24.3.3.1. gem5 memory latency

TODO These look promising:

--list-mem-types
--mem-type=MEM_TYPE
--mem-channels=MEM_CHANNELS
--mem-ranks=MEM_RANKS
--mem-size=MEM_SIZE

TODO: now to verify this with the Linux kernel? Besides raw performance benchmarks.

Now for a raw simplistic benchmark on TimingSimpleCPU without caches via C busy loop:

./run --arch aarch64 --cli-args 1000000 --emulator gem5 --userland userland/gcc/busy_loop.c -- --cpu-type TimingSimpleCPU

LKMC eb22fd3b6e7fff7e9ef946a88b208debf5b419d5 gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 outputs:

Exiting @ tick 897173931000 because exiting with last active thread context

and now because:

  • we have no caches, each instruction is fetched from memory

  • each loop contains 11 instructions as shown at Section 36.2, “C busy loop”

  • and supposing that the loop dominated executable pre/post main, which we know is true since as shown in Benchmark emulators on userland executables an empty dynamically linked C program only as about 100k instructions, while our loop runs 1000000 * 11 = 12M.

we should have about 1000000 * 11 / 897173931000 ps ~ 12260722 ~ 12MB/s of random accesses. The default memory type used is DDR3_1600_8x8 as per:

common/Options.py:101:    parser.add_option("--mem-type", type="choice", default="DDR3_1600_8x8

and according to https://en.wikipedia.org/wiki/DDR3_SDRAM that reaches 6400 MB/s so we are only off by a factor of 50x :-) TODO. Maybe if the minimum transaction if 64 bytes, we would be on point.

Another example we could use later on is userland/gcc/busy_loop.c, but then that mixes icache and dcache accesses, so the analysis is a bit more complex:

./run --arch aarch64 --cli-args 0x1000000 --emulator gem5 --userland userland/gcc/busy_loop.c -- --cpu-type TimingSimpleCPU