24.3.3.1. gem5 memory latency
TODO These look promising:
--list-mem-types --mem-type=MEM_TYPE --mem-channels=MEM_CHANNELS --mem-ranks=MEM_RANKS --mem-size=MEM_SIZE
TODO: now to verify this with the Linux kernel? Besides raw performance benchmarks.
Now for a raw simplistic benchmark on TimingSimpleCPU
without caches via C busy loop:
./run --arch aarch64 --cli-args 1000000 --emulator gem5 --userland userland/gcc/busy_loop.c -- --cpu-type TimingSimpleCPU
LKMC eb22fd3b6e7fff7e9ef946a88b208debf5b419d5 gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 outputs:
Exiting @ tick 897173931000 because exiting with last active thread context
and now because:
-
we have no caches, each instruction is fetched from memory
-
each loop contains 11 instructions as shown at Section 36.2, “C busy loop”
-
and supposing that the loop dominated executable pre/post
main
, which we know is true since as shown in Benchmark emulators on userland executables an empty dynamically linked C program only as about 100k instructions, while our loop runs 1000000 * 11 = 12M.
we should have about 1000000 * 11 / 897173931000 ps ~ 12260722 ~ 12MB/s of random accesses. The default memory type used is DDR3_1600_8x8
as per:
common/Options.py:101: parser.add_option("--mem-type", type="choice", default="DDR3_1600_8x8
and according to https://en.wikipedia.org/wiki/DDR3_SDRAM that reaches 6400 MB/s so we are only off by a factor of 50x :-) TODO. Maybe if the minimum transaction if 64 bytes, we would be on point.
Another example we could use later on is userland/gcc/busy_loop.c, but then that mixes icache and dcache accesses, so the analysis is a bit more complex:
./run --arch aarch64 --cli-args 0x1000000 --emulator gem5 --userland userland/gcc/busy_loop.c -- --cpu-type TimingSimpleCPU