35.2.2. Benchmark emulators on userland executables
Let’s see how fast our simulators are running some well known or easy to understand userland benchmarks!
TODO: would be amazing to have an automated guest instructions per second count, but I’m not sure how to do that nicely for QEMU: QEMU get guest instruction count.
TODO: automate this further, produce the results table automatically, possibly by generalizing test-executables.
For now we can just run on gem5 to estimate the instruction count per input size and extrapolate?
For example, the simplest scalable CPU content would be an C busy loop, so let’s start by analyzing that one.
Summary of manually collected results on 2017 Lenovo ThinkPad P51 at LKMC a18f28e263c91362519ef550150b5c9d75fa3679 + 1: Table 7, “Busy loop MIPS for different simulator setups”. As expected, the less native/more detailed/more complex simulations are slower!
Comment | LKMC | Benchmark build | Emulator command | Loops | Time (s) | Instruction count | Approximate MIPS | Hardware version | Host OS |
---|---|---|---|---|---|---|---|---|---|
Native busy loop |
a7ae8e6a8e29ef46d79eb9178d8599d1faeea0e5 + 1 |
|
10^10 |
27 |
Ubuntu 20.04 |
||||
QEMU aarch64 busy loop |
a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
|
10^10 |
68 |
1.1 * 10^11 (approx) |
2000 |
|||
gem5 busy loop |
a18f28e263c91362519ef550150b5c9d75fa3679 |
|
10^6 |
18 |
2.4005699 * 10^7 |
1.3 |
|||
gem5 empty C program statically linked |
eb22fd3b6e7fff7e9ef946a88b208debf5b419d5 |
|
1 |
0 |
5475 |
872cb227fdc0b4d60acc7840889d567a6936b6e1 |
Ubuntu 20.04 |
||
gem5 empty C program dynamically linked |
eb22fd3b6e7fff7e9ef946a88b208debf5b419d5 |
|
1 |
0 |
106999 |
872cb227fdc0b4d60acc7840889d567a6936b6e1 |
Ubuntu 20.04 |
||
gem5 busy loop for a debug build |
a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
|
10^5 |
33 |
2.405682 * 10^6 |
0.07 |
|||
gem5 busy loop for a fast build |
0d5a41a3f88fcd7ed40fc19474fe5aed0463663f + 1 |
userland/gcc/busy_loop.c |
|
10^6 |
15 |
2.4005699 * 10^7 |
1.6 |
||
gem5 busy loop for a TimingSimpleCPU |
a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
|
10^6 |
26 |
2.4005699 * 10^7 |
0.9 |
|||
gem5 busy loop for a MinorCPU |
a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
|
10^6 |
31 |
1.1018152 * 10^7 |
0.4 |
|||
gem5 busy loop for a DerivO3CPU |
a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
|
10^6 |
52 |
1.1018128 * 10^7 |
0.2 |
|||
a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
|
1 * 1000000 = 10^6 |
63 |
1.1005150 * 10^7 |
0.2 |
||||
glibc C pre-main effects |
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611 |
|
1 |
2 |
1.26479 * 10^5 |
0.05 |
|||
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611 |
glibc C pre-main userland/c/m5ops.c |
|
1 |
2 |
1.26479 * 10^5 |
0.05 |
|||
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611 |
glibc C++ pre-main userland/cpp/m5ops.cpp |
|
1 |
2 |
2.385012 * 10^6 |
1 |
|||
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611 |
glibc C++ pre-main userland/cpp/m5ops.cpp |
|
1 |
25 |
2.385012 * 10^6 |
0.1 |
|||
gem5 optimized build immediate exit on first instruction to benchmark the simulator startup time |
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611 |
immediate exit userland/freestanding/gem5_exit.S |
|
1 |
1 |
1 |
|||
same as above but debug build |
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611 |
|
1 |
1 |
1 |
||||
Check the effect of an ExecAll log (log every instruction) on execution time, compare to analogous run without it. |
d29a07ddad499f273cc90dd66e40f8474b5dfc40 |
|
10^6 |
2.4106774 * 10^7 |
136 |
0.2 |
Same as above but with run command manually hacked to output to a ramfs. Slightly faster, but the bulk was still just in log format operations! |
The first step is to determine a number of loops that will run long enough to have meaningful results, but not too long that we will get bored, so about 1 minute.
On our 2017 Lenovo ThinkPad P51 machine, we found 10^7 (10 million == 1000 times 10000) loops to be a good number for a gem5 atomic simulation:
./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --cli-args '1 10000000' ./gem5-stat --arch aarch64 sim_insts
as it gives:
-
time: 00:01:40
-
instructions: 110018162 ~ 110 millions
so ~ 110 million instructions / 100 seconds makes ~ 1 MIPS (million instructions per second).
This experiment also suggests that each loop is about 11 instructions long (110M instructions / 10M loops), which we confirm at Section 36.2, “C busy loop”, bingo!
Then for QEMU, we experimentally turn the number of loops up to 10^10 loops (100000 100000
), which contains an expected 11 * 10^10 instructions, and the runtime is 00:01:08, so we have 1.1 * 10^11 instruction / 68 seconds ~ 2 * 10^9 = 2000 MIPS!
We can then repeat the experiment for other gem5 CPUs to see how they compare.