Linux Kernel Module Cheat

35.2.2. Benchmark emulators on userland executables

Let’s see how fast our simulators are running some well known or easy to understand userland benchmarks!

TODO: would be amazing to have an automated guest instructions per second count, but I’m not sure how to do that nicely for QEMU: QEMU get guest instruction count.

TODO: automate this further, produce the results table automatically, possibly by generalizing test-executables.

For now we can just run on gem5 to estimate the instruction count per input size and extrapolate?

For example, the simplest scalable CPU content would be an C busy loop, so let’s start by analyzing that one.

Summary of manually collected results on 2017 Lenovo ThinkPad P51 at LKMC a18f28e263c91362519ef550150b5c9d75fa3679 + 1: Table 7, “Busy loop MIPS for different simulator setups”. As expected, the less native/more detailed/more complex simulations are slower!

Table 7. Busy loop MIPS for different simulator setups
Comment	LKMC	Benchmark build	Emulator command	Loops	Time (s)	Instruction count	Approximate MIPS	Hardware version	Host OS
Native busy loop	a7ae8e6a8e29ef46d79eb9178d8599d1faeea0e5 + 1	userland/gcc/busy_loop.c `-O0`	`./run --emulator native --userland userland/gcc/busy_loop.c --cli-args 10000000000`	10^10	27			2017 Lenovo ThinkPad P51	Ubuntu 20.04
QEMU aarch64 busy loop	a18f28e263c91362519ef550150b5c9d75fa3679 + 1	userland/gcc/busy_loop.c `-O0`	`./run --arch aarch64 --userland userland/gcc/busy_loop.c --cli-args 10000000000`	10^10	68	1.1 * 10^11 (approx)	2000
gem5 busy loop	a18f28e263c91362519ef550150b5c9d75fa3679	userland/gcc/busy_loop.c `-O0`	`./run --arch aarch64 --emulator gem5 --static --userland userland/gcc/busy_loop.c --cli-args 1000000`	10^6	18	2.4005699 * 10^7	1.3
gem5 empty C program statically linked	eb22fd3b6e7fff7e9ef946a88b208debf5b419d5	userland/c/empty.c `-O0`	`./run --arch aarch64 --emulator gem5 --static --userland userland/c/empty.c`	1	0	5475		872cb227fdc0b4d60acc7840889d567a6936b6e1	Ubuntu 20.04
gem5 empty C program dynamically linked	eb22fd3b6e7fff7e9ef946a88b208debf5b419d5	userland/c/empty.c `-O0`	`./run --arch aarch64 --emulator gem5 --userland userland/c/empty.c`	1	0	106999		872cb227fdc0b4d60acc7840889d567a6936b6e1	Ubuntu 20.04
gem5 busy loop for a debug build	a18f28e263c91362519ef550150b5c9d75fa3679 + 1	userland/gcc/busy_loop.c `-O0`	`./run --arch aarch64 --emulator gem5 --gem5-build-type debug --static --userland userland/gcc/busy_loop.c --cli-args 100000`	10^5	33	2.405682 * 10^6	0.07
gem5 busy loop for a fast build	0d5a41a3f88fcd7ed40fc19474fe5aed0463663f + 1	userland/gcc/busy_loop.c `-O0 -static`	`./run --arch aarch64 --emulator gem5 --gem5-build-type fast --static --userland userland/gcc/busy_loop.c --cli-args 1000000`	10^6	15	2.4005699 * 10^7	1.6
gem5 busy loop for a TimingSimpleCPU	a18f28e263c91362519ef550150b5c9d75fa3679 + 1	userland/gcc/busy_loop.c `-O0`	`./run --arch aarch64 --emulator gem5 --arch aarch64 --static --userland userland/gcc/busy_loop.c --cli-args 1000000 -- --cpu-type TimingSimpleCPU --caches`	10^6	26	2.4005699 * 10^7	0.9
gem5 busy loop for a MinorCPU	a18f28e263c91362519ef550150b5c9d75fa3679 + 1	userland/gcc/busy_loop.c `-O0`	`./run --arch aarch64 --emulator gem5 --arch aarch64 --userland userland/gcc/busy_loop.c --cli-args 1000000 -- --cpu-type MinorCPU --caches`	10^6	31	1.1018152 * 10^7	0.4
gem5 busy loop for a DerivO3CPU	a18f28e263c91362519ef550150b5c9d75fa3679 + 1	userland/gcc/busy_loop.c `-O0`	`./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --userland args 1000000 -- --cpu-type DerivO3CPU --caches`	10^6	52	1.1018128 * 10^7	0.2
	a18f28e263c91362519ef550150b5c9d75fa3679 + 1	userland/gcc/busy_loop.c `-O0`	`./run --arch aarch64 --emulator gem5 --gem5-build-id MOESI_CMP_directory -- --cpu-type DerivO3CPU --caches --ruby`	1 * 1000000 = 10^6	63	1.1005150 * 10^7	0.2
glibc C pre-main effects	ab6f7331406b22f8ab6e2df5f8b8e464fb35b611	userland/c/m5ops.c `-O0`	`gem5 --arch aarch64 --cli-args e`	1	2	1.26479 * 10^5	0.05
	ab6f7331406b22f8ab6e2df5f8b8e464fb35b611	glibc C pre-main userland/c/m5ops.c `-O0`	`gem5 --arch aarch64 --cli-args e --gem5-build-type debug`	1	2	1.26479 * 10^5	0.05
	ab6f7331406b22f8ab6e2df5f8b8e464fb35b611	glibc C++ pre-main userland/cpp/m5ops.cpp `-O0`	`gem5 --arch aarch64 --cli-args e`	1	2	2.385012 * 10^6	1
	ab6f7331406b22f8ab6e2df5f8b8e464fb35b611	glibc C++ pre-main userland/cpp/m5ops.cpp `-O0`	`gem5 --arch aarch64 --cli-args e --gem5-build-type debug`	1	25	2.385012 * 10^6	0.1
gem5 optimized build immediate exit on first instruction to benchmark the simulator startup time	ab6f7331406b22f8ab6e2df5f8b8e464fb35b611	immediate exit userland/freestanding/gem5_exit.S `-O0`	`gem5 --arch aarch64`	1	1	1
same as above but debug build	ab6f7331406b22f8ab6e2df5f8b8e464fb35b611	userland/freestanding/gem5_exit.S `-O0`	`gem5 --arch aarch64 --gem5-build-type debug`	1	1	1
Check the effect of an ExecAll log (log every instruction) on execution time, compare to analogous run without it. `trace.txt` size: 3.5GB. 5x slowdown observed with output to a hard disk.	d29a07ddad499f273cc90dd66e40f8474b5dfc40	userland/gcc/busy_loop.c `-O0`	`./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --cli-args 1000000 --gem5-worktree master --trace ExecAll`	10^6	2.4106774 * 10^7	136	0.2		Same as above but with run command manually hacked to output to a ramfs. Slightly faster, but the bulk was still just in log format operations!

The first step is to determine a number of loops that will run long enough to have meaningful results, but not too long that we will get bored, so about 1 minute.

On our 2017 Lenovo ThinkPad P51 machine, we found 10^7 (10 million == 1000 times 10000) loops to be a good number for a gem5 atomic simulation:

./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --cli-args '1 10000000'
./gem5-stat --arch aarch64 sim_insts

as it gives:

time: 00:01:40
instructions: 110018162 ~ 110 millions

so ~ 110 million instructions / 100 seconds makes ~ 1 MIPS (million instructions per second).

This experiment also suggests that each loop is about 11 instructions long (110M instructions / 10M loops), which we confirm at Section 36.2, “C busy loop”, bingo!

Then for QEMU, we experimentally turn the number of loops up to 10^10 loops (100000 100000), which contains an expected 11 * 10^10 instructions, and the runtime is 00:01:08, so we have 1.1 * 10^11 instruction / 68 seconds ~ 2 * 10^9 = 2000 MIPS!

We can then repeat the experiment for other gem5 CPUs to see how they compare.

User mode vs full system benchmark