35.2.2. Benchmark emulators on userland executables

Let’s see how fast our simulators are running some well known or easy to understand userland benchmarks!

TODO: would be amazing to have an automated guest instructions per second count, but I’m not sure how to do that nicely for QEMU: QEMU get guest instruction count.

TODO: automate this further, produce the results table automatically, possibly by generalizing test-executables.

For now we can just run on gem5 to estimate the instruction count per input size and extrapolate?

For example, the simplest scalable CPU content would be an C busy loop, so let’s start by analyzing that one.

Summary of manually collected results on 2017 Lenovo ThinkPad P51 at LKMC a18f28e263c91362519ef550150b5c9d75fa3679 + 1: Table 7, “Busy loop MIPS for different simulator setups”. As expected, the less native/more detailed/more complex simulations are slower!

Table 7. Busy loop MIPS for different simulator setups
Comment LKMC Benchmark build Emulator command Loops Time (s) Instruction count Approximate MIPS Hardware version Host OS

Native busy loop

a7ae8e6a8e29ef46d79eb9178d8599d1faeea0e5 + 1

userland/gcc/busy_loop.c -O0

./run --emulator native --userland userland/gcc/busy_loop.c --cli-args 10000000000

10^10

27

2017 Lenovo ThinkPad P51

Ubuntu 20.04

QEMU aarch64 busy loop

a18f28e263c91362519ef550150b5c9d75fa3679 + 1

userland/gcc/busy_loop.c -O0

./run --arch aarch64 --userland userland/gcc/busy_loop.c --cli-args 10000000000

10^10

68

1.1 * 10^11 (approx)

2000

gem5 busy loop

a18f28e263c91362519ef550150b5c9d75fa3679

userland/gcc/busy_loop.c -O0

./run --arch aarch64 --emulator gem5 --static --userland userland/gcc/busy_loop.c --cli-args 1000000

10^6

18

2.4005699 * 10^7

1.3

gem5 empty C program statically linked

eb22fd3b6e7fff7e9ef946a88b208debf5b419d5

userland/c/empty.c -O0

./run --arch aarch64 --emulator gem5 --static --userland userland/c/empty.c

1

0

5475

872cb227fdc0b4d60acc7840889d567a6936b6e1

Ubuntu 20.04

gem5 empty C program dynamically linked

eb22fd3b6e7fff7e9ef946a88b208debf5b419d5

userland/c/empty.c -O0

./run --arch aarch64 --emulator gem5 --userland userland/c/empty.c

1

0

106999

872cb227fdc0b4d60acc7840889d567a6936b6e1

Ubuntu 20.04

gem5 busy loop for a debug build

a18f28e263c91362519ef550150b5c9d75fa3679 + 1

userland/gcc/busy_loop.c -O0

./run --arch aarch64 --emulator gem5 --gem5-build-type debug --static --userland userland/gcc/busy_loop.c --cli-args 100000

10^5

33

2.405682 * 10^6

0.07

gem5 busy loop for a fast build

0d5a41a3f88fcd7ed40fc19474fe5aed0463663f + 1

userland/gcc/busy_loop.c -O0 -static

./run --arch aarch64 --emulator gem5 --gem5-build-type fast --static --userland userland/gcc/busy_loop.c --cli-args 1000000

10^6

15

2.4005699 * 10^7

1.6

gem5 busy loop for a TimingSimpleCPU

a18f28e263c91362519ef550150b5c9d75fa3679 + 1

userland/gcc/busy_loop.c -O0

./run --arch aarch64 --emulator gem5 --arch aarch64 --static --userland userland/gcc/busy_loop.c --cli-args 1000000 -- --cpu-type TimingSimpleCPU --caches

10^6

26

2.4005699 * 10^7

0.9

gem5 busy loop for a MinorCPU

a18f28e263c91362519ef550150b5c9d75fa3679 + 1

userland/gcc/busy_loop.c -O0

./run --arch aarch64 --emulator gem5 --arch aarch64 --userland userland/gcc/busy_loop.c --cli-args 1000000 -- --cpu-type MinorCPU --caches

10^6

31

1.1018152 * 10^7

0.4

gem5 busy loop for a DerivO3CPU

a18f28e263c91362519ef550150b5c9d75fa3679 + 1

userland/gcc/busy_loop.c -O0

./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --userland args 1000000 -- --cpu-type DerivO3CPU --caches

10^6

52

1.1018128 * 10^7

0.2

a18f28e263c91362519ef550150b5c9d75fa3679 + 1

userland/gcc/busy_loop.c -O0

./run --arch aarch64 --emulator gem5 --gem5-build-id MOESI_CMP_directory -- --cpu-type DerivO3CPU --caches --ruby

1 * 1000000 = 10^6

63

1.1005150 * 10^7

0.2

glibc C pre-main effects

ab6f7331406b22f8ab6e2df5f8b8e464fb35b611

userland/c/m5ops.c -O0

gem5 --arch aarch64 --cli-args e

1

2

1.26479 * 10^5

0.05

ab6f7331406b22f8ab6e2df5f8b8e464fb35b611

glibc C pre-main userland/c/m5ops.c -O0

gem5 --arch aarch64 --cli-args e --gem5-build-type debug

1

2

1.26479 * 10^5

0.05

ab6f7331406b22f8ab6e2df5f8b8e464fb35b611

glibc C++ pre-main userland/cpp/m5ops.cpp -O0

gem5 --arch aarch64 --cli-args e

1

2

2.385012 * 10^6

1

ab6f7331406b22f8ab6e2df5f8b8e464fb35b611

glibc C++ pre-main userland/cpp/m5ops.cpp -O0

gem5 --arch aarch64 --cli-args e --gem5-build-type debug

1

25

2.385012 * 10^6

0.1

gem5 optimized build immediate exit on first instruction to benchmark the simulator startup time

ab6f7331406b22f8ab6e2df5f8b8e464fb35b611

immediate exit userland/freestanding/gem5_exit.S -O0

gem5 --arch aarch64

1

1

1

same as above but debug build

ab6f7331406b22f8ab6e2df5f8b8e464fb35b611

userland/freestanding/gem5_exit.S -O0

gem5 --arch aarch64 --gem5-build-type debug

1

1

1

Check the effect of an ExecAll log (log every instruction) on execution time, compare to analogous run without it. trace.txt size: 3.5GB. 5x slowdown observed with output to a hard disk.

d29a07ddad499f273cc90dd66e40f8474b5dfc40

userland/gcc/busy_loop.c -O0

./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --cli-args 1000000 --gem5-worktree master --trace ExecAll

10^6

2.4106774 * 10^7

136

0.2

Same as above but with run command manually hacked to output to a ramfs. Slightly faster, but the bulk was still just in log format operations!

The first step is to determine a number of loops that will run long enough to have meaningful results, but not too long that we will get bored, so about 1 minute.

On our 2017 Lenovo ThinkPad P51 machine, we found 10^7 (10 million == 1000 times 10000) loops to be a good number for a gem5 atomic simulation:

./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --cli-args '1 10000000'
./gem5-stat --arch aarch64 sim_insts

as it gives:

  • time: 00:01:40

  • instructions: 110018162 ~ 110 millions

so ~ 110 million instructions / 100 seconds makes ~ 1 MIPS (million instructions per second).

This experiment also suggests that each loop is about 11 instructions long (110M instructions / 10M loops), which we confirm at Section 36.2, “C busy loop”, bingo!

Then for QEMU, we experimentally turn the number of loops up to 10^10 loops (100000 100000), which contains an expected 11 * 10^10 instructions, and the runtime is 00:01:08, so we have 1.1 * 10^11 instruction / 68 seconds ~ 2 * 10^9 = 2000 MIPS!

We can then repeat the experiment for other gem5 CPUs to see how they compare.