24.22.4.8.1. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazardless
Let’s have a look at the arguably simplest example userland/arch/aarch64/freestanding/linux/hazardless.S.
First let’s start with a gem5 util/o3-pipeview.py O3 pipeline viewer visualization:
// f = fetch, d = decode, n = rename, p = dispatch, i = issue, c = complete, r = retire timeline tick pc.upc disasm seq_num [.ic.r........................................................................fdn]-( 40000) 0x00400078.0 movz x0, #0, #0 [ 1] [.ic.r........................................................................fdn]-( 40000) 0x0040007c.0 movz x1, #1, #0 [ 2] [....................fdn.ic.r....................................................]-( 120000) 0x00400080.0 movz x2, #2, #0 [ 3] [....................fdn.ic.r....................................................]-( 120000) 0x00400084.0 movz x3, #3, #0 [ 4] [....................fdn.ic.r....................................................]-( 120000) 0x00400088.0 movz x4, #4, #0 [ 5] [....................fdn.ic.r....................................................]-( 120000) 0x0040008c.0 movz x5, #5, #0 [ 6] [....................fdn.ic.r....................................................]-( 120000) 0x00400090.0 movz x6, #6, #0 [ 7] [....................fdn.ic.r....................................................]-( 120000) 0x00400094.0 movz x7, #7, #0 [ 8] [....................fdn.pic.r...................................................]-( 120000) 0x00400098.0 movz x8, #8, #0 [ 9] [....................fdn.pic.r...................................................]-( 120000) 0x0040009c.0 movz x9, #9, #0 [ 10] [.....................fdn.ic.r...................................................]-( 120000) 0x004000a0.0 movz x10, #10, #0 [ 11] [.....................fdn.ic.r...................................................]-( 120000) 0x004000a4.0 movz x11, #11, #0 [ 12] [.....................fdn.ic.r...................................................]-( 120000) 0x004000a8.0 movz x12, #12, #0 [ 13] [.....................fdn.ic.r...................................................]-( 120000) 0x004000ac.0 movz x13, #13, #0 [ 14] [.....................fdn.pic.r..................................................]-( 120000) 0x004000b0.0 movz x14, #14, #0 [ 15] [.....................fdn.pic.r..................................................]-( 120000) 0x004000b4.0 movz x15, #15, #0 [ 16] [.....................fdn.pic.r..................................................]-( 120000) 0x004000b8.0 movz x16, #16, #0 [ 17] [.....................fdn.pic.r..................................................]-( 120000) 0x004000bc.0 movz x17, #17, #0 [ 18] [............................................fdn.ic.r............................]-( 160000) 0x004000c0.0 movz x18, #18, #0 [ 19] [............................................fdn.ic.r............................]-( 160000) 0x004000c4.0 movz x19, #19, #0 [ 20] [............................................fdn.ic.r............................]-( 160000) 0x004000c8.0 movz x20, #20, #0 [ 21] [............................................fdn.ic.r............................]-( 160000) 0x004000cc.0 movz x21, #21, #0 [ 22] [............................................fdn.ic.r............................]-( 160000) 0x004000d0.0 movz x22, #22, #0 [ 23] [............................................fdn.ic.r............................]-( 160000) 0x004000d4.0 movz x23, #23, #0 [ 24] [............................................fdn.pic.r...........................]-( 160000) 0x004000d8.0 movz x24, #24, #0 [ 25] [............................................fdn.pic.r...........................]-( 160000) 0x004000dc.0 movz x25, #25, #0 [ 26] [.............................................fdn.ic.r...........................]-( 160000) 0x004000e0.0 movz x26, #26, #0 [ 27] [.............................................fdn.ic.r...........................]-( 160000) 0x004000e4.0 movz x27, #27, #0 [ 28] [.............................................fdn.ic.r...........................]-( 160000) 0x004000e8.0 movz x28, #28, #0 [ 29] [.............................................fdn.ic.r...........................]-( 160000) 0x004000ec.0 movz x29, #29, #0 [ 30] [.............................................fdn.pic.r..........................]-( 160000) 0x004000f0.0 movz x0, #0, #0 [ 31] [.............................................fdn.pic.r..........................]-( 160000) 0x004000f4.0 movz x1, #1, #0 [ 32] [.............................................fdn.pic.r..........................]-( 160000) 0x004000f8.0 movz x2, #2, #0 [ 33] [.............................................fdn.pic.r..........................]-( 160000) 0x004000fc.0 movz x3, #3, #0 [ 34]
The first of instructions has only two instructions because the first instruction is at address 0x400078, so only two instructions fit on that cache line, as the next cache line starts at 0x400080!
The initial fdn
on top middle is likely bugged out, did it wrap around? But the rest makes sense.
From this, we clearly see that up to 8 instructions can be issued concurrently, which matches the default width values we had seen at gem5 DerivO3CPU
.
For example, we can clearly see how:
-
movz x2
through tomovz x9
start running at the exact same time. TODO why doesmov x7
dofdn.ic.r
whilemov x8
dofdn.ic.r
? How are they different? -
movz x10
throughmovz x17
then starts running one step later. This second chunk is fully pipelined with the first instruction pack -
then comes a pause while the next fetch comes back. This group of 16 instructions took up the entire 64-byte cacheline that had been read
First we can have a look at ExecEnable
to get an initial ideal of how many instructions are run at one time:
78500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue : movz x0, #0, #0 : IntAlu : D=0x0000000000000000 FetchSeq=1 CPSeq=1 flags=(IsInteger) 78500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+4 : movz x1, #1, #0 : IntAlu : D=0x0000000000000001 FetchSeq=2 CPSeq=2 flags=(IsInteger) 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+8 : movz x2, #2, #0 : IntAlu : D=0x0000000000000002 FetchSeq=3 CPSeq=3 flags=(IsInteger) 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+12 : movz x3, #3, #0 : IntAlu : D=0x0000000000000003 FetchSeq=4 CPSeq=4 flags=(IsInteger) 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+16 : movz x4, #4, #0 : IntAlu : D=0x0000000000000004 FetchSeq=5 CPSeq=5 flags=(IsInteger) 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+20 : movz x5, #5, #0 : IntAlu : D=0x0000000000000005 FetchSeq=6 CPSeq=6 flags=(IsInteger) 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+24 : movz x6, #6, #0 : IntAlu : D=0x0000000000000006 FetchSeq=7 CPSeq=7 flags=(IsInteger) 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+28 : movz x7, #7, #0 : IntAlu : D=0x0000000000000007 FetchSeq=8 CPSeq=8 flags=(IsInteger) 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+32 : movz x8, #8, #0 : IntAlu : D=0x0000000000000008 FetchSeq=9 CPSeq=9 flags=(IsInteger) 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+36 : movz x9, #9, #0 : IntAlu : D=0x0000000000000009 FetchSeq=10 CPSeq=10 flags=(IsInteger) 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+40 : movz x10, #10, #0 : IntAlu : D=0x000000000000000a FetchSeq=11 CPSeq=11 flags=(IsInteger) 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+44 : movz x11, #11, #0 : IntAlu : D=0x000000000000000b FetchSeq=12 CPSeq=12 flags=(IsInteger) 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+48 : movz x12, #12, #0 : IntAlu : D=0x000000000000000c FetchSeq=13 CPSeq=13 flags=(IsInteger) 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+52 : movz x13, #13, #0 : IntAlu : D=0x000000000000000d FetchSeq=14 CPSeq=14 flags=(IsInteger) 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+56 : movz x14, #14, #0 : IntAlu : D=0x000000000000000e FetchSeq=15 CPSeq=15 flags=(IsInteger) 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+60 : movz x15, #15, #0 : IntAlu : D=0x000000000000000f FetchSeq=16 CPSeq=16 flags=(IsInteger) 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+64 : movz x16, #16, #0 : IntAlu : D=0x0000000000000010 FetchSeq=17 CPSeq=17 flags=(IsInteger) 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+68 : movz x17, #17, #0 : IntAlu : D=0x0000000000000011 FetchSeq=18 CPSeq=18 flags=(IsInteger)
This suggests 8, but remember that ExecEnable
shows issue time labels, which do not coincide necessarily with commit times. As we saw in the pipeline viewer above, instructions 9 and 10 have one extra stage.
After the initial two execs from the first cache line, the full commit log chunk around the first group of six `ExecEnable`s looks like:
133500: Commit: system.cpu.commit: Getting instructions from Rename stage. 133500: Commit: system.cpu.commit: Trying to commit instructions in the ROB. 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:3] 133500: Commit: system.cpu.commit: [tid:0] [sn:3] Committing instruction with PC (0x400080=>0x400084).(0=>1) 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+8 : movz x2, #2, #0 : IntAlu : D=0x0000000000000002 FetchSeq=3 CPSeq=3 flags=(IsInteger) 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400080=>0x400084).(0=>1), [sn:3] 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400080=>0x400084).(0=>1) [sn:3] 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:4] 133500: Commit: system.cpu.commit: [tid:0] [sn:4] Committing instruction with PC (0x400084=>0x400088).(0=>1) 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+12 : movz x3, #3, #0 : IntAlu : D=0x0000000000000003 FetchSeq=4 CPSeq=4 flags=(IsInteger) 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400084=>0x400088).(0=>1), [sn:4] 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400084=>0x400088).(0=>1) [sn:4] 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:5] 133500: Commit: system.cpu.commit: [tid:0] [sn:5] Committing instruction with PC (0x400088=>0x40008c).(0=>1) 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+16 : movz x4, #4, #0 : IntAlu : D=0x0000000000000004 FetchSeq=5 CPSeq=5 flags=(IsInteger) 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400088=>0x40008c).(0=>1), [sn:5] 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400088=>0x40008c).(0=>1) [sn:5] 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:6] 133500: Commit: system.cpu.commit: [tid:0] [sn:6] Committing instruction with PC (0x40008c=>0x400090).(0=>1) 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+20 : movz x5, #5, #0 : IntAlu : D=0x0000000000000005 FetchSeq=6 CPSeq=6 flags=(IsInteger) 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x40008c=>0x400090).(0=>1), [sn:6] 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x40008c=>0x400090).(0=>1) [sn:6] 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:7] 133500: Commit: system.cpu.commit: [tid:0] [sn:7] Committing instruction with PC (0x400090=>0x400094).(0=>1) 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+24 : movz x6, #6, #0 : IntAlu : D=0x0000000000000006 FetchSeq=7 CPSeq=7 flags=(IsInteger) 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400090=>0x400094).(0=>1), [sn:7] 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400090=>0x400094).(0=>1) [sn:7] 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:8] 133500: Commit: system.cpu.commit: [tid:0] [sn:8] Committing instruction with PC (0x400094=>0x400098).(0=>1) 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+28 : movz x7, #7, #0 : IntAlu : D=0x0000000000000007 FetchSeq=8 CPSeq=8 flags=(IsInteger) 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400094=>0x400098).(0=>1), [sn:8] 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400094=>0x400098).(0=>1) [sn:8] 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x400098=>0x40009c).(0=>1), [sn:9] ready within ROB. 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x40009c=>0x4000a0).(0=>1), [sn:10] ready within ROB. 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x4000a0=>0x4000a4).(0=>1), [sn:11] ready within ROB. 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x4000a4=>0x4000a8).(0=>1), [sn:12] ready within ROB. 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x4000a8=>0x4000ac).(0=>1), [sn:13] ready within ROB. 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x4000ac=>0x4000b0).(0=>1), [sn:14] ready within ROB. 133500: Commit: system.cpu.commit: [tid:0] Instruction [sn:9] PC (0x400098=>0x40009c).(0=>1) is head of ROB and ready to commit 133500: Commit: system.cpu.commit: [tid:0] ROB has 10 insts & 182 free entries.
ROB
stands for Re-order buffer.
0x400080⇒0x400084
is an old/new PC address of the first committed instruction.
Another thing we can do, it to try to follow one of the instructions back as it goes through the pipeline. Searching for example for the address 0x400080
, we find:
The first mention of the address happens when is the fetch of the two initial instructions completes. TODO not sure why it doesn’t just also fetch the next cache line at the same time:
FullO3CPU: Ticking main, FullO3CPU. 78500: Fetch: system.cpu.fetch: Running stage. 78500: Fetch: system.cpu.fetch: Attempting to fetch from [tid:0] 78500: Fetch: system.cpu.fetch: [tid:0] Icache miss is complete. 78500: Fetch: system.cpu.fetch: [tid:0] Adding instructions to queue to decode. 78500: DynInst: global: DynInst: [sn:1] Instruction created. Instcount for system.cpu = 1 78500: Fetch: system.cpu.fetch: [tid:0] Instruction PC 0x400078 (0) created [sn:1]. 78500: Fetch: system.cpu.fetch: [tid:0] Instruction is: movz x0, #0, #0 78500: Fetch: system.cpu.fetch: [tid:0] Fetch queue entry created (1/32). 78500: DynInst: global: DynInst: [sn:2] Instruction created. Instcount for system.cpu = 2 78500: Fetch: system.cpu.fetch: [tid:0] Instruction PC 0x40007c (0) created [sn:2]. 78500: Fetch: system.cpu.fetch: [tid:0] Instruction is: movz x1, #1, #0 78500: Fetch: system.cpu.fetch: [tid:0] Fetch queue entry created (2/32). 78500: Fetch: system.cpu.fetch: [tid:0] Issuing a pipelined I-cache access, starting at PC (0x400080=>0x400084).(0=>1). 78500: Fetch: system.cpu.fetch: [tid:0] Fetching cache line 0x400080 for addr 0x400080
so we observe that the first two instructions arrived, and the CPU noticed that 0x400080 hasn’t been fetched yet.
Then for several cycles that follow, the fetch stage just says that it is blocked on data returning:
FullO3CPU: Ticking main, FullO3CPU. 79000: Fetch: system.cpu.fetch: Running stage. 79000: Fetch: system.cpu.fetch: There are no more threads available to fetch from. 79000: Fetch: system.cpu.fetch: [tid:0] Fetch is waiting cache response!
At the same time, the execution of the initial 2 instructions progresses through the pipeline.
These progress up until:
88000: O3CPU: system.cpu: Idle!
at which point there are no more events scheduled besides waiting for the second cache line to come back.
After this, some time passes without events, and the next tick happens when the fetch data returns:
FullO3CPU: Ticking main, FullO3CPU. 130000: Fetch: system.cpu.fetch: Running stage. 130000: Fetch: system.cpu.fetch: Attempting to fetch from [tid:0] 130000: Fetch: system.cpu.fetch: [tid:0] Icache miss is complete. 130000: Fetch: system.cpu.fetch: [tid:0] Adding instructions to queue to decode. 130000: DynInst: global: DynInst: [sn:3] Instruction created. Instcount for system.cpu = 1 130000: Fetch: system.cpu.fetch: [tid:0] Instruction PC 0x400080 (0) created [sn:3]. 130000: Fetch: system.cpu.fetch: [tid:0] Instruction is: movz x2, #2, #0 130000: Fetch: system.cpu.fetch: [tid:0] Fetch queue entry created (1/32). 130000: DynInst: global: DynInst: [sn:4] Instruction created. Instcount for system.cpu = 2 130000: Fetch: system.cpu.fetch: [tid:0] Instruction PC 0x400084 (0) created [sn:4]. 130000: Fetch: system.cpu.fetch: [tid:0] Instruction is: movz x3, #3, #0 130000: Fetch: system.cpu.fetch: [tid:0] Fetch queue entry created (2/32). 130000: DynInst: global: DynInst: [sn:5] Instruction created. Instcount for system.cpu = 3