24.22.4.8.1. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazardless

Let’s have a look at the arguably simplest example userland/arch/aarch64/freestanding/linux/hazardless.S.

First let’s start with a gem5 util/o3-pipeview.py O3 pipeline viewer visualization:

// f = fetch, d = decode, n = rename, p = dispatch, i = issue, c = complete, r = retire

                                     timeline                                             tick          pc.upc     disasm                      seq_num
[.ic.r........................................................................fdn]-(          40000) 0x00400078.0 movz x0, #0, #0           [         1]
[.ic.r........................................................................fdn]-(          40000) 0x0040007c.0 movz x1, #1, #0           [         2]
[....................fdn.ic.r....................................................]-(         120000) 0x00400080.0 movz x2, #2, #0           [         3]
[....................fdn.ic.r....................................................]-(         120000) 0x00400084.0 movz x3, #3, #0           [         4]
[....................fdn.ic.r....................................................]-(         120000) 0x00400088.0 movz x4, #4, #0           [         5]
[....................fdn.ic.r....................................................]-(         120000) 0x0040008c.0 movz x5, #5, #0           [         6]
[....................fdn.ic.r....................................................]-(         120000) 0x00400090.0 movz x6, #6, #0           [         7]
[....................fdn.ic.r....................................................]-(         120000) 0x00400094.0 movz x7, #7, #0           [         8]
[....................fdn.pic.r...................................................]-(         120000) 0x00400098.0 movz x8, #8, #0           [         9]
[....................fdn.pic.r...................................................]-(         120000) 0x0040009c.0 movz x9, #9, #0           [        10]
[.....................fdn.ic.r...................................................]-(         120000) 0x004000a0.0 movz x10, #10, #0         [        11]
[.....................fdn.ic.r...................................................]-(         120000) 0x004000a4.0 movz x11, #11, #0         [        12]
[.....................fdn.ic.r...................................................]-(         120000) 0x004000a8.0 movz x12, #12, #0         [        13]
[.....................fdn.ic.r...................................................]-(         120000) 0x004000ac.0 movz x13, #13, #0         [        14]
[.....................fdn.pic.r..................................................]-(         120000) 0x004000b0.0 movz x14, #14, #0         [        15]
[.....................fdn.pic.r..................................................]-(         120000) 0x004000b4.0 movz x15, #15, #0         [        16]
[.....................fdn.pic.r..................................................]-(         120000) 0x004000b8.0 movz x16, #16, #0         [        17]
[.....................fdn.pic.r..................................................]-(         120000) 0x004000bc.0 movz x17, #17, #0         [        18]
[............................................fdn.ic.r............................]-(         160000) 0x004000c0.0 movz x18, #18, #0         [        19]
[............................................fdn.ic.r............................]-(         160000) 0x004000c4.0 movz x19, #19, #0         [        20]
[............................................fdn.ic.r............................]-(         160000) 0x004000c8.0 movz x20, #20, #0         [        21]
[............................................fdn.ic.r............................]-(         160000) 0x004000cc.0 movz x21, #21, #0         [        22]
[............................................fdn.ic.r............................]-(         160000) 0x004000d0.0 movz x22, #22, #0         [        23]
[............................................fdn.ic.r............................]-(         160000) 0x004000d4.0 movz x23, #23, #0         [        24]
[............................................fdn.pic.r...........................]-(         160000) 0x004000d8.0 movz x24, #24, #0         [        25]
[............................................fdn.pic.r...........................]-(         160000) 0x004000dc.0 movz x25, #25, #0         [        26]
[.............................................fdn.ic.r...........................]-(         160000) 0x004000e0.0 movz x26, #26, #0         [        27]
[.............................................fdn.ic.r...........................]-(         160000) 0x004000e4.0 movz x27, #27, #0         [        28]
[.............................................fdn.ic.r...........................]-(         160000) 0x004000e8.0 movz x28, #28, #0         [        29]
[.............................................fdn.ic.r...........................]-(         160000) 0x004000ec.0 movz x29, #29, #0         [        30]
[.............................................fdn.pic.r..........................]-(         160000) 0x004000f0.0 movz x0, #0, #0           [        31]
[.............................................fdn.pic.r..........................]-(         160000) 0x004000f4.0 movz x1, #1, #0           [        32]
[.............................................fdn.pic.r..........................]-(         160000) 0x004000f8.0 movz x2, #2, #0           [        33]
[.............................................fdn.pic.r..........................]-(         160000) 0x004000fc.0 movz x3, #3, #0           [        34]

The first of instructions has only two instructions because the first instruction is at address 0x400078, so only two instructions fit on that cache line, as the next cache line starts at 0x400080!

The initial fdn on top middle is likely bugged out, did it wrap around? But the rest makes sense.

From this, we clearly see that up to 8 instructions can be issued concurrently, which matches the default width values we had seen at gem5 DerivO3CPU.

For example, we can clearly see how:

  • movz x2 through to movz x9 start running at the exact same time. TODO why does mov x7 do fdn.ic.r while mov x8 do fdn.ic.r? How are they different?

  • movz x10 through movz x17 then starts running one step later. This second chunk is fully pipelined with the first instruction pack

  • then comes a pause while the next fetch comes back. This group of 16 instructions took up the entire 64-byte cacheline that had been read

First we can have a look at ExecEnable to get an initial ideal of how many instructions are run at one time:

  78500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue    :   movz   x0, #0, #0        : IntAlu :  D=0x0000000000000000  FetchSeq=1  CPSeq=1  flags=(IsInteger)
  78500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+4    :   movz   x1, #1, #0        : IntAlu :  D=0x0000000000000001  FetchSeq=2  CPSeq=2  flags=(IsInteger)

 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+8    :   movz   x2, #2, #0        : IntAlu :  D=0x0000000000000002  FetchSeq=3  CPSeq=3  flags=(IsInteger)
 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+12    :   movz   x3, #3, #0        : IntAlu :  D=0x0000000000000003  FetchSeq=4  CPSeq=4  flags=(IsInteger)
 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+16    :   movz   x4, #4, #0        : IntAlu :  D=0x0000000000000004  FetchSeq=5  CPSeq=5  flags=(IsInteger)
 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+20    :   movz   x5, #5, #0        : IntAlu :  D=0x0000000000000005  FetchSeq=6  CPSeq=6  flags=(IsInteger)
 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+24    :   movz   x6, #6, #0        : IntAlu :  D=0x0000000000000006  FetchSeq=7  CPSeq=7  flags=(IsInteger)
 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+28    :   movz   x7, #7, #0        : IntAlu :  D=0x0000000000000007  FetchSeq=8  CPSeq=8  flags=(IsInteger)
 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+32    :   movz   x8, #8, #0        : IntAlu :  D=0x0000000000000008  FetchSeq=9  CPSeq=9  flags=(IsInteger)
 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+36    :   movz   x9, #9, #0        : IntAlu :  D=0x0000000000000009  FetchSeq=10  CPSeq=10  flags=(IsInteger)

 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+40    :   movz   x10, #10, #0      : IntAlu :  D=0x000000000000000a  FetchSeq=11  CPSeq=11  flags=(IsInteger)
 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+44    :   movz   x11, #11, #0      : IntAlu :  D=0x000000000000000b  FetchSeq=12  CPSeq=12  flags=(IsInteger)
 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+48    :   movz   x12, #12, #0      : IntAlu :  D=0x000000000000000c  FetchSeq=13  CPSeq=13  flags=(IsInteger)
 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+52    :   movz   x13, #13, #0      : IntAlu :  D=0x000000000000000d  FetchSeq=14  CPSeq=14  flags=(IsInteger)
 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+56    :   movz   x14, #14, #0      : IntAlu :  D=0x000000000000000e  FetchSeq=15  CPSeq=15  flags=(IsInteger)
 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+60    :   movz   x15, #15, #0      : IntAlu :  D=0x000000000000000f  FetchSeq=16  CPSeq=16  flags=(IsInteger)
 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+64    :   movz   x16, #16, #0      : IntAlu :  D=0x0000000000000010  FetchSeq=17  CPSeq=17  flags=(IsInteger)
 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+68    :   movz   x17, #17, #0      : IntAlu :  D=0x0000000000000011  FetchSeq=18  CPSeq=18  flags=(IsInteger)

This suggests 8, but remember that ExecEnable shows issue time labels, which do not coincide necessarily with commit times. As we saw in the pipeline viewer above, instructions 9 and 10 have one extra stage.

After the initial two execs from the first cache line, the full commit log chunk around the first group of six `ExecEnable`s looks like:

 133500: Commit: system.cpu.commit: Getting instructions from Rename stage.
 133500: Commit: system.cpu.commit: Trying to commit instructions in the ROB.

 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:3]
 133500: Commit: system.cpu.commit: [tid:0] [sn:3] Committing instruction with PC (0x400080=>0x400084).(0=>1)
 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+8    :   movz   x2, #2, #0        : IntAlu :  D=0x0000000000000002  FetchSeq=3  CPSeq=3  flags=(IsInteger)
 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400080=>0x400084).(0=>1), [sn:3]
 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400080=>0x400084).(0=>1) [sn:3]

 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:4]
 133500: Commit: system.cpu.commit: [tid:0] [sn:4] Committing instruction with PC (0x400084=>0x400088).(0=>1)
 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+12    :   movz   x3, #3, #0        : IntAlu :  D=0x0000000000000003  FetchSeq=4  CPSeq=4  flags=(IsInteger)
 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400084=>0x400088).(0=>1), [sn:4]
 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400084=>0x400088).(0=>1) [sn:4]

 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:5]
 133500: Commit: system.cpu.commit: [tid:0] [sn:5] Committing instruction with PC (0x400088=>0x40008c).(0=>1)
 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+16    :   movz   x4, #4, #0        : IntAlu :  D=0x0000000000000004  FetchSeq=5  CPSeq=5  flags=(IsInteger)
 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400088=>0x40008c).(0=>1), [sn:5]
 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400088=>0x40008c).(0=>1) [sn:5]

 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:6]
 133500: Commit: system.cpu.commit: [tid:0] [sn:6] Committing instruction with PC (0x40008c=>0x400090).(0=>1)
 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+20    :   movz   x5, #5, #0        : IntAlu :  D=0x0000000000000005  FetchSeq=6  CPSeq=6  flags=(IsInteger)
 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x40008c=>0x400090).(0=>1), [sn:6]
 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x40008c=>0x400090).(0=>1) [sn:6]

 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:7]
 133500: Commit: system.cpu.commit: [tid:0] [sn:7] Committing instruction with PC (0x400090=>0x400094).(0=>1)
 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+24    :   movz   x6, #6, #0        : IntAlu :  D=0x0000000000000006  FetchSeq=7  CPSeq=7  flags=(IsInteger)
 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400090=>0x400094).(0=>1), [sn:7]
 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400090=>0x400094).(0=>1) [sn:7]

 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:8]
 133500: Commit: system.cpu.commit: [tid:0] [sn:8] Committing instruction with PC (0x400094=>0x400098).(0=>1)
 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+28    :   movz   x7, #7, #0        : IntAlu :  D=0x0000000000000007  FetchSeq=8  CPSeq=8  flags=(IsInteger)
 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400094=>0x400098).(0=>1), [sn:8]
 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400094=>0x400098).(0=>1) [sn:8]

 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x400098=>0x40009c).(0=>1), [sn:9] ready within ROB.
 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x40009c=>0x4000a0).(0=>1), [sn:10] ready within ROB.
 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x4000a0=>0x4000a4).(0=>1), [sn:11] ready within ROB.
 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x4000a4=>0x4000a8).(0=>1), [sn:12] ready within ROB.
 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x4000a8=>0x4000ac).(0=>1), [sn:13] ready within ROB.
 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x4000ac=>0x4000b0).(0=>1), [sn:14] ready within ROB.
 133500: Commit: system.cpu.commit: [tid:0] Instruction [sn:9] PC (0x400098=>0x40009c).(0=>1) is head of ROB and ready to commit
 133500: Commit: system.cpu.commit: [tid:0] ROB has 10 insts & 182 free entries.

ROB stands for Re-order buffer.

0x400080⇒0x400084 is an old/new PC address of the first committed instruction.

Another thing we can do, it to try to follow one of the instructions back as it goes through the pipeline. Searching for example for the address 0x400080, we find:

The first mention of the address happens when is the fetch of the two initial instructions completes. TODO not sure why it doesn’t just also fetch the next cache line at the same time:

FullO3CPU: Ticking main, FullO3CPU.
  78500: Fetch: system.cpu.fetch: Running stage.
  78500: Fetch: system.cpu.fetch: Attempting to fetch from [tid:0]
  78500: Fetch: system.cpu.fetch: [tid:0] Icache miss is complete.
  78500: Fetch: system.cpu.fetch: [tid:0] Adding instructions to queue to decode.
  78500: DynInst: global: DynInst: [sn:1] Instruction created. Instcount for system.cpu = 1
  78500: Fetch: system.cpu.fetch: [tid:0] Instruction PC 0x400078 (0) created [sn:1].
  78500: Fetch: system.cpu.fetch: [tid:0] Instruction is:   movz   x0, #0, #0
  78500: Fetch: system.cpu.fetch: [tid:0] Fetch queue entry created (1/32).
  78500: DynInst: global: DynInst: [sn:2] Instruction created. Instcount for system.cpu = 2
  78500: Fetch: system.cpu.fetch: [tid:0] Instruction PC 0x40007c (0) created [sn:2].
  78500: Fetch: system.cpu.fetch: [tid:0] Instruction is:   movz   x1, #1, #0
  78500: Fetch: system.cpu.fetch: [tid:0] Fetch queue entry created (2/32).
  78500: Fetch: system.cpu.fetch: [tid:0] Issuing a pipelined I-cache access, starting at PC (0x400080=>0x400084).(0=>1).
  78500: Fetch: system.cpu.fetch: [tid:0] Fetching cache line 0x400080 for addr 0x400080

so we observe that the first two instructions arrived, and the CPU noticed that 0x400080 hasn’t been fetched yet.

Then for several cycles that follow, the fetch stage just says that it is blocked on data returning:

FullO3CPU: Ticking main, FullO3CPU.
  79000: Fetch: system.cpu.fetch: Running stage.
  79000: Fetch: system.cpu.fetch: There are no more threads available to fetch from.
  79000: Fetch: system.cpu.fetch: [tid:0] Fetch is waiting cache response!

At the same time, the execution of the initial 2 instructions progresses through the pipeline.

These progress up until:

  88000: O3CPU: system.cpu: Idle!

at which point there are no more events scheduled besides waiting for the second cache line to come back.

After this, some time passes without events, and the next tick happens when the fetch data returns:

FullO3CPU: Ticking main, FullO3CPU.
 130000: Fetch: system.cpu.fetch: Running stage.
 130000: Fetch: system.cpu.fetch: Attempting to fetch from [tid:0]
 130000: Fetch: system.cpu.fetch: [tid:0] Icache miss is complete.
 130000: Fetch: system.cpu.fetch: [tid:0] Adding instructions to queue to decode.
 130000: DynInst: global: DynInst: [sn:3] Instruction created. Instcount for system.cpu = 1
 130000: Fetch: system.cpu.fetch: [tid:0] Instruction PC 0x400080 (0) created [sn:3].
 130000: Fetch: system.cpu.fetch: [tid:0] Instruction is:   movz   x2, #2, #0
 130000: Fetch: system.cpu.fetch: [tid:0] Fetch queue entry created (1/32).
 130000: DynInst: global: DynInst: [sn:4] Instruction created. Instcount for system.cpu = 2
 130000: Fetch: system.cpu.fetch: [tid:0] Instruction PC 0x400084 (0) created [sn:4].
 130000: Fetch: system.cpu.fetch: [tid:0] Instruction is:   movz   x3, #3, #0
 130000: Fetch: system.cpu.fetch: [tid:0] Fetch queue entry created (2/32).
 130000: DynInst: global: DynInst: [sn:5] Instruction created. Instcount for system.cpu = 3