Linux Kernel Module Cheat

24.22.4.4. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs

It would be amazing to analyze a simple example with interconnect packets possibly invalidating caches of other CPUs.

To observe it we could create one well controlled workload with instructions that flush memory, and run it on two CPUs.

If we don’t use such instructions that flush memory, we would only see the interconnect at work when caches run out.

For this study, we will use the same CLI as gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis but with multiple CPUs and a multithreaded which shares a variable across threads.

We can use userland/c/atomic.c (see also C multithreading) at LKMC 7c01b29f1ee7da878c7cc9cb4565f3f3cf516a92 and gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 was with as in Detailed gem5 analysis of how data races happen:

./run \
  --arch aarch64 \
  --cli-args '2 10' \
  --cpus 3 \
  --emulator gem5 \
  --trace FmtFlag,Cache,DRAM,ExecAll,XBar \
  --userland userland/c/atomic.c \
  -- \
  --caches \
;

The config.dot.svg now looks like this but with 3 CPUs instead of 2:

gem5 config TimingSimpleCPU caches 2 CPUs 12c917de54145d2d50260035ba7fa614e25317a3

Figure 5. config.dot.svg for a system with two TimingSimpleCPU with caches.

Once again we focus on the shared function region my_thread_main which is where the interesting cross core memory collisions will be happening.

As a maybe-not-so-interesting, we have a look at the very first my_thread_main icache hit points:

93946000: Cache: system.cpu1.icache: access for ReadReq [8b0:8b3] IF miss
93946000: Cache: system.cpu1.icache: createMissPacket: created ReadCleanReq [880:8bf] IF from ReadReq [8b0:8b3] IF
93946000: Cache: system.cpu1.icache: handleAtomicReqMiss: Sending an atomic ReadCleanReq [880:8bf] IF
93946000: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[5] packet ReadCleanReq [880:8bf] IF
93946000: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[5] packet ReadCleanReq [880:8bf] IF SF size: 1 lat: 1
93946000: Cache: system.cpu0.icache: handleSnoop: snoop hit for ReadCleanReq [880:8bf] IF, old state is state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
93946000: Cache: system.cpu0.icache: new state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
93946000: DRAM: system.mem_ctrls: recvAtomic: ReadCleanReq 0x880
93946000: Cache: system.cpu1.icache: handleAtomicReqMiss: Receive response: ReadResp [880:8bf] IF in state 0
93946000: Cache: system.cpu1.icache: Block addr 0x880 (ns) moving from state 0 to state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
93946000: ExecEnable: system.cpu1: A0 T0 : @my_thread_main    :   sub   sp, sp, #48        : IntAlu :  D=0x0000003fffd6b9a0  flags=(IsInteger)
93946500: Cache: system.cpu1.icache: access for ReadReq [8b4:8b7] IF hit state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
93946500: Cache: system.cpu1.dcache: access for WriteReq [a19a8:a19af] hit state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0x14 set: 0x66 way: 0
93946500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+4    :   str   x0, [sp, #8]       : MemWrite :  D=0x0000007ffffefc70 A=0x3fffd6b9a8  flags=(IsInteger|IsMemRef|IsStore)

Now that we know how to read cache logs from gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches, it is easier to understand what happened:

the physical address for my_thread_main is at 0x8b0, which gets requested is a miss, since it is the first time CPU1 goes near that region, since CPU1 was previously executing in standard library code far from our text segment

CPU0 already has has that cache line (0x880) in its cache at state E of MOESI, so it snoops and moves to S. We can look up the logs to see exactly where CPU0 had previously read that address:

59135500: Cache: system.cpu0.icache: Block addr 0x880 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
59135500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[1] packet WritebackClean [8880:88bf]
59135500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[1] packet WritebackClean [8880:88bf] SF size: 0 lat: 1
59135500: DRAM: system.mem_ctrls: recvAtomic: WritebackClean 0x8880
59135500: ExecEnable: system.cpu0: A0 T0 : @frame_dummy    : stp

the request does touch RAM, it does not get served by the other cache directly. CPU1 is now also at state S for the block
the second cache request from CPU1 is 4 bytes further ahead 0x8b4, and this time it is of course a hit.

Since this is an STR, it also does a dcache access, to 0xA19A8 in this case near its stack SP, and it is a hit, which is not surprising, since basically stack accesses are the very first thing any C code does, and there must be some setup code running on CPU1 before my_thread_main.

Now let’s look for the incremented integer address that is shared across threads. We know from Detailed gem5 analysis of how data races happen that the read happens at my_thread_main+36, so searching for he first occurrence:

93952500: Cache: system.cpu1.icache: access for ReadReq [8d4:8d7] IF hit state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x23 way: 0
93952500: Cache: system.cpu1.dcache: access for ReadReq [2060:2063] miss
93952500: Cache: system.cpu1.dcache: createMissPacket: created ReadSharedReq [2040:207f] from ReadReq [2060:2063]
93952500: Cache: system.cpu1.dcache: handleAtomicReqMiss: Sending an atomic ReadSharedReq [2040:207f]
93952500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[6] packet ReadSharedReq [2040:207f]
93952500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[6] packet ReadSharedReq [2040:207f] SF size: 0 lat: 1
93952500: DRAM: system.mem_ctrls: recvAtomic: ReadSharedReq 0x2040
93952500: Cache: system.cpu1.dcache: handleAtomicReqMiss: Receive response: ReadResp [2040:207f] in state 0
93952500: Cache: system.cpu1.dcache: Block addr 0x2040 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
93952500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000000 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)

so we determine its physical address of 0x2060. It was a miss, and then it went into E.

So we look ahead to the following accesses to that physical address, before CPU2 reaches that point of the code and starts making requests as well.

First there is the STR for the first LDR which is of course a hit:

93954500: Cache: system.cpu1.dcache: access for WriteReq [2060:2063] hit state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
93954500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+52    :   str   x1, [x0]           : MemWrite :  D=0x0000000000000001 A=0x411060  flags=(IsInteger|IsMemRef|IsStore)

If found the line in E, so we presume that it moves it to M. Then the second read confirms that it was in M:

93964500: Cache: system.cpu1.dcache: access for ReadReq [2060:2063] hit state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
93964500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000001 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)

and so on.

Now let’s jump to when CPU2 starts making requests.

The first time this happens is on its first LDR at:

94058500: Cache: system.cpu2.dcache: access for ReadReq [2060:2063] miss
94058500: Cache: system.cpu2.dcache: createMissPacket: created ReadSharedReq [2040:207f] from ReadReq [2060:2063]
94058500: Cache: system.cpu2.dcache: handleAtomicReqMiss: Sending an atomic ReadSharedReq [2040:207f]
94058500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[10] packet ReadSharedReq [2040:207f]
94058500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[10] packet ReadSharedReq [2040:207f] SF size: 1 lat: 1
94058500: Cache: system.cpu1.dcache: handleSnoop: snoop hit for ReadSharedReq [2040:207f], old state is state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
94058500: Cache: system.cpu1.dcache: new state is state: d (O) valid: 1 writable: 0 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
94058500: CoherentXBar: system.membus: recvAtomicBackdoor: Not forwarding ReadSharedReq [2040:207f]
94058500: Cache: system.cpu2.dcache: handleAtomicReqMiss: Receive response: ReadResp [2040:207f] in state 0
94058500: Cache: system.cpu2.dcache: Block addr 0x2040 (ns) moving from state 0 to state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
94058500: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000009 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)

and from this we see:

CPU1 moves from M to O
CPU2 moves from I to S

It also appears that no DRAM was accessed since there are no logs for it, so did the XBar get the value directly from the other cache? TODO: why did the earlier 93946000: DRAM read happened then, since CPU0 had the line when CPU1 asked for it?

The above log sequence also makes it clear that it is the XBar that maintains coherency: it appears that the CPU2 caches tells the XBar what it is doing, and then the XBar tells other caches on other CPUs about it, which leads CPU1 to move to O.

Then CPU1 hits its LDR on O:

94060500: Cache: system.cpu1.dcache: access for ReadReq [2060:2063] hit state: d (O) valid: 1 writable: 0 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
94060500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000009 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)

and then CPU2 writes moving to M and moving CPU1 to I:

94060500: Cache: system.cpu2.dcache: access for WriteReq [2060:2063] hit state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
94060500: Cache: system.cpu2.dcache: createMissPacket: created UpgradeReq [2040:207f] from WriteReq [2060:2063]
94060500: Cache: system.cpu2.dcache: handleAtomicReqMiss: Sending an atomic UpgradeReq [2040:207f]
94060500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[10] packet UpgradeReq [2040:207f]
94060500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[10] packet UpgradeReq [2040:207f] SF size: 1 lat: 1
94060500: Cache: system.cpu1.dcache: handleSnoop: snoop hit for UpgradeReq [2040:207f], old state is state: d (O) valid: 1 writable: 0 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
94060500: Cache: system.cpu1.dcache: new state is state: 0 (I) valid: 0 writable: 0 readable: 0 dirty: 0 | tag: 0xffffffffffffffff set: 0x81 way: 0
94060500: CoherentXBar: system.membus: recvAtomicBackdoor: Not forwarding UpgradeReq [2040:207f]
94060500: Cache: system.cpu2.dcache: handleAtomicReqMiss: Receive response: UpgradeResp [2040:207f] in state 5
94060500: Cache: system.cpu2.dcache: Block addr 0x2040 (ns) moving from state 5 to state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
94060500: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+52    :   str   x1, [x0]           : MemWrite :  D=0x000000000000000a A=0x411060  flags=(IsInteger|IsMemRef|IsStore)

and so on, they just keep fighting over that address and changing one another’s state.