24.22.4.3. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches

Let’s just add --caches to gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis to see if things go any faster, and add Cache to --trace as in:

--trace Cache,Event,ExecAll,-ExecSymbol,FmtFlag

The resulting trace is:

#0         0: Event: system.cpu.wrapped_function_event: EventFunctionWrapped 43 scheduled @ 0
#2         0: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 14 scheduled @ 7786250
#3         0: Event: system.mem_ctrls_1.wrapped_function_event: EventFunctionWrapped 20 scheduled @ 7786250
#4         0: Event: Event_84: generic 84 scheduled @ 0
#5         0: Event: Event_84: generic 84 rescheduled @ 18446744073709551615
#6         0: Event: system.cpu.wrapped_function_event: EventFunctionWrapped 43 executed @ 0
#7         0: Cache: system.cpu.icache: access for ReadReq [78:7b] IF miss
#8         0: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 scheduled @ 1000
#9      1000: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 executed @ 1000
#10     1000: Cache: system.cpu.icache: sendMSHRQueuePacket: MSHR ReadReq [78:7b] IF
#12     1000: Cache: system.cpu.icache: createMissPacket: created ReadCleanReq [40:7f] IF from ReadReq [78:7b] IF
#13     1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 scheduled @ 1000
#14     1000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 70 scheduled @ 2000
#15     1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 executed @ 1000
#16     1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 12 scheduled @ 1000
#17     1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 10 scheduled @ 46250
#18     1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 scheduled @ 5000
#19     1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 12 executed @ 1000
#20     1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 15 scheduled @ 1000
#22     1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 15 executed @ 1000
#23     2000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 70 executed @ 2000
#24     5000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 executed @ 5000
#25    46250: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 10 executed @ 46250
#26    46250: Event: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 8 scheduled @ 74250
#27    74250: Event: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 8 executed @ 74250
#28    74250: Event: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 74 scheduled @ 77000
#29    74250: Event: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped 75 scheduled @ 80000
#30    77000: Event: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 74 executed @ 77000
#32    77000: Cache: system.cpu.icache: recvTimingResp: Handling response ReadResp [40:7f] IF
#33    77000: Cache: system.cpu.icache: Block for addr 0x40 being updated in Cache
#34    77000: Cache: system.cpu.icache: Block addr 0x40 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x1 way: 0
#35    77000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 scheduled @ 78000
#36    78000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 executed @ 78000
#37    78000: Event: Event_40: Timing CPU icache tick 40 scheduled @ 78000
#38    78000: Event: Event_40: Timing CPU icache tick 40 executed @ 78000
#39    78000: ExecEnable: system.cpu: A0 T0 : 0x400078    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
#40    78000: Cache: system.cpu.icache: access for ReadReq [7c:7f] IF hit state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x1 way: 0
#42    78000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 scheduled @ 83000
#43    80000: Event: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped 75 executed @ 80000
#44    83000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 executed @ 83000
#45    83000: Event: Event_40: Timing CPU icache tick 40 scheduled @ 83000
#46    83000: Event: Event_40: Timing CPU icache tick 40 executed @ 83000
#47    83000: ExecEnable: system.cpu: A0 T0 : 0x40007c    :   adr   x1, #28            : IntAlu :  D=0x0000000000400098  flags=(IsInteger)
#48    83000: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 scheduled @ 84000
          [...]
      191000: Event: Event_85: generic 85 scheduled @ 191000
      191000: Event: Event_85: generic 85 executed @ 191000

So yes, --caches does work here, leading to a runtime of 191000 rather than 469000 without caches!

Notably, we now see that very little time passed between the first and second instructions which are marked with ExecEnable in #39 and #47, presumably because rather than going out all the way to the DRAM system the event chain stops right at the icache.cpu_side when a hit happens, which must have been the case for the second instruction, which is just adjacent to the first one.

It is also interested to look into the generated config.dot.svg to compare it to the one without caches: Figure 3, “config.dot.svg for a TimingSimpleCPU without caches.”. With caches: Figure 4, “config.dot.svg for a TimingSimpleCPU with caches.”.

We can see from there, that we now have icache and dcache elements inside the CPU block, and that the CPU icache and dcache ports go through the caches to the SystemXBar rather than being directly connected as before.

It is worth noting that the caches do not affect the ArmITB and ArmDTB TLBs, since those are already caches themselves.

gem5 config TimingSimpleCPU caches 12c917de54145d2d50260035ba7fa614e25317a3
Figure 4. config.dot.svg for a TimingSimpleCPU with caches.

We can break down the events between the instructions as follows.

First, based on TimingSimpleCPU analysis #5, we b TimingSimpleCPU::fetch to see how the initial magically scheduled fetch, and necessarily cache miss, work:

EventManager::schedule
PacketQueue::schedSendEvent
BaseCache::CacheMasterPort::schedSendEvent
BaseCache::schedMemSideSendEvent
BaseCache::allocateMissBuffer
BaseCache::handleTimingReqMiss
Cache::handleTimingReqMiss
BaseCache::recvTimingReq
Cache::recvTimingReq
BaseCache::CpuSidePort::recvTimingReq
TimingRequestProtocol::sendReq
MasterPort::sendTimingReq
TimingSimpleCPU::sendFetch
TimingSimpleCPU::FetchTranslation::finish
ArmISA::TLB::translateComplete
ArmISA::TLB::translateTiming
ArmISA::TLB::translateTiming
TimingSimpleCPU::fetch

By comparing this to the uncached access at TimingSimpleCPU analysis #25, we see that this one does not reach the CoherentXBar at all: the cache must be scheduling an event in the future to model a delay between the cache request and XBar communication.

A quick source structural view shows that the source for non-Ruby caches such as the ones from this example are located under:

src/mem/cache

and the following simple class hierarchy:

  • BaseCache

    • Cache

    • NoncoherentCache

Next, we fast forward to #39 with b TimingSimpleCPU::IcachePort::ITickEvent::process which as we knows from previous sections, is the event that executes instructions, and therefore leaves us at the start of the second instruction.

Then, we b EventManager::schedule to see what that schedules:

EventManager::schedule
PacketQueue::schedSendEvent
PacketQueue::schedSendTiming
QueuedSlavePort::schedTimingResp
BaseCache::handleTimingReqHit
Cache::handleTimingReqHit
BaseCache::recvTimingReq
Cache::recvTimingReq
BaseCache::CpuSidePort::recvTimingReq
TimingRequestProtocol::sendReq
MasterPort::sendTimingReq
TimingSimpleCPU::sendFetch
TimingSimpleCPU::FetchTranslation::finish
ArmISA::TLB::translateComplete
ArmISA::TLB::translateTiming
ArmISA::TLB::translateTiming
TimingSimpleCPU::fetch
TimingSimpleCPU::advanceInst
TimingSimpleCPU::completeIfetch
TimingSimpleCPU::IcachePort::ITickEvent::process

By comparing this trace from the this cache hit and the previous cache miss, we see that BaseCache::recvTimingReq decides between either: Cache::handleTimingReqHit and Cache::handleTimingReqMiss, and from there we see that the key function that decides if the block is present is BaseCache::access.

We can see access behaviour at on the log lines, e.g.:

#7         0: Cache: system.cpu.icache: access for ReadReq [78:7b] IF miss
#40    78000: Cache: system.cpu.icache: access for ReadReq [7c:7f] IF hit state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x1 way: 0

which makes sense since from TimingSimpleCPU analysis #5 we know that the physical address of the initial instruction is 0x78, and 4 bytes are read for each instruction, so the second instruction access is at 0x7c.

The hit line also shows the precise cache state E from the MOESI protocol: What is the coherency protocol implemented by the classic cache system in gem5?.

The other log lines are also very clear, e.g. for the miss we see the following lines:

#10     1000: Cache: system.cpu.icache: sendMSHRQueuePacket: MSHR ReadReq [78:7b] IF
#12     1000: Cache: system.cpu.icache: createMissPacket: created ReadCleanReq [40:7f] IF from ReadReq [78:7b] IF
#32    77000: Cache: system.cpu.icache: recvTimingResp: Handling response ReadResp [40:7f] IF
#33    77000: Cache: system.cpu.icache: Block for addr 0x40 being updated in Cache
#34    77000: Cache: system.cpu.icache: Block addr 0x40 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x1 way: 0

This shows us that the cache miss fills the cache line 40:7f, so we deduce that the cache block size is 0x40 == 64 bytes. The second address only barely hit at the last bytes of the block!

It also informs us that the cache moved to E (from the initial I) state since a memory read was done.

We can confirm this with --trace DRAM which shows:

   1000: DRAM: system.mem_ctrls: recvTimingReq: request ReadCleanReq addr 64 size 64

Contrast this with the non --cache version seen at TimingSimpleCPU analysis #5 in which DRAM only actually reads the 4 required bytes.

The only cryptic thing about the messages is the IF flag, but good computer architects would have guessed it correctly that it is "instruction fetch" and src/mem/packet.cc confirms:

void
Packet::print(std::ostream &o, const int verbosity,
              const std::string &prefix) const
{
    ccprintf(o, "%s%s [%x:%x]%s%s%s%s%s%s", prefix, cmdString(),
             getAddr(), getAddr() + getSize() - 1,
             req->isSecure() ? " (s)" : "",
             req->isInstFetch() ? " IF" : "",
             req->isUncacheable() ? " UC" : "",
             isExpressSnoop() ? " ES" : "",
             req->isToPOC() ? " PoC" : "",
             req->isToPOU() ? " PoU" : "");
}

Another interesting observation of running with --trace Cache,DRAM,XBar is that between the execution of both instructions, there is a Cache event, but no DRAM or XBar events:

  78000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
  78000: Cache: system.cpu.icache: access for ReadReq [7c:7f] IF hit state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x1 way: 0
  83000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+4    :   adr   x1, #28            : IntAlu :  D=0x0000000000400098  flags=(IsInteger)

which is further consistent with the cache hit idea: no traffic goes down to the DRAM nor crossbar.

This block size parameter can be seen set on the gem5 config.ini file:

[system]
cache_line_size=64

so it is runtime configurable. The other key cache parameters can be seen further down in the config:

[system.cpu.dcache]
assoc=2
size=65536

[system.cpu.dcache.replacement_policy]
type=LRURP

[system.cpu.dcache.tags.indexing_policy]
type=SetAssociative

so we understand that by default the classic cache:

At:

#7         0: Cache: system.cpu.icache: access for ReadReq [78:7b] IF miss
#8         0: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 scheduled @ 1000
#9      1000: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 executed @ 1000
#10     1000: Cache: system.cpu.icache: sendMSHRQueuePacket: MSHR ReadReq [78:7b] IF
#12     1000: Cache: system.cpu.icache: createMissPacket: created ReadCleanReq [40:7f] IF from ReadReq [78:7b] IF

we can briefly see the gem5 MSHR doing its thing.

At time 0, the CPU icache wants to read, so it creates a packet that reads 4 bytes only ([78:7b]) for the instruction, and that goes into the MSHR, to be treated in a future event.

At 1000, the future event is executed, and so it reads the original packet from the MSHR, and uses that to create a new request [40:7f] which gets forwarded.