24.22.4.3. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches
Let’s just add --caches
to gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis to see if things go any faster, and add Cache
to --trace
as in:
--trace Cache,Event,ExecAll,-ExecSymbol,FmtFlag
The resulting trace is:
#0 0: Event: system.cpu.wrapped_function_event: EventFunctionWrapped 43 scheduled @ 0 #2 0: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 14 scheduled @ 7786250 #3 0: Event: system.mem_ctrls_1.wrapped_function_event: EventFunctionWrapped 20 scheduled @ 7786250 #4 0: Event: Event_84: generic 84 scheduled @ 0 #5 0: Event: Event_84: generic 84 rescheduled @ 18446744073709551615 #6 0: Event: system.cpu.wrapped_function_event: EventFunctionWrapped 43 executed @ 0 #7 0: Cache: system.cpu.icache: access for ReadReq [78:7b] IF miss #8 0: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 scheduled @ 1000 #9 1000: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 executed @ 1000 #10 1000: Cache: system.cpu.icache: sendMSHRQueuePacket: MSHR ReadReq [78:7b] IF #12 1000: Cache: system.cpu.icache: createMissPacket: created ReadCleanReq [40:7f] IF from ReadReq [78:7b] IF #13 1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 scheduled @ 1000 #14 1000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 70 scheduled @ 2000 #15 1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 executed @ 1000 #16 1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 12 scheduled @ 1000 #17 1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 10 scheduled @ 46250 #18 1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 scheduled @ 5000 #19 1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 12 executed @ 1000 #20 1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 15 scheduled @ 1000 #22 1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 15 executed @ 1000 #23 2000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 70 executed @ 2000 #24 5000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 executed @ 5000 #25 46250: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 10 executed @ 46250 #26 46250: Event: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 8 scheduled @ 74250 #27 74250: Event: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 8 executed @ 74250 #28 74250: Event: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 74 scheduled @ 77000 #29 74250: Event: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped 75 scheduled @ 80000 #30 77000: Event: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 74 executed @ 77000 #32 77000: Cache: system.cpu.icache: recvTimingResp: Handling response ReadResp [40:7f] IF #33 77000: Cache: system.cpu.icache: Block for addr 0x40 being updated in Cache #34 77000: Cache: system.cpu.icache: Block addr 0x40 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x1 way: 0 #35 77000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 scheduled @ 78000 #36 78000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 executed @ 78000 #37 78000: Event: Event_40: Timing CPU icache tick 40 scheduled @ 78000 #38 78000: Event: Event_40: Timing CPU icache tick 40 executed @ 78000 #39 78000: ExecEnable: system.cpu: A0 T0 : 0x400078 : movz x0, #1, #0 : IntAlu : D=0x0000000000000001 flags=(IsInteger) #40 78000: Cache: system.cpu.icache: access for ReadReq [7c:7f] IF hit state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x1 way: 0 #42 78000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 scheduled @ 83000 #43 80000: Event: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped 75 executed @ 80000 #44 83000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 executed @ 83000 #45 83000: Event: Event_40: Timing CPU icache tick 40 scheduled @ 83000 #46 83000: Event: Event_40: Timing CPU icache tick 40 executed @ 83000 #47 83000: ExecEnable: system.cpu: A0 T0 : 0x40007c : adr x1, #28 : IntAlu : D=0x0000000000400098 flags=(IsInteger) #48 83000: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 scheduled @ 84000 [...] 191000: Event: Event_85: generic 85 scheduled @ 191000 191000: Event: Event_85: generic 85 executed @ 191000
So yes, --caches
does work here, leading to a runtime of 191000 rather than 469000 without caches!
Notably, we now see that very little time passed between the first and second instructions which are marked with ExecEnable
in #39 and #47, presumably because rather than going out all the way to the DRAM system the event chain stops right at the icache.cpu_side
when a hit happens, which must have been the case for the second instruction, which is just adjacent to the first one.
It is also interested to look into the generated config.dot.svg
to compare it to the one without caches: Figure 3, “config.dot.svg
for a TimingSimpleCPU without caches.”. With caches: Figure 4, “config.dot.svg
for a TimingSimpleCPU with caches.”.
We can see from there, that we now have icache
and dcache
elements inside the CPU block, and that the CPU icache
and dcache
ports go through the caches to the SystemXBar
rather than being directly connected as before.
It is worth noting that the caches do not affect the ArmITB
and ArmDTB
TLBs, since those are already caches themselves.
config.dot.svg
for a TimingSimpleCPU with caches.We can break down the events between the instructions as follows.
First, based on TimingSimpleCPU analysis #5, we b TimingSimpleCPU::fetch
to see how the initial magically scheduled fetch, and necessarily cache miss, work:
EventManager::schedule PacketQueue::schedSendEvent BaseCache::CacheMasterPort::schedSendEvent BaseCache::schedMemSideSendEvent BaseCache::allocateMissBuffer BaseCache::handleTimingReqMiss Cache::handleTimingReqMiss BaseCache::recvTimingReq Cache::recvTimingReq BaseCache::CpuSidePort::recvTimingReq TimingRequestProtocol::sendReq MasterPort::sendTimingReq TimingSimpleCPU::sendFetch TimingSimpleCPU::FetchTranslation::finish ArmISA::TLB::translateComplete ArmISA::TLB::translateTiming ArmISA::TLB::translateTiming TimingSimpleCPU::fetch
By comparing this to the uncached access at TimingSimpleCPU analysis #25, we see that this one does not reach the CoherentXBar
at all: the cache must be scheduling an event in the future to model a delay between the cache request and XBar communication.
A quick source structural view shows that the source for non-Ruby caches such as the ones from this example are located under:
src/mem/cache
and the following simple class hierarchy:
-
BaseCache
-
Cache
-
NoncoherentCache
-
Next, we fast forward to #39
with b TimingSimpleCPU::IcachePort::ITickEvent::process
which as we knows from previous sections, is the event that executes instructions, and therefore leaves us at the start of the second instruction.
Then, we b EventManager::schedule
to see what that schedules:
EventManager::schedule PacketQueue::schedSendEvent PacketQueue::schedSendTiming QueuedSlavePort::schedTimingResp BaseCache::handleTimingReqHit Cache::handleTimingReqHit BaseCache::recvTimingReq Cache::recvTimingReq BaseCache::CpuSidePort::recvTimingReq TimingRequestProtocol::sendReq MasterPort::sendTimingReq TimingSimpleCPU::sendFetch TimingSimpleCPU::FetchTranslation::finish ArmISA::TLB::translateComplete ArmISA::TLB::translateTiming ArmISA::TLB::translateTiming TimingSimpleCPU::fetch TimingSimpleCPU::advanceInst TimingSimpleCPU::completeIfetch TimingSimpleCPU::IcachePort::ITickEvent::process
By comparing this trace from the this cache hit and the previous cache miss, we see that BaseCache::recvTimingReq
decides between either: Cache::handleTimingReqHit
and Cache::handleTimingReqMiss
, and from there we see that the key function that decides if the block is present is BaseCache::access
.
We can see access behaviour at on the log lines, e.g.:
#7 0: Cache: system.cpu.icache: access for ReadReq [78:7b] IF miss #40 78000: Cache: system.cpu.icache: access for ReadReq [7c:7f] IF hit state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x1 way: 0
which makes sense since from TimingSimpleCPU analysis #5 we know that the physical address of the initial instruction is 0x78, and 4 bytes are read for each instruction, so the second instruction access is at 0x7c.
The hit line also shows the precise cache state E
from the MOESI protocol: What is the coherency protocol implemented by the classic cache system in gem5?.
The other log lines are also very clear, e.g. for the miss we see the following lines:
#10 1000: Cache: system.cpu.icache: sendMSHRQueuePacket: MSHR ReadReq [78:7b] IF #12 1000: Cache: system.cpu.icache: createMissPacket: created ReadCleanReq [40:7f] IF from ReadReq [78:7b] IF #32 77000: Cache: system.cpu.icache: recvTimingResp: Handling response ReadResp [40:7f] IF #33 77000: Cache: system.cpu.icache: Block for addr 0x40 being updated in Cache #34 77000: Cache: system.cpu.icache: Block addr 0x40 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x1 way: 0
This shows us that the cache miss fills the cache line 40:7f, so we deduce that the cache block size is 0x40 == 64 bytes. The second address only barely hit at the last bytes of the block!
It also informs us that the cache moved to E
(from the initial I
) state since a memory read was done.
We can confirm this with --trace DRAM
which shows:
1000: DRAM: system.mem_ctrls: recvTimingReq: request ReadCleanReq addr 64 size 64
Contrast this with the non --cache
version seen at TimingSimpleCPU analysis #5 in which DRAM only actually reads the 4 required bytes.
The only cryptic thing about the messages is the IF
flag, but good computer architects would have guessed it correctly that it is "instruction fetch" and src/mem/packet.cc confirms:
void Packet::print(std::ostream &o, const int verbosity, const std::string &prefix) const { ccprintf(o, "%s%s [%x:%x]%s%s%s%s%s%s", prefix, cmdString(), getAddr(), getAddr() + getSize() - 1, req->isSecure() ? " (s)" : "", req->isInstFetch() ? " IF" : "", req->isUncacheable() ? " UC" : "", isExpressSnoop() ? " ES" : "", req->isToPOC() ? " PoC" : "", req->isToPOU() ? " PoU" : ""); }
Another interesting observation of running with --trace Cache,DRAM,XBar
is that between the execution of both instructions, there is a Cache
event, but no DRAM
or XBar
events:
78000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue : movz x0, #1, #0 : IntAlu : D=0x0000000000000001 flags=(IsInteger) 78000: Cache: system.cpu.icache: access for ReadReq [7c:7f] IF hit state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x1 way: 0 83000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+4 : adr x1, #28 : IntAlu : D=0x0000000000400098 flags=(IsInteger)
which is further consistent with the cache hit idea: no traffic goes down to the DRAM nor crossbar.
This block size parameter can be seen set on the gem5 config.ini file:
[system] cache_line_size=64
so it is runtime configurable. The other key cache parameters can be seen further down in the config:
[system.cpu.dcache] assoc=2 size=65536 [system.cpu.dcache.replacement_policy] type=LRURP [system.cpu.dcache.tags.indexing_policy] type=SetAssociative
so we understand that by default the classic cache:
-
is 2-way https://en.wikipedia.org/wiki/CPU_cache#Two-way_set_associative_cache
-
has 16KiB total size
-
uses LRURP replacement policy. LRU is a well known policy, "LRU RP" seems to simply stand for "LRU Replacement Policy". Other policies can be seen under: src/mem/cache/replacement_policies/
At:
#7 0: Cache: system.cpu.icache: access for ReadReq [78:7b] IF miss #8 0: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 scheduled @ 1000 #9 1000: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 executed @ 1000 #10 1000: Cache: system.cpu.icache: sendMSHRQueuePacket: MSHR ReadReq [78:7b] IF #12 1000: Cache: system.cpu.icache: createMissPacket: created ReadCleanReq [40:7f] IF from ReadReq [78:7b] IF
we can briefly see the gem5 MSHR
doing its thing.
At time 0, the CPU icache wants to read, so it creates a packet that reads 4 bytes only ([78:7b]
) for the instruction, and that goes into the MSHR, to be treated in a future event.
At 1000, the future event is executed, and so it reads the original packet from the MSHR, and uses that to create a new request [40:7f]
which gets forwarded.