24.21.1.1. gem5 functional vs atomic vs timing memory requests
gem5 memory requests can be classified in the following broad categories:
-
functional: get the value magically, do not update caches, see also: gem5 functional requests
-
atomic: get the value now without making a separate event, but do not update caches. Cannot work in Ruby due to fundamental limitations, mentioned in passing at: https://gem5.atlassian.net/browse/GEM5-676
-
timing: get the value simulating delays and updating caches
This trichotomy can be notably seen in the definition of the MasterPort class:
class MasterPort : public Port, public AtomicRequestProtocol, public TimingRequestProtocol, public FunctionalRequestProtocol
and the base classes are defined under src/mem/protocol/
.
Then, by reading the rest of the class, we see that the send methods are all boring, and just forward to some polymorphic receiver that does the actual interesting activity:
Tick sendAtomicSnoop(PacketPtr pkt) { return AtomicResponseProtocol::sendSnoop(_masterPort, pkt); } Tick AtomicResponseProtocol::sendSnoop(AtomicRequestProtocol *peer, PacketPtr pkt) { assert(pkt->isRequest()); return peer->recvAtomicSnoop(pkt); }
The receive methods are therefore the interesting ones, and must be overridden on derived classes if they ever expect to receive such requests:
Tick recvAtomicSnoop(PacketPtr pkt) override { panic("%s was not expecting an atomic snoop request\n", name()); return 0; } void recvFunctionalSnoop(PacketPtr pkt) override { panic("%s was not expecting a functional snoop request\n", name()); } void recvTimingSnoopReq(PacketPtr pkt) override { panic("%s was not expecting a timing snoop request.\n", name()); }
One question that comes up now is: but why do CPUs need to care about snoop requests?
And one big answer is: to be able to implement LLSC atomicity as mentioned at: ARM LDXR and STXR instructions, since when other cores update memory, they could invalidate the lock of the current core.
Then, as you might expect, we can see that for example AtomicSimpleCPU
does not override recvTimingSnoopReq
.
Now let see which requests are generated by ordinary ARM LDR instruction. We run:
./run \ --arch aarch64 \ --debug-vm \ --emulator gem5 \ --gem5-build-type debug \ --useland userland/arch/aarch64/freestanding/linux/hello.S \
and then break at the methods of the LDR class LDRXL64_LIT
: gem5 execute
vs initiateAcc
vs completeAcc
.
Before starting, we of course guess that:
-
AtomicSimpleCPU
will be making atomic accesses fromexecute
-
TimingSimpleCPU
will be making timing accesses frominitiateAcc
, which must generate the event which leads tocompleteAcc
so let’s confirm it.
We break on ArmISAInst::LDRXL64_LIT::execute
which is what AtomicSimpleCPU
uses, and that leads as expected to:
MasterPort::sendAtomic AtomicSimpleCPU::sendPacket AtomicSimpleCPU::readMem SimpleExecContext::readMem readMemAtomic<(ByteOrder)1, ExecContext, unsigned long> readMemAtomicLE<ExecContext, unsigned long> ArmISAInst::LDRXL64_LIT::execute AtomicSimpleCPU::tick
Notably, AtomicSimpleCPU::readMem
immediately translates the address, creates a packet, sends the atomic request, and gets the response back without any events.
And now if we do the same with --cpu-type TimingSimpleCPU
and break at ArmISAInst::LDRXL64_LIT::initiateAcc
, and then add another break for the next event schedule b EventManager::schedule
(which we imagine is the memory read) we reach:
EventManager::schedule DRAMCtrl::addToReadQueue DRAMCtrl::recvTimingReq DRAMCtrl::MemoryPort::recvTimingReq TimingRequestProtocol::sendReq MasterPort::sendTimingReq CoherentXBar::recvTimingReq CoherentXBar::CoherentXBarSlavePort::recvTimingReq TimingRequestProtocol::sendReq MasterPort::sendTimingReq TimingSimpleCPU::handleReadPacket TimingSimpleCPU::sendData TimingSimpleCPU::finishTranslation DataTranslation<TimingSimpleCPU*>::finish ArmISA::TLB::translateComplete ArmISA::TLB::translateTiming ArmISA::TLB::translateTiming TimingSimpleCPU::initiateMemRead SimpleExecContext::initiateMemRead initiateMemRead<ExecContext, unsigned long> ArmISAInst::LDRXL64_LIT::initiateAcc TimingSimpleCPU::completeIfetch TimingSimpleCPU::IcachePort::ITickEvent::process EventQueue::serviceOne
so as expected we have TimingRequestProtocol::sendReq
.
Remember however that timing requests are a bit more complicated due to paging, since the page table walk can itself lead to further memory requests.
In this particular instance, the address being read with ldr x2, =len
ARM LDR pseudo-instruction is likely placed just after the text section, and therefore the pagewalk is already in the TLB due to previous instruction fetches, and this is because the translation just finished immediately going through TimingSimpleCPU::finishTranslation
, some key snippets are:
TLB::translateComplete(const RequestPtr &req, ThreadContext *tc, Translation *translation, Mode mode, TLB::ArmTranslationType tranType, bool callFromS2) { bool delay = false; Fault fault; if (FullSystem) fault = translateFs(req, tc, mode, translation, delay, true, tranType); else fault = translateSe(req, tc, mode, translation, delay, true); if (!delay) translation->finish(fault, req, tc, mode); else translation->markDelayed();
and then translateSe
does not use delay
at all, so we learn that in syscall emulation, delay
is always false
and things progress immediately there. And then further down TimingSimpleCPU::finishTranslation
does some more fault checking:
void TimingSimpleCPU::finishTranslation(WholeTranslationState *state) { if (state->getFault() != NoFault) { translationFault(state->getFault()); } else { if (!state->isSplit) { sendData(state->mainReq, state->data, state->res, state->mode == BaseTLB::Read);
Tested in gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.