24.21.1.1. gem5 functional vs atomic vs timing memory requests

gem5 memory requests can be classified in the following broad categories:

This trichotomy can be notably seen in the definition of the MasterPort class:

class MasterPort : public Port, public AtomicRequestProtocol,
    public TimingRequestProtocol, public FunctionalRequestProtocol

and the base classes are defined under src/mem/protocol/.

Then, by reading the rest of the class, we see that the send methods are all boring, and just forward to some polymorphic receiver that does the actual interesting activity:

    Tick
    sendAtomicSnoop(PacketPtr pkt)
    {
        return AtomicResponseProtocol::sendSnoop(_masterPort, pkt);
    }

    Tick
    AtomicResponseProtocol::sendSnoop(AtomicRequestProtocol *peer, PacketPtr pkt)
    {
        assert(pkt->isRequest());
        return peer->recvAtomicSnoop(pkt);
    }

The receive methods are therefore the interesting ones, and must be overridden on derived classes if they ever expect to receive such requests:

    Tick
    recvAtomicSnoop(PacketPtr pkt) override
    {
        panic("%s was not expecting an atomic snoop request\n", name());
        return 0;
    }

    void
    recvFunctionalSnoop(PacketPtr pkt) override
    {
        panic("%s was not expecting a functional snoop request\n", name());
    }

    void
    recvTimingSnoopReq(PacketPtr pkt) override
    {
        panic("%s was not expecting a timing snoop request.\n", name());
    }

One question that comes up now is: but why do CPUs need to care about snoop requests?

And one big answer is: to be able to implement LLSC atomicity as mentioned at: ARM LDXR and STXR instructions, since when other cores update memory, they could invalidate the lock of the current core.

Then, as you might expect, we can see that for example AtomicSimpleCPU does not override recvTimingSnoopReq.

Now let see which requests are generated by ordinary ARM LDR instruction. We run:

./run \
  --arch aarch64 \
  --debug-vm \
  --emulator gem5 \
  --gem5-build-type debug \
  --useland userland/arch/aarch64/freestanding/linux/hello.S \

and then break at the methods of the LDR class LDRXL64_LIT: gem5 execute vs initiateAcc vs completeAcc.

Before starting, we of course guess that:

  • AtomicSimpleCPU will be making atomic accesses from execute

  • TimingSimpleCPU will be making timing accesses from initiateAcc, which must generate the event which leads to completeAcc

so let’s confirm it.

We break on ArmISAInst::LDRXL64_LIT::execute which is what AtomicSimpleCPU uses, and that leads as expected to:

MasterPort::sendAtomic
AtomicSimpleCPU::sendPacket
AtomicSimpleCPU::readMem
SimpleExecContext::readMem
readMemAtomic<(ByteOrder)1, ExecContext, unsigned long>
readMemAtomicLE<ExecContext, unsigned long>
ArmISAInst::LDRXL64_LIT::execute
AtomicSimpleCPU::tick

Notably, AtomicSimpleCPU::readMem immediately translates the address, creates a packet, sends the atomic request, and gets the response back without any events.

And now if we do the same with --cpu-type TimingSimpleCPU and break at ArmISAInst::LDRXL64_LIT::initiateAcc, and then add another break for the next event schedule b EventManager::schedule (which we imagine is the memory read) we reach:

EventManager::schedule
DRAMCtrl::addToReadQueue
DRAMCtrl::recvTimingReq
DRAMCtrl::MemoryPort::recvTimingReq
TimingRequestProtocol::sendReq
MasterPort::sendTimingReq
CoherentXBar::recvTimingReq
CoherentXBar::CoherentXBarSlavePort::recvTimingReq
TimingRequestProtocol::sendReq
MasterPort::sendTimingReq
TimingSimpleCPU::handleReadPacket
TimingSimpleCPU::sendData
TimingSimpleCPU::finishTranslation
DataTranslation<TimingSimpleCPU*>::finish
ArmISA::TLB::translateComplete
ArmISA::TLB::translateTiming
ArmISA::TLB::translateTiming
TimingSimpleCPU::initiateMemRead
SimpleExecContext::initiateMemRead
initiateMemRead<ExecContext, unsigned long>
ArmISAInst::LDRXL64_LIT::initiateAcc
TimingSimpleCPU::completeIfetch
TimingSimpleCPU::IcachePort::ITickEvent::process
EventQueue::serviceOne

so as expected we have TimingRequestProtocol::sendReq.

Remember however that timing requests are a bit more complicated due to paging, since the page table walk can itself lead to further memory requests.

In this particular instance, the address being read with ldr x2, =len ARM LDR pseudo-instruction is likely placed just after the text section, and therefore the pagewalk is already in the TLB due to previous instruction fetches, and this is because the translation just finished immediately going through TimingSimpleCPU::finishTranslation, some key snippets are:

TLB::translateComplete(const RequestPtr &req, ThreadContext *tc,
        Translation *translation, Mode mode, TLB::ArmTranslationType tranType,
        bool callFromS2)
{
    bool delay = false;
    Fault fault;
    if (FullSystem)
        fault = translateFs(req, tc, mode, translation, delay, true, tranType);
    else
        fault = translateSe(req, tc, mode, translation, delay, true);
    if (!delay)
        translation->finish(fault, req, tc, mode);
    else
        translation->markDelayed();

and then translateSe does not use delay at all, so we learn that in syscall emulation, delay is always false and things progress immediately there. And then further down TimingSimpleCPU::finishTranslation does some more fault checking:

void
TimingSimpleCPU::finishTranslation(WholeTranslationState *state)
{
    if (state->getFault() != NoFault) {
        translationFault(state->getFault());
    } else {
        if (!state->isSplit) {
            sendData(state->mainReq, state->data, state->res,
                     state->mode == BaseTLB::Read);

Tested in gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.