27.3.4.1. atomic.cpp
C version at: atomic.c.
In this set of examples, we exemplify various synchronization mechanisms, including assembly specific ones, by using the convenience of C++ multithreading:
-
userland/cpp/atomic/main.hpp: contains all the code which is then specialized in separated
.cpp
files with macros -
userland/cpp/atomic/aarch64_add.cpp: non synchronized aarch64 inline assembly
-
userland/cpp/atomic/aarch64_ldaxr_stlxr.cpp: see: ARM LDXR and STXR instructions
-
userland/cpp/atomic/aarch64_ldadd.cpp: synchronized aarch64 inline assembly with the ARM Large System Extensions (LSE) LDADD instruction
-
userland/cpp/atomic/fail.cpp: non synchronized C operator ``
-
userland/cpp/atomic/mutex.cpp: synchronized
std::mutex
.std;
-
userland/cpp/atomic/std_atomic.cpp: synchronized
std::atomic_ulong
-
userland/cpp/atomic/x86_64_inc.cpp: non synchronized x86_64 inline assembly
-
userland/cpp/atomic/x86_64_lock_inc.cpp: synchronized x86_64 inline assembly with the x86 LOCK prefix
All examples do exactly the same thing: span N threads and loop M times in each thread incrementing a global integer.
For inputs large enough, the non-synchronized examples are extremely likely to produce "wrong" results, for example on 2017 Lenovo ThinkPad P51 Ubuntu 19.10 native with 2 threads and 10000 loops:
./fail.out 2 10000
we could get an output such as:
expect 20000 global 12676
The actual value is much smaller, because the threads have often overwritten one another with older values.
With --optimization-level 3
, the result almost always equals that of a single thread, e.g.:
./build --optimization-level 3 --force-rebuild fail.cpp ./fail.out 4 1000000
usually gives:
expect 40000 global 10000
This is because now, instead of the horribly inefficient -O0
assembly that reads global
from memory every time, the code:
-
reads
global
to a register -
increments the register
-
at end the end, the resulting value of each thread gets written back, overwriting each other with the increment of each thread
The -O0
code therefore mixes things up much more because it reads and write back to memory many many times.
This can be easily seen from the disassembly with:
gdb -batch -ex "disassemble threadMain" fail.out
which gives for -O0
:
0x0000000000402656 <+0>: endbr64 0x000000000040265a <+4>: push %rbp 0x000000000040265b <+5>: mov %rsp,%rbp 0x000000000040265e <+8>: movq $0x0,-0x8(%rbp) 0x0000000000402666 <+16>: mov 0x5c2b(%rip),%rax # 0x408298 <niters> 0x000000000040266d <+23>: cmp %rax,-0x8(%rbp) 0x0000000000402671 <+27>: jae 0x40269b <threadMain()+69> 0x0000000000402673 <+29>: mov 0x5c26(%rip),%rdx # 0x4082a0 <global> 0x000000000040267a <+36>: mov -0x8(%rbp),%rax 0x000000000040267e <+40>: mov %rax,-0x8(%rbp) 0x0000000000402682 <+44>: mov 0x5c17(%rip),%rax # 0x4082a0 <global> 0x0000000000402689 <+51>: add $0x1,%rax 0x000000000040268d <+55>: mov %rax,0x5c0c(%rip) # 0x4082a0 <global> 0x0000000000402694 <+62>: addq $0x1,-0x8(%rbp) 0x0000000000402699 <+67>: jmp 0x402666 <threadMain()+16> 0x000000000040269b <+69>: nop 0x000000000040269c <+70>: pop %rbp 0x000000000040269d <+71>: retq
and for -O3
:
0x00000000004017f0 <+0>: endbr64 0x00000000004017f4 <+4>: mov 0x2a25(%rip),%rcx # 0x404220 <niters> 0x00000000004017fb <+11>: test %rcx,%rcx 0x00000000004017fe <+14>: je 0x401824 <threadMain()+52> 0x0000000000401800 <+16>: mov 0x2a11(%rip),%rdx # 0x404218 <global> 0x0000000000401807 <+23>: xor %eax,%eax 0x0000000000401809 <+25>: nopl 0x0(%rax) 0x0000000000401810 <+32>: add $0x1,%rax 0x0000000000401814 <+36>: add $0x1,%rdx 0x0000000000401818 <+40>: cmp %rcx,%rax 0x000000000040181b <+43>: jb 0x401810 <threadMain()+32> 0x000000000040181d <+45>: mov %rdx,0x29f4(%rip) # 0x404218 <global> 0x0000000000401824 <+52>: retq
We can now look into how std::atomic
is implemented. In -O3
the disassembly is:
0x0000000000401770 <+0>: endbr64 0x0000000000401774 <+4>: cmpq $0x0,0x297c(%rip) # 0x4040f8 <niters> 0x000000000040177c <+12>: je 0x401796 <threadMain()+38> 0x000000000040177e <+14>: xor %eax,%eax 0x0000000000401780 <+16>: lock addq $0x1,0x2967(%rip) # 0x4040f0 <global> 0x0000000000401789 <+25>: add $0x1,%rax 0x000000000040178d <+29>: cmp %rax,0x2964(%rip) # 0x4040f8 <niters> 0x0000000000401794 <+36>: ja 0x401780 <threadMain()+16> 0x0000000000401796 <+38>: retq
so we clearly see that basically a lock addq
is used to do an atomic read and write to memory every single time, just like in our other example userland/cpp/atomic/x86_64_lock_inc.cpp.
This setup can also be used to benchmark different synchronization mechanisms. For example, std::mutex
was about 1.5x slower with two cores than std::atomic
, presumably because it relies on the futex
system call as can be seen from strace -f -s999 -v
logs, while std::atomic
uses just userland instructions: https://www.quora.com/How-does-std-atomic-work-in-C++11/answer/Ciro-Santilli Tested in -O3
with:
time ./std_atomic.out 4 100000000 time ./mutex.out 4 100000000
Related examples:
-
POSIX pthread_mutex
-
C11 userland/c/atomic.c documented at C multithreading
Bibliography:
-
https://stackoverflow.com/questions/31978324/what-exactly-is-stdatomic/58904448#58904448 "What exactly is std::atomic?"