Introductory analysis of a simple example of the Executable and Linkable File format.
Extracted from this Stack Overflow answer.
ELF is the dominating file format for Linux. It competes with Mach-O for OS X and PE for Windows.
.coff, which supersedes
ELF is specified by the LSB:
The LSB basically links to other standards with minor extensions, in particular:
- Generic (both by SCO):
- System V ABI 4.1 (1997) http://www.sco.com/developers/devspecs/gabi41.pdf, no 64 bit, although a magic number is reserved for it. Same for core files. _This_ is the first document you should look at when searching for information.
- System V ABI Update DRAFT 17 (2003) http://www.sco.com/developers/gabi/2003-12-17/contents.html, adds 64 bit. Only updates chapters 4 and 5 of the previous document: the others remain valid and are still referenced.
- Architecture specific (by the processor vendor):
- IA-32: https://refspecs.linuxfoundation.org/LSB_4.1.0/LSB-Core-IA32/LSB-Core-IA32/elf-ia32.html, points mostly to http://www.sco.com/developers/devspecs/abi386-4.pdf
- AMD64: https://refspecs.linuxfoundation.org/LSB_4.1.0/LSB-Core-AMD64/LSB-Core-AMD64/elf-amd64.html, points mostly to http://www.x86-64.org/documentation/abi.pdf
A handy summary can be found at:
Spin like mad between:
- high level generators. We use the assembler
- file decompilers. We use
readelf. It makes it faster to read the ELF file by turning it into human readable output. But you must have seen one byte-by-byte example first, and think how
readelfoutput maps to the standard.
- low-level generators: stand-alone libraries that let you control every field of the ELF files you generated. https://github.com/BR903/ELFkickers, https://github.com/sqall01/ZwoELF and many more on GitHub.
- consumer: the
execsystem call of the Linux kernel can parse ELF files to starts processes: https://github.com/torvalds/linux/blob/v4.11/fs/binfmt_elf.c, https://stackoverflow.com/questions/8352535/how-does-kernel-get-an-executable-binary-file-running-under-linux/31394861#31394861
The ELF standard specifies multiple file formats:
- Object files (
.o).Intermediate step to generating executables and other formats:Object files exist to make compilation faster: with
make, we only have to recompile the modified source files based on timestamps.We have to do the linking step every time, but it is much less expensive.
- Executable files (no standard Linux extension).This is what the Linux kernel can actually run.
- Archive files (
.a).Libraries meant to be embedded into executables during the Linking step.
- Shared object files (
.so).Libraries meant to be loaded when the executable starts running.
- Core dumps.Such files may be generated by the Linux kernel when the program does naughty things, e.g. segfault.They exist to help debugging the program.
In this tutorial, we consider only object and executable files.
- Compiler toolchains generate and read ELF files.Sane compilers should use a separate standalone library to do the dirty work. E.g., Binutils uses BFD (in-tree and canonical source).
- Operating systems read and run ELF files.Kernels cannot link to a library nor use the C stlib, so they are more likely to implement it themselves.This is the case of the Linux kernel 4.2 which implements it in th file
It is non-trivial to determine what is the smallest legal ELF file, or the smaller one that will do something trivial in Linux.
In this example we will consider a saner
hello worldexample that will better capture real life cases.
Let's break down a minimal runnable Linux x86-64 example:
TODO: use a minimal linker script with
-Tto be more precise and minimal.
- NASM 2.10.09
- Binutils version 2.24 (contains
- Ubuntu 14.04
We don't use a C program as that would complicate the analysis, that will be level 2 :-)
An ELF file contains the following parts:
- ELF header. Points to the position of the section header table and the program header table.
- Section header table (optional on executable). Each has
e_shnumsection headers, each pointing to the position of a section.
- N sections, with
N <= e_shnum(optional on executable)
- Program header table (only on executable). Each has
e_phnumprogram headers, each pointing to the position of a segment.
- N segments, with
N <= e_phnum(only on executable)
The order of those parts is _not_ fixed: the only fixed thing is the ELF header that must be the first thing on the file: Generic docs say:
Although the figure shows the program header table immediately after the ELF header, and the section header table following the sections, actual files may differ. Moreover, sections and segments have no specified order. Only the ELF header has a fixed position in the file.
In pictures: sample object file with three sections:
But nothing (except sanity) prevents the following topology:
But some newbies may prefer PNGs :-)
We will get into more detail later, but it is good to have it in mind now:
- section: exists before linking, in object files.One ore more sections will be put inside a single segment by the linker.Major information sections contain for the linker: is this section:
- raw data to be loaded into memory, e.g.
- or metadata about other sections, that will be used by the linker, but disappear at runtime e.g.
- raw data to be loaded into memory, e.g.
- segment: exists after linking, in the executable file.Contains information about how each segment should be loaded into memory by the OS, notably location and permissions.
Bytes in the object file:
- 0 0:
7f 45 4c 46=
0x7f 'E', 'L', 'F': ELF magic number
- 0 4:
ELFCLASS64: 64 bit elf
- 0 5:
ELFDATA2LSB: little endian data
- 0 6:
01: format version
- 0 7:
EI_OSABI(only in 2003 Update) =
ELFOSABI_NONE: no extensions.
- 0 8:
00: reserved bytes. Must be set to 0.
- 1 0:
01 00= 1 (big endian) =
ET_REl: relocatable formatOn the executable it is
ET_EXEC.Another important possibility for the executable is
ET_DYNfor PIE executables and shared libraries.
ET_DYNtells the Linux kernel that the code is position independent, and can loaded at a random memory location with ASLR.This is explained further at:
- 1 2:
EM_X86_64: AMD64 architecture
- 1 4:
01 00 00 00: must be 1
- 1 8:
00: execution address entry point, or 0 if not applicable like for the object file since there is no entry point.On the executable, it is
b0 00 40 00 00 00 00 00. The kernel puts the RIP directly on that value when executing. It can be configured by the linker script or
-e. But it will segfault if you set it too low: https://stackoverflow.com/questions/2187484/why-is-the-elf-execution-entry-point-virtual-address-of-the-form-0x80xxxxx-and-n
- 2 0:
00: program header table offset, 0 if not present.
40 00 00 00on the executable, i.e. it starts immediately after the ELF header.
- 2 8:
0x40: section header table file offset, 0 if not present.
- 3 0:
00 00 00 00Arch specific.
The Intel386 architecture defines no flags; so this member contains zero.
- 3 4:
40 00: size of this elf header. TODO why this field needed? Isn't the size fixed?
- 3 6:
00 00: size of each program header, 0 if not present.
38 00on executable: it is 56 bytes long
- 3 8:
00 00: number of program header entries, 0 if not present.
02 00on executable: there are 2 entries.
- 3 A:
40 00 07 00: section header size and number of entries
- 3 E:
Section Header STRing iNDeX) =
03 00: index of the
Each entry contains metadata about a given section.
e_shoffof the ELF header gives the starting position, 0x40 here.
e_shnumfrom the ELF header say that we have 7 entries, each
So the table takes bytes from 0x40 to
0x40 + 7 + 0x40 - 1= 0x1FF.
Some section names are reserved for certain section types: http://www.sco.com/developers/gabi/2003-12-17/ch4.sheader.html#special_sections e.g.
structrepresented by each entry is:
Contained in bytes 0x40 to 0x7F.
The first section is always magic: http://www.sco.com/developers/gabi/2003-12-17/ch4.sheader.html says:
If the number of sections is greater than or equal to SHN_LORESERVE (0xff00), e_shnum has the value SHN_UNDEF (0) and the actual number of section header table entries is contained in the sh_size field of the section header at index 0 (otherwise, the sh_size member of the initial entry contains 0).
There are also other magic sections detailed in
Figure 4-7: Special Section Indexes.
In index 0,
SHT_NULLis mandatory. Are there any other uses for it: https://stackoverflow.com/questions/26812142/what-is-the-use-of-the-sht-null-section-in-elf ?
.datais section 1:
- 80 0:
01 00 00 00: index 1 in the
1says the name of this section starts at the first character of that section, and ends at the first NUL character, making up the string
.datais one of the section names which has a predefined meaning according to http://www.sco.com/developers/gabi/2003-12-17/ch4.strtab.html:
These sections hold initialized data that contribute to the program's memory image.
- 80 4:
01 00 00 00:
SHT_PROGBITS: the section content is not specified by ELF, only by how the program interprets it. Normal since a
- 80 8:
SHF_ALLOC: http://www.sco.com/developers/gabi/2003-12-17/ch4.sheader.html#sh_flags, as required from a
- 90 0:
00: TODO: standard says:
If the section will appear in the memory image of a process, this member gives the address at which the section's first byte should reside. Otherwise, the member contains 0.but I don't understand it very well yet.
- 90 8:
00 02 00 00 00 00 00 00=
0x200: number of bytes from the start of the program to the first byte in this section
- a0 0:
0d 00 00 00 00 00 00 00If we take 0xD bytes starting at
sh_offset200, we see:AHA! So our
"Hello world!"string is in the data section like we told it to be on the NASM.Once we graduate from
hd, we will look this up like:which outputs:NASM sets decent properties for that section because it treats
.datamagically: https://www.nasm.us/doc/nasmdoc7.html#section-7.9.2Also note that this was a bad section choice: a good C compiler would put the string in
.rodatainstead, because it is read-only and it would allow for further OS optimizations.
- a0 8:
sh_info= 8x 0: do not apply to this section type. http://www.sco.com/developers/gabi/2003-12-17/ch4.sheader.html#special_sections
- b0 0:
04= TODO: why is this alignment necessary? Is it only for
sh_addr, or also for symbols inside
- b0 8:
00= the section does not contain a table. If != 0, it means that the section contains a table of fixed size entries. In this file, we see from the
readelfoutput that this is the case for the
- a0 8:
Now that we've done one section manually, let's graduate and use the
readelf -Sof the other sections:
.textis executable but not writable: if we try to write to it Linux segfaults. Let's see if we really have some code there: gives:
If we grep
b8 01 00 00on the
hd, we see that this only occurs at
00000210, which is what the section says. And the Size is 27, which matches as well. So we must be talking about the right section.
This looks like the right code: a
writefollowed by an
The most interesting part is line
awhich does: to pass the address of the string to the system call. Currently, the
0x0is just a placeholder. After linking happens, it will be modified to contain: This modification is possible because of the data of the
sh_type == SHT_STRTABare called _string tables_.
They hold a null separated array of strings.
Such sections are used by other sections when string names are to be used. The using section says:
- which string table they are using
- what is the index on the target string table where the string starts
So for example, we could have a string table containing:
The first byte must be a 0. TODO rationale?
And if another section wants to use the string
d e f, they have to point to index
5of this section (letter
Notable string table sections:
sh_type == SHT_STRTAB.
Common name: _section header string table_.
The section name
.shstrtabis reserved. The standard says:
This section holds section names.
This section gets pointed to by the
e_shstrndfield of the ELF header itself.
String indexes of this section are are pointed to by the
sh_namefield of section headers, which denote strings.
This section does not have
SHF_ALLOCmarked, so it will not appear on the executing program. outputs:
The data in this section has a fixed format: http://www.sco.com/developers/gabi/2003-12-17/ch4.strtab.html
If we look at the names of other sections, we see that they all contain numbers, e.g. the
.textsection is number
Then each string ends when the first NUL character is found, e.g. character
sh_type == SHT_SYMTAB.
Common name: _symbol table_.
First the we note that:
SHT_SYMTABsections, those numbers mean that:
- strings that give symbol names are in section 5,
- the relocation data is in section 6,
A good high level tool to disassemble that section is:which gives:
This is however a high level view that omits some types of symbols and in which the symbol types . A more detailed disassembly can be obtained with:which gives:
The binary format of the table is documented at http://www.sco.com/developers/gabi/2003-12-17/ch4.symtab.html
The data is:which gives:
The entries are of type:
Like in the section table, the first entry is magical and set to a fixed meaningless values.
Entry 1 has
ELF64_R_TYPE == STT_FILE.
ELF64_R_TYPEis continued inside of
- 10 8:
01000000= character 1 in the
.strtab, which until the following
hello_world.asmThis piece of information file may be used by the linker to decide on which segment sections go: e.g. in
ldlinker script we write:to pick a section from a given file.Most of the time however, we will just dump all sections with a given name together with:
- 10 12:
04Bits 0-3 =
ELF64_R_TYPE= Type =
STT_FILE: the main purpose of this entry is to use
st_nameto indicate the name of the file which generated this object file.Bits 4-7 =
ELF64_ST_BIND= Binding =
STB_LOCAL. Required value for
- 10 13:
st_shndx= Symbol Table Section header Index =
SHN_ABS. Required for
- 20 0:
00: required for value for
- 20 8:
00: no allocated size
Now from the
readelf, we interpret the others quickly.
There are two such entries, one pointing to
.dataand the other to
TODO what is their purpose?
Then come the most important symbols:
hello_worldstring is in the
.datasection (index 1). It's value is 0: it points to the first byte of that section.
_startis marked with
GLOBALvisibility since we wrote:
in NASM. This is necessary since it must be seen as the entry point. Unlike in C, by default NASM labels are local.
hello_world_lenpoints to the special
st_shndx == SHN_ABS == 0xF1FF.
0xF1FFis chosen so as to not conflict with other sections.
st_value == 0xD == 13which is the value we have stored there on the assembly: the length of the string
This means that relocation will not affect this value: it is a constant.
This is small optimization that our assembler does for us and which has ELF support.
If we had used the address of
hello_world_lenanywhere, the assembler would not have been able to mark it as
SHN_ABS, and the linker would have extra relocation work on it later.
By default, NASM places a
.symtabon the executable as well.
This is only used for debugging. Without the symbols, we are completely blind, and must reverse engineer everything.
You can strip it with
objcopy, and the executable will still run. Such executables are called _stripped executables_.
Holds strings for the symbol table.
This section has
sh_type == SHT_STRTAB.
It is pointed to by
sh_link == 5of the
This implies that it is an ELF level limitation that global variables cannot contain NUL characters.
sh_type == SHT_RELA.
Common name: _relocation section_.
.rela.textholds relocation data which says how the address should be modified when the final executable is linked. This points to bytes of the text area that must be modified when linking happens to point to the correct memory locations.
Basically, it translates the object text containing the placeholder 0x0 address:to the actual executable code containing the final 0x6000d8:
It was pointed to by
readelf -r hello_world.ooutputs:
The section does not exist in the executable.
The actual bytes are:
- 370 0:
r_offset= 0xC: address into the
.textwhose address this relocation will modify
- 370 8:
r_info= 0x200000001. Contains 2 fields:
The AMD64 ABI says that type
ELF64_R_TYPE= 0x1: meaning depends on the exact architecture.
ELF64_R_SYM= 0x2: index of the section to which the address points, so
.datawhich is at index 2.
R_X86_64_64and that it represents the operation
S + Awhere:
This address is added to the section on which the relocation operates.This relocation operation acts on a total 8 bytes.
S: the value of the symbol on the object file, here
0because we point to the
00 00 00 00 00 00 00 00of
A: the addend, present in field
- 380 0:
So in our example we conclude that the new address will be:
S + A=
.data + 0, and thus the first thing in the data section.
sh_type == SHT_RELA, there also exists
SHT_REL, which would have section name
.text.rel(not present in this object file).
Those represent the same
struct, but without the addend, e.g.:
The ELF standard says that in many cases the both can be used, and it is just a matter of convenience.
This program did not have certain dynamic linking related sections because we linked it minimally with
However, if you compile a C hello world with GCC 8.2:
some other interesting sections would appear.
Contains the path to the dynamic loader, i.e.
/lib64/ld-linux-x86-64.so.2in Ubuntu 18.10. Explained at: https://stackoverflow.com/questions/8040631/checking-if-a-binary-compiled-with-static/55664341#55664341
Contains a lot of different flag masks.
Seems to be a GNU Binutils extension
Determines if an executable is a position independent executable (PIE).
Seems to be informational only, since not used by Linux kernel 5.0 or glibc 2.29.
file5.36 however does use it to display file type as explained at: https://stackoverflow.com/questions/34519521/why-does-gcc-create-a-shared-object-instead-of-an-executable-binary-according-to/55704865#55704865
Only appears in the executable.
Contains information of how the executable should be put into the process virtual memory.
The executable is generated from object files by the linker. The main jobs that the linker does are:
- determine which sections of the object files will go into which segments of the executable.In Binutils, this comes down to parsing a linker script, and dealing with a bunch of defaults.You can get the linker script used with
ld --verbose, and set a custom one with
- do relocation according to the
.rela.textsection. This depends on how the multiple sections are put into memory.
readelf -l hello_world.outgives:
On the ELF header,
e_phentsizetold us that there are 2 program headers, which start at
0x38bytes long each, so they are: and:
Structure represented http://www.sco.com/developers/gabi/2003-12-17/ch5.pheader.html:
Breakdown of the first one:
- 40 0:
01 00 00 00=
PT_LOAD: this is a regular segment that will get loaded in memory.
- 40 4:
05 00 00 00= execute and read permissions. No write: we cannot modify the text segment. A classic way to do this in C is with string literals: https://stackoverflow.com/a/30662565/895245 This allows kernels to do certain optimizations, like sharing the segment amongst processes.
- 40 8:
00TODO: what is this? Standard says:
This member gives the offset from the beginning of the file at which the first byte of the segment resides.But it looks like offsets from the beginning of _segments_, not file?
- 50 0:
00 00 40 00 00 00 00 00: initial virtual memory address to load this segment to
- 50 8:
00 00 40 00 00 00 00 00: unspecified effect. Intended for systems in which physical addressing matters. TODO example?
- 60 0:
d7 00 00 00 00 00 00 00: size that the segment occupies in memory. If smaller than
p_memsz, the OS fills it with zeroes to fit when loading the program. This is how BSS data is implemented to save space on executable files. i368 ABI says on
The bytes from the file are mapped to the beginning of the memory segment. If the segment’s memory size (p_memsz) is larger than the file size (p_filesz), the ‘‘extra’’ bytes are defined to hold the value 0 and to follow the segment’s initialized area. The file size may not be larger than the memory size.
- 60 8:
d7 00 00 00 00 00 00 00: size that the segment occupies in memory
- 70 0:
00 00 20 00 00 00 00 00: 0 or 1 mean no alignment required. TODO why is this required? Why not just use
p_addrdirectly, and get that right? Docs also say:
p_vaddr should equal p_offset, modulo p_align
The second segment (
.data) is analogous. TODO: why use offset
0x00000000006000d8? Why not just use
Then the:section of the
readelftells us that:
- 0 is the
.textsegment. Aha, so this is why it is executable, and not writable
- 1 is the
TODO where does this information come from? https://stackoverflow.com/questions/23018496/section-to-segment-mapping-in-elf-files