The ELF format - how programs look from the inside
Introduction ELF is the file format used for object files (.o's), binaries, shared libraries and core dumps in Linux. It's actually pretty simple and well thought-out. ELF has the same layout for all architectures, however endianness and word size can differ; relocation types, symbol types and the like may have platform-specific values, and of course the contained code is arch specific. An ELF file provides 2 views on the data it contains: A linking view and an execution view. Those two views can be accessed by two headers: the section header table and the program header table. Linking view: Section Header Table (SHT) The SHT gives an overview on the sections contained in the ELF file. Of particular interest are REL sections (relocations), SYMTAB/DYNSYM (symbol tables), VERSYM/VERDEF/VERNEED sections (symbol versioning information). greek0@iphigenie:~$ readelf -S /bin/bash There are 26 section headers, starting at offset 0xa4e10: Section Headers: [Nr] Name Type Addr Off Size ES Flg Lk Inf Al [ 0] NULL 00000000 00000 000000 00 0 0 0 [ 1] .interp PROGBITS 08048134 00134 000013 00 A 0 0 1 [ 2] .note.ABI-tag NOTE 08048148 00148 000020 00 A 0 0 4 [ 3] .hash HASH 08048168 00168 002e48 04 A 4 0 4 [ 4] .dynsym DYNSYM 0804afb0 02fb0 007890 10 A 5 1 4 [ 5] .dynstr STRTAB 08052840 0a840 0074e2 00 A 0 0 1 [ 6] .gnu.version VERSYM 08059d22 11d22 000f12 02 A 4 0 2 [ 7] .gnu.version_r VERNEED 0805ac34 12c34 000090 00 A 5 2 4 [ 8] .rel.dyn REL 0805acc4 12cc4 000040 08 A 4 0 4 [ 9] .rel.plt REL 0805ad04 12d04 0005a8 08 A 4 11 4 [10] .init PROGBITS 0805b2ac 132ac 000017 00 AX 0 0 4 [11] .plt PROGBITS 0805b2c4 132c4 000b60 04 AX 0 0 4 [12] .text PROGBITS 0805be30 13e30 077154 00 AX 0 0 16 [13] .fini PROGBITS 080d2f84 8af84 00001a 00 AX 0 0 4 [14] .rodata PROGBITS 080d2fa0 8afa0 015198 00 A 0 0 32 [15] .eh_frame_hdr PROGBITS 080e8138 a0138 00002c 00 A 0 0 4 [16] .eh_frame PROGBITS 080e8164 a0164 00009c 00 A 0 0 4 [17] .ctors PROGBITS 080e9200 a0200 000008 00 WA 0 0 4 [18] .dtors PROGBITS 080e9208 a0208 000008 00 WA 0 0 4 [19] .jcr PROGBITS 080e9210 a0210 000004 00 WA 0 0 4 [20] .dynamic DYNAMIC 080e9214 a0214 0000d8 08 WA 5 0 4 [21] .got PROGBITS 080e92ec a02ec 000004 04 WA 0 0 4 [22] .got.plt PROGBITS 080e92f0 a02f0 0002e0 04 WA 0 0 4 [23] .data PROGBITS 080e95e0 a05e0 004764 00 WA 0 0 32 [24] .bss NOBITS 080edd60 a4d44 004bc8 00 WA 0 0 32 [25] .shstrtab STRTAB 00000000 a4d44 0000cc 00 0 0 1 Execution view: Program Header Table (PHT) The PHT contains information for the kernel on how to start the program. The LOAD directives determinate what parts of the ELF file get mapped into memory. The INTERP directive specifies an ELF interpreter, which is normally /lib/ld-linux.so.2 on Linux systems. The DYNAMIC entry points to the .dynamic section which contains information used by the ELF interpreter to setup the binary. greek0@iphigenie:~$ readelf -l /bin/bash Elf file type is EXEC (Executable file) Entry point 0x805be30 There are 8 program headers, starting at offset 52 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align PHDR 0x000034 0x08048034 0x08048034 0x00100 0x00100 R E 0x4 INTERP 0x000134 0x08048134 0x08048134 0x00013 0x00013 R 0x1 [Requesting program interpreter: /lib/ld-linux.so.2] LOAD 0x000000 0x08048000 0x08048000 0xa0200 0xa0200 R E 0x1000 LOAD 0x0a0200 0x080e9200 0x080e9200 0x04b44 0x09728 RW 0x1000 DYNAMIC 0x0a0214 0x080e9214 0x080e9214 0x000d8 0x000d8 RW 0x4 NOTE 0x000148 0x08048148 0x08048148 0x00020 0x00020 R 0x4 GNU_EH_FRAME 0x0a0138 0x080e8138 0x080e8138 0x0002c 0x0002c R 0x4 GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x4 Section to Segment mapping: Segment Sections... 00 01 .interp 02 .interp .note.ABI-tag .dynsym .dynstr .gnu.version .gnu.version_r .rel.dyn .rel.plt ... 03 .ctors .dtors .jcr .dynamic .got .got.plt .data .bss 04 .dynamic 05 .note.ABI-tag 06 .eh_frame_hdr 07 Putting it all together: the ELF header Neither the STH nor the PTH have fixed positions, they can be located anywhere in an ELF file. To find them the ELF header is used, which is located at the very start of the file. The first bytes contain the elf magic "\x7fELF", followed by the class ID (32 or 64 bit ELF file), the data format ID (little endian/big endian), the machine type, etc. At the end of the ELF header are then pointers to the SHT and PHT. greek0@iphigenie:~$ readelf -h /bin/bash ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: Intel 80386 Version: 0x1 Entry point address: 0x805be30 Start of program headers: 52 (bytes into file) Start of section headers: 675344 (bytes into file) Flags: 0x0 Size of this header: 52 Size of program headers: 32 Number of program headers: 8 Size of section headers: 40 Number of section headers: 26 Section header string table index: 25 The Relocation Table The relocation table specifies where relocations are needed in order for the program to run. In programs these are normally symbol relocations, i.e. the dynamic linker has to resolve the needed symbol by its name, and then write the symbol address to the place specified in the relocation entry. Relocation types are architecture specific and there are usually quite a lot of them. On i386 the most important ones are the R_386_COPY type, meaning "just copy the address of the symbol to that address", and R_386_JUMP_SLOT, which is used for the normal PLT/GOT function call relocation mechanism. The resolution of the symbol value itself is done by the dynamic linker (contained within /lib/ld-linux.so.2, the ELF interpreter commonly used), and is a pretty complex process. Basically the linker searches all loaded ELF objects (the binary itself and the loaded libraries) and uses the first definition of the symbol it finds. greek0@iphigenie:~$ readelf -r /bin/bash Relocation section '.rel.dyn' at offset 0x12cc4 contains 8 entries: Offset Info Type Sym.Value Sym. Name 080e92ec 00078006 R_386_GLOB_DAT 00000000 __gmon_start__ 080edd68 00035205 R_386_COPY 080edd68 stdout 080edd6c 00035d05 R_386_COPY 080edd6c stderr 080edd70 00046405 R_386_COPY 080edd70 PC 080edd74 00067405 R_386_COPY 080edd74 stdin 080edd78 0006e305 R_386_COPY 080edd78 UP Relocation section '.rel.plt' at offset 0x12d04 contains 181 entries: Offset Info Type Sym.Value Sym. Name 080e9368 00012c07 R_386_JUMP_SLOT 00000000 fileno 080e936c 00013807 R_386_JUMP_SLOT 00000000 strcmp 080e9370 00014107 R_386_JUMP_SLOT 0805b4a4 close 080e9374 00015307 R_386_JUMP_SLOT 00000000 dlsym 080e937c 00016a07 R_386_JUMP_SLOT 00000000 fprintf 080e9388 00018307 R_386_JUMP_SLOT 00000000 fflush 080e9390 00019c07 R_386_JUMP_SLOT 0805b524 unlink 080e930c 00003307 R_386_JUMP_SLOT 00000000 regexec 080e9328 00007a07 R_386_JUMP_SLOT 00000000 ferror 080e9330 00008307 R_386_JUMP_SLOT 00000000 readdir64 080e9334 00008f07 R_386_JUMP_SLOT 00000000 strchr 080e9338 0000a507 R_386_JUMP_SLOT 00000000 fdopen 080e9344 0000da07 R_386_JUMP_SLOT 00000000 getpid 080e9360 00012207 R_386_JUMP_SLOT 00000000 write 080e95cc 00078707 R_386_JUMP_SLOT 00000000 strcpy ... ... Exported symbols When searching for a symbol the dynamic linker looks through the dynamic symbol table .dynsym, so all symbols present there are usable by other programs (in other words: exported and in case of a library, part of the ABI). Actually the process is more complicated (involving the hashes in the .hash section), but the end result is the same as just described. greek0@iphigenie:~$ readelf -D -s /lib/libc.so.6 Symbol table for image: Num Buc: Value Size Type Bind Vis Ndx Name 260 0: 0011a580 4 OBJECT GLOBAL DEFAULT 29 _nl_domain_bindings 1693 1: 000b0f60 1303 FUNC GLOBAL DEFAULT 11 fts_read 601 2: 00027df0 13 FUNC WEAK DEFAULT 11 scalbln 208 3: 000698f0 141 FUNC GLOBAL DEFAULT 11 memmove 1798 4: 000b8ae0 117 FUNC GLOBAL DEFAULT 11 lsearch 348 4: 000dfd20 189 FUNC GLOBAL DEFAULT 11 xdr_u_hyper 1675 9: 0005ad10 231 FUNC GLOBAL DEFAULT 11 fputc 381 9: 000b92f0 389 FUNC WEAK DEFAULT 11 error_at_line 166 9: 000864d0 36 FUNC GLOBAL DEFAULT 11 versionsort64 119 9: 000f2950 36 FUNC GLOBAL DEFAULT 11 versionsort64 2113 16: 000ac770 58 FUNC WEAK DEFAULT 11 mkdir 516 16: 000de9c0 677 FUNC GLOBAL DEFAULT 11 svctcp_create 979 17: 000b7040 60 FUNC GLOBAL DEFAULT 11 madvise 1815 18: 000c61f0 42 FUNC GLOBAL DEFAULT 11 pthread_mutex_lock 2018 25: 00054ac0 326 FUNC WEAK DEFAULT 11 fputs 432 30: 000ebc40 33 FUNC GLOBAL DEFAULT 11 getutxid 1879 31: 000292b0 64 FUNC GLOBAL DEFAULT 11 sigdelset 1902 33: 000ba530 107 FUNC GLOBAL DEFAULT 11 gnu_dev_makedev 1385 34: 000f3bc0 153 FUNC GLOBAL DEFAULT 11 getrlimit64 895 34: 000b2ad0 153 FUNC GLOBAL DEFAULT 11 getrlimit64 1290 37: 0009f400 319 FUNC WEAK DEFAULT 11 re_comp 82 40: 000dbd70 1653 FUNC GLOBAL DEFAULT 11 clnt_broadcast 1892 41: 0008a6c0 53 FUNC WEAK DEFAULT 11 getresgid ... ... A more detailed look at versionsort64 The observant reader may have noticed that e.g. versionsort64 is present twice in the dynamic symbol table shown above, and the two symbols have different values. The reason is pretty simple; libc.so.6 uses symbol versioning, and there are two versions of versionsort64 available. The binutils readelf unfortunately doesn't show the symbol versions, eu-readelf from the elfutils package however does. greek0@iphigenie:~$ readelf -D -s /lib/libc.so.6 | grep versionsort64 166 9: 000864d0 36 FUNC GLOBAL DEFAULT 11 versionsort64 119 9: 000f2950 36 FUNC GLOBAL DEFAULT 11 versionsort64 greek0@iphigenie:~$ eu-readelf -s /lib/libc.so.6 | grep versionsort64 119: 000f2950 36 FUNC GLOBAL DEFAULT 11 versionsort64@GLIBC_2.1 166: 000864d0 36 FUNC GLOBAL DEFAULT 11 versionsort64@@GLIBC_2.2 Program loading in the kernel ELF files themselves arent't terribly interesting. How ELF files are loaded into memory, and what has to happen before the program can execute its own code, can be. The execution of a program starts inside the kernel, in the exec system call. There the file type is looked up and the appropriate handler is called. The binfmt-elf handler then loads the ELF header and the program header table (PHT), followed by lots of sanity checks. The kernel then loads the parts specified in the LOAD directives in the PHT into memory. If an INTERP entry is present, the interpreter is loaded too. Statically linked binaries can do without an interpreter; dynamically linked programs always need /lib/ld-linux.so as interpreter because it includes some startup code, loads shared libraries needed by the binary, and performs relocations. Finally control can be transfered to the program, to the interpreter, if present, otherwise to the binary itself. greek0@iphigenie:~$ readelf -l /bin/bash Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align PHDR 0x000034 0x08048034 0x08048034 0x00100 0x00100 R E 0x4 INTERP 0x000134 0x08048134 0x08048134 0x00013 0x00013 R 0x1 [Requesting program interpreter: /lib/ld-linux.so.2] LOAD 0x000000 0x08048000 0x08048000 0xa0200 0xa0200 R E 0x1000 LOAD 0x0a0200 0x080e9200 0x080e9200 0x04b44 0x09728 RW 0x1000 DYNAMIC 0x0a0214 0x080e9214 0x080e9214 0x000d8 0x000d8 RW 0x4 GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x4 ... Dynamic linking and the ELF interpreter In case of a statically linked binary that's pretty much it, however with dynamically linked binaries a lot more magic has to go on. First the dynamic linker (contained within the interpreter) looks at the .dynamic section, whose address is stored in the PHT. There it finds the NEEDED entries determining which libraries have to be loaded before the program can be run, the *REL* entries giving the address of the relocation tables, the VER* entries which contain symbol versioning information, etc. So the dynamic linker loads the needed libraries and performs relocations (either directly at program startup or later, as soon as the relocated symbol is needed, depending on the relocation type). Finally control is transferred to the address given by the symbol _start in the binary. Normally some gcc/glibc startup code lives there, which in the end calls main(). greek0@iphigenie:~$ readelf -d /bin/bash Dynamic section at offset 0xa0214 contains 22 entries: Tag Type Name/Value 0x00000001 (NEEDED) Shared library: [libncurses.so.5] 0x00000001 (NEEDED) Shared library: [libdl.so.2] 0x00000001 (NEEDED) Shared library: [libc.so.6] 0x0000000a (STRSZ) 29922 (bytes) 0x0000000b (SYMENT) 16 (bytes) 0x00000003 (PLTGOT) 0x80e92f0 0x00000002 (PLTRELSZ) 1448 (bytes) 0x00000014 (PLTREL) REL 0x00000017 (JMPREL) 0x805ad04 0x00000011 (REL) 0x805acc4 0x00000012 (RELSZ) 64 (bytes) 0x6ffffffe (VERNEED) 0x805ac34 0x6fffffff (VERNEEDNUM) 2 0x6ffffff0 (VERSYM) 0x8059d22 0x00000000 (NULL) 0x0 Symbol lookup by the dynamic linker As mentioned before, symbol lookup is a complicated process, I'll give a simplified description. For every loaded object RTLD (the runtime dynamic linker) keeps a list of loaded objects called the "lookup scope". Every scope contains pointers to all the loaded objects (the binary and all loaded libraries), but the order of objects can differ between different scopes. What is constant is that the binary is the first object in every scope. When the RTLD has to resolve a symbol, it first checks for which object it needs to perform the relocation. Was the lookup caused in the binary itself or in one of the loaded libraries. Then it gets the lookup scope for that object, and iterates through every object in it. For each object it looks for the needed symbol is in the dynamic symbol table. In case of a match it just uses that symbol value for the relocation, otherwise it continues its search looking at the next object in the scope. Consequences of the symbol lookup rules Libs can't just directly jump to functions they export (they of course know where their own functions are), but have to go through the described symbol lookup mechanism too. This along with the fact that the binary is always first in every lookup scope means that symbols defined in the binary override symbols defined in libraries. It's this way on purpose, to allow the binary to override library functions it doesn't like. If this happens nothing will use the library's function, not even calls by the library itself. Normally that's a good thing, but it can lead to problems if the binary unintentionally defines a symbol that's also used by some loaded library (think program uses GTK, which pulls in different theme and input plugins depending on the user's system config). E.g. if the program defines a function void print_error(int error_code, char* str); and some plugin defines a function with the same name, but another signature, like int print_error(char* str), that may be problematic. If the plugin doesn't export the print_error symbol, there's no problem at all, because the code in the plugin can just call the proper function directly, without the need for a symbol lookup. However if the plugin does export the symbol it has to lookup the symbol itself (because that's required by the SystemV ABI spec and, consequently, by the LSB). Then print_error will be interposed by the symbol in the binary, which has an incompatible signature, which will probably lead to a crash. Solutions for this The right solution is to just not export symbols that you don't explicitly want others to use. That's The Right Way™ for many reasons, it speeds up the library/plugin (no symbol lookup has to be performed for internal uses of that function), you don't pollute the namespace that way, you avoid the possible problems outlined above, and finally, the symbol is not part of your ABI, which means you can change it every way you like without breaking any dependent applications. There's a quick hack that can be used to avoid the above problem. When -Bsymbolic is specified on the linker commandline when linking a library, the lookup scope of that library is changed so the library itself is in the first spot, followed by the binary. This doesn't give you any of the other advantages though, and the program can't interpose your symbol, even if it intentionally wants to do so. Consequently -Bsymbolic should be avoided, unless you're really, really lazy ;-). Further reading The ELF Specification Definitive resource on libraries: Ulrich Drepper's DSO Howto Josselin Mouette's "Packaging shared libraries" talk at Debconf6 The Debian Library Packaging Guide by Junichi Uekawa