AKADEMIA GÓRNICZO-HUTNICZA IM. STANISŁAWA STASZICA W KRAKOWIE # Introduction to Computer Science NASM Assembler Version: 2023 Marek Wilkus, Ph.D. http://home.agh.edu.pl/~mwilkus Faculty of Metallurgy and Industrial Computer Science AGH UST Kraków 1 # **History introduction** First programmable computers were programmed by adapting the hardware memory to the program. Usually by plugs or switches. When the program memory became a part of the random-access memory, it was possible to program the computer by inserting commands right to the specific memory addresses. The programmer had to know what bit alignment does what, write it in own words, then translate it to machine code and enter the commands on the "console". \_ #### Introduction - With growing complexity of software, it was needed to **encode the program** the way that it would be easier to modify or adapt. - The machine-specific commands have been described using handy abbreviations called **mnemonics**. Usually, the mnemonics had some arguments, like value to put into the register. These mnemonics were expanded with them to form one-line assembly commands. # The objective of this lecture - Introduce with general knowledge of Intel x86 assembler principles, - Present some of problems during assembly programming, - Show general structure of assembly programs, and how to make one. - Yes, Intel is not a good assembler to begin with, but the platform is widely used. 4 #### **Intel CPUs** - In the 70s, Intel created 8088 and 8086 processors. - 16-bit registers, 1MB of address space. - 1982: Intel 80286: - 16MB of address space, more instructions. - 1985: Intel 80386: - 4GB of address space, 32-bit registers, protected mode, more instructions. - 1989: Intel 486: - More address space, more commands, 32-bit registers, FPU built-in. - 1992 Pentium: - Incremental changes, power management. - 1995, 1997, 1999 MMX, 3DNow!, SSE instruction enhancement for CPUs - 2003 64-bit x86: - 64-bit registers. #### Intel architecture - Lots of backward compatibility. - Quite troublesome because it was never redesigned from scratch it extends previous versions. - The initial architecture had the following registers: - AX, BX, CX, DX General-purpose 16-bit registers (each can be used as two 8-bit: AH, AL, BH, BL, etc.) - SI and DI general purpose 16-bit registers, for intention to be used as indexes or pointers. - BP and SP pointer registers (Base Pointer, Stack Pointer), - CS, DS, ES, SS segment registers (we will not use them), - IP Instruction Pointer, - Flags register. # 1MB of address space vs 16-bit register? - With 16 bits, you can address 2<sup>16</sup>=64kB of memory! - How they addressed 1MB with its 2<sup>20</sup> bits? - Selector + Offset: - First group of bits defines a **segment**. - Second **offset** in that segment. - This way, both of these numbers cound from 0 upwards. - Scrolling through the memory, the selector stays the same for a longer time, and the offset changes much faster, again and again as selector slowly increases. - For 20-bit addresses, we need 16 bits for offset and 4 bits for segment selector. $(2^4=16, 2^{16}=65536, 65536*16=1048576=1MB)^7$ # Summing up the Intel's trick - It's good that we will use a 64-bit assembler in this course. - If you program something under 64kB, you don't neet to think about segments. - If it is larger, you constantly need to make sure you have chosen the correct segment. If not, switch back and forth which is troublesome. 8 #### The Assembler - Because we are working directly with CPU's commands, this is a really fast language. - On the other side, it is more difficult to align a complex program, it is platform-specific, - The aspect of being platform-specific is important when looking for information about solving a specific problem! - There are many assemblers for the same architecture. They differ by used mnemonics or their order, or some macro-mnemonics (mnemonics which are substituted by specific blocks). # The assembly process - First, the mnemonics are detected and translated to the machine code "object file". - Note that if you make a mistake, but mnemonics are correct and their arguments are possible (could be a total nonsense), there will be no error shown. - Then, object files are joined properly by the **linker**, proper headers and designations are added and the **executable** file you can run is generated. 10 # Netwide Assembler (NASM) - An assembler/disassembler for x86-64 architecture. - Operates under Windows, DOS, Linux and a few other OS, - Outputs **object files** which have to be then linked into the exec. - Assembler code files traditionally have an .asm extension. - Object files -.o, - Executables in Unix have no extension or have a default name a.out (even if they are not according to a.out executable standard, but modern ELF standard). #### **Build the program.asm in Linux:** - nasm -felf64 program.asm - ld program.o - ./a.out - Or in an one-liner: nasm -felf64 program.asm; ld program.o; ./a.out # Assembler program template: Labels Instructions Operands Comments #### **Mnemonics** - **mov** x y moves y to x. Y can be a constant, register or memory location. Both operands must be the same size. - **xor** x y xor-s x with y, writing the output to the x. Like and, or, ... - ...but also **add** or **sub –** add or subtract. - syscall a macro! Calls the operating system's routine. - The **routine's number** is stored in rax. - The arguments may be stored in other registers. - **db** declare bytes put bytes in the memory. The label works then as bytes' address. #### The program - Directives these inform about general conditions of assembling the program. - Labels work as a "chekcpoints" in program's memory. Program can use their address (if we label some data) or jump to them (if we label program's part). - Sections contain instructions. There should be a code section and a data section, as in the example. #### The registers - We used the following registers: rax, rdi, rsi, rdx. There are also rcx, rbx, rsp and rbp. - There are **16** basic integer registers available, and the 8 remaining registers are called just r8, r9, r10, ... r14, r15. # Intel's backward compatibility - Because of compatibility... - The lowest 32 bits of each of these first 8 64-bit registers can be considered as eax, edi, esi ... - The lowest 16 bits of these are also available as ax, di, si etc. - and the lowest 8-bits of them can be used too al, dl, sil etc. - ...And then, the highest bits are: ah, ch, dh, bh. 15 #### Memory operands - As some command's operand, we can use data from the memory. Then, registers hold the **address** and we instruct the assembler to obtain data from the specific memory location. - It is very rare for a command to allow two memory operands. - The following operands can be used: - [x] arbitrary memory address, x is a number. - [reg] the memory address stored in a register. - [reg+x] base register + displacement (offset). - [reg+reg\*s] where s=1, 2, 4 or 8 useful when navigating data structures. - [reg+reg\*s+x] offset in the structure #### The result: 20 # Commands used in this example - inc increment the register. - **cmp** compare two operands. The result can be checked by... - **jne** jump if not equal jumps to label if comparison gave "not equal" result. - jng jump if not greater than. - resb reserve one (or more here 44) bytes. - equ defines a constant. #### .bss and .data segments - Notice we used .bss segment in the second example, and .data in the first one. - Generally, the data segment is used for initialized memory, while bss is used for uninitialized variables we will overwrite during program's execution. .bss (uninitialized data) .data (initialized data) .text (code) - Resb vs db: - resb reserves uninitialized area. - db defines the memory initializing it with value. 22 AGH #### Data "types" - B byte 1 byte, - W word 2 bytes, - D double word 4 bytes, - Q quad word 8 bytes. - So we can dd, dq, resq, etc. - Now for the future: Multi-byte value in registers is described as Little endian, while the memory uses Big Endian! 21 23 #### Mov-ing arbitrary values to the memory We write: mov [rdx], 10 Will end with error. - The assembler does not know what is this 10. Byte? Word? Double? Quad? - We must show it what size we want: mov byte [rdx], 10 There are 5 size specifiers: byte, word, dword, qword and tword (10 bytes). # Mov-ing registers and the memory However, we can assume the size by the register size: mov eax, [rdx] - We know that eax is 4-bytes in width. So we take 4-bytes from location pointed in rdx, and copy them to eax register. - And now we should remember this endianness problem. # Memory limitations - Remember that in the assembly there are no safeguards against overwriting one part of memory by another. - If we declare two 2-byte values and write 4-byte value in the first one, the leftover 4 bytes will overwrite the next variable without any warning. That's why programming in assembler requires careful planning. 25 # Assembly is dangerous While this overwriting can be used for some purposes, Intel's Assembler becomes dangerous as it allows to do this: - Now the memory structure we made lost all sense. - Some architectures just will not let the programmer access an X-byte variable for an address which is not a multiple of X. 27 # **Using C libraries** ``` bits 64 ; Program counts 6..9 global main extern printf section .text main: ; code goes here mov r15, 0 ; counter mov r44, 10 ; meximum loop: mov rdi,format ; printf with format... mov rsi,r15 ; print the r15 mov rax,0 ; zero flag call printf ; call printf inc r15 ; r15++ cmp r15, r14 ; is r15==r147 Jne loop ; no - loop again ; exit routine mov rax, 50 ; system call for exit xor rdi, rdi ; exit code 0 syscall ; system call to exit format: db "w=8ld",10 ; define constants here section .bss ; define uninitialized variables here ``` 28 2 #### **Using C libraries** To build and run: nasm -felf64 -l count.lst count.asm; gcc -no-pie -o count count.o; ./count Notice a few changes: **-no-pie** means that the executable is not position-independent. This way it is possible to jump almost into an entire executable scope, including this linked printf. #### Now a small change ;-) ... • The result: # What happened? - The external function used our registers we were using for something and overwritten its values. - Do we need to use memory? - There is a space for temporarily storing such data and it is called a **stack**. # Stack as the data structure - There are two operations: **push** to the stack and **pop** from the stack. - We can push and pop values and registers. - Initially, the stack contains the program name, argument count and arguments addresses. - As in the stack of objects, the last thing gets in, it goes out first. 33 # Let's hold the registers on the stack Notice the order we push and pop these registers! # Stack pointer - In most architectures, the stack grows "upwards" more items on stack → the higher value of the pointer to the top. - In Intel, start of stack is pre-declared and it grows backwards, means, pushing a 64-bit register into it results in stack's top being 8 bytes lower. - Then the stack pointer (rsp register) **decrases**. - The **base pointer** (rbp register) points to the start of the stack. # Stack requirements - When we call a function, the stack pointer must be **aligned** to the 16-byte boundary. - The stack is aligned before making a call to the function? Great, but calling a function makes it out of alignment because it pushes the 8-bit return address to the stack. - We have to **prepare** the stack before using when functions are called, or we will get the... # Preparing the stack - So to use the stack reliably we have to: - Store the information where the stack begins somewhere (the beginning of the stack is a good point, it is always there!) - ...so put a new base pointer to the new beginning of the stack. printf extern section .text main: push rbp mov rbp, rsp ; prepare the stack Most external functions work properly only if the stack is made this way - otherwise it may not be possible to return from called functions! ncbx@m4800:~/Publiczny/0Dydaktyka/ICS/nasm\$ ./FPU Seamentation fault # Preparing the stack - After the stack is prepared, it nevertheless would be wise to make sure there are no solitary **push** ... as it will shift the stack pointer 8 bytes lower, where we want 16. - A quick hack is to just push and pop something else. There is usually something we may want to save from messing up by function call. - In the next example, I balanced this problem by making three stack operations before function call, aligning the misaligned (by 8 bytes) stack with 8+8+8 bytes. # Fibonacci sequence $F_1=0$ $F_2=1$ $F_n=F_{n-2}+F_{n-1}$ (when n>2 of course) - It can be calculated iteratively or recursively. - Every next element grows very fast, so it will overflow a register quickly. - Parts of this sequence appears surprisingly frequently in mathematics, physics, modelling. 30 # Fibonacci sequence # Floating point operations 40 # Operation of an FPU - Base x86 assembly has no FPU operations at all. - All FPU operations have to be performed using a specific strategy: - If the operands are in the registers, store them somewhere else, e.g. in the memory. - Load the numbers from the program's constants, data or memory into the FPU stack. - Perform the needed operation/operations. - Pop the results back from the FPU stack. Again, not to registers! - FPU stack has capacity of 8 operands. #### But... why? 41 #### **FPU** caveats - Depending on architecture, 32 or 64 bits. - Internally, 80-bits. This allows to truncate precision errors. - Some rare architectures allow to get wider registers. - There are multiple floating-point representations for FPU, CPU and print-like functions and CPU has instructions to convert between them. - Sometimes the floating-point number has to be put in a specific register to make function operate on it like in a floating-point. # Floating point computations end with errors - ...the objective is to minimize their influence on the result! - Typical errors: - Some numbers are **not represented properly** in the system (you cannot put an entire *π* into the system!). - You may run out of precision or numbers are misrepresented. - If the computation program runs in multiple passes, and each pass adds more detail to the result, the computation may be **prematurely stopped** because of time or roundoff constraints. - Poor mathematical assumptions (especially in simulations!) - like "let the friction be zero". - Human errors in algorithms. 44 #### The most important commands - Now, when we loaded the data like in the stack, FPU commands address the same data as registers. - finit initialize the FPU. - fld ... push (load) the number into FPU stack. - fstp pop the number from the FPU stack storing real number in the memory (fst will skip popping). - Many arithmetic operands have "p" suffix which means "perform the operation and pop the result from the FPU stack". #### **Arithmetic** - fsqrt square roots the ST0 FPU register. - fmul multiplication (fdiv division) - One operand multiply ST0 by the operand and store it in ST0 (operand can be a constant or memory variable ( [...] ). - Two operands multiply numbers by each other, store in the first one. But one of the operands must be STO. - fmulp pops the stack after multiplication. - fsin, fcos operate on ST0, write to ST0 - fadd, fsub like fdiv, fmul. 45 43 46 # **Example:** #### Result: Squareroot the number 28 times (and we ran of precision so got 1.000000) # A few important parts: Initialize the stack to point at proper boundary: ``` main: push rbp mov rbp, rsp ; prepare the stack ``` Because we're dealing with floating point numbers, now printf expects the data in the xmm0 wide (128bit) register: ``` movsd xmm0, qword [number] ;load the flt1 into xmm0 register mov rdi,format ; printf with format... mov al, 1 call printf ; call printf ``` 49 # A few important parts... In the main calculation, we convert everything from/to qword to get rid of FPU's precision errors: ``` fld qword [number] ;push the FLT1 into the fpu stack sperform the x=sqrt(x) on fpu stack perform sta ``` • Constants and data for FPU operation: ``` section .data number: dq 123.45 ; 1 qword for argument format: db "v=%f",10,0 section .bss result: resq 1 ; 1 qword for result ``` 50 # Stack or register? - The FPU is externally filled/emptied as a stack. - The numbers can be internally processed as a set of registers. - However it is implemented as a set of shift registers holding 80-bit numbers at once. - It means that if we load (fld) two numbers, always the last one is the STO. - Now: while it is possible to "shift left" the stack to the previously pushed value, pushing any value next would result in the value trying to be written over the STO. - This will shift the stack properly, but destroy the STO contents! #### So one more time - FLD Load into the ST0 previous ST0 becomes ST, ST1 becomes ST2 etc. - FILD Load to ST0 as integer. - FLDPI load Pi to ST0. - FST Store the ST0 into the operand (memory address or ST register) - FSTP As above, but pop the ST0 from stack. - FIST FST, but converts the number to integer. - FISTP As above, but pops the value. 52 #### Other useful instructions - FABS Absolute value of ST0 - FCHS Change sign of ST0 - FRNDINT Round ST0 to integer - FINIT used after another FINIT resets the FPU totally, including clearing the stack. - FYL2X 2-base logarithm: ST1=ST1\*log₂(ST0) - FCOM Compare 2 operands, at least one must be ST. 53 #### FCOM considerations - The comparison is in the FPU, and the code is executed by the CPU. - It is needed to **transfer** the result of comparison from FPU status register to CPU's status register: fcom ; compare fstsw ax ; store FPU's status register to AX sahf ; store AH register to CPU flags # SSE - Streaming SIMD Extensions. - Introduced in 1999 with Pentium III processor. - Allows to perform opreations on 4 floats at once (packed in a 128-bit special XMM registers). - Or 2 doubles, or 2 floats stored as doubles. - Applications: - Multimedia (en/decoding), - Signal processing (SSE2 has DSP instructions) - 3D graphics, - Scientific computation, #### XMM registers - 128-bit wide, - Initially 8, in 64-bit architecture 16 of them, **SSE Extension** - Can keep 4 32-bit floats, - In SSE2, it is also possible to keep and process two 64-bit doubles, two 64-bit integers or four 32-bit integers. - More rarely, it is possible to keep 8 16-bit integers or 16 8-bit integers. 55 57 #### **SSE instructions** - There are two kinds of instructions: - Packed perform the same operation on each of the number in packed register (example: MULPS): | 1 | * | 9 | = | 9 | |---|---|---|---|----| | 2 | * | 8 | = | 16 | | 3 | * | 7 | = | 21 | | 4 | * | 6 | = | 24 | - Scalar - only the first number is processed (MULSS): | 1 | * | 9 | = | 9 | |---|---|---|---|---| | 2 | | 8 | | 2 | | 3 | | 7 | | 3 | | 4 | | 6 | | 4 | 58 56 #### SSE: Example #### **SSE: Result:** ncbx@m4800:/tmp\$ nasm -felf64 -l sseasm.lst sseasm.asm; gcc -no-pie -o /=6.000000 15.000<u>0</u>00 Two floating point numbers get multiplied with a single command. mulpd xmm0, xmm1 ; perform the operation #### **SSE: Important things** - AGH - **SSE: Arithmetic operations** - It is needed to pack and unpack values before/after executing SSE instructions. - The types must be maintained all time. - However, you can use e.g. double-based calculus for floats if you align them properly. - There are instructions for aligned and unaligned data (like movaps/movups for aligned/unaligned singles). This way it is possible to align singles the way that they are considered as doubles. - MULPS, MULPD packed multiplication. - MULSS MULSD scalar multiplication. - ADD[P/S][S/D] addition. - SUB[P/S][S/D] subtraction. - SQRT[P/S][S/D] Square root. - WARNING: SQRT..S is guaranteed to work all time. The double operations are available in newer CPUs (>=Pentium 4). 61 62 # SSE: Packing/unpacking 63 - MOVUPS/MOVUPD move unaligned data as floats/doubles. - MOVAPS/MOVAPS move aligned data as floats/doubles. - UNPCKHPD Unpack higher double - UNPCKHPS Unpack higher float - UNPCKLPD/UNPCKLPS a similar one. #### Thank you for attention