CSE 351 Notes

1. Integers and Bitwise Operations
2. Floating Point
3. x86 Assembly
4. GDB
- 4.1. Debugging commands
- 4.2. TUI
5. Memory

These are Mark Polyakov's notes from taking Computer Science 351, The Hardware/Software Interface, at the University of Washington during Autumn 2020.

1 Integers and Bitwise Operations

1.1 Maximums and minimums

Where $w$ is word width:

Umax: $2^w-1$
Umin: 0
Tmax: $2^{w-1}-1$
Tmin: $-2^{w-1}$

1.2 Two's complement; negative numbers

The first bit is 1 when negative, and if so, the value of the number is $-2^{w-1}$ plus the value of the remaining bits. Inverting a number is ~b+1.

Casting a number between signed and unsigned reinterprets the bits in-place.

Two's complement numbers with width w can be between $-2^w$ and $2^w-1$, inclusive. It is evenly distributed across positive and negative in the sense that "There are an equal number of nonnegative numbers and nonzero negative numbers representable in a given

1.3 Bitwise Operations

Most bitwise operations don't care about the type, with the primary exception of >>: If you right-shift a signed number, rather than filling with zeroes from the left, it fills with whatever the leftmost bit was. So 0b10010001 >> 3 = 0b11110010.

2 Floating Point

IEEE has three parts:

Sign: Single bit, set if negative.
Exponent: Scientific notation-esque power of two. To support negative exponents, the actual exponent is equal to the exponent in the binary representation, minus a "bias" which is equal to $2^{w-1}-1$ where $w$ is the width of the exponent, in bits. This lets us have negative exponents. They could just as easily have used two's complement, but didn't. So if E is 0b10010000, then we will multiply the mantissa by $2^{17}$.
Mantissa: The "meat" of the number. The actual mantissa that gets multiplied by the exponent is 0b1.M where M is the bits of the mantissa.

More formally: If S is the sign bit, E is the exponent in the binary representation, M is the mantissa in the binary representation, w is the bit width of the exponent, and m is the bit width of the mantissa, then the number being represented is \[S*2^{E-(2^w-1)}*(1+M*2^{-m})\].

Denormalized numbers are the very smallest. They have a mantissa value without an implicit 1. The implicit one saves 1 bit most of the time, because we know the number is nonzero. But if the exponent is all zeroes, we would be restricted to the littlest value $2^{-(w-1)}$ (eg, $2^{-127}$). But if we allowed leading zeroes in the mantissa, we could simulate smaller exponents, at the cost of thinner mantissas! And indeed, that's exactly what happens when the exponent is all zeroes.

But there's a denormalization problem: If we assume a leading zero when the exponent is the smallest value, then there's a "gap" between $2^{-(w-1)}$ and $2^{-(w-1)+1}$, because one exponent assumes a leading one, but then the next, smaller exponent assumes a leading zero. This is resolved by treating an E of all zeroes (which would seem to logically correspond to $2^{-127}$ for 8-bit exponent) as $2^{-(w-1)+1}$ (eg, $2^{-126}$), ie, the same exponent as when E=1.

Denormalization lets us access numbers $2^{-23}$ the size of the smallest normalized number, so they're definitely worth it!

But is it worth having the gap? Why not just always assume a leading zero in the mantissa and avoid the confusion? This is equivalent to shrinking the mantissa by one bit, for all numbers in the normal range ($\geq2^{-(w-1)+1}$). Because the vast majority of common numbers are normal, it's better to make the tradeoff of losing one level of exponentiation (because the two smallest Es correspond to the same exponent) than to losing one mantissa bit over the commonly used range.

Special Values:

NaN: Exponent all ones, mantissa nonzero.
$+\infty$: Exponent all ones, sign zero, mantissa zero.
$-\infty$: Exponent all ones, sign one, mantissa zero.
Positive zero: All zero.
Negative zero: Sign one, all else zero.

It is true that the special values obscure other values that could exist, but they are so small or large that it's not a big deal.

Example: -7.375 to float. It is -0b111.011, so Sign = 1, Mantissa = 1.11011, Exponent = 2, so S=1, M=0b11011000…, and E=0b10000001, so 0b11000000111011000000000000000000 = 0xC0EC0000. The bits closest to the decimal point (of the mantissa) are most significant!

The largest finite floating value representable is \[\underbrace{(2^{m+1}-1)*2^{-m}}_{\text{Maximum 1.M}}*2^{\overbrace{(2^w-2)}^{\text{max expt}}-\overbrace{(2^{w-1}-1)}^{\text{bias}}}\]

3 x86 Assembly

We're learning AT&T syntax, also used in GNU.

3.1 Registers

Sizes:

%rax: 64-bit
%eax: 32-bit
%ax: 16-bit
%ah: Upper 8 bits of %ax
%al: Lower 8 bits of %ax
%r11: 64-bit (general purpose)
%r11d: 32-bit
%r11w: 16-bit
%r11l: 8-bit (no upper)

Some registers (si, di, sp, bp) do not have upper and lower 8-bit parts. Best reference (not mine):

3.2 Suffixes

b: Byte, 1 byte
w: Word, 2 bytes
l: Long, 4 bytes
q: Quad, 8 bytes

3.3 Instructions

Normal:

mov: Move arg 1 into arg 2
add: Add arg 1 to arg 2
sub: Subtract arg 1 from arg 2
or, and, xor, not
movs: Takes two length suffixes. Does a signed extension that takes a value of the first length and puts it into the second length.
lea: The first argument must be a memory operand, and the second must be a register (because you can't have both operands to any instruction be memory). lea, while primarily for pointer arithmetic, can also be used for arbitrary arithmetic
mul, imul: Unsigned and signed multiplication, respectively.
div, idiv: ^^^ division. Remainder stored into first operand.
cltq: aka cdqe in Intel, performs extension of eax into rax. Equiv to movlq %eax, %rax?

Flow Control:

cmp: Set condition flags equivalent to sub on the same arguments, but do not mutate the second argument.
test: Like cmp, but flags set according to and.

The following bits are set by many arithmetic functions:

ZF: Whether the result was zero (Zero Flag).
SF: Whether the result was negative (Sign Flag).
OF: Signed Overflow (Overflow Flag). Eg, 127+127 will trigger it, but not 127+(-5), even though the last case sets both SF and CF.
CF: Unsigned Overflow (Carry Flag). During subtraction, it means a borrow was required (i.e, the first argument is smaller than the second, unsigned-wise. So the CF is set when subtracting a negative from a positive).

The "jump" commands go to a certain label, while the "set" commands update a certain location (1-byte set only):

Perhaps untinuitively, jmpge, for example, doesn't actually check that the first operand of the preceding cmp was greater than or equal to the second operand – it checks that the result of subtraction is greater than or equal to zero.

3.4 Operands

%rdi: (for example) a register
$95: Immediate integer. May be negative, with signed stuff based on the context (expected argument type of the instruction, which is always inferrable from the instruction).
D(base,index,scale): Dereference the memory at base+index*scale+D (except for lea, which does not dereference). Scale must be 1, 2, 4, or 8. Note that D is not affected by scale.

3.5 Inline assembly in GCC

Introduced by writing asm or __asm__ (if using –std which disables un-prefixed GNU extensions):

asm("xorw $-1, %[output_name]"
        : "output_name" "+r" (c_var)
        : "input_name" "I" (other_c_var), "input_2" "r" (third_c_var)
        : "r16", "eax");

Each output operand has three parts (though the first is optional):

The symbolic name that can then be used inside the asm as %[sym_name]
A "constraint" with regards to how the value can be passed into the program. Here, + indicates that the variable is for both input and output, and the r indicates that it must be placed into a register. An = instead of a + makes it for output only.
A single c variable, in parenthesis for some reason, where the value should come from.

The inputs are similar, but no = or + needed in constraint. Useful constraints other than r include:

i: Immediate integer.
p: Memory address.
F: Immediate float.
m: "Memory operand" (not sure how it's passed in…maybe the same way as p, but it takes the address of the c variable passed in rather than assuming the c variable itself is a pointer?)

Finally, the last comma separated "clobbers" "list is other registers that the asm may dirty.

It's also possible to specify memory and stuff here using special syntax.

Full documentation: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

3.6 Procedure Calls

call (which does require a length suffix, eg callq?) jumps to the specified location in memory and begins executing from there. Call pushes the next instruction (after call) onto the stack. Then, ret pops from the stack then jumps to the address stored there. And that's how babies are made!

Arguments are in the following order:

"Diane's Silk Dress Costs $89.

Additional arguments are stored on the stack, with the first arguments pushed last, so that when popped they can be accessed first. Arguments are pushed before the return address (necessarily, because call must be invoked after any pushing).

%rax stores the return value.

These registers are all "calling conventions". These are GNU's.

Some registers are, by convention, "callee-saved", while others are "caller-saved". The callee-saved ones are expected to be pushed and popped from the stack at the beginning and end of any function that uses them. On the other hand, it's safe to mutate caller-saved registers whenever you want; just be aware that subroutines may mangle them.

Caller-saved: RAX, RCX, RDX, R8, R9, R10, R11 (ie, subroutines may mangle them)

Callee-saved: RBX, RBP, RDI, RSI, RSP, R12, R13, R14, R15 (all the non-caller-saved)

Historically, %rbp points to the beginning of the current stack frame (depending on convention, either the addres to jump to after the current procedure, or the first byte of either the saved stack variables or local variables.

4 GDB

4.1 Debugging commands

continue or c: Run to breakpoint.
step: Step until next line of C code. If no -g, keeps going until a function with debug symbols (not recommended!)
stepi: Step one assembly instruction.
next: Like step, but will not enter functions.
nexti: Like stepi, but doesn't step into call instructions.
finish: Run until ret.

4.2 TUI

Lets you view source, assembly, and/or registers, breakpoint locations, etc as you run.

Enter by running layout <layout>, with:

split: Asm + C
src
asm
regs

c-x 2 will toggle through the side-by-side modes. unfortunately, c-x 1 doesn't toggle.

run focus next to switch focus to the next window if necessary.

5 Memory

From high addresses to low addresses:

The Stack. The top of the stack is at the lowest address.
The Heap (malloc).
Static and global variables.
Literals. Unlike static, literal memory is read-only.
Instructions. Also read-only, but unlike literals, this part of memory is executable.

Convenient summary:

5.1 The Stack

The "top" of the stack (lowest address) is stored in %rsp. Usually push and pop (with appropriate suffixes) will automatically manipulate the pointer and add or remove data from the stack (pop places that data into the specified location). However, if multiple items need to be removed from the stack, it's totally legal to just addq $12, rsp to remove 12 bytes.

5.2 Caches

Several layers, from DRAM system memory down to close-to-the-chip SRAM caches. Main performance metrics:

Hit time: How long it takes to retrieve data on cache hit
Miss time: How long it takes to retrieve data from the next layer on cache miss, in addition to hit time.
Average Memory Access Time (AMAT): Takes into account both hit time, miss time, and hit rate. Equal to hitTime+(missRate*missTime).

A "block" is a piece of memory that can be moved into the cache in a single operation.

Different "placement" and "replacement" (eviction) schemes determine how memory is put into caches, and what data is removed from those caches to make space for the new data.

5.2.1 Direct-Mapped

Placement policy: Put block X from memory into slot X mod C, where C is the size of the cache in blocks.

Eviction policy: Remove whatever's at that address.

Very fast to implement, but not intelligent and multiple pieces of frequently accessed memory that have the same X mod C will clash.

Such a simple mapping scheme enables us to break memory addresses into three parts, assuming the block size is a power of two, and the cache stores a power-of-two number of blocks:

Offset: The least significant bits, the offset compared to the block. For a 64-byte block, this would be the lower 6 bytes.
Index: Next least significant bits above offset, equivalent to the address in the cache. If the cache stores 32 blocks, then this would be the next 5 bits of the address.
Tag: Removes any remaining ambiguity. Whenever a block is moved into the cache, its tag is stored, so that when the program subsequently tries to access that location in the cache, the CPU can verify that the block stored in the cache is the same one the client requested by verifying that the tag is the same. Using the numbers above for block and cache size, and 64-bit addresses, the tag would be the 53 most-significant bits of each address.

5.2.2 Set Associative Caches

Same cache address calculation as direct-mapped, but each "address" in the cache refers to multiple possible blocks in the cache. For each cache address, the cache stores multiple blocks, and the "tag" part of the address for each block. To check if a memory access is a cache hit, the cpu checks the tag value of the requested address against the tag value for each block under the correct address in the cache, and only if they all fail is it a cache miss.

Now the need for a replacement/eviction policy makes sense: We don't need to remove blocks from cache until multiple other blocks that map to the same cache address are introduced. LRU is a simple and effective policy, and while it might occasionally get expensive in a large database, on the cache level, if you can implement it in hardware, it's a pretty good strategy.

To sum up, a "cache line" (one block + metadata) in a set associative cache contains:

The tag part of the address, so that the CPU can ensure that the block in the cache is the same one the program requests.
A single validity bit, indicating whether the line is in use. I bet you could use a reserved tag address instead.
The block data itself.

5.3 Virtual memory

Each process thinks it has 64 bits of free memory! The fools! In reality, memory is split into pages (4KiB by default?), and the Memory Management Unit/MMU maintains a "page table" for each process. On the left side of the table are Virtual Page Numbers, and on the right side are Physical Page Numbers. The table also contains, for example, permissions (read/write/execute) and information about pages that are swapped out to disk.

The MSBs of the virtual address refer to the page number, and the LSBs refer to the page offset.

Although page tables are stored in memory, VPN-to-PPN translation is performed by hardware in the MMU. When the MMU encounters a VPN that is not mapped to a PPN, eg it's swapped out, then it triggers a fault and jumps to a handler in the kernel which loads the page into physical memory, updates the page table, then restarts the failed instruction, at which point the MMU does a normal lookup. The kernel can see and modify page tables; it's just memory!

Header bytes count as internal fragmentation for purposes of the exam. So, if you have an 8-byte header on each page, then all pages have at least 8 bytes of internal fragmentation.

5.3.1 Hierarchical Page Tables

If the process can theoretically access any element in its virtual memory, there needs to be a page table entry for every virtual page number. But that's a lot. If you have 4 KiB pages and a 64-bit address space, then that's…2⁶⁴ B/(4KiB) = 2⁵² entries! And you need a page table for each process, so this is clearly infeasible. The solution is a hierarchical page table; a modern x86 system will have 4 levels. Rather than being broken up into simply a page number and offset, addresses are broken up into 4 different page numbers and then a offset. Furthermore, the page tables are themselves stored as pages in memory, so they can be easily allocated and deallocated. There will be a greater number of tables at the inner levels than the outer ones. When a new VPN-PPN mapping is created, new page tables may need to be created.

Hierarchical Page Tables are how page tables can be sparse; because each outer table keeps track of whether each next inner table exists in some management bits.

5.3.2 Translation Lookaside Buffer

A cache for the page table. Each entry is about the same as a page table entry, but it is much like a CPU memory cache in that it has associativity and is a lot smaller than the page table. There's also only one level, so it avoids multiple memory lookups.

Just like with memory caches, the TLB splits up the virtual page number into tag bits and offset bits, based on the size of the TLB.

5.4 Allocators

5.4.1 Free List

Implicit Free List: Each block of memory (allocated or unallocated) stores its size in a header at the beginning of a block. To find an available block, you have to iterate through the blocks.
Explicit Free List: Each free block has a pointer to the next free block.

5.4.2 Strategies

First Fit: Look for the first block in memory that's at least as large as the requested size.
Next Fit: Look for the next block in memory, after the block that was most recently allocated, that's at least as large as the requested size.
Best Fit: Find the available block that is closest to the requested size.

5.4.3 Headers and Footers

Each allocated or free chunk of memory starts with a header and ends with a footer. Both header and footer contain the size of the chunk and whether it is free. "Boundary Tags" is the collective term for header and footer.