CSE 444 Notes

Table of Contents

These are Mark Polyakov's notes from taking Computer Science 444, Database System Internals, at the University of Washington during Spring 2020.

1 Logistics

1.1 Asking Questions

Either ask in chat or unmute yourself and interrupt. If you ask questions a lot, wait for him to call on you after unmuting.

1.2 Contact


2 Transactions

2.1 ACID

  • Atomicity
  • Consistency
  • Isolation
  • Durability

Atomicity and isolation are the important parts. Isolation is mostly synonymous with serialization; it's impossible to tell if you are running alone or concurrently. Atomicity means that the transaction either executes in its entirety or not at all.

2.2 View Serializability

You know asher and prancer and dancer and conflict serializability…but do you know view serializability? It's about what a transaction can "see", rather than precedence graphs. View serializability implies serializability but not conflict serializability.

We define two schedules of several transactions to be view equivalent when:

  • If T reads the initial value of an element, T' must as well.
  • If T performs the final write of an element (last in the schedule), T' must as well.
  • h

And it's view serializable if it's view equivalent to any serial schedule.

2.2.1 Difference from conflict serializability

Only transactions with blind writes (writes without a read) are in the narrow set of view serializable schedules that are not conflict serializable. Here's an example of a blind write that is conflict serializable:

2.3 Recoverability

To be recoverable, a schedule must be

  • Conflict Serializable
  • All of our "dependency" transactions have committed: Transactions that wrote a value we read.

Now we have two graphs to think about: The conflict serializability precedence graph, and the write-read dependency graph to track if all dependencies have committed. A directed edge from T1 to T2 exists if T1 reads the result of a write by T2. T1 cannot commit until all nodes it has directed edges to have committed.

Strict 2PL (do not release any locks until you release all locks, at the very end of the transaction) ensures recoverability; transactions release locks at the same time they commit, so another transaction will not be able to read something the first transaction wrote until the first one commits.

2.3.1 Why conflict serializability?

It's a terminology thing. You can still recover a schedule that is, say, view serializable, but not conflict serializable, as long as it commits transactions in the right order. It's just not "recoverable" by the textbook definition.

2.3.2 Avoid Cascading Aborts

Stricter than the recoverability requirement above: A transaction cannot read anything until the dependency transaction has committed. This transaction is recoverable but might have cascading aborts:


This one avoids cascading aborts:


Strict 2PL prevents cascading aborts.

2.4 2PL Proof

By contradiction: Assume there's a conflict precedence graph cycle of transactions T1..Tn, in that order, and we are using 2PL. T1 must release a lock on whatever element it conflicts with T2 on before T2. T2 must release a lock on some element before T3. T2 cannot release any locks before it has acquired the lock, so T1 must release the lock before T3 acquires a lock. And before T4 acquires a lock. etc etc, until Tn, and eventually T1 again, requires a lock.

Each transaction in the chain has to wait for the previous one, so T1 must wait for itself.

2.5 Steal and Force

  • Steal: Allow evictions and flushes of dirty pages from an uncommitted transaction. But then, if the transaction is aborted, you need to undo those writes!
  • No-steal: Dirty pages form uncommitted transactions remain in memory.
  • Force: When a transaction commits, flush all of its dirtied pages.
  • No-force: When a transaction commits, its dirtied pages may remain in the buffer pool's memory without a flush. But, somewhere on disk, you do need to have a record of what was changed, if you want recovery.

2.6 Timestamp

Each transaction gets a truly unique timestamp. The evaluated schedule must be equivalent to a serial schedule where the transactions are in the order given by the timestamps.

View serializability is used to determine whether an operation is allowed. Every time a transaction is about to read or write an element, check whether the last transaction messing with that element has a strictly less timestamp than us.

We do not need to worry about R/R conflicts – there's no such thing. When processing a write, we check against the last read time and write time. When processing a read, we check against the last write time only.

2.6.1 Thomas' Rule

By enforcing transactions to be serializable in the order they began, timestamp-based concurrency control can reject many schedules that lock-based control would allow. However, with timestamps, we can occasionally re-order writes.

If transaction T1 wants to write element X, but the last write timestamp on X, T2, is newer than our timestamp, an immediate analysis makes it seem like we have to abort. But, if the read timestamp on X is older than T1, we know two things about the serial schedule that we are trying to replicate:

  • T2 has overwritten X (blind write), so any future reads must have timestamps after T2 and expect to see what T2 wrote to X.
  • No transaction has already read us.

Nobody read us, nobody will, what's the point of the write at all? We can drop it.

Summary: If RT(X)<T1 and WT(X)>T1 when T1 wants to write to X, we can just not do the write.

2.6.2 Storing the read/write timestamps on the elements

Do you need to indefinitely store a timestamp for the last reader and writer for all elements? If you wanted to do tuple-based locking, for example (and tuple based locking is especially good for an optimistic locker, because collisions are much just likely), this could get expensive.

Can we just hold a mapping from element IDs to timestamps, then sometimes remove entries to indicate that the last read or write was "long ago"? Well, how long ago is long ago? Can we remove after a transaction has committed?


This would be allowed if we remove timestamps after commit, but is nonserializable.

A better solution is to remove timestamps after commit when the transaction committing has the oldest timestamp of all outstanding transactions.

2.6.3 Notation

The timestamp of the last write on element X.
Timestamp of last read on X.
Timestamp of transaction T
Boolean, whether the last transaction to write X has committed.

2.6.4 Abortions

To ensure recoverability at a major cost to performance, just enforce commits in the order of timestamps. To be a bit smarter, require that C(X)=TRUE when reading X or when writing X and WT(X)>TS(T), usually by delaying a read or write until C(X)=TRUE. We obviously have to enforce it on reads, otherwise we could read uncommitted data. On writes, if we are overwriting an old value (WT(X)<)

2.6.5 Summary

Timestamp scheduling ensures view serializability, avoids cascading aborts if you want, but does not handle phantom insertions.

2.7 Multiple Versions

A logical expansion of the timestamp system.

When a transaction writes X, redirect its write to a copy of X marked by the timestamp. When a transaction reads X, read the latest version prior to the reading transaction's timestamp.

In addition to storing the write timestamp and contents for each version of X, we need to check "legality of writes" by storing the highest timestamp of any transaction that read X. Abort any write with a timestamp more recent than the timestamp on a version with a highest read timestamp more recent than the attempted write. I.e, if someone newer than us has read a version older than us, we can't write.

We can delete a version when all transactions with a timestamp less than or equal to that version have committed.

2.7.1 Abortions

A commit bit, like C(X) in the timestamp system, will ensure that we only read a version once the transaction that created it has committed.

2.7.2 Phantoms

Not a problem! Even long-term storage of elements include a timestamp. When we insert a new tuple, it also has a timestamp on it – there's no un-timestamped copy of the tuple. A read with a timestamp before the insertion will not be able to access the newly read tuple because its version is too recent. Similarly with deletion, we basically store a tombstone as the latest version, so that old reads still read a valid tuple.

2.7.3 Observations

  • Only the transaction that created a version can write to it, ever.

2.8 Validation

Let transactions do whatever they want, then check consistency/serializability after commit.

A transaction does all its work, with writes cached locally per-transaction, during the "read phase". When the application commits, the "read" phase is over. The validation phase happens, and VAL(T) is set to the time that validation finished. Finally, the cached writes are applied to disk.

We keep track of which elements the transaction read and wrote (even though those writes aren't applied to disk until the very end). During the validation phase, we check if any ongoing transaction (not done with the write phase) wants to write

2.9 Performance

30% spent in the buffer pool, 30% acquiring locks, 28% on the WAL, and 12% everything else (i.e, doing the query!) according to the research paper by the guys who made H-Store. This is a real reason NoSQL can be so fast even on a single machine.

3 Crash Recovery

These are all types of "Write-Ahead Logs", where you log what's about to happen just before it happens.

3.1 Undo Log

STEAL and FORCE. Record when a transaction starts, commits, aborts, or updates an element. The update log is made when the update is performed in the buffer pool, before it is outputted to disk. The update log stores the value of the element before the transaction updated it.

To recover, move backwards in the log file, reversing all actions until all uncomitted transactions are undone.

Undo log records are idempotent, so it's safe to recover from the same log multiple times, or to recover from a crash that occurred during a previous recovery.

Undo all log entries by uncommitted transactions. We must scan the whole log file to do this consistently, though we could probably insert special entries into the log that indicate there are no outstanding transactions, so we don't need to continue scanning back further.

FORCE: We don't scan backwards past a commit, so all logged writes must be performed before the commit.

STEAL: Undoing updates is idempotent, so it's no big deal if intermediate outputs are written to disk.

3.1.1 Checkpoints

Usually, you have to scan through the entire undo log to check for very old uncommitted transactions. It's safe, however, to stop scanning once you've reached a point where all older log entries are from committed transactions; you won't undo any of them.

In practice, checkpointing is done periodically. In quiescent checkpointing, new transactions are disallowed, and once all outstanding transactions have committed, we insert a checkpoint, then resume the database. This is horrible: We have to pause all transactions.

A method of nonquiescent checkpointing: When the cron runs, insert a START_CKPT that contains the list of outstanding transactions into the WAL. Every time a transaction commits, insert an END_CKPT if all the transactions from the last START_CKPT have all committed. How far back do you go when undoing? Back to the START_CKPT corresponding to the most recent END_CKPT. We have to go back to the start, not the end, because transactions not listed in the START_CKPT may have begun between the START and END. We cannot go back to a START without an in-progress END because we will not know that all transactions involved in that START are done. When the conditions are followed, we go back to a point where we know all transactions that were outstanding, and we know they are all committed.

The END_CKPT is pedagogical: During recovery, we can just calculate where it should be from the COMMITS. Then, the START_CKPT are really just lists of all existing transactions at points in time.

Another way to determine where to start reading is to take the most recent START_CKPT, regardless of whether it is complete, and continue undoing until we see the START of all transactions in the START_CKPT.

PERSONAL: Periodically scan forwards from the last checkpoint, inserting a new checkpoint every time we are at a location where all transactions are committed.

3.1.2 Rollbacks without a crash

Each line of the WAL has a pointer to the last line corresponding to the same transaction. The transaction itself (in the catalog) has some pointer to the mot recent line related to it in the WAL. So we can go back and undo all the related stuff.

3.2 Redo Log

NO STEAL and NO FORCE. When writing things into the buffer pool, you also add to the WAL. They are not on-disk in the main disk. When you commit, store COMMIT in the WAL. When redoing, only apply updates from COMMIT'ed transactions.

NO STEAL: Can't output to disk before commit, because we'll have no way to undo it. Means memory could fill up if you write a ton, but that's usually faster than having to write to disk before a COMMIT.

NO FORCE: This is more like "optional force"; it's fine to force pages to disk right after commit. While STEAL is better than NO STEAL, it's the opposite for force: NO FORCE is better than FORCE.

3.2.1 Checkpointing

START_CKPT, like with undo log, contains the txnids of all outstanding transactions. However, instead of placing END_CKPT at commit, place END_CKPT when all blocks not synced with disk at the START_CKPT have been flushed. Now, an END_CKPT means that all transactions committed before the transactions listed in the corresponding START_CKPT are flushed to disk.

When recovering, we redo starting from the earliest START of any transaction mentioned in the START_CKPT corresponding to the most recent END_CKPT. This is the earliest possible time that a dirty block may have been dirtied.

PERSONAL: Have some daemon that is constantly writing things from the WAL to disk when the disk isn't too busy, and deletes the UPDATE from the log

3.2.2 Commits without a crash

In an undo log, we implement ROLLBACK using the log. But with a redo log, we could instead implement COMMIT by doing a selective redo on the relevant operations. In practice, though, you don't – just write things to disk when convenient.

3.3 Undo-Redo log

UPDATE entries contain both the new value and the old value. Depending on whether the update entries are before or after the commit, we might undo them or redo them?


A specific implemetation of the undo-redo log. Features include:

  • Fuzzy checkpointing; checkpointing happens as fast as you can read information from the catalog and buffer pool; you don't need to wait for transactions to commit or for pages to be flushed.
  • Works better with advanced concurrency control systems that have weird locking protocols.

3.4.1 Overview

  1. Analysis: Determine what parts of the log we might have to redo or undo.
  2. Redo: Get the database back to the state it was in when it crashed.
  3. Undo: Rollback uncomitted transactions. We keep track of how much we've undone, so that redo not only redoes the dirty pages but also redoes the partial undos if a crash happens during undo.

The weird part of ARIES is that you always redo. This means that at the end of the redo phase, dirty pages will be…dirty.

3.4.2 Checkpointing

There are START and END tokens in the WAL, just like undo and redo logs. However, the contents are very different. Starting at the moment after the START token is written, the dbms starts collecting:

  • The set of outstanding transactions and the lastLSN for each of those transactions. lastLSN is the most recent WAL update (or CLR) related to that transaction.
  • The set of dirty pages.

And when you believe you're done, you write the END token containing those two sets. Of course, transactions might start or stop, clean pages might be dirtied, and dirtied pages may be flushed between the START and the END. The only guarantee is that the two sets are supersets of the sets of outstanding transactions and dirty pages that were outstanding and dirty, respectively, at both START and END. It's also a subset of the outstanding/dirty transactions/pages that were outstanding/dirty at either START or END. Put more weakly, a superset of all outstanding transactions and dirty pages that were outstanding/dirty at the START and any point after or equal to END.

3.4.3 Analysis

Unlike simple undo or redo, we cannot start undoing nor redoing from any checkpoint. The reason we could start recovering at a START in undo and redo was because their checkpointing systems waiting to make the checkpoint until some condition was met, so we know the checkpoint is safe to recover from. Fuzzy checkpoints are constructed very quickly and simply tell us how messy everything is.

  1. Finding the redo point and list of dirty pages

    We must redo everything related to every dirty page at the time of the crash. The checkpoint tells us all pages that were dirty at both START and END. This includes many pages that are no longer dirty; oh well. There's no way to tell when a page is cleaned by the WAL, so we will find a superset of dirty pages; it's ok to redo all of them.

    We proceed from the lastest START that has a corresponding END, and use the set of dirty pages from the END. Scanning forwards through the log, whenever a page is written, add it to the dirty set. For each page in the set, look at its recLSN pointer, which indicates the WAL entry that dirtied it. (For pages that were added during analysis, not pages found in the checkpoint, the recLSN is the LSN of the WAL entry that caused us to add the page to the set). Find the minimium recLSN. We will redo from it. Also store the dirty pages list; we will use it to accelerate the redo.

  2. Finding the undo point; or really, the list of outstanding transactions

    Much like finding the redo point, except we search for outstanding transactions instead of dirty pages. Commits are stored in the WAL, unlike page flushes, so we can know exactly which transactions were outstanding at the crash.

    Start scanning forwards from the latest START with a corresponding END, and use the set of outstanding transactions from the END. Remove transactions when we see a transaction end marker (this indicates either a commit or a completed rollback). Add a transaction when we see an entry other than a commit, and set lastLSN for the transaction to that entry.

    The undo process, as we will soon see, does not actually involve a scan of the WAL, like the redo process does. Instead, we will start with the set of latest updates related to a transaction, then use their prevLSN pointers to move backwards in the log, skipping over irrelevant crud. So the undo part of the analysis phase is not to search for a point to undo from, but a set to undo from.

3.4.4 Redo

Starting at the redo point found during analysis, repeat all update and CLR actions (which we'll get to later) that are related to dirty pages and occurred after the page was last written to disk. We verify the last property by ensuring that:

  • The recLSN is newer than the LSN for the update, which indicates that the update was certainly written to disk (but does not guarantee that it was not written to disk, because we have a superset of the dirty pages).
  • The pageLSN, which is stored on-disk in the page and is the more correct version of the recLSN, is also older than the update's LSN. PERSONAL: The existance of a pageLSN means that we actually could get the exact set of dirty pages, rather than a subset, during the redo-finding pass. But because these are only stored on disk (and not in the WAL), this is extra expensive. Couldn't we store an entry to the log whenever a page is flushed, just like we store when a transaction is committed? Possible answer: We need to know exactly which transactions are outstanding so we don't undo committed transactions, so we must go to any expense to track them properly. This is not true for dirty pages. Usually the number of dirty pages >> number of transactions, which is another reason why we need to be more careful about performance with pages than transactions.

The pageLSN must be updated whenever a redo action is performed (undo and redo actions are written to disk immediately) and whenever a page is flushed to disk during normal operation.

Ensuring that we don't redo old actions (from before the page most recently became dirty) is just for efficiency. If the system crashes during recovery, the second recovery will start redoing from no further forwards than where we left off, so eventually all relevant actions from the redo forward will be undone. The optimization simply changes what counts as "relevant" on the this path from past to present.

3.4.5 Undo

Take the set of all lastLSN for outstanding transactions. Start with the most recent one (we must undo backwards). If it's an update:

  1. Write-ahead that we are about to undo something. This is in the form of a CLR, "Compensation Log Record". While normal update logs contain both the old value and the new value, so that we can redo or undo them, the CLR is "redo-only": It contains only the value that we are undoing to (the old value from the update that the CLR corresponds to). We'll get to what happens when you discover a CLR during the undo process soon – but you don't undo the CLR itself in any sense. In addition to the value, the CLR contains a nextUndoLSN pointer to the update just before the update that the CLR corresponds to. In other word, the prevLSN of the corresponding update, which may be null (if we just undid the first update of the transaction). It says what should be undone next after this CLR.
  2. Perform the action represented by the CLR. This means writing the old value from the update log into the page and updating the pageLSN (which is safe, because redo is already complete).
  3. Add the prevLSN of the update (the same LSN stored in the CLR) to the set of things to undo, unless prevLSN is null.
  4. If this is the first update in the transaction, write a transaction-end token to the WAL: Rollback successful!

If a CLR is encountered, then the system must have crashed while recovering/rolling back this transaction. Any other case is impossible: If the roll back process had not yet begun, there would be no CLR. If rollback was complete, there would be a transaction-end token in the WAL and this transaction would never have ended up in the set of outstanding transactions. It is possible that this was the last CLR corresponding to the transaction, which means (after the redo) that the disk is now in the fully rolled-back state for the transaction. The crash must have occurred either while actually undoing the update corresponding to this final CLR or just before/while writing the transaction-end token to the WAL. Either way, the system crashed before the undo was complete, which is defined as after the transaction-end token has been successfully written to disk.

If a crash occurs between steps 1 and 2, the undo during next recovery will still proceed normally. The lastLSN will point to the correct CLR because it's determined empirically during analysis. And this CLR which was interrupted during the second crash will be caried out by the redo phase.

What do we actually do when we discover a CLR during undo? We add its nextUndoLSN to the set of things we want to undo at some point. We don't actually modify any pages based on a CLR – it's "redo-only".

3.4.6 Log Flushing

Sometimes, when an entry is added to the log, it is not necessary to instantly write that log update to disk. Here's all the times the log is updated and whether or not it needs to be flushed immediately:

Operation Flush? Why
Update No The page is dirty, so the update will disappear from both the page and the log in case of a crash.
Undo/CLR No A CLR is just an update – it will get undone in a crash, along with the dirty page.
Output page Yes Duh. Optimization: Only flush up to pageLSN (last update or CLR to this page, updated already)
Input page No Duh, this is a read operation until we modify it.
Checkpoint No Checkpointing does not change the data on disk. Or, we can always use a different checkpoint.
Commit Yes Duh. Must flush full log (through the commit)
Rollback No Any uncommitted transaction will be rolled back during recovery anyway.
Start txn No Not knowing the transaction ever started is the same as rolling it back.

3.4.7 End Records

When committing or aborting, there is a COMMIT or ABORT record that gets written to the WAL as well as an END record. The END record means that the transaction is really completely over and that no more records related to that transaction will ever be written to the log. The END records are used during analysis to reconstruct the running transactions table. What happens between commit/abort and end? For commit, the WAL is flushed to disk and some bookkeeping (presumably on disk?) happens. For example, the database might keep a list of all transactions that have ever committed – adding to that list would happen between commit and end. If there's a crash between commit and end, the bookkeeping still gets done. The application gets confirmation that the transaction has committed before the bookkeeping, though.

For abort, all the CLRs are created between abort and end. I'm not entirely sure why we need the abort at all in this case, because we would have rolled back the transaction anyway in case of a crash, and the end record on its own is enough to determine the set of running transactions during analysis.

4 Crash course: Java concurrency

4.1 Intrinsic locks and synchronization

Every object has an "intrinsic lock". Any synchronized operation on that object acquires the intrinsic lock, performs the operation, and releases it. If you want to simply and explicitly make your own lock, create an empty Object and synchronize on it.

Anything that happens while a lock is held has a happens-before relationship with anything that happens after the lock is released.

A synchronized method holds the intrinsic lock on the method for the entirety of its duration. A section of a method can be synchronized on its own, too. A static synchronized method is synchronized on the reflective Class.

Locks are reentrant. I.e, if you already hold a lock, you may acquire it again at no cost.

Simple getters and setters may be synchronized, even though those operations are guaranteed to be atomic. This is to prevent too much reordering of operations by the OS (memory inconsistency) that would result in the old value being read after a write.

4.1.1 Example of a Really Bad Thing

From https://stackoverflow.com/a/11459843/1233320

while (sth.get() > 1) {

could be compiled to

if (sth.get() > 1) {
    while (true) {

if sth.get() is a non-synchronized getter. When accessing a value without any guarantees, the only thing you know for sure is that you are reading a value of it that really existed at some point in time. The optimizer may validly assume it never changes. This is almost never desirable.

4.2 Atomic operations

Reading and writing variables that aren't long or double is atomic (this includes reference/class variables). volatile has atomic reads and writes for long and double too.

volatile not only enforces atomicity but also installs happens-before relations between any two operations. I.e, after a write to volatile, all subsequent reads from any thread will read the updated value.

4.3 Livelock

When two threads respond to each other, they may get in an infinite inter-thread loop. Like passing people in a hallway (they use this example in the java documentation).

4.4 High-level objects

4.4.1 Locks

Locks, such as java.util.concurrent.locks.ReentrantLock, have a tryLock() method that fails immediately (or after a timeout) if the lock cannot be acquired. This can be used to mitigate deadlock, unlike with simple intrinsic locks and synchronization.

4.4.2 Collections

  • BlockingQueue: Replaces Queue
  • ConcurrentHashMap: Replaces HashMap
  • ConcurrentNavigableMap: Replaces TreeMap

4.4.3 Primitive Replacements

AtomicInteger, for example, has an atomic increment method.

5 Buffer Pool

Can store a certain number of "pages", which are often the same size as disk blocks (usually 8k). Each spot where a page can be stored is called a "frame", regardless of whether it actually is storing a page right now. Eviction strategies (what to do when all frames are in use and we want to cache a new page) include random, LRU, and the "clock method" (??)

6 On-Disk Format

SimpleDB will have one file per relation.

Pages are usually 8KiB

6.1 Page format

6.1.1 Fixed-length tuples, simple

An array of "slots", which each could contain a tuple. Record IDs point to a page, then a slot. This means we cannot move around tuples to different slots (eg, to defragment after removing a tuple in the middle of a page), because it would invalidate all the record ids from indices, etc. So, typically you just put a tombstone where you deleted the tuple to mark it as unused space when iterating over the page. The only header information we need is the total number of in-use slots in the page.

6.1.2 Variable length, slot directory

When variable-length tuples are allowed, the header has pointers to each tuple and the length of the tuple being pointed to. Record IDs now point to the location in the header/slot directory, not the actual location of the main tuple. We are still prohibited from re-using a "pointer slot" in the header, but the tuples (which are much larger) can be re-arranged at will. Each pointer is small: 1 bit to indicate whether the slot is dead, then more bits to indicate the location and length of the tuple.

6.2 Heap file formats

When talking about heap file formats, we're most interested in how hard it is to insert a new tuple. That's because deletion and selection don't differ much between the methods; that requires a sequential scan unless an index exists, regardless of the heap file format.

6.2.1 Simple

Just store all the pages, record how many there are. Requires a scan of all pages, until finding an empty one, to perform an insert.

6.2.2 Dual linked lists

One method: Each disk file contains, in its header, two lists of pages: One of totally full pages, and one of partially full pages. When inserting a tuple, you insert into the first of the partially full pages. When inserting into a page, you consider whether to move it to the full pages list. When deleting from a page, move it to the partial list. Only when the partial list is empty should you create a new page.

6.2.3 Page directory

If we not only keep a list of the partially full pages but also store how much space they have open, it can speed up insertions when we have variable length tuples (don't need to iterate until finding a page that has enough empty space) or inserting many tuples (faster to insert them all into a single page).

A point here, which also applies to indices and stuff, is that this page directory is, itself, stored in pages, and will frequently exceed the size of a single page!

7 Indices

7.1 Clustered

A table with a clustered index usually still has an index structure stored separately from the main table structure – this allows a faster search through the index before seeking to the correct spot on the main disk (it's probably possible to store the whole index in memory).

Clustered index are often stored in the same file as the table. Usually, all primary keys have a clustered index, so it makes sense to tightly couple the clustered index with the table.

7.1.1 Sparse

A "dense" clustered index has pointers to every tuple in the main table. The sorted nature of the main table means that our index could contain pointers to, say, every other tuple in the table structure. If we want to fetch a key that's not in the index, we find the key just less than it, go to its location in the table structure, then iterate forwards until we find our specific key. A sparse clustered index typically stores one entry for each page of the database, since entire pages are read at once anyway. Ie, the index simply tells you which page to look for, not which tuple.

7.2 Unclustered

7.2.1 B-Tree

Like a binary tree, but each node can have many children. All leaf nodes are at the same level. Each node has an alternating pattern of pointers and boundary values. The pointer between two boundary values points to a node that will eventually lead to all leaf nodes with a value between those two boundary values.

B-trees are good for range queries: Values inside a leaf node (and boundary values inside a non-leaf node) are stored in sorted order. Leaf nodes are a sort of doubly linked list; each leaf node contains pointers to the leaf nodes with values immediately greater than and less than the maximum and minimum values in this leaf node.

A typical fan-out is 100, and a typical fill factor is 67%. Recall that the minimum bound on the fill factor is 50%, for non-root nodes.

  1. Selection
    if this is a leaf node
        scan through the node for value
        return what we found
    for pointer, trailing_boundary_value
        if value < trailing_boundary_value
    	recurse into pointer
    recurse into the very_last_pointer
  2. Insertion

    Beginning from the leaf node where this value should exist:

    if there is space in node
        insert value into node
        (this involves choosing a boundary value)
        create a new node
        move the upper half of node into new_node
        Add value to which node is appropriate
        if node is a root node
    	create a new root node with node and new_node as children
    	recurse on parent node, passing new_node as value
  3. Deletion

    Starting at the leaf node containing this value

    remove value from node
    if size(node) < degree/2 and node is not root
        if a sibling has >d/2 nodes
    	steal the one closest to our data range
    	adjust parent boundary
        else (this means the sibling is at exactly d/2)
    	merge this and the sibling into a nearly-full node
    	recurse into parent, telling it the sibling got deleted

    Why does evenly redistributing tuples between the underflowed node and a sibling always work? It will work if the average fu

  4. Choosing the degree

    The maximum number of values in a node is 2*d. So, the total size (in bytes) is keysize*2*d + pointersize*(2*d + 1).

  5. Transactions

    An insertion or deletion could potentially update the whole tree, back up to the root. Thus, to use the insertion and deletion algorithms described above, a lock of the root, which would prevent any concurrent changes to a table with an index.

    There's something called tree locking which works around this a little bit? Or maybe it just lets you read certain leafs while writing other leafs, and not concurrent writes?

8 Query Optimization and Execution

8.1 Joins

8.1.1 Costs

  • Hash join: B(R) + B(S)
  • Sort-merge join: You can usually integrate the join algorithm with the sorting algorithm; the last pass of a merge sort outputs sorted tuples, so you can store the results of the penultimate sorting pass on each table on disk, then emit tuples from the "final pass" for each table on-demand as the final join merge dictates.
  • Nested Loop (By Tuple, no index): B(R) + T(R) * B(S)
  • Nested Loop (unclustered index): B(R) + T(R) * T(S) / V(S, a). Recall that 1/V(S,a) is the selectivity factor for an equality filter on a given attribute. We assume that all matching tuples are in different pages.
  • Nested Loop (clustered index): B(R) + T(R) * B(S) / V(S, a). Like with an unclustered index join, we loop through the values of a from table R, doing an equality filter on table S. This time, though, we know that all the matching tuples in table S will be adjacent on disk, taking up the minimum number of blocks.
  • Nested Loop (By page): B(R) + B(R) * B(S)
  • Nested Loop (By Block): We can decrease B(R) and B(S) by having multiple pages in the same block, where a block just has to fit comfortably in memory. We have to multiply the result by the number of pages per block, but if that number is N, the total cost: \(N*\left \frac{B(R)}{N} + \frac{B(R)}{N} * \frac{B(S)}{N} \right = B(R) + \frac{B(R) * B(S)}{N}\), a substantial improvement, since the right side was the main problem to begin with. Note that if \(N=B(R)\) that the cost is just \(B(R)+B(S)\), which is the same price as a hash join! Which is correct, if you don't care about CPU. If \(N=B(S)\), then cost is \(2*B(R)\). Wut? I think it might be related to

    Alternatively: B(R) + B(R)/(M-1) * B(S), where M is the number of pages that fit in memory. This assumes we read M-1 pages of R at a time, then read through S one page at a time. But, as you can obviously tell from the equation, if we read M-1 pages of S at a time and did single-page loops through R, the cost would be the same; it's sorta symmetrical (apart from B(R) term at the front).

8.1.2 Partitioned Hash Join (Grace Join)

First, hash all tuples from each table (keeping them segregated from the other table) with a bucket size less than M (in pages). Write all the buckets to separate files on disk. Then loop through the buckets generated from the other table simultaneously. We essentially do a (non-partitioned) hash join on these pairs of buckets, if memory permits (and it will, if B(R)+B(S)<M2): Take one bucket, hash it using a much finer algorithm (smaller bucket size), then match against elements from the other table's bucket one by one.

How do you get the buckets (the "partitioning" phase)? You create an empty M-1 pages in memory, then hash and distribute the partitions of table R one at a time. Once you've exhausted R, write the partitions to disk, then repeat for table S.

Question for office hours or something: What if the distribution on the key is not uniform, so that some buckets are much larger than others? How can we be so sure that a bucket will fit in a page?

8.2 Sorting

Multi-way merge sort is the preferred method. It tends to take \(N*log_M(B)\) disk I/Os. Merging more than 2 runs is trivial.

First, sort in-memory the largest possible number of tuples at once. Unfortunately, each of these sorted runs will be as large as main memory, so we cannot merge the entire runs at once – instead, we only hold one page of each run in memory at once while merging – so we can still merge many sorted runs at once.

Each of the runs generated by the initial procedure of loading as much as we could into memory then sorting had a length of M (measured in pages). Then, we can merge about M runs at once (one page from each, leave one page of space for output). So, in the described two-pass algorithm for sorting, the output run will be up to M2 in size. Alternatively, we can sort up to M2 of data with only two passes!

Because two-pass sorts are almost always possible, the cost of a sort is 3*B(R), because we must read, write, then read, and the output write isn't considered (eg, it may be streamed to the next operator).

8.2.1 Merge Joins

The algorithm is slightly modified from a simple sort. First, runs from different relations are never combined. And in the final pass, we do not output tuples in order; we only output tuples when two runs have the same key, and we must output all possible pairings of tuples.

The "all possible pairings" part is important, because although a run has tuples from exactly one relation, a relation may have tuples spread out across multiple runs. The final pass combination algorithm must be aware of this.

Generally, you can do a merge join in two passes if B(R)+B(S)<M2, but really it must be a bit less because a run cannot have tuples from two relations – so a number of runs equal to the number of relations being joined may not be full, meaning wasted space that cuts into the M2. This is negligible because it scales with the number of relations being joined.

Cost: 3(B(R)+B(S))

8.3 Cardinality Estimation

Only covering what wasn't covered so much in 414…

8.3.1 Assumptions

  1. Uniformity

    Assume that tuples are equally distributed among all unique values (or, for a numerical field, that they are distributed uniformly along the range from min to max).

  2. Independence

    Two selectivity factors can be multiplied together. I.e, the distribution of some attributes is not correlated with the distribution of any other.

  3. Containment of Values

    An equijoin where one table has less distinct values of the key than the other, ALL of the unique values of the key in the table with less unique values of that key are present in the larger table. I.e,

  4. Preservation of Values

    A join does not remove any of the unique values of any attributes of its children. e.g, if we join hospitals with coronavirus patients, and each hospital has a country field, we would expect that the number of unique countries stays the same. This would only be false if there is a country where no hospitals have a coronavirus patient.

8.3.2 Joins

Depends on the "containment of values" assumption.

8.3.3 Histograms

Store ranges of some key vs the number of tuples in that key range.

Equal width
Each range is the same width. Some ranges may have many more tuples than others.
Equal depth
Each range has about the same number of tuples in it. Some ranges have less keys than others.

8.4 Relational Algebra Manipulations

8.4.1 Generalized Distributivity

eg, a SUM aggregate can be distributed into a join – we can sum on both sides!


Because the group column is different than the aggregate column, it might at first seem like we're already out of options. But we can improve a little bit.

SELECT *, SUM(S_sub.sum)
    SELECT *, SUM(S.A) AS sum
      FROM S
    ) AS S_sub USING (C)

We know that the grouping due to C is at least as fine as the grouping due to R.B will be. This inevitably means more seeks due to calculating the aggregate, but the cost savings in the join are often worth it.

A super clear example of when the benefit outweighs the cost:

  FROM big_table x, big_table y

if we count first, the join involves only two tuples. If we join first, the join involves count(bigtable)2 tuples.

Generalized distributivity mostly relies on associativity, so that the results from children can be smoothly combined.

A more complex example:



SELECT SUM(A_agg.acc * B_agg.acc)
  FROM (SELECT c, SUM(a) AS acc FROM A) AS A_agg
       (SELECT c, SUM(b) AS acc FROM B) AS B_agg

Assume two rows in each table, all matching on the join attribute. Then \[a_1*b_1 + a_1*b_2 + a_2*b_1 + a_2*b_2 = (a_1 + a_2) * (b_1 + b_2)\] You can see that upon expansion of the RHS, all the join pairs will break out. We are collapse all tuples from a table that share a join key because upon multiplication they will un-collapse.

8.4.2 (Equi)Join removal


  • The attributes from one table are discarded (projected away later, not used before then).
  • The join attribute on the discarded relation is a primary key (no duplicates).
  • The join attribute in the preserved relation is a foreign key to the join attribute in the discarded relation.
  • The join attribute in the preserved relation is NOT NULL.

8.4.3 Left & Right-Deep Trees

How joins are thought about. Each join has a physical table on the right. Most join operators aren't implemented symmetrically (i.e, they go once over one table then either probe or repeatedly scan over the other). So, having the result of a previous join (if any) on the left is amenable to pipelining.

8.5 Selinger's Algorithm (System R)

How to optimize the order of many joins. Dynamic programming. Calculate the cost of accessing individual tables first, with all possible access methods. Egs:

  • Sequential Scan
  • Index access
  • Range access

Then calculate the cost of each possible join of two tables. For each pair, choose the cheapest join method, by analyzing the costs for the different types of access methods calculated during the first pass (and also deciding which table should be on the left and right). Although this phase will analyze multiple join algorithms on the two subtables, it will only output a single cost – the cheapest one it found. It also saves the estimated cardinality of its output, because upper levels need that to compute their own costs.

Continue recursing, ie, finding the cheapest plans for all possible joins of three tables, by looking at combinations of twos and ones. We will have to consider multiple access methods on ones, though not on twos! Notice how it never makes sense to put the single table on the left, for we wouldn't be able to take advantage of any special access methods.

For the fourth level, we might analyze combinations of 3+1 and 2+2. If we look at 2+2, we are considering "bushy trees".

When we get to the top level, where the number of tables being joined is equal to the number of tables requested in the SQL query, we will have the cheapest cost.

8.5.1 Interesting Orders

In the vanilla Selinger's algorithm, the single tables have multiple access methods listed, but groups of tables have but a single one (the cheapest).

8.5.2 Heuristics

  • Do not consider cartesian products: joins of tables without a common key (eg, joining A, B, C, where A and B share a key, and B and C share a key, but A and C share no keys, a join between A and C at the two-table level would not be considered).
  • Left-deep only: Do not consider bushy trees (eg, ignore 2+2 joins at the 4 level) and always put the single table on the right to take advantage of its access methods.
  • Push down selections: Each possible join outputs its cost and estimated cardinality. Include any applicable selections in that cardinality (though be careful to make sure you don't double-count selections that were performed in children even further down!)

9 Parallel and Distributed Databases

9.1 Terminology

Parallel usually refers to OLAP: Online Analytic Processing, i.e, big data queries. Distributed usually refers to (not replicated) OTAP: Online Transaction Processing, eg, when updates and inserts are happening and being intermixed, like with real world applications.

9.2 MapReduce

The Map function takes a single key-value pair as input and outputs any number of intermediate-key - value pairs.

The Reduce function takes pairs of intermediate keys and bags of values that Map outputted with that intermediate key.

If reduce is associative, we can run reduce once on each machine to hopefully slightly reduce the number of pairs, then run it again on a central server.

9.2.1 Example: Count words

Map: Produce an intermediate pair with key = word and value = # of occurrences for each word in the value.

Reduce: Sum all the intermediate values. If they all came directly from a Map, they will all be 1.

9.3 Concurrency Control

Real systems still use 2PL, but have to communicate across machines to know about locks on things. There's an added aspect, though, about determining whether to commit or abort at the end of a transaction; When the user sends a COMMIT command, it may reach one

9.3.1 2-Phase Commit

We assume we have a reliable "coordinator" machine. There's a PAXOS algorithm for an unreliable coordinator. Redundant disks and such reduce the need.

The user sends a COMMIT message to the coordinator. The coordinator proceeds in 2 phases:

  1. Prepare: Ask each subordinate whether it would be able to handle a commit for this txn. Each subordinate, upon receiving the prepare message, writes "Prepare <txid>" into its redo log before responding "Yes" to the coordinator.
  2. Commit: Write "commit" into the coordinator's redo log, then send a commit message to each subordinate until it returns ACK (if it's crashed, we have to continue sending until it comes back up).

If some subordinate reports, during the prepare phase, that it cannot go through with the commit, then we just send rollback commands to all nodes involved in the transaction.

Recovery: In class, we assume that the coordinator only sends out prepare, abort, or commit messages once to each subordinate; i.e, it does not poll them until they acknowledge. This makes the recovery process both simpler and more complex. Per subordinate:

  • If the last entry is "prepare", then

9.4 Replication

Can be lazy/asynchronous (things update as needed) or eager/synchcronous (push updates, everybody is expected to be in sync). Can also be master/slave or group/multi-master.

9.4.1 Goals

  • Consistency: Servers will give roughly the same results to queries.
  • Availability: Requests are always satisfied.
  • Performance

9.4.2 Single-Item Serializability

Weaker than serializability. Enforces serializability on all operations on each item. So, element A will always be left in a state that could have resulted from a serial schedule, and element B will always be left in a state that could have resulted from a serial schedule, but there may not be a serial schedule that would result in the states of both A and B.

9.4.3 Synchronous/Eager Replication

Updates are given to all hosts as part of the transaction, so by the time the transaction commits, all hosts are ready. Use global 2PL locks to avoid conflicts then 2-phase commit.

  1. Master

    Master: Each object gets assigned to its own master node. You could even select a different master for each transaction. Masters maintain the global lock on the item. When a master crashes, a new one is elected (with PAXOS, usually). But what if the master's network goes down temporarily, causing it to be disconnected, causing an election, causing a second master, which may conflict when the first master comes back up? We only continue with a majority of the nodes, then, and hope that the rest figure it out (hand wavey hand wavey)

    We can use a per-item master to control 2-phase-commit as well (not just 2PL locks) because we only care about single-item serializability. But we still want overall recoverability, not just single item serializability, right?

  2. Group/Multi-Master

    Each replica maintains a lock manager for each item. Set (at compile time) a number X of locks that must be acquired to have an effective exclusive lock and a number S of locks that must be acquired to have an effective shared lock. If you want to modify the data, try to acquire locks until you have at least S locks (if you want to read) or X locks (if you want to write). By the rules of locks, \(2X>N\) (or else multiple transactions could hold exclusive locks) and \(X+S>N\) (or else an exclusive and shared lock could co-exist).

    There are two extremes: "Majority Locking", where \(X=S=\lceil \frac{N+1}{2} \rceil\) (you might be able to make S one less, but that's not important if N is large), and The Other One, where \(X=N\) and \(S=1\). Majority locking makes it hard to read. The Other One makes it hard to write (even a single slow machine will hold up all writes on the item. But you have to update that machine anyway before you commit (release the lock), so it's the same speed?).

9.4.4 Asynchronous/Lazy Replication

Transactions commit after writing to a single replica. The writes happen eventually to the other machines.

  1. Master

    Each item gets a master. Transaction updates the master. The master should push out the update to slaves. If the master fails before sending the updates to anyone, yeah, the committed transaction gets undone. If the master crashes after the updates have been sent to only some slaves, there are ways for the slaves to determine who has the latest version of the item and should be the new master.

    Log Shipping is often the implementation. The master sends out the tail of the WAL to the replicas, which can execute it just like during recovery.

9.5 H-store

H-store is stored in memory to avoid disk IO, with snapshots and logging for persistency.

All transactions are stored procedures that take parameters, so they are fully determined when they start. This allows a "command log" that simply stores a list of transactions and the parameters passed to them.

The database is partitioned, and each partition is given a single thread. Transactions are supposed to be optimized to run on a single partition. Additionally, since each transaction is a stored procedure, the core will be entirely busy during the transaction's execution – the stored procedure doesn't do disk IO or anything else that's blocking. Thus, we do not need locks! But 2-phase-commit can be used to lock across partitions if necessary.

Logging is very simple. Just record each transaction and its parameters before you start, then redo them on recovery! Occasional "snapshots" (complete images of the database) are stored to disk.

When a transaction needs access to multiple partitions, it gains complete control of each partition, to continue avoiding locks (though you still need 2PC). All locks to partitions are granted before the transaction starts executing at all. The DBMS makes a best guess at which partitions the txn will need access to. Timestamp-deadlock-detection-style stuff is used to handle transactions that try to lock the same partition simultaneously. If the transaction needs additional partitions than the one the DBMS guessed for it, the txn gets aborted and retried and the new partitions are requested during the retry. The txn is deterministic based on the parameters.

10 Terms

Data Independence
The application doesn't need to know about the physical storage of the data. Eg, why many self-hosted softwares can work on MySQL or SQLite, at your choice. Also covers, for example, changing the indices that exist or the type of machine the DB is running on, not just choice of DBMS. According to Dan, this is the main reason that non-relational data models have failed: They lack data independence.
Number of attributes/columns in a relation.
See Arity
Actual rows in a database, as opposed to the schema.
Theta join
Any join, with arbitrary condition. Usually used to refer exclusively to joins that aren't equijoins.
Natural Join
An equijoin on all same-named columns. Includes projection/renaming as necessary to include only one copy of each in-common column in the output, at least in RA.
Hot Stand-By
A backup for the master node in a distributed database. It's the designated survivor: If the master fails, there's no need to elect a new one – go straight to the hot stand by.

Author: Mark Polyakov

Created: 2020-06-20 Sat 16:40