CSE 444 Notes
Table of Contents
- 1. Logistics
- 2. Transactions
- 3. Crash Recovery
- 4. Crash course: Java concurrency
- 5. Buffer Pool
- 6. On-Disk Format
- 7. Indices
- 8. Query Optimization and Execution
- 9. Parallel and Distributed Databases
- 10. Terms
These are Mark Polyakov's notes from taking Computer Science 444, Database System Internals, at the University of Washington during Spring 2020.
1 Logistics
1.1 Asking Questions
Either ask in chat or unmute yourself and interrupt. If you ask questions a lot, wait for him to call on you after unmuting.
1.2 Contact
cse444-staff@cs.washington.edu
2 Transactions
2.1 ACID
- Atomicity
- Consistency
- Isolation
- Durability
Atomicity and isolation are the important parts. Isolation is mostly synonymous with serialization; it's impossible to tell if you are running alone or concurrently. Atomicity means that the transaction either executes in its entirety or not at all.
2.2 View Serializability
You know asher and prancer and dancer and conflict serializability…but do you know view serializability? It's about what a transaction can "see", rather than precedence graphs. View serializability implies serializability but not conflict serializability.
We define two schedules of several transactions to be view equivalent when:
- If T reads the initial value of an element, T' must as well.
- If T performs the final write of an element (last in the schedule), T' must as well.
- h
W(A) | |
R(A) | |
W(A) | |
W(B) | |
R(A) | |
W(A) |
And it's view serializable if it's view equivalent to any serial schedule.
2.2.1 Difference from conflict serializability
Only transactions with blind writes (writes without a read) are in the narrow set of view serializable schedules that are not conflict serializable. Here's an example of a blind write that is conflict serializable:
2.3 Recoverability
To be recoverable, a schedule must be
- Conflict Serializable
- All of our "dependency" transactions have committed: Transactions that wrote a value we read.
Now we have two graphs to think about: The conflict serializability precedence graph, and the write-read dependency graph to track if all dependencies have committed. A directed edge from T1 to T2 exists if T1 reads the result of a write by T2. T1 cannot commit until all nodes it has directed edges to have committed.
Strict 2PL (do not release any locks until you release all locks, at the very end of the transaction) ensures recoverability; transactions release locks at the same time they commit, so another transaction will not be able to read something the first transaction wrote until the first one commits.
2.3.1 Why conflict serializability?
It's a terminology thing. You can still recover a schedule that is, say, view serializable, but not conflict serializable, as long as it commits transactions in the right order. It's just not "recoverable" by the textbook definition.
2.3.2 Avoid Cascading Aborts
Stricter than the recoverability requirement above: A transaction cannot read anything until the dependency transaction has committed. This transaction is recoverable but might have cascading aborts:
W(A) | |
R(A) | |
W(B) | |
ABRT |
This one avoids cascading aborts:
W(A) | |
CO | |
R(A) | |
W(B) | |
CO |
Strict 2PL prevents cascading aborts.
2.4 2PL Proof
By contradiction: Assume there's a conflict precedence graph cycle of transactions T1..Tn, in that order, and we are using 2PL. T1 must release a lock on whatever element it conflicts with T2 on before T2. T2 must release a lock on some element before T3. T2 cannot release any locks before it has acquired the lock, so T1 must release the lock before T3 acquires a lock. And before T4 acquires a lock. etc etc, until Tn, and eventually T1 again, requires a lock.
Each transaction in the chain has to wait for the previous one, so T1 must wait for itself.
2.5 Steal and Force
- Steal: Allow evictions and flushes of dirty pages from an uncommitted transaction. But then, if the transaction is aborted, you need to undo those writes!
- No-steal: Dirty pages form uncommitted transactions remain in memory.
- Force: When a transaction commits, flush all of its dirtied pages.
- No-force: When a transaction commits, its dirtied pages may remain in the buffer pool's memory without a flush. But, somewhere on disk, you do need to have a record of what was changed, if you want recovery.
2.6 Timestamp
Each transaction gets a truly unique timestamp. The evaluated schedule must be equivalent to a serial schedule where the transactions are in the order given by the timestamps.
View serializability is used to determine whether an operation is allowed. Every time a transaction is about to read or write an element, check whether the last transaction messing with that element has a strictly less timestamp than us.
We do not need to worry about R/R conflicts – there's no such thing. When processing a write, we check against the last read time and write time. When processing a read, we check against the last write time only.
2.6.1 Thomas' Rule
By enforcing transactions to be serializable in the order they began, timestamp-based concurrency control can reject many schedules that lock-based control would allow. However, with timestamps, we can occasionally re-order writes.
If transaction T1 wants to write element X, but the last write timestamp on X, T2, is newer than our timestamp, an immediate analysis makes it seem like we have to abort. But, if the read timestamp on X is older than T1, we know two things about the serial schedule that we are trying to replicate:
- T2 has overwritten X (blind write), so any future reads must have timestamps after T2 and expect to see what T2 wrote to X.
- No transaction has already read us.
Nobody read us, nobody will, what's the point of the write at all? We can drop it.
Summary: If RT(X)<T1 and WT(X)>T1 when T1 wants to write to X, we can just not do the write.
2.6.2 Storing the read/write timestamps on the elements
Do you need to indefinitely store a timestamp for the last reader and writer for all elements? If you wanted to do tuple-based locking, for example (and tuple based locking is especially good for an optimistic locker, because collisions are much just likely), this could get expensive.
Can we just hold a mapping from element IDs to timestamps, then sometimes remove entries to indicate that the last read or write was "long ago"? Well, how long ago is long ago? Can we remove after a transaction has committed?
START | |
START | |
R(X) | |
W(X) | |
CO | |
W(X) |
This would be allowed if we remove timestamps after commit, but is nonserializable.
A better solution is to remove timestamps after commit when the transaction committing has the oldest timestamp of all outstanding transactions.
2.6.3 Notation
- WT(X)
- The timestamp of the last write on element X.
- RT(X)
- Timestamp of last read on X.
- TS(T)
- Timestamp of transaction T
- C(X)
- Boolean, whether the last transaction to write X has committed.
2.6.4 Abortions
To ensure recoverability at a major cost to performance, just enforce commits in the order of timestamps. To be a bit smarter, require that C(X)=TRUE when reading X or when writing X and WT(X)>TS(T), usually by delaying a read or write until C(X)=TRUE. We obviously have to enforce it on reads, otherwise we could read uncommitted data. On writes, if we are overwriting an old value (WT(X)<)
2.6.5 Summary
Timestamp scheduling ensures view serializability, avoids cascading aborts if you want, but does not handle phantom insertions.
2.7 Multiple Versions
A logical expansion of the timestamp system.
When a transaction writes X, redirect its write to a copy of X marked by the timestamp. When a transaction reads X, read the latest version prior to the reading transaction's timestamp.
In addition to storing the write timestamp and contents for each version of X, we need to check "legality of writes" by storing the highest timestamp of any transaction that read X. Abort any write with a timestamp more recent than the timestamp on a version with a highest read timestamp more recent than the attempted write. I.e, if someone newer than us has read a version older than us, we can't write.
We can delete a version when all transactions with a timestamp less than or equal to that version have committed.
2.7.1 Abortions
A commit bit, like C(X) in the timestamp system, will ensure that we only read a version once the transaction that created it has committed.
2.7.2 Phantoms
Not a problem! Even long-term storage of elements include a timestamp. When we insert a new tuple, it also has a timestamp on it – there's no un-timestamped copy of the tuple. A read with a timestamp before the insertion will not be able to access the newly read tuple because its version is too recent. Similarly with deletion, we basically store a tombstone as the latest version, so that old reads still read a valid tuple.
2.7.3 Observations
- Only the transaction that created a version can write to it, ever.
2.8 Validation
Let transactions do whatever they want, then check consistency/serializability after commit.
A transaction does all its work, with writes cached locally per-transaction, during the "read phase". When the application commits, the "read" phase is over. The validation phase happens, and VAL(T) is set to the time that validation finished. Finally, the cached writes are applied to disk.
We keep track of which elements the transaction read and wrote (even though those writes aren't applied to disk until the very end). During the validation phase, we check if any ongoing transaction (not done with the write phase) wants to write
2.9 Performance
30% spent in the buffer pool, 30% acquiring locks, 28% on the WAL, and 12% everything else (i.e, doing the query!) according to the research paper by the guys who made H-Store. This is a real reason NoSQL can be so fast even on a single machine.
3 Crash Recovery
These are all types of "Write-Ahead Logs", where you log what's about to happen just before it happens.
3.1 Undo Log
STEAL and FORCE. Record when a transaction starts, commits, aborts, or updates an element. The update log is made when the update is performed in the buffer pool, before it is outputted to disk. The update log stores the value of the element before the transaction updated it.
To recover, move backwards in the log file, reversing all actions until all uncomitted transactions are undone.
Undo log records are idempotent, so it's safe to recover from the same log multiple times, or to recover from a crash that occurred during a previous recovery.
Undo all log entries by uncommitted transactions. We must scan the whole log file to do this consistently, though we could probably insert special entries into the log that indicate there are no outstanding transactions, so we don't need to continue scanning back further.
FORCE: We don't scan backwards past a commit, so all logged writes must be performed before the commit.
STEAL: Undoing updates is idempotent, so it's no big deal if intermediate outputs are written to disk.
3.1.1 Checkpoints
Usually, you have to scan through the entire undo log to check for very old uncommitted transactions. It's safe, however, to stop scanning once you've reached a point where all older log entries are from committed transactions; you won't undo any of them.
In practice, checkpointing is done periodically. In quiescent checkpointing, new transactions are disallowed, and once all outstanding transactions have committed, we insert a checkpoint, then resume the database. This is horrible: We have to pause all transactions.
A method of nonquiescent checkpointing: When the cron runs, insert a
START_CKPT
that contains the list of outstanding transactions into the
WAL. Every time a transaction commits, insert an END_CKPT
if all the
transactions from the last START_CKPT
have all committed. How far back do
you go when undoing? Back to the START_CKPT
corresponding to the most
recent END_CKPT
. We have to go back to the start, not the end, because
transactions not listed in the START_CKPT
may have begun between the START
and END. We cannot go back to a START without an in-progress END because we
will not know that all transactions involved in that START are done. When
the conditions are followed, we go back to a point where we know all
transactions that were outstanding, and we know they are all committed.
The END_CKPT
is pedagogical: During recovery, we can just calculate where
it should be from the COMMITS. Then, the START_CKPT
are really just
lists of all existing transactions at points in time.
Another way to determine where to start reading is to take the most recent
START_CKPT
, regardless of whether it is complete, and continue undoing
until we see the START
of all transactions in the START_CKPT
.
PERSONAL: Periodically scan forwards from the last checkpoint, inserting a new checkpoint every time we are at a location where all transactions are committed.
3.1.2 Rollbacks without a crash
Each line of the WAL has a pointer to the last line corresponding to the same transaction. The transaction itself (in the catalog) has some pointer to the mot recent line related to it in the WAL. So we can go back and undo all the related stuff.
3.2 Redo Log
NO STEAL and NO FORCE. When writing things into the buffer pool, you also add to the WAL. They are not on-disk in the main disk. When you commit, store COMMIT in the WAL. When redoing, only apply updates from COMMIT'ed transactions.
NO STEAL: Can't output to disk before commit, because we'll have no way to undo it. Means memory could fill up if you write a ton, but that's usually faster than having to write to disk before a COMMIT.
NO FORCE: This is more like "optional force"; it's fine to force pages to disk right after commit. While STEAL is better than NO STEAL, it's the opposite for force: NO FORCE is better than FORCE.
3.2.1 Checkpointing
START_CKPT
, like with undo log, contains the txnids of all outstanding
transactions. However, instead of placing END_CKPT
at commit, place
END_CKPT
when all blocks not synced with disk at the START_CKPT
have
been flushed. Now, an END_CKPT
means that all transactions committed
before the transactions listed in the corresponding START_CKPT
are flushed
to disk.
When recovering, we redo starting from the earliest START
of any
transaction mentioned in the START_CKPT
corresponding to the most recent
END_CKPT
. This is the earliest possible time that a dirty block may have
been dirtied.
PERSONAL: Have some daemon that is constantly writing things from the WAL to disk when the disk isn't too busy, and deletes the UPDATE from the log
3.2.2 Commits without a crash
In an undo log, we implement ROLLBACK using the log. But with a redo log, we could instead implement COMMIT by doing a selective redo on the relevant operations. In practice, though, you don't – just write things to disk when convenient.
3.3 Undo-Redo log
UPDATE entries contain both the new value and the old value. Depending on whether the update entries are before or after the commit, we might undo them or redo them?
3.4 ARIES
A specific implemetation of the undo-redo log. Features include:
- Fuzzy checkpointing; checkpointing happens as fast as you can read information from the catalog and buffer pool; you don't need to wait for transactions to commit or for pages to be flushed.
- Works better with advanced concurrency control systems that have weird locking protocols.
3.4.1 Overview
- Analysis: Determine what parts of the log we might have to redo or undo.
- Redo: Get the database back to the state it was in when it crashed.
- Undo: Rollback uncomitted transactions. We keep track of how much we've undone, so that redo not only redoes the dirty pages but also redoes the partial undos if a crash happens during undo.
The weird part of ARIES is that you always redo. This means that at the end of the redo phase, dirty pages will be…dirty.
3.4.2 Checkpointing
There are START and END tokens in the WAL, just like undo and redo logs. However, the contents are very different. Starting at the moment after the START token is written, the dbms starts collecting:
- The set of outstanding transactions and the
lastLSN
for each of those transactions.lastLSN
is the most recent WAL update (or CLR) related to that transaction. - The set of dirty pages.
And when you believe you're done, you write the END token containing those two sets. Of course, transactions might start or stop, clean pages might be dirtied, and dirtied pages may be flushed between the START and the END. The only guarantee is that the two sets are supersets of the sets of outstanding transactions and dirty pages that were outstanding and dirty, respectively, at both START and END. It's also a subset of the outstanding/dirty transactions/pages that were outstanding/dirty at either START or END. Put more weakly, a superset of all outstanding transactions and dirty pages that were outstanding/dirty at the START and any point after or equal to END.
3.4.3 Analysis
Unlike simple undo or redo, we cannot start undoing nor redoing from any checkpoint. The reason we could start recovering at a START in undo and redo was because their checkpointing systems waiting to make the checkpoint until some condition was met, so we know the checkpoint is safe to recover from. Fuzzy checkpoints are constructed very quickly and simply tell us how messy everything is.
- Finding the redo point and list of dirty pages
We must redo everything related to every dirty page at the time of the crash. The checkpoint tells us all pages that were dirty at both START and END. This includes many pages that are no longer dirty; oh well. There's no way to tell when a page is cleaned by the WAL, so we will find a superset of dirty pages; it's ok to redo all of them.
We proceed from the lastest START that has a corresponding END, and use the set of dirty pages from the END. Scanning forwards through the log, whenever a page is written, add it to the dirty set. For each page in the set, look at its
recLSN
pointer, which indicates the WAL entry that dirtied it. (For pages that were added during analysis, not pages found in the checkpoint, therecLSN
is the LSN of the WAL entry that caused us to add the page to the set). Find the minimiumrecLSN
. We will redo from it. Also store the dirty pages list; we will use it to accelerate the redo. - Finding the undo point; or really, the list of outstanding transactions
Much like finding the redo point, except we search for outstanding transactions instead of dirty pages. Commits are stored in the WAL, unlike page flushes, so we can know exactly which transactions were outstanding at the crash.
Start scanning forwards from the latest START with a corresponding END, and use the set of outstanding transactions from the END. Remove transactions when we see a transaction end marker (this indicates either a commit or a completed rollback). Add a transaction when we see an entry other than a commit, and set
lastLSN
for the transaction to that entry.The undo process, as we will soon see, does not actually involve a scan of the WAL, like the redo process does. Instead, we will start with the set of latest updates related to a transaction, then use their
prevLSN
pointers to move backwards in the log, skipping over irrelevant crud. So the undo part of the analysis phase is not to search for a point to undo from, but a set to undo from.
3.4.4 Redo
Starting at the redo point found during analysis, repeat all update and CLR actions (which we'll get to later) that are related to dirty pages and occurred after the page was last written to disk. We verify the last property by ensuring that:
- The
recLSN
is newer than theLSN
for the update, which indicates that the update was certainly written to disk (but does not guarantee that it was not written to disk, because we have a superset of the dirty pages). - The
pageLSN
, which is stored on-disk in the page and is the more correct version of therecLSN
, is also older than the update'sLSN
. PERSONAL: The existance of apageLSN
means that we actually could get the exact set of dirty pages, rather than a subset, during the redo-finding pass. But because these are only stored on disk (and not in the WAL), this is extra expensive. Couldn't we store an entry to the log whenever a page is flushed, just like we store when a transaction is committed? Possible answer: We need to know exactly which transactions are outstanding so we don't undo committed transactions, so we must go to any expense to track them properly. This is not true for dirty pages. Usually the number of dirty pages >> number of transactions, which is another reason why we need to be more careful about performance with pages than transactions.
The pageLSN
must be updated whenever a redo action is performed (undo and
redo actions are written to disk immediately) and whenever a page is flushed
to disk during normal operation.
Ensuring that we don't redo old actions (from before the page most recently became dirty) is just for efficiency. If the system crashes during recovery, the second recovery will start redoing from no further forwards than where we left off, so eventually all relevant actions from the redo forward will be undone. The optimization simply changes what counts as "relevant" on the this path from past to present.
3.4.5 Undo
Take the set of all lastLSN
for outstanding transactions. Start with the
most recent one (we must undo backwards). If it's an update:
- Write-ahead that we are about to undo something. This is in the form of a
CLR, "Compensation Log Record". While normal update logs contain both the
old value and the new value, so that we can redo or undo them, the CLR is
"redo-only": It contains only the value that we are undoing to (the old
value from the update that the CLR corresponds to). We'll get to what
happens when you discover a CLR during the undo process soon – but you
don't undo the CLR itself in any sense. In addition to the value, the CLR
contains a
nextUndoLSN
pointer to the update just before the update that the CLR corresponds to. In other word, theprevLSN
of the corresponding update, which may be null (if we just undid the first update of the transaction). It says what should be undone next after this CLR. - Perform the action represented by the CLR. This means writing the old
value from the update log into the page and updating the
pageLSN
(which is safe, because redo is already complete). - Add the
prevLSN
of the update (the same LSN stored in the CLR) to the set of things to undo, unlessprevLSN
is null. - If this is the first update in the transaction, write a transaction-end token to the WAL: Rollback successful!
If a CLR is encountered, then the system must have crashed while recovering/rolling back this transaction. Any other case is impossible: If the roll back process had not yet begun, there would be no CLR. If rollback was complete, there would be a transaction-end token in the WAL and this transaction would never have ended up in the set of outstanding transactions. It is possible that this was the last CLR corresponding to the transaction, which means (after the redo) that the disk is now in the fully rolled-back state for the transaction. The crash must have occurred either while actually undoing the update corresponding to this final CLR or just before/while writing the transaction-end token to the WAL. Either way, the system crashed before the undo was complete, which is defined as after the transaction-end token has been successfully written to disk.
If a crash occurs between steps 1 and 2, the undo during next recovery will
still proceed normally. The lastLSN
will point to the correct CLR because
it's determined empirically during analysis. And this CLR which was
interrupted during the second crash will be caried out by the redo phase.
What do we actually do when we discover a CLR during undo? We add its
nextUndoLSN
to the set of things we want to undo at some point. We don't
actually modify any pages based on a CLR – it's "redo-only".
3.4.6 Log Flushing
Sometimes, when an entry is added to the log, it is not necessary to instantly write that log update to disk. Here's all the times the log is updated and whether or not it needs to be flushed immediately:
Operation | Flush? | Why |
---|---|---|
Update | No | The page is dirty, so the update will disappear from both the page and the log in case of a crash. |
Undo/CLR | No | A CLR is just an update – it will get undone in a crash, along with the dirty page. |
Output page | Yes | Duh. Optimization: Only flush up to pageLSN (last update or CLR to this page, updated already) |
Input page | No | Duh, this is a read operation until we modify it. |
Checkpoint | No | Checkpointing does not change the data on disk. Or, we can always use a different checkpoint. |
Commit | Yes | Duh. Must flush full log (through the commit) |
Rollback | No | Any uncommitted transaction will be rolled back during recovery anyway. |
Start txn | No | Not knowing the transaction ever started is the same as rolling it back. |
3.4.7 End Records
When committing or aborting, there is a COMMIT
or ABORT
record that gets
written to the WAL as well as an END
record. The END
record means that
the transaction is really completely over and that no more records related
to that transaction will ever be written to the log. The END
records are
used during analysis to reconstruct the running transactions table. What
happens between commit/abort and end? For commit, the WAL is flushed to disk
and some bookkeeping (presumably on disk?) happens. For example, the
database might keep a list of all transactions that have ever committed –
adding to that list would happen between commit and end. If there's a crash
between commit and end, the bookkeeping still gets done. The application
gets confirmation that the transaction has committed before the bookkeeping,
though.
For abort, all the CLRs are created between abort and end. I'm not entirely sure why we need the abort at all in this case, because we would have rolled back the transaction anyway in case of a crash, and the end record on its own is enough to determine the set of running transactions during analysis.
4 Crash course: Java concurrency
4.1 Intrinsic locks and synchronization
Every object has an "intrinsic lock". Any synchronized operation on that
object acquires the intrinsic lock, performs the operation, and releases it.
If you want to simply and explicitly make your own lock, create an empty
Object
and synchronize on it.
Anything that happens while a lock is held has a happens-before relationship with anything that happens after the lock is released.
A synchronized method holds the intrinsic lock on the method for the entirety of its duration. A section of a method can be synchronized on its own, too. A static synchronized method is synchronized on the reflective Class.
Locks are reentrant. I.e, if you already hold a lock, you may acquire it again at no cost.
Simple getters and setters may be synchronized, even though those operations are guaranteed to be atomic. This is to prevent too much reordering of operations by the OS (memory inconsistency) that would result in the old value being read after a write.
4.1.1 Example of a Really Bad Thing
From https://stackoverflow.com/a/11459843/1233320
while (sth.get() > 1) {
Thread.sleep(1);
}
could be compiled to
if (sth.get() > 1) { while (true) { Thread.sleep(1); } }
if sth.get()
is a non-synchronized getter. When accessing a value without
any guarantees, the only thing you know for sure is that you are reading a
value of it that really existed at some point in time. The optimizer may
validly assume it never changes. This is almost never desirable.
4.2 Atomic operations
Reading and writing variables that aren't long
or double
is atomic (this
includes reference/class variables). volatile
has atomic reads and writes
for long
and double
too.
volatile
not only enforces atomicity but also installs happens-before
relations between any two operations. I.e, after a write to volatile, all
subsequent reads from any thread will read the updated value.
4.3 Livelock
When two threads respond to each other, they may get in an infinite inter-thread loop. Like passing people in a hallway (they use this example in the java documentation).
4.4 High-level objects
4.4.1 Locks
Locks, such as java.util.concurrent.locks.ReentrantLock
, have a
tryLock()
method that fails immediately (or after a timeout) if the lock
cannot be acquired. This can be used to mitigate deadlock, unlike with
simple intrinsic locks and synchronization.
4.4.2 Collections
- BlockingQueue: Replaces Queue
- ConcurrentHashMap: Replaces HashMap
- ConcurrentNavigableMap: Replaces TreeMap
4.4.3 Primitive Replacements
AtomicInteger
, for example, has an atomic increment method.
5 Buffer Pool
Can store a certain number of "pages", which are often the same size as disk blocks (usually 8k). Each spot where a page can be stored is called a "frame", regardless of whether it actually is storing a page right now. Eviction strategies (what to do when all frames are in use and we want to cache a new page) include random, LRU, and the "clock method" (??)
6 On-Disk Format
SimpleDB will have one file per relation.
Pages are usually 8KiB
6.1 Page format
6.1.1 Fixed-length tuples, simple
An array of "slots", which each could contain a tuple. Record IDs point to a page, then a slot. This means we cannot move around tuples to different slots (eg, to defragment after removing a tuple in the middle of a page), because it would invalidate all the record ids from indices, etc. So, typically you just put a tombstone where you deleted the tuple to mark it as unused space when iterating over the page. The only header information we need is the total number of in-use slots in the page.
6.1.2 Variable length, slot directory
When variable-length tuples are allowed, the header has pointers to each tuple and the length of the tuple being pointed to. Record IDs now point to the location in the header/slot directory, not the actual location of the main tuple. We are still prohibited from re-using a "pointer slot" in the header, but the tuples (which are much larger) can be re-arranged at will. Each pointer is small: 1 bit to indicate whether the slot is dead, then more bits to indicate the location and length of the tuple.
6.2 Heap file formats
When talking about heap file formats, we're most interested in how hard it is to insert a new tuple. That's because deletion and selection don't differ much between the methods; that requires a sequential scan unless an index exists, regardless of the heap file format.
6.2.1 Simple
Just store all the pages, record how many there are. Requires a scan of all pages, until finding an empty one, to perform an insert.
6.2.2 Dual linked lists
One method: Each disk file contains, in its header, two lists of pages: One of totally full pages, and one of partially full pages. When inserting a tuple, you insert into the first of the partially full pages. When inserting into a page, you consider whether to move it to the full pages list. When deleting from a page, move it to the partial list. Only when the partial list is empty should you create a new page.
6.2.3 Page directory
If we not only keep a list of the partially full pages but also store how much space they have open, it can speed up insertions when we have variable length tuples (don't need to iterate until finding a page that has enough empty space) or inserting many tuples (faster to insert them all into a single page).
A point here, which also applies to indices and stuff, is that this page directory is, itself, stored in pages, and will frequently exceed the size of a single page!
7 Indices
7.1 Clustered
A table with a clustered index usually still has an index structure stored separately from the main table structure – this allows a faster search through the index before seeking to the correct spot on the main disk (it's probably possible to store the whole index in memory).
Clustered index are often stored in the same file as the table. Usually, all primary keys have a clustered index, so it makes sense to tightly couple the clustered index with the table.
7.1.1 Sparse
A "dense" clustered index has pointers to every tuple in the main table. The sorted nature of the main table means that our index could contain pointers to, say, every other tuple in the table structure. If we want to fetch a key that's not in the index, we find the key just less than it, go to its location in the table structure, then iterate forwards until we find our specific key. A sparse clustered index typically stores one entry for each page of the database, since entire pages are read at once anyway. Ie, the index simply tells you which page to look for, not which tuple.
7.2 Unclustered
7.2.1 B-Tree
Like a binary tree, but each node can have many children. All leaf nodes are at the same level. Each node has an alternating pattern of pointers and boundary values. The pointer between two boundary values points to a node that will eventually lead to all leaf nodes with a value between those two boundary values.
B-trees are good for range queries: Values inside a leaf node (and boundary values inside a non-leaf node) are stored in sorted order. Leaf nodes are a sort of doubly linked list; each leaf node contains pointers to the leaf nodes with values immediately greater than and less than the maximum and minimum values in this leaf node.
A typical fan-out is 100, and a typical fill factor is 67%. Recall that the minimum bound on the fill factor is 50%, for non-root nodes.
- Selection
if this is a leaf node scan through the node for value return what we found for pointer, trailing_boundary_value if value < trailing_boundary_value recurse into pointer recurse into the very_last_pointer
- Insertion
Beginning from the leaf node where this value should exist:
if there is space in node insert value into node (this involves choosing a boundary value) else create a new node move the upper half of node into new_node Add value to which node is appropriate if node is a root node create a new root node with node and new_node as children else recurse on parent node, passing new_node as value
- Deletion
Starting at the leaf node containing this value
remove value from node if size(node) < degree/2 and node is not root if a sibling has >d/2 nodes steal the one closest to our data range adjust parent boundary else (this means the sibling is at exactly d/2) merge this and the sibling into a nearly-full node recurse into parent, telling it the sibling got deleted
Why does evenly redistributing tuples between the underflowed node and a sibling always work? It will work if the average fu
- Choosing the degree
The maximum number of values in a node is 2*d. So, the total size (in bytes) is keysize*2*d + pointersize*(2*d + 1).
- Transactions
An insertion or deletion could potentially update the whole tree, back up to the root. Thus, to use the insertion and deletion algorithms described above, a lock of the root, which would prevent any concurrent changes to a table with an index.
There's something called tree locking which works around this a little bit? Or maybe it just lets you read certain leafs while writing other leafs, and not concurrent writes?
8 Query Optimization and Execution
8.1 Joins
8.1.1 Costs
- Hash join: B(R) + B(S)
- Sort-merge join: You can usually integrate the join algorithm with the sorting algorithm; the last pass of a merge sort outputs sorted tuples, so you can store the results of the penultimate sorting pass on each table on disk, then emit tuples from the "final pass" for each table on-demand as the final join merge dictates.
- Nested Loop (By Tuple, no index): B(R) + T(R) * B(S)
- Nested Loop (unclustered index): B(R) + T(R) * T(S) / V(S, a). Recall that 1/V(S,a) is the selectivity factor for an equality filter on a given attribute. We assume that all matching tuples are in different pages.
- Nested Loop (clustered index): B(R) + T(R) * B(S) / V(S, a). Like with an unclustered index join, we loop through the values of a from table R, doing an equality filter on table S. This time, though, we know that all the matching tuples in table S will be adjacent on disk, taking up the minimum number of blocks.
- Nested Loop (By page): B(R) + B(R) * B(S)
Nested Loop (By Block): We can decrease B(R) and B(S) by having multiple pages in the same block, where a block just has to fit comfortably in memory. We have to multiply the result by the number of pages per block, but if that number is N, the total cost: \(N*\left \frac{B(R)}{N} + \frac{B(R)}{N} * \frac{B(S)}{N} \right = B(R) + \frac{B(R) * B(S)}{N}\), a substantial improvement, since the right side was the main problem to begin with. Note that if \(N=B(R)\) that the cost is just \(B(R)+B(S)\), which is the same price as a hash join! Which is correct, if you don't care about CPU. If \(N=B(S)\), then cost is \(2*B(R)\). Wut? I think it might be related to
Alternatively: B(R) + B(R)/(M-1) * B(S), where M is the number of pages that fit in memory. This assumes we read M-1 pages of R at a time, then read through S one page at a time. But, as you can obviously tell from the equation, if we read M-1 pages of S at a time and did single-page loops through R, the cost would be the same; it's sorta symmetrical (apart from B(R) term at the front).
8.1.2 Partitioned Hash Join (Grace Join)
First, hash all tuples from each table (keeping them segregated from the other table) with a bucket size less than M (in pages). Write all the buckets to separate files on disk. Then loop through the buckets generated from the other table simultaneously. We essentially do a (non-partitioned) hash join on these pairs of buckets, if memory permits (and it will, if B(R)+B(S)<M2): Take one bucket, hash it using a much finer algorithm (smaller bucket size), then match against elements from the other table's bucket one by one.
How do you get the buckets (the "partitioning" phase)? You create an empty M-1 pages in memory, then hash and distribute the partitions of table R one at a time. Once you've exhausted R, write the partitions to disk, then repeat for table S.
Question for office hours or something: What if the distribution on the key is not uniform, so that some buckets are much larger than others? How can we be so sure that a bucket will fit in a page?
8.2 Sorting
Multi-way merge sort is the preferred method. It tends to take \(N*log_M(B)\) disk I/Os. Merging more than 2 runs is trivial.
First, sort in-memory the largest possible number of tuples at once. Unfortunately, each of these sorted runs will be as large as main memory, so we cannot merge the entire runs at once – instead, we only hold one page of each run in memory at once while merging – so we can still merge many sorted runs at once.
Each of the runs generated by the initial procedure of loading as much as we could into memory then sorting had a length of M (measured in pages). Then, we can merge about M runs at once (one page from each, leave one page of space for output). So, in the described two-pass algorithm for sorting, the output run will be up to M2 in size. Alternatively, we can sort up to M2 of data with only two passes!
Because two-pass sorts are almost always possible, the cost of a sort is 3*B(R), because we must read, write, then read, and the output write isn't considered (eg, it may be streamed to the next operator).
8.2.1 Merge Joins
The algorithm is slightly modified from a simple sort. First, runs from different relations are never combined. And in the final pass, we do not output tuples in order; we only output tuples when two runs have the same key, and we must output all possible pairings of tuples.
The "all possible pairings" part is important, because although a run has tuples from exactly one relation, a relation may have tuples spread out across multiple runs. The final pass combination algorithm must be aware of this.
Generally, you can do a merge join in two passes if B(R)+B(S)<M2, but really it must be a bit less because a run cannot have tuples from two relations – so a number of runs equal to the number of relations being joined may not be full, meaning wasted space that cuts into the M2. This is negligible because it scales with the number of relations being joined.
Cost: 3(B(R)+B(S))
8.3 Cardinality Estimation
Only covering what wasn't covered so much in 414…
8.3.1 Assumptions
- Uniformity
Assume that tuples are equally distributed among all unique values (or, for a numerical field, that they are distributed uniformly along the range from min to max).
- Independence
Two selectivity factors can be multiplied together. I.e, the distribution of some attributes is not correlated with the distribution of any other.
- Containment of Values
An equijoin where one table has less distinct values of the key than the other, ALL of the unique values of the key in the table with less unique values of that key are present in the larger table. I.e,
- Preservation of Values
A join does not remove any of the unique values of any attributes of its children. e.g, if we join hospitals with coronavirus patients, and each hospital has a country field, we would expect that the number of unique countries stays the same. This would only be false if there is a country where no hospitals have a coronavirus patient.
8.3.2 Joins
Depends on the "containment of values" assumption.
8.3.3 Histograms
Store ranges of some key vs the number of tuples in that key range.
- Equal width
- Each range is the same width. Some ranges may have many more tuples than others.
- Equal depth
- Each range has about the same number of tuples in it. Some ranges have less keys than others.
8.4 Relational Algebra Manipulations
8.4.1 Generalized Distributivity
eg, a SUM aggregate can be distributed into a join – we can sum on both sides!
SELECT *, SUM(S.A) FROM R INNER JOIN S USING (C) GROUP BY R.B
Because the group column is different than the aggregate column, it might at first seem like we're already out of options. But we can improve a little bit.
SELECT *, SUM(S_sub.sum) FROM R INNER JOIN ( SELECT *, SUM(S.A) AS sum FROM S GROUP BY S.C ) AS S_sub USING (C) GROUP BY R.B
We know that the grouping due to C is at least as fine as the grouping due to R.B will be. This inevitably means more seeks due to calculating the aggregate, but the cost savings in the join are often worth it.
A super clear example of when the benefit outweighs the cost:
SELECT COUNT(*) FROM big_table x, big_table y
if we count first, the join involves only two tuples. If we join first, the join involves count(bigtable)2 tuples.
Generalized distributivity mostly relies on associativity, so that the results from children can be smoothly combined.
A more complex example:
SELECT SUM(A.a * B.b) FROM A, B INNER JOIN USING (c)
becomes:
SELECT SUM(A_agg.acc * B_agg.acc) FROM (SELECT c, SUM(a) AS acc FROM A) AS A_agg (SELECT c, SUM(b) AS acc FROM B) AS B_agg INNER JOIN USING (c)
Assume two rows in each table, all matching on the join attribute. Then \[a_1*b_1 + a_1*b_2 + a_2*b_1 + a_2*b_2 = (a_1 + a_2) * (b_1 + b_2)\] You can see that upon expansion of the RHS, all the join pairs will break out. We are collapse all tuples from a table that share a join key because upon multiplication they will un-collapse.
8.4.2 (Equi)Join removal
If…
- The attributes from one table are discarded (projected away later, not used before then).
- The join attribute on the discarded relation is a primary key (no duplicates).
- The join attribute in the preserved relation is a foreign key to the join attribute in the discarded relation.
- The join attribute in the preserved relation is NOT NULL.
8.4.3 Left & Right-Deep Trees
How joins are thought about. Each join has a physical table on the right. Most join operators aren't implemented symmetrically (i.e, they go once over one table then either probe or repeatedly scan over the other). So, having the result of a previous join (if any) on the left is amenable to pipelining.
8.5 Selinger's Algorithm (System R)
How to optimize the order of many joins. Dynamic programming. Calculate the cost of accessing individual tables first, with all possible access methods. Egs:
- Sequential Scan
- Index access
- Range access
Then calculate the cost of each possible join of two tables. For each pair, choose the cheapest join method, by analyzing the costs for the different types of access methods calculated during the first pass (and also deciding which table should be on the left and right). Although this phase will analyze multiple join algorithms on the two subtables, it will only output a single cost – the cheapest one it found. It also saves the estimated cardinality of its output, because upper levels need that to compute their own costs.
Continue recursing, ie, finding the cheapest plans for all possible joins of three tables, by looking at combinations of twos and ones. We will have to consider multiple access methods on ones, though not on twos! Notice how it never makes sense to put the single table on the left, for we wouldn't be able to take advantage of any special access methods.
For the fourth level, we might analyze combinations of 3+1 and 2+2. If we look at 2+2, we are considering "bushy trees".
When we get to the top level, where the number of tables being joined is equal to the number of tables requested in the SQL query, we will have the cheapest cost.
8.5.1 Interesting Orders
In the vanilla Selinger's algorithm, the single tables have multiple access methods listed, but groups of tables have but a single one (the cheapest).
8.5.2 Heuristics
- Do not consider cartesian products: joins of tables without a common key (eg, joining A, B, C, where A and B share a key, and B and C share a key, but A and C share no keys, a join between A and C at the two-table level would not be considered).
- Left-deep only: Do not consider bushy trees (eg, ignore 2+2 joins at the 4 level) and always put the single table on the right to take advantage of its access methods.
- Push down selections: Each possible join outputs its cost and estimated cardinality. Include any applicable selections in that cardinality (though be careful to make sure you don't double-count selections that were performed in children even further down!)
9 Parallel and Distributed Databases
9.1 Terminology
Parallel usually refers to OLAP: Online Analytic Processing, i.e, big data queries. Distributed usually refers to (not replicated) OTAP: Online Transaction Processing, eg, when updates and inserts are happening and being intermixed, like with real world applications.
9.2 MapReduce
The Map function takes a single key-value pair as input and outputs any number of intermediate-key - value pairs.
The Reduce function takes pairs of intermediate keys and bags of values that Map outputted with that intermediate key.
If reduce is associative, we can run reduce once on each machine to hopefully slightly reduce the number of pairs, then run it again on a central server.
9.2.1 Example: Count words
Map: Produce an intermediate pair with key = word and value = # of occurrences for each word in the value.
Reduce: Sum all the intermediate values. If they all came directly from a Map, they will all be 1.
9.3 Concurrency Control
Real systems still use 2PL, but have to communicate across machines to know
about locks on things. There's an added aspect, though, about determining
whether to commit or abort at the end of a transaction; When the user
sends a COMMIT
command, it may reach one
9.3.1 2-Phase Commit
We assume we have a reliable "coordinator" machine. There's a PAXOS algorithm for an unreliable coordinator. Redundant disks and such reduce the need.
The user sends a COMMIT
message to the coordinator. The coordinator
proceeds in 2 phases:
- Prepare: Ask each subordinate whether it would be able to handle a commit for this txn. Each subordinate, upon receiving the prepare message, writes "Prepare <txid>" into its redo log before responding "Yes" to the coordinator.
- Commit: Write "commit" into the coordinator's redo log, then send a commit message to each subordinate until it returns ACK (if it's crashed, we have to continue sending until it comes back up).
If some subordinate reports, during the prepare phase, that it cannot go through with the commit, then we just send rollback commands to all nodes involved in the transaction.
Recovery: In class, we assume that the coordinator only sends out prepare, abort, or commit messages once to each subordinate; i.e, it does not poll them until they acknowledge. This makes the recovery process both simpler and more complex. Per subordinate:
- If the last entry is "prepare", then
9.4 Replication
Can be lazy/asynchronous (things update as needed) or eager/synchcronous (push updates, everybody is expected to be in sync). Can also be master/slave or group/multi-master.
9.4.1 Goals
- Consistency: Servers will give roughly the same results to queries.
- Availability: Requests are always satisfied.
- Performance
9.4.2 Single-Item Serializability
Weaker than serializability. Enforces serializability on all operations on each item. So, element A will always be left in a state that could have resulted from a serial schedule, and element B will always be left in a state that could have resulted from a serial schedule, but there may not be a serial schedule that would result in the states of both A and B.
9.4.3 Synchronous/Eager Replication
Updates are given to all hosts as part of the transaction, so by the time the transaction commits, all hosts are ready. Use global 2PL locks to avoid conflicts then 2-phase commit.
- Master
Master: Each object gets assigned to its own master node. You could even select a different master for each transaction. Masters maintain the global lock on the item. When a master crashes, a new one is elected (with PAXOS, usually). But what if the master's network goes down temporarily, causing it to be disconnected, causing an election, causing a second master, which may conflict when the first master comes back up? We only continue with a majority of the nodes, then, and hope that the rest figure it out (hand wavey hand wavey)
We can use a per-item master to control 2-phase-commit as well (not just 2PL locks) because we only care about single-item serializability. But we still want overall recoverability, not just single item serializability, right?
- Group/Multi-Master
Each replica maintains a lock manager for each item. Set (at compile time) a number X of locks that must be acquired to have an effective exclusive lock and a number S of locks that must be acquired to have an effective shared lock. If you want to modify the data, try to acquire locks until you have at least S locks (if you want to read) or X locks (if you want to write). By the rules of locks, \(2X>N\) (or else multiple transactions could hold exclusive locks) and \(X+S>N\) (or else an exclusive and shared lock could co-exist).
There are two extremes: "Majority Locking", where \(X=S=\lceil \frac{N+1}{2} \rceil\) (you might be able to make S one less, but that's not important if N is large), and The Other One, where \(X=N\) and \(S=1\). Majority locking makes it hard to read. The Other One makes it hard to write (even a single slow machine will hold up all writes on the item. But you have to update that machine anyway before you commit (release the lock), so it's the same speed?).
9.4.4 Asynchronous/Lazy Replication
Transactions commit after writing to a single replica. The writes happen eventually to the other machines.
- Master
Each item gets a master. Transaction updates the master. The master should push out the update to slaves. If the master fails before sending the updates to anyone, yeah, the committed transaction gets undone. If the master crashes after the updates have been sent to only some slaves, there are ways for the slaves to determine who has the latest version of the item and should be the new master.
Log Shipping is often the implementation. The master sends out the tail of the WAL to the replicas, which can execute it just like during recovery.
9.5 H-store
H-store is stored in memory to avoid disk IO, with snapshots and logging for persistency.
All transactions are stored procedures that take parameters, so they are fully determined when they start. This allows a "command log" that simply stores a list of transactions and the parameters passed to them.
The database is partitioned, and each partition is given a single thread. Transactions are supposed to be optimized to run on a single partition. Additionally, since each transaction is a stored procedure, the core will be entirely busy during the transaction's execution – the stored procedure doesn't do disk IO or anything else that's blocking. Thus, we do not need locks! But 2-phase-commit can be used to lock across partitions if necessary.
Logging is very simple. Just record each transaction and its parameters before you start, then redo them on recovery! Occasional "snapshots" (complete images of the database) are stored to disk.
When a transaction needs access to multiple partitions, it gains complete control of each partition, to continue avoiding locks (though you still need 2PC). All locks to partitions are granted before the transaction starts executing at all. The DBMS makes a best guess at which partitions the txn will need access to. Timestamp-deadlock-detection-style stuff is used to handle transactions that try to lock the same partition simultaneously. If the transaction needs additional partitions than the one the DBMS guessed for it, the txn gets aborted and retried and the new partitions are requested during the retry. The txn is deterministic based on the parameters.
10 Terms
- Data Independence
- The application doesn't need to know about the physical storage of the data. Eg, why many self-hosted softwares can work on MySQL or SQLite, at your choice. Also covers, for example, changing the indices that exist or the type of machine the DB is running on, not just choice of DBMS. According to Dan, this is the main reason that non-relational data models have failed: They lack data independence.
- Arity
- Number of attributes/columns in a relation.
- Degree
- See Arity
- Instance
- Actual rows in a database, as opposed to the schema.
- Theta join
- Any join, with arbitrary condition. Usually used to refer exclusively to joins that aren't equijoins.
- Natural Join
- An equijoin on all same-named columns. Includes projection/renaming as necessary to include only one copy of each in-common column in the output, at least in RA.
- Hot Stand-By
- A backup for the master node in a distributed database. It's the designated survivor: If the master fails, there's no need to elect a new one – go straight to the hot stand by.