CSE 414 Notes

Table of Contents

These are Mark Polyakov's notes from Computer Science 414, Introduction to Data Management, at the University of Washington.

1 SQL

1.1 Joins

WHERE with multiple tables does an implicit cartesian join. Partnered with a WHERE clause that's similar to an ON statement, it turns into something pretty close to an INNER JOIN.

Self-Join: Join on the same table multiple times. Useful when we want to filter on multiple other rows. Say, we want to find all users who have a friend named "Jane" and a friend named "Jack" – we need to join with the friends table twice, WHERE F1.name = "Jane" AND F2.name = "Jack"

Equi-join: A subset of inner joins where only equalities are in the ON clause. A non-equi-join: We want to find all passengers whose age is greater than the senior age limit for a certain set of airlines. So, FROM airlines JOIN people

1.2 Aggregations

MAX, MIN, AVG, SUM

Logically, aggregates take many values and return just one.

GROUP BY: Partition table into sections where the combination of the given columns is unique, compute aggregates, then concatenate again.

HAVING: Filters the "groups" created by GROUP BY

So, the order:

  1. Partition/Group based on group by
  2. Calculate having filters and remove invalid partitions
  3. Calculate select filters and return

1.2.1 Witnessing Problem

We can't access columns not specified in a group by clause. We partition the table, return the common group by column value, and return the calculated aggregate functions, but we don't even know about on which row the aggregate function "matched" (this doesn't even make sense for AVG or SUM).

SELECT Name
FROM people AS p1 INNER JOIN people AS p2
ON p1.job = p2.job
GROUP BY p2.job, p1.salary, p1.name
HAVING p1.salary = MAX(p2.salary)

Each "group" has a single p1.name, but in p2 contains every other person with the same job.

Can also be solved with a subquery:

SELECT name
FROM people p
INNER JOIN (SELECT name, MAX(salary) as max_salary
            FROM people
            GROUP BY job) m
ON p.job = m.job
WHERE p.salary = m.max_salary

And, perhaps more intuitively, but less performantly (before optimization):

SELECT name
FROM people p1
WHERE p1.salary = (SELECT MAX(salary)
                   FROM people as p2
                   WHERE p1.job = p2.job)

1.3 Set operations

UNION, INTERSECT, and EXCEPT (set difference). Operate on two relations with identical schemas. The query to create the first relation appears immediately before the keyword and the query for the second after, with no parens. eg:

SELECT 

1.4 Views

Like a long-term WITH statement; can be queried like a table but automatically updates with other tables. Views can be optionally stored/cached on disk, though you are making a delicate performance trade here.

One application: Improve data partitioning. Split up a table with many columns into multiple tables with the same primary key but fewer other columns, so that the database can store them in different places, but then access them using a view that joins everything together. It's possible to go even further ("horizontal" partitioning) and give all the sub-tables the same number of columns as the original, but simply store disjoint sets in each – for example, one might store customers with an id number ending in 0, then another with id ending in 1, etc, then the view joins with UNION. I imagine that many database systems can do this sort of partitioning automatically, which simplifies UPDATEs and INSERTs, because you can't generally do that on a view.

Another application: Access control – you can give a user access only to a table that aggregates other data for research purposes, for example.

CREATE VIEW ShortFlights
    SELECT *
      FROM Flights F
     WHERE actual_time < 180

2 Subqueries

Where can they go?

Well, a subquery can return either a bag of tuples or a scalar. When returning a bag, it can be placed:

  • Instead of a table name in FROM clause
  • After VALUES in an insert

When a scalar, it can be placed:

  • …Anywhere a scalar can go?

Sometimes, we can put a single-column relation where a scalar should go, but we have to use a keyword ANY, ALL, IN, EXISTS, NOT IN, or NOT EXISTS to specify how it should treat that column vector as a scalar.

2.1 Correlated Subqueries

All subquery expressions can be written as non-subquery expressions where either:

  • We create a temporary table, insert the results of a subquery into it, then reference that table from a second query.
  • We perform a self join, then, in a loose sense, execute the subquery on the right half of the table. We create a copy of the table for each row in the table.

Option A is an "uncorrelated" subquery, obtion B is a "correlated" subquery; correlated subqueries are logically executed for every row being selected FROM in the parent clause.

2.2 Subqueries as self-joins, or the fundamental solution to the witnessing problem

3 Relational Algebra

Tables in-DB should have unique rows/tuples, but results of queries may have duplicates.

(input): Get all tuples from a relation.

τ (sort): ORDER BY some comma-separated columns in order of precedence.

π (projection): SELECT comma-separated columns.

σ (selection): filter to rows WHERE a certain condition is met.

δ (uniq): filter down to DISTINCT or unique rows.

γ (aggregation): Averages, maxes, minimizes, etc. Implicit projection (only listed columns preserved). Can return less rows than it is passed. Group by columns are listed right alongside aggregate functions when creating the actual function, but there is a fundamental logical difference between grouping and calculating aggregates. The relation is partitioned into relations with the same columns, but possibly less rows, where the combination of the raw column names is unique. For each partition, the aggregates are calculated. Then, each partitioned is returned as a row where each grouping column is a column in the output, with value identifying the partition it came from, and each aggregate is a column with value being the result of the aggregation.

x (cartesian product): All possible tuple combinations. Order of output relation will be the product of the orders of the two input relations.

bowtie / visual studio (join): cartesian product immediately followed by selection/filtering based on the listed conditions (ON clauses). A so-called natural join (with no filtering clauses listed) has implicit conditions that all columns that are present with the same name in both tables must be equal.

\\ (backslash): Set difference

U: Set Union

N (upside-down U): Set intersection

4 Cardinality Estimation

We are usually given T, the total number of tuples in a relation, and V, the number of unique values of a certain column exist among the tuples in a relation.

4.1 WHERE and the Selectivity Factor

When filtering with WHERE, we think of a "selectivity factor" – which percentage of the tuples will remain after the filter.

4.1.1 Ranges

Obviously, just range of allowed values over total range of possible values. If there are strict less than or greater than signs and the data type is integral, make sure to avoid off-by-one errors. Eg, WHERE sth < 10 when sth is distributed uniformly from 1 through 100, there are 9 allowed values for sth – 1, 2, 3, 4, 5, 6, 7, 8, and 9. It can also be unintuitive in the opposite direction: if data is from 0 to 100, inclusive, and we're selecting <= 10, then there are actually 11 possible values!

4.1.2 Equals

WHERE sth = val has a selectivity factor of 1/V(sth); this is the number of rows, on average, with sth equal to a given value (I just restated the problem).

4.1.3 AND

Assume independence, multiply selectivity factors for subclauses.

If conditions are mutually exclusive (disjoint), selectivity factor is zero. If one condition is always true when the other is true, the selectivity factor is the minimum from among its subclauses.

4.1.4 OR

Assume fully disjoint conditions, add selectivity factors for subclauses.

If conditions are independent, divide the sum of the selectivity factors by their product, just like in statistics. If conditions are fully overlapping, take the maximum selectivity factor from a subclause.

4.2 Joins

Specifically, we'll talk about inner equijoins. Let's name the common column `com`. There are a certain number of distinct values of `com`, `V(com)`. If one relation has a value of `com` that the other relation lacks, that value will not be present in the join output. So, we define the number of distinct values of `com` as `min(V(coma), V(comb))`. Now, for each value of `com`, we can calculate the number of rows as a cartesian join between the tables `T(A) * T(B)` followed by an AND selection between two equals, so a selectivity factor of `(1/V(coma) * 1/V(comb))`, meaning a cardinality of `T(A) * T(B) / (1/V(coma) * 1/V(comb)) * min(V(coma), V(comb))`, which can be simplified (we'll get to it).

Alternatively, let's start with all the rows of table A, or `T(A)`. For each row, we will match with `T(B)/V(B)` rows from the other table, so our output cardinality is `T(A) * T(B) / V(B)`

5 Entity/Relationship

5.1 Entity Classes

5.2 Relationships

Connect two entity classes; are a subset of the cartesian joins of those two entity classes. Eg, if we have employee and company, all possible "works for" relationships are the cartesian join of employees and companies: Every employee works for every company. A subset of that is true: These employees work for those companies.

Relationships can have attributes and data beyond just the primary keys of their connected entity classes – eg, we might include a field saying how many hours per week each employee works for their company.

It also makes sense to have multiple relationships between the same classes.

5.2.1 Arrows

Represent the n:n, 1:n, etc-ness of a relationship to its connected entity classes.

Line
Unrestricted
Arrow
Up to one
Semicircle
Exactly one
Explicit
Custom constraint

A relationship between employees and companies with an arrow to the company entity class would mean that an employee cannot work for more than one company.

6 Transactions

6.1 Conflicts

A conflict is a relationship between two operations, whether from the same transaction or different ones, that states their order cannot be changed without breaking the conflict-serializability of the schedule. It's a bit like the elementary row operations in linear algebra: Two schedules are conflict-equivalent iff one can be turned into the other only by swapping around non-conflicting operations. Note here that conflict-equivalent means not only that it seems like one transaction occured completely before the other but that the same transaction occured first. Two different conflict-serializable schedules representing the same two transactions, but executed in opposite orders, cannot necessarily reach each other by swapping non-conflicting operations.

In short: Swapping non-conflicting operations produces a schedule that leaves the database in the exact same state as the original schedule would have.

W = Write, R = Read

6.1.1 WW (Lost Write)

Two writes in one transaction are interrupted by a write from another transaction. Think of both of the Ws of the acronym as being inside the same transaction; they get interrupted by a third W from the other transaction.

6.1.2 WR (Dirty Read)

Read in one transaction happens while the database is in an inconsistent state in the middle of another transaction. The W and R occur in different transactions, in that order. Really, though, it should be a WRW, where the first and last W are in the same transaction.

6.1.3 RW (Unrepeatable Read)

Really should be RWR. The difference from WR, then, is that the intermediate W leaves the database in a consistent state, so the second R reads a consistent state. The problem is that it's a different consistent state than the first R read, so the first transaction may rely on the output of the second R being "compatible" with the first one (i.e, that reading the same row returns the same column values).

6.1.4 Phantom Read

Closely related to RW conflicts; I'd even call them a subset of RW conflicts. They occur when a query that should return a subset of another query returns a row that wasn't in the superset-query's, or when a query that should return a superset of another query returns a row that should have been included in the subset.

For example, if we select flights shorter than 60 minutes, insert a 45 minute flight in another transaction, then select flights shorter than 90 minutes.

Concisely, phantom reads are RW conflicts related to extra or missing rows, while normal RW conflicts have to do with updates on existing rows.

6.2 ACID

Atomic
All or nothing.
Consistent
Conform to constraints, such as foreign keys or data types.
Isolated
Each transaction has the same effect it would have had if all transactions were executed serially.
Durable
Handles interruption gracefully.

6.3 Two-Phase Locking

A mechanism to categorically prevent conflicts! It's simply this: Whenever a transaction will do anything with a resource (read or write), it locks that resource. And, a transaction does not acquire any locks after it has released its first lock.

Can we prove this works?

If there are conflicting operations on the same element, obviously, all of those operations from one transaction will have to occur before the other, because a transaction does not release its lock on that element until it is completely done with it. So, we can simplify our model of each transaction down to a sequence of "operations", one per affected element. Eg, R1(A), W1(A), R2(A) turns into O1(A), O2(A).

Now, we just have to make sure that O1 operations are either all before or all after their corresponding operations from O2. So, O1(A), O2(A), O1(B), O2(B) is OK, but O1(A), O2(A), O2(B), O1(B) is not.

Two transactions both operate on elements A..Z. For now, let's assume that both transactions lock all elements in the same order. Define transaction 1 as the transaction that acquires the lock on A before transaction 2. Well, transaction 2 can't lock A until transaction 1 has locked all of the elements, So, obviously A occurs first in 1. B was locked by transaction 1 before it released A, so B also occurs first in transaction 1. Etc.

What if the order of locking of A..Z differs between the transactions? Then a transaction can acquire some locks while the other acquires some other locks, until one transaction, transaction 1, tries to acquire a lock held by transaction 2. This might be a deadlock, which is outside the scope of this proof, but if it isn't a deadlock,

6.3.1 Strict 2PL

Only release locks just before/simultaneously with a ROLLBACK or COMMIT to prevent unrecoverable schedules (that cannot be rolled back).

6.4 Isolation Levels

SQL: SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

6.4.1 READ UNCOMMITTED

No read locks. You might read data that will not exist, because it was set by a (different) transaction that will be aborted. I.e, dirty reads are allowed. Write locks are still done with strict 2PL to prevent WW conflicts.

6.4.2 READ COMMITTED

Acquire short-duration (not 2PL) read lock before reading, then release it immediately after reading. This ensures that there's no write lock on what we're reading. Since all writes occur with strict 2PL, this means that we only read data that's been committed by whatever transaction set it. However, if an entire transaction (2) updates and commits while (1) is running, a subsequent read will see the output of (2) that may not have been present in an earlier read (RW conflict, unrepeatable read).

Short-term "shared" locks (like acquiring an immutable reference in Rust) when reading, to summarize.

6.4.3 REPEATABLE READ

Prevents the 3 fundamental conflicts: WW, RW, and WR, by doing strict 2PL on reads as well as writes. So, that transaction (2) that updates a table between reads in the (1) transaction cannot acquire its write lock because the first read from (1) has the read lock!

This is basically the conflict serializability we've studied in class.

Phantom reads are still possible with insertion and deletion.

6.4.4 SERIALIZABLE

Either lock tables or predicates to prevent phantom reads. So, whenever we read something, we acquire a lock not only on the row(s) that are returned, but also on insertion or deleteion of any row that might affect its output.

So, if SELECT * WHERE age >= 18, then we acquire a predicate lock (shared lock) that prevents deletion or insertion of rows that match the predicate. SELECT MAX(age), in a smart database, would be a weird predicate lock that prevents deletion of the row with maximum age (unless there are multiple?) and prevents insertion of rows with age greater than the maximum. SELECT * WHERE userid = 78, if userid is a primary key, doesn't really require much predicate locking! It's already forbidden to duplicate the primary key, so we can't possibly insert a new row with the same userid. As for forbidding deletion, we only have to make sure this one row isn't deleted – that isn't really a predicate lock, just a row deletion lock. Generalizing on that last point a bit, deletion predicate locks could probably be implemented as locks on the returned set of rows plus for many simpler queries.

7 JDBC

7.1 Statements and queries

Three steps for any query:

7.1.1 Create a statement

For prepared statements, the query is stored in the statement like so:

PreparedStatement myStatement = conn.prepareStatement("SQL HERE");

For non-prepared queries, the SQL is passed in later, and conn.createStatement() is enough.

A statement should be close() 'd when it's no longer needed. This will also free results related to it. An alternate approach is to keep a small number of statements open for the lifetime of the application (common with precompiled prepared statements), which will all be automatically closed when the connection is closed (no need to close(), then).

7.1.2 Execute

Just do stmt.executeQuery(), passing a string of SQL as an argumen if this statement is not prepared. This returns a ResultSet.

stmt.executeUpdate() returns an integer, the number of rows changed, instead.

  1. Providing prepared statement values

    If the prepared SQL query has question marks, replace them like so:

    stmt.setInt(1, userIdVariable);
    stmt.setString(2, crypto.hash(userPassword));
    

    The first argument is the index of the question mark, starting at 1.

7.1.3 Read results

7.2 Transactions

Disable auto commit, do your queries, commit explicitly, then re-enable auto commit (if you want that to be the default in your application):

conn.setAutoCommit(false);
conn.createStatement().executeQuery("SELECT UPDATE DELETE BLAH");
conn.createStatement.executeQuery("DELETE UR MOM");
conn.commit();
conn.setAutoCommit(true);

There's also a conn.rollback() method.

8 Other

8.1 Monotonicity

A monotonic query preserves containment/subset relations between tables in its results. More explicitly, the tuples returned by the query will be a subset of the tuples returned by the query when the input relation is a superset of the original relation. It's a homomorphism: The Q (perform query) operation preserves subset relations.

Author: Mark Polyakov

Created: 2020-02-12 Wed 11:18

Validate