CSE 414 Notes

Table of Contents

These are Mark Polyakov's notes from taking Computer Science 414, Introduction to Data Management, at the University of Washington during Winter 2020.

1 SQL

1.1 Joins

FROM with multiple tables does an implicit cartesian join. Partnered with a WHERE clause that's similar to an ON statement, it turns into something pretty close to an INNER JOIN.

Self-Join: Join on the same table multiple times. Useful when we want to filter on multiple other rows. Say, we want to find all users who have a friend named "Jane" and a friend named "Jack" – we need to join with the friends table twice, WHERE F1.name = "Jane" AND F2.name = "Jack"

Equi-join: A subset of inner joins where only equalities are in the ON clause. A non-equi-join: We want to find all passengers whose age is greater than the senior age limit for a certain set of airlines. So, FROM airlines JOIN people

1.2 Aggregations

MAX, MIN, AVG, SUM

Logically, aggregates take many values and return just one.

GROUP BY: Partition table into sections where the combination of the given columns is unique, compute aggregates, then concatenate again.

HAVING: Filters the "groups" created by GROUP BY

So, the order:

  1. Partition/Group based on group by
  2. Calculate having filters and remove invalid partitions
  3. Calculate select filters and return

1.2.1 Witnessing Problem

What if we, say, want to select all people who have income equal to the maximum income?

We can't access columns not specified in a group by clause, so we can't SELECT name HAVING income = MAX(income). This makes even less sense for, say, AVG or SUM.

SELECT Name
FROM people AS p1 INNER JOIN people AS p2
ON p1.job = p2.job
GROUP BY p2.job, p1.salary, p1.name
HAVING p1.salary = MAX(p2.salary)

Each "group" has a single p1.name, but p2 contains every other person with the same job.

Can also be solved with a subquery:

SELECT name
FROM people p
INNER JOIN (SELECT name, MAX(salary) as max_salary
            FROM people
            GROUP BY job) m
ON p.job = m.job
WHERE p.salary = m.max_salary

And, perhaps more intuitively, but less performantly (before optimization):

SELECT name
FROM people p1
WHERE p1.salary = (SELECT MAX(salary)
                   FROM people as p2
                   WHERE p1.job = p2.job)

1.3 Set operations

UNION, INTERSECT, and EXCEPT (set difference). Operate on two relations with identical schemas. The query to create the first relation appears immediately before the keyword and the query for the second after, with no parens. eg:

SELECT name
  FROM people
 WHERE age > 20
EXCEPT
SELECT name
  FROM people
 WHERE salary < 10000

Is roughly equivalent to:

SELECT name
  FROM people
 WHERE age > 20
   AND SALARY >= 10000

1.4 Views

Like a long-term WITH statement; can be queried like a table but automatically updates with other tables. Views can be optionally stored/cached on disk, though you are making a delicate performance trade here.

One application: Improve data partitioning. Split up a table with many columns into multiple tables with the same primary key but fewer other columns, so that the database can store them in different places, but then access them using a view that joins everything together. It's possible to go even further ("horizontal" partitioning) and give all the sub-tables the same number of columns as the original, but simply store disjoint sets in each – for example, one might store customers with an id number ending in 0, then another with id ending in 1, etc, then the view joins with UNION. I imagine that many database systems can do this sort of partitioning automatically, which simplifies UPDATEs and INSERTs, because you can't generally do that on a view.

Another application: Access control – you can give a user access only to a table that aggregates other data for research purposes, for example.

CREATE VIEW ShortFlights
    SELECT *
      FROM Flights F
     WHERE actual_time < 180

1.5 Recursive CTEs

The only way to do recursion in SQL. Although the syntax uses all keywords that we already know, these keywords don't really act the way we expect and recursive CTEs cannot be understood with non-recursive SQL. Here's a simple recursive query that generates the numbers less than 10:

WITH range_10 AS (
  SELECT 1 AS n
           UNION ALL
  SELECT n+1 AS n
    FROM range_10
   WHERE n < 10
)
SELECT * FROM range_10
1
2
3
4
5
6
7
8
9
10

Once again, it's very important not to interpret the recursive syntax the same way you would any other query, or to try and slot it in nicely alongside the SQL you already know. Here's how this query is run:

  1. Execute the query to the left of the UNION ALL, and assign the results the name range_10.
  2. Execute the query to the right of the UNION ALL. Assign the results of the query on the right side of the union (i.e, the one that was just executed), not including the previous results of range_10, to range_10.
  3. Once the query to the right of UNION ALL returns the empty set, stop recursion. Union all the rows returned both from the anchor query (left of union) and every recursive call together into the final range_10.

Analyzing our range query:

Iteration range_10 contents AFTER query
(anchor) 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 (empty)
(after recursion) 1,2,3,4,5,6,7,8,9,10

The unintuitive thing here is that the intermediate range_10 bindings are very different than the final one.

Another example:

WITH F(a,b) AS (
     SELECT 0, 1
              UNION ALL
     SELECT b, a+b
       FROM F
      WHERE b < 100
)
SELECT a FROM F;
0
1
1
2
3
5
8
13
21
34
55
89

2 Subqueries

Where can they go?

Well, a subquery can return either a bag of tuples or a scalar. When returning a bag, it can be placed:

  • Instead of a table name in FROM clause
  • After VALUES in an insert

When a scalar, it can be placed:

  • …Anywhere a scalar can go?

Sometimes, we can put a single-column relation where a scalar should go, but we have to use a keyword ANY, ALL, IN, EXISTS, NOT IN, or NOT EXISTS to specify how it should treat that column vector as a scalar.

2.1 Correlated Subqueries

All subquery expressions can be written as non-subquery expressions where either:

  • We create a temporary table, insert the results of a subquery into it, then reference that table from a second query.
  • We perform a self join, then, in a loose sense, execute the subquery on the right half of the table. We create a copy of the table for each row in the table.

Option A is an "uncorrelated" subquery, obtion B is a "correlated" subquery; correlated subqueries are logically executed for every row being selected FROM in the parent clause.

2.2 Subqueries as self-joins, or the fundamental solution to the witnessing problem

3 Relational Algebra

Tables in-DB should have unique rows/tuples, but results of queries may have duplicates.

(input): Get all tuples from a relation.

τ (sort): ORDER BY some comma-separated columns in order of precedence.

π (projection): SELECT comma-separated columns.

σ (selection): filter to rows WHERE a certain condition is met.

δ (uniq): filter down to DISTINCT or unique rows.

γ (aggregation): Averages, maxes, minimizes, etc. Implicit projection (only listed columns preserved). Can return less rows than it is passed. Group by columns are listed right alongside aggregate functions when creating the actual function, but there is a fundamental logical difference between grouping and calculating aggregates. The relation is partitioned into relations with the same columns, but possibly less rows, where the combination of the raw column names is unique. For each partition, the aggregates are calculated. Then, each partitioned is returned as a row where each grouping column is a column in the output, with value identifying the partition it came from, and each aggregate is a column with value being the result of the aggregation.

x (cartesian product): All possible tuple combinations. Order of output relation will be the product of the orders of the two input relations.

bowtie / visual studio (join): cartesian product immediately followed by selection/filtering based on the listed conditions (ON clauses). A so-called natural join (with no filtering clauses listed) has implicit conditions that all columns that are present with the same name in both tables must be equal.

\\ (backslash): Set difference

U: Set Union

N (upside-down U): Set intersection

4 Cardinality Estimation

We are usually given T, the total number of tuples in a relation, and V, the number of unique values of a certain column exist among the tuples in a relation.

4.1 WHERE and the Selectivity Factor

When filtering with WHERE, we think of a "selectivity factor" – which percentage of the tuples will remain after the filter.

4.1.1 Ranges

Obviously, just range of allowed values over total range of possible values. If there are strict less than or greater than signs and the data type is integral, make sure to avoid off-by-one errors. Eg, WHERE sth < 10 when sth is distributed uniformly from 1 through 100, there are 9 allowed values for sth – 1, 2, 3, 4, 5, 6, 7, 8, and 9. It can also be unintuitive in the opposite direction: if data is from 0 to 100, inclusive, and we're selecting <= 10, then there are actually 11 possible values!

4.1.2 Equals

WHERE sth = val has a selectivity factor of 1/V(sth); this is the number of rows, on average, with sth equal to a given value (I just restated the problem).

4.1.3 AND

Assume independence, multiply selectivity factors for subclauses.

If conditions are mutually exclusive (disjoint), selectivity factor is zero. If one condition is always true when the other is true, the selectivity factor is the minimum from among its subclauses.

4.1.4 OR

Assume fully disjoint conditions, add selectivity factors for subclauses.

If conditions are independent, divide the sum of the selectivity factors by their product, just like in statistics. If conditions are fully overlapping, take the maximum selectivity factor from a subclause.

4.2 Joins

Specifically, we'll talk about inner equijoins. Let's name the common column `com`. There are a certain number of distinct values of `com`, `V(com)`. If one relation has a value of `com` that the other relation lacks, that value will not be present in the join output. So, we define the number of distinct values of `com` as `min(V(coma), V(comb))`. Now, for each value of `com`, we can calculate the number of rows as a cartesian join between the tables `T(A) * T(B)` followed by an AND selection between two equals, so a selectivity factor of `(1/V(coma) * 1/V(comb))`, meaning a cardinality of `T(A) * T(B) / (1/V(coma) * 1/V(comb)) * min(V(coma), V(comb))`, which can be simplified (we'll get to it).

Alternatively, let's start with all the rows of table A, or `T(A)`. For each row, we will match with `T(B)/V(B)` rows from the other table, so our output cardinality is `T(A) * T(B) / V(B)`

5 Entity/Relationship

5.1 Entity Classes

5.2 Relationships

Connect two entity classes; are a subset of the cartesian joins of those two entity classes. Eg, if we have employee and company, all possible "works for" relationships are the cartesian join of employees and companies: Every employee works for every company. A subset of that is true: These employees work for those companies.

Relationships can have attributes and data beyond just the primary keys of their connected entity classes – eg, we might include a field saying how many hours per week each employee works for their company.

It also makes sense to have multiple relationships between the same classes.

5.2.1 Arrows

Represent the n:n, 1:n, etc-ness of a relationship to its connected entity classes.

Line
Unrestricted
Arrow
Up to one
Semicircle
Exactly one
Explicit
Custom constraint

A relationship between employees and companies with an arrow to the company entity class would mean that an employee cannot work for more than one company.

6 Transactions

6.1 Conflicts

A conflict is a relationship between two operations, whether from the same transaction or different ones, that states their order cannot be changed without breaking the conflict-serializability of the schedule. It's a bit like the elementary row operations in linear algebra: Two schedules are conflict-equivalent iff one can be turned into the other only by swapping around non-conflicting operations. Note here that conflict-equivalent means not only that it seems like one transaction occured completely before the other but that the same transaction occured first. Two different conflict-serializable schedules representing the same two transactions, but executed in opposite orders, cannot necessarily reach each other by swapping non-conflicting operations.

In short: Swapping non-conflicting operations produces a schedule that leaves the database in the exact same state as the original schedule would have.

W = Write, R = Read

6.1.1 WW (Lost Write)

Two writes in one transaction are interrupted by a write from another transaction. Think of both of the Ws of the acronym as being inside the same transaction; they get interrupted by a third W from the other transaction.

6.1.2 WR (Dirty Read)

Read in one transaction happens while the database is in an inconsistent state in the middle of another transaction. The W and R occur in different transactions, in that order. Really, though, it should be a WRW, where the first and last W are in the same transaction.

6.1.3 RW (Unrepeatable Read)

Really should be RWR. The difference from WR, then, is that the intermediate W leaves the database in a consistent state, so the second R reads a consistent state. The problem is that it's a different consistent state than the first R read, so the first transaction may rely on the output of the second R being "compatible" with the first one (i.e, that reading the same row returns the same column values).

6.1.4 Phantom Read

Closely related to RW conflicts; I'd even call them a subset of RW conflicts. They occur when a query that should return a subset of another query returns a row that wasn't in the superset-query's, or when a query that should return a superset of another query returns a row that should have been included in the subset.

For example, if we select flights shorter than 60 minutes, insert a 45 minute flight in another transaction, then select flights shorter than 90 minutes.

Concisely, phantom reads are RW conflicts related to extra or missing rows, while normal RW conflicts have to do with updates on existing rows.

6.2 ACID

Atomic
All or nothing.
Consistent
Conform to constraints, such as foreign keys or data types.
Isolated
Each transaction has the same effect it would have had if all transactions were executed serially.
Durable
Handles interruption gracefully.

6.3 Two-Phase Locking

A mechanism to categorically prevent conflicts! It's simply this: Whenever a transaction will do anything with a resource (read or write), it locks that resource. And, a transaction does not acquire any locks after it has released its first lock.

Can we prove this works?

If there are conflicting operations on the same element, obviously, all of those operations from one transaction will have to occur before the other, because a transaction does not release its lock on that element until it is completely done with it. So, we can simplify our model of each transaction down to a sequence of "operations", one per affected element. Eg, R1(A), W1(A), R2(A) turns into O1(A), O2(A).

Now, we just have to make sure that O1 operations are either all before or all after their corresponding operations from O2. So, O1(A), O2(A), O1(B), O2(B) is OK, but O1(A), O2(A), O2(B), O1(B) is not.

Two transactions both operate on elements A..Z. For now, let's assume that both transactions lock all elements in the same order. Define transaction 1 as the transaction that acquires the lock on A before transaction 2. Well, transaction 2 can't lock A until transaction 1 has locked all of the elements, So, obviously A occurs first in 1. B was locked by transaction 1 before it released A, so B also occurs first in transaction 1. Etc.

What if the order of locking of A..Z differs between the transactions? Then a transaction can acquire some locks while the other acquires some other locks, until one transaction, transaction 1, tries to acquire a lock held by transaction 2. This might be a deadlock, which is outside the scope of this proof, but if it isn't a deadlock,

6.3.1 Strict 2PL

Only release locks just before/simultaneously with a ROLLBACK or COMMIT to prevent unrecoverable schedules (that cannot be rolled back).

6.4 Isolation Levels

SQL: SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

6.4.1 READ UNCOMMITTED

No read locks. You might read data that will not exist, because it was set by a (different) transaction that will be aborted. I.e, dirty reads are allowed. Write locks are still done with strict 2PL to prevent WW conflicts.

6.4.2 READ COMMITTED

Acquire short-duration (not 2PL) read lock before reading, then release it immediately after reading. This ensures that there's no write lock on what we're reading. Since all writes occur with strict 2PL, this means that we only read data that's been committed by whatever transaction set it. However, if an entire transaction (2) updates and commits while (1) is running, a subsequent read will see the output of (2) that may not have been present in an earlier read (RW conflict, unrepeatable read).

Short-term "shared" locks (like acquiring an immutable reference in Rust) when reading, to summarize.

6.4.3 REPEATABLE READ

Prevents the 3 fundamental conflicts: WW, RW, and WR, by doing strict 2PL on reads as well as writes. So, that transaction (2) that updates a table between reads in the (1) transaction cannot acquire its write lock because the first read from (1) has the read lock!

This is basically the conflict serializability we've studied in class.

Phantom reads are still possible with insertion and deletion.

6.4.4 SERIALIZABLE

Either lock tables or predicates to prevent phantom reads. So, whenever we read something, we acquire a lock not only on the row(s) that are returned, but also on insertion or deleteion of any row that might affect its output.

So, if SELECT * WHERE age >= 18, then we acquire a predicate lock (shared lock) that prevents deletion or insertion of rows that match the predicate. SELECT MAX(age), in a smart database, would be a weird predicate lock that prevents deletion of the row with maximum age (unless there are multiple?) and prevents insertion of rows with age greater than the maximum. SELECT * WHERE userid = 78, if userid is a primary key, doesn't really require much predicate locking! It's already forbidden to duplicate the primary key, so we can't possibly insert a new row with the same userid. As for forbidding deletion, we only have to make sure this one row isn't deleted – that isn't really a predicate lock, just a row deletion lock. Generalizing on that last point a bit, deletion predicate locks could probably be implemented as locks on the returned set of rows plus for many simpler queries.

7 Physical Performance

B(R) notation means "The number of blocks in the relation R"

7.1 Join Algorithms

7.1.1 Nested-Loop

When joining relations R and S, takes B(R) + B(R)*B(S) block reads.

Pseudocode:

for block in R
    for block in S
	for tuple in the block from R
	    for tuple in the block from S
		check join condition

At the end of the day, every possible pair of tuples will be checked.

7.1.2 Larger Blocks Nested-Loop

What if we read multiple physical blocks at once? Then we can get away with less passes over S.

If we read N blocks at a time…

B(R) + B(R)/N * B(S)

7.1.3 Hash Join

Make a hash table out of one table, then "probe into it" for every tuple from the other table. B(R)+B(S). Of course, you need to draw the rest of the fucking owl, i.e, make one table into a hash table. One table must be small or you must have some long-term hash table on disk.

7.1.4 Sort/Merge Join

If join columns are sorted across multiple tables, you can do a single pass over each. One way to think about it is merge-sorting both tables together then joining on adjacent tuples in the final, sorted list. Also B(R)+B(S) but, like the expensive hash-table-building of a hash join, it requires sorting

7.2 Indices

Indices are separate files on disk that sit alongside the main table. For a hash index, this separate file is a hash table mapping the index key to the on-disk location of the whole tuple in the main table file.

7.2.1 Unclustered

The main table is not sorted on the index key. There is no logical relation between the index key and the position of a tuple on-disk in the main table.

7.2.2 Clustered

The main table is sorted on the index key. Reading many tuples with "consecutive" values of the index key (eg, all users with age between 10 and 20, if age is the index key) is fast because it's a sequential read from disk. You may ask, if the table is already sorted, why do we need an index? Even without an index, the database should be smart and do some sort of binary search on the index. But, with, say, a hash index, it is substantially faster to find the "start" point to start retrieving tuples from the table.

Clustered indices are not any faster than unclustered ones when reading a single tuple. A table can only be clustered on a single index (which could be a combination of columns).

7.2.3 Covering

A covering index "covers" all columns of the indexed table. It does not just store a pointer to the main tuple. It is literally a second copy of a table, but sorted on a different key.

7.2.4 I/O Lookup Costs

Let X be the selectivity factor, or the portion of the tuples in a relation that match some condition. Then, the cost of the query (in number of blocks read) is:

  • Sequential Scan: B(R). We can't rule out that the last block contains a tuple matching our query, so we have to check them all!
  • Clustered Index: X*B(R). All tuples that match the condition should be stored contiguously on-disk, so all but one of the blocks we read will be completely filled with tuples that match our query. Thus, we read only the number of blocks necessary to contain all the tuples.
  • Unclustered Index: X*T(R). When X is small, each matching tuple is stored in a different block. Thus, for each match, we read an entire block. When X is larger, X*T(R) is still correct (for this class)! Why? Well, we read a block that contains a valid tuple…but we don't know if any other tuples in the block match the query, and it's expensive to check all of them, so we throw away the block! Then, if there does turn out to be another tuple in the same block, we re-read the block from disk again! So yes, with a high X, an unclustered index lookup can be slower than a sequential scan, because the same block might be read many times. A smart database might be able to detect when the X value is high and take precautions. An unclustered index lookup becomes slower than a sequential scan when the number of tuples you are expecting to receive is greater than the number of blocks in the database. For a fixed selectivity factor, the flip occurs when the block size rises too high.

7.2.5 B-Trees

B-trees make it possible to find all tuples with an attribute greater or less than a certain value. B trees are stored very much like a binary search tree, but with more than 2 branches per node. B-trees also make it a bit faster to ORDER BY results.

7.3 Indices and Joins, I/O Estimation

We model this not all that differently than a nested-loop join. We scan through one table, while seeking on the other table. This means the I/O estimation is not commutative: Depending on which table you scan on and which you seek on, performance could vary dramatically.

We scan R and seek on S, using attribute t. When we talk about V(sometable), we are referring to the number of unique values of the join attribute t. The I/O estimation formula will generally be B(R) + T(R)*{ seek I/O cost on S }. These seek costs come from the I/O lookup costs for different types of indices above, where the selectivity factor is 1/V(S) (selecting all tuples that match on the join attribute). If R was sorted, you could imagine that the T(R) term would become T(R)/V(R), because we could re-use the results of the last seek on S, but we don't make that assumption in 414.

7.3.1 Unclustered Index (on S)

B(R) + T(R)*{ T(S)/V(S) }

T(S)/V(S) is how long it takes to find all tuples from S that have the desired value of the joined attribute.

7.3.2 Clustered Index (on S)

B(R) + T(R)*{ B(S)/V(S) }

All the tuples from S that match on the join attribute are adjacent on disk, so we read just the number of blocks necessary to contain all that data.

7.4 Streaming

A naïve database will fully materialize the output of each RA operator at every step of the tree, writing a temporary relation to disk before beginning the next operator up. Sometimes this is required: If we are going to sort the data, we may need access to the same tuple multiple times to perform multiple comparisons on it, or if we are performing a join where it is the secondary table, we need to repeatedly scan through the table for matching tuples. Much of the time, though, we can stream tuples through to the next RA operator.

For example: Filtering usually takes B(R) time. But we only need to look at each tuple once and don't care about tuple order or anything like that, so we can take tuples as they are released from the last operator and the cost of the filter is CPU-only — I/O cost is zero.

Another example — joins. Recall they are usually B(R)+T(R)*{seek cost on S}. We can't stream in S because we need to seek on it. But we can seek in R (and perhaps should restructure the physical plan to allow for that), and if so, the cost becomes just T(R)*{seek cost on S}.

8 JDBC

8.1 Statements and queries

Three steps for any query:

8.1.1 Create a statement

For prepared statements, the query is stored in the statement like so:

PreparedStatement myStatement = conn.prepareStatement("SQL HERE");

For non-prepared queries, the SQL is passed in later, and conn.createStatement() is enough.

A statement should be close() 'd when it's no longer needed. This will also free results related to it. An alternate approach is to keep a small number of statements open for the lifetime of the application (common with precompiled prepared statements), which will all be automatically closed when the connection is closed (no need to close(), then).

8.1.2 Execute

Just do stmt.executeQuery(), passing a string of SQL as an argumen if this statement is not prepared. This returns a ResultSet.

stmt.executeUpdate() returns an integer, the number of rows changed, instead.

  1. Providing prepared statement values

    If the prepared SQL query has question marks, replace them like so:

    stmt.setInt(1, userIdVariable);
    stmt.setString(2, crypto.hash(userPassword));
    

    The first argument is the index of the question mark, starting at 1.

8.1.3 Read results

#+BEGINSRC java ResultSet rs = stmt.executeQuery(); while (rs.next()) { // returns null whenno more rows to read int myval = rs.getInt("columnname"); String mycat = rs.getString("kitty"); } #+ENDSRC java

8.2 Transactions

Disable auto commit, do your queries, commit explicitly, then re-enable auto commit (if you want that to be the default in your application):

conn.setAutoCommit(false);
conn.createStatement().executeQuery("SELECT UPDATE DELETE BLAH");
conn.createStatement.executeQuery("DELETE UR MOM");
conn.commit();
conn.setAutoCommit(true);

There's also a conn.rollback() method.

9 Distributed Databases and Parallelization

When we talk about distributed queries, we are mainly talking about aggregate queries – not selecting 1% of the data, or 0.1% of the data, etc. We are more interested in selecting 0.0001% of the data: Maybe the sum of the prices of each car in the world, or the average age of drivers of each model of car, still only a tiny fraction of the total number of tuples (all drivers on the road).

9.1 Partitioning

We still live in a world of tuples, but now the tuples are split up across machines. How do we decide which tuples to put where?

9.1.1 Block Partitioning

We take all the tuples, serialize, and hash. It's easily to uniformly distribute the hashes across machines. Pros: All machines will store an equal amount of tuples (no two tuples can have the same hash). Cons: Tuples are unorganized and will need to be reorganized for any meaningful query.

9.1.2 Hash Partitioning

Distribute tuples based on the hash of a single column/key. All tuples with the same hash get put on the same machine. Pros: Common keys put together for possibly quick queries. Cons: Distribution of tuples across machines may be lopsided.

9.1.3 Range Partitioning

Instead of partitioning based on the hash of an attribute, we just distribute based on the attribute itself. So, last names A-LC go on machine 1, LD-PR on machine 2, PS-Z on machine 3, for example. Pros: Fast range queries, and common keys put together. Cons: Lopsided distribution. Though the database can make sure to uniformly distribute the tuples initially, as tuples are added it has to choose between making things lopsided or shifting tons of tuples around on other machines just to do an insert on one machine; they usually choose the former.

9.2 Aggregation and Shuffling

The most interesting big data operations are grouped aggregations. We:

  1. Assign each group to a machine.
  2. Get all the tuples to the machine they were assigned to (Shuffle)
  3. Calculate the aggregate.
  4. Union the aggregate results together on a single machine. This step is quick and trivial.

The shuffle step is the most expensive and usually involves hashing the grouped attributes, then range-partitioning on the hashes (just like a normal hash partition).

9.3 MapReduce

How do systems actually implement the aggregation and shuffling procedure described above? In two steps.

9.3.1 Map

This occurs on the machine where the tuples are stored for the long-term. Each tuple is taken through a "map" function which is more like a flatMap.

9.3.2 Reduce

This occurs on the machines that are performing the aggregation calculation, i.e, the targets of the hash shuffles. This is damn similar to the functional reduce function: It gets the grouped attribute and the sequence of tuples returned by the map which originally had that grouped attribute, then returns the output tuples of the aggregation (often 1 tuple only).

10 SQL++

10.1 Types

CREATE TYPE ClassType AS CLOSED {
    department: string,
    course_num: int,
    instructor: string?
}

Instructor is optional. Can DROP TYPE as well. If OPEN instead of CLOSED (or omitted), additional fields are allowed (like &allow-other-keys in Lisp).

  • Arrays: [int]
  • Multiset: {{string}}

10.2 SQL Analogs

SQL SQL++
Database Dataverse
Table Dataset
Index Index

Create a dataset:

CREATE DATASET Class(ClassType)
  PRIMARY KEY unique_id AUTOGENERATED;

AUTOGENERATED means "Add a UUID field with this name". You must always specify a primary key.

Types and datasets are specific to dataverses. You cannot create a dataset with a type from another dataverse. USE DataVerseName makes it the default dataverse, so we no longer need to put DataVerseName. before everything.

10.3 Unnesting

Let's say our data looks like:

{
  "name": "yoop",
  "orders": [
    { "product": "bopit", "price": 14 },
    { "product": "shakeit", "price": 21 },
  ]
}

And we want output that looks like:

{{
  { "name": "yoop", "product": "bopit" },
  { "name": "yoop", "product": "shakeit" }
}}

We can use this query:

SELECT P.name, O.product
  FROM People AS P, P.orders AS O;

Alternatively:

SELECT P.name, O.product
  FROM People AS P UNNEST P.orders AS O;

It's valid to think of this as a special type of join – join people with orders when the order is contained in people. There's no way to write this with INNER JOIN syntax, though, because there's no attribute to join on.

10.3.1 Unnesting things that aren't tables

For example, if we have data like:

{ "name": "Mark", "favorite_animals": "cat,dog,parakeet" },
{ "name": "Ajay", "favorite_animals": "deer,eagle,parakeet" },

And I want to output data like:

Name Animal
Mark Cat
Mark Dog
Mark Parakeet
Ajay Deer
Ajay Eagle
Ajay Parakeet

We can use split() to treat that last field as a nested list in the FROM clause:

SELECT A.name, animal
  FROM Animals A,
       split(A.favorite_animals, ',') animal

Note that we don't have to access a "field" of animal. It's still a legit "table" though, because "tables" in SQL++ are just lists of…anything. They don't have to be lists of objects with attributes.

A slightly more advanced query:

SELECT DISTINCT animal, names
  FROM Animals A, split(A.favorite_animals, ',') animal
   LET names = (SELECT VALUE A1.name
                        FROM Animals A1,
                             split(A1.favorite_animals, ',') A2
                       WHERE A2 = animal);
Animal Names
Dog [ Mark ]
Cat [ Mark ]
Deer [ Ajay ]
Eagle [ Ajay ]
Parakeet [ Ajay, Mark ]

I've also introduced the LET syntax: It's like WITH but for correlated subqueries.

10.4 Nesting

Say we have:

{{
    { "name": "Mark", "age": 19 },
    { "name": "Shana", "age": 38 },
    { "name": "Dan", "age": 19 },
    { "name": "Zander", "age": 17 },
    { "name": "Kari", "age": 17 },
}}

And we want to group by age:

{{
    { "age": 17,
      "people": [{ "name": "Zander" }, { "name": "Kari" }]},
    { "age": 19,
      "people": [{ "name": "Mark" }, { "name": "Dan" }]},
    { "age": 38,
      "people": [{ "name": "Shana" }]}
}}

We can do this:

SELECT P.age,
       (SELECT P1.name
          FROM People AS P1
         WHERE P1.age = P.age) AS people
  FROM People AS P;

Correlated subqueries are the name of the game!

Small issue: it yields this!

{ "age": 38, "people": [ { "name": "Shana" } ] }
{ "age": 17, "people": [ { "name": "Zander" }, { "name": "Karianna" } ] }
{ "age": 19, "people": [ { "name": "Dan" }, { "name": "Mark" } ] }
{ "age": 19, "people": [ { "name": "Dan" }, { "name": "Mark" } ] }
{ "age": 17, "people": [ { "name": "Zander" }, { "name": "Karianna" } ] }

Adding DISTINCT right after SELECT fixes it.

What if we want this instead?

{{
    { "age": 17, "people": [ "Zander", "Kari" ]},
    { "age": 19, "people": [ "Mark", "Dan" ]},
    { "age": 38, "people": [ "Shana" ]}
}}

We can use the VALUE keyword to not print out column names. It only works when selecting a single column.

SELECT DISTINCT P.age,
       (SELECT VALUE P1.name
          FROM People AS P1
         WHERE P1.age = P.age) AS people
  FROM People AS P;

11 Other

11.1 Monotonicity

A monotonic query preserves containment/subset relations between tables in its results. More explicitly, the tuples returned by the query will be a subset of the tuples returned by the query when the input relation is a superset of the original relation. It's a homomorphism: The Q (perform query) operation preserves subset relations.

11.2 RDD

Resilient Distributed Dataset

Author: Mark Polyakov

Created: 2020-03-18 Wed 12:33

Validate