CSE 312 Notes

1. Counting
2. Conditional Love
3. Expectation and Variance
4. Discrete Zoo
5. Continuous Probability Distributions
6. Joint Distributions
- 6.1. Marginal Distribution
- 6.2. Covariance
7. Tail Bounds

These are Mark Polyakov's notes from taking CSE 312, Foundations of Computing II, at the University of Washington during Spring 2021.

1 Counting

I'll list even the really common rules here, just so that it can serve as a nice "database" to look through when unsure how to solve a problem.

1.1 Factorials

Number of permutations of n elements: n!

1.2 Permutations ("Pick")

How many ways to pick k elements from a set of n elements, if order does matter. n!/(n-k)!

1.3 Combinations ("Choose")

Number of ways to choose k elements from a set of n elements, if order does not matter. Like pick, but divide by number of permutations of the chosen elements, ie n!/((n-k)!k!)

1.4 Subsets whose order does not matter (Multinomial Coefficients)

Eg, how many ways are there to pick two s-es and two t-s from Massachusetts (order does matter, we get to change order if we want)? We must divide by 2!2! instead of 4!, to account for that relative order between the s-es and the t-s does matter.

1.5 Pascal's Rule

$\binom{n}{k}=\binom{n-1}{k-1}+\binom{n-1}{k}$. Looking at the first element of a set of n elements, there are two options:

We include the first element in the chosen set of size k. Then, we must pick k-1 elements from the rest of the original set of size n, hence $\binom{n-1}{k-1}$.
We do not include the first element in the chosen set. Then, we must pick all k elements from the rest of the original set, hence $\binom{n-1}{k}$

1.6 Binomial Theorem

\[(x+y)^n=\sum_{k=0}^n \binom{n}{k} x^ky^{n-k}\]

Imagine all n of the $(x+y)$ terms being laid in front of us. We pick some of them, and from the picked ones we take x, and from the unpicked ones take y. What will be the powers? Yep, $x^ky^{n-k}$. And of course, we can make $\binom{n}{k}$ such picks.

1.7 Inclusion-Exclusion

Like that one Project Euler thing! If you have possible intersecting sets, and want to find how many distinct elements there are, you

Add the elements of each set, eg O(A)+O(B)+…
Subtract the elements of the two-wise intersections, eg $O(A\cap B)+O(A\cap C)+O(B\cap C)+\dots$.
Add the elements of the three-wise intersections, etc.
Repeat until it wouldn't make sense to continue (there are no four-wise intersections of three sets)

1.8 Splitting elements into groups

"Stars and bars", how to split n stars into k groups (empty groups are ok, all stars must be in exactly one group). We think about how to place the bars between the stars. It is tempting to count how many ways there are to place the first barrier, then how many that leaves for the second barrier being somewhere after the first barrier…but because the number of possible positions for the second barrier depends on the position of the first barrier, this is cumbersome. Instead, we look at how to place all the barriers, then remove the over-counted possibilities. We have n+1 options for the first barrier (at the beginning means first group empty, at the end means last group(s) empty). We have n+2 options for the next barrier, since it can be on either side of the first barrier. As we'll see later, it doesn't actually matter which side of the last barrier it's on, but it's important for us to uniformly over-count possibilities by considering all possible ways to position the barriers with respect to each other. Then n+3 for the third barrier, and so on, yielding (n+(k-1))!/n! (because there are k-1 barriers if there are k groups). To avoid over-counting, we remove order between the groups, and so divide by (k-1)! so that all the barriers are treated the same. The leftmost group then is always group 1, the next is group 2, etc. So our answer is $\frac{(n+k-1)!}{n!(k-1)!}=\binom{n+k-1}{k-1}$.

2 Conditional Love

First, the basic rule of conditional probability:

\[P(A|B)=\frac{P(A\cap B)}{P(B)}\]

2.1 Bayes' Rule

Given P(A|G), P(A), and P(G), lets us calculate P(G|A). First, we want to find the "absolute", ie unconditional, probability of $P(A\cap G)$. Then, we'll simply divide by the total probability P(A). In English, "how often do both A and G occur, compared to how often A occurs?" To find $P(A\cap G)$, we simply calculate $P(A|G)P(G)$. Thus, Bayes' rule:

\[P(G|A)=\frac{P(A|G)P(G)}{P(A)}\]

The formal proof is trivial; this is equal to $P(G\cap A)/P(A)$ which is by definition $P(G|A)$.

Often we are given P(A|B) and $P(A|\overline B)$, but not A directly ($\overline B$ is the complement of B). Of course, we can just add these to get P(A). Here's the combined Bayes rule in this case:

\[P(G|A)=\frac{P(A|G)P(G)}{P(A|G)P(G)+P(A|\overline G)P(\overline G)}\]

And of course, $P(\overline G)=1-P(G)$.

Let's build some intuition about Bayes' rule. First, the larger P(A) is, the smaller P(G|A). That's because it means there are more situations where A occurs, but not AcapG.

2.2 Chain Rule

Using the basic rule of conditional probability, we have $P(A\cap B)=P(A|B)P(B)$. Makes sense: all cases where A and B occur are covered by A|B, but we need to make it symmetrical.

We can generalize:

\begin{align*} P(A\cap B\cap C)&=P(A|B\cap C)P(B\cap C) \\ &=P(A|B\cap C)P(B|C)P(C) \end{align*}

\[P(G_1\cap\cdots\cap G_n)=P(G_1|G_2\cap\cdots\cap G_n)P(G_2|G_3\cap\cdots\cap G_n)\cdots P(G_n)\]

2.3 Shortcuts for Bayes' rule (Bayes' factor)

Often, we think about Bayes' rule with regard to tests, for example in a medical context, where we are given false negative and false positive rates for the test. However, these rates help you figure things out about the test result given full information about the person – while in fact we'd like to know about the person given the test result! We'd like some shortcuts to invert the dependency in this case.

In addition to false negative/positive rate, we have sensitivity and (not a standard name) negative sensitivity, which are equal to 1-FNR and 1-FPR respectively. Just from that definition you can determine that sensitivity is the chance of the test returning positive given that the person is positive, and the negative sensitivity is the chance that the test will return negative given that the person is negative. I.e., it's more of an accuracy than a false rate.

A "prior" is the same events, but with the implication in the wrong order. So for testing, it's "given that the person is positive, here's how likely the test is to be positive", for example. Then using Bayes we'd be trying to find "given that the test is positive, how likely is it that the person is positive?"

If the prior is positive test given positive person, then to invert it we would multiply by P(actually_positive)/P(test_positive), but in the real world we don't know P(actually_positive). A workaround TODO

2.4 Response Bias

Wikipedia describes response bias as a bias due to some people just straight up answering survey questions wrong. That's true, but it's not immediately obvious what kind of large effects this has.

Suppose you ask 10,000 people whether they are a human or a cat, then ask them which college they went to. You'll probably find a few cats who went to MIT!

This is related to Bayes' theorem because you'll replicate the results if you take the prior, in the example above, to be "the probability that you report being a cat, given that you wrote you went to college" as nonzero, because, even if I do not believe a cat has ever gone to college, I do believe that some people who went to college would write down that they're a cat.

3 Expectation and Variance

3.1 Expectation

The expectation of a discrete random variable is the weighted average of all the values the random variable can take, weighted by the probability of that outcome. More formally:

\[\mathbb E[X]=\sum_i[\mathbb P(X=X_i)X_i]\]

If a random variable is the sum of other random variables, then the expectation of the sum is the sum of the expectations. This is pretty intuitive for things like the sum of two dice rolls, but less so when the events are dependent on each other.

\begin{align*} \mathbb E[X+Y]&=\sum_{i,j}\Big[\mathbb P(X=X_i,Y=Y_j)X_iY_j\Big] \\ &=\sum_{i,j}\Big[\mathbb P(X=X_i,Y=Y_j)X_i\Big]+\sum_{i,j}\Big[\mathbb P(X=X_i,Y=Y_j)Y_j\Big] \\ &= \sum_i\Big[\mathbb P(X=X_i)X_i\Big]+\sum_j\Big[\mathbb P(Y=Y_j)Y_j\Big] & \text{(total probability)} \\ &=\mathbb E[X]+\mathbb E[Y] \end{align*}

The way dependent variables usually manifest themselves when calculating $\mathbb E[X]+\mathbb E[Y]$ is that in the calculation of $\mathbb{E}[Y]$, you have to use the law of total probability across different cases of X.

3.2 Law of Total Expectation

$\mathbb E[X]=\sum_{A_i \mathrm{ partition}} \mathbb E[X|A_i]P(A_i)$

The proof is straightforward and follows from the law of total probability:

\begin{align*} \sum_{i} \mathbb E[X|A_i]P(A_i) &= \sum_{i}\sum_{x\in\Omega_X}\left[x\cdot P(X=x|A_i)\right]P(A_i) \\ &= \sum_{x\in\Omega_X}\sum_i\left[x\cdot P(X=x|A_i)P(A_i)\right] \\ &= \sum_{x\in\Omega_X} x\cdot P(X=x) & \text{(Law of Total Probability)} \\ &= \mathbb E[X] & \text{(definition)} \end{align*}

3.3 Variance

\[\mathrm{Var}(X)=\sum_i\Big[\mathbb P(X=X_i)(X_i-\mathbb E[X])^2\Big]\]

If we let $Y$ be the random variable $(X-\mathbb{E}[X])^2$, then that expression is equal to $\mathbb{E}[Y]=\mathbb{E}[(X-\mathbb{E}[X])^2]$. We can go deeper!

\begin{align*} \mathbb{E}\left[(X=\mathbb{E}[X])^2\right] &= \mathbb{E}\left[X^2-2X \mathbb{E}[X]+(\mathbb{E}[X])^2\right] \\ &= \mathbb{E}\left[X^2\right]-\mathbb E[2X \mathbb{E}[X]]+\mathbb E\left[(\mathbb{E}[X])^2\right] \\ &= \mathbb{E}\left[X^2\right]-\mathbb E[2X \mathbb{E}[X]]+(\mathbb{E}[X])^2 \\ &= \mathbb{E}\left[X^2\right]-2\mathbb E[X]\mathbb E[X]+(\mathbb{E}[X])^2 & \text{(factor out constants)}\\ &= \mathbb{E}\left[X^2\right]-(\mathbb{E}[X])^2 \end{align*}

Rad! Variance is always positive, so this also implies $\mathbb{E}[X^2]\ge \mathbb{E}[X]^2$.

Using this expression for variance, we can determine some things about variance using facts about expectation (namely, linearity, to both addition and to multiplication by constant).

From either the original definition or the alternate expression in terms of expectation, you can easily derive (for constant a):

Var(a+X)=a+Var(X)

Var(aX)=a²*Var(X)

3.4 Independence and what it implies

Recall that for a set of events to be mutually independent, every subset of them must be independent (as in probability of AND equals product of individual probabilities). For a set of random variables to be mutually independent we have a slightly different condition: For every possible set of values that the set of random variables can take on, the probability of exactly that set of values must equal the product of the probabilities of each individual value in that set.

When some random variables are independent, then the variance of the sum is the sum of the variances and the product of an expectation is the expectation of the product, i.e.:

Var(X+Y)=Var(X)+Var(Y)

E[X*Y]=E[X]*E[Y]

3.4.1 Dice roll sum and first roll are independent??

Nope, they are not independent! However, the events of the sum being exactly 7 and the first dice being any given value are independent, because 7 is the only sum that can be reached no matter what the first roll is; 6 cannot be reached if the first roll is 6, and 8 cannot be reached if the first roll is 1. No matter what the first roll is, there is a 1/6 chance of the second roll being such that the sum is seven. Therefore, fixing the first roll doesn't change the probability of the sum being seven. On the other hand, the first roll being 1 does change the probability that the sum is 8.

4 Discrete Zoo

4.1 Uniform

Uniformly distributed integers from a to b. Expected value (a+b)/2. Variance (b-a)(b-a+2)/12. The variance is a bit strange; it's easier to calculate for the continuous case, where you get (b-a)²/12. There are b-a+1 many integers from a through b, so (b-a)(b-a+2) is related to (b-a+1)².

4.2 Bernoulli

Models a single true/false event, with true probability p. The random variable can only take on values 0 and 1. The expected value is clearly just p, and the variance is

\[\mathbb E[X^2]-\mathbb{E}[X]^2=p-p^2=p(1-p)\]

4.3 Binomial

Related to Bernoulli. In n independent true/false events, with true probability p for each, the random variable is how many flips will be true. The expected value is np (by linearity of expectation, we just sum up the expectations of the indicator variables for each flip). Since each coin flip is independent, the variance is linear across them, and we get that the variance of the n flips is n*p(1-p).

4.4 Geometric

How many true/false events until a true? Once again, related to Bernoulli and binomial. The PMF of X=i is $(1-p)^{i-1}p$. The expected value is

\begin{align*} \mathbb{E}[X]&=\sum_{i=1}^\infty i(1-p)^{i-1}p \\ &= -p\cdot\frac{d}{dp}\sum_{i=1}^\infty (1-p)^i \\ &= -p\cdot\frac{d}{dp}\frac{1-p}{1-(1-p)} \\ &= -p\cdot\frac{-1}{p^2} \\ &= \frac{1}{p} \end{align*}

(derivative idea came from https://math.stackexchange.com/questions/605083/calculate-expectation-of-a-geometric-random-variable). We were able to swap the order of differentiation and infinite summation because $\sum_{i=1}^\infty (1-p)^i$ converges uniformly in a small open interval around any 0<p<1. The negative sign popped out on the second line because of the chain rule on the inner -p.

A more intuitive way to calculate the same thing, using that geometric distributions are "memoryless" (the gambler's fallacy):

\begin{gather*} \mathbb{E}[X]=p\cdot 1+(1-p)(1+\mathbb{E}[X]) \\ \mathbb{E}[X]=p+1-p \mathbb{E}[X]+\mathbb{E}[X]-p \\ p\mathbb{E}[X]=1 \\ \mathbb{E}[X] = \frac{1}{p} \end{gather*}

The variance is TODO

4.5 Hypergeometric

Instead of thinking about geometric in terms of weighted coin flips, let's think about it as how many balls we have to pull, with replacement, from a bin of red and blue balls until we get a red ball. Then hypergeometric is the same thing but without replacement.

4.6 Negative Binomial

Really, it's more like "multiple geometric" (but not hypergeometric!). It's "with probability p of success, how many trials will you have to perform to get exactly n successes?" The PMF X=i is $\binom{i-1}{n-1}(1-p)^{i-n}p^n$.

Rather than calculate the expected value directly, we can use that X is the sum of how long it takes for the first success, then the second success, etc, each of which are geometric random variables, so we can use linearity of expectation to get $\mathbb{E}[X]=\frac{n}{p}$.

We can do the same thing for variance, since the amount of flips we need to get the first success does not affect how long it takes to get the second success, so $\mathrm{Var}(X)=\frac{n\cdot(1-p)}{p^2}$.

4.7 Poisson

4.7.1 Connection between Poisson and Geometric

Poisson is a continuous version of geometric, in a sense, as we increase the number of coin flips in the geometric distribution towards infinity.

5 Continuous Probability Distributions

The PDF gives the "density" of the probability at that point, not the actual probability of a certain event occurring. The PDF can take on any nonnegative values. The constraint is that $\int_{-\infty}^\infty f(x)dx=1$. The CDF has a more direct interpretation, as the probability of the random variable being less than or equal to a given value. It's given by $F(x)=\int_{-\infty}^x f(x)dx$. The probability of the random variable being in a given range is calculated by subtracting the CDF at the low point of the range from the CDF at the high point of the range, which simplifies into $\int_a^b f(x)dx$. Expected value and variance are calculated in much the same way as for discrete variables: $\int_{-\infty}^\infty xf(x)dx$ and $\int_{-\infty}^\infty (x-\mathbb E[X])^2f(x)dx$ respectively. Just like with discrete random variables, there's a more convenient way to write the variance:

\begin{align*} \int_{-\infty}^\infty (x-\mathbb E[X])^2f(x)dx &= \int_{-\infty}^\infty x^2f(x)dx-2\mathbb E[X]\int_{-\infty}^\infty xf(x)dx+\mathbb E[X]^2 \\ &= \mathbb E[X^2]-\mathbb E[X]^2 \end{align*}

5.1 Law of the unconscious statistician

I don't understand the name, but the point is that the expected value of some function g(X), where we only know the PDF of X itself, is given by

\[\mathbb E[g(x)]=\int_a^b g(x)f(x)dx\]

That's because the probability density of g(x) occurring is the same as the density of the x that creates that g(x). The same thing works for discrete distributions.

5.2 Example: Uniform Distribution

We defined the PDF as a piecewise function with constant value 1/(a-b) when a<x<b and zero elsewhere. This is pretty clearly a valid PDF. The CDF is $\frac{x-a}{b-a}$ when a<x<b, zero when x<a, and one when x>b. The expectation is

\[\mathbb E[X]=\int_a^b \frac{x}{b-a}dx=\left[\frac{x^2}{2(b-a)}\right]_a^b=\frac{b^2-a^2}{2(b-a)}=\frac{(b-a)(b+a)}{2(b-a)}=\frac{b+a}{2}\]

The variance is

\begin{align*} \mathbb{E}[X^2]-\mathbb E[X]^2&=\left[\frac{x^3}{3(b-a)}\right]_a^b-\frac{(b+a)^2}{4}\\ &= \frac{b^3-a^3}{3(b-a)}-\frac{(b+a)^2}{4}\\ &= \frac{(b-a)(b^2+ab+a^2)}{3(b-a)} -\frac{(b+a)^2}{4}\\ &= \frac{4(b^2+ab+a^2)-3(b+a)^2}{12} \\ &= \frac{a^2-2ab+b^2}{12} \\ &= \frac{(a-b)^2}{12} \end{align*}

Which is pretty similar to the discrete case.

5.3 Example: Exponential Distribution

5.4 Central Limit Theorem

If we have some i.i.d (independent, identically distributed) variables (this does mean that they must each have the same distribution, same mean, same variance, etc), and we take their sum, as the number of variables approaches infinity, the distribution of the sum approaches a normal distribution.

That's cool, but in order to really apply it, we want to know which normal distribution it converges to. We can normalize the sum. Let n be the number of variables and µ be the mean of each variable being summed. Then $\sum_1^n[X_i]-n\mu$ has the mean normalized to zero. The variance is (hand waving here) the sum of the variances of the X_i. Recall that multiplying by σ multiplies the variance by σ², so if σ² is the variance of each X_i, the variance of the sum is nσ², so we need to divide by $\sigma\sqrt n$. Thus:

\[\frac{\sum_1^n[X_i]-n\mu}{\sigma\sqrt n}\]

becomes distributed like a normal distribution with mean 0, variance/stddev 1.

6 Joint Distributions

A joint distribution is just multiple variables, which may or may not be independent, and may or may not be identically distributed.

6.1 Marginal Distribution

Given a joint distribution, we may wish to find statistics about just one of the variables. This could be useful if the variables are tightly intertwined so that the variables are "given" as a joint distribution to begin with.

To find the marginal distribution for discrete joint distribution, we simply use the law of total probability across all variables not of interest to find the total probability of each possible value of the variable of interest. In the continuous case, we integrate across the variable not of interest. More formally, if we have a joint distribution on X_1,…,X_n and want the marginal for X₁, it is:

\[P(X_1=x)=\int_{-\infty}^\infty\cdots\int_{-\infty}^\infty P(X_1=x|X_2=x_2,\dots,X_n=x_n)dx_2\cdots dx_n\]

6.2 Covariance

Defined as $\mathrm{Cov}(X,Y)=\mathbb E\Big[(X-\mathbb E[X])(Y-\mathbb E[Y])\Big]$. Expanding the product on the right,

\begin{align*} \mathrm{Cov}(X,Y) &=\mathbb E\Big[(X-\mathbb E[X])(Y-\mathbb E[Y])\Big]\\ &=\mathbb E\Big[XY-X\mathbb E[Y]-Y\mathbb E[X]+\mathbb E[X]\mathbb E[Y]\Big] \\ &=\mathbb E[XY]-\mathbb E[X\mathbb E[Y]]-\mathbb E[Y\mathbb E[X]]+\mathbb E[X]\mathbb E[Y]\\ &=\mathbb E[XY]-\mathbb E[Y]\mathbb E[X]-\mathbb E[X]\mathbb E[Y]+\mathbb E[X]\mathbb E[Y] & \text{(factor constants)}\\ &=\mathbb E[XY]-\mathbb E[Y]\mathbb E[X] \end{align*}

Intuitively, if X causes Y to be greater, than XY will be artificially increased beyond their product without interaction. Covariance is positive when the variables are positively correlated, zero if they are independent, and negative when they are oppositely correlated. (Note that this does mean that the zero or nonzeroness of covariance tells you whether the variables are independent).

This allows the general formula for the variance of a sum of random variables which may or may not be independent: Var(X+Y)=Var(X)+Var(Y)+2*Cov(X,Y) TODO: prove

Covariance is in fact a generalization of variance: Cov(X,X)=Var(X).

Given multiple random variables, it's convenient to create a (symmetric) covariance matrix, where the i,j-th entry is the covariance between the i-th and j-th variables.

7 Tail Bounds

7.1 Markov's Inequality

The proof is worth a thousand words.

\begin{align*} \e[X] &= \sum_{x\in\Omega_X}x\cdot P(X=x) \\ &\ge \sum_{x\in\Omega_X, x\ge t}x\cdot P(X=x) & \text{(requires $x\ge0$)}\\ &\ge \sum_{x\in\Omega_X, x\ge t}t\cdot P(X=x) & (x\ge t) \\ &= tP(X\ge t) \end{align*}

Markov's Inequality follows immediately: If all values of $x$ are nonnegative, then

\[P(X\ge t)\le \frac{\e[X]}{t}\]

Equivalently, letting $k=t/\e[X]$:

\[P(X\ge k\e[X])\le \frac{1}{k}\]

How can we possibly put a bound on the probability when all we know is the expected value? At first it seems like the variance could make a big difference. However, the restriction that the x are nonnegative is actually critical, not just a formality. Markov knows that all the minimum value is at least zero, and knows where the expected value is, so it knows things can't be spread too far beyond the other side.

We can use the Markov inequality better by shifting over a distribution until the lowest possible value is zero, since this really helps our estimates and could potentially make it applicable even when the minimum value is less than zero initially.

7.2 Chebyshev's Inequality

If a random variable $Y$ has expected value zero, then the variance is just $\e[Y^2]$. Then, using Markov's inequality:

\[P(Y^2\ge t^2)\le \frac{\e[Y^2]}{t}=\frac{\mathrm{Var}(Y)}{t^2}\]

Notice that $P(Y^2\ge t^2)$ is the same probability as $P(|Y|\ge t)$ if t is nonnegative. Also substituting how we got Y from X by making the expected value zero, we get the form of Chebyshev's Inequality:

\[P(|X-\e[X]|\ge t)\le \frac{\mathrm{Var}(X)}{t^2},\qquad t\ge0\]

7.3 Multiplicative Chernoff Bounds

Only applicable to the sum of independent Bernoulli variables. Note this is not the same as saying it only applicable to binomial variables, because binomial requires that all the Bernoulli variables it is the sum of have the same parameter, but this restriction is not necessary for the Chernoff bounds. Let µ be the mean of the sum and $0\le\sigma\le1$; then $P(\sum X_i\ge (1+\sigma)\mu)\le e^{-\sigma^2\mu/3}$ and $P(\sum X_i\le(1-\sigma)\mu)\le e^{-\sigma^2\mu/2}$ (yes, there is a different denominator in the exponent of the second one)

7.4 Union bound

Okay, this one's pretty lame. Given a whole bunch of events that might be dependent, their OR'd probability (the probability of at least one of them happening) is no greater than the sum of their individual probabilities. Formally,

\[P\left(\bigcup X_i\right)\le\sum P(X_i)\]

This follows from $P(X\cup Y)=P(X)+P(Y)-P(X\cap Y)$.