Chapter 1 Some probability revision

(Original version produced by A.R.Wade.)

Probability spaces and random variables

A probability space \((\Omega,\mathcal{F},\mathbb{P})\) consists of

the sample space \(\Omega\) (the set of all possible outcomes);
a \(\sigma\)-algebra \(\mathcal{F}\) of events;
a probability \(\mathbb{P}\) that to each event \(A \in \mathcal{F}\) assigns a number \(\mathbb{P} (A)\) satisfying the probability axioms.

To say that \(\mathcal{F}\) is a \(\sigma\)-algebra means that \(\mathcal{F}\) is a collection of subsets of \(\Omega\) such that:

\(\Omega \in \mathcal{F}\).
If \(A \in \mathcal{F}\) then its complement \(A^\mathrm{c} \in \mathcal{F}\) too.
If \(A_1, A_2, \ldots \in \mathcal{F}\) then \(\cup_{n=1}^\infty A_n \in \mathcal{F}\) too.

A random variable \(X\) is a function \(X : \Omega \to \mathbb{R}\) which is \(\mathcal{F}\)-measurable, meaning that \[ \{ \omega \in \Omega : X(\omega) \leq x \} \in \mathcal{F} \text{ for all } x \in \mathbb{R} .\] An important example is the random variable \({1}_A\) of an event \(A \in \mathcal{F}\), defined by \[ {1}_A ( \omega ) = \begin{cases} 1 & \text{if } \omega \in A, \\ 0 & \text{if } \omega \notin A . \end{cases} \] If \(\mathcal{G} \subseteq \mathcal{F}\) is another (smaller) \(\sigma\)-algebra, then a random variable \(X\) is \(\mathcal{G}\)-measurable if \[ \{ \omega \in \Omega : X(\omega) \leq x \} \in \mathcal{G} \text{ for all } x \in \mathbb{R} .\]

Roughly speaking \(X\) is \(\mathcal{G}\)-measurable if “knowing” \(\mathcal{G}\) means “knowing” \(X\).

Example. Consider the sample space for a die roll \(\Omega = \{1,2,3,4,5,6\}\). Take \(\mathcal{F} = 2^\Omega\) the power set of \(\Omega\) (the set of all subsets). Take \(\mathcal{G} \subset \mathcal{F}\) given by \[ \mathcal{G} = \bigl\{ \emptyset, E , E^\mathrm{c}, \Omega \bigr\} ,\] where \(E = \{ 2,4,6\}\) is the event that the score is even, and its complement \(E^\mathrm{c} = \{1,3,5\}\) (the score is odd). Note that \(\mathcal{G}\) is a \(\sigma\)-algebra. Take random variables \(X(\omega) = \omega\) (the score) and \(Y(\omega ) = {1}_E\) (the indicator that the score is even).

Both \(X\) and \(Y\) are clearly \(\mathcal{F}\)-measurable.

Moreover, \(Y\) is \(\mathcal{G}\)-measurable, since e.g., \(\{ \omega : Y (\omega) \leq 1/2 \} = E^\mathrm{c} \in \mathcal{G}\), but \(X\) is not \(\mathcal{G}\)-measurable, since e.g., \(\{ \omega : X (\omega ) \leq 1 \} = \{ 1 \} \notin \mathcal{G}\).

Expectation and Variance

The expectation of a random variable \(X\) is given by \[\mathbb{E}[X] = \sum_{x} x \mathbb{P}(X = x) \] if \(X\) takes values in a discrete set, or \[ \mathbb{E}[X] = \int_{\mathbb{R}} x f_X(x) dx \] if \(X\) takes values in \(\mathbb{R}\), where \(f_X(x)\) is the density function of \(X\).

The variance of \(X\) is given by \[ Var(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2 \] and is always non-negative ( \(Var(X) \geq 0 )\).

Some properties of the expectation and variance are:

Expectation is linear: \(\mathbb{E}[aX+b] = a \mathbb{E}[X] + b\)
For a measurable function \(g\), \(\mathbb{E}[g(X)] = \sum_{x} g(x) \mathbb{P}(X=x)\) or \(\mathbb{E}[g(X)] = \int_{\mathbb{R}} g(x) f_X(x) dx\). In Probability I you may have heard this referred to as LOTUS, or the Law of the Unconscious Statistician. In general, \(\mathbb{E}[g(X)]\) is not the same as \(g( \mathbb{E}[X])\)!
Jensen’s inequality: for any convex function \(f\), \[ f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]. \]
Cauchy-Schwarz inequality: \(\left\vert \mathbb{E}[XY] \right\vert^2 \leq \mathbb{E}[X^2] \mathbb{E}[Y^2].\)

The normal distribution

A real-valued random variable \(X\) has the normal distribution with mean \(\mu\) and variance \(\sigma^2\) if it is a continuous random variable with probability density function \[ f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left\{ - \frac{(x-\mu)^2}{2\sigma^2} \right\}, \text{ for } x \in \mathbb{R} .\] We write \(X \sim \mathcal{N} (\mu,\sigma^2)\).

The case \(\mu =0\), \(\sigma^2 = 1\) is the standard normal distribution \(Z \sim \mathcal{N} (0,1)\). The density of the standard normal we usually write as \(\phi\), and the cumulative distribution function is \[ N ( x ) = \mathbb{P} ( Z \leq x ) = \int_{-\infty}^x \phi (y) \mathrm{d} y .\] If \(X \sim \mathcal{N} (\mu,\sigma^2)\), then \(\alpha + \beta X \sim \mathcal{N} (\alpha + \beta \mu, \beta^2 \sigma^2 )\). In particular, if \(X \sim \mathcal{N} (\mu,\sigma^2)\), then \[ \frac{X - \mu}{\sigma} \sim \mathcal{N} (0,1) .\]

The moment generating function of \(Z \sim \mathcal{N} (\mu,\sigma^2)\) is \(M_Z (t) = \mathbb{E} \left[ \mathrm{e}^{t Z} \right] = \mathrm{e}^{\mu t + \sigma^2 t^2 /2}\), for \(t \in \mathbb{R}\).

You may not all have seen the multivariate normal distribution, although it is important in most statistics courses. A random vector \(\mathbf{X} \in \mathbb{R}^k\) has the \(k\)-dimensional normal distribution with mean vector \(\mu \in \mathbb{R}^k\) and covariance matrix \(\Sigma\) (a \(k \times k\) symmetric, positive-definite matrix) if its moment generating function is given by \[ \mathbb{E} \left[ \exp \left\{ \mathbf{t} \cdot \mathbf{X} \right\} \right] = \exp \left\{ \mu^T \cdot \mathbf{t} + \frac{1}{2} \mathbf{t}^T \Sigma \mathbf{t} \right\}. \] We write \(X \sim \mathcal{N}_k (\mu, \Sigma)\). –>

The central limit theorem

Let \(X_1, X_2, \ldots\) be independent, identically distributed (i.i.d.) random variables on a probability space \((\Omega, \mathcal{F}, \mathbb{P})\). Set \(S_0 = 0\) and \(S_n = \sum_{i=1}^n X_i\) for \(n \geq 1\).

The central limit theorem says that if \(\mathbb{E} ( X_i ) = \mu\) and \(\mathbb{V}\mathrm{ar} (X_i) = \sigma^2 \in (0,\infty)\), then \[ \frac{ S_n - n \mu}{\sqrt{n \sigma^2}} \overset{d}{\longrightarrow} \mathcal{N} (0,1) .\] In other words, \[ \lim_{n \to \infty} \mathbb{P} \left[ \frac{ S_n - n \mu}{\sqrt{n \sigma^2}} \leq z \right] = N (z) \text{ for all } z \in \mathbb{R} .\]

Conditional expectation

For a random variable \(X\) and an event \(A \in \mathcal{F}\), \[ \mathbb{E} ( X \mid A ) = \frac{\mathbb{E} ( X {1}_A )}{\mathbb{P} (A) } \] is the conditional expectation of \(X\) given \(A\).

If \(Y\) is another random variable, then taking \(A= \{ Y = y\}\) we can define \(\mathbb{E} ( X \mid Y = y )\) (at least if \(\mathbb{P} (Y = y) >0\), and more generally in certain cases).

We can define the random variable \(\mathbb{E} (X \mid Y)\) by \[ \mathbb{E} (X \mid Y ) = g (Y), \text{ where } g(y ) = \mathbb{E} ( X \mid Y = y ) .\]

Theorem. \(\mathbb{E} ( \mathbb{E} (X \mid Y ) ) = \mathbb{E} (X)\).

Example. As in the previous example, take \(\Omega = \{1,2,3,4,5,6\}\) and \(E = \{2,4,6\}\), define random variables \(X(\omega) = \omega\) and \(Y(\omega) = {1}_E\). Then

\[\begin{align*} \mathbb{E} ( X \mid Y = 0 ) & = \sum_x x \mathbb{P} ( X= x \mid Y = 0 ) = \frac{1+3+5}{3} = 3 ,\\ \mathbb{E} ( X \mid Y = 1 ) & = \sum_x x \mathbb{P} ( X= x \mid Y = 1 ) = \frac{2+4+6}{3} = 4 . \end{align*}\]

So the random variable \(\mathbb{E} (X \mid Y)\) is given by \[ \mathbb{E} (X \mid Y ) = \begin{cases} 3 & \text{if } Y = 0,\\ 4 & \text{if } Y = 1. \end{cases} \] A compact way to write this is \(\mathbb{E} (X \mid Y ) = 3 + Y\).

So, by the theorem, \(\mathbb{E} (X) = \mathbb{E} ( \mathbb{E} (X \mid Y ) ) = \mathbb{E} ( 3 + Y) = 7/2\) (as you would expect).

Convergence of random variables

Let \(X\) and \(X_0, X_1, X_2, \ldots\) be random variables on a probability space \((\Omega, \mathcal{F}, \mathbb{P})\). We say that \(X_n\) converges to \(X\) almost surely, written \(X_n \overset{a.s.}{\longrightarrow} X\), if \[ \mathbb{P} \left( \bigl\{ \omega: \lim_{n \to \infty} X_n (\omega) = X(\omega) \bigr\} \right) = 1.\] Another way to say the same thing is that if \(N_{\varepsilon}\) (random) is the smallest integer such that \[ | X_n - X | \leq \varepsilon, \text{ for all } n \geq N_\varepsilon ,\] then \(X_n \to X\) a.s. if and only if \(\mathbb{P} \left( N_{\varepsilon} < \infty \text{ for all } \varepsilon > 0 \right) = 1\).

Let \(X\) and \(X_0, X_1, X_2, \ldots\) be random variables on a probability space \((\Omega, \mathcal{F}, \mathbb{P})\). We say that \(X_n\) converges to \(X\) in probability, written \(X_n \overset{p}{\longrightarrow} X\), if \[ \lim_{n \to \infty} \mathbb{P} \left( | X_n - X | > \varepsilon \right) = 0 \text{ for all } \varepsilon > 0 .\]

Let \(X\) and \(X_0, X_1, X_2, \ldots\) be random variables on a probability space \((\Omega, \mathcal{F}, \mathbb{P})\). For \(q \geq 1\), we say \(X_n\) converges to \(X\) in \(L^q\), if \[ \lim_{n \to \infty} \mathbb{E} \left[ | X_n - X |^q \right] = 0 .\]

Let \(X\) and \(X_0, X_1, X_2, \ldots\) be random variables (not necessarily on the same probability space) with cumulative distribution functions \(F (x) = \mathbb{P} ( X \leq x)\) and \(F_n (x) = \mathbb{P} ( X_n \leq x )\). We say \(X_n\) converges to \(X\) in distribution, written \(X_n \overset{d}{\longrightarrow} X\), if \[ \lim_{n \to \infty} F_n(x) = F(x) \text{ for all } x \text{ at which } F \text{ is continuous.} \]

It can be shown that

\(X_n \to X\) a.s. implies that \(X_n \overset{p}{\longrightarrow} X\).
\(X_n \to X\) in \(L^q\) implies that \(X_n \overset{p}{\longrightarrow} X\).
\(X_n \overset{p}{\longrightarrow} X\) implies that \(X_n \overset{d}{\longrightarrow} X\).

Example. To prove the first implication, note that \(\mathbb{P} ( | X_n - X | > \varepsilon ) \leq \mathbb{P} (n < N_{\varepsilon} )\), so \[ \lim_{n \to \infty} \mathbb{P} \left( | X_n - X | > \varepsilon \right) \leq \lim_{n \to \infty} \mathbb{P} (n < N_{\varepsilon} ) = \mathbb{P} ( N_{\varepsilon} = \infty ) .\]