Probability II

1 Probability and independence

Goals: Recall and extend some of the key facts from Probability I. Derive continuity of probability along monotone sequences of events. Introduce the canonical probability space. Construct infinite sequences of mutually independent events. Construct infinite sequences of independent random variables.


When discussing even the simplest limit results of probability theory, such as the Law of large numbers and the Central limit theorem in Probability I, one uses infinite sequences of independent random variables. But do they exist? To affirmatively answer this question, we first need to explore some of the key concepts of probability theory in more detail. 1

1.1 Probability spaces

Similarly to other areas of pure mathematics, probability theory is developped in an axiomatic way. When talking about applications, ‘randomness’ must be carefully defined, to avoid possible ambiguity:

Example 1.1 A chord of the unit circle is chosen at random. What is the probability that its length is larger than \(\sqrt{3}\), the side of the equilateral triangle inscribed in a circle?

Bertrand argued that there are at least three different but natural ways of generating a random chord:

Random chords chosen using each of the above methods; those in red are longer and those in blue are shorter than the triangle side. Random chords chosen using each of the above methods; those in red are longer and those in blue are shorter than the triangle side. Random chords chosen using each of the above methods; those in red are longer and those in blue are shorter than the triangle side.

It is an instructive exercise to compute the corresponding probabilities. For more details, see the Wiki page. \(\vartriangleleft\)


In Probability I you defined a sample space \(\Omega\) as a collection of all possible outcomes of a probabilistic experiment; then an event is a collection of possible outcomes, ie., a subset of the sample space. Two simplest examples of events are the impossible event \(\varnothing\) and the certain event \(\Omega\). If \(A\subset\Omega\) and \(B\subset\Omega\) are two events, one considers other events such as \(A\cup B\) (A or B), \(A\cap B\) (A and B), \(A^\mathsf{c}\equiv\Omega\setminus A\) (not A), \(A\setminus B\) (A but not B). These and similar finite-operation events are not sufficient to work with any interesting situations:

Example 1.2 In a standard coin-flipping experiment let \(A_k=\bigl\{\text{ first `head' occurs on $k$th flip}\bigr\}\) and \(B_n=\bigl\{\text{ `head' observed in the first $n$ flips}\bigr\}\); then \[B_n=\bigcup_{k=1}^n A_k,\qquad\text{ while }\qquad B_\infty\equiv \bigl\{\text{ `head' observed}\bigr\}=\bigcup_{k=1}^\infty A_k.\] \(\vartriangleleft\)


Consequently, to have any interesting and useful theory, one needs to include infinite sample spaces and countable event operations:

Definition 1.3 Let \(\mathcal{F}\) be a collection of subsets of \(\Omega\). We shall call \(\mathcal{F}\) a \(\sigma\)-field or a \(\sigma\)-algebra if it has the following properties:

  1. \(\varnothing\in\mathcal{F}\);

  2. if \(A\in\mathcal{F}\), then \(A^\mathsf{c}\in\mathcal{F}\);

  3. if \(A_1\), \(A_2, \ldots\in\mathcal{F}\), then \(\bigcup_{k=1}^\infty A_k\in\mathcal{F}\).


The last condition of Definition 1.3 makes sure that \(\bigl\{\text{ `head' observed}\bigr\}\) in Example 1.2 is indeed an event. In probability theory we always assume that all events describing a probabilistic experiment form a \(\sigma\)-field.

Remark 1.3.1 As you know from Probability I, given \(\Omega\), the trivial \(\sigma\)-field \(\{\varnothing,\Omega\}\) is the smallest \(\sigma\)-field over \(\Omega\). Also, the collection of all subsets of \(\Omega\) (also known as the power set \(2^\Omega\)) is the largest \(\sigma\)-field over \(\Omega\). For further examples see your first year notes and Exercises  1.23- 1.25. \(\vartriangleleft\)


Definition 1.4 Let \(\Omega\) be a sample space, and \(\mathcal{F}\) be a \(\sigma\)-field of events in \(\Omega\). A probability distribution \(\mathsf{P}\) on \((\Omega,\mathcal{F})\) is a collection of numbers \(\mathsf{P}(A)\), \(A\in\mathcal{F}\), possessing the following properties:

  1. for every event \(A\in\mathcal{F}\), \(\mathsf{P}(A)\ge0\);

  2. \(\mathsf{P}(\Omega)=1\);

  3. for any pair of incompatible events \(A\) and \(B\) (i.e., \(A\cap B=\varnothing\)), \(\mathsf{P}(A\cup B)=\mathsf{P}(A)+\mathsf{P}(B)\);

  4. for any countable collection \(A_1\), \(A_2\), …of incompatible events (i.e., with \(A_i\cap A_j=\varnothing\) for \(i\neq j\)), \[\tag{1.1}\label{eq:countable-additivity} \mathsf{P}\Bigl(\bigcup_{k=1}^\infty A_k\Bigr)=\sum_{k=1}^\infty \mathsf{P}\bigl(A_k\bigr).\]


In other words, a probability measure is a countably additive map from events in \(\mathcal{F}\) into \([0,1]\). The countable additivity property (\ref{eq:countable-additivity}) is very important for applications. Without the incompatibility constraint, it becomes the subadditive or Boole: for any countable collection \(A_1\), \(A_2\), … of events , \[\tag{1.2}\label{eq:countable-sub-additivity} \mathsf{P}\Bigl(\bigcup_{k=1}^\infty A_k\Bigr)\le\sum_{k=1}^\infty \mathsf{P}\bigl(A_k\bigr).\]

Remark 1.4.1 For the left-hand side of (\ref{eq:countable-additivity}) and (\ref{eq:countable-sub-additivity}) to make sense, the countable union \(\cup_{k=1}^\infty A_k\) must be an event. This is guaranteed by the assumption that \(\mathcal{F}\) is a \(\sigma\)-field. \(\vartriangleleft\)


Further useful consequences of the probability axioms in Definition 1.4 include: if \(A\) and \(B\) are events in \(\Omega\), \[\tag{1.3}\label{eq:two-events-combining-probability} \mathsf{P}(B\setminus A)=\mathsf{P}(B)-\mathsf{P}(A\cap B),\qquad \mathsf{P}(A\cup B)=\mathsf{P}(A)+\mathsf{P}(B\setminus A),\qquad \mathsf{P}(A^\mathsf{c})=1-\mathsf{P}(A)\] as well as the monotonicity property, \[\tag{1.4}\label{eq:probability-monotone-inclusion} \varnothing\subseteq A\subseteq B\subseteq\Omega \qquad\Longrightarrow\qquad 0=\mathsf{P}(\varnothing)\le\mathsf{P}(A)\le\mathsf{P}(B)\le\mathsf{P}(\Omega)=1.\]

Definition 1.5 A probability space is a triple \((\Omega,\mathcal{F},\mathsf{P})\), where \(\Omega\) is a sample space, \(\mathcal{F}\) is a \(\sigma\)-field of events in \(\Omega\), and \(\mathsf{P}(\cdot)\) is a probability measure on \((\Omega,\mathcal{F})\).


In what follows we shall always assume that some probability space \((\Omega,\mathcal{F},\mathsf{P})\) is fixed. In the countable setting the situation is rather straightforward:

Example 1.6 If \(\Omega\) is countable, as is the case for a repeated coin flipping, we can set \((\Omega, 2^{\Omega},\mathsf{P})\) to be the probability space. Notice that for \(A \subset \Omega\), \(A=\bigcup_{\omega \in A} \{\omega\}\) so that \[\mathsf{P}(A)=\sum_{\omega \in A} \mathsf{P} (\{\omega\}).\] Then \(\mathsf{P}(\cdot)\) is uniquely defined from the individual values \(\mathsf{P} (\{\omega\})\).

More generally, let \(\Omega\) be a countable set and let \(p:\Omega \to [0,1]\) be such that \(\sum_{\omega \in \Omega}p(\omega)=1\). Then there exists a unique probability measure \(\mathsf{P}\) on \((\Omega,2^{\Omega})\) such that \(\mathsf{P}(\{\omega \})=p(\omega)\).

The case when \(\Omega\) is finite is especially simple. See your first year notes for more details! \(\vartriangleleft\)


What happens when \(\Omega\) is uncountable, e.g., \(\Omega=[0,1]\)? The collection \(2^{[0,1]}\) of all subsets of \([0,1]\) is too big and contains some subsets for which probability cannot be defined. 2 Borel realized that a smaller \(\sigma\)-field is needed instead of \(2^{[0,1]}\) for the theory to work.

1.1.1 Generated \(\sigma\)-fields

We start by checking that an intersection of two \(\sigma\)-fields is again a \(\sigma\)-field:

Example 1.7 Let \(\mathcal{X}\) be an arbitrary set. If \(\mathcal{F}_1\) and \(\mathcal{F}_2\) are two \(\sigma\)-fields of subsets of \(\mathcal{X}\), then \(\mathcal{G}=\mathcal{F}_1\cap\mathcal{F}_2\) is also a \(\sigma\)-field of subsets of \(\mathcal{X}\).

To verify this simple but very important property, we just check that all conditions in Definition 1.3 are satisfied: 1) as \(\varnothing\in\mathcal{F}_i\) for all \(i\), we have \(\varnothing\in\mathcal{G}\), by the definition of the intersection of \(\mathcal{F}_i\); 2) fix arbitrary \(A\in\mathcal{G}\); then \(A\in\mathcal{F}_i\) and therefore \(A^\mathsf{c}\in\mathcal{F}_i\) for all \(i\) (as each \(\mathcal{F}_i\) is a \(\sigma\)-field), and so \(A^\mathsf{c}\in\mathcal{G}\), as before; 3) fix an arbitrary sequence \(A_1\), \(A_2, \ldots\in\mathcal{G}\); then \(A_1\), \(A_2, \ldots\in\mathcal{F}_i\) and therefore \(\bigcup_{k=1}^\infty A_k\in\mathcal{F}_i\) for all \(i\) (as each \(\mathcal{F}_i\) is a \(\sigma\)-field), and so \(\bigcup_{k=1}^\infty A_k\in\mathcal{G}\). \(\vartriangleleft\)


To test your understanding, you might wish to attempt the following exercises:

Exercise 1.1

Let \(\mathcal{X}\) be an arbitrary set. Prove the following properties of the \(\sigma\)-fields:
a) If \(\mathcal{F}_1\), \(\mathcal{F}_2\), …, \(\mathcal{F}_m\) is a finite collection of \(\sigma\)-fields in \(\mathcal{X}\), then \(\mathcal{G}=\cap_{j=1}^m\mathcal{F}_j\) is also a \(\sigma\)-field in \(\mathcal{X}\).
b) If \(\mathcal{F}_1\), \(\mathcal{F}_2\), …is a countable collection of \(\sigma\)-fields in \(\mathcal{X}\), then \(\mathcal{G}=\cap_{j=1}^\infty\mathcal{F}_j\) is also a \(\sigma\)-field in \(\mathcal{X}\).
c) If \(\mathcal{F}_\beta\), \(\beta\in\mathcal{B}\), is an arbitrary collection of \(\sigma\)-fields in \(\mathcal{X}\), then \(\mathcal{G}=\cap_{\beta\in\mathcal{B}}\mathcal{F}_\beta\) is also a \(\sigma\)-field in \(\mathcal{X}\).


Exercise 1.2

Let \(\mathcal{X}\) be an arbitrary set. Show by counterexample that if \(\mathcal{F}_1\) and \(\mathcal{F}_2\) are two \(\sigma\)-fields of subsets of \(\mathcal{X}\), then \(\mathcal{F}_1\cup\mathcal{F}_2\) does not have to be a \(\sigma\)-field of subsets of \(\mathcal{X}\).


The simple properties in allow to define generated sigma-fields:

Definition 1.8 Let \(\mathcal{X}\) be an arbitrary set and let \(\mathcal{D}\) be an arbitrary collection of subsets of \(\mathcal{X}\). Let \(\mathcal{F}_\alpha\), \(\alpha\in\mathcal{A}\), be the collection of all sigma-fields in \(\mathcal{X}\) which contain \(\mathcal{D}\), namely \(\mathcal{D}\subset\mathcal{F}_\alpha\) for all \(\alpha\in\mathcal{A}\). Then \(\mathcal{G}:=\mathcal{G}_\mathcal{D}=\cap_{\alpha\in\mathcal{A}}\mathcal{F}_\alpha\) is the smallest sigma-field of subsets of \(\mathcal{X}\) containing \(\mathcal{D}\), also known as the \(\sigma\)-field generated by \(\mathcal{D}\).


advanced

Remark 1.8.1 Informally, one could try to construct \(\mathcal{G}_\mathcal{D}\) by starting from the collection of sets \(\mathcal{D}_0:=\mathcal{D}\cup\{\mathcal{X},\varnothing\}\) and then repeatedly apply the countable union property 3 of Definition 1.3. When constructing probability spaces, however, it is not immediately clear that each such resulting countable union can be prescribed a probability value in a unique way.

On the other hand, thanks to the continuity property of probabilities along monotone sequences of events, see Section 1.1.3 below, one can restrict such extensions to monotone operations only. Namely, if \(A_1\subset A_2\subset\ldots\) belong to \(\mathcal{D}_0\), add \(\cup_{k\ge1}A_k\) to \(\mathcal{D}_0\); similarly, if \(B_1\supset B_2\supset \ldots\) belong to \(\mathcal{D}_0\), add \(\cap_{k\ge1}B_k\) to \(\mathcal{D}_0\). Any collection containing all sets from the last two properties (namely, with each increasing sequence \(A_k\) containing its union, as well as with each decreasing sequence \(B_k\) containing its intersection) is called a monotone class. An important result in real analysis is that the smallest monotone class containing \(\mathcal{D}_0\) and the smallest sigma-field containing \(\mathcal{D}_0\) coincide. \(\vartriangleleft\)



1.1.2 The canonical probability space

In the uncountable case of \(\Omega=[0,1]\) it is tempting to start with the collection \(\mathcal{D}\) of all intervals \((a,b]\) with \(0\le a<b\le1\) and to define \(\mathsf{P}\bigl((a,b]\bigr):=b-a\), the length of \((a,b]\). By Definition 1.8 one can uniquely define the sigma-field generated by \(\mathcal{D}\); it is known as the Borel \(\sigma\)-field \(\mathcal{B}[0,1]\). An important result in real analysis claims that the length measure \(\mathsf{P}(\cdot)\) uniquely extends to a probability measure known as the Lebesgue measure in \([0,1]\).

Definition 1.9 The canonical probability space is the triple \(\bigl([0,1],\mathcal{B}[0,1],\mathsf{P}\bigr)\), where \(\mathcal{B}[0,1]\) is the Borel \(\sigma\)-field in \([0,1]\) and \(\mathsf{P}(\cdot)\) is the Lebesgue measure in \([0,1]\).


Remark 1.9.1 One can similarly define the Borel \(\sigma\)-field for other subsets \(\mathcal{X}\) of \(\mathbb{R}\) (including the positive half-line \([0,\infty)\) or the whole of \(\mathbb{R}\)). The length measure then uniquely extends to the Lebesgue measure on \(\mathcal{X}\) (of course, in general one doesn’t expect the probabilistic normalisation \(\mathsf{P}(\mathcal{X})=1\)). Generalisations to higher dimensions are also available. \(\vartriangleleft\)


Remark 1.9.2 The Borel \(\sigma\)-field \(\mathcal{B}[0,1]\) is, arguably, the most natural sigma-field in \([0,1]\). It is generated by many collections of subsets of \([0,1]\). E.g., one can start from any of the collections \(\bigl\{(a,b)\bigr\}\), \(\bigl\{[a,b)\bigr\}\), \(\bigl\{[a,b]\bigr\}\), \(\bigl\{(0,b)\bigr\}\), \(\bigl\{(0,b]\bigr\}\), \(\bigl\{[0,b)\bigr\}\), \(\bigl\{[0,b]\bigr\}\), with \(a\), \(b\) being real, rational, or dyadic rational (i.e., ratios of the type \(\frac{m}{2^n}\) with integer \(m\), \(n\)) and get \(\mathcal{B}[0,1]\) as the corresponding generated sigma-field. Other options include (sufficiently large) collections of open sets or of closed sets in \([0,1]\) etc. While \(\mathbb{R}\) contains many subsets which are not Borel, it is not immediate to construct one. \(\vartriangleleft\)


1.1.3 Monotone sequences of events

Sequences of events arise naturally when a probabilistic experiment is repeated many times. For example, if a fair coin is flipped consecutively, the “event” 3 \[A=\bigl\{\text{ `head' never seen}\bigr\}\equiv\bigl\{\text{ $\mathsf{H}$ never seen}\bigr\}\] is just the intersection, \(A=\cap_{n\ge1}A_n\), of the events \[A_n=\bigl\{\text{ `head' not seen in the first $n$ tosses}\bigr\}.\] This simple remark leads to the following important observations: a) taking countable operations is not that exotic in probabilistic models, and thus any reasonable theory should deal with \(\sigma\)-fields; b) the event \(A\) is in some sense the limit of the sequence \((A_n)_{n\ge1}\), so understanding limits of sequences of events is important.

Definition 1.10 A sequence \((A_n)_{n\ge1}\) of events is increasing if \(A_n\subseteq A_{n+1}\) for all \(n\ge1\). It is decreasing if \(A_n\supseteq A_{n+1}\) for all \(n\ge1\).


Example 1.11 If \((A_n)_{n\ge1}\) is a sequence of arbitrary events in some probability space \((\Omega,\mathcal{F},\mathsf{P})\), then the sequence \((B_n)_{n\ge1}\) with \(B_n=\cup_{k=1}^nA_k\) is increasing.

We first notice that for all sets \(C\) and \(D\) we have \(C\subset C\cup D\). Indeed, by definition, \(x\in C\cup D\) if and only if \(x\) belongs to \(C\) or \(x\) belongs to \(D\). In particular, every \(x\in C\) also belongs to \(C\cup D\), equivalently, \(C\subset C\cup D\). On the other hand, \[B_{n+1}=A_1\cup A_2\cup \ldots \cup A_n\cup A_{n+1}\equiv B_n\cup A_{n+1}\] so that \(B_n\subset B_{n+1}\) for all \(n\ge1\), as claimed.

A similar claim holds for finite intersections along the sequence \(A_n\), see . \(\vartriangleleft\)


Exercise 1.3

In the setting of Example 1.11, show that \(\cup_{m=1}^nA_m=\cup_{m=1}^nB_m\) for all integer \(n\ge1\) and that \(\cup_{m\ge1}A_m=\cup_{m\ge1}B_m\).


Exercise 1.4

If \((A_n)_{n\ge1}\) is a sequence of arbitrary events in some probability space \((\Omega,\mathcal{F},\mathsf{P})\), then the sequence \((C_n)_{n\ge1}\) with \(C_n=\cap_{k=1}^nA_k\) is decreasing.


Exercise 1.5

In the setting of , show that \(\cap_{m=1}^nA_m=\cap_{m=1}^nC_m\) for all integer \(n\ge1\) and that \(\cap_{m\ge1}A_m=\cap_{m\ge1}C_m\).


The following result shows that the probability measure is continuous along monotone sequences of events.

Lemma 1.12 If \((A_n)_{n\ge1}\) is increasing with \(A:=\lim_nA_n\equiv\cup_{n\ge1}A_n\), then \[\tag{1.5}\label{eq:probability-is-continuous-along-increasing-event-sequence} \mathsf{P}(A)=\mathsf{P}\bigl(\lim_{n\to\infty}A_n\bigr)=\lim_{n\to\infty}\mathsf{P}(A_n).\] If \((A_n)_{n\ge1}\) is a decreasing sequence with \(A:=\lim_nA_n\equiv\cap_{n\ge1}A_n\), then \[\tag{1.6}\label{eq:probability-is-continuous-along-decreasing-event-sequence} \mathsf{P}(A)=\mathsf{P}\bigl(\lim_{n\to\infty}A_n\bigr)=\lim_{n\to\infty}\mathsf{P}(A_n).\] \(\vartriangleleft\)


Remark 1.12.1 If \((A_n)_{n\ge1}\) is not a monotone sequence of events, the claim of the lemma is not necessarily true ( find a counterexample!). \(\vartriangleleft\)


Proof Let \((A_n)_{n\ge1}\) be increasing with \(A=\cup_{n\ge1}A_n\). Denote \(C_1=A_1\) and, for \(n\ge2\), put \(C_n=A_n\setminus A_{n-1}=A_n\cap A^\mathsf{c}_{n-1}\). We then have ( why?) 4

\[A_n=\bigcup_{k=1}^nA_k=\bigcup_{k=1}^nC_k,\qquad\bigcup_{k=1}^\infty A_k=\bigcup_{k=1}^\infty C_k.\] Since the events in \((C_k)_{k\ge1}\) are mutually incompatible, the \(\sigma\)-additivity property P3 of probability gives \[\mathsf{P}(A)=\mathsf{P}\Bigl(\bigcup_{k\ge1}A_k\Bigr)=\mathsf{P}\Bigl(\bigcup_{k\ge1}C_k\Bigr)=\sum_{k\ge1}\mathsf{P}\bigl(C_k\bigr)\le1.\] Therefore \[0\le\mathsf{P}(A)-\mathsf{P}(A_n)=\mathsf{P}\bigl(A\setminus A_n\bigr)=\mathsf{P}\Bigl(\bigcup_{k>n}C_k\Bigr)=\sum_{k>n}\mathsf{P}(C_k)\to0\] as \(n\to\infty\), as a tail sum of a convergent series \(\sum_{k\ge1}\mathsf{P}\bigl(C_k\bigr)\).

A similar argument holds for decreasing sequences ( do this!). \(\blacksquare\)


By combining countable monotone approximations together with Lemma 1.12, one can find probabilities of many events of interest.

Example 1.13 A standard six-sided die is tossed repeatedly. Let \(N_1\) denote the total number of ’ones’ observed. Assuming that the individual outcomes are independent, show that \(\mathsf{P}(N_1=\infty)=1\).
Solution. We show that \(\mathsf{P}(N_1<\infty)=0\) by using a monotone approximation. First, notice that \(\{N_1<\infty\}=\cup_{n\ge1}B_n\) with \(B_n=\left\{\text{no `ones' after $n$th toss}\right\}\), so it is enough to show that \(\mathsf{P}(B_n)=0\) for all \(n\). However, \(B_n=\cap_{m>0}C_{n,m}\) with \(C_{n,m}=\left\{\text{no `one' on tosses $n+1$, \dots, $n+m$}\right\}\) being a decreasing sequence, \(C_{n,m}\supset C_{n,m+1}\) for all \(m\ge1\). By straightforward counting, \(\mathsf{P}(C_{n,m})=(5/6)^m\) for all \(m\), \(n\). Because \(\mathsf{P}(C_{n,m})\to0\) as \(m\to\infty\), Lemma 1.12 implies \(\mathsf{P}(B_n)=\lim_{m\to\infty}\mathsf{P}(C_{n,m})=0\), as claimed. \(\vartriangleleft\)


important

For a general sequence \((A_k)_{k\ge1}\) of events in \(\Omega\), one defines \(A:=\lim\limits_{k\to\infty}A_k\) as the collection of all points \(\omega\in\Omega\) which belong to all \(A_k\) with sufficiently large \(k\). It is easy to construct a sequence \((A_k)_{k\ge1}\) for which the limit does not exist and therefore for which the continuity property of Lemma 1.12 is violated ( find a counterexample!).


In contrast, countable unions or countable intersections of events can be written as monotone limits of suitable events and therefore can be assigned a probability value:

Example 1.14 Let \((A_n)_{n\ge1}\) be an arbitrary sequence of events. As shown in Example 1.11 and , the union of these events can be written as a monotone increasing limit of their finite unions \(B_n\), \[\bigcup_{n\ge1}A_n=\bigcup_{n\ge1}B_n,\qquad\text{ where }\qquad B_n:=\bigcup_{m=1}^nA_m.\] Similarly, by and , the intersection of these events can be written as a monotone decreasing limit of their finite intersections \(C_n\), \[\bigcap_{n\ge1}A_n=\bigcap_{n\ge1}C_n,\qquad\text{ where }\qquad C_n:=\bigcap_{m=1}^nA_m.\] In particular, \(\mathsf{P}(\cup_{n\ge1}A_n)\) and \(\mathsf{P}(\cap_{n\ge1}A_n)\) are well defined. \(\vartriangleleft\)


Exercise 1.6

Let \(\bigl([0,1],\mathcal{B}[0,1],\mathsf{P}\bigr)\) be the canonical probability space. By using a countable monotone approximation \(\{x\}\equiv\cap_{m\ge\lceil1/x\rceil}(x-\frac1m,x]\), where \(\lceil 1/x\rceil\) is the smallest integer at least \(1/x>0\), deduce that \(\mathsf{P}(\{x\})=0\) for all \(x\in(0,1]\).


important

Attempts of extending the results of Lemma 1.12 to uncountable limits can easily lead to contradictions. Indeed, the event \((0,1]\) has probability one, while being an uncountable union of zero-probability events \(\{x\}\) with \(x\in(0,1]\).


To summarise our discussion,

important

In probability theory we only work with countable monotone limits of events! If several approximations of an event of interest are available, we will always choose one of those for which Lemma 1.12 can be applied.


For additional practice, you should try some of the exercises  1.17- 1.19 below.

1.2 Independence of events

The concept of independence is the main distinction between the abstract real analysis (measure theory) and probability theory. Arguably, it places probability theory in the centre of contemporary mathematics.

Definition 1.15 Let \((\Omega,\mathcal{F},\mathsf{P})\) be a probability space. Two events \(A\), \(B\in\mathcal{F}\) are independent, if their joint probability factorises, \[\tag{1.7}\label{eq:independent-events-def} \mathsf{P}(A\cap B)=\mathsf{P}(A)\mathsf{P}(B).\]


Example 1.16 Let \(A\), \(B\in\mathcal{F}\) be two events, where \(B\) has vanishing probability. Then \(A\) and \(B\) are independent: \[\tag{1.8}\label{eq:zero-probability-event-independence} \mathsf{P}(B)=0\qquad\Longrightarrow\qquad \mathsf{P}(A\cap B)=\mathsf{P}(A)\mathsf{P}(B).\] Similarly, if \(A\) is arbitrary but \(\mathsf{P}(B)=1\), then \(A\) and \(B\) are independent: \[\tag{1.9}\label{eq:full-probability-event-independence} \mathsf{P}(B)=1\qquad\Longrightarrow\qquad \mathsf{P}(A\cap B)=\mathsf{P}(A)\mathsf{P}(B).\]

Indeed, as \(A\cap B\subset B\), by the monotonicity property (\ref{eq:probability-monotone-inclusion}) we have \(\mathsf{P}(A\cap B)=0\) and so (\ref{eq:zero-probability-event-independence}) follows trivially. Similarly, as \(A\setminus (A\cap B)=A\setminus B\subset\Omega\setminus B=B^\mathsf{c}\) with \(\mathsf{P}(B^\mathsf{c})=0\), by (\ref{eq:two-events-combining-probability}) we deduce that \(0\le\mathsf{P}(A)-\mathsf{P}(A\cap B)\le\mathsf{P}(B^\mathsf{c})=0\), equivalently, \(\mathsf{P}(A\cap B)=\mathsf{P}(A)\); hence, (\ref{eq:full-probability-event-independence}) follows trivially. \(\vartriangleleft\)


Exercise 1.7

In the canonical probability space \(\bigl([0,1],\mathcal{B}[0,1],\mathsf{P}\bigr)\), consider the events (with \(d\in[0,\tfrac12)\)): \[A_1:=[0,\tfrac12),\qquad A_2:=[0,\tfrac14)\cup[\tfrac12,\tfrac34),\qquad B:=[\tfrac12,1),\qquad C:=[\tfrac14,\tfrac34),\qquad D:=[d,d+\tfrac12).\] a) show that \(A_1\) and \(A_2\) are independent;
b) show that \(B\) and \(A_2\) are independent;
c) are \(C\) and \(A_2\) independent?
d) for which values of \(d\in[0,\tfrac12)\) are \(D\) and \(A_2\) independent?


Definition 1.17 Let \((\Omega,\mathcal{F},\mathsf{P})\) be a probability space. A finite or infinite collection of events \((A_\alpha)_{\alpha\in\mathcal{A}}\) is (mutually) independent, if the probability of any finite sub-collection factorises, namely, for all integer \(k\ge1\) and all \(\alpha_1\), …, \(\alpha_k\in\mathcal{A}\), \[\tag{1.10}\label{eq:independent-event-collections-def} \mathsf{P}\Bigl(\bigcap_{\ell=1}^k A_{\alpha_\ell}\Bigr)=\prod_{\ell=1}^k\mathsf{P}\bigl(A_{\alpha_\ell}\bigr).\]


Exercise 1.8

Let events \((A_k)_{k\ge1}\) in \(\mathcal{F}\) be such that \(\mathsf{P}(A_k)=1\) for all integer \(k\ge1\). Show that the events \(A_k\) are mutually independent with \(\mathsf{P}\bigl(\cap_kA_k\bigr)=1\).


Exercise 1.9

Let \((B_k)_{k\ge1}\) be events with \(\mathsf{P}(B_k)=0\) for all integer \(k\ge1\). Show that \(\mathsf{P}\bigl(\cup_kB_k\bigr)=0\).


Exercise 1.10

In the canonical probability space \(\bigl([0,1],\mathcal{B}[0,1],\mathsf{P}\bigr)\), let \(A_1:=[0,\tfrac12)\) and \(A_2:=[0,\tfrac14)\cup[\tfrac12,\tfrac34)\) be as in . Find another event \(A_3\) of probability \(\tfrac12\) so that \(A_1\), \(A_2\), and \(A_3\) are mutually independent.


Exercise 1.11

In the canonical probability space \(\bigl([0,1],\mathcal{B}[0,1],\mathsf{P}\bigr)\), let \(A_1:=[0,\tfrac12)\) and \(A_2:=[0,\tfrac14)\cup[\tfrac12,\tfrac34)\) be as in . Find another event \(E\) of non-trivial probability in \((0,1)\) so that \(A_1\), \(A_2\), and \(E\) are mutually independent.


Inspired by the results in and , a natural question is: is there a sequence \((A_k)_{k\ge1}\) of mutually independent events in the canonical probability space \(\bigl([0,1],\mathcal{B}[0,1],\mathsf{P}\bigr)\) such that \(\mathsf{P}(A_k)=\tfrac12\) for all integer \(k\ge1\)?

Some examples of such sequences will be constructed in Section 1.4 below.

1.3 Random variables

Let \((\Omega,\mathcal{F},\mathsf{P})\) be a probability space. Informally speaking, a (real-valued) random variable \(X\) is a ‘nice’ map \(X:\Omega\to\mathbb{R}\).

In the simplest case of a discrete random variable with countable (finite or denumerable) set of possible values \(\mathcal{X}:=\{x_1,x_2,\ldots\}\subset\mathbb{R}\), being ‘nice’ means that the \(X\)-preimage of each \(x_k\in\mathcal{X}\) is an event, \[{}^\forall x_k\in\mathcal{X},\qquad \{X=x_k\}:=\bigl\{\omega\in\Omega:X(\omega)=x_k\bigr\}\in\mathcal{F}.\] In particular, one can speak of the probability mass function \(\{p_k\}\) of \(X\) defined through \(p_k:=\mathsf{P}(X=x_k)\), and introduce the expectation \(\mathsf{E} X\) and the cumulative distribution function \(F_X(\cdot)\) of \(X\) via the usual expressions \[\mathsf{E} X:=\sum_{x_k\in\mathcal{X}}x_k\mathsf{P}(X=x_k),\qquad \mathsf{F}_X(y):=\mathsf{P}(X\le y)\equiv\sum_{x_k\le y}\mathsf{P}(X=x_k),\quad y\in\mathbb{R}.\] In Probability I you saw various examples of discrete random variables, including Bernoulli, binomial, Poisson, and geometric variables.

Another class of random variables you discussed in Probability I consists of continuous random variables, whose distributions can be described in terms of the probability density function \(f(x)\ge0\), such that \[\mathsf{P}(a\le X\le b)=\int_a^bf(x)dx\] for all real \(-\infty\le a\le b\le\infty\), with the normalisation \(\mathsf{P}(X\in\mathbb{R})=1\). In this case \[\mathsf{E} X:=\int_\mathbb{R} xf(x)dx\quad\text{ and }\quad \mathsf{F}_X(y):=\mathsf{P}(X\le y)\equiv\int_{-\infty}^y f(x)dx,\quad y\in\mathbb{R}.\] Examples of continuous variables include uniform, exponential, and gaussian random variables.

important

In general, given a probability space \((\Omega,\mathcal{F},\mathsf{P})\), a function \(X:\Omega\to\mathbb{R}\) is a random variable, if \[\tag{1.11}\label{eq:random-variable-def} \{X\le y\}:=\bigl\{\omega\in\Omega:X(\omega)\le y\bigr\}\] is an event for each \(y\in\mathbb{R}\); in other words, \(\bigl\{\{X\le y\}:y\in\mathbb{R}\bigr\}\subset\mathcal{F}\).


Lemma 1.18 Let \(X\) be a finite random variable on a probability space \((\Omega,\mathcal{F},\mathsf{P})\), i.e., \(\mathsf{P}(|X|<\infty)=1\). Its cumulative distribution function \(\mathsf{F}_X(y):=\mathsf{P}(X\le y)\) has the following properties:
a) \(\mathsf{F}_X(y)\) is a non-decreasing function of \(y\) such that \(\lim\limits_{y\downarrow-\infty}\mathsf{F}_X(y)=0\) and \(\lim\limits_{y\uparrow+\infty}\mathsf{F}_X(y)=1\);
b) \(\mathsf{F}_X(y)\) is a right-continuous function of \(y\), i.e., \(\lim\limits_{\varepsilon\downarrow0}\mathsf{F}_X(y+\varepsilon)=\mathsf{F}_X(y)\) for all \(y\in\mathbb{R}\). \(\vartriangleleft\)


Remark 1.18.1 The continuity properties of cumulative distribution functions rely upon the fact that monotone uncountable unions (intersections) of events can be written as countable unions (intersections) of suitable events, for which the continuity Lemma 1.12 can be applied. \(\vartriangleleft\)


Proof a) Let \(y_1\le y_2\) for real \(y_1\), \(y_2\). If \(\omega\in\{X\le y_1\}\), we have \(X(\omega)\le y_1\le y_2\) and so \(\{X\le y_1\}\subset\{X\le y_2\}\). By (\ref{eq:probability-monotone-inclusion}), this implies that \(\mathsf{F}_X(y_1)\le\mathsf{F}_X(y_2)\) for all real \(y_1\le y_2\). Next, for each \(\omega\in\Omega\) we have \(X(\omega)=-\infty\) if and only if \(X(\omega)\le-m\) for each integer \(m\ge1\); consequently, \(\{X=-\infty\}=\cap_{m\ge1}\{X\le-m\}\), and the continuity result (\ref{eq:probability-is-continuous-along-decreasing-event-sequence}) implies that \(\lim\limits_{y\downarrow-\infty}\mathsf{F}_X(y)=0\). Similarly, \(X(\omega)<+\infty\) if and only if \(X(\omega)\le m\) for some integer \(m\ge1\); consequently, \(\{X<+\infty\}=\cup_{m\ge1}\{X\le m\}\), and the continuity result (\ref{eq:probability-is-continuous-along-increasing-event-sequence}) implies that \(\lim\limits_{y\uparrow+\infty}\mathsf{F}_X(y)=1\).
b) Fix arbitrary \(y\in\mathbb{R}\). For each integer \(m\ge1\) there is real \(\varepsilon>0\) such that \(y<x=y+\varepsilon<y+\tfrac1m\); therefore, \[{}^\forall m\in\mathbb{N},\qquad\bigcap_{x>y}\{X\le x\}\subset\{X\le y+\tfrac1m\} \qquad\Longrightarrow\qquad\bigcap_{\varepsilon>0}\{X\le y+\varepsilon\}\subset\bigcap_{m\ge1}\{X\le y+\tfrac1m\}.\] Similarly, for each real \(\varepsilon>0\) there is integer \(m\ge1\) so that \(y<y+\tfrac1m<x=y+\varepsilon\); therefore, \[{}^\forall x>y,\qquad\bigcap_{m\ge1}\{X\le y+\tfrac1m\}\subset\{X\le x\} \qquad\Longrightarrow\qquad\bigcap_{m\ge1}\{X\le y+\tfrac1m\}\subset\bigcap_{\varepsilon>0}\{X\le y+\varepsilon\}.\] As a result, \(\cap_{\varepsilon>0}\{X\le y+\varepsilon\}=\cap_{m\ge1}\{X\le y+\tfrac1m\}\), and so \[\lim\limits_{\varepsilon\downarrow0}\mathsf{F}_X(y+\varepsilon)=\lim_{m\uparrow\infty}\mathsf{F}_X(y+\tfrac1m) \equiv\lim_{m\uparrow\infty}\mathsf{P}(X\le y+\tfrac1m)=\mathsf{P}(X\le y)\equiv\mathsf{F}_X(y),\] where the third equality follows from (\ref{eq:probability-is-continuous-along-decreasing-event-sequence}). \(\blacksquare\)


Exercise 1.12

Let \(X\) be a random variable with cumulative distribution function \(\mathsf{F}_X(y)\). Show that for each \(y\in\mathbb{R}\) the limit \(\mathsf{F}_X(y_-):=\lim_{\varepsilon\downarrow0}\mathsf{F}_X(y-\varepsilon)\) is well defined. Find an example with \(\mathsf{F}_X(y_-)\neq\mathsf{F}_X(y)\).


Example 1.19 Given the canonical probability space \(\bigl([0,1],\mathcal{B}[0,1],\mathsf{P}\bigr)\) consider the random variable \(X\) defined via \(X(\omega)=\omega\). Then

\(\mathsf{P}(X\in[a,b])\equiv |[a,b]|=b-a\), for all \(0\le a\le b\le1\), that is, \(X\) is uniformly distributed, \(X\sim\mathcal{U}[0,1]\). \(\vartriangleleft\)


Example 1.20 Let \(X\) be a random variable on some probability space \((\Omega,\mathcal{F},\mathsf{P})\) and let \(f:\mathbb{R}\to\mathbb{R}\) be a function. Assume that \[\tag{1.12}\label{eq:measurable-functions} {}^\forall y\in\mathbb{R},\qquad f^{-1}\bigl((-\infty,y]\bigr)\in\mathcal{B}(\mathbb{R}),\] where \(\mathcal{B}(\mathbb{R})\) is the Borel sigma-field in \(\mathbb{R}\). Then the combined map \(Y(\omega):=f(X(\omega))\) is a random variable. \(\vartriangleleft\)


Remark 1.20.1 By using the standard properties of inverse functions, one can show that the relation (\ref{eq:measurable-functions}) extends to \[\tag{1.13}\label{eq:measurable-functions-general-def} {}^\forall B\in\mathcal{B}(\mathbb{R}),\qquad f^{-1}\bigl(B\bigr)\equiv\bigl\{x\in\mathbb{R}:f(x)\in B\bigr\}\in\mathcal{B}(\mathbb{R}),\] see . Functions satisfying (\ref{eq:measurable-functions-general-def}) are called measurable. \(\vartriangleleft\)


1.3.1 Independent random variables

Let \(X\) and \(Y\) be two random variables defined on the same probability space \((\Omega,\mathcal{F},\mathsf{P})\). Informally, \(X\) and \(Y\) are independent, if every event related to \(X\) and every event related to \(Y\) are independent.

Definition 1.21 Two random variables \(X\) and \(Y\) on the same probability space \((\Omega,\mathcal{F},\mathsf{P})\) are independent, if \[\tag{1.14}\label{eq:two-rvs-independence-cdf} {}^\forall x,y\in\mathbb{R},\qquad \mathsf{P}(X\le x,Y\le y)=\mathsf{P}(X\le x)\mathsf{P}(Y\le y).\]


Remark 1.21.1 As in Remark 1.9.2, one can show that the Borel sigma field \(\mathcal{B}(\mathbb{R})\) can be generated by the collection \(\bigl\{(-\infty,a]:a\in\mathbb{R}\bigr\}\). Consequently, the condition (\ref{eq:two-rvs-independence-cdf}) is equivalent to \[\tag{1.15}\label{eq:two-rvs-independence-Borel} \mathsf{P}(X\in B_x,Y\in B_y)=\mathsf{P}(X\in B_x)\mathsf{P}(Y\in B_y)\] for all Borel sets \(B_x\), \(B_y\) in \(\mathcal{B}(\mathbb{R})\). \(\vartriangleleft\)


Similarly to Definition 1.17, one can define independence of arbitrary collections of random variables:

Definition 1.22 An arbitrary collection of random variables \((X_\alpha)_{\alpha\in\mathcal{A}}\) on some probability space \((\Omega,\mathcal{F},\mathsf{P})\) is (mutually) independent, if any finite sub-collection of these variables is independent, namely, for all integer \(k\ge1\), all \(\alpha_1\), …, \(\alpha_k\in\mathcal{A}\), and all real \(x_1\), …, \(x_k\), \[\tag{1.16}\label{eq:independent-variable-collections-def} \mathsf{P}\bigl(X_{\alpha_1}\le x_{\alpha_1},\dots,X_{\alpha_k}\le x_{\alpha_k}\bigr)=\prod_{\ell=1}^k\mathsf{P}\bigl(X_{\alpha_\ell}\le x_{\alpha_\ell}\bigr).\]


The condition (\ref{eq:independent-variable-collections-def}) can be equivalently written in terms of general Borel sets as in (\ref{eq:two-rvs-independence-Borel}).

Example 1.23 If \(X_1\) and \(X_2\) are independent random variables, while functions \(f_1:\mathbb{R}\to\mathbb{R}\) and \(f_2:\mathbb{R}\to\mathbb{R}\) satisfy (\ref{eq:measurable-functions}) or (\ref{eq:measurable-functions-general-def}), then \(Y_i(\omega):=f_i(X_i(\omega))\) are independent random variables. This property easily extends to any finite or infinite setting. \(\vartriangleleft\)


Example 1.24 Let \(X_1\), \(X_2\) be independent variables and let the functions \(f_1\), \(f_2\) be as in Example 1.23. In view of the measure factorisation properties (\ref{eq:two-rvs-independence-cdf})-(\ref{eq:two-rvs-independence-Borel}), it is straightforward to deduce that \[\mathsf{E}(X_1X_2)=\mathsf{E}(X_1)\mathsf{E}(X_2),\qquad\text{ similarly, }\qquad \mathsf{E}\bigl(f_1(X_1)f_2(X_2)\bigr)=\mathsf{E}\bigl(f_1(X_1)\bigr)\mathsf{E}\bigl(f_2(X_2)\bigr).\] \(\vartriangleleft\)


Example 1.25 Fix arbitrary \(A\in\mathcal{B}[0,1]\) and consider the corresponding indicator random variable \(\mathbf{1}_A\). The latter is Bernoulli distributed with parameter \(\mathsf{E}\mathbf{1}_A=\mathsf{P}(\mathbf{1}_A=1)=\mathsf{P}(A)\), the Lebesgue measure of \(A\).

Fix arbitrary \(A\), \(B\in\mathcal{B}[0,1]\). Then \[\tag{1.17}\label{eq:set-operations-via-indicators} \mathbf{1}_{A\cap B}(\omega)\equiv \mathbf{1}_A(\omega)\mathbf{1}_B(\omega),\qquad \mathbf{1}_{A^\mathsf{c}}(\omega)\equiv 1-\mathbf{1}_A(\omega),\qquad \mathbf{1}_{A\cup B}(\omega)\equiv \mathbf{1}_A(\omega)+\mathbf{1}_B(\omega)-\mathbf{1}_{A\cap B}(\omega),\] and so all set operations can be recorded as linear combinations of products of indicator functions. Furthermore, it is straightforward to check that \[\tag{1.18}\label{eq:independence-of-events-and-their-indicators} \text{ events $A$ and $B$ are independent }\quad\Longleftrightarrow\quad\text{ random variables $\mathbf{1}_A$ and $\mathbf{1}_B$ are independent.}\] Indeed, by the above, for all \(C\), \(D\in\mathcal{B}[0,1]\) \[\tag{1.19}\label{eq:independence-of-events-via-their-indicators} \begin{gathered} \mathsf{P}(\mathbf{1}_C=1,\mathbf{1}_D=1)=\mathsf{P}(\mathbf{1}_C\mathbf{1}_D=1)=\mathsf{P}(\mathbf{1}_{C\cap D}=1)=\mathsf{P}(C\cap D), \quad\mathsf{P}(C)\mathsf{P}(D)=\mathsf{P}(\mathbf{1}_C=1)\mathsf{P}(\mathbf{1}_D=1) \end{gathered}\] with both expressions coinciding in the case where \(C\) and \(D\) are independent. Applying these relations to all combinations with \(C\in\{A,A^\mathsf{c}\}\) and \(D\in\{B,B^\mathsf{c}\}\) one deduces (\ref{eq:independence-of-events-and-their-indicators}). \(\vartriangleleft\)


Exercise 1.13

Let \(A_1\), …, \(A_k\) be events in a given probability space \((\Omega,\mathcal{F},\mathsf{P})\). Show that \(\{A_1,\dots,A_k\}\) are mutually independent if and only if the random variables \(\{\mathbf{1}_{A_1},\dots,\mathbf{1}_{A_k}\}\) are mutually independent.


The last result is the key in our construction of an infinite sequence of independent random variables.

advanced

1.4 Infinite sequences of independent random variables

For an arbitrary fixed \(x\in[0,1)\), let \((d_k)_{k\ge1}\) be the coefficients of its dyadic decomposition \(\sum_{k\ge1}\tfrac{d_k}{2^k}\) with \(d_k=d_k(x)\in\{0,1\}\) satisfying \[\tag{1.20}\label{eq:dyadic-decomposition-sets} d_k=1 \quad\Longleftrightarrow\quad x\in D_k:=\bigcup_{\ell=1}^{2^{k-1}}\Bigl[\frac{2\ell-1}{2^k},\frac{2\ell}{2^k}\Bigr).\] Clearly, each set \(D_k\) is a finite union of intervals of total length \(|D_k|=\tfrac12\). Furthermore, for each \(n\ge1\), \[\tag{1.21}\label{eq:dyadic-decomposition-convergence} 0\le x-\sum_{k=1}^n\frac{d_k(x)}{2^k}\le \frac1{2^n},\] so that the dyadic decomposition converges to \(x\) as \(n\to\infty\). We prove that \(x\in[0,1)\) is the value of a uniform \(\mathcal{U}[0,1)\) random variable if and only if the sequence \((d_k(x))_{k\ge1}\) of its dyadic coefficients forms an infinite sequence of independent \(\mathsf{Ber}(\tfrac12)\) random variables. In turn, this allows to generate a sequence of independent uniform \(\mathcal{U}[0,1)\) random variables, from which any sequence of independent variables can be constructed.

Let \(X(\omega):=\omega\) be the \(\mathcal{U}[0,1)\) uniform random variable in \(([0,1),\mathcal{B}[0,1),\mathsf{P})\), recall Example 1.19. In terms of the sets \(D_k\) in (\ref{eq:dyadic-decomposition-sets}) define the Bernoulli \(\mathsf{Ber}(\tfrac12)\) random variables \[\tag{1.22}\label{eq:bernoulli-dyadic-digits} \delta_k(\omega):=\begin{cases}1,& X(\omega)\in D_k,\\ 0,& X(\omega)\notin D_k,\end{cases} \qquad\text{ equivalently, }\qquad \delta_k(\omega)\equiv\mathbf{1}_{D_k}(\omega).\]

Theorem 1.26 a) Given \(X\sim\mathcal{U}[0,1)\) in the canonical probability space \(([0,1),\mathcal{B}[0,1),\mathsf{P})\), define the \(\mathsf{Ber}(\tfrac12)\) variables \(\delta_k\) as in (\ref{eq:bernoulli-dyadic-digits}). Then \(\{\delta_k\}_{k\ge1}\) forms a mutually independent collection of random variables.

b) Given a sequence \(\{\varepsilon_k\}_{k\ge1}\) of mutually independent \(\mathsf{Ber}(\tfrac12)\) random variables on the canonical probability space \(([0,1),\mathcal{B}[0,1),\mathsf{P})\), define \(Y(\omega):=\sum_{k\ge1}\varepsilon_k(\omega)2^{-k}\). Then \(Y\sim\mathcal{U}[0,1)\). \(\vartriangleleft\)


The proof of Theorem 1.26 is postponed until Section 1.4.1. A simple corollary is the following result:

Theorem 1.27 In the canonical probability space \(([0,1),\mathcal{B}[0,1),\mathsf{P})\) there is a sequence \((X_n)_{n\ge1}\) of mutually independent random variables with common \(\mathcal{U}[0,1)\) distribution. \(\vartriangleleft\)


Proof Let \(\{\delta_k\}_{k\ge1}\) be a sequence of independent \(\mathsf{Ber}(\tfrac12)\) random variables constructed in Theorem 1.26a). Using a bijection \(\mathbb{N}\to\mathbb{N}^2\) with \(k\mapsto(n,m)\in\mathbb{N}^2\), renumber these variables as \(\{\delta_m^n\}_{m,n\ge1}\) and define \[X_n(\omega):=\sum_{m\ge1}{\delta_m^n(\omega)}/{2^m}.\] By construction, the variables \(X_n\) are mutually independent, and, by Theorem 1.26b), have \(\mathcal{U}[0,1)\) distribution. \(\blacksquare\)


It remains to show how arbitrary distributions can be generated. The following idea is very useful in simulations, e.g., Monte Carlo methods.

Example 1.28 Let \(\mathsf{F}(\cdot)\) be an arbitrary distribution function (namely, \(\mathsf{F}:\mathbb{R}\to[0,1]\) is a non-decreasing right-continuous function that tends to \(0\) at \(-\infty\) and to \(1\) at \(+\infty\), recall Lemma 1.18). Define its left-continuous inverse on \([0,1]\) via \[\mathsf{F}^{-1}(x):=\inf\{z\in\mathbb{R}:\mathsf{F}(z)\ge x\},\qquad x\in[0,1],\] where, as usual, the infimum of the empty set is \(+\infty\). Define \[Y(\omega):=\mathsf{F}^{-1}\bigl(X(\omega)\bigr),\qquad\text{ where }\qquad X\sim\mathcal{U}[0,1).\] Then \(\mathsf{P}(Y\le y)=\mathsf{P}\bigl(\mathsf{F}^{-1}(X)\le y\bigr)=\mathsf{P}\bigl(X\le\mathsf{F}(y)\bigr)=\mathsf{F}(y)\) for all \(y\in\mathbb{R}\), that is, \(Y\) has cumulative distribution function \(\mathsf{F}(\cdot)\). \(\vartriangleleft\)


Theorem 1.29 Let \(\bigl(\mathsf{F}_n(\cdot)\bigr)_{n\ge1}\) be an arbitrary sequence of distribution functions. Then there exists a sequence \((Y_n)_{n\ge1}\) of independent random variables such that \(Y_n\) has distribution \(\mathsf{F}_n\). \(\vartriangleleft\)


Proof If \((X_n)_{n\ge1}\) is the sequence of independent \(\mathcal{U}[0,1)\) random variables from Theorem 1.27, then \(Y_n(\omega):=\mathsf{F}_n^{-1}\bigl(X_n(\omega)\bigr)\) are independent random variables where \(Y_n\) has cumulative distribution function \(\mathsf{F}_n\), by Example 1.28. \(\blacksquare\)


1.4.1 Independent Bernoulli sequences

Here we prove Theorem 1.26.

For a Borel set \(A\in\mathcal{B}[0,1)\), define \[\tag{1.23}\label{eq:rademacher-indicator-variables} \rho_A(\omega):=\mathbf{1}_A(\omega)-\mathbf{1}_{A^\mathsf{c}}(\omega)\equiv2\mathbf{1}_A(\omega)-1.\] We need the following general result.

Lemma 1.30 For integer \(n>1\), let \(\{A_k\}_{k=1}^n\) be events in the canonical probability space \(([0,1),\mathcal{B}[0,1),\mathsf{P})\) each having the same probability \(\mathsf{P}(A_k)=\tfrac12\). Then the events \(\{A_k\}_{k=1}^n\) are mutually independent if and only if \[\tag{1.24}\label{eq:expectation-of-rademacher-product-vanishes} {}^\forall \mathcal{J}\subseteq\{1,\dots,n\},\qquad \varnothing\neq\mathcal{J} \quad \Longrightarrow \quad \mathsf{E}\Bigl(\prod_{\ell\in\mathcal{J}}\rho_{A_\ell}\Bigr)=0.\] \(\vartriangleleft\)


Proof Let the events \(\{A_k\}_{k=1}^n\) be mutually independent. Then for each index subset \(\mathcal{I}\subseteq\{1,\dots,n\}\) the intersection \(\cap_{\ell\in\mathcal{I}}A_\ell\) has probability \(2^{-|\mathcal{I}|}\), equivalently, \[\mathsf{E}\Bigl(\prod_{\ell\in\mathcal{I}}(2\mathbf{1}_{A_\ell})\Bigr)=2^{|\mathcal{I}|}\mathsf{E}\bigl(\mathbf{1}_{\cap_{\ell\in\mathcal{I}}A_\ell}\bigr)=1;\] here and below we write \(|\mathcal{I}|\) for the cardinality of \(\mathcal{I}\). As a result, by (\ref{eq:rademacher-indicator-variables}) the expectation in (\ref{eq:expectation-of-rademacher-product-vanishes}) equals \[\mathsf{E}\Bigl(\prod_{\ell\in\mathcal{J}}\bigl(2\mathbf{1}_{A_\ell}-1\bigr)\Bigr)= \mathsf{E}\Bigl(\sum_{\mathcal{I}\subseteq\mathcal{J}}(-1)^{|\mathcal{J}|-|\mathcal{I}|}\prod_{\ell\in\mathcal{I}}(2\mathbf{1}_{A_\ell})\Bigr) =\sum_{\mathcal{I}\subseteq\mathcal{J}}(-1)^{|\mathcal{J}|-|\mathcal{I}|}\mathsf{E}\Bigl(\prod_{\ell\in\mathcal{I}}(2\mathbf{1}_{A_\ell})\Bigr) =\sum_{\mathcal{I}\subseteq\mathcal{J}}(-1)^{|\mathcal{J}|-|\mathcal{I}|},\] where the last sum is just \[\sum_{|\mathcal{I}|=0}^{|\mathcal{J}|}\binom{|\mathcal{J}|}{|\mathcal{I}|}(-1)^{|\mathcal{J}|-|\mathcal{I}|}\cdot1^{|\mathcal{I}|}=(1-1)^{|\mathcal{J}|}=0.\]

On the other hand, suppose the condition (\ref{eq:expectation-of-rademacher-product-vanishes}) holds. Then, by (\ref{eq:rademacher-indicator-variables}), for each index subset \(\mathcal{J}\subseteq\{1,\dots,n\}\) we get \[2^{|\mathcal{J}|}\mathsf{E}\bigl(\mathbf{1}_{\cap_{\ell\in\mathcal{J}}A_\ell}\bigr) =\mathsf{E}\Bigl(\prod_{\ell\in\mathcal{J}}(2\mathbf{1}_{A_\ell})\Bigr)\equiv\mathsf{E}\Bigl(\prod_{\ell\in\mathcal{J}}(\rho_{A_\ell}+1)\Bigr) =\mathsf{E}\Bigl(\sum_{\mathcal{I}\subseteq\mathcal{J}}\prod_{\ell\in\mathcal{I}}\rho_{A_\ell}\Bigr)=1+\sum_{\varnothing\neq\mathcal{I}\subseteq\mathcal{J}}\mathsf{E}\Bigl(\prod_{\ell\in\mathcal{I}}\rho_{A_\ell}\Bigr)=1,\] and so the intersection \(\cap_{\ell\in\mathcal{J}}A_\ell\) has probability \(2^{-|\mathcal{J}|}\); equivalently, the sets \(\{A_\ell\}_{\ell\in\mathcal{J}}\) are independent. \(\blacksquare\)


Lemma 1.31 The collection of variables \(\{\delta_k\}_{k\ge1}\) from (\ref{eq:bernoulli-dyadic-digits}) is mutually independent in \(([0,1),\mathcal{B}[0,1),\mathsf{P})\). \(\vartriangleleft\)


Proof As mentioned above, we have \(\mathsf{P}(D_k)=\tfrac12\) for each integer \(k\ge1\). We fix arbitrary integer \(n>1\) and show that \(\{D_k\}_{k=1}^n\) are mutually independent by verifying the condition (\ref{eq:expectation-of-rademacher-product-vanishes}). The result of then applies that the corresponding indicator functions \(\{\delta_k\}_{k=1}^n\) are mutually independent \(\mathsf{Ber}(\tfrac12)\) random variables.

To this end, fix arbitrary index subset \(\mathcal{J}\subseteq\{1,\dots,n\}\) and let \(j^*:=\max\{j:j\in\mathcal{J}\}\) and \(\mathcal{J}_*:=\mathcal{J}\setminus\{j^*\}\). By (\ref{eq:rademacher-indicator-variables})-(\ref{eq:expectation-of-rademacher-product-vanishes}), \[\mathsf{E}\Bigl(\prod_{\ell\in\mathcal{J}}\rho_{D_\ell}\Bigr)=\mathsf{E}\Bigl(\bigl(\mathbf{1}_{D_{j^*}}-\mathbf{1}_{(D_{j^*})^\mathsf{c}}\bigr)\prod_{\ell\in\mathcal{J}_*}\rho_{D_\ell}\Bigr)\] and we separately integrate the last expression on each interval \(\triangle^\ell_{j^*}:=\bigl[\tfrac{\ell-1}{2^{j^*-1}},\tfrac{\ell}{2^{j^*-1}}\bigr)\), with integer \(\ell\in\{1,\dots,2^{j^*-1}\}\). However, in view of (\ref{eq:dyadic-decomposition-sets}) the product \(\prod_{\ell\in\mathcal{J}_*}\rho_{D_\ell}\) is constant on \(\triangle^\ell_{j^*}\), while by (\ref{eq:rademacher-indicator-variables}) the variable \(\rho_{D_{j^*}}\) averages to zero there. Consequently, the integral in the last display vanishes, implying the condition (\ref{eq:expectation-of-rademacher-product-vanishes}). Thus the claim follows from Lemma 1.30 and . \(\blacksquare\)


We also have the converse result:

Lemma 1.32 Let \(\{d_k\}_{k\ge1}\) be independent random variables with \(d_k\sim\mathsf{Ber}(\tfrac12)\) for all \(k\ge1\). Then \(Y(\omega):=\sum_{k\ge1}\frac{d_k(\omega)}{2^k}\) has \(\mathcal{U}[0,1)\) distribution. \(\vartriangleleft\)


Proof By Remark 1.9.2, it is sufficient to show that \[\tag{1.25}\label{eq:target-bound-dyadic-interval} \mathsf{P}\bigl(Y\in\bigl[\tfrac{m-1}{2^k},\tfrac{m}{2^k}\bigr)\bigr)=\tfrac1{2^k}\] for all integer \(k\ge0\) and \(1\le m\le2^k\).

To this end, fix integer \(n\ge1\) and let \(Y_n\) be the finite sum approximation to \(Y\), namely, \(Y_n(\omega):=\sum_{k\ge1}^n\frac{d_k(\omega)}{2^k}\). It is straightforward to verify that \(Y_n\) has uniform distribution in the finite set \(\{\tfrac{\ell}{2^n}\}_{\ell=0}^{2^n-1}\). By the apriori estimate (\ref{eq:dyadic-decomposition-convergence}), for all \(\omega\in[0,1)\) we have \(0\le Y(\omega)-Y_n(\omega)\le\tfrac1{2^n}\). Therefore, for all \(0\le a\le b\le1\) we have \[\mathsf{P}\bigl(Y_n\in\bigl[a,b-\tfrac1{2^n}\bigr)\bigr)\le\mathsf{P}\bigl(Y\in[a,b)\bigr)\le\mathsf{P}\bigl(Y_n\in\bigl[a-\tfrac1{2^n},b\bigr)\bigr),\] and so the probability in (\ref{eq:target-bound-dyadic-interval}) satisfies, for all integer \(n\ge k\), \[\tfrac1{2^k}-\tfrac1{2^n}=\tfrac{2^{n-k}-1}{2^n}\le\mathsf{P}\bigl(Y\in\bigl[\tfrac{m-1}{2^k},\tfrac{m}{2^k}\bigr)\bigr) \le\tfrac{2^{n-k}+1}{2^n}=\tfrac1{2^k}+\tfrac1{2^n}.\] Taking the limit \(n\to\infty\), we recover (\ref{eq:target-bound-dyadic-interval}) and thus finish the proof. \(\blacksquare\)



Bertrand’s paradox was noticed by a French mathematician Joseph Bertrand (1822-1900) at the end of XIX century. For more details, see the Wiki page.

Émile Borel (1871-1956) and Henri Lebesgue (1875-1941) were among the pioneers of the contemporary measure theory and integration theory. Following their work, in 1933 Andrey Kolmogorov (1903-1987) succeeded in formulating the axioms of probability theory as stated above. In particular, his approach states that a probability \(\mathsf{P}\) is a measure on a measurable space \((\Omega,\mathcal{F})\), see, e.g., .

(left to right) Bertrand, Borel, Lebesgue, Kolmogorov (left to right) Bertrand, Borel, Lebesgue, Kolmogorov (left to right) Bertrand, Borel, Lebesgue, Kolmogorov (left to right) Bertrand, Borel, Lebesgue, Kolmogorov

Interesing sources on early history of modern probability theory are . Some discussion of set theory can be found in , , .

checklist

By the end of this section you should be able to:


Exercise 1.14

Let \(A\) be a set and let \(\bigl(A_\alpha\bigr)_{\alpha\in\mathcal{A}}\) be a collection 5 of sets. Verify the following claims:
a) if \(\bigl(A_\alpha\bigr)_{\alpha\in\mathcal{A}}\) is such that \(A_\alpha\subset A\) for all \(\alpha\in\mathcal{A}\), then \(\cup_\alpha A_\alpha\subset A\).
b) if for every \(x\in A\) there is \(\alpha\in\mathcal{A}\) such that \(x\in A_\alpha\), then \(A\subset\cup_\alpha A_\alpha\).
c) if \(A\subset A_\alpha\) for all \(\alpha\in\mathcal{A}\), then \(A\subset \cap_\alpha A_\alpha\).
d) if for every \(x\notin A\) there is \(\alpha\in\mathcal{A}\) such that \(x\notin A_\alpha\), then \(\cap_\alpha A_\alpha\subset A\).


Exercise 1.15

A coin with probability \(p\in[0,1]\) of showing ‘heads’ is tossed \(n\) times. Let \(E\) be the event \(\{\text{`heads' is observed on the first toss}\}\) and \(F_k\) the event \(\{\text{exactly $k$ `heads' are obtained}\}\). For which pairs of integers \((n,k)\) are \(E\) and \(F_k\) independent?


Exercise 1.16

Let \(A_k\), \(k\ge1\), be events such that \(\mathsf{P}(A_k)=1\) for all \(k\ge1\). Show that
a) \(\mathsf{P}(A_j\cap A_k)=1\) for all \(j\ge1\), \(k\ge1\);
b) \(\mathsf{P}(\cap_{j=1}^mA_j)=1\) for all \(m\ge1\).
Is it true that \(\mathsf{P}(\cap_{j=1}^\infty A_j)=1\)? Justify your answer.


Exercise 1.17

Show that \(\bigcup_{k=1}^\infty\bigl(1/k,1\bigr]=\bigcup_{k=1}^\infty\bigl[1/k,1\bigr]=\bigcup_{0<x<1}(x,1]=(0,1]\).


Exercise 1.18

Show that \(\cap_{k=1}^\infty\bigl(-\infty,1/k\bigr]=\cap_{k=1}^\infty\bigl(-\infty,1/k\bigr)=\cap_{0<x}(-\infty,x]=\cap_{0<x}(-\infty,x)=(-\infty,0]\).


Exercise 1.19

Let events \((C_\alpha)_{\alpha>0}\) form an increasing family, ie., if \(\alpha_1<\alpha_2\), then \(C_{\alpha_1}\subseteq C_{\alpha_2}\). Show that \[\textstyle \text{a)\quad}\bigcap_{\alpha>0}C_\alpha=\bigcap_{k=1}^\infty C_{1/k},\qquad\qquad \text{b)\quad}\bigcup_{\alpha>0}C_\alpha=\bigcup_{k=1}^\infty C_k.\]


Exercise 1.20

Let \(X\) and \(Y\) be two sets, and let \(f\) be a function, \(f:X\to Y\). For every subset \(C\subset Y\), let \(f^{-1}(C):=\bigl\{x\in X:f(x)\in C\bigr\}\subset X\) be the inverse image of \(C\) in \(X\). If \(\mathcal{F}_Y\) is a \(\sigma\)-field in \(Y\), show that the collection \(f^{-1}\bigl(\mathcal{F}_Y\bigr):=\bigl\{f^{-1}(D):D\in\mathcal{F}_Y\bigr\}\) is a \(\sigma\)-field of subsets of \(X\).


Exercise 1.21

Let \(X\ge0\) be a random variable such that \(\mathsf{E} X=0\).
a) Use Markov’s inequality to show that \(\mathsf{P}(X>\varepsilon)=0\), for every fixed \(\varepsilon>0\);
b) Deduce that \(\mathsf{P}(X=0)=1\).


Exercise 1.22

Suppose that \(X\) is a positive random variable with \(\mathsf{E} X<\infty\). Show that \(\mathsf{P}(X=\infty)=0\).


Exercise 1.23

Given arbitrary \(\Omega\), show that \(\{\varnothing,\Omega\}\) and \(2^\Omega\) are \(\sigma\)-fields, recall Remark 1.3.1.


Exercise 1.24

Let \(\Omega=\{a,b,c\}\). a) Is \(\mathcal{F}_1=\bigl\{\varnothing,\{a\},\{b,c\},\Omega\bigr\}\) a \(\sigma\)-field? b) Is \(\mathcal{F}_2=\bigl\{\varnothing,\{a,b\},\{b,c\},\Omega\bigr\}\) a \(\sigma\)-field? Justify your answer.


Exercise 1.25

Let \(\Omega=\{a,b,c\}\). a) Find the \(\sigma\)-field over \(\Omega\) generated by the single-set collection \(\bigl\{\{b,c\}\bigr\}\). b) Find the \(\sigma\)-field over \(\Omega\) generated by the collection \(\bigl\{\{a,b\},\{b,c\}\bigr\}\).


Exercise 1.26

If random variable \(X\) is positive with positive probability, show that \(\mathsf{P}(X>\delta)>0\) for some \(\delta>0\).


Exercise 1.27

If the difference of random variables \(X\) and \(Y\) is positive with positive probability, show that \(\mathsf{P}(X-Y>\delta)>0\) for some \(\delta>0\).


Exercise 1.28

Let \(\mathcal{X}\) be a set, and let \(\mathcal{D}_1\) and \(\mathcal{D}_2\) be two collections of subsets of \(\mathcal{X}\). If \(\mathcal{D}_1\subset\mathcal{D}_2\), show that the generated sigma-fields \(\mathcal{G}_{\mathcal{D}_i}\), \(i\in\{1,2\}\), satisfy \(\mathcal{G}_{\mathcal{D}_1}\subset\mathcal{G}_{\mathcal{D}_2}\).


Exercise 1.29

In the setting of Remark 1.9.2, let \(\mathcal{D}_1:=\bigl\{(a,b]:0\le a<b<1\bigr\}\) and \(\mathcal{D}_2:=\bigl\{(a,b):0\le a<b<1\bigr\}\) be two collections of sub-intervals of \([0,1)\). Write \(\mathcal{G}_{\mathcal{D}_i}\) for the sigma-field generated by \(\mathcal{D}_i\), \(i\in\{1,2\}\).


a) Show that \(\mathcal{D}_2\subset\mathcal{G}_{\mathcal{D}_1}\) and therefore \(\mathcal{G}_{\mathcal{D}_2}\subset\mathcal{G}_{\mathcal{D}_1}\).
b) Show that \(\mathcal{D}_1\subset\mathcal{G}_{\mathcal{D}_2}\) and therefore \(\mathcal{G}_{\mathcal{D}_1}\subset\mathcal{G}_{\mathcal{D}_2}\).
c) Conclude that Borel’s sigma-field in \([0,1)\) coincides with both \(\mathcal{G}_{\mathcal{D}_1}\) and \(\mathcal{G}_{\mathcal{D}_2}\).


2 General sequences of events

Goals: Explore general sequences of events. For a sequence \((A_n)_{n\ge1}\) of events, define the limiting events \(\{A_n\text{ infinitely often}\}\), \(\{A_n\text{ finitely often}\}\), and \(\{A_n\text{ eventually}\}\). Study their properties, using both monotone approximations and the Borel-Cantelli lemma.


As argued in Section 1.1.3, monotone sequences of events play a special role in probability theory. The limit events along such sequences are easy to find and their probabilities can be computed thanks to the important continuity result, Lemma 1.12. Working with general sequences of events can be quite challenging, so, if possible, one should use monotone approximations to the events of interest. In this section we introduce and study several special events, which describe the limiting behaviour of arbitrary sequences of events. They are important in many areas of mathematics ranging from probability to analysis and number theory.

Further examples can be found in .

2.1 Two special limit events

It is instructive to describe convergence of events in terms of their indicators (\ref{eq:Bernoulli-variable-as-indicator-of-uniform}), \[\mathbf{1}_A(\omega)=\begin{cases} 1, & \text{if $\omega\in A$,}\\ 0, & \text{if $\omega\not\in A$.}\end{cases}\]

Example 2.1 If events \((A_n)_{n\ge1}\) form an increasing sequence, then for each fixed \(\omega\in\Omega\) the real sequence \((x_n)_{n\ge1}\) with \(x_n:=\mathbf{1}_{A_n}(\omega)\in\{0,1\}\) is non-decreasing, and therefore converges.6

In particular, there are only three possibilities:
a) \(x_n=0\) for all \(n\ge1\), that is, \(\omega\) is outside of each set \(A_n\) and their limit \(A:=\lim_nA_n=\cup_nA_n\);
b) \(x_n=1\) for all \(n\ge1\), that is, \(\omega\) belongs to each \(A_n\) and to the limit set \(A\);
c) there is \(k\ge1\) such that \(x_1=\ldots=x_{k-1}=x_k=0\) while \(x_{k+1}=x_{k+2}=\ldots=1\), that is, \(\omega\) belongs to all but finitely many \(A_n\) and to the limit set \(A\).
E.g., in the setting of , if \(\Omega=[0,1]\) and \(A_n=[\tfrac1n,1]\) for \(n\ge1\), then \(A_n\) increases to \(A=(0,1]\) as \(n\to\infty\). Taking \(\omega\in\{0,\tfrac12,1\}\subset[0,1]\), we observe all three possibilities mentioned above. \(\vartriangleleft\)


In the setting of increasing sequences \((A_n)_{n\ge1}\), for each \(\omega\in\Omega\) we know whether \(\omega\in A=\lim_nA_n\), or not. In particular, in the former case \(\omega\) must belong to all \(A_n\) with sufficiently large \(n\); such statement is often abbreviated as7

\[\omega\in\bigl\{A_n\text{ eventually}\bigr\}.\] Of course, a similar description holds for a decreasing sequence \((A_n)_{n\ge1}\).

important

It is very tempting to use the same approach for general sequences \((A_n)_{n\ge1}\) of events, by declaring that a sequence of events \((A_n)_{n\ge1}\) in \(\Omega\) converges if and only if the indicator sequence \(\mathbf{1}_{A_n}(\omega)\in\{0,1\}\) converges for each \(\omega\in\Omega\)!

However, without any condition (such as monotonicity) there is no guarantee that the indicator values \(\mathbf{1}_{A_n}(\omega)\) would converge for all (or, at least most) \(\omega\in\Omega\). Nevertheless, for each sequence \((A_n)_{n\ge1}\) there are two particular events, namely, \[\bigl\{A_n\text{ infinitely often}\bigr\} \qquad\text{ and }\qquad \bigl\{A_n\text{ eventually}\bigr\},\] which provide a very useful information about the large-\(n\) behaviour of the sequence \((A_n)_{n\ge1}\).


We next describe both of them through countable operations along monotone sequences, so that the result is an event and can be assigned a probability value. First, we have \[\begin{aligned} \bigcap_{n\ge1}\bigcup_{k\ge n}A_k &=\bigl\{\omega\in\Omega:\text{ for each $n\ge1$ there is $k\ge n$ such that $\omega\in A_k$}\bigr\} \\ &=\bigl\{\omega\in\Omega:\omega\in A_n\text{ for infinitely many $n\ge1$}\bigr\}. \end{aligned}\] This relation is often abbreviated as \[\tag{2.1}\label{eq:infinitely-often-event-def} \bigl\{A_n\text{ i.o.}\bigr\}=\bigl\{A_n\text{ infinitely often}\bigr\}=\bigcap_{n\ge1}\bigcup_{k\ge n}A_k.\] Notice that for each \(n\ge1\) \[B_n:=\bigcup_{k\ge n}A_k=\bigl\{\omega\in\Omega:\omega\in A_k\text{ for at least one $k\ge n$}\bigr\}\] is an event satisfying \(B_n=A_n\cup B_{n+1}\supseteq B_{n+1}\), so that \((B_n)_{n\ge1}\) form a decreasing sequence of events. Therefore, \[\bigl\{A_n\text{ i.o.}\bigr\}=\bigcap_{n\ge1}B_n\] is a monotone limit of events, whose probability can be computed from \(\mathsf{P}(B_n)\) by continuity.

By taking complement in (\ref{eq:infinitely-often-event-def}) we get \[\tag{2.2}\label{eq:finitely-often-event-def} \bigl\{A_n\text{ f.o.}\bigr\}=\bigl\{A_n\text{ finitely often}\bigr\}=\bigcup_{n\ge1}\bigcap_{k\ge n}A_k^\mathsf{c},\] or, in the extended form, \[\bigl\{A_n\text{ f.o.}\bigr\}=\bigcup_{n\ge1}\bigcap_{k\ge n}A_k^\mathsf{c} =\bigl\{\omega\in\Omega:\text{ there is $n\ge1$ such that $\omega\not\in A_k$ for all $k\ge n$}\bigr\}.\]

Similarly to (\ref{eq:finitely-often-event-def}), given a sequence \((B_n)_{n\ge1}\) of events, one considers \[\tag{2.3}\label{eq:eventually-event-def} \bigl\{B_n\text{ ev.}\bigr\}=\bigl\{B_n\text{ eventually}\bigr\}=\bigcup_{n\ge1}\bigcap_{k\ge n}B_k =\bigl\{\omega\in\Omega:\text{ for some $n\ge1$, $\omega\in B_k$ for all $k\ge n$}\bigr\}.\] Rewriting the last description via indicator functions, we deduce that if \(B_n\) converges, then the event \(\{B_n\text{ ev.}\}\) contains all \(\omega\in\Omega\) belonging to the limit set \(B=\lim_nB_n\).

important

Warning: In some books one can find different notation for the events in (\ref{eq:finitely-often-event-def}) and (\ref{eq:eventually-event-def}), namely \[\limsup_nA_n\equiv\bigl\{A_n\text{ i.o.}\bigr\}, \qquad \liminf_nA_n\equiv \bigl\{A_n\text{ ev.}\bigr\},\] but we will not use it here. See, however, Exercises  2.45- 2.47 (and get in touch, if interested).


Example 2.2 Let \(X\) be a positive random variable with \(\mathsf{P}(X<\infty)=1\). For \(n\ge1\), denote \(X_n=\frac1nX\) and, for \(\varepsilon>0\), let \(A_n(\varepsilon):=\bigl\{\omega:|X_n(\omega)|>\varepsilon\bigr\}\equiv\{|X|>n\varepsilon\}\). We obviously have \(A_n(\varepsilon)\supseteq A_{n+1}(\varepsilon)\) for fixed \(\varepsilon>0\) and all \(n\ge1\). Therefore, \[\bigcap_{n\ge1} A_n(\varepsilon)=\lim_{n\to\infty} A_n(\varepsilon)\equiv\bigl\{X=\infty\bigr\},\] which has probability zero by the assumption that \(\mathsf{P}(X<\infty)=1\). On the other hand, because \(A_n(\varepsilon)\) is a decreasing sequence, the event \[\bigl\{A_n(\varepsilon)\text{ i.o.}\bigr\}=\bigcap_{m\ge1}\bigcup_{n\ge m}A_n(\varepsilon) \equiv\bigcap_{m\ge1}A_m(\varepsilon)=\bigl\{X=\infty\bigr\}\] has probability zero. By taking complement, we conclude that \(\bigl\{A_n(\varepsilon)\text{ f.o.}\bigr\}\equiv\bigl\{X<\infty\bigr\}\) for each \(\varepsilon>0\). \(\vartriangleleft\)


Events as in (\ref{eq:finitely-often-event-def}) and (\ref{eq:eventually-event-def}) are well suited for studying limits of sequences of random variables, say \((X_n)_{n\ge1}\). Their usefulness is due to the following simple observation.

important

Let \((x_n)_{n\ge1}\) be a sequence of real numbers. Then \[\begin{aligned} \lim_{n\to\infty}x_n=x \quad\Longleftrightarrow\quad& \forall\varepsilon>0\text{ the condition $|x_n-x|<\varepsilon$ holds eventually } \\ \quad\Longleftrightarrow\quad& \forall\varepsilon>0\text{ the condition $|x_n-x|\ge\varepsilon$ holds finitely often.} \end{aligned}\]


By separately applying this description of convergence \(X_n(\omega)\to X(\omega)\) for individual \(\omega\in\Omega\), we can describe the convergence event as \[C:=\bigl\{\omega:X_n(\omega)\to X(\omega)\bigr\}\equiv\bigcap_{\varepsilon>0}\bigl\{\bigl|X_n(\omega)-X(\omega)\bigr|\ge\varepsilon\text{ f.o.}\bigr\} \equiv\bigcap_{\varepsilon>0}\bigl\{\bigl|X_n(\omega)-X(\omega)\bigr|<\varepsilon\text{ ev.}\bigr\}.\]

In the simple setting of Example 2.2, the convergence event can be described explicitly:

Remark 2.2.1

Recall that the events \(A(\varepsilon)\equiv\bigl\{A_n(\varepsilon)\text{ f.o.}\bigr\}\equiv\bigl\{X<\infty\bigr\}\) in Example 2.2 do not depend on \(\varepsilon\). As a result, the convergence event,8

\[C:=\bigl\{\omega:X_n(\omega)\to0\bigr\}\equiv\bigcap_{\varepsilon>0}\bigl\{A_n(\varepsilon)\text{ f.o.}\bigr\}\equiv\bigl\{X<\infty\bigr\},\] has probability one. For this reason we say that the random variables \(X_k\) converge to zero with probability one (or almost surely). Different modes of convergence of random variables will be discussed later this term. \(\vartriangleleft\)


2.2 Borel-Cantelli lemma

Let \((A_k)_{k\ge1}\) be an infinite sequence of events from some probability space \(\bigl(\Omega,\mathcal{F},\mathsf{P}\bigr)\). One is often interested in finding out how many of the events \(A_n\) occur. 9 Recall that the event “infinitely many of the events \(A_n\) occur” was defined in (\ref{eq:infinitely-often-event-def}) as \[\bigl\{A_n\text{ i.o.}\bigr\}\equiv\bigl\{A_n\text{ infinitely often}\bigr\} =\bigcap_{n\ge1}\bigcup_{k=n}^\infty A_k.\] The next result is very important for applications. Its proof uses the intrinsic monotonicity structure of the definition (\ref{eq:infinitely-often-event-def}).

Lemma 2.3 Let \(A=\cap_{n\ge1}\cup_{k=n}^\infty A_k\) be the event \(\{A_n\text{ i.o.}\}\). Then:
a) If \(\sum_k\mathsf{P}(A_k)<\infty\), then \(\mathsf{P}(A)=0\), ie., with probability one only finitely many of the \(A_k\) occur.
b) If \(\sum_k\mathsf{P}(A_k)=\infty\) and \(A_1\), \(A_2\), … are independent events, then \(\mathsf{P}(A)=1\).

\(\vartriangleleft\)


Remark The independence condition in part  b) above cannot be relaxed. Otherwise, let \(A_n\equiv E\) for all \(n\ge1\), where \(E\in\mathcal{F}\) satisfies \(0<\mathsf{P}(E)<1\) (and thus the events \(A_k\) are not independent). Then \(A=E\) and \(\mathsf{P}(A)=\mathsf{P}(E)\neq1\). \(\vartriangleleft\)


Remark An even more interesting counterexample to part b) without the independence property can be constructed as follows ( check this!):
\(\vartriangleleft\)


Example 2.4 By the second Borel-Cantelli lemma, Lemma 2.3 b), a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will almost surely (ie., with probability one) type any particular chosen text, such as the complete works of William Shakespeare (and, in fact, infinitely many copies of the chosen text).

image


Idea of the argument. Suppose that the typewriter has \(50\) keys, and the word to be typed is ‘banana’. The chance that the first letter typed is b is \(1/50\), as is the chance that the second letter is a, and so on. These events are independent, so the chance of the first six letters matching ‘banana’ is \(1/50^6\). For the same reason the chance that the next six letters match ‘banana’ is also \(1/50^6\), and so on.

Probability of seeing the word ‘banana’ in the first \(n\) blocks


Now, the chance of not typing ‘banana’ in each block of six letters is \(1-1/50^6\). Because each block is typed independently, the chance of not typing ‘banana’ in any of the first \(n\) blocks of six letters 10 is \(p=(1-1/50^6)^n\). If we were to count occurences of ‘banana’ that crossed blocks, \(p\) would approach zero even more quickly. 11 Finally, once the first copy of the word ‘banana’ appears, the process starts afresh independently of the past, so that the probability of obtaining the second copy of the word ‘banana’ within the same number of blocks is still \(p\) etc.; the result now follows from Lemma 2.3.
Of course, the same argument applies if the monkey were typing any other string of characters of finite length, eg., your favourite novel. 12 \(\vartriangleleft\)


Remark By using an appropriate monotone approximation, one can deduce the result as in Example 1.13, without explicitly using the Borel-Cantelli lemma. Moreover, the same argument can be extended to the situations, when the probability \(p_n\) of typing ‘banana’ in the \(n\)th block of six letters varies with \(n\), but remains uniformly positive, ie., \(p_n\ge\delta>0\) for all \(n\ge1\). The true power of the lemma is seen in the situations when \(p_n\to0\) slowly enough to have \(\sum_np_n=\infty\) (provided the events in different blocks are independent). \(\vartriangleleft\)


Proof of Lemma 2.3. a) For every \(n\ge1\), let \(B_n:=\cup_{k=n}^\infty A_k\) be the event that at least one of \(A_k\) with \(k\ge n\) occurs. As \(A\subset B_n\) for all \(n\ge1\), we have \[0\le\mathsf{P}(A)\le\mathsf{P}(B_n)\le\sum_{k=n}^\infty \mathsf{P}(A_k)\to 0\] as \(n\to\infty\), whenever \(\sum_k\mathsf{P}(A_k)<\infty\). Consequently, \(\mathsf{P}(A)=0\).
b) The event \(A^\mathsf{c}=\bigl\{A_n\text{ occur finitely often}\bigr\}\) is related to the sequence \[B_n^\mathsf{c}=\bigcap_{k=n}^\infty A_k^\mathsf{c}\equiv\bigl\{\text{ none of $A_k$, $k\ge n$, occurs}\bigr\} \qquad\text{ via }\qquad A^\mathsf{c}=\bigcup_n\bigcap_{k=n}^\infty A_k^\mathsf{c}=\bigcup_nB_n^\mathsf{c},\] so for \(\mathsf{P}(A^\mathsf{c})=0\) it is sufficient to show that \(\mathsf{P}(B_n^\mathsf{c})=0\) for all \(n\ge1\). By independence and the elementary inequality \(1-x\le e^{-x}\) with \(x\ge0\), we get \[\mathsf{P}\Bigl(\bigcap_{k=n}^mA_k^\mathsf{c}\Bigr)=\prod_{k=n}^m\mathsf{P}\bigl(A_k^\mathsf{c}\bigr)=\prod_{k=n}^m\Bigl(1-\mathsf{P}\bigl(A_k\bigr)\Bigr)\le\exp\Bigl\{-\sum_{k=n}^m\mathsf{P}(A_k)\Bigr\}\] so that \[\mathsf{P}\bigl(B_n^\mathsf{c}\bigr)=\lim_{m\to\infty}\mathsf{P}\Bigl(\bigcap_{k=n}^mA_k^\mathsf{c}\Bigr)\le\exp\Bigl\{-\sum_{k=n}^\infty\mathsf{P}(A_k)\Bigr\}=0,\] as the sum diverges. Hence, \(0\le\mathsf{P}(A^\mathsf{c})\le\sum_{n\ge1}\mathsf{P}(B_n^\mathsf{c})=0\) by sub-additivity, implying that \(\mathsf{P}(A)=1\). \(\blacksquare\)


Example 2.5 A standard six-sided die is tossed repeatedly. Let \(N_k\) denote the total number of tosses when face \(k\) was observed. Assuming that the individual outcomes are independent, show that \[\mathsf{P}(N_1=\infty)=\mathsf{P}(N_2=\infty)=\mathsf{P}(N_1=\infty,N_2=\infty)=1.\]

Solution. Equalities \(\mathsf{P}(N_1=\infty)=\mathsf{P}(N_2=\infty)=1\) can be derived as in Example 1.13, so that the intersection event \(\{N_1=\infty,N_2=\infty\}\) has probability one.
Alternatively, we derive the first equality from the Borel-Cantelli lemma. To this end, fix \(k\in\{1,2,\dots,6\}\) and denote \(A_n^k=\left\{\text{$n$th toss shows $k$}\right\}\). For different \(n\), the events \(A_n^k\) are independent and have the same probability \(1/6\). As \(\sum_n\mathsf{P}(A_n^k)=\infty\), the Borel-Cantelli lemma implies that the event \(\bigl\{N_k=\infty\bigr\}\equiv\left\{A_n^k\text{ infinitely often}\right\}\) has probability one. The remaining claims now follow as indicated above. \(\vartriangleleft\)


Example 2.6 A coin showing ‘heads’ with probability \(p\) is tossed repeatedly. With \(X_n\) denoting the result of the \(n\)th toss, let \(C_n=\{X_n=T,X_{n-1}=\mathsf{H}\}\). Show that \(\mathsf{P}(C_n\text{ i.o.})=1\).
Solution. We have \({\{\text{$C_{2n}$ i.o.}\}}\subset\left\{\text{$C_{n}$ i.o.}\right\}\), where \(\mathsf{P}(C_{2n})\equiv pq\) and \(C_{2n}\) are independent. The result follows from Lemma 2.3 b) (or via monotone approximation). \(\vartriangleleft\)


The Borel-Cantelli lemma is often used to describe long-term behaviour of sequences of random variables.

Example 2.7 Let \((X_k)_{k\ge1}\) be  random variables with common exponential distribution of mean \(1/\lambda\),  \(\mathsf{P}\bigl(X_1>x\bigr)=e^{-\lambda x}\) for all \(x\ge0\). One can show that \(X_n\) grows like \(\frac1\lambda\log n\), more precisely, that 13 \[\mathsf{P}\bigl(\limsup_{n\to\infty}\tfrac {X_n}{\log n}=\tfrac1\lambda\bigr)=1.\]

Solution. For \(\varepsilon>0\), denote \[A_n^\varepsilon:=\Bigl\{\omega:X_n(\omega)>\tfrac{1+\varepsilon}\lambda\log n\Bigr\},\qquad B_n^\varepsilon:=\Bigl\{\omega:X_n(\omega)>\tfrac{1-\varepsilon}\lambda\log n\Bigr\}.\] We clearly have \(\mathsf{P}\bigl(A_n^\varepsilon\bigr)=n^{-(1+\varepsilon)}\) and \(\mathsf{P}\bigl(B_n^\varepsilon\bigr)=n^{-(1-\varepsilon)}\). As \(\sum_n\mathsf{P}(A_n^\varepsilon)<\infty\), by Lemma 2.3 a) the event \(\{\text{$A_n^\varepsilon$ infinitely often}\}\) has probability zero. Similarly, \(B_n^\varepsilon\) are independent and \(\sum_n\mathsf{P}(B_n^\varepsilon)=\infty\); thus, by Lemma 2.3 b), the event \(\{\text{$B_n^\varepsilon$ infinitely often}\}\) has probability one.

As for all \(n\ge1\) the events \(B_n^\varepsilon\) increase with \(\varepsilon\), the event \(B^*(\varepsilon):=\cap_{\varepsilon>0}\{\text{$B_n^\varepsilon$ infinitely often}\}\) has probability one. Similarly, for all \(n\ge1\) the events \(A_n^\varepsilon\) decrease with \(\varepsilon\), and therefore the event \(A^*(\varepsilon):=\cap_{\varepsilon>0}\{\text{$A_n^\varepsilon$ finitely often}\}\) also has probability one. By the characterization of \(\limsup\) (see, e.g., Lemma A.3), we finally deduce that \[\Bigl\{\limsup_{n\to\infty}\tfrac {X_n}{\log n}=\tfrac1\lambda\Bigr\} \equiv\bigcap_{\varepsilon>0}\bigl(\{\text{$A_n^\varepsilon$ finitely often}\}\cap\{\text{$B_n^\varepsilon$ infinitely often}\}\bigr) =\bigcap_{\varepsilon>0}\bigl(A^*(\varepsilon)\cap B^*(\varepsilon)\bigr),\] which is an event of probability one. \(\vartriangleleft\)


Running maximum (blue) of a sample (black) of \(\mathsf{Exp}(1)\) sequence vs. \((1+\varepsilon)\log{n}\) and \((1-\varepsilon)\log{n}\) profiles (red and green respectively)

Remark 2.7.1 A slightly more general version of the argument in Example 2.7 allows to control the limiting behaviour of records:
Let \((X_k)_{k\ge1}\) be i.i.d. exponential r.v. with distribution \(\mathsf{P}(X_k>x)=e^{-x}\), and let \(M_n:=\max_{1\le k\le n}X_k\). Then \(\mathsf{P}(\tfrac{M_n}{\log n}\to1)=1\), ie., the normalized maximum \(\tfrac{M_n}{\log n}\) converges to \(1\) almost surely (as \(n\to\infty\)), see Figure  4. \(\vartriangleleft\)


Example 2.8 Let variables \((X_n)_{n\ge1}\) be i.i.d. with \(X_1\sim\mathcal{U}[0,1]\). For \(\alpha>0\), we have \(\mathsf{P}\bigl(X_n>1-n^{-\alpha}\bigr)=n^{-\alpha}\), so that \(\mathsf{P}\bigl(X_n>1-n^{-\alpha}\text{ i.o. }\bigr)=1\) iff \(\alpha\le1\).

A similar analysis shows that \[\mathsf{P}\Bigl(X_n>1-\frac1{n(\log n)^{\beta}}\text{ i.o.}\Bigr)=\begin{cases}1,&\quad \beta\le1,\\0,&\quad\beta>1.\end{cases}\] \(\vartriangleleft\)


Lemma 2.3 is one of the main methods of proving almost sure convergence:

Example 2.9 If \((X_k)_{k\ge1}\) is a sequence of random variables such that for every \(\varepsilon>0\) the event \[A(\varepsilon)\equiv\bigl\{|X_k|>\varepsilon\text{ finitely often}\bigr\}\] has probability one, then \(X_k\) is said to converge to zero with probability one, recall Remark 2.2.1.
As a simple illustration, let \(X_k=\frac1kX\), where a variable \(X\ge0\) has finite mean, \(\mathsf{E} X<\infty\). One then can show that, for each \(\varepsilon>0\), \[\sum_{k\ge1}\mathsf{P}(|X_k|>\varepsilon)=\sum_{k\ge1}\mathsf{P}(X>k\varepsilon)<\infty,\] and thus by Lemma 2.3 the event \(A(\varepsilon)\equiv\{|X_k|>\varepsilon\text{ f.o.}\}\) has probability one, for all such \(\varepsilon\). By the discussion in Remark 2.2.1 and monotonicity of \(A(\varepsilon)\), the convergence event, \(C=\cap_{\varepsilon>0} A(\varepsilon)\), has probability one.
A more general result in the setting of the amost sure convergence will be discussed later this term. \(\vartriangleleft\)


The Borel-Cantelli lemma is due to Émile Borel (1871-1956) and Francesco Paolo Cantelli (1875-1966), who discovered the result in early XXth century. While originating in measure theory, this simple but powerful result has many interesting applications in probability theory, analysis, and number theory, to name just a few. In 1909 É.Borel used this observation to prove that almost all real numbers are normal.

(left to right) Chebyshev, Hadamard, de la Vallée Poussin, Cantelli, Cramér (left to right) Chebyshev, Hadamard, de la Vallée Poussin, Cantelli, Cramér (left to right) Chebyshev, Hadamard, de la Vallée Poussin, Cantelli, Cramér (left to right) Chebyshev, Hadamard, de la Vallée Poussin, Cantelli, Cramér (left to right) Chebyshev, Hadamard, de la Vallée Poussin, Cantelli, Cramér

In 1936 the Swedish mathematician Harald Cramér (1893-1985) conjectured that if each natural number is independently declared a ‘prime’, then ‘properly formulated’ questions about classical primes should also hold in the random setting (and, in fact, with probability one!). More precisely, the resulting Cramér random model declares that the events \(\{\text{$k$ is prime}\}_{k\in\mathbb{N}}\) are independent with indicators \(\pi_1\equiv0\), \(\pi_2\equiv1\), and \((\pi_k)_{k\ge3}\) being a sequence of independent \(\mathsf{Ber}(1/\log{k})\) Bernoulli random variables. The parameters here are chosen so that the limiting behaviour of random primes is, on average, consistent with the Prime Number Theorem. The latter claims that the asymptotic law of prime numbers, namely the total number of primes not bigger than \(x>0\), is equivalent, for large \(x\), to \(x/\log{x}\). The proof of this celebrated result is based on contributions of many mathematicians, including Pafnuty Chebyshev (1821-1894), Jacques Hadamard (1865-1963), and Charles Jean de la Vallée Poussin (1866-1962).

A great advantage of the Cramér random model is that many properties of numbers are much easier to establish in the random setting.

checklist

By the end of this section you should be able to:


Exercise 2.30

A coin showing ‘heads’ with probability \(p\) is tossed repeatedly. Assuming that individual outcomes are independent, find probabilities of the following events:
a) \(A_k=\{\text{no `heads' after the first $k$ tosses}\}\).

b) \(A=\{\text{finitely many `heads' observed}\}\).

c) \(B=\{\text{infinitely many `heads' observed}\}\).
d) \(C=\{\text{infinitely many `heads' and infinitely many `tails' observed}\}\).
e) \(N=\{\text{no two consecutive results are the same}\}\).


Exercise 2.31

A coin showing ‘heads’ with probability \(p>0\) is tossed repeatedly. With \(X_n\) denoting the result of the \(n\)th toss, consider the events \(C_n=\{X_n=\mathsf{H},X_{n-1}=\mathsf{H}\}\). Show that \(\mathsf{P}(C_n\text{ i.o.})=1\).


Exercise 2.32

A coin showing ‘A’ with probability \(p\) and ‘B’ with probability \(1-p\) is tossed repeatedly. Assuming independence of individual outcomes, find the probability of the event \(N=\{\text{pattern `ABBA' never occurs}\}\). What is the probability of the event \(I=\{\text{pattern `ABBA' occurs infinitely many times}\}\)?


Exercise 2.33

For a sequence \((A_k)_{k\ge1}\) of events, let \(N(\omega)=\sum_{k=1}^\infty\mathbf{1}_{A_k}(\omega)\) be the number of \(A_k\)’s which occur. Show that if \(\sum_{k=1}^\infty\mathsf{P}(A_k)<\infty\), then \(\mathsf{P}(N<\infty)=1\), ie., almost surely a finite number of events \(A_k\) occur.


Exercise 2.34

A coin showing ’heads’ with probability \(p\in(0,1)\) is tossed repeatedly. Assuming independence of individual outcomes, show that for the \(n\)th result \(X_n\), we have \(\mathsf{P}(X_n = H\text{ i.o.})=1\) and \(\mathsf{P}(X_n=T\text{ i.o.})=1\).


Exercise 2.35

A standard six-sided die is tossed repeatedly. Let \(N_1\) denote the total number of ones observed. Assuming that the individual outcomes are independent, show that \(\mathsf{P}(N_1=\infty)=1\),
a) by using a suitable monotone approximation for the event \(\{N_1=\infty\}\);
b) by using the Borel-Cantelli lemma.


Exercise 2.36

A collection of coins are flipped consecutively, with individual outcomes assumed independent. Let \(\mathsf{P}(X_n\text{ is a `head'})=p_n\), where \(X_n\) denotes the \(n\)th result. Let \(F\) be the event \(\{\text{finitely many `heads' observed}\}\). Show carefully that \(\mathsf{P}(F)=1\) if and only if \(\sum_np_n<\infty\).


Exercise 2.37

Let \(C_k\), \(k\ge1\), be consecutive results of a coin tossing experiment, in which ‘heads’ show with probability \(p\in(0,1)\). We say that the \(n\)th result starts a run of ‘heads’ of length \(l_n=k\), if \(C_n=C_{n+1}=\dots=C_{n+k-1}=H\) and \(C_{n+k}=T\) (so that \(l_n=0\) means that \(n\)th result is ‘tails’). For fixed \(k>0\), consider the events \(R_n=\bigl\{l_n=k\bigr\}\). Show that \(\mathsf{P}(R_n)=p^k(1-p)\). Deduce that \(\mathsf{P}\bigl(\{\text{infinitely many $R_n$ occur}\}\bigr)=1\).


Exercise 2.38

In the setup of , consider the events \(Q_n=\bigl\{l_n=n\bigr\}\). Show that with probability one only finitely many events \(Q_n\) take place, equivalently, \(\mathsf{P}\bigl(\{\text{infinitely many $Q_n$ occur}\}\bigr)=0\).


Exercise 2.39

An urn contains one white and one black ball. At each step a ball is chosen uniformly at random and then returned back to the urn together with another black ball (so that after \(n\) steps the urn contains one white and \(n+1\) black balls).
a) Find the probability of the event \(B_n=\bigl\{\text{first $n$ chosen balls are black}\bigr\}\) and deduce the probability of the event \(B=\cap_nB_n\) that the white ball is never chosen.
b) Find the probability that the white ball is chosen infinitely many times.


Exercise 2.40

A collection of coins are flipped consecutively, with \(n\)th result \(X_n\) satisfying \(\mathsf{P}(X_n\text{ is a `head'})=\frac1{\sqrt n}\). Let \(N\) be the event \(\{\text{infinitely many `heads' and infinitely many `tails' observed}\}\). Assuming that individual outcomes are independent, show that \(\mathsf{P}(N)=1\).


Exercise 2.41

Let \((X_n)_{n\ge1}\) be independent r.v.’s having common distribution \(\mathsf{Geom}(p)\) with success probability \(p\), \(0<p=1-q<1\), that is, \(\mathsf{P}\bigl(X_1>k\bigr)=q^k\) for all integer \(k\ge0\). For fixed \(\varepsilon>0\), consider the events \[A_n(\varepsilon):=\bigl\{X_n>\tfrac{1+\varepsilon}{\log(1/q)}\log n\bigr\},\qquad B_n(\varepsilon):=\bigl\{X_n>\tfrac{1-\varepsilon}{\log(1/q)}\log n\bigr\}.\] Show that for every \(\varepsilon>0\) we have \(\mathsf{P}\bigl(A_n(\varepsilon) \text{ i.o.}\bigr)=0\) and \(\mathsf{P}\bigl(B_n(\varepsilon) \text{ i.o.}\bigr)=1\). What does it tell you about the large-\(n\) behaviour of the sequence \(X_n\)?


Exercise 2.42

In the setup of Example 2.7, show that \(\mathsf{P}\bigl(\limsup\limits_{n\to\infty}\tfrac{X_n-\frac1\lambda\log n}{\log\log n}=\frac1\lambda\bigr)=1\).


Exercise 2.43

Let \((X_n)_{n\ge1}\) be  standard normal random variables. Using the fact that for \(\xi\sim\mathcal{N}(0,1)\) we have 14 \(\mathsf{P}(\xi>x)\sim\frac{1}{x\sqrt{2\pi}}\exp\bigl\{-x^2/2\bigr\}\) as \(x\to\infty\), show \[\mathsf{P}\bigl(\limsup_{n\to\infty}\tfrac{X_n}{\sqrt{2\log{n}}}=1\bigr)=1,\qquad \mathsf{P}\bigl(\liminf_{n\to\infty}\tfrac{X_n}{\sqrt{2\log{n}}}=-1\bigr)=1.\]


Exercise 2.44

Let \(X_k\), \(k\ge1\) be  random variables such that for some \(\alpha\in(0,2)\) and all \(x\ge1\) we have \(\mathsf{P}(X>x)=\frac1{x^{\alpha}}\). Show that \(\mathsf{P}\bigl(\limsup\limits_{n\to\infty}\frac{\log{X_n}}{\log{n}}=\frac1\alpha\bigr)=1\).


advanced

Optional exercises

Exercise 2.45

Let \((A_n)_{n\ge1}\) be events and let \(A=\limsup_nA_n=\bigcap_n\bigcup_{k=n}^\infty A_k\equiv\{\text{$A_n$ infinitely often}\}\). Show that for every \(\omega\in\Omega\) we have \(\mathbf{1}_{\limsup_nA_n}(\omega)=\limsup_n\mathbf{1}_{A_n}(\omega)\).


Exercise 2.46

Let \(A_1\), \(A_2\), … be events and let \(A=\liminf_nA_n=\bigcup_n\bigcap_{k=n}^\infty A_k\) be the event \[\{\text{all but finitely many of $A_n$'s occur}\}\equiv\{\text{$A^\mathsf{c}_n$ finitely often}\}.\] Show that for every \(\omega\in\Omega\) we have \(\mathbf{1}_{\liminf_nA_n}(\omega)=\liminf_n\mathbf{1}_{A_n}(\omega)\).


Exercise 2.47

For every sequence \(A_1\), \(A_2\), … of events, let \(\limsup_nA_n\) and \(\liminf_nA_n\) be as defined in Exercises  and . Using the corresponding definitions, show that \(\liminf_nA_n\subseteq\limsup_nA_n\). If both sides in the last inclusion coincide, \(\liminf_nA_n=\limsup_nA_n\), the resulting set is called the limit of the sequence \((A_n)_{n\ge1}\) and is denoted \(\lim_nA_n\).


Exercise 2.48

Let \((X_n)_{n\ge1}\) be independent random variables, and put \(M_n=\max\{X_1,\dots,X_n\}\) for \(n\ge1\). Let, further, \((b_n)_{n\ge1}\) be a non-decreasing sequence of positive real numbers. By using Lemma A.4 or otherwise, show that if \(b_n\to\infty\) as \(n\to\infty\), then \(\limsup\limits_{n\to\infty}(X_n/b_n)=\limsup\limits_{n\to\infty}(M_n/b_n)\).


Exercise 2.49

Let the sequence \((X_n)_{n\ge1}\) be as in . Show that \(\mathsf{P}\bigl(\lim\limits_{n\to\infty}\frac{\log{M_n}}{\log{n}}=\frac1\alpha\bigr)=1\). Notice that for \(\alpha\in(0,1)\) typical \(M_n=\max(X_1,\dots,X_n)\) grows as \(n^{1/\alpha}\), ie., faster than linearly!



3 Convergence of random variables

Goals: Introduce some of the main modes of convergence of random variables: convergence in probability, convergence in \(L^r\), and almost sure convergence. Explore some of their properties. Understand the difference between the weak and strong laws of large numbers.


In probability theory one uses various modes of convergence of random variables, many of which are crucial for applications. In this section we consider some of the most important of them: convergence in \(L^r\), convergence in probability and convergence with probability one (a.k.a.  almost sure convergence). Further examples can be found in .

3.1 Introduction

Getting comfortable with convergence of random variables in a probability space \((\Omega,\mathcal{F},\mathsf{P})\) can take a while, but you might find the following interpretation helpful.

As mentioned previously, a random variable, say \(X\), is a measurable map \(X:\Omega\to\mathbb{R}\). We can thus think of \(X(\omega)\) as the opinion expressed by the individual \(\omega\in\Omega\). A sequence \((X_n)_{n\ge1}\) of random variables then is a database, where, for each fixed \(n\in\mathbb{N}\), the values \(X_n(\omega)\), \(\omega\in\Omega\), record the individual answers to a single question of, e.g., census conducted at time \(n\). Under the assumption that all individuals \(\omega\in\Omega\) are immortal, the question of convergence \(X_n\to X\) as \(n\to\infty\) then boils down to finding out to what degree the profile \(\bigl(X_n(\omega)\bigr)_{\omega\in\Omega}\) at large time \(n\) is similar to the limiting profile \((X(\omega))_{\omega\in\Omega}\).

One can describe this similarity either at the level of individuals \(\omega\) or at the level of the whole population \(\Omega\). In the first case, one fixes \(\omega\in\Omega\) and studies the large-\(n\) behaviour of the sequence \((X_n(\omega))_{n\ge1}\) of real numbers. In particular, one determines whether the convergence \(X_n(\omega)\to X(\omega)\) takes place for the fixed individual \(\omega\in\Omega\). If the set 15 \[\bigl\{\omega\in\Omega: X_n(\omega)\to X(\omega)\text{ as }n\to\infty\bigr\}\] has probability one, we say that \(X_n\) converges to \(X\) with probability one (or converges almost surely). The individual approach is very important in computer science. 16

Alternatively, one can prefer the population level view, and for each fixed \(n\in\mathbb{N}\) quantify to what extent the profile \((X_n(\omega))_{\omega\in\Omega}\) at time \(n\) differs from the limiting profile \((X(\omega))_{\omega\in\Omega}\). Two popular choices are as follows.

One can fix \(\varepsilon>0\) and \(n\in\mathbb{N}\), and evaluate the probability \[\mathsf{P}\bigl(|X_n-X|>\varepsilon\bigr)\equiv\mathsf{P}\bigl(\bigl\{\omega\in:\Omega:\bigl|X_n(\omega)-X(\omega)\bigr|>\varepsilon\bigr).\] If, for each fixed \(\varepsilon>0\), this probability vanishes asymptotically, as \(n\to\infty\), one speaks of convergence in probability.

Another popular choice is, for fixed \(r>0\), to describe the discrepancy between \(X_n\) and \(X\) through the value of the expectation \[\mathsf{E}\bigl(|X_n-X|^r\bigr)\equiv\mathsf{E}\bigl(|X_n(\omega)-X(\omega)|^r\bigr).\] If, for some \(r>0\), this expectation goes to zero as \(n\to\infty\), one says that \(X_n\) converges to \(X\) in \(L^r\). The case of \(r\ge1\) is of special interest as the corresponding space of random variables (satisfying, in particular, \(\mathsf{E}(|X|^r)<\infty\)) has additional useful properties.

3.2 Main modes of convergence

Let \((\Omega,\mathcal{F}, \mathsf{P})\) be a probability space and let \((X_n)_{n \ge1}\) be a sequence of random variables. What does it mean to say that \(X_n\) converges to a random variable \(X\)? There are many possibilities, with the simplest one being as follows:

Definition 3.1 We say that a sequence of random variables \((X_n)_{n \geq 1}\) converges pointwise or surely to \(X\) if \(X_n(\omega) \to X(\omega)\) for all \(\omega \in \Omega\) as \(n \to \infty\).


Remark 3.1.1 The pointwise convergence is a very natural property. In particular, it is preserved 17 by the action of continuous functions: if \(X_n\to X\) surely as \(n\to\infty\), and \(f:\mathbb{R}\to\mathbb{R}\) is continuous, then the sequence \(\bigl(f(X_n)\bigr)_{n\geq1}\) of random variables converges to \(f(X)\) surely, cf. Lemma A.5.

However, pointwise convergence is a very restrictive condition and is often too strong to do anything useful in probability, even though this is the usual definition you might be familiar with from analysis courses. \(\vartriangleleft\)


We next look at some weaker forms of convergence.

3.2.1 Convergence in \(L^r\)

Definition 3.2 For fixed \(r>0\), let \(X\) and \((X_n)_{n\ge1}\) be random variables such that \(\mathsf{E}(|X|^r)\) and all \(\mathsf{E}(|X_n|^r)\) are finite. Then the sequence \(X_n\) converges to \(X\) in \(L^r\) as \(n\to\infty\) (written \(X_n\stackrel{\mathsf{L}^r}{\to} X\)), if \(\mathsf{E}\bigl(|X_n-X|^r\bigr)\to0\) as \(n\to\infty\).


Remark 3.2.1 Such convergence is also known as convergence in the \(r\)-th mean. Two special cases are of particular importance.

When \(r=1\), the sequence \((X_n)_{n\ge1}\) converges to \(X\) in mean or in \(L^1\).

Similarly, when \(r=2\), the sequence \((X_n)_{n\ge1}\) converges to \(X\) in mean square or in \(L^2\). \(\vartriangleleft\)


Example 3.3 Let \(\bigl(X_n\bigr)_{n\ge1}\) be a sequence of random variables such that for some real numbers \((a_n)_{n\ge1}\), we have \[\tag{3.1}\label{eq:01-Lp-conv} \mathsf{P}\bigl(X_n=a_n\bigr)=p_n,\qquad \mathsf{P}\bigl(X_n=0\bigr)=1-p_n.\] Then \(X_n\stackrel{\mathsf{L}^r}{\to}0\) if and only if \(\mathsf{E}\bigl(|X_n|^r\bigr)\equiv|a_n|^rp_n\to0\) as \(n\to\infty\). \(\vartriangleleft\)


3.2.2 Convergence in probability

Definition 3.4 We say that a sequence \((X_n)_{n\ge1}\) of random variables converges to a random variable \(X\) in probability (write \(X_n\stackrel{\mathsf{P}}{\to} X\)) as \(n\to\infty\), if for every fixed \(\varepsilon>0\) \[\mathsf{P}\bigl(|X_n-X|\ge\varepsilon\bigr)\to0\qquad\text{ as }n\to\infty.\]


Remark 3.4.1 Convergence in probability is preserved under the action of continuous functions: If \(X_n\stackrel{\mathsf{P}}{\to} X\) as \(n\to\infty\) and \(f:\mathbb{R}\to\mathbb{R}\) is continuous, then \(f(X_n)\stackrel{\mathsf{P}}{\to} f(X)\) as \(n\to\infty\), cf. Lemma A.5. \(\vartriangleleft\)


Example 3.5 Let random variables \(\bigl(X_n\bigr)_{n\ge1}\) be as in (\ref{eq:01-Lp-conv}). Then for every \(\varepsilon>0\) we have \[\mathsf{P}\bigl(|X_n|\ge\varepsilon\bigr)\le\mathsf{P}\bigl(X_n\neq0)=p_n,\] so that \(X_n\stackrel{\mathsf{P}}{\to}0\) if \(p_n\to0\) as \(n\to\infty\). \(\vartriangleleft\)


Exercise 3.50

Let random variables \((X_n)_{n\ge1}\) be as in (\ref{eq:01-Lp-conv}), \(\mathsf{P}\bigl(X_n=a_n\bigr)=p_n=1-\mathsf{P}\bigl(X_n=0\bigr)\), where \(a_n\to0\) as \(n\to\infty\). Is it true that \(X_n\stackrel{\mathsf{L}^r}{\to}0\) as \(n\to\infty\), for some \(r>0\)? Is it true that \(X_n\stackrel{\mathsf{P}}{\to}0\) as \(n\to\infty\)?


Lemma 3.7 below links convergence in \(L^r\) to that in probability. We first recall the following useful result.

Lemma 3.6 Let \(g:[0,\infty)\to[0,\infty)\) be an increasing function. Then for each random variable \(X\ge0\) and a fixed constant \(a>0\) we have \[\tag{3.2}\label{eq:general-Markov-inequality} \mathsf{P}(X\ge a)\le \frac{\mathsf{E}\bigl(g(X)\bigr)}{g(a)}.\] \(\vartriangleleft\)


Proof Because \(g\) increases, for each fixed \(a>0\) and all \(x\ge0\) we have \(g(x)\ge g(x)\mathbf{1}_{x\ge a}\ge g(a)\mathbf{1}_{x\ge a}\ge0\). By applying this inequality to the random variable \(X\ge0\) and taking expectations, we obtain \[\mathsf{E}\bigl(g(X)\bigr)\ge g(a)\mathsf{E}\bigl(\mathbf{1}_{X\ge a}\bigr)\equiv g(a)\mathsf{P}(X\ge a).\] As \(g(a)>0\), this is equivalent to (\ref{eq:general-Markov-inequality}). \(\blacksquare\)


Remark 3.6.1 Of course, (\ref{eq:general-Markov-inequality}) is trivial unless \(\mathsf{E}\bigl(g(X)\bigr)<\infty\).

When \(g(x)\equiv x\) for \(x\ge0\), (\ref{eq:general-Markov-inequality}) becomes Markov’s inequality. Similarly, when \(g(x)=x^2\) for \(x\ge0\) and \(X=|Y-\mathsf{E} Y|\) for some random variable \(Y\), (\ref{eq:general-Markov-inequality}) becomes Chebyshev’s inequality. \(\vartriangleleft\)


Lemma 3.7 Let \((X_n)_{n\ge1}\) be a sequence of random variables. If \(X_n\stackrel{\mathsf{L}^r}{\to} X\) for some fixed \(r>0\), then \(X_n\stackrel{\mathsf{P}}{\to} X\) as \(n\to\infty\). \(\vartriangleleft\)


Proof Apply the general Markov inequality (\ref{eq:general-Markov-inequality}) to \(g(x)=x^r\) and the random variable \(|X_n-X|\ge0\). Then, for every fixed \(\varepsilon>0\), \[\mathsf{P}\bigl(|X_n-X|\ge\varepsilon\bigr)\le\frac{\mathsf{E}\bigl(|X_n-X|^r\bigr)}{\varepsilon^r}\to0\quad\text{ as $n\to\infty$.} here\] \(\blacksquare\)


Example 3.8 Let random variables \((X_n)_{n\ge1}\) be as in (\ref{eq:01-Lp-conv}), where \(p_n\to0\) while \(|a_n|^rp_n\to\infty\) as \(n\to\infty\). Then, by Example 3.5, \(X_n\stackrel{\mathsf{P}}{\to}0\), while by Example 3.3, \(X_n\) does not converge to \(X\equiv0\) in \(L^r\). \(\vartriangleleft\)


Exercise 3.51

Let random variables \((X_n)_{n\ge1}\) satisfy \(\mathsf{P}(X_n=1)=p_n=1-\mathsf{P}(X_n=0)\).
a) Show that \(X_n\to0\) in probability as \(n\to\infty\) if and only if \(p_n\to0\) as \(n\to\infty\);
b) Show that \(X_n\to1\) in probability as \(n\to\infty\) if and only if \(p_n\to1\) as \(n\to\infty\);
c) Show that \(X_n\to0\) in \(L^r\) (\(r>0\)) as \(n\to\infty\) if and only if \(p_n\to0\) as \(n\to\infty\);
d) Show that \(X_n\to1\) in \(L^r\) (\(r>0\)) as \(n\to\infty\) if and only if \(p_n\to1\) as \(n\to\infty\).


Exercise 3.52

Let \(X\), \(Y\), \((X_n)_{n\ge1}\) and \((Y_n)_{n\ge1}\) be random variables.
a) Let \(X_n\stackrel{\mathsf{P}}{\to} X\) and \(Y_n\stackrel{\mathsf{P}}{\to} Y\) as \(n\to\infty\). Show that \(X_n+Y_n\stackrel{\mathsf{P}}{\to} X+Y\) as \(n\to\infty\).

b) If \(X_n\stackrel{\mathsf{P}}{\to} X\) and \(a\) is a real number, show that \(aX_n\stackrel{\mathsf{P}}{\to} aX\) as \(n\to\infty\).
c) Let \(r\ge1\) be fixed and let \(X_n\stackrel{\mathsf{L}^r}{\to} X\) and \(Y_n\stackrel{\mathsf{L}^r}{\to} Y\) as \(n\to\infty\). Show that \(X_n+Y_n\stackrel{\mathsf{L}^r}{\to} X+Y\) as \(n\to\infty\).

d) If real constants \(a\) and \(r>0\) are fixed and \(X_n\stackrel{\mathsf{L}^r}{\to} X\) as \(n\to\infty\), show that \(aX_n\stackrel{\mathsf{L}^r}{\to} aX\) as \(n\to\infty\).


Exercise 3.53

Let \(X\), \(Y\) and \((X_n)_{n\ge1}\) be random variables.
a) If \(X_n\stackrel{\mathsf{P}}{\to} X\) and \(X_n\stackrel{\mathsf{P}}{\to} Y\) as \(n\to\infty\), show that \(\mathsf{P}(X\neq Y)\equiv\mathsf{P}\bigl(\{\omega\in\Omega:X(\omega)\neq Y(\omega)\}\bigr)=0\), i.e., \(X\) and \(Y\) are equivalent.

b) For \(r\ge1\), let \(X_n\stackrel{\mathsf{L}^r}{\to} X\) and \(X_n\stackrel{\mathsf{L}^r}{\to} Y\) as \(n\to\infty\). Show that \(X\) and \(Y\) are equivalent, ie., \(\mathsf{P}(X\neq Y)=0\).


Exercise 3.54

For random variables \(X\) and \((Y_n)_{n\ge1}\) with \(\mathsf{E}|Y_n|=\frac1n\), let \(X_n:= X+Y_n\). Is it true that \(X_n\stackrel{\mathsf{P}}{\to} X\)? Is it true that \(X_n\stackrel{\mathsf{L}^r}{\to} X\) for some \(r>0\)?


Exercise 3.55

For each \(n\in\mathbb{N}\), let \(X_n\sim\mathsf{Exp}(n)\), that is, \(\mathsf{P}(X_n>a)=e^{-na}\) for all \(a\ge0\). Is it true that \(X_n\stackrel{\mathsf{P}}{\to}0\) as \(n\to\infty\)? Is it true that \(X_n\stackrel{\mathsf{L}^r}{\to}0\) for some \(r>0\)?


3.2.3 Almost sure convergence

Definition 3.9 A sequence \((X_k)_{k\ge1}\) of random variables in \((\Omega,\mathcal{F},\mathsf{P})\) converges, as \(n\to\infty\), to a random variable \(X\) with probability one (or almost surely) if \[\tag{3.3}\label{eq:convergence-almost-sure} \mathsf{P}\Bigl(\bigl\{\omega\in\Omega:X_n(\omega)\to X(\omega)\text{ as $n\to\infty$}\bigr\}\Bigr)=1.\]


Remark 3.9.1 For \(\varepsilon>0\), let \(A_n(\varepsilon)=\bigl\{\omega:|X_n(\omega)-X(\omega)|>\varepsilon\bigr\}\). By the discussion in Remark 2.2.1, property (\ref{eq:convergence-almost-sure}) is equivalent to saying that for every \(\varepsilon>0\) \[\tag{3.4}\label{eq:convergence-almost-sure-2} \mathsf{P}\bigl(\bigl\{A_n(\varepsilon)\text{ finitely often}\bigr\}\bigr)=1.\] This is why the Borel-Cantelli lemma is so useful in studying almost sure limits. \(\vartriangleleft\)


Remark 3.9.2 Almost sure convergence is preserved under the action of continuous functions: If \(X_n\to X\) almost surely as \(n\to\infty\) and \(f:\mathbb{R}\to\mathbb{R}\) is continuous, then \(f(X_n)\to f(X)\) almost surely as \(n\to\infty\), cf. Lemma A.5. \(\vartriangleleft\)


Example 3.10 As in Example 2.2, consider a finite random variable \(X\), ie., satisfying \(\mathsf{P}(|X|<\infty)=1\). Then the sequence \((X_k)_{k\ge1}\) defined via \(X_k:=\frac1kX\) converges to zero with probability one. Indeed, the discussion in Example 2.2 established the following analogue of (\ref{eq:convergence-almost-sure-2}): \[\mathsf{P}\bigl(\bigl\{|X_n(\omega)|>\varepsilon\text{ finitely often}\bigr\}\bigr)=1.\] \(\vartriangleleft\)


It is important to remember that convergence in probability is not related to the pointwise convergence, ie., convergence \(X_n(\omega)\to X(\omega)\) for a fixed \(\omega\in\Omega\). The following important example shows that neither \(X_n\stackrel{\mathsf{P}}{\to} X\) nor \(X_n\stackrel{\mathsf{L}^r}{\to} X\) implies \(X_n\stackrel{\mathsf{a.s.}}{\to} X\) as \(n\to\infty\).

Example 3.11 For \(n\ge1\) put \(m=[\log_2n]\), i.e., let \(m\ge0\) be such that \(2^m\le n<2^{m+1}\). Consider the events \[A_n=\Bigl[\frac{n-2^m}{2^m}, \frac{n+1-2^m}{2^m}\Bigr]\subseteq\bigl[0,1\bigr]\] in the canonical probability space, and let \(X_n(\omega):=\mathbf{1}_{A_n}(\omega)\). Because \[\mathsf{P}\bigl(|\mathbf{1}_{A_n}|>0\bigr)=\mathsf{P}(A_n)\equiv\mathsf{E}\bigl(\mathbf{1}_{A_n}\bigr)=2^{-[\log_2n]}<\frac2n\to0\] as \(n\to\infty\), the sequence \(X_n\) converges to \(X\equiv0\) in probability and in \(L^r\), for each fixed \(r>0\). However, \[\bigl\{\omega\in\Omega:X_n(\omega)\to X(\omega)\equiv0\text{ as $n\to\infty$}\bigr\}=\varnothing,\] ie., there is no point \(\omega\in\Omega\) for which the sequence \(X_n(\omega)\in\{0,1\}\) converges

18

to \(X(\omega)=0\); in fact, for each \(\omega\in\Omega\) the real sequence \(X_n(\omega)\) never stops jumping between \(0\) and \(1\). \(\vartriangleleft\)


In general, to verify convergence with probability one is not immediate. The following lemma gives a sufficient condition of almost sure convergence.

Lemma 3.12 Let \(X\) and \((X_n)_{n\ge1}\) be random variables. If, for every \(\varepsilon>0\), \[\tag{3.5}\label{eq:BC-as-convergence} \sum_{n=1}^\infty\mathsf{P}\bigl(|X_n-X|>\varepsilon\bigr)<\infty,\] then \(X_n\) converges to \(X\) almost surely. \(\vartriangleleft\)


Proof Fix \(\varepsilon>0\) and let \(A_n(\varepsilon)=\bigl\{\omega\in\Omega:|X_n(\omega)-X(\omega)|>\varepsilon\bigr\}\). By (\ref{eq:BC-as-convergence}), \(\sum_n\mathsf{P}\bigl(A_n(\varepsilon)\bigr)<\infty\), and, by Lemma 2.3 a), only a finite number of \(A_n(\varepsilon)\) occur with probability one. This means that for every fixed \(\varepsilon>0\) the event \[A(\varepsilon):=\bigl\{\omega\in\Omega:|X_n(\omega)-X(\omega)|\le\varepsilon\text{ for all $n$ large enough}\bigr\}\] has probability one. By monotonicity (\(A(\varepsilon_1)\subset A(\varepsilon_2)\) if \(\varepsilon_1<\varepsilon_2\)), the event \[\bigl\{\omega\in\Omega:X_n(\omega)\to X(\omega)\text{ as $n\to\infty$}\bigr\}=\bigcap_{\varepsilon>0}A(\varepsilon)=\bigcap_{m\ge1}A(1/m)\] has probability one. The claim follows. \(\blacksquare\)


Exercise 3.56

Let random variables \((X_n)_{n\ge1}\) be independent with \(\mathsf{P}(X_n=1)=p_n\), \(\mathsf{P}(X_n=0)=1-p_n\). Show carefully that \(X_n\to0\) almost surely as \(n\to\infty\) if and only if \(\sum_np_n<\infty\).


Exercise 3.57

Let independent variables \((X_n)_{n\ge1}\) have uniform distribution, \(X_n\sim\mathcal{U}[1,1+\tfrac1n]\). Does \(X_n\) converge to \(1\) almost surely?


Exercise 3.58

Let \((X_n)_{n\ge1}\) be independent random variables with common exponential distribution \(\mathsf{Exp}(\lambda)\), where \(\lambda>0\); equivalently, \(\mathsf{P}(X_n>x)=e^{-\lambda x}\) for all \(n\in\mathbb{N}\) and \(x>0\). Denote \(Y_n:=\min(X_1,\dots,X_n)\). For each of the claims below, prove the result if it is true or find a counterexample otherwise: a) \(Y_n\stackrel{\mathsf{P}}{\to}0\) as \(n\to\infty\); b) \(Y_n\stackrel{\mathsf{L}^r}{\to}0\) as \(n\to\infty\), for some \(r>0\); c) \(Y_n\stackrel{\mathsf{a.s.}}{\to}0\) as \(n\to\infty\).


Exercise 3.59

Let independent variables \((X_n)_{n\ge1}\) satisfy \(\mathsf{P}(X_n=a_n)=p_n=1-\mathsf{P}(X_n=0)\) for \(p_n\in[0,1]\) and real \(a_n\).
a) If \(a_n=\sqrt{n}\), find a sequence \((p_n)_{n\ge1}\) such that \(X_n\) converges to \(X\equiv0\) in \(L^1\) but not almost surely.
b) If \(a_n=n^2\), find a sequence \((p_n)_{n\ge1}\) such that \(X_n\) converges to \(X\equiv0\) almost surely but not in \(L^1\).


Exercise 3.60

Let random variables \(X\), \((X_n)_{n\ge1}\), \(Y\), \((Y_n)_{n\ge1}\) and a real sequence \((c_n)_{n\ge1}\) be such that \(X_n\stackrel{\mathsf{a.s.}}{\to} X\), \(Y_n\stackrel{\mathsf{a.s.}}{\to} Y\), and for some \(c\in\mathbb{R}\) we have \(c_n\to c\) as \(n\to\infty\).
a) Show that \(X_n+Y_n\stackrel{\mathsf{a.s.}}{\to} X+Y\) as \(n\to\infty\);
b) Show that \(c_nX_n\stackrel{\mathsf{a.s.}}{\to} cX\) as \(n\to\infty\);
c) Show that \(X_nY_n\stackrel{\mathsf{a.s.}}{\to} XY\) as \(n\to\infty\).


Exercise 3.61

Let random variables \((X_n)_{n\ge1}\) be independent with \(\mathsf{P}(X_n=1)=\frac1{\sqrt n}\), \(\mathsf{P}(X_n=0)=1-\frac1{\sqrt n}\). Show that \(\mathsf{P}\bigl(X_n \text{ does not converge as $n\to\infty$}\bigr)=1\).


Example 3.13 Let independent random variables \((X_n)_{n\ge1}\) have common uniform distribution \(\mathcal{U}[0,1]\). Denote \(Y_n:=\min(X_1,\dots,X_n)\). By its definition, for each fixed \(\omega\in\Omega\), the (real) sequence \(Y_n(\omega)\) is non-increasing; it is also ‘obvious’ that \(Y_n\to0\). We identify which modes of convergence hold here.
First, for each fixed \(\varepsilon>0\), we have \[\mathsf{P}(Y_n>\varepsilon)\equiv\mathsf{P}(X_1>\varepsilon,\dots,X_n>\varepsilon)=\prod_{k=1}^n\mathsf{P}(X_k>\varepsilon)=(1-\varepsilon)^n,\] where the second equality follows by independence. Consequently, \(\mathsf{P}(|Y_n|>\varepsilon)\to0\) as \(n\to\infty\), for all \(\varepsilon>0\); in other words, \(Y_n\stackrel{\mathsf{P}}{\to}0\).
Next, \(\sum_n\mathsf{P}(|Y_n|>\varepsilon)<\infty\) for each \(\varepsilon>0\), so Lemma 3.12 implies that \(Y_n\stackrel{\mathsf{a.s.}}{\to}0\) as \(n\to\infty\).
We finally show that \(Y_n\stackrel{\mathsf{L}^r}{\to}0\) as \(n\to\infty\), for each fixed \(r>0\). Using the display above, we derive the probability density of \(Y_n\) via \(f_{Y_n}(y)\equiv-\tfrac{d}{dy}\mathsf{P}(Y_n>y)=n(1-y)^{n-1}\mathbf{1}_{0\le y\le1}\). For fixed \(r>0\), we thus have \(\mathsf{E}(|Y_n|^r)=n\int_0^1y^r(1-y)^{n-1}dy\), and it remains to show that the last expression vanishes asymptotically as \(n\to\infty\).
Changing variables \(y\mapsto z:= ny\), we can write the expectation \(\mathsf{E}(|Y_n|^r)\) as \(n^{-r}\) times the integral 19 \[\int_0^nz^r\Bigl(1-\frac{z}n\Bigr)^{n-1}dz\le\int_0^nz^r\exp\Bigl\{-\frac{n-1}nz\Bigr\}dz,\] which is bounded, uniformly in \(n\in\mathbb{N}\); indeed, for \(n>1\) the last expression is not bigger than \(K_r:=\int_0^\infty z^re^{-z/2}dz<\infty\) ( why?). Finally, \(\mathsf{E}(|Y_n|^r)\le K_rn^{-r}\to0\) as \(n\to\infty\), equivalently, \(Y_n\stackrel{\mathsf{L}^r}{\to}0\). \(\vartriangleleft\)


Verifying the almost sure convergence through the Borel-Cantelli lemma (or the sufficient condition (\ref{eq:BC-as-convergence})) is often easier than using an explicit construction in the spirit of Example 2.2. Notice however that We shall see more examples below.

3.3 Laws of large numbers

According to Wikipedia, “The law of averages is the commonly held belief that a particular outcome or event will over certain periods of time occur at a frequency that is similar to its probability. Depending on context or application it can be considered a valid common-sense observation or a misunderstanding of probability.” A rigorous probabilistic result of this type is known as the law of large numbers; it is formulated in terms of convergence of suitable random variables.

Our first result is the \(L^2\) Weak Law of Large Numbers ():

Theorem 3.14 Let \((X_n)_{n\ge1}\) be uncorrelated random variables with \(\mathsf{E} X_n=\mu\) and \(\mathsf{Var}(X_n)\le C<\infty\). Denote \(S_n=X_1+\dots+X_n\). Then \(\frac1nS_n\stackrel{\mathsf{L}^2}{\to}\mu\) as \(n\to\infty\). \(\vartriangleleft\)


Proof As variables \(X_k\) are uncorrelated, we have \(\mathsf{Var}(S_n)=\sum_{k=1}^n\mathsf{Var}(X_k)\); therefore \(\mathsf{Var}(S_n)\le Cn\) and \[\mathsf{E}\Bigl(\Bigl(\frac1nS_n-\mu\Bigr)^2\Bigr)=\mathsf{E}\Bigl(\frac{(S_n-n\mu)^2}{n^2}\Bigr)=\frac{\mathsf{Var}(S_n)}{n^2}\le\frac{C}{n}\to0\qquad\text{as $n\to\infty$.}here\] \(\blacksquare\)


Remark Let \(Z\sim\mathcal{N}(0,\sigma^2)\), so that \(\mathsf{E} Z=\mathsf{E}(Z^3)=0\), \(\mathsf{E}(Z^2)=\sigma^2\), and \(\mathsf{E}(Z^4)=3\sigma^4\). Define \(Y_n:=\frac1n\sum_{k=1}^nX_k\), where \(X_k:=(Z_k)^2\) with independent \(Z_k\sim\mathcal{N}(0,t)\). Then \(\mathsf{E} Y_n=t\), \(\mathsf{E}\bigl((Y_n-t)^2\bigr)=\frac1n\mathsf{Var}(X_k)=\frac{2t^2}{n}\), and so \(Y_n\to t\) in \(L^2\) as \(n\to\infty\). This simple observation is at the heart of the construction of the stochastic integral with respect to the brownian motion. The resulting area of stochastic analysis has a wide range of applicability, including financial mathematics. \(\vartriangleleft\)


The usual Weak Law of Large Numbers () is often stated as a convergence in probability result:

Theorem 3.15 Under the conditions of Theorem 3.14, \(\frac1nS_n\stackrel{\mathsf{P}}{\to}\mu\) as \(n\to\infty\). \(\vartriangleleft\)


Proof Follows immediately from Theorem 3.14 and Lemma 3.7. \(\blacksquare\)


Remark 3.15.1 It is a useful exercise to derive Theorem 3.15 from Chebyshev’s inequality; see Exercise  3.62. \(\vartriangleleft\)


The following optional example shows that a high dimensional cube is essentially a sphere:

advanced

Example 3.16 Let random variables \((X_k)_{k\ge1}\) be i.i.d. with \(X_k\sim \mathcal{U}(-1,1)\), and define \(Y_k=(X_k)^2\). Then \(\mathsf{E} Y_k=\frac13\) and \(\mathsf{Var}(Y_k)\le\mathsf{E}\bigl((Y_k)^2\bigr)=\mathsf{E}\bigl((X_k)^4\bigr)\le1\). Fix \(\varepsilon\in(0,\tfrac13)\) and consider the set \[A_{n,\varepsilon}:=\Bigl\{x\in\mathbb{R}^n:(1-3\varepsilon)\frac{n}3<|x|^2<(1+3\varepsilon)\frac{n}3\Bigr\},\] where \(|x|\) is the usual Euclidean length in \(\mathbb{R}^n\), \(|x|^2=\sum_{k=1}^n(x_k)^2\). By the , \(\frac1n\sum\limits_{k=1}^nY_k\equiv\frac1n\sum\limits_{k=1}^n(X_k)^2\stackrel{\mathsf{P}}{\to}\frac13\); in other words, for every fixed \(\varepsilon\in(0,\tfrac13)\), a point \(\mathbf{X}=(X_1,\dots,X_n)\) chosen uniformly in \((-1,1)^n\) satisfies \[\mathsf{P}\Bigl(\Bigl|\frac1n\sum_{k=1}^n(X_k)^2-\frac13\Bigr|\ge\varepsilon\Bigr)\equiv\mathsf{P}\bigl(\mathbf{X}\not\in A_{n,\varepsilon}\bigr)\to0\qquad\text{ as } n\to\infty,\] ie., for large \(n\), with probability approaching one, a random point \(\mathbf{X}\in(-1,1)^n\) is near the \(n\)-dimensional sphere of radius \(\sqrt{n/3}\) centred at the origin. \(\vartriangleleft\)



Lemma 3.17 Let random variables \((S_n)_{n\ge1}\) have two finite moments, \(\mu_n\equiv\mathsf{E} S_n\), \(\sigma_n^2\equiv\mathsf{Var}(S_n)<\infty\). Further, let \((b_n)_{n\ge1}\) satisfy \(\sigma_n/b_n\to0\) as \(n\to\infty\). Then \((S_n-\mu_n)/{b_n}\to0\) as \(n\to\infty\), both in \(L^2\) and in probability. \(\vartriangleleft\)


Proof Similarly to Theorem 3.15, the result follows from the observation \[\mathsf{E}\Bigl(\frac{(S_n-\mu_n)^2}{b_n^2}\Bigr)=\frac{\mathsf{Var}(S_n)}{b_n^2}\to0\qquad\text{ as }n\to\infty.here\] \(\blacksquare\)


Example 3.18 In the setting of the “coupon collector’s problem”, let \(T_n\) be the time to collect all \(n\) coupons. It is easy to show that \(\mathsf{E} T_n=n\sum_{m=1}^n\frac1m\sim n\log n\) and \(\mathsf{Var}(T_n)\le n^2\sum_{m=1}^n\frac1{m^2}\le \frac{\pi^2n^2}6\), so that \[\tfrac{T_n-\mathsf{E} T_n}{n\log n}\to0\qquad\text{ i.e., }\qquad \tfrac{T_n}{n\log n}\to1\qquad\text{ as }n\to\infty,\] both in \(L^2\) and in probability. \(\vartriangleleft\)


Remark 3.18.1 With additional work, one can show that \(\mathsf{P}(T_n\le n\log n+cn)\to e^{-e^{-c}}\) as \(n\to\infty\). The limiting expression \(f(c):= e^{-e^{-c}}\) changes rapidly with \(c\), in particular, \(f(-1)\approx0.06599\) while \(f(3)\approx0.95143\). I.e., for large \(n\) the probability \(\mathsf{P}(T_n\le n\log n+cn)\) quickly changes from almost zero to almost one if the number of opened boxes goes from \(n\log n-n\) to \(n\log n+3n\). This is known as the cutoff phenomenon. \(\vartriangleleft\)


Theorems  and  are known as Weak Laws of Large Numbers because of the population view approach, as introduced in Section  3.1.

In contrast, there are Strong Laws of Large Numbers (SLLN), which follow the individual approach; SLLN is a statement about almost sure convergence. There are many of these; here we mainly consider the famous SLLN due to Borel. It is convenient to first verify another famous result, Cauchy-Schwarz inequality; the latter has important applications in many areas of mathematics.

Lemma 3.19 Let \(X\) and \(Y\) be random variables with \(\mathsf{E}(|X|^2)<\infty\) and \(\mathsf{E}(|Y|^2)<\infty\). Then, \[\tag{3.6}\label{eq:Cauchy-Schwarz} \mathsf{E}(XY)\le\sqrt{\mathsf{E}(X^2)}\sqrt{\mathsf{E}(Y^2)}.\] \(\vartriangleleft\)


Proof Without loss of generality, we may and shall assume that \(\mathsf{E}(|X|^2)\) and \(\mathsf{E}(|Y|^2)\) are positive, as otherwise \(X\) or \(Y\), and therefore \(XY\), vanish with probability one ( why?). For positive \(\lambda\), consider the random variable \((X-\lambda Y)^2\ge0\). The latter has non-negative expectation, equivalently, \[0\le \mathsf{E}(X^2) -2\lambda\mathsf{E}(XY)+\lambda^2\mathsf{E}(Y^2).\] By rearranging and simplifying, we get \(\mathsf{E}(XY)\le\tfrac1{2\lambda}\mathsf{E}(X^2)+\tfrac\lambda2\mathsf{E}(Y^2)\). The target result (\ref{eq:Cauchy-Schwarz}) now follows by inserting \(\lambda= { \sqrt{ \mathsf{E}(X^2)/ \mathsf{E}(Y^2)}}\) into the last inequality. \(\blacksquare\)


Remark 3.19.1 If \(\mathsf{E}(X^4)\le C\) and \(\mathsf{E}(Y^4)\le C\) for some finite constant \(C\ge0\), then (\ref{eq:Cauchy-Schwarz}) gives \(\mathsf{E}(X^2Y^2)\le C\) as well. \(\vartriangleleft\)


Theorem 3.20 Let variables \((X_n)_{n\ge1}\) be independent with \(\mathsf{E}(X_k)=\mu\) and \(\mathsf{E}\bigl((X_k)^4\bigr)\le C\) for some constant \(C>0\) and all \(k\). If \(S_n:= X_1+X_2+\dots+X_n\), then \(S_n/n\to\mu\) almost surely, as \(n\to\infty\). \(\vartriangleleft\)


Proof We may and shall suppose 20 that \(\mu=\mathsf{E}(X_k)=0\). Now, \[\mathsf{E}\bigl((S_n)^4\bigr)=\mathsf{E}\Bigl(\Bigl(\sum_{k=1}^nX_k\Bigr)^4\Bigr)=\sum_{i,j,k,l=1}^n\mathsf{E}(X_iX_jX_kX_l) =\sum_k\mathsf{E}\bigl((X_k)^4\bigr)+6\sum_{1\le k<m\le n}\mathsf{E}\bigl((X_k)^2(X_m)^2\bigr),\] where the last equality follows from the fact that if \(i\notin\{j,k,l\}\), we have \[\mathsf{E}(X_iX_jX_kX_l)=\mathsf{E}(X_i)\mathsf{E}(X_jX_kX_l)=0,\] as \(\mathsf{E}(X_i)=0\). By Remark 3.19.1, \(\mathsf{E}\bigl((X_k)^2(X_m)^2\bigr)\le C\) for all \(1\le k<m\le n\), so that \(\mathsf{E}\bigl((S_n)^4\bigr)\le 3Cn^2\). Finally, the general Markov inequality (\ref{eq:general-Markov-inequality}) with \(g(x)=x^4\) implies \[\mathsf{P}\bigl(|S_n|>n\varepsilon\bigr)\le\frac{\mathsf{E}\bigl((S_n)^4\bigr)}{(n\varepsilon)^4}\le\frac{3C}{n^2\varepsilon^4}\] and the result follows from the sufficient condition (\ref{eq:BC-as-convergence}) of almost sure convergence. \(\blacksquare\)


Remark 3.20.1 While the Cauchy-Schwarz inequality (\ref{eq:Cauchy-Schwarz}) works without any assumption on (in)dependence of \(X\) and \(Y\), in the iid setting of Theorem 3.20 the bound \(\mathsf{E}((X_k)^2(X_m)^2)\le C\) can also be obtained by noticing the factorisation \(\mathsf{E}((X_k)^2(X_m)^2)=\mathsf{E}((X_k)^2)\mathsf{E}((X_m)^2)\) and the general inequality \(\mathsf{E}(Z^2)\le\sqrt{\mathsf{E}(Z^4)}\). The latter is immediate from the observation that \(\mathsf{Var}(Z^2)=\mathsf{E}(Z^4)-(\mathsf{E}(Z^2))^2\ge0\) for each random variable \(Z\). \(\vartriangleleft\)


With some additional work, 21 one can obtain the following SLLN (which is due to Kolmogorov):

Theorem 3.21 Let random variables \((X_n)_{n\ge1}\) be independent and identically distributed, with \(\mathsf{E}|X_k|<\infty\). If \(S_n:= X_1+\dots+X_n\), then \(\frac1nS_n\to\mu:=\mathsf{E}(X_k)\) almost surely, as \(n\to\infty\). \(\vartriangleleft\)


3.4 Relations between different modes of convergence

We explore some relations between different modes of convergence. We already know that (Lemma 3.7) \[X_n\stackrel{\mathsf{L}^r}{\to} X \qquad\Rightarrow\qquad X_n\stackrel{\mathsf{P}}{\to} X;\] it can also be verified (see Example 3.24 below) that \[\tag{3.7}\label{eq:convergence-toas-implies-convergence-toP} X_n\stackrel{\mathsf{a.s.}}{\to} X \qquad\Rightarrow\qquad X_n\stackrel{\mathsf{P}}{\to} X.\] On the other hand, by Example 3.8, \[X_n\stackrel{\mathsf{P}}{\to} X\qquad\not\Rightarrow\qquad X_n\stackrel{\mathsf{L}^r}{\to} X,\] while according to Example 3.11, \[X_n\stackrel{\mathsf{P}}{\to} X\qquad\not\Rightarrow\qquad X_n\stackrel{\mathsf{a.s.}}{\to} X,\] and the same construction shows that \[X_n\stackrel{\mathsf{L}^r}{\to} X\qquad\not\Rightarrow\qquad X_n\stackrel{\mathsf{a.s.}}{\to} X.\] Finally, Example 3.23 below shows that \[X_n\stackrel{\mathsf{a.s.}}{\to} X\qquad\not\Rightarrow \qquad X_n\stackrel{\mathsf{L}^r}{\to} X.\]

The following examples provide further insights:

Example 3.22 Let \(X_n\) be a sequence of independent random variables such that \(\mathsf{P}(X_n=1)=p_n\), \(\mathsf{P}(X_n=0)=1-p_n\). Then \[X_n\stackrel{\mathsf{P}}{\to} 0\quad\Longleftrightarrow\quad p_n\to0 \quad\Longleftrightarrow\quad X_n\stackrel{\mathsf{L}^r}{\to} 0\qquad\text{ as $n\to\infty$,}\] whereas \[X_n\stackrel{\mathsf{a.s.}}{\to} 0\quad\Longleftrightarrow\quad \sum_n p_n<\infty.\] In particular, taking \(p_n=\tfrac1n\) we deduce the claim. This example also shows that \(X_n\stackrel{\mathsf{P}}{\to} X\not\Rightarrow X_n\stackrel{\mathsf{a.s.}}{\to} X\). \(\vartriangleleft\)


Example 3.23 For every \(n\ge1\), consider the variable \[X_n(\omega):= e^n\cdot\mathbf{1}_{[0,1/n]}(\omega)\equiv\begin{cases} e^n,&\quad 0\le\omega\le1/n\\0,&\quad\omega>1/n,\end{cases}\] in the canonical probability space. Clearly, \(X_n\stackrel{\mathsf{a.s.}}{\to}0\) and \(X_n\stackrel{\mathsf{P}}{\to}0\) as \(n\to\infty\) ( why?); however, given \(r>0\), we have \(\mathsf{E}(|X_n|^r)=\frac{e^{nr}}n\to\infty\) as \(n\to\infty\), ie., \(X_n\not\stackrel{\mathsf{L}^r}{\to} 0\). This example also shows that \(X_n\stackrel{\mathsf{a.s.}}{\to} X\not\Rightarrow X_n\stackrel{\mathsf{L}^r}{\to} X\). \(\vartriangleleft\)


Example 3.24 We verify the implication (\ref{eq:convergence-toas-implies-convergence-toP}). To simplify the notation, we may and shall assume that the limit variable vanishes identically, \(X\equiv0\) (otherwise, use the shifted variables \(X'_n:=X_n-X\)). Assuming \(X_n\stackrel{\mathsf{a.s.}}{\to}0\), we show that, for each fixed \(\varepsilon>0\), the events \(A_n(\varepsilon):=\{|X_n|>\varepsilon\}\equiv\{\omega:|X_n(\omega)|>\varepsilon\}\) satisfy \(\mathsf{P}(A_n(\varepsilon))\to0\) as \(n\to\infty\).

For \(\varepsilon>0\), define the \(\varepsilon\)-tolerance event \[T(\varepsilon):=\bigl\{A^\mathsf{c}_n(\varepsilon)\text{ eventually}\bigr\}\equiv\bigcup_{n\ge1}\bigcap_{k\ge n}A^\mathsf{c}_k(\varepsilon).\] Notice that \(T(\varepsilon_1)\subseteq T(\varepsilon_2)\) if \(0<\varepsilon_1\le\varepsilon_2\), while the convergence event is \[C:=\bigl\{\omega:X_n(\omega)\to0\text{ as $n\to\infty$}\bigr\}=\bigcap_{\varepsilon>0}T(\varepsilon)\equiv\bigcap_{m\ge1}T(1/m).\] As, by assumption, \(\mathsf{P}(C)=1\), monotonicity of \(T(\varepsilon)\) implies that for all \(\varepsilon>0\) we have \(\mathsf{P}\bigl(T(\varepsilon)\bigr)=1\), equivalently, the complement event \(T^\mathsf{c}(\varepsilon)=\bigcap_{n\ge1}\bigcup_{k\ge n}A_k(\varepsilon)\) has probability zero.

Fix \(\varepsilon>0\) and let \(B_n(\varepsilon):=\bigcup_{k\ge n}A_k(\varepsilon)\). Notice that \((B_n(\varepsilon))_{n\ge1}\) form a monotone sequence of events. By continuity of probability, \[\lim_{n\to\infty}\mathsf{P}\bigl(B_n(\varepsilon)\bigr)=\mathsf{P}\Bigl(\bigcap_{n\ge1}B_n(\varepsilon)\Bigr)=\mathsf{P}\bigl(T^\mathsf{c}(\varepsilon)\bigr)=0,\] where the last equality follows from the assumption that \(X_n\stackrel{\mathsf{a.s.}}{\to}0\) as \(n\to\infty\).

As \(A_n(\varepsilon)\subseteq B_n(\varepsilon)\) for all \(n\ge1\) and \(\varepsilon>0\), we deduce that for each fixed \(\varepsilon>0\), \(\mathsf{P}\bigl(A_n(\varepsilon)\bigr)\equiv\mathsf{P}\bigl(|X_n|>\varepsilon\bigr)\to0\) as \(n\to\infty\), equivalently, \(X_n\stackrel{\mathsf{P}}{\to}0\) as \(n\to\infty\). \(\vartriangleleft\)


The following example is important for applications.

Example 3.25 Let random variables \((X_n)_{n\ge1}\) be independent with common \(\mathsf{Exp}(1)\) distribution, i.e., satisfying \(\mathsf{P}(X_k>x)=e^{-x}\) for all \(x\ge0\). Denote \(M_n:=\max_{1\le k\le n}X_k\). We show that \(M_n/\log n\to1\) almost surely as \(n\to\infty\).
Solution. We first verify that \[\tag{3.8}\label{eq:liminf-exponential-records} \liminf_{n\to\infty}\tfrac{M_n(\omega)}{\log n}\ge1,\] on a set \(\Omega_1\) of probability one, \(\mathsf{P}(\Omega_1)=1\). Indeed, by independence of \(X_k\), for every \(y>0\), we have \[\mathsf{P}(M_n\le y)=\prod_{k=1}^n\mathsf{P}(X_k\le y)=\bigl(1-e^{-y}\bigr)^n.\] Consequently, for every fixed \(\varepsilon>0\), \[\mathsf{P}\bigl(\tfrac{M_n}{\log n}\le1-\varepsilon\bigr)=\bigl(1-e^{-(1-\varepsilon)\log n}\bigr)^n=\bigl(1-n^{-(1-\varepsilon)}\bigr)^n\le e^{-n^\varepsilon},\] implying that \[\sum_n\mathsf{P}\bigl(\tfrac{M_n}{\log n}\le1-\varepsilon\bigr)\le\sum_n e^{-n^\varepsilon}<\infty.\] Therefore, by Lemma 2.3 a), the event \[\Omega_1(\varepsilon)=\bigl\{\omega\in\Omega:\tfrac{M_n(\omega)}{\log n}>1-\varepsilon\text{ for all $n$ large enough}\bigr\}\] has probability one, equivalently, the random variable \[n_1=n_1(\omega):=\min\bigl\{n\ge1:M_k(\omega)\ge(1-\varepsilon)\log k\text{ for all $k\ge n$}\bigr\}\] is almost surely finite. As \(\Omega_1(\varepsilon)\) increases with \(\varepsilon\), the set \(\Omega_1:=\cap_{\varepsilon>0}\Omega_1(\varepsilon)\) has probability one, while for each \(\omega\in\Omega_1\) property (\ref{eq:liminf-exponential-records}) holds.

We next show that for a set \(\Omega_2\) of probability one, \(\mathsf{P}(\Omega_2)=1\), we have 22 \[\tag{3.9}\label{eq:limsup-exponential-records} \limsup_{n\to\infty}\tfrac{M_n(\omega)}{\log n}\le1.\] Fix arbitrary \(\omega\in\Omega\) and \(\varepsilon>0\). As \(\log n\to\infty\), by Lemma A.4 the sets \[\bigl\{n\in\mathbb{N}:M_n(\omega)>(1+\varepsilon)\log n\bigr\} \quad\text{ and }\quad \bigl\{n\in\mathbb{N}:X_n(\omega)>(1+\varepsilon)\log n\bigr\}\] are either both finite or both infinite. Consequently, the events \[\Omega_3(\varepsilon):=\bigl\{\omega\in\Omega:\tfrac{M_n(\omega)}{\log n}>(1+\varepsilon)\text{ i.o. }\bigr\} \quad\text{ and }\quad \bigl\{\omega\in\Omega:\tfrac{X_n(\omega)}{\log n}>(1+\varepsilon)\text{ i.o. }\bigr\}\] coincide. By Example 2.7, \(\mathsf{P}(\Omega_3(\varepsilon))=0\) and therefore the event \(\Omega_3:=\cup_{\varepsilon>0}\Omega_3(\varepsilon)\) also has probability zero. Notice that \[\Omega_3\equiv\Bigl\{\omega\in\Omega: \limsup_{n\to\infty}\tfrac{M_n(\omega)}{\log n}>1\Bigr\},\] so that (\ref{eq:limsup-exponential-records}) holds with \(\Omega_2:=\Omega_3^\mathsf{c}\).

Finally, the set \(\Omega_0:=\Omega_1\cap\Omega_2=\Omega_1\setminus\Omega_3\) has probability one, and for each \(\omega\in\Omega_0\) we have \(M_n(\omega)/\log n\to1\) as \(n\to\infty\). \(\vartriangleleft\)


There are many modes of convergence in mathematics, see the Wiki page for a list. A good source on convergence of random variables is .

The Law of large numbers has many interesting applications, see, e.g., the Wiki page.

checklist

By the end of this section you should be able to:


Exercise 3.62

Derive Theorem 3.15 from Chebyshev’s inequality.


Exercise 3.63

Let \((X_n)_{n\ge1}\) be i.i.d. Bernoulli random variables with \(\mathsf{E}[X_i]=2/3\). Put \(S_n=X_1+\dots +X_n\). Show that \(S_n \to \infty\) almost surely as \(n\to \infty\).


Exercise 3.64

Consider a sequence \((X_n)_{n \geq 1}\) with \(\mathsf{P}(X_n=1/n)=\mathsf{P}(X_n=-1/n)=1/2\). Show that \(X_n \to 0\) almost surely as \(n \to \infty\).


Exercise 3.65

Suppose \(r\) distinct balls are put at random in \(n\) labelled bins, with all \(n^r\) assignments of balls to bins being equally probable. Let \(N_n\) denote the total number of empty bins. Assuming that \(r/n\to c>0\) as \(n\to\infty\), show that \[\tfrac1n\mathsf{E} N_n=\bigl(1-\tfrac1n\bigr)^r\to e^{-c},\qquad \tfrac1{n^2}\mathsf{Var}(N_n)\to0 \qquad\text{ as $n\to\infty$,}\] and deduce that \(\frac1nN_n\to e^{-c}\) as \(n\to\infty\) both in \(L^2\) and in probability.


Exercise 3.66

Let \((X_n)_{n\ge1}\) be independent identically distributed random variables with \(\mathsf{E}(X_k)=\mu\) and \(\mathsf{E}\bigl((X_k)^2\bigr)=\sigma^2<\infty\). Denote \(S_n=X_1+\dots+X_n\).
a) Use the Chebyshev inequality and the Borel-Cantelli Lemma to show that

\(\displaystyle\frac1{m^2}S_{m^2}\equiv\frac1{m^2}\bigl(X_1+\dots+X_{m^2}\bigr)\stackrel{\mathsf{a.s.}}{\to}\mu =\mathsf{E}(X_k)\) as \(m\to\infty\).


b) Assuming \(X_k\ge0\), show that \(\displaystyle(m+1)^{-2}S_{m^2}\le n^{-1}S_n\le m^{-2}S_{(m+1)^2}\) provided \(m^2\le n\le(m+1)^2\). Use this inequality to show that \(\frac1nS_n\to\mu=\mathsf{E}(X_k)\) almost surely as \(n\to\infty\) for non-negative random variables \(X_k\ge0\).
c) In the general case, decompose into positive and negative part, \(X_k=X_k^+-X_k^-\), where \(X_k^+\ge0\) and \(X_k^-\ge0\) and deduce the strong law of large numbers (SLLN):

if \(S_n=X_1+\dots+X_n\) where \((X_k)_{k\ge1}\) are i.i.d. satisfying \(\mathsf{E}\bigl((X_k)^2\bigr)<\infty\), then \(n^{-1}S_n\to\mu=\mathsf{E}(X_k)\) almost surely as \(n\to\infty\).

d) Show that \(\frac1nS_n\to\mu=\mathsf{E}(X_k)\) in \(L^2\), that is \(\mathsf{E}\Bigl[\Bigl(\frac1nS_n-\mu\Bigr)^2\Bigr]\to0\) as \(n\to\infty\).


advanced

Optional exercises

Exercise 3.67

Let \(X\) and \((X_k)_{k\ge1}\) be random variables.
a) Show that if for some positive sequence \((\varepsilon_k)_{k\ge1}\) with \(\varepsilon_k\to0\) as \(k\to\infty\) we have \(\sum_k\mathsf{P}\bigl(|X_k-X|>\varepsilon_k\bigr)<\infty\), then \(\mathsf{P}\bigl(\bigl\{\omega\in\Omega:X_k(\omega)\to X(\omega)\bigr\}\bigr)=1\).

b) Suppose that \(X_k\stackrel{\mathsf{P}}{\to} X\) as \(k\to\infty\). Define a strictly increasing integer sequence \((n_k)_{k\ge0}\) via \(n_0=0\), and \[n_k=\min\bigl\{n>n_{k-1}:\mathsf{P}\bigl(|X_{n_k}-X|\ge k^{-1}\bigr)\le k^{-2}\bigr\}.\] Show that for the subsequence \(\bigl(X_{n_k}\bigr)_{k\ge1}\) of the sequence \((X_k)_{k\ge1}\) we have \(X_{n_k}\stackrel{\mathsf{a.s.}}{\to} X\) as \(k\to\infty\).


Exercise 3.68

If \(X_n\) is any sequence of random variables, show that there are constants \(c_n\to\infty\) such that \(X_n/c_n\to0\) almost surely as \(n\to\infty\).


Exercise 3.69

Let \((\Omega,\mathcal{F},\mathsf{P})\) with \(\Omega\equiv[0,1]\) be the canonical probability space, recall Definition 1.9. For \(n\ge1\), put \[a_n:=\sum_{k=1}^n1/k,\qquad b_n:=\sum_{k=1}^n2^{-k},\qquad C_n:=[0,1/n],\qquad D_n:=[0,2^{-n}].\] Consider the random variables \[Z_n^{1}:= \mathbf{1}_{a_n+C_n},\qquad Z_n^{2}:=\mathbf{1}_{b_n+C_n},\qquad Z_n^{3}:=\mathbf{1}_{a_n+D_n},\qquad Z_n^{4}:=\mathbf{1}_{b_n+D_n},\] where the addition is \(( mod1)\), ie, \(Z_3^1\equiv\mathbf{1}_{a_3+C_3}\equiv\mathbf{1}_{[\frac{11}6,\frac{13}6]}\equiv\mathbf{1}_{[0,1/6]\cup[5/6,1]}\).
Which of the sequences \(Z_n^k\) converge in probability? Which converge almost surely? Justify your answers.



4 Elements of integration

Goals: Introduce the main ideas behind the Lebesgue integration. Explore the key results in the area - the Monotone Convergence Theorem and the Dominated Convergence Theorem.


In probability theory it is convenient to think of a convergent sequence of random variables as of successive approximations to the limit random variable. Given a convergent sequence \((X_n)_{n\ge1}\) of random variables, it is often important to know whether various characteristics of such variables, e.g., moments \(\mathsf{E}((X_n)^k)\) or, more generally, expectations of some function applied to these variables \(\mathsf{E}(f(X_n))\), also converge. As the expectation (equivalently, the integral), in its turn, is constructed through a particular limit procedure, one of the central questions is “do the two limit procedures commute?”, i.e., whether \(\mathsf{E}(\lim_nX_n)=\lim_n\mathsf{E}(X_n)\). The answer depends on the notion of the integral used to compute the expectation \(\mathsf{E}(X)\).

In the simplest case, the (Riemann) integral of a non-negative function (in particular, the expectation of a random variable) can be regarded as the area between the graph of that function and the \(x\)-axis. Lebesgue integration is a mathematical construction that extends the notion of the integral to a larger class of functions; it also extends the domains on which these functions can be defined. As such, the Lebesgue integral plays an important role in real analysis, probability, and many other areas of mathematics.

In this section we introduce the main ideas behind the construction of the Lebesgue integral, and explore and apply some of the key results - the Monotone Convergence Theorem () and the Dominated Convergence Theorem ().

4.1 Integration: Riemann vs. Lebesgue

As part of the general movement towards rigour in mathematics in the nineteenth century, attempts were made to put the integral calculus on a firm foundation. The Riemann integral 23 is one of the most widely known examples. Its definition starts with the construction of a sequence of easily-calculated integrals which converge to the integral of a given function. This definition is successful in the sense that it gives the anticipated answer for many already-solved problems, and provides useful results for many other problems.

However, despite the Riemann integral is naturally linear and monotone, it does not interact well with taking limits of sequences of functions, making such limit functions difficult to analyse (and integrate). 24 The Lebesgue integral is easier to deal with when taking limits under the integral sign; it also allows to calculate integrals for a broader class of functions. For example, the Dirichlet function, which is \(0\), where its argument is irrational, and \(1\) otherwise, is Lebesgue-integrable, but not Riemann-integrable, see below.

4.1.1 Riemann integral

Recall that a partition of an interval \([a,b]\) is a finite sequence \[a = x_0 < x_1 < x_2 < \ldots < x_n = b.\] Each \([x_i,x_{i+1}]\) is called a sub-interval of the partition. The mesh of a partition is defined to be the length of the longest sub-interval \([x_i,x_{i+1}]\), that is, \(\max(x_{i+1}-x_i)\) where \(0 \le i \le n - 1\).

Let \(f\) be a real-valued function defined on the interval \([a,b]\). The Riemann sum of \(f\) with respect to the partition \(x_0,\ldots,x_n\) is \[\sum_{i=0}^{n-1} f(y_i) (x_{i+1}-x_i),\] where each \(y_i\) is a fixed point in the sub-interval \([x_i,x_{i+1}]\). Notice that the last expression is the sum of areas of rectangles with heights \(f(y_i)\) and widths \(x_{i+1}-x_i\).

Loosely speaking, the Riemann integral of \(f\) is the limit of the Riemann sums of \(f\) as the partitions get finer and finer (i.e., the mesh goes to zero). Every function \(f\) for which this limit does not depend on the approximating sequence is called integrable.

A useful theory of integration (expectation) needs to possess the following properties.

Linearity:

If \(f(x)\) and \(g(x)\) are integrable functions and \(a\) and \(b\) are real constants, then \(af(x)+bg(x)\) is integrable and \[\int \bigl(af(x)+bg(x)\bigr)ds=a\int f(x)dx+b\int g(x)dx;\]

Monotonicity:

If \(f(x)\le g(x)\) for all \(x\), then \[\int f(x)dx\le\int g(x)dx;\]

Respect of limits:

If \(f_n(x)\to f(x)\) as \(n\to\infty\), then \[\int f_n(x)dx \to \int f(x)dx,\qquad\text{ as }n\to\infty.\]

By construction, the Riemann integral is both linear and monotone. However, it is difficult to define in spaces other than \(\mathbb{R}\), and the Riemann integral does not always respect limits:

Example 4.1 For real \(a\), let \(\delta_a(x)\) be the Kronecker delta-function, \[\delta_a(x)=\begin{cases} 1, & \text{ if }a=x, \\ 0, & \text{ if }a\neq x; \end{cases}\] it is immediate that the Riemann integral of \(\delta_a\) vanishes, \(\int \delta_a(x)dx=0\). On the other hand, the Dirichlet function \(D(x)\), or the indicator function of the rational numbers, \[D(x)=\begin{cases} 1, & \text{ if }x\in\mathbb{Q}, \\ 0, & \text{ if }x\notin\mathbb{Q}, \end{cases}\] is clearly not Riemann-integrable.

At the same time, \(D(x)\) can be written as a limit of a point-wise increasing sequence of functions, each of which has zero Riemann integral on each fixed interval of finite length. Indeed, let \(a\) and \(b\) be any fixed real numbers such that \(a<b\). For \(m\in\mathbb{N}\), let \(\mathbb{Q}_m\) be the set of all rational numbers with denominator at most \(m\), \[\mathbb{Q}_m:=\bigl\{x\in\mathbb{R}:kx\in\mathbb{Z} \text{ for some }k=1,\dots,m\bigr\};\] e.g., for \(m=3\) we have \(\mathbb{Q}_3\cap(0,1]=\{\tfrac13,\tfrac12,\tfrac23,1\}\). For each \(m\ge1\), let \[f_m(x):=\sum_{q\in\mathbb{Q}_m}\delta_q(x)\] be the indicator function of the set \(\mathbb{Q}_m\), so that \(f_m(x)\in\{0,1\}\) for each \(x\in\mathbb{R}\). As the sets \(\mathbb{Q}_m\) are increasing in \(m\), for each fixed \(x\in\mathbb{R}\) the sequence \(\bigl(f_m(x)\bigr)_{m\ge1}\) is a non-decreasing sequence (of numbers in \(\{0,1\}\)), \(f_m\nearrow\) as \(m\to\infty\). At the same time, for each \(m\ge1\), we have 25

\(\int_a^bf_m(x)dx=0\) and for each \(x\in\mathbb{R}\) we have \(f_m(x)\nearrow D(x)\) as \(m\to\infty\), while \(\int_a^bD(x)dx\) is not defined. \(\vartriangleleft\)


The Lebesgue integral is a generalization of the Riemann integral. The Lebesgue integral uses a much more refined construction, which allows one to integrate over different spaces and, under suitable conditions, allows to interchange limits and integrals. The construction of the Lebesgue integral is via Measure theory which was created to provide a more detailed analysis of lengths of subsets of the real line, and more generally, area and volume for subsets of Euclidean spaces. The Lebesgue integral is extremely important in many fields, including probability, where we can integrate with respect to the probability measure. This is an object you should be already familiar with - the expectation. In fact, part of this theory allows for the construction of probability measures on uncountable spaces, such as the uniform probability measure on \([0,1]\).

It is possible that some of you will encounter this theory in more detail in later courses, such as Analysis 3/4. The optional section  4.1.2 below mentions the key steps of the construction; its content will not be examined, but you might find it useful.

advanced

4.1.2 Lebesgue integral: sketch of the construction

The modern approach to the theory of Lebesgue integration has two distinct parts:
a) a theory of measurable sets and measures of these sets;
b) a theory of measurable functions and integrals of these functions.

Measure theory initially was created to provide a detailed analysis of the notion of length of subsets of the real line and, more generally, area and volume of subsets of Euclidean spaces. In particular, it provided a systematic answer to the question of which subsets of \(\mathbb{R}\) have a length. As was shown by later developments in the set theory, it is actually impossible to assign a length to all subsets of \(\mathbb{R}\) in a way which preserves some natural additivity and translation invariance properties. This suggests that picking out a suitable class of measurable subsets is an essential prerequisite.

The modern approach to measure and integration is axiomatic. One defines a measure as a map \(\mu\) from a \(\sigma\)-field \(\mathcal{A}\) of subsets of a set \(E\), which satisfies a certain list of properties, similar to those of the probability space. These properties can be shown to hold in many different cases.

Integration. In the Lebesgue theory, integrals are limited to a class of functions called measurable functions. Let \(E\) be a set and let \(\mathcal{A}\) be a \(\sigma\)-field of subsets 26 of \(E\). A function \(f:E\to\mathbb{R}\) is measurable if the pre-image of any closed interval \([a,b]\subset\mathbb{R}\) is in \(\mathcal{A}\), i.e., all \(f^{-1}([a,b]) \in \mathcal{A}\). The set of measurable functions is naturally closed under algebraic operations; in addition (and more importantly) this class is closed under various kinds of point-wise sequential limits; eg., if the sequence \(\{f_k\}_{k\in\mathbb{N}}\) consists of measurable functions, then both \[\liminf_{k \in \mathbb{N}} f_k \quad\text{ and }\quad \limsup_{k \in \mathbb{N}} f_k\] are measurable functions.
Let a measure space \((E,\mathcal{A},\mu)\) be fixed. The Lebesgue integral \(\int_E f d \mu\) for measurable functions \(f:E\to\mathbb{R}\) is constructed in stages:

Indicator functions: If \(S\in\mathcal{A}\), ie., the set \(S\) is measurable, define the integral of its indicator function 27 \(\mathbf{1}_S\) via \[\int \mathbf{1}_S d \mu = \mu (S).\]

Simple functions: for non-negative simple functions, equivalently, linear combinations of indicator functions \(f=\sum_k a_k \mathbf{1}_{S_k}\) (where the sum is finite and all \(a_k\ge0\)), we use linearity to define 28 \[\mu(f)\equiv\int\Bigl(\sum_k a_k\mathbf{1}_{S_k}\Bigr)d\mu=\sum_ka_k\int\mathbf{1}_{S_k}d \mu=\sum_ka_k\mu(S_k),\] This construction is obviously linear and monotone. Moreover, even if a simple function can be written as \(\sum_k a_k \mathbf{1}_{S_k}\) in many ways, its integral always gives the same value. 29

Non-negative functions: Let \(f:E\to[0,+\infty]\) be measurable. We put \[\int_E fd\mu := \sup\Bigl\{\int_E hd\mu : h\le f, 0\le h \mbox{ simple}\Bigr\}\] We need to check whether this construction is consistent, ie., if \(0\le f\) is simple we need to verify whether this definition coincides with the preceding one. Another question is: if \(f\) as above is Riemann-integrable, does this definition give the same value of the integral? It is not hard to prove that the answer to both questions is yes.

Clearly, if \(f:E\to[0,+\infty]\) is arbitrary measurable function, its integral \(\int fd\mu\) may be infinite.

Signed functions: If \(f:E\to[-\infty,+\infty]\) is measurable, 30 we decompose it into the positive and negative parts, \(f = f^+ - f^-\), where \[f^+(x) = \begin{cases} f(x) & \mbox{if} \quad f(x) > 0, \\ 0 & \mbox{otherwise}, \end{cases} \qquad f^-(x) = \begin{cases} -f(x) & \mbox{if} \quad f(x) < 0, \\ 0 & \mbox{otherwise}. \end{cases}\] Notice that the functions \(f^+\ge0\) and \(f^-\ge0\) satisfy \(|f| = f^+ + f^-\). If \(\int|f|d\mu\) is finite, then \(f\) is called Lebesgue integrable. In this case, both integrals \(\int f^+d\mu\) and \(\int f^-d\mu\) converge, and it makes sense to define \[\int f d \mu = \int f^+ d \mu - \int f^- d \mu.\]

It turns out that this definition gives the desirable properties of the integral, namely, linearity, monotonicity and regularity when taking limits. The functions, which can be obtained from the above construction, are called Borel functions. 31 The class of Borel functions is very big and is sufficient for most practical considerations. 32

A careful construction of Lebesgue integration is done in more advanced courses, eg., Analysis 3/4. Notice however, that in the case of discrete spaces (eg., integer lattice), such construction can be achieved by using ideas at the level of Analysis 1; for details, see (optional) Section  4.3 below.


4.2 Lebesgue integral: main results

A recurring question that one often finds in probability is the following: suppose a sequence of random variables converges; do their expectations converge to the expectation of their limit? That is, assuming \(X_n \to X\) in some sense, is it true that \(\mathsf{E}[X_n] \to \mathsf{E}[X]\)?

According to Example 3.23, the answer is: “not always”. Indeed, if a sequence of random variables \((X_n)_{n\ge1}\) in the canonical probability space \(([0,1],\mathcal{F},\mathsf{P})\) is given via \(X_n(\omega):= e^n\cdot\mathbf{1}_{[0,1/n]}(\omega)\), then \(X_n\stackrel{\mathsf{a.s.}}{\to} X\equiv0\), so that \[\mathsf{E}\bigl(\lim_{n\to\infty}X_n\bigr)=0,\qquad\text{ while }\qquad \lim_{n\to\infty}\mathsf{E}(X_n)=+\infty.\] This example illustrates that we need to be very careful when interchanging limits and expectations! The following theorems provide sufficient conditions when we can perform such an operation. These are stated without proof for expectations of random variables; suitable analogues also hold for integrals of Borel functions; see Analysis 3/4.

Theorem 4.2 Let random variables \(X_n\ge0\) be such that \(X_n\nearrow X\) as \(n\to\infty\); namely, for each \(\omega\in\Omega\) the real sequence \(X_n(\omega)\) is non-decreasing and converges to \(X(\omega)\) as \(n\to\infty\). Then \(\mathsf{E}(X_n)\nearrow\mathsf{E}(X)\le\infty\) as \(n\to\infty\). \(\vartriangleleft\)


Remark 4.2.1 Notice that there is no integrability condition here, the result holds even if \(\mathsf{E}(X)=\infty\). \(\vartriangleleft\)


Remark 4.2.2 As, by footnote  itshape d, the value of the Lebesgue integral does not change, if the function is modified on a set of probability zero, the assumptions of Theorem 4.2 can be relaxed to their almost sure version: \[\tag{4.1}\label{eq:MON-as-condition} \mathsf{P}\bigl(\bigl\{\omega:X_n(\omega)\le X_{n+1}(\omega)\text{ for all }n\ge1\bigr\}\bigr)=1 \qquad\text{ and }\qquad \mathsf{P}\bigl(\bigl\{\omega:X_n(\omega)\to X(\omega)\bigr\}\bigr)=1.\] \(\vartriangleleft\)


Example 4.3 Given nonnegative random variables \((Z_k)_{k\ge1}\), define \(R_m:=\sum_{k=1}^{m} Z_k\). The variables \(R_m\ge0\) form a monotone sequence, \(R_m \leq R_{m+1}\) for all \(m\ge1\), with \(R_m\nearrow\sum_{k=1}^{\infty} Z_k\). Then, by , \[\mathsf{E}\Bigl( \sum_{k=1}^{\infty} Z_k\Bigr) =\mathsf{E}\bigl( \lim_{m \to \infty}R_m\bigr)= \lim_{m \to \infty} \mathsf{E}(R_m) = \sum_{k=1}^{\infty} \mathsf{E}(Z_k) .\] \(\vartriangleleft\)


Example 4.4 For a sequence \((A_n)_{n\ge1}\) of events, \(A_n\in\mathcal{F}\), let \(B_n:=\bigcup_{k=1}^nA_k\). Let \(Z_n:=\mathbf{1}_{A_n}\) and, as in Example 4.3, denote \(R_m:=\sum_{k=1}^{m} Z_k\). Then for each \(n\ge1\), \[\mathbf{1}_{B_n}\le\sum_{k=1}^n\mathbf{1}_{A_k}\equiv R_n\nearrow\sum_{k=1}^{\infty} Z_k.\] So \(\mathsf{P}(B_n)\equiv\mathsf{E}(\mathbf{1}_{B_n})\le\mathsf{E}(R_n)\), which by (MON) is bounded above by \(\mathsf{E}\bigl(\sum_{k=1}^{\infty} Z_k\bigr)=\sum_{k=1}^{\infty}\mathsf{E}(Z_k)=\sum_{k=1}^{\infty}\mathsf{P}(A_k)\), for all \(n\ge1\). On the other hand, \(B_n\nearrow\bigcup_{k\ge1}A_k\), equivalently, \(0\le\mathbf{1}_{B_n}\nearrow\mathbf{1}_{\cup_{k\ge1}A_k}\), as \(n\to\infty\), so that  implies \[\mathsf{P}\bigl(\bigcup_{k\ge1}A_k\bigr)\equiv\mathsf{E}\bigl(\mathbf{1}_{\bigcup_{k\ge1}A_k}\bigr)\le\sum_{k=1}^{\infty}\mathsf{P}(A_k).\] \(\vartriangleleft\)


Corollary 4.5 Let random variables \(X\) and \((X_n)_{n\ge1}\) be such that \(X_n\searrow X\) as \(n\to\infty\). If \(X_1\) is integrable, i.e., \(\mathsf{E}(X_1)\) is finite, then \(\mathsf{E}(X_n)\searrow\mathsf{E}(X)\) as \(n\to\infty\). \(\vartriangleleft\)


Remark 4.5.1 As in Remark 4.2.2, the result holds if the assumptions are relaxed to their almost sure versions: \[\mathsf{P}\bigl(\bigl\{\omega:X_n(\omega)\ge X_{n+1}(\omega)\text{ for all }n\ge1\bigr\}\bigr)=1 \qquad\text{ and }\qquad \mathsf{P}\bigl(\bigl\{\omega:X_n(\omega)\to X(\omega)\bigr\}\bigr)=1.\] \(\vartriangleleft\)


Example 4.6 In the setting of Example 4.3, \(0\le R_n\nearrow R:=\sum_{n\ge1}Z_n\). In terms of \(X_n:= e^{-R_n}\), Corollary 4.5 implies \(\mathsf{E}(e^{-R_n})\searrow\mathsf{E}(e^{-R})\). The distribution of the limit variable \(R\) can be bounded along the lines of the general Markov inequality: for each \(K>0\), \[e^{-K}\mathsf{P}(R\le K) \le \mathsf{E}\bigl(e^{-R}\mathbf{1}_{R\le K}\bigr)\le \mathsf{E}\bigl(e^{-R}\bigr),\] implying that \(\mathsf{P}(R\le K)\le e^K \mathsf{E}(e^{-R})\).
Hence, if \(\mathsf{E}(e^{-R})=0\), by using a suitable monotone approximation we deduce that \(R\) is infinite almost surely. \(\vartriangleleft\)


Proof of Corollary 4.5. Let \(Y_n:= X_1-X_n\ge0\). Then \(Y_n\nearrow X_1-X\), and so Theorem 4.2 implies that \[\mathsf{E}(X_1)-\mathsf{E}(X_n)=\mathsf{E}(X_1-X_n)=\mathsf{E}(Y_n)\nearrow\mathsf{E}(X_1-X)=\mathsf{E}(X_1)-\mathsf{E}(X).\] As \(\mathsf{E}(X_1)\) is finite, we deduce that \(\mathsf{E}(X_n)\searrow\mathsf{E}(X)\) as \(n\to\infty\). \(\blacksquare\)


Expectations of random variables can also converge when the monotonicity assumption does not hold:

Theorem 4.7 Let random variables \(X\), \(Y\), and \((X_n)_{n\ge1}\) be such that for all \(\omega\in\Omega\) \[X_n(\omega)\to X(\omega)\qquad\text{ and }\qquad \bigl|X_n(\omega)\bigr|\le Y(\omega),\quad\text{ for all }n\ge1.\] If \(\mathsf{E}(Y)<\infty\), then \(\mathsf{E}(X_n)\to\mathsf{E}(X)\) as \(n\to\infty\). \(\vartriangleleft\)


Remark 4.7.1 As in Remark 4.2.2, the result holds if the assumptions are relaxed to their almost sure versions: \[\mathsf{P}\bigl(\bigl\{\omega:|X_n(\omega)|\le Y(\omega)\text{ for all }n\ge1\bigr\}\bigr)=1 \qquad\text{ and }\qquad \mathsf{P}\bigl(\bigl\{\omega:X_n(\omega)\to X(\omega)\bigr\}\bigr)=1.\] \(\vartriangleleft\)


This has a useful corollary:

Theorem 4.8 Let random variables \(X\) and \((X_n)_{n\ge1}\), and a finite constant \(M\ge0\) be such that \(\bigl|X_n(\omega)\bigr|\le M\) for all \(\omega\in\Omega\) and \(n\ge1\). If \(X_n(\omega)\to X(\omega)\) for each \(\omega\in\Omega\), then \(\mathsf{E}(X_n)\to\mathsf{E}(X)\) as \(n\to\infty\). \(\vartriangleleft\)


Remark 4.8.1 As in Remark 4.2.2, the result holds if the assumptions are relaxed to their almost sure versions: \[\mathsf{P}\bigl(\bigl\{\omega:|X_n(\omega)|\le M\text{ for all }n\ge1\bigr\}\bigr)=1 \qquad\text{ and }\qquad \mathsf{P}\bigl(\bigl\{\omega:X_n(\omega)\to X(\omega)\bigr\}\bigr)=1.\] \(\vartriangleleft\)


Remark 4.8.2 (BDD) also holds under the relaxed assumption \(X_n\stackrel{\mathsf{P}}{\to} X\), see Example 4.12. \(\vartriangleleft\)


Proof of Theorem 4.8. Apply Theorem 4.7 to the random variables \(X\), \((X_n)_{n\ge1}\) and \(Y(\omega)\equiv M\). \(\blacksquare\)


Example 4.9 Given a random variable \(X\ge0\) with \(\mathsf{E}(X^2)<\infty\), define \((Y_k)_{k\ge1}\) via \[Y_k:= X^2\mathbf{1}_{X>k}\equiv\begin{cases} X^2, & \text{ if } X>k,\\ 0, & \text{ otherwise.} \end{cases}\] Notice that \(|Y_k(\omega)|\le X^2(\omega)\) for all \(k\ge1\) and \(\omega\in\Omega\), while \(Y_k\stackrel{\mathsf{a.s.}}{\to}0\) and \(X^2\) is integrable. Therefore, (DOM) implies \[k^2\mathsf{P}(X>k)\le\mathsf{E}(X^2\mathbf{1}_{X>k})=\mathsf{E}(Y_k)\to0\qquad\text{ as }k\to\infty.\] In particular, \(\mathsf{P}(X>k)\) decays faster than \(1/k^2\). Notice that this is a better estimate than \[\mathsf{P}(X>k)\le\frac{\mathsf{E}(X^2)}{k^2},\] implied by the general Markov inequality. \(\vartriangleleft\)


Example 4.10 Let \(X\), \(Y\), \((X_n)_{n\ge1}\) be random variables such that \(X_n \stackrel{\mathsf{a.s.}}{\to} X\) and \(\mathsf{P}(|X_n| \leq Y\text{ for all }n\ge1)=1\), where \(\mathsf{E} Y < \infty\). We show that \(X_n \stackrel{\mathsf{L}^1}{\to} X\).
Solution. Fix arbitrary \(\omega\) such that \(|X_n(\omega)|\le Y(\omega)\) for all \(n\ge1\) and \(X_n(\omega)\to X(\omega)\) as \(n\to\infty\). Then \(|X(\omega)|\le Y(\omega)\).
Denote \(Z_n:= |X_n-X|\). We have \[\mathsf{P}\bigl(\bigl\{\omega:|Z_n(\omega)|\le 2Y(\omega)\text{ for all }n\ge1\bigr\}\bigr)=1 \qquad\text{ and }\qquad \mathsf{P}\bigl(\bigl\{\omega:Z_n(\omega)\to 0\bigr\}\bigr)=1.\] As \(\mathsf{E} Y<\infty\), the almost-sure version of , see Remark 4.7.1, implies that \(\mathsf{E}(Z_n)\to0\) as \(n\to\infty\), equivalently, \(X_n\stackrel{\mathsf{L}^1}{\to} X\). \(\vartriangleleft\)


Example 4.11 For simplicity, let \((X_n)_{n\ge1}\) be such that \(X_n \stackrel{\mathsf{a.s.}}{\to} 0\) as \(n\to\infty\). Then the variables \(Z_n:=|X_n|/(1+|X_n|)\) are uniformly bounded and, by , \(Z_n\stackrel{\mathsf{L}^1}{\to}0\). As a result \(Z_n\stackrel{\mathsf{P}}{\to}0\), and it is straightforward to show that then \(X_n\stackrel{\mathsf{P}}{\to}0\) as well, see Exercise  4.80 for details. This gives an alternative to Example 3.24. \(\vartriangleleft\)


Example 4.12 Let random variables \((X_n)_{n\ge1}\) be bounded, \(|X_n(\omega)|\le M\) for a constant \(M<\infty\), while \(X_n \stackrel{\mathsf{P}}{\to} X\) as \(n\to\infty\). Then the limit is also bounded, \(|X(\omega)|\le M\), and \[\mathsf{E}|X_n-X|\le\delta\mathsf{P}\bigl(|X_n-X|\le\delta\bigr)+2M\mathsf{P}\bigl(|X_n-X|>\delta\bigr)\] for each \(\delta>0\). As \(X_n \stackrel{\mathsf{P}}{\to} X\), the RHS above is smaller than \(2\delta\) for all \(n\) large enough, equivalently, \(X_n\stackrel{\mathsf{L}^1}{\to} X\) as \(n\to\infty\). This result partially complements Example 3.8. \(\vartriangleleft\)


advanced

4.3 Integral calculus of sequences

The following facts show that the “integral calculus for sequences” is Analysis 1 level material:

Lemma 4.13 Let \(\mathcal{S}=(s_{m,n})_{m,n\ge1}\) be an increasing in both indices (\(m\) and \(n\)) collection of numbers in \(\overline{\mathbb{R}}\equiv[-\infty,+\infty]\), ie., as soon as \(j\le m\) and \(k\le n\), we have \(s_{j,k}\le s_{m,n}\). Then \[\lim_{m\to\infty}\lim_{n\to\infty}s_{m,n}=\lim_{n\to\infty}\lim_{m\to\infty}s_{m,n}=\sup\mathcal{S},\] \(\vartriangleleft\)


Remark 4.13.1 In other words, interchanging the order of limits does not change the result! \(\vartriangleleft\)


Proof An easy exercise using definitions of \(\lim\) and \(\sup\). \(\blacksquare\)


Lemma 4.14 Let \(\mathcal{A}=(a_{m,n})_{m,n\ge1}\) be a collection of numbers in \(\overline{\mathbb{R}}^+\equiv[0,+\infty]\). Then \[\sum_{n=1}^\infty\sum_{m=1}^\infty a_{m,n}=\sum_{m=1}^\infty\sum_{n=1}^\infty a_{m,n}=\sup\mathcal{S},\] where \(\mathcal{S}\) is the set of all sums of finitely many elements of \(\mathcal{A}\). \(\vartriangleleft\)


Remark 4.14.1 In other words, iterated sums of non-negative numbers can be summed in any order. You had a similar statement for multiple integrals in the first year; it is often referred to as the Fubini theorem (for non-negative sums). \(\vartriangleleft\)


Proof Just consider all sums \(s_{m,n}=\sum_{i=1}^m\sum_{j=1}^na_{i,j}\) and use Lemma 4.13. \(\blacksquare\)


Lemma 4.15 Let \((a_{m,n})_{m,n\ge1}\) be a collection of numbers in \(\overline{\mathbb{R}}^+\equiv[0,+\infty]\), which is increasing in the second index \(n\), ie., for every fixed \(m\in\mathbb{N}\), the inequality \(a_{m,k}\le a_{m,n}\) holds provided \(k\le n\). Then \[\lim_{n\to\infty}\sum_{m=1}^\infty a_{m,n}=\sum_{m=1}^\infty\lim_{n\to\infty}a_{m,n}.\] \(\vartriangleleft\)


Remark 4.15.1 If the functions \(f_n:\mathbb{N}\to\overline{\mathbb{R}}^+\) are defined via \(f_n(m)=a_{m,n}\), they form a point-wise monotone sequence (ie., for every fixed \(m\in\mathbb{N}\), we have \(f_n(m)\le f_{n+1}(m)\) for all \(n\ge1\)); the statement above says that the limit of the sum (integral) equals the sum (integral) of limits. In other words, Lemma 4.15 is the Monotone Convergence Theorem for sequences. \(\vartriangleleft\)


Proof Put \(s_{m,n}=\sum_{l=1}^ma_{l,n}\) and use Lemma 4.13. \(\blacksquare\)


Lemma 4.16 Let \(\bigl(a_{m,n}\bigr)_{m,n\ge1}\), \(\bigl(a_m\bigr)_{m\ge1}\) and \(\bigl(b_m\bigr)_{m\ge1}\) be collections of numbers such that for every fixed \(m\in\mathbb{N}\), we have \[\lim_{n\to\infty}a_{m,n}=a_m, \qquad \bigl|a_{m,n}\bigr|\le b_m, \qquad\text{ and }\qquad \sum_mb_m<\infty.\] Then \[\lim_{n\to\infty}\sum_{m=1}^\infty a_{m,n}=\sum_{m=1}^\infty a_m=\sum_{m=1}^\infty\lim_{n\to\infty}a_{m,n}.\] \(\vartriangleleft\)


Remark 4.16.1 This is just the Dominated Convergence Theorem for sequences! \(\vartriangleleft\)


Proof Fix arbitrary \(\varepsilon>0\). By assumption, choosing \(M\) large enough, we can get \[\sum_{m>M}\bigl|a_{m,n}-a_m\bigr|\le2\sum_{m>M}b_m<\tfrac\varepsilon2.\] For a finite \(M\) with this property, we can find \(n\) large enough so that \(\sum_{m=1}^M\bigl|a_{m,n}-a_m\bigr|<\varepsilon/2\). Since \(\varepsilon>0\) is arbitrary, the result follows. \(\blacksquare\)


Notice that for non-negative functions on \(\mathbb{N}\), the sum is linear, monotone and respects limits; in other words, in this case the integral calculus reduces to a calculus of sums!


The Riemann integral was suggested by Bernhard Riemann (1826-1866) as the first rigourous attempt of defining the integral of a function on an interval. Since then, many different approaches to integration of function have been suggested. For example, the Riemann sums are often sadwiched between the lower and upper Darboux sums, which were introduced by Jean-Gaston Darboux (1842-1917).

The Lebesgue integral discussed here was suggested by Henri Lebesgue (1875-1941) and is widely used in probability and analysis.

The Kronecker delta was itroduced by Leopold Kronecker (1823-1891). Both Kronecker and Riemann were students of Peter Gustav Lejeune Dirichlet (1805-1859), who is known for contributions to analysis and number theory. Dirichlet was the first to use the pigeonhole principle in a mathematical argument.

(left to right) Dirichlet, Darboux, Kronecker, Lebesgue, and Riemann (left to right) Dirichlet, Darboux, Kronecker, Lebesgue, and Riemann (left to right) Dirichlet, Darboux, Kronecker, Lebesgue, and Riemann (left to right) Dirichlet, Darboux, Kronecker, Lebesgue, and Riemann (left to right) Dirichlet, Darboux, Kronecker, Lebesgue, and Riemann

checklist

By the end of this section you should be able to:


Exercise 4.70

Let \(Z_1, Z_2, \dots\) be random variables s.t. \(\mathsf{E}\Bigl(\sum\limits_{i=1}^\infty |Z_i|\Bigr)< \infty\). Show that \(\mathsf{E} \Bigl( \sum\limits_{i=1}^{\infty} Z_i \Bigr) = \sum\limits_{i=1}^{\infty} \mathsf{E}(Z_i)\).


Exercise 4.71

Let \((X_n)_{n\ge1}\) be random variables such that \(X_n\sim\mathsf{Exp}(\lambda_n)\), i.e., \(\mathsf{P}(X_n>a)=e^{-\lambda_na}\) for all \(a\ge0\). If \(\sum_{n\ge1}\tfrac1{\lambda_n}<\infty\), show that \(\mathsf{P}(\sum_{n\ge1}X_n<\infty)=1\).


Exercise 4.72

Let \(\Omega = \mathbb{N}\) with \(\mathsf{P}(\{ \omega \})=2^{- \omega}\) for \(\omega \in \Omega\). Let \(X_n(\omega)=2^n\) if \(\omega=n\) and \(X_n(\omega)=0\) otherwise.
a) Show that \(X_n\stackrel{\mathsf{a.s.}}{\to}0\) as \(n\to\infty\) and compute \(\mathsf{E}(X_n)\). Does \(\lim\limits_{n\to\infty}\mathsf{E}(X_n)=\mathsf{E}\bigl(\lim\limits_{n\to\infty}X_n)\)?
b) Is (MON) applicable to this sequence? Justify your answer.
c) Is (BDD) applicable to this sequence? Justify your answer.
d) Is (DOM) applicable to this sequence? Justify your answer.


Exercise 4.73

Let \((X_n)_{n\ge1}\) be random variables such that \(\mathsf{P}\bigl(X_n=n^3\bigr)=\mathsf{P}\bigl(X_n=-n^3\bigr)=\frac1{2n^2}\) and \(\mathsf{P}\bigl(X_n=0\bigr)=1-\frac1{n^2}\).
a) Show that \(X_n\stackrel{\mathsf{a.s.}}{\to}0\) as \(n\to\infty\) and compute \(\mathsf{E}(X_n)\). Does \(\lim\limits_{n\to\infty}\mathsf{E}(X_n)=\mathsf{E}\bigl(\lim\limits_{n\to\infty}X_n)\)?
b) Is (MON) applicable to this sequence? Justify your answer.
c) Is (BDD) applicable to this sequence? Justify your answer.
d) Is (DOM) applicable to this sequence? Justify your answer.


Exercise 4.74

Let \(X\) be a random variable with \(\mathsf{E}|X|<\infty\). Define \[Y_N:= X\land N\equiv\begin{cases}X,&\text{if }X\le N,\\N,&\text{if }X\ge N,\end{cases} \qquad Z_N:= X\mathbf{1}_{|X|\le N}\equiv\begin{cases}X,&\text{if }|X|\le N,\\0,&\text{if }|X|>N.\end{cases}\] a) By using an appropriate limit result, show that \(\mathsf{E}(Y_N)\equiv\mathsf{E}\bigl(X\land N\bigr)\to\mathsf{E}(X)\) and \(\mathsf{E}(Z_N)\equiv\mathsf{E}\bigl[X\mathbf{1}_{|X|\le N}\bigr]\to\mathsf{E}(X)\) as \(N\to\infty\).
b) Find the limit of \(\mathsf{E}\bigl[X\mathbf{1}_{|X|>N}\bigr]\) as \(N\to\infty\).


Exercise 4.75

Let \((\Omega,\mathcal{F},\mathsf{P})\) be the canonical probability space, recall Definition 1.9. Consider the sequence of random variables \[X_n(\omega):= n\mathbf{1}_{(0,1/n)}(\omega)=\begin{cases}n,&\quad \omega\in(0,1/n),\\0,&\quad\text{otherwise.}\end{cases}\] a) Show that \(X_n(\omega)\to X\equiv0\) as \(n\to\infty\) for all \(\omega\in\Omega\), but \(\mathsf{E}(X_n)=1\) and \(\mathsf{E} X=0\).
b) Carefully explain why (MON) does not apply to the sequence \((X_n)_{n\ge1}\).
c) Carefully explain why (BDD) does not apply to the sequence \((X_n)_{n\ge1}\).
d) Carefully explain why (DOM) does not apply to the sequence \((X_n)_{n\ge1}\).


Exercise 4.76

Let \((A_k)_{k\ge1}\) be some events and let the Bernoulli random variables \(\mathbf{1}_{A_k}\) be their indicator functions. The random variables \(N_m(\omega):=\sum_{k=1}^m\mathbf{1}_{A_k}(\omega)\) and \(N(\omega):=\sum_{k=1}^\infty\mathbf{1}_{A_k}(\omega)\equiv\lim_{m\to\infty}N_m(\omega)\) count the number of occurring events among \(A_1\), …, \(A_m\), respectively, the total number of occurring events in the whole sequence \((A_k)_{k\ge1}\). Suppose that \(\sum_k\mathsf{P}(A_k)<\infty\).
a) Show that \(0\le\mathsf{E}(N)\le\sum_k\mathsf{P}(A_k)<\infty\), and therefore that \(N\) is a finite random variable, \(\mathsf{P}(N<\infty)=1\).
b) Show that \(N-N_m:=\lim\limits_{n\to\infty}(N_{m+n}-N_m)\) satisfies \(0\le\mathsf{E} N-\mathsf{E} N_m=\mathsf{E}(N-N_m)\le\sum\limits_{k>m}\mathsf{P}(A_k)\to0\) as \(m\to\infty\).
Notice that part a) is equivalent to the first Borel-Canteli lemma, while part b) controls the speed of convergence in .


Exercise 4.77

Let \(X_1\), \(X_2\), … be i.i.d. r.v. with \(\mathsf{E}(X_k)=0\) and \(\mathsf{E}\bigl((X_k)^4\bigr)<\infty\). Define random variables \(S_k=\sum_{m=1}^kX_m\), \(Z_n=\sum_{k=1}^n\bigl(S_k/k\bigr)^4\) and \(Z=\sum_{k=1}^\infty\bigl(S_k/k\bigr)^4\).
a) Use (MON) to deduce that \(\mathsf{E}(Z_n)\nearrow\mathsf{E}(Z)\) as \(n\to\infty\); check that \(\mathsf{E}(Z)<\infty\).
b) Deduce that the event \(\bigl\{\omega:Z(\omega)<\infty\bigr\}\subseteq \bigl\{\omega:k^{-1}S_k(\omega)\to0\bigr\}\) has probability one; hence derive the Borel Strong Law of Large Numbers: under the above conditions, \(n^{-1}S_n\stackrel{\mathsf{a.s.}}{\to}0\) as \(n\to\infty\).


Exercise 4.78

Let \(X\sim\mathcal{N}(0,1)\) be a standard Gaussian random variable. Show that as \(a\to\infty\), \[\sqrt{2\pi}ae^{a^2/2}\mathsf{P}(X\ge a)=ae^{a^2/2}\int_a^\infty e^{-x^2/2}dx\to1.\]


Exercise 4.79

Let \((X_n)_{n\ge1}\) be independent random variables such that \(X_n\sim\mathsf{Exp}(\lambda_n)\), ie., \(\mathsf{P}(X_n>a)=e^{-\lambda_na}\) for all \(a\ge0\). If \(\sum_{n\ge1}1/\lambda_n=\sum_{n\ge1}\mathsf{E}(X_n)=\infty\), prove that \(\mathsf{P}\bigl(\sum_nX_n=\infty\bigr)=1\).


Exercise 4.80

Let \(X\) and \((X_n)_{n\ge1}\) be random variables.
a) Suppose that \(X_n \stackrel{\mathsf{a.s.}}{\to} 0\) as \(n\to\infty\), and define \(Z_n:=|X_n|/(1+|X_n|)\). Show that \(Z_n\stackrel{\mathsf{L}^1}{\to}0\) as \(n\to\infty\).
b) In the setting of part a) show that \(Z_n\stackrel{\mathsf{P}}{\to}0\) and deduce that \(X_n\stackrel{\mathsf{P}}{\to}0\) as \(n\to\infty\).
c) Let \(X_n\stackrel{\mathsf{a.s.}}{\to} X\) as \(n\to\infty\). Show that \(X_n\stackrel{\mathsf{P}}{\to} X\) as \(n\to\infty\).


advanced

Optional exercises

Exercise 4.81

Let \(X\) be a finite random variable with values in \(\mathbb{Z}^+=\{0,1,\dots\}\); its generating function \(f_X(u)\) is given by \(f_X(u):=\mathsf{E}\bigl[u^X\bigr]=\sum_{k\ge0}u^k\mathsf{P}(X=k)\). Notice that \(|f_X(u)|\le f_X(1)=1\) if \(|u|\le1\).
a) Use (DOM) to show that for every \(|u|<1\) the derivative \(f'_X(u)\) exists and satisfies \[f'_X(u)=\sum_{k\ge0}ku^{k-1}\mathsf{P}(X=k)\equiv\mathsf{E}\bigl[Xu^{X-1}\bigr].\]
b) Use (MON) to show that \(f'_X(u)\nearrow\sum_{k\ge0}k\mathsf{P}(X=k)\equiv\mathsf{E}[X]\) as \(u\nearrow1\).
c) For integer \(m\ge1\) show that as \(u\nearrow1\), \[f^{(m)}_X(u)\equiv\Bigl(\frac{d}{du}\Bigr)^mf_X(u)\nearrow\mathsf{E}\bigl[X(X-1)\dots(X-m+1)\bigr].\]


Exercise 4.82

a) Let \((X_k)_{k\ge1}\) be a sequence of i.i.d.  bounded random variables, i.e., for some \(K>0\) \(\mathsf{P}(|X|>K)=0\). Show that the \(L^4\)-SLLN applies to this sequence, ie., that \(n^{-1}\sum\limits_{k=1}^nX_k\to\mathsf{E} X\) with probability one.


b) Let \(X>0\) be a finite random variable, \(\mathsf{P}(X<\infty)=1\). For every \(M\in\mathbb{N}\), consider the bounded variable \[X^M:=\min\bigl(X,M\bigr)\equiv\begin{cases}X,&\quad X<M,\\ M,&\quad X\ge M.\end{cases}\] Use (MON) to show that \(\mathsf{E} X^M\to\mathsf{E} X\) as \(M\to\infty\).


c) Let \((X_k)_{k\ge1}\) be a sequence of positive i.i.d. random variables with infinite expectation, \(\mathsf{E} X=\infty\). By using parts a) and b), show that \(n^{-1}\sum_{k=1}^nX_k\to\mathsf{E} X=\infty\) with probability one.

This exercise shows that statistical methods based upon averaging are not very useful in study of random variables with infinite expectation, as in this case the average of \(n\) observations can show more variability than a single observed value!



5 Generating functions

Goals: Define the probability generating functions. Explore their key properties and some of their important applications.


Transformations play a key role in studying probability distributions. For discrete probability distributions, one of the most powerful transformation methods is generating functions. Generating functions are widely used in algebra and more specifically in combinatorics, where they help to simplify laborious and lengthy calculations, but they are extremely useful in probability.

5.1 Definition and main properties

Definition 5.1 Given a collection of real numbers \((a_k)_{k\ge0}\), the function \[\tag{5.1}\label{eq:generating-function-def} G(s)=G_a(s):=\sum_{k=0}^\infty a_ks^k\] is called the generating function of \((a_k)_{k\ge0}\).


Remark 5.1.1  If the generating function \(G_a(s)\) of \((a_n)_{n\ge0}\) is analytic near the origin (e.g., it is finite for all (complex) \(s\) satisfying \(|s|<R\) for some \(R>0\)), then there is a one-to-one correspondence between \(G_a(s)\) and \((a_n)_{n\ge0}\); namely, \(a_k\) can be recovered via 33 \[\tag{5.2}\label{eq:gen-fn-uniqueness} a_k=\frac{1}{k!}\frac{d^k}{ds^k}G_a(s)\bigm|_{s=0}.\] As a result, one has the uniqueness property of generating functions: if the generating functions \(G_a(s)\) and \(G_b(s)\) of the sequences \((a_n)_{n\ge0}\) and \((b_n)_{n\ge0}\) coincide on a disk of positive radius \(r\), ie., \(G_a(s)=G_b(s)\) for \(|s|<r\), then \(a_n=b_n\) for all \(n\ge0\). \(\vartriangleleft\)


Definition 5.2 If \(X\) is a discrete random variable with values in \(\mathbb{Z}^+:=\{0,1,\dots\}\), its ( probability) generating function, \[\tag{5.3}\label{eq:proba-generating-function} G(s)\equiv G_X(s):=\mathsf{E}\bigl(s^X\bigr)=\sum_{k=0}^\infty s^k\mathsf{P}(X=k),\] is the generating function of the pmf \(\bigl\{p_k\bigr\}\equiv\bigl\{\mathsf{P}(X=k)\bigr\}\) of \(X\).


Remark 5.2.1 For \(|s|\le1\) the generating function \(G_X(s)\) is bounded, \(|G_X(s)|\le\sum_{k\ge0}\mathsf{P}(X=k)\le1\). Consequently, each probability generating function:
1) can be differentiated or integrated term-by-term any number of times at each \(s\) such that \(|s|<1\);
2) possesses the uniqueness property from Remark 5.1.1;
3) satisfies Abel’s theorem, 34 namely, If the sequence \((a_n)_{n\ge0}\) is non-negative, and its generating function \(G_a(s)\) is finite for \(|s|<1\), then \(\lim_{s \nearrow 1} G_a(s) =\sum_{n\ge0}a_n\), whether the sum is finite or equals \(+\infty\). This standard result is useful if the radius of convergence of \(G_a(s)\) equals \(1\), as then one has no reason to expect that the limit exists as \(s\nearrow1\). \(\vartriangleleft\)


Example 5.3 It is straightforward to derive probability generating functions of many distributions, e.g.:
1) If \(\mathsf{P}(X=c)=1\) for some \(c\in\mathbb{R}\), then \(G(s)=\mathsf{E}(s^X)=s^c\).
2) If \(X\sim\mathsf{Ber}(p)\), then \(G(s)=\mathsf{E}(s^X)=(1-p)s^0+p s=(1-p)+ps\).
3) If \(X\sim\mathsf{Bin}(n,p)\), then \(G(s)=\mathsf{E}(s^X)=\sum_{k=0}^n\binom{n}kp^k(1-p)^{n-k}s^k=(1-p+ps)^n\).
4) If \(X\sim\mathsf{Poi}(\lambda)\), then \(G(s)=\mathsf{E}(s^X)=\sum_{k\ge0}\tfrac{(\lambda)^k}{k!}e^{-\lambda}s^k=e^{\lambda(s-1)}\). \(\vartriangleleft\)


Theorem 5.4 If \(X\) and \(Y\) are independent random variables with values in \(\{0,1,2,\dots\}\) and \(Z:= X+Y\), then their generating functions satisfy \[G_Z(s)=G_{X+Y}(s)=G_X(s)G_Y(s).\] \(\vartriangleleft\)


Proof Recall: if \(X\) and \(Y\) are discrete random variables, and \(f\), \(g:\mathbb{Z}^+\to\mathbb{R}\) are arbitrary functions, then \(f(X)\) and \(g(Y)\) are independent random variables and \(\mathsf{E}\bigl(f(X)g(Y)\bigr)=\mathsf{E}(f(X))\cdot\mathsf{E}(g(Y))\). Now take \(f(X)=s^X\) and \(g(Y)=s^Y\). \(\blacksquare\)


Example 5.5 Let \(X\sim{ Poi}(\lambda)\) and \(Y\sim{ Poi}(\mu)\) be independent. Then \(Z=X+Y\) is \({ Poi}(\lambda+\mu)\).
Solution. By Example 5.3 and Theorem 5.4, we get \(G_Z(s)=G_X(s)G_Y(s)=e^{\lambda(s-1)}e^{\mu(s-1)}\equiv e^{(\lambda+\mu)(s-1)}\); the result follows by uniqueness. \(\vartriangleleft\)


Exercise 5.83

If \(X\sim\mathsf{Bin}(n,p)\) and \(Y\sim\mathsf{Bin}(m,p)\) are independent, show that \(X+Y\sim\mathsf{Bin}(n+m,p)\).


Example 5.6 If \((X_k)_{k=1}^n\) are  random variables with values in \(\{0,1,2,\dots\}\) and if \(S_n=X_1+\dots+X_n\), then \(G_{S_n}(s)=G_{X_1}(s)\dots G_{X_n}(s)\equiv\bigl(G_X(s)\bigr)^n\). This follows from Theorem 5.4 by induction. \(\vartriangleleft\)


Exercise 5.84

If \((X_k)_{k=1}^n\) are  random variables with \(X_1\sim\mathsf{Ber}(p)\), show that \(Y:=\sum_{k=1}^nX_k\sim\mathsf{Bin}(n,p)\).


Definition 5.7 A sequence \((c_n)_{n\ge0}\) is the convolution of \((a_k)_{k\ge0}\) and \((b_m)_{m\ge0}\) (write \(c=a\star b\)), if \[\tag{5.4}\label{eq:convolution} c_n=\sum_{k=0}^na_kb_{n-k},\qquad n\ge0.\]


Remark 5.7.1 If \(X\) and \(Y\) are independent variables in \(\{0,1,2,\ldots\}\) and \(Z=X+Y\), then for each \(n\ge0\) \[\mathsf{P}(Z=n)=\mathsf{P}(X+Y=n)=\sum_{k=0}^n\mathsf{P}(X=k)\mathsf{P}(Y=n-k),\] i.e., the distribution of the independent sum \(Z=X+Y\) is the convolution of the distributions of \(X\) and \(Y\). \(\vartriangleleft\)


The key property of convolutions is a generalisation of Theorem 5.4:

Theorem 5.8 If sequences \((a_k)_{k\ge0}\), \((b_m)_{m\ge0}\), and \((c_n)_{n\ge0}\) are such that \(c=a\star b\), then their respective generating functions \(G_c(s)\), \(G_a(s)\), and \(G_b(s)\) satisfy \(G_c(s)=G_a(s)G_b(s)\). \(\vartriangleleft\)


Exercise 5.85

Prove Theorem 5.8 by verifying that the power series \(G_c(s)\) and \(G_a(s)G_b(s)\) have the same coefficients.


The advantage of the Convolution theorem is that various properties of convolutions, which are challenging to verify, can be reduced to multiplication of generating functions, which is often simpler:

Exercise 5.86

For \(n\in\mathbb{N}\), let \((a_k)_{k\ge0}\) satisfy \(a_k=\binom{n}k\). Use Theorem 5.8 to show that \[\sum_{k=0}^n\binom{n}k^2=\sum_{k=0}^n\binom{n}k\binom{n}{n-k}=\binom{2n}n.\]


A probability generating function \(G_X(s)\) can be used to compute moments \(\mathsf{E}(X^k)\) of \(X\):

Theorem 5.9 If \(X\) has generating function \(G(s)\), then the \(k\)th factorial moment of \(X\) satisfies \[\mathsf{E}\bigl[X(X-1)\dots(X-k+1)\bigr]=\frac{d^k}{ds^k}G(1_-):=\lim_{s\nearrow1}\frac{d^k}{ds^k}G(s).\] \(\vartriangleleft\)


Proof Recall Remark 5.2.1. Fix \(s\in(0,1)\) and differentiate \(G(s)\) \(k\) times to get \[\frac{d^k}{ds^k}G(s)=\mathsf{E}\bigl[s^{X-k}X(X-1)\dots(X-k+1)\bigr].\] Taking the limit \(s\nearrow1\) and using Abel’s theorem, we obtain the result. \(\blacksquare\)


Remark 5.9.1 The usual (also known as polynomial) moments can be computed similarly. E.g., the second moment and the variance of \(X\) satisfy \[\tag{5.5}\label{eq:variance-via-gen-fns} \mathsf{E}(X^2)=G_X''(1)+G_X'(1),\qquad \mathsf{Var}(X)=G_X''(1)+G_X'(1)-\bigl(G_X'(1)\bigr)^2.\] \(\vartriangleleft\)


Remark 5.9.2 We also have \[\lim\limits_{s\nearrow1}G_X(s)\equiv\lim\limits_{s\nearrow1}\mathsf{E}[s^X]=\mathsf{P}(X<\infty).\] This allows us to check whether a variable is finite, if we do not know this apriori. \(\vartriangleleft\)


Recall that the moment 35 generating function \(M_X(t):=\mathsf{E}(e^{tX})\equiv G_X(e^t)\) of a random variable \(X\) in \(\{0,1,2,\ldots\}\) is \(M_X(t):=\sum_{k\ge0}\frac{\mathsf{E}(X^k)}{k!}t^k\), the generating function of the sequence \(\mathsf{E}(X^k)/k!\). From Probability I you know that \(\mathsf{E}(X^k)=\tfrac{d^k}{dt^k}M_X(t)\bigm|_{t=0}\).

Example 5.10 If \(X\sim\mathsf{Poi}(\lambda)\), we have \(G_X(s)=e^{\lambda(s-1)}\). Therefore, \(M_X(t)\equiv G_X(e^t)=\exp\{\lambda(e^t-1)\}\). \(\vartriangleleft\)


5.2 Applications of generating functions

The following example is very important for applications.

Example 5.11 Let \((X_k)_{k\ge1}\) be  random variables with values in \(\{0,1,2,\dots\}\) and let \(N\ge0\) be an integer-valued random variable independent of \(\{X_k\}_{k\ge1}\). Then 36

\(S_N:= X_1+\dots+X_N\) has generating function \[\tag{5.6}\label{eq:GN-GX} G_{S_N}(s)=G_{N}\bigl(G_X(s)\bigr).\]

Solution. This is a straightforward application of the partition theorem for expectations. Alternatively, the result follows from the standard properties of conditional expectations: \(\mathsf{E}\bigl(z^{S_N}\bigr)=\mathsf{E}\bigl[\mathsf{E}\bigl(z^{S_N}\mid N\bigr)\bigr]=\mathsf{E}\bigl(\bigl[G_X(z)\bigr]^N\bigr)=G_{N}\bigl(G_X(z)\bigr)\). \(\vartriangleleft\)


Generating functions are also very useful in solving recurrences, especially when combined with the following algebraic fact. 37

Lemma 5.12 Let \(f(x)=g(x)/h(x)\) be a ratio of two polynomials without common roots. Let \(\deg(g)<\deg(h)=m\) and suppose that the roots \(a_1\), …, \(a_m\) of \(h(x)\) are all distinct. Then \(f(x)\) can be decomposed into a sum of partial fractions, ie., for some constants \(b_1\), \(b_2\), …, \(b_m\), \[\tag{5.7}\label{eq:partial-fraction-decomposition} f(x)=\frac{b_1}{a_1-x}+\frac{b_2}{a_2-x}+\dots+\frac{b_m}{a_m-x}.\] \(\vartriangleleft\)


Remark 5.12.1 Because \[\frac{b}{a-x}=\frac{b}{a}\sum\limits_{k\ge0}\Bigl(\frac{x}a\Bigr)^k=\sum_{k\ge0}\frac{b}{a^{k+1}}x^k,\] a generating function of the form can be easily written as a power series. \(\vartriangleleft\)


The following example illustrates one of the most useful applications of generating functions in probability:

Example 5.13 Imagine a diligent janitor who replaces a light bulb the same day as it burns out. Suppose the first bulb is put in on day \(0\) and let \(X_i\) be the lifetime of the \(i\)th light bulb. Let the individual lifetimes \(X_i\) be  random variables with values in \(\{1,2,\dots\}\) and have a common distribution with generating function \(G_f(s)\). Define \(r_n:=\mathsf{P}\bigl(\text{ a bulb was replaced on day~$n$}\bigr)\) and \(f_k:=\mathsf{P}\bigl(\text{ the first bulb was replaced on day~$k$}\bigr)\). Then \[r_0=1,\quad f_0=0,\qquad\text{ and }\qquad r_n =\textstyle\sum\limits_{k=1}^nf_kr_{n-k},\quad n\ge1.\] A standard computation implies that \[G_r(s)-1=\sum_{n\ge1}r_ns^n=\sum_{n\ge1}\sum\limits_{k=1}^nf_kr_{n-k}s^n=\sum_{k\ge1}f_ks^k\sum_{n\ge k}r_{n-k}s^{n-k}=\sum_{k\ge1}f_ks^kG_r(s)=G_f(s)G_r(s)\] for all \(|s|<1\), so that \(G_r(s)=1/(1-G_f(s))\). \(\vartriangleleft\)


Example 5.14 Let \(a_n\) be the probability that \(n\) independent Bernoulli trials (with success probability \(p\)) result in an even number of successes. Find the generating function of \(a_n\).
Solution. The event under consideration occurs if the initial failure at the first trial is followed by an even number of successes or if the initial success is followed by an odd number of successes. So \(a_0=1\) and \(a_n=qa_{n-1}+p(1-a_{n-1})\) for all \(n\ge1\), where \(q=1-p\).

Multiplying these equalities by \(s^n\) and adding them we get \[G_a(s)-1=qsG_a(s)+p\sum_{n\ge1}s^n-psG_a(s)=(q-p)sG_a(s)+\frac{ps}{1-s},\] and after rearranging, \[G_a(s)=\Bigl(1+\frac{ps}{1-s}\Bigr)/\bigl(1-(q-p)s\bigr)=\frac12\Bigl(\frac1{1-s}+\frac1{1-(q-p)s}\Bigr)=\frac12\sum_{n\ge0}s^n+\frac12\sum_{n\ge0}(q-p)^ns^n.\] As a result, \(a_n=\bigl(1+(q-p)^n\bigr)/2\). \(\vartriangleleft\)


Example 5.15 A biased coin is tossed repeatedly; on each toss, it shows ‘heads’ with probability \(p\). Let \(r_n\) be the probability that a sequence of \(n\) tosses never has two ‘heads’ in a row. Show that \(r_0=1\), \(r_1=1\), and for all \(n>1\), \(r_n=qr_{n-1}+pqr_{n-2}\), where \(q=1-p\). Deduce the generating function of the sequence \((r_n)_{n\ge0}\).
Solution. Every sequence of \(n\ge2\) tosses starts either with T or with HT; hence the relation. Multiplying these equalities by \(s^n\) and summing, we get \[G_r(s)=\sum_{n\ge0}r_ns^n=1+s+qs\sum_{n\ge2}r_{n-1}s^{n-1}+pqs^2\sum_{n\ge2}r_{n-2}s^{n-2}\] so that \[G_r(s)=(qs+pqs^2)G_r(s)+1+ps=\frac {1+ps}{1-qs-pqs^2}. here\] \(\vartriangleleft\)


Theorem 5.16 For every fixed \(n\ge0\) let the sequence \((a_{k,n})_{k\ge0}\) be a probability distribution, ie., \(a_{k,n}\ge0\) and \(\sum_{k\ge0}a_{k,n}=1\). Denote by \(G_n(s)\) be the corresponding generating function, \(G_n(s)=\sum_{k\ge0}a_{k,n}s^k\). In order that for every fixed \(k\) \[\tag{5.8}\label{eq:convergence-in-distribution-GF} \lim_{n\to\infty}a_{k,n}=a_k\] it is necessary and sufficient that for every \(s\in[0,1)\) we have \(\lim_{n\to\infty}G_n(s)= G(s)\), where \(G(s)=\sum_{k\ge0}a_ks^k\) is the generating function of the limiting sequence \((a_k)\). \(\vartriangleleft\)


Remark 5.16.1 The convergence in is known as convergence in distribution! \(\vartriangleleft\)


Example 5.17 If \(X_n\sim{ Bin}(n,p)\) with \(p=p_n\) satisfying \(n\cdot p_n\to\lambda\) as \(n\to\infty\), then \[G_{X_n}(s)\equiv\bigl(1+p_n(s-1)\bigr)^n\to\exp\{\lambda(s-1)\},\] so that the distribution of \(X_n\) converges to that of \(X\sim{ Poi}(\lambda)\). \(\vartriangleleft\)


The optional Section  5.3 below illustrates the role of generating functions in the theory of branching processes. Its content will not be examined, but you might find it useful, especially in the current situation.

advanced

5.3 Introduction into branching processes*

Informally, a branching process 38 is described as follows: let \(\{p_k\}_{k\ge0}\) be a fixed probability mass function in \(\{0,1,2,\ldots\}\). A population starts with a single ancestor who forms generation number \(0\). This initial individual splits into \(k\) offspring with probability \(p_k\); the resulting offspring constitute the first generation. Each of the offspring in the first generation splits independently into a random number of offspring according to the probability mass function \(\{p_k\}\). This process continues until extinction, which occurs when all members of a generation fail to produce offspring.

This model has a number of applications in biology (eg., it can be thought as a model of population growth or virus spreading), physics (chain reaction in nuclear fission), queueing theory etc. Originally it arose from a study of the likelihood of survival of family names.

Formally, let \(\{Z_{n,k}\}\), \(n\ge1\), \(k\ge1\), be a family of i.i.d. random variables in \(\mathbb{Z}^+:=\{0,1,2,\ldots\}\), each having distribution \(\{p_k\}_{k\ge0}\). Then the branching process \((Z_n)_{n\ge0}\) (generated by \(\{p_k\}_{k\ge0}\)) is defined via \(Z_0=1\), and, for \(n\ge1\), \[\tag{5.9}\label{eq:branch-process-recursion} Z_n:= Z_{n,1}+Z_{n,2}+\dots+Z_{n,Z_{n-1}},\] where the empty sum is interpreted as zero. Write \(\mathsf{P}(\cdot)\equiv\mathsf{P}_1(\cdot)=\mathsf{P}(\cdot|Z_0=1)\) and \(\mathsf{E}(\cdot)\equiv\mathsf{E}_1(\cdot)=\mathsf{E}(\cdot|Z_0=1)\) for the corresponding probability measure and the expectation.

If \(\varphi_n(s)\equiv\mathsf{E}{s^{Z_n}}\) is the generating function of \(Z_n\), a straightforward induction based on and implies \[\tag{5.10}\label{eq:branch-process-gener-fn} \begin{gathered} \varphi_0(s)\equiv s,\qquad \varphi(s)\equiv\varphi_1(s):=\mathsf{E}{s^{Z_1}},\\ \varphi_k(s)=\varphi_{k-1}\bigl(\varphi(s)\bigr)\equiv\varphi\bigl(\varphi_{k-1}(s)\bigr),\quad k>1. \end{gathered}\] Usually explicit calculations are hard, but at least in principle, equations determine the distribution of \(Z_n\) for any \(n\ge0\).

Example 5.18 Let \(\varphi_1(s)\equiv\varphi(s)=q+ps\) for some \(0<p=1-q<1\). Then \[\varphi_n(s)\equiv q(1+p+\dots+p^{n-1})+p^ns=1+p^n(s-1).\] Notice that here we have \(\varphi_n(s)\to1\) as \(n\to\infty\) for all \(s\in[0,1]\), ie., the distribution of \(Z_n\) converges to that of \(Z_\infty\equiv0\), recall Theorem 5.16. \(\vartriangleleft\)


Assuming that \(\{p_k\}_{k\ge0}\) is a non-degenerate distribution, we have \(\varphi(1)\equiv\varphi_1(1)=1\) and therefore \(\varphi_n(1)=1\) for all \(n\ge0\). Denote by \(m:=\varphi'(1_-)\ge0\) the average size of offspring of a single individual. A straightforward induction based on shows that \[\mathsf{E}\bigl(Z_n\bigr)=\frac{d}{ds}\varphi_n(s)\bigm|_{s=1_-}=m^n=\bigl(\mathsf{E} Z_1\bigr)^n.\]

This suggests that if \(m\equiv\mathsf{E} Z\neq1\), the branching process might explode (for \(m>1\)) or die out (for \(m<1\)). One classifies branching process into critical (if \(m=1\)), subcritical (\(m<1\)), and supercritical (\(m>1\)).

Example 5.19 It is straightforward to describe the case \(m<1\). Indeed, Markov’s inequality implies that \[\mathsf{P}(Z_n>0)=\mathsf{P}(Z_n\ge1)\le\mathsf{E}(Z_n)=m^n,\] so that \(\mathsf{P}(Z_n>0)\to0\) as \(n\to\infty\) (ie., \(Z_n\to0\) in probability). Moreover, as \(\sum_{n\ge0}\mathsf{P}(Z_n>0)<\infty\), Borel-Cantelli’s lemma implies that \(\mathsf{P}(Z_n\to0)=1\) (ie., \(Z_n\to0\) almost surely). We also notice that the average total population in this case is finite, \(\mathsf{E}\bigl(\sum_{n\ge0}Z_n\bigr)=\sum_{n\ge0}m^n=(1-m)^{-1}<\infty\). \(\vartriangleleft\)


Definition 5.20 The extinction event \(\mathcal{E}\) is the event \(\mathcal{E}=\cup_{n=1}^\infty\bigl\{Z_n=0\bigr\}\). As \(\bigl\{Z_n=0\bigr\}\subset\bigl\{Z_{n+1}=0\bigr\}\) for all \(n\ge0\), the extinction probability \(\rho\) is defined as \[\rho=\mathsf{P}(\mathcal{E})=\lim_{n\to\infty}\mathsf{P}\bigl(Z_n=0\bigr),\] where \(\mathsf{P}\bigl(Z_n=0\bigr)\equiv\varphi_n(0)\) is the extinction probability before \((n+1)\)st generation.


The following result helps to derive the extinction probability \(\rho\) without need to compute iterates \(\varphi_n(\cdot)\). To avoid trivialities we assume that \(p_0=\mathsf{P}(Z=0)\) satisfies 39 \(0<p_0<1\); under this assumption \(\varphi(s)\) is a strictly increasing function of \(s\in[0,1]\).

Theorem 5.21 If \(0<p_0<1\), then the extinction probability \(\rho\) is given by the smallest positive solution to the equation \[\tag{5.11}\label{eq:extinction-probability-solution} s=\varphi(s).\] In particular, if \(m=\mathsf{E} Z\le1\), then \(\rho=1\); otherwise, we have \(0<\rho<1\). \(\vartriangleleft\)


Remark 5.21.1 The relation \(\rho=\varphi(\rho)\) has a clear probabilistic sense. Indeed, if \(\rho=\mathsf{P}_1(\mathcal{E})\) is the extinction probability starting from a single individual, \(Z_0=1\), then by independence we get \(\mathsf{P}_k(\mathcal{E})\equiv\mathsf{P}(\mathcal{E}\mid Z_0=k)=\rho^k\), and thus the first step decomposition for \(Z_n\) gives \[\rho=\mathsf{P}(\mathcal{E})=\sum_{k\ge0}\mathsf{P}(\mathcal{E},Z_1=k)=\sum_{k\ge0}\mathsf{P}(\mathcal{E}\mid Z_1=k)\mathsf{P}(Z_1=k) =\sum_{k\ge0}\rho^k\mathsf{P}(Z_1=k)\equiv\mathsf{E}\bigl(\rho^{Z_1}\bigr)\equiv\varphi(\rho),\] in agreement with . \(\vartriangleleft\)


We will not prove Theorem 5.21 here 40 but just notice that the population has a positive chance \(1-\rho>0\) of survival if and only if the average offspring size per single individual \(m\) is larger than one. In fact, one can show that, in the supercritical case, with probability \(1-\rho>0\) the size \(Z_n\) of the population converges to infinity.

Explicit computations here are often difficult, but you might want to explore the case with quadratic branching, ie., with \(\varphi(s)=as^2+bs+c\), where \(a\), \(b\), \(c\) are positive constants such that \(\varphi(1)=1\). Another explicitly soluble case is with geometric branching, where \(p_k=pq^k\) with \(0<p=1-q<1\) and \(\varphi(s)=p(1-qs)^{-1}\).


You might have seen generating functions in other modules, e.g., Discrete Mathematics. A good source of problems on probabilistic applications of generating functions is .

Generating functions were invented by Abraham de Moivre (1667-1754) in the early eighteen century. Their usefulness was illustrated by Leonhard Euler (1707-1783). Abel’s theorem is due to Niels Henrik Abel (1802-1829), who contributed to many areas of mathematics. He proved that a general quintic equation cannot be solved in radicals; commutative (Abelian) groups were named after him.

(left to right) de Moivre, Euler, Bienaymé, Abel, Galton, and Watson (left to right) de Moivre, Euler, Bienaymé, Abel, Galton, and Watson (left to right) de Moivre, Euler, Bienaymé, Abel, Galton, and Watson (left to right) de Moivre, Euler, Bienaymé, Abel, Galton, and Watson (left to right) de Moivre, Euler, Bienaymé, Abel, Galton, and Watson (left to right) de Moivre, Euler, Bienaymé, Abel, Galton, and Watson

Branching processes, also known as Galton–Watson processes, are named after Sir Francis Galton (1822-1911) and Rev. Henry William Watson (1827-1903) who invented the idea in their study of extinction of family names. Their result is believed to be independent of the earlier work by Irénée-Jules Bienaymé (1796-1878).

checklist

By the end of this section you should be able to:


Exercise 5.87

Let the variables \(X\) and \(Y\) satisfy \(\mathsf{P}(X=1)=\mathsf{P}(Y=5)=\frac13\) and \(\mathsf{P}(X=4)=\mathsf{P}(Y=2)=\frac23\);
a) show that \(X\) and \(Y\) have the same first and second moments, but not the same third and fourth moments;
b) find the probability and moment generating functions for \(X\) and \(Y\).


Exercise 5.88

Find the generating functions, both probability \(G(s)\) and moment \(M(t)\), for the following discrete probability distributions:
a) the distribution describing a fair die;
b) the distribution describing a die that always comes up \(3\);
c) the uniform distribution on the set \(\{n,n+1,n+2,\dots,n+k\}\);
d) the binomial distribution on \(\{n,n+1,n+2,\dots,n+k\}\), where \(\mathsf{P}(X=n+m)=\binom{k}mp^m(1-p)^{k-m}\).


Exercise 5.89

Let \(X_1\), \(X_2\), …, \(X_n\) be an independent trials process, with values in \(\{0,1\}\) and mean \(\mu=1/3\). Find the probability and moment generating functions for the distribution of \[\begin{gathered} \text{a)}\quad S_1=X_1,\qquad\text{b)}\quad S_n=X_1+\dots+X_n,\qquad\text{c)}\quad A_n=S_n/n,\qquad\text{d)}\quad S_n^*=(S_n-n\mu)/\sqrt{n\sigma^2}, \end{gathered}\] where \(\sigma^2\) is the variance of \(X\).


Exercise 5.90

Let \((a_n)_{n\ge0}\) be a real sequence with generating function \(G_a(s)\).
a) With \((b_k)_{k\ge0}\) being the constant sequence, \(b_k=1\) for all \(k\ge0\), define \(c=a\star b\). Show that \(c_n=\sum_{k\le n}a_k\) and write the generating function \(G_c(s)\) in terms of \(G_a(s)\).
b) Suppose that \(a_n\ge0\) for all \(n\ge0\) and that \(G_a(1)=\sum_na_n<\infty\). For integer \(n\ge0\), denote \(d_n:=\sum_{k\ge n}a_k\). Write the generating function \(G_d(s)\) in terms of \(G_a(s)\).


Exercise 5.91

Let \(X\) be a random variable with generating function \(G_X(s)\). In terms of \(G_X(s)\),
a) find the generating functions of the random variables \(X+1\) and \(2X\);
b) find the generating function of the sequence \(\mathsf{P}(X\le n)=\sum_{k\le n}\mathsf{P}(X=k)\);
c) find the generating function of the sequence \(a_n:=\mathsf{P}(X=2n)\).


Exercise 5.92

A biased coin showing ‘heads’ with probability \(p\in(0,1)\) is independently tossed \(n\) times. Let \(X\) be the total number of heads shown. Use the probability generating function of \(X\) to find:
a) the mean and the variance of \(X\);
b) the probability that \(X\) is even;
c) the probability that \(X\) is divisible by \(3\).


Exercise 5.93

a) Two magic dice are thrown independently, one showing a random number from the set \(\mathcal{D}_1=\{1,3,4,5,6,8\}\) and another from \(\mathcal{D}_2=\{1,2,2,3,3,4\}\). Find the generating functions of the outcomes for both dice. Let \(T\) be the sum of two results; find the generating function of \(T\).
b) Do the same for a pair of standard dice, i.e., compute the generating function of the sum of two independent outcomes taken uniformly in \(\mathcal{D}_0=\{1,2,3,4,5,6\}\). Compare your results.


Exercise 5.94

Fix \(0<\rho<1\), and consider two probability generating functions \(G_X(s)\) and \(G_Y(s)\). Show that \(G_Z(s):=\rho G_X(s)+(1-\rho)G_Y(s)\) is a probability generating function, and interpret this result.


Exercise 5.95

Let \(X\sim\mathsf{Poi}(\lambda)\) with \(\lambda>0\). Show that \(\mathsf{E}\bigl(X(X-1)\dots (X-k+1)\bigr) =\lambda^k\).


Exercise 5.96

Suppose that \(X\sim\mathsf{Geom}(p)\), that is \(\mathsf{P}(X=k) = pq^{k-1}\) for \(k \geq 1\) and \(0<p=1-q<1\).
a) Show that \(G_X(s) = \frac{p s}{1-qs}\); deduce the values of \(\mathsf{E} X\) and \(\mathsf{Var} X\).
b) Find the generating function of the sequence \(a_n:=\mathsf{P}(X>n)\), \(n\ge0\).


Exercise 5.97

Let \((X_n)_{n\ge1}\) be  random variables in \(\{0,1,2,\dots\}\) and common generating function \(G_X(s)\). Let \(N\ge0\) be an integer-valued random variable, independent of the sequence \(X_n\); denote its generating function by \(G_N(s)\). The (random) sum \(S_N:= X_1+X_2+\dots+X_N\) has the so-called compound distribution.
a) Use the partition theorem for expectations to find \(\mathsf{E}(S_N)\) in terms of \(\mathsf{E} X\) and \(\mathsf{E} N\);
b) Find \(\mathsf{E}\bigl((S_N)^2\bigr)\) using the partition theorem for expectations; show that \(\mathsf{Var}(S_N)=\mathsf{E}(N)\mathsf{Var}(X)+\mathsf{Var}(N)(\mathsf{E} X)^2\);
c) Show that the generating function \(G_{S_N}(s)\) of \(S_N\) is \(G_{S_N}(s)\equiv G_N\bigl(G_X(s)\bigr)\);
d) Use the previous result to compute \(\mathsf{E}(S_N)\);
e) Compute \(\mathsf{Var}(S_N)\) using the result in c).


Exercise 5.98

A mature individual produces immature offspring according to the probability generating function \(F(s)\).
a) Suppose a population consists of \(k\) immature individuals, each of which grows to maturity with probability \(p\) and then reproduces, independently of the other individuals. Find the probability generating function of the number of immature individuals in the next generation.
b) Find the probability generating function of the number of mature individuals in the next generation, given that there are \(k\) mature individuals in the parent generation.
c) Show that the distributions in a) and b) above have the same mean, but not necessarily the same variance.


Exercise 5.99

A hen lays \(N\) eggs, where \(N\sim\mathsf{Poi}(\lambda)\). Each egg hatches with probability \(p\), independently of all the other eggs. Let \(K\) be the number of chicks, i.e., \(K=X_1+\dots+X_N\), where \(X_k\sim\mathsf{Ber}(p)\) are independent Bernoulli random variables for \(k\ge1\). Show that \(K\sim\mathsf{Poi}(\lambda p)\).


Exercise 5.100

Let \(X_k\), \(k\ge1\) be  random variables with common distribution \[\mathsf{P}(X_k=1)=p,\qquad \mathsf{P}(X_k=-1)=q=1-p.\] Define the simple random walk \((S_n)_{n\ge0}\) via \(S_0=0\) and \(S_n=X_1+\dots+X_n\) for \(n\ge1\). Let \(T:=\inf\bigl\{n\ge1:S_n=1\bigr\}\) be the first time this random walk hits \(1\). Find the generating function \(G_T(s)\equiv\mathsf{E}\bigl[s^T\bigr]\).


Exercise 5.101

A slot machine operates so that at the first turn the probability for the player to win is \(1/2\). Thereafter the probability for the player to win is \(1/2\) if they lost at the last turn, but is \(p<1/2\) if they won at the last turn. If \(u_n\) is the probability that the player wins at the \(n\)th turn, show that for \(n>1\) \[u_n+\Bigl(\frac12-p\Bigr)u_{n-1}=\frac12.\] Show that this equation also holds for \(n=1\), if \(u_0\) is suitably defined, and find \(u_n\).


Exercise 5.102

A flea randomly jumps over non-negative integers by flipping a coin before each jump. If the coin shows ‘tails’, the flea jumps to the next integer (ie., \(k\mapsto{k+1}\)); if it shows ‘heads’, the flea jumps over the next integer (ie., \(k\mapsto{k+2}\)). Let \(u_n\) be the probability that, starting at the origin, the flea visits \(n\) at some point.
a) If the coin is fair, show that \(u_n=(u_{n-1}+u_{n-2})/2\), compute the generating function \(G_u(s)\) of the sequence \(u_n\), and thus derive a formula for \(u_n\);
b) Do the same for a biased coin showing “heads” with probability \(p\in[0,1]\).


Exercise 5.103

In a multiple-choice examination, a student chooses between one true and one false answer to each question. Assume that the student answers at random, and let \(N\) be the number of such answers until they first answer two successive questions correctly. Show that \(\mathsf{E}(s^N)=s^2(4-2s-s^2)^{-1}\) and find \(\mathsf{E}(N)\).


Exercise 5.104

In a sequence of Bernoulli trials with success probability \(p\), let \(u_n\) be the probability that the first combination \(S({ uccess})F({ ailure})\) occurs (in that order) at trials number \(n-1\) and \(n\). Find the corresponding generating function, mean and variance.


Exercise 5.105

A fair coin is tossed \(n\) times. Let \(u_n\) be the probability that the sequence of tosses never has ‘heads’ followed by ‘heads’. Show that \(u_n=\frac12u_{n-1}+\frac14u_{n-2}\). Find \(u_n\), using the condition \(u_0=u_1=1\). Compute \(u_2\) directly and check that your formula gives the correct value for \(n=2\).



  1. Feel free to consult your first year notes and other materials as necessary.

  2. Get in touch, if interested!

  3. Apriori we do not know that \(A\) is an event, i.e., can be assigned probability to! Of course, it is “intuitively obvious” that \(\mathsf{P}(A)=0\), but it is not immediately clear how to justify this guess. The methods discussed in this section allow to define limits of some sequences of events and their probabilities.

  4. Decompositions in the form \(A_n=\bigcup_{k=1}^n\bigl(A_k\setminus(\cup_{m=1}^{k-1}A_m)\bigr)\) are often called telescopic; they are analogous to those in sequential Bayes formulae.

  5. Here \(\mathcal{A}\) can be an arbitrary index set, eg., \(\mathbb{N}\), \(\mathbb{Z}\), \([0,1]\), \(\mathbb{R}^+\equiv[0,\infty)\), \(\mathbb{R}^2\), \(\mathbb{C}\), \(\mathbb{Z}^{25}\) etc.

  6. Recall that a sequence \(a_n\in\{0,1\}\) converges if for some \(k\ge1\) we have \(a_k=a_{k+1}=\ldots=a_n=\ldots\).

  7. equivalently, \(\omega\in\{A_n^\mathsf{c}\text{ finitely often}\}\);

  8. Recall that in general uncountable intersections of events, e.g., over \(\varepsilon>0\), are not well defined. Here everything is fine, because \(A(\varepsilon)\) does not depend on \(\varepsilon\).

  9. Eg., some results in Number Theory about rational approximations of irrational numbers are formulated in a form similar to Lemma 2.3!

  10. As \(n\) grows, \(p\) gets smaller. For \(n=10^6\), \(p\) is more than 99.99 percent, but for \(n=10^{10}\) the probability \(p\) is about 52.73 percent and for an \(n=10^{11}\) it is about 0.17 percent. As \(n\) goes to infinity, the probability \(p\) can be made as small as one likes.

  11. Using the theory of Markov chains on can show that the expected hitting time of the word ‘banana’ is exactly \(50^6\approx1.5625\cdot10^{10}\).

  12. You can use the \(R\) script available from the course webpage to explore sequences of different length and/or different typewriters.

  13. Recall that for a real sequence \((a_n)_{n\ge1}\) one defines \(\limsup\limits_{n\to\infty}a_n\) as the largest limiting point of the sequence \((a_n)_{n\ge1}\), equivalently, \(\limsup\limits_{n\to\infty}a_n\equiv\lim\limits_{n\to\infty}\sup_{k\ge n}a_k\), see App.  A below.

  14. For functions \(f(x)\) and \(g(x)\), we write \(f(x)\sim g(x)\) as \(x\to\infty\), if \(f(x)/g(x)\to1\) in that limit; this asymptotic relation will be rigorously established later in the course.

  15. one can show that this set belongs to \(\mathcal{F}\), i.e., is an event;

  16. In stochastic simulations on a computer (think, e.g., of weather forecast), “random inputs” are often generated by repeatedly iterating deterministic, specially constructed, functions, known as random number generators. In this case the quality of the whole computer experiment depends on the seed, the initial value of the (long) iteration chain. One can interpret each seed as an individual outcome \(\omega\in\Omega\) of a probabilistic experiment.

  17. Notice however, that not all modes of convergence discussed below are preserved by action of all continuous functions!

  18. Try the R script simulating this sequence from the course webpage!

  19. there are many other ways of showing that the integral expressions on the left are uniformly bounded in \(n\ge1\); we find it convenient to use the simple inequality \(1-x\le e^{-x}\), valid for all \(x\in\mathbb{R}\).

  20. otherwise, consider the centred variables \(X'_k=X_k-\mu\) and deduce the result from the relation \(\frac1nS_n'=\frac1nS_n-\mu\) and linearity of almost sure convergence.

  21. we will not do this here!

  22. The result of Exercise  2.48 might also be used here.

  23. proposed by Bernhard Riemann (1826-1866);

  24. This is of prime importance, for instance, in the study of Fourier series, Fourier transforms and other topics.

  25. if \(a\), \(b\), and \(m\) are fixed, the function \(f_m(x)\) vanishes outside the finite set \(\mathbb{Q}_m\cap[a,b]\);

  26. one often calls \((E,\mathcal{A})\) a measurable space, and \((E,\mathcal{A},\mu)\) a measure space;

  27. recall that \(\mathbf{1}_S(x)=1\) if \(x\in S\) and \(\mathbf{1}_S(x)=0\) otherwise

  28. here we always assume that \(0\cdot\infty=\infty\cdot0=0\);

  29. [zero-measure-difference-footnote] Also, if any two functions \(f_1\) and \(f_2\) coincide almost everywhere, ie., they differ on a set of measure zero, \(\mu(x:f_1(x)\neq f_2(x))=0\), their integrals are equal, \(\mu(f_1)=\mu(f_2)\).

  30. Complex valued functions can be similarly integrated, by considering the real part and the imaginary part separately.

  31. by definition \(f:E\to[-\infty,+\infty]\) is Borel, if for every \(a\in\mathbb{R}\), \(\{x\in E:f(x)\le a\}\in\mathcal{A}\), ie., is measurable.

  32. it is not easy to construct a non-Borel real-valued function; get in touch, if interested!

  33. this and a several other useful properties of power series can be found in Section  A.3 below.

  34. Theorem A.9 below; notice that the result can also be derived from .

  35. Why do we introduce both \(G_X(s)\) and \(M_X(t)\)?

  36. This is a two-stage probabilistic experiment!

  37. An alternative way would be to use products of matrices; get in touch, if interested!

  38. sometimes called Galton-Watson-Bienaymé process

  39. otherwise the model is degenerated: if \(p_0=0\), then \(Z_n\ge1\) for all \(n\ge0\) so that \(\rho=0\); if \(p_0=1\), then \(\mathsf{P}(Z_1=0)=\rho=1\).

  40. get in touch if interested!