Eas 327: Statistics & Sampling Theory

distinguish discrete and continuous variables
define a random variable
define and describe a histogram
define a probability density function (pdf) for a continuous random variable, and its relationship to a histogram
parameters of a population versus statistics of a sample drawn from that population
Be familiar with uniform, Gaussian (Normal), and exponential distributions
Understand principle of "best least squares fit" (Linear Regression)
Understand the Central Limit Theorem, and the "sampling distribution of the mean"

1.0 Introductory Concepts

A variable x is discrete if it can take on only particular values (eg., if x can only take integer value, it is discrete). Contrariwise, we say x is continuous on range [a,b] if it can take on any value a <= x <= b.

2.0 Histogram

Suppose we take a set of N observations (x₁, x₂, x₃, ... x_i, ...x_N) of a random variable x (in standard shorthand notation, our sequence of observed values is denoted x_i, i=1...N). And now suppose that we count the number N_j of these observations that fall into each of a number of ranges, or "bins," where the j^th bin covers some range x_j-min <= x <= x_j-max and we say this bin has width Δx_j = x_j-max - x_j-min. The ratios N_j/N <= 1 are the relative frequencies with which the random number x falls into bin j. If our sample is large enough, N_j/N should be a good estimator for the probability a single sample x falls into the j^th bin. If we plot the N_j/N versus the centre-point values x_j of the bins, we have the Histogram.

Now, let all the Δx_j be equal (call this constant Δx) and define f_j= 1/Δx N_j/N . We may plot f_j against x_j. All this does is to re-scale the Histogram, but this "f" is now an empirical (ie., observed) Probability Density Function (pdf) for the random variable x. The pdf of a random variable is an extremely important function, and from it, we may extract complete statistical knowledge of x. We'll begin for now by noting some crucial properties of the pdf.

If f(x) is the pdf for a random variable x having units [u], then the units of f(x) are [u^-1]
Σ_j f_j Δx = 1. In words, the sum (over all j) of Δx times the f_j is unity. The pdf is said to be "normalised." The area under the curve of f(x) versus x is 1, for any pdf.
f(x) may be estimated from an empirical histogram
Sometimes we call f(x) simply the distribution or the probability distribution of x (the proper term, however, is "pdf." The distribution is the integral of the pdf).
You may have heard about the Gaussian distribution, also known as the Normal distribution, and familiarly as the "bell-shaped curve." Well, the Gaussian distribution is a pdf. Its equation is:
f(x) = 1 / ((2π)^½ σ_x) exp [ - (x - μ)²/ (2σ_x²)]
where μ is the population mean value of x, and σ_x is the population standard deviation. The "population" is the infinity of values of x we might have drawn - as distinct from a sample, which is drawn from the population. Note that the Gaussian distribution is a two-parameter distribution, and describes a random variable that has unrestricted range.
Another example of a mathematical pdf is the exponential,
f(x) = α exp(-α x)
which evidently can only describe a "one-sided" variable, x >= 0. The single parameter α of the exponential distribution is related to both the mean and the standard deviation of the variable (see assigned problem).
A common aim in science is to "best fit" a (chosen type of) mathematical pdf to empirical data, ie. to an empirical pdf. The fitting exercise involves only two parameters for a Gaussian, but one parameter an exponential distribution.

3.0 Sample versus Population

So any given sample of N observations (x₁, x₂, x₃, ... x_i, ...x_N) is finite. From it we can calculate properties called statistics, eg. the average value of x across the sample, denoted ⟨x⟩. This sample average can only be claimed to be an estimate of the average value (μ) of x across the population... so we say ⟨x⟩ is a statistic, while μ is a "parameter." And the usual reason for calculating statistics of a sample, is to estimate the parameters of the population.

4.0 Parameters, in particular "Moments," of a pdf.

The moments of a pdf f(x) are the "expected values" of x, of x², of x³, and so forth. Many terminologies are used, and perhaps the best here is the E[ ] notation. Then, the expected value of x is denoted E[x], and this is the population mean, commonly given the symbol μ (as noted above, it is conventional to use Greek symbols to represent the parameters of a population, which are distinct from the statistics of a sample).

The rule for calculating the expected value of any variable g=g(x) that is a function of x, where x is a random variable defined on the range a <= x <= b, is very simple:

E[ g(x) ] = _a∫^b g(x) f(x) dx

It follows at once that the population mean is:

μ = E[x] = _a∫^b x f(x) dx

while

E[x²]= _a∫^b x² f(x) dx

Higher "moments about the mean" are defined like (for example)

σ² = _a∫^b (x-μ)² f(x) dx

and this σ² is nothing other than the population variance (= second moment about the mean), whose square root is the standard deviation.

5.0 Calculating Statistics of a Sample

From a sample of N observations (x₁, x₂, x₃, ... x_i, ...x_N) we can calculate...

Sample Mean

m_x = ⟨x⟩ = 1/N Σ_i x_i

Sample Variance

Define x_i' = x_i - ⟨x⟩ to be the deviation or fluctuation from the sample mean of the i^th sample-member x_i (the figure shows a random variable called "y", instead of x, a difference which is only a trivial matter of a name). Then we may calculate the sample variance s_x² (which is simply the square of the sample standard deviation s_x) as:

s_x² = ⟨ x'² ⟩ = ⟨ (x - ⟨x⟩)² ⟩ = ⟨ x² ⟩ - (⟨x⟩)² =

= 1/N Σ_i x'_i² = 1/N Σ_i (x_i - ⟨x⟩)² = 1/N Σ_i x_i² - ( 1/N σ_i x_i )²

where all these alternative forms are given to you to familiarise you with different notations (the first line) and different calculational procedures for obtaining the variance and standard deviation (the last three lines are alternative but equivalent recipes for calculation of the sample variance s_x²). Of course the sample standard deviation is important as a measure of the spread of values x_i observed in our sample.

The sample mean and standard deviation (m_x, s_x ; note the use of Roman rather than Greek symbols) are the two most common, important and familiar sample statistics, and are often considered without much further worry as estimates of the corresponding population properties μ, σ. We'll stay away from fretting over whether, in calculating s_x, we should have used 1/N or 1/(N-1). This relates to whether or not s_x is (or is not) a "biased estimator" of σ, and since often N is large, who cares?

Two further statistics commonly reported are the skewness

Sk_x = (1/N Σ_i x'_i³) / s_x³

which describes the asymetry of a pdf (the "bell-shaped curve" is perfectly symmetric and so has zero skewness), and the kurtosis

K_x = (1/N Σ_i x'_i⁴) / s_x⁴

6.0 Linear Regression

Suppose we have a sample (ie. a set) of pairs of values (x_i,y_i), for i=1,2,3...N. Imagine we have plotted the y versus the x, and noted there appears to be some sort of relationship or correlation between them. For convenience we shall call x the "predictor" and y the "predictand," although the selection of which is which is arbitrary.

Suppose we consider, either on theoretical grounds or by inspection of our "scatter-plot" of y versus x, that y is linearly related to x, ie. that there is an underlying connection best expressed in the form:

y = m x + b

where m is the slope of the line (m = dy/dx = Lim_(Δx-->0) Δy/Δx) and b is the intercept (value of y when x=0). Then, if the y versus x plot does not form a perfectly straight line, but instead shows some scatter, we will ask ourselves, what is the best choice of m,b on the basis of the available data? In other words, what is the best fit line?

Well, we need a criterion for "best fit." A popular choice, and the basis of "linear regression," is the least square error....

Let us distinguish measured y, that is, the y_i, from "estimated (or modelled) y," the latter being defined as

y_i^e = m x_i + b

Then, (y_i^e - y_i) is the error in the ith model estimate y. Now define the "sum of squares of the error" SS as

SS = 1/N Σ_i (y_i^e - y_i)²

and substituting for y_i^e using the model, this becomes, upon carrying through the multiplications:

SS = 1/N Σ_i (y_i² + m²x_i² + 2 m b x_i - 2 b y_i - 2 m x_iy_i + b²)

We should like SS to be minimised - and we shall call the selection (m,b) that minimises SS the "best least-squares fit" of the linear model (ie. the line) to the observations.

Now SS depends on our choice of (m,b) and so we can write SS=SS(m,b) indicating a functional dependence... SS is a function of m and of b. In principle, we can plot a curve of SS versus m and b, and we would like to select that pair of values (m,b) at which SS is smallest - ie., we are seeking that point in an m-b space at which the slope of a plot of SS versus m,b is flat (zero). Those with a grounding in Calculus may recall this means we look for the values of m, b that make the partial derivatives ∂SS/∂m and ∂SS/∂b vanish.

"Partial" derivatives? The partial derivative of SS with respect to m means

∂SS/∂m = Lim_{(Δm -> 0)} [ΔSS/Δm]_{(b held fixed)}

OK, now we do some calculus. Differentiating w.r.t. m, term by term, we obtain:

∂SS/∂m = 0 = 1/N σ_i [ 2 m x_i² + 2bx_i - 2x_iy_i]

Now we introduce bracket symbols ⟨⟩ to denote the average of any quantity over our sample:

⟨ x ⟩ = 1/N Σ_i x_i,

⟨ x² ⟩ = 1/N Σ_i x_i²,

⟨ y ⟩ = 1/N Σ_i y_i,

⟨ y² ⟩ = 1/N Σ_i y_i²,

⟨ xy ⟩ = Σ_i x_iy_i.

These are all statistics of our sample (ie. of the set of observed values x_i, y_i). They are trivial to calculate, and will be needed below for substitution into the equations we ultimately obtain for our "best" m,b. In terms of these statistics, we may re-write the above equation in a tidier form:

∂SS/∂m = 0 = 2 m ⟨ x² ⟩ + 2b ⟨x⟩ - 2 ⟨xy⟩

This is a single equation containing two unknowns (m,b)... of course the ⟨x⟩, ⟨y⟩ etc. are knowns that are calculable from our sample. We need another equation. No problem: following exactly the same procedure we can extract a second equation ensuring SS is minimised with respect to b. This second equation is:

∂SS/∂b = 0 = 2m ⟨x⟩ - ⟨y⟩ + b

So we now have two simultaneous linear algebraic equations in the two unknowns (m,b). Straightforward algebra yields our wanted criterion of best fit:

The best fit straight line is that having slope
m = [ ⟨xy⟩ - ⟨x⟩ ⟨y⟩ ] / [ ⟨ x² ⟩ - ⟨x⟩² ]
and intercept

b = ⟨y ⟩ - m ⟨ x ⟩

What if we wished to force a best fit line through the origin? No problem... then we set b=0 and our equation for minimisation with respect to m simplifies to

∂SS/∂m = 0 = 2 m ⟨ x² ⟩ - 2 ⟨xy⟩

so that our best fit line is

m = ⟨xy⟩/ ⟨ x² ⟩

7.0 Autocorrelation and Autocorrelation Timescale

Consider a section of a fluctuating signal y(t). We will assume the signal y(t) is "stationary," meaning that its average properties (such as its variance σ_y²) are not drifting in time. For such a signal we can define the autocovariance as the average value of the product of the signal at one time (t) with its value at a different time (t + Δ). The autocorrelation is that number made dimensionless by dividing by the variance σ_y². Thus the autocorrelation R(Δ ) is defined to be:

R (Δ) = ⟨ y(t) y ( t + Δ) ⟩ / σ_y²

where the angle brackets denote the average value of the quantity they enclose. As implied by the notation R(Δ ), R is independent of the reference point t, and depends only on the time "lag" Δ.

The schematic autocorrelation function shows that, necessarily, R(0)=1, ie. that R(Δ) = 1 when Δ =0. It is also intuitively clear that R(Δ) ---> 0 at very large time-separations Δ . To give a physical example, if the wind is blowing up now, in all probability it will still be blowing up 1 microsecond from now - but we have no idea whether up or down one hour from now.

Now Δ has units of time. So the area under the graph is a time. This area is called "the autocorrelation timescale" (Γ), and is a good estimator of the time over which the signal y(t) becomes decorrelated from itself. If we take samples at intervals exceeding Γ, those samples will be independent.

8.0 Sampling Distribution of the Mean: the Central Limit Theorem

If random variable x belongs to (any) probability distribution having mean μ and variance σ² then, according to the "Central Limit Theorem" an average

⟨x⟩ = 1/N Σ_i x_i

has a "sampling distribution" which is "asymptotically" Gaussian with mean μ and variance σ²/N.

What does this mean? Well the mean value ⟨x⟩ is itself random, since the x_i are random. In different trials we will get different sample means < x >. So < x > itself has a probability distribution. Provided N is "large" (say > 15; thus the proviso "asymptotically") that distribution is Gaussian (=Normal) with mean μ and variance σ²/N.

9.0 Confidence in Measurements of a Random Variable

Suppose I want to estimate the true average value of a fluctuating variable x(t) over an interval spanning t=(0,T). My estimate is a sample mean, m_x = < x > (these are just alternative notations for a sample mean). How many measurements (N) must I take to ensure that with 95% probability (ie. in 95% of all such experiments) my sample mean

⟨x⟩ = 1/N Σ_i x_m(i Δt)

lies in the range μ ± ε? Here x_m(i Δt) = x_mi is the measured value of x (which may differ from x, for example if our instrument is too slow to respond to rapid fluctuations in the signal x) at time t=i Δt, where Δt is the "sampling interval" or "time between samples." And ε is the error band.

The notation x_m emphasizes that our measured x's are not necessarily equal to the true x's. We can assume our instrument is accurate, in the sense that if it is presented for a very long period of time with a steady value of x then it measures x_m=x. In the jargon, the device has correct "DC" or steady state response. But we will assume it is not infinitely fast - it is characterised by a time constant τ, so that if presented with a step change in x it will achieve only 66% response in time τ, and 95% response in 3τ. Therefore, we must distinguish x_mi (the measurement at time i) from the true value x(t_i). In particular, this means that our measured sample standard deviation (s_mx) may well be different from the true standard deviation σ_x. If our instrument is "slow" then s_mx < σ_x, while if our instrument is very noisy, perhaps s_mx > σ_x.

The CLT (Central Limit Theorem) tells us that if I was to perform this exercise many times, ie. determine m_x many times, then my population of measured means would cluster in a Gaussian distribution, centred on the true (population) mean μ, and with standard deviation s_mx/N^½.

Now we know that a Gaussian distribution (the "bell-shaped curve") has 66% of its area within ±1 standard deviation from the mean, 95% within ±2 standard deviations, 99% within ±3 standard deviations. It follows that provided the number of samples N determining the mean ⟨x⟩ is "large," with 95% probability ⟨x⟩ lies in the range μ ± 2 s_mx/ N^½

We can rearrange this result to give the answer to our starting question ... if I take

N >= 4 (s_mx/ε)² measurements, then with 95% probability (ie. in 95% of all such experiments) my sample mean ⟨x⟩ = 1/N Σ_i x_m(i Δt) lies in the range μ ± ε.The slower our instrument, the smaller will be our measured variance s_mx² and thus the smaller is the required number of samples.

The only restriction on this result is that the samples x_i must be independent. How do we ensure they are? What if we packed a million measurements of x into a microsecond? Intuition says they would not be independent. But can we be quantitative about it? Yes. By appealing to the concept of the autocorrelation (self-correlation) of a signal x(t) and the corresponding autocorrelation timescale: if the samples are separated in time by intervals Δt greatly exceeding the autocorrelation timescale, they are independent.

Back to the EAS327 home page.

Back to the Earth & Atmospheric Sciences home page.

Eas 327: Statistics & Sampling Theory

Last modified 13 Mar. 2016 (last prior modification 31 Mar. 2005).