Fundamentals of Statistics: Distributions
On some level, statistics is the process of describing distributions, which help us describe the probability of an event given a certain set of circumstances. I’ll get into the particulars of a few different types of statistical distributions like the binomial, Poisson, and Gaussian distributions which are commonly used to describe scientific data. Before we get into that, though, I wanted to just talk a bit about the statistical distribution as a concept, particularly the idea of a parent distribution and a sample distribution.
A lot of physics boils down to trying to characterize a random processes. The ubiquitous, quintessential example is a coin flip. If you want to test the fairness of a coin you might flip it many times and count the number of heads. What you would have collected is a sample distribution from the parent distribution of the coin describing the randomness of the coin. If you attempt this experiment with a real coin, you will likely get something very close to the canonical distribution for a coin flip, the binomial distribution, because for a probability close to 0.5 the sample tends to lie very close to the parent.
For other distributions the important distinction becomes much more apparent. I threw together a little python script you can play with yourself that generates 100 numbers from a randomly seeded pseudo-random number generator. It’ll then plot these data on the same axis as the probability density function of the parent distribution. Potential results might look like this. The values of the mean and standard deviation for a Gaussian distribution are listed in the legend, as well as the calculated values for a sample of 100 “measurements”. A common convention is to refer to the statistics of the parent distribution by the Greek letters
and
and the sample statistics by the roman letters
and
(or
for the deviance).
As you can see the calculated values from the sample data are not the same as those from the parent distribution. Note that the mean and most probable value for the sample data are also not the same. This is a relatively small sample, so all this is unsurprising. If we were to increase the sample size, the sample would become more and more similar to its parent distribution. Actually, we assume that the sample distribution becomes the parent distribution after an infinite number of measurements.
Both
and
are straightforward mean values calculated by adding up the measurements and dividing by their number. The standard deviations
and
are defined in terms of the means and are the root mean squared of the deviation from the mean value (as illustrated in the title bar above). This describes the second moment of the distribution. Whatever the nature of the distribution under study might be, they are usually sufficient to describe it, essentially telling you the mean value (which is also the expected value for symmetric distributions) and how likely a data point is to fall away from the mean.
Next, I’ll take a look at a few common distributions and talk about what situations they are useful to describe. First, will be the binomial distribution which describes the outcomes of measuring “true” values given a number of measurements and a probability. Then, the Poisson distribution, which is a special case of the binomial distribution when the number of “true” events is much less than the number of measurements. Lastly, the Gaussian or normal distribution, which is a symmetric distribution which can describe a large number of stochastic systems.
Tags: math, statistics