﻿ Demonstrations of Bayesian Estimation of Statistical Parameters These demonstrations aim to show how statistical inference from data can be performed using Bayesian sampling methods. In these demonstrations you will see how the parameters of simple descriptive models of data can be estimated by Bayesian sampling. The input is a sample of data and a model, the output is a set of probability distributions of the most likely values of the parameters of that model. This method of data analysis stands in contrast to conventional techniques of null-hypothesis significance testing (NHST) in which what is estimated from the data is instead the parameters of the supposed underlying population that gave rise to the sample. Bayesian sampling methods remove many of the weaknesses of the NHST approach, replacing the often mis-understood p-values and confidence intervals. In the Bayesian approach, statistical inference arises from an analysis of the posterior probability distributions of the model parameters: are there credible differences in the parameter distributions across experimental conditions?

These demonstrations are inspired by the book Doing Bayesian Data Analysis by John Kruschke which shows how Bayesian sampling methods can be used for many different types of statistical analysis (comparison of means, linear and logistic regression, comparison of relative frequencies, modelling of ordinal data). The implementation of the demonstrations is made possible because of bayes.js, an implementation of an Adaptive Metropolis within Gibbs sampler written by Rasmus Bååth.

## How Bayesian sampling works

The essence of the Bayesian approach is to find the most credible parameters of a model that you choose to describe your data. For example, if you have a sample of data and you think it would be well modelled by a normal distribution, you may want to find the most credible values for the mean and the standard deviation of that normal distribution. In such a case we call the mean and the standard deviation the parameters of our model, and use Bayesian sampling to find their most credible values given our sample.

The sampling method is described in the figure below. This describes the basic approach for one parameter, although typically a model will have a number of parameters and the method will explore all the parameters in parallel. 1. We start by specifying the prior distribution for the parameter. This describes any existing knowledge we have for the most credible values of the parameter before we have looked at the sample. In many situations we can put broad unspecific distributions here, since the sample itself will provide sufficient constraints on typical parameter values. We might for example, use a uniform probability distribution over some large range, saying that we have no other information apart from a minimum and maximum value for the parameter. The use of a broad unspecific prior may not be suitable when the sample is small. For example, imagine we have only a single value in our sample! What we want then is to explore how the additional knowledge that comes from this sample updates our prior knowledge: i.e. how does this new piece of evidence change our previous assumptions?
2. The sampling process can then extract typical values for the parameter from the prior distribution - it does this in a way which ensures that the probability of the sampled value matches the probability of the prior distribution. For a uniform distribution between min and max, sample values would be chosen with equal likelihood between min and max. For a normally distributed prior, samples would be more likely to be close to the mean, and less likely at some distance from the mean. If we looked at the distribution of the chosen samples at this point, we would see that they had the same probability distribution as the prior (we do this in the first demonstration below).
3. Given our sampled value of the parameter, we now estimate how likely this value would be given the sample data. Let's say we are estimating the mean of a normal distribution with a known standard deviation sigma. We need to calculate how likely the data would be given a generating distribution with the chosen mean and the given sigma. We can do this by multiplying together the probabilities of each datum given the distribution, that is we calculate the probability that each datum came from a normal distribution described by the mean and sigma parameters, then multiply all these together to get the proabbility that the whole sample was generated by that distribution.
4. We now have a sampled parameter value and a probability. If that probability is high then the parameter value is credible; if the probability is low, then it is not. So at this point we make a random decision to accept or reject the parameter value on the basis of its probability given the data. On the whole, likely values will be accepted and unlikely values rejected.
5. Finally we collect all the accepted samples into a distribution. This will be a distribution of the most likely parameter values given the data: this is the posterior distribution. It tells us which parameter values of our supposed model are most credible.

In summary we supply a sample of data and a model of the data described by one or more parameters together with our estimate of the prior distribution of those parameters. Bayesian sampling evaluates a range of parameter values and collects those that are most credible into a distribution that matches the posterior probability of the parameters given the data. We can then study the shape of the posterior distribution and draw inferences about the data from the range of credible model parameters.

## Fitting a t-Distribution

The first demonstration shows how a t-distribution can be fitted to a single set of data points. We use a t-distribution not because what we are doing is related to a t-test, but because it provides a model of a data distribution that is robust to the presence of outliers in the data (unlike, say, a normal distribution). The parameters of the t-distribution are μ: the location, σ: the scale, and ν: the degrees of freedom. The μ parameter is analogous to the mean of the normal distribution, the σ parameter analogous to the standard deviation. The ν parameter controls the size of the distribution tails, where as ν→∞ the t-distribution approximates the normal distribution, and lower values indicate heavier tails. In practice a value of ν > 30 would be indistinguishable from the normal distribution for most purposes.

In the demonstration, the first row of graphs represent the prior distributions we put on the three parameters. These prior distributions describe our initial beliefs in the values of the model parameters without having seen the data. In this instance, we choose very broad and unspecific prior distributions indicating we have no strong prior assumptions. The prior distribution for μ is given as a normal distribution with a mean of 0 and a standard deviation of 500. The prior distribution for σ is given as a uniform distribution between 0 and 100. The prior distribution for ν (actually it's a distribution of ν-1) is given as an exponential distribution with a mean of 29. When the Start Sampling button is pressed, samples are drawn from these prior distributions and the histograms are updated to show the range of samples collected. You will see that over time the samples follow the broad shape of the specified prior distributions.

The second row of graphs in the demonstration show the posterior distributions of the parameters - that is the likely values of the parameters given both the prior distributions and the data. When the Start Sampling button is pressed, samples are drawn from the prior distributions and the data is used to rate the likelihood of the sampled parameter values. The sampling process then adapts to the data such that the most likely parameter values (given the data) are sampled more often. Thus the posterior distributions of the parameters end up displaying the range of credible values for the parameters given the priors and the data.

The third row shows first the data being analysed. You can cut and paste your own numeric data here (separated by spaces or commas). Then a histogram of the data is shown in the second column, overlaid with lines representing the probability densities of some of the t-distribution parameter samples. Finally some statistics of the posterior distributions are shown in the bottom right table. These include the mode (or most credible) value of the parameter and also the Highest Density Interval (HDI) which describes the range of parameter values which includes 95% of the samples. The HDI is a true "confidence interval" unlike the confidence intervals found in NHST. The HDI actually does tell you the range of parameters values which are most likely given the data.

## Comparison of two independent samples

This demonstration shows how Bayesian sampling methods can be used to compare two independent samples of data to draw the conclusion about whether there is evidence that the samples were drawn from the same or different underlying populations. This is the kind of question that would require use of a t-test in the NHST approach. This use of Bayesian methods to perform an alternative to the t-test is called the BEST approach.

The two data samples are shown at the left in the bottom row. You can cut and paste your own data here (separated by commas or spaces). The model we will be fitting to the data comprises two t-distributions. Again the use of t-distributions is not related to the t-test but simply because t-distributions provide a robust model of the data distribution. We use one location parameter for each sample (μ1 & μ2), one scale parameter for each sample (σ1 & σ2), and one normality ("degrees-of-freedom") parameter that is used for both samples (ν).

Histograms of the two data sets are shown in the first column of the first and second rows. Overlaid on these are example distributions chosen from random sets of parameters found during sampling. The posterior probability distributions of the location parameter for both samples are shown in the middle of the top row. The posterior probability distributions of the scale parameter for both samples are shown on the right of the top row. The posterior probability distributions of the normality parameter (used for both samples) is shown in the middle column on the third row.

The second row shows distributions of the differences in the location and scale parameters for the two samples. In the middle column is the posterior probability distribution of the difference in location (μ1−μ2). You can see that the distribution of likely differences is centred about 0.15. More importantly, the 95% HDI runs from about 0.1 to 0.2. The fact that the HDI does not include 0 means that we can say with 95% confidence that there is a significant difference between the samples. In the right column on the second row shows the distribution of the difference in scales (σ1−σ2). You can see that the distribution of the difference in scale is centred on about 0.03 and the 95% HDI extends from 0 to 0.08. Since the HDI does include 0, we do not conclude that there is a significant difference between the scale parameters of these two samples.

Lastly, the right column of the third row shows the distribution of the estimated effect size - this is the difference between the sample means measured in units of the average standard deviation of the samples. The effect size distribution is centred on a value of 1.6, with a 95% HDI extending from 0.9 to 2.35.

## Linear Regression

This demonstration shows how Bayesian sampling can be used to find the most credible parameters for linear regression, that is the most likely straight lines that approximate a set of points on the 2D plane.

As in previous demonstrations, some sample data can be seen at the bottom left of the display. The data consists of paired X & Y values representing points on the scatter plot shown in the middle column at the bottom. You can type new values into the data table, or you can upload a text file of new X,Y values. The text file should be in comma-separated value (CSV) format. Each line in the file should have two columns representing the X and Y values of a single data point separated by a comma. Any lines containing non-numeric data are ignored.

The linear regression model has three parameters: the intercept, the slope and the standard deviation. The last represents the size of the error between the actual Y values and the predicted Y values given the intercept and slope. The prior distributions for both the intercept and the slope are normal distributions with a mean of 0 and a standard deviation of 10. The prior distribution for the error is a uniform distribution between 0 and 100.

The histograms on the top of the display show the posterior probability distributions of the three regression model parameters given the data. The data shown in the figure above were randomly generated to have an intercept of 5, a slope of 10 and a random error with standard deviation 1. As you can see, these values were recovered by the sampling method.

## One-way Analysis of Variance

This is a demonstration of how Bayesian sampling methods can be applied to the comparison of means of multiple samples. Here we ask the question whether multiple samples could have been drawn from a single underlying population, or whether there is evidence for a difference in means. When there are two groups, this is analogous to the NHST two independent samples t-test, when there are more than two groups this is analogous to a one-way analysis of variance.

The samples of data are shown at the bottom left in a table with two columns. The second column is a sampled value, while the first column specifies which sample group the datum comes from. The group identifiers may be numbers or names. There should be a minumum of three groups. Some simulated data are provided, but you can also type in your own data, or upload a CSV file. The CSV file should have two items per line, mirroring the data table. You can download an example CSV file to check the format.

The middle display in the bottom row shows boxplots of the distribution of sample values found in each group. A boxplot is a convenient way to visualise the range of data values in a sample. The box itself describes the inter-quartile range, with the median value drawn as a thicker central line. The whiskers at the ends of the box describe the total range of the data, excluding outliers. Outliers are plotted as separate circles beyond the whiskers. After sampling is complete, the boxes are also overlaid with example distributions based on the model parameters.

The model we are using to describe this data set comprises a number of parameters together with their prior distributions. In the table below we describe the parameters and the form of the priors:
ParameterDescriptionPrior Distribution
BaselineY value common to all groupsNormal distribution centered on the mean of y with a standard deviation of 5 times the s.d. of all y values
Group σStandard deviation of the group means about the baselineUniform distribution from 0.1 to 10 times of the s.d. of all y values
σStandard deviation of y-values within each groupUniform distribution from 0.1 to 10 times the s.d. of all y values
Group μGroup means - one per groupNormal distribution with zero mean and a standard deviation of 10 times the s.d. of all y values

When the Start Sampling button is pressed, samples are drawn from the prior distributions for all the model parameters and evaluated against the supplied data. Those parameter values which have a high likelihood are sampled more frequently so that the distributions of model parameter samples describe the posterior probability of the model parameters given the data.

The top row of the display shows the posterior probability distributions of the Baseline, Group σ and σ parameters. The left graph in the middle row shows the posterior probability distributions of the the Group μ parameters.

Unlike NHST one-way analysis of variance, the Bayesian sampling approach does not provide a single overall test of the hypothesis that all samples were drawn from the same underlying population. Instead we need to compare the means of the groups to see if there is evidence from the posterior distributions that their likely values are significantly different to one another. The middle graph in the middle row shows the posterior distribution of the difference between group 1 mean and group 2 mean. These data are collected sample-by-sample and shows the range of likely differences between the two group means. You can see from the figure above, that in this case the modal difference is -11 and the HDI extends from -13 to -9. Since the HDI does not include zero, we conclude that there is a significant difference between these two groups.

## Likert Rating Scale

A Likert scale is widely used in psychology to collect information about the attitudes of individuals to some proposition. Typically subjects are given a choice from one of five alternatives that make an ordinal scale such as: strongly agree / agree / don’t know / disagree / strongly disagree. At the end of the data collection, the number of times each alternative was chosen is tallied and may be plotted as say a bar graph showing the relative frequency of the chosen alternatives.

A problem with Likert scales is how best to perform statistical inference using the set of counts. Since the scale is ordinal, it does not make sense to compute a mean using a numerical score for each choice (point 5 on the scale is not "five times" point 1 on the scale). How best then to determine whether a particular set of counts is different to the null hypothesis (of all points equal) or is different to another set of counts (of say a different subject group answering the same question)? In Doing Bayesian Data Analysis, John Kruschke presents a neat approach to the comparison of two sets of ratings. This is a demonstration of his basic approach, which can also be used with your own data.

In this demonstration, two sets of Likert scale ratings are shown in the bottom left. We will use the Bayesian sampling method to answer the question whether these two sets of rating are significantly different. Bar charts of the ratings are given on the bottom row overlaid with some hypothetical underlying distributions that we find from sampling. The model we use for these data is shown on the right. We assume that the 5 counts of each rating set are drawn from an underlying normal distribution which has been divided into five sections. In the figure, these sections are shown by the vertical lines. That is the counts represent the five areas under a normal curve divided by four threshold lines. There are four parameters to the model: the overall mean μ, standard deviation σ, and the position of thresholds t2 and t3. Thresholds t1 and t4 are fixed (in this demonstration they are fixed at values of 1.5 and 4.5). Our sampling process will try and determine the most credible values for these four parameters given the data.

For the process of comparing two rating scale sets, we allow each set to have its own overall mean and standard deviation, but force both sets to share the same threshold values t2 and t3. The table below summarises the parameters and their prior distributions:
ParameterDescriptionPrior Distribution
μ1Mean of overall distribution of set 1Normal distribution with a mean of 3 and a standard deviation of 5
μ2Mean of overall distribution of set 2Normal distribution with a mean of 3 and a standard deviation of 5
σ1Standard deviation of overall distribution of set 1Uniform distribution from 0.005 to 50
σ2Standard deviation of overall distribution of set 2Uniform distribution from 0.005 to 50
t2Location of threshold 2 for both setsNormal distribution with a mean of 2.5 and a standard deviation of 0.5
t3Location of threshold 3 for both setsNormal distribution with a mean of 3.5 and a standard deviation of 0.5

The top row of the demonstration shows posterior distributions. At the top left are the posterior distributions of the two mean parameters μ1 and μ2. In the middle are the posterior distributions of the two standard deviation parameters σ1 and σ2. You can see that while the distributions have different means, they have very similar standard deviations. At the top right is the distribution of the difference between μ1 and μ2. From the 95% HDI of this distribution it is easy to see that the credible values of the difference do not include zero. We conclude that there is a signficant difference between the two sets of ratings.