Select Page

Assumptions in linear correlations

Given how simple Karl Pearson’s Coefficient of Correlation is, the assumptions behind it are often forgotten. It is important to ensure that the assumptions hold true for your data, else the Pearson’s Coefficient may be inappropriate.

The assumptions and requirements for computing Karl Pearson’s Coefficient of Correlation are:

1. Normality means that the data sets to be correlated should approximate the normal distribution. In such normally distributed data, most data points tend to hover close to the mean.

2. Homoscedascity comes from the Greek prefix hom, along with the Greek word skedastikos, which means ‘able to disperse’. Homoscedascity means ‘equal variances’. It means that the size of the error term is the same for all values of the independent variable. If the error term, or the variance, is smaller for a particular range of values of independent variable and larger for another range of values, then there is a violation of homoscedascity. It is quite easy to check for homoscedascity visually, by looking at a scatter plot. If the points lie equally on both sides of the line of best fit, then the data is homoscedastic.

3. Linearity simply means that the data follows a linear relationship. Again, this can be examined by looking at a scatter plot. If the data points have a straight line (and not a curve) relationship, then the data satisfies the linearity assumption.

4. Continuous variables are those that can take any value within an interval. Ratio variables are also continuous variables. To compute Karl Pearson’s Coefficient of Correlation, both data sets must contain continuous variables. If even one of the data sets is ordinal, then Spearman’s Coefficient of Rank Correlation would be a more appropriate measure.

5. Paired observations mean that every data point must be in pairs. That is, for every observation of the independent variable, there must be a corresponding observation of the dependent variable. We cannot compute correlation coefficient if one data set has 12 observations and the other has 10 observations.

6. No outliers must be present in the data. While statistically there’s no harm if the data contains outliers, they can significantly skew the correlation coefficient and make it inaccurate. When does a data point become an outlier? In general, a data point thats beyond +3.29 or -3.29 standard deviations away, it is considered to be an outlier. Outliers are easy to spot visually from the scatter plot.

To verify most of these assumptions, a scatter plot is invaluable. That is why, we suggest that a scatter plot should be created first. before computing the correlation coefficient.