Correlation does not imply causation

Useful to know: lists, summing lots of numbers.

Two variables are correlated if there's some statistical relationship between the two. However, just because two variables are correlated does not mean that one is caused by the other. This misconception is commonly referred to as “correlation does not imply causation”.

Correlation coefficients $r$ for 24 different example data sets $(x, y)$. Top row: values of $r$ close to -1 and 1 suggest very linear relationships with little spread while $r$ close to 0 suggest no relationship and a lot of spread. Middle row: $r>0$ values suggest positive correlations while $r<0$ suggest negative correlations. Bottom row: many nonlinear relationships result in $r=0$ suggesting that the Pearson correlation coefficient is only good for measuring linear relationships. (Image credit: DenisBoigelot, Wikimedia Commons)

One way of computing a correlation coefficient between two variables $X$ and $Y$ with with $n$ measurements $x_1, x_2, \dots, x_n$ and $y_1, y_2, \dots, y_n$ is the Pearson correlation coefficient $$r = \frac{\operatorname{cov}(X,Y)}{\sigma_X\sigma_Y}$$ where $$\operatorname{cov}(X,Y) = \sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y}) = (x_1-\overline{x})(y_1-\overline{y}) + \cdots + (x_n-\overline{x})(y_n-\overline{y})$$ is the covariance between $X$ and $Y$, $$\sigma_X = \sqrt{\sum_{i=1}^n (x_i - \overline{x})^2} = \sqrt{(x_1 - \overline{x})^2 + \cdots + (x_n-\overline{x})^2} \quad \text{and} \quad \sigma_Y = \sqrt{\sum_{i=1}^n (y_i - \overline{y})^2} = \sqrt{(y_1 - \overline{y})^2 + \cdots + (y_n-\overline{y})^2}$$ are the variances of $X$ and $Y$, and $$\overline{x} = \frac{1}{n} \sum_{i=1}^n x_i = \frac{x_1 + x_2 + \cdots + x_n}{n} \quad \text{and} \quad \overline{y} = \frac{1}{n} \sum_{i=1}^n y_i = \frac{y_1 + y_2 + \cdots + y_n}{n}$$ are the averages (or means) of the $X$ and $Y$ measurements. Here $r$ is always between -1 and 1.

Taking in two lists of measurements $x_n$ and $y_n$, return the Pearson correlation coefficient for them.

Input: Two lists $x_n$ and $y_n$ of size $n$.

Output: The Pearson correlation coefficient $r$ between the two variables.

Example input

([5427, 5688, 6198, 6462, 6635, 7336, 7248, 7491, 8161, 8578, 9000], [18.079, 18.594, 19.753, 20.734, 20.831, 23.029, 23.597, 23.584, 22.525, 27.731, 29.449])

Example output

You must be logged in to submit code but you can play around with the editor.

You must be logged in to upload code.

  • There are some really good websites for this stuff.