Welcome to Project Lovelace! We're still super new so there are still tons of bugs to find and improvements to make. If you have any suggestions, complaints, or any comments at all please let us know on Discourse!

Correlation does not imply causation

You will learn about: statistics and lists, and summing lots of numbers.

Two variables are correlated if there's some statistical relationship between the two. However, just because two variables are correlated does not mean that one is caused by the other. This misconception is commonly referred to as “correlation does not imply causation”.

Correlation coefficients $r$ for 24 different example data sets $(x, y)$. Top row: values of $r$ close to -1 and 1 suggest very linear relationships with little spread while $r$ close to 0 suggest no relationship and a lot of spread. Middle row: $r>0$ values suggest positive correlations while $r<0$ suggest negative correlations. Bottom row: many nonlinear relationships result in $r=0$ suggesting that the Pearson correlation coefficient is only good for measuring linear relationships. (Image credit: DenisBoigelot, Wikimedia Commons)

One way of computing a correlation coefficient between two variables $X$ and $Y$ with $n$ measurements $x_1, x_2, \dots, x_n$ and $y_1, y_2, \dots, y_n$ is the Pearson correlation coefficient $$ r = \frac{\operatorname{cov}(X,Y)}{\sigma_X\sigma_Y} $$ where $$ \operatorname{cov}(X,Y) = \frac{1}{n} \sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y}) = \frac{1}{n} \left[ (x_1-\overline{x})(y_1-\overline{y}) + \cdots + (x_n-\overline{x})(y_n-\overline{y}) \right] $$ is the covariance between $X$ and $Y$, \begin{align} \sigma_X & = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \overline{x})^2} = \sqrt{\frac{1}{n} \left[ (x_1 - \overline{x})^2 + \cdots + (x_n-\overline{x})^2 \right]} \quad \text{and} \\ \sigma_Y & = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \overline{y})^2} = \sqrt{\frac{1}{n} \left[ (y_1 - \overline{y})^2 + \cdots + (y_n-\overline{y})^2 \right]} \end{align} are the standard deviations of $X$ and $Y$, and $$ \overline{x} = \frac{1}{n} \sum_{i=1}^n x_i = \frac{x_1 + x_2 + \cdots + x_n}{n} \quad \text{and} \quad \overline{y} = \frac{1}{n} \sum_{i=1}^n y_i = \frac{y_1 + y_2 + \cdots + y_n}{n} $$ are the averages (or means) of the $X$ and $Y$ measurements. The Pearson correlation coefficient $r$ is always between -1 and 1.

Taking in two lists of measurements $x_n$ and $y_n$, return the Pearson correlation coefficient for them.

Input: Two lists $x_n$ and $y_n$ of size $n$.

Output: The Pearson correlation coefficient $r$ between the two variables.


Input xₙ: [ 5427, 5688, 6198, 6462, 6635, 7336, 7248, 7491, 8161, 8578, 9000] Input yₙ: [18.079, 18.594, 19.753, 20.734, 20.831, 23.029, 23.597, 23.584, 22.525, 27.731, 29.449] Output correlation coefficient: 0.94684375

 Difficulty  Timesink
 Maximum runtime 60 s  Max. memory usage 250 MiB
 Function signature correlation_coefficient(x, y)

Write a function that accepts the input as function parameters and returns the correct output. Make sure to read the description above to produce the correct output in the correct format and use the correct function signature so we can run your code. A good first step is to try reproducing the example(s). Your code must not take longer than the maximum runtime to run and must not use more memory than the allowed limit.

You must be logged in to view your submissions.

  • There are some really good websites for this stuff.

Let us know what you think about this problem! Was it too hard? Difficult to understand? Also feel free to discuss the problem, ask questions, and post cool stuff on Discourse. You should be able see a discussion thread below. Feel free to post your solutions but if you do please organize and document your code well so others can learn from it.