Pearson R vs R2: The Ultimate Correlation Showdown

Understanding the distinction between Pearson r and r2 is fundamental for anyone interpreting linear relationships in data. While both metrics originate from the same correlation coefficient, they serve different purposes and communicate different aspects of model performance. Confusing these values can lead to misinterpreting the strength of a relationship versus the amount of variance explained.

Pearson r: The Measure of Linear Association

Pearson r, often called the Pearson correlation coefficient, quantifies the strength and direction of a linear relationship between two continuous variables. Its value ranges from -1 to +1, where +1 indicates a perfect positive linear trend, -1 indicates a perfect negative linear trend, and 0 indicates no linear relationship. This metric is sensitive to the slope of the line, meaning it captures both how closely the points cluster around a line and whether the relationship is positive or negative. Unlike r2, Pearson r retains the sign, which provides critical information about the direction of the association between the independent and dependent variables.

The Squared Relationship: What r2 Represents

R-squared (r2) is the coefficient of determination, derived by squaring the Pearson r value. By squaring the correlation coefficient, r2 eliminates the negative sign and expresses the result as a proportion, typically ranging from 0 to 1. This value represents the percentage of variance in the dependent variable that is predictable from the independent variable. For example, an r2 of 0.85 indicates that 85% of the variability in the outcome can be explained by the model. While r describes the tightness of the clustering, r2 describes the goodness of fit in terms of explained variation.

Direction vs. Magnitude of Fit

The primary conceptual difference lies in the information they convey. Pearson r answers the question: "How closely do the points follow a line, and in what direction?" It is a direct measure of association. R2, conversely, answers: "What proportion of the total variation in the outcome is accounted for by the model?" It is a measure of explanatory power. A high r2 without a significant r might be mathematically impossible, but a high r with a low r2 can occur in cases where the slope is steep but the data points are widely scattered around the line, though this scenario is rare in strict linear contexts.

Interpretation in Statistical Contexts

When evaluating a model, relying solely on r2 can be misleading. A high r2 does not guarantee that the model is appropriate; it could be influenced by outliers or the specific range of the data. Conversely, a low r2 does not necessarily mean the model is useless, particularly in fields where inherent variability is high. Pearson r provides a more granular understanding of the relationship, indicating whether the slope is positive or negative. For robust analysis, it is best practice to report both the correlation coefficient (r) and the coefficient of determination (r2) to give a complete picture of linearity and explanatory strength.

Practical Examples and Calculation

Consider a dataset measuring the relationship between hours studied and exam scores. A Pearson r of 0.9 suggests a strong positive linear association, while an r2 of 0.81 indicates that 81% of the differences in exam scores can be explained by differences in study time. If the Pearson r were -0.9, the r2 would still be 0.81, showing the same magnitude of explained variance, but the direction of the relationship would be negative, indicating that more study time is associated with lower scores, which would be an unusual finding. This example highlights why retaining the sign in r is crucial for theoretical interpretation.