Understanding R-Squared

Here is a non-technical explanation of the correlation coefficient R and its square, R-Squared.

Correlation Coefficient R

Visualize a scatter diagram. Now normalize the x and y axes so that all data fit within the square defined by x and y equal to minus 1 to plus 1.

There must be at least one value of x that equals minus 1. There must be at least one value of x that equals plus 1. There must be at least one value of y that equals minus 1. There must be at least one value of y that equals plus 1.

The values of x and y are otherwise allowed to be independent. The value of y can be anything when x equals minus or plus 1. Similarly, the value of x can be anything when y equals minus or plus 1.

Now fit the best possible straight line that includes all values of x. HINT: We have formulas that allow us to do this.

Notice that the presence of randomness reduces the slope of this line because of the requirement that the maximum value of y equals +1 and the minimum equals -1.

The slope of this straight line is the correlation coefficient R (because of the normalization process).

If the straight line has a slope of plus 1, the correlation coefficient is 100%. If the straight line has a slope of minus 1, the correlation coefficient is -100%. In both cases, knowledge of x translates directly into knowledge of y.

If the slope is zero (that is, if it is a constant), then the correlation coefficient is zero. Knowledge of x tells us nothing about y.

A correlation coefficient of 20% to 30% shows that x has a major influence on the behavior of y. It amounts to a 20% to 30% of the total variation of y. But even after removing the effect of x, y still retains most of its scatter (i.e., randomness).

If we take many samples y at the same value of x, the randomness of the average (i.e., mean) and median (mid-point) decreases substantially. The effect of x on the average becomes visible. In terms of each new sample, however, the full amount of randomness remains.

Regression equations allows us to get the same information about y from many different values of x.

R-Squared

As a rule, variances add. Standard deviations do not.

[When variances don’t add, factors are mutually dependent. We introduce correction terms known as covariances.]

The correlation coefficient R is a fraction of the standard deviation (after scaling the x and y axes). R tells us how much x influences y. R-squared tells us how much x influences the variance of y.

The total variance of y, when ignoring x, equals the sum of the variance of the effect of x (which is R-squared times the total variance) plus the variance of what remains (which is (1-R-Squared) times the total variance). If x causes 20% to 30% of the variation of y, it reduces the total variance by factors of only 0.04 to 0.09. More than 90% of the randomness of each individual sample remains in effect.

Stated differently, it takes about 25 or 11 samples (more precisely, degrees of freedom) to reveal a change in the average (mean) or median (mid-point). That is, 1/(square root of N) = 0.20 to 0.30 = R.

[DETAIL: Add another degree of freedom for each statistical measure extracted from the sample.]

Generalization

The scaling that I described restricts statistical distributions to finite values of x and y. The actual scaling is different. The actual formulas are what make sense mathematically.

Example

The Stock-Return Predictor has outer confidence limits of minus and plus 6% at Year 10, a total range of 12%.

Recently, I looked at the refinement possible by introducing earnings growth rate adjustments estimates.
Stock Return Predictor with Earnings Growth Rate Adjustment

From the new calculator, I determined that different earnings growth rate estimates could vary the Year 10 most likely (real, annualized, total) return prediction from 0.89% to 3.11% when starting at today’s valuations. The total variation is 2.22%, which is 18.5% of the total range of uncertainty (12%, which is from minus 6% to plus 6%) inherent in Year 10 predictions.

Introducing earnings growth is equivalent to adding a factor with a correlation coefficient R of 18.5% and an R-Squared of 0.0342.

This illustrates that important factors can have low values of R-Squared.

Never dismiss a result simply on the basis of R-Squared. Always consider its application.

Have fun.

John Walter Russell
January 19, 2007
Revised: January 20, 2007