Understanding R-Squared Version A

This is a non-technical explanation of the correlation coefficient R and its square, R-Squared.

In this version I have added material to assist readers with technical training.

Correlation Coefficient R

Visualize a scatter diagram. Now normalize the x axis so that all data are centered about zero with x ranging from minus 1 to plus 1.

There must be at least one value of x that equals minus 1. There must be at least one value of x that equals plus 1.

Similarly, center the y axis so that all values range from y=-ymax to +ymax. [Division by ymax comes later. In version A, I delay the normalization of the values of y.]

The values of x and y are otherwise allowed to be independent. The value of y can be anything when x equals minus 1 or plus 1. Similarly, the value of x can be anything when y equals –ymax or +ymax.

Now fit the best possible straight line that includes all values of x. HINT: Excel does this for us.

The presence of randomness constrains the slope of this line since the maximum value of y equals +ymax and the minimum value of y equals -ymax.

Added Technical Details

To a good approximation when the distribution is symmetrical, the straight line passes through x=0 and y=0.

If we take all of the values of y, while ignoring x, we can calculate a total variance VAR_TOTAL and its square root, the TOTAL_STANDARD_DEVIATION. [We actually calculate estimates.]

Ideally, with a normal distribution, the 68% of the values of y would fall between plus and minus one TOTAL_STANDARD_DEVIATION. For a real sample, the range of the data would be within two or three data points of 1.6 to 2.0 times the TOTAL_STANDARD_DEVIATION. That is, ymax would be close to 1.6 to 2.0 times the TOTAL_STANDARD_DEVIATION. Regardless, the data range (within two or three data points) is plus and minus some such multiple “k” times the TOTAL_STANDARD_DEVIATION.

If the correlation were 100%, there would be no scatter outside of the straight line fit. Two TOTAL_STANDARD_DEVIATIONs (i.e., plus and minus one TOTAL_STANDARD_DEVIATION) would correspond to 68% of the total variation of the line as x varies from -1 to +1. The full range of variation would equal 2*k*TOTAL_STANDARD_DEVIATION as x ranges from -1 to +1 (within 2 or 3 data points, when using an actual sample). That is, at x=+1, ymax=k*TOTAL_STANDARD_DEVIATION. At x=-1, y=-ymax=-k* TOTAL_STANDARD_DEVIATION. Squaring, the straight line represents 4*k^2*VAR_TOTAL. It “explains” 100% of the total variance of the data.

When the correlation is between -100% and 100%, the straight line ranges covers a fraction 2*f of the range of the data. The values at x=-1 and x=+1 include -f*ymax and +f*ymax. The fraction f*ymax=f*2*k*TOTAL_STANDARD_DEVIATION. Squaring, the straight line accounts for f^2*4*k^2*VAR_TOTAL. It “explains” f^2 times the total variance of the data.

If we subtract the straight line from all data points y, the difference has a variance equal to (1-f^2)*VAR_TOTAL.

R-squared is f^2. The correlation coefficient R is the fraction of the total range of the data that the straight line covers. The values of the straight line at x=-1 and x=+1 occur at –R*ymax and +R*ymax.

If we normalize the y axis by dividing by ymax, then the straight line equals –R and +R when x=-1 and +1. If the value of the straight line is positive when x is positive, then the correlation is positive. If the value of the straight line is negative when x is positive, then the correlation is negative.

Interpretation of Graphs

The value of the straight line when x=+1 equals the correlation coefficient R times +ymax.

If the slope is zero (that is, if it is a constant), then the correlation coefficient is zero. Knowledge of x tells us nothing about y.

A correlation coefficient of 20% to 30% shows that x has a major influence on values of y. It amounts to a 20% to 30% of the total variation of y. But even after removing the effect of x, the values of y still retain most of their scatter (i.e., randomness). This is because variances add, not standard deviations, when there is randomness.

We can make the effects of x become visible by taking many samples and averaging. The randomness of the average (i.e., mean) and median (mid-point) decrease substantially. When viewed individually, each new sample has the full amount of randomness.

Regression equations allow us to get similar statistical benefits at many different values of x.

R-Squared

As a rule, variances add. Standard deviations do not.

[When variances don’t add, factors are mutually dependent. We introduce correction terms known as covariances.]

The correlation coefficient R is a fraction of the standard deviation (after scaling the x and y axes). R tells us how much x influences y. R-squared tells us how much x influences the variance of y.

The total variance VAR_TOTAL of y, when ignoring x, equals the sum of the variance of the effect of x (which is R-squared times the total variance) plus the variance of what remains (which is (1-R-Squared) times the total variance). If x causes 20% to 30% of the variation of y, it reduces the total variance by a factor of only 0.04 to 0.09. More than 90% of the randomness of each individual sample remains in effect.

Generalization

The scaling that I described restricts statistical distributions to finite values of x and y. This makes sense when handling data.

The actual scaling is different. The actual formulas are what make sense mathematically.

Example

The Stock-Return Predictor has outer confidence limits of minus and plus 6% at Year 10, a total range of 12%.

Recently, I looked at the refinement possible by introducing earnings growth rate adjustments estimates.

Stock Return Predictor with Earnings Growth Rate Adjustment

From the new calculator, I determined that different earnings growth rate estimates could vary the Year 10 most likely (real, annualized, total) return prediction from 0.89% to 3.11% when starting at today’s valuations. The total variation is 2.22%, which is 18.5% of the total range of uncertainty (12%, which is from minus 6% to plus 6%) inherent in Year 10 predictions.

Introducing earnings growth is equivalent to adding a factor with a correlation coefficient R of 18.5% and an R-Squared of 0.0342.

I consider the earning growth rate to be an important factor, especially for bottom-up modeling. This example illustrates that important factors can have low values of R-Squared.

Never dismiss a result simply on the basis of R-Squared. Remember that means and medians can be made visible by collecting more data. Always consider the application.

Have fun.

John Walter Russell
January 24, 2007