Just Suppose

I do not have all of the FACTS. Yet, I believe that these insights are relevant to some important studies.

Study Framework (Conjecture)

Imagine a survey of factors that influence stock returns. Imagine the following:

1) It uses the average of single year stock market returns over a long time frame (50 years or more).
2) It includes variances and standard deviations.
3) It combines different stocks in accordance with a large number of factors including the (trailing) price to earnings ratio, dividend yield, price to book value and price to sales.
4) It fits various combinations with linear equations.
5) It selects the best fit on the basis of R-squared.

Study Findings

Such a survey will identify several combinations with extremely high values of R-squared. Most of these factors will be relevant. The degree of relevance will be overstated because of data mining bias. Any large survey will produce such results. It is a routine matter of chance.

Under normal circumstances, if R-squared of one of these were 99%, the unexplained scatter would be reduced to 1% of the variance (100%-99%). Taking the square root, the unexplained randomness would be 10% of the standard deviation (since the square root of 0.01 is 0.10). With a single year standard deviation of 20%, the claim would be that the unexplained standard deviation is only 2%. Typically, the confidence limits would be plus and minus 4% (two standard deviations). Applying this to the average single year returns and assuming (incorrectly) data independence, the factors would “explain” differences of the order of 1% after ten years.

The final step would be to come up with criteria to allow for a variety of choices instead of a single conclusion.

Observations

1) The time frame is inappropriate. Instead of using the average of single year returns, the study should use the average of returns annualized over 10 or 20 years. This allows mean reversion to work its magic. It gets away from the wild year-to-year variation common among stocks.
2) The use of R-squared, although helpful, does not tell the full story. R-squared must be used in combination with the relevant variances (and standard deviations). Slopes reveal the sensitivity of selection factors. Realistic selection criteria take advantage of this sensitivity.
3) The claims R-squared values as explanatory are greatly overdone. R-squared is very high because of data mining bias. Most likely, there are better combinations. They are likely to be overlooked.

Have fun.

John Walter Russell
April 3, 2008