The next book that I'll blog about is
Measuring Success, an edited volume with 11 chapters, 3 authors, and 26 contributors. The book is about the predictive validity of standardized tests in college admissions. This book poses something of a dilemma for me. On the one hand, it is rich in data and citations to the peer-reviewed literature. On the other hand, the editors include
Lynn Letukas and Ben Wildavsky from the College Board (the organization that produces the SAT) and
someone from a research center that includes the College Board on its
client list.
I think the proper response is to take this with a big grain of salt. Nothing should be taken as a priori truth, but can be taken with a suitable dose of "Assuming that similar results are found in multiple, independent investigations..."
The first chapter is probably the least rigorous, simply because it's an introductory chapter titled "Eight Myths about Standardized Admissions Testing", hence several points are touched on briefly (though many of the same themes are visited in more depth in subsequent chapters). The putative myths are:
1) Standardized tests are very poor predictors of freshman grades: The authors concede that weak correlations are found if you look only at the set of admitted students, but if you correct for the fact that the proper comparison includes the students who were not admitted but nonetheless went to college elsewhere (or were admitted but chose other schools) the correlations improve. This is a point that I've made
before and with
quantitative detail. Moreover, students with weaker preparation often choose different majors than students with stronger preparation, so their grades might not be comparable. But the authors include data showing that when you look at a wider pool and control for common college curricula and also high school GPA, the correlation between SAT score and college grades improves considerably, reaching 0.8 (versus 0.35 in poorly-controlled studies).
2) Tests do not predict anything beyond grades: The authors show data indicating that students with higher test scores take more advanced college classes than those with lower scores, and enter majors that reflect their higher test scores (e.g. verbal versus math).
3) Alternatives to testing are superior: The authors reference work on various alternative measures, and show that often the sample sizes are small and correlations are weak. Moreover, even if there are superior sources of data for admissions decisions, I'm not sure why one would ONLY use one source of information. Why not build a multi-variate model? And since tests are not as vulnerable to the subjective biases of raters (e.g. interviewers, letter writers, essay readers) the claim of superiority over tests seems to be an extraordinary one, requiring extraordinary evidence. (Or, at a minimum, a very careful articulation of what could count as "superiority"--is lack of bias not one of the desiderata?)
4) Tests do not measure anything relevant to life after college: Here the authors cite correlations between test scores and quality of graduate work as evaluated by faculty (i.e. evaluations of the quality and quantity of research output) as well as work performance after graduate study. However, this is a weak point, because it is focused on the Miller Analogies Test as an admissions test for students preparing to work as counselors, rather than more widely-used tests for undergraduate admissions (e.g. SAT, ACT) or graduate admissions (the GRE is the main game in town here).
Still, the authors are psychologists, so counseling programs would seem to be near to their hearts. I'll give them one fumble here.
5) Beyond a threshold tests have little predictive power: In other words, this is an argument that above a threshold you can't use tests to distinguish decent from great performance. And, of course, it is true that tests can't do that with perfect predictive power. However, the authors cite evidence from large studies (6656 students in one study, 15,040 in another, and 150,000 in the third) that tests have non-trivial predictive power even at the upper end of the talent pool. Intuitively this makes sense: If test scores only had threshold predictive power then the correlations under point 1 would probably not be as large.
6) Tests only measure socioeconomic status: This one was an eye-opener. They show mean SAT score varying from 1300 (on a 2400 scale) for the lowest income bracket (<$20,000/year household income) to 1700 for the highest bracket (more than $200,000/year household income). That variation isn't trivial, but it is also hardly enough to generate the correlations seen earlier, especially when you take into account the wide variation within brackets. More importantly, even when controlling for family income, the predictive power of SAT scores remains quite strong.
7) Test are biased: Here the authors are careful to unpack what "bias" means. If a test is biased against group X and favorable to group Y, then if we take a bunch of students with the same test score and look at their college performance, the X students should do better than the Y students. In other words, such an outcome would tell us that X students are doing better than their score would predict while Y students are doing worse, so admitting based on the test gives X students a disadvantage (they're being treated the same as weaker Y students). However, SAT scores slightly over-predict college grades for minority students. The over-prediction makes sense to me, since disadvantage is multi-faceted, and there are aspects of it that cannot be fully captured by family income. If disadvantage matters and is related to ethnicity then I would expect minority students with a given family income and same academic preparation to fare slightly worse (on average) because they face burdens that otherwise-similar white students do not face.
8) Coaching produces large gains: The authors show data suggesting that gains from test prep and coaching are over-stated. This makes sense to me on a few levels. First, the students who avail themselves of test prep include a substantial pool of students who did poorly on their first try, and regression to the mean is surely a factor here. Second, this pool includes kids who did not make the minimal effort to familiarize themselves with the test beforehand. A bit of effort to get familiar with the task at hand is a modest task, one that does not require expensive tutors, but expensive tutors will nonetheless be happy to collect a fee for helping one with that modest task. There are no doubt gains from minimal due diligence, and gains from re-testing after some coaching may in part reflect that modest due diligence. The open question is whether those gains reflect much beyond that, i.e. reflect things that a kid couldn't do without substantial resources.
Besides, if shilling is a concern, then taking claims about the value of coaching at face value amounts to trusting marketing materials from Kaplan, Princeton Review, etc. That's a dubious thing.
Now, that said, all of the points made here are worthy of follow-up. The subsequent chapters have more in-depth analysis that we need to examine. The first chapter is suggestive motivation but hardly conclusive evidence.