Monday, March 8, 2010

AP101 Brief # 7: Understanding IQ score differences via "IQ Test CHC DNA Fingerprints": Comment on Guevara v Thaler (TX, 2008, 2010)

In a prior post re: Guevara v Thaler (TX, 2008, 2010), I mentioned that a number of individuals had asked me to explain why two different IQ scores were possible from two different tests. Before addressing this issue, I draw the readers attention to a number of AP101 briefs/reports previously posted (click here, here, and here).   These prior posts provide important background information regarding the issue of IQ score differences (e.g., Flynn Effect, scoring errors, etc.).

What many layperson's do not understand, and unfortunately also some psychologists, is that different IQ test batteries are composite scores often based of different mixes of cognitive abilities. That is, many IQ tests may measure some of the same abilities (Tests A and B both measure apples), but test A may measure some other abilities (oranges) not measured by Test B. Conversely Test B may measure some abilities (bananas) not measured by Test A.  And, even when two IQ test batteries have tests of similar abilities (apples), they may measure the abilities slightly differently or may measure a different subset or variety of abilities within the same ability domain (red delicious vs golden delicious).

Also, there is a significant difference between comprehensive IQ batteries (Wechsler's; Stanford Binet; Woodcock-Johnson; etc.) and special-purpose batteries that are deliberately designed to measure narrower and more restricted set of human abilities. Nonverbal IQ tests (actually, "tests that measure abilities via non-verbal methods" is a more accurate description---there is no such ability as "non-verbal" ability or IQ). Nonverbal tests are frequently used when a person comes from a different culture and has limited understanding of English language. The so-called nonverbal tests attempt to tap key aspects of a person's cognitive abilities via the use of directions and response formats that impose minimal or no language demands on the examinee (e.g., directions administered via gestures or pantomime).

Lets use two scores from the Geuevara cast to illustrate.

First, in order to determine if cognitive ability content coverage may explain all (or a part) of total IQ score differences, one needs a taxonomy to categorize the abilities measured by different IQ tests. As I've blogged about repeatedly, the Cattell-Horn-Carroll (CHC) theory of cognitive abilities is now considered the consensus psychometric taxonomy of human abilities (McGrew, 2009). Using the CHC taxonomy, I examined the type and amount of different CHC abilities (fruits) measured by the two primary IQ measures administered to Guevara.

On the Spanish version of the Woodcock Johnson-Revised cognitive battery (Bateria-R; BAT-R). Guevara obtained a total composite IQ score of 60 (+- 5 point confidence band: 55-65). On the Test of Nonverbal Intelligence-2nd Edition (TONI-2), a score of 77 (+- 5 point confidence band: 72-82). As one can see, even after accounting for unreliability in measurement via the application of the standard error of measurement (SEM) to the scores, the two respective confidence bands do not overlap (55-65 v 72-82). If the IQ score confidence bands did overlap, one would then assume that the difference in IQ scores is not a reliable difference and reflects the known degree of measurement error in each batteries score. However, in the current illustration,the bands do not overlap.  There is a statistically significant reliable difference between the BAT-R and TONI-2 scores.

Using the recognized research-based extant CHC IQ test analysis literature (Flanagan, McGrew, & Ortiz, 2000; Flanagan, Ortiz, & Alfonso, 2007; McGrew & Flanagan, 1998; Woodcock, 1990 ), I developed the following figure. I call these figures the "IQ Test CHC DNA Fingerprints" for each intelligence battery. In the current figure I have superimposed the CHC IQ Test DNA Fingerprints for both the BAT-R and TONI-2.  It should be immediately clear that the BAT-R and TONI-2 IQ scores are composed of dramatically different mixtures of cognitive abilities (different mixtures of fruit).  The BAT-R, which is grounded in CHC theory, provides a total composite IQ score based on a combination of seven different tests that each measure a different CHC ability domain.  Each contributes 14.3% to the composite IQ score.  Conversely, the TONI-2 is a special purpose nonverbal battery that consists entirely of a single set of items that only measure one CHC ability domain (i.e., Gf).  Clearly one would not expect that all individuals would receive similar total IQ scores from these two different IQ batteries given that people display variability in different CHC abilities (relative strengths and weaknesses).  We clearly have a situation where the proverbial "apples and oranges" issue operating.

[Double click on image to enlarge]


The psychologist who interpreted the scores indicated that the TONI-2 results were "consistent" with those from the BAT-R.  The psychologist was correct.

Below are the obtained standard scores for the seven BAT-R tests, along with their CHC ability designation:
  • Memory for Names (Glr): 60
  • Memory for Sentences (Gsm): 91
  • Visual Matching (Gs): 67
  • Incomplete Words (Ga): 70
  • Visual Closure (Gv): 67
  • Picture Vocabulary (Gc): 70
  • Analysis-Synthesis (Gf):  83
Note the BAT-R score (83) for the Gf test (Analysis-Synthesis) that taps the same CHC domain (and only CHC domain) as the TONI-2.  The BAT-R Gf score of 83 is only 6 points different from the 100% Gf-focused TONI-2 score of 77.  As discussed in a previous report, a difference of 6 standard score points is very typical across intelligence tests.  Also, when one places confidence bands around the score of 77 and 83 they will overlap--indicating that one should conclude that there is no reliable statistically significant difference between these two respective Gf-indicator scores.

It should be obvious that one cannot directly compare a narrowly focused special purpose IQ score that only measures Gf (100%) to a composite IQ score that allows Gf to contribute 14.3 % to the composite score....and which also includes 14.3% contributions from 6 other human cognitive ability domains (Ga, Gc, Gs, Gv, Glr, Gsm).  In this case we are comparing a measure of one type of fruit (TONI-2; Gf) to a measure that is a more complete produce market of fruits (BAT-R; Gf, Gv, Ga, Gsm, Glr, Gs, Gc)

So which is more accurate or valid?  If one assumes that both tests were administered properly and the examinee understood all tests, the test battery that provides for the measurement of a more comprehensive range of CHC abilities should take precedence as the most valid and reliable indicator of the person's general intellectual ability.

When one compares the same "apples" (Gf coverage measures) from the BAT-R and TONI-2, the test results are "consistent" and validate one another.

Bottom line:  One must be cautious when comparing total IQ scores from different IQ batteries, especially when one is a comprehensive IQ battery and the other is a narrowly focused special-purpose battery that measures only 1 (or maybe 2-3) CHC ability domains.  Although not as dramatic as in this comparison, even total IQ scores from different comprehensive IQ batteries can produce different total scores if the total IQ score is "flavored" with different mixtures of abilities (fruits).  This can even be seen in differences in scores between different editions of the same test that has changed its coverage of measured abilities across editions (see CHC analysis of evolving CHC content of Wechsler intelligence batteries across time, which are presented in the same IQ Test CHC Test DNA Fingerprint format.

I will soon be posting IQ Test CHC Test DNA Fingerprints for all the major IQ batteries and select special purpose batteries.  Stay tunned.

Technorati Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,