Friday, February 5, 2010

AP101 Brief #6: Understanding Wechsler IQ score differences--the CHC evolution of the Wechsler FS IQ score

[Note.  A typo in the original tables used to construct the WAIS figure below has been fixed.  Visual Puzzles on the WAIS-IV had been incorrectly designated as a measure of Gf----it should have been classified Gv.  This has now been changed and the corresponding text also modified.  Sorry for this error.  Changes in the text are so designated below via the strikeover]

Why do the IQ scores for the same individual often differ?

This question often perplexes both users and recipients of psychological reports. In a previous IAP Applied Psychometrics 101 report (AP101 #1:  Understanding IQ score differences) I discussed general statistical information related to the magnitude and frequency of expected IQ score differences for different tests (as a function of the correlation between tests).  In that report I mentioned the following general categories of possible reasons for IQ score differences/discrepancies.
Factors contributing to significant IQ differences are many, and include: (a) procedural or test administration issues (e.g., scoring errors; improper test administration; malingering; age vs grade norms), (b) test norm or standardization differences (e.g., possible errors in the norms; sampling plan for selecting subjects for developing the test norms; publication date of test), (c) content differences, and/or, (d) in the case of group research, research methodology issues (e.g., sample pre-selection effects on reported mean IQs) (McGrew, 1994).
At this time I  return to one of these factors--content differences. This brief report does not focus on content differences between different IQ tests but, instead, focuses on the changing content across the various editions of the two primary Wechsler intelligence batteries (WISC/WAIS). This information should be useful when individuals are comparing IQ scores (for the same person) based on different versions of the Wechsler's .

Of course, content differences will not be the only reason for possible IQ score differences across editions of the Wechsler's for an individual. Other possible reasons may include real changes in intelligence, serious scoring errors present in either one of the two test administration's, the Flynn effect, and other possible factors.   This post focuses only on the changing CHC content of the WISC and WAIS series of intelligence batteries.

As discussed previously in numerous posts, contemporary CHC theory is currently considered the consensus psychometric taxonomy of human cognitive abilities (click here for prior posts and information regarding the theory).  For this current brief report, I reviewed the extant CHC-organized factor analysis literature of the variousWechsler intelligence batteries. I then used this information as per the following steps:

1.  I identified the individual subtests in all editions of the WISC and WAIS batteries that contributed to the respective Full Scale (FS) IQ score for each battery.

2.  Using the accepted authoritative sources re: the CHC analysis of the Wechsler intelligence batteries (Flanagan, McGrew and Ortiz, 2000; Flanagan, Ortiz, and Alfonso, 2007; McGrew and Flanagan, 1998; Woodcock, 1990), I classified each of the above identified subtests as per the broad CHC ability (or abilities) measured by each subtest.  For readers who want a very brief CHC overview (and ability definition cheat-sheet), click here.

3.  I calculated the percentage of each broad CHC ability represented in each batteries respective FS IQ. For example, for the 1974 WISC-R, the FS IQ is calculated by summing the WISC-R scaled scores from 10 of the individual subtests. Four of these 10 subtests (Information, Comprehension, Similarities, and Vocabulary) have all been consistently classified as indicators of broad Gc. Since each of the individual subtests contribute equally to the FS IQ score, Gc represents at least 40%  (4 of 10) of the WISC-R FS IQ. 
  • However, the extant CHC Wechsler research has consistently identified a few tests with dual CHC factor loadings. In particular, both Picture Completion and Picture Arrangement have been consistently reported to load on both the Gv (performance scale) and Gc (verbal scale) on the WISC-R. For tests that demonstrated consistent dual CHC factor loadings, I assigned each broad CHC ability measured as representing 1/2 (0.5) of the test. More precise proportional calculation might have been possible (via the calculation of the average factor loadings across all studies), but for the current purpose I used this  simple and (IMHO) reasonably approximate method.
  • As a result, both the Picture Completion and Picture Arrangement subtests were each assigned a 1/2 (0.5) Gc and 1/2 (0.5) ability classifications. When added together these two 0.5 Gc test classifications sum to 1.0. When combined with the other four clear Gc tests mentioned above, the final Gc test indicator total is 5.  As a result, the total Gc proportional percentage of the WISC-R FS IQ was calculated as 50%.
4.  Although the Wechsler CHC classifications were based on the primary source sources noted above, I did revise some commonly accepted classifications based upon my professional opinion (when supported by empirical research). For example, the Arithmetic subtest has frequently been classified as a measure of Gf, Gsm, and sometimes Gs.   However, when valid factor indicators of Quantitative Knowledge (Gq) have been included in analyses, the Arithmetic subtest consistently displays a robust loading on the Gq factor and only minor loadings on other CHC abilities. I placed greater stock in these studies (e.g., Phelps at al, 2005: Woodcock, 1990) as I deem these to be better designed CHC studies (they included a broader array of CHC ability indicators).  My final determination for Arithmetic was that it is a test that measures both Gq and Gsm.
  • In addition, where appropriate and consistent with published research, I modified a few other commonly accepted CHC Wechsler test classifications to reflect recent research (e.g.., Kaufman et al., 2001; Keith et al., 2006; Keith & Reynolds (in press--CHC abilities and cognitive tests: What we've learned from 20 years of research;  Psychology in the Schools); Lichtenberger & Kaufman, 2001; McGrew, 2009; Tulsky & Price, 2003; plus the factor studies reported in the respective technical manuals of each battery). Referring to the mixed measures of Picture Completion and Picture Arrangement mentioned above, research with the WISC-IV  has suggested that Picture Completion is primarily a measure Gv (Gc factor loading minimal or nonexistent) while Picture Arrangement continues to show significant loadings on both Gv and Gc. Thus, Picture Arrangement was classified as a mixed measure of Gc and Gv for all editions of the WISC. In contrast, in the case of the WISC-IV  Picture Completion was classified as a measure Gv.  
  • It is not possible to describe in detail all of the minor "fine tunings" I did for select Wechsler CHC test classifications. The basis for all are included in the various reference sources cited above. In the final analysis the Wechsler CHC test classifications used in this brief report are those made by myself (Kevin McGrew) based on my integration and understanding of the extant empirical research regarding the CHC abilities measured by individual tests in both the WISC and WAIS series of intelligence batteries.
5.  Finally, I calculated the proportion of CHC abilities represented in the FS IQ scores for all editions of the WISC and WAIS.  These value were tabled and plotted on graphs.  The summary graphs are presented below. [Double click on images to enlarge]





Conclusions/observations:  A review of all information presented (in and across both graphs) produces a number of interesting conclusions and hypotheses. I only present a few at this time. I encourage others to review the documents and provide additional insights or commentary via the comment feature of the blog or on various listserv's where I have posted and FYI message regarding this set of analysis.

1.  Historically, the FS IQ score from the Wechsler batteries, which is typically interpreted as a measure of general intelligence (g), has been heavily weighted towards the measurement of Gc and Gv abilities. This should not be surprising given the original design blueprint specified by David Wechsler (the measurement of intelligence vis-a-vis two different modes of expression).

2.  The WISC series remained constant in the CHC FS IQ composition from 1949 to 1991. Although tests may have been revised or replaced, the differential CHC proportional contribution to the FS IQ was relatively equal across all three editions. Following the 80% combined contribution of Gc and Gv, much smaller contributions to the FS IQ came from measures of Gs (10%) and Gq and Gsm (5% respectively).

3.  The WISC-IV represents a significant change in the general intelligence FS IQ score provided. Gc representation has decreased approximately 20%, Gv representation was cut in half (30 % to 15 %) ,  Gs abilities increased slightly (5 %), and Gq was eliminated. More importantly, there was a fourfold increase in the contribution of the Gsm (from 5% to 20%) and a 20% increase in Gf representation (from 0 to 20%)! Clearly different FS IQ scores may be obtained by the same individual when comparing WISC-IV FS IQ to either WISC-R/WISC-III scores.  More importantly,the difference may be a function of the different mixture of CHC abilities represented in the different editions of the WISC series. 

4.  The first two editions of the WAIS (WAIS and WAIS-R) were identical in differential CHC ability contribution to the FS IQ score. However, starting with the WAIS-III significant changes in the adult Wechsler battery commenced and were later amplified in the WAIS-IV. Both the WAIS-III and WAIS-IV FS IQs reduced the amount of Gc representation by approximately 14% to 15%. The contribution of Gv decreased only slightly (27.3% to 22.7%) from the WAIS-R to WAIS-III, but there was a dramatic reduction (by one half) and then another 2% from the WAIS-III to the WAIS-IV (22.7% to 10% 20%). Offsetting reductions in Gc and Gv over these two editions was a trend towards greater measurement of Gs (has doubled from around 9% from the early two editions to approximately 18% to 20% in the last two editions). Gq FS IQ contribution has remained relatively similar throughout all editions. The most dramatic change, which is also consistent with the WISC series, is an approximate tenfold increase (0 % to 9.1%) in Gf from the WAIS-R to the WAIS-III, which was again doubled in magnitude with the publication of the and WAIS-IV (10% 20%). In general, similar to the WISC series, the adult WAIS series FS IQ has slowly evolved in the CHC abilities represented by the FS IQ. Both Gc and Gv abilities have been systematically reduced concurrently with a significant increases in the contribution of Gs and Gf.

Implications of the CHC evolution of the WISC and WAIS FS IQ scores are many if one attempts to compare a current IQ score from one battery to an older score from a earlier edition of the same battery (or compare an older score from the childrens version to the latest edition of the adult version). Before one can assume that significant changes from a childhood WISC-based IQ to a WAIS-III or WAIS-IV  are due to certain factors (neurological insult; malingering, the Flynn effect, etc.), one should review the above graphs and consider the possibility that the different FS IQ scores may both be valid indicators of functioning but may represent differ CHC mixes (flavors) of general intelligence.

The potential implications and  hypotheses that can be generated with the aid of the above graphs are numerous. For example, Flynn (2006) has suggested that there are problems with the WAIS-III standardization norms given that studies comparing the WAIS-R/WAIS-III scores are not consistent with Flynn effect expectations.  According to Weiss (2007), Flynn is ignoring data that does not fit his theory and instead is using theory to question data (and the integrity of a tests norms). According to Weiss (2007), "the only evidence Flynn provides for this statement is that WAIS-III scores do not fit expectations made based on the Flynn effect. However, the progress of science demands that theories be modified based on new data. Adjusting data to fit theory is an inappropriate scientific method, regardless of how well supported the theory may have been in previous studies." (p.1 from abstract).

I tend to concur with Weiss's arguments that the mere finding that the WAIS-III results were inconsistent with  Flynn effect expectations is insufficient evidence to claim that the a test norms are wrong. If the data don't fit--one may need to retrofit (your theory or hypothesis).  By inspecting the second graph above, one can see that a  viable explanation for the apparent lack of the WAIS-R-to-WAIS-III Flynn effect is that the WAIS-III FS IQ score represents a different proportional composite of CHC abilities. More specifically, the WAIS-III reduced the proportional representation of Gc from 45.5% to 31.8%, decreased the Gv representation by approximately 5%, doubled the impact of Gs, and for the first time ever introduced close to 10% Gf representation. CHC content changes of the FS IQ scores between batteries may be at play.   Can anyone say "comparing apples to apples+oranges?"

And so on.................more comments may be forthcoming.

PS - additional information not included in this original post has now been posted.  Click here.

Technorati Tags: , , , , , , , , , , , , , , , , , , , , ,