Today the opinion regarding the Atkins ID decision for Farad Roland was issued. As per my policy, having served as an expert witness in this particular case, I offer no comments. The opinion can be found here.
An attempt to provide understandable and up-to-date information regarding intelligence testing, intelligence theories, personal competence, adaptive behavior and intellectual disability (mental retardation) as they relate to death penalty (capital punishment) issues. A particular focus will be on psychological measurement, statistical and psychometric issues.
Showing posts with label IQ score differences. Show all posts
Showing posts with label IQ score differences. Show all posts
Monday, December 18, 2017
Thursday, October 15, 2015
Evolution of the WJ to WJ IV GIA and CHC clusters
Click on image to enlarge.
Long-time users of the various editions of the WJ cognitive battery (WJ, WJ-R, WJ III, WJ IV) know that the battery has continued to evolve over time. Above is a portion of a large table that summarizes the same and different tests included in the GIA (g-score) and broad CHC clusters across editions. The complete table demonstrates that the WJ has not remained static, with each new edition evolving as per research and theory.
The practical benefit of the complete table comes when examiners want to compare scores from similarly named cluster scores across different editions of the WJ--different scores may be due, in part, to different mixtures of tests in clusters across editions. I hope this is helpful.
The complete table can be downloaded here. The table is adapted from a similar table in
Cormier, D., McGrew, K., Bulut, O., Funamoto, A. (2015). Exploring the relationships between broad Cattell-Horn-Carroll (CHC) cognitive abilities and reading achievement during the school-age years, Manuscript submitted for publication.
Labels:
IQ score differences,
WJ,
WJ III,
WJ IV,
WJ-R
Monday, September 30, 2013
IQ score differences across time may relfect real changes in the brain
Lay people and many professionals often express consternation when an individuals measured IQ scores are different at different times in their life. This concern is particularly heightened in high stakes settings where differences in IQ scores can result in changes in eligibility for programs (e.g., social security disability income) or life-or-death decisions (e.g., Atkins MR/ID death penalty cases).
Factors contributing to significant IQ score differences are many (McGrew, in press a) and may include: (a) procedural or test administration errors (e.g., scoring errors; improper nonstandardized test administration; malingering; age vs. grade norms; practice effects), (b) test norm or standardization differences (e.g., norm obsolescence or the Flynn Effect; McGrew, in press b), (c) content differences across different test batteries or between different editions of the same battery, or (d) variations in a person’s performance on different occasions.
An article "in press" (Neuroimage) by Burgaleta et al. (click here to view copy with annotated comments) provides the important reminder that differences in IQ scores for an individual (across time) may be due to real changes in general intelligence related to real changes in brain development. These researchers found that changes in cortical brain thickness were related to changes in IQ scores. They concluded that "the dynamic nature of intelligence-brain relations...support the idea that changes in IQ across development can reflect meaningful general cognitive ability changes and have a neuroanatomical substrate" (viz., changes in cortical thickness in key brain regions). The hypothesis was offered that changes in the the cortical areas of frontoparietal brain network (see P-FIT model of intelligence) may be related to changes in working memory, which in turn has been strongly associated with general reasoning (fluid intelligence; Gf).
The cortical thickness-IQ change relation was deemed consistent with "cellular events that are sensitive to postnatal development and experience." Possible causal factors suggested included insufficient education or social stimulation during sensitive developmental periods, as well as lifestyle, diet and nutrition, and genetic factors.
Factors contributing to significant IQ score differences are many (McGrew, in press a) and may include: (a) procedural or test administration errors (e.g., scoring errors; improper nonstandardized test administration; malingering; age vs. grade norms; practice effects), (b) test norm or standardization differences (e.g., norm obsolescence or the Flynn Effect; McGrew, in press b), (c) content differences across different test batteries or between different editions of the same battery, or (d) variations in a person’s performance on different occasions.
An article "in press" (Neuroimage) by Burgaleta et al. (click here to view copy with annotated comments) provides the important reminder that differences in IQ scores for an individual (across time) may be due to real changes in general intelligence related to real changes in brain development. These researchers found that changes in cortical brain thickness were related to changes in IQ scores. They concluded that "the dynamic nature of intelligence-brain relations...support the idea that changes in IQ across development can reflect meaningful general cognitive ability changes and have a neuroanatomical substrate" (viz., changes in cortical thickness in key brain regions). The hypothesis was offered that changes in the the cortical areas of frontoparietal brain network (see P-FIT model of intelligence) may be related to changes in working memory, which in turn has been strongly associated with general reasoning (fluid intelligence; Gf).
The cortical thickness-IQ change relation was deemed consistent with "cellular events that are sensitive to postnatal development and experience." Possible causal factors suggested included insufficient education or social stimulation during sensitive developmental periods, as well as lifestyle, diet and nutrition, and genetic factors.
- McGrew, K. S. (in press a). Intellectual functioning: Conceptual issues. In E. Polloway (Ed.), Determining intellectual disability in the courts: Focus on capital cases. AAIDD, Washington, DC.
- McGrew, K. S. (in press b). Norm obsolescence: The Flynn Effect. In E. Polloway (Ed.), Determining intellectual disability in the courts: Focus on capital cases. AAIDD, Washington, DC.
[Click on images to enlarge]
Sunday, January 27, 2013
Research Byte: Which is better measure of intelligence? WAIS-III or WAIS-IV
A new article comparing the changes from the WAIS-III to the WAIS-IV with implications for Atkins cases by Taub and Benson. Below is the abstract. Dr. Taub can be contacted via this link.
A previous IAP AP101 report dealing with WAIS-III/WAIS-IV structural changes is worth reading when reviewing this current article.
Monday, April 30, 2012
IAP AP101 Report # 13: Problems with the 1960 and 1986 Stanford-Binet IQ Scores in Atkins MR/ID Death Penalty Cases
Often in Atkins MR/ID death penalty cases historical and contemporary IQ
scores are available for review by psychological experts. In many cases these scores vary
markedly. The courts frequently wrestle
with the issue of determining what the best estimate is of the person’s general
intelligence. A review of many Atkins cases often reveals frequent
mention of two “gold standard” IQ
tests in reports or testimony—namely, the Stanford-Binet
and the Wechsler series.
The purpose of
this working paper is to alert psychologists and the courts to two little known
(but extremely important) dents in the gold standard status of two versions of
the Stanford-Binet—the 1960 SB and
the 1986 SB IV. If a Flynn effect
adjustment is made to scores from a 1960 SB, the norm date used to calculate
the magnitude of the Flynn effect should be 1932…not 1960. If SB IV
scores exist in an individual’s records, experts providing opinions regarding
the individual’s general level of intelligence should consider: (a) eliminating
the score from consideration, (b) not give the score great weight in
formulating an opinion, or (c) at a minimum, provide qualifying statements
regarding the validity of the SB IV score as required by the Joint Test Standards.
IAP Applied Psychometrics 101 Report # 13 can be downloaded by clicking here.
Thursday, September 22, 2011
IAP AP101 Brief # 10: Understanding IQ score differences: Examiner Errors
Why do significant differences in IQ scores often occur between different tests or the same test given at different times? The explanations are many. Previous IAP Applied Psychometric 101 Reports and Briefs have touched on a number of reasons. Click here to view or link to these reports.
In the first AP101 report, which I would recommend reading prior to reading the material below, test administration and or scoring errors (examiner errors) were mentioned as a possible reason for score discrepancies. The brief report below addresses this topic.
Test procedural and administration errors (examiner error)
Despite rigorous graduate training in standardized administration of intelligence tests for most psychologists, the extant research on adherence to standardized administration and scoring procedures has consistently reported (unfortunately) that the frequency of examiner errors occurs with enough regularity, for both novice and experienced psychological examiners, to be a concern.
Ramos, Alfonso and Schermerhorn (2009) summarized the extant research on examiner errors and reported that most research studies reported sufficient average examiner error to produce significant changes in IQ scores for individuals. The most frequent types of errors reported included a failure to record responses, use of incorrect basal and ceiling rules, reporting an incorrect global IQ score, incorrect adding of subtest scores, incorrect assignment of points for specific items, and incorrect calculation of the individuals age. On Wechsler-related studies, Ramos et al.'s review found that studies have reported average error rates from 7.8 to 25.8 errors per test record, almost 90% of examiners making one error, and in one study 2/3 of the test records reviewed resulted in a change in the Full Scale IQ. Examiner errors do not appear instrument specific as Ramos et al’s reported an average error rate of 4.63 errors per test record on the WJ III Tests of Cognitive Abilities.
The importance of verifying accurate administration and scoring is evident in the finding that across experienced psychologists and students in graduate training, ranges of score differences were as high as 25, 22, and 11 points respectively for the WAIS-III Verbal, Performance and Full Scale IQ scores (Ryan & Schnakenberg-Ott, 2003). Despite examiners reporting confidence in their scoring accuracy, Ryan and Schnakenberg-Ott reported average levels of agreement with the standard (accurate) test record of only 26.3% (Verbal IQ), 36.8 % (Performance IQ), and 42.1 % (Full Scale IQ).
This level of examiner error is alarming, particularly in the context of important decision-making (e.g., IQ score-based life-and-death Atkins MR/ID decisions; eligibility for intervention programs; eligibility for social security disability funds). The level of examiner experience does not appear to be an explanatory variable. More recently, when investigating a single subtest (WISC-IV Vocabulary), Erodi, Richard and Hopwood (2009) reported that more errors may be present when evaluating low and high ability subjects.
Numerous test development and professional training and monitoring recommendations have been suggested (see Erodi et al, 2009; Hopwood & Richard, 2005; Kuentzel et al. 2011; Ramos et al., 2009; Ryad & Schnakenberg-Ott, 2003), some that have empirically demonstrated improvement in accuracy (see Kuentzel, Hetterscheidt &Barnett, 2011).
Examiner test administration and scoring errors can be the reason for discrepant IQ-IQ score differences. It is clear that before attempting to interpret any IQ scores, or trying to reconcile IQ-IQ score differences between tests, the first step would be for all examiners to double check their scoring. Another wise step would be to seek independent review of a scored test record by another experienced examiner. In the case of Atkins decisions, attempts should be made to secure copies of the original IQ test records for independent review. If any clear errors are present, they should be corrected and new scores recalculated. Only then should psychologists proceed to draw conclusions about the consistency or differences between scores from different IQ tests or versions of the same test given at different times during an individual’s life-span.
Any intelligence test results used in an Atkin’s hearings must be subject to independent review of the original test protocol (this may be impossible for old historical testing results) to insure against administration or scoring errors that might result in significant differences in the reported IQ score. This is critically important in Atkin’s cases were the courts often use a strict specific-IQ “bright line” cut score to determine the presence of an intellectual disability.
Below are the abstracts from the primary sources for this brief report. Double click on the images to enlarge.
- iPost using BlogPress from Kevin McGrew's iPad
intelligence intelligence testing Atkins cases ICDP blog psychology school psychology neuropsychology forensic psychology criminal psychology criminal justice death penalty capital punishment ABA IQ tests IQ scores psychometrics adaptive behavior AAIDD mental retardation intellectual disability IQ score differences IQ scoring errors IQ examiner errors AP101 Brief
Generated by: Tag Generator
Tuesday, April 5, 2011
Time to Stop Executing the Mentally Retarded--The Case for Applying the Standard Error of Measurement
I am pleased to announce that the following IAP Applied Psychometrics 101 (#11) report is now available for viewing and download. I had the unique opportunity to tag along on this paper with Kevin Foley, who is conducting extensive research and writing re: Atkins MR/ID cases. This manuscript is intended more for individuals in the legal profession (judges, lawyers) and is thus written in law review review article format.
Although this report is intended primary for readers of the ICDP blog, I am also posting it to the IQ's Corner blog as those readers may find the attempt to explain SEM in terms understandable by non-psychologists of interest.
Double click on the image below to enlarge.
- iPost using BlogPress from my Kevin McGrew's iPad
intelligence intelligence testing Atkins cases ICDP blog psychology school psychology neuropsychology Forensic psychology criminal psychology criminal justice death penalty capital punishment ABA IQ tests IQ scores adaptive behavior AAIDD mental retardation intellectual disability SEM standard error of measurement law review articles
Although this report is intended primary for readers of the ICDP blog, I am also posting it to the IQ's Corner blog as those readers may find the attempt to explain SEM in terms understandable by non-psychologists of interest.
Double click on the image below to enlarge.
- iPost using BlogPress from my Kevin McGrew's iPad
intelligence intelligence testing Atkins cases ICDP blog psychology school psychology neuropsychology Forensic psychology criminal psychology criminal justice death penalty capital punishment ABA IQ tests IQ scores adaptive behavior AAIDD mental retardation intellectual disability SEM standard error of measurement law review articles
Generated by: Tag Generator
Monday, March 8, 2010
AP101 Brief # 7: Understanding IQ score differences via "IQ Test CHC DNA Fingerprints": Comment on Guevara v Thaler (TX, 2008, 2010)
In a prior post re: Guevara v Thaler (TX, 2008, 2010), I mentioned that a number of individuals had asked me to explain why two different IQ scores were possible from two different tests. Before addressing this issue, I draw the readers attention to a number of AP101 briefs/reports previously posted (click here, here, and here). These prior posts provide important background information regarding the issue of IQ score differences (e.g., Flynn Effect, scoring errors, etc.).
What many layperson's do not understand, and unfortunately also some psychologists, is that different IQ test batteries are composite scores often based of different mixes of cognitive abilities. That is, many IQ tests may measure some of the same abilities (Tests A and B both measure apples), but test A may measure some other abilities (oranges) not measured by Test B. Conversely Test B may measure some abilities (bananas) not measured by Test A. And, even when two IQ test batteries have tests of similar abilities (apples), they may measure the abilities slightly differently or may measure a different subset or variety of abilities within the same ability domain (red delicious vs golden delicious).
Also, there is a significant difference between comprehensive IQ batteries (Wechsler's; Stanford Binet; Woodcock-Johnson; etc.) and special-purpose batteries that are deliberately designed to measure narrower and more restricted set of human abilities. Nonverbal IQ tests (actually, "tests that measure abilities via non-verbal methods" is a more accurate description---there is no such ability as "non-verbal" ability or IQ). Nonverbal tests are frequently used when a person comes from a different culture and has limited understanding of English language. The so-called nonverbal tests attempt to tap key aspects of a person's cognitive abilities via the use of directions and response formats that impose minimal or no language demands on the examinee (e.g., directions administered via gestures or pantomime).
Lets use two scores from the Geuevara cast to illustrate.
First, in order to determine if cognitive ability content coverage may explain all (or a part) of total IQ score differences, one needs a taxonomy to categorize the abilities measured by different IQ tests. As I've blogged about repeatedly, the Cattell-Horn-Carroll (CHC) theory of cognitive abilities is now considered the consensus psychometric taxonomy of human abilities (McGrew, 2009). Using the CHC taxonomy, I examined the type and amount of different CHC abilities (fruits) measured by the two primary IQ measures administered to Guevara.
On the Spanish version of the Woodcock Johnson-Revised cognitive battery (Bateria-R; BAT-R). Guevara obtained a total composite IQ score of 60 (+- 5 point confidence band: 55-65). On the Test of Nonverbal Intelligence-2nd Edition (TONI-2), a score of 77 (+- 5 point confidence band: 72-82). As one can see, even after accounting for unreliability in measurement via the application of the standard error of measurement (SEM) to the scores, the two respective confidence bands do not overlap (55-65 v 72-82). If the IQ score confidence bands did overlap, one would then assume that the difference in IQ scores is not a reliable difference and reflects the known degree of measurement error in each batteries score. However, in the current illustration,the bands do not overlap. There is a statistically significant reliable difference between the BAT-R and TONI-2 scores.
Using the recognized research-based extant CHC IQ test analysis literature (Flanagan, McGrew, & Ortiz, 2000; Flanagan, Ortiz, & Alfonso, 2007; McGrew & Flanagan, 1998; Woodcock, 1990 ), I developed the following figure. I call these figures the "IQ Test CHC DNA Fingerprints" for each intelligence battery. In the current figure I have superimposed the CHC IQ Test DNA Fingerprints for both the BAT-R and TONI-2. It should be immediately clear that the BAT-R and TONI-2 IQ scores are composed of dramatically different mixtures of cognitive abilities (different mixtures of fruit). The BAT-R, which is grounded in CHC theory, provides a total composite IQ score based on a combination of seven different tests that each measure a different CHC ability domain. Each contributes 14.3% to the composite IQ score. Conversely, the TONI-2 is a special purpose nonverbal battery that consists entirely of a single set of items that only measure one CHC ability domain (i.e., Gf). Clearly one would not expect that all individuals would receive similar total IQ scores from these two different IQ batteries given that people display variability in different CHC abilities (relative strengths and weaknesses). We clearly have a situation where the proverbial "apples and oranges" issue operating.
The psychologist who interpreted the scores indicated that the TONI-2 results were "consistent" with those from the BAT-R. The psychologist was correct.
Below are the obtained standard scores for the seven BAT-R tests, along with their CHC ability designation:
It should be obvious that one cannot directly compare a narrowly focused special purpose IQ score that only measures Gf (100%) to a composite IQ score that allows Gf to contribute 14.3 % to the composite score....and which also includes 14.3% contributions from 6 other human cognitive ability domains (Ga, Gc, Gs, Gv, Glr, Gsm). In this case we are comparing a measure of one type of fruit (TONI-2; Gf) to a measure that is a more complete produce market of fruits (BAT-R; Gf, Gv, Ga, Gsm, Glr, Gs, Gc)
So which is more accurate or valid? If one assumes that both tests were administered properly and the examinee understood all tests, the test battery that provides for the measurement of a more comprehensive range of CHC abilities should take precedence as the most valid and reliable indicator of the person's general intellectual ability.
When one compares the same "apples" (Gf coverage measures) from the BAT-R and TONI-2, the test results are "consistent" and validate one another.
Bottom line: One must be cautious when comparing total IQ scores from different IQ batteries, especially when one is a comprehensive IQ battery and the other is a narrowly focused special-purpose battery that measures only 1 (or maybe 2-3) CHC ability domains. Although not as dramatic as in this comparison, even total IQ scores from different comprehensive IQ batteries can produce different total scores if the total IQ score is "flavored" with different mixtures of abilities (fruits). This can even be seen in differences in scores between different editions of the same test that has changed its coverage of measured abilities across editions (see CHC analysis of evolving CHC content of Wechsler intelligence batteries across time, which are presented in the same IQ Test CHC Test DNA Fingerprint format.
I will soon be posting IQ Test CHC Test DNA Fingerprints for all the major IQ batteries and select special purpose batteries. Stay tunned.
Technorati Tags: psychology, forensic psychology, forensic psychiatry, neuropsychology, intelligence, school psychology, educational psychology, IQ, IQ tests, IQ scores, intellectual disability, mental retardation, MR, ID, criminal psychology, criminal defense, ABA, American Bar Association, Atkins cases, death penalty, capital punishment, AAIDD, Guevara v Thaler, Guevara v Texas, BAT-R, TONI-2, IQ score differences, CHC theory, Cattell-Horn-Carroll intelligence theory, IQ Test CHC DNA Fingerprints, psychometrics

What many layperson's do not understand, and unfortunately also some psychologists, is that different IQ test batteries are composite scores often based of different mixes of cognitive abilities. That is, many IQ tests may measure some of the same abilities (Tests A and B both measure apples), but test A may measure some other abilities (oranges) not measured by Test B. Conversely Test B may measure some abilities (bananas) not measured by Test A. And, even when two IQ test batteries have tests of similar abilities (apples), they may measure the abilities slightly differently or may measure a different subset or variety of abilities within the same ability domain (red delicious vs golden delicious).
Also, there is a significant difference between comprehensive IQ batteries (Wechsler's; Stanford Binet; Woodcock-Johnson; etc.) and special-purpose batteries that are deliberately designed to measure narrower and more restricted set of human abilities. Nonverbal IQ tests (actually, "tests that measure abilities via non-verbal methods" is a more accurate description---there is no such ability as "non-verbal" ability or IQ). Nonverbal tests are frequently used when a person comes from a different culture and has limited understanding of English language. The so-called nonverbal tests attempt to tap key aspects of a person's cognitive abilities via the use of directions and response formats that impose minimal or no language demands on the examinee (e.g., directions administered via gestures or pantomime).
Lets use two scores from the Geuevara cast to illustrate.
First, in order to determine if cognitive ability content coverage may explain all (or a part) of total IQ score differences, one needs a taxonomy to categorize the abilities measured by different IQ tests. As I've blogged about repeatedly, the Cattell-Horn-Carroll (CHC) theory of cognitive abilities is now considered the consensus psychometric taxonomy of human abilities (McGrew, 2009). Using the CHC taxonomy, I examined the type and amount of different CHC abilities (fruits) measured by the two primary IQ measures administered to Guevara.
On the Spanish version of the Woodcock Johnson-Revised cognitive battery (Bateria-R; BAT-R). Guevara obtained a total composite IQ score of 60 (+- 5 point confidence band: 55-65). On the Test of Nonverbal Intelligence-2nd Edition (TONI-2), a score of 77 (+- 5 point confidence band: 72-82). As one can see, even after accounting for unreliability in measurement via the application of the standard error of measurement (SEM) to the scores, the two respective confidence bands do not overlap (55-65 v 72-82). If the IQ score confidence bands did overlap, one would then assume that the difference in IQ scores is not a reliable difference and reflects the known degree of measurement error in each batteries score. However, in the current illustration,the bands do not overlap. There is a statistically significant reliable difference between the BAT-R and TONI-2 scores.
Using the recognized research-based extant CHC IQ test analysis literature (Flanagan, McGrew, & Ortiz, 2000; Flanagan, Ortiz, & Alfonso, 2007; McGrew & Flanagan, 1998; Woodcock, 1990 ), I developed the following figure. I call these figures the "IQ Test CHC DNA Fingerprints" for each intelligence battery. In the current figure I have superimposed the CHC IQ Test DNA Fingerprints for both the BAT-R and TONI-2. It should be immediately clear that the BAT-R and TONI-2 IQ scores are composed of dramatically different mixtures of cognitive abilities (different mixtures of fruit). The BAT-R, which is grounded in CHC theory, provides a total composite IQ score based on a combination of seven different tests that each measure a different CHC ability domain. Each contributes 14.3% to the composite IQ score. Conversely, the TONI-2 is a special purpose nonverbal battery that consists entirely of a single set of items that only measure one CHC ability domain (i.e., Gf). Clearly one would not expect that all individuals would receive similar total IQ scores from these two different IQ batteries given that people display variability in different CHC abilities (relative strengths and weaknesses). We clearly have a situation where the proverbial "apples and oranges" issue operating.
[Double click on image to enlarge]
The psychologist who interpreted the scores indicated that the TONI-2 results were "consistent" with those from the BAT-R. The psychologist was correct.
Below are the obtained standard scores for the seven BAT-R tests, along with their CHC ability designation:
- Memory for Names (Glr): 60
- Memory for Sentences (Gsm): 91
- Visual Matching (Gs): 67
- Incomplete Words (Ga): 70
- Visual Closure (Gv): 67
- Picture Vocabulary (Gc): 70
- Analysis-Synthesis (Gf): 83
It should be obvious that one cannot directly compare a narrowly focused special purpose IQ score that only measures Gf (100%) to a composite IQ score that allows Gf to contribute 14.3 % to the composite score....and which also includes 14.3% contributions from 6 other human cognitive ability domains (Ga, Gc, Gs, Gv, Glr, Gsm). In this case we are comparing a measure of one type of fruit (TONI-2; Gf) to a measure that is a more complete produce market of fruits (BAT-R; Gf, Gv, Ga, Gsm, Glr, Gs, Gc)
So which is more accurate or valid? If one assumes that both tests were administered properly and the examinee understood all tests, the test battery that provides for the measurement of a more comprehensive range of CHC abilities should take precedence as the most valid and reliable indicator of the person's general intellectual ability.
When one compares the same "apples" (Gf coverage measures) from the BAT-R and TONI-2, the test results are "consistent" and validate one another.
Bottom line: One must be cautious when comparing total IQ scores from different IQ batteries, especially when one is a comprehensive IQ battery and the other is a narrowly focused special-purpose battery that measures only 1 (or maybe 2-3) CHC ability domains. Although not as dramatic as in this comparison, even total IQ scores from different comprehensive IQ batteries can produce different total scores if the total IQ score is "flavored" with different mixtures of abilities (fruits). This can even be seen in differences in scores between different editions of the same test that has changed its coverage of measured abilities across editions (see CHC analysis of evolving CHC content of Wechsler intelligence batteries across time, which are presented in the same IQ Test CHC Test DNA Fingerprint format.
I will soon be posting IQ Test CHC Test DNA Fingerprints for all the major IQ batteries and select special purpose batteries. Stay tunned.
Technorati Tags: psychology, forensic psychology, forensic psychiatry, neuropsychology, intelligence, school psychology, educational psychology, IQ, IQ tests, IQ scores, intellectual disability, mental retardation, MR, ID, criminal psychology, criminal defense, ABA, American Bar Association, Atkins cases, death penalty, capital punishment, AAIDD, Guevara v Thaler, Guevara v Texas, BAT-R, TONI-2, IQ score differences, CHC theory, Cattell-Horn-Carroll intelligence theory, IQ Test CHC DNA Fingerprints, psychometrics
Friday, February 5, 2010
AP101 Brief #6: Understanding Wechsler IQ score differences--the CHC evolution of the Wechsler FS IQ score
[Note. A typo in the original tables used to construct the WAIS figure below has been fixed. Visual Puzzles on the WAIS-IV had been incorrectly designated as a measure of Gf----it should have been classified Gv. This has now been changed and the corresponding text also modified. Sorry for this error. Changes in the text are so designated below via the strikeover]
Why do the IQ scores for the same individual often differ?
This question often perplexes both users and recipients of psychological reports. In a previous IAP Applied Psychometrics 101 report (AP101 #1: Understanding IQ score differences) I discussed general statistical information related to the magnitude and frequency of expected IQ score differences for different tests (as a function of the correlation between tests). In that report I mentioned the following general categories of possible reasons for IQ score differences/discrepancies.
Factors contributing to significant IQ differences are many, and include: (a) procedural or test administration issues (e.g., scoring errors; improper test administration; malingering; age vs grade norms), (b) test norm or standardization differences (e.g., possible errors in the norms; sampling plan for selecting subjects for developing the test norms; publication date of test), (c) content differences, and/or, (d) in the case of group research, research methodology issues (e.g., sample pre-selection effects on reported mean IQs) (McGrew, 1994).At this time I return to one of these factors--content differences. This brief report does not focus on content differences between different IQ tests but, instead, focuses on the changing content across the various editions of the two primary Wechsler intelligence batteries (WISC/WAIS). This information should be useful when individuals are comparing IQ scores (for the same person) based on different versions of the Wechsler's .
Of course, content differences will not be the only reason for possible IQ score differences across editions of the Wechsler's for an individual. Other possible reasons may include real changes in intelligence, serious scoring errors present in either one of the two test administration's, the Flynn effect, and other possible factors. This post focuses only on the changing CHC content of the WISC and WAIS series of intelligence batteries.
As discussed previously in numerous posts, contemporary CHC theory is currently considered the consensus psychometric taxonomy of human cognitive abilities (click here for prior posts and information regarding the theory). For this current brief report, I reviewed the extant CHC-organized factor analysis literature of the variousWechsler intelligence batteries. I then used this information as per the following steps:
1. I identified the individual subtests in all editions of the WISC and WAIS batteries that contributed to the respective Full Scale (FS) IQ score for each battery.
2. Using the accepted authoritative sources re: the CHC analysis of the Wechsler intelligence batteries (Flanagan, McGrew and Ortiz, 2000; Flanagan, Ortiz, and Alfonso, 2007; McGrew and Flanagan, 1998; Woodcock, 1990), I classified each of the above identified subtests as per the broad CHC ability (or abilities) measured by each subtest. For readers who want a very brief CHC overview (and ability definition cheat-sheet), click here.
3. I calculated the percentage of each broad CHC ability represented in each batteries respective FS IQ. For example, for the 1974 WISC-R, the FS IQ is calculated by summing the WISC-R scaled scores from 10 of the individual subtests. Four of these 10 subtests (Information, Comprehension, Similarities, and Vocabulary) have all been consistently classified as indicators of broad Gc. Since each of the individual subtests contribute equally to the FS IQ score, Gc represents at least 40% (4 of 10) of the WISC-R FS IQ.
- However, the extant CHC Wechsler research has consistently identified a few tests with dual CHC factor loadings. In particular, both Picture Completion and Picture Arrangement have been consistently reported to load on both the Gv (performance scale) and Gc (verbal scale) on the WISC-R. For tests that demonstrated consistent dual CHC factor loadings, I assigned each broad CHC ability measured as representing 1/2 (0.5) of the test. More precise proportional calculation might have been possible (via the calculation of the average factor loadings across all studies), but for the current purpose I used this simple and (IMHO) reasonably approximate method.
- As a result, both the Picture Completion and Picture Arrangement subtests were each assigned a 1/2 (0.5) Gc and 1/2 (0.5) ability classifications. When added together these two 0.5 Gc test classifications sum to 1.0. When combined with the other four clear Gc tests mentioned above, the final Gc test indicator total is 5. As a result, the total Gc proportional percentage of the WISC-R FS IQ was calculated as 50%.
- In addition, where appropriate and consistent with published research, I modified a few other commonly accepted CHC Wechsler test classifications to reflect recent research (e.g.., Kaufman et al., 2001; Keith et al., 2006; Keith & Reynolds (in press--CHC abilities and cognitive tests: What we've learned from 20 years of research; Psychology in the Schools); Lichtenberger & Kaufman, 2001; McGrew, 2009; Tulsky & Price, 2003; plus the factor studies reported in the respective technical manuals of each battery). Referring to the mixed measures of Picture Completion and Picture Arrangement mentioned above, research with the WISC-IV has suggested that Picture Completion is primarily a measure Gv (Gc factor loading minimal or nonexistent) while Picture Arrangement continues to show significant loadings on both Gv and Gc. Thus, Picture Arrangement was classified as a mixed measure of Gc and Gv for all editions of the WISC. In contrast, in the case of the WISC-IV Picture Completion was classified as a measure Gv.
- It is not possible to describe in detail all of the minor "fine tunings" I did for select Wechsler CHC test classifications. The basis for all are included in the various reference sources cited above. In the final analysis the Wechsler CHC test classifications used in this brief report are those made by myself (Kevin McGrew) based on my integration and understanding of the extant empirical research regarding the CHC abilities measured by individual tests in both the WISC and WAIS series of intelligence batteries.
Conclusions/observations: A review of all information presented (in and across both graphs) produces a number of interesting conclusions and hypotheses. I only present a few at this time. I encourage others to review the documents and provide additional insights or commentary via the comment feature of the blog or on various listserv's where I have posted and FYI message regarding this set of analysis.
1. Historically, the FS IQ score from the Wechsler batteries, which is typically interpreted as a measure of general intelligence (g), has been heavily weighted towards the measurement of Gc and Gv abilities. This should not be surprising given the original design blueprint specified by David Wechsler (the measurement of intelligence vis-a-vis two different modes of expression).
2. The WISC series remained constant in the CHC FS IQ composition from 1949 to 1991. Although tests may have been revised or replaced, the differential CHC proportional contribution to the FS IQ was relatively equal across all three editions. Following the 80% combined contribution of Gc and Gv, much smaller contributions to the FS IQ came from measures of Gs (10%) and Gq and Gsm (5% respectively).
3. The WISC-IV represents a significant change in the general intelligence FS IQ score provided. Gc representation has decreased approximately 20%, Gv representation was cut in half (30 % to 15 %) , Gs abilities increased slightly (5 %), and Gq was eliminated. More importantly, there was a fourfold increase in the contribution of the Gsm (from 5% to 20%) and a 20% increase in Gf representation (from 0 to 20%)! Clearly different FS IQ scores may be obtained by the same individual when comparing WISC-IV FS IQ to either WISC-R/WISC-III scores. More importantly,the difference may be a function of the different mixture of CHC abilities represented in the different editions of the WISC series.
4. The first two editions of the WAIS (WAIS and WAIS-R) were identical in differential CHC ability contribution to the FS IQ score. However, starting with the WAIS-III significant changes in the adult Wechsler battery commenced and were later amplified in the WAIS-IV. Both the WAIS-III and WAIS-IV FS IQs reduced the amount of Gc representation by approximately 14% to 15%. The contribution of Gv decreased only slightly (27.3% to 22.7%) from the WAIS-R to WAIS-III,
Implications of the CHC evolution of the WISC and WAIS FS IQ scores are many if one attempts to compare a current IQ score from one battery to an older score from a earlier edition of the same battery (or compare an older score from the childrens version to the latest edition of the adult version). Before one can assume that significant changes from a childhood WISC-based IQ to a WAIS-III or WAIS-IV are due to certain factors (neurological insult; malingering, the Flynn effect, etc.), one should review the above graphs and consider the possibility that the different FS IQ scores may both be valid indicators of functioning but may represent differ CHC mixes (flavors) of general intelligence.
The potential implications and hypotheses that can be generated with the aid of the above graphs are numerous. For example, Flynn (2006) has suggested that there are problems with the WAIS-III standardization norms given that studies comparing the WAIS-R/WAIS-III scores are not consistent with Flynn effect expectations. According to Weiss (2007), Flynn is ignoring data that does not fit his theory and instead is using theory to question data (and the integrity of a tests norms). According to Weiss (2007), "the only evidence Flynn provides for this statement is that WAIS-III scores do not fit expectations made based on the Flynn effect. However, the progress of science demands that theories be modified based on new data. Adjusting data to fit theory is an inappropriate scientific method, regardless of how well supported the theory may have been in previous studies." (p.1 from abstract).
I tend to concur with Weiss's arguments that the mere finding that the WAIS-III results were inconsistent with Flynn effect expectations is insufficient evidence to claim that the a test norms are wrong. If the data don't fit--one may need to retrofit (your theory or hypothesis). By inspecting the second graph above, one can see that a viable explanation for the apparent lack of the WAIS-R-to-WAIS-III Flynn effect is that the WAIS-III FS IQ score represents a different proportional composite of CHC abilities. More specifically, the WAIS-III reduced the proportional representation of Gc from 45.5% to 31.8%, decreased the Gv representation by approximately 5%, doubled the impact of Gs, and for the first time ever introduced close to 10% Gf representation. CHC content changes of the FS IQ scores between batteries may be at play. Can anyone say "comparing apples to apples+oranges?"
And so on.................more comments may be forthcoming.
PS - additional information not included in this original post has now been posted. Click here.
Technorati Tags: psychology, school psychology, educational psychology, forensic psychology, neuropsychology, clinical psychology, intelligence testing, intelligence, IQ, Wechsler batteries, WISC-R, WISC, WISC-III, WISC-IV, WAIS, WAIS-R, WAIS-III, WAIS-IV, IQ score differences, CHC theory, Cattell-Horn-Carroll, Flynn effect
Thursday, September 17, 2009
Why IQ test scores differ: Applied Psychometrics 101---IQ scoring errors (next report)

Is it possible for an individual evaluated for mental retardation (as part of an Atkins proceedings) to have an increased probability of facing execution depending on whom administers them an intelligence test? Is it possible for an experienced psychological examiner to make a sufficient number of scoring errors that significantly change a person's IQ score from what it should be (if properly scored)? Unfortunately, the answers are "yes."
I learned this first hand when I reviewed the test record and scoring of a intelligence test (on which I'm a coauthor) in a Federal death penalty appeal hinging on the diagnosis of mental retardation. I've since been locating research articles on the accuracy of IQ test scoring for novice and experienced psychological examiners. The results are discouraging.
The next AP101 report will address the issue of test scoring errors in intelligence testing, with a particular emphasis on implications for Atkins MR death penalty cases. The report will include a summary of representative literature, a discussion of my findings in the recent case for which I was a consultant (presented in such a manner to not reveal the identity of the case or any individuals/agencies involved in the case), and recommendations to address the issue.
Stay tuned.
If you have not read the first report in the series, check it out. AP101 101: IQ Test Score Difference Series--#1 Understanding global IQ test correlations.
I learned this first hand when I reviewed the test record and scoring of a intelligence test (on which I'm a coauthor) in a Federal death penalty appeal hinging on the diagnosis of mental retardation. I've since been locating research articles on the accuracy of IQ test scoring for novice and experienced psychological examiners. The results are discouraging.
The next AP101 report will address the issue of test scoring errors in intelligence testing, with a particular emphasis on implications for Atkins MR death penalty cases. The report will include a summary of representative literature, a discussion of my findings in the recent case for which I was a consultant (presented in such a manner to not reveal the identity of the case or any individuals/agencies involved in the case), and recommendations to address the issue.
Stay tuned.
If you have not read the first report in the series, check it out. AP101 101: IQ Test Score Difference Series--#1 Understanding global IQ test correlations.
echnorati Tags: psychology, forensic psychology, criminal psychology, criminal justice, clinical psychology, neuropsychology, Atkins, MR, mental retardation, developmental disabilities, IQ, IQ tests, IQ scores, scoring errors, reliability, ISIR, intelligence testing
Subscribe to:
Posts (Atom)