Thursday, October 8, 2009

AP101 Brief #1a: g or not to g in Atkins MR death penalty cases

Applied Psychometrics (AP) 101 Brief #1a:  g or not to g in Atkins MR death penalty cases (first in a series)

Despite whether one believes that general intelligence (g) exists, or not (e.g., John Horn), and ignoring the search for the essence of g (via elementary cognitive tasks measuring reaction time, temporal processing, etc.) at the level of brain mechanisms (e.g., Jensen's neural efficiency hypothesis), it is clear from a reading of most Atkins IQ MR death penalty cases that psychological experts testifying in these cases [primarily because of the emphasis on a "deficit in general intellectual functioning" as the first prong in MR diagnosis in the courts, as per recognized professional association definitions of mental retardation; APA, AAIDD] often argue for different IQ scores as being more accurate estimates of the persons g-ness (IQ) than others.

For example, both in Davis (2009), and especially in Vidal (2007), major arguments focused on whether the Full Scale IQ score from theWAIS-III/IV was the best index of g-ness (and thus mental retardation or mental capacity), or whether one of the part scores (e.g., Verbal IQ, Performance IQ) should be used as the best estimate of the persons g-ness (due to extreme variability in the part scores). My "g-estimate is better than your g-estimate" appears a fundamental point of contention at the core of many Atkins cases,  given the assumption that mental retardation is a global deficit in intelligence (see guest post by Watson for some alternative thoughts and excellent insights on the global vs modular nature of intelligence),

Then, along comes Maldonado (2009) where the g-ness argument, at one juncture, is based on the belief that the Spanish WAIS-III Verbal IQ, which is best interpreted as a CHC measure of crystallized intelligence (Gc), should take precedence over the BAT-R total composite score that is comprised of Gc and six other broad CHC abilities.

"My g-estimate....your g-estimate......this special "nonverbal" g-estimate is more accurate for this individual....that is not a good g-estimate....etc......" back-and-forth arguments beg for empirical scrutiny.  So....buckle up and lets examine some real search of g-ness.  This is the introduction to a small series of posts that will eventually examine, with empirical data, the relative g-ness of the "gold standard" (WAIS-III/IV) composite scores that are most often debated in these matters.

But first a definition and some methodological background information.  According to the APA Dictionary of Psychology,  general intelligence (the general factor) is:
  • a hypothetical source of individual differences in GENERAL ABILITY (emphasis in original) , which represents individuals' abilities to perceive relationships and to derive conclusions from them.  The general factor is said to be a basic ability that underlies the performance of different varieties of intellectual tasks, in contrast to SPECIFIC ABILITIES (emphasis in original), which are alleged each to be unique to a single task (p. 403).
[Note - some of the the text below comes from Flanagan, McGrew & Oritz (2000).  The Wechsler Intelligence Scales and Gf-Gc theory.  Boston:  Allyn & Bacon.

Intelligence tests have been interpreted often as reflecting a general mental ability referred to as g (Anastasi & Urbina, 1997; Bracken & Fagan, 1990; Carroll, 1993a; French & Hale, 1990; Horn, 1988; Jensen, 1984, 1998; Kaufman, 1979, 1994; Keith, 1997; Sattler, 1992; Sattler & Ryan, 1999; Thorndike & Lohman, 1990).  The g concept was associated originally with Spearman (1904, 1927) and is considered to represent an underlying general intellectual ability (viz., the apprehension of experience and the eduction of relations) that is the basis for most intelligent behavior. The g concept has been one of the more controversial topics in psychology for decades (French & Hale, 1990; Jensen, 1992, 1998; Kamphaus, 1993; McDermott, Fantuzzo, & Glutting, 1990; McGrew, Flanagan, Keith, & Vanderwood, 1997; Roid & Gyurke, 1991; Zachary, 1990).

According to Arend et al., (2003),  Jensen (1998a, 1998b) proposed that cognitive complexity  might represent a fundamental aspect of g an could be quantified based on inspection of the test measures loadings on the first unrotated factor, because complex tasks show higher factor loadings than simple tasks on that factor.  In many respects when psychologists are discussing mental retardation and general intelligence, there is an implicit assumption that low general intelligence (e.g., mental retardation) is reflected most clearly on performance on the most cognitively complex measures (i.e., high g measures). 

As with the controversy surrounding the nature and meaning of g, disagreements exist about how best to calculate and report psychometric g estimates.  Most all methods are based on some variant of principal component, principal factor, hierarchical factor, or confirmatory factor analysis (Jensen, 1998; Jensen & Weng, 1994).  Although a hierarchical analysis is generally preferred (see Jensen, 1998, p. 86), as long as the number of tests factored is relatively large, the tests have good reliability, a broad range of abilities is represented by the tests, and the sample is heterogeneous, (preferably a large random sample of the general population), the psychometric g's produced by the different methods are typically very similar (Jensen, 1998; Jensen & Weng, 1994).  For the interested reader, Jensen’s (1998) treatise on g (The g Factor) is suggested, as it represents the most comprehensive and contemporary integration of the g related theoretical and research literature.

Operationally the determination of high, moderate or low g-ness of tests or composites has typically been based on each measures correlation (aka., factor or principal component loading) with a single common factor, component, or dimension extracted from the correlations among the set of measures in question.  Measures that "load" high on the g-factor are considered to be the better estimates of general intelligence.

Consider the following simple analogy (which is not original...I borrowed the conceptual idea from Cohen et al., 2006).  You have a special pole that posses a special form of  magnetism (general intelligence). You throw a bunch of  metal marbles (which are the test measures), which have different degrees of the same magnetic force, into a box with the pole at the center.  You gently shake the box.  When you open the box, there is one "king" marble at the top of the poll (it has the highest degree of shared magnetism with the strongest part of the pole), followed next by the next strongest....and so on until the metal marble with the least amount of shared magnetic force is at the bottom.  The pole represents g (general intelligence) and the ordering of the metal marbles (the test measures) represents the ordering of the g-ness (degree of shared magnetic force) of the measures.  The "king" test/marble is assigned the highest numerical index, with each succeeding (and lower) test/marble assigned a slightly lower numerical index of g-ness (shared magnetism).

This is what principal component analysis conceptually accomplishes with a collection of IQ test measures.  It statistically orders the various psychometric measures from strong g-loading to low-g-loading.  This is the typical and traditional statistical currency used by psychometericians and psychologists when discussing the degree of g-ness or g-saturation of different measures--those measures most important for establishing an estimate of a person's general intelligence.

The problem with within-battery factor analysis is that it can affect the g-estimates.  For example, a test’s loading [note- g-loadings are most often computed for the individual subests in a test battery, and not the composite scores such as Verbal IQ, processing speed, etc.-- it is the later, the g-ness of composite scores, which appears to be a critical issue in many Atkins cases.  Thus, when reading the this text I will refer to the measures g...which could mean test or composite] on the general intelligence (g) factor will depend on the specific mixture of measures used in the analysis (Gustafsson & Undheim, 1996; Jensen, 1998; Jensen & Weng, 1994; McGrew, Untiedt, & Flanagan, 1996; Woodcock, 1990).  If a single vocabulary measure is combined with nine visual processing measure, the vocabulary measure will most likely display a relatively low g loading because the general factor will be defined primarily by the visual processing measures.  In contrast, if the vocabulary measure is included in a battery of measure that is an even mixture of verbal and visual processing measures, the loading of the vocabulary measure on the general factor will probably be higher.  It is important to understand that measures g loadings, as typically reported, only reflect each measures relation to the general factor within a specific intelligence battery.  Although in many situations a measure g loading will not change dramatically when computed in the context of a different collection of diverse cognitive tests (Jensen, 1998; Jensen & Weng, 1994), this will not always be the case.

Within (internal-validity) vs across (joint; external validity) estimation of test measures g-ness

When measures from different batteries are combined in the joint-battery approach, the battery-bound g  estimates for some measures may be altered significantly.   Flanagan et al. (2000) demonstrated these when they calculated within- and joint-battery g estimates for the WISC-III.  These estimates were derived from a sample of 150 subjects who were administered the WISC-III and WJ III cognitive measures as part of the Phelps validity study reported for the WJ III cognitive technical manual.  Within-battery g estimates were calculated with the WISC-III data based on the first unrotated principal component.  Next the joint-battery factor analysis allowed for an examination of the WISC-III g estimates when calculated together with another intelligencet test battery (WJ III), one that included a broader array of CHC abilitiy measures.

Flanagan et al. (2000) reported that the within- and joint-battery WISC-III g loadings were similar for many of the individual measures.  For example, the within- and joint-battery test g loadings are generally similar (i.e., do not differ by more than .05) for the Similarities (.76 vs .71), Vocabulary (.78 vs .74), Digit Span (.48 vs .49), Block Design (.60 vs .61), Object Assembly (.50 vs .45), and Symbol Search (.57 vs .54) measures.  These six WISC-III measures appear to have similar g characteristics when examined from the perspective of either the WISC-III or CHC (WJ III battery) frameworks.  However, the joint-battery g loadings were noticeably lower than the within-battery g loadings (i.e., lower by .06 or more) for Information  (.77 vs .68), Arithmetic (.70 vs .64), Comprehension (.59 vs .51), Picture Completion (.50 vs .40), Picture Arrangement (.37 vs .31), and Coding (.46 vs .37).   The results suggested that the latter WISC-III measures were relatively weaker g indicators than is suggested by within-battery WISC-III g analysis.

This example demonstrates the potential chameleon nature of test measures g estimates that are calculated within the confines of individual intelligence batteries when compared to those calculated within a comprehensive set of ability measures. 

And, yet to be mentioned is another, older, and for some reasons under-utilized statistical method for examing the g-ness (congitive complexity) of IQ test measures...multidmensional scaling (MDS).  We will save that for the next post in this seires.

To be continued........................