Nonverbal Ability Tests as Culture-fair Measures of g?
Some test authors are once again claiming that their nonverbal reasoning tests are “culture fair.” The “culture fair” claim is a less extreme version of an earlier claim that such tests were “culture free.” However, the intuitively plausible notion that nonverbal reasoning tests are “culture free” or “culture fair” has been roundly criticized by both measurement specialists and cognitive psychologists. The notion of a “culture free” test surfaced in the 1920s, became popular in the 1930s, but was debunked by some of the more thoughtful measurement experts in the 1940s. Cattell (1971) tried to resurrect the concept in the exposition of his theory of fluid versus crystallized abilities. Cattell attempted to avoid some of the more blatantly false assumptions of a “culture free” test by calling it “culture fair.” While many psychologists eventually became convinced of the utility of this.
These sorts of confusions are typically not detected by statistical tests for item bias. This is because strategies such as picking an associate of the last term in the problem stem will often lead to the keyed answer on simple problems. The student thus appears to be able to solve some easy items, but not the more difficult ones. Bias statistics look for cases in which an item is consistently harder or easier for students in a particular group, given the number of problems the student answers correctly. 3 I agree with Lubinski (2003) that we should measure spatial abilities routinely in talent searches, especially if we can provide instruction that capitalizes on these abilities and furthers their development. However, one should measure these abilities explicitly rather than inadvertently. concepts of fluid and crystallized abilities, the notion of “culture fair” tests continued to be widely criticized (Anastasi & Urbina, 1997; Cronbach, 1990; Scarr, 1994).
The belief that one can measure reasoning ability in a way that eliminates the effects of culture is a recurring fallacy in measurement. Culture permeates nearly all interactions with the environment; indeed, the concept of intelligence is itself rooted in culture (Sternberg, 1985). Further, nonverbal tests such as the Raven Progressive Matrices do not measure the same functions as verbal tests (Scarr, 1994), often show larger differences between ethnic groups than verbal or quantitative tests (Jensen, 1998), and are particularly liable to practice and training (Irving, 1983). Indeed, as Scarr (1994) notes: “Although tests such as the Raven Matrices may seem fair because they sample skills that are learned by nearly everyone” puzzle-like tests turn out to have their own limitations. (p. 324).
At the surface level, the claim that a test is “culture fair” means that the stimulus materials are assumed to be equally familiar to individuals from different cultures. Although there are cultures in which stylized pictorial stimuli are novel (Miller, 1997), children who have lived in developed countries are generally all exposed to common geometric shapes and line drawings of some sort. However, they may not be equally familiar with the names of these objects or as practiced in using those names. Stylized pictures of everyday objects often differ across cultures and within cultures across time. Thus, the assumption that the test stimuli are equally familiar to all is dubious (Laboratory of Comparative Human Cognition, 1982, p. 687).
At a deeper level, though, the claim is that the types of cognitive tasks posed by the items, and thus the cognitive processes children must use when solving them, are equally familiar. There is an aspect of problem solving that is clearly rooted in culture, namely the habit of translating events into words and talking about them. Although children may recognize ovals, triangles, and trapezoids, and may know about making things bigger or shading them with horizontal rather than vertical lines, the habit of labelling and talking aloud about such things varies across cultures (Heath, 1983). Children who do not actively label.
Children’s storybooks provide interesting examples of this variation in styles of depicting people, animals, and objects in different cultures. The variations are reminiscent of cultural variations in the onomatopoeic words for animal sounds.
For an excellent summary of the role of parents in developing the educative reasoning abilities of their children, see J. Raven (2000). Raven argues that such development is promoted if parents involve children in their own thought processes. Such parents are more likely to respect their objects and transformations are more likely to resort to a purely perceptual strategy on nonverbal tests. Such strategies often succeed on the easiest items that require the completion of a visual pattern or a perceptually salient series, but fail on more difficult items that require the identification and application of multiple transformations on multiple stimuli (Carpenter, Just, & Shell, 1990).
Thus, although less extreme than the “culture free” claim, the “culture fair” claim is equally misleading. Both claims help perpetuate the myth that “real” abilities are innate; that culture, experience, and education are contaminants; and that intelligence is a unidimensional rather than a multidimensional concept. We have long known that, as Anastasi and Urbina (1997) observed, the very concept of intelligence is rooted in culture. Modern theories of intelligence begin with this fact (Sternberg, 1985). Importantly, they do not end there. Most go on to try to identify those cognitive structures and processes that generate observed differences on tasks valued as indicants of intelligence. But experience always moderates these interactions. And formal schooling organizes tasks that provide opportunities for these experiences. Because of this, intelligence becomes, as Snow and Yalow (1982) put it, “education’s most important product, as well as its most important raw material” (p. 496).
Indeed, education actively aims to cultivate intelligence (Martinez, 2000). Educators who work with children who learn quickly and deeply from school have the most to lose from the misconception that intelligence is independent of experience. If abilities developed independently of experience, then what need would we have for enrichment or acceleration or, indeed, for education at all? The myth that very able children will do fine if left to their own devices is rooted in this misconception. The Prediction Efficiencies of Figural Reasoning Tests Figural reasoning tests, then, are one important variety of nonverbal ability tests.
Examples include the Raven Progressive Matrices (Raven et al., 1983), the NNAT (Naglieri, 1997), and the Figure Analogies, Figure Classification, and Figure Analysis subtests of the Cognitive Abilities Test (Lohman & Hagen, 2001a). These sorts of tests are sometimes used when screening students for inclusion in programs for the gifted. Further, strong claims have sometimes been made for their usefulness in making such decisions and to initiate a cyclical process in which they discover just how competent their children really are and, as a result, become more willing to place them in situations that call for high-level competencies. (p. 33).
Therefore, I focus exclusively on these sorts of tests in the remainder of this article. The first claim that I make is that these sorts of nonverbal, figural reasoning tests should not be the primary selection instruments for programs for the academically gifted and talented. The reasons typically given for their use are (a) that scores on such tests show correlations with academic achievement that, while lower than the correlations between verbal or quantitative reasoning tests and achievement, are certainly substantial, and (b) differences between some (but not all) minority groups and English-speaking.
White students are smaller on figural reasoning tests than on tests with verbal content. Reduced mean differences makes a common cut score seem more acceptable when identifying children for inclusion in programs. Many also erroneously assume that the nonverbal test is a culture-fair measure of ability. The reasons such tests should not be used at the primary selection tool are equally straightforward. Students who most need advanced academic instruction are those who currently display academic excellence. Although reasoning abilities are important aptitudes for academic learning, they are not good measures of current academic accomplishment.
Further, of the three major reasoning abilities, figural reasoning ability is the most distal aptitude for success in the primary domains of academic learning such as achievement in literacy or language arts, reading, writing, mathematics, science, and social studies. Selecting students for gifted and talented programs on the basis of a test of nonverbal reasoning ability would admit many students who are unprepared for and thus would not profit from advanced instruction in literacy, language arts, mathematics, science, or other content-rich domains. It would also not select, and thereby exclude, many students who either have already demonstrated high levels of accomplishment in one of these domains or whose high verbal or quantitative reasoning abilities make them much more likely to succeed in such programs. It would be like selecting athletes for advanced training in basketball or swimming or ballet on the basis of their running speed.
These abilities are correlated, and running is even one of the requisite skills in basketball, but it is not the fair or proper way to make such decisions. Further, the teams selected in this way would not only include a large number of athletes unprepared for the training that was offered, but would exclude many who would actually benefit from it. Rather, the best measure of the ability to swim or play basketball or perform ballet is a direct measure of the ability to swim or play basketball or perform ballet. In other words, the primary measure of academic giftedness is not something that predicts academic accomplishment, but direct evidence of academic accomplishment (Hagen, 1980). Understanding why a test that shows what some would consider a “strong” correlation with achievement should not be used as a substitute for the measure of achievement requires knowledge of how to interpret correlations. Sadly, many people who must rely on tests to make selection decisions do not understand how imprecise the predictions are, even from seemingly large correlations.
Figure 1. Example of a correlation of r = .6 between a Nonverbal ability test (abscissa) and a Mathematics achievement test (ordinate).
Figure 1 shows an example what a scatterplot looks like for a correlation of r = .6, which is a reasonable estimate of the correlation between a nonverbal ability test and a concurrently administered mathematics achievement test for both minority and nonminority students (Naglieri & Ronning, 2000). Here the nonverbal ability test is on the X axis and a measure of mathematics achievement is on the Y axis. The percentile-rank scale is used since this is the common metric in selection. Suppose we used the nonverbal reasoning test to identify students for a gifted and talented program, and that we admitted the top five percent.
How many students with mathematics achievement scores in the top five percent would be identified? In this particular sample, picking the top five percent on the nonverbal reasoning test would identify only four students who also scored in the top 5% on the mathematics achievement test. Two students actually scored below the population mean (PR = 50) on the achievement test. In general, picking the top 5% on the ability test would identify only 31% of the students in the top 5% of the math achievement test. Put differently, it would exclude 69% of the students with the best mathematics achievement.
Further, about 10% of those who were selected would actually have scored below the mean on the mathematics test. Someday these students may be ready for advanced instruction in mathematics, but clearly, they have less need for it now than the 69% of students with very high math scores who would be excluded. The situation is even worse if fewer students are selected (e.g., top 3 percent) or if the criterion is a verbal competency (such as writing) that has an even lower correlation with performance on the nonverbal test.
Figure 1. Example of a correlation of r = .6 between a Nonverbal ability test (abscissa) and a Mathematics achievement test (ordinate).
But is it not true that nonverbal reasoning tests are good measures of g? Those who study the organization of human abilities using factor analyses routinely find that nonverbal reasoning tests are good measures of fluid reasoning ability (Gustafsson & Undheim, 1996). However, such analyses look only at that portion of the variation in test scores that is shared with other tests that are included in the factor analysis. Variation that is specific to the test is discarded from the analysis.
Those who use test scores for selection get both parts, not just that portion of the shared variation that measures g. Unfortunately, the specific variance on figural reasoning tests is typically as large as the variation that is explained by the g or Gf factor. The test score variance that is explained by the factor is given by the square of the tests loading on the factor. For example, if a test loads .6 on the Gf factor, then the Gf factor accounts for 36% of the variance on the test.
Furthermore, the skills that are specific to the figural test are only rarely required in formal schooling. Indeed, as I later show, some of these spatial skills may actually interfere with academic learning. This is not true for verbal or quantitative reasoning tests. For these tests, most of the specific verbal or quantitative abilities measured by tests are also required for success in school. Therefore, if we are interested in identifying those students most in need of acceleration in mathematics, social studies, or literature, then a reasoning test (especially a figural reasoning test) should not be the primary selection instrument. Later I will also show that a nonverbal reasoning test is also not the best way to identify those students who are most likely to develop high levels of achievement in academic domains. I would not want to be the G&T coordinator saddled with the responsibility of explaining the fairness of such a test to the parents of the many extremely high-achieving but excluded students. I would also not want to be the administrator saddled with the responsibility of defending such a procedure in court.
Predicting Achievement for ELL Students Would a figural reasoning test be more appropriate for identifying gifted English Language Learners (ELL) who perform well on tests that use a language other than English? Naglieri and Ronning (2000) report correlations between the NNAT and Apprenda 2, an achievement test written in Spanish. The mean correlation between the NNAT and Spanish-Language reading was r = .32. This means that picking Hispanic students for a program for gifted and talented students on the basis of their NNAT scores would generally exclude 80 percent of those who read well in Spanish (i.e., score at or above the 90th percentile on the Apprenda 2). Figural reasoning abilities are not the same as verbal reasoning abilities in any language.
Distinguishing Achievement from Reasoning Abilities Some think that it is unfair to use language-based tests of any sort to estimate abilities of bilingual or multilingual students. In large measure, this is because they want a test that measures the full extent of a child’s verbal competence. This is understandable when identification is based on rank in the total sample rather than rank within the subgroup of students with similar linguistic experience. Others see no difference between verbal achievement tests and the sort of verbal reasoning tasks (such as analogies, sentence completions, or verbal classifications) used on verbal reasoning tests. These tests sometimes appear similar, and individual differences on them overlap more than they differ. But the constructs they claim to measure can be distinguished if tests are carefully constructed. Items on good verbal reasoning tests are constructed to emphasize reasoning processes and to reduce as much as possible the influence of extraneous factors such as word frequency.
Consider, for example, the verbal analogies subtest of the Cognitive Abilities Test (CogAT; Lohman & Hagen, 2001a) that is administered to 12th graders. The typical correct answer is a word that can be used correctly in a sentence by about 75% of 7th graders. The average vocabulary level of all other words in the analogy items is grade 5. Nevertheless, the analogy items are quite difficult. The typical 12th grade student answers only about half of the items correctly. Indeed, well-constructed vocabulary tests that use relatively common but abstract words are among the best measures of verbal reasoning. Students learn most new words by inferring plausible meanings from the contexts in which the words are embedded, and then remembering and revising these hypotheses as they encounter the words anew. Achieving precise understandings of relatively common, but abstract words is thus an excellent measure of the efficacy of past reasoning processes in many hundreds or thousands of contexts. On the other hand, knowledge of infrequent or specialized words, while sometimes useful as a measure of prior achievement, estimates reasoning poorly. Nevertheless, there is simply no way to measure verbal reasoning without recourse to words.
One can reduce the impact of reading or knowledge of specialized words or of linguistic conventions, but it is neither possible nor desirable to eliminate them. This is especially the case as one moves up the educational ladder. Similarly, good quantitative reasoning tests aim to reduce the impact of specific instruction in mathematics while emphasizing the ability to discover and apply increasingly abstract quantitative relationships. Once again, it is simply not possible to do this without using concepts and skills taught at some point in school. As with verbal reasoning tests, however, the goal is always to use concepts and skills that are well within the span of most children. Tests that make accurate discriminations among the most able children in any grade, however, must include some items that include more abstract concepts. Knowing that children vary much more within a grade than between several adjacent grades is a fact that continues to be as unfamiliar to most educators as it is familiar to those who work with gifted children.
Figure 2. Correspondence between cognitive abilities and physical skills on the fluid-crystallized continuum.
Quantitative reasoning tests are particularly useful for identifying minority and ELL students who are likely to benefit from acceleration. The verbal requirements of such tests are minimal. Indeed, the directions are often shorter than directions for unfamiliar figural reasoning tests. Unlike figural reasoning, quantitative reasoning is an aptitude for a specific type of educational expertise that is developed in schools and thus affords enrichment and acceleration. Further, minority and ELL students generally perform quite well on such tests, often better than on tests of figural reasoning abilities. Finally, some argue that quantitative reasoning is actually a better marker for g than figural reasoning (Keith & Witta, 1997).
Figure 2. Correspondence between cognitive abilities and physical skills on the fluid-crystallized continuum. Many who object to tests that use words or numbers as stimuli see the proper role of ability tests as one of measuring innate potential or capacity. I do not think this is ever possible. Indeed, performance on figural reasoning tests (such as the Progressive Matrices Test and adaptations of it) is markedly affected by education and practice. The so-called Flynn effect is much larger for such tests than for more educationally loaded tests (Flynn, 1987, 1999). Further, careful studies show the heritability of scores on such tests to be the same as the heritability of scores on achievement tests. In other words, figural reasoning tests do not measure something that is any more (or less) the product of experience than good verbal reasoning tests.
Although Jensen (1998) disagrees, a much longer list of other notables agrees (Cronbach, 1990; Horn, 1985; Humphreys, 1981; Plomin & De Fries, 1998). Indeed, Humphreys was fond of pointing out that, in the Project Talent data, heritability coefficients were as high for a test of knowledge of the Bible as for measures of fluid reasoning ability. Understanding Abilities The Correspondence between Physical and Mental Abilities Although the relative influence of biology and experience varies across tasks, ultimately all abilities are developed though experience and exercise. However, the development of abilities is difficult to see because our intuitive theories of intelligence constantly get in the way. These intuitive theories are difficult to change because we cannot directly observe thinking or its development. If we could, we would see that cognitive and physical skills develop much in the same way. Because of this, it is helpful to consider the development of observable physical skills. Indeed, from Galton (1869/1972) to Bartlett (1932) to Piaget (1952) to cognitive psychologists such as Anderson (1982), theories of cognitive skills have been built on theories of physical skills. Anderson is most explicit about this. His model of for the acquisition of cognitive skills is taken directly from Fitt’s (1964) model for the acquisition of physical skills.
The correspondence between physical and cognitive abilities is shown graphically in Figure 2. Tests of general fluid abilities are akin to measures of general physical fitness. Measures of crystallized achievements in mathematics or literature, for example, are like observed proficiencies in particular sports such as basketball or swimming. Physical fitness is an aptitude for learning different sports. Those individuals with high levels of fitness generally find it easier to learn physically demanding activities and to do better at these activities once they learn them. In like manner, reasoning abilities are aptitudes for learning cognitively challenging subjects. Those who reason well learn more quickly and perform at higher levels once they have learned. Skilled athletic performance requires both biological preparedness and extensive practice and training. This is also true of complex cognitive skills. However, physical fitness is also an outcome of participation in physically demanding activities. In like manner, students who learn how to prove theorems in a geometry class or evaluate source documents in a history class also learn how to reason in more sophisticated ways. Reasoning abilities are, thus, critical aptitudes for learning difficult material as well as important outcomes of such learning. Arguing that a good measure of reasoning ability should be independent of motivation, experience, education, or culture is like saying that a good measure of physical fitness should somehow be independent of every sport or physical activity in which the person has engaged. Such a measure is impossible. All abilities, physical and cognitive, are developed through exercise and experience.
There are no exceptions. Note that the analogy to physical skills provides an important role for biology. Speed, strength, and aerobic capacity are clearly rooted in inherited biological structures and processes. The analogy also suggests the importance of multiple test formats in the estimation of abilities. No test gives a pure estimate of ability. Tests that use the same format for all test items offer an advantage for students who (for whatever reason) do well on that format. This is particularly important for nonverbal reasoning tests because task specificity is generally much larger for figural tests than for verbal or quantitative tests (Lohman, 1996). Using a single item format is like estimating physical fitness from a series of running competitions rather than from a more varied set of physical activities.
The analogy to physical skills also can clarify why good measures of aptitude for specific academic domains such as mathematics or rhetoric must go beyond measures of figural reasoning ability. Success in ballet requires a somewhat different set of physical skills and propensities than success in swimming or basketball. A common set of running competitions would not be the best or fairest way to select athletes for advanced training in any of these domains, even if we could assume that all students had equal opportunities to practice running. The Triad of Reasoning Abilities There is now overwhelming evidence that human abilities are multidimensional, not unidimensional. This does not mean that, as Gardner (1983) would have it, g is unnecessary or unimportant (see Lubinski & Benbow, 1995). At the other extreme, it does not mean that g is the only thing that matters. Instead, it means that one must attend both to the overall level and to the pattern of those abilities that are most important for school learning. This is particularly important when attempting to identify gifted children.
The importance of going beyond g to measure a profile of reasoning abilities for all students (minority and majority) is shown clearly in the CogAT standardization data. Understanding why this is the case requires a brief review of how reasoning abilities are represented in hierarchical theories of human abilities. Carroll’s (1993) three-stratum theory posits a large array of specific, or stratum I, abilities (Carroll identified 69). These narrow abilities may be grouped into eight broad, or stratum II, abilities. Stratum II abilities in turn define a general (g) cognitive ability factor at the third level. Importantly, the broad abilities at Stratum II vary in their proximity to the g factor at stratum III. The closest is the broad fluid reasoning or Gf factor.
Carroll’s (1993) analyses of the fluid reasoning factor show that it in turn is defined by three reasoning abilities: (1) sequential reasoning, verbal, logical, or deductive reasoning; (2) quantitative reasoning. inductive or deductive reasoning with quantitative concepts; and (3) inductive reasoning, typically measured with figural tasks. These correspond roughly with the three CogAT batteries: verbal reasoning, quantitative reasoning, and figural/nonverbal reasoning. Each of these three reasoning abilities is estimated from two tests in grades K-2 and from three tests in grades 3-12.8 If given 90 minutes to test students. abilities, most psychologists would not administer a battery of nine different reasoning tests. Instead, they would try to represent a much broader slice of the Stratum 2 or Stratum 3 abilities in Carroll’s model. Because of this, they would not have reliable measures of these three aspects of fluid reasoning ability (Gf), but only a composite reasoning factor. They would thus see only evidence for g or Gf and not for the distinguishably different abilities to reason with words (and the concepts they can signify), with numbers or symbols (and the concepts they can signify), and stylized spatial figures (and the concepts they can signify).
The assertion that nonverbal, figural reasoning tests are fair proxies for verbal or quantitative reasoning tests rests on the mistaken assumption that, absent task specific factors, all reasoning tests measure more or less the same thing. 8 Since Thurstone (1938), test developers have constructed tests that measure different abilities by increasing the representation of tests (and items) that have lower loadings on g and higher loadings on group factors. CogAT test batteries were not constructed in this way. Each battery is designed to maximize the amount of abstract reasoning that is required and is separately scaled. Correlations among tests were computed only after the test was standardized.
Table 1 Percent of High-Scoring Students (Median Stanine = 8 or 9) Showing Different Profiles of Verbal, Quantitative, and Nonverbal Reasoning Abilities on the CogAT Form 6 Multilevel Battery
Table 1 | ||||||||
Percent of High-Scoring Students (Median Stanine = 8 or 9) Showing Different Profiles of Verbal, Quantitative, and Nonverbal | ||||||||
Reasoning Abilities on the CogAT Form 6 Multilevel Battery | ||||||||
Ethnicity | ||||||||
Profile | White | Black | Hispanic | Asian | American Indian | Other or Missing | Total | |
All scores at the same level | ||||||||
A | 42.0 | 28.5 | 31.8 30.1 | 38.5 | 37.1 | 40.4 | ||
One score above or Below | ||||||||
B (V+) | 2.6 | 1.7 | 2.1 1.2 | 1.0 | 1.3 | 2.4 | ||
B (V-) | 9.1 | 11.9 | 14.1 15.1 | 11.4 | 7.7 | 9.7 | ||
B (Q+) | 2.6 | 1.1 | 2.8 2.9 | 1.4 | 4.4 | 2.6 | ||
B (Q-) | 6.1 | 8.6 | 5.1 4.4 | 8.2 | 4.9 | 6.1 | ||
B (N+) | 2.2 | 2.1 | 2.0 2.2 | 0.0 | 3.2 | 2.2 | ||
B (N-) | 6.3 | 13.5 | 4.9 6.7 | 8.3 | 7.1 | 6.5 | ||
Total B | 28.9 | 39.0 | 30.9 32.6 | 30.3 | 28.7 | 29.5 | ||
Extreme B profile | ||||||||
E (V+) | 1.4 | 0.0 | 0.7 1.8 | 0.0 | 0.1 | 1.3 | ||
E (V-) | 4.3 | 8.6 | 11.4 13.8 | 4.3 | 7.6 | 5.3 | ||
E (Q+) | 1.5 | 1.4 | 1.5 1.3 | 2.0 | 1.8 | 1.5 | ||
E (Q-) | 2.5 | 4.1 | 0.4 0.3 | 3.7 | 2.8 | 2.4 | ||
E (N+) | 1.4 | 0.0 | 1.8 2.1 | 1.7 | 2.9 | 1.5 | ||
E (N-) | 2.5 | 5.8 | 2.5 1.7 | 1.4 | 2.0 | 2.5 | ||
Total EB | 13.6 | 20.0 | 18.3 21.0 | 13.1 | 17.3 | 14.4 | ||
Two scores Contrast | ||||||||
C (V+Q-) | 2.3 | 0.8 | 2.5 1.1 | 0.7 | 2.3 | 2.2 | ||
C (V-Q+) | 2.0 | 1.6 | 2.6 2.0 | 5.5 | 2.5 | 2.1 | ||
C (V+N-) | 2.2 | 0.8 | 2.2 0.3 | 4.3 | 1.2 | 2.1 | ||
C (V-N+) | 2.1 | 1.0 | 3.3 2.8 | 0.7 | 1.0 | 2.1 | ||
C (Q+N-) | 1.7 | 2.4 | 1.4 1.5 | 1.2 | 2.0 | 1.7 | ||
C (Q-N+) | 2.1 | 0.8 | 0.7 1.2 | 2.7 | 4.1 | 2.0 | ||
Total C | 12.3 | 7.5 | 12.8 8.9 | 15.0 | 13.2 | 12.1 | ||
Extreme C profile | ||||||||
E (V+Q-) | 0.7 | 0.0 | 0.3 0.3 | 0.5 | 0.0 | 0.6 | ||
E (V-Q+) | 0.5 | 1.4 | 3.6 3.7 | 0.9 | 1.1 | 0.8 | ||
E (V+N-) | 0.4 | 3.5 | 0.0 1.0 | 0.0 | 0.4 | 0.5 | ||
E (V-N+) | 0.7 | 0.1 | 2.1 1.7 | 0.7 | 1.0 | 0.8 | ||
E (Q+N-) | 0.5 | 0.0 | 0.0 0.7 | 0.0 | 1.2 | 0.5 | ||
E (Q-N+) | 0.2 | 0.0 | 0.1 0.0 | 1.0 | 0.0 | 0.2 | ||
Total EC | 3.1 | 5.1 | 6.2 7.4 | 3.1 | 3.7 | 3.5 | ||
Na | 9,361 | 176 | 317 | 550 | 195 | 70 | 11,031 | |
Note: All columns total 100. V = Verbal; Q = Quantitative; N = Nonverbal; A = All three scores at approximately the same level; B = One score above or Below the other two scores; C = Two scores Contrast significantly; E = Scores differ by at least 24 |
Table 1 shows why this assumption is untenable. The table shows the percentage of high scoring students in the 2000 CogAT standardization sample who had different score profiles on the CogAT multilevel battery. The most recent edition of CogAT reports a profile score for each student that summarizes the level and pattern of his or her scores across the verbal, quantitative, and nonverbal reasoning batteries. Example profiles are 3A, 9B(V-), and 6C(V+Q-). The number is the student’s median age stanine on the three batteries. Stanines range from 1 (lowest 4 % of scores in the distribution) to 9 (highest 4 % of scores in the distribution). The median stanine estimates the overall level of the profile.
The first letter tells whether all three scores were at the sAme level (an ‘A’ profile), whether one score was aBove or Below the other two scores (a ‘B’ profile), or whether two scores showed a significant Contrast (a ‘C’) profile. In the examples above, 3A means that the median age stanine was 3 and that the three scores did not differ significantly from one another. The second example, 9B(V-), means that the median age stanine was 9 and that the score on the Verbal Battery was significantly lower than the scores on the Quantitative and Nonverbal batteries. The last profile, 6C(V+Q-), shows a relative strength on the Verbal Battery and relative weakness on the Quantitative Battery. Finally, in an effort to call attention to unusually large differences, profiles with scores that differ more than 24 points on the SAS scale10 are all labelled E (for ‘Extreme’). For example, 8E (N-) means that the median stanine was 8 and that the score on the Nonverbal Battery was at least 24 points lower than the score on one of the other two batteries.
Given the interest here in identifying gifted students, only scores for the 11,031 students who had a median stanine of 8 or 9 were included in the data summarized in the table. This represents the top 10-11% of students in the national sample. If all three of these highly reliable reasoning scores measure approximately the same thing, then the majority of students, especially White students, should have approximately equal scores on the Verbal, Quantitative, and Nonverbal batteries. Here, this would be represented by an ‘A’ profile. On the contrary, only 42% of high scoring White students showed this profile. Stated the other way, the majority of high-scoring White students showed significantly uneven profiles of reasoning abilities. Of this majority, 28.9% showed a significant, but not extreme, strength or weakness in one area. (See the ‘Total B’ row). Another 13.6% showed an extreme strength or weakness. (See the ‘Total EB’ row). A relative weakness was much more common than a relative strength. Finally, 15.4% showed a significant (12.3%) or extreme (3.1%) contrast between two scores (‘Total C’ and ‘Total EC’ rows). Clearly, one size 9 to be significant, the difference must be at least 10 points on the Standard Age Score scale (mean = 100, SD = 16), and the confidence intervals for the two scores must not overlap. The confidence intervals will be wide if students respond inconsistently to items or subtests in the battery. 10 Standard Age Scores (SAS) have a mean of 100 and a standard deviation of 16.
For frequencies of different score profiles in the full population, see page 86 in Lohman and Hagen (2001c). For frequencies of score profiles for stanine 9 students, see page 125 in Lohman and Hagen (2002). For the multilevel battery, the KR-20 reliabilities average .95, .94, and .95 for the Verbal, Quantitative, and Nonverbal batteries, respectively (Lohman & Hagen, 2002). This statistical fact of life is commonly overlooked by those who would insist on high scores in all three content domains on CogAT to qualify for inclusion in G&T programs. It is why the CogAT authors recommend that schools not use the Composite score for this purpose. does not fit all. Giftedness in reasoning abilities is multidimensional, not unidimensional (see Achter, Lubinski, & Benbow, 1996, for a similar conclusion).
The profiles for minority students are even more interesting. If tests with verbal and quantitative content are inherently biased against minority students, then there should be very few students with an even or ‘A’ profile. Most should show an N+ profile (i.e., a much higher score on the nonverbal battery than on the verbal and quantitative batteries). On the contrary, approximately 30 percent of the Black, Hispanic, and Asian students also showed an even profile across the three batteries. Neither N+ nor E (N+) profiles occurred with greater frequency for these students than for White students. As expected, V- profiles were more common for minority students. Note, however, that 13.4% of the White students also showed either a significant (9.1%) or extreme (4.3%) V- profile. Further, Blacks were much more likely than other ethnic groups to show a significantly lower score on the nonverbal battery (an N- profile) than on either the verbal or quantitative batteries. Fully 19.3 percent showed a significant (13.5%) or extreme (5.8%) relative weakness on the Nonverbal Battery.
This means that screening students with a nonverbal reasoning test will actually eliminate many of the most academically capable Black students in the sample. Indeed, the only extreme profile that was more common for Black students was a verbal strength coupled with a nonverbal weakness, E(V+N-). For Hispanic and Asian-American students, the most common extreme contrast profile was a verbal weakness coupled with a quantitative strength, E(V-Q+). Once again, this argues for the importance of estimating the quantitative reasoning abilities of minority students. Spatial Strengths as Inaptitude for Academic Learning?
Although figural reasoning ability is not the same as spatial ability, the two constructs fall in the same branch of a hierarchical model of abilities (Gustafsson & Undheim, 1996) or in the same slice of the radex model (Marshalek, Lohman, & Snow, 1983). In both of these models, figural reasoning abilities are closer to g. Spatial abilities, although still highly g-loaded, fall further down in a multilevel hierarchical model or somewhat further from the center of the radex. The key difference is that figural reasoning tests require examinees to make inferences, deductions, and extrapolations from figural stimuli, whereas spatial tests require the ability to create images that preserve configurable information in the stimulus, often while performing analog transformations of those images. Many figural tests, of course, sample both reasoning and spatial processing, depending on how items are constructed and how examinees choose to solve them.
These distinctions become important in trying to understand one of the most unexpected findings in our analyses of the CogAT standardization data. At all levels of achievement and from grade 3 through grade 12, students who showed a relative strength on the CogAT Nonverbal Battery showed lower achievement in some areas than students who had the same levels of verbal and quantitative abilities but a relative weakness on the Nonverbal Battery. In other words, a relative strength in nonverbal reasoning seems to be an inaptitude for some aspects school learning. particularly the sorts of basic skills students must learn in elementary school. The effect was particularly strong on verbal achievement in domains such as Spelling and Language Usage, at the elementary school level, and for students who scored in the average range (Lohman & Hagen, 2001c, p. 102). But the effect was clearly present among the most able students as well (p. 105) and in other achievement domains (e.g., Vocabulary, Reading Comprehension, Math Computation, and Composite Achievement score).
The only subtest of the Iowa Tests of Basic Skills (ITBS) on which students with an N+ profile consistently outperformed those with an N- profile was on the Maps and Diagrams test. There are several reasons why this could be the case. One possibility is that students with an N+ profile perform especially well on figural reasoning tests because they have unusually strong spatial abilities. Such students may well find themselves mismatched in an educational system that requires mostly linear and verbal modes of thinking rather than their preferred spatial modes of thinking (Lohman, 1994). Another possibility is that achievement tests generally do not measure spatial modes of thinking. Grades or other measures of accomplishment in literature, science, or mathematics may not show such effects. However, Gohm, Humphreys, and Yao (1998) found that students gifted in spatial ability underperformed on a wide range of school interest and achievement measures that included both tests and grades. Although one could envision an alternative educational system in which this might not be the case, underperformance cannot be attributed to the verbal bias of the achievement tests. A third possibility is that a high nonverbal score reflects a strength in fluid reasoning ability rather than in spatial ability. Students who excel in fluid (as opposed to crystallized) abilities are particularly adept at solving unfamiliar problems rather than the more familiar sort of problems routinely encountered in school. However, if this were the case, deficits in mathematics should be as common as deficits in the more verbal domains. High spatial abilities, on the other hand, are commonly linked to problems in verbal fluency, spelling, and grammar (Shepard, 1978). Thus, the effect seems more plausibly linked to a preference for spatial thinking rather than to a relative strength in fluid reasoning.
Finally, one might hypothesize that the effect in undifferentiated samples reflects the performance of minority students. Such students would be particularly likely to underachieve on verbal achievement tests that emphasize specific language skills such as spelling, grammar, and usage. However, our analyses show that these effects are even stronger for minority students than for White students. This means that selecting students on the basis of their nonverbal reasoning abilities without also attending to their verbal and quantitative reasoning abilities will select some students who are even less likely than students with much lower nonverbal reasoning scores to achieve at high levels. Notice that an isolated strength in nonverbal reasoning is not the same thing as strengths in both quantitative and nonverbal reasoning or in both verbal and nonverbal reasoning or in all three.
Students with these score profiles do not show the deficits observed in the N+ group. This concurs with the finding of Humphreys, Lubinski, and Yao (1993) that engineers were more likely to excel on both spatial and mathematical abilities. However, unless these other reasoning abilities are measured, one has no way of knowing whether a particular student with a high nonverbal score is one of those who are even less likely than other students to achieve well. Reconceptualizing Potential as Aptitude The primary purpose of schooling is to assist students in developing expertise in particular domains of knowledge and skill that are valued by a culture. The primary purpose of programs for the gifted and talented ought to be to provide appropriate levels of challenging instruction for those students who have demonstrated high levels of accomplishment in one or more of these domains.
This can be done through acceleration or advanced placement, for example. The secondary purpose of such programs ought to be to provide enrichment or intensive instruction for those who show potential for high levels of accomplishment. These students commonly need different levels of challenge than those who have already demonstrated high levels of competence in a domain. Measuring accomplishment is difficult. Measuring potential for accomplishment is even more difficult; more troubling, it is fraught with misconceptions and pitfalls. In these analyses, we predicted achievement from a composite score that averaged across the three CogAT batteries (which was entered first into the regression) and then the scores of the three CogAT batteries (which were entered simultaneously in the second block). The nonverbal score typically had a negative regression weight, which was larger for minority students than for White students’ example, some misconstrue potential as latent or suppressed competence waiting to burst forth when conditions that prevent its expression are removed (Humphreys, 1973). Such misconceptions have prompted others to reject potential as a pie-in-the-sky concept that refers to the level of expertise an individual might develop if she were reared in some mythically perfect environment. A more moderate position is to understand potential as readiness to acquire proficiency in some context i.e., as aptitude.