About Lesley Academics Admissions Events News Services Change to large text size. Change to normal text size. Lesley A to Z Contact Lesley Find It Lesley Home Page
Skip to Page Navigation Skip to Page Content
The Hood Children's Literacy Project

Inside the 1999 MCAS: A Close Reading of the Fourth Grade Language Arts Test for Massachusetts

By William T. Stokes and Katherine E. Stokes

Introduction

Question #23 of the 1999 MCAS English Language Arts test for fourth graders poses this problem:

23. According to this passage, Ben Franklin would have changed the spelling of 'knife' to:
A. niff,
B. nif,
C. knif,
D.
nife.

This is the fifth question in a section that follows a reading comprehension passage entitled "Doing Away with the King's English." What the test takers learn, in the final paragraph, is that "Franklin wanted to drop all silent letters from words....Had Franklin written the dictionary instead of Webster, he would spell give, giv and wrong, rong.... [and tongue as] tong!" This is all the students have to go on. So, what could we conclude about how Ben Franklin would have preferred to spell knife?

Since the results of the second year of MCAS testing were released in December, there has been widespread media coverage, commentary, and critique. While students, teachers, parents, and the general public have a better understanding of what the testing entails than they did a year ago, there still is confusion about what these tests are testing. In this essay, we will focus our attention on the fourth grade English Language Arts (ELA) test. It is our hope that a careful examination of what is being asked of our students will contribute to the public policy discussions concerning the entire MCAS program. We'll return to the question of knife shortly; first, a few general observations regarding the 1999 test results.

In terms of statewide averages, it is reasonable to conclude that the results for 1999 are marginally better than those for 1998. The average scaled score for the fourth grade ELA test was 230 in '98 and 231 in '99. Of the 78,841 students who took the fourth grade test, statewide, 79% scored in the needs improvement or failing categories. Since the aim of education reform and the MCAS testing program is to bring all children to a level of proficiency as defined by the scoring system (i.e., minimum of 240), and since only eight communities reached even an average of 240 (the highest being 242), either the goal is very far off or the test system is deeply flawed. The nine largest communities, which provide schooling for more than 15,000 fourth graders, about 20% of the state total, had an average score of 225 (ranging from Lawrence at 222 to Worcester at 229; Boston reached 224). On a more local level, some towns and cities showed substantial gains, e.g., Arlington from 235 in '98 to 238 in '99 or Saugus from 230 to 234; others declined, e.g., Dover from 240 to 238 or Lynnfield from 239 to 234.(1) All these statistics, however, are of limited value unless we know what the tests are testing. Are the tests poorly designed or too difficult? Are the schools not teaching what the tests demand? We can only approach these questions by first looking inside the tests themselves. Let's return to question #23 introduced in the first paragraph.

How would Ben Franklin prefer that Americans spell knife? We can eliminate option C, since it clearly retains the silent "k." We can eliminate option A also, because a doubled consonant like "ff" usually follows a short vowel, so "niff" would presumably rhyme with stiff. We cannot tell from the reading passage whether Ben Franklin would also object to doubled consonants that represent a single sound, effectively rendering one of them silent. In either case, we can eliminate option A.

We are left with options B and D. In the text we have the example of give, but not an example for five, hive or dive, or in fact of any word that has a long vowel. All of us are aware that there has been great concern for teaching phonics in recent years. In fact, it is often presented as a matter of teaching "direct, systematic, intensive phonics." Moreover, many of the proponents of intensive phonics are also dedicated proponents of the MCAS. Under these circumstances, we might have hoped that a question such as #23, would have been better constructed. Given that knife contains a long vowel, our students will have been taught ("intensively") about the "silent e" rule that marks a long vowel. Therefore, the obvious best choice is D! The silent "k" is omitted, but the important long vowel marker is retained -- otherwise giv, div, fiv and hiv would all presumably rhyme, or the language would acquire a vast number of new homographs (e.g., omitting the silent "e" would reduce cape to cap, hide to hid and cute to cut). The correct answer, however, is B -- nif, according to the MCAS.

Statewide, 34% of fourth grade students chose the "correct" answer, B. How many children chose D? We do not have access to a statewide item analysis, but in a sample of just over one hundred students in two schools, 42% chose answer D. In our view that is the correct answer, because students are explicitly taught to distinguish between "silent" letters such as the "k" in knife and "markers" such as the "e" in ate or the "u" in guide (makes the "g" hard) or even the tense marker in walked ("ed" sounds as "t").

The "Release of Spring 1999 Test Items" (http://www.doe.mass.edu/mcas) indicates that test item #23 is related to "Learning Standard 13" which reads: "Students will identify, analyze, and apply knowledge of the structure, elements, and meaning of nonfiction or informational material and provide evidence from the text to support their understanding." What does this mean? And, how does this standard lead to the crafting of a question that invites a speculative inference about Franklin's ideas about spelling based on four sentences containing three bits of data?

This one question illustrates a pervasive problem in testing of the kind that MCAS represents. It is too often trivial and arbitrary. There is little in the passage to justify the state's preferred answer, and there is a good deal to cause us to doubt the scholarship that presumably underlies the question. Do any of us know what Benjamin Franklin's position was on marking long and short vowels?

Earlier in the passage, it is stated that Noah Webster "added new, American words to our language, one of those new words was barbecue." [Emphasis added.] Are we to believe that he personally invented such words? A little checking reveals that barbecue is a borrowing from Spanish and from Arawak (a Native American language of the Antilles) and that it appeared in English 100 years before Noah Webster was born. This passage rests upon shoddy scholarship and leads to questions that are utterly trivial and arbitrary. Is this what we wish our ten-year olds to study; is this the test for which the Massachusetts Department of Education wants teachers to 'teach to the test'? How does one prepare for invented "facts"?

In this essay, we will offer further analysis of specific items from the 1999 test, but first we should address a few issues about changes in the design of the test and its scoring compared with 1998. At the conclusion, we'll offer an analysis of the relationship between scores and socioeconomic factors.

Standard Scores, Raw Scores and Performance Levels

The raw score represents the number of points a test taker earns on the items appearing on the test. The fourth grade ELA test for 1999 contained fourty items, which included thirty-five multiple choice items (one-point each), four open-responses (four-points each), and one composition (twenty-points), for a possible total of seventy-one points. Last year's test had a slightly different mix: fewer multiple choice items, one additional open response for a total of sixty-eight points. Table 1 below shows the relationship between raw scores for both years and the standard scores and performance levels.

Table 1: Performance Levels and the Conversion of Raw Scores to Standard Scores

As the table indicates, the conversion of raw scores to standard scores (also called "cut scores") was such that while there were three additional possible points on the 1999 test, students needed four more points to make the cut to the next performance level. Parents and the public at large should know that the conversion between raw scores and standard scores is decided, it is not somehow required by the test. It is decided each year by the test makers and their consultants (and involved policy makers). A different decision could have been made. The fourth grade ELA test for 1998 was widely criticized for being developmentally inappropriate - it was simply too hard for fourth graders. One option test makers had was to make this year's test somewhat less difficult (i.e., more developmentally appropriate), or to adjust the conversion scores so that, for example, a score of forty-five might have been judged to be proficient. The latter decision was not taken, thus we will have to judge whether the test for 1999 is any more age appropriate.

For those who are new to Massachusetts or who are just focusing on these matters for the first time, it may be of value to point out that aside from the MCAS, there are many well-established standardized tests that are used throughout the nation to assess children's progress. Among these is the Iowa Reading Test.(2) In the 1999 report of performance by third grade students in Massachusetts, it was reported that 31% of children were found to be "advanced" and 37% "proficient." Compare this with the MCAS results: 0% "advance" and 21% "proficient." This contrast is so extreme that it should prompt a "what-is-wrong-with-this-picture" response. Are the children so different from year to year? Or, are the standards applied fundamentally different?(3)

It would require more space than is available here to thoroughly analyze these two tests. Our intention now is to suggest that at least some of the concerns that parents, teachers, and citizens across the Commonwealth have about the MCAS results should be directed toward the design of the MCAS itself. Another test, the National Assessment of Educational Progress, taken by a sampling of students in every state, yielded the result that Massachusetts fourth graders scored third in the nation, behind Connecticut and statistically tied with Maine.(4)

These discrepancies should encourage us to ask hard questions of those who have designed the MCAS and set the performance standards. It is one matter to establish public policy around the assertion that we should "do better." It is another matter to effectively declare that the sky is falling and that all efforts at advancing educational practices should be distorted to "teaching to the test" - a deeply flawed test, at that.

A Close Reading of the "King's English" Session

We've already introduced some concerns about the passage entitled "Doing Away with the King's English" and the questions that followed. Let's continue that examination a bit further.(5)

The author of the passage is identified as Susan Lurie and a subtitle appears below her name: "Noah Webster wasn't satisfied when the British went home -- he wanted to get rid of their language, too." Anyone familiar with the true linguistic tensions during the late eighteenth and early nineteenth centuries will realize that this subtitle is both overstated and misdirected.(6) A cartoon appears to the left of the text which presumably depicts a very young Noah Webster shouting to three (war?) ships on the horizon: "Back to England! Now we can have our own language!" What do our ten-year-olds make of this? If the MCAS were not such serious business, there might be some humor in this portrayal. But, taken seriously, the distortion of history and the jingoistic tone are disturbing.

The passage then opens with the claim that "More than 265 million people speak English." It is such an odd statement. Why 265 million? The well-known series aired on public television a few years ago, entitled the "The Story of English," estimated the figure to be between 750 million and one billion. So, why the figure 265 million? That number happens to be close to the population of the United States. Did the author confuse the two figures?

This is not splitting hairs. We are speaking about a state-sponsored test. From the test takers' point of view, any statement or any "fact" may be queried in the following questions -- it would be important to pay attention and not to engage in any of the doubts that we've just expressed. As indicated above, while the learning standards are sometimes difficult to interpret, the MCAS test becomes the concrete, "official" interpretation.

The "King's English" passage is a little over 500 words in length, ten paragraphs, forty sentences. It includes words such as: Mandarin, intellectually, graduated, and colonizers. We wonder whether the test designers consulted any guides regarding what might be reasonable vocabulary for a fourth grade reading comprehension passage. By our calculation, based on the Fry Readability Formula, this passage is a ninth grade level passage and it is developmentally inappropriate.

Let's look at the other questions in this section. Question #24: "Which words in the third paragraph are used as verbs? A. although, don't, special; B. same, spelling, character; C. particular, pronouncing, spell; D. speak, has, owe." Clearly, one does not need to read the passage to answer this question, if one knows which words are verbs and which are not. The question is said to address "Learning Standard 5: Students will identify, describe, and apply knowledge of the English language and standard English conventions for sentence structure, usage, punctuation, capitalization, and spelling." If this is the point, i.e., to correctly identify verbs, why is such a challenging (ninth grade) passage used, when the content of the passage is quite irrelevant.

Similarly, question #26 asks why Webster and Franklin are capitalized (this also addresses Learning Standard 5). The answer is, of course, that they are proper nouns. Again, it is not necessary to read or comprehend the passage to answer this question. Question #22 asks which genre this passage represents: answer -- nonfiction, but given the fictive nature of the "facts" presented, we might question that designation.

Question #20 asks about the cartoon accompanying the passage. Who are the people in the ships? The answer is clear to an adult, and it was to more than 70% of the test takers. The question hinges on the subtitle, which is itself objectionable, as we commented above.

Question #19 asks "Why is English spoken in so many countries." The answer is given in the second sentence of the second paragraph. But, the foils (the wrong answers) included among the choices, are so clearly false that again it would not be necessary to read the passage -- 80% of children statewide answered this correctly.

Question #21 asks why the "writer" puts barbecue, colour, and give in italics in the last three paragraphs? This question hinges on the fourth graders knowing the meaning of the word italics. One of the words is spelled incorrectly in American English -- that could lead a child to option B, "spelled wrong." The passage is about American English and two of the words look "foreign" -- that could lead a child to option D, "foreign words." Both are incorrect, but could be correct uses of italics in other contexts. The right answer is A. -- "examples of what he's discussing." (Let's leave aside the fact that the writer in question is clearly identified as Susan Lurie.) She is discussing words and their spellings. She mentions colour to point out the British spelling, not what the word means or refers to. If the author had been talking about animals and claimed that the word cat has three letters, it is reasonable to italicize the word; if she claimed that a cat is an example of a common house pet, the word should not be italicized even though it was an example of what she's discussing. This is another poorly constructed question that does little to evaluate reading comprehension, or even the conventions for the use of specialized fonts. And, is this a reasonable problem to set before fouth graders? We don't believe so.

We do not mean to belabor these points, but shouldn't we expect the test design to be excellent? Given the high stakes involved for school systems across the state, and especially for those high school students whose graduation may depend on the high school test, shouldn't we critique all the shortcomings of these instruments?

Question #26, surprisingly, has nothing to do with the reading passage: "A character in a story who says 'y'all come in and sit a spell' is probably from: A. New England. B. California. C. the South. D. Iowa." Did many of the children go back to the passage to see if they had missed this? The test design up to this point would have suggested that they should find the answer in relation to the passage. Then there is the matter of mobility in this society. Neighbors in Cambridge, Massachusetts might well use such phrases. Statewide 49% of students answered correctly, choosing C. What should we conclude about the rest? How would we wish teachers and students to prepare for such questions in the future? Try to imagine a teacher "teaching to the test" by drilling students on stereotypic forms of regional speech, while reminding them, in the interest of equity, that not all people living in those regions speak that way.

The final item in this session is an Open Response: Question #27 -- "Ben Franklin wanted to drop all silent letters from words. Give one reason why this may have been a good idea and one reason why it may have been a bad idea." This question too is said to address learning Standard #13. As we've tried to indicate, the evidence does not lie in the text for the student to answer this question; the analysis must be based on other learning, but that same learning is undermined in question #23.

We would be very interested in examining children's reasoning in this instance. If teachers had access to children's actual responses, then they might be better able to help them prepare for the next round of testing. They might even fulfill the state mandate that they "teach to the test." Unfortunately, the state does not share this data. Statewide the average score on the item was 1.85 out of 4.0. On a proportional basis, students needed to earn nearly 72% of possible points to be judged proficient. Scoring less than half on this question did not help that cause. We'll ask again: is this the shortcoming of our students, or is this evidence of a deeply flawed assessment? How would we score the test designers' efforts, if we were given the opportunity? We would certainly score it no better than "needs improvement."

We have focused our analysis on questions #19 - 27, and while we might offer similar commentary on any of the forty questions, we expect that readers have understood our concerns, and available space limits further item-by-item analysis. We would like to turn to the socioeconomic factors that are too seldom acknowledged.

Socioeconomic Status and Test Performance

In July of 1999, a report was issued by the Massachusetts Department of Education concerning the performance of 1250 elementary schools on the Iowa Reading Test. "In 19 schools, third-grade students scored higher than average on the test, even though they came from poorer backgrounds than the average third-grader in the state," The Boston Globe reported (7/21/99). And, Commissioner David Driscoll was quoted to say, "These are schools that you wouldn't expect to perform well, but even despite poverty levels, they scored better than state average, and we wanted to highlight them." While these schools do deserve recognition, it is also distressing that only nineteen schools with more than the average number of low-income students performed this well. More telling still is the commissioner's remark that "you wouldn't expect [them] to perform well." Poorer students, on average, score lower, and schools with more low-income students have lower average scores. Is this true of the MCAS as well?

We decided to examine the performance of students in 105 communities located in four counties in the Boston metropolitan area: Suffolk, Norfolk, Essex and Middlesex -- roughly those lying within 30 miles of Boston. In this region there is a range of high-income suburban areas and lower-income, more-diverse urban areas. Some have median family incomes of greater than $100,000 while others are lower than $30,000. The best available figures are from the last census in 1990. While we wish that we had the data from the forthcoming 2000 census, we will explicitly adopt the assumption that, while income values will have increased, the rank order of communities in terms of median income will not have changed greatly since the last census -- we recognize that some minor re-orderings are also likely.

That said, we calculate that the correlation between median family income and average test performance by city or town is r = .77. This is a very high value. It means that relative wealth or poverty powerfully predict performance on the tests. Correlations do not establish causal relationships, but this kind of calculation can draw attention to the fact that students' performance on achievement tests is related to socioeconomic characteristics of their parents and communities. Table 2 below provides a glimpse of the relationship between median family income and the average scores on the Fourth Grade English Language Arts Test. We divided the sample communities into five groups (quintiles) based on rank order of median family income. We observed that as median income increases so do scores. The middle three quintiles differ by relatively smaller amounts and the corresponding scores also differ by relatively small amounts. The lowest and highest quintiles show greater differences and so do the corresponding scores.

Table 2: MCAS Scores for the Fourth Grade ELA Test in Relation to the Range of Median Family Income for 105 Communities in Eastern Massachusetts

Income figures based on the 1990 census by municipality. Communities chosen include all cities and towns in four counties that reported MCAS scores for fourth grade: Suffolk, Norfolk, Essex and Middlesex. See Note 7.

Conclusion

This brief essay has attempted to draw attention to three distinct kinds of problems with the MCAS, specifically the fourth grade English Language Arts test.

We have shown that there is reason to be concerned about the quality and accuracy of the test items themselves and the age appropriateness of the reading passages. Let us illustrate the problem with one further example: Teachers reported to us that the first reading selection of Session 3 was even more difficult than the one analyzed above. One of the most difficult questions in Session 3 asked "What is the MAIN theme of the poem 'It's Up to People'?" Statewide 31% of children answered correctly: "C. Only human beings can prevent trees from being cut down." In the sample available to us, 38% chose: "D. We will lose animals as well as trees if we do not save trees." While the poem says twice that "it's up to people to save all the trees," the choice of "C" requires an inference. The poem does not mention trees being cut down. The concern in the poem could just as easily have been that it was up to humans to prevent forest fires, or to prevent disease, or to limit pollution. In fact, the long middle stanza ofthe poem presents couplets that pair animals with trees, as if to suggest that they have an interest in the survival of the trees too, but they don't have the power of humans. Still there is no mention of cutting trees down. Therefore, the test taker is left with choosing between two inferences as the best representation of the "MAIN idea." Should we be faulting ten- year-olds for choosing the inference that includes animals which are mentioned in nine of the poem's eighteen lines? Main idea and main theme questions are notoriously subjective. Here is another instance of arbitrary and age inappropriate requirements. No wonder ELA scores are lower than math and science. There, at least, test designers can check their own work.

Secondly, we have raised questions about the setting of performance levels in relation to raw score test performance. In a town such as Arlington, where the average raw score was 50 and the average standard score was 238, 43% of children scored "proficient" or "advanced." In Wellesley, the average raw score was 53, the average standard score was 242, and 57% of children scored "proficient" or "advanced." Note that for Arlington to set the goal of raising raw scores from 50 to 53 would seem to be a realistic goal and would have enormous consequences for "performance levels." For a city such as Boston, how one sets attainable goals is very different. 94% of Boston fourth graders scored "needs improvement" or "failing" on the ELA test. The average standard was 224 and the average raw score was 34. If Arlington improves raw score performance by 6% (50 to 53), its change will be well noted by dramatic changes in performance levels. If Boston improves by 6% (34 to 36), there will be virtually no change in the overall performance levels. To have even one quarter of the students reach the "proficient" or "advanced" levels, Boston would need to experience a 25% increase in average raw scores. This paragraph is dense with statistics, and for that perhaps we should apologize, but we believe it is important to show that this mode of evaluation rewards those who are near the top to try to perform even better, while it discourages those who are near the bottom from ever expecting that their efforts will even be noticed.

Our final point is that these tests may be more accurately assessing the wealth of students' families and communities, than they are accurately assessing individual competencies of students working to become literate citizens prepared to engage in civic responsibilities. Another way to pose this observation is that the test provides advantages to members of certain discourse groups over others. Children in wealthier communities have access to many opportunities that children in low-income communities do not.(8)

While education reform efforts are directed toward specific revisions of curriculum content, professional development of teachers, and minor changes in funding levels (relative to class-based opportunities), there is no evidence that the MCAS and all the other well-funded testing and assessment programs represent a genuine state commitment to equal educational opportunity. In the present political climate, it does seem that certain factions are working harder to show the challenges facing the public schools than they are working to address those challenges. The deck seems stacked. What is the intent in designing a program in which communities that are among the lowest income in the state have no realistic prospect of having their efforts recognized? Will the state remain commited to education reform, or is this all an effort to end public education in the very state that saw it's inauguation over 160 years ago.

End Notes

1. The correlation between average MCAS test scores for these communities in 1998 and 1999 is r = .92. Some towns changed their ranking, but little changed over all.

2. The Iowa Test of Basic Skills is published by Riverside Publishing, a division of Houghton-Mifflin. This test was used statewide in Massachusetts for three years (1996-1998) and has been widely used across the country since its development more than two generations ago. It has been validated and tested for reliability, and it competes in the "marketplace" with tests from other publishers in the effort to meet the requirements of state curriculum standards. Massachusetts employed the Vocabulary, Reading Comprehension and Spelling portions of the test battery and other sections, including writing, were and are available.

3. Half the points earned on the MCAS are based on written answers rather than multiple choice; this is one of its strengths. Statewide, students scored 64% of possible raw score points on the multiple choice items and 54% on the written items -- an appreciable difference, but not enough to account for the difference in results between the MCAS and the Iowa Tests.

4. Reported in The Boston Globe on March 5, 1999.

5. The fourth grade ELA test began with the composition. Children had to complete that assignment in two sittings on the same day. Three weeks later, the reading and language sessions began. Session 1 began with a reading passage followed by eight multiple choice (MC) questions and one open response (OR). In the same session, there followed another three MC, then another reading passage, four more MC and another OR. Session 2 introduced the "King's English" passage. It included eight MC and an OR. Session 3 included two reading selections, 12 MC, and one OR. The language arts portion of the MCAS was administered over 5 days.

6. There is also a significant subtext, in this passage, filled with anxieties about an American language. Today, when so many school children are English language learners and so many speak varieties of English that differ significantly from so-called "standard" English, there is tremendous concern about language. Perhaps, this choice of a passage about Noah Webster's desire for an American English reveals something of the test designers deeper anxieties. Judging by the better performance two years in a row, in terms of standard scores, on the mathematics and science and technology tests, it would seem that the standard has been set especially high for the language test.

7. Because SAT tests are probably more familiar to many readers, the table below may help to make the point that income predicts test scores.

1999 COLLEGE BOUND SENIORS TEST SCORES: SAT
Total Test-Takers: 1,220,130

Source: College Board, College-Board Seniors Nat'l. Report, 1999, as reported by Fair Test, http://www.fairtest.org.

Individual SAT scores certainly reflect myriad individual variables including achievement in school, motivation, study habits, test preparation, and even factors such as reading speed and risk taking. The College Board also reports consistent differences by group characteristics: including gender, race and ethnicity. The single most powerful group variable, however, is wealth. As Table 2 indicates, for each increase in family income, there is also a substantial increase in average SAT scores for that group. Critics have long suggested that we should ask whether tests of this kind actually measure genuine individual differences or group differences. If we conclude they do both, then we should ask which one is measured more accurately. Incidentally, SATs are designed to predict the grades of college freshman; the correlation between SAT scores and freshman GPA is approximately r.= .60.

8. Consider this thought-experiment. Wellesley and West Springfield happened to have exactly the same number of fourth graders take the ELA test. Wellesley as reported above had an average raw score of 53 and average standard score of 242; West Springfield had an average raw score of 42 and average standard score of 230. In the thought-experiment, let's transfer all the teachers and all the administrators and all the curriculum materials between the two communities, but leave all the other features of the communities unchanged. Now, predict the outcome. We suggest that there is far more involved here than personnel and materials.

William T. Stokes, Ed.D. is a Professor at Lesley University, and director of the Hood Children's Literacy Project. For the past twenty-five years he has focused on children's language and literacy development. He can be reached at wstokes@mail.lesley.edu.

Katherine E. Stokes is a junior at Connecticut College majoring in sociology and ethnophotojournalism.

updated 02/17/05 | 03:47 PM
[top]
home  about  academics  admissions  events  news  services  find it

Lesley University, 29 Everett St., Cambridge, MA 02138
©2008, Lesley University. All rights reserved. Disclaimer.
Mail your comments & questions.