

|  | Op Ed article assessing writing assessment software | 
| Essay assessing software fair to students? By Lori Hubbard As a first-year composition instructor at Eastern Michigan University, the first time I heard about essay assessing software, I was horrified. Horrified because I didn’t understand under what parameters this software was programmed to assess writing. How, for example, could such software possibly analyze, interpret, and assess meaning and content in student writing? I was horrified because it seemed to me to be a method of assessing writing based on conventions, on rules of grammar and sentence structure. I could not fathom how essay assessing software could fairly and accurately assess content and meaning in essays. How can software be programmed to examine and assess such aspects of student writing? The fact is, it really cannot. In the research I’ve done on this subject, I have come across a number of methods assessors use to aid the software in making accurate assessments of content and meaning. While the software is in fact much more capable of fair and accurate assessment than I first realized, however, it still leaves a great deal to be desired. The type of assessing this type of software does is based on a number of criteria for “good writing.” According to the Shermis et al article “Trait Ratings for Automated Essay Grading,” they can be programmed to look for key words, such as “because,” which indicate complex sentences, as well as being programmed with a number of variables that may examine sentence length, repeat sentences, presence of end punctuation, misspellings, vulgar words and “me” words. These authors also acknowledge that the software does not pretend to understand the content of the essay, but instead can be trained to “read” for assessment in the same way as human raters. There are also a number of methods used to train the software to emulate human raters. First, essays rated and scored by people, assessed based on both the parts (conventions, content, etc.) and the sum of those parts (holistic assessment), are fed into the system, which then act as a basis of comparison for the software. Using the sample essays, the software can then more easily assess new essays by comparing the parts and the sum of the parts of the new with those assessed and scored by human raters. This system would be used for the most part for placement tests, such as ACTs and SATs, or other types of entrance exams. Another method of training the software offered by Foltz et al in their article “The Intelligent Essay Assessor: Applications to Educational Technology” and similar to the method above, would be for the instructor of a class to write what s/he would consider an “ideal” essay, and student essays would be assessed based on how closely it connects to this “ideal.” This method could very well be an improved way for instructors of large lecture courses to more fairly assess student writing, given that computers are not subject to biases or exhaustion. The software can also assess around five or six documents a second, which is obviously well beyond human capabilities. A further benefit of the software, according to Foltz et al, is its ability to recognize when a piece of writing is syntactically, structurally, or creatively unusual. In such cases, the software may flag the essay, indicating it needs human assessment. The software recognizes its own weaknesses and limitations and calls attention to essays it cannot assess as accurately and fairly as others. I concede that under the above circumstances, this software is extremely useful. Both articles referenced above cite studies denoting the accuracy of essay assessing software as being as good as, if not better than, human raters. Why, then, is such software unfair to students? Human raters are far more capable of interpreting meaning and content, and Shermis et al give a fine example: “Queen America sailed into Santa Maria with 1492 ships. Her husband, King Columbus, looked to the Indian explorer, Nina Pinta, to find a vast wealth on the beaches of Isabella, but would settle for spices from the continent of Ferdinand.” This excerpt, a fictional piece of writing on Columbus’ discovery of North America, would likely receive a high score for content from essay assessing software, because the keywords in this excerpt match those the software knows are relevant to this topic. If such software is trained to recognize keywords and interpret content accordingly, I can only imagine the massive amounts of misinterpretation that must follow. These misinterpretations are clearly a cause for concern – when assessing student writing, accurate interpretation of content and meaning should be a top priority. This is simply not always possible when using essay assessing software. Human raters have the capacity to interpret meaning and content that essay assessing software does not. If such software absolutely must be used when assessing student writing – and the amount of time and money saved by using this software, especially for placement test essays, makes its own case for the use of the software – then it absolutely must be used in conjunction with human raters. When the future of the student is at stake, it’s the least we can do. |