logo
Automated Essay Grading Systems
 
   HOME    div    PROJECT    div    ONLINE DEMO    div    CONTACT THE TEAM   
    MarkIT   |   Grading Systems |   IEA trial  
 

Production Automated Essay Grading Systems

Among the growing number of systems the following four are of interest: Project Essay Grade (PEG); E-rater; Text Categorisation Technique (TCT); and Intelligent Essay Assessor (IEA). We describe all systems and then report on a trial of the IEA conducted with our own students.

One of the earliest mentions of computer grading of essays in the literature was in an article by Page in which he described Project Essay Grade (PEG) (Page, 1966). The idea behind PEG is to help reduce the enormous essay grading load in large educational testing programs, such as the SAT. When multiple graders are used, problems arise with consistency of grading. A sample of the essays to be graded is selected and graded by a number of human judges. Various linguistic features of these essays are then measured. A multiple regression equation is then developed from these measures. This equation is then used, along with the appropriate measures from each student essay to be graded, to predict the average score that a human judge would assign.

The multiple regression techniques are then used to compute, from the approximations of fluency, diction, grammar, punctuation, among other variables, an equation to predict a score for each student essay. In the research reported in Page (1994), the goal was to identify those variables that would prove effective in predicting human raters' scores. Various software products, including a grammar checker, a program to identify words and sentences, software dictionary, a part-of-speech tagger, and a parser were used to gather data about many variables and their approximations or correlates.

The second system of interest is E-rater. It uses a combination of statistical and Natural Language Processing (NLP) techniques to extract linguistic features of the essays to be graded. As in all the conceptual models discussed in this paper, e-rater student essays are evaluated against a benchmark set of human graded essays. E-rater has modules that extract essay vocabulary content, discourse structure information and syntactic information. Multiple linear regression techniques are then used to predict a score for the essay, based upon the features extracted. For each new essay question, the system is run to extract characteristic features from human scored essay responses. Fifty seven features of the benchmark essays, based upon six score points in an Educational Testing Services (ETS) scoring guide for manual grading, are initially used to build the regression model. Using stepwise regression techniques, the significant predictor variables are determined. The values derived for these variables from the student essays are then substituted into the particular regression equation to obtain the predicted score.

One of the scoring guide criteria is essay syntactic variety. After parsing the essay with an NLP tool, the parse trees are analysed to determine clause or verb types that the essay writer used. Ratios are then calculated for each syntactic type on a per essay and per sentence basis. Another scoring guide criteria relates to having well-developed arguments in the essay. Discourse analysis techniques are used to examine the essay for discourse units by looking for surface cue words and non-lexical cues. These cues are then used to break the essay up into partitions based upon individual content arguments. The system also compares the topical content of an essay with those of the reference texts by looking at word usage.

E-rater has been evaluated by Burstein, Kukich, Wolff, Lu & Chodorow (1998) and has been found that to achieve a level of agreement with human raters of between 87% and 94%, which is claimed to be comparable with that found amongst human raters. For one test essay question the following predictive feature variables were found to be significant: 1) Argument content score; 2) Essay word frequency content score; 3) Total argument development words/phrases; 4) Total pronouns beginning arguments; 5) Total complement clauses beginning arguments; 6) Total summary words beginning arguments; 7) Total detail words beginning arguments; 8) Total rhetorical words developing arguments; 9) Subjunctive modal verbs.

TCT is a system implemented by Larkey (1998) and employs text categorisation techniques (TCT), text complexity features, and linear regression methods. The Information Retrieval literature discusses techniques for classifying documents as to their appropriateness of content for given document retrieval queries (van Rijsbergen, 1979). The technique firstly makes use of Bayesian independent classifiers (Maron, 1961) to assign probabilities to documents estimating the likelihood that they belong to a specified category of documents and relies on an analysis of the occurrence of certain words in the documents. Secondly, a k-nearest neighbour technique is used to find the k essays closest to the student essay, where k is determined through training the system on a sample of human graded essays. The Inquery retrieval system (Callan, Croft & Broglio, 1995) was used for this. Finally, eleven text complexity features are used, such as the number of characters in the document, the number of different words in the document, the fourth root of the number of words in the document (see also the discussion on PEG above), and the average sentence length.

Larkey conducted a number of regression trials, using different combinations of components. He also used a number of essay sets, including essays on social studies (soc), where content was the primary interest, and essays on general opinion (G1), where style was the main criteria for assessment. The results presented here are for these two essay sets only. When all the criteria for assessment were used the proportion of essays graded exactly the same as human graders was 0.60 and scores adjacent (a score one grade on either side) was 1.00. For the general opinion essays the corresponding figures were 0.55 and 0.97. The system performed remarkably well.

A fourth system of interest, and to us of primary focus was IEA by Landauer et al. (1998). It is this system we trialled with our own students.