Among the growing number of systems the following four are of interest: Project Essay Grade (PEG);
E-rater; Text Categorisation Technique (TCT); and Intelligent Essay Assessor (IEA). We describe all
systems and then report on a trial of the IEA conducted with our own students.
One of the earliest mentions of computer grading of essays in the literature was in an article by
Page in which he described Project Essay Grade (PEG) (Page, 1966). The idea behind PEG is to help
reduce the enormous essay grading load in large educational testing programs, such as the SAT. When
multiple graders are used, problems arise with consistency of grading. A sample of the essays to be
graded is selected and graded by a number of human judges. Various linguistic features of these essays
are then measured. A multiple regression equation is then developed from these measures. This equation
is then used, along with the appropriate measures from each student essay to be graded, to predict the
average score that a human judge would assign.
The multiple regression techniques are then used to compute, from the approximations of fluency,
diction, grammar, punctuation, among other variables, an equation to predict a score for each student
essay. In the research reported in Page (1994), the goal was to identify those variables that would prove
effective in predicting human raters' scores. Various software products, including a grammar checker, a
program to identify words and sentences, software dictionary, a part-of-speech tagger, and a parser were
used to gather data about many variables and their approximations or correlates.
The second system of interest is E-rater. It uses a combination of statistical and Natural Language
Processing (NLP) techniques to extract linguistic features of the essays to be graded. As in all the
conceptual models discussed in this paper, e-rater student essays are evaluated against a benchmark set
of human graded essays. E-rater has modules that extract essay vocabulary content, discourse structure
information and syntactic information. Multiple linear regression techniques are then used to predict a
score for the essay, based upon the features extracted. For each new essay question, the system is run
to extract characteristic features from human scored essay responses. Fifty seven features of the benchmark
essays, based upon six score points in an Educational Testing Services (ETS) scoring guide for manual
grading, are initially used to build the regression model. Using stepwise regression techniques, the
significant predictor variables are determined. The values derived for these variables from the student
essays are then substituted into the particular regression equation to obtain the predicted score.
One of the scoring guide criteria is essay syntactic variety. After parsing the essay with an NLP
tool, the parse trees are analysed to determine clause or verb types that the essay writer used. Ratios
are then calculated for each syntactic type on a per essay and per sentence basis. Another scoring guide
criteria relates to having well-developed arguments in the essay. Discourse analysis techniques are used
to examine the essay for discourse units by looking for surface cue words and non-lexical cues. These cues
are then used to break the essay up into partitions based upon individual content arguments. The system
also compares the topical content of an essay with those of the reference texts by looking at word usage.
E-rater has been evaluated by Burstein, Kukich, Wolff, Lu & Chodorow (1998) and has been found
that to achieve a level of agreement with human raters of between 87% and 94%, which is claimed to be
comparable with that found amongst human raters. For one test essay question the following predictive
feature variables were found to be significant: 1) Argument content score; 2) Essay word frequency content
score; 3) Total argument development words/phrases; 4) Total pronouns beginning arguments; 5) Total
complement clauses beginning arguments; 6) Total summary words beginning arguments; 7) Total detail words
beginning arguments; 8) Total rhetorical words developing arguments; 9) Subjunctive modal verbs.
TCT is a system implemented by Larkey (1998) and employs text categorisation techniques (TCT), text
complexity features, and linear regression methods. The Information Retrieval literature discusses techniques
for classifying documents as to their appropriateness of content for given document retrieval queries (van
Rijsbergen, 1979). The technique firstly makes use of Bayesian independent classifiers (Maron, 1961) to
assign probabilities to documents estimating the likelihood that they belong to a specified category of
documents and relies on an analysis of the occurrence of certain words in the documents. Secondly, a
k-nearest neighbour technique is used to find the k essays closest to the student essay, where k is
determined through training the system on a sample of human graded essays. The Inquery retrieval system
(Callan, Croft & Broglio, 1995) was used for this. Finally, eleven text complexity features are used,
such as the number of characters in the document, the number of different words in the document, the fourth
root of the number of words in the document (see also the discussion on PEG above), and the average sentence
length.
Larkey conducted a number of regression trials, using different combinations of components. He also
used a number of essay sets, including essays on social studies (soc), where content was the primary interest,
and essays on general opinion (G1), where style was the main criteria for assessment. The results presented here are
for these two essay sets only. When all the criteria for assessment were used the proportion of essays graded exactly
the same as human graders was 0.60 and scores adjacent (a score one grade on either side) was 1.00. For the general
opinion essays the corresponding figures were 0.55 and 0.97. The system performed remarkably well.
A fourth system of interest, and to us of primary focus was IEA by Landauer et al. (1998). It is this system we
trialled with our own students.