Intelligent Essay Assessor - an LSA based system
Latent Semantic Analysis (LSA) represents documents and their word contents in
a large two dimensional matrix semantic space. A matrix represents the
words and their contexts. Each word being considered for the analysis
is represented as a row of a matrix, and the columns of the matrix
represent the sentences, paragraphs, or other subdivisions of the
contexts in which the words occur. The cells contain the frequencies of
the words in each context. Using a matrix algebra technique known as
Singular Value Decomposition (SVD), new relationships between words and
documents are uncovered, and existing relationships are modified to
more accurately represent their true significance. SVD breaks the
original matrix into three component matrices, that, when matrix
multiplied, reproduce the original matrix. Using a reduced dimension of
these three matrices in which the word-context associations can be
represented, new relationships between words and contexts are induced
when reconstructing a close approximation to the original matrix from
the reduced dimension component SVD matrices. These new relationships
are made manifest, whereas prior to the SVD, they were hidden or latent.
Landauer et al. (1998) have developed the Intelligent Essay Assessor, using the
LSA model. To grade an essay, a matrix for the essay document is built,
and then transformed by the SVD technique to approximately reproduce
the matrix using the reduced dimensional matrices built for the essay
topic domain semantic space. The semantic space typically consists of
human graded essays. Vectors are then computed from a student's essay
data. The vectors for the essay document, and all the documents in the
semantic space are compared, and the mark for the graded essay with the
lowest cosine value in relation to the essay to be graded is assigned.
Foltz (1996) reports that LSA grading performance is about as reliable as
human graders. Landauer (1999) reports a test on GMAT essays where the
percentages for adjacent agreement with human graders were between
85%-91%.
IEA on trial at Curtin University of Technology
During the first semester of 2001 a trial of an automated essay grading system
was conducted. We chose a first year unit on introduction to
Information Systems with an enrolment of some 1,000 students.
IEA requires two hundred manually graded essays as input to establish the
benchmark performance matrix. Three expert human markers graded about
70 papers each and sent the electronic copies along with the marks to
the IEA site in USA . We had another 330 ungraded essays which were
submitted for grading.
Some interesting outcomes were evident. Firstly, the grades from the three
expert human markers as indicated in figures Table 1, had no
significant difference in either absolute marks awarded or the standard
deviation of marks. Grader "A" had always considered himself a "hard"
grader and considered grader "B" rather soft, but the data reveals a
slightly different situation.
| All figures are % |
Grader A |
Grader B |
Grader C |
| mean |
64 |
63 |
67 |
| SD |
17 |
19 |
14 |
| Min |
20 |
0 |
25 |
| Max |
100 |
95 |
90 |
| Mode |
65 |
60 |
80 |
| Median |
65 |
65 |
70 |
Table 1 : Human Marker Comparison
To facilitate comparison with the results of IEA grading we depict the
combined expert human marker grade distribution in Figure 1 which may
be compared with Figure 2
Figure 1 : combined expert human marker results
Our purpose was however not to check our own grading but to see how
consistent IEA would be with our own performance. We discovered that
IEA produced the same mean and standard deviation of marks as the three
expert human markers (see Fig. 2). We were satisfied with its
performance on that account.
Figure 2 : IEA result
Table 2 contains the Human Grader scores and IEA scores by Grade range.
Inspection reveals a close correspondence of the two data sets giving
us confidence in the validity of IEA for our purposes.
| |
Human Grader Scores |
IEA Scores |
| Grade range |
Frequency |
% |
Frequency |
% |
| 0-9 |
1 |
0.5 |
3 |
0.95 |
| 10-19 |
0 |
0 |
1 |
0.32 |
| 20-29 |
4 |
2 |
2 |
0.63 |
| 30-39 |
10 |
5 |
8 |
2.52 |
| 40-49 |
13 |
6.5 |
30 |
9.46 |
| 50-59 |
33 |
16.5 |
56 |
17.67 |
| 60-69 |
50 |
25 |
96 |
30.28 |
| 70-79 |
45 |
22.5 |
62 |
19.56 |
| 80-89 |
29 |
14.5 |
45 |
14.19 |
| 90-100 |
15 |
7.5 |
14 |
4.42 |
| Total |
200 |
100% |
317 |
100% |
Table 2 : Human Grader versus IEA trial data
There was an additional and quite unexpected result from the test. The system
picked up several cases of plagiarism that we had failed to notice. In
this case the plagiarism was really that of one student copying the
work of another student rather than from extracting text from another
source.
Some disadvantages of the System
There are two important weaknesses and one minor weakness for our purposes in
the system that we trialled. The first weakness is that for a
successful implementation, one needs to manually grade 200 essays and
feed them into the system. The computer will then accurately and
dependably grade as many more essays on that topic as is required. In
small classes of less than a few hundred students it becomes
impractical.
The second weakness relates to the cost of using the system. IEA is an American system
meaning we needed to pay in $US, and with the present exchange rate it
cost about A$11,400 to grade a few hundred essays. This is simply not
cost effective.
There is a third factor. The system is run at a site in the USA rather than on our own
computer network at Curtin University . There is some lack of control
and potential security risk in having the process run remotely.
Limitations to any Automated Grading System
To utilize any Automated Grading System the raw data, essays or
examination answers, would need to be in a form that was computer
readable. The most obvious form of this would be electronic documents
in Word format. This is easily enough achieved where the student could
write the essay on a computer. However, when students sit for
examinations this is normally done at desks with paper and pen. The
resulting examination script is not easily transferred to a computer
readable medium. On the other hand we see that it is possible to have
students sit an exam in a computer laboratory and submit their
examination papers electronically. It may be problematic to have large
numbers sit the exam simultaneously but there may be workarounds, such
a take-home examination due within 24 hours. Any number of students
would then be able to sit the exam at the same time and submit the exam
papers electronically.
Another serious limitation to an essay
grading system is that it grades a students' knowledge of a given set
of material. The model answer would contain only a set body of
knowledge and would grade the student on the part of that knowledge the
student was able to demonstrate. This may be acceptable in the early
years of a course but probably not in more advanced studies.
|