logo
Automated Essay Grading Systems
 
   HOME    div    PROJECT    div    ONLINE DEMO    div    CONTACT THE TEAM   
    MarkIT   |   Grading Systems |   IEA trial  
 

Intelligent Essay Assessor - an LSA based system

Latent Semantic Analysis (LSA) represents documents and their word contents in a large two dimensional matrix semantic space. A matrix represents the words and their contexts. Each word being considered for the analysis is represented as a row of a matrix, and the columns of the matrix represent the sentences, paragraphs, or other subdivisions of the contexts in which the words occur. The cells contain the frequencies of the words in each context. Using a matrix algebra technique known as Singular Value Decomposition (SVD), new relationships between words and documents are uncovered, and existing relationships are modified to more accurately represent their true significance. SVD breaks the original matrix into three component matrices, that, when matrix multiplied, reproduce the original matrix. Using a reduced dimension of these three matrices in which the word-context associations can be represented, new relationships between words and contexts are induced when reconstructing a close approximation to the original matrix from the reduced dimension component SVD matrices. These new relationships are made manifest, whereas prior to the SVD, they were hidden or latent.

Landauer et al. (1998) have developed the Intelligent Essay Assessor, using the LSA model. To grade an essay, a matrix for the essay document is built, and then transformed by the SVD technique to approximately reproduce the matrix using the reduced dimensional matrices built for the essay topic domain semantic space. The semantic space typically consists of human graded essays. Vectors are then computed from a student's essay data. The vectors for the essay document, and all the documents in the semantic space are compared, and the mark for the graded essay with the lowest cosine value in relation to the essay to be graded is assigned.

Foltz (1996) reports that LSA grading performance is about as reliable as human graders. Landauer (1999) reports a test on GMAT essays where the percentages for adjacent agreement with human graders were between 85%-91%.

IEA on trial at Curtin University of Technology

During the first semester of 2001 a trial of an automated essay grading system was conducted. We chose a first year unit on introduction to Information Systems with an enrolment of some 1,000 students.

IEA requires two hundred manually graded essays as input to establish the benchmark performance matrix. Three expert human markers graded about 70 papers each and sent the electronic copies along with the marks to the IEA site in USA . We had another 330 ungraded essays which were submitted for grading.

Some interesting outcomes were evident. Firstly, the grades from the three expert human markers as indicated in figures Table 1, had no significant difference in either absolute marks awarded or the standard deviation of marks. Grader "A" had always considered himself a "hard" grader and considered grader "B" rather soft, but the data reveals a slightly different situation.

All figures are % Grader A Grader B Grader C
mean 64 63 67
SD 17 19 14
Min 20 0 25
Max 100 95 90
Mode 65 60 80
Median 65 65 70

Table 1 : Human Marker Comparison

To facilitate comparison with the results of IEA grading we depict the combined expert human marker grade distribution in Figure 1 which may be compared with Figure 2

Figure 1 : combined expert human marker results

Our purpose was however not to check our own grading but to see how consistent IEA would be with our own performance. We discovered that IEA produced the same mean and standard deviation of marks as the three expert human markers (see Fig. 2). We were satisfied with its performance on that account.

Figure 2 : IEA result

Table 2 contains the Human Grader scores and IEA scores by Grade range. Inspection reveals a close correspondence of the two data sets giving us confidence in the validity of IEA for our purposes.

  Human Grader Scores IEA Scores
Grade range Frequency % Frequency %
0-9 1 0.5 3 0.95
10-19 0 0 1 0.32
20-29 4 2 2 0.63
30-39 10 5 8 2.52
40-49 13 6.5 30 9.46
50-59 33 16.5 56 17.67
60-69 50 25 96 30.28
70-79 45 22.5 62 19.56
80-89 29 14.5 45 14.19
90-100 15 7.5 14 4.42
Total 200 100% 317 100%

Table 2 : Human Grader versus IEA trial data

There was an additional and quite unexpected result from the test. The system picked up several cases of plagiarism that we had failed to notice. In this case the plagiarism was really that of one student copying the work of another student rather than from extracting text from another source.

Some disadvantages of the System

There are two important weaknesses and one minor weakness for our purposes in the system that we trialled. The first weakness is that for a successful implementation, one needs to manually grade 200 essays and feed them into the system. The computer will then accurately and dependably grade as many more essays on that topic as is required. In small classes of less than a few hundred students it becomes impractical.

The second weakness relates to the cost of using the system. IEA is an American system meaning we needed to pay in $US, and with the present exchange rate it cost about A$11,400 to grade a few hundred essays. This is simply not cost effective.

There is a third factor. The system is run at a site in the USA rather than on our own computer network at Curtin University . There is some lack of control and potential security risk in having the process run remotely.

Limitations to any Automated Grading System

To utilize any Automated Grading System the raw data, essays or examination answers, would need to be in a form that was computer readable. The most obvious form of this would be electronic documents in Word format. This is easily enough achieved where the student could write the essay on a computer. However, when students sit for examinations this is normally done at desks with paper and pen. The resulting examination script is not easily transferred to a computer readable medium. On the other hand we see that it is possible to have students sit an exam in a computer laboratory and submit their examination papers electronically. It may be problematic to have large numbers sit the exam simultaneously but there may be workarounds, such a take-home examination due within 24 hours. Any number of students would then be able to sit the exam at the same time and submit the exam papers electronically.

Another serious limitation to an essay grading system is that it grades a students' knowledge of a given set of material. The model answer would contain only a set body of knowledge and would grade the student on the part of that knowledge the student was able to demonstrate. This may be acceptable in the early years of a course but probably not in more advanced studies.