We now give an overview of the scoring algorithm. A model answer is prepared by the instructor that contains the core knowledge required to achieve a 100% score. The system may eventually be able to score a student essay against a number of model answers, in case the instructor wishes to use numerous content models. The instructor can also provide about 100 - 200 human graded essays (ideally each graded by three humans) and their scores, for training purposes. These model and training answers are processed as described above. The system then performs a content matching task in which the model answer content summary is compared against each of the training essay content summaries. Many aspects of the relationships between the model and training essays are then computed, and a linear regression model computed to derive a scoring equation. Unmarked student essays are then processed to build the content summaries that are to populate the NP and VP content structures. Finally, the scoring equation is used to produce a score for each essay.
In a large scale test of MarkIT, 390 essays hand written by year 10 high school students on the topic of "The School Leaving Age" were transcribed to Microsoft Word document format. These essays were graded on a number of categories by three different human graders. The essays and scores were forwarded to the MarkIT development team at the School of Information Systems at Curtin University of Technology for processing. A model answer was chosen from amongst the essays by selecting the essay with the highest average score given by the three human graders - this essay had a score of 48.5 out of a possible 54, representing an overall score of 90%.
Figure 1 shows the variation amongst the first two graders on 200 essays. The essays scores are arranged in ascending order of one of the human assigned grades. Note the substantial disagreement in the scores for some essays.
Figure 1: Comparison two Human Grader Scores on 200 Essays
The mean score for Human1 for these 200 essays was 27.74, while the mean grade given by Human2 was 30.37, a difference of 2.63. The correlation between the two humans was 0.80. The mean absolute difference between the two was 5.22, representing an average error rate of 9.67% when scored out of 54 (the maximum possible human score). After a scoring algorithm was built using the 200 essays above as training data, the remaining 190 essays were scored by MarkIT. Figure 2 shows the results, arranged in ascending order of the computer assigned score.
Figure 2: Results of Computer Scoring of 190 Essays vs Average Human
The mean score for the human average grade for these 190 essays was 31.41, while the mean grade given by the computer was 29.62, a difference of 1.79. The correlation between the human and computer grades was 0.75. The mean absolute difference between the two was 4.39, representing an average error rate of 8.14% when scored out of 54 (the maximum possible human score).
The computer assigned scores were close to the agreement between the humans amongst themselves, and the error rates similar. We can conclude that in this particular test, MarkIT performed as well as human graders.