In a large scale test of MarkIT, 390 essays hand written by year 10 high school students on the topic of "The School Leaving Age" were transcribed to Microsoft Word document format. These essays were graded on a number of categories by three different human graders. The essays and scores were forwarded to the MarkIT development team at the School of Information Systems at Curtin University of Technology for processing. A model answer was chosen from amongst the essays by selecting the essay with the highest average score given by the three human graders - this essay had a score of 48.5 out of a possible 54, representing an overall score of 90%.
Figure 1 shows the variation amongst the first two graders on 200 essays. The essays scores are arranged in ascending order of one of the human assigned grades. Note the substantial disagreement in the scores for some essays.
Figure 1: Comparison two Human Grader Scores on 200 Essays
The mean score for Human1 for these 200 essays was 27.74, while the mean grade given by Human2 was 30.37, a difference of 2.63. The correlation between the two humans was 0.80. The mean absolute difference between the two was 5.22, representing an average error rate of 9.67% when scored out of 54 (the maximum possible human score). After a scoring algorithm was built using the 200 essays above as training data, the remaining 190 essays were scored by MarkIT. Figure 2 shows the results, arranged in ascending order of the computer assigned score.
Figure 2: Results of Computer Scoring of 190 Essays vs Average Human
The mean score for the human average grade for these 190 essays was 31.41, while the mean grade given by the computer was 29.62, a difference of 1.79. The correlation between the human and computer grades was 0.75. The mean absolute difference between the two was 4.39, representing an average error rate of 8.14% when scored out of 54 (the maximum possible human score).
The computer assigned scores were close to the agreement between the humans amongst themselves, and the error rates similar. We can conclude that in this particular test, MarkIT performed as well as human graders.
A key strength of MarkIT over its counterparts is the emphasis on providing feedback other than just a grade or number. Numerous aspects of assessed assignments which are useful to students from an improvement point of view include: spelling, grammar, reading ease, and grade level statistics. Such data is derived from existing technology and is incorporated by MarkIT into a comprehensive Report on each assignment. Assignment content relative to model answer content is presented as a graph of concepts juxtaposing student answer concept content with model answer concept content. This graph is interactive in that one can drill down to the thesaurus level and also to the assignment level in order to discover where, and to some degree how, errors and omissions can be rectified.
The following figures show MarkIT's visual feedback components and are available to the teacher and student on completion of the grading. Figure 3 shows the essay selection screen, with essay identifiers appearing in the left window.
Figure 3: Main Control Panel for Visual Feedback
When an essay identifier is selected, the screen shown in figure 4 results. The upper window can be toggled via the tabs to display the Student essay or the Model essay. The lower window can be toggled via the tabs to show further features:
- the grading Report (Figure 4),
- the Graph of the concept counts for Student and Model essays (Figure 5), and
- a Document Tree (Figure 7) representing the grammatical structure of the essay.
Figure 4: Selected Essay Grading report
Figure 5 presents a graph of the 'concepts' associated with both the model answer and the student answer. Naturally, the better the correspondence between the 'concepts' in both, the better the score. If we focus on the rightmost bar, labelled "Busyness", we see that the student answer contains a frequency of 2 (vertical axis) where the model answer called for no discussion on this topic or concept. We may say the student has introduced irrelevancies into the answer; or perhaps the student has waffled and provided 'filler'.
The concept labelled "being" is a case where the model answer concept is not matched by an equal student contribution - this would correspond to a deficit in knowledge on the part of the student. The concept labelled "addition" is not matched by any student contribution - thus we may say the student is ignorant, or unaware of this concept or content. Such visual feedback is rather informative to student and teacher alike. The teacher is able to interactively explain to the student the strengths and weaknesses of the student's answer. If the teacher double clicks on a bar in the graph, the thesaurus text for the category represented by the bar is displayed (Figure 6). The student can then see the type of discussion that should have been devoted to the topic, and also get a good feel, from the many words in that category, how to express that content. The instructor can also switch to the model answer at any time to demonstrate the type of response that was expected.
In the upper window of Figures 5 and 7 one can see some highlighted or underlined words in the selected assignment or model answer. This feature visually marks the words in the essay which are associated with the 'concept' as selected by a user from the graph in the lower window. This matching of 'concept' with the words from the assignment which belong to it, is an excellent learning aid.
Figure 5: Concept Frequencies
Figure 6: Thesaurus Entry for a Chosen Concept (being)
Figure 7: Document Tree (Semantic Content Structure)