Automatic Essay Grading: An Exciting - But Not Yet Perfect - Tool

Mar 01, 2013

In April 2012, the Hewlett Foundation hosted a competition on Kaggle, the predictive modeling competition site, to find the most effective system for automated grading of essays. These systems are based on principles of artificial intelligence, and those of us in the artificial intelligence community are always excited to see useful applications of our work. Our excitement, however, should be tempered by the recognition of some unintended consequences of automated essay grading.

In 2001, when I took my first GRE, the exam consisted of three modules: quantitative, verbal, and analytical reasoning. As a computer scientist I excelled in and enjoyed the analytical reasoning – solving all those puzzles, wondering about the sequence of ships docking at harbors. So when I took the GRE again in 2009, I was shocked to discover that my favorite section had been changed. The Education Testing Service (ETS), which administers the exam, had decided that logical reasoning can be better demonstrated through essay writing. Like the ETS, I believe that the analysis of a person’s critical writing can be good test of reasoning skills. The only downside is that this does not lend itself to automated grading as well as the previous format.

Researchers have been working on automatic essay grading techniques since the early 1990s, but it was only recently that they have been in use. These systems do indeed perform as well as a human grader in the sense that they assign the same or similar grades as humans. However, the problem lies in how they “learn” what is an A-worthy effort. Automated essay grading can be reduced to statistical pattern matching. By analyzing many essays that human graders have deemed worthy of an A, B, and so on, these systems learn the features that discriminate the space. These features include average words per sentence, capitalization, average syllables per words, and the words’ context. Automatic essay grading systems are distinguished by whether they focus on the content of the essays or the style – syntax, mechanics, and word choice, among other factors.

Herein lies the problem with the system. Such systems often can’t even check facts, so it is difficult to find an automatic way to get to the heart of the essay  - the content and the reasoning strategies that the writer demonstrates. Of course we would like to capture those things in an algorithm, but the science is not quite there yet. So what are the implications of using such systems and why do we really care if a computer grades an essay versus a human? These systems are being designed in the United States, and ostensibly we are training our algorithms to learn what we American graders have deemed is the standard. So these systems are codifying a systemic bias toward mainstream American writing and, more importantly, American reasoning. This is an example of adapting the flaws of our analog systems and transferring them to new digital technologies. The promise of artificial intelligence is to go beyond what we have in the analog and design a future that ought to be, not that is. With automated essay grading, we have a unique opportunity to move toward diversity. We the designers should be building systems capable of recognizing multiplicity of genius.

Research has shown time and again that diverse teams outperform homogeneous teams. The same may hold for reasoning. The problems that we face in the 21st century are of such staggering complexity that we need diversity of perspectives and reasoning in order to solve them. If we design systems that cannot recognize diversity in reasoning because it reduces intelligence to word length and the average syllables per word in an essay, what we are doing is filtering out those students who have a different way of writing, which might be indicative of a different way of thinking.

As ETS noted in a 2010 paper, “assessment of creativity, poetry, irony, or other more artistic uses of writing is beyond” most automatic grading systems, even if they’re able to understand the semantics and grammar of a paper well. ETS also noted that different demographics in the author population affect how accurately systems reflect human scores. “Of course, automatic systems are most commonly used in conjunction with human graders, as they should be. And if there is a discrepancy between grades assigned by humans and by automatic systems, another human grader is often called in. Automatic systems provide a quick, cheap way to provide consistency in an area that often does not have enough time or resources.”

I believe the way the Hewlett Foundation is going about solving this problem is helpful. They have decided to crowdsource the solution on Kaggle. What is surprising  is that the first, second, and third place winners of the competition had participated in Andrew Ng’s MOOC Machine Learning class at Stanford. Another example of artificial intelligence being used to create the world that ought to be versus the world that is.

This blog is cross posted on the ICSI blog. It can be found here