This article will be permanently flagged as inappropriate and made unaccessible to everyone.
Are you certain this article is inappropriate?
Political / Social
Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting. It is a method of educational assessment and an application of natural language processing. Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding to the possible grades—for example, the numbers 1 to 6. Therefore, it can be considered a problem of statistical classification.
Several factors have contributed to a growing interest in AES. Among them are cost, accountability, standards, and technology. Rising education costs have led to pressure to hold the educational system accountable for results by imposing standards. The advance of information technology promises to measure educational achievement at reduced cost.
The use of AES for high-stakes testing in education has generated significant backlash, with opponents pointing to research that computers cannot yet grade writing accurately and arguing that their use for such purposes promotes teaching writing in reductive ways (i.e. teaching to the test).
Most historical summaries of AES trace the origins of the field to the work of Ellis Batten Page. In 1966, he argued  for the possibility of scoring essays by computer, and in 1968 he published his successful work with a program called Project Essay Grade™ (PEG™). Using the technology of that time, computerized essay scoring would not have been cost-effective, so Page abated his efforts for about two decades.
By 1990, desktop computers had become so powerful and so widespread that AES was a practical possibility. As early as 1982, a UNIX program called Writer's Workbench was able to offer punctuation, spelling, and grammar advice. In collaboration with several companies (notably Educational Testing Service), Page updated PEG and ran some successful trials in the early 1990s.
Peter Foltz and Thomas Landauer developed a system using a scoring engine called the Intelligent Essay Assessor™ (IEA). IEA was first used to score essays in 1997 for their undergraduate courses. It is now a product from Pearson Educational Technologies and used for scoring within a number of commercial products and state and national exams.
IntelliMetric® is Vantage Learning's AES engine. Its development began in 1996. It was first used commercially to score essays in 1998.
Educational Testing Service offers e-rater®, an automated essay scoring program. It was first used commercially in February 1999. Jill Burstein was the team leader in its development. ETS's CriterionSM Online Writing Evaluation Service uses the e-rater engine to provide both scores and targeted feedback.
Lawrence Rudner has done some work with Bayesian scoring, and developed a system called BETSY (Bayesian Essay Test Scoring sYstem). Some of his results have been published in print or online, but no commercial system incorporates BETSY as yet.
Under the leadership of Howard Mitzel and Sue Lottridge, Pacific Metrics developed a constructed response automated scoring engine, CRASE®. Currently utilized by several state departments of education and in a U.S. Department of Education-funded Enhanced Assessment Grant, Pacific Metrics’ technology has been used in large-scale formative and summative assessment environments since 2007.
Measurement Inc. acquired the rights to PEG in 2002 and has continued to develop it.
In 2012, the
Most resources for automated essay scoring are proprietary. However, with the increased activity in current research as a result of the ASAP competition, there has been an increase in open source activity.
The petition specifically addresses the use of AES for high-stakes testing and says nothing about other possible uses.
In a detailed summary of research on AES, the petition site notes, "RESEARCH FINDINGS SHOW THAT no one—students, parents, teachers, employers, administrators, legislators—can rely on machine scoring of essays . . . AND THAT machine scoring does not measure, and therefore does not promote, authentic acts of writing."
The petition describes the use AES for high-stakes testing as "trivial," "reductive," "inaccurate," "undiagnostic," "unfair," and "secretive."
On March 12, 2013, HumanReaders.Org launched an online petition, "Professionals Against Machine Scoring of Student Essays in High-Stakes Assessment." Within weeks, the petition gained thousands of signatures, including Noam Chomsky, and was cited in a number of newspapers, including The New York Times, and on a number of education and technology blogs.
Proponents of AES point out that computer scoring is more consistent than fallible human raters and can provide students with instant feedback for formative assessment.
AES has been criticized on various grounds. Yang et al. mention "the overreliance on surface features of responses, the insensitivity to the content of responses and to creativity, and the vulnerability to new types of cheating and test-taking strategies." Several critics are concerned that students' motivation will be diminished if they know that no human will read their writing. Among the most telling critiques are reports of intentionally gibberish essays being given high scores.
In current practice, high-stakes assessments such as the GMAT are always scored by at least one human. AES is used in place of a second rater. A human rater resolves any disagreements of more than one point.
Numerous researchers have reported that their AES systems can, in fact, do better than a human. Page made this claim for PEG in 1994. Scott Elliot said in 2003 that IntelliMetric typically outperformed human scorers.
Inter-rater agreement can now be applied to measuring the computer's performance. A set of essays is given to two human raters and an AES program. If the computer-assigned scores agree with one of the human raters as well as the raters agree with each other, the AES program is considered reliable. Alternatively, each essay is given a "true score" by taking the average of the two human raters' scores, and the two humans and the computer are compared on the basis of their agreement with the true score. This is basically a form of Turing test: by their scoring behavior, can a computer and a human be told apart?
Percent agreement is a simple statistic applicable to grading scales with scores from 1 to n, where usually 4 ≤ n ≤ 6. It is reported as three figures, each a percent of the total number of essays scored: exact agreement (the two raters gave the essay the same score), adjacent agreement (the raters differed by at most one point; this includes exact agreement), and extreme disagreement (the raters differed by more than two points). Expert human graders were found to achieve exact agreement on 53% to 81% of all essays, and adjacent agreement on 97% to 100%.
Various statistics have been proposed to measure inter-rater agreement. Among them are percent agreement, Scott's π, Cohen's κ, Krippendorf's α, Pearson's correlation coefficient r, Spearman's rank correlation coefficient ρ, and Lin's concordance correlation coefficient.
Before computers entered the picture, high-stakes essays were typically given scores by two trained human raters. If the scores differed by more than one point, a third, more experienced rater would settle the disagreement. In this system, there is an easy way to measure reliability: by inter-rater agreement. If raters do not consistently agree within one point, their training may be at fault. If a rater consistently disagrees with whichever other raters look at the same essays, that rater probably needs more training.
Any method of assessment must be judged on validity, fairness, and reliability. An instrument is valid if it actually measures the trait that it purports to measure. It is fair if it does not, in effect, penalize or privilege any one class of people. It is reliable if its outcome is repeatable, even when irrelevant external factors are altered.
The various AES programs differ in what specific surface features they measure, how many essays are required in the training set, and most significantly in the mathematical modeling technique. Early attempts used linear regression. Modern systems may use linear regression or other machine learning techniques often in combination with other statistical techniques such as latent semantic analysis and Bayesian inference.
From the beginning, the basic procedure for AES has been to start with a training set of essays that have been carefully hand-scored. The program evaluates surface features of the text of each essay, such as the total number of words, the number of subordinate clauses, or the ratio of uppercase to lowercase letters - quantities that can be measured without any human insight. It then constructs a mathematical model that relates these quantities to the scores that the essays received. The same model is then applied to calculate scores of new essays.
The two multi-state consortia funded by the U.S. Department of Education to develop next-generation assessments, the Partnership for Assessment of Readiness for College and Careers (PARCC), and Smarter Balanced Assessment Consortium, are committed to the challenge of transitioning from paper-and-pencil to computer-based testing by the 2014-2015 school year. As state agencies implement the Common Core State Standards, they are making decisions about the next generation assessments and how to accurately measure the new level of rigor. Innovative automated scoring software that can faithfully replicate how trained educators evaluate a student’s written response offers a new approach for states to meet the challenge. The program would allow easy marking for colleges.
 a claim that has since been strongly contested.
Machine learning, Chinese language, Speech recognition, Corpus linguistics, English language
Natural Language Processing, Speech recognition, Stephen Hawking, Software, Database
Artificial intelligence, Spanish language, Duke University, United States Marine Corps, English grammar
E-learning, Summative assessment, Formative assessment, Scottish Qualifications Authority, Computer-based assessment