Evaluation of machine translation (MT) is a difficult task, both for humans, and using automatic metrics. The main difficulty lies in the fact that there is not one single correct translation, but many alternative good translation options.MT systems are often evaluated using automatic metrics, which commonly rely on comparing a translation to only a single human reference translation. An alternative is different types of human evaluations, commonly ranking be-tween systems or estimations of adequacy and fluency on some scale, or error analyses.
We have explored four different evaluation methods on output from three different statistical MT systems. The main focus is on different types of human evaluation. We compare two conventional evaluation methods, human error analysis and automatic metrics, to two lesser used evaluation methods based on reading comprehension and eye-tracking. These two methods of evaluations are performed without the subjects seeing the source sentence. There have been few previous attempts of using reading comprehension and eye-tracking for MT evaluation.
One example of a reading comprehension study is Fuji (1999) who conducted an experiment to compare English-to-Japanese MT to several versions of manual corrections of the system output. He found significant differences be-tween texts with large differences on reading comprehension questions. Doherty and O’Brien (2009) is the only study we are aware of using eye-tracking for MT evaluation. They found that the average gaze time and fixation counts were significantly lower for sentences judged as excellent in an earlier evaluation, than for bad sentences.
Like previous research we find that both reading comprehension and eye-tracking can be useful for MT evaluation.
The results of these methods are consistent with the other methods for comparison between systems with a big quality difference. For systems with similar quality, however, the different evaluation methods often does not show any significant differences.