Frequencies of ratings. The results for the frequencies ofevaluations (see Table 3) suggest less frequent agreement for the dimensions of Writing Quality, Semantic Completeness, Entailment, Syntactic Similarity, Lexical Similarity, and Paraphrase Quality. The most common judgment given is often the lowest possible rating, as with the dimensions of Garbage (96%), Frozen Expressions (95%), Irrelevant (96%), and Elaboration (92%). The remaining dimensions are far more equally divided.
Differences between raters. Because the rating scale in this study ranged from 1 to 6, the maximum difference between any two raters for any one judgment is 5. Obviously, the lower the difference between raters, the greater is the agreement. Hence, we calculated the frequency of each level of discrepancy (i.e., 0 to 5) between the raters. The frequencies of the differences between raters for the 10 paraphrase dimensions suggest that equivalent evaluations for Garbage, Frozen, Irrelevant, and Elaboration were extremely common (see Table 4). For the remaining dimensions, equivalent evaluations ranged from 23% to 45% of the sentence pairs.
Table 4: Frequencies of Differences Between Raters.