Kappa Values. Agreement between raters can also be observed via Kappa results (see Table 5). Kappa’s main advantage is that it corrects for chance agreement. However, typical Kappa evaluations are for nominal categories, whereas in this challenge, the ratings are at the interval level. As such, either a linear or a quadratic weighting scheme must be employed to ensure that differences between ratings of, for example, 1 and 3 are judged as more similar than ratings of 1 and 5. For linear weighting, the difference at each interval is weighted equally; thus, for the six intervals in our scheme, the following weights would apply: 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0, where equal ratings would be weighted at 0.0. For quadratic weighting, greater penalty is placed on larger differences; thus, for our 6 intervals the weights are: 0.00, 0.36, 0.64, 0.84, 0.96, and 1.0, where equal ratings would again be weighted at 0.0. For our rating scheme, the quadratic weights are more appropriate; however, we report both linear and quadratic values.
|
Table 5: Kappa Evaluations for Paraphrase Dimensions of of Garbage (Gar), Frozen Expressions (Frz), Irrelevant (Irr), Elaboration (Elb), Writing Quality (WQ), Entailment (Ent), Syntactic Similarity (Syn), Lexical Similarity (Lex), Semantic Similarity (Sem), and Paraphrase Quality (PQ).
|
|
|
|
|
|
|
|
|
|
|
|
Kappa
|
Gar
|
Frz
|
Irr
|
Elb
|
WQ
|
Ent
|
Syn
|
Lex
|
Sem
|
PQ
|
Linear
|
0.94
|
0.83
|
0.54
|
0.25
|
0.15
|
0.50
|
0.25
|
0.45
|
0.56
|
0.28
|
Quadratic
|
0.94
|
0.83
|
0.57
|
0.35
|
0.26
|
0.67
|
0.43
|
0.62
|
0.71
|
0.43
|
|
|
|
|
|
|
|
|
|
|
|
Inter Variable Correlations. As a final assessment of inter-rater agreement, Table 6 reports the correlations between the paraphrase dimensions. The results demonstrate that raters view Semantic similarity and Entailment as very similar (r = .94, p < .01). Paraphrase quality also seems to be highly related to Semantic similarity (r = .78, p < .01) and Entailment (r = .76, p < .01). However, Paraphrase quality has a low correlation with lexical similarity (r = .34, p < .01) and no significant correlation with Syntactic similarity.
Table 6: Correlations for the Paraphrase Dimensions of Garbage (Gar), Irrelevant (Irr), Writing Quality (WQ), Entailment (Ent), Syntactic Similarity (Syn), Lexical Similarity (Lex), Semantic Similarity (Sem), and Paraphrase Quality (PQ).
|
|
|
|
|
|
|
|
|
|
Irr
|
Sem
|
Ent
|
Syn
|
Lex
|
PQ
|
WQ
|
Gar
|
-0.03
|
-0.35**
|
-0.37**
|
-0.24**
|
-0.46**
|
-0.32**
|
-0.61**
|
Irr
|
|
-0.34**
|
-0.36**
|
-0.23**
|
-0.44**
|
-0.31**
|
-0.16**::
|
Sem
|
|
|
0.94**
|
0.42**
|
0.65**
|
0.79**
|
0.52**
|
Ent
|
|
|
|
0.40**
|
0.62**
|
0.76**
|
0.51**
|
Syn
|
|
|
|
|
0.57**
|
-0.05*
|
0.24**::
|
Lex
|
|
|
|
|
|
0.44**
|
0.49**
|
PQ
|
|
|
|
|
|
|
0.52**
|
|
|
|
|
|
|
|
|
Note: N = 1998; ** = p < .01; * = p < .05; All correlations for Elaboration r < .22, for Frozen Expressions r < .10
|
|
|
|
|
|
|
|
|
Performance Results
The final gold standard is what will be used to assess the success of computational algorithms. The gold standard for the 10 paraphrase dimensions is a combination of the rater evaluations. Although raters demonstrated significant agreement across all paraphrase dimensions, differences between judgments were occasionally quite large; for example, 31 protocols had a difference of 5 for Entailment evaluations. To accomplish a final gold standard, two of the three raters (working together) re-evaluated sentence pairs according to the following criteria: If the difference between ratings was greater than 3, then they re-evaluated the pair. As such, whatever the previous ratings for the sentence pair for that dimension, the two raters could re-evaluate that cell with any value between 1 and 6. For differences of 3, one of the raters re-evaluated the sentence pairs where any value between the lowest and the highest previous value could be selected. For all other differences, except Frozen Expressions, the average between the two ratings was selected as the final value. Because Frozen Expressions was a binary variable, all differences were re-examined and a final evaluation of either 0 or 1 was selected.
We computed correlations between the computational indices and the 10 paraphrase dimensions as scored by humans. Table 7 shows the five strongest performing computational indices (ordered left to right) in terms of correlation with the paraphrase dimensions.
|
|
|
|
|
|
Table 7: Five Highest Correlating Computational Indices for 10 Dimensions of Paraphrase
|
|
|
|
|
|
|
Garbage
|
Stem
|
LSA
|
Len (dif)
|
Ent (F)
|
Ent (A)
|
|
-0.68
|
-0.48
|
0.44
|
-0.43
|
-0.41
|
Frozen Expressions
|
MED (M)
|
Len (T-R)
|
Len (R)
|
MED (V)
|
Ent (F)
|
|
0.19
|
-0.17
|
0.14
|
0.12
|
-0.11
|
Irrelevant
|
Stem
|
LSA
|
Ent (F)
|
Ent (A)
|
TTRc
|
|
-0.50
|
-0.44
|
-0.37
|
-0.36
|
0.33
|
Elaboration
|
MED (M)
|
Ent (F)
|
Ent (A)
|
TTRc
|
Ent (R)
|
|
0.23
|
-0.21
|
-0.20
|
0.18
|
-0.18
|
Writing Quality
|
Stem
|
LSA
|
Len (dif)
|
Ent (A)
|
Ent (R)
|
|
0.54
|
0.50
|
-0.46
|
0.43
|
0.42
|
Semantic
|
Ent (R)
|
LSA
|
TTRc
|
Ent (A)
|
Len (dif)
|
|
0.56
|
0.56
|
-0.53
|
0.53
|
-0.52
|
Entailment
|
LSA
|
Ent (R)
|
Ent (A)
|
TTRc
|
Stem
|
|
0.54
|
0.51
|
0.50
|
-0.50
|
0.49
|
Syntactic Similarity
|
MED (V)
|
Ent (R)
|
Ent (A)
|
TTRc
|
MED (M)
|
|
-0.74
|
0.58
|
0.54
|
-0.51
|
-0.50
|
Lexical Similarity
|
LSA
|
Ent (A)
|
Ent (R)
|
TTRc
|
Ent (F)
|
|
0.80
|
0.79
|
0.78
|
-0.74
|
0.73
|
Paraphrase Quality
|
Stem
|
LSA
|
Len (dif)
|
Len (T-R)
|
Ent (R)
|
|
0.43
|
0.41
|
-0.38
|
-0.34
|
0.32
|
|
|
|
|
|
|
Note: All correlations are significant at p < .001; N = 1998
|
Precision, Recall, and F1 Results
To calculate recall, precision, and F1 results, the gold standard paraphrase results were re-evaluated as binary variables (1-3.49 = 0 [low]; 3.50-6 = 1 [high]). Computational variables were re-evaluated as binaries by finding the mean value and then recoding the new variables as 0 (low) and 1 (high). In the case of Entailer indices, the binary values are all < .5 = 0 (low), else 1 (high). Note that neither mean values nor mid-point values are necessarily optimal values; as such Table 8 results should be considered as baseline values.
|
Table 8: Five Best Performing Indices for Accuracy Assessment for Seven Highest Performing Dimensions.
|
|
|
|
|
|
|
|
|
|
|
|
Low
|
|
|
High
|
|
Dimension
|
Index
|
Recall
|
Precision
|
F1
|
Recall
|
Precision
|
F1
|
Garbage
|
Stem
|
0.96
|
1.00
|
0.98
|
0.98
|
0.50
|
0.66
|
|
Len (dif)
|
0.66
|
1.00
|
0.79
|
0.94
|
0.10
|
0.19
|
|
LSA
|
0.65
|
0.99
|
0.79
|
0.85
|
0.09
|
0.17
|
|
Len (T-R)
|
0.60
|
1.00
|
0.75
|
0.95
|
0.09
|
0.17
|
|
TTRc
|
0.57
|
1.00
|
0.72
|
1.00
|
0.09
|
0.16
|
Semantic
|
Len (dif)
|
0.63
|
0.52
|
0.57
|
0.75
|
0.82
|
0.78
|
|
TTRc
|
0.70
|
0.47
|
0.56
|
0.65
|
0.83
|
0.73
|
|
LSA
|
0.58
|
0.48
|
0.53
|
0.72
|
0.80
|
0.76
|
|
Stem
|
0.25
|
0.96
|
0.40
|
1.00
|
0.75
|
0.86
|
|
ENT (F)
|
0.66
|
0.43
|
0.52
|
0.62
|
0.80
|
0.70
|
Entailment
|
Len (dif)
|
0.64
|
0.49
|
0.56
|
0.74
|
0.84
|
0.79
|
|
Stem
|
0.27
|
0.96
|
0.42
|
1.00
|
0.78
|
0.87
|
|
TTRc
|
0.72
|
0.44
|
0.55
|
0.65
|
0.85
|
0.74
|
|
LSA
|
0.58
|
0.44
|
0.50
|
0.71
|
0.81
|
0.76
|
|
Ent (F)
|
0.67
|
0.41
|
0.51
|
0.62
|
0.83
|
0.71
|
Syntactic
|
MED (V)
|
0.72
|
0.95
|
0.82
|
0.88
|
0.53
|
0.66
|
|
Ent (R )
|
0.78
|
0.86
|
0.82
|
0.66
|
0.51
|
0.57
|
|
Ent (A)
|
0.64
|
0.87
|
0.74
|
0.73
|
0.42
|
0.53
|
|
TTRc
|
0.55
|
0.89
|
0.68
|
0.80
|
0.39
|
0.52
|
|
Ent (F)
|
0.55
|
0.87
|
0.67
|
0.76
|
0.37
|
0.50
|
Lexical
|
LSA
|
0.76
|
0.58
|
0.66
|
0.78
|
0.89
|
0.83
|
|
TTRc
|
0.85
|
0.52
|
0.65
|
0.70
|
0.92
|
0.79
|
|
Ent (F)
|
0.85
|
0.51
|
0.64
|
0.68
|
0.92
|
0.78
|
|
Len (Dif)
|
0.67
|
0.52
|
0.58
|
0.75
|
0.86
|
0.80
|
|
Ent (A)
|
0.92
|
0.47
|
0.63
|
0.60
|
0.95
|
0.74
|
Paraphrase Quality
|
Len (Dif)
|
0.48
|
0.60
|
0.53
|
0.73
|
0.62
|
0.67
|
|
TTRc
|
0.55
|
0.55
|
0.55
|
0.62
|
0.62
|
0.62
|
|
LSA
|
0.45
|
0.57
|
0.50
|
0.70
|
0.60
|
0.65
|
|
MED (M)
|
0.56
|
0.53
|
0.54
|
0.57
|
0.60
|
0.59
|
|
Ent (F)
|
0.53
|
0.52
|
0.53
|
0.59
|
0.59
|
0.59
|
Writing Quality
|
Stem
|
0.42
|
0.72
|
0.53
|
0.97
|
0.91
|
0.94
|
|
Len (Dif)
|
0.73
|
0.27
|
0.39
|
0.69
|
0.94
|
0.80
|
|
LSA
|
0.65
|
0.24
|
0.35
|
0.67
|
0.92
|
0.78
|
|
TTRc
|
0.81
|
0.24
|
0.37
|
0.60
|
0.95
|
0.74
|
|
Ent (F)
|
0.79
|
0.23
|
0.35
|
0.58
|
0.95
|
0.72
|
|
|
|
|
|
|
|
|
Share with your friends: |