The User-Language Paraphrase Challenge

Download 202.53 Kb.

Page	3/4
Date	27.01.2017
Size	202.53 Kb.
	#8821

1 2 3 4

Kappa Values. Agreement between raters can also be observed via Kappa results (see Table 5). Kappa’s main advantage is that it corrects for chance agreement. However, typical Kappa evaluations are for nominal categories, whereas in this challenge, the ratings are at the interval level. As such, either a linear or a quadratic weighting scheme must be employed to ensure that differences between ratings of, for example, 1 and 3 are judged as more similar than ratings of 1 and 5. For linear weighting, the difference at each interval is weighted equally; thus, for the six intervals in our scheme, the following weights would apply: 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0, where equal ratings would be weighted at 0.0. For quadratic weighting, greater penalty is placed on larger differences; thus, for our 6 intervals the weights are: 0.00, 0.36, 0.64, 0.84, 0.96, and 1.0, where equal ratings would again be weighted at 0.0. For our rating scheme, the quadratic weights are more appropriate; however, we report both linear and quadratic values.


Table 5: Kappa Evaluations for Paraphrase Dimensions of of Garbage (Gar), Frozen Expressions (Frz), Irrelevant (Irr), Elaboration (Elb), Writing Quality (WQ), Entailment (Ent), Syntactic Similarity (Syn), Lexical Similarity (Lex), Semantic Similarity (Sem), and Paraphrase Quality (PQ).

Kappa	Gar	Frz	Irr	Elb	WQ	Ent	Syn	Lex	Sem	PQ
Linear	0.94	0.83	0.54	0.25	0.15	0.50	0.25	0.45	0.56	0.28
Quadratic	0.94	0.83	0.57	0.35	0.26	0.67	0.43	0.62	0.71	0.43

Inter Variable Correlations. As a final assessment of inter-rater agreement, Table 6 reports the correlations between the paraphrase dimensions. The results demonstrate that raters view Semantic similarity and Entailment as very similar (r = .94, p < .01). Paraphrase quality also seems to be highly related to Semantic similarity (r = .78, p < .01) and Entailment (r = .76, p < .01). However, Paraphrase quality has a low correlation with lexical similarity (r = .34, p < .01) and no significant correlation with Syntactic similarity.

Table 6: Correlations for the Paraphrase Dimensions of Garbage (Gar), Irrelevant (Irr), Writing Quality (WQ), Entailment (Ent), Syntactic Similarity (Syn), Lexical Similarity (Lex), Semantic Similarity (Sem), and Paraphrase Quality (PQ).

Irr

Sem

Ent

Syn

Lex

Gar

-0.03

-0.35**

-0.37**

-0.24**

-0.46**

-0.32**

-0.61**

Irr

-0.34**

-0.36**

-0.23**

-0.44**

-0.31**

-0.16**::

Sem

0.94**

0.42**

0.65**

0.79**

0.52**

Ent

0.40**

0.62**

0.76**

0.51**

Syn

0.57**

-0.05*

0.24**::

Lex

0.44**

0.49**

0.52**

Note: N = 1998; ** = p < .01; * = p < .05; All correlations for Elaboration r < .22, for Frozen Expressions r < .10

Performance Results
The final gold standard is what will be used to assess the success of computational algorithms. The gold standard for the 10 paraphrase dimensions is a combination of the rater evaluations. Although raters demonstrated significant agreement across all paraphrase dimensions, differences between judgments were occasionally quite large; for example, 31 protocols had a difference of 5 for Entailment evaluations. To accomplish a final gold standard, two of the three raters (working together) re-evaluated sentence pairs according to the following criteria: If the difference between ratings was greater than 3, then they re-evaluated the pair. As such, whatever the previous ratings for the sentence pair for that dimension, the two raters could re-evaluate that cell with any value between 1 and 6. For differences of 3, one of the raters re-evaluated the sentence pairs where any value between the lowest and the highest previous value could be selected. For all other differences, except Frozen Expressions, the average between the two ratings was selected as the final value. Because Frozen Expressions was a binary variable, all differences were re-examined and a final evaluation of either 0 or 1 was selected.

We computed correlations between the computational indices and the 10 paraphrase dimensions as scored by humans. Table 7 shows the five strongest performing computational indices (ordered left to right) in terms of correlation with the paraphrase dimensions.


Table 7: Five Highest Correlating Computational Indices for 10 Dimensions of Paraphrase

Garbage	Stem	LSA	Len (dif)	Ent (F)	Ent (A)
	-0.68	-0.48	0.44	-0.43	-0.41
Frozen Expressions	MED (M)	Len (T-R)	Len (R)	MED (V)	Ent (F)
	0.19	-0.17	0.14	0.12	-0.11
Irrelevant	Stem	LSA	Ent (F)	Ent (A)	TTRc
	-0.50	-0.44	-0.37	-0.36	0.33
Elaboration	MED (M)	Ent (F)	Ent (A)	TTRc	Ent (R)
	0.23	-0.21	-0.20	0.18	-0.18
Writing Quality	Stem	LSA	Len (dif)	Ent (A)	Ent (R)
	0.54	0.50	-0.46	0.43	0.42
Semantic	Ent (R)	LSA	TTRc	Ent (A)	Len (dif)
	0.56	0.56	-0.53	0.53	-0.52
Entailment	LSA	Ent (R)	Ent (A)	TTRc	Stem
	0.54	0.51	0.50	-0.50	0.49
Syntactic Similarity	MED (V)	Ent (R)	Ent (A)	TTRc	MED (M)
	-0.74	0.58	0.54	-0.51	-0.50
Lexical Similarity	LSA	Ent (A)	Ent (R)	TTRc	Ent (F)
	0.80	0.79	0.78	-0.74	0.73
Paraphrase Quality	Stem	LSA	Len (dif)	Len (T-R)	Ent (R)
	0.43	0.41	-0.38	-0.34	0.32

Note: All correlations are significant at p < .001; N = 1998

Precision, Recall, and F1 Results

To calculate recall, precision, and F1 results, the gold standard paraphrase results were re-evaluated as binary variables (1-3.49 = 0 [low]; 3.50-6 = 1 [high]). Computational variables were re-evaluated as binaries by finding the mean value and then recoding the new variables as 0 (low) and 1 (high). In the case of Entailer indices, the binary values are all < .5 = 0 (low), else 1 (high). Note that neither mean values nor mid-point values are necessarily optimal values; as such Table 8 results should be considered as baseline values.


Table 8: Five Best Performing Indices for Accuracy Assessment for Seven Highest Performing Dimensions.

			Low			High
Dimension	Index	Recall	Precision	F1	Recall	Precision	F1
Garbage	Stem	0.96	1.00	0.98	0.98	0.50	0.66
	Len (dif)	0.66	1.00	0.79	0.94	0.10	0.19
	LSA	0.65	0.99	0.79	0.85	0.09	0.17
	Len (T-R)	0.60	1.00	0.75	0.95	0.09	0.17
	TTRc	0.57	1.00	0.72	1.00	0.09	0.16
Semantic	Len (dif)	0.63	0.52	0.57	0.75	0.82	0.78
	TTRc	0.70	0.47	0.56	0.65	0.83	0.73
	LSA	0.58	0.48	0.53	0.72	0.80	0.76
	Stem	0.25	0.96	0.40	1.00	0.75	0.86
	ENT (F)	0.66	0.43	0.52	0.62	0.80	0.70
Entailment	Len (dif)	0.64	0.49	0.56	0.74	0.84	0.79
	Stem	0.27	0.96	0.42	1.00	0.78	0.87
	TTRc	0.72	0.44	0.55	0.65	0.85	0.74
	LSA	0.58	0.44	0.50	0.71	0.81	0.76
	Ent (F)	0.67	0.41	0.51	0.62	0.83	0.71
Syntactic	MED (V)	0.72	0.95	0.82	0.88	0.53	0.66
	Ent (R )	0.78	0.86	0.82	0.66	0.51	0.57
	Ent (A)	0.64	0.87	0.74	0.73	0.42	0.53
	TTRc	0.55	0.89	0.68	0.80	0.39	0.52
	Ent (F)	0.55	0.87	0.67	0.76	0.37	0.50
Lexical	LSA	0.76	0.58	0.66	0.78	0.89	0.83
	TTRc	0.85	0.52	0.65	0.70	0.92	0.79
	Ent (F)	0.85	0.51	0.64	0.68	0.92	0.78
	Len (Dif)	0.67	0.52	0.58	0.75	0.86	0.80
	Ent (A)	0.92	0.47	0.63	0.60	0.95	0.74
Paraphrase Quality	Len (Dif)	0.48	0.60	0.53	0.73	0.62	0.67
	TTRc	0.55	0.55	0.55	0.62	0.62	0.62
	LSA	0.45	0.57	0.50	0.70	0.60	0.65
	MED (M)	0.56	0.53	0.54	0.57	0.60	0.59
	Ent (F)	0.53	0.52	0.53	0.59	0.59	0.59
Writing Quality	Stem	0.42	0.72	0.53	0.97	0.91	0.94
	Len (Dif)	0.73	0.27	0.39	0.69	0.94	0.80
	LSA	0.65	0.24	0.35	0.67	0.92	0.78
	TTRc	0.81	0.24	0.37	0.60	0.95	0.74
	Ent (F)	0.79	0.23	0.35	0.58	0.95	0.72

Download 202.53 Kb.

Share with your friends:

1 2 3 4