The User-Language Paraphrase Challenge



Download 202.53 Kb.
Page3/4
Date27.01.2017
Size202.53 Kb.
#8821
1   2   3   4

Kappa Values. Agreement between raters can also be observed via Kappa results (see Table 5). Kappa’s main advantage is that it corrects for chance agreement. However, typical Kappa evaluations are for nominal categories, whereas in this challenge, the ratings are at the interval level. As such, either a linear or a quadratic weighting scheme must be employed to ensure that differences between ratings of, for example, 1 and 3 are judged as more similar than ratings of 1 and 5. For linear weighting, the difference at each interval is weighted equally; thus, for the six intervals in our scheme, the following weights would apply: 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0, where equal ratings would be weighted at 0.0. For quadratic weighting, greater penalty is placed on larger differences; thus, for our 6 intervals the weights are: 0.00, 0.36, 0.64, 0.84, 0.96, and 1.0, where equal ratings would again be weighted at 0.0. For our rating scheme, the quadratic weights are more appropriate; however, we report both linear and quadratic values.





Table 5: Kappa Evaluations for Paraphrase Dimensions of of Garbage (Gar), Frozen Expressions (Frz), Irrelevant (Irr), Elaboration (Elb), Writing Quality (WQ), Entailment (Ent), Syntactic Similarity (Syn), Lexical Similarity (Lex), Semantic Similarity (Sem), and Paraphrase Quality (PQ).


































Kappa

Gar

Frz

Irr

Elb

WQ

Ent

Syn

Lex

Sem

PQ

Linear

0.94

0.83

0.54

0.25

0.15

0.50

0.25

0.45

0.56

0.28

Quadratic

0.94

0.83

0.57

0.35

0.26

0.67

0.43

0.62

0.71

0.43



































Inter Variable Correlations. As a final assessment of inter-rater agreement, Table 6 reports the correlations between the paraphrase dimensions. The results demonstrate that raters view Semantic similarity and Entailment as very similar (r = .94, p < .01). Paraphrase quality also seems to be highly related to Semantic similarity (r = .78, p < .01) and Entailment (r = .76, p < .01). However, Paraphrase quality has a low correlation with lexical similarity (r = .34, p < .01) and no significant correlation with Syntactic similarity.


Table 6: Correlations for the Paraphrase Dimensions of Garbage (Gar), Irrelevant (Irr), Writing Quality (WQ), Entailment (Ent), Syntactic Similarity (Syn), Lexical Similarity (Lex), Semantic Similarity (Sem), and Paraphrase Quality (PQ).




























Irr

Sem

Ent

Syn

Lex

PQ

WQ

Gar

-0.03

-0.35**

-0.37**

-0.24**

-0.46**

-0.32**

-0.61**

Irr




-0.34**

-0.36**

-0.23**

-0.44**

-0.31**

-0.16**::

Sem







0.94**

0.42**

0.65**

0.79**

0.52**

Ent










0.40**

0.62**

0.76**

0.51**

Syn













0.57**

-0.05*

0.24**::

Lex
















0.44**

0.49**

PQ



















0.52**

























Note: N = 1998; ** = p < .01; * = p < .05; All correlations for Elaboration r < .22, for Frozen Expressions r < .10



























Performance Results
The final gold standard is what will be used to assess the success of computational algorithms. The gold standard for the 10 paraphrase dimensions is a combination of the rater evaluations. Although raters demonstrated significant agreement across all paraphrase dimensions, differences between judgments were occasionally quite large; for example, 31 protocols had a difference of 5 for Entailment evaluations. To accomplish a final gold standard, two of the three raters (working together) re-evaluated sentence pairs according to the following criteria: If the difference between ratings was greater than 3, then they re-evaluated the pair. As such, whatever the previous ratings for the sentence pair for that dimension, the two raters could re-evaluate that cell with any value between 1 and 6. For differences of 3, one of the raters re-evaluated the sentence pairs where any value between the lowest and the highest previous value could be selected. For all other differences, except Frozen Expressions, the average between the two ratings was selected as the final value. Because Frozen Expressions was a binary variable, all differences were re-examined and a final evaluation of either 0 or 1 was selected.

We computed correlations between the computational indices and the 10 paraphrase dimensions as scored by humans. Table 7 shows the five strongest performing computational indices (ordered left to right) in terms of correlation with the paraphrase dimensions.






















Table 7: Five Highest Correlating Computational Indices for 10 Dimensions of Paraphrase



















Garbage

Stem

LSA

Len (dif)

Ent (F)

Ent (A)




-0.68

-0.48

0.44

-0.43

-0.41

Frozen Expressions

MED (M)

Len (T-R)

Len (R)

MED (V)

Ent (F)




0.19

-0.17

0.14

0.12

-0.11

Irrelevant

Stem

LSA

Ent (F)

Ent (A)

TTRc




-0.50

-0.44

-0.37

-0.36

0.33

Elaboration

MED (M)

Ent (F)

Ent (A)

TTRc

Ent (R)




0.23

-0.21

-0.20

0.18

-0.18

Writing Quality

Stem

LSA

Len (dif)

Ent (A)

Ent (R)




0.54

0.50

-0.46

0.43

0.42

Semantic

Ent (R)

LSA

TTRc

Ent (A)

Len (dif)




0.56

0.56

-0.53

0.53

-0.52

Entailment

LSA

Ent (R)

Ent (A)

TTRc

Stem




0.54

0.51

0.50

-0.50

0.49

Syntactic Similarity

MED (V)

Ent (R)

Ent (A)

TTRc

MED (M)




-0.74

0.58

0.54

-0.51

-0.50

Lexical Similarity

LSA

Ent (A)

Ent (R)

TTRc

Ent (F)




0.80

0.79

0.78

-0.74

0.73

Paraphrase Quality

Stem

LSA

Len (dif)

Len (T-R)

Ent (R)




0.43

0.41

-0.38

-0.34

0.32



















Note: All correlations are significant at p < .001; N = 1998


Precision, Recall, and F1 Results

To calculate recall, precision, and F1 results, the gold standard paraphrase results were re-evaluated as binary variables (1-3.49 = 0 [low]; 3.50-6 = 1 [high]). Computational variables were re-evaluated as binaries by finding the mean value and then recoding the new variables as 0 (low) and 1 (high). In the case of Entailer indices, the binary values are all < .5 = 0 (low), else 1 (high). Note that neither mean values nor mid-point values are necessarily optimal values; as such Table 8 results should be considered as baseline values.







Table 8: Five Best Performing Indices for Accuracy Assessment for Seven Highest Performing Dimensions.

























 

 

 

Low

 

 

High

 

Dimension

Index

Recall

Precision

F1

Recall

Precision

F1

Garbage

Stem

0.96

1.00

0.98

0.98

0.50

0.66




Len (dif)

0.66

1.00

0.79

0.94

0.10

0.19




LSA

0.65

0.99

0.79

0.85

0.09

0.17




Len (T-R)

0.60

1.00

0.75

0.95

0.09

0.17

 

TTRc

0.57

1.00

0.72

1.00

0.09

0.16

Semantic

Len (dif)

0.63

0.52

0.57

0.75

0.82

0.78




TTRc

0.70

0.47

0.56

0.65

0.83

0.73




LSA

0.58

0.48

0.53

0.72

0.80

0.76




Stem

0.25

0.96

0.40

1.00

0.75

0.86

 

ENT (F)

0.66

0.43

0.52

0.62

0.80

0.70

Entailment

Len (dif)

0.64

0.49

0.56

0.74

0.84

0.79




Stem

0.27

0.96

0.42

1.00

0.78

0.87




TTRc

0.72

0.44

0.55

0.65

0.85

0.74




LSA

0.58

0.44

0.50

0.71

0.81

0.76

 

Ent (F)

0.67

0.41

0.51

0.62

0.83

0.71

Syntactic

MED (V)

0.72

0.95

0.82

0.88

0.53

0.66




Ent (R )

0.78

0.86

0.82

0.66

0.51

0.57




Ent (A)

0.64

0.87

0.74

0.73

0.42

0.53




TTRc

0.55

0.89

0.68

0.80

0.39

0.52

 

Ent (F)

0.55

0.87

0.67

0.76

0.37

0.50

Lexical

LSA

0.76

0.58

0.66

0.78

0.89

0.83




TTRc

0.85

0.52

0.65

0.70

0.92

0.79




Ent (F)

0.85

0.51

0.64

0.68

0.92

0.78




Len (Dif)

0.67

0.52

0.58

0.75

0.86

0.80

 

Ent (A)

0.92

0.47

0.63

0.60

0.95

0.74

Paraphrase Quality

Len (Dif)

0.48

0.60

0.53

0.73

0.62

0.67




TTRc

0.55

0.55

0.55

0.62

0.62

0.62




LSA

0.45

0.57

0.50

0.70

0.60

0.65




MED (M)

0.56

0.53

0.54

0.57

0.60

0.59

 

Ent (F)

0.53

0.52

0.53

0.59

0.59

0.59

Writing Quality

Stem

0.42

0.72

0.53

0.97

0.91

0.94




Len (Dif)

0.73

0.27

0.39

0.69

0.94

0.80




LSA

0.65

0.24

0.35

0.67

0.92

0.78




TTRc

0.81

0.24

0.37

0.60

0.95

0.74

 

Ent (F)

0.79

0.23

0.35

0.58

0.95

0.72


























Download 202.53 Kb.

Share with your friends:
1   2   3   4




The database is protected by copyright ©ininet.org 2024
send message

    Main page