Learners’ dictionaries are designed to help learners understand and use words and phrases. A corpus is another resource to help with the same task. How do they relate to each other?
They are both records of the language. The corpus is a sample of the language in the raw. The dictionary is a highly condensed version of roughly the same material. The relation between the two is easy to see when we consider how modern corpus-based dictionaries are prepared. One of the main inputs, at leading dictionary publishers including Collins, Macmillan and Oxford University Press, is word sketches: one-page corpus-based summaries of a word’s grammatical and collocational behaviour, as in Fig 2.5 Is this more corpus-like or dictionary-like? It is automatically-produced output from the corpus, making it corpus-like, but it is a condensed summary of what was found there, making it dictionary-like. On a continuum from corpus to dictionary, it is somewhere in the middle.
Most learners do not want to be corpus linguists, and concordances are unfamiliar and difficult objects. But dictionaries are familiar from an early age, sometimes even loved. Learners will not be put off if they are expected to look items up in a new kind of dictionary. This suggests a strategy for bringing corpora into the classroom: disguise them as dictionaries.
Dictionary-users often find the examples are the most useful part of a dictionary entry. Moreover, where dictionaries are electronic rather than on paper, the traditional space limitation on examples disappears: there is room for lots of examples. This is an area where the corpus can help: they are nothing but examples. However they are not selected or edited examples. Choosing examples for a dictionary is an advanced lexicographical skill: they should be short, use familiar words, without irrelevant grammatical complexity, and they should give a typical example of the word in use and provide a context which helps the learner understand what it means.
While we cannot yet program computers to do the task anything like as well as people, we can perform some parts of it automatically. We can rule out sentences which are too long, or too short, or which contain obscure words, or which have many words capitalised or lots of numbers or square brackets or other characters which are rare in the kind of simple, straightforward sentences we are looking for. We have done that in a program called GDEX (Good Dictionary Example finder, Kilgarriff et al 2008). The initial project was a collaboration with Macmillan, and the examples were used as a first-pass filter for them to add more examples to their dictionary. The machinery has been embedded into the Sketch Engine, and concordance lines can now be sorted according to GDEX score, so the ‘best’ ones are the ones that the user sees as their search hits.
space ukWaC freq = 273022
object_of
86021
2.2
watch
4113
9.21
confine
1200
8.65
occupy
1230
8.22
allocate
912
7.95
limit
1163
7.76
fill
1179
7.68
enclose
566
7.41
create
3038
7.39
save
1097
7.24
reserve
523
7.08
devote
381
6.8
breathe
344
6.79
pp_between-i
2685
10.0
paragraph
44
4.46
particle
22
3.64
row
24
3.5
column
25
3.2
word
143
3.1
tooth
17
3.09
building
83
2.22
seat
20
2.2
letter
39
2.1
star
20
2.07
wall
32
1.97
cell
29
1.73
pp_above-i
205
5.9
shop
29
1.73
pp_per-i
410
4.7
sq.m
16
9.55
dwelling
44
5.48
unit
18
0.57
person
26
0.57
pp_around-i
515
4.6
building
19
0.1
pp_within-i
983
4.2
building
63
1.83
city
28
1.26
area
58
0.18
centre
19
0.14
pp_down-i
76
4.2
left
17
2.75
right
29
0.51
pp_for-i
15742
4.1
wheelchair
109
6.31
fridge/freezer
40
6.13
freezer
49
5.93
fridge
52
5.62
cot
35
5.56
contemplation
27
5.43
reflection
91
5.42
recreation
47
5.4
luggage
35
5.16
update
108
5.01
dryer
24
4.91
storage
99
4.9
a_modifier
106533
2.7
open
10693
9.69
green
3787
9.18
outer
1732
8.75
public
5574
8.18
empty
1268
8.09
ample
970
8.05
enough
1564
7.78
limited
1110
7.54
extra
1325
7.46
urban
959
7.45
short
2051
7.37
much
1740
7.21
n_modifier
64957
2.6
parking
6052
10.16
disk
2762
9.43
storage
3084
9.32
exhibition
1689
8.08
office
2984
7.75
breathing
488
7.59
loft
426
7.57
gallery
872
7.5
roof
812
7.47
floor
1535
7.41
living
798
7.39
studio
755
7.37
adj_subject_of
8065
2.3
available
3224
6.63
adjacent
91
6.37
cramped
20
5.79
tight
70
5.76
limited
130
5.45
finite
17
4.83
scarce
16
4.8
empty
41
4.64
accessible
69
4.59
inadequate
20
4.41
suitable
91
4.41
efficient
44
4.08
Fig 2. Word sketch for space drawn from UKWaC (truncated)
While GDEX could be much better, it already makes it more likely that corpus examples will be readable.