5.1 Quantities in corpora

In some sense Corpus Linguistics is really simple since it is largely focused on one basic indicator: frequency of occurrence. All annotations, such as lemma, pos, text type, etc. are ultimately measured in counts. There are hardly any other types of indicators, such as temperature, color, velocity, or volume. Even word and sentence lengths, are fundamentally also counts (of characters/phonemes, or tokens/words).

This simplicity, however, comes at a cost. Namely the interpretation is rarely straightforward. We’ve talked about how ratios become important if you are working with differently sized samples. I use sample here very broadly. You can think of different corpora as different samples, but even within the same corpus, it depends on what you look at. Let’s assume we were investigating the use of the amplifier adverb utterly. As a first exploration, we might be interested whether there are some obvious differences between British and American usage. For the British corpus, let’s take the BNC (The British National Corpus 2007), and for the American corpus the COCA (Davies 2008).

> cqp
BNC; utterly = "utterly" %c
COCA-S; utterly = "utterly" %c
size BNC:utterly; size COCA-S:utterly
...
1204
4300

The American sample returns more matches. Does that mean it is more common? No! The American sample is also about 5 times larger. We need to account for the different sizes by normalizing the counts. The simplest way to achieve that is by calculating a simple ratio, or relative frequency.

\(\frac{1248}{112156361} = 0.0000111\)
\(\frac{4401}{542341719} = 0.0000081\)

Now we can see that utterly actually scores higher in the British sample. We can make those small numbers a bit prettier by turning them into occurrences per hundred a.k.a percentages, which gives us 0.00107% and 0.00079%. That’s still not intuitive so let’s move the decimal point all the way into intuitive territory by taking it per million. utterly occurs 11.1 times per million words in the British sample and 8.1 times in the American sample.

Now how about the nouns utterly co-occurs with. To keep it simple, let’s focus on the British data.

BNC
count utterly by word on match[1]
...
34      and  [#91-#124]
28      .  [#20-#47]
27      different  [#408-#434]
20      ,  [#0-#19]
13      at  [#142-#154]
13      ridiculous  [#929-#941]
11      impossible  [#645-#655]
11      without  [#1207-#1217]
10      miserable  [#772-#781]
10      opposed  [#820-#829]

The two most frequently co-occurring adjectives are different and ridiculous. However, different is more than twice as frequent. Does that mean it is more strongly associated? No! The adjective different is itself also much more frequent than ridiculous (~47500 vs 1777).

References

Davies, Mark. 2008. The corpus of contemporary American English: 450 million words, 1990-2012. http://corpus.byu.edu/coca.

The British National Corpus, version 3 (BNC XML Edition). 2007. http://www.natcorp.ox.ac.uk/; Distributed by Bodleian Libraries, University of Oxford, on behalf of the BNC Consortium.