3.2 Types and Tokens

3.2.1 Word boundaries

We have to make a first technical distinction at this point. We need to decide what we count as a word. In corpus linguistics, the word model most commonly encountered is the token. A token has a very rough and technical, yet simple definition.

Token

character sequences in between spaces

The emphasis here lies on character sequence. If we use this to count occurrences we are dealing with the related concept of type.

Type

class of identical tokens

Note that neither relying on spaces nor on orthographic characters is by itself ideal in most circumstances. The terms type and token are sometimes also used much more abstractly. You could understand types and tokens as a “words” disregarding spelling conventions. This requires some more work defining word and also working with data later.

The process of breaking up text into “words” is tokenization. Most corpora are designed so that punctuation is treated as individual tokens. Also clitics such as the possessive ’s, and contracted forms of auxiliary verbs such as ’ll, ’ve, ’re are treated as separate tokens. This decision might be contrary to the definition of word you are working with.

How many words?

The concept of word is actually very hard to define and its definition depends on several factors. Consider the following data:

living room

living room has a coherent meaning that is highly conventionalized and also culturally specific. It is not purely compositional. It contrasts paradigmatically with words like kitchen, attic, bathroom, which are either clearly monomorphemic or at least orthographically presented as one word. Semantically, you might decide to consider it one unit rather than two. This is not necessarily true for a morphological perspective.

mother-in-law

Semantically, we have a similar situation to the example above. However, we can make the observation that the plural can attach to the first component, thus mothers-in-law. We also find in-laws. The examples below demonstrate that there is some variation in where speakers feel the word boundaries are. Note that the Oxford English Dictionary (OED) -OED Online (2020) recognizes mother-in-laws as a rare variant.⁴

They wore it only because their mothers-in-law insisted. (BNC⁵)
I always thought it was mother-in-laws that cause the problem. (COCA⁶)
Angela sided with her new in-laws. (BNC)

Next we have fixed grammatical expressions, which are written as separate words, but mostly understood as one word:

going to
in spite of

going to is undoubtedly one word in spoken language (gonna). in spite of again contrasts paradigmatically with words spelt as one, such as despite. Especially prepositions and conjunctions in English have rather arbitrary spacing; consider for example nevertheless, however.

In summary, the definition of word strongly depends on the point of view.
You might distinguish:

Orthographical words (mostly congruent with token)
Phonological words
Morphological words
Lemmas

3.2.2 Word classes in numbers

Now let’s turn back to our hypothesis that there are open and closed word classes. The evidence we need is counts for words and word classes. In an electronic corpus, the notion of orthographical word is the easiest to begin with. We basically count everything surrounded by spaces as a unit, a token. Below I show you the commands used to retrieve the data from our version of the British National Corpus -The British National Corpus (2007) . You don’t need to worry about it just yet. In the first lessons I will provide the data and the numbers. The code might be interesting for you at a later stage, however.

BNC> [pos = "NN.*"]                         # get all tokens tagged as noun
BNC> count by hw > "noun_types.txt"         # count every lemma and save as .txt file
BNC> exit                                   # exit cqp and use wc -l (count lines)
$ wc -l noun_types.txt                      # repeat for other word classes, (V.*, AJ.*, AV.*, CJ.*)

In fact, we find that the open word classes do have considerably more types than closed word classes. Not a very exciting result, and not one we would necessarily need corpus linguistics for, but, nevertheless, our first empirical evidence for a linguistic concept.

PoS	Tokens	Types
Nouns	21255608	222445
Verbs	17870538	37003
Adjectives	7297658	125290
Adverbs	5736409	8985
Prepositions	11246423	434
Conjunctions	5659347	455
Articles	8695242	4

An observation that was not immediately apparent is that function words, though there are not too many, are very frequent individually.

Function words have a low type frequency
Function words have a high token frequency

In fact, the most frequent tokens in a corpus are function words. Below, I retrieved the 100 most frequent lemmas from the BNC.

$ cwb-scan-corpus BNC hw | sort -nr | head -100 > "bnc_lemma_freq.txt"

3.2.3 Considerations

There are more things to consider when counting word types. Words might only be spelt the same by coincidence, we might have words in multiple word classes, words with different senses, etc.

Homonyms
Polysemy
Conversion
Prototype theory

(…) the distinction between V[erbs], A[djectives], and N[ouns] is one of degree, rather than kind (…)

— Ross (1972)

References

Davies, Mark. 2008. The corpus of contemporary American English: 450 million words, 1990-2012. http://corpus.byu.edu/coca.

OED Online. 2020. http://www.oed.com/; Oxford University Press.

Ross, John R. 1972. The category squish: Endstation Hauptwort. Papers from the eighth regional meeting of the chicago linguistic society, vol. 8, 316–328. Chicago Linguistic Society.

The British National Corpus, version 3 (BNC XML Edition). 2007. http://www.natcorp.ox.ac.uk/; Distributed by Bodleian Libraries, University of Oxford, on behalf of the BNC Consortium.

“mother-in-law, n. and adj.” OED Online, Oxford University Press, March 2020, www.oed.com/view/Entry/122659. Accessed 28 April 2020.↩︎
British National Corpus (The British National Corpus 2007)↩︎
Corpus of Contemporary American English (Davies 2008)↩︎