4 The Lexicon

Recap

Important Concepts

Indicator Linguistic Concept
Tokens word, slot (syntagmatic)
Types word of the same form, (paradigmatic)
Parts of Speech word class
Frequency commonness, salience, …
Lemma Lemma

Always remember:
Most linguistic categories can only be quantified indirectly.

What counts?
Look at the frequency list with all lemmas in the BNC corpus. Did you spot anything weird? We talked about representations of linguistic concepts in corpora. The best example of how orthography-centric corpora are (necessarily), is tokenization. Most corpora are designed so that punctuation is treated as individual tokens. Also clitics such as the possessive ’s and contracted forms of auxiliary verbs such as ’ll, ’ve, ’re are treated as separate tokens. This decision might be contrary to the definition of word you are working with.