2 Word Classes

Today, we are looking at the concept of words and word classes. Our aim is to provide some first evidence for common and inherently quantitative statements about language. For example, we might say that one word is “more common” than another or one word class has more members than another. Below we will start by by looking at word classes and we will try to provide evidence for a very simple hypothesis: There are closed word classes with a limited amount of members and open word classes with significantly more members.

2.1 Parts of speech

2.1.1 Recap: Open and Closed Word Classes

The idea of open and closed word classes is the first we can quantify very easily with the help of corpus data. As opposed to a closed word class, an open word class should have a lot more members. Let’s first recap what types of word classes we know.

Open word classes
  • Nouns: time, book, love, kind
  • Verbs: find, try, look, consider
  • Adjectives: green, high, nice, considerate
  • Adverbs: really, nicely, well
Closed word classes
  • Pronouns: I, you, she, they, mine, …
  • Determiners: the, a(n), this, that, some, any, no, …
  • Prepositions: to, in, at, behind, after, …
  • Conjunctions: and, or, so, that, because, …
Lexical vs. Function word
  • Auxiliary verbs: be, have, (get, keep)
  • Lexical verbs: eat, sleep, repeat, …

Open word classes are commonly subject to word formation processes like derivation and compounding. Loanwords are also almost always from open word classes.

It is unlikely (though not impossible) for a closed word class to gain new members. One rather recent addition to the class of pronouns might be considered singular they.

  1. Each member (…) found something they could improve on in the future. (OED Online 2022: they, 2014 Dalby (Queensland) Herald (Nexis))

However, the first written accounts go back as far as the late 13th century (OED Online 2022: they 2.a.). Members of closed word classes are also mostly invariant, i.e. they do not inflect. None of these properties are logically necessary. You could imagine inflecting pronouns, or different types. Some languages have a dual in addition to singular and plural (e.g. Classical Arabic), or a distinction between inclusive and exclusive we (several Polynesian languages). Yet the class of pronouns has remaind rather fixed.

2.1.2 PoS-Tags

Figuring out the word class of each word is done with Part-of-speech taggers. Tools like the Tree Tagger (Schmid & Küchenhoff 2013) can determine word classes with an accuracy of around 95% (Horsmann, Erbs & Zesch 2015). Even though this is good enough for most purposes, you have to bear in mind that automatic annotation is error prone and can cause some spurious patterns that have to be accounted for. We will encounter such cases in future sections.

PoS-tags
  • annotation for word class available in most corpora
  • automated

2.2 Types and Tokens

2.2.1 Word boundaries

We have to make a first technical distinction at this point. We need to decide what we count as a word. In linguistic corpora, text is broken up into tokens. The simplest definition of token is the following:

Token
  • character sequences broken up by spaces and punctuation.

The emphasis here lies on character sequence. It is not easy to teach a computer what counts as a word. One of the simple reasons is that the concept isn’t even easy to define linguistically (consider compounds, complex prepositions, hyphenation). The technical process is called tokenization, and varies in complexity from the crude definition above all the way to complex models based on neural networks If we use tokens to count occurrences we are dealing with the related concept of type.

Type
  • class of identical tokens

Note that neither relying on spaces nor on orthographic characters is by itself ideal in most circumstances, but the reality of most data sets, especially older ones. The terms type and token are sometimes also used much more abstractly. You could understand types and tokens as a “word” disregarding spelling conventions.

How many words?

The concept of word is actually very hard to define and its definition depends on several factors. Consider the following data:

  1. living room

living room has a coherent meaning that is highly conventionalized and also culturally specific. It is not purely compositional. It contrasts paradigmatically with words like kitchen, attic, bathroom, which are either clearly monomorphemic or at least orthographically represented by one word. Semantically, you might decide to consider it one unit rather than two. This is not necessarily true from a morphological perspective.

  1. mother-in-law

Semantically, we have a similar situation to the example above. However, we can make the observation that the plural can attach to the first component, thus mothers-in-law. We also find in-laws. The examples below demonstrate that there is some variation in where speakers feel the word boundaries. Note that the Oxford English Dictionary (OED) (2022) recognizes mother-in-laws as a rare variant 3.

  1. They wore it only because their mothers-in-law insisted. (BNC4)
  2. I always thought it was mother-in-laws that cause the problem. (COCA5)
  3. Angela sided with her new in-laws. (BNC)

Next we have fixed grammatical expressions, which are written as separate words, but mostly understood as one word:

  1. going to
  2. in spite of

going to is undoubtedly one word in spoken language (gonna). in spite of again contrasts paradigmatically with words spelt as one, such as despite. Especially prepositions and conjunctions in English have rather arbitrary spacing; consider for example nevertheless, however.

In summary, the definition of word strongly depends on the point of view.
You might distinguish:

  • Orthographical words (mostly congruent with token)
  • Phonological words
  • Morphological words
  • Lemmas

2.2.2 Word classes in numbers

Now let’s turn back to our hypothesis that there are open and closed word classes. The evidence we need is counts for words and word classes. In an electronic corpus, the notion of orthographical word is the easiest to begin with. We basically count everything surrounded by spaces as a unit, a token. Below I show you the commands used to retrieve the data from our version of the British National Corpus -The British National Corpus (2007) . You don’t need to worry about it just yet. In the first lessons I will provide the data and the numbers. The code might be interesting for you at a later stage, however.

BNC> [pos = "NN.*"]                         # get all tokens tagged as noun
BNC> count by hw > "noun_types.txt"         # count every lemma and save as .txt file
BNC> exit                                   # exit cqp and use wc -l (count lines)
$ wc -l noun_types.txt                      # repeat for other word classes, (V.*, AJ.*, AV.*, CJ.*)

In fact, we find that the open word classes do have considerably more types than closed word classes. Not a very exciting result, and not one we would necessarily need corpus linguistics for, but, nevertheless, our first empirical evidence for a linguistic concept.

PoS Tokens Types
Nouns 21255608 222445
Verbs 17870538 37003
Adjectives 7297658 125290
Adverbs 5736409 8985
Prepositions 11246423 434
Conjunctions 5659347 455
Articles 8695242 4

An observation that was not immediately apparent is that function words, though there are not too many, are very frequent individually.

  • Function words have a low type frequency
  • Function words have a high token frequency

In fact, the most frequent tokens in a corpus are function words. Below, I retrieved the 100 most frequent lemmas from the BNC.

$ cwb-scan-corpus BNC hw | sort -nr | head -100 > "bnc_lemma_freq.txt"

2.2.3 Considerations

There are more things to consider when counting word types. Words might only be spelt the same by coincidence, we might have words in multiple word classes, words with different senses, etc.

  • Homonyms
  • Polysemy
  • Conversion
  • Prototype theory

(…) the distinction between V[erbs], A[djectives], and N[ouns] is one of degree, rather than kind (…)

Ross (1972)

2.3 Homework

This week, your task is to pick your topics for presentations and homework. You should have all the course literature by now (if not, see previous homework). Starting on November 14th (Antonymy), there are 9 different papers to pick from. Additionally, the 2 project days toward the end of the class can be picked if you’d rather find and present your own topic (e.g. a possible term paper topic). You can only pick one project day (or none), and it would automatically become your main presentation.

Please pick at least 4 of the dates as preferences on this doodle: Click here!

The four assignments will be:

  1. Short summary and discussion of the week’s main reading.
  2. Short summary and discussion of an article of your choice that references the week’s main reading.
  3. Review of the week’s main talk.
  4. Practical homework.

Instead of one big presentation, you’ll have to prepare 2 short presentations in groups of two. The presentations shouldn’t exceed 15 minutes.

2.3.1 Main presentation

Outline the main research questions, hypotheses (whether explicitly stated or not) and what you think are the main results of the data analysis. Prepare a slide presentation and write a short summary or script that represents what you are going to present.

  • group size: 2–3
  • length: < 20-25min (total)
  • summary script: < half a page of prose
  • due date: Wednesday before midnight 24:00.
  • Paste your contribution here: https://yopad.eu/p/ws22_lola-365days

2.3.2 Second presentation

Do a reverse bibliography search, and find an empirical paper that references the main reading. Provide a short summary and discuss the relationship to the main reading. Prepare a slide presentation

  • group size: 2–3
  • length: < 20-25min (total)
  • due date: date of the presentation

2.3.3 Review

The main presentation will be peer reviewed. It’s going to be double blind, so I want the reviewers and also the presenters to stay anonymous. This is common practice in academia and an important standard for publications. It ensures the quality and correctness of contributions. Determine whether the presenters used terminology correctly, comment on the extracted research questions, hypotheses, and results and add what you think is missing. This of course requires that you have worked on the reading yourself.

  • group size: 1, there will be multiple reviews per presentation. Don’t let other reviewers influence you.
  • review: < half a page of prose
  • due date: Sunday 24:00 before the respective talk. But the earlier the better. Best case: Presenters can still react before their talk.
  • Paste your review underneath the original contribution here: https://yopad.eu/p/ws22_lola-365days

Should the presentation group be late or fail to contribute, take over their part and write a summary yourself. Of course that means that someone might review you in turn, but that’s also fine, as long as the job is done.

2.3.4 Practical homework

Read the method section of the week’s reading carefully and try to determine the steps that would be necessary to reproduce the data analysis, and potential challenges that need to be overcome. This should result in a short list of instructions we will then discuss in class and carry out if possible. The more we advance in class, the more specific the instructions should become.

2.4 Tip of the day

Today’s tip is from the category: Things I wish I had learned before my Bachelor Thesis
In short: Tiwilbemba.

Build your personal .pdf library

Take every .pdf you download and get from your instructors and archive it with a naming scheme you can remember easily. Especially scans from books and collections are an invaluable resource since not everything is digitalized.

My suggestion: lastname_year_keyword: e.g. Deignan_2005_Metaphor.pdf

Also…
Start building your bibliography database

Get the info for a bibliography entry as soon as you read a text. Platforms like Primo and Google scholar provide bibliography entries in various styles and formats. In a future installment of Tiwilbemba I will discuss the benefits of tools like BibTex, Mendeley, Endnote …

~15s invested per text → hours saved in the long run.