2 Semantics I
In linguistics, we are interested in both the form side and the meaning side of language. Form concerns the surface structure of language, i.e. the phonological, morphological, syntactic properties of words. Every linguitsic form, e.g. a word, a phrase, a construction, an idiom or a whole utterance, have a function. This idea goes back all the way to Ferdinand de Saussure’s Model of the linguistic sign [Saussure (2001); original from 1916]. Saussure’s model establishes that there is an arbitrary relationship between form and meaning. The exact sounds we use to express tree do not have any ‘logical’ conncetion to its meaning. It also predicts that every form has a meaning and function. Today, we start with the concept of words and word classes, and think about whether there is also a meaning side to such rather abstract classes. We will also see some first evidence for common and inherently quantitative statements about language. For example, we might say that one word is “more common” than another or one word class has more members than another. Below we will start by looking at word classes and we will try to provide evidence for a very simple hypothesis: There are closed word classes with a limited amount of members and open word classes with significantly more members.
2.1 Parts of speech
2.1.1 Recap: Open and Closed Word Classes
The idea of open and closed word classes is the first we can quantify very easily with the help of corpus data. As opposed to a closed word class, an open word class should have a lot more members. Let’s first recap what types of word classes we know.
- Open word classes
-
- Nouns: time, book, love, kind
-
- Verbs: find, try, look, consider
-
- Adjectives: green, high, nice, considerate
-
- Adverbs: really, nicely, well
- Closed word classes
-
- Pronouns: I, you, she, they, mine, …
-
- Determiners: the, a(n), this, that, some, any, no, …
-
- Prepositions: to, in, at, behind, after, …
-
- Conjunctions: and, or, so, that, because, …
-
- …
- Lexical vs. Function word
-
- Auxiliary verbs: be, have, (get, keep)
-
- Lexical verbs: eat, sleep, repeat, …
Open word classes are commonly subject to word formation processes like derivation and compounding. Loanwords are also almost always from open word classes.
It is unlikely (though not impossible) for a closed word class to gain new members. One rather recent addition to the class of pronouns might be considered singular they.
- Each member (…) found something they could improve on in the future. (OED Online 2022: they, 2014 Dalby (Queensland) Herald (Nexis))
However, the first written accounts go back as far as the late 13th century (OED Online 2022: they 2.a.). Members of closed word classes are also mostly invariant, i.e. they do not inflect. None of these properties are logically necessary. You could imagine inflecting pronouns, or different types. Some languages have a dual in addition to singular and plural (e.g. Classical Arabic), or a distinction between inclusive and exclusive we (several Polynesian languages). Yet the class of pronouns has remaind rather fixed.
2.2 Types and Tokens
2.2.1 Word boundaries
We have to make a first technical distinction at this point. We need to decide what we count as a word. In linguistic corpora, text is broken up into tokens. The simplest definition of token is the following:
- Token
-
- character sequences broken up by spaces and punctuation.
The emphasis here lies on character sequence. It is not easy to teach a computer what counts as a word. One of the simple reasons is that the concept isn’t even easy to define linguistically (consider compounds, complex prepositions, hyphenation). The technical process is called tokenization, and varies in complexity from the crude definition above all the way to complex models based on neural networks If we use tokens to count occurrences we are dealing with the related concept of type.
- Type
-
- class of identical tokens
Note that neither relying on spaces nor on orthographic characters is by itself ideal in most circumstances, but the reality of most data sets, especially older ones. The terms type and token are sometimes also used much more abstractly. You could understand types and tokens as a “word” disregarding spelling conventions.
How many words?
The concept of word is actually very hard to define and its definition depends on several factors. Consider the following data:
- living room
living room has a coherent meaning that is highly conventionalized and also culturally specific. It is not purely compositional. It contrasts paradigmatically with words like kitchen, attic, bathroom, which are either clearly monomorphemic or at least orthographically represented by one word. Semantically, you might decide to consider it one unit rather than two. This is not necessarily true from a morphological perspective.
- mother-in-law
Semantically, we have a similar situation to the example above. However, we can make the observation that the plural can attach to the first component, thus mothers-in-law. We also find in-laws. The examples below demonstrate that there is some variation in where speakers feel the word boundaries. Note that the Oxford English Dictionary (OED) (2022) recognizes mother-in-laws as a rare variant 3.
- They wore it only because their mothers-in-law insisted. (BNC4)
- I always thought it was mother-in-laws that cause the problem. (COCA5)
- Angela sided with her new in-laws. (BNC)
Next we have fixed grammatical expressions, which are written as separate words, but mostly understood as one word:
- going to
- in spite of
going to is undoubtedly one word in spoken language (gonna). in spite of again contrasts paradigmatically with words spelt as one, such as despite. Especially prepositions and conjunctions in English have rather arbitrary spacing; consider for example nevertheless, however.
In summary, the definition of word strongly depends on the point of view.
You might distinguish:
- Orthographical words (mostly congruent with token)
- Phonological words
- Morphological words
- Lemmas
2.2.2 Word classes in numbers
Now let’s turn back to our hypothesis that there are open and closed word classes. The evidence we need is counts for words and word classes. In an electronic corpus, the notion of orthographical word is the easiest to begin with. We basically count everything surrounded by spaces as a unit, a token. Below I show you the commands used to retrieve the data from our version of the British National Corpus -The BNC Consortium (2007) . You don’t need to worry about it just yet. In the first lessons I will provide the data and the numbers. The code might be interesting for you at a later stage, however.
BNC> [pos = "NN.*"] # get all tokens tagged as noun
BNC> count by hw > "noun_types.txt" # count every lemma and save as .txt file
BNC> exit # exit cqp and use wc -l (count lines)
$ wc -l noun_types.txt # repeat for other word classes, (V.*, AJ.*, AV.*, CJ.*)
In fact, we find that the open word classes do have considerably more types than closed word classes. Not a very exciting result, and not one we would necessarily need corpus linguistics for, but, nevertheless, our first empirical evidence for a linguistic concept.
PoS | Tokens | Types |
---|---|---|
Nouns | 21255608 | 222445 |
Verbs | 17870538 | 37003 |
Adjectives | 7297658 | 125290 |
Adverbs | 5736409 | 8985 |
Prepositions | 11246423 | 434 |
Conjunctions | 5659347 | 455 |
Articles | 8695242 | 4 |
An observation that was not immediately apparent is that function words, though there are not too many, are very frequent individually.
- Function words have a low type frequency
- Function words have a high token frequency
In fact, the most frequent tokens in a corpus are function words. Below, I retrieved the 100 most frequent lemmas from the BNC.
2.3 Form and Meaning
2.3.1 Formal properties of word classes
There are more things to consider when counting word types. Words might only be spelt the same by coincidence, we might have words in multiple word classes, words with different senses. Many words can belong to multiple word classes, as for example right.
- noun: I know my rights (BNC: 70867707)
- verb: It’s time to right wrongs (BNC: 90301615)
- adjective: I know I made the right decision to stay (BNC: 34288874)
- adverb: I keep it right here in my desk (BNC: 33832643)
- discourse marker: Right, let me turn the cake round (BNC: 107306397)
Word classes are typically defined by distributional properties. Here are some examples for the class noun.
- Plural marking: right, rights
- Determiners: the right, a right, my right
- Adjective modification: fundamental right
- Use as subject: My rights are being violated.
Words vary considerably in terms of what distributional properties they exhibit. For example, mass nouns are not normally pluralized, but they can be modified by determiners and adjectives. The same is true for abstract nouns like love and hate. There is no plural hates. Non-gradable adjectives like red do not typically occur with comparison markers like more or most or the inflection suffixes er and est. However, you can still find examples.
- They all got hotter and
and shinier and fatter (BNC: 55008870)
In statistical terms, we can say that there is a non-zero probability of finding a mass noun in the plural, and a non-gradable adjective in the comparative. Therefore, membership to these classes has to be considered gradual, and they are typically considered prototype categories (Langacker 1999).
(…) the distinction between V[erbs], A[djectives], and N[ouns] is one of degree, rather than kind (…)
— Ross (1972)
2.3.2 Meaning/Function of word classes
The distributional properties of words are not the only thing that defines word classes. We can also try to find a notional definition of the typical meaning of a ‘noun’ or ‘verb’. This is a very difficult task, and there are many different approaches. Nouns are often considered to be discrete object, verbs to be events or actions, and adjectives to be properties of objects or events. Those ideas are rather abstract and do not apply to many instances of nouns, verbs, and adjectives. For example, the word right from above is not really a discrete object. to love does not normally describe an event, but rather a state or property.
We can see an interesting parallelism between this functional fuzziness and the distributional fuzziness of words. A noun that does not describe a typical discrete object also does not exhibit all the properties of a typical noun. Mass nouns are not discrete, and they cannot be pluralized. Stative verbs do not describe events, and they cannot occur in the progressive. There is a strong correlation between meaning and form, as we would expect from a linguistic sign.
The fact that we use all sorts of words in contexts typical of all sorts of word classes is a result of our capacity of abstraction. We can think of an abstract concept like a right in terms of a discrete object and use it in discourse as though it were one. Likewise, we can think of it as a property and use it as an adjective. Some researchers therefore prefer a discourse-based definition of word classes, e.g. Hopper & Thompson (1984).
- Noun: discrete discourse entity
- Verb: discrete discourse event
- Adjective: discrete discourse property
2.4 Homework
This week, your task is to pick your topics for presentations and homework. You should have all the course literature by now (if not, see previous homework). There are 10 different papers to pick from, including the ones from the weeks when there is no session due to holiday. Additionally, the 2 project days toward the end of the class can be picked if you’d rather find and present your own topic (e.g. as part of your portfolio). You can only pick one project day (or none), and it would automatically become your main presentation.
The task for the presentation is the following: Do a reverse bibliography search, and find an empirical paper that references the main reading of the week. Provide a short summary outlining the research question, hypotheses, linguistic phenomenon and data; and discuss the relationship to the main reading. Prepare a slide presentation.
- group size: 2–3
- length: < 25min (total)
- due date: date of the presentation
The presentations shouldn’t exceed 25 minutes. It is up to you to decide on how to use the time within your group. Shorter is allowed (within reason ;)).
Please pick at least 3 of the dates as preferences on this doodle: Click here! You can mark your preferred choice by chosing “maybe” for the others. I will try to accommodate your preferences as much as possible, but no guarantees. Your groups will be determined by the overlap of your preferences. If you have a preferred group or partner, you can shoot me an email, and I’ll try to keep you together.
2.5 Tip of the day
Today’s tip is from the category: Things I wish I had learned before my Bachelor Thesis In short: Tiwilbemba.
Take every .pdf you download and get from your instructors and archive it with a naming scheme you can remember easily. Especially scans from books and collections are an invaluable resource since not everything is digitalized.
My suggestion: lastname_year_keyword: e.g. Deignan_2005_Metaphor.pdf
Also…Get the info for a bibliography entry as soon as you read a text. Platforms like Primo and Google scholar provide bibliography entries in various styles and formats. In a future installment of Tiwilbemba I will discuss the benefits of tools like BibTex, Mendeley, Endnote …
~15s invested per text → hours saved in the long run.