3 Form and meaning

3.1 Recap: The Scientific Process

In this week’s class, we brainstormed about irregular verbs. We established the research question of which verbs are irregular, and gathered some ideas that we formulated as hypotheses. These are the first steps of approaching a topic scientifically.

Mono-syllabic words tend to be irregular.
Loanwords tend to be regular.
Common words tend to be irregular
Modal verbs tend to be irregular.

What we did not do in class, due to time limitations, was gather information from the literature. Even so, with a bit of background knowledge, we can reason with those hypotheses a bit more. Number (4) should probably be handled separately because modal verbs are a (sub-)class on their own and in a sense are all different from lexical verbs. It is important to reduce complexity wherever possible. Next, there is some correlation to be expected between “commonness” in (3) and loanwords. Loanwords should be less frequent than native Germanic verbs. English, however, is very heavily influenced by other languages so we would need to test whether borrowing and “commonness” are actually correlated. There is also an argument to be made that borrowed nouns are often irregular (cactus, alumnus, …), so why should verbs be irregular. In practice, it might even be necessary to exclude recent loans from the conversation, as well, in order to focus the question even more. As you can see, this research question has the potential for at least three topics/papers already. Finally, (1) and (3) are probably also correlated, since short words are also the most common words. Our strongest candidate, therefore, is commonness.

The next step in our process should be to conceptualize “commonness”, define it and find reasons, why we would even expect it to be important. For now, all we need to know is that language comprehension and performance is tied to memory, which is very sensitive to repetition, therefore commonness of a stimulus. Even with the best data set in the world, we cannot measure “commonness” directly so we need to operationalize the concept, i.e. make it measurable. We can count occurrences in a representative text corpus. Frequency can be an indicator of commonness, but remember that it is not the same. The former is a tool, the latter is a concept. Most misconceptions in science come from the fact that indicators are mistaken for what they are designed to indicate.

Of course, a question like this has been asked before, and we have a good idea what verbs become irregular. Here is a video on it: Click me

In the following chapter, the aim is to provide some more evidence for common and inherently quantitative statements about language. In order to do so, we are going to look at word classes. For example, we might say that one word is “more common” than another or one word class has more members than another. Below we will start by by looking at word classes and we will try to provide evidence for a very simple hypothesis: There are closed word classes with a limited amount of members and open word classes with significantly more members.

3.2 Parts of speech

3.2.1 Recap: Open and Closed Word Classes

The idea of open and closed word classes is the first we can quantify very easily with the help of corpus data. As opposed to a closed word class, an open word class should have a lot more members. Let’s first recap what types of word classes we know.

Open word classes

Nouns: time, book, love, kind

Verbs: find, try, look, consider

Adjectives: green, high, nice, considerate

Adverbs: really, nicely, well

Closed word classes - Modal verbs: may, might, can, could, … : - Pronouns: I, you, she, they, mine, … : - Determiners: the, a(n), this, that, some, any, no, … : - Prepositions: to, in, at, behind, after, … : - Conjunctions: and, or, so, that, because, … : - …

Closed word classes rarely accept new members. One rather recent addition to the class of pronouns might be considered singular they. Closed word classes are also mostly invariant in that they do not take inflection. Neither of these properties are logically necessary. You could imagine more pronouns. Some languages have a dual in addition to singular and plural (e.g. Classical Arabic), or a distinction between inclusive and exclusive we (several Polynesian languages). Yet the class of pronouns is rather fixed.

Lexical vs. Function word

Auxiliary verbs: be, have, (get, keep)

Lexical verbs: eat, sleep, repeat, …

These first observations about word classes lead us to our core hypothesis for this week. Closed word classes have fewer members than open word classes.

3.2.2 PoS-Tags

Figuring out the word class of each word is done with Part-of-speech taggers. Tools like the Tree Tagger (Schmid 2013) can determine word classes with an accuracy of around 95% (Horsmann, Erbs & Zesch 2015). Even though this is good enough for most purposes, you have to bear in mind that automatic annotation is error prone and can cause some spurious patterns that have to be accounted for. We will encounter such cases in future sections.

PoS-tags

annotation for word class available in most corpora

automatized

around 95% accuracy (Horsmann, Erbs & Zesch 2015)

e.g. Tree Tagger (Schmid 2013)

3.3 Types and Tokens

3.3.1 Word boundaries

We have to make a first technical distinction at this point. We need to decide what we count as a word. In corpus linguistics, the word model most commonly encountered is the token. A token has a very rough and technical, yet simple definition.

Token

character sequences in between spaces

The emphasis here lies on character sequence. If we use this to count occurrences we are dealing with the related concept of type.

Type

class of identical tokens

Note that neither relying on spaces nor on orthographic characters is by itself ideal in most circumstances. The terms type and token are sometimes also used much more abstractly. You could understand types and tokens as a “words” disregarding spelling conventions. This requires some more work defining word and also working with data later.

The process of breaking up text into “words” is tokenization. Most corpora are designed so that punctuation is treated as individual tokens. Also clitics such as the possessive ’s, and contracted forms of auxiliary verbs such as ’ll, ’ve, ’re are treated as separate tokens. This decision might be contrary to the definition of word you are working with.

How many words?

The concept of word is actually very hard to define and its definition depends on several factors. Consider the following data:

living room

living room has a coherent meaning that is highly conventionalized and also culturally specific. It is not purely compositional. It contrasts paradigmatically with words like kitchen, attic, bathroom, which are either clearly monomorphemic or at least orthographically presented as one word. Semantically, you might decide to consider it one unit rather than two. This is not necessarily true for a morphological perspective.

mother-in-law

Semantically, we have a similar situation to the example above. However, we can make the observation that the plural can attach to the first component, thus mothers-in-law. We also find in-laws. The examples below demonstrate that there is some variation in where speakers feel the word boundaries are. Note that the Oxford English Dictionary (OED) -OED Online (2020) recognizes mother-in-laws as a rare variant ³.

They wore it only because their mothers-in-law insisted. (BNC⁴)
I always thought it was mother-in-laws that cause the problem. (COCA⁵)
Angela sided with her new in-laws. (BNC)

Next we have fixed grammatical expressions, which are written as separate words, but mostly understood as one word:

going to
in spite of

going to is undoubtedly one word in spoken language (gonna). in spite of again contrasts paradigmatically with words spelt as one, such as despite. Especially prepositions and conjunctions in English have rather arbitrary spacing; consider for example nevertheless, however.

In summary, the definition of word strongly depends on the point of view.
You might distinguish:

Orthographical words (mostly congruent with token)
Phonological words
Morphological words
Lemmas

3.3.2 Word classes in numbers

Now let’s turn back to our hypothesis that there are open and closed word classes. The evidence we need is counts for words and word classes. In an electronic corpus, the notion of orthographical word is the easiest to begin with. We basically count everything surrounded by spaces as a unit, a token. Below I show you the commands used to retrieve the data from our version of the British National Corpus -The British National Corpus (2007) . You don’t need to worry about it just yet. In the first lessons I will provide the data and the numbers. The code might be interesting for you at a later stage, however.

BNC> [pos = "NN.*"]                         # get all tokens tagged as noun
BNC> count by hw > "noun_types.txt"         # count every lemma and save as .txt file
BNC> exit                                   # exit cqp and use wc -l (count lines)
$ wc -l noun_types.txt                      # repeat for other word classes, (V.*, AJ.*, AV.*, CJ.*)

In fact, we find that the open word classes do have considerably more types than closed word classes. Not a very exciting result, and not one we would necessarily need corpus linguistics for, but, nevertheless, our first empirical evidence for a linguistic concept.

PoS	Tokens	Types
Nouns	21255608	222445
Verbs	17870538	37003
Adjectives	7297658	125290
Adverbs	5736409	8985
Prepositions	11246423	434
Conjunctions	5659347	455
Articles	8695242	4

An observation that was not immediately apparent is that function words, though there are not too many, are very frequent individually.

Function words have a low type frequency
Function words have a high token frequency

In fact, the most frequent tokens in a corpus are function words. Below, I retrieved the 100 most frequent lemmas from the BNC.

$ cwb-scan-corpus BNC hw | sort -nr | head -100 > "bnc_lemma_freq.txt"

3.3.3 Considerations

There are more things to consider when counting word types. Words might only be spelt the same by coincidence, we might have words in multiple word classes, words with different senses, etc.

Homonyms
Polysemy
Conversion
Prototype theory

(…) the distinction between V[erbs], A[djectives], and N[ouns] is one of degree, rather than kind (…)

— Ross (1972)

3.4 Lemma

What are all the grammatical forms of be, cut, tree, nice, beautiful?

be, am, are, is, were, was, been, ’s, ’m, ’re, ?being
cut, cuts, (cut, cut), ?cutting
tree, trees, tree’s, trees’
nice, nicer, nicest
beautiful

A lemma is all the inflectional forms of a word. This includes forms with grammatical affixes (tree, trees) and suppletive forms (go, went). What is not included is derivational suffixes like the adjectival -ly. Of course, this requires a clear definition of inflection and derivation. Some researchers might argue that the participial -ing is derivational rather than inflectional. There is also the issue of whether the past participle of some verbs like cut is to be seen as separate “form” or not.

When it comes to the technical side of research, you have to be aware of the decisions taken when lemmatizing corpus data as to what counts and what doesn’t. A lemma in a corpus is not necessarily equal to a lexeme as a linguistic concept.

3.4.1 Distribution

Information about the frequency of a word or its forms can already be very informative. We can extend this idea easily and look at the larger units a word or lemma occurs in. Frequency information about the distribution is much more complex, but is based on the same underlying concepts and measured with the same tools.

As we will see in the up-coming reading Justeson & Katz (1991), the distribution of adjective pairs plays a crucial role in the formation of antonym pairs. There, the deciding factor is whether they occur together in the same context — different form same context. We could flip this around and look at words with same form that occur in wildly different contexts. A special case of this is homonymy.

How can we find out if something is a homonym if we do not know the meaning or want to keep intuition out of the picture?

Animal or sport utensil?

Maybe I’m a fruitarian bat
… with a straighter bat than some of the Englishmen
The unfortunate starved bat was then returned
And not simply a bat, but an autographed bat

(examples from The British National Corpus 2007)

BNC> [pos = "AJ.*"] []? [hw = "bat"]

In this example, the grammatical structure is similar. We find attributive adjectives preceding bat which is typical for nouns. However, the meaning of the adjectives provides enough context to disambiguate the two uses of bat. The lexicon is structured by both grammar and meaning. If you expanded this to more co-occurrence patterns, e.g. with verbs or even different text types, two clearly distinct patterns emerge. The animal bat eats, which is similar to other animals, whereas the utensil bat strikes like other club-like devices. A Giraffe rarely strikes and a tennis racket doesn’t eat. They each are parts of distinct lexical fields. Distribution plays a defining role in the structure of those fields, therefore, our lexicon.

3.4.2 Association

A key component of human memory is association. The lexicon is organized in associative networks. What we perceive together frequently, we associate as belonging together. This is also referred to as spatial or temporal contiguity.

law and …?

order

good or …?

bad, evil

the number of the …?

??beast

spoils of …?

??war

The first word that comes to mind when you read the first two fragments is most likely law and order, and good or bad. For the other two examples, there is expected to be more variation. A metal fan might readily come up with beast, since the song of the same name is part of their cultural experience, and therefore, very frequent for them. spoils of war might not be a phrase that everyone is familiar with at all. spoils as a word is very rare; yet there is a strong association with the phrase. If it is encountered, it occurs together with war more often than not.

3.5 Frequency and memory

3.5.1 Common and uncommon vowels

In order to illustrate some basic frequency effects (as in count not pitch), we had a little experiment in class today with German vowels.

Let’s consider a subset of the German monophthongs with relatively consistent phonetic spellings. We’re taking orthography as an approximation for pronunciation here.

Experimental task:: Within one minute, find as many words as you can that begin with either one of the characters.

Grapheme	Token count (LCC ⁶)	Type count	This course	Average across courses
〈a〉	376,588	22915	10	20.3
〈e〉	343,636	16407	32	18.8
〈i〉	258,792	6614	23	12.0
〈u〉	191,164	8122	22	12.0
〈o〉	47,160	5062	10	11.8
〈ü〉	22,095	872	13	11.2
〈ö〉	2,209	100	8	6.3

# CQP query example, words beginning with "a", and ignore case:
LCC-DEU-NEWS-2010
a="a.+" %c

# for token counts
size a

# for (case-insensitive) type counts, using external program "wc" to count lines
count a by word %c > "| wc -l"

The expected outcome: People find most words with a, then i and u, and much less with ü and ö. The groups brainstorming for more common vowels had many more distinct words to choose from, which they are also more likely to have encountered more often. If you were to ask how difficult learners of German perceive the pronunciation of each of these vowels, you would see a correlation with the frequency with which those vowels appear in corpus data.

Furthermore, we can observe that front rounded vowels are rare across language (Maddieson 2013). But why are those vowels so much rarer in the first place?

There are three possibilities:

There is a mistake in the approach to counting.
It is coincidence
There is something categorically different about ü and ö

Let’s assume the latter is the case. What ü and ö have in common is that they are front rounded vowels. In fact, we have a pretty good idea about why they are special. In a nutshell: the frequency make-up (in the sense of pitch) of front rounded vowels is not as distinctive as that of other vowels. [a, i, u] are maximally distinct from each other so (almost) all language make a distinction between them. [i] and [e] are more similar in sound yet still much more distinct than [i] and [y]. It is much more common to see a language make a distinction between the former than the latter. The exact cross-linguistic patterns and the interesting bio-physical reasons are far outside the scope of this course, unfortunately. The important conclusion is that we found an interesting correlation with the help of corpus data that we could corroborate with other pieces of data, and that ultimately leads us to a fundamental property of language.

3.5.2 Confounding variables

There is a fundamental flaw in our operationalization of the concept of “vowel”. We measured vowel counts with orthographic characters. Here is a non-exhaustive list of what could lead to a systematic skew in our data

e represents at least 3 different vowel phonemes: ə, e, ɛ
there is considerable overlap to the grapheme ä
the schwa realization ə of the grapheme e cannot occur at the onset of a word
there are common prefixes (an, un, über, ), that cause many different types
e, i, a, and u occur in diphthongs which are phonologically different
i, a, and u might represent different monophthongs (especially in loan words)
ö and ü are sometimes transliterated with oe and ue

There are always many factors that could skew your data in one direction or another. In this case, the observed pattern is probably amplified by the variables above. A better operationalization would involve phonetic annotation so that we count the right thing. Ideally, you would control for those confounding variables, and if you can’t, judge the potential implications. One of the mottoes in science is: always try to prove yourself wrong.

3.6 Homework

Reading response:

While skimming through this week’s reading, gather 3 pieces of terminology that you weren’t familiar with, that confused you, or that you are unsure about.
Try to summarize the two most important conclusions in one sentence each.

Submit your answer via email or dm on Webex Teams.

3.7 Tip of the Day

Today: Multitasking

Learning an academic discipline takes a lot of time and focus. However, some aspects are like learning a language or motor skills. It might sound weird, but knowledge, especially theoretical, is like a muscle you can train. So here is my suggestion for how to get better at Linguistics or Literary Studies or whatever science you are interested in: Listen to lectures, talks, podcasts and other content in the background.

Great topics to passively consume are:

Theory, e.g. Cognitive Linguistics
Philosophy of Science, highly interesting, vastly important, but oft neglected
Sciences that are not your major

Here are some activities I frequently use to bombard myself with knowledge.

weight or endurance training
practicing an instrument (especially repetitive technical exercises)
cooking
cleaning, tidying, building Ikea tables ;)

Non of these activities require your full mental focus or have long pauses, so your thoughts are free to meander through the depths of science. Nowadays, a lot of talks or even full lectures can be found online, and with online teaching taking off right now there will be ever more.

Linguistics

Luckily, we are not the only university trying to teach you linguistics online. Here are some nice channels to binge watch both actively and passively.

Martin Hilpert: Has a variety of lectures and full courses on all things linguistics.
The Virtual Linguistics Campus: Old but gold.
People without YouTube channels, but who are great lecturers, Adele Goldberg, Joan Bybee, George Lakoff, Geoffry Pullum. I have found many of their lectures and interviews online on various channels and platforms.
NativLang: Probably my favorite language channel. Animation videos on a variety of language related topics. Focus on Cross-Linguistics.

Other sciences

If you are a curious person, and if you appreciate the academic endeavor, chances are you are interested in other sciences, too. Knowing subjects outside the social sciences may help you in unexpected ways. Here are my go-to channels to listen to in the background.

mailab: Focus on (bio-)chemistry, but mostly deals with current debates on the media. You can learn a lot about how news outlets interpret and sometimes misrepresent scientific studies.
Closer to the Truth: Philosophy. Dealing with the big questions. How do we know facts? Why should we trust in Science? What are hypotheses and theories and why bother?
Statquest: Pleasantly cringey statistics videos.
zedstatistics: More in depth. (Less cringe. :( )
PBS Space Time: Astrophysics. Popular science without the usual dumbing down. Great stuff to listen to even if you understand nothing. :D
3Blue1Brown: Mathematical concepts with animations instead of formulae. I was horrible at maths in school but I always had a sense that it is actually a very beautiful subject. Wish I had visualizations like these back then.
Computerphile: Various computer science topics

I have not yet explored the world of audio books and audio podcasts, but I’m sure there is a lot of great stuff out there.

If you discover anything, let me know! :)

2 The scientific process

4 Lexical relations