11 Productivity
When it comes to derivational affixes, there are 3 concepts that become very important when you work with data.
- Analyzability
- Semantic transparency
- Productivity
The first aspect, is related to the question of whether a word is complex or not. There are monomorphemic (simple) and polymorphemic (complex) words based on whether you can break them up into component parts. This is straightforward in most cases:
- {Analyz}-{abil}-{ity}
- {Productiv}-{ity}
These two words are clearly composed of smaller meaningful units. Note that there is a lot of spelling variation at morpheme boundaries and there is also allomporphy. Don’t confuse the two. productive is missing its final 〈e〉 when it occurs with a suffix, but the phonological form changes. This is a purely orthographic convention. The change in stress from productive to productivity, however, does cause a change in the phonetic form: /pɹədˈʌktɪv/ → /pɹədəktˈɪv-/. Likewise, the change of the suffix {able} from /-ɪbɫ/ in analyzable to /-ɪˈbɪl-/ can be considered allomorphy.
Another dimension of complexity comes into play when encounter words in your data for which it isn’t even clear whether they are complex or not. Some might seem complex, but require meta-linguistic knowledge, i.e. explicit, learned knowledge about the language. Common cases are loanwords like: poltergeist, schizophrenia, statistics, To a German speaker poltergeist clearly seems polymorphemic because it is a compound in German. This knowledge, however, needs to be learned explicitly by a native speaker of English, because neither component is used on its own in English. schizophrenia is a similar case. You can analyze the component parts academically by looking up the etymological origin. Within the English language without explicit knowledge of Latin and Greek. statistics is trickier. There might be a degree of analyzability here because the suffix -ics is common enough in English, especially for academic disciplines. It is still different from regular derived words in that the root alone doesn’t really carry a conventional meaning, at least one conventional in English. The root in statistics can be considered semantically opaque, which is the opposite of semantically transparent.
Now let’s consider the following three examples:
- ceiling
- building (N)
- interesting
All three words have a very recognizable English -ing suffix. In that sense, they are clearly analyzable as polymorphemic. That means in a native speaker of English the different components might trigger multiple associations. They are still problematic examples if you are interested in derivations with -ing. ceiling has a similar problem to statistics because the root is not really transparent in meaning. /siːl/ does exist in a handful of other English words, such as conceal. So you could make the argument that it is slightly more transparent than statistics. This illustrates that all concepts discussed here are matters of degree. building and interesting are both more transparent in meaning and analyzable. You might still want to treat them separately if you look into -ing as derivational suffix because they are strongly lexicalized. That means they are unlikely to be perceived compositionally, i.e. we have them stored as a whole in our lexicon and there is little to no analogical derivation happening.
Finally, morphemes vary in terms of and how widely they are spread across the lexicon and how commonly they are used to derive new words. A highly productive derivational affix is used with a large variety of roots. This is true of the regular inflectional affixes in English. -s is used as plural suffix on most nouns and it is also the most likely candidate for a neologism. An irregular plural like -en in oxen is more like a living fossil. It is limited to certain words and never used with new vocabulary.
In derivation, productivity is not as clear cut.
Let’s consider the adjectival suffix -ish as in the examples below, taken from the Corpus of Contemporary American English (COCA) with the query [word = ".+-ish"]
.
- I’m still feeling flu-ish.
- … for a nation that was a loyal-ish member of NATO.
- That was selfish rather than other-ish.
In modern spoken language these types of derivations feel quite common. They may not be considered standard or formal language, which is why you will often find them hivenated. Still, intuitively, -ish should be considered productive.
Let’s try to catch more -ish adjectives with a more general query.
It is a good idea to exclude capital letters with [^A-Z]
since a homonym of -ish frequently occurs with nationalities/language names: [word = "[^A-Z].+ish" & class = "ADJ"]
.
Let’s also look at some other suffixes used to derive adjectives and compare some numbers from the BNC with some minimal cleanup:
tokens | suffix | cqp query (BNC) |
---|---|---|
8196 | -ish | [word = "[^A-Z].+ish" & class = "ADJ"] |
168294 | -ous | [word = ".{2,}ous" & class = "ADJ"] |
1025618 | -al | [word = ".{2,}al" & class = "ADJ" & word != "real"%c] |
On a first glance, -ish is much less frequent in the BNC.
This could have something to do with the composition of the corpus (try to restrict to match.text_mode = “spoken”. Is there a difference?).
Normalizing relative to sample size does not help in this case because our sample is the same across.
We could also look for disproportional amounts of false positives due to our queries, but maybe this is not going to be necessary.
Frequency alone is not a sufficient measure for productivity.
We need to check how much of the vocabulary is covered as well.
One measure that can be used is the type-token ratio.
How many occurrences are there relative to how many different words they occur across?
We can get the type frequency by creating a frequency list with count
and counting how many lines it has.
A quick way to do this is to use an external command line tool called wc
with the -l
option for line count: count by word %c > "| wc -l"
.
Note: the following graphs were made with the statistical software called R. I provided the code just in case anyone is interested.
types | tokens | suffix |
---|---|---|
557 | 8196 | -ish |
1410 | 168294 | -ous |
6315 | 1025618 | -al |
The discrepancy in type counts is already much lower and when we take the ratio, we can see that all of a sudden -ish is far ahead. It might not be as frequent as the other two suffixes, but it is also relatively more evenly spread over different roots. You could also imagine the other suffixes being bound to fewer roots that are very high in frequency individually.
barplot(c(557, 1410, 6315) / c(8196, 168294, 1025618),
ylab = "type-token ratio",
names.arg = c("-ish", "-ous", "-al"))
An even better measure is the hapax-token ratio.
Instead of measuring the spread over different roots, we can measure the amounts of roots that only occur once.
This may serve as an indicator for the formation of neologism.
To get hapax frequency, we can add the following to our queries: [... & f(word) = 1]
.
We can observe the following result:
hapaxes: | types: | tokens: | suffix |
---|---|---|---|
271 | 557 | 8196 | -ish |
584 | 168294 | -ous | |
3438 | 1025618 | -al |
barplot(c(271, 584, 3438) / c(8196, 168294, 1025618),
ylab = "hapax-token ratio",
names.arg = c("-ish", "-ous", "-al"))
Even though -ish is comparatively infrequent, it occurs with a lot of types and seems to be used to form neologisms more often than the other two. This might indicate that it is more productive, and also that it is a more recent phenomenon because none of the uses have had the chance yet to become frequent individually. Another observation we can make is that while -al’s type-token ratio is higher than that of -ous, the hapax-token ratios are almost even. The data suggests that their productivity is similar, but that -al is more wide-spread in the lexicon, which makes intuitive sense.
11.1 Homework
We have discussed a variety of multi-word units. Consider the following list, think about the differences and provide one example each that illustrates the differences. Finally, create a CQP query for each of your examples that is as general as possible.
- Collocation
- Lexical Bundle
- Grammatical Framework
- Colligate
11.2 Tip of the day
When you write homework, essays, term papers, or even presentations, keep writing and formatting separated. Latex or Markdown (aka plain text writing) mentioned last week formalizes this idea. Pick a pre-made style or make your own document template, and stick to it. Don’t customize, don’t build from scratch. Keep your formatting at a bare minimum while you are focusing on carrying out the task. That doesn’t mean format badly! Rather, take your time to find or create a template at a different time. This way you can eliminate the one of the biggest form of distraction.
In academic writing across disciplines, all the different style guides you have to deal with might be overwhelming and confusing. But in the end, it can all be boiled down to just three key elements: text, data and references 9. References can and should be generated automatically.
Text should be arranged in coherent paragraphs. Try to package exactly one important argument per paragraph. Section headlines should have some specific formatting, so they can be used as key for a table of contents or cross-referencing. Your type setting tool of choice (Word for most) has a way to deal with this—learn it! Anything else should be taken care of by your template.
When it comes to presenting data, here are the only three elements you should bother with manually.
- Meta-linguistic reference: words and phrases as in-text examples in italics
- Listed examples: indented and in their own paragraph, consecutively numbered
- Tables and figures: keep it simple here, too. They need to have title, numbering and description. Don’t bother applying unnecessary visual effects, or having the text flow nicely around them. If the table or figure doesn’t fit, it belongs in the appendix. Most of those conventions you can see in action on this blog and in every reading. If you want to include a table from another paper, think twice whether it is necessary at all. If you are sure that it is, integrate it into your style, numbering and naming, and do not simply include a screenshot. Also, do not forget to reference it properly.
In a well written text, you don’t need any other visual emphasis, except maybe to highlight parts of listed examples in bold. Italics, underlined or colored text is otherwise unnecessary and unconventional. There are also long quotes, book or journal titles, footnotes and listings; however, it’s worth considering whether you actually need them. In most cases, you are better off skipping those.
Then, there are tables of content, citations and bibliographies, cross-references, lists of tables/abbreviations etc.; but here is a simple rule I learned the hard way: Never create these manually—never! There are ways to deal with citations and bibliographies automatically that allow you to apply whatever style your instructor or potential publisher requires. You can even change it from MLA to Unified on the fly if you need to.
In summary, keep things simple, be aware of the elements in your text, and don’t mix. Formatting can be a huge time sink and should be avoided.