13 Project Day 2

Now that we know that the plural possessive is underrepresented, we can dig deeper. We looked at the frequencies of the plural vs singular forms and found a significant association. What we can do to further explore this pattern is apply the same statistical logic to individual words. The words in question here are the possessors for which we have already got frequency lists.

13.1 Collexeme analysis

Methods that leverage association measures in order to compare lexemes within a construction slot fall under the umbrella of Collostruction Analysis (Stefanowitsch & Gries 2003). This involves calculating an association value for each lexeme, and, depending on the metric, can be simple enough to do even in Excel. All you need to do is enter the 4 values of a contingency table like in the last chapter in one column each and enter the formula for the index. This demonstration, however, is going to use the statistical programming language R.

Before we move on with R, however, we need two more values. We already have the observed frequencies of our nouns in the possessive construction. We also need the overall frequencies of these nouns in the corpus. So back to CQP:

BNC
nouns = [pos = "NN[12].*"]
count nouns by hw %c on match > "poss_all.txt"

After we created this new file and downloaded it, we can start doing our calculations in R.

I have written a package that, among other things, can be used for Collostruction Analysis. In order to install it, you have to launch R and enter the following two commands into the prompt:

install.packages("remotes")
remotes::install_github("alex-raw/occurR")
library(occurR)

Now we have everything set up and ready to go. We need to import the data first. The following 4 commands let you pick a file each from a dialogue box. If it doesn’t work, and you’re on Windows, try choose.files() instead. If neither works, you need to figure out the file path for each file yourself, and enter it within in quotation marks: "/home/mydata/myfile.txt"

path_s_sg <- file.choose()
path_s_pl <- file.choose()
path_all <- file.choose()

Next we are going to import the data. CQP produces TSV-files, which are like CSV files but with tabs instead of commas. Therefore, we tell R that the separator is a tab \t. We also need to switch off quotes with "", otherwise R will not understand tabs as separators within quotation marks. We are also giving our table names (f for frequency).

s_sg <- read.table(path_s_sg, sep = "\t", quote = "")
s_pl <- read.table(path_s_sg, sep = "\t", quote = "")
all_nouns <- read.table(path_s_sg, sep = "\t", quote = "")

Now we have our 4 frequency lists bound to the four variables s_sg, etc. We need to join the overall frequency lists with the possessive frequency lists, by merging them into one big table for each construction. The merge function does this for us, and we tell it to merge by column 2, which is the words in each frequency list.

both_sg <- merge(s_sg, all_nouns, by = 2)
both_pl <- merge(s_pl, all_nouns, by = 2)
both_sg[is.na(both_sg)] <- 0
both_pl[is.na(both_pl)] <- 0
names(both_sg) <- c("word", "f_poss", "f_all")
names(both_pl) <- c("word", "f_poss", "f_all")

We should double-check if this was successful. This step is optional. The lists are going to be alphabetically sorted after the merge. To better compare with the CQP output, let’s sort them by frequency (column 2) again and inspect the first 10:

both_sg <- both_sg[order(both_sg$o11), ]
head(both_sg)
# same for pl

If everything looks right, we are ready to run a collexeme analysis.

sg_coll <- coll_analysis(both_sg, o11 = f_poss, f1 = f_all, flip = "ll")
pl_coll <- coll_analysis(both_pl, o11 = f_poss, f1 = f_all, flip = "ll")
# to use MI, do coll_analysis(both_sg, fun = "mi")

These two variables contain a list of log-likelihood ratio values that are positive if there is statistical attraction or negative if there is statistical repulsion. We could also use M(utual) I(information) like in some of the past readings. There are numerous association measures and all of them have their advantages and disadvantages (for more detail, see Evert 2005). Log-likelihood has emerged as a quasi-standard in corpus linguistics.

Let’s inspect the elements with the strongest association for the plural form:

# sort
pl_coll <- pl_coll[order(pl_coll$ll, decreasing = TRUE), ]

# top 30
head(pl_coll, 30)

We can see two clear patterns: Most of the strongly attracted lexemes are words for humans, The second lexical field that we can see is words for time increments, like year and month. Only further down the list do we find words that don’t fit those two categories. If we repeat this with the singular data set, we can find a similar pattern, however, there is additionally collective nouns, like company that are often used as reference to individual humans metonymically. Collective nouns do not seem to show up in the plural form. This could be a first lead on why there is fewer plural possessives.

13.2 Distinctive collexeme analysis

Comparing those two lists is not the best we can do. We can directly compare the two constructions in a variant of Collostruction Analysis called distinctive collexeme analysis. So we just need to merge our original singular and plural data and feed it into the same function again, only slightly differently.

all_poss <- merge(both_sg, both_pl, all = TRUE, by = "word")
all_poss[5] <- NULL
names(all_poss) <- c("word", "sg", "f_all", "pl")
all_poss <- all_poss[order(all_poss$f_all, decreasing = TRUE), ]

distinctive_coll <- coll_analysis(all_poss, o11 = sg, f1 = f_all, f2 = sg + pl, flip = "ll")
distinctive_coll <- distinctive_coll[order(distinctive_coll$ll, decreasing = TRUE), ]

head(distinctive_coll, 30)
tail(distinctive_coll, 30)

The distinctive collexeme analysis takes into account how strongly associated a word is in the other construction. We can see a pattern emerge a bit more clearly. All non-human possessors seem to be avoided with the plural possessor. Some words that are strongly associated with singular possessives, such as company, and country are underrepresented in the plural form. At the bottom of the list, we also have mass nouns and nouns that are generally uncommon in the plural like world. Naturally, these words will be negatively associated with the plural possessive and part of the reason why it’s underrepresented.

If anything, our exploratory research has brought up more questions than answers (which is normal). We can hypothesize that the factors from Rosenbach (2003) are simply stronger for the plural possessive, especially animacy.

13.3 Where to go from here

We have only looked at the possessor slot of the construction. We could also look at the noun phrase representing the possessed. Looking at the possessor and possessed simultaneously can be done in a co-varying collexeme analysis.

In any case, at this point of our research, we would go back to the literature and look for conceptual motivations for the differences in the collexeme lists.