12 Project Day 1

During the last few weeks, we are going to put it all together. We’re going to go through the necessary steps to create our own project. We are doing this in class, and I’m going to be taking your input and not prepare anything artificial. You are going to see all the steps, considerations and obstacles involved, and we are going to develop our idea hands-on using all the techniques from the homework assignments and tutorials.

The first step is to find an interesting phenomenon and formulate a research question that we can use as a basis. We are going to take the last reading as a starting point. There are always dimensions or edge cases that authors cannot analyze in any given paper. In Rosenbach (2003), the focus was on a comparison of the of-genitive and the s-genitive.

  1. my mother’s birthday
  2. the birthday of my mother

They looked at choice contexts, i.e. contexts in which both are possible. Categorical contexts on the other hand, have a strong rule-like tendency for either construction. These categorical contexts are usually not too interesting and analyzing them will boil down to summarizing literature. It is more interesting to look at gray areas where the “rules” are not so clear.

12.1 Preliminary exploration

If you have a linguistic phenomenon you are interested in, there are crucial steps to take first:

  1. Read! A good starting point is finding the phenomenon in reference grammars (e.g. Huddleston & Pullum 2002; Quirk 2010). Handbooks are also good places to figure out the necessary terminology and get some references.
  2. Play with data.

We’ll skip right to step 2 here since we should already have a rough idea what we are looking for after reading the paper. Let’s start with a very rough query to get a feel for the data. The choice of corpus is not yet important since we are looking at a very general grammatical phenomenon. Remember that clitics, i.e. “short forms” of be and have and the possessive -s are usually treated as separate tokens.

BNC
"'s"

We can immediately see that there are false positives in our results. These artifacts are mostly the forms of the verbs be and have mentioned above. We need to adjust our query. Luckily, the POS-tagger has us covered.

[word = "'s" & pos = "POS"]

This takes care of the verbs, but what about plurals? Orthographically, plural possessives appear as plain . Let’s split our data along the dimension of number into plural and singular data sets.

singular = [word = "'s" & pos = "POS"]
plural = [word = "'" & pos = "POS"]
cat singular
cat plural

We have now saved the results as variables with variable = .... Let’s now have a closer look at the possessor, i.e. the noun phrase to the left of the possessive marker. Since we are looking at a clitic, and not a suffix, we do not always have nouns in that position. Consider the following:

  1. someone else’s child
  2. a friend of mine’s guitar
  3. the Attorney General’s request

I recommend that you get a feel for how frequent and how systematic those edge cases are. In the following, we will simply define our research object as NOUN + ’s for the sake of simplicity.

Looking at our singular data set, we might not immediately spot the next issue by looking at the concordance alone. However, counting the possessor does.

count singular by word %c on match[-1]

match is the first matching token in our results, which is the possessive marker. You can add an index in angled brackets to move left and right relative to it. [-1] is one left of the first match (also cf. cheatsheet). In the resulting frequency list, there are some very frequent nouns that are not singular, but have an overt ’s, like children’s and women’s. These are mostly nouns with irregular plurals, but we also have collective nouns, such as staff and police. The possessive in English basically only has one from that is only expressed when the noun doesn’t already have an ’s ending. Nouns whose bases end in s are another source for variation here that we need to acknowledge, but probably won’t end up analyzing. The CLAWS tag set offers us 3 different tags for different number categories of nouns: “NN1” for singular nouns, “NN2” for plural nouns, and “NN0” for those that are not easily identified as either. Let’s adjust our queries accordingly.

singular = [pos = "NN1.*"] [pos = "POS"]
plural = [pos = "NN2.*"] [pos = "POS"]
count singular by word %c on match
count plural by word %c on match

We no longer want to restrict the possessive marker in form. We also want the preceding token to be either clearly singular or clearly plural respectively. The wild card .* allows for ambiguity tags like NN1-NP0 in cases where the tagger is not sure (in this case whether it’s a common singular noun or a proper noun). Our count command now counts on match because we’re including the position before the possessive marker in our query. Comparing the two frequency lists shows that there is mostly words for humans and words for time intervals among the most frequent possessors. This leads us to the question of whether there are any differences between the two forms.

12.2 Finding a research question

We discovered in class that Rosenbach only mentions that the s-genitive tends to be avoided with plural forms (2003: 384). Our corpus queries show us that this is indeed just a tendency and cannot be a strong rule. We can actually easily test this hypothesis. In addition to what we already have, need the number of all singular nouns [pos = "NN1.*"], and the number of all plural nouns [pos = "NN2.*"]. There are 15430546 singular nouns and 5324559 plural nouns overall, and there are 177502 singular possessors, and 50492 plural possesors. The code used below is written in R (r_base?), which we are going to use in the next two weeks (see homework below).

possessive <- cbind(
  singular = c(s = 177502, other = 15430546 - 177502),
  plural   = c(s = 50492,  other = 5324559 - 50492)
)
poss_test <- chisq.test(possessive)
poss_test$observed - poss_test$expected

The expected frequencies show us that if everything was evenly distributed, there should be about 7998 more cases of plural+possessive marker. The p-value is also below 0.05, which means the results are significant. We can therefore confirm that the plural possessives are underrepresented. Not by much though, so this might be worth investigating.

We can now formulate a first draft of our research question:

  • What are the differences in the choice of singular or plural possessors in the s-genitive construction?

We could also ask:

  • Why is the plural underrepresented, and when is it (not) used?

12.3 Homework

During the project days, we are going to pick up where we left off. So please make sure you have caught up with everything above. You’ll also need to download and install a piece of statistical software.

  1. Download and install R. You can find the necessary links here
  2. Carry out the queries above and save the singular and plural data sets to your personal server space using the commands below.
  3. Download the two text files. See this link.
BNC
singular = [pos = "NN1.*"] [pos = "POS"]
plural = [pos = "NN2.*"] [pos = "POS"]
count singular by hw %c on match > "poss_s_sg.txt"
count plural by hw %c on match > "poss_s_pl.txt"