12 Project Day 1
During the last few weeks, we are going to put it all together. We’re going to go through the necessary steps to create our own project. We are doing this in class, and I’m going to be taking your input and not prepare anything artificial. You are going to see all the steps, considerations and obstacles involved, and we are going to develop our idea hands-on using all the techniques from the homework assignments and tutorials.
The first step is to find an interesting phenomenon and formulate a research question that we can use as a basis. We are going to take the last reading as a starting point. There are always dimensions or edge cases that authors cannot analyze in any given paper. In Rosenbach (2003), the focus was on a comparison of the of-genitive and the s-genitive.
- my mother’s birthday
- the birthday of my mother
They looked at choice contexts, i.e. contexts in which both are possible. Categorical contexts on the other hand, have a strong rule-like tendency for either construction. These categorical contexts are usually not too interesting and analyzing them will boil down to summarizing literature. It is more interesting to look at gray areas where the “rules” are not so clear.
12.1 Preliminary exploration
If you have a linguistic phenomenon you are interested in, there are crucial steps to take first:
- Read! A good starting point is finding the phenomenon in reference grammars (e.g. Huddleston & Pullum 2002; Quirk 2010). Handbooks are also good places to figure out the necessary terminology and get some references.
- Play with data.
We’ll skip right to step 2 here since we should already have a rough idea what we are looking for after reading the paper. Let’s start with a very rough query to get a feel for the data. The choice of corpus is not yet important since we are looking at a very general grammatical phenomenon. Remember that clitics, i.e. “short forms” of be and have and the possessive -s are usually treated as separate tokens.
BNC
"'s"
We can immediately see that there are false positives in our results. These artifacts are mostly the forms of the verbs be and have mentioned above. We need to adjust our query. Luckily, the POS-tagger has us covered.
[word = "'s" & pos = "POS"]
This takes care of the verbs, but what about plurals? Orthographically, plural possessives appear as plain ’. Let’s split our data along the dimension of number into plural and singular data sets.
singular = [word = "'s" & pos = "POS"]
plural = [word = "'" & pos = "POS"]
cat singular
cat plural
We have now saved the results as variables with variable = ...
.
Let’s now have a closer look at the possessor, i.e. the noun phrase to the left
of the possessive marker.
Since we are looking at a clitic, and not a suffix, we do not always have nouns
in that position. Consider the following:
- someone else’s child
- a friend of mine’s guitar
- the Attorney General’s request
I recommend that you get a feel for how frequent and how systematic those edge cases are. In the following, we will simply define our research object as NOUN + ’s for the sake of simplicity.
Looking at our singular data set, we might not immediately spot the next issue by looking at the concordance alone. However, counting the possessor does.
count singular by word %c on match[-1]
match
is the first matching token in our results, which is the possessive
marker. You can add an index in angled brackets to move left and right relative
to it. [-1]
is one left of the first match (also cf. cheatsheet).
In the resulting frequency list, there are some very frequent nouns that are not
singular, but have an overt ’s, like children’s and women’s.
These are mostly nouns with irregular plurals, but we also have collective
nouns, such as staff and police.
The possessive in English basically only has one from that is only expressed
when the noun doesn’t already have an ’s ending.
Nouns whose bases end in s are another source for variation here that we need
to acknowledge, but probably won’t end up analyzing.
The CLAWS tag set offers us 3 different tags for different number categories of
nouns: “NN1” for singular nouns, “NN2” for plural nouns, and “NN0” for those
that are not easily identified as either.
Let’s adjust our queries accordingly.
singular = [pos = "NN1.*"] [pos = "POS"]
plural = [pos = "NN2.*"] [pos = "POS"]
count singular by word %c on match
count plural by word %c on match
We no longer want to restrict the possessive marker in form.
We also want the preceding token to be either clearly singular or clearly plural
respectively.
The wild card .*
allows for ambiguity tags like NN1-NP0
in cases where the
tagger is not sure (in this case whether it’s a common singular noun or a proper
noun).
Our count command now counts on match
because we’re including the position
before the possessive marker in our query.
Comparing the two frequency lists shows that there is mostly words for humans
and words for time intervals among the most frequent possessors.
This leads us to the question of whether there are any differences between the two
forms.
12.2 Finding a research question
We discovered in class that Rosenbach only mentions that the s-genitive tends to
be avoided with plural forms (2003: 384).
Our corpus queries show us that this is indeed just a tendency and cannot be a strong rule.
We can actually easily test this hypothesis.
In addition to what we already have, need the number of all singular nouns [pos = "NN1.*"]
, and the number of all plural nouns [pos = "NN2.*"]
.
There are 15430546 singular nouns and 5324559 plural nouns overall,
and there are 177502 singular possessors, and 50492 plural possesors.
The code used below is written in R (r_base?), which we are going to use in
the next two weeks (see homework below).
possessive <- cbind(
singular = c(s = 177502, other = 15430546 - 177502),
plural = c(s = 50492, other = 5324559 - 50492)
)
poss_test <- chisq.test(possessive)
poss_test$observed - poss_test$expected
The expected frequencies show us that if everything was evenly distributed, there should be about 7998 more cases of plural+possessive marker. The p-value is also below 0.05, which means the results are significant. We can therefore confirm that the plural possessives are underrepresented. Not by much though, so this might be worth investigating.
We can now formulate a first draft of our research question:
- What are the differences in the choice of singular or plural possessors in the s-genitive construction?
We could also ask:
- Why is the plural underrepresented, and when is it (not) used?
12.3 Homework
During the project days, we are going to pick up where we left off. So please make sure you have caught up with everything above. You’ll also need to download and install a piece of statistical software.
- Download and install R. You can find the necessary links here
- Carry out the queries above and save the
singular
andplural
data sets to your personal server space using the commands below. - Download the two text files. See this link.
BNC
singular = [pos = "NN1.*"] [pos = "POS"]
plural = [pos = "NN2.*"] [pos = "POS"]
count singular by hw %c on match > "poss_s_sg.txt"
count plural by hw %c on match > "poss_s_pl.txt"