8.2 Irregular verbs

To a learner of English it might seem unpredictable which verbs are regular and which are irregular. However, there is a strong statistical pattern connecting regularity to frequency.

First, let’s quickly confirm that the regular forms on -ed are indeed more common.8

BNC-BABY
ed=[word = ".+ed" & pos = "V.D"]
non_ed=[word != ".+ed" & pos = "V.D"]
count ed by word %c
count non_ed by word %c

The pos tag for past tense forms of verbs always begins with a V and ends in a D. The . matches anything. The regular expressions ".+ed" matches everything ending in ed. With ! we look for everything that does not match. The count command provides frequency lists. However, we need to know how many different verbs, i.e. different types there are. We don’t need to manually count and scroll down forever, we can have an external program do that.

count ed by word %c > "| wc -l"
> 3072
count non_ed by word %c > "| wc -l"
> 224

> "| ..." directs the output of count into the program wc a.k.a. word count which can also count lines with the -l option.

We can see that the type frequency, i.e. the amount of distinct verbs, is much lower for the irregular form than for the regular one. This is a different type of “commonness” than the one we have dealt with so far when we looked at token frequencies of individual structures.

Taken individually, the irregular verbs are actually rather frequent. Let’s have a look at the 25 most frequent verbs in the BNC-BABY.

BNC-BABY
[pos = "VV.*"]
count by hw

I am filtering modal verbs and the highly frequent and highly irregular verbs be, have, and do by restricting the search to tags that begin with VV, see CLAWS tagset (. matches any character and * repeats the preceding character any amount of times). The count command gives us a frequency list ordered descending by the most frequent hw, which is BNC-speak for lemma. Take a look at the output of the commands and consider how many of the verb types are irregular verbs. At the top of the frequency list, most verbs are irregular, and the lower you go, the rarer they become.

It is not a coincidence that the most frequent verbs are the most irregular ones. In fact, the rarer the verb the more likely it is to be regularized. If an irregular forms is used a lot, it survives longer while rare forms get forgotten, in which case the regular form is used in analogy to other regular verbs. The default, so to speak.

For more examples, and further explanation, watch the following video: How words get forgotted


  1. I use the Baby subsample of the BNC only for demonstration here↩︎