7 Frequencies and Zipf’s Law

7.1 Exploration

Today’s reading was an example of an exploratory study. In contrast to the other papers we have read, an exploratory study is usually qualitative rather than quantitative. That means there is a rather broad research question with many categories and a rather basic statistical analysis. The typical quantitative study tends to be focused on only a few categories but goes much more in depth on the analysis. The main aim is usually hypothesis testing. In exploratory and qualitative studies deriving hypotheses is often the end result. Both formats go hand in hand and are equally important to the research process.

7.2 Frequency and scales

Since the linguistics and methodology this week was rather straightforward, let’s take a little detour and think about frequency, our most important measure.

Most of the data we have encountered was count data. Even though a count is one of the simplest measures possible, there are many ways you will encounter them.

  • Absolute frequency
    • Basic measure
    • Should always be reported since everything else is based on it
    • Sometimes hard to visualize
    • Hard to interpret across different sample or category sizes
  • Relative frequency
    • Absolute frequency divided by all occurrences
    • Either between 0 and 1 or 0% and 100%
    • Makes it possible to compare between different sized samples or sub-categories
    • extremely low relative frequency is sometimes reported as normalized frequency, e.g. 1 per Million == 0.000001 == 0.0001%
  • Log scale
    • Most commonly base 10, i.e. 1 to 10 is the same distance as 10 to 100, 100 to 1,000, etc.
    • Uses:
    1. Visualize heavily skewed data
    2. Make exponential data linear (e.g. word counts)
    3. Approximate human perception of quantities

Log scale presentation of frequencies is common with count data for two reasons. Firstly, most words/phrases/structures have few types that are extremely frequent and many types that are extremely infrequent (Zipf’s law). This makes visualization or generally reasoning about quantity differences difficult.

Secondly, human perception is much well tuned to relative quantities than absolute ones (Weber-Fechner law).

Weber-Fechner Law (cf. Kromer 2003)

  • human perception is based on ratios
  • absolute differences become exponentially less informative

It is much easier to tell the difference between the 10 dots and 20 dots than between the 110 and 120 dots even though the absolute difference is exactly the same. Similar effects can be observed for acoustic frequencies (pitch is essentially also count data, that’s why it’s also measured in frequency), brightness, memory, language learning, etc. The opposite of exponentiation is the taking the logarithm, therefore log-scales are used a lot in information theory (remember Mutual Information) and linguistics.

7.3 Homework

If you have not yet handed in any homework, make sure you have caught up and know how to do basic corpus queries. Think about what would be necessary to reproduce the data in the next reading (Anderwald 2011). Send me ideas, and strategies, commands and observations.

7.4 Tip of the day

Here are two seemingly unconnected thoughts on co-occurrence patterns and exponential decay (Zipf’s law).

Fact 1: One place where n-grams and co-occurrence patterns were used to make something useful is the computer keyboard. The keyboard layout (QWERTY) was carefully designed to avoid the most frequent letter combinations (bigrams and trigrams) to be on adjacent keys (oversimplified) so that old mechanical typewriters don’t get jammed.

Fact 2: When you work on a project, the amount of time you use on individual aspects also follows a power law like Zipf’s law. Look up the Pareto principle. You probably need 80% of your time to produce 20% of the work and 20% of the time to produce 80% of the work. You cannot avoid that, but you can flatten the curve by focusing on the biggest time sinks.

Loosely related, my tip of the day is another tiwilbemba: Learn touch typing if you haven’t already. Since you study language, chances are, you will spend most of your working life typing. Learn a good, fast and healthy typing technique and you can save a lot of time and possibly pain in the long run. Just imagine how much faster you’ll write your Bachelor Thesis if you type at twice the speed mistyping half as often, which is easy to achieve for most people just within a few weeks. I wish I had learned that many years ago. Believe it or not, it can actually be fun. If you have learned a musical instrument—this is basically what it is like, just much much faster to master. If you haven’t learned a musical instrument—forget about touch typing and go learn an instrument! :D