Many distributions, such as the size of cities, the frequency of words, the number of followers of people on Twitter, and others, are very skewed. For instance, 16.2% of songs make up 83.8% of plays on Last.fm, and 18.8% of groups make up 81.2% of all group memberships on YouTube. These types of distributions are described by expressions such as power law, long tail, and other terms. These terms are often confused, as they refer to slightly different concepts. Also, the confusion is made worse by the existence of multiple ways of visualizing such distributions, which are different mathematically but similar in appearance. In this blog post, I will review the most important ways of visualizing such distributions, and will explain the mathematics behind them.
Many things can be distributed in a power-law-like way. For instance, the linguist George Zipf was interested in 1935 in the distribution of words in the English language. He counted, for each word, its frequency in a corpus of text. He then sorted the words by descending frequency, and noted that a word’s frequency is approximately inversely proportional to the rank of that word. That is, the second most common word is twice as rare as the most common word, the third most common word is three times as rare as the most common word, and so on.
This series of numbers can be visualized like this:
The data used for this plot is from a dataset of Reuters news article in English. (Since this data is taken from a collection of network datasets, the plot used the term degree to refer to a word’s frequency.) Each point in this plot represents one word, with the X axis representing the ranking of the word (from most frequent to less frequent), and the Y axis representing the frequency. The plot is drawn with both axes logarithmic. Thus, the top-left point represents the most common English word (which is “the”), and very rare words are on the right of the plots (showing that many words occur only once in the dataset).
An exact inverse proportional relationship as described by Zipf would look like a straight line with slope exactly minus one on this plot, because the axes are logarithmic. With this dataset, this is not the case (in particular in the top-left area). This is due to the fact that this dataset of words was stripped of words which are ignored by search engines such as “the”, “of”, etc., which happen to be the most common in the English language. Therefore, the plot does not show a straight line, and we may be interested in the shape of this plot as a visualization of the distribution.
The Distribution plot
In terms of probability theory, the size, frequency or number of followers follow a distribution, and we are thus interested in the shape of that distribution. The plot inspired by Zipf’s Law is thus often said to be a bad way of visualizing a distribution. Instead, one may rather show the distribution plot, i.e., the plot showing the frequency of each value. For the English words example, this means a plot where the X axis shows word frequency values, and the Y axis shows the number of words having this frequency. This results in the following plot:
Again, we use a log-log plot. Unlink the previous plot, this looks much more like a straight line. What is going on? In a log-log plot, a straight line (of the form y = ax + b) is in fact a power law (i.e., log(y) = a log(x) + b ⇔ y = eb xa). Thus, the plot suggests that the number of words with frequency x is proportional to xa, for some constant a, which equals about −1.6 in this case. This type of relationship is called a power law, and the value −a is then called the power-law exponent. This plot type however is misleading. Since for each number x ≥ 100, only few words have the frequency x, the right half of the plot cannot be interpreted at all visually, and thus the plot effectively only visualizes the distribution of infrequent words.
The Complementary Cumulated Distribution Plot
In order to visualize also words with uncommon (i.e., high) frequencies, a solution is to plot, for each x, not the probability that a word has frequency x (which is proportional to the number of words having that frequency), but the probability that a word has frequency greater or equal to x. This is called the complementary cumulated distribution plot (and would be called simply the cumulated distribution plot if it showed the probability of the frequency being less than a given x).
Again, this plot type is shown with both axes logarithmic. Again, this plot type will show a straight line when the data follows a power law. However, this plot does show the full range of values on the X axis, and in fact we can see right away that this is not a power law, as the right half of the plot deviates significantly from a straight line, showing again that our dataset does not follow Zipf’s Law, because very common words were removed from it.
Is this plot therefore a better visualization of a skewed distribution than the simple distribution plot? Yes, because it visualizes the whole spectrum of possible values, even rare ones. Is this plot also better than the plot derived from Zipf’s analysis? No, because it is the same plot. In fact, Zipf’s plot and the complementary cumulated distribution are mirror images of each other, mirrored around the 45-degree diagonal axis going from the lower left to the upper right. This can be seen by noting that the rank of a word with frequency x equals the probability of a word having frequency ≥ x, multiplied by the total number of words. Thus both plots are identical, up to an exchange of the axes.
The Long Tail
The confusion between the different types of plots is made worse by the ambiguity of the term “long tail”. In marketing, the long tail refers to the bulk of products which are bought only rarely, and thus are traditionally ignored by retailers, but from which online retailers can profit, since they don’t have the space constraints that traditional retailers have. Rare items are similar to infrequent words, and thus the term long tail describes the Zipf-type plot, in which infrequent items form a tail-like shape to the right of the figure. In the two other plots however, infrequent items are shown on the upper left corner, and thus do not form a tail at all.
What may also confuse people is the use of the word tail in probability theory to characterize the ends of distributions. In probability theory, the expression heavy-tailed distribution and fat-tailed distribution have specific meanings and refer to the tail as seen in the distribution plots, and thus refer to very large values of a variable, and not to very small values, as in the expression long tail. What is worse, the expression long-tailed distribution is sometimes also used in probability theory, and also describes the behavior of a distribution at large values of a variable.
The Lorenz Curve
To complete our little tour, I would like to mention the Lorenz curve, which shows the amount of text covered by X% of least frequent words. That is, the X axis goes from 0 to 100, and the Y axis shows the total amount of text made up by the X% of least frequent words. The Lorenz curve is usually used to visualize wealth distribution, and is plotted without logarithmic scales. For our English words dataset, it looks like this:
The area between the Lorenz curve (in blue) and the diagonal is then, when multiplied by two, the Gini coefficient, a measure of inequality as used in economy. Since it does not use logarithmic axes, neither power law distributions nor Zipf’s Law can be easily recognized on it.
There are many ways to visualize skewed distributions, and care must be taken when creating, reading and interpreting such plots. We recommend to use the complementary cumulated distribution plot, as it is most aligned with probability theory, and is just as expressive as the Zipf-type plot. But in the end, remember that a plot is not a replacement for a proper statistical test, so if you want to assert that a dataset follows Zipf’s Law or a power law, then a statistical is needed. Based on my experiments with degree distributions in KONECT, my cautious guess is than there are almost no statistically significant power laws in real-world networks (and I can make no statement about other types of distributions such as city sizes, etc.)