How to Visualize Skewed Distributions

Many distributions, such as the size of cities, the frequency of words, the number of followers of people on Twitter, and others, are very skewed.  For instance, 16.2% of songs make up 83.8% of plays on, and 18.8% of groups make up 81.2% of all group memberships on YouTube. These types of distributions are described by expressions such as power law, long tail, and other terms. These terms are often confused, as they refer to slightly different concepts. Also, the confusion is made worse by the existence of multiple ways of visualizing such distributions, which are different mathematically but similar in appearance. In this blog post, I will review the most important ways of visualizing such distributions, and will explain the mathematics behind them.

Zipf’s Law

Many things can be distributed in a power-law-like way. For instance, the linguist George Zipf was interested in 1935 in the distribution of words in the English language. He counted, for each word, its frequency in a corpus of text. He then sorted the words by descending frequency, and noted that a word’s frequency is approximately inversely proportional to the rank of that word. That is, the second most common word is twice as rare as the most common word, the third most common word is three times as rare as the most common word, and so on.

This series of numbers can be visualized like this:

zipf.v.gottron-reutersThe data used for this plot is from a dataset of Reuters news article in English. (Since this data is taken from a collection of network datasets, the plot used the term degree to refer to a word’s frequency.) Each point in this plot represents one word, with the X axis representing the ranking of the word (from most frequent to less frequent), and the Y axis representing the frequency. The plot is drawn with both axes logarithmic. Thus, the top-left point represents the most common English word (which is “the”), and very rare words are on the right of the plots (showing that many words occur only once in the dataset).

An exact inverse proportional relationship as described by Zipf would look like a straight line with slope exactly minus one on this plot, because the axes are logarithmic. With this dataset, this is not the case (in particular in the top-left area). This is due to the fact that this dataset of words was stripped of words which are ignored by search engines such as “the”, “of”, etc., which happen to be the most common in the English language. Therefore, the plot does not show a straight line, and we may be interested in the shape of this plot as a visualization of the distribution.

The Distribution plot

In terms of probability theory, the size, frequency or number of followers follow a distribution, and we are thus interested in the shape of that distribution. The plot inspired by Zipf’s Law is thus often said to be a bad way of visualizing a distribution.  Instead, one may rather show the distribution plot, i.e., the plot showing the frequency of each value. For the English words example, this means a plot where the X axis shows word frequency values, and the Y axis shows the number of words having this frequency.  This results in the following plot:

degree.v.gottron-reutersAgain, we use a log-log plot. Unlink the previous plot, this looks much more like a straight line. What is going on? In a log-log plot, a straight line (of the form y = ax + b) is in fact a power law (i.e., log(y) = a log(x) + by = eb xa). Thus, the plot suggests that the number of words with frequency x is proportional to xa, for some constant a, which equals about −1.6 in this case.  This type of relationship is called a power law, and the value −a is then called the power-law exponent. This plot type however is misleading. Since for each number x ≥ 100, only few words have the frequency x, the right half of the plot cannot be interpreted at all visually, and thus the plot effectively only visualizes the distribution of infrequent words.

The Complementary Cumulated Distribution Plot

In order to visualize also words with uncommon (i.e., high) frequencies, a solution is to plot, for each x, not the probability that a word has frequency x (which is proportional to the number of words having that frequency), but the probability that a word has frequency greater or equal to x. This is called the complementary cumulated distribution plot (and would be called simply the cumulated distribution plot if it showed the probability of the frequency being less than a given x).

bidd.vx.gottron-reutersAgain, this plot type is shown with both axes logarithmic. Again, this plot type will show a straight line when the data follows a power law. However, this plot does show the full range of values on the X axis, and in fact we can see right away that this is not a power law, as the right half of the plot deviates significantly from a straight line, showing again that our dataset does not follow Zipf’s Law, because very common words were removed from it.

Is this plot therefore a better visualization of a skewed distribution than the simple distribution plot? Yes, because it visualizes the whole spectrum of possible values, even rare ones. Is this plot also better than the plot derived from Zipf’s analysis? No, because it is the same plot. In fact, Zipf’s plot and the complementary cumulated distribution are mirror images of each other, mirrored around the 45-degree diagonal axis going from the lower left to the upper right. This can be seen by noting that the rank of a word with frequency x equals the probability of a word having frequency ≥ x, multiplied by the total number of words. Thus both plots are identical, up to an exchange of the axes.

The Long Tail

The confusion between the different types of plots is made worse by the ambiguity of the term “long tail”. In marketing, the long tail refers to the bulk of products which are bought only rarely, and thus are traditionally ignored by retailers, but from which online retailers can profit, since they don’t have the space constraints that traditional retailers have. Rare items are similar to infrequent words, and thus the term long tail describes the Zipf-type plot, in which infrequent items form a tail-like shape to the right of the figure. In the two other plots however, infrequent items are shown on the upper left corner, and thus do not form a tail at all.

What may also confuse people is the use of the word tail in probability theory to characterize the ends of distributions. In probability theory, the expression heavy-tailed distribution and fat-tailed distribution have specific meanings and refer to the tail as seen in the distribution plots, and thus refer to very large values of a variable, and not to very small values, as in the expression long tail. What is worse, the expression long-tailed distribution is sometimes also used in probability theory, and also describes the behavior of a distribution at large values of a variable.

The Lorenz Curve

To complete our little tour, I would like to mention the Lorenz curve, which shows the amount of text covered by X% of least frequent words. That is, the X axis goes from 0 to 100, and the Y axis shows the total amount of text made up by the X% of least frequent words. The Lorenz curve is usually used to visualize wealth distribution, and is plotted without logarithmic scales. For our English words dataset, it looks like this:

lorenz.vb.gottron-reutersThe area between the Lorenz curve (in blue) and the diagonal is then, when multiplied by two, the Gini coefficient, a measure of inequality as used in economy. Since it does not use logarithmic axes, neither power law distributions nor Zipf’s Law can be easily recognized on it.


There are many ways to visualize skewed distributions, and care must be taken when creating, reading and interpreting such plots. We recommend to use the complementary cumulated distribution plot, as it is most aligned with probability theory, and is just as expressive as the Zipf-type plot. But in the end, remember that a plot is not a replacement for a proper statistical test, so if you want to assert that a dataset follows Zipf’s Law or a power law, then a statistical is needed. Based on my experiments with degree distributions in KONECT, my cautious guess is than there are almost no statistically significant power laws in real-world networks (and I can make no statement about other types of distributions such as city sizes, etc.)

The Surprising Equality of the Mean and Median Path Length in Online Networks

A common statistic computed for networks is the diameter.  The diameter equals the longest shortest path length in a network.  Let me explain:  Given two nodes u and v in a network, their distance is the minimal number of edge needed to reach one from the other.  In other words, the distance between u and v equals the shortest path length between these two nodes. Now, find two nodes u and v such that the distance between them is maximal – this is the diameter of the network. The diameter is thus an integer that characterizes a network.

The diameter is an important statistic for a network, since it is used to characterize a network as a small world. The expression small world refers to the observation that it is possible to reach any node from any node in a small number of steps, as in the famous six degrees of separation.

The diameter is only one of several ways of measuring the “typical path length” in networks, and in fact, it may not even be the most representative one. As it is a maximum, it is not representative of the full distribution of path lengths in the network. Also, being defined as a maximum, it is sensitive to outliers, e.g., tendrils in the networks that increase the diameter but do not affect the overall structure of the network. Therefore, several alternatives measures of typical path lengths can be used:

  • The diameter (as defined above) is the maximum path length
  • The mean path length, averaged over all node pairs
  • The median path length
  • The 90-percentile effective diameter
  • The 50-percentile effective diameter

The effective diameters (50- and 90-percentile) are the path lengths such that 50 or 90 percent of node pairs are a smaller or equal distance apart, interpolated between integer values.  Thus, the median path length is the 50-percentile effective diameter rounded down.

The statistics are shown in the following plot, for the Slashdot Zoo network:

hopdistr.a.slashdot-zooThis shows the cumulated distribution of path lengths over all node pairs in the network.  In this network, the diameter (δ) equals 12. The X-percentile effective diameters are shown as the two vertical red lines and equal 3.51 for the 50-percentile effective diameter (δ0.5), and 4.71 for the 90-percentile effective diameter (δ0.9). The median path length is thus 3 in this network.  The mean path length (δm) equals 4.04 and is somewhere in between. We see that the distribution is skewed, because the mean and median path length are not equal in this network.  However, when I did this analysis for more networks, I found that in most cases, both numbers coincide. The following plot shows many networks (those small enough such that the number could be calculated), plotted by the 50-percentile effective diameter (on the X axis) and the mean path length (on the Y axis). Each network is represented by a two-letter (and a for few three-letter) code. See the results for yourself:

scatter.c.diam_3.diam_4.4This plot shows that both not only correlate very highly, but are almost equal for almost all networks. In other words: the 50-percentile effective diameter and the mean path length are almost equal in actual networks.  What does this mean?  I’m not sure, but this result is not trivial, since the distribution of path lengths is not symmetric in most networks.

For more path length distributions, see here.

For another blog post about the path length distribution, see

The 65 Languages in My Music Collection

As most of you, I have collected music in my life, first as CDs and later on my computer. In particular, a hobby of mine is to collect music in the languages of the world – the more the better! One day, I started tagging each song in my collection with the language it is sung in. Today, about 70% of the 16193 tracks I have are tagged with at least one language.

Here’s the breakdown:

English (8346 songs), French (2366 songs), German (1128 songs), Spanish (587 songs), Portuguese (241 songs), Italian (201 songs), Dutch (197 songs), Arabic (76 songs), Russian (69 songs), Norwegian (69 songs), Latin (64 songs), Nepali (44 songs), Wolof (32 songs), Turkish (31 songs), Occitan (27 songs), Malagasy (26 songs), Korean (23 songs), Chinese (22 songs), French-based creoles (21 songs), Japanese (19 songs), Dyula (19 songs), Polish (18 songs), Swedish (17 songs), Jamaican Creole (13 songs), Finnish (13 songs), Zulu (12 songs), Hawaiian (12 songs), Bulgarian (11 songs), Shona (8 songs), Haitian Creole (8 songs), Nubian (7 songs), Xhosa (6 songs), Catalan (6 songs), Georgian (5 songs), Hindi (5 songs), Galician (4 songs), Hungarian (3 songs), Middle High German (3 songs), Esperanto (3 songs). I have 2 songs in each of the following languages: Sesotho, Romanian, Punjabi, Lingala, Limburgish, Icelandic, Irish Gaelic, Serbian.  I have one song in each of the following languages: Venda, Klingon, Tagalog, Swahili, Old Irish, Scots, Low German (i.e., Plattdeutsch), Ndebele, Mongolian, Modern Hebrew, Fang, Persian, Duala, Breton, Amharic, Afrikaans.

Note:  Some songs contain multiple languages and are counted multiple times in this list.  Some languages are only used for a single phrase, making it arguable that the song is sung “in that language”.

If you’re wondering why a certain language is not included:  There are over 6000 languages in the world, and I have no idea in how many of these songs have been recorded. In any case, I will gladly accept in tips for songs to add to the collection 🙂

Happy Music Listening to everyone!



Kernel of Convex Combination of Matrices

Let A and B be two real m×n matrices, with m << n, such that there exists vectors x, y of length n such that

Ax = 0,    x > 0

By = 0,    y > 0

I.e., such that the kernel of each matrix contains a vector with all components larger than zero.

Question:  Is the same true for convex combinations of A and B?

Put differently:  For any 0 < α < 1, is there a vector z such that

[ αA + (1−α)B ] z = 0,    z > 0  ?

I would be very thankful for a proof, or a refutation.

Thanks to anyone who can help!

Index Notation as Projection Operator and Division of Vectors

I came across a nice equivalence between the standard index notation for vectors and the vector projection operator. This works for vectors of any dimension, but I’ll explain it for vectors in three dimensions.
A vector a can be expressed by its coordinates in a coordinate system:

(ax, ay, az)

These are three numbers that depend on the chosen coordinate system.  Any coordinate system can be described by three vectors x, y and z.  Given these three vectors, we can then write the components of a as

ax = (a · x) / ‖x‖².

We can interpret this as a binary operation with arguments a and x. This operator corresponds to the projection of a onto x, expressed in units where x has length one:

Using the points laid out as in this graphic, we can write the operator as

|ax| = AC / AD

This sign of ax is given by the relative directions of AC and AD.

Since this operator only depends on the dot product (a · x) and the norm ‖x‖, its value is independent of the coordinate system in which it is computed.

As a binary operator, ax is linear in the first argument, i.e., for any number c we have

(ca)x = c (ax),

and for any two vectors a and b we have

(a + b)x = ax + bx.

The operator is inverse-linear in the second operator:

a(cx) = c−1 (ax)

The operator is defined for all vectors a and x, except when x is the zero vector.

When the two vectors a and x have the same norm ‖a‖ = ‖x‖, then exchanging the two operands does not change the value:

ax = xa  ⇔  ‖a‖ = ‖x

This value ax = xa then equals the cosine of the angle between a and x.  In the general case, the operator is however not commutative.

The operator is zero if and only if the two vectors are orthogonal:

ax = 0  ⇔  ax

These properties are similar to those of scalar division, although this operation is of course not a proper division, since it returns not a vector, and cannot be expressed in terms of products and inverses. Note also how by coincidence, the notation ax puts a in a high position and x in a low position, just as in a fraction.

Eleven Essential Physics Lectures by Prof. Leonard Susskind

Prof. Leonard Susskind from Stanford University is one of the most prominent physicists of today, and has given physics lectures in the last few years, which are now online. These lectures are, in my opinion, the fastest way to learn physics. Prof. Susskind is really good at explaining complicated topics. In fact, he doesn’t just follow a script – instead he solves problems and derives results on the whiteboard just as he would on paper by himself. Although he makes errors from time to time, and has to be corrected by the audience, these lectures are still extremely didactic. I really have only very rarely learned so much in so little time.  The lectures range from classical mechanics to cosmology, and cover all important topics in physics. Since there doesn’t seem to be a central place that lists these lectures, I’ve made a list here:

I’ve tried to list the lectures in a rough order of how they should be watched, but beware that I have only seen a small part of them! Also, Prof. Susskind has given lectures about the same topic multiple times; in these cases I have included only one lecture. I will keep this list updated if lectures about new topics appear. Please comment if you find an error. Thanks!

EDIT:  Added Supersymmetry & Grand Unification, which makes it twelve lectures now.

Is the Hop Plot a Logistic Distribution?

The hop plot is a visualization of the distribution of pairwise distances in a network. It shows the number of nodes one can reach (on average) by going N steps in a network.  Here’s an example for the US electrical power grid network:

The X axis shows the numbers of steps taken and the Y axis shows the number of nodes reached, as a percentage.

What kind of distribution do these hop plots follow?  Maybe a normal distribution? A logistic distribution?  We could do all kinds of statistical tests to find out, but I want to do something different. I want to answer that question visually. Therefore, we’re going to use different ways of drawing the Y axis of the plot that correspond to each distribution. Let me explain: If we apply the inverse cumulative distribution function to the values on the Y axis, the plot should show a straight line if the values follows that distribution.  Since the values on the Y axis are between 0 and 1 and have the shape of a sigmoid function (see Wikipedia), we’re going to try out the various sigmoid functions that we know.

(1) Normal distribution – Inverse error function

The cumulative distribution function of the normal distribution is the error function. Let’s try it:

Hmm… does not quite look like a line.

(2) Logistic distribution

The cumulative distribution function of the logistic distribution is the logistic function:

Not that good either.

(3) The tangent sigmoid

We’ll use the arctangent function as our (scaled) cumulative distribution function:

(4) Hyperbolic tangent

We’ll use the hyperbolic tangent as our (scaled) cumulative distribution function:

This too does not fit at all.


In conclusion, none of the functions fit. Therefore, the distances in the US power grid do not follow any of the simple sigmoid models.

Two notes are in order though:

  • A plot is not a statistical test. You should do an actual statistic test when you write a paper!
  • Does anyone known what the usual network models (Erdős–Rényi, Barabási–Albert, Watts–Strogatz, etc.) result in?

Related Blog Posts


Taxonomy of Matrices

Normal matrices, diagonalizable matrices, orthogonal matrices, unitary matrices… the classes of matrices used in math, computer science and other areas can be quite complex and difficult to grasp sometimes. To get an overview of the different classes of complex n-by-n matrices, I made the following diagram:

Taxonomy of n-by-n matrices (PDF version) – with hyperlinks to Wikipedia articles and legend

This diagram is released as cc-by-sa. That means you can use and modify it, as long as you acknowledge me (just link to this page), and publish any modifications also under cc-by-sa.

In this diagram, each rectangle represents one set of complex n-by-n matrices. Arrows represent subset relations. There is lots of information in there, and every arrow tells a story. For instance, all rotation matrices have determinant 1, explaining the arrow from “Rotation matrices” to “Real unit-determinant matrices”. I also colored the classes which are closed under matrix multiplication in blue. Also, I noted down the corresponding multiplicative group, for instance SO(n) with rotation matrices.

There are likely errors in there: I tried to be as diligent as possible, but with the large amount of information compressed into a single diagram, there are bound to be errors. Please comment if you find one!  Also, I did not include lesser-known classes of matrices (such as unistochastic or co-positive matrices), matrices whose definition makes use of the ordering of row and column indices (such as triangular or tridiagonal matrices) and classes that are not closed under the transpose operation (such as left and right stochastic matrices).

A few other observations:

  • The identity matrix is everything except skew-symmetric or skew-Hermitian.
  • The top of the hierarchy are diagonalizable and invertible matrices.
  • Are doubly stochastic matrices a subset of any other class?
  • What are the name of the groups generated by multiplication of double stochastic matrices?
  • There is a one-to-one correspondence between several complex and real classes.  This is however not systematic. For instance, real symmetric matrices can be generalized to complex symmetric matrices, and also to Hermitian matrices.
  • Many classes have alternate names. For instance, Hermitian matrices are called self-adjoint, and rotation matrices are called special orthogonal.
  • The classes of positive-definite and positive-semidefinite matrices is to be interpreted in the narrow sense, i.e., only such matrices that are Hermitian (or symmetric in the real case).  This is also the reason why the class of positive-definite matrices is not closed under multiplication.


  • The PDF version now contains hyperlinks to the Wikipedia articles about the matrix classes and groups.
  • Added symplectic and unimodular matrices
  • Added a legend
  • Added cc-by-sa license
  • Inserted note about positive-definite matrices

The Power Law Paradox, or Why the Power Law Exponent Does Not Mean What You Think It Means

Large networks are often described as having a power-law degree distribution. In other words, the number of nodes having n neighbors is proportional to n−γ, for some number γ (this is the Greek letter Gamma).  Now, it is well-known that this is often not the case, and that many other degree distributions are easily mistaken for a power-law.  Nevertheless, some networks do have power laws, and in these networks, we might think about the meaning of the parameter γ.

Degree distribution of the CiteULike user–tag network

This is the degree distribution of the CiteULike user–tag network.  Using Aaron Clauset’s code, we can find out that it fits a power-law distribution with γ = 1.63.  The code from Aaron Clauset also returns a value of dmin = 1, which means that the power law is valid beginning with minimal degree 1 (i.e., for all nodes).

γ is the slope of the distribution, i.e., when γ is high, the number of nodes with high degree is smaller than the number of nodes with low degree.  We may thus think that a low value of γ denotes a more equal distribution, and that higher and higher values of γ denote more and more unfair degree distributions.  However, this is not the case.  In fact,  the opposite is true: A high value of γ represents a network in which the distribution of edges is fairer.

To measure the equality of the degree distribution, we can use a measure from Economics that has been used to measure the equality of a country’s income distribution. This measure is called the Gini coefficient.  The Gini coefficient can be computed by drawing the so-called Lorenz curve (shown below), and measuring the area that it traces out:

Lorenz curve (blue curve) of the Facebook friendship network

The Lorenz curve, as shown above as a blue curve, consists of all points (X,Y) such that the statement “X percent of the people on Facebook with the smallest number of friends account for Y percent of all friendship links.” For instance, the big black point which lies on the Lorenz curve allows us to state that “The 75% of Facebook users with the least number of friends account for only 25% of all friendship links.” Or, in other words, “The 25% of Facebook users with most friends account for 75% of all friendship links.” If the distribution of friendships among users is maximally equal, then all users have the same number of friends, and the Lorenz curve follows the diagonal of the plot (from the bottom left to the top right).  When the distribution is less equal, the Lorenz curve is nearer to the bottom and right of the plot.  Therefore, the area between the Lorenz curve and the diagonal is an indication of the equality of the degree distribution.  This area is maximally 0.5 (half the square), and therefore we define the Gini coefficient as twice this area, so the Gini coefficient is a measure of equality that goes from 0 (total equality) to 1 (total inequality).

Now we can finally ask:  If two networks with power law degree distributions have two different power law exponents γ, which of the two networks has a higher Gini coefficient?  To find out, we generate networks with perfect power laws, and compute their Gini coefficients. In fact, we don’t have to compute whole networks, it is enough to generate the degree distributions. The result looks like this:

Lorenz curves of perfect power law distributions, with γ ranging from 2.1 to 3.5

The result:  Power law distribution distributions with high γ have a Lorenz curve nearer to the diagonal.  Therefore, a higher value of γ means the distribution is more equal.

This result is somewhat surprising:  We would have thought that a higher slope on the degree distribution plots means that there are even less nodes with many neighbors. However, the opposite is the case.


(1) The mathematically correct name of a power law distribution is “Zeta distribution”, because the normalizing factor 1^(−γ) + 2^(−γ) + 3^(−γ) + … is exactly the Riemann zeta function of γ.

(2) In this test, γ must be strictly larger than 2, otherwise the distribution mean cannot be defined, simply because 1^−1 + 2^−1 + 3^−1 + … is not defined (see Harmonic series)


[1] Fairness on the Web: Alternatives to the Power Law. Jérôme Kunegis & Julia Preusse. WebSci 2012.

9 Reasons to Have an “Enemy” and “Dislike” Feature on Facebook

On Facebook, nothing is negative:  You can have friends, but not enemies.  You can like something but not dislike it.  Not satisfied with this status quo, Dean Terry from the University of Texas in Dallas has created EnemyGraph, a Facebook app that lets you add persons and things as enemies to your profile.

The idea of adding “negative links” may sound counterproductive to a social network, but in fact, there are good reasons to allow the negative “enemy” and “dislike” links in addition to the positive “friend” and “like” links:

  • It better reflects real-world relationships:  In the real world, not everything is positive.  Many relationships, both between people and things are negative. If an online social network wants to reflect the real world as best as possible, it better support enemies and dislikes.
  • Product recommendation gets better:  Facebook can recommend things you like to you, while filtering out what you will not like. Without negative feedback, Facebook will not know that you don’t like certain things or persons, and will continue to recommend them to you.
  • It makes filtering easier:  By removing things and people you don’t like from your newsstream, you Facebook wall will be more interesting to you.
  • It works. In fact, negative feedback is already implemented at a few places in Facebook. For instance, you can already remove individual persons and individual advertisements from your newsstream. Integrating these as first-class features would make it easier for users to understand what is going on.
  • Everybody does it:  Netflix, Amazon and many other websites allow you to give ratings one a rating scale from one to five stars, effectively giving a spectrum of ratings from “unlike” to “like”.
  • Test the principle of “the enemy of my enemy is my friend” in practice.
  • Allow better social network analysis:  Negative links have been used in studies on social networks.
  • Solve the spam and fake account problem. For instance on the website, users can mark other users as friends and foes, which makes to possible to automatically detect trolls.
  • Let people connect by common dislikes.  In fact, many groups in Facebook are created only for people that dislike a specific item, e.g. this.

Read more: