Plotting a Degree Distribution on the Command Line

The following shell command plots a degree distribution in a log-log scale.  It uses only standard unix commands:

<INPUTFILE sed -re ‘/^\%/d;s,^\s*\S+\s*(\S+).*$,\1,’ | sort | uniq -c | sort -n | sed -re ‘s,^\s*(\S+).*$,\1,;s,.,#,g’ | uniq -c | sed -re ‘s,^\s*(\S+).*,\1,;s,.,#,g’

The file “INPUTFILE” must be a text representation of a directed network with one edge per line, each containing a from-node-ID and to-node-ID.  The command will plot the indegree distribution on a doubly logarithmic scale.  Here it is executed for KONECT‘s “YouTube links” dataset:

$ <out.youtube-links sed -re ‘/^\%/d;s,^\s*\S+\s*(\S+).*$,\1,’ | sort | uniq -c | sort -n | sed -re ‘s,^\s*(\S+).*$,\1,;s,.,#,g’ | uniq -c | sed -re ‘s,^\s*(\S+).*,\1,;s,.,#,g’
#######
#####
####
###
#

The file “out.youtube-links” can be downloaded from here.  The degree distribution is transposed from its usual orientation:  The degrees are on the Y axis and the frequency is on the X axis.  As the output shows there is a power law here:  The lines get shorter in a linear fashion and thus the frequency of each degree value is a negative power of the degree value.

This shows that:

  • You don’t need megabytes of graphics code to generate plots
  • You don’t need a log() function to plot logarithmic axes
  • It absolutely makes sense to use both sort(1) and uniq(1) multiple times in a single pipeline

The drawn degree distribution also uses logarithmic binning, and thus it is much more useful in visualizing the tail of the distribution than the standard way of plotting degree distributions.

Try it out with more datasets:  List of datasets in KONECT

Can you write other shell one-liners for generating other network analysis plots?

Largest Network Dataset Ever Released – As Big As Facebook

Good news for the network analysis community:  A new very large Web hyperlink dataset was recently released:  the Web Data Commons Hyperlink Graph consisting of 3.56 billion pages and 129 billion hyperlinks connecting them!  For comparison, the Facebook friendship graph is reported to include 1.26 billion nodes (users) and 150 billion links (friendships).

The Web Data Commons Hyperlink Graph is thus the largest real-world network dataset available outside of the big companies themselves.  What this means is, it is not only the big data companies anymore that can do actual “Big Data” research – now any research group can too.  Therefore, we should be seeing studies using this dataset coming out in the next years.  I can’t wait to find out what will be done with this dataset.

For another comparison, our own Koblenz Network Collection, which collects network datasets of many different types and sizes, maxes out at 1.9 billion edges, with the Twitter follower network.  But since the dataset is open, we will now add it to it of course.

Defining the Clustering Coefficient

Clustering is an important property of social networks:  People tend to have friends who are also friends with each other, resulting in sets of people among which many edges exist, while a set made from randomly chosen people would have a much smaller number of edges between them. To measure the clustering in a social (or other type of) network, a common measure is the clustering coefficient.  The clustering coefficient is a real number between zero and one that is zero when there is no clustering, and one for maximal clustering, which happens when the network consists of disjoint cliques.  While the clustering in a network can be measured in a number of ways, one common way to do it is to check for triangles, i.e., to check that when two edges share a node, then in a network with high clustering, it is likely that a third edge exists such that the three edges form a triangle. Although the clustering coefficient is often used, there are actually two variants of it – which may have completely different values.

Variant (1) – Define the clustering coefficient c1 as the probability that two incident edges are completed by a third one to form a triangle.  Two edges are incident when they share a vertex. In that case, we can check whether a third edge exists such that all three edges form a triangle.  The proportion of such incident edge pairs that are completed by a third edge to form a triangle then equals the clustering coefficient (variant 1).

Variant (2) – First, define the local clustering coefficient c(u) for each node u in the following way:  Let c(u) be the proportion of neighbors of u that are connected.  In other words, c(u) is the probability that two friends of u are friends of each other in a social network.  Then define the clustering coefficient (variant 2) as the mean of all local clustering coefficients in the network.

Both measures c1 and c2 measure the clustering in a network and should therefore correlate highly, right?  Well, no.  Look at the following plot:

scatter.c.clusco.clusco_7.7This plot shows the two clustering coefficients for networks in the Koblenz Network Collection. Each network is represented by a two or three-letter code. The X axis represents c1 and the Y axis represents c2.  Clearly, we cannot state that the variants correlate in any way.  Why is that? To see why the two clustering coefficients do not correlate, we have to think about how both are computed.  In the first variant, each pair of incident edges is given the same weight, but in the second variant each node is given the same weight.  This means that nodes with a high degree will be overrepresented in variant 1, and underrepresented in variant 2.  The correlation of both measures is thus a hint that nodes with high and low degree in general do not have the same local clustering coefficient.  Let’s see:

degcc.a.cit-HepPhThis plot needs some explanation. This is a plot showing a citation network taken from arXiv’s High Energy Physics / Phenomenology section. The network consists of publications connected by citation links, i.e., a link from u to v means that publication u has cited publication v.  The network is taken as undirected here, i.e., we ignore who cited whom. Each blue dot in the plot is one vertex of the network, drawn according to its degree (X axis, logarithmic) and its local clustering coefficient (Y axis, logarithmic).  The red line shows the average local clustering coefficient taken over all nodes of the same degree.  What we see here is clear:  nodes with low degree have a high clustering coefficient. In other words, when person has many friends, these friends have less edges among them, which is to be expected since a person with many friends is likely to have friends from more diverse communities, and a paper getting cited many times is likely to be cited by papers from more diverse areas.

This explain two things:  First, because nodes with high degree are weighted more in the first variant, the first variant will have lower values that the second one.  Second, the non-correlation between the two variants is due to the fact that the discrepancy between the local clustering coefficient of high-degree and low-degree nodes is different from network to network.  Networks for which both clustering coefficients are near each other have more equal local clustering coefficients c1 and c2, whereas networks high up in the figure plot have a higher discrepancy between the local clustering coefficient of high-degree and low-degree nodes.

A few more notes

The clustering coefficients are not always defined. The first variant is not defined when no vertex has degree larger than one. This, of course, is not a problem in practice. The local clustering coefficient is only defined for nodes whose degree is larger than one. These nodes are present in actual social networks, and the usual method is to simply ignore them and compute the average of the local clustering coefficient over all nodes with degree larger than one, which is what is done in the plots presented here.

Resources

Datasets, Matlab Code, Complete definitions of numerical network statistics

How to Visualize Skewed Distributions

Many distributions, such as the size of cities, the frequency of words, the number of followers of people on Twitter, and others, are very skewed.  For instance, 16.2% of songs make up 83.8% of plays on Last.fm, and 18.8% of groups make up 81.2% of all group memberships on YouTube. These types of distributions are described by expressions such as power law, long tail, and other terms. These terms are often confused, as they refer to slightly different concepts. Also, the confusion is made worse by the existence of multiple ways of visualizing such distributions, which are different mathematically but similar in appearance. In this blog post, I will review the most important ways of visualizing such distributions, and will explain the mathematics behind them.

Zipf’s Law

Many things can be distributed in a power-law-like way. For instance, the linguist George Zipf was interested in 1935 in the distribution of words in the English language. He counted, for each word, its frequency in a corpus of text. He then sorted the words by descending frequency, and noted that a word’s frequency is approximately inversely proportional to the rank of that word. That is, the second most common word is twice as rare as the most common word, the third most common word is three times as rare as the most common word, and so on.

This series of numbers can be visualized like this:

zipf.v.gottron-reutersThe data used for this plot is from a dataset of Reuters news article in English. (Since this data is taken from a collection of network datasets, the plot used the term degree to refer to a word’s frequency.) Each point in this plot represents one word, with the X axis representing the ranking of the word (from most frequent to less frequent), and the Y axis representing the frequency. The plot is drawn with both axes logarithmic. Thus, the top-left point represents the most common English word (which is “the”), and very rare words are on the right of the plots (showing that many words occur only once in the dataset).

An exact inverse proportional relationship as described by Zipf would look like a straight line with slope exactly minus one on this plot, because the axes are logarithmic. With this dataset, this is not the case (in particular in the top-left area). This is due to the fact that this dataset of words was stripped of words which are ignored by search engines such as “the”, “of”, etc., which happen to be the most common in the English language. Therefore, the plot does not show a straight line, and we may be interested in the shape of this plot as a visualization of the distribution.

The Distribution plot

In terms of probability theory, the size, frequency or number of followers follow a distribution, and we are thus interested in the shape of that distribution. The plot inspired by Zipf’s Law is thus often said to be a bad way of visualizing a distribution.  Instead, one may rather show the distribution plot, i.e., the plot showing the frequency of each value. For the English words example, this means a plot where the X axis shows word frequency values, and the Y axis shows the number of words having this frequency.  This results in the following plot:

degree.v.gottron-reutersAgain, we use a log-log plot. Unlink the previous plot, this looks much more like a straight line. What is going on? In a log-log plot, a straight line (of the form y = ax + b) is in fact a power law (i.e., log(y) = a log(x) + by = eb xa). Thus, the plot suggests that the number of words with frequency x is proportional to xa, for some constant a, which equals about −1.6 in this case.  This type of relationship is called a power law, and the value −a is then called the power-law exponent. This plot type however is misleading. Since for each number x ≥ 100, only few words have the frequency x, the right half of the plot cannot be interpreted at all visually, and thus the plot effectively only visualizes the distribution of infrequent words.

The Complementary Cumulated Distribution Plot

In order to visualize also words with uncommon (i.e., high) frequencies, a solution is to plot, for each x, not the probability that a word has frequency x (which is proportional to the number of words having that frequency), but the probability that a word has frequency greater or equal to x. This is called the complementary cumulated distribution plot (and would be called simply the cumulated distribution plot if it showed the probability of the frequency being less than a given x).

bidd.vx.gottron-reutersAgain, this plot type is shown with both axes logarithmic. Again, this plot type will show a straight line when the data follows a power law. However, this plot does show the full range of values on the X axis, and in fact we can see right away that this is not a power law, as the right half of the plot deviates significantly from a straight line, showing again that our dataset does not follow Zipf’s Law, because very common words were removed from it.

Is this plot therefore a better visualization of a skewed distribution than the simple distribution plot? Yes, because it visualizes the whole spectrum of possible values, even rare ones. Is this plot also better than the plot derived from Zipf’s analysis? No, because it is the same plot. In fact, Zipf’s plot and the complementary cumulated distribution are mirror images of each other, mirrored around the 45-degree diagonal axis going from the lower left to the upper right. This can be seen by noting that the rank of a word with frequency x equals the probability of a word having frequency ≥ x, multiplied by the total number of words. Thus both plots are identical, up to an exchange of the axes.

The Long Tail

The confusion between the different types of plots is made worse by the ambiguity of the term “long tail”. In marketing, the long tail refers to the bulk of products which are bought only rarely, and thus are traditionally ignored by retailers, but from which online retailers can profit, since they don’t have the space constraints that traditional retailers have. Rare items are similar to infrequent words, and thus the term long tail describes the Zipf-type plot, in which infrequent items form a tail-like shape to the right of the figure. In the two other plots however, infrequent items are shown on the upper left corner, and thus do not form a tail at all.

What may also confuse people is the use of the word tail in probability theory to characterize the ends of distributions. In probability theory, the expression heavy-tailed distribution and fat-tailed distribution have specific meanings and refer to the tail as seen in the distribution plots, and thus refer to very large values of a variable, and not to very small values, as in the expression long tail. What is worse, the expression long-tailed distribution is sometimes also used in probability theory, and also describes the behavior of a distribution at large values of a variable.

The Lorenz Curve

To complete our little tour, I would like to mention the Lorenz curve, which shows the amount of text covered by X% of least frequent words. That is, the X axis goes from 0 to 100, and the Y axis shows the total amount of text made up by the X% of least frequent words. The Lorenz curve is usually used to visualize wealth distribution, and is plotted without logarithmic scales. For our English words dataset, it looks like this:

lorenz.vb.gottron-reutersThe area between the Lorenz curve (in blue) and the diagonal is then, when multiplied by two, the Gini coefficient, a measure of inequality as used in economy. Since it does not use logarithmic axes, neither power law distributions nor Zipf’s Law can be easily recognized on it.

Conclusion

There are many ways to visualize skewed distributions, and care must be taken when creating, reading and interpreting such plots. We recommend to use the complementary cumulated distribution plot, as it is most aligned with probability theory, and is just as expressive as the Zipf-type plot. But in the end, remember that a plot is not a replacement for a proper statistical test, so if you want to assert that a dataset follows Zipf’s Law or a power law, then a statistical is needed. Based on my experiments with degree distributions in KONECT, my cautious guess is than there are almost no statistically significant power laws in real-world networks (and I can make no statement about other types of distributions such as city sizes, etc.)

The Surprising Equality of the Mean and Median Path Length in Online Networks

A common statistic computed for networks is the diameter.  The diameter equals the longest shortest path length in a network.  Let me explain:  Given two nodes u and v in a network, their distance is the minimal number of edge needed to reach one from the other.  In other words, the distance between u and v equals the shortest path length between these two nodes. Now, find two nodes u and v such that the distance between them is maximal – this is the diameter of the network. The diameter is thus an integer that characterizes a network.

The diameter is an important statistic for a network, since it is used to characterize a network as a small world. The expression small world refers to the observation that it is possible to reach any node from any node in a small number of steps, as in the famous six degrees of separation.

The diameter is only one of several ways of measuring the “typical path length” in networks, and in fact, it may not even be the most representative one. As it is a maximum, it is not representative of the full distribution of path lengths in the network. Also, being defined as a maximum, it is sensitive to outliers, e.g., tendrils in the networks that increase the diameter but do not affect the overall structure of the network. Therefore, several alternatives measures of typical path lengths can be used:

  • The diameter (as defined above) is the maximum path length
  • The mean path length, averaged over all node pairs
  • The median path length
  • The 90-percentile effective diameter
  • The 50-percentile effective diameter

The effective diameters (50- and 90-percentile) are the path lengths such that 50 or 90 percent of node pairs are a smaller or equal distance apart, interpolated between integer values.  Thus, the median path length is the 50-percentile effective diameter rounded down.

The statistics are shown in the following plot, for the Slashdot Zoo network:

hopdistr.a.slashdot-zooThis shows the cumulated distribution of path lengths over all node pairs in the network.  In this network, the diameter (δ) equals 12. The X-percentile effective diameters are shown as the two vertical red lines and equal 3.51 for the 50-percentile effective diameter (δ0.5), and 4.71 for the 90-percentile effective diameter (δ0.9). The median path length is thus 3 in this network.  The mean path length (δm) equals 4.04 and is somewhere in between. We see that the distribution is skewed, because the mean and median path length are not equal in this network.  However, when I did this analysis for more networks, I found that in most cases, both numbers coincide. The following plot shows many networks (those small enough such that the number could be calculated), plotted by the 50-percentile effective diameter (on the X axis) and the mean path length (on the Y axis). Each network is represented by a two-letter (and a for few three-letter) code. See the results for yourself:

scatter.c.diam_3.diam_4.4This plot shows that both not only correlate very highly, but are almost equal for almost all networks. In other words: the 50-percentile effective diameter and the mean path length are almost equal in actual networks.  What does this mean?  I’m not sure, but this result is not trivial, since the distribution of path lengths is not symmetric in most networks.

For more path length distributions, see here.

For another blog post about the path length distribution, see

The 65 Languages in My Music Collection

As most of you, I have collected music in my life, first as CDs and later on my computer. In particular, a hobby of mine is to collect music in the languages of the world – the more the better! One day, I started tagging each song in my collection with the language it is sung in. Today, about 70% of the 16193 tracks I have are tagged with at least one language.

Here’s the breakdown:

English (8346 songs), French (2366 songs), German (1128 songs), Spanish (587 songs), Portuguese (241 songs), Italian (201 songs), Dutch (197 songs), Arabic (76 songs), Russian (69 songs), Norwegian (69 songs), Latin (64 songs), Nepali (44 songs), Wolof (32 songs), Turkish (31 songs), Occitan (27 songs), Malagasy (26 songs), Korean (23 songs), Chinese (22 songs), French-based creoles (21 songs), Japanese (19 songs), Dyula (19 songs), Polish (18 songs), Swedish (17 songs), Jamaican Creole (13 songs), Finnish (13 songs), Zulu (12 songs), Hawaiian (12 songs), Bulgarian (11 songs), Shona (8 songs), Haitian Creole (8 songs), Nubian (7 songs), Xhosa (6 songs), Catalan (6 songs), Georgian (5 songs), Hindi (5 songs), Galician (4 songs), Hungarian (3 songs), Middle High German (3 songs), Esperanto (3 songs). I have 2 songs in each of the following languages: Sesotho, Romanian, Punjabi, Lingala, Limburgish, Icelandic, Irish Gaelic, Serbian.  I have one song in each of the following languages: Venda, Klingon, Tagalog, Swahili, Old Irish, Scots, Low German (i.e., Plattdeutsch), Ndebele, Mongolian, Modern Hebrew, Fang, Persian, Duala, Breton, Amharic, Afrikaans.

Note:  Some songs contain multiple languages and are counted multiple times in this list.  Some languages are only used for a single phrase, making it arguable that the song is sung “in that language”.

If you’re wondering why a certain language is not included:  There are over 6000 languages in the world, and I have no idea in how many of these songs have been recorded. In any case, I will gladly accept in tips for songs to add to the collection :-)

Happy Music Listening to everyone!

 

 

Kernel of Convex Combination of Matrices

Let A and B be two real m×n matrices, with m << n, such that there exists vectors x, y of length n such that

Ax = 0,    x > 0

By = 0,    y > 0

I.e., such that the kernel of each matrix contains a vector with all components larger than zero.

Question:  Is the same true for convex combinations of A and B?

Put differently:  For any 0 < α < 1, is there a vector z such that

[ αA + (1−α)B ] z = 0,    z > 0  ?

I would be very thankful for a proof, or a refutation.

Thanks to anyone who can help!

Index Notation as Projection Operator and Division of Vectors

I came across a nice equivalence between the standard index notation for vectors and the vector projection operator. This works for vectors of any dimension, but I’ll explain it for vectors in three dimensions.
A vector a can be expressed by its coordinates in a coordinate system:

(ax, ay, az)

These are three numbers that depend on the chosen coordinate system.  Any coordinate system can be described by three vectors x, y and z.  Given these three vectors, we can then write the components of a as

ax = (a · x) / ‖x‖².

We can interpret this as a binary operation with arguments a and x. This operator corresponds to the projection of a onto x, expressed in units where x has length one:

Using the points laid out as in this graphic, we can write the operator as

|ax| = AC / AD

This sign of ax is given by the relative directions of AC and AD.

Since this operator only depends on the dot product (a · x) and the norm ‖x‖, its value is independent of the coordinate system in which it is computed.

As a binary operator, ax is linear in the first argument, i.e., for any number c we have

(ca)x = c (ax),

and for any two vectors a and b we have

(a + b)x = ax + bx.

The operator is inverse-linear in the second operator:

a(cx) = c−1 (ax)

The operator is defined for all vectors a and x, except when x is the zero vector.

When the two vectors a and x have the same norm ‖a‖ = ‖x‖, then exchanging the two operands does not change the value:

ax = xa  ⇔  ‖a‖ = ‖x

This value ax = xa then equals the cosine of the angle between a and x.  In the general case, the operator is however not commutative.

The operator is zero if and only if the two vectors are orthogonal:

ax = 0  ⇔  ax

These properties are similar to those of scalar division, although this operation is of course not a proper division, since it returns not a vector, and cannot be expressed in terms of products and inverses. Note also how by coincidence, the notation ax puts a in a high position and x in a low position, just as in a fraction.

Eleven Essential Physics Lectures by Prof. Leonard Susskind

Prof. Leonard Susskind from Stanford University is one of the most prominent physicists of today, and has given physics lectures in the last few years, which are now online. These lectures are, in my opinion, the fastest way to learn physics. Prof. Susskind is really good at explaining complicated topics. In fact, he doesn’t just follow a script – instead he solves problems and derives results on the whiteboard just as he would on paper by himself. Although he makes errors from time to time, and has to be corrected by the audience, these lectures are still extremely didactic. I really have only very rarely learned so much in so little time.  The lectures range from classical mechanics to cosmology, and cover all important topics in physics. Since there doesn’t seem to be a central place that lists these lectures, I’ve made a list here:

I’ve tried to list the lectures in a rough order of how they should be watched, but beware that I have only seen a small part of them! Also, Prof. Susskind has given lectures about the same topic multiple times; in these cases I have included only one lecture. I will keep this list updated if lectures about new topics appear. Please comment if you find an error. Thanks!

EDIT:  Added Supersymmetry & Grand Unification, which makes it twelve lectures now.

Is the Hop Plot a Logistic Distribution?

The hop plot is a visualization of the distribution of pairwise distances in a network. It shows the number of nodes one can reach (on average) by going N steps in a network.  Here’s an example for the US electrical power grid network:

The X axis shows the numbers of steps taken and the Y axis shows the number of nodes reached, as a percentage.

What kind of distribution do these hop plots follow?  Maybe a normal distribution? A logistic distribution?  We could do all kinds of statistical tests to find out, but I want to do something different. I want to answer that question visually. Therefore, we’re going to use different ways of drawing the Y axis of the plot that correspond to each distribution. Let me explain: If we apply the inverse cumulative distribution function to the values on the Y axis, the plot should show a straight line if the values follows that distribution.  Since the values on the Y axis are between 0 and 1 and have the shape of a sigmoid function (see Wikipedia), we’re going to try out the various sigmoid functions that we know.

(1) Normal distribution – Inverse error function

The cumulative distribution function of the normal distribution is the error function. Let’s try it:

Hmm… does not quite look like a line.

(2) Logistic distribution

The cumulative distribution function of the logistic distribution is the logistic function:

Not that good either.

(3) The tangent sigmoid

We’ll use the arctangent function as our (scaled) cumulative distribution function:

(4) Hyperbolic tangent

We’ll use the hyperbolic tangent as our (scaled) cumulative distribution function:

This too does not fit at all.

Conclusion

In conclusion, none of the functions fit. Therefore, the distances in the US power grid do not follow any of the simple sigmoid models.

Two notes are in order though:

  • A plot is not a statistical test. You should do an actual statistic test when you write a paper!
  • Does anyone known what the usual network models (Erdős–Rényi, Barabási–Albert, Watts–Strogatz, etc.) result in?

Related Blog Posts

References