The Surprising Equality of the Mean and Median Path Length in Online Networks

A common statistic computed for networks is the diameter.  The diameter equals the longest shortest path length in a network.  Let me explain:  Given two nodes u and v in a network, their distance is the minimal number of edge needed to reach one from the other.  In other words, the distance between u and v equals the shortest path length between these two nodes. Now, find two nodes u and v such that the distance between them is maximal – this is the diameter of the network. The diameter is thus an integer that characterizes a network.

The diameter is an important statistic for a network, since it is used to characterize a network as a small world. The expression small world refers to the observation that it is possible to reach any node from any node in a small number of steps, as in the famous six degrees of separation.

The diameter is only one of several ways of measuring the “typical path length” in networks, and in fact, it may not even be the most representative one. As it is a maximum, it is not representative of the full distribution of path lengths in the network. Also, being defined as a maximum, it is sensitive to outliers, e.g., tendrils in the networks that increase the diameter but do not affect the overall structure of the network. Therefore, several alternatives measures of typical path lengths can be used:

  • The diameter (as defined above) is the maximum path length
  • The mean path length, averaged over all node pairs
  • The median path length
  • The 90-percentile effective diameter
  • The 50-percentile effective diameter

The effective diameters (50- and 90-percentile) are the path lengths such that 50 or 90 percent of node pairs are a smaller or equal distance apart, interpolated between integer values.  Thus, the median path length is the 50-percentile effective diameter rounded down.

The statistics are shown in the following plot, for the Slashdot Zoo network:

hopdistr.a.slashdot-zooThis shows the cumulated distribution of path lengths over all node pairs in the network.  In this network, the diameter (δ) equals 12. The X-percentile effective diameters are shown as the two vertical red lines and equal 3.51 for the 50-percentile effective diameter (δ0.5), and 4.71 for the 90-percentile effective diameter (δ0.9). The median path length is thus 3 in this network.  The mean path length (δm) equals 4.04 and is somewhere in between. We see that the distribution is skewed, because the mean and median path length are not equal in this network.  However, when I did this analysis for more networks, I found that in most cases, both numbers coincide. The following plot shows many networks (those small enough such that the number could be calculated), plotted by the 50-percentile effective diameter (on the X axis) and the mean path length (on the Y axis). Each network is represented by a two-letter (and a for few three-letter) code. See the results for yourself:

scatter.c.diam_3.diam_4.4This plot shows that both not only correlate very highly, but are almost equal for almost all networks. In other words: the 50-percentile effective diameter and the mean path length are almost equal in actual networks.  What does this mean?  I’m not sure, but this result is not trivial, since the distribution of path lengths is not symmetric in most networks.

For more path length distributions, see here.

For another blog post about the path length distribution, see

The 65 Languages in My Music Collection

As most of you, I have collected music in my life, first as CDs and later on my computer. In particular, a hobby of mine is to collect music in the languages of the world – the more the better! One day, I started tagging each song in my collection with the language it is sung in. Today, about 70% of the 16193 tracks I have are tagged with at least one language.

Here’s the breakdown:

English (8346 songs), French (2366 songs), German (1128 songs), Spanish (587 songs), Portuguese (241 songs), Italian (201 songs), Dutch (197 songs), Arabic (76 songs), Russian (69 songs), Norwegian (69 songs), Latin (64 songs), Nepali (44 songs), Wolof (32 songs), Turkish (31 songs), Occitan (27 songs), Malagasy (26 songs), Korean (23 songs), Chinese (22 songs), French-based creoles (21 songs), Japanese (19 songs), Dyula (19 songs), Polish (18 songs), Swedish (17 songs), Jamaican Creole (13 songs), Finnish (13 songs), Zulu (12 songs), Hawaiian (12 songs), Bulgarian (11 songs), Shona (8 songs), Haitian Creole (8 songs), Nubian (7 songs), Xhosa (6 songs), Catalan (6 songs), Georgian (5 songs), Hindi (5 songs), Galician (4 songs), Hungarian (3 songs), Middle High German (3 songs), Esperanto (3 songs). I have 2 songs in each of the following languages: Sesotho, Romanian, Punjabi, Lingala, Limburgish, Icelandic, Irish Gaelic, Serbian.  I have one song in each of the following languages: Venda, Klingon, Tagalog, Swahili, Old Irish, Scots, Low German (i.e., Plattdeutsch), Ndebele, Mongolian, Modern Hebrew, Fang, Persian, Duala, Breton, Amharic, Afrikaans.

Note:  Some songs contain multiple languages and are counted multiple times in this list.  Some languages are only used for a single phrase, making it arguable that the song is sung “in that language”.

If you’re wondering why a certain language is not included:  There are over 6000 languages in the world, and I have no idea in how many of these songs have been recorded. In any case, I will gladly accept in tips for songs to add to the collection :-)

Happy Music Listening to everyone!

 

 

Kernel of Convex Combination of Matrices

Let A and B be two real m×n matrices, with m << n, such that there exists vectors x, y of length n such that

Ax = 0,    x > 0

By = 0,    y > 0

I.e., such that the kernel of each matrix contains a vector with all components larger than zero.

Question:  Is the same true for convex combinations of A and B?

Put differently:  For any 0 < α < 1, is there a vector z such that

[ αA + (1−α)B ] z = 0,    z > 0  ?

I would be very thankful for a proof, or a refutation.

Thanks to anyone who can help!

Index Notation as Projection Operator and Division of Vectors

I came across a nice equivalence between the standard index notation for vectors and the vector projection operator. This works for vectors of any dimension, but I’ll explain it for vectors in three dimensions.
A vector a can be expressed by its coordinates in a coordinate system:

(ax, ay, az)

These are three numbers that depend on the chosen coordinate system.  Any coordinate system can be described by three vectors x, y and z.  Given these three vectors, we can then write the components of a as

ax = (a · x) / ‖x‖².

We can interpret this as a binary operation with arguments a and x. This operator corresponds to the projection of a onto x, expressed in units where x has length one:

Using the points laid out as in this graphic, we can write the operator as

|ax| = AC / AD

This sign of ax is given by the relative directions of AC and AD.

Since this operator only depends on the dot product (a · x) and the norm ‖x‖, its value is independent of the coordinate system in which it is computed.

As a binary operator, ax is linear in the first argument, i.e., for any number c we have

(ca)x = c (ax),

and for any two vectors a and b we have

(a + b)x = ax + bx.

The operator is inverse-linear in the second operator:

a(cx) = c−1 (ax)

The operator is defined for all vectors a and x, except when x is the zero vector.

When the two vectors a and x have the same norm ‖a‖ = ‖x‖, then exchanging the two operands does not change the value:

ax = xa  ⇔  ‖a‖ = ‖x

This value ax = xa then equals the cosine of the angle between a and x.  In the general case, the operator is however not commutative.

The operator is zero if and only if the two vectors are orthogonal:

ax = 0  ⇔  ax

These properties are similar to those of scalar division, although this operation is of course not a proper division, since it returns not a vector, and cannot be expressed in terms of products and inverses. Note also how by coincidence, the notation ax puts a in a high position and x in a low position, just as in a fraction.

Eleven Essential Physics Lectures by Prof. Leonard Susskind

Prof. Leonard Susskind from Stanford University is one of the most prominent physicists of today, and has given physics lectures in the last few years, which are now online. These lectures are, in my opinion, the fastest way to learn physics. Prof. Susskind is really good at explaining complicated topics. In fact, he doesn’t just follow a script – instead he solves problems and derives results on the whiteboard just as he would on paper by himself. Although he makes errors from time to time, and has to be corrected by the audience, these lectures are still extremely didactic. I really have only very rarely learned so much in so little time.  The lectures range from classical mechanics to cosmology, and cover all important topics in physics. Since there doesn’t seem to be a central place that lists these lectures, I’ve made a list here:

I’ve tried to list the lectures in a rough order of how they should be watched, but beware that I have only seen a small part of them! Also, Prof. Susskind has given lectures about the same topic multiple times; in these cases I have included only one lecture. I will keep this list updated if lectures about new topics appear. Please comment if you find an error. Thanks!

Is the Hop Plot a Logistic Distribution?

The hop plot is a visualization of the distribution of pairwise distances in a network. It shows the number of nodes one can reach (on average) by going N steps in a network.  Here’s an example for the US electrical power grid network:

The X axis shows the numbers of steps taken and the Y axis shows the number of nodes reached, as a percentage.

What kind of distribution do these hop plots follow?  Maybe a normal distribution? A logistic distribution?  We could do all kinds of statistical tests to find out, but I want to do something different. I want to answer that question visually. Therefore, we’re going to use different ways of drawing the Y axis of the plot that correspond to each distribution. Let me explain: If we apply the inverse cumulative distribution function to the values on the Y axis, the plot should show a straight line if the values follows that distribution.  Since the values on the Y axis are between 0 and 1 and have the shape of a sigmoid function (see Wikipedia), we’re going to try out the various sigmoid functions that we know.

(1) Normal distribution – Inverse error function

The cumulative distribution function of the normal distribution is the error function. Let’s try it:

Hmm… does not quite look like a line.

(2) Logistic distribution

The cumulative distribution function of the logistic distribution is the logistic function:

Not that good either.

(3) The tangent sigmoid

We’ll use the arctangent function as our (scaled) cumulative distribution function:

(4) Hyperbolic tangent

We’ll use the hyperbolic tangent as our (scaled) cumulative distribution function:

This too does not fit at all.

Conclusion

In conclusion, none of the functions fit. Therefore, the distances in the US power grid do not follow any of the simple sigmoid models.

Two notes are in order though:

  • A plot is not a statistical test. You should do an actual statistic test when you write a paper!
  • Does anyone known what the usual network models (Erdős–Rényi, Barabási–Albert, Watts–Strogatz, etc.) result in?

Related Blog Posts

References

Taxonomy of Matrices

Normal matrices, diagonalizable matrices, orthogonal matrices, unitary matrices… the classes of matrices used in math, computer science and other areas can be quite complex and difficult to grasp sometimes. To get an overview of the different classes of complex n-by-n matrices, I made the following diagram:

Taxonomy of n-by-n matrices (PDF version) – with hyperlinks to Wikipedia articles and legend

This diagram is released as cc-by-sa. That means you can use and modify it, as long as you acknowledge me (just link to this page), and publish any modifications also under cc-by-sa.

In this diagram, each rectangle represents one set of complex n-by-n matrices. Arrows represent subset relations. There is lots of information in there, and every arrow tells a story. For instance, all rotation matrices have determinant 1, explaining the arrow from “Rotation matrices” to “Real unit-determinant matrices”. I also colored the classes which are closed under matrix multiplication in blue. Also, I noted down the corresponding multiplicative group, for instance SO(n) with rotation matrices.

There are likely errors in there: I tried to be as diligent as possible, but with the large amount of information compressed into a single diagram, there are bound to be errors. Please comment if you find one!  Also, I did not include lesser-known classes of matrices (such as unistochastic or co-positive matrices), matrices whose definition makes use of the ordering of row and column indices (such as triangular or tridiagonal matrices) and classes that are not closed under the transpose operation (such as left and right stochastic matrices).

A few other observations:

  • The identity matrix is everything except skew-symmetric or skew-Hermitian.
  • The top of the hierarchy are diagonalizable and invertible matrices.
  • Are doubly stochastic matrices a subset of any other class?
  • What are the name of the groups generated by multiplication of double stochastic matrices?
  • There is a one-to-one correspondence between several complex and real classes.  This is however not systematic. For instance, real symmetric matrices can be generalized to complex symmetric matrices, and also to Hermitian matrices.
  • Many classes have alternate names. For instance, Hermitian matrices are called self-adjoint, and rotation matrices are called special orthogonal.

UPDATE

  • The PDF version now contains hyperlinks to the Wikipedia articles about the matrix classes and groups.
  • Added symplectic and unimodular matrices
  • Added a legend
  • Added cc-by-sa license

The Power Law Paradox, or Why the Power Law Exponent Does Not Mean What You Think It Means

Large networks are often described as having a power-law degree distribution. In other words, the number of nodes having n neighbors is proportional to n−γ, for some number γ (this is the Greek letter Gamma).  Now, it is well-known that this is often not the case, and that many other degree distributions are easily mistaken for a power-law.  Nevertheless, some networks do have power laws, and in these networks, we might think about the meaning of the parameter γ.

Degree distribution of the CiteULike user–tag network

This is the degree distribution of the CiteULike user–tag network.  Using Aaron Clauset’s code, we can find out that it fits a power-law distribution with γ = 1.63.  The code from Aaron Clauset also returns a value of dmin = 1, which means that the power law is valid beginning with minimal degree 1 (i.e., for all nodes).

γ is the slope of the distribution, i.e., when γ is high, the number of nodes with high degree is smaller than the number of nodes with low degree.  We may thus think that a low value of γ denotes a more equal distribution, and that higher and higher values of γ denote more and more unfair degree distributions.  However, this is not the case.  In fact,  the opposite is true: A high value of γ represents a network in which the distribution of edges is fairer.

To measure the equality of the degree distribution, we can use a measure from Economics that has been used to measure the equality of a country’s income distribution. This measure is called the Gini coefficient.  The Gini coefficient can be computed by drawing the so-called Lorenz curve (shown below), and measuring the area that it traces out:

Lorenz curve (blue curve) of the Facebook friendship network

The Lorenz curve, as shown above as a blue curve, consists of all points (X,Y) such that the statement “X percent of the people on Facebook with the smallest number of friends account for Y percent of all friendship links.” For instance, the big black point which lies on the Lorenz curve allows us to state that “The 75% of Facebook users with the least number of friends account for only 25% of all friendship links.” Or, in other words, “The 25% of Facebook users with most friends account for 75% of all friendship links.” If the distribution of friendships among users is maximally equal, then all users have the same number of friends, and the Lorenz curve follows the diagonal of the plot (from the bottom left to the top right).  When the distribution is less equal, the Lorenz curve is nearer to the bottom and right of the plot.  Therefore, the area between the Lorenz curve and the diagonal is an indication of the equality of the degree distribution.  This area is maximally 0.5 (half the square), and therefore we define the Gini coefficient as twice this area, so the Gini coefficient is a measure of equality that goes from 0 (total equality) to 1 (total inequality).

Now we can finally ask:  If two networks with power law degree distributions have two different power law exponents γ, which of the two networks has a higher Gini coefficient?  To find out, we generate networks with perfect power laws, and compute their Gini coefficients. In fact, we don’t have to compute whole networks, it is enough to generate the degree distributions. The result looks like this:

Lorenz curves of perfect power law distributions, with γ ranging from 2.1 to 3.5

The result:  Power law distribution distributions with high γ have a Lorenz curve nearer to the diagonal.  Therefore, a higher value of γ means the distribution is more equal.

This result is somewhat surprising:  We would have thought that a higher slope on the degree distribution plots means that there are even less nodes with many neighbors. However, the opposite is the case.

Notes

(1) The mathematically correct name of a power law distribution is “Zeta distribution”, because the normalizing factor 1^(−γ) + 2^(−γ) + 3^(−γ) + … is exactly the Riemann zeta function of γ.

(2) In this test, γ must be strictly larger than 2, otherwise the distribution mean cannot be defined, simply because 1^−1 + 2^−1 + 3^−1 + … is not defined (see Harmonic series)

References

[1] Fairness on the Web: Alternatives to the Power Law. Jérôme Kunegis & Julia Preusse. WebSci 2012.

9 Reasons to Have an “Enemy” and “Dislike” Feature on Facebook

On Facebook, nothing is negative:  You can have friends, but not enemies.  You can like something but not dislike it.  Not satisfied with this status quo, Dean Terry from the University of Texas in Dallas has created EnemyGraph, a Facebook app that lets you add persons and things as enemies to your profile.

The idea of adding “negative links” may sound counterproductive to a social network, but in fact, there are good reasons to allow the negative “enemy” and “dislike” links in addition to the positive “friend” and “like” links:

  • It better reflects real-world relationships:  In the real world, not everything is positive.  Many relationships, both between people and things are negative. If an online social network wants to reflect the real world as best as possible, it better support enemies and dislikes.
  • Product recommendation gets better:  Facebook can recommend things you like to you, while filtering out what you will not like. Without negative feedback, Facebook will not know that you don’t like certain things or persons, and will continue to recommend them to you.
  • It makes filtering easier:  By removing things and people you don’t like from your newsstream, you Facebook wall will be more interesting to you.
  • It works. In fact, negative feedback is already implemented at a few places in Facebook. For instance, you can already remove individual persons and individual advertisements from your newsstream. Integrating these as first-class features would make it easier for users to understand what is going on.
  • Everybody does it:  Netflix, Amazon and many other websites allow you to give ratings one a rating scale from one to five stars, effectively giving a spectrum of ratings from “unlike” to “like”.
  • Test the principle of “the enemy of my enemy is my friend” in practice.
  • Allow better social network analysis:  Negative links have been used in studies on social networks.
  • Solve the spam and fake account problem. For instance on the website Slashdot.org, users can mark other users as friends and foes, which makes to possible to automatically detect trolls.
  • Let people connect by common dislikes.  In fact, many groups in Facebook are created only for people that dislike a specific item, e.g. this.

Read more:

The Pareto Principle

8.9% of bands make up 91.1% of plays on Last.fm.

9.4% of musical genres represent 90.6% of songs in Wikipedia.

9.5% of words make up 90.5% of Reuters news by word count.

14.4% of record labels represent 85.6% of all bands.

14.6% of user groups account for 75.4% of group memberships on Flickr.

16.2% of football players account for 83.8% positions in sports teams.

16.2% of songs make up 83.8% of plays on Last.fm.

17.4% of movies receive 82.6% of ratings on MovieLens.

17.7% of profiles receive 82.3% of ratings on Czech dating site Libimseti.cz.

18.8% of groups make up 81.2% of all group memberships on YouTube.

19.8% of all categories make up 80.2% of all category inclusions on the English Wikipedia.

20.3% of all users receive 79.7% of “friend” and “foe” links on Slashdot.

21.3% of Facebook users receive 78.7% of wall posts.

22.9% of users make up 77.1% of all “@” mentions on Twitter.

23.1% of projects receive 76.9% of project memberships on Github.

25.0% of Facebook users cover 75.0% of all friendship links.

26.2% of papers receive 73.8% of citations.

26.9% of persons send 73.1% of emails.

27.3% of users receive 72.7% of replies on Digg.

27.4% of actors account for 72.6% of movie credits.

29.9% of all patents receive 70.1% of all patent citations among US patents.

30.4% of all websites receive 69.6% of all hyperlinks.

Statements of this form can be read off a Lorenz curve.  A Lorenz curve plots the share of “things” covered by a given share of other “things”.  In the case of networks, the Lorenz curve shows the amount of edges covered by nodes.  For instance, the following graph shows the number of “plays” received by songs on Last.fm:

The value cited in the examples above can be read off this plot by noting where the Lorenz curve in blue crosses the diagonal, marked by the big dot and the letter P on the plot.  This gives us the phrase from above that 8.9% of bands make up 91.1% of plays on Last.fm.  When the distribution is totally uniform, the Lorenz curve follows the diagonal. The further down the Lorenz curve goes, the less “fair” the distribution of edges. Alternatively to the value P, the area between the Lorenz curve and the other diagonal multiplied by two is the Gini coefficient, marked G on the plot. The Gini coefficient is 0 when the distribution is completely uniform (i.e., “fair”), and 1 when all edge belong to one node. (which is impossible, since edges always attach to two nodes.)

All data was generated with KONECT – The Koblenz Network Collection.

PS the Pareto Principle in the narrow sense states that 20% of people own 80% of things.