I spent the last three years writing my PhD thesis. Since my thesis is about networks, I have used network datasets for research. Experimental results are more significant when done using multiple datasets and therefore I started collected network datasets. At first, I collected bipartite rating graphs such as MovieLens, Epinions and Netflix. Later I added social networks such as the Slashdot Zoo, Facebook and Flickr, hyperlink networks from Google, Wikipedia and TREC. Before my thesis was done, I noticed that I had collected **over a hundred network datasets!**

My PhD thesis is about the *spectral evolution model*, but such a large collection of network datasets is much too interesting to use it just for that. Therefore, I decided to perform some statistics with the collection of network datasets, trying to reproduce well-known results about networks. One such result is that **the density of networks is increasing over time**.

Let me explain that: A network such as Facebook consist of nodes and links. In the case of Facebook, the nodes are users and the links are friendship links. We cannot know the complete Facebook network, but a part of it is available for research. What’s great about this dataset: Not only do we know the friends of each users, we also know the date at which each friendship link was added. This allows us to look at the changes in the network over time. For instance, one important statistic in a network is the average number of links connected to each node. This number is called the network density. A well known result about networks that change over time is that their density is increasing. In other words, the average number of friends people have on Facebook gets larger over time. This is not a trivial property: When a new user account is created, the average number of friends goes down, because new users don’t have any friends.

In other words, the larger the number of nodes in a network, the larger the density. This result was shown to be valid for individual networks, for instance in Jure Leskovec‘s dissertation. Given a large collection of datasets, we may now ask the related question: Is the number of nodes in a network correlated to the density, when comparing different networks? This question can only be answered with a large collection of datasets. Using the network datasets I have available, the result looks like this:

The result is: **Network size and density do not correlate**! In other words, the growing density is a feature of individual networks, not of networks as a whole.

This kind of result is kind of straightforward taken individually, but is only made possible when using a large collection of datasets! Therefore, me and my colleagues at the Institute for Web Science and Technologies (WeST) at the University of Koblenz have decided to make the collection of datasets available in a form that makes it easy to do these kinds of studies. We will call the collection of datasets KONECT, short for *Koblenz Network Collection*. You can look at a poster of KONECT that we presented at the Web Science Conference that WeST organized in Koblenz a few weeks ago. We will also announce the KONECT web site soon—So stay tuned to this blog for more information!

NEW: All network datasets are available at KONECT – The Koblenz Network Collection

Here’s my PhD thesis:

On the Spectral Evolution of Large Networks, Jérôme Kunegis, PhD thesis, University of Koblenz–Landau, 2011.

*Edit: removed preliminary information.*

Interesting … I wonder why this is. I think we would expect that density would increase over time for many growth networks, but not all networks are growth related, so maybe there are a combination of a number of different networks. The other thing I would like to know is whether you were able to collect interconnections among users of each site (I am guessing not) – that’s a lacuna that might be providing the result you suggest.

Hi

About your first comment: If by “growth network” you mean that only new edges can be added, then yes. In fact, the evolution of the average degree is a trade-off between adding new edges (which always increase the average degree) and added new nodes (which always decrease the average degree).

About your second point: we don’t have links from one network to another in KONECT, although if we had them, we would consider the whole as a single network. But of course in the real world, all social networks are connected through the real-life persons, so any social network dataset is always only a subset of the real thing.

Hi, did you look at any citation networks? If so, how does the density in citation networks behave?

Hi Praveena

Citation networks are grouped together with hyperlink networks as “reference networks”, and shown in cyan color in the plot. You can see that the average degree never exceeds about 15. The outdegree in a citation represents the number of references in a paper, so the average outdegree is never very large. Since the average outdegree have to equal the average indegree, it means that also the average indegree is never very large. Note that the plot shows the average total degree for directed networks, so you have to divide by two to get the average outdegree or average indegree. A more recent version of the plot (with more datasets) is available here:

http://konect.uni-koblenz.de/statistics/avgdegree

Cheers

Jérôme

Dear Jérôme,

Very interesting results!

I was wondering if you can help me solve an issue with your expertise… I have to compare three networks of different size (50, 400 and 900 nodes, respectively). I would get very different network properties (for instance avg node degree)… What would you do to compare or to find relevant information (comparable) in those networks? What paper or reference should I look for?

Thank you!

By the way, I’ll keep a copy of your dissetration for further reference 🙂

Dear Laura

Very interesting question! — This is a non-trivial problem, and in fact I’ve been thinking about this a lot. The short answer is that the following statistics are *more or less* size independent:

* average degree

* clustering coefficient

Long answer: These statistics are not really size-independent, and even if they were, they would not be enough statistics to really characterize a network. Currently I’m writing a paper about how to scale the statistics of a network such that it has another size. For instance, given a network, change its size to a given size N, and scale all its other statistics accordingly. The goal is to be able to compare the networks at the same size. And also if N is small enough, it is then possible to draw the graphs and compare them visually.

What networks are you working on?

cheers

Jérôme