The Long Story of a Single Research Dataset

Let’s take a look at how a single dataset has spread from researcher to researcher over the years. It’s this network:

celegans-analysis

Analysis of the C. elegans metabolic network, from (Jeong et al. 2000)

In the KONECT project, we collect datasets that are in the form of networks, i.e., consist of interconnections between all kinds of things.  Recently, we added a network from biology: the metabolic network of Caenorhabditis elegans.  It represents information about the biochemistry of a certain roundworm, Caenorhabditis elegans, which was first discovered in soil in Algeria in 1900, and has since become one of the most-studied lifeforms (see here for why).

While preparing this dataset, I researched the history of the dataset itself, as I always do in order to have correctly labeled data. It turned out that this dataset has spread through the academic world, crossing the boundaries of multiple disciplines, and appearing in multiple formats in various studies.  Here’s my summary of the dataset’s history, as far as I can reconstruct it:

Genome Sequencing

Our story starts at the end of the 1990s. Cenorhabditis elegans is a model organism, i.e., a species used extensively for research.  In fact, it was the first multi-cellular organism to have its genome sequenced.  The project to sequence the full genome of C. elegans took decades, and was finalized in 1999, as summarized in this paper:

[1] R. K. Wilson. How the worm was won. The C. elegans genome sequencing project. Trends Genet., 15(2):51–8, 1999.

The resulting genome is about 97 megabytes large, and includes more than 19,000 proteins.

What Is There: Metabolic Reconstruction

Only one year later, in 2000, the genome of C. elegans is included in a project called What Is There, or just WIT.  Information about this project can be found in this paper:

[2] Ross Overbeek, Niels Larsen, Gordon D. Pusch, Mark D’Souza, Evgeni Selkov Jr., Nikos  Kyrpides, Michael Fonstein, Natalia Maltsev, and Evgeni Selkov. WIT: Integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res., 28(1):123–125, 2000.

Unfortunately, the URLs of this project are not accessible anymore:

However, we can glean information about the project from the article:

“The WIT (What Is There) system has been designed to support comparative analysis of sequenced genomes and to generate metabolic reconstructions based on chromosomal sequences and metabolic modules from the EMP/MPW family of databases. This system contains data derived from about 40 completed or nearly completed genomes.”

Network Science

Also in the year 2000, Hawoong Jeong and colleagues from Notre Dame (Indiana, USA) used the datasets from the WIT project in the following article:

[3] Hawoong Jeong, Bálint Tombor, Réka Albert, Zoltan N. Oltvai, and Albert-László Barabási. The large-scale organization of metabolic networks. Nature, 407:651–654, 2000.

To cite the paper:

“[…] We present a systematic comparative mathematical analysis of the metabolic networks of 43 organisms representing all three domains of life. We show that, despite significant variation in their individual constituents and pathways, these metabolic networks have the same topological scaling properties and show striking similarities to the inherent organization of complex non-biological systems.”

One of the 43 organisms is, of course, C. elegans.

Being an article in the then-emerging field of Network Science, this is the first version of our dataset that is actually a network. to be precise, it is a metabolic network.

Community Detection

In 2005, Jordi Duch and Alex Arenas take the datasets from the 2000 paper in order to study community detection:

[4] Jordi Duch and Alex Arenas. Community detection in complex networks using extremal optimization (arXiv version). Phys. Rev. E, 72(2):027104, 2005.

In this article, seven network datasets from different sources are used, including the metabolic network of C. elegans.  (Another of the seven datasets is the famous Zachary karate club network dataset.)

To cite the article:

“We have also analyzed the community structure of several real networks: the jazz musicians network [27], an university e-mail network [11], the C. elegans metabolic network [28], a network of users of the PGP algorithm for secure information transactions [29], and finally the relations between authors that shared a paper in cond-mat [30].”

We can only guess why these particular datasets were chosen, but the fact that they are from different domains must have played a role, and the fact that in 2005, not many network datasets were easily available, as is the case nowadays.

The dataset as used in the paper is made available on the website of Alex Arenas.

Discrete Mathematics and Theoretical Computer Science

In 2011, the 10th implementation challenge of the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS 10) starts. In this challenge, participants must implement graph partitioning and graph clustering algorithms. After the end of the challenge, participants present their results at a DIMACS conference in 2012.  A book including all algorithm descriptions and other papers is published in 2013:

[5] David A. Bader, Henning Meyerhenke, Peter Sanders, Dorothea Wagner (eds.). Graph Partitioning and Graph Clustering. 10th DIMACS Implementation Challenge Workshop. February 13-14, 2012. Georgia Institute of Technology, Atlanta, GA. Contemporary Mathematics 588. American Mathematical Society and Center for Discrete Mathematics and Theoretical Computer Science, 2013.

In the contest, many real-world network datasets are made available for participants to use. Among them is the C. elegans metabolic network, as taken from Alex Arenas’ website.  To quote the contest website:

“The following four datasets are taken from http://deim.urv.cat/~aarenas/data/welcome.htm, with kind permission of Alex Arenas, Dept. Enginyeria Informatica i Matematiques (Computer Science & Mathematics), Universidad Rovira i Virgili.”

The full dataset is available for download from the DIMACS10 website, in the data format used by the contest.

The KONECT Project

In 2018, the dataset is incorporated in the KONECT project, where it is, as of this writing, one out of 1326 such network datasets.

5[6] Jérôme Kunegis. KONECT – The Koblenz Network Collection (CiteSeerX version). In Proc. Int. Conf. on World Wide Web Companion, pages 1343–1350, 2013.

In KONECT, we have extracted the C. elegans metabolic network dataset from the DIMACS10 website.  It is now also available for download from its KONECT page, in the KONECT format, which is used for all 1326 KONECT datasets. Additionally, the KONECT page of the dataset shows the values of 33 network statistics, and 45 plots.

Conclusion

This particular dataset has thus undergone the steps

Genome
↓
What Is There
↓
Notre Dame
↓
Duch & Arenas
↓
DIMACS10
↓
KONECT

The dataset has gone through all these steps, being downloaded and adapted for each particular project. Even though the researchers at each step (including myself) could have taken the dataset from previous steps, they didn’t. The reason may simply be simplicity in downloading the data, and a lack of knowledge about the full prevenance of the data.

Some readers may have noted that the dataset is also included in KONECT in a form directly extract from Alex Arenas’ website. This has been in KONECT for several years, but in fact, the two networks are not identical:  They have a different number of edges, although the number of nodes is the same. That difference is also the reason why we keep both version in KONECT. We don’t know what exactly the DIMACS10 project has done with the data. We know that multiple edges as available in the version of Alex Arenas have collapsed, but that still does not explain the difference. In light of such slight differences, we can only recommend to researchers analysing networks to properly cite the source of their dataset, and correctly document which transformations have been applied to the datasets. Even though this dataset is not one of the most famous (that award probably goes to the Zachary karate club), I would guess that the length of the chain of reuse is comparable for both.

Disclaimer

This chain of dataset sharing does of course not represent the full list of ways in which the dataset was made available, let alone used. In fact, the cited papers are not necessarily the first papers to perform a specific type of analysis – this chain of paper only represents how the dataset was shared, from the original dataset to KONECT. Note that the dataset was also shared in other ways that did not end up in KONECT. For instance, the dataset was used by Petter Holme and colleagues (in collaboration with Hawoong Jeong) for their 2003 article on detecting subnetworks, which used the version from the WIT project. Also, the dataset is available at the ICON project, and was extracted directly from Alex Arenas’ website there.  (Note: we can’t seem to get a direct link to that particular dataset in ICON – it can be found easily by searching for it.)

This text is likely to contain errors and omissions – please comment below, and I will correct them.  I remember that Albert-László Barabási had a separate page which may have contained the data too, but I can’t find it right now.  Maybe it was at https://www3.nd.edu/~networks, which is not reachable now?

EDIT 2018-04-18:  Added information via Petter Holme to the disclaimer, mentioning his 2003 paper.

EDIT 2018-04-20:  Fixed typos; change a formulation to make clear article [1] is a summary of the work, not the work itself; use quotation marks consistently.

EDIT 2018-05-08: Fixed typos; replaced “different” by “the same” in the description of the other dataset which is also in KONECT (Thanks Thomas)

EDIT 2018-06-11: Fixed year number in first sentence of section “Community Detection” (“2005 paper” → “2000 paper”)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s