Network of Characters in Harry Potter

Here is another network.

This time, it’s the characters in the book series Harry Potter.

The nodes in this network are individual characters, and the edges are their relationships.  There are many different node types, to represent the different types of persons and animals, as well as various relationship types.  The complete list is given in the legend, and is pretty self-explanatory.

Since the universe of Harry Potter is gigantic, I have only included information included in the seven books.  Anything else, even if it is canon, is not shown.  I know that there’s much more information available about ancestors of those characters, as well as their children in some cases.  This is all not included.  Note that the graph is already quite large and complex, and some edges go all over the place.  Some persons are difficult to find too, because their placement may not be obvious.  Since many characters have relationships to many other characters, there is not perfect layout for all of this.  As a result, it is impossible to keep all groups of people together as one would like.  For instance, it would be nice to have all members of the Order of the Phoenix in one block, and each family in one block, and all Hogwarts teachers in one block, but those all overlap, and I had to make choices.  I hope the result is a good compromise.

Anyway, the network is also available as a PDF file, in which a text search can be performed.

To make the chart compact, I have omitted any titles like “Professor” and “Mr” in the names of people; you don’t have to notify me that the complete name of Nearly Headless Nick includes a “Sir” 🙂

I’m also aware that the overall layout is not perfect, but with a graph of this size, you have to make some tradeoffs, so I hope it’s still a very useful representation.

Another note:  I know that there are differences between the books and the movies; this network is about the books.

As always, I’m happy about corrections, so please write about any error you see.  (But keep in mind:  I’m not adding info that’s only available outside the seven books.)

The actual PDF file had exactly double the size of A0.  Apparently, there is no standardised name for this format (A−1?), but you could in principle print is as A0.  In the PDF, each text box is 1 cm high.

Finally:  This network contains spoilers for the Harry Potter series.  Also, if you’re not familiar with Harry Potter, then you’ll probably find this all boring . . . .

potter

PDF Version (searchable, 1.7 MB)

Advertisements

Phylogenetic Tree of My Houseplants

Here’s a graph that is a little bit differentthan the usual: a tree. It shows all my houseplants, arranged by their taxonomic relationships.

A tree is a graph that does not contain cycles.  As such, trees lack some of the complex structure that characterizes other complex networks.  However, we can still make out some interesting “complex” properties, for instance the fact that a few families of plants have a large number of specimens in it (Cactaceae, Crassulaceae, Orchidaceae), while others have just a single member.

Each number in the graph (for instance [209]) refers to one plant.  In some cases, one number refers to multiple plants that are genetically identical, i.e., that I propagated from cuttings or similar.

All names shown are the taxonomically correct scientific names, for instance Euphorbia milii is what is sometimes called “crown of thorns” in English.

The names ending in “-aceae” are families (the grouping above the genus), and the names ending in “-ales” are orders (the grouping above the family).  Above orders, groupings use the usual scientific names “eudicots”, “monocots”, and “magnoliids”, which are mostly in English also in the scientific literature.

All my house plants are in the group “Angiosperms”, i.e., they are flowering plants.  I used to have a fern, but not anymore, and you can easily have conifers like Araucaria as a house plant, which are all outside of the Angiosperms.

I have resisted the temptation to add my house cat to the diagram, which of course would be a sister group to everything mentioned above.

All pictures are of my own plants, and made by myself.

flora.compressed

Game Dependency Graph: The Secret of Monkey Island

guybrush

This is something I already created a little more than a year ago, and only came around to publish now:  a game dependency graph.

The Secret of Monkey Island is a computer game from 1990 which I loved to play back in the day.  It is an adventure game, played with the mouse:  You play as a character named Guybrush Threepwood (shown on the right), and can walk around many different areas, interact with objects, people, and other things.  This is not an action game; you don’t need to perform actions at subsecond precision to win, but instead you must find out how to proceed in the story by performing the correct actions, such as using objects with other objects, talking to people, pushing things around, etc. Essentially, the whole game can be seen as a set of actions that must be performed.

The actions in this game cannot be performed in any random order, but are interrelated.  For instance, you need to pick up an object before you can use it.  Thus, we can model the whole game as a directed graph, which the actions being the nodes of the graph, and the dependencies as directed edges.  It is logically impossible for this dependency graph to have directed cycles, and thus this graph is a DAG, a directed acyclic graph.  In fact, this type of graph has already been considered, in particular under the name puzzle dependency chart; see that article by the creator of Monkey Island himself, Ron Gilbert.

This graph has many additional features in addition to just being a DAG – see below for an explanation.  The picture itself also includes a legend at the bottom.

SPOILER ALERT:  This post contains spoilers for pretty much all puzzles in The Secret of Monkey Island.

NEW:  PDF version of the graph (supports text search)

mi1

Special Graph Features

The graph has some extra features that put it beyond “just a DAG”, which I’ll list in the following.

First of all, a note about the direction of edges:  The direction of edges is not indicated by little pointed arrows in the picture, but instead I follow the convention that all edges point down.  This is possible because the graph is acyclic.

Multiple Types of Actions

There are different types of actions that can be performed:  pick up, talk to, push, etc.  Each of these is represented by a differently colored node.

  • Rarely used commands are all shown in blue – remember that this is the first Monkey Island game, which had much more commands than later ones.  Commands like “push” and “pull” are used only a few times in the game (but sometimes in a clever way, for instance to “rotate” the safe handle and the balanced stones.)
  • Commands like “open” and “go to” are only shown if they have a game-relevant consequence.  I don’t include all doors you have to open just to go through the game.
  • I also don’t include all the people you can talk to.  Almost all of these dialogues are optional, and would just be isolated nodes in the graph.  I only make an exception for really, really important dialogues, such as with the Voodoo Lady.
  • There is a single way to die in game, which is represented as a black node, and obviously cannot have any outgoing edges.  It is located in the underwater scene in Part I.  (Note that Monkey Island avoids death as a general rule – the inclusion of this is more of an easter egg, rather than actual gameplay, in particular because it will only happen if the player cannot exit the underwater scene in ten minutes, which isn’t too difficult to do.)
  • There is also a way to get stuck in the game, by spending all pieces of eight on the broken grog machine.  This too is indicated by a black node.

Multiple Solutions

Some puzzles in this game have multiple solutions.  This is reflected in the graph by dashed lines that converge on a little circle.  This happens only a few times:

  • In order to enter the forest in Part I, one of two things must be done:  (1) buy the map, or (2) follow the storekeeper to the forest.  Once the forest can be entered, both the treasure and the Sword Master can be found without further help, but one of those two actions must be performed, because otherwise Guybrush refuses to enter the forest.
  • The option to ask the storekeeper to to the Sword Master (see previous point) can be unlocked in two ways:  (1) By talking to the important-looking pirates, or (2) by buying the sword from the storekeeper.
  • One of four different objects made out of cloth or paper can be used in the recipe puzzle in Part II:  the business card, the map (which is optional), or one of the two T-shirts.  The player can also put all of them into the pot in order to make the inventory shorter, since you don’t need them again.
  • The dam in Part III can be burst in four different ways:  (1) using the lens with the sun, (2) using the spyglass with the rock, (3) using the cannonball with the rock, and (4) using the magnetic compass with the rock.  This also means that getting the spyglass from the fort is completely optional, even though it can be used in two different ways to burst the dam.

“Action-Like” Puzzles

Some “action-like” puzzles in the game do not fit into the “dependency graph” approach, and are represented as a single node, even though the player must perform a little bit more than a single action.  These cases are shown in red in the graph.  These include: learning and applying swordfighting insults, navigating through the forest maze, following the store keeper to the swordmaster (which is optional), carrying the grog to the prison, haggling with Stan, and tricking the storekeeper in order to steal the note of credit.

(Since the graph doesn’t say how to solve these puzzles, they are the only part of the game not spoilered by this post.)

Areas

The game is divided into what I call areas, and which are separated by dashed lines in the graph.  What I call an “area” is a set of rooms between which the player can navigate freely, and thus the player will always be in a single at a particular time, and will need to perform specific actions to enter another area.  In the beginning of the game, areas correspond to individual parts of the game, e.g., when the player can finally leave Mélêe Island™, after which Part II starts, and the player is now on the ship.  Each area can be seen as its own game with its own dependency graph.  There are also what I call “subareas” in the game, which happens when you enter an area (usally a single room), which, when exited, leads you back to the previous areas – these subareas are shown as dashed rectangles.  There are two of them: the seafloor scene in Part I, and being captured by the cannibals in Part III.  The ending of the game can be seen to consist of three consecutive small areas.

Network Analysis

We can now finally perform some network analysis.  Of course, a proper network analysis would be more interesting if we had multiple games to analyse, so I’ll restrict myself to things that can be gleaned “by hand”, e.g., without using a computer analysis.

Things You Can Do Immediately

Nodes that don’t have any incoming edges (i.e., no edge coming into them from above) are things you can do immediately when you enter an area.  You can think of there being implicit edges from the action that starts the area to all such nodes, but we don’t draw them because they don’t add any information, and would just clutter the graph.

Optional actions

Nodes in the graph that are not connected downward to the end of the game are optional, i.e., you don’t need to do them to win the game.

This includes talking to the important-looking pirates, talking to the sheriff in the alley, using a breath mint, putting money in the grog machine, getting the PTA minutes from the three pirates, playing with the rat, talking to the Voodoo lady, following the storekeeper to the swordmaster, opening the chest in the bedroom on the ship, picking up and reading all memos on Monkey Island, walking to the edge on the top of Monkey Island and falling down, and destroying the ship.

Of course, many of these are integral to the game’s storytelling, so you should do them – after all, exploring everything is what these kinds of games are all about.  (For destroying the ship, you get two slightly different endings depending on whether you did it or not.)

Redundant Transitive Edges

The graph is not a transitive reduction:  Certain dependency edges could be removed without changing the situation.

For instance in Part I, you need the pieces of eight to buy the sword, and you need the sword to train with Captain Smirk.  Thus, we can infer that you must have the pieces of eight when you train with Captain Smirk, and could thus omit the link from the pieces of eight to Captain Smirk.

In this case, we include such links because of the storytelling:  You need to give pieces of eight to Captain Smirk, and thus they are a logical dependency, even though it is impossible to not have the pieces of eight if you already have the sword.

Linear vs Parallel Gameplay

You can see in the graph that the game has distinctly different phases, purely based on the structure of the dependency graph:

The beginning is very dense with many possible parallel paths – as you start to explore the first area (Mêlée Island), you can do many things in parallel.

The end of the game, on the other hand, is much more linear:  The thirteen last actions (starting at “use jug o’grog with dish” in the pirate ship) must all be performed in the given order.

Diversity of Puzzles

Clearly, the puzzles in the beginning of the game are much more varied than those at the end.  The last “action-like” or “rarely used command” puzzle is the navigation through the underground maze in Part III – after that it’s just walking, talking, picking up objects, and using objects until the end of the game. The last action in the game however, is at least timed, so there’s that.

Compare that to the start of the game, which includes all possible commands in some way, all along with going through a maze, following a non-player character, almost dying underwater, and the whole insult swordfighting section.

(This is subjective of course, and I’d like to see how other games compare.)

Size of the Game

We can also use the graph to measure the “size” of the game.  There are 153 154 nodes in the graph (if I counted correctly), but since I haven’t done this for any other game, I can’t compare that number.

(Of course, this does not take into account the “action-like” sequences, which are much more present in this game than in other games of the genre.)

End Notes

If you haven’t played this game:  It is supported by ScummVM.  (There is also a remake which I haven’t tried, but you’d better be playing the original version, with those lovely pixely graphics.)

I also want to do this type of analysis for other games.  Candidates are:  Monkey Island 2 and 3, Day of the Tentacle, Flight of the Amazon Queen, The Dig, Full Throttle, Sam and Max, Simon the Sorcerer, and maybe something like Broken Sword.  Oh, and I would love to do Thimbleweed Park too (haven’t played it yet!).  (If I have time of course – this is not the type of analysis that they give you gigantic grants for, as fun as it is.)

If there are things that should be corrected in the graph, please drop a comment.

Fun fact:  the original Amiga version of the game came on four diskettes, and is only 18.5 times larger than the full-size image above. (If my calculations are correct)

EDIT 2019-01-18:  Added one more optional action that I missed

EDIT 2019-01-18:  Added the death node in the underwater scene

EDIT 2019-01-18:  Added PDF version

EDIT 2019-01-19:  Added a link to Ron Gilbert’s page on puzzle dependency charts

EDIT 2019-01-21:  Divided a dashed box into “mansion”/”underwater” scenes; small adjustments in the chart (v3)

EDIT 2019-01-21:  Added note that not all dialogues are included in the graph

EDIT 2019-01-21:  Added getting stuck by spending all money on the grog machine; added “enter forest” as a node which can be achieved by either following the store keeper, or by buying the map; made clear that the poisened meat can also be produced via the pot o’ stew; rearranged nodes

EDIT 2019-01-22:  Made the legend better (v6)

EDIT 2019-01-23:  Added bullet point about the two ways to enter the forest in Part I

EDIT 2019-02-03:  For following the storekeeper to the Sword Master:  the puzzle has multiple solutions (Thanks LogicDeLuxe from the Thimbleweed Park forums)

EDIT 2019-03-11:  Fix a typo

 

 

The Long Story of a Single Research Dataset

Let’s take a look at how a single dataset has spread from researcher to researcher over the years. It’s this network:

celegans-analysis

Analysis of the C. elegans metabolic network, from (Jeong et al. 2000)

In the KONECT project, we collect datasets that are in the form of networks, i.e., consist of interconnections between all kinds of things.  Recently, we added a network from biology: the metabolic network of Caenorhabditis elegans.  It represents information about the biochemistry of a certain roundworm, Caenorhabditis elegans, which was first discovered in soil in Algeria in 1900, and has since become one of the most-studied lifeforms (see here for why).

While preparing this dataset, I researched the history of the dataset itself, as I always do in order to have correctly labeled data. It turned out that this dataset has spread through the academic world, crossing the boundaries of multiple disciplines, and appearing in multiple formats in various studies.  Here’s my summary of the dataset’s history, as far as I can reconstruct it:

Genome Sequencing

Our story starts at the end of the 1990s. Cenorhabditis elegans is a model organism, i.e., a species used extensively for research.  In fact, it was the first multi-cellular organism to have its genome sequenced.  The project to sequence the full genome of C. elegans took decades, and was finalized in 1999, as summarized in this paper:

[1] R. K. Wilson. How the worm was won. The C. elegans genome sequencing project. Trends Genet., 15(2):51–8, 1999.

The resulting genome is about 97 megabytes large, and includes more than 19,000 proteins.

What Is There: Metabolic Reconstruction

Only one year later, in 2000, the genome of C. elegans is included in a project called What Is There, or just WIT.  Information about this project can be found in this paper:

[2] Ross Overbeek, Niels Larsen, Gordon D. Pusch, Mark D’Souza, Evgeni Selkov Jr., Nikos  Kyrpides, Michael Fonstein, Natalia Maltsev, and Evgeni Selkov. WIT: Integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res., 28(1):123–125, 2000.

Unfortunately, the URLs of this project are not accessible anymore:

However, we can glean information about the project from the article:

“The WIT (What Is There) system has been designed to support comparative analysis of sequenced genomes and to generate metabolic reconstructions based on chromosomal sequences and metabolic modules from the EMP/MPW family of databases. This system contains data derived from about 40 completed or nearly completed genomes.”

Network Science

Also in the year 2000, Hawoong Jeong and colleagues from Notre Dame (Indiana, USA) used the datasets from the WIT project in the following article:

[3] Hawoong Jeong, Bálint Tombor, Réka Albert, Zoltan N. Oltvai, and Albert-László Barabási. The large-scale organization of metabolic networks. Nature, 407:651–654, 2000.

To cite the paper:

“[…] We present a systematic comparative mathematical analysis of the metabolic networks of 43 organisms representing all three domains of life. We show that, despite significant variation in their individual constituents and pathways, these metabolic networks have the same topological scaling properties and show striking similarities to the inherent organization of complex non-biological systems.”

One of the 43 organisms is, of course, C. elegans.

Being an article in the then-emerging field of Network Science, this is the first version of our dataset that is actually a network.  To be precise, it is a metabolic network.

Community Detection

In 2005, Jordi Duch and Alex Arenas take the datasets from the 2000 paper in order to study community detection:

[4] Jordi Duch and Alex Arenas. Community detection in complex networks using extremal optimization (arXiv version). Phys. Rev. E, 72(2):027104, 2005.

In this article, seven network datasets from different sources are used, including the metabolic network of C. elegans.  (Another of the seven datasets is the famous Zachary karate club network dataset.)

To cite the article:

“We have also analyzed the community structure of several real networks: the jazz musicians network [27], an university e-mail network [11], the C. elegans metabolic network [28], a network of users of the PGP algorithm for secure information transactions [29], and finally the relations between authors that shared a paper in cond-mat [30].”

We can only guess why these particular datasets were chosen, but the fact that they are from different domains must have played a role, and the fact that in 2005, not many network datasets were easily available, as is the case nowadays.

The dataset as used in the paper is made available on the website of Alex Arenas.

Discrete Mathematics and Theoretical Computer Science

In 2011, the 10th implementation challenge of the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS 10) starts. In this challenge, participants must implement graph partitioning and graph clustering algorithms. After the end of the challenge, participants present their results at a DIMACS conference in 2012.  A book including all algorithm descriptions and other papers is published in 2013:

[5] David A. Bader, Henning Meyerhenke, Peter Sanders, Dorothea Wagner (eds.). Graph Partitioning and Graph Clustering. 10th DIMACS Implementation Challenge Workshop. February 13-14, 2012. Georgia Institute of Technology, Atlanta, GA. Contemporary Mathematics 588. American Mathematical Society and Center for Discrete Mathematics and Theoretical Computer Science, 2013.

In the contest, many real-world network datasets are made available for participants to use. Among them is the C. elegans metabolic network, as taken from Alex Arenas’ website.  To quote the contest website:

“The following four datasets are taken from http://deim.urv.cat/~aarenas/data/welcome.htm, with kind permission of Alex Arenas, Dept. Enginyeria Informatica i Matematiques (Computer Science & Mathematics), Universidad Rovira i Virgili.”

The full dataset is available for download from the DIMACS10 website, in the data format used by the contest.

The KONECT Project

In 2018, the dataset is incorporated in the KONECT project, where it is, as of this writing, one out of 1326 such network datasets.

[6] Jérôme Kunegis. KONECT – The Koblenz Network Collection (CiteSeerX version). In Proc. Int. Conf. on World Wide Web Companion, pages 1343–1350, 2013.

In KONECT, we have extracted the C. elegans metabolic network dataset from the DIMACS10 website.  It is now also available for download from its KONECT page, in the KONECT format, which is used for all 1326 KONECT datasets. Additionally, the KONECT page of the dataset shows the values of 33 network statistics, and 45 plots.

Conclusion

This particular dataset has thus undergone the steps

Genome
↓
What Is There
↓
Notre Dame
↓
Duch & Arenas
↓
DIMACS10
↓
KONECT

The dataset has gone through all these steps, being downloaded and adapted for each particular project. Even though the researchers at each step (including myself) could have taken the dataset from previous steps, they didn’t. The reason may simply be simplicity in downloading the data, and a lack of knowledge about the full prevenance of the data.

Some readers may have noted that the dataset is also included in KONECT in a form directly extract from Alex Arenas’ website. This has been in KONECT for several years, but in fact, the two networks are not identical:  They have a different number of edges, although the number of nodes is the same. That difference is also the reason why we keep both version in KONECT. We don’t know what exactly the DIMACS10 project has done with the data. We know that multiple edges as available in the version of Alex Arenas have collapsed, but that still does not explain the difference. In light of such slight differences, we can only recommend to researchers analysing networks to properly cite the source of their dataset, and correctly document which transformations have been applied to the datasets. Even though this dataset is not one of the most famous (that award probably goes to the Zachary karate club), I would guess that the length of the chain of reuse is comparable for both.

Disclaimer

This chain of dataset sharing does of course not represent the full list of ways in which the dataset was made available, let alone used. In fact, the cited papers are not necessarily the first papers to perform a specific type of analysis – this chain of paper only represents how the dataset was shared, from the original dataset to KONECT. Note that the dataset was also shared in other ways that did not end up in KONECT. For instance, the dataset was used by Petter Holme and colleagues (in collaboration with Hawoong Jeong) for their 2003 article on detecting subnetworks, which used the version from the WIT project. Also, the dataset is available at the ICON project, and was extracted directly from Alex Arenas’ website there.  (Note: we can’t seem to get a direct link to that particular dataset in ICON – it can be found easily by searching for it.)

This text is likely to contain errors and omissions – please comment below, and I will correct them.  I remember that Albert-László Barabási had a separate page which may have contained the data too, but I can’t find it right now.  Maybe it was at https://www3.nd.edu/~networks, which is not reachable now?

EDIT 2018-04-18:  Added information via Petter Holme to the disclaimer, mentioning his 2003 paper.

EDIT 2018-04-20:  Fixed typos; change a formulation to make clear article [1] is a summary of the work, not the work itself; use quotation marks consistently.

EDIT 2018-05-08: Fixed typos; replaced “different” by “the same” in the description of the other dataset which is also in KONECT (Thanks Thomas)

EDIT 2018-06-11: Fixed year number in first sentence of section “Community Detection” (“2005 paper” → “2000 paper”)

My Shell Coding Conventions

When writing shell scripts, it is worth it to pay attention to details.

Here are the conventions I use when writing shell scripts.  These rules are taken from the coding conventions in Stu project, and are further maintained there.  I have omitted points that are specific to Stu.

Here’s the list:

  • Only use POSIX features of the shell and of standard tools. Read the manpage of a tool on the POSIX website,  rather than your locally installed manpage, which is likely to describe extensions to the tool without documenting them as such.
  • Sed has only basic regular expressions. In particular, no |, +, and no escape sequences like \s/\S/\b/\w. The -r and -E options are not POSIX; they are GNU and/or BSD extensions. A space can be written as [[:space:]]. + can be emulated with \{1,\}. There is no way to write alternatives, i.e., the | operator in extended regular regular expressions. (Note: there are rumors that -E is going to be in a future POSIX standard; I’ll switch to -E once it’s standardized.) Also, the b and t commands must not be followed by a semicolon; always use a newline after them when there are more commands.
  • Grep does have the -E option for extended regular expressions. Grep is always invoked with the -E or the -F option. (Using basic regular expressions with Grep is also portable, but I don’t do it.)
  • date +%s is not mandated by POSIX, even though it works on all operating systems I tried it on. It outputs the correct Unix time. There is a hackish but portable workaround using the random number generator of awk(1).
  • Shell scripts don’t have a filename suffix. Use #! /bin/sh and set the executable bit. The space after ! is not necessary, but is traditional and I think it looks good, so I always use it.
  • test does not support the -a option. Use && in the shell instead. POSIX has deprecated test -a.
  • The “recursive” option to programs such as ls(1) and cp(1) is -R and not -r. -r for recursive behavior is available and prescribed by POSIX for some commands such as rm(1), but it’s easier to always use -R. Mnemonic: Output will be big, hence a big letter. The idiomatic rm -rf is thus not recommended, and rm -Rf is used instead.
  • I use set -e. I don’t rely on it for “normal” code paths, but as an additional fail- safe. It does not work with $() commands that are used within other commands, and it also does not work with pipelines, except for the last command. When you use $(), assign its result to a variable, and then use its result from that variable. Using $() directly embedded in other commands will make the shell ignore failure of the inner shell. There is no good solution to the “first part of pipeline fails silently” problem.
  • Use $(...) instead of `...`. It avoids many quoting pitfalls. In shell syntax, backticks are a form of quote which also executes its content. Thus, characters such as other backticks and backslashes inside it must be quoted by a backslash, leading to ugly code. $(...) does not have this problem. Also, in Unicode ` is a standalone grave accent character, and thus a letter-like character. This is also the reason why ` doesn’t need to be quoted in Stu, like any other letter. The same goes for ^ and the non-ASCII ´.
  • Double-quote all variables, except if they should expand to multiple words in a command line, or when assigning to a variable. Also, use -- when passing variables as non-option arguments to programs. E.g., write cp -- "$filename_from" "$filename_to". All other invocation styles are unsafe under some values of the variables. Some programs such as printf don’t have options and thus don’t support or need --.
  • Always use IFS= read -r instead of read. It’s the safe way to read anything that’s \n-delimited.
  • To omit the trailing newline with echo, both -n and \c are non-portable. Instead, use printf. In general, printf is an underused command. It can often be used judiciously instead of echo. Note that the format string of printf should not start with a dash. In such cases, use %s and a subsequent string, which can start with a dash.

 

Google Is the Largest ‘Web Tracker’, but by Far Not the Only One

We performed a study about web tracking.

trackers-clusters

Web tracking refers to the very widespread practice of websites embedding so-called web trackers on their pages.  These have various purposes, such as optimizing the loading of images, or embedding additional services on a site.  The best known examples are the “Facebook buttons” and “Twitter buttons” found on many sites, but the most widespread ones are by far those from Google, which in fact operates a multitude of such “tracking services”.  There are however many more – what they all have in common is that they allow the company that runs them to track precisely which sites have been visited by each user.

Web trackers have been well-know for quite a time, but until now no web-scale study had been performed to measure their extent.  Thus, we performed a study about them. We analysed 200 terabytes of data on 3.5 billion web pages.

The key insights are:

  • 90% of all websites contain trackers.
  • Google tracks 24.8% of all domains on the internet, globally.
  • When taking into account that not all web sites get visited equally often, Google’s reach is even higher:  We estimated that 50.7% of all visited pages on the web include trackers by Google.  (Ironically, we estimated this by using PageRank, a measure initially associated with Google itself.)
  • The top three tracking systems deployed on the web are all operated by Google, and use the domains google-analytics.com, google.com, and googleapis.com.
  • The top three companies that have trackers on the web are Google, Facebook, and Twitter, in that order.
  • These big companies are by far not the only companies that track:  There is a long tail of trackers on the web – 50% of the tracking services are embedded on less than ten thousand domains, while tracking services in the top 1% of the distribution are integrated into more than a million domains.
  • Google, Twitter and Facebook are the dominant tracking companies in almost all countries in the world.  Exceptions are Russia and China, in which local companies take the top rank. These are Yandex and CNZZ, respectively. Even in Iran, Google is the most deployed tracker.
  • Websites about topics that are particularly privacy-sensitive are less likely to contain trackers than other websites, but still, the majority of such sites do contain trackers.  For instance, 60% of all forums and other sites about mental health, addiction, sexuality, and gender identity contain trackers, compared to 90% overall.
  • Many sites contain more than one tracker.  In fact, multiple trackers are so common that we were able to determine clusters of trackers that are often used together by individual sites – these allow us to automatically detect different types of trackers such as advertising trackers, counters and sharing widgets.  (See the picture above.)

Not all trackers have the explicit purpose of tracking people: Many types of systems perform useful services in which tracking is a side effect.  Examples are caching images, optimizing load times, enhancing the usability of a site, etc. For many of these systems, webmasters may not be aware that they allow tracking.

Note:  The study was performed using data from 2012 by the Common Crawl project.  Due to the fact that their crawling strategy has changed since then, newer data in fact represents a smaller fraction of the whole web.

The study was performed by my colleague Sebastian Schelter from the Database Systems and Information Management Group of the Technical University of Berlin, and myself.

The article is published in the Journal of Web Science, and is available as open access:

  1. Sebastian Schelter and Jérôme Kunegis (2018), “On the Ubiquity of Web Tracking: Insights from a Billion-Page Web Crawl”, The Journal of Web Science: Vol. 4: No. 4, pp 53–66.

The full dataset is available online on Sebastian’s website, and also via the KONECT project.

 

The Build System Stu in Practice

I haven’t blogged in a while about the Stu build system.  Stu is a piece of software written by myself at the University of Namur that we use to run the calculations of network statistics in the KONECT project, build Latex documents, and do all other sorts of data manipulation, and more.  The Stu build system is a domain-specific programming language used to build things.  This is similar to the well-know Make system, but can do much more.  In the KONECT project, we use it in ways that are impossible with Make.  In this post, I will show a very short snippet of a Stu script, as it is used in KONECT, and will explain what it does line by line.

Here is the snippet:

#
# Diadens
#

@diadens: @diadens.[dat/NETWORKS_TIME];

@diadens.$network:  plot/diadens.a.$network.eps;

plot/diadens.a.$network.eps: 
        m/diadens.m 
        $[ -t MATLABPATH]
        dat/stepsi.$network 
        dat/statistic_time.full.(diameter avgdegree).$network
{
        ./matlab m/diadens.m 
}

This code is written in Stu, and is used as shown here in KONECT.  It is part of the (much larger) file main.stu in the KONECT-Analysis project. I have left in the comments as they appear in the original.  In fact, comments in Stu start with #, and you can probably recognize that the first three lines are a comment:

#
# Diadens
#

The name “Diadens” refers to diameter-density; we’ll explain later what that means.

The comment syntax in Stu is the same as in many other scripting languages such as the Shell, Perl, Python, etc.  Until now, there’s nothing special.  Let’s continue:

@diadens: @diadens.[dat/NETWORKS_TIME];

Now we’re talking.  In order to explain this line, I must first explain the overall structure of a Stu script:  A Stu script consists of individual rules.  This one line represents a rule in itself.  Of course, rules may be much longer (as we will see later), but this rule is just a single line. As a general rule, Stu does not care about whitespace (up to a few cases which we’ll see later).  This means that lines can be split anywhere, and a rule can consist of anything from a single to many hundreds of lines, even though in practice almost all rules are not longer than a dozen lines or so.

The overall way in which Stu works is similar to that of Make.  If you know Make, the following explanation will appear trivial to you.  But if you don’t, the explanation is not complex either:  Each rule tells Stu how to build one or more things.  Rules are the basic building blocks of a Stu script, just likes functions are the basic building blocks of a program in C, or classes are the basic building blocks in Java.  The order of rules in a Stu script does not matter.  This is similar to how in C, it does not matter in which order functions are given, up to the detail that in C, functions have to be declared before they are used.  In Stu, there is no separate “declaration” of rules, and therefore the ordering is truly irrelevant.  The only exception to this rule is which rule is given first in a complete Stu script, which will determine what Stu does by default, like the main() function in C.  This is however not relevant to our snippet, as it is taken from the middle of the Stu script.  Let us look as the last rule definition again:

@diadens: @diadens.[dat/NETWORKS_TIME];

Here, the first @diadens is the target of the rule, i.e., it declares what this rule is able to build.  Most rules usually build a file, and on can just write a filename as a target, which means that the goal of the rule is to build the given file.  In this case however, the target begins with @ and that means that the target is a so-called transient target.  A transient target is just Stu’s way of saying that no file is built by the rule, but we still need to do something to build the rule.  In this case, everything after the colon : are the dependencies of the rule.  The meaning is the following:  In order to build the target @diadens, Stu must first build @diadens.[dat/NETWORKS_TIME].   The semicolon ; then indicates the end of the rule.  Instead of a semicolon, it is also possible to specificy a command that Stu must executes, by enclosing it in braces { }.  In this rule, there is no command, and therefore a semicolon is used.

At this point, I should give a short explanation about was this example is supposed to do:  the name diadens stands for “diameter and density”.  The goal of this code is to draw what we may call “diameter and density” plots.  These are plots drawn by KONECT which show, for one given dataset of KONECT, the values of the network diameter (a measure of how many “steps” nodes are apart) and the values of the network density (a measure of how many neighbors nodes typically have), as a function of time.  The details of the particular calculation do not matter, but for the curious, the goal of this is to verify empirically whether in real-world networks, diameters are shrinking while the density is increasing.  Suffice it to say that for each datasets in the KONECT project, we want to generate a plot that looks like this:

diadens.a.wikipedia-growth.full

This particular plot was generated for the Wikipedia growth network dataset.  As predicted, the diameter (green curve) is shrinking, while the density (purple curve) is increasing.  Regardless of the actual behavior of each dataset, we would like to generate these plots for all datasets in the KONECT project.  However, there are complications:  (1) If we just write a loop over all networks to generate all plots, this will take way too much time.  Instead we want to only generate those plots that have not yet been generated.  (2) We must restrict the plotting to only those datasets that are temporal, i.e. for datasets that don’t include timestamps, we cannot generate this plot.  The Stu snippet takes care of both aspects.  Let’s now look at the dependency:

@diadens.[dat/NETWORKS_TIME]

As always in Stu, if a name starts with a @, it is a transient.  In this case however, it is a little bit more complex than what we have seen before.  In this case, brackets [ ] are used.  What is the purpose of the brackets in Stu?  They indicate that part of the name is to be generated dynamically.  In this case, it means that Stu must first build  dat/NETWORKS_TIME (which is a file because it does not start with @), and then the brackets [dat/NETWORKS_TIME] will be replaced by the content of the file dat/NETWORKS_TIME.   The file dat/NETWORKS_TIME is a file of KONECT which contains the name of all datasets which have timestamps, all names being separated by whitespace.  This is exactly the list of datasets to which we want to apply our calculation.  In fact, the main.stu file of KONECT also contains a rule for building the file dat/NETWORKS_TIME, which has to be regenerated whenever new datasets are added to the project.  Now, we said that Stu will replace the string [dat/NETWORKS_TIME] with the content of the file dat/NETWORKS_TIME — this is not 100% precise.  Since that file contains multiple names (one for each dataset), Stu will duplicate the whole dependency @diadens.[dat/NETWORKS_TIME], once for each name in the file dat/NETWORKS_TIME.  As a result, the dependencies of @diadens will be of the form @diadens.XXX, where XXX is replaced with each dataset of KONECT that has timestamps.  This exactly what we want:  We have just told Stu that in order to build @diadens (i.e., build all diameter-density plots), Stu must build all @diadens.XXX plots (i.e., each individual diameter-density plot), for each dataset with timestamps individually.  Then, what is the difference between that one line of Stu script, and a loop over all temporal datasets, generating the plots for all datasets in turn?  For one, Stu is able to parallelize all work, i.e., it will be able to generate multiple plots in parallel, and in fact the number of things to execute in parallel is determined by the -j options to Stu (which should be familiar to users of GNU or BSD Make).  Second, Stu will not always rebuild all plots, but only those that need to be rebuilt.  What this means is that when Stu is invoked as

$ stu @diadens

on the command line, Stu will go through all temporal datasets, and will verify whether: (1) the plot has already been done, and (2) if is has already been done, whether the plotting code has changed since.  Only then will Stu regenerate a plot.  Furthermore, because the file dat/NETWORKS_TIME itself has a rule for it in the Stu script, Stu will automatically check whether any new datasets have been added to the KONECT project (not a rare occasion), and will generate all diameter-density plots for those.  (Users of Make will now recognize that this is difficult to achieve in Make — most makefiles would autogenerate code to achieve this.)

Now, let us continue in the script:

@diadens.$network: plot/diadens.a.$network.eps;

Again, this is a rule that spans a single line, and since it ends in a semicolon, doesn’t have a command.  The target of this rule is @diadens.$network, a transient.  What’s more, this target contains a dollar sign $, which indicates a parameter.  In this case, $network is a parameter.  This means that this rule can apply to more than one target.  In fact, this rule will apply to any target of the form @diadens.*, where * can be any string.  By using the $ syntax, we automatically give a name to the parameter.  In this case, the parameter is called $network, and it can be used in any dependency in that rule.  Here, there is just a single dependency:  plot/diadens.a.$network.eps.   Its name does not start with @ and therefore it refers to a file.  The filename however contains the parameter $network, which will be replaced by Stu by whatever string was matched by the target of the rule.  For instance, if we tell Stu to build the target @diadens.enron (to build the diameter-density plot of the Enron email dataset), then the actual dependency will be plot/diadens.a.enron.eps.  This is exactly the filename of the image that will contain the diameter-density plot for the Enron email dataset, and of course diameter-density plots in KONECT are stored in files named plot/diadens.a.XXX.eps, where XXX is the name of the dataset.

Finally, we come to the last part of the Stu script snippet, which contains code to actually generate a plot:

plot/diadens.a.$network.eps: 
        m/diadens.m 
        $[ -t MATLABPATH]
        dat/stepsi.$network 
        dat/statistic_time.full.(diameter avgdegree).$network
{
        ./matlab m/diadens.m 
}

This is a single rule in Stu syntax, which we can now explain line by line.  Let’s start with the target:

plot/diadens.a.$network.eps:

This tells Stu that the rule is for building the file plot/diadens.a.$network.eps.  As we have already seen, the target of a rule may contain a parameter such as $network, in which case Stu will perform pattern-matching and apply the given rule to all files (or transients) which match the pattern.  Let’s continue:

        m/diadens.m

This is our first dependency.  It is a file (because it does not start with @), and it does not contain the parameter $network.  Thus, whatever the value of $network is, building the file plot/diadens.a.$network.eps will always need the file m/diadens.m to be up to date.  This makes sense, of course, because m/diadens.m is the Matlab code that we use to generate the diameter-density plots.  Note that a dependency may (but does not need to) include any parameter used in the target name, but it is an error if a dependency includes a parameter that is not part of the target name.

The next line is $[ -t MATLABPATH], but in defiance of linearity of exposition, we will explain that line last, because it is a little bit more complex.  It is used to configure the Matlab path, and uses the advanced -t flag (which stands for trivial).  The next line is:

        dat/stepsi.$network

This is a file dependency, and, as should be familiar by now, it is parametrized.  In other words, in order to build the file plot/diadens.a.$network.eps, the file dat/stepsi.$network must be built first.  In KONECT, the file dat/stepsi.$network contains the individual “time steps” for any temporal analysis, i.e., it tells KONECT how many time steps we plot for each temporal dataset, and where the individual “time steps” are.  This is because, obviously, we don’t want to recompute measures such as the diameter for all possible time points, but only for about a hundred of them, which is largely enough to create good-looking plots; more detail in the plots would not make them look much different.  Unnecessary to mention then, that the Stu script of KONECT also contains a rule for building the file dat/stepsi.$network.  Then, we continue for one more target of this rule:

        dat/statistic_time.full.(diameter avgdegree).$network

As before, this target includes the parameter $network, which will be replaced by the name of the dataset.  Furthermore, this line contains a Stu operator we have not yet seen:  the parentheses ( ).   The parentheses ( ) in Stu work similarly to the braces [ ], but instead of reading content from a file, they take their content directly from what is written between them.  Thus, this one line is equivalent to the following two lines:

dat/statistic_time.full.diameter.$network
dat/statistic_time.full.avgdegree.$network

Thus, we see that we are simply telling Stu that the script m/diadens.m will also access these two files, which simply contain the actual values of the diameter and density over time for the given network.  (Note that the density is called avgdegree internally in KONECT.)

Finally, our snippet ends with

{
        ./matlab m/diadens.m 
}

This is an expression in braces { } and in Stu, these represent a command.  A command can be specified instead of a semicolon when something has actually to be executed for the rule.  In most cases, rules for files have commands, and rules for transients don’t.   The actual command is not written in Stu, but is a shell script.  Stu will simply invoke /bin/sh to execute it.  In this example, we simply execute the Matlab script m/diadens.m.  Here ./matlab is a wrapper script for executing Matlab scripts, which we use because pure Matlab is not a good citizen when it comes to being called from a shell script.  For instance, Stu relies on the exit status (i.e., the numerical exit code) of programs to detect errors.  Unfortunately, Matlab always returns the exit status zero (indicating success) even when it fails, and hence we use a wrapper script.

One question you should now be asking is:  How does that Matlab script know which plot to generate, let alone which input files (like dat/statistic_time.full.diameter.$network) to read?  The answer is that Stu will pass the value of the parameter $network as … an environment variable called $network.  Thus, the Matlab code simply uses getenv('network'), which is the Matlab code to read the environment variable named $network.  We thus see that most commands in Stu script don’t need to use their parameters:  these are passed transparently by the shell to any called program.

Finally, we can come back to that one line we avoided earlier:

        $[ -t MATLABPATH]

The characters $[ indicate that this is a variable dependency.  This means that Stu will first make sure that the file MATLABPATH is up to date, and then pass the content of that file to the command, as an environment variable of the same name.  Therefore, such a variable dependency is only useful when there is a command.  In this case, the environment variable $MATLABPATH is used by Matlab to determine the path in order to find its libraries.  The file MATLABPATH, which contains that path as used in KONECT, has its own rule in the KONECT Stu script, because its content is non-trivial.  Had we written $[MATLABPATH] as a dependency, that would be the end of the story.  But that would have a huge drawback:  Every time we change the Matlab path, Stu would rebuild everything in KONECT, or at least everything that is built with Matlab, simply because MATLABPATH is declared as a dependency.  Instead, we could have omitted the dependency $[MATLABPATH] altogether, and simply have our script ./matlab read out the file MATLABPATH and set the environment variable $MATLABPATH accordingly.  But that would also not have been good, because then the file MATLABPATH would never have been built to begin with, when we start with a fresh installation of KONECT.  We could also put the code to determine the Matlab path directly into ./matlab to avoid the problem, but that would mean the starting of every Matlab script would be very slow, as it would have to be generate again each time.  We could made ./matlab cache the results (maybe even in the file MATLABPATH), but since ./matlab is called multiple times in parallel, it would have meant that ./matlab would need some form of a locking mechanism to avoid race conditions.  Instead of all that, the -t flag in Stu does exactly what we want here.  -t stands for trivial, and is used in Stu to mark a dependency as a trivial dependency.  This means that the a change in the file MATLABPATH will not result in everything else being rebuilt.  However, if the file MATLABPATH does not exist, or if it exists but needs to be rebuilt and Stu determines that the command using Matlab must be executed for another reason, then (and only then) does Stu rebuild MATLABPATH.  (For those who know the Make build system:  This is one of the features of Stu that are nearly impossible to recreate in Make.)  Note that there are also flags in Stu that can be used in a similar way:  -p declares a dependency as persistent (meaning it will never be rebuilt if already present), and -o as optional (meaning it is only rebuilt, if at all, if the target file is already present).

This concludes our little tour of a little Stu script snippet.  Fore more information about Stu, you may read these previous blog articles:

You can get Stu at Github:

For reference, the example in this blog article uses features introduced in Stu version 2.5.  Stu is always backward compatible with previous versions, but if you have a pre-2.5 version installed, then of course these new features will not work.

The words written in italics in this article are official Stu nomenclature.

EDIT 2018-01-29:  fixed a typo