A Mathematical Riddle

I’ll just place this here:

quest

You have to compute this expression.

This riddle was composed a few months ago for a competition, but we decided to not use it in the end.  Actually, it is easier than it seems:  You don’t need any calculator or computer to solve it.  A piece of paper (or two) should be enough.  We wrote the riddle in a way that there are several shortcuts that can be used to solve it, but if you don’t find the shortcuts, you can still solve it in the obvious way, taking just a little bit longer.

Also, you have to find the exact answer, not a numerical approximation.

I’m not posting the answer here for now.  If you’re curious whether your answer is correct, post below.

 

No Hairball – The Graph Drawing Experiment

→ QUICKLINK TO THE EXPERIMENT

Many graph drawings look like a hairball.

The larger a network is, the harder it is to visualize it.  Most graph drawing algorithms produce a giant “hairball”, in which nodes and edges are hopelessly mixed up, leaving no way to discern any structure whatsoever. Here is an example:

out_layout

This is from one of my own papers (WWW 2009), so I should know what I am talking about. Nowadays, I wouldn’t put such a picture in paper, let alone on the first page as I did then.  Many papers however, still contain such graphics.

Can we learn anything from this drawing?  No.  There are no communities visible.  No clustering is apparent.  I cannot even tell whether the graph is bipartite.  In fact, I cannot even estimate the size of the graph from this picture.

So, why do we keep putting hairballs in our papers?  Maybe, because they give us the illusion of insight into a complex network.  Yes, we would like to understand whether a graph displays clustering, bipartivity, assortativity, dissortativity, skewed degree distributions, and a myriad of other interesting properties that complex networks can have. What better way to visualise these features, than by drawing the graph?  Isn’t every visualisation also a drawing?

No, visualisations are not necessarily drawings.  We don’t need to draw a graph in order to visualise it.  In fact, the mere fact that we try to draw the entirety of a network in a small space is what leads to the hairball problem to begin with: Hundred of thousands, millions or even more nodes and edges are crammed into the space of only a few centimetres.  It is no wonder we don’t see anything in a hairball graph:  Each node and each edge gets allocated a space on the order of a micrometre or less – much too small to even be shown on computer displays or printed on paper, let alone seen by the human eye.

What then, can we do to avoid hairballs?  Several ideas have been tried:

  • Show only a subset of the network.  This is called sampling.  In its simplest incarnation, choose a random subset of the nodes, to get a subgraph of the real graph of manageable size.  Then, draw that subgraph.
  • Aggregate nearby nodes into single nodes.  This is called coalescing, and also produces smaller graphs, which are then drawn with the usual methods.
  • Allow the user to zoom in, using interactive software.  This is nice, but doesn’t give any more insight into the overall properties of the network.

These methods are all suited to particular use cases.  But for visualising overall properties of a graph, these methods fail.  What do these methods all have in common?  They assume that to visualise a graph, you have to draw the graph.  The question thus becomes, Can we go further?  Can we visualise a graph without drawing it?  Almost.

In the experiment we are performing, we don’t draw the graph to be visualised.  Instead, we draw another graph – a much smaller one – which is representative of our graph.  In fact, we throw away all nodes and edges of the input.  Only graph statistics such as the assortativity, the bipartivity, and others are kept, and a completely new graph is created, specifically for visualisation.  Most graph/network researchers now think, How can I see individual nodes and edges of the graph?  The answer is that you can’t; that is the point of the method.  We sacrifice local information in graph in order to make global properties more apparent.

The method is in development, and a paper is upcoming, but not public yet.

We are now performing an experiment to find out how to do this graph visualisation optimally:

TAKE PART IN THE EXPERIMENT

This is an interactive experiment that will ask you to look at graphs, and to answer yes/no questions. In fact, you only have to click on graphs to answer.  However:  You must be knowledgeable in graphs and networks to participate.

This will help our research.  If you want to keep in touch with the results of the experiments, write to <pkumar@uni-koblenz.de>.

If you’re interested in graph visualisation, check out the KONECT project; it has lots of graph visualisation methods based on matrix decompositions.

 

Which Elements are Named After Countries?

This week, names of four newly discovered chemical elements have been announced:

  • Nihonium (Nh) for element 113
  • Moscovium (Mc) for element 115
  • Tennessine (Ts) for element 117
  • Oganesson (Og) for element 118

These names are up for being revised in six months. Barring some unexpected development, these names will go down in the chemistry books – and more to the point, nuclear physics book – as the names of the elements that complete Period 7 of the periodic table.

Of these four, one name refers to a country (Japan), one to a region (Tennessee), one to a city (Moscow), and one to a person (Yuri Oganessian). All four names were chosen by the discoverers, and honour the discoverers. While the discovering teams have indeed produced a tremendous work and certainly deserve the honour, I find that type of naming quite unfortunate, and even downright selfish.

This has not always been the case. Most elements were not named for selfish reasons. Some elements were named for their properties, such as hydrogen, oxygen, and nitrogen. Some for materials from which they were extracted: beryllium from beryl, aluminium from alum, calcium from calx, sodium from soda, potassium from potash, boron from borax, silicon from silex, and lithium from stones.  Many were named for the colours of their compounds: chlorine is green, bismuth is white, rubidium is red, caesium is blue-grey, gold is yellow, indium is indigo, iodine is violet, rhodium is pink, thallium is green, chromium is colourful, and iridium has the colour of the rainbow. I find these highly poetical. Not to be outdone, some elements were named after astronomical objects: uranium, neptunium, and plutonium for Uranus, Neptune, and Pluto; cerium after Ceres; selenium after the Moon; and tellurium after the Earth.

In more modern times, many radioactive elements were named for their radioactivity: actinium and radium produce rays, astatine is unstable, technetium is artificial. Others were named for various other properties: bromine and osmium smell, argon is inactive, barium is heavy, neon is new, xenon is foreign, phosphorus carries light, krypton and lanthanum are hidden, and dysprosium is hard to get.

Newer elements however, are not named for their properties, but are simply given a name meant to honour the discoverers.  All non-black cells on the follow periodic table are elements whose names can be traced back to a country:

elem

Periodic table of elements by country referred to in the element name.

The last row, Period 7, is almost completely coloured.

I have made some simplifications in the table:  I count Rutherford as British (by his career), and the Rhine as German. Also, the Curies are Polish here.

Let’s look at the country-based names:  Some are quite innocent, like magnesium and manganese, named indirectly after a Greek region called Magnesia. Copper gets its name from the Island of Cyprus, while Beryllium and Strontium get their names indirectly from places in India and Scotland.

In other cases, chemists have tried to honour their country, such as France (gallium and francium), Germany (germanium), and Russia (ruthenium). Note how nicely gallium and germanium are placed side by side. Also note that while europium is named for a deity, the nearby americium was named consciously after the United States.

The Scandinavian countries have the largest share of elements named after them. Famously, a whopping four elements are named after the tiny Swedish village of Ytterby: ytterbium, yttrium, erbium, and terbium. Scandium is named for Scandinavia, thulium for a mythical Scandinavian region.  Holmium and hafnium are named for Stockholm and Copenhagen, respectively. Nobelium and Bohrium and named after a Norvegian and a Danish person, respectively.

Since World War II, newly discovered elements are named simply after people, or after the countries or regions of discovery. In fact, there has been a dispute between American and Soviet physicists about naming priorities. As a result, we don’t have an element named kurchatovium, even if I learned about it in school. With the four new names, that particular dispute should be considered as settled at a score of 7–7, although I’m sure some people on either side of the debate still hope for more.

While names such as einsteinium, fermium, copernicium, and meitnerium  seem fair, given that the persons were already dead and unrelated to the particular discoveries, it seems to me quite preposterous to give names after living people who are involved in the research.  This was the case for seaborgium and the newly named (and not yet finalised) oganesson. Will all respect for these researchers, I doubt that Glenn T. Seaborg and Yuri Oganessian will go down in history in the same way as Copernicus and Mendeleev. Or Einstein and Meitner. Or Curie and Rutherford.

Also, it seems that the responsable discoverers are not trying anymore to follow latin nomenclature:  “nihonium” deserves to be called japonium. That would also give the much more memorable symbol Jp, coinciding with Japan’s country code. If I were responsible for bettering Japan’s standing in the world, I would lobby for “Jp”. On the other hand, Moscovium sounds very appropriate to me, and I like that tennessine and oganesson use the relevant suffixes for halides and noble gases, even if I doubt their corresponding chemistry is going to be characterised anytime soon.

It seems that the last element named after a concept was plutonium. After the last planet to be discovered at the time, and also after a god. The current research teams would do well to come back to naming new elements after actual properties of the material or at least to find metaphors relating to planets and deities, and not just as a vehicle for advertising their labs and principle investigators. Let’s hope that if elements in Group 8 can be synthesized, we will go back to a more classical naming culture.

Index of Complex Networks – The “Google for Networks” by University of Colorado Boulder

This week at the Conference on Network Science (NetSci 2016), Aaron Clauset from the University of Colorado Boulder unveiled the Index of Complex Networks, assembling information about thousands of network datasets available online. They have about 3500 entries at the moment, with many more to come in the near future.

This news is a big deal in the network science community. Researchers in the field of Network Science are (quite obviously) using network datasets on a daily basis, and anything that helps them get their hands on more of them is good news. Until now, there have been several sites collecting network datasets, such as SNAP from Jure Leskovec’s group at Stanford and KONECT written by me in at the University in Koblenz.

The new ICON website takes the community one step further in making available not the datasets themselves, but an index of them. I.e., a listing of datasets available on the web, whose largest share comes from KONECT, but also from SNAP and countless other network dataset-publishing sources.

What does this mean for the field of Network Science? It means we will be able to verify our claims not on individual datasets, or on a few dozens ones, but on thousands of them. This is worth spelling out: With thousands of datasets, we can finally investigate the statistical significance of claims. We may finally find out whether social networks are really scale-free. We may finally find out, whether social networks really have so many more triangles than other networks, as claimed traditionally in social network studies. In fact, we already have first results based on the ICON data: Networks as a whole are not scale-free. In other words, degree distributions are not power laws in a statistical sense. This contradicts many claims made in the field.

I expect this data to lead to many new insights. In particular, many “well-known” facts about networks will have to be revised. I have to say, I’m very much looking forward to start this new chapter of Network Science.

The Index of Complex Networks (ICON) is here: https://icon.colorado.edu/

KONECT (The Koblenz Network Collection) is here: http://konect.uni-koblenz.de/

Stu – Build Automation for Data Mining Projects

Stu is a build tool developed at the University of Koblenz-Landau. In this article, I will explain what sets it apart from the myriad of other build tools, why I wrote it, and how we use it in various projects at the Institute for Web Science and Technologies.

What is Make?

Everyone has probably already heard of Make, a standard Unix tool mainly used in software projects for automatizing the build process. Big software projects consist of many source files, are compiled to many individual object files, and then linked. Larger software projects may even have more complex structure, for instance, they might autogenerate code, compile multiple versions of the software, or even have different compilation procedures depending on the platform on which they are run. As a result, building software is not an easy task, and many tools exist to make it easier. The quintessential tool in this category is undeniably Make, dating back to the 1970s as part of Unix. By now, Make has been standardized as part of the POSIX specification, and is ubiquitous. The best known implementation by now is probably GNU Make, as it is the default implementation on Linux.

Running Data Mining Projects

While everyone knows Make as a tool for building software, not everyone knows that it is also a completely generic build tool: It can be used to generate datasets, convert files, compile Latex documents, etc. For this reason, Make has been the tool I myself have used for years whenever I implemented data mining projects. I use the term “data mining project” in a broad sense here: any project in which data is manipulated. For instance, a typical data mining project may include the following tasks:

  • Download a dataset from the web
  • Uncompress it
  • Clean up the dataset
  • Convert to another format
  • Visualize the dataset
  • Compute features
  • Run a regression
  • Visualize the result of the regression

Most people who do data mining projects would probably write a single program for each step, in their favourite or in the most appropriate programming language. Then, they would run these programs sequentially. If necessary, earlier steps can then be re-run by hand. While this is an acceptable way to go about things in data mining projects, it still leads to many undesirable situations. Sometimes, we find out that our dataset contains individual data points that were wrongly extracted. In this case, we should adapt out clean-up code, and re-run all steps from that point on. Unfortunately, many data mining researchers do not remember the exact sequence of steps taken, and thus they simply do not perform the additional clean-up. A few erroneous data points will not spoil the end results, or so they think.

Common Problems Encountered in Data Mining Projects

In fact, the inability to re-execute previously performed steps is a very common phenomenon, and leads to many unfortunate situations: data clean-up is not performed as it should, tests are not re-run properly, plots are not re-generated, etc. All these problems can be traced back to one root: The inability to remember which steps were actually taken in the past. It thus comes as no surprise that people, as I did, turn to tools like Make to drive their data mining projects. This is a perfectly good solution to a common problem, and thus we may see code like the following:

data-clean.txt:  data.txt cleanup
        ./cleanup <data.txt >data-clean.txt

This is a snippet from a hypothetical data mining project in which Make is used. In this example, a script “cleanup” is used to clean up the dataset, reading the data from the file “data.txt” and writing the cleaned up data into the file “data-clean.txt”. The snippet can be saved in a file “Makefile”, and then “make” can be executed on the command line. Make will then execute the given command, but only if necessary. In fact, Make is smart enough to check the timestamps of the given input and output files, and only rebuild the output file when the input files are newer. This feature of Make is very useful when many rules are combined. Changing a single file and running Make will then only rebuild what has to be rebuilt.

When Make Is Not Enough

This usage of Make shows that Make is not only a source code compilation tool, but a generic build tool. The question then is, why is Make not enough? To give an answer, consider the following:

  • I need to run the same experiment on multiple datasets
  • I need to run multiple experiments on all datasets
  • I need to run the analysis using multiple subsets of all features
  • I may or may not filter out a subset of the data
  • I need to use multiple error measures

What do all these requirements have in common? The need to run things multiple times, parametrized by a variable. That variable can be the dataset chosen, the error measure chosen, or just a binary value denoting whether a certain extra is to be performed or not. Most data mining projects go about this by just implementing loops in whatever programming language they are using, and iterate over all possible values of the variables. For instance, we can easily write a shell script to clean-up all datasets in our collection:

for language in en de fr pt es ja pl nl ; do 
        ./cleanup <data.$language.txt >data-clean.$language.txt 
done

This is a shell script that iterates over eight languages, and calls the clean-up program “cleanup” on each. This works well as long as clean-up is fast. If the datasets in question are Wikipedia datasets for instance, then the files are likely to be huge, and clean-up slow. It then follows that if our script is interrupted, then it can only be restarted from the beginning, and will re-do the clean-up for all languages, even those where it has already been performed.

The solution to this problem is to declare each language’s clean-up in Make. For instance, we may write:

data-clean.en.txt:  data.en.txt cleanup
        ./cleanup <data.en.txt >data-clean.en.txt
data-clean.de.txt:  data.de.txt cleanup
        ./cleanup <data.de.txt >data-clean.de.txt
...

and so on for all eight languages.

This is clearly not scalable. Instead we may use GNU Make’s feature of pattern rules, which allows us to write:

data-clean.%.txt:  data.%.txt cleanup
        ./cleanup <data.$*.txt >data-clean.$*.txt

This is much clearer, but still leaves something to be desired: How do we go about having multiple parameters, e.g. the language and the subset of features? This problem can be achieved by generating the Makefile code automatically. As we have multiple parameters, we then have to generate code for all possible combinations, leading to very large generated files, and thus a long startup time for Make, who has to parse it all. Alternatively, we can use the following type of code. This example is a real-world example lifted from the KONECT project, from before it switched to Stu:

define TEMPLATE_decomposition_all
decomposition_full.$(1).all: $(foreach NETWORK, $(NETWORKS), \
   decomposition_full.$(1).$(NETWORK)) 
diagonality.$(1).all: $(foreach NETWORK, $(NETWORKS), \
   diagonality.$(1).$(NETWORK))

decomposition_time.full.$(1).all: $(foreach NETWORK, $(NETWORKS_TIME), \
   decomposition_time.full.$(1).$(NETWORK))
$(foreach NETWORK, $(NETWORKS_TIME), \
   decomposition_time.full.$(1).$(NETWORK)): \
   decomposition_time.full.$(1).%: \
   plot/decomposition_time.full.a.$(1).%.eps
decomposition_time.split.$(1).all: $(foreach NETWORK, $(NETWORKS_TIME), \
   decomposition_time.split.$(1).$(NETWORK))
$(foreach NETWORK, $(NETWORKS_TIME), \
   decomposition_time.split.$(1).$(NETWORK)): \
   decomposition_time.split.$(1).%: \
   plot/decomposition_time.split.a.$(1).%.eps 
endef
$(foreach DECOMPOSITION, $(DECOMPOSITIONS_ANY), \
   $(eval $(call TEMPLATE_decomposition_all,$(DECOMPOSITION))))

I am not going to explain this code. Suffice it to say that writing such code for all cases is not the way to go. (If you want to know, this is just a small snippet from the code that generates decompositions of characteristic graph matrices of networks in KONECT. Some parameters appear as GNU Make’s %, some as function parameters $(1), etc.)

A Tool That Allows Parameters in Rules

From the inability of Make to properly handle multiply-parametrized rules came my search for a tool that is able to do it. Fortunately, the landscape of build tools is very large. Unfortunately, virtually all build tools have radically different requirements:

  • Almost all are made for compiling software, not executing arbitrary commands. The few that allow arbitrary commands, such as Make, fail to allow parametrized dependencies.
  • Many are specific to a particular programming language or environment, assuming that everything will be done with it. This fails to account for the fact that any data mining project of even modest size will need to execute individual steps in multiple programming languages.

At the end of my year-long search, only the tool Cook came close to fulfilling all requirements, even though here also using some unintended hacks.

The curious reader may consult Wikipedia’s List of build automation software to get an overview of the field.

Only Two Features

Not finding any satisfying build tool, it occured to me that I only needed two features in addition to Make’s feature set:

  • An arbitrary number of named parameters in rules
  • The ability to generate the list of dependencies dynamically

Thus, I set out to write a build tool having these two features. In what later became Stu, the data clean-up example would be written in the following way:

@all:  [dep];

LANGS { 
        echo >LANGS en de fr pt es ja pl nl 
}

dep: LANGS { 
        for lang in $(cat LANGS) ; do 
                echo data-clean.$lang.txt 
        done >dep 
}

data-clean.$lang.txt:  data.$lang.txt cleanup {
        ./cleanup <data.$lang.txt >data-clean.$lang.txt
}

This snippet of Stu code shows the following two features:

  • Parameters are written using ‘$’, like shell variables. They have the same syntax within filenames and in the shell command.
  • The brackets ‘[…]’ denote that dependencies are read from the content of the given file. Stu will first make sure the file is generated, and then parse the file.

These two features are the essence of Stu. They set it apart from other build tools, and are the reason I wrote it.

Other Features

In fact, the example in the previous section also shows another very convenient idiom used very often in Stu files: The list of languages is itself a file, instead of variable as it would be in Make. This has the advantage that we can add a new language and all necessary steps will be executed automatically. Also, variables in Make are usually a scalability problem: They are regenerated at each invocation of Make. Even though the main purpose of Make is to avoid unnecessary executions, any variable (whose content often would depend on other variables such as the list of languages) is regenerated many times.

In fact, Stu initally also had variables like Make, but I removed the feature once I noticed that they are unnecessary as files are always better suited for what they do. As a matter of fact, I sometimes think of Stu as a programming language in which the variables are files.

(Note: That was in Version 0.  Now, in version 1.9, features are not removed anymore, to keep backward compatibility.)

Other Facts About Stu

Like many other Make replacements, Stu corrects the typical shortcomings of Make:

  • Error messages are much better (Make has the dreaded “missing separator”); Stu gives full traces like a proper compiler.
  • Stu has a proper tokenisation pass, instead of Make’s variable substitution syntax, avoiding many quoting and escaping problems.
  • Stu catches typical Makefile errors such as dependencies that where not built.
  • Stu has better support for interrupting large builds (Make will often hang or leave processes running).
  • Stu avoids the “inner platform” antipattern present in Make, in which a lot of shell functionality is duplicated in Make functions. Stu encourages all program logic to be implemented in rules, i.e. using a proper shell.
  • Stu supports additional types of dependencies (existence-only dependencies with the prefix ‘!’, optional dependencies with ‘?’, and trivial dependencies with ‘&’). These can only be emulated partially with Make by using unwieldy constructs.

Other design considerations of Stu include:

  • As a use case, focus on data mining instead of software compilation. There are no built-in rules for compilation or other specific applications. Allow use case-specific rules to be written in the language itself. However, Stu is use case-agnostic.
  • Files are the central datatype. Everything is a file. Think of Stu as “a declarative programming language in which all variables are files.” For instance, Stu has no variables like Make; instead, files are used.
  • Files are sacred: Never make the user delete files in order to rebuild things.
  • Do one thing well: We don’t include features such as file compression that can be achieved by other tools from within the shell commands.
  • Embrace POSIX as an underlying standard. Use the Bourne shell as the underlying command interpreter. Don’t try to create a purportedly portable layer on top of it, as POSIX already is a portability layer. Also, don’t try to create a new portable language for executing commands, as /bin/sh already is one.
  • Keep it simple: Don’t use fancy libraries or hip programming languages. Stu is written in plain C++11 with only standard libraries.
  • Have extensive unit test coverage. All published versions pass 100% of unit tests. Stu has 500+ unit tests. All features and error paths are unit tested.
  • Stability of the interface: We follow Semantic Versioning (semver.org) in order to provide stable syntax and semantics. Stu files written now will still work in the future.

Using Stu In Your Project

Instead of having a “Makefile”, have a “main.stu” file. Instead of calling “make”, call “stu”.

To see an example of a large and complex Stu file, see the Stu file of the KONECT project.

Documentation is given in Stu’s manpage.

To use Stu, get it on GitHub and compile it yourself. Stu is written in standard C++11.

Stu has an extensive set of unit tests, and is quite stable. The specification of Stu strictly follows Semantic Versioning, and therefore Stu files will remain compatible in the future.

The name “Stu” follows the precedents of Make replacements Cook and Bake in referring to kitchen-related verbs, and also honours the author of the original Unix Make, Stuart Feldman.

Stu is free software, placed under the GNU General Public License, Version 3.

Stu was written by Jérôme Kunegis.

 

 

Pronunciation of English Vowels

The English language is written in the Latin alphabet, and in principle, each letter corresponds to one sound.  In practice however, many letters can be pronounced in multiple ways, and many letters may represent multiple sounds. The following diagram shows the relationships between English sounds, and the letters used to write them.

  • The sounds of English are represented here by the phonemes of English, i.e., basic sound units of the language.  This ignores differences between sounds that are not used to distinguish different words, such as the British and American way of pronouncing “au”, which is simply written /ɔː/.  Phonemes are represented using the international phonetic alphabet (IPA).  See English phonology on Wikipedia for a description.
  • English uses not only individual letters, but also groups of multiple letters which are pronounced as one phoneme.  Groups of two letters are called digraphs.

The following diagram show the relationship between the vowel phonemes of English and the letters and digraphs used to write them.  Each cell shows a word written with the given letter or digraph, and producing the given phoneme.  Each column represents words containing the same sound, but written in a different way.  Each row represents words containing the same letter or digraph, but pronounced in a different way.

Relationship between English vowel phonemes and letters/digraphs

Note that the table shows only vowels, and also omits vowels followed by R, as well as the /ə/ (schwa) phoneme.  Each word is color-coded:

  • White words are regular English words
  • Blue words have irregular spelling
  • Green words have a foreign origin explaining their irregular spelling
  • Yellow words have spellings or pronunciations which are either rare, or used only in certain English-speaking countries.

For each cell in which a word exists, we show a word from the first category for which a word exists. Notes: The table is likely to contain errors and omissions!  Please write a comment below, and I will correct it.

  • I have very probably missed many rare and foreign words.  In particular, there are currently no trigraphs (groups of three letters) in the table.
  • The table does not give an indication of the frequency of each letter/phoneme combination.  For instance, “oo” represents /ʌ/ only in very few words, such as “blood” and “flood”.
  • The assessment of spellings as regular and irregular may be arbitrary in places.  For instance, I count “knowledge”/”bought”/”tought” as irregular.
  • Vowels followed by R are not shown, even though they are also irregular, for instance the pronunciations of “learn”, “hear”, “bear” and “heart”.  I put “bury” in this table because the R in not in the same syllable as the vowel.
  • Consonants are missing.  In general, consonants have more regular pronunciation, but still there are some interesting cases such as with “who”/”white”, etc.

Analysis We may note at once that English orthography does not have a one-to-one relationship between sounds and letters.  In fact, almost all letters and digraphs have multiple pronunciations, and almost all phonemes have multiple orthographies. In terms of network analysis, we may ask whether it is possible to reach any phoneme from any other phoneme.  For instance, we may go from /ɪ/ to /ɒ/ using the following sequence:

  • /ɪ/ is written “i” “in”, which is pronounced /aɪ/ in “mind”, which is written “ei” in “height”, which is pronounced /eɪ/ in “veil”, which is written “a” in “name”, which is pronounced /ɒ/ in “yacht”.

This shows that there is a path from /ɪ/ to /ɒ/, even though these two sounds are completely different.  The sequence “in” → “mind” → “height” → “veil” → “name” → “yacht” has length five.  The underlying graph is thus a bipartite graph:  It contains phonemes and letters/digraphs as nodes, and words as edges. Can all nodes in this graphs be reached from any other node?  The answer is no, because the phoneme /ɔɪ/ with its two spellings “oi” and “oy” is isolated from the rest of the graph.  The rest of the graph is connected, however, so any phoneme can be reached from any other phoneme, except for /ɔɪ/. However, we can still find that there are two clusters in the graph, with many edges within each cluster, but not many edges from one to the other:  the AEI group and the OU group:

  • The AEI group contains the sounds /ɪ/, /aɪ/, /ɛ/, /iː/, /æ/, /eɪ/, /ɑː/ and /ɔː/, and the spellings “i”, “y”, “uy”, “ie”, “e”, “ei”, “ea”, “ee”, “eo”, “ey”, “a”, “ae”, “ai”, “ay”, “al”, “au” and “aw”.
  • The OU group contains the sounds /ɒ/, /oʊ/, /ɔɪ/, /aʊ/, /ʊ/, /ʌ/, /uː/ and /juː/, and the spellings “ao”, “o”, “oa”, “oe”, “oi”, “oo”, “ou”, “ow”, “oy”, “u”, “ue”, “eu”, “ew” and “ui”.

The only links between the two groups are irregular and foreign words:

  • Words with irregular spelling: “women”, “busy”, “build”, “bury”, “broad”, “bought”, “knowledge”, “yeoman” and “gaol” (alternative spelling of “jail”).
  • Words of foreign origin:  “phoenix”, “mauve” and “pharaoh”.

EDIT:  Added the word “laugh” with pronunciation /æ/.

Plotting a Degree Distribution on the Command Line

The following shell command plots a degree distribution in a log-log scale.  It uses only standard unix commands:

<INPUTFILE sed -re ‘/^\%/d;s,^\s*\S+\s*(\S+).*$,\1,’ | sort | uniq -c | sort -n | sed -re ‘s,^\s*(\S+).*$,\1,;s,.,#,g’ | uniq -c | sed -re ‘s,^\s*(\S+).*,\1,;s,.,#,g’

The file “INPUTFILE” must be a text representation of a directed network with one edge per line, each containing a from-node-ID and to-node-ID.  The command will plot the indegree distribution on a doubly logarithmic scale.  Here it is executed for KONECT‘s “YouTube links” dataset:

$ <out.youtube-links sed -re ‘/^\%/d;s,^\s*\S+\s*(\S+).*$,\1,’ | sort | uniq -c | sort -n | sed -re ‘s,^\s*(\S+).*$,\1,;s,.,#,g’ | uniq -c | sed -re ‘s,^\s*(\S+).*,\1,;s,.,#,g’
#######
#####
####
###
#

The file “out.youtube-links” can be downloaded from here.  The degree distribution is transposed from its usual orientation:  The degrees are on the Y axis and the frequency is on the X axis.  As the output shows there is a power law here:  The lines get shorter in a linear fashion and thus the frequency of each degree value is a negative power of the degree value.

This shows that:

  • You don’t need megabytes of graphics code to generate plots
  • You don’t need a log() function to plot logarithmic axes
  • It absolutely makes sense to use both sort(1) and uniq(1) multiple times in a single pipeline

The drawn degree distribution also uses logarithmic binning, and thus it is much more useful in visualizing the tail of the distribution than the standard way of plotting degree distributions.

Try it out with more datasets:  List of datasets in KONECT

Can you write other shell one-liners for generating other network analysis plots?

Largest Network Dataset Ever Released – As Big As Facebook

Good news for the network analysis community:  A new very large Web hyperlink dataset was recently released:  the Web Data Commons Hyperlink Graph consisting of 3.56 billion pages and 129 billion hyperlinks connecting them!  For comparison, the Facebook friendship graph is reported to include 1.26 billion nodes (users) and 150 billion links (friendships).

The Web Data Commons Hyperlink Graph is thus the largest real-world network dataset available outside of the big companies themselves.  What this means is, it is not only the big data companies anymore that can do actual “Big Data” research – now any research group can too.  Therefore, we should be seeing studies using this dataset coming out in the next years.  I can’t wait to find out what will be done with this dataset.

For another comparison, our own Koblenz Network Collection, which collects network datasets of many different types and sizes, maxes out at 1.9 billion edges, with the Twitter follower network.  But since the dataset is open, we will now add it to it of course.

Defining the Clustering Coefficient

Clustering is an important property of social networks:  People tend to have friends who are also friends with each other, resulting in sets of people among which many edges exist, while a set made from randomly chosen people would have a much smaller number of edges between them. To measure the clustering in a social (or other type of) network, a common measure is the clustering coefficient.  The clustering coefficient is a real number between zero and one that is zero when there is no clustering, and one for maximal clustering, which happens when the network consists of disjoint cliques.  While the clustering in a network can be measured in a number of ways, one common way to do it is to check for triangles, i.e., to check that when two edges share a node, then in a network with high clustering, it is likely that a third edge exists such that the three edges form a triangle. Although the clustering coefficient is often used, there are actually two variants of it – which may have completely different values.

Variant (1) – Define the clustering coefficient c1 as the probability that two incident edges are completed by a third one to form a triangle.  Two edges are incident when they share a vertex. In that case, we can check whether a third edge exists such that all three edges form a triangle.  The proportion of such incident edge pairs that are completed by a third edge to form a triangle then equals the clustering coefficient (variant 1).

Variant (2) – First, define the local clustering coefficient c(u) for each node u in the following way:  Let c(u) be the proportion of neighbors of u that are connected.  In other words, c(u) is the probability that two friends of u are friends of each other in a social network.  Then define the clustering coefficient (variant 2) as the mean of all local clustering coefficients in the network.

Both measures c1 and c2 measure the clustering in a network and should therefore correlate highly, right?  Well, no.  Look at the following plot:

scatter.c.clusco.clusco_7.7This plot shows the two clustering coefficients for networks in the Koblenz Network Collection. Each network is represented by a two or three-letter code. The X axis represents c1 and the Y axis represents c2.  Clearly, we cannot state that the variants correlate in any way.  Why is that? To see why the two clustering coefficients do not correlate, we have to think about how both are computed.  In the first variant, each pair of incident edges is given the same weight, but in the second variant each node is given the same weight.  This means that nodes with a high degree will be overrepresented in variant 1, and underrepresented in variant 2.  The correlation of both measures is thus a hint that nodes with high and low degree in general do not have the same local clustering coefficient.  Let’s see:

degcc.a.cit-HepPhThis plot needs some explanation. This is a plot showing a citation network taken from arXiv’s High Energy Physics / Phenomenology section. The network consists of publications connected by citation links, i.e., a link from u to v means that publication u has cited publication v.  The network is taken as undirected here, i.e., we ignore who cited whom. Each blue dot in the plot is one vertex of the network, drawn according to its degree (X axis, logarithmic) and its local clustering coefficient (Y axis, logarithmic).  The red line shows the average local clustering coefficient taken over all nodes of the same degree.  What we see here is clear:  nodes with low degree have a high clustering coefficient. In other words, when person has many friends, these friends have less edges among them, which is to be expected since a person with many friends is likely to have friends from more diverse communities, and a paper getting cited many times is likely to be cited by papers from more diverse areas.

This explain two things:  First, because nodes with high degree are weighted more in the first variant, the first variant will have lower values that the second one.  Second, the non-correlation between the two variants is due to the fact that the discrepancy between the local clustering coefficient of high-degree and low-degree nodes is different from network to network.  Networks for which both clustering coefficients are near each other have more equal local clustering coefficients c1 and c2, whereas networks high up in the figure plot have a higher discrepancy between the local clustering coefficient of high-degree and low-degree nodes.

A few more notes

The clustering coefficients are not always defined. The first variant is not defined when no vertex has degree larger than one. This, of course, is not a problem in practice. The local clustering coefficient is only defined for nodes whose degree is larger than one. These nodes are present in actual social networks, and the usual method is to simply ignore them and compute the average of the local clustering coefficient over all nodes with degree larger than one, which is what is done in the plots presented here.

Resources

Datasets, Matlab Code, Complete definitions of numerical network statistics

How to Visualize Skewed Distributions

Many distributions, such as the size of cities, the frequency of words, the number of followers of people on Twitter, and others, are very skewed.  For instance, 16.2% of songs make up 83.8% of plays on Last.fm, and 18.8% of groups make up 81.2% of all group memberships on YouTube. These types of distributions are described by expressions such as power law, long tail, and other terms. These terms are often confused, as they refer to slightly different concepts. Also, the confusion is made worse by the existence of multiple ways of visualizing such distributions, which are different mathematically but similar in appearance. In this blog post, I will review the most important ways of visualizing such distributions, and will explain the mathematics behind them.

Zipf’s Law

Many things can be distributed in a power-law-like way. For instance, the linguist George Zipf was interested in 1935 in the distribution of words in the English language. He counted, for each word, its frequency in a corpus of text. He then sorted the words by descending frequency, and noted that a word’s frequency is approximately inversely proportional to the rank of that word. That is, the second most common word is twice as rare as the most common word, the third most common word is three times as rare as the most common word, and so on.

This series of numbers can be visualized like this:

zipf.v.gottron-reutersThe data used for this plot is from a dataset of Reuters news article in English. (Since this data is taken from a collection of network datasets, the plot used the term degree to refer to a word’s frequency.) Each point in this plot represents one word, with the X axis representing the ranking of the word (from most frequent to less frequent), and the Y axis representing the frequency. The plot is drawn with both axes logarithmic. Thus, the top-left point represents the most common English word (which is “the”), and very rare words are on the right of the plots (showing that many words occur only once in the dataset).

An exact inverse proportional relationship as described by Zipf would look like a straight line with slope exactly minus one on this plot, because the axes are logarithmic. With this dataset, this is not the case (in particular in the top-left area). This is due to the fact that this dataset of words was stripped of words which are ignored by search engines such as “the”, “of”, etc., which happen to be the most common in the English language. Therefore, the plot does not show a straight line, and we may be interested in the shape of this plot as a visualization of the distribution.

The Distribution plot

In terms of probability theory, the size, frequency or number of followers follow a distribution, and we are thus interested in the shape of that distribution. The plot inspired by Zipf’s Law is thus often said to be a bad way of visualizing a distribution.  Instead, one may rather show the distribution plot, i.e., the plot showing the frequency of each value. For the English words example, this means a plot where the X axis shows word frequency values, and the Y axis shows the number of words having this frequency.  This results in the following plot:

degree.v.gottron-reutersAgain, we use a log-log plot. Unlink the previous plot, this looks much more like a straight line. What is going on? In a log-log plot, a straight line (of the form y = ax + b) is in fact a power law (i.e., log(y) = a log(x) + by = eb xa). Thus, the plot suggests that the number of words with frequency x is proportional to xa, for some constant a, which equals about −1.6 in this case.  This type of relationship is called a power law, and the value −a is then called the power-law exponent. This plot type however is misleading. Since for each number x ≥ 100, only few words have the frequency x, the right half of the plot cannot be interpreted at all visually, and thus the plot effectively only visualizes the distribution of infrequent words.

The Complementary Cumulated Distribution Plot

In order to visualize also words with uncommon (i.e., high) frequencies, a solution is to plot, for each x, not the probability that a word has frequency x (which is proportional to the number of words having that frequency), but the probability that a word has frequency greater or equal to x. This is called the complementary cumulated distribution plot (and would be called simply the cumulated distribution plot if it showed the probability of the frequency being less than a given x).

bidd.vx.gottron-reutersAgain, this plot type is shown with both axes logarithmic. Again, this plot type will show a straight line when the data follows a power law. However, this plot does show the full range of values on the X axis, and in fact we can see right away that this is not a power law, as the right half of the plot deviates significantly from a straight line, showing again that our dataset does not follow Zipf’s Law, because very common words were removed from it.

Is this plot therefore a better visualization of a skewed distribution than the simple distribution plot? Yes, because it visualizes the whole spectrum of possible values, even rare ones. Is this plot also better than the plot derived from Zipf’s analysis? No, because it is the same plot. In fact, Zipf’s plot and the complementary cumulated distribution are mirror images of each other, mirrored around the 45-degree diagonal axis going from the lower left to the upper right. This can be seen by noting that the rank of a word with frequency x equals the probability of a word having frequency ≥ x, multiplied by the total number of words. Thus both plots are identical, up to an exchange of the axes.

The Long Tail

The confusion between the different types of plots is made worse by the ambiguity of the term “long tail”. In marketing, the long tail refers to the bulk of products which are bought only rarely, and thus are traditionally ignored by retailers, but from which online retailers can profit, since they don’t have the space constraints that traditional retailers have. Rare items are similar to infrequent words, and thus the term long tail describes the Zipf-type plot, in which infrequent items form a tail-like shape to the right of the figure. In the two other plots however, infrequent items are shown on the upper left corner, and thus do not form a tail at all.

What may also confuse people is the use of the word tail in probability theory to characterize the ends of distributions. In probability theory, the expression heavy-tailed distribution and fat-tailed distribution have specific meanings and refer to the tail as seen in the distribution plots, and thus refer to very large values of a variable, and not to very small values, as in the expression long tail. What is worse, the expression long-tailed distribution is sometimes also used in probability theory, and also describes the behavior of a distribution at large values of a variable.

The Lorenz Curve

To complete our little tour, I would like to mention the Lorenz curve, which shows the amount of text covered by X% of least frequent words. That is, the X axis goes from 0 to 100, and the Y axis shows the total amount of text made up by the X% of least frequent words. The Lorenz curve is usually used to visualize wealth distribution, and is plotted without logarithmic scales. For our English words dataset, it looks like this:

lorenz.vb.gottron-reutersThe area between the Lorenz curve (in blue) and the diagonal is then, when multiplied by two, the Gini coefficient, a measure of inequality as used in economy. Since it does not use logarithmic axes, neither power law distributions nor Zipf’s Law can be easily recognized on it.

Conclusion

There are many ways to visualize skewed distributions, and care must be taken when creating, reading and interpreting such plots. We recommend to use the complementary cumulated distribution plot, as it is most aligned with probability theory, and is just as expressive as the Zipf-type plot. But in the end, remember that a plot is not a replacement for a proper statistical test, so if you want to assert that a dataset follows Zipf’s Law or a power law, then a statistical is needed. Based on my experiments with degree distributions in KONECT, my cautious guess is than there are almost no statistically significant power laws in real-world networks (and I can make no statement about other types of distributions such as city sizes, etc.)