An Update on the Stu Build System (Stu 2.5)

If you are reading my blog you are probably aware that I am often advertising the Stu build system.  This is a piece of software that we are using at the University of Namur to run the KONECT project, as well as many other things.

Version 2.5 of Stu has been released a few days ago, and I want to take this opportunity to come back to my previous post about Stu. In that post, I showed a short example of how to use Stu.  Now, with version 2.5 available, that example has become even shorter and easier to write.  Version 2.5 of Stu is the first version of Stu that implements all three core features of Stu. But before I tell you what these three features are, here’s the updated example:

Example

@all: data-clean.[LANGS].txt;

LANGS = { en de fr pt es ja pl nl }

>data-clean.$lang.txt: <data.$lang.txt cleanup 
{
       ./cleanup
}

What does this do?  This is a Stu script to clean up your datasets. I.e., this is the content of a file ‘main.stu’ in your directory. You would then just call ‘stu’ on the command line from within that directory. In practice, this would be part of a larger Stu script in a data mining project that does much more, such as doing actual experiments, plotting, etc.  In this short snippet, we are only cleaning up our datasets. The details don’t matter for this example, but many data mining projects need to clean up their data as a first step, for instance to remove spam entries for an email dataset, etc.

In this example, we have not only one dataset, but multiple datasets, corresponding to multiple languages, as we would in a study of Wikipedia for instance.  Then, we use the program ‘./cleanup’ to cleanup each dataset.

The syntax of a Stu script consists of a set of rules, in which each rule tells Stu how to generate one or more files.  The example consists of three rules.

Let’s start at the bottom:  the last four lines are a rule for the target named ‘data-clean.$lang.txt’.  This is a parametrized rule:  The actual filename can contain any non-empty string in place of ‘$lang’. In this example, $lang will be the language code, i.e. ‘fr’ for French, ‘en’ for English, ‘nl’ for Dutch, etc.

The list of filenames after the colon (‘:’) are the dependencies.  They tell Stu which files need to be present before the target can be built.  In the case of the last rule shown above, there are two dependencies:  the file ‘data.$lang.txt’ (in our example, the non-cleaned up data), and the file ‘cleanup’ (the program we use for cleaning up a file).

The actual command to be performed is then given by the block in the last three lines.  It is executed by the shell and can do anything, like in this example, calling another program.  Since the rule is parametrized (with parameter $lang), it means Stu will execute the command with the environment variable $lang set to the value as matched in the target name. As a result, the ./cleanup program can simply use getenv(‘lang’) or its equivalent in any programming language to access the value of the parameter.

Another feature of Stu we can see here is redirection, written as ‘<‘ and ‘>’.  These are used to redirect the standard input of the command (with ‘<‘) and the standard output of the command (with ‘>’). As a result, our program ./cleanup does not need to access any file directly; it can simply do its filtering from stdin to stdout.

But let’s look at the other two rules in our example.  The rule

LANGS = { en de fr pt es ja pl nl }

is what we call a hardcoded rule.  It simply assigns a predefined content to a file. In this case, we just use that file to store the list of languages in our experiments.

The first rule is a little more interesting now:

@all: data-clean.[LANGS].txt;

It contains brackets (‘[‘ and ‘]’), which I will explain now.  Brackets in Stu mean that a file (in this case ‘LANGS’) should be built, and then its contents inserted as a replacement for the brackets.  In this case, since the file ‘LANGS’ contains multiple names (all the language codes), the filename ‘data-clean.[LANGS].txt’ will be replaced multiple times, each time with ‘[LANGS]’ replaced by a single language code.  As a result, the target ‘@all’ will have as dependencies all files of the form ‘data-clean.$lang.txt’, where $lang is a language code taken from the file ‘LANGS’.

Finally the symbol ‘@’ is used to denote a so-called transient target, i.e. a target that is not a file, but is simply represented as a symbol in Stu.  Since the target @all is not a file, there is no command to build it, and therefore we simply use ‘;’ instead of a command.

Comparison with Make

Readers who know the tool Make will now draw comparisons to it. If Stu was just a rehash of Make, then I would not have written it, so what are the differences?  In short, these are the three core features of Stu:

  • Parametrized rules:  Like $lang in the example, any rule can be parametrized.  This can be partially achieved in some implementations of Make (such as GNU Make) by using the ‘%’ symbol. But crucially, this does not allow to use multiple parameters. In the KONECT project, we use up to four parameters in very reasonable use-cases.
  • Dynamic dependencies: These are written using brackets (‘[‘ and ‘]’) and have the meaning that the dependency file is first built itself, and then its contents are parsed for more dependencies.
  • Concatenation: I haven’t mentioned concatenation yet, but we have used it in the example.  Simply by writing ‘data-clean.[LANGS].txt’, the string ‘data-clean.’ will be prepended to all names in ‘LANGS’, and ‘.txt’ will be appended to them.

These are the three core features of Stu, which set it apart from virtually all other such tools. In fact, I initally wrote Stu in order to have the first two features available. The third feature came to me later, and since I implemented it in Stu 2.5, it now feels second nature to use it.  As a general rule, the three features interact beautifully to give very concise and easy to understand scripts.  This is the raison d’être of Stu.

Comparison with Other Build Tools

There are many build tools in existence, and some of the features of Stu can be found in other such tools.  The combination of the three core features is however unique.  In fact, I have been using Make and Make replacements for years, for KONECT and other projects, always looking for possibilities to express complex dependencies that arise in data mining projects nicely.  Finally, I started writing Stu because no tool was adequate.  In short, almost all Make replacements fail for me because (1) they are specific to a certain programming language, and (2) they are specific to the task of software building.  For data mining projects, this is not adequate.  For those tools that are generic enough, they do not implement the three core features.

By the way, I have been using the Cook tool by the late Peter Miller for years.  Among all Make replacements, it is probably the one that comes closest to Stu.

Stu inherits the tradition of Make replacements having cooking-related names (brew, chef, cook, etc.), and at the same time honors the author and inventor of the original UNIX Make, Stuart Feldman.

Design Considerations

The design considerations of Stu are:

  • Genericity: In many projects, the same rule has to be executed over and over again with varying parameters. This is particularly true in data mining / data science / machine learning and related areas, but also applies to simply compiling programs with varying compiler options, etc. Being able to do this using a clean syntax and friendly semantics is the main motivation of Stu, and where virtually all other
    Make replacements fail. Most Make replacement force the user to write loops or similar constructs.
  • Generality: Don’t focus on a particular use case such as compilation, but be a generic build tool. There are no built-in rules for compilation or other specific applications. Instead, allow use case-specific rules to be written in the Stu language itself.  Most Make replacement tools instead focus on one specific use-case (almost
    always compilation), making them unsuitable for general use.
  • Files are the central datatype. Everything is a file. You can think of Stu as “a declarative programming language in which all variables are files.” For instance, Stu has no variables like Make; instead, files are used. Other Make replacements are even worse than Make in this regard, and allow any variable in whatever programming language they are using. Stu is based on the UNIX principle that any persistent object should have a name in the file system – “everything is a file.”
  • Scalability: Assume that projects are so large that you can’t just clean and rebuild everything if there are build inconsistencies.  Files are sacred; never make the user delete files in order to rebuild things.
  • Simplicity: Do one thing well. We don’t include features such as file compression that can be achieved by other tools from within shell commands. List of files and dependencies are themselves targets that are built using shell commands, and therefore any external software can be used to define them, without any special support needed from Stu. Too many Make replacements try to “avoid the shell” and include every transformation possible into the tool, effectively amassing dozens of unnecessary dependencies, and creating an ad-hoc language much less well-defined, and let alone portable, than the shell.
  • Debuggability: Like programs written in any programming language, Stu scripts will contain errors. Stu makes it easy to detect and correct errors by having much better error messages than Make. This is achieved by (1) having a proper syntax based on tokenization (rather than Make’s text replacement rules), and (2) having “compiler-grade” error messages, showing not only what went wrong, but how Stu got there. Anyone who has ever wondered why a certain Make rule was executed (or not) will know the value of this.
  • Portability: Embrace POSIX as an underlying standard. Use the shell as the underlying command interpreter. Don’t try to create a purportedly portable layer on top of it, as POSIX already is a portability layer. Also, don’t try to create a new portable language for executing commands, as /bin/sh already is one. Furthermore, don’t use fancy libraries or hip programming languages. Stu is written in plain C++11 with only standard libraries. Many other Make replacements are based on specific programming languages as their “base standard”, effectively limiting their use to that language, and thus preventing projects to use multiple programming languages. Others even do worse and create their own mini-language, invariably less portable and more buggy than the shell.
  • Reliability: Stu has extensive unit test coverage, with more than 1,000 tests. All published versions pass 100% of these tests. All language features and error paths are unit tested.
  • Stability: We follow Semantic Versioning in order to provide syntax and semantics that are stable over time.  Stu files written now will still work in the future.
  • Familiarity: Stu follows the conventions of Make and of the shell as much as possible, to make it easier to make the switch from Make to Stu. For instance, the options -j and -k work like in Make. Also, Stu source can be edited with syntax highlighting for the shell, as the syntaxes are very similar.

That’s All Good!  How Can I Help?

If you want to use Stu – go ahead.  Stu is available on GitHub under the GPLv3 license.  It can be compiled like any program using the “configure–make–install” trio (see the file ‘INSTALL’).  I am developing Stu on Linux, and am also using it on MacOS.   We’ve even had people compiling it on Windows.  If you try it, I will be happy about hearing from your experiences, and also happy for bug reports.

If you want to see Stu in action, have a look at the Stu script of KONECT.  It controls the computations of all statistics, plots and other experiments in the KONECT project, and runs essentially continuously at the University of Namur.  In practice, I also use Stu for all side projects, generating my CV, writing papers, etc.

In the next months, I will write a blog entry about common anti-patterns in data science, and how they can be avoided using Stu.

Big Table of Binary Fluorine Compounds

While not as common as for instance oxygen, the element fluorine (F) reacts and forms compounds with almost all other elements.  In fact, fluorine is the most electronegative element (except for the unreactive neon on one particular electronegativity scale), and this gives it particular properties:  The pure element, the gas F₂, is extremely reactive, so much that it will react with almost all materials except a few noble gases (such as neon), certain other unreactive substance (such as nitrogen at standard conditions), and compounds that already contain fluorine (such as teflon).  In fact, the history of fluorine is highly interesting in itself, having claimed the death of more than one chemist who tried to isolate the pure element.

The following table gives an overview of the binary compounds of fluorine.  It shows compounds that contains the element fluorine (F) and another element.  The compounds are arranged by oxidation state and stoichiometry, showing that many elements form multiple compounds with fluorine.  Since fluorine is more electronegative than all elements with which it forms binary compounds, all compounds are called “fluorides”, the exception being fluorine azide (FN₃).

F

Download PDF

  • The table contains links to all Wikipedia articles, or to research papers if the Wikipedia article does not exist yet
  • The compounds are coloured according to their type:  molecular compounds in blue, ionic compounds in black, etc.  The exact legend is given within the plot.
  • As a bonus, we also include the binary fluorine minerals.

Please post additions and corrections here.

UPDATE:  Version 3

How to Pronounce German Vowels

The German language may be known for its many consonants, but what most learners have trouble with are in fact the vowels.

The following chart gives a pretty much thorough account of the pronunciation of German vowels, in form of a flowchart.  It will help you determine how to pronounce the vowels in a given word.

chart-opt

Download chart as PDF

UPDATE:  This is version 5 of the chart, as of 2017-02-17.

Please comment here with suggestions, corrections, etc.

A Mathematical Riddle

I’ll just place this here:

quest

You have to compute this expression.

This riddle was composed a few months ago for a competition, but we decided to not use it in the end.  Actually, it is easier than it seems:  You don’t need any calculator or computer to solve it.  A piece of paper (or two) should be enough.  We wrote the riddle in a way that there are several shortcuts that can be used to solve it, but if you don’t find the shortcuts, you can still solve it in the obvious way, taking just a little bit longer.

Also, you have to find the exact answer, not a numerical approximation.

I’m not posting the answer here for now.  If you’re curious whether your answer is correct, post below.

 

No Hairball – The Graph Drawing Experiment

→ QUICKLINK TO THE EXPERIMENT

Many graph drawings look like a hairball.

The larger a network is, the harder it is to visualize it.  Most graph drawing algorithms produce a giant “hairball”, in which nodes and edges are hopelessly mixed up, leaving no way to discern any structure whatsoever. Here is an example:

out_layout

This is from one of my own papers (WWW 2009), so I should know what I am talking about. Nowadays, I wouldn’t put such a picture in paper, let alone on the first page as I did then.  Many papers however, still contain such graphics.

Can we learn anything from this drawing?  No.  There are no communities visible.  No clustering is apparent.  I cannot even tell whether the graph is bipartite.  In fact, I cannot even estimate the size of the graph from this picture.

So, why do we keep putting hairballs in our papers?  Maybe, because they give us the illusion of insight into a complex network.  Yes, we would like to understand whether a graph displays clustering, bipartivity, assortativity, dissortativity, skewed degree distributions, and a myriad of other interesting properties that complex networks can have. What better way to visualise these features, than by drawing the graph?  Isn’t every visualisation also a drawing?

No, visualisations are not necessarily drawings.  We don’t need to draw a graph in order to visualise it.  In fact, the mere fact that we try to draw the entirety of a network in a small space is what leads to the hairball problem to begin with: Hundred of thousands, millions or even more nodes and edges are crammed into the space of only a few centimetres.  It is no wonder we don’t see anything in a hairball graph:  Each node and each edge gets allocated a space on the order of a micrometre or less – much too small to even be shown on computer displays or printed on paper, let alone seen by the human eye.

What then, can we do to avoid hairballs?  Several ideas have been tried:

  • Show only a subset of the network.  This is called sampling.  In its simplest incarnation, choose a random subset of the nodes, to get a subgraph of the real graph of manageable size.  Then, draw that subgraph.
  • Aggregate nearby nodes into single nodes.  This is called coalescing, and also produces smaller graphs, which are then drawn with the usual methods.
  • Allow the user to zoom in, using interactive software.  This is nice, but doesn’t give any more insight into the overall properties of the network.

These methods are all suited to particular use cases.  But for visualising overall properties of a graph, these methods fail.  What do these methods all have in common?  They assume that to visualise a graph, you have to draw the graph.  The question thus becomes, Can we go further?  Can we visualise a graph without drawing it?  Almost.

In the experiment we are performing, we don’t draw the graph to be visualised.  Instead, we draw another graph – a much smaller one – which is representative of our graph.  In fact, we throw away all nodes and edges of the input.  Only graph statistics such as the assortativity, the bipartivity, and others are kept, and a completely new graph is created, specifically for visualisation.  Most graph/network researchers now think, How can I see individual nodes and edges of the graph?  The answer is that you can’t; that is the point of the method.  We sacrifice local information in graph in order to make global properties more apparent.

The method is in development, and a paper is upcoming, but not public yet.

We are now performing an experiment to find out how to do this graph visualisation optimally:

TAKE PART IN THE EXPERIMENT

This is an interactive experiment that will ask you to look at graphs, and to answer yes/no questions. In fact, you only have to click on graphs to answer.  However:  You must be knowledgeable in graphs and networks to participate.

This will help our research.  If you want to keep in touch with the results of the experiments, write to <pkumar@uni-koblenz.de>.

If you’re interested in graph visualisation, check out the KONECT project; it has lots of graph visualisation methods based on matrix decompositions.

 

Which Elements are Named After Countries?

This week, names of four newly discovered chemical elements have been announced:

  • Nihonium (Nh) for element 113
  • Moscovium (Mc) for element 115
  • Tennessine (Ts) for element 117
  • Oganesson (Og) for element 118

These names are up for being revised in six months. Barring some unexpected development, these names will go down in the chemistry books – and more to the point, nuclear physics book – as the names of the elements that complete Period 7 of the periodic table.

Of these four, one name refers to a country (Japan), one to a region (Tennessee), one to a city (Moscow), and one to a person (Yuri Oganessian). All four names were chosen by the discoverers, and honour the discoverers. While the discovering teams have indeed produced a tremendous work and certainly deserve the honour, I find that type of naming quite unfortunate, and even downright selfish.

This has not always been the case. Most elements were not named for selfish reasons. Some elements were named for their properties, such as hydrogen, oxygen, and nitrogen. Some for materials from which they were extracted: beryllium from beryl, aluminium from alum, calcium from calx, sodium from soda, potassium from potash, boron from borax, silicon from silex, and lithium from stones.  Many were named for the colours of their compounds: chlorine is green, bismuth is white, rubidium is red, caesium is blue-grey, gold is yellow, indium is indigo, iodine is violet, rhodium is pink, thallium is green, chromium is colourful, and iridium has the colour of the rainbow. I find these highly poetical. Not to be outdone, some elements were named after astronomical objects: uranium, neptunium, and plutonium for Uranus, Neptune, and Pluto; cerium after Ceres; selenium after the Moon; and tellurium after the Earth.

In more modern times, many radioactive elements were named for their radioactivity: actinium and radium produce rays, astatine is unstable, technetium is artificial. Others were named for various other properties: bromine and osmium smell, argon is inactive, barium is heavy, neon is new, xenon is foreign, phosphorus carries light, krypton and lanthanum are hidden, and dysprosium is hard to get.

Newer elements however, are not named for their properties, but are simply given a name meant to honour the discoverers.  All non-black cells on the follow periodic table are elements whose names can be traced back to a country:

elem

Periodic table of elements by country referred to in the element name.

The last row, Period 7, is almost completely coloured.

I have made some simplifications in the table:  I count Rutherford as British (by his career), and the Rhine as German. Also, the Curies are Polish here.

Let’s look at the country-based names:  Some are quite innocent, like magnesium and manganese, named indirectly after a Greek region called Magnesia. Copper gets its name from the Island of Cyprus, while Beryllium and Strontium get their names indirectly from places in India and Scotland.

In other cases, chemists have tried to honour their country, such as France (gallium and francium), Germany (germanium), and Russia (ruthenium). Note how nicely gallium and germanium are placed side by side. Also note that while europium is named for a deity, the nearby americium was named consciously after the United States.

The Scandinavian countries have the largest share of elements named after them. Famously, a whopping four elements are named after the tiny Swedish village of Ytterby: ytterbium, yttrium, erbium, and terbium. Scandium is named for Scandinavia, thulium for a mythical Scandinavian region.  Holmium and hafnium are named for Stockholm and Copenhagen, respectively. Nobelium and Bohrium and named after a Norvegian and a Danish person, respectively.

Since World War II, newly discovered elements are named simply after people, or after the countries or regions of discovery. In fact, there has been a dispute between American and Soviet physicists about naming priorities. As a result, we don’t have an element named kurchatovium, even if I learned about it in school. With the four new names, that particular dispute should be considered as settled at a score of 7–7, although I’m sure some people on either side of the debate still hope for more.

While names such as einsteinium, fermium, copernicium, and meitnerium  seem fair, given that the persons were already dead and unrelated to the particular discoveries, it seems to me quite preposterous to give names after living people who are involved in the research.  This was the case for seaborgium and the newly named (and not yet finalised) oganesson. Will all respect for these researchers, I doubt that Glenn T. Seaborg and Yuri Oganessian will go down in history in the same way as Copernicus and Mendeleev. Or Einstein and Meitner. Or Curie and Rutherford.

Also, it seems that the responsable discoverers are not trying anymore to follow latin nomenclature:  “nihonium” deserves to be called japonium. That would also give the much more memorable symbol Jp, coinciding with Japan’s country code. If I were responsible for bettering Japan’s standing in the world, I would lobby for “Jp”. On the other hand, Moscovium sounds very appropriate to me, and I like that tennessine and oganesson use the relevant suffixes for halides and noble gases, even if I doubt their corresponding chemistry is going to be characterised anytime soon.

It seems that the last element named after a concept was plutonium. After the last planet to be discovered at the time, and also after a god. The current research teams would do well to come back to naming new elements after actual properties of the material or at least to find metaphors relating to planets and deities, and not just as a vehicle for advertising their labs and principle investigators. Let’s hope that if elements in Group 8 can be synthesized, we will go back to a more classical naming culture.

Index of Complex Networks – The “Google for Networks” by University of Colorado Boulder

This week at the Conference on Network Science (NetSci 2016), Aaron Clauset from the University of Colorado Boulder unveiled the Index of Complex Networks, assembling information about thousands of network datasets available online. They have about 3500 entries at the moment, with many more to come in the near future.

This news is a big deal in the network science community. Researchers in the field of Network Science are (quite obviously) using network datasets on a daily basis, and anything that helps them get their hands on more of them is good news. Until now, there have been several sites collecting network datasets, such as SNAP from Jure Leskovec’s group at Stanford and KONECT written by me in at the University in Koblenz.

The new ICON website takes the community one step further in making available not the datasets themselves, but an index of them. I.e., a listing of datasets available on the web, whose largest share comes from KONECT, but also from SNAP and countless other network dataset-publishing sources.

What does this mean for the field of Network Science? It means we will be able to verify our claims not on individual datasets, or on a few dozens ones, but on thousands of them. This is worth spelling out: With thousands of datasets, we can finally investigate the statistical significance of claims. We may finally find out whether social networks are really scale-free. We may finally find out, whether social networks really have so many more triangles than other networks, as claimed traditionally in social network studies. In fact, we already have first results based on the ICON data: Networks as a whole are not scale-free. In other words, degree distributions are not power laws in a statistical sense. This contradicts many claims made in the field.

I expect this data to lead to many new insights. In particular, many “well-known” facts about networks will have to be revised. I have to say, I’m very much looking forward to start this new chapter of Network Science.

The Index of Complex Networks (ICON) is here: https://icon.colorado.edu/

KONECT (The Koblenz Network Collection) is here: http://konect.uni-koblenz.de/