Stu – Build Automation for Data Mining Projects

Stu is a build tool developed at the University of Koblenz-Landau. In this article, I will explain what sets it apart from the myriad of other build tools, why I wrote it, and how we use it in various projects at the Institute for Web Science and Technologies.

What is Make?

Everyone has probably already heard of Make, a standard Unix tool mainly used in software projects for automatizing the build process. Big software projects consist of many source files, are compiled to many individual object files, and then linked. Larger software projects may even have more complex structure, for instance, they might autogenerate code, compile multiple versions of the software, or even have different compilation procedures depending on the platform on which they are run. As a result, building software is not an easy task, and many tools exist to make it easier. The quintessential tool in this category is undeniably Make, dating back to the 1970s as part of Unix. By now, Make has been standardized as part of the POSIX specification, and is ubiquitous. The best known implementation by now is probably GNU Make, as it is the default implementation on Linux.

Running Data Mining Projects

While everyone knows Make as a tool for building software, not everyone knows that it is also a completely generic build tool: It can be used to generate datasets, convert files, compile Latex documents, etc. For this reason, Make has been the tool I myself have used for years whenever I implemented data mining projects. I use the term “data mining project” in a broad sense here: any project in which data is manipulated. For instance, a typical data mining project may include the following tasks:

  • Download a dataset from the web
  • Uncompress it
  • Clean up the dataset
  • Convert to another format
  • Visualize the dataset
  • Compute features
  • Run a regression
  • Visualize the result of the regression

Most people who do data mining projects would probably write a single program for each step, in their favourite or in the most appropriate programming language. Then, they would run these programs sequentially. If necessary, earlier steps can then be re-run by hand. While this is an acceptable way to go about things in data mining projects, it still leads to many undesirable situations. Sometimes, we find out that our dataset contains individual data points that were wrongly extracted. In this case, we should adapt out clean-up code, and re-run all steps from that point on. Unfortunately, many data mining researchers do not remember the exact sequence of steps taken, and thus they simply do not perform the additional clean-up. A few erroneous data points will not spoil the end results, or so they think.

Common Problems Encountered in Data Mining Projects

In fact, the inability to re-execute previously performed steps is a very common phenomenon, and leads to many unfortunate situations: data clean-up is not performed as it should, tests are not re-run properly, plots are not re-generated, etc. All these problems can be traced back to one root: The inability to remember which steps were actually taken in the past. It thus comes as no surprise that people, as I did, turn to tools like Make to drive their data mining projects. This is a perfectly good solution to a common problem, and thus we may see code like the following:

data-clean.txt:  data.txt cleanup
        ./cleanup <data.txt >data-clean.txt

This is a snippet from a hypothetical data mining project in which Make is used. In this example, a script “cleanup” is used to clean up the dataset, reading the data from the file “data.txt” and writing the cleaned up data into the file “data-clean.txt”. The snippet can be saved in a file “Makefile”, and then “make” can be executed on the command line. Make will then execute the given command, but only if necessary. In fact, Make is smart enough to check the timestamps of the given input and output files, and only rebuild the output file when the input files are newer. This feature of Make is very useful when many rules are combined. Changing a single file and running Make will then only rebuild what has to be rebuilt.

When Make Is Not Enough

This usage of Make shows that Make is not only a source code compilation tool, but a generic build tool. The question then is, why is Make not enough? To give an answer, consider the following:

  • I need to run the same experiment on multiple datasets
  • I need to run multiple experiments on all datasets
  • I need to run the analysis using multiple subsets of all features
  • I may or may not filter out a subset of the data
  • I need to use multiple error measures

What do all these requirements have in common? The need to run things multiple times, parametrized by a variable. That variable can be the dataset chosen, the error measure chosen, or just a binary value denoting whether a certain extra is to be performed or not. Most data mining projects go about this by just implementing loops in whatever programming language they are using, and iterate over all possible values of the variables. For instance, we can easily write a shell script to clean-up all datasets in our collection:

for language in en de fr pt es ja pl nl ; do 
        ./cleanup <data.$language.txt >data-clean.$language.txt 
done

This is a shell script that iterates over eight languages, and calls the clean-up program “cleanup” on each. This works well as long as clean-up is fast. If the datasets in question are Wikipedia datasets for instance, then the files are likely to be huge, and clean-up slow. It then follows that if our script is interrupted, then it can only be restarted from the beginning, and will re-do the clean-up for all languages, even those where it has already been performed.

The solution to this problem is to declare each language’s clean-up in Make. For instance, we may write:

data-clean.en.txt:  data.en.txt cleanup
        ./cleanup <data.en.txt >data-clean.en.txt
data-clean.de.txt:  data.de.txt cleanup
        ./cleanup <data.de.txt >data-clean.de.txt
...

and so on for all eight languages.

This is clearly not scalable. Instead we may use GNU Make’s feature of pattern rules, which allows us to write:

data-clean.%.txt:  data.%.txt cleanup
        ./cleanup <data.$*.txt >data-clean.$*.txt

This is much clearer, but still leaves something to be desired: How do we go about having multiple parameters, e.g. the language and the subset of features? This problem can be achieved by generating the Makefile code automatically. As we have multiple parameters, we then have to generate code for all possible combinations, leading to very large generated files, and thus a long startup time for Make, who has to parse it all. Alternatively, we can use the following type of code. This example is a real-world example lifted from the KONECT project, from before it switched to Stu:

define TEMPLATE_decomposition_all
decomposition_full.$(1).all: $(foreach NETWORK, $(NETWORKS), \
   decomposition_full.$(1).$(NETWORK)) 
diagonality.$(1).all: $(foreach NETWORK, $(NETWORKS), \
   diagonality.$(1).$(NETWORK))

decomposition_time.full.$(1).all: $(foreach NETWORK, $(NETWORKS_TIME), \
   decomposition_time.full.$(1).$(NETWORK))
$(foreach NETWORK, $(NETWORKS_TIME), \
   decomposition_time.full.$(1).$(NETWORK)): \
   decomposition_time.full.$(1).%: \
   plot/decomposition_time.full.a.$(1).%.eps
decomposition_time.split.$(1).all: $(foreach NETWORK, $(NETWORKS_TIME), \
   decomposition_time.split.$(1).$(NETWORK))
$(foreach NETWORK, $(NETWORKS_TIME), \
   decomposition_time.split.$(1).$(NETWORK)): \
   decomposition_time.split.$(1).%: \
   plot/decomposition_time.split.a.$(1).%.eps 
endef
$(foreach DECOMPOSITION, $(DECOMPOSITIONS_ANY), \
   $(eval $(call TEMPLATE_decomposition_all,$(DECOMPOSITION))))

I am not going to explain this code. Suffice it to say that writing such code for all cases is not the way to go. (If you want to know, this is just a small snippet from the code that generates decompositions of characteristic graph matrices of networks in KONECT. Some parameters appear as GNU Make’s %, some as function parameters $(1), etc.)

A Tool That Allows Parameters in Rules

From the inability of Make to properly handle multiply-parametrized rules came my search for a tool that is able to do it. Fortunately, the landscape of build tools is very large. Unfortunately, virtually all build tools have radically different requirements:

  • Almost all are made for compiling software, not executing arbitrary commands. The few that allow arbitrary commands, such as Make, fail to allow parametrized dependencies.
  • Many are specific to a particular programming language or environment, assuming that everything will be done with it. This fails to account for the fact that any data mining project of even modest size will need to execute individual steps in multiple programming languages.

At the end of my year-long search, only the tool Cook came close to fulfilling all requirements, even though here also using some unintended hacks.

The curious reader may consult Wikipedia’s List of build automation software to get an overview of the field.

Only Two Features

Not finding any satisfying build tool, it occured to me that I only needed two features in addition to Make’s feature set:

  • An arbitrary number of named parameters in rules
  • The ability to generate the list of dependencies dynamically

Thus, I set out to write a build tool having these two features. In what later became Stu, the data clean-up example would be written in the following way:

@all:  [dep];

LANGS { 
        echo >LANGS en de fr pt es ja pl nl 
}

dep: LANGS { 
        for lang in $(cat LANGS) ; do 
                echo data-clean.$lang.txt 
        done >dep 
}

data-clean.$lang.txt:  data.$lang.txt cleanup {
        ./cleanup <data.$lang.txt >data-clean.$lang.txt
}

This snippet of Stu code shows the following two features:

  • Parameters are written using ‘$’, like shell variables. They have the same syntax within filenames and in the shell command.
  • The brackets ‘[…]’ denote that dependencies are read from the content of the given file. Stu will first make sure the file is generated, and then parse the file.

These two features are the essence of Stu. They set it apart from other build tools, and are the reason I wrote it.

Other Features

In fact, the example in the previous section also shows another very convenient idiom used very often in Stu files: The list of languages is itself a file, instead of variable as it would be in Make. This has the advantage that we can add a new language and all necessary steps will be executed automatically. Also, variables in Make are usually a scalability problem: They are regenerated at each invocation of Make. Even though the main purpose of Make is to avoid unnecessary executions, any variable (whose content often would depend on other variables such as the list of languages) is regenerated many times.

In fact, Stu initally also had variables like Make, but I removed the feature once I noticed that they are unnecessary as files are always better suited for what they do. As a matter of fact, I sometimes think of Stu as a programming language in which the variables are files.

(Note: That was in Version 0.  Now, in version 1.9, features are not removed anymore, to keep backward compatibility.)

Other Facts About Stu

Like many other Make replacements, Stu corrects the typical shortcomings of Make:

  • Error messages are much better (Make has the dreaded “missing separator”); Stu gives full traces like a proper compiler.
  • Stu has a proper tokenisation pass, instead of Make’s variable substitution syntax, avoiding many quoting and escaping problems.
  • Stu catches typical Makefile errors such as dependencies that where not built.
  • Stu has better support for interrupting large builds (Make will often hang or leave processes running).
  • Stu avoids the “inner platform” antipattern present in Make, in which a lot of shell functionality is duplicated in Make functions. Stu encourages all program logic to be implemented in rules, i.e. using a proper shell.
  • Stu supports additional types of dependencies (existence-only dependencies with the prefix ‘!’, optional dependencies with ‘?’, and trivial dependencies with ‘&’). These can only be emulated partially with Make by using unwieldy constructs.

Other design considerations of Stu include:

  • As a use case, focus on data mining instead of software compilation. There are no built-in rules for compilation or other specific applications. Allow use case-specific rules to be written in the language itself. However, Stu is use case-agnostic.
  • Files are the central datatype. Everything is a file. Think of Stu as “a declarative programming language in which all variables are files.” For instance, Stu has no variables like Make; instead, files are used.
  • Files are sacred: Never make the user delete files in order to rebuild things.
  • Do one thing well: We don’t include features such as file compression that can be achieved by other tools from within the shell commands.
  • Embrace POSIX as an underlying standard. Use the Bourne shell as the underlying command interpreter. Don’t try to create a purportedly portable layer on top of it, as POSIX already is a portability layer. Also, don’t try to create a new portable language for executing commands, as /bin/sh already is one.
  • Keep it simple: Don’t use fancy libraries or hip programming languages. Stu is written in plain C++11 with only standard libraries.
  • Have extensive unit test coverage. All published versions pass 100% of unit tests. Stu has 500+ unit tests. All features and error paths are unit tested.
  • Stability of the interface: We follow Semantic Versioning (semver.org) in order to provide stable syntax and semantics. Stu files written now will still work in the future.

Using Stu In Your Project

Instead of having a “Makefile”, have a “main.stu” file. Instead of calling “make”, call “stu”.

To see an example of a large and complex Stu file, see the Stu file of the KONECT project.

Documentation is given in Stu’s manpage.

To use Stu, get it on GitHub and compile it yourself. Stu is written in standard C++11.

Stu has an extensive set of unit tests, and is quite stable. The specification of Stu strictly follows Semantic Versioning, and therefore Stu files will remain compatible in the future.

The name “Stu” follows the precedents of Make replacements Cook and Bake in referring to kitchen-related verbs, and also honours the author of the original Unix Make, Stuart Feldman.

Stu is free software, placed under the GNU General Public License, Version 3.

Stu was written by Jérôme Kunegis.

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s