An Update on the Stu Build System (Stu 2.5)

If you are reading my blog you are probably aware that I am often advertising the Stu build system.  This is a piece of software that we are using at the University of Namur to run the KONECT project, as well as many other things.

Version 2.5 of Stu has been released a few days ago, and I want to take this opportunity to come back to my previous post about Stu. In that post, I showed a short example of how to use Stu.  Now, with version 2.5 available, that example has become even shorter and easier to write.  Version 2.5 of Stu is the first version of Stu that implements all three core features of Stu. But before I tell you what these three features are, here’s the updated example:

Example

@all: data-clean.[LANGS].txt;

LANGS = { en de fr pt es ja pl nl }

>data-clean.$lang.txt: <data.$lang.txt cleanup 
{
       ./cleanup
}

What does this do?  This is a Stu script to clean up your datasets. I.e., this is the content of a file ‘main.stu’ in your directory. You would then just call ‘stu’ on the command line from within that directory. In practice, this would be part of a larger Stu script in a data mining project that does much more, such as doing actual experiments, plotting, etc.  In this short snippet, we are only cleaning up our datasets. The details don’t matter for this example, but many data mining projects need to clean up their data as a first step, for instance to remove spam entries for an email dataset, etc.

In this example, we have not only one dataset, but multiple datasets, corresponding to multiple languages, as we would in a study of Wikipedia for instance.  Then, we use the program ‘./cleanup’ to cleanup each dataset.

The syntax of a Stu script consists of a set of rules, in which each rule tells Stu how to generate one or more files.  The example consists of three rules.

Let’s start at the bottom:  the last four lines are a rule for the target named ‘data-clean.$lang.txt’.  This is a parametrized rule:  The actual filename can contain any non-empty string in place of ‘$lang’. In this example, $lang will be the language code, i.e. ‘fr’ for French, ‘en’ for English, ‘nl’ for Dutch, etc.

The list of filenames after the colon (‘:’) are the dependencies.  They tell Stu which files need to be present before the target can be built.  In the case of the last rule shown above, there are two dependencies:  the file ‘data.$lang.txt’ (in our example, the non-cleaned up data), and the file ‘cleanup’ (the program we use for cleaning up a file).

The actual command to be performed is then given by the block in the last three lines.  It is executed by the shell and can do anything, like in this example, calling another program.  Since the rule is parametrized (with parameter $lang), it means Stu will execute the command with the environment variable $lang set to the value as matched in the target name. As a result, the ./cleanup program can simply use getenv(‘lang’) or its equivalent in any programming language to access the value of the parameter.

Another feature of Stu we can see here is redirection, written as ‘<‘ and ‘>’.  These are used to redirect the standard input of the command (with ‘<‘) and the standard output of the command (with ‘>’). As a result, our program ./cleanup does not need to access any file directly; it can simply do its filtering from stdin to stdout.

But let’s look at the other two rules in our example.  The rule

LANGS = { en de fr pt es ja pl nl }

is what we call a hardcoded rule.  It simply assigns a predefined content to a file. In this case, we just use that file to store the list of languages in our experiments.

The first rule is a little more interesting now:

@all: data-clean.[LANGS].txt;

It contains brackets (‘[‘ and ‘]’), which I will explain now.  Brackets in Stu mean that a file (in this case ‘LANGS’) should be built, and then its contents inserted as a replacement for the brackets.  In this case, since the file ‘LANGS’ contains multiple names (all the language codes), the filename ‘data-clean.[LANGS].txt’ will be replaced multiple times, each time with ‘[LANGS]’ replaced by a single language code.  As a result, the target ‘@all’ will have as dependencies all files of the form ‘data-clean.$lang.txt’, where $lang is a language code taken from the file ‘LANGS’.

Finally the symbol ‘@’ is used to denote a so-called transient target, i.e. a target that is not a file, but is simply represented as a symbol in Stu.  Since the target @all is not a file, there is no command to build it, and therefore we simply use ‘;’ instead of a command.

Comparison with Make

Readers who know the tool Make will now draw comparisons to it. If Stu was just a rehash of Make, then I would not have written it, so what are the differences?  In short, these are the three core features of Stu:

  • Parametrized rules:  Like $lang in the example, any rule can be parametrized.  This can be partially achieved in some implementations of Make (such as GNU Make) by using the ‘%’ symbol. But crucially, this does not allow to use multiple parameters. In the KONECT project, we use up to four parameters in very reasonable use-cases.
  • Dynamic dependencies: These are written using brackets (‘[‘ and ‘]’) and have the meaning that the dependency file is first built itself, and then its contents are parsed for more dependencies.
  • Concatenation: I haven’t mentioned concatenation yet, but we have used it in the example.  Simply by writing ‘data-clean.[LANGS].txt’, the string ‘data-clean.’ will be prepended to all names in ‘LANGS’, and ‘.txt’ will be appended to them.

These are the three core features of Stu, which set it apart from virtually all other such tools. In fact, I initally wrote Stu in order to have the first two features available. The third feature came to me later, and since I implemented it in Stu 2.5, it now feels second nature to use it.  As a general rule, the three features interact beautifully to give very concise and easy to understand scripts.  This is the raison d’être of Stu.

Comparison with Other Build Tools

There are many build tools in existence, and some of the features of Stu can be found in other such tools.  The combination of the three core features is however unique.  In fact, I have been using Make and Make replacements for years, for KONECT and other projects, always looking for possibilities to express complex dependencies that arise in data mining projects nicely.  Finally, I started writing Stu because no tool was adequate.  In short, almost all Make replacements fail for me because (1) they are specific to a certain programming language, and (2) they are specific to the task of software building.  For data mining projects, this is not adequate.  For those tools that are generic enough, they do not implement the three core features.

By the way, I have been using the Cook tool by the late Peter Miller for years.  Among all Make replacements, it is probably the one that comes closest to Stu.

Stu inherits the tradition of Make replacements having cooking-related names (brew, chef, cook, etc.), and at the same time honors the author and inventor of the original UNIX Make, Stuart Feldman.

Design Considerations

The design considerations of Stu are:

  • Genericity: In many projects, the same rule has to be executed over and over again with varying parameters. This is particularly true in data mining / data science / machine learning and related areas, but also applies to simply compiling programs with varying compiler options, etc. Being able to do this using a clean syntax and friendly semantics is the main motivation of Stu, and where virtually all other
    Make replacements fail. Most Make replacement force the user to write loops or similar constructs.
  • Generality: Don’t focus on a particular use case such as compilation, but be a generic build tool. There are no built-in rules for compilation or other specific applications. Instead, allow use case-specific rules to be written in the Stu language itself.  Most Make replacement tools instead focus on one specific use-case (almost
    always compilation), making them unsuitable for general use.
  • Files are the central datatype. Everything is a file. You can think of Stu as “a declarative programming language in which all variables are files.” For instance, Stu has no variables like Make; instead, files are used. Other Make replacements are even worse than Make in this regard, and allow any variable in whatever programming language they are using. Stu is based on the UNIX principle that any persistent object should have a name in the file system – “everything is a file.”
  • Scalability: Assume that projects are so large that you can’t just clean and rebuild everything if there are build inconsistencies.  Files are sacred; never make the user delete files in order to rebuild things.
  • Simplicity: Do one thing well. We don’t include features such as file compression that can be achieved by other tools from within shell commands. List of files and dependencies are themselves targets that are built using shell commands, and therefore any external software can be used to define them, without any special support needed from Stu. Too many Make replacements try to “avoid the shell” and include every transformation possible into the tool, effectively amassing dozens of unnecessary dependencies, and creating an ad-hoc language much less well-defined, and let alone portable, than the shell.
  • Debuggability: Like programs written in any programming language, Stu scripts will contain errors. Stu makes it easy to detect and correct errors by having much better error messages than Make. This is achieved by (1) having a proper syntax based on tokenization (rather than Make’s text replacement rules), and (2) having “compiler-grade” error messages, showing not only what went wrong, but how Stu got there. Anyone who has ever wondered why a certain Make rule was executed (or not) will know the value of this.
  • Portability: Embrace POSIX as an underlying standard. Use the shell as the underlying command interpreter. Don’t try to create a purportedly portable layer on top of it, as POSIX already is a portability layer. Also, don’t try to create a new portable language for executing commands, as /bin/sh already is one. Furthermore, don’t use fancy libraries or hip programming languages. Stu is written in plain C++11 with only standard libraries. Many other Make replacements are based on specific programming languages as their “base standard”, effectively limiting their use to that language, and thus preventing projects to use multiple programming languages. Others even do worse and create their own mini-language, invariably less portable and more buggy than the shell.
  • Reliability: Stu has extensive unit test coverage, with more than 1,000 tests. All published versions pass 100% of these tests. All language features and error paths are unit tested.
  • Stability: We follow Semantic Versioning in order to provide syntax and semantics that are stable over time.  Stu files written now will still work in the future.
  • Familiarity: Stu follows the conventions of Make and of the shell as much as possible, to make it easier to make the switch from Make to Stu. For instance, the options -j and -k work like in Make. Also, Stu source can be edited with syntax highlighting for the shell, as the syntaxes are very similar.

That’s All Good!  How Can I Help?

If you want to use Stu – go ahead.  Stu is available on GitHub under the GPLv3 license.  It can be compiled like any program using the “configure–make–install” trio (see the file ‘INSTALL’).  I am developing Stu on Linux, and am also using it on MacOS.   We’ve even had people compiling it on Windows.  If you try it, I will be happy about hearing from your experiences, and also happy for bug reports.

If you want to see Stu in action, have a look at the Stu script of KONECT.  It controls the computations of all statistics, plots and other experiments in the KONECT project, and runs essentially continuously at the University of Namur.  In practice, I also use Stu for all side projects, generating my CV, writing papers, etc.

In the next months, I will write a blog entry about common anti-patterns in data science, and how they can be avoided using Stu.

More Blog Articles about Stu

Advertisements

One thought on “An Update on the Stu Build System (Stu 2.5)

  1. Pingback: Announcing KONECT.cc – New Website and New Features | networkscience

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s