Pronunciation of English Vowels

The English language is written in the Latin alphabet, and in principle, each letter corresponds to one sound.  In practice however, many letters can be pronounced in multiple ways, and many letters may represent multiple sounds. The following diagram shows the relationships between English sounds, and the letters used to write them.

  • The sounds of English are represented here by the phonemes of English, i.e., basic sound units of the language.  This ignores differences between sounds that are not used to distinguish different words, such as the British and American way of pronouncing “au”, which is simply written /ɔː/.  Phonemes are represented using the international phonetic alphabet (IPA).  See English phonology on Wikipedia for a description.
  • English uses not only individual letters, but also groups of multiple letters which are pronounced as one phoneme.  Groups of two letters are called digraphs.

The following diagram show the relationship between the vowel phonemes of English and the letters and digraphs used to write them.  Each cell shows a word written with the given letter or digraph, and producing the given phoneme.  Each column represents words containing the same sound, but written in a different way.  Each row represents words containing the same letter or digraph, but pronounced in a different way.

Relationship between English vowel phonemes and letters/digraphs

Note that the table shows only vowels, and also omits vowels followed by R, as well as the /ə/ (schwa) phoneme.  Each word is color-coded:

  • White words are regular English words
  • Blue words have irregular spelling
  • Green words have a foreign origin explaining their irregular spelling
  • Yellow words have spellings or pronunciations which are either rare, or used only in certain English-speaking countries.

For each cell in which a word exists, we show a word from the first category for which a word exists. Notes: The table is likely to contain errors and omissions!  Please write a comment below, and I will correct it.

  • I have very probably missed many rare and foreign words.  In particular, there are currently no trigraphs (groups of three letters) in the table.
  • The table does not give an indication of the frequency of each letter/phoneme combination.  For instance, “oo” represents /ʌ/ only in very few words, such as “blood” and “flood”.
  • The assessment of spellings as regular and irregular may be arbitrary in places.  For instance, I count “knowledge”/”bought”/”tought” as irregular.
  • Vowels followed by R are not shown, even though they are also irregular, for instance the pronunciations of “learn”, “hear”, “bear” and “heart”.  I put “bury” in this table because the R in not in the same syllable as the vowel.
  • Consonants are missing.  In general, consonants have more regular pronunciation, but still there are some interesting cases such as with “who”/”white”, etc.

Analysis We may note at once that English orthography does not have a one-to-one relationship between sounds and letters.  In fact, almost all letters and digraphs have multiple pronunciations, and almost all phonemes have multiple orthographies. In terms of network analysis, we may ask whether it is possible to reach any phoneme from any other phoneme.  For instance, we may go from /ɪ/ to /ɒ/ using the following sequence:

  • /ɪ/ is written “i” “in”, which is pronounced /aɪ/ in “mind”, which is written “ei” in “height”, which is pronounced /eɪ/ in “veil”, which is written “a” in “name”, which is pronounced /ɒ/ in “yacht”.

This shows that there is a path from /ɪ/ to /ɒ/, even though these two sounds are completely different.  The sequence “in” → “mind” → “height” → “veil” → “name” → “yacht” has length five.  The underlying graph is thus a bipartite graph:  It contains phonemes and letters/digraphs as nodes, and words as edges. Can all nodes in this graphs be reached from any other node?  The answer is no, because the phoneme /ɔɪ/ with its two spellings “oi” and “oy” is isolated from the rest of the graph.  The rest of the graph is connected, however, so any phoneme can be reached from any other phoneme, except for /ɔɪ/. However, we can still find that there are two clusters in the graph, with many edges within each cluster, but not many edges from one to the other:  the AEI group and the OU group:

  • The AEI group contains the sounds /ɪ/, /aɪ/, /ɛ/, /iː/, /æ/, /eɪ/, /ɑː/ and /ɔː/, and the spellings “i”, “y”, “uy”, “ie”, “e”, “ei”, “ea”, “ee”, “eo”, “ey”, “a”, “ae”, “ai”, “ay”, “al”, “au” and “aw”.
  • The OU group contains the sounds /ɒ/, /oʊ/, /ɔɪ/, /aʊ/, /ʊ/, /ʌ/, /uː/ and /juː/, and the spellings “ao”, “o”, “oa”, “oe”, “oi”, “oo”, “ou”, “ow”, “oy”, “u”, “ue”, “eu”, “ew” and “ui”.

The only links between the two groups are irregular and foreign words:

  • Words with irregular spelling: “women”, “busy”, “build”, “bury”, “broad”, “bought”, “knowledge”, “yeoman” and “gaol” (alternative spelling of “jail”).
  • Words of foreign origin:  “phoenix”, “mauve” and “pharaoh”.

EDIT:  Added the word “laugh” with pronunciation /æ/.

Plotting a Degree Distribution on the Command Line

The following shell command plots a degree distribution in a log-log scale.  It uses only standard unix commands:

<INPUTFILE sed -re ‘/^\%/d;s,^\s*\S+\s*(\S+).*$,\1,’ | sort | uniq -c | sort -n | sed -re ‘s,^\s*(\S+).*$,\1,;s,.,#,g’ | uniq -c | sed -re ‘s,^\s*(\S+).*,\1,;s,.,#,g’

The file “INPUTFILE” must be a text representation of a directed network with one edge per line, each containing a from-node-ID and to-node-ID.  The command will plot the indegree distribution on a doubly logarithmic scale.  Here it is executed for KONECT‘s “YouTube links” dataset:

$ < sed -re ‘/^\%/d;s,^\s*\S+\s*(\S+).*$,\1,’ | sort | uniq -c | sort -n | sed -re ‘s,^\s*(\S+).*$,\1,;s,.,#,g’ | uniq -c | sed -re ‘s,^\s*(\S+).*,\1,;s,.,#,g’

The file “” can be downloaded from here.  The degree distribution is transposed from its usual orientation:  The degrees are on the Y axis and the frequency is on the X axis.  As the output shows there is a power law here:  The lines get shorter in a linear fashion and thus the frequency of each degree value is a negative power of the degree value.

This shows that:

  • You don’t need megabytes of graphics code to generate plots
  • You don’t need a log() function to plot logarithmic axes
  • It absolutely makes sense to use both sort(1) and uniq(1) multiple times in a single pipeline

The drawn degree distribution also uses logarithmic binning, and thus it is much more useful in visualizing the tail of the distribution than the standard way of plotting degree distributions.

Try it out with more datasets:  List of datasets in KONECT

Can you write other shell one-liners for generating other network analysis plots?

Largest Network Dataset Ever Released – As Big As Facebook

Good news for the network analysis community:  A new very large Web hyperlink dataset was recently released:  the Web Data Commons Hyperlink Graph consisting of 3.56 billion pages and 129 billion hyperlinks connecting them!  For comparison, the Facebook friendship graph is reported to include 1.26 billion nodes (users) and 150 billion links (friendships).

The Web Data Commons Hyperlink Graph is thus the largest real-world network dataset available outside of the big companies themselves.  What this means is, it is not only the big data companies anymore that can do actual “Big Data” research – now any research group can too.  Therefore, we should be seeing studies using this dataset coming out in the next years.  I can’t wait to find out what will be done with this dataset.

For another comparison, our own Koblenz Network Collection, which collects network datasets of many different types and sizes, maxes out at 1.9 billion edges, with the Twitter follower network.  But since the dataset is open, we will now add it to it of course.

Defining the Clustering Coefficient

Clustering is an important property of social networks:  People tend to have friends who are also friends with each other, resulting in sets of people among which many edges exist, while a set made from randomly chosen people would have a much smaller number of edges between them. To measure the clustering in a social (or other type of) network, a common measure is the clustering coefficient.  The clustering coefficient is a real number between zero and one that is zero when there is no clustering, and one for maximal clustering, which happens when the network consists of disjoint cliques.  While the clustering in a network can be measured in a number of ways, one common way to do it is to check for triangles, i.e., to check that when two edges share a node, then in a network with high clustering, it is likely that a third edge exists such that the three edges form a triangle. Although the clustering coefficient is often used, there are actually two variants of it – which may have completely different values.

Variant (1) – Define the clustering coefficient c1 as the probability that two incident edges are completed by a third one to form a triangle.  Two edges are incident when they share a vertex. In that case, we can check whether a third edge exists such that all three edges form a triangle.  The proportion of such incident edge pairs that are completed by a third edge to form a triangle then equals the clustering coefficient (variant 1).

Variant (2) – First, define the local clustering coefficient c(u) for each node u in the following way:  Let c(u) be the proportion of neighbors of u that are connected.  In other words, c(u) is the probability that two friends of u are friends of each other in a social network.  Then define the clustering coefficient (variant 2) as the mean of all local clustering coefficients in the network.

Both measures c1 and c2 measure the clustering in a network and should therefore correlate highly, right?  Well, no.  Look at the following plot:

scatter.c.clusco.clusco_7.7This plot shows the two clustering coefficients for networks in the Koblenz Network Collection. Each network is represented by a two or three-letter code. The X axis represents c1 and the Y axis represents c2.  Clearly, we cannot state that the variants correlate in any way.  Why is that? To see why the two clustering coefficients do not correlate, we have to think about how both are computed.  In the first variant, each pair of incident edges is given the same weight, but in the second variant each node is given the same weight.  This means that nodes with a high degree will be overrepresented in variant 1, and underrepresented in variant 2.  The correlation of both measures is thus a hint that nodes with high and low degree in general do not have the same local clustering coefficient.  Let’s see:

degcc.a.cit-HepPhThis plot needs some explanation. This is a plot showing a citation network taken from arXiv’s High Energy Physics / Phenomenology section. The network consists of publications connected by citation links, i.e., a link from u to v means that publication u has cited publication v.  The network is taken as undirected here, i.e., we ignore who cited whom. Each blue dot in the plot is one vertex of the network, drawn according to its degree (X axis, logarithmic) and its local clustering coefficient (Y axis, logarithmic).  The red line shows the average local clustering coefficient taken over all nodes of the same degree.  What we see here is clear:  nodes with low degree have a high clustering coefficient. In other words, when person has many friends, these friends have less edges among them, which is to be expected since a person with many friends is likely to have friends from more diverse communities, and a paper getting cited many times is likely to be cited by papers from more diverse areas.

This explain two things:  First, because nodes with high degree are weighted more in the first variant, the first variant will have lower values that the second one.  Second, the non-correlation between the two variants is due to the fact that the discrepancy between the local clustering coefficient of high-degree and low-degree nodes is different from network to network.  Networks for which both clustering coefficients are near each other have more equal local clustering coefficients c1 and c2, whereas networks high up in the figure plot have a higher discrepancy between the local clustering coefficient of high-degree and low-degree nodes.

A few more notes

The clustering coefficients are not always defined. The first variant is not defined when no vertex has degree larger than one. This, of course, is not a problem in practice. The local clustering coefficient is only defined for nodes whose degree is larger than one. These nodes are present in actual social networks, and the usual method is to simply ignore them and compute the average of the local clustering coefficient over all nodes with degree larger than one, which is what is done in the plots presented here.


Datasets, Matlab Code, Complete definitions of numerical network statistics

How to Visualize Skewed Distributions

Many distributions, such as the size of cities, the frequency of words, the number of followers of people on Twitter, and others, are very skewed.  For instance, 16.2% of songs make up 83.8% of plays on, and 18.8% of groups make up 81.2% of all group memberships on YouTube. These types of distributions are described by expressions such as power law, long tail, and other terms. These terms are often confused, as they refer to slightly different concepts. Also, the confusion is made worse by the existence of multiple ways of visualizing such distributions, which are different mathematically but similar in appearance. In this blog post, I will review the most important ways of visualizing such distributions, and will explain the mathematics behind them.

Zipf’s Law

Many things can be distributed in a power-law-like way. For instance, the linguist George Zipf was interested in 1935 in the distribution of words in the English language. He counted, for each word, its frequency in a corpus of text. He then sorted the words by descending frequency, and noted that a word’s frequency is approximately inversely proportional to the rank of that word. That is, the second most common word is twice as rare as the most common word, the third most common word is three times as rare as the most common word, and so on.

This series of numbers can be visualized like this:

zipf.v.gottron-reutersThe data used for this plot is from a dataset of Reuters news article in English. (Since this data is taken from a collection of network datasets, the plot used the term degree to refer to a word’s frequency.) Each point in this plot represents one word, with the X axis representing the ranking of the word (from most frequent to less frequent), and the Y axis representing the frequency. The plot is drawn with both axes logarithmic. Thus, the top-left point represents the most common English word (which is “the”), and very rare words are on the right of the plots (showing that many words occur only once in the dataset).

An exact inverse proportional relationship as described by Zipf would look like a straight line with slope exactly minus one on this plot, because the axes are logarithmic. With this dataset, this is not the case (in particular in the top-left area). This is due to the fact that this dataset of words was stripped of words which are ignored by search engines such as “the”, “of”, etc., which happen to be the most common in the English language. Therefore, the plot does not show a straight line, and we may be interested in the shape of this plot as a visualization of the distribution.

The Distribution plot

In terms of probability theory, the size, frequency or number of followers follow a distribution, and we are thus interested in the shape of that distribution. The plot inspired by Zipf’s Law is thus often said to be a bad way of visualizing a distribution.  Instead, one may rather show the distribution plot, i.e., the plot showing the frequency of each value. For the English words example, this means a plot where the X axis shows word frequency values, and the Y axis shows the number of words having this frequency.  This results in the following plot:

degree.v.gottron-reutersAgain, we use a log-log plot. Unlink the previous plot, this looks much more like a straight line. What is going on? In a log-log plot, a straight line (of the form y = ax + b) is in fact a power law (i.e., log(y) = a log(x) + by = eb xa). Thus, the plot suggests that the number of words with frequency x is proportional to xa, for some constant a, which equals about −1.6 in this case.  This type of relationship is called a power law, and the value −a is then called the power-law exponent. This plot type however is misleading. Since for each number x ≥ 100, only few words have the frequency x, the right half of the plot cannot be interpreted at all visually, and thus the plot effectively only visualizes the distribution of infrequent words.

The Complementary Cumulated Distribution Plot

In order to visualize also words with uncommon (i.e., high) frequencies, a solution is to plot, for each x, not the probability that a word has frequency x (which is proportional to the number of words having that frequency), but the probability that a word has frequency greater or equal to x. This is called the complementary cumulated distribution plot (and would be called simply the cumulated distribution plot if it showed the probability of the frequency being less than a given x).

bidd.vx.gottron-reutersAgain, this plot type is shown with both axes logarithmic. Again, this plot type will show a straight line when the data follows a power law. However, this plot does show the full range of values on the X axis, and in fact we can see right away that this is not a power law, as the right half of the plot deviates significantly from a straight line, showing again that our dataset does not follow Zipf’s Law, because very common words were removed from it.

Is this plot therefore a better visualization of a skewed distribution than the simple distribution plot? Yes, because it visualizes the whole spectrum of possible values, even rare ones. Is this plot also better than the plot derived from Zipf’s analysis? No, because it is the same plot. In fact, Zipf’s plot and the complementary cumulated distribution are mirror images of each other, mirrored around the 45-degree diagonal axis going from the lower left to the upper right. This can be seen by noting that the rank of a word with frequency x equals the probability of a word having frequency ≥ x, multiplied by the total number of words. Thus both plots are identical, up to an exchange of the axes.

The Long Tail

The confusion between the different types of plots is made worse by the ambiguity of the term “long tail”. In marketing, the long tail refers to the bulk of products which are bought only rarely, and thus are traditionally ignored by retailers, but from which online retailers can profit, since they don’t have the space constraints that traditional retailers have. Rare items are similar to infrequent words, and thus the term long tail describes the Zipf-type plot, in which infrequent items form a tail-like shape to the right of the figure. In the two other plots however, infrequent items are shown on the upper left corner, and thus do not form a tail at all.

What may also confuse people is the use of the word tail in probability theory to characterize the ends of distributions. In probability theory, the expression heavy-tailed distribution and fat-tailed distribution have specific meanings and refer to the tail as seen in the distribution plots, and thus refer to very large values of a variable, and not to very small values, as in the expression long tail. What is worse, the expression long-tailed distribution is sometimes also used in probability theory, and also describes the behavior of a distribution at large values of a variable.

The Lorenz Curve

To complete our little tour, I would like to mention the Lorenz curve, which shows the amount of text covered by X% of least frequent words. That is, the X axis goes from 0 to 100, and the Y axis shows the total amount of text made up by the X% of least frequent words. The Lorenz curve is usually used to visualize wealth distribution, and is plotted without logarithmic scales. For our English words dataset, it looks like this:

lorenz.vb.gottron-reutersThe area between the Lorenz curve (in blue) and the diagonal is then, when multiplied by two, the Gini coefficient, a measure of inequality as used in economy. Since it does not use logarithmic axes, neither power law distributions nor Zipf’s Law can be easily recognized on it.


There are many ways to visualize skewed distributions, and care must be taken when creating, reading and interpreting such plots. We recommend to use the complementary cumulated distribution plot, as it is most aligned with probability theory, and is just as expressive as the Zipf-type plot. But in the end, remember that a plot is not a replacement for a proper statistical test, so if you want to assert that a dataset follows Zipf’s Law or a power law, then a statistical is needed. Based on my experiments with degree distributions in KONECT, my cautious guess is than there are almost no statistically significant power laws in real-world networks (and I can make no statement about other types of distributions such as city sizes, etc.)

The Surprising Equality of the Mean and Median Path Length in Online Networks

A common statistic computed for networks is the diameter.  The diameter equals the longest shortest path length in a network.  Let me explain:  Given two nodes u and v in a network, their distance is the minimal number of edge needed to reach one from the other.  In other words, the distance between u and v equals the shortest path length between these two nodes. Now, find two nodes u and v such that the distance between them is maximal – this is the diameter of the network. The diameter is thus an integer that characterizes a network.

The diameter is an important statistic for a network, since it is used to characterize a network as a small world. The expression small world refers to the observation that it is possible to reach any node from any node in a small number of steps, as in the famous six degrees of separation.

The diameter is only one of several ways of measuring the “typical path length” in networks, and in fact, it may not even be the most representative one. As it is a maximum, it is not representative of the full distribution of path lengths in the network. Also, being defined as a maximum, it is sensitive to outliers, e.g., tendrils in the networks that increase the diameter but do not affect the overall structure of the network. Therefore, several alternatives measures of typical path lengths can be used:

  • The diameter (as defined above) is the maximum path length
  • The mean path length, averaged over all node pairs
  • The median path length
  • The 90-percentile effective diameter
  • The 50-percentile effective diameter

The effective diameters (50- and 90-percentile) are the path lengths such that 50 or 90 percent of node pairs are a smaller or equal distance apart, interpolated between integer values.  Thus, the median path length is the 50-percentile effective diameter rounded down.

The statistics are shown in the following plot, for the Slashdot Zoo network:

hopdistr.a.slashdot-zooThis shows the cumulated distribution of path lengths over all node pairs in the network.  In this network, the diameter (δ) equals 12. The X-percentile effective diameters are shown as the two vertical red lines and equal 3.51 for the 50-percentile effective diameter (δ0.5), and 4.71 for the 90-percentile effective diameter (δ0.9). The median path length is thus 3 in this network.  The mean path length (δm) equals 4.04 and is somewhere in between. We see that the distribution is skewed, because the mean and median path length are not equal in this network.  However, when I did this analysis for more networks, I found that in most cases, both numbers coincide. The following plot shows many networks (those small enough such that the number could be calculated), plotted by the 50-percentile effective diameter (on the X axis) and the mean path length (on the Y axis). Each network is represented by a two-letter (and a for few three-letter) code. See the results for yourself:

scatter.c.diam_3.diam_4.4This plot shows that both not only correlate very highly, but are almost equal for almost all networks. In other words: the 50-percentile effective diameter and the mean path length are almost equal in actual networks.  What does this mean?  I’m not sure, but this result is not trivial, since the distribution of path lengths is not symmetric in most networks.

For more path length distributions, see here.

For another blog post about the path length distribution, see

The 65 Languages in My Music Collection

As most of you, I have collected music in my life, first as CDs and later on my computer. In particular, a hobby of mine is to collect music in the languages of the world – the more the better! One day, I started tagging each song in my collection with the language it is sung in. Today, about 70% of the 16193 tracks I have are tagged with at least one language.

Here’s the breakdown:

English (8346 songs), French (2366 songs), German (1128 songs), Spanish (587 songs), Portuguese (241 songs), Italian (201 songs), Dutch (197 songs), Arabic (76 songs), Russian (69 songs), Norwegian (69 songs), Latin (64 songs), Nepali (44 songs), Wolof (32 songs), Turkish (31 songs), Occitan (27 songs), Malagasy (26 songs), Korean (23 songs), Chinese (22 songs), French-based creoles (21 songs), Japanese (19 songs), Dyula (19 songs), Polish (18 songs), Swedish (17 songs), Jamaican Creole (13 songs), Finnish (13 songs), Zulu (12 songs), Hawaiian (12 songs), Bulgarian (11 songs), Shona (8 songs), Haitian Creole (8 songs), Nubian (7 songs), Xhosa (6 songs), Catalan (6 songs), Georgian (5 songs), Hindi (5 songs), Galician (4 songs), Hungarian (3 songs), Middle High German (3 songs), Esperanto (3 songs). I have 2 songs in each of the following languages: Sesotho, Romanian, Punjabi, Lingala, Limburgish, Icelandic, Irish Gaelic, Serbian.  I have one song in each of the following languages: Venda, Klingon, Tagalog, Swahili, Old Irish, Scots, Low German (i.e., Plattdeutsch), Ndebele, Mongolian, Modern Hebrew, Fang, Persian, Duala, Breton, Amharic, Afrikaans.

Note:  Some songs contain multiple languages and are counted multiple times in this list.  Some languages are only used for a single phrase, making it arguable that the song is sung “in that language”.

If you’re wondering why a certain language is not included:  There are over 6000 languages in the world, and I have no idea in how many of these songs have been recorded. In any case, I will gladly accept in tips for songs to add to the collection 🙂

Happy Music Listening to everyone!