Defining the Clustering Coefficient

Clustering is an important property of social networks:  People tend to have friends who are also friends with each other, resulting in sets of people among which many edges exist, while a set made from randomly chosen people would have a much smaller number of edges between them. To measure the clustering in a social (or other type of) network, a common measure is the clustering coefficient.  The clustering coefficient is a real number between zero and one that is zero when there is no clustering, and one for maximal clustering, which happens when the network consists of disjoint cliques.  While the clustering in a network can be measured in a number of ways, one common way to do it is to check for triangles, i.e., to check that when two edges share a node, then in a network with high clustering, it is likely that a third edge exists such that the three edges form a triangle. Although the clustering coefficient is often used, there are actually two variants of it – which may have completely different values.

Variant (1) – Define the clustering coefficient c1 as the probability that two incident edges are completed by a third one to form a triangle.  Two edges are incident when they share a vertex. In that case, we can check whether a third edge exists such that all three edges form a triangle.  The proportion of such incident edge pairs that are completed by a third edge to form a triangle then equals the clustering coefficient (variant 1).

Variant (2) – First, define the local clustering coefficient c(u) for each node u in the following way:  Let c(u) be the proportion of neighbors of u that are connected.  In other words, c(u) is the probability that two friends of u are friends of each other in a social network.  Then define the clustering coefficient (variant 2) as the mean of all local clustering coefficients in the network.

Both measures c1 and c2 measure the clustering in a network and should therefore correlate highly, right?  Well, no.  Look at the following plot:

scatter.c.clusco.clusco_7.7This plot shows the two clustering coefficients for networks in the Koblenz Network Collection. Each network is represented by a two or three-letter code. The X axis represents c1 and the Y axis represents c2.  Clearly, we cannot state that the variants correlate in any way.  Why is that? To see why the two clustering coefficients do not correlate, we have to think about how both are computed.  In the first variant, each pair of incident edges is given the same weight, but in the second variant each node is given the same weight.  This means that nodes with a high degree will be overrepresented in variant 1, and underrepresented in variant 2.  The correlation of both measures is thus a hint that nodes with high and low degree in general do not have the same local clustering coefficient.  Let’s see:

degcc.a.cit-HepPhThis plot needs some explanation. This is a plot showing a citation network taken from arXiv’s High Energy Physics / Phenomenology section. The network consists of publications connected by citation links, i.e., a link from u to v means that publication u has cited publication v.  The network is taken as undirected here, i.e., we ignore who cited whom. Each blue dot in the plot is one vertex of the network, drawn according to its degree (X axis, logarithmic) and its local clustering coefficient (Y axis, logarithmic).  The red line shows the average local clustering coefficient taken over all nodes of the same degree.  What we see here is clear:  nodes with low degree have a high clustering coefficient. In other words, when person has many friends, these friends have less edges among them, which is to be expected since a person with many friends is likely to have friends from more diverse communities, and a paper getting cited many times is likely to be cited by papers from more diverse areas.

This explain two things:  First, because nodes with high degree are weighted more in the first variant, the first variant will have lower values that the second one.  Second, the non-correlation between the two variants is due to the fact that the discrepancy between the local clustering coefficient of high-degree and low-degree nodes is different from network to network.  Networks for which both clustering coefficients are near each other have more equal local clustering coefficients c1 and c2, whereas networks high up in the figure plot have a higher discrepancy between the local clustering coefficient of high-degree and low-degree nodes.

A few more notes

The clustering coefficients are not always defined. The first variant is not defined when no vertex has degree larger than one. This, of course, is not a problem in practice. The local clustering coefficient is only defined for nodes whose degree is larger than one. These nodes are present in actual social networks, and the usual method is to simply ignore them and compute the average of the local clustering coefficient over all nodes with degree larger than one, which is what is done in the plots presented here.

Resources

Datasets, Matlab Code, Complete definitions of numerical network statistics

Advertisements

5 thoughts on “Defining the Clustering Coefficient

  1. Hi, Jérôme. Thank you for the text.

    In case of real networks it is common that the connections between nodes have some degree of “importance”. Something like degrees of friendship. Then this definition would lose easily this network’s characteristic. But this is easily manageable if you consider weights…

    In cases of random graphs models that you allow multiple edges (but in large quantities) the widely used definition of clustering as “3 times number of triangles over the number of cherries” doesn’t work anymore.

    Do you know other definition for such cases?

    I don’t know if I made myself clear. I can email you with a context if you want.

    Thanks, again!

    • Hi

      yes you’re right, it is not possible to just take the definition c = 3t / s in the multiple-edge case, because that leads to values of c larger than one. Most literature indeed just restrict the clustering coefficient to simple networks, just like KONECT.

      One approach for this would be to think about what it means for a triangle to be closed. Example: If A and B are connected by two edges and B and C are connected by three edges, how many edges must exist between A and C to close that triangle. Then, the clustering coefficient could be defined as the probability that a “cherry” is closed, which is a definition that is equivalent to c = 3t/s for simple networks.

      BTW, I’m really curious about the term “cherry”; is it used in the literature for this pattern?

      Cheers
      Jérôme

      • The problem when you look the clustering as the probability that a cherry is closed (in fact, I believe this is the true definition of clustering) is that it is in general not treatable, I mean, it is hard to compute. =(

        About the definition 3t/s applied to multiple-graphs, as you said, this leads to values greater than one, so this is not a coefficient. But this still say something about the network, right? Or not?
        I mean, even if this definition it isn´t a coefficient, maybe it is still some interesting quantity and can be compared with the real clustering coefficient or say something I direction…
        What do you think?

        About the term cherry, if I´m not wrong, it is used in the graph theory literature. In particular, in an argument called double couting. In the book Graph Theory of Bondy and Murty you can find it ( I guess… =/ )

        Thank you for the answer!

        Cheers,

        Rodrigo

      • Hi Rodrigo

        About using the definition c = 3t/s in the case of multiple edges:

        Let T be the number of triangles ignoring multiple edges, and S the number of cherries ignoring multiple edges. Let w be the average multiplicity, i.e., the average number of edges connecting two node pairs that are connected by at least one edge. Assuming the number of edges connecting two nodes is independent of whether the two nodes pairs are part of a triangle, we have t = w³T and s = w²S, thus:

        c = 3t/s = 3w³T / w²S = w × 3T/S

        I.e., c is w times the unweighted clustering coefficient 3T/S. This means that we should be interested in the quantity

        3t / ws

        I don’t know whether this quantity can be larger than one.

        Cheers
        Jérôme

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s