Clustering is an important property of social networks: People tend to have friends who are also friends with each other, resulting in sets of people among which many edges exist, while a set made from randomly chosen people would have a much smaller number of edges between them. To measure the clustering in a social (or other type of) network, a common measure is the clustering coefficient. The clustering coefficient is a real number between zero and one that is zero when there is no clustering, and one for maximal clustering, which happens when the network consists of disjoint cliques. While the clustering in a network can be measured in a number of ways, one common way to do it is to check for triangles, i.e., to check that when two edges share a node, then in a network with high clustering, it is likely that a third edge exists such that the three edges form a triangle. Although the clustering coefficient is often used, there are actually two variants of it – which may have completely different values.
Variant (1) – Define the clustering coefficient c1 as the probability that two incident edges are completed by a third one to form a triangle. Two edges are incident when they share a vertex. In that case, we can check whether a third edge exists such that all three edges form a triangle. The proportion of such incident edge pairs that are completed by a third edge to form a triangle then equals the clustering coefficient (variant 1).
Variant (2) – First, define the local clustering coefficient c(u) for each node u in the following way: Let c(u) be the proportion of neighbors of u that are connected. In other words, c(u) is the probability that two friends of u are friends of each other in a social network. Then define the clustering coefficient (variant 2) as the mean of all local clustering coefficients in the network.
Both measures c1 and c2 measure the clustering in a network and should therefore correlate highly, right? Well, no. Look at the following plot:
This plot shows the two clustering coefficients for networks in the KONECT project. Each network is represented by a two or three-letter code. The X axis represents c1 and the Y axis represents c2. Clearly, we cannot state that the variants correlate in any way. Why is that? To see why the two clustering coefficients do not correlate, we have to think about how both are computed. In the first variant, each pair of incident edges is given the same weight, but in the second variant each node is given the same weight. This means that nodes with a high degree will be overrepresented in variant 1, and underrepresented in variant 2. The correlation of both measures is thus a hint that nodes with high and low degree in general do not have the same local clustering coefficient. Let’s see:
This plot needs some explanation. This is a plot showing a citation network taken from arXiv’s High Energy Physics / Phenomenology section. The network consists of publications connected by citation links, i.e., a link from u to v means that publication u has cited publication v. The network is taken as undirected here, i.e., we ignore who cited whom. Each blue dot in the plot is one vertex of the network, drawn according to its degree (X axis, logarithmic) and its local clustering coefficient (Y axis, logarithmic). The red line shows the average local clustering coefficient taken over all nodes of the same degree. What we see here is clear: nodes with low degree have a high clustering coefficient. In other words, when person has many friends, these friends have less edges among them, which is to be expected since a person with many friends is likely to have friends from more diverse communities, and a paper getting cited many times is likely to be cited by papers from more diverse areas.
This explain two things: First, because nodes with high degree are weighted more in the first variant, the first variant will have lower values that the second one. Second, the non-correlation between the two variants is due to the fact that the discrepancy between the local clustering coefficient of high-degree and low-degree nodes is different from network to network. Networks for which both clustering coefficients are near each other have more equal local clustering coefficients c1 and c2, whereas networks high up in the figure plot have a higher discrepancy between the local clustering coefficient of high-degree and low-degree nodes.
A few more notes
The clustering coefficients are not always defined. The first variant is not defined when no vertex has degree larger than one. This, of course, is not a problem in practice. The local clustering coefficient is only defined for nodes whose degree is larger than one. These nodes are present in actual social networks, and the usual method is to simply ignore them and compute the average of the local clustering coefficient over all nodes with degree larger than one, which is what is done in the plots presented here.
EDIT: updated to “Matlab code” link to point to GitHub; removed “Datasets” link; updated link to the KONECT Handbook