Building a Twitter Search Engine

This week I attended the Text Retrieval Conference (TREC 2011) to present our contribution to the Microblog Challenge– to find interesting tweets from Twitter about a given list of topics.  The microblog challenge was organized for the first time in 2011 and attracted more participants that any TREC challenge before!  There were 59 participants.  Accordingly, there were many interesting contributions.  Here’s my rundown of the main lessons learned:

Shortness of Tweets

Tweets are really short, and therefore searching tweets is radically different than searching traditional documents.  For instance, counting the number of times a word in included in a tweet is completely irrelevant.  This is in total contrast to traditional search where TF (‘term frequency’) is unavoidable in a any system.

Any consequence of the shortness if tweets is that query expansion is really important.  In traditional search, a good document such as a Wikipedia page for a given topic will typically contain synonyms of the topic, so you will find the page by searching for any of the synonyms.  A tweet on the other hand will use just one synonym, so it is the search system that has to find synonym– not a trivial task, and a research problem in itself.

Social Search

An interesting comment from Jimmy Lin from Twitter is that Twitter sees itself as a social communication platform and not a microblog.  Indeed, that social component is central to Twitter, and a good Twitter search system has to take into account the social network.  For instance, most people trust people they follow more than other people, so returning tweets from these people will make a more satisfying search.  Also, retweets are very valuable:  If a tweet gets retweeted, it is likely to be interesting!  This fact can be used to compute the ‘interestingness’ of a tweet, and indeed this is what my team from the University of Koblenz-Landau did, as described in this paper we presented at this year’s ACM Web Science Conference:

Naveed, Nasir; Gottron, Thomas; Kunegis, Jérôme; Che Alhadi, Arifah: Bad News Travel Fast: A Content-based Analysis of Interestingness on Twitter. In: Proc. Web Science Conf., 2011.

Query-independent Features

As in traditional search, features independent of the query are important.  For instance, tweets with URL are consistently found to be more interesting than tweets without URLs.  This was confirmed by most teams.  Some results in this category are more interesting:  My team from the University of Koblenz-Landau found that tweets containing a negative smiley such as 😦 is more likely to be retweeted than a tweet with a positive smiley such as 🙂 !

URL Dereferencing

Since tweets are so short and often contain URLs, a big improvement in the recall of the search can be had when URLs contained in tweets are crawled and the content is considered as part of tweet for purposes of the keyword search.  In fact, we got word from Twitter people that this is done in real-time by Twitter itself.

Ground Truth

Like all TREC challenges, the Microblog Challenge used assessment by actual people as ground truth.  As a result, participants were not provided with any training data.  Most teams therefore labeled a small subset of tweets by hand as interesting and non-interesting.  We however did something different:  In our previous research we had found that being retweeted is a very good indication for the interestingness of a tweet.  This approach carried over to the Microblog Challenge easily:  Simply compute the probability of a tweet being retweeted as a score.

The Time Element

Twitter is often used to find breaking news:  People will tweet about any important event before it is picked up by news outlets.  As a result, in many typical searches, only very recent tweets are relevant.  This assumption was criticized by many people, because non-realtime search is actually harder, because the pool of potentially interesting tweets is much larger.  Nevertheless, we argued about renaming the Microblog track to the Real-time Search track.  Let’s see how the final decision will look.

Another issue related to time was the evaluation measure:  The rules of the challenge did not make clear whether to rank results by time of by relevance.  A new evaluation will therefore be made by the organizers and be published in December.  Next year’s challenge will very probably only require scores to be submitted for a given list of recent tweets, without any ranking necessary.

Short vs Long Queries

The current Twitter search function is optimized for one-word real-time searches.  Should the Challenge be about the search as it is used on Twitter or about how search should be possible on Twitter?  The consensus seems to be that we also want to support longer queries.  Incidentally, these are the queries for which most participants’ systems performed worse!  So there is lots of work to do in that area.

Deemphasize the Challenge

Many challenges in information retrieval and related areas are based around a competition:  Each teams submits a solution and the organizers compute a score for everyone, and declare a winner.  At TREC, thing are completely different.  TREC is about the methods, and accordingly there is no focus on winning or losing.  For instance, each team may use external data but must declare it.  For instance, a team that doesn’t do do will perform worse in the overall comparison but the comparison will still be valued because of the constraint.  This attitude also makes cheating irrelevant, since nobody officially wins.  This makes the organization much easier, since data doesn’t have to be kept secret.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s