# How to Crawl Followers on Twitter to Approximate Potential Influence

0

Posted Apr 26 2011 11:23 AM

Suppose you want to approximate someone’s influence based upon their popularity and the popularity of their followers. The following excerpt from the O'Reilly publication 21 Recipes for Mining Twitter can help.

You could use a breadth-first traversal to crawl the followers of the user to a reasonable depth, and then count the number of nodes in the graph.

A breadth-first traversal is a common technique for traversing a graph, the implicit data structure that underlies social networks. Given a queue, Q1, containing one or more seed nodes, a breadth-first search systematically visits all of the adjacent nodes (neighbors) for these nodes and places them in another queue, Q2. When Q1 becomes empty, it means that all of these nodes have been visited, and the process repeats itself for the nodes in Q2, with Q1 now being used to keep track of neighbors. Once a suitable depth has been reached, the traversal terminates. A breadth-first traversal is easy to implement, and the neighbors for each node can be stored on disk and later analyzed as a graph. The two characteristics that govern the space complexity of a breadth-first traversal are the depth of the traversal and the average branching factor of each node in the graph. The number of followers for Twitter users varies wildly; all it takes to dramatically effect the average branching factor is a very popular follower, so it would be wise to set an upper threshold.

Example 1-27 illustrates an approach for crawling a user’s followers. It uses a `get_all_followers_ids` function that takes into account exceptional circumstances, and uses this function in `crawl_followers`—a typical implementation of breadth-first search.

Example 1-27. Crawling a friendship graph (see http://oreilly.com/c...g/9781449303167)

```# -*- coding: utf-8 -*-

import sys
import redis
from recipe__setwise_operations import get_redis_id

def crawl_followers(t, r, follower_ids, limit=1000000, depth=2):

# Helper function

def get_all_followers_ids(user_id, limit):

cursor = -1
ids = []
while cursor != 0:

user_id=user_id, cursor=cursor)

if response is not None:
ids += response['ids']
cursor = response['next_cursor']

print >> sys.stderr, 'Fetched %i total ids for %s' % (len(ids), user_id)

# Consider storing the ids to disk during each iteration to provide an
# an additional layer of protection from exceptional circumstances.

if len(ids) >= limit or response is None:
break

return ids

for fid in follower_ids:

next_queue = get_all_followers_ids(fid, limit)

# Store a fid => next_queue mapping in Redis or other database of choice
# In Redis, it might look something like this:

rid = get_redis_id('follower_ids', user_id=fid)
[ r.sadd(rid, _id) for _id in next_queue ]

d = 1
while d < depth:
d += 1
(queue, next_queue) = (next_queue, [])
for _fid in queue:
_follower_ids = get_all_followers_ids(user_id=_fid, limit=limit)

# Store a fid => _follower_ids mapping in Redis or other
# database of choice. In Redis, it might look something like this:

rid = get_redis_id('follower_ids', user_id=fid)
[ r.sadd(rid, _id) for _id in _follower_ids ]

next_queue += _follower_ids

if __name__ == '__main__':

SCREEN_NAME = sys.argv[1]

# Remember to pass in keyword parameters if you don't have a
# token file stored on disk already

# Resolve the id for SCREEN_NAME

_id = str(t.users.show(screen_name=SCREEN_NAME)['id'])

crawl_followers(t, redis.Redis(), [_id])

# The total number of nodes visited represents one measure of potential influence.
# You can also use the user => follower ids information to create a
# graph for analysis.
```

The code illustrates how you could use Redis to store the follower `ids` for users encountered during the breadth-first traversal. However, you could just as easily use any other storage medium. Once your traversal has completed, the total number of nodes in the graph is one indicator of the user’s potential influence. For example, if you were given a user `id` and traversed its follower graph to a depth of 1, you’d have a hub and spoke graph that represents the seed node and its adjacent neighbors. A depth of 2 would represent the user, the user’s followers, and the user’s followers’ followers. Obviously, the higher the depth, the higher the number of nodes in the graph and the higher the potential influence. However, it might be the case that the further out you traverse the graph, the more diverse the users become and the less likely it is that a retweet would occur. It would be interesting to hold a rigorous experiment to investigate this hypothesis.

Tags:
0