Jump to content

How to Crawl Followers on Twitter to Approximate Potential Influence

0
  chco's Photo
Posted Apr 26 2011 11:23 AM

Suppose you want to approximate someone’s influence based upon their popularity and the popularity of their followers. The following excerpt from the O'Reilly publication 21 Recipes for Mining Twitter can help.

You could use a breadth-first traversal to crawl the followers of the user to a reasonable depth, and then count the number of nodes in the graph.

A breadth-first traversal is a common technique for traversing a graph, the implicit data structure that underlies social networks. Given a queue, Q1, containing one or more seed nodes, a breadth-first search systematically visits all of the adjacent nodes (neighbors) for these nodes and places them in another queue, Q2. When Q1 becomes empty, it means that all of these nodes have been visited, and the process repeats itself for the nodes in Q2, with Q1 now being used to keep track of neighbors. Once a suitable depth has been reached, the traversal terminates. A breadth-first traversal is easy to implement, and the neighbors for each node can be stored on disk and later analyzed as a graph. The two characteristics that govern the space complexity of a breadth-first traversal are the depth of the traversal and the average branching factor of each node in the graph. The number of followers for Twitter users varies wildly; all it takes to dramatically effect the average branching factor is a very popular follower, so it would be wise to set an upper threshold.

Example 1-27 illustrates an approach for crawling a user’s followers. It uses a get_all_followers_ids function that takes into account exceptional circumstances, and uses this function in crawl_followers—a typical implementation of breadth-first search.

Example 1-27. Crawling a friendship graph (see http://oreilly.com/c...g/9781449303167)

# -*- coding: utf-8 -*-

import sys
import redis
from recipe__make_twitter_request import make_twitter_request
from recipe__setwise_operations import get_redis_id
from recipe__oauth_login import oauth_login


def crawl_followers(t, r, follower_ids, limit=1000000, depth=2):

    # Helper function

    def get_all_followers_ids(user_id, limit):

        cursor = -1
        ids = []
        while cursor != 0:

            response = make_twitter_request(t, t.followers.ids,
                                            user_id=user_id, cursor=cursor)

            if response is not None:
                ids += response['ids']
                cursor = response['next_cursor']

            print >> sys.stderr, 'Fetched %i total ids for %s' % (len(ids), user_id)

            # Consider storing the ids to disk during each iteration to provide an 
            # an additional layer of protection from exceptional circumstances.

            if len(ids) >= limit or response is None:
                break

        return ids

    for fid in follower_ids:

        next_queue = get_all_followers_ids(fid, limit)

        # Store a fid => next_queue mapping in Redis or other database of choice
        # In Redis, it might look something like this:

        rid = get_redis_id('follower_ids', user_id=fid)
        [ r.sadd(rid, _id) for _id in next_queue ]

        d = 1
        while d < depth:
            d += 1
            (queue, next_queue) = (next_queue, [])
            for _fid in queue:
                _follower_ids = get_all_followers_ids(user_id=_fid, limit=limit)

                # Store a fid => _follower_ids mapping in Redis or other 
                # database of choice. In Redis, it might look something like this:

                rid = get_redis_id('follower_ids', user_id=fid)
                [ r.sadd(rid, _id) for _id in _follower_ids ] 

                next_queue += _follower_ids

if __name__ == '__main__':

    SCREEN_NAME = sys.argv[1]

    # Remember to pass in keyword parameters if you don't have a
    # token file stored on disk already

    t = oauth_login()

    # Resolve the id for SCREEN_NAME

    _id = str(t.users.show(screen_name=SCREEN_NAME)['id'])

    crawl_followers(t, redis.Redis(), [_id])

    # The total number of nodes visited represents one measure of potential influence.
    # You can also use the user => follower ids information to create a 
    # graph for analysis.


The code illustrates how you could use Redis to store the follower ids for users encountered during the breadth-first traversal. However, you could just as easily use any other storage medium. Once your traversal has completed, the total number of nodes in the graph is one indicator of the user’s potential influence. For example, if you were given a user id and traversed its follower graph to a depth of 1, you’d have a hub and spoke graph that represents the seed node and its adjacent neighbors. A depth of 2 would represent the user, the user’s followers, and the user’s followers’ followers. Obviously, the higher the depth, the higher the number of nodes in the graph and the higher the potential influence. However, it might be the case that the further out you traverse the graph, the more diverse the users become and the less likely it is that a retweet would occur. It would be interesting to hold a rigorous experiment to investigate this hypothesis.

21 Recipes for Mining Twitter

Learn more about this topic from 21 Recipes for Mining Twitter.

Millions of public Twitter streams harbor a wealth of data, and once you mine them, you can gain some valuable insights. This short and concise book offers a collection of recipes to help you extract nuggets of Twitter information using easy-to-learn Python tools. Each recipe offers a discussion of how and why the solution works, so you can quickly adapt it to fit your particular needs.

See what you'll learn


Tags:
0 Subscribe


0 Replies