Jump to content

Is there an easy way to visualize the Trending Topic tags found on O'Reilly Answers?

MikeH's Photo
Posted Nov 24 2009 11:42 AM
2335 Views

So I did the wordle below the hard way. I grabbed all the text found on the O'Reilly Answers > Trending Topics > More... page found here, and used Excel to get a numbered list of topic tags. I Then pasted them into http://www.wordle.net/create to produce this wordle. But this process took some time, at least two meetings worth of cutting and pasting. Any hacks to make this quick and easy. Some scraping code or something?

Posted Image

Tags:
3 Subscribe


1 Reply

+ 4
  Util's Photo
Posted Nov 26 2009 08:17 AM

Quick and Easy == Perl!

This program runs in under 5 seconds, and outputs the word list in the format that Wordle expects. The final copy-and-paste is left to you.

If you are on OS X, you can skip the "copy" by piping the output directly into your clipboard with `| pbcopy`.

#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;

my $VERSION  = 0.1;
my $BOT_NAME = "O'Reilly_trending_topics_to_wordle_app";
my $URL_BASE = 'http://answers.oreilly.com/index.php?app=tags&module=tags&filter=alpha';
my $UA = LWP::UserAgent->new() or die;

sub get_topics_listing_page {
    my ($page_num) = @_;

    my $request = HTTP::Request->new( GET => "$URL_BASE&page=$page_num" );

    my $response = $UA->request($request);
    die "Could not get page $page_num" if not $response->is_success;

    return $response->content;
}

sub extract_topics_from_page {
    my ($html) = @_;

    my $topic_re = qr{
        <li>
            <a [ ] href='http://answers\.oreilly\.com/tag/[^'']+'
               [ ] class='tag'
            >
              ( [^<]+ )
            </a>
            \s*
            \( ( \d+ ) \)
        </li>
    }x;

    my @topics;
    for my $line ( split "\n", $html ) {
        push @topics, { TOPIC => $1, COUNT => $2 } while $line =~ /$topic_re/g;
    }

    return @topics;
}

$UA->timeout(2);
$UA->agent("$BOT_NAME/$VERSION");

print STDERR "Getting first page.\n";
my $first_page = get_topics_listing_page(1);

my $url_re  = quotemeta $URL_BASE;
my $last_re = qr{<a [ ] href="$url_re&page=(\d+)">Last</a>}msx;

$first_page =~ /$last_re/
  or die "Could not find Last link in first page";
my $last_page_num = $1;
print STDERR "Expecting $last_page_num total pages.\n";


my @topics = extract_topics_from_page($first_page);

print STDERR "Getting page ";
for my $page_num ( 2 .. $last_page_num ) {
    print STDERR "$page_num ";
    my $page = get_topics_listing_page($page_num);

    push @topics, extract_topics_from_page($page);
}
print STDERR "Done.\n";


print STDERR "Paste this list into the first textbox of http://www.wordle.net/advanced :\n";
print "$_->{TOPIC}:$_->{COUNT}\n" for @topics;