|
|
|||
Is there an easy way to visualize the Trending Topic tags found on O'Reilly Answers?
So I did the wordle below the hard way. I grabbed all the text found on the O'Reilly Answers > Trending Topics > More... page found here, and used Excel to get a numbered list of topic tags. I Then pasted them into http://www.wordle.net/create to produce this wordle. But this process took some time, at least two meetings worth of cutting and pasting. Any hacks to make this quick and easy. Some scraping code or something?
1 Reply
Quick and Easy == Perl!
This program runs in under 5 seconds, and outputs the word list in the format that Wordle expects. The final copy-and-paste is left to you. If you are on OS X, you can skip the "copy" by piping the output directly into your clipboard with `| pbcopy`.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
my $VERSION = 0.1;
my $BOT_NAME = "O'Reilly_trending_topics_to_wordle_app";
my $URL_BASE = 'http://answers.oreilly.com/index.php?app=tags&module=tags&filter=alpha';
my $UA = LWP::UserAgent->new() or die;
sub get_topics_listing_page {
my ($page_num) = @_;
my $request = HTTP::Request->new( GET => "$URL_BASE&page=$page_num" );
my $response = $UA->request($request);
die "Could not get page $page_num" if not $response->is_success;
return $response->content;
}
sub extract_topics_from_page {
my ($html) = @_;
my $topic_re = qr{
<li>
<a [ ] href='http://answers\.oreilly\.com/tag/[^'']+'
[ ] class='tag'
>
( [^<]+ )
</a>
\s*
\( ( \d+ ) \)
</li>
}x;
my @topics;
for my $line ( split "\n", $html ) {
push @topics, { TOPIC => $1, COUNT => $2 } while $line =~ /$topic_re/g;
}
return @topics;
}
$UA->timeout(2);
$UA->agent("$BOT_NAME/$VERSION");
print STDERR "Getting first page.\n";
my $first_page = get_topics_listing_page(1);
my $url_re = quotemeta $URL_BASE;
my $last_re = qr{<a [ ] href="$url_re&page=(\d+)">Last</a>}msx;
$first_page =~ /$last_re/
or die "Could not find Last link in first page";
my $last_page_num = $1;
print STDERR "Expecting $last_page_num total pages.\n";
my @topics = extract_topics_from_page($first_page);
print STDERR "Getting page ";
for my $page_num ( 2 .. $last_page_num ) {
print STDERR "$page_num ";
my $page = get_topics_listing_page($page_num);
push @topics, extract_topics_from_page($page);
}
print STDERR "Done.\n";
print STDERR "Paste this list into the first textbox of http://www.wordle.net/advanced :\n";
print "$_->{TOPIC}:$_->{COUNT}\n" for @topics;
|
|||
|