The vast majority of event information on the web is unavailable as structured data. The elmcity project tackles this problem by empowering curators to create syndicated networks of iCalendar feeds. But what about all those HTML pages that don't embed hCalendar or offer companion iCalendar feeds? We'd like to extract basic facts from them: event titles, dates and times, URLs. A service called FuseCal did exactly that. You could point it at a typical HTML events page, and it would create a corresponding iCalendar feed.
I'm not wild about HTML screen-scraping. 21st-century citizens need to learn, and practice, basic principles of data exchange. But meanwhile all those HTML event pages are out there, and it would be helpful to liberate the data they contain. Hence the elmcity project's fusecal subsystem, a tiny framework for writing Python plug-ins that parse HTML event pages and return structured events.
So far I've got plug-ins for three sources: MySpace, LibraryThing, and LibraryInsight. The first two are adapted from work done by students at the University of Toronto and Michigan State University as part of UCOSP (Undergraduate Capstone Open Source Projects).
The MySpace parser reads band pages, like http://www.myspace.com/jatobamusic, and parses the list of upcoming shows.
The LibraryThing parser could read pages like this events page for Toronto but, since the page provides a corresponding RSS feed which is easier to parse, it reads that instead. Note, by the way, that this is an example of the Right Thing / Wrong Way pattern. Providing a machine-readable data feed for an events page is the Right Thing. But when the items on that page are calendar events, burying the dates and times in the RSS Description field is doing it the Wrong Way. Embedded hCalendar, and/or a companion iCalendar feed, would be the Right Thing done the Right Way.
The third plug-in, which reads pages like this one on my local library's website, demonstrates a third strategy. The MySpace parser is a conventional HTML scraper, using Leonard Richardson's wonderful BeautifulSoup library to untangle the HTML. The LibraryThing parser uses xml.dom.minidom to get at the RSS Description, and then grovels around inside that looking for dates and times. The LibraryInsight parser, however, makes use of the fact that the details view for each event links to an iCalendar representation of the event. Why doesn't LibraryInsight just bundle them all together into an iCalendar feed? Beats me, but anyway that's what this elmcity plug-in does.
Here's what Hello World looks like in this environment:
import datetime, traceback
self.LogMsg('exception', 'HelloWorldParser.Parse', traceback.format_exc())
evt = ElmcityEventParser.Event()
evt.title = "Hello World"
evt.start = datetime.datetime.now()
evt.url = ''
parser = HelloWorldParser(url=None)
Your plugin defines a class that inherits from ElmcityEventParser. It implements a Parse method that, in a real plug-in, fetches the page at self.url, calls a worker method, and then calls the inherited BuildICS method. Your worker method creates one or more ElmcityEventParser.Event objects, fills them with titles, starting date/time values, and per-event URLs, then appends to self.events.
If you're running CPython, the HelloWorld test looks like this:
>>> import helloworld
<Event: Hello World, 2010-08-17 10:25:07.837000, UTC: False>
If you try the real plug-ins, you'll see similar results. Where's the iCalendar output? You won't see that unless you're running IronPython, which is how the elmcity service itself runs the plug-in code.
Python, IronPython, and Azure
You can use either flavor of Python in the Azure environment: CPython or IronPython. In this case I'm using IronPython so that I can make use of Doug Day's excellent DDay.iCal, which is a .NET-based iCalendar library, as well as all of the elmcity project's own .NET componentry. But I'm assuming that most people who will want to write plug-ins will prefer CPython. So the ElmcityEventParser module defines a couple of methods in two different ways. For example:
def LogMsg(self,category=None, message=None, details=None):
ElmcityUtils.GenUtils.LogMsg(category, message, details)
def LogMsg(self,category=None, message=None, details=None):
print '%s, %s, %s' % ( category, message, details )
What this means is that when you're testing a plug-in using CPython, LogMsg writes to the console. But when your plug-in is deployed on Azure, it writes to the elmcity service's log.
BuildICS is also defined two ways. In CPython it just prints out the parsed events. In IronPython it uses DDay.iCal to render them as iCalendar. What if you want to produce iCalendar using CPython? There are several Python iCalendar libraries that can do that.
Your parser also inherits another doubly-defined method, ParseDateTime. This one shouldn't have to exist. But there are a few places where Python and IronPython don't precisely align, and datetime.datetime.strptime() is one of them, at least for now. So it's wrapped in ParseDateTime, which takes the same kind of format string that CPython expects. When the service runs your plug-in, it maps ParseDateTime and its format string to a .NET date parser and a .NET-style format string.
Note that IronPython can load and run pure Python modules but not C-based Python modules. So far that hasn't been a problem. In addition to what's in Python's standard library, modules deployed to Azure for the elmcity service include BeautifulSoup, icalendar, and minidom. I'll gladly add more as needed. And if that won't work, there's always plan B: use CPython on Azure.
Here are a couple of tips to keep in mind.Try virtuous laziness first
Always remember that the best parser is one you don't have to write. If there's a rich event source for which you'd like to write a parser, try sending this email first:
From: Your Friendly Data Liberator
Re: Your events page
I am about to write a parser to make the events on your page accessible in iCalendar format. That will enable people to load the events into personal calendars, and will also enable the events to flow through syndicated calendar networks.
Of course if your service provides an iCalendar feed, I wouldn't have to create one. Is it possible that it exists but I've overlooked it? Or, if it doesn't exist, might you consider providing it?
Your Friendly Data Liberator
To be honest I haven't had much luck with this approach so far. But it never hurts to ask!
Pick juicy targets
You're welcome to write a one-off parser for a specialty web page that's full of events that matter to you. But in general I'd encourage folks to look for general patterns that will work for many hubs, and maybe even for multiple feeds within a hub.
For example, there are lots of hubs that can use the LibraryThing parser. It doesn't find any events for Keene yet, but the LibraryThing RSS feed for many larger towns does yield events. Similarly, any location whose public library runs the LibraryInsight service should be able to use the LibraryInsight parser.
The MySpace parser is even more general. For a given location there may be multiple bands whose schedules intersect with that location. Of course bands rarely appear only in single location, so you'll want to construct events in a way that makes it easy to filter them to a single place (see below: Titles, locations, and filters).
It's always challenging to deal with time zones, but ideally you won't need to. All the event pages I've looked at assume local time. When the service emits the iCalendar feed based on your parsed event stream, it will apply the timezone setting defined by the hub's curator, and include the appropriate iCalendar VTIMEZONE component.It's possible that an event source might use UTC. LibraryInsight's iCalendar events, for example, look like this:
LOCATION:Kay Fox Room
SUMMARY:Forever Free Exhibit
UID:LibraryInsight.com-Keene Public Library-281932
In such cases, you can tell the service to expect UTC like so:
evt.start_is_utc = True
Titles, locations, and filters
The elmcity service's notion of an event is aggressively minimal: Just a title, a starting date/time, and a link back to the authoritative source. So you should always populate self.title, self.start, and self.url.
Location is a slipperier concept, and the service doesn't even try to deal with it. The strategy is to Do The Simplest Thing That Could Possible Work, which in this case means:
1. Gather as many events from as many sources as possible.
2. Arrange them on a timeline.
3. Use titles to make the timeline scannable and searchable
4. Use tags, when available, to support views of the timeline.
5. Use links to point back to authoritative sources where details, in a hodge-podge of styles and formats, can be found.
HTML event pages often don't report locations on a per-event basis. That can happen, for example, when the site's context presumes a location, like LibraryThing events for Toronto. But when a source does call out a location separately from the event title, as with MySpace and LibraryInsight, I recommend that you append the location to the title, like so:
evt.title += ', ' + evt.location
For the LibraryInsight event shown above, that results in this default rendering in the elmcity service:
Finally, note that if your plug-in is called from the elmcity service with a filter string, the list of events will be restricted to those whose titles include the filter string. Consider this upcoming show list from Jatoba's MySpace page:
The MySpace parser merges titles, like The Mole's Eye, with locations, like Brattleboro, VT, to produce combined titles like The Mole's Eye, Brattleboro, VT. Currently there are no upcoming Keene shows, so when referenced from the Keene hub using the filter string Keene the plug-in returns no events. But it could be referenced using Peterborough or Burlington to capture events for just those towns.
So when you construct event titles, think about how you want the events to appear in listings. But also think about how curators will want to filter those events based on strings appearing in the titles.
For more information
If you're interested in writing an elmcity plug-in, feel free to contact me directly.