The elmcity service makes extensive use of the Azure table service. Why use this cloud-based key/value store instead of the corresponding SQL-based offering, SQL Azure? The standard answer is that NoSQL scales, and that's true, but it's not the only answer. Sometimes NoSQL is just an easy way to park a hashtable in the cloud. The venerable hashtable, aka dictionary, has long been my favorite all-purpose data structure. Even in C#, where fully-defined types are the norm, and where powerful ORMs (object-relational mappers) connect the language to SQL stores, it's still easier to dump things into a dictionary. One way to park that dictionary in the cloud is to pickle (serialize) it as a binary object and post it as a blob. The elmcity service does that regularly, using the Azure blob service. Another way leverages the natural correspondence between a dictionary, which is a bag of named properties, and an entity stored in the Azure table service, which is also a bag of named properties. The elmcity service uses that strategy a lot too.
One reason to use the table service, versus the blog service, is that the table service enables you to query for sets of entities. Another reason is that the results of those queries are OData feeds that you can work with directly using generic tools. One such tool is Excel which, when coupled with PowerPivot (a free add-in for Excel 2010) can help you visualize query results. In this week's companion article on the Radar blog, I chart the numbers of events flowing into the elmcity hubs from five kinds of sources. Here I'll show how the data behind that chart flows from C# to the Azure table service to Excel.
The metadata property bag as a C# dictionary and an Azure entity
For each elmcity hub, there's a set of entities stored in an Azure table named metadata. All but one of them represent the iCalendar feeds associated with the hub. But one is special: It's a property bag that's been expanding as I discover new things I want to keep track of for each hub. Here's what it looks like when represented in C# and displayed in the Visual Studio debugger:
And here's what it looks like when stored in an Azure table and displayed using Cerebrata's Cloud Storage Studio:
Until last week I was only tracking the total number of events flowing into each hub. When I decided to look in more detail at the five tributaries, I added these attributes to each hub's Azure entity:
This required no database administration. I just modified the aggregator to inject these new properties into the C# dictionary representing each hub, and then merge the dictionaries into their corresponding Azure entities.
Querying for hub metadata
Although the Azure SDK exposes the table store using the conventions of ADO.NET Data Services and LINQ, the underlying machinery is RESTful. For the elmcity project I made an alternate interface that talks directly to the RESTful core.
Either way, retrieving a hub's metadata entity boils down to this HTTP request, when the hub's ID is elmcity (i.e., Keene):
This works because my convention is to use the ID as the value for both halves of the dual key that indexes entities in an Azure table. Similarly, here's the query to fetch the rest of hub's entities that represent iCalendar feeds:
(Note that although these are the actual URLs, they're protected so you can't click through from here to the data.)
For this example, we want to fetch the metadata entities for each hub. Or rather, since there are two classes of hub -- one for places, one for topics -- we want to fetch metadata entities for each place hub. Curators define the characteristics of their hubs using yet another representation of this same property bag, based on a convention for using the delicious bookmarking service. Here's what the property bag looks like in delicious:
Place hubs are defined by having a property called where. Topic hubs are defined by having a property called what. So to find just the place hubs, the query is:
Note that this query includes neither the PartitionKey or the RowKey. Since the Azure table service doesn't index other attributes, this query will require a table scan. How can that possibly be efficient? Well, it isn't. If there were huge numbers of entities in the table, it would take a long time. But for tens, hundreds, or even low thousands of entities, and when as in this case there's no need for subsecond response, it's no problem. If I end up with more than a few thousand hubs, that will be a good problem to have, and I'll happily migrate this table into SQL Azure. Meanwhile, simpler is better.
Viewing the OData feed
The output of the HTTP request that queries for topic hubs is an OData feed, shown here in Fiddler's XML view:
This is an Atom feed extended with a type system used here to annotate the TimeStamp property as an Edm.DateTime and the eventful_events and eventbrite_properties as Edm.Int32. If you open this feed in a browser that presents a friendly view of feeds, you'll see a list of (not very interesting) items as shown here in Firefox:
If you disable the browser's friendly view of feeds, you'll see the raw XML as shown here in IE:
Using the OData feed
You can ingest an OData feed using any XML parser, or you can use one of the client libraries in the OData SDK. But suppose you want to produce a visualization like this one:
It would be nice to be able to read the OData feed directly into Excel. There isn't yet a generic OData importer for all versions of Excel, unfortunately. I hope one will emerge, but meanwhile there's PowerPivot, a free add-in for Excel 2010. Now, calling PowerPivot an OData importer is like calling a Swiss army knife a toothpick. Yes it serves the purpose, but that one function barely scratches the surface. PowerPivot is a large-capacity in-memory relational data analysis engine. It come from the world of business intelligence, where quants build cubes out of SQL data and then slice them and dice them. As an Excel plug-in, it does the heavy lifting that Excel can't do alone: millions of rows of data, relational joins.
We'll see none of that fancy stuff here, we're only using PowerPivot to load an OData feed into Excel. The elmcity service makes the feed available to me privately over HTTP. But for this example I've just captured the feed as an XML file that can be loaded from local storage.
After importing the rows into PowerPivot, I send them over to an Excel pivot table.
Over on the Excel side of this PowerPivot/Excel duo, I pick the fields I want to use.
Then I select those columns in the pivot table, and copy/paste links into a worksheet that has formulae for adding up totals and calculating percentages.
Finally I add a sparkline column to summarize the percentages using inline column charts.