Jump to content

How to parse key/value pairs in C#

+ 1
  JonUdell's Photo
Posted Oct 07 2010 10:02 AM

Introduction

Curators of elmcity calendar hubs use a Delicious tagging convention to specify the metadata that controls their hubs. Calendar entries that flow through those hubs can specify event URLs and categories using the same convention. In this week's companion article on the Radar blog, I argue that everyone ought to develop some intuitions about how and why to create and use simple kinds of structured data. Here I'll show how the elmcity service extracts key/value collections from a variety of contexts.

Key/value pairs in the wild

Honolulu is the newest elmcity hub. Here's the Delicious account that controls it:

Each hub uses a special bookmark, labeled metadata, to control the behavior of the hub. The bookmark's target URL doesn't point to a real resource, but rather to a fictional one, in this case http://delicious.com...avibe/metadata. So it's really more like a URN (Universal Resource Name) than a URL (Universal Resource Locator). But from the Delicious point of view it's a URL, it can be bookmarked, and the bookmark can be tagged with arbitrary labels. The elmcity service uses a convention for those labels. When they take the form key=value, they represent key/value pairs that the service will recognize, store, and use. In this example, the Honolulu curator has specified a location (where=honolulu,hi), a timezone (tz=hawaiian), and a contact (elmcity@alohavibe.com).

The other bookmarks collectively form the registry of iCalendar feeds for this hub. A trio of tags -- trusted + ics + feed -- signals to the service that the bookmark points to an iCalendar feed whose events should be merged into the hub. The feed can also carry key=value tags. The url= tag provides a default link for all events coming from a calendar. Ideally individual events will carry their own URLs that can override the default. But many calendar programs don't populate the iCalendar URL property when they export to iCalendar format. And the standard doesn't provide for a default calendar-level URL property. So curators can use the url= convention to supply that.

Similarly, iCalendar does not specify a calendar-level CATEGORIES property. Since calendar feeds are often implicitly categorical, curators can use category= to make that explicit. In this example, every event coming from the Hawaii Reggae Guild will carry two categories or, if you prefer, tags: music and reggae.

Of course calendar feeds are as likely not to be categorical. Events that flow through them can belong in many different categories. The iCalendar spec does define a CATEGORIES property, and the elmcity service captures it when it's available. But although some calendar programs enable users to categorize events, others don't. So the elmcity service also looks in the Description field which is always open to user input. If text like category=music,reggae or url=http://tikisgrill/events/32 occurs there, then these key/value pairs are captured and tacked onto the event.

Finding regular expression groups

To find these key/value pairs using Python, I'd start with the re.findall built-in method. It takes a regular expression pattern and a target string, and scans the string for the pattern. If the pattern specifies groups, it returns a list of groups.

In C# there's no corresponding built-in method. So I started with a method that scans an input string for a pattern and returns a list of the values of any groups found.


public static List<string> RegexFindGroups(
string input, 
string pattern)
{
Regex re = new Regex(pattern);
avr groups = re.Match(input).Groups;
var values = new List<string>();
foreach (Group g in groups)
values.Add(g.Value);
return values;
}

We can use IronPython to explore some uses of this method:


>>> pat = 'a b c d e'     # pattern includes no groups

>>> RegexFindGroups('a b c d',pat)  # non-matching input

List[str]([''])                     # no match

>>> RegexFindGroups('a b c d e',pat)  # matching input

List[str](['a b c d e'])            # matched input only

>>> pat = 'a (b) c (d) e' # pattern with literal groups

>>> RegexFindGroups('a b c d e',pat)  # matching input and groups

List[str](['a b c d e', 'b', 'd'])    # matched input and groups

>>> pat = 'a (http://.+\s*) c (\d+) e' # pattern abstract groups

>>> RegexFindGroups('a http://foo.com?x=y c 192534 e',pat)  # matching input

List[str](['a http://foo.com?x=y c 192534 e', 'http://foo.com?x=y', '192534'])

Matching a single key/value pair

Next I added a method to find the key=value pattern for arbitrary keys:


public static List<string> RegexFindKeyValue(string input)
{
var pattern = @"\s*(\w+)=([^\s]+)\s*";
var groups = RegexFindGroups(input, pattern);
var list = new List<string>();
if (groups[0] == input)
{
list.Add(groups[1]);
list.Add(groups[2]);
}
return list;
}

We'll use IronPython again to try it out:


>>> RegexFindKeyValue('abc') # no match

List[str]()

>>> RegexFindKeyValue('abc = def') # no match

List[str]()

>>> RegexFindKeyValue('abc=def') # match

List[str](['abc', 'def'])

>>> RegexFindKeyValue('  abc=def  ') # match

List[str](['abc', 'def'])

Finding arbitrary key/value collections in text

Finally, I added a method that takes a set of keys and tries to find associated values in an input text.


public static Dictionary<stringstring>
  RegexFindKeysAndValues(List<string> keys, string input)
{
string keystrings = String.Join("|", keys.ToArray());
string regex = String.Format(@"({0})=([^\s]+)", keystrings);
Regex reg = new Regex(regex);
var metadict = new Dictionary<stringstring>();
Match m = reg.Match(input);
while (m.Success)
{
var key_value = RegexFindKeyValue(m.Groups[0].ToString());
metadict.Add(key_value[0], key_value[1]);
m = m.NextMatch();
}
return metadict;
}

Let's try it out.


>>> keys = System.Collections.Generic.List[str]()

>>> keys.Add('url')

>>> keys.Add('category')

>>> keys

List[str](['url', 'category'])

>>> RegexFindKeysAndValues(keys, \

... 'Four score and seven years ago, \

... our fathers brought forth ... \

... url=http://civilwar.com/north/lincoln.html \

... category=politics,speech')

Dictionary[str, str]( { 

  'url'      : 'http://civilwar.com/north/lincoln.html', 

  'category' : 'politics,speech' } )

Other uses for the regex group finder

The elmcity service uses these methods in lots of ways. For example, every place hub has a where= setting that defines the location. Here are some of the values:

honolulu,hi
huntington, wv
keene nh
menlo park, ca
myrtle beach,sc

The names of cities and towns might include spaces or might not. Sometimes there's only a comma between the city name and the state abbreviation, sometimes there's only a space, sometimes there's a space and a comma. That's OK. I try to follow Postel's Law -- "Be conservative in what you send; be liberal in what you accept" -- wherever possible. So I normalize these inputs like so:


var groups = GenUtils.RegexFindGroups(where, @"(.+)([\s+,])([^\s]+)");
if (groups.Count > 1)
{
city_or_town = groups[1];
state_abbrev = groups[3];
}

When accepting structured input, we have to enforce some rules. But every extra rule ratchets up the user's frustration another notch. Pick your battles.



Tags:
0 Subscribe


0 Replies