Jump to content

How to Properly Capitalize a Title or Headline in Perl

VOTE
+ 3
  • -
  • +
  gnat's Photo
Posted Oct 20 2009 03:11 PM

If you have a string representing a headline, the title of book, or some other work that needs proper capitalization, use a variant of this tc( ) titlecasing function:

INIT {
    our %nocap;
    for (qw(
            a an the
            and but or
            as at but by for from in into of off on onto per to with
        ))
    {
        $nocap{$_}++;
    }
}

sub tc {
    local $_ = shift;

    # put into lowercase if on stop list, else titlecase
    s/(\pL[\pL']*)/$nocap{$1} ? lc($1) : ucfirst(lc($1))/ge;

    s/^(\pL[\pL']*) /\u\L$1/x;  # first  word guaranteed to cap
    s/ (\pL[\pL']*)$/\u\L$1/x;  # last word guaranteed to cap

    # treat parenthesized portion as a complete title
    s/\( (\pL[\pL']*) /(\u\L$1/x;
    s/(\pL[\pL']*) \) /\u\L$1)/x;

    # capitalize first word following colon or semi-colon
    s/ ( [:;] \s+ ) (\pL[\pL']* ) /$1\u\L$2/x;

    return $_;
}


The rules for correctly capitalizing a headline or title in English are more complex than simply capitalizing the first letter of each word. If that's all you need to do, something like this should suffice:

s/(\w+\S*\w*)/\u\L$1/g;


Most style guides tell you that the first and last words in the title should always be capitalized, along with every other word that's not an article, the particle "to" in an infinitive construct, a coordinating conjunction, or a preposition.

Here's a demo, this time demonstrating the distinguishing property of titlecase.

# with apologies (or kudos) to Stephen Brust, PJF,
# and to JRRT, as always.
@data = (
            "the enchantress of \x{01F3}ur mountain",
    "meeting the enchantress of \x{01F3}ur mountain",
    "the lord of the rings: the fellowship of the ring",
);

$mask = "%-20s: %s\n";

sub tc_lame {
    local $_ = shift;
    s/(\w+\S*\w*)/\u\L$1/g;
    return $_;
}

for $datum (@data) { 
    printf $mask, "ALL CAPITALS",       uc($datum);
    printf $mask, "no capitals",        lc($datum);
    printf $mask, "simple titlecase",   tc_lame($datum);
    printf $mask, "better titlecase",   tc($datum);
    print "\n";
}


ALL CAPITALS        : THE ENCHANTRESS OF DZUR MOUNTAIN
no capitals         : the enchantress of dzur mountain
simple titlecase    : The Enchantress Of Dzur Mountain
better titlecase    : The Enchantress of Dzur Mountain

ALL CAPITALS        : MEETING THE ENCHANTRESS OF DZUR MOUNTAIN
no capitals         : meeting the enchantress of dzur mountain
simple titlecase    : Meeting The Enchantress Of Dzur Mountain
better titlecase    : Meeting the Enchantress of Dzur Mountain

ALL CAPITALS        : THE LORD OF THE RINGS: THE FELLOWSHIP OF THE RING
no capitals         : the lord of the rings: the fellowship of the ring
simple titlecase    : The Lord Of The Rings: The Fellowship Of The Ring
better titlecase    : The Lord of the Rings: The Fellowship of the Ring


One thing to consider is that some style guides prefer capitalizing only prepositions that are longer than three, four, or sometimes five letters. O'Reilly & Associates, for example, keeps prepositions of four or fewer letters in lowercase. Here's a longer list of prepositions if you prefer, which you can modify to your
needs:

@all_prepositions = qw{
    about above absent across after against along amid amidst
    among amongst around as at athwart before behind below
    beneath beside besides between betwixt beyond but by circa
    down during ere except for from in into near of off on onto
    out over past per since than through till to toward towards 
    under until unto up upon versus via with within without
};


This kind of approach can take you only so far, though, because it doesn't distinguish between words that can be several parts of speech. Some prepositions on the list might also double as words that should always be capitalized, such as subordinating conjunctions, adverbs, or even adjectives. For example, it's "Down by the Riverside" but "Getting By on Just $30 a Day", or "A Ringing in My Ears" but "Bringing In the Sheaves".

Another consideration is that you might prefer to apply the \u or ucfirst conversion by itself without also putting the whole string into lowercase. That way a word that's already in all capital letters, such as an acronym, doesn't lose that trait. You probably wouldn't want to convert "FBI" and "LBJ" into "Fbi" and "Lbj".

Cover of Perl Cookbook
Learn more about this topic from Perl Cookbook, Second Edition.  Find a Perl programmer, and you'll find a copy of Perl Cookbook nearby. Perl Cookbook is a comprehensive collection of problems, solutions, and practical examples for anyone programming in Perl. The book contains hundreds of rigorously reviewed Perl "recipes" and thousands of examples ranging from brief one-liners to complete applications. The second edition of Perl Cookbook has been fully updated for Perl 5.8, with extensive changes for Unicode support, I/O layers, mod_perl, and new technologies that have emerged since the previous edition of the book. Recipes have been updated to include the latest modules. New recipes have been added to every chapter of the book, and some chapters have almost doubled in size.
Learn More Read Now on Safari







0 Alternative Solutions | 0 Comments

filter by: