Jump to content

How to use an NLTK part-of-speech tagger

+ 1
  adfm's Photo
Posted Feb 17 2010 02:05 PM

Tagging follows tokenization in the typical natural language processing pipeline. In this excerpt from Natural Language Processing with Python the authors introduce a part-of-speech tagger. The examples assume you are familiar with the Python language and have NLTK installed.


A part-of-speech tagger, or POS tagger, processes a sequence of words, and attaches a part of speech tag to each word (don’t forget to import nltk):

>>> text = nltk.word_tokenize("And now for something completely different")

>>> nltk.pos_tag(text)

[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),

('completely', 'RB'), ('different', 'JJ')]

Here we see that and is CC, a coordinating conjunction; now and completely are RB, or adverbs; for is IN, a preposition; something is NN, a noun; and different is JJ, an adjective.

Note

NLTK provides documentation for each tag, which can be queried using the tag, e.g., nltk.help.upenn_tagset('RB'), or a regular expression, e.g., nltk.help.upenn_brown_tagset('NN.*'). Some corpora have README files with tagset documentation; see nltk.name.readme(), substituting in the name of the corpus.

Let’s look at another example, this time including some homonyms:

>>> text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")

>>> nltk.pos_tag(text)

[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'),

('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]

Notice that refuse and permit both appear as a present tense verb (VBP) and a noun (NN). E.g., refUSE is a verb meaning “deny,” while REFuse is a noun meaning “trash” (i.e., they are not homophones). Thus, we need to know which word is being used in order to pronounce the text correctly. (For this reason, text-to-speech systems usually perform POS tagging.)

Note

Your Turn: Many words, like ski and race, can be used as nouns or verbs with no difference in pronunciation. Can you think of others? Hint: think of a commonplace object and try to put the word to before it to see if it can also be a verb, or think of an action and try to put the before it to see if it can also be a noun. Now make up a sentence with both uses of this word, and run the POS tagger on this sentence.

Lexical categories like “noun” and part-of-speech tags like NN seem to have their uses, but the details will be obscure to many readers. You might wonder what justification there is for introducing this extra level of information. Many of these categories arise from superficial analysis of the distribution of words in text. Consider the following analysis involving woman (a noun), bought (a verb), over (a preposition), and the (a determiner). The text.similar() method takes a word w, finds all contexts w1w w2, then finds all words w' that appear in the same context, i.e. w1w'w2.

>>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())

>>> text.similar('woman')

Building word-context index...

man time day year car moment world family house country child boy

state job way war girl place room word

>>> text.similar('bought')

made said put done seen had found left given heard brought got been

was set told took in felt that

>>> text.similar('over')

in on to of and for with from at by that into as up out down through

is all about

>>> text.similar('the')

a his this their its her an that our any all one these my in your no

some other and

Observe that searching for woman finds nouns; searching for bought mostly finds verbs; searching for over generally finds prepositions; searching for the finds several determiners. A tagger can correctly identify the tags on these words in the context of a sentence, e.g., The woman bought over $150,000 worth of clothes.

A tagger can also model our knowledge of unknown words; for example, we can guess that scrobbling is probably a verb, with the root scrobble, and likely to occur in contexts like he was scrobbling.

Natural Language Processing with Python

Learn more about this topic from Natural Language Processing with Python.

This book offers a highly accessible introduction to Natural Language Processing, the field that underpins a variety of language technologies ranging from predictive text and email filtering to automatic summarization and translation. You'll learn how to write Python programs to analyze the structure and meaning of texts, drawing on techniques from the fields of linguistics and artificial intelligence.

See what you'll learn


1 Reply

0
  davidqs's Photo
Posted Apr 28 2011 01:05 AM

Thanks for the post, very clear and helpful.

A question - I am trying to use your sample plus using collocation.

I mean I want to pos_tag "Companies in New York" having New York as a single token (NNP token).

What's the way do that?

Thanks again,

David