Jump to content

How to Handle Stop Words and Short Words When Working with Sphinx

0
  chco's Photo
Posted Jun 12 2011 05:09 PM

The following excerpt from the O'Reilly publication Introduction to Search with Sphinx will give you some insight into using stop words and short words when searching via Sphinx.

All keywords are not created equal, and in your average English text corpus, there will be a great deal more instances of “the” than, say, “ostentatious” or “scarcity.” Full-text search engines, Sphinx included, do a good deal of keyword crunching. And so the differences in their frequencies affect both performance and relevance.

Stop words are keywords that occur so frequently that you choose to ignore them, both when indexing and when searching. They are noise keywords, in a sense.

Removing only a few stop words can improve indexing time and index size considerably. In Table 3-1, we benchmarked the same 100,000-document index with varying numbers of the top N most frequent words stopped.

Table 3-1. Indexing size and time with different stop word settings

Elapsed timeIndex size
NSecondsPercentMillions of bytesPercent
0 (no stop words)12.2100.073.6100.0
1011.190.967.291.3
2010.586.063.886.6
3010.485.261.483.4
1009.678.651.870.3


As you can see, removing just the 10 most frequent words resulted in about a 10 percent improvement both to index size and indexing time. Stopping 100 of them improved indexing time by more than 20 percent and index size by almost 30 percent. That is pretty nice.

Sphinx lets you configure a file with a list of stop words on a per-index basis, using the stopwords directive in sphinx.conf:

index test1
{
    path      = /var/lib/sphinx/data/test1
    source    = src1
    stopwords = /var/lib/sphinx/stopwords.txt
}


That stopwords.txt file should be a mere text document. It will be loaded and broken into keywords according to general index settings (i.e., using any delimiters that mark the boundaries between words in your text input), and from there, keywords mentioned in it will be ignored when working with the test1 index.

How do you know what keywords to put there? You can either use a list of the most common words for your language of choice, or generate a list based on your own data. To do the latter, perform a dry run of indexer in stop words list generation mode, without actually creating an index. This mode is triggered by these two switches:


--buildstops output.txt N

Tells indexer to process the data sources, collect the N most frequent words, and store the resultant list in the output.txt file (one word per line)


--buildfreqs

Tells indexer to also put word frequencies into the output.txt file


When you specify the --buildstops switch, the output file will be in the exact format needed by the stopwords directive. With --buildfreqs, you will also get occurrence counts. The output in that case is not directly usable, but helps you decide what to stop. For instance, running indexer --buildstops out.txt 10 --buildfreqs on our test 100,000-document collection produced the following:

i 740220
the 460421
and 429831
to 429830
a 371786
it 226381
of 218161
you 217176
my 188783
that 187490


Picking the right keywords to stop is always a question of balance between performance and requirements. In extreme cases, the latter might prevent you from having any stop words at all—think of a requirement to search, and find, “to be or not to be” as an exact phrase quote. Unfortunately, using extremely common words did not prevent William Shakespeare from coming up with an extremely famous line. Fortunately, few quotes of interest are built exclusively from infinitives, prepositions, and articles, so stop words can still often be used safely.

Sometimes you also need to stop keywords based simply on length. Even enumerating all single-character words can be cumbersome, not to mention double-character words and more, so there’s a special feature for that. The min_word_len directive in the index definition specifies a minimum keyword length to be indexed—keywords shorter than this limit will not be indexed.

index test1
{
    path         = /var/lib/sphinx/data/test1
    source       = src1
    min_word_len = 3
}



Given this example, “I” and “am” will not be indexed, but “you” will. Such skipped words, referred to as overshort words, are handled exactly like stop words—that is, they’re ignored.

However, by default, they are not ignored completely. Even though Sphinx will throw them away both when indexing and when searching, it still adjusts the adjacent keyword positions respectively, affecting searches. Assume, for example, that “in” and “the” are stop words. Searches for

“Microsoft
Office”


and

“Microsoft
in

the
office”


will, a bit counterintuitively, return different results.

Why? Because of the assigned keyword positions—both in indexed documents and in search queries. The positions will be different for the two queries. The first query will match only documents in which “Microsoft” occurs exactly before “office”, while the second one will match only documents in which there are exactly two other words between “Microsoft” and “office”. And because we ignore “in” and “the” and thus don’t specify which two other keywords we want, a document that contains “Microsoft... very nice office” will also match the second query.

So, in terms of searching, you can think of stop words in queries as placeholders that match any keyword.

That behavior is configurable with the stopword_step and overshort_step directives. Both are binary options, with an allowable value of 0 or 1. If stopword_step is 0, stop words are ignored even in the position counts just discussed. The default is 1, which counts stop words in position counts. Similarly, if overshort_step is 0, overshort words are ignored in position counts and the default value of 0 counts them. If you change either of these directives, re-create your index for the changes to take effect.

Introduction to Search with Sphinx

Learn more about this topic from Introduction to Search with Sphinx.

Webmasters want fast and powerful search capabilities on their sites, and content management system administrators would like to reveal the wealth of their databases. The solution in both cases is the Sphinx search engine. This concise introduction to Sphinx shows you how to use this free software to index an enormous number of documents and provide fast results to both simple and complex searches.

See what you'll learn


Tags:
0 Subscribe


0 Replies