Possible factors abound. Major web search engines, striving for top-notch, state-of-the-art ranking quality, account for hundreds of different ranking factors.
So-called text factors depend on the document, the query, or the entire document collection text. Typical factors include:
- How many times did our keywords occur within the matched document?
- How many times were they repeated in the query?
- How frequently does every keyword occur in the entire document collection?
- Do the keywords occur in the document in exactly the same order as they occur in the query? If not, are they at least close to each other, or are they scattered all around the document?
- Where do they occur: in the title field, or in the main content field, near the beginning, or near the end?
- Did we match the query keyword form exactly, or is it a stemmed match?
- In how big a font was a keyword written on the HTML page?
Answers to questions such as these provide text-based factors that a search engine can use to compute its magic relevance value.
Nontext factors are important as well. On a forum site, you might want to boost posts made by moderators in the search results. Or users with lots of hard-earned karma. Or threads that were unusually trafficked. Or, as is usually the case in production, all of the above. On a news site search, an important nontext factor is how recent the found document is. One well-known web search factor is PageRank, which is also a nontext factor. The text in a hyperlink (URL) could arguably be either a text factor or a nontext factor.
Ranking factors are different from sorting conditions, even though in the end they serve the same purpose (ordering the search results), and might use the very same data. Factors affect weights, and therefore the ordering of the results, but in a not-so-obvious way, whereas sorting conditions are used strictly for ordering.
To understand the distinction I’m making, consider a search on a news site. Assume that we can compute a text-based weight ranging from 0.0 to 1.0, and that we also have a publication date. When we boost our text weight by 0.1 for last-day articles, 0.05 for last-week articles, and 0.02 for last-month articles, it’s a weighting factor. However, it’s just one of the factors, and highly relevant matches posted a year ago will be able to outweigh barely relevant ones posted minutes ago. But we can also use the very same date as our primary sorting condition so that results from yesterday are guaranteed to come before results from last year, no matter what their text-based relevance rank (weight) is. We’re using the very same data, maybe even in the very same manner (sorting with that date-based 0.1/0.05/0.02 boost), but now it’s a sorting condition, and not a weighting factor.
One famous text-based ranking function is called Okapi BM25 (or BM25 for brevity). It was developed back in the early 1980s, but is still in wide use. The key idea of BM25 is to rank documents higher if they have rare keywords and if they have many occurrences of a keyword. The BM25 weight is computed as:
- W is the number of keywords.
- TF(i) is term frequency, that is, the number of times the ith keyword occurred in the document being ranked.
- IDF(i) is inverse document frequency, that is, a normalized frequency of the ith keyword in our entire document collection.
- N is the number of documents in our entire collection.
- n is the number of documents that match the ith keyword.
- DL is the current document length.
- avgDL is the average document length in our collection.
- k and b are magic constants (e.g., k = 1.2, b = 0.75).
The IDF(i) factor is the part that assigns more weight to rare keywords. It ranges from -1.0 when n = N (i.e., the keyword occurs in all documents) to 1.0 when n = 1 (the keyword occurs in exactly one document), and reaches 0.0 when n = (N+1)/2. So, it’s best when the keyword is so rare that it only occurs in a single document, and it’s worst when the keyword occurs in every document. Note that when the keyword occurs in more than half of the documents indexed, IDF gets negative, and matching the keyword actually hurts the rank.
This is controversial at first glance, but if there are more documents with the keyword than without it, we are probably more interested in fewer common documents that do not mention an overly frequent keyword. As a crude example, in an aeronautics database, you can expect “airplane” to be mentioned frequently, and the presence of that word does not help determine which documents are of interest to the person conducting the search.
The TF(i) part essentially boosts the weight when the keyword occurs several times, and the complex fraction just serves to limit that boost. We could simply multiply TF(i) by IDF(i), but in that case a document mentioning one keyword 1,000 times would be ranked way too high. On the other hand, this fraction (simplified under an assumption that our document is of average length or DL = avgDL):
is guaranteed to range from 1.0 to 1.0+k, growing as TF(i) grows but never going over a certain limit. So, k is essentially a limit on the boost given to a weight by a term’s (a keyword’s) frequency within a document.
Finally, the b constant controls some length magic that boosts shorter documents. b should take a value from 0 to 1. When b = 0, document length will not be accounted for at all. When b = 1, document length has the greatest possible influence on the BM25 function.
BM25 is important because it’s known to work pretty well, and is used in every single search system which does ranking at all. It’s a well-known, de facto standard ranking function that is both a good foundation and a good reference—that is, a solid foundation to build more complicated state-of-the-art ranking functions, and a widespread baseline reference model at the same time. Most existing open source systems implement only BM25, in fact.
BM25’s major drawback is that it considers only keyword statistics, but does not care how the keywords are located in the document in respect to one another and the query. For instance, it will rank a document that simply mentions all the keywords in diverse places exactly as high as a document that matches the query as a phrase perfectly. For instance, given the query “to be or not to be,” BM25 would not boost Hamlet above all the documents that contain those common English words (and of course, the words are too common to perform well in BM25 anyway). Usually one would expect verbatim quotes to be ranked higher.
That quality drawback in BM25 was, in fact, one of the reasons I created Sphinx in the first place. Sphinx can do classic BM25 ranking, but it defaults to a combined ranking function that uses BM25 as a secondary factor, and the degree of query phrase versus document match as a primary factor. We call that phrase proximity ranking. Hence, with the default Sphinx ranking function, documents with verbatim quotes are guaranteed to be ranked above others, and even documents with partial query subphrase matches are guaranteed to be ranked higher.
Learn more about this topic from Introduction to Search with Sphinx.
Webmasters want fast and powerful search capabilities on their sites, and content management system administrators would like to reveal the wealth of their databases. The solution in both cases is the Sphinx search engine. This concise introduction to Sphinx shows you how to use this free software to index an enormous number of documents and provide fast results to both simple and complex searches.