Jump to content

How to Search Multiple Indexes while using Sphinx

0
  chco's Photo
Posted Jun 14 2011 02:05 PM

The following excerpt from the O'Reilly publication Introduction to Search with Sphinx speaks to the easiest ways to work with multiple indexes. Under most circumstances, you will at some point need to maintain multiple indexes, but search through all of them simultaneously.

The other way around, you’d have to store everything in a single, possibly huge, index. And that can only work well in a scenario with a few very specific conditions—when the document collection does not get updated on a daily basis; when it’s OK to utilize a single core for every given search; when you don’t need to combine multiple entity types when searching; and so on. Most real-world tasks are different, and you will likely need more frequent index updates (counted in minutes rather than weeks), scaling across multiple cores, and so forth. Both updates and scaling, as well as a few fancier tasks, require that you be able to search through multiple indexes and combine (aggregate) results. So, let’s look at how that works.

Searching through multiple indexes can be explicit, when you enumerate several indexes in your query call:

$client->Query ( "John Doe", "index1 index2 index3" );



Separators in the index list are ignored, so you can use spaces, commas, semicolons, or anything else.

Sphinx will internally query every index independently, create a server-side result set (the top N best matches from each index, where N equals max_matches), and then combine the obtained sets, sort the combined matches once again (to restore the order you requested), and pick the top N best matches from all the indexes. This “combination” phase is, by default, very quick, unless you set max_matches rather high and there are many actual matches. Sorting several thousand matches in RAM is pretty quick.

The order of indexes in the query is important, however, because it can affect searching results under certain occasions. That’s a nonissue when no rows are shared among indexes, that is, every document ID is unique and only occurs in exactly one index. But when a document ID is duplicated and occurs in both result sets—a case that likely would involve different weights and attribute values!—we have to pick a single version of that document. Sphinx picks the “newer” version from the latter index in the list. For instance, if John Doe matches document 123 in both index1 and index3, and both matches make it into the respective result sets, the data from index3 wins. Note, however, that when document 123 isn’t in the intermediate result set for index3, the final combined result set will still contain data from index1, even if document 123 was actually matched. So, in a sense, matching documents from indexes specified later in the index list replace “older” matches. Therefore, in case of a conflicting duplicate row, you always get a “newer” weight and attribute data in a combined result set.

In made-up pseudo-SQL syntax, this process of eliminating duplicates and combining results can be described as follows:

CREATE TEMPORARY TABLE tmp ...

INSERT INTO tmp SELECT * FROM <index1> WHERE <search-condition>
    ORDER BY <order-condition> LIMIT <max-matches>

REPLACE INTO tmp SELECT * FROM <index2> WHERE <search-condition>
    ORDER BY <order-condition> LIMIT <max-matches>

REPLACE INTO tmp SELECT * FROM <index3> WHERE <search-condition>
    ORDER BY <order-condition> LIMIT <max-matches>
...

SELECT * FROM tmp ORDER BY <order-condition> LIMIT <max-matches>


Internal index search order isn’t specified. In theory, Sphinx can decide to rearrange actual searches in whatever way it deems necessary. The final result set, however, is deterministic and guaranteed to stay the same.

But what does this have to do with quicker updates, scaling in general, and everyday use? The thing is, when using the disk-based indexing backend, partitioning data into multiple indexes is essentially the way to achieve both goals.

Basically, to speed up indexing updates, you put most of the data in a rarely updated “main” archive index (or index set) that only needs to be reindexed once in a while, and you put the tiny “dynamic” fraction of the data that changes actively into a separate “delta” index that can then be rebuilt (very) frequently. Then you search through both the “main” and “delta” indexes.

As for scaling, searches against a single index are single-threaded, so you have to set up several indexes to take advantage of multiple cores, CPUs, and disks, and you can search through all those indexes in one go just as well.

So, in one way or another, sooner or later you are going to divide and conquer and search more than one index in one go and have Sphinx combine the results via the routine we just discussed.

Introduction to Search with Sphinx

Learn more about this topic from Introduction to Search with Sphinx.

Webmasters want fast and powerful search capabilities on their sites, and content management system administrators would like to reveal the wealth of their databases. The solution in both cases is the Sphinx search engine. This concise introduction to Sphinx shows you how to use this free software to index an enormous number of documents and provide fast results to both simple and complex searches.

See what you'll learn


Tags:
0 Subscribe


0 Replies