Federated search always seems just around the corner

Federated search, sometimes called metasearch, has been the filling up conference agendas for years, and The Search Engine Meeting this week was no exception, as Abe Lederman, the President of Deep Web Technologies, explained his company’s approach. Federated search has many appealing qualities, but is rife with the kind of technical hurdles that make a researcher’s head spin. Is federated search finally on its way? Should you be considering that kind of approach for your Web site?

In some sense, federated search has been with is for years—Dogpile, Metacrawler, and others have long provided an ability to search across multiple search engines and similar approaches are available today for your Web site. The problem is that federated results just aren’t as good as the “one-big-index” search engines, such as Google. At first glance, that seems odd. After all, if one search engine can do a good job, wouldn’t searching across ten search engines do an even better job?
So far, the answer is no.
Federated search engines are limited in what they know about the documents they find, because they don’t actually crawl and index those documents—the underlying one-index search engines do. So, while Google’s spider looks at billions of documents across the Internet, Dogpile does not look at any—it merely gets the list of results from Google (and other search engines) and stitches together a list of search results.
Because Dogpile doesn’t actually examine the documents, it suffers from limitations that degrade its results. Relevance ranking, while difficult in a single-index search engine, is excruciating for a federated search engine. Google can rank documents based on where the words appear in the documents, which documents get links to them, and dozens of other factors. Dogpile can’t. Dogpile can only take a guess at which documents are better by examining the titles, snippets, and URLs that Google returns to display on its search results screen. That’s why most people prefer Google, or Yahoo!, or another one-index search engine to Dogpile and Metacrawler.
Similarly, one-index search engines have clever “de-duplication” algorithm to make sure that they are not storing multiple copies (or near-copies) of documents in their indexes—if they did, searchers would see several documents in their list of search results that are essentially the same. One-index search engines can de-duplicate because they see the actual documents as part of their indexing process. Federated search engines can’t easily “de-dup” documents when they are returned from multiple search engines. They can perform simple operations such as eliminating identical URLs, but they can’t eliminate near-duplicates very easily.
So, federated search engines have long suffered with lower-quality results than one-index search engines, but they make up for it with slower performance. OK, it’s no laughing matter to the federated search engines, but the truth is that a federated search can only be as fast as the slowest one-index search engine it uses. Some fancy tricks can be played to begin showing results before all the one-index search engines have returned their results to the federated search facility, but the final result is always slower than the slowest one-index engine.
So, with all these technical hurdles, why are folks still working on federated search?

scale. Some federated search enthusiasts believe that one-index search engines will eventually run out of steam because the number of documents will exceed their capacity. Thus far, those predictions have not panned out as one-index search engines have applied parallel processing techniques to essentially create federated search inside themselves, so they gain all the scaling advantages of federated search while retaining all the information about the documents themselves.
latency. Google’s spider can’t crawl the Internet constantly, so there is always a time lag between when a document is changed and when the search engine knows that it changed. In theory, if a document need only be indexed in one small search index, that search engine could keep up with those fewer changes more rapidly—perhaps each document that changed could be sent to the search engine rather than waiting passively to be crawled. Federated search, in theory, would search all these small search engines and provide more up-to-date results than the one-index search engines.
reach. Despite Google’s goal of making the world’s information accessible, it has a long way to go. Treasure troves of information are squirrelled away in private (fee-based) databases beyond the spider’s prying eyes. Publishers may be willing to allow a metasearch engine to query their private search engine when they are not willing to have their contents crawled. This sets up an odd dynamic for one-index search engines where the same problems are rehashed to be solved over and over again. Amazon offers search-within-the-book, but Google can’t use Amazon’s data, so it set up its own Google Print program. Yahoo! has also struck out on its own path to digitize books, but none of these efforts work together. Federated search engines might be able to search any of these printed book indexes to find a searcher’s answer. Object-oriented theorists believe that objects should be “findable” so that a search engine can literally query each document and have the documents respond if they should be found. Many problems exist with this theory, ranging from spam to performance, but who knows what the future will hold?

Relational database vendors have used federated approaches for years, so that each database is able to searched in massive data warehouse applications. Federated relational databases are simpler than federated search, however, because each database uses the same SQL language to be searched, whereas there isn’t any real equivalent language that works across search engines. Moreover, with relational searches (“Show all the payroll records where the salary is higher than $60,000”) there is a single right answer. With search queries, every search engine would provide different relevance-ranked lists of documents even if they had exactly the same documents in their search indexes.
Abe Lederman, in his talk this week, described how his company is working to improve their de-duplication algorithm and how they are using several approaches to relevance ranking (including retrieving entire documents at search time to decide which ones should be ranked higher). The work is fascinating to search technology buffs like me, but it’s not clear how important it is to a business trying to improve search on its Web site.
If you already have multiple search engines on your site, federated search is an appealing way to build on top of that existing investment, but current federated search facilities don’t easily provide the kind of search experience that most one-index search engines do. So far, you would still be better off wiping out those smaller search engines and putting in one big one. But with all this research activity going on, someday the answer may be different.

Trending Now

Federated search always seems just around the corner

Mike Moran

Related Posts

Join the Discussion Cancel Reply

POPULAR POSTS

Team Flow Institute Releases Recommendations on How to Prepare for the Successful Integration of AI

Envisioning the Future of Human Work in the Age of AI: The Team Flow Institute 2024 Forecast

How Digital Technologies Are Revolutionizing the Commercial Real Estate Industry

Lessons Learned from the MGM Hack

Team Flow Institute Launches to Create a Collective Vision for the Fourth Industrial Revolution