You might be one of those people that, to your horror, discovered that big chunks of your Web site were not indexed in Google and other search engines. That means that those pages could never be found. Or perhaps you have a Web site or intranet search engine that similarly has been befuddled trying to find all your content and index it for search within your site. Why do search engines have such trouble with this seemingly basic task? And what can we learn about how to improve the problem from (of all things) printers?
This problem was brought to mind yesterday while I was on the panel discussing the “Future of Search” at the Enterprise Search Summit in New York. While I was thinking about exciting new features, such as personalization and semantic search, Chris Cleveland, the CEO of search vendor DieselPoint, had a much more prosaic topic—why can’t search engines ingest all the data that we have?
He’s right. Chris was talking about enterprise search (a good thing, given the conference we were at), but you can talk about the same struggle for Internet search, too. Google and friends have done a great job of enhancing their spiders to find many forms of data, and they’ve even created the SiteMaps protocol so that we can help the spiders find our data, but many dynamic pages, Flash content, and other stuff remains buried from view.
It’s even worse for intranets struggling with myriad data formats and data sources—it’s one of the things that makes implementing enterprise search so hard. Data sources are things like file systems, content management systems, and databases that the spider must know how to talk to nicely so they will offer up their treasures. Data formats are the ways that the content is encoded, such as Microsoft Office, HTML, XML, or PDF. Spiders must understand hundreds of data sources and hundreds more data formats to comprehensively index information—even then significant amounts of data are missed.
Chris and some other companies are working on an effort called OpenPipeline so that we can share the hard-won knowledge of how to ingest these various data forms among enterprise search vendors. The open source Apache community is working on the Tika project to understand various data formats. All these efforts make sense, but I wonder if we are missing a bigger point.
Why isn’t it the responsibility of the software companies that creates the data sources and the data formats to make that information findable by search engines?
Think about it. If you’re old like me, you might remember how painful it was to use printers. Back in the days of MS-DOS, before Windows, printer manufacturers shipped you a box with a book. That book contained the “language” of the printer. If you were a programmer, you could write a program that send data to the printer encoded in its particular language and it would then produce the printout you wanted.
But suppose you weren’t a programmer? Suppose you just wanted to use a word processor to print a letter? Well, then it got interesting. You had to make sure that the word processor you wanted to buy supported the printer that you wanted to buy. The advice back then was to pick the software you wanted and then buy hardware to match. Word processors even competed based on who supported more types of printers and how well they supported them.
What a waste of time! Every word processor company had staffs of developers figuring out every blessed new printer model that came down the pike. Customers had huge complexity, and had to pay for all that redundant programming effort.
Microsoft brought an end to this particular madness with the idea that the operating system should be responsible for the hardware. So, from then on, word processor software just told Windows to print the stuff and printer manufacturers were expected to provide printer drivers so that any word processor could print on any printer. The experts in the printers, the manufacturers, did the heavy lifting, and they only did it once.
Customers no longer had to pay for all the wasted effort by the word processor companies and they were saved from the complexity of understanding which word processors supported which printers (and how well they did it).
We need the same thing for data sources and data formats. There’s no reason for enterprise search companies to be competing based on who can spider the most stuff. There’s no reason for those companies to be redundantly solving the same problems just to keep up with changing data sources and data formats. And there’s no reason for customers to be faced with the complexity of finding the enterprise search engine that can handle their data. Or for customers to work their tails off so Google and Yahoo! can find their data for Internet search.
Why don’t we insist that the companies that create these data sources and data formats make them findable? Why can’t we have a standard interface that any search engine can use to learn of the existence of content, to request that content (or have it automatically sent), and to get it in a form that can be easily understood and indexed? I think we need the content driver.
When data sources and formats work the way printer drivers do, both Internet and enterprise search will take a huge step forward.