Trending Now

Why search indexing should be more like printers

You might be one of those people that, to your horror, discovered that big chunks of your Web site were not indexed in Google and other search engines. That means that those pages could never be found. Or perhaps you have a Web site or intranet search engine that similarly has been befuddled trying to find all your content and index it for search within your site. Why do search engines have such trouble with this seemingly basic task? And what can we learn about how to improve the problem from (of all things) printers?

This problem was brought to mind yesterday while I was on the panel discussing the “Future of Search” at the Enterprise Search Summit in New York. While I was thinking about exciting new features, such as personalization and semantic search, Chris Cleveland, the CEO of search vendor DieselPoint, had a much more prosaic topic—why can’t search engines ingest all the data that we have?

He’s right. Chris was talking about enterprise search (a good thing, given the conference we were at), but you can talk about the same struggle for Internet search, too. Google and friends have done a great job of enhancing their spiders to find many forms of data, and they’ve even created the SiteMaps protocol so that we can help the spiders find our data, but many dynamic pages, Flash content, and other stuff remains buried from view.

It’s even worse for intranets struggling with myriad data formats and data sources—it’s one of the things that makes implementing enterprise search so hard. Data sources are things like file systems, content management systems, and databases that the spider must know how to talk to nicely so they will offer up their treasures. Data formats are the ways that the content is encoded, such as Microsoft Office, HTML, XML, or PDF. Spiders must understand hundreds of data sources and hundreds more data formats to comprehensively index information—even then significant amounts of data are missed.

Chris and some other companies are working on an effort called OpenPipeline so that we can share the hard-won knowledge of how to ingest these various data forms among enterprise search vendors. The open source Apache community is working on the Tika project to understand various data formats. All these efforts make sense, but I wonder if we are missing a bigger point.

Why isn’t it the responsibility of the software companies that creates the data sources and the data formats to make that information findable by search engines?

Think about it. If you’re old like me, you might remember how painful it was to use printers. Back in the days of MS-DOS, before Windows, printer manufacturers shipped you a box with a book. That book contained the “language” of the printer. If you were a programmer, you could write a program that send data to the printer encoded in its particular language and it would then produce the printout you wanted.

But suppose you weren’t a programmer? Suppose you just wanted to use a word processor to print a letter? Well, then it got interesting. You had to make sure that the word processor you wanted to buy supported the printer that you wanted to buy. The advice back then was to pick the software you wanted and then buy hardware to match. Word processors even competed based on who supported more types of printers and how well they supported them.
What a waste of time! Every word processor company had staffs of developers figuring out every blessed new printer model that came down the pike. Customers had huge complexity, and had to pay for all that redundant programming effort.

Microsoft brought an end to this particular madness with the idea that the operating system should be responsible for the hardware. So, from then on, word processor software just told Windows to print the stuff and printer manufacturers were expected to provide printer drivers so that any word processor could print on any printer. The experts in the printers, the manufacturers, did the heavy lifting, and they only did it once.

Customers no longer had to pay for all the wasted effort by the word processor companies and they were saved from the complexity of understanding which word processors supported which printers (and how well they did it).

We need the same thing for data sources and data formats. There’s no reason for enterprise search companies to be competing based on who can spider the most stuff. There’s no reason for those companies to be redundantly solving the same problems just to keep up with changing data sources and data formats. And there’s no reason for customers to be faced with the complexity of finding the enterprise search engine that can handle their data. Or for customers to work their tails off so Google and Yahoo! can find their data for Internet search.

Why don’t we insist that the companies that create these data sources and data formats make them findable? Why can’t we have a standard interface that any search engine can use to learn of the existence of content, to request that content (or have it automatically sent), and to get it in a form that can be easily understood and indexed? I think we need the content driver.

When data sources and formats work the way printer drivers do, both Internet and enterprise search will take a huge step forward.

Avatar

Mike Moran

Mike Moran is an expert in digital marketing, search technology, social media, text analytics, web personalization, and web metrics, who, as a Certified Speaking Professional, regularly makes speaking appearances. Mike’s previous appearances include keynote speaking appearances worldwide. Mike serves as a senior strategist for Converseon, an AI powered consumer intelligence technology and consulting firm. He is also a senior strategist for SoloSegment, a marketing automation software solutions and services firm. Mike also serves as a member of the Board of Directors of SEMPO. Mike spent 30 years at IBM, rising to Distinguished Engineer, an executive-level technical position. Mike held various roles in his IBM career, including eight years at IBM’s customer-facing website, ibm.com, most recently as the Manager of ibm.com Web Experience, where he led 65 information architects, web designers, webmasters, programmers, and technical architects around the world. Mike's newest book is Outside-In Marketing with world-renowned author James Mathewson. He is co-author of the best-selling Search Engine Marketing, Inc. (with fellow search marketing expert Bill Hunt), now in its Third Edition. Mike is also the author of the acclaimed internet marketing book, Do It Wrong Quickly: How the Web Changes the Old Marketing Rules, named one of best business books of 2007 by the Miami Herald. Mike founded and writes for Biznology® and writes regularly for other blogs. In addition to Mike’s broad technical background, he holds an Advanced Certificate in Market Management Practice from the Royal UK Charter Institute of Marketing and is a Visiting Lecturer at the University of Virginia’s Darden School of Business. He also teaches at Rutgers Business School. He is a Senior Fellow at the Society for New Communications Research. Mike worked at ibm.com from 1998 through 2006, pioneering IBM’s successful search marketing program. IBM’s website of over two million pages was a classic “big company” website that has traditionally been difficult to optimize for search marketing. Mike, working with Bill Hunt, developed a strategy for search engine marketing that works for any business, large or small. Moran and Hunt spearheaded IBM’s content improvement that has resulted in dramatic gains in traffic from Google and other internet portals.

Join the Discussion

Your email address will not be published. Required fields are marked *

Back to top