The three hard parts of manually tagging subjects on your web pages

Everyone loves faceted search–drilling down on your search results by selecting filters. We all use it in eCommerce to pick colors and prices, but we also are starting to see more of this on content websites. And most informational websites that tackle the top of funnel would love to let you drill down by subject–or topic or theme or whatever your favorite word is. They want searchers to be able to drill down in search results based on what the results are about.

And on the search side, that’s easy. Just about every search engine can drill down based on facets, which you can define as just about any value in one of your tags. So, you can filter by year published or industry or anything you can tag for.

But the hardest thing is to tag your subject. There are a couple of problems that bedevil us:

Deciding the subjects. I’m not sure if you have ever tried to do this in a medium-to-large company, but it’s a beast of a process. Everyone has their idea of what the right subjects should be, and how they should be arranged into some kind of hierarchical taxonomy or ontology. It takes forever to get agreement and there is no guarantee that your site visitors actually recognize what your experts decided. And there is no way to know if it reflects the actual composition of your content.
Tagging the documents. If anything, this is harder. You probably chose several dozen subjects and now you ask the authors to correctly classify each web page into the proper subject category (or categories). That might sound like an easy task, but it is actually quite hard for people to do. Some experiments show that not only do people disagree with each other on the right answer, but if you give the same person the same job a couple of days later, they often disagree with themselves. So, it’s a hard job that is rarely done consistently.
Making changes. You might see a pattern here, but this is actually the hardest job of all. To change your taxonomy to reflect changes in your industry, perhaps, you must reconvene all the same experts and get them to agree again on what the new subjects and hierarchy should be, which usually isn’t any easier than it was the first time. But then comes the real fun, which is that you have to go back to manually review every single document and potentially retag them with the new and changed subjects. If you have more than a few thousand documents, you might imagine that this is excruciating. There might be a few ways to automate some of the task, but the work that is left is more than enough to dissuade anyone from every changing their taxonomy.

So, what do you do instead? Add a lot more automation.

I have worked with clients to use natural language processing and machine learning technology to examine their content to suggest a subject taxonomy based on their actual content. The experts can provide some feedback to move categories around and combine others, but, in the end, the machine has done most of the work, and they can agree more easily to the picture that was painted than they can agree on how to paint a blank slate.

And once we have a machine-generated taxonomy, we have the training data to automatically tag all the documents with those subjects. Now, the tags might not always be correct, but we know humans don’t do the job correctly, either. At least we know they will be done consistently, and we can work to improve the accuracy by analyzing and correcting errors.

Lastly, it’s a lot easier to make changes. You can redo the taxonomy generation process based on any new content you’ve added that might have new subjects. And you can retrain the automation and relabel everything a lot more easily than doing it by hand–which makes you more willing to change your subject taxonomy as your content changes.

If it sounds too good to be true, well, it isn’t perfect. But I am confident that it is a whole lot better than what you are doing now. And if the pain of manual subject taxonomies have kept you from even trying them, this might be an easy way to provide a much easier search experience for your customers.