Concept Search vs. Keyword Search in eDiscovery

Having grown up in the enterprise class solutions world with relational databases, I am very comfortable with SQL query based searching. And, over the past several years, with my focus on the litigation market and litigation technology, I have now become very familiar with keyword searching against scanned and OCR’d document files. However, with with my passion for “leading edge” technology and/or solutions that can meet the demands of a market going through a paradigm shift, I have not been overly excited or impressed with the state of search technology in eDiscovery.

However, that has all changed with the emergence of conceptual search technology. As such, I have spent a tremendous amount of time researching conceptual search and how it compares from both a technology standpoint and from as business value standpoint. And, although I have come to the early conclusion that there is room and a need for all three, I have also determined that there is still a tremendous amount of confusion in regards to concept search vs. keyword search technology and the best use of both.

Therefore, in an effect to keep the followers of my blog informed, I have been following a series of excellent posts on the ediscovery 2.0 blog discussing conceptual search. Following is the full text of the latest post titled “Concept Search Versus Keyword Search in Electronic Discovery” by Will Uppington:

In my last post, I started a discussion on the myths surrounding concept search. The first myth I dispelled was the “concept search is concept search” myth. The myth is that there is an agreed upon definition of concept search. In actuality, when people in e-discovery use the term concept search, they don’t always mean the same thing. Frequently they are not actually talking about concept search technology at all and are actually talking about concept or content categorization technology, which is very different. The second myth that needs dispelling is that concept search is better than keyword search.

The thinking behind this myth goes something like this:

Keyword search has a lot of problems. It is prone to being over-inclusive, i.e., finding some non-relevant documents, and under-inclusive, i.e., not finding some relevant documents. Concept search technologies are new and interesting and using these technologies you can find documents that keyword search can’t find. Therefore, concept search must be better than keyword search.

Let’s examine this thinking. The first two statements are accurate. Keyword search is not perfect and can produce over- and under-inclusive results. And concept search and content categorization technologies can both help identify documents that keyword search technologies might not find. However, the conclusion that concept search is better than keyword search is not valid and doesn’t follow from these two statements. Why?

In order to answer this question, we first need to go back to the difference between concept search and content categorization. Because these are different technologies, we really need to separately compare concept search versus keyword search and content categorization versus keyword search. Let’s start with content categorization and keyword search.

The issue with this comparison is that keyword search and content categorization do different things. Keyword search can be used in many ways in e-discovery. The two most common are: (1) analysis or case assessment: finding the hot documents and understanding the matter by determining who knew what, when, how and why, etc., and (2) culling: removing non-responsive documents and/or identifying potentially privileged documents in order to reduce a large, starting set of documents to a smaller set before review.
Content categorization, on the other hand, has historically been used within the review phase of e-discovery. Categorization can help reviewers to better understand the documents they are reviewing and thus potentially increase the speed of review. Practitioners with whom I have worked also find that categorization can be useful during analysis by helping to understand a matter and identify potentially important keywords.

However, content categorization has not been used as part of culling. First, culling needs to be transparent. You need to be able to get agreement with or at least explain to the opposing side and the court exactly how you have culled the data set. If you cull based on categories of documents that have been generated by a proprietary, black-box algorithm, it’s going to be difficult to gain agreement on or explain your culling methodology. This is why the typical method of culling is still to use keyword search and either agree on the set of search terms with the opposing side or to use e-discovery search best practices to perform keyword searches on your own.

Second, content categorization has its own issues when it comes to being over- and under-inclusive. There is no guarantee that your group of documents that have been categorized as being related to, for example, a company’s hiring policies include all of the documents in your matter related to hiring policies or that they do not include some documents that may not really be related to hiring policies. Content categorization, like keyword search and virtually every information retrieval technology, is not perfect.

So what about concept search technology? Surely, concept search technology is better than old, boring keyword search. Well, actually it’s not that clear-cut. The problem with concept search technology is that while it might find more relevant documents than plain keyword search, it will also likely find more false positives. Imagine searching for documents containing “terminate” in an employment matter and your concept search technology automatically searching for “fire”, “dismiss”, etc. as well. You’ll find more documents related to the termination of employees, but you’ll also find a lot more non-relevant documents concerning house fires, the fire department, etc.

So concept search can help address the under-inclusive problem with keyword search, (though it won’t solve it) and can be helpful during analysis. But it can often increase the over-inclusive problem. In addition, today’s concept search technologies share the transparency problem with concept categorization. These technologies have largely been designed as “black boxes”, which as I have discussed in the past, makes sense for Enterprise search but not for e-discovery search, and, as a result, could also be potentially difficult to explain and defend. For these reasons, concept search technology isn’t used very much in e-discovery today. In order for its use to become widespread, it will need to become more transparent. But that’s a topic for another day.

The bottom line here is that despite all the hype, concept search and content categorization technologies do not solve all the challenges of e-discovery search. Both of these technologies can be very useful and the technology behind them is always improving. However, as most of the experienced practitioners I work with already know, these technologies are generally better thought of as supplements to keyword search, not replacements. The important question is not whether to use one technology over the other but which technology is best suited to your objectives and how best to use all the available technologies to achieve the desired goal.

About Charles Skamser
Charles Skamser is an internationally recognized technology sales, marketing and product management leader with over 25 years of experience in Information Governance, eDiscovery, Machine Learning, Computer Assisted Analytics, Cloud Computing, Big Data Analytics, IT Automation and ITOA. Charles is the founder and Senior Analyst for eDiscovery Solutions Group, a global provider of information management consulting, market intelligence and advisory services specializing in information governance, eDiscovery, Big Data analytics and cloud computing solutions. Previously, Charles served in various executive roles with disruptive technology start ups and well known industry technology providers. Charles is a prolific author and a regular speaker on the technology that the Global 2000 require to manage the accelerating increase in Electronically Stored Information (ESI). Charles holds a BA in Political Science and Economics from Macalester College.