The technology behind search is much older than you might imagine. Philip Bagley described the basic processes behind search in his 1951 Master’s thesis based on the functionality of the MIT Whirlwind computer. IBM developed the inverted index file structure still in use in most enterprise search applications in around 1975. The initial development of the BM25 ranking model by Stephen Robertson and Karen Sparck-Jones started at roughly the same time, yet BM25 (link to a very good video that explains ranking models) remains an important component of most text-retrieval software.
Paradoxically more research has been carried out into enhancing the performance of information retrieval than any other enterprise application. At present approximately 200 research papers are published each day, and that excludes a similar number of papers on artificial intelligence that will likely have a secondary impact on information retrieval. The IR Anthology of research papers lists almost 50,000 items and the comparable ACL Anthology on computational linguistics and natural language processing lists over 70,000 research papers.
Why Should I Care About Search-Related Research?
For managers of search applications, knowing what is emerging from research teams around the world is honestly not going to make any short-term impact on your daily work. What does matter is whether the vendor of your software, or the team that built your open source application, are taking this research into account.
For example, one of the big issues at present concerns the mathematics of vector models, which are the basis of finding related content. Historically search applications have used sparse vectors, but now (as an outcome of significantly improved processing speeds) dense vectors are gaining ground. There is no easy to read comparison, but this blog post might be a good place to start. Ask your vendor which model they see as the future and listen closely to their explanation.
Related Article: (I Can’t Get No) Search Satisfaction
Query management and index management are intrinsically linked. If the index is not built in a way that supports the type of queries your users will make there is no Plan B. This is why it is so important to understand user requirements and then work back from the UI to define the technology you need. It is well worth reading up on the extent to which different groups of users value the functional elements of a user interface.
A substantial amount of research is being conducted into query management. In general this is easier to implement than major changes to the index. A good place to start is an outstanding (but very readable) thesis by David Maxwell. The excellent book “Understanding and Improving Information Search” is also well worth buying.
Related Article: How Well Do You Understand Your Content Processing Pipeline?
Relevance, Recall and Precision
Relevance pervades every discussion about search, so it’s another important topic to dig into. A good place to start is to join Relevance Slack, which OpenSource Connections set up and which currently has almost 2000 members. That number alone is a good indication of the quest to optimize relevance. Most vendors focus on precision, a metric based on the number of relevant results on the first page or two of results. Very little attention is paid to recall, which is perhaps even more important in enterprise search than web search. The entertaining research paper is a rarity, but The eDiscovery Medicine Show is an exception in which the authors raise many important issues about assessing recall performance.
Following the Trends in Microsoft Research
Microsoft Research is by far the most active vendor in terms of conducting and publishing research. It is well worth looking through the Microsoft Research website on a regular basis as it can give you some indication of its direction of travel. As an example, a recent paper provided a substantial amount of detail of Enterprise Alexandria, one of the core technologies of Microsoft Viva.
Related Article: Has Microsoft 365 Been Clinically Tested?
Tracking the Research
Keeping track of the research into information retrieval is a difficult task. A partial solution is watching the preprints on the Cornell University site. Preprints are early versions of typically high-quality, free research papers which have not been peer-reviewed. Look in particular at the papers in the Information Retrieval, Human-Computer Interaction and Computation and Language sections.
Most of these papers will eventually end up in one of the 30 or so journals that cover enterprise search technology, but many are behind a subscription firewall. The same goes for the papers published in the Digital Library of the Association for Computing Machinery (ACM). A title search on Google Scholar will often turn up an open access version of these papers.
The next five years will without a doubt bring significant developments in information retrieval, far more so than in the last five years, due to the numerous dramatic technical developments, notably BERT (Bidirectional Encoder Representations from Transformers) from Google and in natural language processing. The question you should be asking is whether your search vendor will be a leader or a laggard in making use of these developments.
Martin White is Managing Director of Intranet Focus, Ltd. and is based in Horsham, UK. An information scientist by profession, he has been involved in information retrieval and search for nearly four decades as a consultant, author and columnist.