Your web browser is out of date. Update your browser for more security, speed and the best experience on this site.

Update your browser
CapTech Home Page

Technical January 31, 2022

Semantic Search with Deep Learning and Python

Gilles Adande
Gilles Adande

In the last 5 years, Natural Language Processing (NLP) has leaped forward with the introduction of the transformer architecture, which brought large improvements to NLP applications. One such application is semantic matching. In this post, we illustrate how to build a semantic search function using state-of-the-art transformer-based deep learning models.

The goal here is not to build a scalable search engine, but rather to illustrate some of the key concepts in modern NLP, including the current paradigm of foundational model and transfer learning. Some considerations for building a scalable system are presented at the end of the post.


Whether you want to sift through millions of social media posts, extract information from reports of medical trials and academic research, or simply retrieve relevant texts from thousands of documents, reports, and notes generated by an organization as a part of its daily operation, you would need an information retrieval system with the capability to match a query and its search intent to the relevant documents

The traditional approach for information retrieval, such as BM25, relies on word frequencies within indexed documents and on key terms within the query to estimate the relevance of said documents to the query. This approach has two key limitations that affect its accuracy. Documents that do not contain the keywords but include terms that are semantically similar to those keywords may be missed. For a pool of documents containing different languages, and especially languages with different scripts and alphabets, the keyword approach would fail.

With the language model approach, both queries and documents are embedded in a common latent space, and we can build a semantic matching function with the measure of relevance defined as the cosine similarity between the vector embeddings, regardless of the language, alphabet script, or presence of specific key terms. (For the multilingual case, you would need to ensure the model you use have similar sentences aligned across languages.) The aforementioned approach is called bi-encoding. We could add a second searching stage by passing query and top N documents returned in the first stage as an input pair to a transformer. That additional step would further improve the accuracy of the search because the transformer’s attention heads are then able to see both inputs (the cross-encoding approach). An example of such a two-stage method is facebook’s blink entity linker.

Foundational Models and Transfer Learning

Deep learning is all about building high-quality latent spaces, the vector spaces onto which data inputs are mapped. This usually requires a large amount of data. However, once this base vector representation is obtained, it is usually much faster to finetune the vector space for the task at hand.

For NLP, a few tech companies and organizations have outsourced and pre-trained large language models, such as Bert and GPT, that can be used as foundations for vector representations of text. While it is faster to use pre-trained foundational models, you should be aware of some of their pitfalls. If you have a large amount of data for your specific use case and sufficient computing resources for training, you may be better off pre-training the models on your data.

Most of these large pre-trained foundational models have been made easily accessible through Huggingface. There is much choice available with dozens of foundational models (and thousands of models overall) on Huggingface. All these models use the transformer architecture or variations of it, but they differ in pretraining procedures, and some architectures make some models better suited for certain tasks. In general, for specific tasks, having an abundance of high-quality data is more important than the minor differences between available models. If you have mission-specific data, you can further finetune the pre-trained models from this post on your data by using or adapting the bi-encoding contrastive training scripts from the sentence-transformer repo.

Let’s illustrate this point by testing bi-encoding on a few models from the Huggingface library on the following sentences.

We expect to see 2 main semantic clusters, {t1, t2, t3} and {t4, t5}, and within the first cluster, t2 being closer than t3.

We are going to encode these sentences with the following 3 models:

  • Bert, which is the foundational model;
  • all-mpnet-base-v2, which is a model finetuned from the mpnet foundational model on multiple semantic tasks and datasets, and;
  • msmarco-bert-base-dot-v5, which is a model finetuned from Bert on the MSMARCOS dataset of Bing queries and associated queried documents.

    Once the embedding for each text is computed, for each model we compute the cosine similarities between all pairs of embeddings.

    Here are the square matrices of cosine similarities (between [0, 1], the closer the vectors, the closer to 1) for each of the models, plotted with seaborn.

    We can see that while all models are able to see that {t4, t5} are closely related, only the embeddings from mpnet clearly show the expected structure, with 2 main clusters and the {t2, t3} subcluster within the {t1, t2, t3} group. This is expected since mpnet has been specifically fine-tuned to recognize semantically similar sentences.

    Let’s now test another set of sentences:

    This time we are trying to see whether the models can correctly associate the query t1 to {t2, t3} and the query t4 to {t5}.

    Here are the corresponding similarity matrices:

    Here again we see base Bert is lacking and not able to associate t4 and t5. Compared to the previous test however the model finetuned on the bing queries/answer dataset MSMARCO is slightly better able to associate the query t1 with {t2, t3}.

    Search the 116th US House of Representative Dataset

    You can find the dataset house-of-representatives-congress-116 dataset at Kaggle. Once you download it and load it to pandas you should have a dataframe like this:

    The summary column contains text summaries of the content of the various bills. We need to embed all these into a real values vector. However, before that we need to split the longest summaries into several pieces of text since the maximum sequence length for the considered models is 512 tokens. The 512 token limit with Bert ensures that attention computations remain tractable. Some newer bi-directional encoder models such as BigBird use sparse attention mechanisms to allow longer sequences to be embedded.

    Finally, our search function will compare an embedding of a query against all vectorized text summaries and return a sorted array of indices of the texts by similarity with the query.

    For our search function let’s pick the mpnet model, which seems slightly better at finding semantically similar sentences than the other two.

    Here are some examples from the top 5 results using the following input sentence:

    *This isn't an image and too big to screenshot, will need to figure out how to import this as alternative style

    Building semantic matching at scale

    Building a fast scalable semantic search system for millions, billions or more documents requires different designs and hardware. But the overall picture is similar, a simple system will have 2 core components:

    • A document processing pipeline
    • A data store and an indexing and retrieval system for fast similarity search

    The document processing step may be part of a larger data pipeline, it may use GPU instances to deploy the deep learning models, or it may use serverless architecture and CPU-based computing resources like Azure functions or AWS lambda. Most frameworks now allow intraops parallelism of deep learning computations on multiple CPU threads, so in some cases, it could be cheaper to perform inference on CPUs. (For pytorch, one can simply set torch.set_num_threads() to perform the graph computations on multiple CPU threads.) Regardless an important consideration is the size of the model and the throughput of the vectorization process.

    A lot of work has been done in the NLP community to reduce the size of large transformer models while preserving most of their accuracy, mainly through model distillation and model quantization. Those distilled models aim to reproduce predictions from the larger models (referred to as the teachers) with a fraction of the parameters. For example, one model distilled from Bert, distillbert, has only 6 transformer layers and 66 million parameters, half as much as Bert. You can find trained distilled models on Huggingface, or distill your own fine-tuned model. Quantization on the other hand consists of converting the inputs and parameters of neural networks from float32 to other types which take less bit memory to represent, such as int8, thus increasing the speed at which computations are performed within the network. Finally, throughput improvement may also be achieved by running inference with optimized engines such as tensorRT, onnxruntime, or recently Nimble.

    In addition to improving throughput, when writing your inference function, make sure to use dynamic batching, instead of padding the input sequences to the maximum allowed length (e.g. 512 for Bert-like transformers). With dynamic batching, the tensors fed to the deep learning model are padded only to the longest sequence in the batch, therefore reducing the amounts of operations the model needs to perform. If possible, sort the input documents by length so that all sentences in a batch are uniform, rather than having a mix of very short and very long texts in the same batch.

    Another factor affecting the speed of the query is the retrieval system. On CPU, direct exact and exhaustive search as we did earlier in the demo quickly becomes too slow as the number of vectors grows, even when all vectors fit in the memory. With GPUs however, the similarity search can be performed exactly very quickly, as long as the vectors fit in the GPU memory.

    Typically, beyond about 100 million vectors (of a few hundred dimensions), approximate search methods need to be used. The vectors then need to be indexed and the index itself should fit in the machine or cluster memory buffer to insure fast retrieval. Up to a few hundred million vectors, Google with ScaNN and Facebook with FAISS have developed efficient such retrieval methods on a single machine. Beyond billions of vectors, one should consider solutions based on multi-node clusters, such as Milvus, Weaviate, or AWS ElasticSearch. Note that most NLP transformers models output 768-d or 1024-d embeddings. Reducing the dimensionality of these embeddings can often be performed without much loss in accuracy using principal component analysis (PCA) to reduce the memory requirements and cost of storing such a large amount of vectors.

    Finally, if you’d like a more fully featured framework to perform similarity search, you may consider Haystack, Jina, or similar libraries. These frameworks can help you to set up a more advanced similarity search engine more quickly, not only for text, but for multi-modal uses as well, such as searching images or any other type of data.