Building a fast scalable semantic search system for millions, billions or more documents requires different designs and hardware. But the overall picture is similar, a simple system will have 2 core components:
- A document processing pipeline
- A data store and an indexing and retrieval system for fast similarity search
The document processing step may be part of a larger data pipeline, it may use GPU instances to deploy the deep learning models, or it may use serverless architecture and CPU-based computing resources like Azure functions or AWS lambda. Most frameworks now allow intraops parallelism of deep learning computations on multiple CPU threads, so in some cases, it could be cheaper to perform inference on CPUs. (For pytorch, one can simply set torch.set_num_threads() to perform the graph computations on multiple CPU threads.) Regardless an important consideration is the size of the model and the throughput of the vectorization process.
A lot of work has been done in the NLP community to reduce the size of large transformer models while preserving most of their accuracy, mainly through model distillation and model quantization. Those distilled models aim to reproduce predictions from the larger models (referred to as the teachers) with a fraction of the parameters. For example, one model distilled from Bert, distillbert, has only 6 transformer layers and 66 million parameters, half as much as Bert. You can find trained distilled models on Huggingface, or distill your own fine-tuned model. Quantization on the other hand consists of converting the inputs and parameters of neural networks from float32 to other types which take less bit memory to represent, such as int8, thus increasing the speed at which computations are performed within the network. Finally, throughput improvement may also be achieved by running inference with optimized engines such as tensorRT, onnxruntime, or recently Nimble.
In addition to improving throughput, when writing your inference function, make sure to use dynamic batching, instead of padding the input sequences to the maximum allowed length (e.g. 512 for Bert-like transformers). With dynamic batching, the tensors fed to the deep learning model are padded only to the longest sequence in the batch, therefore reducing the amounts of operations the model needs to perform. If possible, sort the input documents by length so that all sentences in a batch are uniform, rather than having a mix of very short and very long texts in the same batch.
Another factor affecting the speed of the query is the retrieval system. On CPU, direct exact and exhaustive search as we did earlier in the demo quickly becomes too slow as the number of vectors grows, even when all vectors fit in the memory. With GPUs however, the similarity search can be performed exactly very quickly, as long as the vectors fit in the GPU memory.
Typically, beyond about 100 million vectors (of a few hundred dimensions), approximate search methods need to be used. The vectors then need to be indexed and the index itself should fit in the machine or cluster memory buffer to insure fast retrieval. Up to a few hundred million vectors, Google with ScaNN and Facebook with FAISS have developed efficient such retrieval methods on a single machine. Beyond billions of vectors, one should consider solutions based on multi-node clusters, such as Milvus, Weaviate, or AWS ElasticSearch. Note that most NLP transformers models output 768-d or 1024-d embeddings. Reducing the dimensionality of these embeddings can often be performed without much loss in accuracy using principal component analysis (PCA) to reduce the memory requirements and cost of storing such a large amount of vectors.
Finally, if you’d like a more fully featured framework to perform similarity search, you may consider Haystack, Jina, or similar libraries. These frameworks can help you to set up a more advanced similarity search engine more quickly, not only for text, but for multi-modal uses as well, such as searching images or any other type of data.