
Introduction
In the previous blog in this series, we explored the fundamentals of information retrieval in RAG, including structured and unstructured data retrieval, classical retrieval techniques (BM25, TF-IDF), modern retrieval techniques (Neural Search, Dense Retrieval), and vector embeddings with semantic search.
In this blog, we will dive deeper into the vector-based retrieval techniques that power modern RAG implementations. We will cover embedding models, vector similarity metrics, vector databases, indexing strategies, and performance optimizations for handling large-scale retrieval efficiently.
Word2Vec
Word2Vec is a neural network-based approach that represents words as dense vectors. It learns word relationships using two techniques:
Continuous Bag of Words (CBOW): Predicts a target word from surrounding context words.
Skip-Gram: Predicts surrounding context words from a target word.
Example: Training a Word2Vec model with gensim:
from gensim.models import Word2Vec
sentences = [['machine', 'learning', 'is', 'fun'],
['deep', 'learning', 'is', 'powerful'],
['neural', 'networks', 'can', 'learn']]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
print(model.wv['learning']) # Word vector for 'learning'
FastText
FastText, an extension of Word2Vec, generates word embeddings by considering subword information, making it effective for handling out-of-vocabulary words.
Example: Training a FastText model:
from gensim.models import FastText
model = FastText(sentences, vector_size=100, window=5, min_count=1, workers=4)
print(model.wv['learning'])
GloVe (Global Vectors for Word Representation)
GloVe constructs word embeddings by factorizing a word co-occurrence matrix.
Example: Using pre-trained GloVe embeddings:
import numpy as np
def load_glove_embeddings(file_path):
embeddings = {}
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
vector = np.array(values[1:], dtype=np.float32)
embeddings[word] = vector
return embeddings
glove_embeddings = load_glove_embeddings('glove.6B.100d.txt')
print(glove_embeddings['learning'])
Transformer-Based Embeddings
Transformer models like BERT and Sentence-BERT (SBERT) provide contextual embeddings, allowing for superior semantic understanding.
Example: Using SBERT to generate embeddings:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
sentence_embedding = model.encode("Machine learning is fascinating.")
print(sentence_embedding)
Creating and Using Embedding Models
We have had enough of theory about the glory of embeddings. The question still remains, how do we create such embeddings, and work with them through our applications. In this age, there are two ways of doing anything - buy or build!
OpenAI Embeddings
This is pretty simple. OpenAI provides powerful pre-trained embedding models via their API.
Example: Generating embeddings with OpenAI:
import openai
response = openai.Embedding.create(
input="Machine learning is fascinating.",
model="text-embedding-ada-002"
)
embedding = response['data'][0]['embedding']
print(embedding)
Using SBERT for Sentence Embeddings
On the other hand, SBERT provides sentence embeddings optimized for similarity search.
Example:
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(["Machine learning is fascinating.", "AI is the future."])
print(embeddings)
Vector Similarity Metrics
The concept of similarity between vectors is the key to all of LLMs. If the vectorization was done correctly, the question and answer should be very close to each other in the vector space.
To compare embeddings, we use similarity metrics like:
Cosine Similarity (angle-based similarity)
Dot Product (linear similarity)
Euclidean Distance (distance between vectors)
Example: Calculating cosine similarity:
from sklearn.metrics.pairwise import cosine_similarity
vector1 = embeddings[0].reshape(1, -1)
vector2 = embeddings[1].reshape(1, -1)
similarity = cosine_similarity(vector1, vector2)
print("Cosine Similarity:", similarity)
Introduction to Vector Databases
Vectors sound great. But it is not so easy to work with them. We cannot save such data in flat files - brute-force comparing the prompt with each entity in the dataset. That is impossible. In order to work with vectors, we need a vector database - that helps us perform vector operations with optimized algorithms.
Vector databases are used to store and retrieve embeddings efficiently. There are hundreds of open source and proprietary vector databases. The popular options include:
FAISS (Facebook AI Similarity Search)
Pinecone (Managed vector search service)
Weaviate (Graph-based vector search)
ChromaDB (Lightweight vector search database)
Milvus (Scalable AI-native vector database)
Example: Let us understand the concept in detail, by using FAISS for similarity search. Try to run the below code, to see how the vector database works.
import faiss
import numpy as np
index = faiss.IndexFlatL2(768)
embeddings = np.random.rand(10, 768).astype('float32')
index.add(embeddings)
query_embedding = np.random.rand(1, 768).astype('float32')
D, I = index.search(query_embedding, k=5)
print("Top matches:", I)
Indexing Strategies for Efficient Retrieval
For any database, the speed of retrieval is related to the strategy of indexing used. If the data is indexed properly, it will lead to faster retrieval - leading to lower compute cost, lower latency and higher efficiency of the product.
Indexing Strategies:
There are several indexing strategies - identified for different use cases. The below three are common and simple ones.
Flat Indexing: Simple but slow.
IVF (Inverted File Indexing): Partitions vectors for faster search.
HNSW (Hierarchical Navigable Small World): Graph-based indexing. This is the most effective of the lot.
Example: In code, it is very simple to configure the database to use HNSW indexing:
index = faiss.IndexHNSWFlat(768, 32)
index.add(embeddings)
Memory and Performance Optimizations
In the age of cloud based scaling, each byte of RAM and each CPU cycle is important - because that multiplies by millions when we scale up to serve as many requests. Hence, we need strong strategies for optimizing performance. The common ones are:
Reducing vector dimensionality (PCA, Autoencoders): This systematically simplifies the dataset - without losing the accuracy.
Batch processing queries: The simplest computational technique that reduces the average overheads on each query.
Using approximate nearest neighbors (ANN): Sometimes, approximation is good enough. This can help reduce the compute as well as data size.
Example: Applying PCA for dimensionality reduction:
from sklearn.decomposition import PCA
pca = PCA(n_components=128)
reduced_embeddings = pca.fit_transform(embeddings)
Handling Large-Scale Retrieval Efficiently
POCs are great. Everything works very well on a small setup. However, the world is entirely different when we deploy things at scale. When we have millions of requests jamming the server, we need some hardened techniques that can help us make sure that each of them is processed without a error.
Major techniques for handling large-scale retrieval include:
Sharding across multiple nodes
Using distributed vector databases (e.g., Milvus, Weaviate)
Hybrid search (combining keyword & vector search)
Weaviate is gaining popularity for its effective and efficient performance.
Example: Hybrid search with Weaviate:
import weaviate
client = weaviate.Client("http://localhost:8080")
Conclusion
This blog covered advanced topics in RAG’s information retrieval component, from embeddings to efficient large-scale retrieval. Understanding these techniques is crucial for building scalable AI-powered retrieval systems.
Stay tuned for the next part, where we will explore integrating these retrieval methods with generative models in RAG systems!
コメント