top of page

Retrieval Augmented Generation: 03/20

Writer: Vikas SolegaonkarVikas Solegaonkar

Updated: Mar 1

Introduction

In the previous blog in this series, we explored the fundamentals of information retrieval in RAG, including structured and unstructured data retrieval, classical retrieval techniques (BM25, TF-IDF), modern retrieval techniques (Neural Search, Dense Retrieval), and vector embeddings with semantic search.


In this blog, we will dive deeper into the vector-based retrieval techniques that power modern RAG implementations. We will cover embedding models, vector similarity metrics, vector databases, indexing strategies, and performance optimizations for handling large-scale retrieval efficiently.


Word2Vec

Word2Vec is a neural network-based approach that represents words as dense vectors. It learns word relationships using two techniques:

  • Continuous Bag of Words (CBOW): Predicts a target word from surrounding context words.

  • Skip-Gram: Predicts surrounding context words from a target word.


Example: Training a Word2Vec model with gensim:

from gensim.models import Word2Vec

sentences = [['machine', 'learning', 'is', 'fun'],
             ['deep', 'learning', 'is', 'powerful'],
             ['neural', 'networks', 'can', 'learn']]

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
print(model.wv['learning'])  # Word vector for 'learning'

FastText

FastText, an extension of Word2Vec, generates word embeddings by considering subword information, making it effective for handling out-of-vocabulary words.


Example: Training a FastText model:

from gensim.models import FastText

model = FastText(sentences, vector_size=100, window=5, min_count=1, workers=4)
print(model.wv['learning'])

GloVe (Global Vectors for Word Representation)

GloVe constructs word embeddings by factorizing a word co-occurrence matrix.

Example: Using pre-trained GloVe embeddings:

import numpy as np

def load_glove_embeddings(file_path):
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype=np.float32)
            embeddings[word] = vector
    return embeddings

glove_embeddings = load_glove_embeddings('glove.6B.100d.txt')
print(glove_embeddings['learning'])

Transformer-Based Embeddings

Transformer models like BERT and Sentence-BERT (SBERT) provide contextual embeddings, allowing for superior semantic understanding.

Example: Using SBERT to generate embeddings:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
sentence_embedding = model.encode("Machine learning is fascinating.")
print(sentence_embedding)

Creating and Using Embedding Models

We have had enough of theory about the glory of embeddings. The question still remains, how do we create such embeddings, and work with them through our applications. In this age, there are two ways of doing anything - buy or build!


OpenAI Embeddings

This is pretty simple. OpenAI provides powerful pre-trained embedding models via their API.


Example: Generating embeddings with OpenAI:

import openai

response = openai.Embedding.create(
    input="Machine learning is fascinating.",
    model="text-embedding-ada-002"
)
embedding = response['data'][0]['embedding']
print(embedding)

Using SBERT for Sentence Embeddings

On the other hand, SBERT provides sentence embeddings optimized for similarity search.


Example:

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(["Machine learning is fascinating.", "AI is the future."])
print(embeddings)

Vector Similarity Metrics

The concept of similarity between vectors is the key to all of LLMs. If the vectorization was done correctly, the question and answer should be very close to each other in the vector space.


To compare embeddings, we use similarity metrics like:

  • Cosine Similarity (angle-based similarity)

  • Dot Product (linear similarity)

  • Euclidean Distance (distance between vectors)


Example: Calculating cosine similarity:

from sklearn.metrics.pairwise import cosine_similarity

vector1 = embeddings[0].reshape(1, -1)
vector2 = embeddings[1].reshape(1, -1)

similarity = cosine_similarity(vector1, vector2)
print("Cosine Similarity:", similarity)

Introduction to Vector Databases

Vectors sound great. But it is not so easy to work with them. We cannot save such data in flat files - brute-force comparing the prompt with each entity in the dataset. That is impossible. In order to work with vectors, we need a vector database - that helps us perform vector operations with optimized algorithms.

Vector databases are used to store and retrieve embeddings efficiently. There are hundreds of open source and proprietary vector databases. The popular options include:

  • FAISS (Facebook AI Similarity Search)

  • Pinecone (Managed vector search service)

  • Weaviate (Graph-based vector search)

  • ChromaDB (Lightweight vector search database)

  • Milvus (Scalable AI-native vector database)


Example: Let us understand the concept in detail, by using FAISS for similarity search. Try to run the below code, to see how the vector database works.

import faiss
import numpy as np

index = faiss.IndexFlatL2(768)
embeddings = np.random.rand(10, 768).astype('float32')
index.add(embeddings)
query_embedding = np.random.rand(1, 768).astype('float32')
D, I = index.search(query_embedding, k=5)
print("Top matches:", I)

Indexing Strategies for Efficient Retrieval

For any database, the speed of retrieval is related to the strategy of indexing used. If the data is indexed properly, it will lead to faster retrieval - leading to lower compute cost, lower latency and higher efficiency of the product.


Indexing Strategies:

There are several indexing strategies - identified for different use cases. The below three are common and simple ones.

  • Flat Indexing: Simple but slow.

  • IVF (Inverted File Indexing): Partitions vectors for faster search.

  • HNSW (Hierarchical Navigable Small World): Graph-based indexing. This is the most effective of the lot.


Example: In code, it is very simple to configure the database to use HNSW indexing:

index = faiss.IndexHNSWFlat(768, 32)
index.add(embeddings)

Memory and Performance Optimizations

In the age of cloud based scaling, each byte of RAM and each CPU cycle is important - because that multiplies by millions when we scale up to serve as many requests. Hence, we need strong strategies for optimizing performance. The common ones are:

  • Reducing vector dimensionality (PCA, Autoencoders): This systematically simplifies the dataset - without losing the accuracy.

  • Batch processing queries: The simplest computational technique that reduces the average overheads on each query.

  • Using approximate nearest neighbors (ANN): Sometimes, approximation is good enough. This can help reduce the compute as well as data size.


Example: Applying PCA for dimensionality reduction:

from sklearn.decomposition import PCA

pca = PCA(n_components=128)
reduced_embeddings = pca.fit_transform(embeddings)

Handling Large-Scale Retrieval Efficiently

POCs are great. Everything works very well on a small setup. However, the world is entirely different when we deploy things at scale. When we have millions of requests jamming the server, we need some hardened techniques that can help us make sure that each of them is processed without a error.


Major techniques for handling large-scale retrieval include:

  • Sharding across multiple nodes

  • Using distributed vector databases (e.g., Milvus, Weaviate)

  • Hybrid search (combining keyword & vector search)


Weaviate is gaining popularity for its effective and efficient performance.


Example: Hybrid search with Weaviate:

import weaviate
client = weaviate.Client("http://localhost:8080")

Conclusion

This blog covered advanced topics in RAG’s information retrieval component, from embeddings to efficient large-scale retrieval. Understanding these techniques is crucial for building scalable AI-powered retrieval systems.


Stay tuned for the next part, where we will explore integrating these retrieval methods with generative models in RAG systems!

 
 

コメント

5つ星のうち0と評価されています。
まだ評価がありません

評価を追加

Subscribe to Our Newsletter

Thanks for submitting!

Manjusha Rao

  • LinkedIn
  • GitHub
  • Medium
Manjusha.png
Vikas Formal.png

Vikas Solegaonkar

  • LinkedIn
  • GitHub
  • Medium
bottom of page