Retrieval Augmented Generation: 02/20

Vikas Solegaonkar
Feb 17
4 min read

Updated: Mar 1

Introduction

Retrieval-Augmented Generation (RAG) is a powerful AI architecture that combines generative models with information retrieval to produce accurate, context-aware responses. In RAG, the information retrieval component is crucial as it fetches relevant documents from external knowledge sources to enhance the generative process. This blog delves into the information retrieval component of RAG, covering structured and unstructured data retrieval, classical retrieval techniques, modern retrieval techniques, and vector embeddings and semantic search.

Structured and Unstructured Data Retrieval

Structured data is highly organized and stored in a tabular format, such as relational databases, spreadsheets, and tables. Information retrieval from structured data involves querying relational databases or using structured query languages like SQL to fetch relevant information.

Example: Suppose we have a table storing customer information:

CustomerID	Name	Age	City
1	Alice	28	New York
2	Bob	32	Los Angeles
3	Charlie	35	Chicago

To retrieve customers from New York:

SELECT * FROM customers WHERE City = 'New York';

On the other hand, Unstructured data lacks a predefined format and includes text documents, emails, images, audio, and video. Retrieving information from unstructured data requires processing natural language and applying algorithms to extract insights.

Example: Suppose we have a collection of text documents describing different topics. Retrieving documents mentioning "climate change" involves using keyword-based or semantic retrieval techniques.

Classical Retrieval Techniques

Classical retrieval techniques rely on traditional algorithms to identify relevant documents based on keyword matching, term frequency, and document relevance. These may not be the most accurate and evolved techniques. However, very often, that is all that we need to get things working.

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF measures the importance of a term within a document relative to a collection of documents.

Term Frequency (TF): Measures how frequently a term appears in a document.
Inverse Document Frequency (IDF): Measures the rarity of a term across all documents.

This is the most logical way of looking at the relevance of the document for the given query. Although very old, it still works for most basic use cases.

Example:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'AI and machine learning are transforming industries.',
    'Deep learning is a subset of machine learning.',
    'Climate change is affecting the world.'
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print("Feature names:", vectorizer.get_feature_names_out())
print("TF-IDF matrix:\n", X.toarray())

BM25 (Best Matching 25)

BM25 is a ranking function used in search engines to estimate the relevance of documents to a query.

Example: Using BM25 with a Python library like rank_bm25:

from rank_bm25 import BM25Okapi

corpus = [
    "AI and machine learning are transforming industries.",
    "Deep learning is a subset of machine learning.",
    "Climate change is affecting the world."
]
tokenized_corpus = [doc.split() for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)
query = "machine learning"
scores = bm25.get_scores(query.split())
print("BM25 scores:", scores)

Modern Retrieval Techniques

Modern retrieval techniques involve neural networks and deep learning to improve document relevance and ranking. Naturally, these can be way more accurate than the classic techniques - and equally costly in terms of the compute required.

Neural Search

Neural search leverages deep learning models to understand query-document relationships based on semantics rather than keywords.

Example: Using a pre-trained model like Sentence-BERT to perform neural search:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')
corpus = [
    "AI and machine learning are transforming industries.",
    "Deep learning is a subset of machine learning.",
    "Climate change is affecting the world."
]

corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
query = "machine learning"
query_embedding = model.encode(query, convert_to_tensor=True)

scores = util.cos_sim(query_embedding, corpus_embeddings)
print("Cosine similarity scores:\n", scores)

Dense Retrieval

Dense retrieval uses dense vector representations of queries and documents, allowing for more nuanced similarity comparisons.

Example: Using FAISS for dense retrieval:

import faiss
import numpy as np

corpus = [
    "AI and machine learning are transforming industries.",
    "Deep learning is a subset of machine learning.",
    "Climate change is affecting the world."
]

embeddings = np.random.rand(len(corpus), 768).astype('float32')
index = faiss.IndexFlatL2(768)
index.add(embeddings)

query_embedding = np.random.rand(1, 768).astype('float32')
D, I = index.search(query_embedding, k=3)
print("Top matches:", I)

Vector Embeddings and Semantic Search

In order to get this working, we need a vector representation of the objects. Vector embeddings represent text, images, or other data in a continuous vector space where similar data points are closer together.

Converting a chunk of data into an accurate vector space, is a research topic. There are many different ways of achieving this, and each have their pros and cons. However, life can be much better if we get a good vector representation of the available data. Then we can perform semantic search.

Semantic Search

Semantic search retrieves documents based on meaning rather than exact keywords. If our data is available in the right vector space, we can use arithmetic operators to find Euclidean distance between the search request and available data.

Example: Using Sentence-BERT for semantic search:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')
corpus = [
    "AI and machine learning are transforming industries.",
    "Deep learning is a subset of machine learning.",
    "Climate change is affecting the world."
]
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

query = "How does AI impact industries?"
query_embedding = model.encode(query, convert_to_tensor=True)
scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)

print("Cosine similarity scores:\n", scores)

Visualization Example

Vector embeddings can be visualized using tools like t-SNE or PCA to show how similar documents cluster together.

Example:

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

embeddings = model.encode(corpus)
tsne = TSNE(n_components=2)
reduced_embeddings = tsne.fit_transform(embeddings)

plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1])
for i, text in enumerate(corpus):
    plt.annotate(text, (reduced_embeddings[i, 0], reduced_embeddings[i, 1]))

plt.title('t-SNE Visualization of Document Embeddings')
plt.show()

Conclusion

In this part of the blog, we explored structured and unstructured data retrieval, classical retrieval techniques (TF-IDF, BM25), modern retrieval techniques (neural search, dense retrieval), and vector embeddings and semantic search. Understanding these concepts is fundamental to building a robust information retrieval component in RAG systems.

In the next part of this blog, we'll explore more details on how these retrieval techniques work on complex data.