Skip to main content

Generating Embeddings for a RAG System

RAG (Retrieval-Augmented Generation) is one of the hottest topics in AI today. It allows you to retrieve desired outputs from a model without fine-tuning it or modifying its underlying layers and weights. Think of it like this: you send your query along with specific instructions to the AI, and it returns results that are aligned with those instructions. These instructions are what we call a prompt, and the effectiveness of your RAG system largely depends on how well you design the prompt.

Another critical component of RAG is fetching relevant information from your source database. This data helps you build the prompt more effectively. To achieve this, you need to retrieve data that is contextually similar to the user’s query, a process known as semantic similarity. This is where a Vector Database comes into play. As discussed in our previous posts, vector embeddings play a crucial role here. Transformers and other NLP models accept embedded vectors of tokens as input, and these embeddings encode semantic meaning. If you’re not familiar with vector embeddings or transformers, we recommend checking the page on Styrish AI, Understanding Transformers.

At a high level, RAG is about finding contextually or semantically similar data to a user’s query. Text in plain English, or any human-readable language, needs to be converted into high-dimensional numeric vectors. Each token is assigned a numeric representation, and the closeness between two vectors indicates how contextually related the tokens are. This is the core concept that powers semantic retrieval in RAG systems.

In the exercise below, we will demonstrate how embedding vectors are generated and used to fetch contextually relevant data. We will use some available LLMs and APIs to showcase this workflow. These approaches are very similar to what you would use in real-time RAG projects. If you go with AWS services for example AWS Bedrock, there are other components to do so.


  1. Generating Embeddings

Here, I have used sentence-transformers to generate the embeddings. Sentence Transformers are lightweight models that convert sentences or paragraphs into dense vector embeddings, enabling semantic search and similarity comparisons for tasks like RAG.


    2. Store embeddings in a vector database (FAISS)

Now, we would store these embeddings in our vector database. I have used FAISS (Facebook AI similarity search). This is an open source API. If you use AWS, you can use AWS open-search as vector database. There are other APIs also for such purpose. Remember, one of the the core mechanisms inside which finds semantic similarity, is known as cosine similarity. 



    3. Perform semantic search

It’s time to perform the search, let’s see the magic in action! 

In the screenshots below, we pass two different input queries to our RAG system. You’ll notice that for both queries, the system retrieves contextually relevant and meaningful results, demonstrating the power of semantic search using embeddings and a vector database. I have fetched the top-3 results, you can definitely change it to get more or less.





Once the contextually relevant data is retrieved, you can append it to a carefully designed prompt and pass it to the LLM. The model then generates the desired output based on the user’s query. This process about combining retrieval of relevant context and prompt-guided generation, forms the complete RAG workflow.

Comments

Popular posts from this blog

Exploring CNN with TensorFlow & Keras

Convolutional Neural Network or CNN for short, is one of the widely used neural network architecture for image recognition. It’s use cases can be widely extended to various powerful tasks, such as, object detection within an image, image classification, facial recognition, gesture recognition etc. Indeed, Convolutional Neural Networks (CNNs) are designed with some level of resemblance to the image recognition process in the human brain. For instance, In the visual cortex, neurons have local receptive fields, meaning they respond to stimuli only in a specific region of the visual field, which is achieved by CNN using kernels or filters. Both human brain and CNN process the visual information in hierarchical manner. Basic information of an image is extracted via lower level of neurons in human brain, and higher-level neurons integrate the information from lower-level neurons to identify the complex patterns. On the other hand, in CNN, we use multiple convolutional layers to extract hiera...

Finetuning of Transformers in Natural Language Processing

Transformers are the essential parts of deep neural network, and widely used in Natural language processing tasks. We have a wide variety of usages where transformers are used in real time scenarios, such as, translations, text generation, question answering and various other NLP tasks. One of the widely used examples of transformer is Chat GPT. More information about transformer architecture and its mechanism can be accessed on page Understanding Transformers (BERT & GPT) . One of the very important processes in transformers is Finetuning. Finetuning is the way for adapting the OOB (out of the box) model for your specific tasks. In other words, it is the process of training a pre-trained model on your specific datasets to adapt the knowledge from new dataset. During fine-tuning, the parameters of the pre-trained model are adjusted based on the task-specific dataset. The goal is to adapt the model’s knowledge to perform well on the particular task of interest. Let’s understand how ...

Tuning Hyperparameters and visualizing on TensorBoard

Hyperparameters tuning is one of the most crucial steps of machine or deep learning process. Hyperparameters are configurations for a machine learning model that are not learned from the data but are set before the training process begins. These parameters are essential for controlling the overall behavior of the model. While training a machine learning model, you may have to experiment with different hyperparameters such as learning rate, batch size, dropout size, optimizers etc. in order to achieve the model with best accuracy. Performing experiments with hyperparameters one by one can be a tedious and time-consuming process. For instance, you initiate the training process with a specific combination of hyperparameters, and subsequently, you repeat the procedure with a different set of hyperparameters, and so forth. TensorFlow allows you to run experiments with different sets of hyperparameters in a single execution, enabling you to visualize the metrics on HParam dashboard in Tenso...