Retrieval Augmented Generation with Hugging Face Models in LangChain

usmanmalik57 1 Tallied Votes 78 Views Share

In my previous articles, I explained how to develop customized chatbots using Retrieval Augmented Generation (RAG) approach in LangChain. However, I used proprietary models such as OpenAI, which can be expensive when you try to scale.

In this article, I will show you how to use the open-source and free-of-cost models from Hugging Face to develop chatbot applications in LangChain. By the end of this tutorial, you will be able to import any Hugging Face Large Language Model (LLM) and embedding model in LangChain and develop your customized chatbot applications.

Importing and Installing Required Libraries

First, install and import the libraries and modules you will need to run codes in this tutorial.

The codes in this tutorial are run on Google Colab, where some of the libraries are preinstalled. You can install the rest of the libraries via the following pip command.

!pip install -q -U transformers==4.38.0
!pip install -q -U sentence-transformers
!pip install -q -U faiss-cpu
!pip install -q -U bitsandbytes==0.42.0
!pip install -q -U accelerate==0.27.1
!pip install -q -U huggingface_hub
!pip install -q -U langchain
!pip install -q -U pypdf

The script below imports the required libraries in your application.

from transformers import AutoModelForCausalLM, AutoTokenizer, logging, pipeline
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import PromptTemplate
from langchain.vectorstores import FAISS
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain_core.prompts import ChatPromptTemplate
from sentence_transformers import SentenceTransformer
from transformers import BitsAndBytesConfig
import torch

Importing a Hugging Face LLM in langChain

A RAG application requires two models: a large language model (LLM) for generating responses and an embedding model for converting documents into numeric representations.

Let's first see how you can import and use an open-source LLM from Hugging Face in LangChain.

The following script defines the quantization settings that reduce LLM weight sizes to 4 bits. This setting reduces the memory required to run LLM with enormous sizes.

#Ignore warnings

bnb_config = BitsAndBytesConfig(

Next, you need to import the model and the corresponding tokenizer from the Hugging Face transformers library.
In the following script, we import the mistralai/Mistral-7B-Instruct-v0.2 model and its tokenizer. You can use any other LLM from Hugging Face if you want.

model_id = "mistralai/Mistral-7B-Instruct-v0.2"
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(model_id,
tokenizer = AutoTokenizer.from_pretrained(model_id)

The next step is to create a transformers text-generation pipeline using the model and tokenizer you just imported. Subsequently, using the transformers pipeline, create an object of the langChain HuggingFacePipeline class.

pipe = pipeline("text-generation",

hf = HuggingFacePipeline(pipeline=pipe)

The HuggingFacePipeline object works like any other LLM in langChain and allows you to generate responses, as shown in the script below.

template = """You are a an expert baking chef.

prompt = PromptTemplate.from_template(template)

chain = prompt | hf

question = "How to bake a pizza?"
print(chain.invoke({"Question": question}))



The next step is to generate embeddings using Hugging Face embedding models in LangChain.

Generating Hugging Face Model Embeddings in LangChain

To generate embeddings, you must first load the document and split it into chunks.

You can import a PDF document using the LangChain PyPDFLoader class and split the loaded document via the load_and_split().

loader = PyPDFLoader("")
pages = loader.load_and_split()

The load_and_split() method splits a PDF document into pages. However, you need to create smaller chunks of your document. To do so, you can use the RecursiveCharacterTextSplitter class from LangChain.

The following script creates text chunks of 1000 characters with an overlap of 200 characters between chunks.

splitter = RecursiveCharacterTextSplitter(

all_text_chunks = []  # To store chunks from all documents
for doc in pages:
    text_content = doc.page_content
    text_chunks = splitter.split_text(text_content)

print("Total chunks:", len(all_text_chunks))


Total chunks: 77

The next step is to create embeddings. To do so, you can use any embedding model from Hugging Face. Pass the embedding model's path to the langChain HuggingFaceEmbeddings class. You can then use any vector store index such as FAISS to store embedded chunks.

model_path = "thenlper/gte-large"
embeddings = HuggingFaceEmbeddings(
    model_name = model_path

embedding_vectors = FAISS.from_texts(all_text_chunks, embeddings)

We imported the LLM and embedding model from Hugging Face in langChain. With these models, we can easily create a RAG application.

RAG Using Open Source LLM and Embeddings from Hugging Face

The first step in a simple RAG application is to define the prompt that receives user input and the context from embedded documents.

The following script defines our sample prompt. This prompt receives the user input and the context from the vector store index containing embedded documents.

The script also creates a create_stuff_documents_chain chain that allows you to execute prompts on LLM.

prompt = ChatPromptTemplate.from_template("""Answer the following question based only on the provided context:

Question: {input}

Context: {context}

document_chain = create_stuff_documents_chain(hf, prompt)

The next step is to create a retriever using the vector store object and pass the retriever and the document_chain to the create_retrieval_chain class. The create_retrieval_chain class in the following script binds the context received from embeddings to the user input.

retriever = embedding_vectors.as_retriever()
retrieval_chain = create_retrieval_chain(retriever, document_chain)

Finally, you can call the invoke() method on the retrieval chain to get customized responses based on the document embeddings. The response contains concatenated user input and the model's reply. Therefore, I split the response using the <end> token and returned only the model's reply.

def generate_response(query):
    response = retrieval_chain.invoke({"input": query})
    return response["answer"]

query = "What are the three points where Republicans and Democrats agree?"
result = generate_response(query)





Creating a RAG application involves a large language model and an embedding model. Though proprietary models achieve higher accuracy, they can be expensive on a large scale.
In this article, you saw how to use free-to-use open-source models from Hugging Face to create a simple RAG application. I encourage you to develop your own RAG application using free and open-source models from Hugging Face and share what you build.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.