Using RAG to create an AI expert about your Company

Have you ever thought, “I’ve answered this question so many times!” or “This information is available online—why are they asking me?” If not, congratulations—you’re one of the lucky ones! But for most of us, these thoughts have crossed our minds at some point. Wouldn’t it be great if there were a tool to handle these repetitive questions for you? Well, I have good news: the technology to make this happen already exists. All you need to do is piece it together. In this tutorial, I’ll guide you through exactly how to do that using GenAI and RAG.

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that enhances generative AI by incorporating an external retrieval mechanism. Instead of relying solely on pre-trained knowledge, RAG systems first retrieve relevant external documents or data to provide context. This retrieved information is then used to generate accurate, contextually enriched responses. The retrieval step ensures that the AI can adapt to specific, real-time knowledge requirements, making it particularly effective for specialized domains, dynamic content, or situations where the underlying knowledge evolves frequently.

Real-life examples of RAG include customer support chatbots that pull information from technical documentation, medical assistants consulting clinical guidelines, and personalized educational tools providing context-driven tutoring.

A Chatbot able to respond questions about Cisco Smart Bonding Partner API

In this tutorial, we will create a chatbot that utilizes Retrieval-Augmented Generation (RAG) to answer questions about Cisco’s Smart Bonding Partner API, pulling information from publicly available documentation. After completing this tutorial, you should be able to easily adapt it, so the chatbot responds questions about your company or products.

Our chatbot will operate using a sequence of messages. Alongside the user and assistant’s messages, relevant documents and other resources can be integrated into the sequence through tool messages. This encourages us to model the state of our RAG application as a sequence of messages, which will include:

  1. User input as a HumanMessage.
  2. Vector store query as an AIMessage with tool calls.
  3. Retrieved documents as a ToolMessage.
  4. Final response as an AIMessage.

This structure ensures clear and logical interactions within the chatbot.

What You’ll Learn

  1. How to retrieve relevant documents using LangChain’s tools.
  2. How to split large texts into manageable chunks.
  3. How to index and search documents using FAISS (Facebook AI Similarity Search).
  4. How to integrate OpenAI’s GPT-based models for generating responses.
  5. How to set up a chatbot with RAG capabilities.

By following this step-by-step guide, you’ll understand how RAG works and how to implement it in Python.

Prerequisites

  • Basic knowledge of Python.
  • An OpenAI API key.
  • Python installed on your system (preferably version 3.8 or higher).

Required Libraries

Before starting, install the following Python libraries:

pip install langchain-openai langgraph langchain-community bs4 faiss-cpu

Let’s Start Coding

Step 1: Identify where to retrieve information

Since we are retrieving our information from the Cisco Smart Bonding Partner API documentation, we need to start by identifying the URLs from where to extract information. Each URL is associated with a context to help the chatbot provide accurate answers. This ensures that retrieved documents align with user queries.

Feel free to use different URLs to retrieve information about your company or products. It should still work. 🙂

urls_with_context = [
    {"url": "https://developer.cisco.com/docs/smart-bonding-partner-api/introduction-to-smart-bonding/", "context": "Introduction to Cisco's Smart Bonding Partner API and its key features."},
    {"url": "https://developer.cisco.com/docs/smart-bonding-partner-api/use-cases/", "context": "Detailed use cases demonstrating how to use the Smart Bonding Partner API in real scenarios."},
    # Add more URLs as needed
]

Step 2: Content Extraction

The WebBaseLoader is used to extract content from the specified URLs. This content will used for the knowledge base for the chatbot.

from langchain_community.document_loaders import WebBaseLoader

documents = []
for item in urls_with_context:
    try:
        loader = WebBaseLoader(web_paths=[item["url"]])
        docs = loader.load()
        for doc in docs:
            doc.metadata["context"] = item["context"]
            documents.append(doc)
    except Exception as e:
        print(f"Error loading {item['url']}: {e}")

Step 3: Document Splitting

The documents generated in the previous step are divided into smaller chunks using RecursiveCharacterTextSplitter. This step improves the retrieval accuracy and efficiency by working with smaller, more relevant pieces of text.

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(documents)

Step 4: Enable Fast Similarity Searches

FAISS (Facebook AI Similarity Search) is used to index and search documents efficiently. It converts document chunks into dense vectors using OpenAI embeddings and enables fast similarity searches.

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

vectorstore = FAISS.from_documents(splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

Step 5: Define Retrieval Tool

Tool abstraction in LangChain associates a Python function with a schema that defines the function’s name, description and expected arguments. Tools can then be passed to chat models that support tool calling allowing the model to request the execution of a specific function with specific inputs.

The retrieve function will be used by the chatbot to search for and return relevant document chunks based on user queries.

from langchain_core.tools import tool

@tool(response_format="content_and_artifact")
def retrieve(query: str):
    """Retrieve information related to a query."""
    retrieved_docs = vectorstore.similarity_search(query, k=2)
    serialized = "\n\n".join(
        (f"Source: {doc.metadata}\n" f"Content: {doc.page_content}")
        for doc in retrieved_docs
    )
    return serialized, retrieved_docs

Step 6: Create a LangGraph Flow

In this tutorial we are using LangGraph. LangGraph is a concept introduced in the LangChain framework, which focuses on integrating large language models (LLMs) into workflows and applications. LangGraph specifically enables the creation and management of knowledge graphs using natural language as an interface.

LangGraph is configured to define the flow of the chatbot using a graph. Our graph will consist of three nodes:

  • A node that fields the user input, either generating a query for the retriever or responding directly;
  • A node for the retriever tool that executes the retrieval step (search for and return relevant document chunks based on user input);
  • A node that generates the final response using the retrieved context.
from langgraph.graph import MessagesState, StateGraph  # Import classes for managing conversational state and graph structure.
from langchain_core.messages import SystemMessage  # Import SystemMessage to structure the prompt for the AI model.
from langgraph.prebuilt import tools_condition  # Import a condition to handle tools in the state graph.

llm = ChatOpenAI(model="gpt-4o-mini")  # Initialize the language model with a specific GPT-based model. You should have a valid OpenAI API key.

# Define a function to handle user input and determine whether to retrieve documents or generate a response.
def query_or_respond(state: MessagesState):
    """
    Processes the current state of the conversation and decides whether to retrieve documents
    or generate a response directly using the language model.
    """
    llm_with_tools = llm.bind_tools([retrieve])  # Bind the retrieval tool to the language model.
    response = llm_with_tools.invoke(state["messages"])  # Invoke the model with the current message sequence.
    return {"messages": [response]}  # Return the updated message sequence with the response.

# Define a function to generate a final response using retrieved documents.
def generate(state: MessagesState):
    """
    Creates the final AI response by using the retrieved documents as context.
    """
    # Extract all recent tool messages (retrieved documents) from the message sequence.
    recent_tool_messages = [message for message in reversed(state["messages"]) if message.type == "tool"][::-1]

    # Combine the content of the retrieved documents into a single string.
    docs_content = "\n\n".join(doc.content for doc in recent_tool_messages)

    # Create a system message to guide the AI model's response generation.
    system_message_content = (
        "You are an expert AI assistant with knowledge of Cisco's Smart Bonding Partner API."
        " Use the following pieces of retrieved context to answer the question."
        " If you don't know the answer, say that you don't know."
        f"\n\n{docs_content}"
    )

    # Construct the full prompt with system guidance and prior user-assistant conversation.
    prompt = [SystemMessage(system_message_content)] + [
        message for message in state["messages"] if message.type in ("human", "system")
    ]

    # Invoke the model with the constructed prompt and return the response.
    return {"messages": [llm.invoke(prompt)]}

# Initialize the state graph to manage the conversation flow.
graph_builder = StateGraph(MessagesState)

graph_builder.add_node(query_or_respond) # Add a node that handles user queries or responses from the assistant
graph_builder.add_node(ToolNode([retrieve])) # Add a node representing the tool used to retrieve relevant documents
graph_builder.add_node(generate) # Add a node responsible for generating the final response

graph_builder.set_entry_point("query_or_respond") # Set the entry point of the graph to the 'query_or_respond' node
# Add conditional edges to determine the next step based on the tool's condition
graph_builder.add_conditional_edges(
    "query_or_respond",
    tools_condition,
    {END: END, "tools": "tools"},
)
graph_builder.add_edge("tools", "generate") # Define a direct transition from the 'tools' node to the 'generate' node
graph_builder.add_edge("generate", END) # Define a direct transition from the 'generate' node to the end of the process

memory = MemorySaver() # Initialize memory storage for tracking intermediate states
graph = graph_builder.compile(checkpointer=memory) # Compile the graph with memory checkpointing enabled

# Define a configuration dictionary with a unique thread ID to track user sessions. In a real life scenario, this value should not be hardcoded.
config = {"configurable": {"thread_id": "abc123"}}

Step 7: Chatbot Execution

The chatbot will run in an interactive command-line interface (CLI). It processes user inputs, retrieves relevant information, and generates responses dynamically, using the components created in the previous steps.

print("Chatbot ready. Start typing your queries!")
if __name__ == "__main__":
    while True:
        user_input = input("User: ")
        if user_input.lower() in ["exit", "quit", "q"]:
            print("Goodbye!")
            break
        for step in graph.stream(
            {"messages": [{"role": "user", "content": user_input}]}, 
            stream_mode="values",
            config=config,
        ):
            step["messages"][-1].pretty_print()

Complete Script

Below you can find the complete Python script:

import getpass
import os
from langchain_openai import ChatOpenAI
from bs4 import SoupStrainer
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain import hub
from langchain_openai import OpenAIEmbeddings
from langchain_core.messages import SystemMessage
from langgraph.graph import MessagesState, StateGraph, END
from langgraph.prebuilt import ToolNode, tools_condition
from langchain_core.tools import tool
from langgraph.checkpoint.memory import MemorySaver

# Step 1: Set up OpenAI API key
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

# Step 2: Define URLs and context
urls_with_context = [
    {"url": "https://developer.cisco.com/docs/smart-bonding-partner-api/introduction-to-smart-bonding/", "context": "Introduction to Cisco's Smart Bonding Partner API and its key features."},
    {"url": "https://developer.cisco.com/docs/smart-bonding-partner-api/use-cases/", "context": "Detailed use cases demonstrating how to use the Smart Bonding Partner API in real scenarios."},
    # Add more URLs as needed
]

# Step 3: Load documents
print("Loading documents...")
documents = []
for item in urls_with_context:
    try:
        loader = WebBaseLoader(web_paths=[item["url"]])
        docs = loader.load()
        for doc in docs:
            doc.metadata["context"] = item["context"]
            documents.append(doc)
    except Exception as e:
        print(f"Error loading {item['url']}: {e}")

# Step 4: Split documents into chunks
print("Splitting documents into chunks...")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(documents)

# Step 5: Create FAISS vector store
print("Creating vector store...")
vectorstore = FAISS.from_documents(splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

# Step 6: Define retrieval tool
@tool(response_format="content_and_artifact")
def retrieve(query: str):
    """Retrieve information related to a query."""
    retrieved_docs = vectorstore.similarity_search(query, k=2)
    serialized = "\n\n".join(
        (f"Source: {doc.metadata}\n" f"Content: {doc.page_content}")
        for doc in retrieved_docs
    )
    return serialized, retrieved_docs

# Step 7: Define LangGraph logic
print("Defining LangGraph flow...")

def query_or_respond(state: MessagesState):
    llm_with_tools = ChatOpenAI(model="gpt-4o-mini").bind_tools([retrieve])
    response = llm_with_tools.invoke(state["messages"])
    return {"messages": [response]}

def generate(state: MessagesState):
    recent_tool_messages = [message for message in reversed(state["messages"]) if message.type == "tool"][::-1]
    docs_content = "\n\n".join(doc.content for doc in recent_tool_messages)
    system_message_content = (
        "You are an expert AI assistant with knowledge of Cisco's Smart Bonding Partner API."
        " Use the following pieces of retrieved context to answer the question."
        " If you don't know the answer, say that you don't know."
        f"\n\n{docs_content}"
    )
    prompt = [SystemMessage(system_message_content)] + [
        message for message in state["messages"] if message.type in ("human", "system")
    ]
    return {"messages": [ChatOpenAI(model="gpt-4o-mini").invoke(prompt)]}

graph_builder = StateGraph(MessagesState)

graph_builder.add_node(query_or_respond) # Add a node that handles user queries or responses from the assistant
graph_builder.add_node(ToolNode([retrieve])) # Add a node representing the tool used to retrieve relevant documents
graph_builder.add_node(generate) # Add a node responsible for generating the final response

graph_builder.set_entry_point("query_or_respond") # Set the entry point of the graph to the 'query_or_respond' node
# Add conditional edges to determine the next step based on the tool's condition
graph_builder.add_conditional_edges(
    "query_or_respond",
    tools_condition,
    {END: END, "tools": "tools"},
)
graph_builder.add_edge("tools", "generate") # Define a direct transition from the 'tools' node to the 'generate' node
graph_builder.add_edge("generate", END) # Define a direct transition from the 'generate' node to the end of the process

memory = MemorySaver() # Initialize memory storage for tracking intermediate states
graph = graph_builder.compile(checkpointer=memory) # Compile the graph with memory checkpointing enabled

# Define a configuration dictionary with a unique thread ID to track user sessions. In a real life scenario, this value should not be hardcoded.
config = {"configurable": {"thread_id": "abc123"}}

# Step 8: Run the chatbot
print("Chatbot ready. Start typing your queries!")
if __name__ == "__main__":
    while True:
        user_input = input("User: ")
        if user_input.lower() in ["exit", "quit", "q"]:
            print("Goodbye!")
            break
        for step in graph.stream(
            {"messages": [{"role": "user", "content": user_input}]}, 
            stream_mode="values",
            config=config,
        ):
            step["messages"][-1].pretty_print()

Try It Yourself

After running the script, ask the chatbot “what is Smart Bonding?”:

================================== Ai Message ==================================

Smart Bonding is a technology provided by Cisco that facilitates the integration of a partner's ITSM (Information Technology Service Management) system with Cisco's support ticketing system for the purpose of ticket synchronization. It aims to enhance service integration and management, enabling users to have a unified view of their entire support ecosystem. The Smart Bonding solution helps partners address multi-sourcing challenges and improve business outcomes by providing real-time communications and visibility of the service delivery process. There are two primary approaches for integrating Smart Bonding: self-onboarding and using ServiceNow, both of which are offered at no cost to partners.

The chatbot is now able to respond specific questions about the Smart Bonding Partner API. 🙂

You might be thinking: “I can’t share this command-line interface with my customers! It is not user-friendly!”. No worries! To see the concepts from this tutorial in action using a more user-friendly UI, I’ve created a working example of the chatbot using Gradio and deployed it on Hugging Face Spaces.

You can try it out here: Cisco DevNet Docs Chatbot Demo.

Not only can you interact with the chatbot to see how it retrieves and generates responses based on the documentation, but you can also check out the source code to understand how the implementation works.

Conclusion

Congratulations! You’ve now able built the foundation for an AI expert tailored to your company using Retrieval-Augmented Generation (RAG). This approach combines the intelligence of generative AI with the precision of information retrieval, allowing you to automate answers to company-specific questions while delivering accurate, context-rich responses.

With this powerful tool, you can address repetitive queries, streamline internal support, or even enhance customer interactions—all while ensuring that your AI assistant stays aligned with the most relevant and up-to-date knowledge about your company.

The script provided in this tutorial is just the beginning. You can expand it to include more data sources, refine the retrieval process, or integrate it into a user-facing chatbot application. The possibilities are endless when you equip your business with a custom AI expert powered by RAG.

Now it’s time to take this foundation and make it uniquely yours. Start building, experimenting, and turning your company’s knowledge into a dynamic, intelligent assistant that adds value every step of the way!