Imagine writing a piece of software that could understand, assist, and even generate code, similar to how a seasoned developer would.
Well, that’s possible with LangChain. Leveraging advanced models such as VectorStores, Conversational RetrieverChain, and LLMs, LangChain takes us to a new level of code understanding and generation.
In this guide, we will reverse engineer Reddit’s public source code repository for version 1 of the site to better understand the codebase and provide insights into its inner workings. I was inspired to create this guide after reading Paul Graham’s tweet on the subject (and because I don’t know anything about Lisp, but still wanted to understand what he was talking about).
We’ll use OpenAI’s embedding technology and a tool called Activeloop to make the code understandable and an LLM (GPT-4 in this case) to converse with the code. If you’re interested in using another LLM or a different platform, check out my previous guide on reverse-engineering Twitter’s algorithm using DeepInfra and Dolly.
When we’re done, we’re going to be able to shortcut the difficult work it will take to understand the algorithm by asking an AI to give us answers to our most pressing questions rather than spending weeks sifting through it ourselves. Let’s begin.
A Conceptual Outline for Code Understanding With LangChain
LangChain is a powerful tool that can analyze code repositories on GitHub. It brings together three important parts: VectorStores, Conversational RetrieverChain, and an LLM (Language Model) to assist you in understanding code, answering questions about it in context, and even generating new code within GitHub repositories.
The Conversational RetrieverChain is a system that helps find and retrieve useful information from a VectorStore. It uses smart techniques like context-aware filtering and ranking to figure out which code snippets and information are most relevant to the specific question or query you have. What sets it apart is that it takes into account the history of the conversation and the context in which the question is asked. This means it can provide you with high-quality and relevant results that specifically address your needs. In simpler terms, it’s like having a smart assistant that understands the context of your questions and gives you the best possible answers based on that context.
Now, let’s look into the LangChain workflow and see how it works at a high level:
- Index the code base: The first step is to clone the target repository you want to analyze. Load all the files within the repository, break them into smaller chunks, and initiate the indexing process. If you already have an indexed dataset, you can even skip this step.
- Embedding and Code Store: To make the code snippets more easily understandable, LangChain employs a code-aware embedding model. This model helps in capturing the essence of the code and stores the embedded snippets in a VectorStore, making them readily accessible for future queries.
- Query Understanding: This is where your LLM comes into play. You can use a model like GPT-4 to process your queries. The model is used to analyze your queries and understand the meaning behind them by considering the context and extracting important information. By doing this, the model helps LangChain accurately interpret your queries and provide you with precise and relevant results.
- Construct the Retriever: Once your question or query is clear, the Conversational RetrieverChain comes into play. It goes through the VectorStore, which is where the code snippets are stored and finds the code snippets that are most relevant to your query. This search process is very flexible and can be customized to fit your requirements. You have the ability to adjust the settings and apply filters that are specific to your needs, ensuring that you get the most accurate and useful results for your query.
- Build the Conversational Chain: Once you have set up the retriever, it’s time to build the Conversational Chain. This step involves adjusting the settings of the retriever to better suit your needs and applying any additional filters that might be required. By doing this, you can narrow down the search and ensure that you receive the most precise, accurate, and relevant results for your queries. Essentially, it allows you to fine-tune the retrieval process to obtain the most useful information.
- Ask questions: Now comes the exciting part! You can ask questions about the codebase using the Conversational Retrieval Chain. It will generate comprehensive and context-aware answers for you. Your LLM, being part of the Conversational Chain, takes into account the retrieved code snippets and the conversation history to provide you with detailed and accurate answers.
By following this workflow, you’ll be able to effectively use LangChain to gain a deeper understanding of code, get context-aware answers to your questions, and even generate code snippets within GitHub repositories. Now, let’s see it in action, step by step.
Step-By-Step Guide
Let’s dive into the actual implementation.
1. Acquiring the Keys
To get started, you’ll need to register at the respective websites and obtain the API keys for Activeloop and OpenAI.
2. Setting Up the indexer.py File
Create a Python file, e.g., indexer.py
, where you’ll index the data. Import the necessary modules and set the API keys as environment variables.
import os from langchain.document_loaders import TextLoader from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import DeepLake from dotenv import load_dotenv # Load environment variables from .env load_dotenv() embeddings = OpenAIEmbeddings(disallowed_special=())
3. Cloning and Indexing the Target Repository
Next, we’ll clone the Reddit algorithm repository, load, split, and index the documents. You can clone the algorithm from this link.
root_dir = './reddit1.0-master' docs = [] for dirpath, dirnames, filenames in os.walk(root_dir): for file in filenames: try: loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8') docs.extend(loader.load_and_split()) except Exception as e: pass
4. Embedding Code Snippets
Next, we use OpenAI embeddings to embed the code snippets. These embeddings are then stored in a VectorStore, which will allow us to perform an efficient similarity search.
from langchain.text_splitter import CharacterTextSplitter text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) texts = text_splitter.split_documents(docs) username = "mikelabs" # replace with your username from app.activeloop.ai db = DeepLake(dataset_path=f"hub://{username}/reddit-source", embedding_function=embeddings) # dataset would be publicly available db.add_documents(texts) print("done")
5. Utilizing GPT-4 to Process and Understand User Queries
Now we set up another Python file, question.py
, to use GPT-4, a language model available with OpenAI, to process and understand user queries.
6. Constructing the Retriever
We construct a retriever using the VectorStore we created earlier.
db = DeepLake(dataset_path="hub://mikelabs/reddit-source", read_only=True, embedding_function=embeddings) # use your username retriever = db.as_retriever() retriever.search_kwargs['distance_metric'] = 'cos' retriever.search_kwargs['fetch_k'] = 100 retriever.search_kwargs['maximal_marginal_relevance'] = True retriever.search_kwargs['k'] = 10
7. Building the Conversational Chain
The Conversational Retrieval Chain links the retriever and the language model. This enables our system to process user queries and generate context-aware responses.
model = ChatOpenAI(model_name='gpt-4') # switch to gpt-3.5-turbo if you want qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)
8. Asking Questions
We can now ask questions about the Reddit source code. The answers provided by the Conversational Retrieval Chain are context-aware and directly based on the codebase.
questions = [ "Explain what each file does, at a high level", "Explain how the files relate to each other", ""] chat_history = [] for question in questions: result = qa({"question": question, "chat_history": chat_history}) chat_history.append((question, result['answer'])) print(f"-> **Question**: {question} n") print(f"**Answer**: {result['answer']} n")
Here were some of the responses I got:
What will you ask? What will you learn? Let me know!
Limitations
After talking with Shriram Krishnamurthi on Twitter, I realized I should point out that this approach has some limitations for understanding code.
- The analysis of this code may often be incomplete, and sometimes, you will miss key details due to errors in this analysis that could send you down the wrong path.
- There may be some “pollution” of results, which can come when your LLM has knowledge of its training data with overlapping terms. For example, the concept of “Reddit Karma” is probably already stored in GPT-4’s knowledge base, so asking it about how Karma works might lead it to pull context from its training data vs the supplied code.
You will need to exercise good judgment and take a trust-but-verify approach for this initial rough approach. Or maybe you can take things further and build a better system!
Conclusion
Throughout this guide, we explored reverse engineering Reddit’s public source code repository for version 1 of the site using LangChain. By leveraging AI capabilities, we save valuable time and effort, replacing manual code examination with automated query responses.
LangChain is a powerful tool that revolutionizes code understanding and generation. By using advanced models like VectorStores, Conversational RetrieverChain, and an LLM, LangChain empowers developers to efficiently analyze code repositories, provide context-aware answers, and generate new code.
LangChain’s workflow involves indexing the codebase, embedding code snippets, processing user queries with language models, and utilizing the Conversational RetrieverChain to retrieve relevant code snippets. By customizing the retriever and building the Conversational Chain, developers can fine-tune the retrieval process for precise results.
By following the step-by-step guide, you can leverage LangChain to enhance your code comprehension, obtain context-aware answers, and even generate code snippets within GitHub repositories. LangChain opens up new possibilities for productivity and understanding. What will you build with it? Thanks for reading!