Natural query against tabular data with LangChain and Pandas
Introduction
In this tutorial, we will explore how LangChain can be used to retrieve information from a dataframe for a specific query using the create_pandas_dataframe_agent
function.
This script utilizes pandas and Langchain, using the OpenAI API, to create a dataframe agent that can answer natural language questions based off of the data provided by a CSV file.
Step 1: Import Libraries
Let’s begin by importing the necessary libraries.
import pandas as pd
from langchain.agents import create_pandas_dataframe_agent
from langchain.llms import OpenAI
Step 2: Create a DataFrame
We will be reading in data from a CSV file to create a DataFrame. In this case we will be using a dataset containing information about the python packages listend in pypi as of 2019.
df = pd.read_csv("/content/package-manifest.csv")
If you are running this code on a notebook you can view the first few entries of the DataFrame by using the .head()
method.
df.head()
Step 3: Create DataFrame Agent
Now we will use the LangChain library to create a DataFrame agent which can answer natural language questions based off of the CSV data.
agent = create_pandas_dataframe_agent(OpenAI(temperature=0), df, verbose=True)
Here, we have created the agent by passing two arguments.
- First is
OpenAI(temperature=0)
, which is an instance of OpenAI with the desired level of temperature (in this case, 0). The temperature represents the degrees of freedom the LLM (chatgpt-turbo 3.5 in this case) uses. We set zero to force coherence to data. - Second is the
dataframe
, a pandas DataFrame instance that we want to search for information.
Step 4: Ask Questions
Now you are all set to ask questions to the agent.
agent.run("what are the best packages for data visualization?")
Here, we passed a free text query "what are the best packages for data visualization?"
to the agent, which returned a pandas DataFrame containing information about the packages for data visualization. You should see the agent’s response printed in the terminal, like this:
> Entering new AgentExecutor chain...
Thought: I need to find packages that are related to data visualization
Action: python_repl_ast
Action Input: df[df['summary'].str.contains('visualization')]
Observation: Cannot mask with non-boolean array containing NA / NaN values
Thought: I need to filter out the NA/NaN values
Action: python_repl_ast
Action Input: df[df['summary'].str.contains('visualization', na=False)]
Observation: package_name version summary \
62 altair 3.2.0 Altair: A declarative statistical visualizatio...
145 datashader 0.7.0 Data visualization toolchain based on aggregat...
224 graphviz 0.8.4 Open Source graph visualization software.
317 missingno 0.4.2 Missing data visualization module for Python.
425 pyLDAvis 2.1.2 Interactive topic model visualization. Port of...
519 seaborn 0.9.0 Statistical data visualization
604 vida 0.3 Python binding for Vida data visualizations
605 visvis 1.11.2 An object oriented approach to visualization o...
607 vtk 8.1.2 VTK is an open-source toolkit for 3D computer ...
license metadata_source
62 BSD 3-clause PyPI
145 BSD-3-Clause Anaconda
224 EPL v1.0 Anaconda
317 NaN PyPI
425 MIT PyPI
519 BSD 3-Clause Anaconda
604 UNKNOWN PyPI
605 BSD 3-Clause Anaconda
607 BSD PyPI
Thought: I now know the best packages for data visualization
Final Answer: altair, datashader, graphviz, missingno, pyLDAvis, seaborn, vida, visvis, and vtk.
> Finished chain.
altair, datashader, graphviz, missingno, pyLDAvis, seaborn, vida, visvis, and vtk.
Obviusly you can skip the logging part and just fetch this string:
altair, datashader, graphviz, missingno, pyLDAvis, seaborn, vida, visvis, and vtk.
Behind the scenes
The agent uses an LLMChains and “tools” to firstly embed the dataset and make it searchable against a query in NLP (the user’s prompt). I might cover in detail how an agent works in the context of LLMs, in the meant time you can check out the documentation these resources:
Conclusion
In this tutorial we have seen how we can use LangChain to extract information from a pandas DataFrame for a specific query using an NLP tool with few lines of python code and an OpenAI subscription.