Generative AI for Analytics: Performing Natural Language Queries on Amazon RDS using SageMaker, LangChain, and LLMs
Licensed image: Tee11/Shutterstock.com

Generative AI for Analytics: Performing Natural Language Queries on Amazon RDS using SageMaker, LangChain, and LLMs


Learn to use LangChain’s SQL Database Chain and Agent with large language models to perform natural language queries (NLQ) of Amazon RDS for PostgreSQL database.


To paraphrase analytics workflow product vendor YellowFin, “Natural language query (NLQ), also known as natural language search, is a self-service business intelligence (BI) reporting capability that enables analytics users to ask questions of their data. It parses for keywords and generates relevant answers sourced from related databases, with results typically delivered as a report, chart or textual explanation that attempt to answer the query, and provide depth of understanding.

Using LangChain’s SQL Database Chain and SQL Database Agent, we can leverage large language models (LLMs) to ask questions of an Amazon RDS for PostgreSQL database using natural language. Questions will be converted into SQL queries and executed against the database. Assuming the generated SQL query is well-formed, the query results will be converted into a textual explanation. For example, we ask questions like, “How many customers have purchased in the last 12 months?” or “What were the total sales of May?” These will be converted into SQL SELECT statements, like “SELECT sum(amount) AS sales FROM purchases WHERE MONTH(purchase_date) = 5 AND YEAR(purchase_date) = 2023;” The answer is then composed into textual explanation, such as “A total of 384 customers made purchases in the last 12 months.

In the following post, we will learn to use LangChain’s SQL Database Chain and SQL Database Agent with OpenAI’s text-davinci-003, an LLM instance within OpenAI’s GPT-3 series, to perform NLQ of an Amazon RDS for PostgreSQL database. We will also learn about the importance of using LangChain’s Prompt Template, Query Checker, few-shot promoting, and retrieval-augmented generation (RAG) to improve our results.

Source Code

All the source code for this post’s demonstration is open-source and available on GitHub.

Architecture

The following diagram represents this post’s high-level architecture. We will use a series of Amazon SageMaker notebooks to interact with an LLM and Amazon RDS database. We will call OpenAI’s LLM APIs or Amazon SageMaker real-time inference LLM endpoints, using LangChain, from the notebooks. For added security and to prevent potentially destructive DDL statements from being executed, LangChain will be restricted to read-only access to the Amazon RDS database.

No alt text provided for this image

Amazon SageMaker Notebooks

For this post’s demonstration, we will be using Amazon SageMaker Notebooks. According to AWS, “Amazon SageMaker offers two types of fully managed Jupyter Notebooks for data exploration and building ML models: Amazon SageMaker Studio notebooks and Amazon SageMaker notebook instances.” Using Amazon SageMaker notebook instances will allow us to install the latest version of packages, including LangChain, which publishes releases almost daily.

The use of Amazon SageMaker notebook instances is optional. Alternatively, you can use your local IDE to interact with the Jupyter notebooks. I originally built these notebooks in Microsoft’s VS Code using the Juypter extension. You only need to modify your database’s security group’s inbound rules to include your IP address on the correct port.

Choosing a Large Language Model

While researching this post, I tested several LLMs, including newer Foundational Models available with Amazon SageMaker Jumpstart, such as AI21’s Jurassic-2 Jumbo and Google’s FLAN-T5 XXL. Notebooks for both of these models are available in the GitHub repository. However, due to its superior out-of-the-box results and ease of use through an API, the focus of this post will be on OpenAI’s text-davinci-003 LLM, a model instance within OpenAI’s GPT-3 series. Future posts will focus on using alternative LLMs.

According to OpenAI, “text-davinci-003 is the newer and more capable model, designed specifically for instruction-following tasks. This enables it to respond concisely and more accurately — even in zero-shot scenarios, i.e. without the need for any examples given in the prompt.” Further, “Additionally, text-davinci-003 supports a longer context window (max prompt+completion length) than davinci — 4097 tokens compared to 2049.” OpenAI’s text-davinci-003 LLM has 175 billion parameters. Compare this to Google’s FLAN-T5 XXL LLM, which has a mere 11 billion parameters.

Demonstration

Much of the code found in the notebooks was adopted from LangChain’s SQL Chain example and SQL Database Agent documentation. For this post, we will be using the langchain-sql-OpenAI.ipynb notebook, located in the post’s GitHub project.

Prerequisites

  1. Import The Museum of Modern Art (MoMA) Collection database, from GitHub, into an Amazon RDS for PostgreSQL database.
  2. Create a new Amazon SageMaker notebook instance for this demonstration. Make sure your RDS instance is accessible to your SageMaker Notebook instance environment.
  3. git clone this post’s GitHub repository to your Amazon SageMaker notebook instance.
  4. Create a .env file, used by dotenv, using a terminal in your SageMaker Notebook environment. A sample env.txt file is included in the project.
  5. Add your RDS database credentials to the .env file: RDS_ENDPOINT, RDS_PORT, RDS_USERNAME, RDS_PASSWORD, and RDS_DB_NAME.
  6. Create an OpenAI account and update the .env file to include your OPENAI_API_KEY.

NOTE: When using dotenv, credentials will be stored in plain text. The recommended and more secure method is to use AWS Secrets Manager.

Install Required Packages

The first step is to install and import the required packages for our demonstration. We want to install the latest version of LangChain since releases are issued almost daily and we want the latest features. At the time of this post, the newest version was 0.0.186.

import os

# Avoid huggingface/tokenizers parallelism error
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Install the latest versions of langchain and openai
%pip install langchain openai -Uq

# Install latest versions of other required packages
%pip install ipywidgets python-dotenv SQLAlchemy psycopg2 psycopg2-binary -Uq

# Check verions of LangChain and OpenAI
%pip list | grep "langchain\|openai"

from langchain import SQLDatabase, SQLDatabaseChain, OpenAI
from langchain.chains import SQLDatabaseSequentialChain

Setup Environment Variables

For simplicity, we will use dotenv to load the OpenAI and RDS environment variables. Note that the credentials will be stored in plain text using dotenv. The recommended, more secure method is to use AWS Secrets Manager. Again, LangChain should be limited to read-only access to the Amazon RDS database.

LangChain uses SQLAlchemy to connect to SQL databases. With SQLAlchemy, we have a choice of dialects/DBAPI options when connecting to PostgreSQL; I have chosen psycopg2 for this demonstration. We will compose the individual RDS environment variables into the RDS URI used by LangChain: postgresql+psycopg2://user:pwd@hostname:port/dbname.

# Load env vars from .env file
%load_ext dotenv
%dotenv

# SQLAlchemy 2.0 reference: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html
# URI format: postgresql+psycopg2://user:pwd@hostname:port/dbname
RDS_DB_NAME = os.environ.get("RDS_DB_NAME")
RDS_ENDPOINT = os.environ.get("RDS_ENDPOINT")
RDS_PASSWORD = os.environ.get("RDS_PASSWORD")
RDS_PORT = os.environ.get("RDS_PORT")
RDS_USERNAME = os.environ.get("RDS_USERNAME")
RDS_URI = f"postgresql+psycopg2://{RDS_USERNAME}:{RDS_PASSWORD}@{RDS_ENDPOINT}:{RDS_PORT}/{RDS_DB_NAME}"

Specify the LLM

We will start by using OpenAI’s text-davinci-003 LLM with LangChain for this first part of the demonstration. See OpenAI’s Models Overview for more model information. The estimated OpenAI API charge to run this OpenAI notebook once is about $0.40.

llm = OpenAI(model_name="text-davinci-003", temperature=0, verbose=True)

Asking our First Question

We are ready to ask our first question of the MoMA database using LangChain’s SQLDatabaseSequentialChain and OpenAI’s text-davinci-003 LLM. I have included a few sample questions for users unfamiliar with the MoMA database.

# A few sample questions
QUESTION_01 = "How many artists are there?"
QUESTION_02 = "How many artworks are there?"
QUESTION_03 = "How many rows are in the artists table?"
QUESTION_04 = "How many rows are in the artworks table?"
QUESTION_05 = "How many artists are there whose nationality is French?"
QUESTION_06 = "How many artworks were created by artists whose nationality is Spanish?"
QUESTION_07 = "How many artist names start with 'M'?"
QUESTION_08 = "What nationality produced the most number of artworks?"
QUESTION_09 = "How many artworks are by Claude Monet?"
QUESTION_10 = "What is the oldest artwork in the collection?"

from sqlalchemy.exc import ProgrammingError

db = SQLDatabase.from_uri(RDS_URI)
db_chain = SQLDatabaseSequentialChain.from_llm(
    llm, db, verbose=True, use_query_checker=True
)

try:
    db_chain.run(QUESTION_05)
except (ProgrammingError, ValueError) as exc:
    print(f"\n\n{exc}")

Successful results should look similar to the example below. First, we see the question (“How many artists are there whose nationality is French?”), the equivalent SQL query (“SELECT COUNT(*) FROM artists WHERE nationality = ‘French’;”), the resultset (assuming the SQL query was well-formed), and finally, the answer (“There are 839 artists whose nationality is French.”).

> Entering new SQLDatabaseSequentialChain chain...
Table names to use:
['artists']

> Entering new SQLDatabaseChain chain...
How many artists are there whose nationality is French?
SQLQuery:SELECT COUNT(*) FROM artists WHERE nationality = 'French';
SQLResult: [(839,)]
Answer:There are 839 artists whose nationality is French.
> Finished chain.

If the SQL query was malformed by the LLM, you would expect to see output similar to the following:

> Entering new SQLDatabaseSequentialChain chain...
Table names to use:
['Artists']

> Entering new SQLDatabaseChain chain...
How many artists are there whose nationality is French?
SQLQuery:

table_names {'Artists'} not found in database

Here is another example of a malformed SQL query:

> Entering new SQLDatabaseSequentialChain chain...
Table names to use:
['artists']

> Entering new SQLDatabaseChain chain...
How many artists are there whose nationality is French?
SQLQuery:SELECT count(*) FROM artists WHERE nationality = 'French' AND n = 0

(psycopg2.errors.UndefinedColumn) column "n" does not exist
LINE 1: ...count(*) FROM artists WHERE nationality = 'French' AND n = 0
                                                                  ^

[SQL: SELECT count(*) FROM artists WHERE nationality = 'French' AND n = 0]
(Background on this error at: https://sqlalche.me/e/20/f405)

Custom Table Information

According to LangChain’s documentation, “In some cases, it can be useful to provide custom table information instead of using the automatically generated table definitions and the first sample_rows_in_table_info sample rows.” Below we see custom table information for our two MoMA database tables. This method may be impractical when dealing with multiple tables or large table schemas.

custom_table_info = {
    "artists": """CREATE TABLE artists (
        artist_id integer NOT NULL,
        name character varying(200),
        nationality character varying(50),
        gender character varying(25),
        birth_year integer,
        death_year integer,
        CONSTRAINT artists_pk PRIMARY KEY (artist_id)
)

/*
3 rows from artists table:
"artist_id" "name" "nationality" "gender" "birth_year" "death_year"
12 "Jüri Arrak" "Estonian" "Male" 1936 
19 "Richard Artschwager" "American" "Male" 1923 2013
22 "Isidora Aschheim" "Israeli" "Female"  
*/""",
    "artworks": """CREATE TABLE artworks (
        artwork_id integer NOT NULL,
        title character varying(500),
        artist_id integer NOT NULL,
        name character varying(500),
        date integer,
        medium character varying(250),
        dimensions text,
        acquisition_date text,
        credit text,
        catalogue character varying(250),
        department character varying(250),
        classification character varying(250),
        object_number text,
        diameter_cm text,
        circumference_cm text,
        height_cm text,
        length_cm text,
        width_cm text,
        depth_cm text,
        weight_kg text,
        durations integer,
        CONSTRAINT artworks_pk PRIMARY KEY (artwork_id)
)

/*
3 rows from artworks table:
"artwork_id" "title" "artist_id" "name" "date" "medium" "dimensions" "acquisition_date" "credit" "catalogue" "department" "classification" "object_number" "diameter_cm" "circumference_cm" "height_cm" "length_cm" "width_cm" "depth_cm" "weight_kg" "durations"
102312 "Watching the Game" 2422 "John Gutmann" 1934 "Gelatin silver print" "9 3/4 x 6 7/16' (24.8 x 16.4 cm)" "2006-05-11" "Purchase" "N" "Photography" "Photograph" "397.2006"   "24.8"  "16.4"   
103321 "Untitled (page from Sump)" 25520 "Jerome Neuner" 1994 "Page with chromogenic color print and text" "12 x 9 1/2' (30.5 x 24.1 cm)" "2006-05-11" "E.T. Harmax Foundation Fund" "N" "Photography" "Photograph" "415.2006.12"   "30.4801"  "24.13"   
10 "The Manhattan Transcripts Project, New York, New York, Episode 1: The Park" 7056 "Bernard Tschumi"  "Gelatin silver photograph" "14 x 18' (35.6 x 45.7 cm)" "1995-01-17" "Purchase and partial gift of the architect in honor of Lily Auchincloss" "Y" "Architecture & Design" "Architecture" "3.1995.11"   "35.6"  "45.7"   
*/""",
}

Query Checker

According to LangChain’s documentation, “Sometimes the Language Model generates invalid SQL with small mistakes that can be self-corrected using the same technique used by the SQL Database Agent to try and fix the SQL using the LLM.” Simply add the use_query_checker=True parameter to the SQLDatabaseSequentialChain object to enable the query checker option.

Below we see an example of using both custom table information and the query checker.

db = SQLDatabase.from_uri(
    RDS_URI,
    include_tables=["artists", "artworks"],
    sample_rows_in_table_info=3,
    custom_table_info=custom_table_info,
)

db_chain = SQLDatabaseSequentialChain.from_llm(
    llm, db, verbose=True, use_query_checker=True, top_k=3
)

try:
    db_chain.run(QUESTION_06)
except (ProgrammingError, ValueError) as exc:
    print(f"\n\n{exc}")

The results should look similar to the example below. Again, we see the question, the equivalent SQL query, the resultset (assuming the SQL query was well-formed), and finally, the answer.

> Entering new SQLDatabaseSequentialChain chain...
Table names to use:
['artists', 'artworks']

> Entering new SQLDatabaseChain chain...
How many artworks were created by artists whose nationality is Spanish?
SQLQuery:SELECT COUNT(*) FROM artworks 
INNER JOIN artists ON artworks.artist_id = artists.artist_id 
WHERE artists.nationality = 'Spanish';
SQLResult: [(2922,)]
Answer:2922 artworks were created by artists whose nationality is Spanish.
> Finished chain.

Customize Prompt

To further improve results, we can also use LangChain’s Prompt Template. According to LangChain, “A prompt template refers to a reproducible way to generate a prompt. It contains a text string (“the template”) that can take in a set of parameters from the end user and generate a prompt.” Below we see an example prompt template that we will use to better improve the chances of achieving more accurate results from our RDS database.

from langchain.prompts.prompt import PromptTemplate

_DEFAULT_TEMPLATE = """Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.
Use the following format:

Question: "Question here"
SQLQuery: "SQL Query to run"
SQLResult: "Result of the SQLQuery"
Answer: "Final answer here"

Only use the following tables:

{table_info}

If someone asks for the art table, they really mean the artworks table.

Question: {input}"""

PROMPT = PromptTemplate(
    input_variables=["input", "table_info", "dialect"], template=_DEFAULT_TEMPLATE
)

Intermediate Steps

According to LangChain’s documentation, “You can also return the intermediate steps of the SQLDatabaseChain. This allows you to access the SQL statement that was generated, as well as the result of running that against the SQL Database.”

Below we see an example of using both a prompt template and intermediate steps.

# Revert to db without custom_table_info
# Could overflow context window (max prompt+completion length) of 4097
db = SQLDatabase.from_uri(RDS_URI)

db_chain = SQLDatabaseChain.from_llm(
    llm,
    db,
    prompt=PROMPT,
    verbose=True,
    use_query_checker=True,
    return_intermediate_steps=True,
)

try:
    result = db_chain(QUESTION_09)
except (ProgrammingError, ValueError) as exc:
    print(f"\n\n{exc}")

result["intermediate_steps"]

The results should look similar to the example below. Again, we see the question, the equivalent SQL query, the resultset (assuming the SQL query was well-formed), the answer, and now, the intermediate steps.

> Entering new SQLDatabaseChain chain...
How many artworks are by Claude Monet?
SQLQuery:SELECT COUNT(*) FROM artworks 
JOIN artists ON artworks.artist_id = artists.artist_id 
WHERE artists.name = 'Claude Monet';
SQLResult: [(6,)]
Answer:6 artworks are by Claude Monet.
> Finished chain.
[{'input': "How many artworks are by Claude Monet?\nSQLQuery:SELECT COUNT(*) FROM artworks \nJOIN artists ON artworks.artist_id = artists.artist_id \nWHERE artists.name = 'Claude Monet';\nSQLResult: [(6,)]\nAnswer:",
  'top_k': '5',
  'dialect': 'postgresql',
  'table_info': 'CREATE TABLE artists (\n        artist_id integer NOT NULL,\n        name character varying(200)",\n        nationality character varying(50)",\n        gender character varying(25)",\n        birth_year integer,\n        death_year integer,\n        CONSTRAINT artists_pk PRIMARY KEY (artist_id)\n)\n\n/*\n3 rows from artists table:\n"artist_id"\t"name"\t"nationality"\t"gender"\t"birth_year"\t"death_year"\n12\t"Jüri Arrak"\t"Estonian"\t"Male"\t1936\t\n19\t"Richard Artschwager"\t"American"\t"Male"\t1923\t2013\n22\t"Isidora Aschheim"\t"Israeli"\t"Female"\t\t\n*/\n\nCREATE TABLE artworks (\n        artwork_id integer NOT NULL,\n        title character varying(500)",\n        artist_id integer NOT NULL,\n        name character varying(500)",\n        date integer,\n        medium character varying(250)",\n        dimensions text",\n        acquisition_date text",\n        credit text",\n        catalogue character varying(250)",\n        department character varying(250)",\n        classification character varying(250)",\n        object_number text",\n        diameter_cm text",\n        circumference_cm text",\n        height_cm text",\n        length_cm text",\n        width_cm text",\n        depth_cm text",\n        weight_kg text",\n        durations integer,\n        CONSTRAINT artworks_pk PRIMARY KEY (artwork_id)\n)\n\n/*\n3 rows from artworks table:\n"artwork_id"\t"title"\t"artist_id"\t"name"\t"date"\t"medium"\t"dimensions"\t"acquisition_date"\t"credit"\t"catalogue"\t"department"\t"classification"\t"object_number"\t"diameter_cm"\t"circumference_cm"\t"height_cm"\t"length_cm"\t"width_cm"\t"depth_cm"\t"weight_kg"\t"durations"\n102312\t"Watching the Game"\t2422\t"John Gutmann"\t1934\t"Gelatin silver print"\t"9 3/4 x 6 7/16\' (24.8 x 16.4 cm)"\t"2006-05-11"\t"Purchase"\t"N"\t"Photography"\t"Photograph"\t"397.2006"\t\t\t"24.8"\t\t"16.4"\t\t\t\n103321\t"Untitled (page from Sump)"\t25520\t"Jerome Neuner"\t1994\t"Page with chromogenic color print and text"\t"12 x 9 1/2\' (30.5 x 24.1 cm)"\t"2006-05-11"\t"E.T. Harmax Foundation Fund"\t"N"\t"Photography"\t"Photograph"\t"415.2006.12"\t\t\t"30.4801"\t\t"24.13"\t\t\t\n10\t"The Manhattan Transcripts Project, New York, New York, Episode 1: The Park"\t7056\t"Bernard Tschumi"\t\t"Gelatin silver photograph"\t"14 x 18\' (35.6 x 45.7 cm)"\t"1995-01-17"\t"Purchase and partial gift of the architect in honor of Lily Auchincloss"\t"Y"\t"Architecture & Design"\t"Architecture"\t"3.1995.11"\t\t\t"35.6"\t\t"45.7"\t\t\t\n*/',
  'stop': ['\nSQLResult:']},
 "SELECT COUNT(*) FROM artworks \nJOIN artists ON artworks.artist_id = artists.artist_id \nWHERE artists.name = 'Claude Monet';",
 {'sql_cmd': "SELECT COUNT(*) FROM artworks \nJOIN artists ON artworks.artist_id = artists.artist_id \nWHERE artists.name = 'Claude Monet';"},
 '[(6,)]',
 {'input': "How many artworks are by Claude Monet?\nSQLQuery:SELECT COUNT(*) FROM artworks \nJOIN artists ON artworks.artist_id = artists.artist_id \nWHERE artists.name = 'Claude Monet';\nSQLResult: [(6,)]\nAnswer:",
  'top_k': '5',
  'dialect': 'postgresql',
  'table_info': 'CREATE TABLE artists (\n        artist_id integer NOT NULL,\n        name character varying(200)",\n        nationality character varying(50)",\n        gender character varying(25)",\n        birth_year integer,\n        death_year integer,\n        CONSTRAINT artists_pk PRIMARY KEY (artist_id)\n)\n\n/*\n3 rows from artists table:\n"artist_id"\t"name"\t"nationality"\t"gender"\t"birth_year"\t"death_year"\n12\t"Jüri Arrak"\t"Estonian"\t"Male"\t1936\t\n19\t"Richard Artschwager"\t"American"\t"Male"\t1923\t2013\n22\t"Isidora Aschheim"\t"Israeli"\t"Female"\t\t\n*/\n\nCREATE TABLE artworks (\n        artwork_id integer NOT NULL,\n        title character varying(500)",\n        artist_id integer NOT NULL,\n        name character varying(500)",\n        date integer,\n        medium character varying(250)",\n        dimensions text",\n        acquisition_date text",\n        credit text",\n        catalogue character varying(250)",\n        department character varying(250)",\n        classification character varying(250)",\n        object_number text",\n        diameter_cm text",\n        circumference_cm text",\n        height_cm text",\n        length_cm text",\n        width_cm text",\n        depth_cm text",\n        weight_kg text",\n        durations integer,\n        CONSTRAINT artworks_pk PRIMARY KEY (artwork_id)\n)\n\n/*\n3 rows from artworks table:\n"artwork_id"\t"title"\t"artist_id"\t"name"\t"date"\t"medium"\t"dimensions"\t"acquisition_date"\t"credit"\t"catalogue"\t"department"\t"classification"\t"object_number"\t"diameter_cm"\t"circumference_cm"\t"height_cm"\t"length_cm"\t"width_cm"\t"depth_cm"\t"weight_kg"\t"durations"\n102312\t"Watching the Game"\t2422\t"John Gutmann"\t1934\t"Gelatin silver print"\t"9 3/4 x 6 7/16\' (24.8 x 16.4 cm)"\t"2006-05-11"\t"Purchase"\t"N"\t"Photography"\t"Photograph"\t"397.2006"\t\t\t"24.8"\t\t"16.4"\t\t\t\n103321\t"Untitled (page from Sump)"\t25520\t"Jerome Neuner"\t1994\t"Page with chromogenic color print and text"\t"12 x 9 1/2\' (30.5 x 24.1 cm)"\t"2006-05-11"\t"E.T. Harmax Foundation Fund"\t"N"\t"Photography"\t"Photograph"\t"415.2006.12"\t\t\t"30.4801"\t\t"24.13"\t\t\t\n10\t"The Manhattan Transcripts Project, New York, New York, Episode 1: The Park"\t7056\t"Bernard Tschumi"\t\t"Gelatin silver photograph"\t"14 x 18\' (35.6 x 45.7 cm)"\t"1995-01-17"\t"Purchase and partial gift of the architect in honor of Lily Auchincloss"\t"Y"\t"Architecture & Design"\t"Architecture"\t"3.1995.11"\t\t\t"35.6"\t\t"45.7"\t\t\t\n*/',
  'stop': ['\nSQLResult:']},
 '6 artworks are by Claude Monet.']

Few-shot Learning

To improve the accuracy of the SQL query, LangChain allows us to use few-shot learning. According to Wikipedia, “In natural language processing, in-context learning, few-shot learning, or few-shot prompting is a prompting technique that allows a model to process examples before attempting a task. The method was popularized after the advent of GPT-3 and is considered to be an emergent property of large language models.” Conveniently, we happen to be using a GPT-3 series LLM in this post’s demonstration.

Retrieval-Augmented Generation (RAG)

The method of allowing a model to process examples before attempting a task is referred to as retrieval-augmented generation (RAG). According to the paper, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, by Patrick Lewis et al., with a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG), models combine pre-trained parametric and non-parametric memory for language generation. In our case, we combine the parametric memory of the pre-trained LLM, OpenAI text-davinci-003 in this case, and the non-parametric memory, a dense vector index of known-good SQL examples, to increase the accuracy of results.

Vector Database

Embeddings databases (aka vector databases), like Pinecone, Milvus, and FAISS, store embeddings and allow you to search by nearest neighbors rather than substrings like a traditional database. We will use the Chroma vector database and the Sentence Transformers all-MiniLM-L6-v2 model to create embeddings of known-good SQL query examples and measure their semantic similarity to the questions asked. We will then use the closest examples to perform few-shot prompting, a technique to enable in-context learning. Finally, we provide example SQL queries in the prompt along with the question to steer the model to better performance. LangChain handles all of this complexity with minimal coding.

%pip install pyyaml -q
%pip install chromadb sentence_transformers -Uq

from typing import Dict
import yaml

chain = SQLDatabaseChain.from_llm(
    llm, db, verbose=True, return_intermediate_steps=True, use_query_checker=True
)

def _parse_example(result: Dict) -> Dict:
    sql_cmd_key = "sql_cmd"
    sql_result_key = "sql_result"
    table_info_key = "table_info"
    input_key = "input"
    final_answer_key = "answer"

    _example = {
        "input": result.get("query"),
    }

    steps = result.get("intermediate_steps")
    answer_key = sql_cmd_key  # the first one
    for step in steps:
        if isinstance(step, dict):
            if table_info_key not in _example:
                _example[table_info_key] = step.get(table_info_key)

            if input_key in step:
                if step[input_key].endswith("SQLQuery:"):
                    answer_key = sql_cmd_key  # this is the SQL generation input
                if step[input_key].endswith("Answer:"):
                    answer_key = final_answer_key  # this is the final answer input
            elif sql_cmd_key in step:
                _example[sql_cmd_key] = step[sql_cmd_key]
                answer_key = sql_result_key  # this is SQL execution input
        elif isinstance(step, str):
            _example[answer_key] = step
    return _example

example: any

try:
    result = chain(QUESTION_08)
    print("\n*** Query succeeded")
    example = _parse_example(result)
except Exception as exc:
    print("\n*** Query failed")
    result = {"query": QUESTION_08, "intermediate_steps": exc.intermediate_steps}
    example = _parse_example(result)


# print results for now, in reality you may want to write this out 
# to a YAML file or database for manual fix-ups offline
yaml_example = yaml.dump(example, allow_unicode=True)
print("\n" + yaml_example)

The verbose results should look similar to the example below. Again, we see the question, the equivalent SQL query, the resultset (assuming the SQL query was well-formed), the answer, and a YAML-formatted example, which we will see how to use next.

> Entering new SQLDatabaseChain chain...
What nationality produced the most number of artworks?
SQLQuery:SELECT nationality, COUNT(*) AS num_artworks
FROM artists
INNER JOIN artworks ON artists.artist_id = artworks.artist_id
WHERE nationality IS NOT NULL
GROUP BY nationality
ORDER BY num_artworks DESC
LIMIT 5;
SQLResult: [('American', 53955), ('French', 22116), ('German', 8920), ('British', 5480), ('', 3273)]
Answer:The nationality that produced the most number of artworks is American.
> Finished chain.

*** Query succeeded

answer: The nationality that produced the most number of artworks is American.
input: What nationality produced the most number of artworks?
sql_cmd: 'SELECT nationality, COUNT(*) AS num_artworks

  FROM artists

  INNER JOIN artworks ON artists.artist_id = artworks.artist_id

  WHERE nationality IS NOT NULL

  GROUP BY nationality

  ORDER BY num_artworks DESC

  LIMIT 5;'
sql_result: '[(''American'', 53955), (''French'', 22116), (''German'', 8920), (''British'',
  5480), ('''', 3273)]'
table_info: "CREATE TABLE artists (\n        artist_id integer NOT NULL,\n       \
  \ name character varying(200)\",\n        nationality character varying(50)\",\n\
  \        gender character varying(25)\",\n        birth_year integer,\n        death_year\
  \ integer,\n        CONSTRAINT artists_pk PRIMARY KEY (artist_id)\n)\n\n/*\n3 rows\
  \ from artists table:\n\"artist_id\"\t\"name\"\t\"nationality\"\t\"gender\"\t\"\
  birth_year\"\t\"death_year\"\n12\t\"Jüri Arrak\"\t\"Estonian\"\t\"Male\"\t1936\t\
  \n19\t\"Richard Artschwager\"\t\"American\"\t\"Male\"\t1923\t2013\n22\t\"Isidora\
  \ Aschheim\"\t\"Israeli\"\t\"Female\"\t\t\n*/\n\nCREATE TABLE artworks (\n     \
  \   artwork_id integer NOT NULL,\n        title character varying(500)\",\n    \
  \    artist_id integer NOT NULL,\n        name character varying(500)\",\n     \
  \   date integer,\n        medium character varying(250)\",\n        dimensions\
  \ text\",\n        acquisition_date text\",\n        credit text\",\n        catalogue\
  \ character varying(250)\",\n        department character varying(250)\",\n    \
  \    classification character varying(250)\",\n        object_number text\",\n \
  \       diameter_cm text\",\n        circumference_cm text\",\n        height_cm\
  \ text\",\n        length_cm text\",\n        width_cm text\",\n        depth_cm\
  \ text\",\n        weight_kg text\",\n        durations integer,\n        CONSTRAINT\
  \ artworks_pk PRIMARY KEY (artwork_id)\n)\n\n/*\n3 rows from artworks table:\n\"\
  artwork_id\"\t\"title\"\t\"artist_id\"\t\"name\"\t\"date\"\t\"medium\"\t\"dimensions\"\
  \t\"acquisition_date\"\t\"credit\"\t\"catalogue\"\t\"department\"\t\"classification\"\
  \t\"object_number\"\t\"diameter_cm\"\t\"circumference_cm\"\t\"height_cm\"\t\"length_cm\"\
  \t\"width_cm\"\t\"depth_cm\"\t\"weight_kg\"\t\"durations\"\n102312\t\"Watching the\
  \ Game\"\t2422\t\"John Gutmann\"\t1934\t\"Gelatin silver print\"\t\"9 3/4 x 6 7/16'\
  \ (24.8 x 16.4 cm)\"\t\"2006-05-11\"\t\"Purchase\"\t\"N\"\t\"Photography\"\t\"Photograph\"\
  \t\"397.2006\"\t\t\t\"24.8\"\t\t\"16.4\"\t\t\t\n103321\t\"Untitled (page from Sump)\"\
  \t25520\t\"Jerome Neuner\"\t1994\t\"Page with chromogenic color print and text\"\
  \t\"12 x 9 1/2' (30.5 x 24.1 cm)\"\t\"2006-05-11\"\t\"E.T. Harmax Foundation Fund\"\
  \t\"N\"\t\"Photography\"\t\"Photograph\"\t\"415.2006.12\"\t\t\t\"30.4801\"\t\t\"\
  24.13\"\t\t\t\n10\t\"The Manhattan Transcripts Project, New York, New York, Episode\
  \ 1: The Park\"\t7056\t\"Bernard Tschumi\"\t\t\"Gelatin silver photograph\"\t\"\
  14 x 18' (35.6 x 45.7 cm)\"\t\"1995-01-17\"\t\"Purchase and partial gift of the\
  \ architect in honor of Lily Auchincloss\"\t\"Y\"\t\"Architecture & Design\"\t\"\
  Architecture\"\t\"3.1995.11\"\t\t\t\"35.6\"\t\t\"45.7\"\t\t\t\n*/"

According to LangChain’s documentation, we need to run the above cell a few times to collect several examples of input, table_info, and sql_cmd generated by your language model. The sql_cmd values may be incorrect, and you can manually fix them up to build a collection of known-good question-query-result-answer examples. Using YAML allows us to keep an organized record of our inputs and corrected SQL outputs that we can build up over time. For the demonstration, I have placed several examples in a separate file, sql_examples_postgresql.yaml.

  class- answer: There are 15086 rows in the artists table.
  input: How many rows are in the artists table?
  sql_cmd: SELECT count(*) FROM artists;
  sql_result: '[(15086,)]'
  table_info: |
    CREATE TABLE artists
    (
        artist_id integer NOT NULL,
        name character varying(200),
        nationality character varying(50),
        gender character varying(25),
        birth_year integer,
        death_year integer,
        CONSTRAINT artists_pk PRIMARY KEY (artist_id)
    )

    /*
    3 rows from artists table:
    "artist_id" "name" "nationality" "gender" "birth_year" "death_year"
    1 "Robert Arneson" "American" "Male" 1930 1992
    2 "Doroteo Arnaiz" "Spanish" "Male" 1936 
    3 "Bill Arnold" "American" "Male" 1941 
    */

Now that we have some examples, we perform few-shot prompting. First, we load the samples from the YAML file. Then, we use the HuggingFaceEmbeddings class with the sentence-transformers/all-MiniLM-L6-v2 model to produce embeddings. According to the model card, this sentence-transformers model maps sentences and paragraphs to a 384 dimensional dense vector space.

Next, we use the SemanticSimilarityExampleSelector class along with the VectorStore class, specifying Chroma as our vector database, to store the embeddings and perform a search for examples to use based on semantic similarity. Lastly, we use the FewShotPromptTemplate class to create a specialized few-shot prompt template, which includes a few of our known-good question-query-result-answer examples. If switching database engines, make sure to import the correct prompt from the langchain.chains.sql_database.prompt module. For PostgreSQL, we need to import the _postgres_prompt prompt.

# Load the corrected examples for few-shot prompting
with open("sql_examples_postgresql.yaml", "r") as stream:
    SQL_SAMPLES = yaml.safe_load(stream)

from langchain import FewShotPromptTemplate, PromptTemplate
from langchain.chains.sql_database.prompt import _postgres_prompt, PROMPT_SUFFIX
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.prompts.example_selector.semantic_similarity import (
    SemanticSimilarityExampleSelector,
)
from langchain.vectorstores import Chroma

example_prompt = PromptTemplate(
    input_variables=["table_info", "input", "sql_cmd", "sql_result", "answer"],
    template="{table_info}\n\nQuestion: {input}\nSQLQuery: {sql_cmd}\nSQLResult: {sql_result}\nAnswer: {answer}",
)

examples_dict = SQL_SAMPLES

local_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

example_selector = SemanticSimilarityExampleSelector.from_examples(
    examples_dict,
    local_embeddings,
    Chroma,
    k=min(3, len(examples_dict)),
)

few_shot_prompt = FewShotPromptTemplate(
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix=_postgres_prompt + "Here are some examples:",
    suffix=PROMPT_SUFFIX,
    input_variables=["table_info", "input", "top_k"],
)

Using our few-shot prompt template, we can ask a question of the RDS database and expect higher accuracy from the LLM.

db_chain = SQLDatabaseChain.from_llm(
    llm,
    db,
    prompt=few_shot_prompt,
    use_query_checker=True,
    verbose=True,
    return_intermediate_steps=True,
)

try:
    result = db_chain(QUESTION_09)
except (ProgrammingError, ValueError) as exc:
    print(f"\n\n{exc}")

The results should look similar to the example below. Again, we see the question, the equivalent SQL query, the resultset (assuming the SQL query was well-formed), and finally, the answer.

> Entering new SQLDatabaseChain chain...
How many artworks are by Claude Monet?
SQLQuery:SELECT COUNT(*)
FROM artworks
JOIN artists ON artists.artist_id = artworks.artist_id
WHERE artists.name = 'Claude Monet';
SQLResult: [(6,)]
Answer:There are 6 artworks by Claude Monet.
> Finished chain.

LangChain SQL Database Agent

According to LangChain documentation, the SQL Database Agent “builds off of SQLDatabaseChain and is designed to answer more general questions about a database, as well as recover from errors. NOTE: it is not guaranteed that the agent won’t perform DML statements on your database given certain questions. Be careful running it on sensitive data!” Using LangChain’s SQL Database Agent, we can ask general questions about the MoMA RDS database, such as “Describe the artists table.”

from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase

toolkit = SQLDatabaseToolkit(db=db, llm=llm)
agent_executor = create_sql_agent(llm=llm, toolkit=toolkit, verbose=True)

# Example of describing a table
try:
    agent_executor.run("Describe the artists table.")
except (ProgrammingError, ValueError) as exc:
    print(f"\n\n{exc}")The results should look similar to the example below. They are different than what we have seen previously with  SQLDatabaseChain and SQLDatabaseSequentialChain examples. In this example, the agent lists the tables, then finds the most likely candidate. We can see the thought process of the LLM. LangChain returns the final answer: "The artists table contains the columns artist_id, name, nationality, gender, birth_year, and death_year. Examples of artists in the table include Jüri Arrak, Richard Artschwager, and Isidora Aschheim."

The results should look similar to the example below. We see the thoughts, actions, and observations of the LLM as it analyzes the question and formulates a response.

> Entering new AgentExecutor chain...
Action: list_tables_sql_db
Action Input: ""
Observation: artists, artworks
Thought: I should look at the schema of the artists table
Action: schema_sql_db
Action Input: "artists"
Observation: CREATE TABLE artists (
        artist_id integer NOT NULL,
        name character varying(200)",
        nationality character varying(50)",
        gender character varying(25)",
        birth_year integer,
        death_year integer,
        CONSTRAINT artists_pk PRIMARY KEY (artist_id)
)

/*
3 rows from artists table:
"artist_id" "name" "nationality" "gender" "birth_year" "death_year"
12 "Jüri Arrak" "Estonian" "Male" 1936 
19 "Richard Artschwager" "American" "Male" 1923 2013
22 "Isidora Aschheim" "Israeli" "Female"  
*/
Thought: I now know the final answer
Final Answer: The artists table contains the columns artist_id, name, nationality, gender, birth_year, and death_year. Examples of artists in the table include Jüri Arrak, Richard Artschwager, and Isidora Aschheim.

> Finished chain.

Using the Agent, we can also ask questions, as we did previously, of the data. For example, “What is the oldest artwork in the collection?

# Example of running queries
try:
    agent_executor.run(QUESTION_10)
except (ProgrammingError, ValueError) as exc:
    print(f"\n\n{exc}")

The results should look similar to the example below. Again, we see the thoughts, actions, and observations of the LLM as it analyzes the question and formulated the response: “The oldest artwork in the collection is the Sugar Bowl, created in 1768.

> Entering new AgentExecutor chain...
Action: list_tables_sql_db
Action Input: ""
Observation: artists, artworks
Thought: I should check the schema of the artworks table to see what columns I can use to order the results.
Action: schema_sql_db
Action Input: "artworks"
Observation: 
CREATE TABLE artworks (
 artwork_id INTEGER NOT NULL, 
 title VARCHAR(500), 
 artist_id INTEGER NOT NULL, 
 name VARCHAR(500), 
 date INTEGER, 
 medium VARCHAR(250), 
 dimensions TEXT, 
 acquisition_date TEXT, 
 credit TEXT, 
 catalogue VARCHAR(250), 
 department VARCHAR(250), 
 classification VARCHAR(250), 
 object_number TEXT, 
 diameter_cm TEXT, 
 circumference_cm TEXT, 
 height_cm TEXT, 
 length_cm TEXT, 
 width_cm TEXT, 
 depth_cm TEXT, 
 weight_kg TEXT, 
 durations INTEGER, 
 CONSTRAINT artworks_pk PRIMARY KEY (artwork_id)
)

/*
3 rows from artworks table:
artwork_id title artist_id name date medium dimensions acquisition_date credit catalogue department classification object_number diameter_cm circumference_cm height_cm length_cm width_cm depth_cm weight_kg durations
102312 Watching the Game 2422 John Gutmann 1934 Gelatin silver print 9 3/4 x 6 7/16' (24.8 x 16.4 cm) 2006-05-11 Purchase N Photography Photograph 397.2006   24.8  16.4   None
103321 Untitled (page from Sump) 25520 Jerome Neuner 1994 Page with chromogenic color print and text 12 x 9 1/2' (30.5 x 24.1 cm) 2006-05-11 E.T. Harmax Foundation Fund N Photography Photograph 415.2006.12   30.4801  24.13   None
10 The Manhattan Transcripts Project, New York, New York, Episode 1: The Park 7056 Bernard Tschumi None Gelatin silver photograph 14 x 18' (35.6 x 45.7 cm) 1995-01-17 Purchase and partial gift of the architect in honor of Lily Auchincloss Y Architecture & Design Architecture 3.1995.11   35.6  45.7   None
*/
Thought: I should query the artworks table, ordering by the date column, and limit the results to 10.
Action: query_checker_sql_db
Action Input: SELECT artwork_id, title, date FROM artworks ORDER BY date ASC LIMIT 10
Observation: 

SELECT artwork_id, title, date FROM artworks ORDER BY date ASC LIMIT 10;
Thought: The query looks correct, I should execute it.
Action: query_sql_db
Action Input: SELECT artwork_id, title, date FROM artworks ORDER BY date ASC LIMIT 10
Observation: [(91978, 'Sugar Bowl', 1768), (91977, 'Demitasse Cup and Saucer', 1768), (2767, 'Partial Coffee and Tea Service', 1768), (91976, 'Teacup and Saucer', 1768), (91975, 'Coffee Pot and Lid', 1768), (49894, 'Portrait of M. Martin', 1805), (50818, 'H. Berton', 1809), (70403, 'Plate for Souvenir of Italy', 1811), (21127, 'Headpiece (page 315) from FABLES', 1818), (21257, 'Tailpiece (page 104) from FABLES', 1818)]
Thought: I now know the final answer.
Final Answer: The oldest artwork in the collection is the Sugar Bowl, created in 1768.

> Finished chain.

Conclusion

In this post, we learned to use LangChain’s SQL Database Chain and Agent with OpenAI’s text-davinci-003 LLM to perform Natural Language Queries (NLQ) of Amazon RDS for PostgreSQL. We also learned about the importance of using LangChain’s Prompt Template, Query Checker, Query Checker, few-shot promoting, and retrieval-augmented generation (RAG) to improve our results.

Stay tuned for the accompanying video for this post and associated follow-up posts where we will use Amazon SageMaker real-time inference LLM endpoints to perform the same activities on the database and Streamlit to create end-user applications built on LangChain’s NLQ capabilities.

If you are not yet a Medium member and want to support writers like me, please sign-up here: https://garystafford.medium.com/membership




This blog represents my viewpoints and not those of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.


Performing Natural Language Queries on Amazon RDS using Amazon SageMaker, LangChain, and LLMs

Learn to use LangChain’s SQL Database Chain and Agent with large language models to perform natural language queries (NLQ) of Amazon RDS for PostgreSQL

To paraphrase analytics workflow product vendor YellowFin, “Natural language query (NLQ), also known as natural language search, is a self-service business intelligence (BI) reporting capability that enables analytics users to ask questions of their data. It parses for keywords and generates relevant answers sourced from related databases, with results typically delivered as a report, chart or textual explanation that attempt to answer the query, and provide depth of understanding.

Using LangChain’s SQL Database Chain and SQL Database Agent, we can leverage large language models (LLMs) to ask questions of an Amazon RDS for PostgreSQL database using natural language. Questions will be converted into SQL queries and executed against the database. Assuming the generated SQL query is well-formed, the query results will be converted into a textual explanation. For example, we ask questions like, “How many customers have purchased in the last 12 months?” or “What were the total sales of May?” These will be converted into SQL SELECT statements, like SELECT COUNT(DISTINCT customer_id) AS num_customers FROM purchases WHERE purchase_date >= DATE_SUB(CURRENT_DATE, INTERVAL 12 MONTH); The answer is then composed into textual explanation, such as “A total of 384 customers made purchases in the last 12 months.

In the following post, we will learn to use LangChain’s SQL Database Chain and SQL Database Agent with OpenAI’s text-davinci-003 , an LLM instance within OpenAI’s GPT-3 series, to perform NLQ of an Amazon RDS for PostgreSQL database. We will also learn about the importance of using LangChain’s Prompt Template, few-shot promoting, and retrieval-augmented generation (RAG) to improve our results.

Source Code

All code for this post’s demonstration is open-source and available on GitHub.

GitHub — garystafford/llm-langchain-sql-demo: Using LangChain’s SQL Database Chain and Agent with…

Using LangChain’s SQL Database Chain and Agent with various LLMs to perform Natural Language Queries (NLQ) of an Amazon…github.com

Architecture

The following diagram represents this post’s high-level architecture. We will use a series of Amazon SageMaker notebooks to interact with an LLM and Amazon RDS database. We will call OpenAI’s LLM APIs or Amazon SageMaker real-time inference LLM endpoints, using LangChain, from the notebooks. For added security and to prevent potentially destructive DDL statements from being executed, LangChain will be restricted to read-only access to the Amazon RDS database.

Amazon SageMaker Notebooks

For this post’s demonstration, we will be using Amazon SageMaker Notebooks. According to AWS, “Amazon SageMaker offers two types of fully managed Jupyter Notebooks for data exploration and building ML models: Amazon SageMaker Studio notebooks and Amazon SageMaker notebook instances.” Using Amazon SageMaker notebook instances will allow us to install the latest version of packages, including LangChain, which publishes releases almost daily.

The use of Amazon SageMaker notebook instances is optional. Alternatively, you can use your local IDE to interact with the Jupyter notebooks. I originally built these notebooks in Microsoft’s VS Code using the Juypter extension. You need only modify your database’s security group’s inbound rules to include your IP address on the correct port.

Choosing a Large Language Model

While researching this post, I tested several LLMs, including newer Foundational Models available with Amazon SageMaker Jumpstart, such as AI21’s Jurassic-2 Jumbo and Google’s FLAN-T5 XXL. Notebooks for these models are available in the GitHub project. However, due to its superior out-of-the-box results and ease of use through an API, the focus of this post will be on OpenAI’s text-davinci-003 LLM, a model instance within OpenAI’s GPT-3 series. Proceeding posts will focus on using alternative models.

According to OpenAI, “text-davinci-003 is the newer and more capable model, designed specifically for instruction-following tasks. This enables it to respond concisely and more accurately — even in zero-shot scenarios, i.e. without the need for any examples given in the prompt.” Further, “Additionally, text-davinci-003 supports a longer context window (max prompt+completion length) than davinci — 4097 tokens compared to 2049.” OpenAI’s text-davinci-003 LLM has 175 billion parameters. Compare this to Google’s FLAN-T5 XXL LLM, which has a mere 11 billion parameters.

Demonstration

Much of the code found in the notebooks was adopted from LangChain’s SQL Chain example and SQL Database Agent documentation. For this post, we will be using the langchain-sql-OpenAI.ipynb notebook, located in the post’s GitHub project.

Prerequisites

  1. Import The Museum of Modern Art (MoMA) Collection database, from GitHub, into an Amazon RDS for PostgreSQL database.
  2. Create a new Amazon SageMaker notebook instance for this demonstration. Make sure your RDS instance is accessible to your SageMaker Notebook instance environment.
  3. git clone this post’s GitHub repository to your Amazon SageMaker notebook instance.
  4. Create a .env file, used by dotenv, using a terminal in your SageMaker Notebook environment. A sample env.txt file is included in the project.
  5. Add your RDS database credentials to the .env file: RDS_ENDPOINT, RDS_PORT, RDS_USERNAME, RDS_PASSWORD, and RDS_DB_NAME.
  6. Create an OpenAI account and update the .env file to include your OPENAI_API_KEY.

NOTE: When using dotenv, credentials will be stored in plain text. The recommended and more secure method is to use AWS Secrets Manager.

Install Required Packages

The first step is to install and import the required packages for our demonstration. We want to install the latest version of LangChain since releases are issued almost daily, and we want the latest features. At the time of this post, the newest version was 0.0.186.

import os

# Avoid huggingface/tokenizers parallelism error
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Install the latest versions of langchain and openai
%pip install langchain openai -Uq

# Install latest versions of other required packages
%pip install ipywidgets python-dotenv SQLAlchemy psycopg2 psycopg2-binary -Uq

# Check verions of LangChain and OpenAI
%pip list | grep "langchain\|openai"

from langchain import SQLDatabase, SQLDatabaseChain, OpenAI
from langchain.chains import SQLDatabaseSequentialChain

Setup Environment Variables

For simplicity, we will use dotenv to load the OpenAI and RDS environment variables. Note that the credentials will be stored in plain text using dotenv. The recommended, more secure method is to use AWS Secrets Manager. Again, LangChain should be limited to read-only access to the Amazon RDS database.

LangChain uses SQLAlchemy to connect to SQL databases. With SQLAlchemy, we have a choice of dialects/DBAPI options when connecting to PostgreSQL; I have chosen psycopg2 for this demonstration. We will compose the individual RDS environment variables into the RDS URI used by LangChain: postgresql+psycopg2://user:pwd@hostname:port/dbname.

# Load env vars from .env file
%load_ext dotenv
%dotenv

# SQLAlchemy 2.0 reference: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html
# URI format: postgresql+psycopg2://user:pwd@hostname:port/dbname
RDS_DB_NAME = os.environ.get("RDS_DB_NAME")
RDS_ENDPOINT = os.environ.get("RDS_ENDPOINT")
RDS_PASSWORD = os.environ.get("RDS_PASSWORD")
RDS_PORT = os.environ.get("RDS_PORT")
RDS_USERNAME = os.environ.get("RDS_USERNAME")
RDS_URI = f"postgresql+psycopg2://{RDS_USERNAME}:{RDS_PASSWORD}@{RDS_ENDPOINT}:{RDS_PORT}/{RDS_DB_NAME}"

Specify the LLM

We will start by using OpenAI’s text-davinci-003 LLM with LangChain for this first part of the demonstration. See OpenAI’s Models Overview for more model information. The estimated OpenAI API charge to run this OpenAI notebook once is about $0.40.

llm = OpenAI(model_name="text-davinci-003", temperature=0, verbose=True)

Asking our First Question

We are ready to ask our first question of the MoMA database using LangChain’s SQLDatabaseSequentialChain and OpenAI’s text-davinci-003 LLM. I have included a few sample questions for users unfamiliar with the MoMA database.

# A few sample questions
QUESTION_01 = "How many artists are there?"
QUESTION_02 = "How many artworks are there?"
QUESTION_03 = "How many rows are in the artists table?"
QUESTION_04 = "How many rows are in the artworks table?"
QUESTION_05 = "How many artists are there whose nationality is French?"
QUESTION_06 = "How many artworks were created by artists whose nationality is Spanish?"
QUESTION_07 = "How many artist names start with 'M'?"
QUESTION_08 = "What nationality produced the most number of artworks?"
QUESTION_09 = "How many artworks are by Claude Monet?"
QUESTION_10 = "What is the oldest artwork in the collection?"

from sqlalchemy.exc import ProgrammingError

db = SQLDatabase.from_uri(RDS_URI)
db_chain = SQLDatabaseSequentialChain.from_llm(
    llm, db, verbose=True, use_query_checker=True
)

try:
    db_chain.run(QUESTION_05)
except (ProgrammingError, ValueError) as exc:
    print(f"\n\n{exc}")

The results should look similar to the example below. First, we see the question, the equivalent SQL query, the resultset (assuming the SQL query was well-formed), and finally, the answer.

> Entering new SQLDatabaseSequentialChain chain...
Table names to use:
['artworks']

> Entering new SQLDatabaseChain chain...
What is the oldest artwork in the collection?
SQLQuery:SELECT title, date FROM artworks ORDER BY date ASC LIMIT 5;
SQLResult: [('Partial Coffee and Tea Service', 1768), ('Teacup and Saucer', 1768), ('Coffee Pot and Lid', 1768), ('Sugar Bowl', 1768), ('Demitasse Cup and Saucer', 1768)]
Answer:The oldest artwork in the collection is Partial Coffee and Tea Service, which dates back to 1768.
> Finished chain.

> Finished chain.

Custom Table Information

According to LangChain’s documentation, “In some cases, it can be useful to provide custom table information instead of using the automatically generated table definitions and the first sample_rows_in_table_info sample rows.” Below we see custom table information for our two MoMA database tables. Be aware, this method may be impractical when dealing with multiple tables or large table schemas.

custom_table_info = {
    "artists": """CREATE TABLE artists (
        artist_id integer NOT NULL,
        name character varying(200)",
        nationality character varying(50)",
        gender character varying(25)",
        birth_year integer,
        death_year integer,
        CONSTRAINT artists_pk PRIMARY KEY (artist_id)
)

/*
3 rows from artists table:
"artist_id" "name" "nationality" "gender" "birth_year" "death_year"
12 "Jüri Arrak" "Estonian" "Male" 1936 
19 "Richard Artschwager" "American" "Male" 1923 2013
22 "Isidora Aschheim" "Israeli" "Female"  
*/""",
    "artworks": """CREATE TABLE artworks (
        artwork_id integer NOT NULL,
        title character varying(500)",
        artist_id integer NOT NULL,
        name character varying(500)",
        date integer,
        medium character varying(250)",
        dimensions text",
        acquisition_date text",
        credit text",
        catalogue character varying(250)",
        department character varying(250)",
        classification character varying(250)",
        object_number text",
        diameter_cm text",
        circumference_cm text",
        height_cm text",
        length_cm text",
        width_cm text",
        depth_cm text",
        weight_kg text",
        durations integer,
        CONSTRAINT artworks_pk PRIMARY KEY (artwork_id)
)

/*
3 rows from artworks table:
"artwork_id" "title" "artist_id" "name" "date" "medium" "dimensions" "acquisition_date" "credit" "catalogue" "department" "classification" "object_number" "diameter_cm" "circumference_cm" "height_cm" "length_cm" "width_cm" "depth_cm" "weight_kg" "durations"
102312 "Watching the Game" 2422 "John Gutmann" 1934 "Gelatin silver print" "9 3/4 x 6 7/16' (24.8 x 16.4 cm)" "2006-05-11" "Purchase" "N" "Photography" "Photograph" "397.2006"   "24.8"  "16.4"   
103321 "Untitled (page from Sump)" 25520 "Jerome Neuner" 1994 "Page with chromogenic color print and text" "12 x 9 1/2' (30.5 x 24.1 cm)" "2006-05-11" "E.T. Harmax Foundation Fund" "N" "Photography" "Photograph" "415.2006.12"   "30.4801"  "24.13"   
10 "The Manhattan Transcripts Project, New York, New York, Episode 1: The Park" 7056 "Bernard Tschumi"  "Gelatin silver photograph" "14 x 18' (35.6 x 45.7 cm)" "1995-01-17" "Purchase and partial gift of the architect in honor of Lily Auchincloss" "Y" "Architecture & Design" "Architecture" "3.1995.11"   "35.6"  "45.7"   
*/""",
}

Query Checker

According to LangChain’s documentation, “Sometimes the Language Model generates invalid SQL with small mistakes that can be self-corrected using the same technique used by the SQL Database Agent to try and fix the SQL using the LLM.” Simply add the use_query_checker=True parameter to the SQLDatabaseSequentialChain object to enable the query checker option.

Below we see an example of using both custom table information and the query checker.

db = SQLDatabase.from_uri(
    RDS_URI,
    include_tables=["artists", "artworks"],
    sample_rows_in_table_info=3,
    custom_table_info=custom_table_info,
)

db_chain = SQLDatabaseSequentialChain.from_llm(
    llm, db, verbose=True, use_query_checker=True, top_k=3
)

try:
    db_chain.run(QUESTION_06)
except (ProgrammingError, ValueError) as exc:
    print(f"\n\n{exc}")

The results should look similar to the example below. Again, we see the question, the equivalent SQL query, the resultset (assuming the SQL query was well-formed), and finally, the answer.

> Entering new SQLDatabaseSequentialChain chain...
Table names to use:
['artists', 'artworks']

> Entering new SQLDatabaseChain chain...
How many artworks were created by artists whose nationality is Spanish?
SQLQuery:SELECT COUNT(*) FROM artworks 
INNER JOIN artists ON artworks.artist_id = artists.artist_id 
WHERE artists.nationality = 'Spanish';
SQLResult: [(2922,)]
Answer:2922 artworks were created by artists whose nationality is Spanish.
> Finished chain.

Customize Prompt

To further improve results, we can also use LangChain’s Prompt Template. According to LangChain, “A prompt template refers to a reproducible way to generate a prompt. It contains a text string (“the template”) that can take in a set of parameters from the end user and generate a prompt.” Below we see an example prompt template that we will use to better improve the chances of achieving more accurate results from our RDS database.

from langchain.prompts.prompt import PromptTemplate

_DEFAULT_TEMPLATE = """Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.
Use the following format:

Question: "Question here"
SQLQuery: "SQL Query to run"
SQLResult: "Result of the SQLQuery"
Answer: "Final answer here"

Only use the following tables:

{table_info}

If someone asks for the table art, they really mean the artworks table.

Question: {input}"""

PROMPT = PromptTemplate(
    input_variables=["input", "table_info", "dialect"], template=_DEFAULT_TEMPLATE
)

Intermediate Steps

According to LangChain’s documentation, “You can also return the intermediate steps of the SQLDatabaseChain. This allows you to access the SQL statement that was generated, as well as the result of running that against the SQL Database.”

Below we see an example of using both a prompt template and intermediate steps.

db_chain = SQLDatabaseChain.from_llm(
    llm,
    db,
    prompt=PROMPT,
    verbose=True,
    use_query_checker=True,
    return_intermediate_steps=True,
)

try:
    result = db_chain(QUESTION_09)
except (ProgrammingError, ValueError) as exc:
    print(f"\n\n{exc}")

result["intermediate_steps"]

The verbose results should look similar to the example below. Again, we see the question, the equivalent SQL query, the resultset (assuming the SQL query was well-formed), the answer, and now, the intermediate steps.

> Entering new SQLDatabaseChain chain...
How many artworks are by Claude Monet?
SQLQuery:SELECT COUNT(*) FROM artworks 
JOIN artists ON artworks.artist_id = artists.artist_id 
WHERE artists.name = 'Claude Monet';
SQLResult: [(6,)]
Answer:6 artworks are by Claude Monet.
> Finished chain.
[{'input': "How many artworks are by Claude Monet?\nSQLQuery:SELECT COUNT(*) FROM artworks \nJOIN artists ON artworks.artist_id = artists.artist_id \nWHERE artists.name = 'Claude Monet';\nSQLResult: [(6,)]\nAnswer:",
  'top_k': '5',
  'dialect': 'postgresql',
  'table_info': 'CREATE TABLE artists (\n        artist_id integer NOT NULL,\n        name character varying(200)",\n        nationality character varying(50)",\n        gender character varying(25)",\n        birth_year integer,\n        death_year integer,\n        CONSTRAINT artists_pk PRIMARY KEY (artist_id)\n)\n\n/*\n3 rows from artists table:\n"artist_id"\t"name"\t"nationality"\t"gender"\t"birth_year"\t"death_year"\n12\t"Jüri Arrak"\t"Estonian"\t"Male"\t1936\t\n19\t"Richard Artschwager"\t"American"\t"Male"\t1923\t2013\n22\t"Isidora Aschheim"\t"Israeli"\t"Female"\t\t\n*/\n\nCREATE TABLE artworks (\n        artwork_id integer NOT NULL,\n        title character varying(500)",\n        artist_id integer NOT NULL,\n        name character varying(500)",\n        date integer,\n        medium character varying(250)",\n        dimensions text",\n        acquisition_date text",\n        credit text",\n        catalogue character varying(250)",\n        department character varying(250)",\n        classification character varying(250)",\n        object_number text",\n        diameter_cm text",\n        circumference_cm text",\n        height_cm text",\n        length_cm text",\n        width_cm text",\n        depth_cm text",\n        weight_kg text",\n        durations integer,\n        CONSTRAINT artworks_pk PRIMARY KEY (artwork_id)\n)\n\n/*\n3 rows from artworks table:\n"artwork_id"\t"title"\t"artist_id"\t"name"\t"date"\t"medium"\t"dimensions"\t"acquisition_date"\t"credit"\t"catalogue"\t"department"\t"classification"\t"object_number"\t"diameter_cm"\t"circumference_cm"\t"height_cm"\t"length_cm"\t"width_cm"\t"depth_cm"\t"weight_kg"\t"durations"\n102312\t"Watching the Game"\t2422\t"John Gutmann"\t1934\t"Gelatin silver print"\t"9 3/4 x 6 7/16\' (24.8 x 16.4 cm)"\t"2006-05-11"\t"Purchase"\t"N"\t"Photography"\t"Photograph"\t"397.2006"\t\t\t"24.8"\t\t"16.4"\t\t\t\n103321\t"Untitled (page from Sump)"\t25520\t"Jerome Neuner"\t1994\t"Page with chromogenic color print and text"\t"12 x 9 1/2\' (30.5 x 24.1 cm)"\t"2006-05-11"\t"E.T. Harmax Foundation Fund"\t"N"\t"Photography"\t"Photograph"\t"415.2006.12"\t\t\t"30.4801"\t\t"24.13"\t\t\t\n10\t"The Manhattan Transcripts Project, New York, New York, Episode 1: The Park"\t7056\t"Bernard Tschumi"\t\t"Gelatin silver photograph"\t"14 x 18\' (35.6 x 45.7 cm)"\t"1995-01-17"\t"Purchase and partial gift of the architect in honor of Lily Auchincloss"\t"Y"\t"Architecture & Design"\t"Architecture"\t"3.1995.11"\t\t\t"35.6"\t\t"45.7"\t\t\t\n*/',
  'stop': ['\nSQLResult:']},
 "SELECT COUNT(*) FROM artworks \nJOIN artists ON artworks.artist_id = artists.artist_id \nWHERE artists.name = 'Claude Monet';",
 {'sql_cmd': "SELECT COUNT(*) FROM artworks \nJOIN artists ON artworks.artist_id = artists.artist_id \nWHERE artists.name = 'Claude Monet';"},
 '[(6,)]',
 {'input': "How many artworks are by Claude Monet?\nSQLQuery:SELECT COUNT(*) FROM artworks \nJOIN artists ON artworks.artist_id = artists.artist_id \nWHERE artists.name = 'Claude Monet';\nSQLResult: [(6,)]\nAnswer:",
  'top_k': '5',
  'dialect': 'postgresql',
  'table_info': 'CREATE TABLE artists (\n        artist_id integer NOT NULL,\n        name character varying(200)",\n        nationality character varying(50)",\n        gender character varying(25)",\n        birth_year integer,\n        death_year integer,\n        CONSTRAINT artists_pk PRIMARY KEY (artist_id)\n)\n\n/*\n3 rows from artists table:\n"artist_id"\t"name"\t"nationality"\t"gender"\t"birth_year"\t"death_year"\n12\t"Jüri Arrak"\t"Estonian"\t"Male"\t1936\t\n19\t"Richard Artschwager"\t"American"\t"Male"\t1923\t2013\n22\t"Isidora Aschheim"\t"Israeli"\t"Female"\t\t\n*/\n\nCREATE TABLE artworks (\n        artwork_id integer NOT NULL,\n        title character varying(500)",\n        artist_id integer NOT NULL,\n        name character varying(500)",\n        date integer,\n        medium character varying(250)",\n        dimensions text",\n        acquisition_date text",\n        credit text",\n        catalogue character varying(250)",\n        department character varying(250)",\n        classification character varying(250)",\n        object_number text",\n        diameter_cm text",\n        circumference_cm text",\n        height_cm text",\n        length_cm text",\n        width_cm text",\n        depth_cm text",\n        weight_kg text",\n        durations integer,\n        CONSTRAINT artworks_pk PRIMARY KEY (artwork_id)\n)\n\n/*\n3 rows from artworks table:\n"artwork_id"\t"title"\t"artist_id"\t"name"\t"date"\t"medium"\t"dimensions"\t"acquisition_date"\t"credit"\t"catalogue"\t"department"\t"classification"\t"object_number"\t"diameter_cm"\t"circumference_cm"\t"height_cm"\t"length_cm"\t"width_cm"\t"depth_cm"\t"weight_kg"\t"durations"\n102312\t"Watching the Game"\t2422\t"John Gutmann"\t1934\t"Gelatin silver print"\t"9 3/4 x 6 7/16\' (24.8 x 16.4 cm)"\t"2006-05-11"\t"Purchase"\t"N"\t"Photography"\t"Photograph"\t"397.2006"\t\t\t"24.8"\t\t"16.4"\t\t\t\n103321\t"Untitled (page from Sump)"\t25520\t"Jerome Neuner"\t1994\t"Page with chromogenic color print and text"\t"12 x 9 1/2\' (30.5 x 24.1 cm)"\t"2006-05-11"\t"E.T. Harmax Foundation Fund"\t"N"\t"Photography"\t"Photograph"\t"415.2006.12"\t\t\t"30.4801"\t\t"24.13"\t\t\t\n10\t"The Manhattan Transcripts Project, New York, New York, Episode 1: The Park"\t7056\t"Bernard Tschumi"\t\t"Gelatin silver photograph"\t"14 x 18\' (35.6 x 45.7 cm)"\t"1995-01-17"\t"Purchase and partial gift of the architect in honor of Lily Auchincloss"\t"Y"\t"Architecture & Design"\t"Architecture"\t"3.1995.11"\t\t\t"35.6"\t\t"45.7"\t\t\t\n*/',
  'stop': ['\nSQLResult:']},
 '6 artworks are by Claude Monet.']

Few-shot Learning

To improve the accuracy of the SQL query, LangChain allows us to use few-shot learning. According to Wikipedia, “In natural language processing, in-context learning, few-shot learning, or few-shot prompting is a prompting technique that allows a model to process examples before attempting a task. The method was popularized after the advent of GPT-3 and is considered to be an emergent property of large language models.” Conveniently, we happen to be using a GPT-3 series LLM in this post’s demonstration.

Retrieval-Augmented Generation (RAG)

The method of allowing a model to process examples before attempting a task is called retrieval-augmented generation (RAG). According to the paper, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, by Patrick Lewis et al., with a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG), models combine pre-trained parametric and non-parametric memory for language generation. In our case, we combine the parametric memory of the pre-trained LLM, OpenAI text-davinci-003 in this case, and the non-parametric memory, a dense vector index of known-good SQL examples, to increase the accuracy of results.

Vector Database

Embeddings databases (aka vector databases), like Pinecone, Milvus, and FAISS, store embeddings and allow you to search by nearest neighbors rather than substrings like a traditional database. We will use the Chroma vector database and the Sentence Transformers all-MiniLM-L6-v2 model to create embeddings of known-good SQL query examples and measure their semantic similarity to the questions asked. We will then use the closest examples to perform few-shot prompting, a technique to enable in-context learning. Finally, we provide example SQL queries in the prompt along with the question to steer the model to better performance. LangChain handles all of this complexity with minimal coding.

%pip install pyyaml -q
%pip install chromadb sentence_transformers -Uq

from typing import Dict
import yaml

chain = SQLDatabaseChain.from_llm(
    llm, db, verbose=True, return_intermediate_steps=True, use_query_checker=True
)

def _parse_example(result: Dict) -> Dict:
    sql_cmd_key = "sql_cmd"
    sql_result_key = "sql_result"
    table_info_key = "table_info"
    input_key = "input"
    final_answer_key = "answer"

    _example = {
        "input": result.get("query"),
    }

    steps = result.get("intermediate_steps")
    answer_key = sql_cmd_key  # the first one
    for step in steps:
        if isinstance(step, dict):
            if table_info_key not in _example:
                _example[table_info_key] = step.get(table_info_key)

            if input_key in step:
                if step[input_key].endswith("SQLQuery:"):
                    answer_key = sql_cmd_key  # this is the SQL generation input
                if step[input_key].endswith("Answer:"):
                    answer_key = final_answer_key  # this is the final answer input
            elif sql_cmd_key in step:
                _example[sql_cmd_key] = step[sql_cmd_key]
                answer_key = sql_result_key  # this is SQL execution input
        elif isinstance(step, str):
            _example[answer_key] = step
    return _example

example: any

try:
    result = chain(QUESTION_08)
    print("\n*** Query succeeded")
    example = _parse_example(result)
except Exception as exc:
    print("\n*** Query failed")
    result = {"query": QUESTION_08, "intermediate_steps": exc.intermediate_steps}
    example = _parse_example(result)


# print results for now, in reality you may want to write this out 
# to a YAML file or database for manual fix-ups offline
yaml_example = yaml.dump(example, allow_unicode=True)
print("\n" + yaml_example)

The verbose results should look similar to the example below. Again, we see the question, the equivalent SQL query, the resultset (assuming the SQL query was well-formed), the answer, and a YAML-formatted example, which we will see how to use next.

> Entering new SQLDatabaseChain chain...
What nationality produced the most number of artworks?
SQLQuery:SELECT nationality, COUNT(*) AS num_artworks
FROM artists
INNER JOIN artworks ON artists.artist_id = artworks.artist_id
WHERE nationality IS NOT NULL
GROUP BY nationality
ORDER BY num_artworks DESC
LIMIT 5;
SQLResult: [('American', 53955), ('French', 22116), ('German', 8920), ('British', 5480), ('', 3273)]
Answer:The nationality that produced the most number of artworks is American.
> Finished chain.

*** Query succeeded

answer: The nationality that produced the most number of artworks is American.
input: What nationality produced the most number of artworks?
sql_cmd: 'SELECT nationality, COUNT(*) AS num_artworks

  FROM artists

  INNER JOIN artworks ON artists.artist_id = artworks.artist_id

  WHERE nationality IS NOT NULL

  GROUP BY nationality

  ORDER BY num_artworks DESC

  LIMIT 5;'
sql_result: '[(''American'', 53955), (''French'', 22116), (''German'', 8920), (''British'',
  5480), ('''', 3273)]'
table_info: "CREATE TABLE artists (\n        artist_id integer NOT NULL,\n       \
  \ name character varying(200)\",\n        nationality character varying(50)\",\n\
  \        gender character varying(25)\",\n        birth_year integer,\n        death_year\
  \ integer,\n        CONSTRAINT artists_pk PRIMARY KEY (artist_id)\n)\n\n/*\n3 rows\
  \ from artists table:\n\"artist_id\"\t\"name\"\t\"nationality\"\t\"gender\"\t\"\
  birth_year\"\t\"death_year\"\n12\t\"Jüri Arrak\"\t\"Estonian\"\t\"Male\"\t1936\t\
  \n19\t\"Richard Artschwager\"\t\"American\"\t\"Male\"\t1923\t2013\n22\t\"Isidora\
  \ Aschheim\"\t\"Israeli\"\t\"Female\"\t\t\n*/\n\nCREATE TABLE artworks (\n     \
  \   artwork_id integer NOT NULL,\n        title character varying(500)\",\n    \
  \    artist_id integer NOT NULL,\n        name character varying(500)\",\n     \
  \   date integer,\n        medium character varying(250)\",\n        dimensions\
  \ text\",\n        acquisition_date text\",\n        credit text\",\n        catalogue\
  \ character varying(250)\",\n        department character varying(250)\",\n    \
  \    classification character varying(250)\",\n        object_number text\",\n \
  \       diameter_cm text\",\n        circumference_cm text\",\n        height_cm\
  \ text\",\n        length_cm text\",\n        width_cm text\",\n        depth_cm\
  \ text\",\n        weight_kg text\",\n        durations integer,\n        CONSTRAINT\
  \ artworks_pk PRIMARY KEY (artwork_id)\n)\n\n/*\n3 rows from artworks table:\n\"\
  artwork_id\"\t\"title\"\t\"artist_id\"\t\"name\"\t\"date\"\t\"medium\"\t\"dimensions\"\
  \t\"acquisition_date\"\t\"credit\"\t\"catalogue\"\t\"department\"\t\"classification\"\
  \t\"object_number\"\t\"diameter_cm\"\t\"circumference_cm\"\t\"height_cm\"\t\"length_cm\"\
  \t\"width_cm\"\t\"depth_cm\"\t\"weight_kg\"\t\"durations\"\n102312\t\"Watching the\
  \ Game\"\t2422\t\"John Gutmann\"\t1934\t\"Gelatin silver print\"\t\"9 3/4 x 6 7/16'\
  \ (24.8 x 16.4 cm)\"\t\"2006-05-11\"\t\"Purchase\"\t\"N\"\t\"Photography\"\t\"Photograph\"\
  \t\"397.2006\"\t\t\t\"24.8\"\t\t\"16.4\"\t\t\t\n103321\t\"Untitled (page from Sump)\"\
  \t25520\t\"Jerome Neuner\"\t1994\t\"Page with chromogenic color print and text\"\
  \t\"12 x 9 1/2' (30.5 x 24.1 cm)\"\t\"2006-05-11\"\t\"E.T. Harmax Foundation Fund\"\
  \t\"N\"\t\"Photography\"\t\"Photograph\"\t\"415.2006.12\"\t\t\t\"30.4801\"\t\t\"\
  24.13\"\t\t\t\n10\t\"The Manhattan Transcripts Project, New York, New York, Episode\
  \ 1: The Park\"\t7056\t\"Bernard Tschumi\"\t\t\"Gelatin silver photograph\"\t\"\
  14 x 18' (35.6 x 45.7 cm)\"\t\"1995-01-17\"\t\"Purchase and partial gift of the\
  \ architect in honor of Lily Auchincloss\"\t\"Y\"\t\"Architecture & Design\"\t\"\
  Architecture\"\t\"3.1995.11\"\t\t\t\"35.6\"\t\t\"45.7\"\t\t\t\n*/"

According to LangChain’s documentation, run the above cell a few times to collect several examples of input, table_info and sql_cmd generated by your language model. The sql_cmd values may be incorrect, and you can manually fix them up to build a collection of known-good question-query-result-answer examples. Using YAML allows us to keep an organized record of our inputs and corrected SQL output that we can build up over time. For the demonstration, I have placed several examples in a separate file, sql_examples.yaml.

  class- answer: There are 15086 rows in the artists table.
  input: How many rows are in the artists table?
  sql_cmd: SELECT count(*) FROM artists;
  sql_result: '[(15086,)]'
  table_info: |
    CREATE TABLE artists
    (
        artist_id integer NOT NULL,
        name character varying(200) COLLATE pg_catalog."default",
        nationality character varying(50) COLLATE pg_catalog."default",
        gender character varying(25) COLLATE pg_catalog."default",
        birth_year integer,
        death_year integer,
        CONSTRAINT artists_pk PRIMARY KEY (artist_id)
    )

    /*
    3 rows from artists table:
    "artist_id" "name" "nationality" "gender" "birth_year" "death_year"
    1 "Robert Arneson" "American" "Male" 1930 1992
    2 "Doroteo Arnaiz" "Spanish" "Male" 1936 
    3 "Bill Arnold" "American" "Male" 1941 
    */

Now that we have some examples, we can do few-shot prompting. First, we load the samples from the YAML file. Then, we use the HuggingFaceEmbeddings class with the sentence-transformers/all-MiniLM-L6-v2 model to produce embeddings. According to the model card, this sentence-transformers model maps sentences and paragraphs to a 384 dimensional dense vector space.

Next, we use the SemanticSimilarityExampleSelector class along with the VectorStore class, specifying Chroma as our vector database, to store the embeddings and perform a search for examples to use based on semantic similarity. Lastly, we use the FewShotPromptTemplate class to create a specialized few-shot prompt template, which includes a few of our known-good question-query-result-answer examples. If switching database engines, make sure to import the correct prompt from the langchain.chains.sql_database.prompt module. For PostgreSQL, we need to import the _postgres_prompt prompt.

# Load the corrected examples for few-shot prompting
with open("sql_examples.yaml", "r") as stream:
    SQL_SAMPLES = yaml.safe_load(stream)

from langchain import FewShotPromptTemplate, PromptTemplate
from langchain.chains.sql_database.prompt import _postgres_prompt, PROMPT_SUFFIX
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.prompts.example_selector.semantic_similarity import (
    SemanticSimilarityExampleSelector,
)
from langchain.vectorstores import Chroma

example_prompt = PromptTemplate(
    input_variables=["table_info", "input", "sql_cmd", "sql_result", "answer"],
    template="{table_info}\n\nQuestion: {input}\nSQLQuery: {sql_cmd}\nSQLResult: {sql_result}\nAnswer: {answer}",
)

examples_dict = SQL_SAMPLES

local_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

example_selector = SemanticSimilarityExampleSelector.from_examples(
    examples_dict,
    local_embeddings,
    Chroma,
    k=min(3, len(examples_dict)),
)

few_shot_prompt = FewShotPromptTemplate(
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix=_postgres_prompt + "Here are some examples:",
    suffix=PROMPT_SUFFIX,
    input_variables=["table_info", "input", "top_k"],
)

Using our few-shot prompt template, we can ask a question of the RDS database and expect higher accuracy from the LLM.

db_chain = SQLDatabaseChain.from_llm(
    llm,
    db,
    prompt=few_shot_prompt,
    use_query_checker=True,
    verbose=True,
    return_intermediate_steps=True,
)

try:
    result = db_chain(QUESTION_09)
except (ProgrammingError, ValueError) as exc:
    print(f"\n\n{exc}")

The results should look similar to the example below. Again, we see the question, the equivalent SQL query, the resultset (assuming the SQL query was well-formed), and finally, the answer.

> Entering new SQLDatabaseChain chain...
How many artworks are by Claude Monet?
SQLQuery:SELECT COUNT(*)
FROM artworks
JOIN artists ON artists.artist_id = artworks.artist_id
WHERE artists.name = 'Claude Monet';
SQLResult: [(6,)]
Answer:There are 6 artworks by Claude Monet.
> Finished chain.

LangChain SQL Database Agent

According to LangChain documentation, the SQL Database Agent “builds off of SQLDatabaseChain and is designed to answer more general questions about a database, as well as recover from errors. NOTE: it is not guaranteed that the agent won’t perform DML statements on your database given certain questions. Be careful running it on sensitive data!” Using LangChain’s SQL Database Agent, we can ask general questions about the MoMA RDS database, such as “Describe the artists table.”

from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase

toolkit = SQLDatabaseToolkit(db=db, llm=llm)
agent_executor = create_sql_agent(llm=llm, toolkit=toolkit, verbose=True)

try:
    agent_executor.run("Describe the artists table.")
except (ProgrammingError, ValueError) as exc:
    print(f"\n\n{exc}")

The results should look similar to the example below. They are different than what we have seen previously with SQLDatabaseChain and SQLDatabaseSequentialChain examples. In this example, the agent lists the tables, then finds the most likely candidate. We can see the thought process of the LLM. LangChain returns the final answer: “The artists table contains the columns artist_id, name, nationality, gender, birth_year, and death_year. Examples of artists in the table include Jüri Arrak, Richard Artschwager, and Isidora Aschheim.

> Entering new AgentExecutor chain...
Action: list_tables_sql_db
Action Input: ""
Observation: artists, artworks
Thought: I should look at the schema of the artists table
Action: schema_sql_db
Action Input: "artists"
Observation: CREATE TABLE artists (
        artist_id integer NOT NULL,
        name character varying(200)",
        nationality character varying(50)",
        gender character varying(25)",
        birth_year integer,
        death_year integer,
        CONSTRAINT artists_pk PRIMARY KEY (artist_id)
)

/*
3 rows from artists table:
"artist_id" "name" "nationality" "gender" "birth_year" "death_year"
12 "Jüri Arrak" "Estonian" "Male" 1936 
19 "Richard Artschwager" "American" "Male" 1923 2013
22 "Isidora Aschheim" "Israeli" "Female"  
*/
Thought: I now know the final answer
Final Answer: The artists table contains the columns artist_id, name, nationality, gender, birth_year, and death_year. Examples of artists in the table include Jüri Arrak, Richard Artschwager, and Isidora Aschheim.

> Finished chain.

Conclusion

In this post, we learned to use LangChain’s SQL Database Chain and Agent with OpenAI’s text-davinci-003 LLM to perform Natural Language Queries (NLQ) of Amazon RDS for PostgreSQL. We also learned about the importance of few-shot promoting and retrieval-augmented generation (RAG) to improve our results. Stay tuned for the accompanying video for this post and an associated follow-up post where we will use Amazon SageMaker real-time inference LLM endpoints to perform the same activities on the database.


This blog represents my viewpoints and not those of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.

Love to see this. Definitely recommend using Microsoft + Waii.ai's Archerfish Benchmarking Framework here, just published in CIDR 2024 - https://www.cidrdb.org/cidr2024/papers/p74-floratou.pdf

Like
Reply
Baskarasethupathi R

Technical Solutions Architect | Senior Full Stack Developer | Python | Gen AI | Fast API | Angular | React | Node js | Project Lead | Azure Devops | Senior Technical Consultant

4mo

Hi Team, Could someone explain how text streaming can be implemented when utilizing a database chain that connects an SQL database with Azure OpenAI?

Like
Reply
Abdul Rehman Baber

Next.Js| React| Typescript| langchain. Welcome to my domain of Modern Web Development with Gen AI

8mo

Using Agents. This is the perhaps the most awesome thing that came after chatGPT

Bal Heroor

Generative AI, Data-Driven Culture, Automation First

11mo

The efficacy of generated SQL in our experience was dicy. For complex real world use cases the queires turned out to generate wrong outcomes. I would rather train claude or hugging face with my dataset schema and relationships to generate better precise SQL

Gary Stafford

Area Principal Solutions Architect @AWS | AWS Analytics TFC | AWS GenAI Hero | 10x AWS Certified | Experienced Technology Leader, Principal Consultant, CTO, COO, President

11mo

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics