Semantic Search on AWS Docs or Custom Documents

This sample project demonstrates how to set up AWS infrastructure to perform semantic search and question answering on documents using a transformer machine learning models like BERT, RoBERTa, or GPT (via the Haystack open source framework).

As an example, users can type questions about AWS services and find answers from the AWS documentation or custom local documents.

The deployed solution support 2 answering styles:

extractive question answering will find the semantically closest documents to the questions and highlight the most likeliest answer(s) in these documents.
generative question answering, also referred to as long form question answering (LFQA), will find the semantically closest documents to the question and generate a formulated answer.

Please note that this project is intended for demo purposes, see disclaimers below.

Architecture

The main components of this project are:

Amazon OpenSearch Service to store and search documents
The AWS Documentation as a sample dataset loaded in the document store
The Haystack framework to set up an extractive Question Answering pipeline with:
- A Retriever that searches all the documents and returns only the most relevant ones
  - Retriever used: sentence-transformers/all-mpnet-base-v2
- A Reader that uses the documents returned by the Retriever and selects a text span which is likely to contain the matching answer to the query
  - Reader used: deepset/roberta-base-squad2
Streamlit to set up a frontend
Terraform to automate the infrastructure deployment on AWS

How to deploy the solution

Deploy with AWS Cloud9

Follow our step-by-step deployment instructions to deploy the semantic search application if you are new to AWS, Terraform, semantic search, or you prefer detailed setp-by-step instructions.

For more general deployment instructions follow the sections below.

General Deployment Instructions

The backend folder contains a Terraform project that deploys an OpenSearch domain and 2 ECS services:

frontend: Streamlit-based UI built by Haystack (repo)
search API: REST API built by Haystack

The main steps to deploy the solution are:

Deploy the terraform stack
Optional: Ingest the AWS documentation

Pre-requisites

Terraform v1.0+ (getting started guide)
Docker installed and running (getting started guide)
AWS CLI v2 installed and configured (getting started guide)
An EC2 Service Limit of at least 8 cores for G-instance type if you want to deploy this solution with GPU acceleration.
Alternatively, you can switch to a CPU instance by changing the instance_type = "g4dn.2xlarge" to a CPU instance in the infrastructure/main.tf file.

Deploy the application infrastructure terraform stack

git clone this repository
Configure Configure and change the infrastructure region, subnets, availability zones in the infrastructure/terraform.tfvars file as needed
Initialize
In this example the Terrform state is stored remotely and managed through a backend using S3 and a dynamodb table to acquire the state lock. This allows collaboration on the same Terraform infrastructure from different machines. ( If you prefer to use local state instead just remove the terraform { backend "s3" { ...}} block from the infrastructure/tf-backend.tf file and run directly terraform init)
- Create an S3 Bucket and DynamoDB to store the Terraform state backend in a region of choice.
```
STATE_REGION=<AWS region>
```
```
S3_BUCKET=<YOUR-BUCKET-NAME>
aws s3 mb s3://$S3_BUCKET -region=$STATE_REGION
```
```
SYNC_TABLE=<YOUR-TABLE-NAME>
aws dynamodb create-table --table-name $SYNC_TABLE --attribute-definitions AttributeName=LockID,AttributeType=S --key-schema   AttributeName=LockID,KeyType=HASH --billing-mode PAY_PER_REQUEST --region=$STATE_REGION
```
- Change to the directory containing the application infrastucture's infrastructure/main.tf file
```
cd infrastructure
```
- Initialize terraform with the S3 remote state backend by running
```
terraform init \
-backend-config="bucket=$S3_BUCKET" \
-backend-config="region=$STATE_REGION" \
-backend-config="dynamodb_table=$SYNC_TABLE"
```
Deploy
Run terraform deploy and approve changes by typing yes.
```
terraform apply
```
Please note: deployment can take a long time to push the container depending on the upload bandwidth of your machine.
For faster deployment you can run the terraform deployment from a development environment hosted inside the same AWS region, for example by using the AWS Cloud9 IDE.
Use
Once deployment is completed, browse to the output URL (loadbalancer_url) from the Terraform output to see the appliction.
However, searches won't return any results until you ingest any documents.
Clean up
To remove all created resources of the applications infrastructure again use
```
terraform destroy
```
(If you used the ingestion terrform below, make sure to first destroy the ingestion resources to avoid conflicts)

Ingest the AWS documentation

This second terraform stack builds, pushes and runs a docker container as an ECS task.
The ingestion container downloads either a single (e.g. amazon-ec2-user-guide) or all awsdocs repos (256) (full) and converts the .md files into .txt using pandoc.
The .txt documents are then being ingested into the applications OpenSearch cluster in the required haystack format and become available for search

Change from the infrastructure directory to the directory containing the ingestion's ingestion/main.tf
```
 cd ../ingestion
```
Init terraform
(here we are using local state instead of a remote S3 backend for simplicity)
```
terraform init
```
Run ingestion as Terraform deployment.
The S3 remote state file from the previous infrastructure deployment is needed here as input variables.
It is used as data source to read out the infra's output variables like the OpenSearch endpoint or private subnets. You can set the S3 bucket and its region either in the infrastructure/terraform.tfvars or passing the input variables via
```
terraform apply \
-var="infra_region=$STATE_REGION" \
-var="infra_tf_state_s3_bucket=$S3_BUCKET"
```
Please note: deployment can take a long time to push the container depending on the upload bandwidth of your machine. For faster deployment you can build and push the container in AWS, for example by using the AWS Cloud9 IDE.
Once the previous step finsihes, the ECS ingestion task is started. You can check its progress in the AWS console, for example in Amazon CloudWatch under the log group name semantic-search and checking ingestion-job. After the task finsihed successfully, the ingested documents are searchable via the application.
After runing the ingestion job, you can remove the created ingestion resources, e.g. ECR repository or task definition by running
```
terraform destroy \
-var="infra_region=$STATE_REGION" \
-var="infra_tf_state_s3_bucket=$S3_BUCKET"
```

Ingesting your own documents

Take a look at the ingestion/awsdocs/ingest.py how adopt the ingestion script for your own documents. In brief, you can ingest local or downloaded files via:

# Create a wrapper for the existing OpenSearch document store
document_store = OpenSearchDocumentStore(...)

# Covert local files
dicts_aws = convert_files_to_docs(dir_path=..., ...)

# Write the documents to the OpenSearch document store
document_store.write_documents(dicts_aws, index=...)

# Compute and update the embeddings for each document with a transformer ML model. 
# An embedding is the vector representation that is learned by the transformer and that
# allows us to capture and compare the semantic meaning of documents via this 
# vector representation. 
# Be sure to use the same model that you want to use later in the search pipeline.
retriever = EmbeddingRetriever(
    document_store=document_store,
    model_format = "sentence_transformers",
    embedding_model = "all-mpnet-base-v2"
)
document_store.update_embeddings(retriever)

Security

See CONTRIBUTING for more information.

Contributing

If you want to contribute to Haystack, check out their GitHub repository.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Disclaimer

This solution is intended to demonstrate the functionality of using machine learning models for semantic search and question answering. They are not intended for production deployment as is.

For best practices on modifying this solution for production use cases, please follow the AWS well-architected guidance.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
application		application
cloud9		cloud9
documentation		documentation
infrastructure		infrastructure
ingestion		ingestion
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
appdemo.gif		appdemo.gif
semantic-search-arch-application.png		semantic-search-arch-application.png
semantic-search-arch-ingestion.png		semantic-search-arch-ingestion.png
semantic-search-architecture.drawio		semantic-search-architecture.drawio

License

aws-samples/semantic-search-aws-docs

Folders and files

Latest commit

History

Repository files navigation

Semantic Search on AWS Docs or Custom Documents

Architecture

How to deploy the solution

Deploy with AWS Cloud9

General Deployment Instructions

Pre-requisites

Deploy the application infrastructure terraform stack

Ingest the AWS documentation

Ingesting your own documents

Security

Contributing

License

Disclaimer

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages