Image Identification and Classification with Amazon Bedrock, OpenSearch, and OpenCLIP

Image Identification and Classification with Amazon Bedrock, OpenSearch, and OpenCLIP

Build a generative AI-powered vehicle damage assessment application on AWS using Vector Engine for Amazon OpenSearch Serverless, AI21 Labs’ Foundation Models, and OpenCLIP


In this post, we will build a hypothetical auction vehicle damage assessment and valuation application using generative AI technologies. Based on the vehicle identification number (VIN) and images showing vehicle damage, we will use machine learning and generative AI technologies to assess the vehicle’s value for auction and create a detailed vehicle description. Major technologies include OpenCLIP, an open-source implementation of OpenAI’s CLIP (Contrastive Language-Image Pre-training), Amazon’s recently announced Vector Engine for Amazon OpenSearch Serverless (Preview), and AI21 Labs’ Jurassic-2 (J2) Ultra Foundation Model, accessed through Amazon Bedrock. The application can be easily adapted for multiple other industries where evaluating the cost of damage is required, including insurance claims, shipping, and retail.

This post was inspired by a series of Generative AI hackathons I recently competed in with two of my close AWS peers, Jigna Gandhi and Chad Jodon. As a team, we developed this winning solution for the vehicle auction industry.

Video Demonstration

A brief video demonstration of the AI-powered Vehicle Damage Evaluator UI is available on YouTube.

Architecture

The architectural diagram below illustrates two separate workloads covered in the post. First, an Amazon SageMaker Studio Notebook generates the OpenCLIP vector embeddings of all vehicle images, then creates the new Amazon OpenSearch Serverless vector index, and finally indexes the OpenSearch documents that contain the vehicle vector embeddings and other image metadata into the new vector index. Second, a Streamlit application written in Python provides the user interface that performs the vehicle damage assessment and generates the vehicle auction description.

High-level architectural diagram of the post’s demonstration

Are you new to CLIP (Contrastive Language-Image Pre-training) and dense vector embedding for semantic image search? If so, I suggest reading my previous post, Using Amazon OpenSearch Serverless Vector Search and OpenAI CLIP Multimodal Model for Semantic Image Search.

Open Source Code

All of the source code used in this post’s demonstration, including the Amazon SageMaker Studio Notebook and Streamlit application, are open-sourced and available on GitHub.

Technologies

OpenCLIP

OpenCLIP is an open-source implementation of OpenAI’s CLIP. OpenAI defines CLIP as a neural network trained on various image/text pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task. CLIP contains both a text and an image encoder. It can create dense vector embeddings of images and text tokens. CLIP is optimized for visual classification tasks.

OpenAI CLIP diagram, courtesy of the CLIP GitHub README file

Vector Engine for Amazon OpenSearch Serverless

In late July 2023, AWS announced the Vector Engine for Amazon OpenSearch Serverless (Preview). OpenSearch Serverless now supports three primary collection types: Time Series, Search, and Vector search. Vector search allows for semantic search on vector embeddings that simplify vector data management and power machine learning (ML) augmented search experiences and generative AI applications, such as chatbots, personal assistants, and fraud detection.

AI21 Labs’ Jurassic-2 (J2) Ultra Foundation Model

AI21 Labs was founded in 2017 by AI pioneers and technology veterans from Stanford, CrowdX, and Mobileye. AI21 Labs, an AWS launch partner for Amazon Bedrock, builds state-of-the-art language models, including their current Jurassic-2 (J2)-series next-generation foundation models: Ultra, Mid, and Light. According to AI21 Labs, “Jurassic-2 Ultra is our largest and most powerful foundation model for complex language generation tasks, producing the highest quality for any language comprehension or generation task. According to our internal evaluations from HELM, the leading benchmark for language models, Jurassic-2 Ultra scores a win-rate of 86.8%, solidifying it as a leader in the LLM space. This is also the most costly language model with the highest latency but most capable of carrying out complex generation and comprehension tasks.” J2 Ultra excels at text summarization, including “the generation of coherent, complete, and engaging product descriptions, while minimizing hallucinations.” The ideal model choice for this solution.

Amazon Bedrock

AWS announced the general availability of Amazon Bedrock in late September 2023. Previously, Bedrock was available to customers in preview for several months. Amazon Bedrock is a fully managed serverless service that makes foundation models (FMs) from leading AI companies, such as Cohere, Anthropic, Stability AI, Meta, and AI21 Labs, available through an application programming interface (API). AI21 Labs’ J2 Ultra foundation model was accessed through Amazon Bedrock for this post using the Boto3 SDK.

Using AI21 Labs’ J2 Ultra FM through Amazon Bedrock’s Text playground

Vehicle Image Datasets

We will use two vehicle image datasets, one for the damaged vehicles and one for the undamaged vehicles. There are several vehicle image databases containing both damaged and undamaged vehicles. We will use a database from Roboflow, Car-Damage-Type-Detection-End-Game Computer Vision Project, for the damaged vehicles. You will need to download and unzip the dataset. The dataset is divided into test, train, and validate image datasets. The dataset is further organized by the type of damage, such as crack, dent, scratch, flat tire, etc. The dataset includes multiple years, makes, and models of vehicles.

A portion of the damaged vehicles image dataset hierarchy

Example of a search for ‘dented bumpers’ from the damaged vehicle image dataset

I have prebuilt a custom image dataset from several used automotive resale websites for images of undamaged vehicles, such as Autotrader. We will focus on late-model 3-series BMW automobiles in this post for brevity.

A portion of the undamaged vehicle image dataset hierarchy

For best results, the undamaged vehicle images should be cropped similarly to the damaged vehicle images.

Example of search results from the undamaged vehicle image dataset

The undamaged vehicle image dataset is available for download from Kaggle: Undamaged Vehicle Image Dataset. There are many other vehicle datasets available if you don’t want to build your own.

Undamaged Vehicle Image Dataset on Kaggle

Getting Started

For this post, I will use Amazon SageMaker Studio to run code developed in a Jupyter Notebook. Amazon SageMaker Studio is AWS’s fully integrated development environment (IDE) for machine learning. The post’s Jupyter Notebook, located in the GitHub repository, has three primary functions: 1) create the OpenCLIP dense vector embeddings of all vehicle images, 2) create the new Amazon OpenSearch Serverless vector index, and 3) index OpenSearch documents that contain dense vector embedding and associated image metadata into the new vector index.

Create Notebook Environment

OpenCLIP replies on PyTorch. Although you can run the notebook on a CPU-based instance, since we are using Amazon SageMaker Studio, I would suggest the PyTorch 2.0.0 Python 3.10 GPU Optimized SageMaker Notebook image, the Python 3 kernel, and a ml.g4dn.xlarge instance type. According to the documentation, the AWS Deep Learning (DL) Containers for PyTorch 2.0.0 with CUDA 11.8 include containers for training on GPU. For more information, see Release Notes for Deep Learning Containers. Don’t forget to shut down the kernel session and the running notebook when finished to minimize AWS costs.

Setting up Amazon SageMaker Studio notebook environment

Create Vector Embeddings

First, install the required Python package for the demonstration, including OpenCLIP and the Python client for OpenSearch, opensearch-py.

Then import the required package for this part of the demonstration.

OpenCLIP has access to over eighty pre-trained ViT, ConvNeXt, EVA, ResNet, and other CLIP models. Models can be listed using the open_clip.list_pretrained() command. Detailed OpenCLIP model cards can be found on Hugging Face.

For image embeddings, we will use the ViT-L/14 — LAION-2B CLIP model, a CLIP ViT L/14 model trained with the 2 billion sample English subset of LAION-5B using OpenCLIP. In the model name, “ViT” is short for Vision Transformer, “L” stands for Large, and “14” indicates a 14 x 14 input patch size. Thus, a 224 x 224-pixel image (input size for this CLIP model) / 14 x 14 patch size = 256 fixed-size patches x 3 channels (RGB). We also use the ViT-L-14 for the text tokenization part of CLIP. The model produces 768 embedding dimensions (see Table 18: Common CLIP hyperparameters) See the paper, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, for a deeper understanding of how these ViT models work.

To create the image embeddings, we use the following function, which calls model.encode_image. It accepts the image path as a string and returns the embedding:

Although not a feature of this application, in addition to creating vector embeddings of images for semantic similarity searches, we can also create text embeddings with OpenCLIP by calling model.encode_text:

We must assign a severity level to each image to later calculate vehicle devaluation due to damage using a semantic similarity search. Since manually labeling all 4,000 images would take too long for the demo, we will randomly assign one of three severity levels to each vehicle photo:

Given the quantity and size of vector embeddings and corresponding OpenSearch documents, I persisted the documents to disk in CSV files, verses in memory, and then indexed them separately. This is unnecessary but made it quicker to recreate the OpenSearch vector index when necessary during my testing without recomputing the embeddings each time.

Each OpenSearch document will look similar to the following sample (abridged vector embedding of 768 dimensions):

Create a Vector Index

In my previous post, Using Amazon OpenSearch Serverless Vector Search and OpenAI CLIP Multimodal Model for Semantic Image Search, I detailed creating an Amazon OpenSearch domain and the associated AWS IAM Roles and required OpenSearch Policies. Review the post for more details. We will assume those resources already exist and start by creating the vector search collection and corresponding vector index. First, instantiate an OpenSearch Serverless client to connect to Amazon OpenSearch:

Then, create a new OpenSearch collection of type: VECTORSEARCH:

Once the collection is created, re-establish the OpenSearch Serverless connection by providing the new vector search collection’s OpenSearch endpoint, which can be found in the OpenSearch console or by using the AWS CLI or Boto3 SDK.

Make sure to update the host with the OpenSearch endpoint and region based on your environment. Do not include https:// in the host value.

Within the new vector search collection, we will next create a new vector index using indices.create(). Each document in the index will contain the following five properties: name, file_path, description, severity, and image_vector. Aside from the image_vector, the four remaining property values are informational. We will see how we can leverage these property values later in the post.

The dense vector embeddings, created from the images by the OpenCLIP model, will contain 768 dimensions. To query these vectors, the index is configured to use the Non-Metric Space Library (nmslib), with the Hierarchical Navigable Small Worlds algorithm (HNSW) and cosine similarity (cosinesimil). I suggest reviewing OpenSearch’s k-NN index documentation and the AWS blog post, Choose the k-NN algorithm for your billion-scale use case with OpenSearch, for more information on the search options available.

To confirm its creation, we can retrieve the description of the new vector index you just created using indices.get():

Example output:

Indexing Documents

With the vector index created, we can now index both the damaged and undamaged vehicle documents into Amazon OpenSearch using index():

Using the bulk operation (bulk()) to index the documents would be much quicker. Unfortunately, I ran into a few limitations with the vector index as compared to the search index when assigning IDs (Document ID is not supported in create/index operation request error).

Test the Vector Index

To test that the index is working, you can create a vector embedding of a sample image, perform a k-NN query of the vector index, return the top n results, and display the resulting images:

Using the output, we can qualitatively judge whether the results are closely similar to the test image:

Example of search results from the undamaged vehicle image query

Using Dimensionality Reduction to Evaluate Search Results

There are several ways to quantitatively evaluate the search results and compare multiple search result sets. One technique we can use is t-distributed Stochastic Neighbor Embedding (t-SNE), which, according to Wikipedia, is a statistical method for visualizing high-dimensional data by giving each data point a location in a two- or three-dimensional map. Using t-SNE, we can reduce the dimensionality of the 768-dimension CLIP embeddings returned from multiple semantic similarity searches down to two dimensions and plot the results as a 2D scatter plot. Below, we see three distinct clusters of vectors representing the top 10 query results for three different vehicle images. We can observe the relative proximity of the results within each OpenSearch result set (cluster) and between different result sets. The proximity reflects the similarity between the images within the result set.

Comparing three different result sets using t-SNE

Another method is using Principal Component Analysis (PCA), a popular statistical technique for reducing the dimensionality of a dataset. Using PCA, we reduce the dimensionality of the 768-dimension CLIP embeddings, returned from multiple semantic similarity searches, down to three dimensions and plot the results as a 3D scatter plot. You can also view the chart below as an animated 3D video on YouTube.

Comparing three different result sets using PCA

AI-powered Vehicle Damage Evaluator Application

Like the Jupyter Notebook, running the application on a CPU-based instance with Open CLIP is possible but can be very slow, even with an M1-powered MacBook Pro. A GPU-based instance is recommended. To run the application, you must install several Python packages, including the latest boto3 package, which incorporates the new bedrock-runtime service for the invoke_model method. A requirements.txt file is included in the GitHub repository. I prefer to create a virtualenv for my Python projects:

The AI-powered Vehicle Damage Evaluator application is built with Streamlit. To start the Streamlit application using HTTP on the default port 8501, run the following command from your terminal:

It is also possible to run the Streamlit application securely using HTTPS on port 443 if you have a registered internet domain and corresponding SSL/TLS certificate:

Vehicle Information

The first section of the application, Vehicle Information, accepts the Vehicle Identification Number (VIN), the exterior vehicle color, and mileage. The initial auction estimate simulates a call to a third-party API to obtain the estimated vehicle value. Since these are often paid services, we will simulate a response from Edmunds or similar external data providers, such as VinAudit or CarsXE, for the post.

Vehicle Information section

When you input the VIN, an external API call is made to the National Highway Traffic Safety Administration (NHTSA), specifically the NHTSA’s Product Information Catalog and Vehicle Listing (vPIC).

NHTSA vehicle data

Below is the abridged output of up to 150 informational fields returned in the JSON payload. Note the vPIC data does not include vehicle color or current mileage.

Be aware that this is a free service that occasionally goes down for maintenance; you should not count on it for Production. You should only use reliable third-party API services.

Uploading Damaged Images

The next section of the application, the Image Uploader, accepts three images of vehicle damage. Three is an arbitrary number of images; it could be more or less. The application should be able to differentiate between damaged and undamaged vehicle images. The images should be in PNG or JPEG format.

Once uploaded, the application will attempt to enhance the images. The theory is that the photos could be overexposed, underexposed, or out of focus. Ideally, the application will automatically enhance the image to optimize the vector embedding we will create next. The uploaded images are saved locally so they can potentially be added to the image database, thus improving the query results.

Image Uploaded section

Individual Image Analysis

Next, dense 768-dimension vector embeddings were created for each uploaded and enhanced image using the same OpenCLIP pre-trained model previously used to create the OpenSearch index. The first time this application is started, there will be a delay while the model is downloaded to the server where the application is hosted.

Next, a query is performed using the vector embedding against the vector index. We will use Amazon OpenSearch’s k-nearest neighbors algorithm (k-NN) to perform queries of the vector index using a hierarchical proximity graph approach for a fast approximate k-NN search. The top ten results are returned, ordered by the relevance score (_score) for each of the three images.

Individual image search results

Using the individual image search results and the accompanying image metadata, the application performs a damage assessment and develops a devaluation calculation based on the severity of the damage. Note that the severity level associated with each image in the demonstration was assigned randomly to the damaged image dataset. Ideally, each image would be hand-labeled, using services such as Amazon SageMaker Ground Truth. Each level of severity has a recommended level of devaluation: none (0%), minor (-10%), moderate (-25%), and severe (-40%). You may choose to vary these percentages based on the vehicle’s year, make, and model in Production.

Individual image damage analysis

Final Analysis

Combining the three image analysis results, the application produces a final analysis, which includes data fields extracted and combined from the vehicle data obtained from the third-party API call. The analysis also consists of a hypothetical final vehicle damage assessment—a combined total of the individual evaluations of the three images.

Final Analysis section

Vehicle Auction Description

Not everyone evaluating an auction vehicle may also be proficient at writing detailed vehicle descriptions, and many organizations have well-defined style guides for producing content, such as product descriptions. With the final analysis, the application can write product descriptions. To generate the vehicle descriptions, the application leverages the capabilities of Amazon Bedrock, specifically AI21 Labs’ J2 Ultra Foundation Model (FM). The resulting description is fully editable by the end user.

AI-powered vehicle auction description

Foundation models, such as the J2 Ultra model, are commonly combined with carefully crafted prompt templates and techniques like in-context learning to generate content in the style of human-like personas (e.g., famous personalities, professionals, or movie characters). In this case, in addition to a detailed vehicle description, the application can write a description in the style of an auctioneer.

The application also includes a “Degree of creativity” slider, which changes the content. The risk of changing this option, tied directly to the model temperature, will impact the accuracy of the description. The results may sound more creative but at the risk of including vehicle details that are not factual, such as interior color.

AI-powered vehicle auction description written as an auctioneer

Lastly, the application can translate the vehicle description into Spanish. Although we could have easily leveraged Amazon Translate to translate the vehicle description, we used the J2 Ultra model. While Amazon Translate supports translation between the following 75 languages and is a great choice, in addition to English, the AI21 Lab’s J2 models “demonstrate exceptional proficiency in multiple languages, including Spanish, French, German, Portuguese, Italian, and Dutch,” according to AI21 Labs.

AI-powered vehicle auction description written in Spanish

Text-to-Speech Capabilities

The application also converts the generated vehicle description from text to speech. The audio can be played within the application and downloaded for further use. Here is a sample. There are several text-to-speech options that can be easily adapted for use within the application, including Google Text-to-Speech (gTTS), Amazon Polly, Google Cloud Text-to-Speech AI, and Speechify. Below is an audio output from the application using an Amazon Polly text-to-speech service.

Persist the Vehicle Description

The final feature of the AI-powered Vehicle Damage Evaluator application is the ability to save the vehicle description to a persistent data store, such as Amazon DynamoDB. Below is an abridged example of the JSON document saved by the application based on the VIN.

Conclusion

This post taught us how to leverage generative AI technologies to build a fully functional application. We learned how to combine OpenCLIP, an open-source implementation of OpenAI’s CLIP, and Vector Engine for Amazon OpenSearch Serverless to perform semantic image similarity search using dense vector embeddings. Finally, we learned how to use AI21 Labs’ J2 foundation models, accessed through Amazon Bedrock, to create precise summarizations and high-quality product descriptions using multiple data sources.

Image created with Midjourney

This blog represents my viewpoints and not those of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.

Joseph Nelson

Co-founder, CEO @ Roboflow

6mo

Great use of Roboflow datasets too

Earl Bovell

Senior Solutions Architect @ AWS | EMBA, Cybersecurity

6mo

Awesome!

Gary Stafford Very helpful article. We have done a POC recently using Amazon Recoknition APIs for image quality determination during vehicle damage assessment.

Like
Reply

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics