Profile Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL PL/pgSQL code using plprofiler

July 2023: This post was reviewed for accuracy.

PostgreSQL is considered to be the primary open-source database choice when migrating from commercial databases such as Oracle. AWS provides two managed PostgreSQL options: Amazon Relational Database Service (Amazon RDS) for PostgreSQL and Amazon Aurora PostgreSQL-Compatible Edition. Identifying slow queries and tuning for better performance is an important task for developers and database administrators managing Amazon RDS and Aurora PostgreSQL environments. To identify slow queries, multiple options are available. You can use the pgBadger tool to find the queries from the PostgreSQL log files by generating a report, or you can use Amazon RDS Performance Insights, which provides an easy-to-understand dashboard for detecting performance problems. For more information on finding slow queries, see Optimizing and tuning queries in Amazon RDS PostgreSQL based on native and external tools.

If database slowness is caused by SQL queries, you can generate query plans using the EXPLAIN command and optimize the SQLs accordingly. However, if the slowness is due to PL/pgSQL functions or procedures, the EXPLAIN plan doesn’t help much because it can’t generate plans for the SQLs inside the function or procedure. The SQL queries inside are optimized just like normal SQL queries, however separately and one by one. Or you can use the additional module auto_explain to get more details. Statements inside PL/pgSQL functions are considered nested statements, so you need to turn on the auto_explain.log_nested_statements parameter. For more details, refer to How can I log execution plans of queries for Amazon RDS PostgreSQL or Aurora PostgreSQL to tune query performance.

Generating a plan for SQLs (manually or through auto_explain) is a time-consuming process. To overcome this, you can use the plprofiler extension, which creates performance profiles of UDFs (user-defined PL/pgSQL functions) and stored procedures. You can use this extension to figure out the issue in the PL/pgSQL functions and procedures along with their runtimes. As part of profiling, we can generate HTML reports that contain details of the queries along with their respective runtimes.

The plprofiler tool is available in Amazon RDS for PostgreSQL and Aurora PostgreSQL (supported from version 11.6 and later).

In this post, we discuss how to install and configure the plprofiler extension. We also generate a report for a sample function to troubleshoot and find the queries causing the slowdown.

Solution overview

To use plprofiler, you need to load the extension on the backend (create the extension in the database) and install the plprofiler command line utility to generate reports.

Complete the following steps as part of installation:

Configure and enable plprofiler on Aurora PostgreSQL or Amazon RDS for PostgreSQL.
Install plprofiler on Amazon Elastic Compute Cloud (Amazon EC2) so that the command line utility can be invoked.

Prerequisites

To install and configure plprofiler, you need to complete the following steps as prerequisites:

Create an Aurora PostgreSQL cluster or RDS for PostgreSQL instance if you don’t already have one.
Create an EC2 instance (Ubuntu 16 or Amazon Linux) to install plprofiler and the PostgreSQL client to access the Aurora or RDS for PostgreSQL instance.
Install the PostgreSQL client.

To install PostgreSQL on Ubuntu, complete the following steps:

Create the file repository configuration

sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'

Import the repository signing key

wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | \
  sudo apt-key add -

Update the package lists

sudo apt-get update

Install the latest version of PostgreSQL. If you want a specific version, use postgresql-12 or similar instead of postgresql

sudo apt-get -y install postgresql-client-13

To install PostgreSQL on Amazon Linux, complete the following steps:

Check the available PostgreSQL and Python version and then enable the required versions

sudo amazon-linux-extras | egrep "postgres|python" 
sudo amazon-linux-extras enable python3.8 postgresql13

Update yum and then install PostgreSQL and Python

sudo yum update -y 
sudo yum install postgresql* python38* -y

Configure and enable plprofiler on Aurora PostgreSQL

Complete the following steps to configure and enable plprofiler on Aurora for PostgreSQL:

Set the shared_preload_libraries parameter to plprofiler in the instance parameter group, as shown in the following screenshot.

This parameter change requires an Aurora PostgreSQL instance reboot because this is a static parameter.

When the reboot is complete, connect to the database using pgadmin4 or psql client and create the plprofiler extension in the database.

psql -h endpoint -U dbuser -d dbname -c ”create extension plprofiler;”

Install plprofiler on an EC2 instance

After the plprofiler is enabled and configured on the database, we can proceed with the following commands to install plprofiler on an EC2 instance, which we can use as a command line utility.

On Ubuntu, log in to the EC2 instance using putty or terminal or Session Manager and connect to the root user.

sudo apg-get update
sudo pip3.5 install plprofiler

On Amazon Linux, log in to the EC2 instance using putty or terminal or Session Manager and connect to the root user.

sudo pip3.8 install plprofiler

To check where plprofiler is installed, run on the command line which plprofiler:

3c22fb76c717:python-plprofiler$ which plprofiler
/usr/local/bin/plprofiler
3c22fb76c717:python-plprofiler$

Test and validate plprofiler

To test plprofiler, we create the sample tables and function shown in the following example. Make sure you create the plprofiler extension in the testing database.

In this example, we created a separate database called plprofiler, sample tables, and a function in the public schema. Then we used the function inventory_in_stock for testing.

Create tables with the following code:

CREATE TABLE public.rental (
    rental_id integer NOT NULL PRIMARY KEY,
    rental_date timestamp with time zone NOT NULL,
    inventory_id integer NOT NULL,
    customer_id integer NOT NULL,
    return_date timestamp with time zone,
    staff_id integer NOT NULL,
    last_update timestamp with time zone DEFAULT now() NOT NULL
);

CREATE TABLE public.inventory (
    inventory_id integer  NOT NULL,
    film_id integer NOT NULL,
    store_id integer NOT NULL,
    last_update timestamp with time zone DEFAULT now() NOT NULL
);

Insert the dummy data using the generate_series function.

The following query inserts 1 million rows – a sequence of numbers for integer columns and a random date from any of the dates in the last 2 years for the timestamp columns:

INSERT INTO public.rental VALUES (
generate_series(1,1000000), 
now()-random()*(now() - (now() - '2 years'::interval)),
generate_series(1,1000000),
random()*generate_series(1,1000000),
now()-random()*(now() - (now() - '2 years'::interval)),
random()*generate_series(1,1000000),
now()-random()*(now() - (now() - '2 years'::interval))
);

INSERT INTO public.inventory VALUES (
generate_series(1,1000000),
random()*generate_series(1,1000000),
random()*generate_series(1,1000000),
now()-random()*(now() - (now() - '2 years'::interval))
);

Create a function with the following code:

CREATE FUNCTION public.inventory_in_stock(p_inventory_id integer) RETURNS boolean
    LANGUAGE plpgsql
    AS $$
DECLARE
    v_rentals INTEGER;
    v_out     INTEGER;
BEGIN
    -- AN ITEM IS IN-STOCK IF THERE ARE EITHER NO ROWS IN THE rental TABLE
    -- FOR THE ITEM OR ALL ROWS HAVE return_date POPULATED


     IF NOT EXISTS (SELECT 1 FROM rental WHERE inventory_id =    p_inventory_id) THEN
     RETURN TRUE;
     END IF;


    SELECT COUNT(rental_id) INTO v_out
    FROM rental WHERE EXISTS (SELECT 1 FROM inventory
    WHERE inventory.inventory_id = p_inventory_id)
    AND rental.return_date IS NULL;


    IF v_out > 0 THEN
      RETURN FALSE;
    ELSE
      RETURN TRUE;
    END IF;
END $$;

After you create the objects, log in to the EC2 instance and run the following command to invoke plprofiler from the command line.

plprofiler run --command "select inventory_in_stock(1525)" --output /tmp/inventory_in_stock.html -h endpoint.rds.amazonaws.com -d plprofiler -U postgres

This generates the report inventory_in_stock.html.

Copy the inventory_in_stock.html file to a machine on which you can open it in a web browser.
Extract the time-consuming query, as highlighted in the following screenshot, and generate the run plan as follows:

plprofiler=> explain analyze SELECT 1 FROM rental WHERE rental.inventory_id =1525;
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Seq Scan on rental  (cost=0.00..21846.00 rows=3 width=4) (actual time=0.279..68.023 rows=12 loops=1)
   Filter: (inventory_id = 1525)
   Rows Removed by Filter: 999988
 Planning Time: 0.296 ms
 Execution Time: 68.080 ms
(5 rows)

In this plan, we can see that the query is taking more time because it’s performing a sequential scan on table.

To resolve this, add an index on the inventory_id column.

CREATE INDEX idx_rental_inventory_id ON public.rental USING btree(inventory_id);

Check the query plan with the index created, which now shows the index scan.

plprofiler=> explain analyze SELECT 1 FROM rental WHERE rental.inventory_id =1525;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using idx_rental_inventory_id on rental  (cost=0.42..4.82 rows=3 width=4) (actual time=0.022..0.031 rows=12 loops=1)
   Index Cond: (inventory_id = 1525)
   Heap Fetches: 12
 Planning Time: 0.141 ms
 Execution Time: 0.048 ms
(5 rows)

Rerun the plprofiler report from the EC2 instance.

plprofiler run --command "select inventory_in_stock(1525)" --output /tmp/inventory_in_stock-1.html -h endpoint.rds.amazonaws.com -d plprofiler -U postgres

If we check the inventory_in_stock.html report again, we can see that the same query is taking less time (almost 67% less) to run after we added the index.

Creating an index on the inventory_id column of the rental table resolved the performance issue. However, if you look at the design prospective and as a best practice, you need a primary key on the inventory_id column of the inventory table and a referential key (foreign key) on the inventory_id column of the rental table.

Use the following ALTER statements to add the primary and foreign keys for the tables accordingly.

ALTER TABLE public.inventory ADD PRIMARY KEY (inventory_id);

ALTER TABLE public.rental ADD CONSTRAINT constraint_rental_fk FOREIGN KEY (inventory_id) REFERENCES public.inventory (inventory_id) ON DELETE CASCADE;

Generate a new plprofiler report for the function.

plprofiler run --command "select inventory_in_stock(1525)" --output /tmp/inventory_in_stock-2.html -h endpoint.rds.amazonaws.com -d plprofiler -U postgres

We can see that performance improved by approximately 50%.

Cleanup

Resource cleanup is one of the important process after all testing is completed. In our case after completion of testing, we have terminated the EC2 and Aurora PostgreSQL instances which were not required. If you’re no longer going to use them, make sure you terminate them as well:

Terminate your EC2 instance
Terminate your Amazon RDS for PostgreSQL instance or your Aurora PostgreSQL instance

Conclusion

The plprofiler tool helps expedite the process of fine-tuning complex user-defined functions and stored procedures in Aurora for PostgreSQL or Amazon RDS for PostgreSQL by reducing the manual effort. As explained in this post, you can generate a plprofiler report for a problematic function or procedure. You can then review the report to find the part of the function or procedure where it is slow and take necessary actions to improve performance.

If you have any questions or suggestions about this post, leave a comment.

About the Authors

Rajesh Madiwale is a Lead Consultant with Amazon Web Services. He has deep expertise on database development and administration on Amazon RDS for PostgreSQL, Aurora PostgreSQL, Redshift, MySQL and Greenplum databases. He is an ardent member of the PostgreSQL community and has been working on PostgreSQL his entire tenure. He has also delivered several sessions at PostgreSQL conferences.

Baji Shaik is a Sr. Lead Consultant with AWS ProServe, GCC India. His background spans a wide depth and breadth of expertise and experience in SQL/NoSQL database technologies. He is a Database Migration Expert and has developed many successful database solutions addressing challenging business requirements for moving databases from on-premises to Amazon RDS and Aurora PostgreSQL/MySQL. He is an eminent author, having written several books on PostgreSQL. A few of his recent works include “PostgreSQL Configuration“, “Beginning PostgreSQL on the Cloud”, and “PostgreSQL Development Essentials“. Furthermore, he has delivered several conference and workshop sessions.

Bhanu Akula is a Consultant with Amazon Web Services. She has expertise in database administration, consulting, solutioning on Amazon RDS for PostgreSQL, Aurora PostgreSQL, and Greenplum databases. She has been working with the database community on Oracle, PostgreSQL, Greenplum since the beginning of her career. She is working actively for the customers from all over the world in the Assess part of their migration journey, in discovering data through various discovery tools and recommending the right sized recommendations.