Monitoring PostgreSQL Index Creation Progress on Large Tables
The Challenge: Adding an Index to 750 Million Rows
Database administrators and engineers often face the daunting task of adding new indexes to large tables in production environments. In this real-world case study, we'll walk through the process of creating a critical unique index on a PostgreSQL table containing 750 million records on an AWS RDS instance (db.r6g.4xlarge).
Understanding the Problem
Our large transaction_records
table needed a unique index on the id
column to improve query performance. For tables of this size, index creation operations are not trivial and can take significant time to complete.
-- Simplified table structure
create table public.transaction_records
(
id uuid not null,
transaction_id uuid,
category_id integer not null,
account_code bigint not null,
amount double precision not null,
remaining_amount double precision,
is_active smallint not null,
processed_amount double precision default 0,
fiscal_year bigint,
branch_code bigint,
-- other metadata columns
);
The Impact of Missing Indexes
When we queried the table without an index on the id
column, PostgreSQL had to perform a full sequential scan:
explain select * from transaction_records where id = 'f67a3c21-45e8-412d-b897-d28ab5c7f931';
Result:
Gather (cost=1000.00..20055623.77 rows=1 width=136)
Workers Planned: 2
-> Parallel Seq Scan on transaction_records (cost=0.00..20054623.67 rows=1 width=136)
Filter: (id = 'f67a3c21-45e8-412d-b897-d28ab5c7f931'::uuid)
This execution plan reveals a critical performance issue - the database must scan all 750 million rows to find a single record, resulting in:
- Extremely high query cost (over 20 million)
- Poor performance for applications
- Unnecessary resource consumption
Choosing the Right Approach
After a failed attempt to create an index with the standard syntax (which blocked the table for hours), we opted for a less disruptive approach using a concurrent index creation:
CREATE UNIQUE INDEX CONCURRENTLY idx_transaction_records_id ON transaction_records(id);
The CONCURRENTLY
option allows regular database operations to continue while the index is being built, making it suitable for production environments.
Monitoring Progress: The Key to Sanity
When operations take hours or even days to complete, visibility becomes crucial. PostgreSQL provides tools to monitor long-running index creations through system views.
1. Check the Process Status
First, identify the running process:
SELECT pid, query, state, now() - pg_stat_activity.query_start AS duration
FROM pg_stat_activity
WHERE query LIKE 'CREATE%INDEX%';
2. Monitor Detailed Progress
PostgreSQL offers detailed progress information through the pg_stat_progress_create_index
view:
SELECT * FROM pg_stat_progress_create_index;
Sample output:
14820,16404,ptx_prod,410166,419340,CREATE INDEX CONCURRENTLY,waiting for writers before validation,18,14,25860,0,0,730737795,730737795,0,0
3. Understanding Progress Phases
The index creation process goes through several phases:
Initial Build Phase: Scanning the table and building the index structure
730,737,795 tuples processed out of 730,737,795 total
Waiting for Writers: Ensuring all concurrent writes are accounted for
lockers_total: 18, lockers_done: 14
Index Validation: Scanning Index: Validating the index structure
1,758,055 items processed out of 2,813,624 total
Index Validation: Scanning Table: Final validation against table data
5,548,280 items processed out of 16,248,698 total
Results: The Payoff
After nearly 5 hours, our index creation completed. We could see the immediate impact on query performance:
explain select * from transaction_records where id = 'f67a3c21-45e8-412d-b897-d28ab5c7f931';
New execution plan:
Index Scan using idx_transaction_records_id on transaction_records (cost=0.57..8.59 rows=1 width=136)
Index Cond: (id = 'f67a3c21-45e8-412d-b897-d28ab5c7f931'::uuid)
The query cost dropped from over 20 million to just 8.59 - a dramatic improvement that translates to queries completing thousands of times faster.
Handling Naming Conflicts
When creating indexes and constraints, we discovered that naming conflicts can occur when objects with the same name exist elsewhere in the schema. If you encounter naming conflicts, use a more distinctive naming convention:
-- Example of using a more distinctive naming pattern
CREATE UNIQUE INDEX CONCURRENTLY idx_tr_2023_id ON transaction_records(id);
Checking for naming conflicts before creating indexes can save time and help avoid errors during the process.
Key Takeaways
- Use CONCURRENTLY: For production systems, use concurrent index creation to minimize disruption.
- Monitor progress: Utilize PostgreSQL system views to track long-running operations.
- Understand resource impact: Operations on large tables require significant resources and time.
- Check naming conflicts: Ensure index names are unique within your schema.
- Be patient: Index creation on large tables takes time, but the performance benefits are worth it.
- Communicate timelines: Set realistic expectations with stakeholders about index creation durations.
Configuration Considerations
For large tables, consider optimizing these PostgreSQL parameters:
maintenance_work_mem
: Our setting was 2140275kB (2.14GB)work_mem
: Affects sorting operations during index creationmax_parallel_workers_per_gather
: Impacts parallel operationseffective_io_concurrency
: Can improve I/O throughput
By understanding how to monitor and manage index creation on large tables, you can make informed decisions when working with massive datasets in PostgreSQL, minimizing downtime while maximizing performance gains.