Documentation Index
Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt
Use this file to discover all available pages before exploring further.
Audience: SREs, DBAs, platform operators
Last updated: 2026-02-23
Database: PostgreSQL 15+ via Amazon RDS (or self-managed)
Connection pooler: PgBouncer
Table of Contents
- Backup
- Restore
- Manual VACUUM
- Index Maintenance
- Schema Migrations
- Monitoring Queries
- Connection Pool Tuning (PgBouncer)
1. Backup
Automated Backups (RDS)
RDS automated backups are configured with a 7-day retention period. Snapshots are taken daily during the maintenance window (02:00-03:00 UTC).
Manual Backup
Use the make backup target or run the equivalent commands directly.
Makefile target:
# Makefile
backup:
@echo "Starting KaireonAI database backup..."
@TIMESTAMP=$$(date +%Y%m%d_%H%M%S) && \
BACKUP_FILE="kaireon_backup_$${TIMESTAMP}.sql.gz" && \
pg_dump "$(DATABASE_URL)" \
--format=custom \
--compress=9 \
--verbose \
--file="/tmp/$${BACKUP_FILE}" && \
aws s3 cp "/tmp/$${BACKUP_FILE}" \
"s3://kaireon-backups/daily/$${BACKUP_FILE}" \
--storage-class STANDARD_IA && \
rm -f "/tmp/$${BACKUP_FILE}" && \
echo "Backup complete: s3://kaireon-backups/daily/$${BACKUP_FILE}"
Manual execution:
# Full database backup
make backup
# Or run directly:
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
pg_dump "$DATABASE_URL" \
--format=custom \
--compress=9 \
--verbose \
--file="/tmp/kaireon_backup_${TIMESTAMP}.sql.gz"
# Upload to S3
aws s3 cp "/tmp/kaireon_backup_${TIMESTAMP}.sql.gz" \
"s3://kaireon-backups/daily/" \
--storage-class STANDARD_IA
Schema-only backup (for migration reference):
pg_dump "$DATABASE_URL" \
--schema-only \
--file="/tmp/kaireon_schema_$(date +%Y%m%d).sql"
Table-specific backup:
pg_dump "$DATABASE_URL" \
--format=custom \
--compress=9 \
--table=decision_logs \
--file="/tmp/decision_logs_$(date +%Y%m%d).sql.gz"
RDS Snapshot
aws rds create-db-snapshot \
--db-instance-identifier kaireon-prod \
--db-snapshot-identifier "kaireon-manual-$(date +%Y%m%d-%H%M%S)"
Backup Verification
Run monthly to ensure backups are valid:
# Download the latest backup
LATEST=$(aws s3 ls s3://kaireon-backups/daily/ --recursive | sort | tail -1 | awk '{print $4}')
aws s3 cp "s3://kaireon-backups/$LATEST" /tmp/verify_backup.sql.gz
# Restore to a test database
createdb kaireon_verify
pg_restore --dbname=kaireon_verify --verbose /tmp/verify_backup.sql.gz
# Verify row counts
psql kaireon_verify -c "SELECT schemaname, relname, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC LIMIT 10;"
# Clean up
dropdb kaireon_verify
rm /tmp/verify_backup.sql.gz
2. Restore
Restore from pg_dump Backup
Use the make restore target or run the equivalent commands directly.
Makefile target:
# Makefile
restore:
@if [ -z "$(BACKUP_FILE)" ]; then \
echo "Usage: make restore BACKUP_FILE=s3://kaireon-backups/daily/kaireon_backup_20260223.sql.gz"; \
exit 1; \
fi
@echo "WARNING: This will overwrite the target database. Press Ctrl+C to abort."
@sleep 5
@aws s3 cp "$(BACKUP_FILE)" /tmp/restore_backup.sql.gz
@pg_restore \
--dbname="$(DATABASE_URL)" \
--clean \
--if-exists \
--verbose \
/tmp/restore_backup.sql.gz
@rm -f /tmp/restore_backup.sql.gz
@echo "Restore complete."
Manual execution:
# Download backup
aws s3 cp "s3://kaireon-backups/daily/kaireon_backup_20260223.sql.gz" /tmp/restore.sql.gz
# Option A: Restore to existing database (destructive)
pg_restore \
--dbname="$DATABASE_URL" \
--clean \
--if-exists \
--verbose \
/tmp/restore.sql.gz
# Option B: Restore to a new database
createdb kaireon_restored
pg_restore \
--dbname=kaireon_restored \
--verbose \
/tmp/restore.sql.gz
Restore from RDS Snapshot
# List available snapshots
aws rds describe-db-snapshots \
--db-instance-identifier kaireon-prod \
--query 'DBSnapshots[*].[DBSnapshotIdentifier,SnapshotCreateTime]' \
--output table
# Restore to a new instance
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier kaireon-restored \
--db-snapshot-identifier kaireon-manual-20260223-120000 \
--db-instance-class db.r6g.xlarge \
--vpc-security-group-ids sg-xxxxxxxx
# Wait for the instance to become available
aws rds wait db-instance-available --db-instance-identifier kaireon-restored
Point-in-Time Recovery (RDS)
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier kaireon-prod \
--target-db-instance-identifier kaireon-pitr \
--restore-time "2026-02-23T10:30:00Z" \
--db-instance-class db.r6g.xlarge
aws rds wait db-instance-available --db-instance-identifier kaireon-pitr
Post-Restore Checklist
- Verify row counts match expectations.
- Run
ANALYZE on all tables to update statistics.
- Verify application connectivity.
- Run a smoke test against the decision endpoint.
- Update DNS or connection strings if restoring to a new instance.
-- Post-restore ANALYZE
ANALYZE VERBOSE;
-- Verify key tables
SELECT 'blueprints' AS tbl, count(*) FROM blueprints
UNION ALL
SELECT 'offers', count(*) FROM offers
UNION ALL
SELECT 'decision_logs', count(*) FROM decision_logs
UNION ALL
SELECT 'customers', count(*) FROM customers;
3. Manual VACUUM
When to VACUUM
- Dead tuple ratio exceeds 20% on any table.
- After large bulk deletes or updates.
- Before
VACUUM FULL if table bloat exceeds 50%.
- Autovacuum is lagging (check
pg_stat_user_tables.last_autovacuum).
Standard VACUUM
Does not lock the table. Safe to run during production hours.
-- VACUUM a specific table with verbose output
VACUUM (VERBOSE) decision_logs;
VACUUM (VERBOSE) audit_events;
VACUUM (VERBOSE) customers;
-- VACUUM and update statistics
VACUUM (VERBOSE, ANALYZE) decision_logs;
-- VACUUM all tables
VACUUM (VERBOSE, ANALYZE);
VACUUM FULL
Reclaims disk space by rewriting the table. Acquires an exclusive lock. Schedule during maintenance windows only.
-- Check if VACUUM FULL is needed (bloat estimation)
SELECT
schemaname,
tablename,
pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size,
n_dead_tup,
n_live_tup,
ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_pct,
last_autovacuum,
last_vacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;
-- Run VACUUM FULL (locks table, schedule in maintenance window)
VACUUM FULL VERBOSE decision_logs;
Autovacuum Tuning
-- Check autovacuum settings per table
SELECT relname, reloptions
FROM pg_class
WHERE relname IN ('decision_logs', 'audit_events', 'customers');
-- Set aggressive autovacuum for high-churn tables
ALTER TABLE decision_logs SET (
autovacuum_vacuum_threshold = 1000,
autovacuum_vacuum_scale_factor = 0.01,
autovacuum_analyze_threshold = 500,
autovacuum_analyze_scale_factor = 0.005
);
ALTER TABLE audit_events SET (
autovacuum_vacuum_threshold = 1000,
autovacuum_vacuum_scale_factor = 0.01
);
Monitor Autovacuum Progress
-- Check running autovacuum processes
SELECT pid, datname, relid::regclass, phase,
heap_blks_total, heap_blks_scanned, heap_blks_vacuumed,
index_vacuum_count, max_dead_tuples, num_dead_tuples
FROM pg_stat_progress_vacuum;
-- Check autovacuum worker count
SELECT count(*) FROM pg_stat_activity WHERE backend_type = 'autovacuum worker';
4. Index Maintenance
Check Index Health
-- Index usage statistics (find unused indexes)
SELECT
schemaname,
tablename,
indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC, pg_relation_size(indexrelid) DESC;
-- Find duplicate indexes
SELECT
indrelid::regclass AS table_name,
array_agg(indexrelid::regclass) AS duplicate_indexes,
count(*)
FROM pg_index
GROUP BY indrelid, indkey
HAVING count(*) > 1;
-- Index bloat estimation
SELECT
schemaname,
tablename,
indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan
FROM pg_stat_user_indexes
WHERE pg_relation_size(indexrelid) > 10485760 -- > 10MB
ORDER BY pg_relation_size(indexrelid) DESC;
Rebuild Indexes
Use REINDEX CONCURRENTLY to avoid locking the table.
-- Rebuild a specific index (non-blocking)
REINDEX INDEX CONCURRENTLY idx_decision_logs_customer_id;
-- Rebuild all indexes on a table (non-blocking)
REINDEX TABLE CONCURRENTLY decision_logs;
-- Rebuild all indexes in the database (maintenance window recommended)
REINDEX DATABASE CONCURRENTLY kaireon;
Create Missing Indexes
Common queries that benefit from indexes:
-- Decision log lookups by customer
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_decision_logs_customer_created
ON decision_logs (customer_id, created_at DESC);
-- Audit event lookups by entity
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_audit_events_entity
ON audit_events (entity_type, entity_id, created_at DESC);
-- Offer lookups by status
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_offers_status
ON offers (status) WHERE status = 'ACTIVE';
Drop Unused Indexes
-- Confirm index is truly unused (check over 30+ days)
-- Then drop it:
DROP INDEX CONCURRENTLY IF EXISTS idx_old_unused_index;
5. Schema Migrations
Using Prisma db push
KaireonAI uses Prisma 7 with db push for schema synchronization.
Standard migration workflow:
cd platform
# 1. Edit schema.prisma with the desired changes
# 2. Generate Prisma client (validates schema)
npx prisma generate
# 3. Preview changes (dry run)
npx prisma db push --accept-data-loss --force-reset # ONLY IN DEV
# 4. Apply to production
npx prisma db push
Pre-migration checklist:
- Back up the database (see Backup).
- Review the schema diff carefully.
- Check for data loss warnings.
- Run in staging first.
- Coordinate with the team (announce in
#kaireon-deployments).
Post-migration checklist:
- Run
ANALYZE on affected tables.
- Verify application starts without errors.
- Check that all API endpoints return valid responses.
- Monitor error rates for 15 minutes.
Manual SQL Migrations
For changes that Prisma cannot handle (data migrations, custom constraints):
# Create a migration file
mkdir -p migrations
cat > migrations/$(date +%Y%m%d%H%M%S)_description.sql << 'EOF'
-- Migration: description
-- Author: <name>
-- Date: <date>
BEGIN;
-- Your SQL here
ALTER TABLE decision_logs ADD COLUMN response_time_ms INTEGER;
-- Backfill if needed
UPDATE decision_logs SET response_time_ms = 0 WHERE response_time_ms IS NULL;
-- Add constraint after backfill
ALTER TABLE decision_logs ALTER COLUMN response_time_ms SET NOT NULL;
COMMIT;
EOF
# Apply the migration
psql "$DATABASE_URL" -f migrations/<filename>.sql
Rollback Procedure
# 1. Restore schema from backup
pg_restore \
--dbname="$DATABASE_URL" \
--schema-only \
--clean \
--if-exists \
/tmp/pre_migration_schema.sql
# 2. Or write a manual rollback script
psql "$DATABASE_URL" << 'EOF'
BEGIN;
ALTER TABLE decision_logs DROP COLUMN IF EXISTS response_time_ms;
COMMIT;
EOF
# 3. Regenerate Prisma client from rolled-back schema
cd platform && npx prisma generate
6. Monitoring Queries
Dashboard Queries
Run these queries periodically or integrate them into Grafana via the PostgreSQL data source.
Database size and growth:
SELECT
pg_size_pretty(pg_database_size('kaireon')) AS total_size;
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size,
pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size,
pg_size_pretty(pg_indexes_size(schemaname || '.' || tablename::regclass)) AS index_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC
LIMIT 20;
Active connections:
SELECT
count(*) AS total,
count(*) FILTER (WHERE state = 'active') AS active,
count(*) FILTER (WHERE state = 'idle') AS idle,
count(*) FILTER (WHERE state = 'idle in transaction') AS idle_in_txn,
count(*) FILTER (WHERE wait_event_type IS NOT NULL) AS waiting
FROM pg_stat_activity
WHERE datname = 'kaireon';
Slow queries (current):
SELECT
pid,
now() - query_start AS duration,
state,
LEFT(query, 200) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
AND datname = 'kaireon'
AND now() - query_start > interval '5 seconds'
ORDER BY duration DESC;
Table statistics:
SELECT
relname,
seq_scan,
seq_tup_read,
idx_scan,
idx_tup_fetch,
n_tup_ins,
n_tup_upd,
n_tup_del,
n_live_tup,
n_dead_tup,
last_vacuum,
last_autovacuum,
last_analyze,
last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
Cache hit ratio (should be >99%):
SELECT
'index' AS type,
sum(idx_blks_hit) AS hits,
sum(idx_blks_read) AS reads,
ROUND(100.0 * sum(idx_blks_hit) / NULLIF(sum(idx_blks_hit) + sum(idx_blks_read), 0), 2) AS hit_ratio
FROM pg_statio_user_indexes
UNION ALL
SELECT
'table',
sum(heap_blks_hit),
sum(heap_blks_read),
ROUND(100.0 * sum(heap_blks_hit) / NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0), 2)
FROM pg_statio_user_tables;
Lock contention:
SELECT
blocked.pid AS blocked_pid,
blocked.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking.query AS blocking_query,
now() - blocked.query_start AS blocked_duration
FROM pg_stat_activity blocked
JOIN pg_locks bl ON bl.pid = blocked.pid
JOIN pg_locks bk ON bk.locktype = bl.locktype
AND bk.database IS NOT DISTINCT FROM bl.database
AND bk.relation IS NOT DISTINCT FROM bl.relation
AND bk.page IS NOT DISTINCT FROM bl.page
AND bk.tuple IS NOT DISTINCT FROM bl.tuple
AND bk.pid != bl.pid
JOIN pg_stat_activity blocking ON bk.pid = blocking.pid
WHERE NOT bl.granted;
Replication lag (if read replicas exist):
-- On the primary
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS replay_lag
FROM pg_stat_replication;
Alerting Thresholds
| Metric | Warning | Critical |
|---|
| Cache hit ratio | <99% | <95% |
| Dead tuple ratio | >10% | >30% |
| Active connections | >70% of max | >90% of max |
| Replication lag | >1MB | >10MB |
| Long-running queries | >30s | >120s |
| Idle in transaction | >60s | >300s |
| Database size growth | >5%/day | >10%/day |
7. Connection Pool Tuning (PgBouncer)
Current Configuration
PgBouncer sits between the KaireonAI application and PostgreSQL, multiplexing connections.
Architecture:
KaireonAI API pods (N replicas)
|
v
PgBouncer (transaction pooling)
|
v
PostgreSQL (max_connections=200)
Configuration File
Located in the PgBouncer ConfigMap: kaireon-pgbouncer-config
; /etc/pgbouncer/pgbouncer.ini
[databases]
kaireon = host=kaireon-prod.xxxxxxxx.us-east-1.rds.amazonaws.com port=5432 dbname=kaireon
[pgbouncer]
; Connection pooling mode
pool_mode = transaction
; Pool sizing
default_pool_size = 25
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3
; Client limits
max_client_conn = 500
max_db_connections = 150
; Timeouts
server_idle_timeout = 600
server_connect_timeout = 15
server_login_retry = 15
client_idle_timeout = 0
client_login_timeout = 60
query_timeout = 300
query_wait_timeout = 120
; Logging
log_connections = 0
log_disconnections = 0
log_pooler_errors = 1
stats_period = 60
; Security
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
; Admin
admin_users = pgbouncer_admin
stats_users = pgbouncer_stats
listen_addr = 0.0.0.0
listen_port = 6432
default_pool_size = max_connections / (number_of_pgbouncer_instances * safety_factor)
Example:
max_connections = 200
pgbouncer_instances = 2
safety_factor = 1.5 (leaves room for direct connections, replication, monitoring)
default_pool_size = 200 / (2 * 1.5) = ~66
But we set 25 because:
- Reserve 30 connections for replication, monitoring, admin
- (200 - 30) / 2 instances = 85 per instance
- With reserve_pool_size=5, effective max = 30 per instance
- 30 * 2 = 60 server connections, leaving headroom
Monitoring PgBouncer
# Show pool statistics
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "SHOW POOLS;"
# Show active clients
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "SHOW CLIENTS;"
# Show server connections
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "SHOW SERVERS;"
# Show statistics
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "SHOW STATS;"
# Show configuration
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "SHOW CONFIG;"
Tuning for Common Scenarios
High API traffic (many short queries):
pool_mode = transaction
default_pool_size = 25
max_client_conn = 1000
server_idle_timeout = 300
Long-running analytics queries:
; Use a separate PgBouncer instance with session pooling
pool_mode = session
default_pool_size = 10
max_client_conn = 50
query_timeout = 3600
Burst traffic handling:
default_pool_size = 25
reserve_pool_size = 10
reserve_pool_timeout = 3
max_client_conn = 500
Applying Configuration Changes
# Edit the ConfigMap
kubectl edit configmap kaireon-pgbouncer-config
# Reload PgBouncer without restart (applies most settings)
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "RELOAD;"
# Verify new settings
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "SHOW CONFIG;" | grep pool_size
# If RELOAD is insufficient (e.g., listen_port changed), restart:
kubectl rollout restart deploy/kaireon-pgbouncer
Troubleshooting PgBouncer
Clients waiting for connections:
# Check waiting count
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c \
"SHOW POOLS;" | grep -v "^-"
# If cl_waiting > 0, increase default_pool_size or max_db_connections
Connection refused errors:
# Check if max_client_conn is reached
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c \
"SHOW LISTS;"
# If used_clients is near max_client_conn, increase it
Prepared statement errors in transaction mode:
# Transaction pooling does not support prepared statements natively.
# Either:
# 1. Disable prepared statements in Prisma:
# Set ?pgbouncer=true in the connection string
# 2. Or switch to session pooling (reduces multiplexing benefit)
Maintenance Schedule
| Task | Frequency | Window | Impact |
|---|
| Automated backup (RDS) | Daily | 02:00-03:00 UTC | None |
| Manual backup verification | Monthly | Business hours | None |
| VACUUM ANALYZE (large tables) | Nightly | 03:00-04:00 UTC | None |
| VACUUM FULL (if needed) | Quarterly | Maintenance window | Table locked |
| REINDEX CONCURRENTLY | Monthly | Off-peak hours | Minimal |
| Index usage review | Quarterly | Business hours | None |
| Connection pool review | Quarterly | Business hours | None |
| Disk usage review | Weekly | Business hours | None |
| Statistics update (ANALYZE) | Nightly | 03:00-04:00 UTC | None |