Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kaireonai.com/llms.txt

Use this file to discover all available pages before exploring further.

Audience: SREs, DBAs, platform operators Last updated: 2026-02-23 Database: PostgreSQL 15+ via Amazon RDS (or self-managed) Connection pooler: PgBouncer

Table of Contents

  1. Backup
  2. Restore
  3. Manual VACUUM
  4. Index Maintenance
  5. Schema Migrations
  6. Monitoring Queries
  7. Connection Pool Tuning (PgBouncer)

1. Backup

Automated Backups (RDS)

RDS automated backups are configured with a 7-day retention period. Snapshots are taken daily during the maintenance window (02:00-03:00 UTC).

Manual Backup

Use the make backup target or run the equivalent commands directly. Makefile target:
# Makefile
backup:
	@echo "Starting KaireonAI database backup..."
	@TIMESTAMP=$$(date +%Y%m%d_%H%M%S) && \
	BACKUP_FILE="kaireon_backup_$${TIMESTAMP}.sql.gz" && \
	pg_dump "$(DATABASE_URL)" \
		--format=custom \
		--compress=9 \
		--verbose \
		--file="/tmp/$${BACKUP_FILE}" && \
	aws s3 cp "/tmp/$${BACKUP_FILE}" \
		"s3://kaireon-backups/daily/$${BACKUP_FILE}" \
		--storage-class STANDARD_IA && \
	rm -f "/tmp/$${BACKUP_FILE}" && \
	echo "Backup complete: s3://kaireon-backups/daily/$${BACKUP_FILE}"
Manual execution:
# Full database backup
make backup

# Or run directly:
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
pg_dump "$DATABASE_URL" \
  --format=custom \
  --compress=9 \
  --verbose \
  --file="/tmp/kaireon_backup_${TIMESTAMP}.sql.gz"

# Upload to S3
aws s3 cp "/tmp/kaireon_backup_${TIMESTAMP}.sql.gz" \
  "s3://kaireon-backups/daily/" \
  --storage-class STANDARD_IA
Schema-only backup (for migration reference):
pg_dump "$DATABASE_URL" \
  --schema-only \
  --file="/tmp/kaireon_schema_$(date +%Y%m%d).sql"
Table-specific backup:
pg_dump "$DATABASE_URL" \
  --format=custom \
  --compress=9 \
  --table=decision_logs \
  --file="/tmp/decision_logs_$(date +%Y%m%d).sql.gz"

RDS Snapshot

aws rds create-db-snapshot \
  --db-instance-identifier kaireon-prod \
  --db-snapshot-identifier "kaireon-manual-$(date +%Y%m%d-%H%M%S)"

Backup Verification

Run monthly to ensure backups are valid:
# Download the latest backup
LATEST=$(aws s3 ls s3://kaireon-backups/daily/ --recursive | sort | tail -1 | awk '{print $4}')
aws s3 cp "s3://kaireon-backups/$LATEST" /tmp/verify_backup.sql.gz

# Restore to a test database
createdb kaireon_verify
pg_restore --dbname=kaireon_verify --verbose /tmp/verify_backup.sql.gz

# Verify row counts
psql kaireon_verify -c "SELECT schemaname, relname, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC LIMIT 10;"

# Clean up
dropdb kaireon_verify
rm /tmp/verify_backup.sql.gz

2. Restore

Restore from pg_dump Backup

Use the make restore target or run the equivalent commands directly. Makefile target:
# Makefile
restore:
	@if [ -z "$(BACKUP_FILE)" ]; then \
		echo "Usage: make restore BACKUP_FILE=s3://kaireon-backups/daily/kaireon_backup_20260223.sql.gz"; \
		exit 1; \
	fi
	@echo "WARNING: This will overwrite the target database. Press Ctrl+C to abort."
	@sleep 5
	@aws s3 cp "$(BACKUP_FILE)" /tmp/restore_backup.sql.gz
	@pg_restore \
		--dbname="$(DATABASE_URL)" \
		--clean \
		--if-exists \
		--verbose \
		/tmp/restore_backup.sql.gz
	@rm -f /tmp/restore_backup.sql.gz
	@echo "Restore complete."
Manual execution:
# Download backup
aws s3 cp "s3://kaireon-backups/daily/kaireon_backup_20260223.sql.gz" /tmp/restore.sql.gz

# Option A: Restore to existing database (destructive)
pg_restore \
  --dbname="$DATABASE_URL" \
  --clean \
  --if-exists \
  --verbose \
  /tmp/restore.sql.gz

# Option B: Restore to a new database
createdb kaireon_restored
pg_restore \
  --dbname=kaireon_restored \
  --verbose \
  /tmp/restore.sql.gz

Restore from RDS Snapshot

# List available snapshots
aws rds describe-db-snapshots \
  --db-instance-identifier kaireon-prod \
  --query 'DBSnapshots[*].[DBSnapshotIdentifier,SnapshotCreateTime]' \
  --output table

# Restore to a new instance
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier kaireon-restored \
  --db-snapshot-identifier kaireon-manual-20260223-120000 \
  --db-instance-class db.r6g.xlarge \
  --vpc-security-group-ids sg-xxxxxxxx

# Wait for the instance to become available
aws rds wait db-instance-available --db-instance-identifier kaireon-restored

Point-in-Time Recovery (RDS)

aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier kaireon-prod \
  --target-db-instance-identifier kaireon-pitr \
  --restore-time "2026-02-23T10:30:00Z" \
  --db-instance-class db.r6g.xlarge

aws rds wait db-instance-available --db-instance-identifier kaireon-pitr

Post-Restore Checklist

  1. Verify row counts match expectations.
  2. Run ANALYZE on all tables to update statistics.
  3. Verify application connectivity.
  4. Run a smoke test against the decision endpoint.
  5. Update DNS or connection strings if restoring to a new instance.
-- Post-restore ANALYZE
ANALYZE VERBOSE;

-- Verify key tables
SELECT 'blueprints' AS tbl, count(*) FROM blueprints
UNION ALL
SELECT 'offers', count(*) FROM offers
UNION ALL
SELECT 'decision_logs', count(*) FROM decision_logs
UNION ALL
SELECT 'customers', count(*) FROM customers;

3. Manual VACUUM

When to VACUUM

  • Dead tuple ratio exceeds 20% on any table.
  • After large bulk deletes or updates.
  • Before VACUUM FULL if table bloat exceeds 50%.
  • Autovacuum is lagging (check pg_stat_user_tables.last_autovacuum).

Standard VACUUM

Does not lock the table. Safe to run during production hours.
-- VACUUM a specific table with verbose output
VACUUM (VERBOSE) decision_logs;
VACUUM (VERBOSE) audit_events;
VACUUM (VERBOSE) customers;

-- VACUUM and update statistics
VACUUM (VERBOSE, ANALYZE) decision_logs;

-- VACUUM all tables
VACUUM (VERBOSE, ANALYZE);

VACUUM FULL

Reclaims disk space by rewriting the table. Acquires an exclusive lock. Schedule during maintenance windows only.
-- Check if VACUUM FULL is needed (bloat estimation)
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size,
  n_dead_tup,
  n_live_tup,
  ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_pct,
  last_autovacuum,
  last_vacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;

-- Run VACUUM FULL (locks table, schedule in maintenance window)
VACUUM FULL VERBOSE decision_logs;

Autovacuum Tuning

-- Check autovacuum settings per table
SELECT relname, reloptions
FROM pg_class
WHERE relname IN ('decision_logs', 'audit_events', 'customers');

-- Set aggressive autovacuum for high-churn tables
ALTER TABLE decision_logs SET (
  autovacuum_vacuum_threshold = 1000,
  autovacuum_vacuum_scale_factor = 0.01,
  autovacuum_analyze_threshold = 500,
  autovacuum_analyze_scale_factor = 0.005
);

ALTER TABLE audit_events SET (
  autovacuum_vacuum_threshold = 1000,
  autovacuum_vacuum_scale_factor = 0.01
);

Monitor Autovacuum Progress

-- Check running autovacuum processes
SELECT pid, datname, relid::regclass, phase,
  heap_blks_total, heap_blks_scanned, heap_blks_vacuumed,
  index_vacuum_count, max_dead_tuples, num_dead_tuples
FROM pg_stat_progress_vacuum;

-- Check autovacuum worker count
SELECT count(*) FROM pg_stat_activity WHERE backend_type = 'autovacuum worker';

4. Index Maintenance

Check Index Health

-- Index usage statistics (find unused indexes)
SELECT
  schemaname,
  tablename,
  indexrelname,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  idx_scan,
  idx_tup_read,
  idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC, pg_relation_size(indexrelid) DESC;

-- Find duplicate indexes
SELECT
  indrelid::regclass AS table_name,
  array_agg(indexrelid::regclass) AS duplicate_indexes,
  count(*)
FROM pg_index
GROUP BY indrelid, indkey
HAVING count(*) > 1;

-- Index bloat estimation
SELECT
  schemaname,
  tablename,
  indexrelname,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  idx_scan
FROM pg_stat_user_indexes
WHERE pg_relation_size(indexrelid) > 10485760  -- > 10MB
ORDER BY pg_relation_size(indexrelid) DESC;

Rebuild Indexes

Use REINDEX CONCURRENTLY to avoid locking the table.
-- Rebuild a specific index (non-blocking)
REINDEX INDEX CONCURRENTLY idx_decision_logs_customer_id;

-- Rebuild all indexes on a table (non-blocking)
REINDEX TABLE CONCURRENTLY decision_logs;

-- Rebuild all indexes in the database (maintenance window recommended)
REINDEX DATABASE CONCURRENTLY kaireon;

Create Missing Indexes

Common queries that benefit from indexes:
-- Decision log lookups by customer
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_decision_logs_customer_created
ON decision_logs (customer_id, created_at DESC);

-- Audit event lookups by entity
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_audit_events_entity
ON audit_events (entity_type, entity_id, created_at DESC);

-- Offer lookups by status
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_offers_status
ON offers (status) WHERE status = 'ACTIVE';

Drop Unused Indexes

-- Confirm index is truly unused (check over 30+ days)
-- Then drop it:
DROP INDEX CONCURRENTLY IF EXISTS idx_old_unused_index;

5. Schema Migrations

Using Prisma db push

KaireonAI uses Prisma 7 with db push for schema synchronization. Standard migration workflow:
cd platform

# 1. Edit schema.prisma with the desired changes

# 2. Generate Prisma client (validates schema)
npx prisma generate

# 3. Preview changes (dry run)
npx prisma db push --accept-data-loss --force-reset  # ONLY IN DEV

# 4. Apply to production
npx prisma db push
Pre-migration checklist:
  1. Back up the database (see Backup).
  2. Review the schema diff carefully.
  3. Check for data loss warnings.
  4. Run in staging first.
  5. Coordinate with the team (announce in #kaireon-deployments).
Post-migration checklist:
  1. Run ANALYZE on affected tables.
  2. Verify application starts without errors.
  3. Check that all API endpoints return valid responses.
  4. Monitor error rates for 15 minutes.

Manual SQL Migrations

For changes that Prisma cannot handle (data migrations, custom constraints):
# Create a migration file
mkdir -p migrations
cat > migrations/$(date +%Y%m%d%H%M%S)_description.sql << 'EOF'
-- Migration: description
-- Author: <name>
-- Date: <date>

BEGIN;

-- Your SQL here
ALTER TABLE decision_logs ADD COLUMN response_time_ms INTEGER;

-- Backfill if needed
UPDATE decision_logs SET response_time_ms = 0 WHERE response_time_ms IS NULL;

-- Add constraint after backfill
ALTER TABLE decision_logs ALTER COLUMN response_time_ms SET NOT NULL;

COMMIT;
EOF

# Apply the migration
psql "$DATABASE_URL" -f migrations/<filename>.sql

Rollback Procedure

# 1. Restore schema from backup
pg_restore \
  --dbname="$DATABASE_URL" \
  --schema-only \
  --clean \
  --if-exists \
  /tmp/pre_migration_schema.sql

# 2. Or write a manual rollback script
psql "$DATABASE_URL" << 'EOF'
BEGIN;
ALTER TABLE decision_logs DROP COLUMN IF EXISTS response_time_ms;
COMMIT;
EOF

# 3. Regenerate Prisma client from rolled-back schema
cd platform && npx prisma generate

6. Monitoring Queries

Dashboard Queries

Run these queries periodically or integrate them into Grafana via the PostgreSQL data source. Database size and growth:
SELECT
  pg_size_pretty(pg_database_size('kaireon')) AS total_size;

SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size,
  pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size,
  pg_size_pretty(pg_indexes_size(schemaname || '.' || tablename::regclass)) AS index_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC
LIMIT 20;
Active connections:
SELECT
  count(*) AS total,
  count(*) FILTER (WHERE state = 'active') AS active,
  count(*) FILTER (WHERE state = 'idle') AS idle,
  count(*) FILTER (WHERE state = 'idle in transaction') AS idle_in_txn,
  count(*) FILTER (WHERE wait_event_type IS NOT NULL) AS waiting
FROM pg_stat_activity
WHERE datname = 'kaireon';
Slow queries (current):
SELECT
  pid,
  now() - query_start AS duration,
  state,
  LEFT(query, 200) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
  AND datname = 'kaireon'
  AND now() - query_start > interval '5 seconds'
ORDER BY duration DESC;
Table statistics:
SELECT
  relname,
  seq_scan,
  seq_tup_read,
  idx_scan,
  idx_tup_fetch,
  n_tup_ins,
  n_tup_upd,
  n_tup_del,
  n_live_tup,
  n_dead_tup,
  last_vacuum,
  last_autovacuum,
  last_analyze,
  last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
Cache hit ratio (should be >99%):
SELECT
  'index' AS type,
  sum(idx_blks_hit) AS hits,
  sum(idx_blks_read) AS reads,
  ROUND(100.0 * sum(idx_blks_hit) / NULLIF(sum(idx_blks_hit) + sum(idx_blks_read), 0), 2) AS hit_ratio
FROM pg_statio_user_indexes
UNION ALL
SELECT
  'table',
  sum(heap_blks_hit),
  sum(heap_blks_read),
  ROUND(100.0 * sum(heap_blks_hit) / NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0), 2)
FROM pg_statio_user_tables;
Lock contention:
SELECT
  blocked.pid AS blocked_pid,
  blocked.query AS blocked_query,
  blocking.pid AS blocking_pid,
  blocking.query AS blocking_query,
  now() - blocked.query_start AS blocked_duration
FROM pg_stat_activity blocked
JOIN pg_locks bl ON bl.pid = blocked.pid
JOIN pg_locks bk ON bk.locktype = bl.locktype
  AND bk.database IS NOT DISTINCT FROM bl.database
  AND bk.relation IS NOT DISTINCT FROM bl.relation
  AND bk.page IS NOT DISTINCT FROM bl.page
  AND bk.tuple IS NOT DISTINCT FROM bl.tuple
  AND bk.pid != bl.pid
JOIN pg_stat_activity blocking ON bk.pid = blocking.pid
WHERE NOT bl.granted;
Replication lag (if read replicas exist):
-- On the primary
SELECT
  client_addr,
  state,
  sent_lsn,
  write_lsn,
  flush_lsn,
  replay_lsn,
  pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS replay_lag
FROM pg_stat_replication;

Alerting Thresholds

MetricWarningCritical
Cache hit ratio<99%<95%
Dead tuple ratio>10%>30%
Active connections>70% of max>90% of max
Replication lag>1MB>10MB
Long-running queries>30s>120s
Idle in transaction>60s>300s
Database size growth>5%/day>10%/day

7. Connection Pool Tuning (PgBouncer)

Current Configuration

PgBouncer sits between the KaireonAI application and PostgreSQL, multiplexing connections. Architecture:
KaireonAI API pods (N replicas)
    |
    v
PgBouncer (transaction pooling)
    |
    v
PostgreSQL (max_connections=200)

Configuration File

Located in the PgBouncer ConfigMap: kaireon-pgbouncer-config
; /etc/pgbouncer/pgbouncer.ini

[databases]
kaireon = host=kaireon-prod.xxxxxxxx.us-east-1.rds.amazonaws.com port=5432 dbname=kaireon

[pgbouncer]
; Connection pooling mode
pool_mode = transaction

; Pool sizing
default_pool_size = 25
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3

; Client limits
max_client_conn = 500
max_db_connections = 150

; Timeouts
server_idle_timeout = 600
server_connect_timeout = 15
server_login_retry = 15
client_idle_timeout = 0
client_login_timeout = 60
query_timeout = 300
query_wait_timeout = 120

; Logging
log_connections = 0
log_disconnections = 0
log_pooler_errors = 1
stats_period = 60

; Security
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

; Admin
admin_users = pgbouncer_admin
stats_users = pgbouncer_stats

listen_addr = 0.0.0.0
listen_port = 6432

Pool Sizing Formula

default_pool_size = max_connections / (number_of_pgbouncer_instances * safety_factor)

Example:
  max_connections = 200
  pgbouncer_instances = 2
  safety_factor = 1.5 (leaves room for direct connections, replication, monitoring)

  default_pool_size = 200 / (2 * 1.5) = ~66

  But we set 25 because:
  - Reserve 30 connections for replication, monitoring, admin
  - (200 - 30) / 2 instances = 85 per instance
  - With reserve_pool_size=5, effective max = 30 per instance
  - 30 * 2 = 60 server connections, leaving headroom

Monitoring PgBouncer

# Show pool statistics
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "SHOW POOLS;"

# Show active clients
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "SHOW CLIENTS;"

# Show server connections
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "SHOW SERVERS;"

# Show statistics
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "SHOW STATS;"

# Show configuration
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "SHOW CONFIG;"

Tuning for Common Scenarios

High API traffic (many short queries):
pool_mode = transaction
default_pool_size = 25
max_client_conn = 1000
server_idle_timeout = 300
Long-running analytics queries:
; Use a separate PgBouncer instance with session pooling
pool_mode = session
default_pool_size = 10
max_client_conn = 50
query_timeout = 3600
Burst traffic handling:
default_pool_size = 25
reserve_pool_size = 10
reserve_pool_timeout = 3
max_client_conn = 500

Applying Configuration Changes

# Edit the ConfigMap
kubectl edit configmap kaireon-pgbouncer-config

# Reload PgBouncer without restart (applies most settings)
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "RELOAD;"

# Verify new settings
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c "SHOW CONFIG;" | grep pool_size

# If RELOAD is insufficient (e.g., listen_port changed), restart:
kubectl rollout restart deploy/kaireon-pgbouncer

Troubleshooting PgBouncer

Clients waiting for connections:
# Check waiting count
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c \
  "SHOW POOLS;" | grep -v "^-"
# If cl_waiting > 0, increase default_pool_size or max_db_connections
Connection refused errors:
# Check if max_client_conn is reached
kubectl exec -it deploy/kaireon-pgbouncer -- psql -p 6432 pgbouncer -c \
  "SHOW LISTS;"
# If used_clients is near max_client_conn, increase it
Prepared statement errors in transaction mode:
# Transaction pooling does not support prepared statements natively.
# Either:
# 1. Disable prepared statements in Prisma:
#    Set ?pgbouncer=true in the connection string
# 2. Or switch to session pooling (reduces multiplexing benefit)

Maintenance Schedule

TaskFrequencyWindowImpact
Automated backup (RDS)Daily02:00-03:00 UTCNone
Manual backup verificationMonthlyBusiness hoursNone
VACUUM ANALYZE (large tables)Nightly03:00-04:00 UTCNone
VACUUM FULL (if needed)QuarterlyMaintenance windowTable locked
REINDEX CONCURRENTLYMonthlyOff-peak hoursMinimal
Index usage reviewQuarterlyBusiness hoursNone
Connection pool reviewQuarterlyBusiness hoursNone
Disk usage reviewWeeklyBusiness hoursNone
Statistics update (ANALYZE)Nightly03:00-04:00 UTCNone