Production Checklist
Essential tasks before deploying to production.
Security Checklist
Secrets & Keys
-
JWT secrets generated - Use cryptographically secure random values (min 48 bytes)
openssl rand -base64 48 -
Encryption key generated - 32-byte hex key for token encryption
openssl rand -hex 32 -
Azure OpenAI key secured - Never commit to version control
-
Resend API key secured - Store in encrypted
.env.resend.production.enc -
Environment files excluded -
.envin.gitignore
Authentication
- JWT_ACCESS_EXPIRY set - Recommended: 15 minutes
- JWT_REFRESH_EXPIRY set - Recommended: 7 days
- Rate limiting configured - 10 login attempts per 15 minutes
- Password requirements enforced - Min 8 chars, mixed case, numbers, symbols
Network Security
- HTTPS enabled - TLS 1.3 for all connections
- CORS configured -
ALLOWED_ORIGINSwith specific domains - Helmet headers enabled - CSP, HSTS, X-Frame-Options
- WebSocket secured - WSS for Socket.io connections
Data Protection
- Tenant isolation enforced -
ENFORCE_TENANT_ISOLATION=true - Encryption at rest - MongoDB and Qdrant encryption
- PII masking enabled - Output scanning for sensitive data
- Audit logging enabled - Security events tracked
Infrastructure Checklist
Database
- MongoDB replica set - For high availability
- MongoDB authentication - Username/password or SCRAM
- MongoDB indexes created - Performance optimization
npm run db:indexes - Backup strategy defined - Automated daily backups
Redis
- Redis password set - AUTH required
- Redis persistence configured - RDB or AOF
- Memory limits set -
maxmemorypolicy
Qdrant
- Qdrant API key set - Authentication enabled
- Collection created - With correct dimensions
- Backup strategy defined - Snapshot schedule
Container/Server
- Non-root user - Containers run as unprivileged user
- Resource limits - CPU and memory constraints
- Health checks - Kubernetes/Docker health probes
- Log rotation - Prevent disk exhaustion
Application Checklist
Configuration
- NODE_ENV=production - Production mode enabled
- Debug logging disabled -
LOG_LEVEL=infoorwarn - Retrieval tracing disabled -
LOG_RETRIEVAL_TRACE=false - Cache enabled -
RAG_CACHE_ENABLED=true
LLM Configuration
- Azure OpenAI configured - All required variables set
- Timeouts configured - LLM and streaming timeouts
- Guardrails enabled - Hallucination blocking active
- Fallback messages - User-friendly error messages
Email Service
- Resend API key configured -
RESEND_API_KEYset in production env - Domain verified in Resend - DKIM, SPF, and DMARC DNS records configured for sender domain
- From email set -
RESEND_FROM_EMAILmatches a verified Resend domain - Email flows tested - Password reset, email verification, workspace invitation, welcome emails all sending correctly
Sync & Workers
- Worker processes running - All BullMQ workers active (assessment, questionnaire, monitoring)
- Stale job recovery -
STALE_JOB_TIMEOUT_HOURSconfigured - Monitoring worker scheduled -
MONITORING_INTERVAL_HOURS=24set; repeatable job registered at startup - Institution name configured -
INSTITUTION_NAMEset for RoI export RT.01.01 sheet
Monitoring Checklist
Logging
- Structured logging - JSON format for aggregation
- Log aggregation - Centralized logging (ELK, Datadog, etc.)
- Error tracking - Sentry or similar service
- Request tracing - Correlation IDs for debugging
Metrics
- Health endpoint -
/healthreturning status - API metrics - Response times, error rates
- LLM metrics - Token usage, latencies
- Queue metrics - Job processing rates
Alerting
- Error rate alerts - High error rate notifications
- Latency alerts - Slow response warnings
- Disk/memory alerts - Resource exhaustion warnings
Performance Checklist
Optimization
- Embedding concurrency -
EMBEDDING_MAX_CONCURRENCYtuned - Chunk filtering -
ENABLE_CHUNK_FILTER=true - Re-ranking enabled -
ENABLE_CROSS_ENCODER_RERANK=true - Context expansion -
ENABLE_CONTEXT_EXPANSION=true
Caching
- RAG cache enabled - Response caching active
- Re-rank cache enabled -
RERANK_CACHE_TTLconfigured - CDN configured - Static asset caching
Database Indexes
Ensure these indexes exist:
// User
db.users.createIndex({ email: 1 }, { unique: true });
// Conversation
db.conversations.createIndex({ workspaceId: 1, userId: 1 });
db.conversations.createIndex({ updatedAt: -1 });
// Message
db.messages.createIndex({ conversationId: 1, createdAt: 1 });
db.messages.createIndex({ createdAt: 1 }, { expireAfterSeconds: 7776000 }); // 90 days
Deployment Checklist
Pre-Deployment
- Dependencies updated -
npm auditpassed - Tests passing - All unit and integration tests
- Build successful - Production build completes
- Environment validated - All required variables set
Deployment
- Blue-green or canary - Zero-downtime deployment
- Database migrations - Schema updates applied
- Cache cleared - Stale cache invalidated
- Workers restarted - Background jobs processing
Post-Deployment
- Health check passing -
/healthreturns 200 - Smoke tests passed - Critical paths verified
- Logs reviewed - No unexpected errors
- Metrics baseline - Performance within expectations
Backup & Recovery
Backup Strategy
- MongoDB backups - Daily snapshots, 30-day retention
- Redis backups - RDB snapshots
- Qdrant backups - Collection snapshots
- Secrets backup - Encrypted, off-site storage
Recovery Plan
- RTO defined - Recovery Time Objective documented
- RPO defined - Recovery Point Objective documented
- Recovery tested - Quarterly restore tests
- Runbooks created - Step-by-step recovery procedures
Documentation
- Architecture documented - System design docs
- Runbooks created - Operational procedures
- API documented - Swagger/OpenAPI specs
- Incident response - Escalation procedures
Compliance
- GDPR compliance - Data processing documented
- Data retention - TTL policies implemented (Analytics auto-delete after 90 days)
- Audit trail - Access logging enabled
- Privacy policy - User-facing documentation
- DORA knowledge base seeded -
npm run seed:compliancerun in production environment - RoI export tested -
GET /api/v1/workspaces/roi-exportreturns valid XLSX with 4 sheets - Monitoring alerts tested - Email delivery confirmed for at least one alert type