diff --git a/docs/plans/2026-02-14-monitoring-dashboard-plan.md b/docs/plans/2026-02-14-monitoring-dashboard-plan.md new file mode 100644 index 0000000..7d8ecdd --- /dev/null +++ b/docs/plans/2026-02-14-monitoring-dashboard-plan.md @@ -0,0 +1,1236 @@ +# Monitoring & Alerting Dashboard - Implementation Plan + +> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. + +**Goal:** Build a real-time Monitoring & Alerting Dashboard in the Payload Admin Panel with system health checks, service monitoring, performance tracking, configurable alerts, and structured log viewing. + +**Architecture:** Event-driven with SSE real-time stream + REST endpoints. A SnapshotCollector runs in the queue-worker PM2 process every 60s, collecting metrics from OS, PostgreSQL, PgBouncer, Redis, SMTP, OAuth, and BullMQ. An AlertEvaluator checks metrics against configurable rules stored in a Payload Collection. The dashboard is a 5-tab Custom Admin View using the same patterns as YouTubeAnalyticsDashboard. + +**Tech Stack:** Payload CMS 3.76.1, Next.js 16 (App Router), React 19, TypeScript, Node.js `os` module, SSE via ReadableStream, BullMQ, PostgreSQL, Redis. + +**Design Doc:** `docs/plans/2026-02-14-monitoring-dashboard-design.md` + +--- + +## Phase 1: Types & Collections (Foundation) + +### Task 1: Shared Types + +**Files:** +- Create: `src/lib/monitoring/types.ts` +- Test: `tests/unit/monitoring/types.test.ts` + +**Step 1: Write types test** + +```typescript +// tests/unit/monitoring/types.test.ts +import { describe, it, expect } from 'vitest' +import type { + SystemHealth, + ServiceStatus, + OAuthStatus, + CronStatus, + QueueStatus, + PerformanceMetrics, + SystemMetrics, + MonitoringEvent, +} from '@/lib/monitoring/types' + +describe('Monitoring Types', () => { + it('SystemMetrics has all required sections', () => { + const metrics: SystemMetrics = { + timestamp: new Date().toISOString(), + system: { + cpuUsagePercent: 23, + memoryUsedMB: 4200, + memoryTotalMB: 8192, + memoryUsagePercent: 51.3, + diskUsedGB: 30, + diskTotalGB: 50, + diskUsagePercent: 60, + loadAvg1: 0.5, + loadAvg5: 0.8, + uptime: 1209600, + }, + services: { + payload: { status: 'online', pid: 1234, memoryMB: 512, uptimeSeconds: 86400, restarts: 0 }, + queueWorker: { status: 'online', pid: 5678, memoryMB: 256, uptimeSeconds: 86400, restarts: 0 }, + postgresql: { status: 'online', connections: 12, maxConnections: 50, latencyMs: 2 }, + pgbouncer: { status: 'online', activeConnections: 8, waitingClients: 0, poolSize: 20 }, + redis: { status: 'online', memoryUsedMB: 48, connectedClients: 5, opsPerSec: 120 }, + }, + external: { + smtp: { status: 'online', lastCheck: new Date().toISOString(), responseTimeMs: 180 }, + metaOAuth: { status: 'ok', tokensTotal: 2, tokensExpiringSoon: 1, tokensExpired: 0 }, + youtubeOAuth: { status: 'ok', tokensTotal: 3, tokensExpiringSoon: 0, tokensExpired: 0 }, + cronJobs: { + communitySync: { lastRun: new Date().toISOString(), status: 'ok' }, + tokenRefresh: { lastRun: new Date().toISOString(), status: 'ok' }, + youtubeSync: { lastRun: new Date().toISOString(), status: 'ok' }, + }, + }, + performance: { avgResponseTimeMs: 120, p95ResponseTimeMs: 350, p99ResponseTimeMs: 800, errorRate: 0.02, requestsPerMinute: 45 }, + } + expect(metrics.system.cpuUsagePercent).toBe(23) + expect(metrics.services.payload.status).toBe('online') + expect(metrics.external.smtp.status).toBe('online') + expect(metrics.performance.avgResponseTimeMs).toBe(120) + }) + + it('MonitoringEvent types are exhaustive', () => { + const events: MonitoringEvent['type'][] = ['health', 'service', 'alert', 'log', 'performance'] + expect(events).toHaveLength(5) + }) +}) +``` + +**Step 2: Run test — expect FAIL** (types don't exist yet) + +```bash +pnpm test tests/unit/monitoring/types.test.ts +``` + +**Step 3: Implement types** + +Create `src/lib/monitoring/types.ts` with all interfaces: +- `SystemHealth` (CPU, RAM, Disk, Load, Uptime) +- `ProcessStatus` (status, pid, memoryMB, uptimeSeconds, restarts) +- `PostgresqlStatus`, `PgBouncerStatus`, `RedisStatus` +- `SmtpStatus`, `OAuthTokenStatus`, `CronJobStatus` +- `ServiceStatuses` (all services combined) +- `ExternalStatuses` (SMTP, OAuth, Cron) +- `PerformanceMetrics` (avg, p95, p99, errorRate, rpm) +- `SystemMetrics` (the full snapshot object) +- `MonitoringEvent` (discriminated union for SSE events: health | service | alert | log | performance) +- `AlertCondition` = 'gt' | 'lt' | 'eq' | 'gte' | 'lte' +- `AlertSeverity` = 'warning' | 'error' | 'critical' +- `LogLevel` = 'debug' | 'info' | 'warn' | 'error' | 'fatal' +- `LogSource` = 'payload' | 'queue-worker' | 'cron' | 'email' | 'oauth' | 'sync' + +**Step 4: Run test — expect PASS** + +```bash +pnpm test tests/unit/monitoring/types.test.ts +``` + +**Step 5: Commit** + +```bash +git add src/lib/monitoring/types.ts tests/unit/monitoring/types.test.ts +git commit -m "feat(monitoring): add shared types for monitoring system" +``` + +--- + +### Task 2: MonitoringSnapshots Collection + +**Files:** +- Create: `src/collections/MonitoringSnapshots.ts` +- Modify: `src/payload.config.ts` (add to collections array) +- Modify: `src/lib/access/index.ts` (add monitoring access) + +**Step 1: Add monitoring access control** + +In `src/lib/access/index.ts`, add: +```typescript +export const monitoringAccess = { + read: superAdminOnly, + create: superAdminOnly, // Only system can create + update: denyAll, // Immutable snapshots + delete: superAdminOnly, // Retention cleanup only +} +``` + +**Step 2: Create MonitoringSnapshots collection** + +Pattern: Follow `AuditLogs.ts` structure. Use `admin.group: 'Monitoring'`. Fields use `type: 'group'` for nested objects and `type: 'json'` for service/external status objects. + +Key fields: +- `timestamp` (date, required, indexed) +- `system` group: cpuUsagePercent, memoryUsedMB, memoryTotalMB, memoryUsagePercent, diskUsedGB, diskTotalGB, diskUsagePercent, loadAvg1, loadAvg5, uptime (all `type: 'number'`) +- `services` group: payload, queueWorker, postgresql, pgbouncer, redis (all `type: 'json'`) +- `external` group: smtp, metaOAuth, youtubeOAuth, cronJobs (all `type: 'json'`) +- `performance` group: avgResponseTimeMs, p95ResponseTimeMs, p99ResponseTimeMs, errorRate, requestsPerMinute (all `type: 'number'`) + +**Step 3: Register in payload.config.ts** + +Add `MonitoringSnapshots` to the `collections` array (import + add). + +**Step 4: Commit** + +```bash +git add src/collections/MonitoringSnapshots.ts src/payload.config.ts src/lib/access/index.ts +git commit -m "feat(monitoring): add MonitoringSnapshots collection" +``` + +--- + +### Task 3: MonitoringLogs Collection + +**Files:** +- Create: `src/collections/MonitoringLogs.ts` +- Modify: `src/payload.config.ts` + +**Step 1: Create collection** + +Pattern: Like `AuditLogs.ts` — WORM (read + create only, no update/delete via UI). + +Key fields: +- `level` (select: debug, info, warn, error, fatal — required) +- `source` (select: payload, queue-worker, cron, email, oauth, sync — required) +- `message` (text, required) +- `context` (json) +- `requestId` (text) +- `userId` (relationship → users) +- `tenant` (relationship → tenants) +- `duration` (number, min: 0) + +Admin config: `group: 'Monitoring'`, `defaultColumns: ['level', 'source', 'message', 'createdAt']`, `useAsTitle: 'message'`. + +**Step 2: Register in payload.config.ts** + +**Step 3: Commit** + +```bash +git add src/collections/MonitoringLogs.ts src/payload.config.ts +git commit -m "feat(monitoring): add MonitoringLogs collection" +``` + +--- + +### Task 4: MonitoringAlertRules Collection + +**Files:** +- Create: `src/collections/MonitoringAlertRules.ts` +- Modify: `src/payload.config.ts` + +**Step 1: Create collection** + +Access: Super-admin full CRUD. + +Key fields: +- `name` (text, required) +- `metric` (text, required — e.g. `system.cpuUsagePercent`) +- `condition` (select: gt, lt, eq, gte, lte — required) +- `threshold` (number, required) +- `severity` (select: warning, error, critical — required) +- `channels` (select, hasMany: true — email, slack, discord — required) +- `recipients` group: + - `emails` (array of text fields) + - `slackWebhook` (text) + - `discordWebhook` (text) +- `cooldownMinutes` (number, defaultValue: 15, min: 1) +- `enabled` (checkbox, defaultValue: true) +- `tenant` (relationship → tenants, optional) + +Admin: `group: 'Monitoring'`, `useAsTitle: 'name'`. + +**Step 2: Register in payload.config.ts** + +**Step 3: Commit** + +```bash +git add src/collections/MonitoringAlertRules.ts src/payload.config.ts +git commit -m "feat(monitoring): add MonitoringAlertRules collection" +``` + +--- + +### Task 5: MonitoringAlertHistory Collection + +**Files:** +- Create: `src/collections/MonitoringAlertHistory.ts` +- Modify: `src/payload.config.ts` + +**Step 1: Create collection** + +Access: Read for super-admin, create for system, update only `resolvedAt` and `acknowledgedBy`. + +Key fields: +- `rule` (relationship → monitoring-alert-rules) +- `metric` (text, required) +- `value` (number, required) +- `threshold` (number, required) +- `severity` (select: warning, error, critical — required) +- `message` (text, required) +- `channelsSent` (select, hasMany: email, slack, discord) +- `resolvedAt` (date, optional) +- `acknowledgedBy` (relationship → users, optional) + +Admin: `group: 'Monitoring'`, `useAsTitle: 'message'`, `defaultColumns: ['severity', 'metric', 'message', 'createdAt', 'acknowledgedBy']`. + +**Step 2: Register in payload.config.ts** + +**Step 3: Commit** + +```bash +git add src/collections/MonitoringAlertHistory.ts src/payload.config.ts +git commit -m "feat(monitoring): add MonitoringAlertHistory collection" +``` + +--- + +### Task 6: Database Migration + +**Step 1: Create migration** + +```bash +pnpm payload migrate:create +``` + +**CRITICAL:** The migration MUST include `payload_locked_documents_rels` columns for ALL 4 new collections: + +```sql +ALTER TABLE "payload_locked_documents_rels" + ADD COLUMN IF NOT EXISTS "monitoring_snapshots_id" integer REFERENCES monitoring_snapshots(id) ON DELETE CASCADE; +ALTER TABLE "payload_locked_documents_rels" + ADD COLUMN IF NOT EXISTS "monitoring_logs_id" integer REFERENCES monitoring_logs(id) ON DELETE CASCADE; +ALTER TABLE "payload_locked_documents_rels" + ADD COLUMN IF NOT EXISTS "monitoring_alert_rules_id" integer REFERENCES monitoring_alert_rules(id) ON DELETE CASCADE; +ALTER TABLE "payload_locked_documents_rels" + ADD COLUMN IF NOT EXISTS "monitoring_alert_history_id" integer REFERENCES monitoring_alert_history(id) ON DELETE CASCADE; + +CREATE INDEX IF NOT EXISTS "payload_locked_documents_rels_monitoring_snapshots_idx" ON "payload_locked_documents_rels" ("monitoring_snapshots_id"); +CREATE INDEX IF NOT EXISTS "payload_locked_documents_rels_monitoring_logs_idx" ON "payload_locked_documents_rels" ("monitoring_logs_id"); +CREATE INDEX IF NOT EXISTS "payload_locked_documents_rels_monitoring_alert_rules_idx" ON "payload_locked_documents_rels" ("monitoring_alert_rules_id"); +CREATE INDEX IF NOT EXISTS "payload_locked_documents_rels_monitoring_alert_history_idx" ON "payload_locked_documents_rels" ("monitoring_alert_history_id"); +``` + +**Step 2: Review generated migration, add locked_documents_rels columns if missing** + +**Step 3: Run migration via direct DB connection** + +```bash +./scripts/db-direct.sh migrate +``` + +**Step 4: Generate import map** + +```bash +pnpm payload generate:importmap +``` + +**Step 5: Commit** + +```bash +git add src/migrations/ src/app/\(payload\)/importMap.js +git commit -m "feat(monitoring): add database migration for 4 monitoring collections" +``` + +--- + +## Phase 2: Backend Services + +### Task 7: MonitoringService — System Health + +**Files:** +- Create: `src/lib/monitoring/monitoring-service.ts` +- Test: `tests/unit/monitoring/monitoring-service.test.ts` + +**Step 1: Write test for checkSystemHealth()** + +```typescript +import { describe, it, expect } from 'vitest' +import { checkSystemHealth } from '@/lib/monitoring/monitoring-service' + +describe('MonitoringService', () => { + describe('checkSystemHealth', () => { + it('returns CPU, memory, disk, load, and uptime', async () => { + const health = await checkSystemHealth() + expect(health.cpuUsagePercent).toBeGreaterThanOrEqual(0) + expect(health.cpuUsagePercent).toBeLessThanOrEqual(100) + expect(health.memoryTotalMB).toBeGreaterThan(0) + expect(health.memoryUsedMB).toBeGreaterThan(0) + expect(health.memoryUsagePercent).toBeGreaterThanOrEqual(0) + expect(health.diskTotalGB).toBeGreaterThan(0) + expect(health.uptime).toBeGreaterThan(0) + expect(health.loadAvg1).toBeGreaterThanOrEqual(0) + }) + }) +}) +``` + +**Step 2: Run test — expect FAIL** + +**Step 3: Implement checkSystemHealth()** + +Use Node.js `os` module: `os.cpus()`, `os.totalmem()`, `os.freemem()`, `os.loadavg()`, `os.uptime()`. +For disk: use `child_process.execSync('df -B1 / | tail -1')` to get disk usage (Linux-only, which is fine — production is Linux). +For CPU: sample `/proc/stat` twice with 100ms delay to calculate usage percentage. + +**Step 4: Run test — expect PASS** + +**Step 5: Commit** + +```bash +git add src/lib/monitoring/monitoring-service.ts tests/unit/monitoring/monitoring-service.test.ts +git commit -m "feat(monitoring): add system health check (CPU, RAM, disk, load)" +``` + +--- + +### Task 8: MonitoringService — Service Checks + +**Files:** +- Modify: `src/lib/monitoring/monitoring-service.ts` +- Test: `tests/unit/monitoring/monitoring-service.test.ts` + +**Step 1: Write tests for service checks** + +Test `checkRedis()`, `checkPostgresql()`, `checkPgBouncer()`, `checkQueues()`. +These need mocking since they connect to external services: + +```typescript +import { vi } from 'vitest' + +describe('checkRedis', () => { + it('returns redis status with memory and client info', async () => { + // Mock redis.info() response + const result = await checkRedis() + expect(result).toHaveProperty('status') + expect(result).toHaveProperty('memoryUsedMB') + expect(result).toHaveProperty('connectedClients') + expect(result).toHaveProperty('opsPerSec') + }) +}) +``` + +**Step 2: Implement service checks** + +- `checkRedis()`: Use `redis.info()` from `src/lib/redis.ts`, parse `used_memory`, `connected_clients`, `instantaneous_ops_per_sec` +- `checkPostgresql()`: Direct query `SELECT count(*) FROM pg_stat_activity` + `SELECT 1` latency test via `./scripts/db-direct.sh` or Payload's DB adapter +- `checkPgBouncer()`: Query `SHOW POOLS` via PgBouncer admin connection (127.0.0.1:6432) +- `checkQueues()`: Use BullMQ `Queue.getJobCounts()` for email, pdf, retention queues +- `checkSmtp()`: Create SMTP transporter and call `verify()` with timeout +- `checkOAuthTokens()`: Query `social-accounts` collection for expiring tokens (< 7 days) +- `checkCronJobs()`: Check audit-logs/monitoring-logs for recent cron executions + +**Step 3: Add `collectMetrics()` that calls all checks with `Promise.allSettled()`** + +**Step 4: Run tests — expect PASS** + +**Step 5: Commit** + +```bash +git add src/lib/monitoring/monitoring-service.ts tests/unit/monitoring/monitoring-service.test.ts +git commit -m "feat(monitoring): add service checks (Redis, PostgreSQL, PgBouncer, queues, SMTP, OAuth)" +``` + +--- + +### Task 9: PerformanceTracker + +**Files:** +- Create: `src/lib/monitoring/performance-tracker.ts` +- Test: `tests/unit/monitoring/performance-tracker.test.ts` + +**Step 1: Write test** + +```typescript +describe('PerformanceTracker', () => { + it('tracks requests and computes metrics', () => { + const tracker = new PerformanceTracker(1000) // 1000-entry ring buffer + tracker.track('GET', '/api/posts', 200, 120) + tracker.track('GET', '/api/posts', 200, 250) + tracker.track('GET', '/api/posts', 500, 800) + + const metrics = tracker.getMetrics('1h') + expect(metrics.avgResponseTimeMs).toBeCloseTo(390, 0) + expect(metrics.errorRate).toBeCloseTo(0.333, 2) + expect(metrics.requestsPerMinute).toBeGreaterThan(0) + expect(metrics.p95ResponseTimeMs).toBeGreaterThanOrEqual(metrics.avgResponseTimeMs) + }) + + it('ring buffer evicts old entries', () => { + const tracker = new PerformanceTracker(2) // tiny buffer + tracker.track('GET', '/a', 200, 100) + tracker.track('GET', '/b', 200, 200) + tracker.track('GET', '/c', 200, 300) + + const metrics = tracker.getMetrics('1h') + // Only last 2 entries should remain + expect(metrics.avgResponseTimeMs).toBeCloseTo(250, 0) + }) +}) +``` + +**Step 2: Run test — expect FAIL** + +**Step 3: Implement PerformanceTracker** + +- Class with ring buffer (fixed-size array + pointer) +- Each entry: `{ timestamp, method, path, statusCode, durationMs }` +- `track()`: Add to ring buffer +- `getMetrics(period)`: Filter by time window, compute avg/p95/p99/errorRate/rpm +- Export singleton instance: `export const performanceTracker = new PerformanceTracker(10_000)` + +**Step 4: Run test — expect PASS** + +**Step 5: Commit** + +```bash +git add src/lib/monitoring/performance-tracker.ts tests/unit/monitoring/performance-tracker.test.ts +git commit -m "feat(monitoring): add performance tracker with ring buffer" +``` + +--- + +### Task 10: MonitoringLogger + +**Files:** +- Create: `src/lib/monitoring/monitoring-logger.ts` +- Test: `tests/unit/monitoring/monitoring-logger.test.ts` + +**Step 1: Write test** + +```typescript +describe('MonitoringLogger', () => { + it('creates logger with source and logs to collection', async () => { + const logger = createMonitoringLogger('cron') + // Mock payload.create + await logger.info('Cron job completed', { jobName: 'community-sync', duration: 3500 }) + // Verify payload.create was called with correct args + }) + + it('respects minimum log level from env', async () => { + // MONITORING_LOG_LEVEL=warn → info/debug should not write to DB + }) +}) +``` + +**Step 2: Implement MonitoringLogger** + +- `createMonitoringLogger(source: LogSource)` factory function +- Returns object with `debug()`, `info()`, `warn()`, `error()`, `fatal()` methods +- Each method calls `payload.create({ collection: 'monitoring-logs', data: { level, source, message, context, ... } })` +- Respects `MONITORING_LOG_LEVEL` env var (default: 'info') +- Falls back to `console.log` if Payload is not initialized (startup phase) +- Non-blocking: fire-and-forget with `.catch(console.error)` + +**Step 3: Run test — expect PASS** + +**Step 4: Commit** + +```bash +git add src/lib/monitoring/monitoring-logger.ts tests/unit/monitoring/monitoring-logger.test.ts +git commit -m "feat(monitoring): add structured monitoring logger" +``` + +--- + +### Task 11: AlertEvaluator + +**Files:** +- Create: `src/lib/monitoring/alert-evaluator.ts` +- Test: `tests/unit/monitoring/alert-evaluator.test.ts` + +**Step 1: Write test** + +```typescript +describe('AlertEvaluator', () => { + it('fires alert when metric exceeds threshold (gt)', () => { + const rule = { metric: 'system.cpuUsagePercent', condition: 'gt', threshold: 80, severity: 'warning' } + const metrics = { system: { cpuUsagePercent: 92 } } + expect(evaluateCondition(rule, getMetricValue(metrics, rule.metric))).toBe(true) + }) + + it('does not fire when metric is below threshold', () => { + const rule = { metric: 'system.cpuUsagePercent', condition: 'gt', threshold: 80 } + const metrics = { system: { cpuUsagePercent: 45 } } + expect(evaluateCondition(rule, getMetricValue(metrics, rule.metric))).toBe(false) + }) + + it('resolves nested metric paths', () => { + const metrics = { services: { redis: { memoryUsedMB: 512 } } } + expect(getMetricValue(metrics, 'services.redis.memoryUsedMB')).toBe(512) + }) + + it('respects cooldown period', () => { + const evaluator = new AlertEvaluator() + // First fire should pass + expect(evaluator.shouldFire('rule-1', 15)).toBe(true) + // Immediate second fire should be blocked (cooldown) + expect(evaluator.shouldFire('rule-1', 15)).toBe(false) + }) +}) +``` + +**Step 2: Implement AlertEvaluator** + +- `getMetricValue(metrics, path)`: Resolve dot-notation path like `system.cpuUsagePercent` +- `evaluateCondition(rule, value)`: Compare value against threshold using condition operator +- `AlertEvaluator` class with in-memory cooldown map (ruleId → lastFiredAt) +- `evaluateRules(payload, metrics)`: Load enabled rules from `monitoring-alert-rules`, evaluate each, fire alerts +- `dispatchAlert(payload, rule, metrics, value)`: Create `monitoring-alert-history` record + call existing `sendAlert()` from `src/lib/alerting/alert-service.ts` + +**Step 3: Run test — expect PASS** + +**Step 4: Commit** + +```bash +git add src/lib/monitoring/alert-evaluator.ts tests/unit/monitoring/alert-evaluator.test.ts +git commit -m "feat(monitoring): add alert evaluator with cooldown and multi-channel dispatch" +``` + +--- + +### Task 12: SnapshotCollector + +**Files:** +- Create: `src/lib/monitoring/snapshot-collector.ts` +- Modify: `scripts/run-queue-worker.ts` (add monitoring worker) +- Modify: `ecosystem.config.cjs` (add env var) + +**Step 1: Implement SnapshotCollector** + +```typescript +import { collectMetrics } from './monitoring-service' +import { AlertEvaluator } from './alert-evaluator' +import { getPayload } from 'payload' +import config from '@payload-config' + +let interval: NodeJS.Timeout | null = null +const alertEvaluator = new AlertEvaluator() + +export async function startSnapshotCollector(): Promise { + const INTERVAL = parseInt(process.env.MONITORING_SNAPSHOT_INTERVAL || '60000', 10) + console.log(`[SnapshotCollector] Starting (interval: ${INTERVAL}ms)`) + + interval = setInterval(async () => { + try { + const payload = await getPayload({ config }) + const metrics = await collectMetrics() + await payload.create({ collection: 'monitoring-snapshots', data: { ...metrics, timestamp: new Date().toISOString() } }) + await alertEvaluator.evaluateRules(payload, metrics) + } catch (error) { + console.error('[SnapshotCollector] Error:', error) + } + }, INTERVAL) +} + +export async function stopSnapshotCollector(): Promise { + if (interval) { clearInterval(interval); interval = null } + console.log('[SnapshotCollector] Stopped') +} +``` + +**Step 2: Add to queue worker** + +In `scripts/run-queue-worker.ts`, add: +```typescript +const ENABLE_MONITORING = process.env.QUEUE_ENABLE_MONITORING !== 'false' +// ... dynamic import +const { startSnapshotCollector, stopSnapshotCollector } = await import('../src/lib/monitoring/snapshot-collector') +// ... conditional start +if (ENABLE_MONITORING) await startSnapshotCollector() +// ... shutdown +if (ENABLE_MONITORING) stopPromises.push(stopSnapshotCollector()) +``` + +**Step 3: Add env var to ecosystem.config.cjs** + +Add to queue-worker env: +```javascript +QUEUE_ENABLE_MONITORING: 'true', +MONITORING_SNAPSHOT_INTERVAL: '60000', +``` + +**Step 4: Commit** + +```bash +git add src/lib/monitoring/snapshot-collector.ts scripts/run-queue-worker.ts ecosystem.config.cjs +git commit -m "feat(monitoring): add snapshot collector to queue worker" +``` + +--- + +### Task 13: Data Retention Integration + +**Files:** +- Modify: `src/lib/retention/retention-config.ts` + +**Step 1: Add 3 new retention policies** + +```typescript +{ + name: 'monitoring-snapshots', + collection: 'monitoring-snapshots', + retentionDays: parseInt(process.env.RETENTION_MONITORING_SNAPSHOTS_DAYS || '7', 10), + dateField: 'createdAt', + batchSize: 500, + description: 'Monitoring-Snapshots älter als X Tage löschen', +}, +{ + name: 'monitoring-alert-history', + collection: 'monitoring-alert-history', + retentionDays: parseInt(process.env.RETENTION_MONITORING_ALERTS_DAYS || '90', 10), + dateField: 'createdAt', + batchSize: 100, + description: 'Alert-History älter als X Tage löschen', +}, +{ + name: 'monitoring-logs', + collection: 'monitoring-logs', + retentionDays: parseInt(process.env.RETENTION_MONITORING_LOGS_DAYS || '30', 10), + dateField: 'createdAt', + batchSize: 200, + description: 'Monitoring-Logs älter als X Tage löschen', +}, +``` + +**Step 2: Commit** + +```bash +git add src/lib/retention/retention-config.ts +git commit -m "feat(monitoring): add retention policies for monitoring collections" +``` + +--- + +## Phase 3: API Endpoints + +### Task 14: Health Endpoint + +**Files:** +- Create: `src/app/(payload)/api/monitoring/health/route.ts` + +**Step 1: Implement GET handler** + +Pattern: Follow `community/stats/route.ts`. Auth check for super-admin. Call `checkSystemHealth()`, return JSON. + +```typescript +import { NextRequest, NextResponse } from 'next/server' +import { getPayload } from 'payload' +import config from '@payload-config' +import { checkSystemHealth } from '@/lib/monitoring/monitoring-service' + +export async function GET(req: NextRequest) { + try { + const payload = await getPayload({ config }) + const { user } = await payload.auth({ headers: req.headers }) + if (!user || !(user as any).isSuperAdmin) { + return NextResponse.json({ error: 'Unauthorized' }, { status: 401 }) + } + const health = await checkSystemHealth() + return NextResponse.json({ data: health, timestamp: new Date().toISOString() }) + } catch (error: unknown) { + return NextResponse.json({ error: error instanceof Error ? error.message : 'Unknown error' }, { status: 500 }) + } +} + +export const dynamic = 'force-dynamic' +``` + +**Step 2: Commit** + +```bash +git add "src/app/(payload)/api/monitoring/health/route.ts" +git commit -m "feat(monitoring): add /api/monitoring/health endpoint" +``` + +--- + +### Task 15: Services Endpoint + +**Files:** +- Create: `src/app/(payload)/api/monitoring/services/route.ts` + +Same pattern as health. Calls `checkPostgresql()`, `checkPgBouncer()`, `checkRedis()`, `checkSmtp()`, `checkOAuthTokens()`, `checkCronJobs()`, `checkQueues()` via `Promise.allSettled()`. Returns combined result. + +**Commit:** +```bash +git commit -m "feat(monitoring): add /api/monitoring/services endpoint" +``` + +--- + +### Task 16: Performance Endpoint + +**Files:** +- Create: `src/app/(payload)/api/monitoring/performance/route.ts` + +Reads `?period=1h|6h|24h|7d` query param. Calls `performanceTracker.getMetrics(period)`. + +**Commit:** +```bash +git commit -m "feat(monitoring): add /api/monitoring/performance endpoint" +``` + +--- + +### Task 17: Alerts Endpoint + Acknowledge + +**Files:** +- Create: `src/app/(payload)/api/monitoring/alerts/route.ts` +- Create: `src/app/(payload)/api/monitoring/alerts/acknowledge/route.ts` + +**GET /alerts:** Query `monitoring-alert-history` with pagination (`?page=1&limit=20`), filter by severity, sort by createdAt desc. + +**POST /alerts/acknowledge:** Body `{ alertId }`. Sets `acknowledgedBy` to current user and `resolvedAt` to now. Super-admin only. + +**Commit:** +```bash +git commit -m "feat(monitoring): add /api/monitoring/alerts + acknowledge endpoints" +``` + +--- + +### Task 18: Logs Endpoint + +**Files:** +- Create: `src/app/(payload)/api/monitoring/logs/route.ts` + +**GET /logs:** Query `monitoring-logs` with pagination, filters: +- `?level=warn` (exact or gte) +- `?source=cron` +- `?search=text` (searches in message) +- `?from=ISO&to=ISO` (date range) +- `?page=1&limit=50` + +**Commit:** +```bash +git commit -m "feat(monitoring): add /api/monitoring/logs endpoint" +``` + +--- + +### Task 19: Snapshots Endpoint + +**Files:** +- Create: `src/app/(payload)/api/monitoring/snapshots/route.ts` + +**GET /snapshots:** Query `monitoring-snapshots` for trend data. +- `?period=1h|6h|24h|7d` (default: 24h) +- `?fields=system.cpuUsagePercent,system.memoryUsagePercent` (optional field selection for bandwidth) +- Returns array sorted by timestamp asc (oldest first for charts). + +**Commit:** +```bash +git commit -m "feat(monitoring): add /api/monitoring/snapshots endpoint" +``` + +--- + +### Task 20: SSE Stream Endpoint + +**Files:** +- Create: `src/app/(payload)/api/monitoring/stream/route.ts` + +**Step 1: Implement SSE stream** + +Pattern: Follow `community/stream/route.ts` exactly. + +Key differences from community stream: +- Multiple event types with different intervals: + - Health metrics: every 10s (via `checkSystemHealth()`) + - Performance metrics: every 30s (via `performanceTracker.getMetrics()`) + - New alerts: check every 5s (query `monitoring-alert-history` for new since lastCheck) + - New logs (warn+): check every 5s (query `monitoring-logs` where level >= warn since lastCheck) +- Each event has a `type` field in the SSE data +- Max duration: 25s with reconnect signal (same as community) + +```typescript +// SSE event format: +controller.enqueue(encoder.encode(`event: health\ndata: ${JSON.stringify(healthData)}\n\n`)) +controller.enqueue(encoder.encode(`event: alert\ndata: ${JSON.stringify(alertData)}\n\n`)) +``` + +Note: Use named SSE events (`event: health\n`) so the client can use `eventSource.addEventListener('health', ...)`. + +**Step 2: Commit** + +```bash +git add "src/app/(payload)/api/monitoring/stream/route.ts" +git commit -m "feat(monitoring): add SSE stream endpoint with multi-event types" +``` + +--- + +## Phase 4: Dashboard UI + +### Task 21: Admin View Registration + NavLinks + +**Files:** +- Create: `src/components/admin/MonitoringNavLinks.tsx` +- Create: `src/components/admin/MonitoringDashboardView.tsx` +- Modify: `src/payload.config.ts` (add view + navlink) + +**Step 1: Create MonitoringNavLinks** + +Pattern: Copy `CommunityNavLinks.tsx` exactly. Single link: +```typescript +const links = [ + { href: '/admin/monitoring', label: 'Monitoring Dashboard' }, +] +``` +Group label: `'Monitoring'`. + +**Step 2: Create MonitoringDashboardView** + +```typescript +'use client' +import React from 'react' +import { MonitoringDashboard } from './MonitoringDashboard' +export const MonitoringDashboardView: React.FC = () => +export default MonitoringDashboardView +``` + +**Step 3: Register in payload.config.ts** + +```typescript +afterNavLinks: [ + // ... existing + '@/components/admin/MonitoringNavLinks#MonitoringNavLinks', +], +views: { + // ... existing + MonitoringDashboard: { + Component: '@/components/admin/MonitoringDashboardView#MonitoringDashboardView', + path: '/monitoring', + }, +}, +``` + +**Step 4: Commit** + +```bash +git add src/components/admin/MonitoringNavLinks.tsx src/components/admin/MonitoringDashboardView.tsx src/payload.config.ts +git commit -m "feat(monitoring): register admin view and sidebar navigation" +``` + +--- + +### Task 22: MonitoringDashboard Main Component + +**Files:** +- Create: `src/components/admin/MonitoringDashboard.tsx` +- Create: `src/components/admin/MonitoringDashboard.scss` + +**Step 1: Implement tab shell** + +Pattern: Follow `YouTubeAnalyticsDashboard.tsx` structure. + +```typescript +'use client' +import React, { useState, useEffect, useCallback, useRef } from 'react' +import './MonitoringDashboard.scss' +import { SystemHealthTab } from './monitoring/SystemHealthTab' +import { ServicesTab } from './monitoring/ServicesTab' +import { PerformanceTab } from './monitoring/PerformanceTab' +import { AlertsTab } from './monitoring/AlertsTab' +import { LogsTab } from './monitoring/LogsTab' + +type Tab = 'health' | 'services' | 'performance' | 'alerts' | 'logs' + +export const MonitoringDashboard: React.FC = () => { + const [activeTab, setActiveTab] = useState('health') + const eventSourceRef = useRef(null) + const [connected, setConnected] = useState(false) + + // SSE connection setup + useEffect(() => { + const es = new EventSource('/api/monitoring/stream', { withCredentials: true }) + eventSourceRef.current = es + + es.addEventListener('open', () => setConnected(true)) + es.addEventListener('error', () => { setConnected(false); /* auto-reconnect */ }) + + return () => { es.close(); eventSourceRef.current = null } + }, []) + + // Pass eventSource to tabs for real-time updates + return ( +
+
+

Monitoring Dashboard

+
+ {connected ? '● Live' : '○ Disconnected'} +
+
+
{/* Tab buttons */}
+
+ {activeTab === 'health' && } + {activeTab === 'services' && } + {activeTab === 'performance' && } + {activeTab === 'alerts' && } + {activeTab === 'logs' && } +
+
+ ) +} +``` + +**Step 2: Create SCSS with BEM classes** + +`.monitoring__header`, `.monitoring__tabs`, `.monitoring__tab`, `.monitoring__tab--active`, `.monitoring__content`, `.monitoring__status--connected`, `.monitoring__status--disconnected` + +**Step 3: Commit** + +```bash +git add src/components/admin/MonitoringDashboard.tsx src/components/admin/MonitoringDashboard.scss +git commit -m "feat(monitoring): add main dashboard component with SSE connection and tab shell" +``` + +--- + +### Task 23: Shared UI Components + +**Files:** +- Create: `src/components/admin/monitoring/StatusBadge.tsx` +- Create: `src/components/admin/monitoring/GaugeWidget.tsx` +- Create: `src/components/admin/monitoring/TrendChart.tsx` +- Create: `src/components/admin/monitoring/LogTable.tsx` + +**StatusBadge:** Simple component: status string → colored badge (online=green, warning=yellow, offline=red). + +**GaugeWidget:** Displays a metric with label, value, unit, and colored arc/bar. Props: `{ label, value, max, unit, thresholds: { warning: number, critical: number } }`. +Use CSS-only approach (no chart library): circular progress with `conic-gradient` or simple horizontal bar. + +**TrendChart:** Renders time-series data as a simple SVG line chart. Props: `{ data: Array<{timestamp, value}>, label, unit, height }`. +Pure SVG, no chart library — keeps bundle size zero. Scales automatically to container width. + +**LogTable:** Renders log entries with expandable JSON context. Props: `{ logs, onLoadMore }`. +Each row: level icon, source badge, message, timestamp. Click to expand `context` JSON. + +**Commit:** +```bash +git commit -m "feat(monitoring): add shared UI components (StatusBadge, GaugeWidget, TrendChart, LogTable)" +``` + +--- + +### Task 24: SystemHealthTab + +**Files:** +- Create: `src/components/admin/monitoring/SystemHealthTab.tsx` + +**Implementation:** + +1. Initial fetch: `GET /api/monitoring/health` +2. SSE listener: `eventSource.addEventListener('health', ...)` updates gauges in real-time +3. Trend data: `GET /api/monitoring/snapshots?period=24h&fields=system.cpuUsagePercent,system.memoryUsagePercent,system.loadAvg1` +4. Renders: 4 GaugeWidgets (CPU, RAM, Disk, Uptime) + 3 TrendCharts (CPU 24h, Memory 24h, Load 24h) + +**Commit:** +```bash +git commit -m "feat(monitoring): add System Health tab with gauges and trend charts" +``` + +--- + +### Task 25: ServicesTab + +**Files:** +- Create: `src/components/admin/monitoring/ServicesTab.tsx` + +**Implementation:** + +1. Initial fetch: `GET /api/monitoring/services` +2. SSE listener: `eventSource.addEventListener('service', ...)` for status changes +3. Renders expandable service cards: + - Payload CMS (PID, Memory, Uptime, Restarts) + - Queue Worker (PID, Memory, Active Jobs) + - PostgreSQL (Connections, Pool, Latency) + - PgBouncer (Active, Waiting, Pool Size) + - Redis (Memory, Clients, Ops/s) + - SMTP (Status, Last Check, Response Time) + - OAuth Tokens (Meta + YouTube, expiry warnings) + - Cron Jobs (Last run times per job) + +Each card has a StatusBadge header. + +**Commit:** +```bash +git commit -m "feat(monitoring): add Services tab with expandable service cards" +``` + +--- + +### Task 26: PerformanceTab + +**Files:** +- Create: `src/components/admin/monitoring/PerformanceTab.tsx` + +**Implementation:** + +1. Period selector: 1h, 6h, 24h, 7d (buttons) +2. Fetch: `GET /api/monitoring/performance?period=24h` +3. KPI cards: Avg Response Time, P95, P99, Error Rate, RPM +4. TrendCharts from snapshots: `GET /api/monitoring/snapshots?period=24h&fields=performance.avgResponseTimeMs,performance.errorRate,performance.requestsPerMinute` + +**Commit:** +```bash +git commit -m "feat(monitoring): add Performance tab with KPI cards and trend charts" +``` + +--- + +### Task 27: AlertsTab + +**Files:** +- Create: `src/components/admin/monitoring/AlertsTab.tsx` + +**Implementation:** + +1. Fetch: `GET /api/monitoring/alerts?page=1&limit=20` +2. SSE listener: `eventSource.addEventListener('alert', ...)` prepends new alerts +3. Active/Unacknowledged alerts highlighted at top +4. Severity filter (warning, error, critical) +5. Acknowledge button: `POST /api/monitoring/alerts/acknowledge` with `{ alertId }` +6. Link to MonitoringAlertRules collection in admin: `/admin/collections/monitoring-alert-rules` +7. Pagination + +**Commit:** +```bash +git commit -m "feat(monitoring): add Alerts tab with acknowledge and real-time updates" +``` + +--- + +### Task 28: LogsTab + +**Files:** +- Create: `src/components/admin/monitoring/LogsTab.tsx` + +**Implementation:** + +1. Fetch: `GET /api/monitoring/logs?page=1&limit=50` +2. SSE listener: `eventSource.addEventListener('log', ...)` prepends new warn+ entries +3. Filters: level dropdown, source dropdown, text search input, date range +4. Uses LogTable component +5. Load more button for pagination +6. Auto-scroll toggle for new SSE entries + +**Commit:** +```bash +git commit -m "feat(monitoring): add Logs tab with filters, search, and real-time updates" +``` + +--- + +## Phase 5: Final Integration + +### Task 29: Generate ImportMap & Build Test + +**Step 1: Generate import map** + +```bash +pnpm payload generate:importmap +``` + +**Step 2: Build test** + +```bash +pm2 stop payload +NODE_OPTIONS="--no-deprecation --max-old-space-size=1024" pnpm build +pm2 start payload +``` + +**Step 3: Fix any build errors** + +**Step 4: Commit** + +```bash +git add src/app/\(payload\)/importMap.js +git commit -m "chore(monitoring): regenerate import map and verify build" +``` + +--- + +### Task 30: Run All Tests + +```bash +pnpm test tests/unit/monitoring/ +``` + +Fix any failures, then: + +```bash +git commit -m "test(monitoring): fix test issues and verify all monitoring tests pass" +``` + +--- + +### Task 31: Update Documentation + +**Files:** +- Modify: `CLAUDE.md` (add Monitoring to Subsysteme table, add collections) +- Modify: `docs/CLAUDE_REFERENCE.md` (add Monitoring section) +- Modify: `docs/PROJECT_STATUS.md` (mark as completed) + +**CLAUDE.md changes:** +- Add to Subsysteme table: `| Monitoring & Alerting | src/lib/monitoring/, API: /api/monitoring/* | docs/CLAUDE_REFERENCE.md |` +- Add 4 collections to Collections table: + - `monitoring-snapshots`, `monitoring-logs`, `monitoring-alert-rules`, `monitoring-alert-history` + +**CLAUDE_REFERENCE.md:** Add new section with API endpoints, SSE events, env vars. + +**PROJECT_STATUS.md:** Move "Monitoring & Alerting Dashboard" from Langfristig to Abgeschlossen. + +**Commit:** +```bash +git commit -m "docs: add monitoring dashboard to project documentation" +``` + +--- + +## Environment Variables Summary + +Add to `.env` (all optional with defaults): + +```env +# Monitoring +QUEUE_ENABLE_MONITORING=true +MONITORING_SNAPSHOT_INTERVAL=60000 +MONITORING_LOG_LEVEL=info +RETENTION_MONITORING_SNAPSHOTS_DAYS=7 +RETENTION_MONITORING_ALERTS_DAYS=90 +RETENTION_MONITORING_LOGS_DAYS=30 +``` + +--- + +## Task Dependency Graph + +``` +Phase 1 (Foundation): + Task 1 (Types) → Task 2-5 (Collections) → Task 6 (Migration) + +Phase 2 (Services): + Task 1 → Task 7 (Health) → Task 8 (Services) → Task 9 (PerfTracker) + Task 1 → Task 10 (Logger) + Task 1 → Task 11 (AlertEvaluator) + Task 7,8,9,11 → Task 12 (SnapshotCollector) + Task 2-5 → Task 13 (Retention) + +Phase 3 (APIs): + Task 7 → Task 14 (Health API) + Task 8 → Task 15 (Services API) + Task 9 → Task 16 (Performance API) + Task 5,11 → Task 17 (Alerts API) + Task 3,10 → Task 18 (Logs API) + Task 2 → Task 19 (Snapshots API) + Task 7,8,9,10,11 → Task 20 (SSE Stream) + +Phase 4 (UI): + Task 21 (Registration) → Task 22 (Main Component) → Task 23 (Shared Components) + Task 23 → Tasks 24-28 (Tab Components) — can be parallel + +Phase 5 (Integration): + All → Task 29 (Build) → Task 30 (Tests) → Task 31 (Docs) +``` + +--- + +## Estimated File Count + +| Category | Files | +|----------|-------| +| Collections (4) | 4 | +| Lib/Monitoring (6) | 6 | +| API Routes (8) | 8 | +| UI Components (12) | 12 | +| Tests (5) | 5 | +| Migrations (1) | 1 | +| Modified (5) | payload.config.ts, run-queue-worker.ts, ecosystem.config.cjs, retention-config.ts, access/index.ts | +| **Total** | **~41 files** |