39 KiB
Monitoring & Alerting Dashboard - Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Build a real-time Monitoring & Alerting Dashboard in the Payload Admin Panel with system health checks, service monitoring, performance tracking, configurable alerts, and structured log viewing.
Architecture: Event-driven with SSE real-time stream + REST endpoints. A SnapshotCollector runs in the queue-worker PM2 process every 60s, collecting metrics from OS, PostgreSQL, PgBouncer, Redis, SMTP, OAuth, and BullMQ. An AlertEvaluator checks metrics against configurable rules stored in a Payload Collection. The dashboard is a 5-tab Custom Admin View using the same patterns as YouTubeAnalyticsDashboard.
Tech Stack: Payload CMS 3.76.1, Next.js 16 (App Router), React 19, TypeScript, Node.js os module, SSE via ReadableStream, BullMQ, PostgreSQL, Redis.
Design Doc: docs/plans/2026-02-14-monitoring-dashboard-design.md
Phase 1: Types & Collections (Foundation)
Task 1: Shared Types
Files:
- Create:
src/lib/monitoring/types.ts - Test:
tests/unit/monitoring/types.test.ts
Step 1: Write types test
// tests/unit/monitoring/types.test.ts
import { describe, it, expect } from 'vitest'
import type {
SystemHealth,
ServiceStatus,
OAuthStatus,
CronStatus,
QueueStatus,
PerformanceMetrics,
SystemMetrics,
MonitoringEvent,
} from '@/lib/monitoring/types'
describe('Monitoring Types', () => {
it('SystemMetrics has all required sections', () => {
const metrics: SystemMetrics = {
timestamp: new Date().toISOString(),
system: {
cpuUsagePercent: 23,
memoryUsedMB: 4200,
memoryTotalMB: 8192,
memoryUsagePercent: 51.3,
diskUsedGB: 30,
diskTotalGB: 50,
diskUsagePercent: 60,
loadAvg1: 0.5,
loadAvg5: 0.8,
uptime: 1209600,
},
services: {
payload: { status: 'online', pid: 1234, memoryMB: 512, uptimeSeconds: 86400, restarts: 0 },
queueWorker: { status: 'online', pid: 5678, memoryMB: 256, uptimeSeconds: 86400, restarts: 0 },
postgresql: { status: 'online', connections: 12, maxConnections: 50, latencyMs: 2 },
pgbouncer: { status: 'online', activeConnections: 8, waitingClients: 0, poolSize: 20 },
redis: { status: 'online', memoryUsedMB: 48, connectedClients: 5, opsPerSec: 120 },
},
external: {
smtp: { status: 'online', lastCheck: new Date().toISOString(), responseTimeMs: 180 },
metaOAuth: { status: 'ok', tokensTotal: 2, tokensExpiringSoon: 1, tokensExpired: 0 },
youtubeOAuth: { status: 'ok', tokensTotal: 3, tokensExpiringSoon: 0, tokensExpired: 0 },
cronJobs: {
communitySync: { lastRun: new Date().toISOString(), status: 'ok' },
tokenRefresh: { lastRun: new Date().toISOString(), status: 'ok' },
youtubeSync: { lastRun: new Date().toISOString(), status: 'ok' },
},
},
performance: { avgResponseTimeMs: 120, p95ResponseTimeMs: 350, p99ResponseTimeMs: 800, errorRate: 0.02, requestsPerMinute: 45 },
}
expect(metrics.system.cpuUsagePercent).toBe(23)
expect(metrics.services.payload.status).toBe('online')
expect(metrics.external.smtp.status).toBe('online')
expect(metrics.performance.avgResponseTimeMs).toBe(120)
})
it('MonitoringEvent types are exhaustive', () => {
const events: MonitoringEvent['type'][] = ['health', 'service', 'alert', 'log', 'performance']
expect(events).toHaveLength(5)
})
})
Step 2: Run test — expect FAIL (types don't exist yet)
pnpm test tests/unit/monitoring/types.test.ts
Step 3: Implement types
Create src/lib/monitoring/types.ts with all interfaces:
SystemHealth(CPU, RAM, Disk, Load, Uptime)ProcessStatus(status, pid, memoryMB, uptimeSeconds, restarts)PostgresqlStatus,PgBouncerStatus,RedisStatusSmtpStatus,OAuthTokenStatus,CronJobStatusServiceStatuses(all services combined)ExternalStatuses(SMTP, OAuth, Cron)PerformanceMetrics(avg, p95, p99, errorRate, rpm)SystemMetrics(the full snapshot object)MonitoringEvent(discriminated union for SSE events: health | service | alert | log | performance)AlertCondition= 'gt' | 'lt' | 'eq' | 'gte' | 'lte'AlertSeverity= 'warning' | 'error' | 'critical'LogLevel= 'debug' | 'info' | 'warn' | 'error' | 'fatal'LogSource= 'payload' | 'queue-worker' | 'cron' | 'email' | 'oauth' | 'sync'
Step 4: Run test — expect PASS
pnpm test tests/unit/monitoring/types.test.ts
Step 5: Commit
git add src/lib/monitoring/types.ts tests/unit/monitoring/types.test.ts
git commit -m "feat(monitoring): add shared types for monitoring system"
Task 2: MonitoringSnapshots Collection
Files:
- Create:
src/collections/MonitoringSnapshots.ts - Modify:
src/payload.config.ts(add to collections array) - Modify:
src/lib/access/index.ts(add monitoring access)
Step 1: Add monitoring access control
In src/lib/access/index.ts, add:
export const monitoringAccess = {
read: superAdminOnly,
create: superAdminOnly, // Only system can create
update: denyAll, // Immutable snapshots
delete: superAdminOnly, // Retention cleanup only
}
Step 2: Create MonitoringSnapshots collection
Pattern: Follow AuditLogs.ts structure. Use admin.group: 'Monitoring'. Fields use type: 'group' for nested objects and type: 'json' for service/external status objects.
Key fields:
timestamp(date, required, indexed)systemgroup: cpuUsagePercent, memoryUsedMB, memoryTotalMB, memoryUsagePercent, diskUsedGB, diskTotalGB, diskUsagePercent, loadAvg1, loadAvg5, uptime (alltype: 'number')servicesgroup: payload, queueWorker, postgresql, pgbouncer, redis (alltype: 'json')externalgroup: smtp, metaOAuth, youtubeOAuth, cronJobs (alltype: 'json')performancegroup: avgResponseTimeMs, p95ResponseTimeMs, p99ResponseTimeMs, errorRate, requestsPerMinute (alltype: 'number')
Step 3: Register in payload.config.ts
Add MonitoringSnapshots to the collections array (import + add).
Step 4: Commit
git add src/collections/MonitoringSnapshots.ts src/payload.config.ts src/lib/access/index.ts
git commit -m "feat(monitoring): add MonitoringSnapshots collection"
Task 3: MonitoringLogs Collection
Files:
- Create:
src/collections/MonitoringLogs.ts - Modify:
src/payload.config.ts
Step 1: Create collection
Pattern: Like AuditLogs.ts — WORM (read + create only, no update/delete via UI).
Key fields:
level(select: debug, info, warn, error, fatal — required)source(select: payload, queue-worker, cron, email, oauth, sync — required)message(text, required)context(json)requestId(text)userId(relationship → users)tenant(relationship → tenants)duration(number, min: 0)
Admin config: group: 'Monitoring', defaultColumns: ['level', 'source', 'message', 'createdAt'], useAsTitle: 'message'.
Step 2: Register in payload.config.ts
Step 3: Commit
git add src/collections/MonitoringLogs.ts src/payload.config.ts
git commit -m "feat(monitoring): add MonitoringLogs collection"
Task 4: MonitoringAlertRules Collection
Files:
- Create:
src/collections/MonitoringAlertRules.ts - Modify:
src/payload.config.ts
Step 1: Create collection
Access: Super-admin full CRUD.
Key fields:
name(text, required)metric(text, required — e.g.system.cpuUsagePercent)condition(select: gt, lt, eq, gte, lte — required)threshold(number, required)severity(select: warning, error, critical — required)channels(select, hasMany: true — email, slack, discord — required)recipientsgroup:emails(array of text fields)slackWebhook(text)discordWebhook(text)
cooldownMinutes(number, defaultValue: 15, min: 1)enabled(checkbox, defaultValue: true)tenant(relationship → tenants, optional)
Admin: group: 'Monitoring', useAsTitle: 'name'.
Step 2: Register in payload.config.ts
Step 3: Commit
git add src/collections/MonitoringAlertRules.ts src/payload.config.ts
git commit -m "feat(monitoring): add MonitoringAlertRules collection"
Task 5: MonitoringAlertHistory Collection
Files:
- Create:
src/collections/MonitoringAlertHistory.ts - Modify:
src/payload.config.ts
Step 1: Create collection
Access: Read for super-admin, create for system, update only resolvedAt and acknowledgedBy.
Key fields:
rule(relationship → monitoring-alert-rules)metric(text, required)value(number, required)threshold(number, required)severity(select: warning, error, critical — required)message(text, required)channelsSent(select, hasMany: email, slack, discord)resolvedAt(date, optional)acknowledgedBy(relationship → users, optional)
Admin: group: 'Monitoring', useAsTitle: 'message', defaultColumns: ['severity', 'metric', 'message', 'createdAt', 'acknowledgedBy'].
Step 2: Register in payload.config.ts
Step 3: Commit
git add src/collections/MonitoringAlertHistory.ts src/payload.config.ts
git commit -m "feat(monitoring): add MonitoringAlertHistory collection"
Task 6: Database Migration
Step 1: Create migration
pnpm payload migrate:create
CRITICAL: The migration MUST include payload_locked_documents_rels columns for ALL 4 new collections:
ALTER TABLE "payload_locked_documents_rels"
ADD COLUMN IF NOT EXISTS "monitoring_snapshots_id" integer REFERENCES monitoring_snapshots(id) ON DELETE CASCADE;
ALTER TABLE "payload_locked_documents_rels"
ADD COLUMN IF NOT EXISTS "monitoring_logs_id" integer REFERENCES monitoring_logs(id) ON DELETE CASCADE;
ALTER TABLE "payload_locked_documents_rels"
ADD COLUMN IF NOT EXISTS "monitoring_alert_rules_id" integer REFERENCES monitoring_alert_rules(id) ON DELETE CASCADE;
ALTER TABLE "payload_locked_documents_rels"
ADD COLUMN IF NOT EXISTS "monitoring_alert_history_id" integer REFERENCES monitoring_alert_history(id) ON DELETE CASCADE;
CREATE INDEX IF NOT EXISTS "payload_locked_documents_rels_monitoring_snapshots_idx" ON "payload_locked_documents_rels" ("monitoring_snapshots_id");
CREATE INDEX IF NOT EXISTS "payload_locked_documents_rels_monitoring_logs_idx" ON "payload_locked_documents_rels" ("monitoring_logs_id");
CREATE INDEX IF NOT EXISTS "payload_locked_documents_rels_monitoring_alert_rules_idx" ON "payload_locked_documents_rels" ("monitoring_alert_rules_id");
CREATE INDEX IF NOT EXISTS "payload_locked_documents_rels_monitoring_alert_history_idx" ON "payload_locked_documents_rels" ("monitoring_alert_history_id");
Step 2: Review generated migration, add locked_documents_rels columns if missing
Step 3: Run migration via direct DB connection
./scripts/db-direct.sh migrate
Step 4: Generate import map
pnpm payload generate:importmap
Step 5: Commit
git add src/migrations/ src/app/\(payload\)/importMap.js
git commit -m "feat(monitoring): add database migration for 4 monitoring collections"
Phase 2: Backend Services
Task 7: MonitoringService — System Health
Files:
- Create:
src/lib/monitoring/monitoring-service.ts - Test:
tests/unit/monitoring/monitoring-service.test.ts
Step 1: Write test for checkSystemHealth()
import { describe, it, expect } from 'vitest'
import { checkSystemHealth } from '@/lib/monitoring/monitoring-service'
describe('MonitoringService', () => {
describe('checkSystemHealth', () => {
it('returns CPU, memory, disk, load, and uptime', async () => {
const health = await checkSystemHealth()
expect(health.cpuUsagePercent).toBeGreaterThanOrEqual(0)
expect(health.cpuUsagePercent).toBeLessThanOrEqual(100)
expect(health.memoryTotalMB).toBeGreaterThan(0)
expect(health.memoryUsedMB).toBeGreaterThan(0)
expect(health.memoryUsagePercent).toBeGreaterThanOrEqual(0)
expect(health.diskTotalGB).toBeGreaterThan(0)
expect(health.uptime).toBeGreaterThan(0)
expect(health.loadAvg1).toBeGreaterThanOrEqual(0)
})
})
})
Step 2: Run test — expect FAIL
Step 3: Implement checkSystemHealth()
Use Node.js os module: os.cpus(), os.totalmem(), os.freemem(), os.loadavg(), os.uptime().
For disk: use child_process.execSync('df -B1 / | tail -1') to get disk usage (Linux-only, which is fine — production is Linux).
For CPU: sample /proc/stat twice with 100ms delay to calculate usage percentage.
Step 4: Run test — expect PASS
Step 5: Commit
git add src/lib/monitoring/monitoring-service.ts tests/unit/monitoring/monitoring-service.test.ts
git commit -m "feat(monitoring): add system health check (CPU, RAM, disk, load)"
Task 8: MonitoringService — Service Checks
Files:
- Modify:
src/lib/monitoring/monitoring-service.ts - Test:
tests/unit/monitoring/monitoring-service.test.ts
Step 1: Write tests for service checks
Test checkRedis(), checkPostgresql(), checkPgBouncer(), checkQueues().
These need mocking since they connect to external services:
import { vi } from 'vitest'
describe('checkRedis', () => {
it('returns redis status with memory and client info', async () => {
// Mock redis.info() response
const result = await checkRedis()
expect(result).toHaveProperty('status')
expect(result).toHaveProperty('memoryUsedMB')
expect(result).toHaveProperty('connectedClients')
expect(result).toHaveProperty('opsPerSec')
})
})
Step 2: Implement service checks
checkRedis(): Useredis.info()fromsrc/lib/redis.ts, parseused_memory,connected_clients,instantaneous_ops_per_seccheckPostgresql(): Direct querySELECT count(*) FROM pg_stat_activity+SELECT 1latency test via./scripts/db-direct.shor Payload's DB adaptercheckPgBouncer(): QuerySHOW POOLSvia PgBouncer admin connection (127.0.0.1:6432)checkQueues(): Use BullMQQueue.getJobCounts()for email, pdf, retention queuescheckSmtp(): Create SMTP transporter and callverify()with timeoutcheckOAuthTokens(): Querysocial-accountscollection for expiring tokens (< 7 days)checkCronJobs(): Check audit-logs/monitoring-logs for recent cron executions
Step 3: Add collectMetrics() that calls all checks with Promise.allSettled()
Step 4: Run tests — expect PASS
Step 5: Commit
git add src/lib/monitoring/monitoring-service.ts tests/unit/monitoring/monitoring-service.test.ts
git commit -m "feat(monitoring): add service checks (Redis, PostgreSQL, PgBouncer, queues, SMTP, OAuth)"
Task 9: PerformanceTracker
Files:
- Create:
src/lib/monitoring/performance-tracker.ts - Test:
tests/unit/monitoring/performance-tracker.test.ts
Step 1: Write test
describe('PerformanceTracker', () => {
it('tracks requests and computes metrics', () => {
const tracker = new PerformanceTracker(1000) // 1000-entry ring buffer
tracker.track('GET', '/api/posts', 200, 120)
tracker.track('GET', '/api/posts', 200, 250)
tracker.track('GET', '/api/posts', 500, 800)
const metrics = tracker.getMetrics('1h')
expect(metrics.avgResponseTimeMs).toBeCloseTo(390, 0)
expect(metrics.errorRate).toBeCloseTo(0.333, 2)
expect(metrics.requestsPerMinute).toBeGreaterThan(0)
expect(metrics.p95ResponseTimeMs).toBeGreaterThanOrEqual(metrics.avgResponseTimeMs)
})
it('ring buffer evicts old entries', () => {
const tracker = new PerformanceTracker(2) // tiny buffer
tracker.track('GET', '/a', 200, 100)
tracker.track('GET', '/b', 200, 200)
tracker.track('GET', '/c', 200, 300)
const metrics = tracker.getMetrics('1h')
// Only last 2 entries should remain
expect(metrics.avgResponseTimeMs).toBeCloseTo(250, 0)
})
})
Step 2: Run test — expect FAIL
Step 3: Implement PerformanceTracker
- Class with ring buffer (fixed-size array + pointer)
- Each entry:
{ timestamp, method, path, statusCode, durationMs } track(): Add to ring buffergetMetrics(period): Filter by time window, compute avg/p95/p99/errorRate/rpm- Export singleton instance:
export const performanceTracker = new PerformanceTracker(10_000)
Step 4: Run test — expect PASS
Step 5: Commit
git add src/lib/monitoring/performance-tracker.ts tests/unit/monitoring/performance-tracker.test.ts
git commit -m "feat(monitoring): add performance tracker with ring buffer"
Task 10: MonitoringLogger
Files:
- Create:
src/lib/monitoring/monitoring-logger.ts - Test:
tests/unit/monitoring/monitoring-logger.test.ts
Step 1: Write test
describe('MonitoringLogger', () => {
it('creates logger with source and logs to collection', async () => {
const logger = createMonitoringLogger('cron')
// Mock payload.create
await logger.info('Cron job completed', { jobName: 'community-sync', duration: 3500 })
// Verify payload.create was called with correct args
})
it('respects minimum log level from env', async () => {
// MONITORING_LOG_LEVEL=warn → info/debug should not write to DB
})
})
Step 2: Implement MonitoringLogger
createMonitoringLogger(source: LogSource)factory function- Returns object with
debug(),info(),warn(),error(),fatal()methods - Each method calls
payload.create({ collection: 'monitoring-logs', data: { level, source, message, context, ... } }) - Respects
MONITORING_LOG_LEVELenv var (default: 'info') - Falls back to
console.logif Payload is not initialized (startup phase) - Non-blocking: fire-and-forget with
.catch(console.error)
Step 3: Run test — expect PASS
Step 4: Commit
git add src/lib/monitoring/monitoring-logger.ts tests/unit/monitoring/monitoring-logger.test.ts
git commit -m "feat(monitoring): add structured monitoring logger"
Task 11: AlertEvaluator
Files:
- Create:
src/lib/monitoring/alert-evaluator.ts - Test:
tests/unit/monitoring/alert-evaluator.test.ts
Step 1: Write test
describe('AlertEvaluator', () => {
it('fires alert when metric exceeds threshold (gt)', () => {
const rule = { metric: 'system.cpuUsagePercent', condition: 'gt', threshold: 80, severity: 'warning' }
const metrics = { system: { cpuUsagePercent: 92 } }
expect(evaluateCondition(rule, getMetricValue(metrics, rule.metric))).toBe(true)
})
it('does not fire when metric is below threshold', () => {
const rule = { metric: 'system.cpuUsagePercent', condition: 'gt', threshold: 80 }
const metrics = { system: { cpuUsagePercent: 45 } }
expect(evaluateCondition(rule, getMetricValue(metrics, rule.metric))).toBe(false)
})
it('resolves nested metric paths', () => {
const metrics = { services: { redis: { memoryUsedMB: 512 } } }
expect(getMetricValue(metrics, 'services.redis.memoryUsedMB')).toBe(512)
})
it('respects cooldown period', () => {
const evaluator = new AlertEvaluator()
// First fire should pass
expect(evaluator.shouldFire('rule-1', 15)).toBe(true)
// Immediate second fire should be blocked (cooldown)
expect(evaluator.shouldFire('rule-1', 15)).toBe(false)
})
})
Step 2: Implement AlertEvaluator
getMetricValue(metrics, path): Resolve dot-notation path likesystem.cpuUsagePercentevaluateCondition(rule, value): Compare value against threshold using condition operatorAlertEvaluatorclass with in-memory cooldown map (ruleId → lastFiredAt)evaluateRules(payload, metrics): Load enabled rules frommonitoring-alert-rules, evaluate each, fire alertsdispatchAlert(payload, rule, metrics, value): Createmonitoring-alert-historyrecord + call existingsendAlert()fromsrc/lib/alerting/alert-service.ts
Step 3: Run test — expect PASS
Step 4: Commit
git add src/lib/monitoring/alert-evaluator.ts tests/unit/monitoring/alert-evaluator.test.ts
git commit -m "feat(monitoring): add alert evaluator with cooldown and multi-channel dispatch"
Task 12: SnapshotCollector
Files:
- Create:
src/lib/monitoring/snapshot-collector.ts - Modify:
scripts/run-queue-worker.ts(add monitoring worker) - Modify:
ecosystem.config.cjs(add env var)
Step 1: Implement SnapshotCollector
import { collectMetrics } from './monitoring-service'
import { AlertEvaluator } from './alert-evaluator'
import { getPayload } from 'payload'
import config from '@payload-config'
let interval: NodeJS.Timeout | null = null
const alertEvaluator = new AlertEvaluator()
export async function startSnapshotCollector(): Promise<void> {
const INTERVAL = parseInt(process.env.MONITORING_SNAPSHOT_INTERVAL || '60000', 10)
console.log(`[SnapshotCollector] Starting (interval: ${INTERVAL}ms)`)
interval = setInterval(async () => {
try {
const payload = await getPayload({ config })
const metrics = await collectMetrics()
await payload.create({ collection: 'monitoring-snapshots', data: { ...metrics, timestamp: new Date().toISOString() } })
await alertEvaluator.evaluateRules(payload, metrics)
} catch (error) {
console.error('[SnapshotCollector] Error:', error)
}
}, INTERVAL)
}
export async function stopSnapshotCollector(): Promise<void> {
if (interval) { clearInterval(interval); interval = null }
console.log('[SnapshotCollector] Stopped')
}
Step 2: Add to queue worker
In scripts/run-queue-worker.ts, add:
const ENABLE_MONITORING = process.env.QUEUE_ENABLE_MONITORING !== 'false'
// ... dynamic import
const { startSnapshotCollector, stopSnapshotCollector } = await import('../src/lib/monitoring/snapshot-collector')
// ... conditional start
if (ENABLE_MONITORING) await startSnapshotCollector()
// ... shutdown
if (ENABLE_MONITORING) stopPromises.push(stopSnapshotCollector())
Step 3: Add env var to ecosystem.config.cjs
Add to queue-worker env:
QUEUE_ENABLE_MONITORING: 'true',
MONITORING_SNAPSHOT_INTERVAL: '60000',
Step 4: Commit
git add src/lib/monitoring/snapshot-collector.ts scripts/run-queue-worker.ts ecosystem.config.cjs
git commit -m "feat(monitoring): add snapshot collector to queue worker"
Task 13: Data Retention Integration
Files:
- Modify:
src/lib/retention/retention-config.ts
Step 1: Add 3 new retention policies
{
name: 'monitoring-snapshots',
collection: 'monitoring-snapshots',
retentionDays: parseInt(process.env.RETENTION_MONITORING_SNAPSHOTS_DAYS || '7', 10),
dateField: 'createdAt',
batchSize: 500,
description: 'Monitoring-Snapshots älter als X Tage löschen',
},
{
name: 'monitoring-alert-history',
collection: 'monitoring-alert-history',
retentionDays: parseInt(process.env.RETENTION_MONITORING_ALERTS_DAYS || '90', 10),
dateField: 'createdAt',
batchSize: 100,
description: 'Alert-History älter als X Tage löschen',
},
{
name: 'monitoring-logs',
collection: 'monitoring-logs',
retentionDays: parseInt(process.env.RETENTION_MONITORING_LOGS_DAYS || '30', 10),
dateField: 'createdAt',
batchSize: 200,
description: 'Monitoring-Logs älter als X Tage löschen',
},
Step 2: Commit
git add src/lib/retention/retention-config.ts
git commit -m "feat(monitoring): add retention policies for monitoring collections"
Phase 3: API Endpoints
Task 14: Health Endpoint
Files:
- Create:
src/app/(payload)/api/monitoring/health/route.ts
Step 1: Implement GET handler
Pattern: Follow community/stats/route.ts. Auth check for super-admin. Call checkSystemHealth(), return JSON.
import { NextRequest, NextResponse } from 'next/server'
import { getPayload } from 'payload'
import config from '@payload-config'
import { checkSystemHealth } from '@/lib/monitoring/monitoring-service'
export async function GET(req: NextRequest) {
try {
const payload = await getPayload({ config })
const { user } = await payload.auth({ headers: req.headers })
if (!user || !(user as any).isSuperAdmin) {
return NextResponse.json({ error: 'Unauthorized' }, { status: 401 })
}
const health = await checkSystemHealth()
return NextResponse.json({ data: health, timestamp: new Date().toISOString() })
} catch (error: unknown) {
return NextResponse.json({ error: error instanceof Error ? error.message : 'Unknown error' }, { status: 500 })
}
}
export const dynamic = 'force-dynamic'
Step 2: Commit
git add "src/app/(payload)/api/monitoring/health/route.ts"
git commit -m "feat(monitoring): add /api/monitoring/health endpoint"
Task 15: Services Endpoint
Files:
- Create:
src/app/(payload)/api/monitoring/services/route.ts
Same pattern as health. Calls checkPostgresql(), checkPgBouncer(), checkRedis(), checkSmtp(), checkOAuthTokens(), checkCronJobs(), checkQueues() via Promise.allSettled(). Returns combined result.
Commit:
git commit -m "feat(monitoring): add /api/monitoring/services endpoint"
Task 16: Performance Endpoint
Files:
- Create:
src/app/(payload)/api/monitoring/performance/route.ts
Reads ?period=1h|6h|24h|7d query param. Calls performanceTracker.getMetrics(period).
Commit:
git commit -m "feat(monitoring): add /api/monitoring/performance endpoint"
Task 17: Alerts Endpoint + Acknowledge
Files:
- Create:
src/app/(payload)/api/monitoring/alerts/route.ts - Create:
src/app/(payload)/api/monitoring/alerts/acknowledge/route.ts
GET /alerts: Query monitoring-alert-history with pagination (?page=1&limit=20), filter by severity, sort by createdAt desc.
POST /alerts/acknowledge: Body { alertId }. Sets acknowledgedBy to current user and resolvedAt to now. Super-admin only.
Commit:
git commit -m "feat(monitoring): add /api/monitoring/alerts + acknowledge endpoints"
Task 18: Logs Endpoint
Files:
- Create:
src/app/(payload)/api/monitoring/logs/route.ts
GET /logs: Query monitoring-logs with pagination, filters:
?level=warn(exact or gte)?source=cron?search=text(searches in message)?from=ISO&to=ISO(date range)?page=1&limit=50
Commit:
git commit -m "feat(monitoring): add /api/monitoring/logs endpoint"
Task 19: Snapshots Endpoint
Files:
- Create:
src/app/(payload)/api/monitoring/snapshots/route.ts
GET /snapshots: Query monitoring-snapshots for trend data.
?period=1h|6h|24h|7d(default: 24h)?fields=system.cpuUsagePercent,system.memoryUsagePercent(optional field selection for bandwidth)- Returns array sorted by timestamp asc (oldest first for charts).
Commit:
git commit -m "feat(monitoring): add /api/monitoring/snapshots endpoint"
Task 20: SSE Stream Endpoint
Files:
- Create:
src/app/(payload)/api/monitoring/stream/route.ts
Step 1: Implement SSE stream
Pattern: Follow community/stream/route.ts exactly.
Key differences from community stream:
- Multiple event types with different intervals:
- Health metrics: every 10s (via
checkSystemHealth()) - Performance metrics: every 30s (via
performanceTracker.getMetrics()) - New alerts: check every 5s (query
monitoring-alert-historyfor new since lastCheck) - New logs (warn+): check every 5s (query
monitoring-logswhere level >= warn since lastCheck)
- Health metrics: every 10s (via
- Each event has a
typefield in the SSE data - Max duration: 25s with reconnect signal (same as community)
// SSE event format:
controller.enqueue(encoder.encode(`event: health\ndata: ${JSON.stringify(healthData)}\n\n`))
controller.enqueue(encoder.encode(`event: alert\ndata: ${JSON.stringify(alertData)}\n\n`))
Note: Use named SSE events (event: health\n) so the client can use eventSource.addEventListener('health', ...).
Step 2: Commit
git add "src/app/(payload)/api/monitoring/stream/route.ts"
git commit -m "feat(monitoring): add SSE stream endpoint with multi-event types"
Phase 4: Dashboard UI
Task 21: Admin View Registration + NavLinks
Files:
- Create:
src/components/admin/MonitoringNavLinks.tsx - Create:
src/components/admin/MonitoringDashboardView.tsx - Modify:
src/payload.config.ts(add view + navlink)
Step 1: Create MonitoringNavLinks
Pattern: Copy CommunityNavLinks.tsx exactly. Single link:
const links = [
{ href: '/admin/monitoring', label: 'Monitoring Dashboard' },
]
Group label: 'Monitoring'.
Step 2: Create MonitoringDashboardView
'use client'
import React from 'react'
import { MonitoringDashboard } from './MonitoringDashboard'
export const MonitoringDashboardView: React.FC = () => <MonitoringDashboard />
export default MonitoringDashboardView
Step 3: Register in payload.config.ts
afterNavLinks: [
// ... existing
'@/components/admin/MonitoringNavLinks#MonitoringNavLinks',
],
views: {
// ... existing
MonitoringDashboard: {
Component: '@/components/admin/MonitoringDashboardView#MonitoringDashboardView',
path: '/monitoring',
},
},
Step 4: Commit
git add src/components/admin/MonitoringNavLinks.tsx src/components/admin/MonitoringDashboardView.tsx src/payload.config.ts
git commit -m "feat(monitoring): register admin view and sidebar navigation"
Task 22: MonitoringDashboard Main Component
Files:
- Create:
src/components/admin/MonitoringDashboard.tsx - Create:
src/components/admin/MonitoringDashboard.scss
Step 1: Implement tab shell
Pattern: Follow YouTubeAnalyticsDashboard.tsx structure.
'use client'
import React, { useState, useEffect, useCallback, useRef } from 'react'
import './MonitoringDashboard.scss'
import { SystemHealthTab } from './monitoring/SystemHealthTab'
import { ServicesTab } from './monitoring/ServicesTab'
import { PerformanceTab } from './monitoring/PerformanceTab'
import { AlertsTab } from './monitoring/AlertsTab'
import { LogsTab } from './monitoring/LogsTab'
type Tab = 'health' | 'services' | 'performance' | 'alerts' | 'logs'
export const MonitoringDashboard: React.FC = () => {
const [activeTab, setActiveTab] = useState<Tab>('health')
const eventSourceRef = useRef<EventSource | null>(null)
const [connected, setConnected] = useState(false)
// SSE connection setup
useEffect(() => {
const es = new EventSource('/api/monitoring/stream', { withCredentials: true })
eventSourceRef.current = es
es.addEventListener('open', () => setConnected(true))
es.addEventListener('error', () => { setConnected(false); /* auto-reconnect */ })
return () => { es.close(); eventSourceRef.current = null }
}, [])
// Pass eventSource to tabs for real-time updates
return (
<div className="monitoring">
<div className="monitoring__header">
<h1>Monitoring Dashboard</h1>
<div className={`monitoring__status ${connected ? 'monitoring__status--connected' : 'monitoring__status--disconnected'}`}>
{connected ? '● Live' : '○ Disconnected'}
</div>
</div>
<div className="monitoring__tabs">{/* Tab buttons */}</div>
<div className="monitoring__content">
{activeTab === 'health' && <SystemHealthTab eventSource={eventSourceRef.current} />}
{activeTab === 'services' && <ServicesTab eventSource={eventSourceRef.current} />}
{activeTab === 'performance' && <PerformanceTab />}
{activeTab === 'alerts' && <AlertsTab eventSource={eventSourceRef.current} />}
{activeTab === 'logs' && <LogsTab eventSource={eventSourceRef.current} />}
</div>
</div>
)
}
Step 2: Create SCSS with BEM classes
.monitoring__header, .monitoring__tabs, .monitoring__tab, .monitoring__tab--active, .monitoring__content, .monitoring__status--connected, .monitoring__status--disconnected
Step 3: Commit
git add src/components/admin/MonitoringDashboard.tsx src/components/admin/MonitoringDashboard.scss
git commit -m "feat(monitoring): add main dashboard component with SSE connection and tab shell"
Task 23: Shared UI Components
Files:
- Create:
src/components/admin/monitoring/StatusBadge.tsx - Create:
src/components/admin/monitoring/GaugeWidget.tsx - Create:
src/components/admin/monitoring/TrendChart.tsx - Create:
src/components/admin/monitoring/LogTable.tsx
StatusBadge: Simple component: status string → colored badge (online=green, warning=yellow, offline=red).
GaugeWidget: Displays a metric with label, value, unit, and colored arc/bar. Props: { label, value, max, unit, thresholds: { warning: number, critical: number } }.
Use CSS-only approach (no chart library): circular progress with conic-gradient or simple horizontal bar.
TrendChart: Renders time-series data as a simple SVG line chart. Props: { data: Array<{timestamp, value}>, label, unit, height }.
Pure SVG, no chart library — keeps bundle size zero. Scales automatically to container width.
LogTable: Renders log entries with expandable JSON context. Props: { logs, onLoadMore }.
Each row: level icon, source badge, message, timestamp. Click to expand context JSON.
Commit:
git commit -m "feat(monitoring): add shared UI components (StatusBadge, GaugeWidget, TrendChart, LogTable)"
Task 24: SystemHealthTab
Files:
- Create:
src/components/admin/monitoring/SystemHealthTab.tsx
Implementation:
- Initial fetch:
GET /api/monitoring/health - SSE listener:
eventSource.addEventListener('health', ...)updates gauges in real-time - Trend data:
GET /api/monitoring/snapshots?period=24h&fields=system.cpuUsagePercent,system.memoryUsagePercent,system.loadAvg1 - Renders: 4 GaugeWidgets (CPU, RAM, Disk, Uptime) + 3 TrendCharts (CPU 24h, Memory 24h, Load 24h)
Commit:
git commit -m "feat(monitoring): add System Health tab with gauges and trend charts"
Task 25: ServicesTab
Files:
- Create:
src/components/admin/monitoring/ServicesTab.tsx
Implementation:
- Initial fetch:
GET /api/monitoring/services - SSE listener:
eventSource.addEventListener('service', ...)for status changes - Renders expandable service cards:
- Payload CMS (PID, Memory, Uptime, Restarts)
- Queue Worker (PID, Memory, Active Jobs)
- PostgreSQL (Connections, Pool, Latency)
- PgBouncer (Active, Waiting, Pool Size)
- Redis (Memory, Clients, Ops/s)
- SMTP (Status, Last Check, Response Time)
- OAuth Tokens (Meta + YouTube, expiry warnings)
- Cron Jobs (Last run times per job)
Each card has a StatusBadge header.
Commit:
git commit -m "feat(monitoring): add Services tab with expandable service cards"
Task 26: PerformanceTab
Files:
- Create:
src/components/admin/monitoring/PerformanceTab.tsx
Implementation:
- Period selector: 1h, 6h, 24h, 7d (buttons)
- Fetch:
GET /api/monitoring/performance?period=24h - KPI cards: Avg Response Time, P95, P99, Error Rate, RPM
- TrendCharts from snapshots:
GET /api/monitoring/snapshots?period=24h&fields=performance.avgResponseTimeMs,performance.errorRate,performance.requestsPerMinute
Commit:
git commit -m "feat(monitoring): add Performance tab with KPI cards and trend charts"
Task 27: AlertsTab
Files:
- Create:
src/components/admin/monitoring/AlertsTab.tsx
Implementation:
- Fetch:
GET /api/monitoring/alerts?page=1&limit=20 - SSE listener:
eventSource.addEventListener('alert', ...)prepends new alerts - Active/Unacknowledged alerts highlighted at top
- Severity filter (warning, error, critical)
- Acknowledge button:
POST /api/monitoring/alerts/acknowledgewith{ alertId } - Link to MonitoringAlertRules collection in admin:
/admin/collections/monitoring-alert-rules - Pagination
Commit:
git commit -m "feat(monitoring): add Alerts tab with acknowledge and real-time updates"
Task 28: LogsTab
Files:
- Create:
src/components/admin/monitoring/LogsTab.tsx
Implementation:
- Fetch:
GET /api/monitoring/logs?page=1&limit=50 - SSE listener:
eventSource.addEventListener('log', ...)prepends new warn+ entries - Filters: level dropdown, source dropdown, text search input, date range
- Uses LogTable component
- Load more button for pagination
- Auto-scroll toggle for new SSE entries
Commit:
git commit -m "feat(monitoring): add Logs tab with filters, search, and real-time updates"
Phase 5: Final Integration
Task 29: Generate ImportMap & Build Test
Step 1: Generate import map
pnpm payload generate:importmap
Step 2: Build test
pm2 stop payload
NODE_OPTIONS="--no-deprecation --max-old-space-size=1024" pnpm build
pm2 start payload
Step 3: Fix any build errors
Step 4: Commit
git add src/app/\(payload\)/importMap.js
git commit -m "chore(monitoring): regenerate import map and verify build"
Task 30: Run All Tests
pnpm test tests/unit/monitoring/
Fix any failures, then:
git commit -m "test(monitoring): fix test issues and verify all monitoring tests pass"
Task 31: Update Documentation
Files:
- Modify:
CLAUDE.md(add Monitoring to Subsysteme table, add collections) - Modify:
docs/CLAUDE_REFERENCE.md(add Monitoring section) - Modify:
docs/PROJECT_STATUS.md(mark as completed)
CLAUDE.md changes:
- Add to Subsysteme table:
| Monitoring & Alerting | src/lib/monitoring/, API: /api/monitoring/* | docs/CLAUDE_REFERENCE.md | - Add 4 collections to Collections table:
monitoring-snapshots,monitoring-logs,monitoring-alert-rules,monitoring-alert-history
CLAUDE_REFERENCE.md: Add new section with API endpoints, SSE events, env vars.
PROJECT_STATUS.md: Move "Monitoring & Alerting Dashboard" from Langfristig to Abgeschlossen.
Commit:
git commit -m "docs: add monitoring dashboard to project documentation"
Environment Variables Summary
Add to .env (all optional with defaults):
# Monitoring
QUEUE_ENABLE_MONITORING=true
MONITORING_SNAPSHOT_INTERVAL=60000
MONITORING_LOG_LEVEL=info
RETENTION_MONITORING_SNAPSHOTS_DAYS=7
RETENTION_MONITORING_ALERTS_DAYS=90
RETENTION_MONITORING_LOGS_DAYS=30
Task Dependency Graph
Phase 1 (Foundation):
Task 1 (Types) → Task 2-5 (Collections) → Task 6 (Migration)
Phase 2 (Services):
Task 1 → Task 7 (Health) → Task 8 (Services) → Task 9 (PerfTracker)
Task 1 → Task 10 (Logger)
Task 1 → Task 11 (AlertEvaluator)
Task 7,8,9,11 → Task 12 (SnapshotCollector)
Task 2-5 → Task 13 (Retention)
Phase 3 (APIs):
Task 7 → Task 14 (Health API)
Task 8 → Task 15 (Services API)
Task 9 → Task 16 (Performance API)
Task 5,11 → Task 17 (Alerts API)
Task 3,10 → Task 18 (Logs API)
Task 2 → Task 19 (Snapshots API)
Task 7,8,9,10,11 → Task 20 (SSE Stream)
Phase 4 (UI):
Task 21 (Registration) → Task 22 (Main Component) → Task 23 (Shared Components)
Task 23 → Tasks 24-28 (Tab Components) — can be parallel
Phase 5 (Integration):
All → Task 29 (Build) → Task 30 (Tests) → Task 31 (Docs)
Estimated File Count
| Category | Files |
|---|---|
| Collections (4) | 4 |
| Lib/Monitoring (6) | 6 |
| API Routes (8) | 8 |
| UI Components (12) | 12 |
| Tests (5) | 5 |
| Migrations (1) | 1 |
| Modified (5) | payload.config.ts, run-queue-worker.ts, ecosystem.config.cjs, retention-config.ts, access/index.ts |
| Total | ~41 files |