cms.c2sgmbh/docs/plans/2026-02-14-monitoring-dashboard-design.md
Martin Porwoll 15bdd66eb6 docs: add monitoring & alerting dashboard design
Event-driven architecture with SSE real-time updates, 5-tab dashboard
(System Health, Services, Performance, Alerts, Logs), 4 new collections,
and integration with existing alert-service.ts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 23:53:25 +00:00

14 KiB

Monitoring & Alerting Dashboard - Design

Datum: 14.02.2026 Status: Genehmigt

Ziel

Umfassendes Monitoring-Dashboard im Payload Admin-Panel mit:

  • System-Gesundheitsüberwachung (CPU, RAM, Disk)
  • Service-Status (DB, Redis, Queue, SMTP, OAuth)
  • Performance-Tracking (Response-Zeiten, Error-Rates)
  • Konfigurierbares Alerting (Multi-Channel: Email, Slack, Discord)
  • Structured Log-Viewer
  • Echtzeit-Updates via SSE

Architektur: Event-Driven mit SSE + REST

REST-Endpoints für initiale/historische Daten, SSE-Stream für Live-Updates.

┌─────────────────────────────────────────────────────────┐
│  Admin UI: /admin/monitoring                             │
│  ┌──────────┬──────────┬───────────┬────────┬────────┐  │
│  │ System   │ Services │Performance│ Alerts │  Logs  │  │
│  │ Health   │          │           │        │        │  │
│  └────┬─────┴────┬─────┴─────┬─────┴───┬────┴───┬────┘  │
│       │          │           │         │        │        │
│       ▼          ▼           ▼         ▼        ▼        │
│  SSE Stream (/api/monitoring/stream)                     │
│  + REST Endpoints (/api/monitoring/*)                    │
└───────────────────────┬─────────────────────────────────┘
                        │
┌───────────────────────▼─────────────────────────────────┐
│  Backend Services                                        │
│  ┌────────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │MonitoringService│  │ Performance  │  │  Alert       │ │
│  │(Health, Services│  │ Tracker      │  │  Evaluator   │ │
│  │ OAuth, SMTP)   │  │ (Ring-Buffer)│  │  (Rules DB)  │ │
│  └───────┬────────┘  └──────┬───────┘  └──────┬───────┘ │
│          │                  │                  │         │
│  ┌───────▼──────────────────▼──────────────────▼───────┐ │
│  │  SnapshotCollector (60s Intervall im Queue-Worker)  │ │
│  └─────────────────────────┬───────────────────────────┘ │
│                            │                             │
│  ┌─────────────────────────▼───────────────────────────┐ │
│  │  MonitoringLogger (Structured Logs → Collection)    │ │
│  └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
                        │
┌───────────────────────▼─────────────────────────────────┐
│  Collections                                             │
│  MonitoringSnapshots │ MonitoringAlertRules               │
│  MonitoringAlertHistory │ MonitoringLogs                  │
└─────────────────────────────────────────────────────────┘

Scope: Payload-Stack + Externe Services

Überwacht:

  • Payload CMS Prozess (PM2)
  • Queue Worker Prozess (PM2)
  • PostgreSQL + PgBouncer
  • Redis
  • SMTP-Verbindungen
  • OAuth-Token-Status (Meta, YouTube)
  • Cron-Job-Health
  • BullMQ Queues

Collections (4)

MonitoringAlertRules

Konfigurierbare Alert-Regeln im Admin-Panel.

Feld Typ Beschreibung
name text Regelname
metric text Metrik-Pfad (z.B. system.memory.usagePercent)
condition select gt, lt, eq, gte, lte
threshold number Schwellenwert
severity select warning, error, critical
channels select (hasMany) email, slack, discord
recipients.email array E-Mail-Empfänger
recipients.slackWebhook text Slack Webhook URL
recipients.discordWebhook text Discord Webhook URL
cooldownMinutes number Min. Abstand (default: 15)
enabled checkbox Aktiv/Inaktiv
tenant relationship Optional: tenant-spezifisch

MonitoringAlertHistory

Alert-Log (WORM - Write Once).

Feld Typ Beschreibung
rule relationship → MonitoringAlertRules
metric text Metrik-Pfad
value number Aktueller Wert
threshold number Schwellenwert
severity select warning, error, critical
message text Alert-Nachricht
channelsSent select (hasMany) Versandte Kanäle
resolvedAt date Zeitpunkt der Auflösung
acknowledgedBy relationship → Users

MonitoringLogs

Structured Logs für Business-Events.

Feld Typ Beschreibung
level select debug, info, warn, error, fatal
source select payload, queue-worker, cron, email, oauth, sync
message text Log-Nachricht
context json Strukturierte Metadaten
requestId text Korrelations-ID
userId relationship → Users
tenant relationship → Tenants
duration number Dauer in ms

MonitoringSnapshots

Historische System-Metriken für Trend-Charts.

Feld Typ Beschreibung
timestamp date Zeitstempel
system.cpuUsagePercent number CPU-Auslastung
system.memoryUsedMB number RAM belegt
system.memoryTotalMB number RAM gesamt
system.memoryUsagePercent number RAM-Auslastung %
system.diskUsedGB number Disk belegt
system.diskTotalGB number Disk gesamt
system.diskUsagePercent number Disk-Auslastung %
system.loadAvg1 number Load Average 1 Min
system.loadAvg5 number Load Average 5 Min
system.uptime number Uptime in Sekunden
services.payload json { status, pid, memory, uptime, restarts }
services.queueWorker json { status, pid, memory, uptime, restarts }
services.postgresql json { status, connections, poolSize, latency }
services.pgbouncer json { status, activeConns, waitingClients }
services.redis json { status, memoryUsed, clients, opsPerSec }
external.smtp json { status, lastCheck, responseTime }
external.metaOAuth json { status, tokensExpiring, tokensExpired }
external.youtubeOAuth json { status, tokensExpiring, tokensExpired }
external.cronJobs json { lastRuns: { ... } }
performance.avgResponseTime number Durchschn. Response-Zeit
performance.errorRate number Error-Rate %
performance.requestsPerMinute number Requests/Minute

API-Endpoints

Methode Endpoint Beschreibung Auth
GET /api/monitoring/health System-Status (Live) Super-Admin / monitoring
GET /api/monitoring/services Service-Status Super-Admin / monitoring
GET /api/monitoring/performance Performance-Metriken Super-Admin / monitoring
GET /api/monitoring/alerts Alert-History (paginiert) Super-Admin / monitoring
POST /api/monitoring/alerts/acknowledge Alert bestätigen Super-Admin
GET /api/monitoring/logs Logs (paginiert, filterbar) Super-Admin / monitoring
GET /api/monitoring/snapshots Historische Metriken Super-Admin / monitoring
GET /api/monitoring/stream SSE Echtzeit-Stream Super-Admin / monitoring

SSE-Stream Events

Event Intervall Daten
health 10s System-Metriken (CPU, RAM, Disk)
service Bei Änderung Service-Status-Updates
alert Sofort Neue Alerts
log Sofort (warn+) Neue Log-Einträge (Level >= warn)
performance 30s Performance-Metriken

Backend-Services

MonitoringService (src/lib/monitoring/monitoring-service.ts)

Zentraler Service für Metrik-Sammlung.

collectMetrics(): Promise<SystemMetrics>
checkSystemHealth(): SystemHealth      // CPU, RAM, Disk, Uptime via os module
checkPostgresql(): ServiceStatus       // pg_stat_activity, Latenz-Test
checkPgBouncer(): ServiceStatus        // SHOW POOLS via PgBouncer Admin
checkRedis(): ServiceStatus            // INFO command
checkSmtp(): ServiceStatus             // SMTP EHLO check
checkOAuthTokens(): OAuthStatus        // SocialAccounts Token-Ablauf prüfen
checkCronJobs(): CronStatus            // Letzte Ausführungszeiten
checkQueues(): QueueStatus             // BullMQ getJobCounts()

PerformanceTracker (src/lib/monitoring/performance-tracker.ts)

Lightweight Request-Metriken in In-Memory Ring-Buffer.

trackRequest(method, path, statusCode, duration): void
getMetrics(period: '1h' | '6h' | '24h' | '7d'): PerformanceMetrics

Integration: Payload beforeOperation / afterOperation Hooks.

SnapshotCollector (src/lib/monitoring/snapshot-collector.ts)

Periodische Metrik-Erfassung, läuft im Queue-Worker PM2-Prozess.

startCollector(): void       // setInterval(60_000)
stopCollector(): void        // clearInterval, SIGTERM handler
saveSnapshot(metrics): void  // → MonitoringSnapshots Collection

AlertEvaluator (src/lib/monitoring/alert-evaluator.ts)

Prüft Metriken gegen MonitoringAlertRules.

evaluateRules(metrics): Promise<Alert[]>
shouldFireAlert(rule, value): boolean    // Cooldown + Deduplizierung
dispatchAlert(alert): Promise<void>      // → vorhandener alert-service.ts

MonitoringLogger (src/lib/monitoring/monitoring-logger.ts)

Structured Logger der in MonitoringLogs Collection schreibt.

const logger = createMonitoringLogger('source')
logger.info('message', { context })
logger.warn('message', { context })
logger.error('message', { context, requestId, userId, tenant })

Dashboard UI

Admin View: /admin/monitoring mit 5 Tabs.

Tab 1: System Health

  • Gauge-Widgets: CPU, RAM, Disk, Uptime (Farbkodierung grün/gelb/rot)
  • Trend-Charts: CPU 24h, Memory 24h, Load Average (aus MonitoringSnapshots)
  • Live-Update via SSE health Events

Tab 2: Services

  • Service-Karten mit Status-Badge (Online/Warning/Offline)
  • Details: PID, Memory, Uptime, Restarts (PM2-Prozesse)
  • PostgreSQL: Connections, Pool, Latenz
  • Redis: Memory, Clients, Ops/s
  • OAuth: Token-Status mit Ablauf-Warnung
  • SMTP: Letzter Check, Response-Zeit
  • Cron: Letzte Ausführungszeiten
  • Live-Update via SSE service Events

Tab 3: Performance

  • Response-Time Chart (Avg, P95, P99)
  • Error-Rate Chart
  • Requests/Minute Chart
  • Zeitraum-Filter (1h, 6h, 24h, 7d)

Tab 4: Alerts

  • Aktive/Unbestätigte Alerts (oben, hervorgehoben)
  • Alert-History Tabelle (filterbar nach Severity, Zeitraum)
  • Acknowledge-Button pro Alert
  • Alert-Regeln CRUD (MonitoringAlertRules)
  • Neue Alerts via SSE alert Events

Tab 5: Logs

  • Log-Tabelle (Level, Source, Message, Timestamp)
  • Filter: Level, Source, Zeitraum, Volltextsuche
  • Expandierbarer JSON-Context pro Eintrag
  • Auto-Scroll für neue warn+ Einträge (via SSE log Events)

Zugriffskontrolle

Aktion Super-Admin monitoring-Rolle
Dashboard ansehen Ja Ja
Alert-Regeln bearbeiten Ja Nein
Alerts bestätigen Ja Nein
Logs ansehen Ja Ja

Data Retention

Collection Retention Env-Variable
monitoring-snapshots 7 Tage RETENTION_MONITORING_SNAPSHOTS_DAYS
monitoring-alert-history 90 Tage RETENTION_MONITORING_ALERTS_DAYS
monitoring-logs 30 Tage RETENTION_MONITORING_LOGS_DAYS

Dateistruktur

src/
├── collections/
│   ├── MonitoringAlertRules.ts
│   ├── MonitoringAlertHistory.ts
│   ├── MonitoringLogs.ts
│   └── MonitoringSnapshots.ts
├── lib/monitoring/
│   ├── monitoring-service.ts
│   ├── performance-tracker.ts
│   ├── snapshot-collector.ts
│   ├── alert-evaluator.ts
│   ├── monitoring-logger.ts
│   └── types.ts
├── app/(payload)/api/monitoring/
│   ├── health/route.ts
│   ├── services/route.ts
│   ├── performance/route.ts
│   ├── alerts/route.ts
│   ├── alerts/acknowledge/route.ts
│   ├── logs/route.ts
│   ├── snapshots/route.ts
│   └── stream/route.ts
├── components/admin/
│   ├── MonitoringDashboard.tsx
│   ├── MonitoringDashboard.scss
│   ├── MonitoringNavLinks.tsx
│   └── monitoring/
│       ├── SystemHealthTab.tsx
│       ├── ServicesTab.tsx
│       ├── PerformanceTab.tsx
│       ├── AlertsTab.tsx
│       ├── LogsTab.tsx
│       ├── GaugeWidget.tsx
│       ├── TrendChart.tsx
│       ├── StatusBadge.tsx
│       └── LogTable.tsx

Abhängigkeiten

Neue npm-Pakete: Keine - nutzt nur Node.js os Module und vorhandene DB-Verbindungen.

Bestehende Infrastruktur die genutzt wird:

  • src/lib/alerting/alert-service.ts - Multi-Channel-Alerting
  • src/lib/queue/ - BullMQ-Integration
  • src/lib/redis.ts - Redis-Client
  • /api/community/stream Pattern - SSE-Implementation
  • Data Retention System - Automatische Bereinigung