# Monitoring & Alerting Dashboard - Design **Datum:** 14.02.2026 **Status:** Genehmigt ## Ziel Umfassendes Monitoring-Dashboard im Payload Admin-Panel mit: - System-Gesundheitsüberwachung (CPU, RAM, Disk) - Service-Status (DB, Redis, Queue, SMTP, OAuth) - Performance-Tracking (Response-Zeiten, Error-Rates) - Konfigurierbares Alerting (Multi-Channel: Email, Slack, Discord) - Structured Log-Viewer - Echtzeit-Updates via SSE ## Architektur: Event-Driven mit SSE + REST REST-Endpoints für initiale/historische Daten, SSE-Stream für Live-Updates. ``` ┌─────────────────────────────────────────────────────────┐ │ Admin UI: /admin/monitoring │ │ ┌──────────┬──────────┬───────────┬────────┬────────┐ │ │ │ System │ Services │Performance│ Alerts │ Logs │ │ │ │ Health │ │ │ │ │ │ │ └────┬─────┴────┬─────┴─────┬─────┴───┬────┴───┬────┘ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ │ │ SSE Stream (/api/monitoring/stream) │ │ + REST Endpoints (/api/monitoring/*) │ └───────────────────────┬─────────────────────────────────┘ │ ┌───────────────────────▼─────────────────────────────────┐ │ Backend Services │ │ ┌────────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │MonitoringService│ │ Performance │ │ Alert │ │ │ │(Health, Services│ │ Tracker │ │ Evaluator │ │ │ │ OAuth, SMTP) │ │ (Ring-Buffer)│ │ (Rules DB) │ │ │ └───────┬────────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ │ ┌───────▼──────────────────▼──────────────────▼───────┐ │ │ │ SnapshotCollector (60s Intervall im Queue-Worker) │ │ │ └─────────────────────────┬───────────────────────────┘ │ │ │ │ │ ┌─────────────────────────▼───────────────────────────┐ │ │ │ MonitoringLogger (Structured Logs → Collection) │ │ │ └─────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘ │ ┌───────────────────────▼─────────────────────────────────┐ │ Collections │ │ MonitoringSnapshots │ MonitoringAlertRules │ │ MonitoringAlertHistory │ MonitoringLogs │ └─────────────────────────────────────────────────────────┘ ``` ## Scope: Payload-Stack + Externe Services **Überwacht:** - Payload CMS Prozess (PM2) - Queue Worker Prozess (PM2) - PostgreSQL + PgBouncer - Redis - SMTP-Verbindungen - OAuth-Token-Status (Meta, YouTube) - Cron-Job-Health - BullMQ Queues ## Collections (4) ### MonitoringAlertRules Konfigurierbare Alert-Regeln im Admin-Panel. | Feld | Typ | Beschreibung | |------|-----|--------------| | name | text | Regelname | | metric | text | Metrik-Pfad (z.B. `system.memory.usagePercent`) | | condition | select | gt, lt, eq, gte, lte | | threshold | number | Schwellenwert | | severity | select | warning, error, critical | | channels | select (hasMany) | email, slack, discord | | recipients.email | array | E-Mail-Empfänger | | recipients.slackWebhook | text | Slack Webhook URL | | recipients.discordWebhook | text | Discord Webhook URL | | cooldownMinutes | number | Min. Abstand (default: 15) | | enabled | checkbox | Aktiv/Inaktiv | | tenant | relationship | Optional: tenant-spezifisch | ### MonitoringAlertHistory Alert-Log (WORM - Write Once). | Feld | Typ | Beschreibung | |------|-----|--------------| | rule | relationship | → MonitoringAlertRules | | metric | text | Metrik-Pfad | | value | number | Aktueller Wert | | threshold | number | Schwellenwert | | severity | select | warning, error, critical | | message | text | Alert-Nachricht | | channelsSent | select (hasMany) | Versandte Kanäle | | resolvedAt | date | Zeitpunkt der Auflösung | | acknowledgedBy | relationship | → Users | ### MonitoringLogs Structured Logs für Business-Events. | Feld | Typ | Beschreibung | |------|-----|--------------| | level | select | debug, info, warn, error, fatal | | source | select | payload, queue-worker, cron, email, oauth, sync | | message | text | Log-Nachricht | | context | json | Strukturierte Metadaten | | requestId | text | Korrelations-ID | | userId | relationship | → Users | | tenant | relationship | → Tenants | | duration | number | Dauer in ms | ### MonitoringSnapshots Historische System-Metriken für Trend-Charts. | Feld | Typ | Beschreibung | |------|-----|--------------| | timestamp | date | Zeitstempel | | system.cpuUsagePercent | number | CPU-Auslastung | | system.memoryUsedMB | number | RAM belegt | | system.memoryTotalMB | number | RAM gesamt | | system.memoryUsagePercent | number | RAM-Auslastung % | | system.diskUsedGB | number | Disk belegt | | system.diskTotalGB | number | Disk gesamt | | system.diskUsagePercent | number | Disk-Auslastung % | | system.loadAvg1 | number | Load Average 1 Min | | system.loadAvg5 | number | Load Average 5 Min | | system.uptime | number | Uptime in Sekunden | | services.payload | json | { status, pid, memory, uptime, restarts } | | services.queueWorker | json | { status, pid, memory, uptime, restarts } | | services.postgresql | json | { status, connections, poolSize, latency } | | services.pgbouncer | json | { status, activeConns, waitingClients } | | services.redis | json | { status, memoryUsed, clients, opsPerSec } | | external.smtp | json | { status, lastCheck, responseTime } | | external.metaOAuth | json | { status, tokensExpiring, tokensExpired } | | external.youtubeOAuth | json | { status, tokensExpiring, tokensExpired } | | external.cronJobs | json | { lastRuns: { ... } } | | performance.avgResponseTime | number | Durchschn. Response-Zeit | | performance.errorRate | number | Error-Rate % | | performance.requestsPerMinute | number | Requests/Minute | ## API-Endpoints | Methode | Endpoint | Beschreibung | Auth | |---------|----------|--------------|------| | GET | `/api/monitoring/health` | System-Status (Live) | Super-Admin / monitoring | | GET | `/api/monitoring/services` | Service-Status | Super-Admin / monitoring | | GET | `/api/monitoring/performance` | Performance-Metriken | Super-Admin / monitoring | | GET | `/api/monitoring/alerts` | Alert-History (paginiert) | Super-Admin / monitoring | | POST | `/api/monitoring/alerts/acknowledge` | Alert bestätigen | Super-Admin | | GET | `/api/monitoring/logs` | Logs (paginiert, filterbar) | Super-Admin / monitoring | | GET | `/api/monitoring/snapshots` | Historische Metriken | Super-Admin / monitoring | | GET | `/api/monitoring/stream` | SSE Echtzeit-Stream | Super-Admin / monitoring | ### SSE-Stream Events | Event | Intervall | Daten | |-------|-----------|-------| | `health` | 10s | System-Metriken (CPU, RAM, Disk) | | `service` | Bei Änderung | Service-Status-Updates | | `alert` | Sofort | Neue Alerts | | `log` | Sofort (warn+) | Neue Log-Einträge (Level >= warn) | | `performance` | 30s | Performance-Metriken | ## Backend-Services ### MonitoringService (`src/lib/monitoring/monitoring-service.ts`) Zentraler Service für Metrik-Sammlung. ```typescript collectMetrics(): Promise checkSystemHealth(): SystemHealth // CPU, RAM, Disk, Uptime via os module checkPostgresql(): ServiceStatus // pg_stat_activity, Latenz-Test checkPgBouncer(): ServiceStatus // SHOW POOLS via PgBouncer Admin checkRedis(): ServiceStatus // INFO command checkSmtp(): ServiceStatus // SMTP EHLO check checkOAuthTokens(): OAuthStatus // SocialAccounts Token-Ablauf prüfen checkCronJobs(): CronStatus // Letzte Ausführungszeiten checkQueues(): QueueStatus // BullMQ getJobCounts() ``` ### PerformanceTracker (`src/lib/monitoring/performance-tracker.ts`) Lightweight Request-Metriken in In-Memory Ring-Buffer. ```typescript trackRequest(method, path, statusCode, duration): void getMetrics(period: '1h' | '6h' | '24h' | '7d'): PerformanceMetrics ``` Integration: Payload `beforeOperation` / `afterOperation` Hooks. ### SnapshotCollector (`src/lib/monitoring/snapshot-collector.ts`) Periodische Metrik-Erfassung, läuft im Queue-Worker PM2-Prozess. ```typescript startCollector(): void // setInterval(60_000) stopCollector(): void // clearInterval, SIGTERM handler saveSnapshot(metrics): void // → MonitoringSnapshots Collection ``` ### AlertEvaluator (`src/lib/monitoring/alert-evaluator.ts`) Prüft Metriken gegen MonitoringAlertRules. ```typescript evaluateRules(metrics): Promise shouldFireAlert(rule, value): boolean // Cooldown + Deduplizierung dispatchAlert(alert): Promise // → vorhandener alert-service.ts ``` ### MonitoringLogger (`src/lib/monitoring/monitoring-logger.ts`) Structured Logger der in MonitoringLogs Collection schreibt. ```typescript const logger = createMonitoringLogger('source') logger.info('message', { context }) logger.warn('message', { context }) logger.error('message', { context, requestId, userId, tenant }) ``` ## Dashboard UI Admin View: `/admin/monitoring` mit 5 Tabs. ### Tab 1: System Health - Gauge-Widgets: CPU, RAM, Disk, Uptime (Farbkodierung grün/gelb/rot) - Trend-Charts: CPU 24h, Memory 24h, Load Average (aus MonitoringSnapshots) - Live-Update via SSE `health` Events ### Tab 2: Services - Service-Karten mit Status-Badge (Online/Warning/Offline) - Details: PID, Memory, Uptime, Restarts (PM2-Prozesse) - PostgreSQL: Connections, Pool, Latenz - Redis: Memory, Clients, Ops/s - OAuth: Token-Status mit Ablauf-Warnung - SMTP: Letzter Check, Response-Zeit - Cron: Letzte Ausführungszeiten - Live-Update via SSE `service` Events ### Tab 3: Performance - Response-Time Chart (Avg, P95, P99) - Error-Rate Chart - Requests/Minute Chart - Zeitraum-Filter (1h, 6h, 24h, 7d) ### Tab 4: Alerts - Aktive/Unbestätigte Alerts (oben, hervorgehoben) - Alert-History Tabelle (filterbar nach Severity, Zeitraum) - Acknowledge-Button pro Alert - Alert-Regeln CRUD (MonitoringAlertRules) - Neue Alerts via SSE `alert` Events ### Tab 5: Logs - Log-Tabelle (Level, Source, Message, Timestamp) - Filter: Level, Source, Zeitraum, Volltextsuche - Expandierbarer JSON-Context pro Eintrag - Auto-Scroll für neue warn+ Einträge (via SSE `log` Events) ## Zugriffskontrolle | Aktion | Super-Admin | monitoring-Rolle | |--------|-------------|------------------| | Dashboard ansehen | Ja | Ja | | Alert-Regeln bearbeiten | Ja | Nein | | Alerts bestätigen | Ja | Nein | | Logs ansehen | Ja | Ja | ## Data Retention | Collection | Retention | Env-Variable | |------------|-----------|-------------| | monitoring-snapshots | 7 Tage | `RETENTION_MONITORING_SNAPSHOTS_DAYS` | | monitoring-alert-history | 90 Tage | `RETENTION_MONITORING_ALERTS_DAYS` | | monitoring-logs | 30 Tage | `RETENTION_MONITORING_LOGS_DAYS` | ## Dateistruktur ``` src/ ├── collections/ │ ├── MonitoringAlertRules.ts │ ├── MonitoringAlertHistory.ts │ ├── MonitoringLogs.ts │ └── MonitoringSnapshots.ts ├── lib/monitoring/ │ ├── monitoring-service.ts │ ├── performance-tracker.ts │ ├── snapshot-collector.ts │ ├── alert-evaluator.ts │ ├── monitoring-logger.ts │ └── types.ts ├── app/(payload)/api/monitoring/ │ ├── health/route.ts │ ├── services/route.ts │ ├── performance/route.ts │ ├── alerts/route.ts │ ├── alerts/acknowledge/route.ts │ ├── logs/route.ts │ ├── snapshots/route.ts │ └── stream/route.ts ├── components/admin/ │ ├── MonitoringDashboard.tsx │ ├── MonitoringDashboard.scss │ ├── MonitoringNavLinks.tsx │ └── monitoring/ │ ├── SystemHealthTab.tsx │ ├── ServicesTab.tsx │ ├── PerformanceTab.tsx │ ├── AlertsTab.tsx │ ├── LogsTab.tsx │ ├── GaugeWidget.tsx │ ├── TrendChart.tsx │ ├── StatusBadge.tsx │ └── LogTable.tsx ``` ## Abhängigkeiten **Neue npm-Pakete:** Keine - nutzt nur Node.js `os` Module und vorhandene DB-Verbindungen. **Bestehende Infrastruktur die genutzt wird:** - `src/lib/alerting/alert-service.ts` - Multi-Channel-Alerting - `src/lib/queue/` - BullMQ-Integration - `src/lib/redis.ts` - Redis-Client - `/api/community/stream` Pattern - SSE-Implementation - Data Retention System - Automatische Bereinigung