From 15bdd66eb69dd3d0a6d983a46c7cb510ea1e82b7 Mon Sep 17 00:00:00 2001 From: Martin Porwoll Date: Sat, 14 Feb 2026 23:53:25 +0000 Subject: [PATCH] docs: add monitoring & alerting dashboard design Event-driven architecture with SSE real-time updates, 5-tab dashboard (System Health, Services, Performance, Alerts, Logs), 4 new collections, and integration with existing alert-service.ts. Co-Authored-By: Claude Opus 4.6 --- .../2026-02-14-monitoring-dashboard-design.md | 339 ++++++++++++++++++ 1 file changed, 339 insertions(+) create mode 100644 docs/plans/2026-02-14-monitoring-dashboard-design.md diff --git a/docs/plans/2026-02-14-monitoring-dashboard-design.md b/docs/plans/2026-02-14-monitoring-dashboard-design.md new file mode 100644 index 0000000..45cdbb1 --- /dev/null +++ b/docs/plans/2026-02-14-monitoring-dashboard-design.md @@ -0,0 +1,339 @@ +# Monitoring & Alerting Dashboard - Design + +**Datum:** 14.02.2026 +**Status:** Genehmigt + +## Ziel + +Umfassendes Monitoring-Dashboard im Payload Admin-Panel mit: +- System-Gesundheitsüberwachung (CPU, RAM, Disk) +- Service-Status (DB, Redis, Queue, SMTP, OAuth) +- Performance-Tracking (Response-Zeiten, Error-Rates) +- Konfigurierbares Alerting (Multi-Channel: Email, Slack, Discord) +- Structured Log-Viewer +- Echtzeit-Updates via SSE + +## Architektur: Event-Driven mit SSE + REST + +REST-Endpoints für initiale/historische Daten, SSE-Stream für Live-Updates. + +``` +┌─────────────────────────────────────────────────────────┐ +│ Admin UI: /admin/monitoring │ +│ ┌──────────┬──────────┬───────────┬────────┬────────┐ │ +│ │ System │ Services │Performance│ Alerts │ Logs │ │ +│ │ Health │ │ │ │ │ │ +│ └────┬─────┴────┬─────┴─────┬─────┴───┬────┴───┬────┘ │ +│ │ │ │ │ │ │ +│ ▼ ▼ ▼ ▼ ▼ │ +│ SSE Stream (/api/monitoring/stream) │ +│ + REST Endpoints (/api/monitoring/*) │ +└───────────────────────┬─────────────────────────────────┘ + │ +┌───────────────────────▼─────────────────────────────────┐ +│ Backend Services │ +│ ┌────────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │MonitoringService│ │ Performance │ │ Alert │ │ +│ │(Health, Services│ │ Tracker │ │ Evaluator │ │ +│ │ OAuth, SMTP) │ │ (Ring-Buffer)│ │ (Rules DB) │ │ +│ └───────┬────────┘ └──────┬───────┘ └──────┬───────┘ │ +│ │ │ │ │ +│ ┌───────▼──────────────────▼──────────────────▼───────┐ │ +│ │ SnapshotCollector (60s Intervall im Queue-Worker) │ │ +│ └─────────────────────────┬───────────────────────────┘ │ +│ │ │ +│ ┌─────────────────────────▼───────────────────────────┐ │ +│ │ MonitoringLogger (Structured Logs → Collection) │ │ +│ └─────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────┘ + │ +┌───────────────────────▼─────────────────────────────────┐ +│ Collections │ +│ MonitoringSnapshots │ MonitoringAlertRules │ +│ MonitoringAlertHistory │ MonitoringLogs │ +└─────────────────────────────────────────────────────────┘ +``` + +## Scope: Payload-Stack + Externe Services + +**Überwacht:** +- Payload CMS Prozess (PM2) +- Queue Worker Prozess (PM2) +- PostgreSQL + PgBouncer +- Redis +- SMTP-Verbindungen +- OAuth-Token-Status (Meta, YouTube) +- Cron-Job-Health +- BullMQ Queues + +## Collections (4) + +### MonitoringAlertRules + +Konfigurierbare Alert-Regeln im Admin-Panel. + +| Feld | Typ | Beschreibung | +|------|-----|--------------| +| name | text | Regelname | +| metric | text | Metrik-Pfad (z.B. `system.memory.usagePercent`) | +| condition | select | gt, lt, eq, gte, lte | +| threshold | number | Schwellenwert | +| severity | select | warning, error, critical | +| channels | select (hasMany) | email, slack, discord | +| recipients.email | array | E-Mail-Empfänger | +| recipients.slackWebhook | text | Slack Webhook URL | +| recipients.discordWebhook | text | Discord Webhook URL | +| cooldownMinutes | number | Min. Abstand (default: 15) | +| enabled | checkbox | Aktiv/Inaktiv | +| tenant | relationship | Optional: tenant-spezifisch | + +### MonitoringAlertHistory + +Alert-Log (WORM - Write Once). + +| Feld | Typ | Beschreibung | +|------|-----|--------------| +| rule | relationship | → MonitoringAlertRules | +| metric | text | Metrik-Pfad | +| value | number | Aktueller Wert | +| threshold | number | Schwellenwert | +| severity | select | warning, error, critical | +| message | text | Alert-Nachricht | +| channelsSent | select (hasMany) | Versandte Kanäle | +| resolvedAt | date | Zeitpunkt der Auflösung | +| acknowledgedBy | relationship | → Users | + +### MonitoringLogs + +Structured Logs für Business-Events. + +| Feld | Typ | Beschreibung | +|------|-----|--------------| +| level | select | debug, info, warn, error, fatal | +| source | select | payload, queue-worker, cron, email, oauth, sync | +| message | text | Log-Nachricht | +| context | json | Strukturierte Metadaten | +| requestId | text | Korrelations-ID | +| userId | relationship | → Users | +| tenant | relationship | → Tenants | +| duration | number | Dauer in ms | + +### MonitoringSnapshots + +Historische System-Metriken für Trend-Charts. + +| Feld | Typ | Beschreibung | +|------|-----|--------------| +| timestamp | date | Zeitstempel | +| system.cpuUsagePercent | number | CPU-Auslastung | +| system.memoryUsedMB | number | RAM belegt | +| system.memoryTotalMB | number | RAM gesamt | +| system.memoryUsagePercent | number | RAM-Auslastung % | +| system.diskUsedGB | number | Disk belegt | +| system.diskTotalGB | number | Disk gesamt | +| system.diskUsagePercent | number | Disk-Auslastung % | +| system.loadAvg1 | number | Load Average 1 Min | +| system.loadAvg5 | number | Load Average 5 Min | +| system.uptime | number | Uptime in Sekunden | +| services.payload | json | { status, pid, memory, uptime, restarts } | +| services.queueWorker | json | { status, pid, memory, uptime, restarts } | +| services.postgresql | json | { status, connections, poolSize, latency } | +| services.pgbouncer | json | { status, activeConns, waitingClients } | +| services.redis | json | { status, memoryUsed, clients, opsPerSec } | +| external.smtp | json | { status, lastCheck, responseTime } | +| external.metaOAuth | json | { status, tokensExpiring, tokensExpired } | +| external.youtubeOAuth | json | { status, tokensExpiring, tokensExpired } | +| external.cronJobs | json | { lastRuns: { ... } } | +| performance.avgResponseTime | number | Durchschn. Response-Zeit | +| performance.errorRate | number | Error-Rate % | +| performance.requestsPerMinute | number | Requests/Minute | + +## API-Endpoints + +| Methode | Endpoint | Beschreibung | Auth | +|---------|----------|--------------|------| +| GET | `/api/monitoring/health` | System-Status (Live) | Super-Admin / monitoring | +| GET | `/api/monitoring/services` | Service-Status | Super-Admin / monitoring | +| GET | `/api/monitoring/performance` | Performance-Metriken | Super-Admin / monitoring | +| GET | `/api/monitoring/alerts` | Alert-History (paginiert) | Super-Admin / monitoring | +| POST | `/api/monitoring/alerts/acknowledge` | Alert bestätigen | Super-Admin | +| GET | `/api/monitoring/logs` | Logs (paginiert, filterbar) | Super-Admin / monitoring | +| GET | `/api/monitoring/snapshots` | Historische Metriken | Super-Admin / monitoring | +| GET | `/api/monitoring/stream` | SSE Echtzeit-Stream | Super-Admin / monitoring | + +### SSE-Stream Events + +| Event | Intervall | Daten | +|-------|-----------|-------| +| `health` | 10s | System-Metriken (CPU, RAM, Disk) | +| `service` | Bei Änderung | Service-Status-Updates | +| `alert` | Sofort | Neue Alerts | +| `log` | Sofort (warn+) | Neue Log-Einträge (Level >= warn) | +| `performance` | 30s | Performance-Metriken | + +## Backend-Services + +### MonitoringService (`src/lib/monitoring/monitoring-service.ts`) + +Zentraler Service für Metrik-Sammlung. + +```typescript +collectMetrics(): Promise +checkSystemHealth(): SystemHealth // CPU, RAM, Disk, Uptime via os module +checkPostgresql(): ServiceStatus // pg_stat_activity, Latenz-Test +checkPgBouncer(): ServiceStatus // SHOW POOLS via PgBouncer Admin +checkRedis(): ServiceStatus // INFO command +checkSmtp(): ServiceStatus // SMTP EHLO check +checkOAuthTokens(): OAuthStatus // SocialAccounts Token-Ablauf prüfen +checkCronJobs(): CronStatus // Letzte Ausführungszeiten +checkQueues(): QueueStatus // BullMQ getJobCounts() +``` + +### PerformanceTracker (`src/lib/monitoring/performance-tracker.ts`) + +Lightweight Request-Metriken in In-Memory Ring-Buffer. + +```typescript +trackRequest(method, path, statusCode, duration): void +getMetrics(period: '1h' | '6h' | '24h' | '7d'): PerformanceMetrics +``` + +Integration: Payload `beforeOperation` / `afterOperation` Hooks. + +### SnapshotCollector (`src/lib/monitoring/snapshot-collector.ts`) + +Periodische Metrik-Erfassung, läuft im Queue-Worker PM2-Prozess. + +```typescript +startCollector(): void // setInterval(60_000) +stopCollector(): void // clearInterval, SIGTERM handler +saveSnapshot(metrics): void // → MonitoringSnapshots Collection +``` + +### AlertEvaluator (`src/lib/monitoring/alert-evaluator.ts`) + +Prüft Metriken gegen MonitoringAlertRules. + +```typescript +evaluateRules(metrics): Promise +shouldFireAlert(rule, value): boolean // Cooldown + Deduplizierung +dispatchAlert(alert): Promise // → vorhandener alert-service.ts +``` + +### MonitoringLogger (`src/lib/monitoring/monitoring-logger.ts`) + +Structured Logger der in MonitoringLogs Collection schreibt. + +```typescript +const logger = createMonitoringLogger('source') +logger.info('message', { context }) +logger.warn('message', { context }) +logger.error('message', { context, requestId, userId, tenant }) +``` + +## Dashboard UI + +Admin View: `/admin/monitoring` mit 5 Tabs. + +### Tab 1: System Health +- Gauge-Widgets: CPU, RAM, Disk, Uptime (Farbkodierung grün/gelb/rot) +- Trend-Charts: CPU 24h, Memory 24h, Load Average (aus MonitoringSnapshots) +- Live-Update via SSE `health` Events + +### Tab 2: Services +- Service-Karten mit Status-Badge (Online/Warning/Offline) +- Details: PID, Memory, Uptime, Restarts (PM2-Prozesse) +- PostgreSQL: Connections, Pool, Latenz +- Redis: Memory, Clients, Ops/s +- OAuth: Token-Status mit Ablauf-Warnung +- SMTP: Letzter Check, Response-Zeit +- Cron: Letzte Ausführungszeiten +- Live-Update via SSE `service` Events + +### Tab 3: Performance +- Response-Time Chart (Avg, P95, P99) +- Error-Rate Chart +- Requests/Minute Chart +- Zeitraum-Filter (1h, 6h, 24h, 7d) + +### Tab 4: Alerts +- Aktive/Unbestätigte Alerts (oben, hervorgehoben) +- Alert-History Tabelle (filterbar nach Severity, Zeitraum) +- Acknowledge-Button pro Alert +- Alert-Regeln CRUD (MonitoringAlertRules) +- Neue Alerts via SSE `alert` Events + +### Tab 5: Logs +- Log-Tabelle (Level, Source, Message, Timestamp) +- Filter: Level, Source, Zeitraum, Volltextsuche +- Expandierbarer JSON-Context pro Eintrag +- Auto-Scroll für neue warn+ Einträge (via SSE `log` Events) + +## Zugriffskontrolle + +| Aktion | Super-Admin | monitoring-Rolle | +|--------|-------------|------------------| +| Dashboard ansehen | Ja | Ja | +| Alert-Regeln bearbeiten | Ja | Nein | +| Alerts bestätigen | Ja | Nein | +| Logs ansehen | Ja | Ja | + +## Data Retention + +| Collection | Retention | Env-Variable | +|------------|-----------|-------------| +| monitoring-snapshots | 7 Tage | `RETENTION_MONITORING_SNAPSHOTS_DAYS` | +| monitoring-alert-history | 90 Tage | `RETENTION_MONITORING_ALERTS_DAYS` | +| monitoring-logs | 30 Tage | `RETENTION_MONITORING_LOGS_DAYS` | + +## Dateistruktur + +``` +src/ +├── collections/ +│ ├── MonitoringAlertRules.ts +│ ├── MonitoringAlertHistory.ts +│ ├── MonitoringLogs.ts +│ └── MonitoringSnapshots.ts +├── lib/monitoring/ +│ ├── monitoring-service.ts +│ ├── performance-tracker.ts +│ ├── snapshot-collector.ts +│ ├── alert-evaluator.ts +│ ├── monitoring-logger.ts +│ └── types.ts +├── app/(payload)/api/monitoring/ +│ ├── health/route.ts +│ ├── services/route.ts +│ ├── performance/route.ts +│ ├── alerts/route.ts +│ ├── alerts/acknowledge/route.ts +│ ├── logs/route.ts +│ ├── snapshots/route.ts +│ └── stream/route.ts +├── components/admin/ +│ ├── MonitoringDashboard.tsx +│ ├── MonitoringDashboard.scss +│ ├── MonitoringNavLinks.tsx +│ └── monitoring/ +│ ├── SystemHealthTab.tsx +│ ├── ServicesTab.tsx +│ ├── PerformanceTab.tsx +│ ├── AlertsTab.tsx +│ ├── LogsTab.tsx +│ ├── GaugeWidget.tsx +│ ├── TrendChart.tsx +│ ├── StatusBadge.tsx +│ └── LogTable.tsx +``` + +## Abhängigkeiten + +**Neue npm-Pakete:** Keine - nutzt nur Node.js `os` Module und vorhandene DB-Verbindungen. + +**Bestehende Infrastruktur die genutzt wird:** +- `src/lib/alerting/alert-service.ts` - Multi-Channel-Alerting +- `src/lib/queue/` - BullMQ-Integration +- `src/lib/redis.ts` - Redis-Client +- `/api/community/stream` Pattern - SSE-Implementation +- Data Retention System - Automatische Bereinigung