cms.c2sgmbh/docs/plans/2026-02-14-monitoring-dashboard-design.md
Martin Porwoll 15bdd66eb6 docs: add monitoring & alerting dashboard design
Event-driven architecture with SSE real-time updates, 5-tab dashboard
(System Health, Services, Performance, Alerts, Logs), 4 new collections,
and integration with existing alert-service.ts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 23:53:25 +00:00

339 lines
14 KiB
Markdown

# Monitoring & Alerting Dashboard - Design
**Datum:** 14.02.2026
**Status:** Genehmigt
## Ziel
Umfassendes Monitoring-Dashboard im Payload Admin-Panel mit:
- System-Gesundheitsüberwachung (CPU, RAM, Disk)
- Service-Status (DB, Redis, Queue, SMTP, OAuth)
- Performance-Tracking (Response-Zeiten, Error-Rates)
- Konfigurierbares Alerting (Multi-Channel: Email, Slack, Discord)
- Structured Log-Viewer
- Echtzeit-Updates via SSE
## Architektur: Event-Driven mit SSE + REST
REST-Endpoints für initiale/historische Daten, SSE-Stream für Live-Updates.
```
┌─────────────────────────────────────────────────────────┐
│ Admin UI: /admin/monitoring │
│ ┌──────────┬──────────┬───────────┬────────┬────────┐ │
│ │ System │ Services │Performance│ Alerts │ Logs │ │
│ │ Health │ │ │ │ │ │
│ └────┬─────┴────┬─────┴─────┬─────┴───┬────┴───┬────┘ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ SSE Stream (/api/monitoring/stream) │
│ + REST Endpoints (/api/monitoring/*) │
└───────────────────────┬─────────────────────────────────┘
┌───────────────────────▼─────────────────────────────────┐
│ Backend Services │
│ ┌────────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │MonitoringService│ │ Performance │ │ Alert │ │
│ │(Health, Services│ │ Tracker │ │ Evaluator │ │
│ │ OAuth, SMTP) │ │ (Ring-Buffer)│ │ (Rules DB) │ │
│ └───────┬────────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ ┌───────▼──────────────────▼──────────────────▼───────┐ │
│ │ SnapshotCollector (60s Intervall im Queue-Worker) │ │
│ └─────────────────────────┬───────────────────────────┘ │
│ │ │
│ ┌─────────────────────────▼───────────────────────────┐ │
│ │ MonitoringLogger (Structured Logs → Collection) │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
┌───────────────────────▼─────────────────────────────────┐
│ Collections │
│ MonitoringSnapshots │ MonitoringAlertRules │
│ MonitoringAlertHistory │ MonitoringLogs │
└─────────────────────────────────────────────────────────┘
```
## Scope: Payload-Stack + Externe Services
**Überwacht:**
- Payload CMS Prozess (PM2)
- Queue Worker Prozess (PM2)
- PostgreSQL + PgBouncer
- Redis
- SMTP-Verbindungen
- OAuth-Token-Status (Meta, YouTube)
- Cron-Job-Health
- BullMQ Queues
## Collections (4)
### MonitoringAlertRules
Konfigurierbare Alert-Regeln im Admin-Panel.
| Feld | Typ | Beschreibung |
|------|-----|--------------|
| name | text | Regelname |
| metric | text | Metrik-Pfad (z.B. `system.memory.usagePercent`) |
| condition | select | gt, lt, eq, gte, lte |
| threshold | number | Schwellenwert |
| severity | select | warning, error, critical |
| channels | select (hasMany) | email, slack, discord |
| recipients.email | array | E-Mail-Empfänger |
| recipients.slackWebhook | text | Slack Webhook URL |
| recipients.discordWebhook | text | Discord Webhook URL |
| cooldownMinutes | number | Min. Abstand (default: 15) |
| enabled | checkbox | Aktiv/Inaktiv |
| tenant | relationship | Optional: tenant-spezifisch |
### MonitoringAlertHistory
Alert-Log (WORM - Write Once).
| Feld | Typ | Beschreibung |
|------|-----|--------------|
| rule | relationship | → MonitoringAlertRules |
| metric | text | Metrik-Pfad |
| value | number | Aktueller Wert |
| threshold | number | Schwellenwert |
| severity | select | warning, error, critical |
| message | text | Alert-Nachricht |
| channelsSent | select (hasMany) | Versandte Kanäle |
| resolvedAt | date | Zeitpunkt der Auflösung |
| acknowledgedBy | relationship | → Users |
### MonitoringLogs
Structured Logs für Business-Events.
| Feld | Typ | Beschreibung |
|------|-----|--------------|
| level | select | debug, info, warn, error, fatal |
| source | select | payload, queue-worker, cron, email, oauth, sync |
| message | text | Log-Nachricht |
| context | json | Strukturierte Metadaten |
| requestId | text | Korrelations-ID |
| userId | relationship | → Users |
| tenant | relationship | → Tenants |
| duration | number | Dauer in ms |
### MonitoringSnapshots
Historische System-Metriken für Trend-Charts.
| Feld | Typ | Beschreibung |
|------|-----|--------------|
| timestamp | date | Zeitstempel |
| system.cpuUsagePercent | number | CPU-Auslastung |
| system.memoryUsedMB | number | RAM belegt |
| system.memoryTotalMB | number | RAM gesamt |
| system.memoryUsagePercent | number | RAM-Auslastung % |
| system.diskUsedGB | number | Disk belegt |
| system.diskTotalGB | number | Disk gesamt |
| system.diskUsagePercent | number | Disk-Auslastung % |
| system.loadAvg1 | number | Load Average 1 Min |
| system.loadAvg5 | number | Load Average 5 Min |
| system.uptime | number | Uptime in Sekunden |
| services.payload | json | { status, pid, memory, uptime, restarts } |
| services.queueWorker | json | { status, pid, memory, uptime, restarts } |
| services.postgresql | json | { status, connections, poolSize, latency } |
| services.pgbouncer | json | { status, activeConns, waitingClients } |
| services.redis | json | { status, memoryUsed, clients, opsPerSec } |
| external.smtp | json | { status, lastCheck, responseTime } |
| external.metaOAuth | json | { status, tokensExpiring, tokensExpired } |
| external.youtubeOAuth | json | { status, tokensExpiring, tokensExpired } |
| external.cronJobs | json | { lastRuns: { ... } } |
| performance.avgResponseTime | number | Durchschn. Response-Zeit |
| performance.errorRate | number | Error-Rate % |
| performance.requestsPerMinute | number | Requests/Minute |
## API-Endpoints
| Methode | Endpoint | Beschreibung | Auth |
|---------|----------|--------------|------|
| GET | `/api/monitoring/health` | System-Status (Live) | Super-Admin / monitoring |
| GET | `/api/monitoring/services` | Service-Status | Super-Admin / monitoring |
| GET | `/api/monitoring/performance` | Performance-Metriken | Super-Admin / monitoring |
| GET | `/api/monitoring/alerts` | Alert-History (paginiert) | Super-Admin / monitoring |
| POST | `/api/monitoring/alerts/acknowledge` | Alert bestätigen | Super-Admin |
| GET | `/api/monitoring/logs` | Logs (paginiert, filterbar) | Super-Admin / monitoring |
| GET | `/api/monitoring/snapshots` | Historische Metriken | Super-Admin / monitoring |
| GET | `/api/monitoring/stream` | SSE Echtzeit-Stream | Super-Admin / monitoring |
### SSE-Stream Events
| Event | Intervall | Daten |
|-------|-----------|-------|
| `health` | 10s | System-Metriken (CPU, RAM, Disk) |
| `service` | Bei Änderung | Service-Status-Updates |
| `alert` | Sofort | Neue Alerts |
| `log` | Sofort (warn+) | Neue Log-Einträge (Level >= warn) |
| `performance` | 30s | Performance-Metriken |
## Backend-Services
### MonitoringService (`src/lib/monitoring/monitoring-service.ts`)
Zentraler Service für Metrik-Sammlung.
```typescript
collectMetrics(): Promise<SystemMetrics>
checkSystemHealth(): SystemHealth // CPU, RAM, Disk, Uptime via os module
checkPostgresql(): ServiceStatus // pg_stat_activity, Latenz-Test
checkPgBouncer(): ServiceStatus // SHOW POOLS via PgBouncer Admin
checkRedis(): ServiceStatus // INFO command
checkSmtp(): ServiceStatus // SMTP EHLO check
checkOAuthTokens(): OAuthStatus // SocialAccounts Token-Ablauf prüfen
checkCronJobs(): CronStatus // Letzte Ausführungszeiten
checkQueues(): QueueStatus // BullMQ getJobCounts()
```
### PerformanceTracker (`src/lib/monitoring/performance-tracker.ts`)
Lightweight Request-Metriken in In-Memory Ring-Buffer.
```typescript
trackRequest(method, path, statusCode, duration): void
getMetrics(period: '1h' | '6h' | '24h' | '7d'): PerformanceMetrics
```
Integration: Payload `beforeOperation` / `afterOperation` Hooks.
### SnapshotCollector (`src/lib/monitoring/snapshot-collector.ts`)
Periodische Metrik-Erfassung, läuft im Queue-Worker PM2-Prozess.
```typescript
startCollector(): void // setInterval(60_000)
stopCollector(): void // clearInterval, SIGTERM handler
saveSnapshot(metrics): void // → MonitoringSnapshots Collection
```
### AlertEvaluator (`src/lib/monitoring/alert-evaluator.ts`)
Prüft Metriken gegen MonitoringAlertRules.
```typescript
evaluateRules(metrics): Promise<Alert[]>
shouldFireAlert(rule, value): boolean // Cooldown + Deduplizierung
dispatchAlert(alert): Promise<void> // → vorhandener alert-service.ts
```
### MonitoringLogger (`src/lib/monitoring/monitoring-logger.ts`)
Structured Logger der in MonitoringLogs Collection schreibt.
```typescript
const logger = createMonitoringLogger('source')
logger.info('message', { context })
logger.warn('message', { context })
logger.error('message', { context, requestId, userId, tenant })
```
## Dashboard UI
Admin View: `/admin/monitoring` mit 5 Tabs.
### Tab 1: System Health
- Gauge-Widgets: CPU, RAM, Disk, Uptime (Farbkodierung grün/gelb/rot)
- Trend-Charts: CPU 24h, Memory 24h, Load Average (aus MonitoringSnapshots)
- Live-Update via SSE `health` Events
### Tab 2: Services
- Service-Karten mit Status-Badge (Online/Warning/Offline)
- Details: PID, Memory, Uptime, Restarts (PM2-Prozesse)
- PostgreSQL: Connections, Pool, Latenz
- Redis: Memory, Clients, Ops/s
- OAuth: Token-Status mit Ablauf-Warnung
- SMTP: Letzter Check, Response-Zeit
- Cron: Letzte Ausführungszeiten
- Live-Update via SSE `service` Events
### Tab 3: Performance
- Response-Time Chart (Avg, P95, P99)
- Error-Rate Chart
- Requests/Minute Chart
- Zeitraum-Filter (1h, 6h, 24h, 7d)
### Tab 4: Alerts
- Aktive/Unbestätigte Alerts (oben, hervorgehoben)
- Alert-History Tabelle (filterbar nach Severity, Zeitraum)
- Acknowledge-Button pro Alert
- Alert-Regeln CRUD (MonitoringAlertRules)
- Neue Alerts via SSE `alert` Events
### Tab 5: Logs
- Log-Tabelle (Level, Source, Message, Timestamp)
- Filter: Level, Source, Zeitraum, Volltextsuche
- Expandierbarer JSON-Context pro Eintrag
- Auto-Scroll für neue warn+ Einträge (via SSE `log` Events)
## Zugriffskontrolle
| Aktion | Super-Admin | monitoring-Rolle |
|--------|-------------|------------------|
| Dashboard ansehen | Ja | Ja |
| Alert-Regeln bearbeiten | Ja | Nein |
| Alerts bestätigen | Ja | Nein |
| Logs ansehen | Ja | Ja |
## Data Retention
| Collection | Retention | Env-Variable |
|------------|-----------|-------------|
| monitoring-snapshots | 7 Tage | `RETENTION_MONITORING_SNAPSHOTS_DAYS` |
| monitoring-alert-history | 90 Tage | `RETENTION_MONITORING_ALERTS_DAYS` |
| monitoring-logs | 30 Tage | `RETENTION_MONITORING_LOGS_DAYS` |
## Dateistruktur
```
src/
├── collections/
│ ├── MonitoringAlertRules.ts
│ ├── MonitoringAlertHistory.ts
│ ├── MonitoringLogs.ts
│ └── MonitoringSnapshots.ts
├── lib/monitoring/
│ ├── monitoring-service.ts
│ ├── performance-tracker.ts
│ ├── snapshot-collector.ts
│ ├── alert-evaluator.ts
│ ├── monitoring-logger.ts
│ └── types.ts
├── app/(payload)/api/monitoring/
│ ├── health/route.ts
│ ├── services/route.ts
│ ├── performance/route.ts
│ ├── alerts/route.ts
│ ├── alerts/acknowledge/route.ts
│ ├── logs/route.ts
│ ├── snapshots/route.ts
│ └── stream/route.ts
├── components/admin/
│ ├── MonitoringDashboard.tsx
│ ├── MonitoringDashboard.scss
│ ├── MonitoringNavLinks.tsx
│ └── monitoring/
│ ├── SystemHealthTab.tsx
│ ├── ServicesTab.tsx
│ ├── PerformanceTab.tsx
│ ├── AlertsTab.tsx
│ ├── LogsTab.tsx
│ ├── GaugeWidget.tsx
│ ├── TrendChart.tsx
│ ├── StatusBadge.tsx
│ └── LogTable.tsx
```
## Abhängigkeiten
**Neue npm-Pakete:** Keine - nutzt nur Node.js `os` Module und vorhandene DB-Verbindungen.
**Bestehende Infrastruktur die genutzt wird:**
- `src/lib/alerting/alert-service.ts` - Multi-Channel-Alerting
- `src/lib/queue/` - BullMQ-Integration
- `src/lib/redis.ts` - Redis-Client
- `/api/community/stream` Pattern - SSE-Implementation
- Data Retention System - Automatische Bereinigung