Event-driven architecture with SSE real-time updates, 5-tab dashboard (System Health, Services, Performance, Alerts, Logs), 4 new collections, and integration with existing alert-service.ts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14 KiB
Monitoring & Alerting Dashboard - Design
Datum: 14.02.2026 Status: Genehmigt
Ziel
Umfassendes Monitoring-Dashboard im Payload Admin-Panel mit:
- System-Gesundheitsüberwachung (CPU, RAM, Disk)
- Service-Status (DB, Redis, Queue, SMTP, OAuth)
- Performance-Tracking (Response-Zeiten, Error-Rates)
- Konfigurierbares Alerting (Multi-Channel: Email, Slack, Discord)
- Structured Log-Viewer
- Echtzeit-Updates via SSE
Architektur: Event-Driven mit SSE + REST
REST-Endpoints für initiale/historische Daten, SSE-Stream für Live-Updates.
┌─────────────────────────────────────────────────────────┐
│ Admin UI: /admin/monitoring │
│ ┌──────────┬──────────┬───────────┬────────┬────────┐ │
│ │ System │ Services │Performance│ Alerts │ Logs │ │
│ │ Health │ │ │ │ │ │
│ └────┬─────┴────┬─────┴─────┬─────┴───┬────┴───┬────┘ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ SSE Stream (/api/monitoring/stream) │
│ + REST Endpoints (/api/monitoring/*) │
└───────────────────────┬─────────────────────────────────┘
│
┌───────────────────────▼─────────────────────────────────┐
│ Backend Services │
│ ┌────────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │MonitoringService│ │ Performance │ │ Alert │ │
│ │(Health, Services│ │ Tracker │ │ Evaluator │ │
│ │ OAuth, SMTP) │ │ (Ring-Buffer)│ │ (Rules DB) │ │
│ └───────┬────────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ ┌───────▼──────────────────▼──────────────────▼───────┐ │
│ │ SnapshotCollector (60s Intervall im Queue-Worker) │ │
│ └─────────────────────────┬───────────────────────────┘ │
│ │ │
│ ┌─────────────────────────▼───────────────────────────┐ │
│ │ MonitoringLogger (Structured Logs → Collection) │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
│
┌───────────────────────▼─────────────────────────────────┐
│ Collections │
│ MonitoringSnapshots │ MonitoringAlertRules │
│ MonitoringAlertHistory │ MonitoringLogs │
└─────────────────────────────────────────────────────────┘
Scope: Payload-Stack + Externe Services
Überwacht:
- Payload CMS Prozess (PM2)
- Queue Worker Prozess (PM2)
- PostgreSQL + PgBouncer
- Redis
- SMTP-Verbindungen
- OAuth-Token-Status (Meta, YouTube)
- Cron-Job-Health
- BullMQ Queues
Collections (4)
MonitoringAlertRules
Konfigurierbare Alert-Regeln im Admin-Panel.
| Feld | Typ | Beschreibung |
|---|---|---|
| name | text | Regelname |
| metric | text | Metrik-Pfad (z.B. system.memory.usagePercent) |
| condition | select | gt, lt, eq, gte, lte |
| threshold | number | Schwellenwert |
| severity | select | warning, error, critical |
| channels | select (hasMany) | email, slack, discord |
| recipients.email | array | E-Mail-Empfänger |
| recipients.slackWebhook | text | Slack Webhook URL |
| recipients.discordWebhook | text | Discord Webhook URL |
| cooldownMinutes | number | Min. Abstand (default: 15) |
| enabled | checkbox | Aktiv/Inaktiv |
| tenant | relationship | Optional: tenant-spezifisch |
MonitoringAlertHistory
Alert-Log (WORM - Write Once).
| Feld | Typ | Beschreibung |
|---|---|---|
| rule | relationship | → MonitoringAlertRules |
| metric | text | Metrik-Pfad |
| value | number | Aktueller Wert |
| threshold | number | Schwellenwert |
| severity | select | warning, error, critical |
| message | text | Alert-Nachricht |
| channelsSent | select (hasMany) | Versandte Kanäle |
| resolvedAt | date | Zeitpunkt der Auflösung |
| acknowledgedBy | relationship | → Users |
MonitoringLogs
Structured Logs für Business-Events.
| Feld | Typ | Beschreibung |
|---|---|---|
| level | select | debug, info, warn, error, fatal |
| source | select | payload, queue-worker, cron, email, oauth, sync |
| message | text | Log-Nachricht |
| context | json | Strukturierte Metadaten |
| requestId | text | Korrelations-ID |
| userId | relationship | → Users |
| tenant | relationship | → Tenants |
| duration | number | Dauer in ms |
MonitoringSnapshots
Historische System-Metriken für Trend-Charts.
| Feld | Typ | Beschreibung |
|---|---|---|
| timestamp | date | Zeitstempel |
| system.cpuUsagePercent | number | CPU-Auslastung |
| system.memoryUsedMB | number | RAM belegt |
| system.memoryTotalMB | number | RAM gesamt |
| system.memoryUsagePercent | number | RAM-Auslastung % |
| system.diskUsedGB | number | Disk belegt |
| system.diskTotalGB | number | Disk gesamt |
| system.diskUsagePercent | number | Disk-Auslastung % |
| system.loadAvg1 | number | Load Average 1 Min |
| system.loadAvg5 | number | Load Average 5 Min |
| system.uptime | number | Uptime in Sekunden |
| services.payload | json | { status, pid, memory, uptime, restarts } |
| services.queueWorker | json | { status, pid, memory, uptime, restarts } |
| services.postgresql | json | { status, connections, poolSize, latency } |
| services.pgbouncer | json | { status, activeConns, waitingClients } |
| services.redis | json | { status, memoryUsed, clients, opsPerSec } |
| external.smtp | json | { status, lastCheck, responseTime } |
| external.metaOAuth | json | { status, tokensExpiring, tokensExpired } |
| external.youtubeOAuth | json | { status, tokensExpiring, tokensExpired } |
| external.cronJobs | json | { lastRuns: { ... } } |
| performance.avgResponseTime | number | Durchschn. Response-Zeit |
| performance.errorRate | number | Error-Rate % |
| performance.requestsPerMinute | number | Requests/Minute |
API-Endpoints
| Methode | Endpoint | Beschreibung | Auth |
|---|---|---|---|
| GET | /api/monitoring/health |
System-Status (Live) | Super-Admin / monitoring |
| GET | /api/monitoring/services |
Service-Status | Super-Admin / monitoring |
| GET | /api/monitoring/performance |
Performance-Metriken | Super-Admin / monitoring |
| GET | /api/monitoring/alerts |
Alert-History (paginiert) | Super-Admin / monitoring |
| POST | /api/monitoring/alerts/acknowledge |
Alert bestätigen | Super-Admin |
| GET | /api/monitoring/logs |
Logs (paginiert, filterbar) | Super-Admin / monitoring |
| GET | /api/monitoring/snapshots |
Historische Metriken | Super-Admin / monitoring |
| GET | /api/monitoring/stream |
SSE Echtzeit-Stream | Super-Admin / monitoring |
SSE-Stream Events
| Event | Intervall | Daten |
|---|---|---|
health |
10s | System-Metriken (CPU, RAM, Disk) |
service |
Bei Änderung | Service-Status-Updates |
alert |
Sofort | Neue Alerts |
log |
Sofort (warn+) | Neue Log-Einträge (Level >= warn) |
performance |
30s | Performance-Metriken |
Backend-Services
MonitoringService (src/lib/monitoring/monitoring-service.ts)
Zentraler Service für Metrik-Sammlung.
collectMetrics(): Promise<SystemMetrics>
checkSystemHealth(): SystemHealth // CPU, RAM, Disk, Uptime via os module
checkPostgresql(): ServiceStatus // pg_stat_activity, Latenz-Test
checkPgBouncer(): ServiceStatus // SHOW POOLS via PgBouncer Admin
checkRedis(): ServiceStatus // INFO command
checkSmtp(): ServiceStatus // SMTP EHLO check
checkOAuthTokens(): OAuthStatus // SocialAccounts Token-Ablauf prüfen
checkCronJobs(): CronStatus // Letzte Ausführungszeiten
checkQueues(): QueueStatus // BullMQ getJobCounts()
PerformanceTracker (src/lib/monitoring/performance-tracker.ts)
Lightweight Request-Metriken in In-Memory Ring-Buffer.
trackRequest(method, path, statusCode, duration): void
getMetrics(period: '1h' | '6h' | '24h' | '7d'): PerformanceMetrics
Integration: Payload beforeOperation / afterOperation Hooks.
SnapshotCollector (src/lib/monitoring/snapshot-collector.ts)
Periodische Metrik-Erfassung, läuft im Queue-Worker PM2-Prozess.
startCollector(): void // setInterval(60_000)
stopCollector(): void // clearInterval, SIGTERM handler
saveSnapshot(metrics): void // → MonitoringSnapshots Collection
AlertEvaluator (src/lib/monitoring/alert-evaluator.ts)
Prüft Metriken gegen MonitoringAlertRules.
evaluateRules(metrics): Promise<Alert[]>
shouldFireAlert(rule, value): boolean // Cooldown + Deduplizierung
dispatchAlert(alert): Promise<void> // → vorhandener alert-service.ts
MonitoringLogger (src/lib/monitoring/monitoring-logger.ts)
Structured Logger der in MonitoringLogs Collection schreibt.
const logger = createMonitoringLogger('source')
logger.info('message', { context })
logger.warn('message', { context })
logger.error('message', { context, requestId, userId, tenant })
Dashboard UI
Admin View: /admin/monitoring mit 5 Tabs.
Tab 1: System Health
- Gauge-Widgets: CPU, RAM, Disk, Uptime (Farbkodierung grün/gelb/rot)
- Trend-Charts: CPU 24h, Memory 24h, Load Average (aus MonitoringSnapshots)
- Live-Update via SSE
healthEvents
Tab 2: Services
- Service-Karten mit Status-Badge (Online/Warning/Offline)
- Details: PID, Memory, Uptime, Restarts (PM2-Prozesse)
- PostgreSQL: Connections, Pool, Latenz
- Redis: Memory, Clients, Ops/s
- OAuth: Token-Status mit Ablauf-Warnung
- SMTP: Letzter Check, Response-Zeit
- Cron: Letzte Ausführungszeiten
- Live-Update via SSE
serviceEvents
Tab 3: Performance
- Response-Time Chart (Avg, P95, P99)
- Error-Rate Chart
- Requests/Minute Chart
- Zeitraum-Filter (1h, 6h, 24h, 7d)
Tab 4: Alerts
- Aktive/Unbestätigte Alerts (oben, hervorgehoben)
- Alert-History Tabelle (filterbar nach Severity, Zeitraum)
- Acknowledge-Button pro Alert
- Alert-Regeln CRUD (MonitoringAlertRules)
- Neue Alerts via SSE
alertEvents
Tab 5: Logs
- Log-Tabelle (Level, Source, Message, Timestamp)
- Filter: Level, Source, Zeitraum, Volltextsuche
- Expandierbarer JSON-Context pro Eintrag
- Auto-Scroll für neue warn+ Einträge (via SSE
logEvents)
Zugriffskontrolle
| Aktion | Super-Admin | monitoring-Rolle |
|---|---|---|
| Dashboard ansehen | Ja | Ja |
| Alert-Regeln bearbeiten | Ja | Nein |
| Alerts bestätigen | Ja | Nein |
| Logs ansehen | Ja | Ja |
Data Retention
| Collection | Retention | Env-Variable |
|---|---|---|
| monitoring-snapshots | 7 Tage | RETENTION_MONITORING_SNAPSHOTS_DAYS |
| monitoring-alert-history | 90 Tage | RETENTION_MONITORING_ALERTS_DAYS |
| monitoring-logs | 30 Tage | RETENTION_MONITORING_LOGS_DAYS |
Dateistruktur
src/
├── collections/
│ ├── MonitoringAlertRules.ts
│ ├── MonitoringAlertHistory.ts
│ ├── MonitoringLogs.ts
│ └── MonitoringSnapshots.ts
├── lib/monitoring/
│ ├── monitoring-service.ts
│ ├── performance-tracker.ts
│ ├── snapshot-collector.ts
│ ├── alert-evaluator.ts
│ ├── monitoring-logger.ts
│ └── types.ts
├── app/(payload)/api/monitoring/
│ ├── health/route.ts
│ ├── services/route.ts
│ ├── performance/route.ts
│ ├── alerts/route.ts
│ ├── alerts/acknowledge/route.ts
│ ├── logs/route.ts
│ ├── snapshots/route.ts
│ └── stream/route.ts
├── components/admin/
│ ├── MonitoringDashboard.tsx
│ ├── MonitoringDashboard.scss
│ ├── MonitoringNavLinks.tsx
│ └── monitoring/
│ ├── SystemHealthTab.tsx
│ ├── ServicesTab.tsx
│ ├── PerformanceTab.tsx
│ ├── AlertsTab.tsx
│ ├── LogsTab.tsx
│ ├── GaugeWidget.tsx
│ ├── TrendChart.tsx
│ ├── StatusBadge.tsx
│ └── LogTable.tsx
Abhängigkeiten
Neue npm-Pakete: Keine - nutzt nur Node.js os Module und vorhandene DB-Verbindungen.
Bestehende Infrastruktur die genutzt wird:
src/lib/alerting/alert-service.ts- Multi-Channel-Alertingsrc/lib/queue/- BullMQ-Integrationsrc/lib/redis.ts- Redis-Client/api/community/streamPattern - SSE-Implementation- Data Retention System - Automatische Bereinigung