mirror of https://github.com/complexcaresolutions/cms.c2sgmbh.git synced 2026-03-17 22:04:10 +00:00

Martin Porwoll 15bdd66eb6 docs: add monitoring & alerting dashboard design

Event-driven architecture with SSE real-time updates, 5-tab dashboard
(System Health, Services, Performance, Alerts, Logs), 4 new collections,
and integration with existing alert-service.ts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-14 23:53:25 +00:00

14 KiB

Raw Blame History

Monitoring & Alerting Dashboard - Design

Datum: 14.02.2026 Status: Genehmigt

Ziel

Umfassendes Monitoring-Dashboard im Payload Admin-Panel mit:

System-Gesundheitsüberwachung (CPU, RAM, Disk)
Service-Status (DB, Redis, Queue, SMTP, OAuth)
Performance-Tracking (Response-Zeiten, Error-Rates)
Konfigurierbares Alerting (Multi-Channel: Email, Slack, Discord)
Structured Log-Viewer
Echtzeit-Updates via SSE

Architektur: Event-Driven mit SSE + REST

REST-Endpoints für initiale/historische Daten, SSE-Stream für Live-Updates.

┌─────────────────────────────────────────────────────────┐
│  Admin UI: /admin/monitoring                             │
│  ┌──────────┬──────────┬───────────┬────────┬────────┐  │
│  │ System   │ Services │Performance│ Alerts │  Logs  │  │
│  │ Health   │          │           │        │        │  │
│  └────┬─────┴────┬─────┴─────┬─────┴───┬────┴───┬────┘  │
│       │          │           │         │        │        │
│       ▼          ▼           ▼         ▼        ▼        │
│  SSE Stream (/api/monitoring/stream)                     │
│  + REST Endpoints (/api/monitoring/*)                    │
└───────────────────────┬─────────────────────────────────┘
                        │
┌───────────────────────▼─────────────────────────────────┐
│  Backend Services                                        │
│  ┌────────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │MonitoringService│  │ Performance  │  │  Alert       │ │
│  │(Health, Services│  │ Tracker      │  │  Evaluator   │ │
│  │ OAuth, SMTP)   │  │ (Ring-Buffer)│  │  (Rules DB)  │ │
│  └───────┬────────┘  └──────┬───────┘  └──────┬───────┘ │
│          │                  │                  │         │
│  ┌───────▼──────────────────▼──────────────────▼───────┐ │
│  │  SnapshotCollector (60s Intervall im Queue-Worker)  │ │
│  └─────────────────────────┬───────────────────────────┘ │
│                            │                             │
│  ┌─────────────────────────▼───────────────────────────┐ │
│  │  MonitoringLogger (Structured Logs → Collection)    │ │
│  └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
                        │
┌───────────────────────▼─────────────────────────────────┐
│  Collections                                             │
│  MonitoringSnapshots │ MonitoringAlertRules               │
│  MonitoringAlertHistory │ MonitoringLogs                  │
└─────────────────────────────────────────────────────────┘

Scope: Payload-Stack + Externe Services

Überwacht:

Payload CMS Prozess (PM2)
Queue Worker Prozess (PM2)
PostgreSQL + PgBouncer
Redis
SMTP-Verbindungen
OAuth-Token-Status (Meta, YouTube)
Cron-Job-Health
BullMQ Queues

Collections (4)

MonitoringAlertRules

Konfigurierbare Alert-Regeln im Admin-Panel.

Feld	Typ	Beschreibung
name	text	Regelname
metric	text	Metrik-Pfad (z.B. `system.memory.usagePercent`)
condition	select	gt, lt, eq, gte, lte
threshold	number	Schwellenwert
severity	select	warning, error, critical
channels	select (hasMany)	email, slack, discord
recipients.email	array	E-Mail-Empfänger
recipients.slackWebhook	text	Slack Webhook URL
recipients.discordWebhook	text	Discord Webhook URL
cooldownMinutes	number	Min. Abstand (default: 15)
enabled	checkbox	Aktiv/Inaktiv
tenant	relationship	Optional: tenant-spezifisch

MonitoringAlertHistory

Alert-Log (WORM - Write Once).

Feld	Typ	Beschreibung
rule	relationship	→ MonitoringAlertRules
metric	text	Metrik-Pfad
value	number	Aktueller Wert
threshold	number	Schwellenwert
severity	select	warning, error, critical
message	text	Alert-Nachricht
channelsSent	select (hasMany)	Versandte Kanäle
resolvedAt	date	Zeitpunkt der Auflösung
acknowledgedBy	relationship	→ Users

MonitoringLogs

Structured Logs für Business-Events.

Feld	Typ	Beschreibung
level	select	debug, info, warn, error, fatal
source	select	payload, queue-worker, cron, email, oauth, sync
message	text	Log-Nachricht
context	json	Strukturierte Metadaten
requestId	text	Korrelations-ID
userId	relationship	→ Users
tenant	relationship	→ Tenants
duration	number	Dauer in ms

MonitoringSnapshots

Historische System-Metriken für Trend-Charts.

Feld	Typ	Beschreibung
timestamp	date	Zeitstempel
system.cpuUsagePercent	number	CPU-Auslastung
system.memoryUsedMB	number	RAM belegt
system.memoryTotalMB	number	RAM gesamt
system.memoryUsagePercent	number	RAM-Auslastung %
system.diskUsedGB	number	Disk belegt
system.diskTotalGB	number	Disk gesamt
system.diskUsagePercent	number	Disk-Auslastung %
system.loadAvg1	number	Load Average 1 Min
system.loadAvg5	number	Load Average 5 Min
system.uptime	number	Uptime in Sekunden
services.payload	json	{ status, pid, memory, uptime, restarts }
services.queueWorker	json	{ status, pid, memory, uptime, restarts }
services.postgresql	json	{ status, connections, poolSize, latency }
services.pgbouncer	json	{ status, activeConns, waitingClients }
services.redis	json	{ status, memoryUsed, clients, opsPerSec }
external.smtp	json	{ status, lastCheck, responseTime }
external.metaOAuth	json	{ status, tokensExpiring, tokensExpired }
external.youtubeOAuth	json	{ status, tokensExpiring, tokensExpired }
external.cronJobs	json	{ lastRuns: { ... } }
performance.avgResponseTime	number	Durchschn. Response-Zeit
performance.errorRate	number	Error-Rate %
performance.requestsPerMinute	number	Requests/Minute

API-Endpoints

Methode	Endpoint	Beschreibung	Auth
GET	`/api/monitoring/health`	System-Status (Live)	Super-Admin / monitoring
GET	`/api/monitoring/services`	Service-Status	Super-Admin / monitoring
GET	`/api/monitoring/performance`	Performance-Metriken	Super-Admin / monitoring
GET	`/api/monitoring/alerts`	Alert-History (paginiert)	Super-Admin / monitoring
POST	`/api/monitoring/alerts/acknowledge`	Alert bestätigen	Super-Admin
GET	`/api/monitoring/logs`	Logs (paginiert, filterbar)	Super-Admin / monitoring
GET	`/api/monitoring/snapshots`	Historische Metriken	Super-Admin / monitoring
GET	`/api/monitoring/stream`	SSE Echtzeit-Stream	Super-Admin / monitoring

SSE-Stream Events

Event	Intervall	Daten
`health`	10s	System-Metriken (CPU, RAM, Disk)
`service`	Bei Änderung	Service-Status-Updates
`alert`	Sofort	Neue Alerts
`log`	Sofort (warn+)	Neue Log-Einträge (Level >= warn)
`performance`	30s	Performance-Metriken

Backend-Services

MonitoringService (`src/lib/monitoring/monitoring-service.ts`)

Zentraler Service für Metrik-Sammlung.

collectMetrics(): Promise<SystemMetrics>
checkSystemHealth(): SystemHealth      // CPU, RAM, Disk, Uptime via os module
checkPostgresql(): ServiceStatus       // pg_stat_activity, Latenz-Test
checkPgBouncer(): ServiceStatus        // SHOW POOLS via PgBouncer Admin
checkRedis(): ServiceStatus            // INFO command
checkSmtp(): ServiceStatus             // SMTP EHLO check
checkOAuthTokens(): OAuthStatus        // SocialAccounts Token-Ablauf prüfen
checkCronJobs(): CronStatus            // Letzte Ausführungszeiten
checkQueues(): QueueStatus             // BullMQ getJobCounts()

PerformanceTracker (`src/lib/monitoring/performance-tracker.ts`)

Lightweight Request-Metriken in In-Memory Ring-Buffer.

trackRequest(method, path, statusCode, duration): void
getMetrics(period: '1h' | '6h' | '24h' | '7d'): PerformanceMetrics

Integration: Payload beforeOperation / afterOperation Hooks.

SnapshotCollector (`src/lib/monitoring/snapshot-collector.ts`)

Periodische Metrik-Erfassung, läuft im Queue-Worker PM2-Prozess.

startCollector(): void       // setInterval(60_000)
stopCollector(): void        // clearInterval, SIGTERM handler
saveSnapshot(metrics): void  // → MonitoringSnapshots Collection

AlertEvaluator (`src/lib/monitoring/alert-evaluator.ts`)

Prüft Metriken gegen MonitoringAlertRules.

evaluateRules(metrics): Promise<Alert[]>
shouldFireAlert(rule, value): boolean    // Cooldown + Deduplizierung
dispatchAlert(alert): Promise<void>      // → vorhandener alert-service.ts

MonitoringLogger (`src/lib/monitoring/monitoring-logger.ts`)

Structured Logger der in MonitoringLogs Collection schreibt.

const logger = createMonitoringLogger('source')
logger.info('message', { context })
logger.warn('message', { context })
logger.error('message', { context, requestId, userId, tenant })

Dashboard UI

Admin View: /admin/monitoring mit 5 Tabs.

Tab 1: System Health

Gauge-Widgets: CPU, RAM, Disk, Uptime (Farbkodierung grün/gelb/rot)
Trend-Charts: CPU 24h, Memory 24h, Load Average (aus MonitoringSnapshots)
Live-Update via SSE health Events

Tab 2: Services

Service-Karten mit Status-Badge (Online/Warning/Offline)
Details: PID, Memory, Uptime, Restarts (PM2-Prozesse)
PostgreSQL: Connections, Pool, Latenz
Redis: Memory, Clients, Ops/s
OAuth: Token-Status mit Ablauf-Warnung
SMTP: Letzter Check, Response-Zeit
Cron: Letzte Ausführungszeiten
Live-Update via SSE service Events

Tab 3: Performance

Response-Time Chart (Avg, P95, P99)
Error-Rate Chart
Requests/Minute Chart
Zeitraum-Filter (1h, 6h, 24h, 7d)

Tab 4: Alerts

Aktive/Unbestätigte Alerts (oben, hervorgehoben)
Alert-History Tabelle (filterbar nach Severity, Zeitraum)
Acknowledge-Button pro Alert
Alert-Regeln CRUD (MonitoringAlertRules)
Neue Alerts via SSE alert Events

Tab 5: Logs

Log-Tabelle (Level, Source, Message, Timestamp)
Filter: Level, Source, Zeitraum, Volltextsuche
Expandierbarer JSON-Context pro Eintrag
Auto-Scroll für neue warn+ Einträge (via SSE log Events)

Zugriffskontrolle

Aktion	Super-Admin	monitoring-Rolle
Dashboard ansehen	Ja	Ja
Alert-Regeln bearbeiten	Ja	Nein
Alerts bestätigen	Ja	Nein
Logs ansehen	Ja	Ja

Data Retention

Collection	Retention	Env-Variable
monitoring-snapshots	7 Tage	`RETENTION_MONITORING_SNAPSHOTS_DAYS`
monitoring-alert-history	90 Tage	`RETENTION_MONITORING_ALERTS_DAYS`
monitoring-logs	30 Tage	`RETENTION_MONITORING_LOGS_DAYS`

Dateistruktur

src/
├── collections/
│   ├── MonitoringAlertRules.ts
│   ├── MonitoringAlertHistory.ts
│   ├── MonitoringLogs.ts
│   └── MonitoringSnapshots.ts
├── lib/monitoring/
│   ├── monitoring-service.ts
│   ├── performance-tracker.ts
│   ├── snapshot-collector.ts
│   ├── alert-evaluator.ts
│   ├── monitoring-logger.ts
│   └── types.ts
├── app/(payload)/api/monitoring/
│   ├── health/route.ts
│   ├── services/route.ts
│   ├── performance/route.ts
│   ├── alerts/route.ts
│   ├── alerts/acknowledge/route.ts
│   ├── logs/route.ts
│   ├── snapshots/route.ts
│   └── stream/route.ts
├── components/admin/
│   ├── MonitoringDashboard.tsx
│   ├── MonitoringDashboard.scss
│   ├── MonitoringNavLinks.tsx
│   └── monitoring/
│       ├── SystemHealthTab.tsx
│       ├── ServicesTab.tsx
│       ├── PerformanceTab.tsx
│       ├── AlertsTab.tsx
│       ├── LogsTab.tsx
│       ├── GaugeWidget.tsx
│       ├── TrendChart.tsx
│       ├── StatusBadge.tsx
│       └── LogTable.tsx

Abhängigkeiten

Neue npm-Pakete: Keine - nutzt nur Node.js os Module und vorhandene DB-Verbindungen.

Bestehende Infrastruktur die genutzt wird:

src/lib/alerting/alert-service.ts - Multi-Channel-Alerting
src/lib/queue/ - BullMQ-Integration
src/lib/redis.ts - Redis-Client
/api/community/stream Pattern - SSE-Implementation
Data Retention System - Automatische Bereinigung

14 KiB Raw Blame History