mirror of
https://github.com/complexcaresolutions/cms.c2sgmbh.git
synced 2026-03-17 22:04:10 +00:00
Event-driven architecture with SSE real-time updates, 5-tab dashboard (System Health, Services, Performance, Alerts, Logs), 4 new collections, and integration with existing alert-service.ts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
339 lines
14 KiB
Markdown
339 lines
14 KiB
Markdown
# Monitoring & Alerting Dashboard - Design
|
|
|
|
**Datum:** 14.02.2026
|
|
**Status:** Genehmigt
|
|
|
|
## Ziel
|
|
|
|
Umfassendes Monitoring-Dashboard im Payload Admin-Panel mit:
|
|
- System-Gesundheitsüberwachung (CPU, RAM, Disk)
|
|
- Service-Status (DB, Redis, Queue, SMTP, OAuth)
|
|
- Performance-Tracking (Response-Zeiten, Error-Rates)
|
|
- Konfigurierbares Alerting (Multi-Channel: Email, Slack, Discord)
|
|
- Structured Log-Viewer
|
|
- Echtzeit-Updates via SSE
|
|
|
|
## Architektur: Event-Driven mit SSE + REST
|
|
|
|
REST-Endpoints für initiale/historische Daten, SSE-Stream für Live-Updates.
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ Admin UI: /admin/monitoring │
|
|
│ ┌──────────┬──────────┬───────────┬────────┬────────┐ │
|
|
│ │ System │ Services │Performance│ Alerts │ Logs │ │
|
|
│ │ Health │ │ │ │ │ │
|
|
│ └────┬─────┴────┬─────┴─────┬─────┴───┬────┴───┬────┘ │
|
|
│ │ │ │ │ │ │
|
|
│ ▼ ▼ ▼ ▼ ▼ │
|
|
│ SSE Stream (/api/monitoring/stream) │
|
|
│ + REST Endpoints (/api/monitoring/*) │
|
|
└───────────────────────┬─────────────────────────────────┘
|
|
│
|
|
┌───────────────────────▼─────────────────────────────────┐
|
|
│ Backend Services │
|
|
│ ┌────────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │MonitoringService│ │ Performance │ │ Alert │ │
|
|
│ │(Health, Services│ │ Tracker │ │ Evaluator │ │
|
|
│ │ OAuth, SMTP) │ │ (Ring-Buffer)│ │ (Rules DB) │ │
|
|
│ └───────┬────────┘ └──────┬───────┘ └──────┬───────┘ │
|
|
│ │ │ │ │
|
|
│ ┌───────▼──────────────────▼──────────────────▼───────┐ │
|
|
│ │ SnapshotCollector (60s Intervall im Queue-Worker) │ │
|
|
│ └─────────────────────────┬───────────────────────────┘ │
|
|
│ │ │
|
|
│ ┌─────────────────────────▼───────────────────────────┐ │
|
|
│ │ MonitoringLogger (Structured Logs → Collection) │ │
|
|
│ └─────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────┘
|
|
│
|
|
┌───────────────────────▼─────────────────────────────────┐
|
|
│ Collections │
|
|
│ MonitoringSnapshots │ MonitoringAlertRules │
|
|
│ MonitoringAlertHistory │ MonitoringLogs │
|
|
└─────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Scope: Payload-Stack + Externe Services
|
|
|
|
**Überwacht:**
|
|
- Payload CMS Prozess (PM2)
|
|
- Queue Worker Prozess (PM2)
|
|
- PostgreSQL + PgBouncer
|
|
- Redis
|
|
- SMTP-Verbindungen
|
|
- OAuth-Token-Status (Meta, YouTube)
|
|
- Cron-Job-Health
|
|
- BullMQ Queues
|
|
|
|
## Collections (4)
|
|
|
|
### MonitoringAlertRules
|
|
|
|
Konfigurierbare Alert-Regeln im Admin-Panel.
|
|
|
|
| Feld | Typ | Beschreibung |
|
|
|------|-----|--------------|
|
|
| name | text | Regelname |
|
|
| metric | text | Metrik-Pfad (z.B. `system.memory.usagePercent`) |
|
|
| condition | select | gt, lt, eq, gte, lte |
|
|
| threshold | number | Schwellenwert |
|
|
| severity | select | warning, error, critical |
|
|
| channels | select (hasMany) | email, slack, discord |
|
|
| recipients.email | array | E-Mail-Empfänger |
|
|
| recipients.slackWebhook | text | Slack Webhook URL |
|
|
| recipients.discordWebhook | text | Discord Webhook URL |
|
|
| cooldownMinutes | number | Min. Abstand (default: 15) |
|
|
| enabled | checkbox | Aktiv/Inaktiv |
|
|
| tenant | relationship | Optional: tenant-spezifisch |
|
|
|
|
### MonitoringAlertHistory
|
|
|
|
Alert-Log (WORM - Write Once).
|
|
|
|
| Feld | Typ | Beschreibung |
|
|
|------|-----|--------------|
|
|
| rule | relationship | → MonitoringAlertRules |
|
|
| metric | text | Metrik-Pfad |
|
|
| value | number | Aktueller Wert |
|
|
| threshold | number | Schwellenwert |
|
|
| severity | select | warning, error, critical |
|
|
| message | text | Alert-Nachricht |
|
|
| channelsSent | select (hasMany) | Versandte Kanäle |
|
|
| resolvedAt | date | Zeitpunkt der Auflösung |
|
|
| acknowledgedBy | relationship | → Users |
|
|
|
|
### MonitoringLogs
|
|
|
|
Structured Logs für Business-Events.
|
|
|
|
| Feld | Typ | Beschreibung |
|
|
|------|-----|--------------|
|
|
| level | select | debug, info, warn, error, fatal |
|
|
| source | select | payload, queue-worker, cron, email, oauth, sync |
|
|
| message | text | Log-Nachricht |
|
|
| context | json | Strukturierte Metadaten |
|
|
| requestId | text | Korrelations-ID |
|
|
| userId | relationship | → Users |
|
|
| tenant | relationship | → Tenants |
|
|
| duration | number | Dauer in ms |
|
|
|
|
### MonitoringSnapshots
|
|
|
|
Historische System-Metriken für Trend-Charts.
|
|
|
|
| Feld | Typ | Beschreibung |
|
|
|------|-----|--------------|
|
|
| timestamp | date | Zeitstempel |
|
|
| system.cpuUsagePercent | number | CPU-Auslastung |
|
|
| system.memoryUsedMB | number | RAM belegt |
|
|
| system.memoryTotalMB | number | RAM gesamt |
|
|
| system.memoryUsagePercent | number | RAM-Auslastung % |
|
|
| system.diskUsedGB | number | Disk belegt |
|
|
| system.diskTotalGB | number | Disk gesamt |
|
|
| system.diskUsagePercent | number | Disk-Auslastung % |
|
|
| system.loadAvg1 | number | Load Average 1 Min |
|
|
| system.loadAvg5 | number | Load Average 5 Min |
|
|
| system.uptime | number | Uptime in Sekunden |
|
|
| services.payload | json | { status, pid, memory, uptime, restarts } |
|
|
| services.queueWorker | json | { status, pid, memory, uptime, restarts } |
|
|
| services.postgresql | json | { status, connections, poolSize, latency } |
|
|
| services.pgbouncer | json | { status, activeConns, waitingClients } |
|
|
| services.redis | json | { status, memoryUsed, clients, opsPerSec } |
|
|
| external.smtp | json | { status, lastCheck, responseTime } |
|
|
| external.metaOAuth | json | { status, tokensExpiring, tokensExpired } |
|
|
| external.youtubeOAuth | json | { status, tokensExpiring, tokensExpired } |
|
|
| external.cronJobs | json | { lastRuns: { ... } } |
|
|
| performance.avgResponseTime | number | Durchschn. Response-Zeit |
|
|
| performance.errorRate | number | Error-Rate % |
|
|
| performance.requestsPerMinute | number | Requests/Minute |
|
|
|
|
## API-Endpoints
|
|
|
|
| Methode | Endpoint | Beschreibung | Auth |
|
|
|---------|----------|--------------|------|
|
|
| GET | `/api/monitoring/health` | System-Status (Live) | Super-Admin / monitoring |
|
|
| GET | `/api/monitoring/services` | Service-Status | Super-Admin / monitoring |
|
|
| GET | `/api/monitoring/performance` | Performance-Metriken | Super-Admin / monitoring |
|
|
| GET | `/api/monitoring/alerts` | Alert-History (paginiert) | Super-Admin / monitoring |
|
|
| POST | `/api/monitoring/alerts/acknowledge` | Alert bestätigen | Super-Admin |
|
|
| GET | `/api/monitoring/logs` | Logs (paginiert, filterbar) | Super-Admin / monitoring |
|
|
| GET | `/api/monitoring/snapshots` | Historische Metriken | Super-Admin / monitoring |
|
|
| GET | `/api/monitoring/stream` | SSE Echtzeit-Stream | Super-Admin / monitoring |
|
|
|
|
### SSE-Stream Events
|
|
|
|
| Event | Intervall | Daten |
|
|
|-------|-----------|-------|
|
|
| `health` | 10s | System-Metriken (CPU, RAM, Disk) |
|
|
| `service` | Bei Änderung | Service-Status-Updates |
|
|
| `alert` | Sofort | Neue Alerts |
|
|
| `log` | Sofort (warn+) | Neue Log-Einträge (Level >= warn) |
|
|
| `performance` | 30s | Performance-Metriken |
|
|
|
|
## Backend-Services
|
|
|
|
### MonitoringService (`src/lib/monitoring/monitoring-service.ts`)
|
|
|
|
Zentraler Service für Metrik-Sammlung.
|
|
|
|
```typescript
|
|
collectMetrics(): Promise<SystemMetrics>
|
|
checkSystemHealth(): SystemHealth // CPU, RAM, Disk, Uptime via os module
|
|
checkPostgresql(): ServiceStatus // pg_stat_activity, Latenz-Test
|
|
checkPgBouncer(): ServiceStatus // SHOW POOLS via PgBouncer Admin
|
|
checkRedis(): ServiceStatus // INFO command
|
|
checkSmtp(): ServiceStatus // SMTP EHLO check
|
|
checkOAuthTokens(): OAuthStatus // SocialAccounts Token-Ablauf prüfen
|
|
checkCronJobs(): CronStatus // Letzte Ausführungszeiten
|
|
checkQueues(): QueueStatus // BullMQ getJobCounts()
|
|
```
|
|
|
|
### PerformanceTracker (`src/lib/monitoring/performance-tracker.ts`)
|
|
|
|
Lightweight Request-Metriken in In-Memory Ring-Buffer.
|
|
|
|
```typescript
|
|
trackRequest(method, path, statusCode, duration): void
|
|
getMetrics(period: '1h' | '6h' | '24h' | '7d'): PerformanceMetrics
|
|
```
|
|
|
|
Integration: Payload `beforeOperation` / `afterOperation` Hooks.
|
|
|
|
### SnapshotCollector (`src/lib/monitoring/snapshot-collector.ts`)
|
|
|
|
Periodische Metrik-Erfassung, läuft im Queue-Worker PM2-Prozess.
|
|
|
|
```typescript
|
|
startCollector(): void // setInterval(60_000)
|
|
stopCollector(): void // clearInterval, SIGTERM handler
|
|
saveSnapshot(metrics): void // → MonitoringSnapshots Collection
|
|
```
|
|
|
|
### AlertEvaluator (`src/lib/monitoring/alert-evaluator.ts`)
|
|
|
|
Prüft Metriken gegen MonitoringAlertRules.
|
|
|
|
```typescript
|
|
evaluateRules(metrics): Promise<Alert[]>
|
|
shouldFireAlert(rule, value): boolean // Cooldown + Deduplizierung
|
|
dispatchAlert(alert): Promise<void> // → vorhandener alert-service.ts
|
|
```
|
|
|
|
### MonitoringLogger (`src/lib/monitoring/monitoring-logger.ts`)
|
|
|
|
Structured Logger der in MonitoringLogs Collection schreibt.
|
|
|
|
```typescript
|
|
const logger = createMonitoringLogger('source')
|
|
logger.info('message', { context })
|
|
logger.warn('message', { context })
|
|
logger.error('message', { context, requestId, userId, tenant })
|
|
```
|
|
|
|
## Dashboard UI
|
|
|
|
Admin View: `/admin/monitoring` mit 5 Tabs.
|
|
|
|
### Tab 1: System Health
|
|
- Gauge-Widgets: CPU, RAM, Disk, Uptime (Farbkodierung grün/gelb/rot)
|
|
- Trend-Charts: CPU 24h, Memory 24h, Load Average (aus MonitoringSnapshots)
|
|
- Live-Update via SSE `health` Events
|
|
|
|
### Tab 2: Services
|
|
- Service-Karten mit Status-Badge (Online/Warning/Offline)
|
|
- Details: PID, Memory, Uptime, Restarts (PM2-Prozesse)
|
|
- PostgreSQL: Connections, Pool, Latenz
|
|
- Redis: Memory, Clients, Ops/s
|
|
- OAuth: Token-Status mit Ablauf-Warnung
|
|
- SMTP: Letzter Check, Response-Zeit
|
|
- Cron: Letzte Ausführungszeiten
|
|
- Live-Update via SSE `service` Events
|
|
|
|
### Tab 3: Performance
|
|
- Response-Time Chart (Avg, P95, P99)
|
|
- Error-Rate Chart
|
|
- Requests/Minute Chart
|
|
- Zeitraum-Filter (1h, 6h, 24h, 7d)
|
|
|
|
### Tab 4: Alerts
|
|
- Aktive/Unbestätigte Alerts (oben, hervorgehoben)
|
|
- Alert-History Tabelle (filterbar nach Severity, Zeitraum)
|
|
- Acknowledge-Button pro Alert
|
|
- Alert-Regeln CRUD (MonitoringAlertRules)
|
|
- Neue Alerts via SSE `alert` Events
|
|
|
|
### Tab 5: Logs
|
|
- Log-Tabelle (Level, Source, Message, Timestamp)
|
|
- Filter: Level, Source, Zeitraum, Volltextsuche
|
|
- Expandierbarer JSON-Context pro Eintrag
|
|
- Auto-Scroll für neue warn+ Einträge (via SSE `log` Events)
|
|
|
|
## Zugriffskontrolle
|
|
|
|
| Aktion | Super-Admin | monitoring-Rolle |
|
|
|--------|-------------|------------------|
|
|
| Dashboard ansehen | Ja | Ja |
|
|
| Alert-Regeln bearbeiten | Ja | Nein |
|
|
| Alerts bestätigen | Ja | Nein |
|
|
| Logs ansehen | Ja | Ja |
|
|
|
|
## Data Retention
|
|
|
|
| Collection | Retention | Env-Variable |
|
|
|------------|-----------|-------------|
|
|
| monitoring-snapshots | 7 Tage | `RETENTION_MONITORING_SNAPSHOTS_DAYS` |
|
|
| monitoring-alert-history | 90 Tage | `RETENTION_MONITORING_ALERTS_DAYS` |
|
|
| monitoring-logs | 30 Tage | `RETENTION_MONITORING_LOGS_DAYS` |
|
|
|
|
## Dateistruktur
|
|
|
|
```
|
|
src/
|
|
├── collections/
|
|
│ ├── MonitoringAlertRules.ts
|
|
│ ├── MonitoringAlertHistory.ts
|
|
│ ├── MonitoringLogs.ts
|
|
│ └── MonitoringSnapshots.ts
|
|
├── lib/monitoring/
|
|
│ ├── monitoring-service.ts
|
|
│ ├── performance-tracker.ts
|
|
│ ├── snapshot-collector.ts
|
|
│ ├── alert-evaluator.ts
|
|
│ ├── monitoring-logger.ts
|
|
│ └── types.ts
|
|
├── app/(payload)/api/monitoring/
|
|
│ ├── health/route.ts
|
|
│ ├── services/route.ts
|
|
│ ├── performance/route.ts
|
|
│ ├── alerts/route.ts
|
|
│ ├── alerts/acknowledge/route.ts
|
|
│ ├── logs/route.ts
|
|
│ ├── snapshots/route.ts
|
|
│ └── stream/route.ts
|
|
├── components/admin/
|
|
│ ├── MonitoringDashboard.tsx
|
|
│ ├── MonitoringDashboard.scss
|
|
│ ├── MonitoringNavLinks.tsx
|
|
│ └── monitoring/
|
|
│ ├── SystemHealthTab.tsx
|
|
│ ├── ServicesTab.tsx
|
|
│ ├── PerformanceTab.tsx
|
|
│ ├── AlertsTab.tsx
|
|
│ ├── LogsTab.tsx
|
|
│ ├── GaugeWidget.tsx
|
|
│ ├── TrendChart.tsx
|
|
│ ├── StatusBadge.tsx
|
|
│ └── LogTable.tsx
|
|
```
|
|
|
|
## Abhängigkeiten
|
|
|
|
**Neue npm-Pakete:** Keine - nutzt nur Node.js `os` Module und vorhandene DB-Verbindungen.
|
|
|
|
**Bestehende Infrastruktur die genutzt wird:**
|
|
- `src/lib/alerting/alert-service.ts` - Multi-Channel-Alerting
|
|
- `src/lib/queue/` - BullMQ-Integration
|
|
- `src/lib/redis.ts` - Redis-Client
|
|
- `/api/community/stream` Pattern - SSE-Implementation
|
|
- Data Retention System - Automatische Bereinigung
|