mirror of
https://github.com/complexcaresolutions/cms.c2sgmbh.git
synced 2026-03-17 16:14:12 +00:00
docs: add monitoring & alerting dashboard design
Event-driven architecture with SSE real-time updates, 5-tab dashboard (System Health, Services, Performance, Alerts, Logs), 4 new collections, and integration with existing alert-service.ts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
a0ef957a7f
commit
15bdd66eb6
1 changed files with 339 additions and 0 deletions
339
docs/plans/2026-02-14-monitoring-dashboard-design.md
Normal file
339
docs/plans/2026-02-14-monitoring-dashboard-design.md
Normal file
|
|
@ -0,0 +1,339 @@
|
|||
# Monitoring & Alerting Dashboard - Design
|
||||
|
||||
**Datum:** 14.02.2026
|
||||
**Status:** Genehmigt
|
||||
|
||||
## Ziel
|
||||
|
||||
Umfassendes Monitoring-Dashboard im Payload Admin-Panel mit:
|
||||
- System-Gesundheitsüberwachung (CPU, RAM, Disk)
|
||||
- Service-Status (DB, Redis, Queue, SMTP, OAuth)
|
||||
- Performance-Tracking (Response-Zeiten, Error-Rates)
|
||||
- Konfigurierbares Alerting (Multi-Channel: Email, Slack, Discord)
|
||||
- Structured Log-Viewer
|
||||
- Echtzeit-Updates via SSE
|
||||
|
||||
## Architektur: Event-Driven mit SSE + REST
|
||||
|
||||
REST-Endpoints für initiale/historische Daten, SSE-Stream für Live-Updates.
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Admin UI: /admin/monitoring │
|
||||
│ ┌──────────┬──────────┬───────────┬────────┬────────┐ │
|
||||
│ │ System │ Services │Performance│ Alerts │ Logs │ │
|
||||
│ │ Health │ │ │ │ │ │
|
||||
│ └────┬─────┴────┬─────┴─────┬─────┴───┬────┴───┬────┘ │
|
||||
│ │ │ │ │ │ │
|
||||
│ ▼ ▼ ▼ ▼ ▼ │
|
||||
│ SSE Stream (/api/monitoring/stream) │
|
||||
│ + REST Endpoints (/api/monitoring/*) │
|
||||
└───────────────────────┬─────────────────────────────────┘
|
||||
│
|
||||
┌───────────────────────▼─────────────────────────────────┐
|
||||
│ Backend Services │
|
||||
│ ┌────────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │MonitoringService│ │ Performance │ │ Alert │ │
|
||||
│ │(Health, Services│ │ Tracker │ │ Evaluator │ │
|
||||
│ │ OAuth, SMTP) │ │ (Ring-Buffer)│ │ (Rules DB) │ │
|
||||
│ └───────┬────────┘ └──────┬───────┘ └──────┬───────┘ │
|
||||
│ │ │ │ │
|
||||
│ ┌───────▼──────────────────▼──────────────────▼───────┐ │
|
||||
│ │ SnapshotCollector (60s Intervall im Queue-Worker) │ │
|
||||
│ └─────────────────────────┬───────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌─────────────────────────▼───────────────────────────┐ │
|
||||
│ │ MonitoringLogger (Structured Logs → Collection) │ │
|
||||
│ └─────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
│
|
||||
┌───────────────────────▼─────────────────────────────────┐
|
||||
│ Collections │
|
||||
│ MonitoringSnapshots │ MonitoringAlertRules │
|
||||
│ MonitoringAlertHistory │ MonitoringLogs │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Scope: Payload-Stack + Externe Services
|
||||
|
||||
**Überwacht:**
|
||||
- Payload CMS Prozess (PM2)
|
||||
- Queue Worker Prozess (PM2)
|
||||
- PostgreSQL + PgBouncer
|
||||
- Redis
|
||||
- SMTP-Verbindungen
|
||||
- OAuth-Token-Status (Meta, YouTube)
|
||||
- Cron-Job-Health
|
||||
- BullMQ Queues
|
||||
|
||||
## Collections (4)
|
||||
|
||||
### MonitoringAlertRules
|
||||
|
||||
Konfigurierbare Alert-Regeln im Admin-Panel.
|
||||
|
||||
| Feld | Typ | Beschreibung |
|
||||
|------|-----|--------------|
|
||||
| name | text | Regelname |
|
||||
| metric | text | Metrik-Pfad (z.B. `system.memory.usagePercent`) |
|
||||
| condition | select | gt, lt, eq, gte, lte |
|
||||
| threshold | number | Schwellenwert |
|
||||
| severity | select | warning, error, critical |
|
||||
| channels | select (hasMany) | email, slack, discord |
|
||||
| recipients.email | array | E-Mail-Empfänger |
|
||||
| recipients.slackWebhook | text | Slack Webhook URL |
|
||||
| recipients.discordWebhook | text | Discord Webhook URL |
|
||||
| cooldownMinutes | number | Min. Abstand (default: 15) |
|
||||
| enabled | checkbox | Aktiv/Inaktiv |
|
||||
| tenant | relationship | Optional: tenant-spezifisch |
|
||||
|
||||
### MonitoringAlertHistory
|
||||
|
||||
Alert-Log (WORM - Write Once).
|
||||
|
||||
| Feld | Typ | Beschreibung |
|
||||
|------|-----|--------------|
|
||||
| rule | relationship | → MonitoringAlertRules |
|
||||
| metric | text | Metrik-Pfad |
|
||||
| value | number | Aktueller Wert |
|
||||
| threshold | number | Schwellenwert |
|
||||
| severity | select | warning, error, critical |
|
||||
| message | text | Alert-Nachricht |
|
||||
| channelsSent | select (hasMany) | Versandte Kanäle |
|
||||
| resolvedAt | date | Zeitpunkt der Auflösung |
|
||||
| acknowledgedBy | relationship | → Users |
|
||||
|
||||
### MonitoringLogs
|
||||
|
||||
Structured Logs für Business-Events.
|
||||
|
||||
| Feld | Typ | Beschreibung |
|
||||
|------|-----|--------------|
|
||||
| level | select | debug, info, warn, error, fatal |
|
||||
| source | select | payload, queue-worker, cron, email, oauth, sync |
|
||||
| message | text | Log-Nachricht |
|
||||
| context | json | Strukturierte Metadaten |
|
||||
| requestId | text | Korrelations-ID |
|
||||
| userId | relationship | → Users |
|
||||
| tenant | relationship | → Tenants |
|
||||
| duration | number | Dauer in ms |
|
||||
|
||||
### MonitoringSnapshots
|
||||
|
||||
Historische System-Metriken für Trend-Charts.
|
||||
|
||||
| Feld | Typ | Beschreibung |
|
||||
|------|-----|--------------|
|
||||
| timestamp | date | Zeitstempel |
|
||||
| system.cpuUsagePercent | number | CPU-Auslastung |
|
||||
| system.memoryUsedMB | number | RAM belegt |
|
||||
| system.memoryTotalMB | number | RAM gesamt |
|
||||
| system.memoryUsagePercent | number | RAM-Auslastung % |
|
||||
| system.diskUsedGB | number | Disk belegt |
|
||||
| system.diskTotalGB | number | Disk gesamt |
|
||||
| system.diskUsagePercent | number | Disk-Auslastung % |
|
||||
| system.loadAvg1 | number | Load Average 1 Min |
|
||||
| system.loadAvg5 | number | Load Average 5 Min |
|
||||
| system.uptime | number | Uptime in Sekunden |
|
||||
| services.payload | json | { status, pid, memory, uptime, restarts } |
|
||||
| services.queueWorker | json | { status, pid, memory, uptime, restarts } |
|
||||
| services.postgresql | json | { status, connections, poolSize, latency } |
|
||||
| services.pgbouncer | json | { status, activeConns, waitingClients } |
|
||||
| services.redis | json | { status, memoryUsed, clients, opsPerSec } |
|
||||
| external.smtp | json | { status, lastCheck, responseTime } |
|
||||
| external.metaOAuth | json | { status, tokensExpiring, tokensExpired } |
|
||||
| external.youtubeOAuth | json | { status, tokensExpiring, tokensExpired } |
|
||||
| external.cronJobs | json | { lastRuns: { ... } } |
|
||||
| performance.avgResponseTime | number | Durchschn. Response-Zeit |
|
||||
| performance.errorRate | number | Error-Rate % |
|
||||
| performance.requestsPerMinute | number | Requests/Minute |
|
||||
|
||||
## API-Endpoints
|
||||
|
||||
| Methode | Endpoint | Beschreibung | Auth |
|
||||
|---------|----------|--------------|------|
|
||||
| GET | `/api/monitoring/health` | System-Status (Live) | Super-Admin / monitoring |
|
||||
| GET | `/api/monitoring/services` | Service-Status | Super-Admin / monitoring |
|
||||
| GET | `/api/monitoring/performance` | Performance-Metriken | Super-Admin / monitoring |
|
||||
| GET | `/api/monitoring/alerts` | Alert-History (paginiert) | Super-Admin / monitoring |
|
||||
| POST | `/api/monitoring/alerts/acknowledge` | Alert bestätigen | Super-Admin |
|
||||
| GET | `/api/monitoring/logs` | Logs (paginiert, filterbar) | Super-Admin / monitoring |
|
||||
| GET | `/api/monitoring/snapshots` | Historische Metriken | Super-Admin / monitoring |
|
||||
| GET | `/api/monitoring/stream` | SSE Echtzeit-Stream | Super-Admin / monitoring |
|
||||
|
||||
### SSE-Stream Events
|
||||
|
||||
| Event | Intervall | Daten |
|
||||
|-------|-----------|-------|
|
||||
| `health` | 10s | System-Metriken (CPU, RAM, Disk) |
|
||||
| `service` | Bei Änderung | Service-Status-Updates |
|
||||
| `alert` | Sofort | Neue Alerts |
|
||||
| `log` | Sofort (warn+) | Neue Log-Einträge (Level >= warn) |
|
||||
| `performance` | 30s | Performance-Metriken |
|
||||
|
||||
## Backend-Services
|
||||
|
||||
### MonitoringService (`src/lib/monitoring/monitoring-service.ts`)
|
||||
|
||||
Zentraler Service für Metrik-Sammlung.
|
||||
|
||||
```typescript
|
||||
collectMetrics(): Promise<SystemMetrics>
|
||||
checkSystemHealth(): SystemHealth // CPU, RAM, Disk, Uptime via os module
|
||||
checkPostgresql(): ServiceStatus // pg_stat_activity, Latenz-Test
|
||||
checkPgBouncer(): ServiceStatus // SHOW POOLS via PgBouncer Admin
|
||||
checkRedis(): ServiceStatus // INFO command
|
||||
checkSmtp(): ServiceStatus // SMTP EHLO check
|
||||
checkOAuthTokens(): OAuthStatus // SocialAccounts Token-Ablauf prüfen
|
||||
checkCronJobs(): CronStatus // Letzte Ausführungszeiten
|
||||
checkQueues(): QueueStatus // BullMQ getJobCounts()
|
||||
```
|
||||
|
||||
### PerformanceTracker (`src/lib/monitoring/performance-tracker.ts`)
|
||||
|
||||
Lightweight Request-Metriken in In-Memory Ring-Buffer.
|
||||
|
||||
```typescript
|
||||
trackRequest(method, path, statusCode, duration): void
|
||||
getMetrics(period: '1h' | '6h' | '24h' | '7d'): PerformanceMetrics
|
||||
```
|
||||
|
||||
Integration: Payload `beforeOperation` / `afterOperation` Hooks.
|
||||
|
||||
### SnapshotCollector (`src/lib/monitoring/snapshot-collector.ts`)
|
||||
|
||||
Periodische Metrik-Erfassung, läuft im Queue-Worker PM2-Prozess.
|
||||
|
||||
```typescript
|
||||
startCollector(): void // setInterval(60_000)
|
||||
stopCollector(): void // clearInterval, SIGTERM handler
|
||||
saveSnapshot(metrics): void // → MonitoringSnapshots Collection
|
||||
```
|
||||
|
||||
### AlertEvaluator (`src/lib/monitoring/alert-evaluator.ts`)
|
||||
|
||||
Prüft Metriken gegen MonitoringAlertRules.
|
||||
|
||||
```typescript
|
||||
evaluateRules(metrics): Promise<Alert[]>
|
||||
shouldFireAlert(rule, value): boolean // Cooldown + Deduplizierung
|
||||
dispatchAlert(alert): Promise<void> // → vorhandener alert-service.ts
|
||||
```
|
||||
|
||||
### MonitoringLogger (`src/lib/monitoring/monitoring-logger.ts`)
|
||||
|
||||
Structured Logger der in MonitoringLogs Collection schreibt.
|
||||
|
||||
```typescript
|
||||
const logger = createMonitoringLogger('source')
|
||||
logger.info('message', { context })
|
||||
logger.warn('message', { context })
|
||||
logger.error('message', { context, requestId, userId, tenant })
|
||||
```
|
||||
|
||||
## Dashboard UI
|
||||
|
||||
Admin View: `/admin/monitoring` mit 5 Tabs.
|
||||
|
||||
### Tab 1: System Health
|
||||
- Gauge-Widgets: CPU, RAM, Disk, Uptime (Farbkodierung grün/gelb/rot)
|
||||
- Trend-Charts: CPU 24h, Memory 24h, Load Average (aus MonitoringSnapshots)
|
||||
- Live-Update via SSE `health` Events
|
||||
|
||||
### Tab 2: Services
|
||||
- Service-Karten mit Status-Badge (Online/Warning/Offline)
|
||||
- Details: PID, Memory, Uptime, Restarts (PM2-Prozesse)
|
||||
- PostgreSQL: Connections, Pool, Latenz
|
||||
- Redis: Memory, Clients, Ops/s
|
||||
- OAuth: Token-Status mit Ablauf-Warnung
|
||||
- SMTP: Letzter Check, Response-Zeit
|
||||
- Cron: Letzte Ausführungszeiten
|
||||
- Live-Update via SSE `service` Events
|
||||
|
||||
### Tab 3: Performance
|
||||
- Response-Time Chart (Avg, P95, P99)
|
||||
- Error-Rate Chart
|
||||
- Requests/Minute Chart
|
||||
- Zeitraum-Filter (1h, 6h, 24h, 7d)
|
||||
|
||||
### Tab 4: Alerts
|
||||
- Aktive/Unbestätigte Alerts (oben, hervorgehoben)
|
||||
- Alert-History Tabelle (filterbar nach Severity, Zeitraum)
|
||||
- Acknowledge-Button pro Alert
|
||||
- Alert-Regeln CRUD (MonitoringAlertRules)
|
||||
- Neue Alerts via SSE `alert` Events
|
||||
|
||||
### Tab 5: Logs
|
||||
- Log-Tabelle (Level, Source, Message, Timestamp)
|
||||
- Filter: Level, Source, Zeitraum, Volltextsuche
|
||||
- Expandierbarer JSON-Context pro Eintrag
|
||||
- Auto-Scroll für neue warn+ Einträge (via SSE `log` Events)
|
||||
|
||||
## Zugriffskontrolle
|
||||
|
||||
| Aktion | Super-Admin | monitoring-Rolle |
|
||||
|--------|-------------|------------------|
|
||||
| Dashboard ansehen | Ja | Ja |
|
||||
| Alert-Regeln bearbeiten | Ja | Nein |
|
||||
| Alerts bestätigen | Ja | Nein |
|
||||
| Logs ansehen | Ja | Ja |
|
||||
|
||||
## Data Retention
|
||||
|
||||
| Collection | Retention | Env-Variable |
|
||||
|------------|-----------|-------------|
|
||||
| monitoring-snapshots | 7 Tage | `RETENTION_MONITORING_SNAPSHOTS_DAYS` |
|
||||
| monitoring-alert-history | 90 Tage | `RETENTION_MONITORING_ALERTS_DAYS` |
|
||||
| monitoring-logs | 30 Tage | `RETENTION_MONITORING_LOGS_DAYS` |
|
||||
|
||||
## Dateistruktur
|
||||
|
||||
```
|
||||
src/
|
||||
├── collections/
|
||||
│ ├── MonitoringAlertRules.ts
|
||||
│ ├── MonitoringAlertHistory.ts
|
||||
│ ├── MonitoringLogs.ts
|
||||
│ └── MonitoringSnapshots.ts
|
||||
├── lib/monitoring/
|
||||
│ ├── monitoring-service.ts
|
||||
│ ├── performance-tracker.ts
|
||||
│ ├── snapshot-collector.ts
|
||||
│ ├── alert-evaluator.ts
|
||||
│ ├── monitoring-logger.ts
|
||||
│ └── types.ts
|
||||
├── app/(payload)/api/monitoring/
|
||||
│ ├── health/route.ts
|
||||
│ ├── services/route.ts
|
||||
│ ├── performance/route.ts
|
||||
│ ├── alerts/route.ts
|
||||
│ ├── alerts/acknowledge/route.ts
|
||||
│ ├── logs/route.ts
|
||||
│ ├── snapshots/route.ts
|
||||
│ └── stream/route.ts
|
||||
├── components/admin/
|
||||
│ ├── MonitoringDashboard.tsx
|
||||
│ ├── MonitoringDashboard.scss
|
||||
│ ├── MonitoringNavLinks.tsx
|
||||
│ └── monitoring/
|
||||
│ ├── SystemHealthTab.tsx
|
||||
│ ├── ServicesTab.tsx
|
||||
│ ├── PerformanceTab.tsx
|
||||
│ ├── AlertsTab.tsx
|
||||
│ ├── LogsTab.tsx
|
||||
│ ├── GaugeWidget.tsx
|
||||
│ ├── TrendChart.tsx
|
||||
│ ├── StatusBadge.tsx
|
||||
│ └── LogTable.tsx
|
||||
```
|
||||
|
||||
## Abhängigkeiten
|
||||
|
||||
**Neue npm-Pakete:** Keine - nutzt nur Node.js `os` Module und vorhandene DB-Verbindungen.
|
||||
|
||||
**Bestehende Infrastruktur die genutzt wird:**
|
||||
- `src/lib/alerting/alert-service.ts` - Multi-Channel-Alerting
|
||||
- `src/lib/queue/` - BullMQ-Integration
|
||||
- `src/lib/redis.ts` - Redis-Client
|
||||
- `/api/community/stream` Pattern - SSE-Implementation
|
||||
- Data Retention System - Automatische Bereinigung
|
||||
Loading…
Reference in a new issue