mirror of
https://github.com/complexcaresolutions/cms.c2sgmbh.git
synced 2026-03-17 22:04:10 +00:00
1236 lines
39 KiB
Markdown
1236 lines
39 KiB
Markdown
# Monitoring & Alerting Dashboard - Implementation Plan
|
|
|
|
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
|
|
|
**Goal:** Build a real-time Monitoring & Alerting Dashboard in the Payload Admin Panel with system health checks, service monitoring, performance tracking, configurable alerts, and structured log viewing.
|
|
|
|
**Architecture:** Event-driven with SSE real-time stream + REST endpoints. A SnapshotCollector runs in the queue-worker PM2 process every 60s, collecting metrics from OS, PostgreSQL, PgBouncer, Redis, SMTP, OAuth, and BullMQ. An AlertEvaluator checks metrics against configurable rules stored in a Payload Collection. The dashboard is a 5-tab Custom Admin View using the same patterns as YouTubeAnalyticsDashboard.
|
|
|
|
**Tech Stack:** Payload CMS 3.76.1, Next.js 16 (App Router), React 19, TypeScript, Node.js `os` module, SSE via ReadableStream, BullMQ, PostgreSQL, Redis.
|
|
|
|
**Design Doc:** `docs/plans/2026-02-14-monitoring-dashboard-design.md`
|
|
|
|
---
|
|
|
|
## Phase 1: Types & Collections (Foundation)
|
|
|
|
### Task 1: Shared Types
|
|
|
|
**Files:**
|
|
- Create: `src/lib/monitoring/types.ts`
|
|
- Test: `tests/unit/monitoring/types.test.ts`
|
|
|
|
**Step 1: Write types test**
|
|
|
|
```typescript
|
|
// tests/unit/monitoring/types.test.ts
|
|
import { describe, it, expect } from 'vitest'
|
|
import type {
|
|
SystemHealth,
|
|
ServiceStatus,
|
|
OAuthStatus,
|
|
CronStatus,
|
|
QueueStatus,
|
|
PerformanceMetrics,
|
|
SystemMetrics,
|
|
MonitoringEvent,
|
|
} from '@/lib/monitoring/types'
|
|
|
|
describe('Monitoring Types', () => {
|
|
it('SystemMetrics has all required sections', () => {
|
|
const metrics: SystemMetrics = {
|
|
timestamp: new Date().toISOString(),
|
|
system: {
|
|
cpuUsagePercent: 23,
|
|
memoryUsedMB: 4200,
|
|
memoryTotalMB: 8192,
|
|
memoryUsagePercent: 51.3,
|
|
diskUsedGB: 30,
|
|
diskTotalGB: 50,
|
|
diskUsagePercent: 60,
|
|
loadAvg1: 0.5,
|
|
loadAvg5: 0.8,
|
|
uptime: 1209600,
|
|
},
|
|
services: {
|
|
payload: { status: 'online', pid: 1234, memoryMB: 512, uptimeSeconds: 86400, restarts: 0 },
|
|
queueWorker: { status: 'online', pid: 5678, memoryMB: 256, uptimeSeconds: 86400, restarts: 0 },
|
|
postgresql: { status: 'online', connections: 12, maxConnections: 50, latencyMs: 2 },
|
|
pgbouncer: { status: 'online', activeConnections: 8, waitingClients: 0, poolSize: 20 },
|
|
redis: { status: 'online', memoryUsedMB: 48, connectedClients: 5, opsPerSec: 120 },
|
|
},
|
|
external: {
|
|
smtp: { status: 'online', lastCheck: new Date().toISOString(), responseTimeMs: 180 },
|
|
metaOAuth: { status: 'ok', tokensTotal: 2, tokensExpiringSoon: 1, tokensExpired: 0 },
|
|
youtubeOAuth: { status: 'ok', tokensTotal: 3, tokensExpiringSoon: 0, tokensExpired: 0 },
|
|
cronJobs: {
|
|
communitySync: { lastRun: new Date().toISOString(), status: 'ok' },
|
|
tokenRefresh: { lastRun: new Date().toISOString(), status: 'ok' },
|
|
youtubeSync: { lastRun: new Date().toISOString(), status: 'ok' },
|
|
},
|
|
},
|
|
performance: { avgResponseTimeMs: 120, p95ResponseTimeMs: 350, p99ResponseTimeMs: 800, errorRate: 0.02, requestsPerMinute: 45 },
|
|
}
|
|
expect(metrics.system.cpuUsagePercent).toBe(23)
|
|
expect(metrics.services.payload.status).toBe('online')
|
|
expect(metrics.external.smtp.status).toBe('online')
|
|
expect(metrics.performance.avgResponseTimeMs).toBe(120)
|
|
})
|
|
|
|
it('MonitoringEvent types are exhaustive', () => {
|
|
const events: MonitoringEvent['type'][] = ['health', 'service', 'alert', 'log', 'performance']
|
|
expect(events).toHaveLength(5)
|
|
})
|
|
})
|
|
```
|
|
|
|
**Step 2: Run test — expect FAIL** (types don't exist yet)
|
|
|
|
```bash
|
|
pnpm test tests/unit/monitoring/types.test.ts
|
|
```
|
|
|
|
**Step 3: Implement types**
|
|
|
|
Create `src/lib/monitoring/types.ts` with all interfaces:
|
|
- `SystemHealth` (CPU, RAM, Disk, Load, Uptime)
|
|
- `ProcessStatus` (status, pid, memoryMB, uptimeSeconds, restarts)
|
|
- `PostgresqlStatus`, `PgBouncerStatus`, `RedisStatus`
|
|
- `SmtpStatus`, `OAuthTokenStatus`, `CronJobStatus`
|
|
- `ServiceStatuses` (all services combined)
|
|
- `ExternalStatuses` (SMTP, OAuth, Cron)
|
|
- `PerformanceMetrics` (avg, p95, p99, errorRate, rpm)
|
|
- `SystemMetrics` (the full snapshot object)
|
|
- `MonitoringEvent` (discriminated union for SSE events: health | service | alert | log | performance)
|
|
- `AlertCondition` = 'gt' | 'lt' | 'eq' | 'gte' | 'lte'
|
|
- `AlertSeverity` = 'warning' | 'error' | 'critical'
|
|
- `LogLevel` = 'debug' | 'info' | 'warn' | 'error' | 'fatal'
|
|
- `LogSource` = 'payload' | 'queue-worker' | 'cron' | 'email' | 'oauth' | 'sync'
|
|
|
|
**Step 4: Run test — expect PASS**
|
|
|
|
```bash
|
|
pnpm test tests/unit/monitoring/types.test.ts
|
|
```
|
|
|
|
**Step 5: Commit**
|
|
|
|
```bash
|
|
git add src/lib/monitoring/types.ts tests/unit/monitoring/types.test.ts
|
|
git commit -m "feat(monitoring): add shared types for monitoring system"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 2: MonitoringSnapshots Collection
|
|
|
|
**Files:**
|
|
- Create: `src/collections/MonitoringSnapshots.ts`
|
|
- Modify: `src/payload.config.ts` (add to collections array)
|
|
- Modify: `src/lib/access/index.ts` (add monitoring access)
|
|
|
|
**Step 1: Add monitoring access control**
|
|
|
|
In `src/lib/access/index.ts`, add:
|
|
```typescript
|
|
export const monitoringAccess = {
|
|
read: superAdminOnly,
|
|
create: superAdminOnly, // Only system can create
|
|
update: denyAll, // Immutable snapshots
|
|
delete: superAdminOnly, // Retention cleanup only
|
|
}
|
|
```
|
|
|
|
**Step 2: Create MonitoringSnapshots collection**
|
|
|
|
Pattern: Follow `AuditLogs.ts` structure. Use `admin.group: 'Monitoring'`. Fields use `type: 'group'` for nested objects and `type: 'json'` for service/external status objects.
|
|
|
|
Key fields:
|
|
- `timestamp` (date, required, indexed)
|
|
- `system` group: cpuUsagePercent, memoryUsedMB, memoryTotalMB, memoryUsagePercent, diskUsedGB, diskTotalGB, diskUsagePercent, loadAvg1, loadAvg5, uptime (all `type: 'number'`)
|
|
- `services` group: payload, queueWorker, postgresql, pgbouncer, redis (all `type: 'json'`)
|
|
- `external` group: smtp, metaOAuth, youtubeOAuth, cronJobs (all `type: 'json'`)
|
|
- `performance` group: avgResponseTimeMs, p95ResponseTimeMs, p99ResponseTimeMs, errorRate, requestsPerMinute (all `type: 'number'`)
|
|
|
|
**Step 3: Register in payload.config.ts**
|
|
|
|
Add `MonitoringSnapshots` to the `collections` array (import + add).
|
|
|
|
**Step 4: Commit**
|
|
|
|
```bash
|
|
git add src/collections/MonitoringSnapshots.ts src/payload.config.ts src/lib/access/index.ts
|
|
git commit -m "feat(monitoring): add MonitoringSnapshots collection"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 3: MonitoringLogs Collection
|
|
|
|
**Files:**
|
|
- Create: `src/collections/MonitoringLogs.ts`
|
|
- Modify: `src/payload.config.ts`
|
|
|
|
**Step 1: Create collection**
|
|
|
|
Pattern: Like `AuditLogs.ts` — WORM (read + create only, no update/delete via UI).
|
|
|
|
Key fields:
|
|
- `level` (select: debug, info, warn, error, fatal — required)
|
|
- `source` (select: payload, queue-worker, cron, email, oauth, sync — required)
|
|
- `message` (text, required)
|
|
- `context` (json)
|
|
- `requestId` (text)
|
|
- `userId` (relationship → users)
|
|
- `tenant` (relationship → tenants)
|
|
- `duration` (number, min: 0)
|
|
|
|
Admin config: `group: 'Monitoring'`, `defaultColumns: ['level', 'source', 'message', 'createdAt']`, `useAsTitle: 'message'`.
|
|
|
|
**Step 2: Register in payload.config.ts**
|
|
|
|
**Step 3: Commit**
|
|
|
|
```bash
|
|
git add src/collections/MonitoringLogs.ts src/payload.config.ts
|
|
git commit -m "feat(monitoring): add MonitoringLogs collection"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 4: MonitoringAlertRules Collection
|
|
|
|
**Files:**
|
|
- Create: `src/collections/MonitoringAlertRules.ts`
|
|
- Modify: `src/payload.config.ts`
|
|
|
|
**Step 1: Create collection**
|
|
|
|
Access: Super-admin full CRUD.
|
|
|
|
Key fields:
|
|
- `name` (text, required)
|
|
- `metric` (text, required — e.g. `system.cpuUsagePercent`)
|
|
- `condition` (select: gt, lt, eq, gte, lte — required)
|
|
- `threshold` (number, required)
|
|
- `severity` (select: warning, error, critical — required)
|
|
- `channels` (select, hasMany: true — email, slack, discord — required)
|
|
- `recipients` group:
|
|
- `emails` (array of text fields)
|
|
- `slackWebhook` (text)
|
|
- `discordWebhook` (text)
|
|
- `cooldownMinutes` (number, defaultValue: 15, min: 1)
|
|
- `enabled` (checkbox, defaultValue: true)
|
|
- `tenant` (relationship → tenants, optional)
|
|
|
|
Admin: `group: 'Monitoring'`, `useAsTitle: 'name'`.
|
|
|
|
**Step 2: Register in payload.config.ts**
|
|
|
|
**Step 3: Commit**
|
|
|
|
```bash
|
|
git add src/collections/MonitoringAlertRules.ts src/payload.config.ts
|
|
git commit -m "feat(monitoring): add MonitoringAlertRules collection"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 5: MonitoringAlertHistory Collection
|
|
|
|
**Files:**
|
|
- Create: `src/collections/MonitoringAlertHistory.ts`
|
|
- Modify: `src/payload.config.ts`
|
|
|
|
**Step 1: Create collection**
|
|
|
|
Access: Read for super-admin, create for system, update only `resolvedAt` and `acknowledgedBy`.
|
|
|
|
Key fields:
|
|
- `rule` (relationship → monitoring-alert-rules)
|
|
- `metric` (text, required)
|
|
- `value` (number, required)
|
|
- `threshold` (number, required)
|
|
- `severity` (select: warning, error, critical — required)
|
|
- `message` (text, required)
|
|
- `channelsSent` (select, hasMany: email, slack, discord)
|
|
- `resolvedAt` (date, optional)
|
|
- `acknowledgedBy` (relationship → users, optional)
|
|
|
|
Admin: `group: 'Monitoring'`, `useAsTitle: 'message'`, `defaultColumns: ['severity', 'metric', 'message', 'createdAt', 'acknowledgedBy']`.
|
|
|
|
**Step 2: Register in payload.config.ts**
|
|
|
|
**Step 3: Commit**
|
|
|
|
```bash
|
|
git add src/collections/MonitoringAlertHistory.ts src/payload.config.ts
|
|
git commit -m "feat(monitoring): add MonitoringAlertHistory collection"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 6: Database Migration
|
|
|
|
**Step 1: Create migration**
|
|
|
|
```bash
|
|
pnpm payload migrate:create
|
|
```
|
|
|
|
**CRITICAL:** The migration MUST include `payload_locked_documents_rels` columns for ALL 4 new collections:
|
|
|
|
```sql
|
|
ALTER TABLE "payload_locked_documents_rels"
|
|
ADD COLUMN IF NOT EXISTS "monitoring_snapshots_id" integer REFERENCES monitoring_snapshots(id) ON DELETE CASCADE;
|
|
ALTER TABLE "payload_locked_documents_rels"
|
|
ADD COLUMN IF NOT EXISTS "monitoring_logs_id" integer REFERENCES monitoring_logs(id) ON DELETE CASCADE;
|
|
ALTER TABLE "payload_locked_documents_rels"
|
|
ADD COLUMN IF NOT EXISTS "monitoring_alert_rules_id" integer REFERENCES monitoring_alert_rules(id) ON DELETE CASCADE;
|
|
ALTER TABLE "payload_locked_documents_rels"
|
|
ADD COLUMN IF NOT EXISTS "monitoring_alert_history_id" integer REFERENCES monitoring_alert_history(id) ON DELETE CASCADE;
|
|
|
|
CREATE INDEX IF NOT EXISTS "payload_locked_documents_rels_monitoring_snapshots_idx" ON "payload_locked_documents_rels" ("monitoring_snapshots_id");
|
|
CREATE INDEX IF NOT EXISTS "payload_locked_documents_rels_monitoring_logs_idx" ON "payload_locked_documents_rels" ("monitoring_logs_id");
|
|
CREATE INDEX IF NOT EXISTS "payload_locked_documents_rels_monitoring_alert_rules_idx" ON "payload_locked_documents_rels" ("monitoring_alert_rules_id");
|
|
CREATE INDEX IF NOT EXISTS "payload_locked_documents_rels_monitoring_alert_history_idx" ON "payload_locked_documents_rels" ("monitoring_alert_history_id");
|
|
```
|
|
|
|
**Step 2: Review generated migration, add locked_documents_rels columns if missing**
|
|
|
|
**Step 3: Run migration via direct DB connection**
|
|
|
|
```bash
|
|
./scripts/db-direct.sh migrate
|
|
```
|
|
|
|
**Step 4: Generate import map**
|
|
|
|
```bash
|
|
pnpm payload generate:importmap
|
|
```
|
|
|
|
**Step 5: Commit**
|
|
|
|
```bash
|
|
git add src/migrations/ src/app/\(payload\)/importMap.js
|
|
git commit -m "feat(monitoring): add database migration for 4 monitoring collections"
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 2: Backend Services
|
|
|
|
### Task 7: MonitoringService — System Health
|
|
|
|
**Files:**
|
|
- Create: `src/lib/monitoring/monitoring-service.ts`
|
|
- Test: `tests/unit/monitoring/monitoring-service.test.ts`
|
|
|
|
**Step 1: Write test for checkSystemHealth()**
|
|
|
|
```typescript
|
|
import { describe, it, expect } from 'vitest'
|
|
import { checkSystemHealth } from '@/lib/monitoring/monitoring-service'
|
|
|
|
describe('MonitoringService', () => {
|
|
describe('checkSystemHealth', () => {
|
|
it('returns CPU, memory, disk, load, and uptime', async () => {
|
|
const health = await checkSystemHealth()
|
|
expect(health.cpuUsagePercent).toBeGreaterThanOrEqual(0)
|
|
expect(health.cpuUsagePercent).toBeLessThanOrEqual(100)
|
|
expect(health.memoryTotalMB).toBeGreaterThan(0)
|
|
expect(health.memoryUsedMB).toBeGreaterThan(0)
|
|
expect(health.memoryUsagePercent).toBeGreaterThanOrEqual(0)
|
|
expect(health.diskTotalGB).toBeGreaterThan(0)
|
|
expect(health.uptime).toBeGreaterThan(0)
|
|
expect(health.loadAvg1).toBeGreaterThanOrEqual(0)
|
|
})
|
|
})
|
|
})
|
|
```
|
|
|
|
**Step 2: Run test — expect FAIL**
|
|
|
|
**Step 3: Implement checkSystemHealth()**
|
|
|
|
Use Node.js `os` module: `os.cpus()`, `os.totalmem()`, `os.freemem()`, `os.loadavg()`, `os.uptime()`.
|
|
For disk: use `child_process.execSync('df -B1 / | tail -1')` to get disk usage (Linux-only, which is fine — production is Linux).
|
|
For CPU: sample `/proc/stat` twice with 100ms delay to calculate usage percentage.
|
|
|
|
**Step 4: Run test — expect PASS**
|
|
|
|
**Step 5: Commit**
|
|
|
|
```bash
|
|
git add src/lib/monitoring/monitoring-service.ts tests/unit/monitoring/monitoring-service.test.ts
|
|
git commit -m "feat(monitoring): add system health check (CPU, RAM, disk, load)"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 8: MonitoringService — Service Checks
|
|
|
|
**Files:**
|
|
- Modify: `src/lib/monitoring/monitoring-service.ts`
|
|
- Test: `tests/unit/monitoring/monitoring-service.test.ts`
|
|
|
|
**Step 1: Write tests for service checks**
|
|
|
|
Test `checkRedis()`, `checkPostgresql()`, `checkPgBouncer()`, `checkQueues()`.
|
|
These need mocking since they connect to external services:
|
|
|
|
```typescript
|
|
import { vi } from 'vitest'
|
|
|
|
describe('checkRedis', () => {
|
|
it('returns redis status with memory and client info', async () => {
|
|
// Mock redis.info() response
|
|
const result = await checkRedis()
|
|
expect(result).toHaveProperty('status')
|
|
expect(result).toHaveProperty('memoryUsedMB')
|
|
expect(result).toHaveProperty('connectedClients')
|
|
expect(result).toHaveProperty('opsPerSec')
|
|
})
|
|
})
|
|
```
|
|
|
|
**Step 2: Implement service checks**
|
|
|
|
- `checkRedis()`: Use `redis.info()` from `src/lib/redis.ts`, parse `used_memory`, `connected_clients`, `instantaneous_ops_per_sec`
|
|
- `checkPostgresql()`: Direct query `SELECT count(*) FROM pg_stat_activity` + `SELECT 1` latency test via `./scripts/db-direct.sh` or Payload's DB adapter
|
|
- `checkPgBouncer()`: Query `SHOW POOLS` via PgBouncer admin connection (127.0.0.1:6432)
|
|
- `checkQueues()`: Use BullMQ `Queue.getJobCounts()` for email, pdf, retention queues
|
|
- `checkSmtp()`: Create SMTP transporter and call `verify()` with timeout
|
|
- `checkOAuthTokens()`: Query `social-accounts` collection for expiring tokens (< 7 days)
|
|
- `checkCronJobs()`: Check audit-logs/monitoring-logs for recent cron executions
|
|
|
|
**Step 3: Add `collectMetrics()` that calls all checks with `Promise.allSettled()`**
|
|
|
|
**Step 4: Run tests — expect PASS**
|
|
|
|
**Step 5: Commit**
|
|
|
|
```bash
|
|
git add src/lib/monitoring/monitoring-service.ts tests/unit/monitoring/monitoring-service.test.ts
|
|
git commit -m "feat(monitoring): add service checks (Redis, PostgreSQL, PgBouncer, queues, SMTP, OAuth)"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 9: PerformanceTracker
|
|
|
|
**Files:**
|
|
- Create: `src/lib/monitoring/performance-tracker.ts`
|
|
- Test: `tests/unit/monitoring/performance-tracker.test.ts`
|
|
|
|
**Step 1: Write test**
|
|
|
|
```typescript
|
|
describe('PerformanceTracker', () => {
|
|
it('tracks requests and computes metrics', () => {
|
|
const tracker = new PerformanceTracker(1000) // 1000-entry ring buffer
|
|
tracker.track('GET', '/api/posts', 200, 120)
|
|
tracker.track('GET', '/api/posts', 200, 250)
|
|
tracker.track('GET', '/api/posts', 500, 800)
|
|
|
|
const metrics = tracker.getMetrics('1h')
|
|
expect(metrics.avgResponseTimeMs).toBeCloseTo(390, 0)
|
|
expect(metrics.errorRate).toBeCloseTo(0.333, 2)
|
|
expect(metrics.requestsPerMinute).toBeGreaterThan(0)
|
|
expect(metrics.p95ResponseTimeMs).toBeGreaterThanOrEqual(metrics.avgResponseTimeMs)
|
|
})
|
|
|
|
it('ring buffer evicts old entries', () => {
|
|
const tracker = new PerformanceTracker(2) // tiny buffer
|
|
tracker.track('GET', '/a', 200, 100)
|
|
tracker.track('GET', '/b', 200, 200)
|
|
tracker.track('GET', '/c', 200, 300)
|
|
|
|
const metrics = tracker.getMetrics('1h')
|
|
// Only last 2 entries should remain
|
|
expect(metrics.avgResponseTimeMs).toBeCloseTo(250, 0)
|
|
})
|
|
})
|
|
```
|
|
|
|
**Step 2: Run test — expect FAIL**
|
|
|
|
**Step 3: Implement PerformanceTracker**
|
|
|
|
- Class with ring buffer (fixed-size array + pointer)
|
|
- Each entry: `{ timestamp, method, path, statusCode, durationMs }`
|
|
- `track()`: Add to ring buffer
|
|
- `getMetrics(period)`: Filter by time window, compute avg/p95/p99/errorRate/rpm
|
|
- Export singleton instance: `export const performanceTracker = new PerformanceTracker(10_000)`
|
|
|
|
**Step 4: Run test — expect PASS**
|
|
|
|
**Step 5: Commit**
|
|
|
|
```bash
|
|
git add src/lib/monitoring/performance-tracker.ts tests/unit/monitoring/performance-tracker.test.ts
|
|
git commit -m "feat(monitoring): add performance tracker with ring buffer"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 10: MonitoringLogger
|
|
|
|
**Files:**
|
|
- Create: `src/lib/monitoring/monitoring-logger.ts`
|
|
- Test: `tests/unit/monitoring/monitoring-logger.test.ts`
|
|
|
|
**Step 1: Write test**
|
|
|
|
```typescript
|
|
describe('MonitoringLogger', () => {
|
|
it('creates logger with source and logs to collection', async () => {
|
|
const logger = createMonitoringLogger('cron')
|
|
// Mock payload.create
|
|
await logger.info('Cron job completed', { jobName: 'community-sync', duration: 3500 })
|
|
// Verify payload.create was called with correct args
|
|
})
|
|
|
|
it('respects minimum log level from env', async () => {
|
|
// MONITORING_LOG_LEVEL=warn → info/debug should not write to DB
|
|
})
|
|
})
|
|
```
|
|
|
|
**Step 2: Implement MonitoringLogger**
|
|
|
|
- `createMonitoringLogger(source: LogSource)` factory function
|
|
- Returns object with `debug()`, `info()`, `warn()`, `error()`, `fatal()` methods
|
|
- Each method calls `payload.create({ collection: 'monitoring-logs', data: { level, source, message, context, ... } })`
|
|
- Respects `MONITORING_LOG_LEVEL` env var (default: 'info')
|
|
- Falls back to `console.log` if Payload is not initialized (startup phase)
|
|
- Non-blocking: fire-and-forget with `.catch(console.error)`
|
|
|
|
**Step 3: Run test — expect PASS**
|
|
|
|
**Step 4: Commit**
|
|
|
|
```bash
|
|
git add src/lib/monitoring/monitoring-logger.ts tests/unit/monitoring/monitoring-logger.test.ts
|
|
git commit -m "feat(monitoring): add structured monitoring logger"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 11: AlertEvaluator
|
|
|
|
**Files:**
|
|
- Create: `src/lib/monitoring/alert-evaluator.ts`
|
|
- Test: `tests/unit/monitoring/alert-evaluator.test.ts`
|
|
|
|
**Step 1: Write test**
|
|
|
|
```typescript
|
|
describe('AlertEvaluator', () => {
|
|
it('fires alert when metric exceeds threshold (gt)', () => {
|
|
const rule = { metric: 'system.cpuUsagePercent', condition: 'gt', threshold: 80, severity: 'warning' }
|
|
const metrics = { system: { cpuUsagePercent: 92 } }
|
|
expect(evaluateCondition(rule, getMetricValue(metrics, rule.metric))).toBe(true)
|
|
})
|
|
|
|
it('does not fire when metric is below threshold', () => {
|
|
const rule = { metric: 'system.cpuUsagePercent', condition: 'gt', threshold: 80 }
|
|
const metrics = { system: { cpuUsagePercent: 45 } }
|
|
expect(evaluateCondition(rule, getMetricValue(metrics, rule.metric))).toBe(false)
|
|
})
|
|
|
|
it('resolves nested metric paths', () => {
|
|
const metrics = { services: { redis: { memoryUsedMB: 512 } } }
|
|
expect(getMetricValue(metrics, 'services.redis.memoryUsedMB')).toBe(512)
|
|
})
|
|
|
|
it('respects cooldown period', () => {
|
|
const evaluator = new AlertEvaluator()
|
|
// First fire should pass
|
|
expect(evaluator.shouldFire('rule-1', 15)).toBe(true)
|
|
// Immediate second fire should be blocked (cooldown)
|
|
expect(evaluator.shouldFire('rule-1', 15)).toBe(false)
|
|
})
|
|
})
|
|
```
|
|
|
|
**Step 2: Implement AlertEvaluator**
|
|
|
|
- `getMetricValue(metrics, path)`: Resolve dot-notation path like `system.cpuUsagePercent`
|
|
- `evaluateCondition(rule, value)`: Compare value against threshold using condition operator
|
|
- `AlertEvaluator` class with in-memory cooldown map (ruleId → lastFiredAt)
|
|
- `evaluateRules(payload, metrics)`: Load enabled rules from `monitoring-alert-rules`, evaluate each, fire alerts
|
|
- `dispatchAlert(payload, rule, metrics, value)`: Create `monitoring-alert-history` record + call existing `sendAlert()` from `src/lib/alerting/alert-service.ts`
|
|
|
|
**Step 3: Run test — expect PASS**
|
|
|
|
**Step 4: Commit**
|
|
|
|
```bash
|
|
git add src/lib/monitoring/alert-evaluator.ts tests/unit/monitoring/alert-evaluator.test.ts
|
|
git commit -m "feat(monitoring): add alert evaluator with cooldown and multi-channel dispatch"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 12: SnapshotCollector
|
|
|
|
**Files:**
|
|
- Create: `src/lib/monitoring/snapshot-collector.ts`
|
|
- Modify: `scripts/run-queue-worker.ts` (add monitoring worker)
|
|
- Modify: `ecosystem.config.cjs` (add env var)
|
|
|
|
**Step 1: Implement SnapshotCollector**
|
|
|
|
```typescript
|
|
import { collectMetrics } from './monitoring-service'
|
|
import { AlertEvaluator } from './alert-evaluator'
|
|
import { getPayload } from 'payload'
|
|
import config from '@payload-config'
|
|
|
|
let interval: NodeJS.Timeout | null = null
|
|
const alertEvaluator = new AlertEvaluator()
|
|
|
|
export async function startSnapshotCollector(): Promise<void> {
|
|
const INTERVAL = parseInt(process.env.MONITORING_SNAPSHOT_INTERVAL || '60000', 10)
|
|
console.log(`[SnapshotCollector] Starting (interval: ${INTERVAL}ms)`)
|
|
|
|
interval = setInterval(async () => {
|
|
try {
|
|
const payload = await getPayload({ config })
|
|
const metrics = await collectMetrics()
|
|
await payload.create({ collection: 'monitoring-snapshots', data: { ...metrics, timestamp: new Date().toISOString() } })
|
|
await alertEvaluator.evaluateRules(payload, metrics)
|
|
} catch (error) {
|
|
console.error('[SnapshotCollector] Error:', error)
|
|
}
|
|
}, INTERVAL)
|
|
}
|
|
|
|
export async function stopSnapshotCollector(): Promise<void> {
|
|
if (interval) { clearInterval(interval); interval = null }
|
|
console.log('[SnapshotCollector] Stopped')
|
|
}
|
|
```
|
|
|
|
**Step 2: Add to queue worker**
|
|
|
|
In `scripts/run-queue-worker.ts`, add:
|
|
```typescript
|
|
const ENABLE_MONITORING = process.env.QUEUE_ENABLE_MONITORING !== 'false'
|
|
// ... dynamic import
|
|
const { startSnapshotCollector, stopSnapshotCollector } = await import('../src/lib/monitoring/snapshot-collector')
|
|
// ... conditional start
|
|
if (ENABLE_MONITORING) await startSnapshotCollector()
|
|
// ... shutdown
|
|
if (ENABLE_MONITORING) stopPromises.push(stopSnapshotCollector())
|
|
```
|
|
|
|
**Step 3: Add env var to ecosystem.config.cjs**
|
|
|
|
Add to queue-worker env:
|
|
```javascript
|
|
QUEUE_ENABLE_MONITORING: 'true',
|
|
MONITORING_SNAPSHOT_INTERVAL: '60000',
|
|
```
|
|
|
|
**Step 4: Commit**
|
|
|
|
```bash
|
|
git add src/lib/monitoring/snapshot-collector.ts scripts/run-queue-worker.ts ecosystem.config.cjs
|
|
git commit -m "feat(monitoring): add snapshot collector to queue worker"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 13: Data Retention Integration
|
|
|
|
**Files:**
|
|
- Modify: `src/lib/retention/retention-config.ts`
|
|
|
|
**Step 1: Add 3 new retention policies**
|
|
|
|
```typescript
|
|
{
|
|
name: 'monitoring-snapshots',
|
|
collection: 'monitoring-snapshots',
|
|
retentionDays: parseInt(process.env.RETENTION_MONITORING_SNAPSHOTS_DAYS || '7', 10),
|
|
dateField: 'createdAt',
|
|
batchSize: 500,
|
|
description: 'Monitoring-Snapshots älter als X Tage löschen',
|
|
},
|
|
{
|
|
name: 'monitoring-alert-history',
|
|
collection: 'monitoring-alert-history',
|
|
retentionDays: parseInt(process.env.RETENTION_MONITORING_ALERTS_DAYS || '90', 10),
|
|
dateField: 'createdAt',
|
|
batchSize: 100,
|
|
description: 'Alert-History älter als X Tage löschen',
|
|
},
|
|
{
|
|
name: 'monitoring-logs',
|
|
collection: 'monitoring-logs',
|
|
retentionDays: parseInt(process.env.RETENTION_MONITORING_LOGS_DAYS || '30', 10),
|
|
dateField: 'createdAt',
|
|
batchSize: 200,
|
|
description: 'Monitoring-Logs älter als X Tage löschen',
|
|
},
|
|
```
|
|
|
|
**Step 2: Commit**
|
|
|
|
```bash
|
|
git add src/lib/retention/retention-config.ts
|
|
git commit -m "feat(monitoring): add retention policies for monitoring collections"
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 3: API Endpoints
|
|
|
|
### Task 14: Health Endpoint
|
|
|
|
**Files:**
|
|
- Create: `src/app/(payload)/api/monitoring/health/route.ts`
|
|
|
|
**Step 1: Implement GET handler**
|
|
|
|
Pattern: Follow `community/stats/route.ts`. Auth check for super-admin. Call `checkSystemHealth()`, return JSON.
|
|
|
|
```typescript
|
|
import { NextRequest, NextResponse } from 'next/server'
|
|
import { getPayload } from 'payload'
|
|
import config from '@payload-config'
|
|
import { checkSystemHealth } from '@/lib/monitoring/monitoring-service'
|
|
|
|
export async function GET(req: NextRequest) {
|
|
try {
|
|
const payload = await getPayload({ config })
|
|
const { user } = await payload.auth({ headers: req.headers })
|
|
if (!user || !(user as any).isSuperAdmin) {
|
|
return NextResponse.json({ error: 'Unauthorized' }, { status: 401 })
|
|
}
|
|
const health = await checkSystemHealth()
|
|
return NextResponse.json({ data: health, timestamp: new Date().toISOString() })
|
|
} catch (error: unknown) {
|
|
return NextResponse.json({ error: error instanceof Error ? error.message : 'Unknown error' }, { status: 500 })
|
|
}
|
|
}
|
|
|
|
export const dynamic = 'force-dynamic'
|
|
```
|
|
|
|
**Step 2: Commit**
|
|
|
|
```bash
|
|
git add "src/app/(payload)/api/monitoring/health/route.ts"
|
|
git commit -m "feat(monitoring): add /api/monitoring/health endpoint"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 15: Services Endpoint
|
|
|
|
**Files:**
|
|
- Create: `src/app/(payload)/api/monitoring/services/route.ts`
|
|
|
|
Same pattern as health. Calls `checkPostgresql()`, `checkPgBouncer()`, `checkRedis()`, `checkSmtp()`, `checkOAuthTokens()`, `checkCronJobs()`, `checkQueues()` via `Promise.allSettled()`. Returns combined result.
|
|
|
|
**Commit:**
|
|
```bash
|
|
git commit -m "feat(monitoring): add /api/monitoring/services endpoint"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 16: Performance Endpoint
|
|
|
|
**Files:**
|
|
- Create: `src/app/(payload)/api/monitoring/performance/route.ts`
|
|
|
|
Reads `?period=1h|6h|24h|7d` query param. Calls `performanceTracker.getMetrics(period)`.
|
|
|
|
**Commit:**
|
|
```bash
|
|
git commit -m "feat(monitoring): add /api/monitoring/performance endpoint"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 17: Alerts Endpoint + Acknowledge
|
|
|
|
**Files:**
|
|
- Create: `src/app/(payload)/api/monitoring/alerts/route.ts`
|
|
- Create: `src/app/(payload)/api/monitoring/alerts/acknowledge/route.ts`
|
|
|
|
**GET /alerts:** Query `monitoring-alert-history` with pagination (`?page=1&limit=20`), filter by severity, sort by createdAt desc.
|
|
|
|
**POST /alerts/acknowledge:** Body `{ alertId }`. Sets `acknowledgedBy` to current user and `resolvedAt` to now. Super-admin only.
|
|
|
|
**Commit:**
|
|
```bash
|
|
git commit -m "feat(monitoring): add /api/monitoring/alerts + acknowledge endpoints"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 18: Logs Endpoint
|
|
|
|
**Files:**
|
|
- Create: `src/app/(payload)/api/monitoring/logs/route.ts`
|
|
|
|
**GET /logs:** Query `monitoring-logs` with pagination, filters:
|
|
- `?level=warn` (exact or gte)
|
|
- `?source=cron`
|
|
- `?search=text` (searches in message)
|
|
- `?from=ISO&to=ISO` (date range)
|
|
- `?page=1&limit=50`
|
|
|
|
**Commit:**
|
|
```bash
|
|
git commit -m "feat(monitoring): add /api/monitoring/logs endpoint"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 19: Snapshots Endpoint
|
|
|
|
**Files:**
|
|
- Create: `src/app/(payload)/api/monitoring/snapshots/route.ts`
|
|
|
|
**GET /snapshots:** Query `monitoring-snapshots` for trend data.
|
|
- `?period=1h|6h|24h|7d` (default: 24h)
|
|
- `?fields=system.cpuUsagePercent,system.memoryUsagePercent` (optional field selection for bandwidth)
|
|
- Returns array sorted by timestamp asc (oldest first for charts).
|
|
|
|
**Commit:**
|
|
```bash
|
|
git commit -m "feat(monitoring): add /api/monitoring/snapshots endpoint"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 20: SSE Stream Endpoint
|
|
|
|
**Files:**
|
|
- Create: `src/app/(payload)/api/monitoring/stream/route.ts`
|
|
|
|
**Step 1: Implement SSE stream**
|
|
|
|
Pattern: Follow `community/stream/route.ts` exactly.
|
|
|
|
Key differences from community stream:
|
|
- Multiple event types with different intervals:
|
|
- Health metrics: every 10s (via `checkSystemHealth()`)
|
|
- Performance metrics: every 30s (via `performanceTracker.getMetrics()`)
|
|
- New alerts: check every 5s (query `monitoring-alert-history` for new since lastCheck)
|
|
- New logs (warn+): check every 5s (query `monitoring-logs` where level >= warn since lastCheck)
|
|
- Each event has a `type` field in the SSE data
|
|
- Max duration: 25s with reconnect signal (same as community)
|
|
|
|
```typescript
|
|
// SSE event format:
|
|
controller.enqueue(encoder.encode(`event: health\ndata: ${JSON.stringify(healthData)}\n\n`))
|
|
controller.enqueue(encoder.encode(`event: alert\ndata: ${JSON.stringify(alertData)}\n\n`))
|
|
```
|
|
|
|
Note: Use named SSE events (`event: health\n`) so the client can use `eventSource.addEventListener('health', ...)`.
|
|
|
|
**Step 2: Commit**
|
|
|
|
```bash
|
|
git add "src/app/(payload)/api/monitoring/stream/route.ts"
|
|
git commit -m "feat(monitoring): add SSE stream endpoint with multi-event types"
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 4: Dashboard UI
|
|
|
|
### Task 21: Admin View Registration + NavLinks
|
|
|
|
**Files:**
|
|
- Create: `src/components/admin/MonitoringNavLinks.tsx`
|
|
- Create: `src/components/admin/MonitoringDashboardView.tsx`
|
|
- Modify: `src/payload.config.ts` (add view + navlink)
|
|
|
|
**Step 1: Create MonitoringNavLinks**
|
|
|
|
Pattern: Copy `CommunityNavLinks.tsx` exactly. Single link:
|
|
```typescript
|
|
const links = [
|
|
{ href: '/admin/monitoring', label: 'Monitoring Dashboard' },
|
|
]
|
|
```
|
|
Group label: `'Monitoring'`.
|
|
|
|
**Step 2: Create MonitoringDashboardView**
|
|
|
|
```typescript
|
|
'use client'
|
|
import React from 'react'
|
|
import { MonitoringDashboard } from './MonitoringDashboard'
|
|
export const MonitoringDashboardView: React.FC = () => <MonitoringDashboard />
|
|
export default MonitoringDashboardView
|
|
```
|
|
|
|
**Step 3: Register in payload.config.ts**
|
|
|
|
```typescript
|
|
afterNavLinks: [
|
|
// ... existing
|
|
'@/components/admin/MonitoringNavLinks#MonitoringNavLinks',
|
|
],
|
|
views: {
|
|
// ... existing
|
|
MonitoringDashboard: {
|
|
Component: '@/components/admin/MonitoringDashboardView#MonitoringDashboardView',
|
|
path: '/monitoring',
|
|
},
|
|
},
|
|
```
|
|
|
|
**Step 4: Commit**
|
|
|
|
```bash
|
|
git add src/components/admin/MonitoringNavLinks.tsx src/components/admin/MonitoringDashboardView.tsx src/payload.config.ts
|
|
git commit -m "feat(monitoring): register admin view and sidebar navigation"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 22: MonitoringDashboard Main Component
|
|
|
|
**Files:**
|
|
- Create: `src/components/admin/MonitoringDashboard.tsx`
|
|
- Create: `src/components/admin/MonitoringDashboard.scss`
|
|
|
|
**Step 1: Implement tab shell**
|
|
|
|
Pattern: Follow `YouTubeAnalyticsDashboard.tsx` structure.
|
|
|
|
```typescript
|
|
'use client'
|
|
import React, { useState, useEffect, useCallback, useRef } from 'react'
|
|
import './MonitoringDashboard.scss'
|
|
import { SystemHealthTab } from './monitoring/SystemHealthTab'
|
|
import { ServicesTab } from './monitoring/ServicesTab'
|
|
import { PerformanceTab } from './monitoring/PerformanceTab'
|
|
import { AlertsTab } from './monitoring/AlertsTab'
|
|
import { LogsTab } from './monitoring/LogsTab'
|
|
|
|
type Tab = 'health' | 'services' | 'performance' | 'alerts' | 'logs'
|
|
|
|
export const MonitoringDashboard: React.FC = () => {
|
|
const [activeTab, setActiveTab] = useState<Tab>('health')
|
|
const eventSourceRef = useRef<EventSource | null>(null)
|
|
const [connected, setConnected] = useState(false)
|
|
|
|
// SSE connection setup
|
|
useEffect(() => {
|
|
const es = new EventSource('/api/monitoring/stream', { withCredentials: true })
|
|
eventSourceRef.current = es
|
|
|
|
es.addEventListener('open', () => setConnected(true))
|
|
es.addEventListener('error', () => { setConnected(false); /* auto-reconnect */ })
|
|
|
|
return () => { es.close(); eventSourceRef.current = null }
|
|
}, [])
|
|
|
|
// Pass eventSource to tabs for real-time updates
|
|
return (
|
|
<div className="monitoring">
|
|
<div className="monitoring__header">
|
|
<h1>Monitoring Dashboard</h1>
|
|
<div className={`monitoring__status ${connected ? 'monitoring__status--connected' : 'monitoring__status--disconnected'}`}>
|
|
{connected ? '● Live' : '○ Disconnected'}
|
|
</div>
|
|
</div>
|
|
<div className="monitoring__tabs">{/* Tab buttons */}</div>
|
|
<div className="monitoring__content">
|
|
{activeTab === 'health' && <SystemHealthTab eventSource={eventSourceRef.current} />}
|
|
{activeTab === 'services' && <ServicesTab eventSource={eventSourceRef.current} />}
|
|
{activeTab === 'performance' && <PerformanceTab />}
|
|
{activeTab === 'alerts' && <AlertsTab eventSource={eventSourceRef.current} />}
|
|
{activeTab === 'logs' && <LogsTab eventSource={eventSourceRef.current} />}
|
|
</div>
|
|
</div>
|
|
)
|
|
}
|
|
```
|
|
|
|
**Step 2: Create SCSS with BEM classes**
|
|
|
|
`.monitoring__header`, `.monitoring__tabs`, `.monitoring__tab`, `.monitoring__tab--active`, `.monitoring__content`, `.monitoring__status--connected`, `.monitoring__status--disconnected`
|
|
|
|
**Step 3: Commit**
|
|
|
|
```bash
|
|
git add src/components/admin/MonitoringDashboard.tsx src/components/admin/MonitoringDashboard.scss
|
|
git commit -m "feat(monitoring): add main dashboard component with SSE connection and tab shell"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 23: Shared UI Components
|
|
|
|
**Files:**
|
|
- Create: `src/components/admin/monitoring/StatusBadge.tsx`
|
|
- Create: `src/components/admin/monitoring/GaugeWidget.tsx`
|
|
- Create: `src/components/admin/monitoring/TrendChart.tsx`
|
|
- Create: `src/components/admin/monitoring/LogTable.tsx`
|
|
|
|
**StatusBadge:** Simple component: status string → colored badge (online=green, warning=yellow, offline=red).
|
|
|
|
**GaugeWidget:** Displays a metric with label, value, unit, and colored arc/bar. Props: `{ label, value, max, unit, thresholds: { warning: number, critical: number } }`.
|
|
Use CSS-only approach (no chart library): circular progress with `conic-gradient` or simple horizontal bar.
|
|
|
|
**TrendChart:** Renders time-series data as a simple SVG line chart. Props: `{ data: Array<{timestamp, value}>, label, unit, height }`.
|
|
Pure SVG, no chart library — keeps bundle size zero. Scales automatically to container width.
|
|
|
|
**LogTable:** Renders log entries with expandable JSON context. Props: `{ logs, onLoadMore }`.
|
|
Each row: level icon, source badge, message, timestamp. Click to expand `context` JSON.
|
|
|
|
**Commit:**
|
|
```bash
|
|
git commit -m "feat(monitoring): add shared UI components (StatusBadge, GaugeWidget, TrendChart, LogTable)"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 24: SystemHealthTab
|
|
|
|
**Files:**
|
|
- Create: `src/components/admin/monitoring/SystemHealthTab.tsx`
|
|
|
|
**Implementation:**
|
|
|
|
1. Initial fetch: `GET /api/monitoring/health`
|
|
2. SSE listener: `eventSource.addEventListener('health', ...)` updates gauges in real-time
|
|
3. Trend data: `GET /api/monitoring/snapshots?period=24h&fields=system.cpuUsagePercent,system.memoryUsagePercent,system.loadAvg1`
|
|
4. Renders: 4 GaugeWidgets (CPU, RAM, Disk, Uptime) + 3 TrendCharts (CPU 24h, Memory 24h, Load 24h)
|
|
|
|
**Commit:**
|
|
```bash
|
|
git commit -m "feat(monitoring): add System Health tab with gauges and trend charts"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 25: ServicesTab
|
|
|
|
**Files:**
|
|
- Create: `src/components/admin/monitoring/ServicesTab.tsx`
|
|
|
|
**Implementation:**
|
|
|
|
1. Initial fetch: `GET /api/monitoring/services`
|
|
2. SSE listener: `eventSource.addEventListener('service', ...)` for status changes
|
|
3. Renders expandable service cards:
|
|
- Payload CMS (PID, Memory, Uptime, Restarts)
|
|
- Queue Worker (PID, Memory, Active Jobs)
|
|
- PostgreSQL (Connections, Pool, Latency)
|
|
- PgBouncer (Active, Waiting, Pool Size)
|
|
- Redis (Memory, Clients, Ops/s)
|
|
- SMTP (Status, Last Check, Response Time)
|
|
- OAuth Tokens (Meta + YouTube, expiry warnings)
|
|
- Cron Jobs (Last run times per job)
|
|
|
|
Each card has a StatusBadge header.
|
|
|
|
**Commit:**
|
|
```bash
|
|
git commit -m "feat(monitoring): add Services tab with expandable service cards"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 26: PerformanceTab
|
|
|
|
**Files:**
|
|
- Create: `src/components/admin/monitoring/PerformanceTab.tsx`
|
|
|
|
**Implementation:**
|
|
|
|
1. Period selector: 1h, 6h, 24h, 7d (buttons)
|
|
2. Fetch: `GET /api/monitoring/performance?period=24h`
|
|
3. KPI cards: Avg Response Time, P95, P99, Error Rate, RPM
|
|
4. TrendCharts from snapshots: `GET /api/monitoring/snapshots?period=24h&fields=performance.avgResponseTimeMs,performance.errorRate,performance.requestsPerMinute`
|
|
|
|
**Commit:**
|
|
```bash
|
|
git commit -m "feat(monitoring): add Performance tab with KPI cards and trend charts"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 27: AlertsTab
|
|
|
|
**Files:**
|
|
- Create: `src/components/admin/monitoring/AlertsTab.tsx`
|
|
|
|
**Implementation:**
|
|
|
|
1. Fetch: `GET /api/monitoring/alerts?page=1&limit=20`
|
|
2. SSE listener: `eventSource.addEventListener('alert', ...)` prepends new alerts
|
|
3. Active/Unacknowledged alerts highlighted at top
|
|
4. Severity filter (warning, error, critical)
|
|
5. Acknowledge button: `POST /api/monitoring/alerts/acknowledge` with `{ alertId }`
|
|
6. Link to MonitoringAlertRules collection in admin: `/admin/collections/monitoring-alert-rules`
|
|
7. Pagination
|
|
|
|
**Commit:**
|
|
```bash
|
|
git commit -m "feat(monitoring): add Alerts tab with acknowledge and real-time updates"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 28: LogsTab
|
|
|
|
**Files:**
|
|
- Create: `src/components/admin/monitoring/LogsTab.tsx`
|
|
|
|
**Implementation:**
|
|
|
|
1. Fetch: `GET /api/monitoring/logs?page=1&limit=50`
|
|
2. SSE listener: `eventSource.addEventListener('log', ...)` prepends new warn+ entries
|
|
3. Filters: level dropdown, source dropdown, text search input, date range
|
|
4. Uses LogTable component
|
|
5. Load more button for pagination
|
|
6. Auto-scroll toggle for new SSE entries
|
|
|
|
**Commit:**
|
|
```bash
|
|
git commit -m "feat(monitoring): add Logs tab with filters, search, and real-time updates"
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 5: Final Integration
|
|
|
|
### Task 29: Generate ImportMap & Build Test
|
|
|
|
**Step 1: Generate import map**
|
|
|
|
```bash
|
|
pnpm payload generate:importmap
|
|
```
|
|
|
|
**Step 2: Build test**
|
|
|
|
```bash
|
|
pm2 stop payload
|
|
NODE_OPTIONS="--no-deprecation --max-old-space-size=1024" pnpm build
|
|
pm2 start payload
|
|
```
|
|
|
|
**Step 3: Fix any build errors**
|
|
|
|
**Step 4: Commit**
|
|
|
|
```bash
|
|
git add src/app/\(payload\)/importMap.js
|
|
git commit -m "chore(monitoring): regenerate import map and verify build"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 30: Run All Tests
|
|
|
|
```bash
|
|
pnpm test tests/unit/monitoring/
|
|
```
|
|
|
|
Fix any failures, then:
|
|
|
|
```bash
|
|
git commit -m "test(monitoring): fix test issues and verify all monitoring tests pass"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 31: Update Documentation
|
|
|
|
**Files:**
|
|
- Modify: `CLAUDE.md` (add Monitoring to Subsysteme table, add collections)
|
|
- Modify: `docs/CLAUDE_REFERENCE.md` (add Monitoring section)
|
|
- Modify: `docs/PROJECT_STATUS.md` (mark as completed)
|
|
|
|
**CLAUDE.md changes:**
|
|
- Add to Subsysteme table: `| Monitoring & Alerting | src/lib/monitoring/, API: /api/monitoring/* | docs/CLAUDE_REFERENCE.md |`
|
|
- Add 4 collections to Collections table:
|
|
- `monitoring-snapshots`, `monitoring-logs`, `monitoring-alert-rules`, `monitoring-alert-history`
|
|
|
|
**CLAUDE_REFERENCE.md:** Add new section with API endpoints, SSE events, env vars.
|
|
|
|
**PROJECT_STATUS.md:** Move "Monitoring & Alerting Dashboard" from Langfristig to Abgeschlossen.
|
|
|
|
**Commit:**
|
|
```bash
|
|
git commit -m "docs: add monitoring dashboard to project documentation"
|
|
```
|
|
|
|
---
|
|
|
|
## Environment Variables Summary
|
|
|
|
Add to `.env` (all optional with defaults):
|
|
|
|
```env
|
|
# Monitoring
|
|
QUEUE_ENABLE_MONITORING=true
|
|
MONITORING_SNAPSHOT_INTERVAL=60000
|
|
MONITORING_LOG_LEVEL=info
|
|
RETENTION_MONITORING_SNAPSHOTS_DAYS=7
|
|
RETENTION_MONITORING_ALERTS_DAYS=90
|
|
RETENTION_MONITORING_LOGS_DAYS=30
|
|
```
|
|
|
|
---
|
|
|
|
## Task Dependency Graph
|
|
|
|
```
|
|
Phase 1 (Foundation):
|
|
Task 1 (Types) → Task 2-5 (Collections) → Task 6 (Migration)
|
|
|
|
Phase 2 (Services):
|
|
Task 1 → Task 7 (Health) → Task 8 (Services) → Task 9 (PerfTracker)
|
|
Task 1 → Task 10 (Logger)
|
|
Task 1 → Task 11 (AlertEvaluator)
|
|
Task 7,8,9,11 → Task 12 (SnapshotCollector)
|
|
Task 2-5 → Task 13 (Retention)
|
|
|
|
Phase 3 (APIs):
|
|
Task 7 → Task 14 (Health API)
|
|
Task 8 → Task 15 (Services API)
|
|
Task 9 → Task 16 (Performance API)
|
|
Task 5,11 → Task 17 (Alerts API)
|
|
Task 3,10 → Task 18 (Logs API)
|
|
Task 2 → Task 19 (Snapshots API)
|
|
Task 7,8,9,10,11 → Task 20 (SSE Stream)
|
|
|
|
Phase 4 (UI):
|
|
Task 21 (Registration) → Task 22 (Main Component) → Task 23 (Shared Components)
|
|
Task 23 → Tasks 24-28 (Tab Components) — can be parallel
|
|
|
|
Phase 5 (Integration):
|
|
All → Task 29 (Build) → Task 30 (Tests) → Task 31 (Docs)
|
|
```
|
|
|
|
---
|
|
|
|
## Estimated File Count
|
|
|
|
| Category | Files |
|
|
|----------|-------|
|
|
| Collections (4) | 4 |
|
|
| Lib/Monitoring (6) | 6 |
|
|
| API Routes (8) | 8 |
|
|
| UI Components (12) | 12 |
|
|
| Tests (5) | 5 |
|
|
| Migrations (1) | 1 |
|
|
| Modified (5) | payload.config.ts, run-queue-worker.ts, ecosystem.config.cjs, retention-config.ts, access/index.ts |
|
|
| **Total** | **~41 files** |
|