Monitoring & Alerts

The system provides multiple layers of monitoring to keep you informed about trading activity, system health, and risk events. Even in BOUNDED_AUTONOMOUS or FULL_AUTONOMOUS mode, you should monitor the system regularly.

Monitoring Dashboards

1. AlphaOracle Dashboard (Port 3000)

The main React dashboard at http://localhost:3000.

Pages:

Portfolio: Real-time positions, P&L, sector exposure, equity curve
Strategies: Strategy rankings, backtest results, signal history
Risk: Risk limits, PDT counter, circuit breakers, kill switch
Trades: Order history, pending approvals, execution quality metrics
Model Health (ML): Feature importance, accuracy, drift detection, model versions

Update frequency: Real-time via WebSocket (sub-second updates for prices and orders)

Best for: Day-to-day trading monitoring, portfolio tracking, order review

2. Grafana Dashboards (Port 3001)

Advanced monitoring dashboards at http://localhost:3001.

Default dashboards:

System Health: CPU, memory, API latency, database connections, Redis hits/misses
Trading Performance: Total return, Sharpe ratio, win rate, profit factor, drawdown
Risk Metrics: Position counts, sector exposure, PDT usage, circuit breaker events
Order Metrics: Orders submitted, filled, cancelled, rejected; fill rates; slippage

Update frequency: 1-5 minute intervals (configurable)

Best for: Historical analysis, system performance tuning, debugging infrastructure issues

Default credentials:

Username: admin
Password: admin (change on first login)

3. Prometheus Metrics (Port 9090)

Raw metrics endpoint at http://localhost:9090.

Available metrics:

portfolio_total_equity: Portfolio value in dollars
portfolio_positions_count: Number of open positions
orders_submitted_total: Counter of submitted orders
orders_filled_total: Counter of filled orders
api_request_duration_seconds: API endpoint latencies
pdt_trades_used: Current PDT trade count

Best for: Custom alerting, integration with external monitoring tools, debugging

Alert Channels

The system supports three alert channels:

1. Logs (Always On)

All events are logged to logs/backend.log with structured logging.

Log levels:

INFO: Normal operations (orders submitted, positions updated)
WARNING: Non-critical issues (PDT approaching limit, stale data detected)
ERROR: Errors (API failures, database timeouts)
CRITICAL: Severe issues (kill switch activated, max drawdown exceeded)

Viewing logs:

tail -f logs/backend.log                    # Follow live
grep "CRITICAL" logs/backend.log            # Filter by level
grep "pdt" logs/backend.log                 # Filter by keyword

2. Slack (Optional)

Send alerts to a Slack channel.

Setup:

Create a Slack webhook URL (https://api.slack.com/messaging/webhooks)

Configure webhook in environment:

export SA_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL

Enable Slack in config/settings.yaml:

notifications:
  enabled: true
  channels: [slack]

Restart backend: ./scripts/restart-backend.sh

Alert format:

🚨 [CRITICAL] Kill Switch Activated
Market crash detected, VIX spike to 47
Timestamp: 2026-03-12T15:34:22Z

3. Telegram (Optional)

Send alerts to a Telegram bot.

Setup:

Create a Telegram bot via @BotFather
Get your chat ID by messaging @userinfobot

Configure bot in environment:

export SA_TELEGRAM_BOT_TOKEN=your_bot_token
export SA_TELEGRAM_CHAT_ID=your_chat_id

Enable Telegram in config/settings.yaml:

notifications:
  enabled: true
  channels: [telegram]

Restart backend: ./scripts/restart-backend.sh

Alert Conditions

The system sends alerts for these events:

Critical Alerts (Immediate Action Required)

Kill switch activated: Trading halted manually
Max drawdown exceeded: Portfolio dropped 10% from peak (trading halted)
Max daily loss exceeded: Lost 3% in one day (trading halted)
PDT limit reached: Used all 3 day trades (can’t day trade until window rolls)
Position reconciliation failed: System’s positions don’t match broker’s (data integrity issue)

Warning Alerts (Monitor Closely)

VIX spike: VIX > 35 (extreme volatility, circuit breaker active)
Stale data detected: Price data > 5 minutes old (feed interruption)
PDT approaching limit: 2/3 day trades used (1 remaining)
High drawdown: Portfolio down 7-9% from peak (approaching 10% limit)
Large position drift: Position size differs from broker by >1%
Model staleness: ML model hasn’t retrained in 14+ days

Info Alerts (FYI)

Circuit breaker cleared: VIX dropped below 35, trading resumed
Data feed reconnected: Market data feed recovered
Model retrained: New ML model deployed
Dead man switch check: Operator heartbeat required within 48 hours

Alert Configuration

Alerts are controlled by circuit breaker settings in config/risk_limits.yaml:

circuit_breakers:
  vix_threshold: 35.0
  stale_data_seconds: 300
  reconciliation_interval_seconds: 300
  max_reconciliation_drift_pct: 1.0
  dead_man_switch_hours: 48

See Risk Limits Reference for details.

Monitoring Best Practices

Daily Checks (5 minutes)

Portfolio page: Check total equity, daily P&L, open positions
Risk page: Verify no circuit breakers active, PDT counter normal
Trades page: Review today’s executed orders, check for rejections
Logs: Scan for WARNING or CRITICAL entries

Weekly Reviews (30 minutes)

Grafana: Review weekly performance metrics (return, drawdown, Sharpe)
Strategy page: Compare strategy performance, consider enabling/disabling strategies
Model Health: Check ML model accuracy, feature drift
Risk limits: Adjust limits based on observed behavior

Monthly Audits (2 hours)

Performance analysis: Compare to benchmarks (S&P 500), calculate returns
Risk analysis: Review max drawdown, worst days, correlations
Strategy tuning: Backtest on recent data, adjust parameters
System health: Review Grafana system metrics, database growth, API latencies

Autonomous Mode Monitoring

Even in BOUNDED_AUTONOMOUS mode:

Check dashboard at least once per day
Review weekly performance reports
Monitor Slack/Telegram alerts in real-time
Investigate any WARNING or CRITICAL alerts immediately

Autonomous doesn’t mean unattended. You remain responsible for system oversight.

Health Check Interval

The system runs health checks every 60 seconds (configurable in config/settings.yaml):

monitoring:
  health_check_interval_seconds: 60

Health checks verify:

Database connectivity
Redis connectivity
IBKR connection status (if provider is ibkr)
Market data feed status
Position reconciliation

Failed health checks trigger alerts.

Integration with External Tools

Export Metrics to External Monitoring

Prometheus metrics can be scraped by:

Datadog: Use Prometheus integration
Grafana Cloud: Add Prometheus data source
CloudWatch: Use Prometheus exporter for AWS

Custom Alerting

Create custom Prometheus alert rules in config/prometheus/alerts.yml:

groups:
  - name: trading_alerts
    rules:
      - alert: HighDrawdown
        expr: portfolio_drawdown_pct > 8
        for: 5m
        annotations:
          summary: "Portfolio drawdown exceeds 8%"

Troubleshooting

Alert Not Received

Check logs: Verify alert was generated (grep "alert" logs/backend.log)
Check notification config: Ensure notifications.enabled: true in settings.yaml
Check webhook URL: Test Slack/Telegram webhook manually
Check environment variables: Verify SA_SLACK_WEBHOOK_URL or SA_TELEGRAM_BOT_TOKEN set

Dashboard Not Updating

Check WebSocket connection: Browser console should show WebSocket connected
Check Redis: Ensure Redis is running (redis-cli ping should return PONG)
Check backend logs: Look for WebSocket errors in logs/backend.log
Refresh page: Sometimes WebSocket disconnects, refresh to reconnect

Grafana Not Showing Data

Check Prometheus: Visit http://localhost:9090/targets, all should be “UP”
Check data source: Grafana → Configuration → Data Sources → Prometheus should be connected
Check metrics: Run query in Prometheus UI to verify metrics exist

Risk Management — Circuit breakers and alert triggers
Kill Switch — Emergency halt procedure
Autonomy Modes — Monitoring requirements for each mode
Application Settings — Configure health check interval and channels