Monitoring & Alerts
The system provides multiple layers of monitoring to keep you informed about trading activity, system health, and risk events. Even in BOUNDED_AUTONOMOUS or FULL_AUTONOMOUS mode, you should monitor the system regularly.
Monitoring Dashboards
1. AlphaOracle Dashboard (Port 3000)
The main React dashboard at http://localhost:3000.
Pages:
- Portfolio: Real-time positions, P&L, sector exposure, equity curve
- Strategies: Strategy rankings, backtest results, signal history
- Risk: Risk limits, PDT counter, circuit breakers, kill switch
- Trades: Order history, pending approvals, execution quality metrics
- Model Health (ML): Feature importance, accuracy, drift detection, model versions
Update frequency: Real-time via WebSocket (sub-second updates for prices and orders)
Best for: Day-to-day trading monitoring, portfolio tracking, order review
2. Grafana Dashboards (Port 3001)
Advanced monitoring dashboards at http://localhost:3001.
Default dashboards:
- System Health: CPU, memory, API latency, database connections, Redis hits/misses
- Trading Performance: Total return, Sharpe ratio, win rate, profit factor, drawdown
- Risk Metrics: Position counts, sector exposure, PDT usage, circuit breaker events
- Order Metrics: Orders submitted, filled, cancelled, rejected; fill rates; slippage
Update frequency: 1-5 minute intervals (configurable)
Best for: Historical analysis, system performance tuning, debugging infrastructure issues
Default credentials:
- Username:
admin - Password:
admin(change on first login)
3. Prometheus Metrics (Port 9090)
Raw metrics endpoint at http://localhost:9090.
Available metrics:
portfolio_total_equity: Portfolio value in dollarsportfolio_positions_count: Number of open positionsorders_submitted_total: Counter of submitted ordersorders_filled_total: Counter of filled ordersapi_request_duration_seconds: API endpoint latenciespdt_trades_used: Current PDT trade count
Best for: Custom alerting, integration with external monitoring tools, debugging
Alert Channels
The system supports three alert channels:
1. Logs (Always On)
All events are logged to logs/backend.log with structured logging.
Log levels:
INFO: Normal operations (orders submitted, positions updated)WARNING: Non-critical issues (PDT approaching limit, stale data detected)ERROR: Errors (API failures, database timeouts)CRITICAL: Severe issues (kill switch activated, max drawdown exceeded)
Viewing logs:
tail -f logs/backend.log # Follow live
grep "CRITICAL" logs/backend.log # Filter by level
grep "pdt" logs/backend.log # Filter by keyword
2. Slack (Optional)
Send alerts to a Slack channel.
Setup:
- Create a Slack webhook URL (https://api.slack.com/messaging/webhooks)
- Configure webhook in environment:
export SA_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL - Enable Slack in
config/settings.yaml:notifications: enabled: true channels: [slack] - Restart backend:
./scripts/restart-backend.sh
Alert format:
🚨 [CRITICAL] Kill Switch Activated
Market crash detected, VIX spike to 47
Timestamp: 2026-03-12T15:34:22Z
3. Telegram (Optional)
Send alerts to a Telegram bot.
Setup:
- Create a Telegram bot via @BotFather
- Get your chat ID by messaging @userinfobot
- Configure bot in environment:
export SA_TELEGRAM_BOT_TOKEN=your_bot_token export SA_TELEGRAM_CHAT_ID=your_chat_id - Enable Telegram in
config/settings.yaml:notifications: enabled: true channels: [telegram] - Restart backend:
./scripts/restart-backend.sh
Alert Conditions
The system sends alerts for these events:
Critical Alerts (Immediate Action Required)
- Kill switch activated: Trading halted manually
- Max drawdown exceeded: Portfolio dropped 10% from peak (trading halted)
- Max daily loss exceeded: Lost 3% in one day (trading halted)
- PDT limit reached: Used all 3 day trades (can’t day trade until window rolls)
- Position reconciliation failed: System’s positions don’t match broker’s (data integrity issue)
Warning Alerts (Monitor Closely)
- VIX spike: VIX > 35 (extreme volatility, circuit breaker active)
- Stale data detected: Price data > 5 minutes old (feed interruption)
- PDT approaching limit: 2/3 day trades used (1 remaining)
- High drawdown: Portfolio down 7-9% from peak (approaching 10% limit)
- Large position drift: Position size differs from broker by >1%
- Model staleness: ML model hasn’t retrained in 14+ days
Info Alerts (FYI)
- Circuit breaker cleared: VIX dropped below 35, trading resumed
- Data feed reconnected: Market data feed recovered
- Model retrained: New ML model deployed
- Dead man switch check: Operator heartbeat required within 48 hours
Alert Configuration
Alerts are controlled by circuit breaker settings in config/risk_limits.yaml:
circuit_breakers:
vix_threshold: 35.0
stale_data_seconds: 300
reconciliation_interval_seconds: 300
max_reconciliation_drift_pct: 1.0
dead_man_switch_hours: 48
See Risk Limits Reference for details.
Monitoring Best Practices
Daily Checks (5 minutes)
- Portfolio page: Check total equity, daily P&L, open positions
- Risk page: Verify no circuit breakers active, PDT counter normal
- Trades page: Review today’s executed orders, check for rejections
- Logs: Scan for WARNING or CRITICAL entries
Weekly Reviews (30 minutes)
- Grafana: Review weekly performance metrics (return, drawdown, Sharpe)
- Strategy page: Compare strategy performance, consider enabling/disabling strategies
- Model Health: Check ML model accuracy, feature drift
- Risk limits: Adjust limits based on observed behavior
Monthly Audits (2 hours)
- Performance analysis: Compare to benchmarks (S&P 500), calculate returns
- Risk analysis: Review max drawdown, worst days, correlations
- Strategy tuning: Backtest on recent data, adjust parameters
- System health: Review Grafana system metrics, database growth, API latencies
Autonomous Mode Monitoring
Even in BOUNDED_AUTONOMOUS mode:
- Check dashboard at least once per day
- Review weekly performance reports
- Monitor Slack/Telegram alerts in real-time
- Investigate any WARNING or CRITICAL alerts immediately
Autonomous doesn’t mean unattended. You remain responsible for system oversight.
Health Check Interval
The system runs health checks every 60 seconds (configurable in config/settings.yaml):
monitoring:
health_check_interval_seconds: 60
Health checks verify:
- Database connectivity
- Redis connectivity
- IBKR connection status (if provider is
ibkr) - Market data feed status
- Position reconciliation
Failed health checks trigger alerts.
Integration with External Tools
Export Metrics to External Monitoring
Prometheus metrics can be scraped by:
- Datadog: Use Prometheus integration
- Grafana Cloud: Add Prometheus data source
- CloudWatch: Use Prometheus exporter for AWS
Custom Alerting
Create custom Prometheus alert rules in config/prometheus/alerts.yml:
groups:
- name: trading_alerts
rules:
- alert: HighDrawdown
expr: portfolio_drawdown_pct > 8
for: 5m
annotations:
summary: "Portfolio drawdown exceeds 8%"
Troubleshooting
Alert Not Received
- Check logs: Verify alert was generated (
grep "alert" logs/backend.log) - Check notification config: Ensure
notifications.enabled: trueinsettings.yaml - Check webhook URL: Test Slack/Telegram webhook manually
- Check environment variables: Verify
SA_SLACK_WEBHOOK_URLorSA_TELEGRAM_BOT_TOKENset
Dashboard Not Updating
- Check WebSocket connection: Browser console should show WebSocket connected
- Check Redis: Ensure Redis is running (
redis-cli pingshould return PONG) - Check backend logs: Look for WebSocket errors in
logs/backend.log - Refresh page: Sometimes WebSocket disconnects, refresh to reconnect
Grafana Not Showing Data
- Check Prometheus: Visit
http://localhost:9090/targets, all should be “UP” - Check data source: Grafana → Configuration → Data Sources → Prometheus should be connected
- Check metrics: Run query in Prometheus UI to verify metrics exist
Related Topics
- Risk Management — Circuit breakers and alert triggers
- Kill Switch — Emergency halt procedure
- Autonomy Modes — Monitoring requirements for each mode
- Application Settings — Configure health check interval and channels