Monitoring Module
The src/monitoring/ module implements observability for the alpha-oracle system via Prometheus metrics (counters, gauges, histograms), AlertManager for Slack/Telegram notifications, and Grafana dashboards. All components emit structured logs via structlog for centralized analysis.
Purpose
Monitoring provides:
- Prometheus metrics on port 8001 for trading activity, portfolio state, risk events, and system health
- AlertManager with Slack and Telegram channels for critical alerts
- Grafana dashboards pre-configured via provisioning
- Structured logging via structlog with JSON output for parsing
- Health check metrics for broker, feed, database, and Redis connectivity
Key Components
Prometheus Metrics
TradingMetrics (src/monitoring/metrics.py)
Centralized Prometheus metrics registry.
Metric Types:
| Type | Purpose | Example |
|---|---|---|
| Counter | Monotonically increasing count | trading_orders_total |
| Gauge | Current value (can go up/down) | trading_portfolio_equity_dollars |
| Histogram | Distribution of values | trading_trade_pnl_dollars |
| Info | Static metadata | trading_system_info (version, environment) |
System Metrics:
system_info = Info("trading_system", "Trading system information")
system_info.info({
"version": "1.0.0",
"environment": "production",
"broker": "ibkr"
})
Trading Metrics:
# Order counters (labels: side, order_type, strategy, status)
orders_total = Counter("trading_orders_total", "Total orders submitted", ["side", "order_type", "strategy", "status"])
orders_total.labels(side="BUY", order_type="LIMIT", strategy="SwingMomentum", status="FILLED").inc()
# Trade counters (labels: side, strategy)
trades_total = Counter("trading_trades_total", "Total trades executed", ["side", "strategy"])
trades_total.labels(side="BUY", strategy="SwingMomentum").inc()
# Trade P&L histogram (labels: strategy)
trade_pnl = Histogram("trading_trade_pnl_dollars", "Trade P&L in dollars", ["strategy"], buckets=[-1000, -500, -200, -100, -50, -20, 0, 20, 50, 100, 200, 500, 1000])
trade_pnl.labels(strategy="SwingMomentum").observe(150.50)
Portfolio Metrics:
# Gauges (values updated on each portfolio snapshot)
portfolio_equity = Gauge("trading_portfolio_equity_dollars", "Total portfolio equity")
portfolio_equity.set(105432.50)
portfolio_cash = Gauge("trading_portfolio_cash_dollars", "Available cash")
portfolio_cash.set(52716.25)
portfolio_positions_count = Gauge("trading_portfolio_positions_count", "Number of open positions")
portfolio_positions_count.set(8)
portfolio_daily_pnl = Gauge("trading_portfolio_daily_pnl_dollars", "Daily P&L in dollars")
portfolio_daily_pnl.set(543.20)
portfolio_daily_pnl_pct = Gauge("trading_portfolio_daily_pnl_pct", "Daily P&L percentage")
portfolio_daily_pnl_pct.set(0.52)
portfolio_drawdown_pct = Gauge("trading_portfolio_drawdown_pct", "Current drawdown percentage")
portfolio_drawdown_pct.set(3.2)
Risk Metrics:
# Risk check counters (labels: result = approve/reject/require_approval/reduce_size)
risk_checks_total = Counter("trading_risk_checks_total", "Total risk checks performed", ["result"])
risk_checks_total.labels(result="approve").inc()
# PDT guard gauge
pdt_trades_used = Gauge("trading_pdt_trades_used", "Day trades used in rolling 5-day window (max 3)")
pdt_trades_used.set(2)
# Circuit breakers gauge
circuit_breakers_tripped = Gauge("trading_circuit_breakers_tripped", "Number of circuit breakers currently tripped")
circuit_breakers_tripped.set(0)
# Kill switch gauge (1=active, 0=inactive)
kill_switch_active = Gauge("trading_kill_switch_active", "Kill switch status")
kill_switch_active.set(0)
Execution Metrics:
# Fill latency histogram
fill_latency_ms = Histogram("trading_fill_latency_ms", "Time from signal to fill in milliseconds", buckets=[100, 500, 1000, 2000, 5000, 10000])
fill_latency_ms.observe(1250)
# Slippage histogram
slippage_bps = Histogram("trading_slippage_bps", "Execution slippage in basis points", buckets=[0, 1, 2, 5, 10, 20, 50])
slippage_bps.observe(3.3)
ML Metrics:
# Model accuracy gauge
ml_model_accuracy = Gauge("trading_ml_model_accuracy", "Current ML model accuracy")
ml_model_accuracy.set(0.58)
# PSI drift gauge
ml_psi_drift = Gauge("trading_ml_psi_drift", "Population Stability Index (model drift)")
ml_psi_drift.set(0.12)
Metrics Endpoint:
Prometheus scrapes metrics from http://localhost:8001/metrics.
Configuration (config/settings.yaml):
monitoring:
prometheus_port: 8001
health_check_interval_seconds: 60
Start Metrics Server:
from prometheus_client import start_http_server
start_http_server(8001) # Called on API startup
AlertManager
AlertManager (src/monitoring/alerts.py)
Routes alerts to multiple channels: log, Slack, Telegram.
Severity Levels:
CRITICAL: System failures, kill switch activation, account violationsWARNING: Risk thresholds breached (drawdown > 5%, PDT approaching limit)INFO: Routine events (job completions, model retraining)
Channels:
log: Always enabled (structlog)slack: Webhook URL requiredtelegram: Bot token + chat ID required
Configuration:
from src.monitoring.alerts import AlertManager, AlertSeverity
alert_manager = AlertManager(channels=["log", "slack"])
alert_manager.configure_slack(webhook_url="https://hooks.slack.com/services/...")
await alert_manager.send_alert(
severity=AlertSeverity.CRITICAL,
title="Kill Switch Activated",
message="Trading halted due to circuit breaker: VIX > 35",
metadata={"vix": 38.5, "timestamp": datetime.utcnow().isoformat()}
)
Slack Webhook Payload:
{
"text": "⚠️ CRITICAL: Kill Switch Activated",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Kill Switch Activated*\n\nTrading halted due to circuit breaker: VIX > 35\n\n*Metadata:*\n• VIX: 38.5\n• Timestamp: 2024-03-15T14:32:00Z"
}
}
]
}
Telegram Message:
🔴 CRITICAL: Kill Switch Activated
Trading halted due to circuit breaker: VIX > 35
Metadata:
• VIX: 38.5
• Timestamp: 2024-03-15T14:32:00Z
Alert Triggers:
- PDT Warning: 2 day trades used (1 remaining).
- PDT Critical: 3 day trades used (limit reached).
- Drawdown Warning: Drawdown > 5%.
- Drawdown Critical: Drawdown > 8% (circuit breaker trips at 10%).
- Daily Loss Warning: Daily P&L < -2%.
- Daily Loss Critical: Daily P&L < -2.5% (circuit breaker trips at -3%).
- Model Drift Warning: PSI > 0.1.
- Model Drift Critical: PSI > 0.2 (retraining required).
- Kill Switch Activation: Any kill switch trigger.
- Broker Disconnect: IBKR connection lost.
- Feed Disconnect: Market feed connection lost.
Grafana Dashboards
Grafana (port 3001) pre-configured with dashboards via provisioning.
Dashboards:
- Trading Overview:
- Portfolio equity over time
- Daily P&L (bar chart)
- Open positions count
- Order success rate
- Risk Metrics:
- PDT trades used (gauge)
- Circuit breaker states (status panel)
- Kill switch status (indicator)
- Drawdown chart
- Execution Quality:
- Slippage distribution (histogram)
- Fill latency distribution (histogram)
- Order type breakdown (pie chart)
- ML Model Performance:
- Model accuracy over time
- PSI drift over time
- Feature importance (bar chart)
- Prediction confidence distribution
Provisioning (config/grafana/provisioning/dashboards/):
# dashboard.yml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /etc/grafana/provisioning/dashboards
Data Source (config/grafana/provisioning/datasources/prometheus.yml):
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
Structured Logging
structlog Configuration
All modules use structlog for structured JSON logging.
Configuration (src/api/main.py):
import structlog
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.stdlib.add_log_level,
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.JSONRenderer()
],
wrapper_class=structlog.stdlib.BoundLogger,
logger_factory=structlog.stdlib.LoggerFactory(),
cache_logger_on_first_use=True,
)
Usage:
import structlog
logger = structlog.get_logger(__name__)
logger.info("order_submitted", symbol="AAPL", order_id="abc123", quantity=100, price=150.00)
logger.warning("pdt_day_trade_recorded", symbol="AAPL", day_trades_used=3)
logger.error("order_rejected", symbol="AAPL", reason="max_positions_exceeded", max_positions=20)
Output (JSON):
{"event": "order_submitted", "symbol": "AAPL", "order_id": "abc123", "quantity": 100, "price": 150.0, "timestamp": "2024-03-15T14:32:15.123456Z", "level": "info"}
{"event": "pdt_day_trade_recorded", "symbol": "AAPL", "day_trades_used": 3, "timestamp": "2024-03-15T14:35:20.987654Z", "level": "warning"}
{"event": "order_rejected", "symbol": "AAPL", "reason": "max_positions_exceeded", "max_positions": 20, "timestamp": "2024-03-15T14:40:10.456789Z", "level": "error"}
Log Files:
- Backend:
logs/backend.log - Frontend:
logs/frontend.log
Log Rotation:
- Size-based: 100 MB per file
- Retention: 10 files
- Compression: gzip
Prometheus Configuration
config/prometheus.yml
Prometheus scrape configuration.
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'alpha-oracle-api'
static_configs:
- targets: ['host.docker.internal:8000'] # FastAPI main app
labels:
service: 'api'
- job_name: 'alpha-oracle-metrics'
static_configs:
- targets: ['host.docker.internal:8001'] # Prometheus metrics endpoint
labels:
service: 'metrics'
- job_name: 'redis'
static_configs:
- targets: ['redis:6379']
labels:
service: 'redis'
- job_name: 'timescaledb'
static_configs:
- targets: ['timescaledb:5432']
labels:
service: 'database'
Targets:
- API:
localhost:8000(health check endpoint) - Metrics:
localhost:8001/metrics(Prometheus client) - Redis:
localhost:6379(Redis exporter) - TimescaleDB:
localhost:5432(Postgres exporter)
Configuration
Settings (config/settings.yaml):
monitoring:
prometheus_port: 8001
health_check_interval_seconds: 60
Environment Variable Overrides:
export SA_MONITORING__PROMETHEUS_PORT=8002
export SA_MONITORING__HEALTH_CHECK_INTERVAL_SECONDS=30
Slack Webhook:
export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/..."
Telegram Bot:
export TELEGRAM_BOT_TOKEN="123456789:ABCdefGHIjklMNOpqrsTUVwxyz"
export TELEGRAM_CHAT_ID="-1001234567890"
Integration with Other Modules
- Execution Engine (
src/execution/): Incrementsorders_total,trades_total, observesfill_latency_ms,slippage_bps. - Risk Management (
src/risk/): Updatespdt_trades_used,circuit_breakers_tripped,kill_switch_active. - Portfolio Monitor (
src/risk/portfolio_monitor.py): Updatesportfolio_equity,portfolio_daily_pnl,portfolio_drawdown_pct. - ML Pipeline (
src/signals/): Updatesml_model_accuracy,ml_psi_drift. - API (
src/api/): Exposes health check endpoint for Prometheus scraping.
Critical Patterns
- Metrics as code: All metrics defined centrally in
TradingMetricsclass. - Labels for grouping: Use labels (e.g.,
strategy,side) for filtering in Grafana. - Histogram buckets: Pre-defined buckets for latency, slippage, P&L distributions.
- Structured logs for audit: All trading decisions logged with context (symbol, order_id, timestamp).
- Alert severity escalation: INFO → WARNING → CRITICAL based on threshold breaches.
- Prometheus scrape interval: 15s (balance between timeliness and load).
Glossary Links
- Redis — In-memory data store
- PDT — Pattern Day Trader rule
- IBKR — Interactive Brokers
- Prometheus — Metrics and monitoring system
- Grafana — Visualization and alerting platform