Monitoring¶
This guide covers the key metrics, limits, and alerting recommendations for operating mlflow-dynamodbstore in production.
CloudWatch Metrics¶
DynamoDB publishes metrics to CloudWatch automatically. These are the most important ones to monitor:
Capacity Metrics¶
| Metric | Dimension | What to Watch |
|---|---|---|
ConsumedReadCapacityUnits |
Table, GSI | Sustained spikes indicate hot partitions |
ConsumedWriteCapacityUnits |
Table, GSI | Track write patterns during training runs |
ThrottledRequests |
Table, GSI | Must be zero in steady state |
ReadThrottleEvents |
Table, GSI | Read-side throttling |
WriteThrottleEvents |
Table, GSI | Write-side throttling |
Throttling
Any non-zero ThrottledRequests means your application is hitting capacity
limits. For on-demand tables, this indicates a partition-level throughput
limit (3,000 RCU / 1,000 WCU per partition). For provisioned tables,
consider increasing capacity or switching to on-demand.
Latency Metrics¶
| Metric | What to Watch |
|---|---|
SuccessfulRequestLatency |
p50 and p99 latency; spikes indicate large scans |
SystemErrors |
DynamoDB internal errors (should be near zero) |
UserErrors |
Client-side errors (e.g., validation, conditions) |
Example CloudWatch Dashboard¶
{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["AWS/DynamoDB", "ConsumedReadCapacityUnits", "TableName", "mlflow"],
["AWS/DynamoDB", "ConsumedWriteCapacityUnits", "TableName", "mlflow"]
],
"period": 60,
"stat": "Sum",
"title": "Table Capacity"
}
},
{
"type": "metric",
"properties": {
"metrics": [
["AWS/DynamoDB", "ThrottledRequests", "TableName", "mlflow"]
],
"period": 60,
"stat": "Sum",
"title": "Throttled Requests"
}
}
]
}
Partition Size Monitoring¶
10 GB LSI Partition Limit¶
DynamoDB enforces a 10 GB limit per partition key when Local Secondary Indexes (LSIs) are present. mlflow-dynamodbstore uses LSIs, so this limit applies.
Each experiment's data lives under a single partition key (EXP#<id>). If an
experiment accumulates more than 10 GB of items (runs, metrics, tags, params,
traces), writes to that partition will fail with an
ItemCollectionSizeLimitExceededException.
Hard Limit
The 10 GB LSI partition limit cannot be increased. It is a fundamental DynamoDB constraint. Plan your data layout accordingly.
What Consumes Space¶
| Item Type | Typical Size | Volume Driver |
|---|---|---|
| Run META | 1-2 KB | Number of runs |
| Tags | 100-500 B | Tags per run |
| Params | 100-500 B | Params per run |
| Latest Metrics | 200-500 B | Metric keys per run |
| Metric History | 100-200 B | Steps per metric key |
| Trace META | 1-2 KB | Number of traces |
| Cached Spans | 5-50 KB | Span count per trace |
Estimating Partition Size¶
Rough formula per experiment:
partition_size ≈ runs × (2KB + tags × 300B + params × 300B + metrics × 300B)
+ runs × metrics × steps × 150B (metric history)
+ traces × (2KB + cached_spans_size)
Example: 1,000 runs, 10 tags, 20 params, 15 metrics, 100 steps each, no traces:
This experiment is well within the 10 GB limit. The danger zone is experiments with many runs, many metric keys, and long step histories.
Mitigation Strategies¶
- Enable metric history TTL -- Set
metric_history_retention_daysto prune old step data automatically. - Limit steps per metric -- Log metrics at intervals rather than every step.
- Split large experiments -- Create new experiments periodically instead of logging thousands of runs to one experiment.
- Monitor item counts -- Track item counts per experiment with the
CloudWatch
ItemCountmetric or periodic scans.
Item Count Monitoring¶
Track the total item count in your table to understand growth trends:
Note
ItemCount is updated approximately every 6 hours. For real-time counts,
use a Scan with Select=COUNT, but be aware this consumes read capacity.
X-Ray Trace Monitoring¶
If X-Ray integration is enabled, monitor these additional dimensions:
| Metric / Signal | Where to Check | Action |
|---|---|---|
X-Ray TracesProcessed |
CloudWatch X-Ray | Track trace ingestion rate |
| Cached span item count | DynamoDB table scan | Verify cache-spans runs |
| X-Ray 30-day retention | Calendar / alarm | Run cache-spans before expiry |
Alerting Recommendations¶
Critical Alerts¶
| Alert | Threshold | Action |
|---|---|---|
ThrottledRequests > 0 for 5 minutes |
Any throttling | Increase capacity or investigate hot keys |
SystemErrors > 0 for 5 minutes |
Any system error | Check AWS Health Dashboard |
ItemCollectionSizeLimitExceededException |
Any occurrence | Split experiment or enable metric TTL |
Warning Alerts¶
| Alert | Threshold | Action |
|---|---|---|
ConsumedReadCapacityUnits > 80% of provisioned |
Sustained over 15 minutes | Consider scaling up or auto-scaling |
SuccessfulRequestLatency p99 > 100ms |
Sustained over 10 minutes | Check for large scans |
| Item count growth > 20% week-over-week | Weekly check | Review TTL settings and cleanup |
Example CloudWatch Alarm (Terraform)¶
resource "aws_cloudwatch_metric_alarm" "throttled_requests" {
alarm_name = "mlflow-dynamodb-throttled-requests"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "ThrottledRequests"
namespace = "AWS/DynamoDB"
period = 300
statistic = "Sum"
threshold = 0
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
TableName = "mlflow"
}
}