Monitoring and Observability¶
This guide covers monitoring and observability practices for zae-limiter deployments, including structured logging, CloudWatch metrics, alerts, and dashboard templates.
Overview¶
Effective monitoring of a rate limiter is critical for:
- Availability - Detecting service degradation before users are impacted
- Latency - Ensuring rate limit checks don't become a bottleneck
- Throughput - Understanding capacity and scaling needs
- Errors - Identifying and resolving issues quickly
zae-limiter provides built-in observability through:
| Component | Purpose |
|---|---|
| CloudWatch Alarms | Proactive alerting on anomalies |
| Structured Logs | JSON-formatted logs for analysis |
| Dead Letter Queue | Capturing failed events for investigation |
| Usage Snapshots | Aggregated consumption metrics |
| Audit Logging | Security and compliance tracking |
Compliance Requirements
For tracking who changed what and when, see the Audit Logging Guide.
Structured Logging¶
The Lambda aggregator uses structured JSON logging compatible with CloudWatch Logs Insights.
Log Format¶
All log entries follow this JSON structure:
{
"timestamp": "2024-01-15T10:30:00.000000+00:00",
"level": "INFO",
"logger": "zae_limiter.aggregator.handler",
"message": "Lambda invocation completed",
"request_id": "abc123-def456",
"processed": 50,
"snapshots_updated": 100,
"processing_time_ms": 45.23
}
Log Fields Reference¶
| Field | Type | Description |
|---|---|---|
timestamp |
string | ISO 8601 timestamp (UTC) |
level |
string | Log level: DEBUG, INFO, WARNING, ERROR |
logger |
string | Logger name (module path) |
message |
string | Human-readable message |
request_id |
string | Lambda request ID for correlation |
function_name |
string | Lambda function name |
record_count |
int | DynamoDB stream records in batch |
processed |
int | Records successfully processed |
deltas_extracted |
int | Consumption deltas found |
snapshots_updated |
int | Usage snapshots updated |
error_count |
int | Processing errors |
processing_time_ms |
float | Total execution time (ms) |
Log Levels¶
| Level | When Used |
|---|---|
| DEBUG | Detailed processing info (snapshot updates) |
| INFO | Invocation start/end, batch processing summary |
| WARNING | Recoverable errors (single record failures) |
| ERROR | Unrecoverable errors (batch failures) |
Example Log Entries¶
Invocation Start:
{
"timestamp": "2024-01-15T10:30:00.000000+00:00",
"level": "INFO",
"logger": "zae_limiter.aggregator.handler",
"message": "Lambda invocation started",
"request_id": "abc123-def456",
"function_name": "ZAEL-limiter-aggregator",
"record_count": 50,
"table_name": "ZAEL-limiter",
"snapshot_windows": ["hourly", "daily"]
}
Batch Complete:
{
"timestamp": "2024-01-15T10:30:00.500000+00:00",
"level": "INFO",
"logger": "zae_limiter.aggregator.processor",
"message": "Batch processing completed",
"processed_count": 50,
"deltas_extracted": 45,
"snapshots_updated": 90,
"error_count": 0,
"processing_time_ms": 423.15
}
Error with Exception:
{
"timestamp": "2024-01-15T10:30:01.000000+00:00",
"level": "ERROR",
"logger": "zae_limiter.aggregator.processor",
"message": "Error processing record",
"record_index": 12,
"exception": "Traceback (most recent call last):\n..."
}
CloudWatch Metrics¶
Lambda Metrics¶
Monitor the aggregator Lambda function:
| Metric | Namespace | Description | Recommended Threshold |
|---|---|---|---|
Invocations |
AWS/Lambda | Total executions | Baseline + 50% |
Errors |
AWS/Lambda | Failed executions | > 1 per 5 min |
Duration |
AWS/Lambda | Execution time (ms) | > 80% of timeout |
Throttles |
AWS/Lambda | Throttled invocations | > 0 |
IteratorAge |
AWS/Lambda | Stream processing lag (ms) | > 30,000 ms |
ConcurrentExecutions |
AWS/Lambda | Parallel executions | Account limit |
DynamoDB Metrics¶
Monitor table performance:
| Metric | Namespace | Description | Recommended Threshold |
|---|---|---|---|
ConsumedReadCapacityUnits |
AWS/DynamoDB | RCU usage | Provisioned capacity |
ConsumedWriteCapacityUnits |
AWS/DynamoDB | WCU usage | Provisioned capacity |
ReadThrottleEvents |
AWS/DynamoDB | Read throttles | > 0 |
WriteThrottleEvents |
AWS/DynamoDB | Write throttles | > 0 |
SystemErrors |
AWS/DynamoDB | Service errors | > 0 |
SuccessfulRequestLatency |
AWS/DynamoDB | Request latency (ms) | p99 > 100ms |
SQS Metrics (Dead Letter Queue)¶
Monitor failed event processing:
| Metric | Namespace | Description | Recommended Threshold |
|---|---|---|---|
ApproximateNumberOfMessagesVisible |
AWS/SQS | Messages in DLQ | > 0 |
ApproximateAgeOfOldestMessage |
AWS/SQS | Oldest message age (s) | > 3600 |
CloudWatch Logs Insights Queries¶
Batch Processing Performance¶
Analyze processing latency over time:
fields @timestamp, @message
| filter @message like /Batch processing completed/
| parse @message /processing_time_ms":(?<duration>[\d.]+)/
| stats avg(duration) as avg_ms,
pct(duration, 50) as p50_ms,
pct(duration, 95) as p95_ms,
pct(duration, 99) as p99_ms
by bin(1h)
| sort @timestamp desc
Error Analysis¶
Find recent errors and warnings:
fields @timestamp, @message, @logStream
| filter level = "ERROR" or level = "WARNING"
| parse @message /message":"(?<error_message>[^"]+)/
| sort @timestamp desc
| limit 100
Invocation Summary¶
Aggregate processing metrics:
fields @timestamp, @message
| filter @message like /Lambda invocation completed/
| parse @message /processed":(?<processed>\d+).*snapshots_updated":(?<snapshots>\d+)/
| stats sum(processed) as total_processed,
sum(snapshots) as total_snapshots,
count() as invocations
by bin(1h)
| sort @timestamp desc
Entity Usage Analysis¶
Find highest-usage entities:
fields @timestamp, @message
| filter @message like /Snapshot updated/
| parse @message /entity_id":"(?<entity>[^"]+)".*resource":"(?<resource>[^"]+)/
| stats count() as updates by entity, resource
| sort updates desc
| limit 50
Cold Start Detection¶
Identify Lambda cold starts:
fields @timestamp, @message, @duration
| filter @type = "REPORT"
| filter @message like /Init Duration/
| parse @message /Init Duration: (?<init_duration>[\d.]+) ms/
| stats count() as cold_starts,
avg(init_duration) as avg_init_ms
by bin(1h)
Error Rate Calculation¶
Calculate error rate percentage:
fields @timestamp
| filter @message like /Lambda invocation/
| parse @message /error_count":(?<errors>\d+)/
| stats sum(errors) as total_errors, count() as total_invocations
| display total_errors, total_invocations,
(total_errors * 100.0 / total_invocations) as error_rate_pct
X-Ray Tracing¶
Future Enhancement
X-Ray tracing integration is planned for a future release. Track progress in Issue #107.
Planned capabilities include:
- Lambda Active Tracing - End-to-end request visibility
- DynamoDB SDK Instrumentation - Database call traces
- Custom Subsegments - Business logic timing (acquire/release operations)
- Trace Header Propagation - Cross-service correlation
Dashboard Templates¶
Operations Dashboard¶
Create a CloudWatch dashboard for day-to-day operations:
{
"widgets": [
{
"type": "metric",
"properties": {
"title": "Lambda Invocations & Errors",
"region": "${AWS::Region}",
"metrics": [
["AWS/Lambda", "Invocations", "FunctionName", "${TableName}-aggregator", {"stat": "Sum"}],
[".", "Errors", ".", ".", {"stat": "Sum", "color": "#d62728"}]
],
"period": 300,
"view": "timeSeries"
}
},
{
"type": "metric",
"properties": {
"title": "Lambda Duration (p50/p95/p99)",
"region": "${AWS::Region}",
"metrics": [
["AWS/Lambda", "Duration", "FunctionName", "${TableName}-aggregator", {"stat": "p50"}],
["...", {"stat": "p95"}],
["...", {"stat": "p99"}]
],
"period": 300,
"view": "timeSeries"
}
},
{
"type": "metric",
"properties": {
"title": "Stream Iterator Age",
"region": "${AWS::Region}",
"metrics": [
["AWS/Lambda", "IteratorAge", "FunctionName", "${TableName}-aggregator", {"stat": "Maximum"}]
],
"period": 60,
"view": "timeSeries",
"annotations": {
"horizontal": [{"value": 30000, "label": "Threshold (30s)"}]
}
}
},
{
"type": "metric",
"properties": {
"title": "DynamoDB Capacity",
"region": "${AWS::Region}",
"metrics": [
["AWS/DynamoDB", "ConsumedReadCapacityUnits", "TableName", "${TableName}", {"stat": "Sum"}],
[".", "ConsumedWriteCapacityUnits", ".", ".", {"stat": "Sum"}]
],
"period": 300,
"view": "timeSeries"
}
},
{
"type": "metric",
"properties": {
"title": "DynamoDB Throttles",
"region": "${AWS::Region}",
"metrics": [
["AWS/DynamoDB", "ReadThrottleEvents", "TableName", "${TableName}", {"stat": "Sum"}],
[".", "WriteThrottleEvents", ".", ".", {"stat": "Sum"}]
],
"period": 300,
"view": "timeSeries"
}
},
{
"type": "metric",
"properties": {
"title": "Dead Letter Queue",
"region": "${AWS::Region}",
"metrics": [
["AWS/SQS", "ApproximateNumberOfMessagesVisible", "QueueName", "${TableName}-aggregator-dlq"]
],
"period": 60,
"view": "singleValue"
}
}
]
}
Capacity Planning Dashboard¶
Create a dashboard for capacity analysis:
{
"widgets": [
{
"type": "metric",
"properties": {
"title": "RCU/WCU Consumption Trend",
"region": "${AWS::Region}",
"metrics": [
["AWS/DynamoDB", "ConsumedReadCapacityUnits", "TableName", "${TableName}", {"stat": "Sum", "period": 3600}],
[".", "ConsumedWriteCapacityUnits", ".", ".", {"stat": "Sum", "period": 3600}]
],
"view": "timeSeries",
"stacked": false
}
},
{
"type": "metric",
"properties": {
"title": "Request Latency Distribution",
"region": "${AWS::Region}",
"metrics": [
["AWS/DynamoDB", "SuccessfulRequestLatency", "TableName", "${TableName}", "Operation", "GetItem", {"stat": "p50"}],
["...", {"stat": "p95"}],
["...", {"stat": "p99"}]
],
"period": 300,
"view": "timeSeries"
}
},
{
"type": "metric",
"properties": {
"title": "Lambda Concurrent Executions",
"region": "${AWS::Region}",
"metrics": [
["AWS/Lambda", "ConcurrentExecutions", "FunctionName", "${TableName}-aggregator", {"stat": "Maximum"}]
],
"period": 60,
"view": "timeSeries"
}
},
{
"type": "metric",
"properties": {
"title": "Throttle Events (7 Day)",
"region": "${AWS::Region}",
"metrics": [
["AWS/DynamoDB", "ReadThrottleEvents", "TableName", "${TableName}", {"stat": "Sum", "period": 86400}],
[".", "WriteThrottleEvents", ".", ".", {"stat": "Sum", "period": 86400}]
],
"view": "bar"
}
}
]
}
Dashboard Deployment
Replace ${TableName} with your actual table name (e.g., ZAEL-limiter) and ${AWS::Region} with your region before deploying.
Alert Configuration¶
Default Alarms¶
The stack deploys these alarms when --enable-alarms is set:
| Alarm | Metric | Threshold | Period | Evaluation |
|---|---|---|---|---|
{name}-aggregator-error-rate |
Lambda Errors | > 1 | 5 min | 2 periods |
{name}-aggregator-duration |
Lambda Duration | > 80% timeout | 5 min | 2 periods |
{name}-stream-iterator-age |
IteratorAge | > 30,000 ms | 5 min | 2 periods |
{name}-aggregator-dlq-alarm |
SQS Messages | >= 1 | 5 min | 1 period |
{name}-read-throttle |
ReadThrottleEvents | > 1 | 5 min | 2 periods |
{name}-write-throttle |
WriteThrottleEvents | > 1 | 5 min | 2 periods |
Deploying with Alarms¶
# Deploy with alarms enabled (default)
zae-limiter deploy --name limiter --region us-east-1
# Deploy with SNS notifications
zae-limiter deploy --name limiter --region us-east-1 \
--alarm-sns-topic arn:aws:sns:us-east-1:123456789012:alerts
# Customize duration threshold (70% of timeout)
zae-limiter deploy --name limiter --region us-east-1 \
--lambda-duration-threshold-pct 70
# Disable alarms (not recommended for production)
zae-limiter deploy --name limiter --region us-east-1 --no-alarms
Threshold Tuning Guide¶
| Alarm | Default | When to Increase | When to Decrease |
|---|---|---|---|
| Error Rate | >1/5min | High-volume systems with rare transient errors | Critical systems requiring immediate response |
| Duration | 80% timeout | Batch workloads with variable processing time | Latency-sensitive applications |
| Iterator Age | 30 seconds | Batch-tolerant analytics workloads | Real-time processing requirements |
| DLQ Messages | >=1 | Never (always investigate DLQ messages) | N/A |
| Throttles | >1/5min | During planned traffic spikes | Before hitting capacity limits |
Programmatic Configuration¶
from zae_limiter import RateLimiter, StackOptions
limiter = RateLimiter(
name="limiter",
region="us-east-1",
stack_options=StackOptions(
enable_alarms=True,
alarm_sns_topic="arn:aws:sns:us-east-1:123456789012:alerts",
lambda_duration_threshold_pct=75, # Alert at 75% of timeout
log_retention_days=90,
),
)
Next Steps¶
- Operations Guide - Troubleshooting and operational procedures
- Audit Logging - Security and compliance tracking
- Performance Tuning - Capacity planning and optimization
- Deployment Guide - Infrastructure setup
- CloudFormation Reference - Template customization