Skip to content

Monitoring and Observability

This guide covers monitoring and observability practices for zae-limiter deployments, including structured logging, CloudWatch metrics, alerts, and dashboard templates.

Overview

Effective monitoring of a rate limiter is critical for:

  • Availability - Detecting service degradation before users are impacted
  • Latency - Ensuring rate limit checks don't become a bottleneck
  • Throughput - Understanding capacity and scaling needs
  • Errors - Identifying and resolving issues quickly

zae-limiter provides built-in observability through:

Component Purpose
CloudWatch Alarms Proactive alerting on anomalies
Structured Logs JSON-formatted logs for analysis
Dead Letter Queue Capturing failed events for investigation
Usage Snapshots Aggregated consumption metrics
Audit Logging Security and compliance tracking

Compliance Requirements

For tracking who changed what and when, see the Audit Logging Guide.

Structured Logging

The Lambda aggregator uses structured JSON logging compatible with CloudWatch Logs Insights.

Log Format

All log entries follow this JSON structure:

{
  "timestamp": "2024-01-15T10:30:00.000000+00:00",
  "level": "INFO",
  "logger": "zae_limiter.aggregator.handler",
  "message": "Lambda invocation completed",
  "request_id": "abc123-def456",
  "processed": 50,
  "snapshots_updated": 100,
  "processing_time_ms": 45.23
}

Log Fields Reference

Field Type Description
timestamp string ISO 8601 timestamp (UTC)
level string Log level: DEBUG, INFO, WARNING, ERROR
logger string Logger name (module path)
message string Human-readable message
request_id string Lambda request ID for correlation
function_name string Lambda function name
record_count int DynamoDB stream records in batch
processed int Records successfully processed
deltas_extracted int Consumption deltas found
snapshots_updated int Usage snapshots updated
error_count int Processing errors
processing_time_ms float Total execution time (ms)

Log Levels

Level When Used
DEBUG Detailed processing info (snapshot updates)
INFO Invocation start/end, batch processing summary
WARNING Recoverable errors (single record failures)
ERROR Unrecoverable errors (batch failures)

Example Log Entries

Invocation Start:

{
  "timestamp": "2024-01-15T10:30:00.000000+00:00",
  "level": "INFO",
  "logger": "zae_limiter.aggregator.handler",
  "message": "Lambda invocation started",
  "request_id": "abc123-def456",
  "function_name": "ZAEL-limiter-aggregator",
  "record_count": 50,
  "table_name": "ZAEL-limiter",
  "snapshot_windows": ["hourly", "daily"]
}

Batch Complete:

{
  "timestamp": "2024-01-15T10:30:00.500000+00:00",
  "level": "INFO",
  "logger": "zae_limiter.aggregator.processor",
  "message": "Batch processing completed",
  "processed_count": 50,
  "deltas_extracted": 45,
  "snapshots_updated": 90,
  "error_count": 0,
  "processing_time_ms": 423.15
}

Error with Exception:

{
  "timestamp": "2024-01-15T10:30:01.000000+00:00",
  "level": "ERROR",
  "logger": "zae_limiter.aggregator.processor",
  "message": "Error processing record",
  "record_index": 12,
  "exception": "Traceback (most recent call last):\n..."
}

CloudWatch Metrics

Lambda Metrics

Monitor the aggregator Lambda function:

Metric Namespace Description Recommended Threshold
Invocations AWS/Lambda Total executions Baseline + 50%
Errors AWS/Lambda Failed executions > 1 per 5 min
Duration AWS/Lambda Execution time (ms) > 80% of timeout
Throttles AWS/Lambda Throttled invocations > 0
IteratorAge AWS/Lambda Stream processing lag (ms) > 30,000 ms
ConcurrentExecutions AWS/Lambda Parallel executions Account limit

DynamoDB Metrics

Monitor table performance:

Metric Namespace Description Recommended Threshold
ConsumedReadCapacityUnits AWS/DynamoDB RCU usage Provisioned capacity
ConsumedWriteCapacityUnits AWS/DynamoDB WCU usage Provisioned capacity
ReadThrottleEvents AWS/DynamoDB Read throttles > 0
WriteThrottleEvents AWS/DynamoDB Write throttles > 0
SystemErrors AWS/DynamoDB Service errors > 0
SuccessfulRequestLatency AWS/DynamoDB Request latency (ms) p99 > 100ms

SQS Metrics (Dead Letter Queue)

Monitor failed event processing:

Metric Namespace Description Recommended Threshold
ApproximateNumberOfMessagesVisible AWS/SQS Messages in DLQ > 0
ApproximateAgeOfOldestMessage AWS/SQS Oldest message age (s) > 3600

CloudWatch Logs Insights Queries

Batch Processing Performance

Analyze processing latency over time:

fields @timestamp, @message
| filter @message like /Batch processing completed/
| parse @message /processing_time_ms":(?<duration>[\d.]+)/
| stats avg(duration) as avg_ms,
        pct(duration, 50) as p50_ms,
        pct(duration, 95) as p95_ms,
        pct(duration, 99) as p99_ms
  by bin(1h)
| sort @timestamp desc

Error Analysis

Find recent errors and warnings:

fields @timestamp, @message, @logStream
| filter level = "ERROR" or level = "WARNING"
| parse @message /message":"(?<error_message>[^"]+)/
| sort @timestamp desc
| limit 100

Invocation Summary

Aggregate processing metrics:

fields @timestamp, @message
| filter @message like /Lambda invocation completed/
| parse @message /processed":(?<processed>\d+).*snapshots_updated":(?<snapshots>\d+)/
| stats sum(processed) as total_processed,
        sum(snapshots) as total_snapshots,
        count() as invocations
  by bin(1h)
| sort @timestamp desc

Entity Usage Analysis

Find highest-usage entities:

fields @timestamp, @message
| filter @message like /Snapshot updated/
| parse @message /entity_id":"(?<entity>[^"]+)".*resource":"(?<resource>[^"]+)/
| stats count() as updates by entity, resource
| sort updates desc
| limit 50

Cold Start Detection

Identify Lambda cold starts:

fields @timestamp, @message, @duration
| filter @type = "REPORT"
| filter @message like /Init Duration/
| parse @message /Init Duration: (?<init_duration>[\d.]+) ms/
| stats count() as cold_starts,
        avg(init_duration) as avg_init_ms
  by bin(1h)

Error Rate Calculation

Calculate error rate percentage:

fields @timestamp
| filter @message like /Lambda invocation/
| parse @message /error_count":(?<errors>\d+)/
| stats sum(errors) as total_errors, count() as total_invocations
| display total_errors, total_invocations,
         (total_errors * 100.0 / total_invocations) as error_rate_pct

X-Ray Tracing

Future Enhancement

X-Ray tracing integration is planned for a future release. Track progress in Issue #107.

Planned capabilities include:

  • Lambda Active Tracing - End-to-end request visibility
  • DynamoDB SDK Instrumentation - Database call traces
  • Custom Subsegments - Business logic timing (acquire/release operations)
  • Trace Header Propagation - Cross-service correlation

Dashboard Templates

Operations Dashboard

Create a CloudWatch dashboard for day-to-day operations:

{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "title": "Lambda Invocations & Errors",
        "region": "${AWS::Region}",
        "metrics": [
          ["AWS/Lambda", "Invocations", "FunctionName", "${TableName}-aggregator", {"stat": "Sum"}],
          [".", "Errors", ".", ".", {"stat": "Sum", "color": "#d62728"}]
        ],
        "period": 300,
        "view": "timeSeries"
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "Lambda Duration (p50/p95/p99)",
        "region": "${AWS::Region}",
        "metrics": [
          ["AWS/Lambda", "Duration", "FunctionName", "${TableName}-aggregator", {"stat": "p50"}],
          ["...", {"stat": "p95"}],
          ["...", {"stat": "p99"}]
        ],
        "period": 300,
        "view": "timeSeries"
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "Stream Iterator Age",
        "region": "${AWS::Region}",
        "metrics": [
          ["AWS/Lambda", "IteratorAge", "FunctionName", "${TableName}-aggregator", {"stat": "Maximum"}]
        ],
        "period": 60,
        "view": "timeSeries",
        "annotations": {
          "horizontal": [{"value": 30000, "label": "Threshold (30s)"}]
        }
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "DynamoDB Capacity",
        "region": "${AWS::Region}",
        "metrics": [
          ["AWS/DynamoDB", "ConsumedReadCapacityUnits", "TableName", "${TableName}", {"stat": "Sum"}],
          [".", "ConsumedWriteCapacityUnits", ".", ".", {"stat": "Sum"}]
        ],
        "period": 300,
        "view": "timeSeries"
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "DynamoDB Throttles",
        "region": "${AWS::Region}",
        "metrics": [
          ["AWS/DynamoDB", "ReadThrottleEvents", "TableName", "${TableName}", {"stat": "Sum"}],
          [".", "WriteThrottleEvents", ".", ".", {"stat": "Sum"}]
        ],
        "period": 300,
        "view": "timeSeries"
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "Dead Letter Queue",
        "region": "${AWS::Region}",
        "metrics": [
          ["AWS/SQS", "ApproximateNumberOfMessagesVisible", "QueueName", "${TableName}-aggregator-dlq"]
        ],
        "period": 60,
        "view": "singleValue"
      }
    }
  ]
}

Capacity Planning Dashboard

Create a dashboard for capacity analysis:

{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "title": "RCU/WCU Consumption Trend",
        "region": "${AWS::Region}",
        "metrics": [
          ["AWS/DynamoDB", "ConsumedReadCapacityUnits", "TableName", "${TableName}", {"stat": "Sum", "period": 3600}],
          [".", "ConsumedWriteCapacityUnits", ".", ".", {"stat": "Sum", "period": 3600}]
        ],
        "view": "timeSeries",
        "stacked": false
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "Request Latency Distribution",
        "region": "${AWS::Region}",
        "metrics": [
          ["AWS/DynamoDB", "SuccessfulRequestLatency", "TableName", "${TableName}", "Operation", "GetItem", {"stat": "p50"}],
          ["...", {"stat": "p95"}],
          ["...", {"stat": "p99"}]
        ],
        "period": 300,
        "view": "timeSeries"
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "Lambda Concurrent Executions",
        "region": "${AWS::Region}",
        "metrics": [
          ["AWS/Lambda", "ConcurrentExecutions", "FunctionName", "${TableName}-aggregator", {"stat": "Maximum"}]
        ],
        "period": 60,
        "view": "timeSeries"
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "Throttle Events (7 Day)",
        "region": "${AWS::Region}",
        "metrics": [
          ["AWS/DynamoDB", "ReadThrottleEvents", "TableName", "${TableName}", {"stat": "Sum", "period": 86400}],
          [".", "WriteThrottleEvents", ".", ".", {"stat": "Sum", "period": 86400}]
        ],
        "view": "bar"
      }
    }
  ]
}

Dashboard Deployment

Replace ${TableName} with your actual table name (e.g., ZAEL-limiter) and ${AWS::Region} with your region before deploying.

Alert Configuration

Default Alarms

The stack deploys these alarms when --enable-alarms is set:

Alarm Metric Threshold Period Evaluation
{name}-aggregator-error-rate Lambda Errors > 1 5 min 2 periods
{name}-aggregator-duration Lambda Duration > 80% timeout 5 min 2 periods
{name}-stream-iterator-age IteratorAge > 30,000 ms 5 min 2 periods
{name}-aggregator-dlq-alarm SQS Messages >= 1 5 min 1 period
{name}-read-throttle ReadThrottleEvents > 1 5 min 2 periods
{name}-write-throttle WriteThrottleEvents > 1 5 min 2 periods

Deploying with Alarms

# Deploy with alarms enabled (default)
zae-limiter deploy --name limiter --region us-east-1

# Deploy with SNS notifications
zae-limiter deploy --name limiter --region us-east-1 \
    --alarm-sns-topic arn:aws:sns:us-east-1:123456789012:alerts

# Customize duration threshold (70% of timeout)
zae-limiter deploy --name limiter --region us-east-1 \
    --lambda-duration-threshold-pct 70

# Disable alarms (not recommended for production)
zae-limiter deploy --name limiter --region us-east-1 --no-alarms

Threshold Tuning Guide

Alarm Default When to Increase When to Decrease
Error Rate >1/5min High-volume systems with rare transient errors Critical systems requiring immediate response
Duration 80% timeout Batch workloads with variable processing time Latency-sensitive applications
Iterator Age 30 seconds Batch-tolerant analytics workloads Real-time processing requirements
DLQ Messages >=1 Never (always investigate DLQ messages) N/A
Throttles >1/5min During planned traffic spikes Before hitting capacity limits

Programmatic Configuration

from zae_limiter import RateLimiter, StackOptions

limiter = RateLimiter(
    name="limiter",
    region="us-east-1",
    stack_options=StackOptions(
        enable_alarms=True,
        alarm_sns_topic="arn:aws:sns:us-east-1:123456789012:alerts",
        lambda_duration_threshold_pct=75,  # Alert at 75% of timeout
        log_retention_days=90,
    ),
)

Next Steps