Lambda Aggregator Operations¶
This guide covers troubleshooting and operational procedures for the Lambda aggregator function that processes DynamoDB stream events and maintains usage snapshots.
Decision Tree¶
Health Indicators¶
Monitor these metrics for Lambda health. See Monitoring Guide for dashboard templates.
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| Error Rate | 0% | < 1% | > 1% |
| Duration | < 50% timeout | < 80% timeout | > 80% timeout |
| Iterator Age | < 1s | < 30s | > 30s |
| DLQ Depth | 0 | 1-10 | > 10 |
Troubleshooting¶
Symptoms¶
- Usage snapshots not updating
- Messages accumulating in Dead Letter Queue (DLQ)
- Lambda duration alarm triggered
- CloudWatch Logs showing errors
Diagnostic Steps¶
Check Lambda errors:
# View recent Lambda invocations
aws logs filter-log-events \
--log-group-name /aws/lambda/ZAEL-<name>-aggregator \
--start-time $(date -u -d '1 hour ago' +%s)000 \
--filter-pattern "ERROR"
Check DLQ message count:
aws sqs get-queue-attributes \
--queue-url https://sqs.<region>.amazonaws.com/<account>/ZAEL-<name>-aggregator-dlq \
--attribute-names ApproximateNumberOfMessagesVisible
Inspect DLQ messages:
aws sqs receive-message \
--queue-url https://sqs.<region>.amazonaws.com/<account>/ZAEL-<name>-aggregator-dlq \
--max-number-of-messages 10 \
--visibility-timeout 0
Error Rate Issues¶
CloudWatch Logs Insights query for errors:
Common errors and solutions:
| Error | Cause | Solution |
|---|---|---|
AccessDeniedException |
IAM role missing permissions | Check role has DynamoDB and SQS permissions |
ValidationException |
Invalid DynamoDB operation | Check for schema changes |
ResourceNotFoundException |
Table or stream doesn't exist | Verify table name, redeploy stack |
ProvisionedThroughputExceededException |
DynamoDB throttling | See DynamoDB Operations |
High Lambda Duration¶
Symptoms: Duration alarm triggered, processing_time_ms > 80% of timeout
Diagnostic query:
fields @timestamp, @message
| filter @message like /Batch processing completed/
| parse @message /processing_time_ms":(?<duration>[\d.]+)/
| stats avg(duration) as avg_ms,
pct(duration, 50) as p50_ms,
pct(duration, 95) as p95_ms,
pct(duration, 99) as p99_ms
by bin(1h)
| sort @timestamp desc
Solutions:
-
Increase Lambda memory (CPU scales with memory):
-
Reduce batch size in event source mapping:
-
Check DynamoDB latency - see DynamoDB Operations
Messages in Dead Letter Queue¶
Symptoms: DLQ alarm triggered, messages accumulating
Investigation:
- Check Lambda logs for the error that caused the failure
- Identify if it's a transient error or persistent issue
- Fix the root cause before reprocessing
Reprocess DLQ messages after fix:
import boto3
import json
sqs = boto3.client('sqs')
lambda_client = boto3.client('lambda')
dlq_url = "https://sqs.<region>.amazonaws.com/<account>/ZAEL-<name>-aggregator-dlq"
while True:
response = sqs.receive_message(
QueueUrl=dlq_url,
MaxNumberOfMessages=10,
WaitTimeSeconds=5,
)
messages = response.get('Messages', [])
if not messages:
break
for msg in messages:
# Reprocess the failed event
body = json.loads(msg['Body'])
# Invoke Lambda directly with the failed records
lambda_client.invoke(
FunctionName='ZAEL-<name>-aggregator',
InvocationType='Event',
Payload=json.dumps(body),
)
# Delete from DLQ after successful reprocessing
sqs.delete_message(
QueueUrl=dlq_url,
ReceiptHandle=msg['ReceiptHandle'],
)
print(f"Reprocessed message: {msg['MessageId']}")
Purge DLQ (discard all messages):
Data Loss
This permanently discards failed events. Only use after confirming data is not needed.
aws sqs purge-queue \
--queue-url https://sqs.<region>.amazonaws.com/<account>/ZAEL-<name>-aggregator-dlq
Cold Start Issues¶
Diagnostic query:
fields @timestamp, @message, @duration
| filter @type = "REPORT"
| filter @message like /Init Duration/
| parse @message /Init Duration: (?<init_duration>[\d.]+) ms/
| stats count() as cold_starts,
avg(init_duration) as avg_init_ms
by bin(1h)
Solutions:
- Increase Lambda memory (faster initialization)
- Enable provisioned concurrency for consistent latency:
# Publish a new version first (provisioned concurrency requires a version or alias) VERSION=$(aws lambda publish-version \ --function-name ZAEL-<name>-aggregator \ --query 'Version' --output text) # Configure provisioned concurrency on the published version aws lambda put-provisioned-concurrency-config \ --function-name ZAEL-<name>-aggregator \ --qualifier $VERSION \ --provisioned-concurrent-executions 2
Procedures¶
Lambda Redeployment¶
Update Lambda code only:
Full stack update:
Memory/Timeout Adjustment¶
Via CLI (recommended):
# Redeploy with new settings
zae-limiter deploy --name <name> --region <region> \
--lambda-memory 512 \
--lambda-timeout 120
Direct Lambda update:
aws lambda update-function-configuration \
--function-name ZAEL-<name>-aggregator \
--memory-size 512 \
--timeout 120
Verification¶
After any Lambda changes, verify health:
# Check function configuration
aws lambda get-function-configuration \
--function-name ZAEL-<name>-aggregator
# Watch for errors in real-time
aws logs tail /aws/lambda/ZAEL-<name>-aggregator --follow
Related¶
- Stream Processing - Iterator age and stream lag issues
- DynamoDB Operations - Throttling and capacity issues
- Monitoring Guide - CloudWatch dashboards and alerts