Lambda Aggregator Operations¶

This guide covers troubleshooting and operational procedures for the Lambda aggregator function that processes DynamoDB stream events and maintains usage snapshots.

Decision Tree¶

flowchart TD START([Lambda Issue]) --> Q1{What's the symptom?} Q1 -->|Error rate alarm| CHECK1[Check CloudWatch Logs] Q1 -->|Duration alarm| CHECK2[Check processing time] Q1 -->|DLQ messages| CHECK3[Inspect DLQ] Q1 -->|Cold starts| CHECK4[Check init duration] CHECK1 --> DIAG{Error type?} DIAG -->|Permission denied| FIX1[Check IAM role] DIAG -->|Timeout| FIX2[Increase memory/timeout] DIAG -->|DynamoDB error| LINK1([→ DynamoDB]) DIAG -->|Code error| FIX3[Check logs, deploy fix] CHECK2 --> FIX2 CHECK3 --> DLQ[DLQ Processing] CHECK4 --> FIX4[Increase memory] click CHECK1 "#error-rate-issues" "View error diagnostics" click FIX1 "#error-rate-issues" "IAM troubleshooting" click FIX2 "#high-lambda-duration" "Increase resources" click LINK1 "dynamodb/" "DynamoDB operations" click FIX3 "#lambda-redeployment" "Redeploy Lambda" click DLQ "#messages-in-dead-letter-queue" "DLQ processing" click FIX4 "#cold-start-issues" "Cold start fixes"

Health Indicators¶

Monitor these metrics for Lambda health. See Monitoring Guide for dashboard templates.

Metric	Healthy	Warning	Critical
Error Rate	0%	< 1%	> 1%
Duration	< 50% timeout	< 80% timeout	> 80% timeout
Iterator Age	< 1s	< 30s	> 30s
DLQ Depth	0	1-10	> 10

Troubleshooting¶

Symptoms¶

Usage snapshots not updating
Messages accumulating in Dead Letter Queue (DLQ)
Lambda duration alarm triggered
CloudWatch Logs showing errors

Diagnostic Steps¶

Check Lambda errors:

# View recent Lambda invocations
aws logs filter-log-events \
  --log-group-name /aws/lambda/<name>-aggregator \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --filter-pattern "ERROR"

Check DLQ message count:

aws sqs get-queue-attributes \
  --queue-url https://sqs.<region>.amazonaws.com/<account>/<name>-aggregator-dlq \
  --attribute-names ApproximateNumberOfMessagesVisible

Inspect DLQ messages:

aws sqs receive-message \
  --queue-url https://sqs.<region>.amazonaws.com/<account>/<name>-aggregator-dlq \
  --max-number-of-messages 10 \
  --visibility-timeout 0

Error Rate Issues¶

CloudWatch Logs Insights query for errors:

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50

Common errors and solutions:

Error	Cause	Solution
`AccessDeniedException`	IAM role missing permissions	Check role has DynamoDB and SQS permissions
`ValidationException`	Invalid DynamoDB operation	Check for schema changes
`ResourceNotFoundException`	Table or stream doesn't exist	Verify table name, redeploy stack
`ProvisionedThroughputExceededException`	DynamoDB throttling	See DynamoDB Operations

High Lambda Duration¶

Symptoms: Duration alarm triggered, processing_time_ms > 80% of timeout

Diagnostic query:

fields @timestamp, @message
| filter @message like /Batch processing completed/
| parse @message /processing_time_ms":(?<duration>[\d.]+)/
| stats avg(duration) as avg_ms,
        pct(duration, 50) as p50_ms,
        pct(duration, 95) as p95_ms,
        pct(duration, 99) as p99_ms
  by bin(1h)
| sort @timestamp desc

Solutions:

Increase Lambda memory (CPU scales with memory):

aws lambda update-function-configuration \
  --function-name <name>-aggregator \
  --memory-size 512

Reduce batch size in event source mapping:

MAPPING_UUID=$(aws lambda list-event-source-mappings \
  --function-name <name>-aggregator \
  --query 'EventSourceMappings[0].UUID' \
  --output text)

aws lambda update-event-source-mapping \
  --uuid $MAPPING_UUID \
  --batch-size 50

Check DynamoDB latency - see DynamoDB Operations

Messages in Dead Letter Queue¶

Symptoms: DLQ alarm triggered, messages accumulating

Investigation:

Check Lambda logs for the error that caused the failure
Identify if it's a transient error or persistent issue
Fix the root cause before reprocessing

Reprocess DLQ messages after fix:

import boto3
import json

sqs = boto3.client('sqs')
lambda_client = boto3.client('lambda')

dlq_url = "https://sqs.<region>.amazonaws.com/<account>/<name>-aggregator-dlq"

while True:
    response = sqs.receive_message(
        QueueUrl=dlq_url,
        MaxNumberOfMessages=10,
        WaitTimeSeconds=5,
    )

    messages = response.get('Messages', [])
    if not messages:
        break

    for msg in messages:
        # Reprocess the failed event
        body = json.loads(msg['Body'])

        # Invoke Lambda directly with the failed records
        lambda_client.invoke(
            FunctionName='<name>-aggregator',
            InvocationType='Event',
            Payload=json.dumps(body),
        )

        # Delete from DLQ after successful reprocessing
        sqs.delete_message(
            QueueUrl=dlq_url,
            ReceiptHandle=msg['ReceiptHandle'],
        )

        print(f"Reprocessed message: {msg['MessageId']}")

Purge DLQ (discard all messages):

Data Loss

This permanently discards failed events. Only use after confirming data is not needed.

aws sqs purge-queue \
  --queue-url https://sqs.<region>.amazonaws.com/<account>/<name>-aggregator-dlq

Cold Start Issues¶

Diagnostic query:

fields @timestamp, @message, @duration
| filter @type = "REPORT"
| filter @message like /Init Duration/
| parse @message /Init Duration: (?<init_duration>[\d.]+) ms/
| stats count() as cold_starts,
        avg(init_duration) as avg_init_ms
  by bin(1h)

Solutions:

Increase Lambda memory (faster initialization)

Enable provisioned concurrency for consistent latency:

# Publish a new version first (provisioned concurrency requires a version or alias)
VERSION=$(aws lambda publish-version \
  --function-name <name>-aggregator \
  --query 'Version' --output text)

# Configure provisioned concurrency on the published version
aws lambda put-provisioned-concurrency-config \
  --function-name <name>-aggregator \
  --qualifier $VERSION \
  --provisioned-concurrent-executions 2

Procedures¶

Lambda Redeployment¶

Update Lambda code only:

zae-limiter upgrade --name <name> --region <region> --lambda-only

Full stack update:

zae-limiter deploy --name <name> --region <region>

Memory/Timeout Adjustment¶

Via CLI (recommended):

# Redeploy with new settings
zae-limiter deploy --name <name> --region <region> \
  --lambda-memory 512 \
  --lambda-timeout 120

Direct Lambda update:

aws lambda update-function-configuration \
  --function-name <name>-aggregator \
  --memory-size 512 \
  --timeout 120

Verification¶

After any Lambda changes, verify health:

# Check function configuration
aws lambda get-function-configuration \
  --function-name <name>-aggregator

# Watch for errors in real-time
aws logs tail /aws/lambda/<name>-aggregator --follow

Stream Processing - Iterator age and stream lag issues
DynamoDB Operations - Throttling and capacity issues
Monitoring Guide - CloudWatch dashboards and alerts