Performance Tuning Guide¶
This guide provides detailed recommendations for optimizing zae-limiter performance, covering DynamoDB capacity planning, Lambda configuration, and cost optimization strategies.
1. DynamoDB Capacity Planning¶
Understanding RCU/WCU Costs¶
Each zae-limiter operation has specific DynamoDB capacity costs. Use this table for capacity planning:
| Operation | RCUs | WCUs | Notes |
|---|---|---|---|
acquire() |
1 | 1 | O(1) regardless of limit count (composite bucket items) |
acquire() with cascade |
2 | 4 | Entity + parent bucket reads and writes (TransactWriteItems, 2 WCU per item) |
acquire() speculative success |
0 | 1 | Skips read; conditional UpdateItem (issue #315) |
acquire() speculative success + cascade (sequential) |
0 | 2 | Child then parent speculative UpdateItem |
acquire() speculative success + cascade (parallel) |
0 | 2 | Concurrent child + parent via entity cache (issue #318) |
acquire() speculative fast rejection |
0 | 0 | Exhausted bucket; rejected from ALL_OLD without write |
acquire() speculative fallback (non-cascade) |
1 | 2 | Failed speculative (1 WCU) + normal path (1 RCU + 1 WCU) |
acquire() speculative cascade fallback (parent refill helps) |
0.5 | 3 | Child stays consumed; parent-only read (0.5 RCU) + single-item write (1 WCU) |
acquire() retry (contention) |
0 | 1 | ADD-based writes don't require re-read |
acquire() with adjustments |
0 | +1 per entity | Independent writes via write_each() (1 WCU each) |
acquire() rollback (on exception) |
0 | +1 per entity | Independent compensating writes (1 WCU each) |
| Aggregator bucket refill (per active bucket) | 0 | 1 | Proactive refill via Lambda; 0 WCU if lock lost |
acquire(limits=None) with config cache miss |
+3 | 0 | +3 GetItem operations for config hierarchy |
available() |
1 | 0 | Read-only, single composite bucket item |
get_limits() |
1 | 0 | Query operation |
set_limits() |
1 | N+1 | Query + N PutItems |
delete_entity() |
1 | batched | Query + BatchWrite in 25-item chunks |
O(1) Cost Optimization (v0.7.0)
ADR-114 (Composite Bucket Items) and ADR-115 (ADD-Based Writes) reduced acquire() costs
from O(N) to O(1) where N is the number of limits. All limits for an entity+resource are
stored in a single DynamoDB item, and ADD operations eliminate the need for read-modify-write
cycles on contention retries.
Namespace Overhead
Namespace-prefixed keys (e.g., {ns}/ENTITY#id) add a few bytes per item but have no measurable impact on RCU/WCU costs. All operations in the table above apply identically regardless of namespace.
Capacity Validation
These costs are validated by automated tests. Run uv run pytest tests/benchmark/test_capacity.py -v to verify.
Capacity Estimation Formula¶
Use these formulas to estimate hourly capacity requirements:
Hourly RCUs = requests/hour × (1 + cascade_pct + config_cache_miss_pct × 3)
Hourly WCUs = requests/hour × (1 + cascade_pct × 3)
With speculative writes enabled (speculative_writes=True), the steady-state formula changes:
Hourly RCUs = requests/hour × (fallback_pct + cascade_pct × 0.5 + config_cache_miss_pct × 3)
Hourly WCUs = requests/hour × (1 + cascade_pct)
Where fallback_pct is the fraction of requests that fall back to the slow path (typically <5% for pre-warmed buckets).
O(1) Scaling (v0.7.0+)
Costs no longer scale with the number of limits per request. Composite bucket items store all limits in a single DynamoDB item, so 1 limit and 10 limits cost the same.
With the Lambda aggregator enabled (2 windows: hourly, daily):
Example Calculations¶
Scenario 1: Simple API Rate Limiting¶
- 10,000 requests/hour
- 2 limits per request (rpm, tpm)
- No cascade, config cached
Calculation:
Limit count doesn't affect cost
With composite bucket items (v0.7.0+), whether you track 1 limit or 10 limits, the DynamoDB cost is identical—all limits are stored in a single item.
Scenario 2: Hierarchical LLM Limiting¶
- 10,000 requests/hour
- 2 limits per request
- 50% use cascade (API key → project)
- 2% config cache miss rate
Calculation:
RCUs = 10,000 × (1 + 0.5 + 0.02×3) = 10,000 × 1.56 = 15,600 RCUs/hour
WCUs = 10,000 × (1 + 0.5×3) = 10,000 × 2.5 = 25,000 WCUs/hour
Billing Mode Selection¶
| Mode | Best For | Trade-offs |
|---|---|---|
| PAY_PER_REQUEST (default) | Variable traffic, new deployments | Higher per-request cost, no planning needed |
| Provisioned | Steady traffic >100 TPS | Lower cost at scale, requires planning |
| Provisioned + Reserved | High-volume production | Lowest cost, 1-year commitment |
Migration Guidance
Start with PAY_PER_REQUEST. Once traffic patterns stabilize (typically 2-4 weeks), analyze CloudWatch metrics to determine optimal provisioned capacity. Switch when monthly on-demand costs exceed provisioned + 20% buffer.
2. Lambda Concurrency Settings¶
The aggregator Lambda processes DynamoDB Stream events to maintain usage snapshots.
Default Configuration¶
| Setting | Default | Range | Impact |
|---|---|---|---|
| Memory | 256 MB | 128-3008 MB | Higher = faster, more expensive |
| Timeout | 60 seconds | 1-900 seconds | Should be 2× typical duration |
| Reserved Concurrency | None | 1-1000 | Limits parallel executions |
Memory Tuning¶
Lambda CPU scales linearly with memory allocation:
| Memory | vCPUs | Best For |
|---|---|---|
| 128 MB | ~0.08 | Minimal workloads (testing only) |
| 256 MB | ~0.15 | Most workloads (default) |
| 512 MB | ~0.30 | High-throughput streams |
| 1024 MB | ~0.60 | Rarely needed |
Guidance based on batch size:
- <50 records/batch: 128-256 MB sufficient
- 50-100 records/batch: 256-512 MB recommended
- Peak streams: Monitor Lambda duration; increase memory if >50% of timeout
Concurrency Management¶
DynamoDB Streams creates one shard per 1000 WCU (or ~3000 writes/sec). Each shard invokes one Lambda instance.
Recommendations:
| Volume | Reserved Concurrency | Notes |
|---|---|---|
| <1000 writes/sec | None | Default scaling sufficient |
| 1000-10000/sec | 10-50 | Prevents runaway scaling |
| >10000/sec | Expected shards + 20% | Based on table monitoring |
Error Handling¶
Configure error handling for production reliability:
# Deploy with DLQ and alarms
zae-limiter deploy --name my-app \
--alarm-sns-topic arn:aws:sns:us-east-1:123456789012:alerts
- Retries: Failed records retry 3 times within the same batch
- DLQ: Persistent failures go to Dead Letter Queue (if configured)
- Duration Alarm: Triggers at 80% of timeout (48s default)
3. Batch Operation Patterns¶
Transaction Limits¶
DynamoDB enforces these limits:
| Constraint | Limit | Impact |
|---|---|---|
| TransactWriteItems | 100 items max | Affects multi-limit updates |
| BatchWriteItem | 25 items per request | Entity deletion is chunked |
| Optimistic locking | Entire transaction fails | Causes retry on contention |
Efficient Patterns¶
Multi-Limit Acquisition¶
# Efficient: Single lease for multiple limits
async with limiter.acquire(
"entity-id",
"llm-api",
{"rpm": 1}, # Initial consumption (1 request)
limits=[rpm_limit, tpm_limit],
) as lease:
# 1 BatchGetItem + 1 UpdateItem (1 WCU, single composite bucket)
response = await call_llm()
await lease.adjust(tpm=response.usage.total_tokens)
# Adjustment: +1 UpdateItem via write_each (1 WCU)
# Inefficient: Separate acquisitions
async with limiter.acquire("entity-id", "llm-api", {"rpm": 1}, limits=[rpm_limit]):
async with limiter.acquire("entity-id", "llm-api", {"tpm": 100}, limits=[tpm_limit]):
# 2 reads + 2 writes (doubles cost!)
pass
Cascade Optimization¶
# Entity without cascade (default) — saves 1 GetEntity + parent bucket operations
await limiter.create_entity(entity_id="api-key", parent_id="project-1")
async with limiter.acquire("api-key", "llm-api", {"rpm": 1}, limits=limits):
pass # Only checks api-key's limits
# Entity with cascade — checks and updates parent limits too
await limiter.create_entity(entity_id="api-key", parent_id="project-1", cascade=True)
async with limiter.acquire("api-key", "llm-api", {"rpm": 1}, limits=limits):
pass # Checks both api-key AND project-1 limits
Write Sharding (Automatic Pre-Shard Buckets)¶
Starting with v0.9.0 (GHSA-76rv), zae-limiter automatically handles DynamoDB hot partition
mitigation via pre-shard buckets. Each bucket item lives on its own DynamoDB partition
key (PK={ns}/BUCKET#{id}#{resource}#{shard}), and an auto-injected wcu:1000
infrastructure limit tracks per-partition write pressure.
How it works:
- Every bucket starts with
shard_count=1(shard 0) - An internal
wculimit (capacity: 1000 millitokens) is auto-injected on every bucket - When
wcuis exhausted on a speculative write, the client doublesshard_countvia a conditional write on shard 0 (source of truth) - The Lambda aggregator proactively doubles shards at >=80% wcu capacity before clients experience throttling
- Shard count changes on shard 0 are propagated to all other shards by the aggregator
- Clients pick a random shard from the entity cache:
random.randrange(shard_count) - If application limits are exhausted on one shard but the entity has multiple shards, the client retries on up to 2 other randomly chosen shards
Shard-aware capacity: The aggregator divides effective capacity and refill amount
by shard_count when computing refills, so each shard receives its proportional share
of tokens.
No application code changes required. Pre-shard buckets are transparent to users.
The wcu limit is filtered from all user-facing output (bucket states, exceptions,
usage snapshots).
When automatic sharding is insufficient:
For extreme high-fanout cascade scenarios (1000+ children with cascade=True writing to the
same parent), automatic bucket sharding handles the per-bucket partition pressure. However,
if you need to distribute traffic across multiple logical parents for application-level
load balancing, you can still use manual entity sharding:
# Manual entity sharding for application-level distribution
# (Only needed for extreme cascade fan-out beyond what pre-shard handles)
num_shards = 10
api_key_id = "api-key-12345"
shard_id = hash(api_key_id) % num_shards
parent_id = f"project-1-shard-{shard_id}"
await limiter.create_entity(
entity_id=api_key_id,
parent_id=parent_id,
cascade=True,
)
Stored Limits Optimization¶
# Config caching reduces RCUs (60s TTL by default)
repo = await Repository.open(config_cache_ttl=60) # seconds (0 to disable)
limiter = RateLimiter(repository=repo)
# Pass explicit limits to skip config resolution entirely
async with limiter.acquire(
entity_id="user-123",
resource="api",
limits=[Limit.per_minute("rpm", 100)], # No config lookup
consume={"rpm": 1},
) as lease:
...
Bulk Operations¶
# Efficient bulk limit setup
await limiter.set_limits("entity-1", [rpm_limit, tpm_limit], resource="llm-api")
await limiter.set_limits("entity-2", [rpm_limit, tpm_limit], resource="llm-api")
# Runs 2 Queries + 2×2 PutItems
# Entity deletion (automatically batched in 25-item chunks)
await limiter.delete_entity("entity-2")
# Runs 1 Query + BatchWrite (up to 25 WCUs per chunk)
4. Expected Latencies¶
Operation Latencies¶
Latencies vary by environment and depend on network conditions, DynamoDB utilization, and operation complexity.
| Operation | Moto p50 | LocalStack p50 | AWS (external) p50 | AWS (in-region) p50 |
|---|---|---|---|---|
acquire() - single limit |
9ms | 43ms | 38ms | 15-20ms |
acquire() - two limits |
11ms | 43ms | 36ms | 15-20ms |
acquire() with cascade |
22ms | 47ms | 48ms | 25-35ms |
available() check |
1ms | 7ms | 10ms | 1-3ms |
Environment Differences
- Moto: In-memory mock, measures code overhead only
- LocalStack: Docker-based, includes local network latency (varies by host)
- AWS (external): From outside AWS, includes internet latency (~8-14ms per round-trip)
- AWS (in-region): From EC2/Lambda in same region (~0.5-1ms per round-trip)
In-Region Performance
When running inside AWS (same region as DynamoDB), latency drops significantly because network round-trips take <1ms instead of 8-14ms. For a typical LLM API call, rate limit overhead is ~4% (20ms / 500ms) vs ~7% when calling from external networks.
Latency Breakdown¶
Typical acquire() latency breakdown for a single limit (non-cascade):
acquire() latency breakdown (external client):
├── Network to AWS ~8-10ms (internet latency)
├── DynamoDB GetItem ~3-5ms (server-side processing)
├── Network back ~8-10ms
├── UpdateItem ~3-5ms (single-item API, 1 WCU)
└── Network back ~8-10ms
─────────
Total: ~30-40ms
acquire() latency breakdown (in-region):
├── Network to DynamoDB ~0.5-1ms (VPC internal)
├── DynamoDB GetItem ~3-5ms
├── Network back ~0.5-1ms
├── UpdateItem ~3-5ms (single-item API, 1 WCU)
└── Network back ~0.5-1ms
─────────
Total: ~10-15ms
With speculative writes enabled (speculative_writes=True), the read round trip is eliminated on success:
speculative acquire() latency breakdown (in-region, success):
├── Network to DynamoDB ~0.5-1ms (VPC internal)
├── Conditional UpdateItem ~3-5ms (1 WCU, skips read)
└── Network back ~0.5-1ms
─────────
Total: ~5-8ms
Single-item vs Transaction writes
Non-cascade acquire() writes a single composite bucket item, so transact_write()
dispatches it as a plain UpdateItem (1 WCU). Cascade mode with 2 items uses
TransactWriteItems (2 WCU per item). Adjustments and rollbacks always use
independent single-item writes via write_each() (1 WCU each).
Environment Selection¶
| Environment | Use Case | Latency Factor |
|---|---|---|
| Moto | Unit tests, CI/CD | 1× (baseline) |
| LocalStack | Integration tests, local dev | 4-5× |
| AWS (external) | Development, testing | 4× |
| AWS (in-region) | Production | 2× |
Run benchmarks to measure your specific environment:
# Moto benchmarks (fast)
uv run pytest tests/benchmark/test_latency.py -v --benchmark-json=latency.json
# LocalStack benchmarks (requires Docker)
docker compose up -d
export AWS_ENDPOINT_URL=http://localhost:4566
uv run pytest tests/benchmark/test_localstack.py -v --benchmark-json=latency.json
# AWS benchmarks (requires credentials)
uv run pytest tests/benchmark/test_aws.py --run-aws -v
5. Throughput Benchmarks¶
Maximum Throughput¶
Theoretical and practical throughput limits depend on contention patterns:
| Scenario | Moto TPS | AWS TPS | Bottleneck |
|---|---|---|---|
| Sequential, single entity | 95 | 28 | Network round-trip |
| Sequential, multiple entities | 76 | 26 | Network round-trip |
| Concurrent, separate entities | 85 | 176 | Scales with parallelism |
| Concurrent, single entity | 88 | — | Optimistic locking contention |
| Cascade sequential | 27 | 19 | Parent bucket operations |
| Cascade concurrent | 28 | 91 | Parent bucket contention |
AWS Concurrent Performance
AWS concurrent throughput (176 TPS) exceeds sequential (28 TPS) because parallel requests to separate entities eliminate serialization. In-region performance would be ~2× higher due to reduced network latency.
Contention Analysis¶
When multiple requests update the same bucket concurrently, DynamoDB's optimistic locking causes transaction retries:
Concurrent updates to same bucket:
├── Request A: Read bucket version=1
├── Request B: Read bucket version=1
├── Request A: Write with condition version=1 → SUCCESS, version=2
├── Request B: Write with condition version=1 → FAIL (ConditionalCheckFailed)
└── Request B: Retry with version=2 → SUCCESS
Each retry adds ~10-30ms latency.
Mitigation Strategies¶
# Strategy 1: Higher capacity (reduces contention per request)
rpm_limit = Limit.per_minute("rpm", 1000)
# Strategy 2: Distribute load across entities
# Instead of one shared entity, use sharded entities:
shard = hash(request_id) % 10
entity_id = f"api-key-shard-{shard}"
# Strategy 3: Client-side rate limiting before acquire
# Reduce concurrent requests to the same entity
Running Benchmarks¶
Use the automated benchmark runner:
# Run all benchmarks (moto + LocalStack)
python scripts/run_benchmarks.py
# Include AWS benchmarks
python scripts/run_benchmarks.py --run-aws
# Skip LocalStack (moto only)
python scripts/run_benchmarks.py --skip-localstack
# Custom output directory
python scripts/run_benchmarks.py --output-dir ./results
Or run individual test suites:
# Throughput tests
uv run pytest tests/benchmark/test_throughput.py -v
# Analyze results
python -c "import json; print(json.load(open('benchmark.json'))['benchmarks'])"
6. Cost Optimization Strategies¶
DynamoDB Cost Breakdown¶
Costs vary by region. Using us-east-1 as reference:
| Component | On-Demand Cost | Notes |
|---|---|---|
| Write Request Units | $0.625 per million | Each WCU = one write |
| Read Request Units | $0.125 per million | Each RCU = one read |
| Storage | $0.25 per GB/month | Usually minimal |
| Streams | $0.02 per 100K reads | Lambda polling |
| Lambda | $0.20 per million + duration | Aggregator function |
Cost Estimation Examples¶
O(1) Costs (v0.7.0+)
With composite bucket items, costs no longer multiply by number of limits. Whether you track 2 limits or 10 limits per request, DynamoDB costs are the same.
Low Volume: 10K requests/day¶
DynamoDB:
Writes: 10K × 30 days = 300K WCUs = $0.19
Reads: 10K × 30 days = 300K RCUs = $0.04
Streams: 300K events = $0.06
Lambda: 300K invocations ≈ $0.06 + duration
Storage: ~10 MB = negligible
─────────────────────────────────────────────────────────
Total: ~$0.35/month
Medium Volume: 1M requests/day¶
DynamoDB:
Writes: 1M × 30 = 30M WCUs = $18.75
Reads: 1M × 30 = 30M RCUs = $3.75
Streams: 30M events = $6.00
Lambda: 30M invocations ≈ $6.00 + duration
─────────────────────────────────────────────────────────
Total (on-demand): ~$35/month
Total (provisioned with auto-scaling): ~$22/month
Cost Reduction Strategies¶
1. Disable Unused Features¶
# Create entity without cascade if not needed (saves 1-2 WCUs per request)
await limiter.create_entity(entity_id="entity", parent_id="project-1") # cascade=False by default
async with limiter.acquire("entity", "api", limits, {"rpm": 1}):
pass
# Disable stored limits if static (saves 2 RCUs per request)
repo = await Repository.open()
limiter = RateLimiter(repository=repo)
2. Optimize TTL Settings¶
# Shorter TTL = faster cleanup = less storage
# bucket_ttl is configured via builder or CloudFormation
repo = await Repository.open()
limiter = RateLimiter(repository=repo)
3. Reduce Snapshot Granularity¶
# Deploy without aggregator if usage tracking not needed
zae-limiter deploy --name my-app --no-aggregator
4. Switch to Provisioned at Scale¶
- Break-even: ~5M operations/month
- Use auto-scaling with 70% target utilization
- Consider reserved capacity for >20M ops/month
5. Batch Similar Operations¶
# Combine multiple limits into single acquire
async with limiter.acquire(
entity_id="entity",
resource="api",
consume={"rpm": 1}, # 1 transaction vs 3
limits=[rpm_limit, tpm_limit, daily_limit],
):
pass
Cost Monitoring¶
Set up CloudWatch metrics for cost tracking:
DynamoDB Metrics:
ConsumedReadCapacityUnitsConsumedWriteCapacityUnitsAccountProvisionedReadCapacityUtilizationAccountProvisionedWriteCapacityUtilization
Lambda Metrics:
InvocationsDurationConcurrentExecutions
Recommended Alerts:
# Deploy with alarms for cost anomalies
zae-limiter deploy --name my-app \
--alarm-sns-topic arn:aws:sns:us-east-1:123456789012:billing-alerts
# Set AWS Budgets alert at 80% of expected monthly cost
aws budgets create-budget \
--account-id 123456789012 \
--budget file://budget.json \
--notifications-with-subscribers file://notifications.json
7. Config Cache Tuning¶
The config cache reduces DynamoDB reads by caching system defaults, resource defaults, and entity limits. This section covers tuning the cache for your workload.
Cache Configuration¶
from zae_limiter import RateLimiter, Repository
# Default: 60-second TTL (recommended for most workloads)
repo = await Repository.open(config_cache_ttl=60)
limiter = RateLimiter(repository=repo)
# High-frequency updates: Shorter TTL for faster propagation
repo = await Repository.open(config_cache_ttl=10)
limiter = RateLimiter(repository=repo)
# Disable caching: For testing or when config changes must be immediate
repo = await Repository.open(config_cache_ttl=0)
limiter = RateLimiter(repository=repo)
Cost Impact¶
Without caching, each acquire() call performs 3 DynamoDB reads to resolve limits:
- Entity-level config lookup (1 RCU)
- Resource-level config lookup (1 RCU)
- System-level config lookup (1 RCU)
With caching (default):
| Traffic Rate | Cache Hit Rate | Amortized RCU/request |
|---|---|---|
| 1 req/sec | 98.3% | 0.05 RCU |
| 10 req/sec | 99.8% | 0.005 RCU |
| 100 req/sec | 99.98% | 0.0005 RCU |
Negative caching also helps: When an entity has no custom config (95%+ of entities typically), the cache remembers this to avoid repeated lookups.
Automatic Cache Eviction¶
Config-modifying methods (set_limits(), delete_limits()) automatically evict relevant cache entries. Manual invalidation is only needed after external changes (e.g., direct DynamoDB writes or changes from another process).
Manual Invalidation¶
After external config changes, force immediate refresh:
Without manual invalidation, changes propagate within the TTL period (max 60 seconds by default).
Monitoring Cache Performance¶
# Get cache statistics
stats = repo.get_cache_stats()
total = stats.hits + stats.misses
print(f"Cache hit rate: {stats.hits / total:.1%}" if total else "No requests yet")
print(f"Cache entries: {stats.size}")
print(f"TTL: {stats.ttl_seconds}s")
TTL Selection Guidelines¶
| Scenario | Recommended TTL | Rationale |
|---|---|---|
| Production (stable config) | 60s (default) | Best cost/latency trade-off |
| Development/testing | 10-30s | Faster config iteration |
| Compliance-critical | 10-30s | Minimizes staleness |
| Testing with frequent changes | 0 (disabled) | Immediate visibility |
| High-traffic APIs (>100 req/s) | 60-120s | Maximize cache hits |
8. Speculative Writes¶
Speculative writes (issue #315) enable a fast path for acquire() that skips the read round trip by issuing a conditional UpdateItem directly. This is most effective for pre-warmed buckets with sufficient capacity.
Configuring Speculative Writes¶
from zae_limiter import RateLimiter, Repository
repo = await Repository.open()
limiter = RateLimiter(repository=repo, speculative_writes=True) # Default
How It Works¶
Instead of the normal read-then-write flow (BatchGetItem + UpdateItem), the speculative path attempts a conditional UpdateItem first.
First acquire (sequential, populates entity cache):
acquire(entity_id, resource, consume)
|
+- Speculative UpdateItem (condition: bucket exists AND has enough tokens)
|
+- SUCCEEDS -> read cascade/parent_id from ALL_NEW, populate entity cache
| +- cascade=False -> DONE (1 RT, 0 RCU, 1 WCU)
| +- cascade=True -> Parent speculative UpdateItem (sequential)
| +- SUCCEEDS -> DONE (2 RT, 0 RCU, 2 WCU)
| +- FAILS -> [parent failure handling, see below]
|
+- FAILS (ConditionalCheckFailedException)
+- No ALL_OLD (bucket missing) -> SLOW PATH (creates bucket)
+- Missing limit in ALL_OLD -> SLOW PATH
+- Refill would help -> SLOW PATH
+- Refill won't help -> RateLimitExceeded (0 RCU, 0 WCU)
Subsequent acquires (parallel, issue #318):
When the entity cache contains (cascade=True, parent_id) from a prior acquire, child and parent speculative writes are issued concurrently:
acquire(entity_id, resource, consume) [cache hit: cascade=True, parent_id known]
|
+- asyncio.gather(child_speculative, parent_speculative)
|
+- BOTH SUCCEED -> DONE (1 RT, 0 RCU, 2 WCU)
+- CHILD FAILS, PARENT SUCCEEDS -> Compensate parent, check child ALL_OLD
| +- [same child failure handling as sequential path]
+- CHILD SUCCEEDS, PARENT FAILS -> Check parent ALL_OLD (child stays consumed)
| +- No ALL_OLD (missing) -> Compensate child, SLOW PATH
| +- Missing limit -> Compensate child, SLOW PATH
| +- Refill won't help -> Compensate child, RateLimitExceeded
| +- Refill would help -> Parent-only slow path (keep child)
| +- Parent acquire succeeds -> DONE (2 RT, 0.5 RCU, 3 WCU)
| +- Parent acquire fails -> Compensate child, SLOW PATH
+- BOTH FAIL -> Check child ALL_OLD, fall back or fast-reject
The ReturnValuesOnConditionCheckFailure=ALL_OLD response provides the current bucket state on failure, allowing the limiter to determine whether refill would help without an additional read.
Deferred cascade compensation: When the child speculative write succeeds but the parent fails with "refill would help", the child's consumption is kept in place while a parent-only slow path is attempted. This avoids compensating the child (1 WCU), re-reading it (0.5 RCU), and using TransactWriteItems for the full cascade write (4 WCU). Instead, only the parent is read (0.5 RCU) and written via a single-item UpdateItem (1 WCU). Compensation only happens when the parent-only path also fails.
Entity metadata cache (issue #318): Repository._entity_cache stores {entity_id: (cascade, parent_id)} as immutable metadata with no TTL. After the first acquire populates the cache (from ALL_NEW on speculative success or from the entity META record on slow path), subsequent cascade acquires fire child and parent speculative writes concurrently via asyncio.gather inside speculative_consume(). This reduces cascade latency from 2 sequential round trips to 1 parallel round trip while maintaining the same WCU cost. In sync mode, asyncio.gather is transformed to self._run_in_executor(lambda: a, lambda: b) using a lazy ThreadPoolExecutor(max_workers=2) for true parallel execution.
Cost Comparison¶
| Scenario | Round Trips | RCU | WCU | Cost per 1M |
|---|---|---|---|---|
| Normal path (non-cascade) | 2 | 1 | 1 | $0.75 |
| Speculative success (non-cascade) | 1 | 0 | 1 | $0.625 |
| Speculative fast rejection (exhausted) | 1 | 0 | 0 | $0.00 |
| Speculative fallback (refill helps) | 3 | 1 | 2 | $1.375 |
| Normal path (cascade) | 3 | 2 | 4 | $1.75 |
| Speculative success (cascade, sequential) | 2 | 0 | 2 | $1.25 |
| Speculative success (cascade, parallel) | 1 | 0 | 2 | $1.25 |
| Speculative cascade fallback (parent refill helps) | 2+ | 0.5 | 3 | $2.00 |
| Speculative cascade fast rejection (parent exhausted) | 1 | 0 | 2 | $1.25 |
When speculative writes save money
The speculative path is cheaper than the normal path when most requests succeed without needing refill. If a high percentage of requests fall back to the slow path (new entities, near-capacity buckets, frequent config changes), the extra WCU from the failed speculative write makes it more expensive.
Latency Comparison¶
| Scenario | Round Trips | Expected Latency (in-region) |
|---|---|---|
| Normal path (non-cascade) | 2 | 10-15ms |
| Speculative success (non-cascade) | 1 | 5-8ms |
| Speculative fast rejection (exhausted) | 1 | 5-8ms |
| Speculative fallback (refill helps) | 3 | 15-22ms |
| Normal path (cascade) | 3 | 15-22ms |
| Speculative success (cascade, sequential) | 2 | 8-12ms |
| Speculative success (cascade, parallel) | 1 | 5-8ms |
| Speculative cascade fallback (parent refill helps) | 2+ | 12-20ms |
| Speculative cascade fast rejection (parent exhausted) | 1 | 5-8ms |
Aggregator-Assisted Refill (Issue #317)¶
When the Lambda aggregator is enabled, it proactively refills token buckets for active entities between client requests. This keeps speculative writes on the fast path (1 RT, 0 RCU, 1 WCU) by ensuring buckets have sufficient tokens, reducing fallback to the slow path (3 RT, 1 RCU, 2 WCU).
How it works:
- The aggregator processes DynamoDB Stream events for bucket modifications
- For each active (entity, resource) bucket, it aggregates consumption deltas from the batch
- If projected tokens after natural refill are insufficient to cover the observed consumption rate, it writes a proactive refill
- The refill uses
ADD(commutative with concurrent speculative writes) and an optimistic lock onrfto prevent double-refill
Cost: 1 WCU per refill written (0 WCU if another writer updated rf first). The cost is amortized across all stream records in a batch, so high-throughput workloads see fewer refills per request.
Aggregator refill + speculative writes
The combination of aggregator-assisted refill and speculative writes provides the best latency and cost profile: the aggregator keeps buckets warm so speculative writes rarely fall back, achieving ~5-8ms p50 latency at $0.625/M requests (non-cascade).
When to Use Speculative Writes¶
Good fit:
- High-throughput workloads with pre-warmed buckets
- Buckets that rarely exhaust capacity (high capacity relative to request rate)
- Latency-sensitive applications where saving one round trip matters
- Cascade entities with repeated acquires (entity cache enables parallel writes after first acquire)
- Deployments with the Lambda aggregator enabled (aggregator keeps buckets warm for speculative success)
Poor fit:
- New entities that have never been seen before (first acquire always falls back)
- Near-capacity buckets that frequently exhaust (high fallback rate)
- Workloads with frequent config changes (missing limits trigger fallback)
- One-shot entities that are only acquired once (entity cache provides no benefit)
Monitoring Speculative Effectiveness¶
Track the ratio of speculative successes to fallbacks to determine if speculative writes are beneficial for your workload:
# Speculative writes work transparently with acquire()
# Monitor DynamoDB ConsumedWriteCapacityUnits to observe:
# - Lower WCU = more speculative successes
# - Higher WCU = more fallbacks (consider disabling)
Disabling speculative writes
If most requests are from new entities or near-capacity buckets, disable with speculative_writes=False to avoid the extra WCU from failed speculative attempts.
9. Load Testing with Locust¶
For realistic, multi-user load testing against a live DynamoDB stack, zae-limiter provides a Locust integration module (zae_limiter.locust). It exposes RateLimiterUser and RateLimiterSession, analogous to Locust's built-in HttpUser and HttpSession, so that every acquire(), available(), and management call fires Locust request events with timing.
Installation¶
Install the [bench] extra to pull in Locust and its dependencies:
Quick Start¶
from locust import task
from zae_limiter.locust import RateLimiterUser
class MyUser(RateLimiterUser):
stack_name = "my-limiter"
@task
def do_acquire(self):
with self.client.acquire(
entity_id="user-123",
resource="gpt-4",
consume={"rpm": 1, "tpm": 500},
name="gpt-4/baseline",
):
pass # simulate work
Run with:
Key Design Points¶
- Shared limiter: A single
SyncRateLimiterinstance is shared across all Locust user greenlets (thread-safe via boto3). - Connection pool:
_configure_boto3_pool()automatically enlarges the boto3 connection pool (default 1000, override withBOTO3_MAX_POOLenv var) to prevent pool exhaustion under high concurrency. - Event types:
ACQUIRE,COMMIT,RATE_LIMITED,AVAILABLE, and management operations (SET_SYSTEM_DEFAULTS,CREATE_ENTITY, etc.) appear as distinct request types in the Locust UI. - Rate limit handling:
RateLimitExceededis tracked asRATE_LIMITED(not counted as a failure), so Locust statistics cleanly separate infrastructure errors from expected rate limiting.
Example Scenarios¶
Pre-built locustfiles are available in examples/locust/locustfiles/:
| Scenario | File | Description |
|---|---|---|
| Simple | simple.py |
Single resource, single limit, basic acquire |
| Max RPS | max_rps.py |
Zero-wait back-to-back acquire for throughput ceiling |
| LLM Gateway | llm_gateway.py |
8 LLM models with RPM + TPM and lease adjustments |
| LLM Production | llm_production.py |
Weighted tasks with custom daily/spike load shapes |
| Stress | stress.py |
16K entities with whale/spike/power-law traffic patterns |
See examples/locust/README.md for full usage instructions including distributed execution on AWS.
Summary¶
| Optimization Area | Key Recommendations |
|---|---|
| Capacity | Start with on-demand, switch to provisioned at 5M+ ops/month |
| Latency | Expect 15-20ms p50 in-region, 35-45ms external; network is the dominant factor |
| Throughput | Distribute load across entities to avoid contention |
| Cost | Disable cascade/stored_limits when not needed |
| Config Cache | Use default 60s TTL; invalidate manually for immediate changes |
| Speculative Writes | Enable for pre-warmed high-throughput workloads; saves 1 round trip on success; cascade entities get parallel writes after first acquire |
| Load Testing | Use zae_limiter.locust with RateLimiterUser for realistic multi-user load tests; see examples/locust/ |
| Monitoring | Set up CloudWatch alerts for capacity and cost anomalies |
For detailed benchmark data, run: