Upgrading¶

This guide describes the v2 upgrade path for mlflow-dynamodbstore, moving from manual CLI-driven operations to event-driven automation.

Overview¶

The v1 architecture relies on manual CLI commands for maintenance tasks. The v2 architecture introduces:

Component	Replaces	Benefit
DynamoDB Streams + Lambda	Manual `denormalize-tags backfill`	Real-time async materialization
OpenSearch Serverless	Trigram-based FTS	Full-text search with ranking
EventBridge scheduled rules	Manual `cleanup-expired`	Automated periodic cleanup

Info

The v2 infrastructure is deployed via the zae-mlflow CDK repository. mlflow-dynamodbstore itself remains the same Python package -- the upgrade is purely on the infrastructure side.

DynamoDB Streams + Lambda¶

What It Does¶

A Lambda function consumes DynamoDB Streams events and performs real-time materialization:

Tag denormalization -- When a tag item is written, the Lambda automatically updates the corresponding META item with the denormalized attribute. No more backfill needed for new data.
Trigram index updates -- When a name field changes, trigrams are recomputed and written to the index partition.

Migration Steps¶

Deploy the CDK stack with Streams enabled:

cd zae-mlflow
npx cdk deploy --context streams=true

Verify the Lambda is processing events:

aws logs tail /aws/lambda/mlflow-materializer --follow

Run a one-time backfill to catch up existing data:

mlflow-dynamodbstore denormalize-tags backfill \
    --table mlflow --region us-east-1

Remove backfill from cron -- the Lambda handles new data going forward.

Note

The backfill CLI command remains available for disaster recovery or re-materialization after schema changes.

Considerations¶

Lambda concurrency -- Each shard gets its own Lambda invocation. For tables with high write throughput, monitor Lambda concurrent executions.
Error handling -- The Lambda uses a dead-letter queue (DLQ) for failed events. Monitor the DLQ for poisoned records.
Ordering -- DynamoDB Streams guarantees ordering within a shard. The Lambda processes events in order, ensuring consistency.

OpenSearch Serverless¶

What It Does¶

Replaces the trigram-based full-text search with OpenSearch Serverless, providing:

Proper tokenization and stemming
Relevance-ranked results
Fuzzy matching
No 10 GB partition limit for search indexes

Migration Steps¶

Deploy OpenSearch Serverless collection via CDK:

cd zae-mlflow
npx cdk deploy --context opensearch=true

Configure the tracking store to use OpenSearch for search:

export MLFLOW_DYNAMODBSTORE_SEARCH_BACKEND=opensearch
export MLFLOW_DYNAMODBSTORE_OPENSEARCH_ENDPOINT=https://...aoss.amazonaws.com

Index existing data -- The CDK stack includes a one-time indexing Lambda that scans the DynamoDB table and populates OpenSearch.
Verify search works through the MLflow UI.

Considerations¶

Cost -- OpenSearch Serverless has a minimum charge of ~2 OCU (OpenSearch Compute Units). Evaluate whether your search volume justifies the cost.
IAM -- The MLflow server needs IAM permissions to call the OpenSearch endpoint. The CDK stack configures this automatically.
Fallback -- The trigram search remains functional. You can switch back by unsetting MLFLOW_DYNAMODBSTORE_SEARCH_BACKEND.

EventBridge Scheduled Cleanup¶

What It Does¶

Replaces manual cleanup-expired CLI runs with an EventBridge rule that triggers a Lambda on a schedule.

Migration Steps¶

Deploy the scheduled cleanup Lambda via CDK:

cd zae-mlflow
npx cdk deploy --context scheduled-cleanup=true

Configure the schedule (default: daily at 03:00 UTC):

cd zae-mlflow
npx cdk deploy \
    --context scheduled-cleanup=true \
    --context cleanup-schedule="rate(1 day)"

Remove the CLI from cron -- EventBridge handles scheduling.

Monitor via CloudWatch Logs:

aws logs tail /aws/lambda/mlflow-cleanup --follow

Considerations¶

The Lambda runs with the same logic as cleanup-expired -- it scans for orphaned items and sets ttl = now.
Timeout -- For large tables, ensure the Lambda timeout is sufficient (default 5 minutes; increase for tables with millions of items).
Alerting -- Set up a CloudWatch alarm on Lambda errors to catch failures.

Upgrade Checklist¶

Deploy CDK stack with desired v2 features
Run one-time backfill / indexing for existing data
Verify Lambda functions are processing events correctly
Monitor CloudWatch Logs and DLQs for errors
Remove manual CLI cron jobs
Update runbooks to reference Lambda-based operations
Update alerting to include Lambda error metrics

Rollback¶

All v2 features are additive. To roll back:

Disable Streams -- Remove the Lambda trigger; the table continues to function without it.
Switch search backend -- Unset MLFLOW_DYNAMODBSTORE_SEARCH_BACKEND to fall back to trigram search.
Disable EventBridge rule -- Re-enable the CLI cron job.

The CLI commands remain fully functional regardless of whether v2 infrastructure is deployed.