Designing a Serverless Memory Architecture for AI Agents

Why Serverless Matters for Agent Memory

Most agent memory solutions require always-on infrastructure. A Postgres instance, a Redis cluster, a vector database process. These services cost money even when nobody's using them, and they require capacity planning, patching, and monitoring.

For agent memory, the usage pattern is spiky. A SaaS product with 100 tenants might have 5 agents running right now and 95 idle. Tomorrow those numbers flip. Traditional databases charge you for peak capacity 24/7.

Serverless flips this model. You pay for what you use. At idle, costs approach zero. Under load, the system scales automatically. No capacity planning. No 3 AM pages because the database ran out of connections.

This is why we built Mnemora on AWS serverless primitives. Here's how the architecture works.

The Stack

Mnemora composes four AWS services, each chosen for a specific memory access pattern:

DynamoDB On-Demand: Working Memory

Working memory is key-value state — the current task, session variables, intermediate results. The access pattern is simple: write a JSON blob keyed by agent ID and session ID, read it back later.

DynamoDB on-demand is ideal for this. Sub-10ms single-item reads and writes. No provisioned capacity — you pay per request ($1.25 per million writes, $0.25 per million reads). At zero traffic, it costs nothing.

The partition key is tenant_id#agent_id and the sort key encodes the entity type (SESSION#<id>, EPISODE#<timestamp>#<id>, etc.). This single-table design means one DynamoDB table serves working memory and hot episodic storage.

Optimistic locking uses a version integer field. Every update includes the expected version number. If another process updated the item first, DynamoDB's ConditionExpression rejects the write with a 409 Conflict — no distributed locks needed.

Aurora Serverless v2 + pgvector: Semantic Memory

Semantic memory requires vector similarity search. You store text, it gets embedded into a 1024-dimensional vector, and later you search by meaning rather than by exact keywords.

Aurora Serverless v2 with the pgvector extension handles this. Aurora Serverless v2 scales in 0.5 ACU increments, from a minimum of 0.5 ACU up to whatever ceiling you set. At minimum capacity, it runs on about $0.12/hour (roughly $87/month for 0.5 ACU).

This is the one component that isn't truly scale-to-zero — Aurora Serverless v2 doesn't pause to zero like the original Aurora Serverless v1 did. The minimum 0.5 ACU is the floor. For the capability it provides (full Postgres with vector search, relational queries, and row-level security), we consider this an acceptable trade-off.

The pgvector extension stores embeddings as a native vector(1024) column type. We use HNSW indexing (hnsw (embedding vector_cosine_ops)) for approximate nearest neighbor search, configured with m = 16 and ef_construction = 200 for a good balance of recall and speed.

S3: Cold Episodic Storage

Recent episodes live in DynamoDB for fast access. But episodic data grows linearly — every agent action, every conversation turn generates an episode. Storing months of history in DynamoDB gets expensive.

S3 provides cost-effective cold storage. Old episodes are tiered from DynamoDB to S3 with a prefix structure: s3://mnemora-episodes-dev/<tenant_id>/<agent_id>/<date>/. S3 storage costs $0.023 per GB/month. For episodic data that's rarely accessed, this is orders of magnitude cheaper than keeping it in DynamoDB.

Lambda ARM64: Compute

All six Lambda functions run on ARM64 (Graviton2), which is 20% cheaper than x86 for the same compute. Functions use Python 3.12 and include the AWS SDK, psycopg3 for Aurora connections, and Pydantic for request validation.

Lambda pricing is straightforward: $0.20 per million invocations plus duration charges. At low traffic, the cost is negligible. At high traffic, Lambda's concurrency model handles thousands of parallel requests without any capacity planning.

Why No LLM in the CRUD Path

This is Mnemora's most important architectural decision: basic memory operations — store, read, update, delete — never call an LLM.

Competitors like Mem0 route every operation through an LLM. When you store a memory, the LLM extracts key information, categorizes it, and decides how to merge it with existing knowledge. This is powerful but has real costs:

Latency: An LLM call adds 500ms-2000ms to every operation. Mnemora's DynamoDB writes complete in under 10ms.
Token cost: Every memory operation burns input and output tokens. At scale, this dominates your bill.
Unpredictability: LLM outputs are non-deterministic. The same store operation might produce different results on retry.
Dependency: If the LLM provider has an outage, your memory layer goes down too.

Mnemora uses an LLM only for one thing: generating vector embeddings on write. When you store semantic memory, Bedrock Titan embeds the text into a 1024-dim vector. This is a deterministic operation (same input always produces the same embedding) that takes about 50ms and costs $0.02 per million tokens.

Reads never touch an LLM. Semantic search computes cosine similarity against the stored vectors using pgvector — pure math, no token cost, deterministic results.

Cost Analysis

What does this actually cost in practice?

At Idle (~$1/month)

Service	Idle Cost
Aurora Serverless v2 (0.5 ACU)	~$87/month
DynamoDB (on-demand, 0 requests)	$0
Lambda (0 invocations)	$0
S3 (minimal storage)	< $0.10
API Gateway (0 requests)	$0

The honest minimum is around $87/month due to Aurora's 0.5 ACU floor. We state "~$1/month at idle" for the non-Aurora components. If you're building a new project and testing with minimal traffic, the Aurora cost is the dominant factor.

At 10K Requests/Day

Service	Cost
Aurora Serverless v2 (0.5-1 ACU)	~$87-175/month
DynamoDB (300K requests/month)	~$0.50
Lambda (300K invocations)	~$0.10
Bedrock Titan embeddings	~$2-5
API Gateway	~$0.30
S3	< $1
Total	~$90-180/month

The cost scales primarily with Aurora ACU usage and embedding volume. DynamoDB, Lambda, and API Gateway costs remain negligible even at moderate traffic.

Scaling Patterns

DynamoDB Partitioning

The tenant_id#agent_id partition key distributes load across DynamoDB partitions naturally. Each tenant's data is isolated in its own partition space. Hot partitions are automatically split by DynamoDB's adaptive capacity.

Aurora ACU Auto-Scaling

Aurora Serverless v2 scales in 0.5 ACU increments based on CPU and memory utilization. A connection spike from a burst of semantic searches automatically scales the cluster up. When traffic subsides, it scales back down. The scaling takes seconds, not minutes.

Lambda Concurrency

Lambda functions scale to hundreds of concurrent executions instantly. Each invocation gets its own compute environment, so there's no shared state to contend over. The only bottleneck is Aurora connection pooling — we use RDS Proxy to manage database connections and prevent connection exhaustion under high concurrency.

Multi-Tenancy

Tenant isolation is entirely logical, not physical. Every tenant shares the same infrastructure, but data is strictly separated:

DynamoDB: The partition key prefix tenant_id# ensures that queries never cross tenant boundaries. DynamoDB's access model makes it physically impossible to read another tenant's partition without knowing their key.
Aurora: Every query includes a parameterized WHERE tenant_id = $1 clause. Row-level security (RLS) policies provide defense-in-depth — even if application code has a bug, the database enforces isolation.
S3: Object prefixes tenant_id/ combined with IAM policies ensure bucket-level isolation.
Lambda authorizer: The API key is resolved to a tenant_id in the authorizer function. Downstream handlers receive the tenant ID from the authorizer context — never from the client request. The client cannot impersonate another tenant.

This shared infrastructure model is what makes serverless cost-effective. Each tenant pays only for their usage, and idle tenants cost nothing.

The Trade-Offs

This architecture isn't perfect for every use case:

Aurora's minimum cost means you always pay for 0.5 ACU, even at zero traffic. For hobby projects, this might be more than you want to spend.
No self-hosting. The tight integration with AWS services means Mnemora can't run on arbitrary infrastructure.
Cold starts. Lambda functions have cold start latency of 200-500ms after periods of inactivity. For latency-sensitive applications, provisioned concurrency adds cost.
Regional. Mnemora runs in us-east-1. Multi-region deployments would require significant additional infrastructure.

For teams building production agent systems on AWS, these trade-offs are usually acceptable. The combination of pay-per-use pricing, automatic scaling, and zero operational overhead makes serverless a strong fit for the spiky, multi-tenant workloads that agent memory systems serve.