Playbook

MongoDB Schema Designer

Design a denormalized MongoDB schema with proper indexes, embedded vs referenced choices, and validation rules.

MongoDB Schema Designer

Design a MongoDB schema that's optimized for your access patterns — not just a translation of a relational schema. Includes embed-vs-reference decisions, indexes for the queries that matter, and validation rules to prevent bad data.

When to use

  • Starting a new project that's chosen MongoDB
  • Migrating from SQL to MongoDB and need to redesign (don't just translate tables)
  • Performance-tuning an existing schema where queries are slow

Prompt

You are a senior MongoDB architect. Design a MongoDB schema optimized for the
access patterns described below. NoSQL design is access-pattern-first; structure
follows queries.

## Input

**Domain description:**
{{domain_description}}

**Top access patterns (CRITICAL):**
{{access_patterns}}

**Expected scale:**
{{scale}}

## Output

Produce four sections.

### 1. Access pattern analysis

For each access pattern from the input:
- Pattern: "Get all orders for a customer in the last 30 days"
- Frequency: high / medium / low
- Latency target: e.g., <50ms p95
- Access type: read-heavy / write-heavy / mixed
- Implication for schema: e.g., "embed recent orders into customer doc, archive old to separate collection"

### 2. Schema rationale

Explain key decisions:

**Embed vs reference**
For each relationship, decide:
- **Embed** when:
  - One-to-few (typically <100 children)
  - Children are queried with the parent in the same request
  - Children rarely change independently
  - Total document size stays well under 16MB
- **Reference** when:
  - One-to-many or many-to-many (children >100)
  - Children are queried independently
  - Children change frequently
  - Multiple parents share the same child (DRY)

**Atomicity boundaries**
- MongoDB is atomic at the document level
- Group together what must be updated atomically
- Multi-document transactions exist but are slower — design to avoid them

**Read vs write trade-offs**
- Denormalize for reads (duplicate data) when reads vastly outnumber writes
- Reference when writes need consistency more than reads need speed

**Schema flexibility**
- Even though MongoDB is schemaless, USE schema validation
- Document the expected shape — future engineers will thank you

### 3. Collections

For each collection produce:

```
## Collection: collection_name

**Purpose:** [1-2 sentence description]
**Estimated count:** [order of magnitude]
**Average document size:** [estimate]
**Sharding strategy (if applicable):** [shard key + rationale]

### Document shape (example)

```json
{
  "_id": ObjectId("..."),
  "field": "value",
  "nested": {
    "subfield": "value"
  },
  "array_of_embedded": [
    { "key": "value" }
  ],
  "reference": ObjectId("...")  // ref to other_collection
}
```

### JSON schema validator

```javascript
db.createCollection("collection_name", {
  validator: {
    $jsonSchema: {
      bsonType: "object",
      required: ["field1", "field2"],
      properties: {
        field1: {
          bsonType: "string",
          minLength: 1,
          maxLength: 200,
          description: "..."
        }
        // ... all fields
      }
    }
  },
  validationLevel: "strict",
  validationAction: "error"
});
```

### Indexes

```javascript
// Why: supports access pattern "list X by Y"
db.collection_name.createIndex({ "field": 1, "other": -1 });

// Why: enforces uniqueness
db.collection_name.createIndex({ "email": 1 }, { unique: true });

// Why: TTL for session expiry
db.collection_name.createIndex({ "expires_at": 1 }, { expireAfterSeconds: 0 });
```

For each index:
- Comment explaining which access pattern it supports
- Compound index field order matters: equality → sort → range
- Use partial indexes (`partialFilterExpression`) when only some docs need indexing
- Consider sparse indexes for optional fields
```

### 4. Migration script

A single `mongosh` script that:
- Creates all collections with validators
- Creates all indexes
- Seeds reference data if needed
- Idempotent (safe to re-run)

## Quality checklist

- [ ] Every access pattern has at least one supporting index
- [ ] No collection grows unbounded without a TTL or archive strategy
- [ ] All `_id` fields explicitly typed (ObjectId or your custom type)
- [ ] Money stored as Decimal128, not Double
- [ ] Dates stored as Date type, not strings
- [ ] No fields stored as Strings that are really Booleans/Numbers/Dates
- [ ] Schema validator is set on every collection
- [ ] Sharding strategy documented (or "not sharded; reason: ___")

Example input

domain_description: |
  E-commerce platform. Customers place orders. Each order has line items
  referencing products. Customers have shipping addresses (multiple).
  Products have variants (size, color). Reviews on products.
access_patterns: |
  1. Get all orders for a customer (last 30 days), with line items - HIGH freq
  2. Get product detail page (product + variants + recent 10 reviews) - HIGH
  3. Search products by name/category - MEDIUM
  4. Add review to product - MEDIUM
  5. Update order status - LOW
scale: |
  ~500K customers, ~50K products, ~10M orders, ~5M reviews. 90% read.

Tips

  • Access patterns first, schema second. This is the fundamental difference from SQL.
  • Don't translate SQL schemas. A customers + orders + order_items SQL schema usually becomes a customers collection with embedded recent orders + a separate orders collection for historical/searchable data.
  • Test queries with .explain('executionStats') before going to production. Look for IXSCAN (good) vs COLLSCAN (bad).
  • Use Atlas search for full-text — don't try to roll your own with regex on indexed fields.
  • For multi-tenant SaaS, include tenant_id as the first field in every compound index.

Common mistakes to avoid

  • Treating MongoDB like SQL (over-normalization, JOIN-style with $lookup everywhere)
  • Unbounded array growth (a comments array on a post can hit the 16MB limit)
  • No TTL on session/cache/log collections (they grow forever)
  • Indexing every field "just in case" (each index costs writes and storage)
  • Storing dates as strings (breaks range queries)
  • Storing money as Double (use Decimal128 for precision)

Related assets

Command Palette

Search for a command to run...