Data Validation Skill

Auto-triggers during data migration work to verify the new system has the same data as legacy. Runs counts, sums, distributions, and spot-checks across both databases, then surfaces drift in a way that's actionable.

When it triggers

"Verify data migration"
"Are the counts matching between old and new?"
"Validate the customers table migrated correctly"
"Check for drift in the orders table"
"Compare aggregate values between source and target DB"

Why a skill

Data validation is high-stakes and error-prone. Counting rows is easy; catching the case where 50 rows have wrong values is hard. This skill encodes the patterns that catch real bugs during migrations: aggregate sums, distribution comparisons, foreign key integrity, sample row diffs, time-windowed reconciliation.

Installation

Copy this skill folder to ~/.claude/skills/data-validation/
Restart Claude Code
Try: "Validate that customers migrated correctly from SQL Server to Postgres"

SKILL.md content

---
name: data-validation
description: |
  Use this skill when the user wants to verify data parity between two
  databases during a migration. Triggers on: "validate migration",
  "check data parity", "compare old and new", "are counts matching",
  "verify backfill", "check for drift", "reconcile databases".

  Generates: validation queries, comparison scripts, drift reports.

  Do NOT use for: greenfield data validation (use schema constraints),
  backup verification (separate concern), or testing new code (use
  test-generator skill).
---

# Data Validation

You generate validation queries and comparison logic to verify data parity
between two databases during migration. You catch the bugs that simple
COUNT(*) won't catch.

## Validation hierarchy

Run validations in this order. Earlier checks are cheaper and rule out
broad issues; later checks catch subtle bugs.

### Level 1: Row counts

The dumbest, fastest check. If counts don't match, stop everything else.

```sql
-- Source (SQL Server)
SELECT COUNT(*) FROM customers;

-- Target (Postgres)
SELECT COUNT(*) FROM customers;

If counts differ:

Investigate WHERE the missing rows are (run with date ranges)
Check the migration log for skipped rows
Don't proceed to other checks until counts match

Level 2: Aggregate sums and stats

Catches data corruption that doesn't show up in counts:

-- Source
SELECT
  COUNT(*) AS total_count,
  SUM(amount) AS total_amount,
  AVG(amount) AS avg_amount,
  MIN(created_at) AS earliest,
  MAX(created_at) AS latest,
  COUNT(DISTINCT customer_id) AS unique_customers
FROM orders;

-- Target
SELECT
  COUNT(*) AS total_count,
  SUM(amount) AS total_amount,
  AVG(amount) AS avg_amount,
  MIN(created_at) AS earliest,
  MAX(created_at) AS latest,
  COUNT(DISTINCT customer_id) AS unique_customers
FROM orders;

Catches:

Currency precision drift (decimal → numeric translation issues)
Date range loss (rows from 1900 silently dropped)
Customer ID translation issues (UUID generation went wrong)

Level 3: Distribution comparisons

Group by enums / status fields and compare:

-- Source
SELECT status, COUNT(*) FROM orders GROUP BY status;

-- Target
SELECT status, COUNT(*) FROM orders GROUP BY status;

Catches:

Enum mapping bugs (legacy 'X' meant 'Refunded' but target maps it to 'Cancelled')
Status migration errors

Level 4: Foreign key integrity

-- Orphaned orders (customer_id pointing to non-existent customer)
SELECT COUNT(*)
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id
WHERE c.id IS NULL;

Run on both. Should be 0 in target. If non-zero in source, document (those rows were already orphaned in legacy — that's the existing data quality issue, not a migration bug).

Level 5: Spot checks (sample rows)

Take N random rows from source, find the matching rows in target, compare every field:

-- Source: 50 random orders
SELECT TOP 50 * FROM orders ORDER BY NEWID();

Then for each, query target by primary key and diff field-by-field.

This catches:

Field-level translation bugs (one column wasn't migrated; default value used)
Encoding issues (Unicode characters mangled)
Whitespace differences (trailing spaces stripped)

Level 6: Time-windowed reconciliation

For ongoing dual-write or CDC, check that recent changes are flowing:

-- Source: orders created in last hour
SELECT id, created_at FROM orders
WHERE created_at >= DATEADD(hour, -1, GETUTCDATE())
ORDER BY created_at;

-- Target: same query
SELECT id, created_at FROM orders
WHERE created_at >= NOW() - INTERVAL '1 hour'
ORDER BY created_at;

The two result sets should match. If target is missing recent rows, the sync pipeline is lagging or broken.

Level 7: Business invariants

Domain-specific checks. Examples:

Sum of order line items = order total
Sum of all account balances = sum of all transactions
Active customer count + inactive count = total count
Sum of inventory movements = current stock

These catch business logic errors during migration that pure data checks miss.

Process when activated

Step 1: Understand the migration

Ask:

Source connection: how to access source DB
Target connection: how to access target DB
Tables in scope: which tables to validate (often a subset)
Phase: pre-cutover (initial backfill done, planning to verify) or post-cutover (already migrated, verifying ongoing parity)
Tolerance: zero tolerance, or some drift acceptable (e.g., timestamps on metadata fields are fine to differ)

Step 2: Generate validation script

Generate a script that runs all 7 levels of validation. Output the queries in both source and target SQL dialects.

For each table, the script should:

Run Level 1 (counts) — fail loudly if mismatch
Run Level 2 (aggregates) — report differences
Run Level 3 (distributions) — report group-by differences
Run Level 4 (FK integrity) — report orphans
Run Level 5 (spot checks) — sample 50 rows, diff field-by-field
Run Level 6 (recent windows) — if dual-write phase
Run Level 7 (business invariants) — domain-specific

The script:

Connects to both DBs
Runs queries in parallel where possible
Outputs a structured report (Markdown or JSON)
Marks each check as PASS / FAIL / WARN with details

Step 3: Generate the comparison script

For SQL Server → Postgres comparisons, the script needs to handle dialect differences. Generate the queries paired:

# scripts/validation/compare.py
import pyodbc          # for SQL Server
import psycopg2        # for Postgres
import sys
from decimal import Decimal

def validate_table_counts(source_conn, target_conn, table_name):
    src_cur = source_conn.cursor()
    tgt_cur = target_conn.cursor()

    src_cur.execute(f"SELECT COUNT(*) FROM {table_name}")
    src_count = src_cur.fetchone()[0]

    tgt_cur.execute(f"SELECT COUNT(*) FROM {table_name}")
    tgt_count = tgt_cur.fetchone()[0]

    return {
        'check': 'row_count',
        'table': table_name,
        'source': src_count,
        'target': tgt_count,
        'status': 'PASS' if src_count == tgt_count else 'FAIL',
        'delta': tgt_count - src_count
    }

# ... similar for aggregates, distributions, etc.

def main():
    source = pyodbc.connect(SOURCE_CONN_STRING)
    target = psycopg2.connect(TARGET_CONN_STRING)

    results = []
    for table in TABLES_TO_VALIDATE:
        results.append(validate_table_counts(source, target, table))
        results.append(validate_aggregates(source, target, table))
        # ... etc

    report = generate_markdown_report(results)
    print(report)

    # Exit non-zero if any FAIL
    if any(r['status'] == 'FAIL' for r in results):
        sys.exit(1)

Step 4: Generate sample-row diff logic

For Level 5 spot checks, generate code that:

Takes 50 random IDs from source
Queries the same 50 from target
Diffs every field
Reports which fields differ and how

def spot_check_table(source_conn, target_conn, table_name, pk_column='id', sample_size=50):
    src_cur = source_conn.cursor()
    src_cur.execute(f"SELECT TOP {sample_size} * FROM {table_name} ORDER BY NEWID()")
    src_rows = src_cur.fetchall()
    src_columns = [desc[0] for desc in src_cur.description]

    differences = []
    for src_row in src_rows:
        pk_value = src_row[src_columns.index(pk_column)]
        tgt_cur = target_conn.cursor()
        tgt_cur.execute(f"SELECT * FROM {table_name} WHERE {pk_column} = %s", (pk_value,))
        tgt_row = tgt_cur.fetchone()

        if tgt_row is None:
            differences.append({'pk': pk_value, 'issue': 'Missing in target'})
            continue

        for col_idx, col_name in enumerate(src_columns):
            src_val = src_row[col_idx]
            tgt_val = tgt_row[col_idx]
            if not values_equal(src_val, tgt_val, col_name):
                differences.append({
                    'pk': pk_value,
                    'column': col_name,
                    'source': src_val,
                    'target': tgt_val
                })

    return differences

def values_equal(src, tgt, col_name):
    """Compare values with tolerance for known acceptable differences."""
    # Whitespace tolerance for legacy-stored text
    if isinstance(src, str) and isinstance(tgt, str):
        return src.strip() == tgt.strip()
    # Decimal precision tolerance
    if isinstance(src, Decimal) and isinstance(tgt, Decimal):
        return abs(src - tgt) < Decimal('0.001')
    # Skip migration-added timestamp fields
    if col_name in {'migrated_at', 'last_synced'}:
        return True
    return src == tgt

Step 5: Generate the drift report

For ongoing migrations (dual-write, CDC), generate a recurring drift report that:

Runs every hour / shift / day
Reports new drift since last run
Alerts if drift exceeds threshold

def drift_report():
    return {
        'timestamp': datetime.now(),
        'tables': {
            'customers': {
                'count_drift': 0,
                'sum_drift': Decimal('0.00'),
                'last_synced': '2024-01-15T12:00:00Z',
                'lag_seconds': 30
            },
            'orders': {
                'count_drift': 5,  # FAIL
                'sum_drift': Decimal('-127.50'),
                'last_synced': '2024-01-15T11:55:00Z',
                'lag_seconds': 330
            }
        }
    }

Email or Slack the report on schedule. Alert thresholds:

Count drift > 10 rows
Sum drift > $1.00
Sync lag > 5 minutes

Step 6: Output the actual diff

When validation fails, the user wants to see the actual difference:

## Validation report — 2024-01-15 14:00 UTC

### customers
- ✅ Row count: 1,234,567 (match)
- ✅ Sum check: passed
- ✅ Distribution: matches across 5 status values
- ✅ FK integrity: 0 orphans
- ✅ Spot check: 50/50 rows match
- **Status: PASS**

### orders
- ❌ Row count: source=8,901,234, target=8,901,229, delta=-5
- ❌ Sum check: source=$45,678,123.50, target=$45,678,000.00, delta=-$123.50
- ⚠️ Distribution: matches except 'cancelled' (src=12,345, tgt=12,340)
- ✅ FK integrity: 0 orphans
- ❌ Spot check: 47/50 rows match. 3 differences:
  - id=42: column `discount` source=$10.00, target=$0.00
  - id=87: column `notes` source='Test order', target=NULL
  - id=156: column `status` source='cancelled', target='void'
- **Status: FAIL — 5 rows missing, $123.50 missing, 3 spot-check failures**

### Recommended actions
1. Investigate the 5 missing orders (likely the same orders that account for the $123.50)
2. Check status mapping: 'void' vs 'cancelled' — are these supposed to be the same enum value?
3. Re-run migration for affected rows OR backfill manually

Step 7: Don't let "small" drift slide

If drift is small (e.g., 5 rows out of 9 million):

It's still worth investigating
5 missing rows could be 5 customers' orders that vanished
Don't accept "we'll fix it later" — those drift items become permanent

Push back on:

"It's only 0.0001% drift, ship it"
"We'll deal with the missing rows in a one-off script"
"The sums are close enough"

Ask: which 5 rows? Are they meaningful? Why are they missing?

Anti-patterns to avoid

Validating only counts (misses field-level corruption)
Validating only aggregates (misses distribution shifts)
Sample size too small (50 rows out of 10M might miss systematic bugs)
Random sampling without seed (can't reproduce findings)
Treating warnings as passes
Letting drift accumulate (small drift today = big problem tomorrow)
Validating on dev data (validate on production-like volumes)

Output format

Default structure:

## Validation results

[Per-table report with PASS/FAIL/WARN]

## Summary
- Tables checked: N
- Passed: N
- Failed: N
- Warnings: N

## Action items
[Specific, actionable next steps]

## Scripts generated
[Paths to validation scripts]

Make findings copy-pasteable for tickets.


## Pairing with other skills

- **Data Migration Plan** template includes a section on validation; this skill generates the actual validation code
- **Migration Planner Skill** orchestrates this skill during the validation phase
- **Postgres MCP** + **Filesystem MCP** make this skill more useful — direct DB access for queries

## Tips

- Run validation continuously during dual-write phases. Don't wait until cutover to discover drift.
- Save baseline validation reports — comparing today's drift to last week's tells you if the system is degrading.
- For very large tables, sampling at the right scale matters. 50 rows out of 10M might miss systemic bugs; sample smarter (stratified by status, by date range).
- Pair with feature flags: validate per-tenant after each tenant migration.

## Limitations

- Cannot validate behavior, only data state. (Behavior parity tests handle behavior.)
- Cross-engine comparisons (SQL Server vs Postgres) require careful handling of type differences (decimal precision, datetime tz).
- For very large tables, full validation is expensive — generate sampled validation scripts for these.
- Cannot fix the drift, only report it. Remediation is a separate concern.