Playbook
TemplateDiscoverProduct BriefsFeatured

Data Source Inventory & Profiling

Discovery template for ETL projects: inventory all data sources, profile their structure and quality, and document downstream dependencies.

Data Source Inventory & Profiling

The first step of every data engineering engagement. You can't build a modern data platform on top of sources you don't understand. This template produces the discovery document that drives every later decision — what to ingest, what to skip, what to fix at source.

When to use

  • Starting a cloud data warehouse modernization
  • Inheriting a data platform from another team or vendor
  • Before scoping/estimating an ETL engagement
  • When users complain about "the data" but nobody knows which data
  • For data quality fix-it projects (where to start?)

Why this matters

Most failed data projects fail at this step. Teams skip discovery, build pipelines for the obvious sources, then discover three months in that:

  • The "customer" entity exists in 7 systems with 7 different definitions
  • The "real" data is in a spreadsheet someone in Finance updates monthly
  • A critical report depends on a SQL Server view nobody documented
  • Half the source data has quality issues that have been silently masked by legacy ETL

This template forces honest discovery before commitments are made.

Prompt

You are a senior data architect doing initial discovery on a data
engineering engagement. Generate a comprehensive data source inventory
and profiling document.

## Input

**Project context:** {{project_context}}
**Source systems:** {{source_systems}}
**Target platform:** {{target_platform}}
**Known pain points:** {{known_pain_points}}

## Output

A Markdown document with these sections:

### 1. Engagement summary
1-2 paragraphs: what's the project, why now, what success looks like.

### 2. Source system inventory

For each source system, document:

#### System overview
- **Name and version** (Salesforce Lightning, SAP S/4HANA 2022, SQL Server 2019)
- **Vendor / hosting** (SaaS / on-prem / hybrid)
- **Owner team** (who runs it)
- **Primary use** (what business function it supports)
- **Active? Sunset planned?** (no point ingesting from a system being retired in 6 months)
- **License / contract considerations** (data extraction allowed? per-row costs?)

#### Connectivity
- **Network access** (direct connection, VPN, private endpoint, public API)
- **Authentication** (service principal, key vault secret, OAuth, basic auth)
- **Rate limits** (API call limits, query throttling)
- **Latency to Azure** (matters for incremental loads)

#### Data access methods
- **Native API** (REST, GraphQL, SOAP)
- **Database direct** (JDBC, ODBC, native driver)
- **File extracts** (CSV/JSON/XML drop, SFTP, blob storage)
- **CDC available?** (transaction log access, change feeds)
- **Recommended pattern** for ingestion (which method to use)

#### Data scope in this system
- **Tables / objects in scope** (with row counts where known)
- **Tables / objects explicitly out of scope** (and why)
- **PII / sensitive data flags** (per table)
- **Refresh frequency available** (real-time, hourly, daily, monthly)

### 3. Entity-level profiling

For each major entity (customer, order, product, etc.) across all source systems:

```markdown
## Entity: Customer

**Found in (with row counts):**
- Salesforce: Account (45,000 rows)
- SAP S/4: BUT000 (180,000 rows)
- SQL Server CRM_Legacy: dbo.Customer (52,000 rows)
- Excel from Finance: Top_Accounts.xlsx (~500 rows, manually maintained)

**Definition variance:**
- Salesforce defines customer as "any signed contract"
- SAP defines customer as "any business partner with sales transactions"
- SQL Server includes prospects (not yet customers)
- Excel includes only accounts >$1M ARR

**Master / source of truth:**
[Document conclusion: which is authoritative? Or is the answer "depends"?]

**Identifier strategy:**
- Salesforce uses 18-char Account ID
- SAP uses 10-digit BUT000 ID
- SQL Server uses incremental int
- No common key — needs MDM / matching strategy

**Migration concerns:**
- Customer count varies by system; need reconciliation logic
- PII handling: Salesforce has consent flags, others don't track
- Historical data: SAP keeps full history, Salesforce only current state
```

For each shared entity, this is where the project gets interesting. Document
ALL duplications, not just the convenient ones.

### 4. Data profiling (per major source table)

For each significant source table, profile:

#### Volume and velocity
- **Row count** (current)
- **Daily change rate** (inserts, updates, deletes per day)
- **Annual growth** (rows/year)
- **Peak load times** (when is the system busy)

#### Schema characteristics
- **Column count**
- **Primary key** (and uniqueness verification)
- **Nullable column count** (and which are commonly null)
- **Data types in use** (especially: text, decimal precision, dates, booleans)
- **Indexes that matter** (for query patterns)

#### Data quality findings
For each table, run profiling queries and report:

```sql
-- Null percentages
SELECT 
  COUNT(*) AS total_rows,
  SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS null_customer_id,
  SUM(CASE WHEN order_date IS NULL THEN 1 ELSE 0 END) AS null_order_date
FROM orders;

-- Distinct value counts (catches enum issues)
SELECT order_status, COUNT(*) FROM orders GROUP BY order_status ORDER BY 2 DESC;

-- Duplicate detection
SELECT customer_email, COUNT(*) FROM customers GROUP BY customer_email HAVING COUNT(*) > 1;

-- Referential integrity
SELECT COUNT(*) FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id
WHERE c.id IS NULL;

-- Date range sanity
SELECT MIN(order_date), MAX(order_date), 
  SUM(CASE WHEN order_date > CURRENT_DATE THEN 1 ELSE 0 END) AS future_dated
FROM orders;
```

Document findings:
- **Null rates** higher than expected
- **Orphaned records** (FK violations)
- **Duplicate keys** in supposedly-unique columns
- **Invalid values** (dates in 1900, negative ages, future dates)
- **Inconsistent formats** (phone numbers in 5 patterns)
- **Encoding issues** (mojibake from past charset mistakes)
- **Out-of-range values** (PIC 9(7) overflow, decimal truncation)

For each issue: severity (critical/medium/low), where it surfaces (downstream impact), and recommendation (fix at source / fix during ingest / migrate as-is + flag).

### 5. Downstream dependencies

For each source, identify what consumes it today:

| Consumer | Type | Frequency | Owner | Impact if breaks |
|----------|------|-----------|-------|------------------|
| Sales Daily Report | Power BI report | daily | Sales Ops | High (CFO sees it) |
| Customer 360 dashboard | Tableau | hourly | Marketing | Medium |
| Commission calculation | SQL job | weekly | Finance | High (paychecks) |
| Mailing list export | Custom Python | monthly | Marketing | Low |
| Predictive model training | Databricks notebook | monthly | Data Science | Medium |

For each consumer:
- **What query/feed they use** (specific tables, refresh pattern)
- **Latency requirement** (how fresh data needs to be)
- **Schema dependencies** (what columns they reference; renaming breaks them)
- **Migration plan** (will they switch to new platform? when?)

This list is always longer than initially expected. Budget time to find consumers nobody mentioned.

### 6. Data lineage (current state)

For the most important pipelines, document current lineage:

```
SAP S/4 BUT000 ──┐
                  ├─→ SSIS Job: NightlyCustomerLoad ─→ Stage_Customer (SQL Server) ─┐
Salesforce API ──┘                                                                     ├─→ Customer_Master (DW)

SQL Server CRM_Legacy ─→ ADF copy activity ─→ Stage_Legacy_Customer (Synapse) ────────┘


                                                                                Power BI: Customer 360
```

Use Mermaid or plain text. Goal: visualize WHERE data goes through current pipelines.

This often reveals:
- Multiple paths producing different results
- Single points of failure
- Black-box transformations (the .ipynb nobody owns)
- Manual steps in supposedly automated pipelines

### 7. Pain points and opportunities

Capture user-reported issues categorized by root cause:

#### Data freshness issues
"The dashboard is always 2 days behind" → root cause analysis: where's the lag?

#### Conflicting numbers
"Sales report shows 45M revenue, finance says 47M" → root cause: which source is right?

#### Performance issues
"Daily report takes 4 hours to refresh" → root cause: ETL design? Storage? Compute?

#### Quality issues
"Half the customer emails are invalid" → root cause: source quality? Ingestion bug?

For each, document root cause and whether the modernization addresses it (and how).

### 8. Risks for the data engagement

Top 10 risks specific to this project. For each:
- Description
- Likelihood / impact
- Mitigation
- Trigger

Common risks for ETL projects:
- **Schema changes mid-project.** Source schemas evolve; pipelines break.
- **Volume surprises.** Test data is 10K rows; production is 100M.
- **Hidden consumers.** Reports nobody mentioned that depend on legacy.
- **Source system access changes.** API deprecation, account changes.
- **Data quality surfacing.** Issues that didn't matter before become blockers.
- **Stakeholder availability.** SMEs needed for definition validation.
- **Compliance changes.** Cross-border data, PII handling, retention policies.

### 9. Strategic recommendations

Based on inventory, recommend:

- **Quick wins:** sources/datasets that should be migrated first (high value, low risk)
- **Defer items:** sources that should wait (low ROI, high complexity, pending sunset)
- **Excludes:** sources that should NOT be migrated (replaced by something else, dying anyway)
- **Architecture implications:** what the inventory suggests about the target architecture

### 10. Open questions

What you couldn't determine. Each:
- The question
- Why it matters for project planning
- Who could answer
- Default assumption + risk

## Style

- Specific over generic (named systems, named consumers, real numbers)
- Honest about unknowns
- Tone: discovery is meant to find problems; finding lots is good
- Include actual SQL profiling queries used and their results
- 10-30 pages typical for non-trivial engagements

## Quality bar

- Every source system has connectivity + access method documented
- Major entities have cross-system definition variance documented
- Profiling shows actual numbers, not assumptions
- Downstream consumers found and contacted
- Pain points have root cause analysis
- Risks are specific to THIS project, not generic

Tips

  • Run profiling queries against actual data, not samples. Sample data hides systematic issues.
  • Talk to consumers, not just producers. The team that produces the data and the team that uses it often disagree about what it means.
  • Find the spreadsheet. Every data project has a critical Excel file someone updates manually. Find it before it surprises you.
  • Document the dead bodies. Reports that don't refresh anymore, jobs that fail nightly, ETLs that someone "fixed" with a hardcoded value — all reveal what to be careful about.
  • Profile early, profile often. Schema and quality drift; today's findings won't be tomorrow's reality.
  • Pair with stakeholder interviews. The audit document is the artifact; the conversations that produce it are where the real understanding happens.

Common mistakes to avoid

  • Auditing only systems with formal owners. Spreadsheets, MS Access DBs, custom scripts in shared drives — these all carry data and they all matter.
  • Trusting documentation over reality. Always verify with profiling queries.
  • Skipping the consumer audit. Consumers determine what migrations are non-negotiable.
  • Not quantifying. "Lots of data" is useless. "12 source systems, 340 tables, 800 GB total, ~50 daily-active dashboards" is actionable.
  • Hiding bad findings to look efficient. If discovery finds problems, that's the point. Suppressing them just delays the surprise.
  • Treating discovery as one-shot. Discovery continues throughout the project; this template produces the initial picture.

Related assets

Command Palette

Search for a command to run...