Data Source Inventory & Profiling
The first step of every data engineering engagement. You can't build a modern data platform on top of sources you don't understand. This template produces the discovery document that drives every later decision — what to ingest, what to skip, what to fix at source.
When to use
- Starting a cloud data warehouse modernization
- Inheriting a data platform from another team or vendor
- Before scoping/estimating an ETL engagement
- When users complain about "the data" but nobody knows which data
- For data quality fix-it projects (where to start?)
Why this matters
Most failed data projects fail at this step. Teams skip discovery, build pipelines for the obvious sources, then discover three months in that:
- The "customer" entity exists in 7 systems with 7 different definitions
- The "real" data is in a spreadsheet someone in Finance updates monthly
- A critical report depends on a SQL Server view nobody documented
- Half the source data has quality issues that have been silently masked by legacy ETL
This template forces honest discovery before commitments are made.
Prompt
You are a senior data architect doing initial discovery on a data
engineering engagement. Generate a comprehensive data source inventory
and profiling document.
## Input
**Project context:** {{project_context}}
**Source systems:** {{source_systems}}
**Target platform:** {{target_platform}}
**Known pain points:** {{known_pain_points}}
## Output
A Markdown document with these sections:
### 1. Engagement summary
1-2 paragraphs: what's the project, why now, what success looks like.
### 2. Source system inventory
For each source system, document:
#### System overview
- **Name and version** (Salesforce Lightning, SAP S/4HANA 2022, SQL Server 2019)
- **Vendor / hosting** (SaaS / on-prem / hybrid)
- **Owner team** (who runs it)
- **Primary use** (what business function it supports)
- **Active? Sunset planned?** (no point ingesting from a system being retired in 6 months)
- **License / contract considerations** (data extraction allowed? per-row costs?)
#### Connectivity
- **Network access** (direct connection, VPN, private endpoint, public API)
- **Authentication** (service principal, key vault secret, OAuth, basic auth)
- **Rate limits** (API call limits, query throttling)
- **Latency to Azure** (matters for incremental loads)
#### Data access methods
- **Native API** (REST, GraphQL, SOAP)
- **Database direct** (JDBC, ODBC, native driver)
- **File extracts** (CSV/JSON/XML drop, SFTP, blob storage)
- **CDC available?** (transaction log access, change feeds)
- **Recommended pattern** for ingestion (which method to use)
#### Data scope in this system
- **Tables / objects in scope** (with row counts where known)
- **Tables / objects explicitly out of scope** (and why)
- **PII / sensitive data flags** (per table)
- **Refresh frequency available** (real-time, hourly, daily, monthly)
### 3. Entity-level profiling
For each major entity (customer, order, product, etc.) across all source systems:
```markdown
## Entity: Customer
**Found in (with row counts):**
- Salesforce: Account (45,000 rows)
- SAP S/4: BUT000 (180,000 rows)
- SQL Server CRM_Legacy: dbo.Customer (52,000 rows)
- Excel from Finance: Top_Accounts.xlsx (~500 rows, manually maintained)
**Definition variance:**
- Salesforce defines customer as "any signed contract"
- SAP defines customer as "any business partner with sales transactions"
- SQL Server includes prospects (not yet customers)
- Excel includes only accounts >$1M ARR
**Master / source of truth:**
[Document conclusion: which is authoritative? Or is the answer "depends"?]
**Identifier strategy:**
- Salesforce uses 18-char Account ID
- SAP uses 10-digit BUT000 ID
- SQL Server uses incremental int
- No common key — needs MDM / matching strategy
**Migration concerns:**
- Customer count varies by system; need reconciliation logic
- PII handling: Salesforce has consent flags, others don't track
- Historical data: SAP keeps full history, Salesforce only current state
```
For each shared entity, this is where the project gets interesting. Document
ALL duplications, not just the convenient ones.
### 4. Data profiling (per major source table)
For each significant source table, profile:
#### Volume and velocity
- **Row count** (current)
- **Daily change rate** (inserts, updates, deletes per day)
- **Annual growth** (rows/year)
- **Peak load times** (when is the system busy)
#### Schema characteristics
- **Column count**
- **Primary key** (and uniqueness verification)
- **Nullable column count** (and which are commonly null)
- **Data types in use** (especially: text, decimal precision, dates, booleans)
- **Indexes that matter** (for query patterns)
#### Data quality findings
For each table, run profiling queries and report:
```sql
-- Null percentages
SELECT
COUNT(*) AS total_rows,
SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS null_customer_id,
SUM(CASE WHEN order_date IS NULL THEN 1 ELSE 0 END) AS null_order_date
FROM orders;
-- Distinct value counts (catches enum issues)
SELECT order_status, COUNT(*) FROM orders GROUP BY order_status ORDER BY 2 DESC;
-- Duplicate detection
SELECT customer_email, COUNT(*) FROM customers GROUP BY customer_email HAVING COUNT(*) > 1;
-- Referential integrity
SELECT COUNT(*) FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id
WHERE c.id IS NULL;
-- Date range sanity
SELECT MIN(order_date), MAX(order_date),
SUM(CASE WHEN order_date > CURRENT_DATE THEN 1 ELSE 0 END) AS future_dated
FROM orders;
```
Document findings:
- **Null rates** higher than expected
- **Orphaned records** (FK violations)
- **Duplicate keys** in supposedly-unique columns
- **Invalid values** (dates in 1900, negative ages, future dates)
- **Inconsistent formats** (phone numbers in 5 patterns)
- **Encoding issues** (mojibake from past charset mistakes)
- **Out-of-range values** (PIC 9(7) overflow, decimal truncation)
For each issue: severity (critical/medium/low), where it surfaces (downstream impact), and recommendation (fix at source / fix during ingest / migrate as-is + flag).
### 5. Downstream dependencies
For each source, identify what consumes it today:
| Consumer | Type | Frequency | Owner | Impact if breaks |
|----------|------|-----------|-------|------------------|
| Sales Daily Report | Power BI report | daily | Sales Ops | High (CFO sees it) |
| Customer 360 dashboard | Tableau | hourly | Marketing | Medium |
| Commission calculation | SQL job | weekly | Finance | High (paychecks) |
| Mailing list export | Custom Python | monthly | Marketing | Low |
| Predictive model training | Databricks notebook | monthly | Data Science | Medium |
For each consumer:
- **What query/feed they use** (specific tables, refresh pattern)
- **Latency requirement** (how fresh data needs to be)
- **Schema dependencies** (what columns they reference; renaming breaks them)
- **Migration plan** (will they switch to new platform? when?)
This list is always longer than initially expected. Budget time to find consumers nobody mentioned.
### 6. Data lineage (current state)
For the most important pipelines, document current lineage:
```
SAP S/4 BUT000 ──┐
├─→ SSIS Job: NightlyCustomerLoad ─→ Stage_Customer (SQL Server) ─┐
Salesforce API ──┘ ├─→ Customer_Master (DW)
│
SQL Server CRM_Legacy ─→ ADF copy activity ─→ Stage_Legacy_Customer (Synapse) ────────┘
│
↓
Power BI: Customer 360
```
Use Mermaid or plain text. Goal: visualize WHERE data goes through current pipelines.
This often reveals:
- Multiple paths producing different results
- Single points of failure
- Black-box transformations (the .ipynb nobody owns)
- Manual steps in supposedly automated pipelines
### 7. Pain points and opportunities
Capture user-reported issues categorized by root cause:
#### Data freshness issues
"The dashboard is always 2 days behind" → root cause analysis: where's the lag?
#### Conflicting numbers
"Sales report shows 45M revenue, finance says 47M" → root cause: which source is right?
#### Performance issues
"Daily report takes 4 hours to refresh" → root cause: ETL design? Storage? Compute?
#### Quality issues
"Half the customer emails are invalid" → root cause: source quality? Ingestion bug?
For each, document root cause and whether the modernization addresses it (and how).
### 8. Risks for the data engagement
Top 10 risks specific to this project. For each:
- Description
- Likelihood / impact
- Mitigation
- Trigger
Common risks for ETL projects:
- **Schema changes mid-project.** Source schemas evolve; pipelines break.
- **Volume surprises.** Test data is 10K rows; production is 100M.
- **Hidden consumers.** Reports nobody mentioned that depend on legacy.
- **Source system access changes.** API deprecation, account changes.
- **Data quality surfacing.** Issues that didn't matter before become blockers.
- **Stakeholder availability.** SMEs needed for definition validation.
- **Compliance changes.** Cross-border data, PII handling, retention policies.
### 9. Strategic recommendations
Based on inventory, recommend:
- **Quick wins:** sources/datasets that should be migrated first (high value, low risk)
- **Defer items:** sources that should wait (low ROI, high complexity, pending sunset)
- **Excludes:** sources that should NOT be migrated (replaced by something else, dying anyway)
- **Architecture implications:** what the inventory suggests about the target architecture
### 10. Open questions
What you couldn't determine. Each:
- The question
- Why it matters for project planning
- Who could answer
- Default assumption + risk
## Style
- Specific over generic (named systems, named consumers, real numbers)
- Honest about unknowns
- Tone: discovery is meant to find problems; finding lots is good
- Include actual SQL profiling queries used and their results
- 10-30 pages typical for non-trivial engagements
## Quality bar
- Every source system has connectivity + access method documented
- Major entities have cross-system definition variance documented
- Profiling shows actual numbers, not assumptions
- Downstream consumers found and contacted
- Pain points have root cause analysis
- Risks are specific to THIS project, not genericTips
- Run profiling queries against actual data, not samples. Sample data hides systematic issues.
- Talk to consumers, not just producers. The team that produces the data and the team that uses it often disagree about what it means.
- Find the spreadsheet. Every data project has a critical Excel file someone updates manually. Find it before it surprises you.
- Document the dead bodies. Reports that don't refresh anymore, jobs that fail nightly, ETLs that someone "fixed" with a hardcoded value — all reveal what to be careful about.
- Profile early, profile often. Schema and quality drift; today's findings won't be tomorrow's reality.
- Pair with stakeholder interviews. The audit document is the artifact; the conversations that produce it are where the real understanding happens.
Common mistakes to avoid
- Auditing only systems with formal owners. Spreadsheets, MS Access DBs, custom scripts in shared drives — these all carry data and they all matter.
- Trusting documentation over reality. Always verify with profiling queries.
- Skipping the consumer audit. Consumers determine what migrations are non-negotiable.
- Not quantifying. "Lots of data" is useless. "12 source systems, 340 tables, 800 GB total, ~50 daily-active dashboards" is actionable.
- Hiding bad findings to look efficient. If discovery finds problems, that's the point. Suppressing them just delays the surprise.
- Treating discovery as one-shot. Discovery continues throughout the project; this template produces the initial picture.