Step 04 of 9 3-4 weeks· advanced
Step 4: PySpark Transformation Framework
Establish PySpark standards before writing many notebooks. Shared utility library, Bronze / Silver / Gold templates, performance patterns, testing approach.
Recommended prompts
Use one of these to do the work in your IDE
Open the template to read it in full. Click Copy prompt to grab it (with your stack values pre-filled where they apply) — then paste into Claude Code, Cursor, or wherever you build.
Recommended skills
Drop these into Claude Code for this phase
Skills auto-trigger on the right kind of request. Install once; they apply to every prompt that fits.
Recommended MCP configs
Wire these tools into Claude Code first
MCP servers give Claude Code direct access to external systems (Jira, browsers, databases). Configure once.
When you're done
Verify these in your own work before moving on
This is a checklist for you to mentally tick off in your repo and IDE — the site doesn't track it, you do.
- Shared utility library deployed and tested
- Standard notebook templates created (Bronze / Silver / Gold)
- Bronze, Silver, Gold patterns proven on a real entity
- Performance patterns documented with examples
- Testing approach proven
- Code-review checklist for PySpark notebooks created
Common pitfalls
What goes wrong at this step
- Treating notebooks as scripts — no structure, no docs, no error handling
- Copy-paste boilerplate everywhere — build a shared library
- Spark for small data (<1GB) — use SQL / pandas / DuckDB instead
- No idempotency — re-runs create duplicates
- inferSchema in production — slow, unpredictable
- .collect() on large data — OOM disasters