· By Jeff Baart

From spreadsheets to a real Lakehouse: building a data platform on Databricks

A personal-scale story about wiring up Unity Catalog, Lakeflow Declarative Pipelines, a serverless ingest job, and a Mosaic AI agent — for the price of a fancy coffee per month.


Where your data probably lives

If you're reading this, you almost certainly have some data you'd like to ask questions of. The shape varies, as you stare at the rows and columns. This post also happens to cover personal passions - Golf, Data, and AI.


It might be a folder of CSVs that started as one spreadsheet and got cloned every time you exported a report. It might be a SQL Server table at work, maintained by a team you don't quite know how to file tickets with. It might be the output of a SaaS product — a finance tool, a CRM, a hobby tracker — that has a perfectly good web UI and no obvious way to get the data out. It might even be (no judgment) a Microsoft Access database sitting on a network share under somebody's desk.

That's where I was, at personal scale. For over ten years my buddies and I have played a small weekly golf league. The league treasurer (Me) pays for a SaaS scoring platform that does what it's supposed to do: enter scores, see standings. The data is all there. It just lives inside that one product.

What I really wanted, in plain terms, was an agent sitting on top of all that data. Something I could ask in English — "how have I been playing lately compared to last year?" or "who's been hot in the league the last few weeks, and why?" Something that could research scores and history on demand, surface insights I'd otherwise miss, and even draft the weekly recap so the buddies didn't have to wait for me to find the time. The kind of work I'd happily hand off to a smart caddie if one existed.

The question I wanted to answer for myself wasn't "how do I extract the data" — that's the easy part — but what would a real, production-shaped data platform look like at personal scale, if I built it the way I'd build it at work? Unity Catalog. Declarative pipelines. Scheduled jobs. An AI agent on top, with the same governance discipline I'd insist on professionally. Could I do all of that on an Azure subscription, for hobby money?

Short answer: yes. This post walks the build.


What "deliberate" means here

Before any code, the thing I want to flag is the bar I held myself to:

Could I justify this design decision in a code review with a Databricks engineer?

Personal projects are tempting territory for shortcuts — a long Python script in a notebook, a CSV in a OneDrive folder, a scheduled task on a laptop. None of those are wrong. But they don't teach you anything you'd use at work. So at every fork, I held myself to the same building blocks I'd reach for in production:

  • Unity Catalog for storage, even on a one-user workspace — because permissions, lineage, and audit all live there. The day a second person asks "can I see this too?" you want to be answering "I'll grant you access," not "let me restructure everything first."
  • Lakeflow Declarative Pipelines with explicit bronze / silver / gold layers — because when you find a bug, you want to rebuild the downstream tables from the layer below without re-hitting the original source. Mixed-purpose tables make this impossible.
  • Asset Bundles as the source of truth, not the workspace UI — because "click change in the UI" leaves no record of what you did or why. A YAML file in a folder is the difference between "I think I changed the schedule last month" and an actual edit history that proves it.
  • Serverless Jobs with a schedule and email failure alerts — because the difference between a hobby project and a real one is whether you notice when it breaks. Failure alerts are the cheapest insurance there is.
  • Safe-to-rerun ingest with MERGE so a re-upload is harmless — because every recurring data pipeline eventually needs to be re-run on the same data: backfills, fixes, network blips. If a second run produces duplicates or corrupts state, you've baked in operational pain forever.
  • No secrets in code — secret scopes for jobs, Azure Managed Identity for the web layer — because the moment a secret hits source control or a chat window, it's compromised. Designing this in on day one is free; retrofitting it on day 100 is expensive.

That's the bar. Now the build.


Step 1: get the data out of wherever it lives


Whatever you're starting from, the first step is the same — write a small adapter that turns "the place the data lives" into a stable file output you control. Two patterns cover most cases:

  • Tabular sources (spreadsheets, SQL tables, CSV exports, Access). One Python script with pandas or pyodbc, scheduled to read and write a normalized CSV per entity into a known location.
  • API-backed sources (most modern SaaS). One Python script with requests that handles auth, pagination, and rate limits, and writes the same shape of CSVs.

In my case the source is a SaaS web app with no published API. The legitimate way to find out what's available is just to use the product the way you normally would, with your browser's developer tools open, and watch what calls the page makes as you click around. Most modern web apps talk JSON to themselves; the URLs and payloads are right there in the Network tab. That's not a hack — it's the same technique the product's own front-end developers use.

For my source the result was a small JSON endpoint for player rosters and recent scores, and a server-rendered HTML page (an "advanced scorecard" view) that has the full per-hole detail for any week. I wrote two small requests-based readers — one for each — and merged them into one normalized output:

def fetch_round_html(self, weeknum: int) -> str:
    """POST the form for a specific week, return the rendered HTML."""
    af = _AF_RE.search(self.s.get(BASE + DETAIL_URL).text).group(1)
    return self.s.post(
        BASE + DETAIL_URL,
        data={"__RequestVerificationToken": af, "weeknum": str(weeknum)},
        headers={"Referer": BASE + DETAIL_URL},
    ).text

Then a BeautifulSoup parser turns the HTML table into one row per (player, week, hole). All twenty-some weeks of the current season pulled in one ingest run. The same client switches "seasons" with a server-side session toggle, so last year's data is one extra loop.

Two general lessons worth keeping:

  1. The JSON layer isn't always the richest one. Many products have a perfectly clean API for what's on screen now, and a fuller HTML rendering for historical or detailed views. If the JSON feels thin, look at the HTML.
  2. Make the adapter a real piece of code, not a one-off. Same safe re-run behavior, same logging, same error handling you'd put in a production data pipeline. The fact that "production" here means "Wednesday morning, just me" doesn't change the shape of the code.

Whether your source is a SaaS portal, a SQL table, a folder of spreadsheets, or that Access database under the desk, the output of this step is identical: a small set of clean CSVs in a known location, written by a script you can re-run any time.

A quick aside on connectors. Databricks ships a wide range of pre-built, governed connectors — Salesforce, Workday, ServiceNow, SQL Server, Snowflake, SharePoint, and many more — that take care of auth, schema discovery, change-data-capture, and incremental load for you. If your source is on that list, skip the custom adapter entirely.



Step 2: land it in Unity Catalog


Azure Databricks on a Visual Studio subscription. Premium tier (you need it for Unity Catalog). The whole workspace, container registry, and web hosting cost me under 20 dollars a month on average.

The catalog layout is the medallion you'd expect:

golf_league/
├── bronze/
│   ├── landing/                  (Unity Catalog volume — where CSVs land)
│   ├── players_raw                streaming Delta, MERGE on player_number
│   └── match_results_raw          streaming Delta, MERGE on (player, week, side, hole)
├── silver/
│   ├── players                    one row per current player
│   ├── rounds                     one row per (player, match_date, course_side)
│   └── hole_scores                one row per played hole
└── gold/
    ├── player_season              season averages per player
    ├── weekly_summary             weekly recap with low/avg gross + net
    ├── hole_stats                 per-hole difficulty
    └── handicap_trend             handicap movement per player over time

The reasoning behind each layer is standard, but worth being explicit:

  • Bronze is the source's truth, normalized to one shape per entity but otherwise untouched. If you find a bug in your silver logic, you should be able to rebuild silver from bronze without re-hitting the source.
  • Silver is your business's truth. Joins are settled, types are coerced, expectations enforce sanity. This is what analysts and AI agents read from.
  • Gold is product-shaped. One table per common question — weekly recap, per-player season, hole difficulty leaderboard.

What turns this from "three folders" into a real layer cake is how rows move between them.


Step 3: the pipeline (Lakeflow Declarative Pipelines)

The pipeline is one Python file, declarative, ~150 lines. Two patterns matter.

Pattern A: streaming Auto Loader + apply_changes for bronze. The ingest job writes the same CSV filename every week. Auto Loader picks it up; apply_changes MERGEs on the primary key with SCD type 1 (latest wins). The result: re-uploading the same CSV is a harmless, and re-uploading a corrected one overwrites the bad rows in place. No "full refresh" ever needed.

@dlt.view(name="match_results_stream")
@dlt.expect_or_drop(
    "mr_keys_present",
    "player_number IS NOT NULL AND match_date IS NOT NULL "
    "AND course_side IS NOT NULL AND hole_number IS NOT NULL"
)
def match_results_stream():
    return (
        spark.readStream.format("cloudFiles")
            .option("cloudFiles.format", "csv")
            .option("cloudFiles.schemaLocation", f"{landing}/_schemas/match_results")
            .option("cloudFiles.allowOverwrites", "true")
            .option("pathGlobFilter", "match_results.csv")
            .option("header", "true")
            .schema(MATCH_RESULTS_SCHEMA)
            .load(landing)
            .withColumn("_ingest_ts", F.current_timestamp())
    )

dlt.create_streaming_table(name="golf_league.bronze.match_results_raw")

dlt.apply_changes(
    target="golf_league.bronze.match_results_raw",
    source="match_results_stream",
    keys=["player_number", "match_date", "course_side", "hole_number"],
    sequence_by=F.col("_ingest_ts"),
    stored_as_scd_type=1,
)

The allowOverwrites flag is the easy-to-miss bit. Without it, Auto Loader tracks files by their original metadata and silently skips rewrites of the same path. With it, every new write of the same filename gets re-processed and merged. That's the property that lets the weekly job write to a stable path forever.

Pattern B: batch reads + expect for silver. Silver tables are aggregations or projections of the bronze stream. They're written as batch (dlt.read, not dlt.read_stream) so I can use window functions and joins without fighting streaming semantics. Quality is enforced with @dlt.expect — soft warnings for things like "this gross score is realistic" — and @dlt.expect_or_drop for hard invariants like "the primary key columns are not null."

@dlt.table(name="golf_league.silver.rounds")
@dlt.expect("gross_score_realistic", "gross_score BETWEEN 18 AND 200")
def silver_rounds():
    mr = dlt.read("golf_league.bronze.match_results_raw")
    return (
        mr.groupBy("player_number", "match_date", "course_side")
          .agg(
              F.sum("gross_score").alias("gross_score"),
              F.sum("absolute_net_score").alias("net_score"),
              F.round(F.sum("match_play_points"), 2).alias("match_play_points"),
              F.max("player_handicap").alias("player_handicap"),
              # ...
          )
          .withColumn("season_year", F.year("match_date"))
    )

Gold tables read from silver the same way and produce the consumption shapes for downstream apps.

The whole pipeline runs on Databricks Serverless. No cluster to manage, scales to zero between runs, and bills by what it actually processes. For my data volume (a few thousand rows a week) a full bronze-to-gold refresh costs pennies.


Step 4: scheduling — Asset Bundles + a serverless Job

The pipeline above is one resource. The job that calls the source, writes the CSVs to the landing volume, and then triggers the pipeline is another. Both are defined in YAML inside a Databricks Asset Bundle:

resources:
  jobs:
    weekly_ingest:
      name: golf-weekly-ingest-${bundle.target}
      schedule:
        quartz_cron_expression: "0 0 12 ? * WED *"   # noon ET Wednesday
        timezone_id: America/New_York
        pause_status: UNPAUSED
      tasks:
        - task_key: harvest_to_landing
          spark_python_task:
            python_file: ../src/ingest.py
            parameters:
              - "--catalog"
              - "${var.catalog}"
              - "--landing-path"
              - "${var.landing_volume_path}"
          environment_key: default
        - task_key: run_pipeline
          depends_on: [{task_key: harvest_to_landing}]
          pipeline_task:
            pipeline_id: ${resources.pipelines.golf_pipeline.id}
            full_refresh: false
      environments:
        - environment_key: default
          spec:
            client: "1"
            dependencies:
              - requests>=2.32
              - beautifulsoup4>=4.12
              - lxml>=5.0

A few things this small block buys you:

  • Cron in source control. I changed the schedule from Sunday morning to Wednesday noon (after Tuesday-night league play) with a one-line YAML edit. Run databricks bundle deploy -t dev, and the workspace is updated; my edit history shows when and why I changed it.
  • One command, two resources. The same bundle deploy creates or updates both the pipeline and the job that runs it. Adding a backfill-only second job for historical seasons was a copy-paste of the same task block with one extra parameter.
  • Secrets are decoupled. Credentials for the source live in a UC secret scope, read by the ingest script at runtime. The bundle and the codebase never see them.
# Inside the ingest job, running on serverless:
user = dbutils.secrets.get("golf-league", "glt_username")
pwd  = dbutils.secrets.get("golf-league", "glt_password")

The Wednesday-noon job kicks off, hits the source, drops two CSVs into the landing volume, and triggers the pipeline. Total wall-clock runtime is about three minutes for a typical week. Total cost is in the cents.


Step 5: Genie — the analytical layer

This is my favorite part. With silver and gold tables in place, the next question is: what's the cheapest, fastest path from "user has a question" to "answer with numbers"? In a Databricks workspace, that's a Genie space.

A Genie space is a curated, scoped natural-language query surface over a chosen set of UC tables. You give it the tables, a small ontology of business definitions (synonyms, defaults, what "this season" means), and a few trusted SQL examples. Then anyone with access can ask English questions and get a SQL answer, a table, and usually a chart.

# golf-league-coach Genie space, configured via UI but tracked in source
tables:
  - golf_league.gold.player_season
  - golf_league.gold.hole_stats
  - golf_league.gold.handicap_trend
  - golf_league.gold.weekly_summary
  - golf_league.silver.players
  - golf_league.silver.rounds
  - golf_league.silver.hole_scores

defaults:
  - "Always interpret 'this season' as MAX(season_year)."
  - "Front 9 = course_side 1, Back 9 = course_side 2."
  - "A 'round' is one row in silver.rounds (one player, one date, one side)."

glossary:
  - net score   ⇒ silver.rounds.net_score   (gross − player_handicap)
  - hardest hole ⇒ ORDER BY gold.hole_stats.avg_over_par DESC
  - birdie      ⇒ hole_scores.gross_score = hole_par − 1

trusted_sql:
  - q: "Who has the lowest scoring average this season?"
    sql: |
      SELECT full_name, avg_gross_score, rounds_played
      FROM golf_league.gold.player_season
      WHERE season_year = (SELECT MAX(season_year) FROM golf_league.gold.player_season)
      ORDER BY avg_gross_score ASC LIMIT 5

This file is the design artifact — what gets pasted into the Genie space's Instructions, Glossary, and Trusted SQL panels in the UI. Tracked alongside everything else.

Once configured, Genie handles a wide range of questions just fine: leaderboards, weekly recaps, head-to-head comparisons, hardest holes, handicap trends. For straightforward "fetch the numbers" requests, it's a delightful experience and the engineering work behind it is minimal.

But Genie alone has limits, and those limits are part of why the next step matters.


Step 6: wrapping Genie in a Mosaic AI agent

Two limits show up the moment users start asking real questions:

  • Advisory questions need to be broken down. "How can Jeff improve this season?" isn't a single SQL query — it's at least three (his worst holes, his front-vs-back split, the league average to compare against), plus a final step that turns those into recommendations.
  • Response consistency matters. A natural-language query interface that sometimes refuses, sometimes answers, and sometimes asks a clarifying question back makes a poor production experience. You want a predictable shape every time.

Both are solved by wrapping Genie in a small Mosaic AI Agent Framework agent. The agent is a tiny LangGraph loop with a single tool: genie_query(question). Its system prompt teaches it to break advisory questions into one or more pure data-fetch sub-questions Genie reliably handles, then to combine the results into a final answer in a consistent voice.

SYSTEM_PROMPT = """You are the BD Golf League's caddie — a friendly, data-savvy
analyst. The data is the point; tone is light seasoning.

# Your one tool
`genie_query(question: str)` runs natural-language analytics over the league's
lakehouse (golf_league.silver.* and golf_league.gold.*) and returns SQL +
tabular results.

# How to answer
1. DECOMPOSE the user's question into 1–3 sub-questions Genie can answer with SQL.
2. CALL `genie_query` for each — chain multiple calls.
3. SYNTHESIZE the results into 2–4 short paragraphs of analysis with cited numbers.
4. ADVISORY QUESTIONS: identify worst holes, scoring splits, handicap trend,
   par%. Translate the data into actionable recommendations. Never refuse.
"""

@tool
def genie_query(question: str) -> str:
    """Query the lakehouse with a natural-language question."""
    return _genie_query_impl(question)   # POSTs to /api/2.0/genie/spaces/...

class CaddieAgent(ChatAgent):
    def _build_graph(self):
        llm = ChatDatabricks(endpoint="databricks-meta-llama-3-3-70b-instruct",
                             temperature=0.3, max_tokens=1500)
        return create_react_agent(llm, [genie_query], prompt=SYSTEM_PROMPT)

AGENT = CaddieAgent()
mlflow.models.set_model(AGENT)

Deployment is a three-step recipe that Databricks documents end-to-end:

  1. Log the agent to MLflow as pyfunc code-as-model, registered into Unity Catalog. Declare the Genie space and the LLM endpoint as resources so the deployment's runtime principal is automatically granted the right access.
  2. Deploy it with databricks.agents.deploy() — this provisions a Model Serving endpoint, sets up automatic authentication passthrough, and creates inference tables (a Delta table that captures every request/response).
  3. Query it like any other model: POST /serving-endpoints/golf-caddie-agent/invocations with chat-completion-style messages.

Stepping back: this is exactly what Agent Bricks brings to the table. It establishes a single governed pattern across the agent's full stack — the Mosaic AI Agent Framework for the code, Unity Catalog for model registration and lineage, Model Serving with automatic inference tables for deployment and observability, and Genie Spaces as a first-class tool. Assembling those pieces by hand is what teaches you what's actually underneath the brand name; the brand is the unified developer experience on top.

At this point the agent answers questions over the Lakehouse. It still has a long list of "what's next" — measurable quality, guardrails, fine tune the web frontend, a bridge into enterprise platforms — but the foundation is there. After a little HTML magic:



The gotchas that ate the most time

A few learnings worth writing down so the second time around they cost five minutes:

  • Auto Loader and same-named files: the default behavior is to skip a re-write of a path it's already seen. Set cloudFiles.allowOverwrites = true if your ingest is going to write to a stable path each run. Otherwise your pipeline silently stalls and you wonder why nothing's updating.
  • Streaming and window functions don't mix: ROW_NUMBER over a partition needs a batch read, not a stream. If you try it in a dlt.read_stream flow you'll get NON_TIME_WINDOW_NOT_SUPPORTED_IN_STREAMING. The fix is to keep bronze streaming (so MERGE works incrementally), then read bronze as batch in silver where you actually do the deduplication.
  • sys.exit(0) in a Spark Python task is treated as failure: the framework's worker interprets any SystemExit as an abnormal exit, even when the code is zero. Just return from main, or raise SystemExit(message) only on nonzero.
  • Schema changes break DLT live tables: if you alter the schema your silver function returns (drop a column, rename one), the next pipeline run will fail to write the table. Drop the affected silver/gold tables first, then redeploy the bundle. Bronze is fine because MERGE doesn't care about new columns.
  • MSA tenants and Personal Access Tokens: if you generate a PAT before your workspace has Model Serving enabled, the token won't have the serving.serving-endpoints scope and any agent call will 403. Regenerate the PAT after enabling serving. Better, skip PATs entirely — register your application identity (a Managed Identity in Azure works perfectly) as a Databricks service principal, and you've got zero secrets to rotate.

None of these are showstoppers. All of them cost me an hour of head-scratching the first time.


What this costs

For one user, one league, weekly cadence, with a Mosaic AI agent on top:

Item Monthly Notes
Databricks Premium workspace (Azure) ~$0 idle Pay-per-use; no fixed seat cost
Serverless pipeline + job runs ~$2–3 A few minutes of serverless compute per week
Model Serving endpoint (agent) ~$1–2 Scale-to-zero, only pays while answering questions
Storage (ADLS Gen2 under UC) <$1 Tens of MB

The whole thing runs comfortably under five dollars a month on personal credit. Annualize and it's the cost of one decent steak dinner. That's the real democratizing change of the serverless Lakehouse plus Mosaic AI: production-grade data engineering and agent development as a budget hobby.


What's next

This post built the foundation: data in Unity Catalog, a declarative pipeline on a schedule, a Genie space, and a Mosaic AI agent that answers questions over the Lakehouse. The agent works. It is not yet fully in production.

Part two picks up exactly where this leaves off — turning a working agent into one you can confidently share with end users:

  • Measuring agent quality with the CLEARS framework and MLflow's built-in judges, so a prompt change can be tested instead of vibe-checked.
  • AI Gateway for inference tables, usage tracking, and (where available) guardrails.
  • A FastAPI web app as the front end, with charts, voice playback, and shared-password auth — deployed on Azure Container Apps.
  • Auth without secrets — Azure Managed Identity registered as a Databricks service principal.
  • Bridging into Microsoft 365 via Entra Agent ID, so end users can invoke the agent right from Teams or Copilot, with the user's identity propagated through to Unity Catalog audit.

Until then, the data is there. Quiet, dependable, queryable in SQL, refreshed every Wednesday at noon Eastern, the morning after league night. And the agent is there too — answering questions over it, consistent enough to demo, ready to be hardened. Exactly the foundation I wanted.

References & further reading

If you want to go deeper on any of the building blocks covered above, here are the official Databricks docs and product pages for each:

Lakehouse foundation

Pipelines & ingestion

Analytics & agent layer