World's Simplest Data Catalogue?

June 22, 2026

This post is the fourth in a series of blogs about Data several years in the making. It stands alone, so feel free to dive straight in, but if you do want to learn more about the world’s simplest data pipeline, data model or even the world’s simplest BI tool, you know where to click!

I have adopted the strategy I walk through here to create a data catalogue for my own personal data lake. You can find it here, on my github. It’s open, free and ready to use as a template, if you feel inclined.


What’s a data catalogue anyway?

Data is the life blood of modern business! Collecting, analysing and presenting data has become a ubiquitous pastime in every office across the world, and an ability to use data to justify decisions is now the most important skill anyone can learn. From receptionist to CEO, if you can make your point with data, you’ll do better for it.

This means that almost every business now obsesses about gathering data. Clicks, hits, customers, prospects, sales, revenue, overheads, temperature readings from sensors under every desk… hundreds of different datasets, billions upon billions of rows… from Excel spreadsheets to multi-million dollar databases in the cloud.

Gathering data is easy, cleaning it up is hard, analysing and learning from it are very difficult indeed, but the final boss of data maturity, the bit almost nobody manages to get done, is the Data Catalogue.

The concept is simple: We have a lot of data and we have a lot of people who want to use that data - all we need is a directory that tells everyone where they can find what they need. Like a phone book, or a dating app for data apps. While we’re at it, we might as well also include some other stuff, like who owns the data, how long we can keep it, where it came from, who’s allowed to access it and what tools they need to do so.

It sounds simple enough but in my experience few ever make it work. Some get by without it, lacking the time, scale or regulatory pressure to make it happen and relying on tribal knowledge to fill the gap. Others invest huge sums of money and build whole teams of people to fuss over expensive platforms and arcane processes that keep them looking busy but never really deliver what was needed in the first place.

I want to propose a different approach. One which is simple enough for any small business, scaleable enough for large enterprises and straightforward enough to roll out in a matter of hours. All we have to do is get back to what we really need and why we really need it, while embracing one old and one very new technology…

The core thesis

Most data governance fails by being adopted by nobody, not by being insufficiently rigorous. The realistic competitor to this approach isn’t a “better” catalogue - it’s the status quo of zero documentation, tribal knowledge, and “ask Dave, he wrote it.” Against that baseline, a markdown file that’s 90% complete and lightly stale is an enormous improvement.

So the whole thing optimises for one thing above everything else: the lowest possible barrier to writing and updating an entry. Every other decision falls out of that.

The one-line version: capture 90% of perfection and 100% of necessity, in plain text, in the repo, where the people who change the data will actually see it.


What we’re actually trying to achieve

The goal isn’t a complete catalogue. It’s a catalogue that’s still being updated in two years, because that’s the only kind that’s ever useful. Most data documentation doesn’t fail because it was too simple - it fails because it was too much effort to keep current, drifts within months, and everyone quietly stops trusting it.

Design target: coverage and trust over completeness and polish. Every dataset having an honest, slightly rough entry beats three flagship datasets with beautiful entries and nothing else. The first one people use. The second teaches people the catalogue doesn’t cover what they need, so they stop checking it.


The thing that makes this possible now

Markdown-in-git as a docs format isn’t new. What’s changed is the justification for not doing it that way.

Structured metadata formats (JSON, XML, formal schemas) earned their place because machines could only reliably extract information from rigid structure - a parser needs to know exactly where the field name is. That constraint shaped a generation of catalogue tooling, which is why most catalogues end up looking like databases describing databases.

That constraint has weakened. An LLM reading a well-written markdown file extracts the same information a parser gets from a JSON schema, and the same file is also the best possible format for a human to read. We’re not picking markdown as a lesser, human-friendly alternative to “real” structured metadata. We’re picking it because the reason structured metadata existed - machines needing rigid structure - has mostly gone away, and human-readability was always the harder constraint to satisfy anyway.

This reframes “machine readable.” It used to mean structured. Now a machine can read prose as easily as a human can, so the best format for a machine and the best format for a human are the same format. That convergence is the whole unlock.


Why low friction beats high rigour (the trust point)

Every catalogue I’ve seen in practice - including in large, well-resourced organisations - has the same failure mode: process and tooling accumulate around the metadata layer in the name of quality, and that process becomes the reason people stop updating it. A mandatory review step, a required field that doesn’t apply to most datasets, a slow UI, a schema validator that blocks your commit over a formatting nit - each individually reasonable, collectively fatal. The dataset changes faster than anyone’s willing to fight the process to document it.

This makes the opposite trade deliberately. It enforces almost nothing. Anyone can write a file, anyone can edit one, the only quality control is the judgement of the person doing it. Lower bar - but a bar that gets cleared far more often, which is the only metric that matters: does the catalogue reflect what’s true right now.

The deeper version, and the actual philosophical core: it’s about where you spend effort. The instinct in most systems is to code against everything a developer could get wrong - validation, required fields, gates, checks. That’s an arms race you lose, because there are infinitely many ways to do it wrong and you’ll never enumerate them. The trust-based alternative spends that same effort making it trivially easy to do it right: a good template, a clear example, sensible conventions. Then trust people to use them.

This sounds reckless to a governance audience - until you point out that the heavyweight alternative is the thing that’s actually failing. Worth landing that contrast hard in the post.


Convention over configuration (name the pattern)

This is the pattern the whole thing rests on. Worth naming explicitly because a technical audience recognises it instantly and it does a lot of persuasive work.

CoC (Rails made it famous): provide sensible defaults and a shared convention, so the common case needs no configuration, and you only specify what genuinely differs. That’s exactly this. The convention is “a markdown file per dataset, with these headings.” No config, no schema, no settings. Follow the convention, it just works.

The payoff that’s easy to undersell: because it’s only a convention over plain text, anyone can extend it without permission. Need to track foreign keys properly? Add a column to your template. Need a data-retention field, a GDPR lawful-basis field, a cost-centre? Add a heading. Nothing breaks, because there’s no central schema to violate - the “parser” is a human or an LLM, both of which cope fine with an extra section. Compare a structured catalogue product, where a custom field is a config change, a migration, or a feature request to a vendor. This is the single biggest practical advantage; lean on it.


Design choices, and why (one-liners for the post)

  • One file per dataset, no central schema for the catalogue itself. Each entry self-contained; nothing to migrate when you add a new section.
  • No relationship graph, no formal join modelling. A human or an AI agent can spot a plausible join from two well-written files without it being declared anywhere - so the real move isn’t “the AI figures it out,” it’s that we’ve delegated the job to whoever’s using the data. We’re not building one schema to pre-solve every possible question across every dataset; we’re lowering the friction of getting many datasets in front of many people fast, and trusting them to work out the joins they actually need. Model the dataset, not the data model.
  • “Interested parties” as a courtesy, not a contract. Publisher gets visibility; no obligation neither side can realistically keep.
  • “Known issues & caveats” as a first-class section, not an afterthought. Willingness to admit a dataset is messy is what makes the rest of the entry trustworthy. A catalogue that only ever looks polished is one nobody quite believes. (The live repo proves this one out - see below.)
  • Git is the only version history. Not really a design choice so much as declining to reinvent one. Git is version history - the de facto standard for who changed what, when and why, and every dev already has it. A changelog field or version number in the document would be barking yourself when you keep a dog: a worse, hand-maintained copy of what git already does perfectly for free.
  • Subtree over submodule for importing into other repos. Submodule is textbook-correct but adds an init step people forget and a failure mode that breaks builds. Subtree vendors the files in so they behave like any other file, at the cost of a little command ceremony. Pick the option that fails less often over the one that’s more architecturally pure.
  • No tooling required. A folder of text files works on day one - nothing to install, host, or pay for - and degrades gracefully: still just as readable if you abandon every other part of the convention next year.

A note on “maturity” (myth-busting)

Tempting to assume large, established orgs have this solved, and that something this simple is a stepping stone for small teams who haven’t earned a “real” catalogue. Not the experience in practice. Data governance is the final boss of data maturity - nobody is doing it well. The reality across orgs of every size tends to be one of three things, none good:

  • Nothing at all. No catalogue. What data means lives in people’s heads - tribal knowledge that walks out the door when they leave.
  • An expensive product nobody maintains. Real money spent on a governance tool that then needs one person to police and curate it. Coverage thin, entries stale, the rest of the business never opens it.
  • A wiki page. Best case. Still needs a small number of people to check and update it, lives far from the code, out of sight of the developers who actually change the data. So it drifts.

Common thread in all three: distance from the code, and dependence on a few designated curators. This catalogue’s answer is to put the docs in the repo next to the code and let anyone update them, so neither problem can take hold.

So the comparison isn’t “good enough for a small team, not good enough for a serious one.” It’s “cheap and honest, against expensive and unreliable” - and on that comparison cheap and honest wins more often than the size of an org’s data team would suggest. An expensive, polished, low-coverage catalogue can be worse than nothing, because it looks finished, so nobody questions it.

Final point, callback to CoC: there is no one size fits all, and this doesn’t pretend there is. Every team, business and org has different needs - some must handle PII and have a hard legal requirement to classify it; others never touch personal data and a classification field is just noise. A prescriptive system has to either force the PII fields on everyone or negotiate them in as config. A convention over plain text sidesteps it: take the foundation, add the headings you need, drop the ones you don’t. Simple text documents with shared conventions for headings and contents are infinitely extensible - which is exactly why they can fit everyone without being designed for anyone in particular.


Discoverability (don’t lose this thread - it’s the point of a catalogue)

Discoverability is the single most important feature of any catalogue. Everything else is in service of someone being able to find “do we have UK inflation data?” or “how do I look up a postcode?” and getting an answer.

Two consumers, both served by the same low-structure approach:

  • Human browsing. Wants good folder structure, consistent naming, and a root index. Discovery mechanism is “scan a list, recognise a name, open the file.” ls your way to an answer, no tooling.
  • LLM/agent querying. Wants each file self-sufficient and richly described, and the whole corpus cheap to load or grep. Discovery mechanism is “read everything (or embed/grep it), then reason.” No search index needed because the corpus is small text and the reader is a language model.

Good discoverability for the agent imposes almost no extra structure beyond what the human already needs. No tags, no controlled vocabulary, no search backend. Just: consistent layout, a root index as the table of contents, self-contained files, and plain literal language over jargon.

The one rule worth stating outright: write the description as if answering “what would someone search for to find this.” Fold the common names and synonyms into the prose - “UK Consumer Price Index (CPI) inflation… also called the inflation rate or RPI in older sources” - rather than adding a separate aliases field. A well-written description is the search-terms field. (The live ons_inflation entry does this well: CPI, CPIH, RPI all in the first sentence.)


What this deliberately does NOT do (pre-empt the objections)

A credible reader from a governance background will instinctively look for these and read their absence as an oversight. Name them as choices:

  • No automated lineage or relationship graph.
  • No enforced schema validation on the markdown (a linter that checks headings exist is fine; anything that blocks a commit is not).
  • No access control (this describes data, it doesn’t gate access - wrong tool for that).
  • No mandatory review process (trust the author).
  • No central tooling, server, search index or UI (it’s a git repo - clone it, grep it, point an agent at it, render it with any static-site generator if you want it prettier; none required).
  • No versioning inside the document (git is the version history).

NEW - things the live repo taught us (not in the original notes)

I built the thing and pointed it at three real, messy source domains (a self-built weather pipeline, a sprawling Elexon/energy estate, bulk government reference data). What that surfaced, worth weaving into the post as “this isn’t theory, here’s it surviving contact with reality”:

  • The convention survived contact with reality - that’s the real test. Three very different domains, the same twelve headings absorbed all of them with no per-domain schema and no special cases. elexon_boalf (a composite-key balancing-instruction file with 18 fields and a dozen flags) and ons_inflation (a single monthly time series) sit comfortably in the same format. That’s the CoC claim proving out in practice rather than on paper. Strongest single piece of evidence for the whole post.

  • Caveats sections are where the real value lives, and you can’t fake them. Concrete examples from the repo that exist in no schema anywhere and would otherwise be pure tribal knowledge:

    • The Chivenor station’s run of implausible sub-−5°C readings in July 2022, hard-filtered in the modelled table but still present in raw.
    • “The table is deleted and rebuilt nightly, so any Athena query running around 06:00 UTC may see empty or partial results.” This single line saves someone a genuinely baffling afternoon.
    • Elexon’s so_flag: instructions for system security rather than energy imbalance, excluded from the cash-out price calc - separate them or your balancing-cost analysis is wrong.
    • Land Registry CSVs have no header row and fixed column order; partition columns stored as unpadded strings. These are the bits a polished enterprise catalogue culturally can’t afford to write down, because it’s optimised to look authoritative. Mine can afford total honesty because nobody’s procurement decision depended on it looking finished. That asymmetry is itself an argument.
  • Lineage emerged for free, without building lineage. Because the folders mirror the pipeline and each entry says where it came from in plain English (“derived from incoming.weather by removing duplicates and filtering Chivenor”), a reader gets the benefit of lineage - where did this come from, what was done to it - without any graph machinery. Prose lineage in terms of what actually happened beats a formal DAG you have to maintain. (Note: the raw-vs-modelled mechanics themselves are old ground from the original pipeline post - reference it, don’t re-explain it. The catalogue-relevant point is only that good prose descriptions deliver lineage as a side effect.)

  • The index grouped itself by source-and-stage, not flat alphabetical - and that’s better. Someone landing cold sees the shape of the estate (three sources, raw vs modelled) before reading a single entry. The front door doing real work. Wasn’t specified anywhere; it’s what you naturally reach for when the index is just a markdown file you hand-write.

  • An AI agent wrote most of these entries as a side effect of the work, and that’s the intended workflow, not a gimmick. The mechanical sections (location, field spec, ETL pointer) are derivable from the code, so an agent fills them fast; the judgement sections (caveats, license, classification) still want a human pass. This is the realistic future of the format: the human curates and sanity-checks, the machine does the typing. Worth a sentence - it closes the loop on “the same format is best for humans and machines” by showing machines can write it too, not just read it.

Small inconsistencies I noticed (honest “and here’s what I’d tidy” material, optional)

Including these makes the post more credible, not less - it models the same honesty the caveats sections preach:

  • Title format drifted across folders. gov-etl uses descriptive titles (# ONS Inflation (MM23)), weather-etl uses # lake.weather — Deduplicated..., energy-etl uses bare table names (# elexon_boalf). All defensible; for skim-discoverability, picking one (table_name — short description) and noting it in the README settles it cheaply. A CoC nudge, not a rule - and a nice live demonstration that conventions need the occasional gardening.
  • Empty “Interested parties” table in elexon_boalf (header row, no body). Honest - nobody’s registered - but per my own “blank reads as nobody-looked” rule, a one-line convention for “none registered yet” would make it read as deliberate. Tiny.

These two are a good way to end the “real repo” section: the format doesn’t stop you being sloppy, it just makes sloppiness cheap to see and cheaper to fix. Which is the whole point versus a rigid system that hides drift behind a polished UI.


Possible structure for the post (rough)

  1. Hook: the problem isn’t rigour, it’s that nobody updates the catalogue. Ask Dave.
  2. The unlock: LLMs changed what “machine readable” means. Human-readable IS machine-readable now.
  3. The idea: a folder of markdown files, one per dataset, in the repo next to the code.
  4. Show it: the template, then a real filled-in entry from the repo.
  5. The philosophy: convention over configuration; trust over enforcement; git is version history; what it deliberately doesn’t do.
  6. Reality check: three real domains, the caveats that saved me, lineage-for-free, the agent wrote it.
  7. Myth-bust: big orgs haven’t solved this; cheap-and-honest beats expensive-and-unreliable; no one size fits all.
  8. Close: take it, bend it, it’s yours - link the repo.

Style reminders for when I draft

  • No em dashes. Hyphens, semicolons, colons, or restructure.
  • No Oxford commas.
  • No AI-ish transitions (crucially, furthermore, notably, importantly…).
  • British English.
  • “get in touch” not “make contact”.
  • Personal and direct, dantelore.com voice - not formal or bureaucratic.
  • I write the final version myself; these are notes, not prose to lift wholesale.