Offline JAX Transformer Trained from Scratch for Confidential Board Meeting Decision & Action Extraction
A regulated financial firm needed to extract structured decisions, responsible parties, deadlines, and action types from internal board minutes — without any document leaving their network, and without any pre-trained model whose training data or licence they could not account for. We built a 47M parameter encoder-only transformer in JAX/Flax from random weights, trained entirely on their de-identified historical minutes on their own hardware. It runs on CPU in under a second per document and ships as an offline CLI with a local web UI.
Discuss a Similar ProjectWhat We Built
Custom Tokeniser for Governance Vocabulary
SentencePiece BPE tokeniser trained on the client's own corpus of board minutes, resolution logs, and governance templates. Handles abbreviations, role titles, quorum terminology, and statutory references common in board documentation without any external vocabulary.
47M Parameter Encoder-Only Transformer from Scratch
12-layer encoder with multi-head attention and position encodings, written in JAX/Flax and initialised from random weights — no pre-trained checkpoint, no transfer learning. Trained using Optax on 4 x CPU workers on the client's own hardware over a 3-day run.
On-Premise Training Pipeline (Air-Gapped)
Full training pipeline packaged as a Docker image buildable from offline sources — no PyPI calls, no HuggingFace downloads, no external DNS during training. All dependencies vendored and verified with SHA hashes before the build begins.
Four Extraction Heads
Separate classification heads fine-tuned per task from the shared encoder: decision text span, responsible party name, deadline date expression, and action type label (Approve / Defer / Delegate / Reject / Note). Each head outputs a confidence score alongside the extracted value.
Structured JSON Output with Confidence Scores
Every extracted item returned as structured JSON with field, value, start/end character span in the original document, and a per-field confidence score — enabling downstream systems to apply their own acceptance thresholds or flag low-confidence items for human review.
Offline CLI and Local Web UI
Command-line interface for batch processing with JSON or CSV output. Local web UI (FastAPI + Jinja2, no JavaScript dependencies from CDN) for interactive use — paste or upload a minutes document, inspect highlighted extractions with confidence colour coding, and export results. No GPU, no internet, no cloud.
Technologies Used
Key Outcomes
Per-document extraction time on CPU — no GPU required at inference
Air-gapped — training, inference, and UI run with zero outbound network calls
No pre-trained model licences, no third-party data used in training
Need Something Similar?
We build private AI systems that never leave your infrastructure — trained on your data, owned entirely by you, with no dependency on any external model provider.