Every role I’ve had has been the same problem with more powerful tools.
I didn’t stumble into AI pipeline work. It found me — because I’ve been solving this class of problem for years without having a name for it.
The Pattern I Didn’t See Coming
Back in 2017, I was building Access databases and VBA macros for Navy and USMC clients. The work looked simple on the surface: take siloed data that no one could make sense of, clean it up, connect it, and give leadership visibility they didn’t have before.
A $628M budget trapped in disconnected Excel files. 170 million data points across depot systems with no unified view. Lessons learned that lived in email threads and died there.
The problem was always the same: signal buried in noise, manual work standing between the data and the decision.
VBA and Access were the right tools for that scale. I used them well. But I was already solving an extraction problem — I just didn’t know it yet.
Millennium: NLP at Scale
By 2023 I was working on something bigger: the Navy Personnel and Pay (NP2) program, a federal modernization effort processing 10,000+ requirements that needed to be matched to test cases.
The manual process took analysts 3–4 months per cycle. My job was to make that faster.
I built a Python-based NLP platform that applied:
- Jaccard similarity scoring to measure requirement overlap
- Hierarchical clustering to group related requirements
- Context-aware topic extraction to pull meaning from dense technical language
- Adaptive acronym harvesting — 633 Navy-specific terms with 98% extraction accuracy
The result: validation time dropped from 3–4 months to under 2 hours. A 98% reduction. Analysts validated requirements 50% faster. Traceability confidence went from 60–70% to 75–85%.
I called it an NLP automation platform at the time. Looking back, I’d call it something else.
What I’d Call It Now
What I built at Millennium was a primitive RAG system.
I didn’t have that vocabulary then. But the architecture was the same:
- Ingest a collection of documents (requirements)
- Extract meaning and structure from text
- Score similarity and relevance
- Surface the right matches to a human analyst
That’s retrieval-augmented generation without the generation step. The retrieval and augmentation were all there.
The tooling was 2023-era Python and NLP libraries. The concept was timeless.
Autyvia: The Same Problem, Modern Stack
In late 2025 I joined Autyvia, an architecture, engineering, and construction (AEC) intelligence startup, to build their competitive intelligence pipeline from the ground up.
The problem statement was immediately familiar: extract signal from noise at scale and eliminate manual work.
This time the content isn’t Navy requirements — it’s 500+ weekly content items from RSS feeds, YouTube channels, podcasts, and industry association websites. The extraction target isn’t test case matches — it’s companies, people, pain points, technology trends, and newsletter hooks.
The architecture maps directly:
| Millennium (2023) | Autyvia (2025–present) |
|---|---|
| Python NLP pipeline | Python + n8n orchestration |
| Jaccard similarity | pgvector cosine similarity |
| Hierarchical clustering | HDBSCAN semantic clustering |
| Domain acronym extraction | LLM entity extraction (Ollama/Llama 3.1) |
| Manual analyst review | Automated delivery to SharePoint + Teams |
The concepts transferred directly. The implementation evolved.
What used to require hand-tuned similarity functions now uses vector embeddings. What used to require manual review pipelines now runs on a schedule and delivers to the editorial team automatically. What used to take 3–4 months now runs every Monday morning.
The Stack Today
The production pipeline I maintain at Autyvia:
- Ingestion: RSS feeds, YouTube RSS, podcast sources, config-driven Python scrapers for industry associations
- Processing: Ollama/Llama 3.1 running locally for relevance scoring, entity extraction, and structured JSON output
- Storage: PostgreSQL with pgvector for semantic search and clustering
- Orchestration: 15+ n8n workflows handling ingestion, deduplication, LLM processing, and delivery
- Delivery: Automated export to SharePoint and Microsoft Teams via Graph API
- Infrastructure: Docker containerized on-prem, OAuth2/Azure AD for Microsoft 365 integrations
Zero manual steps in the weekly delivery cycle.
The Through-Line
VBA/Access → Python NLP → LLM pipelines with pgvector.
Each transition felt like learning something new. Looking back, it was the same skill deepening — pattern recognition, extraction, automation — applied to increasingly complex problems with increasingly powerful tools.
If you’re looking for someone who thinks this way about data problems — and has the production pipeline to show for it — I’m open to conversations about what’s next.
— Linton