Trunk Tools' stack cut document review from 60 days to 10 by ditching general-purpose models

Most verticals aren’t clean, well-oiled SaaS databases; the reality is ugly documents, proprietary schemas, implicit workflows, and long‑running tasks that most general-purpose models struggle with.

This prompted construction project management company Trunk Tools to build a specialized, three-layer architecture — perception, semantics, agents — based on highly-detailed data to support high-accuracy, highly-relevant industry automation.

Their purpose-built stack has shrunk review cycles from months to days, prevented costly field errors, and given autonomous agents the ability to reason over millions of pages of documentation, Trunk says.

“We really set out to take the data from dispersed systems, pre-process it, structure it, go through our ontology into a knowledge graph, and then train AI models,” said Sarah Buchner, Trunk’s founder and CEO and a former carpenter.

For builders in other verticals, Trunk’s approach could serve as a blueprint for transforming data chaos into agent‑ready, industry-specific workflows.

Where general-purpose LLMs break down on industry data

Foundation LLMs, while powerful, are optimized for breadth, not always depth.

“General-purpose LLMs are trained to be okay at everything, so they're weak at anything niche,” said Kriti Faujdar, a senior product manager working in AI infrastructure, agentic AI, security, and LLM platforms. For instance: Rare terms, domain-specific reasoning, the unspoken context that any practitioner “just knows.”

Web, app, and software developer Sébastien De Bollivier agreed that the biggest bottleneck is reliability on data that is “jargon-dense, abbreviation-heavy, and format-specific.”

“A GPT-4-class model can understand a French legal contract, but will fumble the specific article references practitioners need to cite,” he said.

Besides, the most valuable enterprise data never made it into pretraining anyway, Faujdar pointed out. It's sitting in internal systems and proprietary formats. “RAG helps a little,” she said. “But it's just giving better facts to a model that still can't reason properly in the domain.”

Pre-training on domain data is critical; enterprises should then fine-tune on good task examples and build their own evals. “A few thousand examples from real practitioners beats millions of scraped, noisy ones," Faujdar said.

Mixture-of-experts (MoE) can provide specialization without inference costs blowing up. Pairing RAG with fine-tuning also works well; RAG handles the factual long trail while fine-tuning fixes vocabulary and reasoning.

De Bollivier pointed to the advantage of hybrid stacks: A general-purpose model for reasoning and orchestration, a smaller fine-tuned model (or dense retrieval over a curated corpus) for domain-specific extraction. He advised: “Don't fine-tune to make the model 'smarter' about a domain, fine-tune to make it more reliable on the specific output format your workflow requires.”

The trades and construction are certainly industries seeing traction with these techniques, as are legal and healthcare, De Bollivier said. These verticals have “high stakes for errors plus standardized document formats, equaling clear domain-training ROI.”

One honest caveat worth mentioning, Faujdar said: Specialized models can often fall apart outside their domain, so they’re often not useful outside their expertise (unless they’re re-trained).

Perception, semantics, agents: inside Trunk's three-layer stack

In highly-specialized domains like construction, “data dumps” into large language models (LLMs) don’t cut it, said Trunk’s CTO Amrish Kapoor. This is because most transformers are probabilistic models: When given an image, they report back that it is “probably” a tree, or “probably” a child playing next to a tree.

This makes them insufficient for high‑precision symbolic interpretation. For instance, in construction documents, a 2-millimeter-wide symbol has a vastly different meaning depending on where it’s placed.

Further, constrained by context limits, probabilistic models struggle with long‑term project memory. “I don't mean a context window of a few tokens,” Kapoor said. “I'm talking about long term memory that stretches across months and years, because this is how long some of these projects are.”

Instead, Trunk’s three-layer system breaks workflows into:

Perception (reading and extracting data from messy docs like PDFs, drawings, or scans)

A semantic/graph layer (making sense of that data and understanding their relationships).

LLMs and agents on top.

Construction drawings are typically symbolic, Buchner said. A door isn't always labeled ‘door.’ Sometimes it's simply an arc on a wall that a trained eye learns to read based on years of practice.

“The perception layer is what teaches AI to read that language,” she said. The semantic layer then gives that information meaning; for instance, connecting the door to the drawing that details it, the spec that governs it, and the trade that installs it. This helps answer project engineers’ critical questions: Not "is there a door here?" but "does this door create a problem down the line?"

Particularly in construction, that shift matters because the cost of a problem compounds with time. “A conflict caught in design is relatively low cost to address,” Buchner said, “whereas the same problem caught in the field might cost tens of thousands of dollars.”

At a high level, the system identifies the document type and begins extracting information based on content (drawing, schedules, paragraph text). This data is then “transformed and augmented” in the platform, which triggers agentic workflows like knowledge graph relationships and end-user workflows.

For instance, an agent might review an architecture bulletin and produce a visual overlay comparing an older version and a newer version (flagging additions and removals), then generate written narratives that describe what those changes are in simple terms. This helps users understand what’s changed and coordinate with trade partners on updated pricing and change orders.

The scale of construction’s data problem

Construction workflows are “ripe with implicit assumptions and connections between data in its myriad of sources,” Buchner said. And the amount of unstructured data is “humanly impossible” to process or make sense of.

Buchner estimated the average high-rise building generates about 3.6 million pages of corresponding documentation. “If you print it into a stack of papers it would be as high as the building itself.”

All three layers of Trunk’s stack — perception, semantic, LLM — are trained on “very specific datasets” from customers with “explicit permissions” and auto‑labeling/IP, Kapoor explained. Customers who don’t want Trunk training on their data can opt out.

Data is deidentified and aggregated, and Trunk also collects “tons more” labeled data through other pipelines like 3D building information modeling (BIM).

Trunk says it only ships agents that achieve around 95% accuracy. The team maintains continuous evaluation pipelines based on ground truth data from customers and experts. They also employ an LLMs-as-a-judge model.

“This notion of an LLM as a judge is to score how well you're doing, both subjectively as well as objectively,” Kapoor said. Objectivity can be an easy ‘right’ or ‘not right,’ but subjectivity requires more nuance.

For instance, when creating an email or narrative or explanation, an LLM as a judge framework can create a composite score, or a numerical value that aggregates different metrics and tests a model's performance or risk.

There can be challenges, though, particularly with latency, Buchner noted; any time the reasoning capacity of underlying models increases, the risk of latency goes up, too. Trunk maintains a set of evaluation criteria to objectively measure latency whenever changes are made to underlying infrastructure, agents, and API calls.

Then, “before we release to customers, we ensure marginal changes to the end-user experience are well worth the performance enhancements,” Buchner said.

From 60 days to 10: the measurable payoff

Trunk’s platform powers seven AI agents purpose-built for construction, such as analyzing request for information (RFI) responses, overviewing bids, or reviewing drawings and submittals.

The submittal agent, for instance, flags missing, conflicting, or noncompliant information in product specs and RFIs. While it’s an essential step in the construction process, “it's a super annoying workflow,” Buchner said, because human reviewers have to compare documents “with a bunch of other parts of documents.”

But the agent is able to do this in seconds, and Trunk says it has reduced submittal cycles from 50 to 60 days to 10, “which has massive schedule and financial implications.”

Trunk is now at a place where these agents are communicating directly with each other, which is “quite exciting,” Buchner said. So, for example, one agent will review an architectural drawing for accuracy, then autonomously hand it over to agents handling RFIs and asking follow-up questions.

“If the drawings have problems, the RFI agent is taking over and is actively reaching out for clarification,” Buchner explained.

Trunk says its customers report savings of 20 to 40 minutes per field question. Buchner said that users in the field know better than anyone how much of a “time suck” it is to go back and forth from office trailers, dig through project documents in scattered systems or printed PDFs, reconcile discrepancies, and return to coordinate with trade partners.

Trunk says its customers report these additional outcomes:

Average 8 minute time savings for single-document retrieval (status checks, location lookups, quantity queries).

Average 20 minute time savings for standard referencing (cross-referencing 2 to 3 spec sections to form an answer.

Average 40 minute time savings for multi-document research (listing and filtering queries, mapping relationships, analyzing RFIs and submittals across 4 to 6 documents).

Average 75 minute time savings for complex tasks (creating RFIs and other communication materials, deep cross-referencing across documents, change tracking).

In one instance, Trunk’s drawing review agent flagged that a structural beam had been moved up 8.5 inches. However, this was not documented by the architect. If the change hadn’t been caught, the project manager would likely have had to strip out and reinstall the right size beam, Buchner said. This rework would have added $10,000 or more to the budget, and “certainly there would have been implications on the schedule.”

Buchner also pointed to other examples: an agent flagged $60,000 in exaggerated pricing with no justification from landscaping subcontractors; identified a fireplace that needed to be sealed prior to drywall installation, saving around $100,000 in labor, materials, and delays; and called out that an electric door required a panel that wasn’t included in electrical drawings.

Learnings for other industries

Trunk’s approach to building agents is applicable to any vertical working with high volumes of unstructured, industry-specific data.

Builders working in specific verticals must understand the industry’s specific data challenges their end users face and build technical infrastructure that can transform unstructured data into something an “LLM can traverse and understand,” Buchner said.

“Only then can you build the connections between data points that ultimately feed agentic workflows.”

A lot of money is being invested in foundational models, so enterprises should build modular systems that can leverage the strengths of various models as they continue to improve, Buchner advised.

Then, “build your technical advantage where the generic models are not investing and not performing well,” she said.

Source link