OCR for construction documents does not work, we fixed it

[−] Terr_ 46d ago

> OCR for construction documents does not work

I'm reminded of the Xerox JBIG2 bug back in ~2013, where certain scan settings could silently replace numbers inside documents, and bad construction-plans were one of the cases that led to it being discovered. [0]

It wasn't overt OCR per se, end-user users weren't intending to convert pixels to characters or vice-versa.

[0] https://www.youtube.com/watch?v=c0O6UXrOZJo&t=6m03s

[−] TehCorwiz 46d ago

If I recall it was an artifact of the compression algo.

Full context and details: https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

[−] hackcasual 46d ago

JBIG2 does glyph binning, as you say not exactly OCR, but similar. So chunks of the image that look sufficiently similar get replaced with a reference to a single instance.

[−] thaumasiotes 46d ago

> not exactly OCR, but similar. So chunks of the image that look sufficiently similar get replaced with a reference to a single instance.

How can we describe OCR that wouldn't match this definition exactly?

[−] Terr_ 46d ago

It's not too hard, while they share some mechanics, the underlying use-cases and requirements are very different.

_______ Optical character recognition:

1. You have a set of predefined patterns of interest which are well-known.

2. You're trying your best to find all occurrences of those patterns. If a letter appears only once, you still need to detect it.

3. You don't care much about visual similarity within a category. The letter "B" written in extremely different fonts is the same letter.

4. You care strongly about the boundaries between categories. For example, "B+" must resolve to two known characters in sequence.

5. You want to keep details of exactly where something was found, or at the least in what order they were found. You're creating a layer of new details, which may be added to the artifact.

_______ "Glyph compression":

1. You don't have a predefined set of patterns, the algorithm is probably trying to dynamically guess at patterns which are sufficiently similar and frequent.

2. Your aren't trying to find all occurrences, only sufficiently similar and common ones, to maximize compression. If a letter appears only once, it can be ignored.

3. You do care strongly about visual similarity within a category, you don't want to mix-n-match fonts.

4. You don't care about clear category lines, if "B+" becomes its own glyph, that's no problem.

5. You're discarding detail from the artifact, to make it smaller.

[−] yuliyp 46d ago

Glyph binning looks for any chunks in the image that are similar to eachother, regardless of what they are. Letters, eyeballs, pennies, triangles, etc without caring what it is. OCR looks specifically to try and identify characters (i.e. it starts with a knowledge of an alphabet, then looks for things in the image that look like those.

If the image is actually text, both of them can end up finding things. Binning will identify "these things look almost the same", while OCR will identify "these look like the letter M"

[−] Dylan16807 46d ago

Jbig2 dynamically pulls reference chunks out of the image, which makes it more likely to have insufficient separation between the target shapes.

It also gives a false sense of security when it displays dirty pixels that still clearly show a specific digit, since you think you're basically looking at the original.

[−] thaumasiotes 46d ago

That's a description of Jbig2, not a description of OCR.

Jbig2 is an OCR algorithm that doesn't assume the document comes from a pre-existing alphabet.

[−] Dylan16807 46d ago

You asked what the difference was, and I said the difference. Was it unclear that to fit the phrasing of your question, we add "OCR doesn't"? I would not personally call Jbig2 OCR.

[−] thaumasiotes 46d ago

> You asked what the difference was, and I said the difference.

Take another look at my comment.

[−] Dylan16807 46d ago

Let me try rephrasing to make the response to your original comment as clear as possible.

Question: "How can we describe OCR that wouldn't match this definition exactly?"

Answer: This definition largely fits OCR, but "reference to a single instance" is a weird way to phrase it. A better definition of OCR would include how it uses builtin knowledge of glyphs and text structure, unlike JBIG2 which looks for examples dynamically. And that difference in technique gives you a significant difference in the end results.

Is that better?

The definition you quoted is not an "exact" fit to OCR, it's a mildly misleading fit to OCR, and clearing up the misleading part makes it no longer fit both.

[−] h317 46d ago

I cannot wait for the day when tech companies become players in the construction industry because it looks like it is the only way forward to make a change.

To think that everything has been digitalized a long time ago, yet contract law cannot properly deal with delineating responsibilities between GC and Architects, who are still sending 2D drawings to each other.

Imagine, all this information about quantities and door types (and everything else) is already available and produced by the architect's team, BUT they cannot share it! Because if they do, they are responsible for the numbers in case something is wrong.

So now there is this circus of: Arch technologist making the base drawing with doors. GC receives documents, counts doors for verification, and sends them to the sub. Subcontractor looks at these drawings, counts them again, and sends data to the supplier. Guess what, the supplier also looks, counts, confirms, and back we go.

Though I think robotics will change all of that. And when we have some sort of bot assistance, big tech players will have a bigger leverage in this, which will lead to the proper change management architecture.

Anyway, cool product. Anything to help with estimation. Really hope it gets traction.

[−] punnerud 46d ago

I had a job as HVAC engineer for the upgraded Oslo Airport back in 2011; started doing HVAC work for 3 weeks the rest was programming trying to make the rest of people more efficient. Made an Excel sheet with a lot of macros to manage all the drawing of the airport. That’s why I switched to programming when I continued to study, and did not want to come back before I got more experience.

They even gave me a big desk at Trondheim/Tyholt so I could help them with the software during my studies.

[−] ajcp 46d ago

For what it's worth I found Oslo Airport to be one of the best airport experiences I've ever had. 5 stars.

[−] punnerud 45d ago

Thank you, I remember a lot of good discussions between architects and engineers. How to make it both beautiful and a good environment to be in

[−] wcisco17 46d ago

Hey thanks!! For estimation we cover division 08 (doors and opening) you can use it for your estimating purposes with these two endpoints:

- Counting all the doors: https://www.getanchorgrid.com/developer/docs/endpoints/drawi... - Extracting schedules in architectural drawings: https://www.getanchorgrid.com/developer/docs/endpoints/drawi...

and use Claude or any other AI tool to wire up the UI

We're releasing toilets (division 10) later this week, then floors and pipes next.

[−] ndespres 46d ago

I’ve worked on projects where a lot of work was done in highly collaborative drawings on Bluebeam, in which vendors add their markups and items and the program facilitates counting it all at the end of the phase. My role was only in things like wireless AP placement and low voltage cabling drop locations, not anything safety critical like doors, but I assume those vendors were able to keep track of those items in a similar way. For actual engineering projects I’m glad so many people have to take the time to count.

[−] alexeischiopu 46d ago

I couldn’t agree more. The fact this data isn’t programmable is really holding the industry behind.

When building PlanGrid there were so many things we wished we could have done had this been unlocked.

I’m now working on doing just that.

[−] sreekanth850 46d ago

We’re taking a different path, building a parsing engine that converts CAD (DWG/DXF) into fully structured JSON with preserved semantics (no ML in the critical path).We also have a separate GIS parser that extracts vector data (features, layers, geometries) independently, Like to know how you handle consistency and reproducibility across runs using models and how you make it affordable, especially at scale. because as far as i know CAD and GIS need precision and accuracy.

[−] wcisco17 46d ago

interesting yeah parsing DWG/DXF natively makes sense when the source file is clean and well-structured. The precision argument is valid in controlled environments.

The challenge we kept running into is that construction drawings in the wild aren’t always that clean. Unresolved xrefs, exploded dynamic blocks, version incompatibilities, SHX font substitutions — by the time a PDF hits a GC’s desk it’s often the only reliable artifact left. The CAD source may not even be available.

That’s why we see vision becomes the more pragmatic path — not because it’s more precise than structured CAD parsing, but because PDFs are the actual lingua franca of construction. Every firm, every trade, every discipline hands off PDFs. So we made a bet on meeting the document where it actually lives.

On consistency and reproducibility — that’s a real challenge with vision models. Our approach is to keep detection scope narrow and validate confidence scores on every output rather than trying to generalize broadly. Happy to go deeper on that if useful.

[−] sreekanth850 46d ago

As a part of our product development, we had fought with PDF so much, even we have a generic PDF parser with triple pipeline (One for single column, another for multi column and third for complex table based layouts) yet we are not getting 100% accuracy, I would say that it's bit risky to bet on PDF. PDF often is the most complex format ever made and it was never made for data extraction. And You are right that vision models are the only way but hallucination is real.

[−] runxel 46d ago

Dumbcad line barf will not help you with that at all.

There already is a format that is plain text and preserves the semantics: IFC. That's what it was made for.

[−] sreekanth850 46d ago

We’re not just dumping primitives, we extract full CAD context including entities, layers, blocks, colors, and topology. That metadata allow reconstruct structure deterministically. IFC is great when available, but in most real-world pipelines DWG is still the source of truth, often degraded. Our focus is making that usable without relying on probabilistic vision layers. People depend on PDF for cad files due to its portability and to avoid software dependency/licensing, we aim solving that, any machine or pipeline that needs CAD or GIS data for analytics, search, or reasoning can operate on our structured output without requiring a native CAD or ESRI license.

[−] oneneptune 46d ago

Is this a service / product you plan to offer outwardly? I'd be interested in learning more. Use case: estimation.

[−] nostrapollo 46d ago

First off, congrats on the launch! Construction is a tough market to build in. My personal view after being in it a for a few years is that there is no shortage of MVPs. In fact there is an MVP for every problem at every level (or at least it feels that way) but construction is /vast/ and the rough edges that seem juicy at first, in practice are optimizations rather than bottlenecks for constructors.

I hope you succeed because it would be great to have a standard API for this data, but I would advise on one of two directions: become the standard by being close to 100% accurate at finding symbols (one symbol doesn't seem to cut it in our testing) or make a great, comprehensive workflow for a small subset of the market and become standard that way.

In both cases, you cannot do a broad 'market test', you need to spend many hours with a specific sub-set of users in construction.

Disclaimer: I'm a co-founder of Provision.

[−] petee 46d ago

I ran the example doors given and it missed 9 swinging doors, some that were in double swing pairs, and a few that were just out on their own not clustered. Not bad overall though

[−] ichamo 46d ago

Really interesting to see this space developing. I'm building a masonry-specific quantity takeoff tool (vision model extraction into a parametric domain model that spits out bid-ready quantities) and the "data prison" framing resonates hard.

One thing I've learned going deep in a single trade: the distance between "structured JSON from a drawing" and "numbers an estimator will bid with" is enormous. I've been really impressed with Bobyard and SketchDeck especially.

h317's point about the liability-driven re-counting circus is spot on. Each party in the chain needs to own their numbers. Revit could have solved this a long time ago had this not been the case. An API that makes each individual count faster is valuable but it doesn't collapse the chain.

Would love to talk to anyone else building in this space.

[−] frogguy 46d ago

Looks cool! Where are you getting the data to finetune the cv models for element extraction? I'm worried there isn't a robust enough dataset to be able to build a detection model that will generalize to all of the slightly different standards each discipline (and each firm for that matter) use.

[−] Iulioh 46d ago

When will this be available for 30000x8000px electrical diagrams?

I have to make a BOM and oh boy I hate my job

[−] xnx 39d ago

Doesn't Gemini do this directly? I uploaded an image and asked it to identify the doors and it gave me a JSON array of the boxes.

[−] punnerud 46d ago

«Why we did it»; would rather have a “How we did it”. The why gave me AI generated marketing material feelings.

Tailscale’s article about NAT traversal is an example of how to write “how we did it”: https://tailscale.com/blog/how-nat-traversal-works

[−] testUser1228 44d ago

I know its been a few days, but have you looked into the new Bluebeam offerings and do you see that as competition for you guys?

https://www.bluebeam.com/bluebeam-max/

Their first example is counting fixtures

[−] tomedwrds 46d ago

I have been working on an extension of this problem lately that involves extracting all doors + any details about those doors to produce quotes. I have found giving the pdf to codex pretty good at it as it can take subcrops of the plans to look at certain areas of high noise in more detail. Only downside is cost is quite high.

[−] copypaper 46d ago

Very interesting. Im on vacation but will check this out at work next week.

What is the maximum resolution you support for PDFs? The max gemini will do is 3072x3072. We have plans that are 10x that size.

[−] superdocs1 44d ago

Interesting, I've dealt with a similar problem before. If anyone needs a self-hosted/custom solution, feel free to reach out.

[−] mmethodz 46d ago

Do that for Finnish construction documents. My parser is 30000+ lines candidate based but the lack of standards and the Finnish language...

[−] achillesheels 46d ago

Love it! Starbucks Vente Machiato sip

Love to give it to an arc client, not sure who the right person to implement this would be? Hmm…

[−] gsampsonva 45d ago

How can this tool help easy my workload? I'm already pretty overwhelmed.

[−] rmpanga04 37d ago

This is incredible

[−] testUser1228 46d ago

What do you foresee being the end use case for this (or most valuable use case)?

[−] hspraggins77 46d ago

Great points raised!

[−] alexeischiopu 46d ago

Good idea :)

[−] vessenes 46d ago

cool. What's pricing like?

[−] i18nagentai 46d ago

[flagged]

[−] ware-intel 46d ago

Your smart features looks like a game changer? Nice job!

[−] fithisux 46d ago

Of course it is not working. PDF and images are supposed to be tamper resistant. OCR tries to reverse engineer them.

OCR for construction documents does not work, we fixed it (getanchorgrid.com)

98 comments