Unverified: What Practitioners Post About OCR, Agents, and Tables

[−] bonsai_spool 40d ago

Please write in your own words! I’m not inclined to read something if it consists of what you copy and pasted from Claude

[−] ikidd 40d ago

This reads less like LLM output than it does someone just transcribing their brief notes as they did their research. Lot of missing subject nouns, which is not something I'd expect to see from AI output.

[−] bonsai_spool 40d ago

You can ask an LLM to write in a different voice—they don't all sound exactly the same, though this one is no different than other examples.

When I use an LLM, it tries to sound like me but there are still tendencies it falls back on, especially when the context window begins to expand.

The 'missing subject nouns' is probably the LLM's way of sounding like an authoritative source in a technical field since many programmers like to write that way.

[−] bonsai_spool 40d ago

Here's a great example of something written by a human that otherwise seems to have a similar structure to the OP:

https://lalitm.com/post/building-syntaqlite-ai/

Flags for LLM vs human drafting:

- Subtitles have the rhetoric turned to 11 with LLMs. (Note: Who has ever had multiple sentences as a blog post heading? It's bizarre) :

  - LLM "The Demo Works. Production Does Not."

  - Human "AI is why this project exist, and why it's as complete as it is"

- Sources for claims that call for evidence

  - LLM "Six months ago, a practitioner could name a preferred OCR engine with confidence. Based on what I read, that confidence is gone." - *What was read?*

  - Human "AI coding tools and playing slot machines"[ref]

- Variable paragraph lengths, where things that need more explanation have longer paragraphs (and vice versa)

  - LLM *Scroll through—each thing is about the same length*

----

There are lots of tells like this. This is a moment to get good at detecting LLM text in case it's surreptitiously used to your detriment.

[−] chelm 40d ago

Ok, let's not discuss the content but the format.

> Who has ever had multiple sentences?

Many? https://forum.wordreference.com/threads/two-sentences-in-a-t...

> Sources for claims that call for evidence

Absolutely. You got the joke, or? This was the main point of the full article. No primary sources. Only unverified aggregates. Strong contrast to what I did normally once per month.

> Variable paragraph lengths

I tried to compare it to the URL you posted. It's quite similar. I would have rather have said. Shorter sentences. Shorter Paragraphs. But let's not fight on this ;)

[−] bonsai_spool 40d ago

I'll amend my statement; I think the comparison text was written by an LLM with human editing. As I read it more, there are also some LLM-isms there.

[−] obsidianbases1 40d ago

Interesting complaint, because many might not share any of their ideas if it weren't for LLMs making it easy. Not everyone has the incentive to dedicate a day to producing writing worth publishing. But maybe they would if it took significantly less time.

Even considering HNs no LLMs for comments rule, which I mostly agree with, I think we would all lose of the same rule were applied to publishing in general.

[−] curtisf 40d ago

"I would rather read the prompt"

https://claytonwramsey.com/blog/prompt/

discussion: https://news.ycombinator.com/item?id=43888803

All of the output beyond the prompt contains, definitionally, essentially no useful information. Unless it's being used to translate from one human language to another, you're wasting your reader's time and energy in exchange for you own. If you have useful ideas, share them, and if you believe in the age of LLMs, be less afraid of them being unpolished and simply ask you readers to rely on their preferred tools to piece through it.

[−] x1798DE 40d ago

I have also found that LLMs do not help me communicate my ideas in any way because the bottleneck is getting the ideas out of my head and into the prompt in the first place, but I will disagree with the idea that the output beyond the prompt contains no useful information.

In the article you linked the output he is complaining about probably had a prompt like this: "What are the downsides of using Euler angles for rotation representation in robotics? Please provide a bulleted list and suggest alternatives." The LLM expanded on it based on its knowledge of the domain or based on a search tool (or both). Charitably, the student looked it over and thought through the information and decided it was good (or possibly tweaked around the edges) and then sent it over - though in practice they probably just assumed it was correct and didn't check it.

For writing an essay like "I would rather read the prompt" LLMs don't seem like they would speed up the process much, but for something that involves synthesizing or summarizing information LLMs definitely can generate you a useful essay (though at least at the moment the default system prompts generate something distinctively bland and awful).

[−] obsidianbases1 40d ago

Sounds reasonable until you consider that the "prompt" might include a million tokens of context, not to mention follow-up/iterations

[−] chelm 40d ago

Did you read the article?

[−] quinndupont 40d ago

Very helpful analysis that confirms everything I’ve encountered. OCR remains a thorny issue. The author talks about professional workflows struggling with tables and such, but I’ve found it challenging to get clean copies of long documents (books). The hybrid workflow (layout then OCR) sounds promising.

[−] ChrisKnott 40d ago

Is there a SOTA OCR model that prioritises failing in a debuggable way?

What I want is an output that records which sections of the image have contributed to each word/letter, preferably with per word confidence levels and user correctable identification information.

I should be able to build a UI to say: no, this section is red-on-green vertically aligned Cyrillic characters; try again.

[−] ikidd 40d ago

Funny enough I was processing some handwritten tables into excel with Sonnet. It did way better than I thought it would, I'd say like 95%.

I did have it put confidence indexes next to the output per line, and that was pretty useless, they were either really high or really low, and the confidence didn't match the mistakes at all.

[−] bobajeff 40d ago

It's very surprising to me that the state of the art tools for data entry and digitizing still require a lot of supervision. From the article it's not that surprising that handwritten documents are harder for old-school OCR or AI as that can be hard even for humans in some cases. But tables and different layouts seem like low hanging fruit for vision models.

[−] adam-badar 40d ago

working with continuous OCR capture across 3 monitors using screenpipe. at 1.2fps you get usable text extraction but use 600mb-2gb ram.

biggest issue is OCR can't distinguish directionality - ie. if someone messages you, or you type "let's cancel the meeting" the text is identical but the intent isn't

[−] jgalt212 40d ago

> The Demo Works. Production Does Not.

Truer words have never been spoken. LLMs make mind blowing demos, but real-world performance is much less (but still useful).

An example from yesterday:

I asked Google / Nano Banana to repaint my house with a few options. It gave a nice write up on three themes and a nice rendering of 1/3 vertical slices in one image of each theme.

Then, I asked it to redraw the image entirely in one of the themes. It redrew the image 1/3 in the one theme I asked for and 2/3 in a theme I did not ask for. Further prompting did not fix it. At the end of the day, this was a useful exercise and I was able to get some sense of what color scheme would work better for my house, but the level of execution was miles away from the perfection portrayed in demos and hypester / huckster bloggers and VCs.

Unverified: What Practitioners Post About OCR, Agents, and Tables (idp-software.com)

28 comments