Interesting to note how similar this seems to what happened with Benj Edwards at Ars Technica. AI was used to extract or summarize information, and quotes found in the summary were then used as source material for the final writing and never double checked against the actual source.
I’ve run into a similar problem myself - working with a big transcript, I asked an AI to pull out passages that related to a certain topic, and only because of oddities in the timestamps extracted did I realize that most of the quotes did not exist in the source at all.
It might be a solved problem in the sense that it has a possible solution, but not in the sense that it doesn’t happen with the tools most people would expect to be able to handle the task.
Looking at the media ecosystem at large, gives me a case of gallows humor.
In some sections of the ecosystem, firms still penalize journalists for errors. In other sections, checking reduces the velocity of attention grabbing headlines. The difference in treatment is… farcical.
We need more good journalists, and more good journalism - but we no longer have ways to subsidize such work. Ads / classifieds are dead, and revenue accrues to only a few.
We can't square this circle. It's why they're all A/B flipping headlines (resulting in the most deranged partisan clickbait), killed of their (too expensive) redactions (especially international news), rely solely on (barely) rewriting AP, Reuters and PRNewswire, and fill their site with opinion rather than factual reporting in support of gov handouts to the sector.
Out of curiosity, if you asked for the same text extraction multiple times, each inside fresh contexts, is it likely to fabricate unique quotes each time? And if so, a) might that be a procedure we train humans to do to better understand LLM unreliability, and 2) and instrumentalize the behavior to measure answer overlap with non LLM statistical tools?
Also, quote-presence testing/linking against source would seem to be a trivial layer to build on a chat interface, no LLM required. Just highlight and link the longest common strings.
79 comments
I’ve run into a similar problem myself - working with a big transcript, I asked an AI to pull out passages that related to a certain topic, and only because of oddities in the timestamps extracted did I realize that most of the quotes did not exist in the source at all.
e.g.: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/g...
In some sections of the ecosystem, firms still penalize journalists for errors. In other sections, checking reduces the velocity of attention grabbing headlines. The difference in treatment is… farcical.
We need more good journalists, and more good journalism - but we no longer have ways to subsidize such work. Ads / classifieds are dead, and revenue accrues to only a few.
I have no idea how we square this circle.
Also, quote-presence testing/linking against source would seem to be a trivial layer to build on a chat interface, no LLM required. Just highlight and link the longest common strings.