The war against PDFs is heating up

(economist.com)

20 points | by pseudolus 3 hours ago

13 comments

pseudolus 3 hours ago
https://archive.ph/aCleq
barrister 2 hours ago
Seems to be a weak pitch for an Israeli startup called Factify. Their new document type is also closed sourced which seems like an obvious showstopper for a ubiquitous global document replacement, especially in today's extremely heated and untrustworthy environment.
No strong argument imo for replacing the pdf.
[-]
- g947o 1 hour ago
  It weirdly reminds me of SynthID/C2PA. At the end of the day, they matter very little. People are going to do what they want to do.
  If people want to manage version/access etc, they are going to do it right the first time with existing document format and permission control mechanism, ranging from "making rhe document only accessible to certain users" to "have someone read a document in a specific room", which has worked reasonably well.
pavel_lishin 2 hours ago
> Yet Duff Johnson, head of the PDF Association, protector of the format, argues that the fault lies not in the file type but in ourselves. He contends that there is no reason developers cannot build bots that are able to use PDFs. The AI assistant embedded in Acrobat, Adobe’s PDF reader, is designed to do precisely that, notes Leonard Rosenthol, the software firm’s PDF guru.
Designed to, but does it do it well without the problems noted earlier in the article?
[-]
- ssl-3 2 hours ago
  Strictly anecdotally, I've had no trouble feeding PDFs to OpenAI's bot.
  The searchable PDFs get searched, and the just-pictures-of-words ones get fed through their (quite good, IMHO) OCR.
  I use it all the time. It's remarkably good for locating the details I need in the poorly-organized ~1,200 page factory manual for my Honda.
  (Well, it's not necessarily organized poorly. It's just designed with the clear intent that it is mostly to serve as a set of repair instructions, and sometimes I don't want repair instructions. Sometimes I want to know how a thing works for my own cognitive benefit instead of how diagnose and R&R it as a series of steps.)
- cyberax 1 hour ago
  I'm using paperless-ngx for personal document management, and Claude Desktop was able to read and OCR all the PDFs there just fine (through an MCP connector).
  It also was able to parse my tax forms in 3 languages.
maxloh 1 hour ago
For context, here is the startup's website: https://www.factify.com/. The site consists of only two main pages: the landing page and a "careers" section.
Based on the site, the service appears to be little more than a document hosting platform with tracking features, such as monitoring who copied the document and the specific paragraphs they selected. They’ve intentionally omitted a download feature to prevent access to outdated versions, but otherwise, the experience seems no different from an ordinary PDF reader.
There is no mention of a "new standard" on their front page. I suspect they don't actually convert the documents. They likely just convert pages to encrypted images and use client-side rendering for text elements to allow for selection and copying.
g947o 1 hour ago
My biggest gripes:
* you cannot easily view a PDF in dark mode. Solutions do exist, but there are always some limitations
* poor experience reading on mobile device (mentioned in the article). You can use "Reflow" features provided by Acrobat or similar tools, but they often don't work offline, not to mention Acrobat is bloated and filled with dark patterns that trick you into buying a subscription
dhosek 2 hours ago
Well, that was a nonsense article. Badly written software has trouble with PDFs, accessibility is an afterthought (which, sadly, is true of most things) and some small group thinks they can invent a better wheel, ignoring the fact that they’d have to do a lot of work to overcome the first mover advantages of HTML and PDF and this comment now has more information than the original article thanks to that clause beginning with “ignoring”.
Gualdrapo 2 hours ago
Makes me remember of this, which was posted a few days ago here in HN:
https://scottlocklin.wordpress.com/2023/05/31/djvu-and-its-c...
sghaz 1 hour ago
This looks like an sponsored article. Very poor quality.
cratermoon 2 hours ago
There are PDF files and there are PDF files. Many (most?) PDFs I run into are generated from Microsoft Word or some other MS product with no structure at all. The majority of people use MS products don't understand or care about structure. The WYSIWYG imperative means lots of markup to describe font size, color, and decoration, to make every section heading look the same without ever designating the text as a section head. The same happens with paragraphs, page breaks, and column flow. The resulting document looks correct enough to the creator. Other people who have a different version of Word, different fonts, and a thousand other little differences, won't see it correctly. That leads our author to generate a PDF, probably with embedded fonts, to ensure uniform appearance across these thousand little exceptions.
The result is a document with the content mixed up so incomprehensibly with appearance controls as to be both unreadable and without any residue of the underlying intended structure of the document's sections, headers, figures, paragraphs, captions, footnotes, or anything.
And then there's PDF files which are nothing more than a series of images of pages of text. If you're lucky and the scans are clean a good OCR might be able to recover most of the content.
What I'm saying is, it doesn't matter the tool, if authors don't encode structure and formatting in semantically meaningful ways.
[-]
- tpm 2 hours ago
  So what you are actually saying is that there is a market for a tool that will recreate the PDF with a structure based on how the original PDF looks?
  [-]
  - cratermoon 2 hours ago
    The market has been needing a tool like that for 30 years. A PDF document of the type I describe is like a broken egg. Information is lost between the authoring and rendering, to the extent that it's not clear recreating the original is even possible.
    [-]
    - pessimizer 2 hours ago
      A typesetter could recreate the document through looking at it, doing some font research, and playing with the kerning for a while. Saying it's not possible to recreate a typeset document that is readable is absurd, no matter how twisted and insane the actual postscript is.
    - fleahunter 2 hours ago
      [dead]
pessimizer 2 hours ago
The war against pdfs is based on AI being too stupid to read them? That's a condemnation of AI, not pdfs. I, a natural intelligence, can easily read pdfs.
[-]
- Cheyana 1 hour ago
  Perfect response.
lsbehe 2 hours ago
I'll miss getting documentation as a pile of pictures in a PDF.
ur-whale 2 hours ago
https://archive.is/aCleq