Suppose you have a scanned document. Scans are flawed, ugly, bulky, and sometimes contain sensitive info. How do you produce a clean vector PDF from that?

In the FOSS world, we have #Tesseract to OCR the PDF. This just generates the text for searchability but retains the ugliness and bulk. It would be crazy painful to extract the text and then use LaTeX to manually reconstruct the layout.

Windows has a tool named ABBYY FineReader Pro Edition. It performs the OCR, then it also finds text and images and puts numbered rectangles around them. It finds a matching font, style, and weight for the text and gives users a chance to delete rectangles at will before constructing a new vector PDF. I took the scan of a 1-page letter and dithered it to bitonal using FOSS, yielding 43k. Then processed with ABBYY, which produced a much cleaner vector PDF weighing in at 12kb. Most importantly, ABBYY makes it easy to omit sensitive blocks like names, addresses, and signatures, or overwrite them with “REDACTED”.

Is there anything like this in the FOSS world? I suppose not, but please correct me if I’m wrong.

I would love it if a tool could take an OCRd PDF and produce a LaTeX doc where every object would simply amount to a \put command with coordinates in a picture environment, so it could be edited in LaTeX.

Potential use cases:

  1. Whistle-blowing. E.g. someone wants to leak a document but needs to strip it of forensic artifacts and certain sensitive data.
  2. Activism. E.g. you complain to a public administration about some injustice, and they give a shitty anti-human reply. Simply publishing the response can help dish out some shame and embarrassment. But you probably don’t want your personal details like your address exposed, and you probably have to scrub the name of the administrator to respect their GDPR rights.
  3. Digitizing paper books. You have a bound book but as you travel space and weight is scarce. You need an electronic format, perhaps even just as a backup. A raster scan is bulky, messy and does not scale well on different screens.
  4. Binary blobs are not great for a version control system. Converting raster scans to quasi-text vector files makes edits possible as well as tracking the history of changes.

Update

FOSS libraries apparently already exist.