9193 shaares
210 private links
210 private links
A very basic shell script that I'm using to gauge the quality of already OCRed PDFs. Takes the filename of a PDF as a parameter, and prints the total word count, the count of words not known by aspell, and the percentage of unknown words. A good PDF (exported straight from the original source) likely has an unknown rate of around 5%, while a poorly OCRed scan of questionable quality may be 20% or higher.
Requires pdftotext and aspell.