Personal tools
You are here: Home / OSCAR EMR 12.x / 4.0 Developers / 4.9 Utilities / 4.9.1 Creating Searchable PDF's

4.9.1 Creating Searchable PDF's

OSCAR stores most incoming documents as PDF's. These are usually generated by scanning or faxing software (see Hylafax) and are plain pdf's that you cannot search. The ability to select text can be added to a pdf with the appropriate use of open source OCR software.

Creating Searchable PDF's from regular image PDF's

A searchable but hidden "text" layer can be added to an scanned or faxed image

Selecting Text in a Searchable PDF

Figure 1: Example of a PDF from a scanned document to which a text layer has been added and selected

Document Version History

  • v1.0 – initial public release to oscarmanual.org on July 5, 2013
The document is copyright by Peter Hutten-Czapski © 2013 under the Creative Commons Attribution-Share Alike 3.0 Unported License

Contents

    1. Installing a Script

    Installation Instructions

    Here we will be using ghost script, cuneiform, hocr2pdf.  Acceptable results are available even in Ubuntu Lucid Linux 10.04 LTS

    • GPL Ghostscript 8.71 (2010)
    • Cuneiform for Linux 0.7.0 (2010)
    • hocr2pdf version 0.7.4 (2009)

    I suggest using the latest stable versions of your preferred image conversion, OCR and PDF creation software and testing settings before putting into production.  If suboptimal use an alternate OCR library such as tesseract v 3.0 or newer.

    Open a terminal and type the following to install the set available for your version of Ubuntu

    sudo apt-get install cuneiform gs exactimage 
    Open a text editor (such as vi, nano, gedit)

    and paste the following into it.

    #!/bin/bash
    # Run OCR on a multi-page PDF file and create a new PDF with
    # the extracted text (if any) in hidden layer. 
    # Requires cuneiform, hocr2pdf, gs.
    # Usage: ./ocrpdf.sh input.pdf output.pdf
    
    set -e
    
    input="$1"
    output="$2"
    
    tmpdir="$(mktemp -d)"
    
    # extract images of the pages as tif files 
    # note: resolution hard-coded, do not go below fax resolution of 150dpi
    gs -SDEVICE=tiffg4 -r300x300 -sOutputFile="$tmpdir/page-%04d.tif" -dNOPAUSE -dBATCH -- "$input"
    
    # OCR each tif image into an anointed html and then convert into PDF
    for page in "$tmpdir"/page-*.tif
    do
        base="${page%.tif}"
        cuneiform -f hocr -o "$base.html" "$page"
        hocr2pdf -i "$page" -o "$base.pdf" < "$base.html"
    done
    
    # combine each of the pages into one PDF
    gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" "$tmpdir"/page-*.pdf
    
    # cleanup
    rm -rf -- "$tmpdir"

    Save and chomd 777 the file

    setup a cron job to take the scanned files from where they come in and process them into the directory from which you take files to load into the Inbox from.

    Document Actions