Skip to content

Automatically remove blank pages from pdf

2012-04-05

I’ve found myself in a situation where I have to automatically generate PDFs from spreadsheets and automatically remove blank pages from them. I found an unfinished (and badly formatted) solution here, so I took some minutes to finish it:

non-blank-page-ranges.py

#! /usr/bin/python3
"""Read text input and print non-blank page ranges
(pages should be separated by ^L pagebreaks)"""

import sys

# find non-blank pages
page = 1
blank = True
nonblanks = []
for line in sys.stdin:
  for char in line:
    if char == "\x0c": # ^L, pagebreak
      if not blank:
        nonblanks.append(page)
      # new page
      page += 1
      blank = True
    else:
      blank = False

# exit if no non-blank pages found
if not nonblanks:
  exit(1)

# print ranges of non-blank pages in format used by pdftk
# (e.g. "1-3 5-8 10-10")
ranges = []
nonblanks = sorted(nonblanks)
first_in_range = 0

for i in range(1, len(nonblanks)):
  # if the page increased by more than 1 (i.e. at least one
  # page got skipped), append the current range and start a
  # new one
  if nonblanks[i] > nonblanks[i-1] + 1:
    ranges.append("{}-{}".format(nonblanks[first_in_range],
                                 nonblanks[i-1]))
    first_in_range = i

# append the last range
ranges.append("{}-{}".format(nonblanks[first_in_range],
                             nonblanks[-1]))
print(" ".join(ranges))

pdf-remove-blank-pages

#! /bin/bash

for filename in "$@"; do
  # get non-blank ranges
  ranges="$(pdftotext "$filename" - | \
    "$HOME/.bin/non-blank-page-ranges.py")"

  if [ -z "$ranges" ]; then
    echo "no non-blank pages found in $filename" >&2
    continue
  fi

  # rename pdf
  if [ -e "${filename}.old" ]; then
    echo "file exists: ${filename}.old" >&2
    continue
  fi

  mv -n "$filename" "${filename}.old"

  if [ -e "$filename" -o ! -e "${filename}.old" ]; then
    echo "couldn't rename file $filename" >&2
    continue
  fi

  # create new pdf with non-blank pages only
  pdftk "${filename}.old" cat $ranges output "$filename"
done

The export to pdf macro

Automatic export of PDFs comes from here, with this fix (i.e. first argument of executeDispatch becomes document.getCurrentController().getFrame() instead of just document).

Sub exportPDF(xlsFile)

  xlsURL = ConvertToURL(xlsFile)

  ' Open the document.
  xlsDoc = StarDesktop.loadComponentFromURL(xlsURL, "_blank", 0, Array(_
             MakePropertyValue("Hidden", True),_
             )_
           )

  ' Export PDF
  ' http://user.services.openoffice.org/en/forum/viewtopic.php?f=9&t=31957#p146050
  ' corereflection error: http://www.oooforum.org/forum/viewtopic.phtml?t=27661
  pdfFile = Left(xlsFile, Len(xlsFile)-3) + "pdf"
  pdfURL = ConvertToURL(pdfFile)

  createUnoService("com.sun.star.frame.DispatchHelper").executeDispatch(_
    xlsDoc.getCurrentController().getFrame(), ".uno:ExportToPDF", "", 0, Array(_
      MakePropertyValue("URL", pdfURL),_
      MakePropertyValue("FilterName", "calc_pdf_Export"),_
      MakePropertyValue("FilterData",_
        Array(_
          Array("UseLosslessCompression",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("Quality",0,90,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("ReduceImageResolution",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("MaxImageResolution",0,300,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("UseTaggedPDF",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("SelectPdfVersion",0,0,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("ExportNotes",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("ExportBookmarks",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("OpenBookmarkLevels",0,-1,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("UseTransitionEffects",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("IsSkipEmptyPages",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("IsAddStream",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("FormsType",0,0,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("ExportFormFields",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("HideViewerToolbar",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("HideViewerMenubar",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("HideViewerWindowControls",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("ResizeWindowToInitialPage",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("CenterWindow",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("OpenInFullScreenMode",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("DisplayPDFDocumentTitle",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("InitialView",0,0,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("Magnification",0,0,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("Zoom",0,100,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("PageLayout",0,0,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("FirstPageOnLeft",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("InitialPage",0,1,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("Printing",0,2,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("Changes",0,4,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("EnableCopyingOfContent",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("EnableTextAccessForAccessibilityTools",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("ExportLinksRelativeFsys",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("PDFViewSelection",0,0,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("ConvertOOoTargetToPDFTarget",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("ExportBookmarksToPDFDestination",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("_OkButtonString",0,"",com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("EncryptFile",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("DocumentOpenPassword",0,"",com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("RestrictPermissions",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("PermissionPassword",0,"",com.sun.star.beans.PropertyState.DIRECT_VALUE),_
          Array("",0,,com.sun.star.beans.PropertyState.DIRECT_VALUE)_
          )_
        )_
      )_
    )

  xlsDoc.Close(True)
End Sub
Advertisements

From → Uncategorized

4 Comments
  1. ThSGM permalink

    Hi,

    Two quick questions: does this work for removing blank pages from scanned documents?

    Also, I’m a bit ignorant of Unix; I understand that the .py file should be placed in $HOME/.bin/ but what do with pdf-remove-blank-pages? Can this be made into a script and placed into the /bin as well? How is this called (i.e. what is the command to call pdf-remove-blank-pages with input/output filenames)?

    Thanks!

    • Probably not. It relies on pdftotext outputting no characters inside that page. A scanned blank page will probably have a lot of noise that pdftotext will convert to random characters.

      The pdf-remove-blank-pages is a function declared in my zsh configuration file (bash is the default linux shell, zsh is an arguably better, though slower, interactive shell, not really appropriate for scripting). When it comes to small scripts, unless I know for sure I’m running something on a computer without zsh (or my config files), I tend to use zsh functions instead of bash scripts just so I don’t have to type so many quotation marks. :P

      Here’s a bash version you can save to a script (I put it in the post as well):

      #! /bin/bash
      
      for filename in "$@"; do
          ranges="$(pdftotext "$filename" - | "$HOME/.bin/non-blank-page-ranges.py")"
          mv "$filename" "$filename.old" && pdftk "$filename.old" cat $ranges output "$filename"
      done

      You can place it anywhere you want (same thing for the python script, just remember to change its path in the first line of the for loop), I just use ~/.bin/ to keep all my local no-root-required scripts in one place. You can just call it by its full path, followed by the pdfs as arguments. For example (I’m using the dollar sign to represent your command prompt, you shouldn’t type it in):

      $ ~/.bin/pdf-remove-blank-pages A.pdf B.pdf C.pdf

      Note that HOME is an environment variable that contains your home folder’s path. In bash (and zsh), the tilde is a shortcut for your home folder as well, but it turns literal when quoted (i.e. “~/file.txt” won’t work, while “$HOME/file.txt” will), so it’s better practice to use $HOME instead when writing scripts (in sh, the dollar sign followed by a variable name is expanded to that variable’s contents; without it, the variable name will just be considered literal text).

      Also note that you have to make both scripts executable before you can run them (you can also do this in whatever GUI file manager you have available):

      $ chmod +x ~/.bin/pdf-remove-blank-pages ~/.bin/non-blank-page-ranges.py

      Hope that helps.

      p.s. Sorry, just noticed you asked about passing input and output filenames. This script will consider all files you pass as input, it will rename them adding “.old” to their names, and save the output as the original name. So, for example, if you run it on one file named “myfile.pdf”, you will call it with:

      $ ~/.bin/pdf-remove-blank-pages myfile.pdf

      And it will rename the file to “myfile.pdf.old” and save the new pdf to “myfile.pdf”.

      • Annamalai permalink

        It is renaming the file to “myfile.pdf.old”. But not saving the new pdf. Any good guess on this error?

      • I would imagine either pdftk is failing, or mv is returning a non-zero status, so pdftk isn’t even being called. I’d expect either of them to output an error message if there was a problem, though. I added some error handling in the scripts above, please see if those help you narrow down where the issue is.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: