Remove images from PDF files

Some times ago I ran into some Service Manuals with unusual background images:

As much as I found that amusing, I wanted to clean up the files (in a semi-automatic way) to make the technical contents more readable.

The files are in the PDF format, and at a lower level a PDF is basically a hierarchy of objects, with each object having a unique numeric ID.

Usually images are discrete objects in a PDF file, so by manipulating the hierarchy and removing the objects representing the images —and the references to these objects— we can have a document without the unwanted images.

I put together a Proof-of-Concept called pdfstrip (pun intended, you'll get it if you look at the amusing files from above).

pdfstrip depends on the pdfrw python library which is available in Debian as python3-pdfrw.

Check out the README file for an example of use of pdfstrip.

BTW, if you want to know more about the internals of PDF files, make sure to check out these interesting talks by Ange Albertini.

Limitations

For now Inline images cannot be stripped by pdfstrip, however they are quite easy to spot in the PDF source: they are delimited by markers BI and EI and there is always an ID marker between the two; removing by hand the source code delimited by the markers usually works but this is a brute force approach.

Editing PDFs with Vim

When editing a PDF file with a text editor, it is advisable to edit it in binary mode, for Vim this can be achieved by adding the following snippet in the ~/.vimrc file:

autocmd BufReadPre,BufNewFile *.pdf            let &bin=1 | set display=uhex

It is also useful to show the cursor offset in bytes, in order to make it easier to follow the xref table, this can be done using a snippet like the following when setting the vim statusline:

function! ByteOffset()
  return line2byte(line('.')) + col('.') - 1
endfunction

" show offset in bytes, when in binary mode.
" apparently we can't just use %o if we want the field to be conditional...
set statusline+=%{(has('byte_offset')&&getbufvar(bufnr('%'),'&bin'))?'[Offset:'.ByteOffset().']':''}

CommentsSyndicate content

Post new comment

The content of this field is kept private and will not be shown publicly. If you have a Gravatar account associated with the e-mail address you provide, it will be used to display your avatar.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
C
H
n
2
e
F
Enter the code without spaces.