Remove images from PDF files
Some times ago I ran into some Service Manuals with unusual background images:
As much as I found that amusing, I wanted to clean up the files (in a semi-automatic way) to make the technical contents more readable.
The files are in the PDF format, and at a lower level a PDF is basically a hierarchy of objects, with each object having a unique numeric ID.
Usually images are discrete objects in a PDF file, so by manipulating the hierarchy and removing the objects representing the images —and the references to these objects— we can have a document without the unwanted images.
I put together a Proof-of-Concept called pdfstrip (pun intended, you'll get it if you look at the amusing files from above).
pdfstrip depends on the pdfrw python library which is available in Debian as python3-pdfrw
.
Check out the README file for an example of use of pdfstrip.
BTW, if you want to know more about the internals of PDF files, make sure to check out these interesting talks by Ange Albertini.
Limitations
For now Inline images cannot be stripped by pdfstrip, however they are quite easy to spot in the PDF source: they are delimited by markers BI
and EI
and there is always an ID
marker between the two; removing by hand the source code delimited by the markers usually works but this is a brute force approach.
Editing PDFs with Vim
When editing a PDF file with a text editor, it is advisable to edit it in binary mode, for Vim this can be achieved by adding the following snippet in the ~/.vimrc
file:
autocmd BufReadPre,BufNewFile *.pdf let &bin=1 | set display=uhex
It is also useful to show the cursor offset in bytes, in order to make it easier to follow the xref
table, this can be done using a snippet like the following when setting the vim statusline:
function! ByteOffset() return line2byte(line('.')) + col('.') - 1 endfunction " show offset in bytes, when in binary mode. " apparently we can't just use %o if we want the field to be conditional... set statusline+=%{(has('byte_offset')&&getbufvar(bufnr('%'),'&bin'))?'[Offset:'.ByteOffset().']':''}
Commenti
Invia nuovo commento