By Ken Sharp - Thursday, September 29, 2022
There are quite a few ‘guides’ or suggestions on the Internet for ‘optimizing’ PDF files, and many of them suggest using Ghostscript to perform the task. Confusingly, they often differ in their advice and sometimes are even contradictory, making it hard to find helpful information. This post attempts to document what pdfwrite (and the Ghostscript family of interpreters) can and can’t do and why.
The first problem is that when people say ‘optimize’ they can mean many different things; perhaps to one person, optimize means ‘make my file smaller’ while to another, it means ‘make it more likely to open in a browser’. The two different goals mean that the two files might be constructed differently, but both could be described as optimized for the task they are meant to fulfill.
Let’s start by listing some of the things people can mean when they say ‘optimized’:
- Fast web view
- Minimise file size
- Conforming to a subset of the PDF specification (e.g., PDF/A)
- Produce a PDF file without errors
There may be others, but these are the most common, so we’ll look at these. Before we begin, though, we need to understand how the pdfwrite device, and the interpreter(s), actually work.
Because of the way PDF files are constructed, it is possible, up to a point, to treat them as if they are a set of ‘building blocks’. Pages, images, and forms can be thought of as a component of the PDF and can be rearranged or reordered. In addition, it is possible to take a content stream (the operations marking a page or making up a Form) and call it from another stream. So, you could create a new content stream for a page; in that stream, you could call the original stream, then perform some other operations. This allows an application to effectively add to a page without changing the original page content. Several PDF applications work like this.
This isn’t how pdfwrite and Ghostscript work, though many users seem to think it is. With the Ghostscript family, the input interpreter reads the input (which could be a PDF file, a PostScript program, a PCL or XPS page description, or even some kinds of image formats). The interpreter then sends a sequence of marking operations to the graphics library; when rendering, e.g., to the display, that turns the operations into pixels. When the output device is one of the high-level devices, the graphics library sends those operations to the device, turning them back into high-level operations in a potentially different output language. Currently, that can be PDF, PostScript, PCL or XPS.
The important point to remember is that the output only contains those parts of the input which draw on the page; other metadata is generally lost. The PDF interpreter does make a lot of effort to copy metadata from an input PDF file to the pdfwrite device, so things like hyperlinks, bookmarks and so on will often be preserved, but it is vital to realise that not everything is preserved, and even where the content appears to be the same (because metadata is preserved or the operation draws on the page) the actual commands may not be the same. For example, in PDF syntax, this:
0 1 0 rg
0 0 72 72 re
f
fills a rectangle from 0,0 (bottom left corner) width 72, height 72 (1 inch square) with pure green.
This:
0 1 0 rg
0 0 m
0 72 l
72 72 l
72 0 l
h
f
starts from 0,0, creates a line segment to 0,72, then to 72, 72, then to 72,0 and finally closes the path, then fills it with green.
The two will render identically, but the actual content is quite different. Although this is (obviously) a highly simplified example, it should be clear that there are several ways to achieve a particular appearance in PDF. Most of the time, this doesn’t matter, but if your workflow relies in any way on the actual content of a PDF file, then you may find it does not work after being processed through the pdfwrite device, even though it looks exactly the same.
This will be important in a couple of the sections below.
Fast web view
Ordinarily, the entire PDF file must be downloaded before it can be opened because the cross-reference table can only be found by looking at the document trailer, which is located at the end of the file.
Fast Web View (described as Linearized PDF in the specification) is a way to produce a PDF file that allows for the first page (and only the first page) to be drawn before the whole PDF file is available on systems that support it. This isn’t well-supported and is of limited value since it only affects the first page. However, if you find the feature useful, you can have the pdfwrite device produce a PDF file that conforms to this part of the specification by using -dFastWebView.
The caveats above about the file content apply, of course, and you should be aware that this feature requires random access to the output file. You cannot use this feature and stream the output to stdout.
Produce a PDF file without errors
Ghostscript’s PDF interpreter is very tolerant of errors, and unlike Adobe Acrobat, it will tell you when it finds them rather than silently ignoring them. It seems that several other PDF consumers, including printers, are rather less tolerant of errors in PDF files and will fail to print or otherwise process these files. This is not a criticism of those consumers, the files are broken or invalid, but sometimes Ghostscript can produce the expected output, or at least match Acrobat, which many users consider to be the same thing, even when other consumers cannot.
One solution is to process such files through Ghostscript and the pdfwrite device to produce a ‘clean’ or ‘optimized’ PDF file, which is much more likely to work, or at least work the same on all consumers. Indeed, we have been told of some organisations which process all their PDF files through Ghostscript and pdfwrite before sending them into their workflow, just to avoid such problems.
There’s no special configuration required for this though users might find the -dPDFSTOPONERROR and -dPDFSTOPONWARNING controls useful, to filter out potentially bad files for further verification.
For the especially curious, the -dPDFDEBUG control will dump to stdout the PDF content being processed as it goes, which can provide still more details for those skilled in reading raw PDF content. Beware; this has a large performance hit and will scribble lots of more or less incomprehensible messages to the output!
Conforming to a subset of the specification
There is already documentation on producing PDF/X-3 and PDF/A files available here:
https://ghostscript.readthedocs.io/en/gs10.0.0/VectorDevices.html#creating-a-pdf-x-3-document
https://ghostscript.readthedocs.io/en/gs10.0.0/VectorDevices.html#creating-a-pdf-a-document
and information on producing ZUGFeRD (Factur-X, EN16931) files here:
https://ghostscript.com/blog/zugferd.html
I don’t propose to cover these further. We may support other subsets of PDF in the future, and these will be covered in the Ghostscript documentation as well.
Minimise file size
This is probably the most confusing area of all, and where the various information posted by users can be contradictory.
The first thing to understand is that there is no guarantee that processing a PDF file with Ghostscript and the pdfwrite output device will produce a smaller file. In fact, it may even produce a larger file! As you’ll recall from the introduction, the actual operations written to the PDF file may not be the same as the original ones and, in fact, as the example showed, may be more verbose and, therefore larger.
In addition, at the time of writing, the pdfwrite device does not support either XrefStm (compressed xref tables) or ObjStm (a means of compressing other kinds of objects) in the PDF files it creates. If the original file did use those and used them effectively, it may well be smaller than the pdfwrite output.
However, in general, these don’t save a huge amount of space, and many PDF files don’t take advantage of them anyway.
The other aspect of how pdfwrite works is where we can gain an advantage, sometimes quite significant gains. Recall that we do not write metadata (or at least some kinds of metadata) into the output PDF file, and we rewrite the file completely.
Quite a few PDF files contain extraneous white space or leading/trailing zeros, and many of them contain genuine metadata which may not be useful to the user.
The extraneous bytes can be written for several reasons, but one of the most common is the value associated with the /Length key of a stream. Fairly often, the PDF producer doesn’t know the length of a compressed stream until after it has been written, so it doesn’t know how many bytes to reserve in the PDF file to hold that length.
Because of the cross-reference table which holds offsets of each object in the PDF file, naive producers often cannot easily change the recorded position in the file of an object after it has been written either. So instead, it writes the Length as a large number of space characters, which it will later partially overwrite with the actual digits defining the length. Of course, this leaves the remaining white space unused. The pdfwrite device doesn’t do that, so it can save space there.
Some applications embed data in the PDF file, which is meaningful to that application but not to anything else. For example, Adobe Illustrator can embed the original Illustrator document in the PDF file. If you then open that PDF file with Illustrator, it will use the saved document rather than trying to interpret the PDF file. Because the pdfwrite device doesn’t understand the embedded data, it doesn’t copy it, which in the case of Illustrator can lead to quite large savings.
This is also where problems can arise; let’s take a real file we’ve been sent and look inside it. The file has this Page dictionary describing the page:
In there, amongst the standard keys, we can see ‘/OneVisionPageColorsInfo’, which is a non-standard key, obviously added by a OneVision product. It is a dictionary with CreationDate, PageProcessColors and PageCustomColors keys. Once this file has been processed and a new file created by pdfwrite, that information will not be present. It saves space, but at the cost of dropping the metadata. If the file were then sent back through a workflow that expected to find the PageCustomColors in a OneVisionPageColorsInfo dictionary, it would fail.
That pretty much covers the ‘incidental’ space savings that might occur by using the pdfwrite device. Depending on the content of the original file, these might or might not be significant. Now let’s look at actions we can take which might save even more space, but potentially at the cost of some quality.
The pdfwrite device understands many of the ‘distiller parameter’ controls that are specified by Adobe for use with the Adobe Acrobat Distiller product. Distiller takes PostScript programs as input and produces PDF files as output. As we noted right at the start, the Ghostscript family can take a variety of different inputs and produce PDF files as output, and we can apply these distiller parameters to affect how that output is produced.
Images
The biggest win is often to reduce the size of image data in the PDF file as, if images are present, they will generally use the most bytes. We can do this in several ways:
- Alter the colour space
- Change the compression
- Remove duplicates
- Reduce the ‘effective’ resolution of the image
Altering the colour space
We can convert the content to grayscale instead of colour; for RGB input, this is a saving of 66.6%, and for CMYK input, it’s a saving of 75%. Of course, the entire document will then be in gray, which may not be desirable. Use -sColorConversionStrategy=Gray to do this.
If the image data is known to be in CMYK, then the ConvertCMYKImagesToRGB switch can turn them into RGB instead for a 25% saving while retaining the colour.
Change the compression
If the image data is not compressed or is losslessly compressed, then using a lossy compression filter can save some more space. There are three controls for this: one each for gray, colour, and monochrome (1 bit per pixel) images. There’s little point in trying to alter the Monochrome image compression as there is no lossy filter available. Just leave that one alone. The GrayImageFilter and ColorImageFilter, however, can both be set to /DCTEncode, which will apply JPEG compression to the images. Be aware this is a lossy compression; the quality will be reduced, and if the original image was JPEG compressed, the additional compression would result in substantial quality degradation.
Note that you will also have to disable the automatic filter using the AutoFilterGrayImages and AutoFilterColorImages controls (set them to false). There is no AutoFilter control for monochrome images.
You should also disable PassThroughJPEGImages and PassThroughJPXImages.
Removing duplicates
Some PDF creation tools, when using the same image (e.g., a company logo) multiple times, will insert a copy of the image each time. With PDF, this isn’t necessary; we can use a reference to the image instead and only embed one copy of the image data. If the -dDetectDuplicateImages control is true, then the pdfwrite device will take an MD5 hash of every image and, if it detects two images with the same hash, will replace the second usage with reference to the first image. NOTE this control defaults to true, so you do not need to turn it on, but you might want to turn it off.
Reduce image resolution
This is another area where there is a considerable amount of confusion, not helped by the fact that many image formats include a resolution in the image information. In fact, there is not really a resolution for any image format; merely a number of image samples (or pixels) horizontally and vertically. Until that image is drawn on a medium, we can’t say what the resolution actually is.
To take a concrete example, let’s say I have an image that is 360 pixels by 360 pixels. I then draw that on 1 inch square on a PDF page. The ‘effective’ resolution of the image is, clearly, 360 dpi. Now I take the same image and draw it into a two-inch square on my PDF page. I haven’t changed the image in any way, but now the effective resolution of the image is 180 dpi, half what it was. Yet the image is unchanged, so clearly, the resolution isn’t an inherent property of the image itself.
The controls for reducing image resolution are, unfortunately, complicated because that’s the way that Adobe defined them, and for compatibility, we chose to implement the same controls. Basically, there is a set of controls for each image ‘type’: Monochrome, Gray or Colour. We first need to turn on downsampling for the image type(s) we want to change. So set -dDownsampleColorImages, -dDownsampleGrayImages and -dDownsampleMonoImages to true to reduce the resolution of each type of image.
Next, there’s a resolution for each image type, and we need to bear in mind the point above about the effective resolution of the image. You should also note that if the image is used in multiple locations at different sizes, then it will be downsampled only once to the lowest effective resolution. So, we need to set -dColorImageResolution, -dGrayImageResolution and -dColorImageResolution.
Confusingly there is also a ‘threshold’ for each of these resolutions. That’s there so that we don’t spend much time processing image data for very little reward. If the image resolution is already ‘close’ to what we want, then we won’t try and reduce the resolution. Again, there are three controls: -dColorImageDownsampleThreshold, -dGrayImageDownsampleThreshold and -dMonoImageDownsampleThreshold’.
Let’s take an example here, using our image above, which has 360 pixels in each direction and is drawn in a one-inch square. Assume we’ve set the desired ImageResolution to 300. If the ImageDownsampleThreshold is the default of 1.5, then we will only downsample images that have an effective resolution of 300 x 1.5 = 450 dpi. Our image drawn in a one-inch square has a resolution of 360 and therefore doesn’t qualify, so we won’t reduce its resolution. Using a Threshold of 1.1, however, we would downsample images with a resolution of 300 x 1.1 = 330, and so we would downsample our example image. If we altered the ImageDownsampleThreshold to 1.0, then we would reduce the resolution of all images greater than 300 x 1.0 = 300 to be 300.
Moving on, the final parameter is the type of downsampling to apply, and there are three possibilities: Subsample, Average and Bicubic. For Monochrome images, we can only use Subsample because the other types involve coming up with some kind of average value of the pixels we are considering. Since monochrome images can only have black or white pixels, there’s no way to come up with an average. For the other image types, we can use Average or Bicubic; without going into technical detail, bicubic will produce more ‘pleasing’ results but will take longer.
There is one control that encapsulates stored values for all the controls above, the -dPDFSETTINGS control. The default for this control is not to apply any kind of space savings; there are four other possible values; /screen, /ebook, /printer and /prepress, which are covered in detail in the Ghostscript documentation at: https://ghostscript.readthedocs.io/en/gs10.0.0/VectorDevices.html#the-family-of-pdf-and-postscript-output-devices
Non-image options
There are a few other ways that the file size can be reduced, but with potential downsides.
The embedding of fonts can be controlled using the /NeverEmbed distiller parameter. Of course, if you don’t embed the fonts, then the final PDF consumer will have to use a substitute font, which means the PDF file may not display as intended.
By default, pdfwrite embeds any halftone screens (used to ‘dither’ the output on a monochrome device). These can be discarded reasonably safely since any monochrome device (e.g., printer) will always be able to use its own defaults. You can drop this information by setting -dPreserveHalftoneInfo to false.
Similarly with overprint, though this might give surprising results on a real CMYK printer or press. The -dPreserveOverprintSettings controls this.
Transfer functions (dot gain compensation, gamma correction) can be applied to the PDF file, altering the colour values, but allowing the information to be dropped, if it is present in the input file. Set -dTransferFunctionInfo to /Apply instead of /Preserve.
Undercolour removal and black generation functions are used when converting RGB to CMYK, and PDF files can carry around rules on how to do this. Since printers will always have their own defaults, it is safe to drop this too by setting UCRandBGInfo to /Remove.
PDF files can contain certain metadata describing the content; this is known as Marked Content and Structure Information. Currently, pdfwrite doesn’t preserve Structure Information (it is planned to do so when time allows), but the new PDF interpreter can preserve Marked Content (the old interpreter cannot). If you know you don’t need this, then you can set -dPreserveMarkedContent to false.
Summary
It isn’t really possible to give a ‘one size fits all’ recommendation on how to use these parameters; it depends on how badly you want to reduce the file size, what compromises on quality you are prepared to make, and how much time you want to spend processing the PDF file to get there.
I’ve tried to describe all the reasonable ways to ‘optimize’ a PDF file and the controls you can use to help do so, as well as warning of what possible pitfalls there might be. Hopefully, this is clear enough to be of some assistance in navigating the admittedly complicated settings for the PDF interpreter and the pdfwrite device.