Converting PDF to images

deceptikon 2 Tallied Votes 1K Views Share

I work with images...a lot. Often this involves image processing of various types such as resizing, resampling, and various cleanup operations. However, a common issue is that people like to conflate Adobe PDF with images. As such, any application that works with images should also work with PDF. However, since PDF isn't in reality an image type (even though it can store images), to do any kind of editing or cleanup a PDF must be converted to an actual image type.

Anyone who has worked with PDF in more than a consumption capacity will know that Adobe products cost money. The most useful enterprise products cost a lot of money. However, there are free options such as Ghostscript if you're willing to learn the SDK. For my personal imaging library I've chosen to create a class that invokes Ghostscript to consume a PDF and produce an image for processing.

The code is straightforward, but because Ghostscript is a C-based SDK, it might be confusing to C# programmers who aren't used to lower level languages or the P/Invoke support in .NET.

The biggest benefit of this class is that it supports streams and byte arrays on top of just the files supported by Ghostscript. One example of using byte arrays is extracting a PDF from an email attachment (which one of my production applications does). Streams are obviously useful without providing examples, of course. ;)

Note 1: For this class to work, gsdll32.dll (available on the Ghostscript site) must be available either by dropping it in the application folder or storing it in the GAC.

Note 2: This class is not fully tested, so use it at your own risk. If you find any bugs or make any improvements, I happily welcome feedback.

using System;
using System.Collections.Generic;
using System.IO;
using System.Runtime.InteropServices;
using System.Text;

namespace JRD.Imaging
{
    /// <summary>
    /// Supports conversion of Adobe PDF formatted data to image formats.
    /// </summary>
    public class PdfConvert
    {
        #region P/Invoke
        [DllImport("kernel32.dll", EntryPoint = "RtlMoveMemory")]
        private static extern void CopyMemory(IntPtr Destination, IntPtr Source, uint Length);

        [DllImport("gsdll32.dll", EntryPoint = "gsapi_new_instance")]
        private static extern int gsapi_new_instance(out IntPtr pinstance, IntPtr caller_handle);

        [DllImport("gsdll32.dll", EntryPoint = "gsapi_init_with_args")]
        private static extern int gsapi_init_with_args(IntPtr instance, int argc, IntPtr argv);

        [DllImport("gsdll32.dll", EntryPoint = "gsapi_exit")]
        private static extern int gsapi_exit(IntPtr instance);

        [DllImport("gsdll32.dll", EntryPoint = "gsapi_delete_instance")]
        private static extern void gsapi_delete_instance(IntPtr instance);
        #endregion

        #region Public Properties
        /// <summary>
        /// Gets or sets the destination image format for conversion.
        /// </summary>
        /// <remarks>
        /// The default is 24-bit RGB TIFF.
        /// </remarks>
        public PdfImageFormat ImageFormat { get; set; }

        /// <summary>
        /// Gets or sets a default destination page size in the absence of specific width and height.
        /// </summary>
        /// <remarks>
        /// The default is letter size.
        /// </remarks>
        public PdfPageSize DefaultPageSize { get; set; }

        /// <summary>
        /// Gets or sets the destination width of the image in pixels.
        /// </summary>
        /// <remarks>
        /// The default is 0, such that DefaultPageSize will be used instead.
        /// </remarks>
        public int Width { get; set; }

        /// <summary>
        /// Gets or sets the destination height of the image in pixels.
        /// </summary>
        /// <remarks>
        /// The default is 0, such that DefaultPageSize will be used instead.
        /// </remarks>
        public int Height { get; set; }

        /// <summary>
        /// Gets or sets the destination horizontal resolution of the image in DPI/PPI.
        /// </summary>
        /// <remarks>
        /// The default is 0, such that a resolution will be automatically chosen.
        /// </remarks>
        public int ResolutionX { get; set; }

        /// <summary>
        /// Gets or sets the destination vertical resolution of the image in DPI/PPI.
        /// </summary>
        /// <remarks>
        /// The default is 0, such that a resolution will be automatically chosen.
        /// </remarks>
        public int ResolutionY { get; set; }

        /// <summary>
        /// Gets or sets the first page of the PDF in a range conversion.
        /// </summary>
        /// <remarks>
        /// The default is -1 representing the first page in the file.
        /// </remarks>
        public int FirstPage { get; set; }

        /// <summary>
        /// Gets or sets the last page of the PDF in a range conversion.
        /// </summary>
        /// <remarks>
        /// The default is -1 representing the last page in the file.
        /// </remarks>
        public int LastPage { get; set; }

        /// <summary>
        /// Gets or sets the quality of the image when ImageFormat is "jpeg".
        /// </summary>
        /// <remarks>
        /// The default is 75.
        /// </remarks>
        public int JpegQuality { get; set; }

        /// <summary>
        /// Gets or sets the compression type when ImageFormat is "tiff".
        /// </summary>
        /// <remarks>
        /// The default is uncompressed.
        /// </remarks>
        public PdfTiffCompression TiffCompression { get; set; }

        /// <summary>
        /// Gets or sets whether images will be fit to the default page size.
        /// </summary>
        public bool FitPage { get; set; }

        /// <summary>
        /// Gets or sets whether each page is converted to a separate file.
        /// </summary>
        /// <remarks>
        /// This property must be true if ImageFormat does not support multiple pages,
        /// otherwise only the first page in the source PDF will be converted.
        /// </remarks>
        public bool SeparatePages { get; set; }
        #endregion

        #region Public Interface
        /// <summary>
        /// Creates and initializes a new instance.
        /// </summary>
        public PdfConvert()
        {
            ImageFormat = PdfImageFormat.tiff24nc;
            DefaultPageSize = PdfPageSize.letter;
            FirstPage = -1;
            LastPage = -1;
            TiffCompression = PdfTiffCompression.none;
        }

        /// <summary>
        /// Converts the provided PDF represented by a byte array to an image file.
        /// </summary>
        /// <param name="input">The source PDF byte array.</param>
        /// <param name="output">The destination image file.</param>
        /// <returns>True if the conversion succeeded.</returns>
        public bool Convert(byte[] input, string output)
        {
            using (var ms = new MemoryStream(input))
            {
                return Convert(ms, output);
            }
        }

        /// <summary>
        /// Converts the provided PDF represented by a byte array to the provided destination stream.
        /// </summary>
        /// <param name="input">The source PDF byte array.</param>
        /// <param name="output">The destination stream.</param>
        /// <returns>True if the conversion succeeded.</returns>
        /// <remarks>
        /// If the destination stream is seekable, the position will be set to the beginning.
        /// </remarks>
        public bool Convert(byte[] input, Stream output)
        {
            using (var ms = new MemoryStream(input))
            {
                return Convert(ms, output);
            }
        }

        /// <summary>
        /// Converts the provided PDF file represented by a byte array to a byte array.
        /// </summary>
        /// <param name="input">The source PDF byte array.</param>
        /// <returns>A byte array on successful conversion, or null if the conversion fails.</returns>
        public byte[] Convert(byte[] input)
        {
            using (var ms = new MemoryStream(input))
            {
                return Convert(input, ms) ? ms.ToArray() : null;
            }
        }

        /// <summary>
        /// Converts the provided PDF file to a byte array.
        /// </summary>
        /// <param name="input">The source PDF file.</param>
        /// <returns>A byte array on successful conversion, or null if the conversion fails.</returns>
        public byte[] Convert(string input)
        {
            using (var ms = new MemoryStream())
            {
                return Convert(input, ms) ? ms.ToArray() : null;
            }
        }

        /// <summary>
        /// Converts the provided PDF file to an image file.
        /// </summary>
        /// <param name="input">The source PDF file.</param>
        /// <param name="output">The destination image file.</param>
        /// <returns>True if the conversion succeeded.</returns>
        public bool Convert(string input, string output)
        {
            return ExecuteGhostscriptCommand(BuildGhostscriptCommand(input, output));
        }

        /// <summary>
        /// Converts the provided PDF file to the provided destination stream.
        /// </summary>
        /// <param name="input">The source PDF file.</param>
        /// <param name="output">The destination stream.</param>
        /// <returns>True if the conversion succeeded.</returns>
        /// <remarks>
        /// If the destination stream is seekable, the position will be set to the beginning.
        /// </remarks>
        public bool Convert(string input, Stream output)
        {
            var dstFile = Path.GetTempFileName();

            try
            {
                // Ghostscript only works with files, so we need to use temporary destination file for conversion.
                if (ExecuteGhostscriptCommand(BuildGhostscriptCommand(input, dstFile)))
                {
                    using (var reader = File.OpenRead(dstFile))
                    {
                        reader.CopyTo(output);

                        // Reset the stream if we can
                        if (output.CanSeek)
                        {
                            output.Seek(0, SeekOrigin.Begin);
                        }
                    }

                    return true;
                }
            }
            finally
            {
                if (File.Exists(dstFile))
                {
                    File.Delete(dstFile);
                }
            }

            return false;
        }

        /// <summary>
        /// Converts the provided PDF represented by a stream to a byte array.
        /// </summary>
        /// <param name="input">The source PDF stream.</param>
        /// <returns>A byte array on successful conversion, or null if the conversion fails.</returns>
        /// <remarks>
        /// The source stream position will be modified by this conversion.
        /// </remarks>
        public byte[] Convert(Stream input)
        {
            var srcFile = Path.GetTempFileName();

            try
            {
                // Ghostscript only works with files, so we need to send the stream to a temporary file for conversion.
                using (var writer = new FileStream(srcFile, FileMode.Create, FileAccess.Write))
                {
                    input.CopyTo(writer);
                }

                return Convert(srcFile);
            }
            finally
            {
                if (File.Exists(srcFile))
                {
                    File.Delete(srcFile);
                }
            }
        }

        /// <summary>
        /// Converts the provided PDF represented by a stream to an image file.
        /// </summary>
        /// <param name="input">The source PDF stream.</param>
        /// <param name="output">The destination image file.</param>
        /// <returns>True if the conversion succeeded.</returns>
        /// <remarks>
        /// The source stream position will be modified by this conversion.
        /// </remarks>
        public bool Convert(Stream input, string output)
        {
            var srcFile = Path.GetTempFileName();

            try
            {
                // Ghostscript only works with files, so we need to send the stream to a temporary file for conversion.
                using (var writer = new FileStream(srcFile, FileMode.Create, FileAccess.Write))
                {
                    input.CopyTo(writer);
                }

                return Convert(srcFile, output);
            }
            finally
            {
                if (File.Exists(srcFile))
                {
                    File.Delete(srcFile);
                }
            }
        }

        /// <summary>
        /// Converts the provided PDF represented by a stream to the provided destination stream.
        /// </summary>
        /// <param name="input">The source PDF stream.</param>
        /// <param name="output">The destination stream.</param>
        /// <returns>True if the conversion succeeded.</returns>
        /// <remarks>
        /// The source stream position will be modified by this conversion.
        /// If the destination stream is seekable, the position will be set to the beginning.
        /// </remarks>
        public bool Convert(Stream input, Stream output)
        {
            var srcFile = Path.GetTempFileName();

            try
            {
                // Ghostscript only works with files, so we need to send the stream to a temporary file for conversion.
                using (var writer = new FileStream(srcFile, FileMode.Create, FileAccess.Write))
                {
                    input.CopyTo(writer);
                }

                return Convert(srcFile, output);
            }
            finally
            {
                if (File.Exists(srcFile))
                {
                    File.Delete(srcFile);
                }
            }
        }
        #endregion

        #region Ghostscript Workers
        /// <summary>
        /// Executes a generated Ghostscript command for conversion.
        /// </summary>
        /// <param name="args">A list of Ghostscript arguments.</param>
        /// <returns>True of the conversion succeeded, false otherwise.</returns>
        private bool ExecuteGhostscriptCommand(List<string> args)
        {
            var handles = new GCHandle[args.Count];
            var pHandles = new IntPtr[args.Count];

            // Ghostscript is a C-based API, so we must allocate fixed memory and convert C# strings to C-style strings
            for (int i = 0; i < args.Count; i++)
            {
                handles[i] = GCHandle.Alloc(Encoding.Default.GetBytes(args[i] != null ? args[i] : string.Empty), GCHandleType.Pinned);
                pHandles[i] = handles[i].AddrOfPinnedObject();
            }

            var memHandle = GCHandle.Alloc(pHandles, GCHandleType.Pinned);
            var pInstance = IntPtr.Zero;
            var ret = -1;

            try
            {
                ret = gsapi_new_instance(out pInstance, IntPtr.Zero);

                if (ret >= 0)
                {
                    ret = gsapi_init_with_args(pInstance, args.Count, memHandle.AddrOfPinnedObject());
                }
            }
            finally
            {
                // Don't forget to release memory, we're in old school coding mode here. :)
                for (int i = 0; i < handles.Length; i++)
                {
                    handles[i].Free();
                }

                memHandle.Free();

                // Safely dispose of Ghostscript
                if (pInstance != IntPtr.Zero)
                {
                    gsapi_exit(pInstance);
                    gsapi_delete_instance(pInstance);
                }
            }

            return (ret == 0) || (ret == -101);
        }

        /// <summary>
        /// Generates a list of Ghostscript arguments based on current property settings.
        /// </summary>
        /// <param name="inputFile">The source PDF file for conversion.</param>
        /// <param name="outputFile">The destination image file for conversion.</param>
        /// <returns>A list of constructed Ghostscript arguments.</returns>
        private List<string> BuildGhostscriptCommand(string inputFile, string outputFile)
        {
            var args = new List<string>();

            args.Add("pdf2img");
            args.Add("-dNOPAUSE");
            args.Add("-dBATCH");
            args.Add("-dSAFER");
            args.Add(string.Format("-sDEVICE={0}", ImageFormat));

            // If an explicit size is specified, use that. Otherwise use the default page size.
            if (Width > 0 && Height > 0)
            {
                args.Add(string.Format("-g{0}x{1}", Width, Height));
            }
            else
            {
                args.Add(string.Format("-sPAPERSIZE={0}", DefaultPageSize));
            }

            if (ResolutionX > 0 && ResolutionY > 0)
            {
                args.Add(string.Format("-r{0}x{1}", ResolutionX, ResolutionY));
            }

            if (FirstPage > 0)
            {
                args.Add(string.Format("-dFirstPage={0}", FirstPage));
            }

            if (LastPage > 0)
            {
                args.Add(string.Format("-dLastPage={0}", LastPage));
            }

            // Apply format-specific options
            if (ImageFormat == PdfImageFormat.jpeg && JpegQuality > 0 && JpegQuality < 101)
            {
                args.Add(string.Format("-dJPEGQ={0}", JpegQuality));
            }
            else if (ImageFormat.ToString().StartsWith("tiff"))
            {
                args.Add(string.Format("-sCompression={0}", TiffCompression));
            }

            if (FitPage)
            {
                args.Add("-dFitPage");
            }

            if (SeparatePages)
            {
                // Format the output file such that Ghostscript can version more than one
                int lastDotIndex = outputFile.LastIndexOf('.');

                if (lastDotIndex > 0)
                {
                    // Note: Ghostscript uses a printf-like string format for versioned file names
                    outputFile = outputFile.Insert(lastDotIndex, "%04d");
                }
            }

            args.Add(string.Format("-sOutputFile={0}", outputFile));
            args.Add(inputFile);

            return args;
        }
        #endregion
    }
}
mindylynn0 0 Newbie Poster

Cannot agree with you more. It would be necessary to convert PDF to image type for editing or cleanup.
Converting pdf document to raster image is not a new tech. Commonly, Google Docs, ImageMagick, Adobe Reader, GIMP, PDF Converter, etc are often used for pdf conversion. Never used Ghostscript, but would like to try with it.

deceptikon 1,790 Code Sniper Team Colleague Featured Poster

As a side note, Ghostscript is the back-end for ImageMagick's PDF conversion. I'm a huge fan of ImageMagick and its variants.

Not sure what the other products use under the hood, but since most of my work involves custom code, a fairly open API of some sort is critical. Customers tend not to react well to "well, use PDF Converter". ;)

Oxiegen 88 Basically an Occasional Poster Featured Poster

Wouldn't have been easier to use iTextSharp instead?
It's free and already have all the classes and methods for creating/manipulating PDF's and converting them to images.

deceptikon 1,790 Code Sniper Team Colleague Featured Poster

Wouldn't have been easier to use iTextSharp instead?

Last I checked, iTextSharp didn't support PDF rasterization directly.

mike_2000_17 2,669 21st Century Viking Team Colleague Featured Poster

there are free options such as Ghostscript

Might be worth mentioning that it is free as long as you don't use it for a commercial application (only for GPL / AGPL projects). If you intend to use it in a commercial application, you must contact their licensing broker (Artifex Software Inc.) to negociate an agreement with them. Otherwise, you are breaking the law.

I just didn't want people to get the wrong idea, because you said it's "free" and keep referring to having developed software for your "clients" / "customers" with it. People might get the false impression that you don't have to pay a licensing fee to Ghostscript for using their product as part of your commercial products.

deceptikon 1,790 Code Sniper Team Colleague Featured Poster

because you said it's "free"

Very true, I should have qualified "free".

and keep referring to having developed software for your "clients" / "customers" with it

As clearly stated in the OP, this class was written for my personal library. The only mention of customers was a tongue in cheek comment by myself and your mention in the quoted text, so it's a little unfair to claim that I "keep referring" to it.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.