PDF's Full of Images only.
Open one in note pad, delete a few lines and close it again -- Effectively Breaking your pdf file.
You can still open it, but all your pages will be blank with maybe a scribble at the bottom somewhere.

Now... How do you check if a pdf file is broken like that (can STILL be opened by Adobe - with the message "Insufficient data for an image" but you can still scroll through your broken little file)

(oh and you do Not have the original to compare it with some fancy checksum)

What ive done so far with no success might I add.

  • Check to see if the pages are blank (checking the amount of data on that page) -- Fun fact.. its exactly the same as the original... so Fail...
  • Tried to open a pdf in a window... it still opened... broken, but it still opened. so couldnt catch anything
  • I did the whole read the pages into a file stream but that only worked for the text.. and i have images. (ill put the code for this one at the bottom, since thats the only one that worked partly... but like i said, only for pdf's with words in.)
  • Checking the headers. But the faults isnt in the headers (most of the time)

Ive used EVERY free library there isssss

  1. ITextSharp
  2. AcroPDF
  3. SautinSoft.PdfFocus

And hell if i know what else!

How would you check for a broken PDF like that?

Am i checking the right way? Is there a diffirent way to do it?
Im so desperate about this, it doesnt even have to be in c#... well preferably,,, i mean i even tried a whole new thing like GhostScript to do it but i sucked at that... so that failed..

The code that worked for the text, but didnt work for the images, since the strings came up empty. go-figure.

public bool ReadPdfFile(FileInfo f, string sourceDir)
        {
            lbl_CurrentFile.Text = f.FullName;

            Application.DoEvents();

            try
            {
                PdfReader pdfReader = new PdfReader(f.FullName);

                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                }
                pdfReader.Close();
            }
            catch (Exception a)
            {
                return false;
            }
            return true;
        }

Any help would be Awesome.

Thanx

For something like this you may have to look at the PDF specification. There's also an Acrobat SDK and PDF Library SDK, both of which cost money, and I have no idea what they offer in terms of detecting corruption. From what I understand, 1.7 is the latest specification, although Adobe keeps releasing additional extensions.

The basic structure of the file is pretty simple. At the end of the file, there's a file offset for the document catalog. The document catalog is a dictionary that contains a collection of file offsets for "objects". Objects have a header that defines what it is, and its length. You could check simple things like that; beyond that it will start getting quite involved.

This article has been dead for over six months. Start a new discussion instead.