I have a BIG PDF file created by as a report. The file have 1.6M lines of data. No images at all.
each page have a title with a page number a header for the columns and 28 data lines after it. Each line have 12 fields. Some data fields may be empty.
and at the end a line marking the end of the data and below a file creation date.
I would like to extract just the data lines. I see a samples of writing to a PDF file for exa,ple using the mjwPDF class but could not find any sample code for reading TEXT from a PDF.

https://www.google.com/#q=read+pdf+in+vb6 disagrees with you. That is, it appears it's been done before.

I'd disregard that and suggest you forget the PDF but just open the PDF in a reader, select all and past it into some plain text file. Then you're off to the races.

As to field counts, that's for your code to handle.

Cutting and pasting 1.6 million lines of data? I've never tried that before but something tells me it's not going to work :-o

One issue lies in the way text is stored within a PDF. Strings of text are typically broken up in to arbitrary chunks, and not necessarily stored in the correct reading order. Trying to determine which chunks belong to which column, paragraph or sentence can be a challenge at times and occasionally bordering on the impossible.

Unless the PDF has been previously tagged of course. Tagging the contents of a PDF can offer a way to extract data in the correct reading order.

If resorting to the original data that generated the PDF is not an option, try exracting the data with a third-party component. Search the web for "VB.NET PDF component" and you'll find there are several on the market. Different components are likely to produce different results because text extraction is not a trivial task, so it's worth trying a few out to discover which one works best for you.

Edited 8 Months Ago by LaxLoafer

This article has been dead for over six months. Start a new discussion instead.