Reading TEXT from a PDF file Using VB6

Question

levshlomo 0 Newbie Poster

9 Years Ago

I have a BIG PDF file created by as a report. The file have 1.6M lines of data. No images at all.
each page have a title with a page number a header for the columns and 28 data lines after it. Each line have 12 fields. Some data fields may be empty.
and at the end a line marking the end of the data and below a file creation date.
I would like to extract just the data lines. I see a samples of writing to a PDF file for exa,ple using the mjwPDF class but could not find any sample code for reading TEXT from a PDF.

file-system pdf visual-basic

3 Contributors
3 Replies
2K Views
8 Hours Discussion Span
Latest Post 9 Years Ago Latest Post by rproffitt

All 3 Replies

rproffitt 2,706 https://5calls.org

9 Years Ago

https://www.google.com/#q=read+pdf+in+vb6 disagrees with you. That is, it appears it's been done before.

I'd disregard that and suggest you forget the PDF but just open the PDF in a reader, select all and past it into some plain text file. Then you're off to the races.

As to field counts, that's for your code to handle.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

LaxLoafer 71 Posting Whiz in Training · Answer 1 · 2016-03-14T21:36:38+00:00

Cutting and pasting 1.6 million lines of data? I've never tried that before but something tells me it's not going to work :-o

One issue lies in the way text is stored within a PDF. Strings of text are typically broken up in to arbitrary chunks, and not necessarily stored in the correct reading order. Trying to determine which chunks belong to which column, paragraph or sentence can be a challenge at times and occasionally bordering on the impossible.

Unless the PDF has been previously tagged of course. Tagging the contents of a PDF can offer a way to extract data in the correct reading order.

If resorting to the original data that generated the PDF is not an option, try exracting the data with a third-party component. Search the web for "VB.NET PDF component" and you'll find there are several on the market. Different components are likely to produce different results because text extraction is not a trivial task, so it's worth trying a few out to discover which one works best for you.

rproffitt 2,706 https://5calls.org Moderator · Answer 2 · 2016-03-14T22:14:00+00:00

As I have cut and pasted such amounts I can write it does work. Way back in 32 bit days or before it was a problem.

And then there's Ghostscipt if it must be automated. Search follows.
https://www.google.com/search?q=gswin64c+-sDEVICE%3Dtxtwrite+-o+output.txt+input.pdf

Reading TEXT from a PDF file Using VB6

Recommended Answers Collapse Answers

All 3 Replies

Recommended Answers