Hi guys, I'm looking into the possibility of writing some javascript (or using an API of course) to be able to search a collection of PDFs and return the results in a Json format for processing (mainly displaying them back to the user). I've played around a little bit with something called the adobe acrobat console, which essentially it's a console running inside adobe acrobat reader which allows you to run some javascript.
The following simple snippet runs in that console and returns a list of results in all the PDFs selected:

search.matchCase = false;
search.wordMatching = "MatchAnyWord";
search.bookmarks = true;
search.query("will","Folder","/C/Users/xxx/Desktop/PDFs");

I've basically created a folder called PDFs where I stored 2 PDFs and then I run this search which returns all the results in a separate window. That's great, but I need to be able to "export" this functionality and pack it up in a script. Has anybody got any idea? Or even better, has anybody done this before?
cheers

Recommended Answers

All 4 Replies

What did you use to retrieve texts from PDF files? And what is the format of the results? Did you read PDF file using JavaScript or something else?

Nothing, the above is internal to Adobe acrobat so it literally is those lines of code pasted in this peculiar adobe console: it resturns the results in a nice window - I presume under the hood it's a json objec of some kind but from there you don't have access to the code. I've looked a bit into pdf.js, https://github.com/mozilla/pdf.js as that seems to be the way forward, although installing and getting it up and running is proving rather challenging as not everything works the way it should, especially running gulp. I failed in the office, so i'll try on my laptop at home this evening hopefully, a clean install of nodejs and everything else, hoping to get some help from their irc channel
Has anybody used pdf.js?

If you want to extract text from PDF files using JavaScript, you may try this pdftotext on github. It should be simplier than pdf.js one; however, the results from extracting may not be in json format.

One good approach would be using a programming SDK that provides high level classes to extract text from documents including PDF. The following LEADTOOLS project shows how you can search for text inside group of files in a folder:
https://www.leadtools.com/blog/document-imaging/ocr/directory-word-search-images-documents-25-projects-25-days/

You can use the same classes in a web service and return the search result in JSON string to the client side.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.