Extracting Phone numbers from Documents(Word, PDF)

Question

Noorul Ariff -1 Light Poster

10 Years Ago

Hi Friends,
I want to do a project for Parsing Resume in C#. i.e when we upload resumes(More than 100), it should extract Name, email id, phone no, skills.

Please don't tell that software’s are available. I tried those soft wares, but they are not working properly. So, I wanted to do by myself.

Please Help me.

c#

Edited 10 Years Ago by pritaeas because: Moved to C#

2 Contributors
1 Reply
1K Views
13 Hours Discussion Span
Latest Post 10 Years Ago Latest Post by overwraith

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

overwraith 83 Newbie Poster · Answer 1 · 2014-12-18T01:22:41+00:00

The following is java, for extracting text from files:

import java.io.IOException;

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;

public class PDFToText {

    public static void main(String[] args) {
        for(int i = 0 ; i < args.length ; i++)
            try {
                System.out.println(wordsInFile(args[i]));
            } catch (IOException e) {
                e.printStackTrace();
            }

    }//end main

    //counts words specified in the word map that occur in a given pdf document
    public static String wordsInFile(String pdf) throws IOException {
        PdfReader reader = new PdfReader(pdf);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        TextExtractionStrategy strategy;
        StringBuilder fullText = new StringBuilder();
        //String result = "";

        //Only a single page in memory at a given time
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            strategy = (TextExtractionStrategy) parser.processContent(i, new SimpleTextExtractionStrategy());
            //one page of text
            fullText.append(strategy.getResultantText());
        }//end loop
        reader.close();

        return fullText.toString();
    }//end method

}//end class

The reason it is relivant is that instead of using itext, you can use itext sharp, a C# port of itext. I am reading a book about itext sharp in java called "itext in action". I think this is where itext sharp is downloaded.
http://sourceforge.net/projects/itextsharp/
Word documents I am unsure of how to parse, but I seem to remember some code project projects that discuss the process a little. The projects are much different from this itext library.