Content extraction using Apache Tika

Traevel 216 Light Poster

7 Years Ago

Tutorial - Content extraction using Apache Tika

From the official website:

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

In this tutorial we will try and implement the four most important features of Apache Tika (as of version 1.14).

Table of contents

Is this tutorial for me?
Requirements
How can I detect a file's type?
How can I extract a file's content?
How can I extract a file's metadata?
How can I detect a file's language?
Final thoughts
External Resources

Is this tutorial for me?

What can Tika do?

Filetype detection
Language detection
Metadata extraction
Content extraction

What else can Tika do?

Named Entity Recognition (in combination with OpenNLP)
Image recognition (in combination with Tensorflow)
Language translation (in combination with Joshua)

What can't Tika do?

Bypass a password
Bypass file encryption

Can Tika handle filetype ... ?

Yes.

Really?

Yes.

What if I make one up?

Yes.

Send me $100!

Ye... wait a minute!

Requirements

Java SDK 1.7+ (required)
JUnit 4 to unit test with (required)(optional if you just want to watch the world burn)
An IDE (optional)

Note that this tutorial will use Java 8.

To include Tika in your project add the required dependency.

Maven

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-app</artifactId>
    <version>1.14</version>
</dependency>

Gradle

dependencies {
    runtime 'org.apache.tika:tika-app:1.14'
}

Note: you can also use tika-app-1.14.jar as a standalone commandline utility. See the external resources at the end of the tutorial for the download link.

How can I detect a file's type?

Start by creating a new class called FileTypeDetector and create the following method:

/**
 * Detect a file's MIME type and return it as a string.
 *
 * @param filePath the path to the file.
 * @return the file's MIME type.
 * @throws Exception for brevity, this is not good practice.
 */
public String getFileType(final String filePath) throws Exception {

    // initialize a new Tika Configuration
    TikaConfig tikaConfig = new TikaConfig();

    // create a new Tika Metadata object
    Metadata metadata = new Metadata();

    // turn file path string into a path object
    Path path = Paths.get(filePath);

    // determine the media type of the file
    MediaType mediaType = tikaConfig.getDetector()
            .detect(TikaInputStream.get(path), metadata);

    // return the type and subtype of the media type object
    return mediaType.getType() + "/" + mediaType.getSubtype();
}

Ensure that Metadata and MediaType are imported from the correct source:

import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;

Next, create a unit test for the detector called FileTypeDetectorTest and create the following method:

@Rule
public TemporaryFolder folder = new TemporaryFolder();

@Test
public void testGetFileType() throws Exception {

    // create a file in the temp folder
    File plainTextFile = folder.newFile("plain.txt");
    // write to it as if it is a text file
    Files.write("This is plain text", plainTextFile, Charset.defaultCharset());
    // assert that its mime type comes back as text/plain
    assertThat(new FileTypeDetector().getFileType(plainTextFile.getPath()),
            is("text/plain"));

    // create a file in the temp folder
    File binaryFile = folder.newFile("binary.bin");
    // write to it as if it is a binary file
    Files.write(new byte[]{}, binaryFile);
    // assert that its mime type comes back as application/octet-stream
    assertThat(new FileTypeDetector().getFileType(binaryFile.getPath()),
            is("application/octet-stream"));

    // create a new image
    BufferedImage image = new BufferedImage(10, 10, BufferedImage.TYPE_INT_RGB);
    // create a file in the temp folder
    File imageFile = folder.newFile("image.png");
    // write to it as if it is a png image
    ImageIO.write(image, "png", imageFile);
    // assert that its mime type comes back as image/png
    assertThat(new FileTypeDetector().getFileType(imageFile.getPath()),
            is("image/png"));

    // create a file in the temp folder, with a wrong extension
    File wrongExtensionFile = folder.newFile("image.txt");
    // write to it as if it is a png image
    ImageIO.write(image, "png", wrongExtensionFile);
    // make sure it's still a "txt" file
    assertThat(wrongExtensionFile.getName(), is("image.txt"));
    // assert that its mime type comes back as image/png
    assertThat(new FileTypeDetector().getFileType(wrongExtensionFile.getPath()),
            is("image/png"));
}

Ensure that is is imported from the correct source:

import static org.hamcrest.core.Is.is;

If everything has gone correctly you should be able to run the unit test successfully.

Note that this unit test uses features from JUnit 4 to easily create temporary folders and files which will be automatically deleted at the end of the test.

How can I extract a file's content?

This is where it can get complex. There are large amounts of parsers and content handlers available, suited for all sorts of complicated processing of text files as well as a number of possible return types that can be used as direct input to other toolkits (i.e. Apache Lucene). However, for the purpose of this tutorial we will stick to two simple versions.

Start by creating a new class called FileContentExtraction and add the following method:

/**
 * The simplest way of extracting a file's content.
 *
 * @param filePath the path to the file.
 * @return Its content in a string
 * @throws Exception for brevity, this is not good practice.
 */
public String extractContentSimple(final String filePath) throws Exception {

    // create a new Tika facade
    Tika tika = new Tika();

    // parse using the facade
    return tika.parseToString(Paths.get(filePath));
}

That's all there is to it. Anything more complex is simply to tweak the parsing, output, translation etc.

Now create a new unit test called FileContentExtractionTest and add the following method:

@Rule
public TemporaryFolder folder = new TemporaryFolder();

@Test
public void testExtractContentSimple() throws Exception {
    // create a file in the temp folder
    File plainTextFile = folder.newFile("plain.txt");

    // write to it as if it is a text file
    Files.write("This is plain text", plainTextFile, Charset.forName("UTF-8"));

    // assert the content is what we put in above
    // because a line break will be added (and that would be platform dependant)
    // we only check the beginning
    assertThat(new FileContentExtraction().extractContentSimple(plainTextFile.getPath()),
            startsWith("This is plain text"));
}

Ensure that startsWith is imported from the correct source:

import static org.hamcrest.core.StringStartsWith.startsWith;

Now, for a slightly more configurable solution we'll add a new method to the FileContentExtraction class:

/**
 * A slightly more controlled way of extracting a file's content.
 *
 * @param filePath the path to the file.
 * @return Its content in a string
 * @throws Exception for brevity, this is not good practice.
 */
public String extractContent(final String filePath) throws Exception {

    // create a new auto detect parser
    AutoDetectParser parser = new AutoDetectParser();

    // create a new content handler
    ContentHandler handler = new BodyContentHandler();

    // create a new metadata object
    Metadata metadata = new Metadata();

    // try-with-resources to parse the inputstream
    try (InputStream stream = new FileInputStream(new File(filePath))) {

        // parse the stream and add data to handler and metadata
        parser.parse(stream, handler, metadata);

        // return the result
        return handler.toString();
    }
}

We're still using the default parser and a standard body handler, but by specifying them we can change them in a later stage if we so wish. For instance, if we know the language of the file we can instead use a parser designed specifically for the English language. By using a language specific parser we could for instance stem all the words, running, run and runs would all become run. Useful for when we're hooking up the content extraction to a search engine for example.

All that's left is adding the unit test to our FileContentExtractionTest which is identical to the one we made before:

@Test
public void testExtractContent() throws Exception {
    // create a file in the temp folder
    File plainTextFile = folder.newFile("plain.txt");

    // write to it as if it is a text file
    Files.write("This is plain text", plainTextFile, Charset.forName("UTF-8"));

    // assert the content is what we put in above
    // because a line break will be added (and that would be platform dependant)
    // we only check the beginning
    assertThat(new FileContentExtraction().extractContent(plainTextFile.getPath()),
            startsWith("This is plain text"));
}

Note that this unit test uses features from JUnit 4 to easily create temporary folders and files which will be automatically deleted at the end of the test.

How can I extract a file's metadata?

Reading the metadata becomes trivial since we can re-use the code for extracting content. As an added bonus the Tika parser has attached an extra attribute to the file's metadata: the name of the parser it used. This attribute will not be attached to the original file, but requesting other (actual) attributes such as Content-Type follow the exact same steps (those are however more difficult to unit test as adding metadata to a file within Java is non-trivial).

Create a new class called FileMetadataExtractor and create the following method:

/**
 * Retrieves a metadata attribute from a file's metadata.
 *
 * @param filePath the path to the file.
 * @param attribute the attribute to extract from the metadata.
 * @return the requested attribute value.
 * @throws Exception for brevity, this is not good practice.
 */
public String getMetadata(final String filePath, final String attribute) throws Exception {

    // create a new auto detect parser
    AutoDetectParser parser = new AutoDetectParser();

    // create a new content handler
    ContentHandler handler = new BodyContentHandler();

    // create a new metadata object
    Metadata metadata = new Metadata();

    // try-with-resources to parse the inputstream
    try (InputStream stream = new FileInputStream(new File(filePath))) {

        // parse the stream and add data to handler and metadata
        parser.parse(stream, handler, metadata);
    }

    // return the requested attribute
    return metadata.get(attribute);
}

Ensure that Metadata and BodyContentHandler are imported from the correct source:

import org.apache.tika.metadata.Metadata;
import org.apache.tika.sax.BodyContentHandler;

Next, create a new unit test called FileMetadataExtractorTest and create the following method:

@Rule
public TemporaryFolder folder = new TemporaryFolder();

@Test
public void testGetMetadata() throws Exception {

    // create a file in the temp folder
    File plainTextFile = folder.newFile("plain.txt");

    // write to it as if it is a text file
    Files.write("This is plain text", plainTextFile, Charset.defaultCharset());

    // assert that the attribute X-Parsed-By (added by the Tika parser) equals the DefaultParser
    assertThat(new FileMetadataExtractor().getMetadata(plainTextFile.getPath(), "X-Parsed-By"),
            is("org.apache.tika.parser.DefaultParser"));
}

Ensure that is is imported from the correct source:

import static org.hamcrest.core.Is.is;

If everything has gone correctly you should be able to run the unit test successfully.

Note that this unit test uses features from JUnit 4 to easily create temporary folders and files which will be automatically deleted at the end of the test.

How can I detect a file's language?

Start by creating a new class called FileLanguageDetector and create the following method:

/**
 * Determine the language of a string.
 *
 * @param text the text to analyze.
 * @return the language of <b>text</b>.
 * @throws Exception for brevity, this is not good practice.
 */
public String getLanguage(final String text) throws Exception {

    // create a new language detector
    LanguageDetector detector = new OptimaizeLangDetector();

    // load the models in the detector
    detector.loadModels();

    // fetch and store the result
    LanguageResult result = detector.detect(text);

    // return the detected language
    return result.getLanguage();
}

Note that OptimaizeLangDetector is not a typo and should be imported from:

import org.apache.tika.langdetect.OptimaizeLangDetector;

Next, create a unit test for the detector called FileLanguageDetectorTest and create the following method:

@Test
public void testGetLanguage() throws Exception {

    // assert that the string is in english
    assertThat(new FileLanguageDetector().getLanguage("This is written in..."),
            is("en"));

    // assert that the string is in spanish
    assertThat(new FileLanguageDetector().getLanguage("Donde esta la Biblioteca?"),
            is("es"));

    // assert that the string is in french
    assertThat(new FileLanguageDetector().getLanguage("Qu’est-ce que tu veux?"),
            is("fr"));

    // assert that the string is in finnish
    assertThat(new FileLanguageDetector().getLanguage("Hyvää päivänjatkoa"),
            is("fi"));
}

If everything has gone correctly you should be able to run the unit test successfully.

Note that language detection will never be 100% accurate and mistakes will be made. In an ideal world you would try to extract the language of a file from its metadata attributes. However, this method can be used when no data is provided and you still want to perform language specific operations (i.e. selecting a parser designed for that language or handling files written in more than one language).

As an expansion on the above you could concatenate multiple lines from the file and analyze it as a whole for a more accurate result.

Final thoughts

This is by no means code to be put in a production environment, but merely intended to be used as a beginner's guide to Apache Tika. There will be mistakes, typo's, and other errors. If you've found one be sure to mention it in a reply so that a mod might be able to change it as time moves forward.

Many improvements and expansions can be made on the examples shown here. Extracting content (or other file specific data) is usually the first step in a larger process.

If you're still wondering why Tika could be useful when extracting content remember this: we merely extracted content from a few text files but the same code can be used to extract text from virtually any format. PDF, epub, html, docx... even your very own custom format.

External Resources

Online
Official website
Download
Tika API (1.14)
Tika commandline
Tika Source

Offline
Note, the following book is a few years old but has been written by the developers themselves.

Tika In Action
Chris A. Mattmann and Jukka L. Zitting
December 2011
ISBN 9781935182856
256 pages

1 Contributor
0 Replies
462 Views

Be the first to reply

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.