1,105,386 Community Members

TF/IDF

Member Avatar
timzter
Newbie Poster
2 posts since Apr 2012
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

I am currently building a search engine using Javaand I'm trying to do TF/IDF. I have got the TF working but am no stuck on the IDF. This is what I have done so far but I can't compile it as I get errors on the imports. Anyone know a way around this.

Thanks in advance

// Source: src/main/java/net/sf/jtmt/indexers/IdfIndexer.java
package net.sf.jtmt.indexers;

import org.apache.commons.collections15.Transformer;
import org.apache.commons.math.linear.RealMatrix;

/**
 * Reduces the weight of words which are commonly found (ie in more
 * documents). The factor by which it is reduced is chosen from the book
 * as:
 * f(m) = 1 + log(N/d(m))
 * where N = total number of docs in collection
 *       d(m) = number of docs containing word m
 * so where a word is more frequent (ie d(m) is high, f(m) would be low.
 */
public class IdfIndexer implements Transformer<RealMatrix,RealMatrix> {

  public RealMatrix transform(RealMatrix matrix) {
    // Phase 1: apply IDF weight to the raw word frequencies
    int n = matrix.getColumnDimension();
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        double matrixElement = matrix.getEntry(i, j);
        if (matrixElement > 0.0D) {
          double dm = countDocsWithWord(
            matrix.getSubMatrix(i, i, 0, matrix.getColumnDimension() - 1));
          matrix.setEntry(i, j, matrix.getEntry(i,j) * (1 + Math.log(n) - Math.log(dm)));
        }
      }
    }
    // Phase 2: normalize the word scores for a single document
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      double sum = sum(matrix.getSubMatrix(0, matrix.getRowDimension() -1, j, j));
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        matrix.setEntry(i, j, (matrix.getEntry(i, j) / sum));
      }
    }
    return matrix;
  }

  private double sum(RealMatrix colMatrix) {
    double sum = 0.0D;
    for (int i = 0; i < colMatrix.getRowDimension(); i++) {
      sum += colMatrix.getEntry(i, 0);
    }
    return sum;
  }

  private double countDocsWithWord(RealMatrix rowMatrix) {
    double numDocs = 0.0D;
    for (int j = 0; j < rowMatrix.getColumnDimension(); j++) {
      if (rowMatrix.getEntry(0, j) > 0.0D) {
        numDocs++;
      }
    }
    return numDocs;
  }
}
Member Avatar
JamesCherrill
... trying to help
10,387 posts since Apr 2008
Reputation Points: 2,081 [?]
Q&As Helped to Solve: 1,752 [?]
Skill Endorsements: 47 [?]
Moderator
Featured
 
0
 

Hi timzter
If you are looking for help here you need to provide enough information.
"I get errors" doesn't help. Post the full exact text of all error messages.

Member Avatar
timzter
Newbie Poster
2 posts since Apr 2012
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

The first error which I can't get past is:
The import org.apache cannot be resolved

Thanks

Member Avatar
NormR1
Posting Sage
7,723 posts since Jun 2010
Reputation Points: 563 [?]
Q&As Helped to Solve: 793 [?]
Skill Endorsements: 16 [?]
Team Colleague
 
0
 

You probably need the jar file that contains the classes in the package you are trying to import. When you get it, you'll have to put it on the classpath so the compiler can find it.

Try Googling for the where you can download the jar file

You
This article has been dead for over three months: Start a new discussion instead
Post:
Start New Discussion
Tags Related to this Article