We're a community of 1076K IT Pros here for help, advice, solutions, professional growth and fun. Join us!
1,075,996 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Start New Discussion Reply to this Discussion

TF/IDF

I am currently building a search engine using Javaand I'm trying to do TF/IDF. I have got the TF working but am no stuck on the IDF. This is what I have done so far but I can't compile it as I get errors on the imports. Anyone know a way around this.

Thanks in advance

// Source: src/main/java/net/sf/jtmt/indexers/IdfIndexer.java
package net.sf.jtmt.indexers;

import org.apache.commons.collections15.Transformer;
import org.apache.commons.math.linear.RealMatrix;

/**
 * Reduces the weight of words which are commonly found (ie in more
 * documents). The factor by which it is reduced is chosen from the book
 * as:
 * f(m) = 1 + log(N/d(m))
 * where N = total number of docs in collection
 *       d(m) = number of docs containing word m
 * so where a word is more frequent (ie d(m) is high, f(m) would be low.
 */
public class IdfIndexer implements Transformer<RealMatrix,RealMatrix> {

  public RealMatrix transform(RealMatrix matrix) {
    // Phase 1: apply IDF weight to the raw word frequencies
    int n = matrix.getColumnDimension();
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        double matrixElement = matrix.getEntry(i, j);
        if (matrixElement > 0.0D) {
          double dm = countDocsWithWord(
            matrix.getSubMatrix(i, i, 0, matrix.getColumnDimension() - 1));
          matrix.setEntry(i, j, matrix.getEntry(i,j) * (1 + Math.log(n) - Math.log(dm)));
        }
      }
    }
    // Phase 2: normalize the word scores for a single document
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      double sum = sum(matrix.getSubMatrix(0, matrix.getRowDimension() -1, j, j));
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        matrix.setEntry(i, j, (matrix.getEntry(i, j) / sum));
      }
    }
    return matrix;
  }

  private double sum(RealMatrix colMatrix) {
    double sum = 0.0D;
    for (int i = 0; i < colMatrix.getRowDimension(); i++) {
      sum += colMatrix.getEntry(i, 0);
    }
    return sum;
  }

  private double countDocsWithWord(RealMatrix rowMatrix) {
    double numDocs = 0.0D;
    for (int j = 0; j < rowMatrix.getColumnDimension(); j++) {
      if (rowMatrix.getEntry(0, j) > 0.0D) {
        numDocs++;
      }
    }
    return numDocs;
  }
}
3
Contributors
3
Replies
3 Hours
Discussion Span
1 Year Ago
Last Updated
4
Views
timzter
Newbie Poster
2 posts since Apr 2012
Reputation Points: 0
Solved Threads: 0
Skill Endorsements: 0

Hi timzter
If you are looking for help here you need to provide enough information.
"I get errors" doesn't help. Post the full exact text of all error messages.

JamesCherrill
... trying to help
Moderator
8,516 posts since Apr 2008
Reputation Points: 2,583
Solved Threads: 1,455
Skill Endorsements: 30

The first error which I can't get past is:
The import org.apache cannot be resolved

Thanks

timzter
Newbie Poster
2 posts since Apr 2012
Reputation Points: 0
Solved Threads: 0
Skill Endorsements: 0

You probably need the jar file that contains the classes in the package you are trying to import. When you get it, you'll have to put it on the classpath so the compiler can find it.

Try Googling for the where you can download the jar file

NormR1
Posting Sage
Team Colleague
7,742 posts since Jun 2010
Reputation Points: 1,158
Solved Threads: 793
Skill Endorsements: 16

This article has been dead for over three months: Start a new discussion instead

Post: Markdown Syntax: Formatting Help
 
You
 
© 2013 DaniWeb® LLC
Page rendered in 0.0662 seconds using 2.74MB