I have a task to do. I will just brief it in a sentence,

I have a file of 20000 lines. Now task is to recognize number of similar lines.

ex1 : "The quick brown fox jumps on a lazy dog"
ex2 : "The quick brown dog jumps on a very lazy fox"

I want these two sentences as one group. So among 20000 lines i have to recognize how many groups are present.

Time complexity is major issue. Please tell me is there any approach i can go ahead with??

An idea: assign each new word a number and save the numbers representing the words in each line in an array for that line. Comparing numbers should be faster than comparing Strings.

When you talk about Time complexity, do you consider space as well? If not, then you may try...

  - lineNumber (int)
  - content (String[])  // each string is a word of the line
  - lineObj (LineDataObject)
  - duplicatedLineNumber (ArrayList<Integer>)
  - allLines (ArrayList<LineDataComparison>)
lineComp <- a LineComparison object
lineNumber <- 0
read <- each line of the file starting from the first line

while read!=EOF
  lineNumber <- lineNumber + 1
  aLine <- read the line
  lObj <- create LineDataComparison using the line data (aLine) and lineNumber
  if lObj is not one of lineComp line data
    add lObj to lineComp
  else  // discard the newly created object but add line number instead
    add the line number to the matched LineDataComparison in lineComp
  end if
  read <- next line
end while

The reading file part will be O(n). The cost (big O) for time is going to depend on the comparison of lObj with lineComp because the worst case scenario would be O(nlogn*m) where m is the number of words in a line. The cost for space is the whole content of the file (big O) but it may not be up to the whole file because of duplicated. Using break once it found a match could improve the performance too.

This article has been dead for over six months. Start a new discussion instead.