Hey guys Me and my friends are doing project on " text summarization". I want ur help in developing following modules.
If anyone knows abt it plz help me as soon as possible.
Here r the modules.
1. text pre-processor ( convert .doc to .txt)
2. Sentence separator
3. Word separator
4. Stop word eliminator
5. Word frequency calculator
6. Scoring algorithms
7. Ranking algorithms

Need ur help badly........plz reply soon.
Thanx in advance.

>>1. text pre-processor ( convert .doc to .txt)
When you load in a doc file, you will also load in its formatting and other extra stuff. You need to weed that out.

>>2. Sentence separator
A sentence ends with a period(.), so use that as your end case with strings,

>>3. Word separator
Words are seperated by space, so use that as your end case with strings.
>>4. Stop word eliminator
Need to be more specific.

>>5. Word frequency calculator
Get a word, count its frequency. Just dive in and see what happens.
Also google, and use the search on this forum for "frequency counter"

>>6. Scoring algorithms
What do you mean by this?

7. Ranking algorithms
Relative to what?

thanx 4 ur replay.
I think u should go through following content, u vl get some idea what
I want exactly.......

1. Text pre-processor. This will work on the HTML or Word Documents and convert them to plain text for processing by the rest of the system.
2. Sentence separator. This goes through the document and separates the sentences based on some rules (like a sentence ending is determined by a dot and a space etc). Any other appropriate criteria might also be added to separate the sentences.

3. Word separator. This separates the words based on some criteria (like a space denotes the end of a word etc).

4. Stop-words eliminator. This eliminates the regular English words like ‘a, an, the, of, from..’ etc for further processing. These words are known as ‘stop-words’. A list of applicable stop-words for English is available on the Internet.

5. Word-frequency calculator. This calculates the number of times a word appears in the document (stop-words have been eliminated earlier itself and will not figure in this calculation) and also the number of sentences that word appears in the document. For example, the word ‘Unix’ may appear a total of 100 times in a document, and in 80 sentences. (Some sentences might have more than one occurrence of the word). Some min-max thresholds can be set for the frequencies (the thresholds to be determined by trial-and-error)

6. Scoring algorithm. This algorithm determines the score of each sentence. Several possibilities exist. The score can be made to be proportional to the sum of frequencies of the different words comprising the sentence (ie, if a sentence has 3 words A, B and C, then the score is proportional the sum of how many times A, B and C have occurred in the document). The score can also be made to be inversely proportional to the number of sentences in which the words in the sentence appear in the document. Likewise, many such heuristic rules can be applied to score the sentences.

7. Ranking. The sentences will be ranked according to the scores. Any other criteria like the position of a sentence in the document can be used to control the ranking. For example, even though the scores are high, we would not put consecutive sentences together.

8. Summarizing. Based on the user input on the size of the summary, the sentences will be picked from the ranked list and concatenated. The resulting summary file could be stored with a name like <originalfilename>_summary.txt.

9. User Interface. The tool could use a GUI or a plain command-line interface. In either case, it should have easy and intuitive ways of getting the input from the user (the document, the size of the summary needed etc).

Actually I left with less time in my hands so do u know wher I can get source 4 these modules.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.