We're a community of 1077K IT Pros here for help, advice, solutions, professional growth and fun. Join us!
1,076,124 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Start New Discussion Reply to this Discussion

How do I parse a txt file into its sentences ?

Hello everyone, I am currently working on text processing with Python and I want to parse a .txt file into its sentences as a whole. I tried to create some regular expressions but I failed to do so. I only managed to come up with a regex that splits each paragraph into its sentences. How can I split the text into sentences ? -Also, if a sentence is long, does not end in the same paragraph, it should be printed as a whole too- Thanks.

3
Contributors
2
Replies
8 Hours
Discussion Span
2 Years Ago
Last Updated
3
Views
koveras vehcna
Light Poster
29 posts since Aug 2010
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

Split on the period. Note that the following doesn't make sense/requires an example.

Also, if a sentence is long, does not end in the same paragraph, it should be printed as a whole too

woooee
Posting Maven
2,706 posts since Dec 2006
Reputation Points: 827
Solved Threads: 779
Skill Endorsements: 9

Here is example, but English messy way of "Quoting sentence." should be fixed to make this work:

import itertools as it
endsentence = ".?!"
filein = 'd:/test/advsh12.txt'
sentences = it.groupby(open(filein).read(),
                       lambda x: any(x.endswith(punct)
                                     for punct in endsentence))
for number,(truth, sentence) in  enumerate(sentences):
    if truth:
        print number//2+1,':',previous+''.join(sentence).replace('\n',' ')
    previous = ''.join(sentence)
    if number>=2*100: break ## 100 first sentences

We use itertools groupby to separate the sentence,punctuation,sentence,punctuation pairs and join them in pairs when reading punctuation (truth is True). We check only end of words not to stop at 1.23 for example.

pyTony
pyMod
Moderator
6,305 posts since Apr 2010
Reputation Points: 879
Solved Threads: 986
Skill Endorsements: 26

This article has been dead for over three months: Start a new discussion instead

Post: Markdown Syntax: Formatting Help
 
You
View similar articles that have also been tagged:
 
© 2013 DaniWeb® LLC
Page rendered in 0.0545 seconds using 2.65MB