Hello everyone, I am currently working on text processing with Python and I want to parse a .txt file into its sentences as a whole. I tried to create some regular expressions but I failed to do so. I only managed to come up with a regex that splits each paragraph into its sentences. How can I split the text into sentences ? -Also, if a sentence is long, does not end in the same paragraph, it should be printed as a whole too- Thanks.

Recommended Answers

All 2 Replies

Split on the period. Note that the following doesn't make sense/requires an example.

Also, if a sentence is long, does not end in the same paragraph, it should be printed as a whole too

Here is example, but English messy way of "Quoting sentence." should be fixed to make this work:

import itertools as it
endsentence = ".?!"
filein = 'd:/test/advsh12.txt'
sentences = it.groupby(open(filein).read(),
                       lambda x: any(x.endswith(punct)
                                     for punct in endsentence))
for number,(truth, sentence) in  enumerate(sentences):
    if truth:
        print number//2+1,':',previous+''.join(sentence).replace('\n',' ')
    previous = ''.join(sentence)
    if number>=2*100: break ## 100 first sentences

We use itertools groupby to separate the sentence,punctuation,sentence,punctuation pairs and join them in pairs when reading punctuation (truth is True). We check only end of words not to stop at 1.23 for example.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.