concatenate lines based on regex

Question

Stefano Mtangoo 455 Senior Poster

13 Years Ago

Hi,
I need to concatanate lines based on regex. Lines to be concatanated are scattered. All lines begin with number$number$number$number$sentences. There is nothing to mark the end of sentence, only the beginning. Here is an example. I want to rewrite this one

2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat 
        <a href=""http://example.com"">An Example duh! </a>
        <a href=""http://example.com"">An Example duh! </a> 
2 $5$233$ check big cat if it have not eaten all the meat 
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat

into this one

2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat <a href=""http://example.com"">An Example duh! </a> <a href=""http://example.com"">An Example duh! </a> 
2 $5$233$ check big cat if it have not eaten all the meat  check big cat if it have not eaten all the meat <a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat

Since I have been while out of python, I'm running out of ideas and I need your help
Thanks

python regex

Edited 13 Years Ago by Stefano Mtangoo because: n/a

2 Contributors
19 Replies
215 Views
2 Days Discussion Span
Latest Post 13 Years Ago Latest Post by Stefano Mtangoo

All 19 Replies

TrustyTony 888 ex-Moderator

13 Years Ago

"copyright", "credits" or "license()" for more information.
>>> data = """2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat 
        <a href=""http://example.com"">An Example duh! </a>
        <a href=""http://example.com"">An Example duh! </a> 
2 $5$233$ check big cat if it have not eaten all the meat 
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat"""
>>> re.findall(r'(\d* \$\d*\d*.*) ', data)
['2 $5$233$ check big cat if it have not eaten all the', '3 $5$233$ check big cat if it have not eaten all the meat', '2 $5$233$ check big cat if it have not eaten all the meat', '2 $5$233$ check big cat if it have not eaten all the']

Stefano Mtangoo commented: Great PyTony +13

TrustyTony 888 ex-Moderator

13 Years Ago

This regex does not catch next lines. Maybe Multiline mode (I do not remember exactly) Catch lines and do .replace('\n', '')

Edited 13 Years Ago by TrustyTony because: n/a

TrustyTony 888 ex-Moderator

13 Years Ago

Looks like we would also change dot to match newline also by doing

re.MULTILINE | re.DOTALL

TrustyTony 888 ex-Moderator

13 Years Ago

Actually, why you can not use normal Python to group the lines (itertools.groupby or generator), but must use re?

data = """2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat 
        <a href=""http://example.com"">An Example duh! </a>
        <a href=""http://example.com"">An Example duh! </a> 
2 $5$233$ check big cat if it have not eaten all the meat 
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat"""

def get_blocks(source):
    block = []
    for line in source:
        # simplified lazy check not very likely to mistake
        if line.count('$') == 3:
            if block:
                yield ''.join(block)
            block = [line]
        else:
            block.append(line)


print('\n'.join(get_blocks(data.splitlines())))

Edited 13 Years Ago by TrustyTony because: n/a

TrustyTony 888 ex-Moderator

13 Years Ago

Seems to work also for your sample.txt

def get_blocks(source):
    block = []
    for line in source:
        # simplified lazy check not very likely to mistake
        if len(line) > 12 and line.lstrip()[0].isdigit() and line[:12].count('$') == 3:
            if block:
                yield ''.join(block)
            block = [line]
        else:
            block.append(line)


with open('sample.txt') as data:
    print('------------------\n'.join(get_blocks(data)))

EDIT: slightly stronger check for correct start of block

Edited 13 Years Ago by TrustyTony because: ----- print

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Stefano Mtangoo 455 Senior Poster · Answer 1 · 2011-12-02T22:04:18+00:00

"copyright", "credits" or "license()" for more information.
>>> data = """2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat 
        <a href=""http://example.com"">An Example duh! </a>
        <a href=""http://example.com"">An Example duh! </a> 
2 $5$233$ check big cat if it have not eaten all the meat 
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat"""
>>> re.findall(r'(\d* \$\d*\d*.*) ', data)
['2 $5$233$ check big cat if it have not eaten all the', '3 $5$233$ check big cat if it have not eaten all the meat', '2 $5$233$ check big cat if it have not eaten all the meat', '2 $5$233$ check big cat if it have not eaten all the']

Let me check, but your solution seems geniously simple!

Stefano Mtangoo 455 Senior Poster · Answer 2 · 2011-12-03T19:50:17+00:00

the text below does not work even in re.M mode. Is there anything else I can do?

1 $3$5$ 
        <a href=""http://daniweb.com"">For </a>
       <a href=""http://daniweb.com"">For </a><a href=""http://daniweb.com"">For </a>
        <a href=""http://daniweb.com"">For </a><a href=""http://daniweb.com"">For </a>

Stefano Mtangoo 455 Senior Poster · Answer 3 · 2011-12-03T20:04:50+00:00

Oh! and it just returns 1 $3$5$ when I do

res = re.findall(r'(\d{1,}\s*\$\s*\d{1,}\s*\$\s*\d{1,}\s*\$*.*)', data, re.MULTILINE)
for line in res:
    print line
    exit(0)

it is supposed to return a whole thing as single line.
text file is big and I didn't write myself!

Stefano Mtangoo 455 Senior Poster · Answer 4 · 2011-12-03T20:20:41+00:00

Looks like we would also change dot to match newline also by doing
re.MULTILINE | re.DOTALL

That is fine up until first line and all next line matches due to dot!

Is there a way to tell it to match all except given pattern (that shows next line)?
for example using above with DOTALL/MULTILINE with below it get returned as single line when it should be two.

1 $3$5$
<a href=""http://daniweb.com"">For </a>
<a href=""http://daniweb.com"">For </a><a href=""http://daniweb.com"">For </a>
<a href=""http://daniweb.com"">For </a><a href=""http://daniweb.com"">For </a>
2 $4$6$
<a href=""http://daniweb.com"">For </a>
<a href=""http://daniweb.com"">For </a><a href=""http://daniweb.com"">For </a>
<a href=""http://daniweb.com"">For </a><a href=""http://daniweb.com"">For </a>

Stefano Mtangoo 455 Senior Poster · Answer 5 · 2011-12-03T21:24:14+00:00

Stefano Mtangoo 455 Senior Poster

13 Years Ago

Actually, why you can not use normal Python to group the lines (itertools.groupby or generator), but must use re?

PT,
I don't care what tool will get my job done. What I want to achieve. I want everything from number$number$number$text before next number$number$number$ to be on same line. But the text is random
So if itertools can do the job, that is fine with me :)

sample text is attached

sample.txt (23.05 KB)

The attachment preview is chopped off after the first 10 KB. Please download the entire file.

1 $1$26$  And <a href=""strongs://430"">God </a><a href=""strongs://559"">said</a>, Let us <a href=""strongs://6213"">make </a><a href=""strongs://120"">man </a>in our <a href=""strongs://6754"">image</a>, after our <a href=""strongs://1823"">likeness</a>: and let them have <a href=""strongs://7287"">dominion </a>over the <a href=""strongs://1710"">fish </a>of the <a href=""strongs://3220"">sea</a>, and over the <a href=""strongs://5775"">fowl </a>of the <a href=""strongs://8064"">air</a>, and over the <a href=""strongs://929"">cattle</a>, and over <a href=""strongs://3605"">all </a>the <a href=""strongs://776"">earth</a>, and over <a href=""strongs://3605"">every </a>creeping <a href=""strongs://7431"">thing </a>that <a href=""strongs://7430"">creepeth </a><a href=""strongs://5921"">upon </a>the <a href=""strongs://776"">earth</a>. 
1 $1$27$  So <a href=""strongs://430"">God </a><a href=""strongs://1254"">created </a><a href=""strongs://853""></a><a href=""strongs://120"">man </a>in his <span style=""color:#808080;font-style:italic;""> own </span><a href=""strongs://6754"">image</a>, in the <a href=""strongs://6754"">image </a>of <a href=""strongs://430"">God </a><a href=""strongs://1254"">created </a>he him; <a href=""strongs://2145"">male </a>and <a href=""strongs://5347"">female </a><a href=""strongs://1254"">created </a>he them. 
1 $1$28$  And <a href=""strongs://430"">God </a><a href=""strongs://1288"">blessed </a>them, and <a href=""strongs://430"">God </a><a href=""strongs://559"">said </a>unto them, Be <a href=""strongs://6509"">fruitful</a>, and <a href=""strongs://7235"">multiply</a>, and <a href=""strongs://4390"">replenish </a><a href=""strongs://853""></a>the <a href=""strongs://776"">earth</a>, and <a href=""strongs://3533"">subdue </a>it: and have <a href=""strongs://7287"">dominion </a>over the <a href=""strongs://1710"">fish </a>of the <a href=""strongs://3220"">sea</a>, and over the <a href=""strongs://5775"">fowl </a>of the <a href=""strongs://8064"">air</a>, and over <a href=""strongs://3605"">every </a>living <a href=""strongs://2416"">thing </a>that <a href=""strongs://7430"">moveth </a><a href=""strongs://5921"">upon </a>the <a href=""strongs://776"">earth</a>. 
1 $1$29$  And <a href=""strongs://430"">God </a><a href=""strongs://559"">said</a>, <a href=""strongs://2009"">Behold</a>, I have <a href=""strongs://5414"">given </a>you <a href=""strongs://853""></a><a href=""strongs://3605"">every </a><a href=""strongs://6212"">herb </a><a href=""strongs://2232"">bearing </a><a href=""strongs://2233"">seed</a>, <a href=""strongs://834"">which </a><span style=""color:#808080;font-style:italic;""> is </span><a href=""strongs://5921"">upon </a>the <a href=""strongs://6440"">face </a>of <a href=""strongs://3605"">all </a>the <a href=""strongs://776"">earth</a>, and <a href=""strongs://3605"">every </a><a href=""strongs://6086"">tree</a>, in the <a href=""strongs://834"">which </a><span style=""color:#808080;font-style:italic;""> is </span> the <a href=""strongs://6529"">fruit </a>of a <a href=""strongs://6086"">tree </a><a href=""strongs://2232"">yielding </a><a href=""strongs://2233"">seed</a>; to you it shall <a href=""strongs://1961"">be </a>for <a href=""strongs://402"">meat</a>. 
1 $1$30$  And to <a href=""strongs://3605"">every </a><a href=""strongs://2416"">beast </a>of the <a href=""strongs://776"">earth</a>, and to <a href=""strongs://3605"">every </a><a href=""strongs://5775"">fowl </a>of the <a href=""strongs://8064"">air</a>, and to every <a href=""strongs://3605"">thing </a>that <a href=""strongs://7430"">creepeth </a><a href=""strongs://5921"">upon </a>the <a href=""strongs://776"">earth</a>, <a href=""strongs://834"">wherein </a><span style=""color:#808080;font-style:italic;""> there </span><span style=""color:#808080;font-style:italic;""> is </span><a href=""strongs://5315|2416"">life</a>, <span style=""color:#808080;font-style:italic;""> I </span><span style=""color:#808080;font-style:italic;""> have </span><span style=""color:#808080;font-style:italic;""> given </span><a href=""strongs://853""></a><a href=""strongs://3605"">every </a><a href=""strongs://3418"">green </a><a href=""strongs://6212"">herb </a>for <a href=""strongs://402"">meat</a>: and it <a href=""strongs://1961"">was </a><a href=""strongs://3651"">so</a>. 
1 $1$31$  And <a href=""strongs://430"">God </a><a href=""strongs://7200"">saw </a><a href=""strongs://853""></a>every <a href=""strongs://3605"">thing </a><a href=""strongs://834"">that </a>he had <a href=""strongs://6213"">made</a>, and, <a href=""strongs://2009"">behold</a>, <span style=""color:#808080;font-style:italic;""> it </span><span style=""color:#808080;font-style:italic;""> was </span><a href=""strongs://3966"">very </a><a href=""strongs://2896"">good</a>. And the <a href=""strongs://6153"">evening </a>and the <a href=""strongs://1242"">morning </a><a href=""strongs://1961"">were </a>the <a href=""strongs://8345"">sixth </a><a href=""strongs://3117"">day</a>. 
1 $2$1$  Thus the <a href=""strongs://8064"">heavens </a>and the <a href=""strongs://776"">earth </a>were <a href=""strongs://3615"">finished</a>, and <a href=""strongs://3605"">all </a>the <a href=""strongs://6635"">host </a>of them. 
1 $2$2$  And on the <a href=""strongs://7637"">seventh </a><a href=""strongs://3117"">day </a><a href=""strongs://430"">God </a><a href=""strongs://3615"">ended </a>his <a href=""strongs://4399"">work </a><a href=""strongs://834"">which </a>he had <a href=""strongs://6213"">made</a>; and he <a href=""strongs://7673"">rested </a>on the <a href=""strongs://7637"">seventh </a><a href=""strongs://3117"">day </a>from <a href=""strongs://4480|3605"">all </a>his <a href=""strongs://4399"">work </a><a href=""strongs://834"">which </a>he had <a href=""strongs://6213"">made</a>. 
1 $2$3$  And <a href=""strongs://430"">God </a><a href=""strongs://1288"">blessed </a><a href=""strongs://853""></a>the <a href=""strongs://7637"">seventh </a><a href=""strongs://3117"">day</a>, and <a href=""strongs://6942"">sanctified </a>it: <a href=""strongs://3588"">because </a>that in it he had <a href=""strongs://7673"">rested </a>from <a href=""strongs://4480|3605"">all </a>his <a href=""strongs://4399"">work </a><a href=""strongs://834"">which </a><a href=""strongs://430"">God </a><a href=""strongs://1254"">created </a>and <a href=""strongs://6213"">made</a>. 
1 $2$4$ 
        <a href=""strongs://428"">These </a>
        <span style=""color:#808080;font-style:italic;""> are </span> the <a href=""strongs://8435"">generations </a>of the <a href=""strongs://8064"">heavens </a>and of the <a href=""strongs://776"">earth </a>when they were <a href=""strongs://1254"">created</a>, in the <a href=""strongs://3117"">day </a>that the <a href=""strongs://3068"">LORD </a><a href=""strongs://430"">God </a><a href=""strongs://6213"">made </a>the <a href=""strongs://776"">earth </a>and the <a href=""strongs://8064"">heavens</a>, 
1 $2$5$  And <a href=""strongs://3605"">every </a><a href=""strongs://7880"">plant </a>of the <a href=""strongs://7704"">field </a><a href=""strongs://2962"">before </a>it <a href=""strongs://1961"">was </a>in the <a href=""strongs://776"">earth</a>, and <a href=""strongs://3605"">every </a><a href=""strongs://6212"">herb </a>of the <a href=""strongs://7704"">field </a><a href=""strongs://2962"">before </a>it <a href=""strongs://6779"">grew</a>: <a href=""strongs://3588"">for </a>the <a href=""strongs://3068"">LORD </a><a href=""strongs://430"">God </a>had <a href=""strongs://3808"">not </a>caused it to <a href=""strongs://4305"">rain </a><a href=""strongs://5921"">upon </a>the <a href=""strongs://776"">earth</a>, and <span style=""color:#808080;font-style:italic;""> there </span><span style=""color:#808080;font-style:italic;""> was </span><a href=""strongs://369"">not </a>a <a href=""strongs://120"">man </a>to <a href=""strongs://5647"">till </a><a href=""strongs://853""></a>the <a href=""strongs://127"">ground</a>. 
1 $2$6$  But there went <a href=""strongs://5927"">up </a>a <a href=""strongs://108"">mist </a><a href=""strongs://4480"">from </a>the <a href=""strongs://776"">earth</a>, and <a href=""strongs://8248"">watered </a><a href=""strongs://853""></a>the <a href=""strongs://3605"">whole </a><a href=""strongs://6440"">face </a>of the <a href=""strongs://127"">ground</a>. 
1 $2$7$  And the <a href=""strongs://3068"">LORD </a><a href=""strongs://430"">God </a><a href=""strongs://3335"">formed </a><a href=""strongs://853""></a><a href=""strongs://120"">man </a><span style=""color:#808080;font-style:italic;""> of </span> the <a href=""strongs://6083"">dust </a><a href=""strongs://4480"">of </a>the <a href=""strongs://127"">ground</a>, and <a href=""strongs://5301"">breathed </a>into his <a href=""strongs://639"">nostrils </a>the <a href=""strongs://5397"">breath </a>of <a href=""strongs://2416"">life</a>; and <a href=""strongs://120"">man </a><a href=""strongs://1961"">became </a>a <a href=""strongs://2416"">living </a><a href=""strongs://5315"">soul</a>. 
1 $2$8$  And the <a href=""strongs://3068"">LORD </a><a href=""strongs://430"">God </a><a href=""strongs://5193"">planted </a>a <a href=""strongs://1588"">garden </a><a href=""strongs://4480|6924"">eastward </a>in <a href=""strongs://5731"">Eden</a>; and <a href=""strongs://8033"">there </a>he <a href=""strongs://7760"">put </a><a href=""strongs://853""></a>the <a href=""strongs://120"">man </a><a href=""strongs://834"">whom </a>he had <a href=""strongs://3335"">formed</a>. 
1 $2$9$  And out <a href=""strongs://4480"">of </a>the <a href=""strongs://127"">ground </a>made the <a href=""strongs://3068"">LORD </a><a href=""strongs://430"">God </a>to <a href=""strongs://6779"">grow </a><a href=""strongs://3605"">every </a><a href=""strongs://6086"">tree </a>that is <a href=""strongs://2530"">pleasant </a>to the <a href=""strongs://4758"">sight</a>, and <a href=""strongs://2896"">good </a>for <a href=""strongs://3978"">food</a>; the <a href=""strongs://6086"">tree </a>of <a href=""strongs://2416"">life </a>also in the <a href=""strongs://8432"">midst </a>of the <a href=""strongs://1588"">garden</a>, and the <a href=""strongs://6086"">tree </a>of <a href=""strongs:

Stefano Mtangoo 455 Senior Poster · Answer 6 · 2011-12-03T21:25:54+00:00

Actually, why you can not use normal Python to group the lines (itertools.groupby or generator), but must use re?

data = """2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat 
        <a href=""http://example.com"">An Example duh! </a>
        <a href=""http://example.com"">An Example duh! </a> 
2 $5$233$ check big cat if it have not eaten all the meat 
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat"""

def get_blocks(source):
    block = []
    for line in source:
        # simplified lazy check not very likely to mistake
        if line.count('$') == 3:
            if block:
                yield ''.join(block)
            block = [line]
        else:
            block.append(line)


print('\n'.join(get_blocks(data.splitlines())))

Cannot understand the code. can you explain a bit. I have been away python for so long ;)

Stefano Mtangoo 455 Senior Poster · Answer 7 · 2011-12-03T21:57:33+00:00

Seems to work also for your sample.txt

def get_blocks(source):
    block = []
    for line in source:
        # simplified lazy check not very likely to mistake
        if len(line) > 12 and line.lstrip()[0].isdigit() and line[:12].count('$') == 3:
            if block:
                yield ''.join(block)
            block = [line]
        else:
            block.append(line)


with open('sample.txt') as data:
    print('------------------\n'.join(get_blocks(data)))

EDIT: slightly stronger check for correct start of block

Seem to work ok (I have to do further check for data integrity) But could you explain the code?

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 8 · 2011-12-03T22:09:15+00:00

I think my code fails to give out the last block (sorry but that was sub 5 minutes of coding, only one run of your sample) You should add

yield ''.join(block)

to end of the function.

can you tell what you understand, what not?

Stefano Mtangoo 455 Senior Poster · Answer 9 · 2011-12-03T22:20:22+00:00

I think my code fails to give out the last block (sorry but that was sub 5 minutes of coding, only one run of your sample) You should add
yield ''.join(block)
to end of the function.
can you tell what you understand, what not?

General concept of what you are doing.
It does not however work for these lines in sample.txt

1 $2$3$  And <a href=""strongs://430"">God </a><a href=""strongs://1288"">blessed </a><a href=""strongs://853""></a>the <a href=""strongs://7637"">seventh </a><a href=""strongs://3117"">day</a>, and <a href=""strongs://6942"">sanctified </a>it: <a href=""strongs://3588"">because </a>that in it he had <a href=""strongs://7673"">rested </a>from <a href=""strongs://4480|3605"">all </a>his <a href=""strongs://4399"">work </a><a href=""strongs://834"">which </a><a href=""strongs://430"">God </a><a href=""strongs://1254"">created </a>and <a href=""strongs://6213"">made</a>. 
1 $2$4$ 
        <a href=""strongs://428"">These </a>
        <span style=""color:#808080;font-style:italic;""> are </span> the <a href=""strongs://8435"">generations </a>of the <a href=""strongs://8064"">heavens </a>and of the <a href=""strongs://776"">earth </a>when they were <a href=""strongs://1254"">created</a>, in the <a href=""strongs://3117"">day </a>that the <a href=""strongs://3068"">LORD </a><a href=""strongs://430"">God </a><a href=""strongs://6213"">made </a>the <a href=""strongs://776"">earth </a>and the <a href=""strongs://8064"">heavens</a>,

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 10 · 2011-12-03T22:39:53+00:00

I am collecting block and yielding it when new block starts (and so the last block must have s yield at end)To accept such short start line reduce the length limit to 8 or 6 from 12.

Stefano Mtangoo 455 Senior Poster · Answer 11 · 2011-12-03T23:29:55+00:00

Thanks PT,
I will mark this solved and finish up quirks myself
Thanks for your help!

Stefano Mtangoo 455 Senior Poster · Answer 12 · 2011-12-03T23:32:15+00:00

Thanks PT,
I will mark this solved and finish up quirks myself
Thanks for your help!

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 13 · 2011-12-04T02:08:34+00:00

Here is anyway my cleaned up code:

from pprint import pprint

def block_start(line, limit=8):
    return len(line) > limit and line.lstrip()[0].isdigit() and line[:12].count('$') == 3


def get_blocks(source, block_start=block_start):
    block = []
    for line in source:
        # simplified lazy check not very likely to mistake
        if block_start(line):
            if block:
                yield ''.join(block)
            block = [line]
        else:
            block.append(line)
    if block:
        yield ''.join(block)

with open('sample.txt') as data:
    print('------------------\n'.join(get_blocks(data)))

Stefano Mtangoo 455 Senior Poster · Answer 14 · 2011-12-04T19:07:36+00:00

Stefano Mtangoo 455 Senior Poster

13 Years Ago

Thanks Tony!

concatenate lines based on regex

Recommended Answers Collapse Answers

All 19 Replies

Recommended Answers