954,515 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

concatenate lines based on regex

Hi,
I need to concatanate lines based on regex. Lines to be concatanated are scattered. All lines begin with number$number$number$number$sentences. There is nothing to mark the end of sentence, only the beginning. Here is an example. I want to rewrite this one

2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat 
        <a href=""http://example.com"">An Example duh! </a>
        <a href=""http://example.com"">An Example duh! </a> 
2 $5$233$ check big cat if it have not eaten all the meat 
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat

into this one

2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat <a href=""http://example.com"">An Example duh! </a> <a href=""http://example.com"">An Example duh! </a> 
2 $5$233$ check big cat if it have not eaten all the meat  check big cat if it have not eaten all the meat <a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat

Since I have been while out of python, I'm running out of ideas and I need your help
Thanks

evstevemd
Senior Poster
3,713 posts since Jun 2007
Reputation Points: 462
Solved Threads: 392
 
"copyright", "credits" or "license()" for more information.
>>> data = """2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat 
        <a href=""http://example.com"">An Example duh! </a>
        <a href=""http://example.com"">An Example duh! </a> 
2 $5$233$ check big cat if it have not eaten all the meat 
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat"""
>>> re.findall(r'(\d* \$\d*\d*.*) ', data)
['2 $5$233$ check big cat if it have not eaten all the', '3 $5$233$ check big cat if it have not eaten all the meat', '2 $5$233$ check big cat if it have not eaten all the meat', '2 $5$233$ check big cat if it have not eaten all the']
pyTony
pyMod
Moderator
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
 
"copyright", "credits" or "license()" for more information.
>>> data = """2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat 
        <a href=""http://example.com"">An Example duh! </a>
        <a href=""http://example.com"">An Example duh! </a> 
2 $5$233$ check big cat if it have not eaten all the meat 
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat"""
>>> re.findall(r'(\d* \$\d*\d*.*) ', data)
['2 $5$233$ check big cat if it have not eaten all the', '3 $5$233$ check big cat if it have not eaten all the meat', '2 $5$233$ check big cat if it have not eaten all the meat', '2 $5$233$ check big cat if it have not eaten all the']


Let me check, but your solution seems geniously simple!

evstevemd
Senior Poster
3,713 posts since Jun 2007
Reputation Points: 462
Solved Threads: 392
 

This regex does not catch next lines. Maybe Multiline mode (I do not remember exactly) Catch lines and do .replace('\n', '')

pyTony
pyMod
Moderator
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
 

the text below does not work even in re.M mode. Is there anything else I can do?

1 $3$5$ 
        <a href=""http://daniweb.com"">For </a>
       <a href=""http://daniweb.com"">For </a><a href=""http://daniweb.com"">For </a>
        <a href=""http://daniweb.com"">For </a><a href=""http://daniweb.com"">For </a>
evstevemd
Senior Poster
3,713 posts since Jun 2007
Reputation Points: 462
Solved Threads: 392
 

Oh! and it just returns 1 $3$5$ when I do

res = re.findall(r'(\d{1,}\s*\$\s*\d{1,}\s*\$\s*\d{1,}\s*\$*.*)', data, re.MULTILINE)
for line in res:
    print line
    exit(0)

it is supposed to return a whole thing as single line.
text file is big and I didn't write myself!

evstevemd
Senior Poster
3,713 posts since Jun 2007
Reputation Points: 462
Solved Threads: 392
 

Looks like we would also change dot to match newline also by doing

re.MULTILINE | re.DOTALL
pyTony
pyMod
Moderator
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
 

Looks like we would also change dot to match newline also by doing

re.MULTILINE | re.DOTALL


That is fine up until first line and all next line matches due to dot!

Is there a way to tell it to match all except given pattern (that shows next line)?
for example using above with DOTALL/MULTILINE with below it get returned as single line when it should be two.

1 $3$5$
<a href=""http://daniweb.com"">For </a>
<a href=""http://daniweb.com"">For </a><a href=""http://daniweb.com"">For </a>
<a href=""http://daniweb.com"">For </a><a href=""http://daniweb.com"">For </a>
2 $4$6$
<a href=""http://daniweb.com"">For </a>
<a href=""http://daniweb.com"">For </a><a href=""http://daniweb.com"">For </a>
<a href=""http://daniweb.com"">For </a><a href=""http://daniweb.com"">For </a>
evstevemd
Senior Poster
3,713 posts since Jun 2007
Reputation Points: 462
Solved Threads: 392
 

Actually, why you can not use normal Python to group the lines (itertools.groupby or generator), but must use re?

data = """2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat 
        <a href=""http://example.com"">An Example duh! </a>
        <a href=""http://example.com"">An Example duh! </a> 
2 $5$233$ check big cat if it have not eaten all the meat 
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat"""

def get_blocks(source):
    block = []
    for line in source:
        # simplified lazy check not very likely to mistake
        if line.count('$') == 3:
            if block:
                yield ''.join(block)
            block = [line]
        else:
            block.append(line)


print('\n'.join(get_blocks(data.splitlines())))
pyTony
pyMod
Moderator
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
 
Actually, why you can not use normal Python to group the lines (itertools.groupby or generator), but must use re?


PT,
I don't care what tool will get my job done. What I want to achieve. I want everything from number$number$number$text before next number$number$number$ to be on same line. But the text is random
So if itertools can do the job, that is fine with me :)

sample text is attached

Attachments sample.txt (23.05KB)
evstevemd
Senior Poster
3,713 posts since Jun 2007
Reputation Points: 462
Solved Threads: 392
 

Actually, why you can not use normal Python to group the lines (itertools.groupby or generator), but must use re?

data = """2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat 
        <a href=""http://example.com"">An Example duh! </a>
        <a href=""http://example.com"">An Example duh! </a> 
2 $5$233$ check big cat if it have not eaten all the meat 
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat"""

def get_blocks(source):
    block = []
    for line in source:
        # simplified lazy check not very likely to mistake
        if line.count('$') == 3:
            if block:
                yield ''.join(block)
            block = [line]
        else:
            block.append(line)


print('\n'.join(get_blocks(data.splitlines())))


Cannot understand the code. can you explain a bit. I have been away python for so long ;)

evstevemd
Senior Poster
3,713 posts since Jun 2007
Reputation Points: 462
Solved Threads: 392
 

Seems to work also for your sample.txt

def get_blocks(source):
    block = []
    for line in source:
        # simplified lazy check not very likely to mistake
        if len(line) > 12 and line.lstrip()[0].isdigit() and line[:12].count('$') == 3:
            if block:
                yield ''.join(block)
            block = [line]
        else:
            block.append(line)


with open('sample.txt') as data:
    print('------------------\n'.join(get_blocks(data)))

EDIT: slightly stronger check for correct start of block

pyTony
pyMod
Moderator
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
 

Seems to work also for your sample.txt

def get_blocks(source):
    block = []
    for line in source:
        # simplified lazy check not very likely to mistake
        if len(line) > 12 and line.lstrip()[0].isdigit() and line[:12].count('$') == 3:
            if block:
                yield ''.join(block)
            block = [line]
        else:
            block.append(line)


with open('sample.txt') as data:
    print('------------------\n'.join(get_blocks(data)))

EDIT: slightly stronger check for correct start of block


Seem to work ok (I have to do further check for data integrity) But could you explain the code?

evstevemd
Senior Poster
3,713 posts since Jun 2007
Reputation Points: 462
Solved Threads: 392
 

I think my code fails to give out the last block (sorry but that was sub 5 minutes of coding, only one run of your sample) You should add

yield ''.join(block)


to end of the function.

can you tell what you understand, what not?

pyTony
pyMod
Moderator
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
 

I think my code fails to give out the last block (sorry but that was sub 5 minutes of coding, only one run of your sample) You should add

yield ''.join(block)

to end of the function.

can you tell what you understand, what not?


General concept of what you are doing.
It does not however work for these lines in sample.txt

1 $2$3$  And <a href=""strongs://430"">God </a><a href=""strongs://1288"">blessed </a><a href=""strongs://853""></a>the <a href=""strongs://7637"">seventh </a><a href=""strongs://3117"">day</a>, and <a href=""strongs://6942"">sanctified </a>it: <a href=""strongs://3588"">because </a>that in it he had <a href=""strongs://7673"">rested </a>from <a href=""strongs://4480|3605"">all </a>his <a href=""strongs://4399"">work </a><a href=""strongs://834"">which </a><a href=""strongs://430"">God </a><a href=""strongs://1254"">created </a>and <a href=""strongs://6213"">made</a>. 
1 $2$4$ 
        <a href=""strongs://428"">These </a>
        <span style=""color:#808080;font-style:italic;""> are </span> the <a href=""strongs://8435"">generations </a>of the <a href=""strongs://8064"">heavens </a>and of the <a href=""strongs://776"">earth </a>when they were <a href=""strongs://1254"">created</a>, in the <a href=""strongs://3117"">day </a>that the <a href=""strongs://3068"">LORD </a><a href=""strongs://430"">God </a><a href=""strongs://6213"">made </a>the <a href=""strongs://776"">earth </a>and the <a href=""strongs://8064"">heavens</a>,
evstevemd
Senior Poster
3,713 posts since Jun 2007
Reputation Points: 462
Solved Threads: 392
 

I am collecting block and yielding it when new block starts (and so the last block must have s yield at end)To accept such short start line reduce the length limit to 8 or 6 from 12.

pyTony
pyMod
Moderator
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
 

Thanks PT,
I will mark this solved and finish up quirks myself
Thanks for your help!

evstevemd
Senior Poster
3,713 posts since Jun 2007
Reputation Points: 462
Solved Threads: 392
 

Thanks PT,
I will mark this solved and finish up quirks myself
Thanks for your help!

evstevemd
Senior Poster
3,713 posts since Jun 2007
Reputation Points: 462
Solved Threads: 392
 

Here is anyway my cleaned up code:

from pprint import pprint

def block_start(line, limit=8):
    return len(line) > limit and line.lstrip()[0].isdigit() and line[:12].count('$') == 3


def get_blocks(source, block_start=block_start):
    block = []
    for line in source:
        # simplified lazy check not very likely to mistake
        if block_start(line):
            if block:
                yield ''.join(block)
            block = [line]
        else:
            block.append(line)
    if block:
        yield ''.join(block)

with open('sample.txt') as data:
    print('------------------\n'.join(get_blocks(data)))
pyTony
pyMod
Moderator
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
 

Thanks Tony!

evstevemd
Senior Poster
3,713 posts since Jun 2007
Reputation Points: 462
Solved Threads: 392
 

This question has already been solved

Post: Markdown Syntax: Formatting Help
You
View similar articles that have also been tagged: