Hi,
I need to concatanate lines based on regex. Lines to be concatanated are scattered. All lines begin with number$number$number$number$sentences. There is nothing to mark the end of sentence, only the beginning. Here is an example. I want to rewrite this one

2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat 
        <a href=""http://example.com"">An Example duh! </a>
        <a href=""http://example.com"">An Example duh! </a> 
2 $5$233$ check big cat if it have not eaten all the meat 
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat

into this one

2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat <a href=""http://example.com"">An Example duh! </a> <a href=""http://example.com"">An Example duh! </a> 
2 $5$233$ check big cat if it have not eaten all the meat  check big cat if it have not eaten all the meat <a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat

Since I have been while out of python, I'm running out of ideas and I need your help
Thanks

Recommended Answers

All 19 Replies

"copyright", "credits" or "license()" for more information.
>>> data = """2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat 
        <a href=""http://example.com"">An Example duh! </a>
        <a href=""http://example.com"">An Example duh! </a> 
2 $5$233$ check big cat if it have not eaten all the meat 
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat"""
>>> re.findall(r'(\d* \$\d*\d*.*) ', data)
['2 $5$233$ check big cat if it have not eaten all the', '3 $5$233$ check big cat if it have not eaten all the meat', '2 $5$233$ check big cat if it have not eaten all the meat', '2 $5$233$ check big cat if it have not eaten all the']
commented: Great PyTony +13
"copyright", "credits" or "license()" for more information.
>>> data = """2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat 
        <a href=""http://example.com"">An Example duh! </a>
        <a href=""http://example.com"">An Example duh! </a> 
2 $5$233$ check big cat if it have not eaten all the meat 
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat"""
>>> re.findall(r'(\d* \$\d*\d*.*) ', data)
['2 $5$233$ check big cat if it have not eaten all the', '3 $5$233$ check big cat if it have not eaten all the meat', '2 $5$233$ check big cat if it have not eaten all the meat', '2 $5$233$ check big cat if it have not eaten all the']

Let me check, but your solution seems geniously simple!

This regex does not catch next lines. Maybe Multiline mode (I do not remember exactly) Catch lines and do .replace('\n', '')

Oh! and it just returns 1 $3$5$ when I do

res = re.findall(r'(\d{1,}\s*\$\s*\d{1,}\s*\$\s*\d{1,}\s*\$*.*)', data, re.MULTILINE)
for line in res:
    print line
    exit(0)

it is supposed to return a whole thing as single line.
text file is big and I didn't write myself!

Looks like we would also change dot to match newline also by doing

re.MULTILINE | re.DOTALL

Looks like we would also change dot to match newline also by doing

re.MULTILINE | re.DOTALL

That is fine up until first line and all next line matches due to dot!

Is there a way to tell it to match all except given pattern (that shows next line)?
for example using above with DOTALL/MULTILINE with below it get returned as single line when it should be two.

1 $3$5$
<a href=""http://daniweb.com"">For </a>
<a href=""http://daniweb.com"">For </a><a href=""http://daniweb.com"">For </a>
<a href=""http://daniweb.com"">For </a><a href=""http://daniweb.com"">For </a>
2 $4$6$
<a href=""http://daniweb.com"">For </a>
<a href=""http://daniweb.com"">For </a><a href=""http://daniweb.com"">For </a>
<a href=""http://daniweb.com"">For </a><a href=""http://daniweb.com"">For </a>

Actually, why you can not use normal Python to group the lines (itertools.groupby or generator), but must use re?

data = """2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat 
        <a href=""http://example.com"">An Example duh! </a>
        <a href=""http://example.com"">An Example duh! </a> 
2 $5$233$ check big cat if it have not eaten all the meat 
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat"""

def get_blocks(source):
    block = []
    for line in source:
        # simplified lazy check not very likely to mistake
        if line.count('$') == 3:
            if block:
                yield ''.join(block)
            block = [line]
        else:
            block.append(line)


print('\n'.join(get_blocks(data.splitlines())))

Actually, why you can not use normal Python to group the lines (itertools.groupby or generator), but must use re?

PT,
I don't care what tool will get my job done. What I want to achieve. I want everything from number$number$number$text before next number$number$number$ to be on same line. But the text is random
So if itertools can do the job, that is fine with me :)

sample text is attached

Actually, why you can not use normal Python to group the lines (itertools.groupby or generator), but must use re?

data = """2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat 
        <a href=""http://example.com"">An Example duh! </a>
        <a href=""http://example.com"">An Example duh! </a> 
2 $5$233$ check big cat if it have not eaten all the meat 
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat"""

def get_blocks(source):
    block = []
    for line in source:
        # simplified lazy check not very likely to mistake
        if line.count('$') == 3:
            if block:
                yield ''.join(block)
            block = [line]
        else:
            block.append(line)


print('\n'.join(get_blocks(data.splitlines())))

Cannot understand the code. can you explain a bit. I have been away python for so long ;)

Seems to work also for your sample.txt

def get_blocks(source):
    block = []
    for line in source:
        # simplified lazy check not very likely to mistake
        if len(line) > 12 and line.lstrip()[0].isdigit() and line[:12].count('$') == 3:
            if block:
                yield ''.join(block)
            block = [line]
        else:
            block.append(line)


with open('sample.txt') as data:
    print('------------------\n'.join(get_blocks(data)))

EDIT: slightly stronger check for correct start of block

Seems to work also for your sample.txt

def get_blocks(source):
    block = []
    for line in source:
        # simplified lazy check not very likely to mistake
        if len(line) > 12 and line.lstrip()[0].isdigit() and line[:12].count('$') == 3:
            if block:
                yield ''.join(block)
            block = [line]
        else:
            block.append(line)


with open('sample.txt') as data:
    print('------------------\n'.join(get_blocks(data)))

EDIT: slightly stronger check for correct start of block

Seem to work ok (I have to do further check for data integrity) But could you explain the code?

I think my code fails to give out the last block (sorry but that was sub 5 minutes of coding, only one run of your sample) You should add

yield ''.join(block)

to end of the function.

can you tell what you understand, what not?

I think my code fails to give out the last block (sorry but that was sub 5 minutes of coding, only one run of your sample) You should add

yield ''.join(block)

to end of the function.

can you tell what you understand, what not?

General concept of what you are doing.
It does not however work for these lines in sample.txt

1 $2$3$  And <a href=""strongs://430"">God </a><a href=""strongs://1288"">blessed </a><a href=""strongs://853""></a>the <a href=""strongs://7637"">seventh </a><a href=""strongs://3117"">day</a>, and <a href=""strongs://6942"">sanctified </a>it: <a href=""strongs://3588"">because </a>that in it he had <a href=""strongs://7673"">rested </a>from <a href=""strongs://4480|3605"">all </a>his <a href=""strongs://4399"">work </a><a href=""strongs://834"">which </a><a href=""strongs://430"">God </a><a href=""strongs://1254"">created </a>and <a href=""strongs://6213"">made</a>. 
1 $2$4$ 
        <a href=""strongs://428"">These </a>
        <span style=""color:#808080;font-style:italic;""> are </span> the <a href=""strongs://8435"">generations </a>of the <a href=""strongs://8064"">heavens </a>and of the <a href=""strongs://776"">earth </a>when they were <a href=""strongs://1254"">created</a>, in the <a href=""strongs://3117"">day </a>that the <a href=""strongs://3068"">LORD </a><a href=""strongs://430"">God </a><a href=""strongs://6213"">made </a>the <a href=""strongs://776"">earth </a>and the <a href=""strongs://8064"">heavens</a>,

I am collecting block and yielding it when new block starts (and so the last block must have s yield at end)To accept such short start line reduce the length limit to 8 or 6 from 12.

Thanks PT,
I will mark this solved and finish up quirks myself
Thanks for your help!

Thanks PT,
I will mark this solved and finish up quirks myself
Thanks for your help!

Here is anyway my cleaned up code:

from pprint import pprint

def block_start(line, limit=8):
    return len(line) > limit and line.lstrip()[0].isdigit() and line[:12].count('$') == 3


def get_blocks(source, block_start=block_start):
    block = []
    for line in source:
        # simplified lazy check not very likely to mistake
        if block_start(line):
            if block:
                yield ''.join(block)
            block = [line]
        else:
            block.append(line)
    if block:
        yield ''.join(block)

with open('sample.txt') as data:
    print('------------------\n'.join(get_blocks(data)))
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.