Hi,
I need to concatanate lines based on regex. Lines to be concatanated are scattered. All lines begin with number$number$number$number$sentences. There is nothing to mark the end of sentence, only the beginning. Here is an example. I want to rewrite this one
2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat
into this one
2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat <a href=""http://example.com"">An Example duh! </a> <a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat check big cat if it have not eaten all the meat <a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat
Since I have been while out of python, I'm running out of ideas and I need your help
Thanks
"copyright", "credits" or "license()" for more information.
>>> data = """2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat"""
>>> re.findall(r'(\d* \$\d*\d*.*) ', data)
['2 $5$233$ check big cat if it have not eaten all the', '3 $5$233$ check big cat if it have not eaten all the meat', '2 $5$233$ check big cat if it have not eaten all the meat', '2 $5$233$ check big cat if it have not eaten all the']
"copyright", "credits" or "license()" for more information.
>>> data = """2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat"""
>>> re.findall(r'(\d* \$\d*\d*.*) ', data)
['2 $5$233$ check big cat if it have not eaten all the', '3 $5$233$ check big cat if it have not eaten all the meat', '2 $5$233$ check big cat if it have not eaten all the meat', '2 $5$233$ check big cat if it have not eaten all the']
Let me check, but your solution seems geniously simple!
Looks like we would also change dot to match newline also by doing
re.MULTILINE | re.DOTALL
That is fine up until first line and all next line matches due to dot!
Is there a way to tell it to match all except given pattern (that shows next line)?
for example using above with DOTALL/MULTILINE with below it get returned as single line when it should be two.
Actually, why you can not use normal Python to group the lines (itertools.groupby or generator), but must use re?
data = """2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat"""
def get_blocks(source):
block = []
for line in source:
# simplified lazy check not very likely to mistake
if line.count('$') == 3:
if block:
yield ''.join(block)
block = [line]
else:
block.append(line)
print('\n'.join(get_blocks(data.splitlines())))
Actually, why you can not use normal Python to group the lines (itertools.groupby or generator), but must use re?
PT,
I don't care what tool will get my job done. What I want to achieve. I want everything from number$number$number$text before next number$number$number$ to be on same line. But the text is random
So if itertools can do the job, that is fine with me :)
Actually, why you can not use normal Python to group the lines (itertools.groupby or generator), but must use re?
data = """2 $5$233$ check big cat if it have not eaten all the meat
3 $5$233$ check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat
check big cat if it have not eaten all the meat
<a href=""http://example.com"">An Example duh! </a>
2 $5$233$ check big cat if it have not eaten all the meat"""
def get_blocks(source):
block = []
for line in source:
# simplified lazy check not very likely to mistake
if line.count('$') == 3:
if block:
yield ''.join(block)
block = [line]
else:
block.append(line)
print('\n'.join(get_blocks(data.splitlines())))
Cannot understand the code. can you explain a bit. I have been away python for so long ;)
def get_blocks(source):
block = []
for line in source:
# simplified lazy check not very likely to mistake
if len(line) > 12 and line.lstrip()[0].isdigit() and line[:12].count('$') == 3:
if block:
yield ''.join(block)
block = [line]
else:
block.append(line)
with open('sample.txt') as data:
print('------------------\n'.join(get_blocks(data)))
EDIT: slightly stronger check for correct start of block
def get_blocks(source):
block = []
for line in source:
# simplified lazy check not very likely to mistake
if len(line) > 12 and line.lstrip()[0].isdigit() and line[:12].count('$') == 3:
if block:
yield ''.join(block)
block = [line]
else:
block.append(line)
with open('sample.txt') as data:
print('------------------\n'.join(get_blocks(data)))
EDIT: slightly stronger check for correct start of block
Seem to work ok (I have to do further check for data integrity) But could you explain the code?
I am collecting block and yielding it when new block starts (and so the last block must have s yield at end)To accept such short start line reduce the length limit to 8 or 6 from 12.
from pprint import pprint
def block_start(line, limit=8):
return len(line) > limit and line.lstrip()[0].isdigit() and line[:12].count('$') == 3
def get_blocks(source, block_start=block_start):
block = []
for line in source:
# simplified lazy check not very likely to mistake
if block_start(line):
if block:
yield ''.join(block)
block = [line]
else:
block.append(line)
if block:
yield ''.join(block)
with open('sample.txt') as data:
print('------------------\n'.join(get_blocks(data)))