Hi,

i have a textfile with a string in it. The string is built like this:

test =""" 
        [
        "this is a text and its supposed to contain every possible char."
        ], 
        [
        "like *.;#]§< and many more."
        ], 
        [
        "plus there are even
        newlines

        in it."
        ]"""

I want to parse out each text in between the square brackets and the quotes. So my desired output would be a list like the following:

['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many more.', 'plus there are even newlines in it.']

I tried the following regex with the corresping output:

my_list = re.findall(r'(?<=\[\n {8}\").*(?=\"\n {8}\])', test)
print (my_list)

['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many more.']

So this one is close to what i want. A list of two elements containing only the text between the square brackets and the quotes. But it doesnt include the third element because there are newlines in it, which are not covered by '.'.
So i tried another regex:

my_list = re.findall(r'(?<=\[\n {8}\")[\s\S]*(?=\"\n {8}\])', test)
print (my_list)

['this is a text and its supposed to contain every possible char."\n        ], \n        [\n        "like *.;#]§< and many more."\n        ], \n        [\n        "plus there are even\n        newlines\n        \n        in it.']

This time it includes the third element but it seems it doesnt take the look ahead into account at all and doesnt split the text since the list contains of only one element.

How can i get the desired output with regex?
Or is there even a faster or more clear way to get it without regex (beautifulsoup maybe?)?

I am thankful for any help and hints. Thank you in advance.

Recommended Answers

All 6 Replies

Try using \"([^\"]*)\" as your regular expression. This will return matches on all double quoted strings. Using [^\"]* instead of .* ensures that each double quoted string is taken as a unit.

Thanks for your reply. In this case it is working out but i want it to apply it for every possible character, so even if between the brackets and the quotes there are Quotes within, like in the following example:

[
"this should also be "included" or even single "."
]

And then your regex fails unfortunately.

Why don't you list the matches, one per line, for the input string and I'll try to put together a regexp that gives you what you want?

Please use the test string

' \n        [\n        "this is a text and its supposed to "contain" every possible char."\n        ], \n        [\n        "like *.;#]§< and many more."\n        ], \n        [\n        "plus there are even\n        newlines\n        in it."\n        ]'

which is the original string plus your later complication.

Incidentally, can you see how

i want it to apply it for every possible character

is a little ambiguous?

I am not sure, if it would help, when i am listing the string one per line and afterwards applying ur regex to it. The thing is, that later on i want to parse a lot more of those texts and the only thing they have in common is that the text i am trying to match is between [\n" and "\n]. The texts in between can differ a lot and they can contain every possible character there is.
Thats why i was trying to find this specific regex.

I can't help you if you won't give me a simple "here is the input" and "here is the output".

A simple split will do the trick

test =""" 
        [
        "this is a text and its supposed to contain every possible char."
        ], 
        [
        "like *.;#]§< and many more."
        ], 
        [
        "plus there are even
        newlines

        in it."
        ]"""

x=test.split('[\n')
for rec in x:
    print(rec)
    print("-"*20)
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.