Decided to start my own thread rather than hijack thread "Sorting"

I followed ghostdog74 advice and used module re to extract numeric strings:

import re

data_raw = """[20]
[ 35+ ]
age = 40
(84)
100kg
  $245
"""

# use regex module re to extract numeric string
data_list = re.findall(r"\d+",data_raw)
print data_list  # ['20', '35', '40', '84', '100', '245']

That works fine, but when I change to floating point number:

import re

data_raw = """[20]
[ 35+ ]
age = 40
(84)
100kg
  $245.99
"""

# use re to extract numeric string (however, float split at '.')
data_list = re.findall(r"\d+",data_raw)
print data_list  # ['20', '35', '40', '84', '100', '245', '99']

How can I make re handle floating point numbers?

Hi!

data_list = re.findall(r'\d+(?:\.\d+)?', data_raw)

You want one or more digits \d+ , followed by a point and some more digits \.\d+ , or not ? .
The ?: prevents the () making backreferences. We want them to just cluster two patterns. You see it on the output you get without the "?:":

['', '', '', '', '', '.99']

Hope this was not too confusing ;)

Regards, mawe

another way is to use the "|" special character.

>>> re.findall(r'\d+\.\d+|\d+',data_raw)
['20', '35', '40', '84', '100', '245.99']
>>>

another way is to use the "|" special character.

>>> re.findall(r'\d+\.\d+|\d+',data_raw)
['20', '35', '40', '84', '100', '245.99']
>>>

sorry somehow i can't find my edit button, but anyway
a decimal/float looks like this : 245.332 or 4.5 or 74.32
so to match them, we need one or more digits, followed by "." and followed by one or more digits again..so the expression becomes "\d+\.\d+".

Two nice solutions, wow! Now I have a question, what if we had a "-$245.99" to extract so it would give "-245.99"?

Two nice solutions, wow! Now I have a question, what if we had a "-$245.99" to extract so it would give "-245.99"?

hi, wow, getter harder.:-)
anyway, here's a rather crude way and i am sure there are better ways (using re). I did substitution first, then do the rest

>>> data_raw = """
... header
... [23 ]
... [ 43 ]
... [4323]
... [-$23.44 ]
... [ 12.32 ]
... footer
... """
>>> 
>>> re.findall(r"(-\d+\.\d+|\d+\.\d+|\d+)",re.sub(r"(-\$)","-",data_raw))
['23', '43', '4323', '-23.44', '12.32']

I still prefer not to use re though lol:)

This re stuff makes my head spin! I can see that it is very powerful for text processing, but also seemingly very complex! Almost another language within Python.

This re stuff makes my head spin! I can see that it is very powerful for text processing, but also seemingly very complex! Almost another language within Python.

Most string manipulation problems can be solved with Python's string functions. Only very complex ones will need regexp. So try not to use regexp if possible. Of course if you are good at it , then by all means, but have to think of the next person reading your code and who don't understand regexp. Just my $0.02 :cheesy:

This re stuff makes my head spin! I can see that it is very powerful for text processing, but also seemingly very complex! Almost another language within Python.

Makes my head spin too, so I used this short 'regular stuff' code ...

# extract numeric value from a data stream
# caveat --> only for one number per data line
 
data_raw = """
header
[23 ]
[ 43 ]
[4323]
[-$23.44 ]
[ 12.32 ]
footer
"""
 
data_list = data_raw.split('\n')
print data_list  # test
 
num_list = []
for x in data_list:
    s = ""
    for c in x:
        if c in '1234567890.-':
            s += c
    if s:
        num_list.append(s)
 
print num_list  # ['23', '43', '4323', '-23.44', '12.32']

... you should be able to figure that one out.

Makes my head spin too, so I used this short 'regular stuff' code ...

# extract numeric value from a data stream
# caveat --> only for one number per data line
 
data_raw = """
header
[23 ]
[ 43 ]
[4323]
[-$23.44 ]
[ 12.32 ]
footer
"""
 
data_list = data_raw.split('\n')
print data_list  # test
 
num_list = []
for x in data_list:
    s = ""
    for c in x:
        if c in '1234567890.-':
            s += c
    if s:
        num_list.append(s)
 
print num_list  # ['23', '43', '4323', '-23.44', '12.32']

... you should be able to figure that one out.

Thank you, I could understand that code! Have been bitten by the re bug a little too!

This question has already been answered. Start a new discussion instead.