Hi,

Im trying to parse a file which contains some random text.
How can i match a case which seperates "garbage" (anything but digit/letter) char??
for e,g: 25.5.5 will produce . (the 2nd dot, because 25.5 is a number (rational))

another e.g:
----3.82 will produce --- (because -3.82 is a rational)

I've tried
r'[+-](?!\d)|(?<!\d)[.][^\+\-.\d\w\s]+

but it wont fit the cases i mentioned above.
Thanks,
Assaf.

Recommended Answers

All 4 Replies

Try this:

pattern = (?P<data>[+\-]?(?:(?:\d+\.\d+)|\w+))|(?P<garbage>.)

for m in re.finditer(pattern, garbled_text):
    print m.group('data'), m.group(garbage)    # m.groupdict() will also work

@nbaztec:

Your code did not quite run and produce expected result, this was what did give it, after some debug:

import re

garbled_text= '----3.82'
pattern = r'(?P<data>[+\-]?(?:(?:\d+\.\d+)|\w+))|(?P<garbage>.)'
g = ''

for m in re.finditer(pattern, garbled_text):
    if m.group('data'):
        print m.group('data'), repr(g)    # m.groupdict() will also work
        g = ''
    else:
        g += m.group('garbage')

Output:

-3.82 '---'

Hmmmm i still have a problem.
i tried that pattern but it didnt catch alot of stuff.
what i need is:
given

   AAAA15.2.2.2.2.2.AAAjraw AJR53 ++--15.041%58#*&%# &.#
    &.*.#

the output of the junk collector should be :

    . . . ++- % #*&%# &.#
    &.*.#

this pattern almost solves everything except the dots between the 15.2.2.2.2. i think

patternGarbage=r'[+-](?!\d)|(?<!\d)[.]|[^\+\-.\d\w\s]+'

could rly use abit more help thanks :)

@pyTony
The code is fine, albeit I missed the quotes on garbage here while posting.

import re
pat = '(?P<text>[+\-]?(?:(?:\d+\.\d+)|\w+))|(?P<garbage>.)';
for m in re.finditer(pat, "+25.5.5 ---3.82sscs+35.25.2.2"):
    print m.group('text'), m.group('garbage') #m.groupdict()

Gives:

+25.5 None
None .
5 None
None
None -
None -
-3.82 None
sscs None
+35.25 None
None .
2.2 None

Which is the output I intended (seggregate data from garbage). The OP can do with these values as he pleases.

@Despairy: The pattern is doing fine, except it is Not matching the first digit since I used \w (matches [0-9] also) in my regex. Substitue it by [a-zA-Z] and I forgot to make the decimal part optional and the DOT Matches newline part . Here is my sample code, custom fitted for your need:

pat = '(?P<text>[+\-]?(?:(?:\d+(?:\.\d+)?)|[A-Za-z]+))|(?P<garbage>.)';   # \w -> [A-Za-z]
data = []
garbage=[]
for m in re.finditer(pat, sample, re.S):
    print m.group('text'), m.group('garbage') #m.groupdict(); Debug Output to see seggregation
    if m.group('text') is not None:          # Do /your/ processing here
        data.append(m.group('text'))
    if m.group('garbage') is not None:
        garbage.append(m.group('garbage'))        
print "Data %s = %s" % (data, ''.join(data))
print "Junk %s = %s" % (garbage, ''.join(garbage))

Output:

AAAA None
1AAAA None
15.2 None
None .
2.2 None
None .
2.2 None
None .
AAAjraw None
None  
AJR None
53 None
None  
None +
None +
None -
-15.041 None
None %
58 None
None #
None *
...
...
Data ['AAAA', '15.2', '2.2', '2.2', 'AAAjraw', 'AJR', '53', '-15.041', '58'] = AAAA15.22.22.2AAAjrawAJR53-15.04158
Junk ['.', '.', '.', ' ', ' ', '+', '+', '-', '%', '#', '*', '&', '%', '#', ' ', '&', '.', '#', '\n', '&', '.', '*', '.', '#'] = ...  ++-%#*&%# &.#
&.*.#
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.