My brain is freezing up again. I have a string that looks something like this

1 apple--1 pear--1 peach--2 onion--2 carrot--3 <bee mince--3 <por chops--4 <oth salad:--potato--4 <oth bread:--garlic

Then I have a few dictionaries:

dic1 = {'1': 'fruit',  '2': 'vegetable', '3': 'meat', '4': 'other'}
dic2 = {'bee': 'beef, 'por': 'pork'}

What I'm trying to do is transform this into XML.. sortof.
The output needs to look like this:

<fruit>apple</fruit> <fruit>pear</fruit> <fruit>peach</fruit> <vegetable>onion</vegetable> <vegetable>carrot</vegetable> <meat>beef<type>mince</type></meat> <meat>pork<type>chops</type></meat> <other>(salad)<type>potato</type></other> <other>(bread)<type>garlic</type></other>

fruit, vegetable and meat are easy but I don't know what to do with other. Initially I thought I'd just use split() and then itterate over it using keys() something like this:

items = line.split('--')
    for item in items:
        if item[0] in dic1.keys:
            print '<%s>%s</%s>' % (dic[item0]], item[1:], dic[item0]])

And this works for fruit, vegetables and I can even get meat to behave by adding to that but I don't how I can make other behave. I've even tried writing a recursive function and it almost works but I'm not sure I'm on the right track:

def other(line):
    line = line.split('--', 1)
    if line[0][0] in dic1.keys():
        print '%s\n' % line[0]
    elif line[0][0] == '4':
        if line[0].endswith(':'):
            value = line[1].split('--', 1)
            print '%s, %s\n' % (line[0][:-1], value[0])
        else:
            print '%s\n' % line[0]

    if line[1]:
        other(line[1])

This almost gets me what I want but I end up with

salad, potato potato bread, garlic garlic

Any help welcome. Thanks

Recommended Answers

All 14 Replies

I suggest storing format string with %s in article name placr for other and then you only use % to fill in the name in template

Maybe this will help ...

data = """\
1 apple--1 pear--1 peach--2 onion--2 carrot--3 <bee mince--3 <por chops--4 <oth salad:--potato--4 <oth bread:--garlic
"""

for item in data.split('--'):
    print item
    print item.split(" ", 2)

fixed it. I can now turn this:

a1, a2, a3, a:, 4, a5

into this:

a1
a2
a3
a4
a5

Doing something like this.

flag = None
list = split.line(', ')
for item in list:
    if item.endswith(':'):
        flag = item
    else:
        if condition a:
            do something
            flag = None
        else:
            do something else

I hope this makes sense.
Does anyone know if it's ok to do flag = None the way I have it?

I see nothing connection to your original post.

flag = None should be okay
Python object None is simply an empty object/value

Ah yes, I simplified it a bit. So from my original example it now looks like this:

input = '1 apple--1 pear--1 peach--2 onion--2 carrot--3 <bee mince--3 <por chops--4 <oth salad:--4 potato--4 <oth bread:--4 garlic'

dic1 = {'1': 'fruit',  '2': 'vegetable', '3': 'meat', '4': 'other'}
dic2 = {'bee': 'beef', 'por': 'pork'}

startswith_tuple = ('1', '2', '3')

    def des(field):
        flag = None
        item_list = field.split('--')
        for item in item_list:
            if item.endswith(':'):
                flag = item[7:-1]
            else:
                if item.startswith('4'):
                    if not item.startswith('4 <'):
                        print '<%s>%s</%s>' % (flag, item[2:], flag)
                        flag = None
                elif item.startswith(startswith_tuple):
                    if '<' in item:
                            print '<%s>%s</%s>' % (dic2[item[3:6]], item[7:], dic2[item[3:6]])
                    else:
                        print '<%s>%s</%s>' % (dic1[item[0]], item[2:], dic1[item[0]])


    des(input)

This gives me:

<fruit>apple</fruit>
<fruit>pear</fruit>
<fruit>peach</fruit>
<vegetable>onion</vegetable>
<vegetable>carrot</vegetable>
<beef>mince</beef>
<pork>chops</pork>
<salad>potato</salad>
<bread>garlic</bread>

Which is exactly what I need. (Now I'm hungry)

All this would be more serious if you could describe the exact syntax rules of the input string.

Unfortunately I don't yet know the exact syntax rules. I have loads and loads of data to sift throught and for most part it seems to make sense:

KEY value--KEY value--etc...

so

FLK ethnomusicology--FLK folk music--FLK popular music--LOC England--LOC Newcastle-upon-Tyne--

outputs:

<subjfolk>ethnomusicology</subjfolk>
<subjfolk>folk music</subjfolk>
<subjfolk>popular music</subjfolk>
<subjloc>England</subjloc>
<subjloc>Newcastle-upon-Tyne</subjloc>

So the lookup dictionary(for these examples) looks like this:

subject = {'FLK': 'subjfolk', 'LOC': 'subjloc'}

but there are inconsistencies. This:

LFE <ret youth culture--LFE social class--LFE <zot after:--LFE World War II

Needs to output:

<subjfeat>(relationship to)youth culture</subjfeat>
<subjfeat>social class</subjfeat>
<subjfeat>(after)World War II</subjfeat>

At the moment I am fairly happy with what I have and all I can do now is run it over the data and see what happens. One day I'll have a job where they have specs and mappings for data!!

just came across this in the data:
LFE <ret youth culture--LFE social class--LFE <zot after:--LFE World War II---LFE information

i see split does not allow me to do this:

line.split(('---', '--'))

Any suggestions on how to do this. I'm thinking I'll have to build some sort of recursive function to do this?

split with -- and strip the - at ends of pieces of data.

I stumbled onto something on the intewebs and after fiddling with it a bit I ended up with this:

def my_split(in_line):
    output = [in_line]
    separators = ['---', '--']
    for separator in separators:
        in_line = output
        output = []
        for item in in_line:
            output += item.split(separator)
    return output

So this:

print my_split('1--2--3--4---5---6--7---8')

outputs:
['1', '2', '3', '4', '5', '6', '7', '8']

This works nicely

Looks working even I would write it little more concisely:

def my_split(*in_line):
    for separator in '---', '--':
        output = []
        for item in in_line:
            output.extend(item.split(separator))
        in_line = output
    return output

print my_split('1--2--3--4---5---6--7---8')
print my_split('1--2--3--4---5', '6--7---8')

Using re.split() is also and option.

>>> import re
>>> s = '1--2--3--4---5---6--7---8'
>>> re.split(r'\-+', s)
['1', '2', '3', '4', '5', '6', '7', '8']
commented: True +12
commented: nicely done +14

Changed it slightly but it works brilliantly

import re
s = '1--2--3--4---5---6--7---8--bla-bla'
re.split(r'\--+', s)
['1', '2', '3', '4', '5', '6', '7', '8', 'bla-bla']
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.