My brain is freezing up again. I have a string that looks something like this

1 apple--1 pear--1 peach--2 onion--2 carrot--3 <bee mince--3 <por chops--4 <oth salad:--potato--4 <oth bread:--garlic

Then I have a few dictionaries:

dic1 = {'1': 'fruit',  '2': 'vegetable', '3': 'meat', '4': 'other'}
dic2 = {'bee': 'beef, 'por': 'pork'}

What I'm trying to do is transform this into XML.. sortof.
The output needs to look like this:

<fruit>apple</fruit> <fruit>pear</fruit> <fruit>peach</fruit> <vegetable>onion</vegetable> <vegetable>carrot</vegetable> <meat>beef<type>mince</type></meat> <meat>pork<type>chops</type></meat> <other>(salad)<type>potato</type></other> <other>(bread)<type>garlic</type></other>

fruit, vegetable and meat are easy but I don't know what to do with other. Initially I thought I'd just use split() and then itterate over it using keys() something like this:

items = line.split('--')
    for item in items:
        if item[0] in dic1.keys:
            print '<%s>%s</%s>' % (dic[item0]], item[1:], dic[item0]])

And this works for fruit, vegetables and I can even get meat to behave by adding to that but I don't how I can make other behave. I've even tried writing a recursive function and it almost works but I'm not sure I'm on the right track:

def other(line):
    line = line.split('--', 1)
    if line[0][0] in dic1.keys():
        print '%s\n' % line[0]
    elif line[0][0] == '4':
        if line[0].endswith(':'):
            value = line[1].split('--', 1)
            print '%s, %s\n' % (line[0][:-1], value[0])
            print '%s\n' % line[0]

    if line[1]:

This almost gets me what I want but I end up with

salad, potato potato bread, garlic garlic

Any help welcome. Thanks

I suggest storing format string with %s in article name placr for other and then you only use % to fill in the name in template

Maybe this will help ...

data = """\
1 apple--1 pear--1 peach--2 onion--2 carrot--3 <bee mince--3 <por chops--4 <oth salad:--potato--4 <oth bread:--garlic

for item in data.split('--'):
    print item
    print item.split(" ", 2)

fixed it. I can now turn this:

a1, a2, a3, a:, 4, a5

into this:


Doing something like this.

flag = None
list = split.line(', ')
for item in list:
    if item.endswith(':'):
        flag = item
        if condition a:
            do something
            flag = None
            do something else

I hope this makes sense.
Does anyone know if it's ok to do flag = None the way I have it?

Edited 4 Years Ago by 4evrmrepylrning

Ah yes, I simplified it a bit. So from my original example it now looks like this:

input = '1 apple--1 pear--1 peach--2 onion--2 carrot--3 <bee mince--3 <por chops--4 <oth salad:--4 potato--4 <oth bread:--4 garlic'

dic1 = {'1': 'fruit',  '2': 'vegetable', '3': 'meat', '4': 'other'}
dic2 = {'bee': 'beef', 'por': 'pork'}

startswith_tuple = ('1', '2', '3')

    def des(field):
        flag = None
        item_list = field.split('--')
        for item in item_list:
            if item.endswith(':'):
                flag = item[7:-1]
                if item.startswith('4'):
                    if not item.startswith('4 <'):
                        print '<%s>%s</%s>' % (flag, item[2:], flag)
                        flag = None
                elif item.startswith(startswith_tuple):
                    if '<' in item:
                            print '<%s>%s</%s>' % (dic2[item[3:6]], item[7:], dic2[item[3:6]])
                        print '<%s>%s</%s>' % (dic1[item[0]], item[2:], dic1[item[0]])


This gives me:


Which is exactly what I need. (Now I'm hungry)

Edited 4 Years Ago by pyTony: fixed your code block

All this would be more serious if you could describe the exact syntax rules of the input string.

Edited 4 Years Ago by Gribouillis

Unfortunately I don't yet know the exact syntax rules. I have loads and loads of data to sift throught and for most part it seems to make sense:

KEY value--KEY value--etc...


FLK ethnomusicology--FLK folk music--FLK popular music--LOC England--LOC Newcastle-upon-Tyne--


<subjfolk>folk music</subjfolk>
<subjfolk>popular music</subjfolk>

So the lookup dictionary(for these examples) looks like this:

subject = {'FLK': 'subjfolk', 'LOC': 'subjloc'}

but there are inconsistencies. This:

LFE <ret youth culture--LFE social class--LFE <zot after:--LFE World War II

Needs to output:

<subjfeat>(relationship to)youth culture</subjfeat>
<subjfeat>social class</subjfeat>
<subjfeat>(after)World War II</subjfeat>

At the moment I am fairly happy with what I have and all I can do now is run it over the data and see what happens. One day I'll have a job where they have specs and mappings for data!!

Edited 4 Years Ago by 4evrmrepylrning

just came across this in the data:
LFE <ret youth culture--LFE social class--LFE <zot after:--LFE World War II---LFE information

i see split does not allow me to do this:

line.split(('---', '--'))

Any suggestions on how to do this. I'm thinking I'll have to build some sort of recursive function to do this?

I stumbled onto something on the intewebs and after fiddling with it a bit I ended up with this:

def my_split(in_line):
    output = [in_line]
    separators = ['---', '--']
    for separator in separators:
        in_line = output
        output = []
        for item in in_line:
            output += item.split(separator)
    return output

So this:

print my_split('1--2--3--4---5---6--7---8')

['1', '2', '3', '4', '5', '6', '7', '8']

This works nicely

Looks working even I would write it little more concisely:

def my_split(*in_line):
    for separator in '---', '--':
        output = []
        for item in in_line:
        in_line = output
    return output

print my_split('1--2--3--4---5---6--7---8')
print my_split('1--2--3--4---5', '6--7---8')

Edited 4 Years Ago by pyTony

Using re.split() is also and option.

>>> import re
>>> s = '1--2--3--4---5---6--7---8'
>>> re.split(r'\-+', s)
['1', '2', '3', '4', '5', '6', '7', '8']

Edited 4 Years Ago by snippsat

nicely done

Changed it slightly but it works brilliantly

import re
s = '1--2--3--4---5---6--7---8--bla-bla'
re.split(r'\--+', s)
['1', '2', '3', '4', '5', '6', '7', '8', 'bla-bla']
This question has already been answered. Start a new discussion instead.