This could be a cooking recipe, but in my case it is a chemical recipe.

Here is a typical generic chemical recipe:
23 g chemicalA is dissolved in 250 ml methanol. The solution is cooled
to 0 - 5 degC and 18 ml compoundB is added dropwise over 60 minutes.
The mixture is heated to reflux overnight, then cooled to 0 - 5 degC
for 3 hours. The precipated product is collected, washed with 50 ml
cold methanol. Yield: 28 g compoundC as a light yellow solid after
drying in vacuum at 70 degC for 6 hours.

I have a list of chemicals I am searching for:
chemicals =

I want to extract amount and unit of measurement for each chemical
from the recipe to give a list of sublists [chemical, amount, unit].
Something like:
recipe1 = [, , ...]

Using Python, how do I go about this best?

Hi!

This works:

text = file("recipe.txt", "r").read()

words = [ i.strip(".?!") for i in text.split() ]

chemicals = ["chemicalA", "compoundB", "compoundC", "methanol"]

recipe = {}
for chemical in chemicals:
        ind = words.index(chemical)
        recipe[chemical] = [words[ind-2], words[ind-1]]

print recipe

DON'T USE IT!! :)

It works with your example text, but is not very flexible. It assumes that the chemical info is always in the order amount unit chemical.
I'm not sure if there is a clever and really flexible way to do what you want. My solution is neither clever nor ...
(There is also a problem with methanol, because it appears twice in the text. How would you handle this?)

Regards, mawe

Thanks mawe, I am going to play with that. I guess one can assume the 'amount unit chemical' order fairly reliably. Might have to add a list of common units and of course amount has to start with a digit. That would ignore out the second methanol. If I wanted to have the second amount of methanol, I would have to add it to the chemical list as 'cold methanol' or add a list of potential modfiers. Could be an interesting project.

You also might have trouble with solvents: how many mL should 'wash with water, then acetone' count as?

How crucial is completeness? Are you trying to extract keywords for reference, or are you trying to prepare an ingredients list for a lab assistant, or 'other'?

Jeff

You also might have trouble with solvents: how many mL should 'wash with water, then acetone' count as?

How crucial is completeness? Are you trying to extract keywords for reference, or are you trying to prepare an ingredients list for a lab assistant, or 'other'?

Jeff

Mostly a chemicals list to project purchases and budgets. There has to be some slack for spills etc.

i did something like that in VB.NET using classes in order to work out atomic mass, moles of reactants, volume of gas and concentration (im a chemistry student)

Another possibility is to focus in on the numbers and then try to figure out which chemical the number goes with. So:

'Dissolve 5.00 g of salicylic acid in methanol (about 50 mL will suffice) in a 100-mL Erlenmeyer flask. Add 10 drops of either sulfuric or phosphoric acid as a catalyst and heat for 20 minutes.'

could lead to this kind of analysis:

Assume we have a class

class Item(object):

  def __init__(self):
       self.name = ""
       self.qty = 0.0
       self.unit = ""
       self.type = ""  # options: 'chemical', 'equipment'

* grab the 5.00, 50, 100, 10, and 20.
* since the 'g', 'mL', and '-mL' are in a known list of units, create some Item()'s and partially populate them. You could do some cute things here like mapping 'milliliters' to 'mL', etc. It might make sense to make 'drops' be a unit.
* since the '5.00 g' is followed with 'of X', assume X is a chemical name. Look it up -- sure nuff, it is. You've got a match. Fill in 'salicylic acid' in the Item.name and 'chemical' in the Item.type.
* working outwards from 50 mL, the first chemical name is 'methanol' -- put 50 mL of methanol as a high probability (on a 'tentative' list or something)
* the 100-mL Erlenmeyer is a known piece of equipment. Add it to the equipment list.
* 10 drops is followed by 'of', but then followed by a partial match 'sulfuric'. Put 'sulfuric acid' in the name and move item to the tentative list.
* '20 minutes' is a known unit of time; ignore.
* That leaves the chemical 'phosphoric acid.' If you feel like tackling booleans, the 'or' could associate it with the 10 drops.
* Move any complete Item()s from the tentative list to the final list.

Whew! It's hard teaching a computer to read like a human.

Jeff

I took mawe's code and Jeff's ideas and came up with this:

# analyze a chemical recipe to create a chemical data list

rcp = """23 g chemicalA is dissolved in 250 ml methanol. The solution is cooled
to 0 - 5 degC and 18 ml compoundB is added dropwise over 60 minutes.
The mixture is heated to reflux overnight, then cooled to 0 - 5 degC
for 3 hours. The precipated product is collected, washed with 50 ml
cold methanol. Yield: 28 g compoundC as a light yellow solid after
drying in vacuum at 70 degC for 6 hours.
"""

chem_list = ['chemicalA', 'compoundB', 'compoundC', 'methanol']

unit_list = ['g', 'ml', 'kg', 'l']

# break recipe text down to list and remove punctuation marks
rcp_list = [ w.strip(' .,?') for w in rcp.split(None)]
print rcp_list

print '-'*60

recipe1 = []
for ix, item in enumerate(rcp_list):
    try:
        # check for number
        if item.isdigit():
            # check for unit
            if rcp_list[ix+1].lower() in unit_list:
                # check for chemical name
                if rcp_list[ix+2] in chem_list:
                    recipe1.append([item, rcp_list[ix+1], rcp_list[ix+2]])
                # case of modifier (can add more)
                elif rcp_list[ix+3] in chem_list:
                    recipe1.append([item, rcp_list[ix+1], rcp_list[ix+3]])
    except:
        pass

for data in recipe1:
    print data

"""
my output -->
['23', 'g', 'chemicalA']
['250', 'ml', 'methanol']
['18', 'ml', 'compoundB']
['50', 'ml', 'methanol']
['28', 'g', 'compoundC']
"""

Actually an interesting project. Tell the TA to stay close to the 'amount unit [modifier] chemical' structure and to keep the modifiers down to something manageable.

One potential fly in the ointment! If there is a space in the chermical name, you might have to use a dash like sodium-chloride. You could let Python analyze the text for you and do that ahead of splitting it.

Okay, this should work for the 'space in chemical name' problem:

# analyze a chemical recipe to create a chemical data list

rcp = """23 g powdered chemicalA is dissolved in 250 ml methanol. The
solution is cooled to 0 - 5 degC and 18 ml compoundB is added dropwise
over 60 minutes.  The mixture is heated to reflux overnight, then cooled
to 0 - 5 degC for 3 hours. The precipated product is collected, washed
with 50 ml ice cold methanol. Yield: 28 g compoundC as a light yellow 
solid after drying in vacuum at 70 degC for 6 - 12 hours.

For 'space in chemical' testing:
10 g sodium chloride
50 g potassium carbonate
"""

chem_list = [
'chemicalA', 
'compoundB', 
'compoundC', 
'methanol',
'sodium chloride',
'potassium carbonate']

unit_list = ['g', 'ml', 'kg', 'l']

# replace space in chemical name with a special filler
# for split() and change back after split
space = ' '
# pick a filler hat is not usually in a chemical name
filler = '_'
for chem in chem_list:
    if chem in rcp:
        if space in chem:
            chem1 = chem.replace(space, filler)
            #print chem, chem1  # test
            rcp = rcp.replace(chem, chem1)

#print rcp  # test

print '-'*60

# break recipe text down to list and remove punctuation marks
rcp_list = [ w.rstrip(' .,?') for w in rcp.split(None)]

print rcp_list  # test

print '-'*60

# split() is done, now replace the special filler back to a space
for word in rcp_list:
    if filler in word:
        #print word  # test
        word1 = word.replace(filler, space)
        ix = rcp_list.index(word)
        rcp_list[ix] = word1

print rcp_list  # test

print '-'*60

recipe1 = []
for ix, item in enumerate(rcp_list):
    try:
        # check for number
        if item.isdigit():
            # check for unit
            if rcp_list[ix+1].lower() in unit_list:
                # check for chemical name
                if rcp_list[ix+2] in chem_list:
                    recipe1.append([item, rcp_list[ix+1], rcp_list[ix+2]])
                # case of one modifier
                elif rcp_list[ix+3] in chem_list:
                    recipe1.append([item, rcp_list[ix+1], rcp_list[ix+3]])
                # case of two modifiers
                elif rcp_list[ix+4] in chem_list:
                    recipe1.append([item, rcp_list[ix+1], rcp_list[ix+4]])
    except:
        pass

for data in recipe1:
    print data

"""
my output -->
['23', 'g', 'chemicalA']
['250', 'ml', 'methanol']
['18', 'ml', 'compoundB']
['50', 'ml', 'methanol']
['28', 'g', 'compoundC']
['10', 'g', 'sodium chloride']
['50', 'g', 'potassium carbonate']
"""

Dang, Python is fun!

when in rome do what the romanians do, i dont get it?

Somewhere in the Bushism collection, things uttered by the best president we ever had. I think it is supposed to be "When in Rome do what the Romans do!". I thought you Brits were hanging on to every word this wise man says!

BTW, thanks Ene for the great code! Now I simply have to process the data, reflect the number of students and can give our stockroom guy an idea what is coming up.

This article has been dead for over six months. Start a new discussion instead.