Multiple Word Replace in Text (Python)

18 Years Ago vegaseat 0 20K Views

This snippets shows how to have fun replacing multiple words in a text. The target words and the replacement words form key:value pairs in a dictionary. The search and replacement is done using Python's regular expression module re. The code also gives an example of a function within a function.

python

# replace words in a text that match key_strings in a dictionary with the given value_string
# Python's regular expression module  re  is used here
# tested with Python24       vegaseat      07oct2005

import re

def multiwordReplace(text, wordDic):
    """
    take a text and replace words that match a key in a dictionary with
    the associated value, return the changed text
    """
    rc = re.compile('|'.join(map(re.escape, wordDic)))
    def translate(match):
        return wordDic[match.group(0)]
    return rc.sub(translate, text)

str1 = \
"""When we see a Space Shuttle sitting on its launch pad, there are two big booster rockets
attached to the sides of the main fuel tank. These are solid rocket boosters, made by Thiokol
at their factory in Utah. The engineers who designed the solid rocket boosters might have preferred
to make them a bit fatter, but they had to be shipped by train from the factory to the launch site.
The railroad line from the factory runs through a tunnel in the mountains.  The boosters had to fit
through that tunnel.  The tunnel is slightly wider than the railroad track.  The width of the railroad
track came from the width of horse-drawn wagons in England, which were as wide as two horses' behinds.
So, a major design feature of what is the world's most advanced transportation system was determined
over two thousand years ago by the width of a horse's ###!
"""

# the dictionary has target_word : replacement_word pairs
wordDic = {
'booster': 'rooster',
'rocket': 'pocket',
'solid': 'salted',
'tunnel': 'funnel',
'ship': 'slip'}

# call the function and get the changed text
str2 = multiwordReplace(str1, wordDic)

print str2

vegaseat 1,735 DaniWeb's Hypocrite

17 Years Ago

# simpler and faster without re ...
# contributed by Python Fan 'bvdet'
 
def multipleReplace(text, wordDict):
    """
    take a text and replace words that match the key in a dictionary
    with the associated value, return the changed text
    """
    for key in wordDict:
        text = text.replace(key, wordDict[key])
    return text

CS guy 0 Newbie Poster

12 Years Ago

@vegaseat: You are very wrong. Well, I haven't timed it, but theoretically it should be. Because, with multiple separate replaces you're running through the whole text string for every key in the dictionary, whereas the OP's regex method traverses 'text' just once. For short texts this may not make much difference but for longer texts it will. Too, if one of the to-be-replaced strings (keys) is a superset of another, usually you want the longer string to be replaced first (it's more specific; eg 'theater' before 'the'). Well, the regex method, because regexes try to match the maximum length, will do this naturally. With simply traversing a dict, you take your chances as to which key gets replaced first (eg 'the' could be replaced, even in the middle of 'theater'). Thirdly (though least important), searching separately key-by-key, each key is totally separate; in the regex version, if there's any overlap, the regex compiler will use that to find a slightly more efficient way to search through all keys /at once/, at any part of the text.

edit: oh, you're the same guy. Well, have you timed it? I claim that if you have a sufficiently long 'text' the re method will be faster (not to mention correctness in the case of overlap among your keys).

Edited 12 Years Ago by CS guy because: noticed OP and vegaseat were the same guy.

Gribouillis 1,391 Programming Explorer

12 Years Ago

@vegaseat: You are very wrong. Well, I haven't timed it, but theoretically it should be. Because, with multiple separate replaces you're running through the whole text string for every key in the dictionary, whereas the OP's regex method traverses 'text' just once. For short texts this may not make much difference but for longer texts it will. Too, if one of the to-be-replaced strings (keys) is a superset of another, usually you want the longer string to be replaced first (it's more specific; eg 'theater' before 'the'). Well, the regex method, because regexes try to match the maximum length, will do this naturally. With simply traversing a dict, you take your chances as to which key gets replaced first (eg 'the' could be replaced, even in the middle of 'theater'). Thirdly (though least important), searching separately key-by-key, each key is totally separate; in the regex version, if there's any overlap, the regex compiler will use that to find a slightly more efficient way to search through all keys /at once/, at any part of the text.
edit: oh, you're the same guy. Well, have you timed it? I claim that if you have a sufficiently long 'text' the re method will be faster (not to mention correctness in the case of overlap among your keys).

Actually, on this text, the second method is 6 times faster:

from timeit import Timer
NTIMES = 100000
testlist = [multiwordReplace, multipleReplace]
for func in testlist:
    print "{n}(): {u:.2f} usecs".format(n=func.__name__, u=Timer(
        "{n}(str1, wordDic)".format(n=func.__name__),
        "from __main__ import {f}, str1, wordDic".format(f=",".join(x.__name__ for x in testlist))
        ).timeit(number=NTIMES) * 1.e6/NTIMES)

""" my output --->

multiwordReplace(): 64.09 usecs
multipleReplace(): 10.56 usecs

"""

It's more important to notice that the functions replace subwords and not words, for example 'solidity' would be replaced by 'saltedity', and also that they may give different results due to the multiple passes of the second fonction.

Edited 12 Years Ago by Gribouillis because: n/a

vegaseat commented: thanks +15

Skrell 0 Light Poster

12 Years Ago

i have spent the last hour trying to figure out this section of code from above.

def translate(match):
return wordDic[match.group(0)]

Can anyone PLEASE explain what this is doing and how it works?

TrustyTony 888 pyMod

12 Years Ago

i have spent the last hour trying to figure out this section of code from above.
def translate(match):
return wordDic[match.group(0)]
Can anyone PLEASE explain what this is doing and how it works?

There should be lot of tutorials out there, for example http://www.tutorialspoint.com/python/python_reg_expressions.htm.

wallars 0 Newbie Poster

12 Years Ago

def wordReplace(sentList, wordDict):
    find = lambda searchList, elem: [[i for i, x in enumerate(searchList) if x == e] for e in elem]
    wordList = list(wordDict)
    wordInd = find(sentList,wordList)
    for i in wordList:
        for k in range(len(wordInd)):
            for j in wordInd[k]:
                sentList[j] = wordDict[i]
    return "".join(sentList)

This is probably super inefficient but maybe one of y'all we be able to make it better. :D

Example:

>>> wordReplace(['Alex', ' ','is',' ','cool'],{'is':'was'})
'Alex was cool'

Varunkrishna 0 Junior Poster in Training

11 Years Ago

I wanted to get the input from the user and then use this to translate the words from English to German
for example if an user types "a" the exact german meaning of a is "ein" simillarly if the user types "an" the exact German meaning is "eine". So let me assume that the input is "a an" the desired output should be "ein eine" but what I am getting is "ein einn". So can any one please help me out here. Thank you in advance.

Lucaci Andrew 140 Za s|n

11 Years Ago

To get a better answer to your problem, post your question in a new thread.

Edited 11 Years Ago by Lucaci Andrew

paddy3118 11 Light Poster

11 Years Ago

Both the original and vegaseat code needs to replace longer words first to get over the problem where a shorter word is part of a longer one.

TrustyTony 888 pyMod

11 Years Ago

Good thinking, paddy3118, but little incomplete. How about case of shorter replacement word containing another word? We should use extraction regexp adding '\W' at beginning and end of each word (non-word character).

paddy3118 11 Light Poster

11 Years Ago

Hi pyTony sorting on just the length should suffice as long as you arrange to replace longer matches before shorter matches.The regexp alternation operator '|' matches the LHS before the RHS so all words in order of decreasing length should do it.

If there are sequences of word characters that are not word characters then you are right, you need to replace on word boundaries. (which is also missing from the original text).

P.S. I had to do this kind of thing when replacing all signal names by an HTML link to further data. This was in a Verilog source file that could have hundreds of signals.

ZZucker 342 Practically a Master Poster

11 Years Ago

A slight modification of vegaseat's original code to take care of whole words:

''' re_sub_whole words.py
replace whole words in a text using Python module re
and a dictionary of key_word:replace_with pairs

tested with Python273  ZZ
'''

import re

def word_replace(text, replace_dict):
    '''
    Replace words in a text that match a key in replace_dict
    with the associated value, return the modified text.
    Only whole words are replaced.
    Note that replacement is case sensitive, but attached
    quotes and punctuation marks are neutral.
    '''
    rc = re.compile(r"[A-Za-z_]\w*")
    def translate(match):
        word = match.group(0)
        return replace_dict.get(word, word)
    return rc.sub(translate, text)

old_text = """\
In this text 'debug' will be changed but not 'debugger'.
Similarly red will be replaced but not reduce.
How about Fred, -red- and red?
Red, white and blue shipped a ship!"""

# create a dictionary of key_word:replace_with pairs
replace_dict = {
"red" : "redish",
"debug" : "fix",
'ship': 'boat'
}

new_text = word_replace(old_text, replace_dict)

print(old_text)
print('-'*60)
print(new_text)

''' my output -->
In this text 'debug' will be changed but not 'debugger'.
Similarly red will be replaced but not reduce.
How about Fred, -red- and red?
Red, white and blue shipped a ship!
------------------------------------------------------------
In this text 'fix' will be changed but not 'debugger'.
Similarly redish will be replaced but not reduce.
How about Fred, -redish- and redish?
Red, white and blue shipped a boat!
'''

Tim1234 0 Newbie Poster

9 Years Ago

I use the same commands daily, only the Date, etc. changes. I have a master template with the commands. I use the Python program to modify it and poop out a ready to use text document called Worksheet.txt . Then I just copy and paste my commands and text.

First Create a text File called MyMasterTemplate.txt with these five lines:
grep -i alert elfYYYYMMDD.fil (Command to find the word alert in the Error Log File (elf) with Date - YYYYMMDD)
EMAILS TO SEND
Alerts for MM/DD/YYYY (Email Subject Line)
There was an alert on MM/DD/YYYY. (Email Body)

Here is the program:

TextMemory = open("MyMasterTemplate.txt").read() #read template from Disk into Memory

TextMemory = TextMemory.replace('YYYYMMDD', '20141225') #replace YYYYMMDD with date 20141225
TextMemory = TextMemory.replace('MM/DD/YYYY', '12/25/2014')

f2 = open("Worksheet.txt", "w") #Open a file to write to
f2.write(TextMemory)

f2.close()

Edited 9 Years Ago by Gribouillis because: fixed code display

Gribouillis 1,391 Programming Explorer

9 Years Ago

@tim1234 This is a good way to do it. For more sophisticated work in the same direction, I would suggest trying a templating module such as mako.

Edited 9 Years Ago by Gribouillis

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.