I’m building a tagger that searches through a corpus of IM data and tags any instances of words that occur on a wordlist. I've run into a problem and was hoping to find help. I'd like to try and understand exactly why it's not working, so I've laid out everything I can think of that might be helpful.
data = ["@LINE2@ 04-09-2006/DAT 09:05:30/TIM [Team]/CHT @NAME@ Digs Phung @NAME@ @CONTENT@ you might be crazy @CONTENT@ @LINE2@\n"] wordlist = ["a", "an", "aardvark", "aardvarks", "aback", "abacus", "you", "be"] import re def tagger(data_string): string_copy = data_string for entry in wordlist: p = re.compile("\s" + entry + "\s", re.IGNORECASE) g = p.search(string_copy) if g == None: pass else: h = g.group() space_copy = string_copy.replace(h, h + "/TAG ") string_copy = space_copy.replace(" /TAG", "/TAG") return string_copy tagged =  for line in data: x = tagger(line) tagged.append(x)
This works as expected producing:
tagged = /CHT @NAME@ Digs Phung @NAME@ @CONTENT@ you/TAG might be/TAG crazy @CONTENT@ @LINE3@']
But when I do the same thing to the full wordlist (~40k words) and data (a list with ~1 million strings), I get the following error:
Traceback (most recent call last): File "<pyshell#197>", line 2, in <module> x = tagger(data_string) File "<pyshell#195>", line 4, in tagger p = re.compile("\s" + entry + "\s", re.IGNORECASE) File "C:\Python25\lib\re.py", line 188, in compile return _compile(pattern, flags) File "C:\Python25\lib\re.py", line 241, in _compile raise error, v # invalid expression error: unexpected end of regular expression
I've run a similar function over the full dataset with smaller wordlists (~50) and haven't had any problems, so I figured that the issue was with something in this particular wordlist. Here's what I've already done:
1) I've tested the wordlist to make sure it only contains alphanumeric characters because I thought that other character might be interfering.
2) The function moves through the wordlist once, and then the error message pops up. The second line of data (where it hangs) is above.
3) I swapped out the variable names to words that couldn't appear on the wordlist, in case there was some conflict there.
4) I did some searches for the error message, but the explanations were way over my head.