lowercase word generator

TrustyTony 3 Tallied Votes 2K Views Share

This handy function turns file into stream of words stripped of punctuation, whitespace and digits, but does not split for example we'd to two words. If you want that you can further process yielded words or change the definition.

peter_budo commented: Nice sample +15
import string
def get_lower_words(filein):
    for line in filein:
        while line:
            word, match, line = line.partition(' ')
            word = word.lower().strip(string.punctuation+
                                      string.whitespace+
                                      string.digits)
            if word: yield word

for word in get_lower_words(open('11.txt')):
    print word
griswolf 304 Veteran Poster

Why use partition() instead of split()? What happens with 'malformed' lines such as word1\tword2 . Using split() would fix that problem (but cause the generator to store a lot of data if the lines are long).

TrustyTony 888 ex-Moderator Team Colleague Featured Poster

In StackOverflow discussions have come up that partition is very much faster than split. I have not timed. You can prove it by changing partition in a copy of function to

word, line = line.split(None, 1)

and comparing the speed.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster

In

word, line = line.split(None, 1)

and comparing the speed.

Except with split you should change to for loop as previous change does not work with under two words lines. Also my simple example of usage does not close the file, better to use with statement.

I did few other versions, including letter by letter isalpha and groupby, and timing was not so bad for loop over splitted line. I will time a re version for the optimisation and post the results.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster

Here is do it yourself walltime timing of different versions with Alice in Wonderland.

Notice that results of these versions differ as some consider all non-letters as word breaks.

Now you can start to make plans how to spend all those saved milliseconds for each book over the previous posts answer (all 78 of them)-

import string
import time
from collections import defaultdict
import itertools
import re

# translation table to turn everything else but letters to ' '
only_letters = ''.join(chr(c) if chr(c).isalpha() else ' ' for c in range(256))

# regular expression capturing words
words = re.compile('\w+')

def get_lower_words(filein):
    for line in filein:
        while line:
            line = line.replace('--',' ')
            word, match, line = line.partition(' ')
            word = word.lower().strip(string.punctuation+
                                      string.whitespace+
                                      string.digits)
            if word: yield word

def lower_words_split(filein):
    ''' give lower case words keeping non-letters inside words '''
    for line in filein:
        ## deal with double dashes a la Project Guttenberg 
        line = line.replace('--',' ')
        for word in  line.split(None):
            word = word.lower().strip(string.punctuation+
                                      string.digits)
            if word: yield word

def lower_words_split_trans(filein):
    ''' Make all non-alpha spaces and split with space '''
    for line in filein:
        for word in  line.translate(only_letters).split(' '):
            if word: yield word.lower()

def lower_generate(filein):
    ''' generate the letters like lower_words_split_trans
        by letter by letter scan and groupby'''
    return (''.join(c)
            for line in filein
            for islet,c in itertools.groupby(line.lower(), lambda x: x.isalpha()) if islet)

def lower_words_re(filein):
    return (w.lower()
            for line in filein
            for w in re.findall(words, line))
          
for func in (get_lower_words,
             lower_generate,
             lower_words_split,
             lower_words_split_trans,
             lower_words_re):
    with open('11.txt') as alice:
        counts = defaultdict(int)

        t0 = time.time()
        for word in func(alice):
            counts[word] += 1
            
        t1 = time.time()
        print ('%s, %.0f ms' % (func, 1000.0*(t1-t0)))
        print '\n'.join('%4i:%20s' % pair
                        for pair in sorted(((count, word)
                                            for word, count in counts.items()),
                                           reverse=True)[:10])
        raw_input('Ready')

''' Output in my computer:
Microsoft Windows XP [versio 5.1.2600]
(C) Copyright 1985 - 2001 Microsoft Corp.

J:\test>yieldwords.py
<function get_lower_words at 0x00BD3030>, 125 ms
1813:                 the
 934:                 and
 805:                  to
 689:                   a
 628:                  of
 545:                  it
 541:                 she
 462:                said
 435:                 you
 429:                  in
Ready
<function lower_generate at 0x00BD31F0>, 266 ms
1818:                 the
 940:                 and
 809:                  to
 690:                   a
 631:                  of
 610:                  it
 553:                 she
 545:                   i
 481:                 you
 462:                said
Ready
<function lower_words_split at 0x00BD3170>, 78 ms
1813:                 the
 934:                 and
 805:                  to
 689:                   a
 628:                  of
 545:                  it
 541:                 she
 462:                said
 435:                 you
 429:                  in
Ready
<function lower_words_split_trans at 0x00BD31B0>, 47 ms
1818:                 the
 940:                 and
 809:                  to
 690:                   a
 631:                  of
 610:                  it
 553:                 she
 545:                   i
 481:                 you
 462:                said
Ready
<function lower_words_re at 0x00BD3230>, 78 ms
1818:                 the
 940:                 and
 809:                  to
 690:                   a
 631:                  of
 610:                  it
 553:                 she
 543:                   i
 481:                 you
 462:                said
Ready
'''
griswolf 304 Veteran Poster

I rewrote Tony's tests to be more uniform (all of them now use the 'yield' keyword, etc). I skipped lower_generate which was slowest in his tests. This was running on my OS/X laptop.
bottom line: Using split() beats partition() by a factor of 3.5 on this data.

from collections import defaultdict
import string
import re
import time
stripA = string.punctuation+string.whitespace+string.digits
stripB = string.punctuation+string.digits
only_letters = ''.join(chr(c) if chr(c).isalpha() else ' ' for c in range(256))
wordRE = re.compile('\w+')

def get_lower_words_partition(filein):
  """line.partition(' ')"""
  for line in filein:
    while line:
      word, match, line = line.partition(' ')
      word = word.lower().strip(stripA)
      if word: yield word

def get_lower_words_split_one(filein):
  """line.split(None,1)"""
  for line in filein:
    while line:
      try:
        word,line = line.split(None,1)
      except ValueError:
        word,line = line,None
      word = word.lower().strip(stripA)
      if word: yield word

def get_lower_words_split_all(filein):
  """line.split()"""
  for line in filein:
    for word in line.split():
      word = word.lower().strip(stripB)
      if word: yield word

def get_lower_words_xlate_split_one(filein):
  """line.translate().split(None,1)"""
  for line in filein:
    line = line.translate(only_letters)
    while line:
      try:
        word,line = line.split(None,1)
      except ValueError:
        word,line = line,None
      word = word.lower().strip(stripA)
      if word: yield word

def get_lower_words_xlate_split_all(filein):
  """line.translate().split()"""
  for line in filein:
    line = line.translate(only_letters)
    for word in line.split():
      word = word.lower().strip(stripB)
      if word: yield word
    
def get_lower_words_xlate_partition(filein):
  """line.translate().partition(' ')"""
  for line in filein:
    line = line.translate(only_letters)
    while line:
      word, match, line = line.partition(' ')
      word = word.lower().strip(stripA)
      if word: yield word

def get_lower_words_re(filein):
  """'\w+'.findall(line)"""
  for line in filein:
    for w in wordRE.findall(line):
      if w: yield w.lower()

def get_lower_words_xlate_re(filein):
  """'\w+'.findall(line.translate())"""
  for line in filein:
    line = line.translate(only_letters)
    for w in wordRE.findall(line):
      if w:
        yield w.lower()

fs = [eval(f) for f in dir() if f.startswith('get_')]
functions = zip(fs, (f.__doc__ for f in fs))

results = []
for func,doc in functions:
  with open('/tmp/big.txt') as rabbit:
    counts = defaultdict(int)
    t0 = time.time()
    for word in func(rabbit):
      counts[word] += 1
    t1 = time.time()
    result = '%5.0f ms (distinct words: %d) -- %s' % (1000.0*(t1-t0),len(counts),doc)
    print('%5.0f ms -- %s'%(1000.0*(t1-t0),doc))
    results.append(result)
print('')
for r in sorted(results):
  print(r)
"""Results using a file 'big.txt' with about 264K lines, some very long (cat of several PHP files, then by hand merged up to 5000 lines into a single line, several places in the file):

% wc /tmp/big.txt 
  264538  902027 10014551 /tmp/big.txt
% # wc is 'word count' util: 264,538 lines, 902,027 'words' about 1M characters)
% 
% python time_wordgen.py
 4619 ms -- line.partition(' ')
 1497 ms -- '\w+'.findall(line)
 1299 ms -- line.split()
 3387 ms -- line.split(None,1)
 6514 ms -- line.translate().partition(' ')
 1625 ms -- '\w+'.findall(line.translate())
 1417 ms -- line.translate().split()
 3787 ms -- line.translate().split(None,1)

 1299 ms (distinct words: 6375) -- line.split()
 1417 ms (distinct words: 2577) -- line.translate().split()
 1497 ms (distinct words: 3298) -- '\w+'.findall(line)
 1625 ms (distinct words: 2577) -- '\w+'.findall(line.translate())
 3387 ms (distinct words: 6375) -- line.split(None,1)
 3787 ms (distinct words: 2577) -- line.translate().split(None,1)
 4619 ms (distinct words: 6375) -- line.partition(' ')
 6514 ms (distinct words: 2577) -- line.translate().partition(' ')
"""
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.