Help regarding regular expression and codecs module

Question

enigmaenigma 0 Newbie Poster

17 Years Ago

I have txt files in russian which i need to read and trim out all the special characters like !,digits, blankspaces and write it into another file in the encoded form.

Below is the code i am using for

for fname in filelist:
if fname.endswith('.txt'):
count = 0
path = os.path.join(dirs, fname)
os.chmod(dirs, 0755)
txtfile = file('%s/%s.txt' % (dir, fname), 'w', 'utf-8')
textf = codecs.open("%s/%s" % (dirss,fname), 'r', "utf-8")
p = re.compile("[^a-z A-Z]")
lines = textf.readlines()

for line in lines:
s2 = p.sub("", line)
txtfile.write(s2)
count +=1

if count <= count1:
continue
if count == count1:
break

textf.close()
txtfile.close()

When i execute the above code it gives me an error TypeError: an integer is required
I am working with multiple lanaguges of txt files.

Please help

python

2 Contributors
5 Replies
100 Views
5 Days Discussion Span
Latest Post 17 Years Ago Latest Post by enigmaenigma

All 5 Replies

woooee 814 Nearly a Posting Maven

17 Years Ago

if count <= count1: Is count1 delared somewhere? If so, which line is the error referencing. Also, use CODE tags, see this post http://www.daniweb.com/forums/announcement114-3.html

woooee 814 Nearly a Posting Maven

17 Years Ago

I know very little about UTF, but have always seen files opened this way
import codecs
fp = codecs.open( fname, "w", "utf-8" )
I don't know if the filename used has to be a unicode string, and would assume that the records written to the file would be unicode strings.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

enigmaenigma 0 Newbie Poster · Answer 1 · 2008-03-08T06:43:56+00:00

count1 is the number of files in the directory filelist.

dirs = '/tmp/5.0/'
dirss = '/tmp/5.0/fr'
filelist = os.listdir(dirss)
os.mkdir('%s/FRENCH' % dirss)
dir = '/tmp/5.0/fr/FRENCH'
count1 = len(filelist)
for fname in filelist:
if fname.endswith('.txt'):
count = 0
path = os.path.join(dirs, fname)
os.chmod(dirs, 0755)
txtfile = file('%s/%s.txt' % (dir, fname), 'w', 'utf-8')
textf = codecs.open("%s/%s" % (dirss,fname), 'r', "utf-8")
p = re.compile("[^a-z A-Z]")
lines = textf.readlines()

for line in lines:
s2 = p.sub("", line)
txtfile.write(s2)
count +=1

if count <= count1:
continue
if count == count1:
break

textf.close()
txtfile.close()

The error i get is below,
Traceback (most recent call last):
File "reg1.py", line 23, in ?
txtfile = file('%s/%s.txt' % (dir, filename), 'w', "utf-8")
TypeError: an integer is required

Thanks in advance.

enigmaenigma 0 Newbie Poster · Answer 2 · 2008-03-11T06:11:49+00:00

The filename is supposed to be in English but the text in the file is in a different language(french or german or spanish)there are more than 20 langauges.Each txt file has text only in one langauge at a time.There are many txt files grouped by language.

Can some one help in the matter.I am strugglling with this for quite sometime.

Thanks for the response woooee.Tried your piece of code did not work.

Anyone, Can codecs and regular expression work together?

enigmaenigma 0 Newbie Poster · Answer 3 · 2008-03-13T03:54:53+00:00

hi you all,

I have got the solution for the issue.
Below is the code for

count1 = len(filelist)
print('The count is %s' % count1)
for fname in filelist: 
 if fname.endswith('.txt'):
  count = 0
  path = os.path.join(dirs, fname)
  os.chmod(dirs, 0755)
  (filename, extension) = os.path.splitext(fname)
  file_name = filename.upper()
  txtfile = codecs.open('%s/%s.txt' % (dir, filename), 'wU', "utf-8")
  textf = open("%s/%s" % (dirss,fname), 'rU')
  p = re.compile("[^a-z A-Z]")
  lines = textf.readlines()
  for line in lines:
    s2 = p.sub("", line)
    txtfile.write(s2)
    count +=1

  if count <= count1:
    continue
  if count == count1:
    break

textf.close()
txtfile.close()

The mistake i was doing was i was opening the file for reading using the codecs.I only have to open the file with rU mode and write the content into a new file using codecs and encoding,

Thanks.

Help regarding regular expression and codecs module

Recommended Answers Collapse Answers

All 5 Replies

Recommended Answers