I have txt files in russian which i need to read and trim out all the special characters like !,digits, blankspaces and write it into another file in the encoded form.

Below is the code i am using for

for fname in filelist:
if fname.endswith('.txt'):
count = 0
path = os.path.join(dirs, fname)
os.chmod(dirs, 0755)
txtfile = file('%s/%s.txt' % (dir, fname), 'w', 'utf-8')
textf = codecs.open("%s/%s" % (dirss,fname), 'r', "utf-8")
p = re.compile("[^a-z A-Z]")
lines = textf.readlines()

for line in lines:
s2 = p.sub("", line)
txtfile.write(s2)
count +=1

if count <= count1:
continue
if count == count1:
break

textf.close()
txtfile.close()

When i execute the above code it gives me an error TypeError: an integer is required
I am working with multiple lanaguges of txt files.

Please help

Recommended Answers

All 5 Replies

count1 is the number of files in the directory filelist.

dirs = '/tmp/5.0/'
dirss = '/tmp/5.0/fr'
filelist = os.listdir(dirss)
os.mkdir('%s/FRENCH' % dirss)
dir = '/tmp/5.0/fr/FRENCH'
count1 = len(filelist)
for fname in filelist:
if fname.endswith('.txt'):
count = 0
path = os.path.join(dirs, fname)
os.chmod(dirs, 0755)
txtfile = file('%s/%s.txt' % (dir, fname), 'w', 'utf-8')
textf = codecs.open("%s/%s" % (dirss,fname), 'r', "utf-8")
p = re.compile("[^a-z A-Z]")
lines = textf.readlines()

for line in lines:
s2 = p.sub("", line)
txtfile.write(s2)
count +=1

if count <= count1:
continue
if count == count1:
break

textf.close()
txtfile.close()

The error i get is below,
Traceback (most recent call last):
File "reg1.py", line 23, in ?
txtfile = file('%s/%s.txt' % (dir, filename), 'w', "utf-8")
TypeError: an integer is required

Thanks in advance.

I know very little about UTF, but have always seen files opened this way
import codecs
fp = codecs.open( fname, "w", "utf-8" )
I don't know if the filename used has to be a unicode string, and would assume that the records written to the file would be unicode strings.

The filename is supposed to be in English but the text in the file is in a different language(french or german or spanish)there are more than 20 langauges.Each txt file has text only in one langauge at a time.There are many txt files grouped by language.

Can some one help in the matter.I am strugglling with this for quite sometime.

Thanks for the response woooee.Tried your piece of code did not work.

Anyone, Can codecs and regular expression work together?

hi you all,

I have got the solution for the issue.
Below is the code for

count1 = len(filelist)
print('The count is %s' % count1)
for fname in filelist: 
 if fname.endswith('.txt'):
  count = 0
  path = os.path.join(dirs, fname)
  os.chmod(dirs, 0755)
  (filename, extension) = os.path.splitext(fname)
  file_name = filename.upper()
  txtfile = codecs.open('%s/%s.txt' % (dir, filename), 'wU', "utf-8")
  textf = open("%s/%s" % (dirss,fname), 'rU')
  p = re.compile("[^a-z A-Z]")
  lines = textf.readlines()
  for line in lines:
    s2 = p.sub("", line)
    txtfile.write(s2)
    count +=1

  if count <= count1:
    continue
  if count == count1:
    break

textf.close()
txtfile.close()

The mistake i was doing was i was opening the file for reading using the codecs.I only have to open the file with rU mode and write the content into a new file using codecs and encoding,

Thanks.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.