Is there some way I can display the 'escape characters'
Escape characters are generally < decimal 32 and > decimal 127 (in English), but you may have to adjust for your character set. This works for me.
# -*- coding: utf-8 -*-
test_file = [ "about, approximately περίπου cinema σινεμά (το)", \
"after μετά cinema σινεμά/κινηματογράφος", \
"afternoon απόγευμα (το) city πόλη (η)" ]
escape_min = 32
escape_max = 127
for rec in test_file:
printed = 0
for num, character in enumerate(rec):
ord_chr = ord(character)
if (ord_chr < escape_min) or (ord_chr > escape_max):
if not printed:
print "\n", rec
printed = 1
print " ", num, ord_chr
woooee
Nearly a Posting Maven
2,454 posts since Dec 2006
Reputation Points: 777
Solved Threads: 714
I can't see where the columns are in your example because this website reduces multiple spaces to single spaces. If you wrap your example in code tags (click the "Help with Code Tags" link for instructions) we can see how your example data really is appearing to you.
For example, when I type:
'This is lineA This is lineB' without the code tags you can't see the 20 spaces that I typed between 'lineA' and 'This'.
With the code tags, the same example looks like this:
'This is lineA This is lineB'
d5e5
Practically a Posting Shark
810 posts since Sep 2009
Reputation Points: 159
Solved Threads: 159
If I understand correctly, the above shows 14 examples of output records -- that is, 14 examples of the contents of newString after the newString = '%-30s%-30s%s' % ( lineA , lineB, new_line) runs.
What I'd like to do is run that command myself to reproduce your output and try to modify the command to avoid the irregular-looking output.
Could you show us one or two examples of the initial contents of thelineA, lineB, and new_line variables? I'm a Python beginner as well so I don't know all the answers, but I think if we knew some sample values for the input variables we would have something to test.
d5e5
Practically a Posting Shark
810 posts since Sep 2009
Reputation Points: 159
Solved Threads: 159
Thanks Chico2009. I'm interested because here in Canada we have two official languages so it's good to learn how this encoding thing works. I just tried testing something in Python's Idle and it complained right away about "unsupported characters".
I have to go to lunch now but I'll look at your attachment later this afternoon if possible.
d5e5
Practically a Posting Shark
810 posts since Sep 2009
Reputation Points: 159
Solved Threads: 159
No solution yet. Now it's driving me crazy.:)
What I think is happening is that Python knows how to read your input file and write to your output file even though there are non-Ascii characters represented in them because the # -*- coding: utf-8 -*- statement informs it. However when I examine the contents of lineA using Idle, it shows me that it has content like the following:
>>> lineA
'2 \xce\xb4\xcf\x8d\xce\xbf (also \xce\xb4\xcf\x85\xce\xbf)'
>>>
>>> len(lineA)
50
>>> lineB
'but \xce\xbc\xce\xb1'
>>> len(lineB)
34
>>> #...So when we adjust the widths of lineA and lineB, it is the above unicode sequences we are manipulating
>>> combined_AB = "%-56s%-60s" % (lineA,lineB)
>>> combined_AB
'2 \xce\xb4\xcf\x8d\xce\xbf (also \xce\xb4\xcf\x85\xce\xbf) but \xce\xbc\xce\xb1 '
>>> len(combined_AB)
116
>>> #You see the problem? Python does the length adjusting BEFORE it converts the unicode sequence into the record that it writes to the output file.
d5e5
Practically a Posting Shark
810 posts since Sep 2009
Reputation Points: 159
Solved Threads: 159
Now that we can recreate the problem I'm confident that we're close to finding a way around it. I just need to learn a little more about how to format a unicode string. Maybe something to do with encoding or decoding it to set the length we want when formatting.
I'll try to look at it some more this afternoon, unless someone else finds the answer before then.
d5e5
Practically a Posting Shark
810 posts since Sep 2009
Reputation Points: 159
Solved Threads: 159
You would have to step through the unicode and count the number of actual letters, and then add enough spaces to the string to make it the length you want. This appears to work, but you'll have to test for yourself.
def print_rec(rec):
print rec
substrs = rec.split()
greek_str = " ".join(substrs[1:])
num_chr = 0
for chr in greek_str:
print ord(chr),
if ord(chr) not in [206, 207]:
num_chr += 1
print "\n", num_chr
fp = open("./origDict.txt", "r")
## skip heading
fp.readline()
print_rec(fp.readline())
print_rec(fp.readline())
print_rec(fp.readline())
woooee
Nearly a Posting Maven
2,454 posts since Dec 2006
Reputation Points: 777
Solved Threads: 714
Please try the following as well. It works on my computer now.
#!/usr/local/bin/python
# -*- coding: utf-8 -*-
import os, sys
#DictManipulation.py written 26/09/09
#See flow_dia_for_new_dict.ods, this directory
#
# A text file that is an English to Greek dictionary and has 2 columns.
# As there are many entries it takes up 25 or so pages. The object is to
# produce a file that has 2 x 2 columns and so producing half the number
# of pages. The 2nd columb generated should be the next 72 entries following
# the first column so the alfa order is maintained.
#
# A sample of the original file is shown below
#
#August Αύγουστος (ο)
#aunt θεία η
#autumn φθινόπωρο (το)
#bad, ugly (adj.) άσχημος, -η, -ο
#bad, wicked, evil (adj.) κακός, -ή, -ό
#bank τράπεζα (η)
#basket καλάθι (το)
def rpad (orig_string, length):
"""Adds spaces to the end of a string until it has the desired length"""
#Convert to unicode utf-8 because if your default encoding is Ascii the encode step fails
ustring = unicode(orig_string,"utf-8","strict")
#In a plain string Greek symbols have length = 1 instead of 2
plain_string = ustring.encode("cp737", "replace")
spaces_needed = length - len(plain_string)
padded_string = orig_string + " " * spaces_needed
return padded_string
lines_per_page = 73
current_page = 0
line_pointer_a = 1
line_pointer_b = line_pointer_a + lines_per_page
write_file = open ("newDict.txt", "w")
read_file = open ("origDict.txt", "r")
lines_in_file = read_file.readlines ()
read_file.close()
lengh_file = len (lines_in_file)
new_line = '\n'
while line_pointer_a != (lines_per_page * (current_page + 1 ) + 1):
lineA = lines_in_file [line_pointer_a].strip()
lineB = lines_in_file [line_pointer_b].strip()
#####newString = '%-56s%-60s%s' % ( lineA , lineB, new_line)
newString = rpad(lineA,56) + rpad(lineB,60) + new_line
write_file.write (newString)
print newString
print "***Length of the above string is: %d***" % len(newString)
line_pointer_a += 1
line_pointer_b += 1
if line_pointer_b >= lengh_file:
write_file.close()
print 'Job Done'
break
if line_pointer_a == (lines_per_page * (current_page + 1 ) + 1):
current_page += 2
line_pointer_a = (lines_per_page * (current_page ) + 1)
line_pointer_b = line_pointer_a + lines_per_page
d5e5
Practically a Posting Shark
810 posts since Sep 2009
Reputation Points: 159
Solved Threads: 159