hi
I am a complete novice and have written the following code and it work as required, with the exeption of the text being in a tidy format when printed.
The origDict.txt has been exported in a fixed file format, and I am trying to acheive a fixed file layout for the 2nd colomn so it looks neat and readable.
I feel my problem could be something to do with unusual 'escape characters' in the origDict.txt. Is there some way I can display the 'escape characters' so as to enable some further investigation?
I have tried various layout techniques to no avail. Some help would be very much appreciated.
Copy of my code

lines_per_page = 73       
current_page  = 0       
line_pointer_a = 1        
line_pointer_b = line_pointer_a + lines_per_page 

write_file = open ("newDict.txt", "w")
read_file = open ("origDict.txt", "r")
lines_in_file = read_file.readlines ()
read_file.close()
lengh_file = len (lines_in_file)



while line_pointer_a != (lines_per_page * (current_page + 1 ) + 1):
    lineA = lines_in_file [line_pointer_a].strip() #Read line & delete CR
    lineB = lines_in_file[line_pointer_b] # Read line 
    newString = '%-56s %-56s' % ( lineA , lineB)
    write_file.write (newString) # write string to newDict
    print newString
    line_pointer_a += 1
    line_pointer_b += 1
    if line_pointer_b >= lengh_file:
        write_file.close()
        print 'Job Done'
        break
    if line_pointer_a == (lines_per_page * (current_page + 1 ) + 1):
        current_page += 2
        line_pointer_a = (lines_per_page * (current_page ) + 1)              
        line_pointer_b = line_pointer_a + lines_per_page

A sample of part of the generated newDict.txt file, it is an English to Greek Dictionary. The printed result is not very well represented below, due to limitations of the site display (I think!)

about, approximately περίπου cinema σινεμά (το)
after μετά cinema σινεμά/κινηματογράφος
afternoon απόγευμα (το) city πόλη (η)
again πάλι clock, watch ρολόι (το)
airplane αεροπλάνο (το) closed κλειστός, -ή, -ό
all όλος, -η, -ο clothes ρούχα (τα)
alone, only μόνος, -η, -ο cloud σύννεφο (το)
always πάντα/πάντοτε coffee καφές (ο)
American αμερικανικός, -ή, -ό coffehouse καφενείο (το)
and και cognac κονιάκ (το)
and, too, also και cognac κονιάκ (το)

Recommended Answers

All 13 Replies

Is there some way I can display the 'escape characters'

Escape characters are generally < decimal 32 and > decimal 127 (in English), but you may have to adjust for your character set. This works for me.

# -*- coding: utf-8 -*-

test_file = [ "about, approximately περίπου cinema σινεμά (το)", \
              "after μετά cinema σινεμά/κινηματογράφος", \
              "afternoon απόγευμα (το) city πόλη (η)" ]

escape_min = 32
escape_max = 127
for rec in test_file:
   printed = 0
   for num, character in enumerate(rec):
      ord_chr = ord(character)
      if (ord_chr < escape_min) or (ord_chr > escape_max):
         if not printed:
            print "\n", rec
            printed = 1
         print "     ", num, ord_chr

hi
I am really struggling with this problem.
If I make the 2 colombs in the origDict.txt both english the formatting works great. If I then replace a few ascii characters with greek, the formatting is lost for those entries;-

1 ena 4 4
2 δεα a a
3 τρια1 1
4 τερρερα2 2
a a 3 3
1 1 4 4
2 2 a a
3 3 1 1
4 4 2 2
a a 3 3
The formatting used is
newString = '%-20s%s' % ( lineA , lineB)
Some help would be much appreciated

I can't see where the columns are in your example because this website reduces multiple spaces to single spaces. If you wrap your example in code tags (click the "Help with Code Tags" link for instructions) we can see how your example data really is appearing to you.

For example, when I type:
'This is lineA This is lineB' without the code tags you can't see the 20 spaces that I typed between 'lineA' and 'This'.

With the code tags, the same example looks like this:

'This is lineA       This is lineB'

Hi
Thanks for your response.
On further investigation the 2nd colomb is not justifying to the edge of the page but is offset by the number of characters in the fisrt colomb. the formatting code is:-
newString = '%-30s%-30s%s' % ( lineA , lineB, new_line)

1           εωα            1           1                 
2           δεο            2           2                 
3           τρεα          3           3                 
4           τεσσερα    4           4                 
and         Και            a           a                 
but         μα              1           1                 
live        μενώ          2           2                 
look        κιτάζω      3           3                 
have        εχω            4           4                 
111         1                 a           a                 
254768      2                 1           1                 
3           ζφσγφγ      2           2                 
4343        4                 3           3                 
a           a                 4           4

If I understand correctly, the above shows 14 examples of output records -- that is, 14 examples of the contents of newString after the newString = '%-30s%-30s%s' % ( lineA , lineB, new_line) runs.

What I'd like to do is run that command myself to reproduce your output and try to modify the command to avoid the irregular-looking output.

Could you show us one or two examples of the initial contents of the lineA, lineB, and new_line variables? I'm a Python beginner as well so I don't know all the answers, but I think if we knew some sample values for the input variables we would have something to test.

Hi
Thanks your response.
There follows my code, please note that in the commented forward section the list of English to Greek translations are perfectly fixed field so are both left justified. A copy of the origial dictionary is attached
Thanks for your interest.

#!/usr/local/bin/python
# -*- coding: iso-8859-15 -*-
import os, sys
#DictManipulation.py written 26/09/09
#See flow_dia_for_new_dict.ods, this directory
#
# A text file that is an English to Greek  dictionary and has 2 columns. 
# As there are many entries it takes up 25 or so pages. The object is to 
# produce a file that has 2 x 2 columns and so producing half the number 
# of pages. The 2nd columb generated should be the next 72 entries following
# the first column so the alfa order is maintained.
# 
# A sample of the original file is shown below
#
#August                    Αύγουστος (ο)                        
#aunt                      θεία  η                              
#autumn                    φθινόπωρο (το)                       
#bad, ugly (adj.)          άσχημος, -η, -ο                      
#bad, wicked, evil (adj.)  κακός, -ή, -ό                        
#bank                      τράπεζα (η)                          
#basket                    καλάθι (το)     

lines_per_page = 73       
current_page  = 0       
line_pointer_a = 1        
line_pointer_b = line_pointer_a + lines_per_page 

write_file = open ("newDict.txt", "w")
read_file = open ("origDict.txt", "r")
lines_in_file = read_file.readlines ()
read_file.close()
lengh_file = len (lines_in_file)
new_line = '\n'


while line_pointer_a != (lines_per_page * (current_page + 1 ) + 1):
    lineA = lines_in_file [line_pointer_a].strip() 
    lineB = lines_in_file [line_pointer_b].strip()
    newString = '%-56s%-60s%s' % ( lineA , lineB, new_line)
    write_file.write (newString)
    print newString
    line_pointer_a += 1
    line_pointer_b += 1
    if line_pointer_b >= lengh_file:
        write_file.close()
        print 'Job Done'
        break
    if line_pointer_a == (lines_per_page * (current_page + 1 ) + 1):
        current_page += 2
        line_pointer_a = (lines_per_page * (current_page ) + 1)              
        line_pointer_b = line_pointer_a + lines_per_page

Thanks Chico2009. I'm interested because here in Canada we have two official languages so it's good to learn how this encoding thing works. I just tried testing something in Python's Idle and it complained right away about "unsupported characters".
I have to go to lunch now but I'll look at your attachment later this afternoon if possible.

No solution yet. Now it's driving me crazy.:)
What I think is happening is that Python knows how to read your input file and write to your output file even though there are non-Ascii characters represented in them because the # -*- coding: utf-8 -*- statement informs it. However when I examine the contents of lineA using Idle, it shows me that it has content like the following:

>>> lineA
'2                             \xce\xb4\xcf\x8d\xce\xbf (also \xce\xb4\xcf\x85\xce\xbf)'
>>> 
>>> len(lineA)
50
>>> lineB
'but                           \xce\xbc\xce\xb1'
>>> len(lineB)
34
>>> #...So when we adjust the widths of lineA and lineB, it is the above unicode sequences we are manipulating
>>> combined_AB = "%-56s%-60s" % (lineA,lineB)
>>> combined_AB
'2                             \xce\xb4\xcf\x8d\xce\xbf (also \xce\xb4\xcf\x85\xce\xbf)      but                           \xce\xbc\xce\xb1                          '
>>> len(combined_AB)
116
>>> #You see the problem? Python does the length adjusting BEFORE it converts the unicode sequence into the record that it writes to the output file.

Hi d5e5,
Thanks for your post, it looks like u bug?
Or am I using the left justify function incorrectly?
Thanks for your efforts

Now that we can recreate the problem I'm confident that we're close to finding a way around it. I just need to learn a little more about how to format a unicode string. Maybe something to do with encoding or decoding it to set the length we want when formatting.

I'll try to look at it some more this afternoon, unless someone else finds the answer before then.

You would have to step through the unicode and count the number of actual letters, and then add enough spaces to the string to make it the length you want. This appears to work, but you'll have to test for yourself.

def print_rec(rec):
   print rec
   substrs = rec.split()
   greek_str = " ".join(substrs[1:])
   num_chr = 0
   for chr in greek_str:
      print ord(chr),
      if ord(chr) not in [206, 207]:
         num_chr += 1

   print "\n", num_chr


fp = open("./origDict.txt", "r")
## skip heading
fp.readline()

print_rec(fp.readline())
print_rec(fp.readline())
print_rec(fp.readline())
commented: of great assistance, thanks +1

Please try the following as well. It works on my computer now.

#!/usr/local/bin/python
# -*- coding: utf-8 -*-
import os, sys
#DictManipulation.py written 26/09/09
#See flow_dia_for_new_dict.ods, this directory
#
# A text file that is an English to Greek  dictionary and has 2 columns. 
# As there are many entries it takes up 25 or so pages. The object is to 
# produce a file that has 2 x 2 columns and so producing half the number 
# of pages. The 2nd columb generated should be the next 72 entries following
# the first column so the alfa order is maintained.
# 
# A sample of the original file is shown below
#
#August                    Αύγουστος (ο)                        
#aunt                      θεία  η                              
#autumn                    φθινόπωρο (το)                       
#bad, ugly (adj.)          άσχημος, -η, -ο                      
#bad, wicked, evil (adj.)  κακός, -ή, -ό                        
#bank                      τράπεζα (η)                          
#basket                    καλάθι (το)     

def rpad (orig_string, length):
    """Adds spaces to the end of a string until it has the desired length"""
    #Convert to unicode utf-8 because if your default encoding is Ascii the encode step fails
    ustring = unicode(orig_string,"utf-8","strict")
    #In a plain string Greek symbols have length = 1 instead of 2
    plain_string = ustring.encode("cp737", "replace")
    spaces_needed = length - len(plain_string)
    padded_string = orig_string + " " * spaces_needed
    return padded_string

lines_per_page = 73       
current_page  = 0       
line_pointer_a = 1        
line_pointer_b = line_pointer_a + lines_per_page 

write_file = open ("newDict.txt", "w")
read_file = open ("origDict.txt", "r")
lines_in_file = read_file.readlines ()
read_file.close()
lengh_file = len (lines_in_file)
new_line = '\n'


while line_pointer_a != (lines_per_page * (current_page + 1 ) + 1):
    lineA = lines_in_file [line_pointer_a].strip() 
    lineB = lines_in_file [line_pointer_b].strip()
    #####newString = '%-56s%-60s%s' % ( lineA , lineB, new_line)
    newString = rpad(lineA,56) + rpad(lineB,60) + new_line
    write_file.write (newString)
    print newString
    print "***Length of the above string is: %d***" % len(newString)
    line_pointer_a += 1
    line_pointer_b += 1
    if line_pointer_b >= lengh_file:
        write_file.close()
        print 'Job Done'
        break
    if line_pointer_a == (lines_per_page * (current_page + 1 ) + 1):
        current_page += 2
        line_pointer_a = (lines_per_page * (current_page ) + 1)              
        line_pointer_b = line_pointer_a + lines_per_page
commented: of great assistance thanks +1

Hi Guys
Thankyou both for your efforts.
I have impilmented your mod. and it works great.
I have been trying to use the u fuction to try and influence the interpreter that a unicode string requires formatting with no success.
There must be a way of acheiving the desired effect using the format operators!!!??

Thanks again for your help

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.