Hi there. Day 2 of programming python. In this thread I posted my first attempt
http://www.daniweb.com/forums/post1231604.html#post1231604
and growing from there it goes to slightly deeper water here.

I have three .txt files:

nvutf8.txt here new vocab items are stored
esutf8.txt here example sentences are stored
exoututf8.txt example sentences from esutf8.txt containing vocab from nvutf8.txt is supposed to be stored here.

I have written the following code:

#step1: find example sentences in esutf8.txt which contain new voc items from nvutf8.txt
#step2: among those sentences find those which contain as few as possible new words from kvutf8.txt (known vocab).

import codecs

enout = codecs.open('Python/ExListBuild/exoututf8.txt', encoding = 'utf-8', mode = 'w')

nvin = codecs.open('Python/ExListBuild/nvutf8.txt', encoding = 'utf-8', mode = 'r')

for line in open('Python/ExListBuild/nvutf8.txt'):
	newvocab = nvin.readline()
	print "-"
	print "next vocab item being checked"
	print "-"
	esin = codecs.open('Python/ExListBuild/esutf8.txt', encoding = 'utf-8', mode = 'r')
	for line in open('Python/ExListBuild/esutf8.txt'):
		sentence = esin.readline()
		index = sentence.find(newvocab)
		if index==-1:
			print "nope"
		else:
			print "yes"
			enout.write(sentence)
	esin.close()
nvin.close()

There are some hard to understand irregularities going on.

I use the following example sentences in esutf8.txt:
我前边要拐弯了,请注意。
车来了快跑。
请排好队上车。
带好自己的东西。
方向错了!
我给你讲一个成语故事。
感谢你对我们的关心。

For new vocab I use in nvutf8.txt:

And I get returned in exoututf8.txt:
我前边要拐弯了,请注意。
我给你讲一个成语故事。
感谢你对我们的关心。

So it worked fine for 我, but it did not work for 要 (which is in the first sentence).
EDIT: Apparently it always works ONLY for the last word from nvutf8.txt. (also for two or more char vocab like
自己)

I have a version (with analogue code) running without the utf-8 stuff which works fine for roman letters.

Recommended Answers

All 6 Replies

edit: Please delete´- reason: code was complete garbage.

Hi there. Day 2 of programming python. In this thread I posted my first attempt
http://www.daniweb.com/forums/post1231604.html#post1231604
and growing from there it goes to slightly deeper water here.

I have three .txt files:

nvutf8.txt here new vocab items are stored
esutf8.txt here example sentences are stored
exoututf8.txt example sentences from esutf8.txt containing vocab from nvutf8.txt is supposed to be stored here.

I have written the following code:

#step1: find example sentences in esutf8.txt which contain new voc items from nvutf8.txt
#step2: among those sentences find those which contain as few as possible new words from kvutf8.txt (known vocab).

import codecs

enout = codecs.open('Python/ExListBuild/exoututf8.txt', encoding = 'utf-8', mode = 'w')

nvin = codecs.open('Python/ExListBuild/nvutf8.txt', encoding = 'utf-8', mode = 'r')

for line in open('Python/ExListBuild/nvutf8.txt'):
	newvocab = nvin.readline()
	print "-"
	print "next vocab item being checked"
	print "-"
	esin = codecs.open('Python/ExListBuild/esutf8.txt', encoding = 'utf-8', mode = 'r')
	for line in open('Python/ExListBuild/esutf8.txt'):
		sentence = esin.readline()
		index = sentence.find(newvocab)
		if index==-1:
			print "nope"
		else:
			print "yes"
			enout.write(sentence)
	esin.close()
nvin.close()

There are some hard to understand irregularities going on.

I use the following example sentences in esutf8.txt:
我前边要拐弯了,请注意。
车来了快跑。
请排好队上车。
带好自己的东西。
方向错了!
我给你讲一个成语故事。
感谢你对我们的关心。

For new vocab I use in nvutf8.txt:

And I get returned in exoututf8.txt:
我前边要拐弯了,请注意。
我给你讲一个成语故事。
感谢你对我们的关心。

So it worked fine for 我, but it did not work for 要 (which is in the first sentence).
EDIT: Apparently it always works ONLY for the last word from nvutf8.txt. (also for two or more char vocab like
自己)

I have a version (with analogue code) running without the utf-8 stuff which works fine for roman letters.

With a couple of small modifications your code does find some sentences. I'm not certain it's working as you want but see for yourself. The changes: convert the line and search strings into unicode strings before doing the find. And trim the whitespace from the search string before doing the find.

#!/usr/bin/env python
#step1: find example sentences in esutf8.txt which contain new voc items from nvutf8.txt
#step2: among those sentences find those which contain as few as possible new words from kvutf8.txt (known vocab).

def to_unicode_or_bust(
        obj, encoding='utf-8'):
    if isinstance(obj, basestring):
        if not isinstance(obj, unicode):
            obj = unicode(obj, encoding)
    return obj

import codecs
MyDir = '/home/david/Programming/Python'
enout = codecs.open(MyDir + '/' + 'exoututf8.txt', encoding = 'utf-8', mode = 'w')

nvin = codecs.open(MyDir + '/' + 'nvutf8.txt', encoding = 'utf-8', mode = 'r')

for line in open(MyDir + '/' + 'nvutf8.txt'):
    newvocab = nvin.readline()
    newvocab_uni = to_unicode_or_bust(newvocab)
    newvocab_uni = newvocab_uni.rstrip()
    print "-"
    print newvocab_uni.encode('utf-8') + "is the next vocab item being checked"
    print "-"
    esin = codecs.open(MyDir + '/' + 'esutf8.txt', encoding = 'utf-8', mode = 'r')
    for line in open(MyDir + '/' + 'esutf8.txt'):
        sentence = esin.readline()
        sentence_uni = to_unicode_or_bust(sentence)
        index = sentence_uni.find(newvocab_uni)
        if index==-1:
            print "nope"
        else:
            print "yes"
            enout.write(sentence)
    esin.close()
nvin.close()

After running the above, exoututf8.txt now contains

我前边要拐弯了,请注意。
我前边要拐弯了,请注意。
我给你讲一个成语故事。
感谢你对我们的关心。

I copied the to_unicode_or_bust function from this presentation about unicode

Thanks for the input. Will check it more thoroughly tomorrow.

Ok, I checked your program and only did some minor changes (the way the for-loops are called and the more intuitive variable names) and it now ALMOST works, the only thing that seems to be messed up is the first entry from the nvutf8.txt-file. Here the new code:

#!/usr/bin/env python
#step1: find example sentences in esutf8.txt which contain new voc items from nvutf8.txt
#step2: among those sentences find those which contain as few as possible new words from kvutf8.txt (known vocab).

def to_unicode_or_bust(
        obj, encoding='utf-8'):
    if isinstance(obj, basestring):
        if not isinstance(obj, unicode):
            obj = unicode(obj, encoding)
    return obj

import codecs
MyDir = '/Programme/notepad++/Python/ExListBuild2'
exout_file = codecs.open(MyDir + '/' + 'exoututf8.txt', encoding = 'utf-8', mode = 'w')

newvocab_file = codecs.open(MyDir + '/' + 'nvutf8.txt', encoding = 'utf-8', mode = 'r')

exsentences_file = codecs.open(MyDir + '/' + 'esutf8.txt', encoding = 'utf-8', mode = 'r')

for line_nv in newvocab_file.readlines():
    newvocab_uni = to_unicode_or_bust(line_nv)
    newvocab_uni = newvocab_uni.rstrip()
    print "-"
    print newvocab_uni.encode('utf-8') + "is the next vocab item being checked"
    print "-"
    
    for line_es in exsentences_file.readlines():
        sentence_uni = to_unicode_or_bust(line_es)
        index = sentence_uni.find(newvocab_uni)
        if index==-1:
            print "nope"
        else:
            print "yes"
            exout_file.write(line_es)
	exsentences_file.seek(0)
newvocab_file.close()
exsentences_file.close()

I use these example sentences:

医生说的连体人, 就是李方刚生的这个孩子。
医生放下电话。
王医生给的药太难吃。
中国有了英文热。
可是,去美国我们没有钱,再说, 我和向右是连在一只

and as new vocab


The command line then reads:

´╗┐ÕŬis the next vocab item being checked
-
nope
nope
nope
nope
nope
-
ÕŬis the next vocab item being checked
-
nope
nope
nope
nope
yes

Notice the difference in ´╗┐ÕŬ and ÕŬ, any ideas, on how to fix that?

Perhaps your file is using a BOM at the beginning. You could try open with mode 'utf-8-sig' instead of 'utf-8'.

commented: necessary tidbit of information of BOM +1
commented: Didn't think of BOM. Good call, +1

Thanks a bunch. That solved it.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.