Hi there. Day 2 of programming python. In this thread I posted my first attempt
and growing from there it goes to slightly deeper water here.
I have three .txt files:
nvutf8.txt here new vocab items are stored
esutf8.txt here example sentences are stored
exoututf8.txt example sentences from esutf8.txt containing vocab from nvutf8.txt is supposed to be stored here.
I have written the following code:
#step1: find example sentences in esutf8.txt which contain new voc items from nvutf8.txt #step2: among those sentences find those which contain as few as possible new words from kvutf8.txt (known vocab). import codecs enout = codecs.open('Python/ExListBuild/exoututf8.txt', encoding = 'utf-8', mode = 'w') nvin = codecs.open('Python/ExListBuild/nvutf8.txt', encoding = 'utf-8', mode = 'r') for line in open('Python/ExListBuild/nvutf8.txt'): newvocab = nvin.readline() print "-" print "next vocab item being checked" print "-" esin = codecs.open('Python/ExListBuild/esutf8.txt', encoding = 'utf-8', mode = 'r') for line in open('Python/ExListBuild/esutf8.txt'): sentence = esin.readline() index = sentence.find(newvocab) if index==-1: print "nope" else: print "yes" enout.write(sentence) esin.close() nvin.close()
There are some hard to understand irregularities going on.
I use the following example sentences in esutf8.txt:
For new vocab I use in nvutf8.txt:
And I get returned in exoututf8.txt:
So it worked fine for 我, but it did not work for 要 (which is in the first sentence).
EDIT: Apparently it always works ONLY for the last word from nvutf8.txt. (also for two or more char vocab like
I have a version (with analogue code) running without the utf-8 stuff which works fine for roman letters.