Hi all, actually I have a requirement to remove all non letter character. (Numbers, Punctuation, symbols, non printing characters etc.)
string.punctuation does a good job, but it does NOT remove any non English punctuation (Like '。' which is a full stop in Chinese)
So I come accross such code:
import unicodedata def onlyWord(text): Word = set(['Lm','Lo','Lu','Ll','Lt']) return ''.join(x for x in text if unicodedata.category(x) in Word) print(onlyWord('µ'))
Great! It works what I wanted, and now I realized that ᐒ is a letter (Unicode category as Lo).
The problem is, I CANNOT use unicodedata, collections, re and a number of libraries as a challenge.
So I want to know how to print out the list of numbers that are defined as 'Lu' (As an example, I can extend to the ones listed above) so that I can do this:
def processDisallowedChars(word): ''' This function removes all the non-alphabetical characters within strings, with hyphens and contractions / astrophes in mind. Examples of allowed chars: one-north tom's bill gates don't 中文 ''' #Initalize a list of acceptable characters. setalphabet = set("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ") setdisallowed = set(word).difference(setalphabet) setexceptions = set(\ tuple(i for i in range(181,195102) \ if (182 <= i <= 191) == False and (i != 215) and (i != 247) and (706 <= i <= 709) == False \ and (722 <= i <= 735) == False and (741 <= i <= 881) == False and (i != 12290) )) for x in set(setdisallowed): if(ord(x) not in setexceptions): #Further check to ensure there is no starting / ending with illegal characters. while (word.endswith(x) or word.startswith(x)): word = word[0:1].replace(x,"") + word[1:len(word) - 1] + word[len(word) - 1:].replace(x,"") if (x == '\''): #Removes disllowed "'" character when it does not followed by a "'t" or "'s" (Example: don'b instead of don't / Tom'b instead of Tom's). #This is to allow contractions and apostrophes. targetchar = max([word.rfind("'t"),word.rfind("'s")]) if targetchar > 0: word = word[:targetchar].replace(x,"") + word[targetchar:]#Separate any unallowed use of "'" except 't and 's. else: word = word.replace(x,"") elif (x == '-'): #Removes disllowed consecutive "-" characters. Hyphens are meant to use once. #Allowed example: Twentieth-century (20th century). #Disallowed example: Twentieth--century (Will be replaced as Twentieth-century) or Mid-----Air (Mid-Air). #1st Line: This statement converts string into a new list contains each characters within string. #It will return a new string based on new list that satifies the criteria in 2nd Line. #2nd Line: New list accepts any chars, except consecutives of "-", which will be treated as single "-". word = "".join([word[n] for n in range(len(word)) \ if (word[n] != '-') or (word[n] != word[n-1])]) else: word = word.replace(x,"") #Remove all other punctuations, numbers, unprintable characters except hyphens and contractions (apostrophes). return word #Returns a string with only legal characters. print(processDisallowedChars('中文。'))
I actually wanted to generate this 'setexceptions' into a tuple of numbers, so that it only accept letters and not any other characters.
Reference I have used:
Now I am stuck in this variable
setexceptions = set(\ tuple(i for i in range(181,195102) \ if (182 <= i <= 191) == False and (i != 215) and (i != 247) and (706 <= i <= 709) == False \ and (722 <= i <= 735) == False and (741 <= i <= 881) == False and (i != 12290) ))
Because going through the WikiBook page and compare with fileformat is really inefficient.
Thank you for the help.