How to seperate two languages(English,Hindi) in python

This is my Sample data

1 . wikiner2013inflected 1-1 1.000 Sharaabi शराबी

2 . ted 1-1 1.0 politicians do not have permission to do what needs to be done. राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह करने कि अनुमति नहीं है .

3 . ted 1-1 1.0 I'd like to tell you about one such child, मई आपको ऐसे ही एक बच्चे के बारे में बताना चाहूंगी,

What i need is

1 . [[Sharaabi],[शराबी]]

2 . [[politicians do not have permission to do what needs to be done.][राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह करने कि अनुमति नहीं है .]]

3 . [[I'd like to tell you about one such child][मई आपको ऐसे ही एक बच्चे के बारे में बताना चाहूंगी]]

So far what i was able to do is

remove the first three values from the list

[

['Sharaabi', '\xe0\xa4\xb6\xe0\xa4\xb0\xe0\xa4\xbe\xe0\xa4\xac\xe0\xa5\x80'],

['politicians', 'do', 'not', 'have', 'permission', 'to', 'do', 'what', 'needs', 'to', 'be', 'done.', '\xe0\xa4\xb0\xe0\xa4\xbe\xe0\xa4\x9c\xe0\xa4\xa8\xe0\xa5\x80\xe0\xa4\xa4\xe0\xa4\xbf\xe0\xa4\x9c\xe0\xa5\x8d\xe0\xa4\x9e\xe0\xa5\x8b\xe0\xa4\x82', '\xe0\xa4\x95\xe0\xa5\x87', '\xe0\xa4\xaa\xe0\xa4\xbe\xe0\xa4\xb8', '\xe0\xa4\x9c\xe0\xa5\x8b', '\xe0\xa4\x95\xe0\xa4\xbe\xe0\xa4\xb0\xe0\xa5\x8d\xe0\xa4\xaf', '\xe0\xa4\x95\xe0\xa4\xb0\xe0\xa4\xa8\xe0\xa4\xbe', '\xe0\xa4\x9a\xe0\xa4\xbe\xe0\xa4\xb9\xe0\xa4\xbf\xe0\xa4\x8f,', '\xe0\xa4\xb5\xe0\xa4\xb9', '\xe0\xa4\x95\xe0\xa4\xb0\xe0\xa4\xa8\xe0\xa5\x87', '\xe0\xa4\x95\xe0\xa4\xbf', '\xe0\xa4\x85\xe0\xa4\xa8\xe0\xa5\x81\xe0\xa4\xae\xe0\xa4\xa4\xe0\xa4\xbf', '\xe0\xa4\xa8\xe0\xa4\xb9\xe0\xa5\x80\xe0\xa4\x82', '\xe0\xa4\xb9\xe0\xa5\x88', '.'],

["I'd", 'like', 'to', 'tell', 'you', 'about', 'one', 'such', 'child,', '\xe0\xa4\xae\xe0\xa4\x88', '\xe0\xa4\x86\xe0\xa4\xaa\xe0\xa4\x95\xe0\xa5\x8b', '\xe0\xa4\x90\xe0\xa4\xb8\xe0\xa5\x87', '\xe0\xa4\xb9\xe0\xa5\x80', '\xe0\xa4\x8f\xe0\xa4\x95', '\xe0\xa4\xac\xe0\xa4\x9a\xe0\xa5\x8d\xe0\xa4\x9a\xe0\xa5\x87', '\xe0\xa4\x95\xe0\xa5\x87', '\xe0\xa4\xac\xe0\xa4\xbe\xe0\xa4\xb0\xe0\xa5\x87', '\xe0\xa4\xae\xe0\xa5\x87\xe0\xa4\x82', '\xe0\xa4\xac\xe0\xa4\xa4\xe0\xa4\xbe\xe0\xa4\xa8\xe0\xa4\xbe', '\xe0\xa4\x9a\xe0\xa4\xbe\xe0\xa4\xb9\xe0\xa5\x82\xe0\xa4\x82\xe0\xa4\x97\xe0\xa5\x80,']

]

Any hint how to seperate english and hindi into two seperate list ?

Recommended Answers

All 7 Replies

The "?" are basically hindi characters and it is not able to recognised properly

Separate each line by spaces into words, then run each word against english dictionary

The data is 286000 lines . if i use "run each word against english dictionary" , it would take lot of time isn't it ?

No it will be very fast.

I suppose your data is stored in a file. Can you send such a file with say 10 lines of data ?

I have sent the file

It seems very easy because the file is a TAB-separated file. Here is my code in python 2

#!/usr/bin/env python
# -*-coding: utf8-*-
'''doc
'''
from __future__ import (absolute_import, division,
                        print_function, unicode_literals)
import codecs

def process_file(filename):
    with codecs.open(filename, encoding='utf8') as ifh:
        for line in ifh:
            row = line.split('\t')
            english, hindi = row[-2:]
            print('English:', english)
            print('Hindi:', hindi)

if __name__ == '__main__':
    process_file('hindmonocorp05.txt')

And the result

Thanks a lot Gribouillis. you made it very simple

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.