How to seperate two languages(English,Hindi) in python

This is my Sample data

1 . wikiner2013inflected 1-1 1.000 Sharaabi शराबी

2 . ted 1-1 1.0 politicians do not have permission to do what needs to be done. राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह करने कि अनुमति नहीं है .

3 . ted 1-1 1.0 I'd like to tell you about one such child, मई आपको ऐसे ही एक बच्चे के बारे में बताना चाहूंगी,

What i need is

1 . [[Sharaabi],[शराबी]]

2 . [[politicians do not have permission to do what needs to be done.][राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह करने कि अनुमति नहीं है .]]

3 . [[I'd like to tell you about one such child][मई आपको ऐसे ही एक बच्चे के बारे में बताना चाहूंगी]]

So far what i was able to do is

remove the first three values from the list

[

['Sharaabi', '\xe0\xa4\xb6\xe0\xa4\xb0\xe0\xa4\xbe\xe0\xa4\xac\xe0\xa5\x80'],

['politicians', 'do', 'not', 'have', 'permission', 'to', 'do', 'what', 'needs', 'to', 'be', 'done.', '\xe0\xa4\xb0\xe0\xa4\xbe\xe0\xa4\x9c\xe0\xa4\xa8\xe0\xa5\x80\xe0\xa4\xa4\xe0\xa4\xbf\xe0\xa4\x9c\xe0\xa5\x8d\xe0\xa4\x9e\xe0\xa5\x8b\xe0\xa4\x82', '\xe0\xa4\x95\xe0\xa5\x87', '\xe0\xa4\xaa\xe0\xa4\xbe\xe0\xa4\xb8', '\xe0\xa4\x9c\xe0\xa5\x8b', '\xe0\xa4\x95\xe0\xa4\xbe\xe0\xa4\xb0\xe0\xa5\x8d\xe0\xa4\xaf', '\xe0\xa4\x95\xe0\xa4\xb0\xe0\xa4\xa8\xe0\xa4\xbe', '\xe0\xa4\x9a\xe0\xa4\xbe\xe0\xa4\xb9\xe0\xa4\xbf\xe0\xa4\x8f,', '\xe0\xa4\xb5\xe0\xa4\xb9', '\xe0\xa4\x95\xe0\xa4\xb0\xe0\xa4\xa8\xe0\xa5\x87', '\xe0\xa4\x95\xe0\xa4\xbf', '\xe0\xa4\x85\xe0\xa4\xa8\xe0\xa5\x81\xe0\xa4\xae\xe0\xa4\xa4\xe0\xa4\xbf', '\xe0\xa4\xa8\xe0\xa4\xb9\xe0\xa5\x80\xe0\xa4\x82', '\xe0\xa4\xb9\xe0\xa5\x88', '.'],

["I'd", 'like', 'to', 'tell', 'you', 'about', 'one', 'such', 'child,', '\xe0\xa4\xae\xe0\xa4\x88', '\xe0\xa4\x86\xe0\xa4\xaa\xe0\xa4\x95\xe0\xa5\x8b', '\xe0\xa4\x90\xe0\xa4\xb8\xe0\xa5\x87', '\xe0\xa4\xb9\xe0\xa5\x80', '\xe0\xa4\x8f\xe0\xa4\x95', '\xe0\xa4\xac\xe0\xa4\x9a\xe0\xa5\x8d\xe0\xa4\x9a\xe0\xa5\x87', '\xe0\xa4\x95\xe0\xa5\x87', '\xe0\xa4\xac\xe0\xa4\xbe\xe0\xa4\xb0\xe0\xa5\x87', '\xe0\xa4\xae\xe0\xa5\x87\xe0\xa4\x82', '\xe0\xa4\xac\xe0\xa4\xa4\xe0\xa4\xbe\xe0\xa4\xa8\xe0\xa4\xbe', '\xe0\xa4\x9a\xe0\xa4\xbe\xe0\xa4\xb9\xe0\xa5\x82\xe0\xa4\x82\xe0\xa4\x97\xe0\xa5\x80,']

]

Any hint how to seperate english and hindi into two seperate list ?

Edited 1 Year Ago by jamesjohnson25

Separate each line by spaces into words, then run each word against english dictionary

The data is 286000 lines . if i use "run each word against english dictionary" , it would take lot of time isn't it ?

No it will be very fast.

I suppose your data is stored in a file. Can you send such a file with say 10 lines of data ?

Edited 1 Year Ago by Gribouillis

I have sent the file

Edited 1 Year Ago by jamesjohnson25: Attached file

Attachments
wikiner2013inflected	1-1	1.000	Sharaabi	
ted	1-1	1.0	politicians do not have permission to do what needs to be done.	      ,       .
ted	1-1	1.0	I'd like to tell you about one such child,	          ,
indic2012	1-1	manual	This percentage is even greater than the percentage in India.	        
quote-name	1-1	1.0	- John Collins	-  
ted	1-1	1.0	what we really mean is that they're bad at not paying attention.	          
launchpad	1-1	implied	%{APPNAME} would like to send notifications, but you need to be signed in to Chrome.	%{APPNAME}    ,   Chrome     .
launchpad	1-1	implied	Important Messages	 
launchpad	1-1	implied	User authentication required for VPN connection '%s'...	  VPN  '%s'    ...
launchpad	1-1	implied	Surface width	 
launchpad	1-1	implied	Reinstall	  
agro-hunaligned	1-1	0.87	2. Infection caused by germs.	2.   
wikiner2013	1-1	implied	Suhasi Goradia Dhami	 
indic2012	1-1	manual	.The ending portion of these Vedas is called Upanishad.	       
wikiner2013infldected	1-1	0.065	Yuga

It seems very easy because the file is a TAB-separated file. Here is my code in python 2

#!/usr/bin/env python
# -*-coding: utf8-*-
'''doc
'''
from __future__ import (absolute_import, division,
                        print_function, unicode_literals)
import codecs

def process_file(filename):
    with codecs.open(filename, encoding='utf8') as ifh:
        for line in ifh:
            row = line.split('\t')
            english, hindi = row[-2:]
            print('English:', english)
            print('Hindi:', hindi)

if __name__ == '__main__':
    process_file('hindmonocorp05.txt')

And the result

Edited 1 Year Ago by Gribouillis

Attachments term1.png 70.75 KB
This question has already been answered. Start a new discussion instead.