How to seperate two languages(English,Hindi) in python

Question

jamesjohnson25 0 Newbie Poster

10 Years Ago

This is my Sample data

1 . wikiner2013inflected 1-1 1.000 Sharaabi शराबी

2 . ted 1-1 1.0 politicians do not have permission to do what needs to be done. राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह करने कि अनुमति नहीं है .

3 . ted 1-1 1.0 I'd like to tell you about one such child, मई आपको ऐसे ही एक बच्चे के बारे में बताना चाहूंगी,

What i need is

1 . [[Sharaabi],[शराबी]]

2 . [[politicians do not have permission to do what needs to be done.][राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह करने कि अनुमति नहीं है .]]

3 . [[I'd like to tell you about one such child][मई आपको ऐसे ही एक बच्चे के बारे में बताना चाहूंगी]]

So far what i was able to do is

remove the first three values from the list

[

['Sharaabi', '\xe0\xa4\xb6\xe0\xa4\xb0\xe0\xa4\xbe\xe0\xa4\xac\xe0\xa5\x80'],

['politicians', 'do', 'not', 'have', 'permission', 'to', 'do', 'what', 'needs', 'to', 'be', 'done.', '\xe0\xa4\xb0\xe0\xa4\xbe\xe0\xa4\x9c\xe0\xa4\xa8\xe0\xa5\x80\xe0\xa4\xa4\xe0\xa4\xbf\xe0\xa4\x9c\xe0\xa5\x8d\xe0\xa4\x9e\xe0\xa5\x8b\xe0\xa4\x82', '\xe0\xa4\x95\xe0\xa5\x87', '\xe0\xa4\xaa\xe0\xa4\xbe\xe0\xa4\xb8', '\xe0\xa4\x9c\xe0\xa5\x8b', '\xe0\xa4\x95\xe0\xa4\xbe\xe0\xa4\xb0\xe0\xa5\x8d\xe0\xa4\xaf', '\xe0\xa4\x95\xe0\xa4\xb0\xe0\xa4\xa8\xe0\xa4\xbe', '\xe0\xa4\x9a\xe0\xa4\xbe\xe0\xa4\xb9\xe0\xa4\xbf\xe0\xa4\x8f,', '\xe0\xa4\xb5\xe0\xa4\xb9', '\xe0\xa4\x95\xe0\xa4\xb0\xe0\xa4\xa8\xe0\xa5\x87', '\xe0\xa4\x95\xe0\xa4\xbf', '\xe0\xa4\x85\xe0\xa4\xa8\xe0\xa5\x81\xe0\xa4\xae\xe0\xa4\xa4\xe0\xa4\xbf', '\xe0\xa4\xa8\xe0\xa4\xb9\xe0\xa5\x80\xe0\xa4\x82', '\xe0\xa4\xb9\xe0\xa5\x88', '.'],

["I'd", 'like', 'to', 'tell', 'you', 'about', 'one', 'such', 'child,', '\xe0\xa4\xae\xe0\xa4\x88', '\xe0\xa4\x86\xe0\xa4\xaa\xe0\xa4\x95\xe0\xa5\x8b', '\xe0\xa4\x90\xe0\xa4\xb8\xe0\xa5\x87', '\xe0\xa4\xb9\xe0\xa5\x80', '\xe0\xa4\x8f\xe0\xa4\x95', '\xe0\xa4\xac\xe0\xa4\x9a\xe0\xa5\x8d\xe0\xa4\x9a\xe0\xa5\x87', '\xe0\xa4\x95\xe0\xa5\x87', '\xe0\xa4\xac\xe0\xa4\xbe\xe0\xa4\xb0\xe0\xa5\x87', '\xe0\xa4\xae\xe0\xa5\x87\xe0\xa4\x82', '\xe0\xa4\xac\xe0\xa4\xa4\xe0\xa4\xbe\xe0\xa4\xa8\xe0\xa4\xbe', '\xe0\xa4\x9a\xe0\xa4\xbe\xe0\xa4\xb9\xe0\xa5\x82\xe0\xa4\x82\xe0\xa4\x97\xe0\xa5\x80,']

]

Any hint how to seperate english and hindi into two seperate list ?

python

Edited 10 Years Ago by jamesjohnson25

3 Contributors
7 Replies
440 Views
10 Hours Discussion Span
Latest Post 10 Years Ago Latest Post by jamesjohnson25

All 7 Replies

Gribouillis 1,391 Programming Explorer

10 Years Ago

No it will be very fast.

I suppose your data is stored in a file. Can you send such a file with say 10 lines of data ?

Edited 10 Years Ago by Gribouillis

Gribouillis 1,391 Programming Explorer

10 Years Ago

It seems very easy because the file is a TAB-separated file. Here is my code in python 2

#!/usr/bin/env python
# -*-coding: utf8-*-
'''doc
'''
from __future__ import (absolute_import, division,
                        print_function, unicode_literals)
import codecs

def process_file(filename):
    with codecs.open(filename, encoding='utf8') as ifh:
        for line in ifh:
            row = line.split('\t')
            english, hindi = row[-2:]
            print('English:', english)
            print('Hindi:', hindi)

if __name__ == '__main__':
    process_file('hindmonocorp05.txt')

And the result

Edited 10 Years Ago by Gribouillis

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

jamesjohnson25 0 Newbie Poster · Answer 1 · 2015-04-21T15:53:21+00:00

The "?" are basically hindi characters and it is not able to recognised properly

Slavi 94 Master Poster Featured Poster · Answer 2 · 2015-04-21T16:17:55+00:00

Separate each line by spaces into words, then run each word against english dictionary

jamesjohnson25 0 Newbie Poster · Answer 3 · 2015-04-21T16:23:34+00:00

The data is 286000 lines . if i use "run each word against english dictionary" , it would take lot of time isn't it ?

jamesjohnson25 0 Newbie Poster · Answer 4 · 2015-04-21T18:38:18+00:00

jamesjohnson25 0 Newbie Poster

10 Years Ago

I have sent the file

hindmonocorp05.txt (2.1 KB)

wikiner2013inflected	1-1	1.000	Sharaabi	
ted	1-1	1.0	politicians do not have permission to do what needs to be done.	      ,       .
ted	1-1	1.0	I'd like to tell you about one such child,	          ,
indic2012	1-1	manual	This percentage is even greater than the percentage in India.	        
quote-name	1-1	1.0	- John Collins	-  
ted	1-1	1.0	what we really mean is that they're bad at not paying attention.	          
launchpad	1-1	implied	%{APPNAME} would like to send notifications, but you need to be signed in to Chrome.	%{APPNAME}    ,   Chrome     .
launchpad	1-1	implied	Important Messages	 
launchpad	1-1	implied	User authentication required for VPN connection '%s'...	  VPN  '%s'    ...
launchpad	1-1	implied	Surface width	 
launchpad	1-1	implied	Reinstall	  
agro-hunaligned	1-1	0.87	2. Infection caused by germs.	2.   
wikiner2013	1-1	implied	Suhasi Goradia Dhami	 
indic2012	1-1	manual	.The ending portion of these Vedas is called Upanishad.	       
wikiner2013infldected	1-1	0.065	Yuga

Edited 10 Years Ago by jamesjohnson25 because: Attached file

jamesjohnson25 0 Newbie Poster · Answer 5 · 2015-04-22T01:57:47+00:00

Thanks a lot Gribouillis. you made it very simple

How to seperate two languages(English,Hindi) in python

Recommended Answers Collapse Answers

All 7 Replies

Recommended Answers