0

Hello,

I have a number of long text files, they took from windows system with specific encoding (windows-1256).

Can I count the all different number of words in theses files using Ubuntu commands?

For example: If I have three text files contain respectively

-Hi, hello.
-Hello all.
-Good people in all time.

The output should be: 7.

Thank you

Edited by mini programmer

2
Contributors
5
Replies
41
Views
3 Years
Discussion Span
Last Post by mini programmer
0

Thank you readers :)

Now, I found a solution for part of the problem:

cat ~/folderName/* | tr ' ' '\n' | sort | uniq | wc -w

This command calculates the number of unique words in all files in folderName.
It works with English files correctly.

But until now, it does not work with files have windows-1256 encoding.
The terminal can not read the number of words, it prints (0).

I changed encoding of the terminal by:

gconftool --set --type=string /apps/gnome-terminal/profiles/Default/encoding WINDOWS-1256 

or I tried to use CP1256 insted of WINDOWS-1256. But until now it does not work :(

0

For your original example, your script gives the answer 9, not 7 as you requested. The command you show does not handle punctuation and capitalization.

You probably want to add some sed-like functionality to remove punctuation and address capitalization. After that you can consider the encoding.

0

Thank you.
Are You mean adding like this tr -cd 'a-zA-Z''\n'to the command:

cat ~/folderName/* | tr ' ' '\n'| tr -cd 'a-zA-Z''\n' | sort | uniq | wc -w

Thank you. I added that now, and the encoding is still.

0

Sorry, I can not edit my previous post.

Can I find the count of words without needing to use wc command?
For example like that:

tr -cs 'a-zA-Z''\n' '\n'|  sort | uniq  < ~/folderName/*

But I find the error: ~/folderName/: ambiguous redirect
and if I put the folder name between "~/folderName/*" or '~/folderName/*' I found the error which is: No such file or directory

Edited by mini programmer

0

Finally, I got this correct solution :)

cat ~/folderName/* | tr ' ' '\n' |tr -cd 'a-zA-Z''\n'| awk '{a[$1]} END {for(i in a) print i}'  > ~/out.txt 
awk 'END {print NR}' ~/out.txt

I hope this help for anyone has same my problem :)

But this command need to a file to save the output (lines/words), Is there any problem if this file contains huge number of lines, e.g more than 500,000 lines?
So, can I combine between these command to remove using this file. Until now I can not find the solution.

Can anyone help me!

Thank you.

Edited by mini programmer

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.