Count different number of words in files

Question

mini programmer 0 Junior Poster in Training

11 Years Ago

Hello,

I have a number of long text files, they took from windows system with specific encoding (windows-1256).

Can I count the all different number of words in theses files using Ubuntu commands?

For example: If I have three text files contain respectively

-Hi, hello.
-Hello all.
-Good people in all time.

The output should be: 7.

Thank you

ubuntu

Edited 11 Years Ago by mini programmer

2 Contributors
5 Replies
329 Views
1 Week Discussion Span
Latest Post 11 Years Ago Latest Post by mini programmer

All 5 Replies

L7Sqr 227 Practically a Master Poster

11 Years Ago

For your original example, your script gives the answer 9, not 7 as you requested. The command you show does not handle punctuation and capitalization.

You probably want to add some sed-like functionality to remove punctuation and address capitalization. After that you can consider the encoding.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

mini programmer 0 Junior Poster in Training · Answer 1 · 2014-03-18T16:00:09+00:00

Thank you readers :)

Now, I found a solution for part of the problem:

cat ~/folderName/* | tr ' ' '\n' | sort | uniq | wc -w

This command calculates the number of unique words in all files in folderName.
It works with English files correctly.

But until now, it does not work with files have windows-1256 encoding.
The terminal can not read the number of words, it prints (0).

I changed encoding of the terminal by:

gconftool --set --type=string /apps/gnome-terminal/profiles/Default/encoding WINDOWS-1256

or I tried to use CP1256 insted of WINDOWS-1256. But until now it does not work :(

mini programmer 0 Junior Poster in Training · Answer 2 · 2014-03-19T03:57:22+00:00

Thank you.
Are You mean adding like this tr -cd 'a-zA-Z''\n'to the command:

cat ~/folderName/* | tr ' ' '\n'| tr -cd 'a-zA-Z''\n' | sort | uniq | wc -w

Thank you. I added that now, and the encoding is still.

mini programmer 0 Junior Poster in Training · Answer 3 · 2014-03-19T16:57:47+00:00

Sorry, I can not edit my previous post.

Can I find the count of words without needing to use wc command?
For example like that:

tr -cs 'a-zA-Z''\n' '\n'|  sort | uniq  < ~/folderName/*

But I find the error: ~/folderName/: ambiguous redirect
and if I put the folder name between "~/folderName/*" or '~/folderName/*' I found the error which is: No such file or directory

mini programmer 0 Junior Poster in Training · Answer 4 · 2014-03-20T15:38:39+00:00

Finally, I got this correct solution :)

cat ~/folderName/* | tr ' ' '\n' |tr -cd 'a-zA-Z''\n'| awk '{a[$1]} END {for(i in a) print i}'  > ~/out.txt 
awk 'END {print NR}' ~/out.txt

I hope this help for anyone has same my problem :)

But this command need to a file to save the output (lines/words), Is there any problem if this file contains huge number of lines, e.g more than 500,000 lines?
So, can I combine between these command to remove using this file. Until now I can not find the solution.

Can anyone help me!

Thank you.

Count different number of words in files

Recommended Answers Collapse Answers

All 5 Replies

Recommended Answers