Hello everyone,
I'm looking for some help with these simple tasks. I actually need this just for linguistic analysis, so I'm sorry for asking probably dumb questions. :)

There is a simple code that uses grep to find lines that contain a certain word in one file.

linecount=`grep "someword" $1/*file.txt | wc -l`
echo $linecount
wordcount=`grep "someword" $1/file.txt | cut -f2- | wc -w`
echo $wordcount
echo 'avg words per line:'
echo "scale=2; $wordcount / $linecount" | bc

What would be the simplest way to:
- find the maximum line length (in words)? wc -L should be probably used somehow?
- count the vocabulary size (simply number of different tokens) for all the found lines? I could only apply uniq -c to lines, not words

I really appreciate any help. many thanks in advance!

8 Years
Discussion Span
Last Post by dragonflyheli

If you have a line
"their therapist is over there"

and you're searching "the", what would you want
0 - nothing matches the actual word "the"
1 - the line contains "the" somewhere
3 - there are three places where "the" appears


Actually the words being searched belong to technical markup, each identifying unambiguously one line in natural language, which needs to be parsed; so in this case generally possible ambiguities in search can be ignored.

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.