Isolate the words

Question

jecris 0 Newbie Poster

14 Years Ago

A common problem when processing incoming text is to isolate the words in the text.
This is made more difficult by the punctuation; words have commas, "quote marks",
(even brackets) next to them, or hyphens in the middle of the word. This punctuation
doesn't count as letters when the words have to be looked up in a dictionary by the
program.

For this problem, you must separate out "clean" words from text, that is, words with no
attached or embedded non-letters. A "word" is any continuous string of non-whitespace
characters, with whitespace characters on each side of it. For this problem, a
"whitespace" character is a space character or an end-of-line character, or the start or
end of the file (so that, for example, if the file is "A B", where there is a space character between the A and B but no other, then there are two words, "A" and "B").
Input will be from standard input (keyboard) and will consist of lines with no more than
60 characters in each line. Every line will be terminated by a character which isn't
whitespace (which will be followed immediately by an end-of-line character). The input
will be terminated by a line consisting of a single #.

Output, which must be written to standard output (the screen), must be the lines of the
incoming text, with the punctuation stripped away from each word. "Punctuation" is any
character which is not a letter (a – z and A -Z) or a whitespace character - your program must not change the letters and space characters (although space characters at theends of lines will be ignored). When punctuation occurs in the middle of a word (i.e.
there is no whitespace character next to it), it must be simply removed - see what
happens to the word "doesn't" in the example. A word which consists entirely of
punctuation will therefore be removed entirely. There is a special rule for a hyphen ("-")
which is the very last character in a line - the word part before the hyphen, and the first
word part on the next line, form a single word, and this must be written on a line by itself
(so that no line is ever lengthened). There will always be a space before the word part
on the first line, and a space after the word part on the second line.

Example

Input
A common problem when processing incoming text is to isolate
the words in the text. This is made more difficult by the
punctuation; words have commas, "quote marks",
(even brackets) next to them, or hyphens
in the middle of the word. This punctuation doesn't
count as letters when the words have to be looked up in a
# dictionary by the 12345 "**&! program.
#

Output
A common problem when processing incoming text is to isolate
the words in the text This is made more difficult by the
punctuation words have commas quote marks
even brackets next to them or
hyphens
in the middle of the word This punctuation doesnt
count as letters when the words have to be looked up in a
dictionary by the program

java

2 Contributors
1 Reply
143 Views
7 Hours Discussion Span
Latest Post 14 Years Ago Latest Post by masijade

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

masijade 1,351 Industrious Poster Team Colleague Featured Poster · Answer 1 · 2011-02-01T13:56:48+00:00

masijade 1,351 Industrious Poster

14 Years Ago

See this response.