954,554 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Java Split Text by Spaces and Punctuation

Hi All,
I have to split a text into words using both spaces and punctuation as delimiters.
Punctuation includes characters like .,!?:;'"-
I am using the split function as it follows:

wordsArray = strLine.split("[.,!?:;'\"-]+\\s*");


However, this only splits my text by spaces and ignores other characters I've set as delimiters.
Obviously, I am new at regexes and I could use your help.
Thank you.

michelleradu
Junior Poster in Training
55 posts since Dec 2009
Reputation Points: 10
Solved Threads: 5
 

The regular expression: ([.,!?:;'\"-]|\\s)+ should do the trick. It is read as "for one or more occurrences of either a whitespace or punctuation". The important part here is to make sure that the '+' is kept outside the alternation, since we can have a mix of white-spaces and punctuations.

The problem with your regular expression was that you didn't take into consideration punctuations and whitespaces, but rather punctuation followed by whitespaces. If you have a 'this or that' situation, use alternation. Writing patterns one after another just increases the matching requirement; your pattern read "match any punctuation *followed* by zero or more whitespace characters" which isn't what you wanted. Hence all it did was match punctuations but blew up when faced with white-spaces.

~s.o.s~
Failure as a human
Administrator
11,938 posts since Jun 2006
Reputation Points: 3,281
Solved Threads: 734
 

~s.o.s~,
Excellent answer and explanation for regex - I now understand what I was doing wrong.
Your solution works perfectly.
Thank you.

michelleradu
Junior Poster in Training
55 posts since Dec 2009
Reputation Points: 10
Solved Threads: 5
 

How can I not split exceptions such as Mr. and Mrs.?

leopoldauer
Newbie Poster
1 post since Apr 2012
Reputation Points: 0
Solved Threads: 0
 

It's possible to include exceptions by using negative lookbehind assertion feature of regular expressions. Just throw in another alternation which uses lookaround and you should be golden. But this approach would leave behind a nasty regular expression in your code so if possible, try to deal with exceptions after performing the basic splitting rather than modifying the regex.

Also, please create a new thread for your question, and refer this thread if you feel it is related rather than bumping an existing solved thread.

~s.o.s~
Failure as a human
Administrator
11,938 posts since Jun 2006
Reputation Points: 3,281
Solved Threads: 734
 

This question has already been solved

Post: Markdown Syntax: Formatting Help
You
View similar articles that have also been tagged: