Member Avatar

Hi All,
I have to split a text into words using both spaces and punctuation as delimiters.
Punctuation includes characters like .,!?:;'"-
I am using the split function as it follows:

wordsArray = strLine.split("[.,!?:;'\"-]+\\s*");

However, this only splits my text by spaces and ignores other characters I've set as delimiters.
Obviously, I am new at regexes and I could use your help.
Thank you.

The regular expression: ([.,!?:;'\"-]|\\s)+ should do the trick. It is read as "for one or more occurrences of either a whitespace or punctuation". The important part here is to make sure that the '+' is kept outside the alternation, since we can have a mix of white-spaces and punctuations.

The problem with your regular expression was that you didn't take into consideration punctuations and whitespaces, but rather punctuation followed by whitespaces. If you have a 'this or that' situation, use alternation. Writing patterns one after another just increases the matching requirement; your pattern read "match any punctuation *followed* by zero or more whitespace characters" which isn't what you wanted. Hence all it did was match punctuations but blew up when faced with white-spaces.

Member Avatar

Excellent answer and explanation for regex - I now understand what I was doing wrong.
Your solution works perfectly.
Thank you.

How can I not split exceptions such as Mr. and Mrs.?

It's possible to include exceptions by using negative lookbehind assertion feature of regular expressions. Just throw in another alternation which uses lookaround and you should be golden. But this approach would leave behind a nasty regular expression in your code so if possible, try to deal with exceptions after performing the basic splitting rather than modifying the regex.

Also, please create a new thread for your question, and refer this thread if you feel it is related rather than bumping an existing solved thread.