This method invocation
txt.match(/\[.*\]/g)
works exactly as intended - it returns an array of matches
[anytext1],[anytext2],[anytext3]
I don't need the brackets, so out of curiosity I tried this tweak
txt.match(/\[.*(?=\])/g)
which - somewhat to my surprise I must admit - does exactly what I was hoping it would do and returns
[anytext1,[anytext2,[anytext3
(IOW the closing bracket is not included in the match).
Emboldened by that success I tried this further tweak
txt.match(/(?:\[).*(?=\])/g)
but was disappointed; that also returns
[anytext1,[anytext2,[anytext3
(IOW the opening bracket is still included in the match).

The asymmetry of the (? operators in this context appears to be a glitch; either that or the (?: doesn't do what I think it does.

Recommended Answers

All 8 Replies

FXM,

What is your input string txt , that gives these results?

Airshow

FXM,

What is your input string

Airshow

This

<html> <pre>
<!A>
[=== Board 24 has been corrected ===]
[]
[=== Final results will be posted tomorrow ===]

Sub 20 Thursday Afternoon Pairs Thursday Aft Session March 25, 2010
Scores after  7 rounds  Average:   84.0      Section  A  North-South
Pair    Pct   Score      Section Rank      Overall Rank      MPs     
                         A     B     C     A     B     C

is an example.

FXM,

Aha, screen-scraping by the look of it. I have done a bit of that myself.

I suggest the following approach:

  • Use your first regex formulation, txt.match(/\[.*\]/g) , to create an array of matches, including the bits you don't want.
  • Use [I]string[/I].replace() to strip out the bits you don't want (the "[===" "===]" delimiters), further down in your code where you come to use the matches.

For example, to display the matches you might do something like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<title>Airshow :: Untitled</title>
<style type="text/css">
#results {
	width: 300px;
}
p {
	border: 1px solid #999999;
	margin: 3px 0;
	padding: 3px;
	background-color: #99CCCC;
	font-family: verdana,arial;
	font-size: 10pt;
}
</style>

<script>
onload = function(){
	var txt = "<html> <pre>\n<!A>\n[=== Board 24 has been corrected ===]\n[]\n[=== Final results will be posted tomorrow ===]\nSub 20 Thursday Afternoon Pairs Thursday Aft Session March 25, 2010\nScores after  7 rounds  Average:   84.0      Section  A  North-South\nPair    Pct   Score      Section Rank      Overall Rank      MPs\n                         A     B     C     A     B     C";
	var matches = txt.match(/\[.*\]/g);
	var res = document.getElementById('results');
	var i, p;
	for(i=0; i<matches.length; i++){
		if(matches[i] !== '[]'){
			p = document.createElement('p');
			p.innerHTML = matches[i].replace('[=== ', '').replace(' ===]', '');
			res.appendChild(p);
		}
	}
};
</script>
</head>

<body>

<div id="results"></div>

</body>
</html>

As I say, this is just an example. What you do with the matches will be determined by the requirements of your target application.

Regex gurus could most likely do more in a single statement but complex regexes, although impressive, get harder and harder to maintain at some later date. They don't have that "at-a-glance" quality of most other types of code. Complex regexes can also be less efficient, in terms of execution time, than multiple lines of code to the same effect. (Regex gurus will disagree of course).

Airshow

And here is a pretty good reference for javascript regular expression terms.

Airshow

OK to all that.

The post-scrape processing is actually far simpler. The =s were entered by the user (so she expects them to appear); the rest is a trivial substring (because the ending bracket is already gone and the starting bracket is always in position 0).

However, my original question was: why didn't the (?: exclude the left bracket (and indirectly why did the (?= exclude the right bracket)? I was surprised that the (?= worked as I wanted [because AFAIK it is not documented to do so in this context] and then I was semi-surprised that with the (?= working, the (?: seemed to fail [given my - possibly incorrect - assumption that its match consumes input].

And here is a pretty good reference for javascript regular expression terms.

Airshow

For future readers, IMO this is a somewhat better starting point. For example, it mentions all three (? operators [w3schools misses (?:]. Admittedly jsk isn't perfect either. A couple of examples: its documentation of the (? operators is inconsistent, and it continues to recommend deprecated RegExp properties.

For example,
Airshow

I gave this post an up arrow but I didn't think that was nearly enough appreciation. Hence, this follow-up. I was really impressed that you took the time to give a detailed example of what you were suggesting.

txt.match(/(?:\[).*(?=\])/g)

Further testing seems to show conclusively that in this context the (?: operator always captures. In the course of testing I removed the uncertainty - in my mind, at least - as to what the .* would match by recasting the expression as txt.match(/(?:\[)[^[]*?(?=\])/g) .
As expected the return was the same. Out of curiosity I then tried txt.match(/[^[]*?(?=\])/g) .
This excluded the left bracket "by brute force" so the capture was exactly correct - as expected - but the expression now took on the order of 10 times as long to evaluate (5 seconds as opposed to 5/10 of a second) - presumably because of the backtracking that was now possible.

Obviously the better solution was to use the fastest expression and discard the unwanted left bracket.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.