Java Scanner -- Scanner going beyond token

Question

AssertNull 1,094 Practically a Posting Shark

8 Years Ago

I've always been bad with regular expressions syntax. I am converting text into Bezier curves, polygons, etc. Part of that is identifying points in the form (x,y) where x and y are coordinates in a graph. I am extracting x and y as doubles and creating Point objects with those two double values. I'm having problems with the white space. I cannot assume that there will be white space and I cannot assume that there won't be white space. Here's my attempt to extract three points from a String with some random white space thrown in.

        String str = "(   4.5 , 8.9 ) (76.4,  9)   (67.3  , 0.3) ";
        Scanner scan = new Scanner(str);
        double x[] = new double[3];
        double y[] = new double[3];
        int i;

        try
        {
            for(i = 0; i < 3; i++)
            {
                scan.next("\\(");
                x[i] = scan.nextDouble();
                scan.next(",");
                y[i] = scan.nextDouble();
                scan.next("\\)");
            }

            for(i = 0; i < 3; i++)
            {
                System.out.print("(" + x[i] + "," + y[i] + ")");
            }
        }
        catch(Exception ex)
        {
            System.out.println("malformed");
        }

It works fine for the first point. Note that every token has white space after it for the first point. Then it gets to the second opening paren, which is followed by a digit rather than whitespace, and throws an InputMismatchException.

I'm wondering if it has something to do with "lazy" versus "greedy" pattern searching. I want it to stop when it finds the first pattern (in this case, the opening paren token), not keep going and try to keep matching, to match as little as possible and stop, so I guess that would be "lazy" (Java calls it "reluctant"?) As mentioned, I'm not very good with Regular Expressions (that's an understatement). I've been playing around with sticking in question marks, etc., but haven't got it working.

java regex

Edited 8 Years Ago by AssertNull because: Grammar

3 Contributors
9 Replies
379 Views
1 Week Discussion Span
Latest Post 8 Years Ago Latest Post by AssertNull

All 9 Replies

JamesCherrill 4,733 Most Valuable Poster

8 Years Ago

I'm not good at regex, so I often start by removing all the white space (one simple method call) so what remains is easier to parse.

JamesCherrill 4,733 Most Valuable Poster

8 Years Ago

OK. From what you had said ("I cannot assume that there will be white space") I assumed that the white space was not a delimiter, only the parens and the commas were valid delimiters, so basically you can:

with "( 4.5 , 8.9 ) (76.4, 9) (67.3 , 0.3) "
delete all white space "(4.5,8.9)(76.4,9)(67.3,0.3)"
delete all close parens "(4.5,8.9(76.4,9(67.3,0.3"
split on open parens to get x,y pairs "4.5,8.9", "76.4,9", "67.3,0.3"
(ignore the "" before the first open paren, or just delete the first open paren)
split the pairs on comma "4.5" "8.9 " "76.4" "9" "67.3" "0.3"
parse the values as doubles

... or something like that. A real regex samurai would know how to get all the texts between all pairs of open and close parens.

(I'm no fan of Scanner - it was supposed to make things easy for beginners, but it doesn't even do that. It's almost always easier to read whole lines and parse them yourself)

JamesCherrill 4,733 Most Valuable Poster

8 Years Ago

Sorry, it's a dull Sunday morning and I couldn't resist working on my Regex skills a bit.
Here's what I came up with...

        String data = "(   4.5 , 8.9 ) (76.4,  9)   (67.3  , 0.3) ";
        // shortest string between ( and ,
        Pattern p1 = Pattern.compile("\\(.*?,"); 
        Matcher m1 = p1.matcher(data);

        // shortest string between , and )
        Pattern p2 = Pattern.compile(",.*?\\)"); 
        Matcher m2 = p2.matcher(data);

        while (m1.find() & m2.find()) {

            // extract the strings between the (, and ,) delimiters...
            String v1 = m1.group();
            String v2 = m2.group();
            System.out.println("\"" + v1 + "\"     \"" + v2 + "\"");

            // remove the opening & closing delimiters...
            v1 = v1.substring(1, v1.length() - 1);
            v2 = v2.substring(1, v2.length() - 1);
            System.out.println("\"" + v1 + "\"     \"" + v2 + "\"");

            // trim any surrounding spaces and parse...
            double d1 = Double.parseDouble(v1.trim());
            double d2 = Double.parseDouble(v2.trim());
            System.out.println(d1 + ", " + d2 + "\n");
        }

Obviously that's coded for explanation and tracing rather than compactness, but it shows how it all works pretty well, I think.

Edit: I revisited this a bit later, and prefer this version. It uses a Regex to break out all the strings between a ( and a ), but then uses split to break those up ready for parsing...

        Matcher m = Pattern.compile("\\(.*?\\)").matcher(data);
        while (m.find()) {
            String[] parts = m.group().split("[\\(\\),]");
            double d1 = Double.parseDouble(parts[1].trim());
            double d2 = Double.parseDouble(parts[2].trim());
            System.out.println(d1 + ", " + d2 + "\n");
        }

Edited 8 Years Ago by JamesCherrill

AssertNull commented: Got me headed in the right direction +5

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

AssertNull 1,094 Practically a Posting Shark · Answer 1 · 2016-09-17T20:02:50+00:00

I'm not good at regex, so I often start by removing all the white space (one simple method call) so what remains is easier to parse.

That's what's NOT working. It works when there's white space. It doesn't work when there isn't whitespace after a paren or a comma (ie when there is no white-space after, say, the comma, it'll end up reading in ",99.99" as a string as opposed to a comma token which I can throw away and then a 99.99 token, which I can read in as a double). The white space acts as a token delimiter, which is good. I just can't assume white space. Taking OUT the white space seems like it would make the problem worse.

AssertNull 1,094 Practically a Posting Shark · Answer 2 · 2016-09-17T21:44:44+00:00

White space is not a delimiter and is meaningless for my purposes, it just happened to be the default delimiter for Scanner, so whenever Scanner hit white space, it stopped adding characters to the token, which had the inadvertant effect of making my faulty code identify the token how I wanted to identify it, on accident, so I got away with it when there was white space. Just dumb luck. I was just pointing out that it throws an exception immediately WITHOUT any white space, but "works" when there is white space between all tokens. My task is to make it parse correctly for every type of white space occurence, including none. I was actually reading Scanner spec wrong and wasn't thinking about delimiters at all, thinking that it would read up until the regular expression was hit, lazily eat that regular expression, then stop. It appears that's NOT what it does, so it wasn't that my regular expression was wrong, it's that I was misunderstanding what the function did.

What seems to work is to make the delimiter be the paren or comma, then read up to that delimiter, then "find" the delimiter itself. Then set the next delimiter, read up to it, then "find it", etc. I imagine there's a better way, but this seems to read in and not reject well-formed point pairs. It may have the opposite problem and NOT reject things that it should reject, though in this case I'm not sure that's a problem (i.e. having extra characters in there will just result in them being thrown away). Here's the code that works (I think. Still testing it. So far, so good).

        try
        {
            for(i = 0; i < 3; i++)
            {
                scan.useDelimiter("\\(");
                scan.findInLine("\\(");
                scan.useDelimiter(",");
                x[i] = Double.parseDouble(scan.next());
                scan.findInLine(",");
                scan.useDelimiter("\\)");                
                y[i] = Double.parseDouble(scan.next());
                scan.findInLine("\\)");
            }

            for(i = 0; i < 3; i++)
            {
                System.out.print("(" + x[i] + "," + y[i] + ")");
            }
        }
        catch(Exception ex)
        {
            System.out.println("malformed");
        }

But I'm actually thinking about doing it the way you suggested and write my own parser. The reason I didn't was that I didn't want to write the actual code for converting a String into a double, but I realize I don't have to. I'll just isolate the substring between the comma and paren, strip it of white space, and call Double.parseDouble like I do above.

JamesCherrill 4,733 Most Valuable Poster Team Colleague Featured Poster · Answer 3 · 2016-09-18T08:02:11+00:00

That all makes sense.
One advantage of the DIY solution is that you get to chose how to deal with malformed input rather than being stuck with whatever uninformative thing Scanner may do.

AssertNull 1,094 Practically a Posting Shark · Answer 4 · 2016-09-18T22:35:34+00:00

I took your idea and went a little farther with it. In the end, I'm thinking I'll probably go with my own custom function. Again, I originally went with it, then abandoned if because while I was finding the legitimate double STRINGS correctly, I didn't want to write parseDouble. For some reason, it wasn't registering in my head that I could write my own custom pattern finding function and not worry about regular expressions and still use Java's built in parseDouble function. Not sure why that didn't occur at the time. I guess I was in "all or nothing" mode in my thinking. Anyway, as always, one has to decide how much time to devote to making sure one's code "catches" all the intentionally weird stuff people might throw at it to break things intentionally and figure out what is "good enough".

Anyway, I came up with this. Thanks for the posts. It got me thinking in the right direction. It was the ".*?" that I was missing. That got me going in the right direction.

        String data = "  (-.9,7)(6,.888)(889. , .3  ) (666.43, 8.) (   4.5 ,   18.9  ) ( 76.4 ,  9.9)(67.3  ,0.3)";

        String DOUBLE = "\\-?[\\d\\.]+"; // 0 or 1 minus signs, then 1 or more digits or decimals (note this allows some illegal values, but parseDouble will catch them)
        String WHITE_DOUBLE_WHITE = "\\s*?" + DOUBLE + "\\s*?"; // 0 or more white-space chars, followed by double, followed by 0 or more white space chars
        String DOUBLE_PAIR = "\\({1}" + WHITE_DOUBLE_WHITE + ",{1}" + WHITE_DOUBLE_WHITE + "\\){1}";

        Pattern p1 = Pattern.compile(DOUBLE_PAIR); 
        Matcher m1 = p1.matcher(data);
        Pattern p2 = Pattern.compile(DOUBLE); 
        Matcher m2 = p2.matcher(data);

        while (m1.find())
        {
            String s1 = m1.group();
            if(!(m2.find()))
                throw new Exception("This should never happen!");
            double d1 = Double.parseDouble(m2.group());
            if(!(m2.find()))
                throw new Exception("This should never happen!");
            double d2 = Double.parseDouble(m2.group());
            System.out.println(d1 + "," + d2 + "\n");
        }
    }

Taywin 312 Posting Virtuoso · Answer 5 · 2016-09-26T19:27:50+00:00

I know that this thread has been marked as solved. I just want to throw in one thing.

You may try regex as "\\-?\\d*\\.\\d*|\\-?\\d+" to capture all numbers in your list instead of going through the way you are doing. What the regex is doing can be explained in 2 parts.

\\-? is to capture any number that may start with - symbol. \\d*\\.\\d* is to capture any number that has decimal symbol in between; however, there may or may not be leading/tailing number. This pattern could go wrong if there is only a dot (.) without a number next to it (though, it meant malformat). The | meant to separate the next set of pattern. The \\-?\\d+ is plain and simple, capture a whole number which may be negative.

AssertNull 1,094 Practically a Posting Shark · Answer 6 · 2016-09-28T00:40:01+00:00

Your example uses the | operator. Mine did not. I think the | operator here offers some real power that I was not harnessing. It's also nice in that you can make your individual regular expressions more human readable by separating each possible legal match by the | operator rather than figuring out one big long one that makes you scratch your head for a while. This helps. Thank you.

Java Scanner -- Scanner going beyond token

Recommended Answers Collapse Answers

All 9 Replies

Recommended Answers