Question about string pattern matching

Question

George2 0 Junior Poster

19 Years Ago

Hello everyone,

I am working on a dictionary application implementation. I have question about a string pattern matching algorithm implementation. In this algorithm, * can be used to match any sequence of characters. The input is string s1 and string s2, the algorithm using s2 to match s1, and the output is the number of possibility matching.

For example, s1 = ABC, s2 = AC*, output is 0;
For example, s1 = ABC, s2 = A*C, output is 1;
For example, s1 = AB, s2 = ***, output is 6 (AB,NULL,NULL
NULL,AB.NULL
NULL,NULL,AB
NULL,A,B
A,NULL,B
A, B, NULL);

Does anyone know what is the most efficient implementation? Currently, I only have a brute-force implementation.

thanks in advance & happy new year,
George

algorithm java

5 Contributors
21 Replies
447 Views
3 Years Discussion Span
Latest Post 16 Years Ago Latest Post by javaAddict

All 21 Replies

jwenting 1,905 duckman

19 Years Ago

Write a parser to turn the input into a regular expression and run that on the dataset.

jwenting 1,905 duckman

19 Years Ago

The regexp package is part of the core JRE distribution since 1.4.0 so there is no additional installation or download required.

jwenting 1,905 duckman

19 Years Ago

why can't you use regexp? It's the natural choice.
If not that you're going to have to use some form of system using indexOf() and keeping track of all the numbers that's returning from repeated calls.
Could probably be written quite efficiently using recursion.

jwenting 1,905 duckman

19 Years Ago

shoot the project lead for his decision making ;)

indexOf will tell you where a substring resides (if anywhere). Call repeatedly on the parts of the string that aren't wildcards and check whether the returned values are in order.
Should be quite efficient, leaving the biggest trick the splitting of the pattern into chunks to feed into your search function.

jwenting 1,905 duckman

19 Years Ago

say your pattern is something like "AB*cd*F*", you'd start by checking whether the string begins with "AB".
If it does, check whether in the substring you get when removing AB from the front of the data you have "cd" at some point.
If it does, remove everything up to and including that "cd" and check for an "F" in the rest.
If that's found, return true. If at any point indexOf() returns -1 (not found) return false.

Something like

public boolean matcher(String data, String[] pattern, boolean startsWith) {
	int idx = data.indexOf(pattern[0]);
	if (idx == -1 || (idx > 0 && startsWith)) {
		return false;
	}
	String newdata = data.substring(idx);
	String[] newPat = new String[pattern.length()-1];
	for (int i=1; i<pattern.length; i++) {
		newPat[i-1] = pattern[i];
	}
	return matcher(newdata, newPat, false);
}

should do the trick.
Mind I've not tested this, and I've excluded the needed bounds checking for brevity.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

George2 0 Junior Poster · Answer 1 · 2006-01-02T17:45:40+00:00

jwenting,

Write a parser to turn the input into a regular expression and run that on the dataset.

Your idea is very helpful! In my application, I can not use regular expression library because I want to make it a small foot-print application. So, I want to build the algorithm from scratch and want to find out an efficient implementation.

Could you help please? Thanks.

regards,
George

George2 0 Junior Poster · Answer 2 · 2006-01-03T17:07:35+00:00

jwenting,

The regexp package is part of the core JRE distribution since 1.4.0 so there is no additional installation or download required.

I fully agree with you. In my specific application, I can not use that regular expression library and I have to implement this string matching algorithm from scratch. Do you have any ideas of an efficient implementation?

regards,
George

George2 0 Junior Poster · Answer 3 · 2006-01-04T09:29:00+00:00

jwenting,

why can't you use regexp? It's the natural choice.
If not that you're going to have to use some form of system using indexOf() and keeping track of all the numbers that's returning from repeated calls.
Could probably be written quite efficiently using recursion.

For a number of reasons, I can not use regexp -- it is up to project lead. And up to me, I am asked to implement from scratch. :-)

I think using indexOf() is a very good idea! Anyway, do you have any idea of an efficient implementation using indexOf() -- not using regexp?

regards,
George

George2 0 Junior Poster · Answer 4 · 2006-01-05T13:45:55+00:00

jwenting,

indexOf will tell you where a substring resides (if anywhere). Call repeatedly on the parts of the string that aren't wildcards and check whether the returned values are in order.

Good idea! I still have some detailed questions about your method, especially "Call repeatedly on the parts of the string that aren't wildcards". Could you provide some sample source codes please?

regards,
George

George2 0 Junior Poster · Answer 5 · 2006-01-06T16:00:56+00:00

jwenting,

say your pattern is something like "AB*cd*F*", you'd start by checking whether the string begins with "AB".
If it does, check whether in the substring you get when removing AB from the front of the data you have "cd" at some point.
If it does, remove everything up to and including that "cd" and check for an "F" in the rest.
If that's found, return true. If at any point indexOf() returns -1 (not found) return false.
Something like
public boolean matcher(String data, String[] pattern, boolean startsWith) {
	int idx = data.indexOf(pattern[0]);
	if (idx == -1 || (idx > 0 && startsWith)) {
		return false;
	}
	String newdata = data.substring(idx);
	String[] newPat = new String[pattern.length()-1];
	for (int i=1; i<pattern.length; i++) {
		newPat[i-1] = pattern[i];
	}
	return matcher(newdata, newPat, false);
}
should do the trick.
Mind I've not tested this, and I've excluded the needed bounds checking for brevity.

Thank you very much for your reply! I think your method only checks whether a pattern matches a given string, is that correct? But my problem is that, I want to find out the different ways that a pattern can match a given string.

have a good weekend,
George

jwenting 1,905 duckman Team Colleague · Answer 6 · 2006-01-06T17:02:52+00:00

yes, I check whether the pattern matches the string at all, not how many different matches there are.
That would mean taking the total number of permutations of the parts of input defined by the wildcards.
With 1 character that's 1 permutation, with 2 you're up to 2, with 3 it's 6, so with n you're up to n! options.
With long input that can quickly get very large indeed.

George2 0 Junior Poster · Answer 7 · 2006-01-07T13:24:35+00:00

jwenting,

yes, I check whether the pattern matches the string at all, not how many different matches there are.
That would mean taking the total number of permutations of the parts of input defined by the wildcards.
With 1 character that's 1 permutation, with 2 you're up to 2, with 3 it's 6, so with n you're up to n! options.
With long input that can quickly get very large indeed.

What do you mean "With 1 character that's 1 permutation, with 2 you're up to 2, with 3 it's 6, so with n you're up to n! options."? I can not understand it. Could you please show your idea with some specific samples please?

Anyway, I think your idea is a great one!

regards,
George

iamthwee · Answer 8 · 2006-01-07T20:50:07+00:00

What do you mean "With 1 character that's 1 permutation, with 2 you're up to 2, with 3 it's 6, so with n you're up to n! options."? I can not understand it. Could you please show your idea with some specific samples please?

Oh dear,

The are good reasons why using a dataset, rather then a brute force algorithm would be the optimal chose here.

With 1 character that's 1 permutation, with 2 you're up to 2, with 3 it's 6, so with n you're up to n! options.
With long input that can quickly get very large indeed.

Jwenting has pointed out the reason why above.

Here's a quick example.

Say you have a string AB
the perms are:

AB, and BA which is 2.

For ABC, there are 6.

ABC
ACB
BAC
BCA
CBA
CAB

In maths the number of perms can be determined quickly, using the simple rule.

If the length of the string is N then the number of perms is given by:

Nx (N-1) X (N-2) etc.

So that for a word of length four the number of perms is:

4x3x2x1

The rate of growth is a problem, and in most cases, a quicker more efficient algorithm should be sort after.

Unfortunately, your project lead has other ideas. Sack him :eek:

jwenting 1,905 duckman Team Colleague · Answer 9 · 2006-01-07T23:25:24+00:00

well, it's a rather simple means. Determining in advance what the rules are and finding a simple calculation will do is far easier than trying all possible options and seeing if they fit and coming to the same result ;)
Can't get faster too than that in this case.

George2 0 Junior Poster · Answer 10 · 2006-01-08T12:33:15+00:00

iamthwee,

Oh dear,
The are good reasons why using a dataset, rather then a brute force algorithm would be the optimal chose here.
Jwenting has pointed out the reason why above.
Here's a quick example.
Say you have a string AB
the perms are:
For ABC, there are 6.
In maths the number of perms can be determined quickly, using the simple rule.
If the length of the string is N then the number of perms is given by:
So that for a word of length four the number of perms is:
The rate of growth is a problem, and in most cases, a quicker more efficient algorithm should be sort after.
Unfortunately, your project lead has other ideas. Sack him :eek:

Do you mean if the input string (which we will use pattern to match) is ABC, we should at first permutate it to (ABC, ACB, BAC, BCA, CBA, CAB)?

I do not know why you should do so, since the question is to find in how many different approaches a pattern (containing '*') can match an input string.

regards,
George

George2 0 Junior Poster · Answer 11 · 2006-01-08T12:35:23+00:00

jwenting,

well, it's a rather simple means. Determining in advance what the rules are and finding a simple calculation will do is far easier than trying all possible options and seeing if they fit and coming to the same result ;)
Can't get faster too than that in this case.

Thank you very much for your reply! I understand your points. The current issue is that, I do not understand the relationship between your permutation algorithm and my question.

For example, do you mean if the input string (which we will use pattern to match) is ABC, we should at first permutate it to (ABC, ACB, BAC, BCA, CBA, CAB)?

I do not know why you should do so, since the question is to find in how many different approaches a pattern (containing '*') can match an input string.

regards,
George

jwenting 1,905 duckman Team Colleague · Answer 12 · 2006-01-08T13:46:26+00:00

No, if the input string is ABC and the pattern is A*, you have 2 characters left over which means there are 2! = 1*2 = 2 possible strings that match.
If you have a string ABCDEF and you have a pattern *CD* you have 4! ((2+2)!) matches.

That's assuming that the matches need to be made up of the characters in the input string with the fixed ones as defined in the pattern.
If the input string can be anything as long as the pattern is in it the number of possible options is always infinity or zero depending on whether the fixed part(s) of the pattern exist or not.

George2 0 Junior Poster · Answer 13 · 2006-01-09T13:55:28+00:00

jwenting,

No, if the input string is ABC and the pattern is A*, you have 2 characters left over which means there are 2! = 1*2 = 2 possible strings that match.
If you have a string ABCDEF and you have a pattern *CD* you have 4! ((2+2)!) matches.

I think it should be 2!*2!.

That's assuming that the matches need to be made up of the characters in the input string with the fixed ones as defined in the pattern.

I agree. I am just wondering how to use program to implement it.

If the input string can be anything as long as the pattern is in it the number of possible options is always infinity or zero depending on whether the fixed part(s) of the pattern exist or not.

What means "If the input string can be anything as long as the pattern is in it"? Could you show an example please?

regards,
George

jwenting 1,905 duckman Team Colleague · Answer 14 · 2006-01-09T17:42:22+00:00

It should be n!, not (n!)^n.
If each of the n characters can be used in any of the positions irrespective of whether it was already used at another position it would be (m!)^m where m is the number of distinct characters in the total set of n characters, but I assume you have n distinct characters each of which has to be used only once.
To be completely accurate you have to calculate m here as well by removing duplicates.

To determine n, you only need the length of the input string minus the length of the fixed part(s).
Determining m is harder, as you'll have to take the non-fixed part of the input and determine the number of distinct characters in it.
Not hard, but takes a little work. Using a HashMap of Character objects will do nicely. m will be the size of the HashMap.

What means "If the input string can be anything as long as the pattern is in it"? Could you show an example please?[\quote]
If the pattern is ABC*, both ABC1234 and ABC123456 are valid.
Now if your input were ABC1234 you'd have 4! options if the valid string length is restricted to the length of the input string.
If however it is not restricted to that (7 long in this case), the longer string is also valid, which would yield a total number of 6!+5!+4! options for just that length and any length at least as long as ABC1234.
Logically then the number of options would be infinity as the maximum length of the valid resulting string would be infinity (of course the computer would be unable to handle such a long string, but we're talking mathematics here).
Your number of results would be the SIGMA(n!) where n between 4 and infinity inclusive.

aru211285 0 Newbie Poster · Answer 15 · 2009-06-09T19:36:06+00:00

i need to extract the number part from the string which looks like "(7 r 8digits).txt". i need to store this found string in some variable, but i am confused wih the groups twhether o get the string as group() or group(1).

Pattern p = Pattern.compile("\b[0-9]{7,8}\b");
Matcher matcher = p.matcher("num"); //where num looks like10986745.txt r 0986745.txt
String identifier = matcher.group();
System.out.println(identifier);
But i am getting the output as no match found.
So it would be very helpful if some one would clear this

javaAddict 900 Nearly a Senior Poster Team Colleague Featured Poster · Answer 16 · 2009-06-09T20:09:13+00:00

i need to extract the number part from the string which looks like "(7 r 8digits).txt". i need to store this found string in some variable, but i am confused wih the groups twhether o get the string as group() or group(1).
Pattern p = Pattern.compile("\b[0-9]{7,8}\b");
Matcher matcher = p.matcher("num"); //where num looks like10986745.txt r 0986745.txt
String identifier = matcher.group();
System.out.println(identifier);
But i am getting the output as no match found.
So it would be very helpful if some one would clear this

I deleted your PM. Start a new thread

Question about string pattern matching

Recommended Answers Collapse Answers

All 21 Replies

Recommended Answers