Hello everyone,


I am working on a dictionary application implementation. I have question about a string pattern matching algorithm implementation. In this algorithm, * can be used to match any sequence of characters. The input is string s1 and string s2, the algorithm using s2 to match s1, and the output is the number of possibility matching.

For example, s1 = ABC, s2 = AC*, output is 0;
For example, s1 = ABC, s2 = A*C, output is 1;
For example, s1 = AB, s2 = ***, output is 6 (AB,NULL,NULL
NULL,AB.NULL
NULL,NULL,AB
NULL,A,B
A,NULL,B
A, B, NULL);

Does anyone know what is the most efficient implementation? Currently, I only have a brute-force implementation.


thanks in advance & happy new year,
George

jwenting,

Write a parser to turn the input into a regular expression and run that on the dataset.

Your idea is very helpful! In my application, I can not use regular expression library because I want to make it a small foot-print application. So, I want to build the algorithm from scratch and want to find out an efficient implementation.

Could you help please? Thanks.


regards,
George

The regexp package is part of the core JRE distribution since 1.4.0 so there is no additional installation or download required.

jwenting,

The regexp package is part of the core JRE distribution since 1.4.0 so there is no additional installation or download required.

I fully agree with you. In my specific application, I can not use that regular expression library and I have to implement this string matching algorithm from scratch. Do you have any ideas of an efficient implementation?


regards,
George

why can't you use regexp? It's the natural choice.
If not that you're going to have to use some form of system using indexOf() and keeping track of all the numbers that's returning from repeated calls.
Could probably be written quite efficiently using recursion.

jwenting,

why can't you use regexp? It's the natural choice.
If not that you're going to have to use some form of system using indexOf() and keeping track of all the numbers that's returning from repeated calls.
Could probably be written quite efficiently using recursion.

For a number of reasons, I can not use regexp -- it is up to project lead. And up to me, I am asked to implement from scratch. :-)

I think using indexOf() is a very good idea! Anyway, do you have any idea of an efficient implementation using indexOf() -- not using regexp?


regards,
George

shoot the project lead for his decision making ;)

indexOf will tell you where a substring resides (if anywhere). Call repeatedly on the parts of the string that aren't wildcards and check whether the returned values are in order.
Should be quite efficient, leaving the biggest trick the splitting of the pattern into chunks to feed into your search function.

jwenting,

indexOf will tell you where a substring resides (if anywhere). Call repeatedly on the parts of the string that aren't wildcards and check whether the returned values are in order.

Good idea! I still have some detailed questions about your method, especially "Call repeatedly on the parts of the string that aren't wildcards". Could you provide some sample source codes please?


regards,
George

say your pattern is something like "AB*cd*F*", you'd start by checking whether the string begins with "AB".
If it does, check whether in the substring you get when removing AB from the front of the data you have "cd" at some point.
If it does, remove everything up to and including that "cd" and check for an "F" in the rest.
If that's found, return true. If at any point indexOf() returns -1 (not found) return false.

Something like

public boolean matcher(String data, String[] pattern, boolean startsWith) {
	int idx = data.indexOf(pattern[0]);
	if (idx == -1 || (idx > 0 && startsWith)) {
		return false;
	}
	String newdata = data.substring(idx);
	String[] newPat = new String[pattern.length()-1];
	for (int i=1; i<pattern.length; i++) {
		newPat[i-1] = pattern[i];
	}
	return matcher(newdata, newPat, false);
}

should do the trick.
Mind I've not tested this, and I've excluded the needed bounds checking for brevity.

jwenting,

say your pattern is something like "AB*cd*F*", you'd start by checking whether the string begins with "AB".
If it does, check whether in the substring you get when removing AB from the front of the data you have "cd" at some point.
If it does, remove everything up to and including that "cd" and check for an "F" in the rest.
If that's found, return true. If at any point indexOf() returns -1 (not found) return false.

Something like

public boolean matcher(String data, String[] pattern, boolean startsWith) {
	int idx = data.indexOf(pattern[0]);
	if (idx == -1 || (idx > 0 && startsWith)) {
		return false;
	}
	String newdata = data.substring(idx);
	String[] newPat = new String[pattern.length()-1];
	for (int i=1; i<pattern.length; i++) {
		newPat[i-1] = pattern[i];
	}
	return matcher(newdata, newPat, false);
}

should do the trick.
Mind I've not tested this, and I've excluded the needed bounds checking for brevity.

Thank you very much for your reply! I think your method only checks whether a pattern matches a given string, is that correct? But my problem is that, I want to find out the different ways that a pattern can match a given string.


have a good weekend,
George

yes, I check whether the pattern matches the string at all, not how many different matches there are.
That would mean taking the total number of permutations of the parts of input defined by the wildcards.
With 1 character that's 1 permutation, with 2 you're up to 2, with 3 it's 6, so with n you're up to n! options.
With long input that can quickly get very large indeed.

jwenting,

yes, I check whether the pattern matches the string at all, not how many different matches there are.
That would mean taking the total number of permutations of the parts of input defined by the wildcards.
With 1 character that's 1 permutation, with 2 you're up to 2, with 3 it's 6, so with n you're up to n! options.
With long input that can quickly get very large indeed.

What do you mean "With 1 character that's 1 permutation, with 2 you're up to 2, with 3 it's 6, so with n you're up to n! options."? I can not understand it. Could you please show your idea with some specific samples please?

Anyway, I think your idea is a great one!


regards,
George

What do you mean "With 1 character that's 1 permutation, with 2 you're up to 2, with 3 it's 6, so with n you're up to n! options."? I can not understand it. Could you please show your idea with some specific samples please?

Oh dear,

The are good reasons why using a dataset, rather then a brute force algorithm would be the optimal chose here.

With 1 character that's 1 permutation, with 2 you're up to 2, with 3 it's 6, so with n you're up to n! options.
With long input that can quickly get very large indeed.

Jwenting has pointed out the reason why above.

Here's a quick example.

Say you have a string AB
the perms are:

AB, and BA which is 2.

For ABC, there are 6.

ABC
ACB
BAC
BCA
CBA
CAB

In maths the number of perms can be determined quickly, using the simple rule.

If the length of the string is N then the number of perms is given by:

Nx (N-1) X (N-2) etc.

So that for a word of length four the number of perms is:

4x3x2x1

The rate of growth is a problem, and in most cases, a quicker more efficient algorithm should be sort after.

Unfortunately, your project lead has other ideas. Sack him :eek:

well, it's a rather simple means. Determining in advance what the rules are and finding a simple calculation will do is far easier than trying all possible options and seeing if they fit and coming to the same result ;)
Can't get faster too than that in this case.

iamthwee,

Oh dear,

The are good reasons why using a dataset, rather then a brute force algorithm would be the optimal chose here.

Jwenting has pointed out the reason why above.

Here's a quick example.

Say you have a string AB
the perms are:

For ABC, there are 6.

In maths the number of perms can be determined quickly, using the simple rule.

If the length of the string is N then the number of perms is given by:

So that for a word of length four the number of perms is:

The rate of growth is a problem, and in most cases, a quicker more efficient algorithm should be sort after.

Unfortunately, your project lead has other ideas. Sack him :eek:

Do you mean if the input string (which we will use pattern to match) is ABC, we should at first permutate it to (ABC, ACB, BAC, BCA, CBA, CAB)?

I do not know why you should do so, since the question is to find in how many different approaches a pattern (containing '*') can match an input string.


regards,
George

jwenting,

well, it's a rather simple means. Determining in advance what the rules are and finding a simple calculation will do is far easier than trying all possible options and seeing if they fit and coming to the same result ;)
Can't get faster too than that in this case.

Thank you very much for your reply! I understand your points. The current issue is that, I do not understand the relationship between your permutation algorithm and my question.

For example, do you mean if the input string (which we will use pattern to match) is ABC, we should at first permutate it to (ABC, ACB, BAC, BCA, CBA, CAB)?

I do not know why you should do so, since the question is to find in how many different approaches a pattern (containing '*') can match an input string.


regards,
George

No, if the input string is ABC and the pattern is A*, you have 2 characters left over which means there are 2! = 1*2 = 2 possible strings that match.
If you have a string ABCDEF and you have a pattern *CD* you have 4! ((2+2)!) matches.

That's assuming that the matches need to be made up of the characters in the input string with the fixed ones as defined in the pattern.
If the input string can be anything as long as the pattern is in it the number of possible options is always infinity or zero depending on whether the fixed part(s) of the pattern exist or not.

jwenting,

No, if the input string is ABC and the pattern is A*, you have 2 characters left over which means there are 2! = 1*2 = 2 possible strings that match.
If you have a string ABCDEF and you have a pattern *CD* you have 4! ((2+2)!) matches.

I think it should be 2!*2!.

That's assuming that the matches need to be made up of the characters in the input string with the fixed ones as defined in the pattern.

I agree. I am just wondering how to use program to implement it.

If the input string can be anything as long as the pattern is in it the number of possible options is always infinity or zero depending on whether the fixed part(s) of the pattern exist or not.

What means "If the input string can be anything as long as the pattern is in it"? Could you show an example please?


regards,
George

It should be n!, not (n!)^n.
If each of the n characters can be used in any of the positions irrespective of whether it was already used at another position it would be (m!)^m where m is the number of distinct characters in the total set of n characters, but I assume you have n distinct characters each of which has to be used only once.
To be completely accurate you have to calculate m here as well by removing duplicates.

To determine n, you only need the length of the input string minus the length of the fixed part(s).
Determining m is harder, as you'll have to take the non-fixed part of the input and determine the number of distinct characters in it.
Not hard, but takes a little work. Using a HashMap of Character objects will do nicely. m will be the size of the HashMap.

What means "If the input string can be anything as long as the pattern is in it"? Could you show an example please?[\quote]

If the pattern is ABC*, both ABC1234 and ABC123456 are valid.
Now if your input were ABC1234 you'd have 4! options if the valid string length is restricted to the length of the input string.
If however it is not restricted to that (7 long in this case), the longer string is also valid, which would yield a total number of 6!+5!+4! options for just that length and any length at least as long as ABC1234.
Logically then the number of options would be infinity as the maximum length of the valid resulting string would be infinity (of course the computer would be unable to handle such a long string, but we're talking mathematics here).
Your number of results would be the SIGMA(n!) where n between 4 and infinity inclusive.

i need to extract the number part from the string which looks like "(7 r 8digits).txt". i need to store this found string in some variable, but i am confused wih the groups twhether o get the string as group() or group(1).

Pattern p = Pattern.compile("\b[0-9]{7,8}\b");
Matcher matcher = p.matcher("num"); //where num looks like10986745.txt r 0986745.txt
String identifier = matcher.group();
System.out.println(identifier);
But i am getting the output as no match found.
So it would be very helpful if some one would clear this

i need to extract the number part from the string which looks like "(7 r 8digits).txt". i need to store this found string in some variable, but i am confused wih the groups twhether o get the string as group() or group(1).

Pattern p = Pattern.compile("\b[0-9]{7,8}\b");
Matcher matcher = p.matcher("num"); //where num looks like10986745.txt r 0986745.txt
String identifier = matcher.group();
System.out.println(identifier);
But i am getting the output as no match found.
So it would be very helpful if some one would clear this

I deleted your PM. Start a new thread

This article has been dead for over six months. Start a new discussion instead.