i need to search for a pattern match in an array of strings, where each string is a list of symbols and a given string in the array may match to only one symbol in the pattern.

So suppose we have a pattern as A+B*C and we have the string array as

[1] ACD
[2] BC
[3] ADE
[4] BAE
[5] CD
[6] DE

So above we have a match starting from string [3], as AAC (or ABC). i need to find the longest such match.

i think of one approach as to construct strings taking different combination of symbols (one from each string) and compare it against the pattern.

So starting form 1st string ("ACD") we take up A and match it against the pattern. it is a partial match so we go ahead and take B from second string ("BC"). So we now have "AB" which is a partial match. So we go ahead to string 3 ("ADE") and try each of its symbols for a partial match. But it fails so we drop the current path and try again taking different symbol from the first string, and so on. i think matcher.hitEnd() can help in detecting partial match.

While this approach may work, it will be inefficient for large number of strings (and symbols). i have such strings numbering in thousands (taken from a database table). It will be great if some one can point out more efficient approach to this.

Recommended Answers

All 6 Replies

have a pattern as A+B*C

How is that a pattern? Can you define a pattern?
Is xe=+@2 a pattern?

And this is confusing:

have a match starting from string [3], as AAC

Is [3] ADE

Am I correct in assuming that the regular expression like pattern "A+B*C" means "one or more strings each containing the letter 'A', followed by zero or more strings each containing the letter 'B', followed by one string containing the letter 'C'."?

So "AC" would match '[1] ACD, [2] BC'. And '[3] ADE, [4] BAE, [5] CD' would be "ABC" or "AAC" (or both). So '[4] BAE, [5] CD' is "AC". So that's 3 or 4 matches in the example, depending on how you count them.

Is the "A+B*C" pattern fixed at design time (when you write the program) or is it not known until runtime. And if not known until runtime, what subset of "regular expression syntax" do you intend to support?

You are correct about the regular expression. However, the OP doesn't match any particular string from the list. He needs to read the list "vertically", top to bottom, extracting exactly one letter from each string, in such a way that the resulting string would match the pattern.
That is how AAC in his example is formed: A from [3], A from [4], and C from [5].

Honestly, I do not see any hope for a more efficient approach than OP described.

Ugh!! regular expressions.

How is that a pattern? Can you define a pattern?
Is xe=+@2 a pattern?

And this is confusing:

Is [3] ADE

sorry for not being clear. By pattern i meant a regular expression here, like A+B*C.

nezachems reply precisely explains what i am trying to do.

Am I correct in assuming that the regular expression like pattern "A+B*C" means "one or more strings each containing the letter 'A', followed by zero or more strings each containing the letter 'B', followed by one string containing the letter 'C'."?

Yes.

So "AC" would match '[1] ACD, [2] BC'. And '[3] ADE, [4] BAE, [5] CD' would be "ABC" or "AAC" (or both). So '[4] BAE, [5] CD' is "AC". So that's 3 or 4 matches in the example, depending on how you count them.

Right. Sorry for the confusion, just a random example i came up with.

Is the "A+B*C" pattern fixed at design time (when you write the program) or is it not known until runtime. And if not known until runtime, what subset of "regular expression syntax" do you intend to support?

pattern is not known until runtime. i intend to support the following operators in the regular expressions with their usual semantics - *, + , | and ?. Also patterns can be nested like ((A+B)*C).

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.