Hi,

I'm trying to figure out what would be the best way to develop a regular expression class that can have child or parent.

I want to develop a generic regex extractor for text files.

Example :
- An HTML file has a table
- Each table has some data (let's say classes)
- Each class has some properties.
- Each property can have multiple data (array).
- So on

We need :
- A regex to extract each class which are a subtable in the main table.
- Regexes for each properties that are rows.
- Regexes for each value in an array

You see scheme. So I need a recursive class or something like that.

Does someone have an idea of what could be a good design?

Thanks

Recommended Answers

All 9 Replies

so, is a class represented by a row in the table? if so, easy.. just look for the tr tags

You see scheme. So I need a recursive class or something like that.

What the fuck is a recursive class? What the fuck are you talking about?

Perhaps he means a linked list?

Perhaps he means a linked list?

Sort of a linked list. It's not only a question of <tr> tags. Inside each <tr> there could be other sets of values I need to extract, inside these values, their might be other values and so on.

So a regular expression could bear a set of other regular expression.

Algorithm:

matches m_parent = regex_Parent.match(text)
foreach (x in m_parent)
{
   load set of sub_regexes
   foreach r in the set of sub_regexes
   {
      matches m_child = r.match (x)
      ...
      load set of sub_sub...
      ... so on
   }
}

An application of this extractor could be extracting results of a google query. There are blocks of pages and in each block there's some info.
The same application could work with yahoo, pirate bay, etc. Only the regex file could be change.


Rashakil : Please stay polite, your answer is very non professional.

Rashakil : Please stay polite, your answer is very non professional.

I am a professional programmer, so that means my answer is by definition professional :P

I am a professional programmer, so that means my answer is by definition professional :P

Ok then... as a professional answer it wasn't useful.

Sort of a linked list. It's not only a question of <tr> tags. Inside each <tr> there could be other sets of values I need to extract, inside these values, their might be other values and so on.

Well the regexpression would handle that just fine...

I think it would help if you specified more precisely what you expect the input text to be, and gave examples.

I want to create an application that could extract any structured data. Kind of a generic parser.

Examples :
- Google results
- CNN news
- Forums
- Engadget
- ...

All these website have structured data. Except all of them are structured diffently. It could be easy to extract data from them using a structured tree of regular expressions.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.