I am trying to extract the important text, and only the important text from an RSS feed. I have determimed that all of the items in the feed have the important text underneath a tag that starts

<div id="important text"> ...

In some of the files the important text is under <p> tags that are all childrem of the above <div> tag. In others all the important text is under the <div> tags that are children of the <div id="important text". It is easy enough to extract the text, but there are cases where the formatting gets messed up. These are cases where all the important text, while under the "first level" div tags is actually inside <div> tags that are children of <div> tags that are children of the <div id="important text"> tag. Sometimes the important text is actually in great grandchildren.

In order to select the tags where it useful to put in line breaks, I thought to count the number of text characters that live within "first level" div tags but do not live within "second level" div tags, those that live within "second level" div tags but not within "third level" div tags, etc. I am enough of a mathematician to realize that all I really need to do is to find out how many text characters live within first level div tags, how many live within second level div tags, how many live within third level div tags, etc.c

To end a long story, I would be quite happy to count the number of text characters contained within a specific tag and all of its descendents. How do I do that?

I promise to be reading to try to solve it myself while you are thinking about helping me.

Can you give an simple example document of what you're dealing with and what you want the desired output to be? Also XSLT 1 or 2 ?

I normally use XSLT I, but in case of emergency I can trot out XSLT II. For the moment I would rather not, but will if there is a good reason to. I will try to give an example that will bring out the true difficulty with the situation. It would be clearer if I gave more than one example, since I have to be able to distinguish within XSLT between the different possible paradigms that are used in the xml file that I receive. Here are the basic issues:
1) I am only interested in text contained in the tag <div id="important text">
2) In some of the xml files all the text that I want is contained in text nodes that are descendents of tags of the type <div id="important text">/<div>
3) In other cases all the text of interest is in text nodes that are descendents of tags of the type <div id="important text">/<div>/<div>
4) In other cases all the text of interest is in text nodes that are chiodren of tags of the type <div id="important text">/<p>
5) In other cases all the text of interest is in text nodes that are descendents of tags of the type <div id="important text">/<div>/<div>/<div>

These are the only types I have seen so far. I need to determine which of these types our current xml file belongs to and process it accordingly.

My ideas for working are:
1) Count the number of characters of text we would get if we were of each of the types, figure out which of the options has the maximum number of characters, and then process the file as if it were of the type that produces the maximal number of text characters.
2) Count the number of tags of each type described in the list of 5 types and process the file as if it were of the type that has the maximal number of tags.

<div id="important text">
<div><span id+"delved">This is some text.</span></div>
<div></div?
<div id="yet more stuff">
<div>This is the seasno<em>of</em>our<i>discombobulation</div>
<div>The sum of human knowledge<weird>is an integration problem</weird>.</div>
</div>
<div>Oh look a little bit of text.</div>
</div>

I would like the output to look like
<Para>
This is the season of discombobulation
</Para>
<Para>
The sum of human knowledge is an integration problem.
</Para>

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.