Hello all,

I'm hoping someone here might be able to provide some assistance. Let me see if I can describe the general problem I have and perhaps someone can offer some ideas on the best way to approach solving it.

I've got a text file that is basically a dump of thousands of other text files pasted one following the other. I have no choice in how I receive this data, just to make it clear, this is how it comes and I have to make it work. What I need to do is seperate each individual bit of file text into their own seperate file. Here is an example, formatted for your convenience, of the data in the large text file:

AB1234
TITLE: News Headline 1
SOURCE: USA Today
TEXT: News article here
123489
TITLE: News Headline 2
SOURCE: Newsweek
TEXT: News article 2

So that is the general idea, greatly simplified but still conveying the necessary info. Each article is preceded by an identifier that sometimes has two letters, sometimes does not, and is some random number of numbers after that. After that identifier I believe TITLE always appears next. So what I'm trying to do, is knowing this, pull out each article and place them into their own individual file with the preceding identifier if possible.

Currently this is done with a Macro through word, but it takes hours, leave out data, like the identifier, and does not always manage to account for all data or data integrity problems. I'm thinking a perl script could do this very easily and much more quickly, though I could easily be wrong about that. So I haven't used Perl in awhile and I'm trying to get this to work.

Currently I'm just trying to get the regexp to work and what seems like it might be on the right track but does not work entirely.

/[a-z]{0,2}[0-9]+\s+TITLE:(?!TITLE)/g

I put each match from this into another file temporarily just to see what I'm getting returned. Currently this returns:

1234TITLE:
123489TITLE:

So not only is it missing the AB from the first identifier, but its also missing all formatting and of course all the rest of the text. Anyway, if anyone has some suggestions I would greatly appreciate it. Or perhaps if someone knows of an even smarter way to do this that would be even faster and more efficient, I'm all ears. Though there are some limitations in directions I can take this, and the data itself cannot be changed. It comes in a text file and that's the only option I have.

Thanks to everyone.

Recommended Answers

All 3 Replies

I would read the file line by line instead of trying to make a single regexp that tries to do too much.

while(<>){
   if (/^([a-zA-Z]{0,2}\d+)/) {
      my $id = $1;
      # now get the title/source/text in this block
      # and write to a new file
   }
}

You know, the more I've been going over it the more I think you're probably right. I guess the best idea would be to use regex to pull out specifically everything that's needed but otherwise just read the file in line by line.

Quick addendum question then. Do you think that reading in such a large file line by line will be quicker than a word macro parsing the entire file? I think a lot of the slow down with the word macro comes from opening the file and it taking up so many system resources. Don't know if this approach will actually be quicker or not.

I have no idea if using perl will ultimately prove to be faster or not. You will just have to try it and see.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.