Hello, I am trying to write a script for someone at my university. The only experience I've had with perl really is through webpages, which coincidentally this is, albeit a bit different. Normally the webpage is hosted online and I have to use LWP::* or WWW::* to access it, but this is 40,000 pages stored locally. Supposing that recursive directories (ie; subdirectories) are not an issue, how would I go about reading in one file at a time, doing the scraping I need, and then moving on after printing out the extracted info? Basically something on the commandline like this:

myScript.pl *.htm*

where it will read in all htm or html files to be parsed. I feel a bit daft for not knowing how to do this, but oh well, all part of learning I guess.

Recommended Answers

All 8 Replies

Hello, I am trying to write a script for someone at my university. The only experience I've had with perl really is through webpages, which coincidentally this is, albeit a bit different. Normally the webpage is hosted online and I have to use LWP::* or WWW::* to access it, but this is 40,000 pages stored locally. Supposing that recursive directories (ie; subdirectories) are not an issue, how would I go about reading in one file at a time, doing the scraping I need, and then moving on after printing out the extracted info? Basically something on the commandline like this:

myScript.pl *.htm*

where it will read in all htm or html files to be parsed. I feel a bit daft for not knowing how to do this, but oh well, all part of learning I guess.

You have pretty much answered your own question, all you have to do is implement it.

reading in one file at a time

doing the scraping I need

moving on after printing out the extracted info

What have you tried so far?

heh, I know the idea behind everything I need to do, it's just a matter of implementation.

@myFile = <@ARGV>;

is pretty much all I tried, but that reads in file names. I just don't know how to read in the actual content.

see if anything here helps you:

http://www.perl.com/pub/a/2004/10/14/file_editing.html

That looks a little advanced for me. I suppose the beginning would be helpful, but what I need to do is far more than just a oneliner to act as sed for me.

Have you read the documentation that comes with every Perl installation? File opening / reading are FAQs, and are handled quite well in the docs.

A hint: since you are after the source code, you will need to use LWP for that, and any scraping can be done with Mechanize. But beware - Mechanize doesn't handle HTML created by JavaScript, which is happening more and more these days.

will give a list of all the htm(l) files in the given path and open and read them one at a time:

@ARGV = <path/to/files/*.htm*>;
while(<>){
    .......
}

how you parse each file depends on what you are searching for. There are a number of html parsing modules on CPAN that will probably work better than any parsing code you can write yourself.

Question: is the directory you are walking on a local drive, or are you trying to scrape web pages at another site? That will determine how you do this.

As I mentioned in my original post, they are stored locally and thus I cannot use LWP or WWW.

@ARGV = <path/to/files/*.htm*>;
while(<>){
.......
}

I will try this

Question: is the directory you are walking on a local drive, or are you trying to scrape web pages at another site? That will determine how you do this.

He said in the first post that they are local files.

but this is 40,000 pages stored locally

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.