| | |
Pretty noob question regarding input
![]() |
•
•
Join Date: Dec 2007
Posts: 7
Reputation:
Solved Threads: 1
Hello, I am trying to write a script for someone at my university. The only experience I've had with perl really is through webpages, which coincidentally this is, albeit a bit different. Normally the webpage is hosted online and I have to use LWP:
or WWW:
to access it, but this is 40,000 pages stored locally. Supposing that recursive directories (ie; subdirectories) are not an issue, how would I go about reading in one file at a time, doing the scraping I need, and then moving on after printing out the extracted info? Basically something on the commandline like this:
myScript.pl *.htm*
where it will read in all htm or html files to be parsed. I feel a bit daft for not knowing how to do this, but oh well, all part of learning I guess.
or WWW:
to access it, but this is 40,000 pages stored locally. Supposing that recursive directories (ie; subdirectories) are not an issue, how would I go about reading in one file at a time, doing the scraping I need, and then moving on after printing out the extracted info? Basically something on the commandline like this:myScript.pl *.htm*
where it will read in all htm or html files to be parsed. I feel a bit daft for not knowing how to do this, but oh well, all part of learning I guess.
•
•
Join Date: Sep 2007
Posts: 176
Reputation:
Solved Threads: 20
•
•
•
•
Hello, I am trying to write a script for someone at my university. The only experience I've had with perl really is through webpages, which coincidentally this is, albeit a bit different. Normally the webpage is hosted online and I have to use LWP:or WWW:
to access it, but this is 40,000 pages stored locally. Supposing that recursive directories (ie; subdirectories) are not an issue, how would I go about reading in one file at a time, doing the scraping I need, and then moving on after printing out the extracted info? Basically something on the commandline like this:
myScript.pl *.htm*
where it will read in all htm or html files to be parsed. I feel a bit daft for not knowing how to do this, but oh well, all part of learning I guess.
•
•
•
•
reading in one file at a time
•
•
•
•
doing the scraping I need
•
•
•
•
moving on after printing out the extracted info
Amer Neely - Web Mechanic
"Others make web sites. We make web sites work!"
"Others make web sites. We make web sites work!"
•
•
Join Date: Dec 2007
Posts: 7
Reputation:
Solved Threads: 1
heh, I know the idea behind everything I need to do, it's just a matter of implementation.
is pretty much all I tried, but that reads in file names. I just don't know how to read in the actual content.
see if anything here helps you:
http://www.perl.com/pub/a/2004/10/14/file_editing.html
That looks a little advanced for me. I suppose the beginning would be helpful, but what I need to do is far more than just a oneliner to act as sed for me.
Perl Syntax (Toggle Plain Text)
@myFile = <@ARGV>;
is pretty much all I tried, but that reads in file names. I just don't know how to read in the actual content.
see if anything here helps you:
http://www.perl.com/pub/a/2004/10/14/file_editing.html
That looks a little advanced for me. I suppose the beginning would be helpful, but what I need to do is far more than just a oneliner to act as sed for me.
•
•
Join Date: Sep 2007
Posts: 176
Reputation:
Solved Threads: 20
Have you read the documentation that comes with every Perl installation? File opening / reading are FAQs, and are handled quite well in the docs.
A hint: since you are after the source code, you will need to use LWP for that, and any scraping can be done with Mechanize. But beware - Mechanize doesn't handle HTML created by JavaScript, which is happening more and more these days.
A hint: since you are after the source code, you will need to use LWP for that, and any scraping can be done with Mechanize. But beware - Mechanize doesn't handle HTML created by JavaScript, which is happening more and more these days.
Last edited by trudge; Dec 20th, 2007 at 6:06 pm.
Amer Neely - Web Mechanic
"Others make web sites. We make web sites work!"
"Others make web sites. We make web sites work!"
will give a list of all the htm(l) files in the given path and open and read them one at a time:
how you parse each file depends on what you are searching for. There are a number of html parsing modules on CPAN that will probably work better than any parsing code you can write yourself.
Perl Syntax (Toggle Plain Text)
@ARGV = <path/to/files/*.htm*>; while(<>){ ....... }
how you parse each file depends on what you are searching for. There are a number of html parsing modules on CPAN that will probably work better than any parsing code you can write yourself.
Last edited by KevinADC; Dec 20th, 2007 at 6:15 pm.
![]() |
Similar Threads
- noob question, answer probably pretty easy, i just don't know where to start (C++)
- drawStripes (Java)
Other Threads in the Perl Forum
- Previous Thread: LWP Authentication
- Next Thread: can you help me on how to use gearman?
| Thread Tools | Search this Thread |





