Pretty noob question regarding input

Reply

Join Date: Dec 2007
Posts: 7
Reputation: roryne is an unknown quantity at this point 
Solved Threads: 1
roryne roryne is offline Offline
Newbie Poster

Pretty noob question regarding input

 
0
  #1
Dec 20th, 2007
Hello, I am trying to write a script for someone at my university. The only experience I've had with perl really is through webpages, which coincidentally this is, albeit a bit different. Normally the webpage is hosted online and I have to use LWP: or WWW: to access it, but this is 40,000 pages stored locally. Supposing that recursive directories (ie; subdirectories) are not an issue, how would I go about reading in one file at a time, doing the scraping I need, and then moving on after printing out the extracted info? Basically something on the commandline like this:

myScript.pl *.htm*

where it will read in all htm or html files to be parsed. I feel a bit daft for not knowing how to do this, but oh well, all part of learning I guess.
Reply With Quote Quick reply to this message  
Join Date: Mar 2006
Posts: 898
Reputation: KevinADC has a spectacular aura about KevinADC has a spectacular aura about 
Solved Threads: 67
KevinADC's Avatar
KevinADC KevinADC is offline Offline
Practically a Posting Shark

Re: Pretty noob question regarding input

 
0
  #2
Dec 20th, 2007
Reply With Quote Quick reply to this message  
Join Date: Sep 2007
Posts: 176
Reputation: trudge is an unknown quantity at this point 
Solved Threads: 20
trudge trudge is offline Offline
Junior Poster

Re: Pretty noob question regarding input

 
0
  #3
Dec 20th, 2007
Originally Posted by roryne View Post
Hello, I am trying to write a script for someone at my university. The only experience I've had with perl really is through webpages, which coincidentally this is, albeit a bit different. Normally the webpage is hosted online and I have to use LWP: or WWW: to access it, but this is 40,000 pages stored locally. Supposing that recursive directories (ie; subdirectories) are not an issue, how would I go about reading in one file at a time, doing the scraping I need, and then moving on after printing out the extracted info? Basically something on the commandline like this:

myScript.pl *.htm*

where it will read in all htm or html files to be parsed. I feel a bit daft for not knowing how to do this, but oh well, all part of learning I guess.
You have pretty much answered your own question, all you have to do is implement it.

reading in one file at a time
doing the scraping I need
moving on after printing out the extracted info
What have you tried so far?
Amer Neely - Web Mechanic
"Others make web sites. We make web sites work!"
Reply With Quote Quick reply to this message  
Join Date: Dec 2007
Posts: 7
Reputation: roryne is an unknown quantity at this point 
Solved Threads: 1
roryne roryne is offline Offline
Newbie Poster

Re: Pretty noob question regarding input

 
0
  #4
Dec 20th, 2007
heh, I know the idea behind everything I need to do, it's just a matter of implementation.

  1. @myFile = <@ARGV>;

is pretty much all I tried, but that reads in file names. I just don't know how to read in the actual content.

see if anything here helps you:

http://www.perl.com/pub/a/2004/10/14/file_editing.html


That looks a little advanced for me. I suppose the beginning would be helpful, but what I need to do is far more than just a oneliner to act as sed for me.
Reply With Quote Quick reply to this message  
Join Date: Sep 2007
Posts: 176
Reputation: trudge is an unknown quantity at this point 
Solved Threads: 20
trudge trudge is offline Offline
Junior Poster

Re: Pretty noob question regarding input

 
0
  #5
Dec 20th, 2007
Have you read the documentation that comes with every Perl installation? File opening / reading are FAQs, and are handled quite well in the docs.

A hint: since you are after the source code, you will need to use LWP for that, and any scraping can be done with Mechanize. But beware - Mechanize doesn't handle HTML created by JavaScript, which is happening more and more these days.
Last edited by trudge; Dec 20th, 2007 at 6:06 pm.
Amer Neely - Web Mechanic
"Others make web sites. We make web sites work!"
Reply With Quote Quick reply to this message  
Join Date: Mar 2006
Posts: 898
Reputation: KevinADC has a spectacular aura about KevinADC has a spectacular aura about 
Solved Threads: 67
KevinADC's Avatar
KevinADC KevinADC is offline Offline
Practically a Posting Shark

Re: Pretty noob question regarding input

 
0
  #6
Dec 20th, 2007
will give a list of all the htm(l) files in the given path and open and read them one at a time:

  1. @ARGV = <path/to/files/*.htm*>;
  2. while(<>){
  3. .......
  4. }

how you parse each file depends on what you are searching for. There are a number of html parsing modules on CPAN that will probably work better than any parsing code you can write yourself.
Last edited by KevinADC; Dec 20th, 2007 at 6:15 pm.
Reply With Quote Quick reply to this message  
Join Date: Sep 2007
Posts: 176
Reputation: trudge is an unknown quantity at this point 
Solved Threads: 20
trudge trudge is offline Offline
Junior Poster

Re: Pretty noob question regarding input

 
0
  #7
Dec 20th, 2007
Question: is the directory you are walking on a local drive, or are you trying to scrape web pages at another site? That will determine how you do this.
Amer Neely - Web Mechanic
"Others make web sites. We make web sites work!"
Reply With Quote Quick reply to this message  
Join Date: Dec 2007
Posts: 7
Reputation: roryne is an unknown quantity at this point 
Solved Threads: 1
roryne roryne is offline Offline
Newbie Poster

Re: Pretty noob question regarding input

 
0
  #8
Dec 20th, 2007
As I mentioned in my original post, they are stored locally and thus I cannot use LWP or WWW.

@ARGV = <path/to/files/*.htm*>;
while(<>){
.......
}


I will try this
Reply With Quote Quick reply to this message  
Join Date: Mar 2006
Posts: 898
Reputation: KevinADC has a spectacular aura about KevinADC has a spectacular aura about 
Solved Threads: 67
KevinADC's Avatar
KevinADC KevinADC is offline Offline
Practically a Posting Shark

Re: Pretty noob question regarding input

 
0
  #9
Dec 21st, 2007
Originally Posted by trudge View Post
Question: is the directory you are walking on a local drive, or are you trying to scrape web pages at another site? That will determine how you do this.
He said in the first post that they are local files.

but this is 40,000 pages stored locally
Reply With Quote Quick reply to this message  
Reply

This thread is more than three months old.
Perhaps start a new thread instead?
Message:


Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC