User Name Password Register
DaniWeb IT Discussion Community
All
What is DaniWeb IT Discussion Community?
You're currently browsing the Shell Scripting section within the Software Development category of DaniWeb, a massive community of 456,272 software developers, web developers, Internet marketers, and tech gurus who are all enthusiastic about making contacts, networking, and learning from each other. In fact, there are 3,383 IT professionals currently interacting right now! Registration is free, only takes a minute and lets you enjoy all of the interactive features of the site.
Please support our Shell Scripting advertiser: Programming Forums
Views: 2386 | Replies: 8 | Solved
Reply
Join Date: Apr 2008
Posts: 5
Reputation: jedi_ralf is an unknown quantity at this point 
Rep Power: 0
Solved Threads: 0
jedi_ralf jedi_ralf is offline Offline
Newbie Poster

Awk, using variables in a regular expression

  #1  
Apr 25th, 2008
I am new to awk, so please excuse any mistakes.
I was hoping someone would be able to tell me if it's possible to include variables in a regular expression, and if so, how.

The code I currently have is the find the beginning of link (<a) and image (<img) tags in an html page.
Instead of duplicating this same code for each tag (ie, hard code each value in), I was hoping to be able to use a for() loop to loop over the same expressions, simply changing the variable each loop.

The code I have:
setTags[1,1] = "<a"
setTags[2,1] = "<img"

for ( j = 1; j < 3; j++ ) {
         TAG = setTags[start,1]
         split($0,removeFront,/[>]?[.*]?TAG/)
}
This loops through twice, once looking to split by a link tag, and another to split by the image tag. This will tell me how many links and how many images are on each line.
However, I am unable to get the TAG variable (only uppercase to make it stand out) interpreted. Depending on what I do, it will look for the letters T, A and G, or it simply won't work.

Any assistance you can give would be greatly appreciated.
Thanks,
Ralf
AddThis Social Bookmark Button
Reply With Quote  
Join Date: Feb 2007
Posts: 52
Reputation: radoulov is an unknown quantity at this point 
Rep Power: 2
Solved Threads: 5
radoulov's Avatar
radoulov radoulov is offline Offline
Junior Poster in Training

Re: Awk, using variables in a regular expression

  #2  
Apr 28th, 2008
Could you post a sample html input and the desired output?
Reply With Quote  
Join Date: Apr 2008
Posts: 5
Reputation: jedi_ralf is an unknown quantity at this point 
Rep Power: 0
Solved Threads: 0
jedi_ralf jedi_ralf is offline Offline
Newbie Poster

Re: Awk, using variables in a regular expression

  #3  
Apr 28th, 2008
Thanks for your reply.
Originally Posted by radoulov View Post
Could you post a sample html input and the desired output?

Sure. I don't quite see how it's relevant to the question though.
Anyway, the html can be any webpage. All I'm interested in are the link and image tags.
eg:
<a href="http://www.google.com">Google</a> - <img src="http://www.google.com/intl/en_ALL/images/logo.gif" />
In essence, this is a small part of my larger program. I will also be assigning the assigment operator (href or src) and the end tag (<a/> or >) to the 2d array.
These will be used to retrieve the address within the tag.

So far I have not planned an output. I simply wish to get an array of all link and image addresses within the webpage. Once I have that ouput will be a simple process of running through the array, printing all, or counting duplicated, etc.

Do you know if it's possible to have variables in a regular expression?
Last edited by jedi_ralf : Apr 28th, 2008 at 5:17 pm.
Reply With Quote  
Join Date: Feb 2007
Posts: 52
Reputation: radoulov is an unknown quantity at this point 
Rep Power: 2
Solved Threads: 5
radoulov's Avatar
radoulov radoulov is offline Offline
Junior Poster in Training

Re: Awk, using variables in a regular expression

  #4  
Apr 28th, 2008
If you have GNU Awk:

awk 'END { 
print "#links:", _
  for (i=1;i<=_;i++)
    print la[i]
print "#images:", __
  for (i=1;i<=__;i++)
    print ia[i] }
{ r = $0
while (match(r = substr(r, RSTART+RLENGTH), /<a href=([^>]*)>/, t)) 
  la[++_] = t[1] 
}
{ r = $0
while (match(r = substr(r, RSTART+RLENGTH), /<img([^>]*)>/, t))
  ia[++__] = t[1] 
}'  html
Reply With Quote  
Join Date: Apr 2008
Posts: 5
Reputation: jedi_ralf is an unknown quantity at this point 
Rep Power: 0
Solved Threads: 0
jedi_ralf jedi_ralf is offline Offline
Newbie Poster

Re: Awk, using variables in a regular expression

  #5  
Apr 29th, 2008
Originally Posted by radoulov View Post
If you have GNU Awk:

awk 'END { 
print "#links:", _
  for (i=1;i<=_;i++)
    print la[i]
print "#images:", __
  for (i=1;i<=__;i++)
    print ia[i] }
{ r = $0
while (match(r = substr(r, RSTART+RLENGTH), /<a href=([^>]*)>/, t)) 
  la[++_] = t[1] 
}
{ r = $0
while (match(r = substr(r, RSTART+RLENGTH), /<img([^>]*)>/, t))
  ia[++__] = t[1] 
}'  html

Thank you for the code. There's some advanced stuff in there I can learn from.
However, the problem still remains that there will be code redundancy because I am still using two while loops, one for links and one for images. Is there no way I can replace the "<img" and "<a" tags with a variable and have only one while loop?
Thanks for your help so far!
Reply With Quote  
Join Date: Feb 2007
Posts: 52
Reputation: radoulov is an unknown quantity at this point 
Rep Power: 2
Solved Threads: 5
radoulov's Avatar
radoulov radoulov is offline Offline
Junior Poster in Training

Re: Awk, using variables in a regular expression

  #6  
Apr 29th, 2008
You could:

awk 'END { 
print "#links:", _
  for (i=1;i<=_;i++)
    print la[i]
print "#images:", __
  for (i=1;i<=__;i++)
    print ia[i] }
{ r = $0
while (match(r = substr(r, RSTART+RLENGTH), /<(a href=|img)([^>]*)>/, t)) {
  if (substr(r, RSTART, 2) == "<a")
    la[++_] = t[2]
  else
    ia[++__] = t[2]  
  }
}'  html

Or, given that you insist for the variable

awk 'END { 
print "#links:", _
  for (i=1;i<=_;i++)
    print la[i]
print "#images:", __
  for (i=1;i<=__;i++)
    print ia[i] }
{ r = $0; tag = "<(a href=|img)([^>]*)>"
while (match(r = substr(r, RSTART+RLENGTH), tag, t)) {
  if (substr(r, RSTART, 2) == "<a")
    la[++_] = t[2]
  else
    ia[++__] = t[2]  
  }
}'  html


Or even:

awk 'END { 
print "#links:", _
  for (i=1;i<=_;i++)
    print la[i]
print "#images:", __
  for (i=1;i<=__;i++)
    print ia[i] }
{ r = $0; lnk = "a href="; img = "img"; tag = "<("lnk"|"img")([^>]*)>"
while (match(r = substr(r, RSTART+RLENGTH), tag, t)) {
  if (substr(r, RSTART, RLENGTH) ~ "^<"lnk)
    la[++_] = t[2]
  else
    ia[++__] = t[2]  
  }
}'  html
Last edited by radoulov : Apr 29th, 2008 at 7:00 am.
Reply With Quote  
Join Date: Apr 2008
Posts: 5
Reputation: jedi_ralf is an unknown quantity at this point 
Rep Power: 0
Solved Threads: 0
jedi_ralf jedi_ralf is offline Offline
Newbie Poster

Re: Awk, using variables in a regular expression

  #7  
Apr 29th, 2008
Originally Posted by radoulov View Post
awk 'END { 
print "#links:", _
  for (i=1;i<=_;i++)
    print la[i]
print "#images:", __
  for (i=1;i<=__;i++)
    print ia[i] }
{ r = $0; lnk = "a href="; img = "img"; tag = "<("lnk"|"img")([^>]*)>"
while (match(r = substr(r, RSTART+RLENGTH), tag, t)) {
  if (substr(r, RSTART, RLENGTH) ~ "^<"lnk)
    la[++_] = t[2]
  else
    ia[++__] = t[2]  
  }
}'  html
Wow, that's brilliant, thank you! You have solved all my problems!

I've ended up using your example of the while loop with a match in it (before I had been splitting by the tag and then using a for loop to loop through each instance of the tag). It now looks so much neater and clearer.
Also, using the or (|) between different tags has worked brilliantly, without the need for multiple loops for each tag.

I see that in my first post, although explaining my problem, I did not justify it well. As well as finding the tag in the line, I also needed to find the entire <a href=".."> or <img src="..." /> and replace it.
My final regular expression ended up being
/(<a|<img)[^>]+(href|src)=http:\/\/[^>]+[.]*>/
which can be used to replace the front tag of the link or an entire image tag, regardless of (x)html standard and of any other atributes which may be present (class, id, target, etc)

Finally, it's good to know that variables can be used, so long as you assign the whole regular expression to another variable before use. In this case your first example did the job, but it's great to know for future use.

Just to double check I've got it, would this be the solution the code?
setTags[1,1] = "<a"
setTags[2,1] = "<img"

for ( j = 1; j < 3; j++ ) {
         TAG = "[>]?[.*]?"setTags[j,1]
         split($0, removeFront, TAG)
}
In this case, the TAG variable is set to the regular expression, including the setTags variable, which changes depending on the loop. It has been left outside the inverted commas. split is then given TAG as the parameter of the regular expression.

Once again, thanks for all your hard work. It really is appreciated!
Ralf
Reply With Quote  
Join Date: Feb 2007
Posts: 52
Reputation: radoulov is an unknown quantity at this point 
Rep Power: 2
Solved Threads: 5
radoulov's Avatar
radoulov radoulov is offline Offline
Junior Poster in Training

Re: Awk, using variables in a regular expression

  #8  
Apr 30th, 2008
Did you test your code?
Do you get the expected result?
Reply With Quote  
Join Date: Apr 2008
Posts: 5
Reputation: jedi_ralf is an unknown quantity at this point 
Rep Power: 0
Solved Threads: 0
jedi_ralf jedi_ralf is offline Offline
Newbie Poster

Re: Awk, using variables in a regular expression

  #9  
Apr 30th, 2008
Originally Posted by radoulov View Post
Did you test your code?
Do you get the expected result?
Gave it a go and got the results I was looking for. Took a bit of fiddling, but it works ^_^
#!/bin/gawk -f
{
   setTags[1,1] = "<img "
   setTags[1,2] = "src"
   setTags[1,3] = ">"

   setTags[2,1] = "<a "
   setTags[2,2] = "href"
   setTags[2,3] = ">"

   for ( j = 1; j < 3; j++ ) {
      expression = setTags[j,1]setTags[j,2]"[^>]+[.]*"setTags[j,3]
      gsub(expression, "")
      if (j==2) 
         gsub("</a>", "")
   }
}
Runs through each line of the file and removes images and links.
The ending of the link (</a>) has to be removed separately because of the "[^>]" in the regular expression (which prevents it overrunning). However, this also has the added advantage of leaving in the text or image which was wrapped in the link tags.

Thank you again for your help.
I have learnt alot.
Reply With Quote  
Reply

Only community members can participate in forum threads. You must register or log in to contribute.

DaniWeb Shell Scripting Marketplace
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

 

Thread Tools Display Modes

Other Threads in the Shell Scripting Forum

All times are GMT -4. The time now is 6:28 pm.
Forum system based on vBulletin Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
©2003 - 2008 DaniWeb® LLC