954,546 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article
jedi_ralf
Newbie Poster
20 posts since Apr 2008
Reputation Points: 10
Solved Threads: 3
 

Could you post a sample html input and the desired output?

radoulov
Junior Poster in Training
54 posts since Feb 2007
Reputation Points: 10
Solved Threads: 5
 

Thanks for your reply.
Could you post a sample html input and the desired output?
Sure. I don't quite see how it's relevant to the question though.
Anyway, the html can be any webpage. All I'm interested in are the link and image tags.
eg:

<a href="http://www.google.com">Google</a> - <img src="http://www.google.com/intl/en_ALL/images/logo.gif" />

In essence, this is a small part of my larger program. I will also be assigning the assigment operator (href or src) and the end tag ( or >) to the 2d array.
These will be used to retrieve the address within the tag.

So far I have not planned an output. I simply wish to get an array of all link and image addresses within the webpage. Once I have that ouput will be a simple process of running through the array, printing all, or counting duplicated, etc.

Do you know if it's possible to have variables in a regular expression?

jedi_ralf
Newbie Poster
20 posts since Apr 2008
Reputation Points: 10
Solved Threads: 3
 

If you have GNU Awk:

awk 'END { 
print "#links:", _
  for (i=1;i<=_;i++)
    print la[i]
print "#images:", __
  for (i=1;i<=__;i++)
    print ia[i] }
{ r = $0
while (match(r = substr(r, RSTART+RLENGTH), /<a href=([^>]*)>/, t)) 
  la[++_] = t[1] 
}
{ r = $0
while (match(r = substr(r, RSTART+RLENGTH), /<img([^>]*)>/, t))
  ia[++__] = t[1] 
}'  html
radoulov
Junior Poster in Training
54 posts since Feb 2007
Reputation Points: 10
Solved Threads: 5
 

If you have GNU Awk:

awk 'END { 
print "#links:", _
  for (i=1;i<=_;i++)
    print la[i]
print "#images:", __
  for (i=1;i<=__;i++)
    print ia[i] }
{ r = $0
while (match(r = substr(r, RSTART+RLENGTH), /<a href=([^>]*)>/, t)) 
  la[++_] = t[1] 
}
{ r = $0
while (match(r = substr(r, RSTART+RLENGTH), /<img([^>]*)>/, t))
  ia[++__] = t[1] 
}'  html


Thank you for the code. There's some advanced stuff in there I can learn from.
However, the problem still remains that there will be code redundancy because I am still using two while loops, one for links and one for images. Is there no way I can replace the "

jedi_ralf
Newbie Poster
20 posts since Apr 2008
Reputation Points: 10
Solved Threads: 3
 

You could:

awk 'END { 
print "#links:", _
  for (i=1;i<=_;i++)
    print la[i]
print "#images:", __
  for (i=1;i<=__;i++)
    print ia[i] }
{ r = $0
while (match(r = substr(r, RSTART+RLENGTH), /<(a href=|img)([^>]*)>/, t)) {
  if (substr(r, RSTART, 2) == "<a")
    la[++_] = t[2]
  else
    ia[++__] = t[2]  
  }
}'  html


Or, given that you insist for the variable :)

awk 'END { 
print "#links:", _
  for (i=1;i<=_;i++)
    print la[i]
print "#images:", __
  for (i=1;i<=__;i++)
    print ia[i] }
{ r = $0; tag = "<(a href=|img)([^>]*)>"
while (match(r = substr(r, RSTART+RLENGTH), tag, t)) {
  if (substr(r, RSTART, 2) == "<a")
    la[++_] = t[2]
  else
    ia[++__] = t[2]  
  }
}'  html

Or even:

awk 'END { 
print "#links:", _
  for (i=1;i<=_;i++)
    print la[i]
print "#images:", __
  for (i=1;i<=__;i++)
    print ia[i] }
{ r = $0; lnk = "a href="; img = "img"; tag = "<("lnk"|"img")([^>]*)>"
while (match(r = substr(r, RSTART+RLENGTH), tag, t)) {
  if (substr(r, RSTART, RLENGTH) ~ "^<"lnk)
    la[++_] = t[2]
  else
    ia[++__] = t[2]  
  }
}'  html
radoulov
Junior Poster in Training
54 posts since Feb 2007
Reputation Points: 10
Solved Threads: 5
 
awk 'END { 
print "#links:", _
  for (i=1;i<=_;i++)
    print la[i]
print "#images:", __
  for (i=1;i<=__;i++)
    print ia[i] }
{ r = $0; lnk = "a href="; img = "img"; tag = "<("lnk"|"img")([^>]*)>"
while (match(r = substr(r, RSTART+RLENGTH), tag, t)) {
  if (substr(r, RSTART, RLENGTH) ~ "^<"lnk)
    la[++_] = t[2]
  else
    ia[++__] = t[2]  
  }
}'  html

Wow, that's brilliant, thank you! You have solved all my problems!

I've ended up using your example of the while loop with a match in it (before I had been splitting by the tag and then using a for loop to loop through each instance of the tag). It now looks so much neater and clearer.
Also, using theor (|) between different tags has worked brilliantly, without the need for multiple loops for each tag.

I see that in my first post, although explaining my problem, I did not justify it well. As well as finding the tag in the line, I also needed to find the entire or ... and replace it. My final regular expression ended up being /(<a|<img)[^>]+(href|src)=http:\/\/[^>]+[.]*>/ which can be used to replace the front tag of the link or an entire image tag, regardless of (x)html standard and of any other atributes which may be present (class, id, target, etc) Finally, it's good to know that variables can be used, so long as you assign the whole regular expression to another variable before use. In this case your first example did the job, but it's great to know for future use. Just to double check I've got it, would this be the solution the code? setTags[1,1] = "<a" setTags[2,1] = "<img" for ( j = 1; j < 3; j++ ) { TAG = "[>]?[.*]?"setTags[j,1] split($0, removeFront, TAG) } In this case, theTAG variable is set to the regular expression, including the setTags variable, which changes depending on the loop. It has been left outside the inverted commas. split is then given TAG as the parameter of the regular expression. Once again, thanks for all your hard work. It really is appreciated! Ralf

jedi_ralf
Newbie Poster
20 posts since Apr 2008
Reputation Points: 10
Solved Threads: 3
 

Did you test your code?
Do you get the expected result?

radoulov
Junior Poster in Training
54 posts since Feb 2007
Reputation Points: 10
Solved Threads: 5
 
Did you test your code? Do you get the expected result?

Gave it a go and got the results I was looking for. Took a bit of fiddling, but it works ^_^

#!/bin/gawk -f
{
   setTags[1,1] = "<img "
   setTags[1,2] = "src"
   setTags[1,3] = ">"

   setTags[2,1] = "<a "
   setTags[2,2] = "href"
   setTags[2,3] = ">"

   for ( j = 1; j < 3; j++ ) {
      expression = setTags[j,1]setTags[j,2]"[^>]+[.]*"setTags[j,3]
      gsub(expression, "")
      if (j==2) 
         gsub("</a>", "")
   }
}

Runs through each line of the file and removes images and links.
The ending of the link () has to be removed separately because of the "[^>]" in the regular expression (which prevents it overrunning). However, this also has the added advantage of leaving in the text or image which was wrapped in the link tags.

Thank you again for your help.
I have learnt alot.

jedi_ralf
Newbie Poster
20 posts since Apr 2008
Reputation Points: 10
Solved Threads: 3
 

This question has already been solved

Post: Markdown Syntax: Formatting Help
You