I am new to awk, so please excuse any mistakes.
I was hoping someone would be able to tell me if it's possible to include variables in a regular expression, and if so, how.

The code I currently have is the find the beginning of link (<a) and image (<img) tags in an html page.
Instead of duplicating this same code for each tag (ie, hard code each value in), I was hoping to be able to use a for() loop to loop over the same expressions, simply changing the variable each loop.

The code I have:

setTags[1,1] = "<a"
setTags[2,1] = "<img"

for ( j = 1; j < 3; j++ ) {
         TAG = setTags[start,1]
         split($0,removeFront,/[>]?[.*]?TAG/)
}

This loops through twice, once looking to split by a link tag, and another to split by the image tag. This will tell me how many links and how many images are on each line.
However, I am unable to get the TAG variable (only uppercase to make it stand out) interpreted. Depending on what I do, it will look for the letters T, A and G, or it simply won't work.

Any assistance you can give would be greatly appreciated.
Thanks,
Ralf

Recommended Answers

All 8 Replies

Could you post a sample html input and the desired output?

Thanks for your reply.

Could you post a sample html input and the desired output?

Sure. I don't quite see how it's relevant to the question though.
Anyway, the html can be any webpage. All I'm interested in are the link and image tags.
eg:

<a href="http://www.google.com">Google</a> - <img src="http://www.google.com/intl/en_ALL/images/logo.gif" />

In essence, this is a small part of my larger program. I will also be assigning the assigment operator (href or src) and the end tag (<a/> or >) to the 2d array.
These will be used to retrieve the address within the tag.

So far I have not planned an output. I simply wish to get an array of all link and image addresses within the webpage. Once I have that ouput will be a simple process of running through the array, printing all, or counting duplicated, etc.

Do you know if it's possible to have variables in a regular expression?

If you have GNU Awk:

awk 'END { 
print "#links:", _
  for (i=1;i<=_;i++)
    print la[i]
print "#images:", __
  for (i=1;i<=__;i++)
    print ia[i] }
{ r = $0
while (match(r = substr(r, RSTART+RLENGTH), /<a href=([^>]*)>/, t)) 
  la[++_] = t[1] 
}
{ r = $0
while (match(r = substr(r, RSTART+RLENGTH), /<img([^>]*)>/, t))
  ia[++__] = t[1] 
}'  html

If you have GNU Awk:

awk 'END { 
print "#links:", _
  for (i=1;i<=_;i++)
    print la[i]
print "#images:", __
  for (i=1;i<=__;i++)
    print ia[i] }
{ r = $0
while (match(r = substr(r, RSTART+RLENGTH), /<a href=([^>]*)>/, t)) 
  la[++_] = t[1] 
}
{ r = $0
while (match(r = substr(r, RSTART+RLENGTH), /<img([^>]*)>/, t))
  ia[++__] = t[1] 
}'  html

Thank you for the code. There's some advanced stuff in there I can learn from.
However, the problem still remains that there will be code redundancy because I am still using two while loops, one for links and one for images. Is there no way I can replace the "<img" and "<a" tags with a variable and have only one while loop?
Thanks for your help so far!

You could:

awk 'END { 
print "#links:", _
  for (i=1;i<=_;i++)
    print la[i]
print "#images:", __
  for (i=1;i<=__;i++)
    print ia[i] }
{ r = $0
while (match(r = substr(r, RSTART+RLENGTH), /<(a href=|img)([^>]*)>/, t)) {
  if (substr(r, RSTART, 2) == "<a")
    la[++_] = t[2]
  else
    ia[++__] = t[2]  
  }
}'  html

Or, given that you insist for the variable :)

awk 'END { 
print "#links:", _
  for (i=1;i<=_;i++)
    print la[i]
print "#images:", __
  for (i=1;i<=__;i++)
    print ia[i] }
{ r = $0; tag = "<(a href=|img)([^>]*)>"
while (match(r = substr(r, RSTART+RLENGTH), tag, t)) {
  if (substr(r, RSTART, 2) == "<a")
    la[++_] = t[2]
  else
    ia[++__] = t[2]  
  }
}'  html

Or even:

awk 'END { 
print "#links:", _
  for (i=1;i<=_;i++)
    print la[i]
print "#images:", __
  for (i=1;i<=__;i++)
    print ia[i] }
{ r = $0; lnk = "a href="; img = "img"; tag = "<("lnk"|"img")([^>]*)>"
while (match(r = substr(r, RSTART+RLENGTH), tag, t)) {
  if (substr(r, RSTART, RLENGTH) ~ "^<"lnk)
    la[++_] = t[2]
  else
    ia[++__] = t[2]  
  }
}'  html
awk 'END { 
print "#links:", _
  for (i=1;i<=_;i++)
    print la[i]
print "#images:", __
  for (i=1;i<=__;i++)
    print ia[i] }
{ r = $0; lnk = "a href="; img = "img"; tag = "<("lnk"|"img")([^>]*)>"
while (match(r = substr(r, RSTART+RLENGTH), tag, t)) {
  if (substr(r, RSTART, RLENGTH) ~ "^<"lnk)
    la[++_] = t[2]
  else
    ia[++__] = t[2]  
  }
}'  html

Wow, that's brilliant, thank you! You have solved all my problems!

I've ended up using your example of the while loop with a match in it (before I had been splitting by the tag and then using a for loop to loop through each instance of the tag). It now looks so much neater and clearer.
Also, using the or (|) between different tags has worked brilliantly, without the need for multiple loops for each tag.

I see that in my first post, although explaining my problem, I did not justify it well. As well as finding the tag in the line, I also needed to find the entire <a href=".."> or <img src="..." /> and replace it.
My final regular expression ended up being

/(<a|<img)[^>]+(href|src)=http:\/\/[^>]+[.]*>/

which can be used to replace the front tag of the link or an entire image tag, regardless of (x)html standard and of any other atributes which may be present (class, id, target, etc)

Finally, it's good to know that variables can be used, so long as you assign the whole regular expression to another variable before use. In this case your first example did the job, but it's great to know for future use.

Just to double check I've got it, would this be the solution the code?

setTags[1,1] = "<a"
setTags[2,1] = "<img"

for ( j = 1; j < 3; j++ ) {
         TAG = "[>]?[.*]?"setTags[j,1]
         split($0, removeFront, TAG)
}

In this case, the TAG variable is set to the regular expression, including the setTags variable, which changes depending on the loop. It has been left outside the inverted commas. split is then given TAG as the parameter of the regular expression.

Once again, thanks for all your hard work. It really is appreciated!
Ralf

Did you test your code?
Do you get the expected result?

Did you test your code?
Do you get the expected result?

Gave it a go and got the results I was looking for. Took a bit of fiddling, but it works ^_^

#!/bin/gawk -f
{
   setTags[1,1] = "<img "
   setTags[1,2] = "src"
   setTags[1,3] = ">"

   setTags[2,1] = "<a "
   setTags[2,2] = "href"
   setTags[2,3] = ">"

   for ( j = 1; j < 3; j++ ) {
      expression = setTags[j,1]setTags[j,2]"[^>]+[.]*"setTags[j,3]
      gsub(expression, "")
      if (j==2) 
         gsub("</a>", "")
   }
}

Runs through each line of the file and removes images and links.
The ending of the link (</a>) has to be removed separately because of the "[^>]" in the regular expression (which prevents it overrunning). However, this also has the added advantage of leaving in the text or image which was wrapped in the link tags.

Thank you again for your help.
I have learnt alot.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.