•
•
•
•
What is DaniWeb IT Discussion Community?
You're currently browsing the Shell Scripting section within the Software Development category of DaniWeb, a massive community of 456,272 software developers, web developers, Internet marketers, and tech gurus who are all enthusiastic about making contacts, networking, and learning from each other. In fact, there are 3,383 IT professionals currently interacting right now! Registration is free, only takes a minute and lets you enjoy all of the interactive features of the site.
Please support our Shell Scripting advertiser: Programming Forums
Views: 2386 | Replies: 8 | Solved
![]() |
•
•
Join Date: Apr 2008
Posts: 5
Reputation:
Rep Power: 0
Solved Threads: 0
I am new to awk, so please excuse any mistakes.
I was hoping someone would be able to tell me if it's possible to include variables in a regular expression, and if so, how.
The code I currently have is the find the beginning of link (<a) and image (<img) tags in an html page.
Instead of duplicating this same code for each tag (ie, hard code each value in), I was hoping to be able to use a for() loop to loop over the same expressions, simply changing the variable each loop.
The code I have: This loops through twice, once looking to split by a link tag, and another to split by the image tag. This will tell me how many links and how many images are on each line.
However, I am unable to get the TAG variable (only uppercase to make it stand out) interpreted. Depending on what I do, it will look for the letters T, A and G, or it simply won't work.
Any assistance you can give would be greatly appreciated.
Thanks,
Ralf
I was hoping someone would be able to tell me if it's possible to include variables in a regular expression, and if so, how.
The code I currently have is the find the beginning of link (<a) and image (<img) tags in an html page.
Instead of duplicating this same code for each tag (ie, hard code each value in), I was hoping to be able to use a for() loop to loop over the same expressions, simply changing the variable each loop.
The code I have:
setTags[1,1] = "<a"
setTags[2,1] = "<img"
for ( j = 1; j < 3; j++ ) {
TAG = setTags[start,1]
split($0,removeFront,/[>]?[.*]?TAG/)
}However, I am unable to get the TAG variable (only uppercase to make it stand out) interpreted. Depending on what I do, it will look for the letters T, A and G, or it simply won't work.
Any assistance you can give would be greatly appreciated.
Thanks,
Ralf
•
•
Join Date: Apr 2008
Posts: 5
Reputation:
Rep Power: 0
Solved Threads: 0
Thanks for your reply.
Sure. I don't quite see how it's relevant to the question though.
Anyway, the html can be any webpage. All I'm interested in are the link and image tags.
eg:
In essence, this is a small part of my larger program. I will also be assigning the assigment operator (href or src) and the end tag (<a/> or >) to the 2d array.
These will be used to retrieve the address within the tag.
So far I have not planned an output. I simply wish to get an array of all link and image addresses within the webpage. Once I have that ouput will be a simple process of running through the array, printing all, or counting duplicated, etc.
Do you know if it's possible to have variables in a regular expression?
Sure. I don't quite see how it's relevant to the question though.
Anyway, the html can be any webpage. All I'm interested in are the link and image tags.
eg:
<a href="http://www.google.com">Google</a> - <img src="http://www.google.com/intl/en_ALL/images/logo.gif" />
These will be used to retrieve the address within the tag.
So far I have not planned an output. I simply wish to get an array of all link and image addresses within the webpage. Once I have that ouput will be a simple process of running through the array, printing all, or counting duplicated, etc.
Do you know if it's possible to have variables in a regular expression?
Last edited by jedi_ralf : Apr 28th, 2008 at 5:17 pm.
If you have GNU Awk:
awk 'END {
print "#links:", _
for (i=1;i<=_;i++)
print la[i]
print "#images:", __
for (i=1;i<=__;i++)
print ia[i] }
{ r = $0
while (match(r = substr(r, RSTART+RLENGTH), /<a href=([^>]*)>/, t))
la[++_] = t[1]
}
{ r = $0
while (match(r = substr(r, RSTART+RLENGTH), /<img([^>]*)>/, t))
ia[++__] = t[1]
}' html•
•
Join Date: Apr 2008
Posts: 5
Reputation:
Rep Power: 0
Solved Threads: 0
•
•
•
•
If you have GNU Awk:
awk 'END { print "#links:", _ for (i=1;i<=_;i++) print la[i] print "#images:", __ for (i=1;i<=__;i++) print ia[i] } { r = $0 while (match(r = substr(r, RSTART+RLENGTH), /<a href=([^>]*)>/, t)) la[++_] = t[1] } { r = $0 while (match(r = substr(r, RSTART+RLENGTH), /<img([^>]*)>/, t)) ia[++__] = t[1] }' html
Thank you for the code. There's some advanced stuff in there I can learn from.
However, the problem still remains that there will be code redundancy because I am still using two while loops, one for links and one for images. Is there no way I can replace the "<img" and "<a" tags with a variable and have only one while loop?
Thanks for your help so far!
You could:
Or, given that you insist for the variable
Or even:
awk 'END {
print "#links:", _
for (i=1;i<=_;i++)
print la[i]
print "#images:", __
for (i=1;i<=__;i++)
print ia[i] }
{ r = $0
while (match(r = substr(r, RSTART+RLENGTH), /<(a href=|img)([^>]*)>/, t)) {
if (substr(r, RSTART, 2) == "<a")
la[++_] = t[2]
else
ia[++__] = t[2]
}
}' htmlOr, given that you insist for the variable

awk 'END {
print "#links:", _
for (i=1;i<=_;i++)
print la[i]
print "#images:", __
for (i=1;i<=__;i++)
print ia[i] }
{ r = $0; tag = "<(a href=|img)([^>]*)>"
while (match(r = substr(r, RSTART+RLENGTH), tag, t)) {
if (substr(r, RSTART, 2) == "<a")
la[++_] = t[2]
else
ia[++__] = t[2]
}
}' htmlOr even:
awk 'END {
print "#links:", _
for (i=1;i<=_;i++)
print la[i]
print "#images:", __
for (i=1;i<=__;i++)
print ia[i] }
{ r = $0; lnk = "a href="; img = "img"; tag = "<("lnk"|"img")([^>]*)>"
while (match(r = substr(r, RSTART+RLENGTH), tag, t)) {
if (substr(r, RSTART, RLENGTH) ~ "^<"lnk)
la[++_] = t[2]
else
ia[++__] = t[2]
}
}' html Last edited by radoulov : Apr 29th, 2008 at 7:00 am.
•
•
Join Date: Apr 2008
Posts: 5
Reputation:
Rep Power: 0
Solved Threads: 0
•
•
•
•
awk 'END { print "#links:", _ for (i=1;i<=_;i++) print la[i] print "#images:", __ for (i=1;i<=__;i++) print ia[i] } { r = $0; lnk = "a href="; img = "img"; tag = "<("lnk"|"img")([^>]*)>" while (match(r = substr(r, RSTART+RLENGTH), tag, t)) { if (substr(r, RSTART, RLENGTH) ~ "^<"lnk) la[++_] = t[2] else ia[++__] = t[2] } }' html
I've ended up using your example of the while loop with a match in it (before I had been splitting by the tag and then using a for loop to loop through each instance of the tag). It now looks so much neater and clearer.
Also, using the or (|) between different tags has worked brilliantly, without the need for multiple loops for each tag.
I see that in my first post, although explaining my problem, I did not justify it well. As well as finding the tag in the line, I also needed to find the entire <a href=".."> or <img src="..." /> and replace it.
My final regular expression ended up being
/(<a|<img)[^>]+(href|src)=http:\/\/[^>]+[.]*>/
Finally, it's good to know that variables can be used, so long as you assign the whole regular expression to another variable before use. In this case your first example did the job, but it's great to know for future use.
Just to double check I've got it, would this be the solution the code?
setTags[1,1] = "<a"
setTags[2,1] = "<img"
for ( j = 1; j < 3; j++ ) {
TAG = "[>]?[.*]?"setTags[j,1]
split($0, removeFront, TAG)
}Once again, thanks for all your hard work. It really is appreciated!
Ralf
•
•
Join Date: Apr 2008
Posts: 5
Reputation:
Rep Power: 0
Solved Threads: 0
Gave it a go and got the results I was looking for. Took a bit of fiddling, but it works ^_^
Runs through each line of the file and removes images and links.
The ending of the link (</a>) has to be removed separately because of the "[^>]" in the regular expression (which prevents it overrunning). However, this also has the added advantage of leaving in the text or image which was wrapped in the link tags.
Thank you again for your help.
I have learnt alot.
#!/bin/gawk -f
{
setTags[1,1] = "<img "
setTags[1,2] = "src"
setTags[1,3] = ">"
setTags[2,1] = "<a "
setTags[2,2] = "href"
setTags[2,3] = ">"
for ( j = 1; j < 3; j++ ) {
expression = setTags[j,1]setTags[j,2]"[^>]+[.]*"setTags[j,3]
gsub(expression, "")
if (j==2)
gsub("</a>", "")
}
}The ending of the link (</a>) has to be removed separately because of the "[^>]" in the regular expression (which prevents it overrunning). However, this also has the added advantage of leaving in the text or image which was wrapped in the link tags.
Thank you again for your help.
I have learnt alot.
![]() |
•
•
•
•
•
•
•
•
DaniWeb Shell Scripting Marketplace
•
•
•
•
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
Other Threads in the Shell Scripting Forum
- Previous Thread: using isql in shell script?
- Next Thread: Script to search numbers from a given range from a file


Linear Mode