Hello,

I have used bash scripting for sometime in the past and know quite a
bit of linux. But haven't used it in a long time.

I am off job and took some data entry job where I need to do some stats work on a very long list of websites.

This I get from a website called quantcast.com which gives
statistics of other web sites when I put the other site from the list
I already have, each into the SITENAME field here:

http://www.quantcast.com/SITENAME/demographics


Here's the url for google.nl stats on quantcast:

http://www.quantcast.com/google.nl/demographics


Here's an example below for google.nl in place of the above SITENAME:

http://s995.photobucket.com/albums/af80/unknownbucket/?action=view&current=quantcast.jpg


(The problem now is that the quantcast.com site gives the stats in images instead of text so I will have a hard time getting the values out of the images.)


After getting this kind of list, it has to be input into the xls
sheet. So I will need this information in some kind of file format
which can be converted or imported into excel easily. CSV etc I
guess.

I know this can be done but I will need to learn quite a lot of bash
scripting again. I have not used it since some time.

Can someone please help me making this script. I will be able to
understand the basics for sure.


This is what I had intended to to at first, when I didn't know that the stats are in images rather than text:


What I had intended to do was in these steps:

1. Copy / Paste the site list into a text file. (Each site is on a
newline already.)

2. Use something like sed/awk etc to insert the http://www.quantcast.com/
before each of these sitenames.

3, To each of the result in (2) above, insert the word "/demographics"
at the end of the urls.

4. The above will now be a file with links to each of the sites on
quantcast.com. Using wget, download each page and save them with some
names or numbers.

5. Tidy the html or somehow batch-convert the html to plain-text.

6. From this file, get the given field-names and their values (that
is, the numbers) and save this info for each site into a new file also
including only the original sitename (without the quantcast.com),
using the cut command or something.

7. Convert this file into a CSV or tab-delimited format.

The CSV conversion as per me would be the last thing to do.

But first, I need to try to get each page, save it with a numbered
name, then extract the images.

Save the images for each site in separate folders.
(The image names are the same for example they are like demograp.png,
demograq.png, demograr.png, demogras.png)

And then extract the content of each image using any utility I can
find. I hope this part really can be done.

Any help Very Much appreciated and thanks,
Regards,
deboo

Recommended Answers

All 3 Replies

This is not really a shell scripting problem -- reading the content of images is a very hard AI question, the difficulty of which underlies the whole concept of CAPTCHAs. While, indeed, the images quantcast provides are not deliberately distorted like CAPTCHA images, it's still non-trivial. If you want to go this route, I'd suggest first using ImageMagick <http://www.imagemagick.org> to chop each number into its own image according to the pattern common to all quantcast graphs of a given type, then process each number using an OCR program such as GOCR <http://jocr.sourceforge.net>.

This is not really a shell scripting problem -- reading the content of images is a very hard AI question, the difficulty of which underlies the whole concept of CAPTCHAs. While, indeed, the images quantcast provides are not deliberately distorted like CAPTCHA images, it's still non-trivial.

Right now I am still manually typing the content due to this problem.
I still require some help and shell scripting may be able to help me here.

I already added / prepended the full quantcast url to each site (concatenated the two urls in excel lingo) but still I have to manually edit each of them using the F2 key and then press enter on each of them ... they are 10,000 so I wish I could just add a space at the end of each url.

I guess sed can do this? Can someone give me the syntax to do this?

If you want to go this route, I'd suggest first using ImageMagick <http://www.imagemagick.org> to chop each number into its own image according to the pattern common to all quantcast graphs of a given type, then process each number using an OCR program such as GOCR <http://jocr.sourceforge.net>.

I don't understand the bold text in the above quote. Can you please clarify?

Regards,
Deboo

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.