Hi guys, I'll try to keep it as short as possible.
Basically, I love the webcomic xkcd. And recently, I've been browsing around for a downloader, and have found one written in Perl that works great. It even grabs the Alt-Text (text that pops up when you hover your mouse over the comic for a second or so), which is a a feature I was desperately looking for.

I thought it was all too good to be true, the program did everything I wanted it to, so well. It would download each comic to it's own folder, and save the alt-text in a text file in the same folder.

It's only after looking at the output do I find the problems:
1) (Main) When working with a comic strip that uses apostrophes, whether they are in the title, or the alt-text doesn't matter, it would not use an apostrophe. Instead of saving a file say, called "It's Hot Today.png" it would save it as "It's Hot Today.png", but instead of an apostrophe, there would be this:

& # 3 9 ;

NOTE: I had to put a space between each character, otherwise the forum would convert it into an apostrophe.

This would occur within the alt-text files as well, if the alt-text contained an apostrophe.

Is it maybe an issue with the operating system (Ubuntu 10.10, 32-Bit)?
Does anyone know what I can change either in my system, or in the script, to rectify this problem?

The second problem:
2) 90% of the comics on www.xkcd.com are in the .png format. But about 10% are not. The problem is, is that the program says to save the file as "x.png"
-even if the actual file is a .jpg

Being the absolute beginner I am, I have no idea how to change this from "save all images to the png extension" to "save all images to the original format it is in".


If anyone could provide any help at all, or even just give their opinion on potential solutions, it would be greatly appreciated.
The full source code of the application is listed below.

Again, I am just starting out in the world of Perl, and have next to no experience with it, nor it's syntax.


Thank you so much in advance for your help,

Liam.

#!/usr/bin/perl 

use LWP::Simple;
#  use Smart::Comments;

## Objectives ##

#  Download all comics from xkcd.com
#  Ability to download new comics
#  Download ALT text
#  Saved in: ~/Desktop


# Set Specifics
$sitePrefix = "http://xkcd.com/";

## Path to main XKCD directory ##
$path = "$ENV{HOME}/Desktop";


mkdir "$path/XKCD", 0755 or print "XKCD Directory Exists\n";
chomp($path = "$path/XKCD");

$d = get($sitePrefix);
if ($d =~ /http:\/\/xkcd.com\/(\d+)\//) {
    $current = $1;
}

# Obtains all individual comic data
sub getComicData {
    my $siteData = get("$sitePrefix$current/");
    my @data = split /\n/, $siteData;
    foreach (@data) {
        if (/http:\/\/xkcd.com\/(\d+)\//) {
            $current = $1;
        }
        if (/src="(http:\/\/imgs.xkcd.com\/comics\/.+\.\w{3})"/) {
            $currentUrl = $1;
            if (/alt="(.+?)"/) {
                $title = $1;
            $title = "House of Pancakes" if $current == 472;  # Color title on comic 472 with weird syntax
            }
            if (/title="(.+?)"/) {    #title commonly know as 'alt' text
                $alt = $1;
            }
        }
    }
}

chdir "$path" or die "Cannot change directory: $!";
&getComicData();
while ( get("$sitePrefix$current/")){ ### Writing Files $current: $title
    print "Writing Files $current: $title\n";
   # Create directories for individual comics
    mkdir "$current $title", 0755 or die "Previously Downloaded";
    chdir "$path/$current $title" or die "Cannot change directory: $!";

    # Save image file
    $image = get($currentUrl);
    open my $IMAGE, '>>', "$title.png"
        or die "Cannot create file!";
    print $IMAGE $image;
    close $IMAGE;

    # Save alt text
    open my $TXT, '>>', "$title ALT.txt"
        or die "Cannot create file!";
    print $TXT $alt;
    close $TXT;
    chdir "$path" or die "Cannot change directory: $!";
    $current--;

    # Check for non existent 404 comic
    $current-- if $current == 404;

    &getComicData();
}


# End Gracefully
print "Download Complete\n"

But the issue with the file types still needs to be addressed.

Instead of taking filename from alt take it from the end of the src URL.

# Obtains all individual comic data
sub getComicData {
    my $siteData = get("$sitePrefix$current/");
    my @data = split /\n/, $siteData;
    foreach (@data) {
        if (/http:\/\/xkcd.com\/(\d+)\//) {
            $current = $1;
        }
        
        #Instead of taking filename from alt
        #take it from the end of the src URL
        if (/src="(http:\/\/imgs.xkcd.com\/comics\/(.+\.\w{3}))"/) {
            $currentUrl = $1;
            #if (/alt="(.+?)"/) {
            #    $title = $1;
            #    $title = "House of Pancakes" if $current == 472;  # Color title on comic 472 with weird syntax
            #}
            $title = $2;
            say "File to save: $title";
            if (/title="(.+?)"/) {    #title commonly know as 'alt' text
                $alt = $1;
            }
        }
    }
}

Okay, I'm a little lost... Here is the updates version of the code, using the method I mentioned above to fix the apostrophe issue.

#!/usr/bin/perl 

use LWP::Simple;
#  use Smart::Comments;

## Objectives ##

#  Download all comics from xkcd.com
#  Ability to download new comics
#  Download ALT text
#  Saved in: ~/Desktop

## Bugs ##

# Files that aren't a png, are being saved as one.

# Set Specifics
$sitePrefix = "http://xkcd.com/";

## Path to main xkcd directory ##
$path = "$ENV{HOME}/Desktop";


mkdir "$path/xkcd", 0755 or print "xkcd Directory Exists\n";
chomp($path = "$path/xkcd");

$d = get($sitePrefix);
if ($d =~ /http:\/\/xkcd.com\/(\d+)\//) {
    $current = $1;
}

# Obtains all individual comic data
sub getComicData {
    my $siteData = get("$sitePrefix$current/");
    my @data = split /\n/, $siteData;
    foreach (@data) {
        if (/http:\/\/xkcd.com\/(\d+)\//) {
            $current = $1;
        }
        if (/src="(http:\/\/imgs.xkcd.com\/comics\/.+\.\w{3})"/) {
            $currentUrl = $1;
            if (/alt="(.+?)"/) {
                $title = $1;
                $title =~ s/&#(\d+);/chr($1)/ge;
            $title = "House of Pancakes" if $current == 472;  # Color title on comic 472 with weird syntax
            }
            if (/title="(.+?)"/) {    #title commonly know as 'alt' text
                $alt = $1;
                $alt =~ s/&#(\d+);/chr($1)/ge;
            }
        }
    }
}

chdir "$path" or die "Cannot change directory: $!";
&getComicData();
while ( get("$sitePrefix$current/")){ ### Writing Files $current: $title
    print "Writing Files $current: $title\n";
   # Create directories for individual comics
    mkdir "$current $title", 0755 or die "Previously Downloaded";
    chdir "$path/$current $title" or die "Cannot change directory: $!";

    # Save image file
    $image = get($currentUrl);
    open my $IMAGE, '>>', "$title.png"
        or die "Cannot create file!";
    print $IMAGE $image;
    close $IMAGE;

    # Save alt text
    open my $TXT, '>>', "$title ALT.txt"
        or die "Cannot create file!";
    print $TXT $alt;
    close $TXT;
    chdir "$path" or die "Cannot change directory: $!";
    $current--;

    # Check for non existent 404 comic
    $current-- if $current == 404;

    &getComicData();
}


# End Gracefully
print "Download Complete\n"

d5e5, so by using the filename from the end of the src URL, that should solve the issue with the incorrect filetypes, yes?

...d5e5, so by using the filename from the end of the src URL, that should solve the issue with the incorrect filetypes, yes?

Yes. Assuming that the src URL ends in an extension that correctly identifies the file type, you can capture the file name (or just the extension, .png or .jpg, if you prefer.)

You need to change the following statement from

open my $IMAGE, '>>', "$title.png"
        or die "Cannot create file!";

to

open my $IMAGE, '>>', "$title.$ext"
        or die "Cannot create file!";

after declaring a variable called $ext and capturing the correct file extension into it.

How do you capture the file extension into your $ext variable? See the following snippet from your script, which I have modified slightly. Notice that the regular expression (or regex) now contains two sets of parentheses instead of one. The second set of parentheses captures the end of the src url into $2, which we save to $ext.

# Obtains all individual comic data
sub getComicData {
    my $siteData = get("$sitePrefix$current/");
    my @data = split /\n/, $siteData;
    my $ext;
    foreach (@data) {
        if (/http:\/\/xkcd.com\/(\d+)\//) {
            $current = $1;
        }
        if (/src="(http:\/\/imgs.xkcd.com\/comics\/.+(\.\w{3}))"/) {
            $currentUrl = $1;
            $ext = $2;
            print "$currentUrl has the following extension: $ext", "\n";
            sleep(3);
            if (/alt="(.+?)"/) {
                $title = $1;
                $title =~ s/&#(\d+);/chr($1)/ge;
            $title = "House of Pancakes" if $current == 472;  # Color title on comic 472 with weird syntax
            }
            if (/title="(.+?)"/) {    #title commonly know as 'alt' text
                $alt = $1;
                $alt =~ s/&#(\d+);/chr($1)/ge;
            }
        }
    }
}

My apologies for the late reply, I experienced a death in my family only hours after seeing your reply, and obviously had my priorities.

So, now, including all the changes you've recommended, plus those to fix the apostrophe issue, I have the script. It works, except that it is not saving the files with an extension at all. Files come out like this: "MyFile." --with no extension. The ALT text files are fine, and still have their .txt extension.

I've thought about, and tried to rectify, this problem for awhile now, but I can't seem to find the issue.
Would you possible be able to shed some light?

Thanks, and thank you so much for your past help,

Liam.

Please accept my condolences for your loss.

I think the remaining problem consists in the lexical (i.e. declared with my and limited in scope to the current block or subroutine) $ext variable going out of scope at the point where you name and save the file. Try the following: move the my $ext; from the subroutine and put it near the beginning of your program, where it will have the widest scope. You could put it immediately before the #Set specifics comment, for example, like this:

## Bugs ##

# Files that aren't a png, are being saved as one.
my $ext; #Declare variable to store file extension until file is saved.

# Set Specifics
# etc.
#.

Make sure the my $ext; occurs only once in your program. The script should really start with use strict; use warnings; as nearly all Perl scripts should.

Thank you so much for your help, it worked. (Although, the files were being saved as "MyFile..png", but that was easily fixed by removing the extra period from the file saving section.)

Once again, you've been amazing in both your speedy replies, and great help.

Thanks,
Liam.

Thank you so much for your help, it worked. (Although, the files were being saved as "MyFile..png", but that was easily fixed by removing the extra period from the file saving section.)

Once again, you've been amazing in both your speedy replies, and great help.

Thanks,
Liam.

You're welcome. I didn't want to test the full script as it involved saving a lot of files so the suggestions were kind of hit and miss, but I'm glad it works now. Please don't forget to mark this thread solved.

This question has already been answered. Start a new discussion instead.