Hi Dw

I'm trying to create a program that I will use to browse through websites when I'm not connected to the internet and I can download the single page using the code I provide bellow but what I want is to download the entire website with it contents e.g. pics,css,ect so that when I'm offline I can still browse the website. The code I'm using require me to add a name in which the file will be served to my drive so this will course errors because I may not name it the same way as it was on the original I want to also scan if a website contains any hypelinks and if it does I want to download it contents so that it will run the same way when I'm accessing it over the internet when I'm offline. The code that I'm using is:

Dim myWebClient As New System.Net.WebClient
        Dim myFile2SaveTo As String = "C:\0000\index.html"
        myWebClient.DownloadFile("http://example.com/", myFile2SaveTo)

I doubt that is possible without hacking the website, which is illegal and possibly get you a nice long prison sentence with free room, board and food for the rest of your life.

Comments
Love it!

Assuming you're not violating the website's terms of use, with a webbrowser control, the document property contains collections for images and links, that you could use to download them. This is all well and good, except that you'll have to parse the htmldocument and fix every link that needs to point to a local source.

Edited 3 Years Ago by tinstaafl

Unfortunately I don't have much knowledge of VB.NET, however I would recommend you look into something along the lines of WGet.

WGet is a command line utility which you might be able to incorporate into your VB scripts to achieve the same affects. This means that it is possible to download the entire site, including all files associated with it (with the exclusion of PHP code, content that has been restricted etc.) and with the correct options update the links and the such to allow you to browse it offline.

WGet works in a similar way to a spider, but instead of logging each page it stores a copy. You just need to be careful in defining your options, as if the site links to other external sites you might end up downloading their entire contents as well!

Edited 3 Years Ago by Octet

Even if you could download the entire site, wouldn't you have to use a server such as Apache2 in order to browse it with a browser? And what about the database that the site might use, such as the one DaniWeb uses? Are you going to download the entire database files as well? That would probably mean violating the site's security because the database probably contains member names and passwords.

Edited 3 Years Ago by Ancient Dragon

Even if you could download the entire site, wouldn't you have to use a server such as Apache2 in order to browse it with a browser?

All that's necessary is to fix the links to point to local files instead of remote files.

If the program only follows public links and downloads the public page it can't access restricted info.

The biggest danger in programming something like this, is not have specific restrictions on what links it follows, imagine the data dump following every link tree in a site like daniweb not to mention the time it would take to fix every link.

Something like this would work, and with the --convert-links option it shall change the links to ones suitable for offline viewing. If it is only .html files, .css and images then you won't need a web server to view the content offline.

$ wget --mirror -p --convert-links -P ./directory/to/store http://www.foobar.com 

Anything that can be viewed by the user can be downloaded, even things that require credentials although obviously you would need to assign a cookie file to your WGet download to login.

You won't be able to download an entire database, but if the content from that database is rendered to the page then it can be downloaded as text. Taking Daniweb as an example, I could assign a cookie to the site, download content from my profile as well as every single article being displayed on each page whilst avoiding any links that lead to external sites.

The main problem however is this obviously shall consume a lot of bandwidth for both you and the site's servers... so you need to be careful you don't swamp them. This is especially important for small sites, otherwise you might find you're added to a black list.

You'll want to get the package WGet.exe and then call it's process in VB.

Edited 3 Years Ago by Octet

You can get the URL addresses from a downloaded page by using Regex.Matches with an expression like this "http:\/\/.+?" (just an example). Then for each URL found in the list that Regex.Matches returns, spin a new thread and use DownloadFile to download that file - you still need to check if the file needs to be downloaded or not, so you won't start downloading from another website.

No idea about page's name...if it has any GET parameters or it uses url rewriting, it might be difficult.

Edited 3 Years Ago by TheApex

Thanks to you all but what I would like to clearify is that I'm trying to create a program that will allow the user to surf offline just like SurfOffline Pro and other programs not that I want to violate any sites rules or so, When I say I want to save site or website pages on a disk this will help me because I'm also freelancing in webdevelopment so most of my clients has huge sites which sometimes they want me to update for them so I decided to just create a program to download the website and then update it and save it back to the server this will also help if a client has lost or damaged his buckup instant of him/her giving me or my stuff members his/her server credentials to download the site or update the site online it will be much better if he/she will give me his/her site address and I will download and update and server it back and also if they want to port in to my server I will simple provide a space for them to type in their current domain then I can retriev straight his/her website to my server if he/she want to use the same webpages to his new hosting server so this will save them lot of time instand of loging in to both servers and download and upload website.

Judging by your question - you chose a very... humerous, username.

All you need is a simple shell script if it's on a linux webserver that used the cp command with the -B switch. The -U switch might also be useful. I'm not sure why you're asking this in VB.NET, maybe that's due to my lack of knowledge in the area, but still - everything you want can be achieve on a windows server using a batch script (or powershell, which is simply cmd on steroids) as well as scheduled tasks or on a linux webserver using a shell script that simply does something like:

cp -RUF /home/webserver/public/website.com/public/[source] /sys/backup/[destination]

YOUR PATHS WILL BE DIFFERENT...

You would only need the -R switch if you're copying the contents of a directory rather than the folder itself. -R is recursive. -U only copies the file if the source is newer than the destination. -F switch forces the copy.

Sorry for the sloppy reply in a hurry.

For all of the above you'll need access to the server via SSH or remote desktop, or at least FTP/SFTP access to execute manual backups.

Edited 3 Years Ago by mmcdonald

Thanks but I want to do it in vb because there are some other features I would like to add on this project so I think it will be much better if I use vb because the user will find everything in one place and it will save lot of time because the features that I want to add will also help the user on the end side to download and update and save the pages fast and easy with a simple options that I want to provide so that a minutes I/user click submit it will save the backupcopy and also transfer to a server the pages so that why I find this idea of SurfOffline pro a good idea for this project because the Sufoffline Pro allows the user to just type in the URL and then choose what the user want to download if the user didn't choose anything but did provide a valide URL it then download the entire website so what I'm trying to do at the moment is to be able to download the website simple by providing the URL address so that I can be able to add these features I want in this project.

This question has already been answered. Start a new discussion instead.