I'm helping a friend on a research project. She has a list of websites (~100) which she wants to download all the text from several pages of the site for use in another program. However she is not technology savy (has never used Linux or programmed before) but is very smart and willing to learn. She originally planned to spend several weeks copy-pasting by hand but she's now been convinced to try a computational solution. Is there any tool you would suggest?

Thanks,

Agilemind

Recommended Answers

All 2 Replies

The easiest way would probably be to setup a MySQL database and then run a simple PHP crawler, or to cURL the page as it is only the text you want.

I do however have some concerns relating to copyright infringement and plagerism. Even with it being a research paper, there is no reason for your friend to be copying entire pages worth of information, especially not 100 or more. To research effectively, she should find the relevant facts and statistics from the page, and then quote the source that it was taken from using the appropriate format.

Could you elaborate more on it please, I might be misunderstanding or it does seem that she is plagerising/researching incorrectly.

Thanks,

She is comparing the text from various websites to find common/different themes, messages, patterns etc.. in the presentation of a particular subject. She has a program she is familiar with for doing the text analysis but it requires either a text document or Word document input.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.