For the last 5 months I've tried to organize myself and my friends in order to collect some relevant information on news sites (information about global warming, pollution, ecology etc.)
The only solution that I've found was to use a local search engine from the news site and type in adequate keywords which are mentioned inside the parenthesis above. After that, articles would appear and the only way to collect some relevant sentences is to read the whole text or again use the CTRL+F and type in the criteria words.

Unfortunately, it worked only for a few days because we were getting tired of doing this. I've calculated that it would take us a long time to achieve what we wanted.

Before the project had started I was thinking that the best thing would be to come up with an artificial intelligence algorithm which would save us 80% of our time. I don't see any other good programming approach apart from looping through the text and getting the senteces with the relevant words, but the same thing is done with CTRL+F.

I don't want to make this post too long. I would be very glad if someone could tell me how to approach/tackle such problem (efficiently collect data on news site). Is AI worth it? If yes what type of AI would I use to do this or should I use another approach.
I would also consider tagged languages to solve my problem. If there is a better language, please suggest them.

Recommended Answers

All 2 Replies

Well, you could retrieve a page data via a script (Perl, Python, etc.). Implement a script that can retrieve a page data, and then use Regular Expression to help you to search for matched words. This way, it should help you screen page data and reduce your manual work load...

or, much better, ask the people maintaining those sites if they have something like an RSS feed you can get access to and get the data right there in text, uncluttered by tons of advertising, javascript, menus, headers, etc. etc., and of course with their full blessing and no risk of having your script suddenly fail because someone changed the layout of their pages.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.