Hi all,
I'm writing a webcrawler as part of a search engine project at university, and have been using jsoup to my advantage. This connects to a page and neatly takes all the href's from anchor tags for me. However, when I add these to the arraylist of URLs waited to be connected to and crawled, I can end up with literally hundreds of duplicates.
I'm now a bit lost in the bottomless pit of hashset copy/remove functions people have written for this problem, and am not convinced this is the best solution. I would prefer to be able to somehow to do this (prevent them) the problem is if I want to step through an array I can't then add to it over time.
get newurl from arraylist;
if(newurl is not in visitedarray)
connect to newurl;
grab links;
add newurl to visitedarray;