Member Avatar for kris0r

Hi all,

I'm writing a webcrawler as part of a search engine project at university, and have been using jsoup to my advantage. This connects to a page and neatly takes all the href's from anchor tags for me. However, when I add these to the arraylist of URLs waited to be connected to and crawled, I can end up with literally hundreds of duplicates.

I'm now a bit lost in the bottomless pit of hashset copy/remove functions people have written for this problem, and am not convinced this is the best solution. I would prefer to be able to somehow to do this (prevent them) the problem is if I want to step through an array I can't then add to it over time.

get newurl from arraylist;
if(newurl is not in visitedarray)
connect to newurl;
grab links;
add newurl to visitedarray;

The ArrayList class has a boolean contains(Object o); method that returns true if the ArrayList already contains this Object, so you can use this for your if(newurl is not in visitedarray) test.
The test used to see if Object o is already in the ArrayList is o.equals(any element in the list). So you need to ensure that the elements you are putting in the list have an equals method that works the way you need. The API doc for URL (or whatever other class you are using) will explain how it implements equals. If it's your own class, then its up to you.

Member Avatar for kris0r

Ok, I shall start with the contains method from ArrayList and have a play. I've looked up and down the API for ArrayList most of today and can't believe I didn't see contains or put two and two together, never mind! The "object" kind did confuse me though, is this a generic type meaning it can be what you make it? I got errors when trying to remove strings even though I'd added Strings, so had to take an extra step to Object.toString() to use them again... Thanks James

edit: to save double posting another question I thought I'd edit it onto here, is there a way to dynamically assign variable names? I want to create documents from the links I find but the link changes and I don't want each document to just overwrite. How can I name each new object (this would be passed to the next part of the search engine to be analysed)?

The contains method is defined as having an Object as param to make it suitably universal, so it does start by checking the type of the param against each element and goes no further if the types don't match.
Since Java 1.5 you should always specify the type of your Collections eg
ArrayList<URL>
which will (a) ensure you don't put anything incorrect into it and, more usefully, (b) when you retrieve anything from the list the compiler already knows what type it and you don't need to cast or toString() them.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.