amartin903 0 Newbie Poster

I've built a classifier after following an online tutorial. I take a bunch of tweets, use a HTML parser to get rid of unescaped syntax, remove everything shorter than 3 words and make a dictionary out of them. I then work out the frequency distribution of the words and so on.

My problem comes to creating my classifier, using the code:

training_set = nltk.classify.apply_features(extract_features, tweets)
classifer = nltk.NaiveBayesClassifier.train(training_set)

I tested this on a small dataset of 50 tweets and it worked fine. The problem is that my data set is 1.5 million tweets. It's been running nearly an hour and hasn't finished. I just came across a post online from someone who claimed after 10 hours their 100,000 strong dataset was still being processed.

With this size of dataset, is using the built in classifer unreasonable (and if not, how long would it take)? Is there a simple alternative as I was only able to build this due to an excellent tutorial due to my very limited python knowledge.