I have a txt file within there are three articals recognizable by the html tags < doc > < / doc>
As a result i need to count the words in each artical and get a result like this:

[the] -> [1, 20] -> [2, 34] -> [3, 12]
[author] -> [1, 7] -> [3, 2]

The code i'm using at the moment only counts all the words in the txt file. But it's not giving me the correct output. Has anybody suggestions how I can create the output that is want ?

This is the code i have so far:

import re
import nltk
import numpy as np
import matplotlib.pyplot as plt
from operator import itemgetter
file=open('/Users/c1/Desktop/doc.txt')

def unicount(file):
    dic={}

    for word in file.read().split():
        word = word.lower()
        if tekens(word) == False:
            continue
        elif word in dic:
            dic[word] += 1
        else:
           dic[word] = 1
    print dic
    print len(dic)    


def tekens(word):
    ''' Filtering out all punctuation marks'''
    regex = re.compile("^[A-Za-z0-9]+$")
    if regex.match(word):
        return True
    else:
        return False

unicount(file)

Where unicount count ALL the words in the document but what I want is to count the words within in each <body>
This code gives me the following output:

'fair': 1, 'po': 3, 'color': 1,

Edited 8 Months Ago by Cabba23

@woooee What is the best way to split the the statement <doc>?

This article has been dead for over six months. Start a new discussion instead.