Create document vectors

Question

udev 0 Newbie Poster

13 Years Ago

Hi!
I want to read from a set of documents and put the information into a matrix[x][y] , where x is the document and y is a boolean field denoting whether a particular word appears in the document x or not. So each row would have y fields/dimensions where i is the number of words in the document x.Something like :

D1:”The cat in the hat disabled”
D2:”A cat is a fine pet ponies.”
D3:”Dogs and cats make good pets”
D4:”I haven’t got a hat.”

good pet hat make dog cat poni fine disabl
D1 [+0.00 +0.00 +1.00 +0.00 +0.00 +1.00 +0.00 +0.00 +1.00 ]
D2 [+0.00 +1.00 +0.00 +0.00 +0.00 +1.00 +1.00 +1.00 +0.00 ]
D3 [+1.00 +1.00 +0.00 +1.00 +1.00 +1.00 +0.00 +0.00 +0.00 ]
D4 [+0.00 +0.00 +1.00 +0.00 +0.00 +0.00 +0.00 +0.00 +0.00 ]

The entries are somewhat different because the stemming is applied on the document contents and frequently occurring words are ignored.
All the documents are in the same directory and are named sequentially.

python

3 Contributors
14 Replies
114 Views
1 Week Discussion Span
Latest Post 13 Years Ago Latest Post by udev

All 14 Replies

woooee 814 Nearly a Posting Maven

13 Years Ago

I would suggest a dictionary with the key pointing to a list, which would show up similar to your example

D_dict = {"D1":[0, 0, 1, 0, 0, 1, 0, 0, 0 ],
          "D2":[0, 1, 0, 0, 0, 1, 1, 1, 0 ] }

Come up with some code to read the file and split/compare the words, and post back with any problems.

Edited 11 Years Ago by Dani because: Formatting fixed

TrustyTony 888 pyMod

13 Years Ago

Maybe you could catch some points from this words haiku code I made recently (reorganizing words in given count pattern to lines):

from __future__ import print_function

bookfile = '11.txt'
pattern = (7, 5, 7)

def nwords(book, nwords):
    for dontcare in range(nwords):
        ## give word from begining (0) and remove from list of words one by one
        if book:
             yield book.pop(0)
        else:
             break
    ## finish giving words after nwords has been given 

with open(bookfile) as thebook:
    # read text of book and split from white space
    bookaslist =  thebook.read().split()
    # until bookaslist is empty, which is considered False value
    while bookaslist:
        for count in pattern:
            # rejoin count words and center it for 60 column print
            print(' '.join(nwords(bookaslist, count)).center(60))
        # empty line between to clarify the form
        print()

TrustyTony 888 pyMod

13 Years Ago

And what it says when you change line to say

print index, len(row), row

I have to say that your code is very strange k is iterating keys of dictionary like list and the you go on using i[k][0].keys()[0][4]!

Edited 13 Years Ago by TrustyTony because: n/a

TrustyTony 888 pyMod

13 Years Ago

for i in grail:
for k in i.keys():
print float(i[k][0].keys())

I was using this to extract21 from 'file21.txt' . It gives an error saying float argument must be a string or a number.

You can do

print [float(i[k][0]) for i in grail for k in i ]

Edited 11 Years Ago by mike_2000_17 because: Fixed formatting

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

udev 0 Newbie Poster · Answer 1 · 2010-09-25T11:51:17+00:00

Actually I have to feed this as input to another program..So I want it as a matrix..Ill try and code whatever part I can and post that here..Also is there any way of reading different files from the same directory except os.walk? I find using it a bit tedious..

udev 0 Newbie Poster · Answer 2 · 2010-10-06T10:31:36+00:00

Sorry for not replying all this time. Was real busy!
tony,I couldn't get much from the code you provided as in I couldn't get any leads to the solve the problem mentioned before.
Guys! I need help and I need it fast (hate to sound demanding ! :|)
I guess I will change the problem statement
I have this data structure.(list of dics)
{'year': ({'file21.txt': 1}, {'year': [1040]})}
{'year': ({'file22.txt': 2}, {'year': [1604, 1846]})}
{'year': ({'file26.txt': 1}, {'year': [110]})}
{'year-old': ({'file17.txt': 1}, {'year-old': [1344]})}
{'yearlong': ({'file01.txt': 1}, {'yearlong': [4681]})}
{'yet': ({'file01.txt': 2}, {'yet': [2055, 2403]})}
{'yet': ({'file11.txt': 1}, {'yet': [4409]})}
And I have to map it to a matrix such that the zeroth row belongs to term year and the 21st element of this row is marked as one(coz year appears in file21.txt).Like wise for all the terms.I need it in the matrix form because I have to carry out some matrix operations on it.

udev 0 Newbie Poster · Answer 3 · 2010-10-06T10:58:52+00:00

]    for i in grail:
       for k in i.keys():
        print float(i[k][0].keys())

I was using this to extract21 from 'file21.txt' . It gives an error saying float argument must be a string or a number.

udev 0 Newbie Poster · Answer 4 · 2010-10-06T11:38:28+00:00

That is done. Now

def matrix(grail):
     #for i in range(1,31):
      for i in grail:
       for k in i.keys():
         index = int(i[k][0].keys()[0][4]+i[k][0].keys()[0][5])
         print index
         row[int(index)] == 1

This gives list index out of range. :(
P.S. grail is the array of dictionaries in the format mentioned above

udev 0 Newbie Poster · Answer 5 · 2010-10-06T12:22:52+00:00

And what it says when you change line to say
print index, len(row), row
I have to say that your code is very strange k is iterating keys of dictionary like list and the you go on using i[k][0].keys()[0][4]!

Regarding the "strange" part, I can't help it. I have to do something in two days and I am trying out anything and everything to that end! Regarding the thing u mentioned

rows =[0]*31
def matrix(grail):
     #for i in range(1,31):
      for i in grail:
       for k in i.keys():
         index = int(i[k][0].keys()[0][4]+i[k][0].keys()[0][5])
         rows[index] == 1
         print index, len(rows), rows

This gives
.
.
23 31 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
.
.
as the output.
I initialized rows list(size 31) as a list of all zeroes.

udev 0 Newbie Poster · Answer 6 · 2010-10-06T12:35:04+00:00

This seems to work! Let me be sure of it though.Will post back again! Thanks tonyjv

def matrix(grail):
     #for i in range(1,31):
      for i in grail:
       for k in i.keys():
         rows =[0]*31
         index = int(i[k][0].keys()[0][4]+i[k][0].keys()[0][5])
         rows[index] = 1
         #print index, len(rows), rows
         print rows

TrustyTony 888 pyMod Team Colleague Featured Poster · Answer 7 · 2010-10-06T13:07:43+00:00

Does not look possible as, you aresetting then rows[23] as 1 but rows is all zeroes. Strange. But no error anymore, then. You are loosing of previous values of rows every loop, you must save them somewhere. Maybe initialization in outer loop instead?

udev 0 Newbie Poster · Answer 8 · 2010-10-06T13:19:06+00:00

Hmm..I need a separate row for each term. So I think resetting the rows is logical.For putting these rows in a 2D matrix can I use something like mat.append(rows) ?

TrustyTony 888 pyMod Team Colleague Featured Poster · Answer 9 · 2010-10-06T13:26:32+00:00

TrustyTony 888 pyMod

13 Years Ago

Sounds right.

udev 0 Newbie Poster · Answer 10 · 2010-10-06T14:10:46+00:00

I'll wait for a while (in case I have any doubts) ,before marking this thread as solved. Thanks tonyjv! :) You came to my rescue as always! Really appreciate that!

Create document vectors

Recommended Answers Collapse Answers

All 14 Replies

Recommended Answers