cy163 0 Newbie Poster

Hi ALL,

The project we're working on is a document retrieval system. Two kinds of Features for each document in a repository are extracted as described in table-1 and table-2.


Table1
DocID Word Freq
1 Book 9
1 Study 3
2 Work 4
2 Run 5
2 User 2
3 Search 6
3 Sweat
Table2
DocID Word1 Word2 Freq
1 Book Read 3
1 Study English 2
2 Work Hard 1
3 Sweat warm 1


Given an example document, which is described as
ExTable1
Word Freq
Book 8
Read 7

ExTable2
WordA WordB Freq
Book Read 2
Book Write 1

I need to calculate the similarities between the example document and each of the document in the repository.

For example, the similarity between Doc1 and the example document( based on ExTable1 and Table1) can be calculated as follows

Read: 9+8= 17
the similarity based on ExTable2 and Table3 can be calculated as follows

Book--Read: 3+2 = 5

Finally, calculating the final result: 17+5=22

In my program, I implement the above operations through two steps.

First create two views corresponding to the two-level matching. Then, based on the two views to calculate the final result.

First Step:

CREATE view view1 AS SELECT DocID, SUM(ExTable1.Freq + Table1.Freq) Score FROM ExTable1, Table1 WHERE ExTable1.Word = Table1.Word Group BY DocID order BY Score;

CREATE view view2 AS SELECT DocID, SUM(ExTable2.Freq + Table2.Freq) Score FROM ExTable2, Table2 WHERE ExTable2.WordA = Table2.WordA AND ExTable2.WordB = Table2.WordB  Group BY DocID order BY Score;

STEP 2

SELECT view1.DocID As DocID, view1.Score+view2.Score As Score FROM view1 LEFT JOIN view2 on view1.DocID=view2.DocID ORDER BY Score

However, this approach is problmetic when the system receives two or more user query requests simultaneously, since view1 and view2 are shared by all users. The contents of view1 and view2 would be rewritten when another user makes his request.

Now, I am not sure how to solve this problem. Someone suggested me to merge the above two statements for creating view1 and view2 into one statement. I am not sure whether this is feasible, and whether it is the right solution for this problem.

Please help.