954,510 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

string matching with controlled vocabulary

Hello,

i have the following query. I have a txt file with data like this:

1 observational study
1.1 cohort study
1.1.1 retrospective cohort study
1.1.2 prospective cohort study
1.2 cross-sectional study


And another file with data like this:

cross-sectional survey 12345.txt
retrospective study 2345.txt
...


I want to do an appropriate string matching. To be more specific i want to read each line of the second file and find the one that is similar (or looks kinda the same) from the first file. So for the first line of the second file the "cross-sectional survey" will be assigned to "1.2 cross-sectional study".

Is there any way of doing this? :-/

doomas10
Newbie Poster
21 posts since Jul 2010
Reputation Points: 10
Solved Threads: 0
 


Split() each record. Don't use the first element of the list, join([1:]), in the first file. And, similarly, don't use the last element in the second file.

woooee
Nearly a Posting Maven
2,454 posts since Dec 2006
Reputation Points: 777
Solved Threads: 714
 

You could find this my old function useful: Longest common subsequence

pyTony
pyMod
Moderator
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
 

not sure how is going to work this with thesplit command. I want to compare the strings of the second file with those of the first, and from the comparison i want to get the ones that look similar to the lines of the second file.

For example "retrospective study" is similar to "retrospective cohort study" thus the comparison will return "retrospective cohort study" as the most similar string for the line of the second file "retrospective study" :/

I will try your function and will let you know how it goes. Is there any module that can do this?

doomas10
Newbie Poster
21 posts since Jul 2010
Reputation Points: 10
Solved Threads: 0
 

If you read the messages of the thread you can see that there is module difflib module which has connected functionality.

pyTony
pyMod
Moderator
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You
View similar articles that have also been tagged: