943,946 Members | Top Members by Rank

Ad:
  • Python Discussion Thread
  • Unsolved
  • Views: 2942
  • Python RSS
Sep 23rd, 2005
0

How to parse patent data from html's for dummy

Expand Post »
Hello!

I'm a total dummy when it comes to programming, and now I have a heck of a data extracing job to do for my graduation thesis :cry:
Hopefully somebody can help me in the right direction here, because a whole day of surfing didn't get me to the right application to do this monsterjob for me.

The thing goes like this: I have to parse patents (in html format on my hard drive) to get specific bits of data (a technology class), store this data in a database, then have an interface to display the data in a meaningful format.

The patents are assigned to about 300 firms in the years 1994 - present. I need to know from this whole pile of html files which and how many of each found technology class each firm had per year. The patents are all in the same layout, so the technology class is always preceded by the same term, indicating the right line. Same goes for firm name and year of filing.

Anybody know if there's some little program in which I can set some rules to do this, without any programming?

I attached an example for Nokia in 1995. The technology class can be found after 'Intern'l Class'. These classes all have the same concept.

Hope the info is somewhat meaningfull as I study economic geography ;-)
Similar Threads
Reputation Points: 10
Solved Threads: 0
Newbie Poster
FinnDutch is offline Offline
2 posts
since Sep 2005
Sep 23rd, 2005
0

Re: How to parse patent data from html's for dummy

Sounds like you are a candidate for "Regular Expression" pattern matching and searches contained in the Python module re, more detail at:
http://www.amk.ca/python/howto/regex/
Moderator
Reputation Points: 1333
Solved Threads: 1403
DaniWeb's Hypocrite
vegaseat is offline Offline
5,792 posts
since Oct 2004
Sep 24th, 2005
0

Re: How to parse patent data from html's for dummy

Ok, thnx for the info! I'll do some studying...
Reputation Points: 10
Solved Threads: 0
Newbie Poster
FinnDutch is offline Offline
2 posts
since Sep 2005

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in Python Forum Timeline: Python and the JPEG Image File, Part 1, The Header
Next Thread in Python Forum Timeline: List Comprehension





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC