| | |
How to parse patent data from html's for dummy
Please support our Python advertiser: Programming Forums - DaniWeb Sister Site
![]() |
•
•
Join Date: Sep 2005
Posts: 2
Reputation:
Solved Threads: 0
Hello!
I'm a total dummy when it comes to programming, and now I have a heck of a data extracing job to do for my graduation thesis :cry:
Hopefully somebody can help me in the right direction here, because a whole day of surfing didn't get me to the right application to do this monsterjob for me.
The thing goes like this: I have to parse patents (in html format on my hard drive) to get specific bits of data (a technology class), store this data in a database, then have an interface to display the data in a meaningful format.
The patents are assigned to about 300 firms in the years 1994 - present. I need to know from this whole pile of html files which and how many of each found technology class each firm had per year. The patents are all in the same layout, so the technology class is always preceded by the same term, indicating the right line. Same goes for firm name and year of filing.
Anybody know if there's some little program in which I can set some rules to do this, without any programming?
I attached an example for Nokia in 1995. The technology class can be found after 'Intern'l Class'. These classes all have the same concept.
Hope the info is somewhat meaningfull as I study economic geography ;-)
I'm a total dummy when it comes to programming, and now I have a heck of a data extracing job to do for my graduation thesis :cry:
Hopefully somebody can help me in the right direction here, because a whole day of surfing didn't get me to the right application to do this monsterjob for me.
The thing goes like this: I have to parse patents (in html format on my hard drive) to get specific bits of data (a technology class), store this data in a database, then have an interface to display the data in a meaningful format.
The patents are assigned to about 300 firms in the years 1994 - present. I need to know from this whole pile of html files which and how many of each found technology class each firm had per year. The patents are all in the same layout, so the technology class is always preceded by the same term, indicating the right line. Same goes for firm name and year of filing.
Anybody know if there's some little program in which I can set some rules to do this, without any programming?
I attached an example for Nokia in 1995. The technology class can be found after 'Intern'l Class'. These classes all have the same concept.
Hope the info is somewhat meaningfull as I study economic geography ;-)
Sounds like you are a candidate for "Regular Expression" pattern matching and searches contained in the Python module re, more detail at:
http://www.amk.ca/python/howto/regex/
http://www.amk.ca/python/howto/regex/
May 'the Google' be with you!
![]() |
Similar Threads
- data grabbing from html sites (Python)
- JPEG challenged by Patent (Graphics and Multimedia)
- get posted form data in python (Python)
- Problem with displaying data.. (PHP)
Other Threads in the Python Forum
- Previous Thread: Python and the JPEG Image File, Part 1, The Header
- Next Thread: List Comprehension
| Thread Tools | Search this Thread |
Tag cloud for Python
abrupt ansi anti approximation assignment avogadro backend basic beginner binary bluetooth calculator character code customdialog decimals dictionaries dictionary drive dynamic examples excel exe file float format ftp function gnu graphics gui heads homework http ideas import input java launcher leftmouse line linux list lists loop module mouse number numbers output parsing path pointer port prime program programming progressbar projects py2exe pygame pyqt python random recursion recursive refresh schedule scrolledtext sqlite ssh statistics stdout string strings sudokusolver sum table terminal text thread threading time tkinter tlapse tricks tuple tutorial twoup ubuntu unicode update urllib urllib2 variable wikipedia windows write wxpython xlib






