problem extracting a sequence from a html page

Question

nethero 0 Newbie Poster

12 Years Ago

Hi there,

I'm kind of new to python and I'm trying to extract a protein sequence from this webpage...

http://www.ncbi.nlm.nih.gov/protein/BAH23558.1

When I use urllib.urlopen the html it gets does not contain the sequence data. When I open this page in firefox and use firebug to look at the page I can see the data. It looks like simply using python to grab the html file won't work. I'm not sure why this happens but if anyone could explain it to me I'd be much appreciative. I suspect the data is loaded server-side and I need to tweak my python code to include it somehow. I'm currently reading up on DOM but any pointers would be appreciated.

P.S. I know I could simply copy the data but I have to process over 1000 of these links (which I have saved in a text file) so I need to figure out how this works.

EDIT:
I'm sure it has something to do with the div id="viewercontent1. This div looks like this in the HTML pulled by urllib.urlopen()...

<div id="viewercontent1" class="seq gbff" val="224176120" SequenceSize="7562" VirtualSequence=""></div>

However when I look at the page in firefox using firebug the div looks like this...

<div style="display: block;" id="viewercontent1" class="seq gbff" val="224176120" sequencesize="7562" virtualsequence=""><div><div class="sequence"><a name="locus_224176120"></a><div class="hnav" id="hnav224176120_0"><div class="goto"><a aria-expanded="false" role="button" href="#goto224176120_0" class="tgt_dark jig-ncbipopper" config="openMethod : 'click', closeMethod : 'click', destPosition: 'bottom left', adjustFit: 'none', triggerPosition: 'bottom left'" id="gotopopper224176120_0">Go to:</a></div></div><div class="tabPopper nonstd_popper" style="display: none;" id="goto224176120_0"><ul class="locals"><li><a href="#feature_224176120" title="Jump to the feature table of this record">Features</a></li><li><a href="#sequence_224176120" title="Jump to the sequence of this record">Sequence</a></li></ul></div>
<pre class="genbank">LOCUS       BAH23558                 362 aa            linear   VRL 26-FEB-2009
DEFINITION  VP1 [BK polyomavirus].
ACCESSION   BAH23558
VERSION     BAH23558.1  GI:224176120
DBSOURCE    accession <a href="http://www.ncbi.nlm.nih.gov/nuccore/224176116">AB485712.1</a>
KEYWORDS    .
SOURCE      BK polyomavirus
  ORGANISM  <a href="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10629">BK polyomavirus</a>
            Viruses; dsDNA viruses, no RNA stage; Polyomaviridae; Polyomavirus.
REFERENCE   1
  AUTHORS   Sugimoto,C., Hara,K., Taguchi,F. and Yogo,Y.
  TITLE     Growth efficiency of naturally occurring BK virus variants in vivo
            and in vitro
  JOURNAL   J. Virol. 63 (7), 3195-3199 (1989)
   PUBMED   <a href="http://www.ncbi.nlm.nih.gov/pubmed/2542627">2542627</a>

REFERENCE   2  (residues 1 to 362)
  AUTHORS   Zhong,S. and Yogo,Y.
  TITLE     Direct Submission
  JOURNAL   Submitted (20-FEB-2009) Contact:Shan Zhong Graduate School of
            Medicine, The University of Tokyo, Department of Urology; Hongo
            7-3-1, Bunkyo-ku, Tokyo 113-8655, Japan
<a name="comment_224176120"></a><a name="feature_224176120"></a>FEATURES             Location/Qualifiers
     source          1..362
                     /organism="BK polyomavirus"
                     /isolate="MT clone 111"
                     /isolation_source="urine of a patient with systemic lupus
                     erythematosus"
                     /db_xref="taxon:<a href="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10629">10629</a>"
                     /country="Japan"
                     /note="complete genome;
                     vector: pAT153"
     <a href="http://www.ncbi.nlm.nih.gov/protein/224176120?from=1&amp;to=362&amp;report=gpwithparts">Protein</a>         1..362
                     /product="VP1"
     <a href="http://www.ncbi.nlm.nih.gov/protein/224176120?from=2&amp;to=362&amp;report=gpwithparts">Region</a>          2..362
                     /region_name="PHA02614"
                     /note="Major capsid protein VP1; Provisional"
                     /db_xref="CDD:<a href="http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=177437">177437</a>"
     <a href="http://www.ncbi.nlm.nih.gov/nuccore/224176116?from=1587&amp;to=2675&amp;report=gbwithparts">CDS</a>             1..362
                     /coded_by="AB485712.1:1587..2675"
ORIGIN      
<a name="sequence_224176120"></a>        1 maptkrkgec pgaapkkpkd pvqvpkllik ggvevlevkt gvdaitevec flnpemgdpd
       61 enlrgfslkl saendfssds perkmlpcys tariplpnln edltcgnllm weavtvqtev
      121 igitsmlnlh agsqkvhehg ggkpiqgsnf hffavggdpl emqgvlmnyr tkypegtitp
      181 knptaqsqvm ntdhkayldk nnaypvecwi pdpsrnentr yfgtltggen vppvlhvtnt
      241 attvlldeqg vgplckadsl yvsaadicgl ftnssgtqqw rglaryfkir lrkrsvknpy
      301 pisfllsdli nrrtqrvdgq pmygmesqve evrvfdgtek lpgdpdmiry idkqgqlqtk
      361 ml
//</pre>

<a name="slash_224176120"></a></div>
</div></div>

python

Edited 12 Years Ago by nethero because: n/a

3 Contributors
3 Replies
242 Views
21 Hours Discussion Span
Latest Post 12 Years Ago Latest Post by nethero

All 3 Replies

predator78 22 Junior Poster

12 Years Ago

It's hard to tell without seeing the exact page and code your using what the problem is. If there is an issue with providing that information I would suggest doing what you seem to have done already again which is to navigate the sight manually first and take good note of how you are reaching the page you are requesting. Is it some sort of popup? Is it possible you need to keep track of cookies? etc...etc...

Edited 12 Years Ago by predator78 because: more info

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

TrustyTony 888 pyMod Team Colleague Featured Poster · Answer 1 · 2011-05-25T02:50:50+00:00

Use access instructions here http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html or ftp access.

http://www.ncbi.nlm.nih.gov/guide/data-software/#downloads_

nethero 0 Newbie Poster · Answer 2 · 2011-05-25T17:58:25+00:00

Thanks for the help guys. I did not figure out how to fix my problem; however, I did find a different way of doing it. I found the link to the http API of NCBI (thanks tonyjv for the suggestion). I wrote some python code with a couple of urllib2.urlopen calls to retreive the data I needed.

Still, I'd like to figure out how to get the information directly from the webpage in case I'm ever in a situation where there is no http API and the webpage is the only source. However, that problem is probably best solved in a non-python forum. Perhaps I should repost this question elsewhere.

problem extracting a sequence from a html page

Recommended Answers Collapse Answers

All 3 Replies

Recommended Answers