Hi there,

I'm kind of new to python and I'm trying to extract a protein sequence from this webpage...


When I use urllib.urlopen the html it gets does not contain the sequence data. When I open this page in firefox and use firebug to look at the page I can see the data. It looks like simply using python to grab the html file won't work. I'm not sure why this happens but if anyone could explain it to me I'd be much appreciative. I suspect the data is loaded server-side and I need to tweak my python code to include it somehow. I'm currently reading up on DOM but any pointers would be appreciated.

P.S. I know I could simply copy the data but I have to process over 1000 of these links (which I have saved in a text file) so I need to figure out how this works.

I'm sure it has something to do with the div id="viewercontent1. This div looks like this in the HTML pulled by urllib.urlopen()...

<div id="viewercontent1" class="seq gbff" val="224176120" SequenceSize="7562" VirtualSequence=""></div>

However when I look at the page in firefox using firebug the div looks like this...

<div style="display: block;" id="viewercontent1" class="seq gbff" val="224176120" sequencesize="7562" virtualsequence=""><div><div class="sequence"><a name="locus_224176120"></a><div class="hnav" id="hnav224176120_0"><div class="goto"><a aria-expanded="false" role="button" href="#goto224176120_0" class="tgt_dark jig-ncbipopper" config="openMethod : 'click', closeMethod : 'click', destPosition: 'bottom left', adjustFit: 'none', triggerPosition: 'bottom left'" id="gotopopper224176120_0">Go to:</a></div></div><div class="tabPopper nonstd_popper" style="display: none;" id="goto224176120_0"><ul class="locals"><li><a href="#feature_224176120" title="Jump to the feature table of this record">Features</a></li><li><a href="#sequence_224176120" title="Jump to the sequence of this record">Sequence</a></li></ul></div>
<pre class="genbank">LOCUS       BAH23558                 362 aa            linear   VRL 26-FEB-2009
DEFINITION  VP1 [BK polyomavirus].
VERSION     BAH23558.1  GI:224176120
DBSOURCE    accession <a href="http://www.ncbi.nlm.nih.gov/nuccore/224176116">AB485712.1</a>
SOURCE      BK polyomavirus
  ORGANISM  <a href="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10629">BK polyomavirus</a>
            Viruses; dsDNA viruses, no RNA stage; Polyomaviridae; Polyomavirus.
  AUTHORS   Sugimoto,C., Hara,K., Taguchi,F. and Yogo,Y.
  TITLE     Growth efficiency of naturally occurring BK virus variants in vivo
            and in vitro
  JOURNAL   J. Virol. 63 (7), 3195-3199 (1989)
   PUBMED   <a href="http://www.ncbi.nlm.nih.gov/pubmed/2542627">2542627</a>

REFERENCE   2  (residues 1 to 362)
  AUTHORS   Zhong,S. and Yogo,Y.
  TITLE     Direct Submission
  JOURNAL   Submitted (20-FEB-2009) Contact:Shan Zhong Graduate School of
            Medicine, The University of Tokyo, Department of Urology; Hongo
            7-3-1, Bunkyo-ku, Tokyo 113-8655, Japan
<a name="comment_224176120"></a><a name="feature_224176120"></a>FEATURES             Location/Qualifiers
     source          1..362
                     /organism="BK polyomavirus"
                     /isolate="MT clone 111"
                     /isolation_source="urine of a patient with systemic lupus
                     /db_xref="taxon:<a href="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10629">10629</a>"
                     /note="complete genome;
                     vector: pAT153"
     <a href="http://www.ncbi.nlm.nih.gov/protein/224176120?from=1&amp;to=362&amp;report=gpwithparts">Protein</a>         1..362
     <a href="http://www.ncbi.nlm.nih.gov/protein/224176120?from=2&amp;to=362&amp;report=gpwithparts">Region</a>          2..362
                     /note="Major capsid protein VP1; Provisional"
                     /db_xref="CDD:<a href="http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=177437">177437</a>"
     <a href="http://www.ncbi.nlm.nih.gov/nuccore/224176116?from=1587&amp;to=2675&amp;report=gbwithparts">CDS</a>             1..362
<a name="sequence_224176120"></a>        1 maptkrkgec pgaapkkpkd pvqvpkllik ggvevlevkt gvdaitevec flnpemgdpd
       61 enlrgfslkl saendfssds perkmlpcys tariplpnln edltcgnllm weavtvqtev
      121 igitsmlnlh agsqkvhehg ggkpiqgsnf hffavggdpl emqgvlmnyr tkypegtitp
      181 knptaqsqvm ntdhkayldk nnaypvecwi pdpsrnentr yfgtltggen vppvlhvtnt
      241 attvlldeqg vgplckadsl yvsaadicgl ftnssgtqqw rglaryfkir lrkrsvknpy
      301 pisfllsdli nrrtqrvdgq pmygmesqve evrvfdgtek lpgdpdmiry idkqgqlqtk
      361 ml

<a name="slash_224176120"></a></div>

It's hard to tell without seeing the exact page and code your using what the problem is. If there is an issue with providing that information I would suggest doing what you seem to have done already again which is to navigate the sight manually first and take good note of how you are reaching the page you are requesting. Is it some sort of popup? Is it possible you need to keep track of cookies? etc...etc...

Thanks for the help guys. I did not figure out how to fix my problem; however, I did find a different way of doing it. I found the link to the http API of NCBI (thanks tonyjv for the suggestion). I wrote some python code with a couple of urllib2.urlopen calls to retreive the data I needed.

Still, I'd like to figure out how to get the information directly from the webpage in case I'm ever in a situation where there is no http API and the webpage is the only source. However, that problem is probably best solved in a non-python forum. Perhaps I should repost this question elsewhere.