catching dynamically generated content

Question

gunbuster363 0 Junior Poster

14 Years Ago

I am crawling a website
I found out that the source code of the website is different to the content of the website.
obviously the content were generated through some script,
does anyone know how to simulate the action and get the content of it?

For example, there is a post of someone say "Today I go to the beach and blah blah blah..."
but the source code of that web do not contain "Today I go to the beach and blah blah blah..."
and I want to get "Today I go to the beach and blah blah blah..." down to my comp using python.

Does anyone know how to do?

Thanks All!

python

2 Contributors
3 Replies
110 Views
13 Hours Discussion Span
Latest Post 14 Years Ago Latest Post by gunbuster363

All 3 Replies

Beat_Slayer 17 Posting Pro in Training

14 Years Ago

Sometihing is wrong, since urllib2 should give you the same source code for the page.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

gunbuster363 0 Junior Poster · Answer 1 · 2010-07-15T14:04:28+00:00

Sorry, I made some mistakes.
this is the case:
For example, there is a post of someone say "Today I go to the beach and blah blah blah..."

Then I used urllib2 to get the webpage and view it under a text editor, obviously there are no "Today I go to the beach and blah blah blah...",

However, if I just click the browser's option "view source code", "Today I go to the beach and blah blah blah..." is presented.

gunbuster363 0 Junior Poster · Answer 2 · 2010-07-15T21:18:04+00:00

Sometihing is wrong, since urllib2 should give you the same source code for the page.

for example,this is a webpage
http://www.tripadvisor.com/ShowUserReviews-g294217-d301413-r19916386-Mandarin_Oriental_Hong_Kong-Hong_Kong_Hong_Kong_Region.html#CHECK_RATES_CONT

It contains some block of japanese text.

from the source code given by the browser, it contain a block of html and javascript, here, which I am very sure that it executes the javascript and call the content from other place, because I have studied this website for several month, its' review always started with a <p> tag with attribue "id" equals to "reviewXXXXXX":

<p id="review_text_19916386"></p> 
<script type="text/javascript"> 
showReviewWithAnswers(19916386, '16');
</script> 
</div><!--/ entry--> 
</div><!--/ summary--> 
<div class="note"> 
This review is the subjective opinion of an individual traveler and not of TripAdvisor LLC nor of its partners. </div>

from other part of the website, we can see it store the reviews in the p tag I mentioned above, however, if the content is in English, it will just store them in the html, like this webpage:
http://www.tripadvisor.com/ShowUserReviews-g294217-d301413-r69881993-Mandarin_Oriental_Hong_Kong-Hong_Kong_Hong_Kong_Region.html#REVIEWS

,which have this part of code:

<p id="review_68439781">We just got back from Hong Kong today. It was a three night, four day stay in the Mandarin Oriental.<br/>Unlike the Four Seasons or Island Shangri-la, the Mandarin Oriental is not built on a huge mall. This is a slight disadvantage, but I think the fact that it is right on central and the subway entrance is across the street from the back entrance of the hotel, makes up for the lack of a mall downstairs.<br/>The service was excellent, discreet but efficient. The rooms are very modern with a TV in the spinning bathroom mirror, and Internet on the TV in the main bedroom.<br/>They have a pillow menu available, and very good Hermes toiletries (I don't bother to bring my shampoo and conditioner along on short trips).<br/>The Chinese restaurant is very good, and the server there was very helpful when we got home pretty late one night and decided we were craving for almond milk and black sesame cream.<br/>Our best experience was dinner in the Grill. My aunt described it best by saying the service reminded her of the days when they would get all dressed up for dinner. The waiters were very attentive, seeing to every detail and the food was perfect. I am still dreaming of the pre-appetizers they brought us, after the bread and olive oil.</p>

To be clear, I will say this website store non-English content in some database, and call them using javascript when necessary.
What I would want to do is, grab these non-English Content

Well, to getting started, I used urllib2 to make a test.

page = urllib2.urlopen("http://www.tripadvisor.com/ShowUserReviews-g294217-d301413-r19916386-Mandarin_Oriental_Hong_Kong-Hong_Kong_Hong_Kong_Region.html#CHECK_RATES_CONT")
print page.read()

and the result I got is,
<p id="review_text_19916386"></p>
<script type="text/javascript">
showReviewWithAnswers(19916386, '16');
</script>

Please help me out!

catching dynamically generated content

Recommended Answers Collapse Answers

All 3 Replies

Recommended Answers