I am crawling a website
I found out that the source code of the website is different to the content of the website.
obviously the content were generated through some script,
does anyone know how to simulate the action and get the content of it?


For example, there is a post of someone say "Today I go to the beach and blah blah blah..."
but the source code of that web do not contain "Today I go to the beach and blah blah blah..."
and I want to get "Today I go to the beach and blah blah blah..." down to my comp using python.


Does anyone know how to do?

Thanks All!

Recommended Answers

All 3 Replies

Sorry, I made some mistakes.
this is the case:
For example, there is a post of someone say "Today I go to the beach and blah blah blah..."

Then I used urllib2 to get the webpage and view it under a text editor, obviously there are no "Today I go to the beach and blah blah blah...",

However, if I just click the browser's option "view source code", "Today I go to the beach and blah blah blah..." is presented.

Sometihing is wrong, since urllib2 should give you the same source code for the page.

Sometihing is wrong, since urllib2 should give you the same source code for the page.

for example,this is a webpage
http://www.tripadvisor.com/ShowUserReviews-g294217-d301413-r19916386-Mandarin_Oriental_Hong_Kong-Hong_Kong_Hong_Kong_Region.html#CHECK_RATES_CONT

It contains some block of japanese text.

from the source code given by the browser, it contain a block of html and javascript, here, which I am very sure that it executes the javascript and call the content from other place, because I have studied this website for several month, its' review always started with a <p> tag with attribue "id" equals to "reviewXXXXXX":

<p id="review_text_19916386"></p> 
<script type="text/javascript"> 
showReviewWithAnswers(19916386, '16');
</script> 
</div><!--/ entry--> 
</div><!--/ summary--> 
<div class="note"> 
This review is the subjective opinion of an individual traveler and not of TripAdvisor LLC nor of its partners. </div>

from other part of the website, we can see it store the reviews in the p tag I mentioned above, however, if the content is in English, it will just store them in the html, like this webpage:
http://www.tripadvisor.com/ShowUserReviews-g294217-d301413-r69881993-Mandarin_Oriental_Hong_Kong-Hong_Kong_Hong_Kong_Region.html#REVIEWS

,which have this part of code:

<p id="review_68439781">We just got back from Hong Kong today. It was a three night, four day stay in the Mandarin Oriental.<br/>Unlike the Four Seasons or Island Shangri-la, the Mandarin Oriental is not built on a huge mall. This is a slight disadvantage, but I think the fact that it is right on central and the subway entrance is across the street from the back entrance of the hotel, makes up for the lack of a mall downstairs.<br/>The service was excellent, discreet but efficient. The rooms are very modern with a TV in the spinning bathroom mirror, and Internet on the TV in the main bedroom.<br/>They have a pillow menu available, and very good Hermes toiletries (I don't bother to bring my shampoo and conditioner along on short trips).<br/>The Chinese restaurant is very good, and the server there was very helpful when we got home pretty late one night and decided we were craving for almond milk and black sesame cream.<br/>Our best experience was dinner in the Grill. My aunt described it best by saying the service reminded her of the days when they would get all dressed up for dinner. The waiters were very attentive, seeing to every detail and the food was perfect. I am still dreaming of the pre-appetizers they brought us, after the bread and olive oil.</p>

To be clear, I will say this website store non-English content in some database, and call them using javascript when necessary.
What I would want to do is, grab these non-English Content


Well, to getting started, I used urllib2 to make a test.

page = urllib2.urlopen("http://www.tripadvisor.com/ShowUserReviews-g294217-d301413-r19916386-Mandarin_Oriental_Hong_Kong-Hong_Kong_Hong_Kong_Region.html#CHECK_RATES_CONT")
print page.read()

and the result I got is,
<p id="review_text_19916386"></p>
<script type="text/javascript">
showReviewWithAnswers(19916386, '16');
</script>

Please help me out!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.