943,746 Members | Top Members by Rank

Ad:
  • PHP Discussion Thread
  • Unsolved
  • Views: 2126
  • PHP RSS
You are currently viewing page 2 of this multi-page discussion thread; Jump to the first page
Jul 4th, 2009
0

Re: ::Making Daniweb FAQ Bot::

Click to Expand / Collapse  Quote originally posted by cwarn23 ...
Although that's good advice Shawn and I will take note of that.
The only thing I would disagree with is the unset(). I have noticed in the past when reassigning string to variable in a loop a few million times the server runs out of memory where as if unset is used I have noticed the memory problem does not occur.
If you're using the most recent version then that shouldn't happen. If you are using the most recent version then you might want to submit a bug, those are called memory leaks
Sponsor
Reputation Points: 520
Solved Threads: 268
Code Monkey
ShawnCplus is offline Offline
1,564 posts
since Apr 2005
Jul 5th, 2009
0

Re: ::Making Daniweb FAQ Bot::

I thought I would let you know that I have made a demo bot but for another website as attached to this post. Although it may need a few effiency changes it does the job. The attachment should also give you a better understanding of the sort of code structure I'm talking about as I have just written one possible algorithm.

--------
Edit:
I now see what you meen about the rss feeds. So are you saying to perhaps have a bot to only check to topics linked to in the rss feeds?
Attached Files
File Type: zip files.zip (15.4 KB, 7 views)
Last edited by cwarn23; Jul 5th, 2009 at 4:39 am. Reason: added info
Sponsor
Featured Poster
Reputation Points: 410
Solved Threads: 258
Occupation: Genius
cwarn23 is offline Offline
3,004 posts
since Sep 2007
Jul 5th, 2009
0

Re: ::Making Daniweb FAQ Bot::

Click to Expand / Collapse  Quote originally posted by cwarn23 ...
I thought I would let you know that I have made a demo bot but for another website as attached to this post. Although it may need a few effiency changes it does the job. The attachment should also give you a better understanding of the sort of code structure I'm talking about as I have just written one possible algorithm.

--------
Edit:
I now see what you meen about the rss feeds. So are you saying to perhaps have a bot to only check to topics linked to in the rss feeds?
Yes, the RSS feeds could be used as the posts index. So you don't have to follow links (guessing).

You know that the index of links is contained in the RSS feed.
Moderator
Reputation Points: 457
Solved Threads: 101
Nearly a Posting Virtuoso
digital-ether is offline Offline
1,250 posts
since Sep 2005
Jul 6th, 2009
0

Re: ::Making Daniweb FAQ Bot::

I have just completed a bot that indexes the pages linked from the rss feeds. Below is the script that I have created for the job and will alter it to scan 15 rss feeds. But does anybody know how often the rss feeds update? The minimum time gap in bot startup periods I could set this bot to would be about 1 hour due to monthly bandwidth. So please advise on what time intervals the bot should scan. Also you may be able to see in the code below the algorithm I have created. Are there any suggestions on how to improve the algorithm for later sentence retrieval?

Code is:
php Syntax (Toggle Plain Text)
  1. <?
  2. set_time_limit (1800);
  3. mysql_connect('localhost','root','');
  4. mysql_select_db('botsearch');
  5.  
  6. function domain ($ddomain) {
  7. return preg_replace('/^((http(s)?:\/\/)?([^\/]+))(.*)/','$1',$ddomain);
  8. }
  9.  
  10. function url_exists($durl)
  11. {
  12. // Version 4.x supported
  13. $handle = curl_init($durl);
  14. if (false === $handle)
  15. {
  16. return false;
  17. }
  18. curl_setopt($handle, CURLOPT_HEADER, true);
  19. curl_setopt($handle, CURLOPT_FAILONERROR, true); // this works
  20. curl_setopt($handle, CURLOPT_HTTPHEADER,
  21. Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
  22. curl_setopt($handle, CURLOPT_NOBODY, true);
  23. curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
  24. $connectable = curl_exec($handle);
  25. //echo '"'.$connectable.'"';
  26. curl_close($handle);
  27. $con=substr_replace($connectable,'',30);
  28. if (stripos($con,'200 OK') || stripos($con,'Moved')) {
  29. return true;
  30. } else {
  31. return false;
  32. }
  33. }
  34.  
  35.  
  36. $in_valid_wordlist=array();
  37. $r=mysql_query('SELECT * FROM `noun_list`');
  38. while ($row=mysql_fetch_assoc($r)) {
  39. $in_wordlist[$row['word']]=true;
  40. }
  41.  
  42. function generate($url,$topicid) {
  43. global $f_data; //Data of file contents
  44. global $in_wordlist;
  45.  
  46. $r=mysql_query('SELECT * FROM `faq_datareference` WHERE `id`="'.mysql_real_escape_string($topicid).'"');
  47. if (mysql_num_rows($r)==0) {
  48. //do something with webpage $f_data.
  49. //$tmp=htmlspecialchars_decode($f_data);
  50. $f_data=html_entity_decode($f_data);
  51. $f_data=str_replace('-|code|--|code|-','-|code|- -|code|-',$f_data);
  52. $f_data=preg_replace('#<blockquote[^>]+>(.*)</blockquote>#U','',$f_data);
  53. $tmp=str_replace(array('<br>','<br />'),"\r",$f_data);
  54. preg_match_all('#<div class="codecontent">(.*)</div>#iUm',$tmp,$code);
  55. $codeid=0;
  56. unset($tmp);
  57. $f_data=str_replace(array('<br/>','<br />'),'<br>',$f_data);
  58. $f_data=str_replace('<br><br>','<br>',$f_data);
  59. $f_data=str_replace('<br><br>','<br>',$f_data); //in case of 4 <br> in a row
  60. $f_data=preg_replace('#<pre class="code">(.*)</pre>#iU','-|code|-',$f_data);
  61. preg_match_all('#<div id="post_message_[0-9]+">(.*)</div> </div> </div>#isU',$f_data,$a_data);
  62. $val=0;
  63. $totalval=0;
  64. $abortval=0;
  65. $totalcodeval=0;
  66. foreach ($a_data[1] AS $adata) {
  67. $adata=strtolower(str_replace('...','.',strip_tags($adata,'<br>')));
  68. $d_data=preg_split('/([\?.!]|<br>)/',$adata);
  69. if ($abortval<4) {
  70. foreach ($d_data AS $sentence) {
  71. if (strlen($sentence)>16 && $abortval<4) {
  72. $re=mysql_query('SELECT * FROM `faq_data` WHERE `string`="'.mysql_real_escape_string($sentence).'" AND `id`="'.mysql_real_escape_string($topicid).'"');
  73. $did_wordcheck=false;
  74. if (mysql_num_rows($re)==0) {
  75. $did_wordcheck=true;
  76. $words=explode(' ',$sentence);
  77. foreach ($words AS $word) {
  78. if (isset($in_wordlist[$word])) {
  79. $val+=1;
  80. if (!isset($nounverb_result[$word])) {
  81. $nounverb_result[$word]=1;
  82. } else {
  83. $nounverb_result[$word]+=1;
  84. }
  85. }
  86. }
  87. $var[0]=$sentence;
  88. $var[1]=$val;
  89. $totalval+=($val-$val2);
  90. mysql_query('INSERT INTO `faq_data` SET `match`="'.$var[1].'", `string`="'.mysql_real_escape_string($var[0]).'", `id`="'.mysql_real_escape_string($topicid).'"');
  91. } else {
  92. $abortval+=1;
  93. }
  94.  
  95.  
  96. $val=0;
  97. $val2=0;
  98. } else if ($sentence=='-|code|--|code|-') {
  99. $re=mysql_query('SELECT * FROM `faq_data` WHERE `match`="-1" AND `string`="'.mysql_real_escape_string($code[1][$codeid]).'" AND `id`="'.mysql_real_escape_string($topicid).'"');
  100. if ($abortval==0 || mysql_num_rows($re)==0) {
  101. $var[0]=$sentence;
  102. $var[1]+=3;
  103. mysql_query('UPDATE `faq_data` SET `match`="'.$var[1].'" WHERE `string`="'.mysql_real_escape_string($var[0]).'" AND `id`="'.mysql_real_escape_string($topicid).'"');
  104. mysql_query('INSERT INTO `faq_data` SET `match`="-1", `string`="'.mysql_real_escape_string($code[1][$codeid]).'", `id`="'.mysql_real_escape_string($topicid).'"');
  105. $codeid+=1;
  106. $val+=2;
  107. }
  108. $val2=$val;
  109. $totalcodeval+=1;
  110. }
  111. }
  112. } else {
  113. break;
  114. }
  115. }
  116.  
  117. if ($did_wordcheck==true) {
  118. arsort($nounverb_result);
  119. $var='INSERT INTO `faq_datareference` SET `id`="'.mysql_real_escape_string($topicid).'"';
  120. $i=1;
  121. foreach ($nounverb_result AS $key=>$val) {
  122. if ($i>10) { break; } else {
  123. $var.=', `word'.$i.'`="'.mysql_real_escape_string($key).'", `word'.$i.'num`='.$val;
  124. }
  125. $i++;
  126. }
  127. for (;$i<11;$i++) {
  128. $var.=', `word'.$i.'`=" ", `word'.$i.'num`=-1';
  129. }
  130. $var.=', `num_nounsverbs`='.$totalval.', `num_codeboxes`='.$totalcodeval;
  131. mysql_query($var);
  132. }
  133. }
  134. echo '<b>Generated Url: </b>'.$url.'<br>';
  135. unset($f_data);
  136. }
  137.  
  138. $rss=file_get_contents('http://www.daniweb.com/forums/rss143.xml');
  139. preg_match_all('#<link><!\[CDATA\[(.*[&?]t=([^\&]++).*)\]\]></link>#U',$rss,$topic_url);
  140. for ($id=0;isset($topic_url[2][$id]);$id++) {
  141. $rr=mysql_query('SELECT `id` FROM `faq_datareference` WHERE `id`="'.mysql_real_escape_string($topic_url[2][$id]).'"');
  142. if (url_exists($topic_url[1][$id]) && mysql_num_rows($rr)==0) {
  143. $f_data=file_get_contents($topic_url[1][$id]);
  144. preg_match('#<td class="alt1" nowrap="nowrap"><span>Page 1 of ([0-9])</span></td>#i',$f_data,$pages_num);
  145. generate($topic_url[1][$id],$topic_url[2][$id]);
  146. flush();
  147. if (!empty($pages_num[1]) && isset($pages_num[1]) && $pages_num[1]>1) {
  148. for ($i=2;$i<=$pages_num[1];$i++) {
  149. if (url_exists($topic_url[1][$id])) {
  150. $f_data=file_get_contents($topic_url[1][$id]);
  151. generate('http://www.daniweb.com/forums/showthread.php?t='.$topic_url[2][$id].'&page='.$i,$topic_url[2][$id]);
  152. flush();
  153. }
  154. }
  155. }
  156. }
  157. }
  158. ?>
Thanks.
Last edited by cwarn23; Jul 6th, 2009 at 10:29 am.
Sponsor
Featured Poster
Reputation Points: 410
Solved Threads: 258
Occupation: Genius
cwarn23 is offline Offline
3,004 posts
since Sep 2007
Jul 6th, 2009
0

Re: ::Making Daniweb FAQ Bot::

Good idea, but in practice I fear that it will be very difficult to make and have it work properly.

Why not just create a bot that uses premade solutions that can be trusted. It can analyze the words in the question (kind of like what you have) and post the appropriate solution.

The easier solution is to have more read me topics on common errors (which is the bulk of threads) and add some logic to make sure they have at least looked at them before being allowed to post their topic.

Another thing to think about is having a search function when they are posting their topic. They type out their problem and hit a search button (instead of 'submit new thread') and it brings up relevant threads. If they don't find a solution there, then the can post the thread to the forums. This would suppress a lot of the stupid questions that are posted on a daily basis.
Reputation Points: 235
Solved Threads: 193
Nearly a Posting Virtuoso
kkeith29 is offline Offline
1,315 posts
since Jun 2007
Jul 6th, 2009
0

Re: ::Making Daniweb FAQ Bot::

Click to Expand / Collapse  Quote originally posted by kkeith29 ...
Good idea, but in practice I fear that it will be very difficult to make and have it work properly.

Why not just create a bot that uses premade solutions that can be trusted. It can analyze the words in the question (kind of like what you have) and post the appropriate solution.

The easier solution is to have more read me topics on common errors (which is the bulk of threads) and add some logic to make sure they have at least looked at them before being allowed to post their topic.

Another thing to think about is having a search function when they are posting their topic. They type out their problem and hit a search button (instead of 'submit new thread') and it brings up relevant threads. If they don't find a solution there, then the can post the thread to the forums. This would suppress a lot of the stupid questions that are posted on a daily basis.
I didn't exactly catch exactly what your suggesting. But the part that says brings up relevant threads is not exactly what I'm doing. What my database does is it contains many rows of sentences and on the users search, an AI will string together some of those sentences to form paragraphs. And at the moment I'm using a system of nouns and verbs. So if you could rephrase the information in the above quote in more simple terms because it's hard to see what your suggesting other than that I use a premade project/solution.
Sponsor
Featured Poster
Reputation Points: 410
Solved Threads: 258
Occupation: Genius
cwarn23 is offline Offline
3,004 posts
since Sep 2007
Jul 6th, 2009
0

Re: ::Making Daniweb FAQ Bot::

Instead of trying to create an understandable paragraph without human intervention (will not be easy at all), just build a database of solutions that humans have created and post the appropriate one. That way the information can be trusted to be accurate.

The other suggestions I had were for changing how daniweb works, if you didn't catch that. Not related to your question at all really. I just know that moderators ect. will be looking at it, so I thought I would throw it out there.
Last edited by kkeith29; Jul 6th, 2009 at 9:22 pm.
Reputation Points: 235
Solved Threads: 193
Nearly a Posting Virtuoso
kkeith29 is offline Offline
1,315 posts
since Jun 2007
Jul 6th, 2009
0

Re: ::Making Daniweb FAQ Bot::

Quote ...
The other suggestions I had were for changing how daniweb works, if you didn't catch that. Not related to your question at all really. I just know that moderators ect. will be looking at it, so I thought I would throw it out there.
If you mean the url system, the url's that are being accessed are all on the rss feeds page or are on the page 2, 3, 4, 5 etc links. So for some reason the rss feeds don't link to the rewritten url (module_rewrite url). So that is why the links appear as they are. But if you mean something different then I would like to know as I can't see how else this script might change how daniweb works.
Sponsor
Featured Poster
Reputation Points: 410
Solved Threads: 258
Occupation: Genius
cwarn23 is offline Offline
3,004 posts
since Sep 2007
Jul 7th, 2009
0

Re: ::Making Daniweb FAQ Bot::

I was talking about an actual change to the daniweb code. Your script couldn't do that.

Really I think that the faq bot would be too difficult to make work. If you can think of how to code like the human mind to interpret something as complex as language then good for you. I am sure you can make it work good enough, but there will always be a downfall. In reality, I don't think it will build sentences that make sense. How will it know that the info its taking in is correct? Most people that post don't have a good understand of English and have sentences that are not grammatically correct. How will you combat that without a human reading it all.

About the script itself. Do you have multiple servers to use for load balancing? That script takes a huge amount of resources. I ran it on my server along with a few others and they were having problems. The bad thing is that that is only half of the code required to pull that off. You are just getting the info and haven't devised a way to build the sentences into a paragraph and post it onto a thread. The code is going to have to build a database and keep that database current (unless you can have access to the daniweb databases, which I don't see happening ever) which is going to be hard.
Last edited by kkeith29; Jul 7th, 2009 at 12:26 am.
Reputation Points: 235
Solved Threads: 193
Nearly a Posting Virtuoso
kkeith29 is offline Offline
1,315 posts
since Jun 2007
Jul 7th, 2009
0

Re: ::Making Daniweb FAQ Bot::

It seems the main answer to all those problems it one of the things I'm good at - advanced algorithms. I find complex algorithms fun to make and in cases like this will make a great result like the algorithm that founded google. It's not like if there's anything hard about it. If worst comes to worst then I could just make the bot reword entire topics into a single mysql entries (eg 1 paragraph per mysql entry) and there would be no need for mass processing when searching the database. The problem is all about making the right algorithm and I'm sure I could do that. Just asking if anybody had any thoughts on how it should work.

Note that the reason for the cpu load is mainly because of the file_get_contents() function downloading files around 30Kb in size but using the sleep() function occasionally will help the cpu to some degree. Just in case your wondering, this bot will be hosted on a virtual private server with 512MB ram.
Last edited by cwarn23; Jul 7th, 2009 at 1:42 am.
Sponsor
Featured Poster
Reputation Points: 410
Solved Threads: 258
Occupation: Genius
cwarn23 is offline Offline
3,004 posts
since Sep 2007

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in PHP Forum Timeline: Installing php libraries on linux
Next Thread in PHP Forum Timeline: in sql server same unique identifier in 2 tables





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC