::Making Daniweb FAQ Bot::

Reply

Join Date: Apr 2005
Posts: 1,403
Reputation: ShawnCplus is a glorious beacon of light ShawnCplus is a glorious beacon of light ShawnCplus is a glorious beacon of light ShawnCplus is a glorious beacon of light ShawnCplus is a glorious beacon of light 
Solved Threads: 224
Sponsor
ShawnCplus's Avatar
ShawnCplus ShawnCplus is offline Offline
Code Monkey

Re: ::Making Daniweb FAQ Bot::

 
0
  #11
Jul 4th, 2009
Originally Posted by cwarn23 View Post
Although that's good advice Shawn and I will take note of that.
The only thing I would disagree with is the unset(). I have noticed in the past when reassigning string to variable in a loop a few million times the server runs out of memory where as if unset is used I have noticed the memory problem does not occur.
If you're using the most recent version then that shouldn't happen. If you are using the most recent version then you might want to submit a bug, those are called memory leaks
GCS d- s+ a-->? C++(++++) UL+++ P+>+++ L+++ E--- W+++
N+ o K w++(---) O? !M- V PS+>++ PE+ Y+ PGP !t- 5? X- R tv+
b+>++ DI+ D G++>+++ e+ h+>++ r y+
PMs asking for help will not be answered, post on the forums. That's what they're there for.
Reply With Quote Quick reply to this message  
Join Date: Sep 2007
Posts: 1,476
Reputation: cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about 
Solved Threads: 136
cwarn23's Avatar
cwarn23 cwarn23 is offline Offline
Nearly a Posting Virtuoso

Re: ::Making Daniweb FAQ Bot::

 
0
  #12
Jul 5th, 2009
I thought I would let you know that I have made a demo bot but for another website as attached to this post. Although it may need a few effiency changes it does the job. The attachment should also give you a better understanding of the sort of code structure I'm talking about as I have just written one possible algorithm.

--------
Edit:
I now see what you meen about the rss feeds. So are you saying to perhaps have a bot to only check to topics linked to in the rss feeds?
Last edited by cwarn23; Jul 5th, 2009 at 4:39 am. Reason: added info
Attached Files
File Type: zip files.zip (15.4 KB, 5 views)
Try not to bump 10 year old threads as it can be really annoying.
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
Reply With Quote Quick reply to this message  
Join Date: Sep 2005
Posts: 1,075
Reputation: digital-ether is just really nice digital-ether is just really nice digital-ether is just really nice digital-ether is just really nice 
Solved Threads: 66
Moderator
digital-ether's Avatar
digital-ether digital-ether is offline Offline
Veteran Poster

Re: ::Making Daniweb FAQ Bot::

 
0
  #13
Jul 5th, 2009
Originally Posted by cwarn23 View Post
I thought I would let you know that I have made a demo bot but for another website as attached to this post. Although it may need a few effiency changes it does the job. The attachment should also give you a better understanding of the sort of code structure I'm talking about as I have just written one possible algorithm.

--------
Edit:
I now see what you meen about the rss feeds. So are you saying to perhaps have a bot to only check to topics linked to in the rss feeds?
Yes, the RSS feeds could be used as the posts index. So you don't have to follow links (guessing).

You know that the index of links is contained in the RSS feed.
www.fijiwebdesign.com - web design and development and fun
Cpanel Email - Let users Register email accounts on your website upon registration
Ajax Chat - Fully browser based chat!
Reply With Quote Quick reply to this message  
Join Date: Sep 2007
Posts: 1,476
Reputation: cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about 
Solved Threads: 136
cwarn23's Avatar
cwarn23 cwarn23 is offline Offline
Nearly a Posting Virtuoso

Re: ::Making Daniweb FAQ Bot::

 
0
  #14
Jul 6th, 2009
I have just completed a bot that indexes the pages linked from the rss feeds. Below is the script that I have created for the job and will alter it to scan 15 rss feeds. But does anybody know how often the rss feeds update? The minimum time gap in bot startup periods I could set this bot to would be about 1 hour due to monthly bandwidth. So please advise on what time intervals the bot should scan. Also you may be able to see in the code below the algorithm I have created. Are there any suggestions on how to improve the algorithm for later sentence retrieval?

Code is:
  1. <?
  2. set_time_limit (1800);
  3. mysql_connect('localhost','root','');
  4. mysql_select_db('botsearch');
  5.  
  6. function domain ($ddomain) {
  7. return preg_replace('/^((http(s)?:\/\/)?([^\/]+))(.*)/','$1',$ddomain);
  8. }
  9.  
  10. function url_exists($durl)
  11. {
  12. // Version 4.x supported
  13. $handle = curl_init($durl);
  14. if (false === $handle)
  15. {
  16. return false;
  17. }
  18. curl_setopt($handle, CURLOPT_HEADER, true);
  19. curl_setopt($handle, CURLOPT_FAILONERROR, true); // this works
  20. curl_setopt($handle, CURLOPT_HTTPHEADER,
  21. Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
  22. curl_setopt($handle, CURLOPT_NOBODY, true);
  23. curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
  24. $connectable = curl_exec($handle);
  25. //echo '"'.$connectable.'"';
  26. curl_close($handle);
  27. $con=substr_replace($connectable,'',30);
  28. if (stripos($con,'200 OK') || stripos($con,'Moved')) {
  29. return true;
  30. } else {
  31. return false;
  32. }
  33. }
  34.  
  35.  
  36. $in_valid_wordlist=array();
  37. $r=mysql_query('SELECT * FROM `noun_list`');
  38. while ($row=mysql_fetch_assoc($r)) {
  39. $in_wordlist[$row['word']]=true;
  40. }
  41.  
  42. function generate($url,$topicid) {
  43. global $f_data; //Data of file contents
  44. global $in_wordlist;
  45.  
  46. $r=mysql_query('SELECT * FROM `faq_datareference` WHERE `id`="'.mysql_real_escape_string($topicid).'"');
  47. if (mysql_num_rows($r)==0) {
  48. //do something with webpage $f_data.
  49. //$tmp=htmlspecialchars_decode($f_data);
  50. $f_data=html_entity_decode($f_data);
  51. $f_data=str_replace('-|code|--|code|-','-|code|- -|code|-',$f_data);
  52. $f_data=preg_replace('#<blockquote[^>]+>(.*)</blockquote>#U','',$f_data);
  53. $tmp=str_replace(array('<br>','<br />'),"\r",$f_data);
  54. preg_match_all('#<div class="codecontent">(.*)</div>#iUm',$tmp,$code);
  55. $codeid=0;
  56. unset($tmp);
  57. $f_data=str_replace(array('<br/>','<br />'),'<br>',$f_data);
  58. $f_data=str_replace('<br><br>','<br>',$f_data);
  59. $f_data=str_replace('<br><br>','<br>',$f_data); //in case of 4 <br> in a row
  60. $f_data=preg_replace('#<pre class="code">(.*)</pre>#iU','-|code|-',$f_data);
  61. preg_match_all('#<div id="post_message_[0-9]+">(.*)</div> </div> </div>#isU',$f_data,$a_data);
  62. $val=0;
  63. $totalval=0;
  64. $abortval=0;
  65. $totalcodeval=0;
  66. foreach ($a_data[1] AS $adata) {
  67. $adata=strtolower(str_replace('...','.',strip_tags($adata,'<br>')));
  68. $d_data=preg_split('/([\?.!]|<br>)/',$adata);
  69. if ($abortval<4) {
  70. foreach ($d_data AS $sentence) {
  71. if (strlen($sentence)>16 && $abortval<4) {
  72. $re=mysql_query('SELECT * FROM `faq_data` WHERE `string`="'.mysql_real_escape_string($sentence).'" AND `id`="'.mysql_real_escape_string($topicid).'"');
  73. $did_wordcheck=false;
  74. if (mysql_num_rows($re)==0) {
  75. $did_wordcheck=true;
  76. $words=explode(' ',$sentence);
  77. foreach ($words AS $word) {
  78. if (isset($in_wordlist[$word])) {
  79. $val+=1;
  80. if (!isset($nounverb_result[$word])) {
  81. $nounverb_result[$word]=1;
  82. } else {
  83. $nounverb_result[$word]+=1;
  84. }
  85. }
  86. }
  87. $var[0]=$sentence;
  88. $var[1]=$val;
  89. $totalval+=($val-$val2);
  90. mysql_query('INSERT INTO `faq_data` SET `match`="'.$var[1].'", `string`="'.mysql_real_escape_string($var[0]).'", `id`="'.mysql_real_escape_string($topicid).'"');
  91. } else {
  92. $abortval+=1;
  93. }
  94.  
  95.  
  96. $val=0;
  97. $val2=0;
  98. } else if ($sentence=='-|code|--|code|-') {
  99. $re=mysql_query('SELECT * FROM `faq_data` WHERE `match`="-1" AND `string`="'.mysql_real_escape_string($code[1][$codeid]).'" AND `id`="'.mysql_real_escape_string($topicid).'"');
  100. if ($abortval==0 || mysql_num_rows($re)==0) {
  101. $var[0]=$sentence;
  102. $var[1]+=3;
  103. mysql_query('UPDATE `faq_data` SET `match`="'.$var[1].'" WHERE `string`="'.mysql_real_escape_string($var[0]).'" AND `id`="'.mysql_real_escape_string($topicid).'"');
  104. mysql_query('INSERT INTO `faq_data` SET `match`="-1", `string`="'.mysql_real_escape_string($code[1][$codeid]).'", `id`="'.mysql_real_escape_string($topicid).'"');
  105. $codeid+=1;
  106. $val+=2;
  107. }
  108. $val2=$val;
  109. $totalcodeval+=1;
  110. }
  111. }
  112. } else {
  113. break;
  114. }
  115. }
  116.  
  117. if ($did_wordcheck==true) {
  118. arsort($nounverb_result);
  119. $var='INSERT INTO `faq_datareference` SET `id`="'.mysql_real_escape_string($topicid).'"';
  120. $i=1;
  121. foreach ($nounverb_result AS $key=>$val) {
  122. if ($i>10) { break; } else {
  123. $var.=', `word'.$i.'`="'.mysql_real_escape_string($key).'", `word'.$i.'num`='.$val;
  124. }
  125. $i++;
  126. }
  127. for (;$i<11;$i++) {
  128. $var.=', `word'.$i.'`=" ", `word'.$i.'num`=-1';
  129. }
  130. $var.=', `num_nounsverbs`='.$totalval.', `num_codeboxes`='.$totalcodeval;
  131. mysql_query($var);
  132. }
  133. }
  134. echo '<b>Generated Url: </b>'.$url.'<br>';
  135. unset($f_data);
  136. }
  137.  
  138. $rss=file_get_contents('http://www.daniweb.com/forums/rss143.xml');
  139. preg_match_all('#<link><!\[CDATA\[(.*[&?]t=([^\&]++).*)\]\]></link>#U',$rss,$topic_url);
  140. for ($id=0;isset($topic_url[2][$id]);$id++) {
  141. $rr=mysql_query('SELECT `id` FROM `faq_datareference` WHERE `id`="'.mysql_real_escape_string($topic_url[2][$id]).'"');
  142. if (url_exists($topic_url[1][$id]) && mysql_num_rows($rr)==0) {
  143. $f_data=file_get_contents($topic_url[1][$id]);
  144. preg_match('#<td class="alt1" nowrap="nowrap"><span>Page 1 of ([0-9])</span></td>#i',$f_data,$pages_num);
  145. generate($topic_url[1][$id],$topic_url[2][$id]);
  146. flush();
  147. if (!empty($pages_num[1]) && isset($pages_num[1]) && $pages_num[1]>1) {
  148. for ($i=2;$i<=$pages_num[1];$i++) {
  149. if (url_exists($topic_url[1][$id])) {
  150. $f_data=file_get_contents($topic_url[1][$id]);
  151. generate('http://www.daniweb.com/forums/showthread.php?t='.$topic_url[2][$id].'&page='.$i,$topic_url[2][$id]);
  152. flush();
  153. }
  154. }
  155. }
  156. }
  157. }
  158. ?>
Thanks.
Last edited by cwarn23; Jul 6th, 2009 at 10:29 am.
Try not to bump 10 year old threads as it can be really annoying.
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
Reply With Quote Quick reply to this message  
Join Date: Jun 2007
Posts: 1,227
Reputation: kkeith29 has a spectacular aura about kkeith29 has a spectacular aura about kkeith29 has a spectacular aura about 
Solved Threads: 167
kkeith29's Avatar
kkeith29 kkeith29 is offline Offline
Nearly a Posting Virtuoso

Re: ::Making Daniweb FAQ Bot::

 
0
  #15
Jul 6th, 2009
Good idea, but in practice I fear that it will be very difficult to make and have it work properly.

Why not just create a bot that uses premade solutions that can be trusted. It can analyze the words in the question (kind of like what you have) and post the appropriate solution.

The easier solution is to have more read me topics on common errors (which is the bulk of threads) and add some logic to make sure they have at least looked at them before being allowed to post their topic.

Another thing to think about is having a search function when they are posting their topic. They type out their problem and hit a search button (instead of 'submit new thread') and it brings up relevant threads. If they don't find a solution there, then the can post the thread to the forums. This would suppress a lot of the stupid questions that are posted on a daily basis.
Google is your friend.

Use [code] tags.

If you have found a solution to your problem, please mark the thread as SOLVED.
Reply With Quote Quick reply to this message  
Join Date: Sep 2007
Posts: 1,476
Reputation: cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about 
Solved Threads: 136
cwarn23's Avatar
cwarn23 cwarn23 is offline Offline
Nearly a Posting Virtuoso

Re: ::Making Daniweb FAQ Bot::

 
0
  #16
Jul 6th, 2009
Originally Posted by kkeith29 View Post
Good idea, but in practice I fear that it will be very difficult to make and have it work properly.

Why not just create a bot that uses premade solutions that can be trusted. It can analyze the words in the question (kind of like what you have) and post the appropriate solution.

The easier solution is to have more read me topics on common errors (which is the bulk of threads) and add some logic to make sure they have at least looked at them before being allowed to post their topic.

Another thing to think about is having a search function when they are posting their topic. They type out their problem and hit a search button (instead of 'submit new thread') and it brings up relevant threads. If they don't find a solution there, then the can post the thread to the forums. This would suppress a lot of the stupid questions that are posted on a daily basis.
I didn't exactly catch exactly what your suggesting. But the part that says brings up relevant threads is not exactly what I'm doing. What my database does is it contains many rows of sentences and on the users search, an AI will string together some of those sentences to form paragraphs. And at the moment I'm using a system of nouns and verbs. So if you could rephrase the information in the above quote in more simple terms because it's hard to see what your suggesting other than that I use a premade project/solution.
Try not to bump 10 year old threads as it can be really annoying.
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
Reply With Quote Quick reply to this message  
Join Date: Jun 2007
Posts: 1,227
Reputation: kkeith29 has a spectacular aura about kkeith29 has a spectacular aura about kkeith29 has a spectacular aura about 
Solved Threads: 167
kkeith29's Avatar
kkeith29 kkeith29 is offline Offline
Nearly a Posting Virtuoso

Re: ::Making Daniweb FAQ Bot::

 
0
  #17
Jul 6th, 2009
Instead of trying to create an understandable paragraph without human intervention (will not be easy at all), just build a database of solutions that humans have created and post the appropriate one. That way the information can be trusted to be accurate.

The other suggestions I had were for changing how daniweb works, if you didn't catch that. Not related to your question at all really. I just know that moderators ect. will be looking at it, so I thought I would throw it out there.
Last edited by kkeith29; Jul 6th, 2009 at 9:22 pm.
Google is your friend.

Use [code] tags.

If you have found a solution to your problem, please mark the thread as SOLVED.
Reply With Quote Quick reply to this message  
Join Date: Sep 2007
Posts: 1,476
Reputation: cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about 
Solved Threads: 136
cwarn23's Avatar
cwarn23 cwarn23 is offline Offline
Nearly a Posting Virtuoso

Re: ::Making Daniweb FAQ Bot::

 
0
  #18
Jul 6th, 2009
The other suggestions I had were for changing how daniweb works, if you didn't catch that. Not related to your question at all really. I just know that moderators ect. will be looking at it, so I thought I would throw it out there.
If you mean the url system, the url's that are being accessed are all on the rss feeds page or are on the page 2, 3, 4, 5 etc links. So for some reason the rss feeds don't link to the rewritten url (module_rewrite url). So that is why the links appear as they are. But if you mean something different then I would like to know as I can't see how else this script might change how daniweb works.
Try not to bump 10 year old threads as it can be really annoying.
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
Reply With Quote Quick reply to this message  
Join Date: Jun 2007
Posts: 1,227
Reputation: kkeith29 has a spectacular aura about kkeith29 has a spectacular aura about kkeith29 has a spectacular aura about 
Solved Threads: 167
kkeith29's Avatar
kkeith29 kkeith29 is offline Offline
Nearly a Posting Virtuoso

Re: ::Making Daniweb FAQ Bot::

 
0
  #19
Jul 7th, 2009
I was talking about an actual change to the daniweb code. Your script couldn't do that.

Really I think that the faq bot would be too difficult to make work. If you can think of how to code like the human mind to interpret something as complex as language then good for you. I am sure you can make it work good enough, but there will always be a downfall. In reality, I don't think it will build sentences that make sense. How will it know that the info its taking in is correct? Most people that post don't have a good understand of English and have sentences that are not grammatically correct. How will you combat that without a human reading it all.

About the script itself. Do you have multiple servers to use for load balancing? That script takes a huge amount of resources. I ran it on my server along with a few others and they were having problems. The bad thing is that that is only half of the code required to pull that off. You are just getting the info and haven't devised a way to build the sentences into a paragraph and post it onto a thread. The code is going to have to build a database and keep that database current (unless you can have access to the daniweb databases, which I don't see happening ever) which is going to be hard.
Last edited by kkeith29; Jul 7th, 2009 at 12:26 am.
Google is your friend.

Use [code] tags.

If you have found a solution to your problem, please mark the thread as SOLVED.
Reply With Quote Quick reply to this message  
Join Date: Sep 2007
Posts: 1,476
Reputation: cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about cwarn23 has a spectacular aura about 
Solved Threads: 136
cwarn23's Avatar
cwarn23 cwarn23 is offline Offline
Nearly a Posting Virtuoso

Re: ::Making Daniweb FAQ Bot::

 
0
  #20
Jul 7th, 2009
It seems the main answer to all those problems it one of the things I'm good at - advanced algorithms. I find complex algorithms fun to make and in cases like this will make a great result like the algorithm that founded google. It's not like if there's anything hard about it. If worst comes to worst then I could just make the bot reword entire topics into a single mysql entries (eg 1 paragraph per mysql entry) and there would be no need for mass processing when searching the database. The problem is all about making the right algorithm and I'm sure I could do that. Just asking if anybody had any thoughts on how it should work.

Note that the reason for the cpu load is mainly because of the file_get_contents() function downloading files around 30Kb in size but using the sleep() function occasionally will help the cpu to some degree. Just in case your wondering, this bot will be hosted on a virtual private server with 512MB ram.
Last edited by cwarn23; Jul 7th, 2009 at 1:42 am.
Try not to bump 10 year old threads as it can be really annoying.
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
Reply With Quote Quick reply to this message  
Reply

This thread is more than three months old.
Perhaps start a new thread instead?
Message:


Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC