| | |
::Making Daniweb FAQ Bot::
Please support our PHP advertiser: PostgreSQL or MySQL? Compare and contrast the two most popular open source databases
![]() |
•
•
•
•
Although that's good advice Shawn and I will take note of that.
The only thing I would disagree with is the unset(). I have noticed in the past when reassigning string to variable in a loop a few million times the server runs out of memory where as if unset is used I have noticed the memory problem does not occur.
GCS d- s+ a-->? C++(++++) UL+++ P+>+++ L+++ E--- W+++
N+ o K w++(---) O? !M- V PS+>++ PE+ Y+ PGP !t- 5? X- R tv+
b+>++ DI+ D G++>+++ e+ h+>++ r y+
PMs asking for help will not be answered, post on the forums. That's what they're there for.
N+ o K w++(---) O? !M- V PS+>++ PE+ Y+ PGP !t- 5? X- R tv+
b+>++ DI+ D G++>+++ e+ h+>++ r y+
PMs asking for help will not be answered, post on the forums. That's what they're there for.
I thought I would let you know that I have made a demo bot but for another website as attached to this post. Although it may need a few effiency changes it does the job. The attachment should also give you a better understanding of the sort of code structure I'm talking about as I have just written one possible algorithm.
--------
Edit:
I now see what you meen about the rss feeds. So are you saying to perhaps have a bot to only check to topics linked to in the rss feeds?
--------
Edit:
I now see what you meen about the rss feeds. So are you saying to perhaps have a bot to only check to topics linked to in the rss feeds?
Last edited by cwarn23; Jul 5th, 2009 at 4:39 am. Reason: added info
Try not to bump 10 year old threads as it can be really annoying.
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
•
•
•
•
I thought I would let you know that I have made a demo bot but for another website as attached to this post. Although it may need a few effiency changes it does the job. The attachment should also give you a better understanding of the sort of code structure I'm talking about as I have just written one possible algorithm.
--------
Edit:
I now see what you meen about the rss feeds. So are you saying to perhaps have a bot to only check to topics linked to in the rss feeds?
You know that the index of links is contained in the RSS feed.
www.fijiwebdesign.com - web design and development and fun
Cpanel Email - Let users Register email accounts on your website upon registration
Ajax Chat - Fully browser based chat!
Cpanel Email - Let users Register email accounts on your website upon registration
Ajax Chat - Fully browser based chat!
I have just completed a bot that indexes the pages linked from the rss feeds. Below is the script that I have created for the job and will alter it to scan 15 rss feeds. But does anybody know how often the rss feeds update? The minimum time gap in bot startup periods I could set this bot to would be about 1 hour due to monthly bandwidth. So please advise on what time intervals the bot should scan. Also you may be able to see in the code below the algorithm I have created. Are there any suggestions on how to improve the algorithm for later sentence retrieval?
Code is:
Thanks.
Code is:
php Syntax (Toggle Plain Text)
<? set_time_limit (1800); mysql_connect('localhost','root',''); mysql_select_db('botsearch'); function domain ($ddomain) { return preg_replace('/^((http(s)?:\/\/)?([^\/]+))(.*)/','$1',$ddomain); } function url_exists($durl) { // Version 4.x supported $handle = curl_init($durl); if (false === $handle) { return false; } curl_setopt($handle, CURLOPT_HEADER, true); curl_setopt($handle, CURLOPT_FAILONERROR, true); // this works curl_setopt($handle, CURLOPT_HTTPHEADER, Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") ); curl_setopt($handle, CURLOPT_NOBODY, true); curl_setopt($handle, CURLOPT_RETURNTRANSFER, true); $connectable = curl_exec($handle); //echo '"'.$connectable.'"'; curl_close($handle); $con=substr_replace($connectable,'',30); if (stripos($con,'200 OK') || stripos($con,'Moved')) { return true; } else { return false; } } $in_valid_wordlist=array(); $r=mysql_query('SELECT * FROM `noun_list`'); while ($row=mysql_fetch_assoc($r)) { $in_wordlist[$row['word']]=true; } function generate($url,$topicid) { global $f_data; //Data of file contents global $in_wordlist; $r=mysql_query('SELECT * FROM `faq_datareference` WHERE `id`="'.mysql_real_escape_string($topicid).'"'); if (mysql_num_rows($r)==0) { //do something with webpage $f_data. //$tmp=htmlspecialchars_decode($f_data); $f_data=html_entity_decode($f_data); $f_data=str_replace('-|code|--|code|-','-|code|- -|code|-',$f_data); $f_data=preg_replace('#<blockquote[^>]+>(.*)</blockquote>#U','',$f_data); $tmp=str_replace(array('<br>','<br />'),"\r",$f_data); preg_match_all('#<div class="codecontent">(.*)</div>#iUm',$tmp,$code); $codeid=0; unset($tmp); $f_data=str_replace(array('<br/>','<br />'),'<br>',$f_data); $f_data=str_replace('<br><br>','<br>',$f_data); $f_data=str_replace('<br><br>','<br>',$f_data); //in case of 4 <br> in a row $f_data=preg_replace('#<pre class="code">(.*)</pre>#iU','-|code|-',$f_data); preg_match_all('#<div id="post_message_[0-9]+">(.*)</div> </div> </div>#isU',$f_data,$a_data); $val=0; $totalval=0; $abortval=0; $totalcodeval=0; foreach ($a_data[1] AS $adata) { $adata=strtolower(str_replace('...','.',strip_tags($adata,'<br>'))); $d_data=preg_split('/([\?.!]|<br>)/',$adata); if ($abortval<4) { foreach ($d_data AS $sentence) { if (strlen($sentence)>16 && $abortval<4) { $re=mysql_query('SELECT * FROM `faq_data` WHERE `string`="'.mysql_real_escape_string($sentence).'" AND `id`="'.mysql_real_escape_string($topicid).'"'); $did_wordcheck=false; if (mysql_num_rows($re)==0) { $did_wordcheck=true; $words=explode(' ',$sentence); foreach ($words AS $word) { if (isset($in_wordlist[$word])) { $val+=1; if (!isset($nounverb_result[$word])) { $nounverb_result[$word]=1; } else { $nounverb_result[$word]+=1; } } } $var[0]=$sentence; $var[1]=$val; $totalval+=($val-$val2); mysql_query('INSERT INTO `faq_data` SET `match`="'.$var[1].'", `string`="'.mysql_real_escape_string($var[0]).'", `id`="'.mysql_real_escape_string($topicid).'"'); } else { $abortval+=1; } $val=0; $val2=0; } else if ($sentence=='-|code|--|code|-') { $re=mysql_query('SELECT * FROM `faq_data` WHERE `match`="-1" AND `string`="'.mysql_real_escape_string($code[1][$codeid]).'" AND `id`="'.mysql_real_escape_string($topicid).'"'); if ($abortval==0 || mysql_num_rows($re)==0) { $var[0]=$sentence; $var[1]+=3; mysql_query('UPDATE `faq_data` SET `match`="'.$var[1].'" WHERE `string`="'.mysql_real_escape_string($var[0]).'" AND `id`="'.mysql_real_escape_string($topicid).'"'); mysql_query('INSERT INTO `faq_data` SET `match`="-1", `string`="'.mysql_real_escape_string($code[1][$codeid]).'", `id`="'.mysql_real_escape_string($topicid).'"'); $codeid+=1; $val+=2; } $val2=$val; $totalcodeval+=1; } } } else { break; } } if ($did_wordcheck==true) { arsort($nounverb_result); $var='INSERT INTO `faq_datareference` SET `id`="'.mysql_real_escape_string($topicid).'"'; $i=1; foreach ($nounverb_result AS $key=>$val) { if ($i>10) { break; } else { $var.=', `word'.$i.'`="'.mysql_real_escape_string($key).'", `word'.$i.'num`='.$val; } $i++; } for (;$i<11;$i++) { $var.=', `word'.$i.'`=" ", `word'.$i.'num`=-1'; } $var.=', `num_nounsverbs`='.$totalval.', `num_codeboxes`='.$totalcodeval; mysql_query($var); } } echo '<b>Generated Url: </b>'.$url.'<br>'; unset($f_data); } $rss=file_get_contents('http://www.daniweb.com/forums/rss143.xml'); preg_match_all('#<link><!\[CDATA\[(.*[&?]t=([^\&]++).*)\]\]></link>#U',$rss,$topic_url); for ($id=0;isset($topic_url[2][$id]);$id++) { $rr=mysql_query('SELECT `id` FROM `faq_datareference` WHERE `id`="'.mysql_real_escape_string($topic_url[2][$id]).'"'); if (url_exists($topic_url[1][$id]) && mysql_num_rows($rr)==0) { $f_data=file_get_contents($topic_url[1][$id]); preg_match('#<td class="alt1" nowrap="nowrap"><span>Page 1 of ([0-9])</span></td>#i',$f_data,$pages_num); generate($topic_url[1][$id],$topic_url[2][$id]); flush(); if (!empty($pages_num[1]) && isset($pages_num[1]) && $pages_num[1]>1) { for ($i=2;$i<=$pages_num[1];$i++) { if (url_exists($topic_url[1][$id])) { $f_data=file_get_contents($topic_url[1][$id]); generate('http://www.daniweb.com/forums/showthread.php?t='.$topic_url[2][$id].'&page='.$i,$topic_url[2][$id]); flush(); } } } } } ?>
Last edited by cwarn23; Jul 6th, 2009 at 10:29 am.
Try not to bump 10 year old threads as it can be really annoying.
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
Good idea, but in practice I fear that it will be very difficult to make and have it work properly.
Why not just create a bot that uses premade solutions that can be trusted. It can analyze the words in the question (kind of like what you have) and post the appropriate solution.
The easier solution is to have more read me topics on common errors (which is the bulk of threads) and add some logic to make sure they have at least looked at them before being allowed to post their topic.
Another thing to think about is having a search function when they are posting their topic. They type out their problem and hit a search button (instead of 'submit new thread') and it brings up relevant threads. If they don't find a solution there, then the can post the thread to the forums. This would suppress a lot of the stupid questions that are posted on a daily basis.
Why not just create a bot that uses premade solutions that can be trusted. It can analyze the words in the question (kind of like what you have) and post the appropriate solution.
The easier solution is to have more read me topics on common errors (which is the bulk of threads) and add some logic to make sure they have at least looked at them before being allowed to post their topic.
Another thing to think about is having a search function when they are posting their topic. They type out their problem and hit a search button (instead of 'submit new thread') and it brings up relevant threads. If they don't find a solution there, then the can post the thread to the forums. This would suppress a lot of the stupid questions that are posted on a daily basis.
Google is your friend.
Use [code] tags.
If you have found a solution to your problem, please mark the thread as SOLVED.
Use [code] tags.
If you have found a solution to your problem, please mark the thread as SOLVED.
•
•
•
•
Good idea, but in practice I fear that it will be very difficult to make and have it work properly.
Why not just create a bot that uses premade solutions that can be trusted. It can analyze the words in the question (kind of like what you have) and post the appropriate solution.
The easier solution is to have more read me topics on common errors (which is the bulk of threads) and add some logic to make sure they have at least looked at them before being allowed to post their topic.
Another thing to think about is having a search function when they are posting their topic. They type out their problem and hit a search button (instead of 'submit new thread') and it brings up relevant threads. If they don't find a solution there, then the can post the thread to the forums. This would suppress a lot of the stupid questions that are posted on a daily basis.
brings up relevant threads is not exactly what I'm doing. What my database does is it contains many rows of sentences and on the users search, an AI will string together some of those sentences to form paragraphs. And at the moment I'm using a system of nouns and verbs. So if you could rephrase the information in the above quote in more simple terms because it's hard to see what your suggesting other than that I use a premade project/solution. Try not to bump 10 year old threads as it can be really annoying.
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
Instead of trying to create an understandable paragraph without human intervention (will not be easy at all), just build a database of solutions that humans have created and post the appropriate one. That way the information can be trusted to be accurate.
The other suggestions I had were for changing how daniweb works, if you didn't catch that. Not related to your question at all really. I just know that moderators ect. will be looking at it, so I thought I would throw it out there.
The other suggestions I had were for changing how daniweb works, if you didn't catch that. Not related to your question at all really. I just know that moderators ect. will be looking at it, so I thought I would throw it out there.
Last edited by kkeith29; Jul 6th, 2009 at 9:22 pm.
Google is your friend.
Use [code] tags.
If you have found a solution to your problem, please mark the thread as SOLVED.
Use [code] tags.
If you have found a solution to your problem, please mark the thread as SOLVED.
•
•
•
•
The other suggestions I had were for changing how daniweb works, if you didn't catch that. Not related to your question at all really. I just know that moderators ect. will be looking at it, so I thought I would throw it out there.
Try not to bump 10 year old threads as it can be really annoying.
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
I was talking about an actual change to the daniweb code. Your script couldn't do that.
Really I think that the faq bot would be too difficult to make work. If you can think of how to code like the human mind to interpret something as complex as language then good for you. I am sure you can make it work good enough, but there will always be a downfall. In reality, I don't think it will build sentences that make sense. How will it know that the info its taking in is correct? Most people that post don't have a good understand of English and have sentences that are not grammatically correct. How will you combat that without a human reading it all.
About the script itself. Do you have multiple servers to use for load balancing? That script takes a huge amount of resources. I ran it on my server along with a few others and they were having problems. The bad thing is that that is only half of the code required to pull that off. You are just getting the info and haven't devised a way to build the sentences into a paragraph and post it onto a thread. The code is going to have to build a database and keep that database current (unless you can have access to the daniweb databases, which I don't see happening ever) which is going to be hard.
Really I think that the faq bot would be too difficult to make work. If you can think of how to code like the human mind to interpret something as complex as language then good for you. I am sure you can make it work good enough, but there will always be a downfall. In reality, I don't think it will build sentences that make sense. How will it know that the info its taking in is correct? Most people that post don't have a good understand of English and have sentences that are not grammatically correct. How will you combat that without a human reading it all.
About the script itself. Do you have multiple servers to use for load balancing? That script takes a huge amount of resources. I ran it on my server along with a few others and they were having problems. The bad thing is that that is only half of the code required to pull that off. You are just getting the info and haven't devised a way to build the sentences into a paragraph and post it onto a thread. The code is going to have to build a database and keep that database current (unless you can have access to the daniweb databases, which I don't see happening ever) which is going to be hard.
Last edited by kkeith29; Jul 7th, 2009 at 12:26 am.
Google is your friend.
Use [code] tags.
If you have found a solution to your problem, please mark the thread as SOLVED.
Use [code] tags.
If you have found a solution to your problem, please mark the thread as SOLVED.
It seems the main answer to all those problems it one of the things I'm good at - advanced algorithms. I find complex algorithms fun to make and in cases like this will make a great result like the algorithm that founded google. It's not like if there's anything hard about it. If worst comes to worst then I could just make the bot reword entire topics into a single mysql entries (eg 1 paragraph per mysql entry) and there would be no need for mass processing when searching the database. The problem is all about making the right algorithm and I'm sure I could do that. Just asking if anybody had any thoughts on how it should work.
Note that the reason for the cpu load is mainly because of the file_get_contents() function downloading files around 30Kb in size but using the sleep() function occasionally will help the cpu to some degree. Just in case your wondering, this bot will be hosted on a virtual private server with 512MB ram.
Note that the reason for the cpu load is mainly because of the file_get_contents() function downloading files around 30Kb in size but using the sleep() function occasionally will help the cpu to some degree. Just in case your wondering, this bot will be hosted on a virtual private server with 512MB ram.
Last edited by cwarn23; Jul 7th, 2009 at 1:42 am.
Try not to bump 10 year old threads as it can be really annoying.
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
Like php then read my website at http://syntax.cwarn23.net/
Star-Trek-Atlantis - now that's what I call a movie ^_^
My favourite PC. - MacGyver Fan
Bad english note: dis-iz-2b4u
![]() |
Other Threads in the PHP Forum
- Previous Thread: Installing php libraries on linux
- Next Thread: in sql server same unique identifier in 2 tables
| Thread Tools | Search this Thread |
301 apache api array autosuggest beginner binary broken cakephp checkbox class cms code compression cron curl data database date display dropdownlist dynamic echo email eregi error execution file files folder form forms function functions google href htaccess html httppost if...loop image include insert ip javascript joomla jquery key library limit link links login mail md5 menu mlm multiple mysql mysql_real_escape_string oop paypal pdf pdfdownload php phpvotingscript problem query radio random recursion remote screen script search searchbox server session sessions sms sorting source space sql syntax system table tutorial update upload url validator variable video volume votedown web website youtube zend






