I was thinking of a new project for making a bot that scans daniweb for information and enters useful information into a database then people can ask the website questions and php+mysql will provide answers in paragraph form for the user to read. Is this ok on the legal side of things as I have written bots for my personal use before and have made a tutorial. Also if this is ok by the mods then any help with the algorithm would be appreciated.

kvprajapati commented: So nice that work. Simple yet elegant. +4

Recommended Answers

All 21 Replies

cwarn23,

Congratulation for the art-work - A sharp, smooth, smart, and intellegent code work.

Nobody beside mod or admin can follow that link

Is the topic on the moderators lounge looking hopeful or haven't there been that many replies yet?

Not many replies yet, dont think dani has seen it yet. Happygeek (administrator) has and its currently being discussed.

Is the topic on the moderators lounge looking hopeful or haven't there been that many replies yet?

still little discussion on that yet.

Is the RSS feeds available to every forum not a better alternative then a spider? It would involve less parsing...

I just checked the rss feeds on forum 17 and the rss feeds seems to be full of questions instead of answers. I was thinking more along the lines where the user could type a question like I did in post #1 or perhaps even send a link to their post and php would be able to string together answers from the database. And the database would be populated from bot scans. For now I'll find another site to scan as a demo while a decision is being made.

Some tips for your code cwarn:

Don't use preg_* functions unless absolutely necessary, if you're comparing a static string use strpos or stripos.

Define reused regular expressions in one place (you use /(.*)[.]([^.\?]+)(\?(.*))?/ three times in the same script, define it once in a variable and use it that way, one point of failure is always better.

If you're going to go for speed at the cost of memory usage get rid of in_array. Build an 1-level index and use isset() so you're performing O(1) operations instead of O(n) the in_array and array_key_exists functions are expensive. Example:

$some_big_array = array(1 => somestring, ..., 10000 => anotherstring);

$strings_index = array('somestring' => 1, ...., 'anotherstring' => 1);

$search = 'somestring';
if (isset($strings_index[$search])) // O(1) operation

if (in_array($search, $some_big_array)) // O(n) operation

STOP USING global Don't use unset() right before an assignment of the same variable, ie.,

$datac = "somestring";unset($datac);$datac = "anotherstring";

The unset becomes wasted time because reassignment is an implicit flush of previous memory

Although that's good advice Shawn and I will take note of that.
The only thing I would disagree with is the unset(). I have noticed in the past when reassigning string to variable in a loop a few million times the server runs out of memory where as if unset is used I have noticed the memory problem does not occur.

Although that's good advice Shawn and I will take note of that.
The only thing I would disagree with is the unset(). I have noticed in the past when reassigning string to variable in a loop a few million times the server runs out of memory where as if unset is used I have noticed the memory problem does not occur.

If you're using the most recent version then that shouldn't happen. If you are using the most recent version then you might want to submit a bug, those are called memory leaks :)

I thought I would let you know that I have made a demo bot but for another website as attached to this post. Although it may need a few effiency changes it does the job. The attachment should also give you a better understanding of the sort of code structure I'm talking about as I have just written one possible algorithm.

--------
Edit:
I now see what you meen about the rss feeds. So are you saying to perhaps have a bot to only check to topics linked to in the rss feeds?

I thought I would let you know that I have made a demo bot but for another website as attached to this post. Although it may need a few effiency changes it does the job. The attachment should also give you a better understanding of the sort of code structure I'm talking about as I have just written one possible algorithm.

--------
Edit:
I now see what you meen about the rss feeds. So are you saying to perhaps have a bot to only check to topics linked to in the rss feeds?

Yes, the RSS feeds could be used as the posts index. So you don't have to follow links (guessing).

You know that the index of links is contained in the RSS feed.

I have just completed a bot that indexes the pages linked from the rss feeds. Below is the script that I have created for the job and will alter it to scan 15 rss feeds. But does anybody know how often the rss feeds update? The minimum time gap in bot startup periods I could set this bot to would be about 1 hour due to monthly bandwidth. So please advise on what time intervals the bot should scan. Also you may be able to see in the code below the algorithm I have created. Are there any suggestions on how to improve the algorithm for later sentence retrieval?

Code is:

<?
set_time_limit (1800);
mysql_connect('localhost','root','');
mysql_select_db('botsearch');

function domain ($ddomain) {
return preg_replace('/^((http(s)?:\/\/)?([^\/]+))(.*)/','$1',$ddomain);
}

function url_exists($durl)
		{
		// Version 4.x supported
		$handle   = curl_init($durl);
		if (false === $handle)
			{
			return false;
			}
		curl_setopt($handle, CURLOPT_HEADER, true);
		curl_setopt($handle, CURLOPT_FAILONERROR, true);  // this works
		curl_setopt($handle, CURLOPT_HTTPHEADER, 
Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
		curl_setopt($handle, CURLOPT_NOBODY, true);
		curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
		$connectable = curl_exec($handle);
        //echo '"'.$connectable.'"';
		curl_close($handle);  
        $con=substr_replace($connectable,'',30);
        if (stripos($con,'200 OK') || stripos($con,'Moved')) {
            return true;
            } else {
            return false;
            }
		}


$in_valid_wordlist=array();
$r=mysql_query('SELECT * FROM `noun_list`');
while ($row=mysql_fetch_assoc($r)) {
    $in_wordlist[$row['word']]=true;
    }

function generate($url,$topicid) {
		global $f_data; //Data of file contents
        global $in_wordlist;

        $r=mysql_query('SELECT * FROM `faq_datareference` WHERE `id`="'.mysql_real_escape_string($topicid).'"');
        if (mysql_num_rows($r)==0) {
            //do something with webpage $f_data.
            //$tmp=htmlspecialchars_decode($f_data);
            $f_data=html_entity_decode($f_data);
            $f_data=str_replace('-|code|--|code|-','-|code|- -|code|-',$f_data);
            $f_data=preg_replace('#<blockquote[^>]+>(.*)</blockquote>#U','',$f_data);
            $tmp=str_replace(array('<br>','<br />'),"\r",$f_data);
            preg_match_all('#<div class="codecontent">(.*)</div>#iUm',$tmp,$code);
            $codeid=0;
            unset($tmp);
            $f_data=str_replace(array('<br/>','<br />'),'<br>',$f_data);
            $f_data=str_replace('<br><br>','<br>',$f_data);
            $f_data=str_replace('<br><br>','<br>',$f_data); //in case of 4 <br> in a row
            $f_data=preg_replace('#<pre class="code">(.*)</pre>#iU','-|code|-',$f_data);
            preg_match_all('#<div id="post_message_[0-9]+">(.*)</div> </div> </div>#isU',$f_data,$a_data);
            $val=0;
            $totalval=0;
            $abortval=0;
            $totalcodeval=0;
            foreach ($a_data[1] AS $adata) {
                $adata=strtolower(str_replace('...','.',strip_tags($adata,'<br>')));
                $d_data=preg_split('/([\?.!]|<br>)/',$adata);
                if ($abortval<4) {
                    foreach ($d_data AS $sentence) {
                        if (strlen($sentence)>16 && $abortval<4) {
                            $re=mysql_query('SELECT * FROM `faq_data` WHERE `string`="'.mysql_real_escape_string($sentence).'" AND `id`="'.mysql_real_escape_string($topicid).'"');
                            $did_wordcheck=false;
                            if (mysql_num_rows($re)==0) {
                            $did_wordcheck=true;
                            $words=explode(' ',$sentence);
                            foreach ($words AS $word) {
                                if (isset($in_wordlist[$word])) {
                                    $val+=1;
                                    if (!isset($nounverb_result[$word])) {
                                        $nounverb_result[$word]=1;
                                        } else {
                                        $nounverb_result[$word]+=1;
                                        }
                                    }
                                }
                            $var[0]=$sentence;
                            $var[1]=$val;
                            $totalval+=($val-$val2);
                                mysql_query('INSERT INTO `faq_data` SET `match`="'.$var[1].'", `string`="'.mysql_real_escape_string($var[0]).'", `id`="'.mysql_real_escape_string($topicid).'"');
                                } else {
                                $abortval+=1;
                                }


                            $val=0;
                            $val2=0;
                            } else if ($sentence=='-|code|--|code|-') {
                            $re=mysql_query('SELECT * FROM `faq_data` WHERE `match`="-1" AND `string`="'.mysql_real_escape_string($code[1][$codeid]).'" AND `id`="'.mysql_real_escape_string($topicid).'"');
                            if ($abortval==0 || mysql_num_rows($re)==0) {
                                $var[0]=$sentence;
                                $var[1]+=3;
                                mysql_query('UPDATE `faq_data` SET `match`="'.$var[1].'" WHERE `string`="'.mysql_real_escape_string($var[0]).'" AND `id`="'.mysql_real_escape_string($topicid).'"');
                                mysql_query('INSERT INTO `faq_data` SET `match`="-1", `string`="'.mysql_real_escape_string($code[1][$codeid]).'", `id`="'.mysql_real_escape_string($topicid).'"');
                                $codeid+=1;
                                $val+=2;
                                }
                            $val2=$val;
                            $totalcodeval+=1;
                            }
                        }
                    } else {
                    break;
                    }
                }
            
            if ($did_wordcheck==true) {
                arsort($nounverb_result);
                $var='INSERT INTO `faq_datareference` SET `id`="'.mysql_real_escape_string($topicid).'"';
                $i=1;
                foreach ($nounverb_result AS $key=>$val) {
                    if ($i>10) { break; } else {
                        $var.=', `word'.$i.'`="'.mysql_real_escape_string($key).'", `word'.$i.'num`='.$val;
                        }
                    $i++;
                    }
                for (;$i<11;$i++) {
                    $var.=', `word'.$i.'`=" ", `word'.$i.'num`=-1';
                    }
                $var.=', `num_nounsverbs`='.$totalval.', `num_codeboxes`='.$totalcodeval;
                mysql_query($var);
                }
            }
		echo '<b>Generated Url: </b>'.$url.'<br>';
	unset($f_data);
    }

$rss=file_get_contents('http://www.daniweb.com/forums/rss143.xml');
preg_match_all('#<link><!\[CDATA\[(.*[&?]t=([^\&]++).*)\]\]></link>#U',$rss,$topic_url);
for ($id=0;isset($topic_url[2][$id]);$id++) {
$rr=mysql_query('SELECT `id` FROM `faq_datareference` WHERE `id`="'.mysql_real_escape_string($topic_url[2][$id]).'"');
if (url_exists($topic_url[1][$id]) && mysql_num_rows($rr)==0) {
    $f_data=file_get_contents($topic_url[1][$id]);
    preg_match('#<td class="alt1" nowrap="nowrap"><span>Page 1 of ([0-9])</span></td>#i',$f_data,$pages_num);
    generate($topic_url[1][$id],$topic_url[2][$id]);
    flush();
    if (!empty($pages_num[1]) && isset($pages_num[1]) && $pages_num[1]>1) {
        for ($i=2;$i<=$pages_num[1];$i++) {
            if (url_exists($topic_url[1][$id])) {
                $f_data=file_get_contents($topic_url[1][$id]);
                generate('http://www.daniweb.com/forums/showthread.php?t='.$topic_url[2][$id].'&page='.$i,$topic_url[2][$id]);
                flush();
                }
            }
        }
    }
    }
?>

Thanks.

Good idea, but in practice I fear that it will be very difficult to make and have it work properly.

Why not just create a bot that uses premade solutions that can be trusted. It can analyze the words in the question (kind of like what you have) and post the appropriate solution.

The easier solution is to have more read me topics on common errors (which is the bulk of threads) and add some logic to make sure they have at least looked at them before being allowed to post their topic.

Another thing to think about is having a search function when they are posting their topic. They type out their problem and hit a search button (instead of 'submit new thread') and it brings up relevant threads. If they don't find a solution there, then the can post the thread to the forums. This would suppress a lot of the stupid questions that are posted on a daily basis.

Good idea, but in practice I fear that it will be very difficult to make and have it work properly.

Why not just create a bot that uses premade solutions that can be trusted. It can analyze the words in the question (kind of like what you have) and post the appropriate solution.

The easier solution is to have more read me topics on common errors (which is the bulk of threads) and add some logic to make sure they have at least looked at them before being allowed to post their topic.

Another thing to think about is having a search function when they are posting their topic. They type out their problem and hit a search button (instead of 'submit new thread') and it brings up relevant threads. If they don't find a solution there, then the can post the thread to the forums. This would suppress a lot of the stupid questions that are posted on a daily basis.

I didn't exactly catch exactly what your suggesting. But the part that says brings up relevant threads is not exactly what I'm doing. What my database does is it contains many rows of sentences and on the users search, an AI will string together some of those sentences to form paragraphs. And at the moment I'm using a system of nouns and verbs. So if you could rephrase the information in the above quote in more simple terms because it's hard to see what your suggesting other than that I use a premade project/solution.

Instead of trying to create an understandable paragraph without human intervention (will not be easy at all), just build a database of solutions that humans have created and post the appropriate one. That way the information can be trusted to be accurate.

The other suggestions I had were for changing how daniweb works, if you didn't catch that. Not related to your question at all really. I just know that moderators ect. will be looking at it, so I thought I would throw it out there.

The other suggestions I had were for changing how daniweb works, if you didn't catch that. Not related to your question at all really. I just know that moderators ect. will be looking at it, so I thought I would throw it out there.

If you mean the url system, the url's that are being accessed are all on the rss feeds page or are on the page 2, 3, 4, 5 etc links. So for some reason the rss feeds don't link to the rewritten url (module_rewrite url). So that is why the links appear as they are. But if you mean something different then I would like to know as I can't see how else this script might change how daniweb works.

I was talking about an actual change to the daniweb code. Your script couldn't do that.

Really I think that the faq bot would be too difficult to make work. If you can think of how to code like the human mind to interpret something as complex as language then good for you. I am sure you can make it work good enough, but there will always be a downfall. In reality, I don't think it will build sentences that make sense. How will it know that the info its taking in is correct? Most people that post don't have a good understand of English and have sentences that are not grammatically correct. How will you combat that without a human reading it all.

About the script itself. Do you have multiple servers to use for load balancing? That script takes a huge amount of resources. I ran it on my server along with a few others and they were having problems. The bad thing is that that is only half of the code required to pull that off. You are just getting the info and haven't devised a way to build the sentences into a paragraph and post it onto a thread. The code is going to have to build a database and keep that database current (unless you can have access to the daniweb databases, which I don't see happening ever) which is going to be hard.

It seems the main answer to all those problems it one of the things I'm good at - advanced algorithms. I find complex algorithms fun to make and in cases like this will make a great result like the algorithm that founded google. It's not like if there's anything hard about it. If worst comes to worst then I could just make the bot reword entire topics into a single mysql entries (eg 1 paragraph per mysql entry) and there would be no need for mass processing when searching the database. The problem is all about making the right algorithm and I'm sure I could do that. Just asking if anybody had any thoughts on how it should work.

Note that the reason for the cpu load is mainly because of the file_get_contents() function downloading files around 30Kb in size but using the sleep() function occasionally will help the cpu to some degree. Just in case your wondering, this bot will be hosted on a virtual private server with 512MB ram.

I have managed to program an algorithm which places capitals at the start of sentences, keeps only the useful information and places code in a <pre> css code box. My last problem (I hope) is the rss feeds in most forums that I've checked seem to be almost live and contain links to threads with only one or two posts. The only solution I can think of is for a small script to every hour or so to retrieve the links in the rss feeds which can be stored in a database to be followed a week later. Does anybody know a better way around this problem or isn't there one?

Just to let all the moderators know, the bot has now been launched.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.