do random selection of subset from data table

Question

showman13 49 Junior Poster

11 Years Ago

Good afternoon All.

I'm running into difficulties trying to get a random selection from a subset in a query.

Basically what I have is a table with a record for every New Active referral, that contains the date(active_date) the member ID(mem_id) and the referrer ID(ref_id)...

What I need to do is select all ref_ids that have 6 or more referrals(mem_ids) within a specific date range, and then randomly select one of those selected ref_ids to receive a contest bonus...

Sounds pretty simple, but I've been messing with it for hours and no luck...

This query gives me a list of the ref_ids and the count of refs each has.

SELECT ref_id, count(active_id) as refs
  FROM `new_actives`
  WHERE active_date>'2013-03-31'
  AND active_date<'2013-04-07'
  GROUP BY ref_id
  ORDER BY refs desc

The table has an assortment of records where the ref_id can match 1 mem_id or could potentially have an unlimited number. Currently the most any one ref_id has is 8 matching referrals (mem_ids)

Ultimately what I would like to do is to be able to randomly select a winner from the referrers that have 6 or more referrals
THEN
Do the same for those with 4 or more, and then the same for 2 or more and finally 1 or more...

I'm sure I can manually do record counts and then do the random selection, but was thinking there would be a much easier quicker way to accomplish this.

The basic random selection routine that I use for other things is not really suitable for this purpose because it takes into consideration all individual records in the table to randomly select the winner from, and I only want to consider each referrer once in each drawing, so those that have 8 referrals won't have 8 chances to win, but only 1.
This is that code:

  SELECT FLOOR(RAND() * COUNT(*)) AS `offset`
  FROM `new_actives`
  WHERE active_date>'2013-03-31'
  AND active_date<'2013-04-07'

Any suggestions or direction would be greatly appreciated...

And if I'm going in the wrong direction, please feel free to set me straight (nicely);-))

php

4 Contributors
24 Replies
189 Views
4 Days Discussion Span
Latest Post 11 Years Ago Latest Post by showman13

All 24 Replies

rene.romerobenavides 0 Newbie Poster

11 Years Ago

Which database engine are you using? Does it support window functions and common table expressions? Let me tinker with those a little bit, I'm on a quest to learn advanced sql and your problem sounds interesting. I use postgresql, which supports both features.

rene.romerobenavides 0 Newbie Poster

11 Years Ago

In postgresql you would do something like this:

SELECT referral_id,count() FROM new_actives WHERE active_date >= '2013-04-01' AND active_date <= '2013-04-30' GROUP BY referral_id HAVING count() >= 6
ORDER BY RANDOM() LIMIT 1;

diafol

11 Years Ago

It's somewhat difficult to test queries on a table where we only have the fields and no real data. Would it be possible for you to print an SQL dump (or link to a *.sql file if it is large), so that we can test?

diafol

11 Years Ago

Sorry late in getting back - I've been to choir practice! I've been singing 'You Raise Me Up' for almost 2 hours. :)

The subquery:

SELECT ref_id, COUNT(active_id) AS refs
  FROM `new_actives`
  WHERE active_date>'2013-03-31'
  AND active_date<'2013-04-07'
  GROUP BY ref_id ORDER BY RAND()

Is just yours with a randomiser for the return set - so they are in random order. Try it on its own and you'll see what it does.

The AS tab1 is just an alias for the subquery - and you're right it's not used elsewhere - but it's still essential to the inner workings of MySQL. So it could be referenced elsewhere for example if you wanted to use it in a JOIN query. You can call the subquery anything you want - I just chose tab1 - short for table1.

So, in essence, what we have is:

SELECT ref_id, refs FROM tab1 GROUP BY refs

I hope the alias now makes sense.
Now the above just returns the first row of each distinct refs value. Usually, to get this sort of result you'd do a SELECT DISTINCT - but of course, that's impossible in this case because you want the ref_id as well, so a SELECT DISTINCT would not work as expected.

The refs filter (where) you want to add can be added to the outside query:

SELECT ref_id, refs FROM (SELECT ref_id, COUNT(active_id) AS refs
  FROM `new_actives`
  WHERE active_date>'2013-03-31'
  AND active_date<'2013-04-07'
  GROUP BY ref_id ORDER BY RAND()) AS tab1
  WHERE refs >= 4 GROUP BY refs

If you have a look at the mysql manual, you'll see that WHERE clauses have to come before the GROUP BY clause in that specific level.

Alternatively, you could do this:

SELECT ref_id, refs FROM (SELECT ref_id, COUNT(active_id) AS refs
  FROM `new_actives`
  WHERE active_date>'2013-03-31'
  AND active_date<'2013-04-07'
  GROUP BY ref_id 
  HAVING refs >= 3 
  ORDER BY RAND()) AS tab1
  GROUP BY refs

This makes use of the HAVING clause. You can't use a WHERE clause on an aliased aggregated field (COUNT), so you need to use HAVING instead.

This wasn't meant as a complete solution as I'm sure that you're aware. I thought I'd post a simple idea that you could play around with. There must be many different approaches to this, including the use of INNER JOINs and alternative randomizing methods.

I hope other contributors have a go at offering a solution too, as variety is the spice of life!

Edited 11 Years Ago by diafol

showman13 commented: Absolutely on target. thank you +2

diafol

11 Years Ago

You don't need both HAVING and WHERE. Either should do.
The problem here is that you have to change the SQL for every draw, so you end up running 5 queries, unless you use UNION ALL syntax.

I'd still favour the php code from my previous post as this allows easy manipulation of code if there are any changes and the SQL is easy - no need for an ORDER BY either:

SELECT ref_id, COUNT(ref_id) AS refs FROM new_actives GROUP BY ref_id

Just need to include the dates in a WHERE clause.

Convoluted SQL - including the use of subqueires - often make my skin crawl. But that's a personal thing. :)

showman13 commented: Great Feedback and Direction +0

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

showman13 49 Junior Poster · Answer 1 · 2013-04-07T19:46:29+00:00

I use mySql.. I know nothing about postgresql...

But the real challenge is in getting a single randomly selected referrer from within each group based on how many active referrals they have within the contest period.

masterjiraya -3 Junior Poster · Answer 2 · 2013-04-08T04:23:46+00:00

in your where clause dont use AND logical operator only, use BETWEEN.

BETWEEN Expression AND Expression

masterjiraya -3 Junior Poster · Answer 3 · 2013-04-08T04:26:37+00:00

what is this mem_ids?
If this mem_ids you are talking is existing on the other table, use Left Join (). Left Join function is the function of subset in Discrete Mathematics.
here's is the basic usage of LEFT JOIN

select table_a.*, table_b.* from table_a
left join table_a.ID=table_b.ID
where table_a.active_date Between "01/01/2013" And "01/30/2013"

showman13 49 Junior Poster · Answer 4 · 2013-04-08T13:11:32+00:00

masterjiraya - thank you for your response but you aren't addressing the question that I presented.

Changing the AND logic to the BETWEEN logic has no bearing on the outcome

And, there is no JOIN required. all the information necessary to accomplish what I'm trying to do is in this single table

the mem_id in each record simply indicates which member they referred and really has no bearing on making a random selection from the subset that would be retrieved in the query

I simply need to make a random selection from the subset, which is a list of ref_ids that have referred a minimum of X members within the date range, to determine a single contest winner, is all I am trying to do.

showman13 49 Junior Poster · Answer 5 · 2013-04-10T16:20:06+00:00

Sorry diafol, I didn't receive a notification that someone responded to this thread or I would have replied already.

Here is a dump part of the table... only 51 records of it anyway.

Thanks for your response. I always appreciate your input, and look forward to your thoughts.

-- phpMyAdmin SQL Dump
-- version 3.5.5
-- http://www.phpmyadmin.net
--
-- Host: localhost
-- Generation Time: Apr 10, 2013 at 11:13 AM
-- Server version: 5.1.68-cll
-- PHP Version: 5.3.17

SET SQL_MODE="NO_AUTO_VALUE_ON_ZERO";
SET time_zone = "+00:00";

--
-- Table structure for table `new_actives`
--

CREATE TABLE IF NOT EXISTS `new_actives` (
  `active_id` int(7) NOT NULL AUTO_INCREMENT,
  `mem_id` int(8) NOT NULL COMMENT 'mem_id becoming active',
  `ref_id` int(8) NOT NULL COMMENT 'referrer ID',
  `active_date` datetime NOT NULL,
  PRIMARY KEY (`active_id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 COMMENT='one record for each NEW Activation' AUTO_INCREMENT=52 ;

--
-- Dumping data for table `new_actives`
--

INSERT INTO `new_actives` (`active_id`, `mem_id`, `ref_id`, `active_date`) VALUES
(1, 10001513, 10000037, '2013-03-30 20:32:21'),
(2, 10000275, 10000212, '2013-03-31 10:30:28'),
(3, 10001481, 10000446, '2013-03-31 10:32:04'),
(4, 10000542, 10000202, '2013-03-31 10:36:08'),
(5, 10001249, 10000766, '2013-03-31 10:37:44'),
(6, 10001034, 10001030, '2013-03-31 10:51:12'),
(7, 10002628, 10002564, '2013-03-31 11:12:00'),
(8, 10003006, 10002564, '2013-03-31 11:13:36'),
(9, 10002533, 10001040, '2013-04-03 17:45:34'),
(10, 10002534, 10001040, '2013-04-03 17:47:13'),
(11, 10003120, 10000992, '2013-04-03 19:46:29'),
(12, 10003151, 10000992, '2013-04-03 21:54:55'),
(13, 10003121, 10000992, '2013-04-03 22:02:06'),
(14, 10003092, 10003010, '2013-04-03 22:41:46'),
(15, 10003099, 10003010, '2013-04-03 22:44:49'),
(16, 10003100, 10003010, '2013-04-03 22:47:27'),
(17, 10003203, 10003010, '2013-04-03 22:49:32'),
(18, 10003205, 10003010, '2013-04-03 22:51:57'),
(19, 10003207, 10003010, '2013-04-03 22:54:25'),
(20, 10001006, 10000360, '2013-04-03 23:01:12'),
(21, 10001422, 10000989, '2013-04-04 02:54:12'),
(22, 10002056, 10001882, '2013-04-04 04:07:07'),
(23, 10002535, 10001040, '2013-04-04 13:07:02'),
(24, 10002598, 10001893, '2013-04-04 17:59:08'),
(25, 10003187, 10002583, '2013-04-04 22:17:42'),
(26, 10003219, 10003018, '2013-04-05 22:54:19'),
(27, 10003206, 10003010, '2013-04-05 23:14:36'),
(28, 10003143, 10003020, '2013-04-05 23:34:14'),
(29, 10002638, 10002564, '2013-04-05 23:42:42'),
(30, 10000359, 10000156, '2013-04-05 23:50:14'),
(31, 10002058, 10001882, '2013-04-06 00:52:13'),
(32, 10002063, 10001882, '2013-04-06 01:00:47'),
(33, 10002171, 10002170, '2013-04-06 14:30:59'),
(34, 10003039, 10003008, '2013-04-06 14:31:20'),
(35, 10001205, 10000613, '2013-04-06 14:31:42'),
(36, 10001497, 10000783, '2013-04-06 14:32:02'),
(37, 10002434, 10002424, '2013-04-06 14:34:14'),
(38, 10000264, 10000209, '2013-04-06 14:34:33'),
(39, 10001153, 10000135, '2013-04-06 14:34:47'),
(40, 10002556, 10001622, '2013-04-06 14:35:10'),
(41, 10002233, 10002182, '2013-04-06 14:35:26'),
(42, 10000406, 10002141, '2013-04-06 14:35:43'),
(43, 10000623, 10000539, '2013-04-06 14:36:08'),
(44, 10002640, 10002564, '2013-04-06 14:37:05'),
(45, 10002278, 10000001, '2013-04-06 14:37:39'),
(46, 10001489, 10001416, '2013-04-06 14:39:03'),
(47, 10002669, 10002564, '2013-04-06 14:39:23'),
(48, 10002231, 10002181, '2013-04-06 14:39:41'),
(49, 10001973, 10001631, '2013-04-06 14:40:00'),
(50, 10002342, 10001537, '2013-04-06 14:40:28'),
(51, 10001711, 10001677, '2013-04-06 14:40:49');

diafol · Answer 6 · 2013-04-10T16:43:42+00:00

OK, cheers Doug. I'm out for about an hour, will be back later if somebody else doesn't answer in the meantime.

diafol · Answer 7 · 2013-04-10T18:17:47+00:00

Play with this:

SELECT ref_id, refs FROM (SELECT ref_id, COUNT(active_id) AS refs
  FROM `new_actives`
  WHERE active_date>'2013-03-31'
  AND active_date<'2013-04-07'
  GROUP BY ref_id ORDER BY RAND()) AS tab1
  GROUP BY refs

essentially what it does is

inner clause randomizes order
outer clause picks first occurence of each refs

showman13 49 Junior Poster · Answer 8 · 2013-04-10T19:56:04+00:00

Thank you diafol,

That is an interesting SELECT statement and naturally, I don't quite understand what it is doing...
more specifically I don't understand the ORDER BY RAND() - what does that actually accomplish?
and what is the tab1 used for? - There is no reference to it in the result set.

AND the other question would be if I only want to make a random selection of the ref_id, refs that have >=4 referrals(active_id), where would I put in that where clause...

I've tried it inside and outside the inner select, but can't seem to find the right place.

showman13 49 Junior Poster · Answer 9 · 2013-04-10T23:16:55+00:00

OK... thanks again. yes the alias makes total sense - thanks for the clarification.

I did try the where clause on the outside query and got the same result I get with your example.

What I get when I say >=2, is a random selection for 3, 5, and 7

BUT, what I really need is a single result when I say >=2 referrals, I need a single referrer randomly selected in that group.

Then I also need to be able to do the same thing with different limitations... i.e. >=6 / >=4 / >=2 / >=1

I figure if I can get it to give me a single random selection for any of them, then I can just run it in a loop to get the results for each of the 4 different contest parameters.

Hope that makes sense...

diafol · Answer 10 · 2013-04-10T23:35:14+00:00

I'm not quite sure.

Your sets: >=6 / >=4 / >=2 / >=1

Does this mean that somebody with 8 refs could appear in each draw? It would with the operators that you're using OR do you mean:

Where the first set can have 1, the second set 2 or 3, the third set can be 4 or 5...

diafol · Answer 11 · 2013-04-11T00:03:19+00:00

You could do this:

SELECT ref_id, refs, IF(refs >= 8, 8,IF(refs >= 6, 6, IF(refs >=4,4, IF(refs >=2,2,1)))) AS grp
  FROM (SELECT ref_id, COUNT(active_id) AS refs
  FROM `new_actives`
  WHERE active_date>'2013-03-31'
  AND active_date<'2013-04-07'
  GROUP BY ref_id      
  ORDER BY RAND()) AS tab1
  GROUP BY grp
  ORDER BY RAND()

That will give you a third column called grp (short for group), comprising of 4 records( your data):

1,2,4,6 (no 8 as no ref_id repeated 8 times or more).

$draws = array('t1'=>false,'t2'=>false,'t4'=>false,'t6'=>false,'t8'=>false);
while($row = mysql_fetch_assoc($result)){
    $draws['t' . $row['grp']] = $row['ref_id'];
}

Now you have the ref_id and the key in the $draws array. PHP not tested and there may be a more elegant way of doing it.

showman13 49 Junior Poster · Answer 12 · 2013-04-11T01:29:24+00:00

it means if you have 1 or more you are in the category 1 drawing
if you have 2 or more you are in the category 1 and category 2 drawing
4 or more you are in category 1, 2, and 3 drawings
6 or more you are in all 4 category drawings

This is a weekly drawing contest that I will be running periodically on my members site.

it doesn't matter how many new referrals they have for that week, if they have at least one, they will get an entry into at least one of the drawings...

If they have 6 or if they have 60 it doesn't matter, they will still be in all 4 drawings with 1 entry in each.

showman13 49 Junior Poster · Answer 13 · 2013-04-11T01:32:04+00:00

If I can get it so I can just define the number at the time that I run it, then I could adjust the criteria at any time that I am running the contest...

That way I could use it for multiple purposes, by making it a function that I just send the referral number to, and let it do it's thing...

diafol · Answer 14 · 2013-04-11T06:24:32+00:00

OK, my mistake, you can't use this code for multiple draws wrt same ref_id. I'm off to work now and I'm going to be busy tonight, so I may not be able to come back until tomorrow eve. Anybody else?

diafol · Answer 15 · 2013-04-11T10:31:14+00:00

SELECT ref_id, COUNT(ref_id) AS refs FROM new_actives GROUP BY ref_id

$pool = array();
while($row = mysql_fetch_array($result)){
    if($row['refs'] >= 8) $pool[$t8][] = $row['refs'];
    if($row['refs'] >= 6) $pool[$t6][] = $row['refs'];
    if($row['refs'] >= 4) $pool[$t4][] = $row['refs'];
    if($row['refs'] >= 2) $pool[$t2][] = $row['refs'];
    if($row['refs'] >= 1) $pool[$t1][] = $row['refs'];
}

foreach($pool as $k => $v){
    shuffle($v);
    $draw[$k] = $v[0];
}

That should give you your draws. NOT TESTED.

showman13 49 Junior Poster · Answer 16 · 2013-04-11T14:06:58+00:00

I've been playing around with a lot of the samples you have sent, and think I may have come up with a variation on one that will give me what I'm looking for...

What is your opinion of this code?

    SELECT ref_id, refs FROM (
      SELECT ref_id, COUNT(active_id) AS refs
      FROM `new_actives`
      WHERE active_date>'2013-03-31'
      AND active_date<'2013-04-07'
      GROUP BY ref_id
      HAVING refs >= 1
      ORDER BY RAND()
    ) AS tab1
    WHERE refs>=1
LIMIT 1

When testing it against the data table, I get a random selection within the range that is defined by the refs >= whatever number I use, 1, 2, 4, or 6...

Please let me know if you see a flaw in the functionality that could cause any issues. I'm working a little outside the realm of my understanding at the moment, but it is getting clearer as we go along.

Thank you again for all your feedback on this.

showman13 49 Junior Poster · Answer 17 · 2013-04-11T20:52:06+00:00

OK diafol,
For now, this will work just fine. We are 1 week into a 4 week contest, so I can just go to the url that contains this little script on the website and it will give me the winners after the end of each week.

I will ultimately clean it up and make it more presentable so I can do the selections on a teleseminar where everyone can see it as it happens, but right now I just need to know who won so I can pay out the winnings.

I just put it together with a little form to enter the beginning day of each contest and the number of days, and it gives me back the ending day of the contest and the list of the winners in each category.

Based on the feedback I got from you I will be able to make this into something very useful, thank you very much for all your time and effort.

if ($Submit != 'Submit'){
print"<br>Enter the Contest Parameters to determine the winners<br><br>";
?>
<form name="winner_select" action="<? print $_SERVER['PHP_SELF']; ?>" method="POST" target="_self">
  Start Date (YYYY-MM-DD) <input name="start_date" type="text" size="10" tabindex="1" maxlength="10"><br>
  Term in Days <input name="term" type="text" size="2" tabindex="2" maxlength="2"><br>
  <input type="submit" name="Submit" value="Submit" tabindex="3">
</form>
<?
}else{
  print"<br>These are the Contest Parameters and the selected Winners.<br><br>";
  $end_date=date('Y-m-d',strtotime("$term day", strtotime("$start_date"))) ;
  print "<br>Start date is ".$start_date."<br>Term is ".$term."<br>End date is ".$end_date;

  $target=array(1 => 1,2,4,6);
  for ($x=1; $x<=4; ++$x){
    $sql="
      SELECT ref_id, refs FROM (
        SELECT ref_id, COUNT(active_id) AS refs
        FROM `new_actives`
        WHERE active_date>'$start_date'
        AND active_date<'$end_date'
        GROUP BY ref_id
        HAVING refs >= $target[$x]
        ORDER BY RAND()
      ) AS tab1
      LIMIT 1
    ";

    $ref_rec = mysql_query($sql);
    $category[$x]=mysql_fetch_array($ref_rec);
    print "<br><br>Category ".$x." WINNER - ref_id - ".$category[$x][0]." || count ".$category[$x][1];
  }
}

My only concern at this point is exactly how Random the selections are, and if there is a way to improve the randomness (not sure if it is possible).

Thanks Again,
Douglas

diafol · Answer 18 · 2013-04-11T23:23:56+00:00

My only concern at this point is exactly how Random the selections are, and if there is a way to improve the randomness (not sure if it is possible).

Do you think the randomness is an issue?

If so run a few tests (say a hundred iterations?) while storing the results and totals. It should give you an idea of the randomness.

showman13 49 Junior Poster · Answer 19 · 2013-04-12T02:17:51+00:00

I had run about 50 tests but didn't store the results, and in so doing, it seemed to be quite repetative in some aspects. But then I realized with the limited amount of data for it to randomize, the chances were high that I would get that repetative result. I think that it will be just fine.

Thanks again.
Douglas

do random selection of subset from data table

Recommended Answers Collapse Answers

All 24 Replies

Recommended Answers