I am looking for a mysql query to search and list most common pairs with their IDs and number of occurence. Thanks.

Recommended Answers

All 11 Replies

Probably not going to happen with such a bare-bones description. How about giving us the schema, some sample data and what you expect to see as a result? How do you define "most common pairs"? Do you mean the two most common, five most common?

Thanks Jim. Here is some Data:

ID  B1  B2  B3  B4  B5  B6
1   6   8   10  26  27  36
2   2   5   10  11  32  42
3   20  21  23  24  29  38
4   3   4   17  19  33  49
5   2   17  20  23  33  41
6   1   12  13  20  31  48
7   20  21  26  41  44  47
8   3   4   43  44  46  47
9   6   7   20  23  29  46
10  1   5   13  20  46  40
11  2   5   10  18  47  40
12  20  21  23  24  37  39
13  3   4   17  25  38  41
14  5   17  20  29  30  41
15  12  14  28  31  32  43

Results would be like:

Count   B1  B2   ID(1)  ID(2)   ID(3)   ID(4)   ID(5)   ID(6)   ID(7)   ID(8) ID(9)etc.
      5     1   2     10     25     45      90   220                    
      8     11   36    4      12      13        19    40     42     89      90

For most common pairs. If it is possible, it would be nice to have a threshold level. for example if threshold is set to 5, then it would list only the most common pairs that occur at least 5 times.

The numbers represented by B1,B2,B3,B4,B5,B6. For example on this example data the pair 3,4 is repeated 3 times.

    ID   B1 B2
    5    3   4
    9    3   4
    14   3   4

Each row of table has 6 unique numbers. Note: I made a mistake in ID:5 row, there are two 4s. Numbers don't repeat in the rows.

I don't know how you'd do that in a query (or if it is even possible. What I would do is go through the table and build a dictionary using val(b1) & "," & val(b2) as the key and the frequency as the value. Then I would scan the dictionary and pick the keys with the highest frequencies. Implementation depends on the language you are using.

Sometimes, even if you can do it in a query, the query can be so convoluted that it is cleaner to do it in code.

As example:

-- table definition
CREATE TABLE `numbers` (
  `id` tinyint(3) unsigned NOT NULL AUTO_INCREMENT,
  `b1` tinyint(3) unsigned DEFAULT NULL,
  `b2` tinyint(3) unsigned DEFAULT NULL,
  `b3` tinyint(3) unsigned DEFAULT NULL,
  `b4` tinyint(3) unsigned DEFAULT NULL,
  `b5` tinyint(3) unsigned DEFAULT NULL,
  `b6` tinyint(3) unsigned DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

-- sample data
INSERT INTO `numbers`
VALUES (1, 6, 8, 10, 26, 27, 36),
       (2, 2, 5, 10, 11, 32, 42),
       (3, 20, 21, 23, 24, 29, 38),
       (4, 3, 4, 17, 19, 33, 49),
       (5, 2, 17, 20, 23, 33, 41),
       (6, 1, 12, 13, 20, 31, 48),
       (7, 20, 21, 26, 41, 44, 47),
       (8, 3, 4, 43, 44, 46, 47),
       (9, 6, 7, 20, 23, 29, 46),
       (10, 1, 5, 13, 20, 46, 40),
       (11, 2, 5, 10, 18, 47, 40),
       (12, 20, 21, 23, 23, 37, 39),
       (13, 3, 4, 17, 25, 38, 41),
       (14, 5, 17, 20, 29, 30, 41),
       (15, 12, 14, 28, 31, 32, 43);

-- query
> SELECT * FROM numbers;
+------+------+------+------+------+------+------+
|   id |   b1 |   b2 |   b3 |   b4 |   b5 |   b6 |
|------+------+------+------+------+------+------|
|    1 |    6 |    8 |   10 |   26 |   27 |   36 |
|    2 |    2 |    5 |   10 |   11 |   32 |   42 |
|    3 |   20 |   21 |   23 |   24 |   29 |   38 |
|    4 |    3 |    4 |   17 |   19 |   33 |   49 |
|    5 |    2 |   17 |   20 |   23 |   33 |   41 |
|    6 |    1 |   12 |   13 |   20 |   31 |   48 |
|    7 |   20 |   21 |   26 |   41 |   44 |   47 |
|    8 |    3 |    4 |   43 |   44 |   46 |   47 |
|    9 |    6 |    7 |   20 |   23 |   29 |   46 |
|   10 |    1 |    5 |   13 |   20 |   46 |   40 |
|   11 |    2 |    5 |   10 |   18 |   47 |   40 |
|   12 |   20 |   21 |   23 |   23 |   37 |   39 |
|   13 |    3 |    4 |   17 |   25 |   38 |   41 |
|   14 |    5 |   17 |   20 |   29 |   30 |   41 |
|   15 |   12 |   14 |   28 |   31 |   32 |   43 |
+------+------+------+------+------+------+------+
15 rows in set

Now, maybe you can use GROUP BY and UNION ALL, the first query looks like this:

> SELECT b1, b2, COUNT(id) AS tot FROM numbers GROUP BY b1, b2 ORDER BY tot DESC;
+------+------+-------+
|   b1 |   b2 |   tot |
|------+------+-------|
|   20 |   21 |     3 |
|    3 |    4 |     3 |
|    2 |    5 |     2 |
|    6 |    8 |     1 |
|    2 |   17 |     1 |
|    1 |   12 |     1 |
|    6 |    7 |     1 |
|    1 |    5 |     1 |
|    5 |   17 |     1 |
|   12 |   14 |     1 |
+------+------+-------+
10 rows in set

To add a threshold use this query as a sub query:

> SELECT b1 AS 'x', b2 AS 'y', tot FROM (SELECT b1, b2, COUNT(id) AS tot FROM numbers GROUP BY b1, b2 ORDER BY tot DESC) AS sub WHERE sub.tot > 1;
+-----+-----+-------+
|   x |   y |   tot |
|-----+-----+-------|
|  20 |  21 |     3 |
|   3 |   4 |     3 |
|   2 |   5 |     2 |
+-----+-----+-------+
3 rows in set

Now the UNION ALL part:

SET @threshold = 1;

-- union all
-- b1, b2
SELECT b1 AS 'x', b2 AS 'y', tot
  FROM (SELECT b1, b2, COUNT(id) AS tot
          FROM numbers
         GROUP BY b1, b2
         ORDER BY tot DESC)
    AS sub
 WHERE sub.tot > @threshold
--
 UNION ALL
-- b2, b3
SELECT b2 AS 'x', b3 AS 'y', tot
  FROM (SELECT b2, b3, COUNT(id) AS tot
          FROM numbers
         GROUP BY b2, b3
         ORDER BY tot DESC)
    AS sub
 WHERE sub.tot > @threshold
 --
 UNION ALL
-- b3, b4
SELECT b3 AS 'x', b4 AS 'y', tot
  FROM (SELECT b3, b4, COUNT(id) AS tot
          FROM numbers
         GROUP BY b3, b4
         ORDER BY tot DESC)
    AS sub
 WHERE sub.tot > @threshold
 --
 UNION ALL
-- b4, b5
SELECT b4 AS 'x', b5 AS 'y', tot
  FROM (SELECT b4, b5, COUNT(id) AS tot
          FROM numbers
         GROUP BY b4, b5
         ORDER BY tot DESC)
    AS sub
 WHERE sub.tot > @threshold
--
 UNION ALL
-- b5, b6
SELECT b5 AS 'x', b6 AS 'y', tot
  FROM (SELECT b5, b6, COUNT(id) AS tot
          FROM numbers
         GROUP BY b5, b6
         ORDER BY tot DESC)
    AS sub
 WHERE sub.tot > @threshold
--
 UNION ALL
-- limit the result set
SELECT NULL, NULL, NULL
  FROM DUAL
 LIMIT 5;

Which returns:

+-----+-----+-------+
|   x |   y |   tot |
|-----+-----+-------|
|  20 |  21 |     3 |
|   3 |   4 |     3 |
|   2 |   5 |     2 |
|   5 |  10 |     2 |
|  21 |  23 |     2 |
+-----+-----+-------+
5 rows in set

The last union query is done to limit the full result set and points to the DUAL table, a dummy table for queries that don't need table references. If you need the full result set just remove it.

Is this you where searching for?

I don't know how it would perform with real data, maybe you could add indexes:

ALTER TABLE `numbers` ADD KEY `pair_1` (`b1`, `b2`);
ALTER TABLE `numbers` ADD KEY `pair_2` (`b2`, `b3`);
ALTER TABLE `numbers` ADD KEY `pair_3` (`b3`, `b4`);
ALTER TABLE `numbers` ADD KEY `pair_4` (`b4`, `b5`);
ALTER TABLE `numbers` ADD KEY `pair_5` (`b5`, `b6`);

and see how it works through EXPLAIN SELECT ....

Thanks for your advice, Jim.

Cereal, thanks for your valuable time, and coding. This is more than I expected, actually I didn't expect that much coding. I'll need to learn mysql deeper to understand and implement this code.

commented: You're welcome! +15

It should be easy, in practice there are 5 select queries that group the rows by the choosen pairs, in the case below the pair is composed by the columns b1 and b2:

SELECT b1 AS 'x', b2 AS 'y', tot
  FROM (SELECT b1, b2, COUNT(id) AS tot
          FROM numbers
         GROUP BY b1, b2
         ORDER BY tot DESC)
    AS sub
 WHERE sub.tot > @threshold

each following query moves to the next column, so you go from b1,b2 to b2,b3 and so on until you reach the last b5,b6.

The UNION ALL statement is used to return the results together. Otherwise you should run five separated requests. It's not a join because each result is separated from the others, the database engine checks only that, the number of columns defined in the SELECT statements, matches all along the query. By using strings it would look like this:

> select 'a', 'b' union all select 'b', 'c' union all select 'c', 'd';

+-----+-----+
| a   | b   |
|-----+-----|
| a   | b   |
| b   | c   |
| c   | d   |
+-----+-----+

With such approach if 20,21 is repeated 3 times in b1,b2 and one time in b2,b3 you will get only 3, not 4. I don't know what you expect, but in case you want 4 then RJ's approach is probably the best, or at least you could use these queries as base for a script.

Right now it does not enter on my mind how to fix it easily with a query, but in practice if you could access the indexes contents directly, you would get those traversal results.

Note: by removing the limit from the dual query, or returning less rows than the limit, you will get an empty row at the end, due to the SELECT NULL, NULL, NULL statement.

Ok, thanks, Cereal. I will look into RJ's approach too.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.