Stored Routines: Word Filtering Inside the Database

Bob Hensley 2 Tallied Votes 501 Views Share

MySQL Stored Routines: Another Useful Scenario

In a past guide I discussed MySQL stored routines within MySQL. Now I’m back with another guide on MySQL stored routines. And this time it’s a more concrete case; one that you may run into yourself in the future (or maybe you already have)! Let’s jump right on, shall we?

The scenario…

You’re implementing a comments system to your application. Users will be able to comment on articles you and your staff post. A concern arises: explicit comments. It’s something all web developers have to deal with: people being crude and just not decent at all.

How you’re probably handling it right now…

Chances are your first intuition would be to filter explicit words in the application code. Maybe you hard-code the words into an array. Perhaps you put them in the database. Either way, you’d be handling the manipulation within the application code instead of the database.

I’m here to present a different solution!

In data-driven applications there will always be a juggle: how much should be handled by the application and how much by the database itself? I firmly believe that we should pass the buck off on the database anytime it makes sense. And in this scenario, I believe it makes sense.

By using stored routines we can push more responsibility onto the database. Should your application really care about what words are not allowed? I don’t believe so. After all, if the data is clean and secure then it should be good to go to the database. Let the database handle excluding specific words at that point.

Stored Routines: Implementation

Let’s begin implementing these stored routines, shall we? You may follow along verbatim, or change details as we go along. The choice is yours.

Preparation

We’re going to need a database to play with. We’ll also need a table with dummy data. Without this we have no real way of playing with stored routines!

CREATE DATABASE bhdctut_reserved_words;

USE bhdctut_reserved_words;

CREATE TABLE reserved_words (
    word VARCHAR(50) NOT NULL,
    UNIQUE(word)
);

INSERT INTO reserved_words VALUES("butthole"), ("jerk"), ("meanyface"), ("prick"), ("pinhead");

CREATE TABLE comments (
    cid INT NOT NULL AUTO_INCREMENT,
    comment VARCHAR(200) NOT NULL DEFAULT "",
    PRIMARY KEY(cid)
);

None of this should be new for you. We’re creating the database, switching it into scope and then building our tables. We’ve also preloaded some very mature words to filter. No content in the comments table is needed yet.

Filter Function

We’re going to use a function and not a procedure. If you recall, procedures cannot return values. Functions can. For what we’re doing a function makes much more sense.

Stored Routines Tip: If you remember anything about procedures versus functions, just remember this: procedures are red tape, but functions actually return results.

Our function will first need to accept the comment in its entirety. We’re going to return an equivalent data type, as it’ll line up with the comments relation. And this is going to be a deterministic function; it will always return the same value, given the same input.

DELIMITER ||
DROP FUNCTION IF EXISTS filter_words;
CREATE FUNCTION filter_words (comment VARCHAR(200))
    RETURNS VARCHAR(200)
    DETERMINISTIC
BEGIN

END;||

We’re first changing the delimiter (what SQL looks for to say “oh, that statement is complete”) because we need to use semicolons through the code, but do not want the statements being executed prematurely. And because we may be overwriting the function, we’re dropping the function if it exists.

This function is not valid and will err out if you execute it. We do not have a return statement. But we will soon.

We need to declare some variables to be used momentarily:

DELIMITER ||
DROP FUNCTION IF EXISTS filter_words;
CREATE FUNCTION filter_words (comment VARCHAR(200))
    RETURNS VARCHAR(200)
    DETERMINISTIC
BEGIN
    DECLARE done INT DEFAULT FALSE;

    DECLARE a VARCHAR(50);
    DECLARE b CHAR(6);
    DECLARE RESULT VARCHAR(200);

    DECLARE word_list CURSOR FOR SELECT word FROM reserved_words;
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
END;||

We’re going to be looping through our filtered words using MySQL’s cursor implementation. Simply put, a database cursor is a construct that allows you to traverse over a data set (such as the result set from a SELECT statement).

Stored Routines Tip: You may see variables declared using a format such as “SET @my_var = 10″. This is a user-defined sessions variable and has limited uses inside stored routines.

To facilitate the cursor we need a variable holding the status: done or ongoing. Hence why this is a boolean value by default. We’ll set it to TRUE when the traversal has completed.

The next three variables are for our filtering. We’ll store the current word we’re looking for inside a. And we’ll replace occurrences of a with b (which we’ll define soon). And the results are saved to the result variable.

The last two variables deal, again, with the cursor. We need to query the database for our words. The data type we’re using is CURSOR, as that’s exactly what word_list will be- a cursor construct. And it’s going to be a cursor for the select statement we’ve declared.

Now the last variable can be a little tricky. But simply put, a handler is a variable that changes when a certain condition becomes true. As defined here, we’re using a CONTINUE handler. But this handler is listening for a callback saying “couldn’t find a data set.” And if that occurs- our done variable goes from FALSE to TRUE.

We need some default values in our variables…

DELIMITER ||
DROP FUNCTION IF EXISTS filter_words;
CREATE FUNCTION filter_words (comment VARCHAR(200))
    RETURNS VARCHAR(200)
    DETERMINISTIC
BEGIN
    DECLARE done INT DEFAULT FALSE;

    DECLARE a VARCHAR(50);
    DECLARE b CHAR(6);
    DECLARE RESULT VARCHAR(200);

    DECLARE word_list CURSOR FOR SELECT word FROM reserved_words;
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;

    SET b = "******";
    SET RESULT = comment;
END;||

As mentioned, b is our replacement string. If we find a bad word, we’ll replace it with b. And result needs to be, for now, the comment that has been passed into the function. After all, the comment will be returned untouched if no bad words are found!

Now we get to do something fun… use the cursor to loop.

DELIMITER ||
DROP FUNCTION IF EXISTS filter_words;
CREATE FUNCTION filter_words (comment VARCHAR(200))
    RETURNS VARCHAR(200)
    DETERMINISTIC
BEGIN
    DECLARE done INT DEFAULT FALSE;

    DECLARE a VARCHAR(50);
    DECLARE b CHAR(6);
    DECLARE RESULT VARCHAR(200);

    DECLARE word_list CURSOR FOR SELECT word FROM reserved_words;
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;

    SET b = "******";
    SET RESULT = comment;

    OPEN word_list;

    read_words: LOOP
        FETCH word_list INTO a;

        IF done THEN
            LEAVE read_words;
        END IF;
    END LOOP;

    CLOSE word_list;
END;||

On line 19 we’re opening the cursor. That is, we’re populating the construct with the results of the query. Effectively we’re instantiating it.

Jump to line 29 real quick and you’ll see the opposite: we’re closing the cursor. This is effectively the garbage collection. A closed cursor cannot be reopened, by the way.

Lines 21 through 27 are the loop. We’re labeling the loop so that we can end it within a conditional. You can see that occurring at lines 24-26. There we are checking the done variable (remember, it’ll become TRUE if the handler finds we’ve run out of data). If it’s TRUE, we terminate the loop.

Line 22 is another cursor thing. The FETCH function grabs the next result in the cursor. That is, it proceeds to the next entry. And it stores that into the variable given: a.

Stored Routines Tip: Are you familiar with PHP’s MySQLi extension? If so you’ve probably used mysqli_result::fetch_row(). The functionality between that and FETCH here are very similar.

Just one thing left to do: replace bad words.

DELIMITER ||
DROP FUNCTION IF EXISTS filter_words;
CREATE FUNCTION filter_words (comment VARCHAR(200))
    RETURNS VARCHAR(200)
    DETERMINISTIC
BEGIN
    DECLARE done INT DEFAULT FALSE;

    DECLARE a VARCHAR(50);
    DECLARE b CHAR(6);
    DECLARE RESULT VARCHAR(200);

    DECLARE word_list CURSOR FOR SELECT word FROM reserved_words;
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;

    SET b = "******";
    SET RESULT = comment;

    OPEN word_list;

    read_words: LOOP
        FETCH word_list INTO a;

        IF done THEN
            LEAVE read_words;
        END IF;

        SET RESULT = REPLACE(RESULT, a, b);
    END LOOP;

    CLOSE word_list;

    RETURN(RESULT);
END;||

Lines 27 and 32 are the only new code. Line 27 is pretty easy to understand: replace any occurrence of the current bad word (if any exist) and overwrite our result variable.

Line 32 is then us returning the result.

Let’s check to see if the function works…

SELECT filter_words("Hey butthole, you are such a prick.")||

Executing this should, if you have used the same data set as me, result in:

Hey ******, you are such a ******.

Let’s Automate This

Now we need to make this function actually meaningful. We could, if we really wanted to, invoke the function on every INSERT manually:

INSERT INTO comments(comment) VALUE(filter_words("Hey butthole, you are such a prick."));

Or we could tell MySQL to do it automatically. I wrote a guide some time back discussing triggers and how to use them. I would encourage you to read through that guide at some point. We’re actually going to employ a trigger here.

CREATE TRIGGER filter_words BEFORE INSERT
    ON comments
    FOR EACH ROW
BEGIN
    SET NEW.comment = (SELECT filter_words(NEW.comment));
END;||

By setting this trigger to invoke before any INSERT function is executed on the comments table, we can control the data going in. All we’re doing here is setting the comment value of any new INSERT to the output of our filter_words function. You can think of NEW as a pre-populated object containing the fields being passed.

Better Replacements… Custom Replacements

What’s better than replacing bad words with asterisks? Being able to decide what to replace bad words with on a more granular level! Wouldn’t it be cool if we could have all bad words default to ******** but have the option of replacing any given word with whatever we want?

I think so!

Let’s change our delimiter back for now- we have some work to do.

DELIMITER ;

We need to alter the reserved_words table to better facilitate our needs. Specifically, we need to add a new attribute (column): replacement_word. This will default to our asterisks, should a specific value not be given.

ALTER TABLE reserved_words
ADD replacement_word VARCHAR(50) NOT NULL DEFAULT "******";

Double check to make sure your alteration occurred (DESCRIBE is very useful).

DESCRIBE reserved_words;

And update our existing data set to have the asterisks.

UPDATE reserved_words
SET replacement_word = '******';

You can now go ahead and set some custom replacement words. For the sake of this guide I’m only setting up one word. Feel free to get more adventurous, though!

UPDATE reserved_words
SET replacement_word = 'mouth hole'
WHERE word = 'butthole';

The alterations to the function are simple. We need to change the data type of our replacement variable (b), as CHAR(6) just won’t do anymore. Our replacement_words query, for our cursor, needs to be modified to pull our new attribute. We then just need to get rid of our asterisks definition and fetch the second attribute.

DELIMITER ||
DROP FUNCTION IF EXISTS filter_words;
CREATE FUNCTION filter_words (comment VARCHAR(200))
    RETURNS VARCHAR(200)
    DETERMINISTIC
BEGIN
    DECLARE done INT DEFAULT FALSE;

    DECLARE a VARCHAR(50);
    DECLARE b VARCHAR(50);
    DECLARE RESULT VARCHAR(200);

    DECLARE word_list CURSOR FOR SELECT word, replacement_word FROM reserved_words;
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;

    SET RESULT = comment;

    OPEN word_list;

    read_words: LOOP
        FETCH word_list INTO a, b;

        IF done THEN
            LEAVE read_words;
        END IF;

        SET RESULT = REPLACE(RESULT, a, b);
    END LOOP;

    CLOSE word_list;

    RETURN(RESULT);
END;||

Line 10 is the new data type for b. Line 13 has the updated query. Line 16 used to be the asterisks assignment, but it’s gone. And line 21 now fetches the replacement word into the b variable. Because we’re using the same variable, line 27 is still valid.

Check for validity:

SELECT filter_words("Hey butthole, you are such a prick.")||

With my data that results in:

Hey mouth hole, you are such a ******.

Your mileage will vary depending on your data.

There We Have It!

Alright, nearly 2,000 words and numerous code examples later, we’re done. In this guide we have gone over stored routines. We’ve reintroduced triggers. And we’ve moved logic into the database in a meaningful way. What's your opinion on empowering the database to handle more logic?

Please comment below should anything need more explaining. I am always happy to make modifications to my guides!

batuzai04123 0 Newbie Poster

This is great thanks for the guide!!!!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.