1.11M Members

Simple Trie Class

 
3
 

As the title says, it's a simple trie class written in C++11. The only operations implemented are insert, remove, and search (both prefix and exact matches). I sort of abandoned the test/debug process, and while I'm somewhat confident that there aren't any serious bugs, I can't guarantee it. Use this class at your own risk. ;)

#include <map>
#include <memory>
#include <stack>
#include <string>

template <typename CharT>
class trie {
    struct node {
        std::map<CharT, std::shared_ptr<node>> link;
        bool terminal;
        CharT key;
    public:
        node(CharT key): terminal(false), key(key) { }
    };

    std::shared_ptr<node> root;
public:
    trie(): root(std::make_shared<node>(CharT())) {}

    /*
        @description:
            Adds a word path specified by s to the trie.
    */
    void add(const std::basic_string<CharT>& s);

    /*
        @description:
            Removes a word path specified by s from the trie.
    */
    void remove(const std::basic_string<CharT>& s);

    /*
        @description:
            Searches for the prefix s when require_terminal is false or exactly matches
            a terminal path created by add() when require_terminal is true.
        @returns:
            basic_string::npos is returned if a match is found. If no match is found, the
            index of the mismatch is returned. If require_terminal is true and a prefix is
            matched, the size of s is returned.
    */
    typename std::basic_string<CharT>::size_type match(const std::basic_string<CharT>& s, bool require_terminal = true);
};

template <typename CharT>
void trie<CharT>::add(const std::basic_string<CharT>& s)
{
    auto it = root;

    // Traverse the path and extend it as necessary
    for (auto c : s) {
        if (it->link.find(c) == it->link.end()) {
            it->link[c] = std::make_shared<node>(c);
        }

        it = it->link[c];
    }

    // Mark the last key node as a terminal for the path. Useful for
    // determining the difference between a prefix and a full word.
    it->terminal = true;
}

template <typename CharT>
void trie<CharT>::remove(const std::basic_string<CharT>& s)
{
    if (match(s) != std::basic_string<CharT>::npos) {
        // This key doesn't exist in the trie, removing a prefix would break continuity
        return;
    }

    std::stack<decltype(root)> bak;
    auto it = root;

    // Store the reversed path
    for (auto c : s) {
        bak.push(it);
        it = it->link[c];
    }

    // Only necessary when another path intersects and continues 
    // from the end of this one, but it's cheaper just to always 
    // unmark the node as terminal.
    it->terminal = false;

    if (it->link.size() == 0) {
        /*
            Truncate any unshared tail section of this path
        */
        CharT key = it->key;

        // Find the start of the unshared tail for this path
        while (!bak.empty() && bak.top()->link.size() <= 1 && !bak.top()->terminal) {
            key = bak.top()->key;
            bak.pop();
        }

        if (!bak.empty()) {
            // Average case: Trim the tail from the parent's links
            bak.top()->link.erase(key);
        }
        else {
            // Edge case: This was the last path in the trie
            root->link.clear();
        }
    }
}

template <typename CharT>
typename std::basic_string<CharT>::size_type trie<CharT>::match(const std::basic_string<CharT>& s, bool require_terminal)
{
    auto end = s.begin();
    auto it = root;

    // Follow the path until the end of the string or a mismatch
    while (end != s.end() && it->link.find(*end) != it->link.end()) {
        it = it->link[*end++];
    }

    if (end == s.end() && (!require_terminal || it->terminal)) {
        // The path was matched completely
        return std::basic_string<CharT>::npos;
    }

    return end - s.begin(); // Return the index of the mismatch
}
 
2
 

Is there any reason at all for using a shared_ptr<node> for your links? Why not use a unique_ptr<node>? Why incur the extra memory payload and overhead. It seems to me that the following would be a bit simpler (and smaller, and faster):

#include <map>
#include <memory>
#include <stack>
#include <limits>
#include <iterator>

template <typename T>
class trie {
    struct node {
        std::map<T, std::unique_ptr<node> > link;
        bool terminal;
    public:
        node(): terminal(false) { }
    };

    std::unique_ptr<node> root;
public:
    trie(): root(new node()) {}

    typedef std::size_t size_type;

    template <typename String>
    size_type match(const String& s, bool require_terminal = true) const
    {
        // Follow the path until the end of the string or a mismatch
        auto it = &root;
        for(auto sit = s.begin(); sit != s.end(); ++sit) {
            auto lit = (*it)->link.find(*sit);
            if(lit == (*it)->link.end())
                return std::distance(s.begin(), sit); // Return the index of the mismatch
            it = &( lit->second );
        }

        if (!require_terminal || (*it)->terminal)
            return std::numeric_limits<size_type>::max(); // The path was matched completely

        return s.size(); // Return the index of the mismatch
    }

    template <typename String>
    void add(const String& s) {
        auto it = &root;
        // Traverse the path and extend it as necessary
        for (auto c : s)
            it = &( (*it)->link[c] );
        (*it)->terminal = true;
    }

    template <typename String>
    void remove(const String& s) {
        auto it = &root;

        // Store the reversed path
        std::stack< decltype(it) > bak;
        auto sit = s.begin();
        for (; sit != s.end(); ++sit) {
            bak.push(it);
            auto lit = (*it)->link.find(*sit);
            if(lit == (*it)->link.end())
                return; // no match! nothing to remove!
            it = &( lit->second );
        }
        --sit;                 // move back to last.
        (*it)->terminal = false;  // this node no longer ends a word.

        /* Truncate any unshared tail section of this path */
        if ((*it)->link.size() != 0)
            return;

        // Find the start of the unshared tail for this path
        while (!bak.empty() && (*bak.top())->link.size() <= 1 && !(*bak.top())->terminal) {
            --sit;
            bak.pop();
        }

        if (!bak.empty())
            (*bak.top())->link.erase(*sit); // Trim the tail
        else
            root->link.clear(); // Trim it all!!
    }
};

I couldn't help myself, I had to fix a few additional things, such as:

  • const-correctness for the match() function (doesn't change the object);
  • the unnecessary storage of the keys inside the nodes, since the key is always known from the parent that led to that node, and nodes are only accessible from their parent;
  • those inefficient repeated queries into the maps, i.e., whenever you see the "find()" function and the indexing operator near to each other, you can usually eliminate one of them (and it is worth it, performance-wise);
  • the unnecessary preemptive check for a match in the remove function. I believe the intent was to avoid the stacking of the path (in bak) if there is no match to be found, but the linked-list traversal (during the matching) will be the main factor for performance (or lack thereof), so, minimizing the amount of traversals is far more important than avoiding the partial (and possibly useless) construction of a (very fast) contiguous storage stack; and,
  • the overly restricted interface. Although I know that tries are supposed to handle strings of characters, there is no need to limit your implementation to that specific case only. This data structure is just as applicable to any sequence of values (that can act as a key in a std::map).
 
3
 

I notice that add() got broken in your changes. It just traverses the (probably nonexistent) path rather than adding keys where they don't exist. You'd end up dereferencing null pointers unless new nodes are added at the right places:

template <typename String>
void add(const String& s) {
    auto it = &root;
    // Traverse the path and extend it as necessary
    for (auto c : s) {
        if (!(*it)->link[c]) {
            (*it)->link[c].reset(new node());
        }
        it = &( (*it)->link[c] );
    }
    (*it)->terminal = true;
}

Is there any reason at all for using a shared_ptr<node> for your links?

Yes. It's not a good reason, but I do have a reason. ;)

Why not use a unique_ptr<node>?

Ultimately it comes down to two things:

  1. As an arbitrary restriction I didn't want to use new explicitly, and there's no std::make_unique(). I could have written my own make_unique(), but like I said, it's just a reason, not a good one.

  2. I didn't want to diddle about working around unique_ptr's ownership rules with non-owning aliases, and as another arbitrary restriction I didn't want to use raw pointers at all.

For this class, I'm not especially concerned about the overhead of shared_ptr, so I went with what at the time felt like an easier approach. If it were intended as a serious library I'd put more work into optimization.

const-correctness for the match() function (doesn't change the object);

Bugger, I totally forgot to make match() const. Thanks for the catch.

the unnecessary storage of the keys inside the nodes, since the key is always known from the parent that led to that node, and nodes are only accessible from their parent;

I'm not sure I agree with that. It's not unreasonable to want to know the key of the current node without requiring a reference to the parent to figure it out.

those inefficient repeated queries into the maps, i.e., whenever you see the "find()" function and the indexing operator near to each other, you can usually eliminate one of them (and it is worth it, performance-wise);

No argument here, though in some circumstances I might argue that the explicit nature of a find and insert would be worth the performance hit in terms of clarity. A lot of people get surprised by the behavior of operator[] for std::map and how it always creates an item.

the unnecessary preemptive check for a match in the remove function.

Actually, given that the code was written in haste and when my attention was divided, I didn't want to risk introducing a bug for the case where there's no match. It was simpler and clearer just to verify the string prior to attempting removal. Defensive programming FTW. ;)

the overly restricted interface.

Was intentional. These days I'm becoming more and more of the opinion that overly generalized code is a Bad Thing™, so I default to a reasonably generic subset of an interface and generalize further only as necessary.

I've used tries occasionally over the years, but never for anything other than string data. So this class is a trie for strings, where "string" is generalized to any integer-based character type supported by std::basic_string. So based on my experience with tries I've chosen a reasonably generic subset of sequences. I considered including template parameters for the type traits and allocator as well, and ultimately decided that they're overridden so rarely it wasn't worth the line noise.

Hopefully I've described my rationale without making it seem that I'm against generic interfaces in general.

 
2
 

I notice that add() got broken in your changes. It just traverses the (probably nonexistent) path rather than adding keys where they don't exist. You'd end up dereferencing null pointers unless new nodes are added at the right places:

Yes you're right. I originally re-wrote the code with a by-value storage of the nodes, in a std::map<T,node>. Until I realized that it wasn't standard-compliant to assume that an incomplete type could be stored in an STL container, although many implementation allow it (because it is certainly possible to implement STL containers that allow incomplete types, but the standard doesn't require it). So, I went back and changed it to unique_ptr, with rather ugly effects, including that bug, which wouldn't have been one if the nodes were stored by value. I guess this is a legitimate case for wanting containers to contain incomplete types, it would save one level of indirection and one level of new-allocated memory, both of which are a big deal in a tight loop.

Btw, you, again, introduced three map lookups where only one is needed:

template <typename String>
void add(const String& s) {
    auto it = &root;
    // Traverse the path and extend it as necessary
    for (auto c : s) {
        // lookup / create the node for the character c:
        it = &( (*it)->link[c] );
        if (!(*it)) {
            // it didn't exist yet, so, create it:
            (*it) = std::unique_ptr(new node());
        }
    }
    (*it)->terminal = true;
}

For this class, I'm not especially concerned about the overhead of shared_ptr, so I went with what at the time felt like an easier approach.

The best, simplest and clearest solution would certainly have been to store them by value, but short of that, go with unique_ptr.

If it were intended as a serious library I'd put more work into optimization.

Even for a serious library, very often, simply caring for not pessimizing the code by some basic (wrong) choices is enough to have nearly the best performance you can get. It's the remaining optimizations that you can leave for when you fine-tune / profile the code. The whole point is that taking the additional 5 seconds to second guess your "easier choice" and getting a more streamline implementation to begin with pays off in time that you don't lose revisiting that code later or waiting for your program to finish. And this is a skill that requires training, and so, if you write simple example codes for fun, then at the same time, it's good to hone that skill.

The above correction to your correction to my correction of the add function is a good example of that, it has to be a reflex, things like repeated look-ups of the same element should stand out like 6-feet clown at a midget conference.

I'm not sure I agree with that. It's not unreasonable to want to know the key of the current node without requiring a reference to the parent to figure it out.

Use case? I know that common sense would dictate that the nodes should store their own values. But I couldn't think of any reason why it would be needed, or even really any more convenient. This is another case of second-guessing your "easier choices".

Was intentional. These days I'm becoming more and more of the opinion that overly generalized code is a Bad Thing™,

You do have a point there. Things can be too generic and leading to confusion. I agree that it might not be worth the generalization in this case, although there isn't really that much difference here between a string-only version and the more generic version: the name of the class is the same, its template arguments are the same, the interface is the same (except the string parameter is template type), and the small implementation changes to allow any kind of container or value-type are not really significant. So, there isn't much penalty in terms of clarity of the use and the purpose of the code. But, in general, it is true that making things too generic can hinder clarity and usability.

In this example of a trie, which is essentially a kind of set (as in std::set) for elements that are strings (in the general sense, sequence of values), one technically appropriate and fully-generic name for it would be something like lexicographic_set, but it would be much clearer to call it string_set or something like that, the point being to attribute a name that more directly conveys its predominant (special) purpose (even if the implementation is still generic, if that can be done without hurting the clarity of the interface). Personally, I didn't know what a "trie" was until now, so, that's an indication that it's probably not a very good name for it either.

It might sound trivial to talk about naming things, but very often, making things more generic mostly hurts in the semantics of the operations and the names of the classes and functions that no longer match the specific and predominant uses, and there lies to confusion and usability issues. So, that's the broader point here.

so I default to a reasonably generic subset of an interface and generalize further only as necessary.

Uhmmm, I'm not sure I like that approach too much. I've always found it hard to add more generalizations to a piece of code (class or function) without causing at least some minor updates to the interface and thus get a ripple effect through the code that depends on it, and then stuff hits the fan. I usually do the opposite. Almost everything I write has at least two layers: crazy-super generic algorithmic code, and easy to use special-purpose wrappers. Thanks to C++ templates and inlining, this has zero overhead in general. If you look at the STL or standard strings, you'll see the same pattern (at least in standard implementations, if not in the standard document itself).

In this example, I would probably write a as-generic-as-can-be trie class template, and then provide a wrapper called string_set, making sure its interface is clear and well oriented towards storing a set of strings (as opposed to some other more generic sequence). And then, I might add another wrapper (or a template alias) called lexicographic_set. Anyone who comes around looking for an implementation of a trie will be able to find it and, hopefully, use it by virtue of its genericity. Anyone looking to store a bunch of strings will find a clear and usable container for that. And having fully-generic code under-the-hood of a library is very handy. But there is no doubt that multiple levels are needed.

One interesting example is BGL's adjacency_list class which is a pretty terrifying class template to expose to the users of a library: it has a rather obscur implementation-oriented name (as opposed to user-oriented), it has a ton of template arguments to tweak the behavior, it has a special set of observable behavior for each combination of template arguments (e.g., iterator invalidations, etc.), and so on. However, as an implementer of graph-theoretic algorithms, this class template is awesome for me, i.e., when implementing performance-critical algorithms under-the-hood. So, there is also a measure of who the target audience is. If you want a user-friendly one-size-fits-all interface, you can always wrap ugly things in nice packages.

 
2
 

Btw, you, again, introduced three map lookups where only one is needed

Because, again, I'm not especially concerned over what I feel is a negligible cost.

The best, simplest and clearest solution would certainly have been to store them by value, but short of that, go with unique_ptr.

I don't disagree, but you're neglecting the arbitrary restrictions I put in place when writing this code. Unless you have an example of using unique_ptr without any explicit use of new or raw pointers, I'll stand by my choice of shared_ptr. Off the top of my head, I can't think of any way to use unique_ptr directly without violating one or both of those restrictions.

Dumping those restrictions and with more of an eye toward map performance, I'd probably go with something more like this (quickly done, so please excuse any bugs or obvious omissions):

#include <iterator>
#include <map>
#include <memory>
#include <stack>
#include <string>

template <typename CharT>
class trie {
    struct node {
        std::map<CharT, std::unique_ptr<node>> link;
        bool terminal;
        CharT key;
    public:
        node(CharT key): terminal(false), key(key) { }
    };

    std::unique_ptr<node> root;
public:
    trie(): root(new node(CharT())) {}

    /*
        @description:
            Adds a word path specified by s to the trie.
    */
    void add(const std::basic_string<CharT>& s);

    /*
        @description:
            Removes a word path specified by s from the trie.
    */
    void remove(const std::basic_string<CharT>& s);

    /*
        @description:
            Searches for the prefix s when require_terminal is false or exactly matches
            a terminal path created by add() when require_terminal is true.
        @returns:
            basic_string::npos is returned if a match is found. If no match is found, the
            index of the mismatch is returned. If require_terminal is true and a prefix is
            matched, the size of s is returned.
    */
    typename std::basic_string<CharT>::size_type match(const std::basic_string<CharT>& s, bool require_terminal = true) const;
};

template <typename CharT>
void trie<CharT>::add(const std::basic_string<CharT>& s)
{
    auto it = root.get();

    // Traverse the path and extend it as necessary
    for (auto c : s) {
        auto& link = it->link[c];

        if (!link) {
            link.reset(new node(c));
        }

        it = link.get();
    }

    // Mark the last key node as a terminal for the path. Useful for
    // determining the difference between a prefix and a full word.
    it->terminal = true;
}

template <typename CharT>
void trie<CharT>::remove(const std::basic_string<CharT>& s)
{
    auto it = root.get();
    std::stack<decltype(it)> bak;

    // Store the reversed path
    for (auto c : s) {
        auto link = it->link.find(c);

        if (link == std::end(it->link)) {
            return; // No match
        }

        bak.push(it);
        it = link->second.get();
    }

    // Only necessary when another path intersects and continues 
    // from the end of this one, but it's cheaper just to always 
    // unmark the node as terminal.
    it->terminal = false;

    if (it->link.size() == 0) {
        /*
            Truncate any unshared tail section of this path
        */
        CharT key = it->key;

        // Find the start of the unshared tail for this path
        while (!bak.empty() && bak.top()->link.size() <= 1 && !bak.top()->terminal) {
            key = bak.top()->key;
            bak.pop();
        }

        if (!bak.empty()) {
            // Average case: Trim the tail from the parent's links
            bak.top()->link.erase(key);
        }
        else {
            // Edge case: This was the last path in the trie
            root->link.clear();
        }
    }
}

template <typename CharT>
typename std::basic_string<CharT>::size_type trie<CharT>::match(const std::basic_string<CharT>& s, bool require_terminal) const
{
    auto it = root.get();

    // Follow the path until the end of the string or a mismatch
    for (auto end = std::begin(s); end != std::end(s); ++end) {
        auto link = it->link.find(*end);

        if (link == std::end(it->link)) {
            // The path was exhausted while searching
            return std::distance(s.begin(), end);
        }

        it = link->second.get();
    }

    if (!require_terminal || it->terminal) {
        // The path was matched completely
        return std::basic_string<CharT>::npos;
    }

    return s.size(); // A prefix was matched
}

You'll probably see a few places where I added stuff you indirectly reminded me about that I should have put there in the first place. ;)

Use case?

Iteration comes to mind. The class obviously doesn't support iteration in STL style, but it would be pretty damn useless without providing the key value on each item. You could certainly make it work under the hood, but that strikes me as unnecessarily awkward.

the interface is the same (except the string parameter is template type) and the small implementation changes to allow any kind of container or value-type are not really significant

I think you're forgetting about the specializations in both interface and implementation to support C-style strings. At present your "String" template parameter makes the assumption that the collection is an STL-style sequence collection. I'd find it irritating to have to call match() like this (or some variation thereof):

obj.match<std::string>("someword");

So it's not quite as simple as you're making it out to be, and the more generic you try to make it, the more complex it will become. My question is: is the added complexity worth it?

the point being to attribute a name that more directly conveys its predominant (special) purpose

Indeed, and I suspect I would have done that if this class weren't specifically written as an example of a simple trie. ;) The snippet isn't a set that happens to be implemented as a trie, it's a trie. I have no problem naming the class as such for the intended purpose.

Uhmmm, I'm not sure I like that approach too much. [...]

It seems I didn't express my reasoning well enough. Rather than over-engineer things into oblivion, I try to recognize exactly what the intended use of the class will be and write code that's just generic enough to do it.

Your example of the STL and standard strings is a good example of writing for an unknown problem domain. In that case, yes, you should try to be as generic as possible without destroying ease of use. But the vast majority of problems don't need to take into account as broad of a spectrum as the standard library. If you use the standard library as your model, you may end up over-engineering something that has no business being so general.

Isn't it about time forums rewarded their contributors?

Earn rewards points for helping others. Gain kudos. Cash out. Get better answers yourself.

It's as simple as contributing editorial or replying to discussions labeled or OP Kudos

You
This is an OP Kudos discussion and contributors may be rewarded
Post:
Start New Discussion
View similar articles that have also been tagged: