I notice that add() got broken in your changes. It just traverses the (probably nonexistent) path rather than adding keys where they don't exist. You'd end up dereferencing null pointers unless new nodes are added at the right places:
Yes you're right. I originally re-wrote the code with a by-value storage of the nodes, in a std::map<T,node>. Until I realized that it wasn't standard-compliant to assume that an incomplete type could be stored in an STL container, although many implementation allow it (because it is certainly possible to implement STL containers that allow incomplete types, but the standard doesn't require it). So, I went back and changed it to unique_ptr, with rather ugly effects, including that bug, which wouldn't have been one if the nodes were stored by value. I guess this is a legitimate case for wanting containers to contain incomplete types, it would save one level of indirection and one level of new-allocated memory, both of which are a big deal in a tight loop.
Btw, you, again, introduced three map lookups where only one is needed:
template <typename String>
void add(const String& s) {
auto it = &root;
// Traverse the path and extend it as necessary
for (auto c : s) {
// lookup / create the node for the character c:
it = &( (*it)->link[c] );
if (!(*it)) {
// it didn't exist yet, so, create it:
(*it) = std::unique_ptr(new node());
}
}
(*it)->terminal = true;
}
For this class, I'm not especially concerned about the overhead of shared_ptr, so I went with what at the time felt like an easier approach.
The best, simplest and clearest solution would certainly have been to store them by value, but short of that, go with unique_ptr.
If it were intended as a serious library I'd put more work into optimization.
Even for a serious library, very often, simply caring for not pessimizing the code by some basic (wrong) choices is enough to have nearly the best performance you can get. It's the remaining optimizations that you can leave for when you fine-tune / profile the code. The whole point is that taking the additional 5 seconds to second guess your "easier choice" and getting a more streamline implementation to begin with pays off in time that you don't lose revisiting that code later or waiting for your program to finish. And this is a skill that requires training, and so, if you write simple example codes for fun, then at the same time, it's good to hone that skill.
The above correction to your correction to my correction of the add function is a good example of that, it has to be a reflex, things like repeated look-ups of the same element should stand out like 6-feet clown at a midget conference.
I'm not sure I agree with that. It's not unreasonable to want to know the key of the current node without requiring a reference to the parent to figure it out.
Use case? I know that common sense would dictate that the nodes should store their own values. But I couldn't think of any reason why it would be needed, or even really any more convenient. This is another case of second-guessing your "easier choices".
Was intentional. These days I'm becoming more and more of the opinion that overly generalized code is a Bad Thing™,
You do have a point there. Things can be too generic and leading to confusion. I agree that it might not be worth the generalization in this case, although there isn't really that much difference here between a string-only version and the more generic version: the name of the class is the same, its template arguments are the same, the interface is the same (except the string parameter is template type), and the small implementation changes to allow any kind of container or value-type are not really significant. So, there isn't much penalty in terms of clarity of the use and the purpose of the code. But, in general, it is true that making things too generic can hinder clarity and usability.
In this example of a trie, which is essentially a kind of set (as in std::set) for elements that are strings (in the general sense, sequence of values), one technically appropriate and fully-generic name for it would be something like lexicographic_set, but it would be much clearer to call it string_set or something like that, the point being to attribute a name that more directly conveys its predominant (special) purpose (even if the implementation is still generic, if that can be done without hurting the clarity of the interface). Personally, I didn't know what a "trie" was until now, so, that's an indication that it's probably not a very good name for it either.
It might sound trivial to talk about naming things, but very often, making things more generic mostly hurts in the semantics of the operations and the names of the classes and functions that no longer match the specific and predominant uses, and there lies to confusion and usability issues. So, that's the broader point here.
so I default to a reasonably generic subset of an interface and generalize further only as necessary.
Uhmmm, I'm not sure I like that approach too much. I've always found it hard to add more generalizations to a piece of code (class or function) without causing at least some minor updates to the interface and thus get a ripple effect through the code that depends on it, and then stuff hits the fan. I usually do the opposite. Almost everything I write has at least two layers: crazy-super generic algorithmic code, and easy to use special-purpose wrappers. Thanks to C++ templates and inlining, this has zero overhead in general. If you look at the STL or standard strings, you'll see the same pattern (at least in standard implementations, if not in the standard document itself).
In this example, I would probably write a as-generic-as-can-be trie class template, and then provide a wrapper called string_set, making sure its interface is clear and well oriented towards storing a set of strings (as opposed to some other more generic sequence). And then, I might add another wrapper (or a template alias) called lexicographic_set. Anyone who comes around looking for an implementation of a trie will be able to find it and, hopefully, use it by virtue of its genericity. Anyone looking to store a bunch of strings will find a clear and usable container for that. And having fully-generic code under-the-hood of a library is very handy. But there is no doubt that multiple levels are needed.
One interesting example is BGL's adjacency_list class which is a pretty terrifying class template to expose to the users of a library: it has a rather obscur implementation-oriented name (as opposed to user-oriented), it has a ton of template arguments to tweak the behavior, it has a special set of observable behavior for each combination of template arguments (e.g., iterator invalidations, etc.), and so on. However, as an implementer of graph-theoretic algorithms, this class template is awesome for me, i.e., when implementing performance-critical algorithms under-the-hood. So, there is also a measure of who the target audience is. If you want a user-friendly one-size-fits-all interface, you can always wrap ugly things in nice packages.