Hi all,

I am tryiing to figure out how to tokenize a string consisting of words in alphabetical order, all crammed together with no delimiters. I can't for the life of me see how this could be done without analyzing the string beforehand and hard coding a delimiter in.

Example string:

aaardvarkapplebananabicyclecobracupdelve

Any ideas?

Recommended Answers

All 11 Replies

You have a dictionary of words somewhere, right? My first attack would be to put the dictionary into a trie, with a special search that returns the end of the first complete match from the end of the most recent match. For example given a very limited dictionary:

#include <algorithm>
#include <iostream>
#include <string>
#include <cctype>
#include <climits>

class trie {
    struct node {
        char key;
        node *link[CHAR_MAX + 1];
    public:
        node(char key): key(key)
        {
            std::fill_n(link, CHAR_MAX + 1, (node*)0);
        }
    } *root;
public:
    trie(): root(0) {}

    void add(const std::string& s)
    {
        node **it = &root;

        if (*it == NULL)
            *it = new node(0);

        for (std::string::size_type i = 0; i < s.size(); ++i) {
            char key = std::tolower(s[i]);

            if ((*it)->link[key] == 0)
                (*it)->link[key] = new node(key);

            it = &(*it)->link[key];
        }
    }

    std::string::const_iterator match(const std::string& s, std::string::const_iterator& begin)
    {
        std::string::const_iterator end = begin;
        node *it = root;

        while (end != s.end() && it->link[std::tolower(*end)] != 0)
            it = it->link[std::tolower(*end++)];

        return end;
    }
};

trie initialize_dictionary()
{
    const char *words[] = {
        "aardvark","apple","banana","bicycle","cobra"
    };
    trie trie;

    for (std::size_t i = 0; i < sizeof words / sizeof *words; i++)
        trie.add(words[i]);

    return trie;
}

int main()
{
    trie dict = initialize_dictionary();
    std::string s = "aardvarkapplebananabicyclecobra";
    std::string::const_iterator begin = s.begin();
    std::string::const_iterator end;

    for (std::string::const_iterator begin = s.begin(); begin != s.end(); begin = end) {
        end = dict.match(s, begin);
        std::cout<< s.substr(begin - s.begin(), end - begin) <<'\n';
    }
}

In your example:

std::tolower(*end)

tolower is in <cctype> and is a C function.

Interesting, thanks for the reply. After reading in the file, I realized that the words were separated by nulls, so I guess this will be straightforward after all. The textfile made it seem like there were no delimiters, but once I read it into a char array, I saw that it was null-terminated. Thanks anyways!

In your example:

tolower is in <cctype> and is a C function.

I'm sure Narue knows that. The more important questions are:

  • Why do you think there is a problem?
  • Do you know why she wrote the call to it the way she did?

@OP:
glad you figured it out

tolower is in <cctype> and is a C function.

If you think there's a problem, it's best to just say what you think it is.

once I read it into a char array, I saw that it was null-terminated

That's good. Working without any kind of formatting is either tricky or impossible, and I'm not a fan of either. ;)

Sorry, I thought it would be obvious it's not in the std namespace.

Sorry, I thought it would be obvious it's not in the std namespace.

That's not obvious at all because it is in the std namespace when you use <cctype>.

No it isn't.
http://www.cplusplus.com/reference/clibrary/cctype/toupper/

Are you using Visual Studio? What version?

Well a quick (possibly misleading) google search claims C library stuff is meant to be in the std namespace, but visual studio sucks and it isn't.

My thoughts are:
http://www.cplusplus.com/reference/std/locale/toupper/
http://www.cplusplus.com/reference/clibrary/cctype/toupper/

Only version of std::toupper provided is:

template <class charT>
  charT toupper ( charT c, const locale& loc );

Could someone make sure std::toupper(int) exists on other compilers? (G++)

Microsoft always likes to have small differences with other compilers, for instance on G++ I don't think there is a std::exception constructor overload that accepts a C string.

No it isn't.

Yes, it is. I don't care what links you post, my source is the C++ standard. And unless you're terribly familiar with it, I strongly suggest you don't argue the finer points of the standard with me, though you're welcome to confirm that I'm correct. ;)

In particular, we're referring to the section tagged depr.c.headers:

1 For compatibility with the C standard library and the C Unicode TR, the C++ standard library provides the 25 C headers, as shown in Table 151.

Table 151 — C headers
<assert.h> <float.h> <math.h> <stddef.h> <tgmath.h>
<complex.h> <inttypes.h> <setjmp.h> <stdio.h> <time.h>
<ctype.h> <iso646.h> <signal.h> <stdint.h> <uchar.h>
<errno.h> <limits.h> <stdarg.h> <stdlib.h> <wchar.h>
<fenv.h> <locale.h> <stdbool.h> <string.h> <wctype.h>

2 Every C header, each of which has a name of the form name.h, behaves as if each name placed in the standard library namespace by the corresponding cname header is placed within the global namespace scope. It is unspecified whether these names are first declared or defined within namespace scope (3.3.6) of the namespace std and are then injected into the global namespace scope by explicit using-declarations (7.3.3).

3 [ Example: The header <cstdlib> assuredly provides its declarations and definitions within the namespace std. It may also provide these names within the global namespace. The header <stdlib.h> assuredly provides the same declarations and definitions within the global namespace, much as in the C Standard. It may also provide these names within the namespace std. —end example ]

This comes from the latest draft of C++0x, but you'll find that the original C++98 standard says the same thing. Also notice (before claiming that a compiler sucks) that the regardless of which header you choose, the other namespace is allowed as an extension. That is, global namespace on top of std namespace for <cname> and std namespace on top of global namespace for <name.h>.

With VC++ 2010 look in <cctype> and you will find that it's in std namespace, just as Narue said it is. _STD_BEGIN is defined as #define _STD_BEGIN namespace std { in yvals.h

Very helpful, I never knew. That's why we communicate. :D

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.