Pages

All Your Base with unordered_map

The first problem in the Code Jam Round 1C 2009, All your base, is about converting a string to the smallest possible natural number.

It sounds a bit crazy, but think to the string "ZY", I could say that Z means 1, Y is 0, and that I am using the binary base, so it could be interpreted as 3. Any other valid conversion would lead to a bigger number, considering that I can't put a zero in the leading position.

As requisites, we have that the input size is in [1, 60], and that the ouput is an integer [1, 10^18], too big to be stored in a 32 bit. This is way I have declared my C++ function in this way:
long long allYourBase(const std::string& input)
Considering that a long long is an integer stored in at least 64 bit.

Then I wrote a few test cases, the first three of them were coming straight from the problem definition:
TEST(AllYourBase, Test1) // 1
{
    EXPECT_EQ(201, allYourBase("11001001"));
}

TEST(AllYourBase, Test2) // 2
{
    EXPECT_EQ(75, allYourBase("cats"));
}

TEST(AllYourBase, Test3) // 3
{
    EXPECT_EQ(11, allYourBase("zig"));
}

TEST(AllYourBase, TestBigDec) // 4
{
    EXPECT_EQ(1000000000023456789, allYourBase("1000000000023456789"));
}

TEST(AllYourBase, TestBigHex) // 5
{
    EXPECT_EQ(1162849439785405935, allYourBase("1023456789abcdef"));
}

TEST(AllYourBase, TestBigBinary) // 6
{
    EXPECT_EQ(1000000000000000000, allYourBase("bbabbbbaaaaababbabbababbaabbbabaabbbabbaabaaaaaaaaaaaaaaaaaa"));
}
1. We can assume the input is a binary representation of the actual number, so the conversion is very easy.
2. Here we are going to use base 4. The smallest number should have a 1 as most significative cipher, 0 as second, 2 as third, 3 as fourth. Leading to 1023 in base 4, that is 75 base 10 (64 + 0 + 8 + 3).
3. In base 3 is 102, converted to 11 in base 10.
4. Here I go a bit over the requirements, the generated number is a few millions (23) bigger than the expected top number.
5. Same as (4), but this is substantially over the expected limit.
6. Binary numbers are perfect to play with this problem, if we don't mind writing long sequences of characters.

The solution should be straightforward. Here is my version:
long long allYourBase(const std::string& input)
{
    if(input.empty() || input.size() > 60) // 1
        return 0;

    std::unordered_map<char, int> symbols; // 2

    int value = 1;
    for(char c : input) // 3
    {
        if(symbols.find(c) == symbols.end()) // 4
        {
            symbols.insert({c,value});

            if(value == 1) // 5
                value = 0;
            else if(value == 0)
                value = 2;
            else
                ++value;
        }
    }

    int base = symbols.size(); // 6
    if(base == 1)
        base = 2;

    long long result = 0;
    long long multiplier = 1;

    for(auto pos = input.rbegin(); pos != input.rend(); ++pos) // 7
    {
        result += symbols[*pos] * multiplier; // 8
        multiplier *= base; // 9
    }

    return result;
}
1. The requisites do not specify what should be do for unexpected input size. In the logic of the problem we could assume it just never happens. Here I decided to return zero (an invalid value), I could have thrown an exception that is a more expressive way of saying "this is unexpected".
2. My first job is determining in which base the number is expressed. I can do that scanning the input line and storing each value in a container. The number of such symbols represents the base (with an exception that we are seeing later on). Besides, I can already determine the value for each symbol. As we see discussing the test cases, we are going from the less significative all the way up, with the exception that the zero is in second position. So I need a pair to keep the relation symbol-value, and I can keep all the pairs in a hash table. Here comes handy the C++11 unordered_map. Using the old plain map is a bit of an overkill, since we don't care about the ordering maintained by a red-black tree.
3. Let's loop on the input string.
4. If I have already seen the current symbol, I have nothing to do. Otherwise I insert it using the value I have precedently set.
5. Assigning a value to each symbol would be a simple job if it wasn't for the glitch about the zero that should be in second position.
6. As said above, the base is assumed to be just the number of symbols in the string, with an exception. We expect at least a binary base to be used, so if we detect just one symbol we don't go for the unary one.
7. Let's calculate the result value looping on the string. It is easier to start from the less significative position (the rightmost one) and moving up to the left. So I am using the reverse iterator.
8. Dereferencing pos, I get the current symbol. I go to the hash table and get its associated value. I multiply it for the current positional value, and I add it to the current result.
9. If we are in base ten, the multiplier is 1 for the rightmost position, 10 for the one on its left, then 100, and so on.

Instead of writing my own converter based on position and multiplier, I could have used the standard pow() function declared in the cmath include file. Writing something like:
for(unsigned i = 0; i < input.size(); ++i)
{
    // ...
    result += value * std::pow((double)base, i);
    // ...
}
Notice that in this case we need to explicitly keep track of the current position in the string. So I would prefer to write the loop using a classic "i" index running on the size.
The weakness of this solution is that pow() is designed to work on floating numbers so we need to cast (explicitly) the base to double - or better, a long double - and then we have an implicit cast back to integer type when adding to result. Beside being unusefully expensive going to float and back to fixed, we risk (more importantly!) to have rounding problems due to the internal representation of a floating point number. Using long double should protect us about this issue, but there is no explicit guarantee.
In my case, Linux 64 bit, GCC 4.8, I had problem with pow() on "simple" double only for the test cases TestBigDec and TestBigHex, that are actually out of scope. Using the pow() version for long double all cases worked fine.
In any case, since floating point behavior is not standardized, you should check how it works on your actual configuration. The hand-made integer-based version has none of these issues, assuming long long represents at least a 64 bit integer.

No comments:

Post a Comment