Hi,

I've coded huffman coding.I've formed my codebook.Its correct upto that point.I'm using unsigned int(32 bits) and packing the codes into it and writing to file whenever 32 bits get filled.Somehow..my resultant file is larger than my source file.Am i doing something wrong?

Sounds like it. The compression is more pronounced when the distribution of the input is skewed; i.e. there are more occurances of some characters than others.

Feel free to post some code, or provide more info on how you're performing the codebook creation, packing, saving.

Hmm, I just replied to your last post...

Huffman only works if the tree is properly formed. If you are getting a larger output file then your tree is malformed.

You are accounting for the difference in what it takes to actually store the tree in the output, right? If you are testing on a really small input then the file could grow just because of the space necessary to store the tree...

My tree is formed correctly.I've verified it with a small input and tried it on paper.I'm storing my codes in a int [256][2] array where [256][0] has the code and [256][1] has the length in bits.I'm using the following code to generate that.

void encodeTree(node *root, int code,int length)
        {
                if(root->lptr)
                {
                        encodeTree(root->lptr,code<<1,length+1);
                }
                if(root->rptr)
                {
                        encodeTree(root->rptr,(code<<1)|1,length+1);
                }
                if(!root->lptr && !root->rptr)
                {
                        c->codes[(int)root->sym][0]=code;
                        c->codes[(int)root->sym][1]=length;
                }

        }

this procedure works fine.I'm then encoding the file with the following function

void encodeFile(const std::string f,const std::string op)
        {
                unsigned int cnum1,cnum2,temp1,temp2,temp3,temp,len;
                char c1,c2;
                int i,j,k;
                std::ifstream inFile( f.c_str(), ios::in | ios::binary );
                std::ofstream outFile( op.c_str(), ios::out);
                if(inFile.is_open())
                {
                        c1=inFile.get();
                        cnum1=(int)c1;
                        temp=c->codes[cnum1][0];
                        len=c->codes[cnum1][1];
                        i=32-len;
                        temp=temp<<i;
                        while(!inFile.eof())
                        {

                                c2=inFile.get();
                                cnum2=(int)c2;
                                if(len+c->codes[cnum2][1]<32)
                                {
                                        temp1=temp;
                                        temp2=c->codes[cnum2][0];j=32-(len+c->codes[cnum2][1]);
                                        temp2=temp2<<j;
                                        temp=temp1 | temp2;
                                        len=len+c->codes[cnum2][1];
                                        
                                }
                                else
                                {
                                        temp1=temp;
                                        temp2=c->codes[cnum2][0];
                                        j=64-(len+c->codes[cnum2][1]);
                                        k=(len+c->codes[cnum2][1])-32;
                                        temp=temp2;
                                        if(j!=32)
                                        temp=temp<<j;
                                        else
                                        temp=0;
                                        temp2=temp2>>k;
                                        temp2=temp1 | temp2;
                                        len=k;
                                        outFile<<temp2;
                                 }
                        }
                }
        outFile.close();
        }

That is, I'm reading character after character and packing until i reach 32 bits and outputting the integer.

You've got waaaay too many variables swimming around in there.

That said, you've managed to do all your bit-shifting without collisions.


Big output
The reason why your output file is so big is because you switched from binary to text. Go ahead and open your output in Notepad to see a giant list of digits. ("Digits", mind you, not numbers.)

Line 7: Open your output file with ios::binary.
Line 44: Don't use << --use OutFile.write() or OutFile.put() instead.


Some other notes
Line 16: Don't use while (!InFile.eof()) --it leaves you open to a possible infinite loop. Use while (InFile.good()) instead.

Packing: you have failed to heed my advice about bit-packing order. As a result your output stream is indecipherable, since the whole point of the Huffman code-tree is to produce codes with unique prefixes. If you read the codes backwards there is no guarantee against ambiguous matches --and by packing from MSB to LSB you give the decoder no way to determine where each code begins.

Missing bits: You also didn't write any leftover bits to file before closing it, as I warned you to not do. As a result the last few code(s) of your stream will be missing in the output file.


Helpful hints
Don't try to do too many things at once. Your code tries to do everything: read, pack, and write, all at once. Break it down into smaller chunks.

There is no appreciable need to handle output in integers larger than one byte. The code sizes can still be any number of bits you like, more or less than a byte, but by packing and writing only one byte at a time you avoid issues with endianness. A Huffman file is a stream of bits, not words, so it should always be in strict "little-endian" format...

...you also don't have to second-guess file buffering. The OS and/or fstreams will handle that gracefully for you.

If you offload packing bits to a separate function then the encodeFile() function stops having to care about all that bit-packing and when to actually write data etc. All it has to do is read a byte from the source file, look it up, and send the correct number of bits to the pack-and-write function. When the source file exhausts, then all you need to do is write the left-over (as yet unwritten) bits, then close your files.

Huffman files need extra data encoded with them: the size of the source file, and the Huffman tree itself. If you fail to save those in the file then the decoder will be unable to decipher the output file. (One of the reasons the L77-family arithmetic schemes are so much more popular is because they adaptivly encode their dictionary tree in the stream.)

Try to reduce the number of variables you are using. Don't create temporaries unless you absolutely must. For example, on lines 10 and 11 you use the c1 variable to do... nothing. Just write: cnum1 = inFile.get(); Also, try to use more descriptive names. For example, I would replace "cnum1" with something like "source_value" or "input_symbol" or something that makes its purpose obvious at a glance. Then, whenever you need to input a source symbol to be encoded, just use the "source_symbol" (or whatever you name it) variable, instead of introducing more and more new ones.

Whew. Hope this helps.

[edit] I hope I haven't made you feel stupid by being overly critical. You are doing fine, and it is clear that you are able to think logically about things (like properly packing bits into a word --something that befuddles a lot of people). Hence the detailed (and really long winded) responses... Otherwise I'd just click away. :)

Thanks a billion.Over critical is good.thanks for caring enough to write out such a huuuuge reply.Now I'm gonna sit and read it :)

I have one other doubt.When I'm doing inFile.get(), occasionally say like pdf's or some files some totally wierd character comes up and (int) becomes some whacky number like 48329262 when its supposed to be less than 256.What exactly is going on?

You caught me just before I left...

It looks like the default char type is signed. So:

unsigned char 0xBF (default windows codepage for '¿', but any binary data will do)
read as --> signed char -65
expanded to --> signed int -65 (0xFFFFFFBF)
converted to --> unsigned int 4294967231

Alas. Make sure to cast to unsigned before casting to int.

Hope this helps. (...and works)

The packing in reverse is not clear to me.Let me see if I got this right.When we are writing a bit stream, the bit endianness is little so we write from least significant bit to most significant bit? or are we talking about least and most significant byte?

If you read and write your files one byte at a time, endianness is not an issue.

However, bit order is always an issue. The first bit you write should be in bit position 0 of the first byte you write. The second bit goes in bit position 1, etc. The ninth bit goes in position 0 of the second byte. Etc.

I'll use that little huffman tree from the Wikipedia.

space a     e     f     h     i     m     n      
111   010   000   1101  1010  1000  0111  0010  
s     t     l     o     p     r     u     x      
1011  0110  11001 00110 10011 11000 00111 10010

Now the thing to remember here is that each code is a list of bits. Hence you have to read the above bitstreams from left-to-right, which is opposite from our usual way of looking at numbers.

For example, the 'h' is code '1101', which we can store as the number 11 --not 13. The way we are writing it here is with the lsb on the left and the msb on the right. Hence: 1 + 2 + 0 + 8 = 11 This becomes important when we start sticking more than one code next to each other. Since the order of the bits is important (not the numbers!), we must always read left-to-right, least-significant-bit to most-significant-bit.

An example: Bitsequence of a p p l e

The bitstream a..p....p....l....e.. 010100111001111001000 Separate it into bytes byte 0 . . byte 1 . . byte 2 01234567 . 01234567 . 01234567 bit positions 01010011 . 10011110 . 01000... bit values

Each byte can now be treated as an unsigned number, and written to file
(remember that we are thinking about our bits lsb..msb, but numbers msd..lsd) 202 121 2 (in decimal) CA 79 02 (or hexadecimal if you prefer it)

Up to now we have had the luxury of those friendly little red and green bb-tags to help us keep the codes separate.

But now we have to pretend we are the decoder and get our message back. For simplicity in my example here, I'll skip the part about getting the next byte when we run out of bits in the current byte. However, we will always read each byte's bits starting at lsb and ending at msb.

Our bitstream: 010100111001111001000000 Read bits only until we can exactly match a code in our Huffman tree. (Make sure to compare what I do to our list of codes above.) 010100111001111001000000 no match 010100111001111001000000 no match 010100111001111001000000 matches 'a'

Starting over ...100111001111001000000 no match ...100111001111001000000 no match ...100111001111001000000 no match ...100111001111001000000 no match ...100111001111001000000 matches 'p'

Starting over ........1001111001000000 no match ........1001111001000000 no match ........1001111001000000 no match ........1001111001000000 no match ........1001111001000000 matches 'p'

Starting over .............11001000000 no match .............11001000000 no match .............11001000000 no match .............11001000000 no match .............11001000000 matches 'l'

Starting over ..................000000 no match ..................000000 no match ..................000000 matches 'e'

At this point, we have recovered the same number of symbols as the original stream, so we are done.

Backwards Stuff
Argh... I'm tired. Just so long as you don't start matching bits from the wrong end. Is 010 'a'?
Or the back end of 'h', 'n', or 'x'?

Just so long as you don't mix them. When you mix the msb/ls-bits of the symbols, you fracture the bitstream. And your decoder has to know it. That's all...

/me goes away

Comments
Respect for the amount of time you put into this

Awesome! I've finished encoding.About the pseudo eof, it cant be 8-bit.So, I'm using the first 9 bit character which is 256.Do i need to assign an additional 8 bits(a byte) to each node just so that i can store my pseudo eof?

Also, When I'm writing the header(the huffman tree) I'm representing it as a 0 for an internal node and a 1 for a leaf.So,
0 0 1 10110110 1 11001011
This is a root followed by internal node followed by leaf and the ascii for the leaf. and so on.Should I use the pseudo eof to differentiate where the header has ended and where the data starts?

This article has been dead for over six months. Start a new discussion instead.