Hi
I have found several articles on the web about data compression and encoding. My interest in learning this is to decompress, compressed data in files such as PDF etc which are encoded in one way or another. I'm probably asking the wrong question here.

I'm wondering how someone goes about learning this kind of thing. I've come across terms such as Flat Decode and LZW Decode but have not idea how I would use or implement this. Is there a good book on this topic or some other resource.

Thanks

Recommended Answers

All 9 Replies

Start from the simplest and work your way up: RLE, sliding window, LZW, and Huffman are the simpler algorithms. They're relatively easy to implement and reason about.

Production level algorithms tend to be either proprietary or variations/combinations of the simpler ones. Also keep in mind that different types of data compress differently and different algorithms will be optimal. For example, RLE variants are simple, but a solid winner in compression quality when you have text or binary that contains long spans of repeated values.

I'd probably browse Amazon for algorithm books and books specializing on compression algorithms and search Google for specific algorithms as a starting point. Wikipedia will give you a list of algorithms to do further reserach on.

Thanks.. For now I would like to learn how to go about learning how to identify FlateDecode or LZW so that I can apply the appropriate filter when necessary. Would anyone know how I would go about learning this?

What do you mean by how to identify it? When looking at an encoding algorithm, or the resulting encoded bytes?

The resulting encoding bytes. Basically, I'm trying to understand how, if I want to parse a stream which is encoded, which filter will I need to apply to it. So I would assume that by looking at the stream data (in its encoded form) there may or may not be a way of telling what kind of filter to apply to it in order to decode it.

I know the question is quiet abstract. I'm probably not understanding some rather important points about encoding here though I'm attempting to teach myself a little more about the topic.

Thanks

So I would assume that by looking at the stream data (in its encoded form) there may or may not be a way of telling what kind of filter to apply to it in order to decode it.

I don't doubt it, but that may be excessively difficult. You're in luck though, as many file formats will have an identifying prefix in the byte stream that you can use. For example:

Zip:     50 4B
Rar:     52 61 72 21
Tar-LZW: 1F 9D
GZip:    1F 8B
DEFLATE: 78 9C

Keep in mind that there is difference between the general algorithms, and the archive file formats which are their concrete realization. For any given algorithm, there are more than one possible representation of the compressed data. It is only when you have a specific format that you can meaningfully talk about detecting the format from the file metadata. For example, two different archive formats may both use LZW compression for part of their formats, yet the actual formats could be quite different.

EDIT: I was wrong about Zip and GZip. Sorry if I misled anyone.

Thanks. I'm starting to catch on. Knowing the below will set me on a new path for sure.

*Zip:     50 4B
Rar:     52 61 72 21
Tar-LZW: 1F 9D
GZip:    1F 8B
DEFLATE: 78 9C*

I'm marking this as solved because I think the above covers it. I'll also add that Metadata also provides valuable information for decoding as well.

nice question and answers this is useful for me thnx ^^

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.