Hi guys, I've to develop a software to compress and extract files using the RLE algorithm.

And I've decided to implement it so that if I've a set of chars like
"ABRTTTTOPSSSSSNU" it will be converted as "ABR4*TP5*SNU".

Now, the problem is that for example the char '*' could be present also in the file, making a false-positive.
Does exist any "special" char which could I use to be sure that it's not present in any file? Or do you know any other "smart" implementation to do something like this?

Thanks

Because of how RLE works, a run of two will not naturally be in the encoded string. You can use a run of two of any character as a special sequence that signifies the start of an encoded run for that character. In your example, "ABRTTTTOPSSSSSNU" becomes "ABRTT4OPSS5NU" with the following logic.

; encode RLE
while not eof(istr)
    current = get_next_char(istr)
    length = 1
    while peek() = current
        get_next_char(istr) ; eat the run
        length = length + 1
    if length > 1
        write_n(ostr, current, 2)
        write(ostr, length)
    else
        write(ostr, current)
; decode RLE
while not eof(istr)
    current = get_next_char(istr)
    if peek() = current
        get_next_char(istr) ; eat the duplicate
        length = get_next_int(istr)
        write_n(ostr, current, length)
    else
        write(ostr, current)

This way there is no worry about special characters being in the unencoded data because there are no special characters, just a special sequence of any character that would otherwise not exist.

Thanks for your reply, and for the good idea.
However what happens if in the original file I've a run long 2?
For example "NTBBOP" should become "NTBB2OP", increasing the result of 1 char

Thanks for your reply, and for the good idea.
However what happens if in the original file I've a run long 2?
For example "NTBBOP" should become "NTBB2OP", increasing the result of 1 char

This would be a problem with the special character approach also, "NTBBOP" becomes "NTB*2OP". RLE is very sensitive to the input. One degenerate case for RLE is many runs where the encoding is longer than the run. If such degenerate cases are expected, RLE is not a suitable compression method.

Edited 6 Years Ago by Luther von Wulf: n/a

Well..it's not completely true...'cause with a "special" char I could consider only the runs longer than 3.
So that "NTBBOP" remains as it is, without losing anything.

Edited 6 Years Ago by cesvokamra: n/a

Well..it's not completely true...'cause with a "special" char I could consider only the runs longer than 3.
So that "NTBBOP" remains as it is, without losing anything.

Then you need some kind of escape for the "special" character unless it is so special that it will never be in the unencoded string. The fact still remains that RLE is not suitable for input that has few long runs. If the input is degenerate enough that the encoding of short runs is not alleviated by the encoding of long runs, RLE is a poor choice of compression.

Yes, you're true..but unfortunately the RLE algorithm is required for this software, and it's not choosed by me :)

This article has been dead for over six months. Start a new discussion instead.