Ok so this is part of a project I'm working on. I am dealing with DNA strings that only consist of the letters ACGT. I have to parse these strings into a file and be able to read them back out of the file. However when putting them in the file we have to convert it to bytes and compress it (to save file space) some how using this format.

A: 00
C: 01
G: 10
T: 11

It was explained to me that you can do this by byte shifting along with some AND's and OR's however after hours of work I'm not getting it.

So how I understand it, each character is it's own byte Ex:

A: 00000000
C: 00000001
G: 00000010
T: 00000011

And these strings can be a few dozen characters long to thousands of characters long. However I have to compress 4 letters at a time.

Example:

If I read in the above as a string "ACGT"

it should be compressed and look like this as a byte.

00011011

So this is read from left to right.

I don't know how to do this and here is my compression attempt and it's not working.

String string = "ACGT";
               byte[] byt = new byte[string.length()/4];
               byte temp = 0;
               for(int i = 0; i < string.length(); i += 4) {
                       for(int j = 0; j < 4; j++) {
                               if(string.charAt(j+i) == 'A')
                                       temp = (byte) (temp << 2 | 00);
                               if(string.charAt(j+i) == 'C')
                                       temp = (byte) (temp << 2 | 01);
                               if(string.charAt(j+i) == 'G')
                                       temp = (byte) (temp << 2 | 10);
                               if(string.charAt(j+i) == 'T')
                                       temp = (byte) (temp << 2 | 11);
                       }
                       byt[i] = temp;
               }
               for(int k = 0; k < byt.length; k++)
                      
               System.out.println("byt[k] " + byt[k]);
               System.out.println("byt: " + byt);

The output is:
byt[k] 59
byt: [B@3e25a5


I believe it's suppose to be
byt[k] 27 (since 00011011 is 27 in binary right?)

and byt: won't be readable at all. So I think since it's showing 59, it's incorrect.

My attempt at decompressing it which should read out the string entered: "ACGT"

String str = "";


               for(int i = 0; i < byt.length; i++) {
                       for(int j = 0; j < 4; j++) {
                               if((byte) (byt[i]%4) == 00)
                                       str = str + "A";
                               if((byte) (byt[i]%4) == 01)
                                       str = str + "C";
                               if((byte) (byt[i]%4) == 10)
                                       str = str + "G";
                               if((byte) (byt[i]%4) == 11)
                                       str = str + "T";
                               byt[i] = (byte) (byt[i] >> 2);
                       }
               }
              
               System.out.println("str: " + str);

The output is:

str: A

So for some reason it's not reading the rest of the characters.


Could someone help me out with this? I've spent way too much time on this and I'm just not getting it to work. I also need to be able to decompress it after I can compress it.

Edited 5 Years Ago by charchar88: n/a

You need to use defined bytes '00000000', '00000001', '00000010', and '00000011' i.e.

byte oo = 0x00;
byte oi = 0x01;
byte io = 0x02;
byte ii = 0x03;

as what you are using now is actually zero, one, ten, and eleven. Not quite what you want.

or use the 0x form directly.

(P.S. you can also use 0, 1, 2, and 3 directly, of course.)

Edited 5 Years Ago by masijade: n/a

Ok, I tried what you suggested with my original code below and the outputs were the same. I will keep looking at it but I believe my compression method works fine, it's just my decompression that sometimes works and sometimes doesn't.

I'm almost there but it's not quite doing it for any string sequence of ACGT See below:

String Str = "ACGTCGAT";
           byte[] compressed = compress(Str);
           System.out.println("String compressed: " + compressed);
           String Str2 = decompress(compressed);
           System.out.println("String decompressed: " + Str2);

Will output:
String compressed: [B@69b332
String decompressed: ACGTCGAT

This is correct and what it should output since that was the String I compressed.

String Str3 = "GCATTACTGA";
           byte[] compressed2 = compress(Str3);
           System.out.println("2nd String compressed: " + compressed2);
           String Str4 = decompress(compressed2);
           System.out.println("2nd String decompressed: " + Str4);

Will output:
2nd String compressed: [B@173a10f
2nd String decompressed: ACAGCACG

This is incorrect since the String I started with was GCATTACTGA.

It seems it prints it correctly with some letters and not with others.

I'm not sure why.


Here are my two methods:

public static byte[] compress(String sequence)
       {
               int length = sequence.length()-(sequence.length()%4);
               byte[] byt = new byte[length/4];
           byte temp = 0;
           for(int i = 0; i < length; i += 4) {
                   for(int j = 0; j < 4; j++) {
                   if(sequence.charAt(j+i) == 'A')
                       temp = (byte) (temp << 2 | 0);
                   if(sequence.charAt(j+i) == 'C')
                       temp = (byte) (temp << 2 | 1);
                   if(sequence.charAt(j+i) == 'G')
                       temp = (byte) (temp << 2 | 2);
                   if(sequence.charAt(j+i) == 'T')
                       temp = (byte) (temp << 2 | 3);
                   }
                   byt[i/4] = temp;
           }
           return byt;
       }
   
   
   public static String decompress(byte[] sequence)
   {
           String str = "";
           String temp = "";
       for(int i = 0; i < sequence.length; i++) {
               System.out.println((byte)(sequence[i]%4));
               if(sequence[i] < 0)
                       sequence[i] = (byte) (sequence[i] + 127);
               for(int j = 0; j < 4; j++) {
                   if((byte) (sequence[i]%4) == 0)
                   temp = "A" + temp;
               if((byte) (sequence[i]%4) == 1)
                       temp = "C" + temp;
               if((byte) (sequence[i]%4) == 2)
                       temp = "G" + temp;
               if((byte) (sequence[i]%4) == 3)
                   temp = "T" + temp;
               sequence[i] = (byte) (sequence[i] >> 2);
           }
           str = str + temp;
           temp = "";
       }
       return str;
   }

Edited 5 Years Ago by charchar88: n/a

In your decomnpress I would stay with boolean operations as in encode,and forget the arithmetic. ie
Delete lines 29/30 - this is definitely going to mess up any values starting with a 1 bit (ie sequences beginning with G or T - eg second test case)
replace sequence%4 with sequence&3 to mask out all but the last 2 bits.
Most, if not all of the (byte) casts are redundant or undesirable (forcing int back to byte when the next operation will convert back to int anyway).

Of these comments, the first I think is your problem, the remainder are just tidying up.

Wow, that simple fix did it! Thanks!

Yeah all I did was delete line 29/30 and change sequence%4 with sequence&3

Thanks!

This article has been dead for over six months. Start a new discussion instead.