Hi there!

this an excerpt from an extracted pdb file my team has done in the format of a text file.

ATOM 1 N ASN A 2 18.668 27.299 52.379 1.00 41.19 N
ATOM 2 CA ASN A 2 19.400 26.674 53.492 1.00 40.18 C
ATOM 3 C ASN A 2 19.710 27.737 54.550 1.00 37.56 C
ATOM 4 O ASN A 2 19.123 27.737 55.640 1.00 38.90 O
ATOM 5 N LEU A 3 20.637 28.606 54.184 1.00 34.40 N

those in bold are the coordinates we need to extract and in the form of (x,y,z) down a long list and save as a new text file.

Would greatly appreciate your help.

I understand that i can't directly extract columns but i have to do a parsing and a split token. I can't use arrays though because there are thousands of these files to extract. Could someone teach me?

Recommended Answers

All 27 Replies

i have an idea in mind and that is to read the line by line as strings.. and split tokens.. den
it most probably come out as something like:

1ATOM
21
3N
4ASN
5A
62
718.668
827.299
952.379

101.00
1141.19
12N

so one line will become 12 tokens.

right. from here.. I do a loop that only reads in only every 7th to 9th token into a new text file. Can this be done? Kindly advise me. Thanks!

other option u can use EXCELL files...

if txt file is mandatory then split or stringtokenizer option may fulfil your requirements.

Use StringTokenizer and skip over the first n tokens to get to the ones you want.

alrighty. thanks for advice. studying the usage now.

Hi, I came up with this but the methods and classes are really really out of the world the way i am getting it. Could anyone help me to straighten out my thoughts? I am so sorry to trouble you.

import java.util.*;
import java.io.*;

public class CoorToks {

    public StringTokenizer(String token); //invalid method declaration
    public static void main(String[] args) throws IOException {
        BufferedReader inputStream = null; // scan input line by line
        PrintWriter outputStream = null;// output aligned the same way
        Pattern delim=Pattern.compile("/s");

       String token;
	    StringTokenizer tokenizer = new StringTokenizer(token);


        try {
            inputStream =
                new BufferedReader(new FileReader("1APB.pdb.txt"));
            outputStream =
                new PrintWriter(new FileWriter("characteroutput.txt"));
            	while(tokenizer.hasMoreTokens())
            	{
                 if (token.trim().startsWith("ATOM") && !token.trim().endsWith("H")) // I need to scan for the word "ATOM" before i start tokenizing. ends at H.
                 {
                  // and i only need the 7th to 9th tokens of each line.
                  // should i use a pattern delimiter instead?
                  String tokens[]=delim.split(token);
                  double x= Double.parseDouble(tokens[7]);
                  double y= Double.parseDouble(tokens[8]);
                  double z= Double.parseDouble(tokens[9]);
                  outputStream.println(token);

                  //the compiler says it can't find variable tokens. which means i have to do a declaration of variables?
                  // how do i do that when there are so many tokens coming from the text file.

                 }
            	}


        }//end of try

         finally {
         	while ((token = inputStream.readLine()) != null) {
                outputStream.println(token);}
            if (inputStream != null) {
                inputStream.close();
            }
            if (outputStream != null) {
                outputStream.close();
            }
        }
    }
}

Why are you mixing usage of StringTokenizer with the String.split() method?

Try debugging your program by printing out the values of the variables as they are read and parsed/tokenized. You need to see what each method and class is doing to your data.
To print an array use the Arrays.toString() method.

compiler says it can't find variable tokens.

Please copy and paste full text of any error messages.

ok. I'll work on it now. THanks!!!!!!!

here is the code for u:

import java.util.*;

import java.io.*;
import java.util.regex.Pattern;



public class Main {



 //invalid method declaration

public static void main(String[] args) throws IOException {

BufferedReader inputStream = null; // scan input line by line

PrintWriter outputStream = null;// output aligned the same way

Pattern delim=Pattern.compile("/s");



String token=null;







try {
inputStream =

new BufferedReader(new FileReader("c:/Users/ashish/documents/test.txt"));
System.out.println("hkjdshjk");
outputStream =

new PrintWriter(new FileWriter("c:/Users/ashish/documents/wtest.txt"));


while((token=inputStream.readLine())!=null)
{


int i=0;
if (token.trim().startsWith("ATOM") && !token.trim().endsWith("H")) // I need to scan for the word "ATOM" before i start tokenizing. ends at H.

{
               token=token.substring(17);
                 StringTokenizer tokenizer = new StringTokenizer(token);
                  System.out.println(token);
               while(tokenizer.hasMoreTokens() && i<3 )

               {
                  String st= tokenizer.nextToken();
                double x=Double.parseDouble(st);
                outputStream.print(x);
                outputStream.print("\t");
                i++;
              }
               // and i only need the 7th to 9th tokens of each line.
         outputStream.println("");
// should i use a pattern delimiter instead?

/*String tokens[]=delim.split(token);

double x= Double.parseDouble(tokens[7]);

double y= Double.parseDouble(tokens[8]);

double z= Double.parseDouble(tokens[9]);

outputStream.println(token);


*/
//the compiler says it can't find variable tokens. which means i have to do a declaration of variables?



    }

    }


}//end of try



finally {

while ((token = inputStream.readLine()) != null) {

outputStream.println(token);}

if (inputStream != null) {

inputStream.close();

}

if (outputStream != null) {

outputStream.close();

}

}

}

}

token=token.substring(17);

What is the 17 for? Is the data column aligned?

Hi, Agen8k, hey look I really apprecaite you and efforts. But I think, Like what NormR1 highlighted. I think you are only getting in or rather printing out the first 17 tokens? I understand the concept behind what you are doing but my text file has many many values that I would need to tokenize. And the file being copied over is exactly the same as what was originally in the text. which is also not what I am looking for. So yeah. well, I'm working on it! :) thanks man!

NormR1, there has been a twist in what I need to extract. Now its not only the x and y coordinate values, but I need to scan out the Protein ID too.

For example, in this protein text file:

HEADER BINDING PROTEINS 15-NOV-91 1APB
TITLE A PRO TO GLY MUTATION IN THE HINGE OF THE ARABINOSE-BINDING
TITLE 2 PROTEIN ENHANCES BINDING AND ALTERS SPECIFICITY: SUGAR-
TITLE 3 BINDING AND CRYSTALLOGRAPHIC STUDIES
ATOM 1 N ASN A 2 18.668 27.299 52.379 1.00 41.19 N
ATOM 2 CA ASN A 2 19.400 26.674 53.492 1.00 40.18 C
ATOM 3 C ASN A 2 19.710 27.737 54.550 1.00 37.56 C
ATOM 4 O ASN A 2 19.123 27.737 55.640 1.00 38.90 O
ATOM 5 N LEU A 3 20.637 28.606 54.184 1.00 34.40 N
ATOM 6 CA LEU A 3 20.986 29.784 54.994 1.00 29.68 C
ATOM 7 C LEU A 3 19.777 30.725 55.041 1.00 26.30 C
ATOM 8 O LEU A 3 19.051 30.861 54.044 1.00 26.16 O
ATOM 9 CB LEU A 3 22.239 30.409 54.379 1.00 30.97 C
ATOM 10 CG LEU A 3 22.816 31.630 55.080 1.00 33.27 C
the list goes on...

But as you can see..
1)I only have the "ATOM" to serve as an indicator for a new string.
2)The Protein code i.e. "1APB" is also what i need to extract.
3)The ones in bold are the values in need to extract.

Do you have any further advice that I should consider?

Cheers.

i thought they r aligned. but they r not . we need a little change.

i thought they r aligned. but they r not . we need a little change.

well actually.. in the saved output file.. It is aligned because of the

BufferedReader inputStream = null; // scan input line by line
      PrintWriter outputStream = null;// output aligned the same way

but for the information of anyone wondering, we're looking at whitespaces.. "/s" :) cheers. THanks a lot!

use an if condition to check when in a line CA appears then start your substring from 18 th char

else from 17th.

i hope that works.......

?

I am sorry.. which "CA" are you looking at??

ATOM 2 "CA" ASN A 2 19.400 26.674 53.492 1.00 40.18 C


only these lines hav cordinates at 18 th char. i suppose

I can't possibly just tokenize the lines this way.. I have SOOO many lines.. no way this will work.. It will take forever... I have a million text files in my database that i need to run this programme on. haha

I can't possibly just tokenize the lines this way

What way is "this way??
If you have a text file with multiple lines in the format you've shown and you want to get data from those files you'll need to read the lines one by one from the files.
Then test a line to see if it has the data you want.
If so, tokenize the line, skip the first n tokens to get to the tokens/data you want, save those tokens and then go back and read the next line.

i agree with normR1

The following code may
(1) extract the coordinate values in the 3 float columns in the file "data.txt" under the same folder, and
(2) print the extracted float data only on DOS window.

import java.io.*;
import java.util.*;


public class CoorToks {

   public static StringBuffer buffer;
   public static BufferedReader input; 
   
   public static void main(String args[]) {
  		try{ 	
            input = new BufferedReader(
                new FileReader( new File("data.txt") ) );                     
         String text;
        while ( ( text = input.readLine() ) != null ) {
	    StringTokenizer s = new StringTokenizer(text," ");
		int counter=0;
                  while(s.hasMoreTokens()) {
			String ss = s.nextToken();
			counter++;
			if (counter >6 && counter <10)
			System.out.print(ss + " ");
			}
			System.out.println();  
	  	}
	  }catch( IOException ioException ) {} 
	} 
}

Taking the advantage of the powerful function in DOS (or unix/lynix), the following code in Java could be useful for your data extracting. The original text data file is data.txt while the extracted data are stored in the text file:OutputData.dat

import java.io.*;
import java.util.*;
public class CoorToks1 {
   private static BufferedReader input;
   public static void main(String args[]) {
  		try{ 	
            input = new BufferedReader(
                new FileReader( new File(args[0]) ) );                     
         String text;
         System.out.println("x\ty\tz");
        while ( ( text = input.readLine() ) != null ) {
	    StringTokenizer s = new StringTokenizer(text," ");
		int counter=0;
                  while(s.hasMoreTokens()) {
			String ss = s.nextToken();
			counter++;
			if (counter >6 && counter <10)
			System.out.print(ss + "\t");
			}
			System.out.println();  
	  	}
	  }catch( IOException ioException ) {} 
	} 
}

The command on DOS is shown as follows.
java CoorToks1 data.txt >OutputData.dat

Hey. Thank you very much for your help on both occassions. I'll get back soon.! Cheers.

I have made the following program which may read the text file "data.txt" (in the format structure you have defined at your first poster) and extract the columns of coordinate values x,y,z into a new file "myfile.txt". Both files are under the same folder as the code file. I hope this program may help you in your work.

/* The class FileUtil is originally defined by 
 * Java source code example
 * http://www.javadb.com/write-lines-of-text-to-file-using-a-printwriter
 * The method writeLinesToFile(...) has been modified.
 */

import java.io.*;
import java.util.*;

class FileUtil {

  public void writeLinesToFile(String filename,
              String[] linesToWrite,int length,
              boolean appendToFile) {

    PrintWriter pw = null;

    try {

      if (appendToFile) {

        //If the file already exists, start writing at the end of it.
        pw = new PrintWriter(new FileWriter(filename, true));

      }
      else {
        pw = new PrintWriter(new FileWriter(filename));
        //this is equal to:
        //pw = new PrintWriter(new FileWriter(filename, false));

      }

      for (int i = 0; i < length; i++) {

        pw.println(linesToWrite[i]);

      }
      pw.flush();

    }
    catch (IOException e) {
      e.printStackTrace();
    }
    finally {     
      //Close the PrintWriter
      if (pw != null)
        pw.close();
      
    	}

  	}
  }

public class CoorToks0 {

   private static StringBuffer buffer;
   private static BufferedReader input=null; 
   private static String st[]= new String[1000]; 
   
   public static void main(String args[]) {
   int count = 0;
   	st[count++] = new String("x\ty\tz\0");
  		try{ 	
       	input = new BufferedReader(
       	new FileReader( new File("data.txt") ) ); 
       	             
        String text;
        while ( ( text = input.readLine() ) != null ) {
	    StringTokenizer s = new StringTokenizer(text," ");
		int counter=0;
		String line="";
        while(s.hasMoreTokens()) {
			String ss = s.nextToken();
			counter++;
			if (counter >6 && counter <10) {
			line += ss;
			line +="\t";
			}
			}
			line +="\0";
			st[count++]=line;
			}
			}catch( IOException ioException ) {}  
 	FileUtil util = new FileUtil();
    util.writeLinesToFile("myfile.txt", st, count, false);
		
	}
}

Thank you so much, I am studying the codes you gave.

I think the confusion is in the suggestion that you find the substrings by position in the string - the data is not fixed-width, and if it were, there's no reason to count on it remaining so, since data formats change. "Why can't we use a two-character value in column 3? Can't you just change the code to deal with it?" Unless fixed-width is part of your data definition, don't rely on it.

Norm is right - splitting the file into tokens is the way to go. You can either read line by line, or read the whole file into one long string and split that on your line separator ("ATOM") - a more perl-like solution, less economical but possibly easier to code.

Thanks JON! you're right on. Currently still packaging it with other class files and methods. I'll get back to you guys when I' done!

Yeah, i was thinking over using "CA" as the "trigger" to count to the 18th character as Agen8k proprosed and thought it couldnt work because not every line of the text file has "CA", which now I see your point that he was on the topic of substrings.

Thanks EVERYBODY!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.