How to extract columns of coordinate values from text file (in java) ?
Hi there!
this an excerpt from an extracted pdb file my team has done in the format of a text file.
ATOM 1 N ASN A 2 18.668 27.299 52.379 1.00 41.19 N
ATOM 2 CA ASN A 2 19.400 26.674 53.492 1.00 40.18 C
ATOM 3 C ASN A 2 19.710 27.737 54.550 1.00 37.56 C
ATOM 4 O ASN A 2 19.123 27.737 55.640 1.00 38.90 O
ATOM 5 N LEU A 3 20.637 28.606 54.184 1.00 34.40 N
those in bold are the coordinates we need to extract and in the form of (x,y,z) down a long list and save as a new text file.
Would greatly appreciate your help.
I understand that i can't directly extract columns but i have to do a parsing and a split token. I can't use arrays though because there are thousands of these files to extract. Could someone teach me?
jemimaloh
Junior Poster in Training
50 posts since Jun 2010
Reputation Points: 10
Solved Threads: 0
i have an idea in mind and that is to read the line by line as strings.. and split tokens.. den
it most probably come out as something like:
1ATOM
21
3N
4ASN
5A
62
718.668
827.299
952.379
101.00
1141.19
12N
so one line will become 12 tokens.
right. from here.. I do a loop that only reads in only every 7th to 9th token into a new text file. Can this be done? Kindly advise me. Thanks!
jemimaloh
Junior Poster in Training
50 posts since Jun 2010
Reputation Points: 10
Solved Threads: 0
Use StringTokenizer and skip over the first n tokens to get to the ones you want.
NormR1
Posting Expert
6,677 posts since Jun 2010
Reputation Points: 1,138
Solved Threads: 656
alrighty. thanks for advice. studying the usage now.
jemimaloh
Junior Poster in Training
50 posts since Jun 2010
Reputation Points: 10
Solved Threads: 0
Hi, I came up with this but the methods and classes are really really out of the world the way i am getting it. Could anyone help me to straighten out my thoughts? I am so sorry to trouble you.
import java.util.*;
import java.io.*;
public class CoorToks {
public StringTokenizer(String token); //invalid method declaration
public static void main(String[] args) throws IOException {
BufferedReader inputStream = null; // scan input line by line
PrintWriter outputStream = null;// output aligned the same way
Pattern delim=Pattern.compile("/s");
String token;
StringTokenizer tokenizer = new StringTokenizer(token);
try {
inputStream =
new BufferedReader(new FileReader("1APB.pdb.txt"));
outputStream =
new PrintWriter(new FileWriter("characteroutput.txt"));
while(tokenizer.hasMoreTokens())
{
if (token.trim().startsWith("ATOM") && !token.trim().endsWith("H")) // I need to scan for the word "ATOM" before i start tokenizing. ends at H.
{
// and i only need the 7th to 9th tokens of each line.
// should i use a pattern delimiter instead?
String tokens[]=delim.split(token);
double x= Double.parseDouble(tokens[7]);
double y= Double.parseDouble(tokens[8]);
double z= Double.parseDouble(tokens[9]);
outputStream.println(token);
//the compiler says it can't find variable tokens. which means i have to do a declaration of variables?
// how do i do that when there are so many tokens coming from the text file.
}
}
}//end of try
finally {
while ((token = inputStream.readLine()) != null) {
outputStream.println(token);}
if (inputStream != null) {
inputStream.close();
}
if (outputStream != null) {
outputStream.close();
}
}
}
}
jemimaloh
Junior Poster in Training
50 posts since Jun 2010
Reputation Points: 10
Solved Threads: 0
Why are you mixing usage of StringTokenizer with the String.split() method?
Try debugging your program by printing out the values of the variables as they are read and parsed/tokenized. You need to see what each method and class is doing to your data.
To print an array use the Arrays.toString() method.
compiler says it can't find variable tokens.
Please copy and paste full text of any error messages.
NormR1
Posting Expert
6,677 posts since Jun 2010
Reputation Points: 1,138
Solved Threads: 656
ok. I'll work on it now. THanks!!!!!!!
jemimaloh
Junior Poster in Training
50 posts since Jun 2010
Reputation Points: 10
Solved Threads: 0
token=token.substring(17);
What is the 17 for? Is the data column aligned?
NormR1
Posting Expert
6,677 posts since Jun 2010
Reputation Points: 1,138
Solved Threads: 656
Hi, Agen8k, hey look I really apprecaite you and efforts. But I think, Like what NormR1 highlighted. I think you are only getting in or rather printing out the first 17 tokens? I understand the concept behind what you are doing but my text file has many many values that I would need to tokenize. And the file being copied over is exactly the same as what was originally in the text. which is also not what I am looking for. So yeah. well, I'm working on it! :) thanks man!
jemimaloh
Junior Poster in Training
50 posts since Jun 2010
Reputation Points: 10
Solved Threads: 0
NormR1, there has been a twist in what I need to extract. Now its not only the x and y coordinate values, but I need to scan out the Protein ID too.
For example, in this protein text file:
HEADER BINDING PROTEINS 15-NOV-91 1APB
TITLE A PRO TO GLY MUTATION IN THE HINGE OF THE ARABINOSE-BINDING
TITLE 2 PROTEIN ENHANCES BINDING AND ALTERS SPECIFICITY: SUGAR-
TITLE 3 BINDING AND CRYSTALLOGRAPHIC STUDIES
ATOM 1 N ASN A 2 18.668 27.299 52.379 1.00 41.19 N
ATOM 2 CA ASN A 2 19.400 26.674 53.492 1.00 40.18 C
ATOM 3 C ASN A 2 19.710 27.737 54.550 1.00 37.56 C
ATOM 4 O ASN A 2 19.123 27.737 55.640 1.00 38.90 O
ATOM 5 N LEU A 3 20.637 28.606 54.184 1.00 34.40 N
ATOM 6 CA LEU A 3 20.986 29.784 54.994 1.00 29.68 C
ATOM 7 C LEU A 3 19.777 30.725 55.041 1.00 26.30 C
ATOM 8 O LEU A 3 19.051 30.861 54.044 1.00 26.16 O
ATOM 9 CB LEU A 3 22.239 30.409 54.379 1.00 30.97 C
ATOM 10 CG LEU A 3 22.816 31.630 55.080 1.00 33.27 C
the list goes on...
But as you can see..
1)I only have the "ATOM" to serve as an indicator for a new string.
2)The Protein code i.e. "1APB" is also what i need to extract.
3)The ones in bold are the values in need to extract.
Do you have any further advice that I should consider?
Cheers.
jemimaloh
Junior Poster in Training
50 posts since Jun 2010
Reputation Points: 10
Solved Threads: 0
i thought they r aligned. but they r not . we need a little change.
well actually.. in the saved output file.. It is aligned because of the
BufferedReader inputStream = null; // scan input line by line
PrintWriter outputStream = null;// output aligned the same way
but for the information of anyone wondering, we're looking at whitespaces.. "/s" :) cheers. THanks a lot!
jemimaloh
Junior Poster in Training
50 posts since Jun 2010
Reputation Points: 10
Solved Threads: 0
I am sorry.. which "CA" are you looking at??
jemimaloh
Junior Poster in Training
50 posts since Jun 2010
Reputation Points: 10
Solved Threads: 0
I can't possibly just tokenize the lines this way.. I have SOOO many lines.. no way this will work.. It will take forever... I have a million text files in my database that i need to run this programme on. haha
jemimaloh
Junior Poster in Training
50 posts since Jun 2010
Reputation Points: 10
Solved Threads: 0
I can't possibly just tokenize the lines this way
What way is "this way??
If you have a text file with multiple lines in the format you've shown and you want to get data from those files you'll need to read the lines one by one from the files.
Then test a line to see if it has the data you want.
If so, tokenize the line, skip the first n tokens to get to the tokens/data you want, save those tokens and then go back and read the next line.
NormR1
Posting Expert
6,677 posts since Jun 2010
Reputation Points: 1,138
Solved Threads: 656