I'm writing a program where I take in a file and need to grab strings of n length (n = 3, 4, 5, 6 ,7 or 8).

The issue I'm having is I need to grab these strings in a specific window. So if n = 3 I need to grab the first three chars, save it, then grab characters 2, 3, 4 then save it, then grab characters 4, 5, 6 and save it etc... I'm using get to grab strings of n length, I just don't know how to offset the window to grab the strings after 1.

So if the file contained GATCGAT I would need to get and save the strings GAT, ATC, TCG, CGA, and GAT.

Edited 4 Years Ago by Dewey1040: n/a

I see two ways of doing this. Either read the entire contents of the file into a string and then break the string apart, or use file seeking functionality. Here is an example of the file seeking functionality for n=3 strings:

FILE *file=fopen(filename,"r");
if (!file)return;
char buffer[4];//this will store our strings
while (!feof(file))
{
    fgets(buffer,3,file);
    fseek(file,-1,SEEK_CUR);
    //add buffer to the list of return values
}

I'd suggest you read the contents into a std::string and just use the substr method to do the work for you. For example:

int main () {
    const std::string input = "GATCGAT";
    std::size_t offset = 0, len = input.size (), window_step = 1;
    std::cout << "INPUT: " << input << std::endl;
    for (int window = 3; window < 9; ++window) {
        std::cout << "Window: " << window << std::endl;
        while ((len - offset) >= window) {
            std::cout << " " << input.substr (offset, window) << std::endl;
            offset += window_step;
        }
        offset = 0;
        std::cout << std::endl;
    }

    return 0;
}

The output from that is:

INPUT: GATCGAT
Window: 3
 GAT
 ATC
 TCG
 CGA
 GAT

Window: 4
 GATC
 ATCG
 TCGA
 CGAT

Window: 5
 GATCG
 ATCGA
 TCGAT

Window: 6
 GATCGA
 ATCGAT

Window: 7
 GATCGAT

Window: 8

Thanks for the response, this is the code I'm attempting to use when n = 3. The code is hitting an endless loop somehow. Before trying the fseek method I tried copying the file into a string and I was hitting bumps when trying to count the different trimers(sets of 3).

while( !feof( inputFile )){
				
			fgets(chromosome, 3, inputFile);
			fseek(inputFile, -1, SEEK_CUR);

			for( int i = 0; i < 3; i++ ){
				if( chromosome[i] == 'A' ){
					if( i == 1 )
						a = 0;
					if( i == 2 )
						b = 0;
					if( i == 3 )
						c = 0;
				}
				if( chromosome[i] == 'T' ){
					if( i == 1 )
						a = 1;
					if( i == 2 )
						b = 1;
					if( i == 3 )
						c = 1;
				}
				if( chromosome[i] == 'C' ){
					if( i == 1 )
						a = 2;
					if( i == 2 )
						b = 2;
					if( i == 3 )
						c = 2;
				}
				if( chromosome[i] == 'G' ){
					if( i == 1 )
						a = 3;
					if( i == 2 )
						b = 3;
					if( i == 3 )
						c = 3;
				}
				trimer[a][b][c]++;  // count current trimer
			}
		}

EDIT: ignore the syntax error of checking i to equal 1, 2 or 3, i changed it to equal 0, 1 or 2

Edited 4 Years Ago by Dewey1040: n/a

A simple change would allow you to count the sets of matching windows. Consider the following update to my original code

void show_counts (std::map< std::string, unsigned int >& trios) {
    std::map< std::string, unsigned int >::iterator it = trios.begin(),
        last = trios.end ();
    for (; it != last; ++it) {
        std::cout << "[" << it->first << "] : " << it->second << std::endl;
    }
    std::cout << std::endl;
}

int main () {
    std::map< std::string, unsigned int > group_counts;
    const std::string input = "GATCGAT";
    std::size_t offset = 0, len = input.size (), window_step = 1;
    for (int i = 3; i < 9; ++i) {
        while ((len - offset) >= i) {
            std::string trio = input.substr (offset, i);
            group_counts[trio]++;
            offset += window_step;
        }
        offset = 0;
        show_counts (group_counts);
        group_counts.clear ();
    }

    return 0;
}

That output looks like:

[ATC] : 1
[CGA] : 1
[GAT] : 2
[TCG] : 1

[ATCG] : 1
[CGAT] : 1
[GATC] : 1
[TCGA] : 1

[ATCGA] : 1
[GATCG] : 1
[TCGAT] : 1

[ATCGAT] : 1
[GATCGA] : 1

[GATCGAT] : 1

Which is what I think you are after.

This article has been dead for over six months. Start a new discussion instead.