Hi everyone,

First post here so feel free to school me if my etiquette needs it.

I'm reading a very long string of data (genome stuff AT TT GT AA AG AA ... etc) and comparing it using hamming distance with a similar string. (Naively checking for similarity entry by entry).

I have quite a few questions to put to you but lets begin with one that will probably demonstrate my level of C++.

``````int main () {
ifstream indata;
string str;
int j,count;
indata.open("file");

if(!indata) {
cerr << "Error: file could not be opened" << endl;
exit(1);
}

for (j=0; j<899701; j++){
indata >> str[j];        //read in base line
}
while(indata){
count=0;
for (j=0; j<899701; j++){
indata >> str[j];       //read in line to be compared
count=count+(str[j]==str[j]);   //hamming distance
}
}

indata.close();
return 0;
}
``````

My first question: Can I replace

``````    indata >> str[j];       //read in line to be compared
count=count+(str[j]==str[j]);   //hamming distance
``````

with something like

``````    count=count+(str[j]==indata.get);
``````

which would combine the two steps. Also, if this is possible would it significantly speed things up. I have >5000 strings to compare and each has ~800000 pairs of letters.

I am also open to any suggestions about different approaches to take here. Is C++ a good choice for this kind of problem?

Any help/suggestions/scathing criticism is much appreciated.
Peace.

A one-off calculation of the hamming distance between two strings would take O(N) time. This code would not be noticeably faster - but it is perhaps cleaner.

``````#include <string>

std::size_t hamming_distance( const std::string& a, const std::string& b )
{
if( a.size() != b.size() ) return -1 ;
std::size_t hd = 0 ;
for( std::size_t i = 0 ; i < a.size() ; ++i ) hd += ( a[i] == b[i] ) ;
return hd ;
}

#include <fstream>

int main()
{
std::ifstream file( "file" ) ;
std::string a, b ;
std::getline( file, a ) ;
while( std::getline( file, b ) )
{
std::size_t distance = hamming_distance( a, b ) ;
// use distance
}
}
``````

If you are repeatedly calculating the hamming distance with a constant set of strings in the file, optimizations may be possible. For instance, encode the genome strings into a bit sequence and then calculate using bit-wise operations (with a `std::bitset<>`)

``````#include <unordered_map>

std::string create_bit_sequence( const std::string& str )
{
static std::unordered_map<char,std::string> m =
{ { 'A', "00" }, { 'C', "01" }, { 'T', "10" }, { 'G', "11" } } ;
std::string result ;
for( char c : str ) result += m[c] ;
return result ;
}
``````

Many thanks. You've given me a lot to think about. i will certainly try using bit-wise operations.

I'm having a number of problems implementing the bit sequence function. For instance

`````` for( char c : str ) result += m[c] ;
``````

gives the error: "a function definition is not allowed here before the ':' token". I should mention I'm using XCode on Mac OSX so I've had to make a couple of changes.

``````#include <tr1/unordered_map>

std::string create_bit_sequence( const std::string& str )
{
static std::tr1::unordered_map<char,std::string> m;
m['A']="00";
m['C']="01";
m['G']="10";
m['T']="11";
std::string result ;
for( char c : str ) result += m[c] ;
return result ;
}
``````

I had to change the declaration of `m` because I was getting the error:'m' must be initialised by the constructor, not by '{...}'.

Thanks.

I'm using XCode on Mac OSX

If it's a recent version of GCC or clang, you should be able to compile the code with
-std=c++11 compiler option.

Otherwise:

``````std::string create_bit_sequence( const std::string& str )
{
std::string result ;

for( std::string::size_type i = 0 ; i < str.size() ; ++i  )
{
switch( str[i] )
{
case 'A' : result += "00" ; break ;
case 'C' : result += "01" ; break ;
case 'G' : result += "10" ; break ;
case 'T' : result += "10" ; break ;
default : { /* unexpected */ }
}
}

return result ;
}
``````
`````` for( char c : str ) result += m[c] ;
``````

It's a Java-like implementation of itration over a string, when char c will be at each iteration a character from the string.
Here's what my compiler says:

range-based-for loops are not allowed in C++98 mode

Also, perhaps this answer exaplins better:

This file requires compiler and library support for the upcoming ISO C++ standard, C++0x. This support is currently experimental, and must be enabled with the -std=c++0x or -std=gnu++0x compiler options.

That is, if you have an "older" compiler, such as me. :)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of 1.18 million developers, IT pros, digital marketers, and technology enthusiasts learning and sharing knowledge.