Comparing 2 Arrays and Counting differences and similarities

Question

bonotevo 0 Newbie Poster

15 Years Ago

This is my first post so please bear with me. I am bringing in 2 files into 2 arrays and then trying co compare the arrays to each other. When I compare the arrays, I am trying to find out how many locations have matching letters in both strings. ) For example:
array1[]={AAACCCGTTT} and array2[]={AACCCCGGTT}; locations 0 and 1 match but location 2 is different) My code seems to work on smaller files but when I open the text files that are over 2,200 characters long I get inconsistancies. The output tells me I have more matches that are possible, but when I test with smaller files it seems to work correctly. I am also trying to use this code /*while ((array1[l] != '\0') && (array2[l] != '\0'))*/ to only read the characters and not the empty part of the array. I have tried using this in my for loop, within another for loop to try to get a count of total characters, in a do-while statement, but it just does not seem to work. Any ideas would be appreciated.

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main(){
	char array1[20000];
	char array2[20000];
	char ch;
	char compare1;
	char compare2;
	int match=0;
	int nomatch=0;
	int i=0;
	int j=0;
	int counter1=0;	

	ifstream fin;
	ifstream fin2;
	fin.open("seq0.txt");				while (fin.get(ch)){
			array1[i++] = ch;
			array1[i]=0;}
	fin2.open("seq1.txt");
		while (fin2.get(ch)){
			array2[j++] = ch;
			array2[j]=0;}
	for (int k=0; k<20000; k++){
		while (array1[k] != '\0'){
			counter1++;}
}
	for (int l=0; l<20000; l++){
		if (array1[l] = array2[l]){
			match++;		
		}				
		else {
			nomatch++;
		}
	}
	for (int l=0; l<20000; l++)	{
		compare1 = array1[l];
		compare2 = array2[l];

		if (compare1 = compare2){
			match++;			
		}				
		else {
			nomatch++;
		}
	}
	cout << match << endl;
	cout << nomatch << endl;

	fin.close();
	return 0;
}

c++

Edited 15 Years Ago by kvprajapati because: Added [code] tags. Encase your code in: [code] and [/code] tags.

5 Contributors
7 Replies
281 Views
20 Hours Discussion Span
Latest Post 15 Years Ago Latest Post by Dave Sinkula

Ancient Dragon 5,243 Achieved Level 70

15 Years Ago

line 32: you are using the wrong operator; use == to test the two characters, not = which is assignment operator. Same problem on line 43.

Clinton Portis 211 Practically a Posting Shark

15 Years Ago

I will offer few improvements to your code and propose a possible pseudo-coded solution to your initial question.

Firstly, reading in an entire file character at a time is inefficient should be avoided. There may be some situations where this approach may be warranted; however, I believe that this isn't one of them.

Although modern day processors will blast through your file in an instant reading character at a time, extra unnecessary milliseconds can add up depending on your application. I guess it's better to learn about optimization early on so that your future employer will not look at you like a buffoon when there are more obvious optimal methods.

In your defense, I do have suspicions that other methods of reading a file, such as getline() may at some point internally break down to a char-at-a-time memcpy opearation; however, this is unsubstantiated since I do not have access to the secret internal workings of c++ standard functions.

For your current application needs, I would suggest reading the entire file in at once into a char type buffer using the read() function. We'll also be using the seekg() and tellg() functions which we can use easily to get the size of the file (number of characters in file). Lastly, we will be declaring a dynamic array equal to the number of characters contained in the file, as opposed to having to arbitrarily declare an array of a fixed size (very inefficient and could potentially lead to buffer overflow if not used carefully.)

So, here is my proposed solution; a bit more optimal than your code. Perhaps you can add these file I/O techniques to your bag o' tricks:

// var declarations
int length = 0;
char* buffer = NULL;

// get length of file:
fin.seekg (0, ios::end);
length = is.tellg();
fin.seekg (0, ios::beg);

// allocate memory; simple dynamic array of specific size
// much more efficient than having to 'guess' at an array size
buffer = new char [length];

// read data as a block:
fin.read (buffer,length);

You have efficiently managed to read in the entire file like a pro. Reading in your second file can be done in a similar manor into a second char type buffer (array). Now all you need to do is perform simple array comparison in order to return useful information a to the user:

//pseudo-coded solution
//it will be your job to translate this solution into c++ code

create a loop(loop from array element zero to 'size of array')
{
     if( buffer[element] equals buffer2[element] )
     {
          increment a counter
     }
}
 
//Display useful and interesting information to the user
cout << "There were " << counter << " matches found and " << (length - counter) << " differences. ";

Look here if you would like to learn more about the <fstream> class member functions we used today. The specific example I cited above can be viewed here.

Edited 15 Years Ago by Clinton Portis because: It's hard to rock a rhyme.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Clinton Portis 211 Practically a Posting Shark · Answer 1 · 2009-12-07T18:49:44+00:00

//this
length = is.tellg();
//should be this
length = fin.tellg();

mrnutty 761 Senior Poster · Answer 2 · 2009-12-07T19:59:25+00:00

Use the stl's : set_intersection , set_difference . They also have other variants of those.

bonotevo 0 Newbie Poster · Answer 3 · 2009-12-07T22:42:34+00:00

Thank you for the responces. My professor told us that this was very inefficient but I suppose it is a good starting point. I think I have the solution but I am still debugging.

Clinton Portis 211 Practically a Posting Shark · Answer 4 · 2009-12-07T23:36:41+00:00

If you are referring to the file i/o implementation, I would like to argue to your professor that reading the entire file at once is more efficient than reading 'line by line' or 'word by word' or 'char by char'. Additionally, the use of a dynamically allocated buffer increases efficiency due to not needing to 'over guess' the amount of space needed by statically creating an array of an arbitrary size.

If you are referring to the array comparison algorithm, I am unaware of anything that will provide you with the results you need with anymore efficiency than performing an 'element by element' comparison.

Dave Sinkula 2,398 long time no c Team Colleague · Answer 5 · 2009-12-07T23:51:24+00:00

This is my first post so please bear with me. I am bringing in 2 files into 2 arrays and then trying co compare the arrays to each other. When I compare the arrays, I am trying to find out how many locations have matching letters in both strings. ) For example:
array1[]={AAACCCGTTT} and array2[]={AACCCCGGTT}; locations 0 and 1 match but location 2 is different) My code seems to work on smaller files but when I open the text files that are over 2,200 characters long I get inconsistancies. The output tells me I have more matches that are possible, but when I test with smaller files it seems to work correctly. I am also trying to use this code /*while ((array1[l] != '\0') && (array2[l] != '\0'))*/ to only read the characters and not the empty part of the array. I have tried using this in my for loop, within another for loop to try to get a count of total characters, in a do-while statement, but it just does not seem to work. Any ideas would be appreciated.

Could you attach the longer input files?

This is very similar to your original; how does it differ from the output you expect?

#include <iostream>
#include <fstream>
#include <string.h>
using namespace std;

int get_data(const char *filename, char *array, size_t size)
{
   ifstream file(filename);
   if ( !file )
   {
      return 0;
   }
   size_t i = 0;
   int ch;
   while ( (ch = file.get()) != EOF && ch != '\n' && i < size )
   {
      array[i++] = ch;
   }
   array[i] = '\0';
   return i;
}

int compare(const char *a, const char *b)
{
   int diffs = 0, matches = 0;
   size_t alen = strlen(a);
   size_t blen = strlen(b);
   for ( size_t i = 0; i < alen && i < blen; ++i )
   {
      if ( a[i] != b[i] )
      {
         ++diffs;
      }
      else
      {
         ++matches;
      }
   }
   cout << "matches = " << matches << "\n";
   cout << "diffs   = " << diffs   << "\n";
   return diffs;
}

int main()
{
   char a[10000], b[10000];
   if ( get_data("seq0.txt", a, sizeof a) &&
        get_data("seq1.txt", b, sizeof b) )
   {
      cout << "a: " << a << "\n";
      cout << "b: " << b << "\n";
      compare(a, b);
   }
   return 0;
}

/* my output
a: AAACCCGTTT
b: AACCCCGGTT
matches = 8
diffs   = 2
*/