I think the following problem is a basic one in numerical codes:

Suppose that we would like to do a for loop

for (double x = 0; x < Y; x += dx)
{
  // numeric code
}

The problem is that if we set dx=0.1
it is not exactly 1/10.
The last digit is usually bad,
and during the loop the error grows up.
If we fix Y, and want higher precision (decreasing dx)
means that the number of steps is increasing also.
If Y is in the order of magnitude 1 (unit),
dx = 1e-7 gives the highest precision (approximately)
because double is about 15 digits.
This is very bad!
Because this is only a simple loop,
and no numeric calculation has been done yet!

Are there any library to solve this problem?
Or can we set dx as exact double numbers?

Recommended Answers

All 9 Replies

There is no 'Exact' double. One way I would do is to work with it in integer or long, and divide it later to get double. Even though, there will always be an error because of epsilon value in computation for approximation, the error should not be cumulative as the way you do (occurs at the last computation).

If you want a fixed number of iterations, use an integer value to control the iteration number and progress.

for (int i = 0; i < MaxIter; ++i)
{
  double x = dx * i;
  // numeric code
}

Many numerical methods do not require a fixed amount of iterations but might vary the step size. In this case, the original loop you posted is fine. If you want an exact terminal value of x to be Y, you can do this:

for (double x = 0; x <= Y; x += dx)
{
  // numeric code... possibly changing dx
  if( x + dx > Y) dx = Y - x - epsilon; //change dx to fix exactly to the bound (minus an epsilon of precision).
}

Two additional techniques people sometimes use when they are really really concerned with numerical errors, are rational numbers and interval arithmetics. With rational numbers (i.e. fractions of integer numbers), you can store rational numbers exactly and do simple math operations with them, but things get a little complicated when you encounter transcendental functions (sin, cos, exp, ln, etc.). So, this can avoid error accumulation for all simple operations (plus, minus, multiply, etc.) with a run-time cost of course, but it doesn't eliminate numerical errors altogether.

On the other hand, interval arithmetics allows you to track the accumulation of numerical errors (to some extent), and be able to deal with it, or to know if your results are trustworthy or not (it gives you an uncertainty interval). Sometimes, people use interval arithmetics to test how the error evolves in an algorithm compared to another to help them choose the best method, and then, they use the chosen method for the real application (because interval arithmetic can be much slower).

OK, sometimes you can solve the problem with integers, but n general if there is a complicated numerical algorithm in the loop, this is not possible.

Then you have big problems because you can't solve it with c or c++. Try another language, such as fortran, which is more suited to numerical analysis.

*sigh* I didn't want to bother posting this because I'll probably get flamed, but maybe you can make use of it--

/**
 * Numerics.h
 */
#ifndef NUMERICS
#define NUMERICS

#include <sstream>
#include <string>
#include <iomanip>

using namespace std;


/**
 * Convenience class for enabling easy
 * conversions between strings and primitive types.
 *
 * This class is not meant to support non-primitive types,
 * but it should if target class Type has istream and ostream
 * operators overloaded.
 */
namespace Numerics{

	template<class T> class Numerical;

	template<>
	class Numerical<float>{
	
		private:
			float number;
			int precision;
			static const int PRECISION_LIMIT = 60;

		public:
			/**
			 * Constructs a Numerical<float> based on a float
			 */
			Numerical(float value = 0) : number(value), precision(8){
				(*this)(getString());
			}

			/**
			 * Attempts to construct a Numerical<float> based on the string input
			 */
			Numerical(const string& arg){
				(*this)(arg);
			}

			/**
			 * Sets the precision setting. The setting will be adjusted to fit within 1-to-PRECISION_LIMIT
			 * which is 60.
			 */
			Numerical<float>& setPrecision(int pv){

				precision = (pv >= 1) ? ( (pv <= PRECISION_LIMIT) ? pv : PRECISION_LIMIT) : 1;

				(*this)( getString() );
				return *this;
			}

			/**
			 * Returns the current precision setting
			 */
			int getCurrentPrecision(){return precision;}

			/**
			 * Attempts to assign the argument value to the value
			 * of this Object's type.
			 * If the value is invalid, nothing happens.
			 */
			string operator()(const string& arg){
				stringstream ss (stringstream::in | stringstream::out);
				try{
					ss << setprecision(precision);
					ss << arg;
					ss >> number;
				}catch(...){
					throw string("Invalid String!");
				}
				return ss.str();
			}

			/**
			 * Attempts to assign the argument value to the value of
			 * this Object's type.
			 */
			float operator()(float value){
				number = value;
				(*this)(getString());
				return number ;
			}

			/**
			 * Returns a string representation of this Object's number
			 */
			string getString() const{
				stringstream ss (stringstream::in | stringstream::out);
				ss << setprecision(precision);
				ss << number;
				return ss.str();
			}

			/**
			 * Returns a copy of this Object's number
			 */
			float getValue(){
				return number;
			}

			/**
			 * Extraction operator used to return the underlying value
			 * during operations assosciated with primitive types.
			 */
			operator float& (){
				return number;
			}

			/**
			 * const version of the above operator. Returns a copy
			 * of this Object's number.
			 */
			operator float () const{
				(const_cast<Numerical<float>&>(*this))(getString());
				return number;
			}
	};

	template<>
	class Numerical<double>{
	
		private:
			double number;
			int precision;
			static const int PRECISION_LIMIT = 60;

		public:
			/**
			 * Constructs a Numerical<double> based on a double
			 */
			Numerical(double value = 0) : number(value), precision(8){
				(*this)(getString());
			}

			/**
			 * Attempts to construct a Numerical<double> based on the string input
			 */
			Numerical(const string& arg) : precision(8){
				(*this)(arg);
			}

			/**
			 * Sets the precision setting. The setting will be adjusted to fit within 1-to-PRECISION_LIMIT
			 * which is 60.
			 */
			Numerical<double>& setPrecision(int pv){

				precision = (pv >= 1) ? ( (pv <= PRECISION_LIMIT) ? pv : PRECISION_LIMIT) : 1;

				(*this)( getString() );
				return *this;
			}

			/**
			 * Returns the current precision setting
			 */
			int getCurrentPrecision(){return precision;}

			/**
			 * Attempts to assign the argument value to the value
			 * of this Object's type.
			 * If the value is invalid, nothing happens.
			 */
			string operator()(const string& arg){
				stringstream ss (stringstream::in | stringstream::out);
				try{
					ss << setprecision(precision);
					ss << arg;
					ss >> number;
				}catch(...){
					throw string("Invalid String!");
				}
				return ss.str();
			}

			/**
			 * Attempts to assign the argument value to the value of
			 * this Object's type.
			 */
			double operator()(double value){
				number = value;
				(*this)(getString());
				return number ;
			}

			/**
			 * Returns a string representation of this Object's number
			 */
			string getString() const{
				stringstream ss (stringstream::in | stringstream::out);
				ss << setprecision(precision);
				ss << number;
				return ss.str();
			}

			/**
			 * Returns a copy of this Object's number
			 */
			double getValue(){
				return number;
			}

			/**
			 * Extraction operator used to return the underlying value
			 * during operations assosciated with primitive types.
			 */
			operator double& (){
				(*this)(getString());
				return number;
			}

			/**
			 * const version of the above operator. Returns a copy
			 * of this Object's number.
			 */
			operator double () const{
				(const_cast<Numerical<double>&>(*this))(getString());
				return number;
			}
	};

	template<class T>
	class Numerical{
	
		private:
			T number;

		public:
			/**
			 * Constructs a Numerical<T> based on the T value
			 */
			Numerical(T value = 0) : number(value){}

			/**
			 * Attempts to construct a Numerical<T> based on the string input
			 */
			Numerical(const string& arg){
				(*this)(arg);
			}

			/**
			 * Attempts to assign the argument value to the value
			 * of this Object's type.
			 * If the value is invalid, nothing happens.
			 */
			string operator()(const string& arg){
				stringstream ss (stringstream::in | stringstream::out);
				try{
					ss << arg;
					ss >> number;
				}catch(...){
					throw string("Invalid Input!");
				}
				return ss.str();
			}

			/**
			 * Attempts to assign the argument value to the value of
			 * this Object's type.
			 */
			T operator()(T value){
				return (number = value);
			}

			/**
			 * Returns a string representation of this Object's number
			 */
			string getString(){
				stringstream ss (stringstream::in | stringstream::out);
				ss << number;
				return ss.str();
			}

			/**
			 * Returns a copy of this Object's number
			 */
			T getValue(){
				return number;
			}

			/**
			 * Extraction operator used to return the underlying value
			 * during operations assosciated with primitive types.
			 */
			operator T& (){
				return number;
			}

			/**
			 * const version of the above operator. Returns a copy
			 * of this Object's number.
			 */
			operator T () const{
				return number;
			}
	};

	/* Some meaningful typedefs for Numerical types */
	typedef Numerical<short> Short;
	typedef Numerical<unsigned short> U_Short;
	typedef Numerical<int> Integer;
	typedef Numerical<unsigned int> U_Integer;
	typedef Numerical<double> Double;
	typedef Numerical<float> Float;
	typedef Numerical<char> Char;
	typedef Numerical<unsigned char> U_Char;
	typedef Numerical<wchar_t> Wide_Char;
	typedef Numerical<long int> Long;
	typedef Numerical<unsigned long int> U_Long;

	/* For non-standard types, like __int8, __int16, __int32, and __int64 */
	#ifdef ALLOW_NONSTANDARD_PRIMITIVE_TYPES

		#if (ALLOW_NONSTANDARD_PRIMITIVE_TYPES == 0x01)
			typedef Numerical < __int8 > __Int8;
			typedef Numerical < unsigned __int8 > U___Int8;
			typedef Numerical < __int16 > __Int16;
			typedef Numerical < unsigned __int16 > U___Int16;
			typedef Numerical < __int32 > __Int32;
			typedef Numerical < unsigned __int32 > U___Int32;
			typedef Numerical < __int64 > __Int64;
			typedef Numerical < unsigned __int64 > U___Int64;
		#endif

	#endif
}

#endif
#include "Numerics.h"
#include <iostream>

using namespace std;
using namespace Numerics;

int main(){

	enum {PRECISION = 1};

	Float dx = 0.05f;
	dx.setPrecision(PRECISION);
	float Y = 2.5f;
	Float x;
	for ( x = 0; x < Y; x += dx)
	{
		cout << x << endl;
	}
	
	Float w = 0.1234567f;
	w.setPrecision(PRECISION);

	if(w == 1/float(10))
		cout << "Yay!" << endl;
	else cout << "Boo!" << endl;
	{
		float w = 0.1f;

		if(w == 1/float(10))
			cout << "Yay!" << endl;
		else cout << "Boo!" << endl;
	}

	Float t = 0.1111111111f;
	t.setPrecision(PRECISION);
	cout << t << endl;

	cin.get();
	return 0;
}

-- this doesn't really solve the problem. This masks the problem by conversion with precision settings. The extra overhead is probably not worth the trade-off for "accurate floating precision values." I say it in quotes because the accuracy is really questionable.

You can either stick to integers for simplicity and less overhead or attempt to "force" the values to be precise with additional overhead. I honestly don't think the extra precision is worth the trade-off in speed...

@Intrade:>>*sigh* I didn't want to bother posting this because I'll probably get flamed
Sorry if I have to light your fire here... Did I miss something? or does your implementation only do some float to string and string to float operations? I understand you take a string and cutoff to a required precision and then transform to float again. But how does that achieve anything? At best, the float after the conversion is going to be the same precision as that before, at worse, the precision is far worse after the conversion back to float. There is no gain in precision that I can think of. And most of the accuracy problem does not come from initial values, but comes from the operations themselves, and I don't see any overloaded arithmetic operators that magically are more precise than with the primitive types.

As far as overhead, interval arithmetic would be still much better than the code you posted. If you really want to achieve crazy precision levels, custom floating-point formats would be needed (like using two or three or more double variables to store the values), but that would require assembly code to reach any reasonable efficient level.

@AD: using fortran is not going to make the floating-point variables store the values with more precision than with C or C++ or even Assembly. But you are right that fortran is often cited as more suited for numerical methods, mainly because it out-performs C and C++, in terms of computation time, but not in terms of precision.

@merse>>OK, sometimes you can solve the problem with integers, but n general if there is a complicated numerical algorithm in the loop, this is not possible.
Exactly, a complicated numerical algorithm has to be designed such that it limits the accumulation of round-off errors in the floating-point values and have methods to control and estimate it. There is no silver-bullet. A numerical algorithm's quality is generally assessed in terms of how much it amplifies the round-off errors as it processes the input and spits out the output. Nobody that I know of, designs a numerical algorithm without analyzing the round-off error amplification factor, it is at the core of numerical analysis in general.

At the extreme, you can use an analogue computer instead.. which doesn't accumulate round-off error (because it has none), but it suffers for noise as a result of electromagnetic disturbances on the signals. In any case, exact floating-point arithmetic is useless, nature itself doesn't even produce exact results.. why would you ask a computer to do it? It is all about controlling the uncertainty around the answers you get.

commented: always have amazing answers :) +34

Then you have big problems because you can't solve it with c or c++. Try another language, such as fortran, which is more suited to numerical analysis.

I dont think so. First double is OK if we use binary numbers!
Second we can define any kind of number class and use it.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.