Greeting All,

Peace. New to this forum and am confused about how to approach data analysis of variables with C++. My question I is am simply going to do simple descriptive statistics (mean, mode, variance), and a little more complicated chi square / correlation (Pearson). The approach that I am going to take is that I am going to write my own formulas and re-create a distribution table to get the level of significance and the analysis that is needed.

Is this approach re-inventing the wheel or is it better to try to figure out other statistical code already made? I have had problems with the latter. For example this snippet below on chi-square with C++ baffles me as to what is passed in.

-- ALGLIB --
     Copyright 19.09.2006 by Bochkanov Sergey
*************************************************************************/
void onesamplevariancetest(const ap::real_1d_array& x,
     int n,
     double variance,
     double& bothtails,
     double& lefttail,
     double& righttail)
{
    int i;
    double xmean;
    double ymean;
    double xvar;
    double yvar;
    double p;
    double s;
    double stat;

    if( n<=1 )
    {
        bothtails = 1.0;
        lefttail = 1.0;
        righttail = 1.0;
        return;
    }

The code is nice the implementation is confusing. The question is should I try to understand or implement my own. If I do my own this would mean I would need to re-type the distribution table so that a chi-square algorithm could be analyzed.

Thank you for any points in the right direction.

Peace.

const ap::real_1d_array& x----this appears to be a constant reference variable called x of type real_1d_array. real_1d_array is apprarently a user declared type in the ap namespace. Being a constant reference means that the value of x cannot be changed in this function. This type (user declared type) is probably an implementation of a single dimensional array of real numbers based on the name of the type. Since there is little reason to redefine a simple array this user declared type probably has added features, like maybe bounds checking, memory handling, etc that make using arrays more user friendly. It also means that this function is useless unless you include the declaration of the user defined type in whatever project you want to reuse it in.

int n ----- this is a pretty poor variable name as you can't tell what is supposed to represent unless you look at how it's used and try to guess what it is

double variance----this is a pretty common statistical name that is being passed by value. Any changes to the value of this variable made in this function will not be maintained back in the function that called this one.

The rest are all references to variables of type double that have pretty common statistical names. Being references, any changes to the values of these variables made in this function will be maintained back in the function that called this one.
double& bothtails,
double& lefttail,
double& righttail

Only you can decide whether it's easier to reuse code you've found elsewhere or write your own. Sometimes you end up trying it one way, then go the other anyway. Sometimes it's pretty straigtforward to decide which way to go. In the end though, only you can decide what's best for you.

Dear Lerner,

Thank you so much for your reply...I have never done anything like this before (statistical analysis in code). I used to be a Java programmer and there were always these libraries. My fear was that I would spend time doing all this work in c++ re-writing code for the chi-square, the standard deviation etc, and then find out from somebody " why did you do that you could have simply used the foo foo library":-O

You are correct, I started trying using other programmers code, but think I'll do my own since it appears from your post that there really is not a standard out there. This way it can be catered to our needs.

Thank you so much for your time, really I was so impressed at the length and depth of your post. I feel honored by your reply:icon_wink:

Peace

There are lot's of third party libraries for C++ out there that cover an amazing range of tasks. Unfortunately their ease of use is very inconsistent. To my knowledge there is nothing in standard C++ to help you. There may be code that is considered "industry standard" but since I'm not a statistician, or someone who does a lot of statistical calculations, I don't know that for sure. Good luck!

This article has been dead for over six months. Start a new discussion instead.