I have some stock market price data organized as follows in a structure:

struct PriceInfo
{
         string       Date;
         unsigned int Time;
         double       Open;
    double       High;
    double       Low;
    double       Close;
    unsigned int Volume;

    };

This struct organizes price data per interval of date and interval of time, where time is measured in minutes. So for instance, minute data would look like this:

121009 930 2100 2200 2075 2150 10000 
121009 931 ....
121009 932

Where 930 is 9:30 am and 121009 Dec 10th 2009. A whole day's worth of data, measured in 1min intervals, runs from 930 to 1614 pm. The data file I have contains years+ of data and my first project is to be able to do analysis on individual days, such as on the price data for 12/10/09 only or for a range of days, such as 12/10/09 to 3/03/11.

One way to organize the data is to do a simple std::vector <PriceInfo> where we create a vector of objects.

But my question is, what's the most intuitive container to use with data organized as such to create individual day containers to analyze? Is there anything else other than say a map that can allow one to do things like searching and retrieving, while controlling for the date and time?

Thanks,
TR

Edited 3 Years Ago by pyTony: fixed formatting

First of all, what do you intend to do with the data?

I'd like to do basic analysis. Let's say that I have a file with a year's worth of data for 2010. I can define a task to search the container and retrieve (and/or print) the time of day when each day's low price was made. So, manually to do this, one would open the file, look at the date and time columns for any given day, and eyeball the low field to find the lowest price. Works great for one eyeball, but obviously cannot be done for data covering 365 days where each day has 390 minutes worth of data.

Well you could store the data in a set and define a comparison function that sorts the data by the lowest price. After reading the contents into the set and the first element of the set should be the one with the lowest price.

class comparePrice
{
    bool operator()(const PriceInfo & lhs, const PriceInfo & rhs) const
    {
        return lhs.low < rhs.low;
    }
};

//  inside code
set<PriceInfo, comparePrice> data;
// add data to the set and it will be sorted by the lowest price field.

Edited 5 Years Ago by NathanOliver: n/a

Well you could store the data in a set and define a comparison function that sorts the data by the lowest price. After reading the contents into the set and the first element of the set should be the one with the lowest price.

class comparePrice
{
    bool operator()(const PriceInfo & lhs, const PriceInfo & rhs) const
    {
        return lhs.low < rhs.low;
    }
};

//  inside code
set<PriceInfo, comparePrice> data;
// add data to the set and it will be sorted by the lowest price field.

That's a good idea. Let me ask you this though, question related to design again: If you have data organized by date, then by time (see the first two fields below starting at leftmost), how does one restrict the analysis to a given date if one doesn't know the first date in the file?


121009 930 2100 2200 2075 2150 10000
121009 931 ....
121009 932
...
121009 1612
121009 1613
121009 1614 // day ends here and new day starts on the next line
121109 930 ...
121109 931 ....
etc
121109 1614 // day ends here and new day starts on the next line

I mean, I could peek into the file and see that 12/10/09 is the first date, but that would be 'cheating'. Would a way around it be to define what date or date range I want the analysis for (i.e., finding the low price of that particular day), check to see whether those dates are in memory, and then use some type of function to restrict the analysis to the given date or date range?

I guess what I'm having trouble imagining is how does the program know where the day starts and ends?

It really depends what you want to do. You can read the entire file into a some sort of container like a vector and then pull data out of the vector as needed. Another way is you can just read only the data you want out of the file to keep the size lower. If you read all of the file into a vector and want to just look for a particular day or range of dates then you could do something like

vector<PriceInfo> data // stores the entire file
set<PriceInfo, comparePrice> smallSet // to put only sepecific days to get the lowest price
vector<PriceInfo>::const_iterator it = data.begin(), end = data.end();
string lowerDate, upperDate;  // for storing the date range
while (it++ != end)
{
    if (it->date >= lowerDate && it->.date <= upperDate)
        smallSet.insert(*it);
}

Alternately if you only want to read in from the file only the days you need you could do something like

set<PriceInfo, comparePrice> smallSet // to put only sepecific days to get the lowest price
string lowerDate, upperDate;  // for storing the date range
ifstream fin("somefile.txt");
PriceInfo reader;
while (fin >> reader) // assume you define a >> operator for PriceInfo
{
    if (reader.date >= lowerDate && reader.date <= upperDate)
        smallSet.insert(reader);
}

Edited 5 Years Ago by NathanOliver: n/a

Nathan, that's excellent, exactly the type of input I was hoping for. Any specific books that you would recommend that maybe dwell more on dealing with data in this kind of a way, using the stl algos and functors together with different containers, maybe with examples like the code you just posted? I mean I'm thinking not all books are equal on that front I guess.

I'll go with the first option, that stores the entire file, because this is C++ after all, and speed should not be an issue even with large files.

Edited 5 Years Ago by toneranger: n/a

The subject of your question is really about what is called "data mining", which is a field of Database research. In other words, how to retrieve, classify, cluster, and compile statistics about data, especially large amounts of data. Frankly, this is what database engines are for, this is what they do, I would recommend you consider using one of those. Or, at least, read on methods for data mining in databases.

I think that for your application, the standard containers and algos don't provide enough features, and it will lead to a highly inefficient implementation, either in memory usage or in processing time. This is because you will have to either sort only according to one criteria (like date-time) and waste significant processing time to find records according to another criteria (like low-price), or you will have to have several containers to achieve fast look-ups for each criteria. What you need is a multi-indexing system, that is, you need a container which can store data once, but achieve fast look-ups for any of a number of criteria (key-value, or index). There are fundamentally three ways to do this (besides the flawed methods I just mentioned), that I know of, and I'm not an expert in this.

The first is to have one unordered container of all the data (like a std::vector or std::list) and hold several associative containers which indirectly refer to these records. This is a kind of generalization of single-index associative containers. This is implemented in the Boost.Multi-Index library. This library allows you to set up associative containers that allow fast look-ups and range searches (like from date X to date Y) according to any of the criteria you want to.

The second method is called spatial partitioning (you will often hear BSP tree, for Binary Space Partitioning tree). This is a way to create a data structure that is split recursively according to the many criteria you have such that you can easily find all data elements in a particular region of the space (or "feature space", which is the space of all possible records you have). In some sense, this is like the generalization of "binary search trees" to higher dimensions (i.e. more search criteria). There are several methods for this in euclidean spaces, or close to that. If you don't have a Euclidean space, then you have to fall back to a more general kind of space, that is, a metric space. In this case, algorithms are much harder to develop for spatial partitioning in metric spaces, the most notable is the Vantage-Point tree (and some variations of it, and similar data structures like M-Trees, DVP-Trees, and MVP-Trees).

The third method is called data clustering. In this approach, you create groups or clusters of data elements that are similar by some metric. Look-ups involve taking your criteria, finding clusters that are most likely to contain elements with that criteria and restricting your search within those clusters. In some sense, these algorithms are similar to adaptive bucket methods for sorting.

The easiest solution is definitely to just use Boost.Multi-Index. As for the other methods, the only really good implementations of those methods are under the hood of database engines (databases are just massive data mining algorithm machines), and this is why they are the best option to mine for data when you have a large amount of it.

Mike,

Greatly appreciate your input, thanks for taking the time to write. Very interesting: I'm somewhat shocked b/c I had this idea that C++ was a very powerful language, and that the type of database mining I want to do while not trivial would be manageable. But you're the second person to tell me that it's not that trivial to do the type of associative searching I want to do...

When you say database research, what kind of languages are best suited for that? I will google in meantime and look up the boost methods you mentioned.

If anyone has done this type of database mining pls feel free to chip in, value everyone's thoughts.

Thanks much!

When I say "database research", I mean research being done in the field of databases and data mining. This means, professors and computer scientists who are specialists in developing algorithms to crunch massive amounts of data.

Databases are not really tied to a particular programming language. C++ is as good as any other language for dealing with databases. But, you use databases normally through client/server systems, i.e., you have a database server to which you can register data entries and perform any queries you have. You communicate to the server via a client library. Most client libraries are available in most popular languages. I'm not an expert on this, so I prefer not giving any more info on this.

It really depends what you want to do. You can read the entire file into a some sort of container like a vector and then pull data out of the vector as needed. Another way is you can just read only the data you want out of the file to keep the size lower. If you read all of the file into a vector and want to just look for a particular day or range of dates then you could do something like

vector<PriceInfo> data // stores the entire file
set<PriceInfo, comparePrice> smallSet // to put only sepecific days to get the lowest price
vector<PriceInfo>::const_iterator it = data.begin(), end = data.end();
string lowerDate, upperDate;  // for storing the date range
while (it++ != end)
{
             smallSet.insert(*it);
}

Alternately if you only want to read in from the file only the days you need you could do something like

set<PriceInfo, comparePrice> smallSet // to put only sepecific days to get the lowest price
string lowerDate, upperDate;  // for storing the date range
ifstream fin("somefile.txt");
PriceInfo reader;
while (fin >> reader) // assume you define a >> operator for PriceInfo
{
    if (reader.date >= lowerDate && reader.date <= upperDate)
        smallSet.insert(reader);
}

Nathan,

Is it necessary to use a set if one wanted to find the time at which that low price was made? Remember the struct contains the following fields: date time open high low close volume. So while the compareprice functor does the job to sort the set, it cannot tell me at what time that low price was made.

Conceptually I'm thinking that I need something like a map, but one which where the key would map to more than just one value, but instead a number of values, so that I could access these values using it-->first, it-->second and so on. Would that be a multi-map?

If so, is a multi map the only way to tackle this problem?

Thanks

After you add the data into the set all you have to do is

//...
vector<PriceInfo> data // stores the entire file
set<PriceInfo, comparePrice> smallSet // to put only sepecific days to get the lowest price
vector<PriceInfo>::const_iterator it = data.begin(), end = data.end();
string lowerDate, upperDate;  // for storing the date range
while (it++ != end)
{
    if (it->date >= lowerDate && it->.date <= upperDate)
        smallSet.insert(*it);
}
PriceInfo lowest = *(smallSet.begin());  // gets the first element which is the lowest price
cout << "The lowest price ocurd at: " << lowest.date << " " << lowest.time;
//...

Edited 5 Years Ago by NathanOliver: n/a

After you add the data into the set all you have to do is

//...
vector<PriceInfo> data // stores the entire file
set<PriceInfo, comparePrice> smallSet // to put only sepecific days to get the lowest price
vector<PriceInfo>::const_iterator it = data.begin(), end = data.end();
string lowerDate, upperDate;  // for storing the date range
while (it++ != end)
{
    if (it->date >= lowerDate && it->.date <= upperDate)
        smallSet.insert(*it);
}
PriceInfo lowest = *(smallSet.begin());  // gets the first element which is the lowest price
cout << "The lowest price ocurd at: " << lowest.date << " " << lowest.time;
//...

Hi Nathan,

Prior to main I defined the following:

typedef std::vector<PriceInfo> PriceList;


class LowestPrice
{
	bool operator () (const PriceInfo& a, const PriceInfo& b) const
	{
	return a.Low < b.Low;
	}
};

typedef std::set<PriceInfo, LowestPrice> smallSet;

and then the main here:

int main(int argc, char* argv[])
{
	if (argc != 2)
	{
		std::cerr << "Usage: " << argv[0] << " filename " << std::endl;
		return EXIT_FAILURE;
	}

	PriceList prices;

	       std::ifstream dataFile(argv[1]);
		if (!dataFile.is_open())
		{
			std::cerr << "Could not open file : '" << argv[1] << "'" << std::endl;
			return EXIT_FAILURE;
		}
		dataFile >> prices;
	

    std::vector<PriceInfo>::const_iterator it = prices.begin(), end = prices.end();
	std::string lowerDate, upperDate;
	while (it++ != end);
	{
		if (it->Date >= lowerDate && it->Date <= upperDate)
			smallSet.insert(*it);
	}
	PriceInfo lowest = *(smallSet.begin());
	std::cout << "The lowest price occurred at:" << lowest.Date << " " << lowest.Time;

	return EXIT_SUCCESS;
}

I'm getting the following 5 syntax errors when I try to compile:

error C2143: syntax error : missing ';' before '.' on the "smallset.insert(*it)" line
error C2143: syntax error : missing ';' before '.' on the "smallset.insert(*it)" line
error C2143: syntax error : missing ')' before '.' on the PriceInfo lowest = etc line
error C2059: syntax error : '.' on the PriceInfo lowest = etc line
error C2059: syntax error : ')' on the PriceInfo lowest = etc line

Any idea on why we're getting these syntax errors?

Thanks

Edited 5 Years Ago by toneranger: n/a

smallSet is a type, not a variable. You need to make a object of that type to be able to use it, like you do with "prices".

Also, the loop that starts at line 22 does not make sense. Given how it is, it will start at begin()+1 and end at end(), when it should be from begin() to end()-1. Why not use the idiomatic loop:

for(std::vector<PriceInfo>::const_iterator it = prices.begin(); it != prices.end(); ++it) {
  //..
}

And, anyhow, you have a stray semi-colon at the end of Line 22.

smallSet is a type, not a variable. You need to make a object of that type to be able to use it, like you do with "prices".

Also, the loop that starts at line 22 does not make sense. Given how it is, it will start at begin()+1 and end at end(), when it should be from begin() to end()-1. Why not use the idiomatic loop:

for(std::vector<PriceInfo>::const_iterator it = prices.begin(); it != prices.end(); ++it) {
  //..
}

And, anyhow, you have a stray semi-colon at the end of Line 22.

Thanks Mike, I declared a smallSet object as smallset sortedLows (see below), corrected the iterator syntax, and compiled, and now I'm only getting 1 error: "cannot access private member declared in class 'LowestPrice' "

So is that basically the same problem, namely that I need to instantiate a LowestPrice object? If so, how would I pass the object to my 'sortedLows' object?

smallSet sortedLows;
		

for (std::vector<PriceInfo>::const_iterator it = prices.begin(); it!= prices.end(); ++it)
		{
	       std::string lowerDate, upperDate;

	        {
		        if (it->Date >= lowerDate && it->Date <= upperDate)
			      sortedLows.insert(*it);
	        }
	PriceInfo lowest = *(sortedLows.begin());
	std::cout << "The lowest price occurred at:" << lowest.Date << " " << lowest.Time;
	

	return EXIT_SUCCESS;

The problem is simply that your () operator in the LowestPrice is private (which is the default access right for a class). So, you should do either this:

class LowestPrice
{
    public: //notice "public" here.
	bool operator () (const PriceInfo& a, const PriceInfo& b) const
	{
	return a.Low < b.Low;
	}
};

Or this:

struct LowestPrice //notice the "struct" here, which has "public" default access rights.
{
	bool operator () (const PriceInfo& a, const PriceInfo& b) const
	{
	return a.Low < b.Low;
	}
};

Got it to compile and run now. Will be back with more questions later. Thanks for your help so far.

Edited 5 Years Ago by toneranger: n/a

Ok the code compiles and when I run it using my data file, it does spit out the correct low of the day and the time at which that low was made. This works when I input "12/10//2009", which is the first date in my data file.

However, when I try the next date, which is 12/11/2009, it gives me this error message from the debug library: "Expression: map/set iterator not dereferencable"

int main(int argc, char* argv[])
{
	if (argc != 2)
	{
		std::cerr << "Usage: " << argv[0] << " filename " << std::endl;
		return EXIT_FAILURE;
	}

	PriceList prices;

		std::ifstream dataFile(argv[1]);
		if (!dataFile.is_open())
		{
			std::cerr << "Could not open file : '" << argv[1] << "'" << std::endl;
			return EXIT_FAILURE;
		}
		dataFile >> prices;
	
    smallSet sortedLows;
		for (std::vector<PriceInfo>::const_iterator it = prices.begin(); it!= prices.end(); ++it)
		{
	       std::string lowerDate, upperDate;
		   std::cout << "Please enter lowerDate:" << std::endl;
		   std::cin >> lowerDate;
           std::cout << "Please enter upperDate:" << std::endl;
		   std::cin >> upperDate;

	        {
		        if (it->Date >= lowerDate && it->Date <= upperDate)
			      sortedLows.insert(*it);
	        }
	PriceInfo lowest = *(sortedLows.begin());
	std::cout << "The lowest price occurred at:" << lowest.Date << " " << lowest.Time;
	

	return EXIT_SUCCESS;
		}
}

In this line:

PriceInfo lowest = *(sortedLows.begin());

How can you tell that sortedLows is not empty? If the set/list/map is empty, then the begin() iterator will not be dereferenciable (because it is equal to end() iterator, which is one past the last element). Verify that the container is not empty before dereferencing the begin() iterator.

In this line:

PriceInfo lowest = *(sortedLows.begin());

How can you tell that sortedLows is not empty? If the set/list/map is empty, then the begin() iterator will not be dereferenciable (because it is equal to end() iterator, which is one past the last element). Verify that the container is not empty before dereferencing the begin() iterator.

I added this code:

if (sortedLows.empty())
			std::cout << "sortedLows container is empty." << std::endl;
		else
			std::cout << "sortedLows container is not empty." << std::endl;

And you're right, it returned empty. I'm not sure why it does though. The csv data file does contain the date 12/11/2009, I checked it. Also, the vector from which the set pulls the data from is supposed to contain all of the data in the csv file. So why would the sortedLows.insert(*it) fail to populate?

Edited 5 Years Ago by toneranger: n/a

No where do I see a loop to read in all of the data from the file

dataFile >> prices;

Is only getting the frist record from the file.

Edited 5 Years Ago by NathanOliver: n/a

No where do I see a loop to read in all of the data from the file

dataFile >> prices;

Is only getting the frist record from the file.

Hi Nathan,

Sorry for the delay in responding, been busy with work and took a break from programming. I've since retested my file and it reads in all the data correctly. I know this is true because when I output the data, all the data prints to the screen (print code not in the code below), so the loop completes.

So any idea on how to deal with this task: how does one go about sorting the contents of my vector of structures? Do we have to start over completely with the sort algorithm or is it off by a small thing?

The code below still has the same problem. It spits out the time at which the low was made for the first day but does not work for the 2nd day.

#include <stdafx.h>
#include <algorithm>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <set>
#include <sstream>
#include <string>
#include <vector>
#include <cstdlib>

struct PriceInfo
{
	double       Open;
	double       High;
	double       Low;
	double       Close;
	unsigned int Volume;
	unsigned int Time;
	std::string  Date;
};


struct LowestPrice
{
	bool operator () (const PriceInfo& a, const PriceInfo& b) const
	{
	return a.Low < b.Low;
	}
};


typedef std::set<PriceInfo, LowestPrice> smallSet;


typedef std::vector<PriceInfo> PriceList;


std::istream& operator>>(std::istream& ist, PriceInfo& priceInfo)
{
	std::string ignored;
	std::string line;
	getline (ist, ignored, '\t');
	getline (ist, ignored, '\t');
	getline (ist, priceInfo.Date, '\t');
	getline (ist, line);
	std::istringstream linestream(line);
	linestream >> priceInfo.Time >> priceInfo.Open >> priceInfo.High >> priceInfo.Low >> priceInfo.Close >> priceInfo.Volume;
	return ist;
}

std::ostream& WriteAmount(std::ostream& ostream,
                          double amount)
{
    ostream
        << std::setw(10) << std::right
        << std::fixed << std::setprecision(2)
        << amount;
    return ostream;
}

std::ostream& operator<<(std::ostream& ostream,
                         const PriceInfo& priceInfo)
{
    ostream << "| " << std::setw(10) << priceInfo.Date;
    WriteAmount(ostream, priceInfo.Time);
    WriteAmount(ostream, priceInfo.Open);
    WriteAmount(ostream, priceInfo.High);
    WriteAmount(ostream, priceInfo.Low);
    WriteAmount(ostream, priceInfo.Close);
	WriteAmount(ostream, priceInfo.Volume);
    ostream << " |";
    return ostream;
}

template <typename T>
std::istream& operator>>(std::istream& ist,
                         std::vector<T>& vector)
{
    std::copy(
        std::istream_iterator<T>(ist),
        std::istream_iterator<T>(),
        std::back_inserter(vector));
    return ist;
}

template <typename T>
std::ostream& operator<<(std::ostream& ostream,
                         const std::vector<T>& vector)
{
    std::copy(
        vector.begin(),
        vector.end(),
        std::ostream_iterator<T>(ostream, "\n"));
    return ostream;
}

void ReadPricesFromFile(const char* filename, std::vector<PriceInfo>& prices)
{
    std::ifstream input(filename);
    input >> prices;
}

void PrintPrices(std::ostream& output, const PriceList& prices)
{
	output << "| "
		<< std::setw(10) << "Date"   << " | "
		<< std::setw(6)  << "Time"   << " | "
		<< std::setw(6)  << "Volume" << " | "
		<< std::setw(8)  << "Open"   << " | "
		<< std::setw(8)  << "High"   << " | "
		<< std::setw(8)  << "Low"    << " | "
		<< std::setw(8)  << "Close"  << " |"
		<< std::endl     << prices;
}

int main(int argc, char* argv[])
{
    
		if (argc != 2)
	{
		std::cerr << "Usage: " << argv[0] << " filename " << std::endl;
		return EXIT_FAILURE;
	}

	
	PriceList prices;
    ReadPricesFromFile(argv[1], prices);
    

        smallSet sortedLows;
		for (std::vector<PriceInfo>::const_iterator it = prices.begin(); it!= prices.end(); ++it)
		{
	       std::string lowerDate, upperDate;
		   std::cout << "Please enter lowerDate:" << std::endl;
		   std::cin >> lowerDate;
           std::cout << "Please enter upperDate:" << std::endl;
		   std::cin >> upperDate;

	        {
		        if (it->Date == lowerDate && it->Date == upperDate)
			      sortedLows.insert(*it);
	        }

		if (sortedLows.empty())
			std::cout << "sortedLows container is empty." << std::endl;
		else
			std::cout << "sortedLows container is not empty." << std::endl;
    
	PriceInfo lowest = *(sortedLows.begin());
	std::cout << "The lowest price occurred at:" << lowest.Date << " " << lowest.Time;
		}

    return EXIT_SUCCESS;
}
This article has been dead for over six months. Start a new discussion instead.