Hi,

I am building a database that will be populated with financial data. Taking into account what the data looks like, how the data will be organised, what I am trying to do with this data, my hardware, budget and competency, I am looking for the most efficient design for super fast querying of this data (not concerned with writes). I am hoping to get some guidance as to how I should progress.

What the Data Looks Like
I have effectively collected monthly stock data over a 25 year period. So for every stock in the world, I have collected its associated financial data (approximately 50 mostly numeric fields) at a monthly frequency over a 25 year period.

e.g.

Microsoft, May 2000 -> associated financial data (50 fields)
Cisco, May 2000 -> associated financial data (50 fields)
Microsoft, June 2000 -> associated financial data (50 fields)
Cisco, June 2000 -> associated financial data (50 fields)
Microsoft, July 2000 -> associated financial data (50 fields)

How the Data will be organised
The data will be organised into one large table. Initially there will be approximately 3 million rows and 50 columns. The database will grow by approximately 15,000 rows per month.

What I am trying to do with this data
The data will used in a few different ways:

1. Time Series analysis where I will be retrieving some or all the financial data for a stock over the entire 25 year period e.g. I will query for all Microsoft's financial data over a 25 year period at monthly intervals. I may perform this sort of query quite frequently especially when I request these data arrays for a group of stocks e.g. I will query for some or all the financial data over the 25 year period at monthly intervals for Stocks A, B, C, D and E.

2. Aggregation of columns where I will be aggregating financial data for a group of stocks over its entire 25 years history. For example, I may want to aggregate the financial data for all stocks that meet a particular criteria (such as being located in a specific region, size above a certain amount, similar types of businesses, highly profitable business etc) at monthly intervals over a 25 years period. These queries are likely to be quite dynamic so I will never know how they will be aggregated in advance.

3. Same as 1 & 2 but at a point in time only. e.g. give me all the financial data for Stock A as of May 2001 or lets aggregate all the financial data of all Japanese companies in June 2011.

I would like to be able to perform these queries very, very fast e.g. in milliseconds.

Hardware
Dual Core Xeon Server 3.5gHz 3GB RAM.

Budget
Free open source technology.

Competency
I am a good programmer (Java, C, C++, .NET and PHP), but know little about database design beyond basic MySQL/SQL Server. My experience in database is limited to SQL type databases but I am willing to try something new if it will serve my purpose.

I have tried to do this before using one large table and MySQL. I have used both MYISAM and Infobright storage engines. Infobright is good but when you're pulling back time series data over 50 columns it will provide little benefit over MYISAM. When you're aggregating over 50 columns it's significantly faster, but still takes a few seconds. I have been reading about in-memory solutions but have no idea how to implement these (also, I have no idea how much memory I would need to place a table that is 3 million rows long and 50 columns wide in-memory). I have also been reading about Map/Reduce solutions but am not sure I will be able to get the blazing fast aggregate query results (especially since I only have one machine and not a cluster of nodes). Does anyone have any advice?

Rameez

You need to normalise the data into many tables, then you can select only the data that you need for a particular query, and don't transfer over the wire a lot of data that you're not interested in. I suspect if you did this with 25 years of data and transferred over the wire, you're likely to get timeouts.

Purely form a design perspective, you want to take the design to being as normalised as possible, then denormalise the data (move back toward your single table, by combining data) until you get close to your required performance.

I suspect though, you will need to implement some caching, where you could essentially then be holding all the data in memory and avoid the database call.

Both of these subjects, and the subject generally of database design for performance is a massive area, and one that I would expect will take a good deal of experieince to get totally right. Once you know you have the right design, you may want to consider caching if you have a middle tier. If you don't, then your database server may do some basic caching for you.

3 million row of data isn't a great deal really, but it depends on how many users are going to be using the site. Very quickly I could see this becoming an enterprise application, where you really do need to know what you're doing to get the performance and speed.

I am concerned that you consider yourself a good programmer in a few languages but have never touched a database design before? I suggest without that area of design, you're doing yourself a disservice - you should learn as a priority!

If you want to create ridiculously quick results, you could look into materialised views. These are very fast on selecting data (they are highly indexed) but quite slow on writing it. If you don't know about database design, there are no shortcuts, and it is unlikely you will be able to get blistering speed without some massive fluke!

Hope this helps, as I know it probably wasn't exaclty what you wanted to hear.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.