Hi to all,

Problem Statement:
I have 4 nodes which are executing in parallel, each node is processing some information and storing records in MySQL DB, the possible records being saved in an db daily is above thousand, but i want only 100 samples to be stored in db daily, i need some logic (mixing with and ip would help?) which permit me whether to store in database or not.. i do not want to connect with db again and again to check whether the row count is 100 or not, that would certainly crash my db. one thing more 4 nodes are in middle layer (mean four nodes are running independently, one node does not know the state of other node), if it is on upper layer then there was no problem at all i would just put a counter in start and then check if counter is below 100 then store else do not store.

Please any help in this is appreciated.

Thanks:)

Recommended Answers

All 5 Replies

When you use the word 'sample' it means something very specific to a statistician. If you do indeed want a proper sample of your data, you must not truncate it "when the space runs out" because there might be, for instance, something that only shows up in the late evening, or when a particular user is logged in or ...

Instead, what you must do is store all the data, perhaps locally, then take a proper random sample from that full amount. OR you can properly save one item at a time based on a true random process, and accept that you usuall will get a count that is not exactly 100. What is special about 100?

Of course if you simply want to get 100 items, you can do something much simpler: Accept data only from one node, and stop when you've seen 100.

Why would checking the row count crash your DB?
You could create a trigger which checks BEFORE INSERT that the row count does not exceed your maximum and reject the INSERT if it does.
How to reject a row in a trigger: http://forums.mysql.com/read.php?99,134466,134481#msg-134481

a single node is basically processing thousands of samples a day, for each sample then i have to connect with db to check if the row count is still less than 100 this would crash my db because at the same time other nodes are also working. and this check has to be done whether to store or not instead of having 100 samples in db. for all other process this check is failed each time.

For a trigger will work when i am trying to insert, and for insert i need to connect.. so i am again connecting with db?

When you use the word 'sample' it means something very specific to a statistician. If you do indeed want a proper sample of your data, you must not truncate it "when the space runs out" because there might be, for instance, something that only shows up in the late evening, or when a particular user is logged in or ...

Instead, what you must do is store all the data, perhaps locally, then take a proper random sample from that full amount. OR you can properly save one item at a time based on a true random process, and accept that you usuall will get a count that is not exactly 100. What is special about 100?

Of course if you simply want to get 100 items, you can do something much simpler: Accept data only from one node, and stop when you've seen 100.

no i need samples from all nodes, lets say if i have 5 nodes then each node should save only 20 samples to db 20*5 = 100
As i mentioned earlier an upper counter is the simplest solution on each node, but i can not have that logic here.
the control of node is not in my hand. consider it like a black box. i have feed the code script now the script needs to take decision at run time whether to store results in db or not.
some logic mixing with some random number and the time???

Can you do it the other way round? Have your server send a message to all known clients as soon as the maximum has been reached.
I must admit that I do not yet understand the setup. If you have no control of the nodes, then you have either an intermediate agent between the database and the nodes or you are working directly on database level. Is that so? Or what do you mean with "no control"?
Did you test if the database performance really is affected by your query? MySQL has a query cache which should do most of the work in this case.
You could set up a polling process which connects to the database only once and queries it every few minutes whether the maximum has been reached. Then your clients connect to this polling process to see if it's o.k. to feed the database before they connect to the database. The process could have a very simple protocol with minimal overhead.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.