My servers are running RedHat ES 3.

A while back, (maybe a month or two) when I came in to work in the morning, our main server was not responding at all. Nothing over the network, or on the console. The power light was still on though.

Upon trying to restart it, it became apparent that the RAID controller card had gone bad.

I replaced the RAID controller card, and everything has pretty seemed to be working as normal.

Except there have now been 2 or 3 times since I replaced the card, that it has done that same thing. Completely not responding, not even to the console. Restarting it, seems to solve the problem though.

What seems even more strange to me, is the fact that this has always happened in the middle of the night, when nothing is really going on on the server. During the day, and evening, we have this server running up to 140 'dumb' terminals. As well as many people accessing it through PuTTY and SAMBA. But it has never gone down when all this is going on.

It's not a major problem (yet), since it has happened only a few times, and happened at night -- but I want to try to head this off before it does become a problem. I have tried looking at logs, etc, and haven't found anything revealing (at least not to me).

Does anyone have an suggestions, of what it might be/where I should start looking? Do you think that the new RAID card might also be going bad?

Any and all input is greatly appreciated!

(Let me know what in any extra information I should post)

Maybe your RAID port is bad? Or your motherboard.

try to put the controller in a different slot on MB
also, run a test on the HDDs, maybe one of them is kicking the controllers in the shins

update all firmware. update RHEL to latest.
scan /var/log/messages and dmesg for errors, oopses and kernel panics
what brand is the server?

has it reached the magic number?

i know that older windows servers crash after a certain number of days (the uptime counter overflows)

I hate being that person who posts a question and disappears......

I greatly appreciate the input and idea's, and I will be back to try things out, and let the community know what I find, etc. -- but things just got really busy at my work, and maintaining the servers is considered my "side" work (except for when it's something actually notably interfering with production). And this problem has been rare, and has never happened during a shift, so even though personally I would like to get to the bottom of this, I'll have to wait until I have some "free time" at work, to continue looking into this.