I recently received an assignment that is, in my opinion, vastly above my paygrade and current knowledge, but I am dead set on completing it in a timely manner.
I've posted this on another tech forum with zero results, so I appeal to those in the industry who hopefully have run into this before.
okay, so I've been tasked by my supervisor to see if there is a way to read the input from our server room's UPS bank to triage power needs, and to allow for more graceful shutdown of certain servers over others.
We currently have a system that will provide us 1.5 hours of battery backup at 75% load for our 90 servers. However, some of our servers are not mission critical and can be shut down early to allow for longer up time for our larger systems to run graceful shutdowns.
right now, my boss has to come in when there's a power outage and manually shut things down to extend our times for the large servers, because at the moment, we have no way to decide WHEN machines need to shut down.
I'm supposed to find a way to read information from the UPS system and set a 5 minute timer, so if after 5 minutes the power is still off, I quick kill 21 of our non-important machines to give the mainframe and the other larger machines time to backup their information and shutdown nicely.
We have quite a few short power outages (30 seconds to 4 minutes) from time to time, so we would like to set things up so that nothing shuts down for the short outages, but if things look like they will be out for a while (over 10 minutes) it will start shutting non-critical machines down, but if the power comes back on, it will stop killing machines, and hopefully be able to turn them back on when power is stable. Allowing us to have completely hands free power management, while maintaining uptime.
I have been given 12 dummy machines to use for testing purposes, so I'm pretty much open to any suggestions.
What I am going to do is have one of my spare machines plugged into the main UPS, but have it connected via serial to an APC SUPS1500 plugged into the power backbone. I'll have the machine attached the the UPS be a script server (sorta). When the little APC looses power it will tell the script server, and run a script. The script will shut down the non-critical servers remotely, and then do a check to see if power is back on utility(How to do this?). then it will shut down the important servers in a specific order remotely, checking each time if power is back. (if power comes back it runs a script that will boot up all the machines with 45 seconds between each boot so we don't blow the breaker).
I'm not quite sure how to get information from the UPS each time to decide when power is back, into the program.. so I'm hung up there...
and I'm also not sure how to use a batch file to send kill signals to a linux / unix / sun server without SSHing into them each time. which with as many servers as I have to deal with would be terrible to script a connect script for each.
I'll be okay with all the windows servers though.
Where I'm coming from:
I'm a hardware guy with a little bit of C++ and Java experience, and lots of time working with servers remotely doing web development and hosting. But I've never even had to work with UPSs before, and rarely have I used batch files, so I'm in a whole new ballgame now.
If there's an easier way to do about this please let me know, this solution came out of about 4 intense brainstorming sessions by myself, so I really don't have much to draw on.
My main problems:
Automate login for the unix/sun/linux/bsd systems
deciding when power comes back on to reverse the shutdown process.
I'm appealing to the almighty gods of the internet :D
thanks in advance,
Public Sector; Network Engineering.