November 25, 2009
Sysadmins: trust devs. They are hired because they are very good at what they do. They don't want to reboot your servers and overload them.
Devs: trust sysadmins. They are hired because they are very good at what they do. They don't want to under-resource your apps and block off access to your apps.
Let's all just get along. There are some housekeeping requirements though. There must be an accountability chain as well as an audit trail of changes made on production systems. There must also be quick and easy access to details about hardware, apps, network, basically anythig you can catalog. Very important questions like these should be trivial to answer:
Each should be answered with way more information than seems sane. Leave it to meatspace to make sense of it.
Some ideas...
How should you give access to and accountability on production systems and maintain a secure audit trail? A simple starting point is to hook up production with a read copy of your corporate directory, allow dev and ops to login and have sudo. Yes, root shell in production. Nobody wants to break things, but if somebody really is incompetent, they should be trained, not gated off. Be sure to set up auditd to track filesystem writes by devs and ops as well as access to other accounts (all sudo and su usage). Tracking privileged port socket creation may be good, too. Make daily/weekly/monthly reports as well as realtime alerting for big ohshits like root login, exec of important binaries, rm of special files, the usual Security Relevant Objects stuff.
Aside from shell access, everybody should be able to deploy, revert, and stage new code at any time. There are many many ways to do this. I hear flickr uses an internal web app that has a big red button. Be sure the button reverts ALL changes: all code, packages, daemons, db alters, switch/loadbalancer config,... EVERYTHING. Stuff broke, so change it back NOW!
An important part of minimizing the impact of "shit breaking" is resource isolation. If something fails it should fail completely independently. Just because resource usage scales 1:1 between two apps, or because the localhost latency will improve performance, or even because something is usually well-behaved do not let the two apps share resources if one going down can have only limited impact on the other. For example, if your main app requires a daemon for db connection pooling, serialization, or some message queue and can operate in any way without that daemon (with degraded features), then think twice (or ten times) before putting the two on the same server. There are tools to minimize the correlated performance hit like rlimit and chroot, or even virtualization. Sometimes these will get you close enough to independent failures.
How do I get any and all info without impacting performance of production hardware? You don't. It's a requirement that is built into hardware scaling considerations, just like monitoring. Of course, always try to find the least invasive method. Some ideas: