I’m a Systems/Software Engineer in the San Francisco Bay Area. I moved from Columbus, Ohio in 2007 after getting a B.S. in Physics from the Ohio State University. I'm married, and we have dogs.

Under my github account (https://github.com/addumb): I open-sourced python-aliyun in 2014, I have an outdated python 2 project starter template at python-example, and I have a pretty handy “sshec2” command and some others in tools.

Devops

November 25, 2009

Sysadmins: trust devs. They are hired because they are very good at what they do. They don't want to reboot your servers and overload them.

Devs: trust sysadmins. They are hired because they are very good at what they do. They don't want to under-resource your apps and block off access to your apps.

Let's all just get along. There are some housekeeping requirements though. There must be an accountability chain as well as an audit trail of changes made on production systems. There must also be quick and easy access to details about hardware, apps, network, basically anythig you can catalog. Very important questions like these should be trivial to answer:

  • What switch are these servers connected to?
  • What version of this package should be installed where?
  • What is this IP?
  • What is this hostname?

Each should be answered with way more information than seems sane. Leave it to meatspace to make sense of it.

Some ideas...

How should you give access to and accountability on production systems and maintain a secure audit trail? A simple starting point is to hook up production with a read copy of your corporate directory, allow dev and ops to login and have sudo. Yes, root shell in production. Nobody wants to break things, but if somebody really is incompetent, they should be trained, not gated off. Be sure to set up auditd to track filesystem writes by devs and ops as well as access to other accounts (all sudo and su usage). Tracking privileged port socket creation may be good, too. Make daily/weekly/monthly reports as well as realtime alerting for big ohshits like root login, exec of important binaries, rm of special files, the usual Security Relevant Objects stuff.
Aside from shell access, everybody should be able to deploy, revert, and stage new code at any time. There are many many ways to do this. I hear flickr uses an internal web app that has a big red button. Be sure the button reverts ALL changes: all code, packages, daemons, db alters, switch/loadbalancer config,... EVERYTHING. Stuff broke, so change it back NOW!

An important part of minimizing the impact of "shit breaking" is resource isolation. If something fails it should fail completely independently. Just because resource usage scales 1:1 between two apps, or because the localhost latency will improve performance, or even because something is usually well-behaved do not let the two apps share resources if one going down can have only limited impact on the other. For example, if your main app requires a daemon for db connection pooling, serialization, or some message queue and can operate in any way without that daemon (with degraded features), then think twice (or ten times) before putting the two on the same server. There are tools to minimize the correlated performance hit like rlimit and chroot, or even virtualization. Sometimes these will get you close enough to independent failures.

How do I get any and all info without impacting performance of production hardware? You don't. It's a requirement that is built into hardware scaling considerations, just like monitoring. Of course, always try to find the least invasive method. Some ideas:

  • Crawl switch mac tables to correlate servers and switch ports. Often if you do a lot of wire fanagaling.
  • Use a pre-made daily report about packages installed (CentOS has /var/log/rpm made from /etc/cron.daily/rpm).
  • Write many small and simple tools to make the info accessible everywhere at any time.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 United States License. :wq