Just after 2 a.m. on Sunday morning, members of the LogCheck DevOps team (that’s Development + Operations) received an automated text message with a warning that our database server was running low on disk space. It was unpleasant news, but fortunately, not unexpected.
In the world of software, the phrase eating your own dog food means using your own product, so that you can see it the way that your customers see it. The phrase was coined at Microsoft in the 1980s, but the idea is even older.
It’s not always easy to do — if I was hired to write software for an air traffic control console, I’m not sure what I would do with it myself — but when you can, it helps you empathize with your users. Sometimes, it even inspires you to think of ways to make the product better that wouldn’t have occurred to you otherwise.
LogCheck, like most of the software services you use, is a combination of apps, servers, and other services, passing messages back-and-forth over the Internet and private networks. LogCheck is a system, and, just like a two-pipe steam heating system, it requires routine monitoring and maintenance to avoid system failure.
How much memory are the web servers using? Are email reports being delivered? What percentage of code is covered by tests? How much free space does the database server have? It’s easy to set an alarm to go off when any one of these or a hundred other metrics fall out of an expected range. But an alarm is not enough.
The chief engineer of a prestigious Manhattan skyscraper once told me that, even if every piece of equipment in his building were fully automated, he would still mandate routine inspections, “because it gets the guys in front of the equipment.” If you’re counting on humans to jump in when a problem occurs, you have to keep them in the loop.
For a while now, we’ve been using LogCheck to monitor the LogCheck system itself. We created a logbook for the DevOps team, and every day one of us is responsible for keeping it up-to-date.
Among the benefits we discovered:
- Increased situational awareness. Because we were routinely checking the free disk space on the database server, we knew how quickly it was running out, and how much time we had to draw up a plan to resize it. When it was time to act, we were ready.
- Avoiding “knowledge silos”. In order to complete your rounds, you need to be able to sign in to each system and find your way around. Sharing responsibility means we don’t have to worry that “the email guy” is the only person that knows how to check if our outgoing email is being delivered.
- We don’t let maintenance tasks slide. As a security measure, we limit some access to a list of permitted IP addresses. Thanks to LogCheck, that list is reviewed and unused addresses are culled on a regular basis.
- We know our product better. We all use the mobile app to enter readings, the web app to review them, and receive the email report when something is out of range. Even though we individually work on different parts of LogCheck, the whole team can see how they all fit together.
Earlier in the week, when we noticed the database was running out of room, we researched our options and devised an upgrade plan that wouldn’t require any downtime. We even had time to stage a “dress rehearsal” on a backup database to see if the procedure worked. When the alarm went off on Sunday morning, it didn’t take long to verify that the situation hadn’t changed (I could go back to sleep!), and on Sunday evening we executed our plan. If we hadn’t been keeping a logbook… let’s just say I probably would looked more haggard than usual on Monday morning.
The next time you’re talking to a software vendor, be sure to ask: do you eat your own dog food? And if you’re responsible for maintaining a heating system, a software system, or any other kind of system, I hope for your own sake that you’re keeping a logbook. (And if you’re not, did you know you can try LogCheck for free?)