Monitoring and Logging

I’ve had a lot of arguments about monitoring and logging during my career in operations. Many of these arguments were centered around the idea that I wanted to monitor and log “too much”. That’s not necessarily wrong, but I’ve always thought of it more as “monitor and log everything.”

The reason for that is really simple. When (it’s not “if”) something goes wrong, you’ll need data to help understand and troubleshoot. If you don’t monitor and log all the things, then you’ll always be playing at least partially in the dark. I’m sure a philosophical argument could be made about never being able to truly monitor everything …. but I think the point is clear.

On the Subject of KPIs

That being said, not every data point you collect will be a key performance indicator (KPI) nor do you need to set an alarm for every data point. You also don’t need to retain all that data for 30 years. Keep your KPIs for longer if it makes sense. The average response time of your app for example could be useful even a year later, but disk utilization of the web servers, likely not so much.

We are also often asked to recommend and implement monitoring. It’s pretty simple. Use services like datadog and loggly.

Monitoring and Logging Services

Unless your core business is monitoring/logging, you should leave the management of this cost sink to people who really and truly care about it because it’s their business. You will never make money from monitoring and instead of diverting your precious resources, you will be much better off using those resources for revenue generation.

There is one exception and that is scale. Many of the monitoring and logging services can become expensive when you’re running a lot of systems. In that case the economies of scale may tip in favor of putting together a custom solution.

Sensu is a nice tool to alarm and works much better in a cloud environment with ephemeral machines than the venerable Nagios and it’s forks. The biggest reason for that is the subscription based monitoring of the clients and ability to distribute the load much more easily than with Nagios. Much like Nagios, Sensu doesn’t come with a way to track and graph data points collected. Also like Nagios, it’s not hard to add. A good choice is InfluxDB with Grafana. Graphite is another choice instead of InfluxDB, but it’s harder to set up and maintain.

On the logging front the big hitter is the ELK Stack. Without a lot of fussing your can aggregate all of your logs and work with it through search and visualizations. There is also a really good community that can help parse and analyze log formats.

In Conclusion

Ultimately it comes down to cost. With a lot of devices to monitor and lots of logs to aggregate you might consider running things yourself. The cost for a service can add up quickly as scale increases. Just be sure to remember that while you can run some very good tools for free, the setup and maintenance still needs to happen. If you have the staff and skill to make this unproblematic go for it. Otherwise you’ll be better of focusing on your product and creating revenue than venturing into the rabbit hole that is monitoring, logging, and metrics collection.

What's Next?

DEVOPS CONSULTING - Leverage our decades of large-scale DevOps expertise to migrate to the cloud, automate your infrastructure and take your SaaS and web apps to the next level.

DEVOPS-AS-A-SERVICE - Partner with experts that can maintain your DevOps platform and be responsible for day-to-day operational issues, allowing you to develop and ship your product without the need for internal DevOps hires.

Ready for your own expert DevOps team? Let's Talk.

Learn more about DevOps-as-a-Service

Share with your Colleagues:


Subscribe to Our Monthly Newsletter

Recent Posts

Most Popular

Share with your Colleagues: