The path to an automated monitoring system
Juliano Martinez Francisco Freire Abstract
After years struggling with manual and hand crafted monitoring systems Locaweb got to a point where the number of services and the data generated by these systems is huge. We needed to follow the company's growth, scale the system and learn from past errors in record time. The challenge: design and implement an automated and integrated monitoring system in a short time. This paper shows what how we built our automated monitoring system in less than 3 months, with almost 400k service checks using cfengine, check_mk, python and a home grown project called leela. We will talk about the challenges we faced to design, develop, integrate everything and put this project on production plus how to leverage a heuristic to automatically open tickets without flooding our operations team 1 Introduction Monitoring have been one of the biggest problems on system administration for years, “how to scale monitoring?”, “how cover everything that must be monitored?”, “which alarms are more or less critical?”, “what have to be monitored from the application perspective?”, those questions live on system administrators head. Our work will focus on remove any falsepositive from services and applications being monitored, have a good way to calculate a composite SLA, use one solution to keep all system administrators speaking the same language. 2 Challenges Locaweb has a huge environment ( more than 6k physical servers and 13k virtual machines ) we have to offer the most recent products and systems. Everything can change from day by night and starts completely different on the next day, based on this requirement, the monitoring system need to be effective to work and grow along the infrastructure.