Friday, September 24, 2010

Take Monitoring To Nest Generation With Rocksteady

Predictive monitoring with #Rocksteady.
Most of todays monitoring systems give you information that has been and we usually run around fixing stuff already gone south. It would be really nice if there was a monitor that could predict what is going to happen. But since that is a bit difficult we would settle for something
So bunch of Googlers got together to create Rocksteady;

Rocksteady is an effort to utilize complex event process engine to analyze user defined metric. End goal is to derive root cause conclusion based on metric driven events. Rocksteady is only the metric analysis part of the whole picture, but we also present a solution including metric convention, metric sending, load balancing, and graphing that work well for us.

  1. Metrics sent from hosts into rabbitmq, which would look something like 1min.juicer.system.cpu.prct_idle.dc1.pi101 82 1283192317. Read detail at metric format
  2. Rocksteady subscribe to metric exchange on rabbitmq. It can also publish its own metric into rabbitmq.
  3. Graphite subscribe to metric exchange on rabbitmq.
  4. Rocksteady can request some historic metric from graphite.
  5. Rocksteady process the metric and alert nagios.
  6. Rocksteady capture some data, either raw or composite, into db.
  7. Graphite data used in dashboard.
  8. Rocksteady captured data used in dashboard.

The said Googlers are using the Rocksteady for monitoring opertional metrics at admob, like reasons for latency. They monitor requests per second (rps) together with a bunch of other metrics such as CPU and network traffic;
... together in a prediction algorithm such as Holt Winters to predict a confidence band for the next arriving value. We then record an event whenever metrics are outside the band more than a certain number of times in a row. This is what we call auto threshold establishment. Now, if we have a SLA we really care about, such as response time, we can set a hard threshold, say 250ms. When response time slows beyond 250ms, Rocksteady tells us whether rps, CPU or network crossed their respective thresholds. Now instead of just knowing there is a latency problem, we can also quickly pinpoint the potential cause.
You can find the Rocksteady and the related technology including to code, 'cos it is an Google Open Source effort. The design information is here.
Perhaps Facebook, who were monitoring data by gone, could use this, to prevent facebook take-downs like yesterday.

Get ready to Rocksteady - Google Open Source Blog


Blog Widget by LinkWithin