post-thumb

Predictive monitoring industrialisation

Nowadays, Machine Learning has become a common trend. AI is everywhere, hopefully, it will change our world for the better. We find it in Smartphones, Cars, Customer behaviours analysis ….

AI improves the user’s experience by using machine learning and analyzing our behavior. So they can predict our next “move”. Why not use it to predict the next problem in our applications?

Well, we decided to give Predictive monitoring a try, i.e. trying to detect anomalies before they actually occur.

This article is not about the data scientists’ algorithms themselves, but how we achieved the industrialization of the concept.

Predictive monitoring: what and why?

Few times ago, we attended a conference on how to detect adverse effects occurring with some medicines. The goal was to automatically and proactively detect small populations that could develop issues with these medicines. The speaker then explained how to use the R framework and various incomprehensible algorithms for ordinary people to magically find these people.

On my side, deeply involved in application monitoring concerns, this demonstration triggered a click: how could we use the same method to detect anomalies in production in a fast and automatic way?

After a quick look on the market, it became apparent that this type of solution was still in its infancy.

So we poke the community of Worldline data scientists to debate of the feasibility. They answered positively and they started to provide us with algorithms.

Early tests helped us to identify the underlying benefits:

  • no more need to configure any threshold, algorithms detect the seasonality of the metrics and abnormal variation
  • anomalies are detected quite early before they become incidents

It also highlighted a constraint: the need for seasonality of the data analysed (and of course a sufficient amount of time laps to detect it).

All in all, this schema represents our ambitions:

Maintenance predictive vs standard

Worldline context

After these first tests, we decided to think further:

  • how to make the use of analytics algorithms easy for everyone
  • how to integrate to our existing monitoring ecosystem
  • how to spread the benefits to the whole company

As an answer, we decided to design and develop a Proof Of Concept (PoC) focused on predictive monitoring to industrialize the use of these algorithms, with the aim of making them accessible to everyone in our company.

Some others aspects were part of the reflection:

  • The time series database is provided by our production technicians for the collection of metrics. It allows us to reach a greater number of applications

  • The Framework R is widespread in the community of data scientists. Also, the algorithms provided by our data scientists are written in R

  • Being part of Worldline Expert network’s Labs which deals with monitoring subjects (we known this topic and we can access to others experts through the network)

A glance at the proposed architecture

Proof Of Concept Architecture

The modularity of the solution, around an API, is intended to enable the add of any:

  • sources of computation requests
  • results consuming clients
  • sources of data

Each component of this architecture is built as a container image and integrated by our CI platform, allowing:

  • easier ‘External’ contributions (especially for the computation module and its embedded predictive analytics algorithms provided by our data scientists)
  • infrastructure agnostic deployments

For the needs of our PoC experimentation, deployment was currently supported by a CaaS solution.

Technologies used:

  • Analysis framework: R
  • Orchestration: Docker Swarm
  • Messaging bus: RabbitMQ
  • Api coding: Java, Spring Boot
  • Time series database: OpenTSDB

Our results so far

Performance in function of CPU units

Benchmark

We run benchmarks on the solution with great overall results.

Giving figures here wouldn’t be really relevant since metrics history range and downscaling, as well as the number of computation instances have an impact on the results. But as a reference, we achieved to reach 300 metrics computed/mn with 36 CPU units for computation.

The most interesting part was to:

  • establish the scalability potential of the solution (and how linear was the rate CPU/metrics processed)
  • produce an internally shareable capacity planning

And as a final bonus the PoC demonstrated the possibility to recycle any obsolete pieces of hardware (with the only prerequisite of being able to run a docker host).

Here follows an example of anomaly detection of one datapoint issue (in blue) over a time series source

Anomaly detected

What’s in sight?

We use the predictive monitoring tool on our own projects on production and operational perimeters. At the same time, we are promoting this solution within our company to make it a product available to every collaborator.

By design, there are probably several ways to extend and improve this solution: new algorithms of course, but we are also considering the addition of other kinds of sources, like schema-less databases.

Besides these overall positive results, you should keep in mind that conventional monitoring still remains mandatory, predictive monitoring being complementary.

One of the reasons is related to false positives that can emerge with the current predictive analytics.

False positives are a typical topic in data science field. This kind of concern is indeed under study by Dalila Hattab’s teams (who provided us the analytics algorithms) inside Worldline and could be covered in a future post.