Skip to main content

Annomally Detection

Randoli Anomaly detection feature is based on Prometheus. When investigating a workload, it can help you understand the behavior of a particular workload to point you towards possible root causes.

You can setup anomaly detection for workloads, with prometheus queries based on

  • Metrics directly related to it (Ex Memory usage, latency, throughput, number-of-errors ..etc)
  • Metrics from core infrastructue services your workload depends (Ex Kafka consumer lag, Postgres query latency ..etc)

When anomalies are detected, it will show up in the Workload Overview tab and Timeline tab (and Cluster timeline if the scope is CLUSTER).

Timeline Overview

The feature provides

  • Built-in Anomaly detection
  • Ability to extend via custom prometheus queries

Built-in Anomaly Detection

Randoli Platform provides out-of-the-box anomally detection by identifying spikes in the following metrics.

  • Pod memory usage
  • Postgres - Query Latency & Number Of Active Connections
  • Java - Heap Memory Usage
  • Golang - Heap Memory Usage
tip

You can turn off the built-in detection by annomalyDetection.defaultQueries.enabled to false.

The following pre-conditions needs to be met

  1. You have configured the Prometheus endpoint during the agent install
  2. Your workloads are publishing the metrics to Prometheus

Extend Anomaly Detection via Custom Prometheus Queries

The extention provides a powerful and flexible mechanism for you to setup anomaly detection based on your workloads unique characteritics.

info

Please get it touch with our customer support team via the Support Portal for more information.

Here are some possible examples.

  1. Workload A reads from Kafka Topic topic-a, process the messages and write to the database. You can configure a prometheus query to detect any spikes in consumer-lag for topic-a. It can highlight a slowdown in consuming messages from the topic. This could be due to a number of reasons.

    • The workload is experiencing memory issues, which maybe highlited via an increase in memory consumption.
    • The database operations taking longer. A spike in postgres-query-latency or number-of-active-connections may provide some possible clues.
  2. Microservice A publishes response times as a custom metric. The services fetches some data from Redis, and if not available from the database. You might want to understand when the response times increases than normal and what might be the possible causes.

    • You can set up to detect if there is a spike in response times.
    • You can also setup to detect a drop in cache hit ratio, indicating more requests are going to the database due to not being in the cache.