Anomaly Detection in Payment Systems

6 min readFeb 25, 2024

Anomaly detection is an important problem that has been well-studied in various research areas and application domains. The models identified provide crucial information both on the fidelity of the data and on the deviations in the process of generating the underlying data. The key concept in detecting anomalies is the notion of “robustness”, the design of models & representations that are less sensitive to small changes in the distribution of underlying data. These techniques are widely applied in Fraud detection, Cyber security, Finance, Healthcare & diagnosis, Video surveillance, and Drug discovery.

“An anomaly is an observation or a sequence of observations which deviates remarkably from the general distribution of data. The set of the anomalies form a very small part of the dataset”, as described by Braei and Wagner (2020)”

Success Rate%

Uptime in Payment Systems is directly measured by the metric success rate which is total successful transactions by total transactions in any interval. The problem arises when users start facing failed transactions which is a really bad user experience. Imagine you’re trying to book a movie ticket last moment and your payment doesn’t go through :(

This issue becomes really critical in time-sensitive situations. Below is a great example of how a transaction flows in a payment system.

Example flow of a transaction initiated on Swiggy with Razorpay as a payment gateway

In the context of payment systems drop in success rate % can be termed as an anomaly. These anomalies in data can be very tricky to track they can be spiky for a small duration as shown in the figure below, or there can be a trend shift in the time series, or it can be contextually caused by some known confounder in the system.

the y-axis is the time series of success rate % and the x-axis is the time in seconds

Can we identify anomalies using machine learning algorithms?

Here is a very high-level abstraction of how can we formulate an anomaly detection problem, taken from the KDD workshop on anomaly detection.

First, we learn how the underlying data is structured using either forecasting, clustering, or deep learning techniques.
Second, we detect the anomaly using an anomaly score from the learned distribution.
Third, we apply a decision threshold to alert systems or take precautionary actions.

Estimating the model parameters can be a daunting task based on how the structure of the underlying data is and depends on what modeling technique is being used. It can either be an integrated approach where both the anomaly score and threshold parameter are learned together, or it can be 1-phase, 2-phase, or Non-parametric in the case of clustering algorithms.

Quantile regression for Anomaly Detection

For regression prediction tasks, not all time that we pursue only an accurate prediction, and in fact, our prediction is always inaccurate, so instead of looking for absolute precision, sometimes a prediction interval is required, in which cases we need quantile regression. A typical regression aims to fit the mean of the distribution. We try to approximate the conditional mean of the response variable y given certain values of the predictor variable X. In this context, the objective is to minimize the sum of the squared errors.

*y_i* is the ith value of the variable to be predicted, and yhat_i is the predicted value of y_i.

Loss Function

The major difference between quantile regression against general regression lies in the loss function, which is called pinball loss or quantile loss. There is a good explanation of pinball loss here. In quantile regression, we want to make a set of predictions Q so that, for a given quantile q, q% of the true values are less than Q. In this case, we try to minimize the following loss function:

The special case q=0.5 corresponds to the median regression where the cost function is the absolute deviation. Higher error is less punished, this makes sense in that for high quantile prediction, the loss function encourages higher prediction value and vice versa for low quantile prediction.

*Cost function for linear regression and quantile regression for different values of q*

Post training below is the inference distribution of the transactions for a time series data of transactions. For an intuition the the predicted anomaly score in a time interval represents the probability of a transaction being successful.

Observe the left-skewed distribution of the non-anomalous data with a mean score of 0.82, it is also magnitudes higher than the anomalous ones.

y-axis is the count of anomaly (Red) and non-anomaly (Blue) and x-axis is the anomaly score

The anomalous intervals follow a Gaussian distribution with a mean score of 0.56, you can decide a decision threshold based on your tolerance estimates of taking actions to mitigate the anomalies

the y-axis is the count of anomaly (Red) and the x-axis is the anomaly score

Fortunately, the powerful lightGBM has made quantile prediction really easy one can follow this tutorial to implement the same.

Check out this amazing video by Datadog as well:- Link

How we took it to production!

Let's keep that for another blog but let me shed some light on what’s required to keep a production service running.

Service uptime in Machine Learning systems is just the tip of the iceberg, Its dependent on two major components of Data health and Model health. Model can give wrong outputs, or entire system can go down if there is any issue in either of them.

System design of such machine leraning systems require these abstractions to be kept in mind along with robust CI-CD pipelines for 100% availlability and agile devlopment.