Anomaly Detection at Scale

Sachin Bansal
3 min readApr 14, 2021

Why anomaly detection at scale is hard, expensive, and noisy.

Photo by Markus Spiske on Unsplash

Say you work for an online retailer. Your store sells 1000 products. You want to run anomaly detection on daily orders for each of these 1000 products. This means the following:

Number of Metrics = 1 (Orders)
Number of Dimension Values = 1000 (1000 products)
Number of Metric Combinations = 1000 (1 metric * 1000 dimension values)

This means the anomaly detection algorithm runs 1000 times every day, once for each metric combination.

Let’s say you want to monitor Orders by another dimension — State (50 unique values). You also want to monitor the metric by the combinations of these two dimensions. This means the anomaly detection algorithm runs 51,050 times every day.

51,050 = (1 metric * 1000 products) + (1 metric * 50 states) + (1 metric * 1000 products * 50 states)

Now let’s say you want to monitor tens of metrics. Each metric has tens of dimensions. Each dimension has hundreds or thousands of dimension values. You are now looking at tens or hundreds of millions of metric combinations to be monitored on a daily basis.

Anomaly Detection at Scale is Expensive

To get a sense of infrastructure pricing, let’s look at the pricing of the AWS ML-powered Anomaly Detection service. If you look at the pricing example, AWS will charge you $1,900 per month to track 221K metric combinations on a daily basis.

This is just the charge for running the algorithm. Before each run of the algorithm, you need to run a query to pull data from your datawarehouse. Millions of metric combinations mean millions of queries on your datawarehouse daily, which is additional cost.

And this is just the cloud infrastructure cost.

Anomaly Detection at Scale is Noisy

Let’s go back to the example at the start. You are monitoring Daily Orders metric by two dimensions — product and state. Each of these 51K metric combinations has the potential to be an anomaly.

Even if 0.1% of these combinations turn out to be anomalies, you are looking at 51 anomalies on a daily basis, for a single metric. Will you or your team have the bandwidth to take action on so many anomalies every single day? Probably no. If you don’t act on these anomalies, these anomalies don’t add any business value. Instead, they just add to your cost.

Anomaly Detection at Scale is Hard

Now you probably agree with me that anomaly detection at scale is expensive and noisy. But there’s a solution to this problem. And that’s what we are building at Cuebook. It was a hard and challenging problem, but it was lots of fun. I won’t share the solution here because it’s complex and involves too many domains — querying, data storage, cloud infrastructure, user experience etc.

At a very high level, these are some of the problems that we solved for or the approach that we took:

  1. We started with consumption of anomalies, rather than production. If an end user doesn’t consume an anomaly, then that anomaly is of no business value.
  2. To enable consumption, we must ensure anomaly is relevant and significant to the user.
  3. We cannot expect the user to explicitly tell us what is relevant and significant to her. What is relevant and significant today might not remain so 2 months later.
  4. We also cannot expect the analyst or the developer to configure anomalies for the user. They won’t be able to. One user might only be interested in California as a state while another user might be interested in Texas and New York.
  5. Anomaly detection is not a technical or an ML algorithm problem. It’s a product and a user experience problem.
  6. Boiling the ocean of data will need lots of fuel, irrespective of the type of fuel. Boiling the ocean only benefits the cloud infrastructure providers.

--

--