HomeTechnologyNetwork devices overheat monitoring | OVHcloud Blog

Network devices overheat monitoring | OVHcloud Blog

Network devices overheat monitoring

Problem to solve

Monitoring our community devices is necessary to ensure a fine quality of service for our clients. In particular, avoiding any overheat safety shutdowns is critical, even if we ensure failure resilience in our infrastructure with device redundancy. In the best scenario, if a device falls, its twin is not impacted. It is able to handle the full load and the redundancy is operational. In such a case, the twin device can make up for the temporary loss of the first 1 properly. Even if the infrastructure is temporary at risk, end users are not impacted and this is what matters.

Network devices overheat monitoring

The dispute is that 2 device twins have often been put in service at the same time: in other words, their hardware cooling factors may be identical (fan obstruction, thermal paste efficiency loss, dust…). In addition, since they are often located in close proximity of 1 another, their environment is often identical (surrounding air temperature). So, if the first 1 gets too hot, there is a fine chance that once loaded with the additional traffic, its twin will overheat too. Even if we have redundancy, overheating is definitely something we want to hold an eye on.

This post outlines the way we monitor our community devices and avoid overheat throttles and/or shutdowns.

Facts

  • 21000+: number of community devices we have in our datacenters
  • 150+: number of distinct vendor models – to avoid any confusion with machine learning models, from here on we won’t talk about ‘model’, instead we will use the term ‘device series’ to refer to vendor models.
  • 4000: number of distinct sensors – each of the device series comes with its own bunch of supported sensors, essentially meant to monitor electrical power supply, temperature at strategic places/pieces in the device (CPU core temperature, outgoing flow temperature…).
Huge number and variety of devices to monitor

How to find a unified way to monitor all these different devices?

For any given series, its associated vendor provides hard coded sensor thresholds (sometimes thresholds can be configured by a community administrators). These thresholds trigger alerts, which can be categorized as 2 types:

  • ‘Soft’ thresholds: when reached, an event will be recorded in the device log with no further hardware security measures taken. This can be seen as a warning that your device’s temperature is not yet critical but abnormally high.
  • ‘Hard’ thresholds: when reached, some hardware safety measures will be triggered. For example, on your home computer, if your CPU gets too hot, it will be throttled, meaning its frequency will be restrained, in an attempt to decrease its temperature. For a community device on the other hand, the action is often not a CPU throttle but a full shutdown. Two causes for such a drastic safety measure:
    1. Cost of a single device: better safe than sorry
    2. Redundancy: better to power off a device completely and rely on its twin to take over rather than throttling its CPU and operating in a degraded state

So, a question arises, why not stick to the aforementioned vendor thresholds and wait for soft thresholds to trigger to take action?

This was how we once monitored our devices. And actually, we nonetheless hold an eye on these soft/hard alerts – distributors do know their hardware best, so their thresholds should be carefully watched.

But it appeared that it was not enough, principally because if you just wait for these vendor alerts to trigger, you can end up in the following situation: you’ve had no alerts in months, and on a particular hot day, you get all alarms at once, due to your environment being hotter than usual. So once vendor alerts are triggered, it’s actually too late. Datacenter operators become overwhelmed and have no way to prioritize alerts, because you do not always know among all the alerts, which ones are the most critical and will actually result in safety shutdowns. This is for soft thresholds. For the hard thresholds, you have roughly 30 seconds to intervene before safety measures are taken.

To avoid being overwhelmed by alerts on hot days, we desired a couple of things:

  • Crash prediction model: during disaster situations (hot days) we needed to be able to forecast crashes/safety shutdowns better, to give datacenter operators more time to intervene and provide them with a way to sort alerts, according to the device crash risk they represent. We mined into our device sensors data and crash records, to find out which vendor series were most sensitive to overheat and which sensors (or sensor mixtures) we could use to predict crashes and constructed a supervised machine learning model with the crash examples we had, that learned to predict crashes up to 2 hours before they occurred – this left datacenter operators with enough time to intervene in such emergency cases.
  • Preventive planned maintenance: this is a continuous effort. The purpose is to make sure that we detect and continuously maintain devices that could cause us trouble on hot days, in order to not become overwhelmed. To achieve this, we constructed an unsupervised conditional outlier detection model, to learn how to detect devices which operate abnormally hot under sure relevant factors (environment temperature, load…).

Let’s take a gaze at both points in more details.

Available data

1623310678 517 Network devices overheat monitoring OVHcloud Blog

As you can see in Table 1, for each device, we get a multi-variable and sparse time series, related to sensor records. The ensuing matrix is sparse because for every given device and among the 4000 possible sensors, only a small subset of sensors is supported.

Next to these time series, we get device details (Table 2), allowing us to retrieve for every device its vendor series and location.

Finally, we get environmental data (temperature, humidity…) time series per location in Table 3, which we can join to the first time series thanks to the device details association desk.

Device series clustering

From the first desk we can see several possible approaches for our crash prediction. We can hold the raw data (high dimensional and sparse time series) and construct 1 machine learning model to tackle the more than 150 device series we have.

By doing so, we’d be compelled to confront the curse of dimensionality:

  • Heavy compute and memory cost
  • Overfitting risk: for some device series, we only have a few devices, with no crash example at all. As this is not statistically significant, the model might just learn to predict that no crash will occur when it detects such a series signature (from the supported sensor set for example). We may prefer to ignore these devices for now, since we have no positive examples for them. In addition, keeping too many features given the number of distinct samples is not fine regarding the overfitting risk either.

On the other extreme side, instead of building 1 single model to rule them all, we could construct a crash prediction machine learning model per single device series. In this scenario, this would lead us to having more than 150 distinct models to maintain in manufacturing, which is very unaffordable.

So, we tried to cluster device series by similarity (defined below), making a tradeoff between the 2 extremes mentioned above.

We tried the following approaches:

  • For each sensor of each device series, we initiated by estimating the 10%, 30%, 50%, 70%, 90% quantiles of the values taken by sensor and used these percentages to roughly portray the form of their scaled distribution. Using these features, we could compute some distances (Euclidean or cosine similarity) based on these features (roughly some distribution similarity) and cluster (series, sensor) tuples based on similarity. We quickly gave up with this approach because as previously mentioned, some devices, had too few samples to compute relevant quantiles and portray the sensor distribution properly. Furthermore, we could inform, from their supported sensor name set, that some devices were close to other device series with far more samples.
  • Using the observation made in the previous approach (some devices appearing to be close to others from their supported sensor set), we decided to cluster sensors based on their name stems: we needed to leverage the assumption that close device series have close hardware/firmware components and therefore have close sensor names for the same sensor function. To do so, we gathered sensors using a DBScan algorithm coupled with the Levenshtein distance metric.

Recursive definition of the Levenshtein distance (Wikipedia):

1623310679 241 Network devices overheat monitoring OVHcloud Blog
Levenshtein distance (Wikipedia)

This gave us sensor groups like described in the desk below:

1623310679 116 Network devices overheat monitoring OVHcloud Blog

Then for every device series, we constructed the subset of supported sensor groups, and clustered our device series using a DBScan clustering.

1623310679 406 Network devices overheat monitoring OVHcloud Blog

Then using these computed clusters, we split the original dataset:

1623310679 517 Network devices overheat monitoring OVHcloud Blog

Instead of building 1 huge model, we constructed smaller models whose data scope became diagonal blocks of the original matrix. We obtained the trade-off we longed for, overcoming the sparseness and curse of dimensionality and keeping the total number of distinct machine learning models to maintain under control.

Once we have defined our device series clusters, we can focus on every cluster, which we have crash examples for, to construct our supervised crash prediction model.

2-hour crash prediction

Collecting the ground truth

Collecting the ground truth (e.g. labeling your data) is a challenge that is often ignored when talking about supervised learning models. Yet, having a reliable label is critical to obtain a fine dataset and subsequently a correct classifier. In many supervised machine learning examples, the labeled dataset is already available, so the most tedious part, which consist in accumulating and labeling data, is often understated.

The first challenge we encountered was to gather the crashed devices alongside with their timestamps from our data history. For every single community device crash, we have a postmortem and a crash report. However, it is an unstructured text blob. We would need to extract the device identifier, date, and crash reason from the text. In order to do so, we would need to construct a model specialised in Named Entity Recognition (NER) on the specific blobs. Not only would the effort have been quite huge, especially for people that are no experts in Natural Language Processing, it would also likely have failed at being a robust way to label our data, given the few amount of reviews our NER model would have been skilled on.

Given the fact that we only had around 1 hundred or less positive examples, manually extracting positive samples was definitely an option we considered, and not so tedious (a few hours of effort). The drawback was that we would then have had no way to routinely extract and refresh data later.

Fortunately, thanks to our sensor data, we also had the device uptime at our disposal. So, we designed the following workflow to routinely construct our labelled dataset:

Workflow to automatically build dataset

First, using the uptime data which we had at hand for our community devices, we could quickly see if a device had rebooted. If not, it had not crashed (label: 0). If it did, then it did not necessarily reboot because of overheat (more likely to be a planned maintenance). Then we looked at sensors, more specifically we looked for device outliers by looking at sensors individually. Note, this may seem surprising to data scientists, and incorrect at first glance (because then you just gaze for outliers alongside your predefined axes/features, not in the whole vector space). But this saved us some computing time and was acceptable in this specific case, under the assumption that most distributors implement their hardware safety measures against individual sensor thresholds. If the device was not an outlier regarding any of the temperature sensors, then no conclusion was drawn yet (label: -1). If the record was an outlier, the record was pushed further alongside the labeling workflow. Finally, if the device had rebooted alongside with many other non-regular ones, then it was most likely a planned maintenance reboot and no conclusion was drawn (label: -1). Otherwise, the sample was labeled as a crash.

We ended up with an intermediate semi-labeled dataset. To finish labeling the dataset, we desired a strategy to fill in the -1 (choose between 0 and 1). We opted for a trivial label spreading strategy, which consisted in filling -1 with 0. We stuck to this strategy as it gave us decent enough results as can be seen below.

Now we have a labeled dataset. We also have a way to routinely detect overheat crashes to refresh our data later whenever desired and, perhaps even more importantly, to detect crashes missed by our crash prediction models once in manufacturing.

Since we want to construct a 2-hour crash prediction model, we choose for a positive sample to unfold the positive labels across the timeline, on the records regarding the 2 hours preceding a given device’s crash.

Now our data gaze like this:

1623310681 374 Network devices overheat monitoring OVHcloud Blog

Note that rather than applying sequence approaches (like with rnn), we just hold our tabular data form, by using sensor records during 1 or several previous steps as additional features.

Undersampling the unfavorable class in the training set

The obtained dataset is highly imbalanced (positive ratio order: 10-5). SMOTE was considered to oversample the positive class. As usual, we initiated with an even more trivial approach: we undersampled the unfavorable class (on the training set only, not the test set!). We kept all the records for a device that had a positive record and completed the unfavorable class with randomly selected device records among devices with no crashes, so that we obtained a positive ratio order of 0.01 on the training set.

Then we skilled a classifier on these labeled and undersampled data.

In this case, a random forest.

Network devices overheat monitoring OVHcloud Blog
Famous decision tree

Evaluation on test set

Finally, we evaluated it on the nonetheless imbalanced test set. Here is an overview of the metric obtained for a classifier related to a specific group of series. We won’t give the exact vendor series, but hold in mind, they were particularly vulnerable to overheat issues:

Metric Score
ROC AUC 0.999
PR AUC 0.230
Accuracy 0.999
Precision 0.040
Recall 1.00
False positive rate 0.001

The almost excellent values of the ROC and accuracy should not be given much consideration, given the highly imbalanced nature of the dataset. Instead, we focused on the positive class: the PR AUC may seem bad at first glance, only 0.23, but hold in mind that a random/unskilled classifier would have obtained the positive ratio in the test set here so 10-5, which means our classifier won 4 powers of 10! Not bad. Our recall was excellent in this case, but at the cost of a high untrue positive rate and a low precision (only 4%).

Still, in our case we prefer having a fine recall at the cost of a high fp rate. As said above, the resolution is viable if the number of raised alerts:

  1. provides enough time for datacenter operators to intervene
  2. does not overwhelm datacenter operators, especially on hot days

Real time simulation

Rather than focusing on these metrics on our test set, we took our freshly constructed model and used it for a monitoring simulation in our datacenter (on a different day than the 1 we had used to train it of course). We conducted the simulation on 2019-06-29, a particularly hot day, where we encountered trouble with many community device crashes for the considered model in our datacenters. Here are the simulation results (we lowered the positive detection threshold from 0.5 to 0.4 in this simulation, any device crash predicted less than 30 minutes before it occurs is considered missed):

KPI Value
Missed 0
Detected 10
Alerts 33
Precision 0.3
Recall 1
Proactivity mean (hours) 1.7
Proactivity min (hours) 0.5
Proactivity std (hours) 0.85
Mean alerts per hour 1.375
Max alerts per hour in a given datacenter 2.6

We nonetheless get the 1 recall we want and a nice surprise regarding precision, higher than the 1 we obtained during the evaluation (0.3 > 0.04). Also note that some devices considered as untrue positives on this day (because they did not crash), actually crashed on the next heat wave (2019-07-24).  Strictly speaking they are untrue positives but keeping them in operation would nonetheless have been a fine call. The proactivity mean time is a bit less than the announced 2 hours at 1.7 hours with a high standard deviation, and the minimal proactivity value is 30 minutes (simulation prediction’s success situation in our simulation as mentioned above). The 33 alerts were primarily concentrated during  2019-06-29 hot hours (3PM-7PM), across 3 distinct datacenter locations (Roubaix, Gravelines and Strasbourg), and at the alert peak, 2.6 alerts would have had to be handled per hour. A lot, but nonetheless manageable with alert now prioritized, with the crash risk estimation we provided 🙂

On 2019-06-24, the system was not deployed yet, but here is a visualization of what the system would have done:

1623310681 63 Network devices overheat monitoring OVHcloud Blog

As you can see the crash occurred around 2:30PM UTC. Provided we set the detection threshold to 0.5 (assuming operators were already highly loaded, otherwise we hold it lower, to intervene even more proactively), the device crash would have been predicted at 12:00PM, which would have left more than enough time to take preventive action (fan checks and dust cleaning are quick interventions that usually have worthy advantage on the device’s temperature).

Feature significance and knowledge extraction for preventive maintenance

When digging in the model decision boundaries/interpretability, through feature significance and decision boundary visualization, let’s face it, we realized the decision making was actually trivial and each time focused on 1 or 2 sensors, and consisted in making a regression on the successive values in time and evaluating them with a given threshold (which physically matches to the actual hardcoded vendor thresholds for these particular sensors).

In addition to getting prediction models, they provided us with a way to mine into the high quantity of sensor data and achieve experience/knowledge on which few significant sensors to focus on and monitor per device series clusters in priority. Thus, it helped us in building a relevant preventive maintenance plan in a second step, which we portray below.

Preventive maintenance

What is it?

This approach should be seen as a best effort. Rather than passively waiting for crashes to be predicted on hot days, we use the identified critical sensors to watch for our devices all year lengthy round. Contrary to the previous supervised models, which we could only construct for clusters of series which we had positive samples for, this approach can be utilized to any device series cluster, though for clusters concerned by crashes, we advantage from the knowledge previously gained, regarding the sensors to focus on.

Preventive maintenance

If this preventive maintenance plan is efficient and we follow its maintenance recommendations, we should, in particular, see less alerts reported by the crash prediction models constructed above.

Outlier detection here will consist of detecting devices with a temperature considered to be abnormally high and planning maintenance operations on them.

How does it work?

We assume the temperatures measured by sensors depend on many factors/features. Among others, we could have: the device series (model), the dirt level on fans, the surrounding airflow temperature in the cold aisle or in the hot aisle, surrounding humidity, and of course the device traffic load.

An easy feature to act on is the dirt level, since it’s only a matter of minutes to fix it, and it’s generally a huge game changer in temperature. No huge infrastructure scale is required as it would be the case if, for example, we needed to act on the device load.

If we constructed a classical outlier detection model with all the pre-cited features, the model would not only detect devices that are abnormally hot, but also those which are, for example, in a hotter or damper environment than usual and we are not interested in this here. So, we cannot do it this way. What we actually want is retrieving the devices which are abnormally hot, conditionally to the state of features which we cannot easily act on (device load, aisle temperature). To get rid of 1 or several particular feature’s dependence among those cited above, 1 way we found was to partition the feature space into small parts (buckets). Then we projected our samples in these buckets. Once in a given bucket, we performed a classical outlier detection on temperatures. While doing so, you can project on as many features as you want. The only constraint while splitting the space in buckets is that you need to end up with enough samples in every bucket to be able to perform a reliable outlier detection. For example, Gaussian mixtures will require you to estimate properly means and standard deviations for the Gaussian vectors composing your signal. In our case, we just estimated temperature quantiles, so that we did not make any assumptions on the distribution form at all, but to estimate reliable quantiles (especially the extreme ones, which we are interested in for outlier detection), you need enough samples. Because of this constraint, we only projected our samples in buckets constructed using only the surrounding hot aisle temperature feature.

In quick, the efficiency of this method depends on the assumption that a device which is abnormally hot when the surrounding airflow temperature is 27C is likely to also be abnormally hot, when the temperature is only 20C (thus it will be proven as an outlier in winter as well).

Below is an example of a sensor of interest identified during the crash prediction supervised learning step through feature significance.

1623310681 579 Network devices overheat monitoring OVHcloud Blog

The detected outliers are the red points on the graph. We report them as devices that would need a maintenance.

The graph above is interesting, as we can see that projecting on only 1 feature is not enough to separate 2 modes (the 2 crocodile ‘jaws’ form you can see). It would definitely be interesting, in a next iteration, to try to project the samples conditionally to a second feature, like the device’s load. The 2 modes we observe might match to globally idle vs globally loaded devices.

Overall results

To measure the impact on these 2 monitoring methods, we can compare device crash counts due to overheating before and after we put them in manufacturing.

In 2019, we had several particular hot days (2019-06-24, 2019-06-29, 2019-07-24, 2019-07-25). For 2 particular device series, we encountered 39 crashes (35+4, fortunately, most of them had no impact on end users thanks to redundancy).

The system was prodded in early 2020 and ever since, we have not encountered any crashes for these particular series. Note, however, that the results can be biased because apart from this monitoring, some other improvements were made regarding cooling in the datacenter.

References

Clustering

https://en.wikipedia.org/wiki/DBSCAN

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

https://en.wikipedia.org/wiki/Levenshtein_distance

Supervised learning

https://en.wikipedia.org/wiki/Random_forest

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Oversampling

https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis

https://imbalanced-learn.org/stable/

Interpretability for decision bushes based models

https://pypi.org/project/treeinterpreter/

Outlier detection

https://en.wikipedia.org/wiki/Anomaly_detection

https://scikit-learn.org/stable/modules/mixture.html

https://scikit-learn.org/stable/modules/outlier_detection.html


Network devices overheat monitoring OVHcloud Blog


Go to the source

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular