The Switch Shut Down at 66°C. The Dashboard Said 104°C Was the Limit. Both Were Right

Wait 5 sec.

For quite some time now, I've been working with production networks. The standard process for troubleshooting equipment failures is as follows: the equipment fails, everyone assumes there is a hardware issue with the equipment; the next step is to power-cycle the equipment; after that, there is extreme urgency to correct the ticket because the ticket backlog is at an all-time high. I did not clear this ticket because I have seen something happen that does not conform to my history with the operation of this piece of equipment.Everything was fine up until it was not.After I powered up the device and completed my typical checks, I found that the CPU Package was operating at 32°C, PSU1 was at 30°C, PSU2 was at 38°C, and the Main board was between 28 & 34°C for the operating temperature; therefore, the sensors were operating within specification, and there was nothing that led me to believe there might be an issue with the hardware of the equipment.Following my checks of the hardware, I reviewed the logs. As anticipated, I located the sequence.| Feb 3 18:55:27 WARNING pmon#thermalctld: Temperature of CPU Package Tempchanged too fast, from 32.0 to 66.0Feb 3 18:55:46 CRIT - Monitor CPU Temp, temperature is 66.0. Temperature is over 66.0. Need to shut down the DUT. ||----|Sixty-six degrees Celsius is a reasonable enough number to prompt a device to turn off, but I'm not so sure that the 82-degree Celsius high-water mark for the thermal daemon was. After reviewing the data again in my mind, I realized that there is a discrepancy between what the machine reported as a safe shutdown temperature (66°), and how the thermal daemon was programmed to respond in those circumstances (104°). Therefore, there is a38° difference between the limit set by the manufacturer (82°) and the limit set by the software (104°).Device FunctionalityThe switch did not have a malfunction or an erratic failure; it operated as per its configuration however everyone on our team, including myself, did not understand its configured performance capability. \n The Temperature IssueIdentifying the underlying cause of this issue required a degree of research. An analysis of the logs indicated that the CPU temp had been approximately 32º C before hitting or exceeding the CPU shutdown temp of 66º C in less than 2.5 seconds, there was no gradual temperature increase associated with the CPU hitting 66º C because the heat up spike was fast enough to record a warning "changed too fast" 19 seconds prior to the shutdown event.This CPU heat up scenario was not representative of what I would expect, based on sustained load testing. Therefore, I set aside a piece of identical hardware and completed properly disc Trial tests on it for the same purpose; I loaded the CPU utilization to 99% across all cores for an extended period of time to gain confidence in the validated test results. The highest temperature I recorded was just 50º C. The device was never at risk of reaching the shutdown temperature under normal extended load testing conditions. \n The Missing PiecePart of this investigation is honesty.When I first saw that the device was back up and running, I realized that the spike had already passed. The log files showed me that there had been a spike, but they did not tell me what caused it because there was no continuous telemetry collected prior to the failure.Dashboards provide a current snapshot of what is happening. Logs give you pieces of information on what has happened in the past. Telemetry allows you to merge the information from both dashboards and logs into an overall view of when something happened. Without telemetry, the CPU spikes would be nothing but a guess based on the current time. However, with telemethis you can determine temperature, CPU performance, fan performance, and the load on the CPU just seconds before the failure occurred.| That gap was the real problem in this investigation. Not the threshold mismatch. Not the CPU spike. The absence of anything running in the background to capture what the logs could not. ||----|How I fixed the issuesThere were 2 immediate things from this investigation.1. I downloaded a monitoring package to the device. 1 command; less than 2 seconds to execute; no impact to service.| sudo dpkg -i systatmonit1.0amd64.deb ||----|After the download is complete, there are 3 things the monitoring package will do. The first is to correct the thermal policy mismatch so that the output of the platform accurately represents the shutdown threshold. Secondly, it will log system resource information every five minutes. Lastly, it will automatically include the logged data in the support dumps; therefore if this occurs again, there will be some documented information to go back to for reference.2. I escalated the discrepancy on the display to the RD team. The fact that "show platform temperature" shows 104 degrees Celsius as the critical threshold, whereas the system shuts down at 66 degrees Celsius, indicates a design flaw. A fix for this issue is planned for a future software release.Two Things to Do Before This Happens to YouManagement of SONiC-based switches should not rely on using the show platform temperature command to monitor temperature thresholds since the actual thresholds configured by the thermal daemon(s) may be different than what is being reported by the show command (in one situation, they were 38 degrees off) until a piece of production equipment failed (blackout). The first step is to compare the two sources (show platform temperature output to thermal configuration) to ensure they agree. This can be done quickly in approximately 5 minutes.The second item is telemetry. The process should be implemented before an incident occurs. If you want to find the cause of a gradual trend over time, use 5-minute polling. If your event is intermittent and has unexpected, sudden increases/decreases in the monitored metric, you should use shorter polling intervals to identify the event. In this case, the spike that caused the outage was on the ground before anyone began their investigation. The only way to have seen the spike in this case would have been through pre-incident, time-synchronized telemetry records or reports.| For production networks, telemetry is not an optional troubleshooting tool. It is the evidence layer. The most important question when a switch shuts down is not only what threshold did it hit. It is what was happening in the seconds before it got there. ||----|Things I Learned from ThisNormally, a power cycle causes equipment to start correctly back up. But the display displayed a different value than what was actually happening. Several other issues on the switch were watched, and they do not have power running to them due to no software running to capture the event on their router.There is nothing wrong with the switch, it just needs to be monitored for a longer period before closing the ticket for this issue after the equipment was powered cycle back to how it was before power cycled.