From aa4ab834ca71cbbe1032b05851ea95e43879406d Mon Sep 17 00:00:00 2001
From: Blake Hildebrand
@@ -39,26 +39,26 @@ The three core metrics that we cared about at Pebble were the following:
- Average battery life
- Percentage of time the watch was connected via Bluetooth to the phone (we’d often have connectivity regressions!)
-The metric that was the easiest to improve upon was the average time between crashes since we had a pretty slick diagnostics system akin to Memfault’s product offering. Internally at Memfault, we’ve been calling the various metrics related to crashes as “Crashiness”, so without further ado, let’s dig into some Crashiness metrics.
+The easiest metric to improve upon was the average time between crashes since we had a pretty slick diagnostics system akin to Memfault’s product offering. Internally, at Memfault, we’ve been calling the various metrics related to crashes “Crashiness,” so without further ado, let’s dig into some Crashiness metrics.
-> This article primarily talks about crashes. If you'd like to track other types of failures, read up on how you might [adjust these metrics](#non-crash-failures) to work for you.
+> This article primarily talks about crashes. If you'd like to track other failures, read up on how you might [adjust these metrics](#non-crash-failures) to work for you.
## Crashiness Metrics
-In an ideal world, the firmware on a device never crashes. For most modern firmware that is operating on even the most basic MCUs, this isn’t realistic, especially since we keep writing in C, which lacks robust compile-time checks and memory safety features. The best we have is [offensive programming pratices]({% post_url 2020-12-15-defensive-and-offensive-programming %}) and liberal usage of [asserts]({% post_url 2019-11-05-asserts-in-embedded-systems %}).
+In an ideal world, the firmware on a device never crashes. This is only realistic for some modern firmware operating on even the most basic MCUs, especially since we keep writing in C, which lacks robust compile-time checks and memory safety features. The best we have is [offensive programming pratices]({% post_url 2020-12-15-defensive-and-offensive-programming %}) and liberal usage of [asserts]({% post_url 2019-11-05-asserts-in-embedded-systems %}).
-With this acknowledged, we need a way to measure how often our devices crash in the field. Sounds simple! I only wish it was. To compare the different types of metrics we can collect on the device and compute in a data warehouse, we’ll come up with a few criteria.
+With this acknowledged, we need a way to measure how often our devices crash in the field. Sounds simple! I only wish it was. To compare the different metrics we can collect on the device and compute in a data warehouse, we’ll develop a few criteria.
We want to collect a crashiness metric that:
-- **Can quickly assess the reliability of a group of devices:** We want to be able to get signal from a metric within hours and days after releasing a new firmware version, not wait weeks or months. We also want to be able to compare this metric with previous firmware releases so we know whether there is a regression.
+- **Can quickly assess the reliability of a group of devices:** We want to get a signal from a metric within hours and days after releasing a new firmware version, not wait weeks or months. We also want to compare this metric with previous firmware releases to determine whether there is a regression.
- **Handles expected vs unexpected reboots:** We want to be able to separate crashes from user shutdowns or the battery being depleted.
-- **Not susceptible to a small subset of misbehaving devices skewing the metric:** A handful of devices crashing many times an hour should not drastically skew the metric if 99.9% of all other devices are behaving normally.
-- **Works well with session-based devices:** The metric needs to handle devices that are used or powered on intermittently. An example of an intermittently used device is a handheld gaming device or a home espresso machine. The gaming device would track a game session, and a coffee machine would track a brew session.
+- **Not susceptible to a small subset of misbehaving devices skewing the metric:** A handful of devices crashing many times an hour should not drastically skew the metric if 99.9% of all other devices behave normally.
+- **Works well with session-based devices:** The metric needs to handle devices used or powered on intermittently. An example of an intermittently used device is a handheld gaming device or a home espresso machine. The gaming device would track a game session, and a coffee machine would track a brew session.
Getting metrics from the device can be relatively simple, but if you are looking for a place to start, I recommend reading a previous post on [heartbeat metrics]({% post_url 2020-09-02-device-heartbeat-metrics %}).
-Before we dig into each possible metric to collect from the device and how to aggregate it, here’s a summary table of all the ones I’ll talk about today, and their strengths and weaknesses.
+Before we dig into each possible metric to collect from the device and how to aggregate it, here’s a summary table of all the ones I’ll talk about today and their strengths and weaknesses.
| Metric Criteria | Uptime | Mean Time Between Failure | Crash-Free Sessions | Crash-Free Hours | Crash-Free Devices |
| ------------------------------------------------------------------------------- | ------ | ------------------------- | ------------------- | ---------------- | ------------------ |
@@ -67,9 +67,9 @@ Before we dig into each possible metric to collect from the device and how to ag
| Not susceptible to a small subset of misbehaving devices skewing the metric | ❌ | ❌ | ❌ | ✅ | ✅ |
| Works well with session-based devices | ❌ | ⚠️ | ✅ | ⚠️ | ✅ |
-- ✅ - works great
-- ⚠️ - works well with a caveat
-- ❌ - does not work well
+- ✅ - works great
+- ⚠️ - works well with a caveat
+- ❌ - does not work well
### Uptime
@@ -80,7 +80,7 @@ $ uptime
15:52 up 14 days, 7:10, 2 users, load averages: 2.04 1.97 1.93
```
-To calculate average uptime, just add up all of the uptime measurements (assuming only one per boot is sent) and divide by the number of measurements.
+To calculate average uptime, add all the uptime measurements (assuming only one per boot is sent) and divide by the number of measurements.
@@ -92,44 +92,44 @@ Here’s a diagram to show how some uptime measurements could be collected and a
@@ -141,22 +141,22 @@ To collect MTBF from the devices, record the last boot’s uptime according to t
#### Summary
-I do not recommend MTBF as a reporting metric, and would instead opt to use any one of the metrics listed later in this article.
+I do not recommend MTBF as a reporting metric and would instead opt to use any one of the metrics listed later in this article.
| Criteria | Rating | Notes |
| ------------------------------------------------------------------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| Can assess the reliability of devices & software quickly after firmware updates | ❌ | Just like uptime, we need to wait N days before being confident our devices can go N days between crashes. |
+| Can assess the reliability of devices & software quickly after firmware updates | ❌ | Like uptime, we need to wait N days before being confident our devices can go N days between crashes. |
| Handles expected vs unexpected reboots | ✅ | Expected reboots performed by the user are properly ignored. |
| Not susceptible to a small subset of misbehaving devices skewing the metric | ❌ | One device resetting often will cause the metric to skew lower. |
-| Works well with session-based devices | ⚠️ | MTBF can work well with devices that are session-based, but it’s not intuitive. If a device is used 1 hour a day and its MTBF metric is 24 hours, it’s expected to crash every 24 days, not every day. |
+| Works well with session-based devices | ⚠️ | MTBF can work well with session-based devices, but it’s not intuitive. If a device is used 1 hour a day and its MTBF metric is 24 hours, it’s expected to crash every 24 days, not every day. |
### Mean Time Between Failure (MTBF)
-[MTBF](https://en.wikipedia.org/wiki/Mean_time_between_failures) is a standard metric for assessing how long it takes for a device or component to fail catastrophically. This metric is most useful for evaluating individual component failures, like a light bulb burning out or a sensor malfunction, or for large machines on assembly lines that need human intervention for repairs. The device being measured is either **working** or **not working**, and a **failure** prevents the device from functioning.
+[MTBF](https://en.wikipedia.org/wiki/Mean_time_between_failures) is a standard metric for assessing how long it takes for a device or component to fail catastrophically. This metric is most helpful in evaluating individual component failures, like a light bulb burning out, a sensor malfunction, or large machines on assembly lines needing human intervention for repairs. The device being measured is either **working** or **not working**, and a **failure** prevents the device from functioning.
-MTBF is almost the same as the uptime metric above, but this isn’t obvious at first glance. To calculate MTBF, devices will send how often they crash and the total amount of operating hours. The total number of operating hours is divided by the number of crashes, producing MTBF.
+MTBF is almost the same as the uptime metric above, but this isn’t obvious at first glance. To calculate MTBF, devices will send how often they crash and the total amount of operating hours. The number of operating hours is divided by the number of crashes, producing MTBF.
### Crash Free Sessions
-This is a metric that is typically used in the mobile and web application world. A session is defined as the time between a user opening and closing an application, or navigating to and then away from a website. If a crash did not occur in that time window, it’s a crash free session!
+This metric is typically used in the mobile and web application world. A session is the time between a user opening and closing an application or navigating to and away from a website. If a crash did not occur in that time window, it’s a crash free session!
-IoT devices often operate within a session world. These devices might include an e-bike, a smart coffee maker, or a pair of headphones. For these devices, tracking crash free sessions would be ideal, as the ultimate metric the manufacturer of these devices wants to know is did the device crash when the user was using the product. If it did, it probably resulted in the user noticing - You wouldn’t want an e-bike to ‘crash’ while you’re riding it!
+IoT devices often operate within a session world. These devices might include an e-bike, a smart coffee maker, or a pair of headphones. For these devices, tracking crash free sessions would be ideal, as the ultimate metric the manufacturer wants to know is whether the device crashed when the user was using the product. If it did, it probably resulted in the user noticing - You wouldn’t want an e-bike to ‘crash’ while riding it!
Calculating the percentage of crash free sessions is self-explanatory - divide crash free sessions by the total number of sessions across all devices.
@@ -164,7 +164,7 @@ Calculating the percentage of crash free sessions is self-explanatory - divide c
-The best part of crash free hours is that it prevents devices that are rebooting multiple times per hour from skewing the metric drastically. If a device crashes once a minute for an hour, it does not report 60 crashes for that hour. It only reports that it was not a crash free hour. +The best part of crash free hours is that it prevents devices that are rebooting multiple times per hour from skewing the metric drastically. If a device crashes once a minute for an hour, it does not report 60 crashes. It only reports that it was not a crash free hour. The second reason I like crash free hours compared to the other metrics is that you don’t need to wait very long until the data can be aggregated - just a few hours! This is because the metric is gathered **hourly** instead of daily, by sessions, or after rebooting. @@ -215,7 +215,7 @@ The second reason I like crash free hours compared to the other metrics is that -The one thing I don’t love about crash free hours, is that even a lousy rating (99% for example) feels like a good stat to the uneducated. This can be solved with either education or by inverting the stat and essentially showing MTBF (but this time with devices crashing a lot not wreaking havoc on the metric). +The one thing I don’t love about crash free hours, is that even a lousy rating (99% for example) feels like a good stat to the uneducated. This can be solved by either education or inverting the stat and essentially showing MTBF (but this time, with devices crashing a lot, not wreaking havoc on the metric). - 95% crash free hours = 1 crash a day - 99.4% = 1 crash a week @@ -286,7 +286,7 @@ To calculate crash free devices, you need three things: a chosen time window to -Each device will send data points about when it crashed, and the heavy lifting is done in the data warehouse. To calculate crash free devices in the last 7 days, one can use something like the following SQL query. +Each device will send data points about when it crashed, and the heavy lifting is done in the data warehouse. One can use the following SQL query to calculate crash-free devices in the last 7 days. ```sql SELECT @@ -303,15 +303,15 @@ Below is an example of what the data from the device might look like and the res -This metric will show your firmware’s true reliability score, as it only takes one crash on each device to get to 0%! The other metrics are not this sensitive, so only show this metric to your manager and CTO if you are really committed to improving it. +This metric will show your firmware’s true reliability score, as it only takes one crash on each device to get to 0%! The other metrics are not this sensitive, so only show this metric to your manager and CTO if you are committed to improving it. #### Collection -Upon boot, send whether the device reset due to a crash and also a time stamp to help place the crash into the right day bucket. +Upon boot, send whether the device reset due to a crash and a time stamp to help place the crash into the correct day bucket. #### Summary -Below is the standard comparison chart, but there is one final note about crash free devices that I need to make. This metric does not work well to compare device reliability _between firmware releases._ To determine whether your firmware update is better or worse than the one it’s replacing, be sure to use a different metric for that. +Below is the standard comparison chart, but there is one final note about crash free devices that I need to make. This metric does not work well to compare device reliability _between firmware releases._ To determine whether your firmware update is better or worse than the one it’s replacing, use a different metric. | Criteria | Rating | Notes | | ------------------------------------------------------------------------------- | ------ | ----------------------------------------------------------------------------- | @@ -324,37 +324,37 @@ Below is the standard comparison chart, but there is one final note about crash ## Metrics to Give the Boss -If I was Head of Hardware or Firmware at an IoT company, I would want the following information in real-time: +If I were Head of Hardware or Firmware at an IoT company, I would want the following information in real-time: -1. How many devices have experienced a crash in the last 1 day and last week? A device that crashes directly impacts the customer experience and a crash that occurs on a large percentage of the fleet might crater (gasp!) Amazon review ratings or the perception of quality. **Shoot for 99% crash-free devices as a north star**. +1. How many devices have experienced a crash in the last 1 day and week? A device that crashes directly impacts the customer experience, and a crash that occurs on a large percentage of the fleet might crater (gasp!) Amazon review ratings or the perception of quality. **Shoot for 99% crash-free devices as a north star**. 2. Of all sessions on firmware version X, how many end in a crash? **Shoot for 99.9% crash-free sessions** - it’s the standard in the mobile application world. -3. How long, on average, for firmware version X, does it take for a device to experience a crash? If devices are crashing more than a few times a month, users are going to notice and care. +3. How long, on average, for firmware version X, does it take for a device to experience a crash? If devices crash more than a few times a month, users will notice and care. -With this information, I would know when the firmware is stable enough to release to beta users or to the entire fleet. I would also be able to know whether new firmware versions that are shipped are improving reliability or making it worse. If I see that many devices are experiencing crashes, I could then roll back the firmware release, gather more metrics and coredumps from the updated devices, and create a subsequent release with the necessary fixes. +This information lets me know when the firmware is stable enough to release to beta users or the entire fleet. I would also like to know whether new firmware versions that are shipped are improving reliability or making it worse. If I see that many devices are experiencing crashes, I could then roll back the firmware release, gather more metrics and coredumps from the updated devices, and create a subsequent release with the necessary fixes. ## Fixing the Crashes Once you take steps to monitor how often devices in the field crash, you’ll see that the number is non-zero. It might be a relatively small number of crashes, or it could be astronomical and surprising. Regardless of the result, you’ll need to figure out how to fix the firmware crash. Here are a couple of articles to help you out with this stage: - Embedded Artistry - [Ending the Embedded Software Dark Ages: Let’s Start With Processor Fault Debugging!](https://embeddedartistry.com/blog/2021/01/11/hard-fault-debugging/) This article talks about how to collect coredumps from the field in a roll-your-own manner. -- Interrupt - [How to debug a HardFault on an ARM Cortex-M MCU]({% post_url 2019-11-20-cortex-m-hardfault-debug %}). This post is relevant once you have a crash reproduced locally or captured in a coredump. +- Interrupt - [How to debug a HardFault on an ARM Cortex-M MCU]({% post_url 2019-11-20-cortex-m-hardfault-debug %}). This post is relevant once a crash is reproduced locally or captured in a coredump. - Interrupt - [Using Asserts in Embedded Systems]({% post_url 2019-11-05-asserts-in-embedded-systems %}). To help track down memory corruption issues, [stack overflows]({% post_url 2023-06-14-measuring-stack-usage %}), and [watchdogs]({% post_url 2020-02-18-firmware-watchdog-best-practices %}), use asserts liberally! ## Tracking Failures That Aren’t Crashes {#non-crash-failures} -This article is focused on tracking crashes to assess reliability. However, what is great about these fundamental metrics is they can apply to any failure that your company wants to monitor closely. +This article is focused on tracking crashes to assess reliability. However, what is great about these fundamental metrics is that they can apply to any failure your company wants to monitor closely. -For example, if my company makes an IoT weather sensor that needs to send data back every minute, I will want to track how often it fails to do so. Instead of recording crashes as the failure, I would record the number of times the device fails to send a weather-related reading. Then I would calculate “weather sync failure” free hours and “weather sync failure” free devices. +For example, if my company makes an IoT weather sensor that needs to send data back every minute, I will want to track how often it fails. Instead of recording crashes as a failure, I would record the number of times the device fails to send a weather-related reading. Then, I would calculate “weather sync failure” free hours and “weather sync failure” free devices. As long as there is an event that can fail, and you have a count of the number of attempts, you can use this methodology to measure any time of failure! ## Towards Crash Free Firmware -By collecting and constantly obsessing over these metrics at Pebble, we were able to produce reliable firmware despite the complexity being crammed into a 1MB flash part full of C code. Our firmware did still crash every so often, but our average was around 14 days between crashes, which I thought was pretty good. +By collecting and constantly obsessing over these metrics at Pebble, we produced reliable firmware despite the complexity of being crammed into a 1MB flash part full of C code. Our firmware still crashed occasionally, but our average was around 14 days between crashes, which was pretty good. -I hope this post was a good primer on how you can measure device reliability as it relates to crashes, and that you have the information and tools you need to start getting these metrics into your firmware and data warehouse. +I hope this post was a good primer on measuring device reliability related to crashes and that you have the information and tools to get these metrics into your firmware and data warehouse. -If you’re feeling overwhelmed by having to build all the intricate libraries in hooks in firmware, the serialization and protocol, processing and data pipeline, SQL queries, and dashboarding-fu necessary to surface these crashiness metrics, [reach out to us](mailto:hello@memfault.com) at Memfault. We’d love to help or at least steer you in the right direction. +If you’re feeling overwhelmed by having to build all the intricate libraries in hooks in firmware, the serialization and protocol, processing and data pipeline, SQL queries, and dashboarding-fu necessary to surface these crashiness metrics, [reach out to us](mailto:hello@memfault.com) at Memfault. We’d love to help or steer you in the right direction.