Skip to content
Ross edited this page Oct 18, 2023 · 6 revisions

Life Table statistical methods for disk failure times.

Abstract

Life table statistical methods offer a probabilistic framework to help understand failures over time, in hard drive and solid state drive tracking records published by a commercial cloud storage vendor.

Background

Backblaze drive storage failure data have been published quarterly for more than 10 years, providing objective data about hard disk (HD) and solid state drive (SSD) reliability under sustained load in a commercial setting. Regular blog posts provide opinions based on descriptive measures such as annualised failure rates.

Statistical challenges

The daily status of hundreds of thousands of drive units is automatically monitored, with failures being counted each day, and published every quarter. Only a small minority of drive units under observation will fail during each period. In practice, the failure data is censored at the time of analysis for every drive remaining in service. If no new drive units were added to the existing pool, it would still take decades of patient observation, before they all eventually failed.

Specifically, this is right censored data, because the future failure times of all of the non-failed drives remain unknown, at the time data is analysed. Most conventional statistical models become unreliable when there are substantial proportions of incomplete or missing data. Specialised statistical methods based on life tables, can extract reliable information from this kind of data, taking into account the truncation at observation.

The right censored drives still in service at the end of the observation period have not failed, but they contribute information as the denominator for failure rate estimates. Information from failed and from surviving drives can be combined to estimate instantaneous failure rates reliably, at each point in time when an individual drive fails and is removed from further observation. The uncertainty surrounding these instantaneous rate estimates can be estimated, allowing hypotheses about different groups of drives to be tested reliably. Each individual drive's service start date is available when a unique drive identifier first appears in the data stream over time, so there is no left censoring of this data and models such as the non-parametric Kaplan Meier model, or the semi-parametric Cox Proportional Hazards model, are applicable.

Life table methods

Each drive's first date of service in the raw data is noted. When a unit fails, the failure and the date are noted, or if a unit has not failed, the number of days between first appearance and last observation date are noted. A life table is created, with the total number of drive-days of observation, and the instantaneous failure rate. Each row in this table represents a day when one or more individual drives fail and are permanently removed from further onservation. On days without any failures, the table does not change so no new row is added.

After failing, a drive contributes no additional information. The life table remains unchanged on days when no failures are recorded. Statistical methods based on life tables have a wide range of applications, including optimising engineering methods to make components more long lasting, or comparing the effects of treatments on cancer patient survival.

A suitable input for life table statistical methods is a table where each row is the summary of one individual drive at the time of analysis. The analysis is based on the total duration of observation, and the status at observation for each drive, and that can be determined by sequentially scanning all the Backblaze records for a period, updating each drive as a new record is found, then writing the much reduced summary table needed for analysis.