Hardware vs. Software Failures – is the balance changing?

NOTE:  these are notes about a post and not really a post per se.  I was thinking that as our software becomes more reliable, and as the rate at which the amount of persistent data increases (in addition to the actual amount increasing), that hardware failure may well be a more important problem to us than software (ours or OS) failures.

My suspicion is that the MTBF for persistent-storage tech (e.g. disk drives) is not increasing nearly as quickly as the rate of change for stored data.  The potential side effect for data-bound systems (like ours) is that customers will increasingly see hardware failures that affect data quality rather than simple software failures. 

Since people seem to tend to display a tendency to group apparent causal factors into a single, coordinated, apparently consistent logical model based on observational correlation, we’ll get some of the blame.  By the way, I agree with you, that previous sentence was simply horrible, but I rewrote it four times and it still says what I was trying to say, which is that when people correlate the presence of our software and hardware failures, I’d bet they’ll think we’re somehow responsible.  More on this point later.

Oddly, this is the situation for first-generation computing technologies, too, although they tended to have little if any persistent storage as we think of it.  They just failed a lot, taking work with them.  I can only imagine similar problems as we move to IPv6, as we deal with what amounts to brand new technology once again. 

Note:  I am aware that geometric progressions are a form of exponential progressions, but am using it here to indicate something which increases faster than a linear model, yet more slowly than one growing by orders of magnitude.  Math is not my friend, I’m afraid.

Never forget, MathWorld is your friend:  http://mathworld.wolfram.com/

Variables: 

  • amount of data
  • data stored per fixed interval
  • rate of increase of data storage for a given interval
  • storage density (data / drive)
  • increase in storage capacity
    • constant cost
    • constant performance
  • rate of increase in storage capacity
  • other hardware failures affecting storage reliability (e.g.  SAN availability)
  •  
  • MTBF for storage media
  • change in MTBF

 

Current Unsubstantiated Beliefs and Opinions (of mine)

  • MTBF is increasingly linearly (or so) but is probably a http://en.wikipedia.org/wiki/Logistic_growth for a given class of technology, in other words, there comes a point when it’s not worth the investment to make something more reliable, although this can sometimes be overcome (for service-providing tech) by providing access to parallel service providers via a logically-singular access point (e.g. 2 web sites front ends with one DNS entry, either of which talks to either of two back-end systems, which themselves are kept in sync by a further set of systems.  Regress ad infinitum if you’re not careful).
  • storage capacity (fixed cost over time) is increasing geometrically.  Also probably a logistical function, since you either have to make disks bigger, have more of them, and/or increase storage density.  All of those approaches will have the effect of decreasing MTBF as they approach their theoretical capacity.  Think of airplane propellers, which can only be so big before their tips break the sound barrier, which limits their RPM, which limits their lift.  Adding more blades increases the mass in motion, and curving them
  • storage requirements in increasingly at a greater-than-linear rate but at a less-than-exponential rate.  For us, this is driven by the increase in the number of devices, the number of points/device, the frequency with which data is collected, and increasingly resolution.  Some devices (e.g. netcams, audio) have serious implications for storage increases, even with compression.
  • SAN- and cluster-based failures will reflect poorly on us, despite it literally not being our fault, because our products are the primary factor driving the acquisition of storage tech.  This is to say that we’re literally the reason for the increasing failures, but we’re not responsible and it’s not clear there’s any technical solution other than increasing the scale and reliability of back-end systems through parallel architecture.
  • Despite the combination of these factors, the amount of data collected from SCADA systems will not approach a singularity in any meaningful sense because it will tend to be distributed in static pools, typically organized within individual corporations.
  • We store data at a much higher resolution than is technically or legally required, which means we could probably cut storage requirements significantly by adopting lossy (or lossier) storage mechanisms.
  • Our architecture is going to have to evolve to deal with infrastructure that’s apparently increasingly unreliable, although that’s not statistically the case when viewed on a safety-per-item case.

Useful charts to find/create

  • need graph of historical MTBF for storage technologies
  • SCADA data statistics by industry, over time, i.e.  in 1970 the gas/oil exploration industry stored N megabytes of data/day in the US. 
  • graph of storage densities over time
  • Advertisements

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s