ECC memory - Misplaced Pages

This is an old revision of this page, as edited by Fleivium (talk | contribs) at 04:58, 22 July 2014 (Undid revision 617556261 by 2.239.236.96 (talk)). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Revision as of 04:58, 22 July 2014 by Fleivium (talk | contribs) (Undid revision 617556261 by 2.239.236.96 (talk))(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

This article possibly contains original research. Please improve it by verifying the claims made and adding inline citations. Statements consisting only of original research should be removed. (May 2010) (Learn how and when to remove this message)

Error-correcting code memory (Error Checking & Correction, ECC memory) is a type of computer data storage that can detect and correct the most common kinds of internal data corruption. ECC memory is used in most computers where data corruption cannot be tolerated under any circumstances, such as for scientific or financial computing.

ECC memory maintains a memory system immune to single-bit errors: the data that is read from each word is always the same as the data that had been written to it, even if one or more bits actually stored have been flipped to the wrong state. Most non-ECC memory cannot detect errors although some non-ECC memory with parity support allows detection but not correction.

Problem background

Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access memory (DRAM) to spontaneously flip to the opposite state. It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off soft errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read/write them.

There was some concern that as DRAM density increases further, and thus the components on chips get smaller, while at the same time operating voltages continue to fall, DRAM chips will be affected by such radiation more frequently—since lower-energy particles will be able to change a memory cell's state. On the other hand, smaller cells make smaller targets, and moves to technologies such as SOI may make individual cells less susceptible and so counteract, or even reverse, this trend. Recent studies show that single event upsets due to cosmic radiation have been dropping dramatically with process geometry and previous concerns over increasing bit cell error rates are unfounded.

The spacecraft Cassini–Huygens, launched in 1997, contains two identical flight recorders, each of which contains 2.5 gigabits of memory in the form of arrays of commercial DRAM chips. Its engineering telemetry reports the number of (correctable) single-bit-per-word errors and (uncorrectable) double-bit-per-word errors. In the vicinity of Earth, and when the sun is "quiet", it reported a nearly constant single-bit error rate of about 280 errors per day. The maximum hourly error report from Cassini–Huygens in the first month in space was 3072 single-bit errors per day during a weak solar flare. If the flight recorders had been designed with EDAC words assembled from widely-separated bits, the number of (uncorrectable) multiple-bit errors should average less than one per year.

Work published between 2007 and 2009 showed widely varying error rates with over 7 orders of magnitude difference, ranging from 10–10 error/bit·h, roughly one bit error, per hour, per gigabyte of memory to one bit error, per millennium, per gigabyte of memory. A very large-scale study based on Google's very large number of servers was presented at the SIGMETRICS/Performance’09 conference. The actual error rate found was several orders of magnitude higher than previous small-scale or laboratory studies, with 25,000 to 70,000 errors per billion device hours per megabit (about 2.5–7 × 10 error/bit·h) (i.e. about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate), and more than 8% of DIMM memory modules affected by errors per year.

The consequence of a memory error is system-dependent. In systems without ECC, an error can lead either to a crash or to corruption of data; in large-scale production sites, memory errors are one of the most common hardware causes of machine crashes. Memory errors can cause security vulnerabilities. A memory error can have no consequences if it changes a bit which neither causes observable malfunctioning nor affects data used in calculations or saved. A 2010 simulation study showed that, for a web browser, only a small fraction of memory errors caused data corruption, although, as many memory errors are intermittent and correlated, the effects of memory errors were greater than would be expected for independent soft errors.

An example of a single-bit error that would be ignored by a system with no error-checking, would halt a machine with parity checking, or would be invisibly corrected by ECC: a single bit is stuck at 1 due to a faulty chip, or becomes changed to 1 due to background or cosmic radiation; a spreadsheet storing numbers in ASCII format is loaded, and the digit "8" is stored in the byte which contains the stuck bit as its eighth bit; then a change is made to the spreadsheet and it is saved. However, the "8" (00111000 binary) has silently become a "9" (00111001).

Solutions

Several approaches have been developed to deal with unwanted bit-flips:

This problem can be mitigated by using DRAM modules that include extra memory bits and memory controllers that exploit these bits. These extra bits are used to record parity or to use an error-correcting code (ECC). Parity allows the detection of all single-bit errors (actually, any odd number of wrong bits). The most common error correcting code, a SECDED Hamming code, allows a single-bit error to be corrected and (in the usual configuration, with an extra parity bit) double-bit errors to be detected. Chipkill ECC is a more effective version that also corrects for multiple bit errors, including the loss of an entire memory chip.

Implementations

Seymour Cray famously said "parity is for farmers" when asked why he left this out of the CDC 6600. He included parity in the CDC 7600, which caused pundits to remark that "apparently a lot of farmers buy computers" (see parity bit#History). The original IBM PC and all PCs until the early 1990s used parity checking. Later ones mostly did not. Wider memory buses make parity and especially ECC more affordable. Many current microprocessor memory controllers, including almost all AMD 64-bit offerings, support ECC, but many motherboards and in particular those using low-end chipsets do not .

In a few cases, systems with a non-ECC memory controller can still gain most of the benefits of ECC memory by using EOS memory modules.

An ECC-capable memory controller as used in many modern PCs (mostly medium- to high-end workstation and server-class) can detect and correct errors of a single bit per 64-bit "word" (the unit of bus transfer), and detect (but not correct) errors of two bits per 64-bit word. The BIOS in some computers, when matched with operating systems such as some versions of Linux, Mac OS, and Windows, allow counting of detected and corrected memory errors, in part to help identify failing memory modules before the problem becomes catastrophic.

Error detection and correction depends on an expectation of the kinds of errors that occur. Implicitly, we have assumed that the failure of each bit in a word of memory is independent and hence that two simultaneous errors are improbable. This used to be the case when memory chips were one bit wide (typical in the first half of the 1980s). Now many bits are in the same chip. This weakness is addressed by various technologies : Chipkill(IBM), Extended ECC(Sun Microsystems), Chipspare(Hewlett Packard) or SDDC=Single Device Data Correction(Intel).

DRAM memory may provide increased protection against soft errors by relying on error correcting codes. Such error-correcting memory, known as ECC or EDAC-protected memory, is particularly desirable for high fault-tolerant applications, such as servers, as well as deep-space applications due to increased radiation. Some systems also 'scrub' the memory – periodically reading all addresses and writing back corrected versions if necessary to remove soft errors.

Interleaving allows for distribution of the effect of a single cosmic ray, potentially upsetting multiple physically neighboring bits across multiple words by associating neighboring bits to different words. As long as a single event upset (SEU) does not exceed the error threshold (e.g., a single error) in any particular word between accesses, it can be corrected (e.g., by a single-bit error correcting code), and an effectively error-free memory system may be maintained.

Error-correcting memory controllers traditionally use Hamming codes, although some use triple modular redundancy. The latter is preferred because its hardware is faster than Hamming error correction hardware. Space satellite systems often use TMR, although satellite RAM usually uses Hamming error correction.

Many early implementations of ECC memory mask correctable errors, acting "as if" the error never occurred, and only report uncorrectable errors. Modern implementations log both correctable errors (CE) and uncorrectable errors (UE). Some people proactively replace memory modules that exhibit high error rates, in order to reduce the likelihood of uncorrectable error events.

Many ECC memory systems use an "external" EDAC circuit between the CPU and the memory. Some DRAM chips include "internal" on-chip error correction circuit. A few systems with ECC memory use both internal and external EDAC systems; the external EDAC system should be designed to correct certain errors that the internal EDAC system is unable to correct.

Modern desktop and server CPUs integrate the EDAC circuit into the CPU, especially with the shift towards CPU-integrated memory controllers (NUMA).

Registered memory

Main article: Registered memory

Registered, or buffered, memory is not the same as ECC; these strategies perform different functions. It is usual for memory used in servers to be both registered, to allow many memory modules to be used without electrical problems, and ECC, for data integrity. Memory used in desktop computers is neither, for economy. However, unbuffered (not-registered) ECC memory is available, and some non-server motherboards support ECC functionality of such modules when used with a CPU that supports ECC. Registered memory does not work reliably in motherboards without buffering circuitry, and vice-versa.

Pros and cons of ECC

Ultimately, there is a trade-off between protection against unusual loss of data and a higher cost.

ECC protects against undetected data corruption, and is used in computers where such corruption is unacceptable, as with some scientific and financial computing applications and as file servers. ECC also reduces the number of crashes, particularly unacceptable in multi-user server applications and maximum-availability systems.

Most motherboards and many processors for less critical application are not designed to support ECC, for economy. Some such boards and processors are able to support unbuffered (not registered) ECC, but will also work with non-ECC memory; a BIOS setting enables ECC functionality if ECC RAM is fitted.

ECC memory costs more, as each bank requires 9 memory chips compared to 8 for non-ECC memory. In some cases the price ratio reduces to 9/8, as an example, on 2008/11/30, on Crucial.com, an ECC CL5 unbuffered 2GB DDR2-667 DIMM cost $30 while the corresponding non-ECC part cost $28, a difference of 1/15, however some ECC modules cost twice as much as their non-ECC equivalents (Crucial CT12872Z40B and CT12864Z40B, Jan 2009). ECC-supporting motherboards, chipsets, and processors may also be more expensive.

ECC may lower memory performance by around 2–3 percent on some systems, depending on application and implementation, due to the additional time needed for ECC memory controllers to perform error checking. However, modern systems integrate ECC testing into the CPU, generating no additional delay to memory accesses.

References

Single Event Upset at Ground Level, Eugene Normand, Member, IEEE, Boeing Defense & Space Group, Seattle, WA 98124-2499
^ Borucki, "Comparison of Accelerated DRAM Soft Error Rates Measured at Component and System Level", 46th Annual International Reliability Physics Symposium, Phoenix, 2008, pp. 482–487
Gary M. Swift and Steven M. Guertin. "In-Flight Observations of Multiple-Bit Upset in DRAMs". Jet Propulsion Laboratory
^ Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich (2009). "DRAM Errors in the Wild: A Large-Scale Field Study" (PDF). SIGMETRICS/Performance. ACM. ISBN 978-1-60558-511-6. {{cite journal}}: Unknown parameter |laysource= ignored (help); Unknown parameter |laysummary= ignored (help)
http://www.ece.rochester.edu/~xinli/usenix07/
Li, Huang, Shen, Chu (2010). ""A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility". Usenix Annual Tech Conference 2010" (PDF).{{cite web}}: CS1 maint: multiple names: authors list (link)
"CDC 6600". Microsoft Research. Retrieved 2011-11-23.
"Parity Checking". Pcguide.com. 2001-04-17. Retrieved 2011-11-23.
^ "Using StrongArm SA-1110 in the On-Board Computer of Nanosatellite". Tsinghua Space Center, Tsinghua University, Beijing. Retrieved 2009-02-16.
"Actel engineers use triple-module redundancy in new rad-hard FPGA". Military & Aerospace Electronics. Retrieved 2009-02-16.
"SEU Hardening of Field Programmable Gate Arrays (FPGAs) For Space Applications and Device Characterization". Klabs.org. 2010-02-03. Retrieved 2011-11-23.
"FPGAs in Space". Techfocusmedia.net. Retrieved 2011-11-23.
"Commercial Microelectronics Technologies for Applications in the Satellite Radiation Environment". Radhome.gsfc.nasa.gov. Retrieved 2011-11-23.
Doug Thompson, Mauro Carvalho Chehab. "EDAC - Error Detection And Correction". 2005 - 2009. "The 'edac' kernel module goal is to detect and report errors that occur within the computer system running under linux."
^ A. H. Johnston. "Space Radiation Effects in Advanced Flash Memories". NASA Electronic Parts and Packaging Program (NEPP). 2001.
^ AMD-762™ System Controller Software/BIOS Design Guide, p. 179
Typical unbuffered ECC RAM module: Crucial CT25672BA1067
Specification of desktop motherboard that supports both ECC and non-ECC unbuffered RAM with compatible CPUs
"Discussion of ECC on pcguide". Pcguide.com. 2001-04-17. Retrieved 2011-11-23.
Benchmark of AMD-762/Athlon platform with and without ECC

Category:

Computer memory