Equipment Reliability Institute
your reliability newsletter
February, 2001

Wayne TustinLarry George disagreed with some things Kirk Gray and I said in the Fall 2000 issue. He agreed to write an article, and here it is.

Some readers will find value in "Machinery Metal Fatigue" by Guil Cornejo. His answers to a question I asked him last year made so much sense that I asked him to amplify those answers into an article.

Commencing now, we hope each issue will carry a practical suggestion from one of our specialists, a solution to some difficulty he has often observed in his consulting work.

Best wishes,
Wayne

*******************************

Measure Field Reliability with Statistics
by Larry George

Wayne Tustin kindly sends me the newsletters. I had the impertinence to dispute a few statements made in the previous newsletter by Gray and Tustin. Wayne invited me to expound. His tolerance is commendable.

Gray and Tustin are right about the need for HALT and HASS, but reliability statistics are valuable too. Statistics convert data into actionable information, information that helps you decide whether to do anything and what to do, to what, when, and how much. This information can save half of your field service costs, double profit from service, or have unexpected consequences that companies don't disclose. Misunderstandings about what reliability is and which data is necessary to measure it limit the value of reliability statistics. This article describes reliability prediction and estimation from data required by generally accepted accounting principles.

Blanchard says reliability is "the probability that a system or product will perform in a satisfactory manner for a given period of time when used under specified operating conditions." The military standard definition of reliability is, "the probability that an item will perform a required function without failure under stated conditions for a stated period of time."

Probability has stood the test of time as a useful measure of survival randomness, so reliability is P[Life > age] for ages within the useful product life, whether for hardware, electronics, or humans. The time variable is age for most products, whether in calendar hours or operating hours, miles, cycles, etc. As far as customers are concerned, the only appropriate operating conditions are field conditions, not in a laboratory. Reliability is not MTBF.

Reliability engineers and their managers believe they have to test to measure reliability. Have you ever said, "We need to test at least n units for at least t hours to verify P[R > .95] > .9"? You figure out the smallest n and t you can possibly test (http://www.sre.org/sresoft.htm). Then your manager says you can test only half as many units for half the time. Typically, n and t are based on an incorrect constant-failure-rate assumption, thereby eliminating any chance of learning actionable information. (A constant failure rate implies the absence of infant mortality, wearout, and the need for maintenance).

People believe that it is necessary to track at least a sample by serial number from birth to death to estimate field reliability. This data gives ages at failures and survivors' ages, which are sufficient but not necessary. Most companies have given up tracking parts by serial number because of errors, data storage requirements, and failure to use actionable reliability information. Fortunately, tracking parts by serial number is not necessary for either field reliability prediction or estimation.

Reliability Prediction?

People make MTBF predictions, argue about them, and compare lies. "My MTBF is bigger than yours." Most MTBFs are predictions, seldom verified. They are predictions of averages, not age-specific reliability. Have you seen predictions in the range of 500,000 hours for computer hardware? That's 250 years for a computer operated 2000 hours per year, M-F, 9-4, or more than 50 years for continuous operation.

To predict age-specific field reliability, use field reliability of comparable products. Designs may change, but other factors (process, environment, and customers) that determine field reliability don't. The field reliability of comparable products provides a reasonable, relative reliability prediction. Scale the fielded products' age-specific failure rates to take changes in MTBF predictions into account to make an age-specific reliability prediction [George and Langfeldt].

Alternatives to Test and MTBF Prediction

It is a waste of time and credibility to track annual failure rate (AFR) and argue about wiggles in monthly AFR charts. AFR, annual returns divided by the installed base, is an average and provides little actionable information, too late, and too imprecisely. It is a waste of talent, ability, and initiative not to use actionable information from available data.

Several clients have asked for age-specific reliability predictions, because their customers asked. They wanted to know the probability of being dead on arrival, the probability of failure in the first month, first three months, six months, year, etc. Age-specific reliability predictions provide actionable information because, although designs change, age-specific reliability doesn't change, much. Designs change, but manufacturing, packaging, shipping, installation, environment, and customers don't. Until there's field experience with new products, age-specific reliability predictions help plan warranty, service, spares production, and burn-in and assist the designers.

It's not necessary to track products and parts by serial number to estimate age-specific reliability. Tracking parts by serial number requires about 1000 times as much data storage capacity and probably incurs more than 1000 times as many errors, compared to ships and returns data (table 1). Generally accepted accounting principles require ships and returns data, which is sufficient for estimating age-specific reliability. That means that your company has sufficient data [George]. Ships and returns are population data, so reliability estimates from them have no sample uncertainty.

Table 1. 1988 Ford V-8 460-cubic-inch Drivetrain Ships and Returns

Month
Shipments
Monthly returns
Aug-87
213
18
Sep-87
6439
797
Oct-87
6951
1291
Nov-87
5715
1511
Dec-87
5390
1791
Jan-88
6336
2282
Feb-88
6319
2628
Etc.
Etc.
Etc.

Figure 1 shows the field reliability estimated from the ships and returns data in table 1. It shows two reliability functions, one for the age at first warranty return and one for the age between subsequent returns. The probability of drivetrain's being returned in the first month was more than 15% initially and 18% subsequently. The former indicates that many were defective practically from delivery. The latter indicates that the problems didn't get fixed. The 1988 Ford V-8-460-cubic-inch engine was the last Ford engine with a carburetor, a very unhappy engine.

Figure 1. 1988 Ford V-8-460-cubic-inch drivetrain field reliability

Age-specific failure rates help failure analysis

The failure rate function shows what's happening (see figure 2). Process defects cause infant mortality, evidenced by an initially decreasing failure rate. Design defects cause prematurely increasing failure rates. Other phenomena, such as warranty expiration anticipation, preventive maintenance, and periodic inspections, also manifest themselves.

Figure 2. Age-specific failure rates per month and their possible causes

Engineers regard design defects as more significant than process defects.They assume that their designs will be produced, packaged, shipped, installed, and operated in a manner that achieves inherent reliability. Design defects cause premature wearout, which becomes apparent pretty early in the product life cycle, although not as early as process defects, which cause infant mortality. Engineers should be reassured to know that, for most products, retirement occurs before wearout, so the failure rate function decreases with age.

Conclusion

Don't give up on statistics, even for reliability predictions. Population statistics eliminate sample uncertainty and help you predict, measure, and use age-specific field reliability, without tracking parts by serial number. Which do you prefer, randomness with uncertainty or without? Uncertainty means you're gambling without knowing the odds.

References

  • Gray, Kirk and Wayne Tustin, "Electronics Testing into the 21st Century: Success in Test Is in Capabilities, Not Specifications," ERI News - Reliability Newsletter, Equipment Reliability Institute, Nov. 2000.
  • George, L. L., "Field Reliability Estimation Without Life Data," ASA, SPES Newsletter, Dec. 1999, htttp://web.utk.edu/~asaqp/newsletters/1299newsletter.pdf.
  • George, L. L. and Eva Langfeldt, "Age-Specific Reliability Prediction," to appear in ASQ Reliability Review, 2001.

Larry George is an ASQ Certified Reliability Engineer. He has a Ph. D. in industrial engineering and operations research, with a minor in statistics. He taught for 11 years, worked for a national laboratory for 11 years, and has worked in the real world for more than 20 years. ASQ just elected him as a Fellow. Contact him at pstlarry@home.com, 925 447 4969, or http://members.home.net/pstlarry.

Eva Langfeldt edited this article and made it readable. Eva has her own editorial services business, Text Support, with clients in high-tech, publishing, and marketing. Contact her at 925 314 9588 or eva@megapathdsl.net.

Larry offers training in field reliability analysis and applications. It helps you estimate your products' and service parts' field reliability from your data and use that information to help solve your problems. Contact Wayne or Larry for course information. Send data for free samples of field reliability estimates and their applications.

*******************************

Machinery metal fatigue, brief practical notes
by Guil Cornejo

Some people refer to high cycle fatigue (HCF) fracture as representative of a fatigue induced by many stress cycles. This could be caused by a self resonance or a near resonance of a component such as a turbine blade. Either a low stress or high stress condition goes together with HCF resonance or its counterpart, low cycle fatigue (LCF).

HCF resonance fracture is usually accompanied by characteristic "beach marks" at the point of fracture. What is implicit here is the fact that for a structure (blade, rotor, etc.) to resonate it must be stressed within the elastic range of the stress-strain diagram or within Young's modulus of elasticity. This type of failure is not usually accompanied by heat marks like blueing or metal discoloration. Fractures sometimes show that parts rubbed together. The shiny "finger print" shows the effect of rubbing between the two sides of the initiating crack- lapping. The fracture point then usually reveals "beach marking" like what we see at the ocean beach as each wave comes in and then leaves the shore.

Low cycle fatigue means a couple of things. Primarily LCF is a failure induced by stressing a component just above the Young's Modulus range or way into the material's plastic range. This type of failure, if pure plastic, will have a typical characteristic failure pattern at the fracture point (like breaking a nail by bending it back and forth, in about 3 or 5 cycles). Or the failure may show signs of heat discoloration at the fracture point. The metal grain (austenitic, gamma etc.) is usually revealed under the microscope and it is unique, recognizable and often clearly undisturbed.

Unfortunately, there is often a combination of both HCF and LCF at a fracture point. As the stress increases in a component, as it fractures during HCF, the failure may accelerate from HCF to LCF. This is where the insight of the engineer and the metallurgist comes into play, defining the fracture. Sometimes, the failure can be very complex; after a couple of LCF bends, the failure accelerates into HCF. Finally, the "high" and the "low" cycle adjectives are sometimes relative. For a rotor that is crunching rock at very low speed, a high cycle excitation would be close to LCF frequency range. In contrast, for a gas turbine rotor running say at 400 CPS or 24000 RPM the distinction between HCF and LCF would be much easier since the excitation is way above the clear cut LCF range.

Guillermo "Guil" Cornejo is president of RPM & Predictive Engineering Inc., a California, USA, corporation. He has 20 years of factory and worldwide field experience solving turbomachinery/powertrain vibration problems in industrial and marine applications. If you want to know more about Guil, visit www.equipment-reliability.com and click Consulting then Specialists. You can e-mail Guil at Gcornejo@equipment-reliability.com.

*******************************

Questions our readers have asked...

You saw Guil Cornejo's answer to my question, and below is Kirk Gray's answer to a reader's question. Now it's your turn. What question would you like to ask one of ERI's specialists? Send it to webmaster@equipment-reliability.com

On this issue Kirk Gray (KG) tells us about an email he got from a Quality Engineer from Israel. You will see their initials (KG and SS) as guides at the paragraphs.

(SS) My name is Samuel Sela (SS) and I am a Quality Engineer - CQE of ASQ - (Mech. Eng. -M.Sc.- in the past) in Rafael, Israel. I found your address in the Equipment Reliability Institute site in the Web, and I'll be very grateful if you'll accept to present your opinion in a few controversial issues among our people. So, for ESS of electronic boxes, containing PCB's, for military use, which fundamental specification do you recommend to apply?

Temperature cycling:

  • Temp. range
    (KG) The optimum temperature range should be determined from the last iteration of HALT. That is the HALT that is completed after the last improvement of the operating capabilities. Most electronics should be able to be cycled in a temperature range of 100 C. The larger the range of temperature cycling can increase the screening strength (that is the ability to precipitate latent defects to patent defects) in a shorter screening time. In other words, the higher the range ot temperature cycling the less number of cycles required to find the same defects, and the shorter HASS cycle.
  • Number of cycles
    (KG) It all depends on the temp range and rate of change. Initially a HASS process should be developed based on 4 to 6 thermal cycles. Since the product in HASS is continuously monitored the "HASS Chamber lifetime bathtub curve" should flatten out before the last screen cycle. If very few defects are found in the last thermal cycles, then the screen is too long and can be shortened. If a significant number of defects are found in the last thermal cycle then the screen is too short.
  • Rate
    (KG) Again, as in thermal cycles, more is better. The evidence from a Hughes? Rome Air Development Study done in the 1980's showed that the higher the rate of change, the higher the screening strength. Again, the higher the rate of change, the faster cycling can occur, the faster the HASS process and the lower the cost of HASS. In liquid emersion ESS, electronics are subjected to rates of thermal change of around 500 C per minute without damage to defect-free circuit boards. A good target for the thermal change rate is around 40 - 50 degrees C per minute on the product, not air temperature.
  • Time in max. temperature
    (KG) The benefit in thermal cycling is in inducing fatigue damage through the expansion and contraction of all the thousands of differing material interfaces that may have flaws in them. Therefore when the most or all of product reaches equilibrium at the extremes of the thermal cycling range, the next phase of the cycle can start. The typical dwell for any temperature in a HALT/HASS chamber (which forces the product temperature by high volume air circulation and overdriving the air temperature) is 10 - 20 minutes. Of course the biggest factor in the rate of change of the product is its thermal mass and power dissipation. Power supplies with heavy discrete analog components would take much longer to reach equilibrium that a hand held PDA, so again each HASS should be developed based on the product capabilities not a "standard" process.
  • Functional test
    (KG) It is very critical to continuously monitor the device under test because it has been estimated and the industry evidence suggests that about 30% of latent defects are observable while stress is applied, but will be functional and not detectable when the stress is removed. An example is a break in a metal (solder or wire trace) connection. During the cold contraction the metal break will separate enough to make an open circuit, yet when the product reaches room temperature the metal expands again to make contact and close the circuit. Sometimes a thermal interlock cannot be overridden for a HASS process and it is acceptable to go beyond the (designed) operating temperature, with monitoring occurring when it falls below the designed in shutdown.
  • Does the box have to be open?
    (KG) It is best in most if the box can be opened to allow for faster thermal changes. By helping reduce the mass, you can increase the thermal rate of change. Many of those doing HASS use a dedicated fixture to hold and operate the circuit cards and only use vibration on the final assembly to verify there are no loose connectors or fasteners. Again, each HASS should developed base on characteristics of the product and the production volumes and throughput.
  • PSD & frequency range
    (KG) HALT and HASS chambers use multi-axis repetitive shock producing a pseudo-random, broad frequency range without frequency control (typical range is from 200 Hz to 10,000 Hz with most of the energy in the range from 500 - 3000 Hz depending on the vibration table manufacturer). There are methods that can be used to dampen out certain frequencies, but it is better to improve the design to eliminate sensitivities or a weakness for a certain frequency if possible before changing the vibration spectrum.. The level of vibration intensity during HASS should be based on the Upper Destruct Limit (UDL) found in HALT. A good rule of thumb is to use half the UDL level found in HALT. It is very important to verify that the level will not use a significant portion of fatigue life through a proof-of-screen (POS) before implementing a HASS process.
  • Duration
    (KG) See the explanation of number of cycles. The criteria should be the same after some data is gathered on the first 100 - 1000 units. If defects are being found towards the end of the last cycle, the screen maybe too short, if no defects are found in the last cycle(s), the screen might be too long. Most screens are between 30 minutes to 2 hours, but then again it is specific to the product. Again whatever HASS process is developed must use a POS to insure you are not taking out significant fatigue damage

(SS) Honestly, we use a few versions, originated from various references and it's difficult to get a common agreement for a standard specification.

(KG) The paradigm shift from traditional reliability test processes and HALT and HASS is a significant one that makes understanding and acceptance of these process very difficult to those who have been following a "cookbook" of burn-in testing. The major shift is from basing testing on maximum operational component and material specifications from the supplier, or simulation of worst case end-use environmental conditions. HALT and HASS approach the problem with the following fundamental shift in perspective:

1) The best screen for the electronic assembly is one based on that the assemblies operating and destruct limits,

2) That electronics has more than enough life to exceed its technological obsolescence and that rapidly using a small portion of fatigue life is the best method for removing the front end of the life-cycle bathtub curve or what is sometimes called "infant mortality". Infant mortality is simply a result of latent defects either in the design or the manufacturing process.

3) That the only "standard" in HALT and HASS is using step-stress to find the weak links and removing or improving them, and then developing a HASS based on destruct and operation limits. This requires a much better understanding of the product and the Physics-of-failure of latent defects than has been used in the past in electronics. The understanding and acceptance of the HALT and HASS approach still suffers from its success. Very few companies want to let their competitors know how they are able to reduce warranty returns 90% and reduce the time in test by 95% (Advanced Energy Industries, Inc. has documented this gain on one product type). The best manufacturing and test methods usually are considered to be a competitive advantage and are not the first to be published after an electronics company has discovered them.

Kirk Gray has over 21 years in the electronics manufacturing industry and the last 11 years in the application of HALT and HASS processes. Kirk is Vice Chairman of the Denver Chapter of the IEEE Reliability Society , Chairman of the IEEE/CPMT Technical Committee 7 on Reliability, and Registration Chairman for the annual IEEE/CPMT Workshops on Accelerated Stress Testing (AST) held in the fall each year. If you would like to contact him, please send an email to gray@equipment-reliability.com.


Reliability ain't free


Would you rather invest a little up front? Or pay a great deal later?

Invest a little extra now, to give your designers and test people the tools to do it right? Or pay dearly, later on, in delays, in performance penalties, in excessive warranty and other expenses?

"It" refers to the creation of reliable products.

Our products (home, office, factory, telecom, satellite, military, other) are increasingly complex. As a society, we are increasingly vulnerable when those products do not work reliably. A computer failure forces airport closures. Another closes grocery checkout lines. Another idles or mislocates an expensive satellite. Another stops my car in traffic.

Reliability requires a small training investment up-front. Your designers and your test people know what training they need. Approve their training requests. Invest in them.

 
Participate at ERI News

You are invited to send news of reliability-oriented events to supplement ERI's newsletter. Please send an email to the webmaster.
 
Vibration and Shock courses coming up


Wayne Tustin will teach the following short courses in vibration and shock measurement, analysis, calibration, testing, HALT, ESS and HASS:

ERI classes

Huntsville, Alabama, February 20-22, 2001

Hillsboro (Portland), Oregon, March 20-22, 2001

Pico Rivera (LA), CA,
May 16-18, 2001

Farmingdale (LI) , NY,
June 6-8, 2001

In addition, Wayne will present a super-concentrated 1-day version at Grand Rapids, MI,
March 27, 2001.
Details are available from Vibration Research, phone 616-669-3028 or send an email to john@
vibrationresearch.com

*******

Society of Automotive Engineers

Troy, MI
April 18-20, 2001

*******

Applied Technology Institute

College Park, MD
April 9-12, 2001 (get more information from Wayne Tustin)

 
Announcements


European EMC Instructor

ERI is seeking an authority on European EMC directives to teach USA EMC practitioners. Please e-mail tustin@equipment-reliability.com

 
Contact information


ERI - Equipment Reliability Institute
1520 Santa Rosa Av.
Santa Barbara - CA - 93109
Tel/Fax: (805) 564-1260

Wayne Tustin tustin@equipment-reliability.com

Webmaster webmaster@equipment-reliability.com

Website http://www.equipment-reliability.com

 
Free Newsletter


ERI News is sent in both html and plain text formats. If you had any problems reading this newsletter, please let us know. Send an email to the webmaster, reporting your difficulties.

If you would like to subscribe to ERI News, go to our website, fill in the form "Free Newsletter" and hit the Submit button.
Click here to subscribe!

If you do not want to receive ERI's quarterly newsletter, please send a reply to this message with "remove" as subject.