Methods for the treatment of common cause failures in redundant systems

Berg Heinz-Peter; Görtz Rudolf; Kesten Jürgen

In memory of SergeiAntonov

Graduated from Math Dept of Moscow State University. Worked at Russian Academy of Sciences. Master of sport in alpinism. Multiple champion of alpinism of Moscow and Russia.

One of the first member of GNEDENKO FORUM, made number of joint projects with I. Ushakov.

Sergei died in Pamir mountains during climbing.

(1960-2006)

Sergei ANTONOV

SPARE SUPPLY SYSTEM FOR WORLDWIDE TELECOMMUNICATION SYSTEM GLOBALSTAR

Igor Ushakov, |Sergei Antonov, Sumantra Chakravarty,

Asad Hamid, Thomas Keliinoi

In memory of SergeiAntonov

ABSTRACT

This work describes the Optimal Spare Allocator (OSA), a software tool for Globalstar, which is a worldwide satellite telecommunication system designed at QUALCOMM (San Diego, USA). The Globalstar spare supply system is hierarchical and has three levels: Central Spare Stock (CSS), Regional Spare Stocks (RSS) and On-Site Spare Stocks (OSS). The tool allows solving direct and inverse problems of optimal redundancy. The OSA computer model has a user-friendly interface and a convenient reporting utility.

KEYWORDS

Spare allocation, reliability, optimization, cost, software tool, steepest descent algorithm

GENERAL DESCRIPTION OF THE SPARE SUPPORT SYSTEM

We consider a hierarchical spare supply system for satellite telecommunication system, the Globalstar. Globalstar is expected to have a number of base stations (gateways) dispersed all over the world. Successful operation of such a complex system depends on the ability to perform fast restoration of its operational ability after a failure. Fast and effective gateway restoration after a failure depends on a stock of field replaceable units (FRU). For this purpose, a hierarchical spare supply system (HSSS) is being designed (Ushakov 1994). HSSS includes the central spare stock (CSS), regional spare stocks (RSS), and on-site spare stocks (OSS).

Diversity of gateways and addition of new ones lead to the necessity of a computer tool capable of optimal spare allocation. The problems that arise are: (1) determination of optimal allocation of spares at each OSS depending on the size of a gateway, (2) determination of location and size of each RSS, and (3) determination of size of the CSS.

Ga'eway Gateway

Figure 1. A hierarchical spare supply system

Gateway equipment consists of replaceable units. After each failure, a spare unit from the OSS replaces the failed unit. A failed unit is sent to the repair base. Regional and Central stocks are usually supplied periodically (with priority request for refilling if stock has reached some critical level). On-site stocks are small enough and use the advance delivery; this means that the OSS site sends a request to the RSS after each failure. Structure of Globalstar HSSS is presented in Fig. 1.

FORMULATION OF GENERAL PROBLEM OF OPTIMAL SPARE ALLOCATION

Let the operational system (gateway) consist of N different types of spare units. Request for spare unit of type k, k= 1,2...N, arrives to the stock in accordance with a Poisson process with (failure) intensity Ak. Costs of units, ck, are assumed to be known. A spare stock contains x1, x2... xN units of different types. The problem is to find the optimal allocation, satisfying requirements on the stock reliability or the total cost.

Let X=( x1, x2... xN) be a vector of spares at the stock site, xi is the number of spares of type i; P(X, 9) be reliability index characterizing the spare stock with X spares for period of time 9; and C(X) be the cost of spares. Two optimization tasks (Gnedenko and Ushakov 1995) can be formulated as:

Direct: To minimize the total cost of spares at the stock under condition that the stock

reliability index is not less than required level P*, i.e.,

min {c(x)Ip(x,9)> P *}. (1)

all X

Inverse: To maximize the stock reliability index under condition that the total cost of spares

at stock is not larger than a admissible level C*, i.e.,

max{p(x,9)Ic(x*} . (2)

all X

ON-SITE SPARE STOCK

We assume that gateways are highly reliable and its units are independent, so we neglect the possibility of overlapping of system down times due to different causes. For highly reliable systems, the approximate formula for the OSS unreliability coefficient, QOSS is

Qoss — ^ßkqk ( xk )•

1<k < N

r Y1

(3)

The weights in Eq. 3 are defined as Pk = Aknk ^ Aknk , qk(xk) = unreliability coefficient of

\}<k<N )

units of type k (cumulative Poisson function with parameter ak=nkX^Q) , and xk = number of spares of type k in the OSS. For highly reliable systems, approximate formula for the OSS unavailability coefficient, Uoss, is

\nkqk ( xk ) (4)

ur

k=1

Xk + 1

where 0= time delay corresponding to advance delivery.

PREDICTING APPROXIMATE TRENDS

In many cases of practical interest, we are faced with the problem of sparing a highly reliable system. For a highly reliable system, maxk{ak}<<1 and the cumulative Poisson function may be approximated by its leading term (in practice, maxk{ak}<0.1 is acceptable). Typically, there is at least one spare for every type of unit in a commercially deployed system. On the other hand, total money allocated for sparing is generally limited. If 1 < x, < 5, ln(x;l) « 0.9(xr1) is a workable approximation in the Poisson function. These two simplifications linearize the Lagrange equation determining the optimal values of x, (Ushakov and Chakravarty 1998). For goal function shown in Eq. 3 and for the inverse problem of redundancy we obtain

f

Xk — round

K - ln(Ck /(ßkC*)) - ak + ln(0.9 - ln(ak ))

V

ln(0.9 - ln(ak))

Constant K in Eq. 5 can be found from the cost constraint C(X)=Zk ckxk=C*.

(5)

REGIONAL AND CENTRAL STOCKS

An RSS is periodically refilled from the Central Spare Stock (CSS). The number and location of gateways, which are served by a particular RSS may changing in time with the development of Globalstar. It seems that the best index characterizing the RSS is its unreliability coefficient (3). The same might be said about the CSS, which is replenished by production (probably with different period for different type of units). In principle, the solution for these cases is similar to the previous one with the difference that the advance delivery period starts with the installation of a failed unit.

OPTIMAL SPARE ALLOCATOR (OSA) SOFTWARE TOOL

The "Optimal Spare Allocator" software tool has been developed for use in Globalstar gateways. Globalstar is a worldwide satellite telecommunication system that has gateways dispersed all over the world.

Figure 2. OSA tool: Map of hierarchical stock system.

Its spare supply system is hierarchical in nature. OSA is a GUI driven user-friendly tool designed to solve the direct and inverse problems of optimal redundancy for a multi level hierarchical spare supply system.

Figure 3. OSA tool: a hierarchical spare supply system structure

It uses the relative increments of the goal function in respect to a unit of cost (steepest descent method) to solve the optimization problem.

A PC with Windows 95 or NT operating system is needed for installing and running the OSA tool. The program's main window includes a menu of all available commands and a toolbar with the most frequently used operations. It has other windows that depict the "hierarchical tree" of the stock supply system (Fig. 3), a table of parameters characterizing a particular stock, including a list of units and quantities used, embedded spares if any, their cost, their mean time between failures etc.

Figure 4. OSA tool: calculation options.

j Set of Units ^^H - |n| 11

■i- t-ii

current set ofunits (119 types1 Q AtW

Part No Name |mtbf Cost | Commen T]

20-14074-1 TFU 350000 123,45 Demo ® Edit j

20-14703-1 CCA 150000 12,34 Demo

— ....

20-14875-1 Control Unit 200000 456,78 Demo 1 Aaa to uti

20-14917-1 ATM IC CCA 100000 111,11 Demo Delete |

20-14918-1 YMCA Interface 66g66 123,45 Demo

20-14930-1 bcn 100000 765,43 Demo U Import

20-18034-1 CCA 166666 234,56 Demo f~ Overwrite

20-26035-1 Receiver Card 80000 1234,56

20-26085-1 CCA 133333 456,78 Demo f- L

ouu uy...

20-26115-1 UpConvertor 30000 2345,67 Demo (* Part No

20-26195-1 tfdc 200000 444,44 Demo r Name

20-26205-1 CQA 300000 222,22 Demo c mtbf

20-26270-1 RDS 66b6 111,11 Demo r Cost

20-26305-1 Control Unit 66b66 111,11 Demo Fixed Columr

30-18271-1 CHASSIS ASSY 200000 1111,11 Demo

30-18272-1 CHASSIS ASSY 250000 2222,22 Demo U Export

30-18273-2 CHASSIS ASSY 333333 3333,33 Demo

30-26018-1 CHASSIS ASSY 400000 4444,44 Demo ........ 1 II

1

30-26042-1 CHASSIS ASSY, DIGITAL 100000 1111,11 Demo 0K

30-26052-1 CHASSIS ASSY, FORWARD LINK 150000 2222,22 Demo |

30-26061-1 GAIN CONTROL UNIT 150000 3333,33 Demo X Cancel

3n-?r534-1 SHFI F TflP ASSY HISTRIRI ITIflN Rnonno 4444 44 Dpmn _

_ jlU ' A •7 Help u

Figure 5. OSA tool: Unit database.

/' Operating Units of the Base Station

On-Site Stock: : |GW25 Sort by: C Part No C Name C Qty

Units in the corresponding Base Station [184 types]

¡Part No j Narine Qt^ Standby * ® Edit

N20-14074-1 TFU 12 o 9

20-14703-1 CCA 4 0 New units

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

20-14875-1 Control Unit 2 0

20-14917-1 ATM IC CQA 4 0 f] Delete

20-14918-1 YMCA Interface 4 0 Confirm

20-14918-1 YMCA Interface 4 0

20-14930-1 BCN 24 0 H Export |

20-13034-1 CCA, 2 0

20-2S035-1 Receiver Card 90 0 s/ OK

20-26035-1 CCA 7 0

20-26115-1 UpConvertor 112 0 )C Cancel

20-26195-1 TFDC 6 o

! 20-26205-1 CCA 1 0 ~r] 7 Help

HJ jJ

Figure 6. OSA tool: Gateway specification

QUALCOMAA OPTIMAL SPARE ALLOCATION STOCKS

Stack: GW47 Uni! ¿ata: level: Spare unit delivery time: Repair time: On-Site 240 Reliability: 0,995388

Part No Name MTBF Cost Spare Spare Cost

20-14074-1 20-14703-1 20-14S75-1 20-14917-1 TFTJ CCA ControlUnit ATMIC CCA 350000 150000 200000 100000 123.45 12^4 45^78 111,11 2 246,90 3 37J02 2 913.56 3 333.33

Figure 7. OSA tool: sample of reporting

The OSA tool is flexible and offers various calculation options to the user. It is able to solve the direct and inverse problems of optimal redundancy with two different goal functions. It also offers two separate replenishment policies, and lets the user choose minimum number of spares consistent with the total cost. Results of calculations are presented in a report whose layout can be specified by the user. Reports generated by the OSA tool may be saved in ASCII format for further processing or documentation.

REFERENCES

1. Gnedenko, B. and I. Ushakov (1995). Probabilistic Reliability Engineering. John Wiley, New York.

2. Ushakov, I. (1994). Handbook of Reliability Engineering. John Wiley, New York.

SSARS 2007

22-29 July 2007 Sopot Poland

Summer Safety & Reliability Seminars

Special Issue # 2 on SSARS 2007

Invited Editors E. Zio and K. Kolowrocki

METHODS FOR THE TREATMENT OF COMMON CAUSE FAILURES IN REDUNDANT SYSTEMS

Berg Heinz-Peter, Görtz Rudolf, Kesten Jürgen

Bundesamt für Strahlenschutz, Salzgitter, Germany

Keywords

nuclear power plant, probabilistic safety assessment, simulation, common cause failure, modelling Abstract

Dependent failures are extremely important in reliability analysis and must be given adequate treatment so as to minimize gross underestimation of reliability. German regulatory guidance documents for PSA stipulate that model parameters used for calculating frequencies should be derived from operating experience in a transparent manner. Progress has been made with the process oriented simulation (POS) model for common cause failure (CCF) quantification. A number of applications are presented for which results obtained from established CCF models are available, focusing on cases with high degree of redundancy and small numbers of observed events.

1. Common cause failure analysis in the frame of probabilistic safety assessment

Design, operation and maintenance of systems are performed to minimize potential failures such as random, systematic and dependent failures. Dependent failures comprise secondary failures caused, e.g., by violation of operational conditions and so-called commanded failures like component fails due to violation of interface conditions. The residual part of the group of commanded failures is called common cause failures (CCF). To identify dependent failures, approaches have been extended to encompass potential interpendencies between systems or components. Secondary and commanded failures are supposed to be modelled explicitly as far as possible in fault tree models of the system whereas common cause failures are taken into account in probabilistic safety assessment implicitly by parametric models.

In general, the most important defence against accidental component or system failures is the implementation of principles such as separation, diversity and redundancy. However, experience has shown that redundancy itself is not sufficient to avoid undesired events just because of possible dependent failures.

CCF of redundant safety relevant systems have been of concern since quantitative estimation of the reliability of these systems was developed starting in the early 70ies because this type of failures affect significantly their availability and reliability leading - in the worst case - to a simultaneous loss of all redundancies.

Typical examples of CCF are miscalibration of sensors, incorrect maintenance, environmental impact on the field device and use of a not appropriate process fluid, which plugs valves in different redundancies. Experience from numerous probabilistic safety assessments has shown that, especially for highly redundant systems in nuclear power plants, common cause failures tend to dominate the results of these assessments such as the core damage frequency or large early release frequency. As a consequence of generally rather effective defence against common cause failures in place, the number of really observed events in nuclear power plants is limited, in particular with respect to events involving failures of all or at least many redundant components. However, the operational experience contains some information on potential common cause failures, i. e., partial failures that could have evolved into the complete failure- of the common cause component group within a short period of time. This in turn requires in one way or the other an extrapolation based on parametric models, which is extremely difficult to verify.

Despite of these difficulties significant progress has been made in the last years due to increasing operational experience, more systematic data collection and analysis, growing experience in probabilistic safety assessment and an enhanced exchange on data and methods both nationally and internationally.

Although the use of plant-specific data in probabilistic safety assessment is preferred, in case of lack of events or of information it is helpful to provide a generic data base taking into account all national experiences and appropriate international data. Data bases like the OECD/NEA International Common Cause Failure Data Exchange Project allows collecting and analysing data of a lot of different components such as valves, pumps and diesel generators. Results of the analysis of these data also enable to assess and improve the effectiveness of defences against common cause failure events. For that purpose, data and information related to events observed in the operational experience with sufficiently detailed content have to be provided.

In general, the treatment of common cause failures within probabilistic safety assessment requires four main steps: development of a system logic model, identification of common cause component groups, common cause modelling and data analysis as well as quantification and interpretation of the results. For the quantitative part of the common cause failure assessment, models have still to be further developed, in particular with respect to applicability to highly redundant systems, suitability and traceability.

2. German practice

Probabilistic safety analyses (PSA) have been performed for all operating German nuclear power plants. Experience has shown that CCF in many cases tends to dominate the results of the PSA. Therefore, methods and results of CCF analyses receive a lot of attention in the discussions between regulator, technical experts, utilities and analysts.

Regulatory guidance is available in Germany for level 1+ PSA (a level 1+ analysis is understood to end at the onset of core damage but to take into account active containment functions) as part of periodic safety reviews of nuclear power plants. According to the importance of CCF, a chapter in the German regulatory guidance documents is dedicated to dependent failures [6]-[7]. These failures comprise secondary failures caused by violation of operational or environmental conditions as well as commanded failures - intact component failing due to violation of interface conditions, for example in the case of erroneous signals or failed energy supply. The residual part of the group of dependent failures is the common cause failures mentioned before. Secondary and commanded failures are to be modelled explicitly as far as possible in the fault tree models of the system. CCF, on the other hand, are taken into account in PSA by parameter models [2].

The guidelines mentioned before - they are currently undergoing final steps of revision in view of the fact that the Atomic Energy Act as amended in 2002 makes Periodic Safety Reviews (including PSA) mandatory - do not prescribe specific CCF models. Rather, they demand that the parameters of any model used are to be derived in a clearly described way from operating experience. Thus, in German PSA practice, a variety of models have been used [1], [9], [10].

3. A process oriented simulation model (POS) for CCF quantification

3.1. Rationale and objectives

The question can be raised whether an approach aiming at modelling the entire CCF process from the point in time of the root cause impact to failures taking effect or being detected in the common cause component group (CCCG) in a more mechanistic manner could support and complement the established modelling which is mostly aiming at failure probabilities. Such a process oriented modelling approach is described and discussed in this paper. It represents a further elaboration of the modelling stages described in [3]-[4].

3.2. Model description

The method of stochastic simulation offers a convenient way to describe the model and to quantify its results. The sequence of stochastic variables displayed in table 1 is supposed to adequately describe the CCF process.

Based on simulation of this sequence, the associated unavailability's can be calculated.

The following fixed-value parameters are used throughout a simulation sequence:

• operation time TB

• number of components in the CCCG: r

• time between functional tests TFT

The sequence of variables and calculations defines a single simulation of the common cause failure process. It is described how the variables are either derived from a stochastic assumption or are calculated deterministically.

The calculation of the probabilities W(m,r) for the event that the common cause impact will affect exactly m out of r components are calculated by a recursive scheme that is detailed in [3]. Here, only the formulae up to r = 4 are given. Model parameters are a and r0.

w (2,2) = 1

W (3,3) = a W (2,3)= 1 - a W (4,4) = a-(a + (1 - a )-(l - e~3/10 )) W (2,4) = (l - a)2 W(3,4)= 1 - W (4,4)-W (2,4)

(1) (2)

(3)

(4)

(5)

(6)

To facilitate handling of the necessary equations, model parameter r0 is replaced by:

C = exp(l/r0 ). (7)

In the applications presented here, a model version has been used that is based on a simplified assumption regarding the CCF identification. It is assumed that non-staggered testing is applied and that a CCF-event is identified at the functional test following the first component failure. It is well known that conditions in the field are more complex. To account for that from the information provided in the literature sources effective test intervals have been estimated for the POS-analyses. The model assumptions can be modified to account for other situations like staggered testing in a straightforward manner. As the prime purpose of this paper is to demonstrate key features of the POS model such refinements have been postponed.

3.3. Parameter estimation for the process oriented simulation model

The parameter estimation routine used here is closely related to the one described in [4]. It has, however, been simplified without significantly lowering its precision.

3.3.1. Frequency

The model has essentially four parameters that have to be estimated. The first is the frequency of CCF-events for which the usual estimator for failure rates is used.

3.3.2. Number of impacted components

The approach selected consists of an estimation of the distribution of the number of impacted components based on the observed events:

(m, r )= "m+fcl. (8)

The constant term 1/(r-1) is introduced into the estimator to avoid vanishing probabilities, which in practice are not expected. K serves for normalization. Nm is the number of events for CCCG size r and with number of impacted components m.

On the other hand, the probabilities can be calculated as functions of the model parameters. It can be shown that

W(2,r) = (1 - a)-2. (9)

Table 1. Overview of the POS model

Sequence of stochastic variables

Modelling assumptions for the stochastic variables Model parameter_Assumption

Time tCCI of common cause impact Number m <r + 1 of impacted components

Rate of common cause impacts rCCI

a, r0

Failure rate R of the impacted components Probability of instantaneous failure of all

impacted components Winst, interval for rates of non- instantaneous failures RMiN to rmax

Times of failure of the impacted components tF (m)

Identification of CCF-process by the functional test

Time of CCF identification tïï

Tf

Equally distributed in TB, rCCI -TB << 1, Probability W (m, r), see formulae (1) to (6) and [3]

According to Winst the m components fail either instantaneously or are logarithmic equally distributed in the interval Rmin to Rmax Either all impacted components fail at tea or the times of failure are exponentially distributed with rate R For times > tF (i) the failure and the common cause process are identified, the components are immediately repaired and as good as new The functional tests are performed at intervals TFT . The first test time after the first failure occurring at the minimum of the tF (m) is equal to tID

Finally, from the failure times tF (i) (i = 1, ..., m) in the time interval between tCCI and tIDthe time periods are calculated in which zero, one, two, ... up to at most m components are failed: A(i) (i = 0,1,2,.., m) The average of A (i) / TB (i > 1) for many simulations is the unavailability.

This relation suggests the following estimator:

aest (2, r ) = 1 - We

1/ (r-2)

(10)

In a second step, parameter c is estimated based on the mean of m:

< m >„, = S m -West O,r ).

(11)

Again, the mean of m can be calculated as a function y of the model parameters a and c

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

< m >= y(a,c).

This can be used to estimate c based on the estimates aest and West(m,r) already obtained

Cest = y~l(aest > < m >est )

Here,y-1 denotes functiony(a,c) inverted with respect to c.

(12)

(13)

There are, however, cases in which the non-linear equation (13) for cest does not have a meaningful solution. This is avoided by applying the following transformation to the estimated <m>est:

Methods for the treatment of common cause failures in redundant systems Текст научной статьи по специальности «Медицинские технологии»

Аннотация научной статьи по медицинским технологиям, автор научной работы — Berg Heinz-Peter, Görtz Rudolf, Kesten Jürgen

Похожие темы научных работ по медицинским технологиям , автор научной работы — Berg Heinz-Peter, Görtz Rudolf, Kesten Jürgen

Текст научной работы на тему «Methods for the treatment of common cause failures in redundant systems»