In memory of SergeiAntonov
Graduated from Math Dept of Moscow State University. Worked at Russian Academy of Sciences. Master of sport in alpinism. Multiple champion of alpinism of Moscow and Russia.
One of the first member of GNEDENKO FORUM, made number of joint projects with I. Ushakov.
Sergei died in Pamir mountains during climbing.
(1960-2006)
Sergei ANTONOV
SPARE SUPPLY SYSTEM FOR WORLDWIDE TELECOMMUNICATION SYSTEM GLOBALSTAR
Igor Ushakov, |Sergei Antonov, Sumantra Chakravarty,
Asad Hamid, Thomas Keliinoi
In memory of SergeiAntonov
ABSTRACT
This work describes the Optimal Spare Allocator (OSA), a software tool for Globalstar, which is a worldwide satellite telecommunication system designed at QUALCOMM (San Diego, USA). The Globalstar spare supply system is hierarchical and has three levels: Central Spare Stock (CSS), Regional Spare Stocks (RSS) and On-Site Spare Stocks (OSS). The tool allows solving direct and inverse problems of optimal redundancy. The OSA computer model has a user-friendly interface and a convenient reporting utility.
KEYWORDS
Spare allocation, reliability, optimization, cost, software tool, steepest descent algorithm
GENERAL DESCRIPTION OF THE SPARE SUPPORT SYSTEM
We consider a hierarchical spare supply system for satellite telecommunication system, the Globalstar. Globalstar is expected to have a number of base stations (gateways) dispersed all over the world. Successful operation of such a complex system depends on the ability to perform fast restoration of its operational ability after a failure. Fast and effective gateway restoration after a failure depends on a stock of field replaceable units (FRU). For this purpose, a hierarchical spare supply system (HSSS) is being designed (Ushakov 1994). HSSS includes the central spare stock (CSS), regional spare stocks (RSS), and on-site spare stocks (OSS).
Diversity of gateways and addition of new ones lead to the necessity of a computer tool capable of optimal spare allocation. The problems that arise are: (1) determination of optimal allocation of spares at each OSS depending on the size of a gateway, (2) determination of location and size of each RSS, and (3) determination of size of the CSS.
Ga'eway Gateway
Figure 1. A hierarchical spare supply system
Gateway equipment consists of replaceable units. After each failure, a spare unit from the OSS replaces the failed unit. A failed unit is sent to the repair base. Regional and Central stocks are usually supplied periodically (with priority request for refilling if stock has reached some critical level). On-site stocks are small enough and use the advance delivery; this means that the OSS site sends a request to the RSS after each failure. Structure of Globalstar HSSS is presented in Fig. 1.
FORMULATION OF GENERAL PROBLEM OF OPTIMAL SPARE ALLOCATION
Let the operational system (gateway) consist of N different types of spare units. Request for spare unit of type k, k= 1,2...N, arrives to the stock in accordance with a Poisson process with (failure) intensity Ak. Costs of units, ck, are assumed to be known. A spare stock contains x1, x2... xN units of different types. The problem is to find the optimal allocation, satisfying requirements on the stock reliability or the total cost.
Let X=( x1, x2... xN) be a vector of spares at the stock site, xi is the number of spares of type i; P(X, 9) be reliability index characterizing the spare stock with X spares for period of time 9; and C(X) be the cost of spares. Two optimization tasks (Gnedenko and Ushakov 1995) can be formulated as:
Direct: To minimize the total cost of spares at the stock under condition that the stock
reliability index is not less than required level P*, i.e.,
min {c(x)Ip(x,9)> P *}. (1)
all X
Inverse: To maximize the stock reliability index under condition that the total cost of spares
at stock is not larger than a admissible level C*, i.e.,
max{p(x,9)Ic(x*} . (2)
all X
ON-SITE SPARE STOCK
We assume that gateways are highly reliable and its units are independent, so we neglect the possibility of overlapping of system down times due to different causes. For highly reliable systems, the approximate formula for the OSS unreliability coefficient, QOSS is
Qoss — ^ßkqk ( xk )•
1<k < N
r Y1
(3)
The weights in Eq. 3 are defined as Pk = Aknk ^ Aknk , qk(xk) = unreliability coefficient of
\}<k<N )
units of type k (cumulative Poisson function with parameter ak=nkX^Q) , and xk = number of spares of type k in the OSS. For highly reliable systems, approximate formula for the OSS unavailability coefficient, Uoss, is
\nkqk ( xk ) (4)
ur
k=1
Xk + 1
where 0= time delay corresponding to advance delivery.
PREDICTING APPROXIMATE TRENDS
In many cases of practical interest, we are faced with the problem of sparing a highly reliable system. For a highly reliable system, maxk{ak}<<1 and the cumulative Poisson function may be approximated by its leading term (in practice, maxk{ak}<0.1 is acceptable). Typically, there is at least one spare for every type of unit in a commercially deployed system. On the other hand, total money allocated for sparing is generally limited. If 1 < x, < 5, ln(x;l) « 0.9(xr1) is a workable approximation in the Poisson function. These two simplifications linearize the Lagrange equation determining the optimal values of x, (Ushakov and Chakravarty 1998). For goal function shown in Eq. 3 and for the inverse problem of redundancy we obtain
f
Xk — round
K - ln(Ck /(ßkC*)) - ak + ln(0.9 - ln(ak ))
V
ln(0.9 - ln(ak))
Constant K in Eq. 5 can be found from the cost constraint C(X)=Zk ckxk=C*.
(5)
REGIONAL AND CENTRAL STOCKS
An RSS is periodically refilled from the Central Spare Stock (CSS). The number and location of gateways, which are served by a particular RSS may changing in time with the development of Globalstar. It seems that the best index characterizing the RSS is its unreliability coefficient (3). The same might be said about the CSS, which is replenished by production (probably with different period for different type of units). In principle, the solution for these cases is similar to the previous one with the difference that the advance delivery period starts with the installation of a failed unit.
OPTIMAL SPARE ALLOCATOR (OSA) SOFTWARE TOOL
The "Optimal Spare Allocator" software tool has been developed for use in Globalstar gateways. Globalstar is a worldwide satellite telecommunication system that has gateways dispersed all over the world.
Figure 2. OSA tool: Map of hierarchical stock system.
Its spare supply system is hierarchical in nature. OSA is a GUI driven user-friendly tool designed to solve the direct and inverse problems of optimal redundancy for a multi level hierarchical spare supply system.
Figure 3. OSA tool: a hierarchical spare supply system structure
It uses the relative increments of the goal function in respect to a unit of cost (steepest descent method) to solve the optimization problem.
A PC with Windows 95 or NT operating system is needed for installing and running the OSA tool. The program's main window includes a menu of all available commands and a toolbar with the most frequently used operations. It has other windows that depict the "hierarchical tree" of the stock supply system (Fig. 3), a table of parameters characterizing a particular stock, including a list of units and quantities used, embedded spares if any, their cost, their mean time between failures etc.
Figure 4. OSA tool: calculation options.
j Set of Units ^^H - |n| 11
■i- t-ii
current set ofunits (119 types1 Q AtW
Part No Name |mtbf Cost | Commen T]
20-14074-1 TFU 350000 123,45 Demo ® Edit j
20-14703-1 CCA 150000 12,34 Demo
— ....
20-14875-1 Control Unit 200000 456,78 Demo 1 Aaa to uti
20-14917-1 ATM IC CCA 100000 111,11 Demo Delete |
20-14918-1 YMCA Interface 66g66 123,45 Demo
20-14930-1 bcn 100000 765,43 Demo U Import
20-18034-1 CCA 166666 234,56 Demo f~ Overwrite
20-26035-1 Receiver Card 80000 1234,56
20-26085-1 CCA 133333 456,78 Demo f- L
ouu uy...
20-26115-1 UpConvertor 30000 2345,67 Demo (* Part No
20-26195-1 tfdc 200000 444,44 Demo r Name
20-26205-1 CQA 300000 222,22 Demo c mtbf
20-26270-1 RDS 66b6 111,11 Demo r Cost
20-26305-1 Control Unit 66b66 111,11 Demo Fixed Columr
30-18271-1 CHASSIS ASSY 200000 1111,11 Demo
30-18272-1 CHASSIS ASSY 250000 2222,22 Demo U Export
30-18273-2 CHASSIS ASSY 333333 3333,33 Demo
30-26018-1 CHASSIS ASSY 400000 4444,44 Demo ........ 1 II
1
30-26042-1 CHASSIS ASSY, DIGITAL 100000 1111,11 Demo 0K
30-26052-1 CHASSIS ASSY, FORWARD LINK 150000 2222,22 Demo |
30-26061-1 GAIN CONTROL UNIT 150000 3333,33 Demo X Cancel
3n-?r534-1 SHFI F TflP ASSY HISTRIRI ITIflN Rnonno 4444 44 Dpmn _
_ jlU ' A •7 Help u
Figure 5. OSA tool: Unit database.
/' Operating Units of the Base Station
On-Site Stock: : |GW25 Sort by: C Part No C Name C Qty
Units in the corresponding Base Station [184 types]
¡Part No j Narine Qt^ Standby * ® Edit
N20-14074-1 TFU 12 o 9
20-14703-1 CCA 4 0 New units
20-14875-1 Control Unit 2 0
20-14917-1 ATM IC CQA 4 0 f] Delete
20-14918-1 YMCA Interface 4 0 Confirm
20-14918-1 YMCA Interface 4 0
20-14930-1 BCN 24 0 H Export |
20-13034-1 CCA, 2 0
20-2S035-1 Receiver Card 90 0 s/ OK
20-26035-1 CCA 7 0
20-26115-1 UpConvertor 112 0 )C Cancel
20-26195-1 TFDC 6 o
! 20-26205-1 CCA 1 0 ~r] 7 Help
HJ jJ
Figure 6. OSA tool: Gateway specification
QUALCOMAA OPTIMAL SPARE ALLOCATION STOCKS
Stack: GW47 Uni! ¿ata: level: Spare unit delivery time: Repair time: On-Site 240 Reliability: 0,995388
Part No Name MTBF Cost Spare Spare Cost
20-14074-1 20-14703-1 20-14S75-1 20-14917-1 TFTJ CCA ControlUnit ATMIC CCA 350000 150000 200000 100000 123.45 12^4 45^78 111,11 2 246,90 3 37J02 2 913.56 3 333.33
Figure 7. OSA tool: sample of reporting
The OSA tool is flexible and offers various calculation options to the user. It is able to solve the direct and inverse problems of optimal redundancy with two different goal functions. It also offers two separate replenishment policies, and lets the user choose minimum number of spares consistent with the total cost. Results of calculations are presented in a report whose layout can be specified by the user. Reports generated by the OSA tool may be saved in ASCII format for further processing or documentation.
REFERENCES
1. Gnedenko, B. and I. Ushakov (1995). Probabilistic Reliability Engineering. John Wiley, New York.
2. Ushakov, I. (1994). Handbook of Reliability Engineering. John Wiley, New York.
SSARS 2007
22-29 July 2007 Sopot Poland
Summer Safety & Reliability Seminars
Special Issue # 2 on SSARS 2007
Invited Editors E. Zio and K. Kolowrocki
METHODS FOR THE TREATMENT OF COMMON CAUSE FAILURES IN REDUNDANT SYSTEMS
Berg Heinz-Peter, Görtz Rudolf, Kesten Jürgen
Bundesamt für Strahlenschutz, Salzgitter, Germany
Keywords
nuclear power plant, probabilistic safety assessment, simulation, common cause failure, modelling Abstract
Dependent failures are extremely important in reliability analysis and must be given adequate treatment so as to minimize gross underestimation of reliability. German regulatory guidance documents for PSA stipulate that model parameters used for calculating frequencies should be derived from operating experience in a transparent manner. Progress has been made with the process oriented simulation (POS) model for common cause failure (CCF) quantification. A number of applications are presented for which results obtained from established CCF models are available, focusing on cases with high degree of redundancy and small numbers of observed events.
1. Common cause failure analysis in the frame of probabilistic safety assessment
Design, operation and maintenance of systems are performed to minimize potential failures such as random, systematic and dependent failures. Dependent failures comprise secondary failures caused, e.g., by violation of operational conditions and so-called commanded failures like component fails due to violation of interface conditions. The residual part of the group of commanded failures is called common cause failures (CCF). To identify dependent failures, approaches have been extended to encompass potential interpendencies between systems or components. Secondary and commanded failures are supposed to be modelled explicitly as far as possible in fault tree models of the system whereas common cause failures are taken into account in probabilistic safety assessment implicitly by parametric models.
In general, the most important defence against accidental component or system failures is the implementation of principles such as separation, diversity and redundancy. However, experience has shown that redundancy itself is not sufficient to avoid undesired events just because of possible dependent failures.
CCF of redundant safety relevant systems have been of concern since quantitative estimation of the reliability of these systems was developed starting in the early 70ies because this type of failures affect significantly their availability and reliability leading - in the worst case - to a simultaneous loss of all redundancies.
Typical examples of CCF are miscalibration of sensors, incorrect maintenance, environmental impact on the field device and use of a not appropriate process fluid, which plugs valves in different redundancies. Experience from numerous probabilistic safety assessments has shown that, especially for highly redundant systems in nuclear power plants, common cause failures tend to dominate the results of these assessments such as the core damage frequency or large early release frequency. As a consequence of generally rather effective defence against common cause failures in place, the number of really observed events in nuclear power plants is limited, in particular with respect to events involving failures of all or at least many redundant components. However, the operational experience contains some information on potential common cause failures, i. e., partial failures that could have evolved into the complete failure- of the common cause component group within a short period of time. This in turn requires in one way or the other an extrapolation based on parametric models, which is extremely difficult to verify.
Despite of these difficulties significant progress has been made in the last years due to increasing operational experience, more systematic data collection and analysis, growing experience in probabilistic safety assessment and an enhanced exchange on data and methods both nationally and internationally.
Although the use of plant-specific data in probabilistic safety assessment is preferred, in case of lack of events or of information it is helpful to provide a generic data base taking into account all national experiences and appropriate international data. Data bases like the OECD/NEA International Common Cause Failure Data Exchange Project allows collecting and analysing data of a lot of different components such as valves, pumps and diesel generators. Results of the analysis of these data also enable to assess and improve the effectiveness of defences against common cause failure events. For that purpose, data and information related to events observed in the operational experience with sufficiently detailed content have to be provided.
In general, the treatment of common cause failures within probabilistic safety assessment requires four main steps: development of a system logic model, identification of common cause component groups, common cause modelling and data analysis as well as quantification and interpretation of the results. For the quantitative part of the common cause failure assessment, models have still to be further developed, in particular with respect to applicability to highly redundant systems, suitability and traceability.
2. German practice
Probabilistic safety analyses (PSA) have been performed for all operating German nuclear power plants. Experience has shown that CCF in many cases tends to dominate the results of the PSA. Therefore, methods and results of CCF analyses receive a lot of attention in the discussions between regulator, technical experts, utilities and analysts.
Regulatory guidance is available in Germany for level 1+ PSA (a level 1+ analysis is understood to end at the onset of core damage but to take into account active containment functions) as part of periodic safety reviews of nuclear power plants. According to the importance of CCF, a chapter in the German regulatory guidance documents is dedicated to dependent failures [6]-[7]. These failures comprise secondary failures caused by violation of operational or environmental conditions as well as commanded failures - intact component failing due to violation of interface conditions, for example in the case of erroneous signals or failed energy supply. The residual part of the group of dependent failures is the common cause failures mentioned before. Secondary and commanded failures are to be modelled explicitly as far as possible in the fault tree models of the system. CCF, on the other hand, are taken into account in PSA by parameter models [2].
The guidelines mentioned before - they are currently undergoing final steps of revision in view of the fact that the Atomic Energy Act as amended in 2002 makes Periodic Safety Reviews (including PSA) mandatory - do not prescribe specific CCF models. Rather, they demand that the parameters of any model used are to be derived in a clearly described way from operating experience. Thus, in German PSA practice, a variety of models have been used [1], [9], [10].
3. A process oriented simulation model (POS) for CCF quantification
3.1. Rationale and objectives
The question can be raised whether an approach aiming at modelling the entire CCF process from the point in time of the root cause impact to failures taking effect or being detected in the common cause component group (CCCG) in a more mechanistic manner could support and complement the established modelling which is mostly aiming at failure probabilities. Such a process oriented modelling approach is described and discussed in this paper. It represents a further elaboration of the modelling stages described in [3]-[4].
3.2. Model description
The method of stochastic simulation offers a convenient way to describe the model and to quantify its results. The sequence of stochastic variables displayed in table 1 is supposed to adequately describe the CCF process.
Based on simulation of this sequence, the associated unavailability's can be calculated.
The following fixed-value parameters are used throughout a simulation sequence:
• operation time TB
• number of components in the CCCG: r
• time between functional tests TFT
The sequence of variables and calculations defines a single simulation of the common cause failure process. It is described how the variables are either derived from a stochastic assumption or are calculated deterministically.
The calculation of the probabilities W(m,r) for the event that the common cause impact will affect exactly m out of r components are calculated by a recursive scheme that is detailed in [3]. Here, only the formulae up to r = 4 are given. Model parameters are a and r0.
w (2,2) = 1
W (3,3) = a W (2,3)= 1 - a W (4,4) = a-(a + (1 - a )-(l - e~3/10 )) W (2,4) = (l - a)2 W(3,4)= 1 - W (4,4)-W (2,4)
(1) (2)
(3)
(4)
(5)
(6)
To facilitate handling of the necessary equations, model parameter r0 is replaced by:
C = exp(l/r0 ). (7)
In the applications presented here, a model version has been used that is based on a simplified assumption regarding the CCF identification. It is assumed that non-staggered testing is applied and that a CCF-event is identified at the functional test following the first component failure. It is well known that conditions in the field are more complex. To account for that from the information provided in the literature sources effective test intervals have been estimated for the POS-analyses. The model assumptions can be modified to account for other situations like staggered testing in a straightforward manner. As the prime purpose of this paper is to demonstrate key features of the POS model such refinements have been postponed.
3.3. Parameter estimation for the process oriented simulation model
The parameter estimation routine used here is closely related to the one described in [4]. It has, however, been simplified without significantly lowering its precision.
3.3.1. Frequency
The model has essentially four parameters that have to be estimated. The first is the frequency of CCF-events for which the usual estimator for failure rates is used.
3.3.2. Number of impacted components
The approach selected consists of an estimation of the distribution of the number of impacted components based on the observed events:
(m, r )= "m+fcl. (8)
The constant term 1/(r-1) is introduced into the estimator to avoid vanishing probabilities, which in practice are not expected. K serves for normalization. Nm is the number of events for CCCG size r and with number of impacted components m.
On the other hand, the probabilities can be calculated as functions of the model parameters. It can be shown that
W(2,r) = (1 - a)-2. (9)
Table 1. Overview of the POS model
Sequence of stochastic variables
Modelling assumptions for the stochastic variables Model parameter_Assumption
Time tCCI of common cause impact Number m <r + 1 of impacted components
Rate of common cause impacts rCCI
a, r0
Failure rate R of the impacted components Probability of instantaneous failure of all
impacted components Winst, interval for rates of non- instantaneous failures RMiN to rmax
Times of failure of the impacted components tF (m)
Identification of CCF-process by the functional test
Time of CCF identification tïï
Tf
Equally distributed in TB, rCCI -TB << 1, Probability W (m, r), see formulae (1) to (6) and [3]
According to Winst the m components fail either instantaneously or are logarithmic equally distributed in the interval Rmin to Rmax Either all impacted components fail at tea or the times of failure are exponentially distributed with rate R For times > tF (i) the failure and the common cause process are identified, the components are immediately repaired and as good as new The functional tests are performed at intervals TFT . The first test time after the first failure occurring at the minimum of the tF (m) is equal to tID
Finally, from the failure times tF (i) (i = 1, ..., m) in the time interval between tCCI and tIDthe time periods are calculated in which zero, one, two, ... up to at most m components are failed: A(i) (i = 0,1,2,.., m) The average of A (i) / TB (i > 1) for many simulations is the unavailability.
This relation suggests the following estimator:
aest (2, r ) = 1 - We
1/ (r-2)
(10)
In a second step, parameter c is estimated based on the mean of m:
< m >„, = S m -West O,r ).
(11)
Again, the mean of m can be calculated as a function y of the model parameters a and c
< m >= y(a,c).
This can be used to estimate c based on the estimates aest and West(m,r) already obtained
Cest = y~l(aest > < m >est )
Here,y-1 denotes functiony(a,c) inverted with respect to c.
(12)
(13)
There are, however, cases in which the non-linear equation (13) for cest does not have a meaningful solution. This is avoided by applying the following transformation to the estimated <m>est: