Section 2. Information technology
https ://doi.org/10.29013/AJT-19-11.12-7-14
Dolhikh Anastasiia Olekhivna, postgraduate student, the Faculty of Applied Math Dnipro National University named after Oles Honchar, E-mail: [email protected] Baybuz Oleg Grigorovich, Doctor of Technical Sciences, Professor, Dnipro National University named after Oles Honchar, Head of The Department of Mathematical Support of Calculating Machines, the Faculty of Applied Math
E-mail: [email protected]
THE SOFTWARE TECHNOLOGY OF TIME SERIES OUTLIERS' IDENTIFICATION
Abstract. The article analyzes the modern methods of time series outliers' identification. A new model of anomalous values detection has been introduced. Results of the proposed model work have been presented.
Keywords: time series, outliers, adaptive models, residuals, interquartile criterion.
1. Introduction that allows not only to find outliers but also to deter-
An outlier is a value that is significantly distant mine their type. The article presents the result of the from other similar points. There are two reasons why proposed method on series of financial indicators. these values need to be correctly identified. Firstly, 2. Analysis of recent research their emergence may indicate an atypical state of the According to the IBM Knowledge Center reprocess under study. For example, in the financial source [9], at least six types of time series outliers are sector, timely identification of outliers is necessary to distinguished nowadays. The most important from a track bank fraud attempts, detect unusual credit card business standpoint are the following: transactions, or case of card theft [7]. The second 1. Additive outlier, or AO (additive outlier) - an reason for the importance of conducting the outliers' unexpected large or small value, which is very differ-detection procedure is that the emergence of such ent from the other levels of the series (Fig. 1). They values can significantly affect the performance of the appear once and have no effect on other values [10]. built model, which causes a decrease in the accuracy 2. Temporal changes, or TC (temporal changes), of future time series values forecasting [7-8]. The occur when large or small values, uncharacteristic for current work is devoted to the development of a pro- the time series dynamics, appear over a short period cedure for identifying anomalous time series values of time (Fig. 2) [10].
Figure 1. Additive outlier
Figure 2. Temporary changes
Figure 3. Level shift
2. Shift oflevel, or LS (level shift) - sharp change of levels of a series. All the observations that occur after this kind of outlier move to a new level (Fig. 3).
Having analyzed the latest researches in the field of time series analysis, one can conclude that anomalies detection methods can be divided into three groups.
The first group includes statistical approaches based on standard deviation [12]. They are used to detect outliers in a sample of random numbers. However, the use of such approaches for the time series analysis often does not produce the desired results, since the time series have significant differences from the samples [1]. A better solution is to identify outliers using the Irwin method [11]. Its main idea is not to compare directly the series levels but to explore the differences between the current series values and the previous ones. The criterion is easy to understand and implement. However, according to the work [11], this approach makes it impossible to correctly identify two outliers going successively, one after another, and to distinguish the "normal" value following immediately after anomalous one, from the outlier.
The second group introduces methods that use algorithms of data classification, such as decision trees and neural networks, to search for time series outliers [10]. Obviously, with this approach, it is necessary to have a training sequence, previously divided into two classes: "anomalies" and "usual values", which significantly restricts the possibilities of this group methods applying.
The third group consists of methods that attempt to reproduce the most plausible model of the input time series and to identify outliers not in the original sequence but in the series of residuals. The methods of this group show quality results and are ones of the most applicable nowadays. The algorithm proposed by American researchers in the work [2] has become widespread. Its implementation is available in the tsoutliers library of the well-known statistical data processing package R [6]. The main idea of this method is to build mostly corresponding to the input
time series ARIMA(p, q, d) model and, after that, perform outliers detection in the residuals series. In order to achieve quality results, it is necessary to find the optimal values of the auto-regression order, p, integration order, d, and the moving average order, q. According to the authors [4], the number of models that need to be built and evaluated with this approach can reach 488. This could take a long time, especially when it comes to big flows of input data.
Based on the performed analysis, it could be concluded that a promising area of research is the development of more "faster" models of outliers' identification. A topical issue is to consider the use of adaptive forecasting methods to interpolate time series values. They do not require a larger number of parameters to be calculated, while displaying qualitative forecasting results.
3. Setting of the problem
To develop the procedure of time series outliers' identification based on adaptive forecasting models the following tasks have been solved:
- implement adaptive models of time series forecasting;
- develop a model for anomalous values detection in the series of residuals;
- develop an algorithm for the abnormal value type determination ("AO", "TC", or "LS");
- evaluate the quality of the proposed model on the time series of financial indicators.
4. A new procedure of time series outliers' identification
To construct the optimal time series model adaptive forecasting methods have been used. Despite the simplicity of the implementation, they show quality results [3, 14]. For training adaptive models, a genetic algorithm with real encoding has been used [3]. Once the optimal model of the original time series has been found, anomalies identification procedure in the residuals time series can be executed. Below is the detailed description of the proposed approach.
1. Calculate the predicted values of the time series {ut}.
2. Calculate the series ofresiduals {et}, et = ut - ut.
3. Perform the identification of residuals time series outliers. To do this the interquartile distance procedure has been used [5]. Unlike other statistical criteria ofthe outliers' search, its use does not require analyzed data to be normally distributed. This is an important factor, since this condition is rarely met in real financial analysis tasks. On the first iteration of the algorithm as an adjusting coefficient, Coefficient, in the interquartile distance procedure the value 3 is applied. This value is usually used to find the outer boundaries of the data set. All values that are outside the limits of the set are considered as significant outliers. On the further iterations of the algorithm, this value has to be updated.
4. Determine the type of found outliers: "AO", "TC" or "LS". To do this, the length of abnormal values sequence to which the current outlier is included, outliers _length, has to be calculated. If outliers _ length = 1, then the type of this outlier is "AO". If 1< outliers _length < threshold, that means that we are dealing with the "TC" outlier type. Otherwise, if both of the above conditions are not executed, then the type of the found outlier is "LS". As the threshold value, threshold = 0.02 • N is used, where N is the length of the original time series.
5. After the types of all found outliers are determined, outliers of the input series need to be replaced by the predicted values of the constructed model.
6. Reduce the outer limits of the data set using the following formula:
Coefficient = Coefficient- (1- reduce),
reduce = 0.14813
7. If the iteration number has reached the maximum allowed value, MAX ITER, or the new outliers have not been found at the current iteration, the algorithm is terminated. Otherwise, repeat the outliers detection procedure from step 1, but at this point adjusted time series (e.g. time series received at the fifth step) has to be used as the input series.
By default, the maximum number of iterations MAX _ ITER equals three. It has been empirically found that reducing this value leads to the fact that the program cannot find all the important anomalies, while increasing - causes too many time series levels detection as outliers, even the values that completely correspond to the process dynamics.
5. The results of the proposed method of the outliers' detection
This section summarizes the results of the proposed model testing on the financial time series.
Figure 4. CAT time series
The (Fig. 4) shows the open prices of Caterpillar existence have been conducted. According to their
Inc. stocks in the period of time from January 2017 results, there is a trend component in the series and
to January 2018. Data have been downloaded from there is no pronounced seasonality or autooscilla-
"Yahoo! Finance" resource [13] (Fig. 4). tions.
During the time series preliminary analysis, To interpolate time series level values Taylor-
the Spearman test for the trend presence and the Wage adaptive model has been used. Fig. 5 shows
serial correlation test for the periodic component the time series of residuals.
Figure 5. A series of residuals
Having analyzed the diagram 5, one can conclude that the series of residuals is stationary, the values average fluctuates near zero, residuals look like random numbers independent on time point. These considerations have been confirmed by the performed statistical tests. One-sample Student's test has been used to test the equality of mean to zero. The trend absence has been confirmed with using the Spearman test for the randomnicity. An
important step in analyzing the model's residuals is to check the absence of autocorrelation between them. To do this Darbin-Watson test has been applied. The results showed that there was no interdependence between the levels of the series. In our case, such properties of residuals are important not only because they confirm the quality of the built model, but also because the residuals series is used for anomalies identification.
a) b)
Figure 6. Box-plot diagram of the original series (a) and residuals series (b)
Table 1. List of CAT series outliers
No. Series level number Series level value Outlier type
1. 63 102.88 LS
2. 126 113.24 LS
3. 190 140.06 TC
4. 226 148.85 AO
As one can see from the Box-whisker plot graphs siduals series "applicants" to outliers can be easily (Fig. 6), anomalous values cannot be detected in the distinguished. In (Fig. 6 b) they are shown as black original series values. However, after transition to re- dots outside the outer limits of the data set.
Figure 7. CAT time
Table 1 and Fig. 7 show the results of the proposed outliers' identification procedure application to the CAT time series.
The proposed anomaly identification procedure revealed four anomalous values. The first two outliers are ofthe "LS" type. The third one has been identified by the procedure as the "TC" outlier because it covers only a few values. The last anomalous value is single
series with outliers
one and doesn't have any effect on the future values of the series, so its type is defined as "AO". The procedure tso of the package R [6] shows similar results.
Table 2 summarizes the results of comparing the time costs required to perform the proposed procedure and the tso function of the tsoutliers library of R package [6] on the series that represent stock prices of well-known American companies from 2017 to 2018.
Table 2.- Time costs required to detect outliers with using adaptive models and ARIMA-models based approaches
No. Time series Procedure based on adaptive models, ms Procedure tso from library of tsoutliers package R, ms
1. AAON 0.379 0.405
2. CAT 0.726 1.122
3. DBD 0.372 2.025
4. MSFT 0.44 0.686
5. CSCO 0.317 0.379
6. BGS 0.623 6.556
7. IBM 0.694 0.47
8. ABC 0.656 3.108
9. BK 0.529 0.388
10. CAKE 0.521 7.147
Having analyzed Table 2, one can conclude that the use of the procedure proposed in the article for most time series allows to reduce the time required to identify anomalous values, with the exception of only IBM and BK time series.
6. Conclusions
During the study, a new procedure of time series outliers' identification has been developed. Its advantage over many other approaches is that it not only
detect anomalies, but also determines their type. The scientific novelty of the proposed method is that it uses adaptive forecasting models to interpolate the series values and the interquartile distance criterion to detect for outliers in it. Proposed procedure allows correctly identify a large number ofanomalous values and reduce the time required to do this. Prospects for further research are testing the feasibility of the proposed approach usage in the big data analysis.
References:
1. Biloborodko O. I., Yemelyanenko T. G. Dynamic Series Analysis. Dnipro: Dnipro national university named after Oles Honchar, 2014.- 80 p.
2. Chen C., Liu L. Joint Estimation of Model Parameters and Outlier Effects in Time Series. Journal of the American Statistical Association. 1993.- Vol. 88.- P. 284-297.
3. Dolhikh A. O., Biloborodko O. I., Baybuz O. G. Finding the optimum values of adaptive models parameters for time series forecasting using genetic algorithm. Actual problems of automation and informational technologies. 2016.- Vol. 20.- P. 11-22.
4. Hyndman R. J., Khandakar Y. Automatic time series forecasting: The forecast package for R. Journal of Statistical Software. 2008.- Vol. 26.- No 3.
5. Inter-quartile range, outliers, boxplots. URL:https://www.sfu.ca/~jackd/Stat203_2011/Wk02_1_Full.pdf
6. Javier Lopez-de-Lacalle. Detection of Outliers in Time Series. Package 'tsoutliers'. 2019. URL: https://cran.r-project.org/web/packages/tsoutliers/tsoutliers.pdf (application date: 06.12.2019)
7. Mab Alam. Outlier detection with time-series data mining. Data Science Central - The online resource for big data practitioners. 2018. URL:https://www.datasciencecentral.com/profiles/blogs/outlier-de-tection-with-time-series-data-mining (application date: 06.12.2019)
8. Nosov S. S., Belovodsky V. N. Pleliminary processing of meteo time series: methods, experiments, results. Informatics and computer technologies. 2011. URL:http://ea.donntu.edu.ua/bitstream/123456789/12
786/1/%D0%9D%D0%BE%D1%81%D0%BE%D0%B2%20%D0%A1.%D0%A1.%20.pdf (application date: 06.12.2019)
9. Outliers. IBM Knowledge Center. URL:https://www.ibm.com/support/knowledgecenter/en/SS-3^A7_15.0.0/com.ibm.spss.modeler.help/ts_outliers_overview. htm (application date: 06.12.2019)
10. Pavel Tiunov. Time Series Anomaly Detection Algorithms. Stats and Bots. URL:https://blog.statsbot.co/ time-series-anomaly-detection-algorithms-1cef5519aef2 https://blog.statsbot.co/time-series-anomaly-detection-algorithms-1cef5519aef2 (application date: 06.12.2019)
11. Trofimenko S., Marshalov A., Grib N., Kolodeznikov I. Modification of the Irwin method for the detection of abnormal levels of time series: methodology and numerical experiments. Modern problems of science and education. 2014.- Vol. 5.
12. Will Badr. 5 Ways to Detect Outliers/Anomalies That Every Data Scientist Should Know (Python Code). Towards Data Science. URL:https://towardsdatascience.com/5-ways-to-detect-outliers-that-every-data-scientist-should-know-python-code-70a54335a623 (application date: 06.12.2019)
13. Yahoo! Finance. URL:https://finance.yahoo.com/ (application date: 21.04.2019)
14. Yu. P. Lukashin. Adaptive methods for short-term time series forecasting.- M., 2003.- 416 p.