-Информатика, вычислительная техника и управление ===========
DOI: 10.14529/cmsel60403
SUPERCOMPUTER APPLICATION INTEGRAL CHARACTERISTICS ANALYSIS FOR THE WHOLE QUEUED JOB COLLECTION OF LARGE-SCALE
HPC SYSTEMS*
© 2016 D.A. Nikitenko, V.V. Voevodin, A.M. Teplov, S.A. Zhumatiy, Vad.V. Voevodin, K.S. Stefanov, P.A. Shvets
Research Computing Center, M. V. Lomonosov Moscow State University (Leninskie
Gory 1, Moscow, 119991 Russia)
E-mail: [email protected], [email protected], [email protected], [email protected], [email protected], shvets.pavel.srcc@gmail. com Received: 11.04.2016
Efficient use and high output of any supercomputer depends on a great number of factors. The problem of controlling granted resource utilization is one of those, and becomes especially noticeable in conditions of concurrent work of many user projects. It is important to provide users with detailed information on peculiarities of their executed jobs. At the same time it is important to provide project managers with detailed information on resource utilization by project members by giving access to the detailed job analysis. Unfortunately, such information is rarely available. This gap should be eliminated with our proposed approach to supercomputer application integral characteristics analysis for the whole queued job collection of large-scale HPC systems based on system monitoring data management and study, building integral job characteristics, revealing job categories and single job run peculiarities.
Keywords: supercomputer, efficiency, system monitoring, job categories, integral job characteristics, queued job collection, job queue, resource utilization control.
FOR CITATION
Nikitenko D.A., Voevodin V.V., Teplov A.M., Zhumatiy S.A., Voevodin Vad.V., Stefanov K.S., Shvets P.A. Supercomputer Application Integral Characteristics Analysis for the Whole Queued Job Collection of Large-Scale HPC Systems. Bulletin of the South Ural State University. Series: Computational Mathematics and Software Engineering. 2016. vol. 5, no. 4. pp. 32-45. DOI: 10.14529/cmsel60403.
Introduction
Securing efficient resource utilization of HPC systems is one of the most important and challenging tasks at present trends of rapid growth of scales and capabilities of modern supercomputers [1, 2]. There is a variety of approaches that are aimed at analysis of efficient utilization of certain HPC system components or systems as a whole. Some of them are based on system monitoring data analysis [3, 4|. This type of approaches sets especially strict requirements on monitoring system implementation and configuration [5], as well as for the means of data storage and access. At the same time these approaches possess a number of fundamental advantages.
"The paper was recommended for publication by the program committee of the International Scientific Conference "Parallel Computing Technologies - 2016".
First, the analyzed data reflects physical, real levels of HPC system components and appropriate resource utilization.
Second, filtering system monitoring data obtained from known set of components and period of time allows binding this data to certain jobs. Thus, allowing analyzing resource utilization history and trends by certain applications, users, projects, partitions, and so on.
Third, typically it is possible to configure monitoring systems obtaining data from the whole system in such a way that it induces acceptable overhead. This allows collecting data with a rougher granulation, when possible, but still sufficient for basic analysis of resource utilization by any and every job. To have more detailed information on certain job, of course, used monitoring system should likely support data acquisition rate reconfiguration on-the-fly for specified sensor sets and sources. If not (of course, it is much less efficient way), most monitoring systems can be started in a higher granularity mode to record certain job activity and restarted in a normal mode afterwards. There are other options available, for example, data aggregation implementation that is precise for first, say, 30 seconds of job execution (to study short jobs) that is later switches to rough mode (for longer jobs). Anyhow, there are a number of techniques available to study certain application behavior.
The existing methods and techniques that base on system monitoring data analysis, allow-both analysis of dynamic characteristics of certain application runs and peculiarities of resource utilization within system partitions and systems as a whole [6]. With a project-oriented workflow, when a number of users run jobs as a part of one applied research, it is very useful to let administrator and system manager have a clear view of resource utilization distribution in the workgroup to have a possibility to influence permissions or workflow inside the workgroup to meet the granted resources limitation [7, 8]. Nevertheless there is still need for specialized tools and techniques to analyze available system monitoring data. In a point of fact, the one is needed as a valuable additional tool to the set of implemented approaches in every-day practice of MSU Supercomputer Center [9-12] — a tool for job queue analysis based on system monitoring that would allow revealing job categories, job grouping by some criteria, starting from belonging to user or project domain and other resource manager specific characteristics, to categories by levels and peculiarities of HPC system resource utilization or its combinations. As a basic technique for such grouping implementing tagging system seems to be an adequate option — assigning special tags to a job description as soon as each tag description criteria is met by job characteristics. Tagging principles are widely successfully used for categorizing and search purposes managing huge collections of data in Internet: news, videos, photos, notes, and so forth, that is quite close to the challenge that is being tackled.
The paper is organized as follows. Section 1 is devoted to job categories and tagging principles. Section 2 describes implementation. Section 3 provides examples and use cases. Conclusion section includes summary as well as future work overview.
1. Job categories and tagging
The combined analysis of system monitoring data and resource manager log data, as was already mentioned, allows binding raw system monitoring data to certain jobs. This provides means to analyze job dynamics as far as data granularity allows. To analyze the average rate of application resource utilization every dynamic characteristic can serve basis for the calculation of minimum, maximum, average and median values. These types of values are often named integral job characteristics.
When one takes a look at the whole scope of executed jobs for analysis of job queue structure, application run sequences, jobs comparative analysis and even searching for outstanding single job behavior it becomes obvious that it would be very useful to have tools that provide means for revealing job categories based on various criteria.
This functionality can be implemented by introducing special tags. Every tag is based on its own criteria, based on a single integral job characteristic or its combination, resource manager job-related information and any other available info from used data sources. For example tags can correspond to certain average rates of various resource utilization, job ownership, job duration, resource utilization specifics, special execution modes, detailed system monitoring data availability, and so forth.
The approach features means to make efficient grouping and filtration of whole job queue history collection by any improvised combination of specified tags. Driven by experience of application efficiency and scalability study based on system monitoring data analysis, the authors propose introducing the following job categories on the first stage of implementation. Tag naming is designed to give self-explanatory tag description, nevertheless, every tag must have a detailed full-format description available.
1.1. System monitoring data based categories
CPU utilization
• Tag name: avg_CPU_user LOW
Category: Low CPU user utilization.
Criteria: Average value of CPU_user doesn't exceed 20%.
• Tag name: avg_CPU_user HIGH
Category: High CPU user utilization.
Criteria: Average value of CPU_user exceeds 40%.
• Tag name: avg_CPU_idle TOO HIGH
Category: CPU is idle for a considerable time.
Criteria: Average CPU_idle value exceeds 25%.
Competition of processes for CPU cores
• Tag name: avg_LA LOW
Category: User job is almost out of action, almost no utilization of CPU.
Criteria: Average Load Average is below 1.
• Tag name: avg_LA SINGLE CORE
Category: Only one process per node is active as an average.
Criteria: Average Load Average is approximately 1.
• Tag name: avg_LA NORMAL
Category: Optimal competition of processes.
Criteria: Average Load Average is approximately equal to the number of cores per node.
• Tag name: avg_LA HYPERTHREADED
Category: Normal process competition with hyperthreading is on.
Criteria: Average Load Average value is approximately equal to the double number of CPU
cores per node.
Floating point operations
• Tag name: avg_Flops HIGH
Category: Intensive CPU floating point operations.
Criteria: Average value of floating point operations number exceeds 10% of theoretical CPU peak. Interconnect activity
• Tag name: avg_IB_packages_num LOW
Category: Low number of inter-node data transmissions.
Criteria: Average package send rate does not exceed 103 packages per second.
• Tag name: avg_IB_packages_size TOO LOW Category: Small size of packages.
Criteria: Average package send rate exceeds 103 packages per second while average data transmission rate is below 2 kilobytes per second.
• Tag name: avg_IB_speed HIGH Category: High data transmission intensity.
Criteria: Average data transmission rate is over 0,2 Gigabytes per second and up to 1 Gigabytes per second.
• Tag name: avg_IB_speed TOO HIGH Category: Very high data transmission intensity.
Criteria: Average data transmission rate is over 1 Gigabytes per second. Memory utilization
• Tag name: avg_cache_Ll/L3 TOO LOW Category: Very low efficiency of cache stack utilization.
Criteria: Ratio of the number of LI misses to the number of L3 misses is below 5.
• Tag name: avg_cache_Ll/L3 LOW
Category: Reduced efficiency of cache stack utilization.
Criteria: Ratio of the number of LI misses to the number of L3 misses is below 10.
• Tag name: avg_cache_Ll/L3 HIGH
Category: Good efficiency of cache stack utilization.
Criteria: Ratio of the number of LI misses to the number of L3 misses exceeds 10.
• Tag name: avg_mem/cache_Ll LOW Category: Reduced efficiency of cache LI utilization.
Criteria: Ratio of the number of total memory operations to the number of LI misses does not exceed 15.
• Tag name: avg_memload HIGH Category: Intensive memory operations.
Criteria: Average number of memory operations exceeds 109 operations per second. 1.2. Resource manager based categories
Job execution status
• Tag name: job_status COMPLETED Category: Job is successfully finished.
• Tag name: job_status FAILED
Category: Job is finished with an error in program.
• Tag name: job_status CANCELED Category: Job was cancelled by user.
• Tag name: job_status TIMEOUT
Category: Job was cancelled by exceeding time limit.
Tag name: job_status NODE_FAIL Category: Job is finished with system error. Job submission details Tag name: job_time_limit CUSTOM Category: Requested time limit is custom. Tag name: job_start_script CUSTOM Category: Job batch file is custom. Tag name: job_cores_requested FEW Category: Not all available CPU cores per node requested. Tag name: job_cores_requested SINGLE Category: Just a single CPU core per node requested. Tag name: job_MPI INTEL Category: MPI type used: Intel MPI. Tag name: job_MPI OpenMPI Category: MPI type used: OpenMPI. Tag name: job_nnodes SINGLE Category: Job used a single node. Tag name: job_nnodes FEW Category: Job used from 2 up to 8 nodes. Tag name: job_nnodes MANY Category: Job used 8 nodes and above. System-dependent peculiarities and partition usage Illustrated by the example of "Lomonosov" supercomputer partitions. Tag name: job_partition REGULAR4 Category: Job allocated to REGULAR4 partition. Tag name: job_partition REGULAR6 Category: Job allocated to REGULAR6 partition. Tag name: job_partition HDD4 Category: Job allocated to HDD4 partition. Tag name: job_partition HDD6 Category: Job allocated to HDD6 partition. Tag name: job_partition SMP Category: Job allocated to SMP partition. Tag name: job_partition GPU Category: Job allocated to GPU partition. Tag name: job_partition TEST Category: Job allocated to TEST partition. Tag name: job_partition GPUTEST Category: Job allocated to GPUTEST partition. Tag name: job_partition EXCEPT TEST Category: Job allocated to regular or high priority partition. Tag name: job_priority HIGH
Category: Job allocated to partitions with a higher priority (queues reg4prio, gpu_p, dedicated6) Matching partition specifics
• Tag name: job_accell GPU
Category: User application uses accelerators. Accelerator type: GPU.
• Tag name: job_accel GPU UNUSED
Category: Job is run on GPU partition, but never uses GPUs.
• Tag name: job_disks UNUSED
Category: Job is run on HDD-equipped partition, but never uses local I/O.
• Tag name: job_disks TOO LOW
Category: Job is run on HDD-equipped partition, but I/O rate is very low.
1.3. Other categories
Beyond the tags that can be assigned automatically, it is possible to introduce manually-set tags. This is useful when the criteria is cannot be automatically determined. There are now-different manual setting options available. First, most typical, selecting the one from known tags lost or introducing new one with human-read and formal description. Second, is pushing some tags like "higher system monitoring rate for the job" via the command line when submitting a job.
This applies for instance to general job description characterizing type of data processing as it is usually known a priori or determined in the course of job behavior study by a specialist: job_behavior DATA MINING, job_behavior MASTER-SLAVE, job_behavior COMMUNICATION, job_behavior ITERATIVE, etc.
In the same manner typical anomalies encountered during analysis course can be specified: job bug DEADLOCK, job_bug DATA RACE, etc.
It is very useful to specify if a widely used algorithm implementation or software package is used. Just in case, this can provide a great contribution to scalability and algorithms-studying projects, like AlgoWiki [13]: job_sw VASP, job_sw FIREFLY, job_sw GROMACS, etc.
If detailed reports on job efficiency analysis or issues is available, or specific standard report like JobDigest is available, it is useful to mark such a feature with another tag, for example: job_analized, job_analized JobDigest.
2. Implementation
We keep to the basis of building a tool that might be deployed at any supercomputer center with minimal efforts. We currently support Slurm [14], Cleo [15] resource managers and Ganglia [16], Collectd [17], Clustrx [18], DiMMon [5] (most promising) monitoring systems.
As for integral job characteristics derivation and tagging, PostgreSQL is used as data storage for coarsened system monitoring data and saved job information from resource managers. The saved job info is processed by JavaScript, jQuery with jQuery UI [19] and Taglt [20].
The tag can be assigned to a job only if it is already declared in tag description table. Such a table includes tag id, name, human-readable description, criteria (a specification of SQL request for automatic processing), comments, and a flag of availability that can be set only by administrator. Any user can suggest introducing a new tag, but it will be available only after administrator approval. Information on new tag author is saved in the comments attribute, added the user tag description suggestion and motivation.
All tags can be assigned in two ways: automatically and manually. Any tag set by mistake or error can be manually removed from a job.
Automatic mode. In this mode, the tags are automatically assigned:
• to all finished jobs according to SQL-based criteria regarding saved integral job characteristics data, information from resource manager and other available saved data;
• as a result of running a special script that processes whole saved job collection info.
In this mode a special attribute would indicate that the tag was set in package (automatic) mode of tag assignment.
Manual mode. Manual tag assignment is usually done by user, project manager or administrator in the following cases:
• as a result of certain job analysis (specifying algorithm implemented, etc.);
• as a result of specifying the tag via command line when submitting a job;
• any tag in a user-specific tag space (marking out important job runs as a part of the project, etc.).
In this mode an attribute addressing tag author is set, that also allows finding jobs, marked as a part of a certain project or by a certain user.
User-specific tag space consists of regular tags and custom user tags. Any manually assigned tags by a user are seen only in the scope of the project and system administrators. The members of other project see their own tag spaces and the general tag space is available for all of the users.
3. Use cases
Of course, real life use cases are very diverse. It this section we would like to share our experience of every-day usage of the proposed technique as a part of the developed tool approbation at Supercomputer Center of Moscow State University on a few examples just to give a general idea of it.
3.1. Revealing jobs, users and projects that practice inappropriate resource utilization
One of the problems of every-day practice of large-scale supercomputer center with a number of heterogeneous resources and considerable number of users concurring for the resources is a problem of inacceptable efficiency or inappropriate resource utilization. This is of a higher priority for specific limited resources, like compute nodes equipped with specialized accelerators, local disks, extra memory or other hardware and software that is critical for some applications and at the same time these nodes can still be used by applications that do not need that specific type of resources that the nodes possess. Such nodes usually have a high potential for resource-demanding specific applications and for the large systems like "Lomonosov" are usually managed as a separate partition with a special queuing options to allow submitting jobs to the appropriate partition. This is vital for projects that perform computations only due to the advantages of such partitions, so by queuing to the desired partition user get a guarantee that their application would have all necessary resources at disposal.
Nevertheless, when analyzing the whole job collection for such partitions it appears that there are numerous job runs that do not use any partition facilities benefits. Of course, sometimes algorithm peculiarities can use resources with totally different intensity, but further analysis usually shows that the majority of suspicious jobs never use any benefit of such partitions. The reasons can be different, but usually it is a shorter wait time in a queue.
This can be seen on GPU partitions with user job runs that never use GPUs. A slightly different situation is seen on HDD-equipped nodes with absent or extreme low disk usage rate
and finally, single-process application that don't benefit from multiple CPU cores can be seen almost on any partition regardless of hardware and software.
It is important to find the root cause of such applications behavior and as soon as the reason is found and changes by user or administrator are applied, the ratio of such jobs can be lowered that would immediately raise HPC system efficiency and overall throughput.
The most popular reasons are:
• Problems inside the application, program or algorithm. The user is sure that he needs resources, but in practice application doesn't utilize any or utilizes at extremely low rates.
• Problems of HPC system. The declared resources are not available on the nodes.
• Inappropriate job allocation. This can be both a mistake, and cheating for lower job waiting time.
Regardless of real reason, these job runs lead to a higher wait time for the jobs that really need specific resources.
The search for such jobs can be automated using integral job characteristics and some of introduced tags.
For example, to filter the jobs allocated to GPU partition with no usage of accelerator one can use tags job_partition GPU and job accel GPU UNUSED at the same time. Next, one can cut off jobs allocated to the test partition as of no interest. The rest jobs that are assigned job_status COMPLETED tag probably do not need GPUs at all, as finished successfully with no registered GPU usage. At this point two options are available whether it is a mistake (user or system) or it was done by user intentionally, trying to reduce job wait time as wait time in specific partitions is sometimes less than in regular.
A very similar situation is seen for HDD-equipped nodes. Jobs that are tagged with job_partition HDD4, job_partition HDD6, job_disks UNUSED or job_disks TOO LOW can potentially be successfully executed at regular partitions. Note that there appears an option of very low resource utilization. This means that disk operations might be easily replaced with network file system operations with minimal additional overhead or even without it.
For those jobs that are tagged with job_nodes SINGLE and avg_LA SINGLE CORE or avg_LA LOW, it is quite reasonable to inquire what for it was submitted to the supercomputer. Such jobs use a single node and a single core (or just few processes per node) and can potentially run well on a desktop. Unfortunately such jobs are met very often.
Users who submit types of jobs mentioned above must be contacted to figure out the reasons of the revealed facts of inappropriate and inefficient resource utilization. The problems found should be resolved. If cheating is met or it is proved that the executed jobs do not really need HPC resources, quotas for corresponding user accounts and projects can be reduced to the extent of blocking.
Let us take a look at one of real-life examples. Figure 1 illustrates the filtered job list allocated to regular partitions with automatically avg_LA_SINGLE_CORE tag assigned. It is clearly seen that the jobs have a low LoadAverage close to 1 as filtered by the tag, at the same time having very low CPU_user. Note, that it is not a test partition and all jobs are run on a single node, grabbing 8 cores on regular4 and hdd4 partitions!
A close look at the longest job owner that was cancelled by timeout illustrates that the user always runs such single-node, even single-process jobs regardless of partitions (Figure 2).
auto_avg_LA_SINGLE_CORE * Add lags lo filter the "лЪЬ Single node CPU_USer LoadAVg
яН> Ml-lr
•J т - t ■Mr i wwijvil ............ 0»«* И>". "Wll* •4L4W.144 cvy.wj'n1» » i 4'iJWn.MV■* * «•4 J •-»•«-•>••
1222223 'i..iwi ' Ч 2016-02 04 С625.13 23*3 С<2 34 :".51:ЧЭ CCMPLETEO 11.55 8 66 regilart 65? 89793300 183216 П 116759 0 373460C0 0 27979703 0 1732 93 1S6T 85 1 0346 7
вз 2018 02 03 C&64.S7 23*3 CG 33 07.59.22 ГЛШЗ 16.'<3 0 123 reatbH 607174 28313500 0 1132060 0 101050 D 304752C0 0 13338403 0 1713400 0 45283300 0 0943«
122222 Мй 201602 05 со.б&об 2316 02 33 С4.24.17 COMPLETED 27.M 9 2C5 MM 0092307 1*96 76 30933 7 189064 994213 0 245040 0 037446 795066 00941 a
12371ЭД МИШНУЧП! 2016-32-02 22.01 £6 2316-02-32 22 22.17 COMPLETED 2.72 9 2C refiikir4 10 27 №807100 1ЙЙ47А 0 1245(18 0 3797ГЛСП 0 232147011 0 2142 9 2049 93 1 n
.ш» 201602 02 16:63:36 23 КЗ Сй 32 10:61:32 ТГ.ССОТ 9.03 0 ec restlaH 7 03 006*4000 15050!) a 90245 3 3714COCOO 22067*030 00 0.0 1001
ш Hniwn Г*Т1 2016-52452 ЮбВЗУ >316432-32 17:33:38 FALED 4 63 9 34 retjilaM e,2 96509600 56315.4 202209 32770 1 40205 V 8/6.039 363156C0.0 22835903.0 00 0.0 0.99
14Г.1ГЛ 2316-01-31 14 52-34 ГЬССОТ 673.02 9 E3 rcgilar4 619784 76402600 287333 0 81164100 40755200 Mt>41 7 27984700 0 99490:
70115-01-31 C70/4G ми4)1-31 иь-зз-за CCMPlfcrfcU ?/ 43 Up to 3 days 00436/4 253/84 11/8*4.3 19&8O0CO.O 12030/03.0 М6/ЗОСО.О 844384.0 0.96051;
1 гтл 2016-01-24 08:28:31 2Э*м>1-2? 08:2833 completing 576.00 du ration!!! 10.G5£3 36?/1700.0 323722.0 26Я01СО.О 6824880.0 2393.48 2911.1/ 093058
701В-01-73 RM4 31 ?П«-01-7Б П5НМ COMPI FTFH я restart 6257C4 29531700.0 1163360.0 110499.0 2966ЖО.О 12936203.0 5563.94 59261300 1.00758
201601-23 63:22:49 2316-01-24 18:^8:44 CUMPLtriNO 315*6 8 2265 608614 311.465 6483 29 409.6 1148tfoLOL.O 25839203.0 1/01.98 190/./6 1.01085
2016-01-24 13.02.31 13:28:52 FAIFT1 .4 51 9 ?f М-И 0253233 116354.0 2007430.0 539.793 420 936 03310* 702457 331)951 512663CO.O 32041403.0 3184.72 230^.22 0.ЗД68К
\2У>ХР 2016-31-23 С634 42 2016-01-23 11 03.35 COMPLETING 36.58 8 274 'f'jtli'4 0042452! 310 48 8836 37 Э9О456С00 60204300 1760.95 1911.44 1 03526
201601 23 С6.--1.31 2316 01 23 D3/6.39 FALED 423 8 32 refit Id rl 5418 725150TOD 1447870П S39351CO 0 16352303 0 7107 31 1496310 1 032
■ hM 2016 01 22 21.52.12 2316 01 23 w2.52.32 nscouT 40.34 9 SCO fejji.b'4 0825517 1131000 1377160 0 4007COCO 0 30508903 0 4627K>CO 0 1336810 1 01707
201601 23 C0.ie.-12 2316 01 23 03.53.13 COMPLETED 4.94 a 37 t»>J4 0004 5170J4 2285050 0 421227C00 24895900 0 251305 ] 22079300 1 048
12И194 MM 2016-01-23 СО 07 12 >316-01-23 33 49 11 COMPLETED 563 9 41 h±K oazm? 003^444 4953ft 7 23045600 4253 23 5203.37 427447ГП 0 25238900 0 233075 20713100 1 03
— 2016-01 22 13:03:20 2310 01 22 1443:45 TWCCUT 13.37 a 1C0 rejilart 110970.0 1941090.0 50Й2ЮСО.О 3'777900 0 5*200400.0 197747.0 1 00778
2016-31-21 V 44 1Ь 2316-01-21 FALED 21 54 9 ге$11аг4 0462S02J 377^4.0 1954350.0 1в?6.в2 49SO29C0.0 3'221 ЮО.О 3537700.0 329396 1.01516
Fig. 1. Filtered single-process jobs found in real job queue in various regular partitions
single node CPU_user LoadAvg
и : s . tlyn ; с Mir : га. : nmm, *va Ф1 IM ivn C»|M«P»S мд.цп r*» -in нее 48928 4 1956В/СОО гмп «»п : >.р » « am мм п .п, »n ЛИ"» 1 013/8
A 2016-02-0111:38:5/ 2016-02-0411:4006 IIMfcOU' 864 23 12 4321 r>W5 415912 18580W1D0 61/46/0 ааомзо о аь/о 58 | ьб:оьээ о
a-J J— .-■ЧУ 2016-02-0305 5^57 2016-02-СЗ С'7 56 22 FAILCD 1646 8 123 r«tf.la<- 6.07174 253135000 1132C60 0 1C1E69.0 204752СО.О 13338403.0 17194ООО 4526303.0 0 96134 С
ОСЛЯт 2016 012814.51:53 2016 01 2114:520» TIMEOUT 576.02 8 4320 'tfinlif S197B4 7540280 0 2873330 37770 - В1184100 4075570 0 5Сгм41 7 Й 273Й470 0 0994907
201601-2805.28.13 2016-01-21 Сб.26 26 TIMEOUT 664W 12 4320 877655 351737000 12627900 111023 0 268442СО 0 16344Э03 0 5280840 0 0 92-3645
LLiim .•a^liCi. 2316 01 2905.23.13 2016 01 21 Сб.26 26 TIMEOUT 364.W 12 14320 •l-OLljlt 4.105 174025000 642*960 57124 4 102460 СО 0 0C»3530D 6317 03 || 55659100 0 043202
I..:.:-: .^■t 2016 01 2803.02.12 2016 01 21 C0.02.l6 TIMEOUT 864.02 12 4320 IIJJS 39756 IftlftifiOOO 512477 0 39170 9 176Й17С00 83019П0 !> Я4Л? 98 Щ 4997560 0 0 951705
2016-01-2902 59 55 2016-01-21 И 6616 TIMEOUT 864 СЙ 12 4323 niJS 384146 177140300 509533.0 44937.' 179126СО.О 3211870 0 72*37 53 j] 473590)0 0935977
щиш.' 2016-01^805.23 13 ЛСЛЛ+t* 2С116-01.2Й14 Г.1ЕЗ 2018-01-2& 14.47 57 COf.tfLCTED 2016411.2» 213ft 14 CDr.fi PTFO 4CC.36 •V»1ft 12 ft 2XII 40Й rep.lnr 3.92857 0.232025 144016000 5*94690 16747600 M6J1.0 426 425 32640.2 164878СОО 441432СО.О 7705330 0 6528 29 3 451 «320 0 0 974849 0 906203
1018890 27594 305 0 233035 9 21^-24
tayjzis lUMb 2018-01*23 1IW:CI 2U1641|.2ft H 07 01 COf.»-Vfc1ir«j VXA 00 12 4ХЛ1 recilait 3.12U01 1 16031000 430194.0 12601800.0 08893/0 0 t031 V-j 1 3/82510.0 068348
12ШВ ---•"f* -Л1Нь01.2311Ч17-Й1 2П11М11.2К 11 0/01 СГ*.*4 HlWi Wi4TO ?Г4 ГО S7Fr>: 1? 12 8 ЛХК1 45?n 43ЭТ rca l-irt 2.83212 3 31003 4.33655 1288/6000 464006.0 49Ю42.С 20Ф355.0 4C734.5 44162' 23269.5 46465 0 13325100.0 133223СО.О 5666650.0 58Э03500 4/02 23 Щ 3930503.0 08037-30 0 ЗЗДЛЗ J 3970610 0 2803020 0 3349 31 205644-3.0 0.680046
HH Я116-01-73 CM 1431 7niB4l1-7f>r»1ft01 TIMFOlir 'M Uf- 5432760.0 0/2137?
12Э3217 2016-01-2301:13:21 2016-01-26 C*1:13:31 COMPLETING 864 03 12 4УА1 1»M6 336625 156366000 56601/0 15Ю&2СО.О /040130 0 52/0 36 g 4/71/40 0 0.31913
1222224 1Г----^r 2016 01 23091131 2016 01 25 C6 30 34 CO-.rtETEO 354 14 8 vm 6-25704 20331700 0 11633600 11049Э.0 20СОЗОСО.О 12936203 0 55'23-34 Ц 5926103 0 1 00758
1222225 201601-2301:42:13 2016 01 24 1ЙЮ57 CCX.FLETED 497.56 12 2'37 flilS 419759 178872000 671857 0 52291 5 83310 4 196Л01СО 0 8959310 0 ШЯЯ^ЩИ 6251393 0 1 01723
Tltrma 2316-01-2309.14.31 • y . 2016 01 1< 22 13.33 ------- 2016 01 H 10:33:03 2016-01 -23 C«-46 39 FAILED 2016 01 17 22.1650 TIMEOUT 20ie 01 17 10.33:26 TIMEOUT 4.23 064.Ce ee4.c« 8 02 4»3 1323 fesOs" r>J18 541П 4 33091 4 14311 22515000 0 20412f000 205965000 1447B70 0 729143.0 736184 С 239351 СО 0 206122СО.О 20Э070С0 0 16352303 0 7107 31 149630 0 0339490 0 5362-9 Щ 55574700 8927940 0 50.92 59 J 50932500 1 002 1 01106 1 00703
12 69520 3
MM 2316-01-14 10 23 C3 2016-01-17 102626 TIMEOUT 67606 8 4323 ngil» 6.14744 287063000 1051800 0 1С55в4.0 29е350СО.О 123353OD.0 6108 Щ 554C.39D 0 1 00313
шмь 2316.01-1* 10:27:03 2010411-17 102726 TIMEOUT ЙС4.С» 12 «323 "ca. Ian 420593 1 !;i444700 0 196139000 26/16300.0 725674.0 723350.0 803650.0 04734 5 19COG4CO.O 8532810 0 6016 08 ■ 5396400.0
на Я16-01.1* 1И1зиз 7U1B4J1-1/ Ю11//8 IIMbOUI 67Ы» 8 4ХЛЗ ttfLlU 6 31/73 85406.0 9SW59.8 65141.6 64954 6 29^6310.0 13Э33303.0 6462.6/ Я 562/593.0 1 'JIK14
ГаЖа^. 2016-01-и Ю:17ЛЗ ■УД J0ie-01-1* щ 35:03 .. • ^16-01-и 201MI1-1? 101/28 TIMfcOU) ?mfi-oi-is;i 114ft CANC.FI I FH-20164)1-14 12 ib 14 CANCELLED 8 4320 161 гслйг 6 80/68 1 18375 41057/ 282UU1000 1 £663500 0 1569Q10D0 101У1600 697625.0 7156120 298/74 ООО 20С666СО.О 199353СО 0 13262JOJ O 601981 Щ 4/28/83.0 9552830 0 6551 38 J 5672640.0 8563930 D НИИ 5159183 0 100182 1.02367 1 00066
32 24 «-
В ------ 2316-01-11 10.33 С7 2018-01-14 C«32.41 TAILED 567.54 0 ■4256 'cy.lji 4 31072 22211700 0 761102 0 76557 ' 253633 I17132CO 0 9393030 0 ?4>:42 4 Щ 3535183 0 0 663M?
2316 01 13 30.13.04 2016 01 14 C9.32.40 FAILED 39692 1=©3 rillS -44131 Л0342200 2758150 7120 ОАО 0 Э047030 0 638798 ^1647803 0 0
Fig. 2. Filtered single-process jobs found in real job queue in various regular partitions
3.2. Finding jobs with high Flops intensity and high efficiency
Apart from finding problem cases, there is a task of finding well-optimized jobs that utilize HPC resources with high efficiency. These jobs owners are usually experienced users with high qualification in parallel programming and fine tuning of software that secures highly efficient supercomputer load. The contact with such users is very important first and foremost to learn the techniques used and share them with novice users, contributing to FAQ/wiki sections of helpdesk and so on. The experienced users are very likely to be invited to give public lectures, make reports on seminars and join other educational activities.
As a rule, most of jobs with high floating point intensity are tagged with avg_Flops HIGH, avg_CPU_user HIGH tags and most efficient apps in terms of memory utilization are tagged
avg_cache_Ll/L3 HIGH. Filtered by these criteria jobs are usually having good data locality, they are well-balanced and show high performance.
The deeper analysis of such jobs allows revealing optimal command line and compiler options for the variety of categories of standard applications and algorithm implementations. Once such a job is approved to be a well-optimized typical example of a SW package usage or algorithm implemented, a proper tag, corresponding to such a category can be set (like job_sw VASP). This provides means for the comparative analysis of similar jobs. This can also serve as a good basis for the more detailed analysis of the whole job collection and revealing inefficient applications and users that use resources inefficiently.
3.3. Finding applications with special need for large amounts of memory
Many users of supercomputer complex run applications that are resource-demanding regarding amount of available memory per process. Such applications are usually effective enough, but are often scaled down in different ways to fit available memory, for example, reducing number of MPI processes per node and so on.
Such applications are usually run on a considerable number of nodes and LoadAvg values are below the number of CPU cores per node. These jobs are usually tagged with avg_memload HIGH, and related to node and core usage tags avg_LA SINGLE CORE, job_nnodes FEW or job_nnodes MANY.
If such a job is found in the 6-cored CPU "Regular 6" partition (tagged with job_partition REGULAR6), even changing allocation to the "Regular 4" partition can be an optimization choice leading to reducing the number of idle CPU cores per node by 4 cores (2*6-2*4).
Some of such applications can also benefit from moving to hybrid MPI-fOpemMP or MPI+Cilk models. If such a model cannot be applied, some of the applications can be reallocated to SMP partition with much larger amounts of memory available.
3.4. Revealing categories of issues and inefficient behavior
The accumulation of statistics and knowledge on the problems of parallel applications is one of the most important components of the HPC center job collection analysis. The ability to add tags to the analyzed jobs related to the implementation issues found is a useful feature for this purpose as well as tags corresponding to non-efficient use of computing resources, hardware problems or other features found in course of job execution characteristics analysis.
When analyzing inefficient, abnormal application behavior of a single run or of a sequence of jobs, based on the certain software package, it is often needed to contact the user, the application owner who can provide additional information on the program details: algorithm implementation used, program architecture and structure, computing model and so on up to dependencies on input data and command line options. All this information should be recorded to aid further analysis of similar applications and categories.
If any application run is being analyzed it is useful to mark and tag the used system software details. This is true first of all regarding the math libraries used, compiler and compiler options, MPI type, etc. This provides the basis for the comparative analysis of similar jobs or sequences of jobs. If differences in behavior are found, one can continue deep study on the reason origin: user application reaction, system software configuration, etc.
All widely-met issues like data race, deadlocks and so forth can be marked by special tags (job_bug DATA RACE, job_bug DEADLOCK, etc.). This can help in further analysis of other
jobs. One can compare strange program behavior to the analyzed profiles marked as having specific issues. Once a similar behavior is found, it can be a key to resolving the problems of the originally analyzed job.
Conclusion and future work
Close-future plans include implementation of the Octoshell [7, 8] module for full project-oriented workflow support and authentication, thus securing accessibility of the proposed service for any user. We expect it ready by the middle of 2016. By that time we also plan to extend supported tag set and adjust criteria for existing tags if needed.
To sum up, a user-friendly, useful and effective technique for filtration, grouping and further analysis of the whole queued job collection of large-scale HPC systems based on system monitoring and resource manager data is proposed and implemented. The developed tool is evaluated in the every-day practice of the Supercomputer Center of Lomonosov Moscow State University, providing means for effective analysis for any and every user application run. The priceless collection of information on all finished jobs is already being enriched in a 24/7 mode for several month.
The work was funded in part by the Russian Foundation for Basic Research (grants №16-01-00912A, №13-01-00186A), Russian Presidential study grant (SP-1981.2016.5), and by the Ministry of Education and Science of the Russian Federation, Agreement No. 14-601.21.0006 (unique identifier RFMEFI607UX0006).
References
1. Top50 Supercomputers of Russia and CIS. Available at: http://top50.supercomputers.ru/ (accessed 15.02.2016).
2. Top500 Supercomputer Sites. Available at: http://top500.org/ (accessed:15.02.2016).
3. Antonov A., Zhumatiy S., Nikitenko D., Stefanov K., Teplov A., Shvets P. Analysis of Dynamic Characteristics of Job Stream on Supercomputer System, Numerical Methods and Programming. 2013. vol. 14, no. 2. pp. 104-108.
4. Safonov A., Kostenetskiy P., Borodulin K., Melekhin F. A Monitoring System for Supercomputers of SUSU. Russian Supercomputing Days International Conference, Moscow, Russian Federation, 28-29 September, 2015, Proceedings. CEUR Workshop Proceedings, 2015. vol. 1482. pp. 662-666.
5. Stefanov K. et al. Dynamically Reconfigurable Distributed Modular Monitoring System for Supercomputers (DiMMon). Procedia Computer Science / Elsevier B.V.. 2015. vol. 66. pp. 625-634. DOI: 10.1016/j.procs.2015.11.071.
6. Nikitenko D. Complex Approach to Performance Analysis of Supercomputer Systems Based on System Monitoring Data. Numerical Methods and Programming. 2014. vol. 15. pp. 85-97.
7. Voevodin V., Zhumatiy S., Nikitenko D. Octoshell: Large Supercomputer Complex Administration System. Russian Supercomputing Days International Conference, Moscow, Russian Federation, 28-29 September, 2015, Proceedings. CEUR Workshop Proceedings, 2015. vol. 1482. pp. 69-83.
8. Nikitenko D., Voevodin V., Zhumatiy S. Resolving Frontier Problems of Mastering Large-Scale Supercomputer Complexes. Proceedings of the ACM International Conference on Computing
Frontiers (CF'16), Como, Italy, 16-18 May, 2016. ACM New York, NY, USA, 2016. pp. 349-352. DOI: 10.1145/2903150.2903481.
9. Voevodin VI., Antonov A., Bryzgalov P., Nikitenko D., Zhumatiy S., Sobolev S., Stefanov K., Voevodin Vad. Practice of "Lomonosov" Supercomputer. Open Systems. 2012. no. 7. pp. 36-39.
10. Zhumatiy S., Nikitenko D. Approach to Flexible Supercomputers Management. International Supercomputing Conference Scientific Services & Internet: All Parallelism Edges, Novorossiysk, Russian Federation, 23-28 September, 2013, Proceedings. MSU, 2013. pp. 296-300.
11. Voevodin VI. Supercomputer Situational Screen. Open Systems. 2014. no. 3. pp. 36-39.
12. Shvets P. , Antonov A., Nikitenko D., Sobolev S., Stefanov K., Voevodin Vad., Voevodin V., Zhumatiy S. An Approach for Ensuring Reliable Functioning of a Supercomputer Based on a Formal Model. Parallel Processing and Applied Mathematics. 11th International Conference, PPAM 2015, Krakow, Poland, September 6-9, 2015. Springer International Publishing, vol. 9573. pp. 12-22. DOI: 10.1007/978-3-319-32149-3_2.
13. Voevodin V., Antonov A., Dongarra J. AlgoWiki: an Open Encyclopedia of Parallel Algorithmic Features. Supercomputing Frontiers and Innovations. 2015. vol. 2, no. 1. pp. 4-18. DOI: 10.14529/jsfil50101.
14. SLURM Workload Manager. Available at: http://slurm.schedmd.com/ (accessed: 15.02.2016).
15. Cleo Cluster Batch System. Available at: http://sourceforge.net/projects/cleo-bs/ (accessed: 15.02.2016).
16. Ganglia Monitoring System. Available at: http://ganglia.sourceforge.net/ (accessed:15.02.2016).
17. Collectd - The System Statistics Collection Daemon. Available at: https://collectd.org/ (accessed: 15.02.2016).
18. Clustrx. Available at: http://www.t-platforms.ru/products/software/clustrxproductfamily/ clustrxwatch.html (accessed: 15.02.2016).
19. jQuery & jQuery UI. Available at: http://jqueryui.com/ (accessed: 15.02.2016).
20. Taglt. Available at: http://aehlke.github.io/tag-it/ (accessed 15.02.2016).
УДК 004.457, 004.382.2 DOI: 10.14529/cmsel60403
ИССЛЕДОВАНИЕ ИНТЕГРАЛЬНЫХ ХАРАКТЕРИСТИК СУПЕРКОМПЬЮТЕРНЫХ ПРИЛОЖЕНИЙ ДЛЯ ВСЕГО ПОТОКА ЗАДАЧ БОЛЬШИХ ВЫЧИСЛИТЕЛЬНЫХ СИСТЕМ
© 2016 г. Д.А. Никитенко, В.В. Воеводин, A.M. Теплов, С.А. Жуматий, Вад.В. Воеводин, К.С. Стефанов, П.А. Швец
Московский государственный университет имени М.В. Ломоносова (119991 Москва, ул. Ленинские Горы, д. 1) E-mail: [email protected], [email protected], [email protected] [email protected], cstef@parallel. ru, shvets.pavel. srcc@gmail. com Поступила в редакцию: 11.04.2016
Эффективность работы суперкомпьютерных систем зависит от множества факторов. В условиях одновременной работы множества пользователей особую роль играет контроль использования выделенных для расчетов ресурсов. Важно, чтобы в распоряжении пользователей была подробная информация о свойствах выполненных задач. В условиях групповой работы над прикладными задачами дополнительно стоит выделить необходимость контроля использования ресурсов участниками проекта руководителем работ. К сожалению, такие сведения сейчас как правило не доступны. Этот пробел призван восполнить разработанный авторами подход к получению и исследованию интегральных характеристик суперкомпьютерных приложений для всего потока задач больших суперкомпьютерных систем. В основе подхода лежит использование данных системного мониторинга, построение интегральных характеристик отдельных запусков для всего множества выполненных задач, деление их на классы, выявление особенностей запусков.
Ключевые слова: суперкомпьютер, эффективность, системный мониторинг, классы задач, интегральные характеристики задач, поток задач, контролг> использования вы-числительных ресурсов.
ОБРАЗЕЦ ЦИТИРОВАНИЯ
Nikitenko D.A., Voevodin V.V., Teplov A.M., Zhumatiy S.A., Voevodin Vad.V., Stefanov K.S., Shvets P.A. Supercomputer Application Integral Characteristics Analysis for the Whole Queued Job Collection of Large-Scale HPC Systems // Вестник ЮУрГУ. Серия: Вычислительная математика и информатика. 2016. Т. 5, № 4. С. 32-45. DOI: 10.14529/cmsel60403.
Литература
1. ТорБО Supercomputers of Russia and CIS. URL: http://top50.supercomputers.ru/ (дата обращения: 15.02.2016).
2. Top500 Supercomputer sites. URL: http://top500.org/ (дата обращения: 15.02.2016).
3. Antonov A., Zhumatiy S., Nikitenko D., Stefanov K., Teplov A., Shvets P. Analysis of dynamic characteristics of job stream on supercomputer system Numerical Methods and Programming, 2013. Vol. 14, No. 2. P. 104-108.
4. Safonov A., Kostenetskiy P., Borodulin K., Melekhin F. A monitoring system for supercomputers of SUSU // Russian Supercomputing Days International Conference, Moscow, Russian Federation, 28-29 September, 2015, Proceedings. CEUR Workshop Proceedings, 2015. Vol. 1482. P. 662-666.
5. Stefanov К. et al. Dynamically Reconfigurable Distributed Modular Monitoring System for Supercomputers (DiMMon) // Proccdia Computer Science / Elsevier B.V., 2015. Vol. 66. P. 625-634.
6. Nikitenko D. Complex approach to performance analysis of supercomputer systems based on system monitoring data // Numerical Methods and Programming, 2014. Vol. 15. P. 85—97.
7. Voevodin V., Zhumatiy S., Nikitenko D. Octoshell: Large Supercomputer Complex Administration System // Russian Supercomputing Days International Conference, Moscow, Russian Federation, 28-29 September, 2015, Proceedings. CEUR Workshop Proceedings, 2015. Vol. 1482. P. 69-83.
8. Nikitenko D., Voevodin V., Zhumatiy S. Resolving frontier problems of mastering large-scale supercomputer complexes j j Proceedings of the ACM International Conference on Computing Frontiers (CF'16), Como, Italy, 16-18 May, 2016. ACM New York, NY, USA, 2016. P. 349-352.
9. Voevodin VI., Antonov A., Bryzgalov P., Nikitenko D., Zhumatiy S., Sobolev S., Stefanov K., Voevodin Vad. Practice of "Lomonosov" Supercomputer // Open systems, 2012. No. 7. P. 36-3!).
10. Zhumatiy S., Nikitenko D. Approach to flexible supercomputers management. // International supercomputing conference Scientific Services h Internet: all parallelism edges, Novorossiysk, Russian Federation, 23-28 September, 2013, Proceedings. MSU, 2013. P. 296300.
11. Voevodin VI. Supercomputer situational screen // Open systems, 2014. No. 3. P. 36-39.
12. Shvets P. , Antonov A., Nikitenko D., Sobolev S., Stefanov K., Voevodin Vad., Voevodin V., Zhumatiy S. An Approach for Ensuring Reliable Functioning of a Supercomputer Based on a Formal Model // Parallel Processing and Applied Mathematics, llt.h International Conference, PPAM 2015, Krakow, Poland, September 6-9, 2015- Springer International Publishing. Vol. 9573. P. 12-22.
13. Voevodin V., Antonov A., Dongarra J. AlgoWiki: an Open Encyclopedia of Parallel Algorithmic Features // Supercomputing Frontiers and Innovations, 2015. Vol. 2, No.l. P. 4 18.
14. SLURM workload manager. URL: http://slurm.schedmd.com/ (дата обращения: 15.02.2016).
15. Cleo cluster batch system. URL: http://sourceforge.net/projects/cleo-bs/ (дата обращения: 15.02.2016).
16. Ganglia Monitoring System. URL: http://ganglia.sourceforge.net/ (дата обращения: 15.02.2016).
17. Collectd - The system statistics collection daemon. URL: https://collectd.org/ (дата обращения: 15.02.2016).
18. Clustrx. URL: http://www.t-platforms.rH/products/software/chistrxproductfainily/ clustrxwatch.html (дата обращения: 15.02.2016).
19. jQuery h jQuerv UI. URL: http://jqueryui.com/ (дата обращения: 15.02.2016).
20. Taglt. URL: http://achlke.githnb.io/tag-it/ (дата обращения: 15.02.2016).