Draft statement of work


Performance Measurements (TR-1)



Download 0.66 Mb.
Page32/34
Date28.01.2017
Size0.66 Mb.
#9693
1   ...   26   27   28   29   30   31   32   33   34

9.4Performance Measurements (TR-1)


All performance measurements for marquee benchmarks are stated as “figures of merit” (FOM). All other benchmarks use wall clock execution time as their performance measurement.

The first four marquee benchmarks print a raw figure of merit at the end of the benchmark run. For the LAMMPS benchmark, the raw figure of merit is calculated by multiplying the total number of atoms in the simulation times the total number of time steps (which is 100) and dividing by the total wall clock loop time (in seconds).

Each marquee code uses its own “raw” FOM, because each benchmark implements a different kind of physics/algorithm, and the “natural” figures of merit for each are different in type and magnitude. For example, in the SPhot and AMG benchmarks, very little floating point arithmetic is performed making FLOP/s (FLoating point OPerations per second) a poor performance metric. Performance is dominated by integer arithmetic, array indexing and branching. For SPhot, the correct balance of floating point arithmetic, integer arithmetic, array indexing and branching is captured in the sequence of instructions that update the location of a Monte Carlo particle to its next “event”, be that a scattering collision, absorption, reaching a zone boundary, etc. Thus particle “track updates” per second is the natural figure of merit for meaningful comparison between computers.

For all benchmarks, the raw FOM has been chosen to also factor out the change in the difficulty of the problems as the size of the problems is increased. For example, the number of iterations to converge an answer by the IRS, UMT, and AMG benchmarks increases as system size increases. This is a characteristic of the algorithms, independent of the hardware used. So rather than use wall clock time to completion as the FOM, these three benchmarks use solution updates per second, defined as system size (e.g., the number of fundamental state variables) times the number of iterations performed, divided by the wall clock time (in seconds) as their raw figures of merit. Each benchmark’s raw FOM is then multiplied by a separate weight (see Table 9 -4 below) that both balances the importance of the benchmarks with respect to each other, and with respect to the peak FLOP/s (floating point operations per second) of the system.



Benchmark

SPhot

UMT

AMG

IRS

LAMMPS

Weights

57,512

12,240

269,200

203,200

37,840

Table 9 4: Sequoia Marquee Benchmarks FOM Weights


The weighted FOM performance of the four IDC workload marquee benchmarks on the ASC Purple system for 1,024 (1 Ki) to 8,192 (8 Ki) processors is shown in Figure 9 -12.



Figure 9 12: Sequoia Marquee benchmark scaling to 8,192 MPI tasks on Purple.


The Offeror must commit to a specific sustained, aggregate, weighted figure of merit while running under conditions described in Section 9.4.2.

The raw FOM performance of the LAMMPS science workload marquee benchmark on the ASC BG/L system up to 128 Ki processors is shown in Figure 9 -13. (NOTE: For processor counts up to and including 64 Ki processors, LAMMPS was run using only one of the two processors on a node. For 128 Ki processors, both processors was used on the same 64 Ki nodes. The 64 Ki to 128 Ki speedup was 1.84.)





Figure 9 13: LAMMPS scaling up to 131,072 (128Ki) MPI tasks on BG/L.



9.4.1Modifications


The source code and compile scripts downloaded from the Sequoia benchmark web site may be modified as necessary to get the benchmarks to compile and run on the Offeror’s system. Once this is accomplished, a full set of benchmark runs must be reported with this “as is” source code.

Beyond this, the benchmarks can be optimized as desired by the Offeror. The highest value optimizations are those obtained from standard compiler flags, and other compiler flag hints. Next in value are performance improvements from pragma-style guidance in C, C++, and Fortran source files. Changes in the OpenMP and MPI implementation are allowed, as they are likely to benefit performance of LLNS’s benchmarks on many platforms. Wholesale algorithm changes, or even manual rewriting of loops to become strongly architecture specific are of less value, because the ASC program’s large installed code base makes extensive code rewrites prohibitively expensive, both in manpower and schedule.

Offeror will continue its efforts to improve the efficiency and scalability of the benchmarks. Offeror’s goal in these improvement efforts is to emphasize higher level optimizations as well as compiler optimization technology improvements while maintaining readable and maintainable code.

If Python is not available on all systems that are used by Offeror to run the benchmarks for the RFP, LLNS will help the Offeror create a non-Python version to use when running the RFP benchmarks. However, in the event of subcontract award, the selected Offeror must demonstrate a Python-based version of the UMT benchmarks on the Dawn system prior to shipment.


9.4.2Sequoia Execution Requirements


The conditions for running the sustained aggregate, weighted figure of merit for the final Sequoia delivery are as follows. For the IDC workload, six separate (but identical) problems will be run simultaneously using each of the four marquee benchmark codes, for a total of 24 simultaneously running IDC problems. (See Section 9.3 for the specific problems to be run.)

Each problem is sized to be the largest possible problem that can be run on the ASC Purple machine for that IDC benchmark (8,192 MPI tasks) and 20x BG/L run with 131,072 MPI tasks and 32,000 atoms/ MPI task for LAMMPS. Simultaneous with these runs, a single LAMMPS run with at least 83,886,080,000 atoms = 20 * 131,072 MPI tasks * 32,000 atoms/MPI task will be run. Thus, the aggregate throughput of the Sequoia system during the sustained performance test must be 24x Purple for IDC codes plus 20x BG/L for the science workload. The “raw” figures of merit printed out by each of the four marquee IDC benchmarks on 8,192 Purple processors (MPI-only), and LAMMPs on 131,072 CPUs on BG/L are shown in Table 9 -5.



Benchmark

SPhot

UMT

AMG

IRS

LAMMPS

“Raw” FOM

11.59e+9

54.39e+9

2.484e+9

3.27e+9

5.28E+9

Table 9 5: Raw FOM for Sequoia marquee benchmarks on their reference system.


Using the weighting factors in Table 9 -4 and the raw figures of merit for of each of the five marquee benchmark codes on Purple in Table 9 -5 above, the final sustained figure of merit may be no less than 20.0e+15.

If 8,192 MPI tasks (MPI only) in the IDC benchmarks are insufficient to achieve the required aggregate, weighted FOM, more cores and threads may be used via thread parallelism.


9.4.2.1Sequoia14 Execution Requirements


If the Sequoia14 System Performance Mandatory Option (Section 2.12.3) is exercised, then the procedure for measuring the aggregate sustained performance as described in Section 9.4.2 will be adjusted as following. The LAMMPS problem size will be decrease to a total of 58,720,256,000 atoms = 14 * 131,072 MPI tasks * 32,000 atoms/MPI task. The number of simultaneously running jobs for each IDC code will be decreased from six to four, making a total of 16 simultaneously running IDC jobs which are also simultaneous with the LAMMPS job. All multipliers for raw figures of merit will remain the same.

End of Section 9



Download 0.66 Mb.

Share with your friends:
1   ...   26   27   28   29   30   31   32   33   34




The database is protected by copyright ©ininet.org 2024
send message

    Main page