fortran - Calculating mflop/s of a HPC application using memory bandwidth info -
i want calculate mflops (million of operations per second per processor) of hpc application(nas benchmark) without running application. have measured memory bandwidth of each core of system (a supercomputer) using stream benchmark. i'm wondering how can mflops per processor of application having memory bandwidth info of cores. node has 64gib memory (includes 16 cores-2 sockets) , 58 gib/s aggregated bandwidth using physical cores. memory bandwidth of cores varied 2728.1204 mb/s 10948.8962 mb/s triad function it's must because of numa architecture.
any appreciate.
you can't estimate of mflops/gflops of benchmark memory bandwidth results stream. need know 2 more parameters: peak mflops/gflops of cpu core (better max flop operations per clock cycle variants of vector instructions , cpu frequency limits: min, mean, max) , gflops/gbytes (flops bytes ratio, arithmetic intensity) of every program need estimate (every nas benchmark).
the stream benchmark has low arithmetic intensity (0 dp=fp64 flops per 2 double operands = 2*8 bytes in copy, 1 flops per 16 bytes in scale, 1 flops / 24 byte in add , 2 flops / 24 bytes in triad). so, stream benchmark limited memory bandwidth in correct runs (and cache bandwidth in incorrect runs on ). many benchmarks may have higher
with data (memory bandwidth, max gflops/ghz on different vectorization levels, normal/maximal/low frequency of cpu, arithmetic intensity of test) can start use roofline performance model https://crd.lbl.gov/departments/computer-science/par/research/roofline/
with roofline have x axis flops/byte; y axis of gflop/s (both @ logarithmic scale). line of "roof" consists of 2 parts every cpu (or machine).
first part inclined , corresponds low arithmetic intensity. applications in part have wait data loaded memory, have no data operate on full gflop/s speed of cpu; tests limited memory. line defined stream benchmark.
second part of line straight, corresponds higher intensity. tasks here not limited memory bandwidth, limited available flops. , modern cpu flops available wide vector instruction (instruction-level parallelism), , not tasks can use widest vectors:
Comments
Post a Comment