fortran - Calculating mflop/s of a HPC application using memory bandwidth info -


i want calculate mflops (million of operations per second per processor) of hpc application(nas benchmark) without running application. have measured memory bandwidth of each core of system (a supercomputer) using stream benchmark. i'm wondering how can mflops per processor of application having memory bandwidth info of cores. node has 64gib memory (includes 16 cores-2 sockets) , 58 gib/s aggregated bandwidth using physical cores. memory bandwidth of cores varied 2728.1204 mb/s 10948.8962 mb/s triad function it's must because of numa architecture.

any appreciate.

you can't estimate of mflops/gflops of benchmark memory bandwidth results stream. need know 2 more parameters: peak mflops/gflops of cpu core (better max flop operations per clock cycle variants of vector instructions , cpu frequency limits: min, mean, max) , gflops/gbytes (flops bytes ratio, arithmetic intensity) of every program need estimate (every nas benchmark).

the stream benchmark has low arithmetic intensity (0 dp=fp64 flops per 2 double operands = 2*8 bytes in copy, 1 flops per 16 bytes in scale, 1 flops / 24 byte in add , 2 flops / 24 bytes in triad). so, stream benchmark limited memory bandwidth in correct runs (and cache bandwidth in incorrect runs on ). many benchmarks may have higher

with data (memory bandwidth, max gflops/ghz on different vectorization levels, normal/maximal/low frequency of cpu, arithmetic intensity of test) can start use roofline performance model https://crd.lbl.gov/departments/computer-science/par/research/roofline/

roofline model example; memory limited part

with roofline have x axis flops/byte; y axis of gflop/s (both @ logarithmic scale). line of "roof" consists of 2 parts every cpu (or machine).

first part inclined , corresponds low arithmetic intensity. applications in part have wait data loaded memory, have no data operate on full gflop/s speed of cpu; tests limited memory. line defined stream benchmark.

second part of line straight, corresponds higher intensity. tasks here not limited memory bandwidth, limited available flops. , modern cpu flops available wide vector instruction (instruction-level parallelism), , not tasks can use widest vectors:

roofline model; gflops ilp limited part


Comments

Popular posts from this blog

Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12:test (default-test) on project.Error occurred in starting fork -

windows - Debug iNetMgr.exe unhandle exception System.Management.Automation.CmdletInvocationException -

configurationsection - activeMq-5.13.3 setup configurations for wildfly 10.0.0 -