c++ - Can sqrtsd in inline assembler be faster than sqrt()? -
i creating testing utility requires high usage of sqrt() function. after digging in possible optimisations, have decided try inline assembler in c++. code is:
#include <iostream> #include <cstdlib> #include <cmath> #include <ctime> using namespace std; volatile double normalsqrt(double a){ double b = 0; for(int = 0; < iterations; i++){ b = sqrt(a); } return b; } volatile double asmsqrt(double a){ double b = 0; for(int = 0; < iterations; i++){ asm volatile( "movq %1, %%xmm0 \n" "sqrtsd %%xmm0, %%xmm1 \n" "movq %%xmm1, %0 \n" : "=r"(b) : "g"(a) : "xmm0", "xmm1", "memory" ); } return b; } int main(int argc, char *argv[]){ double = atoi(argv[1]); double c; std::clock_t start; double duration; start = std::clock(); c = asmsqrt(a); duration = std::clock() - start; cout << "asm sqrt: " << c << endl; cout << duration << " clocks" <<endl; cout << "start: " << start << " end: " << start + duration << endl; start = std::clock(); c = normalsqrt(a); duration = std::clock() - start; cout << endl << "builtin sqrt: " << c << endl; cout << duration << " clocks" << endl; cout << "start: " << start << " end: " << start + duration << endl; return 0; }
i compiling code using script sets number of iterations, starts profiling, , opens profiling output in vim:
#!/bin/bash default_iterations=1000000 if [ $# -eq 1 ]; echo "setting iterations $1" default_iterations=$1 else echo "using default value: $default_iterations" fi rm -rf asd g++ -msse4 -std=c++11 -o0 -ggdb -pg -diterations=$default_iterations test.cpp -o asd ./asd 16 gprof asd gmon.out > output.txt vim -o output.txt true
the output is:
using default value: 1000000 asm sqrt: 4 3802 clocks start: 1532 end: 5334 builtin sqrt: 4 5501 clocks start: 5402 end: 10903
the question why sqrtsd
instruction takes 3802 clocks, count square root of 16, , sqrt()
takes 5501 clocks? have hw implementation of instructions? thank you.
cpu:
architecture: x86_64 cpu op-mode(s): 32-bit, 64-bit byte order: little endian cpu(s): 4 on-line cpu(s) list: 0-3 thread(s) per core: 2 core(s) per socket: 2 socket(s): 1 numa node(s): 1 vendor id: authenticamd cpu family: 21 model: 48 model name: amd a8-7600 radeon r7, 10 compute cores 4c+6g stepping: 1 cpu mhz: 3100.000 cpu max mhz: 3100,0000 cpu min mhz: 1400,0000 bogomips: 6188.43 virtualization: amd-v l1d cache: 16k l1i cache: 96k l2 cache: 2048k numa node0 cpu(s): 0-3
floating point arithmetic has take consideration rounding. c/c++ compilers adopt ieee 754, have "ideal" algorithm perform operations such square root. free optimize, must return the same result down last decimal, in cases. freedom optimize not complete, in fact severely constrained.
your algorithm off digit or 2 part of time. negligible users, cause nasty bugs others, it's not allowed default.
if care more speed standard compliance, try poking around options of compiler. instance in gcc first i'd try -funsafe-math-optimizations
, should enable optimizations disregarding strict standard compliance. once tweak enough, should come closer , possibly pass handmade implementation's speed.
Comments
Post a Comment