I cant find any white papers on SSE4.1. SSE = Streaming SIMD Instructions. They are able to optimize the mathematical operations like sqrt . SSE 2 was useful.SSE3 is useless.SSE4 didn't had a speed increase.(I think)SSE4.1 is? I dunno.Basic is they help to optimize mathematical operations in a CPU.Hope that helps.AFAIK Auto gordian knot(DivX encoder) can use it.Softwares should point to them to get use of them.Again I dunno programming in advanced level such as C/C++ to say how it is done.
Edit : yay found it .
But I cant understand it :S
Edit : yay found it .
But I cant understand it :S
I have been trying to use the Intel DPPS instruction with either EXTRACTPS or BLENDPS. Essentially I have a loop in which
x1 = dot-product(y1,z1)
x2 = dot-product(y2,z2)
x3 = dot-product(y3,z3)
x4 = x1/(sqrt(x2)*sqrt(x3)
I can do x1,x2,x3 with the DPPS instruction and then use extractps. So 3 DPPS with 3 EXTRACTPS. Turns out I did not get any improvement in performance. To use lesser number of EXTRACTPS, I used BLENDPS.
x1_sse = dpps(y1,z1,241)
x2_sse = dpps(y2,z2,242)
x2_sse = blendps(x1_sse,x2_sse, 2);
x3_sse = dpps(y3,z3, 244)
x3_sse = blendps(x2_sse, x3_sse, 4)
storeps(x3_sse, x3_array)
x1 = x3_array[0]
x2 = x3_array[1]
x3 = x3_array[2]
Turns out there is no improvement from this either, infact a slight degradation. All loads and stores are aligned. I am using icpc -ipo -xT -O3 -no-prec-div -static -funroll-loops (so -fast without -ipo since -ipo does not work with SSE4.1 instructions). Any comments on how I could do this better or are these instruction latencies just too long for my use ? I guess I am dissapointed with the performance of the SSE 4.1 so far.
From Intel software network.Look in it.icpc -xS supports automatic selection of SSE4.1 instructions, where the compiler deems them beneficial. dpps fully unrolled "vectorization" of an inner loop inhibits auto-vectorization of a containing loop, which would seem a likely application of it. In the case where traditional "re-rolling" of a long partially unrolled dot product loop avoids the compiler selection of dpps, that is the better way to full performance.
Much as ad writers love to get paid for writing about new instructions, more significant performance improvements of Penryn CPUs are realized in SSE2 code, for example, by the improved performance of IEEE divide and square root (both serial and parallel versions), and by the higher supported FSB ratings.
Last edited:


onna ohe dapan ban 

and welcome to EK 