SIMD: Accumulate Adjacent Pairsc++ SSE SIMD frameworkHeader files for x86 SIMD intrinsicsSIMD-able code?SIMD math libraries for SSE and AVXSIMD and difference between packed and scalar double precisionExtract set bytes position from SIMD vectorAVX2 SIMD addition not workingC++ SIMD: accumulate POPCOUNTs of uint64_t elements in an arraySIMD Intrinsics and Persistent Variables/StateCharacter to bits with SIMD (and substrings)

Remove Expired Scratch Orgs From VSCode

Can a significant change in incentives void an employment contract?

Could the E-bike drivetrain wear down till needing replacement after 400 km?

Is it improper etiquette to ask your opponent what his/her rating is before the game?

Divine apple island

Can the Supreme Court overturn an impeachment?

Should I stop contributing to retirement accounts?

Did US corporations pay demonstrators in the German demonstrations against article 13?

Varistor? Purpose and principle

Longest common substring in linear time

How do I implement a file system driver driver in Linux?

What major Native American tribes were around Santa Fe during the late 1850s?

Why did the HMS Bounty go back to a time when whales are already rare?

How to color a curve

What linear sensor for a keyboard?

Open a doc from terminal, but not by its name

Bob has never been a M before

How do I extrude a face to a single vertex

Flux received by a negative charge

How must one send away the mother bird?

On a tidally locked planet, would time be quantized?

Have I saved too much for retirement so far?

Find last 3 digits of this monster number

Reply 'no position' while the job posting is still there



SIMD: Accumulate Adjacent Pairs


c++ SSE SIMD frameworkHeader files for x86 SIMD intrinsicsSIMD-able code?SIMD math libraries for SSE and AVXSIMD and difference between packed and scalar double precisionExtract set bytes position from SIMD vectorAVX2 SIMD addition not workingC++ SIMD: accumulate POPCOUNTs of uint64_t elements in an arraySIMD Intrinsics and Persistent Variables/StateCharacter to bits with SIMD (and substrings)













3















I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me.



I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel.



template <typename T, typename U>
inline void accumulate_2x2_x_pass(
T* channel, U* accum,
const size_t sx, const size_t sy,
const size_t osx, const size_t osy,
const size_t yoff, const size_t oyoff
)

const bool odd_x = (sx & 0x01);

size_t i_idx, o_idx;

// Should be vectorizable somehow...
for (size_t x = 0, ox = 0; x < sx - (size_t)odd_x; x += 2, ox++)
i_idx = x + yoff;
o_idx = ox + oyoff;
accum[o_idx] += channel[i_idx];
accum[o_idx] += channel[i_idx + 1];


if (odd_x)
// << 1 bc we need to multiply by two on the edge
// to avoid darkening during render
accum[(osx - 1) + oyoff] += (U)(channel[(sx - 1) + yoff]) * 2;




However, godbolt shows that my loop is not autovectorizable. (https://godbolt.org/z/qZxvof) How would I construct SIMD intrinsics to solve this issue? I have control of the alignment for accum, but not channel.



(I know there's an average intrinsic, but it's not appropriate here because I need to generate multiple mip levels and that command would cause loss of precision for the next level.)



Thanks everyone. :)










share|improve this question



















  • 1





    Looks like a use-case for SSSE3 _mm_hadd_epi32 or _mm_hadd_epi16 T is int16_t instead of int. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1. If you want to work around a shuffle-port bottleneck on Intel CPUs, you could consider using qword shifts on the inputs and then shuffling together the result with shufps.

    – Peter Cordes
    Mar 8 at 6:55












  • Wow, that's pretty cool! I had been under the impression that "horizontal" operations were not possible in SIMD. I'll give this a try tomorrow. For what it's worth, the dominant use case for this operation is uint8_t -> uint16_t

    – SapphireSun
    Mar 8 at 6:57







  • 1





    I didn't realize you were widening, that changes things entirely. (Also, you showed short -> int on Godbolt; what SSE/AVX version are you targeting? You used -march=native on Godbolt, which is Skylake-AVX512 = AVX512BW). Anyway, _mm_hadd_* isn't useful when U is not the same width as T. You probably want pmaddwd or pmaddubsw with a multiplier of 1 to add horizontal pairs into a wider result.

    – Peter Cordes
    Mar 8 at 7:19






  • 1





    You should always use -march=haswell or similar if you know that's what you're targeting. That sets important tuning options as well as instruction sets. And don't use -march=corei7, it's kind of meaningless / confusing because it's basically -march=nehalem (the first generation of core i7).

    – Peter Cordes
    Mar 8 at 7:36







  • 1





    On Godbolt you can #include <stddef.h> and use size_t like a normal person. And notice that gcc did auto-vectorize your code for uint8_t -> uint16_t. Not particularly well, but it did it.

    – Peter Cordes
    Mar 8 at 7:39















3















I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me.



I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel.



template <typename T, typename U>
inline void accumulate_2x2_x_pass(
T* channel, U* accum,
const size_t sx, const size_t sy,
const size_t osx, const size_t osy,
const size_t yoff, const size_t oyoff
)

const bool odd_x = (sx & 0x01);

size_t i_idx, o_idx;

// Should be vectorizable somehow...
for (size_t x = 0, ox = 0; x < sx - (size_t)odd_x; x += 2, ox++)
i_idx = x + yoff;
o_idx = ox + oyoff;
accum[o_idx] += channel[i_idx];
accum[o_idx] += channel[i_idx + 1];


if (odd_x)
// << 1 bc we need to multiply by two on the edge
// to avoid darkening during render
accum[(osx - 1) + oyoff] += (U)(channel[(sx - 1) + yoff]) * 2;




However, godbolt shows that my loop is not autovectorizable. (https://godbolt.org/z/qZxvof) How would I construct SIMD intrinsics to solve this issue? I have control of the alignment for accum, but not channel.



(I know there's an average intrinsic, but it's not appropriate here because I need to generate multiple mip levels and that command would cause loss of precision for the next level.)



Thanks everyone. :)










share|improve this question



















  • 1





    Looks like a use-case for SSSE3 _mm_hadd_epi32 or _mm_hadd_epi16 T is int16_t instead of int. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1. If you want to work around a shuffle-port bottleneck on Intel CPUs, you could consider using qword shifts on the inputs and then shuffling together the result with shufps.

    – Peter Cordes
    Mar 8 at 6:55












  • Wow, that's pretty cool! I had been under the impression that "horizontal" operations were not possible in SIMD. I'll give this a try tomorrow. For what it's worth, the dominant use case for this operation is uint8_t -> uint16_t

    – SapphireSun
    Mar 8 at 6:57







  • 1





    I didn't realize you were widening, that changes things entirely. (Also, you showed short -> int on Godbolt; what SSE/AVX version are you targeting? You used -march=native on Godbolt, which is Skylake-AVX512 = AVX512BW). Anyway, _mm_hadd_* isn't useful when U is not the same width as T. You probably want pmaddwd or pmaddubsw with a multiplier of 1 to add horizontal pairs into a wider result.

    – Peter Cordes
    Mar 8 at 7:19






  • 1





    You should always use -march=haswell or similar if you know that's what you're targeting. That sets important tuning options as well as instruction sets. And don't use -march=corei7, it's kind of meaningless / confusing because it's basically -march=nehalem (the first generation of core i7).

    – Peter Cordes
    Mar 8 at 7:36







  • 1





    On Godbolt you can #include <stddef.h> and use size_t like a normal person. And notice that gcc did auto-vectorize your code for uint8_t -> uint16_t. Not particularly well, but it did it.

    – Peter Cordes
    Mar 8 at 7:39













3












3








3








I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me.



I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel.



template <typename T, typename U>
inline void accumulate_2x2_x_pass(
T* channel, U* accum,
const size_t sx, const size_t sy,
const size_t osx, const size_t osy,
const size_t yoff, const size_t oyoff
)

const bool odd_x = (sx & 0x01);

size_t i_idx, o_idx;

// Should be vectorizable somehow...
for (size_t x = 0, ox = 0; x < sx - (size_t)odd_x; x += 2, ox++)
i_idx = x + yoff;
o_idx = ox + oyoff;
accum[o_idx] += channel[i_idx];
accum[o_idx] += channel[i_idx + 1];


if (odd_x)
// << 1 bc we need to multiply by two on the edge
// to avoid darkening during render
accum[(osx - 1) + oyoff] += (U)(channel[(sx - 1) + yoff]) * 2;




However, godbolt shows that my loop is not autovectorizable. (https://godbolt.org/z/qZxvof) How would I construct SIMD intrinsics to solve this issue? I have control of the alignment for accum, but not channel.



(I know there's an average intrinsic, but it's not appropriate here because I need to generate multiple mip levels and that command would cause loss of precision for the next level.)



Thanks everyone. :)










share|improve this question
















I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me.



I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel.



template <typename T, typename U>
inline void accumulate_2x2_x_pass(
T* channel, U* accum,
const size_t sx, const size_t sy,
const size_t osx, const size_t osy,
const size_t yoff, const size_t oyoff
)

const bool odd_x = (sx & 0x01);

size_t i_idx, o_idx;

// Should be vectorizable somehow...
for (size_t x = 0, ox = 0; x < sx - (size_t)odd_x; x += 2, ox++)
i_idx = x + yoff;
o_idx = ox + oyoff;
accum[o_idx] += channel[i_idx];
accum[o_idx] += channel[i_idx + 1];


if (odd_x)
// << 1 bc we need to multiply by two on the edge
// to avoid darkening during render
accum[(osx - 1) + oyoff] += (U)(channel[(sx - 1) + yoff]) * 2;




However, godbolt shows that my loop is not autovectorizable. (https://godbolt.org/z/qZxvof) How would I construct SIMD intrinsics to solve this issue? I have control of the alignment for accum, but not channel.



(I know there's an average intrinsic, but it's not appropriate here because I need to generate multiple mip levels and that command would cause loss of precision for the next level.)



Thanks everyone. :)







c++ sse simd intrinsics avx






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 8 at 7:56









Peter Cordes

132k18201338




132k18201338










asked Mar 8 at 6:36









SapphireSunSapphireSun

4,82494254




4,82494254







  • 1





    Looks like a use-case for SSSE3 _mm_hadd_epi32 or _mm_hadd_epi16 T is int16_t instead of int. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1. If you want to work around a shuffle-port bottleneck on Intel CPUs, you could consider using qword shifts on the inputs and then shuffling together the result with shufps.

    – Peter Cordes
    Mar 8 at 6:55












  • Wow, that's pretty cool! I had been under the impression that "horizontal" operations were not possible in SIMD. I'll give this a try tomorrow. For what it's worth, the dominant use case for this operation is uint8_t -> uint16_t

    – SapphireSun
    Mar 8 at 6:57







  • 1





    I didn't realize you were widening, that changes things entirely. (Also, you showed short -> int on Godbolt; what SSE/AVX version are you targeting? You used -march=native on Godbolt, which is Skylake-AVX512 = AVX512BW). Anyway, _mm_hadd_* isn't useful when U is not the same width as T. You probably want pmaddwd or pmaddubsw with a multiplier of 1 to add horizontal pairs into a wider result.

    – Peter Cordes
    Mar 8 at 7:19






  • 1





    You should always use -march=haswell or similar if you know that's what you're targeting. That sets important tuning options as well as instruction sets. And don't use -march=corei7, it's kind of meaningless / confusing because it's basically -march=nehalem (the first generation of core i7).

    – Peter Cordes
    Mar 8 at 7:36







  • 1





    On Godbolt you can #include <stddef.h> and use size_t like a normal person. And notice that gcc did auto-vectorize your code for uint8_t -> uint16_t. Not particularly well, but it did it.

    – Peter Cordes
    Mar 8 at 7:39












  • 1





    Looks like a use-case for SSSE3 _mm_hadd_epi32 or _mm_hadd_epi16 T is int16_t instead of int. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1. If you want to work around a shuffle-port bottleneck on Intel CPUs, you could consider using qword shifts on the inputs and then shuffling together the result with shufps.

    – Peter Cordes
    Mar 8 at 6:55












  • Wow, that's pretty cool! I had been under the impression that "horizontal" operations were not possible in SIMD. I'll give this a try tomorrow. For what it's worth, the dominant use case for this operation is uint8_t -> uint16_t

    – SapphireSun
    Mar 8 at 6:57







  • 1





    I didn't realize you were widening, that changes things entirely. (Also, you showed short -> int on Godbolt; what SSE/AVX version are you targeting? You used -march=native on Godbolt, which is Skylake-AVX512 = AVX512BW). Anyway, _mm_hadd_* isn't useful when U is not the same width as T. You probably want pmaddwd or pmaddubsw with a multiplier of 1 to add horizontal pairs into a wider result.

    – Peter Cordes
    Mar 8 at 7:19






  • 1





    You should always use -march=haswell or similar if you know that's what you're targeting. That sets important tuning options as well as instruction sets. And don't use -march=corei7, it's kind of meaningless / confusing because it's basically -march=nehalem (the first generation of core i7).

    – Peter Cordes
    Mar 8 at 7:36







  • 1





    On Godbolt you can #include <stddef.h> and use size_t like a normal person. And notice that gcc did auto-vectorize your code for uint8_t -> uint16_t. Not particularly well, but it did it.

    – Peter Cordes
    Mar 8 at 7:39







1




1





Looks like a use-case for SSSE3 _mm_hadd_epi32 or _mm_hadd_epi16 T is int16_t instead of int. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1. If you want to work around a shuffle-port bottleneck on Intel CPUs, you could consider using qword shifts on the inputs and then shuffling together the result with shufps.

– Peter Cordes
Mar 8 at 6:55






Looks like a use-case for SSSE3 _mm_hadd_epi32 or _mm_hadd_epi16 T is int16_t instead of int. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1. If you want to work around a shuffle-port bottleneck on Intel CPUs, you could consider using qword shifts on the inputs and then shuffling together the result with shufps.

– Peter Cordes
Mar 8 at 6:55














Wow, that's pretty cool! I had been under the impression that "horizontal" operations were not possible in SIMD. I'll give this a try tomorrow. For what it's worth, the dominant use case for this operation is uint8_t -> uint16_t

– SapphireSun
Mar 8 at 6:57






Wow, that's pretty cool! I had been under the impression that "horizontal" operations were not possible in SIMD. I'll give this a try tomorrow. For what it's worth, the dominant use case for this operation is uint8_t -> uint16_t

– SapphireSun
Mar 8 at 6:57





1




1





I didn't realize you were widening, that changes things entirely. (Also, you showed short -> int on Godbolt; what SSE/AVX version are you targeting? You used -march=native on Godbolt, which is Skylake-AVX512 = AVX512BW). Anyway, _mm_hadd_* isn't useful when U is not the same width as T. You probably want pmaddwd or pmaddubsw with a multiplier of 1 to add horizontal pairs into a wider result.

– Peter Cordes
Mar 8 at 7:19





I didn't realize you were widening, that changes things entirely. (Also, you showed short -> int on Godbolt; what SSE/AVX version are you targeting? You used -march=native on Godbolt, which is Skylake-AVX512 = AVX512BW). Anyway, _mm_hadd_* isn't useful when U is not the same width as T. You probably want pmaddwd or pmaddubsw with a multiplier of 1 to add horizontal pairs into a wider result.

– Peter Cordes
Mar 8 at 7:19




1




1





You should always use -march=haswell or similar if you know that's what you're targeting. That sets important tuning options as well as instruction sets. And don't use -march=corei7, it's kind of meaningless / confusing because it's basically -march=nehalem (the first generation of core i7).

– Peter Cordes
Mar 8 at 7:36






You should always use -march=haswell or similar if you know that's what you're targeting. That sets important tuning options as well as instruction sets. And don't use -march=corei7, it's kind of meaningless / confusing because it's basically -march=nehalem (the first generation of core i7).

– Peter Cordes
Mar 8 at 7:36





1




1





On Godbolt you can #include <stddef.h> and use size_t like a normal person. And notice that gcc did auto-vectorize your code for uint8_t -> uint16_t. Not particularly well, but it did it.

– Peter Cordes
Mar 8 at 7:39





On Godbolt you can #include <stddef.h> and use size_t like a normal person. And notice that gcc did auto-vectorize your code for uint8_t -> uint16_t. Not particularly well, but it did it.

– Peter Cordes
Mar 8 at 7:39












1 Answer
1






active

oldest

votes


















3














The widening case with the narrow type T = uint8_t or uint16_t is probably best implemented with SSSE3 pmaddubsw or SSE2 pmaddwd with a multiplier of 1. (Intrinsics guide) Those instructions single-uop and do exactly the horizontal widening add you need more efficiently than shuffling.



If you can do so without losing precision, do the vertical add between rows first, before widening horizontal add. (e.g. 10, 12, or 14-bit pixel components in [u]int16_t can't overflow). Load and vertical-add have (at least) 2 per clock throughput on most CPUs, vs. 1 per clock for pmadd* only having 2-per-clock throughput on Skylake and later. And it means you only need 1x add + 1x pmadd vs. 2x pmadd + 1x add so it's a significant win even on Skylake. (For the 2nd way, both loads can fold into memory operands for pmadd, if you have AVX. For the add before pmadd way, you'll need a pure load first and then fold the 2nd load into add, so you might not save front-end uops, unless you use indexed addressing modes and they un-laminate.)



And ideally you don't need to += into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.



// SSSE3
__m128i hadd_widen8_to_16(__m128i a)
// uint8_t, int8_t (doesn't matter when multiplier is +1)
return _mm_maddubs_epi16(a, _mm_set_epi8(1));


// SSE2
__m128i hadd_widen16_to_32(__m128i a)
// int16_t, int16_t
return _mm_madd_epi16(a, _mm_set_epi16(1));



These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.



Yes really, they're both _epi16. Intel can be wildly inconsistent with intrinsic names. asm mnemonics are more consistent and easier to remember what's what. (ubsw = unsigned byte to signed word, except that one of the inputs is signed bytes. pmaddwd is packed multiply add word to dword, same naming scheme as punpcklwd etc.)




The T=U case with uint16_t or uint32_t is a a use-case for SSSE3 _mm_hadd_epi16 or _mm_hadd_epi32. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1.



If you want to work around a shuffle-port bottleneck on Haswell and later, you could consider using qword shifts on the inputs and then shuffling together the result with shufps (_mm_shuffle_ps + some casting). This could possibly be a win on Skylake (with 2 per clock shift throughput), even though it costs more 5 total uops instead of 3. It can run at best 5/3 cycles per vector of output instead of 2 cycles per vector if there's no front-end bottleneck



// UNTESTED

//Only any good with AVX, otherwise the extra movdqa instructions kill this
//Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
__m128i hadd32_emulated(__m128i a, __m128i b)
__m128i a_shift = _mm_srli_epi64(a, 32);
__m128i b_shift = _mm_srli_epi64(b, 32);
a = _mm_add_epi32(a, a_shift);
b = _mm_add_epi32(b, b_shift);
__m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
return _mm_castps_si128(combined);




For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd result. So emulating hadd with shifts might be a bigger win.



// 3x shuffle 1x add uops
__m256i hadd32_avx2(__m256i a, __m256i b)
__m256i hadd = _mm256_hadd_epi32(a, b); // 2x in-lane hadd
return _mm256_permutex_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );


// UNTESTED
// 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
__m256i hadd32_emulated_avx2(__m256i a, __m256i b)

__m256i a_shift = _mm256_srli_epi64(a, 32); // useful result in the low half of each qword
__m256i b_shift = _mm256_slli_epi64(b, 32); // ... high half of each qword
a = _mm256_add_epi32(a, a_shift);
b = _mm256_add_epi32(b, b_shift);
__m256i blended = _mm256_blend_epi32(a,b, 0b10101010); // alternating low/high results
return _mm256_permutexvar_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0), blended);



On Haswell and Skylake, hadd32_emulated_avx2 can run at 1 per 2 clocks (saturating all vector ALU ports). The extra add_epi32 to sum into accum[] will slow it down to at best 7/3 cycles per 256-bit vector of results, and you'll need to unroll (or use a compiler that unrolls) to not just bottleneck on the front-end.



hadd32_avx2 can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32 uops to implement your loop can run in the shadow of that easily.



(https://agner.org/optimize/, and see https://stackoverflow.com/tags/x86/info)






share|improve this answer
























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55057933%2fsimd-accumulate-adjacent-pairs%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3














    The widening case with the narrow type T = uint8_t or uint16_t is probably best implemented with SSSE3 pmaddubsw or SSE2 pmaddwd with a multiplier of 1. (Intrinsics guide) Those instructions single-uop and do exactly the horizontal widening add you need more efficiently than shuffling.



    If you can do so without losing precision, do the vertical add between rows first, before widening horizontal add. (e.g. 10, 12, or 14-bit pixel components in [u]int16_t can't overflow). Load and vertical-add have (at least) 2 per clock throughput on most CPUs, vs. 1 per clock for pmadd* only having 2-per-clock throughput on Skylake and later. And it means you only need 1x add + 1x pmadd vs. 2x pmadd + 1x add so it's a significant win even on Skylake. (For the 2nd way, both loads can fold into memory operands for pmadd, if you have AVX. For the add before pmadd way, you'll need a pure load first and then fold the 2nd load into add, so you might not save front-end uops, unless you use indexed addressing modes and they un-laminate.)



    And ideally you don't need to += into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.



    // SSSE3
    __m128i hadd_widen8_to_16(__m128i a)
    // uint8_t, int8_t (doesn't matter when multiplier is +1)
    return _mm_maddubs_epi16(a, _mm_set_epi8(1));


    // SSE2
    __m128i hadd_widen16_to_32(__m128i a)
    // int16_t, int16_t
    return _mm_madd_epi16(a, _mm_set_epi16(1));



    These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.



    Yes really, they're both _epi16. Intel can be wildly inconsistent with intrinsic names. asm mnemonics are more consistent and easier to remember what's what. (ubsw = unsigned byte to signed word, except that one of the inputs is signed bytes. pmaddwd is packed multiply add word to dword, same naming scheme as punpcklwd etc.)




    The T=U case with uint16_t or uint32_t is a a use-case for SSSE3 _mm_hadd_epi16 or _mm_hadd_epi32. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1.



    If you want to work around a shuffle-port bottleneck on Haswell and later, you could consider using qword shifts on the inputs and then shuffling together the result with shufps (_mm_shuffle_ps + some casting). This could possibly be a win on Skylake (with 2 per clock shift throughput), even though it costs more 5 total uops instead of 3. It can run at best 5/3 cycles per vector of output instead of 2 cycles per vector if there's no front-end bottleneck



    // UNTESTED

    //Only any good with AVX, otherwise the extra movdqa instructions kill this
    //Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
    __m128i hadd32_emulated(__m128i a, __m128i b)
    __m128i a_shift = _mm_srli_epi64(a, 32);
    __m128i b_shift = _mm_srli_epi64(b, 32);
    a = _mm_add_epi32(a, a_shift);
    b = _mm_add_epi32(b, b_shift);
    __m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
    return _mm_castps_si128(combined);




    For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd result. So emulating hadd with shifts might be a bigger win.



    // 3x shuffle 1x add uops
    __m256i hadd32_avx2(__m256i a, __m256i b)
    __m256i hadd = _mm256_hadd_epi32(a, b); // 2x in-lane hadd
    return _mm256_permutex_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );


    // UNTESTED
    // 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
    __m256i hadd32_emulated_avx2(__m256i a, __m256i b)

    __m256i a_shift = _mm256_srli_epi64(a, 32); // useful result in the low half of each qword
    __m256i b_shift = _mm256_slli_epi64(b, 32); // ... high half of each qword
    a = _mm256_add_epi32(a, a_shift);
    b = _mm256_add_epi32(b, b_shift);
    __m256i blended = _mm256_blend_epi32(a,b, 0b10101010); // alternating low/high results
    return _mm256_permutexvar_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0), blended);



    On Haswell and Skylake, hadd32_emulated_avx2 can run at 1 per 2 clocks (saturating all vector ALU ports). The extra add_epi32 to sum into accum[] will slow it down to at best 7/3 cycles per 256-bit vector of results, and you'll need to unroll (or use a compiler that unrolls) to not just bottleneck on the front-end.



    hadd32_avx2 can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32 uops to implement your loop can run in the shadow of that easily.



    (https://agner.org/optimize/, and see https://stackoverflow.com/tags/x86/info)






    share|improve this answer





























      3














      The widening case with the narrow type T = uint8_t or uint16_t is probably best implemented with SSSE3 pmaddubsw or SSE2 pmaddwd with a multiplier of 1. (Intrinsics guide) Those instructions single-uop and do exactly the horizontal widening add you need more efficiently than shuffling.



      If you can do so without losing precision, do the vertical add between rows first, before widening horizontal add. (e.g. 10, 12, or 14-bit pixel components in [u]int16_t can't overflow). Load and vertical-add have (at least) 2 per clock throughput on most CPUs, vs. 1 per clock for pmadd* only having 2-per-clock throughput on Skylake and later. And it means you only need 1x add + 1x pmadd vs. 2x pmadd + 1x add so it's a significant win even on Skylake. (For the 2nd way, both loads can fold into memory operands for pmadd, if you have AVX. For the add before pmadd way, you'll need a pure load first and then fold the 2nd load into add, so you might not save front-end uops, unless you use indexed addressing modes and they un-laminate.)



      And ideally you don't need to += into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.



      // SSSE3
      __m128i hadd_widen8_to_16(__m128i a)
      // uint8_t, int8_t (doesn't matter when multiplier is +1)
      return _mm_maddubs_epi16(a, _mm_set_epi8(1));


      // SSE2
      __m128i hadd_widen16_to_32(__m128i a)
      // int16_t, int16_t
      return _mm_madd_epi16(a, _mm_set_epi16(1));



      These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.



      Yes really, they're both _epi16. Intel can be wildly inconsistent with intrinsic names. asm mnemonics are more consistent and easier to remember what's what. (ubsw = unsigned byte to signed word, except that one of the inputs is signed bytes. pmaddwd is packed multiply add word to dword, same naming scheme as punpcklwd etc.)




      The T=U case with uint16_t or uint32_t is a a use-case for SSSE3 _mm_hadd_epi16 or _mm_hadd_epi32. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1.



      If you want to work around a shuffle-port bottleneck on Haswell and later, you could consider using qword shifts on the inputs and then shuffling together the result with shufps (_mm_shuffle_ps + some casting). This could possibly be a win on Skylake (with 2 per clock shift throughput), even though it costs more 5 total uops instead of 3. It can run at best 5/3 cycles per vector of output instead of 2 cycles per vector if there's no front-end bottleneck



      // UNTESTED

      //Only any good with AVX, otherwise the extra movdqa instructions kill this
      //Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
      __m128i hadd32_emulated(__m128i a, __m128i b)
      __m128i a_shift = _mm_srli_epi64(a, 32);
      __m128i b_shift = _mm_srli_epi64(b, 32);
      a = _mm_add_epi32(a, a_shift);
      b = _mm_add_epi32(b, b_shift);
      __m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
      return _mm_castps_si128(combined);




      For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd result. So emulating hadd with shifts might be a bigger win.



      // 3x shuffle 1x add uops
      __m256i hadd32_avx2(__m256i a, __m256i b)
      __m256i hadd = _mm256_hadd_epi32(a, b); // 2x in-lane hadd
      return _mm256_permutex_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );


      // UNTESTED
      // 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
      __m256i hadd32_emulated_avx2(__m256i a, __m256i b)

      __m256i a_shift = _mm256_srli_epi64(a, 32); // useful result in the low half of each qword
      __m256i b_shift = _mm256_slli_epi64(b, 32); // ... high half of each qword
      a = _mm256_add_epi32(a, a_shift);
      b = _mm256_add_epi32(b, b_shift);
      __m256i blended = _mm256_blend_epi32(a,b, 0b10101010); // alternating low/high results
      return _mm256_permutexvar_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0), blended);



      On Haswell and Skylake, hadd32_emulated_avx2 can run at 1 per 2 clocks (saturating all vector ALU ports). The extra add_epi32 to sum into accum[] will slow it down to at best 7/3 cycles per 256-bit vector of results, and you'll need to unroll (or use a compiler that unrolls) to not just bottleneck on the front-end.



      hadd32_avx2 can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32 uops to implement your loop can run in the shadow of that easily.



      (https://agner.org/optimize/, and see https://stackoverflow.com/tags/x86/info)






      share|improve this answer



























        3












        3








        3







        The widening case with the narrow type T = uint8_t or uint16_t is probably best implemented with SSSE3 pmaddubsw or SSE2 pmaddwd with a multiplier of 1. (Intrinsics guide) Those instructions single-uop and do exactly the horizontal widening add you need more efficiently than shuffling.



        If you can do so without losing precision, do the vertical add between rows first, before widening horizontal add. (e.g. 10, 12, or 14-bit pixel components in [u]int16_t can't overflow). Load and vertical-add have (at least) 2 per clock throughput on most CPUs, vs. 1 per clock for pmadd* only having 2-per-clock throughput on Skylake and later. And it means you only need 1x add + 1x pmadd vs. 2x pmadd + 1x add so it's a significant win even on Skylake. (For the 2nd way, both loads can fold into memory operands for pmadd, if you have AVX. For the add before pmadd way, you'll need a pure load first and then fold the 2nd load into add, so you might not save front-end uops, unless you use indexed addressing modes and they un-laminate.)



        And ideally you don't need to += into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.



        // SSSE3
        __m128i hadd_widen8_to_16(__m128i a)
        // uint8_t, int8_t (doesn't matter when multiplier is +1)
        return _mm_maddubs_epi16(a, _mm_set_epi8(1));


        // SSE2
        __m128i hadd_widen16_to_32(__m128i a)
        // int16_t, int16_t
        return _mm_madd_epi16(a, _mm_set_epi16(1));



        These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.



        Yes really, they're both _epi16. Intel can be wildly inconsistent with intrinsic names. asm mnemonics are more consistent and easier to remember what's what. (ubsw = unsigned byte to signed word, except that one of the inputs is signed bytes. pmaddwd is packed multiply add word to dword, same naming scheme as punpcklwd etc.)




        The T=U case with uint16_t or uint32_t is a a use-case for SSSE3 _mm_hadd_epi16 or _mm_hadd_epi32. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1.



        If you want to work around a shuffle-port bottleneck on Haswell and later, you could consider using qword shifts on the inputs and then shuffling together the result with shufps (_mm_shuffle_ps + some casting). This could possibly be a win on Skylake (with 2 per clock shift throughput), even though it costs more 5 total uops instead of 3. It can run at best 5/3 cycles per vector of output instead of 2 cycles per vector if there's no front-end bottleneck



        // UNTESTED

        //Only any good with AVX, otherwise the extra movdqa instructions kill this
        //Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
        __m128i hadd32_emulated(__m128i a, __m128i b)
        __m128i a_shift = _mm_srli_epi64(a, 32);
        __m128i b_shift = _mm_srli_epi64(b, 32);
        a = _mm_add_epi32(a, a_shift);
        b = _mm_add_epi32(b, b_shift);
        __m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
        return _mm_castps_si128(combined);




        For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd result. So emulating hadd with shifts might be a bigger win.



        // 3x shuffle 1x add uops
        __m256i hadd32_avx2(__m256i a, __m256i b)
        __m256i hadd = _mm256_hadd_epi32(a, b); // 2x in-lane hadd
        return _mm256_permutex_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );


        // UNTESTED
        // 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
        __m256i hadd32_emulated_avx2(__m256i a, __m256i b)

        __m256i a_shift = _mm256_srli_epi64(a, 32); // useful result in the low half of each qword
        __m256i b_shift = _mm256_slli_epi64(b, 32); // ... high half of each qword
        a = _mm256_add_epi32(a, a_shift);
        b = _mm256_add_epi32(b, b_shift);
        __m256i blended = _mm256_blend_epi32(a,b, 0b10101010); // alternating low/high results
        return _mm256_permutexvar_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0), blended);



        On Haswell and Skylake, hadd32_emulated_avx2 can run at 1 per 2 clocks (saturating all vector ALU ports). The extra add_epi32 to sum into accum[] will slow it down to at best 7/3 cycles per 256-bit vector of results, and you'll need to unroll (or use a compiler that unrolls) to not just bottleneck on the front-end.



        hadd32_avx2 can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32 uops to implement your loop can run in the shadow of that easily.



        (https://agner.org/optimize/, and see https://stackoverflow.com/tags/x86/info)






        share|improve this answer















        The widening case with the narrow type T = uint8_t or uint16_t is probably best implemented with SSSE3 pmaddubsw or SSE2 pmaddwd with a multiplier of 1. (Intrinsics guide) Those instructions single-uop and do exactly the horizontal widening add you need more efficiently than shuffling.



        If you can do so without losing precision, do the vertical add between rows first, before widening horizontal add. (e.g. 10, 12, or 14-bit pixel components in [u]int16_t can't overflow). Load and vertical-add have (at least) 2 per clock throughput on most CPUs, vs. 1 per clock for pmadd* only having 2-per-clock throughput on Skylake and later. And it means you only need 1x add + 1x pmadd vs. 2x pmadd + 1x add so it's a significant win even on Skylake. (For the 2nd way, both loads can fold into memory operands for pmadd, if you have AVX. For the add before pmadd way, you'll need a pure load first and then fold the 2nd load into add, so you might not save front-end uops, unless you use indexed addressing modes and they un-laminate.)



        And ideally you don't need to += into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.



        // SSSE3
        __m128i hadd_widen8_to_16(__m128i a)
        // uint8_t, int8_t (doesn't matter when multiplier is +1)
        return _mm_maddubs_epi16(a, _mm_set_epi8(1));


        // SSE2
        __m128i hadd_widen16_to_32(__m128i a)
        // int16_t, int16_t
        return _mm_madd_epi16(a, _mm_set_epi16(1));



        These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.



        Yes really, they're both _epi16. Intel can be wildly inconsistent with intrinsic names. asm mnemonics are more consistent and easier to remember what's what. (ubsw = unsigned byte to signed word, except that one of the inputs is signed bytes. pmaddwd is packed multiply add word to dword, same naming scheme as punpcklwd etc.)




        The T=U case with uint16_t or uint32_t is a a use-case for SSSE3 _mm_hadd_epi16 or _mm_hadd_epi32. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1.



        If you want to work around a shuffle-port bottleneck on Haswell and later, you could consider using qword shifts on the inputs and then shuffling together the result with shufps (_mm_shuffle_ps + some casting). This could possibly be a win on Skylake (with 2 per clock shift throughput), even though it costs more 5 total uops instead of 3. It can run at best 5/3 cycles per vector of output instead of 2 cycles per vector if there's no front-end bottleneck



        // UNTESTED

        //Only any good with AVX, otherwise the extra movdqa instructions kill this
        //Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
        __m128i hadd32_emulated(__m128i a, __m128i b)
        __m128i a_shift = _mm_srli_epi64(a, 32);
        __m128i b_shift = _mm_srli_epi64(b, 32);
        a = _mm_add_epi32(a, a_shift);
        b = _mm_add_epi32(b, b_shift);
        __m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
        return _mm_castps_si128(combined);




        For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd result. So emulating hadd with shifts might be a bigger win.



        // 3x shuffle 1x add uops
        __m256i hadd32_avx2(__m256i a, __m256i b)
        __m256i hadd = _mm256_hadd_epi32(a, b); // 2x in-lane hadd
        return _mm256_permutex_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );


        // UNTESTED
        // 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
        __m256i hadd32_emulated_avx2(__m256i a, __m256i b)

        __m256i a_shift = _mm256_srli_epi64(a, 32); // useful result in the low half of each qword
        __m256i b_shift = _mm256_slli_epi64(b, 32); // ... high half of each qword
        a = _mm256_add_epi32(a, a_shift);
        b = _mm256_add_epi32(b, b_shift);
        __m256i blended = _mm256_blend_epi32(a,b, 0b10101010); // alternating low/high results
        return _mm256_permutexvar_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0), blended);



        On Haswell and Skylake, hadd32_emulated_avx2 can run at 1 per 2 clocks (saturating all vector ALU ports). The extra add_epi32 to sum into accum[] will slow it down to at best 7/3 cycles per 256-bit vector of results, and you'll need to unroll (or use a compiler that unrolls) to not just bottleneck on the front-end.



        hadd32_avx2 can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32 uops to implement your loop can run in the shadow of that easily.



        (https://agner.org/optimize/, and see https://stackoverflow.com/tags/x86/info)







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Mar 9 at 11:11

























        answered Mar 8 at 7:52









        Peter CordesPeter Cordes

        132k18201338




        132k18201338





























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55057933%2fsimd-accumulate-adjacent-pairs%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Identity Server 4 is not redirecting to Angular app after login2019 Community Moderator ElectionIdentity Server 4 and dockerIdentityserver implicit flow unauthorized_clientIdentityServer Hybrid Flow - Access Token is null after user successful loginIdentity Server to MVC client : Page Redirect After loginLogin with Steam OpenId(oidc-client-js)Identity Server 4+.NET Core 2.0 + IdentityIdentityServer4 post-login redirect not working in Edge browserCall to IdentityServer4 generates System.NullReferenceException: Object reference not set to an instance of an objectIdentityServer4 without HTTPS not workingHow to get Authorization code from identity server without login form

            2005 Ahvaz unrest Contents Background Causes Casualties Aftermath See also References Navigation menue"At Least 10 Are Killed by Bombs in Iran""Iran"Archived"Arab-Iranians in Iran to make April 15 'Day of Fury'"State of Mind, State of Order: Reactions to Ethnic Unrest in the Islamic Republic of Iran.10.1111/j.1754-9469.2008.00028.x"Iran hangs Arab separatists"Iran Overview from ArchivedConstitution of the Islamic Republic of Iran"Tehran puzzled by forged 'riots' letter""Iran and its minorities: Down in the second class""Iran: Handling Of Ahvaz Unrest Could End With Televised Confessions""Bombings Rock Iran Ahead of Election""Five die in Iran ethnic clashes""Iran: Need for restraint as anniversary of unrest in Khuzestan approaches"Archived"Iranian Sunni protesters killed in clashes with security forces"Archived

            Can't initialize raids on a new ASUS Prime B360M-A motherboard2019 Community Moderator ElectionSimilar to RAID config yet more like mirroring solution?Can't get motherboard serial numberWhy does the BIOS entry point start with a WBINVD instruction?UEFI performance Asus Maximus V Extreme