SIMD: Accumulate Adjacent Pairsc++ SSE SIMD frameworkHeader files for x86 SIMD intrinsicsSIMD-able code?SIMD math libraries for SSE and AVXSIMD and difference between packed and scalar double precisionExtract set bytes position from SIMD vectorAVX2 SIMD addition not workingC++ SIMD: accumulate POPCOUNTs of uint64_t elements in an arraySIMD Intrinsics and Persistent Variables/StateCharacter to bits with SIMD (and substrings)

Remove Expired Scratch Orgs From VSCode

Can a significant change in incentives void an employment contract?

Could the E-bike drivetrain wear down till needing replacement after 400 km?

Is it improper etiquette to ask your opponent what his/her rating is before the game?

Divine apple island

Can the Supreme Court overturn an impeachment?

Should I stop contributing to retirement accounts?

Did US corporations pay demonstrators in the German demonstrations against article 13?

Varistor? Purpose and principle

Longest common substring in linear time

How do I implement a file system driver driver in Linux?

What major Native American tribes were around Santa Fe during the late 1850s?

Why did the HMS Bounty go back to a time when whales are already rare?

How to color a curve

What linear sensor for a keyboard?

Open a doc from terminal, but not by its name

Bob has never been a M before

How do I extrude a face to a single vertex

Flux received by a negative charge

How must one send away the mother bird?

On a tidally locked planet, would time be quantized?

Have I saved too much for retirement so far?

Find last 3 digits of this monster number

Reply 'no position' while the job posting is still there

SIMD: Accumulate Adjacent Pairs

c++ SSE SIMD frameworkHeader files for x86 SIMD intrinsicsSIMD-able code?SIMD math libraries for SSE and AVXSIMD and difference between packed and scalar double precisionExtract set bytes position from SIMD vectorAVX2 SIMD addition not workingC++ SIMD: accumulate POPCOUNTs of uint64_t elements in an arraySIMD Intrinsics and Persistent Variables/StateCharacter to bits with SIMD (and substrings)

I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me.

I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel.

template <typename T, typename U>
inline void accumulate_2x2_x_pass(
 T* channel, U* accum,
 const size_t sx, const size_t sy, 
 const size_t osx, const size_t osy,
 const size_t yoff, const size_t oyoff
) 

 const bool odd_x = (sx & 0x01);

 size_t i_idx, o_idx;

 // Should be vectorizable somehow...
 for (size_t x = 0, ox = 0; x < sx - (size_t)odd_x; x += 2, ox++) 
 i_idx = x + yoff;
 o_idx = ox + oyoff;
 accum[o_idx] += channel[i_idx];
 accum[o_idx] += channel[i_idx + 1];
 

 if (odd_x) 
 // << 1 bc we need to multiply by two on the edge 
 // to avoid darkening during render
 accum[(osx - 1) + oyoff] += (U)(channel[(sx - 1) + yoff]) * 2;

However, godbolt shows that my loop is not autovectorizable. (https://godbolt.org/z/qZxvof) How would I construct SIMD intrinsics to solve this issue? I have control of the alignment for accum, but not channel.

(I know there's an average intrinsic, but it's not appropriate here because I need to generate multiple mip levels and that command would cause loss of precision for the next level.)

Thanks everyone. :)

edited Mar 8 at 7:56

Peter Cordes

132k18201338

asked Mar 8 at 6:36

SapphireSun

4,82494254

1

Looks like a use-case for SSSE3 _mm_hadd_epi32 or _mm_hadd_epi16 T is int16_t instead of int. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1. If you want to work around a shuffle-port bottleneck on Intel CPUs, you could consider using qword shifts on the inputs and then shuffling together the result with shufps.

– Peter Cordes
Mar 8 at 6:55

Wow, that's pretty cool! I had been under the impression that "horizontal" operations were not possible in SIMD. I'll give this a try tomorrow. For what it's worth, the dominant use case for this operation is uint8_t -> uint16_t

– SapphireSun
Mar 8 at 6:57

1

I didn't realize you were widening, that changes things entirely. (Also, you showed short -> int on Godbolt; what SSE/AVX version are you targeting? You used -march=native on Godbolt, which is Skylake-AVX512 = AVX512BW). Anyway, _mm_hadd_* isn't useful when U is not the same width as T. You probably want pmaddwd or pmaddubsw with a multiplier of 1 to add horizontal pairs into a wider result.

– Peter Cordes
Mar 8 at 7:19

1

You should always use -march=haswell or similar if you know that's what you're targeting. That sets important tuning options as well as instruction sets. And don't use -march=corei7, it's kind of meaningless / confusing because it's basically -march=nehalem (the first generation of core i7).

– Peter Cordes
Mar 8 at 7:36

1

On Godbolt you can #include <stddef.h> and use size_t like a normal person. And notice that gcc did auto-vectorize your code for uint8_t -> uint16_t. Not particularly well, but it did it.

– Peter Cordes
Mar 8 at 7:39

|
show 3 more comments

I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me.

I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel.

template <typename T, typename U>
inline void accumulate_2x2_x_pass(
 T* channel, U* accum,
 const size_t sx, const size_t sy, 
 const size_t osx, const size_t osy,
 const size_t yoff, const size_t oyoff
) 

 const bool odd_x = (sx & 0x01);

 size_t i_idx, o_idx;

 // Should be vectorizable somehow...
 for (size_t x = 0, ox = 0; x < sx - (size_t)odd_x; x += 2, ox++) 
 i_idx = x + yoff;
 o_idx = ox + oyoff;
 accum[o_idx] += channel[i_idx];
 accum[o_idx] += channel[i_idx + 1];
 

 if (odd_x) 
 // << 1 bc we need to multiply by two on the edge 
 // to avoid darkening during render
 accum[(osx - 1) + oyoff] += (U)(channel[(sx - 1) + yoff]) * 2;

(I know there's an average intrinsic, but it's not appropriate here because I need to generate multiple mip levels and that command would cause loss of precision for the next level.)

Thanks everyone. :)

edited Mar 8 at 7:56

Peter Cordes

132k18201338

asked Mar 8 at 6:36

SapphireSun

4,82494254

1

Looks like a use-case for SSSE3 _mm_hadd_epi32 or _mm_hadd_epi16 T is int16_t instead of int. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1. If you want to work around a shuffle-port bottleneck on Intel CPUs, you could consider using qword shifts on the inputs and then shuffling together the result with shufps.

– Peter Cordes
Mar 8 at 6:55

Wow, that's pretty cool! I had been under the impression that "horizontal" operations were not possible in SIMD. I'll give this a try tomorrow. For what it's worth, the dominant use case for this operation is uint8_t -> uint16_t

– SapphireSun
Mar 8 at 6:57

1

I didn't realize you were widening, that changes things entirely. (Also, you showed short -> int on Godbolt; what SSE/AVX version are you targeting? You used -march=native on Godbolt, which is Skylake-AVX512 = AVX512BW). Anyway, _mm_hadd_* isn't useful when U is not the same width as T. You probably want pmaddwd or pmaddubsw with a multiplier of 1 to add horizontal pairs into a wider result.

– Peter Cordes
Mar 8 at 7:19

1

You should always use -march=haswell or similar if you know that's what you're targeting. That sets important tuning options as well as instruction sets. And don't use -march=corei7, it's kind of meaningless / confusing because it's basically -march=nehalem (the first generation of core i7).

– Peter Cordes
Mar 8 at 7:36

1

On Godbolt you can #include <stddef.h> and use size_t like a normal person. And notice that gcc did auto-vectorize your code for uint8_t -> uint16_t. Not particularly well, but it did it.

– Peter Cordes
Mar 8 at 7:39

|
show 3 more comments

I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me.

I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel.

template <typename T, typename U>
inline void accumulate_2x2_x_pass(
 T* channel, U* accum,
 const size_t sx, const size_t sy, 
 const size_t osx, const size_t osy,
 const size_t yoff, const size_t oyoff
) 

 const bool odd_x = (sx & 0x01);

 size_t i_idx, o_idx;

 // Should be vectorizable somehow...
 for (size_t x = 0, ox = 0; x < sx - (size_t)odd_x; x += 2, ox++) 
 i_idx = x + yoff;
 o_idx = ox + oyoff;
 accum[o_idx] += channel[i_idx];
 accum[o_idx] += channel[i_idx + 1];
 

 if (odd_x) 
 // << 1 bc we need to multiply by two on the edge 
 // to avoid darkening during render
 accum[(osx - 1) + oyoff] += (U)(channel[(sx - 1) + yoff]) * 2;

(I know there's an average intrinsic, but it's not appropriate here because I need to generate multiple mip levels and that command would cause loss of precision for the next level.)

Thanks everyone. :)

edited Mar 8 at 7:56

Peter Cordes

132k18201338

asked Mar 8 at 6:36

SapphireSun

4,82494254

I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me.

I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel.

template <typename T, typename U>
inline void accumulate_2x2_x_pass(
 T* channel, U* accum,
 const size_t sx, const size_t sy, 
 const size_t osx, const size_t osy,
 const size_t yoff, const size_t oyoff
) 

 const bool odd_x = (sx & 0x01);

 size_t i_idx, o_idx;

 // Should be vectorizable somehow...
 for (size_t x = 0, ox = 0; x < sx - (size_t)odd_x; x += 2, ox++) 
 i_idx = x + yoff;
 o_idx = ox + oyoff;
 accum[o_idx] += channel[i_idx];
 accum[o_idx] += channel[i_idx + 1];
 

 if (odd_x) 
 // << 1 bc we need to multiply by two on the edge 
 // to avoid darkening during render
 accum[(osx - 1) + oyoff] += (U)(channel[(sx - 1) + yoff]) * 2;

(I know there's an average intrinsic, but it's not appropriate here because I need to generate multiple mip levels and that command would cause loss of precision for the next level.)

Thanks everyone. :)

c++ sse simd intrinsics avx

edited Mar 8 at 7:56

Peter Cordes

132k18201338

asked Mar 8 at 6:36

SapphireSun

4,82494254

edited Mar 8 at 7:56

Peter Cordes

132k18201338

asked Mar 8 at 6:36

SapphireSun

4,82494254

edited Mar 8 at 7:56

Peter Cordes

132k18201338

edited Mar 8 at 7:56

Peter Cordes

132k18201338

edited Mar 8 at 7:56

Peter Cordes

132k18201338

asked Mar 8 at 6:36

SapphireSun

4,82494254

asked Mar 8 at 6:36

SapphireSun

4,82494254

asked Mar 8 at 6:36

SapphireSun

4,82494254

1

Looks like a use-case for SSSE3 _mm_hadd_epi32 or _mm_hadd_epi16 T is int16_t instead of int. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1. If you want to work around a shuffle-port bottleneck on Intel CPUs, you could consider using qword shifts on the inputs and then shuffling together the result with shufps.

– Peter Cordes
Mar 8 at 6:55

Wow, that's pretty cool! I had been under the impression that "horizontal" operations were not possible in SIMD. I'll give this a try tomorrow. For what it's worth, the dominant use case for this operation is uint8_t -> uint16_t

– SapphireSun
Mar 8 at 6:57

1

I didn't realize you were widening, that changes things entirely. (Also, you showed short -> int on Godbolt; what SSE/AVX version are you targeting? You used -march=native on Godbolt, which is Skylake-AVX512 = AVX512BW). Anyway, _mm_hadd_* isn't useful when U is not the same width as T. You probably want pmaddwd or pmaddubsw with a multiplier of 1 to add horizontal pairs into a wider result.

– Peter Cordes
Mar 8 at 7:19

1

You should always use -march=haswell or similar if you know that's what you're targeting. That sets important tuning options as well as instruction sets. And don't use -march=corei7, it's kind of meaningless / confusing because it's basically -march=nehalem (the first generation of core i7).

– Peter Cordes
Mar 8 at 7:36

1

On Godbolt you can #include <stddef.h> and use size_t like a normal person. And notice that gcc did auto-vectorize your code for uint8_t -> uint16_t. Not particularly well, but it did it.

– Peter Cordes
Mar 8 at 7:39

|
show 3 more comments

1

Looks like a use-case for SSSE3 _mm_hadd_epi32 or _mm_hadd_epi16 T is int16_t instead of int. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1. If you want to work around a shuffle-port bottleneck on Intel CPUs, you could consider using qword shifts on the inputs and then shuffling together the result with shufps.

– Peter Cordes
Mar 8 at 6:55

Wow, that's pretty cool! I had been under the impression that "horizontal" operations were not possible in SIMD. I'll give this a try tomorrow. For what it's worth, the dominant use case for this operation is uint8_t -> uint16_t

– SapphireSun
Mar 8 at 6:57

1

I didn't realize you were widening, that changes things entirely. (Also, you showed short -> int on Godbolt; what SSE/AVX version are you targeting? You used -march=native on Godbolt, which is Skylake-AVX512 = AVX512BW). Anyway, _mm_hadd_* isn't useful when U is not the same width as T. You probably want pmaddwd or pmaddubsw with a multiplier of 1 to add horizontal pairs into a wider result.

– Peter Cordes
Mar 8 at 7:19

1

You should always use -march=haswell or similar if you know that's what you're targeting. That sets important tuning options as well as instruction sets. And don't use -march=corei7, it's kind of meaningless / confusing because it's basically -march=nehalem (the first generation of core i7).

– Peter Cordes
Mar 8 at 7:36

1

On Godbolt you can #include <stddef.h> and use size_t like a normal person. And notice that gcc did auto-vectorize your code for uint8_t -> uint16_t. Not particularly well, but it did it.

– Peter Cordes
Mar 8 at 7:39

Looks like a use-case for SSSE3 _mm_hadd_epi32 or _mm_hadd_epi16 T is int16_t instead of int. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1. If you want to work around a shuffle-port bottleneck on Intel CPUs, you could consider using qword shifts on the inputs and then shuffling together the result with shufps.

– Peter Cordes
Mar 8 at 6:55

Wow, that's pretty cool! I had been under the impression that "horizontal" operations were not possible in SIMD. I'll give this a try tomorrow. For what it's worth, the dominant use case for this operation is uint8_t -> uint16_t

– SapphireSun
Mar 8 at 6:57

I didn't realize you were widening, that changes things entirely. (Also, you showed short -> int on Godbolt; what SSE/AVX version are you targeting? You used -march=native on Godbolt, which is Skylake-AVX512 = AVX512BW). Anyway, _mm_hadd_* isn't useful when U is not the same width as T. You probably want pmaddwd or pmaddubsw with a multiplier of 1 to add horizontal pairs into a wider result.

– Peter Cordes
Mar 8 at 7:19

You should always use -march=haswell or similar if you know that's what you're targeting. That sets important tuning options as well as instruction sets. And don't use -march=corei7, it's kind of meaningless / confusing because it's basically -march=nehalem (the first generation of core i7).

– Peter Cordes
Mar 8 at 7:36

On Godbolt you can #include <stddef.h> and use size_t like a normal person. And notice that gcc did auto-vectorize your code for uint8_t -> uint16_t. Not particularly well, but it did it.

– Peter Cordes
Mar 8 at 7:39

|
show 3 more comments

1 Answer
1

active

oldest

votes

The widening case with the narrow type T = uint8_t or uint16_t is probably best implemented with SSSE3 pmaddubsw or SSE2 pmaddwd with a multiplier of 1. (Intrinsics guide) Those instructions single-uop and do exactly the horizontal widening add you need more efficiently than shuffling.

If you can do so without losing precision, do the vertical add between rows first, before widening horizontal add. (e.g. 10, 12, or 14-bit pixel components in [u]int16_t can't overflow). Load and vertical-add have (at least) 2 per clock throughput on most CPUs, vs. 1 per clock for pmadd* only having 2-per-clock throughput on Skylake and later. And it means you only need 1x add + 1x pmadd vs. 2x pmadd + 1x add so it's a significant win even on Skylake. (For the 2nd way, both loads can fold into memory operands for pmadd, if you have AVX. For the add before pmadd way, you'll need a pure load first and then fold the 2nd load into add, so you might not save front-end uops, unless you use indexed addressing modes and they un-laminate.)

And ideally you don't need to += into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.

// SSSE3
__m128i hadd_widen8_to_16(__m128i a) 
 // uint8_t, int8_t (doesn't matter when multiplier is +1)
 return _mm_maddubs_epi16(a, _mm_set_epi8(1));


// SSE2
__m128i hadd_widen16_to_32(__m128i a) 
 // int16_t, int16_t
 return _mm_madd_epi16(a, _mm_set_epi16(1));

These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.

Yes really, they're both _epi16. Intel can be wildly inconsistent with intrinsic names. asm mnemonics are more consistent and easier to remember what's what. (ubsw = unsigned byte to signed word, except that one of the inputs is signed bytes. pmaddwd is packed multiply add word to dword, same naming scheme as punpcklwd etc.)

The T=U case with uint16_t or uint32_t is a a use-case for SSSE3 _mm_hadd_epi16 or _mm_hadd_epi32. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1.

If you want to work around a shuffle-port bottleneck on Haswell and later, you could consider using qword shifts on the inputs and then shuffling together the result with shufps (_mm_shuffle_ps + some casting). This could possibly be a win on Skylake (with 2 per clock shift throughput), even though it costs more 5 total uops instead of 3. It can run at best 5/3 cycles per vector of output instead of 2 cycles per vector if there's no front-end bottleneck

// UNTESTED

//Only any good with AVX, otherwise the extra movdqa instructions kill this
//Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
__m128i hadd32_emulated(__m128i a, __m128i b) 
 __m128i a_shift = _mm_srli_epi64(a, 32);
 __m128i b_shift = _mm_srli_epi64(b, 32);
 a = _mm_add_epi32(a, a_shift);
 b = _mm_add_epi32(b, b_shift);
 __m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
 return _mm_castps_si128(combined);

For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd result. So emulating hadd with shifts might be a bigger win.

// 3x shuffle 1x add uops
__m256i hadd32_avx2(__m256i a, __m256i b) 
 __m256i hadd = _mm256_hadd_epi32(a, b); // 2x in-lane hadd
 return _mm256_permutex_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );


// UNTESTED
// 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
__m256i hadd32_emulated_avx2(__m256i a, __m256i b)

 __m256i a_shift = _mm256_srli_epi64(a, 32); // useful result in the low half of each qword
 __m256i b_shift = _mm256_slli_epi64(b, 32); // ... high half of each qword
 a = _mm256_add_epi32(a, a_shift);
 b = _mm256_add_epi32(b, b_shift);
 __m256i blended = _mm256_blend_epi32(a,b, 0b10101010); // alternating low/high results
 return _mm256_permutexvar_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0), blended);

On Haswell and Skylake, hadd32_emulated_avx2 can run at 1 per 2 clocks (saturating all vector ALU ports). The extra add_epi32 to sum into accum[] will slow it down to at best 7/3 cycles per 256-bit vector of results, and you'll need to unroll (or use a compiler that unrolls) to not just bottleneck on the front-end.

hadd32_avx2 can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32 uops to implement your loop can run in the shadow of that easily.

(https://agner.org/optimize/, and see https://stackoverflow.com/tags/x86/info)

edited Mar 9 at 11:11

answered Mar 8 at 7:52

Peter Cordes

132k18201338

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55057933%2fsimd-accumulate-adjacent-pairs%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

And ideally you don't need to += into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.

// SSSE3
__m128i hadd_widen8_to_16(__m128i a) 
 // uint8_t, int8_t (doesn't matter when multiplier is +1)
 return _mm_maddubs_epi16(a, _mm_set_epi8(1));


// SSE2
__m128i hadd_widen16_to_32(__m128i a) 
 // int16_t, int16_t
 return _mm_madd_epi16(a, _mm_set_epi16(1));

These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.

// UNTESTED

//Only any good with AVX, otherwise the extra movdqa instructions kill this
//Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
__m128i hadd32_emulated(__m128i a, __m128i b) 
 __m128i a_shift = _mm_srli_epi64(a, 32);
 __m128i b_shift = _mm_srli_epi64(b, 32);
 a = _mm_add_epi32(a, a_shift);
 b = _mm_add_epi32(b, b_shift);
 __m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
 return _mm_castps_si128(combined);

For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd result. So emulating hadd with shifts might be a bigger win.

// 3x shuffle 1x add uops
__m256i hadd32_avx2(__m256i a, __m256i b) 
 __m256i hadd = _mm256_hadd_epi32(a, b); // 2x in-lane hadd
 return _mm256_permutex_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );


// UNTESTED
// 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
__m256i hadd32_emulated_avx2(__m256i a, __m256i b)

 __m256i a_shift = _mm256_srli_epi64(a, 32); // useful result in the low half of each qword
 __m256i b_shift = _mm256_slli_epi64(b, 32); // ... high half of each qword
 a = _mm256_add_epi32(a, a_shift);
 b = _mm256_add_epi32(b, b_shift);
 __m256i blended = _mm256_blend_epi32(a,b, 0b10101010); // alternating low/high results
 return _mm256_permutexvar_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0), blended);

hadd32_avx2 can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32 uops to implement your loop can run in the shadow of that easily.

(https://agner.org/optimize/, and see https://stackoverflow.com/tags/x86/info)

edited Mar 9 at 11:11

answered Mar 8 at 7:52

Peter Cordes

132k18201338

add a comment |

And ideally you don't need to += into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.

// SSSE3
__m128i hadd_widen8_to_16(__m128i a) 
 // uint8_t, int8_t (doesn't matter when multiplier is +1)
 return _mm_maddubs_epi16(a, _mm_set_epi8(1));


// SSE2
__m128i hadd_widen16_to_32(__m128i a) 
 // int16_t, int16_t
 return _mm_madd_epi16(a, _mm_set_epi16(1));

These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.

// UNTESTED

//Only any good with AVX, otherwise the extra movdqa instructions kill this
//Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
__m128i hadd32_emulated(__m128i a, __m128i b) 
 __m128i a_shift = _mm_srli_epi64(a, 32);
 __m128i b_shift = _mm_srli_epi64(b, 32);
 a = _mm_add_epi32(a, a_shift);
 b = _mm_add_epi32(b, b_shift);
 __m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
 return _mm_castps_si128(combined);

For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd result. So emulating hadd with shifts might be a bigger win.

// 3x shuffle 1x add uops
__m256i hadd32_avx2(__m256i a, __m256i b) 
 __m256i hadd = _mm256_hadd_epi32(a, b); // 2x in-lane hadd
 return _mm256_permutex_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );


// UNTESTED
// 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
__m256i hadd32_emulated_avx2(__m256i a, __m256i b)

 __m256i a_shift = _mm256_srli_epi64(a, 32); // useful result in the low half of each qword
 __m256i b_shift = _mm256_slli_epi64(b, 32); // ... high half of each qword
 a = _mm256_add_epi32(a, a_shift);
 b = _mm256_add_epi32(b, b_shift);
 __m256i blended = _mm256_blend_epi32(a,b, 0b10101010); // alternating low/high results
 return _mm256_permutexvar_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0), blended);

hadd32_avx2 can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32 uops to implement your loop can run in the shadow of that easily.

(https://agner.org/optimize/, and see https://stackoverflow.com/tags/x86/info)

edited Mar 9 at 11:11

answered Mar 8 at 7:52

Peter Cordes

132k18201338

add a comment |

And ideally you don't need to += into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.

// SSSE3
__m128i hadd_widen8_to_16(__m128i a) 
 // uint8_t, int8_t (doesn't matter when multiplier is +1)
 return _mm_maddubs_epi16(a, _mm_set_epi8(1));


// SSE2
__m128i hadd_widen16_to_32(__m128i a) 
 // int16_t, int16_t
 return _mm_madd_epi16(a, _mm_set_epi16(1));

These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.

// UNTESTED

//Only any good with AVX, otherwise the extra movdqa instructions kill this
//Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
__m128i hadd32_emulated(__m128i a, __m128i b) 
 __m128i a_shift = _mm_srli_epi64(a, 32);
 __m128i b_shift = _mm_srli_epi64(b, 32);
 a = _mm_add_epi32(a, a_shift);
 b = _mm_add_epi32(b, b_shift);
 __m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
 return _mm_castps_si128(combined);

For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd result. So emulating hadd with shifts might be a bigger win.

// 3x shuffle 1x add uops
__m256i hadd32_avx2(__m256i a, __m256i b) 
 __m256i hadd = _mm256_hadd_epi32(a, b); // 2x in-lane hadd
 return _mm256_permutex_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );


// UNTESTED
// 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
__m256i hadd32_emulated_avx2(__m256i a, __m256i b)

 __m256i a_shift = _mm256_srli_epi64(a, 32); // useful result in the low half of each qword
 __m256i b_shift = _mm256_slli_epi64(b, 32); // ... high half of each qword
 a = _mm256_add_epi32(a, a_shift);
 b = _mm256_add_epi32(b, b_shift);
 __m256i blended = _mm256_blend_epi32(a,b, 0b10101010); // alternating low/high results
 return _mm256_permutexvar_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0), blended);

hadd32_avx2 can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32 uops to implement your loop can run in the shadow of that easily.

(https://agner.org/optimize/, and see https://stackoverflow.com/tags/x86/info)

edited Mar 9 at 11:11

answered Mar 8 at 7:52

Peter Cordes

132k18201338

And ideally you don't need to += into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.

// SSSE3
__m128i hadd_widen8_to_16(__m128i a) 
 // uint8_t, int8_t (doesn't matter when multiplier is +1)
 return _mm_maddubs_epi16(a, _mm_set_epi8(1));


// SSE2
__m128i hadd_widen16_to_32(__m128i a) 
 // int16_t, int16_t
 return _mm_madd_epi16(a, _mm_set_epi16(1));

These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.

// UNTESTED

//Only any good with AVX, otherwise the extra movdqa instructions kill this
//Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
__m128i hadd32_emulated(__m128i a, __m128i b) 
 __m128i a_shift = _mm_srli_epi64(a, 32);
 __m128i b_shift = _mm_srli_epi64(b, 32);
 a = _mm_add_epi32(a, a_shift);
 b = _mm_add_epi32(b, b_shift);
 __m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
 return _mm_castps_si128(combined);

For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd result. So emulating hadd with shifts might be a bigger win.

// 3x shuffle 1x add uops
__m256i hadd32_avx2(__m256i a, __m256i b) 
 __m256i hadd = _mm256_hadd_epi32(a, b); // 2x in-lane hadd
 return _mm256_permutex_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );


// UNTESTED
// 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
__m256i hadd32_emulated_avx2(__m256i a, __m256i b)

 __m256i a_shift = _mm256_srli_epi64(a, 32); // useful result in the low half of each qword
 __m256i b_shift = _mm256_slli_epi64(b, 32); // ... high half of each qword
 a = _mm256_add_epi32(a, a_shift);
 b = _mm256_add_epi32(b, b_shift);
 __m256i blended = _mm256_blend_epi32(a,b, 0b10101010); // alternating low/high results
 return _mm256_permutexvar_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0), blended);

hadd32_avx2 can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32 uops to implement your loop can run in the shadow of that easily.

(https://agner.org/optimize/, and see https://stackoverflow.com/tags/x86/info)

edited Mar 9 at 11:11

answered Mar 8 at 7:52

Peter Cordes

132k18201338

edited Mar 9 at 11:11

answered Mar 8 at 7:52

Peter Cordes

132k18201338

answered Mar 8 at 7:52

Peter Cordes

132k18201338

answered Mar 8 at 7:52

Peter Cordes

132k18201338

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ggtcf

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Can't initialize raids on a new ASUS Prime B360M-A motherboard2019 Community Moderator ElectionSimilar to RAID config yet more like mirroring solution?Can't get motherboard serial numberWhy does the BIOS entry point start with a WBINVD instruction?UEFI performance Asus Maximus V Extreme

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Can't initialize raids on a new ASUS Prime B360M-A motherboard2019 Community Moderator ElectionSimilar to RAID config yet more like mirroring solution?Can't get motherboard serial numberWhy does the BIOS entry point start with a WBINVD instruction?UEFI performance Asus Maximus V Extreme

1 Answer
1

1 Answer
1

1 Answer
1