SIMD: Accumulate Adjacent Pairsc++ SSE SIMD frameworkHeader files for x86 SIMD intrinsicsSIMD-able code?SIMD math libraries for SSE and AVXSIMD and difference between packed and scalar double precisionExtract set bytes position from SIMD vectorAVX2 SIMD addition not workingC++ SIMD: accumulate POPCOUNTs of uint64_t elements in an arraySIMD Intrinsics and Persistent Variables/StateCharacter to bits with SIMD (and substrings)
Remove Expired Scratch Orgs From VSCode
Can a significant change in incentives void an employment contract?
Could the E-bike drivetrain wear down till needing replacement after 400 km?
Is it improper etiquette to ask your opponent what his/her rating is before the game?
Divine apple island
Can the Supreme Court overturn an impeachment?
Should I stop contributing to retirement accounts?
Did US corporations pay demonstrators in the German demonstrations against article 13?
Varistor? Purpose and principle
Longest common substring in linear time
How do I implement a file system driver driver in Linux?
What major Native American tribes were around Santa Fe during the late 1850s?
Why did the HMS Bounty go back to a time when whales are already rare?
How to color a curve
What linear sensor for a keyboard?
Open a doc from terminal, but not by its name
Bob has never been a M before
How do I extrude a face to a single vertex
Flux received by a negative charge
How must one send away the mother bird?
On a tidally locked planet, would time be quantized?
Have I saved too much for retirement so far?
Find last 3 digits of this monster number
Reply 'no position' while the job posting is still there
SIMD: Accumulate Adjacent Pairs
c++ SSE SIMD frameworkHeader files for x86 SIMD intrinsicsSIMD-able code?SIMD math libraries for SSE and AVXSIMD and difference between packed and scalar double precisionExtract set bytes position from SIMD vectorAVX2 SIMD addition not workingC++ SIMD: accumulate POPCOUNTs of uint64_t elements in an arraySIMD Intrinsics and Persistent Variables/StateCharacter to bits with SIMD (and substrings)
I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me.
I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel.
template <typename T, typename U>
inline void accumulate_2x2_x_pass(
T* channel, U* accum,
const size_t sx, const size_t sy,
const size_t osx, const size_t osy,
const size_t yoff, const size_t oyoff
)
const bool odd_x = (sx & 0x01);
size_t i_idx, o_idx;
// Should be vectorizable somehow...
for (size_t x = 0, ox = 0; x < sx - (size_t)odd_x; x += 2, ox++)
i_idx = x + yoff;
o_idx = ox + oyoff;
accum[o_idx] += channel[i_idx];
accum[o_idx] += channel[i_idx + 1];
if (odd_x)
// << 1 bc we need to multiply by two on the edge
// to avoid darkening during render
accum[(osx - 1) + oyoff] += (U)(channel[(sx - 1) + yoff]) * 2;
However, godbolt shows that my loop is not autovectorizable. (https://godbolt.org/z/qZxvof) How would I construct SIMD intrinsics to solve this issue? I have control of the alignment for accum, but not channel.
(I know there's an average intrinsic, but it's not appropriate here because I need to generate multiple mip levels and that command would cause loss of precision for the next level.)
Thanks everyone. :)
c++ sse simd intrinsics avx
|
show 3 more comments
I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me.
I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel.
template <typename T, typename U>
inline void accumulate_2x2_x_pass(
T* channel, U* accum,
const size_t sx, const size_t sy,
const size_t osx, const size_t osy,
const size_t yoff, const size_t oyoff
)
const bool odd_x = (sx & 0x01);
size_t i_idx, o_idx;
// Should be vectorizable somehow...
for (size_t x = 0, ox = 0; x < sx - (size_t)odd_x; x += 2, ox++)
i_idx = x + yoff;
o_idx = ox + oyoff;
accum[o_idx] += channel[i_idx];
accum[o_idx] += channel[i_idx + 1];
if (odd_x)
// << 1 bc we need to multiply by two on the edge
// to avoid darkening during render
accum[(osx - 1) + oyoff] += (U)(channel[(sx - 1) + yoff]) * 2;
However, godbolt shows that my loop is not autovectorizable. (https://godbolt.org/z/qZxvof) How would I construct SIMD intrinsics to solve this issue? I have control of the alignment for accum, but not channel.
(I know there's an average intrinsic, but it's not appropriate here because I need to generate multiple mip levels and that command would cause loss of precision for the next level.)
Thanks everyone. :)
c++ sse simd intrinsics avx
1
Looks like a use-case for SSSE3_mm_hadd_epi32
or_mm_hadd_epi16
T isint16_t
instead ofint
. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1. If you want to work around a shuffle-port bottleneck on Intel CPUs, you could consider using qword shifts on the inputs and then shuffling together the result withshufps
.
– Peter Cordes
Mar 8 at 6:55
Wow, that's pretty cool! I had been under the impression that "horizontal" operations were not possible in SIMD. I'll give this a try tomorrow. For what it's worth, the dominant use case for this operation is uint8_t -> uint16_t
– SapphireSun
Mar 8 at 6:57
1
I didn't realize you were widening, that changes things entirely. (Also, you showed short -> int on Godbolt; what SSE/AVX version are you targeting? You used-march=native
on Godbolt, which is Skylake-AVX512 = AVX512BW). Anyway,_mm_hadd_*
isn't useful whenU
is not the same width asT
. You probably wantpmaddwd
orpmaddubsw
with a multiplier of 1 to add horizontal pairs into a wider result.
– Peter Cordes
Mar 8 at 7:19
1
You should always use-march=haswell
or similar if you know that's what you're targeting. That sets important tuning options as well as instruction sets. And don't use-march=corei7
, it's kind of meaningless / confusing because it's basically-march=nehalem
(the first generation of core i7).
– Peter Cordes
Mar 8 at 7:36
1
On Godbolt you can#include <stddef.h>
and usesize_t
like a normal person. And notice that gcc did auto-vectorize your code for uint8_t -> uint16_t. Not particularly well, but it did it.
– Peter Cordes
Mar 8 at 7:39
|
show 3 more comments
I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me.
I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel.
template <typename T, typename U>
inline void accumulate_2x2_x_pass(
T* channel, U* accum,
const size_t sx, const size_t sy,
const size_t osx, const size_t osy,
const size_t yoff, const size_t oyoff
)
const bool odd_x = (sx & 0x01);
size_t i_idx, o_idx;
// Should be vectorizable somehow...
for (size_t x = 0, ox = 0; x < sx - (size_t)odd_x; x += 2, ox++)
i_idx = x + yoff;
o_idx = ox + oyoff;
accum[o_idx] += channel[i_idx];
accum[o_idx] += channel[i_idx + 1];
if (odd_x)
// << 1 bc we need to multiply by two on the edge
// to avoid darkening during render
accum[(osx - 1) + oyoff] += (U)(channel[(sx - 1) + yoff]) * 2;
However, godbolt shows that my loop is not autovectorizable. (https://godbolt.org/z/qZxvof) How would I construct SIMD intrinsics to solve this issue? I have control of the alignment for accum, but not channel.
(I know there's an average intrinsic, but it's not appropriate here because I need to generate multiple mip levels and that command would cause loss of precision for the next level.)
Thanks everyone. :)
c++ sse simd intrinsics avx
I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me.
I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel.
template <typename T, typename U>
inline void accumulate_2x2_x_pass(
T* channel, U* accum,
const size_t sx, const size_t sy,
const size_t osx, const size_t osy,
const size_t yoff, const size_t oyoff
)
const bool odd_x = (sx & 0x01);
size_t i_idx, o_idx;
// Should be vectorizable somehow...
for (size_t x = 0, ox = 0; x < sx - (size_t)odd_x; x += 2, ox++)
i_idx = x + yoff;
o_idx = ox + oyoff;
accum[o_idx] += channel[i_idx];
accum[o_idx] += channel[i_idx + 1];
if (odd_x)
// << 1 bc we need to multiply by two on the edge
// to avoid darkening during render
accum[(osx - 1) + oyoff] += (U)(channel[(sx - 1) + yoff]) * 2;
However, godbolt shows that my loop is not autovectorizable. (https://godbolt.org/z/qZxvof) How would I construct SIMD intrinsics to solve this issue? I have control of the alignment for accum, but not channel.
(I know there's an average intrinsic, but it's not appropriate here because I need to generate multiple mip levels and that command would cause loss of precision for the next level.)
Thanks everyone. :)
c++ sse simd intrinsics avx
c++ sse simd intrinsics avx
edited Mar 8 at 7:56
Peter Cordes
132k18201338
132k18201338
asked Mar 8 at 6:36
SapphireSunSapphireSun
4,82494254
4,82494254
1
Looks like a use-case for SSSE3_mm_hadd_epi32
or_mm_hadd_epi16
T isint16_t
instead ofint
. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1. If you want to work around a shuffle-port bottleneck on Intel CPUs, you could consider using qword shifts on the inputs and then shuffling together the result withshufps
.
– Peter Cordes
Mar 8 at 6:55
Wow, that's pretty cool! I had been under the impression that "horizontal" operations were not possible in SIMD. I'll give this a try tomorrow. For what it's worth, the dominant use case for this operation is uint8_t -> uint16_t
– SapphireSun
Mar 8 at 6:57
1
I didn't realize you were widening, that changes things entirely. (Also, you showed short -> int on Godbolt; what SSE/AVX version are you targeting? You used-march=native
on Godbolt, which is Skylake-AVX512 = AVX512BW). Anyway,_mm_hadd_*
isn't useful whenU
is not the same width asT
. You probably wantpmaddwd
orpmaddubsw
with a multiplier of 1 to add horizontal pairs into a wider result.
– Peter Cordes
Mar 8 at 7:19
1
You should always use-march=haswell
or similar if you know that's what you're targeting. That sets important tuning options as well as instruction sets. And don't use-march=corei7
, it's kind of meaningless / confusing because it's basically-march=nehalem
(the first generation of core i7).
– Peter Cordes
Mar 8 at 7:36
1
On Godbolt you can#include <stddef.h>
and usesize_t
like a normal person. And notice that gcc did auto-vectorize your code for uint8_t -> uint16_t. Not particularly well, but it did it.
– Peter Cordes
Mar 8 at 7:39
|
show 3 more comments
1
Looks like a use-case for SSSE3_mm_hadd_epi32
or_mm_hadd_epi16
T isint16_t
instead ofint
. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1. If you want to work around a shuffle-port bottleneck on Intel CPUs, you could consider using qword shifts on the inputs and then shuffling together the result withshufps
.
– Peter Cordes
Mar 8 at 6:55
Wow, that's pretty cool! I had been under the impression that "horizontal" operations were not possible in SIMD. I'll give this a try tomorrow. For what it's worth, the dominant use case for this operation is uint8_t -> uint16_t
– SapphireSun
Mar 8 at 6:57
1
I didn't realize you were widening, that changes things entirely. (Also, you showed short -> int on Godbolt; what SSE/AVX version are you targeting? You used-march=native
on Godbolt, which is Skylake-AVX512 = AVX512BW). Anyway,_mm_hadd_*
isn't useful whenU
is not the same width asT
. You probably wantpmaddwd
orpmaddubsw
with a multiplier of 1 to add horizontal pairs into a wider result.
– Peter Cordes
Mar 8 at 7:19
1
You should always use-march=haswell
or similar if you know that's what you're targeting. That sets important tuning options as well as instruction sets. And don't use-march=corei7
, it's kind of meaningless / confusing because it's basically-march=nehalem
(the first generation of core i7).
– Peter Cordes
Mar 8 at 7:36
1
On Godbolt you can#include <stddef.h>
and usesize_t
like a normal person. And notice that gcc did auto-vectorize your code for uint8_t -> uint16_t. Not particularly well, but it did it.
– Peter Cordes
Mar 8 at 7:39
1
1
Looks like a use-case for SSSE3
_mm_hadd_epi32
or _mm_hadd_epi16
T is int16_t
instead of int
. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1. If you want to work around a shuffle-port bottleneck on Intel CPUs, you could consider using qword shifts on the inputs and then shuffling together the result with shufps
.– Peter Cordes
Mar 8 at 6:55
Looks like a use-case for SSSE3
_mm_hadd_epi32
or _mm_hadd_epi16
T is int16_t
instead of int
. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1. If you want to work around a shuffle-port bottleneck on Intel CPUs, you could consider using qword shifts on the inputs and then shuffling together the result with shufps
.– Peter Cordes
Mar 8 at 6:55
Wow, that's pretty cool! I had been under the impression that "horizontal" operations were not possible in SIMD. I'll give this a try tomorrow. For what it's worth, the dominant use case for this operation is uint8_t -> uint16_t
– SapphireSun
Mar 8 at 6:57
Wow, that's pretty cool! I had been under the impression that "horizontal" operations were not possible in SIMD. I'll give this a try tomorrow. For what it's worth, the dominant use case for this operation is uint8_t -> uint16_t
– SapphireSun
Mar 8 at 6:57
1
1
I didn't realize you were widening, that changes things entirely. (Also, you showed short -> int on Godbolt; what SSE/AVX version are you targeting? You used
-march=native
on Godbolt, which is Skylake-AVX512 = AVX512BW). Anyway, _mm_hadd_*
isn't useful when U
is not the same width as T
. You probably want pmaddwd
or pmaddubsw
with a multiplier of 1 to add horizontal pairs into a wider result.– Peter Cordes
Mar 8 at 7:19
I didn't realize you were widening, that changes things entirely. (Also, you showed short -> int on Godbolt; what SSE/AVX version are you targeting? You used
-march=native
on Godbolt, which is Skylake-AVX512 = AVX512BW). Anyway, _mm_hadd_*
isn't useful when U
is not the same width as T
. You probably want pmaddwd
or pmaddubsw
with a multiplier of 1 to add horizontal pairs into a wider result.– Peter Cordes
Mar 8 at 7:19
1
1
You should always use
-march=haswell
or similar if you know that's what you're targeting. That sets important tuning options as well as instruction sets. And don't use -march=corei7
, it's kind of meaningless / confusing because it's basically -march=nehalem
(the first generation of core i7).– Peter Cordes
Mar 8 at 7:36
You should always use
-march=haswell
or similar if you know that's what you're targeting. That sets important tuning options as well as instruction sets. And don't use -march=corei7
, it's kind of meaningless / confusing because it's basically -march=nehalem
(the first generation of core i7).– Peter Cordes
Mar 8 at 7:36
1
1
On Godbolt you can
#include <stddef.h>
and use size_t
like a normal person. And notice that gcc did auto-vectorize your code for uint8_t -> uint16_t. Not particularly well, but it did it.– Peter Cordes
Mar 8 at 7:39
On Godbolt you can
#include <stddef.h>
and use size_t
like a normal person. And notice that gcc did auto-vectorize your code for uint8_t -> uint16_t. Not particularly well, but it did it.– Peter Cordes
Mar 8 at 7:39
|
show 3 more comments
1 Answer
1
active
oldest
votes
The widening case with the narrow type T
= uint8_t
or uint16_t
is probably best implemented with SSSE3 pmaddubsw
or SSE2 pmaddwd
with a multiplier of 1
. (Intrinsics guide) Those instructions single-uop and do exactly the horizontal widening add you need more efficiently than shuffling.
If you can do so without losing precision, do the vertical add between rows first, before widening horizontal add. (e.g. 10, 12, or 14-bit pixel components in [u]int16_t
can't overflow). Load and vertical-add have (at least) 2 per clock throughput on most CPUs, vs. 1 per clock for pmadd*
only having 2-per-clock throughput on Skylake and later. And it means you only need 1x add + 1x pmadd vs. 2x pmadd + 1x add so it's a significant win even on Skylake. (For the 2nd way, both loads can fold into memory operands for pmadd, if you have AVX. For the add before pmadd way, you'll need a pure load first and then fold the 2nd load into add, so you might not save front-end uops, unless you use indexed addressing modes and they un-laminate.)
And ideally you don't need to +=
into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.
// SSSE3
__m128i hadd_widen8_to_16(__m128i a)
// uint8_t, int8_t (doesn't matter when multiplier is +1)
return _mm_maddubs_epi16(a, _mm_set_epi8(1));
// SSE2
__m128i hadd_widen16_to_32(__m128i a)
// int16_t, int16_t
return _mm_madd_epi16(a, _mm_set_epi16(1));
These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.
Yes really, they're both _epi16
. Intel can be wildly inconsistent with intrinsic names. asm mnemonics are more consistent and easier to remember what's what. (ubsw
= unsigned byte to signed word, except that one of the inputs is signed bytes. pmaddwd
is packed multiply add word to dword, same naming scheme as punpcklwd
etc.)
The T=U case with uint16_t
or uint32_t
is a a use-case for SSSE3 _mm_hadd_epi16
or _mm_hadd_epi32
. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1.
If you want to work around a shuffle-port bottleneck on Haswell and later, you could consider using qword shifts on the inputs and then shuffling together the result with shufps
(_mm_shuffle_ps
+ some casting). This could possibly be a win on Skylake (with 2 per clock shift throughput), even though it costs more 5 total uops instead of 3. It can run at best 5/3 cycles per vector of output instead of 2 cycles per vector if there's no front-end bottleneck
// UNTESTED
//Only any good with AVX, otherwise the extra movdqa instructions kill this
//Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
__m128i hadd32_emulated(__m128i a, __m128i b)
__m128i a_shift = _mm_srli_epi64(a, 32);
__m128i b_shift = _mm_srli_epi64(b, 32);
a = _mm_add_epi32(a, a_shift);
b = _mm_add_epi32(b, b_shift);
__m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
return _mm_castps_si128(combined);
For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd
result. So emulating hadd with shifts might be a bigger win.
// 3x shuffle 1x add uops
__m256i hadd32_avx2(__m256i a, __m256i b)
__m256i hadd = _mm256_hadd_epi32(a, b); // 2x in-lane hadd
return _mm256_permutex_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );
// UNTESTED
// 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
__m256i hadd32_emulated_avx2(__m256i a, __m256i b)
__m256i a_shift = _mm256_srli_epi64(a, 32); // useful result in the low half of each qword
__m256i b_shift = _mm256_slli_epi64(b, 32); // ... high half of each qword
a = _mm256_add_epi32(a, a_shift);
b = _mm256_add_epi32(b, b_shift);
__m256i blended = _mm256_blend_epi32(a,b, 0b10101010); // alternating low/high results
return _mm256_permutexvar_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0), blended);
On Haswell and Skylake, hadd32_emulated_avx2
can run at 1 per 2 clocks (saturating all vector ALU ports). The extra add_epi32
to sum into accum[]
will slow it down to at best 7/3 cycles per 256-bit vector of results, and you'll need to unroll (or use a compiler that unrolls) to not just bottleneck on the front-end.
hadd32_avx2
can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32
uops to implement your loop can run in the shadow of that easily.
(https://agner.org/optimize/, and see https://stackoverflow.com/tags/x86/info)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55057933%2fsimd-accumulate-adjacent-pairs%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The widening case with the narrow type T
= uint8_t
or uint16_t
is probably best implemented with SSSE3 pmaddubsw
or SSE2 pmaddwd
with a multiplier of 1
. (Intrinsics guide) Those instructions single-uop and do exactly the horizontal widening add you need more efficiently than shuffling.
If you can do so without losing precision, do the vertical add between rows first, before widening horizontal add. (e.g. 10, 12, or 14-bit pixel components in [u]int16_t
can't overflow). Load and vertical-add have (at least) 2 per clock throughput on most CPUs, vs. 1 per clock for pmadd*
only having 2-per-clock throughput on Skylake and later. And it means you only need 1x add + 1x pmadd vs. 2x pmadd + 1x add so it's a significant win even on Skylake. (For the 2nd way, both loads can fold into memory operands for pmadd, if you have AVX. For the add before pmadd way, you'll need a pure load first and then fold the 2nd load into add, so you might not save front-end uops, unless you use indexed addressing modes and they un-laminate.)
And ideally you don't need to +=
into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.
// SSSE3
__m128i hadd_widen8_to_16(__m128i a)
// uint8_t, int8_t (doesn't matter when multiplier is +1)
return _mm_maddubs_epi16(a, _mm_set_epi8(1));
// SSE2
__m128i hadd_widen16_to_32(__m128i a)
// int16_t, int16_t
return _mm_madd_epi16(a, _mm_set_epi16(1));
These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.
Yes really, they're both _epi16
. Intel can be wildly inconsistent with intrinsic names. asm mnemonics are more consistent and easier to remember what's what. (ubsw
= unsigned byte to signed word, except that one of the inputs is signed bytes. pmaddwd
is packed multiply add word to dword, same naming scheme as punpcklwd
etc.)
The T=U case with uint16_t
or uint32_t
is a a use-case for SSSE3 _mm_hadd_epi16
or _mm_hadd_epi32
. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1.
If you want to work around a shuffle-port bottleneck on Haswell and later, you could consider using qword shifts on the inputs and then shuffling together the result with shufps
(_mm_shuffle_ps
+ some casting). This could possibly be a win on Skylake (with 2 per clock shift throughput), even though it costs more 5 total uops instead of 3. It can run at best 5/3 cycles per vector of output instead of 2 cycles per vector if there's no front-end bottleneck
// UNTESTED
//Only any good with AVX, otherwise the extra movdqa instructions kill this
//Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
__m128i hadd32_emulated(__m128i a, __m128i b)
__m128i a_shift = _mm_srli_epi64(a, 32);
__m128i b_shift = _mm_srli_epi64(b, 32);
a = _mm_add_epi32(a, a_shift);
b = _mm_add_epi32(b, b_shift);
__m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
return _mm_castps_si128(combined);
For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd
result. So emulating hadd with shifts might be a bigger win.
// 3x shuffle 1x add uops
__m256i hadd32_avx2(__m256i a, __m256i b)
__m256i hadd = _mm256_hadd_epi32(a, b); // 2x in-lane hadd
return _mm256_permutex_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );
// UNTESTED
// 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
__m256i hadd32_emulated_avx2(__m256i a, __m256i b)
__m256i a_shift = _mm256_srli_epi64(a, 32); // useful result in the low half of each qword
__m256i b_shift = _mm256_slli_epi64(b, 32); // ... high half of each qword
a = _mm256_add_epi32(a, a_shift);
b = _mm256_add_epi32(b, b_shift);
__m256i blended = _mm256_blend_epi32(a,b, 0b10101010); // alternating low/high results
return _mm256_permutexvar_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0), blended);
On Haswell and Skylake, hadd32_emulated_avx2
can run at 1 per 2 clocks (saturating all vector ALU ports). The extra add_epi32
to sum into accum[]
will slow it down to at best 7/3 cycles per 256-bit vector of results, and you'll need to unroll (or use a compiler that unrolls) to not just bottleneck on the front-end.
hadd32_avx2
can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32
uops to implement your loop can run in the shadow of that easily.
(https://agner.org/optimize/, and see https://stackoverflow.com/tags/x86/info)
add a comment |
The widening case with the narrow type T
= uint8_t
or uint16_t
is probably best implemented with SSSE3 pmaddubsw
or SSE2 pmaddwd
with a multiplier of 1
. (Intrinsics guide) Those instructions single-uop and do exactly the horizontal widening add you need more efficiently than shuffling.
If you can do so without losing precision, do the vertical add between rows first, before widening horizontal add. (e.g. 10, 12, or 14-bit pixel components in [u]int16_t
can't overflow). Load and vertical-add have (at least) 2 per clock throughput on most CPUs, vs. 1 per clock for pmadd*
only having 2-per-clock throughput on Skylake and later. And it means you only need 1x add + 1x pmadd vs. 2x pmadd + 1x add so it's a significant win even on Skylake. (For the 2nd way, both loads can fold into memory operands for pmadd, if you have AVX. For the add before pmadd way, you'll need a pure load first and then fold the 2nd load into add, so you might not save front-end uops, unless you use indexed addressing modes and they un-laminate.)
And ideally you don't need to +=
into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.
// SSSE3
__m128i hadd_widen8_to_16(__m128i a)
// uint8_t, int8_t (doesn't matter when multiplier is +1)
return _mm_maddubs_epi16(a, _mm_set_epi8(1));
// SSE2
__m128i hadd_widen16_to_32(__m128i a)
// int16_t, int16_t
return _mm_madd_epi16(a, _mm_set_epi16(1));
These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.
Yes really, they're both _epi16
. Intel can be wildly inconsistent with intrinsic names. asm mnemonics are more consistent and easier to remember what's what. (ubsw
= unsigned byte to signed word, except that one of the inputs is signed bytes. pmaddwd
is packed multiply add word to dword, same naming scheme as punpcklwd
etc.)
The T=U case with uint16_t
or uint32_t
is a a use-case for SSSE3 _mm_hadd_epi16
or _mm_hadd_epi32
. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1.
If you want to work around a shuffle-port bottleneck on Haswell and later, you could consider using qword shifts on the inputs and then shuffling together the result with shufps
(_mm_shuffle_ps
+ some casting). This could possibly be a win on Skylake (with 2 per clock shift throughput), even though it costs more 5 total uops instead of 3. It can run at best 5/3 cycles per vector of output instead of 2 cycles per vector if there's no front-end bottleneck
// UNTESTED
//Only any good with AVX, otherwise the extra movdqa instructions kill this
//Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
__m128i hadd32_emulated(__m128i a, __m128i b)
__m128i a_shift = _mm_srli_epi64(a, 32);
__m128i b_shift = _mm_srli_epi64(b, 32);
a = _mm_add_epi32(a, a_shift);
b = _mm_add_epi32(b, b_shift);
__m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
return _mm_castps_si128(combined);
For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd
result. So emulating hadd with shifts might be a bigger win.
// 3x shuffle 1x add uops
__m256i hadd32_avx2(__m256i a, __m256i b)
__m256i hadd = _mm256_hadd_epi32(a, b); // 2x in-lane hadd
return _mm256_permutex_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );
// UNTESTED
// 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
__m256i hadd32_emulated_avx2(__m256i a, __m256i b)
__m256i a_shift = _mm256_srli_epi64(a, 32); // useful result in the low half of each qword
__m256i b_shift = _mm256_slli_epi64(b, 32); // ... high half of each qword
a = _mm256_add_epi32(a, a_shift);
b = _mm256_add_epi32(b, b_shift);
__m256i blended = _mm256_blend_epi32(a,b, 0b10101010); // alternating low/high results
return _mm256_permutexvar_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0), blended);
On Haswell and Skylake, hadd32_emulated_avx2
can run at 1 per 2 clocks (saturating all vector ALU ports). The extra add_epi32
to sum into accum[]
will slow it down to at best 7/3 cycles per 256-bit vector of results, and you'll need to unroll (or use a compiler that unrolls) to not just bottleneck on the front-end.
hadd32_avx2
can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32
uops to implement your loop can run in the shadow of that easily.
(https://agner.org/optimize/, and see https://stackoverflow.com/tags/x86/info)
add a comment |
The widening case with the narrow type T
= uint8_t
or uint16_t
is probably best implemented with SSSE3 pmaddubsw
or SSE2 pmaddwd
with a multiplier of 1
. (Intrinsics guide) Those instructions single-uop and do exactly the horizontal widening add you need more efficiently than shuffling.
If you can do so without losing precision, do the vertical add between rows first, before widening horizontal add. (e.g. 10, 12, or 14-bit pixel components in [u]int16_t
can't overflow). Load and vertical-add have (at least) 2 per clock throughput on most CPUs, vs. 1 per clock for pmadd*
only having 2-per-clock throughput on Skylake and later. And it means you only need 1x add + 1x pmadd vs. 2x pmadd + 1x add so it's a significant win even on Skylake. (For the 2nd way, both loads can fold into memory operands for pmadd, if you have AVX. For the add before pmadd way, you'll need a pure load first and then fold the 2nd load into add, so you might not save front-end uops, unless you use indexed addressing modes and they un-laminate.)
And ideally you don't need to +=
into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.
// SSSE3
__m128i hadd_widen8_to_16(__m128i a)
// uint8_t, int8_t (doesn't matter when multiplier is +1)
return _mm_maddubs_epi16(a, _mm_set_epi8(1));
// SSE2
__m128i hadd_widen16_to_32(__m128i a)
// int16_t, int16_t
return _mm_madd_epi16(a, _mm_set_epi16(1));
These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.
Yes really, they're both _epi16
. Intel can be wildly inconsistent with intrinsic names. asm mnemonics are more consistent and easier to remember what's what. (ubsw
= unsigned byte to signed word, except that one of the inputs is signed bytes. pmaddwd
is packed multiply add word to dword, same naming scheme as punpcklwd
etc.)
The T=U case with uint16_t
or uint32_t
is a a use-case for SSSE3 _mm_hadd_epi16
or _mm_hadd_epi32
. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1.
If you want to work around a shuffle-port bottleneck on Haswell and later, you could consider using qword shifts on the inputs and then shuffling together the result with shufps
(_mm_shuffle_ps
+ some casting). This could possibly be a win on Skylake (with 2 per clock shift throughput), even though it costs more 5 total uops instead of 3. It can run at best 5/3 cycles per vector of output instead of 2 cycles per vector if there's no front-end bottleneck
// UNTESTED
//Only any good with AVX, otherwise the extra movdqa instructions kill this
//Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
__m128i hadd32_emulated(__m128i a, __m128i b)
__m128i a_shift = _mm_srli_epi64(a, 32);
__m128i b_shift = _mm_srli_epi64(b, 32);
a = _mm_add_epi32(a, a_shift);
b = _mm_add_epi32(b, b_shift);
__m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
return _mm_castps_si128(combined);
For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd
result. So emulating hadd with shifts might be a bigger win.
// 3x shuffle 1x add uops
__m256i hadd32_avx2(__m256i a, __m256i b)
__m256i hadd = _mm256_hadd_epi32(a, b); // 2x in-lane hadd
return _mm256_permutex_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );
// UNTESTED
// 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
__m256i hadd32_emulated_avx2(__m256i a, __m256i b)
__m256i a_shift = _mm256_srli_epi64(a, 32); // useful result in the low half of each qword
__m256i b_shift = _mm256_slli_epi64(b, 32); // ... high half of each qword
a = _mm256_add_epi32(a, a_shift);
b = _mm256_add_epi32(b, b_shift);
__m256i blended = _mm256_blend_epi32(a,b, 0b10101010); // alternating low/high results
return _mm256_permutexvar_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0), blended);
On Haswell and Skylake, hadd32_emulated_avx2
can run at 1 per 2 clocks (saturating all vector ALU ports). The extra add_epi32
to sum into accum[]
will slow it down to at best 7/3 cycles per 256-bit vector of results, and you'll need to unroll (or use a compiler that unrolls) to not just bottleneck on the front-end.
hadd32_avx2
can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32
uops to implement your loop can run in the shadow of that easily.
(https://agner.org/optimize/, and see https://stackoverflow.com/tags/x86/info)
The widening case with the narrow type T
= uint8_t
or uint16_t
is probably best implemented with SSSE3 pmaddubsw
or SSE2 pmaddwd
with a multiplier of 1
. (Intrinsics guide) Those instructions single-uop and do exactly the horizontal widening add you need more efficiently than shuffling.
If you can do so without losing precision, do the vertical add between rows first, before widening horizontal add. (e.g. 10, 12, or 14-bit pixel components in [u]int16_t
can't overflow). Load and vertical-add have (at least) 2 per clock throughput on most CPUs, vs. 1 per clock for pmadd*
only having 2-per-clock throughput on Skylake and later. And it means you only need 1x add + 1x pmadd vs. 2x pmadd + 1x add so it's a significant win even on Skylake. (For the 2nd way, both loads can fold into memory operands for pmadd, if you have AVX. For the add before pmadd way, you'll need a pure load first and then fold the 2nd load into add, so you might not save front-end uops, unless you use indexed addressing modes and they un-laminate.)
And ideally you don't need to +=
into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.
// SSSE3
__m128i hadd_widen8_to_16(__m128i a)
// uint8_t, int8_t (doesn't matter when multiplier is +1)
return _mm_maddubs_epi16(a, _mm_set_epi8(1));
// SSE2
__m128i hadd_widen16_to_32(__m128i a)
// int16_t, int16_t
return _mm_madd_epi16(a, _mm_set_epi16(1));
These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.
Yes really, they're both _epi16
. Intel can be wildly inconsistent with intrinsic names. asm mnemonics are more consistent and easier to remember what's what. (ubsw
= unsigned byte to signed word, except that one of the inputs is signed bytes. pmaddwd
is packed multiply add word to dword, same naming scheme as punpcklwd
etc.)
The T=U case with uint16_t
or uint32_t
is a a use-case for SSSE3 _mm_hadd_epi16
or _mm_hadd_epi32
. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1.
If you want to work around a shuffle-port bottleneck on Haswell and later, you could consider using qword shifts on the inputs and then shuffling together the result with shufps
(_mm_shuffle_ps
+ some casting). This could possibly be a win on Skylake (with 2 per clock shift throughput), even though it costs more 5 total uops instead of 3. It can run at best 5/3 cycles per vector of output instead of 2 cycles per vector if there's no front-end bottleneck
// UNTESTED
//Only any good with AVX, otherwise the extra movdqa instructions kill this
//Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
__m128i hadd32_emulated(__m128i a, __m128i b)
__m128i a_shift = _mm_srli_epi64(a, 32);
__m128i b_shift = _mm_srli_epi64(b, 32);
a = _mm_add_epi32(a, a_shift);
b = _mm_add_epi32(b, b_shift);
__m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
return _mm_castps_si128(combined);
For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd
result. So emulating hadd with shifts might be a bigger win.
// 3x shuffle 1x add uops
__m256i hadd32_avx2(__m256i a, __m256i b)
__m256i hadd = _mm256_hadd_epi32(a, b); // 2x in-lane hadd
return _mm256_permutex_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );
// UNTESTED
// 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
__m256i hadd32_emulated_avx2(__m256i a, __m256i b)
__m256i a_shift = _mm256_srli_epi64(a, 32); // useful result in the low half of each qword
__m256i b_shift = _mm256_slli_epi64(b, 32); // ... high half of each qword
a = _mm256_add_epi32(a, a_shift);
b = _mm256_add_epi32(b, b_shift);
__m256i blended = _mm256_blend_epi32(a,b, 0b10101010); // alternating low/high results
return _mm256_permutexvar_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0), blended);
On Haswell and Skylake, hadd32_emulated_avx2
can run at 1 per 2 clocks (saturating all vector ALU ports). The extra add_epi32
to sum into accum[]
will slow it down to at best 7/3 cycles per 256-bit vector of results, and you'll need to unroll (or use a compiler that unrolls) to not just bottleneck on the front-end.
hadd32_avx2
can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32
uops to implement your loop can run in the shadow of that easily.
(https://agner.org/optimize/, and see https://stackoverflow.com/tags/x86/info)
edited Mar 9 at 11:11
answered Mar 8 at 7:52
Peter CordesPeter Cordes
132k18201338
132k18201338
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55057933%2fsimd-accumulate-adjacent-pairs%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Looks like a use-case for SSSE3
_mm_hadd_epi32
or_mm_hadd_epi16
T isint16_t
instead ofint
. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1. If you want to work around a shuffle-port bottleneck on Intel CPUs, you could consider using qword shifts on the inputs and then shuffling together the result withshufps
.– Peter Cordes
Mar 8 at 6:55
Wow, that's pretty cool! I had been under the impression that "horizontal" operations were not possible in SIMD. I'll give this a try tomorrow. For what it's worth, the dominant use case for this operation is uint8_t -> uint16_t
– SapphireSun
Mar 8 at 6:57
1
I didn't realize you were widening, that changes things entirely. (Also, you showed short -> int on Godbolt; what SSE/AVX version are you targeting? You used
-march=native
on Godbolt, which is Skylake-AVX512 = AVX512BW). Anyway,_mm_hadd_*
isn't useful whenU
is not the same width asT
. You probably wantpmaddwd
orpmaddubsw
with a multiplier of 1 to add horizontal pairs into a wider result.– Peter Cordes
Mar 8 at 7:19
1
You should always use
-march=haswell
or similar if you know that's what you're targeting. That sets important tuning options as well as instruction sets. And don't use-march=corei7
, it's kind of meaningless / confusing because it's basically-march=nehalem
(the first generation of core i7).– Peter Cordes
Mar 8 at 7:36
1
On Godbolt you can
#include <stddef.h>
and usesize_t
like a normal person. And notice that gcc did auto-vectorize your code for uint8_t -> uint16_t. Not particularly well, but it did it.– Peter Cordes
Mar 8 at 7:39