Low 'Average Physical Core Utilization' according to VTune when using OpenMP, not sure what the bigger picture isOpenMP False SharingOpenMP and cores/threadsPerformance loss from parallelizationVtune Amplifier XE for Multicores?fortran openmp nested loop communication overheadNUMA systems, virtual pages, and false sharingUnexpectedly good performance with openmp parallel for loopopenMP program creates multiple threads but executes on only one corewhy parallelized .at() opencv with openmp take more timeOpenMP: no speedup with Hyperthreading
What was the exact wording from Ivanhoe of this advice on how to free yourself from slavery?
Creepy dinosaur pc game identification
Which one is correct as adjective “protruding” or “protruded”?
Why is it that I can sometimes guess the next note?
is this legal and f i dont come up with extra money is the deal off
How can "mimic phobia" be cured or prevented?
Strong empirical falsification of quantum mechanics based on vacuum energy density
What should you do when eye contact makes your subordinate uncomfortable?
Why did the EU agree to delay the Brexit deadline?
Open a doc from terminal, but not by its name
The IT department bottlenecks progress. How should I handle this?
Is there any references on the tensor product of presentable (1-)categories?
Can I sign legal documents with a smiley face?
Is the U.S. Code copyrighted by the Government?
New brakes for 90s road bike
Removing files under particular conditions (number of files, file age)
Aragorn's "guise" in the Orthanc Stone
Lowest total scrabble score
If a character has darkvision, can they see through an area of nonmagical darkness filled with lightly obscuring gas?
Is it possible to have a strip of cold climate in the middle of a planet?
Should I outline or discovery write my stories?
How much character growth crosses the line into breaking the character
What does chmod -u do?
Did Swami Prabhupada reject Advaita?
Low 'Average Physical Core Utilization' according to VTune when using OpenMP, not sure what the bigger picture is
OpenMP False SharingOpenMP and cores/threadsPerformance loss from parallelizationVtune Amplifier XE for Multicores?fortran openmp nested loop communication overheadNUMA systems, virtual pages, and false sharingUnexpectedly good performance with openmp parallel for loopopenMP program creates multiple threads but executes on only one corewhy parallelized .at() opencv with openmp take more timeOpenMP: no speedup with Hyperthreading
I have been optimizing a ray tracer, and to get a nice speed up, I used OpenMP generally like follows (C++):
Accelerator accelerator; // Has the data to make tracing way faster
Rays rays; // Makes the rays so they're ready to go
#pragma omp parallel for
for (int y = 0; y < window->height; y++)
for (int x = 0; x < window->width; x++)
Ray& ray = rays.get(x, y);
accelerator.trace(ray);
I gained 4.85x performance on a 6 core/12 thread CPU. I thought I'd get more than that, maybe something like 6-8x... especially when this eats up >= 99% of the processing time of the application.
I want to find out where my performance bottleneck is, so I opened VTune and profiled. Note that I am new to profiling, so maybe this is normal but this is the graph I got:

In particular, this is the 2nd biggest time consumer:

where the 58% is the microarchitecture usage.
Trying to solve this on my own, I went looking for information on this, but the most I could find was on Intel's VTune wiki pages:
Average Physical Core Utilization
Metric Description
The metric shows average physical cores utilization by computations of the application. Spin and Overhead time are not counted. Ideal average CPU utilization is equal to the number of physical CPU cores.
I'm not sure what this is trying to tell me, which leads me to my question:
Is this normal for a result like this? Or is something going wrong somewhere? Is it okay to only see a 4.8x speedup (compared to a theoretical max of 12.0) for something that is embarrassingly parallel? While ray tracing itself can be unfriendly due to the rays bouncing everywhere, I have done what I can to compact the memory and be as cache friendly as possible, use libraries that utilize SIMD for calculations, done countless implementations from the literature to speed things up, and avoided branching as much as possible and do no recursion. I also parallelized the rays so that there's no false sharing AFAIK, since each row is done by one thread so there shouldn't be any cache line writing for any threads (especially since ray traversal is all const). Also the framebuffer is row major, so I was hoping false sharing wouldn't be an issue from that.
I do not know if a profiler will pick up the main loop that is threaded with OpenMP and this is an expected result, or if I have some kind of newbie mistake and I'm not getting the throughput that I want. I also checked that it spawns 12 threads, and OpenMP does.
I guess tl;dr, am I screwing up using OpenMP? From what I gathered, the average physical core utilization is supposed to be up near the average logical core utilization, but I almost certainly have no idea what I'm talking about.
multithreading performance optimization openmp vtune
add a comment |
I have been optimizing a ray tracer, and to get a nice speed up, I used OpenMP generally like follows (C++):
Accelerator accelerator; // Has the data to make tracing way faster
Rays rays; // Makes the rays so they're ready to go
#pragma omp parallel for
for (int y = 0; y < window->height; y++)
for (int x = 0; x < window->width; x++)
Ray& ray = rays.get(x, y);
accelerator.trace(ray);
I gained 4.85x performance on a 6 core/12 thread CPU. I thought I'd get more than that, maybe something like 6-8x... especially when this eats up >= 99% of the processing time of the application.
I want to find out where my performance bottleneck is, so I opened VTune and profiled. Note that I am new to profiling, so maybe this is normal but this is the graph I got:

In particular, this is the 2nd biggest time consumer:

where the 58% is the microarchitecture usage.
Trying to solve this on my own, I went looking for information on this, but the most I could find was on Intel's VTune wiki pages:
Average Physical Core Utilization
Metric Description
The metric shows average physical cores utilization by computations of the application. Spin and Overhead time are not counted. Ideal average CPU utilization is equal to the number of physical CPU cores.
I'm not sure what this is trying to tell me, which leads me to my question:
Is this normal for a result like this? Or is something going wrong somewhere? Is it okay to only see a 4.8x speedup (compared to a theoretical max of 12.0) for something that is embarrassingly parallel? While ray tracing itself can be unfriendly due to the rays bouncing everywhere, I have done what I can to compact the memory and be as cache friendly as possible, use libraries that utilize SIMD for calculations, done countless implementations from the literature to speed things up, and avoided branching as much as possible and do no recursion. I also parallelized the rays so that there's no false sharing AFAIK, since each row is done by one thread so there shouldn't be any cache line writing for any threads (especially since ray traversal is all const). Also the framebuffer is row major, so I was hoping false sharing wouldn't be an issue from that.
I do not know if a profiler will pick up the main loop that is threaded with OpenMP and this is an expected result, or if I have some kind of newbie mistake and I'm not getting the throughput that I want. I also checked that it spawns 12 threads, and OpenMP does.
I guess tl;dr, am I screwing up using OpenMP? From what I gathered, the average physical core utilization is supposed to be up near the average logical core utilization, but I almost certainly have no idea what I'm talking about.
multithreading performance optimization openmp vtune
Your idea of theoretical speedup when running more threads than cores seems mistaken. If you don't pin threads to individual cores your speedup will be limited by the threads not being spread evenly over cores and losing cached data prematurely. Gnu libgomp on Windows doesn't support pinning, e.g omp_places depending on your application linux may pin better than winsowd
– tim18
Mar 8 at 9:36
@tim18 I am using Ubuntu, gcc 7.3.0. I don't use windows at all (at least right now, this wouldn't even compile on windows currently). I've tried changing the values with respect to pinning threads to cores, playing with affinity through environmental variables, etc... it does nothing to improve performance. Is there something I'm missing?
– Water
Mar 8 at 15:31
add a comment |
I have been optimizing a ray tracer, and to get a nice speed up, I used OpenMP generally like follows (C++):
Accelerator accelerator; // Has the data to make tracing way faster
Rays rays; // Makes the rays so they're ready to go
#pragma omp parallel for
for (int y = 0; y < window->height; y++)
for (int x = 0; x < window->width; x++)
Ray& ray = rays.get(x, y);
accelerator.trace(ray);
I gained 4.85x performance on a 6 core/12 thread CPU. I thought I'd get more than that, maybe something like 6-8x... especially when this eats up >= 99% of the processing time of the application.
I want to find out where my performance bottleneck is, so I opened VTune and profiled. Note that I am new to profiling, so maybe this is normal but this is the graph I got:

In particular, this is the 2nd biggest time consumer:

where the 58% is the microarchitecture usage.
Trying to solve this on my own, I went looking for information on this, but the most I could find was on Intel's VTune wiki pages:
Average Physical Core Utilization
Metric Description
The metric shows average physical cores utilization by computations of the application. Spin and Overhead time are not counted. Ideal average CPU utilization is equal to the number of physical CPU cores.
I'm not sure what this is trying to tell me, which leads me to my question:
Is this normal for a result like this? Or is something going wrong somewhere? Is it okay to only see a 4.8x speedup (compared to a theoretical max of 12.0) for something that is embarrassingly parallel? While ray tracing itself can be unfriendly due to the rays bouncing everywhere, I have done what I can to compact the memory and be as cache friendly as possible, use libraries that utilize SIMD for calculations, done countless implementations from the literature to speed things up, and avoided branching as much as possible and do no recursion. I also parallelized the rays so that there's no false sharing AFAIK, since each row is done by one thread so there shouldn't be any cache line writing for any threads (especially since ray traversal is all const). Also the framebuffer is row major, so I was hoping false sharing wouldn't be an issue from that.
I do not know if a profiler will pick up the main loop that is threaded with OpenMP and this is an expected result, or if I have some kind of newbie mistake and I'm not getting the throughput that I want. I also checked that it spawns 12 threads, and OpenMP does.
I guess tl;dr, am I screwing up using OpenMP? From what I gathered, the average physical core utilization is supposed to be up near the average logical core utilization, but I almost certainly have no idea what I'm talking about.
multithreading performance optimization openmp vtune
I have been optimizing a ray tracer, and to get a nice speed up, I used OpenMP generally like follows (C++):
Accelerator accelerator; // Has the data to make tracing way faster
Rays rays; // Makes the rays so they're ready to go
#pragma omp parallel for
for (int y = 0; y < window->height; y++)
for (int x = 0; x < window->width; x++)
Ray& ray = rays.get(x, y);
accelerator.trace(ray);
I gained 4.85x performance on a 6 core/12 thread CPU. I thought I'd get more than that, maybe something like 6-8x... especially when this eats up >= 99% of the processing time of the application.
I want to find out where my performance bottleneck is, so I opened VTune and profiled. Note that I am new to profiling, so maybe this is normal but this is the graph I got:

In particular, this is the 2nd biggest time consumer:

where the 58% is the microarchitecture usage.
Trying to solve this on my own, I went looking for information on this, but the most I could find was on Intel's VTune wiki pages:
Average Physical Core Utilization
Metric Description
The metric shows average physical cores utilization by computations of the application. Spin and Overhead time are not counted. Ideal average CPU utilization is equal to the number of physical CPU cores.
I'm not sure what this is trying to tell me, which leads me to my question:
Is this normal for a result like this? Or is something going wrong somewhere? Is it okay to only see a 4.8x speedup (compared to a theoretical max of 12.0) for something that is embarrassingly parallel? While ray tracing itself can be unfriendly due to the rays bouncing everywhere, I have done what I can to compact the memory and be as cache friendly as possible, use libraries that utilize SIMD for calculations, done countless implementations from the literature to speed things up, and avoided branching as much as possible and do no recursion. I also parallelized the rays so that there's no false sharing AFAIK, since each row is done by one thread so there shouldn't be any cache line writing for any threads (especially since ray traversal is all const). Also the framebuffer is row major, so I was hoping false sharing wouldn't be an issue from that.
I do not know if a profiler will pick up the main loop that is threaded with OpenMP and this is an expected result, or if I have some kind of newbie mistake and I'm not getting the throughput that I want. I also checked that it spawns 12 threads, and OpenMP does.
I guess tl;dr, am I screwing up using OpenMP? From what I gathered, the average physical core utilization is supposed to be up near the average logical core utilization, but I almost certainly have no idea what I'm talking about.
multithreading performance optimization openmp vtune
multithreading performance optimization openmp vtune
asked Mar 8 at 4:18
WaterWater
96111334
96111334
Your idea of theoretical speedup when running more threads than cores seems mistaken. If you don't pin threads to individual cores your speedup will be limited by the threads not being spread evenly over cores and losing cached data prematurely. Gnu libgomp on Windows doesn't support pinning, e.g omp_places depending on your application linux may pin better than winsowd
– tim18
Mar 8 at 9:36
@tim18 I am using Ubuntu, gcc 7.3.0. I don't use windows at all (at least right now, this wouldn't even compile on windows currently). I've tried changing the values with respect to pinning threads to cores, playing with affinity through environmental variables, etc... it does nothing to improve performance. Is there something I'm missing?
– Water
Mar 8 at 15:31
add a comment |
Your idea of theoretical speedup when running more threads than cores seems mistaken. If you don't pin threads to individual cores your speedup will be limited by the threads not being spread evenly over cores and losing cached data prematurely. Gnu libgomp on Windows doesn't support pinning, e.g omp_places depending on your application linux may pin better than winsowd
– tim18
Mar 8 at 9:36
@tim18 I am using Ubuntu, gcc 7.3.0. I don't use windows at all (at least right now, this wouldn't even compile on windows currently). I've tried changing the values with respect to pinning threads to cores, playing with affinity through environmental variables, etc... it does nothing to improve performance. Is there something I'm missing?
– Water
Mar 8 at 15:31
Your idea of theoretical speedup when running more threads than cores seems mistaken. If you don't pin threads to individual cores your speedup will be limited by the threads not being spread evenly over cores and losing cached data prematurely. Gnu libgomp on Windows doesn't support pinning, e.g omp_places depending on your application linux may pin better than winsowd
– tim18
Mar 8 at 9:36
Your idea of theoretical speedup when running more threads than cores seems mistaken. If you don't pin threads to individual cores your speedup will be limited by the threads not being spread evenly over cores and losing cached data prematurely. Gnu libgomp on Windows doesn't support pinning, e.g omp_places depending on your application linux may pin better than winsowd
– tim18
Mar 8 at 9:36
@tim18 I am using Ubuntu, gcc 7.3.0. I don't use windows at all (at least right now, this wouldn't even compile on windows currently). I've tried changing the values with respect to pinning threads to cores, playing with affinity through environmental variables, etc... it does nothing to improve performance. Is there something I'm missing?
– Water
Mar 8 at 15:31
@tim18 I am using Ubuntu, gcc 7.3.0. I don't use windows at all (at least right now, this wouldn't even compile on windows currently). I've tried changing the values with respect to pinning threads to cores, playing with affinity through environmental variables, etc... it does nothing to improve performance. Is there something I'm missing?
– Water
Mar 8 at 15:31
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55056661%2flow-average-physical-core-utilization-according-to-vtune-when-using-openmp-no%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55056661%2flow-average-physical-core-utilization-according-to-vtune-when-using-openmp-no%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Your idea of theoretical speedup when running more threads than cores seems mistaken. If you don't pin threads to individual cores your speedup will be limited by the threads not being spread evenly over cores and losing cached data prematurely. Gnu libgomp on Windows doesn't support pinning, e.g omp_places depending on your application linux may pin better than winsowd
– tim18
Mar 8 at 9:36
@tim18 I am using Ubuntu, gcc 7.3.0. I don't use windows at all (at least right now, this wouldn't even compile on windows currently). I've tried changing the values with respect to pinning threads to cores, playing with affinity through environmental variables, etc... it does nothing to improve performance. Is there something I'm missing?
– Water
Mar 8 at 15:31