How to know if a machine in a Spark cluster 'participate's a jobApache Spark: The number of cores vs. the number of executorsSpark on yarn concept understandingDeploying Spark and Hadoop on different cluster/machinesYARN REST API - Spark job submissionHow to run spark-shell with YARN in client mode?Hadoop Capacity Scheduler and SparkHow to submit a spark job on a remote master node in yarn client mode?Standalone Spark cluster on Mesos accessing HDFS data in a different Hadoop clusterSpark failed to submit jobs to remote yarn cluster with java.lang.NumberFormatExceptionSpark job reading from S3 on Spark cluster gives IllegalAccessError: tried to access method MutableCounterLong

Forgetting the musical notes while performing in concert

What mechanic is there to disable a threat instead of killing it?

Am I breaking OOP practice with this architecture?

Assassin's bullet with mercury

Size of subfigure fitting its content (tikzpicture)

Why doesn't using multiple commands with a || or && conditional work?

How dangerous is XSS?

Why is this clock signal connected to a capacitor to gnd?

How much of data wrangling is a data scientist's job?

Why would the Red Woman birth a shadow if she worshipped the Lord of the Light?

How seriously should I take size and weight limits of hand luggage?

In 'Revenger,' what does 'cove' come from?

All in one piece, we mend holes in your socks

How do I deal with an unproductive colleague in a small company?

What are some good books on Machine Learning and AI like Krugman, Wells and Graddy's "Essentials of Economics"

How did the Super Star Destroyer Executor get destroyed exactly?

Arrow those variables!

Avoiding the "not like other girls" trope?

How to Recreate this in LaTeX? (Unsure What the Notation is Called)

Short story with a alien planet, government officials must wear exploding medallions

Why are the 737's rear doors unusable in a water landing?

Why is consensus so controversial in Britain?

Do scales need to be in alphabetical order?

How to prevent "they're falling in love" trope

How to know if a machine in a Spark cluster 'participate's a job

Apache Spark: The number of cores vs. the number of executorsSpark on yarn concept understandingDeploying Spark and Hadoop on different cluster/machinesYARN REST API - Spark job submissionHow to run spark-shell with YARN in client mode?Hadoop Capacity Scheduler and SparkHow to submit a spark job on a remote master node in yarn client mode?Standalone Spark cluster on Mesos accessing HDFS data in a different Hadoop clusterSpark failed to submit jobs to remote yarn cluster with java.lang.NumberFormatExceptionSpark job reading from S3 on Spark cluster gives IllegalAccessError: tried to access method MutableCounterLong

I wanted to know when it is safe to remove a node from a machine from a cluster.

My assumption is that it could be safe to remove a machine if the machine does not have any containers, and it does not store any useful data.

By the APIs at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html, we can do

 GET http://<rm http address:port>/ws/v1/cluster/nodes

to get the information of each node like

<node>
 <rack>/default-rack</rack>
 <state>RUNNING</state>
 <id>host1.domain.com:54158</id>
 <nodeHostName>host1.domain.com</nodeHostName>
 <nodeHTTPAddress>host1.domain.com:8042</nodeHTTPAddress>
 <lastHealthUpdate>1476995346399</lastHealthUpdate>
 <version>3.0.0-SNAPSHOT</version>
 <healthReport></healthReport>
 <numContainers>0</numContainers>
 <usedMemoryMB>0</usedMemoryMB>
 <availMemoryMB>8192</availMemoryMB>
 <usedVirtualCores>0</usedVirtualCores>
 <availableVirtualCores>8</availableVirtualCores>
 <resourceUtilization>
 <nodePhysicalMemoryMB>1027</nodePhysicalMemoryMB>
 <nodeVirtualMemoryMB>1027</nodeVirtualMemoryMB>
 <nodeCPUUsage>0.006664445623755455</nodeCPUUsage>
 <aggregatedContainersPhysicalMemoryMB>0</aggregatedContainersPhysicalMemoryMB>
 <aggregatedContainersVirtualMemoryMB>0</aggregatedContainersVirtualMemoryMB>
 <containersCPUUsage>0.0</containersCPUUsage>
 </resourceUtilization>
 </node>

If numContainers is 0, I assume it does not run containers. However can it still store any data on disk that other downstream tasks can read?

I did not get if Spark lets us know this. I assume if a machine still stores some data useful for the running job, the machine may maintain a heart beat with Spark Driver or some central controller? Can we check this by scanning tcp or udp connections?

Is there any other way to check if a machine in a Spark cluster participates a job?

asked Mar 8 at 22:40

Joe C

1,23911428

add a comment |

I wanted to know when it is safe to remove a node from a machine from a cluster.

My assumption is that it could be safe to remove a machine if the machine does not have any containers, and it does not store any useful data.

By the APIs at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html, we can do

 GET http://<rm http address:port>/ws/v1/cluster/nodes

to get the information of each node like

<node>
 <rack>/default-rack</rack>
 <state>RUNNING</state>
 <id>host1.domain.com:54158</id>
 <nodeHostName>host1.domain.com</nodeHostName>
 <nodeHTTPAddress>host1.domain.com:8042</nodeHTTPAddress>
 <lastHealthUpdate>1476995346399</lastHealthUpdate>
 <version>3.0.0-SNAPSHOT</version>
 <healthReport></healthReport>
 <numContainers>0</numContainers>
 <usedMemoryMB>0</usedMemoryMB>
 <availMemoryMB>8192</availMemoryMB>
 <usedVirtualCores>0</usedVirtualCores>
 <availableVirtualCores>8</availableVirtualCores>
 <resourceUtilization>
 <nodePhysicalMemoryMB>1027</nodePhysicalMemoryMB>
 <nodeVirtualMemoryMB>1027</nodeVirtualMemoryMB>
 <nodeCPUUsage>0.006664445623755455</nodeCPUUsage>
 <aggregatedContainersPhysicalMemoryMB>0</aggregatedContainersPhysicalMemoryMB>
 <aggregatedContainersVirtualMemoryMB>0</aggregatedContainersVirtualMemoryMB>
 <containersCPUUsage>0.0</containersCPUUsage>
 </resourceUtilization>
 </node>

If numContainers is 0, I assume it does not run containers. However can it still store any data on disk that other downstream tasks can read?

Is there any other way to check if a machine in a Spark cluster participates a job?

asked Mar 8 at 22:40

Joe C

1,23911428

add a comment |

I wanted to know when it is safe to remove a node from a machine from a cluster.

My assumption is that it could be safe to remove a machine if the machine does not have any containers, and it does not store any useful data.

By the APIs at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html, we can do

 GET http://<rm http address:port>/ws/v1/cluster/nodes

to get the information of each node like

<node>
 <rack>/default-rack</rack>
 <state>RUNNING</state>
 <id>host1.domain.com:54158</id>
 <nodeHostName>host1.domain.com</nodeHostName>
 <nodeHTTPAddress>host1.domain.com:8042</nodeHTTPAddress>
 <lastHealthUpdate>1476995346399</lastHealthUpdate>
 <version>3.0.0-SNAPSHOT</version>
 <healthReport></healthReport>
 <numContainers>0</numContainers>
 <usedMemoryMB>0</usedMemoryMB>
 <availMemoryMB>8192</availMemoryMB>
 <usedVirtualCores>0</usedVirtualCores>
 <availableVirtualCores>8</availableVirtualCores>
 <resourceUtilization>
 <nodePhysicalMemoryMB>1027</nodePhysicalMemoryMB>
 <nodeVirtualMemoryMB>1027</nodeVirtualMemoryMB>
 <nodeCPUUsage>0.006664445623755455</nodeCPUUsage>
 <aggregatedContainersPhysicalMemoryMB>0</aggregatedContainersPhysicalMemoryMB>
 <aggregatedContainersVirtualMemoryMB>0</aggregatedContainersVirtualMemoryMB>
 <containersCPUUsage>0.0</containersCPUUsage>
 </resourceUtilization>
 </node>

If numContainers is 0, I assume it does not run containers. However can it still store any data on disk that other downstream tasks can read?

Is there any other way to check if a machine in a Spark cluster participates a job?

asked Mar 8 at 22:40

Joe C

1,23911428

I wanted to know when it is safe to remove a node from a machine from a cluster.

My assumption is that it could be safe to remove a machine if the machine does not have any containers, and it does not store any useful data.

By the APIs at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html, we can do

 GET http://<rm http address:port>/ws/v1/cluster/nodes

to get the information of each node like

<node>
 <rack>/default-rack</rack>
 <state>RUNNING</state>
 <id>host1.domain.com:54158</id>
 <nodeHostName>host1.domain.com</nodeHostName>
 <nodeHTTPAddress>host1.domain.com:8042</nodeHTTPAddress>
 <lastHealthUpdate>1476995346399</lastHealthUpdate>
 <version>3.0.0-SNAPSHOT</version>
 <healthReport></healthReport>
 <numContainers>0</numContainers>
 <usedMemoryMB>0</usedMemoryMB>
 <availMemoryMB>8192</availMemoryMB>
 <usedVirtualCores>0</usedVirtualCores>
 <availableVirtualCores>8</availableVirtualCores>
 <resourceUtilization>
 <nodePhysicalMemoryMB>1027</nodePhysicalMemoryMB>
 <nodeVirtualMemoryMB>1027</nodeVirtualMemoryMB>
 <nodeCPUUsage>0.006664445623755455</nodeCPUUsage>
 <aggregatedContainersPhysicalMemoryMB>0</aggregatedContainersPhysicalMemoryMB>
 <aggregatedContainersVirtualMemoryMB>0</aggregatedContainersVirtualMemoryMB>
 <containersCPUUsage>0.0</containersCPUUsage>
 </resourceUtilization>
 </node>

If numContainers is 0, I assume it does not run containers. However can it still store any data on disk that other downstream tasks can read?

Is there any other way to check if a machine in a Spark cluster participates a job?

apache-spark hadoop autoscaling

asked Mar 8 at 22:40

Joe C

1,23911428

asked Mar 8 at 22:40

Joe C

1,23911428

asked Mar 8 at 22:40

Joe C

1,23911428

asked Mar 8 at 22:40

Joe C

1,23911428

asked Mar 8 at 22:40

Joe C

1,23911428

add a comment |

1 Answer
1

active

oldest

votes

I am not sure whether you just want to know if a node is running any task (is that's what you mean by 'participate') or you want to know if it is safe to remove a node from the Spark cluster

I will try to explain the latter point.

Spark has the ability to recover from the failure, which also applies to any node being removed from the cluster.
The node removed can be an executor or an application master.

If an application master is removed, the entire job fails. But is you are using yarn as a resource manager, the job is retried and yarn gives a new application master. The number if retries is configured in :

yarn.resourcemanager.am.max-attempts

By default, this value is 2

If a node on which a task is running is removed, the resource manager (which is handled by yarn) will stop getting heartbeats from that node. Application master will know it is supposed to reschedule the failed job as it will no longer receive progress status from the previous node. It will then request resource manager for resources and then reschedule the job.

As far as data on these nodes is concerned, you need to understand how the tasks and their output are handled. Every node has its own local storage to store the output of the tasks running on them. After the tasks are run successfully, the OutputCommitter will move the output from local storage to the shared storage (HDFS) of the job from where the data is picked for the next step of the job.
When a task fails (may be because the node that runs this job failed or was removed), the task is rerun on another available node.

In fact, the application master will also rerun the successfully run tasks on this node as their output stored on the node's local storage will not longer be available.

answered Mar 10 at 16:44

Tej

665

My goal is to remove an idle node w/o introducing more retries since retry makes a run slow. It sounds like after OutputCommitter, the downstream nodes do not need to read data from the node any more, so we can remove node. Do we know if a node could run OutputCommitter? During a node runs OutputCommitter, does the node show "<numContainers>0</numContainers>"? I was wondering when we know the node can be removed.

– Joe C
Mar 10 at 18:09

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55071993%2fhow-to-know-if-a-machine-in-a-spark-cluster-participates-a-job%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I am not sure whether you just want to know if a node is running any task (is that's what you mean by 'participate') or you want to know if it is safe to remove a node from the Spark cluster

I will try to explain the latter point.

Spark has the ability to recover from the failure, which also applies to any node being removed from the cluster.
The node removed can be an executor or an application master.

If an application master is removed, the entire job fails. But is you are using yarn as a resource manager, the job is retried and yarn gives a new application master. The number if retries is configured in :

yarn.resourcemanager.am.max-attempts

By default, this value is 2

If a node on which a task is running is removed, the resource manager (which is handled by yarn) will stop getting heartbeats from that node. Application master will know it is supposed to reschedule the failed job as it will no longer receive progress status from the previous node. It will then request resource manager for resources and then reschedule the job.

In fact, the application master will also rerun the successfully run tasks on this node as their output stored on the node's local storage will not longer be available.

answered Mar 10 at 16:44

Tej

665

My goal is to remove an idle node w/o introducing more retries since retry makes a run slow. It sounds like after OutputCommitter, the downstream nodes do not need to read data from the node any more, so we can remove node. Do we know if a node could run OutputCommitter? During a node runs OutputCommitter, does the node show "<numContainers>0</numContainers>"? I was wondering when we know the node can be removed.

– Joe C
Mar 10 at 18:09

add a comment |

I am not sure whether you just want to know if a node is running any task (is that's what you mean by 'participate') or you want to know if it is safe to remove a node from the Spark cluster

I will try to explain the latter point.

Spark has the ability to recover from the failure, which also applies to any node being removed from the cluster.
The node removed can be an executor or an application master.

If an application master is removed, the entire job fails. But is you are using yarn as a resource manager, the job is retried and yarn gives a new application master. The number if retries is configured in :

yarn.resourcemanager.am.max-attempts

By default, this value is 2

If a node on which a task is running is removed, the resource manager (which is handled by yarn) will stop getting heartbeats from that node. Application master will know it is supposed to reschedule the failed job as it will no longer receive progress status from the previous node. It will then request resource manager for resources and then reschedule the job.

In fact, the application master will also rerun the successfully run tasks on this node as their output stored on the node's local storage will not longer be available.

answered Mar 10 at 16:44

Tej

665

My goal is to remove an idle node w/o introducing more retries since retry makes a run slow. It sounds like after OutputCommitter, the downstream nodes do not need to read data from the node any more, so we can remove node. Do we know if a node could run OutputCommitter? During a node runs OutputCommitter, does the node show "<numContainers>0</numContainers>"? I was wondering when we know the node can be removed.

– Joe C
Mar 10 at 18:09

add a comment |

I am not sure whether you just want to know if a node is running any task (is that's what you mean by 'participate') or you want to know if it is safe to remove a node from the Spark cluster

I will try to explain the latter point.

Spark has the ability to recover from the failure, which also applies to any node being removed from the cluster.
The node removed can be an executor or an application master.

If an application master is removed, the entire job fails. But is you are using yarn as a resource manager, the job is retried and yarn gives a new application master. The number if retries is configured in :

yarn.resourcemanager.am.max-attempts

By default, this value is 2

If a node on which a task is running is removed, the resource manager (which is handled by yarn) will stop getting heartbeats from that node. Application master will know it is supposed to reschedule the failed job as it will no longer receive progress status from the previous node. It will then request resource manager for resources and then reschedule the job.

In fact, the application master will also rerun the successfully run tasks on this node as their output stored on the node's local storage will not longer be available.

answered Mar 10 at 16:44

Tej

665

I am not sure whether you just want to know if a node is running any task (is that's what you mean by 'participate') or you want to know if it is safe to remove a node from the Spark cluster

I will try to explain the latter point.

Spark has the ability to recover from the failure, which also applies to any node being removed from the cluster.
The node removed can be an executor or an application master.

If an application master is removed, the entire job fails. But is you are using yarn as a resource manager, the job is retried and yarn gives a new application master. The number if retries is configured in :

yarn.resourcemanager.am.max-attempts

By default, this value is 2

If a node on which a task is running is removed, the resource manager (which is handled by yarn) will stop getting heartbeats from that node. Application master will know it is supposed to reschedule the failed job as it will no longer receive progress status from the previous node. It will then request resource manager for resources and then reschedule the job.

In fact, the application master will also rerun the successfully run tasks on this node as their output stored on the node's local storage will not longer be available.

answered Mar 10 at 16:44

Tej

665

answered Mar 10 at 16:44

Tej

665

answered Mar 10 at 16:44

Tej

665

answered Mar 10 at 16:44

Tej

665

My goal is to remove an idle node w/o introducing more retries since retry makes a run slow. It sounds like after OutputCommitter, the downstream nodes do not need to read data from the node any more, so we can remove node. Do we know if a node could run OutputCommitter? During a node runs OutputCommitter, does the node show "<numContainers>0</numContainers>"? I was wondering when we know the node can be removed.

– Joe C
Mar 10 at 18:09

add a comment |

My goal is to remove an idle node w/o introducing more retries since retry makes a run slow. It sounds like after OutputCommitter, the downstream nodes do not need to read data from the node any more, so we can remove node. Do we know if a node could run OutputCommitter? During a node runs OutputCommitter, does the node show "<numContainers>0</numContainers>"? I was wondering when we know the node can be removed.

– Joe C
Mar 10 at 18:09

My goal is to remove an idle node w/o introducing more retries since retry makes a run slow. It sounds like after OutputCommitter, the downstream nodes do not need to read data from the node any more, so we can remove node. Do we know if a node could run OutputCommitter? During a node runs OutputCommitter, does the node show "<numContainers>0</numContainers>"? I was wondering when we know the node can be removed.

– Joe C
Mar 10 at 18:09

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ggtcf

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Can't initialize raids on a new ASUS Prime B360M-A motherboard2019 Community Moderator ElectionSimilar to RAID config yet more like mirroring solution?Can't get motherboard serial numberWhy does the BIOS entry point start with a WBINVD instruction?UEFI performance Asus Maximus V Extreme

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Can't initialize raids on a new ASUS Prime B360M-A motherboard2019 Community Moderator ElectionSimilar to RAID config yet more like mirroring solution?Can't get motherboard serial numberWhy does the BIOS entry point start with a WBINVD instruction?UEFI performance Asus Maximus V Extreme

1 Answer
1

1 Answer
1

1 Answer
1