How to know if a machine in a Spark cluster 'participate's a jobApache Spark: The number of cores vs. the number of executorsSpark on yarn concept understandingDeploying Spark and Hadoop on different cluster/machinesYARN REST API - Spark job submissionHow to run spark-shell with YARN in client mode?Hadoop Capacity Scheduler and SparkHow to submit a spark job on a remote master node in yarn client mode?Standalone Spark cluster on Mesos accessing HDFS data in a different Hadoop clusterSpark failed to submit jobs to remote yarn cluster with java.lang.NumberFormatExceptionSpark job reading from S3 on Spark cluster gives IllegalAccessError: tried to access method MutableCounterLong

Forgetting the musical notes while performing in concert

What mechanic is there to disable a threat instead of killing it?

Am I breaking OOP practice with this architecture?

Assassin's bullet with mercury

Size of subfigure fitting its content (tikzpicture)

Why doesn't using multiple commands with a || or && conditional work?

How dangerous is XSS?

Why is this clock signal connected to a capacitor to gnd?

How much of data wrangling is a data scientist's job?

Why would the Red Woman birth a shadow if she worshipped the Lord of the Light?

How seriously should I take size and weight limits of hand luggage?

In 'Revenger,' what does 'cove' come from?

All in one piece, we mend holes in your socks

How do I deal with an unproductive colleague in a small company?

What are some good books on Machine Learning and AI like Krugman, Wells and Graddy's "Essentials of Economics"

How did the Super Star Destroyer Executor get destroyed exactly?

Arrow those variables!

Avoiding the "not like other girls" trope?

How to Recreate this in LaTeX? (Unsure What the Notation is Called)

Short story with a alien planet, government officials must wear exploding medallions

Why are the 737's rear doors unusable in a water landing?

Why is consensus so controversial in Britain?

Do scales need to be in alphabetical order?

How to prevent "they're falling in love" trope



How to know if a machine in a Spark cluster 'participate's a job


Apache Spark: The number of cores vs. the number of executorsSpark on yarn concept understandingDeploying Spark and Hadoop on different cluster/machinesYARN REST API - Spark job submissionHow to run spark-shell with YARN in client mode?Hadoop Capacity Scheduler and SparkHow to submit a spark job on a remote master node in yarn client mode?Standalone Spark cluster on Mesos accessing HDFS data in a different Hadoop clusterSpark failed to submit jobs to remote yarn cluster with java.lang.NumberFormatExceptionSpark job reading from S3 on Spark cluster gives IllegalAccessError: tried to access method MutableCounterLong













1















I wanted to know when it is safe to remove a node from a machine from a cluster.



My assumption is that it could be safe to remove a machine if the machine does not have any containers, and it does not store any useful data.



By the APIs at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html, we can do



 GET http://<rm http address:port>/ws/v1/cluster/nodes


to get the information of each node like



<node>
<rack>/default-rack</rack>
<state>RUNNING</state>
<id>host1.domain.com:54158</id>
<nodeHostName>host1.domain.com</nodeHostName>
<nodeHTTPAddress>host1.domain.com:8042</nodeHTTPAddress>
<lastHealthUpdate>1476995346399</lastHealthUpdate>
<version>3.0.0-SNAPSHOT</version>
<healthReport></healthReport>
<numContainers>0</numContainers>
<usedMemoryMB>0</usedMemoryMB>
<availMemoryMB>8192</availMemoryMB>
<usedVirtualCores>0</usedVirtualCores>
<availableVirtualCores>8</availableVirtualCores>
<resourceUtilization>
<nodePhysicalMemoryMB>1027</nodePhysicalMemoryMB>
<nodeVirtualMemoryMB>1027</nodeVirtualMemoryMB>
<nodeCPUUsage>0.006664445623755455</nodeCPUUsage>
<aggregatedContainersPhysicalMemoryMB>0</aggregatedContainersPhysicalMemoryMB>
<aggregatedContainersVirtualMemoryMB>0</aggregatedContainersVirtualMemoryMB>
<containersCPUUsage>0.0</containersCPUUsage>
</resourceUtilization>
</node>


If numContainers is 0, I assume it does not run containers. However can it still store any data on disk that other downstream tasks can read?



I did not get if Spark lets us know this. I assume if a machine still stores some data useful for the running job, the machine may maintain a heart beat with Spark Driver or some central controller? Can we check this by scanning tcp or udp connections?



Is there any other way to check if a machine in a Spark cluster participates a job?










share|improve this question


























    1















    I wanted to know when it is safe to remove a node from a machine from a cluster.



    My assumption is that it could be safe to remove a machine if the machine does not have any containers, and it does not store any useful data.



    By the APIs at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html, we can do



     GET http://<rm http address:port>/ws/v1/cluster/nodes


    to get the information of each node like



    <node>
    <rack>/default-rack</rack>
    <state>RUNNING</state>
    <id>host1.domain.com:54158</id>
    <nodeHostName>host1.domain.com</nodeHostName>
    <nodeHTTPAddress>host1.domain.com:8042</nodeHTTPAddress>
    <lastHealthUpdate>1476995346399</lastHealthUpdate>
    <version>3.0.0-SNAPSHOT</version>
    <healthReport></healthReport>
    <numContainers>0</numContainers>
    <usedMemoryMB>0</usedMemoryMB>
    <availMemoryMB>8192</availMemoryMB>
    <usedVirtualCores>0</usedVirtualCores>
    <availableVirtualCores>8</availableVirtualCores>
    <resourceUtilization>
    <nodePhysicalMemoryMB>1027</nodePhysicalMemoryMB>
    <nodeVirtualMemoryMB>1027</nodeVirtualMemoryMB>
    <nodeCPUUsage>0.006664445623755455</nodeCPUUsage>
    <aggregatedContainersPhysicalMemoryMB>0</aggregatedContainersPhysicalMemoryMB>
    <aggregatedContainersVirtualMemoryMB>0</aggregatedContainersVirtualMemoryMB>
    <containersCPUUsage>0.0</containersCPUUsage>
    </resourceUtilization>
    </node>


    If numContainers is 0, I assume it does not run containers. However can it still store any data on disk that other downstream tasks can read?



    I did not get if Spark lets us know this. I assume if a machine still stores some data useful for the running job, the machine may maintain a heart beat with Spark Driver or some central controller? Can we check this by scanning tcp or udp connections?



    Is there any other way to check if a machine in a Spark cluster participates a job?










    share|improve this question
























      1












      1








      1


      0






      I wanted to know when it is safe to remove a node from a machine from a cluster.



      My assumption is that it could be safe to remove a machine if the machine does not have any containers, and it does not store any useful data.



      By the APIs at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html, we can do



       GET http://<rm http address:port>/ws/v1/cluster/nodes


      to get the information of each node like



      <node>
      <rack>/default-rack</rack>
      <state>RUNNING</state>
      <id>host1.domain.com:54158</id>
      <nodeHostName>host1.domain.com</nodeHostName>
      <nodeHTTPAddress>host1.domain.com:8042</nodeHTTPAddress>
      <lastHealthUpdate>1476995346399</lastHealthUpdate>
      <version>3.0.0-SNAPSHOT</version>
      <healthReport></healthReport>
      <numContainers>0</numContainers>
      <usedMemoryMB>0</usedMemoryMB>
      <availMemoryMB>8192</availMemoryMB>
      <usedVirtualCores>0</usedVirtualCores>
      <availableVirtualCores>8</availableVirtualCores>
      <resourceUtilization>
      <nodePhysicalMemoryMB>1027</nodePhysicalMemoryMB>
      <nodeVirtualMemoryMB>1027</nodeVirtualMemoryMB>
      <nodeCPUUsage>0.006664445623755455</nodeCPUUsage>
      <aggregatedContainersPhysicalMemoryMB>0</aggregatedContainersPhysicalMemoryMB>
      <aggregatedContainersVirtualMemoryMB>0</aggregatedContainersVirtualMemoryMB>
      <containersCPUUsage>0.0</containersCPUUsage>
      </resourceUtilization>
      </node>


      If numContainers is 0, I assume it does not run containers. However can it still store any data on disk that other downstream tasks can read?



      I did not get if Spark lets us know this. I assume if a machine still stores some data useful for the running job, the machine may maintain a heart beat with Spark Driver or some central controller? Can we check this by scanning tcp or udp connections?



      Is there any other way to check if a machine in a Spark cluster participates a job?










      share|improve this question














      I wanted to know when it is safe to remove a node from a machine from a cluster.



      My assumption is that it could be safe to remove a machine if the machine does not have any containers, and it does not store any useful data.



      By the APIs at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html, we can do



       GET http://<rm http address:port>/ws/v1/cluster/nodes


      to get the information of each node like



      <node>
      <rack>/default-rack</rack>
      <state>RUNNING</state>
      <id>host1.domain.com:54158</id>
      <nodeHostName>host1.domain.com</nodeHostName>
      <nodeHTTPAddress>host1.domain.com:8042</nodeHTTPAddress>
      <lastHealthUpdate>1476995346399</lastHealthUpdate>
      <version>3.0.0-SNAPSHOT</version>
      <healthReport></healthReport>
      <numContainers>0</numContainers>
      <usedMemoryMB>0</usedMemoryMB>
      <availMemoryMB>8192</availMemoryMB>
      <usedVirtualCores>0</usedVirtualCores>
      <availableVirtualCores>8</availableVirtualCores>
      <resourceUtilization>
      <nodePhysicalMemoryMB>1027</nodePhysicalMemoryMB>
      <nodeVirtualMemoryMB>1027</nodeVirtualMemoryMB>
      <nodeCPUUsage>0.006664445623755455</nodeCPUUsage>
      <aggregatedContainersPhysicalMemoryMB>0</aggregatedContainersPhysicalMemoryMB>
      <aggregatedContainersVirtualMemoryMB>0</aggregatedContainersVirtualMemoryMB>
      <containersCPUUsage>0.0</containersCPUUsage>
      </resourceUtilization>
      </node>


      If numContainers is 0, I assume it does not run containers. However can it still store any data on disk that other downstream tasks can read?



      I did not get if Spark lets us know this. I assume if a machine still stores some data useful for the running job, the machine may maintain a heart beat with Spark Driver or some central controller? Can we check this by scanning tcp or udp connections?



      Is there any other way to check if a machine in a Spark cluster participates a job?







      apache-spark hadoop autoscaling






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 8 at 22:40









      Joe CJoe C

      1,23911428




      1,23911428






















          1 Answer
          1






          active

          oldest

          votes


















          1














          I am not sure whether you just want to know if a node is running any task (is that's what you mean by 'participate') or you want to know if it is safe to remove a node from the Spark cluster



          I will try to explain the latter point.



          Spark has the ability to recover from the failure, which also applies to any node being removed from the cluster.
          The node removed can be an executor or an application master.



          1. If an application master is removed, the entire job fails. But is you are using yarn as a resource manager, the job is retried and yarn gives a new application master. The number if retries is configured in :


          yarn.resourcemanager.am.max-attempts




          By default, this value is 2



          1. If a node on which a task is running is removed, the resource manager (which is handled by yarn) will stop getting heartbeats from that node. Application master will know it is supposed to reschedule the failed job as it will no longer receive progress status from the previous node. It will then request resource manager for resources and then reschedule the job.

          As far as data on these nodes is concerned, you need to understand how the tasks and their output are handled. Every node has its own local storage to store the output of the tasks running on them. After the tasks are run successfully, the OutputCommitter will move the output from local storage to the shared storage (HDFS) of the job from where the data is picked for the next step of the job.
          When a task fails (may be because the node that runs this job failed or was removed), the task is rerun on another available node.



          In fact, the application master will also rerun the successfully run tasks on this node as their output stored on the node's local storage will not longer be available.






          share|improve this answer























          • My goal is to remove an idle node w/o introducing more retries since retry makes a run slow. It sounds like after OutputCommitter, the downstream nodes do not need to read data from the node any more, so we can remove node. Do we know if a node could run OutputCommitter? During a node runs OutputCommitter, does the node show "<numContainers>0</numContainers>"? I was wondering when we know the node can be removed.

            – Joe C
            Mar 10 at 18:09












          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55071993%2fhow-to-know-if-a-machine-in-a-spark-cluster-participates-a-job%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          I am not sure whether you just want to know if a node is running any task (is that's what you mean by 'participate') or you want to know if it is safe to remove a node from the Spark cluster



          I will try to explain the latter point.



          Spark has the ability to recover from the failure, which also applies to any node being removed from the cluster.
          The node removed can be an executor or an application master.



          1. If an application master is removed, the entire job fails. But is you are using yarn as a resource manager, the job is retried and yarn gives a new application master. The number if retries is configured in :


          yarn.resourcemanager.am.max-attempts




          By default, this value is 2



          1. If a node on which a task is running is removed, the resource manager (which is handled by yarn) will stop getting heartbeats from that node. Application master will know it is supposed to reschedule the failed job as it will no longer receive progress status from the previous node. It will then request resource manager for resources and then reschedule the job.

          As far as data on these nodes is concerned, you need to understand how the tasks and their output are handled. Every node has its own local storage to store the output of the tasks running on them. After the tasks are run successfully, the OutputCommitter will move the output from local storage to the shared storage (HDFS) of the job from where the data is picked for the next step of the job.
          When a task fails (may be because the node that runs this job failed or was removed), the task is rerun on another available node.



          In fact, the application master will also rerun the successfully run tasks on this node as their output stored on the node's local storage will not longer be available.






          share|improve this answer























          • My goal is to remove an idle node w/o introducing more retries since retry makes a run slow. It sounds like after OutputCommitter, the downstream nodes do not need to read data from the node any more, so we can remove node. Do we know if a node could run OutputCommitter? During a node runs OutputCommitter, does the node show "<numContainers>0</numContainers>"? I was wondering when we know the node can be removed.

            – Joe C
            Mar 10 at 18:09
















          1














          I am not sure whether you just want to know if a node is running any task (is that's what you mean by 'participate') or you want to know if it is safe to remove a node from the Spark cluster



          I will try to explain the latter point.



          Spark has the ability to recover from the failure, which also applies to any node being removed from the cluster.
          The node removed can be an executor or an application master.



          1. If an application master is removed, the entire job fails. But is you are using yarn as a resource manager, the job is retried and yarn gives a new application master. The number if retries is configured in :


          yarn.resourcemanager.am.max-attempts




          By default, this value is 2



          1. If a node on which a task is running is removed, the resource manager (which is handled by yarn) will stop getting heartbeats from that node. Application master will know it is supposed to reschedule the failed job as it will no longer receive progress status from the previous node. It will then request resource manager for resources and then reschedule the job.

          As far as data on these nodes is concerned, you need to understand how the tasks and their output are handled. Every node has its own local storage to store the output of the tasks running on them. After the tasks are run successfully, the OutputCommitter will move the output from local storage to the shared storage (HDFS) of the job from where the data is picked for the next step of the job.
          When a task fails (may be because the node that runs this job failed or was removed), the task is rerun on another available node.



          In fact, the application master will also rerun the successfully run tasks on this node as their output stored on the node's local storage will not longer be available.






          share|improve this answer























          • My goal is to remove an idle node w/o introducing more retries since retry makes a run slow. It sounds like after OutputCommitter, the downstream nodes do not need to read data from the node any more, so we can remove node. Do we know if a node could run OutputCommitter? During a node runs OutputCommitter, does the node show "<numContainers>0</numContainers>"? I was wondering when we know the node can be removed.

            – Joe C
            Mar 10 at 18:09














          1












          1








          1







          I am not sure whether you just want to know if a node is running any task (is that's what you mean by 'participate') or you want to know if it is safe to remove a node from the Spark cluster



          I will try to explain the latter point.



          Spark has the ability to recover from the failure, which also applies to any node being removed from the cluster.
          The node removed can be an executor or an application master.



          1. If an application master is removed, the entire job fails. But is you are using yarn as a resource manager, the job is retried and yarn gives a new application master. The number if retries is configured in :


          yarn.resourcemanager.am.max-attempts




          By default, this value is 2



          1. If a node on which a task is running is removed, the resource manager (which is handled by yarn) will stop getting heartbeats from that node. Application master will know it is supposed to reschedule the failed job as it will no longer receive progress status from the previous node. It will then request resource manager for resources and then reschedule the job.

          As far as data on these nodes is concerned, you need to understand how the tasks and their output are handled. Every node has its own local storage to store the output of the tasks running on them. After the tasks are run successfully, the OutputCommitter will move the output from local storage to the shared storage (HDFS) of the job from where the data is picked for the next step of the job.
          When a task fails (may be because the node that runs this job failed or was removed), the task is rerun on another available node.



          In fact, the application master will also rerun the successfully run tasks on this node as their output stored on the node's local storage will not longer be available.






          share|improve this answer













          I am not sure whether you just want to know if a node is running any task (is that's what you mean by 'participate') or you want to know if it is safe to remove a node from the Spark cluster



          I will try to explain the latter point.



          Spark has the ability to recover from the failure, which also applies to any node being removed from the cluster.
          The node removed can be an executor or an application master.



          1. If an application master is removed, the entire job fails. But is you are using yarn as a resource manager, the job is retried and yarn gives a new application master. The number if retries is configured in :


          yarn.resourcemanager.am.max-attempts




          By default, this value is 2



          1. If a node on which a task is running is removed, the resource manager (which is handled by yarn) will stop getting heartbeats from that node. Application master will know it is supposed to reschedule the failed job as it will no longer receive progress status from the previous node. It will then request resource manager for resources and then reschedule the job.

          As far as data on these nodes is concerned, you need to understand how the tasks and their output are handled. Every node has its own local storage to store the output of the tasks running on them. After the tasks are run successfully, the OutputCommitter will move the output from local storage to the shared storage (HDFS) of the job from where the data is picked for the next step of the job.
          When a task fails (may be because the node that runs this job failed or was removed), the task is rerun on another available node.



          In fact, the application master will also rerun the successfully run tasks on this node as their output stored on the node's local storage will not longer be available.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 10 at 16:44









          TejTej

          665




          665












          • My goal is to remove an idle node w/o introducing more retries since retry makes a run slow. It sounds like after OutputCommitter, the downstream nodes do not need to read data from the node any more, so we can remove node. Do we know if a node could run OutputCommitter? During a node runs OutputCommitter, does the node show "<numContainers>0</numContainers>"? I was wondering when we know the node can be removed.

            – Joe C
            Mar 10 at 18:09


















          • My goal is to remove an idle node w/o introducing more retries since retry makes a run slow. It sounds like after OutputCommitter, the downstream nodes do not need to read data from the node any more, so we can remove node. Do we know if a node could run OutputCommitter? During a node runs OutputCommitter, does the node show "<numContainers>0</numContainers>"? I was wondering when we know the node can be removed.

            – Joe C
            Mar 10 at 18:09

















          My goal is to remove an idle node w/o introducing more retries since retry makes a run slow. It sounds like after OutputCommitter, the downstream nodes do not need to read data from the node any more, so we can remove node. Do we know if a node could run OutputCommitter? During a node runs OutputCommitter, does the node show "<numContainers>0</numContainers>"? I was wondering when we know the node can be removed.

          – Joe C
          Mar 10 at 18:09






          My goal is to remove an idle node w/o introducing more retries since retry makes a run slow. It sounds like after OutputCommitter, the downstream nodes do not need to read data from the node any more, so we can remove node. Do we know if a node could run OutputCommitter? During a node runs OutputCommitter, does the node show "<numContainers>0</numContainers>"? I was wondering when we know the node can be removed.

          – Joe C
          Mar 10 at 18:09




















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55071993%2fhow-to-know-if-a-machine-in-a-spark-cluster-participates-a-job%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Identity Server 4 is not redirecting to Angular app after login2019 Community Moderator ElectionIdentity Server 4 and dockerIdentityserver implicit flow unauthorized_clientIdentityServer Hybrid Flow - Access Token is null after user successful loginIdentity Server to MVC client : Page Redirect After loginLogin with Steam OpenId(oidc-client-js)Identity Server 4+.NET Core 2.0 + IdentityIdentityServer4 post-login redirect not working in Edge browserCall to IdentityServer4 generates System.NullReferenceException: Object reference not set to an instance of an objectIdentityServer4 without HTTPS not workingHow to get Authorization code from identity server without login form

          2005 Ahvaz unrest Contents Background Causes Casualties Aftermath See also References Navigation menue"At Least 10 Are Killed by Bombs in Iran""Iran"Archived"Arab-Iranians in Iran to make April 15 'Day of Fury'"State of Mind, State of Order: Reactions to Ethnic Unrest in the Islamic Republic of Iran.10.1111/j.1754-9469.2008.00028.x"Iran hangs Arab separatists"Iran Overview from ArchivedConstitution of the Islamic Republic of Iran"Tehran puzzled by forged 'riots' letter""Iran and its minorities: Down in the second class""Iran: Handling Of Ahvaz Unrest Could End With Televised Confessions""Bombings Rock Iran Ahead of Election""Five die in Iran ethnic clashes""Iran: Need for restraint as anniversary of unrest in Khuzestan approaches"Archived"Iranian Sunni protesters killed in clashes with security forces"Archived

          Can't initialize raids on a new ASUS Prime B360M-A motherboard2019 Community Moderator ElectionSimilar to RAID config yet more like mirroring solution?Can't get motherboard serial numberWhy does the BIOS entry point start with a WBINVD instruction?UEFI performance Asus Maximus V Extreme