Does integer encoding of strings and using this as an input to decision tree (sklearn) makes the splitting attributes discrete or continuous?Passing categorical data to Sklearn Decision TreeHow to handle catagorical data while training decision tree using scikit-learn/ sklearn?How to explain feature importance after one-hot encode used for decision treeIs there any way to visualize decision tree (sklearn) with categorical features consolidated from one hot encoded features?Discretizing continuous variables for RandomForest in SklearnUse of one-hot encoder to build decision treesIn sklearn, how can one-hot encoding help when building decision tree with categorical features?Using OneHotEncoder for categorical features in decision tree classifierDecision Tree producing 0.5 value at split for binary variablesHow to use unified pipelines on numerical and categorical features in machine learning?How do you feed 'str' data to decision tree without one-hot encoding

Is exact Kanji stroke length important?

How long to clear the 'suck zone' of a turbofan after start is initiated?

I'm in charge of equipment buying but no one's ever happy with what I choose. How to fix this?

Is expanding the research of a group into machine learning as a PhD student risky?

Pre-amplifier input protection

How does Loki do this?

Energy of the particles in the particle accelerator

What can we do to stop prior company from asking us questions?

India just shot down a satellite from the ground. At what altitude range is the resulting debris field?

How do scammers retract money, while you can’t?

Detecting if an element is found inside a container

How does it work when somebody invests in my business?

Is this apparent Class Action settlement a spam message?

How did Doctor Strange see the winning outcome in Avengers: Infinity War?

Two monoidal structures and copowering

Is `x >> pure y` equivalent to `liftM (const y) x`

Sequence of Tenses: Translating the subjunctive

How do I extract a value from a time formatted value in excel?

Can the discrete variable be a negative number?

Is there a korbon needed for conversion?

Where does the Z80 processor start executing from?

Do sorcerers' Subtle Spells require a skill check to be unseen?

How do I rename a Linux host without needing to reboot for the rename to take effect?

Is oxalic acid dihydrate considered a primary acid standard in analytical chemistry?



Does integer encoding of strings and using this as an input to decision tree (sklearn) makes the splitting attributes discrete or continuous?


Passing categorical data to Sklearn Decision TreeHow to handle catagorical data while training decision tree using scikit-learn/ sklearn?How to explain feature importance after one-hot encode used for decision treeIs there any way to visualize decision tree (sklearn) with categorical features consolidated from one hot encoded features?Discretizing continuous variables for RandomForest in SklearnUse of one-hot encoder to build decision treesIn sklearn, how can one-hot encoding help when building decision tree with categorical features?Using OneHotEncoder for categorical features in decision tree classifierDecision Tree producing 0.5 value at split for binary variablesHow to use unified pipelines on numerical and categorical features in machine learning?How do you feed 'str' data to decision tree without one-hot encoding













2















I have to use Decision Tree classifier to classify certain data. However, the attribute values are strings, and as I found here: https://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest, it said that strings cannot be used as an input. Hence I used integer encoding for the strings.



In this article, Passing categorical data to Sklearn Decision Tree, I found out that passing integer-encoded data may result in a wrong answer since sklearn assumes an ordering among the data. So, the only way out is using OneHotEncoder module.



Using OneHotEncoder module increases the number of features (e.g. if there is an attribute 'price' with values ['high','med','low'], one-hot-encoding would result in inclusion of 3 attributes related to the actual attribute 'price'; those can be interpreted as ['price-high','price-med', 'price-low'] and the attribute values will be either 1 or 0 depending on the data), which I don't want since I have to print the decision tree in a certain format which would require the original features (e.g. I need 'price').



Is there a way out of this?










share|improve this question


























    2















    I have to use Decision Tree classifier to classify certain data. However, the attribute values are strings, and as I found here: https://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest, it said that strings cannot be used as an input. Hence I used integer encoding for the strings.



    In this article, Passing categorical data to Sklearn Decision Tree, I found out that passing integer-encoded data may result in a wrong answer since sklearn assumes an ordering among the data. So, the only way out is using OneHotEncoder module.



    Using OneHotEncoder module increases the number of features (e.g. if there is an attribute 'price' with values ['high','med','low'], one-hot-encoding would result in inclusion of 3 attributes related to the actual attribute 'price'; those can be interpreted as ['price-high','price-med', 'price-low'] and the attribute values will be either 1 or 0 depending on the data), which I don't want since I have to print the decision tree in a certain format which would require the original features (e.g. I need 'price').



    Is there a way out of this?










    share|improve this question
























      2












      2








      2








      I have to use Decision Tree classifier to classify certain data. However, the attribute values are strings, and as I found here: https://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest, it said that strings cannot be used as an input. Hence I used integer encoding for the strings.



      In this article, Passing categorical data to Sklearn Decision Tree, I found out that passing integer-encoded data may result in a wrong answer since sklearn assumes an ordering among the data. So, the only way out is using OneHotEncoder module.



      Using OneHotEncoder module increases the number of features (e.g. if there is an attribute 'price' with values ['high','med','low'], one-hot-encoding would result in inclusion of 3 attributes related to the actual attribute 'price'; those can be interpreted as ['price-high','price-med', 'price-low'] and the attribute values will be either 1 or 0 depending on the data), which I don't want since I have to print the decision tree in a certain format which would require the original features (e.g. I need 'price').



      Is there a way out of this?










      share|improve this question














      I have to use Decision Tree classifier to classify certain data. However, the attribute values are strings, and as I found here: https://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest, it said that strings cannot be used as an input. Hence I used integer encoding for the strings.



      In this article, Passing categorical data to Sklearn Decision Tree, I found out that passing integer-encoded data may result in a wrong answer since sklearn assumes an ordering among the data. So, the only way out is using OneHotEncoder module.



      Using OneHotEncoder module increases the number of features (e.g. if there is an attribute 'price' with values ['high','med','low'], one-hot-encoding would result in inclusion of 3 attributes related to the actual attribute 'price'; those can be interpreted as ['price-high','price-med', 'price-low'] and the attribute values will be either 1 or 0 depending on the data), which I don't want since I have to print the decision tree in a certain format which would require the original features (e.g. I need 'price').



      Is there a way out of this?







      python scikit-learn decision-tree






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 8 at 11:26









      Sarthak ChakrabortySarthak Chakraborty

      111




      111






















          1 Answer
          1






          active

          oldest

          votes


















          0














          I think pd.get_dummies would be useful since you want to keep track of the original feature names, when creating one-hot vectors.



          Example:



          df = pd.DataFrame('price': ['high', 'medium', 'high', 'low'], 'some_feature': ['b', 'a', 'c','a'])
          pd.get_dummies(df,columns=['price','some_feature'])

          price_high price_low price_medium some_feature_a some_feature_b some_feature_c
          0 1 0 0 0 1 0
          1 0 0 1 1 0 0
          2 1 0 0 0 0 1
          3 0 1 0 1 0 0


          When feed this dataframe to decision tree, you could get a better understanding!






          share|improve this answer























          • Sure. That would convert the data to one-hot-encoded form. But, the Decision Tree will be constructed on the new features (e.g. price_high, price_low, etc). So while printing the Decision Tree, the features would not be "price" or "some_feature", but "price_high", "price_low", etc.

            – Sarthak Chakraborty
            Mar 10 at 16:42











          • yes. why do you want the see just the price as feature name when we already created dummies for it. I think, having it as price_high would have more explanation of how the split has been made in the decision tree

            – AI_Learning
            Mar 10 at 17:59










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55062283%2fdoes-integer-encoding-of-strings-and-using-this-as-an-input-to-decision-tree-sk%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          I think pd.get_dummies would be useful since you want to keep track of the original feature names, when creating one-hot vectors.



          Example:



          df = pd.DataFrame('price': ['high', 'medium', 'high', 'low'], 'some_feature': ['b', 'a', 'c','a'])
          pd.get_dummies(df,columns=['price','some_feature'])

          price_high price_low price_medium some_feature_a some_feature_b some_feature_c
          0 1 0 0 0 1 0
          1 0 0 1 1 0 0
          2 1 0 0 0 0 1
          3 0 1 0 1 0 0


          When feed this dataframe to decision tree, you could get a better understanding!






          share|improve this answer























          • Sure. That would convert the data to one-hot-encoded form. But, the Decision Tree will be constructed on the new features (e.g. price_high, price_low, etc). So while printing the Decision Tree, the features would not be "price" or "some_feature", but "price_high", "price_low", etc.

            – Sarthak Chakraborty
            Mar 10 at 16:42











          • yes. why do you want the see just the price as feature name when we already created dummies for it. I think, having it as price_high would have more explanation of how the split has been made in the decision tree

            – AI_Learning
            Mar 10 at 17:59















          0














          I think pd.get_dummies would be useful since you want to keep track of the original feature names, when creating one-hot vectors.



          Example:



          df = pd.DataFrame('price': ['high', 'medium', 'high', 'low'], 'some_feature': ['b', 'a', 'c','a'])
          pd.get_dummies(df,columns=['price','some_feature'])

          price_high price_low price_medium some_feature_a some_feature_b some_feature_c
          0 1 0 0 0 1 0
          1 0 0 1 1 0 0
          2 1 0 0 0 0 1
          3 0 1 0 1 0 0


          When feed this dataframe to decision tree, you could get a better understanding!






          share|improve this answer























          • Sure. That would convert the data to one-hot-encoded form. But, the Decision Tree will be constructed on the new features (e.g. price_high, price_low, etc). So while printing the Decision Tree, the features would not be "price" or "some_feature", but "price_high", "price_low", etc.

            – Sarthak Chakraborty
            Mar 10 at 16:42











          • yes. why do you want the see just the price as feature name when we already created dummies for it. I think, having it as price_high would have more explanation of how the split has been made in the decision tree

            – AI_Learning
            Mar 10 at 17:59













          0












          0








          0







          I think pd.get_dummies would be useful since you want to keep track of the original feature names, when creating one-hot vectors.



          Example:



          df = pd.DataFrame('price': ['high', 'medium', 'high', 'low'], 'some_feature': ['b', 'a', 'c','a'])
          pd.get_dummies(df,columns=['price','some_feature'])

          price_high price_low price_medium some_feature_a some_feature_b some_feature_c
          0 1 0 0 0 1 0
          1 0 0 1 1 0 0
          2 1 0 0 0 0 1
          3 0 1 0 1 0 0


          When feed this dataframe to decision tree, you could get a better understanding!






          share|improve this answer













          I think pd.get_dummies would be useful since you want to keep track of the original feature names, when creating one-hot vectors.



          Example:



          df = pd.DataFrame('price': ['high', 'medium', 'high', 'low'], 'some_feature': ['b', 'a', 'c','a'])
          pd.get_dummies(df,columns=['price','some_feature'])

          price_high price_low price_medium some_feature_a some_feature_b some_feature_c
          0 1 0 0 0 1 0
          1 0 0 1 1 0 0
          2 1 0 0 0 0 1
          3 0 1 0 1 0 0


          When feed this dataframe to decision tree, you could get a better understanding!







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 9 at 12:38









          AI_LearningAI_Learning

          4,05021035




          4,05021035












          • Sure. That would convert the data to one-hot-encoded form. But, the Decision Tree will be constructed on the new features (e.g. price_high, price_low, etc). So while printing the Decision Tree, the features would not be "price" or "some_feature", but "price_high", "price_low", etc.

            – Sarthak Chakraborty
            Mar 10 at 16:42











          • yes. why do you want the see just the price as feature name when we already created dummies for it. I think, having it as price_high would have more explanation of how the split has been made in the decision tree

            – AI_Learning
            Mar 10 at 17:59

















          • Sure. That would convert the data to one-hot-encoded form. But, the Decision Tree will be constructed on the new features (e.g. price_high, price_low, etc). So while printing the Decision Tree, the features would not be "price" or "some_feature", but "price_high", "price_low", etc.

            – Sarthak Chakraborty
            Mar 10 at 16:42











          • yes. why do you want the see just the price as feature name when we already created dummies for it. I think, having it as price_high would have more explanation of how the split has been made in the decision tree

            – AI_Learning
            Mar 10 at 17:59
















          Sure. That would convert the data to one-hot-encoded form. But, the Decision Tree will be constructed on the new features (e.g. price_high, price_low, etc). So while printing the Decision Tree, the features would not be "price" or "some_feature", but "price_high", "price_low", etc.

          – Sarthak Chakraborty
          Mar 10 at 16:42





          Sure. That would convert the data to one-hot-encoded form. But, the Decision Tree will be constructed on the new features (e.g. price_high, price_low, etc). So while printing the Decision Tree, the features would not be "price" or "some_feature", but "price_high", "price_low", etc.

          – Sarthak Chakraborty
          Mar 10 at 16:42













          yes. why do you want the see just the price as feature name when we already created dummies for it. I think, having it as price_high would have more explanation of how the split has been made in the decision tree

          – AI_Learning
          Mar 10 at 17:59





          yes. why do you want the see just the price as feature name when we already created dummies for it. I think, having it as price_high would have more explanation of how the split has been made in the decision tree

          – AI_Learning
          Mar 10 at 17:59



















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55062283%2fdoes-integer-encoding-of-strings-and-using-this-as-an-input-to-decision-tree-sk%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Identity Server 4 is not redirecting to Angular app after login2019 Community Moderator ElectionIdentity Server 4 and dockerIdentityserver implicit flow unauthorized_clientIdentityServer Hybrid Flow - Access Token is null after user successful loginIdentity Server to MVC client : Page Redirect After loginLogin with Steam OpenId(oidc-client-js)Identity Server 4+.NET Core 2.0 + IdentityIdentityServer4 post-login redirect not working in Edge browserCall to IdentityServer4 generates System.NullReferenceException: Object reference not set to an instance of an objectIdentityServer4 without HTTPS not workingHow to get Authorization code from identity server without login form

          2005 Ahvaz unrest Contents Background Causes Casualties Aftermath See also References Navigation menue"At Least 10 Are Killed by Bombs in Iran""Iran"Archived"Arab-Iranians in Iran to make April 15 'Day of Fury'"State of Mind, State of Order: Reactions to Ethnic Unrest in the Islamic Republic of Iran.10.1111/j.1754-9469.2008.00028.x"Iran hangs Arab separatists"Iran Overview from ArchivedConstitution of the Islamic Republic of Iran"Tehran puzzled by forged 'riots' letter""Iran and its minorities: Down in the second class""Iran: Handling Of Ahvaz Unrest Could End With Televised Confessions""Bombings Rock Iran Ahead of Election""Five die in Iran ethnic clashes""Iran: Need for restraint as anniversary of unrest in Khuzestan approaches"Archived"Iranian Sunni protesters killed in clashes with security forces"Archived

          Can't initialize raids on a new ASUS Prime B360M-A motherboard2019 Community Moderator ElectionSimilar to RAID config yet more like mirroring solution?Can't get motherboard serial numberWhy does the BIOS entry point start with a WBINVD instruction?UEFI performance Asus Maximus V Extreme