Does integer encoding of strings and using this as an input to decision tree (sklearn) makes the splitting attributes discrete or continuous?Passing categorical data to Sklearn Decision TreeHow to handle catagorical data while training decision tree using scikit-learn/ sklearn?How to explain feature importance after one-hot encode used for decision treeIs there any way to visualize decision tree (sklearn) with categorical features consolidated from one hot encoded features?Discretizing continuous variables for RandomForest in SklearnUse of one-hot encoder to build decision treesIn sklearn, how can one-hot encoding help when building decision tree with categorical features?Using OneHotEncoder for categorical features in decision tree classifierDecision Tree producing 0.5 value at split for binary variablesHow to use unified pipelines on numerical and categorical features in machine learning?How do you feed 'str' data to decision tree without one-hot encoding
Is exact Kanji stroke length important?
How long to clear the 'suck zone' of a turbofan after start is initiated?
I'm in charge of equipment buying but no one's ever happy with what I choose. How to fix this?
Is expanding the research of a group into machine learning as a PhD student risky?
Pre-amplifier input protection
How does Loki do this?
Energy of the particles in the particle accelerator
What can we do to stop prior company from asking us questions?
India just shot down a satellite from the ground. At what altitude range is the resulting debris field?
How do scammers retract money, while you can’t?
Detecting if an element is found inside a container
How does it work when somebody invests in my business?
Is this apparent Class Action settlement a spam message?
How did Doctor Strange see the winning outcome in Avengers: Infinity War?
Two monoidal structures and copowering
Is `x >> pure y` equivalent to `liftM (const y) x`
Sequence of Tenses: Translating the subjunctive
How do I extract a value from a time formatted value in excel?
Can the discrete variable be a negative number?
Is there a korbon needed for conversion?
Where does the Z80 processor start executing from?
Do sorcerers' Subtle Spells require a skill check to be unseen?
How do I rename a Linux host without needing to reboot for the rename to take effect?
Is oxalic acid dihydrate considered a primary acid standard in analytical chemistry?
Does integer encoding of strings and using this as an input to decision tree (sklearn) makes the splitting attributes discrete or continuous?
Passing categorical data to Sklearn Decision TreeHow to handle catagorical data while training decision tree using scikit-learn/ sklearn?How to explain feature importance after one-hot encode used for decision treeIs there any way to visualize decision tree (sklearn) with categorical features consolidated from one hot encoded features?Discretizing continuous variables for RandomForest in SklearnUse of one-hot encoder to build decision treesIn sklearn, how can one-hot encoding help when building decision tree with categorical features?Using OneHotEncoder for categorical features in decision tree classifierDecision Tree producing 0.5 value at split for binary variablesHow to use unified pipelines on numerical and categorical features in machine learning?How do you feed 'str' data to decision tree without one-hot encoding
I have to use Decision Tree classifier to classify certain data. However, the attribute values are strings, and as I found here: https://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest, it said that strings cannot be used as an input. Hence I used integer encoding for the strings.
In this article, Passing categorical data to Sklearn Decision Tree, I found out that passing integer-encoded data may result in a wrong answer since sklearn assumes an ordering among the data. So, the only way out is using OneHotEncoder
module.
Using OneHotEncoder
module increases the number of features (e.g. if there is an attribute 'price' with values ['high','med','low'], one-hot-encoding would result in inclusion of 3 attributes related to the actual attribute 'price'; those can be interpreted as ['price-high','price-med', 'price-low'] and the attribute values will be either 1 or 0 depending on the data), which I don't want since I have to print the decision tree in a certain format which would require the original features (e.g. I need 'price').
Is there a way out of this?
python scikit-learn decision-tree
add a comment |
I have to use Decision Tree classifier to classify certain data. However, the attribute values are strings, and as I found here: https://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest, it said that strings cannot be used as an input. Hence I used integer encoding for the strings.
In this article, Passing categorical data to Sklearn Decision Tree, I found out that passing integer-encoded data may result in a wrong answer since sklearn assumes an ordering among the data. So, the only way out is using OneHotEncoder
module.
Using OneHotEncoder
module increases the number of features (e.g. if there is an attribute 'price' with values ['high','med','low'], one-hot-encoding would result in inclusion of 3 attributes related to the actual attribute 'price'; those can be interpreted as ['price-high','price-med', 'price-low'] and the attribute values will be either 1 or 0 depending on the data), which I don't want since I have to print the decision tree in a certain format which would require the original features (e.g. I need 'price').
Is there a way out of this?
python scikit-learn decision-tree
add a comment |
I have to use Decision Tree classifier to classify certain data. However, the attribute values are strings, and as I found here: https://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest, it said that strings cannot be used as an input. Hence I used integer encoding for the strings.
In this article, Passing categorical data to Sklearn Decision Tree, I found out that passing integer-encoded data may result in a wrong answer since sklearn assumes an ordering among the data. So, the only way out is using OneHotEncoder
module.
Using OneHotEncoder
module increases the number of features (e.g. if there is an attribute 'price' with values ['high','med','low'], one-hot-encoding would result in inclusion of 3 attributes related to the actual attribute 'price'; those can be interpreted as ['price-high','price-med', 'price-low'] and the attribute values will be either 1 or 0 depending on the data), which I don't want since I have to print the decision tree in a certain format which would require the original features (e.g. I need 'price').
Is there a way out of this?
python scikit-learn decision-tree
I have to use Decision Tree classifier to classify certain data. However, the attribute values are strings, and as I found here: https://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest, it said that strings cannot be used as an input. Hence I used integer encoding for the strings.
In this article, Passing categorical data to Sklearn Decision Tree, I found out that passing integer-encoded data may result in a wrong answer since sklearn assumes an ordering among the data. So, the only way out is using OneHotEncoder
module.
Using OneHotEncoder
module increases the number of features (e.g. if there is an attribute 'price' with values ['high','med','low'], one-hot-encoding would result in inclusion of 3 attributes related to the actual attribute 'price'; those can be interpreted as ['price-high','price-med', 'price-low'] and the attribute values will be either 1 or 0 depending on the data), which I don't want since I have to print the decision tree in a certain format which would require the original features (e.g. I need 'price').
Is there a way out of this?
python scikit-learn decision-tree
python scikit-learn decision-tree
asked Mar 8 at 11:26
Sarthak ChakrabortySarthak Chakraborty
111
111
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
I think pd.get_dummies
would be useful since you want to keep track of the original feature names, when creating one-hot vectors.
Example:
df = pd.DataFrame('price': ['high', 'medium', 'high', 'low'], 'some_feature': ['b', 'a', 'c','a'])
pd.get_dummies(df,columns=['price','some_feature'])
price_high price_low price_medium some_feature_a some_feature_b some_feature_c
0 1 0 0 0 1 0
1 0 0 1 1 0 0
2 1 0 0 0 0 1
3 0 1 0 1 0 0
When feed this dataframe to decision tree, you could get a better understanding!
Sure. That would convert the data to one-hot-encoded form. But, the Decision Tree will be constructed on the new features (e.g. price_high, price_low, etc). So while printing the Decision Tree, the features would not be "price" or "some_feature", but "price_high", "price_low", etc.
– Sarthak Chakraborty
Mar 10 at 16:42
yes. why do you want the see just theprice
as feature name when we already created dummies for it. I think, having it asprice_high
would have more explanation of how the split has been made in the decision tree
– AI_Learning
Mar 10 at 17:59
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55062283%2fdoes-integer-encoding-of-strings-and-using-this-as-an-input-to-decision-tree-sk%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I think pd.get_dummies
would be useful since you want to keep track of the original feature names, when creating one-hot vectors.
Example:
df = pd.DataFrame('price': ['high', 'medium', 'high', 'low'], 'some_feature': ['b', 'a', 'c','a'])
pd.get_dummies(df,columns=['price','some_feature'])
price_high price_low price_medium some_feature_a some_feature_b some_feature_c
0 1 0 0 0 1 0
1 0 0 1 1 0 0
2 1 0 0 0 0 1
3 0 1 0 1 0 0
When feed this dataframe to decision tree, you could get a better understanding!
Sure. That would convert the data to one-hot-encoded form. But, the Decision Tree will be constructed on the new features (e.g. price_high, price_low, etc). So while printing the Decision Tree, the features would not be "price" or "some_feature", but "price_high", "price_low", etc.
– Sarthak Chakraborty
Mar 10 at 16:42
yes. why do you want the see just theprice
as feature name when we already created dummies for it. I think, having it asprice_high
would have more explanation of how the split has been made in the decision tree
– AI_Learning
Mar 10 at 17:59
add a comment |
I think pd.get_dummies
would be useful since you want to keep track of the original feature names, when creating one-hot vectors.
Example:
df = pd.DataFrame('price': ['high', 'medium', 'high', 'low'], 'some_feature': ['b', 'a', 'c','a'])
pd.get_dummies(df,columns=['price','some_feature'])
price_high price_low price_medium some_feature_a some_feature_b some_feature_c
0 1 0 0 0 1 0
1 0 0 1 1 0 0
2 1 0 0 0 0 1
3 0 1 0 1 0 0
When feed this dataframe to decision tree, you could get a better understanding!
Sure. That would convert the data to one-hot-encoded form. But, the Decision Tree will be constructed on the new features (e.g. price_high, price_low, etc). So while printing the Decision Tree, the features would not be "price" or "some_feature", but "price_high", "price_low", etc.
– Sarthak Chakraborty
Mar 10 at 16:42
yes. why do you want the see just theprice
as feature name when we already created dummies for it. I think, having it asprice_high
would have more explanation of how the split has been made in the decision tree
– AI_Learning
Mar 10 at 17:59
add a comment |
I think pd.get_dummies
would be useful since you want to keep track of the original feature names, when creating one-hot vectors.
Example:
df = pd.DataFrame('price': ['high', 'medium', 'high', 'low'], 'some_feature': ['b', 'a', 'c','a'])
pd.get_dummies(df,columns=['price','some_feature'])
price_high price_low price_medium some_feature_a some_feature_b some_feature_c
0 1 0 0 0 1 0
1 0 0 1 1 0 0
2 1 0 0 0 0 1
3 0 1 0 1 0 0
When feed this dataframe to decision tree, you could get a better understanding!
I think pd.get_dummies
would be useful since you want to keep track of the original feature names, when creating one-hot vectors.
Example:
df = pd.DataFrame('price': ['high', 'medium', 'high', 'low'], 'some_feature': ['b', 'a', 'c','a'])
pd.get_dummies(df,columns=['price','some_feature'])
price_high price_low price_medium some_feature_a some_feature_b some_feature_c
0 1 0 0 0 1 0
1 0 0 1 1 0 0
2 1 0 0 0 0 1
3 0 1 0 1 0 0
When feed this dataframe to decision tree, you could get a better understanding!
answered Mar 9 at 12:38
AI_LearningAI_Learning
4,05021035
4,05021035
Sure. That would convert the data to one-hot-encoded form. But, the Decision Tree will be constructed on the new features (e.g. price_high, price_low, etc). So while printing the Decision Tree, the features would not be "price" or "some_feature", but "price_high", "price_low", etc.
– Sarthak Chakraborty
Mar 10 at 16:42
yes. why do you want the see just theprice
as feature name when we already created dummies for it. I think, having it asprice_high
would have more explanation of how the split has been made in the decision tree
– AI_Learning
Mar 10 at 17:59
add a comment |
Sure. That would convert the data to one-hot-encoded form. But, the Decision Tree will be constructed on the new features (e.g. price_high, price_low, etc). So while printing the Decision Tree, the features would not be "price" or "some_feature", but "price_high", "price_low", etc.
– Sarthak Chakraborty
Mar 10 at 16:42
yes. why do you want the see just theprice
as feature name when we already created dummies for it. I think, having it asprice_high
would have more explanation of how the split has been made in the decision tree
– AI_Learning
Mar 10 at 17:59
Sure. That would convert the data to one-hot-encoded form. But, the Decision Tree will be constructed on the new features (e.g. price_high, price_low, etc). So while printing the Decision Tree, the features would not be "price" or "some_feature", but "price_high", "price_low", etc.
– Sarthak Chakraborty
Mar 10 at 16:42
Sure. That would convert the data to one-hot-encoded form. But, the Decision Tree will be constructed on the new features (e.g. price_high, price_low, etc). So while printing the Decision Tree, the features would not be "price" or "some_feature", but "price_high", "price_low", etc.
– Sarthak Chakraborty
Mar 10 at 16:42
yes. why do you want the see just the
price
as feature name when we already created dummies for it. I think, having it as price_high
would have more explanation of how the split has been made in the decision tree– AI_Learning
Mar 10 at 17:59
yes. why do you want the see just the
price
as feature name when we already created dummies for it. I think, having it as price_high
would have more explanation of how the split has been made in the decision tree– AI_Learning
Mar 10 at 17:59
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55062283%2fdoes-integer-encoding-of-strings-and-using-this-as-an-input-to-decision-tree-sk%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown