Training and test accuracy plot shows strange behavior The Next CEO of Stack OverflowScikit learn (Python 3.5): Do I need to import a library to make this work?Music genre classification with sklearn: how to accurately evaluate different modelsIs there a way to perform grid search hyper-parameter optimization on One-Class SVMPythonshell.parser error using npm python-shellAdding optimizations decrease the accuracy, precision, f1 of classifier algorithmsHow to correctly implement StratifiedKFold with RandomizedSearchCVGridsearchCV and Kfold Cross validationHow do i tune the parameters?Huge difference in accuracy when using two datasets (instead of one) for text classification in Pythoncatboost: evaluation/test set with weights for observations
Can Sneak Attack be used when hitting with an improvised weapon?
Does higher Oxidation/ reduction potential translate to higher energy storage in battery?
Point distance program written without a framework
Is fine stranded wire ok for main supply line?
Is it ever safe to open a suspicious HTML file (e.g. email attachment)?
What is the difference between "hamstring tendon" and "common hamstring tendon"?
How to calculate the two limits?
Calculate the Mean mean of two numbers
Do I need to write [sic] when including a quotation with a number less than 10 that isn't written out?
Is there an equivalent of cd - for cp or mv
Do scriptures give a method to recognize a truly self-realized person/jivanmukta?
Where do students learn to solve polynomial equations these days?
Why is information "lost" when it got into a black hole?
Man transported from Alternate World into ours by a Neutrino Detector
Is there a reasonable and studied concept of reduction between regular languages?
Does the Idaho Potato Commission associate potato skins with healthy eating?
"Eavesdropping" vs "Listen in on"
Cannot shrink btrfs filesystem although there is still data and metadata space left : ERROR: unable to resize '/home': No space left on device
Why do we say 'Un seul M' and not 'Une seule M' even though M is a "consonne"
Plausibility of squid whales
Ising model simulation
What CSS properties can the br tag have?
Are the names of these months realistic?
Raspberry pi 3 B with Ubuntu 18.04 server arm64: what chip
Training and test accuracy plot shows strange behavior
The Next CEO of Stack OverflowScikit learn (Python 3.5): Do I need to import a library to make this work?Music genre classification with sklearn: how to accurately evaluate different modelsIs there a way to perform grid search hyper-parameter optimization on One-Class SVMPythonshell.parser error using npm python-shellAdding optimizations decrease the accuracy, precision, f1 of classifier algorithmsHow to correctly implement StratifiedKFold with RandomizedSearchCVGridsearchCV and Kfold Cross validationHow do i tune the parameters?Huge difference in accuracy when using two datasets (instead of one) for text classification in Pythoncatboost: evaluation/test set with weights for observations
I'm trying to build decision tree classifier for binary classification problem. My dataset was unbalanced (1=173 and 0= 354) and I used the resample approach to increase minority class and make them balanced. I build a model using gridSearchCv and here my code
x=df_balanced["x"]
y=df_balanced['y']
X_train, X_test, Y_train, Y_test = model_selection.train_test_split( x, y, stratify=y, random_state=42,test_size=0.25)
pipeline = Pipeline([
('vectorizer',CountVectorizer(stop_words='english')),
('classifier', DecisionTreeClassifier(random_state=42))])
grid =
'vectorizer__ngram_range': [(1, 1), (1, 2)],
'vectorizer__analyzer':('word', 'char'),
'classifier__max_depth':[15,20,25,30]
grid_search = GridSearchCV(pipeline, param_grid=grid, scoring='accuracy', n_jobs=-1, cv=5)
grid_search.fit(X_train,Y_train)
print(grid_search.best_estimator_,"n")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(list(grid.keys())):
print("t0: 1".format(param_name, best_parameters[param_name]))
best_model = grid_search.best_estimator_
y_pred=best_model.predict(X_test)
confusion=confusion_matrix(Y_test, y_pred)
report=classification_report(Y_test,y_pred)
false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("Confusion matrix n",confusion,"n")
print("Classification_report n ",report,"n")
print("Train Accuracy",accuracy_score(Y_train, best_model.predict(X_train)))
print("Test Accuracy:",accuracy_score(Y_test,y_pred))
print("roc_auc_score",roc_auc)
and the output.
Confusion matrix
[[82 7]
[13 75]]
Classification_report
precision recall f1-score support
0 0.86 0.92 0.89 89
1 0.91 0.85 0.88 88
micro avg 0.89 0.89 0.89 177
macro avg 0.89 0.89 0.89 177
weighted avg 0.89 0.89 0.89 177
Train Accuracy 0.9510357815442562
Test Accuracy: 0.8870056497175142
roc_auc_score 0.8868105209397344
To check if I fail in overfitting problem I calculated the train and test accuracy, and I think I am not too much overfitting.
Then I plot the depth of tree that may lead to overfitting.
the code
#Setup arrays to store train and test accuracies
dep = np.arange(1, 50)
train_accuracy = np.empty(len(dep))
test_accuracy = np.empty(len(dep))
# Loop over different values of k
for i, k in enumerate(dep):
model=best_model.fit(X_train,Y_train)
y_pred = model.predict(X_test)
#Compute accuracy on the training set
train_accuracy[i] = model.score(X_train,Y_train)
#Compute accuracy on the testing set
test_accuracy[i] = model.score(X_test, Y_test)
# Generate plot
plt.title('clf: Varying depth of tree')
plt.plot(dep, test_accuracy, label = 'Testing Accuracy')
plt.plot(dep, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Depth of tree')
plt.ylabel('Accuracy')
plt.show()
The plot is very strange and I can not explain it.
Any help Plz
python machine-learning scikit-learn decision-tree
add a comment |
I'm trying to build decision tree classifier for binary classification problem. My dataset was unbalanced (1=173 and 0= 354) and I used the resample approach to increase minority class and make them balanced. I build a model using gridSearchCv and here my code
x=df_balanced["x"]
y=df_balanced['y']
X_train, X_test, Y_train, Y_test = model_selection.train_test_split( x, y, stratify=y, random_state=42,test_size=0.25)
pipeline = Pipeline([
('vectorizer',CountVectorizer(stop_words='english')),
('classifier', DecisionTreeClassifier(random_state=42))])
grid =
'vectorizer__ngram_range': [(1, 1), (1, 2)],
'vectorizer__analyzer':('word', 'char'),
'classifier__max_depth':[15,20,25,30]
grid_search = GridSearchCV(pipeline, param_grid=grid, scoring='accuracy', n_jobs=-1, cv=5)
grid_search.fit(X_train,Y_train)
print(grid_search.best_estimator_,"n")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(list(grid.keys())):
print("t0: 1".format(param_name, best_parameters[param_name]))
best_model = grid_search.best_estimator_
y_pred=best_model.predict(X_test)
confusion=confusion_matrix(Y_test, y_pred)
report=classification_report(Y_test,y_pred)
false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("Confusion matrix n",confusion,"n")
print("Classification_report n ",report,"n")
print("Train Accuracy",accuracy_score(Y_train, best_model.predict(X_train)))
print("Test Accuracy:",accuracy_score(Y_test,y_pred))
print("roc_auc_score",roc_auc)
and the output.
Confusion matrix
[[82 7]
[13 75]]
Classification_report
precision recall f1-score support
0 0.86 0.92 0.89 89
1 0.91 0.85 0.88 88
micro avg 0.89 0.89 0.89 177
macro avg 0.89 0.89 0.89 177
weighted avg 0.89 0.89 0.89 177
Train Accuracy 0.9510357815442562
Test Accuracy: 0.8870056497175142
roc_auc_score 0.8868105209397344
To check if I fail in overfitting problem I calculated the train and test accuracy, and I think I am not too much overfitting.
Then I plot the depth of tree that may lead to overfitting.
the code
#Setup arrays to store train and test accuracies
dep = np.arange(1, 50)
train_accuracy = np.empty(len(dep))
test_accuracy = np.empty(len(dep))
# Loop over different values of k
for i, k in enumerate(dep):
model=best_model.fit(X_train,Y_train)
y_pred = model.predict(X_test)
#Compute accuracy on the training set
train_accuracy[i] = model.score(X_train,Y_train)
#Compute accuracy on the testing set
test_accuracy[i] = model.score(X_test, Y_test)
# Generate plot
plt.title('clf: Varying depth of tree')
plt.plot(dep, test_accuracy, label = 'Testing Accuracy')
plt.plot(dep, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Depth of tree')
plt.ylabel('Accuracy')
plt.show()
The plot is very strange and I can not explain it.
Any help Plz
python machine-learning scikit-learn decision-tree
add a comment |
I'm trying to build decision tree classifier for binary classification problem. My dataset was unbalanced (1=173 and 0= 354) and I used the resample approach to increase minority class and make them balanced. I build a model using gridSearchCv and here my code
x=df_balanced["x"]
y=df_balanced['y']
X_train, X_test, Y_train, Y_test = model_selection.train_test_split( x, y, stratify=y, random_state=42,test_size=0.25)
pipeline = Pipeline([
('vectorizer',CountVectorizer(stop_words='english')),
('classifier', DecisionTreeClassifier(random_state=42))])
grid =
'vectorizer__ngram_range': [(1, 1), (1, 2)],
'vectorizer__analyzer':('word', 'char'),
'classifier__max_depth':[15,20,25,30]
grid_search = GridSearchCV(pipeline, param_grid=grid, scoring='accuracy', n_jobs=-1, cv=5)
grid_search.fit(X_train,Y_train)
print(grid_search.best_estimator_,"n")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(list(grid.keys())):
print("t0: 1".format(param_name, best_parameters[param_name]))
best_model = grid_search.best_estimator_
y_pred=best_model.predict(X_test)
confusion=confusion_matrix(Y_test, y_pred)
report=classification_report(Y_test,y_pred)
false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("Confusion matrix n",confusion,"n")
print("Classification_report n ",report,"n")
print("Train Accuracy",accuracy_score(Y_train, best_model.predict(X_train)))
print("Test Accuracy:",accuracy_score(Y_test,y_pred))
print("roc_auc_score",roc_auc)
and the output.
Confusion matrix
[[82 7]
[13 75]]
Classification_report
precision recall f1-score support
0 0.86 0.92 0.89 89
1 0.91 0.85 0.88 88
micro avg 0.89 0.89 0.89 177
macro avg 0.89 0.89 0.89 177
weighted avg 0.89 0.89 0.89 177
Train Accuracy 0.9510357815442562
Test Accuracy: 0.8870056497175142
roc_auc_score 0.8868105209397344
To check if I fail in overfitting problem I calculated the train and test accuracy, and I think I am not too much overfitting.
Then I plot the depth of tree that may lead to overfitting.
the code
#Setup arrays to store train and test accuracies
dep = np.arange(1, 50)
train_accuracy = np.empty(len(dep))
test_accuracy = np.empty(len(dep))
# Loop over different values of k
for i, k in enumerate(dep):
model=best_model.fit(X_train,Y_train)
y_pred = model.predict(X_test)
#Compute accuracy on the training set
train_accuracy[i] = model.score(X_train,Y_train)
#Compute accuracy on the testing set
test_accuracy[i] = model.score(X_test, Y_test)
# Generate plot
plt.title('clf: Varying depth of tree')
plt.plot(dep, test_accuracy, label = 'Testing Accuracy')
plt.plot(dep, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Depth of tree')
plt.ylabel('Accuracy')
plt.show()
The plot is very strange and I can not explain it.
Any help Plz
python machine-learning scikit-learn decision-tree
I'm trying to build decision tree classifier for binary classification problem. My dataset was unbalanced (1=173 and 0= 354) and I used the resample approach to increase minority class and make them balanced. I build a model using gridSearchCv and here my code
x=df_balanced["x"]
y=df_balanced['y']
X_train, X_test, Y_train, Y_test = model_selection.train_test_split( x, y, stratify=y, random_state=42,test_size=0.25)
pipeline = Pipeline([
('vectorizer',CountVectorizer(stop_words='english')),
('classifier', DecisionTreeClassifier(random_state=42))])
grid =
'vectorizer__ngram_range': [(1, 1), (1, 2)],
'vectorizer__analyzer':('word', 'char'),
'classifier__max_depth':[15,20,25,30]
grid_search = GridSearchCV(pipeline, param_grid=grid, scoring='accuracy', n_jobs=-1, cv=5)
grid_search.fit(X_train,Y_train)
print(grid_search.best_estimator_,"n")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(list(grid.keys())):
print("t0: 1".format(param_name, best_parameters[param_name]))
best_model = grid_search.best_estimator_
y_pred=best_model.predict(X_test)
confusion=confusion_matrix(Y_test, y_pred)
report=classification_report(Y_test,y_pred)
false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("Confusion matrix n",confusion,"n")
print("Classification_report n ",report,"n")
print("Train Accuracy",accuracy_score(Y_train, best_model.predict(X_train)))
print("Test Accuracy:",accuracy_score(Y_test,y_pred))
print("roc_auc_score",roc_auc)
and the output.
Confusion matrix
[[82 7]
[13 75]]
Classification_report
precision recall f1-score support
0 0.86 0.92 0.89 89
1 0.91 0.85 0.88 88
micro avg 0.89 0.89 0.89 177
macro avg 0.89 0.89 0.89 177
weighted avg 0.89 0.89 0.89 177
Train Accuracy 0.9510357815442562
Test Accuracy: 0.8870056497175142
roc_auc_score 0.8868105209397344
To check if I fail in overfitting problem I calculated the train and test accuracy, and I think I am not too much overfitting.
Then I plot the depth of tree that may lead to overfitting.
the code
#Setup arrays to store train and test accuracies
dep = np.arange(1, 50)
train_accuracy = np.empty(len(dep))
test_accuracy = np.empty(len(dep))
# Loop over different values of k
for i, k in enumerate(dep):
model=best_model.fit(X_train,Y_train)
y_pred = model.predict(X_test)
#Compute accuracy on the training set
train_accuracy[i] = model.score(X_train,Y_train)
#Compute accuracy on the testing set
test_accuracy[i] = model.score(X_test, Y_test)
# Generate plot
plt.title('clf: Varying depth of tree')
plt.plot(dep, test_accuracy, label = 'Testing Accuracy')
plt.plot(dep, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Depth of tree')
plt.ylabel('Accuracy')
plt.show()
The plot is very strange and I can not explain it.
Any help Plz
python machine-learning scikit-learn decision-tree
python machine-learning scikit-learn decision-tree
edited Mar 8 at 17:56
desertnaut
20.6k84379
20.6k84379
asked Mar 8 at 17:35
Muna GazzaiMuna Gazzai
42
42
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Looking closely at your for
loop, you will realize that you always just fit the same model; the following line:
model=best_model.fit(X_train,Y_train)
does not depend in any way from your k
and does not affect at all the max_depth
parameter, as you actually intend to do.
Consequently, all the values of your (training & testing) accuracy are the same, hence the "strange" straight lines (i.e. constant values).
What I guess you want, is to get the performance metrics for the best parameters you have found from your CV and different depths; but the issue here is that max_depth
is already included in your best_parameters
, so your methodology looks rather vague...
add a comment |
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55068262%2ftraining-and-test-accuracy-plot-shows-strange-behavior%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Looking closely at your for
loop, you will realize that you always just fit the same model; the following line:
model=best_model.fit(X_train,Y_train)
does not depend in any way from your k
and does not affect at all the max_depth
parameter, as you actually intend to do.
Consequently, all the values of your (training & testing) accuracy are the same, hence the "strange" straight lines (i.e. constant values).
What I guess you want, is to get the performance metrics for the best parameters you have found from your CV and different depths; but the issue here is that max_depth
is already included in your best_parameters
, so your methodology looks rather vague...
add a comment |
Looking closely at your for
loop, you will realize that you always just fit the same model; the following line:
model=best_model.fit(X_train,Y_train)
does not depend in any way from your k
and does not affect at all the max_depth
parameter, as you actually intend to do.
Consequently, all the values of your (training & testing) accuracy are the same, hence the "strange" straight lines (i.e. constant values).
What I guess you want, is to get the performance metrics for the best parameters you have found from your CV and different depths; but the issue here is that max_depth
is already included in your best_parameters
, so your methodology looks rather vague...
add a comment |
Looking closely at your for
loop, you will realize that you always just fit the same model; the following line:
model=best_model.fit(X_train,Y_train)
does not depend in any way from your k
and does not affect at all the max_depth
parameter, as you actually intend to do.
Consequently, all the values of your (training & testing) accuracy are the same, hence the "strange" straight lines (i.e. constant values).
What I guess you want, is to get the performance metrics for the best parameters you have found from your CV and different depths; but the issue here is that max_depth
is already included in your best_parameters
, so your methodology looks rather vague...
Looking closely at your for
loop, you will realize that you always just fit the same model; the following line:
model=best_model.fit(X_train,Y_train)
does not depend in any way from your k
and does not affect at all the max_depth
parameter, as you actually intend to do.
Consequently, all the values of your (training & testing) accuracy are the same, hence the "strange" straight lines (i.e. constant values).
What I guess you want, is to get the performance metrics for the best parameters you have found from your CV and different depths; but the issue here is that max_depth
is already included in your best_parameters
, so your methodology looks rather vague...
edited Mar 8 at 18:18
answered Mar 8 at 17:55
desertnautdesertnaut
20.6k84379
20.6k84379
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55068262%2ftraining-and-test-accuracy-plot-shows-strange-behavior%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown