Training and test accuracy plot shows strange behavior The Next CEO of Stack OverflowScikit learn (Python 3.5): Do I need to import a library to make this work?Music genre classification with sklearn: how to accurately evaluate different modelsIs there a way to perform grid search hyper-parameter optimization on One-Class SVMPythonshell.parser error using npm python-shellAdding optimizations decrease the accuracy, precision, f1 of classifier algorithmsHow to correctly implement StratifiedKFold with RandomizedSearchCVGridsearchCV and Kfold Cross validationHow do i tune the parameters?Huge difference in accuracy when using two datasets (instead of one) for text classification in Pythoncatboost: evaluation/test set with weights for observations

Can Sneak Attack be used when hitting with an improvised weapon?

Does higher Oxidation/ reduction potential translate to higher energy storage in battery?

Point distance program written without a framework

Is fine stranded wire ok for main supply line?

Is it ever safe to open a suspicious HTML file (e.g. email attachment)?

What is the difference between "hamstring tendon" and "common hamstring tendon"?

How to calculate the two limits?

Calculate the Mean mean of two numbers

Do I need to write [sic] when including a quotation with a number less than 10 that isn't written out?

Is there an equivalent of cd - for cp or mv

Do scriptures give a method to recognize a truly self-realized person/jivanmukta?

Where do students learn to solve polynomial equations these days?

Why is information "lost" when it got into a black hole?

Man transported from Alternate World into ours by a Neutrino Detector

Is there a reasonable and studied concept of reduction between regular languages?

Does the Idaho Potato Commission associate potato skins with healthy eating?

"Eavesdropping" vs "Listen in on"

Cannot shrink btrfs filesystem although there is still data and metadata space left : ERROR: unable to resize '/home': No space left on device

Why do we say 'Un seul M' and not 'Une seule M' even though M is a "consonne"

Plausibility of squid whales

Ising model simulation

What CSS properties can the br tag have?

Are the names of these months realistic?

Raspberry pi 3 B with Ubuntu 18.04 server arm64: what chip

Training and test accuracy plot shows strange behavior

The Next CEO of Stack OverflowScikit learn (Python 3.5): Do I need to import a library to make this work?Music genre classification with sklearn: how to accurately evaluate different modelsIs there a way to perform grid search hyper-parameter optimization on One-Class SVMPythonshell.parser error using npm python-shellAdding optimizations decrease the accuracy, precision, f1 of classifier algorithmsHow to correctly implement StratifiedKFold with RandomizedSearchCVGridsearchCV and Kfold Cross validationHow do i tune the parameters?Huge difference in accuracy when using two datasets (instead of one) for text classification in Pythoncatboost: evaluation/test set with weights for observations

-1

I'm trying to build decision tree classifier for binary classification problem. My dataset was unbalanced (1=173 and 0= 354) and I used the resample approach to increase minority class and make them balanced. I build a model using gridSearchCv and here my code

x=df_balanced["x"]
y=df_balanced['y']

X_train, X_test, Y_train, Y_test = model_selection.train_test_split( x, y, stratify=y, random_state=42,test_size=0.25)

pipeline = Pipeline([
 ('vectorizer',CountVectorizer(stop_words='english')),
 ('classifier', DecisionTreeClassifier(random_state=42))])


grid = 
 'vectorizer__ngram_range': [(1, 1), (1, 2)],
 'vectorizer__analyzer':('word', 'char'),
 'classifier__max_depth':[15,20,25,30]


grid_search = GridSearchCV(pipeline, param_grid=grid, scoring='accuracy', n_jobs=-1, cv=5)
grid_search.fit(X_train,Y_train)
print(grid_search.best_estimator_,"n")

best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(list(grid.keys())):
 print("t0: 1".format(param_name, best_parameters[param_name]))
best_model = grid_search.best_estimator_
y_pred=best_model.predict(X_test)

confusion=confusion_matrix(Y_test, y_pred)
report=classification_report(Y_test,y_pred)
false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("Confusion matrix n",confusion,"n")
print("Classification_report n ",report,"n")
print("Train Accuracy",accuracy_score(Y_train, best_model.predict(X_train)))
print("Test Accuracy:",accuracy_score(Y_test,y_pred))
print("roc_auc_score",roc_auc)

and the output.

Confusion matrix 
 [[82 7]
 [13 75]] 

Classification_report 
 precision recall f1-score support

 0 0.86 0.92 0.89 89
 1 0.91 0.85 0.88 88

 micro avg 0.89 0.89 0.89 177
 macro avg 0.89 0.89 0.89 177
weighted avg 0.89 0.89 0.89 177


Train Accuracy 0.9510357815442562
Test Accuracy: 0.8870056497175142
roc_auc_score 0.8868105209397344

To check if I fail in overfitting problem I calculated the train and test accuracy, and I think I am not too much overfitting.

Then I plot the depth of tree that may lead to overfitting.
the code

 #Setup arrays to store train and test accuracies
dep = np.arange(1, 50)
train_accuracy = np.empty(len(dep))
test_accuracy = np.empty(len(dep))

# Loop over different values of k
for i, k in enumerate(dep):

 model=best_model.fit(X_train,Y_train)
 y_pred = model.predict(X_test)

 #Compute accuracy on the training set
 train_accuracy[i] = model.score(X_train,Y_train)

 #Compute accuracy on the testing set
 test_accuracy[i] = model.score(X_test, Y_test)

# Generate plot
plt.title('clf: Varying depth of tree')
plt.plot(dep, test_accuracy, label = 'Testing Accuracy')
plt.plot(dep, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Depth of tree')
plt.ylabel('Accuracy')
plt.show()

The plot is very strange and I can not explain it.

enter image description here

Any help Plz

edited Mar 8 at 17:56

desertnaut

20.6k84379

asked Mar 8 at 17:35

Muna Gazzai

add a comment |

-1

x=df_balanced["x"]
y=df_balanced['y']

X_train, X_test, Y_train, Y_test = model_selection.train_test_split( x, y, stratify=y, random_state=42,test_size=0.25)

pipeline = Pipeline([
 ('vectorizer',CountVectorizer(stop_words='english')),
 ('classifier', DecisionTreeClassifier(random_state=42))])


grid = 
 'vectorizer__ngram_range': [(1, 1), (1, 2)],
 'vectorizer__analyzer':('word', 'char'),
 'classifier__max_depth':[15,20,25,30]


grid_search = GridSearchCV(pipeline, param_grid=grid, scoring='accuracy', n_jobs=-1, cv=5)
grid_search.fit(X_train,Y_train)
print(grid_search.best_estimator_,"n")

best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(list(grid.keys())):
 print("t0: 1".format(param_name, best_parameters[param_name]))
best_model = grid_search.best_estimator_
y_pred=best_model.predict(X_test)

confusion=confusion_matrix(Y_test, y_pred)
report=classification_report(Y_test,y_pred)
false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("Confusion matrix n",confusion,"n")
print("Classification_report n ",report,"n")
print("Train Accuracy",accuracy_score(Y_train, best_model.predict(X_train)))
print("Test Accuracy:",accuracy_score(Y_test,y_pred))
print("roc_auc_score",roc_auc)

and the output.

Confusion matrix 
 [[82 7]
 [13 75]] 

Classification_report 
 precision recall f1-score support

 0 0.86 0.92 0.89 89
 1 0.91 0.85 0.88 88

 micro avg 0.89 0.89 0.89 177
 macro avg 0.89 0.89 0.89 177
weighted avg 0.89 0.89 0.89 177


Train Accuracy 0.9510357815442562
Test Accuracy: 0.8870056497175142
roc_auc_score 0.8868105209397344

To check if I fail in overfitting problem I calculated the train and test accuracy, and I think I am not too much overfitting.

Then I plot the depth of tree that may lead to overfitting.
the code

 #Setup arrays to store train and test accuracies
dep = np.arange(1, 50)
train_accuracy = np.empty(len(dep))
test_accuracy = np.empty(len(dep))

# Loop over different values of k
for i, k in enumerate(dep):

 model=best_model.fit(X_train,Y_train)
 y_pred = model.predict(X_test)

 #Compute accuracy on the training set
 train_accuracy[i] = model.score(X_train,Y_train)

 #Compute accuracy on the testing set
 test_accuracy[i] = model.score(X_test, Y_test)

# Generate plot
plt.title('clf: Varying depth of tree')
plt.plot(dep, test_accuracy, label = 'Testing Accuracy')
plt.plot(dep, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Depth of tree')
plt.ylabel('Accuracy')
plt.show()

The plot is very strange and I can not explain it.

enter image description here

Any help Plz

edited Mar 8 at 17:56

desertnaut

20.6k84379

asked Mar 8 at 17:35

Muna Gazzai

add a comment |

-1

x=df_balanced["x"]
y=df_balanced['y']

X_train, X_test, Y_train, Y_test = model_selection.train_test_split( x, y, stratify=y, random_state=42,test_size=0.25)

pipeline = Pipeline([
 ('vectorizer',CountVectorizer(stop_words='english')),
 ('classifier', DecisionTreeClassifier(random_state=42))])


grid = 
 'vectorizer__ngram_range': [(1, 1), (1, 2)],
 'vectorizer__analyzer':('word', 'char'),
 'classifier__max_depth':[15,20,25,30]


grid_search = GridSearchCV(pipeline, param_grid=grid, scoring='accuracy', n_jobs=-1, cv=5)
grid_search.fit(X_train,Y_train)
print(grid_search.best_estimator_,"n")

best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(list(grid.keys())):
 print("t0: 1".format(param_name, best_parameters[param_name]))
best_model = grid_search.best_estimator_
y_pred=best_model.predict(X_test)

confusion=confusion_matrix(Y_test, y_pred)
report=classification_report(Y_test,y_pred)
false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("Confusion matrix n",confusion,"n")
print("Classification_report n ",report,"n")
print("Train Accuracy",accuracy_score(Y_train, best_model.predict(X_train)))
print("Test Accuracy:",accuracy_score(Y_test,y_pred))
print("roc_auc_score",roc_auc)

and the output.

Confusion matrix 
 [[82 7]
 [13 75]] 

Classification_report 
 precision recall f1-score support

 0 0.86 0.92 0.89 89
 1 0.91 0.85 0.88 88

 micro avg 0.89 0.89 0.89 177
 macro avg 0.89 0.89 0.89 177
weighted avg 0.89 0.89 0.89 177


Train Accuracy 0.9510357815442562
Test Accuracy: 0.8870056497175142
roc_auc_score 0.8868105209397344

To check if I fail in overfitting problem I calculated the train and test accuracy, and I think I am not too much overfitting.

Then I plot the depth of tree that may lead to overfitting.
the code

 #Setup arrays to store train and test accuracies
dep = np.arange(1, 50)
train_accuracy = np.empty(len(dep))
test_accuracy = np.empty(len(dep))

# Loop over different values of k
for i, k in enumerate(dep):

 model=best_model.fit(X_train,Y_train)
 y_pred = model.predict(X_test)

 #Compute accuracy on the training set
 train_accuracy[i] = model.score(X_train,Y_train)

 #Compute accuracy on the testing set
 test_accuracy[i] = model.score(X_test, Y_test)

# Generate plot
plt.title('clf: Varying depth of tree')
plt.plot(dep, test_accuracy, label = 'Testing Accuracy')
plt.plot(dep, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Depth of tree')
plt.ylabel('Accuracy')
plt.show()

The plot is very strange and I can not explain it.

enter image description here

Any help Plz

edited Mar 8 at 17:56

desertnaut

20.6k84379

asked Mar 8 at 17:35

Muna Gazzai

x=df_balanced["x"]
y=df_balanced['y']

X_train, X_test, Y_train, Y_test = model_selection.train_test_split( x, y, stratify=y, random_state=42,test_size=0.25)

pipeline = Pipeline([
 ('vectorizer',CountVectorizer(stop_words='english')),
 ('classifier', DecisionTreeClassifier(random_state=42))])


grid = 
 'vectorizer__ngram_range': [(1, 1), (1, 2)],
 'vectorizer__analyzer':('word', 'char'),
 'classifier__max_depth':[15,20,25,30]


grid_search = GridSearchCV(pipeline, param_grid=grid, scoring='accuracy', n_jobs=-1, cv=5)
grid_search.fit(X_train,Y_train)
print(grid_search.best_estimator_,"n")

best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(list(grid.keys())):
 print("t0: 1".format(param_name, best_parameters[param_name]))
best_model = grid_search.best_estimator_
y_pred=best_model.predict(X_test)

confusion=confusion_matrix(Y_test, y_pred)
report=classification_report(Y_test,y_pred)
false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("Confusion matrix n",confusion,"n")
print("Classification_report n ",report,"n")
print("Train Accuracy",accuracy_score(Y_train, best_model.predict(X_train)))
print("Test Accuracy:",accuracy_score(Y_test,y_pred))
print("roc_auc_score",roc_auc)

and the output.

Confusion matrix 
 [[82 7]
 [13 75]] 

Classification_report 
 precision recall f1-score support

 0 0.86 0.92 0.89 89
 1 0.91 0.85 0.88 88

 micro avg 0.89 0.89 0.89 177
 macro avg 0.89 0.89 0.89 177
weighted avg 0.89 0.89 0.89 177


Train Accuracy 0.9510357815442562
Test Accuracy: 0.8870056497175142
roc_auc_score 0.8868105209397344

To check if I fail in overfitting problem I calculated the train and test accuracy, and I think I am not too much overfitting.

Then I plot the depth of tree that may lead to overfitting.
the code

 #Setup arrays to store train and test accuracies
dep = np.arange(1, 50)
train_accuracy = np.empty(len(dep))
test_accuracy = np.empty(len(dep))

# Loop over different values of k
for i, k in enumerate(dep):

 model=best_model.fit(X_train,Y_train)
 y_pred = model.predict(X_test)

 #Compute accuracy on the training set
 train_accuracy[i] = model.score(X_train,Y_train)

 #Compute accuracy on the testing set
 test_accuracy[i] = model.score(X_test, Y_test)

# Generate plot
plt.title('clf: Varying depth of tree')
plt.plot(dep, test_accuracy, label = 'Testing Accuracy')
plt.plot(dep, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Depth of tree')
plt.ylabel('Accuracy')
plt.show()

The plot is very strange and I can not explain it.

enter image description here

Any help Plz

python machine-learning scikit-learn decision-tree

edited Mar 8 at 17:56

desertnaut

20.6k84379

asked Mar 8 at 17:35

Muna Gazzai

edited Mar 8 at 17:56

desertnaut

20.6k84379

asked Mar 8 at 17:35

Muna Gazzai

edited Mar 8 at 17:56

desertnaut

20.6k84379

edited Mar 8 at 17:56

desertnaut

20.6k84379

edited Mar 8 at 17:56

desertnaut

20.6k84379

asked Mar 8 at 17:35

Muna Gazzai

asked Mar 8 at 17:35

Muna Gazzai

asked Mar 8 at 17:35

Muna Gazzai

add a comment |

1 Answer
1

active

oldest

votes

Looking closely at your for loop, you will realize that you always just fit the same model; the following line:

model=best_model.fit(X_train,Y_train)

does not depend in any way from your k and does not affect at all the max_depth parameter, as you actually intend to do.

Consequently, all the values of your (training & testing) accuracy are the same, hence the "strange" straight lines (i.e. constant values).

What I guess you want, is to get the performance metrics for the best parameters you have found from your CV and different depths; but the issue here is that max_depth is already included in your best_parameters, so your methodology looks rather vague...

edited Mar 8 at 18:18

answered Mar 8 at 17:55

desertnaut

20.6k84379

add a comment |

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55068262%2ftraining-and-test-accuracy-plot-shows-strange-behavior%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Looking closely at your for loop, you will realize that you always just fit the same model; the following line:

model=best_model.fit(X_train,Y_train)

does not depend in any way from your k and does not affect at all the max_depth parameter, as you actually intend to do.

Consequently, all the values of your (training & testing) accuracy are the same, hence the "strange" straight lines (i.e. constant values).

edited Mar 8 at 18:18

answered Mar 8 at 17:55

desertnaut

20.6k84379

add a comment |

Looking closely at your for loop, you will realize that you always just fit the same model; the following line:

model=best_model.fit(X_train,Y_train)

does not depend in any way from your k and does not affect at all the max_depth parameter, as you actually intend to do.

Consequently, all the values of your (training & testing) accuracy are the same, hence the "strange" straight lines (i.e. constant values).

edited Mar 8 at 18:18

answered Mar 8 at 17:55

desertnaut

20.6k84379

add a comment |

Looking closely at your for loop, you will realize that you always just fit the same model; the following line:

model=best_model.fit(X_train,Y_train)

does not depend in any way from your k and does not affect at all the max_depth parameter, as you actually intend to do.

Consequently, all the values of your (training & testing) accuracy are the same, hence the "strange" straight lines (i.e. constant values).

edited Mar 8 at 18:18

answered Mar 8 at 17:55

desertnaut

20.6k84379

Looking closely at your for loop, you will realize that you always just fit the same model; the following line:

model=best_model.fit(X_train,Y_train)

does not depend in any way from your k and does not affect at all the max_depth parameter, as you actually intend to do.

Consequently, all the values of your (training & testing) accuracy are the same, hence the "strange" straight lines (i.e. constant values).

edited Mar 8 at 18:18

answered Mar 8 at 17:55

desertnaut

20.6k84379

edited Mar 8 at 18:18

answered Mar 8 at 17:55

desertnaut

20.6k84379

answered Mar 8 at 17:55

desertnaut

20.6k84379

answered Mar 8 at 17:55

desertnaut

20.6k84379

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ggtcf

1 Answer
1

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Can't initialize raids on a new ASUS Prime B360M-A motherboard2019 Community Moderator ElectionSimilar to RAID config yet more like mirroring solution?Can't get motherboard serial numberWhy does the BIOS entry point start with a WBINVD instruction?UEFI performance Asus Maximus V Extreme

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Can't initialize raids on a new ASUS Prime B360M-A motherboard2019 Community Moderator ElectionSimilar to RAID config yet more like mirroring solution?Can't get motherboard serial numberWhy does the BIOS entry point start with a WBINVD instruction?UEFI performance Asus Maximus V Extreme

1 Answer
1

1 Answer
1

1 Answer
1