Training and test accuracy plot shows strange behavior The Next CEO of Stack OverflowScikit learn (Python 3.5): Do I need to import a library to make this work?Music genre classification with sklearn: how to accurately evaluate different modelsIs there a way to perform grid search hyper-parameter optimization on One-Class SVMPythonshell.parser error using npm python-shellAdding optimizations decrease the accuracy, precision, f1 of classifier algorithmsHow to correctly implement StratifiedKFold with RandomizedSearchCVGridsearchCV and Kfold Cross validationHow do i tune the parameters?Huge difference in accuracy when using two datasets (instead of one) for text classification in Pythoncatboost: evaluation/test set with weights for observations

Can Sneak Attack be used when hitting with an improvised weapon?

Does higher Oxidation/ reduction potential translate to higher energy storage in battery?

Point distance program written without a framework

Is fine stranded wire ok for main supply line?

Is it ever safe to open a suspicious HTML file (e.g. email attachment)?

What is the difference between "hamstring tendon" and "common hamstring tendon"?

How to calculate the two limits?

Calculate the Mean mean of two numbers

Do I need to write [sic] when including a quotation with a number less than 10 that isn't written out?

Is there an equivalent of cd - for cp or mv

Do scriptures give a method to recognize a truly self-realized person/jivanmukta?

Where do students learn to solve polynomial equations these days?

Why is information "lost" when it got into a black hole?

Man transported from Alternate World into ours by a Neutrino Detector

Is there a reasonable and studied concept of reduction between regular languages?

Does the Idaho Potato Commission associate potato skins with healthy eating?

"Eavesdropping" vs "Listen in on"

Cannot shrink btrfs filesystem although there is still data and metadata space left : ERROR: unable to resize '/home': No space left on device

Why do we say 'Un seul M' and not 'Une seule M' even though M is a "consonne"

Plausibility of squid whales

Ising model simulation

What CSS properties can the br tag have?

Are the names of these months realistic?

Raspberry pi 3 B with Ubuntu 18.04 server arm64: what chip



Training and test accuracy plot shows strange behavior



The Next CEO of Stack OverflowScikit learn (Python 3.5): Do I need to import a library to make this work?Music genre classification with sklearn: how to accurately evaluate different modelsIs there a way to perform grid search hyper-parameter optimization on One-Class SVMPythonshell.parser error using npm python-shellAdding optimizations decrease the accuracy, precision, f1 of classifier algorithmsHow to correctly implement StratifiedKFold with RandomizedSearchCVGridsearchCV and Kfold Cross validationHow do i tune the parameters?Huge difference in accuracy when using two datasets (instead of one) for text classification in Pythoncatboost: evaluation/test set with weights for observations










-1















I'm trying to build decision tree classifier for binary classification problem. My dataset was unbalanced (1=173 and 0= 354) and I used the resample approach to increase minority class and make them balanced. I build a model using gridSearchCv and here my code



x=df_balanced["x"]
y=df_balanced['y']

X_train, X_test, Y_train, Y_test = model_selection.train_test_split( x, y, stratify=y, random_state=42,test_size=0.25)

pipeline = Pipeline([
('vectorizer',CountVectorizer(stop_words='english')),
('classifier', DecisionTreeClassifier(random_state=42))])


grid =
'vectorizer__ngram_range': [(1, 1), (1, 2)],
'vectorizer__analyzer':('word', 'char'),
'classifier__max_depth':[15,20,25,30]


grid_search = GridSearchCV(pipeline, param_grid=grid, scoring='accuracy', n_jobs=-1, cv=5)
grid_search.fit(X_train,Y_train)
print(grid_search.best_estimator_,"n")

best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(list(grid.keys())):
print("t0: 1".format(param_name, best_parameters[param_name]))
best_model = grid_search.best_estimator_
y_pred=best_model.predict(X_test)

confusion=confusion_matrix(Y_test, y_pred)
report=classification_report(Y_test,y_pred)
false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print("Confusion matrix n",confusion,"n")
print("Classification_report n ",report,"n")
print("Train Accuracy",accuracy_score(Y_train, best_model.predict(X_train)))
print("Test Accuracy:",accuracy_score(Y_test,y_pred))
print("roc_auc_score",roc_auc)


and the output.



Confusion matrix 
[[82 7]
[13 75]]

Classification_report
precision recall f1-score support

0 0.86 0.92 0.89 89
1 0.91 0.85 0.88 88

micro avg 0.89 0.89 0.89 177
macro avg 0.89 0.89 0.89 177
weighted avg 0.89 0.89 0.89 177


Train Accuracy 0.9510357815442562
Test Accuracy: 0.8870056497175142
roc_auc_score 0.8868105209397344


To check if I fail in overfitting problem I calculated the train and test accuracy, and I think I am not too much overfitting.



Then I plot the depth of tree that may lead to overfitting.
the code



 #Setup arrays to store train and test accuracies
dep = np.arange(1, 50)
train_accuracy = np.empty(len(dep))
test_accuracy = np.empty(len(dep))

# Loop over different values of k
for i, k in enumerate(dep):

model=best_model.fit(X_train,Y_train)
y_pred = model.predict(X_test)

#Compute accuracy on the training set
train_accuracy[i] = model.score(X_train,Y_train)

#Compute accuracy on the testing set
test_accuracy[i] = model.score(X_test, Y_test)

# Generate plot
plt.title('clf: Varying depth of tree')
plt.plot(dep, test_accuracy, label = 'Testing Accuracy')
plt.plot(dep, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Depth of tree')
plt.ylabel('Accuracy')
plt.show()


The plot is very strange and I can not explain it.



enter image description here



Any help Plz










share|improve this question




























    -1















    I'm trying to build decision tree classifier for binary classification problem. My dataset was unbalanced (1=173 and 0= 354) and I used the resample approach to increase minority class and make them balanced. I build a model using gridSearchCv and here my code



    x=df_balanced["x"]
    y=df_balanced['y']

    X_train, X_test, Y_train, Y_test = model_selection.train_test_split( x, y, stratify=y, random_state=42,test_size=0.25)

    pipeline = Pipeline([
    ('vectorizer',CountVectorizer(stop_words='english')),
    ('classifier', DecisionTreeClassifier(random_state=42))])


    grid =
    'vectorizer__ngram_range': [(1, 1), (1, 2)],
    'vectorizer__analyzer':('word', 'char'),
    'classifier__max_depth':[15,20,25,30]


    grid_search = GridSearchCV(pipeline, param_grid=grid, scoring='accuracy', n_jobs=-1, cv=5)
    grid_search.fit(X_train,Y_train)
    print(grid_search.best_estimator_,"n")

    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(list(grid.keys())):
    print("t0: 1".format(param_name, best_parameters[param_name]))
    best_model = grid_search.best_estimator_
    y_pred=best_model.predict(X_test)

    confusion=confusion_matrix(Y_test, y_pred)
    report=classification_report(Y_test,y_pred)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, y_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    print("Confusion matrix n",confusion,"n")
    print("Classification_report n ",report,"n")
    print("Train Accuracy",accuracy_score(Y_train, best_model.predict(X_train)))
    print("Test Accuracy:",accuracy_score(Y_test,y_pred))
    print("roc_auc_score",roc_auc)


    and the output.



    Confusion matrix 
    [[82 7]
    [13 75]]

    Classification_report
    precision recall f1-score support

    0 0.86 0.92 0.89 89
    1 0.91 0.85 0.88 88

    micro avg 0.89 0.89 0.89 177
    macro avg 0.89 0.89 0.89 177
    weighted avg 0.89 0.89 0.89 177


    Train Accuracy 0.9510357815442562
    Test Accuracy: 0.8870056497175142
    roc_auc_score 0.8868105209397344


    To check if I fail in overfitting problem I calculated the train and test accuracy, and I think I am not too much overfitting.



    Then I plot the depth of tree that may lead to overfitting.
    the code



     #Setup arrays to store train and test accuracies
    dep = np.arange(1, 50)
    train_accuracy = np.empty(len(dep))
    test_accuracy = np.empty(len(dep))

    # Loop over different values of k
    for i, k in enumerate(dep):

    model=best_model.fit(X_train,Y_train)
    y_pred = model.predict(X_test)

    #Compute accuracy on the training set
    train_accuracy[i] = model.score(X_train,Y_train)

    #Compute accuracy on the testing set
    test_accuracy[i] = model.score(X_test, Y_test)

    # Generate plot
    plt.title('clf: Varying depth of tree')
    plt.plot(dep, test_accuracy, label = 'Testing Accuracy')
    plt.plot(dep, train_accuracy, label = 'Training Accuracy')
    plt.legend()
    plt.xlabel('Depth of tree')
    plt.ylabel('Accuracy')
    plt.show()


    The plot is very strange and I can not explain it.



    enter image description here



    Any help Plz










    share|improve this question


























      -1












      -1








      -1








      I'm trying to build decision tree classifier for binary classification problem. My dataset was unbalanced (1=173 and 0= 354) and I used the resample approach to increase minority class and make them balanced. I build a model using gridSearchCv and here my code



      x=df_balanced["x"]
      y=df_balanced['y']

      X_train, X_test, Y_train, Y_test = model_selection.train_test_split( x, y, stratify=y, random_state=42,test_size=0.25)

      pipeline = Pipeline([
      ('vectorizer',CountVectorizer(stop_words='english')),
      ('classifier', DecisionTreeClassifier(random_state=42))])


      grid =
      'vectorizer__ngram_range': [(1, 1), (1, 2)],
      'vectorizer__analyzer':('word', 'char'),
      'classifier__max_depth':[15,20,25,30]


      grid_search = GridSearchCV(pipeline, param_grid=grid, scoring='accuracy', n_jobs=-1, cv=5)
      grid_search.fit(X_train,Y_train)
      print(grid_search.best_estimator_,"n")

      best_parameters = grid_search.best_estimator_.get_params()
      for param_name in sorted(list(grid.keys())):
      print("t0: 1".format(param_name, best_parameters[param_name]))
      best_model = grid_search.best_estimator_
      y_pred=best_model.predict(X_test)

      confusion=confusion_matrix(Y_test, y_pred)
      report=classification_report(Y_test,y_pred)
      false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, y_pred)
      roc_auc = auc(false_positive_rate, true_positive_rate)
      print("Confusion matrix n",confusion,"n")
      print("Classification_report n ",report,"n")
      print("Train Accuracy",accuracy_score(Y_train, best_model.predict(X_train)))
      print("Test Accuracy:",accuracy_score(Y_test,y_pred))
      print("roc_auc_score",roc_auc)


      and the output.



      Confusion matrix 
      [[82 7]
      [13 75]]

      Classification_report
      precision recall f1-score support

      0 0.86 0.92 0.89 89
      1 0.91 0.85 0.88 88

      micro avg 0.89 0.89 0.89 177
      macro avg 0.89 0.89 0.89 177
      weighted avg 0.89 0.89 0.89 177


      Train Accuracy 0.9510357815442562
      Test Accuracy: 0.8870056497175142
      roc_auc_score 0.8868105209397344


      To check if I fail in overfitting problem I calculated the train and test accuracy, and I think I am not too much overfitting.



      Then I plot the depth of tree that may lead to overfitting.
      the code



       #Setup arrays to store train and test accuracies
      dep = np.arange(1, 50)
      train_accuracy = np.empty(len(dep))
      test_accuracy = np.empty(len(dep))

      # Loop over different values of k
      for i, k in enumerate(dep):

      model=best_model.fit(X_train,Y_train)
      y_pred = model.predict(X_test)

      #Compute accuracy on the training set
      train_accuracy[i] = model.score(X_train,Y_train)

      #Compute accuracy on the testing set
      test_accuracy[i] = model.score(X_test, Y_test)

      # Generate plot
      plt.title('clf: Varying depth of tree')
      plt.plot(dep, test_accuracy, label = 'Testing Accuracy')
      plt.plot(dep, train_accuracy, label = 'Training Accuracy')
      plt.legend()
      plt.xlabel('Depth of tree')
      plt.ylabel('Accuracy')
      plt.show()


      The plot is very strange and I can not explain it.



      enter image description here



      Any help Plz










      share|improve this question
















      I'm trying to build decision tree classifier for binary classification problem. My dataset was unbalanced (1=173 and 0= 354) and I used the resample approach to increase minority class and make them balanced. I build a model using gridSearchCv and here my code



      x=df_balanced["x"]
      y=df_balanced['y']

      X_train, X_test, Y_train, Y_test = model_selection.train_test_split( x, y, stratify=y, random_state=42,test_size=0.25)

      pipeline = Pipeline([
      ('vectorizer',CountVectorizer(stop_words='english')),
      ('classifier', DecisionTreeClassifier(random_state=42))])


      grid =
      'vectorizer__ngram_range': [(1, 1), (1, 2)],
      'vectorizer__analyzer':('word', 'char'),
      'classifier__max_depth':[15,20,25,30]


      grid_search = GridSearchCV(pipeline, param_grid=grid, scoring='accuracy', n_jobs=-1, cv=5)
      grid_search.fit(X_train,Y_train)
      print(grid_search.best_estimator_,"n")

      best_parameters = grid_search.best_estimator_.get_params()
      for param_name in sorted(list(grid.keys())):
      print("t0: 1".format(param_name, best_parameters[param_name]))
      best_model = grid_search.best_estimator_
      y_pred=best_model.predict(X_test)

      confusion=confusion_matrix(Y_test, y_pred)
      report=classification_report(Y_test,y_pred)
      false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, y_pred)
      roc_auc = auc(false_positive_rate, true_positive_rate)
      print("Confusion matrix n",confusion,"n")
      print("Classification_report n ",report,"n")
      print("Train Accuracy",accuracy_score(Y_train, best_model.predict(X_train)))
      print("Test Accuracy:",accuracy_score(Y_test,y_pred))
      print("roc_auc_score",roc_auc)


      and the output.



      Confusion matrix 
      [[82 7]
      [13 75]]

      Classification_report
      precision recall f1-score support

      0 0.86 0.92 0.89 89
      1 0.91 0.85 0.88 88

      micro avg 0.89 0.89 0.89 177
      macro avg 0.89 0.89 0.89 177
      weighted avg 0.89 0.89 0.89 177


      Train Accuracy 0.9510357815442562
      Test Accuracy: 0.8870056497175142
      roc_auc_score 0.8868105209397344


      To check if I fail in overfitting problem I calculated the train and test accuracy, and I think I am not too much overfitting.



      Then I plot the depth of tree that may lead to overfitting.
      the code



       #Setup arrays to store train and test accuracies
      dep = np.arange(1, 50)
      train_accuracy = np.empty(len(dep))
      test_accuracy = np.empty(len(dep))

      # Loop over different values of k
      for i, k in enumerate(dep):

      model=best_model.fit(X_train,Y_train)
      y_pred = model.predict(X_test)

      #Compute accuracy on the training set
      train_accuracy[i] = model.score(X_train,Y_train)

      #Compute accuracy on the testing set
      test_accuracy[i] = model.score(X_test, Y_test)

      # Generate plot
      plt.title('clf: Varying depth of tree')
      plt.plot(dep, test_accuracy, label = 'Testing Accuracy')
      plt.plot(dep, train_accuracy, label = 'Training Accuracy')
      plt.legend()
      plt.xlabel('Depth of tree')
      plt.ylabel('Accuracy')
      plt.show()


      The plot is very strange and I can not explain it.



      enter image description here



      Any help Plz







      python machine-learning scikit-learn decision-tree






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 8 at 17:56









      desertnaut

      20.6k84379




      20.6k84379










      asked Mar 8 at 17:35









      Muna GazzaiMuna Gazzai

      42




      42






















          1 Answer
          1






          active

          oldest

          votes


















          2














          Looking closely at your for loop, you will realize that you always just fit the same model; the following line:



          model=best_model.fit(X_train,Y_train)


          does not depend in any way from your k and does not affect at all the max_depth parameter, as you actually intend to do.



          Consequently, all the values of your (training & testing) accuracy are the same, hence the "strange" straight lines (i.e. constant values).



          What I guess you want, is to get the performance metrics for the best parameters you have found from your CV and different depths; but the issue here is that max_depth is already included in your best_parameters, so your methodology looks rather vague...






          share|improve this answer

























            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55068262%2ftraining-and-test-accuracy-plot-shows-strange-behavior%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            2














            Looking closely at your for loop, you will realize that you always just fit the same model; the following line:



            model=best_model.fit(X_train,Y_train)


            does not depend in any way from your k and does not affect at all the max_depth parameter, as you actually intend to do.



            Consequently, all the values of your (training & testing) accuracy are the same, hence the "strange" straight lines (i.e. constant values).



            What I guess you want, is to get the performance metrics for the best parameters you have found from your CV and different depths; but the issue here is that max_depth is already included in your best_parameters, so your methodology looks rather vague...






            share|improve this answer





























              2














              Looking closely at your for loop, you will realize that you always just fit the same model; the following line:



              model=best_model.fit(X_train,Y_train)


              does not depend in any way from your k and does not affect at all the max_depth parameter, as you actually intend to do.



              Consequently, all the values of your (training & testing) accuracy are the same, hence the "strange" straight lines (i.e. constant values).



              What I guess you want, is to get the performance metrics for the best parameters you have found from your CV and different depths; but the issue here is that max_depth is already included in your best_parameters, so your methodology looks rather vague...






              share|improve this answer



























                2












                2








                2







                Looking closely at your for loop, you will realize that you always just fit the same model; the following line:



                model=best_model.fit(X_train,Y_train)


                does not depend in any way from your k and does not affect at all the max_depth parameter, as you actually intend to do.



                Consequently, all the values of your (training & testing) accuracy are the same, hence the "strange" straight lines (i.e. constant values).



                What I guess you want, is to get the performance metrics for the best parameters you have found from your CV and different depths; but the issue here is that max_depth is already included in your best_parameters, so your methodology looks rather vague...






                share|improve this answer















                Looking closely at your for loop, you will realize that you always just fit the same model; the following line:



                model=best_model.fit(X_train,Y_train)


                does not depend in any way from your k and does not affect at all the max_depth parameter, as you actually intend to do.



                Consequently, all the values of your (training & testing) accuracy are the same, hence the "strange" straight lines (i.e. constant values).



                What I guess you want, is to get the performance metrics for the best parameters you have found from your CV and different depths; but the issue here is that max_depth is already included in your best_parameters, so your methodology looks rather vague...







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Mar 8 at 18:18

























                answered Mar 8 at 17:55









                desertnautdesertnaut

                20.6k84379




                20.6k84379





























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55068262%2ftraining-and-test-accuracy-plot-shows-strange-behavior%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Can't initialize raids on a new ASUS Prime B360M-A motherboard2019 Community Moderator ElectionSimilar to RAID config yet more like mirroring solution?Can't get motherboard serial numberWhy does the BIOS entry point start with a WBINVD instruction?UEFI performance Asus Maximus V Extreme

                    Identity Server 4 is not redirecting to Angular app after login2019 Community Moderator ElectionIdentity Server 4 and dockerIdentityserver implicit flow unauthorized_clientIdentityServer Hybrid Flow - Access Token is null after user successful loginIdentity Server to MVC client : Page Redirect After loginLogin with Steam OpenId(oidc-client-js)Identity Server 4+.NET Core 2.0 + IdentityIdentityServer4 post-login redirect not working in Edge browserCall to IdentityServer4 generates System.NullReferenceException: Object reference not set to an instance of an objectIdentityServer4 without HTTPS not workingHow to get Authorization code from identity server without login form

                    2005 Ahvaz unrest Contents Background Causes Casualties Aftermath See also References Navigation menue"At Least 10 Are Killed by Bombs in Iran""Iran"Archived"Arab-Iranians in Iran to make April 15 'Day of Fury'"State of Mind, State of Order: Reactions to Ethnic Unrest in the Islamic Republic of Iran.10.1111/j.1754-9469.2008.00028.x"Iran hangs Arab separatists"Iran Overview from ArchivedConstitution of the Islamic Republic of Iran"Tehran puzzled by forged 'riots' letter""Iran and its minorities: Down in the second class""Iran: Handling Of Ahvaz Unrest Could End With Televised Confessions""Bombings Rock Iran Ahead of Election""Five die in Iran ethnic clashes""Iran: Need for restraint as anniversary of unrest in Khuzestan approaches"Archived"Iranian Sunni protesters killed in clashes with security forces"Archived