Bagging is an ensamble method which would be helpful when the model is complex, easy to overfit. We can do voting or average the results from different models. It is not necessary to include all the models but just those estimators that can give you the best result. For optimization of hyper parameters, I have applied GridSearchCV. For cross validation, I used StratifiedKFold. The best estimators will be stored in a list and VotingClassifier is applied to implement the bagging process.
Import necessary packages
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB, BernoulliNB
Here is the code to find the models with relatively best hyper parameters.
Set parameters for GridSearchCV
random_state = 42
classifier = [DecisionTreeClassifier(random_state = random_state),
SVC(random_state = random_state),
RandomForestClassifier(random_state = random_state),
LogisticRegression(random_state = random_state),
KNeighborsClassifier(),
BernoulliNB()]
dt_param_grid = {"min_samples_split" : range(10,500,20),
"max_depth": range(1,30,2)}
svc_param_grid = {"kernel" : ["rbf"],
"gamma": [0.001, 0.01, 0.1, 1],
"C": [1,10,50,100,200,300,1000]}
rf_param_grid = {"max_features": [1,3,5,10],
"min_samples_split":[2,3,10],
"min_samples_leaf":[1,3,10],
"bootstrap":[False, True],
"n_estimators":[100,300],
"criterion":["gini"]}
logreg_param_grid = {"C":np.logspace(-3,3,7),
"penalty": ["l1","l2"]}
knn_param_grid = {"n_neighbors": np.linspace(1,30,10, dtype = int).tolist(),
"weights": ["uniform","distance"],
"metric":["euclidean","manhattan"]}
gaussian_NB_grid = {"alpha":[0.3,0.5,0.7,1.0]}
classifier_param = [dt_param_grid,
svc_param_grid,
rf_param_grid,
logreg_param_grid,
knn_param_grid,
gaussian_NB_grid]
Perform GridSearchCV
cv_result = []
best_estimators = []
for i in range(len(classifier)):
clf = GridSearchCV(classifier[i], param_grid=classifier_param[i], cv = StratifiedKFold(n_splits = 10), scoring = "accuracy", n_jobs = -1,verbose = 1)
clf.fit(x_train,y_train)
cv_result.append(clf.best_score_)
best_estimators.append(clf.best_estimator_)
print(cv_result[i])
Visualize the result
cv_results = pd.DataFrame({"Cross Validation Means":cv_result, "ML Models":["DecisionTreeClassifier", "SVM","RandomForestClassifier",
"LogisticRegression",
"KNeighborsClassifier",
"Naive Baysian"]})
g = sns.barplot("Cross Validation Means", "ML Models", data = cv_results)
g.set_xlabel("Mean Accuracy")
g.set_title("Cross Validation Scores")
Training and Prediction
votingC = VotingClassifier(estimators = [("dt",best_estimators[0]),
("rfc",best_estimators[2]),
("bb",best_estimators[5])
],
voting = "soft", n_jobs = -1)
votingC = votingC.fit(x_train, y_train)
print(accuracy_score(votingC.predict(x_test),y_test))
Making Prediction
test["Survived"] = votingC.predict(test_new[features_use])