python - Low precision,recall, f1-score and accuracy for minority binaryclass case with scikit learn? -
first of all, thank help.
i'm developing empirical study of dimensionality reduction methodologies classification problems final degree project in university, , purpose, using medical dataset in order predict if patient has disease or not( binaryclass case,0 or 1).
my dataset imbalanced , i'm applying oversampling , different dimensionality reduction algorihtms. i'm comparing performance obtained classification algorithms before , after processing dataset, , applying dimensionality reduction algorithms, i'm interested in obtain classification report, minority class obtains pretty bad score , i'm wondering why. how can improve if i'm doing wrong?
this 1 code:
from sklearn.metrics import f1_score # prepare models models = [] models.append(('dtc', decisiontreeclassifier())) models.append(('etc', extratreesclassifier())) models.append(('lr', logisticregression())) models.append(('lsvc', linearsvc())) models.append(('nn', mlpclassifier())) models.append(('rfc', randomforestclassifier())) # evaluate each model in turn n_samples, n_features = data.shape num_fea = 10 resultstotal = [] names = [] scoring = 'accuracy' name, model in models: kfold = cross_validation.kfold(n_samples,n_folds=10, shuffle= true, random_state=seed) results = [] print results train, test in kfold: # obtain index of each feature on training set, have pass here numpy arrays. #the key idea cross-validation way of estimating generalisation performance of #a process building model, need repeat whole process in each fold. otherwise #you end biased estimate, or under-estimate of variance of estimate (or both). idx = mrmr.mrmr(data[train], targetnar[train], n_selected_features=num_fea) # obtain dataset on selected features features = data[:, idx[0:num_fea]] # apply here oversampling in training data only, not in test featuresov, targetov = smote(kind='svm').fit_sample(features[train], targetnar[train]) x_train, x_test = featuresov, features[test] y_train, y_test = targetov, targetnar[test] model.fit(x_train, y_train) y_pred = model.predict(x_test) results.append(accuracy_score(y_test, y_pred)) print "f1-score: %f " % (f1_score(y_test, y_pred) ) print confusion_matrix(y_test,y_pred) fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred) print "auc:: %f " % (metrics.auc(fpr, tpr)) print "accuracy: %f " % (accuracy_score(y_test, y_pred)) report = classification_report(y_test, y_pred) print(report) # perform cross validation featuresov false have modify or change print results #cv_results = model_selection.cross_val_score(model, featuresov, targetov, cv=kfold, scoring=scoring) #print cv_results resultstotal.append(np.asarray(results)) names.append(name) msg = "%s: %f (%f)" % (name, np.asarray(results).mean(), np.asarray(results).std()) print(msg) # boxplot algorithm comparison fig = plt.figure() fig.suptitle('algorithm comparison') ax = fig.add_subplot(111) plt.boxplot(resultstotal) ax.set_xticklabels(names) plt.show()
as can see i'm using k-cross evaluate models, , inside of loop of cross valiation i'm doing 3 stesps: first implementing mrmr feature selection algorithm , after apply oversampling in training part of dataset , trainning , evaluating model.
my results these:
f1-score: 0.267148 [[570 166] [ 37 37]] auc:: 0.637228 accuracy: 0.749383 precision recall f1-score support 0 0.94 0.77 0.85 736 1 0.18 0.50 0.27 74 avg / total 0.87 0.75 0.80 810 f1-score: 0.210145 [[563 203] [ 15 29]] auc:: 0.697039 accuracy: 0.730864 precision recall f1-score support 0 0.97 0.73 0.84 766 1 0.12 0.66 0.21 44 avg / total 0.93 0.73 0.80 810 f1-score: 0.242678 [[600 159] [ 22 29]] auc:: 0.679571 accuracy: 0.776543 precision recall f1-score support 0 0.96 0.79 0.87 759 1 0.15 0.57 0.24 51 avg / total 0.91 0.78 0.83 810 f1-score: 0.264151 [[534 203] [ 31 42]] auc:: 0.649951 accuracy: 0.711111 precision recall f1-score support 0 0.95 0.72 0.82 737 1 0.17 0.58 0.26 73 avg / total 0.88 0.71 0.77 810
as can see obtain bad result in minority class , don't know do, think applied steps using classification.
thanks!
Comments
Post a Comment