[프로젝트로 배우는 데이터사이언스]pima_classification_baseline_03

2024. 7. 3. 23:16MOOC

데이터셋 출처

데이터 구성

  • Pregnancies : 임신 횟수
  • Glucose : 2시간 동안의 경구 포도당 내성 검사에서 혈장 포도당 농도
  • BloodPressure : 이완기 혈압 (mm Hg)
  • SkinThickness : 삼두근 피부 주름 두께 (mm), 체지방을 추정하는데 사용되는 값
  • Insulin : 2시간 혈청 인슐린 (mu U / ml)
  • BMI : 체질량 지수 (체중kg / 키(m)^2)
  • DiabetesPedigreeFunction : 당뇨병 혈통 기능
  • Age : 나이
  • Outcome : 768개 중에 268개의 결과 클래스 변수(0 또는 1)는 1이고 나머지는 0입니다.

필요한 라이브러리 로드

In [ ]:
# 데이터 분석을 위한 pandas, 수치계산을 위한 numpy
# 시각화를 위한 seaborn, matplotlib.pyplot 을 로드합니다.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

데이터셋 로드

In [ ]:
df = pd.read_csv("data/diabetes_feature.csv")
df.shape
Out[ ]:
(768, 16)
In [ ]:
# 데이터셋을 미리보기 합니다.

df.head()
Out[ ]:
PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcomePregnancies_highAge_lowAge_middleAge_highInsulin_nanInsulin_loglow_glu_insulin01234
6 148 72 35 0 33.6 0.627 50 1 False False True False 169.5 5.138735 False
1 85 66 29 0 26.6 0.351 31 0 False False True False 102.5 4.639572 True
8 183 64 0 0 23.3 0.672 32 1 True False True False 169.5 5.138735 False
1 89 66 23 94 28.1 0.167 21 0 False True False False 94.0 4.553877 True
0 137 40 35 168 43.1 2.288 33 1 False False True False 168.0 5.129899 False

학습과 예측에 사용할 데이터셋 만들기

In [ ]:
df.columns
Out[ ]:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome', 'Pregnancies_high',
       'Age_low', 'Age_middle', 'Age_high', 'Insulin_nan', 'Insulin_log',
       'low_glu_insulin'],
      dtype='object')
In [ ]:
X = df[['Glucose', 'BloodPressure', 'SkinThickness',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_high',
       'Insulin_nan', 'low_glu_insulin']]
X.shape
Out[ ]:
(768, 9)
In [ ]:
y = df['Outcome']
y.shape
Out[ ]:
(768,)
In [ ]:
# 사이킷런에서 제공하는 model_selection 의 train_test_split 으로 만듭니다.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
In [ ]:
# train 세트의 문제와 정답의 데이터 수를 확인해 주세요.

X_train.shape, y_train.shape
Out[ ]:
((614, 9), (614,))
In [ ]:
# test 세트의 문제와 정답의 데이터 수를 확인해 주세요.

X_test.shape, y_test.shape
Out[ ]:
((154, 9), (154,))

머신러닝 알고리즘 사용하기

In [ ]:
# DecisionTree 를 불러옵니다.

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=11, random_state=42)
model
Out[ ]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=11, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')

최적의 max_depth 값 찾기

In [ ]:
from sklearn.metrics import accuracy_score

for max_depth in range(3, 12):
    model = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
    y_predict = model.fit(X_train, y_train).predict(X_test)
    score = accuracy_score(y_test, y_predict) * 100
    print(max_depth, score)
3 85.06493506493507
4 87.66233766233766
5 85.71428571428571
6 81.81818181818183
7 81.81818181818183
8 81.81818181818183
9 83.76623376623377
10 79.22077922077922
11 81.81818181818183
In [ ]:
from sklearn.model_selection import GridSearchCV


model = DecisionTreeClassifier(random_state=42)
param_grid = {"max_depth": range(3, 12),
              "max_features": [0.3, 0.5, 0.7, 0.9, 1]}
clf = GridSearchCV(model, param_grid=param_grid, n_jobs=-1, cv=5, verbose=2)
clf.fit(X_train, y_train)
Fitting 5 folds for each of 45 candidates, totalling 225 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    3.5s
[Parallel(n_jobs=-1)]: Done 225 out of 225 | elapsed:    4.0s finished
Out[ ]:
GridSearchCV(cv=5, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=42,
                                              splitter='best'),
             iid='deprecated', n_jobs=-1,
             param_grid={'max_depth': range(3, 12),
                         'max_features': [0.3, 0.5, 0.7, 0.9, 1]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=2)
In [ ]:
clf.best_params_
Out[ ]:
{'max_depth': 5, 'max_features': 0.7}
In [ ]:
clf.best_estimator_
Out[ ]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=5, max_features=0.7, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')
In [ ]:
clf.best_score_
Out[ ]:
0.8664934026389444
In [ ]:
pd.DataFrame(clf.cv_results_).sort_values(by="rank_test_score").head()
Out[ ]:
mean_fit_timestd_fit_timemean_score_timestd_score_timeparam_max_depthparam_max_featuresparamssplit0_test_scoresplit1_test_scoresplit2_test_scoresplit3_test_scoresplit4_test_scoremean_test_scorestd_test_scorerank_test_score12781827
0.006471 0.000485 0.002686 0.000658 5 0.7 {'max_depth': 5, 'max_features': 0.7} 0.878049 0.910569 0.813008 0.837398 0.893443 0.866493 0.036082 1
0.006573 0.000491 0.002635 0.000312 4 0.7 {'max_depth': 4, 'max_features': 0.7} 0.813008 0.886179 0.829268 0.861789 0.918033 0.861655 0.037935 2
0.006613 0.000522 0.002681 0.000637 4 0.9 {'max_depth': 4, 'max_features': 0.9} 0.821138 0.886179 0.853659 0.853659 0.893443 0.861615 0.026005 3
0.006644 0.000947 0.002781 0.000535 6 0.9 {'max_depth': 6, 'max_features': 0.9} 0.829268 0.894309 0.821138 0.878049 0.877049 0.859963 0.029149 4
0.006094 0.000273 0.002468 0.000401 8 0.7 {'max_depth': 8, 'max_features': 0.7} 0.861789 0.878049 0.837398 0.853659 0.860656 0.858310 0.013162 5
In [ ]:
clf.predict(X_test)
Out[ ]:
array([1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0])
In [ ]:
clf.score(X_test, y_test)
Out[ ]:
0.8701298701298701
model
Out[ ]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')
In [ ]:
max_depth = np.random.randint(3, 20, 10)
max_depth
Out[ ]:
array([ 3,  5,  8,  5,  5, 13,  5, 16, 18, 13])
In [ ]:
max_features = np.random.uniform(0.7, 1.0, 100)
In [ ]:
param_distributions = {"max_depth" :max_depth,
                       "max_features": max_features,
                       "min_samples_split" : list(range(2, 7))
                      }
param_distributions
Out[ ]:
{'max_depth': array([ 3,  5,  8,  5,  5, 13,  5, 16, 18, 13]),
 'max_features': array([0.99106896, 0.81855538, 0.80462666, 0.98977458, 0.73282047,
        0.82150853, 0.84194348, 0.88743413, 0.99886695, 0.87119353,
        0.76534319, 0.80883467, 0.9150284 , 0.85429453, 0.77911168,
        0.9210062 , 0.79638464, 0.99281375, 0.99780265, 0.84750491,
        0.70992737, 0.9583918 , 0.75906882, 0.83963015, 0.75394577,
        0.93916341, 0.86509081, 0.81393384, 0.91599864, 0.7415089 ,
        0.85352664, 0.89897825, 0.83836258, 0.87044418, 0.72702334,
        0.85906734, 0.8030263 , 0.81262126, 0.83317934, 0.84651679,
        0.95067863, 0.77310569, 0.97842667, 0.78061133, 0.86059737,
        0.83677233, 0.80382649, 0.8432214 , 0.9875819 , 0.82981433,
        0.83095953, 0.87872657, 0.7961058 , 0.8113246 , 0.91099763,
        0.79671905, 0.97237741, 0.90046943, 0.92105654, 0.82158186,
        0.81755379, 0.8291375 , 0.85454379, 0.77120579, 0.7363162 ,
        0.98492692, 0.75220163, 0.82488327, 0.87990592, 0.79613856,
        0.94869687, 0.97880518, 0.81160924, 0.94770462, 0.71661541,
        0.80342747, 0.89439379, 0.9379838 , 0.71543596, 0.87932605,
        0.83239537, 0.97706006, 0.95202297, 0.75126108, 0.89816093,
        0.93328139, 0.81298917, 0.89519292, 0.8643087 , 0.79294457,
        0.91578112, 0.99709119, 0.78497154, 0.7200114 , 0.84726556,
        0.80067696, 0.92617871, 0.82913264, 0.764748  , 0.75956455]),
 'min_samples_split': [2, 3, 4, 5, 6]}
In [ ]:
from sklearn.model_selection import RandomizedSearchCV

clf = RandomizedSearchCV(model,
                   param_distributions,
                   n_iter=1000,
                   scoring="accuracy",
                   n_jobs=-1,
                   cv=5,
                   random_state=42
                  )

clf.fit(X_train, y_train)
Out[ ]:
RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features=None,
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    presort='deprecated',
                                                    random_state=42,
                                                    splitter='best'),
                   iid...
       0.83239537, 0.97706006, 0.95202297, 0.75126108, 0.89816093,
       0.93328139, 0.81298917, 0.89519292, 0.8643087 , 0.79294457,
       0.91578112, 0.99709119, 0.78497154, 0.7200114 , 0.84726556,
       0.80067696, 0.92617871, 0.82913264, 0.764748  , 0.75956455]),
                                        'min_samples_split': [2, 3, 4, 5, 6]},
                   pre_dispatch='2*n_jobs', random_state=42, refit=True,
                   return_train_score=False, scoring='accuracy', verbose=0)
In [ ]:
clf.best_params_
Out[ ]:
{'min_samples_split': 4, 'max_features': 0.7415089045216909, 'max_depth': 5}
In [ ]:
clf.best_score_
Out[ ]:
0.8697454351592697
In [ ]:
clf.score(X_test, y_test)
Out[ ]:
0.8701298701298701
In [ ]:
pd.DataFrame(clf.cv_results_).sort_values(by="rank_test_score").head()
Out[ ]:
mean_fit_timestd_fit_timemean_score_timestd_score_timeparam_min_samples_splitparam_max_featuresparam_max_depthparamssplit0_test_scoresplit1_test_scoresplit2_test_scoresplit3_test_scoresplit4_test_scoremean_test_scorestd_test_scorerank_test_score772731983687326
0.006956 0.001394 0.002171 0.000143 4 0.752202 5 {'min_samples_split': 4, 'max_features': 0.752... 0.878049 0.910569 0.813008 0.853659 0.893443 0.869745 0.033985 1
0.022266 0.008578 0.005994 0.004555 4 0.709927 5 {'min_samples_split': 4, 'max_features': 0.709... 0.878049 0.910569 0.813008 0.853659 0.893443 0.869745 0.033985 1
0.024347 0.034837 0.008484 0.007401 4 0.73282 5 {'min_samples_split': 4, 'max_features': 0.732... 0.878049 0.910569 0.813008 0.853659 0.893443 0.869745 0.033985 1
0.005775 0.000105 0.002380 0.000491 4 0.765343 5 {'min_samples_split': 4, 'max_features': 0.765... 0.878049 0.910569 0.813008 0.853659 0.893443 0.869745 0.033985 1
0.006005 0.001090 0.005060 0.005768 4 0.751261 5 {'min_samples_split': 4, 'max_features': 0.751... 0.878049 0.910569 0.813008 0.853659 0.893443 0.869745 0.033985 1

학습과 예측하기

In [ ]:
# 학습을 시킵니다.
model.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')
In [ ]:
feature_names = X_train.columns.tolist()
In [ ]:
from sklearn.tree import plot_tree

plt.figure(figsize=(15, 15))
tree = plot_tree(model, feature_names=feature_names, fontsize=10, filled=True)

# 예측을 하고 결과를 y_predict에 담습니다.
y_predict = model.predict(X_test)
y_predict
Out[ ]:
array([0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0])

정확도(Accuracy) 측정하기

In [ ]:
# 다르게 예측한 갯수를 구해서 diff_count 에 할당해 줍니다.

abs(y_predict - y_test).sum()
Out[ ]:
28
In [ ]:
# accuracy score를 구합니다.

accuracy_score(y_test, y_predict) * 100
Out[ ]:
81.81818181818183