[프로젝트로 배우는 데이터사이언스]pima_classification_baseline_03
2024. 7. 3. 23:16ㆍMOOC
데이터셋 출처
- Pima Indians Diabetes Database | Kaggle
- https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html
데이터 구성
- Pregnancies : 임신 횟수
- Glucose : 2시간 동안의 경구 포도당 내성 검사에서 혈장 포도당 농도
- BloodPressure : 이완기 혈압 (mm Hg)
- SkinThickness : 삼두근 피부 주름 두께 (mm), 체지방을 추정하는데 사용되는 값
- Insulin : 2시간 혈청 인슐린 (mu U / ml)
- BMI : 체질량 지수 (체중kg / 키(m)^2)
- DiabetesPedigreeFunction : 당뇨병 혈통 기능
- Age : 나이
- Outcome : 768개 중에 268개의 결과 클래스 변수(0 또는 1)는 1이고 나머지는 0입니다.
필요한 라이브러리 로드
In [ ]:
# 데이터 분석을 위한 pandas, 수치계산을 위한 numpy
# 시각화를 위한 seaborn, matplotlib.pyplot 을 로드합니다.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
데이터셋 로드
In [ ]:
df = pd.read_csv("data/diabetes_feature.csv")
df.shape
Out[ ]:
(768, 16)
In [ ]:
# 데이터셋을 미리보기 합니다.
df.head()
Out[ ]:
PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcomePregnancies_highAge_lowAge_middleAge_highInsulin_nanInsulin_loglow_glu_insulin01234
6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 | False | False | True | False | 169.5 | 5.138735 | False |
1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 | False | False | True | False | 102.5 | 4.639572 | True |
8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 | True | False | True | False | 169.5 | 5.138735 | False |
1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 | False | True | False | False | 94.0 | 4.553877 | True |
0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 | False | False | True | False | 168.0 | 5.129899 | False |
학습과 예측에 사용할 데이터셋 만들기
In [ ]:
df.columns
Out[ ]:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome', 'Pregnancies_high',
'Age_low', 'Age_middle', 'Age_high', 'Insulin_nan', 'Insulin_log',
'low_glu_insulin'],
dtype='object')
In [ ]:
X = df[['Glucose', 'BloodPressure', 'SkinThickness',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_high',
'Insulin_nan', 'low_glu_insulin']]
X.shape
Out[ ]:
(768, 9)
In [ ]:
y = df['Outcome']
y.shape
Out[ ]:
(768,)
In [ ]:
# 사이킷런에서 제공하는 model_selection 의 train_test_split 으로 만듭니다.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
In [ ]:
# train 세트의 문제와 정답의 데이터 수를 확인해 주세요.
X_train.shape, y_train.shape
Out[ ]:
((614, 9), (614,))
In [ ]:
# test 세트의 문제와 정답의 데이터 수를 확인해 주세요.
X_test.shape, y_test.shape
Out[ ]:
((154, 9), (154,))
머신러닝 알고리즘 사용하기
In [ ]:
# DecisionTree 를 불러옵니다.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=11, random_state=42)
model
Out[ ]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=11, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=42, splitter='best')
최적의 max_depth 값 찾기
In [ ]:
from sklearn.metrics import accuracy_score
for max_depth in range(3, 12):
model = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
y_predict = model.fit(X_train, y_train).predict(X_test)
score = accuracy_score(y_test, y_predict) * 100
print(max_depth, score)
3 85.06493506493507
4 87.66233766233766
5 85.71428571428571
6 81.81818181818183
7 81.81818181818183
8 81.81818181818183
9 83.76623376623377
10 79.22077922077922
11 81.81818181818183
In [ ]:
from sklearn.model_selection import GridSearchCV
model = DecisionTreeClassifier(random_state=42)
param_grid = {"max_depth": range(3, 12),
"max_features": [0.3, 0.5, 0.7, 0.9, 1]}
clf = GridSearchCV(model, param_grid=param_grid, n_jobs=-1, cv=5, verbose=2)
clf.fit(X_train, y_train)
Fitting 5 folds for each of 45 candidates, totalling 225 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 3.5s
[Parallel(n_jobs=-1)]: Done 225 out of 225 | elapsed: 4.0s finished
Out[ ]:
GridSearchCV(cv=5, error_score=nan,
estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort='deprecated',
random_state=42,
splitter='best'),
iid='deprecated', n_jobs=-1,
param_grid={'max_depth': range(3, 12),
'max_features': [0.3, 0.5, 0.7, 0.9, 1]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=2)
In [ ]:
clf.best_params_
Out[ ]:
{'max_depth': 5, 'max_features': 0.7}
In [ ]:
clf.best_estimator_
Out[ ]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=5, max_features=0.7, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=42, splitter='best')
In [ ]:
clf.best_score_
Out[ ]:
0.8664934026389444
In [ ]:
pd.DataFrame(clf.cv_results_).sort_values(by="rank_test_score").head()
Out[ ]:
mean_fit_timestd_fit_timemean_score_timestd_score_timeparam_max_depthparam_max_featuresparamssplit0_test_scoresplit1_test_scoresplit2_test_scoresplit3_test_scoresplit4_test_scoremean_test_scorestd_test_scorerank_test_score12781827
0.006471 | 0.000485 | 0.002686 | 0.000658 | 5 | 0.7 | {'max_depth': 5, 'max_features': 0.7} | 0.878049 | 0.910569 | 0.813008 | 0.837398 | 0.893443 | 0.866493 | 0.036082 | 1 |
0.006573 | 0.000491 | 0.002635 | 0.000312 | 4 | 0.7 | {'max_depth': 4, 'max_features': 0.7} | 0.813008 | 0.886179 | 0.829268 | 0.861789 | 0.918033 | 0.861655 | 0.037935 | 2 |
0.006613 | 0.000522 | 0.002681 | 0.000637 | 4 | 0.9 | {'max_depth': 4, 'max_features': 0.9} | 0.821138 | 0.886179 | 0.853659 | 0.853659 | 0.893443 | 0.861615 | 0.026005 | 3 |
0.006644 | 0.000947 | 0.002781 | 0.000535 | 6 | 0.9 | {'max_depth': 6, 'max_features': 0.9} | 0.829268 | 0.894309 | 0.821138 | 0.878049 | 0.877049 | 0.859963 | 0.029149 | 4 |
0.006094 | 0.000273 | 0.002468 | 0.000401 | 8 | 0.7 | {'max_depth': 8, 'max_features': 0.7} | 0.861789 | 0.878049 | 0.837398 | 0.853659 | 0.860656 | 0.858310 | 0.013162 | 5 |
In [ ]:
clf.predict(X_test)
Out[ ]:
array([1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1,
0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0,
0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1,
0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0])
In [ ]:
clf.score(X_test, y_test)
Out[ ]:
0.8701298701298701
Random Search
model
Out[ ]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=42, splitter='best')
In [ ]:
max_depth = np.random.randint(3, 20, 10)
max_depth
Out[ ]:
array([ 3, 5, 8, 5, 5, 13, 5, 16, 18, 13])
In [ ]:
max_features = np.random.uniform(0.7, 1.0, 100)
In [ ]:
param_distributions = {"max_depth" :max_depth,
"max_features": max_features,
"min_samples_split" : list(range(2, 7))
}
param_distributions
Out[ ]:
{'max_depth': array([ 3, 5, 8, 5, 5, 13, 5, 16, 18, 13]),
'max_features': array([0.99106896, 0.81855538, 0.80462666, 0.98977458, 0.73282047,
0.82150853, 0.84194348, 0.88743413, 0.99886695, 0.87119353,
0.76534319, 0.80883467, 0.9150284 , 0.85429453, 0.77911168,
0.9210062 , 0.79638464, 0.99281375, 0.99780265, 0.84750491,
0.70992737, 0.9583918 , 0.75906882, 0.83963015, 0.75394577,
0.93916341, 0.86509081, 0.81393384, 0.91599864, 0.7415089 ,
0.85352664, 0.89897825, 0.83836258, 0.87044418, 0.72702334,
0.85906734, 0.8030263 , 0.81262126, 0.83317934, 0.84651679,
0.95067863, 0.77310569, 0.97842667, 0.78061133, 0.86059737,
0.83677233, 0.80382649, 0.8432214 , 0.9875819 , 0.82981433,
0.83095953, 0.87872657, 0.7961058 , 0.8113246 , 0.91099763,
0.79671905, 0.97237741, 0.90046943, 0.92105654, 0.82158186,
0.81755379, 0.8291375 , 0.85454379, 0.77120579, 0.7363162 ,
0.98492692, 0.75220163, 0.82488327, 0.87990592, 0.79613856,
0.94869687, 0.97880518, 0.81160924, 0.94770462, 0.71661541,
0.80342747, 0.89439379, 0.9379838 , 0.71543596, 0.87932605,
0.83239537, 0.97706006, 0.95202297, 0.75126108, 0.89816093,
0.93328139, 0.81298917, 0.89519292, 0.8643087 , 0.79294457,
0.91578112, 0.99709119, 0.78497154, 0.7200114 , 0.84726556,
0.80067696, 0.92617871, 0.82913264, 0.764748 , 0.75956455]),
'min_samples_split': [2, 3, 4, 5, 6]}
In [ ]:
from sklearn.model_selection import RandomizedSearchCV
clf = RandomizedSearchCV(model,
param_distributions,
n_iter=1000,
scoring="accuracy",
n_jobs=-1,
cv=5,
random_state=42
)
clf.fit(X_train, y_train)
Out[ ]:
RandomizedSearchCV(cv=5, error_score=nan,
estimator=DecisionTreeClassifier(ccp_alpha=0.0,
class_weight=None,
criterion='gini',
max_depth=None,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort='deprecated',
random_state=42,
splitter='best'),
iid...
0.83239537, 0.97706006, 0.95202297, 0.75126108, 0.89816093,
0.93328139, 0.81298917, 0.89519292, 0.8643087 , 0.79294457,
0.91578112, 0.99709119, 0.78497154, 0.7200114 , 0.84726556,
0.80067696, 0.92617871, 0.82913264, 0.764748 , 0.75956455]),
'min_samples_split': [2, 3, 4, 5, 6]},
pre_dispatch='2*n_jobs', random_state=42, refit=True,
return_train_score=False, scoring='accuracy', verbose=0)
In [ ]:
clf.best_params_
Out[ ]:
{'min_samples_split': 4, 'max_features': 0.7415089045216909, 'max_depth': 5}
In [ ]:
clf.best_score_
Out[ ]:
0.8697454351592697
In [ ]:
clf.score(X_test, y_test)
Out[ ]:
0.8701298701298701
In [ ]:
pd.DataFrame(clf.cv_results_).sort_values(by="rank_test_score").head()
Out[ ]:
mean_fit_timestd_fit_timemean_score_timestd_score_timeparam_min_samples_splitparam_max_featuresparam_max_depthparamssplit0_test_scoresplit1_test_scoresplit2_test_scoresplit3_test_scoresplit4_test_scoremean_test_scorestd_test_scorerank_test_score772731983687326
0.006956 | 0.001394 | 0.002171 | 0.000143 | 4 | 0.752202 | 5 | {'min_samples_split': 4, 'max_features': 0.752... | 0.878049 | 0.910569 | 0.813008 | 0.853659 | 0.893443 | 0.869745 | 0.033985 | 1 |
0.022266 | 0.008578 | 0.005994 | 0.004555 | 4 | 0.709927 | 5 | {'min_samples_split': 4, 'max_features': 0.709... | 0.878049 | 0.910569 | 0.813008 | 0.853659 | 0.893443 | 0.869745 | 0.033985 | 1 |
0.024347 | 0.034837 | 0.008484 | 0.007401 | 4 | 0.73282 | 5 | {'min_samples_split': 4, 'max_features': 0.732... | 0.878049 | 0.910569 | 0.813008 | 0.853659 | 0.893443 | 0.869745 | 0.033985 | 1 |
0.005775 | 0.000105 | 0.002380 | 0.000491 | 4 | 0.765343 | 5 | {'min_samples_split': 4, 'max_features': 0.765... | 0.878049 | 0.910569 | 0.813008 | 0.853659 | 0.893443 | 0.869745 | 0.033985 | 1 |
0.006005 | 0.001090 | 0.005060 | 0.005768 | 4 | 0.751261 | 5 | {'min_samples_split': 4, 'max_features': 0.751... | 0.878049 | 0.910569 | 0.813008 | 0.853659 | 0.893443 | 0.869745 | 0.033985 | 1 |
학습과 예측하기
In [ ]:
# 학습을 시킵니다.
model.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=42, splitter='best')
In [ ]:
feature_names = X_train.columns.tolist()
In [ ]:
from sklearn.tree import plot_tree
plt.figure(figsize=(15, 15))
tree = plot_tree(model, feature_names=feature_names, fontsize=10, filled=True)
# 예측을 하고 결과를 y_predict에 담습니다.
y_predict = model.predict(X_test)
y_predict
Out[ ]:
array([0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,
0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1,
0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0,
0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0,
0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0])
정확도(Accuracy) 측정하기
In [ ]:
# 다르게 예측한 갯수를 구해서 diff_count 에 할당해 줍니다.
abs(y_predict - y_test).sum()
Out[ ]:
28
In [ ]:
# accuracy score를 구합니다.
accuracy_score(y_test, y_predict) * 100
Out[ ]:
81.81818181818183
'MOOC' 카테고리의 다른 글
[Mooc] 부스트코스 Mooc 강의 수료증 (0) | 2024.11.03 |
---|---|
[프로젝트로 배우는 데이터사이언스]pima_classification_baseline_04 (0) | 2024.07.03 |
[프로젝트로 배우는 데이터사이언스]pima_diabetes_preprocessed (1) | 2024.06.28 |
[프로젝트로 배우는 데이터사이언스]pima_diabetes_eda (0) | 2024.06.27 |
[프로젝트로 배우는 데이터사이언스] pima_classification_baseline (1) | 2024.06.27 |