[프로젝트로 배우는 데이터사이언스]pima_classification_baseline_04

2024. 7. 3. 23:22MOOC

필요한 라이브러리 로드

In [ ]:
# 데이터 분석을 위한 pandas, 수치계산을 위한 numpy
# 시각화를 위한 seaborn, matplotlib.pyplot 을 로드합니다.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

데이터셋 로드

In [ ]:
df = pd.read_csv("data/diabetes_feature.csv")
df.shape
Out[ ]:
(768, 16)
In [ ]:
# 데이터셋을 미리보기 합니다.

df.head()
Out[ ]:
PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcomePregnancies_highAge_lowAge_middleAge_highInsulin_nanInsulin_loglow_glu_insulin01234
6 148 72 35 0 33.6 0.627 50 1 False False True False 169.5 5.138735 False
1 85 66 29 0 26.6 0.351 31 0 False False True False 102.5 4.639572 True
8 183 64 0 0 23.3 0.672 32 1 True False True False 169.5 5.138735 False
1 89 66 23 94 28.1 0.167 21 0 False True False False 94.0 4.553877 True
0 137 40 35 168 43.1 2.288 33 1 False False True False 168.0 5.129899 False

학습과 예측에 사용할 데이터셋 만들기

In [ ]:
df.columns
Out[ ]:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome', 'Pregnancies_high',
       'Age_low', 'Age_middle', 'Age_high', 'Insulin_nan', 'Insulin_log',
       'low_glu_insulin'],
      dtype='object')
In [ ]:
X = df[['Glucose', 'BloodPressure', 'SkinThickness',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_high',
       'Insulin_nan', 'low_glu_insulin']]
X.shape
Out[ ]:
(768, 9)
In [ ]:
y = df['Outcome']
y.shape
Out[ ]:
(768,)
In [ ]:
# 사이킷런에서 제공하는 model_selection 의 train_test_split 으로 만듭니다.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
In [ ]:
# train 세트의 문제와 정답의 데이터 수를 확인해 주세요.

X_train.shape, y_train.shape
Out[ ]:
((614, 9), (614,))
In [ ]:
# test 세트의 문제와 정답의 데이터 수를 확인해 주세요.

X_test.shape, y_test.shape
Out[ ]:
((154, 9), (154,))

Single tree

In [ ]:
# from sklearn.tree import DecisionTreeClassifier

# model = DecisionTreeClassifier(random_state=42)
# model

Bagging

In [ ]:
# from sklearn.ensemble import RandomForestClassifier

# model = RandomForestClassifier(n_estimators=100, random_state=42)
# model

Boosting

 

What is the difference between Bagging and Boosting? ⋆ Quantdare

Bagging and Boosting are both ensemble methods in Machine Learning, but what is the key behind them?  Are they equivalent, similar, not related?

quantdare.com

from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(random_state=42)
model
Out[ ]:
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=42, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

학습과 예측하기

In [ ]:
# 학습을 시킵니다.
model.fit(X_train, y_train)
Out[ ]:
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=42, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)
In [ ]:
model.feature_importances_
Out[ ]:
array([1.06409948e-01, 7.04053663e-03, 6.57236021e-02, 3.73742490e-02,
       2.62954875e-02, 9.47259928e-02, 1.56708089e-04, 6.62273475e-01,
       0.00000000e+00])
In [ ]:
feature_names = X_train.columns.tolist()
In [ ]:
sns.barplot(x=model.feature_importances_, y=feature_names)

# 예측을 하고 결과를 y_predict에 담습니다.
y_predict = model.predict(X_test)
y_predict[:5]
Out[ ]:
array([1, 0, 0, 0, 0])

정확도(Accuracy) 측정하기

In [ ]:
# 다르게 예측한 갯수를 구해서 diff_count 에 할당해 줍니다.
# DT : 28
# RF : 20
# GB : 24
(y_predict != y_test).sum()
Out[ ]:
24
In [ ]:
# accuracy score를 구합니다.
# DT: 0.818
# RF: 0.870
# GBL 0.844
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_predict)
Out[ ]:
0.8441558441558441