[프로젝트로 배우는 데이터사이언스]pima_classification_baseline_04
2024. 7. 3. 23:22ㆍMOOC
필요한 라이브러리 로드
In [ ]:
# 데이터 분석을 위한 pandas, 수치계산을 위한 numpy
# 시각화를 위한 seaborn, matplotlib.pyplot 을 로드합니다.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
데이터셋 로드
In [ ]:
df = pd.read_csv("data/diabetes_feature.csv")
df.shape
Out[ ]:
(768, 16)
In [ ]:
# 데이터셋을 미리보기 합니다.
df.head()
Out[ ]:
PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcomePregnancies_highAge_lowAge_middleAge_highInsulin_nanInsulin_loglow_glu_insulin01234
6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 | False | False | True | False | 169.5 | 5.138735 | False |
1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 | False | False | True | False | 102.5 | 4.639572 | True |
8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 | True | False | True | False | 169.5 | 5.138735 | False |
1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 | False | True | False | False | 94.0 | 4.553877 | True |
0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 | False | False | True | False | 168.0 | 5.129899 | False |
학습과 예측에 사용할 데이터셋 만들기
In [ ]:
df.columns
Out[ ]:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome', 'Pregnancies_high',
'Age_low', 'Age_middle', 'Age_high', 'Insulin_nan', 'Insulin_log',
'low_glu_insulin'],
dtype='object')
In [ ]:
X = df[['Glucose', 'BloodPressure', 'SkinThickness',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_high',
'Insulin_nan', 'low_glu_insulin']]
X.shape
Out[ ]:
(768, 9)
In [ ]:
y = df['Outcome']
y.shape
Out[ ]:
(768,)
In [ ]:
# 사이킷런에서 제공하는 model_selection 의 train_test_split 으로 만듭니다.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
In [ ]:
# train 세트의 문제와 정답의 데이터 수를 확인해 주세요.
X_train.shape, y_train.shape
Out[ ]:
((614, 9), (614,))
In [ ]:
# test 세트의 문제와 정답의 데이터 수를 확인해 주세요.
X_test.shape, y_test.shape
Out[ ]:
((154, 9), (154,))
Single tree
In [ ]:
# from sklearn.tree import DecisionTreeClassifier
# model = DecisionTreeClassifier(random_state=42)
# model
Bagging
In [ ]:
# from sklearn.ensemble import RandomForestClassifier
# model = RandomForestClassifier(n_estimators=100, random_state=42)
# model
Boosting
What is the difference between Bagging and Boosting? ⋆ Quantdare
Bagging and Boosting are both ensemble methods in Machine Learning, but what is the key behind them? Are they equivalent, similar, not related?
quantdare.com
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=42)
model
Out[ ]:
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_iter_no_change=None, presort='deprecated',
random_state=42, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False)
학습과 예측하기
In [ ]:
# 학습을 시킵니다.
model.fit(X_train, y_train)
Out[ ]:
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_iter_no_change=None, presort='deprecated',
random_state=42, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False)
In [ ]:
model.feature_importances_
Out[ ]:
array([1.06409948e-01, 7.04053663e-03, 6.57236021e-02, 3.73742490e-02,
2.62954875e-02, 9.47259928e-02, 1.56708089e-04, 6.62273475e-01,
0.00000000e+00])
In [ ]:
feature_names = X_train.columns.tolist()
In [ ]:
sns.barplot(x=model.feature_importances_, y=feature_names)
# 예측을 하고 결과를 y_predict에 담습니다.
y_predict = model.predict(X_test)
y_predict[:5]
Out[ ]:
array([1, 0, 0, 0, 0])
정확도(Accuracy) 측정하기
In [ ]:
# 다르게 예측한 갯수를 구해서 diff_count 에 할당해 줍니다.
# DT : 28
# RF : 20
# GB : 24
(y_predict != y_test).sum()
Out[ ]:
24
In [ ]:
# accuracy score를 구합니다.
# DT: 0.818
# RF: 0.870
# GBL 0.844
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predict)
Out[ ]:
0.8441558441558441
'MOOC' 카테고리의 다른 글
[컴퓨터 비전의 모든 것] Computer Vision 이란? (2) | 2024.12.19 |
---|---|
[Mooc] 부스트코스 Mooc 강의 수료증 (0) | 2024.11.03 |
[프로젝트로 배우는 데이터사이언스]pima_classification_baseline_03 (0) | 2024.07.03 |
[프로젝트로 배우는 데이터사이언스]pima_diabetes_preprocessed (1) | 2024.06.28 |
[프로젝트로 배우는 데이터사이언스]pima_diabetes_eda (0) | 2024.06.27 |