[프로젝트로 배우는 데이터사이언스]pima_classification_baseline

[프로젝트로 배우는 데이터사이언스]pima_classification_baseline_04

2024. 7. 3. 23:22ㆍMOOC

필요한 라이브러리 로드

In [ ]:

# 데이터 분석을 위한 pandas, 수치계산을 위한 numpy
# 시각화를 위한 seaborn, matplotlib.pyplot 을 로드합니다.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

데이터셋 로드

In [ ]:

df = pd.read_csv("data/diabetes_feature.csv")
df.shape

Out[ ]:

(768, 16)

In [ ]:

# 데이터셋을 미리보기 합니다.

df.head()

Out[ ]:

PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcomePregnancies_highAge_lowAge_middleAge_highInsulin_nanInsulin_loglow_glu_insulin01234

6	148	72	35	0	33.6	0.627	50	1	False	False	True	False	169.5	5.138735	False
1	85	66	29	0	26.6	0.351	31	0	False	False	True	False	102.5	4.639572	True
8	183	64	0	0	23.3	0.672	32	1	True	False	True	False	169.5	5.138735	False
1	89	66	23	94	28.1	0.167	21	0	False	True	False	False	94.0	4.553877	True
0	137	40	35	168	43.1	2.288	33	1	False	False	True	False	168.0	5.129899	False

학습과 예측에 사용할 데이터셋 만들기

In [ ]:

df.columns

Out[ ]:

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome', 'Pregnancies_high',
       'Age_low', 'Age_middle', 'Age_high', 'Insulin_nan', 'Insulin_log',
       'low_glu_insulin'],
      dtype='object')

In [ ]:

X = df[['Glucose', 'BloodPressure', 'SkinThickness',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_high',
       'Insulin_nan', 'low_glu_insulin']]
X.shape

Out[ ]:

(768, 9)

In [ ]:

y = df['Outcome']
y.shape

Out[ ]:

(768,)

In [ ]:

# 사이킷런에서 제공하는 model_selection 의 train_test_split 으로 만듭니다.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

In [ ]:

# train 세트의 문제와 정답의 데이터 수를 확인해 주세요.

X_train.shape, y_train.shape

Out[ ]:

((614, 9), (614,))

In [ ]:

# test 세트의 문제와 정답의 데이터 수를 확인해 주세요.

X_test.shape, y_test.shape

Out[ ]:

((154, 9), (154,))

머신러닝 알고리즘 사용하기

이미지 출처 : https://upload.wikimedia.org/wikipedia/commons/8/83/0_jW2hAGmYEFH0RP9W.png

Single tree

In [ ]:

# from sklearn.tree import DecisionTreeClassifier

# model = DecisionTreeClassifier(random_state=42)
# model

Bagging

In [ ]:

# from sklearn.ensemble import RandomForestClassifier

# model = RandomForestClassifier(n_estimators=100, random_state=42)
# model

Boosting

https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/

What is the difference between Bagging and Boosting? ⋆ Quantdare

Bagging and Boosting are both ensemble methods in Machine Learning, but what is the key behind them? Are they equivalent, similar, not related?

quantdare.com

from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(random_state=42)
model

Out[ ]:

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=42, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

학습과 예측하기

In [ ]:

# 학습을 시킵니다.
model.fit(X_train, y_train)

Out[ ]:

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=42, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [ ]:

model.feature_importances_

Out[ ]:

array([1.06409948e-01, 7.04053663e-03, 6.57236021e-02, 3.73742490e-02,
       2.62954875e-02, 9.47259928e-02, 1.56708089e-04, 6.62273475e-01,
       0.00000000e+00])

In [ ]:

feature_names = X_train.columns.tolist()

In [ ]:

sns.barplot(x=model.feature_importances_, y=feature_names)

# 예측을 하고 결과를 y_predict에 담습니다.
y_predict = model.predict(X_test)
y_predict[:5]

Out[ ]:

array([1, 0, 0, 0, 0])

정확도(Accuracy) 측정하기

In [ ]:

# 다르게 예측한 갯수를 구해서 diff_count 에 할당해 줍니다.
# DT : 28
# RF : 20
# GB : 24
(y_predict != y_test).sum()

Out[ ]:

In [ ]:

# accuracy score를 구합니다.
# DT: 0.818
# RF: 0.870
# GBL 0.844
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_predict)

Out[ ]:

0.8441558441558441

'MOOC' 카테고리의 다른 글

[컴퓨터 비전의 모든 것] Computer Vision 이란? (2)	2024.12.19
[Mooc] 부스트코스 Mooc 강의 수료증 (0)	2024.11.03
[프로젝트로 배우는 데이터사이언스]pima_classification_baseline_03 (0)	2024.07.03
[프로젝트로 배우는 데이터사이언스]pima_diabetes_preprocessed (1)	2024.06.28
[프로젝트로 배우는 데이터사이언스]pima_diabetes_eda (0)	2024.06.27

ParkS2.tistory

ParkS2.tistory

태그

최근글

댓글

공지사항

아카이브

필요한 라이브러리 로드

데이터셋 로드

학습과 예측에 사용할 데이터셋 만들기

머신러닝 알고리즘 사용하기

Single tree

Bagging

Boosting

학습과 예측하기

정확도(Accuracy) 측정하기

'MOOC' 카테고리의 다른 글

관련글

티스토리툴바