[프로젝트로 배우는 데이터사이언스]pima_diabetes

[프로젝트로 배우는 데이터사이언스]pima_diabetes_preprocessed

2024. 6. 28. 02:14ㆍMOOC

데이터 구성

Pregnancies : 임신 횟수
Glucose : 2시간 동안의 경구 포도당 내성 검사에서 혈장 포도당 농도
BloodPressure : 이완기 혈압 (mm Hg)
SkinThickness : 삼두근 피부 주름 두께 (mm), 체지방을 추정하는데 사용되는 값
Insulin : 2시간 혈청 인슐린 (mu U / ml)
BMI : 체질량 지수 (체중kg / 키(m)^2)
DiabetesPedigreeFunction : 당뇨병 혈통 기능
Age : 나이
Outcome : 768개 중에 268개의 결과 클래스 변수(0 또는 1)는 1이고 나머지는 0입니다.

필요한 라이브러리 로드

In [ ]:

# 데이터 분석을 위한 pandas, 수치계산을 위한 numpy
# 시각화를 위한 seaborn, matplotlib.pyplot 을 로드합니다.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

데이터셋 로드

In [ ]:

df = pd.read_csv("http://bit.ly/data-diabetes-csv")
df.shape

Out[ ]:

(768, 9)

In [ ]:

df.head()

Out[ ]:

PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome01234

6	148	72	35	0	33.6	0.627	50	1
1	85	66	29	0	26.6	0.351	31	0
8	183	64	0	0	23.3	0.672	32	1
1	89	66	23	94	28.1	0.167	21	0
0	137	40	35	168	43.1	2.288	33	1

Feature Engineering

수치형 변수를 범주형 변수로 만들기

In [ ]:

df["Pregnancies_high"] = df["Pregnancies"] > 6
df[["Pregnancies", "Pregnancies_high"]].head()

Out[ ]:

PregnanciesPregnancies_high01234

6	False
1	False
8	True
1	False
0	False

In [ ]:

# One-Hot-Encoding
# 수치 => 범주 => 수치
df["Age_low"] = df["Age"] < 30
df["Age_middle"] = (df["Age"] >= 30) & (df["Age"] <= 60)
df["Age_high"] = df["Age"] > 60
df[["Age", "Age_low", "Age_middle", "Age_high"]].head()

Out[ ]:

AgeAge_lowAge_middleAge_high01234

50	False	True	False
31	False	True	False
32	False	True	False
21	True	False	False
33	False	True	False

In [ ]:

sns.countplot(data=df, x="Age_high", hue="Outcome")

결측치 다루기

In [ ]:

df.isnull().sum()

Out[ ]:

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
Pregnancies_high            0
Age_low                     0
Age_middle                  0
Age_high                    0
dtype: int64

In [ ]:

df.describe()

Out[ ]:

PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcomecountmeanstdmin25%50%75%max

768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

In [ ]:

df["Insulin_nan"] = df["Insulin"].replace(0, np.nan)
df[["Insulin", "Insulin_nan"]].head()

Out[ ]:

InsulinInsulin_nan01234

0	NaN
0	NaN
0	NaN
94	94.0
168	168.0

In [ ]:

df["Insulin_nan"].isnull().sum()

Out[ ]:

In [ ]:

# 결측치 비율
df["Insulin_nan"].isnull().mean()

Out[ ]:

0.4869791666666667

In [ ]:

df.groupby(["Outcome"])[["Insulin", "Insulin_nan"]].agg(["mean", "median"])

Out[ ]:

InsulinInsulin_nanmeanmedianmeanmedianOutcome01

68.792000	39	130.287879	102.5
100.335821	0	206.846154	169.5

In [ ]:

# 결측치 채우기
df.loc[(df["Outcome"] == 0) & (df["Insulin_nan"].isnull()), "Insulin_nan"] = 102.5
df.loc[(df["Outcome"] == 1) & (df["Insulin_nan"].isnull()), "Insulin_nan"] = 169.5

정규분포 만들기

왜도, 첨도

In [ ]:

# sns.distplot(df.loc[df["Insulin"] > 0, "Insulin"])
# seaborn 0.11.0 이상
sns.displot(df.loc[df["Insulin"] > 0, "Insulin"], kde=True)

Insulin_log = np.log(df.loc[df["Insulin"] > 0, "Insulin"] + 1)
# sns.distplot(Insulin_log)
# seaborn 0.11.0 이상
sns.displot(Insulin_log, kde=True)

sns.displot(df, x="Insulin_nan", kde=True)

df["Insulin_log"] = np.log(df["Insulin_nan"] + 1)
sns.displot(df, x="Insulin_log", kde=True)

파생변수 만들기

EDA에서 해본 상관분석을 바탕으로 파생변수를 생성합니다.

In [ ]:

sns.lmplot(data=df, x="Insulin_nan", y="Glucose", hue="Outcome")

df["low_glu_insulin"] = (df["Glucose"] < 100) & (df["Insulin_nan"] <= 102.5)
df["low_glu_insulin"].head()

Out[ ]:

0    False
1     True
2    False
3     True
4    False
Name: low_glu_insulin, dtype: bool

In [ ]:

pd.crosstab(df["Outcome"], df["low_glu_insulin"])

Out[ ]:

low_glu_insulinFalseTrueOutcome01

332	168
263	5

이상치(outlier) 다루기

https://ko.wikipedia.org/wiki/%EC%83%81%EC%9E%90_%EC%88%98%EC%97%BC_%EA%B7%B8%EB%A6%BC

In [ ]:

plt.figure(figsize=(15, 2))
sns.boxplot(data=df, x="Insulin_nan")

df["Insulin_nan"].describe()

Out[ ]:

count    768.000000
mean     141.753906
std       89.100847
min       14.000000
25%      102.500000
50%      102.500000
75%      169.500000
max      846.000000
Name: Insulin_nan, dtype: float64

In [ ]:

IQR3 = df["Insulin_nan"].quantile(0.75)
IQR1 = df["Insulin_nan"].quantile(0.25)
IQR = IQR3 - IQR1
IQR

Out[ ]:

67.0

In [ ]:

OUT = IQR3 + (IQR * 1.5)
OUT

Out[ ]:

270.0

In [ ]:

df[df["Insulin_nan"] > OUT].shape

Out[ ]:

(51, 16)

In [ ]:

df[df["Insulin_nan"] > 600].shape

Out[ ]:

(3, 16)

Scaling

In [ ]:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(df[["Glucose", "DiabetesPedigreeFunction"]])
scale = scaler.transform(df[["Glucose", "DiabetesPedigreeFunction"]])
scale

Out[ ]:

array([[ 0.84832379,  0.46849198],
       [-1.12339636, -0.36506078],
       [ 1.94372388,  0.60439732],
       ...,
       [ 0.00330087, -0.68519336],
       [ 0.1597866 , -0.37110101],
       [-0.8730192 , -0.47378505]])

In [ ]:

# df[["Glucose", "DiabetesPedigreeFunction"]] = scale
# df[["Glucose", "DiabetesPedigreeFunction"]].head()

In [ ]:

h = df[["Glucose", "DiabetesPedigreeFunction"]].hist(figsize=(15, 3))

CSV 파일로 저장하기

In [ ]:

df.to_csv("data/diabetes_feature.csv", index=False)

In [ ]:

pd.read_csv("data/diabetes_feature.csv").head()

Out[ ]:

PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcomePregnancies_highAge_lowAge_middleAge_highInsulin_nanInsulin_loglow_glu_insulin01234

6	148	72	35	0	33.6	0.627	50	1	False	False	True	False	169.5	5.138735	False
1	85	66	29	0	26.6	0.351	31	0	False	False	True	False	102.5	4.639572	True
8	183	64	0	0	23.3	0.672	32	1	True	False	True	False	169.5	5.138735	False
1	89	66	23	94	28.1	0.167	21	0	False	True	False	False	94.0	4.553877	True
0	137	40	35	168	43.1	2.288	33	1	False	False	True	False	168.0	5.129899	False

학습, 예측 데이터셋 나누기

In [ ]:

# 8:2 의 비율로 구하기 위해 전체 데이터의 행에서 80% 위치에 해당되는 값을 구해서 split_count 라는 변수에 담습니다.
split_count = int(df.shape[0] * 0.8)
split_count

Out[ ]:

In [ ]:

# train, test로 슬라이싱을 통해 데이터를 나눕니다.
train = df[:split_count].copy()
train.shape

Out[ ]:

(614, 16)

In [ ]:

train[train["Insulin_nan"] < 600].shape

Out[ ]:

(610, 16)

In [ ]:

train = train[train["Insulin_nan"] < 600]
train.shape

Out[ ]:

(610, 16)

In [ ]:

test = df[split_count:].copy()
test.shape

Out[ ]:

(154, 16)

학습, 예측에 사용할 컬럼

In [ ]:

# feature_names 라는 변수에 학습과 예측에 사용할 컬럼명을 가져옵니다.
feature_names = train.columns.tolist()
feature_names.remove("Pregnancies")
feature_names.remove("Outcome")
feature_names.remove("Age_low")
feature_names.remove("Age_middle")
feature_names.remove("Age_high")
feature_names.remove("Insulin")
feature_names.remove("Insulin_log")
feature_names

Out[ ]:

['Glucose',
 'BloodPressure',
 'SkinThickness',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age',
 'Pregnancies_high',
 'Insulin_nan',
 'low_glu_insulin']

정답값이자 예측해야 될 값

In [ ]:

# label_name 이라는 변수에 예측할 컬럼의 이름을 담습니다.

label_name = "Outcome"
label_name

Out[ ]:

'Outcome'

학습, 예측 데이터셋 만들기

In [ ]:

# 학습 세트 만들기 예) 시험의 기출문제

X_train = train[feature_names]
print(X_train.shape)
X_train.head()

(610, 9)

Out[ ]:

GlucoseBloodPressureSkinThicknessBMIDiabetesPedigreeFunctionAgePregnancies_highInsulin_nanlow_glu_insulin01234

148	72	35	33.6	0.627	50	False	169.5	False
85	66	29	26.6	0.351	31	False	102.5	True
183	64	0	23.3	0.672	32	True	169.5	False
89	66	23	28.1	0.167	21	False	94.0	True
137	40	35	43.1	2.288	33	False	168.0	False

In [ ]:

# 정답 값을 만들어 줍니다. 예) 기출문제의 정답
y_train = train[label_name]
print(y_train.shape)
y_train.head()

(610,)

Out[ ]:

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

In [ ]:

# 예측에 사용할 데이터세트를 만듭니다. 예) 실전 시험 문제

X_test = test[feature_names]
print(X_test.shape)
X_test.head()

(154, 9)

Out[ ]:

GlucoseBloodPressureSkinThicknessBMIDiabetesPedigreeFunctionAgePregnancies_highInsulin_nanlow_glu_insulin614615616617618

138	74	26	36.1	0.557	50	True	144.0	False
106	72	0	25.8	0.207	27	False	102.5	False
117	96	0	28.7	0.157	30	False	102.5	False
68	62	13	20.1	0.257	23	False	15.0	True
112	82	24	28.2	1.282	50	True	169.5	False

In [ ]:

# 예측의 정답값 예) 실전 시험 문제의 정답
y_test = test[label_name]

print(y_test.shape)
y_test.head()

(154,)

Out[ ]:

614    1
615    0
616    0
617    0
618    1
Name: Outcome, dtype: int64

머신러닝 알고리즘 가져오기

In [ ]:

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=42)
model

Out[ ]:

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')

학습(훈련)

시험을 볼 때 기출문제(X_train)와 정답(y_train)을 보고 공부하는 과정과 유사합니다.

In [ ]:

model.fit(X_train, y_train)

Out[ ]:

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')

예측

실전 시험문제(X_test)라고 보면 됩니다. 우리가 정답을 직접 예측합니다.

In [ ]:

y_predict = model.predict(X_test)
y_predict[:5]

Out[ ]:

array([1, 0, 0, 0, 1])

트리 알고리즘 분석하기

의사결정나무를 시각화 합니다.

In [ ]:

from sklearn.tree import plot_tree

plt.figure(figsize=(20, 20))
tree = plot_tree(model,
                 feature_names=feature_names,
                 max_depth=4,
                 filled=True,
                 fontsize=10)

# graphviz 를 통해 시각화 합니다.
# graphviz 는 별도의 설치가 필요합니다.
# graphviz 와 파이썬에서 graphviz 를 사용할 수 있게 해주는 도구 2가지를 설치해 주셔야 합니다.
# import graphviz
# from sklearn.tree import export_graphviz

# dot_tree = export_graphviz(model,
#                            feature_names = feature_names,
#                            filled=True)
# graphviz.Source(dot_tree)

In [ ]:

# 피처의 중요도를 추출하기

model.feature_importances_

Out[ ]:

array([0.10720708, 0.03829317, 0.02739544, 0.08008031, 0.02662991,
       0.08272508, 0.        , 0.63283861, 0.0048304 ])

In [ ]:

# 피처의 중요도 시각화 하기

sns.barplot(x=model.feature_importances_, y=feature_names)

정확도(Accuracy) 측정하기

In [ ]:

# 실제값 - 예측값을 빼주면 같은 값은 0으로 나오게 됩니다.
# 여기에서 절대값을 씌운 값이 1인 값이 다르게 예측한 값이 됩니다.
# 44 => 39 => 49(나이25세 기준) => 55(나이30세 기준)
# => 23(인슐린 결측치를 평균으로 대체) => 16(인슐린 결측치를 중앙값으로 대체)
# => 15(인슐린&글루코스 파생수변수 추가)
# => 15(인슐린 수치 600이상 이상치제거 )
diff_count = abs(y_test - y_predict).sum()
diff_count

Out[ ]:

In [ ]:

# 예측의 정확도를 구합니다. 100점 만점 중에 몇 점을 맞았는지 구한다고 보면 됩니다.
# 71 => => 85(인슐린 결측치를 평균으로 대체) => 89(인슐린의 결측치를 중앙값으로 대체)
# => 90(인슐린&글루코스 상관계수로 파생변수를 생성)
(len(y_test) - diff_count) / len(y_test) * 100

Out[ ]:

90.25974025974025

In [ ]:

# 위에서 처럼 직접 구할 수도 있지만 미리 구현된 알고리즘을 가져와 사용합니다.

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_predict) * 100

Out[ ]:

90.25974025974025

In [ ]:

# model 의 score 로 점수를 계산합니다.
model.score(X_test, y_test) * 100

Out[ ]:

90.25974025974025

'MOOC' 카테고리의 다른 글

[프로젝트로 배우는 데이터사이언스]pima_classification_baseline_04 (0)	2024.07.03
[프로젝트로 배우는 데이터사이언스]pima_classification_baseline_03 (0)	2024.07.03
[프로젝트로 배우는 데이터사이언스]pima_diabetes_eda (0)	2024.06.27
[프로젝트로 배우는 데이터사이언스] pima_classification_baseline (1)	2024.06.27
PyTorch 프로젝트 구조 이해하기 (0)	2024.06.06

ParkS2.tistory

ParkS2.tistory

태그

최근글

댓글

공지사항

아카이브

데이터 구성

필요한 라이브러리 로드

데이터셋 로드

Feature Engineering

수치형 변수를 범주형 변수로 만들기

결측치 다루기

정규분포 만들기

파생변수 만들기

이상치(outlier) 다루기

Scaling

CSV 파일로 저장하기

학습, 예측 데이터셋 나누기

학습, 예측에 사용할 컬럼

정답값이자 예측해야 될 값

학습, 예측 데이터셋 만들기

머신러닝 알고리즘 가져오기

학습(훈련)

예측

트리 알고리즘 분석하기

정확도(Accuracy) 측정하기

'MOOC' 카테고리의 다른 글

관련글

티스토리툴바