[프로젝트로 배우는 데이터사이언스]pima_diabetes

[프로젝트로 배우는 데이터사이언스]pima_diabetes_eda

2024. 6. 27. 02:13ㆍMOOC

Exploratory Data Analysis

데이터셋 출처

Pima Indians Diabetes Database | Kaggle

데이터 구성

Pregnancies : 임신 횟수
Glucose : 2시간 동안의 경구 포도당 내성 검사에서 혈장 포도당 농도
BloodPressure : 이완기 혈압 (mm Hg)
SkinThickness : 삼두근 피부 주름 두께 (mm), 체지방을 추정하는데 사용되는 값
Insulin : 2시간 혈청 인슐린 (mu U / ml)
BMI : 체질량 지수 (체중kg / 키(m)^2)
DiabetesPedigreeFunction : 당뇨병 혈통 기능
Age : 나이
Outcome : 768개 중에 268개의 결과 클래스 변수(0 또는 1)는 1이고 나머지는 0입니다.

라이브러리 로드

In [1]:

# 데이터 분석을 위한 pandas, 수치계산을 위한 numpy
# 시각화를 위한 seaborn, matplotlib.pyplot 을 로드합니다.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

데이터 로드

In [2]:

df = pd.read_csv("C:/Users/82106/Desktop/데이터 분석 프로젝트 2/당뇨병 머신러닝/데이터/archive/diabetes.csv")
df.shape

Out[2]:

(768, 9)

In [3]:

# 위에서 5개만 미리보기 합니다.
df.head()

Out[3]:

PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome01234

6	148	72	35	0	33.6	0.627	50	1
1	85	66	29	0	26.6	0.351	31	0
8	183	64	0	0	23.3	0.672	32	1
1	89	66	23	94	28.1	0.167	21	0
0	137	40	35	168	43.1	2.288	33	1

In [4]:

# info로 데이터타입, 결측치, 메모리 사용량 등의 정보를 봅니다.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

In [5]:

# 결측치를 봅니다.

df_null = df.isnull()
df_null.head()

Out[5]:

PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome01234

False	False	False	False	False	False	False	False	False
False	False	False	False	False	False	False	False	False
False	False	False	False	False	False	False	False	False
False	False	False	False	False	False	False	False	False
False	False	False	False	False	False	False	False	False

In [6]:

df_null.sum()

Out[6]:

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [7]:

# 수치데이터에 대한 요약을 봅니다.

df.describe()

Out[7]:

PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcomecountmeanstdmin25%50%75%max

768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

In [8]:

# 가장 마지막에 있는 Outcome 은 label 값이기 때문에 제외하고
# 학습과 예측에 사용할 컬럼을 만들어 줍니다.
# feature_columns 라는 변수에 담아줍니다.

feature_columns = df.columns[:-1].tolist()
feature_columns

Out[8]:

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']

결측치 시각화

값을 요약해 보면 최솟값이 0으로 나오는 값들이 있습니다.

0이 나올 수 있는 값도 있지만 인슐린이나 혈압 등의 값은 0값이 결측치라고 볼 수 있을 것입니다.

따라서 0인 값을 결측치로 처리하고 시각화 해봅니다.

In [9]:

cols = feature_columns[1:]
cols

Out[9]:

['Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']

In [10]:

# 결측치 여부를 나타내는 데이터프레임을 만듭니다.
# 0값을 결측치라 가정하고 정답(label, target)값을 제외한 컬럼에 대해
# 결측치 여부를 구해서 df_null 이라는 데이터프레임에 담습니다.
df_null = df[cols].replace(0, np.nan)
df_null = df_null.isnull()
df_null.sum()

Out[10]:

Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
dtype: int64

In [11]:

df_null.mean() * 100

Out[11]:

Glucose                      0.651042
BloodPressure                4.557292
SkinThickness               29.557292
Insulin                     48.697917
BMI                          1.432292
DiabetesPedigreeFunction     0.000000
Age                          0.000000
dtype: float64

In [12]:

# 결측치의 갯수를 구해 막대 그래프로 시각화 합니다.
df_null.sum().plot.barh()

# 결측치를 heatmap 으로 시각화 합니다.
plt.figure(figsize=(15, 4))
sns.heatmap(df_null, cmap="Greys_r")

정답값

target, label 이라고 부르기도 합니다.

In [14]:

# 정답값인 Outcome 의 갯수를 봅니다.

df["Outcome"].value_counts()

Out[14]:

Outcome
0    500
1    268
Name: count, dtype: int64

In [15]:

# 정답값인 Outcome 의 비율을 봅니다.

df["Outcome"].value_counts(normalize=True)

Out[15]:

Outcome
0    0.651042
1    0.348958
Name: proportion, dtype: float64

In [16]:

# 다른 변수와 함께 봅니다.
# 임신횟수와 정답값을 비교해 봅니다.
# "Pregnancies"를 groupby 로 그룹화 해서 Outcome 에 대한 비율을 구합니다.
# 결과를 df_po라는 변수에 저장합니다.

df_po = df.groupby(["Pregnancies"])["Outcome"].agg(["mean", "count"]).reset_index()
df_po

Out[16]:

Pregnanciesmeancount012345678910111213141516

0	0.342342	111
1	0.214815	135
2	0.184466	103
3	0.360000	75
4	0.338235	68
5	0.368421	57
6	0.320000	50
7	0.555556	45
8	0.578947	38
9	0.642857	28
10	0.416667	24
11	0.636364	11
12	0.444444	9
13	0.500000	10
14	1.000000	2
15	1.000000	1
17	1.000000	1

In [17]:

# 임신횟수에 따른 당뇨병 발병 비율
df_po["mean"].plot.bar(rot=0)

countplot

In [18]:

# 위에서 구했던 당뇨병 발병 비율을 구해봅니다.
# 당뇨병 발병 빈도수를 비교 합니다.

sns.countplot(data=df, x="Outcome")

# 임신횟수에 따른 당뇨병 발병 빈도수를 비교합니다.

sns.countplot(data=df, x="Pregnancies", hue="Outcome")

# 임신횟수의 많고 적음에 따라 Pregnancies_high 변수를 만듭니다.

df["Pregnancies_high"] = df["Pregnancies"] > 6
df[["Pregnancies", "Pregnancies_high"]].head()

Out[20]:

PregnanciesPregnancies_high01234

6	False
1	False
8	True
1	False
0	False

In [21]:

# Pregnancies_high 변수의 빈도수를 countplot 으로 그리고
# Outcome 값에 따라 다른 색상으로 표현합니다.

sns.countplot(data=df, x="Pregnancies_high", hue="Outcome")

barplot

기본 설정으로 시각화 하면 y축에는 평균을 추정해서 그리게 됩니다.

In [22]:

# 당뇨병 발병에 따른 BMI 수치를 비교합니다.

sns.barplot(data=df, x="Outcome", y="BMI")

# 당뇨병 발병에 따른 포도당(Glucose)수치를 비교합니다.

sns.barplot(data=df, x="Outcome", y="Glucose")

# Insulin 수치가 0 이상인 관측치에 대해서 당뇨병 발병을 비교합니다.

sns.barplot(data=df, x="Outcome", y="Insulin")

# 임신횟수에 대해서 당뇨병 발병 비율을 비교합니다.

sns.barplot(data=df, x="Pregnancies", y="Outcome")

# 임신횟수(Pregnancies)에 따른 포도당(Glucose)수치를 당뇨병 발병여부(Outcome)에 따라 시각화 합니다.

sns.barplot(data=df, x="Pregnancies", y="Glucose", hue="Outcome")

# 임신횟수(Pregnancies)에 따른 체질량지수(BMI)를 당뇨병 발병여부(Outcome)에 따라 시각화 합니다.

sns.barplot(data=df, x="Pregnancies", y="BMI", hue="Outcome")

# 임신횟수(Pregnancies)에 따른 인슐린 수치(Insulin)를 당뇨병 발병여부(Outcome)에 따라 시각화 합니다.
# 인슐린 수치에는 결측치가 많기 때문에 0보다 큰 값에 대해서만 그립니다.

sns.barplot(data=df[df["Insulin"] > 0],
            x="Pregnancies", y="Insulin", hue="Outcome")

boxplot

In [29]:

# 임신횟수(Pregnancies)에 따른 인슐린 수치(Insulin)를 당뇨병 발병여부(Outcome)에 따라 시각화 합니다.
# 인슐린 수치에는 결측치가 많기 때문에 0보다 큰 값에 대해서만 그립니다.

sns.boxplot(data=df[df["Insulin"] > 0],
            x="Pregnancies", y="Insulin", hue="Outcome")

violinplot

In [30]:

# 위의 그래프를 violinplot 으로 시각화 합니다.
plt.figure(figsize=(15, 4))
sns.violinplot(data=df[df["Insulin"] > 0],
            x="Pregnancies", y="Insulin", hue="Outcome", split=True)

swarmplot

In [31]:

# 위의 그래프를 swarmplot 으로 시각화 합니다.

plt.figure(figsize=(15, 4))
sns.swarmplot(data=df[df["Insulin"] > 0],
            x="Pregnancies", y="Insulin", hue="Outcome")

c:\Users\82106\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\categorical.py:3399: UserWarning: 5.3% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
c:\Users\82106\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\categorical.py:3399: UserWarning: 26.6% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
c:\Users\82106\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\categorical.py:3399: UserWarning: 17.2% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)

Out[31]:

<Axes: xlabel='Pregnancies', ylabel='Insulin'>

c:\Users\82106\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\categorical.py:3399: UserWarning: 15.6% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)

distplot

In [32]:

df_0 = df[df["Outcome"] == 0]
df_1 = df[df["Outcome"] == 1]
df_0.shape, df_1.shape

Out[32]:

((500, 10), (268, 10))

In [33]:

# 임신횟수에 따른 당뇨병 발병 여부를 시각화 합니다.

sns.distplot(df_0["Pregnancies"])
sns.distplot(df_1["Pregnancies"])

C:\Users\82106\AppData\Local\Temp\ipykernel_7008\906901944.py:3: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(df_0["Pregnancies"])
C:\Users\82106\AppData\Local\Temp\ipykernel_7008\906901944.py:4: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(df_1["Pregnancies"])

# 나이에 따른 당뇨병 발병 여부를 시각화 합니다.

sns.distplot(df_0["Age"], hist=False, rug=True, label=0)
sns.distplot(df_1["Age"], hist=False, rug=True, label=1)

C:\Users\82106\AppData\Local\Temp\ipykernel_7008\2424226668.py:3: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(df_0["Age"], hist=False, rug=True, label=0)
C:\Users\82106\AppData\Local\Temp\ipykernel_7008\2424226668.py:4: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(df_1["Age"], hist=False, rug=True, label=1)

Subplots

Pandas 를 통한 histplot 그리기

pandas를 사용하면 모든 변수에 대한 서브플롯을 한 번에 그려줍니다.

In [35]:

df["Pregnancies_high"] = df["Pregnancies_high"].astype(int)
h = df.hist(figsize=(15, 15), bins=20)

반복문을 통한 서브플롯 그리기

distplot

In [36]:

# 컬럼의 수 만큼 for 문을 만들어서 서브플롯으로 시각화를 합니다.
cols = df.columns[:-1].tolist()
cols

Out[36]:

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age',
 'Outcome']

In [37]:

# distplot 으로 서브플롯을 그립니다.

fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(15, 15))

for i, col_name in enumerate(cols):
    row = i // 3
    col = i % 3
    sns.distplot(df[col_name], ax=axes[row][col])

df_0

Out[38]:

PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcomePregnancies_high135710...762763764765767

1	85	66	29	0	26.6	0.351	31	0	False
1	89	66	23	94	28.1	0.167	21	0	False
5	116	74	0	0	25.6	0.201	30	0	False
10	115	0	0	0	35.3	0.134	29	0	True
4	110	92	0	0	37.6	0.191	30	0	False
...	...	...	...	...	...	...	...	...	...
9	89	62	0	0	22.5	0.142	33	0	True
10	101	76	48	180	32.9	0.171	63	0	True
2	122	70	27	0	36.8	0.340	27	0	False
5	121	72	23	112	26.2	0.245	30	0	False
1	93	70	31	0	30.4	0.315	23	0	False

500 rows × 10 columns

In [39]:

# 모든 변수에 대한 distplot을 그려 봅니다.

fig, axes = plt.subplots(nrows=4, ncols=2, figsize=(15, 15))

for i, col_name in enumerate(cols[:-1]):
    row = i // 2
    col = i % 2
    sns.distplot(df_0[col_name], ax=axes[row][col])
    sns.distplot(df_1[col_name], ax=axes[row][col])

violinplot

In [40]:

# violinplot 으로 서브플롯을 그려봅니다.


fig, axes = plt.subplots(nrows=4, ncols=2, figsize=(15, 15))

for i, col_name in enumerate(cols[:-1]):
    row = i // 2
    col = i % 2
    sns.violinplot(data=df, x="Outcome", y=col_name, ax=axes[row][col])

lmplot

상관계수가 높은 두 변수에 대해 시각화 합니다.

In [41]:

# Glucose 와 Insulin 을 Outcome 으로 구분해 봅니다.

sns.lmplot(data=df, x="Glucose", y="Insulin", hue="Outcome")

# Insulin 수치가 0 이상인 데이터로만 그려봅니다.

sns.lmplot(data=df[df["Insulin"] > 0], x="Glucose", y="Insulin", hue="Outcome")

pairplot

In [43]:

# PairGrid 를 통해 모든 변수에 대해 Outcome 에 따른 scatterplot을 그려봅니다.

g = sns.PairGrid(df, hue="Outcome")
g.map(plt.scatter)

상관 분석

상관 분석 - 위키백과, 우리 모두의 백과사전

In [44]:

df_matrix = df.iloc[:, :-2].replace(0, np.nan)
df_matrix["Outcome"] = df["Outcome"]
df_matrix.head()

Out[44]:

PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome01234

6.0	148.0	72.0	35.0	NaN	33.6	0.627	50	1
1.0	85.0	66.0	29.0	NaN	26.6	0.351	31	0
8.0	183.0	64.0	NaN	NaN	23.3	0.672	32	1
1.0	89.0	66.0	23.0	94.0	28.1	0.167	21	0
NaN	137.0	40.0	35.0	168.0	43.1	2.288	33	1

In [45]:

# 정답 값인 Outcome을 제외 하고 feature 로 사용할 컬럼들에 대해 0을 결측치로 만들어 줍니다.
# 상관계수를 구합니다.

df_corr = df_matrix.corr()
df_corr.style.background_gradient()

Out[45]:

PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcomePregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome

1.000000	0.166329	0.285013	0.167298	0.104081	0.128207	-0.006459	0.550525	0.268218
0.166329	1.000000	0.223192	0.228043	0.581186	0.232771	0.137246	0.267136	0.494650
0.285013	0.223192	1.000000	0.226839	0.098272	0.289230	-0.002805	0.330107	0.170589
0.167298	0.228043	0.226839	1.000000	0.184888	0.648214	0.115016	0.166816	0.259491
0.104081	0.581186	0.098272	0.184888	1.000000	0.228050	0.130395	0.220261	0.303454
0.128207	0.232771	0.289230	0.648214	0.228050	1.000000	0.155382	0.025841	0.313680
-0.006459	0.137246	-0.002805	0.115016	0.130395	0.155382	1.000000	0.033561	0.173844
0.550525	0.267136	0.330107	0.166816	0.220261	0.025841	0.033561	1.000000	0.238356
0.268218	0.494650	0.170589	0.259491	0.303454	0.313680	0.173844	0.238356	1.000000

In [46]:

# 위에서 구한 상관계수를 heatmap으로 시각화 합니다.
plt.figure(figsize=(15, 8))
sns.heatmap(df_corr, annot=True, vmax=1, vmin=-1, cmap="coolwarm")

#  Outcome 수치에 대한 상관계수만 모아서 봅니다.

df_corr["Outcome"]

Out[47]:

Pregnancies                 0.268218
Glucose                     0.494650
BloodPressure               0.170589
SkinThickness               0.259491
Insulin                     0.303454
BMI                         0.313680
DiabetesPedigreeFunction    0.173844
Age                         0.238356
Outcome                     1.000000
Name: Outcome, dtype: float64

상관계수가 높은 변수끼리 보기

In [48]:

#  Outcome 수치에 대한 상관계수만 모아서 봅니다.

df_corr["Outcome"]

Out[47]:

Pregnancies                 0.268218
Glucose                     0.494650
BloodPressure               0.170589
SkinThickness               0.259491
Insulin                     0.303454
BMI                         0.313680
DiabetesPedigreeFunction    0.173844
Age                         0.238356
Outcome                     1.000000
Name: Outcome, dtype: float64

상관계수가 높은 변수끼리 보기

In [48]:

# Insulin 과 Glucose 로 regplot 그리기
sns.regplot(data=df, x="Insulin", y="Glucose")

# Insulin 과 Glucose 로 regplot 그리기 sns.regplot(data=df, x="Insulin", y="Glucose")

# df_0 으로 결측치 처리한 데이터프레임으로
# Insulin 과 Glucose 로 regplot 그리기

sns.regplot(data=df_matrix, x="Insulin", y="Glucose")

sns.lmplot(data=df_matrix, x="Insulin", y="Glucose", hue="Outcome")

# Age 와 Pregnancies 로 regplot 그리기

sns.regplot(data=df, x="Age", y="Pregnancies")

# Age 와 Pregnancies 로 lmplot 을 그리고 Outcome 에 따라 다른 색상으로 표현하기

sns.lmplot(data=df, x="Age", y="Pregnancies", hue="Outcome", col="Outcome")

'MOOC' 카테고리의 다른 글

[프로젝트로 배우는 데이터사이언스]pima_classification_baseline_03 (0)	2024.07.03
[프로젝트로 배우는 데이터사이언스]pima_diabetes_preprocessed (1)	2024.06.28
[프로젝트로 배우는 데이터사이언스] pima_classification_baseline (1)	2024.06.27
PyTorch 프로젝트 구조 이해하기 (0)	2024.06.06
[파이토치로 만드는 딥러닝 이론3] 모델 저장하기 (1)	2024.06.03

ParkS2.tistory

ParkS2.tistory

태그

최근글

댓글

공지사항

아카이브

Exploratory Data Analysis

데이터셋 출처

데이터 구성

라이브러리 로드

데이터 로드

결측치 시각화

정답값

countplot

barplot

boxplot

violinplot

swarmplot

distplot

Subplots

Pandas 를 통한 histplot 그리기

반복문을 통한 서브플롯 그리기

distplot

violinplot

lmplot

pairplot

상관 분석

상관계수가 높은 변수끼리 보기

상관계수가 높은 변수끼리 보기

'MOOC' 카테고리의 다른 글

관련글

티스토리툴바