핵심 포인트 : EDA / 결측치 시각화 / Regression

01. 데이터 수립¶

https://www.kaggle.com/datasets/simranjain17/insurance

In [30]:

# 필요한 라이브러리
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

import missingno

import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import r2_score

02. 데이터 EDA 및 전처리¶

어떤 질문을 해결거나 틀렸다고 증명하려고 하는가?
중복된 항목은 있는가?
어떤 종류의 데이터가 있고 다른 데이터 타입들을 어떻게 다루려고 하는가?
데이터에서 누락된 것이 있는가, 만약 있다면 그것들을 어떻게 처리하려는가?
이상치는 어디에 있는가? 관심을 가져야 할 데이터인가?
변수 간 상관성이 있는가? (수치형)

In [31]:

# 실습자료: ch1_premium.csv
# import pandas as pd
filepath = "C:/Users/KANG/PYTHON/py_study/py_proj/data_proj/"
data = pd.read_csv("".join([filepath, "insurance.csv"]))

In [32]:

data.head()

Out[32]:

	age	sex	bmi	children	smoker	region	charges
0	19	female	27.900	0	yes	southwest	16884.92400
1	18	male	33.770	1	no	southeast	1725.55230
2	28	male	33.000	3	no	southeast	4449.46200
3	33	male	22.705	0	no	northwest	21984.47061
4	32	male	28.880	0	no	northwest	3866.85520

1. 어떤 질문을 해결거나 틀렸다고 증명하려고 하는가?¶

보험사 고객 정보를 통해 보험료 예측 모델을 생성하려고 한다.

In [33]:

print(data.shape)

(1338, 7)

In [34]:

# 15 of rows
print(data.head(15))

    age     sex     bmi  children smoker     region      charges
0    19  female  27.900         0    yes  southwest  16884.92400
1    18    male  33.770         1     no  southeast   1725.55230
2    28    male  33.000         3     no  southeast   4449.46200
3    33    male  22.705         0     no  northwest  21984.47061
4    32    male  28.880         0     no  northwest   3866.85520
5    31  female  25.740         0     no  southeast   3756.62160
6    46  female  33.440         1     no  southeast   8240.58960
7    37  female  27.740         3     no  northwest   7281.50560
8    37    male  29.830         2     no  northeast   6406.41070
9    60  female  25.840         0     no  northwest  28923.13692
10   25    male  26.220         0     no  northeast   2721.32080
11   62  female  26.290         0    yes  southeast  27808.72510
12   23    male  34.400         0     no  southwest   1826.84300
13   56  female  39.820         0     no  southeast  11090.71780
14   27    male  42.130         0    yes  southeast  39611.75770

고객ID 처럼 명백하게 보험료와 관계없는 것은 없는가?
컬럼 중 의미가 이해가지 않는 것은 없는가?
약어나 전문 용어로 되어 있는 것은 없는가?

2. 중복된 항목은 있는가?¶

In [35]:

# 중복된 항목 수
# df.duplicated() - boolen
print("중복된 항목 수: ", len(data[data.duplicated()]))

중복된 항목 수:  1

In [36]:

# 중복된 항목 확인
# sort_values(by = '기준 열')
print(data[data.duplicated(keep=False)].sort_values(by=list(data.columns)).head())

     age   sex    bmi  children smoker     region    charges
195   19  male  30.59         0     no  northwest  1639.5631
581   19  male  30.59         0     no  northwest  1639.5631

In [37]:

# 중복된 항목 제거
# df.drop_duplicates(inplce=True)
data.drop_duplicates(inplace=True, keep ='first', ignore_index = True)

In [38]:

print(list(data.loc[195]))

[19, 'male', 30.59, 0, 'no', 'northwest', 1639.5631]

3. 어떤 종류의 데이터가 있고 다른 데이터 타입들을 어떻게 다루려고 하는가?¶

총 컬럼 수와 컬럼별 데이터 타입 확인

In [39]:

# 데이터 컬럼 이름/타입 정보 확인하기
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1337 entries, 0 to 1336
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1337 non-null   int64  
 1   sex       1337 non-null   object 
 2   bmi       1337 non-null   float64
 3   children  1337 non-null   int64  
 4   smoker    1337 non-null   object 
 5   region    1337 non-null   object 
 6   charges   1337 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.2+ KB
None

In [40]:

# 데이터 타입별 컬럼 수 확인하기 : dtypes
# ○○별 : groupby
# 집계 : agg('func')
dtype_data = data.dtypes.reset_index()
dtype_data.columns = ["Count", "Column Type"]; dtype_data
dtype_data = dtype_data.groupby('Column Type').agg('count').reset_index()
print(dtype_data)

  Column Type  Count
0       int64      2
1     float64      2
2      object      3

숫자형 데이터 중 범주형으로 인식되는 변수가 있는가?
범주형 변수는 있는가?

In [41]:

# 범주형 변수별 유일한 값 개수 확인
# 유일값 개수 확인: nunique()
print(data.select_dtypes(include=['object','category']).nunique())

sex       2
smoker    2
region    4
dtype: int64

In [42]:

# 범주형 변수별 개수 시각화
# ○○형 변수 : select_dtypes(include= or exclude=).columns
# import seaborn as sns
# import matplotlib.pyplot as plt
category_data = data.select_dtypes(include=['object','category']).columns
for col in category_data:
    fig = sns.catplot(x=col, kind="count", data = data, hue = None)
    fig.set_xticklabels(rotation=90)
    plt.show()

항목이 2개인 성별과, 흡연 여부는 LabelEncoder 를, 지역은 OneHotEncoder 를 사용하기로 한다.

범주형 변수 변환¶

In [43]:

data.head(2)

Out[43]:

	age	sex	bmi	children	smoker	region	charges
0	19	female	27.90	0	yes	southwest	16884.9240
1	18	male	33.77	1	no	southeast	1725.5523

In [44]:

## sklearn의 LabelEncoder, OneHotEncoder 사용
## LabelEncoder : 각각의 범주를 서로 다른 정수로 맵핑
## 성별, 흡연 여부 컬럼은 Label Encoding을 위해 ndarray로 변환하여 준다.
# import numpy as np
# from sklearn.preprocessing import OneHotEncoder
# from sklearn.preprocessing import LabelEncoder

sex = data.iloc[:,1:2].values
smoker = data.iloc[:,4:5].values

In [45]:

### 성별 ###
# 1. LabelEncoder() 객체 선언
le = LabelEncoder()

# 2. LabelEncoder의 fit_transform에 성별을 넣어준다.
sex[:,0] = le.fit_transform(sex[:,0])
sex = pd.DataFrame(sex)
sex.columns = ['sex']
print(sex)

# 3. dict형으로 변환
le_sex_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print("성별의 Label Encoder: ")
print(le_sex_mapping)

     sex
0      0
1      1
2      1
3      1
4      1
...   ..
1332   1
1333   0
1334   0
1335   0
1336   0

[1337 rows x 1 columns]
성별의 Label Encoder: 
{'female': 0, 'male': 1}

In [46]:

### 흡연 여부 ###
# 1. LabelEncoder() 객체 선언
le = LabelEncoder()

# 2. LabelEncoder의 fit_transform에 성별을 넣어준다.
smoker[:,0] = le.fit_transform(smoker[:,0])
smoker = pd.DataFrame(smoker)
smoker.columns = ['smoker']
print(smoker)

# 3. dict형으로 변환
le_sex_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print("흡연 여부의 Label Encoder: ")
print(le_sex_mapping)

     smoker
0         1
1         0
2         0
3         0
4         0
...     ...
1332      0
1333      0
1334      0
1335      0
1336      1

[1337 rows x 1 columns]
흡연 여부의 Label Encoder: 
{'no': 0, 'yes': 1}

In [47]:

## OneHot Encoder: 각각의 범주를 0과 1로 맵핑
## 지역 컬럼은 Label Encoding을 위해 ndarray로 변환
region = data.iloc[:,5:6].values

### 지역 ###
# 1. OneHotEncoder() 를 선언해주고
ohe = OneHotEncoder() 

# 2. 지역을 OneHotEncoder 의 fit_transform 에 넣어준다
region = ohe.fit_transform(region).toarray()
region = pd.DataFrame(region)
region.columns = ['northeast', 'northwest', 'southeast', 'southwest']
print("지역의 OneHot Encoder 결과 : ")  
print(region[:10])

지역의 OneHot Encoder 결과 : 
   northeast  northwest  southeast  southwest
0        0.0        0.0        0.0        1.0
1        0.0        0.0        1.0        0.0
2        0.0        0.0        1.0        0.0
3        0.0        1.0        0.0        0.0
4        0.0        1.0        0.0        0.0
5        0.0        0.0        1.0        0.0
6        0.0        0.0        1.0        0.0
7        0.0        1.0        0.0        0.0
8        1.0        0.0        0.0        0.0
9        0.0        1.0        0.0        0.0

4. 데이터에서 누락된 것이 있는가, 만약 있다면 그것들을 어떻게 처리하려는가?¶

NULL 값이 포함된 컬럼 찾기 -> 각 컬럼의 평균값으로 채우기 (Imputation 또는 보간법)

In [48]:

# 각 컬럼들에 몇 개의 NULL값이 포함되어 있는지 확인
count_nan = data.isnull().sum()
print(count_nan[count_nan > 0])

Series([], dtype: int64)

In [49]:

# 결측치 시각화1 : missingno패키지
# import missingno
missingno.matrix(data, figsize = (30,10))

Out[49]:

<AxesSubplot: >

In [50]:

# 결측치 시각화2 : seaborn 패키지 heatmap
sns.heatmap(data.isnull(), cbar = False, yticklabels=False, cmap = 'viridis')

Out[50]:

<AxesSubplot: >

In [51]:

# NULL 값 대치 - 해당 컬럼의 평균값
# NULL값 대치 : fillna(값, inplace=True)
data['bmi'].fillna(data['bmi'].mean(), inplace = True)
print(data.head(15))

    age     sex     bmi  children smoker     region      charges
0    19  female  27.900         0    yes  southwest  16884.92400
1    18    male  33.770         1     no  southeast   1725.55230
2    28    male  33.000         3     no  southeast   4449.46200
3    33    male  22.705         0     no  northwest  21984.47061
4    32    male  28.880         0     no  northwest   3866.85520
5    31  female  25.740         0     no  southeast   3756.62160
6    46  female  33.440         1     no  southeast   8240.58960
7    37  female  27.740         3     no  northwest   7281.50560
8    37    male  29.830         2     no  northeast   6406.41070
9    60  female  25.840         0     no  northwest  28923.13692
10   25    male  26.220         0     no  northeast   2721.32080
11   62  female  26.290         0    yes  southeast  27808.72510
12   23    male  34.400         0     no  southwest   1826.84300
13   56  female  39.820         0     no  southeast  11090.71780
14   27    male  42.130         0    yes  southeast  39611.75770

In [52]:

# 검증
count_nan = data.isnull().sum()
print(count_nan[count_nan > 0])

# 시각화 검증
missingno.matrix(data, figsize=(30,10))

Series([], dtype: int64)

Out[52]:

<AxesSubplot: >

5. 이상치는 어디에 있는가? 관심을 가져야 할 데이터인가?¶

숫자형 데이터별 요약 통계값 확인

In [53]:

# 데이터 컬럼별 요약 통계값 보기
display(data.describe().T)

	count	mean	std	min	25%	50%	75%	max
age	1337.0	39.222139	14.044333	18.0000	27.000	39.0000	51.00000	64.00000
bmi	1337.0	30.663452	6.100468	15.9600	26.290	30.4000	34.70000	53.13000
children	1337.0	1.095737	1.205571	0.0000	0.000	1.0000	2.00000	5.00000
charges	1337.0	13279.121487	12110.359656	1121.8739	4746.344	9386.1613	16657.71745	63770.42801

In [54]:

# 데이터 개별 컬럼 히스토그램
data.age.plot.hist()

Out[54]:

<AxesSubplot: ylabel='Frequency'>

In [55]:

import scipy # 공학 계산 패키지
scipy.__version__

Out[55]:

'1.9.3'

숫자형 데이터 Skewness (왜도) 확인 - 이상치 추측

:데이터가 몰려서 분포되어있는 곳과 멀리 떨어진 곳에 이상치로 볼 수 있는 데이터가 다수 포함되어있다

In [56]:

# 데이터 컬럼 타입이 np.number
numeric_data = data.select_dtypes(include=np.number)

# 데이터 컬럼 타입이 np.number인 컬럼 이름들 가져오기
l = numeric_data.columns.values
number_of_columns = 4

# 컬럼별 히스토그램 그리기
# for i in range(0,len(l)):
#     sns.displot(numeric_data[l[i]],kde=True) # kde : kernel density

In [57]:

# 컬럼별 히스토그램 그리기
# select the columns to be plotted
cols = ['age', 'bmi', 'children', 'charges']

# create the figure and axes
fig, axes = plt.subplots(1, 4, figsize= (20,10))
axes = axes.ravel()  # flattening the array makes indexing easier

for col, ax in zip(cols, axes):
    sns.histplot(data=numeric_data[col], kde=True, stat='count', ax=ax)

# fig.tight_layout()
plt.show()

숫자형 데이터 Box Plot 시각화 - 이상치 확인

In [58]:

# 데이터 컬럼 타입이 np.number인 컬럼들 가져오기
# enumerate() : 인덱스(index)와 원소를 동시에 접근
columns = data.select_dtypes(include=np.number).columns
figure = plt.figure(figsize=(20, 10))
figure.add_subplot(1, len(columns), 1)
for index, col in enumerate(columns):
    if index > 0:
        figure.add_subplot(1, len(columns), index + 1)
    sns.boxplot(y=col, data = data, boxprops={'facecolor':'None'}) 
    # boxprops={'facecolor':'None'} 박스 색상 지우기
figure.tight_layout() # 자동으로 명시된 여백에 관련된 서브플롯 파라미터를 조정한다.
plt.show()

In [59]:

# 참고
number_of_rows = (len(l)-1)//number_of_columns
plt.figure(figsize=(20,20))
for i in range(0,len(l)):
    plt.subplot(number_of_rows + 1,number_of_columns,i+1)
    sns.set_style('whitegrid')
    sns.boxplot(numeric_data[l[i]],color='green',orient='v')
    plt.tight_layout()

범주형 데이터별 Violin Plot 시각화

In [60]:

# height와 aspect(width/height)로 사이즈 조절
if len(data.select_dtypes(include=['object','category']).columns) > 0:
    for col_num in data.select_dtypes(include=np.number).columns:
        for col in data.select_dtypes(include=['object','category']).columns:
            fig = sns.catplot(x=col, y=col_num, kind='violin', data = data, height = 5, aspect= 2 )
            fig.set_xticklabels(rotation=90)
            plt.show()

6. 변수 간 상관성이 있는가?¶

숫자형 데이터 간 Pairwise 결합 분포 시각화

In [61]:

numeric_data

Out[61]:

	age	bmi	children	charges
0	19	27.900	0	16884.92400
1	18	33.770	1	1725.55230
2	28	33.000	3	4449.46200
3	33	22.705	0	21984.47061
4	32	28.880	0	3866.85520
...	...	...	...	...
1332	50	30.970	3	10600.54830
1333	18	31.920	0	2205.98080
1334	18	36.850	0	1629.83350
1335	21	25.800	0	2007.94500
1336	61	29.070	0	29141.36030

1337 rows × 4 columns

In [62]:

# Correlation 시각화 : Seaborn Heatmap을 사용
numeric_data = data.select_dtypes(include=np.number)
plt.figure(figsize=(6,4))
sns.heatmap(numeric_data.corr(), cmap="Blues", annot = True)

Out[62]:

<AxesSubplot: >

In [63]:

# 보험료 기준 Correleation Matrix 시각화
k = 3 # heatmap에서 확인한 변수 개수
# 정렬후 추출 nlargest
cols = numeric_data.corr().nlargest(4, 'charges')['charges'].index
cm = numeric_data[cols].corr()
plt.figure(figsize=(10,6))
sns.heatmap(cm, annot=True, cmap= 'viridis')

Out[63]:

<AxesSubplot: >

In [64]:

# 숫자 변수형 컬럼들 간 pairplot 그리기
sns.pairplot(data.select_dtypes(include=np.number))
plt.show()

범주형 데이터를 기준으로 추가한 시각화

https://seaborn.pydata.org/examples/index.html

In [65]:

hue = 'smoker'
sns.pairplot(data.select_dtypes(include=np.number).join(data[[hue]]), hue=hue)
plt.show()

03. 다양한 Regression 을 활용한 보험료 예측¶

https://scikit-learn.org/stable/

Training, Test 데이터 나누기¶

In [66]:

# 숫자형 데이터들만 copy()를 사용하여 복사
X_num = data[['age','bmi','children']].copy()

# 변환했던 범주형 데이터들과 concat을 사용하여 합치기
X_final = pd.concat([X_num, region, sex, smoker], axis = 1)

# 보험료 컬럼을 y값으로 설정
y_final = data[['charges']].copy()

# train_test_split 을 사용하여 Training, Test 나누기 
# (Training:Test=2:1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size = 0.33, random_state = 0)

In [67]:

X_train[0:10]

Out[67]:

	age	bmi	children	northeast	northwest	southeast	southwest	sex	smoker
905	27	32.585	3	1.0	0.0	0.0	0.0	1	0
2	28	33.000	3	0.0	0.0	1.0	0.0	1	0
405	52	38.380	2	1.0	0.0	0.0	0.0	0	0
481	49	37.510	2	0.0	0.0	1.0	0.0	1	0
338	50	32.300	1	1.0	0.0	0.0	0.0	1	1
356	46	43.890	3	0.0	0.0	1.0	0.0	1	0
1258	52	23.180	0	1.0	0.0	0.0	0.0	0	0
182	22	19.950	3	1.0	0.0	0.0	0.0	1	0
461	42	30.000	0	0.0	0.0	0.0	1.0	1	1
1058	32	33.820	1	0.0	1.0	0.0	0.0	1	0

In [68]:

X_test[0:10]

Out[68]:

	age	bmi	children	northeast	northwest	southeast	southwest	sex	smoker
1247	18	39.820	0	0.0	0.0	1.0	0.0	0	0
609	47	29.370	1	0.0	0.0	1.0	0.0	0	0
393	49	31.350	1	1.0	0.0	0.0	0.0	1	0
503	19	30.250	0	0.0	0.0	1.0	0.0	1	1
198	51	18.050	0	0.0	1.0	0.0	0.0	0	0
820	26	17.670	0	0.0	1.0	0.0	0.0	1	0
31	18	26.315	0	1.0	0.0	0.0	0.0	0	0
1250	19	19.800	0	0.0	0.0	0.0	1.0	1	0
1298	19	25.745	1	0.0	1.0	0.0	0.0	0	0
1150	58	36.480	0	0.0	1.0	0.0	0.0	0	0

Feature Scaling¶

다차원의 값들을 비교 분석하기 쉽게 만든다.
변수들 간의 단위 차이가 있을 경우 필요하다.
Overflow, Underflow 를 방지해준다.

standardscaler

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html?highlight=standardscaler#sklearn.preprocessing.StandardScaler

In [69]:

# from sklearn.preprocessing import MinMaxScaler
# from sklearn.preprocessing import StandardScaler
## MinMaxScaler 를 사용하는 경우 : 
# 이상치가 있는 경우 변환된 값이 매우 좁은 범위로 압축될 수 있다

#n_scaler = MinMaxScaler()
#X_train = n_scaler.fit_transform(X_train.astype(np.float))
#X_test= n_scaler.transform(X_test.astype(np.float))

## StandardScaler 를 사용하는 경우 : 
# 이상치가 있는 경우에는 균형 잡힌 결과를 보장하기 힘들다

s_scaler = StandardScaler()
X_train = s_scaler.fit_transform(X_train.astype(np.float64))
X_test= s_scaler.transform(X_test.astype(np.float64))

## 그 외 - RobustScaler 를 사용하는 경우 : 
# 이상치의 영향을 최소화한 기법. 
# 중앙값과 IQR 을 사용하기 때문에 표준화 후 동일한 값을 더 넓게 분포시키게 된다.

Regression 절차 요약¶

****Regression()
fit()
predict()
score()

Linear Regression 적용¶

In [70]:

# from sklearn.linear_model import LinearRegression
# fit model
lr = LinearRegression()
lr.fit(X_train,y_train)

# predict
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)

# Score 확인
# 기울기
print("lr.coef_: {}".format(lr.coef_))
# 절편
print("lr.intercept_: {}".format(lr.intercept_))
# 평가
print('lr train score %.3f, lr test score: %.3f' % (
lr.score(X_train,y_train),
lr.score(X_test, y_test)))

lr.coef_: [[3358.74406798 1770.11632553  605.66602843  276.56724498   20.18780658
   -26.06561878 -262.61820082  -75.42314863 9509.77832038]]
lr.intercept_: [13098.07379314]
lr train score 0.743, lr test score: 0.759

Polynomial Regression 적용 - 다항 회귀¶

● 데이터들간의 형태가 비선형 일때 데이터에 각 특성의 제곱을 추가해주어서 특성이 추가된 비선형 데이터를 선형 회귀 모델로 훈련시키는 방법

In [71]:

# from sklearn.preprocessing import PolynomialFeatures
# 다항식 피처로 변환
poly = PolynomialFeatures(degree = 3)
X_poly = poly.fit_transform(X_final)

# 데이터 분할
X_train,X_test,y_train,y_test = train_test_split(X_poly,y_final, test_size = 0.33, random_state = 0)

# Standard Scaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train.astype(np.float64))
X_test= sc.transform(X_test.astype(np.float64))

# fit model
poly_lr = LinearRegression()
poly_lr.fit(X_train,y_train)

# predict
y_train_pred = poly_lr.predict(X_train)
y_test_pred = poly_lr.predict(X_test)

# Score 확인
print('poly train score %.3f, poly test score: %.3f' % (
poly_lr.score(X_train,y_train),
poly_lr.score(X_test, y_test)))

poly train score 0.835, poly test score: 0.835

Support Vector Regression 적용¶

In [72]:

# from sklearn.svm import SVR
svr = SVR(kernel='linear', C = 300)

X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size = 0.33, random_state = 0 )

# Standard Scaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train.astype(np.float64))
X_test= sc.transform(X_test.astype(np.float64))

# fit model
svr = svr.fit(X_train,y_train.values.ravel())
y_train_pred = svr.predict(X_train)
y_test_pred = svr.predict(X_test)

# Score 확인
print('svr train score %.3f, svr test score: %.3f' % (
svr.score(X_train,y_train),
svr.score(X_test, y_test)))

svr train score 0.715, svr test score: 0.719

RandomForest Regression 적용¶

In [73]:

# from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators = 100,
                              criterion = 'squared_error',
                              random_state = 1,
                              n_jobs = -1)

X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size = 0.33, random_state = 0 )

# Standard Scaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train.astype(np.float64))
X_test= sc.transform(X_test.astype(np.float64))

# fit model
forest.fit(X_train,y_train.values.ravel())
y_train_pred = forest.predict(X_train)
y_test_pred = forest.predict(X_test)

# Score 확인
print('forest train score %.3f, forest test score: %.3f' % (
forest.score(X_train, y_train),
forest.score(X_test, y_test)))

forest train score 0.975, forest test score: 0.851

Decision Tree Regression 적용¶

In [74]:

# from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor(random_state=0)

X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size = 0.33, random_state = 0 )

# Standard Scaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train.astype(np.float64))
X_test= sc.transform(X_test.astype(np.float64))

# fit model
dt = dt.fit(X_train,y_train.values.ravel())
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)

# Score 확인
print('dt train score %.3f, dt test score: %.3f' % (
dt.score(X_train,y_train),
dt.score(X_test, y_test)))

dt train score 0.999, dt test score: 0.721

다양한 모델 성능 종합 비교¶

In [75]:

# from sklearn.metrics import r2_score
# linear_model
lr = LinearRegression().fit(X_train,y_train)

# poly_model
poly = PolynomialFeatures(degree = 3)
X_poly = poly.fit_transform(X_final)
X_train,X_test,y_train,y_test = train_test_split(X_poly,y_final, test_size = 0.33, random_state = 0)
# Standard Scaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train.astype(np.float64))
X_test= sc.transform(X_test.astype(np.float64))
# fit model
poly_lr = LinearRegression().fit(X_train,y_train)

# SVM_model
svr = SVR(kernel='linear', C = 300)
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size = 0.33, random_state = 0 )
# Standard Scaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train.astype(np.float64))
X_test= sc.transform(X_test.astype(np.float64))
# fit model
svr = svr.fit(X_train,y_train.values.ravel())

# forest_model
forest = RandomForestRegressor(n_estimators = 100,
                              criterion = 'squared_error',
                              random_state = 1,
                              n_jobs = -1)
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size = 0.33, random_state = 0 )
# Standard Scaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train.astype(np.float64))
X_test= sc.transform(X_test.astype(np.float64))
# fit model
forest.fit(X_train,y_train.values.ravel())

# decision_tree_model
dt = DecisionTreeRegressor(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size = 0.33, random_state = 0 )
# Standard Scaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train.astype(np.float64))
X_test= sc.transform(X_test.astype(np.float64))
# fit model
dt = dt.fit(X_train,y_train.values.ravel())

In [76]:

# 앞에서 만든 regressor 변수들과 라벨을 묶어서 하나의 리스트로 모으기
regressors = [(lr, 'Linear Regression'),
                (poly_lr, 'Polynomial Regression'),
                (svr, 'SupportVector Regression'),
                (forest, 'RandomForest Regression'),
                (dt, 'DecisionTree')]

# 각 regressor 변수들과 라벨 묶음을 차례로 fit -> predict -> score 로 처리해서 보여주기
for reg, label in regressors:
    print(80*'_', '\n')
    reg = reg.fit(X_train,y_train.values.ravel())
    y_train_pred = reg.predict(X_train)
    y_test_pred = reg.predict(X_test)
    print(f'{label} train score %.3f, {label} test score: %.3f' % (
    reg.score(X_train,y_train),
    reg.score(X_test, y_test)))

________________________________________________________________________________ 

Linear Regression train score 0.743, Linear Regression test score: 0.759
________________________________________________________________________________ 

Polynomial Regression train score 0.743, Polynomial Regression test score: 0.759
________________________________________________________________________________ 

SupportVector Regression train score 0.715, SupportVector Regression test score: 0.719
________________________________________________________________________________ 

RandomForest Regression train score 0.975, RandomForest Regression test score: 0.851
________________________________________________________________________________ 

DecisionTree train score 0.999, DecisionTree test score: 0.721

[Pandas][Series] S1_02_custom_index: 사용자 지정 인덱스 (0)	2023.01.17
[Pandas][Series] S1_01_Numeric Default Index: 기본 인덱스 (0)	2023.01.17
[matplotlib & seaborn] 기초 명령어 (0)	2023.01.08
[pandas] 기초 명령어 (0)	2023.01.08
[데이터프레임] Dataframe이란 (0)	2022.02.28

Kang's Note

[실전 연습] 보험료 예측 (insurance)

01. 데이터 수립¶

02. 데이터 EDA 및 전처리¶

1. 어떤 질문을 해결거나 틀렸다고 증명하려고 하는가?¶

2. 중복된 항목은 있는가?¶

3. 어떤 종류의 데이터가 있고 다른 데이터 타입들을 어떻게 다루려고 하는가?¶

범주형 변수 변환¶

4. 데이터에서 누락된 것이 있는가, 만약 있다면 그것들을 어떻게 처리하려는가?¶

5. 이상치는 어디에 있는가? 관심을 가져야 할 데이터인가?¶

6. 변수 간 상관성이 있는가?¶

03. 다양한 Regression 을 활용한 보험료 예측¶

Training, Test 데이터 나누기¶

Feature Scaling¶

Regression 절차 요약¶

Linear Regression 적용¶

Polynomial Regression 적용 - 다항 회귀¶

Support Vector Regression 적용¶

RandomForest Regression 적용¶

Decision Tree Regression 적용¶

다양한 모델 성능 종합 비교¶

'Data Analytics with python > [Data Analysis]' 카테고리의 다른 글

댓글

티스토리툴바

[실전 연습] 보험료 예측 (insurance)

01. 데이터 수립¶

02. 데이터 EDA 및 전처리¶

1. 어떤 질문을 해결거나 틀렸다고 증명하려고 하는가?¶

2. 중복된 항목은 있는가?¶

3. 어떤 종류의 데이터가 있고 다른 데이터 타입들을 어떻게 다루려고 하는가?¶

범주형 변수 변환¶

4. 데이터에서 누락된 것이 있는가, 만약 있다면 그것들을 어떻게 처리하려는가?¶

5. 이상치는 어디에 있는가? 관심을 가져야 할 데이터인가?¶

6. 변수 간 상관성이 있는가?¶

03. 다양한 Regression 을 활용한 보험료 예측¶

Training, Test 데이터 나누기¶

Feature Scaling¶

Regression 절차 요약¶

Linear Regression 적용¶

Polynomial Regression 적용 - 다항 회귀¶

Support Vector Regression 적용¶

RandomForest Regression 적용¶

Decision Tree Regression 적용¶

다양한 모델 성능 종합 비교¶

'Data Analytics with python > [Data Analysis]' 카테고리의 다른 글

관련글

댓글

티스토리툴바