728x90

PCA ¶

Data source:

https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

In [1]:

import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# PCA
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# model
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

In [2]:

import os
os.chdir("C:/Users/KANG/Desktop/pyml/pyml_data/")

credit card 데이터 세트 PCA 변환

In [3]:

# header로 의미없는 첫행 제거, iloc로 기존 id 제거
df = pd.read_excel('default of credit card clients.xls', header=1, sheet_name='Data').iloc[0:,1:]
print(df.shape)
df.head(3)

(30000, 24)

Out[3]:

	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	PAY_5	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default payment next month
0	20000	2	2	1	24	2	2	-1	-1	-2	...	0	0	0	0	689	0	0	0	0	1
1	120000	2	2	2	26	-1	2	0	0	0	...	3272	3455	3261	0	1000	1000	1000	0	2000	1
2	90000	2	2	2	34	0	0	0	0	0	...	14331	14948	15549	1518	1500	1000	1000	1000	5000	0

3 rows × 24 columns

In [4]:

df.rename(columns={'PAY_0':'PAY_1','default payment next month':'default'}, inplace=True)
y_target = df['default']
X_features = df.drop('default', axis=1)

In [5]:

X_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 23 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   LIMIT_BAL  30000 non-null  int64
 1   SEX        30000 non-null  int64
 2   EDUCATION  30000 non-null  int64
 3   MARRIAGE   30000 non-null  int64
 4   AGE        30000 non-null  int64
 5   PAY_1      30000 non-null  int64
 6   PAY_2      30000 non-null  int64
 7   PAY_3      30000 non-null  int64
 8   PAY_4      30000 non-null  int64
 9   PAY_5      30000 non-null  int64
 10  PAY_6      30000 non-null  int64
 11  BILL_AMT1  30000 non-null  int64
 12  BILL_AMT2  30000 non-null  int64
 13  BILL_AMT3  30000 non-null  int64
 14  BILL_AMT4  30000 non-null  int64
 15  BILL_AMT5  30000 non-null  int64
 16  BILL_AMT6  30000 non-null  int64
 17  PAY_AMT1   30000 non-null  int64
 18  PAY_AMT2   30000 non-null  int64
 19  PAY_AMT3   30000 non-null  int64
 20  PAY_AMT4   30000 non-null  int64
 21  PAY_AMT5   30000 non-null  int64
 22  PAY_AMT6   30000 non-null  int64
dtypes: int64(23)
memory usage: 5.3 MB

In [6]:

corr = X_features.corr()
plt.figure(figsize=(14,14))
sns.heatmap(corr, annot=True, fmt='.1g',cmap='Blues') # 0.2f (float), 0.2g (double)

Out[6]:

<AxesSubplot: >

입력 데이터의 공분산 행렬이 고유벡터와 고유값으로 분해 될 수 있으며, 이렇게 분해된 고유벡터를 이용해 입력데이터를 선형 변환하는 방식

PCA는 여러 속성의 값을 연산하므로 속성의 스케일에 영향을 받기 때문에 PCA 변환 전에 각 속성값을 동일한 스케일로 변환해야 합니다.

In [7]:

# 변환할 컬럼 속성명 생성 (ex. BILL_AMT1 ~ BILL_AMT6 까지)
cols_bill = ['BILL_AMT'+ str(i) for i in range(1,7)]
cols_pay = ['PAY_' + str(i) for i in range(1, 7)]
cols_amt = ['PAY_AMT' + str(i) for i in range(1, 7)]
cols_bill.extend(cols_pay)
cols_bill.extend(cols_amt)
print('대상 속성명:',cols_bill)

# 2개의 PCA 속성을 가진 PCA 객체 생성하고, explained_variance_ratio_ 계산 위해 fit( ) 호출

# scaling
scaler = StandardScaler()
df_cols_scaled = scaler.fit_transform(X_features[cols_bill])

# Replace data with scale data
X_features.loc[:, cols_bill] = df_cols_scaled

# PCA (2차원 = 2개 속성)
pca = PCA(n_components=2) # PCA 클래스 생성
pca.fit(df_cols_scaled) # PCA 변환
print('PCA Component별 변동성:', pca.explained_variance_ratio_)

대상 속성명: ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
PCA Component별 변동성: [0.36180187 0.20618472]

In [8]:

X_principal = pca.fit_transform(df_cols_scaled) 
X_principal = pd.DataFrame(X_principal) 
X_principal.columns = ['P1', 'P2']
X_principal

Out[8]:

	P1	P2
0	-1.774457	-0.613836
1	-0.660334	-2.051522
2	-0.766104	-0.934008
3	-0.114276	-0.627801
4	-0.851808	0.029701
...	...	...
29995	2.418658	0.828907
29996	-1.834585	0.068665
29997	0.475637	-3.053813
29998	0.602878	1.101630
29999	-0.121365	-0.631337

30000 rows × 2 columns

In [9]:

rcf = RandomForestClassifier(n_estimators=300, random_state=2023)
scores = cross_val_score(rcf, X_features, y_target, scoring='accuracy', cv=3 )

print('CV=3 인 경우의 개별 Fold세트별 정확도:',scores)
print('평균 정확도:{0:.4f}'.format(np.mean(scores)))

CV=3 인 경우의 개별 Fold세트별 정확도: [0.8065 0.8222 0.821 ]
평균 정확도:0.8166

In [10]:

# 원본 데이터셋에 먼저 StandardScaler적용
scaler = StandardScaler()
df_scaled = scaler.fit_transform(X_features)

# 6개의 Component를 가진 PCA 변환을 수행하고 cross_val_score( )로 분류 예측 수행. 
pca = PCA(n_components=6)
df_pca = pca.fit_transform(df_scaled)
scores_pca = cross_val_score(rcf, df_pca, y_target, scoring='accuracy', cv=3)

print('CV=3 인 경우의 PCA 변환된 개별 Fold세트별 정확도:',scores_pca)
print('PCA 변환 데이터 셋 평균 정확도:{0:.4f}'.format(np.mean(scores_pca)))

CV=3 인 경우의 PCA 변환된 개별 Fold세트별 정확도: [0.7928 0.7976 0.8013]
PCA 변환 데이터 셋 평균 정확도:0.7972

728x90

'Data Analytics with python > [Machine Learning ]' 카테고리의 다른 글

[Dimension Reduction] SVD 변환 (0)	2023.02.13
[Dimension Reduction] LDA 변환 (0)	2023.02.13
[연관 규칙 데이터 정제] 데이터를 정제하여 apriori 알고리즘을 수행하기 위한 준비 (0)	2023.02.03
[연관 규칙 분석] Association_rules 분석 (0)	2023.02.03
[회귀 구현] data : california_housing (0)	2023.02.02

Kang's Note

[Dimension Reduction] PCA components 기반 변환

PCA ¶

'Data Analytics with python > [Machine Learning ]' 카테고리의 다른 글

댓글

티스토리툴바

[Dimension Reduction] PCA components 기반 변환

PCA ¶

'Data Analytics with python > [Machine Learning ]' 카테고리의 다른 글

관련글

댓글

티스토리툴바