728x90

https://knote.tistory.com/185

[Python] 깃허브에 있는 데이터 압축 풀고 읽어오는 방법

데이터 추출하고 불러오기 from github https://docs.python.org/ko/3/library/urllib.request.html https://docs.python.org/ko/3/library/tarfile.html parents: True 옵션 : True 인 경우 상위 path가 없는 경우 새로 생성함, Flase인 경우

knote.tistory.com

https://knote.tistory.com/186

[Python] 내가 만든 함수 파일(.py) 임포트 사용법

1) 함수.py파일이 코드 파일과 같은 폴더에 있는 경우 from 내가 만든 함수 파일명 import 함수명 2) 함수.py파일이 다른 폴더에 있는 경우 path추가하기 : 원하는 물리적 위치의 파일을 어디서든지 사

knote.tistory.com

In [1]:nb

from extract_data import *

In [2]:

housing = load_housing_data()
housing.head()

Out[2]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

In [3]:

housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

In [4]:

housing['ocean_proximity'].value_counts()

Out[4]:

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

In [5]:

housing.describe()

Out[5]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	20640.000000	20640.000000	20640.000000	20640.000000	20433.000000	20640.000000	20640.000000	20640.000000	20640.000000
mean	-119.569704	35.631861	28.639486	2635.763081	537.870553	1425.476744	499.539680	3.870671	206855.816909
std	2.003532	2.135952	12.585558	2181.615252	421.385070	1132.462122	382.329753	1.899822	115395.615874
min	-124.350000	32.540000	1.000000	2.000000	1.000000	3.000000	1.000000	0.499900	14999.000000
25%	-121.800000	33.930000	18.000000	1447.750000	296.000000	787.000000	280.000000	2.563400	119600.000000
50%	-118.490000	34.260000	29.000000	2127.000000	435.000000	1166.000000	409.000000	3.534800	179700.000000
75%	-118.010000	37.710000	37.000000	3148.000000	647.000000	1725.000000	605.000000	4.743250	264725.000000
max	-114.310000	41.950000	52.000000	39320.000000	6445.000000	35682.000000	6082.000000	15.000100	500001.000000

In [6]:

# 데이터의 형태를 빠르게 검토하기
import matplotlib.pyplot as plt
housing.hist(bins = 50, figsize = (20,15));

중간소득(median_income) : US달러로 표현되지 않아 보입니다. 이미 스케일이 조정된 수치입니다. (단위는 만 달러입니다)
중간 주택 가격(median house value) : 타깃 속성인데 값의 범위가 제한되어 있어 보입니다. 500,000을 넘어가지 않아서 정확한 예측을 하려면 한곗값 밖의 값을 구하던지 아니면 이 구역을 제거하여야 합니다.
특성들의 스케일이 서로 달라 특성 스케일링이 필요합니다.
히스토 그램의 분포가 편향되어 보입니다. 패턴을 찾기 어렵게 만들기 때문에 특성들의 분포를 종 모양의 분포가 되도록 만들 필요가 있습니다.

테스트 세트 만들기

데이터 분석하기 전 테스트 세트를 따로 떼어 놓고 테스트 세트를 절대 보지 않습니다.

테스트 세트를 미리 본다면 어떤 패턴에 속아 특정 머신러닝 모델을 선택하여 분석할 경우 향후 새 데이터에 대한 일반화된 모델을 만들기 어려워집니다.

In [7]:

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size = 0.2, random_state = 42)

계측정 샘플링

만약, 중간 소득이 중간 주택 가격을 예측하는데 매우 중요한 요소라 가정합니다.
테스트 세트가 전체 데이터셋의 여러 소득 카테고를 잘 대표해야 합니다.
따라서 소득 카테고리를 기반으로 한 계층 샘플링을 합니다.

In [8]:

import numpy as np
housing["income_cat"] = pd.cut(housing["median_income"], bins = [0., 1.5, 3.0, 4.5, 6., np.inf],
                             labels = [1, 2, 3, 4, 5])
housing["income_cat"].hist()

Out[8]:

<AxesSubplot: >

In [9]:

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

In [10]:

# 확인
strat_test_set['income_cat'].value_counts() / len(strat_test_set)

Out[10]:

3    0.350533
2    0.318798
4    0.176357
5    0.114341
1    0.039971
Name: income_cat, dtype: float64

In [11]:

# 임의의 income_cat열 삭제
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis = 1, inplace = True)

훈련 세트에 대해서만 탐색하기

In [12]:

housing = strat_train_set.copy()

In [13]:

# 지리 정보(위도와 경도) 
housing.plot(kind="scatter", x="longitude", y = "latitude", alpha = 0.1 );

C:\ProgramData\Anaconda3\lib\site-packages\pandas\plotting\_matplotlib\core.py:1070: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  scatter = ax.scatter(

매개변수 s : 원의 반지름
매개변수 c : 색상
매개변수 cmap : 컬러 맵

In [14]:

# 빨간 색은 높은 가격, 파란색은 낮은 가격, 큰 원은 인구가 밀집된 지역을 나타냅니다.
housing.plot(kind = 'scatter', x= 'longitude', y= 'latitude', alpha = 0.4,
            s = housing["population"]/100, label = "population", figsize=(10,8),
            c= 'median_house_value', cmap = plt.get_cmap('jet'), colorbar = True,
            sharex = False)
plt.legend()

Out[14]:

<matplotlib.legend.Legend at 0x20626efeca0>

특성 간 상관관계

방법1 : 표준 상관계수

In [15]:

corr_matrix = housing.corr()

In [16]:

corr_matrix["median_house_value"].sort_values(ascending = False)

Out[16]:

median_house_value    1.000000
median_income         0.687151
total_rooms           0.135140
housing_median_age    0.114146
households            0.064590
total_bedrooms        0.047781
population           -0.026882
longitude            -0.047466
latitude             -0.142673
Name: median_house_value, dtype: float64

방법2 : 숫자형 특성 사이에 산점도

In [17]:

from pandas.plotting import scatter_matrix

attributes = ['median_house_value','median_income','total_rooms','housing_median_age']
scatter_matrix(housing[attributes], figsize = (12,8));

In [18]:

# 중간 소득 대 중간 주택 가격
housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha = 0.1)

Out[18]:

<AxesSubplot: xlabel='median_income', ylabel='median_house_value'>

In [19]:

housing['rooms_per_household'] = housing['total_rooms']/housing['households']
housing['bedrooms_per_room'] = housing['total_bedrooms']/housing['total_rooms']
housing['population_per_household'] = housing['population']/housing['households']

In [20]:

corr_matrix = housing.corr()
corr_matrix['median_house_value'].sort_values(ascending = False)

Out[20]:

median_house_value          1.000000
median_income               0.687151
rooms_per_household         0.146255
total_rooms                 0.135140
housing_median_age          0.114146
households                  0.064590
total_bedrooms              0.047781
population_per_household   -0.021991
population                 -0.026882
longitude                  -0.047466
latitude                   -0.142673
bedrooms_per_room          -0.259952
Name: median_house_value, dtype: float64

머신 러닝을 위한 데이터 준비

특성, 라벨 분류

In [21]:

housing = strat_train_set.drop("median_house_value", axis = 1)
housing_labels = strat_train_set['median_house_value'].copy()

수치형 특성

In [22]:

# 누락된 값
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')

In [23]:

# 텍스트 특성 제외 :중간값이 수치형 특성에서만 계산
housing_num = housing.drop('ocean_proximity', axis = 1)

In [24]:

imputer.fit(housing_num)

Out[24]:

SimpleImputer(strategy='median')

In [25]:

imputer.statistics_

Out[25]:

array([-118.51   ,   34.26   ,   29.     , 2119.     ,  433.     ,
       1164.     ,  408.     ,    3.54155])

In [26]:

X = imputer.transform(housing_num)

In [27]:

housing_tr = pd.DataFrame(X, columns = housing_num.columns, index = housing_num.index)

범주형 특성

In [28]:

housing_cat = housing[['ocean_proximity']]

In [29]:

from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

In [30]:

ordinal_encoder.categories_
# 한계점 : 가까이 있는 두 값을 비슷하다고 생각함
# 순서가 있는 카테고리의 경우 괜찮음 (gooo, excellent 처럼)

Out[30]:

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

In [31]:

from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

In [32]:

housing_cat_1hot.toarray()

Out[32]:

array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

In [33]:

cat_encoder.categories_

Out[33]:

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

특성 스케일링

변환 파이프 라인

In [34]:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [35]:

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('std_scaler', StandardScaler())
])

housing_num_tr = num_pipeline.fit_transform(housing_num)

In [36]:

from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']

full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('cat', OneHotEncoder(), cat_attribs)
])
housing_prepared = full_pipeline.fit_transform(housing)

모델 선택 및 훈련

회귀 모델

In [37]:

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

Out[37]:

LinearRegression()

In [38]:

# 훈련 샘플 몇 개 적용
sample_data = housing.iloc[:5]
sample_labels = housing_labels.iloc[:5]
sample_data_prepared = full_pipeline.transform(sample_data)
print("predict:", lin_reg.predict(sample_data_prepared))
print('labels:', list(sample_labels))

predict: [ 88983.14806384 305351.35385026 153334.71183453 184302.55162102
 246840.18988841]
labels: [72100.0, 279600.0, 82700.0, 112500.0, 238300.0]

In [39]:

# 평가 지표
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
# 예측 오차 - 과소적합 됨
# 특성을 더 많이 추가하거나 더 복잡한 모델을 시도하기
lin_rmse 

Out[39]:

69050.56219504567

DecisionTreeRegressor

In [40]:

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)

Out[40]:

DecisionTreeRegressor()

In [41]:

housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

Out[41]:

0.0

오차가 전혀 없다고 합니다. 과연 완벽한 모델일지 모르겠습니다.

확신이 드는 모델이 론칭할 준비가 되기 전까지는 테스트 세트를 사용하지 않고자 합니다.

훈련 세트의 일부분을 훈련을하고 다른 일부분을 모델 검증으로 사용하겠습니다.

K-fold cross-validation

훈련 세트를 폴드라 불리는 서브셋으로 무작위로 분할합니다.
매번 다른 폴드를 선택해 평가에 사용하고 나머지 폴드를 모두 훈련에 사용합니다.
k개 만큼의 평가 점수가 담긴 배열이 결과가 됩니다.

In [42]:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                        scoring = "neg_mean_squared_error", cv = 10)
tree_rmse_scores = np.sqrt(-scores)

사이킷런의 교차 검증은 scoring 매개변수에 효용함수(클수록 좋은)를 기대합니다.
neg_mean_squared_error함수 : 평균 제곱 오차의 반대값이므로 -mse로 부호를 바꿔줘야 합니다.

In [43]:

def display_scores(scores):
    print("점수 :", scores)
    print("평균 :", scores.mean())
    print("표준편차:", scores.std())
display_scores(tree_rmse_scores)

점수 : [69390.13474142 69818.47103039 64559.89006816 69840.36377624
 67951.91785872 68219.99677652 72019.71405803 71012.87892019
 66881.7078513  71548.92712786]
평균 : 69124.4002208828
표준편차: 2166.784718222486

In [44]:

# 비교
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                            scoring = 'neg_mean_squared_error', cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

점수 : [72229.03469752 65318.2240289  67706.39604745 69368.53738998
 66767.61061621 73003.75273869 70522.24414582 69440.77896541
 66930.32945876 70756.31946074]
평균 : 69204.32275494763
표준편차: 2372.0707910559213

RandomForestRegressor

In [45]:

from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

Out[45]:

18366.486650164195

In [46]:

scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                        scoring = "neg_mean_squared_error", cv = 10)
forest_rmse_scores = np.sqrt(-scores)
display_scores(forest_rmse_scores)

점수 : [50428.06815503 49222.42101743 46191.65369825 50501.18027831
 46993.31216438 49240.46940105 51599.03403672 48941.87528812
 47297.56401643 53207.55374291]
평균 : 49362.31317986178
표준편차: 2058.185300846608

모델 저장

In [49]:

# pickle 패키지나 joblib 라이브러리 사용
import joblib

joblib.dump(lin_reg, "my_model.pkl")

# load
# my_model_loaded = joblib.load('my_model.pkl')

Out[49]:

['my_model.pkl']

모델 세부 튜닝

In [53]:

# 적은 수의 조합 GridSearchCV, 많은 수의 조합 RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [3, 10, 30],'max_features':[2,4,6,8]},
    {'bootstrap':[False], 'n_estimators':[3,10], 'max_features':[2,3,4]}
]

forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, cv = 5,
                           scoring = 'neg_mean_squared_error',
                           return_train_score = True)

grid_search.fit(housing_prepared, housing_labels)
# 보통 연속된 10의 거듭제곱 수를 시도하나 더 세밀한 탐색을 원하면 더 작은 값을 지정합니다.

Out[53]:

GridSearchCV(cv=5, estimator=RandomForestRegressor(),
             param_grid=[{'max_features': [2, 4, 6, 8],
                          'n_estimators': [3, 10, 30]},
                         {'bootstrap': [False], 'max_features': [2, 3, 4],
                          'n_estimators': [3, 10]}],
             return_train_score=True, scoring='neg_mean_squared_error')

GridSearchCV

GridSearchCV(cv=5, estimator=RandomForestRegressor(),
             param_grid=[{'max_features': [2, 4, 6, 8],
                          'n_estimators': [3, 10, 30]},
                         {'bootstrap': [False], 'max_features': [2, 3, 4],
                          'n_estimators': [3, 10]}],
             return_train_score=True, scoring='neg_mean_squared_error')

estimator: RandomForestRegressor

RandomForestRegressor()

RandomForestRegressor

RandomForestRegressor()

In [54]:

# 최적의 조합
grid_search.best_params_

Out[54]:

{'max_features': 6, 'n_estimators': 30}

In [56]:

grid_search.best_estimator_

Out[56]:

RandomForestRegressor(max_features=6, n_estimators=30)

In [57]:

cvres = grid_search.cv_results_

In [58]:

for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
    print(np.sqrt(-mean_score), params)

64100.50739759063 {'max_features': 2, 'n_estimators': 3}
55765.0987319143 {'max_features': 2, 'n_estimators': 10}
52307.466182445314 {'max_features': 2, 'n_estimators': 30}
60261.39971672139 {'max_features': 4, 'n_estimators': 3}
52877.03071206763 {'max_features': 4, 'n_estimators': 10}
50368.915458169184 {'max_features': 4, 'n_estimators': 30}
59844.63860401046 {'max_features': 6, 'n_estimators': 3}
52016.74165502729 {'max_features': 6, 'n_estimators': 10}
50091.246410598615 {'max_features': 6, 'n_estimators': 30}
58065.03668517956 {'max_features': 8, 'n_estimators': 3}
51683.85492328109 {'max_features': 8, 'n_estimators': 10}
50210.02780909483 {'max_features': 8, 'n_estimators': 30}
61987.172799861415 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
53898.775488719584 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59822.771781725525 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52724.25053622563 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
58717.78125682323 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51826.21262946235 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

최상의 모델로 각 특성의 상대적인 중요도 확인 : RandomForestRegressor

In [62]:

feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

Out[62]:

array([1.17934434e-01, 1.07259297e-01, 4.72949603e-02, 3.62695678e-02,
       3.01461898e-02, 4.31874979e-02, 2.82123625e-02, 4.16269022e-01,
       1.15006629e-02, 1.53123710e-01, 7.42608942e-05, 2.66916591e-03,
       6.05886939e-03])

In [64]:

extra_attribs = ['rooms_per_hhold', 'pop_per_hhold', 'bedrooms_per_room']
cat_encoder = full_pipeline.named_transformers_['cat']
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)

Out[64]:

[(0.416269021927223, 'median_income'),
 (0.15312370956640659, 'pop_per_hhold'),
 (0.11793443425350791, 'longitude'),
 (0.10725929681873321, 'latitude'),
 (0.04729496031357122, 'housing_median_age'),
 (0.043187497852368056, 'population'),
 (0.03626956780898215, 'total_rooms'),
 (0.030146189824382594, 'total_bedrooms'),
 (0.028212362527238823, 'households'),
 (0.011500662919772708, 'rooms_per_hhold'),
 (0.006058869387255515, 'INLAND'),
 (0.002669165906349746, '<1H OCEAN'),
 (7.426089420868927e-05, 'bedrooms_per_room')]

테스트 세트로 시스템 평가

In [65]:

# 테스트 세트에서 훈련하면 안 되므로 fit_transform()이 아닌 transform()을 호출하기
final_model = grid_search.best_estimator_

X_test = strat_test_set.drop('median_house_value', axis = 1)
y_test = strat_test_set['median_house_value'].copy()

X_test_prepared = full_pipeline.transform(X_test)

final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

신뢰구간

https://angeloyeo.github.io/2021/01/05/confidence_interval.html

In [ ]:

from scipy import stats
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_erros) - 1,
                        loc = squared_erros.mean(),
                        scale = stats.sem(squared_errors)))

scipy를 이용한 확률분포

https://datascienceschool.net/02%20mathematics/08.01%20%EC%82%AC%EC%9D%B4%ED%8C%8C%EC%9D%B4%EB%A5%BC%20%EC%9D%B4%EC%9A%A9%ED%95%9C%20%ED%99%95%EB%A5%A0%EB%B6%84%ED%8F%AC%20%EB%B6%84%EC%84%9D.html

loc : 일반적으로 분포의 기댓값
scale : 일반적으로 분포의 표준편차
stats.sem : 표준 오차(Standard Error of Mean)
squared_erros.mean() : 평균의 평균
자유도 : len(squared_erros) - 1

728x90

'Data Analytics with python > [Machine Learning ]' 카테고리의 다른 글

[연관 규칙 데이터 정제] 데이터를 정제하여 apriori 알고리즘을 수행하기 위한 준비 (0)	2023.02.03
[연관 규칙 분석] Association_rules 분석 (0)	2023.02.03
[회귀 구현] data : california_housing (0)	2023.02.02
[이미지 분할] Image segmentation (1)	2023.01.26
[분류 모델 평가 지표] Confusion Matrix (0)	2023.01.24

Kang's Note

[학습 01] 주택 가격 예측하기

'Data Analytics with python > [Machine Learning ]' 카테고리의 다른 글

댓글

티스토리툴바

[학습 01] 주택 가격 예측하기

'Data Analytics with python > [Machine Learning ]' 카테고리의 다른 글

관련글

댓글

티스토리툴바