728x90
Feature Engineering and Dealing with Missing Dataset¶
In [46]:
import pandas as pd
In [48]:
employee_df = pd.read_csv('Human_Resources_Employee.csv')
employee_df.head()
Out[48]:
Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1.0 | ... | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2.0 | ... | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 4.0 | ... | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 5.0 | ... | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 7.0 | ... | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
5 rows × 35 columns
In [ ]:
employee_df.shape
In [49]:
# Null values
employee_df.isnull().sum()
Out[49]:
Age 0
Attrition 0
BusinessTravel 0
DailyRate 0
Department 1
DistanceFromHome 0
Education 0
EducationField 1
EmployeeCount 0
EmployeeNumber 1
EnvironmentSatisfaction 0
Gender 1
HourlyRate 0
JobInvolvement 0
JobLevel 0
JobRole 1
JobSatisfaction 0
MaritalStatus 1
MonthlyIncome 3
MonthlyRate 2
NumCompaniesWorked 0
Over18 0
OverTime 0
PercentSalaryHike 1
PerformanceRating 1
RelationshipSatisfaction 0
StandardHours 0
StockOptionLevel 0
TotalWorkingYears 0
TrainingTimesLastYear 0
WorkLifeBalance 0
YearsAtCompany 0
YearsInCurrentRole 0
YearsSinceLastPromotion 0
YearsWithCurrManager 0
dtype: int64
In [50]:
employee_df.loc[employee_df['Department'].isnull()]
Out[50]:
Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 30 | No | Travel_Rarely | 1358 | NaN | 24 | 1 | Life Sciences | 1 | 11.0 | ... | 2 | 80 | 1 | 1 | 2 | 3 | 1 | 0 | 0 | 0 |
1 rows × 35 columns
In [51]:
# Drop any tow that contains a Null value
employee_df.dropna(how='any', inplace=True)
In [52]:
# 7 rows are gone
employee_df.shape
Out[52]:
(1463, 35)
In [53]:
# re-data
employee_df = pd.read_csv('Human_Resources_Employee.csv')
In [54]:
# indicate which columns are interested in
employee_df.dropna(how='any', inplace=True, subset = ['MonthlyIncome','PercentSalaryHike'])
In [55]:
employee_df.shape
Out[55]:
(1467, 35)
In [56]:
# filling na data
employee_df = pd.read_csv('Human_Resources_Employee.csv')
In [57]:
# Calculate the average monthly income
employee_df['MonthlyIncome'].mean()
Out[57]:
6505.155419222904
In [58]:
employee_df['MonthlyIncome'].fillna(employee_df['MonthlyIncome'].mean(), inplace=True)
728x90
'Data Analytics with python > [Data Analysis]' 카테고리의 다른 글
[Pandas][DataFrame][concat]S3_01_concatenation (0) | 2023.01.21 |
---|---|
[Pandas][DataFrame]S2_14_change_datatypes (0) | 2023.01.21 |
[Pandas][DataFrame]S2_12_Operations_Filtering (0) | 2023.01.21 |
[Pandas][DataFrame]S2_11_define_functions (0) | 2023.01.21 |
[Pandas][DataFrame]S2_10_sorting_and_ordering (0) | 2023.01.21 |
댓글