728x90
데이터1 : 비행 데이터
용량이 커서 xlsb 형식으로 저장되어 있다.
압축해제 후 엑셀 > F12 > csv 형식으로 저장하여 분석에 사용가능하다.
데이터2 : 인사 데이터
In [68]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
catplot
In [69]:
df = sns.load_dataset('tips')
In [70]:
sns.catplot(x='day',y='total_bill', data = df)
Out[70]:
<seaborn.axisgrid.FacetGrid at 0x7f8b956c8d30>
relplot : scatter + line
In [71]:
penguins = sns.load_dataset('penguins'); penguins
Out[71]:
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows × 7 columns
In [72]:
sns.set_theme(style="whitegrid", palette="pastel")
# sns.set_theme('talk')
# sns.set_style("whitegrid")
g =sns.relplot(data = penguins, x = 'flipper_length_mm', y='bill_length_mm', col = 'island', hue='sex', style = 'species') # kind='line'
# col 인자에 범주형 칼러명을 지정하여 해당 칼럼 범주 수 만큼 그래프를 분할
g.set_xticklabels(rotation=45)
# https://stackoverflow.com/questions/32542957/control-tick-labels-in-python-seaborn-package
Out[72]:
<seaborn.axisgrid.FacetGrid at 0x7f8b957cc460>
lineplot
In [73]:
sns.set_theme(style = 'whitegrid')
# example
random_state = np.random.RandomState(365)
# rns라는 object가 생기며 이를 통해 난수에 접근 가능하다
vlaues = random_state.randn(365, 4).cumsum(axis = 0)
# randn(m,n) : 평균 0, 표준편차 1의 가우시안 표준정규분포 난수를 matrix array(m,n)생성
dates = pd.date_range('2 2 2020', periods = 365, freq='D')
df = pd.DataFrame(vlaues, dates, columns= ['A','B','C','D'])
df = df.rolling(7).mean()
# rolling() : 기간 이동평균 계산
rc : runtime command
런타임 구성을 위한 명령어입니다.
matplotlib에서 다른 옵션들을 사용하게 해준다.
In [74]:
sns.set(rc={'figure.figsize':(5,7.49)})
sns.lineplot(data = df, palette = 'tab10', linewidth = 1 )
Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8b977e6df0>
In [75]:
df.fillna(1)
Out[75]:
A | B | C | D | |
---|---|---|---|---|
2020-02-02 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
2020-02-03 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
2020-02-04 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
2020-02-05 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
2020-02-06 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
... | ... | ... | ... | ... |
2021-01-27 | -18.482826 | 10.330142 | -12.108625 | 14.878444 |
2021-01-28 | -18.693797 | 10.391382 | -12.020502 | 15.376387 |
2021-01-29 | -18.752957 | 10.062616 | -11.685921 | 15.996722 |
2021-01-30 | -18.918042 | 9.957435 | -11.244617 | 16.669990 |
2021-01-31 | -19.239433 | 10.189316 | -10.733219 | 17.307552 |
365 rows × 4 columns
In [76]:
sns.set(rc={'figure.figsize':(15,12)})
sns.lineplot(data = df, palette = 'tab10', linewidth = 3.5 )
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8b977dcca0>
In [92]:
from pandas_datareader import data as pdr
import yfinance as yf
from datetime import datetime
pandas-datareader
https://pandas-datareader.readthedocs.io/en/latest/remote_data.html
pandas-datareader issue
In [96]:
yf.pdr_override()
stock = ['GOOG','MSFT','TSLA','META']
df = pdr.get_data_yahoo(stock, start = "2019-01-01", end = "2021-04-30")
[*********************100%***********************] 4 of 4 completed
In [97]:
df.head(100)
Out[97]:
Adj Close | Close | High | ... | Low | Open | Volume | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GOOG | META | MSFT | TSLA | GOOG | META | MSFT | TSLA | GOOG | META | ... | MSFT | TSLA | GOOG | META | MSFT | TSLA | GOOG | META | MSFT | TSLA | |
Date | |||||||||||||||||||||
2019-01-02 00:00:00-05:00 | 52.292500 | 135.679993 | 96.874687 | 20.674667 | 52.292500 | 135.679993 | 101.120003 | 20.674667 | 52.616001 | 137.509995 | ... | 98.940002 | 19.920000 | 50.828499 | 128.990005 | 99.550003 | 20.406668 | 30652000 | 28146200 | 35329300 | 174879000 |
2019-01-03 00:00:00-05:00 | 50.803001 | 131.740005 | 93.310860 | 20.024000 | 50.803001 | 131.740005 | 97.400002 | 20.024000 | 52.848999 | 137.169998 | ... | 97.199997 | 19.825333 | 52.049999 | 134.690002 | 100.099998 | 20.466667 | 36822000 | 22717900 | 42579100 | 104478000 |
2019-01-04 00:00:00-05:00 | 53.535500 | 137.949997 | 97.650703 | 21.179333 | 53.535500 | 137.949997 | 101.930000 | 21.179333 | 53.542000 | 138.000000 | ... | 98.930000 | 20.181999 | 51.629501 | 134.009995 | 99.720001 | 20.400000 | 41878000 | 29002100 | 44060600 | 110911500 |
2019-01-07 00:00:00-05:00 | 53.419498 | 138.050003 | 97.775215 | 22.330667 | 53.419498 | 138.050003 | 102.059998 | 22.330667 | 53.700001 | 138.869995 | ... | 100.980003 | 21.183332 | 53.575001 | 137.559998 | 101.639999 | 21.448000 | 39638000 | 20089300 | 35656100 | 113268000 |
2019-01-08 00:00:00-05:00 | 53.813999 | 142.529999 | 98.484177 | 22.356667 | 53.813999 | 142.529999 | 102.800003 | 22.356667 | 54.228001 | 143.139999 | ... | 101.709999 | 21.801332 | 53.805500 | 139.889999 | 103.040001 | 22.797333 | 35298000 | 26263800 | 31514400 | 105127500 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2019-05-20 00:00:00-04:00 | 56.942501 | 182.720001 | 121.886856 | 13.690667 | 56.942501 | 182.720001 | 126.220001 | 13.690667 | 57.339851 | 184.229996 | ... | 125.760002 | 13.016667 | 57.224998 | 181.880005 | 126.519997 | 13.520000 | 27066000 | 10352000 | 23706900 | 307893000 |
2019-05-21 00:00:00-04:00 | 57.481499 | 184.820007 | 122.543518 | 13.672000 | 57.481499 | 184.820007 | 126.900002 | 13.672000 | 57.635399 | 185.699997 | ... | 126.580002 | 13.069333 | 57.424500 | 184.570007 | 127.430000 | 13.184000 | 23196000 | 7502800 | 15293300 | 270058500 |
2019-05-22 00:00:00-04:00 | 57.570999 | 185.320007 | 123.287079 | 12.848667 | 57.570999 | 185.320007 | 127.669998 | 12.848667 | 57.925999 | 186.740005 | ... | 126.519997 | 12.785333 | 57.337502 | 184.729996 | 126.620003 | 13.273333 | 18290000 | 9213800 | 15396500 | 280278000 |
2019-05-23 00:00:00-04:00 | 57.038502 | 180.869995 | 121.848228 | 13.032667 | 57.038502 | 180.869995 | 126.180000 | 13.032667 | 57.298649 | 183.899994 | ... | 124.739998 | 12.414667 | 57.025002 | 182.419998 | 126.199997 | 12.956000 | 23978000 | 12768800 | 23603800 | 398206500 |
2019-05-24 00:00:00-04:00 | 56.673500 | 181.059998 | 121.906166 | 12.708667 | 56.673500 | 181.059998 | 126.239998 | 12.708667 | 57.488251 | 183.630005 | ... | 125.970001 | 12.583333 | 57.368000 | 182.330002 | 126.910004 | 13.322000 | 22240000 | 8807700 | 14123400 | 212049000 |
100 rows × 24 columns
In [101]:
sns.set_theme(style = 'whitegrid')
sns.set_style('darkgrid')
fig, ax = plt.subplots()
fig.set_size_inches(12,9)
sns.lineplot(data =df['Close'], palette = 'bright', linewidth = 3.5)
Out[101]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8b94518970>
Practical example 1
In [124]:
cd /content/
/content
In [104]:
flights_df = pd.read_csv('flights.csv')
In [105]:
flights_df.head(10)
Out[105]:
YEAR | MONTH | DAY | DAY_OF_WEEK | AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | ... | ARRIVAL_TIME | ARRIVAL_DELAY | DIVERTED | CANCELLED | CANCELLATION_REASON | AIR_SYSTEM_DELAY | SECURITY_DELAY | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015 | 1 | 1 | 4 | AS | 98 | N407AS | ANC | SEA | 5.0 | ... | 408.0 | -22.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2015 | 1 | 1 | 4 | AA | 2336 | N3KUAA | LAX | PBI | 10.0 | ... | 741.0 | -9.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2015 | 1 | 1 | 4 | US | 840 | N171US | SFO | CLT | 20.0 | ... | 811.0 | 5.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 2015 | 1 | 1 | 4 | AA | 258 | N3HYAA | LAX | MIA | 20.0 | ... | 756.0 | -9.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 2015 | 1 | 1 | 4 | AS | 135 | N527AS | SEA | ANC | 25.0 | ... | 259.0 | -21.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 | 2015 | 1 | 1 | 4 | DL | 806 | N3730B | SFO | MSP | 25.0 | ... | 610.0 | 8.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
6 | 2015 | 1 | 1 | 4 | NK | 612 | N635NK | LAS | MSP | 25.0 | ... | 509.0 | -17.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
7 | 2015 | 1 | 1 | 4 | US | 2013 | N584UW | LAX | CLT | 30.0 | ... | 753.0 | -10.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
8 | 2015 | 1 | 1 | 4 | AA | 1112 | N3LAAA | SFO | DFW | 30.0 | ... | 532.0 | -13.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
9 | 2015 | 1 | 1 | 4 | DL | 1173 | N826DN | LAS | ATL | 30.0 | ... | 656.0 | -15.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
10 rows × 31 columns
In [108]:
flights_df.dtypes
Out[108]:
YEAR int64
MONTH int64
DAY int64
DAY_OF_WEEK int64
AIRLINE object
FLIGHT_NUMBER int64
TAIL_NUMBER object
ORIGIN_AIRPORT object
DESTINATION_AIRPORT object
SCHEDULED_DEPARTURE float64
DEPARTURE_TIME float64
DEPARTURE_DELAY float64
TAXI_OUT float64
WHEELS_OFF float64
SCHEDULED_TIME float64
ELAPSED_TIME float64
AIR_TIME float64
DISTANCE float64
WHEELS_ON float64
TAXI_IN float64
SCHEDULED_ARRIVAL float64
ARRIVAL_TIME float64
ARRIVAL_DELAY float64
DIVERTED float64
CANCELLED float64
CANCELLATION_REASON object
AIR_SYSTEM_DELAY float64
SECURITY_DELAY float64
AIRLINE_DELAY float64
LATE_AIRCRAFT_DELAY float64
WEATHER_DELAY float64
dtype: object
In [111]:
flights_df['DEPARTURE_TIME'].isnull().values.any()
Out[111]:
True
In [112]:
test_flights = flights_df.fillna(1)
In [116]:
test_flights.columns
Out[116]:
Index(['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER',
'TAIL_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT',
'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT',
'WHEELS_OFF', 'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE',
'WHEELS_ON', 'TAXI_IN', 'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME',
'ARRIVAL_DELAY', 'DIVERTED', 'CANCELLED', 'CANCELLATION_REASON',
'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY',
'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY'],
dtype='object')
In [123]:
sns.set_theme(style = 'whitegrid')
cmap = sns.cubehelix_palette(rot=-.3, as_cmap=True)
g = sns.relplot(data = test_flights, x="DISTANCE", y='AIR_TIME', hue = "FLIGHT_NUMBER", size = 'DEPARTURE_DELAY',palette = cmap )
g.fig.set_size_inches(15,10)
g.set(xscale = 'log', yscale = 'log')
g.ax.xaxis.grid(True, 'minor', linewidth = .25)
g.ax.yaxis.grid(True, 'minor', linewidth = .25)
g.despine(left = True, bottom = True)# axis tick mark들을 연결하는 선 제거
Out[123]:
<seaborn.axisgrid.FacetGrid at 0x7f8b919bafd0>
Practical example 2
jointplot
In [125]:
hr_df = pd.read_csv('aug_train.csv')
In [126]:
hr_df.head(10)
Out[126]:
enrollee_id | city | city_development_index | gender | relevent_experience | enrolled_university | education_level | major_discipline | experience | company_size | company_type | last_new_job | training_hours | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8949 | city_103 | 0.920 | Male | Has relevent experience | no_enrollment | Graduate | STEM | >20 | NaN | NaN | 1 | 36 | 1.0 |
1 | 29725 | city_40 | 0.776 | Male | No relevent experience | no_enrollment | Graduate | STEM | 15 | 50-99 | Pvt Ltd | >4 | 47 | 0.0 |
2 | 11561 | city_21 | 0.624 | NaN | No relevent experience | Full time course | Graduate | STEM | 5 | NaN | NaN | never | 83 | 0.0 |
3 | 33241 | city_115 | 0.789 | NaN | No relevent experience | NaN | Graduate | Business Degree | <1 | NaN | Pvt Ltd | never | 52 | 1.0 |
4 | 666 | city_162 | 0.767 | Male | Has relevent experience | no_enrollment | Masters | STEM | >20 | 50-99 | Funded Startup | 4 | 8 | 0.0 |
5 | 21651 | city_176 | 0.764 | NaN | Has relevent experience | Part time course | Graduate | STEM | 11 | NaN | NaN | 1 | 24 | 1.0 |
6 | 28806 | city_160 | 0.920 | Male | Has relevent experience | no_enrollment | High School | NaN | 5 | 50-99 | Funded Startup | 1 | 24 | 0.0 |
7 | 402 | city_46 | 0.762 | Male | Has relevent experience | no_enrollment | Graduate | STEM | 13 | <10 | Pvt Ltd | >4 | 18 | 1.0 |
8 | 27107 | city_103 | 0.920 | Male | Has relevent experience | no_enrollment | Graduate | STEM | 7 | 50-99 | Pvt Ltd | 1 | 46 | 1.0 |
9 | 699 | city_103 | 0.920 | NaN | Has relevent experience | no_enrollment | Graduate | STEM | 17 | 10000+ | Pvt Ltd | >4 | 123 | 0.0 |
In [127]:
hr_df.isnull().sum()
Out[127]:
enrollee_id 0
city 0
city_development_index 0
gender 4508
relevent_experience 0
enrolled_university 386
education_level 460
major_discipline 2813
experience 65
company_size 5938
company_type 6140
last_new_job 423
training_hours 0
target 0
dtype: int64
In [128]:
df_clean = hr_df.dropna()
In [129]:
df_clean.head(10)
Out[129]:
enrollee_id | city | city_development_index | gender | relevent_experience | enrolled_university | education_level | major_discipline | experience | company_size | company_type | last_new_job | training_hours | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 29725 | city_40 | 0.776 | Male | No relevent experience | no_enrollment | Graduate | STEM | 15 | 50-99 | Pvt Ltd | >4 | 47 | 0.0 |
4 | 666 | city_162 | 0.767 | Male | Has relevent experience | no_enrollment | Masters | STEM | >20 | 50-99 | Funded Startup | 4 | 8 | 0.0 |
7 | 402 | city_46 | 0.762 | Male | Has relevent experience | no_enrollment | Graduate | STEM | 13 | <10 | Pvt Ltd | >4 | 18 | 1.0 |
8 | 27107 | city_103 | 0.920 | Male | Has relevent experience | no_enrollment | Graduate | STEM | 7 | 50-99 | Pvt Ltd | 1 | 46 | 1.0 |
11 | 23853 | city_103 | 0.920 | Male | Has relevent experience | no_enrollment | Graduate | STEM | 5 | 5000-9999 | Pvt Ltd | 1 | 108 | 0.0 |
12 | 25619 | city_61 | 0.913 | Male | Has relevent experience | no_enrollment | Graduate | STEM | >20 | 1000-4999 | Pvt Ltd | 3 | 23 | 0.0 |
15 | 6588 | city_114 | 0.926 | Male | Has relevent experience | no_enrollment | Graduate | STEM | 16 | 10/49 | Pvt Ltd | >4 | 18 | 0.0 |
20 | 31972 | city_159 | 0.843 | Male | Has relevent experience | no_enrollment | Masters | STEM | 11 | 100-500 | Pvt Ltd | 1 | 68 | 0.0 |
21 | 19061 | city_114 | 0.926 | Male | Has relevent experience | no_enrollment | Masters | STEM | 11 | 100-500 | Pvt Ltd | 2 | 50 | 0.0 |
23 | 7041 | city_40 | 0.776 | Male | Has relevent experience | no_enrollment | Graduate | Humanities | <1 | 1000-4999 | Pvt Ltd | 1 | 65 | 0.0 |
In [131]:
sns.set_theme()
sns.set_context('paper')
g = sns.jointplot(data = df_clean, x = 'city_development_index', y = 'training_hours', hue = 'company_type', kind = 'kde')
g.fig.set_size_inches(15,15)
In [134]:
from google.colab import files
g.figure.savefig('seaborn_test.png')
files.download('seaborn_test.png')
728x90
'Data Analytics with python > [Data Analysis]' 카테고리의 다른 글
[Visualization] Plotly_Part2 (0) | 2023.01.22 |
---|---|
[Visualization] Plotly_Part1 (0) | 2023.01.22 |
[Visualization] matplotlib (0) | 2023.01.22 |
[Visualization] Basic_for _visualization (0) | 2023.01.22 |
[Text]S8_08_Word_Cloud (0) | 2023.01.21 |
댓글