본문 바로가기
728x90

Data Analytics with python/[Data Analysis]63

outlier 상관성이 높은 변수를 기준으로 이상치 제거 import seaborn as sns plt.figure(figsize=(9,9)) corr = card_df.corr() sns.heatmap(corr, cmap='RdBu') 이상치 인덱스 import numpy as np def get_outlier(df = None, column = None, weight = 1.5): data = df[df['Class']==1][column] quantile_25 = np.percentile(data.values, 25) quantile_75 = np.percentile(data.values, 75) iqr = quantile_75 - quantile_25 iqr_weight = iqr * weight lowest_.. 2023. 2. 18.
[Visualization] Plotly_Part2 plotlyhttps://plotly.com/python/ In [1]:# interactive graphsimport plotly.express as pximport pandas as pdimport numpy as np  practical examplehttps://www.kaggle.com/datasets/davincermak/quarterly-census-of-employment-and-wages-may-2020?resource=download  choropleth_mapbox In [2]:df_practice = pd.read_csv('data.csv') # 2020년 5월의 실업데이터 In [3]:df_practice.head(1) Out[3]: area_fipsarea_titlemay2020.. 2023. 1. 22.
[Visualization] Plotly_Part1 plotlyhttps://plotly.com/python/  plotly/datasetshttps://github.com/plotly/datasets In [1]:# interactive graphsimport plotly.express as pximport pandas as pdimport numpy as np In [2]:df = px.data.gapminder();df Out[2]: countrycontinentyearlifeExppopgdpPercapiso_alphaiso_num0AfghanistanAsia195228.8018425333779.445314AFG41AfghanistanAsia195730.3329240934820.853030AFG42AfghanistanAsia196231.9971026.. 2023. 1. 22.
[Visualization] seaborn 데이터1 : 비행 데이터 용량이 커서 xlsb 형식으로 저장되어 있다. 압축해제 후 엑셀 > F12 > csv 형식으로 저장하여 분석에 사용가능하다. 데이터2 : 인사 데이터 데이터 출처: https://www.kaggle.com/datasets/usdot/flight-delays https://www.kaggle.com/datasets/arashnic/hr-analytics-job-change-of-data-scientists?resource=download Seaborn https://seaborn.pydata.org/tutorial/introduction In [68]: import matplotlib.pyplot as plt import seaborn as sns import pandas as p.. 2023. 1. 22.
[Visualization] matplotlib 데이터 출처: https://www.kaggle.com/datasets/yasserh/bitcoin-prices-dataset?resource=download matplotlib tutorial https://matplotlib.org/stable/tutorials/introductory/usage.html In [37]: import matplotlib.pyplot as plt import seaborn as sns # bulit on top of matplotlib import pandas as pd import numpy as np %matplotlib inline In [38]: plt.plot([1,2,3,4,5,6], [9,7,8,2,4,6]) Out[38]: [] axes https://ma.. 2023. 1. 22.
[Visualization] Basic_for _visualization In [6]: import numpy as np import pandas as pd numerical In [11]: num_df = pd.DataFrame(np.random.randint(0, 100, size=(10,5)), columns = list('ABCDE')) num_df Out[11]: A B C D E 0 5 9 66 69 87 1 18 87 31 59 42 2 30 69 31 13 38 3 90 96 19 59 91 4 34 95 27 56 69 5 28 36 26 34 53 6 57 55 96 6 16 7 48 30 19 8 58 8 6 72 65 76 61 9 96 72 0 77 93 categorical In [12]: cat_df = pd.DataFrame({"Color": ['.. 2023. 1. 22.
[Text]S8_08_Word_Cloud 데이터 출처: https://www.kaggle.com/datasets/PromptCloudHQ/amazon-echo-dot-2-reviews-dataset In [3]: import pandas as pd import string import nltk # natural language processing (자연어 처리) from nltk.corpus import stopwords import gensim # 자연어 처리 중 토큰화 from gensim.utils import simple_preprocess import matplotlib.pyplot as plt import seaborn as sns In [4]: echo_df = pd.read_csv('Echodot2_Reviews.csv', enc.. 2023. 1. 21.
[Text]S8_07_Text_visualization 데이터 출처: https://www.kaggle.com/datasets/PromptCloudHQ/amazon-echo-dot-2-reviews-dataset In [150]: import pandas as pd import string import nltk # natural language processing (자연어 처리) from nltk.corpus import stopwords import gensim # 자연어 처리 중 토큰화 from gensim.utils import simple_preprocess import matplotlib.pyplot as plt import seaborn as sns In [135]: echo_df = pd.read_csv('Echodot2_Reviews.csv',.. 2023. 1. 21.
[Text]S8_06_Text_tokenization 데이터 출처: https://www.kaggle.com/datasets/PromptCloudHQ/amazon-echo-dot-2-reviews-dataset In [21]: import pandas as pd import string import nltk # natural language processing (자연어 처리) from nltk.corpus import stopwords import gensim # 자연어 처리: 토큰 from gensim.utils import simple_preprocess import matplotlib.pyplot as plt In [22]: echo_df = pd.read_csv('Echodot2_Reviews.csv', encoding='utf-8') echo_df.. 2023. 1. 21.
[Text]S8_04_Text_cleaning(removing_punctuation) 데이터 출처: https://www.kaggle.com/datasets/PromptCloudHQ/amazon-echo-dot-2-reviews-dataset In [39]: import pandas as pd import matplotlib.pyplot as plt import seaborn as sns Out[39]: '!"#$%&\'()*+,-./:;?@[\\]^_`{|}~' In [27]: echo_df = pd.read_csv('Echodot2_Reviews.csv', encoding='utf-8') echo_df.head() Out[27]: Rating Review Date Configuration Text Review Text Review Color Title User Verified Revi.. 2023. 1. 21.
[Text]S8_03_Text_in_pandas_2 데이터 출처: https://www.kaggle.com/datasets/PromptCloudHQ/amazon-echo-dot-2-reviews-dataset In [26]: import pandas as pd import matplotlib.pyplot as plt import seaborn as sns In [27]: echo_df = pd.read_csv('Echodot2_Reviews.csv', encoding='utf-8') echo_df.head() Out[27]: Rating Review Date Configuration Text Review Text Review Color Title User Verified Review Useful Count Declaration Text Pageurl 0 .. 2023. 1. 21.
[Text]S8_02_Text_in_pandas_1 데이터 출처: https://www.kaggle.com/datasets/PromptCloudHQ/amazon-echo-dot-2-reviews-dataset In [3]: import pandas as pd import matplotlib.pyplot as plt import seaborn as sns In [4]: echo_df = pd.read_csv('Echodot2_Reviews.csv', encoding='utf-8') echo_df.head() Out[4]: Rating Review Date Configuration Text Review Text Review Color Title User Verified Review Useful Count Declaration Text Pageurl 0 3 1.. 2023. 1. 21.
[Text]S8_01_upper_lower 데이터 출처: https://www.kaggle.com/datasets/PromptCloudHQ/amazon-echo-dot-2-reviews-dataset In [1]: import pandas as pd import matplotlib.pyplot as plt import seaborn as sns In [35]: echo_df = pd.read_csv('Echodot2_Reviews.csv', encoding='utf-8') echo_df.head() Out[35]: Rating Review Date Configuration Text Review Text Review Color Title User Verified Review Useful Count Declaration Text Pageurl 0 3.. 2023. 1. 21.
[datetime]S7_05_Practical_example3 In [68]: import pandas as pd import datetime as dt In [69]: avo_df = pd.read_csv('Avocado.csv') avo_df Out[69]: Date AveragePrice Total Volume type region 0 2015-12-27 1.33 64236.62 conventional Albany 1 2015-12-20 1.35 54876.98 conventional Albany 2 2015-12-13 0.93 118220.22 conventional Albany 3 2015-12-06 1.08 78992.15 conventional Albany 4 2015-11-29 1.28 51039.60 conventional Albany ... ..... 2023. 1. 21.
[datetime]S7_04_Practical_example2 In [68]: import pandas as pd import datetime as dt In [69]: avo_df = pd.read_csv('Avocado.csv') avo_df Out[69]: Date AveragePrice Total Volume type region 0 2015-12-27 1.33 64236.62 conventional Albany 1 2015-12-20 1.35 54876.98 conventional Albany 2 2015-12-13 0.93 118220.22 conventional Albany 3 2015-12-06 1.08 78992.15 conventional Albany 4 2015-11-29 1.28 51039.60 conventional Albany ... ..... 2023. 1. 21.
[datetime]S7_03_Practical_example1 In [59]: import pandas as pd import datetime as dt In [60]: avo_df = pd.read_csv('Avocado.csv') avo_df Out[60]: Date AveragePrice Total Volume type region 0 2015-12-27 1.33 64236.62 conventional Albany 1 2015-12-20 1.35 54876.98 conventional Albany 2 2015-12-13 0.93 118220.22 conventional Albany 3 2015-12-06 1.08 78992.15 conventional Albany 4 2015-11-29 1.28 51039.60 conventional Albany ... ..... 2023. 1. 21.
[datetime]S7_02_Timestamp https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html In [31]: import pandas as pd import datetime as dt In [33]: pd.Timestamp('2023, 3, 30') Out[33]: Timestamp('2023-03-30 00:00:00') In [34]: # Pandas Timestamp pd.Timestamp(dt.datetime(2022, 3, 31, 8, 0, 15,)) Out[34]: Timestamp('2022-03-31 08:00:15') In [35]: # Difference between two dates day_1 = pd.Timestamp('1990, 3, 31, 11') d.. 2023. 1. 21.
[datetime]S7_01_datetime In [11]: # date: define dates only without including time (month, day, year) # datetime : define times and dates together (month, day, year, hour, second, microsecond) import pandas as pd import datetime as dt In [12]: # A date date_ex = dt.date(2022,1,1) date_ex Out[12]: datetime.date(2022, 1, 1) In [13]: type(date_ex) Out[13]: datetime.date In [14]: # Convert it into string to view str(now) Ou.. 2023. 1. 21.
[seaborn]S6_02_pairplot,displot,heatmap(correlations) Seaborn https://seaborn.pydata.org/examples/index.html ●pandas: data manipulation using dataframes ●numpy: data statistical analysis ●matplotlib: data visualisation ●seaborn: Statistical data visualization https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset?resource=download In [58]: # Seaborn offers enhanced features compared to matplotlib # import libraries import pandas as pd import.. 2023. 1. 21.
[seaborn]S6_01_scatter&count_plot Seaborn https://seaborn.pydata.org/examples/index.html ●pandas: data manipulation using dataframes ●numpy: data statistical analysis ●matplotlib: data visualisation ●seaborn: Statistical data visualization https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset?resource=download In [58]: # Seaborn offers enhanced features compared to matplotlib # import libraries import pandas as pd import.. 2023. 1. 21.
728x90