17회#

Hits

Attention

캐글에 μ—…λ‘œλ“œλœ λ‹€λ₯Έ λΆ„λ“€ μ½”λ“œ λ³΄λŸ¬κ°€κΈ°
데이터셋 링크
문제였λ₯˜, μ½”λ“œμ˜€λ₯˜ λŒ“κΈ€λ‘œ ν”Όλ“œλ°±μ£Όμ„Έμš”

Attention

1번

데이터 μ„€λͺ… : 집과 κ΄€λ ¨λœ μ—¬λŸ¬ μˆ˜μΉ˜λ“€κ³Ό μ§‘μ˜ 가격, log1p μ •κ·œν™”λœ price 컬럼 예츑 ν•˜κΈ°
데이터 좜처 : https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv 일뢀 μ „μ²˜λ¦¬
data Url : https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p3/problem1.csv

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p3/problem1.csv')
df.head()
Id LotArea LotFrontage YearBuilt 1stFlrSF 2ndFlrSF YearRemodAdd TotRmsAbvGrd KitchenAbvGr BedroomAbvGr GarageCars GarageArea price
0 1 8450 65.0 2003 856 854 2003 8 1 3 2 548 12.247699
1 2 9600 80.0 1976 1262 0 1976 6 1 3 2 460 12.109016
2 3 11250 68.0 2001 920 866 2002 6 1 3 2 608 12.317171
3 4 9550 60.0 1915 961 756 1970 7 1 3 3 642 11.849405
4 5 14260 84.0 2000 1145 1053 2000 9 1 4 3 836 12.429220

1-1번

데이터 EDA μˆ˜ν–‰ ν›„, 뢄석가 μž…μž₯μ—μ„œ μ˜λ―ΈμžˆλŠ” 탐색

  • μ‹œκ°ν™” 및 ν†΅κ³„λŸ‰ μ œμ‹œ

print(df.info())
display(df.describe())
print('''
λͺ¨λ“  μ»¬λŸΌμ€ numeric λ³€μˆ˜μ΄λ‹€. μ΄μƒμΉ˜κ°€ μ‘΄μž¬ν•˜λŠ” μ»¬λŸΌμ€ ~~ 이닀. (μ€‘λž΅)
''')

import matplotlib.pyplot as plt
df.plot(kind='box',subplots=True,layout=(2,len(df.columns)//2+1),figsize=(20,10))
plt.tight_layout()
plt.show()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Id            1460 non-null   int64  
 1   LotArea       1460 non-null   int64  
 2   LotFrontage   1201 non-null   float64
 3   YearBuilt     1460 non-null   int64  
 4   1stFlrSF      1460 non-null   int64  
 5   2ndFlrSF      1460 non-null   int64  
 6   YearRemodAdd  1460 non-null   int64  
 7   TotRmsAbvGrd  1460 non-null   int64  
 8   KitchenAbvGr  1460 non-null   int64  
 9   BedroomAbvGr  1460 non-null   int64  
 10  GarageCars    1460 non-null   int64  
 11  GarageArea    1460 non-null   int64  
 12  price         1460 non-null   float64
dtypes: float64(2), int64(11)
memory usage: 148.4 KB
None
Id LotArea LotFrontage YearBuilt 1stFlrSF 2ndFlrSF YearRemodAdd TotRmsAbvGrd KitchenAbvGr BedroomAbvGr GarageCars GarageArea price
count 1460.000000 1460.000000 1201.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000
mean 730.500000 10516.828082 70.049958 1971.267808 1162.626712 346.992466 1984.865753 6.517808 1.046575 2.866438 1.767123 472.980137 12.024057
std 421.610009 9981.264932 24.284752 30.202904 386.587738 436.528436 20.645407 1.625393 0.220338 0.815778 0.747315 213.804841 0.399449
min 1.000000 1300.000000 21.000000 1872.000000 334.000000 0.000000 1950.000000 2.000000 0.000000 0.000000 0.000000 0.000000 10.460271
25% 365.750000 7553.500000 59.000000 1954.000000 882.000000 0.000000 1967.000000 5.000000 1.000000 2.000000 1.000000 334.500000 11.775105
50% 730.500000 9478.500000 69.000000 1973.000000 1087.000000 0.000000 1994.000000 6.000000 1.000000 3.000000 2.000000 480.000000 12.001512
75% 1095.250000 11601.500000 80.000000 2000.000000 1391.250000 728.000000 2004.000000 7.000000 1.000000 3.000000 2.000000 576.000000 12.273736
max 1460.000000 215245.000000 313.000000 2010.000000 4692.000000 2065.000000 2010.000000 14.000000 3.000000 8.000000 4.000000 1418.000000 13.534474
λͺ¨λ“  μ»¬λŸΌμ€ numeric λ³€μˆ˜μ΄λ‹€. μ΄μƒμΉ˜κ°€ μ‘΄μž¬ν•˜λŠ” μ»¬λŸΌμ€ ~~ 이닀. (μ€‘λž΅)
../../../_images/p3_6_3.png

1-2번

Train,Valid,Test set으둜 λΆ„ν•  및 μ‹œκ°ν™” μ œμ‹œ

df2 = df.copy()

#μ»¬λŸΌμ— μˆ«μžκ°€ λ“€μ–΄κ°€λ©΄ statsmodels ols λ™μž‘μ‹œ μ—λŸ¬λ°œμƒ
df2 = df2.rename(columns={'1stFlrSF':'first','2ndFlrSF':'second'})

#년도 λ°μ΄ν„°μ˜ 경우 μ΅œλŒ€λ…„λ„ κΈ°μ€€ λͺ‡λ…„전인지 κ°’μœΌλ‘œ λŒ€μ²΄
df2['YearBuilt']  = abs(df2['YearBuilt'] - df2['YearBuilt'].max())
df2['YearRemodAdd']  = abs(df2['YearRemodAdd'] - df2['YearRemodAdd'].max())



X = df2.drop(columns=['Id','price','LotFrontage'])
y = df2['price']

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test , y_train, y_test = train_test_split(X,y)

sc = StandardScaler()
sc.fit(X_train)

X_train_sc = sc.transform(X_train)
X_test_sc = sc.transform(X_test)

print('μŠ€μΌ€μΌλ§ μ „ μ‹œκ°ν™”')
X_train.plot(kind='box',subplots=True,layout=(2,len(df.columns)//2+1),figsize=(20,10))
plt.tight_layout()
plt.show()
print('μŠ€μΌ€μΌλ§ ν›„ μ‹œκ°ν™”')
pd.DataFrame(X_train_sc,columns=X_train.columns).plot(kind='box',subplots=True,layout=(2,len(df.columns)//2+1),figsize=(20,10))
plt.tight_layout()
plt.show()


print('''
μ„€λͺ… ~~(μƒλž΅)) -> νšŒκ·€ λΆ„μ„μ‹œ μŠ€μΌ€μΌλ§ ν•˜μ§€ μ•ŠλŠ” 것이 r-squred 값이 더 λ†’κ²Œλ‚˜μ˜΄
''')
μŠ€μΌ€μΌλ§ μ „ μ‹œκ°ν™”
../../../_images/p3_8_1.png
μŠ€μΌ€μΌλ§ ν›„ μ‹œκ°ν™”
../../../_images/p3_8_3.png
μ„€λͺ… ~~(μƒλž΅)) -> νšŒκ·€ λΆ„μ„μ‹œ μŠ€μΌ€μΌλ§ ν•˜μ§€ μ•ŠλŠ” 것이 r-squred 값이 더 λ†’κ²Œλ‚˜μ˜΄

1-3번

2μ°¨ κ΅ν˜Έμž‘μš©ν•­ κΉŒμ§€ κ³ λ €ν•œ νšŒκ·€λΆ„μ„ μˆ˜ν–‰ 및 λ³€μˆ˜ 선택 κ³Όμ • μ œμ‹œ

from itertools import permutations 
comb = list(permutations(X_train.columns, 3))
len(comb)

variables= '+ '.join(list(X_train.columns)) +'+'+ '+'.join([':'.join(list(y)) for y in comb]) 
# νŒŒμ΄μ¬μ€ νšŒκ·€λΆ„μ„μ— μžˆμ–΄ λͺ¨λ“ˆμ΄ 뢈친철 ν•œκ±°κ°™λ„€μš” γ…Žγ…Ž γ…  
# μ•„λž˜μ˜ 2μ°¨ ꡐ호 μž‘μš©μ„ λͺ¨λ‘ ν¬ν•¨ν•œ μ»¬λŸΌμ€‘μ—μ„œ 각자의 기쀀에 맞게 λ³€μˆ˜ μ„ νƒν•˜μ‹œλ©° 될 것 κ°™μŠ΅λ‹ˆλ‹€.
# λͺ¨λ“  λ³€μˆ˜ ν¬ν•¨μ‹œ λ‹¨μˆœ λ‹€ν•­νšŒκ·€λ³΄λ‹€λŠ” r-squared값이 λ†’κ²Œ λ‚˜μ˜΅λ‹ˆλ‹€

from statsmodels.formula.api import ols

#'+ '.join(list(X_train.columns))
res = ols(f'price ~ {variables}', data=pd.concat([X_train,y_train],axis=1)).fit()
res.summary()
OLS Regression Results
Dep. Variable: price R-squared: 0.886
Model: OLS Adj. R-squared: 0.871
Method: Least Squares F-statistic: 57.61
Date: Wed, 21 Dec 2022 Prob (F-statistic): 0.00
Time: 01:16:19 Log-Likelihood: 643.13
No. Observations: 1095 AIC: -1024.
Df Residuals: 964 BIC: -369.4
Df Model: 130
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 11.1578 0.155 71.968 0.000 10.854 11.462
LotArea 7.105e-06 5.78e-06 1.230 0.219 -4.23e-06 1.84e-05
YearBuilt -0.0031 0.001 -3.560 0.000 -0.005 -0.001
first 0.0007 0.000 5.077 0.000 0.000 0.001
second 0.0004 0.000 3.585 0.000 0.000 0.001
YearRemodAdd -0.0031 0.002 -2.047 0.041 -0.006 -0.000
TotRmsAbvGrd 0.0182 0.027 0.685 0.493 -0.034 0.070
KitchenAbvGr -0.1529 0.126 -1.212 0.226 -0.400 0.095
BedroomAbvGr 0.0218 0.035 0.618 0.537 -0.047 0.091
GarageCars 0.1429 0.072 1.974 0.049 0.001 0.285
GarageArea 7.146e-06 0.000 0.027 0.979 -0.001 0.001
LotArea:YearBuilt:first -4.009e-10 2.29e-10 -1.752 0.080 -8.5e-10 4.81e-11
LotArea:YearBuilt:second 1.651e-10 2.52e-10 0.656 0.512 -3.29e-10 6.59e-10
LotArea:YearBuilt:YearRemodAdd -7.982e-09 3.35e-09 -2.386 0.017 -1.45e-08 -1.42e-09
LotArea:YearBuilt:TotRmsAbvGrd 1.608e-07 7.01e-08 2.295 0.022 2.33e-08 2.98e-07
LotArea:YearBuilt:KitchenAbvGr 8.162e-07 2.99e-07 2.726 0.007 2.29e-07 1.4e-06
LotArea:YearBuilt:BedroomAbvGr -3.453e-07 1.19e-07 -2.909 0.004 -5.78e-07 -1.12e-07
LotArea:YearBuilt:GarageCars -2.565e-07 1.84e-07 -1.395 0.163 -6.17e-07 1.04e-07
LotArea:YearBuilt:GarageArea 3.271e-10 6.19e-10 0.528 0.597 -8.88e-10 1.54e-09
LotArea:first:second -1.82e-11 1.58e-11 -1.156 0.248 -4.91e-11 1.27e-11
LotArea:first:YearRemodAdd -2.937e-10 3.13e-10 -0.937 0.349 -9.09e-10 3.21e-10
LotArea:first:TotRmsAbvGrd -2.585e-09 2.48e-09 -1.043 0.297 -7.45e-09 2.28e-09
LotArea:first:KitchenAbvGr -3.083e-08 2.22e-08 -1.390 0.165 -7.44e-08 1.27e-08
LotArea:first:BedroomAbvGr 2.349e-08 8.21e-09 2.860 0.004 7.37e-09 3.96e-08
LotArea:first:GarageCars -6.4e-09 9.49e-09 -0.674 0.500 -2.5e-08 1.22e-08
LotArea:first:GarageArea 2.607e-11 2.45e-11 1.064 0.288 -2.2e-11 7.41e-11
LotArea:second:YearRemodAdd -5.768e-10 3.25e-10 -1.775 0.076 -1.21e-09 6.09e-11
LotArea:second:TotRmsAbvGrd -6.54e-10 3.2e-09 -0.205 0.838 -6.93e-09 5.62e-09
LotArea:second:KitchenAbvGr -4.56e-08 2.93e-08 -1.555 0.120 -1.03e-07 1.19e-08
LotArea:second:BedroomAbvGr 3.259e-08 8.36e-09 3.901 0.000 1.62e-08 4.9e-08
LotArea:second:GarageCars -2.455e-09 1.7e-08 -0.145 0.885 -3.58e-08 3.09e-08
LotArea:second:GarageArea -2.17e-11 5.27e-11 -0.412 0.681 -1.25e-10 8.18e-11
LotArea:YearRemodAdd:TotRmsAbvGrd -1.97e-08 9.19e-08 -0.214 0.830 -2e-07 1.61e-07
LotArea:YearRemodAdd:KitchenAbvGr -5.82e-07 3.53e-07 -1.651 0.099 -1.27e-06 1.1e-07
LotArea:YearRemodAdd:BedroomAbvGr 4.925e-07 1.4e-07 3.518 0.000 2.18e-07 7.67e-07
LotArea:YearRemodAdd:GarageCars 3.294e-07 2.89e-07 1.138 0.255 -2.38e-07 8.97e-07
LotArea:YearRemodAdd:GarageArea -5.006e-10 9.76e-10 -0.513 0.608 -2.42e-09 1.41e-09
LotArea:TotRmsAbvGrd:KitchenAbvGr 3.767e-06 6.03e-06 0.625 0.532 -8.06e-06 1.56e-05
LotArea:TotRmsAbvGrd:BedroomAbvGr -3.77e-06 1.7e-06 -2.221 0.027 -7.1e-06 -4.39e-07
LotArea:TotRmsAbvGrd:GarageCars 9.679e-06 4.96e-06 1.951 0.051 -5.86e-08 1.94e-05
LotArea:TotRmsAbvGrd:GarageArea -2.417e-08 1.46e-08 -1.654 0.098 -5.28e-08 4.5e-09
LotArea:KitchenAbvGr:BedroomAbvGr -4.794e-07 9.78e-06 -0.049 0.961 -1.97e-05 1.87e-05
LotArea:KitchenAbvGr:GarageCars 2.182e-05 2.02e-05 1.078 0.281 -1.79e-05 6.15e-05
LotArea:KitchenAbvGr:GarageArea -5.645e-08 7.28e-08 -0.776 0.438 -1.99e-07 8.63e-08
LotArea:BedroomAbvGr:GarageCars -2.299e-05 6.54e-06 -3.513 0.000 -3.58e-05 -1.01e-05
LotArea:BedroomAbvGr:GarageArea 5.323e-08 2.25e-08 2.370 0.018 9.16e-09 9.73e-08
LotArea:GarageCars:GarageArea 3.955e-09 5.75e-09 0.688 0.491 -7.32e-09 1.52e-08
YearBuilt:first:second 3.187e-09 3.26e-09 0.978 0.328 -3.21e-09 9.58e-09
YearBuilt:first:YearRemodAdd 4.164e-08 5.51e-08 0.755 0.450 -6.66e-08 1.5e-07
YearBuilt:first:TotRmsAbvGrd -3.609e-07 7.13e-07 -0.506 0.613 -1.76e-06 1.04e-06
YearBuilt:first:KitchenAbvGr 5.498e-06 3.97e-06 1.386 0.166 -2.28e-06 1.33e-05
YearBuilt:first:BedroomAbvGr -6.066e-07 1.51e-06 -0.401 0.689 -3.58e-06 2.36e-06
YearBuilt:first:GarageCars -7.33e-07 2.85e-06 -0.257 0.797 -6.33e-06 4.87e-06
YearBuilt:first:GarageArea 1.572e-09 9.58e-09 0.164 0.870 -1.72e-08 2.04e-08
YearBuilt:second:YearRemodAdd -3.504e-08 5.05e-08 -0.694 0.488 -1.34e-07 6.41e-08
YearBuilt:second:TotRmsAbvGrd -6.647e-07 6.43e-07 -1.034 0.301 -1.93e-06 5.97e-07
YearBuilt:second:KitchenAbvGr 7.001e-06 3.79e-06 1.845 0.065 -4.45e-07 1.44e-05
YearBuilt:second:BedroomAbvGr -2.496e-07 1.17e-06 -0.214 0.830 -2.54e-06 2.04e-06
YearBuilt:second:GarageCars -3.235e-06 2.23e-06 -1.454 0.146 -7.6e-06 1.13e-06
YearBuilt:second:GarageArea 7.045e-10 8.18e-09 0.086 0.931 -1.54e-08 1.68e-08
YearBuilt:YearRemodAdd:TotRmsAbvGrd 1.83e-05 1.23e-05 1.483 0.138 -5.91e-06 4.25e-05
YearBuilt:YearRemodAdd:KitchenAbvGr -0.0001 3.46e-05 -3.674 0.000 -0.000 -5.92e-05
YearBuilt:YearRemodAdd:BedroomAbvGr 6.027e-06 1.94e-05 0.311 0.756 -3.2e-05 4.41e-05
YearBuilt:YearRemodAdd:GarageCars -4.087e-05 3.32e-05 -1.231 0.219 -0.000 2.43e-05
YearBuilt:YearRemodAdd:GarageArea 1.983e-07 1.2e-07 1.647 0.100 -3.8e-08 4.35e-07
YearBuilt:TotRmsAbvGrd:KitchenAbvGr -0.0016 0.001 -2.253 0.025 -0.003 -0.000
YearBuilt:TotRmsAbvGrd:BedroomAbvGr 4.456e-05 0.000 0.192 0.848 -0.000 0.000
YearBuilt:TotRmsAbvGrd:GarageCars 0.0013 0.001 1.480 0.139 -0.000 0.003
YearBuilt:TotRmsAbvGrd:GarageArea -3.781e-06 2.94e-06 -1.285 0.199 -9.56e-06 1.99e-06
YearBuilt:KitchenAbvGr:BedroomAbvGr 0.0011 0.001 0.865 0.387 -0.001 0.004
YearBuilt:KitchenAbvGr:GarageCars -0.0020 0.003 -0.735 0.463 -0.007 0.003
YearBuilt:KitchenAbvGr:GarageArea -7.963e-06 9.58e-06 -0.832 0.406 -2.68e-05 1.08e-05
YearBuilt:BedroomAbvGr:GarageCars -0.0003 0.001 -0.228 0.820 -0.003 0.002
YearBuilt:BedroomAbvGr:GarageArea 7.721e-06 4.55e-06 1.695 0.090 -1.22e-06 1.67e-05
YearBuilt:GarageCars:GarageArea -8.145e-08 1.64e-06 -0.050 0.961 -3.31e-06 3.15e-06
first:second:YearRemodAdd 1.258e-08 5.17e-09 2.433 0.015 2.43e-09 2.27e-08
first:second:TotRmsAbvGrd 3.229e-08 4.27e-08 0.756 0.450 -5.16e-08 1.16e-07
first:second:KitchenAbvGr -6.114e-07 3.05e-07 -2.004 0.045 -1.21e-06 -1.28e-08
first:second:BedroomAbvGr -1.191e-07 9.06e-08 -1.315 0.189 -2.97e-07 5.87e-08
first:second:GarageCars -2.4e-09 2.09e-07 -0.011 0.991 -4.13e-07 4.08e-07
first:second:GarageArea 1.098e-09 6.6e-10 1.665 0.096 -1.96e-10 2.39e-09
first:YearRemodAdd:TotRmsAbvGrd -6.076e-07 1.07e-06 -0.570 0.569 -2.7e-06 1.48e-06
first:YearRemodAdd:KitchenAbvGr 7.265e-06 6.33e-06 1.149 0.251 -5.15e-06 1.97e-05
first:YearRemodAdd:BedroomAbvGr -8.677e-07 2.14e-06 -0.406 0.685 -5.06e-06 3.33e-06
first:YearRemodAdd:GarageCars -5.605e-06 4.85e-06 -1.156 0.248 -1.51e-05 3.91e-06
first:YearRemodAdd:GarageArea 1.035e-08 1.66e-08 0.623 0.533 -2.22e-08 4.29e-08
first:TotRmsAbvGrd:KitchenAbvGr -4.385e-05 4.33e-05 -1.013 0.311 -0.000 4.11e-05
first:TotRmsAbvGrd:BedroomAbvGr 1.291e-05 1.11e-05 1.161 0.246 -8.91e-06 3.47e-05
first:TotRmsAbvGrd:GarageCars -3.17e-05 3.7e-05 -0.857 0.392 -0.000 4.09e-05
first:TotRmsAbvGrd:GarageArea 1.852e-07 1.18e-07 1.567 0.117 -4.67e-08 4.17e-07
first:KitchenAbvGr:BedroomAbvGr -8.223e-05 0.000 -0.742 0.459 -0.000 0.000
first:KitchenAbvGr:GarageCars 5.371e-05 0.000 0.181 0.857 -0.001 0.001
first:KitchenAbvGr:GarageArea 1.37e-06 1.09e-06 1.253 0.210 -7.75e-07 3.52e-06
first:BedroomAbvGr:GarageCars 0.0002 0.000 1.731 0.084 -2.48e-05 0.000
first:BedroomAbvGr:GarageArea -1.124e-06 3.89e-07 -2.890 0.004 -1.89e-06 -3.61e-07
first:GarageCars:GarageArea -2.199e-07 1.33e-07 -1.656 0.098 -4.81e-07 4.07e-08
second:YearRemodAdd:TotRmsAbvGrd -1.286e-06 1.02e-06 -1.266 0.206 -3.28e-06 7.07e-07
second:YearRemodAdd:KitchenAbvGr -5.54e-06 5.93e-06 -0.935 0.350 -1.72e-05 6.09e-06
second:YearRemodAdd:BedroomAbvGr 8.785e-07 1.87e-06 0.470 0.638 -2.79e-06 4.55e-06
second:YearRemodAdd:GarageCars 3.536e-06 4.02e-06 0.879 0.380 -4.36e-06 1.14e-05
second:YearRemodAdd:GarageArea -5.12e-09 1.45e-08 -0.352 0.725 -3.37e-08 2.34e-08
second:TotRmsAbvGrd:KitchenAbvGr 0.0001 5.87e-05 2.102 0.036 8.2e-06 0.000
second:TotRmsAbvGrd:BedroomAbvGr 1.177e-06 1.17e-05 0.101 0.920 -2.18e-05 2.41e-05
second:TotRmsAbvGrd:GarageCars -4.553e-05 4.18e-05 -1.089 0.276 -0.000 3.65e-05
second:TotRmsAbvGrd:GarageArea -3.217e-08 1.4e-07 -0.230 0.818 -3.07e-07 2.42e-07
second:KitchenAbvGr:BedroomAbvGr -0.0002 0.000 -1.361 0.174 -0.000 6.98e-05
second:KitchenAbvGr:GarageCars 0.0003 0.000 0.948 0.343 -0.000 0.001
second:KitchenAbvGr:GarageArea 2.178e-07 1e-06 0.218 0.828 -1.75e-06 2.18e-06
second:BedroomAbvGr:GarageCars 4.715e-05 9.33e-05 0.506 0.613 -0.000 0.000
second:BedroomAbvGr:GarageArea -3.533e-07 3.15e-07 -1.123 0.262 -9.71e-07 2.64e-07
second:GarageCars:GarageArea 3.037e-08 1.53e-07 0.199 0.842 -2.69e-07 3.3e-07
YearRemodAdd:TotRmsAbvGrd:KitchenAbvGr 0.0012 0.001 1.123 0.262 -0.001 0.003
YearRemodAdd:TotRmsAbvGrd:BedroomAbvGr -0.0003 0.000 -0.883 0.377 -0.001 0.000
YearRemodAdd:TotRmsAbvGrd:GarageCars -0.0015 0.001 -1.090 0.276 -0.004 0.001
YearRemodAdd:TotRmsAbvGrd:GarageArea 5.439e-06 4.78e-06 1.139 0.255 -3.93e-06 1.48e-05
YearRemodAdd:KitchenAbvGr:BedroomAbvGr -0.0007 0.002 -0.393 0.695 -0.004 0.003
YearRemodAdd:KitchenAbvGr:GarageCars 0.0052 0.005 1.082 0.280 -0.004 0.015
YearRemodAdd:KitchenAbvGr:GarageArea -3.205e-06 1.65e-05 -0.195 0.846 -3.55e-05 2.91e-05
YearRemodAdd:BedroomAbvGr:GarageCars 0.0034 0.002 1.891 0.059 -0.000 0.007
YearRemodAdd:BedroomAbvGr:GarageArea -1.564e-05 6.47e-06 -2.418 0.016 -2.83e-05 -2.95e-06
YearRemodAdd:GarageCars:GarageArea -4.013e-06 2.52e-06 -1.593 0.111 -8.96e-06 9.29e-07
TotRmsAbvGrd:KitchenAbvGr:BedroomAbvGr 0.0172 0.014 1.227 0.220 -0.010 0.045
TotRmsAbvGrd:KitchenAbvGr:GarageCars -0.0201 0.062 -0.323 0.747 -0.142 0.102
TotRmsAbvGrd:KitchenAbvGr:GarageArea -9.975e-05 0.000 -0.387 0.699 -0.001 0.000
TotRmsAbvGrd:BedroomAbvGr:GarageCars -0.0065 0.023 -0.280 0.780 -0.052 0.039
TotRmsAbvGrd:BedroomAbvGr:GarageArea 5.19e-05 8.26e-05 0.628 0.530 -0.000 0.000
TotRmsAbvGrd:GarageCars:GarageArea -1.463e-05 4.1e-05 -0.357 0.721 -9.51e-05 6.58e-05
KitchenAbvGr:BedroomAbvGr:GarageCars -0.1440 0.093 -1.553 0.121 -0.326 0.038
KitchenAbvGr:BedroomAbvGr:GarageArea 0.0004 0.000 1.121 0.263 -0.000 0.001
KitchenAbvGr:GarageCars:GarageArea -0.0002 0.000 -1.475 0.141 -0.001 7.82e-05
BedroomAbvGr:GarageCars:GarageArea 0.0002 5.89e-05 3.060 0.002 6.46e-05 0.000
Omnibus: 281.963 Durbin-Watson: 1.968
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1549.607
Skew: -1.071 Prob(JB): 0.00
Kurtosis: 8.420 Cond. No. 8.76e+11


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.76e+11. This might indicate that there are
strong multicollinearity or other numerical problems.

1-4번

벌점, 앙상블을 ν¬ν•¨ν•˜μ—¬ λͺ¨ν˜•μ— μ ν•©ν•œ κΈ°κ³„ν•™μŠ΅ λͺ¨λΈ 3가지λ₯Ό μ œμ‹œν•˜λΌ
(ν‰κ°€μ§€ν‘œλŠ” MSE, MAPE, R2 λͺ¨λ‘ 확인할 것)

# lasso , ridge , randomforest
import numpy as np
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error , r2_score
def MAPE(y_test, y_pred):
    return np.mean(np.abs((y_test - y_pred) / y_test)) * 100 
    

ls = Lasso()
rd = Ridge()
rf = RandomForestRegressor()


def modelpipe(model):

    model.fit(X_train,y_train)
    model_pred = model.predict(X_test)
    mse = mean_squared_error(y_test,model_pred)
    r2score = r2_score(y_test,model_pred)
    mape = MAPE(y_test,model_pred)
    
    metrics= [mse,r2score,mape]
    return metrics
    
ls_result =modelpipe(ls)
rd_result =modelpipe(rd)
rf_result =modelpipe(rf)


result = pd.DataFrame([ls_result,rd_result,rf_result],columns = ['mse','r2','mape'],index=['lasso','ridge','randomForest'])
result
mse r2 mape
lasso 0.049039 0.697203 1.246075
ridge 0.041681 0.742635 1.175805
randomForest 0.037110 0.770859 1.094485

Attention

2번
데이터 μ„€λͺ… : μ½”λ‘œλ‚˜19에 λŒ€ν•œ λ‚˜λΌλ³„ λ°μ΄ν„°λ‘œ λͺ¨λΈλ§ 진행
데이터 좜처 : https://www.kaggle.com/imdevskp/corona-virus-report 일뢀 ν›„μ²˜λ¦¬
data Url : https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p3/problem2.csv location : 지역λͺ…
date : 일자
total_cases : λˆ„μ  ν™•μΈμž
total_deaths : λˆ„μ  μ‚¬λ§μž
new_tests : κ²€μ‚¬μž
population : 인ꡬ
new_vaccinations : λ°±μ‹  μ ‘μ’…μž

import pandas as pd
df =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p3/problem2.csv')
df.head()
location date total_cases total_deaths new_tests population new_vaccinations
0 Afghanistan 2020-02-24 5.0 NaN NaN 39835428.0 NaN
1 Afghanistan 2020-02-25 5.0 NaN NaN 39835428.0 NaN
2 Afghanistan 2020-02-26 5.0 NaN NaN 39835428.0 NaN
3 Afghanistan 2020-02-27 5.0 NaN NaN 39835428.0 NaN
4 Afghanistan 2020-02-28 5.0 NaN NaN 39835428.0 NaN

2-1번

λ§ˆμ§€λ§‰ 일자λ₯Ό κΈ°μ€€μœΌλ‘œ 인ꡬ λŒ€λΉ„ ν™•μ§„μž λΉ„μœ¨μ΄ 높은 μƒμœ„ 5개 κ΅­κ°€λ₯Ό κ΅¬ν•˜μ—¬λΌ
μƒμœ„ 5개 κ΅­κ°€λ³„λ‘œ λˆ„μ  ν™•μ§„μž, 일일 ν™•μ§„μž, λˆ„μ  μ‚¬λ§μž, 일일 μ‚¬λ§μž, κ·Έλž˜ν”„, λ²”λ‘€λ₯Ό μ΄μš©ν•΄μ„œ 가독성 있게 λ§Œλ“€μ–΄λΌ

df =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p3/problem2.csv')
df['ratio'] = df['total_cases'] / df['population']


# 전체 λ°μ΄ν„°μ˜ 결츑치 및 일일 확진, μ‚¬λ§μž 확인
# 2021-11-30μ—λŠ” new_tests , new_vaccinations값이 nan μ΄λ―€λ‘œ μ œμ™Έ
# 인ꡬ수 0인 μΌ€μ΄μŠ€ μ œμ™Έ
import matplotlib.pyplot as plt 
df = df.fillna(0)
df['date']  = pd.to_datetime(df['date'])
df = df[df.date != pd.to_datetime('2021-11-30')]
df = df[df.population !=0]

for location in df.location.unique():
    lo = df[df.location == location]
    df.loc[lo.index,'new_cases'] =lo.total_cases.diff().values
    df.loc[lo.index[0], 'new_cases'] = lo['total_cases'].values[0]

    df.loc[lo.index,'new_deaths'] =lo.total_deaths.diff().values
    df.loc[lo.index[0], 'new_deaths'] = lo['total_deaths'].values[0]
    
    df.loc[lo.index, 'total_vacciantions'] = lo['new_vaccinations'].cumsum().values
    df.loc[lo.index, '7days_new_case'] = lo['new_tests'].rolling(7).sum().fillna(0).values

import seaborn as sns
import matplotlib.pyplot as plt


locations = df.groupby(['location']).tail(1).sort_values('ratio',ascending=False).location.head(5).values
target = df[df.location.isin(locations)].reset_index(drop=True)
for v in ['total_cases','new_cases','total_deaths','new_deaths']:
    plt.figure(figsize = (15,5))
    plt.title(v)
    sns.lineplot(data=target,x= 'date',y=v,hue='location')
    plt.show()
../../../_images/p3_16_0.png ../../../_images/p3_16_1.png ../../../_images/p3_16_2.png ../../../_images/p3_16_3.png

2-2번

μ½”λ‘œλ‚˜ μœ„ν—˜μ§€μˆ˜λ₯Ό 직접 λ§Œλ“€κ³  κ·Έ μœ„ν—˜μ§€μˆ˜μ— λŒ€ν•œ μ„€λͺ…을 적고 μœ„ν—˜μ§€μˆ˜κ°€ 높은 κ΅­κ°€λ“€ 10개λ₯Ό μ„ μ •ν•΄μ„œ μ‹œκ°ν™”

# μœ„ν—˜μ§€μˆ˜ =  ( 졜근일주일 λˆ„μ  ν™•μ§„μž / 인ꡬ수)   + (일일 μ‚¬λ§μž / 인ꡬ수) - (λˆ„μ  λ°±μ‹  인ꡬ / 인ꡬ수) * 보정 μƒμˆ˜) * 보정 μƒμˆ˜
print('''
μ½”λ‘œλ‚˜ μœ„ν—˜μ§€μˆ˜λŠ” μ½”λ‘œλ‚˜λ‘œ μΈν•œ κ΅­κ°€μ˜ μœ„κΈ°μ •λ„λ₯Ό ν‘œν˜„ν•œλ‹€. μ½”λ‘œλ‚˜ μ „νŒŒ νŠΉμ„±μƒ 졜근 일주일의 ν™•μ§„μž μˆ«μžκ°€ κ·Έλ‹€μŒμ˜ 일주일에 영ν–₯을 μ€€λ‹€.     
일일 μ‚¬λ§μžμˆ˜λŠ” ν˜„μž¬ μ½”λ‘œλ‚˜μ˜ κ΅­κ°€ λ‚΄μ—μ„œμ˜ 치λͺ…μœ¨μ„ ν‘œν˜„ν•œλ‹€. μœ„κΈ°μ •λ„λŠ” λˆ„μ  백신인ꡬ에 μ˜ν•΄ κ°μ†Œ 될수 μžˆλ‹€. 
κ΅­κ°€κ°„μ˜ 비ꡐλ₯Ό μœ„ν•΄ 각 κ΅­κ°€μ˜ 인ꡬ수둜 λ‚˜λˆ μ£Όμ–΄ 값을 μŠ€μΌ€μΌλ§ν•˜κ³ , λ³€μˆ˜κ°„ λ³΄μ •μƒμˆ˜λ₯Ό 톡해 μ •μˆ˜ν™”λ₯Ό μœ λ„ν•œλ‹€
''')

def ratio_index(x):
    value = (x['7days_new_case'] / x['population'] + x['new_deaths'] / x['population'] - x['total_vacciantions'] / x['population']*0.001) *100
    return value


df['ratio_index'] = df.apply(ratio_index,axis=1)


locations = df.groupby(['location']).tail(1).sort_values('ratio_index',ascending=False).location.head(10).values
target = df[df.location.isin(locations)].reset_index(drop=True)
for v in ['total_cases','new_cases','ratio_index']:
    plt.figure(figsize = (15,5))
    plt.title(v)
    sns.lineplot(data=target,x= 'date',y=v,hue='location')
    plt.show()
μ½”λ‘œλ‚˜ μœ„ν—˜μ§€μˆ˜λŠ” μ½”λ‘œλ‚˜λ‘œ μΈν•œ κ΅­κ°€μ˜ μœ„κΈ°μ •λ„λ₯Ό ν‘œν˜„ν•œλ‹€. μ½”λ‘œλ‚˜ μ „νŒŒ νŠΉμ„±μƒ 졜근 일주일의 ν™•μ§„μž μˆ«μžκ°€ κ·Έλ‹€μŒμ˜ 일주일에 영ν–₯을 μ€€λ‹€.     
일일 μ‚¬λ§μžμˆ˜λŠ” ν˜„μž¬ μ½”λ‘œλ‚˜μ˜ κ΅­κ°€ λ‚΄μ—μ„œμ˜ 치λͺ…μœ¨μ„ ν‘œν˜„ν•œλ‹€. μœ„κΈ°μ •λ„λŠ” λˆ„μ  백신인ꡬ에 μ˜ν•΄ κ°μ†Œ 될수 μžˆλ‹€. 
κ΅­κ°€κ°„μ˜ 비ꡐλ₯Ό μœ„ν•΄ 각 κ΅­κ°€μ˜ 인ꡬ수둜 λ‚˜λˆ μ£Όμ–΄ 값을 μŠ€μΌ€μΌλ§ν•˜κ³ , λ³€μˆ˜κ°„ λ³΄μ •μƒμˆ˜λ₯Ό 톡해 μ •μˆ˜ν™”λ₯Ό μœ λ„ν•œλ‹€
../../../_images/p3_18_1.png ../../../_images/p3_18_2.png ../../../_images/p3_18_3.png

2-3번

ν•œκ΅­μ˜ μ½”λ‘œλ‚˜ μ‹ κ·œ ν™•μ§„μž μ˜ˆμΈ‘ν•΄λΌ(μ„ ν˜• μ‹œκ³„μ—΄λͺ¨λΈ + λΉ„μ„ ν˜•μ‹œκ³„μ—΄ 각각 ν•œκ°œμ”© λ§Œλ“€μ–΄λΌ)
μ„ ν˜•μ‹œκ³„μ—΄ - arma λΉ„μ„ ν˜• μ‹œκ³„μ—΄ - arima

ko = df[df.location =='South Korea'].reset_index(drop=True)
ko.head()
location date total_cases total_deaths new_tests population new_vaccinations ratio new_cases new_deaths total_vacciantions 7days_new_case ratio_index
0 South Korea 2020-01-21 0.0 0.0 0.0 51305184.0 0.0 0.000000e+00 0.0 0.0 0.0 0.0 0.0
1 South Korea 2020-01-22 1.0 0.0 5.0 51305184.0 0.0 1.949121e-08 1.0 0.0 0.0 0.0 0.0
2 South Korea 2020-01-23 1.0 0.0 0.0 51305184.0 0.0 1.949121e-08 0.0 0.0 0.0 0.0 0.0
3 South Korea 2020-01-24 2.0 0.0 0.0 51305184.0 0.0 3.898242e-08 1.0 0.0 0.0 0.0 0.0
4 South Korea 2020-01-25 2.0 0.0 0.0 51305184.0 0.0 3.898242e-08 0.0 0.0 0.0 0.0 0.0
# μ„ ν˜•λͺ¨λΈ - arma.

from statsmodels.tsa.ar_model import AutoReg
mod = AutoReg(ko.new_cases, 3, old_names=False)
res = mod.fit()
print(res.summary())
fig = res.plot_predict(1,700)

# λΉ„μ„ ν˜• λͺ¨λΈ -arima μ‚¬μš©
from statsmodels.tsa.arima.model import ARIMA

model = ARIMA(ko.new_cases, order=(0,1,1))
model_fit = model.fit()
print(model_fit.summary())

forecast = model_fit.forecast(steps=24*7)

plt.figure(figsize=(10,5))
plt.plot(ko.new_cases)
plt.plot(forecast)
                            AutoReg Model Results                             
==============================================================================
Dep. Variable:              new_cases   No. Observations:                  679
Model:                     AutoReg(3)   Log Likelihood               -4376.552
Method:               Conditional MLE   S.D. of innovations            156.844
Date:                Wed, 21 Dec 2022   AIC                           8763.103
Time:                        01:16:32   BIC                           8785.684
Sample:                             3   HQIC                          8771.846
                                  679                                         
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
const           10.0652      7.966      1.264      0.206      -5.547      25.678
new_cases.L1     0.9978      0.037     27.163      0.000       0.926       1.070
new_cases.L2    -0.3117      0.052     -6.002      0.000      -0.413      -0.210
new_cases.L3     0.3080      0.038      8.196      0.000       0.234       0.382
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.0045           -0.0000j            1.0045           -0.0000
AR.2            0.0037           -1.7978j            1.7978           -0.2497
AR.3            0.0037           +1.7978j            1.7978            0.2497
-----------------------------------------------------------------------------
                               SARIMAX Results                                
==============================================================================
Dep. Variable:              new_cases   No. Observations:                  679
Model:                 ARIMA(0, 1, 1)   Log Likelihood               -4422.919
Date:                Wed, 21 Dec 2022   AIC                           8849.837
Time:                        01:16:32   BIC                           8858.876
Sample:                             0   HQIC                          8853.336
                                - 679                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ma.L1          0.0072      0.025      0.286      0.775      -0.042       0.057
sigma2       2.73e+04    486.188     56.156      0.000    2.63e+04    2.83e+04
===================================================================================
Ljung-Box (L1) (Q):                   0.01   Jarque-Bera (JB):              8521.33
Prob(Q):                              0.94   Prob(JB):                         0.00
Heteroskedasticity (H):              21.35   Skew:                             2.60
Prob(H) (two-sided):                  0.00   Kurtosis:                        19.57
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[<matplotlib.lines.Line2D at 0x7f871a7941c0>]
../../../_images/p3_21_2.png ../../../_images/p3_21_3.png
forecast = model_fit.forecast(steps=24*7)

plt.figure(figsize=(10,5))
plt.plot(ko.new_cases)
plt.plot(forecast)
[<matplotlib.lines.Line2D at 0x7f8702fd0370>]
../../../_images/p3_22_1.png

Attention

3번
섀문쑰사 데이터
데이터 좜처 : 자체 μ œμž‘
data Url : https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p3/problem3.csv 데이터 μ„€λͺ… : A ~ DκΉŒμ§€μ˜ κ·Έλ£Ήμ—κ²Œ 각각 같은 섀문쑰사λ₯Ό ν•˜μ—¬ 1-1,1-2,1-3…5-1,5-4 인 섀문지λ₯Ό ν‘Ό 것이닀.
문항은 μ˜μ—­λ³„λ‘œ λ‚˜λ‰˜μ–΄ 있고, μ˜μ—­μ€ 크게 5κ°œμ΄λ‹€(1~5)
각 μ˜μ—­μ˜ 세뢀문항은 4κ°œμ”© μ‘΄μž¬ν•œλ‹€ (1-1,1-2,1-3,1-4 ~) 이 λ•Œ 쀑간에 λ°˜λŒ€ 문항이 λ“€μ–΄κ°€ μžˆλ‹€. 예λ₯Ό λ“€μ–΄ 1-1 λ¬Έμ œκ°€ β€œλ‚˜λŠ” μ‹œκ°„μ•½μ†μ„ 잘 지킨닀.β€λΌλŠ” 문제라면 1-3의 λ¬Έμ œλŠ” β€œλ‚˜λŠ” μ‹œκ°„μ•½μ†μ„ 잘 지킀지 μ•ŠλŠ”λ‹€.” λΌλŠ” μ—­λ¬Έμ œλ‘œ ꡬ성 λ˜μ–΄μžˆλ‹€.
각 μ˜μ—­μ˜ 3λ²ˆλ¬Έν•­μ˜ 1λ²ˆλ¬Έν•­μ˜ μ—­λ¬Έμ œμ΄λ‹€. λͺ¨λ“  닡변은 5점 척도이닀. 문제λ₯Ό ν’€κΈ°μ „ λͺ¨λ“  μ—­λ¬Έν•­μ˜ 경우 점수λ₯Ό λ³€ν™˜(6점을 λΉΌμ„œ) μž‘μ—…μ΄ ν•„μš”ν•˜λ‹€

import pandas as pd
df =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p3/problem3.csv')
df.head()
userid group Q1-1 Q1-2 Q1-3 Q1-4 Q2-1 Q2-2 Q2-3 Q2-4 ... Q3-3 Q3-4 Q4-1 Q4-2 Q4-3 Q4-4 Q5-1 Q5-2 Q5-3 Q5-4
0 0 A 5 2 1 2 4 5 3 3 ... 1 1 5 2 5 3 3 4 3 4
1 1 A 2 2 3 3 4 3 1 4 ... 2 3 4 3 5 3 1 2 1 1
2 2 A 1 3 4 4 2 1 4 4 ... 4 2 1 3 4 1 3 3 2 5
3 3 A 3 3 4 2 2 4 4 3 ... 2 3 3 4 2 4 1 1 3 2
4 4 A 3 1 2 3 4 3 4 1 ... 5 1 3 2 3 1 3 2 5 4

5 rows Γ— 22 columns

3-1번

역문항을 λ³€ν™˜ ν•œ ν›„ 각 κ·Έλ£Ή(A~D)의 μ˜μ—­(Q1~Q5)별 μ‘λ‹΅μ˜ 평균, ν‘œμ€€νŽΈμ°¨, μ™œλ„, 첨도λ₯Ό κ΅¬ν•˜λΌ. (각 ν†΅κ³„λŸ‰ λ³„λ‘œ 4x5 dataframe 생성)

# μ—­λ³€ν™˜
for num in range(1,6):
    df[f'Q{num}-3'] =6 -df[f'Q{num}-3']
    
for num in range(1,6):
    col_lst = ['group']
    for col in range(1,5):
        col_lst.append(f'Q{num}-{col}')
        
    target = df[col_lst]
    
    targetdf =target.set_index('group').unstack().to_frame().reset_index()[['group',0]].rename(columns ={0: f'Q{num}'})
    
    display(targetdf.groupby('group').agg(['mean','std','skew',pd.DataFrame.kurt]))
Q1
mean std skew kurt
group
A 3.016 1.263860 -0.077803 -1.087887
B 3.042 1.242489 -0.126751 -1.022905
C 3.030 1.243642 -0.050626 -1.033246
D 2.991 1.264325 -0.069421 -1.081406
Q2
mean std skew kurt
group
A 3.058 1.236999 -0.129390 -0.997133
B 3.048 1.266215 -0.111043 -1.060834
C 3.063 1.256427 -0.122030 -1.046603
D 3.091 1.249913 -0.166334 -1.018150
Q3
mean std skew kurt
group
A 2.992 1.268679 -0.061600 -1.098330
B 3.050 1.238965 -0.117158 -1.035672
C 3.023 1.248210 -0.102330 -0.988577
D 3.034 1.255556 -0.128043 -1.043094
Q4
mean std skew kurt
group
A 3.043 1.255678 -0.090314 -1.028166
B 3.041 1.240507 -0.071541 -1.014676
C 3.014 1.283531 -0.074531 -1.100094
D 3.080 1.268546 -0.144620 -1.006126
Q5
mean std skew kurt
group
A 3.088 1.256119 -0.102638 -1.053632
B 2.983 1.272136 -0.055805 -1.080934
C 2.987 1.260325 -0.068696 -1.071557
D 2.989 1.250777 -0.065315 -1.055332

3-2번

κ·Έλ£Ήλ³„λ‘œ Q1-1λ¬Έν•­μ˜ 차이가 μ‘΄μž¬ν•˜λŠ”μ§€ anova뢄석을 μ‹œν–‰ν•˜λΌ

from scipy.stats import shapiro
a = df[df.group =='A']['Q1-1']
b = df[df.group =='B']['Q1-1']
c = df[df.group =='C']['Q1-1']
d = df[df.group =='D']['Q1-1']

print('a p-value',shapiro(a)[1])
print('b p-value',shapiro(b)[1])
print('c p-value',shapiro(c)[1])
print('d p-value',shapiro(d)[1])


from scipy.stats import levene

# λ“±λΆ„μ‚° λ§Œμ‘±ν•œλ‹€
print(levene(a,b,c,d))
print()

# μ •κ·œμ„±μ„ λ§Œμ‘±ν•˜μ§€ μ•ŠκΈ° λ•Œλ¬Έμ— kruskal-wallis H testλ₯Ό 톡해 λΆ„μ‚° 뢄석 진행
from scipy.stats import kruskal
kruskal(a,b,c,d)

# 4개의 그룹은 ν†΅κ³„μ μœΌλ‘œ μœ μ˜ν•œ 차이가 μ—†λ‹€
a p-value 4.089666539447423e-12
b p-value 1.2895768654319628e-11
c p-value 1.4126045819184974e-11
d p-value 4.2081052184506085e-12
LeveneResult(statistic=0.24718103455049822, pvalue=0.8633690011210747)
KruskalResult(statistic=4.567127187870985, pvalue=0.20638028098088249)

3-3번

탐색적 μš”μΈλΆ„μ„μ„ μˆ˜ν–‰ν•˜κ³  κ²°κ³Όλ₯Ό μ‹œκ°ν™” ν•˜λΌ

ana = df.drop(columns = ['userid','group'])
#μ‹€μ œ adp νŒ¨ν‚€μ§€λ¦¬μŠ€νŠΈμ—λŠ” μ‘΄μž¬ν•¨
#!pip install factor-analyzer

from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_square_value,p_value=calculate_bartlett_sphericity(ana)
chi_square_value, p_value

# μš”μΈμ„± 평가 κ²°κ³Ό μš”μΈμ„± 평가에 μ ν•©ν•œ p-value( <0.05)λ₯Ό 확인

from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all,kmo_model=calculate_kmo(ana)
kmo_model
# kmo κ²°κ³Ό 0.6 μ΄ν•˜λŠ” λΆ€μ ν•©ν•˜λ‹€ λ³Έλ‹€


fa = FactorAnalyzer(n_factors=25,rotation=None)
fa.fit(ana)
#Eigenκ°’ 체크 
ev, v = fa.get_eigenvalues()
plt.scatter(range(1,ana.shape[1]+1),ev)
plt.plot(range(1,ana.shape[1]+1),ev)
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigenvalue')
plt.grid()
plt.show()

#eigenvalueκ°€ 1이 λ˜λŠ”μ§€μ μΈ 10개의 μš”μΈμ΄ 선택에 μ ν•©ν•œ 숫자둜 확인



fa = FactorAnalyzer(n_factors=10, rotation="varimax") #ml : μ΅œλŒ€μš°λ„ 방법
fa.fit(ana)
efa_result= pd.DataFrame(fa.loadings_, index=ana.columns)
plt.figure(figsize=(6,10))
sns.heatmap(efa_result, cmap="Blues", annot=True, fmt='.2f')
../../../_images/p3_30_0.png
<AxesSubplot:>
../../../_images/p3_30_2.png