17회#

Attention

캐글에 업로드된 다른 분들 코드 보러가기
데이터셋 링크
문제오류, 코드오류 댓글로 피드백주세요

Attention

1번

데이터 설명 : 집과 관련된 여러 수치들과 집의 가격, log1p 정규화된 price 컬럼 예측 하기
데이터 출처 : https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv 일부 전처리
data Url : https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p3/problem1.csv

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p3/problem1.csv')
df.head()

	Id	LotArea	LotFrontage	YearBuilt	1stFlrSF	2ndFlrSF	YearRemodAdd	TotRmsAbvGrd	KitchenAbvGr	BedroomAbvGr	GarageCars	GarageArea	price
0	1	8450	65.0	2003	856	854	2003	8	1	3	2	548	12.247699
1	2	9600	80.0	1976	1262	0	1976	6	1	3	2	460	12.109016
2	3	11250	68.0	2001	920	866	2002	6	1	3	2	608	12.317171
3	4	9550	60.0	1915	961	756	1970	7	1	3	3	642	11.849405
4	5	14260	84.0	2000	1145	1053	2000	9	1	4	3	836	12.429220

1-1번

데이터 EDA 수행 후, 분석가 입장에서 의미있는 탐색

시각화 및 통계량 제시

print(df.info())
display(df.describe())
print('''
모든 컬럼은 numeric 변수이다. 이상치가 존재하는 컬럼은 ~~ 이다. (중략)
''')

import matplotlib.pyplot as plt
df.plot(kind='box',subplots=True,layout=(2,len(df.columns)//2+1),figsize=(20,10))
plt.tight_layout()
plt.show()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Id            1460 non-null   int64  
 1   LotArea       1460 non-null   int64  
 2   LotFrontage   1201 non-null   float64
 3   YearBuilt     1460 non-null   int64  
 4   1stFlrSF      1460 non-null   int64  
 5   2ndFlrSF      1460 non-null   int64  
 6   YearRemodAdd  1460 non-null   int64  
 7   TotRmsAbvGrd  1460 non-null   int64  
 8   KitchenAbvGr  1460 non-null   int64  
 9   BedroomAbvGr  1460 non-null   int64  
 10  GarageCars    1460 non-null   int64  
 11  GarageArea    1460 non-null   int64  
 12  price         1460 non-null   float64
dtypes: float64(2), int64(11)
memory usage: 148.4 KB
None

	Id	LotArea	LotFrontage	YearBuilt	1stFlrSF	2ndFlrSF	YearRemodAdd	TotRmsAbvGrd	KitchenAbvGr	BedroomAbvGr	GarageCars	GarageArea	price
count	1460.000000	1460.000000	1201.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000
mean	730.500000	10516.828082	70.049958	1971.267808	1162.626712	346.992466	1984.865753	6.517808	1.046575	2.866438	1.767123	472.980137	12.024057
std	421.610009	9981.264932	24.284752	30.202904	386.587738	436.528436	20.645407	1.625393	0.220338	0.815778	0.747315	213.804841	0.399449
min	1.000000	1300.000000	21.000000	1872.000000	334.000000	0.000000	1950.000000	2.000000	0.000000	0.000000	0.000000	0.000000	10.460271
25%	365.750000	7553.500000	59.000000	1954.000000	882.000000	0.000000	1967.000000	5.000000	1.000000	2.000000	1.000000	334.500000	11.775105
50%	730.500000	9478.500000	69.000000	1973.000000	1087.000000	0.000000	1994.000000	6.000000	1.000000	3.000000	2.000000	480.000000	12.001512
75%	1095.250000	11601.500000	80.000000	2000.000000	1391.250000	728.000000	2004.000000	7.000000	1.000000	3.000000	2.000000	576.000000	12.273736
max	1460.000000	215245.000000	313.000000	2010.000000	4692.000000	2065.000000	2010.000000	14.000000	3.000000	8.000000	4.000000	1418.000000	13.534474

모든 컬럼은 numeric 변수이다. 이상치가 존재하는 컬럼은 ~~ 이다. (중략)

1-2번

Train,Valid,Test set으로 분할 및 시각화 제시

df2 = df.copy()

#컬럼에 숫자가 들어가면 statsmodels ols 동작시 에러발생
df2 = df2.rename(columns={'1stFlrSF':'first','2ndFlrSF':'second'})

#년도 데이터의 경우 최대년도 기준 몇년전인지 값으로 대체
df2['YearBuilt']  = abs(df2['YearBuilt'] - df2['YearBuilt'].max())
df2['YearRemodAdd']  = abs(df2['YearRemodAdd'] - df2['YearRemodAdd'].max())



X = df2.drop(columns=['Id','price','LotFrontage'])
y = df2['price']

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test , y_train, y_test = train_test_split(X,y)

sc = StandardScaler()
sc.fit(X_train)

X_train_sc = sc.transform(X_train)
X_test_sc = sc.transform(X_test)

print('스케일링 전 시각화')
X_train.plot(kind='box',subplots=True,layout=(2,len(df.columns)//2+1),figsize=(20,10))
plt.tight_layout()
plt.show()
print('스케일링 후 시각화')
pd.DataFrame(X_train_sc,columns=X_train.columns).plot(kind='box',subplots=True,layout=(2,len(df.columns)//2+1),figsize=(20,10))
plt.tight_layout()
plt.show()


print('''
설명 ~~(생략)) -> 회귀 분석시 스케일링 하지 않는 것이 r-squred 값이 더 높게나옴
''')

스케일링 전 시각화

스케일링 후 시각화

설명 ~~(생략)) -> 회귀 분석시 스케일링 하지 않는 것이 r-squred 값이 더 높게나옴

1-3번

2차 교호작용항 까지 고려한 회귀분석 수행 및 변수 선택 과정 제시

from itertools import permutations 
comb = list(permutations(X_train.columns, 3))
len(comb)

variables= '+ '.join(list(X_train.columns)) +'+'+ '+'.join([':'.join(list(y)) for y in comb]) 
# 파이썬은 회귀분석에 있어 모듈이 불친철 한거같네요 ㅎㅎ ㅠ 
# 아래의 2차 교호 작용을 모두 포함한 컬럼중에서 각자의 기준에 맞게 변수 선택하시며 될 것 같습니다.
# 모든 변수 포함시 단순 다항회귀보다는 r-squared값이 높게 나옵니다

from statsmodels.formula.api import ols

#'+ '.join(list(X_train.columns))
res = ols(f'price ~ {variables}', data=pd.concat([X_train,y_train],axis=1)).fit()
res.summary()

OLS Regression Results
Dep. Variable:	price	R-squared:	0.886
Model:	OLS	Adj. R-squared:	0.871
Method:	Least Squares	F-statistic:	57.61
Date:	Wed, 21 Dec 2022	Prob (F-statistic):	0.00
Time:	01:16:19	Log-Likelihood:	643.13
No. Observations:	1095	AIC:	-1024.
Df Residuals:	964	BIC:	-369.4
Df Model:	130
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	11.1578	0.155	71.968	0.000	10.854	11.462
LotArea	7.105e-06	5.78e-06	1.230	0.219	-4.23e-06	1.84e-05
YearBuilt	-0.0031	0.001	-3.560	0.000	-0.005	-0.001
first	0.0007	0.000	5.077	0.000	0.000	0.001
second	0.0004	0.000	3.585	0.000	0.000	0.001
YearRemodAdd	-0.0031	0.002	-2.047	0.041	-0.006	-0.000
TotRmsAbvGrd	0.0182	0.027	0.685	0.493	-0.034	0.070
KitchenAbvGr	-0.1529	0.126	-1.212	0.226	-0.400	0.095
BedroomAbvGr	0.0218	0.035	0.618	0.537	-0.047	0.091
GarageCars	0.1429	0.072	1.974	0.049	0.001	0.285
GarageArea	7.146e-06	0.000	0.027	0.979	-0.001	0.001
LotArea:YearBuilt:first	-4.009e-10	2.29e-10	-1.752	0.080	-8.5e-10	4.81e-11
LotArea:YearBuilt:second	1.651e-10	2.52e-10	0.656	0.512	-3.29e-10	6.59e-10
LotArea:YearBuilt:YearRemodAdd	-7.982e-09	3.35e-09	-2.386	0.017	-1.45e-08	-1.42e-09
LotArea:YearBuilt:TotRmsAbvGrd	1.608e-07	7.01e-08	2.295	0.022	2.33e-08	2.98e-07
LotArea:YearBuilt:KitchenAbvGr	8.162e-07	2.99e-07	2.726	0.007	2.29e-07	1.4e-06
LotArea:YearBuilt:BedroomAbvGr	-3.453e-07	1.19e-07	-2.909	0.004	-5.78e-07	-1.12e-07
LotArea:YearBuilt:GarageCars	-2.565e-07	1.84e-07	-1.395	0.163	-6.17e-07	1.04e-07
LotArea:YearBuilt:GarageArea	3.271e-10	6.19e-10	0.528	0.597	-8.88e-10	1.54e-09
LotArea:first:second	-1.82e-11	1.58e-11	-1.156	0.248	-4.91e-11	1.27e-11
LotArea:first:YearRemodAdd	-2.937e-10	3.13e-10	-0.937	0.349	-9.09e-10	3.21e-10
LotArea:first:TotRmsAbvGrd	-2.585e-09	2.48e-09	-1.043	0.297	-7.45e-09	2.28e-09
LotArea:first:KitchenAbvGr	-3.083e-08	2.22e-08	-1.390	0.165	-7.44e-08	1.27e-08
LotArea:first:BedroomAbvGr	2.349e-08	8.21e-09	2.860	0.004	7.37e-09	3.96e-08
LotArea:first:GarageCars	-6.4e-09	9.49e-09	-0.674	0.500	-2.5e-08	1.22e-08
LotArea:first:GarageArea	2.607e-11	2.45e-11	1.064	0.288	-2.2e-11	7.41e-11
LotArea:second:YearRemodAdd	-5.768e-10	3.25e-10	-1.775	0.076	-1.21e-09	6.09e-11
LotArea:second:TotRmsAbvGrd	-6.54e-10	3.2e-09	-0.205	0.838	-6.93e-09	5.62e-09
LotArea:second:KitchenAbvGr	-4.56e-08	2.93e-08	-1.555	0.120	-1.03e-07	1.19e-08
LotArea:second:BedroomAbvGr	3.259e-08	8.36e-09	3.901	0.000	1.62e-08	4.9e-08
LotArea:second:GarageCars	-2.455e-09	1.7e-08	-0.145	0.885	-3.58e-08	3.09e-08
LotArea:second:GarageArea	-2.17e-11	5.27e-11	-0.412	0.681	-1.25e-10	8.18e-11
LotArea:YearRemodAdd:TotRmsAbvGrd	-1.97e-08	9.19e-08	-0.214	0.830	-2e-07	1.61e-07
LotArea:YearRemodAdd:KitchenAbvGr	-5.82e-07	3.53e-07	-1.651	0.099	-1.27e-06	1.1e-07
LotArea:YearRemodAdd:BedroomAbvGr	4.925e-07	1.4e-07	3.518	0.000	2.18e-07	7.67e-07
LotArea:YearRemodAdd:GarageCars	3.294e-07	2.89e-07	1.138	0.255	-2.38e-07	8.97e-07
LotArea:YearRemodAdd:GarageArea	-5.006e-10	9.76e-10	-0.513	0.608	-2.42e-09	1.41e-09
LotArea:TotRmsAbvGrd:KitchenAbvGr	3.767e-06	6.03e-06	0.625	0.532	-8.06e-06	1.56e-05
LotArea:TotRmsAbvGrd:BedroomAbvGr	-3.77e-06	1.7e-06	-2.221	0.027	-7.1e-06	-4.39e-07
LotArea:TotRmsAbvGrd:GarageCars	9.679e-06	4.96e-06	1.951	0.051	-5.86e-08	1.94e-05
LotArea:TotRmsAbvGrd:GarageArea	-2.417e-08	1.46e-08	-1.654	0.098	-5.28e-08	4.5e-09
LotArea:KitchenAbvGr:BedroomAbvGr	-4.794e-07	9.78e-06	-0.049	0.961	-1.97e-05	1.87e-05
LotArea:KitchenAbvGr:GarageCars	2.182e-05	2.02e-05	1.078	0.281	-1.79e-05	6.15e-05
LotArea:KitchenAbvGr:GarageArea	-5.645e-08	7.28e-08	-0.776	0.438	-1.99e-07	8.63e-08
LotArea:BedroomAbvGr:GarageCars	-2.299e-05	6.54e-06	-3.513	0.000	-3.58e-05	-1.01e-05
LotArea:BedroomAbvGr:GarageArea	5.323e-08	2.25e-08	2.370	0.018	9.16e-09	9.73e-08
LotArea:GarageCars:GarageArea	3.955e-09	5.75e-09	0.688	0.491	-7.32e-09	1.52e-08
YearBuilt:first:second	3.187e-09	3.26e-09	0.978	0.328	-3.21e-09	9.58e-09
YearBuilt:first:YearRemodAdd	4.164e-08	5.51e-08	0.755	0.450	-6.66e-08	1.5e-07
YearBuilt:first:TotRmsAbvGrd	-3.609e-07	7.13e-07	-0.506	0.613	-1.76e-06	1.04e-06
YearBuilt:first:KitchenAbvGr	5.498e-06	3.97e-06	1.386	0.166	-2.28e-06	1.33e-05
YearBuilt:first:BedroomAbvGr	-6.066e-07	1.51e-06	-0.401	0.689	-3.58e-06	2.36e-06
YearBuilt:first:GarageCars	-7.33e-07	2.85e-06	-0.257	0.797	-6.33e-06	4.87e-06
YearBuilt:first:GarageArea	1.572e-09	9.58e-09	0.164	0.870	-1.72e-08	2.04e-08
YearBuilt:second:YearRemodAdd	-3.504e-08	5.05e-08	-0.694	0.488	-1.34e-07	6.41e-08
YearBuilt:second:TotRmsAbvGrd	-6.647e-07	6.43e-07	-1.034	0.301	-1.93e-06	5.97e-07
YearBuilt:second:KitchenAbvGr	7.001e-06	3.79e-06	1.845	0.065	-4.45e-07	1.44e-05
YearBuilt:second:BedroomAbvGr	-2.496e-07	1.17e-06	-0.214	0.830	-2.54e-06	2.04e-06
YearBuilt:second:GarageCars	-3.235e-06	2.23e-06	-1.454	0.146	-7.6e-06	1.13e-06
YearBuilt:second:GarageArea	7.045e-10	8.18e-09	0.086	0.931	-1.54e-08	1.68e-08
YearBuilt:YearRemodAdd:TotRmsAbvGrd	1.83e-05	1.23e-05	1.483	0.138	-5.91e-06	4.25e-05
YearBuilt:YearRemodAdd:KitchenAbvGr	-0.0001	3.46e-05	-3.674	0.000	-0.000	-5.92e-05
YearBuilt:YearRemodAdd:BedroomAbvGr	6.027e-06	1.94e-05	0.311	0.756	-3.2e-05	4.41e-05
YearBuilt:YearRemodAdd:GarageCars	-4.087e-05	3.32e-05	-1.231	0.219	-0.000	2.43e-05
YearBuilt:YearRemodAdd:GarageArea	1.983e-07	1.2e-07	1.647	0.100	-3.8e-08	4.35e-07
YearBuilt:TotRmsAbvGrd:KitchenAbvGr	-0.0016	0.001	-2.253	0.025	-0.003	-0.000
YearBuilt:TotRmsAbvGrd:BedroomAbvGr	4.456e-05	0.000	0.192	0.848	-0.000	0.000
YearBuilt:TotRmsAbvGrd:GarageCars	0.0013	0.001	1.480	0.139	-0.000	0.003
YearBuilt:TotRmsAbvGrd:GarageArea	-3.781e-06	2.94e-06	-1.285	0.199	-9.56e-06	1.99e-06
YearBuilt:KitchenAbvGr:BedroomAbvGr	0.0011	0.001	0.865	0.387	-0.001	0.004
YearBuilt:KitchenAbvGr:GarageCars	-0.0020	0.003	-0.735	0.463	-0.007	0.003
YearBuilt:KitchenAbvGr:GarageArea	-7.963e-06	9.58e-06	-0.832	0.406	-2.68e-05	1.08e-05
YearBuilt:BedroomAbvGr:GarageCars	-0.0003	0.001	-0.228	0.820	-0.003	0.002
YearBuilt:BedroomAbvGr:GarageArea	7.721e-06	4.55e-06	1.695	0.090	-1.22e-06	1.67e-05
YearBuilt:GarageCars:GarageArea	-8.145e-08	1.64e-06	-0.050	0.961	-3.31e-06	3.15e-06
first:second:YearRemodAdd	1.258e-08	5.17e-09	2.433	0.015	2.43e-09	2.27e-08
first:second:TotRmsAbvGrd	3.229e-08	4.27e-08	0.756	0.450	-5.16e-08	1.16e-07
first:second:KitchenAbvGr	-6.114e-07	3.05e-07	-2.004	0.045	-1.21e-06	-1.28e-08
first:second:BedroomAbvGr	-1.191e-07	9.06e-08	-1.315	0.189	-2.97e-07	5.87e-08
first:second:GarageCars	-2.4e-09	2.09e-07	-0.011	0.991	-4.13e-07	4.08e-07
first:second:GarageArea	1.098e-09	6.6e-10	1.665	0.096	-1.96e-10	2.39e-09
first:YearRemodAdd:TotRmsAbvGrd	-6.076e-07	1.07e-06	-0.570	0.569	-2.7e-06	1.48e-06
first:YearRemodAdd:KitchenAbvGr	7.265e-06	6.33e-06	1.149	0.251	-5.15e-06	1.97e-05
first:YearRemodAdd:BedroomAbvGr	-8.677e-07	2.14e-06	-0.406	0.685	-5.06e-06	3.33e-06
first:YearRemodAdd:GarageCars	-5.605e-06	4.85e-06	-1.156	0.248	-1.51e-05	3.91e-06
first:YearRemodAdd:GarageArea	1.035e-08	1.66e-08	0.623	0.533	-2.22e-08	4.29e-08
first:TotRmsAbvGrd:KitchenAbvGr	-4.385e-05	4.33e-05	-1.013	0.311	-0.000	4.11e-05
first:TotRmsAbvGrd:BedroomAbvGr	1.291e-05	1.11e-05	1.161	0.246	-8.91e-06	3.47e-05
first:TotRmsAbvGrd:GarageCars	-3.17e-05	3.7e-05	-0.857	0.392	-0.000	4.09e-05
first:TotRmsAbvGrd:GarageArea	1.852e-07	1.18e-07	1.567	0.117	-4.67e-08	4.17e-07
first:KitchenAbvGr:BedroomAbvGr	-8.223e-05	0.000	-0.742	0.459	-0.000	0.000
first:KitchenAbvGr:GarageCars	5.371e-05	0.000	0.181	0.857	-0.001	0.001
first:KitchenAbvGr:GarageArea	1.37e-06	1.09e-06	1.253	0.210	-7.75e-07	3.52e-06
first:BedroomAbvGr:GarageCars	0.0002	0.000	1.731	0.084	-2.48e-05	0.000
first:BedroomAbvGr:GarageArea	-1.124e-06	3.89e-07	-2.890	0.004	-1.89e-06	-3.61e-07
first:GarageCars:GarageArea	-2.199e-07	1.33e-07	-1.656	0.098	-4.81e-07	4.07e-08
second:YearRemodAdd:TotRmsAbvGrd	-1.286e-06	1.02e-06	-1.266	0.206	-3.28e-06	7.07e-07
second:YearRemodAdd:KitchenAbvGr	-5.54e-06	5.93e-06	-0.935	0.350	-1.72e-05	6.09e-06
second:YearRemodAdd:BedroomAbvGr	8.785e-07	1.87e-06	0.470	0.638	-2.79e-06	4.55e-06
second:YearRemodAdd:GarageCars	3.536e-06	4.02e-06	0.879	0.380	-4.36e-06	1.14e-05
second:YearRemodAdd:GarageArea	-5.12e-09	1.45e-08	-0.352	0.725	-3.37e-08	2.34e-08
second:TotRmsAbvGrd:KitchenAbvGr	0.0001	5.87e-05	2.102	0.036	8.2e-06	0.000
second:TotRmsAbvGrd:BedroomAbvGr	1.177e-06	1.17e-05	0.101	0.920	-2.18e-05	2.41e-05
second:TotRmsAbvGrd:GarageCars	-4.553e-05	4.18e-05	-1.089	0.276	-0.000	3.65e-05
second:TotRmsAbvGrd:GarageArea	-3.217e-08	1.4e-07	-0.230	0.818	-3.07e-07	2.42e-07
second:KitchenAbvGr:BedroomAbvGr	-0.0002	0.000	-1.361	0.174	-0.000	6.98e-05
second:KitchenAbvGr:GarageCars	0.0003	0.000	0.948	0.343	-0.000	0.001
second:KitchenAbvGr:GarageArea	2.178e-07	1e-06	0.218	0.828	-1.75e-06	2.18e-06
second:BedroomAbvGr:GarageCars	4.715e-05	9.33e-05	0.506	0.613	-0.000	0.000
second:BedroomAbvGr:GarageArea	-3.533e-07	3.15e-07	-1.123	0.262	-9.71e-07	2.64e-07
second:GarageCars:GarageArea	3.037e-08	1.53e-07	0.199	0.842	-2.69e-07	3.3e-07
YearRemodAdd:TotRmsAbvGrd:KitchenAbvGr	0.0012	0.001	1.123	0.262	-0.001	0.003
YearRemodAdd:TotRmsAbvGrd:BedroomAbvGr	-0.0003	0.000	-0.883	0.377	-0.001	0.000
YearRemodAdd:TotRmsAbvGrd:GarageCars	-0.0015	0.001	-1.090	0.276	-0.004	0.001
YearRemodAdd:TotRmsAbvGrd:GarageArea	5.439e-06	4.78e-06	1.139	0.255	-3.93e-06	1.48e-05
YearRemodAdd:KitchenAbvGr:BedroomAbvGr	-0.0007	0.002	-0.393	0.695	-0.004	0.003
YearRemodAdd:KitchenAbvGr:GarageCars	0.0052	0.005	1.082	0.280	-0.004	0.015
YearRemodAdd:KitchenAbvGr:GarageArea	-3.205e-06	1.65e-05	-0.195	0.846	-3.55e-05	2.91e-05
YearRemodAdd:BedroomAbvGr:GarageCars	0.0034	0.002	1.891	0.059	-0.000	0.007
YearRemodAdd:BedroomAbvGr:GarageArea	-1.564e-05	6.47e-06	-2.418	0.016	-2.83e-05	-2.95e-06
YearRemodAdd:GarageCars:GarageArea	-4.013e-06	2.52e-06	-1.593	0.111	-8.96e-06	9.29e-07
TotRmsAbvGrd:KitchenAbvGr:BedroomAbvGr	0.0172	0.014	1.227	0.220	-0.010	0.045
TotRmsAbvGrd:KitchenAbvGr:GarageCars	-0.0201	0.062	-0.323	0.747	-0.142	0.102
TotRmsAbvGrd:KitchenAbvGr:GarageArea	-9.975e-05	0.000	-0.387	0.699	-0.001	0.000
TotRmsAbvGrd:BedroomAbvGr:GarageCars	-0.0065	0.023	-0.280	0.780	-0.052	0.039
TotRmsAbvGrd:BedroomAbvGr:GarageArea	5.19e-05	8.26e-05	0.628	0.530	-0.000	0.000
TotRmsAbvGrd:GarageCars:GarageArea	-1.463e-05	4.1e-05	-0.357	0.721	-9.51e-05	6.58e-05
KitchenAbvGr:BedroomAbvGr:GarageCars	-0.1440	0.093	-1.553	0.121	-0.326	0.038
KitchenAbvGr:BedroomAbvGr:GarageArea	0.0004	0.000	1.121	0.263	-0.000	0.001
KitchenAbvGr:GarageCars:GarageArea	-0.0002	0.000	-1.475	0.141	-0.001	7.82e-05
BedroomAbvGr:GarageCars:GarageArea	0.0002	5.89e-05	3.060	0.002	6.46e-05	0.000

Omnibus:	281.963	Durbin-Watson:	1.968
Prob(Omnibus):	0.000	Jarque-Bera (JB):	1549.607
Skew:	-1.071	Prob(JB):	0.00
Kurtosis:	8.420	Cond. No.	8.76e+11

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.76e+11. This might indicate that there are
strong multicollinearity or other numerical problems.

1-4번

벌점, 앙상블을 포함하여 모형에 적합한 기계학습 모델 3가지를 제시하라
(평가지표는 MSE, MAPE, R2 모두 확인할 것)

# lasso , ridge , randomforest
import numpy as np
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error , r2_score
def MAPE(y_test, y_pred):
    return np.mean(np.abs((y_test - y_pred) / y_test)) * 100 
    

ls = Lasso()
rd = Ridge()
rf = RandomForestRegressor()


def modelpipe(model):

    model.fit(X_train,y_train)
    model_pred = model.predict(X_test)
    mse = mean_squared_error(y_test,model_pred)
    r2score = r2_score(y_test,model_pred)
    mape = MAPE(y_test,model_pred)
    
    metrics= [mse,r2score,mape]
    return metrics
    
ls_result =modelpipe(ls)
rd_result =modelpipe(rd)
rf_result =modelpipe(rf)


result = pd.DataFrame([ls_result,rd_result,rf_result],columns = ['mse','r2','mape'],index=['lasso','ridge','randomForest'])
result

	mse	r2	mape
lasso	0.049039	0.697203	1.246075
ridge	0.041681	0.742635	1.175805
randomForest	0.037110	0.770859	1.094485

Attention

2번
데이터 설명 : 코로나19에 대한 나라별 데이터로 모델링 진행
데이터 출처 : https://www.kaggle.com/imdevskp/corona-virus-report 일부 후처리
data Url : https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p3/problem2.csv location : 지역명
date : 일자
total_cases : 누적 확인자
total_deaths : 누적 사망자
new_tests : 검사자
population : 인구
new_vaccinations : 백신 접종자

import pandas as pd
df =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p3/problem2.csv')
df.head()

	location	date	total_cases	total_deaths	new_tests	population	new_vaccinations
0	Afghanistan	2020-02-24	5.0	NaN	NaN	39835428.0	NaN
1	Afghanistan	2020-02-25	5.0	NaN	NaN	39835428.0	NaN
2	Afghanistan	2020-02-26	5.0	NaN	NaN	39835428.0	NaN
3	Afghanistan	2020-02-27	5.0	NaN	NaN	39835428.0	NaN
4	Afghanistan	2020-02-28	5.0	NaN	NaN	39835428.0	NaN

2-1번

마지막 일자를 기준으로 인구 대비 확진자 비율이 높은 상위 5개 국가를 구하여라
상위 5개 국가별로 누적 확진자, 일일 확진자, 누적 사망자, 일일 사망자, 그래프, 범례를 이용해서 가독성 있게 만들어라

df =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p3/problem2.csv')
df['ratio'] = df['total_cases'] / df['population']


# 전체 데이터의 결측치 및 일일 확진, 사망자 확인
# 2021-11-30에는 new_tests , new_vaccinations값이 nan 이므로 제외
# 인구수 0인 케이스 제외
import matplotlib.pyplot as plt 
df = df.fillna(0)
df['date']  = pd.to_datetime(df['date'])
df = df[df.date != pd.to_datetime('2021-11-30')]
df = df[df.population !=0]

for location in df.location.unique():
    lo = df[df.location == location]
    df.loc[lo.index,'new_cases'] =lo.total_cases.diff().values
    df.loc[lo.index[0], 'new_cases'] = lo['total_cases'].values[0]

    df.loc[lo.index,'new_deaths'] =lo.total_deaths.diff().values
    df.loc[lo.index[0], 'new_deaths'] = lo['total_deaths'].values[0]
    
    df.loc[lo.index, 'total_vacciantions'] = lo['new_vaccinations'].cumsum().values
    df.loc[lo.index, '7days_new_case'] = lo['new_tests'].rolling(7).sum().fillna(0).values

import seaborn as sns
import matplotlib.pyplot as plt


locations = df.groupby(['location']).tail(1).sort_values('ratio',ascending=False).location.head(5).values
target = df[df.location.isin(locations)].reset_index(drop=True)
for v in ['total_cases','new_cases','total_deaths','new_deaths']:
    plt.figure(figsize = (15,5))
    plt.title(v)
    sns.lineplot(data=target,x= 'date',y=v,hue='location')
    plt.show()

2-2번

코로나 위험지수를 직접 만들고 그 위험지수에 대한 설명을 적고 위험지수가 높은 국가들 10개를 선정해서 시각화

# 위험지수 =  ( 최근일주일 누적 확진자 / 인구수)   + (일일 사망자 / 인구수) - (누적 백신 인구 / 인구수) * 보정 상수) * 보정 상수
print('''
코로나 위험지수는 코로나로 인한 국가의 위기정도를 표현한다. 코로나 전파 특성상 최근 일주일의 확진자 숫자가 그다음의 일주일에 영향을 준다.     
일일 사망자수는 현재 코로나의 국가 내에서의 치명율을 표현한다. 위기정도는 누적 백신인구에 의해 감소 될수 있다. 
국가간의 비교를 위해 각 국가의 인구수로 나눠주어 값을 스케일링하고, 변수간 보정상수를 통해 정수화를 유도한다
''')

def ratio_index(x):
    value = (x['7days_new_case'] / x['population'] + x['new_deaths'] / x['population'] - x['total_vacciantions'] / x['population']*0.001) *100
    return value


df['ratio_index'] = df.apply(ratio_index,axis=1)


locations = df.groupby(['location']).tail(1).sort_values('ratio_index',ascending=False).location.head(10).values
target = df[df.location.isin(locations)].reset_index(drop=True)
for v in ['total_cases','new_cases','ratio_index']:
    plt.figure(figsize = (15,5))
    plt.title(v)
    sns.lineplot(data=target,x= 'date',y=v,hue='location')
    plt.show()

코로나 위험지수는 코로나로 인한 국가의 위기정도를 표현한다. 코로나 전파 특성상 최근 일주일의 확진자 숫자가 그다음의 일주일에 영향을 준다.     
일일 사망자수는 현재 코로나의 국가 내에서의 치명율을 표현한다. 위기정도는 누적 백신인구에 의해 감소 될수 있다. 
국가간의 비교를 위해 각 국가의 인구수로 나눠주어 값을 스케일링하고, 변수간 보정상수를 통해 정수화를 유도한다

2-3번

한국의 코로나 신규 확진자 예측해라(선형 시계열모델 + 비선형시계열 각각 한개씩 만들어라)
선형시계열 - arma 비선형 시계열 - arima

ko = df[df.location =='South Korea'].reset_index(drop=True)
ko.head()

	location	date	total_cases	new_tests	population	ratio	new_cases
0	South Korea	2020-01-21	0.0	0.0	51305184.0	0.000000e+00	0.0
1	South Korea	2020-01-22	1.0	5.0	51305184.0	1.949121e-08	1.0
2	South Korea	2020-01-23	1.0	0.0	51305184.0	1.949121e-08	0.0
3	South Korea	2020-01-24	2.0	0.0	51305184.0	3.898242e-08	1.0
4	South Korea	2020-01-25	2.0	0.0	51305184.0	3.898242e-08	0.0

# 선형모델 - arma.

from statsmodels.tsa.ar_model import AutoReg
mod = AutoReg(ko.new_cases, 3, old_names=False)
res = mod.fit()
print(res.summary())
fig = res.plot_predict(1,700)

# 비선형 모델 -arima 사용
from statsmodels.tsa.arima.model import ARIMA

model = ARIMA(ko.new_cases, order=(0,1,1))
model_fit = model.fit()
print(model_fit.summary())

forecast = model_fit.forecast(steps=24*7)

plt.figure(figsize=(10,5))
plt.plot(ko.new_cases)
plt.plot(forecast)

                            AutoReg Model Results                             
==============================================================================
Dep. Variable:              new_cases   No. Observations:                  679
Model:                     AutoReg(3)   Log Likelihood               -4376.552
Method:               Conditional MLE   S.D. of innovations            156.844
Date:                Wed, 21 Dec 2022   AIC                           8763.103
Time:                        01:16:32   BIC                           8785.684
Sample:                             3   HQIC                          8771.846
                                  679                                         
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
const           10.0652      7.966      1.264      0.206      -5.547      25.678
new_cases.L1     0.9978      0.037     27.163      0.000       0.926       1.070
new_cases.L2    -0.3117      0.052     -6.002      0.000      -0.413      -0.210
new_cases.L3     0.3080      0.038      8.196      0.000       0.234       0.382
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.0045           -0.0000j            1.0045           -0.0000
AR.2            0.0037           -1.7978j            1.7978           -0.2497
AR.3            0.0037           +1.7978j            1.7978            0.2497
-----------------------------------------------------------------------------
                               SARIMAX Results                                
==============================================================================
Dep. Variable:              new_cases   No. Observations:                  679
Model:                 ARIMA(0, 1, 1)   Log Likelihood               -4422.919
Date:                Wed, 21 Dec 2022   AIC                           8849.837
Time:                        01:16:32   BIC                           8858.876
Sample:                             0   HQIC                          8853.336
                                - 679                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ma.L1          0.0072      0.025      0.286      0.775      -0.042       0.057
sigma2       2.73e+04    486.188     56.156      0.000    2.63e+04    2.83e+04
===================================================================================
Ljung-Box (L1) (Q):                   0.01   Jarque-Bera (JB):              8521.33
Prob(Q):                              0.94   Prob(JB):                         0.00
Heteroskedasticity (H):              21.35   Skew:                             2.60
Prob(H) (two-sided):                  0.00   Kurtosis:                        19.57
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

[<matplotlib.lines.Line2D at 0x7f871a7941c0>]

forecast = model_fit.forecast(steps=24*7)

plt.figure(figsize=(10,5))
plt.plot(ko.new_cases)
plt.plot(forecast)

[<matplotlib.lines.Line2D at 0x7f8702fd0370>]

Attention

3번
설문조사 데이터
데이터 출처 : 자체 제작
data Url : https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p3/problem3.csv 데이터 설명 : A ~ D까지의 그룹에게 각각 같은 설문조사를 하여 1-1,1-2,1-3…5-1,5-4 인 설문지를 푼 것이다.
문항은 영역별로 나뉘어 있고, 영역은 크게 5개이다(1~5)
각 영역의 세부문항은 4개씩 존재한다 (1-1,1-2,1-3,1-4 ~) 이 때 중간에 반대 문항이 들어가 있다. 예를 들어 1-1 문제가 “나는 시간약속을 잘 지킨다.”라는 문제라면 1-3의 문제는 “나는 시간약속을 잘 지키지 않는다.” 라는 역문제로 구성 되어있다.
각 영역의 3번문항의 1번문항의 역문제이다. 모든 답변은 5점 척도이다. 문제를 풀기전 모든 역문항의 경우 점수를 변환(6점을 빼서) 작업이 필요하다

import pandas as pd
df =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p3/problem3.csv')
df.head()

	userid	group	Q1-1	Q1-2	Q1-3	Q1-4	Q2-1	Q2-2	Q2-3	Q2-4	...	Q3-3	Q3-4	Q4-1	Q4-2	Q4-3	Q4-4	Q5-1	Q5-2	Q5-3	Q5-4
0	0	A	5	2	1	2	4	5	3	3	...	1	1	5	2	5	3	3	4	3	4
1	1	A	2	2	3	3	4	3	1	4	...	2	3	4	3	5	3	1	2	1	1
2	2	A	1	3	4	4	2	1	4	4	...	4	2	1	3	4	1	3	3	2	5
3	3	A	3	3	4	2	2	4	4	3	...	2	3	3	4	2	4	1	1	3	2
4	4	A	3	1	2	3	4	3	4	1	...	5	1	3	2	3	1	3	2	5	4

5 rows × 22 columns

3-1번

역문항을 변환 한 후 각 그룹(A~D)의 영역(Q1~Q5)별 응답의 평균, 표준편차, 왜도, 첨도를 구하라. (각 통계량 별로 4x5 dataframe 생성)

# 역변환
for num in range(1,6):
    df[f'Q{num}-3'] =6 -df[f'Q{num}-3']
    
for num in range(1,6):
    col_lst = ['group']
    for col in range(1,5):
        col_lst.append(f'Q{num}-{col}')
        
    target = df[col_lst]
    
    targetdf =target.set_index('group').unstack().to_frame().reset_index()[['group',0]].rename(columns ={0: f'Q{num}'})
    
    display(targetdf.groupby('group').agg(['mean','std','skew',pd.DataFrame.kurt]))

	Q1
	mean	std	skew	kurt
group
A	3.016	1.263860	-0.077803	-1.087887
B	3.042	1.242489	-0.126751	-1.022905
C	3.030	1.243642	-0.050626	-1.033246
D	2.991	1.264325	-0.069421	-1.081406

	Q2
	mean	std	skew	kurt
group
A	3.058	1.236999	-0.129390	-0.997133
B	3.048	1.266215	-0.111043	-1.060834
C	3.063	1.256427	-0.122030	-1.046603
D	3.091	1.249913	-0.166334	-1.018150

	Q3
	mean	std	skew	kurt
group
A	2.992	1.268679	-0.061600	-1.098330
B	3.050	1.238965	-0.117158	-1.035672
C	3.023	1.248210	-0.102330	-0.988577
D	3.034	1.255556	-0.128043	-1.043094

	Q4
	mean	std	skew	kurt
group
A	3.043	1.255678	-0.090314	-1.028166
B	3.041	1.240507	-0.071541	-1.014676
C	3.014	1.283531	-0.074531	-1.100094
D	3.080	1.268546	-0.144620	-1.006126

	Q5
	mean	std	skew	kurt
group
A	3.088	1.256119	-0.102638	-1.053632
B	2.983	1.272136	-0.055805	-1.080934
C	2.987	1.260325	-0.068696	-1.071557
D	2.989	1.250777	-0.065315	-1.055332

3-2번

그룹별로 Q1-1문항의 차이가 존재하는지 anova분석을 시행하라

from scipy.stats import shapiro
a = df[df.group =='A']['Q1-1']
b = df[df.group =='B']['Q1-1']
c = df[df.group =='C']['Q1-1']
d = df[df.group =='D']['Q1-1']

print('a p-value',shapiro(a)[1])
print('b p-value',shapiro(b)[1])
print('c p-value',shapiro(c)[1])
print('d p-value',shapiro(d)[1])


from scipy.stats import levene

# 등분산 만족한다
print(levene(a,b,c,d))
print()

# 정규성을 만족하지 않기 때문에 kruskal-wallis H test를 통해 분산 분석 진행
from scipy.stats import kruskal
kruskal(a,b,c,d)

# 4개의 그룹은 통계적으로 유의한 차이가 없다

a p-value 4.089666539447423e-12
b p-value 1.2895768654319628e-11
c p-value 1.4126045819184974e-11
d p-value 4.2081052184506085e-12
LeveneResult(statistic=0.24718103455049822, pvalue=0.8633690011210747)

KruskalResult(statistic=4.567127187870985, pvalue=0.20638028098088249)

3-3번

탐색적 요인분석을 수행하고 결과를 시각화 하라

ana = df.drop(columns = ['userid','group'])
#실제 adp 패키지리스트에는 존재함
#!pip install factor-analyzer

from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_square_value,p_value=calculate_bartlett_sphericity(ana)
chi_square_value, p_value

# 요인성 평가 결과 요인성 평가에 적합한 p-value( <0.05)를 확인

from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all,kmo_model=calculate_kmo(ana)
kmo_model
# kmo 결과 0.6 이하는 부적합하다 본다


fa = FactorAnalyzer(n_factors=25,rotation=None)
fa.fit(ana)
#Eigen값 체크 
ev, v = fa.get_eigenvalues()
plt.scatter(range(1,ana.shape[1]+1),ev)
plt.plot(range(1,ana.shape[1]+1),ev)
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigenvalue')
plt.grid()
plt.show()

#eigenvalue가 1이 되는지점인 10개의 요인이 선택에 적합한 숫자로 확인



fa = FactorAnalyzer(n_factors=10, rotation="varimax") #ml : 최대우도 방법
fa.fit(ana)
efa_result= pd.DataFrame(fa.loadings_, index=ana.columns)
plt.figure(figsize=(6,10))
sns.heatmap(efa_result, cmap="Blues", annot=True, fmt='.2f')

<AxesSubplot:>

	userid	group	Q1-1	Q1-2	Q1-3	Q1-4	Q2-1	Q2-2	Q2-3	Q2-4	...	Q3-3	Q3-4	Q4-1	Q4-2	Q4-3	Q4-4	Q5-1	Q5-2	Q5-3	Q5-4
0	0	A	5	2	1	2	4	5	3	3	...	1	1	5	2	5	3	3	4	3	4
1	1	A	2	2	3	3	4	3	1	4	...	2	3	4	3	5	3	1	2	1	1
2	2	A	1	3	4	4	2	1	4	4	...	4	2	1	3	4	1	3	3	2	5
3	3	A	3	3	4	2	2	4	4	3	...	2	3	3	4	2	4	1	1	3	2
4	4	A	3	1	2	3	4	3	4	1	...	5	1	3	2	3	1	3	2	5	4

	userid	group	Q1-1	Q1-2	Q1-3	Q1-4	Q2-1	Q2-2	Q2-3	Q2-4	...	Q3-3	Q3-4	Q4-1	Q4-2	Q4-3	Q4-4	Q5-1	Q5-2	Q5-3	Q5-4
0	0	A	5	2	1	2	4	5	3	3	...	1	1	5	2	5	3	3	4	3	4
1	1	A	2	2	3	3	4	3	1	4	...	2	3	4	3	5	3	1	2	1	1
2	2	A	1	3	4	4	2	1	4	4	...	4	2	1	3	4	1	3	3	2	5
3	3	A	3	3	4	2	2	4	4	3	...	2	3	3	4	2	4	1	1	3	2
4	4	A	3	1	2	3	4	3	4	1	...	5	1	3	2	3	1	3	2	5	4

DataManim

17회

17회#

	userid	group	Q1-1	Q1-2	Q1-3	Q1-4	Q2-1	Q2-2	Q2-3	Q2-4	...	Q3-3	Q3-4	Q4-1	Q4-2	Q4-3	Q4-4	Q5-1	Q5-2	Q5-3	Q5-4
0	0	A	5	2	1	2	4	5	3	3	...	1	1	5	2	5	3	3	4	3	4
1	1	A	2	2	3	3	4	3	1	4	...	2	3	4	3	5	3	1	2	1	1
2	2	A	1	3	4	4	2	1	4	4	...	4	2	1	3	4	1	3	3	2	5
3	3	A	3	3	4	2	2	4	4	3	...	2	3	3	4	2	4	1	1	3	2
4	4	A	3	1	2	3	4	3	4	1	...	5	1	3	2	3	1	3	2	5	4