모의고사 1회차#

Hits

광고 한번 눌러주시겠습니까

작업 1유형#

Attention

데이터 출처 : https://archive.ics.uci.edu/ml/datasets/Bank+Marketing (후처리 작업)
데이터 설명 : 은행의 전화 마케팅에 대해 고객의 반응 여부
dataurl : https://raw.githubusercontent.com/Datamanim/datarepo/main/bank/train.csv

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/bank/train.csv')
df.head()
ID age job marital education default balance housing loan contact day month campaign pdays previous poutcome y
0 13829 29 technician single tertiary no 18254 no no cellular 11 may 2 -1 0 unknown no
1 22677 26 services single secondary no 512 yes yes unknown 5 jun 3 -1 0 unknown no
2 10541 30 management single secondary no 135 no no cellular 14 aug 2 -1 0 unknown no
3 13689 41 technician married unknown no 30 yes no cellular 10 jul 1 -1 0 unknown no
4 11304 27 admin. single secondary no 321 no yes unknown 2 sep 1 -1 0 unknown no

Question1

마케팅 응답 고객들의 나이를 10살 단위로 변환 했을 때, 가장 많은 인원을 가진 나이대는? (0~9 : 0 , 10~19 : 10)

Hide code cell source
result = (df.age//10 *10).value_counts().index[0]
print(result)
30

Question2

마케팅 응답 고객들의 나이를 10살 단위로 변환 했을 때, 가장 많은 나이대 구간의 인원은 몇명인가?

Hide code cell source
result = (df.age//10 *10).value_counts().values[0]
print(result)
5056

Question3

나이가 25살 이상 29살 미만인 응답 고객들중 housing컬럼의 값이 yes인 고객의 수는?

Hide code cell source
result = df[(df.age >=25) & (df.age<29) & (df.housing =='yes')].shape[0]
print(result)
504

Question4

numeric한 값을 가지지 않은 컬럼들중 unique한 값을 가장 많이 가지는 컬럼은?

pd.DataFrame(lst).sort_values(1,ascending=False)
0 1
0 job 12
7 month 12
2 education 4
8 poutcome 4
1 marital 3
6 contact 3
3 default 2
4 housing 2
5 loan 2
9 y 2
Hide code cell source
lst= [] 
for col in df.select_dtypes(exclude='int'):
    target = df[col]
    lst.append([col,target.nunique()])

dfs  = pd.DataFrame(lst).sort_values(1,ascending=False)
result = dfs[dfs[1] ==dfs[1].max()][0].values
print(result)
['job' 'month']

Question5

balance 컬럼값들의 평균값 이상을 가지는 데이터를 ID값을 기준으로 내림차순 정렬했을때 상위 100개 데이터의 balance값의 평균은?

Hide code cell source
result = df[df.balance >= df.balance.mean()].sort_values('ID',ascending=False).head(100).balance.mean()
print(result)
3473.73

Question6

가장 많은 광고를 집행했던 날짜는 언제인가? (데이터 그대로 일(숫자),달(영문)으로 표기)

Hide code cell source
result = df[['day','month']].value_counts().index[0]
print(result)
(15, 'may')

Question7

데이터의 job이 unknown 상태인 고객들의 age 컬럼 값의 정규성을 검정하고자 한다. 샤피로 검정의 p-value값을 구하여라

Hide code cell source
from scipy.stats import shapiro
result = shapiro(df[df.job =='unknown'].age)[1]
print(result)
0.1961131989955902

Question8

age와 balance의 상관계수를 구하여라

Hide code cell source
result = df[['age','balance']].corr().iloc[0,1]
print(result)
0.10198734763981472

Question9

y 변수와 education 변수는 독립인지 카이제곱검정을 통해 확인하려한다. p-value값을 출력하라

df
ID age job marital education default balance housing loan contact day month campaign pdays previous poutcome y
0 13829 29 technician single tertiary no 18254 no no cellular 11 may 2 -1 0 unknown no
1 22677 26 services single secondary no 512 yes yes unknown 5 jun 3 -1 0 unknown no
2 10541 30 management single secondary no 135 no no cellular 14 aug 2 -1 0 unknown no
3 13689 41 technician married unknown no 30 yes no cellular 10 jul 1 -1 0 unknown no
4 11304 27 admin. single secondary no 321 no yes unknown 2 sep 1 -1 0 unknown no
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12865 14023 47 technician married secondary no 1167 yes no cellular 30 apr 1 87 5 failure yes
12866 17259 31 unknown married secondary no 111 no no cellular 21 nov 2 93 2 failure yes
12867 15200 37 unemployed single tertiary no 1316 yes no cellular 18 nov 1 172 2 failure no
12868 13775 42 management married tertiary no 479 yes no unknown 28 may 2 -1 0 unknown no
12869 20137 24 services single secondary no 0 no no unknown 16 may 1 -1 0 unknown no

12870 rows × 17 columns

Hide code cell source
v = pd.crosstab(df.y,df.education)
from scipy.stats import chi2_contingency
chi2 , p ,dof, expected = chi2_contingency(v)
display(v)
print(p)
education primary secondary tertiary unknown
y
no 1424 4555 2559 365
yes 456 1813 1516 182
7.901201277473551e-29

Question10

각 job에 따라 divorced/married 인원의 비율을 확인 했을 때 그 값이 가장 높은 값은?

Hide code cell source
t = df.groupby(['job','marital']).size().reset_index()
pivotdf = t.pivot_table(index='job',columns='marital')[0]
pivotdf = pivotdf.fillna(0)
pivotdf['ratio'] = pivotdf['divorced'] / pivotdf['married']

result = pivotdf.sort_values('ratio').ratio.values[-1]
print(result)
0.2831050228310502

작업 2유형#

import pandas as pd
train= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/bank/train.csv')
test= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/bank/test.csv')
submission= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/bank/submission.csv')

display(train.head())
display(test.head())
display(submission.head())
ID age job marital education default balance housing loan contact day month campaign pdays previous poutcome y
0 13829 29 technician single tertiary no 18254 no no cellular 11 may 2 -1 0 unknown no
1 22677 26 services single secondary no 512 yes yes unknown 5 jun 3 -1 0 unknown no
2 10541 30 management single secondary no 135 no no cellular 14 aug 2 -1 0 unknown no
3 13689 41 technician married unknown no 30 yes no cellular 10 jul 1 -1 0 unknown no
4 11304 27 admin. single secondary no 321 no yes unknown 2 sep 1 -1 0 unknown no
ID age job marital education default balance housing loan contact day month campaign pdays previous poutcome
0 53608 32 management single tertiary no 12569 no no cellular 1 jul 2 295 2 success
1 51055 25 services single secondary no 801 no no cellular 5 jun 2 -1 0 unknown
2 52573 46 blue-collar married secondary no 1728 yes no unknown 26 may 2 -1 0 unknown
3 50458 39 management divorced secondary no 51 no no unknown 17 jun 2 -1 0 unknown
4 52272 31 services single tertiary no 1626 no no unknown 31 jul 1 -1 0 unknown
ID predict
0 53608 0.0
1 51055 0.0
2 52573 0.0
3 50458 0.0
4 52272 0.0

간단한 모델링 작업

모델링 및 submission파일 생성까지

Hide code cell source
from sklearn.model_selection import train_test_split

x = train.drop(columns =['ID','y'])
xd = pd.get_dummies(x)
y = train['y']

x_train,x_test,y_train,y_test = train_test_split(xd,y,stratify =y ,random_state=1)

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(x_train,y_train)
pred = rf.predict_proba(x_test)

from sklearn.metrics import roc_auc_score,classification_report

print('test roc score : ',roc_auc_score(y_test,pred[:,1]))


test_pred = rf.predict_proba(pd.get_dummies(test.drop(columns=['ID'])))
submission['predict'] = test_pred[:,1]

print('submission file')
display(submission.head())
submission.to_csv('00000000000000.csv',index=False)
test roc score :  0.77612408703591
submission file
ID predict
0 53608 0.73
1 51055 0.80
2 52573 0.01
3 50458 0.23
4 52272 0.33