4회 기출 변형#

  • 작업 3유형 통계 반영

Hits

광고 눌러주세요~!

작업 1유형#

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p1_1.csv')
df.head(5)
Store Date Weekly_Sales Holiday_Flag Temperature Fuel_Price CPI Unemployment
0 1 05-02-2010 1643690.90 0 42.31 2.572 211.096358 8.106
1 1 12-02-2010 1641957.44 1 38.51 2.548 211.242170 8.106
2 1 19-02-2010 1611968.17 0 39.93 2.514 211.289143 8.106
3 1 26-02-2010 1409727.59 0 46.63 2.561 211.319643 8.106
4 1 05-03-2010 1554806.68 0 46.5 2.625 211.350143 8.106

1-1

Temperature컬럼에서 숫자가 아닌 문자들을 제거후 숫자 타입으로 바꾸고 3분위수에서 1분위수의 차이를 소숫점 이하 2자리까지 구하여라

Hide code cell source
df.loc[:,'Temperature_f'] = df['Temperature'].str.replace('*','',regex=True).astype('float')
result = round(df['Temperature_f'].quantile(0.75) -df['Temperature_f'].quantile(0.25),3)
print(result)
27.48
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p1_2.csv')
df.head(5)
Title Video ID Published At Keyword Likes Comments Views
0 Apple Pay Is Killing the Physical Wallet After... wAZZ-UWGVHI 2022-08-23 tech 3407.0 672.0 135612.0
1 The most EXPENSIVE thing I own. b3x28s61q3c 2022-08-24 tech 76779.0 4306.0 1758063.0
2 My New House Gaming Setup is SICK! 4mgePWWCAmA 2022-08-23 tech 63825.0 3338.0 1564007.0
3 Petrol Vs Liquid Nitrogen | Freezing Experimen... kXiYSI7H2b0 2022-08-23 tech 71566.0 1426.0 922918.0
4 Best Back to School Tech 2022! ErMwWXQxHp0 2022-08-08 tech 96513.0 5155.0 1855644.0

1-2

Likes를 Comments로 나눈 비율이 20이상이면서 Keyword값이 minecraft인 영상들의 Views값의 평균을 정수로 구하여라

Hide code cell source
result = int(df[((df['Likes'] / df['Comments']) >=20) & (df.Keyword =='minecraft')].Views.mean())
print(result)
1789084
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p1_3.csv')
df.head(5)
show_id type title director cast country date_added release_year rating duration listed_in description
0 s1 Movie Dick Johnson Is Dead Kirsten Johnson NaN United States September 25, 2021 2020 PG-13 90 min Documentaries As her father nears the end of his life, filmm...
1 s2 TV Show Blood & Water NaN Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... South Africa September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, TV Dramas, TV Mysteries After crossing paths at a party, a Cape Town t...
2 s3 TV Show Ganglands Julien Leclercq Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... NaN September 24, 2021 2021 TV-MA 1 Season Crime TV Shows, International TV Shows, TV Act... To protect his family from a powerful drug lor...
3 s4 TV Show Jailbirds New Orleans NaN NaN NaN September 24, 2021 2021 TV-MA 1 Season Docuseries, Reality TV Feuds, flirtations and toilet talk go down amo...
4 s5 TV Show Kota Factory NaN Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... India September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, Romantic TV Shows, TV ... In a city of coaching centers known to train I...

1-3

date_added가 2018년 1월 이면서 country가 United Kingdom 단독 제작인 데이터의 갯수

Hide code cell source
result = df[(df.date_added_dt.dt.strftime("%Y-%m") =='2018-01') & (df.country =='United Kingdom')].shape[0]
print(result)
6

작업 2유형#

Attention

유저 분류 : https://www.kaggle.com/datasets/kaushiksuresh147/customer-segmentation train = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p2_train.csv test = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p2_test.csv

예측 변수 Segmentation, test.csv에 대해 ID별로 Segmentation의 클래스를 예측해서 저장후 제출, 제출 데이터 컬럼은 ID와 Segmentation 두개만 존재해야함. 평가지표는 macro f1 score

import pandas as pd
train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p2_train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p2_test.csv')

display(train.head(2))
test.head(2)
ID Gender Ever_Married Age Graduated Profession Work_Experience Spending_Score Family_Size Var_1 Segmentation
0 464357 Male No 40 Yes Artist 7.0 Low 1.0 Cat_6 A
1 459624 Male No 18 No Healthcare NaN Low 5.0 Cat_4 D
ID Gender Ever_Married Age Graduated Profession Work_Experience Spending_Score Family_Size Var_1
0 460406 Male Yes 36 Yes Healthcare 3.0 Low 2.0 Cat_6
1 466890 Male Yes 47 Yes Artist 0.0 Average 6.0 Cat_7
Hide code cell source
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier


# 결측치 채우기
MAX_EVER_MARRIED = train['Ever_Married'].value_counts().index[0]
MAX_GRADUATED = train['Graduated'].value_counts().index[0]
MAX_PROFESSION = train['Profession'].value_counts().index[0]
MEAD_Work_Experience = train['Work_Experience'].mean()
MAX_Family_Size = train['Family_Size'].value_counts().index[0]
MAX_Var_1 = train['Var_1'].value_counts().index[0]

fillna_dic = {
    'Ever_Married':MAX_EVER_MARRIED,
    'Graduated':MAX_GRADUATED,
    'Profession':MAX_PROFESSION,
    'Work_Experience':MEAD_Work_Experience,
    'Family_Size':MAX_Family_Size,
    'Var_1':MAX_Var_1
}

for col in fillna_dic.keys():
    train.loc[:,col] = train.loc[:,col].fillna(fillna_dic[col])
    test.loc[:,col] = test.loc[:,col].fillna(fillna_dic[col])
    
    
    
# Family_Size 범주형 변경
train['Family_Size'] = train['Family_Size'].astype('str')
test['Family_Size'] = test['Family_Size'].astype('str')


# ID 컬럼제거
IDS = test['ID']
train_drop = train.drop(columns =['ID','Segmentation'])
train_y = train['Segmentation']

test_drop = test.drop(columns =['ID'])

train_dum = pd.get_dummies(train_drop)
test_dum = pd.get_dummies(test_drop)[train_dum.columns]


# validtaion - test 셋 구분

x_train, x_test, y_train,y_test = train_test_split(train_dum , train_y,random_state =1,stratify=train_y)


# 모델링
rf = RandomForestClassifier(random_state =1)
rf.fit(x_train,y_train)

pred = rf.predict(x_test)
print('valiation test f1score',f1_score(y_test,pred,average='macro'))


pred_test = rf.predict(test_dum)


sumission = pd.DataFrame()
sumission['ID'] = IDS
sumission['Segmentation'] = pred_test
sumission.to_csv('000000000.csv',index=False)
valiation test f1score 0.4107784704089142

작업 3유형#

Attention

어느 대학교의 신입생의 학과별 성별에 대한 데이터이다. DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p3_1_.csv 이 데이터를 바탕으로, 학생들의 학과와 성별이 서로 독립적인지 여부를 확인하기 위해 카이제곱 독립성 검정을 실시 하려한다.

import pandas as pd 
df= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e4_p3_1_.csv')
df.head()
학번 학과 성별
0 DFSKJK_1 경영학과
1 DFSKJK_2 사회학과
2 DFSKJK_3 기계공학과
3 DFSKJK_4 경제학과
4 DFSKJK_5 기계공학과

3-1-a

학과 평균 인원에 대한 값을 소숫점 이하 3자리까지 구하여라

Hide code cell source
result= round(df.groupby(['학과']).size().mean(),3)
print(result)
170.333

3-1-b

카이제곱검정 독립성 검정 통계량을 소숫점 이하 3자리까지 구하여라

Hide code cell source
from scipy.stats import chi2_contingency


crosstab = pd.crosstab(df['학과'], df['성별'])
s, p, _, _ = chi2_contingency(crosstab)


round_s = round(s,3)
print(round_s)
5.646

3-1-c

카이제곱검정 독립성 검정의 pvalue를 소숫점 이하 3자리까지 구하여라. 유의수준 0.05하에서 귀무가설과 대립가설중 유의한 것을 출력하라

Hide code cell source
round_p = round(p,3)
print(p)
print('귀무')
0.3421500573497989
귀무

Attention

어느 학교에서 수학 시험을 본 학생 100명 중 60명이 60점 이상을 받았다. 이 학교의 수학 시험의 평균 점수가 50점 이상인지 95%의 신뢰 수준에서 검정하려한다.

3-1-a

검정 통계량을 소숫점 이하 3자리에서 구하시오

Hide code cell source
import numpy as np
n = 100
p_hat = 0.6
p = 0.5
alpha = 0.05

# 검정 통계량 계산
z = round((p_hat - p) / np.sqrt(p * (1 - p) / n),5)
print(z)
2.0

3-1-b

pvalue를 소숫점 이하 3자리까지 구하고 귀무가설과 대립가설중 유의한 것을 출력하라

Hide code cell source
from scipy.stats import norm
p_value = round(1 - norm.cdf(z),3)

print(p_value)
print('대립')
0.023
대립