모의고사 3회차#

Hits

.......

유튜브 링크

캐글 데이터셋 링크
유튜브링크
6월 4일(토) 오전 9시~ 11시 , 6월 5일(일) 오후 19시~21시 라이브 문제 풀이 진행

작업 1유형#

Attention

데이터 출처 : link (후처리 작업)
데이터 설명 : 2010-2019 스포티파이 TOP100 노래
dataurl : https://raw.githubusercontent.com/Datamanim/datarepo/main/spotify/spotify.csv

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/spotify/spotify.csv')
df.head()
title artist top genre year released added bpm nrgy dnce dB live val dur acous spch pop top year artist type
0 STARSTRUKK (feat. Katy Perry) 3OH!3 dance pop 2009.0 2022‑02‑17 140.0 81.0 61.0 -6.0 23.0 23.0 203.0 0.0 6.0 70.0 2010.0 Duo
1 My First Kiss (feat. Ke$ha) 3OH!3 dance pop 2010.0 2022‑02‑17 138.0 89.0 68.0 -4.0 36.0 83.0 192.0 1.0 8.0 68.0 2010.0 Duo
2 I Need A Dollar Aloe Blacc pop soul 2010.0 2022‑02‑17 95.0 48.0 84.0 -7.0 9.0 96.0 243.0 20.0 3.0 72.0 2010.0 Solo
3 Airplanes (feat. Hayley Williams of Paramore) B.o.B atl hip hop 2010.0 2022‑02‑17 93.0 87.0 66.0 -4.0 4.0 38.0 180.0 11.0 12.0 80.0 2010.0 Solo
4 Nothin' on You (feat. Bruno Mars) B.o.B atl hip hop 2010.0 2022‑02‑17 104.0 85.0 69.0 -6.0 9.0 74.0 268.0 39.0 5.0 79.0 2010.0 Solo

Question1

데이터는 현재 년도별 100곡이 인기순으로 정렬되어 있다. 각 년도별 1~100위의 랭킹을 나타내는 rank컬럼을 만들고 매년도 1위의 bpm컬럼의 평균값을 구하여라

Hide code cell source
df = df.dropna()
df.loc[:,'rank'] = list(range(1,101))*10
result = df[df['rank'] ==1].bpm.mean()
print(result)
125.6

Question2

2015년도에 가장많은 top100곡을 올린 artist는 누구인가?

Hide code cell source
result = df[df['top year'] ==2015].artist.value_counts().index[0]
print(result)
The Weeknd

Question3

년도별 rank값이 1~10위 까지의 곡들 중 두번째로 많은 top genre는 무엇인가?

Hide code cell source
result = df[df['rank'].isin(range(1,11))]['top genre'].value_counts().index[2]
print(result)
british soul

Question4

피처링의 경우 title에 표시된다. 피처링을 가장 많이 해준 가수는 누구인가?

Hide code cell source
result = df.title.str.split('feat.').str[1].dropna().str[:-1].str.strip().value_counts().index[0]
print(result)
Bruno Mars

Question5

top year 년도를 기준으로 발매일(year released)과 top100에 진입한 일자 (top year)가 다른 곡의 숫자를 count 했을때 가장 많은 년도는?

Hide code cell source
year = int(df[df['year released'] != df['top year']]['top year'].value_counts().index[0])
print(year)
2016

Question6

artist 컬럼의 값에 대소문자 상관없이 q 단어가 들어가는 아티스트는 몇명인가?

Hide code cell source
result = df[df.artist.str.lower().str.contains('q')].artist.nunique()
print(result)
6

Question7

년도 상관없이 전체데이터에서 1~50위와 51~100위간의 dur 컬럼의 평균값의 차이는?

Hide code cell source
result = df[df['rank'].isin(range(1,51))].dur.mean() - df[df['rank'].isin(range(51,101))].dur.mean()
print(result)
0.896000000000015

Question8

title을 띄어쓰기 단어로 구분 했을때 가장 많이 나온 단어는 무엇인가? (대소문자 구분 x)

Hide code cell source
result = df.title.str.split('\(feat').str[0].str.split().explode().str.lower().value_counts().index[0]
print(result)
the
df
title artist top genre year released added bpm nrgy dnce dB live val dur acous spch pop top year artist type rank
0 STARSTRUKK (feat. Katy Perry) 3OH!3 dance pop 2009.0 2022‑02‑17 140.0 81.0 61.0 -6.0 23.0 23.0 203.0 0.0 6.0 70.0 2010.0 Duo 1
1 My First Kiss (feat. Ke$ha) 3OH!3 dance pop 2010.0 2022‑02‑17 138.0 89.0 68.0 -4.0 36.0 83.0 192.0 1.0 8.0 68.0 2010.0 Duo 2
2 I Need A Dollar Aloe Blacc pop soul 2010.0 2022‑02‑17 95.0 48.0 84.0 -7.0 9.0 96.0 243.0 20.0 3.0 72.0 2010.0 Solo 3
3 Airplanes (feat. Hayley Williams of Paramore) B.o.B atl hip hop 2010.0 2022‑02‑17 93.0 87.0 66.0 -4.0 4.0 38.0 180.0 11.0 12.0 80.0 2010.0 Solo 4
4 Nothin' on You (feat. Bruno Mars) B.o.B atl hip hop 2010.0 2022‑02‑17 104.0 85.0 69.0 -6.0 9.0 74.0 268.0 39.0 5.0 79.0 2010.0 Solo 5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 SICKO MODE Travis Scott hip hop 2018.0 2020‑06‑22 155.0 73.0 83.0 -4.0 12.0 45.0 313.0 1.0 22.0 86.0 2019.0 Solo 96
996 EARFQUAKE Tyler, The Creator hip hop 2019.0 2020‑06‑22 80.0 50.0 55.0 -9.0 80.0 41.0 190.0 23.0 7.0 85.0 2019.0 Solo 97
997 Boasty (feat. Idris Elba) Wiley grime 2019.0 2020‑06‑22 103.0 77.0 89.0 -5.0 9.0 46.0 177.0 1.0 7.0 68.0 2019.0 Solo 98
998 Strike a Pose (feat. Aitch) Young T & Bugsey afroswing 2019.0 2020‑08‑20 138.0 58.0 53.0 -6.0 10.0 59.0 214.0 1.0 10.0 67.0 2019.0 Duo 99
999 The London (feat. J. Cole & Travis Scott) Young Thug atl hip hop 2019.0 2020‑06‑22 98.0 59.0 80.0 -7.0 13.0 18.0 200.0 2.0 15.0 75.0 2019.0 Solo 100

1000 rows × 18 columns

Question9

년도별 nrgy값의 평균값을 구할때 최대 평균값과 최소 평균값의 차이를 구하여라

Hide code cell source
m = df.groupby(['top year']).nrgy.mean().sort_values().values
result = m[-1] - m[0]
print(result)
13.860000000000007

Question10

artist중 artist type 타입을 여러개 가지고 있는 artist는 누구인가

Hide code cell source
result = df[['artist','artist type']].value_counts().reset_index().artist.value_counts().index[0]
print(result)
Rudimental

작업 2유형#

Attention

데이터 설명 : 센서데이터로 동작 유형 분류 (종속변수 pose : 0 ,1 구분)
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_train.csv
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/y_train.csv
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_test.csv
출처(참고, 데이터 수정)

import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/y_train.csv")
test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_test.csv")


display(x_train.head())
display(y_train.head())
ID motion_0 motion_1 motion_2 motion_3 motion_4 motion_5 motion_6 motion_7 motion_8 ... motion_54 motion_55 motion_56 motion_57 motion_58 motion_59 motion_60 motion_61 motion_62 motion_63
0 0 1.0 -2.0 -1.0 4.0 -5.0 -4.0 1.0 0.0 -15.0 ... 0.0 -1.0 -13.0 -3.0 1.0 -1.0 -32.0 -22.0 -2.0 -3.0
1 2 20.0 0.0 0.0 1.0 5.0 6.0 -52.0 18.0 15.0 ... -70.0 -55.0 -38.0 -14.0 -12.0 -8.0 -34.0 -63.0 -87.0 -77.0
2 4 1.0 -1.0 1.0 4.0 -5.0 -8.0 1.0 -3.0 -14.0 ... 1.0 12.0 -25.0 0.0 0.0 3.0 2.0 -27.0 1.0 0.0
3 5 13.0 2.0 1.0 -3.0 1.0 3.0 28.0 3.0 12.0 ... 0.0 -21.0 -17.0 -2.0 0.0 -4.0 -17.0 -21.0 -21.0 25.0
4 6 -2.0 -7.0 -4.0 -8.0 16.0 44.0 1.0 3.0 -16.0 ... -1.0 2.0 -1.0 1.0 4.0 4.0 -17.0 -38.0 -3.0 3.0

5 rows × 65 columns

ID pose
0 0 1
1 2 0
2 4 1
3 5 0
4 6 1
Hide code cell source
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report

x = x_train.drop(columns = ['ID'])

test_drop = test.drop(columns = ['ID'])



sc = StandardScaler()
sc.fit(x)


xs = sc.transform(x)
x_test_scaler = sc.transform(test_drop)

X_train, X_test, y_train, y_test = train_test_split(xs, y_train['pose'], test_size=0.33, random_state=42)

lr = LogisticRegression()
lr.fit(X_train,y_train)

pred = lr.predict_proba(X_test)
print('validation_auc : ',roc_auc_score(y_test,pred[:,1]))



# # 아래 코드 예측변수와 수험번호를 개인별로 변경하여 활용
# # pd.DataFrame({'id': test.id, 'stroke': pred}).to_csv('003000000.csv', index=False)
pd.DataFrame({'id': test.ID, 'pose': lr.predict_proba(x_test_scaler)[:,1]}).to_csv('003000000.csv', index=False)
validation_auc :  0.5633199559973924