20ํšŒ#

Hits

Attention

์บ๊ธ€์— ์—…๋กœ๋“œ๋œ ๋‹ค๋ฅธ ๋ถ„๋“ค ์ฝ”๋“œ ๋ณด๋Ÿฌ๊ฐ€๊ธฐ
๋ฌธ์ œ์˜ค๋ฅ˜, ์ฝ”๋“œ์˜ค๋ฅ˜ ๋Œ“๊ธ€๋กœ ํ”ผ๋“œ๋ฐฑ์ฃผ์„ธ์š”

Attention

1๋ฒˆ
๋‚ ์”จ ์˜จ๋„ ์˜ˆ์ธก, ์ข…์†๋ณ€์ˆ˜ :actual(์ตœ๊ณ ์˜จ๋„)
๋ฐ์ดํ„ฐ ์ถœ์ฒ˜ : https://towardsdatascience.com/random-forest-in-python-24d0893d51c0
๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ : /kaggle/input/adp-kr-p2/problem1.csv
temp_1 : ์ „๋‚  ์ตœ๊ณ ์˜จ๋„
temp_2 : ์ „์ „๋‚  ์ตœ๊ณ ์˜จ๋„
friend : ์นœ๊ตฌ์˜ ์˜ˆ์ธก์˜จ๋„

1-1๋ฒˆ

๋ฐ์ดํ„ฐ ํ™•์ธ ๋ฐ ์ „์ฒ˜๋ฆฌ

  • ๋ฐ์ดํ„ฐ EDA ์ˆ˜ํ–‰

  • ๊ฒฐ์ธก์น˜๋ฅผ ํ™•์ธํ•˜๊ณ  ์ฒ˜๋ฆฌ ๋ฐฉ์•ˆ์— ๋Œ€ํ•ด ๋…ผ์˜ํ•˜๋ผ

  • ๋ฐ์ดํ„ฐ ๋ถ„ํ•  ๋ฐฉ๋ฒ• ์„ค๋ช…

  • ์ตœ์ข… ๋ฐ์ดํ„ฐ์…‹์ด ์ ์ ˆํ•จ์„ ์ฃผ์žฅํ•˜๋ผ

import pandas as pd 
import seaborn as sns
df =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p2/problem1.csv')
print(df.head())  # ์ƒ์œ„ 5๊ฐœ
print(df.shape)  # ๋ฐ์ดํ„ฐ ํ˜•ํƒœ
display(sns.pairplot(df)) # ๋ณ€์ˆ˜๋ณ„ ์ƒ๊ด€๊ณ„์‘ค
print(df.info()) # ๊ฐ ์ปฌ๋Ÿผ ๋ฐ์ดํ„ฐ ํƒ€์ž…
print(df.describe()) # ๊ธฐ์ดˆ ํ†ต๊ณ„๋Ÿ‰
print(df.isnull().sum()) #๊ฒฐ์ธก์น˜ ํ™•์ธ


df['date'] =df['year'].astype('str')+'-'+df['month'].astype('str')+'-'+df['day'].astype('str')
df['date'] = pd.to_datetime(df['date'])

v = pd.DataFrame(pd.date_range(start=df['date'].dt.strftime('%Y-%m-%d').min(), end=df['date'].dt.strftime('%Y-%m-%d').max()))[0].dt.strftime('%Y-%m-%d').values
a=set(v) - set(df['date'].dt.strftime('%Y-%m-%d'))
print(a)
len(a)

display(df.corr())

# ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
dfd = pd.get_dummies(df)
df_drop = dfd.drop(columns=['year','month','day','friend','date'])

X = df_drop.drop(columns=['actual'])
y = df_drop['actual']

from sklearn.model_selection import train_test_split

X_train,X_test , y_train,y_test = train_test_split(X,y,random_state=2,test_size=0.2)


plt.show()
print('''
Answer
๋ฐ์ดํ„ฐ ์ƒ์—์„œ ์ˆ˜์น˜ ๊ฒฐ์ธก์น˜๋Š” ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค. ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ ๊ด€์ ์œผ๋กœ ๋ดค์„๋•Œ, 18์ผ์น˜์˜ ์ผ์ž ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฒฐ์ธก์น˜๋กœ ์กด์žฌํ•œ๋‹ค.     
๋ฌธ์ œ ํ•ด๊ฒฐ์‹œ ์‹œ๊ณ„์—ด ๋ฐฉ์‹์œผ๋กœ ์ ‘๊ทผ ํ•˜์ง€ ์•Š์„ ๊ฒƒ์ด๊ธฐ์— ๋ˆ„๋ฝ๋œ ์ผ์ž์— ๋Œ€ํ•ด์„œ ๋”ฐ๋กœ ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ๋ฅผ ํ•ด์ฃผ์ง€ ์•Š์„ ๊ฒƒ์ด๋‹ค.     
์‹œ๊ณ„์—ด ๊ด€์ ์œผ๋กœ ํ•ด์„์„ ํ•  ๊ฒฝ์šฐ ๋ˆ„๋ฝ๋œ ๋ฐ์ดํ„ฐ๋Š” ํ‰๊ท  ๋ณด๊ฐ„์„ ์‹ค์‹œ ํ•˜์—ฌ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.     
      
๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ด๋Š” ์ปฌ๋Ÿผ๋“ค์ด ํ™•์ธ๋˜๋ฉฐ ์ฃผ๊ธฐ์  ๊ฒฝํ–ฅ์„ ๋ณด์ด๋Š” ๋ฐ์ดํ„ฐ๋“ค์ด ํ™•์ธ๋œ๋‹ค.    
    
year, month, day, week, ๊ฐ’์€ ๋ถˆํ•„์š” ์ปฌ๋Ÿผ์œผ๋กœ ์ œ์™ธํ•œ๋‹ค. week์˜ ๊ฒฝ์šฐ ์›ํ•ซ์ธ์ฝ”๋”ฉ์„ ์ง„ํ–‰ํ•ด์„œ ์ถ”๊ฐ€ํ•œ๋‹ค.     
train์…‹๊ณผ test์…‹์€ 8:2๋น„์œจ๋กœ ๋‚˜๋ˆ ์„œ ๋ชจ๋ธ๋ง์„ ์ง„ํ–‰ํ•œ๋‹ค.    
friend ์ปฌ๋Ÿผ์˜ ๊ฒฝ์šฐ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํ™•์ธํ–ˆ์„๋•Œ ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์€ ๊ฐ’์„ ๊ฐ€์ง€๊ธฐ์— ์ œ์™ธํ•˜๊ณ  ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค.     
''')
   year  month  day  week  temp_2  temp_1  average  actual  forecast_noaa  \
0  2016      1    1   Fri      45      45     45.6      45             43   
1  2016      1    2   Sat      44      45     45.7      44             41   
2  2016      1    3   Sun      45      44     45.8      41             43   
3  2016      1    4   Mon      44      41     45.9      40             44   
4  2016      1    5  Tues      41      40     46.0      44             46   

   forecast_acc  forecast_under  friend  
0            50              44      29  
1            50              44      61  
2            46              47      56  
3            48              46      53  
4            46              46      41  
(348, 12)
<seaborn.axisgrid.PairGrid at 0x7f93869a7820>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 348 entries, 0 to 347
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   year            348 non-null    int64  
 1   month           348 non-null    int64  
 2   day             348 non-null    int64  
 3   week            348 non-null    object 
 4   temp_2          348 non-null    int64  
 5   temp_1          348 non-null    int64  
 6   average         348 non-null    float64
 7   actual          348 non-null    int64  
 8   forecast_noaa   348 non-null    int64  
 9   forecast_acc    348 non-null    int64  
 10  forecast_under  348 non-null    int64  
 11  friend          348 non-null    int64  
dtypes: float64(1), int64(10), object(1)
memory usage: 32.8+ KB
None
         year       month         day      temp_2      temp_1     average  \
count   348.0  348.000000  348.000000  348.000000  348.000000  348.000000   
mean   2016.0    6.477011   15.514368   62.652299   62.701149   59.760632   
std       0.0    3.498380    8.772982   12.165398   12.120542   10.527306   
min    2016.0    1.000000    1.000000   35.000000   35.000000   45.100000   
25%    2016.0    3.000000    8.000000   54.000000   54.000000   49.975000   
50%    2016.0    6.000000   15.000000   62.500000   62.500000   58.200000   
75%    2016.0   10.000000   23.000000   71.000000   71.000000   69.025000   
max    2016.0   12.000000   31.000000  117.000000  117.000000   77.400000   

           actual  forecast_noaa  forecast_acc  forecast_under      friend  
count  348.000000     348.000000    348.000000      348.000000  348.000000  
mean    62.543103      57.238506     62.373563       59.772989   60.034483  
std     11.794146      10.605746     10.549381       10.705256   15.626179  
min     35.000000      41.000000     46.000000       44.000000   28.000000  
25%     54.000000      48.000000     53.000000       50.000000   47.750000  
50%     62.500000      56.000000     61.000000       58.000000   60.000000  
75%     71.000000      66.000000     72.000000       69.000000   71.000000  
max     92.000000      77.000000     82.000000       79.000000   95.000000  
year              0
month             0
day               0
week              0
temp_2            0
temp_1            0
average           0
actual            0
forecast_noaa     0
forecast_acc      0
forecast_under    0
friend            0
dtype: int64
{'2016-08-25', '2016-02-14', '2016-08-26', '2016-02-13', '2016-08-29', '2016-08-21', '2016-08-18', '2016-08-27', '2016-09-02', '2016-08-24', '2016-10-30', '2016-08-17', '2016-02-29', '2016-09-01', '2016-08-19', '2016-08-31', '2016-08-22', '2016-08-20'}
year month day temp_2 temp_1 average actual forecast_noaa forecast_acc forecast_under friend
year NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
month NaN 1.000000 -0.000412 0.047651 0.032664 0.120806 0.004529 0.131141 0.127436 0.119786 0.048145
day NaN -0.000412 1.000000 -0.046194 -0.000691 -0.021136 -0.021675 -0.021393 -0.030605 -0.013727 0.024592
temp_2 NaN 0.047651 -0.046194 1.000000 0.857800 0.821560 0.805835 0.813134 0.817374 0.819576 0.583758
temp_1 NaN 0.032664 -0.000691 0.857800 1.000000 0.819328 0.877880 0.810672 0.815162 0.815943 0.541282
average NaN 0.120806 -0.021136 0.821560 0.819328 1.000000 0.848365 0.990340 0.990705 0.994373 0.689278
actual NaN 0.004529 -0.021675 0.805835 0.877880 0.848365 1.000000 0.838639 0.842135 0.838946 0.569145
forecast_noaa NaN 0.131141 -0.021393 0.813134 0.810672 0.990340 0.838639 1.000000 0.979863 0.985670 0.669221
forecast_acc NaN 0.127436 -0.030605 0.817374 0.815162 0.990705 0.842135 0.979863 1.000000 0.983910 0.696054
forecast_under NaN 0.119786 -0.013727 0.819576 0.815943 0.994373 0.838946 0.985670 0.983910 1.000000 0.691177
friend NaN 0.048145 0.024592 0.583758 0.541282 0.689278 0.569145 0.669221 0.696054 0.691177 1.000000
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-8cb9cf578946> in <module>
     32 
     33 
---> 34 plt.show()
     35 print('''
     36 Answer

NameError: name 'plt' is not defined
../../../_images/p2_5_5.png

1-2๋ฒˆ

Random Forest ๋ชจ๋ธ ์ ํ•ฉ ๋ฐ ๊ฒ€์ฆ

  • Random Forest ํ•™์Šต ๋ฐ ์˜ˆ์ธก ๊ฒฐ๊ณผ ํ•ด์„

  • ์˜ˆ์ธก ๊ฒฐ๊ณผ ๊ฒ€์ • ํ•ด์„, ์ค‘์š”๋ณ€์ˆ˜ ๋„์ถœ

  • ๋ณ€์ˆ˜ ์ค‘์š”์„ฑ ๋ถ„์„ ๋ฐ ๊ทธ๋ž˜ํ”„ ์ถœ๋ ฅ

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
import time
import matplotlib.pyplot as plt

result = []
rf = RandomForestRegressor(random_state=22)
start = time.time()
rf.fit(X_train,y_train)
end = time.time()

pred = rf.predict(X_test)
print('RandomForest r2_score : ',r2_score(y_test,pred))
print('learning time ',end-start)
importances = rf.feature_importances_
forest_importances = pd.Series(importances, index=X_train.columns)

fig, ax = plt.subplots()
forest_importances.plot.bar( ax=ax)
ax.set_title("Feature importances")
fig.tight_layout()

print('temp_1 ,average , forecast_acc ์ˆœ์œผ๋กœ ๋ณ€์ˆ˜ ์ค‘์š”๋„๋ฅผ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ๋‹ค')

result.append([end-start,r2_score(y_test,pred)])
RandomForest r2_score :  0.8399186619591019
learning time  0.1347370147705078
temp_1 ,average , forecast_acc ์ˆœ์œผ๋กœ ๋ณ€์ˆ˜ ์ค‘์š”๋„๋ฅผ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ๋‹ค
../../../_images/p2_7_1.png

1-3๋ฒˆ

SVM(Support Vector Machine) ๋ชจ๋ธ ์ ํ•ฉ ๋ฐ ๊ฒ€์ฆ

  • svm ํ•™์Šต ๋ฐ ์˜ˆ์ธก ๊ฒฐ๊ณผ ํ•ด์„

  • ์˜ˆ์ธก ๊ฒฐ๊ณผ ๊ฒ€์ • ํ•ด์„, ์ค‘์š”๋ณ€์ˆ˜ ๋„์ถœ

  • ๋ณ€์ˆ˜ ์ค‘์š”์„ฑ ๋ถ„์„ ๋ฐ ๊ทธ๋ž˜ํ”„ ์ถœ๋ ฅ

from sklearn.svm import SVR
from sklearn.metrics import r2_score
import time
svm = SVR()

start = time.time()
svm.fit(X_train,y_train)
end = time.time()

pred = svm.predict(X_test)
print('svm r2_score : ',r2_score(y_test,pred))
print('learning time ',end-start)
print('svm์€ ๋ณ€์ˆ˜ ์ค‘์š”๋„๋ฅผ ๋”ฐ๋กœ ์ถ”์ถœ ํ•  ์ˆ˜ ์—†๋‹ค. r2_score์˜ ๊ฒฝ์šฐ RandomForest์— ๋น„ํ•ด ๋‚ฎ๋‹ค')


result.append([end-start,r2_score(y_test,pred)])
svm r2_score :  0.8138036782618503
learning time  0.008105278015136719
svm์€ ๋ณ€์ˆ˜ ์ค‘์š”๋„๋ฅผ ๋”ฐ๋กœ ์ถ”์ถœ ํ•  ์ˆ˜ ์—†๋‹ค. r2_score์˜ ๊ฒฝ์šฐ RandomForest์— ๋น„ํ•ด ๋‚ฎ๋‹ค

1-4๋ฒˆ

๋ชจ๋ธ ๋น„๊ต ๋ฐ ํ–ฅํ›„ ๊ฐœ์„  ๋ฐฉํ–ฅ ๋„์ถœ

  • Random Forest, SVM ๋ชจ๋ธ์˜ ๊ฒฐ๊ณผ ๋น„๊ต ํ›„ ์ตœ์ข… ๋ชจ๋ธ ์„ ํƒ

  • ๋‘ ๋ชจ๋ธ์˜ ์žฅ๋‹จ์  ๋ถ„์„, ์ถ”ํ›„ ์šด์˜ ๊ด€์ ์—์„œ ์–ด๋–ค ๋ชจ๋ธ์„ ์„ ํƒํ•  ๊ฒƒ์ธ๊ฐ€?

  • ๋ชจ๋ธ๋ง ๊ด€๋ จ ์ถ”ํ›„ ๊ฐœ์„  ๋ฐฉํ–ฅ ์ œ์‹œ

result_df = pd.DataFrame(result,columns = ['learning time','r2_score'])
result_df.index = ['RandomForest','Svm']
display(result_df)

print('''
ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ์ œ์™ธํ•œ ๊ธฐ๋ณธ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๋ชจ๋ธํ•™์Šต์‹œ๊ฐ„์€ ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ๊ฐ€ svm์— ๋น„ํ•ด ๋” ์˜ค๋ž˜ ๊ฑธ๋ฆฐ๋‹ค.      
test์…‹์— ๋Œ€ํ•œ ๋ชจ๋ธ r2score๋Š” ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ๊ฐ€ ๋” ๋†’๋‹ค.    
๋ชจ๋ธ ํ•™์Šต์‹œ๊ฐ„์„ ์ค‘์ ๋‘”๋‹ค๋ฉด svm์ด ๋” ์œ ๋ฆฌํ•˜๋‹ค. ํ•˜์ง€๋งŒ ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ์˜ ๊ฒฝ์šฐ ๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ๊ณ , ์ •ํ™•๋„๊ฐ€ ๋” ๋†’๊ธฐ ๋•Œ๋ฌธ์—    
์ตœ์ข…์ ์œผ๋กœ๋Š” ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ๋ฅผ ์„ ํƒํ•œ๋‹ค. 
''')
learning time r2_score
RandomForest 0.134737 0.839919
Svm 0.008105 0.813804
ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ์ œ์™ธํ•œ ๊ธฐ๋ณธ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๋ชจ๋ธํ•™์Šต์‹œ๊ฐ„์€ ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ๊ฐ€ svm์— ๋น„ํ•ด ๋” ์˜ค๋ž˜ ๊ฑธ๋ฆฐ๋‹ค.      
test์…‹์— ๋Œ€ํ•œ ๋ชจ๋ธ r2score๋Š” ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ๊ฐ€ ๋” ๋†’๋‹ค.    
๋ชจ๋ธ ํ•™์Šต์‹œ๊ฐ„์„ ์ค‘์ ๋‘”๋‹ค๋ฉด svm์ด ๋” ์œ ๋ฆฌํ•˜๋‹ค. ํ•˜์ง€๋งŒ ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ์˜ ๊ฒฝ์šฐ ๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ๊ณ , ์ •ํ™•๋„๊ฐ€ ๋” ๋†’๊ธฐ ๋•Œ๋ฌธ์—    
์ตœ์ข…์ ์œผ๋กœ๋Š” ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ๋ฅผ ์„ ํƒํ•œ๋‹ค. 

Attention

2๋ฒˆ
5๋ถ„๊ฐ„๊ฒฉ์˜ ๊ฐ€๊ตฌ๋ณ„ ์ „๋ ฅ ์‚ฌ์šฉ๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ
๋ฐ์ดํ„ฐ ์ถœ์ฒ˜ : ์ž์ฒด์ƒ์„ฑ
๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ : https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p2/problem2.csv

import matplotlib.pyplot as plt
ttt= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p2/problem2.csv')

2-1๋ฒˆ

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ
๊ฐ ๊ฐ€๊ตฌ์˜ 15๋ถ„๊ฐ„๊ฒฉ์˜ ์ „๋ ฅ๋Ÿ‰์˜ ํ•ฉ์„ ๊ตฌํ•˜๊ณ  ํ•ด๋‹น๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ด 5๊ฐœ์˜ ๊ตฐ์ง‘์œผ๋กœ ๊ตฐ์ง‘ํ™”๋ฅผ ์ง„ํ–‰ํ•œ ํ›„ ์•„๋ž˜์˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์€ ํ˜•ํƒœ๋กœ ์ถœ๋ ฅํ•˜๋ผ.
๊ตฐ์ง‘ํ™”๋ฅผ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ์˜ ์ด์œ ๋ฅผ ์„ค๋ช…ํ•˜๋ผ
(๊ตฐ์ง‘ ๋ฐฉ์‹์— ๋”ฐ๋ผ Cluster์ปฌ๋Ÿผ์˜ ๊ฐ’์€ ๋‹ฌ๋ผ์งˆ์ˆ˜ ์žˆ์Œ)

image

tt = ttt.sort_values(['houseCode','date']).reset_index(drop=True)
tt['date'] = pd.to_datetime(tt['date'])
tg = tt.groupby(['houseCode']).resample('15min', on='date')['power consumption'].sum().reset_index()
tg = tg.rename(columns= {'power consumption':'power consumption sum'})
tgg = tg.copy()

tgg['c'] =tgg['houseCode'].str[-2:].astype('int')
tgg['d'] =tgg['date'].dt.hour
tgg['e'] =tgg['date'].dt.day

from sklearn.cluster import KMeans 

# k-means clustering ์‹คํ–‰
kmeans = KMeans(n_clusters=5)
kmeans.fit(tgg.iloc[:,2:].values)

tg['Cluster'] =kmeans.labels_

tg
houseCode date power consumption sum Cluster
0 house_00 2050-01-01 00:00:00 136.249952 4
1 house_00 2050-01-01 00:15:00 98.283387 4
2 house_00 2050-01-01 00:30:00 53.967679 4
3 house_00 2050-01-01 00:45:00 204.821270 1
4 house_00 2050-01-01 01:00:00 150.760786 1
... ... ... ... ...
133915 house_44 2050-01-31 22:45:00 334.675717 0
133916 house_44 2050-01-31 23:00:00 463.419892 3
133917 house_44 2050-01-31 23:15:00 369.930740 0
133918 house_44 2050-01-31 23:30:00 237.713030 2
133919 house_44 2050-01-31 23:45:00 184.888439 1

133920 rows ร— 4 columns

2-2๋ฒˆ

ํžˆํŠธ๋งต ์‹œ๊ฐํ™”
2-1์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐ ๊ตฐ์ง‘์˜ ์š”์ผ, 15๋ถ„๊ฐ„๊ฒฉ๋ณ„ ์ „๋ ฅ์‚ฌ์šฉ๋Ÿ‰์˜ ํ•ฉ์„ ๊ตฌํ•œ ํ›„ ์•„๋ž˜์™€ ๊ฐ™์ด ์‹œ๊ฐํ™” ํ•˜์—ฌ๋ผ
(์ˆ˜์น˜๋Š” ๋™์ผํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Œ 2-1์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์ •ํ™•ํ•˜๊ฒŒ ์•„๋ž˜์™€ ๊ฐ™์€ ์ด๋ฏธ์ง€๋กœ ๋ณ€ํ™˜ ๋๋Š”์ง€ ์ฃผ๋กœ ํ™•์ธ)

image

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

tg['day'] = tg.date.dt.day_name()

tg['min'] = tg.date.dt.strftime('%H:%M')

pv = tg.groupby(['Cluster','day','min'],as_index=False).sum()
for v in range(5):
    plt.figure(figsize=(20,3))
    target = pv.loc[pv.Cluster==v]
    pvt = target.pivot(index='day',columns='min',values='power consumption sum').reindex(['Sunday','Saturday','Friday','Thursday','Wednesday','Tuesday','Monday'])
    plt.pcolor(pvt)
    plt.title('Cluster'+str(v))
    plt.xticks(range(len(pvt.columns)),pvt.columns,rotation=90)
    plt.yticks(np.arange(len(pvt.index))+0.5,pvt.index)
../../../_images/p2_19_0.png ../../../_images/p2_19_1.png ../../../_images/p2_19_2.png ../../../_images/p2_19_3.png ../../../_images/p2_19_4.png

Attention

3๋ฒˆ
ํƒœ์–‘๊ด‘ ๋ฐ์ดํ„ฐ
๋ฐ์ดํ„ฐ ์ถœ์ฒ˜ : https://www.kaggle.com/cheedcheed/california-renewable-production-20102018
๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ : https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p2/problem3.csv
์˜ˆ์ธก ๋ณ€์ˆ˜ :SOLAR PV

import pandas as pd
df= pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/adp/p2/problem3.csv')
df.head()
TIMESTAMP BIOGAS BIOMASS GEOTHERMAL Hour SMALL HYDRO SOLAR SOLAR PV SOLAR THERMAL WIND TOTAL
0 2012-11-26 00:00:00 208.0 354.0 926.0 1.0 208.0 NaN 0.0 0.0 57.0
1 2012-11-26 01:00:00 207.0 354.0 927.0 2.0 207.0 NaN 0.0 0.0 76.0
2 2012-11-26 02:00:00 208.0 353.0 927.0 3.0 208.0 NaN 0.0 0.0 100.0
3 2012-11-26 03:00:00 208.0 350.0 927.0 4.0 209.0 NaN 0.0 0.0 111.0
4 2012-11-26 04:00:00 209.0 352.0 927.0 5.0 209.0 NaN 0.0 0.0 131.0

3-1๋ฒˆ

๋ฐ์ดํ„ฐ์…‹ ๋ถ„ํ•  ๋ฐ ๊ฒฐ๊ณผ ๊ฒ€์ฆ

  • ๋ฐ์ดํ„ฐ์…‹ 7:3 ๋ถ„ํ• 

  • ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋ฐ ์˜ˆ์ธก ๋ชจ๋ธ ์ƒ์„ฑ

  • ๋ชจ๋ธ ์„ฑ๋Šฅ ๊ฒ€์ฆ : RMSE, R์ œ๊ณฑ, ์ •ํ™•๋„(์•„๋ž˜ ๋ฐฉ์‹์œผ๋กœ ์—ฐ์‚ฐ)๋กœ ๊ตฌํ•˜์—ฌ๋ผ

  • ์ •ํ™•๋„์˜ ๊ฒฝ์šฐ ์‹ค์ œ๊ฐ’>์˜ˆ์ธก๊ฐ’์ธ ๊ฒฝ์šฐ (1-์˜ˆ์ธก๊ฐ’/์‹ค์ œ๊ฐ’), ์‹ค์ œ๊ฐ’<์˜ˆ์ธก๊ฐ’์ธ ๊ฒฝ์šฐ (1- ์‹ค์ œ๊ฐ’/์˜ˆ์ธก๊ฐ’)์œผ๋กœ ํ•˜๊ณ  ์ด๊ฒƒ๋“ค์„ ํ‰๊ท ๋‚ธ ํ›„ 1์—์„œ ๋บ€๊ฐ’์œผ๋กœ ํ•œ๋‹ค.
    ๋ถ„์ˆ˜์‹์˜ ๋ถ„๋ชจ๊ฐ€ 0์ธ ๊ฒฝ์šฐ์˜ ์ •ํ™•๋„๋Š” 0.5๋กœ ์ทจ๊ธ‰ํ•œ๋‹ค.

  • ์ตœ์ข… ๊ฒฐ๊ณผ ์ œ์ถœ : ์†Œ์ˆ˜์  3์งธ์ž๋ฆฌ ๋ฐ˜์˜ฌ๋ฆผ

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score,accuracy_score, mean_squared_error
from sklearn.ensemble import RandomForestRegressor


df = df.drop(columns =['SOLAR'])
def suntimeChecker(x):
    if pd.to_datetime(x).hour in list(range(6,18)):
        return 1
    else:
        return 0

df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'])
df['suntime'] = df['TIMESTAMP'].apply(suntimeChecker)

X = df.drop(columns=['TIMESTAMP','Hour','SOLAR PV'])
y= df['SOLAR PV']

X_train,X_test ,y_train,y_test = train_test_split(X,y,random_state =2 , test_size =0.3)

rf =RandomForestRegressor()
rf.fit(X_train,y_train)

pred = rf.predict(X_test)

def getEachAccuracy(y_true,y_pred):
    if y_true ==0:
        return 0.5
    if y_pred ==0:
        return 0.5
    
    if y_true > y_pred:
        return 1-(y_pred/y_true)
    else:
        return 1-(y_true/y_pred)
    
acc = []
for i,v in enumerate(y_test):
    acc.append(getEachAccuracy(v,pred[i]))
    
# ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ์˜ ๊ฒฝ์šฐ ๋‚ ์งœ ์ปฌ๋Ÿผ์„ ์ œ์™ธํ•˜๊ณ , nan๊ฐ’๋งŒ ์žˆ๋Š” ์ปฌ๋Ÿผ์„ ์ œ์™ธํ–ˆ๋‹ค. 
# ํ•ด๊ฐ€ ์กด์žฌํ•˜๋Š”์‹œ๊ฐ์„ (06~17์‹œ)๋กœ ์„ค์ •ํ•ด์„œ ํŒŒ์ƒ๋ณ€์ˆ˜๋ฅผ ๋งŒ๋“ค์–ด์คฌ๋‹ค
# ์ •ํ™•๋„์˜ ๊ฒฝ์šฐ ์•„๋ž˜์™€ ๊ฐ™๋‹ค

print('RMSE',round(mean_squared_error(y_test, pred)**0.5,3))
print('r2',round(r2_score(y_test, pred),3))
print('acc',1- round(sum(acc)/len(acc),3))
RMSE 702.12
r2 0.914
acc 0.623