기초 100문제#

Hits

Attention

아래 6가지 패키지 베이스로 문제 풀었습니다. 설치 후 진행해주세요 dplyr, reshape2, stringr, tidyr, lubridate, zoo

01 Getting & Knowing Data#

Question 1

데이터를 로드하라. 데이터는 \t을 기준으로 구분되어있다.

Hide code cell source
df <- read.csv('https://raw.githubusercontent.com/Datamanim/pandas/main/lol.csv',sep='\t')
Hide code cell source
class(df)
'data.frame'

Question 2

데이터의 상위 5개 행을 출력하라

Hide code cell source
head(df)
A data.frame: 6 × 61
gameIdcreationTimegameDurationseasonIdwinnerfirstBloodfirstTowerfirstInhibitorfirstBaronfirstDragont2_towerKillst2_inhibitorKillst2_baronKillst2_dragonKillst2_riftHeraldKillst2_ban1t2_ban2t2_ban3t2_ban4t2_ban5
<dbl><dbl><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int>
133260865141.504279e+121949912111150011114 67 43 16 51
232295660291.497849e+121851911110120000 11 67238 51420
333273635041.504360e+121493912111220010157238121 57 28
433268565981.504349e+121758911111100000164 18141 40 51
533300807621.504554e+122094912111130010 86 11201122 18
632874357051.501668e+122059912211260030119134154 63 31

Question 3

데이터의 행과 열의 갯수를 파악하라

Hide code cell source
dim(df)
  1. 51490
  2. 61

Question 4

전체 컬럼을 출력하라

Hide code cell source
colnames(df)
  1. 'gameId'
  2. 'creationTime'
  3. 'gameDuration'
  4. 'seasonId'
  5. 'winner'
  6. 'firstBlood'
  7. 'firstTower'
  8. 'firstInhibitor'
  9. 'firstBaron'
  10. 'firstDragon'
  11. 'firstRiftHerald'
  12. 't1_champ1id'
  13. 't1_champ1_sum1'
  14. 't1_champ1_sum2'
  15. 't1_champ2id'
  16. 't1_champ2_sum1'
  17. 't1_champ2_sum2'
  18. 't1_champ3id'
  19. 't1_champ3_sum1'
  20. 't1_champ3_sum2'
  21. 't1_champ4id'
  22. 't1_champ4_sum1'
  23. 't1_champ4_sum2'
  24. 't1_champ5id'
  25. 't1_champ5_sum1'
  26. 't1_champ5_sum2'
  27. 't1_towerKills'
  28. 't1_inhibitorKills'
  29. 't1_baronKills'
  30. 't1_dragonKills'
  31. 't1_riftHeraldKills'
  32. 't1_ban1'
  33. 't1_ban2'
  34. 't1_ban3'
  35. 't1_ban4'
  36. 't1_ban5'
  37. 't2_champ1id'
  38. 't2_champ1_sum1'
  39. 't2_champ1_sum2'
  40. 't2_champ2id'
  41. 't2_champ2_sum1'
  42. 't2_champ2_sum2'
  43. 't2_champ3id'
  44. 't2_champ3_sum1'
  45. 't2_champ3_sum2'
  46. 't2_champ4id'
  47. 't2_champ4_sum1'
  48. 't2_champ4_sum2'
  49. 't2_champ5id'
  50. 't2_champ5_sum1'
  51. 't2_champ5_sum2'
  52. 't2_towerKills'
  53. 't2_inhibitorKills'
  54. 't2_baronKills'
  55. 't2_dragonKills'
  56. 't2_riftHeraldKills'
  57. 't2_ban1'
  58. 't2_ban2'
  59. 't2_ban3'
  60. 't2_ban4'
  61. 't2_ban5'

Question 5

6번째 컬럼명을 출력하라

Hide code cell source
Ans <-names(df[c(6)])
print(Ans)
[1] "firstBlood"

Question 6

6번째 컬럼의 데이터 타입을 확인하라

Hide code cell source
Ans <- sapply(df[c(6)], class)
print(Ans)
firstBlood 
 "integer" 

Question 7

데이터셋의 인덱스 구성은 어떤가

Hide code cell source
Ans <- summary(rownames(df))
print(Ans)
   Length     Class      Mode 
    51490 character character 

Question 8

6번째 컬럼의 3번째 값은 무엇인가?

Hide code cell source
Ans <- df[3,6]
print(Ans)
# A tibble: 1 x 1
  근무인구
     <dbl>
1   96921.

Attention

제주 날씨,인구에 따른 교통량데이터 : 출처 제주 데이터 허브 DataUrl = ‘https://raw.githubusercontent.com/Datamanim/pandas/main/Jeju.csv

Question 9

데이터를 로드하라. 컬럼이 한글이기에 적절한 처리해줘야함

Hide code cell source
library(readr)
df =  read_csv("https://raw.githubusercontent.com/Datamanim/pandas/main/Jeju.csv",locale=locale(encoding="EUC-KR"),show_col_types = FALSE)
class(df)
  1. 'spec_tbl_df'
  2. 'tbl_df'
  3. 'tbl'
  4. 'data.frame'

Question 10

데이터 마지막 3개행을 출력하라

Hide code cell source
Ans<-tail(df,3)
Ans
A tibble: 3 × 13
id일자시도명읍면동명거주인구근무인구방문인구총 유동인구평균 속도평균 소요 시간평균 기온일강수량평균 풍속
<dbl><date><chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
320662020-04-30제주시 도두동 28397.48 3144.895 84052.7 115595.141.05329.42120.303.0
320672020-04-30서귀포시안덕면 348037.8529106.286251129.7 628273.846.59549.18917.603.5
320682020-04-30제주시 연동 1010643.3765673.477447622.11523938.940.86327.76514.104.8

Question 11

수치형 변수를 가진 컬럼만 필터하여 데이터프레임을 만들고 상위 5행을 출력하라

Hide code cell source
# install.packages("dplyr")  # Install dplyr
library(dplyr)           # Load dplyr

Ans <- select_if(df, is.numeric)             
head(Ans)                                            
A tibble: 6 × 10
id거주인구근무인구방문인구총 유동인구평균 속도평균 소요 시간평균 기온일강수량평균 풍속
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
22448 32249.99 3418.266102709.09 138377.339.55629.1675.002.5
22449 213501.0010341.172112692.79 336535.032.90030.9005.002.5
224501212382.2296920.834541194.481850497.529.53835.6922.902.4
22451 33991.65 6034.253 72155.92 112181.830.00023.5002.902.4
22452 155036.92 9403.969150882.41 315323.341.58314.3755.102.3
22453 119524.38 5616.131 77338.34 202478.938.22228.0004.901.9

Question 12

범주형 변수를 가진 컬럼만 필터하여 데이터프레임을 만들고 상위 5행을 출력하라

Hide code cell source
# install.packages("dplyr")  # Install dplyr
library("dplyr")           # Load dplyr

Ans <- select_if(df, is.character)             
head(Ans)                                            
A tibble: 6 × 2
시도명읍면동명
<chr><chr>
제주시 도두동
제주시 외도동
제주시 이도2동
제주시 일도1동
서귀포시대천동
서귀포시서홍동

Question 13

각 컬럼의 결측치 숫자를 파악하라

Hide code cell source
Ans = colSums(is.na(df))
Ans
id
0
일자
0
시도명
0
읍면동명
0
거주인구
0
근무인구
0
방문인구
0
총 유동인구
0
평균 속도
0
평균 소요 시간
0
평균 기온
0
일강수량
0
평균 풍속
0

Question 14

각 컬럼의 데이터수, 데이터타입을 한번에 확인하라

Hide code cell source
# install.packages("psych")
library("psych")
Ans = describe(df)
Ans
Warning message in FUN(newX[, i], ...):
“min에 전달되는 인자들 중 누락이 있어 Inf를 반환합니다”
Warning message in FUN(newX[, i], ...):
“max에 전달되는 인자들 중 누락이 있어 -Inf를 반환합니다”
A psych: 13 × 13
varsnmeansdmediantrimmedmadminmaxrangeskewkurtosisse
<int><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
id 196212.725800e+042.777488e+03 27258.002.725800e+04 3565.6530022448.000 32068.000 9620.000 0.00000000 -1.20037422.831666e+01
일자 29621 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
시도명* 396211.599314e+004.900629e-01 2.001.624139e+00 0.00000 1.000 2.000 1.000-0.40526907 -1.83594784.996222e-03
읍면동명* 496212.134809e+011.176629e+01 22.002.143381e+01 14.82600 1.000 41.000 40.000-0.04626972 -1.16607581.199581e-01
거주인구 596213.174315e+052.982079e+05222110.462.610080e+05210521.47714 9305.5521364503.9131355198.361 1.54926880 1.70872173.040249e+03
근무인구 696213.547120e+044.038121e+04 21960.932.625729e+04 18357.90013 1407.936 263476.965 262069.029 2.52411926 6.59189624.116890e+02
방문인구 796211.958896e+051.407061e+05152805.331.688951e+05 89397.9956811538.322 723459.209 711920.887 1.60209993 1.95918241.434507e+03
총 유동인구 896215.487922e+054.608802e+05386693.474.575345e+05302944.5965422251.8102066483.8672044232.057 1.62316887 1.92852584.698703e+03
평균 속도 996214.110908e+018.758631e+00 39.644.095279e+01 11.12395 24.333 103.000 78.667 0.21585864 -0.73610918.929479e-02
평균 소요 시간1096213.721587e+011.299379e+01 34.503.655343e+01 14.08470 12.667 172.200 159.533 0.54943108 0.85561891.324725e-01
평균 기온1196211.355083e+017.745515e+00 13.401.358641e+01 9.04386 -9.600 30.400 40.000-0.03559627 -0.79609857.896601e-02
일강수량1296216.972426e+002.761726e+01 0.001.273733e+00 0.00000 0.000 587.500 587.500 9.15918621120.99583772.815597e-01
평균 풍속1396212.753171e+001.498538e+00 2.402.567136e+00 1.18608 0.000 13.333 13.333 1.67037647 4.86987401.527769e-02

Question 15

각 수치형 변수의 분포(사분위, 평균, 최대 , 최소)를 확인하라

Hide code cell source
Ans = summary(df)
Ans
       id             일자               시도명            읍면동명        
 Min.   :22448   Min.   :2018-01-01   Length:9621        Length:9621       
 1st Qu.:24853   1st Qu.:2018-08-10   Class :character   Class :character  
 Median :27258   Median :2019-03-23   Mode  :character   Mode  :character  
 Mean   :27258   Mean   :2019-03-13                                        
 3rd Qu.:29663   3rd Qu.:2019-10-13                                        
 Max.   :32068   Max.   :2020-04-30                                        
    거주인구          근무인구         방문인구       총 유동인구     
 Min.   :   9306   Min.   :  1408   Min.   : 11538   Min.   :  22252  
 1st Qu.:  95399   1st Qu.: 12074   1st Qu.: 99632   1st Qu.: 221691  
 Median : 222110   Median : 21961   Median :152805   Median : 386694  
 Mean   : 317432   Mean   : 35471   Mean   :195890   Mean   : 548792  
 3rd Qu.: 410667   3rd Qu.: 40192   3rd Qu.:236325   3rd Qu.: 640692  
 Max.   :1364504   Max.   :263477   Max.   :723459   Max.   :2066484  
   평균 속도      평균 소요 시간     평균 기온        일강수량      
 Min.   : 24.33   Min.   : 12.67   Min.   :-9.60   Min.   :  0.000  
 1st Qu.: 34.25   1st Qu.: 27.89   1st Qu.: 7.60   1st Qu.:  0.000  
 Median : 39.64   Median : 34.50   Median :13.40   Median :  0.000  
 Mean   : 41.11   Mean   : 37.22   Mean   :13.55   Mean   :  6.972  
 3rd Qu.: 49.10   3rd Qu.: 46.18   3rd Qu.:19.70   3rd Qu.:  1.500  
 Max.   :103.00   Max.   :172.20   Max.   :30.40   Max.   :587.500  
   평균 풍속     
 Min.   : 0.000  
 1st Qu.: 1.700  
 Median : 2.400  
 Mean   : 2.753  
 3rd Qu.: 3.400  
 Max.   :13.333  

Question 16

거주인구 컬럼의 값들을 출력하라

Hide code cell source
Ans = df['거주인구']
head(Ans)
A tibble: 6 × 1
거주인구
<dbl>
32249.99
213501.00
1212382.22
33991.65
155036.92
119524.38

Question 17

평균 속도 컬럼의 4분위 범위(IQR) 값을 구하여라

Hide code cell source
IQR(df$`평균 속도`)
14.855

Question 18

읍면동명 컬럼의 유일값 갯수를 출력하라

Hide code cell source
Ans <- length(unique(df$읍면동명))
print(Ans)
[1] 41

Question 19

읍면동명 컬럼의 유일값을 모두 출력하라

Hide code cell source
Ans <- unique(df$읍면동명)
print(Ans)
 [1] "도두동"  "외도동"  "이도2동" "일도1동" "대천동"  "서홍동"  "한경면" 
 [8] "송산동"  "조천읍"  "일도2동" "영천동"  "예래동"  "대륜동"  "삼도1동"
[15] "이호동"  "건입동"  "중앙동"  "삼양동"  "삼도2동" "이도1동" "남원읍" 
[22] "대정읍"  "정방동"  "효돈동"  "아라동"  "한림읍"  "구좌읍"  "용담1동"
[29] "오라동"  "화북동"  "연동"    "표선면"  "중문동"  "성산읍"  "안덕면" 
[36] "천지동"  "노형동"  "동홍동"  "용담2동" "봉개동"  "애월읍" 

02 Filtering & Sorting#

Question 20

데이터를 로드하라.

Hide code cell source
df <- read.csv('https://raw.githubusercontent.com/Datamanim/pandas/main/chipo.csv', na.strings=c(""))
Ans <- head(df)
Ans
A data.frame: 6 × 5
order_idquantityitem_namechoice_descriptionitem_price
<int><int><chr><chr><chr>
111Chips and Fresh Tomato Salsa NA $2.39
211Izze [Clementine] $3.39
311Nantucket Nectar [Apple] $3.39
411Chips and Tomatillo-Green Chili SalsaNA $2.39
522Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]] $16.98
631Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sour Cream, Guacamole, Lettuce]]$10.98

Question 21

quantity컬럼 값이 3인 데이터를 추출하여 첫 5행을 출력하라

Hide code cell source
Ans <- subset(df,quantity ==3)
head(Ans)
A data.frame: 6 × 5
order_idquantityitem_namechoice_descriptionitem_price
<int><int><chr><chr><chr>
4101783Chicken Bowl [[Fresh Tomato Salsa (Mild), Tomatillo-Green Chili Salsa (Medium), Roasted Chili Corn Salsa (Medium)], [Black Beans, Rice, Fajita Veggies, Cheese, Guacamole, Lettuce]]$32.94
4461933Bowl [Braised Carnitas, Pinto Beans, [Sour Cream, Cheese, Cilantro-Lime Rice]] $22.20
6902843Canned Soft Drink[Diet Coke] $3.75
8193383Bottled Water NA $3.27
8513503Canned Soft Drink[Sprite] $3.75
9183793Canned Soft Drink[Lemonade] $3.75

Question 22

quantity컬럼 값이 3인 데이터를 추출하여 index를 1부터 정렬하고 첫 5행을 출력하라

Hide code cell source
Ans <- subset(df,quantity ==3)
rownames(Ans) <- 1:nrow(Ans)
head(Ans)
A data.frame: 6 × 5
order_idquantityitem_namechoice_descriptionitem_price
<int><int><chr><chr><chr>
11783Chicken Bowl [[Fresh Tomato Salsa (Mild), Tomatillo-Green Chili Salsa (Medium), Roasted Chili Corn Salsa (Medium)], [Black Beans, Rice, Fajita Veggies, Cheese, Guacamole, Lettuce]]$32.94
21933Bowl [Braised Carnitas, Pinto Beans, [Sour Cream, Cheese, Cilantro-Lime Rice]] $22.20
32843Canned Soft Drink[Diet Coke] $3.75
43383Bottled Water NA $3.27
53503Canned Soft Drink[Sprite] $3.75
63793Canned Soft Drink[Lemonade] $3.75

Question 23

quantity , item_price 두개의 컬럼으로 구성된 새로운 데이터 프레임을 정의하라

Hide code cell source
new_df <- df[,c('quantity','item_price')]
head(new_df)
A data.frame: 6 × 2
quantityitem_price
<int><chr>
11$2.39
21$3.39
31$3.39
41$2.39
52$16.98
61$10.98

Question 24

원본데이터에서 item_price 컬럼의 달러표시 문자를 제거하고 float 타입으로 저장하여 new_price 컬럼에 저장하라

Hide code cell source
df$new_price<- as.numeric(gsub("\\$","",new_df$item_price))
head(df)
A data.frame: 6 × 6
order_idquantityitem_namechoice_descriptionitem_pricenew_price
<int><int><chr><chr><chr><dbl>
111Chips and Fresh Tomato Salsa NA $2.39 2.39
211Izze [Clementine] $3.39 3.39
311Nantucket Nectar [Apple] $3.39 3.39
411Chips and Tomatillo-Green Chili SalsaNA $2.39 2.39
522Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]] $16.98 16.98
631Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sour Cream, Guacamole, Lettuce]]$10.98 10.98

Question 25

new_price 컬럼이 5이하의 값을 가지는 데이터프레임을 추출하고, 전체 갯수를 구하여라

Hide code cell source
t <-subset(df,new_price <=5)
Ans <- nrow(t)
print(Ans)
[1] 1652

Question 26

item_name명이 Chicken Salad Bowl 인 데이터 프레임을 추출하라고 index 값을 초기화 하여라

Hide code cell source
t<-subset(df,item_name =='Chicken Salad Bowl')
rownames(t)<- 1:nrow(t)
Ans <-t

head(Ans)
A data.frame: 6 × 6
order_idquantityitem_namechoice_descriptionitem_pricenew_price
<int><int><chr><chr><chr><dbl>
1 201Chicken Salad Bowl[Fresh Tomato Salsa, [Fajita Vegetables, Pinto Beans, Lettuce]] $8.75 8.75
2 602Chicken Salad Bowl[Tomatillo Green Chili Salsa, [Sour Cream, Cheese, Guacamole]] $22.50 22.50
3 942Chicken Salad Bowl[Fresh Tomato Salsa, [Fajita Vegetables, Pinto Beans, Guacamole]] $22.50 22.50
41111Chicken Salad Bowl[Fresh Tomato Salsa, [Fajita Vegetables, Rice, Cheese, Sour Cream, Lettuce]]$8.75 8.75
51372Chicken Salad Bowl[Fresh Tomato Salsa, Fajita Vegetables] $17.50 17.50
62201Chicken Salad Bowl[Roasted Chili Corn Salsa, [Black Beans, Sour Cream, Cheese, Lettuce]] $8.75 8.75

Question 27

new_price값이 9 이하이고 item_name 값이 Chicken Salad Bowl 인 데이터 프레임을 추출하라

Hide code cell source
library(dplyr)

Ans <- df %>% filter(new_price <=9 & item_name =='Chicken Salad Bowl')
head(Ans)
A data.frame: 6 × 6
order_idquantityitem_namechoice_descriptionitem_pricenew_price
<int><int><chr><chr><chr><dbl>
1 201Chicken Salad Bowl[Fresh Tomato Salsa, [Fajita Vegetables, Pinto Beans, Lettuce]] $8.75 8.75
21111Chicken Salad Bowl[Fresh Tomato Salsa, [Fajita Vegetables, Rice, Cheese, Sour Cream, Lettuce]] $8.75 8.75
32201Chicken Salad Bowl[Roasted Chili Corn Salsa, [Black Beans, Sour Cream, Cheese, Lettuce]] $8.75 8.75
42211Chicken Salad Bowl[Tomatillo Green Chili Salsa, [Fajita Vegetables, Black Beans, Pinto Beans, Lettuce]]$8.75 8.75
52211Chicken Salad Bowl[Tomatillo Green Chili Salsa, [Fajita Vegetables, Rice, Cheese, Sour Cream, Lettuce]]$8.75 8.75
62341Chicken Salad Bowl[Fresh Tomato Salsa, Fajita Vegetables] $8.75 8.75

Question 28

df의 new_price 컬럼 값에 따라 오름차순으로 정리하고 index를 초기화 하여라

Hide code cell source
Ans<-df[order(df$new_price),]
rownames(Ans) <- 1:nrow(Ans)

head(Ans)
A data.frame: 6 × 6
order_idquantityitem_namechoice_descriptionitem_pricenew_price
<int><int><chr><chr><chr><dbl>
1141Canned Soda [Dr. Pepper] $1.09 1.09
2171Bottled WaterNA $1.09 1.09
3241Canned Soda [Sprite] $1.09 1.09
4381Bottled WaterNA $1.09 1.09
5471Canned Soda [Dr. Pepper] $1.09 1.09
6511Canned Soda [Diet Dr. Pepper]$1.09 1.09

Question 29

df의 item_name 컬럼 값중 Chips 포함하는 경우의 데이터를 출력하라

Hide code cell source
Ans<-df[grepl('Chips',df$item_name),]
head(Ans)
A data.frame: 6 × 6
order_idquantityitem_namechoice_descriptionitem_pricenew_price
<int><int><chr><chr><chr><dbl>
111Chips and Fresh Tomato Salsa NA$2.39 2.39
411Chips and Tomatillo-Green Chili SalsaNA$2.39 2.39
731Side of Chips NA$1.69 1.69
1151Chips and Guacamole NA$4.45 4.45
1571Chips and Guacamole NA$4.45 4.45
1681Chips and Tomatillo-Green Chili SalsaNA$2.39 2.39

Question 30

df의 짝수번째 컬럼만을 포함하는 데이터프레임을 출력하라

Hide code cell source
odd_col <-seq_len(ncol(df))%%2
Ans<-df[,odd_col == 0 ]
head(Ans)
A data.frame: 6 × 3
quantitychoice_descriptionnew_price
<int><chr><dbl>
11NA 2.39
21[Clementine] 3.39
31[Apple] 3.39
41NA 2.39
52[Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]] 16.98
61[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sour Cream, Guacamole, Lettuce]]10.98

Question 31

df의 new_price 컬럼 값에 따라 내림차순으로 정리하고 index를 초기화 하여라

Hide code cell source
Ans<- df[order(df$new_price,decreasing=TRUE),]
rownames(Ans) <- 1:nrow(Ans)
head(Ans)
A data.frame: 6 × 6
order_idquantityitem_namechoice_descriptionitem_pricenew_price
<int><int><chr><chr><chr><dbl>
1144315Chips and Fresh Tomato SalsaNA $44.25 44.25
21398 3Carnitas Bowl [Roasted Chili Corn Salsa, [Fajita Vegetables, Rice, Black Beans, Cheese, Sour Cream, Guacamole, Lettuce]] $35.25 35.25
3 511 4Chicken Burrito [Fresh Tomato Salsa, [Fajita Vegetables, Rice, Black Beans, Cheese, Lettuce]] $35.00 35.00
41443 4Chicken Burrito [Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream]] $35.00 35.00
51443 3Veggie Burrito [Fresh Tomato Salsa, [Fajita Vegetables, Rice, Black Beans, Cheese, Sour Cream, Guacamole]] $33.75 33.75
6 178 3Chicken Bowl [[Fresh Tomato Salsa (Mild), Tomatillo-Green Chili Salsa (Medium), Roasted Chili Corn Salsa (Medium)], [Black Beans, Rice, Fajita Veggies, Cheese, Guacamole, Lettuce]]$32.94 32.94

Question 32

df의 item_name 컬럼 값이 Steak Salad 또는 Bowl 인 데이터를 인덱싱하라

Hide code cell source
Ans<-subset(df,item_name =='Steak Salad' |item_name =='Bowl')
head(Ans)
A data.frame: 6 × 6
order_idquantityitem_namechoice_descriptionitem_pricenew_price
<int><int><chr><chr><chr><dbl>
446 1933Bowl [Braised Carnitas, Pinto Beans, [Sour Cream, Cheese, Cilantro-Lime Rice]] $22.20 22.20
665 2761Steak Salad[Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Fajita Veggies, Cheese, Lettuce]] $8.99 8.99
674 2791Bowl [Adobo-Marinated and Grilled Steak, [Sour Cream, Salsa, Cheese, Cilantro-Lime Rice, Guacamole]]$7.40 7.40
753 3111Steak Salad[Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Fajita Veggies, Cheese, Lettuce]] $8.99 8.99
894 3691Steak Salad[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sour Cream, Lettuce]] $8.99 8.99
350314061Steak Salad[[Lettuce, Fajita Veggies]] $8.69 8.69

Question 33

df의 item_name 컬럼 값이 Steak Salad 또는 Bowl 인 데이터를 데이터 프레임화 한 후, item_name를 기준으로 중복행이 있으면 제거하되 첫번째 케이스만 남겨라

Hide code cell source
sub<-subset(df,item_name =='Steak Salad' |item_name =='Bowl')
Ans<-distinct(sub,item_name,.keep_all =TRUE)
Ans
A data.frame: 2 × 6
order_idquantityitem_namechoice_descriptionitem_pricenew_price
<int><int><chr><chr><chr><dbl>
1933Bowl [Braised Carnitas, Pinto Beans, [Sour Cream, Cheese, Cilantro-Lime Rice]] $22.20 22.20
2761Steak Salad[Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Fajita Veggies, Cheese, Lettuce]]$8.99 8.99

Question 34

df의 item_name 컬럼 값이 Steak Salad 또는 Bowl 인 데이터를 데이터 프레임화 한 후, item_name를 기준으로 중복행이 있으면 제거하되 마지막 케이스만 남겨라

Hide code cell source
sub<-subset(df,item_name =='Steak Salad' |item_name =='Bowl')
sub <- sub[order(-sub$order_id),]
Ans<- sub[!duplicated(sub$item_name),]
Ans
A data.frame: 2 × 6
order_idquantityitem_namechoice_descriptionitem_pricenew_price
<int><int><chr><chr><chr><dbl>
350314061Steak Salad[[Lettuce, Fajita Veggies]] $8.69 8.69
674 2791Bowl [Adobo-Marinated and Grilled Steak, [Sour Cream, Salsa, Cheese, Cilantro-Lime Rice, Guacamole]]$7.40 7.40

Question 35

df의 데이터 중 new_price값이 new_price값의 평균값 이상을 가지는 데이터들을 인덱싱하라 (dplyr 패키지의 filter 이용)

Hide code cell source
library(dplyr)

Ans<- filter(df,new_price >=mean(df$new_price))
head(Ans)
A data.frame: 6 × 6
order_idquantityitem_namechoice_descriptionitem_pricenew_price
<int><int><chr><chr><chr><dbl>
122Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]] $16.98 16.98
231Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sour Cream, Guacamole, Lettuce]] $10.98 10.98
341Steak Burrito [Tomatillo Red Chili Salsa, [Fajita Vegetables, Black Beans, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]]$11.75 11.75
441Steak Soft Tacos [Tomatillo Green Chili Salsa, [Pinto Beans, Cheese, Sour Cream, Lettuce]] $9.25 9.25
551Steak Burrito [Fresh Tomato Salsa, [Rice, Black Beans, Pinto Beans, Cheese, Sour Cream, Lettuce]] $9.25 9.25
661Chicken Crispy Tacos[Roasted Chili Corn Salsa, [Fajita Vegetables, Rice, Black Beans, Cheese, Sour Cream]] $8.75 8.75

Question 36

df의 데이터 중 item_name의 값이 Izze 데이터를 Fizzy Lizzy로 수정하라

Hide code cell source
# install.packages("stringr")
library(stringr)

df$item_name<- str_replace(df$item_name,'Izze','Fizzy Lizzy')
head(df)
A data.frame: 6 × 6
order_idquantityitem_namechoice_descriptionitem_pricenew_price
<int><int><chr><chr><chr><dbl>
111Chips and Fresh Tomato Salsa NA $2.39 2.39
211Fizzy Lizzy [Clementine] $3.39 3.39
311Nantucket Nectar [Apple] $3.39 3.39
411Chips and Tomatillo-Green Chili SalsaNA $2.39 2.39
522Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]] $16.98 16.98
631Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sour Cream, Guacamole, Lettuce]]$10.98 10.98

Question 37

df의 데이터 중 choice_description 값이 NaN 인 데이터의 갯수를 구하여라

Hide code cell source
Ans<-sum(is.na(df$choice_description))
Ans
1246

Question 38

df의 데이터 중 choice_description 값이 NaN 인 데이터를 NoData 값으로 대체하라

Hide code cell source
df$choice_description[is.na(df$choice_description)] = 'NoData'
head(df)
A data.frame: 6 × 6
order_idquantityitem_namechoice_descriptionitem_pricenew_price
<int><int><chr><chr><chr><dbl>
111Chips and Fresh Tomato Salsa NoData $2.39 2.39
211Fizzy Lizzy [Clementine] $3.39 3.39
311Nantucket Nectar [Apple] $3.39 3.39
411Chips and Tomatillo-Green Chili SalsaNoData $2.39 2.39
522Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]] $16.98 16.98
631Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sour Cream, Guacamole, Lettuce]]$10.98 10.98

Question 39

df의 데이터 중 choice_description 값에 Black이 들어가는 경우를 인덱싱하라

Hide code cell source
Ans<-df[grepl('Black',df$choice_description),]
head(Ans)
A data.frame: 6 × 6
order_idquantityitem_namechoice_descriptionitem_pricenew_price
<int><int><chr><chr><chr><dbl>
522Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]] $16.98 16.98
841Steak Burrito [Tomatillo Red Chili Salsa, [Fajita Vegetables, Black Beans, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]]$11.75 11.75
1051Steak Burrito [Fresh Tomato Salsa, [Rice, Black Beans, Pinto Beans, Cheese, Sour Cream, Lettuce]] $9.25 9.25
1261Chicken Crispy Tacos[Roasted Chili Corn Salsa, [Fajita Vegetables, Rice, Black Beans, Cheese, Sour Cream]] $8.75 8.75
1361Chicken Soft Tacos [Roasted Chili Corn Salsa, [Rice, Black Beans, Cheese, Sour Cream]] $8.75 8.75
1891Chicken Burrito [Fresh Tomato Salsa (Mild), [Black Beans, Rice, Cheese, Sour Cream, Lettuce]] $8.49 8.49

Question 40

df의 데이터 중 choice_description 값에 Vegetables 들어가지 않는 경우의 갯수를 출력하라

Hide code cell source
sub<-df[!grepl('Vegetables',df$choice_description),]
Ans<-nrow(sub)
print(Ans)
[1] 3900

Question 41

df의 데이터 중 item_name 값이 N으로 시작하는 데이터를 모두 추출하라

Hide code cell source
Ans<- df[grepl('^N',df$item_name),] ## 정규표현식
head(Ans)
A data.frame: 6 × 6
order_idquantityitem_namechoice_descriptionitem_pricenew_price
<int><int><chr><chr><chr><dbl>
3 11Nantucket Nectar[Apple] $3.39 3.39
23 111Nantucket Nectar[Pomegranate Cherry] $3.39 3.39
106 461Nantucket Nectar[Pineapple Orange Banana]$3.39 3.39
174 771Nantucket Nectar[Apple] $3.39 3.39
206 911Nantucket Nectar[Peach Orange] $3.39 3.39
4371891Nantucket Nectar[Pomegranate Cherry] $3.39 3.39

Question 42

df의 데이터 중 item_name 값의 단어갯수가 15개 이상인 데이터를 인덱싱하라

Hide code cell source
Ans<- subset(df,nchar(df$item_name)>=15)
head(Ans)
A data.frame: 6 × 6
order_idquantityitem_namechoice_descriptionitem_pricenew_price
<int><int><chr><chr><chr><dbl>
111Chips and Fresh Tomato Salsa NoData $2.39 2.39
311Nantucket Nectar [Apple] $3.39 3.39
411Chips and Tomatillo-Green Chili SalsaNoData $2.39 2.39
941Steak Soft Tacos [Tomatillo Green Chili Salsa, [Pinto Beans, Cheese, Sour Cream, Lettuce]] $9.25 9.25
1151Chips and Guacamole NoData $4.45 4.45
1261Chicken Crispy Tacos [Roasted Chili Corn Salsa, [Fajita Vegetables, Rice, Black Beans, Cheese, Sour Cream]]$8.75 8.75

Question 43

df의 데이터 중 new_price값이 아래 lst에 해당하는 경우의 데이터 프레임을 구하고 그 갯수를 출력하라 lst =[1.69, 2.39, 3.39, 4.45, 9.25, 10.98, 11.75, 16.98]

Hide code cell source
lst <- c(1.69, 2.39, 3.39, 4.45, 9.25, 10.98, 11.75, 16.98)
Ans<-subset(df,new_price %in% lst)
head(Ans)
A data.frame: 6 × 6
order_idquantityitem_namechoice_descriptionitem_pricenew_price
<int><int><chr><chr><chr><dbl>
111Chips and Fresh Tomato Salsa NoData $2.39 2.39
211Fizzy Lizzy [Clementine] $3.39 3.39
311Nantucket Nectar [Apple] $3.39 3.39
411Chips and Tomatillo-Green Chili SalsaNoData $2.39 2.39
522Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]] $16.98 16.98
631Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sour Cream, Guacamole, Lettuce]]$10.98 10.98

03_Grouping#

Question 44

데이터를 로드하고 상위 5개 컬럼을 출력하라

Hide code cell source
df<- read.csv('https://raw.githubusercontent.com/Datamanim/pandas/main/AB_NYC_2019.csv')
head(df)
A data.frame: 6 × 16
idnamehost_idhost_nameneighbourhood_groupneighbourhoodlatitudelongituderoom_typepriceminimum_nightsnumber_of_reviewslast_reviewreviews_per_monthcalculated_host_listings_countavailability_365
<int><chr><int><chr><chr><chr><dbl><dbl><chr><int><int><int><chr><dbl><int><int>
12539Clean & quiet apt home by the park 2787John Brooklyn Kensington 40.64749-73.97237Private room 149 1 92018-10-190.216365
22595Skylit Midtown Castle 2845Jennifer ManhattanMidtown 40.75362-73.98377Entire home/apt225 1 452019-05-210.382355
33647THE VILLAGE OF HARLEM....NEW YORK ! 4632Elisabeth ManhattanHarlem 40.80902-73.94190Private room 150 3 0 NA1365
43831Cozy Entire Floor of Brownstone 4869LisaRoxanneBrooklyn Clinton Hill40.68514-73.95976Entire home/apt 89 12702019-07-054.641194
55022Entire Apt: Spacious Studio/Loft by central park7192Laura ManhattanEast Harlem 40.79851-73.94399Entire home/apt 8010 92018-11-190.101 0
65099Large Cozy 1 BR Apartment In Midtown East 7322Chris ManhattanMurray Hill 40.74767-73.97500Entire home/apt200 3 742019-06-220.591129

Question 45

데이터의 각 host_name의 빈도수를 구하고 host_name으로 정렬하여 상위 5개를 출력하라

Hide code cell source
library(dplyr)

Ans <- head(count(df,host_name,sort=TRUE))
Ans
A data.frame: 6 × 2
host_namen
<chr><int>
1Michael 417
2David 403
3Sonder (NYC)327
4John 294
5Alex 279
6Blueground 232

Question 46

데이터의 각 host_name의 빈도수를 구하고 빈도수 기준 내림차순 정렬한 데이터 프레임을 만들어라. 빈도수 컬럼은 counts로 명명하라

Hide code cell source
library(dplyr)
Ans <- head(count(df,host_name,sort=TRUE))
colnames(Ans) <- c('host_name','counts')
head(Ans)
A data.frame: 6 × 2
host_namecounts
<chr><int>
1Michael 417
2David 403
3Sonder (NYC)327
4John 294
5Alex 279
6Blueground 232

Question 47

neighbourhood_group의 값에 따른 neighbourhood컬럼 값의 unique한 갯수를 구하여라

Hide code cell source
library(dplyr)

Ans<- df %>% 
    group_by(neighbourhood_group,neighbourhood) %>% 
    count()

head(Ans)
A grouped_df: 6 × 3
neighbourhood_groupneighbourhoodn
<chr><chr><int>
BronxAllerton 42
BronxBaychester 7
BronxBelmont 24
BronxBronxdale 19
BronxCastle Hill 9
BronxCity Island18

Question 48

neighbourhood_group의 값에 따른 neighbourhood컬럼 값들의 unique한 갯수 중 neighbourhood_group그룹을 기준으로 최댓값을 가지는 데이터들을 출력하라

Hide code cell source
Ans <-df %>% 
    group_by(neighbourhood_group,neighbourhood) %>% 
    count() %>%
    group_by(neighbourhood_group) %>% 
    slice(which.max(n))

Ans
A grouped_df: 5 × 3
neighbourhood_groupneighbourhoodn
<chr><chr><int>
Bronx Kingsbridge 70
Brooklyn Williamsburg3920
Manhattan Harlem 2658
Queens Astoria 900
Staten IslandSt. George 48

Question 49

neighbourhood_group 값에 따른 price값의 평균, 분산, 최대, 최소 값을 구하여라

Hide code cell source
Ans<-df %>% 
    group_by(neighbourhood_group) %>%
    summarize(mean = mean(price)
              ,var = var(price)
              ,std=sd(price)
              ,max=max(price)
              ,min=min(price))

Ans
A tibble: 5 × 6
neighbourhood_groupmeanvarstdmaxmin
<chr><dbl><dbl><dbl><int><int>
Bronx 87.4967911386.89106.7093 2500 0
Brooklyn 124.3832134921.72186.873510000 0
Manhattan 196.8758184904.16291.383210000 0
Queens 99.5176527923.13167.10221000010
Staten Island114.8123377073.09277.6204 500013

Question 50

neighbourhood_group 값에 따른 reviews_per_month 평균, 분산, 최대, 최소 값을 구하여라

Hide code cell source
Ans<-df %>% 
    group_by(neighbourhood_group) %>%
    summarize(mean = mean(reviews_per_month, na.rm = TRUE)
              ,var = var(reviews_per_month, na.rm = TRUE)
              ,std=sd(reviews_per_month, na.rm = TRUE)
              ,max=max(reviews_per_month, na.rm = TRUE)
              ,min=min(reviews_per_month, na.rm = TRUE))

Ans
A tibble: 5 × 6
neighbourhood_groupmeanvarstdmaxmin
<chr><dbl><dbl><dbl><dbl><dbl>
Bronx 1.8378312.7998781.67328410.340.02
Brooklyn 1.2832122.2990401.51625914.000.01
Manhattan 1.2721312.6512061.62825258.500.01
Queens 1.9412004.8978482.21310820.940.01
Staten Island1.8725802.8408951.68549510.120.02

Question 51

neighbourhood 값과 neighbourhood_group 값에 따른 price 의 평균을 구하라

Hide code cell source
Ans<- df %>% 
    group_by(neighbourhood,neighbourhood_group) %>%
    summarize(mean = mean(price),.groups = "drop_last")
head(Ans)
A grouped_df: 6 × 3
neighbourhoodneighbourhood_groupmean
<chr><chr><dbl>
Allerton Bronx 87.59524
Arden HeightsStaten Island 67.25000
Arrochar Staten Island115.00000
Arverne Queens 171.77922
Astoria Queens 117.18778
Bath Beach Brooklyn 81.76471

Question 52

neighbourhood 값과 neighbourhood_group 값에 따른 price 의 평균을 계층적 indexing 없이 구하라

Hide code cell source
# install.packages('reshape')
library(reshape)

Ans<- df %>% 
    group_by(neighbourhood,neighbourhood_group) %>%
    summarize(mean = mean(price),.groups = "drop_last")

res <-cast(data = Ans, neighbourhood ~ neighbourhood_group,value='mean')
head(res)
A cast_df: 6 × 6
neighbourhoodBronxBrooklynManhattanQueensStaten Island
<chr><dbl><dbl><dbl><dbl><dbl>
1Allerton 87.59524 NANA NA NA
2Arden Heights NA NANA NA 67.25
3Arrochar NA NANA NA115.00
4Arverne NA NANA171.7792 NA
5Astoria NA NANA117.1878 NA
6Bath Beach NA81.76471NA NA NA

Question 53

neighbourhood 값과 neighbourhood_group 값에 따른 price 의 평균을 계층적 indexing 없이 구하고 nan 값은 -999값으로 채워라

Hide code cell source
# install.packages('reshape')
library(reshape)

Ans<- df %>% 
    group_by(neighbourhood,neighbourhood_group) %>%
    summarize(mean = mean(price),.groups = "drop_last")

res <-cast(data = Ans, neighbourhood ~ neighbourhood_group,fill=-999,value='mean')
head(res)
A cast_df: 6 × 6
neighbourhoodBronxBrooklynManhattanQueensStaten Island
<chr><dbl><dbl><dbl><dbl><dbl>
1Allerton 87.59524-999.00000-999-999.0000-999.00
2Arden Heights-999.00000-999.00000-999-999.0000 67.25
3Arrochar -999.00000-999.00000-999-999.0000 115.00
4Arverne -999.00000-999.00000-999 171.7792-999.00
5Astoria -999.00000-999.00000-999 117.1878-999.00
6Bath Beach -999.00000 81.76471-999-999.0000-999.00

Question 54

데이터중 neighbourhood_group 값이 Queens값을 가지는 데이터들 중 neighbourhood 그룹별로 price값의 평균, 분산, 최대, 최소값을 구하라

Hide code cell source
library(dplyr)
Ans<-df %>% 
    filter(neighbourhood_group =='Queens') %>%
    group_by(neighbourhood) %>%
    summarize(mean = mean(price)
            ,var = var(price, na.rm = TRUE)
            ,std=sd(price, na.rm = TRUE)
            ,max=max(price, na.rm = TRUE)
            ,min=min(price, na.rm = TRUE))

head(Ans)
A tibble: 6 × 6
neighbourhoodmeanvarstdmaxmin
<chr><dbl><dbl><dbl><int><int>
Arverne 171.77922 37383.411193.34790 150035
Astoria 117.18778122428.811349.898291000025
Bay Terrace 142.00000 6816.400 82.56149 25832
Bayside 157.94872166106.471407.56162 260030
Bayswater 87.47059 2330.890 48.27929 23045
Belle Harbor171.50000 8226.571 90.70045 35085

Question 55

데이터중 neighbourhood_group 값에 따른 room_type 컬럼의 unique value의 각 갯수를 구하고 neighbourhood_group 값을 기준으로 각 값의 비율을 구하여라

Hide code cell source
library(dplyr)
ra <-df %>%
    group_by(neighbourhood_group,room_type) %>%
    count()


res <-cast(data = ra, neighbourhood_group ~ room_type,value='n')
Ans <-cbind(res[,1],prop.table(data.matrix(res[,c(2:4)]),1))

Ans
A matrix: 5 × 4 of type chr
Entire home/aptPrivate roomShared room
1Bronx 0.3473877176901920.5976168652612280.0549954170485793
2Brooklyn 0.4754775169120570.5039793076004770.0205431754874652
3Manhattan 0.6093439822722870.3684963759752550.0221596417524583
4Queens 0.3699258736321920.5951288386869040.0349452876809036
5Staten Island0.4718498659517430.50402144772118 0.0241286863270777

04_Apply , Map#

Question 56

데이터를 로드하고 데이터 행과 열의 갯수를 출력하라

Hide code cell source
df <- read.csv('https://raw.githubusercontent.com/Datamanim/pandas/main/BankChurnersUp.csv')
dim(df)
  1. 10127
  2. 19

Question 57

Income_Category의 카테고리를 stringr패키지의 str_replace 함수를 이용하여 다음과 같이 변경하여 newIncome 컬럼에 매핑하라 Unknown : N
Less than $40K : a
$40K - $60K : b
$60K - $80K : c
$80K - $120K : d
$120K +’ : e

Hide code cell source
library(stringr)

df$newIncome = str_replace(df$Income_Category,'Less than \\$40K','a')
df$newIncome = str_replace(df$newIncome,'\\$40K - \\$60K','b')
df$newIncome = str_replace(df$newIncome,'\\$60K - \\$80K','c')
df$newIncome = str_replace(df$newIncome,'\\$80K - \\$120K','d')
df$newIncome = str_replace(df$newIncome,'\\$120K \\+','e')
df$newIncome = str_replace(df$newIncome,'Unknown','N')

head(df[,c('Income_Category','newIncome')])
A data.frame: 6 × 2
Income_CategorynewIncome
<chr><chr>
1$60K - $80K c
2Less than $40Ka
3$80K - $120K d
4Less than $40Ka
5$60K - $80K c
6$40K - $60K b

Question 59

Customer_Age의 값을 이용하여 나이 구간을 AgeState 컬럼으로 정의하라. (0~9 : 0 , 10~19 :10 , 20~29 :20 … 각 구간의 빈도수를 출력하라

Hide code cell source
library(dplyr)
df$AgeState <- df$Customer_Age %%10 *10
Ans<-count(df, AgeState)
Ans
A data.frame: 10 × 2
AgeStaten
<dbl><int>
01011
10 961
201001
301053
40 996
501050
601053
70 998
80 963
901041

Question 60

Education_Level의 값중 Graduate단어가 포함되는 값은 1 그렇지 않은 경우에는 0으로 변경하여 newEduLevel 컬럼을 정의하고 빈도수를 출력하라

Hide code cell source
library(stringr)
library(dplyr)

df$newEduLevel <- lapply(str_detect("Graduate",df$Education_Level),as.numeric)
count(df,newEduLevel)
A data.frame: 2 × 2
newEduLeveln
<list><int>
06999
13128

Question 61

Credit_Limit 컬럼값이 4500 이상인 경우 1 그외의 경우에는 모두 0으로 하는 newLimit 정의하라. newLimit 각 값들의 빈도수를 출력하라

Hide code cell source
df$newLimit <- lapply(df$Credit_Limit >=4500,as.numeric)
count(df,newLimit)
A data.frame: 2 × 2
newLimitn
<list><int>
15096
05031

Question 62

Marital_Status 컬럼값이 Married 이고 Card_Category 컬럼의 값이 Platinum인 경우 1 그외의 경우에는 모두 0으로 하는 newState컬럼을 정의하라. newState의 각 값들의 빈도수를 출력하라

Hide code cell source
df$newState <- lapply(df$Marital_Status =='Married'& df$Card_Category =='Platinum',as.numeric)
count(df,newState)
A data.frame: 2 × 2
newStaten
<list><int>
010120
1 7

Question 63

Gender 컬럼값 M인 경우 male F인 경우 female로 값을 변경하여 Gender 컬럼에 새롭게 정의하라. 각 value의 빈도를 출력하라

Hide code cell source
library(stringr)
library(dplyr)
df$Gender = str_replace(df$Gender,'M','male')
df$Gender = str_replace(df$Gender,'F','female')
count(df,Gender)
A data.frame: 2 × 2
Gendern
<chr><int>
female5358
male 4769

05_Time_Series#

Question 64

데이터를 로드하고 첫 5행을 출력하라

Hide code cell source
df<-read.csv('https://raw.githubusercontent.com/Datamanim/pandas/main/timeTest.csv')
head(df)
A data.frame: 6 × 13
Yr_Mo_DyRPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
12061-01-0115.0414.9613.17 9.29 NA9.8713.6710.2510.8312.5818.5015.04
22061-01-0214.71 NA10.83 6.5012.627.6711.5010.04 9.79 9.6717.5413.83
32061-01-0318.5016.8812.3310.1311.176.1711.25 NA 8.50 7.6712.7512.71
42061-01-0410.58 6.6311.75 4.58 4.542.88 8.63 1.79 5.83 5.88 5.4610.88
52061-01-0513.3313.2511.42 6.1710.718.2111.92 6.5410.9210.3412.9211.83
62061-01-0613.21 8.12 9.96 6.67 5.374.5010.67 4.42 7.17 7.50 8.1213.17

Question 65

Yr_Mo_Dy을 date타입으로 변경하라

Hide code cell source
df$Yr_Mo_Dy<-as.Date(df$Yr_Mo_Dy)
head(df)
A data.frame: 6 × 13
Yr_Mo_DyRPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
<date><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
12061-01-0115.0414.9613.17 9.29 NA9.8713.6710.2510.8312.5818.5015.04
22061-01-0214.71 NA10.83 6.5012.627.6711.5010.04 9.79 9.6717.5413.83
32061-01-0318.5016.8812.3310.1311.176.1711.25 NA 8.50 7.6712.7512.71
42061-01-0410.58 6.6311.75 4.58 4.542.88 8.63 1.79 5.83 5.88 5.4610.88
52061-01-0513.3313.2511.42 6.1710.718.2111.92 6.5410.9210.3412.9211.83
62061-01-0613.21 8.12 9.96 6.67 5.374.5010.67 4.42 7.17 7.50 8.1213.17

Question 66

Yr_Mo_Dy에 존재하는 년도의 유일값을 모두 출력하라

Hide code cell source
Ans<-unique(format(df$Yr_Mo_Dy,'%Y'))
Ans
  1. '2061'
  2. '2062'
  3. '2063'
  4. '2064'
  5. '2065'
  6. '2066'
  7. '2067'
  8. '2068'
  9. '2069'
  10. '2070'
  11. '1971'
  12. '1972'
  13. '1973'
  14. '1974'
  15. '1975'
  16. '1976'
  17. '1977'
  18. '1978'

Question 67

Yr_Mo_Dy에 년도가 2061년 이상의 경우에는 모두 잘못된 데이터이다. 해당경우의 값은 년도에서 100을 빼서 새롭게 날짜를 Yr_Mo_Dy 컬럼에 정의하라

# install.packages('lubridate')
library(lubridate)
library(dplyr)
Hide code cell source
df$Yr_Mo_Dy<- if_else(format(df$Yr_Mo_Dy,'%Y') >=2061,df$Yr_Mo_Dy-years(100),df$Yr_Mo_Dy)
head(df)
A data.frame: 6 × 13
Yr_Mo_DyRPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
<date><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
11961-01-0115.0414.9613.17 9.29 NA9.8713.6710.2510.8312.5818.5015.04
21961-01-0214.71 NA10.83 6.5012.627.6711.5010.04 9.79 9.6717.5413.83
31961-01-0318.5016.8812.3310.1311.176.1711.25 NA 8.50 7.6712.7512.71
41961-01-0410.58 6.6311.75 4.58 4.542.88 8.63 1.79 5.83 5.88 5.4610.88
51961-01-0513.3313.2511.42 6.1710.718.2111.92 6.5410.9210.3412.9211.83
61961-01-0613.21 8.12 9.96 6.67 5.374.5010.67 4.42 7.17 7.50 8.1213.17

Question 68

년도별 각컬럼의 평균값을 구하여라

Hide code cell source
Ans<-df %>%
    group_by(year=format(df$Yr_Mo_Dy,'%Y')) %>%
    summarise_all("mean", na.rm = TRUE) %>%
    select(-Yr_Mo_Dy) 

head(Ans)
A tibble: 6 × 13
yearRPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
196112.2995810.3518011.362376.95822710.881767.729726 9.733923 8.8587888.647652 9.83557713.5027913.68077
196212.2469210.1104411.732716.96044010.657927.39306811.020712 8.7937538.316822 9.67624712.9306814.32396
196312.8134510.8369912.541157.33005511.724118.43471211.07569910.3365488.90358910.22443813.6388814.99901
196412.3636610.9201612.104376.78778711.454487.57087410.259153 9.4673507.78901610.20795113.7405514.91030
196512.4513711.0755311.848776.85846611.024797.47811010.618712 8.8799187.907425 9.91808212.9642515.59164
196613.4619711.5572112.020637.34572611.805047.79367110.579808 8.8350968.514438 9.76895914.2658416.30726

Question 69

weekday컬럼을 만들고 요일별로 매핑하라 ( 일요일: 1 ~ 토요일 :7)

Hide code cell source
df$weekday <- wday(df$Yr_Mo_Dy)
head(df)
A data.frame: 6 × 14
Yr_Mo_DyRPTVALROSKILSHABIRDUBCLAMULCLOBELMALweekday
<date><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
11961-01-0115.0414.9613.17 9.29 NA9.8713.6710.2510.8312.5818.5015.041
21961-01-0214.71 NA10.83 6.5012.627.6711.5010.04 9.79 9.6717.5413.832
31961-01-0318.5016.8812.3310.1311.176.1711.25 NA 8.50 7.6712.7512.713
41961-01-0410.58 6.6311.75 4.58 4.542.88 8.63 1.79 5.83 5.88 5.4610.884
51961-01-0513.3313.2511.42 6.1710.718.2111.92 6.5410.9210.3412.9211.835
61961-01-0613.21 8.12 9.96 6.67 5.374.5010.67 4.42 7.17 7.50 8.1213.176

Question 70

weekday컬럼을 기준으로 주말이면 1 평일이면 0의 값을 가지는 WeekCheck 컬럼을 만들어라

Hide code cell source
w <- c(1,7)
df$WeekCheck <- lapply(df$weekday %in% w,as.numeric)
head(df)
A data.frame: 6 × 15
Yr_Mo_DyRPTVALROSKILSHABIRDUBCLAMULCLOBELMALweekdayWeekCheck
<date><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><list>
11961-01-0115.0414.9613.17 9.29 NA9.8713.6710.2510.8312.5818.5015.0411
21961-01-0214.71 NA10.83 6.5012.627.6711.5010.04 9.79 9.6717.5413.8320
31961-01-0318.5016.8812.3310.1311.176.1711.25 NA 8.50 7.6712.7512.7130
41961-01-0410.58 6.6311.75 4.58 4.542.88 8.63 1.79 5.83 5.88 5.4610.8840
51961-01-0513.3313.2511.42 6.1710.718.2111.92 6.5410.9210.3412.9211.8350
61961-01-0613.21 8.12 9.96 6.67 5.374.5010.67 4.42 7.17 7.50 8.1213.1760

Question 71

년도, 일자 상관없이 모든 컬럼의 각 달의 평균을 구하여라

Hide code cell source
Ans<-df %>%
    select(-WeekCheck) %>%
    group_by(year=format(df$Yr_Mo_Dy,'%m')) %>%
    summarise_all("mean", na.rm = TRUE) %>%
    select(-Yr_Mo_Dy) 

head(Ans)
A tibble: 6 × 14
yearRPTVALROSKILSHABIRDUBCLAMULCLOBELMALweekday
<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
0114.8473212.91456013.299627.19949811.6677348.05483911.8193559.5120479.54320810.05356614.5505218.028763.985663
0213.7109112.11112212.879136.94241111.5517727.63385811.2060249.3414379.313169 9.51805113.7289017.156144.000000
0313.1586911.50584212.648127.26590711.5545167.95940911.3101799.6358969.70032410.09695313.8106116.909324.008961
0412.5556510.42975912.204816.89803710.6776677.44138910.2213158.9090568.930870 9.15801912.6647614.937614.000000
0511.7240310.14561911.550396.30748710.2243016.942061 8.7977388.4529038.040806 8.52485712.7672613.736043.991039
0610.45132 8.94970410.361315.652278 9.5299266.410093 8.0095567.9207967.639796 7.72918512.2464112.861824.009259

Question 72

모든 결측치는 컬럼기준 직전의 값으로 대체하고 첫번째 행에 결측치가 있을경우 뒤에있는 값으로 대채하라

Hide code cell source
library(tidyr)

Ans<-df %>%
    fill(everything(),.direction='down') %>%
    fill(everything(),.direction='up')
head(Ans)
A data.frame: 6 × 15
Yr_Mo_DyRPTVALROSKILSHABIRDUBCLAMULCLOBELMALweekdayWeekCheck
<date><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><list>
11961-01-0115.0414.9613.17 9.2912.629.8713.6710.2510.8312.5818.5015.0411
21961-01-0214.7114.9610.83 6.5012.627.6711.5010.04 9.79 9.6717.5413.8320
31961-01-0318.5016.8812.3310.1311.176.1711.2510.04 8.50 7.6712.7512.7130
41961-01-0410.58 6.6311.75 4.58 4.542.88 8.63 1.79 5.83 5.88 5.4610.8840
51961-01-0513.3313.2511.42 6.1710.718.2111.92 6.5410.9210.3412.9211.8350
61961-01-0613.21 8.12 9.96 6.67 5.374.5010.67 4.42 7.17 7.50 8.1213.1760

Question 73

년도 - 월을 기준으로 모든 컬럼의 평균값을 구하여라

Hide code cell source
Ans<-df %>%
    select(-WeekCheck) %>%
    group_by(year=format(df$Yr_Mo_Dy,'%Y-%m')) %>%
    summarise_all("mean", na.rm = TRUE) %>%
    select(-Yr_Mo_Dy) 

head(Ans)
A tibble: 6 × 14
yearRPTVALROSKILSHABIRDUBCLAMULCLOBELMALweekday
<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1961-0114.84133311.98833313.431617.73677411.072759 8.58806511.184839 9.245333 9.08580610.10741913.8809714.703233.806452
1961-0216.26928614.97535714.441489.23074113.85214310.93750011.89071411.84607111.82142912.71428618.5832115.411794.000000
1961-0310.89000011.29645210.752907.28400010.509355 8.866774 9.644194 9.82967710.29413811.25193516.4109715.720004.096774
1961-0410.722667 9.427667 9.998005.830667 8.435000 6.495000 6.925333 7.094667 7.342333 7.23700011.1473310.278334.000000
1961-05 9.860968 8.85000010.818065.905333 9.490323 6.574839 7.604000 8.177097 8.039355 8.49935511.9003212.011613.903226
1961-06 9.904138 8.520333 8.867006.08300010.824000 6.707333 9.095667 8.849333 9.086667 9.94033313.9950014.553794.100000

Question 74

RPT 컬럼의 값을 일자별 기준으로 1차 차분하라

Hide code cell source
Ans<-diff(df$RPT,differences=1)
head(Ans)
  1. -0.329999999999998
  2. 3.79
  3. -7.92
  4. 2.75
  5. -0.119999999999999
  6. 0.289999999999999

Question 75

RPT와 VAL의 컬럼을 일주일 간격으로 각각 이동평균한값을 구하여라

Hide code cell source
library(zoo)
Ans<-rollmean(df[,c(2:13)], k = 13, fill = NA, align = "center")
head(Ans,10)
A matrix: 10 × 12 of type dbl
RPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA
13.54077 NA11.650776.429231 NA6.60923110.636154 NA8.0469238.98153813.0030813.30462
13.07923 NA11.182315.7661548.4923085.956154 9.821538 NA7.2523088.21923112.1315412.54538
12.8738510.13000011.253085.4484628.0892315.606923 9.129231 NA6.8646157.90846211.3623112.00077
12.71385 9.69692311.510005.0315388.1023085.664615 8.9753856.636.8423087.88230811.3846211.71846

Question 76

년-월-일:시 컬럼을 dttm 형태로 변경하라. 서울시의 제공데이터의 경우 0시가 24시로 표현된다. 데이터프레임명은 df로 하라

Hide code cell source
df<- read.csv('https://raw.githubusercontent.com/Datamanim/pandas/main/seoul_pm.csv')

library("stringr")
a<-as.data.frame(str_split_fixed(df$`X.년.월.일.시.`,':',2))
colnames(a) <- c('date','hour')

df$`X.년.월.일.시.` <-a %>%
    mutate(total = paste(date,' ',hour)) %>%
    mutate(newdate = ifelse(hour==24,
                        paste(as.character(as.Date(date) + days(1)),' ','00'),
                        paste(as.character(as.Date(date)) ,' ',hour)   
                           )) %>%
    select(newdate) %>%
    pull(newdate)

df$`X.년.월.일.시.`<- strptime(df$`X.년.월.일.시.`,format='%Y-%m-%d %H')
head(df)
A data.frame: 6 × 13
X.년.월.일.시.PM10등급PM10PM2.5등급PM2.5오존등급오존이산화질소등급이산화질소일산화탄소등급일산화탄소아황산가스등급아황산가스
<dttm><chr><int><chr><int><chr><dbl><chr><dbl><chr><dbl><chr><dbl>
12021-05-15 15:00:00보통47보통19좋음0.017좋음0.023좋음0.4좋음0.003
22021-05-15 14:00:00보통43보통20좋음0.024좋음0.019좋음0.3좋음0.003
32021-05-15 13:00:00보통34보통24보통0.035좋음0.017좋음0.4좋음0.004
42021-05-15 12:00:00보통41보통27보통0.037좋음0.020좋음0.4좋음0.004
52021-05-15 11:00:00보통51보통34보통0.033좋음0.023좋음0.4좋음0.005
62021-05-15 10:00:00보통47보통31좋음0.028좋음0.029좋음0.5좋음0.005

Question 77

일자별 한글요일 이름을 dayName 컬럼에 저장하라

Hide code cell source
df$dayName <- weekdays(as.Date(df$`X.년.월.일.시.`))
head(df)
A data.frame: 6 × 14
X.년.월.일.시.PM10등급PM10PM2.5등급PM2.5오존등급오존이산화질소등급이산화질소일산화탄소등급일산화탄소아황산가스등급아황산가스dayName
<dttm><chr><int><chr><int><chr><dbl><chr><dbl><chr><dbl><chr><dbl><chr>
12021-05-15 15:00:00보통47보통19좋음0.017좋음0.023좋음0.4좋음0.003토요일
22021-05-15 14:00:00보통43보통20좋음0.024좋음0.019좋음0.3좋음0.003토요일
32021-05-15 13:00:00보통34보통24보통0.035좋음0.017좋음0.4좋음0.004토요일
42021-05-15 12:00:00보통41보통27보통0.037좋음0.020좋음0.4좋음0.004토요일
52021-05-15 11:00:00보통51보통34보통0.033좋음0.023좋음0.4좋음0.005토요일
62021-05-15 10:00:00보통47보통31좋음0.028좋음0.029좋음0.5좋음0.005토요일

Question 78

일자별 각 PM10등급의 빈도수를 파악하라

Hide code cell source
Ans<-df %>%
    group_by(dayName) %>%
    count(PM10등급)
head(Ans)
A grouped_df: 6 × 3
dayNamePM10등급n
<chr><chr><int>
금요일 3
금요일나쁨 31
금요일매우나쁨 17
금요일보통 120
금요일좋음 21
목요일 1

Question 79

시간이 연속적으로 존재하며 결측치가 없는지 확인하라

Hide code cell source
sum(diff(df$`X.년.월.일.시.`) !=-1)
0

Question 80

오전 10시와 오후 10시(22시)의 PM10의 평균값을 각각 구하여라

Hide code cell source
Ans<-df %>%
    filter(hour(`X.년.월.일.시.`)== 10 | hour(`X.년.월.일.시.`)== 22) %>%
    group_by(hour=hour(`X.년.월.일.시.`)) %>%
    summarize(mean = mean(PM10))
Ans
A tibble: 2 × 2
hourmean
<int><dbl>
1070.38462
2269.94118

Question 81

날짜 컬럼을 index로 만들어서 df2에 저장하라

Hide code cell source
df2<-df
rownames(df2) <-df2$`X.년.월.일.시.` 
df2<-df2[,c(2:14)]
head(df2)
A data.frame: 6 × 13
PM10등급PM10PM2.5등급PM2.5오존등급오존이산화질소등급이산화질소일산화탄소등급일산화탄소아황산가스등급아황산가스dayName
<chr><int><chr><int><chr><dbl><chr><dbl><chr><dbl><chr><dbl><chr>
2021-05-15 15:00:00보통47보통19좋음0.017좋음0.023좋음0.4좋음0.003토요일
2021-05-15 14:00:00보통43보통20좋음0.024좋음0.019좋음0.3좋음0.003토요일
2021-05-15 13:00:00보통34보통24보통0.035좋음0.017좋음0.4좋음0.004토요일
2021-05-15 12:00:00보통41보통27보통0.037좋음0.020좋음0.4좋음0.004토요일
2021-05-15 11:00:00보통51보통34보통0.033좋음0.023좋음0.4좋음0.005토요일
2021-05-15 10:00:00보통47보통31좋음0.028좋음0.029좋음0.5좋음0.005토요일

Question 82

df 데이터를 주단위로 뽑아서 최소,최대 평균, 표준표차를 구하여라

Hide code cell source
Ans<-df %>%
    mutate(week= week(df$`X.년.월.일.시.`)) %>%
    select_if(is.numeric) %>%
    group_by(week) %>%
    summarise_all(mean) 

Ans
A tibble: 9 × 7
weekPM10PM2.5오존이산화질소일산화탄소아황산가스
<dbl><dbl><dbl><dbl><dbl><dbl><dbl>
12 72.1739137.695650.035173910.040173910.49565220.003521739
13113.64286 NA0.023089290.037744050.56428570.003017857
14 38.2440517.744050.034065480.023315480.38869050.002672619
15 37.4583319.482140.036422620.023220240.38273810.002619048
16 NA NA NA NA NA NA
17 NA17.934520.037059520.022178570.39285710.002702381
18 NA NA NA NA NA NA
19 NA NA NA NA NA NA
20 63.5250038.100000.032350000.032850000.50000000.004500000

06_Reshape(Pivot)#

Question 83

Indicator을 삭제하고 First Tooltip 컬럼에서 신뢰구간에 해당하는 표현([~~])을 지우고 first 컬럼에 실수형으로 타입을 변경한 후 추가하라

df<-read.csv('https://raw.githubusercontent.com/Datamanim/pandas/main/under5MortalityRate.csv')
Hide code cell source
library(dplyr)
library(stringr)

df<- df %>%
    select(-c(Indicator))  %>%
    mutate(first = as.numeric(sapply(strsplit(df$First.Tooltip, "\\["), head, 1))) %>%
    select(-c(First.Tooltip))

head(df)
A data.frame: 6 × 4
LocationPeriodDim1first
<chr><int><chr><dbl>
1Afghanistan2019Both sexes60.27
2Afghanistan2019Male 63.83
3Afghanistan2019Female 56.57
4Afghanistan2018Both sexes62.54
5Afghanistan2018Male 66.08
6Afghanistan2018Female 58.84

Question 84

년도가 2015년 이상, Dim1이 Both sexes인 케이스만 추출하라

Hide code cell source
df2<-df %>%
    filter(Period >=2015 & Dim1 =='Both sexes')
head(df2)
A data.frame: 6 × 4
LocationPeriodDim1first
<chr><int><chr><dbl>
1Afghanistan2019Both sexes60.27
2Afghanistan2018Both sexes62.54
3Afghanistan2017Both sexes64.94
4Afghanistan2016Both sexes67.57
5Afghanistan2015Both sexes70.44
6Albania 2019Both sexes 9.68

Question 85

84번 문제에서 추출한 데이터로 아래와 같이 나라에 따른 년도별 사망률을 데이터 프레임화 하라

Hide code cell source
library(reshape2)
Ans<-df2 %>%
    select(-c(Dim1)) %>%
    group_by(Location,first) %>%
    dcast(Location~Period,value.var='first')
head(Ans)
A data.frame: 6 × 6
Location20152016201720182019
<chr><dbl><dbl><dbl><dbl><dbl>
1Afghanistan 70.4467.5764.9462.5460.27
2Albania 9.57 9.42 9.42 9.53 9.68
3Algeria 25.1824.7924.3223.8123.26
4Andorra 3.53 3.37 3.22 3.09 2.97
5Angola 88.2084.2180.6277.6774.69
6Antigua and Barbuda 7.75 7.42 7.12 6.85 6.61

Question 86

전체 데이터(df)에서 Dim1에 따른 년도별 사망비율의 평균을 구하라

Hide code cell source
Ans<-df %>%
    select(-c(Location)) %>%
    group_by(Dim1,Period) %>%
    summarise_all(mean) %>%
    ungroup() %>%
    dcast(Period~Dim1,value.var='first')

head(Ans)
A data.frame: 6 × 4
PeriodBoth sexesFemaleMale
<int><dbl><dbl><dbl>
11950147.7008140.9098154.1512
21951155.5375149.2102161.5382
31952157.8111151.5161163.7608
41953156.1472150.2509161.7421
51954154.5399148.6883160.0810
61955155.7972149.8432161.4569

Question 87

데이터에서 Country가 한국(KOR) 데이터만 추출하라

Hide code cell source
df <-read.csv('https://raw.githubusercontent.com/Datamanim/pandas/main/winter.csv')

kor <- df %>%
        filter(Country =='KOR')
head(kor)
A data.frame: 6 × 9
YearCitySportDisciplineAthleteCountryGenderEventMedal
<int><chr><chr><chr><chr><chr><chr><chr><chr>
11992AlbertvilleSkatingShort Track Speed SkatingLEE, Jun-Ho KORMen1000M Bronze
21992AlbertvilleSkatingShort Track Speed SkatingKIM, Ki-Hoon KORMen1000M Gold
31992AlbertvilleSkatingShort Track Speed SkatingKIM, Ki-Hoon KORMen5000M RelayGold
41992AlbertvilleSkatingShort Track Speed SkatingLEE, Jun-Ho KORMen5000M RelayGold
51992AlbertvilleSkatingShort Track Speed SkatingMO, Ji-Soo KORMen5000M RelayGold
61992AlbertvilleSkatingShort Track Speed SkatingSONG, Jae-KunKORMen5000M RelayGold

Question 88

한국 올림픽 메달리스트 데이터에서 년도에 따른 medal 종류별 갯수를 데이터프레임화 하라

Hide code cell source
Ans<-kor %>% 
    group_by(Year,Medal) %>%
    count() %>%
    ungroup() %>%
    dcast(Year~Medal,value.var='n')
Ans
A data.frame: 7 × 4
YearBronzeGoldSilver
<int><int><int><int>
1992 1 5 1
1994 1 8 1
1998 2 6 4
2002NA 5 2
2006 214 3
2010 2 610
2014 2 7 5

Question 89

전체 데이터에서 sport종류에 따른 성별수를 구하여라

Hide code cell source
Ans<-df %>%
    group_by(Sport,Gender) %>%
    count() %>%
    dcast(Sport~Gender,value.var='n')

Ans
A data.frame: 7 × 3
SportMenWomen
<chr><int><int>
Biathlon 270150
Bobsleigh 416 36
Curling 97 75
Ice Hockey1231305
Luge 135 45
Skating 665564
Skiing 1130651

Question 90

전체 데이터에서 Discipline종류에 따른 따른 Medal수를 구하여라

Hide code cell source
Ans<-df %>%
    group_by(Discipline,Medal) %>%
    count() %>%
    dcast(Discipline~Medal,value.var='n')


head(Ans)
A data.frame: 6 × 4
DisciplineBronzeGoldSilver
<chr><int><int><int>
1Alpine Skiing 141143144
2Biathlon 139140141
3Bobsleigh 147134141
4Cross Country Skiing263264262
5Curling 56 58 58
6Figure skating 118122119

07_Merge , Concat#

Question 91

df1과 df2 데이터를 하나의 데이터 프레임으로 합쳐라

df<-read.csv('https://raw.githubusercontent.com/Datamanim/pandas/main/mergeTEst.csv',row.names = 1)

df1<-df[c(1,2,3,4),]
df2<-df[c(5:nrow(df)),]
Hide code cell source
Ans<-rbind(df1,df2)
head(Ans)
A data.frame: 6 × 10
X2010X2011X2012X2013X2014X2015X2016X2017X2018X2019
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
Afghanistan64.02361.64059.36757.17055.0853.10751.26749.56047.98346.453
Albania11.80310.807 9.943 9.267 8.79 8.493 8.363 8.363 8.453 8.597
Algeria23.54022.90722.45022.11721.8521.58721.25720.85020.40719.930
Andorra 4.240 4.033 3.843 3.667 3.49 3.330 3.187 3.060 2.933 2.827
Angola75.71371.28067.23363.57060.4357.75755.51053.46051.75750.093
Antigua and Barbuda 8.667 8.223 7.807 7.420 7.07 6.757 6.483 6.230 6.000 5.783

Question 92

df3과 df4 데이터를 하나의 데이터 프레임으로 합쳐라. 둘다 포함하고 있는 년도에 대해서만 고려한다

df3 <- df[c(1,2,3),c(1,2,3,4)]
df4 <- df[c(5:nrow(df)),c(3:ncol(df))]
df3
A data.frame: 3 × 4
X2010X2011X2012X2013
<dbl><dbl><dbl><dbl>
Afghanistan64.02361.64059.36757.170
Albania11.80310.807 9.943 9.267
Algeria23.54022.90722.45022.117
df4
A data.frame: 6 × 8
X2012X2013X2014X2015X2016X2017X2018X2019
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
Angola67.23363.57060.43057.75755.51053.46051.75750.093
Antigua and Barbuda 7.807 7.420 7.070 6.757 6.483 6.230 6.000 5.783
Argentina11.84011.28310.73310.203 9.683 9.177 8.680 8.227
Armenia14.89714.17013.47712.81712.18311.58311.00710.497
Australia 3.623 3.467 3.343 3.253 3.183 3.137 3.090 3.047
Austria 3.333 3.210 3.113 3.043 2.987 2.943 2.897 2.843

Question 93

df3과 df4 데이터를 하나의 데이터 프레임으로 합쳐라. 모든 컬럼을 포함하고, 결측치는 0으로 대체한다

Question 94

df5과 df6 데이터를 하나의 데이터 프레임으로 merge함수를 이용하여 합쳐라. Algeria컬럼을 key로 하고 두 데이터 모두 포함하는 데이터만 출력하라

df5<-t(df)[c(1:7),c(1:3)]
df6<-t(df)[c(6:nrow(t(df))),c(2,3,4,5)]

Question 95

df5과 df6 데이터를 하나의 데이터 프레임으로 merge함수를 이용하여 합쳐라. Algeria컬럼을 key로 하고 합집합으로 합쳐라

머지 잘모르겠어요.. 답아시면 깃헙에 코드 공유 부탁드려요 ㅠㅠ#