1. R markdown(데이터경로, 불러오기, 5가지확인, summarizeColumns, mytable, mycsv)
총 정리
# install.packages("moonBook") library(moonBook) # *** data(acs) # ***
getwd()
setwd('C:/Users/is2js/')
# a = read.table('txt_data.txt', header = T) # txt파일 첫줄이 header일시 옵션 넣어주기 ***
# df = read.csv('C:/Users/is2js/R_da/R_Camp/data/week1/csv_data.csv', stringsAsFactors = FALSE) #***
# 윈도우파일 -> 맥에서 여는 인코딩 문제시, fileEncoding = "CP949", encoding = "UTF-8"dim(acs)
head(acs)
str(acs)
summary(acs)
apply(is.na(acs), MARGIN = 2, FUN = 'sum')#install.packages("mlr")
library(mlr)
summarizeColumns(iris)mytable(sex ~ age, data=acs)
mytable(sex ~ BMI, data=acs)
mytable(obesity ~ age+BMI+smoking , data = acs)
mytable(obesity + sex ~ age+BMI+smoking+height , data = acs)csv <- mytable(obesity + sex ~ age+BMI+smoking+height , data = acs)
mycsv(csv, file = 'obesity and sex.csv')
latex <- mytable(obesity ~ age+BMI+smoking , data = acs)
mylatex(latex)
기술 통계
Numerically - 기초 통계량
a <- c(1, 2, 3, 10)
mean(a) # 중심척도 평균
sd(a) # spread 척도 표준편차min(a)
max(a)summary(a) # 한번에 기초 통계 다보기***
boxplot(a)
boxplot(a, horizontal = TRUE) # 박스플롯 가로로 세우기***
데이터 경로 맞춰주기***
워킹 디렉토리 확인***
폴더경로를 복사해왔으면 항상 /로 변환해줘야한다. in Windows
getwd()
## [1] "C:/Users/is2js/R_da/myR"
setwd('C:/Users/is2js/R_da/R_Camp/data/week1')
a = read.table('txt_data.txt', header = T) # txt파일 첫줄이 header일시 옵션 넣어주기 ***
데이터 확인하기 ***
- dim() : 행(observation), 열(variable) 개수 확인
- head() : 직접 요약 보기
- str() : 데이터 구조 및 타입 보기
- summary() : 기초통계량과 범주형의 frequency 확인하기
- apply() : na를 apply+2+sum으로 칼럼별 개수 확인하기
dim(a) # ***
## [1] 10 3
head(a)
## age gender group
## 1 25 M a
## 2 30 M a
## 3 30 F a
## 4 25 F a
## 5 30 M b
## 6 40 F b
str(a)
## 'data.frame': 10 obs. of 3 variables:
## $ age : int 25 30 30 25 30 40 35 50 40 45
## $ gender: Factor w/ 2 levels "F","M": 2 2 1 1 2 1 2 1 2 2
## $ group : Factor w/ 2 levels "a","b": 1 1 1 1 2 2 2 1 1 2
summary(a)
## age gender group
## Min. :25.0 F:4 a:6
## 1st Qu.:30.0 M:6 b:4
## Median :32.5
## Mean :35.0
## 3rd Qu.:40.0
## Max. :50.0
apply(is.na(a), MARGIN = 2, FUN = 'sum') # ***
## age gender group
## 0 0 0
csv파일 + 문자형을 Factor전환 방지 ***
stringAsFactors = FALSE 대신, summary시 범주형 칼럼의 빈도가 안나타난다..
df = read.csv('C:/Users/is2js/R_da/R_Camp/data/week1/csv_data.csv', stringsAsFactors = FALSE) #***
dim(df) # ***
## [1] 1003 17
head(df)
## id gender age height weight cancer cancer_onset HP diabetes 혈당1 혈당2
## 1 1 M 22 184 79 N N N 89 118
## 2 2 M 70 185 67 N N N 116 120
## 3 3 F 32 179 75 N Y Y 104 88
## 4 3 F 32 179 75 N Y Y 104 88
## 5 4 F 29 166 45 Y 2016-01-25 N N 88 115
## 6 5 F 53 164 50 Y 2016-04-29 N N 106 98
## 혈당3 body_temp spo2 맥박 sbp dbp
## 1 129 39 39 94 123 84
## 2 130 35 36 102 131 88
## 3 124 36 37 83 103 64
## 4 124 36 37 83 103 64
## 5 87 39 35 73 101 85
## 6 83 39 37 74 121 61
str(df)
## 'data.frame': 1003 obs. of 17 variables:
## $ id : int 1 2 3 3 4 5 6 7 8 9 ...
## $ gender : chr "M" "M" "F" "F" ...
## $ age : int 22 70 32 32 29 53 63 50 57 35 ...
## $ height : int 184 185 179 179 166 164 166 165 181 185 ...
## $ weight : int 79 67 75 75 45 50 97 53 66 67 ...
## $ cancer : chr "N" "N" "N" "N" ...
## $ cancer_onset: chr "" "" "" "" ...
## $ HP : chr "N" "N" "Y" "Y" ...
## $ diabetes : chr "N" "N" "Y" "Y" ...
## $ 혈당1 : int 89 116 104 104 88 106 99 98 90 99 ...
## $ 혈당2 : int 118 120 88 88 115 98 127 99 96 83 ...
## $ 혈당3 : int 129 130 124 124 87 83 133 101 109 81 ...
## $ body_temp : int 39 35 36 36 39 39 38 37 39 35 ...
## $ spo2 : int 39 36 37 37 35 37 37 37 39 38 ...
## $ 맥박 : int 94 102 83 83 73 74 75 93 110 85 ...
## $ sbp : int 123 131 103 103 101 121 130 131 109 117 ...
## $ dbp : int 84 88 64 64 85 61 98 76 88 74 ...
summary(df)
## id gender age height
## Min. : 1.0 Length:1003 Min. :-25.00 Min. :160.0
## 1st Qu.: 250.5 Class :character 1st Qu.: 35.00 1st Qu.:168.0
## Median : 501.0 Mode :character Median : 50.00 Median :174.0
## Mean : 501.0 Mean : 49.72 Mean :173.8
## 3rd Qu.: 751.5 3rd Qu.: 65.00 3rd Qu.:181.0
## Max. :1000.0 Max. : 80.00 Max. :187.0
## NA's :2
## weight cancer cancer_onset HP
## Min. : 40.00 Length:1003 Length:1003 Length:1003
## 1st Qu.: 56.00 Class :character Class :character Class :character
## Median : 71.00 Mode :character Mode :character Mode :character
## Mean : 70.99
## 3rd Qu.: 86.00
## Max. :100.00
##
## diabetes 혈당1 혈당2 혈당3
## Length:1003 Min. : 80.00 Min. : 80.0 Min. : 80
## Class :character 1st Qu.: 90.00 1st Qu.: 91.0 1st Qu.: 94
## Mode :character Median : 99.00 Median :104.0 Median :111
## Mean : 99.79 Mean :104.5 Mean :110
## 3rd Qu.:110.00 3rd Qu.:117.0 3rd Qu.:125
## Max. :120.00 Max. :130.0 Max. :140
##
## body_temp spo2 맥박 sbp
## Min. : 35.00 Min. :35.0 Min. : 60.00 Min. :100.0
## 1st Qu.: 36.00 1st Qu.:36.0 1st Qu.: 73.00 1st Qu.:110.0
## Median : 37.00 Median :37.0 Median : 85.00 Median :120.0
## Mean : 37.49 Mean :37.5 Mean : 85.49 Mean :120.2
## 3rd Qu.: 39.00 3rd Qu.:39.0 3rd Qu.: 99.00 3rd Qu.:130.0
## Max. :120.00 Max. :40.0 Max. :110.00 Max. :140.0
##
## dbp
## Min. : 60.00
## 1st Qu.: 70.00
## Median : 80.00
## Mean : 80.06
## 3rd Qu.: 90.00
## Max. :100.00
##
apply(is.na(df), MARGIN = 2, FUN = 'sum') # ***
## id gender age height weight
## 0 0 0 2 0
## cancer cancer_onset HP diabetes 혈당1
## 0 0 0 0 0
## 혈당2 혈당3 body_temp spo2 맥박
## 0 0 0 0 0
## sbp dbp
## 0 0
moonbook 패키지 설치 및 mytable 사용 ***
<열>범주별 ~ <행1, 2, 3> int숫자(기술통계) + 범주(frequency) + , data=
# install.packages("moonBook")
library(moonBook) # ***
# require(moonBook)
data(acs) # ***
getwd()
## [1] "C:/Users/is2js/R_da/myR"
setwd('C:/Users/is2js/R_da/R_Camp/data/week1')
dim(acs)
## [1] 857 17
head(acs)
## age sex cardiogenicShock entry Dx EF height weight
## 1 62 Male No Femoral STEMI 18.0 168 72
## 2 78 Female No Femoral STEMI 18.4 148 48
## 3 76 Female Yes Femoral STEMI 20.0 NA NA
## 4 89 Female No Femoral STEMI 21.8 165 50
## 5 56 Male No Radial NSTEMI 21.8 162 64
## 6 73 Female No Radial Unstable Angina 22.0 153 59
## BMI obesity TC LDLC HDLC TG DM HBP smoking
## 1 25.51020 Yes 215 154 35 155 Yes No Smoker
## 2 21.91381 No NA NA NA 166 No Yes Never
## 3 NA No NA NA NA NA No Yes Never
## 4 18.36547 No 121 73 20 89 No No Never
## 5 24.38653 No 195 151 36 63 Yes Yes Smoker
## 6 25.20398 Yes 184 112 38 137 Yes Yes Never
str(acs)
## 'data.frame': 857 obs. of 17 variables:
## $ age : int 62 78 76 89 56 73 58 62 59 71 ...
## $ sex : chr "Male" "Female" "Female" "Female" ...
## $ cardiogenicShock: chr "No" "No" "Yes" "No" ...
## $ entry : chr "Femoral" "Femoral" "Femoral" "Femoral" ...
## $ Dx : chr "STEMI" "STEMI" "STEMI" "STEMI" ...
## $ EF : num 18 18.4 20 21.8 21.8 22 24.7 26.6 28.5 31.1 ...
## $ height : num 168 148 NA 165 162 153 167 160 152 168 ...
## $ weight : num 72 48 NA 50 64 59 78 50 67 60 ...
## $ BMI : num 25.5 21.9 NA 18.4 24.4 ...
## $ obesity : chr "Yes" "No" "No" "No" ...
## $ TC : num 215 NA NA 121 195 184 161 136 239 169 ...
## $ LDLC : int 154 NA NA 73 151 112 91 88 161 88 ...
## $ HDLC : int 35 NA NA 20 36 38 34 33 34 54 ...
## $ TG : int 155 166 NA 89 63 137 196 30 118 141 ...
## $ DM : chr "Yes" "No" "No" "No" ...
## $ HBP : chr "No" "Yes" "Yes" "No" ...
## $ smoking : chr "Smoker" "Never" "Never" "Never" ...
summary(acs)
## age sex cardiogenicShock entry
## Min. :28.00 Length:857 Length:857 Length:857
## 1st Qu.:55.00 Class :character Class :character Class :character
## Median :64.00 Mode :character Mode :character Mode :character
## Mean :63.31
## 3rd Qu.:72.00
## Max. :91.00
##
## Dx EF height weight
## Length:857 Min. :18.00 Min. :130.0 Min. : 30.00
## Class :character 1st Qu.:50.45 1st Qu.:158.0 1st Qu.: 58.00
## Mode :character Median :58.10 Median :165.0 Median : 65.00
## Mean :55.83 Mean :163.2 Mean : 64.84
## 3rd Qu.:62.35 3rd Qu.:170.0 3rd Qu.: 72.00
## Max. :79.00 Max. :185.0 Max. :112.00
## NA's :134 NA's :93 NA's :91
## BMI obesity TC LDLC
## Min. :15.62 Length:857 Min. : 25.0 Min. : 15.0
## 1st Qu.:22.13 Class :character 1st Qu.:154.0 1st Qu.: 88.0
## Median :24.16 Mode :character Median :183.0 Median :114.0
## Mean :24.28 Mean :185.2 Mean :116.6
## 3rd Qu.:26.17 3rd Qu.:213.0 3rd Qu.:141.0
## Max. :41.42 Max. :493.0 Max. :366.0
## NA's :93 NA's :23 NA's :24
## HDLC TG DM HBP
## Min. : 4.00 Min. : 11.0 Length:857 Length:857
## 1st Qu.:32.00 1st Qu.: 68.0 Class :character Class :character
## Median :38.00 Median :105.5 Mode :character Mode :character
## Mean :38.24 Mean :125.2
## 3rd Qu.:45.00 3rd Qu.:154.0
## Max. :89.00 Max. :877.0
## NA's :23 NA's :15
## smoking
## Length:857
## Class :character
## Mode :character
##
##
##
##
apply(is.na(acs), MARGIN = 2, FUN = 'sum')
## age sex cardiogenicShock entry
## 0 0 0 0
## Dx EF height weight
## 0 134 93 91
## BMI obesity TC LDLC
## 93 0 23 24
## HDLC TG DM HBP
## 23 15 0 0
## smoking
## 0
# 의학논문 Table 1(기술 통계)을 만들어주는 mytable()함수 ***
# 인구 통계학 정보를 Table 1형태로 만들어준다.
# (열(범주)별 ~ 행(숫자or범주)변수1+변수2 +.., data=)
# 1. y : sex(범주)별 ~ age(int)의 기술통계
mytable(sex ~ age, data=acs)
##
## Descriptive Statistics by 'sex'
## __________________________________
## Female Male p
## (N=287) (N=570)
## ----------------------------------
## age 68.7 ± 10.7 60.6 ± 11.2 0.000
## ----------------------------------
# 2. y : sex(범주)별 ~ BMI(int)의 기술통계
mytable(sex ~ BMI, data=acs)
##
## Descriptive Statistics by 'sex'
## __________________________________
## Female Male p
## (N=287) (N=570)
## ----------------------------------
## BMI 24.2 ± 3.6 24.3 ± 3.2 0.611
## ----------------------------------
# 3. y : obesity(범주)별 ~ age(int), BMI(int), smoking(범주)
mytable(obesity ~ age+BMI+smoking , data = acs)
##
## Descriptive Statistics by 'obesity'
## ____________________________________________
## No Yes p
## (N=567) (N=290)
## --------------------------------------------
## age 64.7 ± 11.7 60.6 ± 11.3 0.000
## BMI 22.3 ± 2.0 27.5 ± 2.5 0.000
## smoking 0.688
## - Ex-smoker 130 (22.9%) 74 (25.5%)
## - Never 221 (39.0%) 111 (38.3%)
## - Smoker 216 (38.1%) 105 (36.2%)
## --------------------------------------------
# 4. y : obesity(범주) + sex(범주)별 ~ 숫자 + 범주
# - 범주가 2개일 때는, 순서대로 계층을 가지게 보여준다.
mytable(obesity + sex ~ age+BMI+smoking+height , data = acs)
##
## Descriptive Statistics stratified by 'obesity' and 'sex'
## ______________________________________________________________________________
## Yes No
## ------------------------------- -------------------------------
## Female Male p Female Male p
## (N=93) (N=197) (N=194) (N=373)
## ------------------------------------------------------------------------------
## age 66.9 ± 10.4 57.6 ± 10.4 0.000 69.5 ± 10.8 62.2 ± 11.3 0.000
## BMI 27.7 ± 2.7 27.4 ± 2.3 0.310 22.2 ± 2.1 22.4 ± 1.9 0.281
## smoking 0.000 0.000
## - Ex-smoker 16 (17.2%) 58 (29.4%) 33 (17.0%) 97 (26.0%)
## - Never 68 (73.1%) 43 (21.8%) 141 (72.7%) 80 (21.4%)
## - Smoker 9 ( 9.7%) 96 (48.7%) 20 (10.3%) 196 (52.5%)
## height 153.5 ± 6.4 168.1 ± 6.8 0.000 153.9 ± 6.2 167.8 ± 5.7 0.000
## ------------------------------------------------------------------------------
moonBook - mylatex()함수
- mytable()함수의 결과를 변수에 저장한 뒤, mylatex(변수)로 활용
latex <- mytable(obesity ~ age+BMI+smoking , data = acs)
mylatex(latex)
## \begin{table}[!hbp]
## \begin{normalsize}
## \begin{tabular}{lccc}
## \multicolumn{4}{c}{Descriptive Statistics by obesity}\\
## \hline
## & No & Yes & \multirow{2}{*}{p}\\
## & (N=567) & (N=290) & \\
## \hline
## age & 64.7 ± 11.7 & 60.6 ± 11.3 & 0.000\\
## BMI & 22.3 ± 2.0 & 27.5 ± 2.5 & 0.000\\
## smoking & & & 0.688\\
## - Ex-smoker & 130 (22.9\%) & 74 (25.5\%) & \\
## - Never & 221 (39.0\%) & 111 (38.3\%) & \\
## - Smoker & 216 (38.1\%) & 105 (36.2\%) & \\
## \hline
## \end{tabular}
## \end{normalsize}
## \end{table}
moonBook - mycsv()함수
- mytable결과를 변수에 저장한 뒤, mycsv(변수명, file= ‘파일명.csv’)
csv <- mytable(obesity + sex ~ age+BMI+smoking+height , data = acs)
mycsv(csv, file = 'obesity and sex.csv')
mlr 패키지 설치 및 summarizeColumns()
- str()처럼 컬럼명/type + na 개수 + mean 등 기초 기술통계량 + factorlevel 수 표시
# install.packages("mlr")
library(mlr)
## Loading required package: ParamHelpers
summarizeColumns(iris)
## name type na mean disp median mad min max
## 1 Sepal.Length numeric 0 5.843333 0.8280661 5.80 1.03782 4.3 7.9
## 2 Sepal.Width numeric 0 3.057333 0.4358663 3.00 0.44478 2.0 4.4
## 3 Petal.Length numeric 0 3.758000 1.7652982 4.35 1.85325 1.0 6.9
## 4 Petal.Width numeric 0 1.199333 0.7622377 1.30 1.03782 0.1 2.5
## 5 Species factor 0 NA 0.6666667 NA NA 50.0 50.0
## nlevs
## 1 0
## 2 0
## 3 0
## 4 0
## 5 3
'한의대 생활 > └ 통계에 대한 나의 정리' 카테고리의 다른 글
2-2 R markdown ggplot2 ( plotly 올릴시 에러 ) (0) | 2019.02.01 |
---|---|
2. R markdown( 변수별 EDA 및 abline 2가지 사용) (0) | 2019.02.01 |
2. 전처리시 체크2가지 및 EDA시 변수의 성격에 따른 분류 (0) | 2019.01.25 |
1. 통계 - 기술통계와 추론통계 , 표본추출방법들 (0) | 2019.01.19 |
R 검정 방법에 대한 나의 정리 (0) | 2019.01.03 |