국씨의 메모장: [Machine Learning] 4. Supervised Learning

Classification

- Predictive modeling for categorical or discrete values (or class)
- 각 케이스에 해당하는 그룹을 판별한다.
. 광고 메일에 대한 반응(답장 여부)
. 어떤 수술 방법에 대해 적합한 환자인지
. 신용 평가가 좋은지 나쁜지(또는 등급)
- training data로 model을 생성하고, 예측에 대한 평가는 그 외의 data로 수행한다.
. 주로 training data/testing data를 구분하여 수행한다.
- Classification models
. Decision Trees
. Neural Networks
. Support Vector Machine
. Discriminant Analysis
. Logistic regression
. K-nearest neighbor

이번 글에서는 Logistic regression에 대해 주로 다룰 것이다.
(후에 Decision Tree, Neural Network, Support Vector Machine을 다룰 예정)

Logistic Regression

식을 쓰려다가 막막해서 wiki를 찾아봤는데, 설명이 잘 되어 있다.
~~(우씨 떄려칠까 그냥)~~

https://ko.wikipedia.org/wiki/%EB%A1%9C%EC%A7%80%EC%8A%A4%ED%8B%B1_%ED%9A%8C%EA%B7%80

wiki page에서 4. 모델 피팅 내용 전까지 보면 된다.
이후에 모델 피팅 방법으로 wiki에 제시된 것과 다르게, Gradient descent method를 통해 계산할 것이다. (컴퓨터가)

R 코드 예시 :
> testdata = read.table('buytest.txt',sep='\t',header=T)
> lgstResult = glm(RESPOND~X1+X2+X3+X4+X5+X6+X7+X8+X9+X10,
family=binomial(), data=testdata)

Logistic Regression도 마찬가지로 이전 글(http://justkook.blogspot.kr/2017/04/machine-learning-3-supervised-learning.html)에서 설명한 것처럼 한번만 수행하는 것이 아니라, 가장 적합한 모델을 생성하기 위해서 X Variable을 선택하는 방법이 필요하다.
(Backward Elimination, Forward Selection, Stepwise Selection)

Model Comparison

문제는 Logistic regression은 Linear Regression과 다르게, R-square 또는 p-value가 없어서 모델을 비교하는 다른 방법이 필요하다.

그렇기 때문에 data를 3가지로 분류해서 사용한다.
- Training data : Regression Model 생성 시 활용
- Validation data : Model Comparison 에 활용
- Testing data : 선택된 모델(final model)에 대한 검증 시 활용

Training Dataset을 정할 때 유의할 점이 있다.
- 전체 data set을 대표할 수 있도록 random sample
- 매우 희귀한 reponse를 찾는 모델을 다룰 경우, 충분한 관측치가 포함되어야 한다.
. stratified sampling : random sample로 불충분한 경우에는 일정 비율을 맞춰 데이터를 샘플링 하기도 한다.

Model Comparison in Classification
- Accuracy
- Lift chart
- Profit chart
- ROC curve (AUROC)
- K-S statistics

Confusion Matrix

predicted class
0 1
actual 0 True Negative False Positive
class 1 False Negative True Positive

- Accuracy = true positive / actual positive
- Error = 1- accuracy
- Sensitivity = true positive / (false positive + true positive)
- Specificity = true negative / (true negative + false positive)

Lift Chart
- Cumulative table

Profit Chart
: 해당 차트를 통해, cut-off value를 선택

ROC Curve
- ROC stands for Receiver Operating Characteristic

Graph of ROC
y = True positives
x = False Positives ( 1 - Specificity )

, 그리고 AUROC (Area Under ROC)
AUROC는 linear regression에서의 r-square와 비슷한 값으로
회귀식 에 대한 신뢰도라고 생각할 수 있다.

국씨의 메모장

2017년 4월 20일 목요일

[Machine Learning] 4. Supervised Learning - Classification

댓글 없음:

댓글 쓰기