빅데이터 분석기사/작업 유형 2 (ML)

이항분류 성능평가, 예측값 저장

유방울 2023. 6. 6. 19:12

성능 좋은 rf, xgb로 모델 생성

data = make_sample(seedno=1234, size=50000)
model_rf = RandomForestClassifier(n_estimators=500)
ModelTrain(model_rf, data)

[(65243, 3), (16311, 3), (65243,), (16311,)]
train 성능: 1.0
test 성능: 0.9983446753724481
model_xgb = XGBClassifier(n_estimators=500)
ModelTrain(model_xgb, data)

train 성능: 1.0
test 성능: 0.9984084231145935

오분류표

sklearn.metrics.confusion_matrix(y_treu, y_pred, *, labels=None, sampe_weight=Noen, normalize=None)

y_true : 실제값

y_pred : 예측값 

 

from sklearn.metrics import confusion_matrix
label =['불합격','합격']
print(model_rf.score(X,Y))
y_pred = model_rf.predict(X)
a = confusion_matrix(Y, y_pred)
b = pd.DataFrame(a, columns=label, index=label)
b

0.9968727585433771


		불합격	합격
불합격	839456	1684
합격	1538	187623
# XGBClassifier의 각 종류별 정확도를 확인해 보도록 합니다.
from sklearn.metrics import confusion_matrix
label = ['불합격', '합격']
print(model_xgb.score(X, Y))
y_pred = model_xgb.predict(X)
a = confusion_matrix(Y, y_pred)
b = pd.DataFrame(a, columns=label, index=label)
b

0.9977666720696184
		불합격	합격
불합격	840049	1091
합격	1210	187951

model.predict_prob

# 합격일 확률 구하기 (다른 데이터 사용)
data = make_sample(seedno=1234, size=6)
x_test = data[['국어', '영어', '수학']]
y_test = data['합격여부'] 
print(y_test.to_numpy()) # 실제값
print(model_xgb.predict(x_test))  # 예측값
proba = model_xgb.predict_proba(x_test) # 중요 중요 중요 중요
print(proba)

[0 1 0 0 0 0 1 1 1 1]
[0 1 0 0 0 0 1 1 1 1]
[[1.0000000e+00 4.4052449e-11]
 [0.0000000e+00 1.0000000e+00]
 [9.9985445e-01 1.4554607e-04]
 [1.0000000e+00 1.4111021e-09]
 [9.9999529e-01 4.7255057e-06]
 [1.0000000e+00 6.1994535e-11]
 [0.0000000e+00 1.0000000e+00]
 [0.0000000e+00 1.0000000e+00]
 [0.0000000e+00 1.0000000e+00]
 [0.0000000e+00 1.0000000e+00]]

예측값 저장

submission = pd.DataFrame()
submission['id'] = pd.RangeIndex(1, len(X)+1)
submission['prob'] = model_xgb.predict_prob(X)[:,1]
submission.to_csv('submission.csv', index=False)