고객 구매 데이터로 성별예측 모형

빅데이터 분석기사/작업 유형 2 (ML)

고객 구매 데이터로 성별예측 모형

유방울 2023. 6. 7. 00:23

1. 데이터 불러오기

x_train

import pandas as pd

X = pd.read_csv('X_train', encoding='cp949')
print(X.head(2))

 cust_id      총구매액     최대구매액         환불금액 주구매상품 주구매지점  내점일수  내점당구매건수  \
0        0  68282840  11264000 6860000.0000    기타   강남점    19   3.8947   
1        1   2136000   2136000  300000.0000   스포츠   잠실점     2   1.5000   

   주말방문비율  구매주기  
0  0.5270    17  
1  0.0000     1

y_train

Y = pd.read_csv('y_train.csv')
print(Y.hear(2))

   cust_id  gender
0        0       0
1        1       0

x_test

X_submission = pd.read_csv('X_test', encoding='cp949')
print(X_submission.head(2))

   cust_id       총구매액     최대구매액          환불금액 주구매상품 주구매지점  내점일수  내점당구매건수  \
0     3500   70900400  22000000  4050000.0000    골프  부산본점    13   1.4615   
1     3501  310533100  38558000 48034700.0000   농산물   잠실점    90   2.4333   

   주말방문비율  구매주기  
0  0.7895    26  
1  0.3699     3

2. 데이터 전처리

x, x_submission 합치기

인덱스 이상함

object -> 주구매상품, 주구매지점 -> 인코딩 필요 !

dfX = pd.concat([X, X_submission])
dfX.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5982 entries, 0 to 2481
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   cust_id  5982 non-null   int64  
 1   총구매액     5982 non-null   int64  
 2   최대구매액    5982 non-null   int64  
 3   환불금액     2076 non-null   float64
 4   주구매상품    5982 non-null   object 
 5   주구매지점    5982 non-null   object 
 6   내점일수     5982 non-null   int64  
 7   내점당구매건수  5982 non-null   float64
 8   주말방문비율   5982 non-null   float64
 9   구매주기     5982 non-null   int64  
dtypes: float64(3), int64(5), object(2)
memory usage: 514.1+ KB

dfX = pd.concat([X, X_submission], ignore_index = True)

환불금액 결측치 3906개

# [6] dfX의 컬럼별 결측치 확인하기
dfX.isnull().sum()

cust_id       0
총구매액          0
최대구매액         0
환불금액       3906
주구매상품         0
주구매지점         0
내점일수          0
내점당구매건수       0
주말방문비율        0
구매주기          0
dtype: int64

주구매별 환불금액의 평균 확인

temp = df.group('주구매상품')['환불금액'].transform('mean')
print(temp)
0      18435255.9690
1      13091245.5882
2       5945761.5385
3      18435255.9690
4       3200000.0000
            ...     
5977   34571470.9615
5978    5305093.3333
5979    5945761.5385
5980   16213347.0588
5981   20808626.1716
Name: 환불금액, Length: 5982, dtype: float64

# 0, 3둘 다 기타라서 같은 값임
dfX['주구매상품'].head(5)

0        기타
1       스포츠
2    남성 캐주얼
3        기타
4        보석
Name: 주구매상품, dtype: object

2-1 결측치 처리

dfX['환불금액'] = dfX['환불금액'].mask(dfX['환불금액'].isna(),temp) # dfX['환불금액']의 결측치를 temp로 채우기

dfX[dfX['환불금액'].isna()] 


cust_id	총구매액	최대구매액	환불금액	주구매상품	주구매지점	내점일수	내점당구매건수	주말방문비율	구매주기
1021	1021	3190800	2494800	nan	통신/컴퓨터	영등포점	2	1.5000	0.6667	61
1521	1521	178000	178000	nan	소형가전	본 점	1	1.0000	1.0000	0
1712	1712	4578000	3948000	nan	통신/컴퓨터	잠실점	2	1.0000	0.5000	0
2035	2035	260000	260000	nan	소형가전	잠실점	1	1.0000	0.0000	0
3003	3003	5850000	4200000	nan	악기	잠실점	2	1.0000	0.5000	5
3256	3256	39100000	39100000	nan	통신/컴퓨터	울산점	1	1.0000	0.0000	0
3434	3434	898000	836000	nan	악기	광주점	2	1.5000	0.6667	88
3764	3764	5006320	3718000	nan	통신/컴퓨터	부산본점	2	1.0000	1.0000	0
3960	3960	5013000	5013000	nan	통신/컴퓨터	본 점	1	1.0000	1.0000	0
5450	5450	2256000	2256000	nan	통신/컴퓨터	본 점	1	1.0000	1.0000	0
5937	5937	365230000	164700000	nan	악기	영등포점	4	1.7500	0.4286	64

# [10] '주구매상품'별 평균을 구할 수 없는 경우 '환불금액'을 '환불금액'의 평균으로 채우기 합니다.
# 채우기 후에 채우기가 잘 적용되었는지 확인합니다.
dfX['환불금액'] = dfX['환불금액'].fillna(dfX['환불금액'].mean())
dfX.isna().sum().sum()

0

상관계수 확인

# 상관관계가 높은 X1, X2가 있다면 제거 : -1 또는 1에 가까운 것은 좋지 않음
# 0.98 같은 것이 있으면 제거하기 
print(dfX.corr())

         cust_id    총구매액   최대구매액    환불금액    내점일수  내점당구매건수  주말방문비율    구매주기
cust_id   1.0000  0.0206  0.0210  0.0113 -0.0017  -0.0055 -0.0179 -0.0029
총구매액      0.0206  1.0000  0.6826  0.3773  0.6484   0.1050  0.0160 -0.2126
최대구매액     0.0210  0.6826  1.0000  0.3726  0.3602   0.0291  0.0163 -0.1128
환불금액      0.0113  0.3773  0.3726  1.0000  0.2362  -0.0270 -0.0174 -0.0754
내점일수     -0.0017  0.6484  0.3602  0.2362  1.0000   0.2303 -0.0036 -0.2953
내점당구매건수  -0.0055  0.1050  0.0291 -0.0270  0.2303   1.0000  0.0110 -0.0781
주말방문비율   -0.0179  0.0160  0.0163 -0.0174 -0.0036   0.0110  1.0000 -0.0135
구매주기     -0.0029 -0.2126 -0.1128 -0.0754 -0.2953  -0.0781 -0.0135  1.0000

층화추출 이용

# [12] Y['gender'] 값의 분포 확인 - 여성 (62.4%), 남성 (37.6%)
# 여성과 남성의 비율이 다르니까 층화추출을 이용하겠다고 생각하기!!
temp = Y['gender'].value_counts(normalize=True)
temp

0   0.6240
1   0.3760
Name: gender, dtype: float64

2-2 인코딩

인코딩 전 칼럼 고윳값 확인

-> 순서없는 명목형 변수

# [17] dfX에서 '주구매지점'에 대해 중복을 제거해 본다 (고윳값을 확인함)
# 순서가 없는 명목형 변수
A = dfX['주구매지점'].unique()
print(A)

['강남점' '잠실점' '관악점' '광주점' '본  점' '일산점' '대전점' '부산본점' '분당점' '영등포점' '미아점'
 '청량리점' '안양점' '부평점' '동래점' '포항점' '노원점' '창원점' '센텀시티점' '인천점' '대구점' '전주점'
 '울산점' '상인점']
 
 B = dfX['주구매상품'].unique()
print(B)

['기타' '스포츠' '남성 캐주얼' '보석' '디자이너' '시티웨어' '명품' '농산물' '화장품' '골프' '구두' '가공식품'
 '수산품' '아동' '차/커피' '캐주얼' '섬유잡화' '육류' '축산가공' '젓갈/반찬' '액세서리' '피혁잡화' '일용잡화'
 '주방가전' '주방용품' '건강식품' '가구' '주류' '모피/피혁' '남성 트랜디' '셔츠' '남성정장' '생활잡화'
 '트래디셔널' '란제리/내의' '커리어' '침구/수예' '대형가전' '통신/컴퓨터' '식기' '소형가전' '악기']

라벨 인코딩

# [19] '주구매지점', '주구매상품'에 대해 Label Encoding을 실행한다
# 데이터건수가 너무 많으면 원핫인코딩 사용
# 순서가 없고 데이터 건수가 없으니 레이블 인코딩 사용!!
# 항목별 순서 개념이 없는 경우
dfX['주구매지점'] = dfX['주구매지점'].astype('caterogy').cat.codes
dfX['주구매상품'] = dfX['주구매상품'].astype('caterogy').cat.codes

3. 데이터 분리, 모델 생성 및 학습

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.lnear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrixs import rou_auc_score

def get_scores(model, xtrain, xtest, ytrain, ytest):
	A = model.score(xtrain,ytrain)
    B = model.score(xtest,ytest)
    ypred = model.predict(xtest)[:,1]
    C = model.rou_auc_score(ytest,ypred)
    return '{:.4f}{:.4f}{:.4f}'.format(A,B,C)

def make_models(xtrain, xtest, ytrain, ytest):
	model1 = LogisticRegression(max_iter=500).fit(xtrain,ytrain)
    print('model1', get_scores(xtrain, xtest, ytrain, ytes))
    
    for k in range(1,10):
    model2 = KNeighbors(k).fit(xtrain,ytrain)
    print('model1', k, get_scores(xtrain, xtest, ytrain, ytest))
    
    model3 = DecisionTreeClassifier(random_state=0).fit(xtrain,ytrain)
    print('model3', get_scores(xtrain, xtest, ytrain, ytest))
    for d in range(3,8):
    	model3 = DecisionTreeClassifier(max_depth=d, random_state=0).fit(xtrain,ytrain)
    	print('model3', d, get_scores(xtrain, xtest, ytrain, ytest))
    
    model4 = RandomForestClassifier(random_state=0).fit(xtrain,ytrain)
    print('model4', rf, get_scores(xtrain, xtest, ytrain, ytest))
    for rf in range(3,8):
    	model3 = DecisionTreeClassifier(max_depth=rf, random_state=0).fit(xtrain,ytrain)
    	print('model4', rf, get_scores(xtrain, xtest, ytrain, ytest))
        
    model5 = XGBClassifier(eval_metric='logloss', use_label_encoder=False)..fit(xtrain,ytrain)
    print('model4', get_scores(xtrain, xtest, ytrain, ytet))

train을 여러개 줄 수도 있어서 1이라고 숫자 붙이는 거임

def get_data(dfX,Y):
	X = dfX.drop(columns=['cust_id'])
    X_use = X.iloc[:3500,:]
    X_submission = X.iloc[3500:,:]
    Y1 = Y['gender']
    scaler = StandardScaler()
    X1_use = scaler.fit_transform(X_use)
    X1_submission = scaler.transform(X_submission)
    print(X1_use.shape, X1_submission..shape, Y1.shape)
    return X1_use, X1_submission, Y1

# 분리하기
X1_use, X1_submission, Y1 = get_data(dfX,Y)

xtrain, xtest, ytrain, ytest = train_test_split(X1_use, Y1, test_size=0.3, stratify=Y1, random_state=0)

make_models(xtrain, xtest, ytrain, ytest)

(3500, 9) (2482, 9) (3500,)
model1 0.6322 0.6410 0.6582
model2 1 1.0000 0.5676 0.5369
model2 2 0.7955 0.6114 0.5560
model2 3 0.7951 0.6019 0.5770
model2 4 0.7404 0.6305 0.5967
model2 5 0.7449 0.6181 0.6009
model2 6 0.7188 0.6190 0.5953
model2 7 0.7159 0.6000 0.6042
model2 8 0.7143 0.6248 0.6155
model2 9 0.7045 0.6219 0.6138
model3 1.0000 0.5533 0.5314
model3 3 0.6486 0.6657 0.6781
model3 4 0.6645 0.6467 0.6717
model3 5 0.6812 0.6410 0.6542
model3 6 0.7024 0.6343 0.6522
model3 7 0.7078 0.6381 0.6494
model4 1.0000 0.6543 0.6509
model4 3 0.6649 0.6448 0.6892
model4 4 0.6739 0.6514 0.6963
model4 5 0.6959 0.6686 0.6971
model4 6 0.7314 0.6590 0.6989
model4 7 0.7669 0.6648 0.6984
model5 0.9902 0.6410 0.6307

최종 성능 모델 선택

model = RandomForestClassifier(500, max_depth=6,random_state=0).fit(xtrain, ytrain)
print('final model', get_scores(model, xtrain, xtest, ytrain, ytest))

제출할 데이터 생성

컬럼 확인

X_submission.columns

Index(['cust_id', '총구매액', '최대구매액', '환불금액', '주구매상품', '주구매지점', '내점일수', '내점당구매건수',
       '주말방문비율', '구매주기'],
      dtype='object')

열 첫번째 cust_id만 !!

pred = model.predict(X_submission)[:,1]
submission = pd.DataFrame({'cust_id':X_submission['cust_id'],
							'gender':pred})
submission.to_csv('submission.csv', index = False)

'빅데이터 분석기사 > 작업 유형 2 (ML)' 카테고리의 다른 글

빅데이터분석기사 2유형 주의할 점, 실수하기 좋은 점, 많이 뜨는 오류 (0)	2023.06.17
이항분류 성능평가, 예측값 저장 (0)	2023.06.06
이항분류 모델링(lr, knn, dt, rf, xgb) (0)	2023.06.06
이항분류 파생변수 생성, 스캐일링 (0)	2023.06.06
충분한 데이터의 중요성 (0)	2023.06.06

현재글고객 구매 데이터로 성별예측 모형

Today :
Yesterday :

차곡차곡

고객 구매 데이터로 성별예측 모형

1. 데이터 불러오기

2. 데이터 전처리

2-1 결측치 처리

2-2 인코딩

'빅데이터 분석기사 > 작업 유형 2 (ML)' 카테고리의 다른 글

'빅데이터 분석기사/작업 유형 2 (ML)'의 다른글

티스토리툴바

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

고객 구매 데이터로 성별예측 모형

1. 데이터 불러오기

2. 데이터 전처리

2-1 결측치 처리

2-2 인코딩

'빅데이터 분석기사 > 작업 유형 2 (ML)' 카테고리의 다른 글

'빅데이터 분석기사/작업 유형 2 (ML)'의 다른글

관련글

티스토리툴바