决策树、SVM、随机森林在评定信用等级上的应用

阅读: 评论:0

决策树、SVM、随机森林在评定信用等级上的应用

决策树、SVM、随机森林在评定信用等级上的应用

以下为我们这次的数据集信息,分别是各类特征和信用评定Label,属于二分类问题。

本文章想通过比较决策树、SVM和随机森林在该数据集上的表现

导入数据,查看缺失值

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data &#ad_excel('./GermanCredit.xls', sheet_name='Data')  #读取xls文件的Data sheet
data.head()
num_features = ['DURATION','AMOUNT','INSTALL_RATE','AGE','NUM_CREDITS','NUM_DEPENDENTS']
cat_features = lumns.drop(num_features + ['OBS#'])
data.isnull().sum()
# 都没有缺失值
OBS#                0
CHK_ACCT            0
DURATION            0
HISTORY             0
NEW_CAR             0
USED_CAR            0
FURNITURE           0
RADIO/TV            0
EDUCATION           0
RETRAINING          0
AMOUNT              0
SAV_ACCT            0
EMPLOYMENT          0
INSTALL_RATE        0
MALE_DIV            0
MALE_SINGLE         0
MALE_MAR_or_WID     0
CO-APPLICANT        0
GUARANTOR           0
PRESENT_RESIDENT    0
REAL_ESTATE         0
PROP_UNKN_NONE      0
AGE                 0
OTHER_INSTALL       0
RENT                0
OWN_RES             0
NUM_CREDITS         0
JOB                 0
NUM_DEPENDENTS      0
TELEPHONE           0
FOREIGN             0
RESPONSE            0
dtype: int64

将连续特征离散化

发现DURATION是贷款期限,分布在4-72个月之间,而且分布是一个看似左偏的正态分布,做一个hist图看得更清楚!

plt.hist(data['DURATION'])

(array([171., 262., 337.,  57.,  86.,  17.,  54.,   2.,  13.,   1.]),array([ 4. , 10.8, 17.6, 24.4, 31.2, 38. , 44.8, 51.6, 58.4, 65.2, 72. ]),<a list of 10 Patch objects>)

取五分位数,将DURATION特征转化成cat_features做离散化处理

x<20 dua_rank = 1
20<x<40 dua_rank = 2
40<x<60 dua_rank = 3
60<x<72 dua_rank = 4
并且创造一个新特征 dua_rank 添加在new_data中,也可以用sklearn.KBinsDiscretizer进行分箱处理
dua_rank = []
duration = data['DURATION']
for i in duration:if i <=20:dua_rank.append(1)elif i<= 40:dua_rank.append(2)elif i < 60:dua_rank.append(3)else:dua_rank.append(4)

可以看出,大部分的duration分布在rank1、2的区间内

plt.hist(dua_rank,bins = 4)
(array([554., 365.,  67.,  14.]),array([1.  , 1.75, 2.5 , 3.25, 4.  ]),<a list of 4 Patch objects>)

new_data = py()
new_data['dua_rank'] = dua_rank
new_data.head()
OBS#CHK_ACCTDURATIONHISTORYNEW_CARUSED_CARFURNITURERADIO/TVEDUCATIONRETRAINING...OTHER_INSTALLRENTOWN_RESNUM_CREDITSJOBNUM_DEPENDENTSTELEPHONEFOREIGNRESPONSEdua_rank
01064000100...0012211011
121482000100...0011210003
233124000010...0011120011
340422001000...0001220013
450243100000...0002220002

5 rows × 33 columns

plt.hist(data['AMOUNT'])
(array([445., 293.,  97.,  80.,  38.,  19.,  14.,   8.,   5.,   1.]),array([  250. ,  2067.4,  3884.8,  5702.2,  7519.6,  9337. , 11154.4,12971.8, 14789.2, 16606.6, 18424. ]),<a list of 10 Patch objects>)

我们也将AMOUNT特征分为1-10级,用十分位点作为评分标准

同样可以用sklearn.KBinsDiscretizer进行分箱离散化

percent = np.percentile(data['AMOUNT'], [i * 10 for i in range(1,10)])
amount_rank = []
for i in data['AMOUNT']:if i < percent[0]:amount_rank.append(1)elif i < percent[1]:amount_rank.append(2)elif i <percent[2]:amount_rank.append(3)elif i < percent[3]:amount_rank.append(4)elif i < percent[4]:amount_rank.append(5)elif i < percent[5]:amount_rank.append(6)elif i < percent[6]:amount_rank.append(7)elif i < percent[7]:amount_rank.append(8)elif i < percent[8]:amount_rank.append(9)else:amount_rank.append(10)
new_data['amount_rank'] = amount_rank
data['INSTALL_RATE'].value_counts()
4    476
2    231
3    157
1    136
Name: INSTALL_RATE, dtype: int64
INSTSLL_RATE 分期付款率占可支配收入的百分比可以直接看作一个离散变量,不作处理
AGE 变量做离散化处理,原理同上的特征处理
data['AGE'].describe()
count    1000.000000
mean       35.546000
std        11.375469
min        19.000000
25%        27.000000
50%        33.000000
75%        42.000000
max        75.000000
Name: AGE, dtype: float64
percent = np.percentile(data['AGE'], [25, 50, 75])
age_rank = []
for i in data['AGE']:if i <= percent[0]:age_rank.append(1)elif i <= percent[1]:age_rank.append(2)elif i <= percent[2]:age_rank.append(3)else:age_rank.append(4)
new_data['age_rank'] = age_rank      
new_data.head()
OBS#CHK_ACCTDURATIONHISTORYNEW_CARUSED_CARFURNITURERADIO/TVEDUCATIONRETRAINING...OWN_RESNUM_CREDITSJOBNUM_DEPENDENTSTELEPHONEFOREIGNRESPONSEdua_rankamount_rankage_rank
01064000100...1221101124
121482000100...1121000391
233124000010...1112001154
340422001000...01220013104
450243100000...0222000294

5 rows × 35 columns

data['NUM_CREDITS'].describe()
count    1000.000000
mean        1.407000
std         0.577654
min         1.000000
25%         1.000000
50%         1.000000
75%         2.000000
max         4.000000
Name: NUM_CREDITS, dtype: float64
num_credits 表示持有的信用卡的数目,范围在1-3 ,也可以不用处理
plt.hist(data['NUM_CREDITS'],bins = 4)
(array([633., 333.,  28.,   6.]),array([1.  , 1.75, 2.5 , 3.25, 4.  ]),<a list of 4 Patch objects>)

data['NUM_DEPENDENTS'].describe()
count    1000.000000
mean        1.155000
std         0.362086
min         1.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         2.000000
Name: NUM_DEPENDENTS, dtype: float64
也只有两个类, 我也不需要处理
plt.hist(data['NUM_DEPENDENTS'],bins = 2)
(array([845., 155.]), array([1. , 1.5, 2. ]), <a list of 2 Patch objects>)

将刚刚做处理的num_features 删除掉,用rank特征代替
new_data.drop(['AGE','AMOUNT','DURATION'], axis = 1,inplace=True)
将不具有大小关系的特征进行one-hot encoding 以消除其大小的含义

所以我们将 HISTORY , JOB 特征 进行独热编码

history = data['HISTORY']
new_history = pd.get_dummies(history,prefix='histor')
job = data['JOB']
new_job = pd.get_dummies(job, prefix= 'job')
new_data = pd.concat([new_data, new_history, new_job], axis = 1)
new_data.drop(['HISTORY', 'JOB'], axis = 1,inplace=True)
new_data
OBS#CHK_ACCTNEW_CARUSED_CARFURNITURERADIO/TVEDUCATIONRETRAININGSAV_ACCTEMPLOYMENT...age_rankhistor_0histor_1histor_2histor_3histor_4job_0job_1job_2job_3
01000010044...4000010010
12100010002...1001000010
23300001003...4000010100
34000100003...4001000010
45010000002...4000100010
..................................................................
995996300100003...2001000100
996997001000002...3001000001
997998300010004...3001000010
998999000010002...1001000010
9991000101000010...1000010010

1000 rows × 39 columns

new_data.info
<bound method DataFrame.info of      OBS#  CHK_ACCT  NEW_CAR  USED_CAR  FURNITURE  RADIO/TV  EDUCATION  
0       1         0        0         0          0         1          0   
1       2         1        0         0          0         1          0   
2       3         3        0         0          0         0          1   
3       4         0        0         0          1         0          0   
4       5         0        1         0          0         0          0   
..    ...       ...      ...       ...        ...       ...        ...   
995   996         3        0         0          1         0          0   
996   997         0        0         1          0         0          0   
997   998         3        0         0          0         1          0   
998   999         0        0         0          0         1          0   
999  1000         1        0         1          0         0          0   RETRAINING  SAV_ACCT  EMPLOYMENT  ...  age_rank  histor_0  histor_1  
0             0         4           4  ...         4         0         0   
1             0         0           2  ...         1         0         0   
2             0         0           3  ...         4         0         0   
3             0         0           3  ...         4         0         0   
4             0         0           2  ...         4         0         0   
..          ...       ...         ...  ...       ...       ...       ...   
995           0         0           3  ...         2         0         0   
996           0         0           2  ...         3         0         0   
997           0         0           4  ...         3         0         0   
998           0         0           2  ...         1         0         0   
999           0         1           0  ...         1         0         0   histor_2  histor_3  histor_4  job_0  job_1  job_2  job_3  
0           0         0         1      0      0      1      0  
1           1         0         0      0      0      1      0  
2           0         0         1      0      1      0      0  
3           1         0         0      0      0      1      0  
4           0         1         0      0      0      1      0  
..        ...       ...       ...    ...    ...    ...    ...  
995         1         0         0      0      1      0      0  
996         1         0         0      0      0      0      1  
997         1         0         0      0      0      1      0  
998         1         0         0      0      0      1      0  
999         0         0         1      0      0      1      0  [1000 rows x 39 columns]>
from sklearn import tree
del_selection import train_test_split
 import DecisionTreeClassifier
del_selection import GridSearchCV
import graphviz
from sklearn.svm import SVC
x = new_data.drop(['RESPONSE'], axis = 1)
y = new_data.loc[:,['RESPONSE']]
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = 0.2, random_state = 777)

决策树分类器

采用GridSearchCV进行可选参数的遍历,选出一个最优模型,以后的决策树,SVM,随机森林都采用这个方法进行遍历调参,选出最优参数和最优模型

model = DecisionTreeClassifier(criterion='gini',max_depth=4,min_samples_split=4,max_features=6)
params = {'criterion':['gini','entropy'],'max_depth': range(1,30),'min_samples_split': range(2,10),'min_samples_leaf' : range(1,6),
}
cv = GridSearchCV(model,param_grid= params,n_jobs= -1,verbose=1,scoring='accuracy', cv = 5)
cv.fit(data.iloc[:,:-1], data.iloc[:,-1])
Fitting 5 folds for each of 2320 candidates, totalling 11600 fits[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1348 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-1)]: Done 8248 tasks      | elapsed:   15.7s
[Parallel(n_jobs=-1)]: Done 11600 out of 11600 | elapsed:   22.5s finishedGridSearchCV(cv=5, error_score='raise-deprecating',estimator=DecisionTreeClassifier(class_weight=None,criterion='gini', max_depth=4,max_features=6,max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,min_samples_leaf=1,min_samples_split=4,min_weight_fraction_leaf=0.0,presort=False, random_state=None,splitter='best'),iid='warn', n_jobs=-1,param_grid={'criterion': ['gini', 'entropy'],'max_depth': range(1, 30),'min_samples_leaf': range(1, 6),'min_samples_split': range(2, 10)},pre_dispatch='2*n_jobs', refit=True, return_train_score=False,scoring='accuracy', verbose=1)
model1 = cv.best_estimator_
cv.best_params_,cv.best_score_
({'criterion': 'gini','max_depth': 5,'min_samples_leaf': 2,'min_samples_split': 5},0.734)

影响决策树决策的特征重要性的可视化

发现最重要的特征为 CHK_ACCT、DURATION、AGE、HISTORY
plt.figure(figsize= (9,6))
plt.bar(data.iloc[:,:-1].columns, model1.feature_importances_)
icks(rotation = 90)
plt.show()

cv.fit(x, y)
model2 = cv.best_estimator_
cv.best_params_,cv.best_score_
Fitting 5 folds for each of 2320 candidates, totalling 11600 fits[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1328 tasks      | elapsed:    3.0s
[Parallel(n_jobs=-1)]: Done 7928 tasks      | elapsed:   18.5s
[Parallel(n_jobs=-1)]: Done 11600 out of 11600 | elapsed:   27.5s finished({'criterion': 'gini','max_depth': 5,'min_samples_leaf': 2,'min_samples_split': 2},0.728)
model2
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,max_features=6, max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=2, min_samples_split=2,min_weight_fraction_leaf=0.0, presort=False,random_state=None, splitter='best')

将预处理过的特征 进行决策树分析,发现 CHK_ACCT仍是最重要的影响特征,其他特征并没有表现出来,模型的表现也不如未处理的数据

plt.figure(figsize= (9,6))
plt.lumns, model2.feature_importances_)
icks(rotation = 90)
plt.show()

决策树的可视化

graph_data = port_graphviz(model2,out_file = None,feature_names&#lumns,filled= True, rounded= True)
graph = graphviz.Source(graph_data,)
graph

SVM分类器

可以看出SVM在全是离散型变量的数据集的预测上表现的并不是很好,不如决策树,accuracy 在为预处理的数据集和 经过离散处理的数据集上都只有0.7和0.65的表现
model = SVC()
params = {'C':range(1,10)}
cv = GridSearchCV(model,param_grid=params, verbose = 1,cv = 5,scoring='accuracy',n_jobs=-1)
cv.fit(x, y)
model1 = cv.best_estimator_
cv.best_score_
Fitting 5 folds for each of 9 candidates, totalling 45 fits[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  38 out of  45 | elapsed:    1.3s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:    1.5s finished
F:Anaconda3libsite-packagessklearnutilsvalidation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().y = column_or_1d(y, warn=True)
F:Anaconda3libsite-packagessklearnsvmbase.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning."avoid this warning.", FutureWarning)0.653
cv.fit(data.iloc[:,:-1], data.iloc[:,-1])
model2 = cv.best_estimator_
cv.best_score_
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.Fitting 5 folds for each of 9 candidates, totalling 45 fits[Parallel(n_jobs=-1)]: Done  38 out of  45 | elapsed:    1.6s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:    1.8s finished
F:Anaconda3libsite-packagessklearnsvmbase.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning."avoid this warning.", FutureWarning)0.7

随机森林分类器

可以看出集成类分类器会有更好的表现,在预处理的数据集上表现为0.759,未处理的数据集上表现为 0.773
semble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=500, random_state=2)
params = {'n_estimators':range(1,1000)
}
cv = GridSearchCV(model, param_grid=params ,verbose = 1,n_jobs=-1, scoring='accuracy')
cv.fit(x,y)
rfc1= cv.best_estimator_
F:Anaconda3libsite-packagessklearnmodel_selection_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.warnings.warn(CV_WARNING, FutureWarning)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.Fitting 3 folds for each of 999 candidates, totalling 2997 fits[Parallel(n_jobs=-1)]: Done  75 tasks      | elapsed:    3.7s
[Parallel(n_jobs=-1)]: Done 526 tasks      | elapsed:   30.7s
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 1126 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 1576 tasks      | elapsed:  4.5min
[Parallel(n_jobs=-1)]: Done 2126 tasks      | elapsed:  8.2min
[Parallel(n_jobs=-1)]: Done 2776 tasks      | elapsed: 14.1min
[Parallel(n_jobs=-1)]: Done 2997 out of 2997 | elapsed: 16.4min finished
F:Anaconda3libsite-packagessklearnmodel_selection_search.py:715: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().self.best_estimator_.fit(X, y, **fit_params)
rfc = cv.best_estimator_
rfc
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',max_depth=None, max_features='auto', max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=673,n_jobs=None, oob_score=False, random_state=2, verbose=0,warm_start=False)
cv.best_score_
0.759
cv.fit(data.iloc[:,:-1],y.iloc[:,-1])
rfc2= cv.best_estimator_
F:Anaconda3libsite-packagessklearnmodel_selection_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.warnings.warn(CV_WARNING, FutureWarning)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.Fitting 3 folds for each of 999 candidates, totalling 2997 fits[Parallel(n_jobs=-1)]: Done  75 tasks      | elapsed:    4.0s
[Parallel(n_jobs=-1)]: Done 423 tasks      | elapsed:   21.4s
[Parallel(n_jobs=-1)]: Done 673 tasks      | elapsed:   49.5s
[Parallel(n_jobs=-1)]: Done 1023 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 1473 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done 2023 tasks      | elapsed:  7.5min
[Parallel(n_jobs=-1)]: Done 2673 tasks      | elapsed: 13.2min
[Parallel(n_jobs=-1)]: Done 2997 out of 2997 | elapsed: 16.5min finished
cv.best_score_ #
0.773

总结

1、在决策树,随机森林,SVM上,经过预处理的数据反而准确率不及源数据,可能造成的原因是,处理后将原数据的某些特点抹掉了,使模型欠拟合
2、SVM在大部分特征是0、1特征的数据集上表现不如树模型
3、随机森林这类集成模型表现好于单模型,但需要计算资源较多,耗时长。
4、影响贷款最重要的特征是CHK_ACCT,即支票帐户状态。
5、随机森林和决策树这种树类模型在实际应用中更具可解释性。

总体变现并不是很好,之后想再试一试 xgboost 和 LightGBM这类boosting集成树模型

本文发布于:2024-01-29 16:57:46,感谢您对本站的认可!

本文链接:https://www.4u4v.net/it/170651867116826.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:信用等级   森林   决策树   SVM
留言与评论(共有 0 条评论)
   
验证码:

Copyright ©2019-2022 Comsenz Inc.Powered by ©

网站地图1 网站地图2 网站地图3 网站地图4 网站地图5 网站地图6 网站地图7 网站地图8 网站地图9 网站地图10 网站地图11 网站地图12 网站地图13 网站地图14 网站地图15 网站地图16 网站地图17 网站地图18 网站地图19 网站地图20 网站地图21 网站地图22/a> 网站地图23