
对kaggle:travel
1.导入数据
from matplotlib import pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
import seaborn as sns
import pandas as pd
sns.set()
import warnings
warnings.filterwarnings('ignore')
导入数据时出现错误,参考:
df = pd.read_csv(r'C:/Users/admin/Desktop/论文/kaggle1/travel-insurance.csv',engine='python')
df.head()
| Agency | Agency Type | Distribution Channel | Product Name | Claim | Duration | Destination | Net Sales | Commision (in value) | Gender | Age |
---|
0 | CBH | Travel Agency | Offline | Comprehensive Plan | No | 186 | MALAYSIA | -29.0 | 9.57 | F | 81 |
---|
1 | CBH | Travel Agency | Offline | Comprehensive Plan | No | 186 | MALAYSIA | -29.0 | 9.57 | F | 71 |
---|
2 | CWT | Travel Agency | Online | Rental Vehicle Excess Insurance | No | 65 | AUSTRALIA | -49.5 | 29.70 | NaN | 32 |
---|
3 | CWT | Travel Agency | Online | Rental Vehicle Excess Insurance | No | 60 | AUSTRALIA | -39.6 | 23.76 | NaN | 32 |
---|
4 | CWT | Travel Agency | Online | Rental Vehicle Excess Insurance | No | 79 | ITALY | -19.8 | 11.88 | NaN | 41 |
---|
2.属性介绍及分析
order = ['Agency', 'Agency Type', 'Distribution Channel', 'Product Name', 'Duration','Destination','Net Sales', 'Commision (in value)', 'Age','Gender','Claim']
df = df[order]
df.columns = ['Agency', 'AgencyType', 'DistributionChannel', 'ProductName', 'Duration','Destination','NetSales', 'Commision', 'Age','Gender','Claim']
df.head()
| Agency | AgencyType | DistributionChannel | ProductName | Duration | Destination | NetSales | Commision | Age | Gender | Claim |
---|
0 | CBH | Travel Agency | Offline | Comprehensive Plan | 186 | MALAYSIA | -29.0 | 9.57 | 81 | F | No |
---|
1 | CBH | Travel Agency | Offline | Comprehensive Plan | 186 | MALAYSIA | -29.0 | 9.57 | 71 | F | No |
---|
2 | CWT | Travel Agency | Online | Rental Vehicle Excess Insurance | 65 | AUSTRALIA | -49.5 | 29.70 | 32 | NaN | No |
---|
3 | CWT | Travel Agency | Online | Rental Vehicle Excess Insurance | 60 | AUSTRALIA | -39.6 | 23.76 | 32 | NaN | No |
---|
4 | CWT | Travel Agency | Online | Rental Vehicle Excess Insurance | 79 | ITALY | -19.8 | 11.88 | 41 | NaN | No |
---|
2.1数据的基本了解
df.shape
(63326, 11)
df.info()
<class frame.DataFrame'>
RangeIndex: 63326 entries, 0 to 63325
Data columns (total 11 columns):
Agency 63326 non-null object
AgencyType 63326 non-null object
DistributionChannel 63326 non-null object
ProductName 63326 non-null object
Duration 63326 non-null int64
Destination 63326 non-null object
NetSales 63326 non-null float64
Commision 63326 non-null float64
Age 63326 non-null int64
Gender 18219 non-null object
Claim 63326 non-null object
dtypes: float64(2), int64(2), object(7)
memory usage: 5.3+ MB
df.isnull().sum()
Agency 0
AgencyType 0
DistributionChannel 0
ProductName 0
Duration 0
Destination 0
NetSales 0
Commision 0
Age 0
Gender 45107
Claim 0
dtype: int64
数据总共63326条,但只有性别这个属性存在较多缺失值
df.describe()
| Duration | NetSales | Commision | Age |
---|
count | 63326.000000 | 63326.000000 | 63326.000000 | 63326.000000 |
---|
mean | 49.317074 | 40.702018 | 9.809992 | 39.969981 |
---|
std | 101.791566 | 48.845637 | 19.804388 | 14.017010 |
---|
min | -2.000000 | -389.000000 | 0.000000 | 0.000000 |
---|
25% | 9.000000 | 18.000000 | 0.000000 | 35.000000 |
---|
50% | 22.000000 | 26.530000 | 0.000000 | 36.000000 |
---|
75% | 53.000000 | 48.000000 | 11.550000 | 43.000000 |
---|
max | 4881.000000 | 810.000000 | 283.500000 | 118.000000 |
---|
从上面可以看出Duration的最小值是-2,但是旅行的时间应该是正值,所以应考虑把负值的样本删除。
2.2 对属性进行分析
属性Claim
df['Claim'].value_counts()
No 62399
Yes 927
Name: Claim, dtype: int64
属性Agency
df['Agency'].value_counts()
EPX 35119
CWT 8580
C2B 8267
JZI 6329
SSI 1056
JWT 749
RAB 725
LWC 689
TST 528
KML 392
ART 331
CCR 194
CBH 101
TTW 98
CSR 86
ADM 82
Name: Agency, dtype: int64
plt.figure(figsize=(15,7))
untplot(x='Agency', data=df,orient='h')
plt.show()
df['Claim'].groupby(df['Agency']).value_counts()
Agency Claim
ADM No 82
ART No 330Yes 1
C2B No 7720Yes 547
CBH No 100Yes 1
CCR No 191Yes 3
CSR No 85Yes 1
CWT No 8494Yes 86
EPX No 34924Yes 195
JWT No 745Yes 4
JZI No 6298Yes 31
KML No 384Yes 8
LWC No 653Yes 36
RAB No 724Yes 1
SSI No 1049Yes 7
TST No 526Yes 2
TTW No 94Yes 4
Name: Claim, dtype: int64
pd.crosstab(df.Agency,df.Claim,margins=True).style.background_gradient()
Claim | No | Yes | All |
---|
Agency | | | |
---|
ADM | 82 | 0 | 82 |
---|
ART | 330 | 1 | 331 |
---|
C2B | 7720 | 547 | 8267 |
---|
CBH | 100 | 1 | 101 |
---|
CCR | 191 | 3 | 194 |
---|
CSR | 85 | 1 | 86 |
---|
CWT | 8494 | 86 | 8580 |
---|
EPX | 34924 | 195 | 35119 |
---|
JWT | 745 | 4 | 749 |
---|
JZI | 6298 | 31 | 6329 |
---|
KML | 384 | 8 | 392 |
---|
LWC | 653 | 36 | 689 |
---|
RAB | 724 | 1 | 725 |
---|
SSI | 1049 | 7 | 1056 |
---|
TST | 526 | 2 | 528 |
---|
TTW | 94 | 4 | 98 |
---|
All | 62399 | 927 | 63326 |
---|
AGency = pd.read_excel('C:/Users/admin/Desktop/论文/kaggle1/Agency.xlsx')
AGency['Percent']=AGency['Yes']/AGency['All']
AGency.sort_values("Percent",inplace=True)
AGency
| Agency | No | Yes | All | Percent |
---|
0 | ADM | 54 | 0 | 54 | 0.000000 |
---|
12 | RAB | 709 | 1 | 710 | 0.001408 |
---|
1 | ART | 315 | 1 | 316 | 0.003165 |
---|
14 | TST | 517 | 2 | 519 | 0.003854 |
---|
9 | JZI | 6128 | 30 | 6158 | 0.004872 |
---|
8 | JWT | 716 | 4 | 720 | 0.005556 |
---|
7 | EPX | 34188 | 193 | 34381 | 0.005614 |
---|
13 | SSI | 1024 | 7 | 1031 | 0.006790 |
---|
3 | CBH | 93 | 1 | 94 | 0.010638 |
---|
5 | CSR | 85 | 1 | 86 | 0.011628 |
---|
6 | CWT | 7132 | 86 | 7218 | 0.011915 |
---|
4 | CCR | 187 | 3 | 190 | 0.015789 |
---|
10 | KML | 380 | 8 | 388 | 0.020619 |
---|
15 | TTW | 94 | 4 | 98 | 0.040816 |
---|
11 | LWC | 623 | 36 | 659 | 0.054628 |
---|
2 | C2B | 7531 | 547 | 8078 | 0.067715 |
---|
从以上数据可以看到索赔率高的代理商可能是索赔率低的代理商的20倍(0.067715是0.003165的20倍左右),
所以不同代理商之间还是有明显差异的。应保留这个特征,同时有些代理商的样本数量足够多,
可以考虑直接对某一个代理商的样本进行建模。
属性Agency Type
df['AgencyType'].value_counts()
Travel Agency 45869
Airlines 17457
Name: AgencyType, dtype: int64
pd.crosstab(df.AgencyType,df.Claim,margins=True).style.background_gradient()
Claim | No | Yes | All |
---|
AgencyType | | | |
---|
Airlines | 16866 | 591 | 17457 |
---|
Travel Agency | 45533 | 336 | 45869 |
---|
All | 62399 | 927 | 63326 |
---|
AgencyType这个属性里面:Travel Agency的数量多,但是索赔人数少;Airlines的人数少,但是索赔人数多。
所以代理商的类别是航空、旅行社对索赔有明显影响,此属性保留。
属性Distribution Channel
df['DistributionChannel'].value_counts()
Online 62219
Offline 1107
Name: DistributionChannel, dtype: int64
pd.crosstab(df.DistributionChannel,df.Claim,margins=True).style.background_gradient()
Claim | No | Yes | All |
---|
DistributionChannel | | | |
---|
Offline | 1090 | 17 | 1107 |
---|
Online | 61309 | 910 | 62219 |
---|
All | 62399 | 927 | 63326 |
---|
忽略其他因素,分别索赔率:
Offline:17/1107=0.0153
Online:910/62219=0.0146
All:927/63326=0.0146
差异不是很大,结合现实也可以知道,购买保险人是否索赔应该与自己的经历有关,而与当初购买的方式关系不大。故此属性删除
属性ProductName
df['ProductName'].value_counts()
Cancellation Plan 18630
2 way Comprehensive Plan 13158
Rental Vehicle Excess Insurance 8580
Basic Plan 5469
Bronze Plan 4049
1 way Comprehensive Plan 3331
Value Plan 2715
Silver Plan 2249
Annual Silver Plan 1423
Ticket Protector 1056
Travel Cruise Protect 527
Comprehensive Plan 364
Gold Plan 352
24 Protect 247
Single Trip Travel Protect Gold 204
Annual Gold Plan 194
Premier Plan 194
Single Trip Travel Protect Silver 173
Annual Travel Protect Gold 100
Annual Travel Protect Silver 86
Individual Comprehensive Plan 74
Single Trip Travel Protect Platinum 73
Annual Travel Protect Platinum 53
Spouse or Parents Comprehensive Plan 15
Child Comprehensive Plan 9
Travel Cruise Protect Family 1
Name: ProductName, dtype: int64
pd.crosstab(df.ProductName,df.Claim,margins=True).style.background_gradient()
Claim | No | Yes | All |
---|
ProductName | | | |
---|
1 way Comprehensive Plan | 3322 | 9 | 3331 |
---|
2 way Comprehensive Plan | 13016 | 142 | 13158 |
---|
24 Protect | 247 | 0 | 247 |
---|
Annual Gold Plan | 173 | 21 | 194 |
---|
Annual Silver Plan | 1266 | 157 | 1423 |
---|
Annual Travel Protect Gold | 90 | 10 | 100 |
---|
Annual Travel Protect Platinum | 50 | 3 | 53 |
---|
Annual Travel Protect Silver | 82 | 4 | 86 |
---|
Basic Plan | 5446 | 23 | 5469 |
---|
Bronze Plan | 3839 | 210 | 4049 |
---|
Cancellation Plan | 18586 | 44 | 18630 |
---|
Child Comprehensive Plan | 9 | 0 | 9 |
---|
Comprehensive Plan | 359 | 5 | 364 |
---|
Gold Plan | 334 | 18 | 352 |
---|
Individual Comprehensive Plan | 71 | 3 | 74 |
---|
Premier Plan | 191 | 3 | 194 |
---|
Rental Vehicle Excess Insurance | 8494 | 86 | 8580 |
---|
Silver Plan | 2108 | 141 | 2249 |
---|
Single Trip Travel Protect Gold | 194 | 10 | 204 |
---|
Single Trip Travel Protect Platinum | 68 | 5 | 73 |
---|
Single Trip Travel Protect Silver | 169 | 4 | 173 |
---|
Spouse or Parents Comprehensive Plan | 14 | 1 | 15 |
---|
Ticket Protector | 1049 | 7 | 1056 |
---|
Travel Cruise Protect | 525 | 2 | 527 |
---|
Travel Cruise Protect Family | 1 | 0 | 1 |
---|
Value Plan | 2696 | 19 | 2715 |
---|
All | 62399 | 927 | 63326 |
---|
Productnames = pd.read_excel('C:/Users/admin/Desktop/论文/kaggle1/Product Name.xlsx')
Productnames['Percent']=Productnames['Yes']/Productnames['All']
Productnames.sort_values("Percent",inplace=True)
Productnames
| Claim | No | Yes | All | Percent |
---|
2 | 24 Protect | 247 | 0 | 247 | 0.000000 |
---|
24 | Travel Cruise Protect Family | 1 | 0 | 1 | 0.000000 |
---|
11 | Child Comprehensive Plan | 9 | 0 | 9 | 0.000000 |
---|
10 | Cancellation Plan | 18586 | 44 | 18630 | 0.002362 |
---|
0 | 1 way Comprehensive Plan | 3322 | 9 | 3331 | 0.002702 |
---|
23 | Travel Cruise Protect | 525 | 2 | 527 | 0.003795 |
---|
8 | Basic Plan | 5446 | 23 | 5469 | 0.004206 |
---|
22 | Ticket Protector | 1049 | 7 | 1056 | 0.006629 |
---|
25 | Value Plan | 2696 | 19 | 2715 | 0.006998 |
---|
16 | Rental Vehicle Excess Insurance | 8494 | 86 | 8580 | 0.010023 |
---|
1 | 2 way Comprehensive Plan | 13016 | 142 | 13158 | 0.010792 |
---|
12 | Comprehensive Plan | 359 | 5 | 364 | 0.013736 |
---|
15 | Premier Plan | 191 | 3 | 194 | 0.015464 |
---|
20 | Single Trip Travel Protect Silver | 169 | 4 | 173 | 0.023121 |
---|
14 | Individual Comprehensive Plan | 71 | 3 | 74 | 0.040541 |
---|
7 | Annual Travel Protect Silver | 82 | 4 | 86 | 0.046512 |
---|
18 | Single Trip Travel Protect Gold | 194 | 10 | 204 | 0.049020 |
---|
13 | Gold Plan | 334 | 18 | 352 | 0.051136 |
---|
9 | Bronze Plan | 3839 | 210 | 4049 | 0.051865 |
---|
6 | Annual Travel Protect Platinum | 50 | 3 | 53 | 0.056604 |
---|
17 | Silver Plan | 2108 | 141 | 2249 | 0.062695 |
---|
21 | Spouse or Parents Comprehensive Plan | 14 | 1 | 15 | 0.066667 |
---|
19 | Single Trip Travel Protect Platinum | 68 | 5 | 73 | 0.068493 |
---|
5 | Annual Travel Protect Gold | 90 | 10 | 100 | 0.100000 |
---|
3 | Annual Gold Plan | 173 | 21 | 194 | 0.108247 |
---|
4 | Annual Silver Plan | 1266 | 157 | 1423 | 0.110330 |
---|
同理,容易看出不同保险产品之间的索赔率差异较大,此属性保留。
属性Destination
df['Destination'].value_counts().head(10)
SINGAPORE 13255
MALAYSIA 5930
THAILAND 5894
CHINA 4796
AUSTRALIA 3694
INDONESIA 3452
UNITED STATES 2530
PHILIPPINES 2490
HONG KONG 2411
INDIA 2251
Name: Destination, dtype: int64
由于国家有149个,这里就不列出了,发现前10名的目的地去的人数很多,后续可以考虑仅针对去某个国家的客户进行研究。
属性Age
plt.figure(figsize=(15,7))
df['Age'].hist(bins=12)
plt.xlim(0,120)
plt.xlabel('Age',fontsize=14)
plt.ylabel('count',fontsize=14)
plt.show()
年龄分布范围比较广泛,所以保留比较好
3.数据清洗与准备
df.drop(["Destination","Gender","DistributionChannel"],inplace=True,axis=1)
df = df.loc[(df["Duration"]>0) & (df["NetSales"]>0)]
df.shape
(60700, 8)
删除的样本数量:63326-60700=2626
df1 = df.loc[(df["Agency"]=="EPX")]
df1.drop(["Agency"],inplace=True,axis=1)
df1.head()
| AgencyType | ProductName | Duration | NetSales | Commision | Age | Claim |
---|
33 | Travel Agency | 2 way Comprehensive Plan | 9 | 25.0 | 0.0 | 36 | No |
---|
34 | Travel Agency | 2 way Comprehensive Plan | 12 | 24.0 | 0.0 | 36 | No |
---|
35 | Travel Agency | 2 way Comprehensive Plan | 45 | 20.0 | 0.0 | 36 | No |
---|
36 | Travel Agency | 2 way Comprehensive Plan | 35 | 27.0 | 0.0 | 36 | No |
---|
37 | Travel Agency | 2 way Comprehensive Plan | 28 | 20.0 | 0.0 | 44 | No |
---|
df1['Claim'].value_counts()
No 34188
Yes 193
Name: Claim, dtype: int64
df1['ProductName'].value_counts()
Cancellation Plan 18212
2 way Comprehensive Plan 12907
1 way Comprehensive Plan 3262
Name: ProductName, dtype: int64
df1['Commision'].value_counts() #删除
0.0 34381
Name: Commision, dtype: int64
将分类变量转为示性变量
dummyX = pd.get_dummies(df1[['ProductName']])
dummyX.head()
| ProductName_1 way Comprehensive Plan | ProductName_2 way Comprehensive Plan | ProductName_Cancellation Plan |
---|
33 | 0 | 1 | 0 |
---|
34 | 0 | 1 | 0 |
---|
35 | 0 | 1 | 0 |
---|
36 | 0 | 1 | 0 |
---|
37 | 0 | 1 | 0 |
---|
from sklearn import preprocessing
# 标签矩阵
labelList= df1["Claim"]
# 将标签矩阵二值化
lb = preprocessing.LabelBinarizer()
Y=lb.fit_transform(labelList)
Y
array([[0],[0],[0],..., [0],[0],[0]])
合并特征
X = pd.concat([dummyX,df1[['Duration','NetSales','Age']]], axis=1)
X.head()
| ProductName_1 way Comprehensive Plan | ProductName_2 way Comprehensive Plan | ProductName_Cancellation Plan | Duration | NetSales | Age |
---|
33 | 0 | 1 | 0 | 9 | 25.0 | 36 |
---|
34 | 0 | 1 | 0 | 12 | 24.0 | 36 |
---|
35 | 0 | 1 | 0 | 45 | 20.0 | 36 |
---|
36 | 0 | 1 | 0 | 35 | 27.0 | 36 |
---|
37 | 0 | 1 | 0 | 28 | 20.0 | 44 |
---|
平衡样本
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
df_x,df_y = sm.fit_sample(X,Y)
df_x= pd.DataFrame(df_x)
lumns = X.columns
df_y = pd.DataFrame(df_y)
lumns = ["Claim"]
X = df_x
Y = df_y
决策树
del_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 1)
# Decision Tree's
import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
ics import accuracy_score
print('accuracy is',accuracy_score(y_pred,y_test))
accuracy is 0.983620941796
可视化直接崩掉!!,电脑卡住了,样本数太大了。
import graphviz
import pydotplus
from sklearn import tree
dot_data = port_graphviz(classifier, out_file=None,feature_nameslumns,class_names=['0', '1'], #类别为0,1,也可以设置为yes,nofilled=True, rounded=True, #rounded=True,框是圆边special_characters=True)
graph = aph_from_dot_data(dot_data)
with open('1.png', 'wb') as f:f.ate_png())