对kaggle:travel

阅读: 评论:0

对kaggle:travel

对kaggle:travel

1.导入数据
from matplotlib import pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
import seaborn as sns
import pandas as pd
sns.set()
import warnings
warnings.filterwarnings('ignore')
导入数据时出现错误,参考:
df = pd.read_csv(r'C:/Users/admin/Desktop/论文/kaggle1/travel-insurance.csv',engine='python')
df.head()
AgencyAgency TypeDistribution ChannelProduct NameClaimDurationDestinationNet SalesCommision (in value)GenderAge
0CBHTravel AgencyOfflineComprehensive PlanNo186MALAYSIA-29.09.57F81
1CBHTravel AgencyOfflineComprehensive PlanNo186MALAYSIA-29.09.57F71
2CWTTravel AgencyOnlineRental Vehicle Excess InsuranceNo65AUSTRALIA-49.529.70NaN32
3CWTTravel AgencyOnlineRental Vehicle Excess InsuranceNo60AUSTRALIA-39.623.76NaN32
4CWTTravel AgencyOnlineRental Vehicle Excess InsuranceNo79ITALY-19.811.88NaN41
2.属性介绍及分析
order = ['Agency', 'Agency Type', 'Distribution Channel', 'Product Name', 'Duration','Destination','Net Sales', 'Commision (in value)', 'Age','Gender','Claim']
df = df[order]
df.columns = ['Agency', 'AgencyType', 'DistributionChannel', 'ProductName', 'Duration','Destination','NetSales', 'Commision', 'Age','Gender','Claim']
df.head()
AgencyAgencyTypeDistributionChannelProductNameDurationDestinationNetSalesCommisionAgeGenderClaim
0CBHTravel AgencyOfflineComprehensive Plan186MALAYSIA-29.09.5781FNo
1CBHTravel AgencyOfflineComprehensive Plan186MALAYSIA-29.09.5771FNo
2CWTTravel AgencyOnlineRental Vehicle Excess Insurance65AUSTRALIA-49.529.7032NaNNo
3CWTTravel AgencyOnlineRental Vehicle Excess Insurance60AUSTRALIA-39.623.7632NaNNo
4CWTTravel AgencyOnlineRental Vehicle Excess Insurance79ITALY-19.811.8841NaNNo
2.1数据的基本了解
df.shape
(63326, 11)
df.info()
<class &#frame.DataFrame'>
RangeIndex: 63326 entries, 0 to 63325
Data columns (total 11 columns):
Agency                 63326 non-null object
AgencyType             63326 non-null object
DistributionChannel    63326 non-null object
ProductName            63326 non-null object
Duration               63326 non-null int64
Destination            63326 non-null object
NetSales               63326 non-null float64
Commision              63326 non-null float64
Age                    63326 non-null int64
Gender                 18219 non-null object
Claim                  63326 non-null object
dtypes: float64(2), int64(2), object(7)
memory usage: 5.3+ MB
df.isnull().sum()
Agency                     0
AgencyType                 0
DistributionChannel        0
ProductName                0
Duration                   0
Destination                0
NetSales                   0
Commision                  0
Age                        0
Gender                 45107
Claim                      0
dtype: int64
数据总共63326条,但只有性别这个属性存在较多缺失值
df.describe()
DurationNetSalesCommisionAge
count63326.00000063326.00000063326.00000063326.000000
mean49.31707440.7020189.80999239.969981
std101.79156648.84563719.80438814.017010
min-2.000000-389.0000000.0000000.000000
25%9.00000018.0000000.00000035.000000
50%22.00000026.5300000.00000036.000000
75%53.00000048.00000011.55000043.000000
max4881.000000810.000000283.500000118.000000
从上面可以看出Duration的最小值是-2,但是旅行的时间应该是正值,所以应考虑把负值的样本删除。
2.2 对属性进行分析
属性Claim
df['Claim'].value_counts()
No     62399
Yes      927
Name: Claim, dtype: int64
属性Agency
df['Agency'].value_counts()
EPX    35119
CWT     8580
C2B     8267
JZI     6329
SSI     1056
JWT      749
RAB      725
LWC      689
TST      528
KML      392
ART      331
CCR      194
CBH      101
TTW       98
CSR       86
ADM       82
Name: Agency, dtype: int64
plt.figure(figsize=(15,7))
untplot(x='Agency', data=df,orient='h')
plt.show()

df['Claim'].groupby(df['Agency']).value_counts()
Agency  Claim
ADM     No          82
ART     No         330Yes          1
C2B     No        7720Yes        547
CBH     No         100Yes          1
CCR     No         191Yes          3
CSR     No          85Yes          1
CWT     No        8494Yes         86
EPX     No       34924Yes        195
JWT     No         745Yes          4
JZI     No        6298Yes         31
KML     No         384Yes          8
LWC     No         653Yes         36
RAB     No         724Yes          1
SSI     No        1049Yes          7
TST     No         526Yes          2
TTW     No          94Yes          4
Name: Claim, dtype: int64
pd.crosstab(df.Agency,df.Claim,margins=True).style.background_gradient()
ClaimNoYesAll
Agency
ADM82082
ART3301331
C2B77205478267
CBH1001101
CCR1913194
CSR85186
CWT8494868580
EPX3492419535119
JWT7454749
JZI6298316329
KML3848392
LWC65336689
RAB7241725
SSI104971056
TST5262528
TTW94498
All6239992763326
AGency = pd.read_excel('C:/Users/admin/Desktop/论文/kaggle1/Agency.xlsx')
AGency['Percent']=AGency['Yes']/AGency['All']
AGency.sort_values("Percent",inplace=True)
AGency
AgencyNoYesAllPercent
0ADM540540.000000
12RAB70917100.001408
1ART31513160.003165
14TST51725190.003854
9JZI61283061580.004872
8JWT71647200.005556
7EPX34188193343810.005614
13SSI1024710310.006790
3CBH931940.010638
5CSR851860.011628
6CWT71328672180.011915
4CCR18731900.015789
10KML38083880.020619
15TTW944980.040816
11LWC623366590.054628
2C2B753154780780.067715
从以上数据可以看到索赔率高的代理商可能是索赔率低的代理商的20倍(0.067715是0.003165的20倍左右),
所以不同代理商之间还是有明显差异的。应保留这个特征,同时有些代理商的样本数量足够多,
可以考虑直接对某一个代理商的样本进行建模。
属性Agency Type
df['AgencyType'].value_counts()
Travel Agency    45869
Airlines         17457
Name: AgencyType, dtype: int64
pd.crosstab(df.AgencyType,df.Claim,margins=True).style.background_gradient()
ClaimNoYesAll
AgencyType
Airlines1686659117457
Travel Agency4553333645869
All6239992763326
AgencyType这个属性里面:Travel Agency的数量多,但是索赔人数少;Airlines的人数少,但是索赔人数多。
所以代理商的类别是航空、旅行社对索赔有明显影响,此属性保留。
属性Distribution Channel
df['DistributionChannel'].value_counts()
Online     62219
Offline     1107
Name: DistributionChannel, dtype: int64
pd.crosstab(df.DistributionChannel,df.Claim,margins=True).style.background_gradient()
ClaimNoYesAll
DistributionChannel
Offline1090171107
Online6130991062219
All6239992763326
忽略其他因素,分别索赔率:
Offline:17/1107=0.0153
Online:910/62219=0.0146
All:927/63326=0.0146
差异不是很大,结合现实也可以知道,购买保险人是否索赔应该与自己的经历有关,而与当初购买的方式关系不大。故此属性删除
属性ProductName
df['ProductName'].value_counts()
Cancellation Plan                       18630
2 way Comprehensive Plan                13158
Rental Vehicle Excess Insurance          8580
Basic Plan                               5469
Bronze Plan                              4049
1 way Comprehensive Plan                 3331
Value Plan                               2715
Silver Plan                              2249
Annual Silver Plan                       1423
Ticket Protector                         1056
Travel Cruise Protect                     527
Comprehensive Plan                        364
Gold Plan                                 352
24 Protect                                247
Single Trip Travel Protect Gold           204
Annual Gold Plan                          194
Premier Plan                              194
Single Trip Travel Protect Silver         173
Annual Travel Protect Gold                100
Annual Travel Protect Silver               86
Individual Comprehensive Plan              74
Single Trip Travel Protect Platinum        73
Annual Travel Protect Platinum             53
Spouse or Parents Comprehensive Plan       15
Child Comprehensive Plan                    9
Travel Cruise Protect Family                1
Name: ProductName, dtype: int64
pd.crosstab(df.ProductName,df.Claim,margins=True).style.background_gradient()
ClaimNoYesAll
ProductName
1 way Comprehensive Plan332293331
2 way Comprehensive Plan1301614213158
24 Protect2470247
Annual Gold Plan17321194
Annual Silver Plan12661571423
Annual Travel Protect Gold9010100
Annual Travel Protect Platinum50353
Annual Travel Protect Silver82486
Basic Plan5446235469
Bronze Plan38392104049
Cancellation Plan185864418630
Child Comprehensive Plan909
Comprehensive Plan3595364
Gold Plan33418352
Individual Comprehensive Plan71374
Premier Plan1913194
Rental Vehicle Excess Insurance8494868580
Silver Plan21081412249
Single Trip Travel Protect Gold19410204
Single Trip Travel Protect Platinum68573
Single Trip Travel Protect Silver1694173
Spouse or Parents Comprehensive Plan14115
Ticket Protector104971056
Travel Cruise Protect5252527
Travel Cruise Protect Family101
Value Plan2696192715
All6239992763326
Productnames = pd.read_excel('C:/Users/admin/Desktop/论文/kaggle1/Product Name.xlsx')
Productnames['Percent']=Productnames['Yes']/Productnames['All']
Productnames.sort_values("Percent",inplace=True)
Productnames
ClaimNoYesAllPercent
224 Protect24702470.000000
24Travel Cruise Protect Family1010.000000
11Child Comprehensive Plan9090.000000
10Cancellation Plan1858644186300.002362
01 way Comprehensive Plan3322933310.002702
23Travel Cruise Protect52525270.003795
8Basic Plan54462354690.004206
22Ticket Protector1049710560.006629
25Value Plan26961927150.006998
16Rental Vehicle Excess Insurance84948685800.010023
12 way Comprehensive Plan13016142131580.010792
12Comprehensive Plan35953640.013736
15Premier Plan19131940.015464
20Single Trip Travel Protect Silver16941730.023121
14Individual Comprehensive Plan713740.040541
7Annual Travel Protect Silver824860.046512
18Single Trip Travel Protect Gold194102040.049020
13Gold Plan334183520.051136
9Bronze Plan383921040490.051865
6Annual Travel Protect Platinum503530.056604
17Silver Plan210814122490.062695
21Spouse or Parents Comprehensive Plan141150.066667
19Single Trip Travel Protect Platinum685730.068493
5Annual Travel Protect Gold90101000.100000
3Annual Gold Plan173211940.108247
4Annual Silver Plan126615714230.110330
同理,容易看出不同保险产品之间的索赔率差异较大,此属性保留。
属性Destination
df['Destination'].value_counts().head(10)
SINGAPORE        13255
MALAYSIA          5930
THAILAND          5894
CHINA             4796
AUSTRALIA         3694
INDONESIA         3452
UNITED STATES     2530
PHILIPPINES       2490
HONG KONG         2411
INDIA             2251
Name: Destination, dtype: int64
由于国家有149个,这里就不列出了,发现前10名的目的地去的人数很多,后续可以考虑仅针对去某个国家的客户进行研究。
属性Age
plt.figure(figsize=(15,7))
df['Age'].hist(bins=12)
plt.xlim(0,120)
plt.xlabel('Age',fontsize=14)
plt.ylabel('count',fontsize=14)
plt.show()

年龄分布范围比较广泛,所以保留比较好
3.数据清洗与准备
df.drop(["Destination","Gender","DistributionChannel"],inplace=True,axis=1)
df = df.loc[(df["Duration"]>0) & (df["NetSales"]>0)]
df.shape
(60700, 8)
删除的样本数量:63326-60700=2626
df1 = df.loc[(df["Agency"]=="EPX")]
df1.drop(["Agency"],inplace=True,axis=1)
df1.head()
AgencyTypeProductNameDurationNetSalesCommisionAgeClaim
33Travel Agency2 way Comprehensive Plan925.00.036No
34Travel Agency2 way Comprehensive Plan1224.00.036No
35Travel Agency2 way Comprehensive Plan4520.00.036No
36Travel Agency2 way Comprehensive Plan3527.00.036No
37Travel Agency2 way Comprehensive Plan2820.00.044No
df1['Claim'].value_counts()
No     34188
Yes      193
Name: Claim, dtype: int64
df1['ProductName'].value_counts()
Cancellation Plan           18212
2 way Comprehensive Plan    12907
1 way Comprehensive Plan     3262
Name: ProductName, dtype: int64
df1['Commision'].value_counts()   #删除
0.0    34381
Name: Commision, dtype: int64
将分类变量转为示性变量
dummyX = pd.get_dummies(df1[['ProductName']])
dummyX.head()
ProductName_1 way Comprehensive PlanProductName_2 way Comprehensive PlanProductName_Cancellation Plan
33010
34010
35010
36010
37010
from sklearn import preprocessing
# 标签矩阵
labelList= df1["Claim"]
# 将标签矩阵二值化
lb = preprocessing.LabelBinarizer()
Y=lb.fit_transform(labelList)
Y
array([[0],[0],[0],..., [0],[0],[0]])
合并特征
X = pd.concat([dummyX,df1[['Duration','NetSales','Age']]], axis=1)
X.head()
ProductName_1 way Comprehensive PlanProductName_2 way Comprehensive PlanProductName_Cancellation PlanDurationNetSalesAge
33010925.036
340101224.036
350104520.036
360103527.036
370102820.044
平衡样本
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
df_x,df_y = sm.fit_sample(X,Y)
df_x= pd.DataFrame(df_x)
lumns = X.columns
df_y = pd.DataFrame(df_y)
lumns = ["Claim"]
X = df_x
Y = df_y
决策树
del_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 1)
# Decision Tree's
 import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
ics import accuracy_score
print('accuracy is',accuracy_score(y_pred,y_test))
accuracy is 0.983620941796
可视化直接崩掉!!,电脑卡住了,样本数太大了。
import graphviz
import pydotplus
from sklearn import tree
dot_data = port_graphviz(classifier, out_file=None,feature_names&#lumns,class_names=['0', '1'],     #类别为0,1,也可以设置为yes,nofilled=True, rounded=True,  #rounded=True,框是圆边special_characters=True)
graph = aph_from_dot_data(dot_data)
with open('1.png', 'wb') as f:f.ate_png())

本文发布于:2024-02-03 01:48:39,感谢您对本站的认可!

本文链接:https://www.4u4v.net/it/170689611747836.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:kaggle   travel
留言与评论(共有 0 条评论)
   
验证码:

Copyright ©2019-2022 Comsenz Inc.Powered by ©

网站地图1 网站地图2 网站地图3 网站地图4 网站地图5 网站地图6 网站地图7 网站地图8 网站地图9 网站地图10 网站地图11 网站地图12 网站地图13 网站地图14 网站地图15 网站地图16 网站地图17 网站地图18 网站地图19 网站地图20 网站地图21 网站地图22/a> 网站地图23