对kaggle：travel

阅读：评论：0

对kaggle：travel

1.导入数据

from matplotlib import pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
import seaborn as sns
import pandas as pd
sns.set()
import warnings
warnings.filterwarnings('ignore')

导入数据时出现错误，参考：

df = pd.read_csv(r'C:/Users/admin/Desktop/论文/kaggle1/travel-insurance.csv',engine='python')
df.head()

	Agency	Agency Type	Distribution Channel	Product Name	Claim	Duration	Destination	Net Sales	Commision (in value)	Gender	Age
0	CBH	Travel Agency	Offline	Comprehensive Plan	No	186	MALAYSIA	-29.0	9.57	F	81
1	CBH	Travel Agency	Offline	Comprehensive Plan	No	186	MALAYSIA	-29.0	9.57	F	71
2	CWT	Travel Agency	Online	Rental Vehicle Excess Insurance	No	65	AUSTRALIA	-49.5	29.70	NaN	32
3	CWT	Travel Agency	Online	Rental Vehicle Excess Insurance	No	60	AUSTRALIA	-39.6	23.76	NaN	32
4	CWT	Travel Agency	Online	Rental Vehicle Excess Insurance	No	79	ITALY	-19.8	11.88	NaN	41

2.属性介绍及分析

order = ['Agency', 'Agency Type', 'Distribution Channel', 'Product Name', 'Duration','Destination','Net Sales', 'Commision (in value)', 'Age','Gender','Claim']
df = df[order]
df.columns = ['Agency', 'AgencyType', 'DistributionChannel', 'ProductName', 'Duration','Destination','NetSales', 'Commision', 'Age','Gender','Claim']
df.head()

	Agency	AgencyType	DistributionChannel	ProductName	Duration	Destination	NetSales	Commision	Age	Gender	Claim
0	CBH	Travel Agency	Offline	Comprehensive Plan	186	MALAYSIA	-29.0	9.57	81	F	No
1	CBH	Travel Agency	Offline	Comprehensive Plan	186	MALAYSIA	-29.0	9.57	71	F	No
2	CWT	Travel Agency	Online	Rental Vehicle Excess Insurance	65	AUSTRALIA	-49.5	29.70	32	NaN	No
3	CWT	Travel Agency	Online	Rental Vehicle Excess Insurance	60	AUSTRALIA	-39.6	23.76	32	NaN	No
4	CWT	Travel Agency	Online	Rental Vehicle Excess Insurance	79	ITALY	-19.8	11.88	41	NaN	No

2.1数据的基本了解

df.shape

(63326, 11)

df.info()

<class &#frame.DataFrame'>
RangeIndex: 63326 entries, 0 to 63325
Data columns (total 11 columns):
Agency                 63326 non-null object
AgencyType             63326 non-null object
DistributionChannel    63326 non-null object
ProductName            63326 non-null object
Duration               63326 non-null int64
Destination            63326 non-null object
NetSales               63326 non-null float64
Commision              63326 non-null float64
Age                    63326 non-null int64
Gender                 18219 non-null object
Claim                  63326 non-null object
dtypes: float64(2), int64(2), object(7)
memory usage: 5.3+ MB

df.isnull().sum()

Agency                     0
AgencyType                 0
DistributionChannel        0
ProductName                0
Duration                   0
Destination                0
NetSales                   0
Commision                  0
Age                        0
Gender                 45107
Claim                      0
dtype: int64

数据总共63326条，但只有性别这个属性存在较多缺失值

df.describe()

	Duration	NetSales	Commision	Age
count	63326.000000	63326.000000	63326.000000	63326.000000
mean	49.317074	40.702018	9.809992	39.969981
std	101.791566	48.845637	19.804388	14.017010
min	-2.000000	-389.000000	0.000000	0.000000
25%	9.000000	18.000000	0.000000	35.000000
50%	22.000000	26.530000	0.000000	36.000000
75%	53.000000	48.000000	11.550000	43.000000
max	4881.000000	810.000000	283.500000	118.000000

从上面可以看出Duration的最小值是-2，但是旅行的时间应该是正值，所以应考虑把负值的样本删除。

2.2 对属性进行分析

属性Claim

df['Claim'].value_counts()

No     62399
Yes      927
Name: Claim, dtype: int64

属性Agency

df['Agency'].value_counts()

EPX    35119
CWT     8580
C2B     8267
JZI     6329
SSI     1056
JWT      749
RAB      725
LWC      689
TST      528
KML      392
ART      331
CCR      194
CBH      101
TTW       98
CSR       86
ADM       82
Name: Agency, dtype: int64

plt.figure(figsize=(15,7))
untplot(x='Agency', data=df,orient='h')
plt.show()

df['Claim'].groupby(df['Agency']).value_counts()

Agency  Claim
ADM     No          82
ART     No         330Yes          1
C2B     No        7720Yes        547
CBH     No         100Yes          1
CCR     No         191Yes          3
CSR     No          85Yes          1
CWT     No        8494Yes         86
EPX     No       34924Yes        195
JWT     No         745Yes          4
JZI     No        6298Yes         31
KML     No         384Yes          8
LWC     No         653Yes         36
RAB     No         724Yes          1
SSI     No        1049Yes          7
TST     No         526Yes          2
TTW     No          94Yes          4
Name: Claim, dtype: int64

pd.crosstab(df.Agency,df.Claim,margins=True).style.background_gradient()

Claim	No	Yes	All
Agency
ADM	82	0	82
ART	330	1	331
C2B	7720	547	8267
CBH	100	1	101
CCR	191	3	194
CSR	85	1	86
CWT	8494	86	8580
EPX	34924	195	35119
JWT	745	4	749
JZI	6298	31	6329
KML	384	8	392
LWC	653	36	689
RAB	724	1	725
SSI	1049	7	1056
TST	526	2	528
TTW	94	4	98
All	62399	927	63326

AGency = pd.read_excel('C:/Users/admin/Desktop/论文/kaggle1/Agency.xlsx')
AGency['Percent']=AGency['Yes']/AGency['All']

AGency.sort_values("Percent",inplace=True)
AGency

	Agency	No	Yes	All	Percent
0	ADM	54	0	54	0.000000
12	RAB	709	1	710	0.001408
1	ART	315	1	316	0.003165
14	TST	517	2	519	0.003854
9	JZI	6128	30	6158	0.004872
8	JWT	716	4	720	0.005556
7	EPX	34188	193	34381	0.005614
13	SSI	1024	7	1031	0.006790
3	CBH	93	1	94	0.010638
5	CSR	85	1	86	0.011628
6	CWT	7132	86	7218	0.011915
4	CCR	187	3	190	0.015789
10	KML	380	8	388	0.020619
15	TTW	94	4	98	0.040816
11	LWC	623	36	659	0.054628
2	C2B	7531	547	8078	0.067715

从以上数据可以看到索赔率高的代理商可能是索赔率低的代理商的20倍（0.067715是0.003165的20倍左右），
所以不同代理商之间还是有明显差异的。应保留这个特征，同时有些代理商的样本数量足够多，
可以考虑直接对某一个代理商的样本进行建模。

属性Agency Type

df['AgencyType'].value_counts()

Travel Agency    45869
Airlines         17457
Name: AgencyType, dtype: int64

pd.crosstab(df.AgencyType,df.Claim,margins=True).style.background_gradient()

Claim	No	Yes	All
AgencyType
Airlines	16866	591	17457
Travel Agency	45533	336	45869
All	62399	927	63326

AgencyType这个属性里面：Travel Agency的数量多，但是索赔人数少；Airlines的人数少，但是索赔人数多。
所以代理商的类别是航空、旅行社对索赔有明显影响，此属性保留。

属性Distribution Channel

df['DistributionChannel'].value_counts()

Online     62219
Offline     1107
Name: DistributionChannel, dtype: int64

pd.crosstab(df.DistributionChannel,df.Claim,margins=True).style.background_gradient()

Claim	No	Yes	All
DistributionChannel
Offline	1090	17	1107
Online	61309	910	62219
All	62399	927	63326

忽略其他因素，分别索赔率：
Offline:17/1107=0.0153
Online:910/62219=0.0146
All:927/63326=0.0146
差异不是很大，结合现实也可以知道，购买保险人是否索赔应该与自己的经历有关，而与当初购买的方式关系不大。故此属性删除

属性ProductName

df['ProductName'].value_counts()

Cancellation Plan                       18630
2 way Comprehensive Plan                13158
Rental Vehicle Excess Insurance          8580
Basic Plan                               5469
Bronze Plan                              4049
1 way Comprehensive Plan                 3331
Value Plan                               2715
Silver Plan                              2249
Annual Silver Plan                       1423
Ticket Protector                         1056
Travel Cruise Protect                     527
Comprehensive Plan                        364
Gold Plan                                 352
24 Protect                                247
Single Trip Travel Protect Gold           204
Annual Gold Plan                          194
Premier Plan                              194
Single Trip Travel Protect Silver         173
Annual Travel Protect Gold                100
Annual Travel Protect Silver               86
Individual Comprehensive Plan              74
Single Trip Travel Protect Platinum        73
Annual Travel Protect Platinum             53
Spouse or Parents Comprehensive Plan       15
Child Comprehensive Plan                    9
Travel Cruise Protect Family                1
Name: ProductName, dtype: int64

pd.crosstab(df.ProductName,df.Claim,margins=True).style.background_gradient()

Claim	No	Yes	All
ProductName
1 way Comprehensive Plan	3322	9	3331
2 way Comprehensive Plan	13016	142	13158
24 Protect	247	0	247
Annual Gold Plan	173	21	194
Annual Silver Plan	1266	157	1423
Annual Travel Protect Gold	90	10	100
Annual Travel Protect Platinum	50	3	53
Annual Travel Protect Silver	82	4	86
Basic Plan	5446	23	5469
Bronze Plan	3839	210	4049
Cancellation Plan	18586	44	18630
Child Comprehensive Plan	9	0	9
Comprehensive Plan	359	5	364
Gold Plan	334	18	352
Individual Comprehensive Plan	71	3	74
Premier Plan	191	3	194
Rental Vehicle Excess Insurance	8494	86	8580
Silver Plan	2108	141	2249
Single Trip Travel Protect Gold	194	10	204
Single Trip Travel Protect Platinum	68	5	73
Single Trip Travel Protect Silver	169	4	173
Spouse or Parents Comprehensive Plan	14	1	15
Ticket Protector	1049	7	1056
Travel Cruise Protect	525	2	527
Travel Cruise Protect Family	1	0	1
Value Plan	2696	19	2715
All	62399	927	63326

Productnames = pd.read_excel('C:/Users/admin/Desktop/论文/kaggle1/Product Name.xlsx')
Productnames['Percent']=Productnames['Yes']/Productnames['All']
Productnames.sort_values("Percent",inplace=True)
Productnames

	Claim	No	Yes	All	Percent
2	24 Protect	247	0	247	0.000000
24	Travel Cruise Protect Family	1	0	1	0.000000
11	Child Comprehensive Plan	9	0	9	0.000000
10	Cancellation Plan	18586	44	18630	0.002362
0	1 way Comprehensive Plan	3322	9	3331	0.002702
23	Travel Cruise Protect	525	2	527	0.003795
8	Basic Plan	5446	23	5469	0.004206
22	Ticket Protector	1049	7	1056	0.006629
25	Value Plan	2696	19	2715	0.006998
16	Rental Vehicle Excess Insurance	8494	86	8580	0.010023
1	2 way Comprehensive Plan	13016	142	13158	0.010792
12	Comprehensive Plan	359	5	364	0.013736
15	Premier Plan	191	3	194	0.015464
20	Single Trip Travel Protect Silver	169	4	173	0.023121
14	Individual Comprehensive Plan	71	3	74	0.040541
7	Annual Travel Protect Silver	82	4	86	0.046512
18	Single Trip Travel Protect Gold	194	10	204	0.049020
13	Gold Plan	334	18	352	0.051136
9	Bronze Plan	3839	210	4049	0.051865
6	Annual Travel Protect Platinum	50	3	53	0.056604
17	Silver Plan	2108	141	2249	0.062695
21	Spouse or Parents Comprehensive Plan	14	1	15	0.066667
19	Single Trip Travel Protect Platinum	68	5	73	0.068493
5	Annual Travel Protect Gold	90	10	100	0.100000
3	Annual Gold Plan	173	21	194	0.108247
4	Annual Silver Plan	1266	157	1423	0.110330

同理，容易看出不同保险产品之间的索赔率差异较大，此属性保留。

属性Destination

df['Destination'].value_counts().head(10)

SINGAPORE        13255
MALAYSIA          5930
THAILAND          5894
CHINA             4796
AUSTRALIA         3694
INDONESIA         3452
UNITED STATES     2530
PHILIPPINES       2490
HONG KONG         2411
INDIA             2251
Name: Destination, dtype: int64

由于国家有149个，这里就不列出了，发现前10名的目的地去的人数很多，后续可以考虑仅针对去某个国家的客户进行研究。

属性Age

plt.figure(figsize=(15,7))
df['Age'].hist(bins=12)
plt.xlim(0,120)
plt.xlabel('Age',fontsize=14)
plt.ylabel('count',fontsize=14)
plt.show()

年龄分布范围比较广泛，所以保留比较好

3.数据清洗与准备

df.drop(["Destination","Gender","DistributionChannel"],inplace=True,axis=1)

df = df.loc[(df["Duration"]>0) & (df["NetSales"]>0)]
df.shape

(60700, 8)

删除的样本数量：63326-60700=2626

df1 = df.loc[(df["Agency"]=="EPX")]
df1.drop(["Agency"],inplace=True,axis=1)
df1.head()

	AgencyType	ProductName	Duration	NetSales	Age	Claim
33	Travel Agency	2 way Comprehensive Plan	9	25.0	36	No
34	Travel Agency	2 way Comprehensive Plan	12	24.0	36	No
35	Travel Agency	2 way Comprehensive Plan	45	20.0	36	No
36	Travel Agency	2 way Comprehensive Plan	35	27.0	36	No
37	Travel Agency	2 way Comprehensive Plan	28	20.0	44	No

df1['Claim'].value_counts()

No     34188
Yes      193
Name: Claim, dtype: int64

df1['ProductName'].value_counts()

Cancellation Plan           18212
2 way Comprehensive Plan    12907
1 way Comprehensive Plan     3262
Name: ProductName, dtype: int64

df1['Commision'].value_counts()   #删除

0.0    34381
Name: Commision, dtype: int64

将分类变量转为示性变量

dummyX = pd.get_dummies(df1[['ProductName']])
dummyX.head()

	ProductName_1 way Comprehensive Plan	ProductName_2 way Comprehensive Plan	ProductName_Cancellation Plan
33	0	1	0
34	0	1	0
35	0	1	0
36	0	1	0
37	0	1	0

from sklearn import preprocessing
# 标签矩阵
labelList= df1["Claim"]
# 将标签矩阵二值化
lb = preprocessing.LabelBinarizer()
Y=lb.fit_transform(labelList)
Y

array([[0],[0],[0],..., [0],[0],[0]])

合并特征

X = pd.concat([dummyX,df1[['Duration','NetSales','Age']]], axis=1)
X.head()

	ProductName_2 way Comprehensive Plan	Duration	NetSales	Age
33	1	9	25.0	36
34	1	12	24.0	36
35	1	45	20.0	36
36	1	35	27.0	36
37	1	28	20.0	44

平衡样本

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
df_x,df_y = sm.fit_sample(X,Y)

df_x= pd.DataFrame(df_x)
lumns = X.columns
df_y = pd.DataFrame(df_y)
lumns = ["Claim"]
X = df_x
Y = df_y

决策树

del_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 1)
# Decision Tree's
 import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
ics import accuracy_score
print('accuracy is',accuracy_score(y_pred,y_test))

accuracy is 0.983620941796

可视化直接崩掉！！，电脑卡住了，样本数太大了。

import graphviz
import pydotplus
from sklearn import tree

dot_data = port_graphviz(classifier, out_file=None,feature_names&#lumns,class_names=['0', '1'],     #类别为0,1，也可以设置为yes，nofilled=True, rounded=True,  #rounded=True，框是圆边special_characters=True)
graph = aph_from_dot_data(dot_data)
with open('1.png', 'wb') as f:f.ate_png())

本文发布于:2024-02-03 01:48:39，感谢您对本站的认可！

本文链接：https://www.4u4v.net/it/170689611747836.html

上一篇：Failed to execute goal on project travel

下一篇：《Two Dozen Short Lessons in Haskell》所有习题的索引

标签：kaggle travel

留言与评论（共有 0 条评论）