笔者刚入门机器学习,本文是参照官方文档的学习笔记,夹杂着自己的理解,如果有错误的地方欢迎纠正。
用OrdinalEncoder
将分类特征转换为有序整数,每个分类特征转换为整数作为新特征
取值:0到n_categories-1
适用于定序变量,因为无序的话就会给原本不应有顺序的特征强行多了一层有序的解释。
默认情况下会忽略np.nan
表示的缺失值
也可以通过设置encoded_missing_value
来对缺失值编码
enc = preprocessing.OrdinalEncoder(encoded_missing_value=-1)
X = [['male'], ['female'], [np.nan], ['female']]
enc.**fit_transform**(X)
省去 创建pipeline
和使用SimpleImputer
的麻烦
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
enc = Pipeline(steps=[("encoder", preprocessing.OrdinalEncoder()),/
("imputer", SimpleImputer(strategy="constant", fill_value=-1)),/
])
enc.**fit_transform**(X)
ordinal_encoder = preprocessing.OrdinalEncoder(handle_unknown='use_encoded_value',unknown_value=-1,encoded_missing_value=-2,min_frequency=10,max_categories=7)
ordinal_encoder.fit(train_df[['Title']]) # fit只能用于训练集,不能用于测试集
train_df['Title']=ansform(train_df[['Title']])
test_df['Title']=ansform(test_df[['Title']]) # 在这里只是顺便转换用于到时测试,不是训练过程print(ordinal_encoder.categories_)
print(ordinal_encoder.infrequent_categories_)
# 打印出ordinal_encoder.categories_中不在ordinal_encoder.infrequent_categories_中的元素
print([item for item in ordinal_encoder.categories_[0] if item not in ordinal_encoder.infrequent_categories_[0]])
[array([‘Billiard’, ‘Capt’, ‘Carlo’, ‘Col’, ‘Cruyssen’, ‘Don’, ‘Dr’,
‘Gordon’, ‘Impe’, ‘Jonkheer’, ‘Major’, ‘Master’, ‘Melkebeke’,
‘Messemaeker’, ‘Miss’, ‘Mr’, ‘Mrs’, ‘Mulder’, ‘Pelsmaeker’,
‘Planke’, ‘Rev’, ‘Shawah’, ‘Steen’, ‘Velde’, ‘Walle’, ‘der’, ‘the’,
‘y’], dtype=object)]
[array([‘Billiard’, ‘Capt’, ‘Carlo’, ‘Col’, ‘Cruyssen’, ‘Don’, ‘Dr’,
‘Gordon’, ‘Impe’, ‘Jonkheer’, ‘Major’, ‘Melkebeke’, ‘Messemaeker’,
‘Mulder’, ‘Pelsmaeker’, ‘Planke’, ‘Rev’, ‘Shawah’, ‘Steen’,
‘Velde’, ‘Walle’, ‘der’, ‘the’, ‘y’], dtype=object)]
[‘Master’, ‘Miss’, ‘Mr’, ‘Mrs’]
train_titile_nums = len(train_df['Title'].unique())
print(ordinal_encoder.inverse_transform([[i] for i in range(train_titile_nums)])) # 每个数字对应的原来的字符串
sns.displot(train_df,x="Title",hue="Survived",height=3,bins= train_titile_nums)
输出:
使用OneHotEncoder
对分类变量进行独热编码,是哪个值,对应二进制为1,其余0
>>> enc = preprocessing.OneHotEncoder()
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.**fit**(X)
OneHotEncoder()
>>> enc.**transform**([['female', 'from US', 'uses Safari'],
... ['male', 'from Europe', 'uses Safari']]).toarray()
array([[1., 0., 0., 1., 0., 1.],[0., 1., 1., 0., 0., 1.]])
可以用参数categories
显式地指定二进制位的顺序
>>> genders = ['female', 'male']
>>> locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
>>> browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
>>> enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
>>> # Note that for there are missing categorical values for the 2nd and 3rd
>>> # feature
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.**fit**(X)
OneHotEncoder(categories=[['female', 'male'],['from Africa', 'from Asia', 'from Europe','from US'],['uses Chrome', 'uses Firefox', 'uses IE','uses Safari']])
>>> enc.**transform**([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])
请注意区别未知和缺失,二者是不同的概念
如果有未知值,可以用handle_unknown='infrequent_if_exist’
而不是像上面那样手动去设置categories
,这样未知值对应编码就是全0,或者归为不常见类(如果设置了Infrequent categories
的话)
>>> enc = preprocessing.OneHotEncoder(**handle_unknown='infrequent_if_exist'**)
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X)
OneHotEncoder(handle_unknown='infrequent_if_exist')>>> ansform([['female', 'from Asia', 'uses Chrome']]).toarray() array([[1., 0., 0., 0., 0., 0.]])
这里的’from Asia’, 'uses Chrome’都是未知值
可用drop参数来制定每个特征要舍去的一个特征值(下称drop值),这样可以使用n_categories - 1个虚拟变量而不是n_categories个虚拟变量来避免多重共线性
虚拟变量相当于前面理解的二进制位,所以其实回归中虚拟变量的概念和独热编码是差不多的
>>> drop_enc = preprocessing.OneHotEncoder(**drop='first'**).fit(X)
也可以用drop='if_binary'
只对二分类变量设置,不过其实drop='first
’就可以了,有点多余,因为一般来说不止二分类变量需要避免多重共线性。
>>> drop_enc = preprocessing.OneHotEncoder(drop='if_binary').fit(X)
当handle_unknown='ignore'
且drop
非空,未知值就会被编码为全0。
(一般来说是对仅对训练集编码的)因此测试集中那些未知的值就会被当成全0,这意味着未知值和那个被drop的值有着相同的映射(或者说编码)。
OneHotEncoder.inverse_transform
就是这么做的,它做反向转化时会把去全0值映射到drop值对应的类或者None(表示不是drop值,而是未知值)
>>> drop_enc = preprocessing.OneHotEncoder(drop='if_binary', sparse_output=False,
... handle_unknown='ignore').fit(X)
>>> X_test = [['unknown', 'America', 'IE']]
>>> X_trans = ansform(X_test)
>>> X_trans
array([[0., 0., 0., 0., 0., 0., 0.]])
>>> drop_enc.inverse_transform(X_trans)
array([['female', None, None]], dtype=object)
OneHotEncoder
也支持把缺失、未知值当成一个新的类
缺失值:nan
未知值:None
>>> X = [['male', 'Safari'],
... ['female', None],
... [np.nan, 'Firefox']]
>>> enc = preprocessing.OneHotEncoder(handle_unknown='error').fit(X)
>>> enc.categories_
[array(['female', 'male', nan], dtype=object),array(['Firefox', 'Safari', None], dtype=object)]
>>> ansform(X).toarray()
array([[0., 1., 0., 0., 1., 0.],[1., 0., 0., 0., 0., 1.],[0., 0., 1., 1., 0., 0.]])
因此如果同时存在缺失、未知值,那就多了2个类了
>>> X = [['Safari'], [None], [np.nan], ['Firefox']]
>>> enc = preprocessing.OneHotEncoder(handle_unknown='error').fit(X)
>>> enc.categories_
[array(['Firefox', 'Safari', **None, nan**], dtype=object)]
>>> ansform(X).toarray()
array([[0., 1., 0., 0.],[0., 0., 1., 0.],[0., 0., 0., 1.],[1., 0., 0., 0.]])
columns_to_encode = ["Sex","Embarked"]
onehot_encoder = preprocessing.OneHotEncoder(handle_unknown='error',drop='first',sparse_output=False).fit(train_df[columns_to_encode])result_df = pd.DataFrame(ansform(train_df[["Sex","Embarked"]]),columns=_feature_names_out(columns_to_encode))
train_df = pd.concat([train_df,result_df],axis=1)
# .drop(columns_to_encode,axis=1) # 这里不用删除,因为后面要用到这两列result_df = pd.DataFrame(ansform(test_df[["Sex","Embarked"]]),columns=_feature_names_out(columns_to_encode))
# .drop(columns_to_encode,axis=1) # 这里不用删除,因为后面要用到这两列
test_df = pd.concat([test_df,result_df],axis=1)
train_df:
Survived Pclass Sex Age SibSp Parch Fare Embarked Title
0 0 3 male 22.0 1 0 7.2500 S 2.0
1 1 1 female 38.0 1 0 71.2833 C 3.0
2 1 3 female 26.0 0 0 7.9250 S 1.0
3 1 1 female 35.0 1 0 53.1000 S 3.0
4 0 3 male 35.0 0 0 8.0500 S 2.0
Sex_male Embarked_Q Embarked_S Embarked_nan
0 1.0 0.0 1.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 1.0 0.0
3 0.0 0.0 1.0 0.0
4 1.0 0.0 1.0 0.0
test_df:
PassengerId Pclass Sex Age SibSp Parch Fare Embarked Title
0 892 3 male 34.5 0 0 7.8292 Q 2.0
1 893 3 female 47.0 1 0 7.0000 S 3.0
2 894 2 male 62.0 0 0 9.6875 Q 2.0
3 895 3 male 27.0 0 0 8.6625 S 2.0
4 896 3 female 22.0 1 1 12.2875 S 3.0
Sex_male Embarked_Q Embarked_S Embarked_nan
0 1.0 1.0 0.0 0.0
1 0.0 0.0 1.0 0.0
2 1.0 1.0 0.0 0.0
3 1.0 0.0 1.0 0.0
4 0.0 0.0 1.0 0.0
(891, 13) (418, 13)
不常见类别的设置:Infrequent categories
OneHotEncoder
and OrdinalEncoder
都支持把低频值归成不常见类Infrequent categories
,相当于多了一个新的类
min_frequency
Infrequent categories
Infrequent categories
默认为1,表示不考虑Infrequent categories
>>> X = np.array([['dog'] * 5 + ['cat'] * 20 + ['rabbit'] * 10 +
... ['snake'] * 3], dtype=object).T
>>> enc = preprocessing.OrdinalEncoder(min_frequency=6).fit(X)
>>> enc.infrequent_categories_
[array(['dog', 'snake'], dtype=object)]
>>> ansform(np.array([['dog'], ['cat'], ['rabbit'], ['snake']]))
array([[2.],[0.],[1.],[2.]])
max_categories
也是可整数、可浮点数。但是规定了编码后类别的上限,即从0类别开始,不断用频数最高的作为一个新的类别,如果达到了上限,就会把剩余的其他所有还未被作为新类别建立的类别当成一个Infrequent categories
。
等价于:如果编码后类别超了,就会不断把频数少的类别归到Infrequent categories
中,直到最后的类别≤max_categories。
注意上述过程不考虑缺失、未知值,如果同时存在缺失、未知值并分别归为新的一类,那最后至多会有max_categories+2个类别。
>>> X_train = np.array(
... [["a"] * 5 + ["b"] * 20 + ["c"] * 10 + ["d"] * 3 + [np.nan]],
... dtype=object).T
>>> enc = preprocessing.OrdinalEncoder(
... handle_unknown="use_encoded_value", unknown_value=3,
... max_categories=3, encoded_missing_value=4)
>>> _ = enc.fit(X_train)
>>> X_test = np.array([["a"], ["b"], ["c"], ["d"], ["e"], [np.nan]], dtype=object)
>>> ansform(X_test)
array([[2.],[0.],[1.],[2.],[3.],[4.]])
可见这里有5类,比max_categories=3多了2类,是因为同时存在缺失、未知值,且归缺失值为4、未知值为3
<_feature_names_out把输出不常见类的类名时会包含有infrequent
。
如前面所说,设置handle_unknown='infrequent_if_exist’
会使得未知值被当成低频值,即未知类别归为Infrequent categories
如果设置了handle_unknown='infrequent_if_exist’:
Infrequent categories
翻转时用infrequent_sklearn
来标识低频类如果同时设置了max_categories
和min_frequency
,那么会先判断min_frequency
然后判断max_categories
:
>>> enc = preprocessing.OneHotEncoder(min_frequency=4, max_categories=3, sparse_output=False)
>>> enc = enc.fit(X)
>>> ansform([['dog'], ['cat'], ['rabbit'], ['snake']])
array([[0., 0., 1.],[1., 0., 0.],[0., 1., 0.],[0., 0., 1.]])
示例中,min_frequency=4只认为蛇被归为
Infrequent categories
,但max_categories=3会强制狗也被归为Infrequent categories
如果出现了临界情况,会根据字典排序来进行决策以满足max_categories
>>> X = np.asarray([["a"] * 20 + ["b"] * 10 + ["c"] * 10 + ["d"] * 10], dtype=object).T
>>> enc = preprocessing.OneHotEncoder(max_categories=3).fit(X)
>>> enc.infrequent_categories_
[array(['b', 'c'], dtype=object)]
示例中,“b”、“c”和“d”具有相同的频数,并且在max_categories=2的情况下,“b”和“c”被归为了Infrequent categories
,因为它们的词典顺序更高。
类别太多,就别用独热码了,使用TargetEncoder
sklearn官方文档
本文发布于:2024-01-28 04:52:15,感谢您对本站的认可!
本文链接:https://www.4u4v.net/it/17063887414918.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
留言与评论(共有 0 条评论) |