首页 > 编程札记 > 编程

《机器学习实战：sklearn和TensorFlow》阅读笔记——第1章机器学习概览

阅读：评论：0

第一章机器学习概览

目录如下

01 代码实战部分

OECD数据：3292 * 17,17个列项。

挑出 INEQUALITY为 TOT的，因为有些是Woman 或者 Man，说明数据统计可能带有偏差。这样就只有 888 * 17个数据了。
oecd_bli=oecd_bli.pivot(index="Country",columns="Indicator",values="Value")

进行这个操作后，有了37*24，即共有37个城市，24个Indicator如教育水平、房间个数等因素。value是值的多少
挑出Indicator有值的部分：37 * 24

GDP数据：190 * 7。

挑出Country排序，有GDP数据的部分：190 * 6

# OECD的生活满意度与IMF的GDP数据
def prepare_country_stats(oecd_bli, gdp_per_capita):oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]  # 选择TOT部分的数据，Woman和man的数据不要了# 通过country栏来索引，列为Indicator，值为valuesoecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")# 重命名：将2015那一列命名为 省GDP，inplace为替换掉原数据gdp_ame(columns={"2015": "GDP per capita"}, inplace=True)# 将Country作为索引gdp_per_capita.set_index("Country", inplace=True)# 将 oced和gdp的城市对应起来full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,left_index=True, right_index=True)# 按照gdp排序full_country_stats.sort_values(by="GDP per capita", inplace=True)# 去掉一些行索引，故意去掉一些特别数据，来拟合。remove_indices = [0, 1, 6, 8, 33, 34, 35]keep_indices = list(set(range(36)) - set(remove_indices))# 只把GDP和生活满意度两个标签列的数据返回return full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[keep_indices]import os
import numpy as np
# 加载数据集的路径
datapath = os.path.join("datasets", "lifesat", "")oecd_bli = pd.read_csv(datapath + "oecd_bli_2015.csv", thousands=',')
gdp_per_capita = pd.read_csv(datapath + "gdp_per_capita.csv", thousands=',', delimiter='t', encoding='latin1', na_values="n/a")# Prepare the data
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]# Visualize the data
country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.show()# Train the model
model.fit(X, y)# Make a prediction for Cyprus
X_new = [[22587]]  # Cyprus' GDP per capita
print(model.predict(X_new)) # outputs [[ 5.96242338]]

OECD生活满意度数据如下：

LOCATION是省份； Country是省份里面的城市；
Indicator的内容为：Student skills, self-reported health, personal earnings, voter turnout, employment rate, long-term unemployment rate, household net adjusted disposable income, life satisfaction, quality of support network, time devoted to leisure and personal care, assault rate, educational attainment, homicide rate, employees working very long hours, job security, water quality, life expenctancy, years in education, household net financial wealth, housing expenditure, air pollution, dwelling without basic facilities, rooms per person, consultation on rule-making. 学术技巧、自我健康评估，个人收入、就业率、长期失业率、生活满意度，支持网络的质量，空闲时间多少，袭击率，教育保持，长时间工作，工作安全，水质量，寿命，教育年限，空气污染，平均房间数量。

GDP数据如下：

1. 线性模型

去除了一些特殊点missing_data，留下的为sample_data。用sample_data训练数据，

把特殊点放回去，不抽出来。一起训练的结果，可以看出单变量线性回归，容易受异常值的影响。跟数据有关。

1.1 训练数据不具有代表性

特别是有采样偏差的时候，尤其注意。如男女性别比例重要的时候，就要分层抽样。选举调查时的采样。都要具有代表性的采样。

1.2 训练数据过拟合

多项式拟合

正则化帮助减小过拟合。

本文发布于:2024-01-29 11:08:21，感谢您对本站的认可！

本文链接：https://www.4u4v.net/it/170649770614850.html

上一篇：第 0 章

下一篇：JDK，JVM 与 JRE

标签：机器实战笔记 sklearn TensorFlow

留言与评论（共有 0 条评论）

《机器学习实战：sklearn和TensorFlow》阅读笔记——第1章 机器学习概览

《机器学习实战：sklearn和TensorFlow》阅读笔记——第1章 机器学习概览

第一章 机器学习概览

01 代码实战部分

1. 线性模型

1.1 训练数据不具有代表性

1.2 训练数据过拟合

《机器学习实战：sklearn和TensorFlow》阅读笔记——第1章机器学习概览

《机器学习实战：sklearn和TensorFlow》阅读笔记——第1章机器学习概览

第一章机器学习概览