首页 > 编程札记 > 编程

路透社文章的文本数据分析与可视化

阅读：评论：0

路透社文章的文本数据分析与可视化

作者|Manmohan Singh 编译|VK 来源|Towards Datas Science

当我要求你解释文本数据时，你会怎么做？你将采取什么步骤来构建文本可视化？

本文将帮助你获得构建可视化和解释文本数据所需的信息。

从文本数据中获得的见解将有助于我们发现文章之间的联系。它将检测趋势和模式。对文本数据的分析将排除噪音，发现以前未知的信息。

这种分析过程也称为探索性文本分析(ETA)。运用K-means、Tf-IDF、词频等方法对这些文本数据进行分析。此外，ETA在数据清理过程中也很有用。

我们还使用Matplotlib、seaborn和Plotly库将结果可视化到图形、词云和绘图中。

在分析文本数据之前，请完成这些预处理任务。

从数据源检索数据

有很多非结构化文本数据可供分析。你可以从以下来源获取数据。

来自Kaggle的Twitter文本数据集。
Reddit和twitter数据集使用API。
使用Beautifulsoup从网站上获取文章、。

我将使用路透社的SGML格式的文章。为了便于分析，我将使用beauthoulsoup库从数据文件中获取日期、标题和文章正文。

使用下面的代码从所有数据文件中获取数据，并将输出存储在单个CSV文件中。

from bs4 import BeautifulSoup
import pandas as pd
import csvarticle_dict = {}
i = 0
list_of_data_num = []for j in range(0,22):if j < 10:list_of_data_num.append("00" + str(j))else:list_of_data_num.append("0" + str(j))# 循环所有文章以提取日期、标题和文章主体
for num in list_of_data_num:try:soup = BeautifulSoup(open("data/reut2-" + num + ".sgm"), features='lxml')except:continueprint(num)data_reuters = soup.find_all('reuters')for data in data_reuters:article_dict[i] = {}for date in data.find_all('date'):try:article_dict[i]["date"] = ts[0]).strip()except:article_dict[i]["date"] = None# ts[0])for title in data.find_all('title'):article_dict[i]["title"] = ts[0]).strip()# ts)for text in data.find_all('text'):try:article_dict[i]["text"] = ts[4]).strip()except:article_dict[i]["text"] = Nonei += 1dataframe_article = pd.DataFrame(article_dict).T
_csv('articles_data.csv', header=True, index=False, quoting=csv.QUOTE_ALL)
print(dataframe_article)

还可以使用Regex和OS库组合或循环所有数据文件。
每篇文章的正文以开头，因此使用find_all('reuters')。
你也可以使用pickle模块来保存数据，而不是CSV。

清洗数据

在本节中，我们将从文本数据中移除诸如空值、标点符号、数字等噪声。首先，我们删除文本列中包含空值的行。然后我们处理另一列的空值。

import pandas as pd import rearticles_data = pd.read_csv(‘articles_data.csv’) print(articles_data.apply(lambda x: sum(x.isnull()))) articles_nonNull = articles_data.dropna(subset=[‘text’]) set_index(inplace=True)def clean_text(text):‘’’Make text lowercase, remove text in square brackets,remove n,remove punctuation and remove words containing numbers.’’’text = str(text).lower()text = re.sub(‘<.*?>+’, ‘’, text)text = re.sub(‘[%s]’ % re.escape(string.punctuation), ‘’, text)text = re.sub(‘n’, ‘’, text)text = re.sub(‘w*dw*’, ‘’, text)return textarticles_nonNull[‘text_clean’]=articles_nonNull[‘text’].apply(lambda x:clean_text(x))