python处理情感预测（一）

阅读：评论：0

python处理情感预测（一）

1 数据集

为用户对各种手机的review。
读取excel中的review

def get_excel_data(filepath, sheetnum, colnum, para):table = xlrd.open_workbook(filepath)print len(table.sheets())sheet = table.sheets()[sheetnum-1]data = l_values(colnum-1)rownum = wsif para == 'data':return dataelif para == 'rownum':return rownum

from Preprocessing_module import textprocessing as tpdata = tp.get_excel_data(r"D:tomcatreview_protectionReview setHTC Z710t_review_2013.6.5.xlsx", 3, 1,'data')
print data[0]

注意sheet数量为3.

读取txt中的信息：

def get_txt_data(filepath, para):if para == 'lines':txt_file1 = open(filepath, 'r')txt_tmp1 = adlines()txt_tmp2 = ''.join(txt_tmp1)txt_data1 = txt_tmp2.decode('utf-8').split('n')txt_file1.close()return txt_data1elif para == 'line':txt_file2 = open(filepath, 'r')txt_tmp = adline()txt_data2 = txt_tmp.decode('utf-8')txt_file2.close()return txt_data2

分词：

def segmentation(sentence, para):if para == 'str':seg_list = jieba.cut(sentence)seg_result = ' '.join(seg_list)return seg_resultelif para == 'list':seg_list2 = jieba.cut(sentence)seg_result2 = []for w in seg_list2:seg_result2.append(w)return seg_result2

分词加词性标注：

def postagger(sentence, para):if para == 'list':pos_data1 = jieba.posseg.cut(sentence)pos_list = []for w in pos_data1:pos_list.append((w.word, w.flag)) #make every word and tag as a tuple and add them to a listreturn pos_listelif para == 'str':pos_data2 = jieba.posseg.cut(sentence)pos_list2 = []for w2 in pos_data2:#d([de('utf-8'), w2.flag])d([w2.word, w2.flag])pos_str = ' '.join(pos_list2)return pos_str

根据标点符号分词：

def cut_sentence_2(words):#words = (words).decode('utf8')start = 0i = 0 #i is the position of wordstoken = 'meaningless'sents = []punt_list = ',.!?;~，。！？；～… '.decode('utf-8')for word in words:if word not in punt_list:i += 1token = list(words[start:i+2]).pop()#print tokenelif word in punt_list and token in punt_list:i += 1token = list(words[start:i+2]).pop()else:sents.append(words[start:i+1])start = i+1i += 1if start < len(words):sents.append(words[start:])return sents

这款手机大小合适，配置也还可以，很好用，只是屏幕有点小。。。总之，戴妃+是一款值得购买的智能手机。

分词结果为：

这款手机大小合适，
配置也还可以，
很好用，
只是屏幕有点小。。。
总之，
戴妃+是一款值得购买的智能手机。

读取csv分词：

def seg_fil_excel(filepath, sheetnum, colnum):# Read product review data from excel file and segment every reviewreview_data = []for cell in get_excel_data(filepath, sheetnum, colnum, 'data')[0:get_excel_data(filepath, sheetnum, colnum, 'rownum')]:review_data.append(segmentation(cell, 'list')) # Seg every reivew# Read txt file contain stopwordsstopwords = get_txt_data(r'D:tomcatreview_protectionPreprocessing_', 'lines')# Filter stopwords from reviewsseg_fil_result = []for review in review_data:fil = [word for word in review if word not in stopwords and word != ' ']seg_fil_result.append(fil)fil = []# Return filtered segment reviewsreturn seg_fil_result

2 特征提取
统计句子中的adjectives adverbs and verbs numbers

def count_adj_adv(dataset):adj_adv_num = []a = 0d = 0v = 0for review in dataset:pos = tp.postagger(review, 'list')for i in pos:if i[1] == 'a':a += 1elif i[1] == 'd':d += 1elif i[1] == 'v':v += 1adj_adv_num.append((a, d, v))a = 0d = 0v = 0return adj_adv_num

计算相似度质心分数：

def centroid(datapath, storepath):logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)# Read review data from txt fileclass MyCorpus(object):def __iter__(self):for line in open(datapath):yield line.split()# Change review data to gensim corpus formatCorp = MyCorpus()dictionary = corpora.Dictionary(Corp)corpus = [dictionary.doc2bow(text) for text in Corp]# Make the corpus become a tf-idf modeltfidf = models.TfidfModel(corpus)# Compute every word's tf-idf scorecorpus_tfidf = tfidf[corpus]# Compute review centroid score by combinating every word's tf-idf scorecentroid = 0review_centroid = []for doc in corpus_tfidf:for token in doc:centroid += token[1]review_centroid.append(centroid)centroid = 0# Store review centroid score into a txt filecentroid_file = open(storepath, 'w')for i in review_centroid:centroid_file.write(str(i)+'n')centroid_file.close()

即计算分词后每个review的tfidf的总和

手机 很 好 很 喜欢 三防 出色 操作系统 垃圾 
Defy 用过 3 年 感受 
刚买 很 兴奋 当时 还 流行 机 还 很 贵

结果为：

2.0
2.0
2.2360679775
2.7136021012

计算评论的相似度：

def gensim(datapath, querypath, storepath):logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)# Read review data from txt fileclass MyCorpus(object):def __iter__(self):for line in open(datapath):yield line.split()# Change review data to gensim corpus formatCorp = MyCorpus()dictionary = corpora.Dictionary(Corp)corpus = [dictionary.doc2bow(text) for text in Corp]# Make the corpus become a tf-idf modeltfidf = models.TfidfModel(corpus)# Compute every word's tf-idf scorecorpus_tfidf = tfidf[corpus]# Read filtered editorial review from txt fileq_file = open(querypath, 'r')query = adline()q_file.close()# Based on the review tf-idf model, compute its tf-idf scorevec_bow = dictionary.doc2bow(query.split())vec_tfidf = tfidf[vec_bow]# Compute similarityindex = similarities.MatrixSimilarity(corpus_tfidf)sims = index[vec_tfidf]similarity = list(sims)# Store similarity score into a txt filesim_file = open(storepath, 'w')for i in similarity:sim_file.write(str(i)+'n')sim_file.close()

 一款 畅销  产品   创造 出来   一种  迎合 需求  解决 消费者 现实  需求  受到 欢迎  现在 绝大部分  手机 属于  一种    一种   创造 需求   用户   新  需求   类产品 总是 可遇  不可 求  去年 经典  摩托罗拉 ME525  Andorid  基础 上 加入  民用 三防功   受到  消费者  热烈 追捧  时隔 一年 摩托罗拉  上 月 刚刚 发布  最新  升级 版本 ME525+  今天 ZOL 手机 频道 全国 独家 拿到  这款 产品  第一 时间  大家 带来 相关  评测 文章  摩托罗拉 ME525  大名 已经 不用   说  三防  Andorid2  2 系统 摆放  大家 面前   火爆 已经  时间  问题    编辑  水煮 Defy   一系列 暴力 测试 之后  款 产品   长期 盘踞 ZOL 手机 频道 热门 手 机 排行榜 前三甲  位置  ME525  成为  摩托罗拉  2010 年 最为 成功  中端 产品  摩托罗拉 ME525+  ME525  升级 版本  新 产品  处理器 升级   1GHz   系统 版本  升级   全新  Andorid2  3 版本  当然 经典  外观 造型  没有 改变  三防  特性 仍然 得到  经典  保留   今天   评测 将会   一年  恪守  改变   主题   词   会 穿插 进  整个 评测 文章 当中   小巧 精致  仍然  此次 ME525+  主题  整个 机器  外观 基本 仍然 延续  原机  设计  正面 更是 显示 出  温柔  一面  白色  边框 加上 黑色  屏幕显示 出  亲民  一面   正面 ME525+ 采用  3.7 英寸  TFT 屏幕  分辨率  854x480 级别     ME811  DROID  X    分辨率    屏幕 更 小 反而 更加  清晰  这块 屏幕 采用  康宁 公司 大猩猩 玻璃  具备 很强  防刮功   非常  坚固  注意   大屏幕 都 不 具备 抗 摔  功能    康宁 公司  专门  汽车  直升机 生产 前 挡风玻璃  厂商  此次 ME525+ 采用  大猩猩 玻璃  一种 环保 性  碱 铝硅酸盐 薄层 玻璃  具备 很强  清洁 能力    触控 之后 很难 留下 指纹    浸水 之后  很快  水滴 汇集 流出  非常  方便  当然 整个 机器 外表 最大  改变 仍然   颜色 上 做文章  白色  黑色  主色调 仍然 不 能够 丢掉   这次 摩托罗拉  更加  亲民 加入  粉色  设计

待比较的review：

 手机 很 好 很 喜欢 三防 出色 操作系统 垃圾 
Defy 用过 3 年 感受 
刚买 很 兴奋 当时 还 流行 机 还 很 贵

结果如下：

待续。。。。。。

本文发布于:2024-01-30 22:30:27，感谢您对本站的认可！

本文链接：https://www.4u4v.net/it/170662503023287.html

上一篇：企业应对数据泄露风险——应用强化学习进行用户画像及行为预测

下一篇：预测模型，怎么做才够精准

标签：情感 python

留言与评论（共有 0 条评论）