【TensorFlow深度学习】三、流量数据预处理(字符串、csv、DataFrame、字典、张量之间的转换)

阅读: 评论:0

【TensorFlow深度学习】三、流量数据预处理(字符串、csv、DataFrame、字典、张量之间的转换)

【TensorFlow深度学习】三、流量数据预处理(字符串、csv、DataFrame、字典、张量之间的转换)

本专栏是记录作者学习TensorFlow深度学习的相关内容

本节简单介绍了数据预处理的内容,主要是字符串、csv文件、DataFrame、字典、张量等数据格式之间的转换与处理。经过这一节,我们以将字符串文件格式化为张量为例,详细介绍了处理过程。

本节的 Jupyter 笔记本文件已上传至gitee以供大家学习交流:我的gitee仓库

文章目录

  • 1 DataFrame数据的存取
    • 将字典类型数据转化为DataFrame数据
    • 将DataFrame数据导出到csv文件中
  • 2 将字符串格式化为字典
    • demo
      • 将请求存入列表
      • 获取method,url,protocol
      • 获取头部
      • 将头部加入字典
      • 获取请求体
      • 将请求拼接成字典
    • 完整实现
  • 3 将字典转化为DataFrame
  • 4 插值法处理缺失值
  • 5 DataFrame转换为张量

为了能用深度学习来解决现实世界的问题,我们经常 从预处理原始数据开始, 而不是从那些准备好的张量格式数据开始。 我们使用Python的pandsas包,对原始数据进行预处理,将原始数据转化为张量格式

下文用到的数据摘至HTTP DATASET CSIC 2010数据集:/,该数据集包含上万条自动生成的Web请求,主要用于测试网络攻击防护系统

1 DataFrame数据的存取

该部分我们需要认识DataFrame,DataFrame 是 pandas 库中的一种数据结构,它类似于表格或电子表格。它可以看作是一个二维的数据结构,其中数据以行和列的形式组织。DataFrame 提供了丰富的功能,用于数据的清理、分析和操作。

将字典类型数据转化为DataFrame数据

# 初始化一个空的 DataFrame,并加入数据
import pandas as pd
request_dict={'Method': 'POST','URL': 'localhost:8080/tienda1/publico/anadir.jsp','Protocol': 'HTTP/1.1','User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)','Pragma': 'no-cache','Cache-control': 'no-cache','Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5','Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate','Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5','Accept-Language': 'en','Host': 'localhost','Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF','Content-Type': 'application/x-www-form-urlencoded','Connection': 'close','Content-Length': '68','Body': 'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'}
df = pd.DataFrame([request_dict])#DataFrame中的数据可以看做一个列表,数据行是列表的一个元素。所以传入的数据应该是列表的格式。用[]包裹
df

结果:

	Method	URL	Protocol	User-Agent	Pragma	Cache-control	Accept	Accept-Encoding	Accept-Charset	Accept-Language	Host	Cookie	Content-Type	Connection	Content-Length	Body
0	POST	localhost:8080/tienda1/publico/anadir.jsp	HTTP/1.1	Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...	no-cache	no-cache	text/xml,application/xml,application/xhtml&#	x-gzip, x-deflate, gzip, deflate	utf-8, utf-8;q=0.5, *;q=0.5	en	localhost	JSESSIONID=933185092E0B668B90676E0A2B0767AF	application/x-www-form-urlencoded	close	68	id=3&nombre=Vino+Rioja&precio=100&cantidad=55&...
MethodURLProtocolUser-AgentPragmaCache-controlAcceptAccept-EncodingAccept-CharsetAccept-LanguageHostCookieContent-TypeConnectionContent-LengthBody
0POSTlocalhost:8080/tienda1/publico/anadir.jspHTTP/1.1Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...no-cacheno-cachetext/xml,application/xml,application/xhtml&#x-gzip, x-deflate, gzip, deflateutf-8, utf-8;q=0.5, *;q=0.5enlocalhostJSESSIONID=933185092E0B668B90676E0A2B0767AFapplication/x-www-form-urlencodedclose68id=3&nombre=Vino+Rioja&precio=100&cantidad=55&...

其中[request_dict]数据格式如下

[{'Method': 'POST','URL': 'localhost:8080/tienda1/publico/anadir.jsp','Protocol': 'HTTP/1.1','User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)','Pragma': 'no-cache','Cache-control': 'no-cache','Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5','Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate','Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5','Accept-Language': 'en','Host': 'localhost','Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF','Content-Type': 'application/x-www-form-urlencoded','Connection': 'close','Content-Length': '68','Body': 'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'}]

将DataFrame数据导出到csv文件中

#将DataFrame数据导出到csv文件中
import os
os.makedirs(os.path.join('.', 'data'), exist_ok=True)#创建目录“../data/”
data_file = os.path.join('.', 'data', 'Traffic.csv')
with open(data_file,'w') as _csv(data_file, index=True)

to_csv 方法将 DataFrame 中的数据保存到名为 raffic.csv 的文件中。参数 index=True 表示不保存行索引(默认情况下,行索引也会被保存到 CSV 文件中)。

将CSV文件数据导出到csv文件中

#从csv文件导出DataFrame数据中
data = pd.read_csv(data_file)
data

结果:

	Method	URL	Protocol	User-Agent	Pragma	Cache-control	Accept	Accept-Encoding	Accept-Charset	Accept-Language	Host	Cookie	Content-Type	Connection	Content-Length	Body
0	POST	localhost:8080/tienda1/publico/anadir.jsp	HTTP/1.1	Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...	no-cache	no-cache	text/xml,application/xml,application/xhtml&#	x-gzip, x-deflate, gzip, deflate	utf-8, utf-8;q=0.5, *;q=0.5	en	localhost	JSESSIONID=933185092E0B668B90676E0A2B0767AF	application/x-www-form-urlencoded	close	68	id=3&nombre=Vino+Rioja&precio=100&cantidad=55&...
MethodURLProtocolUser-AgentPragmaCache-controlAcceptAccept-EncodingAccept-CharsetAccept-LanguageHostCookieContent-TypeConnectionContent-LengthBody
0POSTlocalhost:8080/tienda1/publico/anadir.jspHTTP/1.1Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...no-cacheno-cachetext/xml,application/xml,application/xhtml&#x-gzip, x-deflate, gzip, deflateutf-8, utf-8;q=0.5, *;q=0.5enlocalhostJSESSIONID=933185092E0B668B90676E0A2B0767AFapplication/x-www-form-urlencodedclose68id=3&nombre=Vino+Rioja&precio=100&cantidad=55&...

2 将字符串格式化为字典

当然,我们的数据来源很可能是txt文件,是一系列字符串,此时我们需要对字符串进行处理

#数据
requests='''GET localhost:8080/tienda1/index.jsp HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=1F767F17239C9B670A39E9B10C3825F4
Connection: closeGET localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5
Connection: closePOST localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=933185092E0B668B90676E0A2B0767AF
Content-Type: application/x-www-form-urlencoded
Connection: close
Content-Length: 68id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'''
#分割数据
request_list=requests.split("nnn")
request_list

结果:

['GET localhost:8080/tienda1/index.jsp HTTP/1.1nUser-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)nPragma: no-cachenCache-control: no-cachenAccept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5nAccept-Encoding: x-gzip, x-deflate, gzip, deflatenAccept-Charset: utf-8, utf-8;q=0.5, *;q=0.5nAccept-Language: ennHost: localhost:8080nCookie: JSESSIONID=1F767F17239C9B670A39E9B10C3825F4nConnection: close','GET localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito HTTP/1.1nUser-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)nPragma: no-cachenCache-control: no-cachenAccept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5nAccept-Encoding: x-gzip, x-deflate, gzip, deflatenAccept-Charset: utf-8, utf-8;q=0.5, *;q=0.5nAccept-Language: ennHost: localhost:8080nCookie: JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5nConnection: close','POST localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1nUser-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)nPragma: no-cachenCache-control: no-cachenAccept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5nAccept-Encoding: x-gzip, x-deflate, gzip, deflatenAccept-Charset: utf-8, utf-8;q=0.5, *;q=0.5nAccept-Language: ennHost: localhost:8080nCookie: JSESSIONID=933185092E0B668B90676E0A2B0767AFnContent-Type: application/x-www-form-urlencodednConnection: closenContent-Length: 68nnid=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito']

demo

以下是以第三条POST请求为例的demo,以便读者理解,如想直接看完整实现可看下一个部分

将请求存入列表

request=request_list[2]
lines = request.split("n")
lines

结果:

['POST localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1','User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)','Pragma: no-cache','Cache-control: no-cache','Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5','Accept-Encoding: x-gzip, x-deflate, gzip, deflate','Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5','Accept-Language: en','Host: localhost:8080','Cookie: JSESSIONID=933185092E0B668B90676E0A2B0767AF','Content-Type: application/x-www-form-urlencoded','Connection: close','Content-Length: 68','','id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito']

获取method,url,protocol

method,url,protocol= lines[0].split(" ")
method,url,protocol

结果:

('POST', 'localhost:8080/tienda1/publico/anadir.jsp', 'HTTP/1.1')

获取头部

headers=lines[1:-2]
headers

结果:

['User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)','Pragma: no-cache','Cache-control: no-cache','Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5','Accept-Encoding: x-gzip, x-deflate, gzip, deflate','Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5','Accept-Language: en','Host: localhost:8080','Cookie: JSESSIONID=933185092E0B668B90676E0A2B0767AF','Content-Type: application/x-www-form-urlencoded','Connection: close','Content-Length: 68']

将头部加入字典

headers_dict = {header.split(":")[0]: header.split(":")[1].strip() for header in headers}
headers_dict

结果:

{'User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)','Pragma': 'no-cache','Cache-control': 'no-cache','Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5','Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate','Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5','Accept-Language': 'en','Host': 'localhost','Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF','Content-Type': 'application/x-www-form-urlencoded','Connection': 'close','Content-Length': '68'}

获取请求体

body=lines[-1]
body

结果:

'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'

将请求拼接成字典

request_dict = {'Method': method,'URL': url,'Protocol': protocol,'User-Agent': ('User-Agent', ''),'Pragma': ('Pragma', ''),'Cache-control': ('Cache-control', ''),'Accept': ('Accept', ''),'Accept-Encoding': ('Accept-Encoding', ''),'Accept-Charset': ('Accept-Charset', ''),'Accept-Language': ('Accept-Language', ''),'Host': ('Host', ''),'Cookie': ('Cookie', ''),'Content-Type': ('Content-Type', ''),'Connection': ('Connection', ''),'Content-Length': ('Content-Length', ''),'Body':body
}
request_dict

结果:

{'Method': 'POST','URL': 'localhost:8080/tienda1/publico/anadir.jsp','Protocol': 'HTTP/1.1','User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)','Pragma': 'no-cache','Cache-control': 'no-cache','Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5','Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate','Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5','Accept-Language': 'en','Host': 'localhost','Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF','Content-Type': 'application/x-www-form-urlencoded','Connection': 'close','Content-Length': '68','Body': 'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'}

完整实现

实现处理多条数据

requests='''GET localhost:8080/tienda1/index.jsp HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=1F767F17239C9B670A39E9B10C3825F4
Connection: closeGET localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5
Connection: closePOST localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=933185092E0B668B90676E0A2B0767AF
Content-Type: application/x-www-form-urlencoded
Connection: close
Content-Length: 68id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'''
request_list=requests.split("nnn")
requests_list=[]
for request in request_list:#将请求存入列表lines = request.split("n")#获取method,url,protocolmethod,url,protocol= lines[0].split(" ")#将请求拼接成字典request_dict = {'Method': method,'URL': url,'Protocol': protocol,}if(method=='GET'):#获取头部headers=lines[1:]elif(method=='POST'):#获取头部headers=lines[1:-2]#获取请求体body=lines[-1]request_dict.update({'Body' : body})#将头部加入字典headers_dict = {header.split(":")[0]: header.split(":")[1].strip() for header in headers}request_dict.update(headers_dict)requests_list.append(request_dict)
requests_list

结果:

[{'Method': 'GET','URL': 'localhost:8080/tienda1/index.jsp','Protocol': 'HTTP/1.1','User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)','Pragma': 'no-cache','Cache-control': 'no-cache','Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5','Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate','Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5','Accept-Language': 'en','Host': 'localhost','Cookie': 'JSESSIONID=1F767F17239C9B670A39E9B10C3825F4','Connection': 'close'},{'Method': 'GET','URL': 'localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito','Protocol': 'HTTP/1.1','User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)','Pragma': 'no-cache','Cache-control': 'no-cache','Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5','Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate','Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5','Accept-Language': 'en','Host': 'localhost','Cookie': 'JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5','Connection': 'close'},{'Method': 'POST','URL': 'localhost:8080/tienda1/publico/anadir.jsp','Protocol': 'HTTP/1.1','Body': 'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito','User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)','Pragma': 'no-cache','Cache-control': 'no-cache','Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5','Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate','Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5','Accept-Language': 'en','Host': 'localhost','Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF','Content-Type': 'application/x-www-form-urlencoded','Connection': 'close','Content-Length': '68'}]

3 将字典转化为DataFrame

使用loc方法处理数据,loc 是 Pandas 中用于通过标签(label)定位和访问 DataFrame 中的数据的方法。

import pandas as pd
#初始化df
df = pd.DataFrame(columns=['Method', 'URL' , 'Protocol', 'User-Agent', 'Pragma', 'Cache-control', 'Accept', 'Accept-Encoding','Accept-Charset', 'Accept-Language', 'Host', 'Cookie', 'Content-Type', 'Connection','Content-Length', 'Body'])
# 使用 loc 方法将新行添加到 DataFrame
for request_dict in requests_list:df.loc[len(df)] = request_dict
#以下方法为清空df
#df.drop(df.index, inplace=True)
df

结果:

	Method	URL	Protocol	User-Agent	Pragma	Cache-control	Accept	Accept-Encoding	Accept-Charset	Accept-Language	Host	Cookie	Content-Type	Connection	Content-Length	Body
0	GET	localhost:8080/tienda1/index.jsp	HTTP/1.1	Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...	no-cache	no-cache	text/xml,application/xml,application/xhtml&#	x-gzip, x-deflate, gzip, deflate	utf-8, utf-8;q=0.5, *;q=0.5	en	localhost	JSESSIONID=1F767F17239C9B670A39E9B10C3825F4	NaN	close	NaN	NaN
1	GET	localhost:8080/tienda1/publico/	HTTP/1.1	Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...	no-cache	no-cache	text/xml,application/xml,application/xhtml&#	x-gzip, x-deflate, gzip, deflate	utf-8, utf-8;q=0.5, *;q=0.5	en	localhost	JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5	NaN	close	NaN	NaN
2	POST	localhost:8080/tienda1/publico/anadir.jsp	HTTP/1.1	Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...	no-cache	no-cache	text/xml,application/xml,application/xhtml&#	x-gzip, x-deflate, gzip, deflate	utf-8, utf-8;q=0.5, *;q=0.5	en	localhost	JSESSIONID=933185092E0B668B90676E0A2B0767AF	application/x-www-form-urlencoded	close	68	id=3&nombre=Vino+Rioja&precio=100&cantidad=55&...
MethodURLProtocolUser-AgentPragmaCache-controlAcceptAccept-EncodingAccept-CharsetAccept-LanguageHostCookieContent-TypeConnectionContent-LengthBody
0GETlocalhost:8080/tienda1/index.jspHTTP/1.1Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...no-cacheno-cachetext/xml,application/xml,application/xhtml&#x-gzip, x-deflate, gzip, deflateutf-8, utf-8;q=0.5, *;q=0.5enlocalhostJSESSIONID=1F767F17239C9B670A39E9B10C3825F4NaNcloseNaNNaN
1GETlocalhost:8080/tienda1/publico/HTTP/1.1Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...no-cacheno-cachetext/xml,application/xml,application/xhtml&#x-gzip, x-deflate, gzip, deflateutf-8, utf-8;q=0.5, *;q=0.5enlocalhostJSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5NaNcloseNaNNaN
2POSTlocalhost:8080/tienda1/publico/anadir.jspHTTP/1.1Mozilla/5.0 (compatible; Konqueror/3.5; Linux)...no-cacheno-cachetext/xml,application/xml,application/xhtml&#x-gzip, x-deflate, gzip, deflateutf-8, utf-8;q=0.5, *;q=0.5enlocalhostJSESSIONID=933185092E0B668B90676E0A2B0767AFapplication/x-www-form-urlencodedclose68id=3&nombre=Vino+Rioja&precio=100&cantidad=55&...
将DataFrame中的数据导出到csv文件中
import os
os.makedirs(os.path.join('.', 'data'), exist_ok=True)#创建目录“../data/”
data_file = os.path.join('.', 'data', 'Traffic.csv')
with open(data_file,'w') as f:# 将数据保存为 CSV 文件df.to_csv(data_file, index=False)

4 插值法处理缺失值

NaN数据值代表缺失值,处理缺失值的方法有插值法和删除法,其中插值法用一个替代值弥补缺失值,而删除法则直接忽略缺失值。 在这里,我们将考虑插值法。

#从csv中获取数据
df = pd.read_csv(data_file)
# 提取数值型列
numeric_cols = df.select_dtypes(include=['float64']).columns
# # 提取非数值型列
# non_numeric_cols = df.select_dtypes(exclude=['float64']).columns# 对数值型列进行均值填充
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())# # 对Content-Type列进行填充
# df = pd.get_dummies(df , columns=['Content-Type'] , dummy_na=True)
# 对非数值列进行填充(对于该数据集来说将非数值列进行填充没有任何意义,这部分只是为了演示操作)
df = pd.get_dummies(df , dummy_na=True)df

结果:

	Content-Length	Method_GET	Method_POST	Method_nan	URL_localhost:8080/tienda1/index.jsp	URL_localhost:8080/tienda1/publico/anadir.jsp	URL_localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito	URL_nan	Protocol_HTTP/1.1	Protocol_nan	...	Cookie_JSESSIONID=1F767F17239C9B670A39E9B10C3825F4	Cookie_JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5	Cookie_JSESSIONID=933185092E0B668B90676E0A2B0767AF	Cookie_nan	Content-Type_application/x-www-form-urlencoded	Content-Type_nan	Connection_close	Connection_nan	Body_id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito	Body_nan
0	68.0	True	False	False	True	False	False	False	True	False	...	True	False	False	False	False	True	True	False	False	True
1	68.0	True	False	False	False	False	True	False	True	False	...	False	True	False	False	False	True	True	False	False	True
2	68.0	False	True	False	False	True	False	False	True	False	...	False	False	True	False	True	False	True	False	True	False
Content-LengthMethod_GETMethod_POSTMethod_nanURL_localhost:8080/tienda1/index.jspURL_localhost:8080/tienda1/publico/anadir.jspURL_localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carritoURL_nanProtocol_HTTP/1.1Protocol_nan...Cookie_JSESSIONID=1F767F17239C9B670A39E9B10C3825F4Cookie_JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5Cookie_JSESSIONID=933185092E0B668B90676E0A2B0767AFCookie_nanContent-Type_application/x-www-form-urlencodedContent-Type_nanConnection_closeConnection_nanBody_id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carritoBody_nan
068.0TrueFalseFalseTrueFalseFalseFalseTrueFalse...TrueFalseFalseFalseFalseTrueTrueFalseFalseTrue
168.0TrueFalseFalseFalseFalseTrueFalseTrueFalse...FalseTrueFalseFalseFalseTrueTrueFalseFalseTrue
268.0FalseTrueFalseFalseTrueFalseFalseTrueFalse...FalseFalseTrueFalseTrueFalseTrueFalseTrueFalse

3 rows × 36 columns

5 DataFrame转换为张量

只有数值类型的DataFrame可以转化为张量格式。 若要以上述流量作为数据集进行入侵检测的训练,上面将非数值数据项转化为数值类型的方案肯定是不行的,机器不能学习到流量里的特征。
对于将流量转化为数值类型的数据的方法,根据作者了解,可以将流量转化为图片的形式,用卷积网络进行训练。后续作者也会在该方向展开入侵检测的学习。
当数据采用张量的格式,就可以通过张量函数对数据进行操作。

import tensorflow as tf
X = tf._numpy(dtype=float))
X

结果:

<tf.Tensor: shape=(3, 36), dtype=float64, numpy=
array([[68.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,1.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  1.],[68.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,0.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  1.],[68.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,0.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.]])>

本文发布于:2024-01-29 10:25:10,感谢您对本站的认可!

本文链接:https://www.4u4v.net/it/170649511514625.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:张量   字符串   字典   深度   流量
留言与评论(共有 0 条评论)
   
验证码:

Copyright ©2019-2022 Comsenz Inc.Powered by ©

网站地图1 网站地图2 网站地图3 网站地图4 网站地图5 网站地图6 网站地图7 网站地图8 网站地图9 网站地图10 网站地图11 网站地图12 网站地图13 网站地图14 网站地图15 网站地图16 网站地图17 网站地图18 网站地图19 网站地图20 网站地图21 网站地图22/a> 网站地图23