本专栏是记录作者学习TensorFlow深度学习的相关内容
本节简单介绍了数据预处理的内容,主要是字符串、csv文件、DataFrame、字典、张量等数据格式之间的转换与处理。经过这一节,我们以将字符串文件格式化为张量为例,详细介绍了处理过程。
本节的 Jupyter 笔记本文件已上传至gitee以供大家学习交流:我的gitee仓库
下文用到的数据摘至HTTP DATASET CSIC 2010数据集:/,该数据集包含上万条自动生成的Web请求,主要用于测试网络攻击防护系统
该部分我们需要认识DataFrame,DataFrame 是 pandas 库中的一种数据结构,它类似于表格或电子表格。它可以看作是一个二维的数据结构,其中数据以行和列的形式组织。DataFrame 提供了丰富的功能,用于数据的清理、分析和操作。
# 初始化一个空的 DataFrame,并加入数据
import pandas as pd
request_dict={'Method': 'POST','URL': 'localhost:8080/tienda1/publico/anadir.jsp','Protocol': 'HTTP/1.1','User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)','Pragma': 'no-cache','Cache-control': 'no-cache','Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5','Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate','Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5','Accept-Language': 'en','Host': 'localhost','Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF','Content-Type': 'application/x-www-form-urlencoded','Connection': 'close','Content-Length': '68','Body': 'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'}
df = pd.DataFrame([request_dict])#DataFrame中的数据可以看做一个列表,数据行是列表的一个元素。所以传入的数据应该是列表的格式。用[]包裹
df
结果:
Method URL Protocol User-Agent Pragma Cache-control Accept Accept-Encoding Accept-Charset Accept-Language Host Cookie Content-Type Connection Content-Length Body
0 POST localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1 Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... no-cache no-cache text/xml,application/xml,application/xhtml x-gzip, x-deflate, gzip, deflate utf-8, utf-8;q=0.5, *;q=0.5 en localhost JSESSIONID=933185092E0B668B90676E0A2B0767AF application/x-www-form-urlencoded close 68 id=3&nombre=Vino+Rioja&precio=100&cantidad=55&...
Method | URL | Protocol | User-Agent | Pragma | Cache-control | Accept | Accept-Encoding | Accept-Charset | Accept-Language | Host | Cookie | Content-Type | Connection | Content-Length | Body | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | POST | localhost:8080/tienda1/publico/anadir.jsp | HTTP/1.1 | Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... | no-cache | no-cache | text/xml,application/xml,application/xhtml | x-gzip, x-deflate, gzip, deflate | utf-8, utf-8;q=0.5, *;q=0.5 | en | localhost | JSESSIONID=933185092E0B668B90676E0A2B0767AF | application/x-www-form-urlencoded | close | 68 | id=3&nombre=Vino+Rioja&precio=100&cantidad=55&... |
其中[request_dict]数据格式如下
[{'Method': 'POST','URL': 'localhost:8080/tienda1/publico/anadir.jsp','Protocol': 'HTTP/1.1','User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)','Pragma': 'no-cache','Cache-control': 'no-cache','Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5','Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate','Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5','Accept-Language': 'en','Host': 'localhost','Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF','Content-Type': 'application/x-www-form-urlencoded','Connection': 'close','Content-Length': '68','Body': 'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'}]
#将DataFrame数据导出到csv文件中
import os
os.makedirs(os.path.join('.', 'data'), exist_ok=True)#创建目录“../data/”
data_file = os.path.join('.', 'data', 'Traffic.csv')
with open(data_file,'w') as _csv(data_file, index=True)
to_csv
方法将 DataFrame 中的数据保存到名为 raffic.csv
的文件中。参数 index=True
表示不保存行索引(默认情况下,行索引也会被保存到 CSV 文件中)。
将CSV文件数据导出到csv文件中
#从csv文件导出DataFrame数据中
data = pd.read_csv(data_file)
data
结果:
Method URL Protocol User-Agent Pragma Cache-control Accept Accept-Encoding Accept-Charset Accept-Language Host Cookie Content-Type Connection Content-Length Body
0 POST localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1 Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... no-cache no-cache text/xml,application/xml,application/xhtml x-gzip, x-deflate, gzip, deflate utf-8, utf-8;q=0.5, *;q=0.5 en localhost JSESSIONID=933185092E0B668B90676E0A2B0767AF application/x-www-form-urlencoded close 68 id=3&nombre=Vino+Rioja&precio=100&cantidad=55&...
Method | URL | Protocol | User-Agent | Pragma | Cache-control | Accept | Accept-Encoding | Accept-Charset | Accept-Language | Host | Cookie | Content-Type | Connection | Content-Length | Body | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | POST | localhost:8080/tienda1/publico/anadir.jsp | HTTP/1.1 | Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... | no-cache | no-cache | text/xml,application/xml,application/xhtml | x-gzip, x-deflate, gzip, deflate | utf-8, utf-8;q=0.5, *;q=0.5 | en | localhost | JSESSIONID=933185092E0B668B90676E0A2B0767AF | application/x-www-form-urlencoded | close | 68 | id=3&nombre=Vino+Rioja&precio=100&cantidad=55&... |
当然,我们的数据来源很可能是txt文件,是一系列字符串,此时我们需要对字符串进行处理
#数据
requests='''GET localhost:8080/tienda1/index.jsp HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=1F767F17239C9B670A39E9B10C3825F4
Connection: closeGET localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5
Connection: closePOST localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=933185092E0B668B90676E0A2B0767AF
Content-Type: application/x-www-form-urlencoded
Connection: close
Content-Length: 68id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'''
#分割数据
request_list=requests.split("nnn")
request_list
结果:
['GET localhost:8080/tienda1/index.jsp HTTP/1.1nUser-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)nPragma: no-cachenCache-control: no-cachenAccept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5nAccept-Encoding: x-gzip, x-deflate, gzip, deflatenAccept-Charset: utf-8, utf-8;q=0.5, *;q=0.5nAccept-Language: ennHost: localhost:8080nCookie: JSESSIONID=1F767F17239C9B670A39E9B10C3825F4nConnection: close','GET localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito HTTP/1.1nUser-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)nPragma: no-cachenCache-control: no-cachenAccept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5nAccept-Encoding: x-gzip, x-deflate, gzip, deflatenAccept-Charset: utf-8, utf-8;q=0.5, *;q=0.5nAccept-Language: ennHost: localhost:8080nCookie: JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5nConnection: close','POST localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1nUser-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)nPragma: no-cachenCache-control: no-cachenAccept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5nAccept-Encoding: x-gzip, x-deflate, gzip, deflatenAccept-Charset: utf-8, utf-8;q=0.5, *;q=0.5nAccept-Language: ennHost: localhost:8080nCookie: JSESSIONID=933185092E0B668B90676E0A2B0767AFnContent-Type: application/x-www-form-urlencodednConnection: closenContent-Length: 68nnid=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito']
以下是以第三条POST请求为例的demo,以便读者理解,如想直接看完整实现可看下一个部分
request=request_list[2]
lines = request.split("n")
lines
结果:
['POST localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1','User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)','Pragma: no-cache','Cache-control: no-cache','Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5','Accept-Encoding: x-gzip, x-deflate, gzip, deflate','Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5','Accept-Language: en','Host: localhost:8080','Cookie: JSESSIONID=933185092E0B668B90676E0A2B0767AF','Content-Type: application/x-www-form-urlencoded','Connection: close','Content-Length: 68','','id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito']
method,url,protocol= lines[0].split(" ")
method,url,protocol
结果:
('POST', 'localhost:8080/tienda1/publico/anadir.jsp', 'HTTP/1.1')
headers=lines[1:-2]
headers
结果:
['User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)','Pragma: no-cache','Cache-control: no-cache','Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5','Accept-Encoding: x-gzip, x-deflate, gzip, deflate','Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5','Accept-Language: en','Host: localhost:8080','Cookie: JSESSIONID=933185092E0B668B90676E0A2B0767AF','Content-Type: application/x-www-form-urlencoded','Connection: close','Content-Length: 68']
headers_dict = {header.split(":")[0]: header.split(":")[1].strip() for header in headers}
headers_dict
结果:
{'User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)','Pragma': 'no-cache','Cache-control': 'no-cache','Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5','Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate','Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5','Accept-Language': 'en','Host': 'localhost','Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF','Content-Type': 'application/x-www-form-urlencoded','Connection': 'close','Content-Length': '68'}
body=lines[-1]
body
结果:
'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'
request_dict = {'Method': method,'URL': url,'Protocol': protocol,'User-Agent': ('User-Agent', ''),'Pragma': ('Pragma', ''),'Cache-control': ('Cache-control', ''),'Accept': ('Accept', ''),'Accept-Encoding': ('Accept-Encoding', ''),'Accept-Charset': ('Accept-Charset', ''),'Accept-Language': ('Accept-Language', ''),'Host': ('Host', ''),'Cookie': ('Cookie', ''),'Content-Type': ('Content-Type', ''),'Connection': ('Connection', ''),'Content-Length': ('Content-Length', ''),'Body':body
}
request_dict
结果:
{'Method': 'POST','URL': 'localhost:8080/tienda1/publico/anadir.jsp','Protocol': 'HTTP/1.1','User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)','Pragma': 'no-cache','Cache-control': 'no-cache','Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5','Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate','Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5','Accept-Language': 'en','Host': 'localhost','Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF','Content-Type': 'application/x-www-form-urlencoded','Connection': 'close','Content-Length': '68','Body': 'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'}
实现处理多条数据
requests='''GET localhost:8080/tienda1/index.jsp HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=1F767F17239C9B670A39E9B10C3825F4
Connection: closeGET localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5
Connection: closePOST localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)
Pragma: no-cache
Cache-control: no-cache
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: x-gzip, x-deflate, gzip, deflate
Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5
Accept-Language: en
Host: localhost:8080
Cookie: JSESSIONID=933185092E0B668B90676E0A2B0767AF
Content-Type: application/x-www-form-urlencoded
Connection: close
Content-Length: 68id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito'''
request_list=requests.split("nnn")
requests_list=[]
for request in request_list:#将请求存入列表lines = request.split("n")#获取method,url,protocolmethod,url,protocol= lines[0].split(" ")#将请求拼接成字典request_dict = {'Method': method,'URL': url,'Protocol': protocol,}if(method=='GET'):#获取头部headers=lines[1:]elif(method=='POST'):#获取头部headers=lines[1:-2]#获取请求体body=lines[-1]request_dict.update({'Body' : body})#将头部加入字典headers_dict = {header.split(":")[0]: header.split(":")[1].strip() for header in headers}request_dict.update(headers_dict)requests_list.append(request_dict)
requests_list
结果:
[{'Method': 'GET','URL': 'localhost:8080/tienda1/index.jsp','Protocol': 'HTTP/1.1','User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)','Pragma': 'no-cache','Cache-control': 'no-cache','Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5','Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate','Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5','Accept-Language': 'en','Host': 'localhost','Cookie': 'JSESSIONID=1F767F17239C9B670A39E9B10C3825F4','Connection': 'close'},{'Method': 'GET','URL': 'localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito','Protocol': 'HTTP/1.1','User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)','Pragma': 'no-cache','Cache-control': 'no-cache','Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5','Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate','Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5','Accept-Language': 'en','Host': 'localhost','Cookie': 'JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5','Connection': 'close'},{'Method': 'POST','URL': 'localhost:8080/tienda1/publico/anadir.jsp','Protocol': 'HTTP/1.1','Body': 'id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito','User-Agent': 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.8 (like Gecko)','Pragma': 'no-cache','Cache-control': 'no-cache','Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5','Accept-Encoding': 'x-gzip, x-deflate, gzip, deflate','Accept-Charset': 'utf-8, utf-8;q=0.5, *;q=0.5','Accept-Language': 'en','Host': 'localhost','Cookie': 'JSESSIONID=933185092E0B668B90676E0A2B0767AF','Content-Type': 'application/x-www-form-urlencoded','Connection': 'close','Content-Length': '68'}]
使用loc方法处理数据,loc 是 Pandas 中用于通过标签(label)定位和访问 DataFrame 中的数据的方法。
import pandas as pd
#初始化df
df = pd.DataFrame(columns=['Method', 'URL' , 'Protocol', 'User-Agent', 'Pragma', 'Cache-control', 'Accept', 'Accept-Encoding','Accept-Charset', 'Accept-Language', 'Host', 'Cookie', 'Content-Type', 'Connection','Content-Length', 'Body'])
# 使用 loc 方法将新行添加到 DataFrame
for request_dict in requests_list:df.loc[len(df)] = request_dict
#以下方法为清空df
#df.drop(df.index, inplace=True)
df
结果:
Method URL Protocol User-Agent Pragma Cache-control Accept Accept-Encoding Accept-Charset Accept-Language Host Cookie Content-Type Connection Content-Length Body
0 GET localhost:8080/tienda1/index.jsp HTTP/1.1 Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... no-cache no-cache text/xml,application/xml,application/xhtml x-gzip, x-deflate, gzip, deflate utf-8, utf-8;q=0.5, *;q=0.5 en localhost JSESSIONID=1F767F17239C9B670A39E9B10C3825F4 NaN close NaN NaN
1 GET localhost:8080/tienda1/publico/ HTTP/1.1 Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... no-cache no-cache text/xml,application/xml,application/xhtml x-gzip, x-deflate, gzip, deflate utf-8, utf-8;q=0.5, *;q=0.5 en localhost JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5 NaN close NaN NaN
2 POST localhost:8080/tienda1/publico/anadir.jsp HTTP/1.1 Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... no-cache no-cache text/xml,application/xml,application/xhtml x-gzip, x-deflate, gzip, deflate utf-8, utf-8;q=0.5, *;q=0.5 en localhost JSESSIONID=933185092E0B668B90676E0A2B0767AF application/x-www-form-urlencoded close 68 id=3&nombre=Vino+Rioja&precio=100&cantidad=55&...
Method | URL | Protocol | User-Agent | Pragma | Cache-control | Accept | Accept-Encoding | Accept-Charset | Accept-Language | Host | Cookie | Content-Type | Connection | Content-Length | Body | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GET | localhost:8080/tienda1/index.jsp | HTTP/1.1 | Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... | no-cache | no-cache | text/xml,application/xml,application/xhtml | x-gzip, x-deflate, gzip, deflate | utf-8, utf-8;q=0.5, *;q=0.5 | en | localhost | JSESSIONID=1F767F17239C9B670A39E9B10C3825F4 | NaN | close | NaN | NaN |
1 | GET | localhost:8080/tienda1/publico/ | HTTP/1.1 | Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... | no-cache | no-cache | text/xml,application/xml,application/xhtml | x-gzip, x-deflate, gzip, deflate | utf-8, utf-8;q=0.5, *;q=0.5 | en | localhost | JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5 | NaN | close | NaN | NaN |
2 | POST | localhost:8080/tienda1/publico/anadir.jsp | HTTP/1.1 | Mozilla/5.0 (compatible; Konqueror/3.5; Linux)... | no-cache | no-cache | text/xml,application/xml,application/xhtml | x-gzip, x-deflate, gzip, deflate | utf-8, utf-8;q=0.5, *;q=0.5 | en | localhost | JSESSIONID=933185092E0B668B90676E0A2B0767AF | application/x-www-form-urlencoded | close | 68 | id=3&nombre=Vino+Rioja&precio=100&cantidad=55&... |
import os
os.makedirs(os.path.join('.', 'data'), exist_ok=True)#创建目录“../data/”
data_file = os.path.join('.', 'data', 'Traffic.csv')
with open(data_file,'w') as f:# 将数据保存为 CSV 文件df.to_csv(data_file, index=False)
NaN
数据值代表缺失值,处理缺失值的方法有插值法和删除法,其中插值法用一个替代值弥补缺失值,而删除法则直接忽略缺失值。 在这里,我们将考虑插值法。
#从csv中获取数据
df = pd.read_csv(data_file)
# 提取数值型列
numeric_cols = df.select_dtypes(include=['float64']).columns
# # 提取非数值型列
# non_numeric_cols = df.select_dtypes(exclude=['float64']).columns# 对数值型列进行均值填充
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())# # 对Content-Type列进行填充
# df = pd.get_dummies(df , columns=['Content-Type'] , dummy_na=True)
# 对非数值列进行填充(对于该数据集来说将非数值列进行填充没有任何意义,这部分只是为了演示操作)
df = pd.get_dummies(df , dummy_na=True)df
结果:
Content-Length Method_GET Method_POST Method_nan URL_localhost:8080/tienda1/index.jsp URL_localhost:8080/tienda1/publico/anadir.jsp URL_localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito URL_nan Protocol_HTTP/1.1 Protocol_nan ... Cookie_JSESSIONID=1F767F17239C9B670A39E9B10C3825F4 Cookie_JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5 Cookie_JSESSIONID=933185092E0B668B90676E0A2B0767AF Cookie_nan Content-Type_application/x-www-form-urlencoded Content-Type_nan Connection_close Connection_nan Body_id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito Body_nan
0 68.0 True False False True False False False True False ... True False False False False True True False False True
1 68.0 True False False False False True False True False ... False True False False False True True False False True
2 68.0 False True False False True False False True False ... False False True False True False True False True False
Content-Length | Method_GET | Method_POST | Method_nan | URL_localhost:8080/tienda1/index.jsp | URL_localhost:8080/tienda1/publico/anadir.jsp | URL_localhost:8080/tienda1/publico/anadir.jsp?id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito | URL_nan | Protocol_HTTP/1.1 | Protocol_nan | ... | Cookie_JSESSIONID=1F767F17239C9B670A39E9B10C3825F4 | Cookie_JSESSIONID=81761ACA043B0E6014CA42A4BCD06AB5 | Cookie_JSESSIONID=933185092E0B668B90676E0A2B0767AF | Cookie_nan | Content-Type_application/x-www-form-urlencoded | Content-Type_nan | Connection_close | Connection_nan | Body_id=3&nombre=Vino+Rioja&precio=100&cantidad=55&B1=A%F1adir+al+carrito | Body_nan | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 68.0 | True | False | False | True | False | False | False | True | False | ... | True | False | False | False | False | True | True | False | False | True |
1 | 68.0 | True | False | False | False | False | True | False | True | False | ... | False | True | False | False | False | True | True | False | False | True |
2 | 68.0 | False | True | False | False | True | False | False | True | False | ... | False | False | True | False | True | False | True | False | True | False |
3 rows × 36 columns
只有数值类型的DataFrame可以转化为张量格式。 若要以上述流量作为数据集进行入侵检测的训练,上面将非数值数据项转化为数值类型的方案肯定是不行的,机器不能学习到流量里的特征。
对于将流量转化为数值类型的数据的方法,根据作者了解,可以将流量转化为图片的形式,用卷积网络进行训练。后续作者也会在该方向展开入侵检测的学习。
当数据采用张量的格式,就可以通过张量函数对数据进行操作。
import tensorflow as tf
X = tf._numpy(dtype=float))
X
结果:
<tf.Tensor: shape=(3, 36), dtype=float64, numpy=
array([[68., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1.,0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0.,1., 0., 0., 0., 0., 1., 1., 0., 0., 1.],[68., 1., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 1.,0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0.,0., 1., 0., 0., 0., 1., 1., 0., 0., 1.],[68., 0., 1., 0., 0., 1., 0., 0., 1., 0., 1., 0., 1.,0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0.,0., 0., 1., 0., 1., 0., 1., 0., 1., 0.]])>
本文发布于:2024-01-29 10:25:10,感谢您对本站的认可!
本文链接:https://www.4u4v.net/it/170649511514625.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
留言与评论(共有 0 条评论) |