Python 编码学习笔记

阅读：评论：0

Python 编码学习笔记

摘要

本文介绍了 Python 的编码，包括常用编码的介绍， py2 与 py3 的对比，str 与 bytes 的对比等。

常用编码

名称	长度	备注
ASCII	1B	128个字符
GB2312	1-2B	6763个汉字
GBK	1-2B	21886个汉字，Windows 10 中文默认编码
UTF-8	1-4B	Unicode 的一种实现，Ubuntu 18.04、Python3 默认编码

>>> s = '2019年11月10日'
>>> len(s)
11
>>> de('gb2312'))  # ASCII字符1字节，中文字符2字节
14
>>> de('gbk'))  # 同gb2312
14
>>> de('utf-8'))  # ASCII字符1字节，中文字符3字节
17

参考资料：
Python 编码为什么那么蛋疼？ - 刘志军的回答 - 知乎
GB2312、GBK、GB18030 这几种字符集的主要区别是什么？ - Tuxify的回答 - 知乎

Python 编码

Python 2 与Python 3 的区别

Python 2

>>> import sys
>>> defaultencoding()
'ascii'
>>> str
<type 'str'>
>>> bytes
<type 'str'>

Python 3

>>> import sys
>>> defaultencoding()
'utf-8'
>>> str
<class 'str'>
>>> bytes
<class 'bytes'>

Python 3 编码使用方法

str 与 bytes 的关系

str是字符串对象，bytes是字节流对象；
所有str编码相同，~~为文件默认编码，默认为UTF-8~~(Unicode?)；
将str转换为bytes，bytes.decode()将bytes转换为str；
和bytes.decode()指定的编码是bytes的编码。

为什么在 Python 终端中输出 bytes ，可以显示字符

>>> s
'2019年11月10日'
>>> s_bytes = s.encode()
>>> type(s_bytes)
<class 'bytes'>
>>> s_bytes
b'2019xe5xb9xb411xe6x9cx8810xe6x97xa5'

这是因为终端在显示 bytes 时，当遇到可以转换为 ASCII 的字符（例如“2019”）时，会将其先转换为 ASCII 再显示。bytes不是字符串。bytes是字节流，除了存储文字，也可以存储图片、视频。

Python 3 中文编程

Python 3 标识符使用 UTF-8 编码，因此支持中文变量名。

>>> 最大值 = max
>>> 张三分数 = 87
>>> 李四分数 = 91
>>> 王五分数 = 88
>>> 最大值(张三分数, 李四分数, 王五分数)
91

base64

>>> import base64
>>> base64.b64encode(b'abc')
b'YWJj'
>>> base64.b64encode('abc')
...
TypeError: a bytes-like object is required, not 'str'

为什么 base64 编码时只接受bytes类型输入？个人理解，有两点原因：

base64 不只是用来编码字符串，有时需要用来编码图片等二进制数据；
Python 编程时，应避免同时处理str(Unicode)和bytes类型的数据。

If you attempt to write processing functions that accept both Unicode and byte strings, you will find your program vulnerable to bugs wherever you combine the two different kinds of strings. There is no automatic encoding or decoding: if you str + bytes, a TypeError will be raised.
Unicode HOWTO

base64的正确编码方法：

>>> s
'2019年11月10日'
>>> base64.de('gb2312'))
b'MjAxOcTqMTHUwjEwyNU='
>>> base64.de('gbk'))
b'MjAxOcTqMTHUwjEwyNU='
>>> base64.de('utf-8'))
b'MjAxOeW5tDEx5pyIMTDml6U='

注意到base64.b64encode()返回的是bytes类型数据。由于终端遇到可以转换为 ASCII 的字符时，会将其先转换为 ASCII 再显示，因此可以直接看到 base64 编码后的结果。
将 base64 编码结果转换为字符串：

>>> base64.de('utf-8')).decode('utf-8')
'MjAxOeW5tDEx5pyIMTDml6U='
>>> base64.de('utf-8')).decode('ascii')
'MjAxOeW5tDEx5pyIMTDml6U='
>>> base64.de('utf-8')).decode('gb2312')
'MjAxOeW5tDEx5pyIMTDml6U='
>>> base64.de('utf-8')).decode('gbk')
'MjAxOeW5tDEx5pyIMTDml6U='

因为 base64 编码所使用的字符均为 ASCII 字符，而上述编码均与 ASCII 兼容，因此均能返回正确结果。

问题

环境：Python 3，默认编码。

>>> sizeof("甲")
76
>>> sizeof("甲乙")
78
>>> len("乙".encode('utf-8'))
3

为什么"乙"只占用2个字节？Python 3 的str在内存中是保存为 Unicode 还是 UTF-8？

本文发布于:2024-01-28 03:52:39，感谢您对本站的认可！

本文链接：https://www.4u4v.net/it/17063851664580.html

上一篇：Python文本处理小案例

下一篇：POI 在开发中的应用

标签：学习笔记 Python

留言与评论（共有 0 条评论）

Python 编码 学习笔记