首页猿问 JDK 7 和 8 中的 new...

JDK 7 和 8 中的 new String(byte []) 结果不同

Java

www说 2022-11-10 16:22:48

一些使用 new String (byte [],"UTF-8") 的字节数组在 jdk 1.7 和 1.8 中返回不同的结果byte[] bytes1 = {55, 93, 97, -13, 4, 8, 29, 26, -68, -4, -26, -94, -37, 32, -41, 88}; String str1 = new String(bytes1,"UTF-8"); System.out.println(str1.length()); byte[] out1 = str1.getBytes("UTF-8"); System.out.println(out1.length); System.out.println(Arrays.toString(out1));byte[] bytes2 = {65, -103, -103, 73, 32, 68, 49, 73, -1, -30, -1, -103, -92, 11, -32, -30}; String str2 = new String(bytes2,"UTF-8"); System.out.println(str2.length()); byte[] out2 = str2.getBytes("UTF-8"); System.out.println(out2.length); System.out.println(Arrays.toString(out2));bytes2 使用 new String(byte[],"UTF-8")，结果（str2）在 jdk7 和 jdk8 中不一样，但是 byte1 是一样的。bytes2 有什么特别之处？测试“ISO-8859-1”代码，在jdk1.8中bytes2的结果是一样的！jdk1.7.0_80：1527[55, 93, 97, -17, -65, -67, 4, 8, 29, 26, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, -17, -65, -67, 88]1531[65, -17, -65, -67, -17, -65, -67, 73, 32, 68, 49, 73, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 11, -17, -65, -67]jdk1.8.0_2011527[55, 93, 97, -17, -65, -67, 4, 8, 29, 26, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, -17, -65, -67, 88]1634[65, -17, -65, -67, -17, -65, -67, 73, 32, 68, 49, 73, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 11, -17, -65, -67, -17, -65, -67]

查看完整描述

1 回答

森林海

TA贡献2011条经验获得超2个赞

简短的回答：

在第二个字节数组的最后 2 个字节中： [-32, -37] (0b11011011_11100000) 被编码为：

By JDK 7: [-17, -65, -67] which is Unicode character 0xFFFD ("invalid character"),

By JDK 8: [-17, -65, -67, -17, -65, -67] which is 2 of 0xFFFD characters.

长答案：

数组中的某些字节序列似乎不是有效的 UTF-8 序列。让我们考虑这段代码：

byte[] bb = {55, 93, 97, -13, 4, 8, 29, 26, -68, -4, -26, -94, -37, 32, -41, 88};

for (byte b : bb) System.out.println(Integer.toBinaryString(b & 0xff));

它将打印（为了便于阅读，我手动添加了前导下划线）：

__110111

_1011101

_1100001

11110011

_____100

____1000

___11101

___11010

10111100

11111100

11100110

10100010

11011011

__100000

11010111

_1011000

正如您在UTF-8 维基百科文章中所读到的，utf-8 编码字符串使用以下二进制序列：

0xxxxxxx -- for ASCII characters

110xxxxx 10xxxxxx -- for 0x0080 to 0x07ff

1110xxxx 10xxxxxx 10xxxxxx -- for 0x0800 to 0xFFFF

... and so on

因此，每个不遵循此编码方案的字符都被 3 个字节替换：

[-17, -65, -67]

二进制 1110 1111 10 111111 10 111101

Unicode 位是 0b11111111_11111101

Unicode 十六进制是 0xFFFD（Unicode 的“无效字符”）

您的代码打印的数组的唯一区别是如何处理以下字符，它们是第二个数组末尾的 2 个字节：

[-32, -30] is 0b11100000_11100010, and this is not valid UTF-8

JDK 7 为这个序列生成了单个 0xFFFD 字符。

JDK 8 为这个序列生成了两个 0xFFFD 字符。

RFC-3629标准没有关于如何处理无效序列的明确说明，因此似乎在 JDK 8 中他们决定为每个无效字节生成 0xFFFD，这似乎更正确。

另一个问题是，当你不应该这样做时，为什么你尝试将这些原始的非 UTF-8 字节解析为 UTF-8 字符？

反对回复 2022-11-10

1 回答
0 关注
110 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

JDK 7 和 8 中的 new String(byte []) 结果不同

JDK 7 和 8 中的 new String(byte []) 结果不同

1 回答

添加回答