NIO 读取 UTF-8 格式文件数据丢失解决过程记录

2018-08-01 约 739 字预计阅读 2 分钟

问题描述

使用NIO读取一个文件时，从Channel中读取数据到ByteBuffer中，指定ByteBuffer的大小为1024，当读取一个大于1024的文件是，至少需要2+n的读取次数。

对于utf-8格式的文件，直接读取会产生乱码，为解决乱码，使用CharsetDecoder的decode(ByteBuffer in, CharBuffer out, boolean endOfInput)方法进行编码，乱码问题解决。

但是输出内容相对于源文件有数据缺失现象，一步一步跟debug后，发现decode方法返回MALFORMED[1]错误代码。

问题解决思路

在进行一番Google后，定位到了问题是由于每次编码的第一个字节不能为utf-8格式的。找到每次读取的边界，添加几个字母后数据不丢失了，这种方法显然是行不通的。Google到的内容如下：

来自: http://stackoverflow.com/questions/29559842/what-is-charsetdecoder-decodebytebuffer-charbuffer-endofinput

1
2
3


Apparently it does not work correctly if you start decoding in the middle of a character. The decoder expects that the first thing it reads is the start of a valid UTF-8 sequence.

edit - When the decoder reports UNDERFLOW, it expects you to add more data to the input buffer and then try to call decode() again, but you must re-offer it the data from the start of the UTF-8 sequence that you are trying to decode. You can't continue in the middle of an UTF-8 sequence.

后来又搜索到另一个答案：

来自：http://stackoverflow.com/questions/14792654/decoding-multibyte-utf8-symbols-with-charset-decoder-in-byte-by-byte-manner

1
2


If my theory is correct then the answer is that you cannot decode UTF-8 by providing the decoder with byte buffers containing exactly one byte. But you could implement byte-by-byte decoding by starting with a ByteBuffer containing one byte and adding extra bytes until the decoder succeeds in outputing a character. Just make sure that you don't clobber input bytes that haven't been consumed yet.
  Note that decoding like this is not efficient. The API design is optimized for decoding a large number of bytes in one go.

思考

第二段答案中最后一句话：最佳化API设计是每一次编码使用一个较大的bytes接收。

修改ByteBuffer的初始化大小后问题解决。怪怪得(￣∇￣)

目录

NIO 读取 UTF-8 格式文件数据丢失解决过程记录

问题描述

问题解决思路

思考