问题描述
使用NIO读取一个文件时,从Channel
中读取数据到ByteBuffer
中,指定ByteBuffer
的大小为1024,当读取一个大于1024的文件是,至少需要2+n的读取次数。
对于utf-8
格式的文件,直接读取会产生乱码,为解决乱码,使用CharsetDecoder
的decode(ByteBuffer in, CharBuffer out, boolean endOfInput)
方法进行编码,乱码问题解决。
但是输出内容相对于源文件有数据缺失现象,一步一步跟debug后,发现decode方法返回MALFORMED[1]
错误代码。
问题解决思路
在进行一番Google后,定位到了问题是由于每次编码的第一个字节不能为utf-8格式的。找到每次读取的边界,添加几个字母后数据不丢失了,这种方法显然是行不通的。Google到的内容如下:
来自: http://stackoverflow.com/questions/29559842/what-is-charsetdecoder-decodebytebuffer-charbuffer-endofinput
1
2
3
|
Apparently it does not work correctly if you start decoding in the middle of a character. The decoder expects that the first thing it reads is the start of a valid UTF-8 sequence.
edit - When the decoder reports UNDERFLOW, it expects you to add more data to the input buffer and then try to call decode() again, but you must re-offer it the data from the start of the UTF-8 sequence that you are trying to decode. You can't continue in the middle of an UTF-8 sequence.
|
后来又搜索到另一个答案:
来自:http://stackoverflow.com/questions/14792654/decoding-multibyte-utf8-symbols-with-charset-decoder-in-byte-by-byte-manner
1
2
|
If my theory is correct then the answer is that you cannot decode UTF-8 by providing the decoder with byte buffers containing exactly one byte. But you could implement byte-by-byte decoding by starting with a ByteBuffer containing one byte and adding extra bytes until the decoder succeeds in outputing a character. Just make sure that you don't clobber input bytes that haven't been consumed yet.
Note that decoding like this is not efficient. The API design is optimized for decoding a large number of bytes in one go.
|
思考
第二段答案中最后一句话:最佳化API设计是每一次编码使用一个较大的bytes接收。
修改ByteBuffer的初始化大小后问题解决。怪怪得( ̄∇ ̄)