Zookeeper 序列化机制总结

2016-10-01 约 2915 字预计阅读 6 分钟

简介

Jute是Zookeeper中的序列化组件，最初也是Hadoop中的默认序列化组件，前身就是Hadoop Record IO，后来由于Apache Avro具有更好的跨语言性，丰富的数据结构和对MapReduce的支持，并且能够方便的用于RPC调用；因此Hadoop废弃了Record IO，开始使用Avro，并且将Record IO剥离出来，成为了一个独立的序列化组件，重新命名为Jute。

Zookeeper从最早的版本开始就一直使用Jute作为序列化工具，直到现在最新的版本zookeeper-3.4.9依然使用Jute；至于为什么没有换成性能更好，通用性更强的如：Apache Avro，Thrift，Protobuf等序列化组件，主要还是由于考虑到新老版本序列化组件的兼容性，另一方面Jute并没有成为Zookeeper的瓶颈所在；下面针对Jute使用和部分源码的分析。

简单使用

首先对Jute简单使用，对Jute有一个初步的了解：

1.提供一个实现接口Record的bean

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


public class TestBean implements Record {

  private int intV;
  private String stringV;

  public TestBean() {

  }

  public TestBean(int intV, String stringV) {
      this.intV = intV;
      this.stringV = stringV;
  }

      //get/set方法

  @Override
  public void deserialize(InputArchive archive, String tag)
          throws IOException {
      archive.startRecord(tag);
      this.intV = archive.readInt("intV");
      this.stringV = archive.readString("stringV");
      archive.endRecord(tag);
  }

  @Override
  public void serialize(OutputArchive archive, String tag) throws IOException {
      archive.startRecord(this, tag);
      archive.writeInt(intV, "intV");
      archive.writeString(stringV, "stringV");
      archive.endRecord(this, tag);
  }

}

实现的Record接口，主要实现了2个方法deserialize和serialize。

2.序列化和反序列

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


public class BinaryTest1 {

  public static void main(String[] args) throws IOException {
      ByteArrayOutputStream baos = new ByteArrayOutputStream();
      BinaryOutputArchive boa = BinaryOutputArchive.getArchive(baos);
      new TestBean(1, "testbean1").serialize(boa, "tag1");
      byte array[] = baos.toByteArray();
       
      ByteArrayInputStream bais = new ByteArrayInputStream(array);
      BinaryInputArchive bia = BinaryInputArchive.getArchive(bais);
      TestBean newBean1 = new TestBean();
      newBean1.deserialize(bia, "tag1");
       
      System.out.println("intV = " + newBean1.getIntV() + ",stringV = "
              + newBean1.getStringV());
      bais.close();
      baos.close();
  }
}

分别提供了序列化器BinaryOutputArchive和反序列化器ByteArrayInputStream，然后将TestBean指定tag1标记进行序列化和反序列化，最终对比序列化前的数据和序列化后的数据。

使用分析

Record接口

1
2
3
4
5
6


public interface Record {
  public void serialize(OutputArchive archive, String tag)
      throws IOException;
  public void deserialize(InputArchive archive, String tag)
      throws IOException;
}

OutputArchive表示序列化器，InputArchive表示反序列器，tag用于标识对象，主要是因为同一个序列化器可以序列化多个对象，所以需要给每个对象一个标识。

OutputArchive和InputArchive相关

OutputArchive接口

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


public interface OutputArchive {
  public void writeByte(byte b, String tag) throws IOException;
  public void writeBool(boolean b, String tag) throws IOException;
  public void writeInt(int i, String tag) throws IOException;
  public void writeLong(long l, String tag) throws IOException;
  public void writeFloat(float f, String tag) throws IOException;
  public void writeDouble(double d, String tag) throws IOException;
  public void writeString(String s, String tag) throws IOException;
  public void writeBuffer(byte buf[], String tag)
      throws IOException;
  public void writeRecord(Record r, String tag) throws IOException;
  public void startRecord(Record r, String tag) throws IOException;
  public void endRecord(Record r, String tag) throws IOException;
  public void startVector(List v, String tag) throws IOException;
  public void endVector(List v, String tag) throws IOException;
  public void startMap(TreeMap v, String tag) throws IOException;
  public void endMap(TreeMap v, String tag) throws IOException;

}

接口中定义了支持序列化的类型：

基本类型：byte，boolean，int，long，float，double

非基本类型：string，byte[]，嵌套类型，vector，treeMap

InputArchive接口同理。

实现类

OutputArchive和InputArchive的实现类，可以从代码结构中看到，主要有如下几个：

OutputArchive实现类：BinaryOutputArchive，CsvOutputArchive和XmlOutputArchive

InputArchive实现类：BinaryInputArchive，CsvInputArchive和XmlInputArchive

用途

BinaryOutputArchive：用于网络传输和本地磁盘的存储

CsvOutputArchive：更多的是方便数据对象的可视化展现

XmlInputArchive：将数据以xml保存和还原

在Zookeeper中更多的地方用于网络传输和本地磁盘的存储，所以BinaryOutputArchive使用最为广泛，上面的实例也是以BinaryOutputArchive作为序列化类。

数据描述语言

大多数序列化工具都支持数据描述语言，DDL（Data Description Language），当然jute也不例外，其实如果看过Zookeeper的源码，会发现很多类的开头都有这么一段描述：// File generated by hadoop record compiler. Do not edit. 有相关描述的类，都是通过jute的数据描述语言生成的。

Zookeeper中的很多类都是通过描述语言生成的，对应的描述文件在Zookeeper的包中也能找到：zookeeper-3.4.9/src文件下的zookeeper.jute文件，里面包含了Zookeeper中所有需要生成的bean文件，可以自行打开去查看，这里我提供一个更全的描述文件实例：

JavaGenerator，CSharpGenerator，CppGenerator，CGenerator；分别对应生成java，c#，c++，c语言的类文件；

与Protobuf简单对比

与protobuf3序列化和反序列化时间，以及序列化之后的字节数进行比较，对应版本分别是：

Protobuf：protobuf-3.0.0

jute:zookeeper-3.4.9

protobuf描述文件：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


syntax = "proto3";
option java_package = "protobuf.clazz"; 
option java_outer_classname = "GoodsPicInfo";

message PicInfo { 
 int32 ID = 1; 
 int64 GoodID = 2;        
 string Url = 3; 
 string Guid = 4; 
 string Type = 5; 
 int32 Order = 6; 
}

jute描述文件：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


module test {
  class PicInfo {
      int ID;
      long GoodID;
      ustring Url;
      ustring Guid;
      ustring Type;
      int Order;
  }
}

protobuf测试代码：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


public class Protobuf_Test {
  public static void main(String[] args)
          throws InvalidProtocolBufferException {
      long startTime = System.currentTimeMillis();
      byte[] result = null;
      for (int i = 0; i < 50000; i++) {
          GoodsPicInfo.PicInfo.Builder builder = GoodsPicInfo.PicInfo
                  .newBuilder();
          builder.setGoodID(100);
          builder.setGuid("11111-22222-3333-444");
          builder.setOrder(0);
          builder.setType("ITEM");
          builder.setID(10);
          builder.setUrl("http://xxx.jpg");
          GoodsPicInfo.PicInfo info = builder.build();
          result = info.toByteArray();
      }
      long endTime = System.currentTimeMillis();
      System.out.println("字节数大小:" + result.length + ",序列化花费时间:"
              + (endTime - startTime) + "ms");

      for (int i = 0; i < 50000; i++) {
          GoodsPicInfo.PicInfo newBean = GoodsPicInfo.PicInfo
                  .getDefaultInstance();
          MessageLite prototype = newBean.getDefaultInstanceForType();
          newBean = (PicInfo) prototype.newBuilderForType().mergeFrom(result)
                  .build();
      }
      long endTime2 = System.currentTimeMillis();
      System.out.println("反序列化花费时间:" + (endTime2 - endTime) + "ms");
  }
}

jute测试代码：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


public class Jute_test {
  public static void main(String[] args) throws IOException {
      long startTime = System.currentTimeMillis();
      byte array[] = null;
      for (int i = 0; i < 50000; i++) {
          ByteArrayOutputStream baos = new ByteArrayOutputStream();
          BinaryOutputArchive boa = BinaryOutputArchive.getArchive(baos);
          new PicInfo(10, 100, "http://xxx.jpg", "11111-22222-3333-444",
                  "ITEM", 0).serialize(boa, "tag" + i);
          array = baos.toByteArray();
      }
      long endTime = System.currentTimeMillis();
      System.out.println("字节数大小:" + array.length + ",序列化花费时间:"
              + (endTime - startTime) + "ms");

      for (int i = 0; i < 50000; i++) {
          ByteArrayInputStream bais = new ByteArrayInputStream(array);
          BinaryInputArchive bia = BinaryInputArchive.getArchive(bais);
          PicInfo newBean = new PicInfo();
          newBean.deserialize(bia, "tag1");
      }
      long endTime2 = System.currentTimeMillis();
      System.out.println("反序列化花费时间:" + (endTime2 - endTime) + "ms");
  }
}

分别进行50000次的序列化操作和反序列化操作，结果如下：

protobuf:字节数大小:48,序列化花费时间:141ms,反序列化花费时间:62ms

jute:字节数大小:66,序列化花费时间:94ms,反序列化花费时间:62ms

jute在序列化的花费的时间上占有一定的优势，但是字节数大小不太理想。

总结

本文从一个简单的实例入手，然后分析了几个核心类，了解到jute支持的数据类型，支持的语言，以及它的序列化和反序列化都是基于jdk的DataOutput和DataInput来实现的；然后了解了jute的数据描述语言；最后和protobuf就行对比，发现jute还是有自己的优势，我想这也是Zookeeper一直把jute当做自己的序列化工具的部分原因吧。

目录