阅读Hive Orc官方文档

Introduction

Orc格式支持自Hive 0.11引入。

The Optimized Row Columnar (ORC) 文件格式提供了更高效地存储Hive数据存储。其设计初衷是克服其他Hive文件格式的局限。

使用Orc文件提升了Hive的读写和处理性能。

相比RCFile，ORC文件格式由如下优点：

每个Task输出单一文件，减少了NameNode的负载
包括对datatime，decimal及复杂类型的支持(struct,list,map,union)
文件内存储轻量化的索引：跳过被过滤掉的row group；索引到指定行。
根据数据类型进行基于块的压缩：整数列进行行程编码；对字符串列进行字典编码。
使用独立的RecordReader并行读取同一文件。
不必扫描标识就能对文件进行切分
限定读写需要的内存
元数据以Protocol Buffer形式存储，允许添加删除域。

Orc使用类型相关的Reader和Writer来提供轻量化的压缩技术，例如字典编码，行程编码等。

可以使用通用的压缩方式，如zlib或Snappy等。存储上的节省只是一方面的优势。

Orc对于每一列能够只读取指定的一部分字节数据。

另外Orc文件还包含了轻量化的索引，对于每一定量(默认为10000行)数据，包含了每一列的最大值和最小值。

使用Hive的下推过滤器时，Orc文件的Reader能够根据索引信息跳过本次查询不需要的行。

文件结构

一个ORC文件主体由一系列称作stripes的行数据的分组以及一份称作file footer的额外信息数据组成。

在文件末尾包含一个称为postscript的部分用于保存压缩的参数以及被压缩的footer的大小。

默认的stripe大小为250MB，大的stripe大小利于数据更高效的从HDFS读取。

文件的footer包含了如下信息：

该文件所包含的stripe的列表
每个stripe的行数
每一列的数据类型
列级别的聚合信息，count，min，max以及sum。

Stripe结构

Orc的每个stripe包含了index数据，行数据以及一个stripe footer。

stripe footer包含了数据流位置的目录。row data在表的扫描过程中会用到。

index data包含了每一列的最小值，最大值以及每一列中最大最小值出现的行位置。

行索引条目提供了偏移，用于在被解压的块中定位正确的压缩块和字节，

注意：Orc索引值用于选出目标stripe和row group，并不用于返回查询结果。

即使文件数据量很大，使用row index能够跳过不需要的行，从而实现快速读取。

关于ORC格式的完整描述见 ORC格式官方说明

File Tail

HDFS不支持写入文件后改变文件中的数据，因此ORC将全局的索引存在文件末尾。

文件的整体结构如下图所示：

文件由三部分组成，文件metadata，文件footer和postscript。

ORC的metadata以Protocol Buffers的形式存储，能够方便地添加新的域。

PostScript

.proto文件描述如下：

message PostScript {
 // the length of the footer section in bytes
 optional uint64 footerLength = 1;
 // the kind of generic compression used
 optional CompressionKind compression = 2;
 // the maximum size of each compression chunk
 optional uint64 compressionBlockSize = 3;
 // the version of the writer
 repeated uint32 version = 4 [packed = true];
 // the length of the metadata section in bytes
 optional uint64 metadataLength = 5;
 // the fixed string "ORC"
 optional string magic = 8000;
}
enum CompressionKind {
 NONE = 0;
 ZLIB = 1;
 SNAPPY = 2;
 LZO = 3;
}

Postscript section提供了解读文件其他部分的必要信息。

包括文件Footer和Metadata部分的长度，文件的版本，使用的压缩方法。

Postscript不会被压缩并且在文件末尾会有一个标志结束的字节。

Postscript中存储这能保证正常读取文件的最低Hive版本，由主版本和副版本号组成。

如[0,11]对应Hive 0.11，[0,12]对应Hive 0.12 ~0.14。

读取一个Orc文件的过程是从文件末尾反向进行的。

Orc Reader会一口气读取文件最后的16k字节，期望将Footer和Postscript都包括进去。

文件的最后一个字节包含了Postscript的序列化后的长度，该长度不能超过256字节。

Postscript解析后就能得到压缩后的Footer序列化长度，从而能够对Footer进行解压缩和解析。

.proto文件描述如下：

message Footer {
 // the length of the file header in bytes (always 3)
 optional uint64 headerLength = 1;
 // the length of the file header and body in bytes
 optional uint64 contentLength = 2;
 // the information about the stripes
 repeated StripeInformation stripes = 3;
 // the schema information
 repeated Type types = 4;
 // the user metadata that was added
 repeated UserMetadataItem metadata = 5;
 // the total number of rows in the file
 optional uint64 numberOfRows = 6;
 // the statistics of each column across the file
 repeated ColumnStatistics statistics = 7;
 // the maximum number of rows in each index entry
 optional uint32 rowIndexStride = 8;
}

Footer包含了文件的整体结构信息，schema类型信息，行数以及关于每一列的统计信息。

整个Orc文件分为3部分，头部Header，主体Body和尾部Tail。头部包含了”ORC”三个字节用于表示文件类型。

主体包含了行和索引，尾部给出了文件级别的描述信息。

Stripe Information

文件的body部分被划分为stripes。每个stripe是自包含的，读取时只需要用到stripe内的信息和文件Footer及Postscript的信息。

每个stripe仅包含一系列完整的行，因此行不回跨越stripe边界。

每个stripe包含三部分：

stripe内部的行索引集合
行数据
stripe footer

索引部分和数据部分均为按列划分的，以便于只读出需要的列。

message StripeInformation {
 // the start of the stripe within the file
 optional uint64 offset = 1;
 // the length of the indexes in bytes
 optional uint64 indexLength = 2;
 // the length of the data in bytes
 optional uint64 dataLength = 3;
 // the length of the footer in bytes
 optional uint64 footerLength = 4;
 // the number of rows in the stripe
 optional uint64 numberOfRows = 5;
}

Type Information

Orc文件中的所有行的schema都相同。逻辑上schema表达为一棵树，复合类型具有子列。如下所示：

create table Foobar (
 myInt int,
 myMap map<string,
 struct<myString : string,
 myDouble: double>>,
 myTime timestamp
);

类型树通过先序遍历的方式打平为一个列表，并按照遍历的顺序分配域的id。

树的根节点的id为0。.proto文件描述如下：

message Type {
 enum Kind {
 BOOLEAN = 0;
 BYTE = 1;
 SHORT = 2;
 INT = 3;
 LONG = 4;
 FLOAT = 5;
 DOUBLE = 6;
 STRING = 7;
 BINARY = 8;
 TIMESTAMP = 9;
 LIST = 10;
 MAP = 11;
 STRUCT = 12;
 UNION = 13;
 DECIMAL = 14;
 DATE = 15;
 VARCHAR = 16;
 CHAR = 17;
 }
 // the kind of this type
 required Kind kind = 1;
 // the type ids of any subcolumns for list, map, struct, or union
 repeated uint32 subtypes = 2 [packed=true];
 // the list of field names for struct
 repeated string fieldNames = 3;
 // the maximum length of the type for varchar or char
 optional uint32 maximumLength = 4;
 // the precision and scale for decimal
 optional uint32 precision = 5;
 optional uint32 scale = 6;
}

Column Statistics

列统计的目标是对于每一列，writer记录计数并根据数据类型记录一些有用的域。

对大多数简单数据类型，它记录最小值和最大值。对于数值类型计算总和。

从Hive 1.1.0之后，基于列的统计也会通过设置hasNull标志位记录row group中是否有空值(null)

hasNull标记用于在’IS NULL’类型的query中获得更好的查询性能。

message ColumnStatistics {
 // the number of values
 optional uint64 numberOfValues = 1;
 // At most one of these has a value for any column
 optional IntegerStatistics intStatistics = 2;
 optional DoubleStatistics doubleStatistics = 3;
 optional StringStatistics stringStatistics = 4;
 optional BucketStatistics bucketStatistics = 5;
 optional DecimalStatistics decimalStatistics = 6;
 optional DateStatistics dateStatistics = 7;
 optional BinaryStatistics binaryStatistics = 8;
 optional TimestampStatistics timestampStatistics = 9;
 optional bool hasNull = 10;
}
For integer types (tinyint, smallint, int, bigint), the column
statistics includes the minimum, maximum, and sum. If the sum
overflows long at any point during the calculation, no sum is
recorded.
message IntegerStatistics {
 optional sint64 minimum = 1;
 optional sint64 maximum = 2;
 optional sint64 sum = 3;
} 

不同数据类型的统计字段由.proto文件描述如下：

message IntegerStatistics {
 optional sint64 minimum = 1;
 optional sint64 maximum = 2;
 optional sint64 sum = 3;
}
message DoubleStatistics {
 optional double minimum = 1;
 optional double maximum = 2;
 // If the sum overflows a double, no sum is recorded.
 optional double sum = 3; 
}
message StringStatistics {
 optional string minimum = 1;
 optional string maximum = 2;
 // sum will store the total length of all strings
 optional sint64 sum = 3;
}
message BucketStatistics {
 // For booleans, the statistics include the count of 
 // false and true values.
 repeated uint64 count = 1 [packed=true];
}
 
message DecimalStatistics {
 optional string minimum = 1;
 optional string maximum = 2;
 optional string sum = 3;
}
message DateStatistics {
 // min,max values saved as days since epoch
 optional sint32 minimum = 1;
 optional sint32 maximum = 2;
}
message TimestampStatistics {
 // min,max values saved as milliseconds since epoch
 optional sint64 minimum = 1;
 optional sint64 maximum = 2;
}
message BinaryStatistics {
 // sum will store the total binary blob length
 optional sint64 sum = 1;
}

User Metadata

用户可以在写入Orc文件时添加任意的键值对形式的元数据信息，数据内容可以由应用程序指定。

但键限定是字符串类型，值限定为二进制类型。

用户应当确保键是唯一的并且通常会以机构代码为前缀。

message UserMetadataItem {
 // the user defined key
 required string name = 1;
 // the user defined binary value
 required bytes value = 2;
}

File Metadata

文件元数据部分包含了stripe粒度的列统计信息。

这些统计信息使得输入分片能够基于每个stripe的下推过滤裁剪掉不需要读取的分片信息。

message StripeStatistics {
 repeated ColumnStatistics colStats = 1;
}
message Metadata {
 repeated StripeStatistics stripeStats = 1;
}

Compression Streams

如果指定了Orc文件的某种压缩编码方式(zlib或snappy)，除了Postscript外Orc文件的每部分都会被以该编码方式压缩。

然而我们需要Orc的reader能够做到不解压完整的数据流就能跳过那些被压缩的字节。

为了实现这一点，Orc将压缩的数据流以带header的chunk为单位进行写入。如下图所示：

为了处理未压缩的数据，如果被压缩的数据比原来的文件还大,则数据以原文件的形式存储并用isOriginal形式标记。

每个header均为3字节并以小尾端存储，包括了压缩后的长度CompressedLength及一位isOriginal标志位。

每个压缩快独立地进行压缩，因此只要一个decompressor从header开始读取，它就能不依赖前面的字节流对当前块进行解压。

默认的压缩块大小为256K，Writer可以配置该参数，最大不超过2^23。

chunk越大压缩效果越好，但需要消耗更多的内存。

chunks大小记录在Postscript中，reader会根据它分配合适的buffer大小。

没有进行压缩的Orc文件则直接写入数据流，不带header。

Run Length Encoding

Orc对基本数据类型采用了行程编码的方式进行压缩。

针对字节型，布尔型，整型采用了不同的行程编码方式进行压缩。

详细规范建 ORC格式官方说明

Stripes

Orc文件的主体由一系列stripes构成。Stripes通常较大(一般约200MB)，彼此独立且常由不同的Task处理。

在Orc文件中，每一列由文件中相邻存储的多个stream组成。

例如对于一个整型列，存储为两个stream：

一个称为PRESENT，每个value占1位用于标记是否为空。
一个称为DATA，用于存储非空的值。
如果一个Stripe的一列中所有的值均为非空的，则PRESENT这个stream可以省略。

对于二进制列，存储为三个stream：

一个称为PRESENT，每个value占1位用于标记是否为空。
一个称为DATA，用于存储非空的值。
一个称为LENGTH，存储每个值的长度。

Stripe Footer

Stripe Footer包含了对每一列以及列的stream信息的编码后的数据。

message StripeFooter {
 // the location of each stream
 repeated Stream streams = 1;
 // the encoding of each column
 repeated ColumnEncoding columns = 2;
}

为了描述每个stream，Orc存储了stream的种类，对应列的id以及stream的大小(以字节为单位)。

每个stream内存储的信息依赖该列的数据类型和编码。

message Stream {
 enum Kind {
 // boolean stream of whether the next value is non-null
 PRESENT = 0;
 // the primary data stream
 DATA = 1;
 // the length of each value for variable length data
 LENGTH = 2;
 // the dictionary blob
 DICTIONARY\_DATA = 3;
 // deprecated prior to Hive 0.11
 // It was used to store the number of instances of each value in the
 // dictionary
 DICTIONARY_COUNT = 4;
 // a secondary data stream
 SECONDARY = 5;
 // the index for seeking to particular row groups
 ROW_INDEX = 6;
 }
 required Kind kind = 1;
 // the column id
 optional uint32 column = 2;
 // the number of bytes in the file
 optional uint64 length = 3;
}

根据数据类型由若干可选的编码类型。

可分为直接编码还是基于字典分类的编码，进一步可细分为行程编码采用v1还是v2。

message ColumnEncoding {
 enum Kind {
 // the encoding is mapped directly to the stream using RLE v1
 DIRECT = 0;
 // the encoding uses a dictionary of unique values using RLE v1
 DICTIONARY = 1;
 // the encoding is direct using RLE v2
 DIRECT\_V2 = 2;
 // the encoding is dictionary-based using RLE v2
 DICTIONARY\_V2 = 3;
 }
 required Kind kind = 1;
 // for dictionary encodings, record the size of the dictionary
 optional uint32 dictionarySize = 2;
}

Column Encodings

SmallInt, Int, and BigInt Columns

整型列编码方式如下：

Encoding	Stream Kind	Optional	Contents
DIRECT	PRESENT	Yes	Boolean RLE
	DATA	No	Signed Integer RLE v1
DIRECT_V2	PRESENT	Yes	Boolean RLE
	DATA	No	Signed Integer RLE v2

Float and Double Columns

浮点型列编码方式如下：

Encoding	Stream Kind	Optional	Contents
DIRECT	PRESENT	Yes	Boolean RLE
	DATA	No	IEEE 754 floating point representation

String, Char, and VarChar Columns

字符串列编码是自适应的，根据前10000个值是否相似度较低决定用何种编码方式。

字符串列编码方式如下：

Encoding	Stream Kind	Optional	Contents
DIRECT	PRESENT	Yes	Boolean RLE
	DATA	No	String contents
	LENGTH	No	Unsigned Integer RLE v1
DICTIONARY	PRESENT	Yes	Boolean RLE
	DATA	No	Unsigned Integer RLE v1
	DICTIONARY_DATA	No	String contents
	LENGTH	No	Unsigned Integer RLE v1
DIRECT_V2	PRESENT	Yes	Boolean RLE
	DATA	No	String Contents
	LENGTH	No	Unsigned Integer RLE v2
DICTIONARY_V2	PRESENT	Yes	Boolean RLE
	DATA	No	Unsigned Integer RLE v2
	DICTIONARY_DATA	No	String contents
	LENGTH	No	Unsigned Integer RLE v2

Boolean Columns

布尔型列编码方式如下：

Encoding	Stream Kind	Optional	Contents
DIRECT	PRESENT	Yes	Boolean RLE
	DATA	No	Boolean RLE

TinyInt Columns

小整型列采用字节行程编码：

Encoding	Stream Kind	Optional	Contents
DIRECT	PRESENT	Yes	Boolean RLE
	DATA	No	Byte RLE

Binary Columns

二进制数据编码方式如下：

Encoding	Stream Kind	Optional	Contents
DIRECT	PRESENT	Yes	Boolean RLE
	DATA	No	Binary contents
	LENGTH	No	Unsigned Integer RLE v1
DIRECT_V2	PRESENT	Yes	Boolean RLE
	DATA	No	Binary contents
	LENGTH	No	Unsigned Integer RLE v2

Decimal Columns

Decimal类型的列编码方式如下：

Encoding	Stream Kind	Optional	Contents
DIRECT	PRESENT	Yes	Boolean RLE
	DATA	No	Unbounded base 128 varints
	SECONDARY	No	Unsigned Integer RLE v1
DIRECT_V2	PRESENT	Yes	Boolean RLE
	DATA	No	Unbounded base 128 varints
	SECONDARY	No	Unsigned Integer RLE v2

Date Columns

日期列编码如下：

Encoding	Stream Kind	Optional	Contents
DIRECT	PRESENT	Yes	Boolean RLE
	DATA	No	Signed Integer RLE v1
DIRECT_V2	PRESENT	Yes	Boolean RLE
	DATA	No	Signed Integer RLE v2

Timestamp Columns

时间戳列编码如下：

Encoding	Stream Kind	Optional	Contents
DIRECT	PRESENT	Yes	Boolean RLE
	DATA	No	Signed Integer RLE v1
	SECONDARY	No	Unsigned Integer RLE v1
DIRECT_V2	PRESENT	Yes	Boolean RLE
	DATA	No	Signed Integer RLE v2
	SECONDARY	No	Unsigned Integer RLE v2

Struct Columns

Structs类型的列编码方式如下：

Encoding	Stream Kind	Optional	Contents
DIRECT	PRESENT	Yes	Boolean RLE

List Columns

列表类型的列编码方式如下：

Encoding	Stream Kind	Optional	Contents
DIRECT	PRESENT	Yes	Boolean RLE
	LENGTH	No	Unsigned Integer RLE v1
DIRECT_V2	PRESENT	Yes	Boolean RLE
	LENGTH	No	Unsigned Integer RLE v2

Map Columns

Map类型的列编码方式如下：

Encoding	Stream Kind	Optional	Contents
DIRECT	PRESENT	Yes	Boolean RLE
	LENGTH	No	Unsigned Integer RLE v1
DIRECT_V2	PRESENT	Yes	Boolean RLE
	LENGTH	No	Unsigned Integer RLE v2

Union Columns

Union类型的列编码方式如下：

Encoding	Stream Kind	Optional	Contents
DIRECT	PRESENT	Yes	Boolean RLE
	DATA	No	Byte RLE

Indexes

Row Group Index

Row Group索引由一个对应于每个简单类型的列的ROW\_INDEX stream组成。

针对每个Row Group有一个entry。Row group的大小由Writer控制，默认为10000行。

每个RowIndexEntry记录了列的每个stream的位置和该Row Group的统计信息。

索引stream位于stripe的前部，这是因为默认情况下它们并不需要被读取。

只有当使用下推裁剪功能或reader寻找特定的某一行时这些stream才会被加载。

message RowIndexEntry {
 repeated uint64 positions = 1 [packed=true];
 optional ColumnStatistics statistics = 2;
}
message RowIndex {
 repeated RowIndexEntry entry = 1;
}

对于包含多个stream的列，位置信息的数字序列会连接起来。

Bloom Filter Index

布隆过滤器(Bloom Filters)是从Hive 0.12以后引入到ORC索引中的。

谓词下推可以使用Bloom Filters来更好地过裁剪掉那些不满足过滤条件的row group。

Orc中Bloom Filters由一个BLOOM_FILTER stream组成，该stream对应每个由”orc.bloom.filter.columns”参数指定的列。

Bloom Filters stream为每个row group(默认10000行)记录了一个bloom filter entry。

只有row index索引满足min/max过滤条件的那些row group才会继续使用bloom filter index进行过滤。

每个BloomFilterEntry存储了一系列哈希函数和一个位向量。

message BloomFilter {
 optional uint32 numHashFunctions = 1;
 repeated fixed64 bitset = 2;
}
message BloomFilterIndex {
 repeated BloomFilter bloomFilter = 1;
}

布隆过滤器在内部使用两种不同的hash函数来将一个key映射到位向量的一个位置。

对tinyint, smallint, int, bigint, float，double types使用Thomas Wang 64-bit整型hash函数。

对于字符串和二进制数据类型，使用Murmur3 64-bit hash算法。

布隆过滤器stream同row group index交错分布，这样方便在单次读操作中读取bloom filter stream和row index stream。