Changeset View
Changeset View
Standalone View
Standalone View
src/leveldb/doc/table_format.md
- This file was added.
leveldb File format | |||||
=================== | |||||
<beginning_of_file> | |||||
[data block 1] | |||||
[data block 2] | |||||
... | |||||
[data block N] | |||||
[meta block 1] | |||||
... | |||||
[meta block K] | |||||
[metaindex block] | |||||
[index block] | |||||
[Footer] (fixed size; starts at file_size - sizeof(Footer)) | |||||
<end_of_file> | |||||
The file contains internal pointers. Each such pointer is called | |||||
a BlockHandle and contains the following information: | |||||
offset: varint64 | |||||
size: varint64 | |||||
See [varints](https://developers.google.com/protocol-buffers/docs/encoding#varints) | |||||
for an explanation of varint64 format. | |||||
1. The sequence of key/value pairs in the file are stored in sorted | |||||
order and partitioned into a sequence of data blocks. These blocks | |||||
come one after another at the beginning of the file. Each data block | |||||
is formatted according to the code in `block_builder.cc`, and then | |||||
optionally compressed. | |||||
2. After the data blocks we store a bunch of meta blocks. The | |||||
supported meta block types are described below. More meta block types | |||||
may be added in the future. Each meta block is again formatted using | |||||
`block_builder.cc` and then optionally compressed. | |||||
3. A "metaindex" block. It contains one entry for every other meta | |||||
block where the key is the name of the meta block and the value is a | |||||
BlockHandle pointing to that meta block. | |||||
4. An "index" block. This block contains one entry per data block, | |||||
where the key is a string >= last key in that data block and before | |||||
the first key in the successive data block. The value is the | |||||
BlockHandle for the data block. | |||||
5. At the very end of the file is a fixed length footer that contains | |||||
the BlockHandle of the metaindex and index blocks as well as a magic number. | |||||
metaindex_handle: char[p]; // Block handle for metaindex | |||||
index_handle: char[q]; // Block handle for index | |||||
padding: char[40-p-q];// zeroed bytes to make fixed length | |||||
// (40==2*BlockHandle::kMaxEncodedLength) | |||||
magic: fixed64; // == 0xdb4775248b80fb57 (little-endian) | |||||
## "filter" Meta Block | |||||
If a `FilterPolicy` was specified when the database was opened, a | |||||
filter block is stored in each table. The "metaindex" block contains | |||||
an entry that maps from `filter.<N>` to the BlockHandle for the filter | |||||
block where `<N>` is the string returned by the filter policy's | |||||
`Name()` method. | |||||
The filter block stores a sequence of filters, where filter i contains | |||||
the output of `FilterPolicy::CreateFilter()` on all keys that are stored | |||||
in a block whose file offset falls within the range | |||||
[ i*base ... (i+1)*base-1 ] | |||||
Currently, "base" is 2KB. So for example, if blocks X and Y start in | |||||
the range `[ 0KB .. 2KB-1 ]`, all of the keys in X and Y will be | |||||
converted to a filter by calling `FilterPolicy::CreateFilter()`, and the | |||||
resulting filter will be stored as the first filter in the filter | |||||
block. | |||||
The filter block is formatted as follows: | |||||
[filter 0] | |||||
[filter 1] | |||||
[filter 2] | |||||
... | |||||
[filter N-1] | |||||
[offset of filter 0] : 4 bytes | |||||
[offset of filter 1] : 4 bytes | |||||
[offset of filter 2] : 4 bytes | |||||
... | |||||
[offset of filter N-1] : 4 bytes | |||||
[offset of beginning of offset array] : 4 bytes | |||||
lg(base) : 1 byte | |||||
The offset array at the end of the filter block allows efficient | |||||
mapping from a data block offset to the corresponding filter. | |||||
## "stats" Meta Block | |||||
This meta block contains a bunch of stats. The key is the name | |||||
of the statistic. The value contains the statistic. | |||||
TODO(postrelease): record following stats. | |||||
data size | |||||
index size | |||||
key size (uncompressed) | |||||
value size (uncompressed) | |||||
number of entries | |||||
number of data blocks |