Conversation
src/main/flatbuf/parquet3.fbs
Outdated
| @@ -0,0 +1,224 @@ | |||
| namespace parquet.format3; | |||
There was a problem hiding this comment.
lets just name this parquet.format for now?
src/main/flatbuf/parquet3.fbs
Outdated
| // 1. Statistics are stored in integral types if their size is fixed, otherwise prefix + suffix | ||
| // 2. ColumnMetaData.encoding_stats are removed, they are replaced with | ||
| // ColumnMetaData.is_fully_dict_encoded. | ||
| // 3. RowGroups are limited to 2GB in size, so we can use int for sizes. |
There was a problem hiding this comment.
I think this and the item below are out of date (we are using long now) and can keep things absolute?
src/main/flatbuf/parquet3.fbs
Outdated
| @@ -0,0 +1,224 @@ | |||
| namespace parquet.format3; | |||
|
|
|||
| // Optimization notes | |||
There was a problem hiding this comment.
Can we expand this comment to be explicit about the relationship between this FBS and parquet.thrift.
| // Note: Match the thrift enum values so that we can cast between them. | ||
| enum Encoding : byte { | ||
| PLAIN = 0, | ||
| // GROUP_VAR_INT = 1, |
There was a problem hiding this comment.
Call out commented out entries as deprecated to make it clear why they are commented out?
src/main/flatbuf/parquet3.fbs
Outdated
| GZIP = 2, | ||
| LZO = 3, | ||
| BROTLI = 4, | ||
| // LZ4 = 5, |
There was a problem hiding this comment.
same comment on deprecation.
src/main/flatbuf/parquet3.fbs
Outdated
| scale: int; | ||
| } | ||
| enum TimeUnit : byte { | ||
| MS = 0, |
There was a problem hiding this comment.
Can we please make these match parquet.thrift for names (Millisecond, Microsecond, Nanosecond)?
| // Logical types. | ||
| /////////////////////////////////////////////////////////////////////////////////////////////////// | ||
|
|
||
| table Empty {} |
There was a problem hiding this comment.
I think we want detailed docs (same level as parquet.thrift if we intend this to be the new footer)?
src/main/flatbuf/parquet3.fbs
Outdated
| /////////////////////////////////////////////////////////////////////////////////////////////////// | ||
|
|
||
| table Empty {} | ||
| table DecimalOpts { |
There was a problem hiding this comment.
| table DecimalOpts { | |
| table DecimalOptions { |
Should be spell out type names to make it easer on reader?
src/main/flatbuf/parquet3.fbs
Outdated
| // - BYTE_ARRAY: | ||
| // prefix: the longest common prefix of min/max | ||
| // lo8+hi8 zero padded 16 bytes (big-endian) of the suffix | ||
| // len: the length for the suffix of the value after removing the prefix. If > 16 then the |
There was a problem hiding this comment.
| // len: the length for the suffix of the value after removing the prefix. If > 16 then the | |
| // min_len/max_len: the length for the suffix of the value after removing the prefix. If > 16 then the |
src/main/flatbuf/parquet3.fbs
Outdated
| // prefix: the longest common prefix of min/max | ||
| // lo8+hi8 zero padded 16 bytes (big-endian) of the suffix | ||
| // len: the length for the suffix of the value after removing the prefix. If > 16 then the | ||
| // value is inexact |
There was a problem hiding this comment.
| // value is inexact | |
| // value is inexact (it is exact otherwise). |
src/main/flatbuf/parquet3.fbs
Outdated
| // - BOOLEAN: none | ||
| // - INT32/FLOAT: lo4 (little-endian) | ||
| // - INT64/DOUBLE: lo8 (little-endian) | ||
| // - INT96: lo4+lo8 (little-endian) |
There was a problem hiding this comment.
for composite values, I think this is complicated enough that providing concrete examples would be belpful for implementors?
src/main/flatbuf/parquet3.fbs
Outdated
| DATA_PAGE_V2 = 3, | ||
| } | ||
|
|
||
| table KV { |
There was a problem hiding this comment.
nit
| table KV { | |
| table KeyValue { |
Lets keep name consistent if possible?
| codec: CompressionCodec; | ||
| num_values: long = null; // only present if not equal to rg.num_rows | ||
| total_uncompressed_size: long; | ||
| total_compressed_size: long; |
There was a problem hiding this comment.
It would be nice to keep total unencoded size here which I think is generally useful? But I suppose it can be added after?
src/main/flatbuf/parquet3.fbs
Outdated
| dictionary_page_offset: long = null; | ||
| statistics: Statistics; | ||
| is_fully_dict_encoded: bool; | ||
| bloom_filter_offset: long = null; |
There was a problem hiding this comment.
Should we this be made a struct/value to make the bloom filter info more self contained?
src/main/flatbuf/parquet3.fbs
Outdated
| row_groups: [RowGroup]; | ||
| kv: [KV]; | ||
| created_by: string; | ||
| // column_orders: [ColumnOrder]; // moved to SchemaElement |
There was a problem hiding this comment.
remove this row for now?
emkornfield
left a comment
There was a problem hiding this comment.
I think we also need to add an apache header here, and CI to make sure this compiles?
|
Hi @rok and @emkornfield , could you help to have another look of this pr? |
Rationale for this change
Improve wide table support.
What changes are included in this PR?
Add parquet flatbuf schema.
Do these changes have PoC implementations?
apache/arrow#48431