Multi-Character Field Delimiter in Apache Hive Table

Multi-character field delimiter is not implemented in LazySimpleSerDe and OpenCSVSerde text file SerDe classes. There are two options to use multi-character field delimiter in Hive. The first option is MultiDelimitSerDe class specially developed to handle multi-character field delimiters. The second one is to use RegexSerDe class as a workaround. Those two options don't work as expected because of limitations.

MultiDelimitSerDe implementation

MultiDelimitSerDeSerDe is considered as experimental one until Hive release 4.0.0. It's included in hive-contrib-<version>.jar library and you have to add the library to the class path. If hive-contrib-<version>.jar library is not included in the class path, the functionality is limited to run only SELECT * FROM table_name; queries. The limitation is caused by map/reduce jobs which don't have access to the library. The issue should be fixed in Hive 4.0.0 when MultiDelimitSerDeclass is supposed to be included in org.apache.hadoop.hive.serde2 library. Currently, MultiDelimitSerDeclass is a part of org.apache.hadoop.hive.contrib.serde2 library.

One more limitation is that skip header lines functionality (TBLPROPERTIES ("skip.header.line.count"="1")) doesn't work.

Sample of experimental version with ~| field delimiter. The code is run on Hive 1.1.0.

CREATE EXTERNAL TABLE `sample_table`(
   `column1` string,
   `column2` int,
   `column3` decimal(10,2),
   `column4` timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' 
WITH SERDEPROPERTIES ("field.delim"="~|");
LOCATION
 '/folder/folder2'

Sample of final version with ~| field delimiter. The code can't be validated.

CREATE EXTERNAL TABLE `sample_table`(
   `column1` string,
   `column2` int,
   `column3` decimal(10,2),
   `column4` timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.MultiDelimitSerDe' 
WITH SERDEPROPERTIES ("field.delim"="~|");
LOCATION
 '/folder/folder2'

RegexSerDe implementation

RegexSerDe SerDe limitation is to support only string data type in Hive tables. Also, It's expected performance overhead.

Sample with ~| field delimiter.

CREATE EXTERNAL TABLE `sample_table`(
   `column1` string,
   `column2` string,
   `column3` string,
   `column4` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "(.*)[~][|](.*)[~][|](.*)[~][|](.*)"
)
LOCATION
 '/folder/folder2'
TBLPROPERTIES (
 "skip.header.line.count"="1"
);

Multi-Character Field Delimiter in Apache Hive Table

MultiDelimitSerDe implementation

RegexSerDe implementation

Resources

Comments