我正在尝试从已上传到 HDFS 目录的 CSV 在 Impala 中创建一个表。 CSV 包含用逗号括在引号内的值。
例子:
1.66.96.0/19,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.66.128.0/17,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.0.0/17,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.128.0/18,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.192.0/19,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
Impala documentation 表示这可以通过 ESCAPED BY
子句来解决。这是我当前的代码:
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY ''
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
我也尝试过使用 ESCAPED BY '"'
子句。在这两种情况下,Impala 都将引号中的逗号用作分隔符,将值分成两列。
关于如何修复代码以免发生这种情况的任何想法?
编辑(2015 年 6 月 9 日)
因此,根据@KS Nidhin 和@JTUP 的建议,我经历了以下变化。但是,每个变体返回的结果与不使用 SERDEPROPERTIES
运算符编写的查询相同,逗号仍然导致值出现在错误的列中:
Variation 1
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
WITH SERDEPROPERTIES ( "quoteChar" = "'", "escapeChar" = "" )
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
Variation 2
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY ''
WITH SERDEPROPERTIES ( 'quoteChar' = '"', 'escapeChar' = '' )
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
Variation 3
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY ''
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = """
)
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
任何其他想法,或 SERDEPROPERTIES
运算符的进一步变体尝试?
编辑(2016 年 6 月 10 日)
我能够使用 SERDE
和 SERDEPROPERTIES
运算符获得查询的不同变体以在 Hive 中工作(基于 Hive Documentation 中提供的代码),并创建了正确的表:
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4(network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '"',
'escapeChar' = ''
)
STORED AS TEXTFILE;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
由于 SERDE
运算符在 Impala 中不可用,因此该解决方案在那里不起作用。我很好地在 Hive 中创建表,但在 Impala 中找不到可行的解决方案仍然很烦人。
添加 SERDEPROPERTIES 应该可以解决问题