新聞中心
寫在前面
很多時(shí)候,由于SQL邏輯復(fù)雜,加之對SQL執(zhí)行邏輯理解不透徹,很容易產(chǎn)生一些莫名其妙的結(jié)果,這些結(jié)果看似不符合預(yù)期,殊不知這就是真實(shí)結(jié)果。本文整理了幾個(gè)常見的SQL問題,我們在實(shí)際書寫SQL腳本時(shí),需要多加注意,希望本文對你有所幫助。

關(guān)于LEFT JOIN
外連接是我們書寫SQL時(shí)經(jīng)常使用的多表連接方式,使用起來也是十分的簡單。值得注意的是,越是簡單的東西,越是容易被忽略細(xì)節(jié)。通常我們都是這樣理解LEFT JOIN的:
語義是滿足Join on條件的直接返回,但不滿足情況下,需要返回Left Outer Join的left 表所有列,同時(shí)右表的列全部填null
上述對于LEFT JOIN的理解是沒有任何問題的,但是里面有一個(gè)誤區(qū):謂詞下推。具體看下面的實(shí)例:
假設(shè)有如下的三張表:
--建表
create table t1(id int, value int) partitioned by (ds string);
create table t2(id int, value int) partitioned by (ds string);
create table t3(c1 int, c2 int, c3 int);
--數(shù)據(jù)裝載,t1表
insert overwrite table t1 partition(ds='20220120') select '1','2022';
insert overwrite table t1 partition(ds='20220121') select '2','2022';
insert overwrite table t1 partition(ds='20220122') select '2','2022';
--數(shù)據(jù)裝載,t2表
insert overwrite table t2 partition(ds='20220120') select '1','120';
當(dāng)我們執(zhí)行如下的SQL查詢時(shí),會(huì)返回什么數(shù)據(jù)呢?
SELECT *
FROM t1
LEFT JOIN t2
ON t1.id = t2.id
AND t1.ds = '20220120'
;
結(jié)果1:
1 2022 20220120 1 120 20220120
結(jié)果2:
1 2022 20220120 1 120 20220120
2 2022 20220121 NULL NULL NULL
1 2022 20220122 NULL NULL NULL
相信對于很多初學(xué)者,甚至是一個(gè)有開發(fā)經(jīng)驗(yàn)的人來說,會(huì)認(rèn)為結(jié)果1是正確的返回結(jié)果。其實(shí)結(jié)果1的并不是正確的結(jié)果,真正的返回值是結(jié)果2.
是不是跟預(yù)期的結(jié)果不一致呢?很多初學(xué)者會(huì)認(rèn)為上述查詢SQL中AND t1.ds = '20220120'會(huì)進(jìn)行謂詞下推,從而得到結(jié)果2。其實(shí),SQL本身的語義不是這樣的,如果需要獲取結(jié)果1的數(shù)據(jù),正確的查詢方式是下面這樣:
--方式1:
SELECT *
FROM t1
LEFT OUTER JOIN t2
ON t1.id = t2.id
WHERE t1.ds = '20220120'
;
--方式2:
SELECT *
FROM (
SELECT *
FROM t1
WHERE ds = '20220120'
) t1
LEFT OUTER JOIN t2
ON t1.id = t2.id
;
細(xì)心的你看出差異了嗎?重點(diǎn)是在WHERE t1.ds = '20220120'過濾條件上,最上面的查詢方式是ON t1.ds = '20220120',所以按照LEFT JOIN的語義,如果沒有過濾條件,那么左表的數(shù)據(jù)應(yīng)該全部返回,右表匹配不上則補(bǔ)null。
執(zhí)行計(jì)劃
我們先來看看沒有謂詞下推的查詢SQL的執(zhí)行計(jì)劃
正常LEFT JOIN
查看執(zhí)行計(jì)劃
EXPLAIN
SELECT *
FROM t1
LEFT JOIN t2
ON t1.id = t2.id
AND t1.ds = '20220120'
;
執(zhí)行計(jì)劃結(jié)果
hive> EXPLAIN
> SELECT *
> FROM t1
> LEFT JOIN t2
> ON t1.id = t2.id
> AND t1.ds = '20220120'
> ;
OK
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
$hdt$_1:t2
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
$hdt$_1:t2
TableScan
alias: t2
Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: int), value (type: int), ds (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
filter predicates:
0 {(_col2 = '20220120')}
1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: int), value (type: int), ds (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Left Outer Join0 to 1
filter predicates:
0 {(_col2 = '20220120')}
1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 3 Data size: 19 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 3 Data size: 19 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
從上面的執(zhí)行計(jì)劃可以看出:總共有3個(gè)stage,
STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3
其中stage4是map任務(wù)讀取t2表,將t2表加載成HashTable,用于map端join。t2表數(shù)據(jù)量為1行。
Select Operator expressions: id (type: int), value (type: int), ds (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator
stage3是map任務(wù)讀取t1表數(shù)據(jù)并執(zhí)行map端join。t1表數(shù)量為3行,可見并沒有進(jìn)行過濾操作。
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: int), value (type: int), ds (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE
Stage-0進(jìn)行結(jié)果輸出,最終并未執(zhí)行過濾操作。
Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink
謂詞下推的LEFT JOIN
- 查看執(zhí)行計(jì)劃
EXPLAIN
SELECT *
FROM t1
LEFT OUTER JOIN t2
ON t1.id = t2.id
WHERE t1.ds = '20220120'
;
執(zhí)行計(jì)劃結(jié)果
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
$hdt$_1:t2
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
$hdt$_1:t2
TableScan
alias: t2
Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: int), value (type: int), ds (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 _col0 (type: int)
1 _col0 (type: int)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: int), value (type: int)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0, _col1, _col3, _col4, _col5
Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: int), '20220120' (type: string), _col3 (type: int), _col4 (type: int), _col5 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
從上面的執(zhí)行計(jì)劃可以看出:總共有3個(gè)stage,
STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3
其中stage4是map任務(wù)讀取t2表,將t2表加載成HashTable,用于map端join。t2表數(shù)據(jù)量為1行。
TableScan
alias: t2
Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: int), value (type: int), ds (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
stage3是map任務(wù)讀取t1表數(shù)據(jù)并執(zhí)行map端join。t1表數(shù)量為1行,執(zhí)行了過濾操作。
TableScan
alias: t1
Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: int), value (type: int)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col0, _col1, _col3, _col4, _col5
Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE
Stage-0進(jìn)行結(jié)果輸出,最終并未執(zhí)行過操作。
Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink
總結(jié)本文主要結(jié)合具體的使用示例,對HiveSQL的LEFT JOIN操作進(jìn)行了詳細(xì)解釋。主要包括兩種比較常見的LEFT JOIN方式,一種是正常的LEFT JOIN,也就是只包含ON條件,這種情況沒有過濾操作,即左表的數(shù)據(jù)會(huì)全部返回。另一種方式是有謂詞下推,即關(guān)聯(lián)的時(shí)候使用了WHERE條件,這個(gè)時(shí)候會(huì)會(huì)對數(shù)據(jù)進(jìn)行過濾。所以在寫SQL的時(shí)候,尤其需要注意這些細(xì)節(jié)問題,以免出現(xiàn)意想不到的錯(cuò)誤結(jié)果。
文章題目:關(guān)于HiveSQL常見的LeftJoin誤區(qū),你知道嗎
標(biāo)題網(wǎng)址:http://m.fisionsoft.com.cn/article/cohpspd.html


咨詢
建站咨詢
