Mysql subquery much faster than join -
i have following queries both return same result , row count:
select * ( select unix_timestamp(network_time) * 1000 epoch_network_datetime, hbrl.business_rule_id, display_advertiser_id, hbrl.campaign_id, truncate(sum(coalesce(hbrl.ad_spend_network, 0))/100000.0, 2) demand_ad_spend_network, sum(coalesce(hbrl.ad_view, 0)) demand_ad_view, sum(coalesce(hbrl.ad_click, 0)) demand_ad_click, truncate(coalesce(case when sum(hbrl.ad_view) = 0 0 else 100*sum(hbrl.ad_click)/sum(hbrl.ad_view) end, 0), 2) ctr_percent, truncate(coalesce(case when sum(hbrl.ad_view) = 0 0 else sum(hbrl.ad_spend_network)/100.0/sum(hbrl.ad_view) end, 0), 2) ecpm, truncate(coalesce(case when sum(hbrl.ad_click) = 0 0 else sum(hbrl.ad_spend_network)/100000.0/sum(hbrl.ad_click) end, 0), 2) ecpc hourly_business_rule_level hbrl (publisher_network_id = 31534) , network_time between str_to_date('2017-08-13 17:00:00.000000', '%y-%m-%d %h:%i:%s.%f') , str_to_date('2017-08-14 16:59:59.999000', '%y-%m-%d %h:%i:%s.%f') , (network_time not null , display_advertiser_id > 0) group network_time, hbrl.campaign_id, hbrl.business_rule_id having demand_ad_spend_network > 0 or demand_ad_view > 0 or demand_ad_click > 0 or ctr_percent > 0 or ecpm > 0 or ecpc > 0 order epoch_network_datetime) atb left join dim_demand demand on atb.display_advertiser_id = demand.advertiser_dsp_id , atb.campaign_id = demand.campaign_id , atb.business_rule_id = demand.business_rule_id
ran explain extended, , these results:
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | | +----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+ | 1 | primary | <derived2> | | null | null | null | null | 1451739 | 100.00 | null | | 1 | primary | demand | ref | primary,join_index | primary | 4 | atb.campaign_id | 1 | 100.00 | using | | 2 | derived | hourly_business_rule_level | | _hourly_business_rule_level_supply_idx,_hourly_business_rule_level_demand_idx | null | null | null | 1494447 | 97.14 | using where; using temporary; using filesort | +----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+
and other is:
select unix_timestamp(network_time) * 1000 epoch_network_datetime, hbrl.business_rule_id, display_advertiser_id, hbrl.campaign_id, truncate(sum(coalesce(hbrl.ad_spend_network, 0))/100000.0, 2) demand_ad_spend_network, sum(coalesce(hbrl.ad_view, 0)) demand_ad_view, sum(coalesce(hbrl.ad_click, 0)) demand_ad_click, truncate(coalesce(case when sum(hbrl.ad_view) = 0 0 else 100*sum(hbrl.ad_click)/sum(hbrl.ad_view) end, 0), 2) ctr_percent, truncate(coalesce(case when sum(hbrl.ad_view) = 0 0 else sum(hbrl.ad_spend_network)/100.0/sum(hbrl.ad_view) end, 0), 2) ecpm, truncate(coalesce(case when sum(hbrl.ad_click) = 0 0 else sum(hbrl.ad_spend_network)/100000.0/sum(hbrl.ad_click) end, 0), 2) ecpc hourly_business_rule_level hbrl join dim_demand demand on hbrl.display_advertiser_id = demand.advertiser_dsp_id , hbrl.campaign_id = demand.campaign_id , hbrl.business_rule_id = demand.business_rule_id (publisher_network_id = 31534) , network_time between str_to_date('2017-08-13 17:00:00.000000', '%y-%m-%d %h:%i:%s.%f') , str_to_date('2017-08-14 16:59:59.999000', '%y-%m-%d %h:%i:%s.%f') , (network_time not null , display_advertiser_id > 0) group network_time, hbrl.campaign_id, hbrl.business_rule_id having demand_ad_spend_network > 0 or demand_ad_view > 0 or demand_ad_click > 0 or ctr_percent > 0 or ecpm > 0 or ecpc > 0 order epoch_network_datetime;
and these results second query:
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | | +----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+ | 1 | simple | hourly_business_rule_level | | _hourly_business_rule_level_supply_idx,_hourly_business_rule_level_demand_idx | null | null | null | 1494447 | 97.14 | using where; using temporary; using filesort | | 1 | simple | demand | ref | primary,join_index | primary | 4 | my6sense_datawarehouse.hourly_business_rule_level.campaign_id | 1 | 100.00 | using where; using index | +----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+
the first 1 takes 2 seconds while second 1 takes on 2 minutes!
why second query taking long? missing here?
thanks.
one possible reason number of rows have joined second table.
the group clause , having clause limit number of rows returned subquery. rows used join.
without subquery clause limiting number of rows join. join done before group , having clauses processed. depending on group size , selectivity of having conditions there more rows need joined.
consider following simplified example:
we have table users
1000 entries , columns id
, email
.
create table users( id smallint auto_increment primary key, email varchar(50) unique );
then have (huge) log table user_actions
1,000,000 entries , columns id
, user_id
, timestamp
, action
create table user_actions( id mediumint auto_increment primary key, user_id smallint not null, timestamp timestamp, action varchar(50), index (timestamp, user_id) );
the task find users have @ least 900 entries in log table since 2017-02-01.
the subquery solution:
select a.user_id, a.cnt, u.email ( select a.user_id, count(*) cnt user_actions a.timestamp >= '2017-02-01 00:00:00' group a.user_id having cnt >= 900 ) left join users u on u.id = a.user_id
the subquery returns 135 rows (users). rows joined users
table. subquery runs in 0.375 seconds. time needed join zero, full query runs in 0.375 seconds.
solution without subquery:
select a.user_id, count(*) cnt, u.email user_actions left join users u on u.id = a.user_id a.timestamp >= '2017-02-01 00:00:00' group a.user_id having cnt >= 900
the condition filters table 866,081 rows. join has done 866k rows. after join group , having clauses processed , limit result 135 rows. query needs 0.815 seconds.
so can see, subquery can improve performance.
but let's make things worse , drop primary key in users
table. way have no index can used join. first query runs in 0.455 seconds. second query needs 40 seconds - 100 times slower.
notes
it's difficult if same applies case. reasons are:
- your queries quite complex , far away from beeing mvce.
- i don't see beeng selected
demand
table - it's unclear why joining @ all. - you use left join in 1 query , inner join in one.
- the relation between 2 tables unclear.
- no information indexes. should provide create statements (
show create table_name
).
test setup
drop table if exists users; create table users( id smallint auto_increment primary key, email varchar(50) unique ) select seq id, rand(1) email seq_1_to_1000 ; drop table if exists user_actions; create table user_actions( id mediumint auto_increment primary key, user_id smallint not null, timestamp timestamp, action varchar(50), index (timestamp, user_id) ) select seq id , floor(rand(2)*1000)+1 user_id #, '2017-01-01 00:00:00' + interval seq*20 second timestamp , from_unixtime(unix_timestamp('2017-01-01 00:00:00') + seq*20) timestamp , rand(3) action seq_1_to_1000000 ;
mariadb 10.0.19 sequence plugin.
Comments
Post a Comment