Cleaning Up Bot Registrations in WooCommerce

…or Optimizing Slow Sub-Queries in WordPress

Bot registrations are a nuisance in many WooCommerce sites. Cleaning them up seems to be a trivial task: just delete all users without a placed order from a month ago and backwards.

select * from wp_users where user_registered < "2020-07-01 00:00:00";                                                                                                         
47665 rows in set (0.06 sec)

select meta_value from wp_postmeta where meta_value = '_customer_user';                                                                                                       
51253 rows in set (0.44 sec)

Okay, so we almost 50 thousand customers and a bit over 50 thousand orders.

The query to delete all the users that have no order is seemingly a simple one:

delete from wp_users where user_registered < "2020-07-01 00:00:00"
and id not in (select meta_value from wp_postmeta where meta_key = '_customer_user' group by meta_value);

Great. Yet there's a huge issue: Query OK, (59 min 7.22 sec)

Ooomph! This won't effing do!

Why the heck would it be so slow?

+------+--------------------+-------------+------+---------------+----------+---------+-------+-------+----------+-------------+
| id   | select_type        | table       | type | possible_keys | key      | key_len | ref   | rows  | filtered | Extra       |
+------+--------------------+-------------+------+---------------+----------+---------+-------+-------+----------+-------------+
|    1 | PRIMARY            | wp_users    | ALL  | NULL          | NULL     | NULL    | NULL  | 53102 |   100.00 | Using where |
|    2 | DEPENDENT SUBQUERY | wp_postmeta | ref  | meta_key      | meta_key | 767     | const | 87152 |   100.00 | Using where |
+------+--------------------+-------------+------+---------------+----------+---------+-------+-------+----------+-------------+

Obviously because the subquery is run every single time for each user. So 0.1 sec * 50k = 5 thousand seconds pretty much. So no surprise why we get a query that is running for an hour or so.

What can we do? Caching.

Unfortunately built-in subquery caching does not work in the above case. DEPENDENT SUBQUERY is reevaluated every single time.

This is because the query optimizer transforms a NOT IN construct into a NOT EXISTS ... WHERE construct. The "beauty" of antijoins.

show status where Variable_name like 'Subquery%';
+---------------------+-------+
| Variable_name       | Value |
+---------------------+-------+
| Subquery_cache_hit  | 0     |
| Subquery_cache_miss | 1927  |
+---------------------+-------+

The query itself is pretty lightweight when selecting:

select id from wp_users where id not in
(select id from wp_users left join wp_postmeta on meta_value=id where user_registered < "2020-07-01 00:00:00" and meta_key = '_customer_user'
union select id from wp_users where user_registered >= "2020-07-01 00:00:00");

But to actually run a delete query you need to alias the subquery. Confusing.

delete from wp_users where id not in
(select id from (select id from wp_users u left join wp_postmeta on meta_value=u.id where user_registered < "2020-07-01 00:00:00" and meta_key = '_customer_user'
union select id from wp_users where user_registered >= "2020-07-01 00:00:00"
union select user_id from wp_usermeta where meta_key = 'wp_capabilities' and meta_value like '%admin%'
) as u);

Done.

Query OK, 46140 rows affected (1.73 sec)

Bye.