Question 13 · Section 1

When to Use HAVING?

Proper separation of filters is a sign of good SQL:

Language versions: English Russian Ukrainian

🟢 Junior Level

HAVING is used when you need to filter results after grouping by aggregate functions (COUNT, SUM, AVG, etc.).

Simple rule:

  • Filter by regular fields → WHERE
  • Filter by aggregation results → HAVING

Example:

-- Find users with more than 5 orders
SELECT user_id, COUNT(*) as orders_count
FROM orders
GROUP BY user_id
HAVING COUNT(*) > 5;  -- Filter AFTER grouping

-- Find departments with average salary > 100,000
SELECT department, AVG(salary) as avg_sal
FROM employees
GROUP BY department
HAVING AVG(salary) > 100000;

Common scenarios:

  • Finding duplicates (HAVING COUNT(*) > 1)
  • Finding groups exceeding a threshold
  • Filtering by average/total values

🟡 Middle Level

Architectural Pattern: WHERE + HAVING

Proper separation of filters is a sign of good SQL:

SELECT customer_id, AVG(amount) as avg_amount
FROM orders
WHERE status = 'delivered'           -- 1. Filter BEFORE grouping (index!)
  AND created_at > '2024-01-01'      -- 2. Reduce input data
GROUP BY customer_id
HAVING COUNT(*) > 10                 -- 3. Filter AFTER grouping
   AND AVG(amount) > 5000;           -- 4. Business logic

Why this matters:

  • WHERE reduces the number of rows BEFORE GROUP BY → less work
  • HAVING filters already grouped data

Main Scenarios

1. Aggregated thresholds:

-- Sellers with revenue > 1M
SELECT seller_id, SUM(price) as revenue
FROM sales
GROUP BY seller_id
HAVING SUM(price) > 1000000;

2. Finding duplicates:

-- Duplicate emails
SELECT email, COUNT(*) as cnt
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

3. Checking group properties:

-- Projects with issues (more than 2 blocked tasks)
SELECT project_id
FROM tasks
GROUP BY project_id
HAVING COUNT(*) > 5
   AND COUNT(*) FILTER (WHERE status = 'blocked') > 2;

4. HAVING without GROUP BY:

-- Check condition on the entire table
SELECT 'Warning: High avg salary'
FROM employees
HAVING AVG(salary) > 200000;
-- Returns 1 row if condition is true, otherwise 0 rows

Common Mistakes

  1. Filtering non-aggregate field in HAVING
    -- ❌ BAD: city is not an aggregate
    SELECT city, COUNT(*) FROM users
    GROUP BY city
    HAVING city = 'Moscow';
    
    -- ✅ GOOD: in WHERE
    SELECT city, COUNT(*) FROM users
    WHERE city = 'Moscow'
    GROUP BY city;
    
  2. Redundancy in ORM
    -- ORM often generates:
    SELECT city, COUNT(*) FROM users GROUP BY city HAVING city = 'Moscow';
    -- This is an anti-pattern! City should be in WHERE
    -- Allows using an index and reducing data for aggregation
    

WHERE vs HAVING Comparison

Scenario WHERE HAVING
status = 'ACTIVE' ❌ (if no GROUP BY)
COUNT(*) > 10 ❌ Cannot
SUM(amount) > 1000 ❌ Cannot
city = 'Moscow' ⚠️ Possible, but bad

🔴 Senior Level

When NOT to Use HAVING

  1. For non-aggregate filtersHAVING city = 'Moscow' groups ALL cities, then filters. Move to WHERE
  2. For filtering by window functions — window functions are computed AFTER HAVING. Use a subquery/CTE
  3. When there are too many groups — millions of groups → spill to disk. Pre-filter in WHERE or use Materialized Views

The Huge Number of Groups Problem

If after WHERE there are 100M rows and 10M groups:

-- Problem: database must allocate memory for ALL 10M groups
SELECT user_id, SUM(amount)
FROM orders
WHERE created_at > '2023-01-01'  -- 100M rows
GROUP BY user_id                  -- 10M groups
HAVING SUM(amount) > 10000;       -- Filter after

What happens:

  1. HashAggregate tries to place 10M groups in work_mem (work_mem — memory for one operation, default 4MB)
  2. If it doesn’t fit → spill to disk (temp_files — temporary files on disk)
  3. Performance drops 10-100x (disk operations are millions of times slower than RAM)

Solutions:

-- 1. Pre-filtering via CTE
WITH filtered_orders AS (
    SELECT user_id, amount
    FROM orders
    WHERE created_at > '2023-01-01'
      AND amount > 100  -- Additional filter
)
SELECT user_id, SUM(amount)
FROM filtered_orders
GROUP BY user_id
HAVING SUM(amount) > 10000;

-- 2. Materialized view for frequent queries
CREATE MATERIALIZED VIEW mv_user_stats AS
SELECT user_id, SUM(amount) as total, COUNT(*) as cnt
FROM orders
GROUP BY user_id;

-- Then a fast query
SELECT user_id FROM mv_user_stats WHERE total > 10000;

FILTER in HAVING for Declarativeness

-- Elegant checking of complex conditions
SELECT project_id
FROM tasks
GROUP BY project_id
HAVING
    COUNT(*) > 10                                      -- Minimum 10 tasks
    AND COUNT(*) FILTER (WHERE status = 'done') >= 8   -- 80% completed
    AND COUNT(*) FILTER (WHERE status = 'blocked') = 0 -- No blockers
    AND MAX(due_date) > CURRENT_DATE;                  -- Deadline hasn't passed

Redundancy and the Optimizer

PostgreSQL tries to do Filter Push-down — move conditions from HAVING to WHERE:

-- Query
SELECT city, COUNT(*) FROM users GROUP BY city HAVING city = 'Moscow';

-- Optimizer transforms into:
SELECT city, COUNT(*) FROM users WHERE city = 'Moscow' GROUP BY city;

BUT: This doesn’t always work!

  • Complex expressions with subqueries
  • Inside Views
  • CTEs with MATERIALIZED

Golden rule: Always write non-aggregate filters in WHERE explicitly.

SQL Standards and Portability

-- ⚠️ Works in PostgreSQL, does NOT work in SQL Server/Oracle
SELECT city, COUNT(*) as cnt FROM users
GROUP BY city
HAVING cnt > 10;  -- cnt is an alias from SELECT

-- ✅ Portable across all DBMS
SELECT city, COUNT(*) FROM users
GROUP BY city
HAVING COUNT(*) > 10;

Edge Cases

  1. HAVING with a subquery
    -- Users whose order total is above average
    SELECT user_id, SUM(amount) as total
    FROM orders
    GROUP BY user_id
    HAVING SUM(amount) > (SELECT AVG(total) FROM (
        SELECT SUM(amount) as total FROM orders GROUP BY user_id
    ) sub);
    
  2. HAVING with window functions — NOT ALLOWED
    -- ❌ ERROR: window functions are computed AFTER HAVING
    SELECT user_id, amount,
           ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY created_at DESC) as rn
    FROM orders
    GROUP BY user_id, amount, created_at
    HAVING rn = 1;  -- Error!
    
    -- ✅ Solution: CTE
    WITH ranked AS (
        SELECT user_id, amount,
               ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY created_at DESC) as rn
        FROM orders
    )
    SELECT user_id, amount FROM ranked WHERE rn = 1;
    
  3. HAVING with multiple GROUP BY
    -- ROLLUP + HAVING
    SELECT department, city, COUNT(*)
    FROM employees
    GROUP BY ROLLUP(department, city)
    HAVING COUNT(*) > 5;
    -- Filters ALL levels (detailed and total rows)
    

Performance Impact

-- Monitoring work_mem pressure
EXPLAIN (ANALYZE, BUFFERS)
SELECT status, COUNT(*) FROM orders GROUP BY status HAVING COUNT(*) > 100;

-- Look for:
-- HashAggregate (memory usage: X kB)  → if X is close to work_mem → spill risk
-- HashAggregate (disk: Y kB)          → spill happened! Increase work_mem

Production Experience

Real scenario:

  • E-commerce: sales report (500M rows)
  • ORM generated HAVING status = 'COMPLETED' instead of WHERE
  • HashAggregate: 500M rows → spill 30 GB to disk
  • Time: 15 minutes
  • Solution: WHERE status = 'COMPLETED' before GROUP BY
  • Result: 10 seconds (90x speedup)
  • Why: ORM generated HAVING status = 'COMPLETED' — the database first grouped all 500M rows (including cancelled, returned, etc.), creating millions of groups, and only then filtered. Moving to WHERE reduced input data for GROUP BY 10x + allowed using an index on status.

Monitoring

-- 1. Find queries with HAVING in pg_stat_statements
SELECT query, mean_exec_time, calls, shared_blks_read
FROM pg_stat_statements
WHERE query LIKE '%HAVING%'
ORDER BY mean_exec_time DESC
LIMIT 10;

-- 2. Check for spill
EXPLAIN (ANALYZE, BUFFERS)
SELECT ... GROUP BY ... HAVING ...;
-- "Disk: X kB" → problem!

-- 3. Check Filter Push-down
EXPLAIN (VERBOSE)
SELECT city, COUNT(*) FROM users GROUP BY city HAVING city = 'Moscow';
-- In the plan you'll see if the filter was pushed

Best Practices

  1. WHERE for non-aggregate filters, HAVING for aggregate filters
  2. FILTER (WHERE …) inside HAVING for complex conditions
  3. Avoid aliases from SELECT in HAVING (not portable)
  4. Monitor work_mem for large GROUP BY
  5. CTE/Materialized Views for pre-filtering
  6. Check the plan for spill (Disk usage)
  7. Remember: HAVING doesn’t save from memory overflow during GROUP BY

Summary for Senior

  • HAVING is the last line of filtering after GROUP BY
  • Doesn’t save resources during aggregation — all groups are already created
  • Filter Push-down (PG 12+) doesn’t always work — write WHERE explicitly
  • work_mem pressure: HAVING filters OUTPUT, not INPUT
  • Window functions are not available in HAVING → CTE/subquery
  • FILTER for declarativeness of complex conditions
  • Portability: don’t use aliases from SELECT in HAVING

🎯 Interview Cheat Sheet

Must know:

  • HAVING filters the RESULT of GROUP BY, not input data
  • WHERE reduces data BEFORE GROUP BY → less work_mem, faster
  • Filter Push-down (PG 12+): automatic move from HAVING to WHERE, but NOT always
  • Huge number of groups: 10M groups → spill to disk → CTE/Materialized View
  • FILTER in HAVING: COUNT(*) FILTER (WHERE status = 'blocked') > 2
  • Window functions are computed AFTER HAVING → subquery/CTE
  • Aliases from SELECT in HAVING: NOT portable (SQL Server/Oracle don’t support)
  • HAVING with subquery: possible, but slow

Frequent follow-up questions:

  • “Why does ORM generate HAVING instead of WHERE?” → ORM anti-pattern, doesn’t optimize
  • “How to avoid spill with 10M groups?” → Filter in WHERE, work_mem ↑, Materialized View
  • “Can FILTER be used in HAVING?” → Yes, elegant for complex conditions
  • “When is HAVING with subquery justified?” → Comparing group to overall average

Red flags (DO NOT say):

  • ❌ “HAVING reduces data for GROUP BY” (no, it filters OUTPUT)
  • ❌ “Can rely on Filter Push-down” (doesn’t always work!)
  • ❌ “Window functions can go in HAVING” (computed AFTER!)

Related topics:

  • [[What is the difference between WHERE and HAVING]] → execution order
  • [[What does GROUP BY do]] → huge number of groups
  • [[What are Window Functions]] → why they can’t go in HAVING