mm, page_alloc: do not wake kswapd with zone lock held
authorMel Gorman <mgorman@techsingularity.net>
Tue, 8 Jan 2019 23:23:39 +0000 (15:23 -0800)
committerLinus Torvalds <torvalds@linux-foundation.org>
Wed, 9 Jan 2019 01:15:11 +0000 (17:15 -0800)
syzbot reported the following regression in the latest merge window and
it was confirmed by Qian Cai that a similar bug was visible from a
different context.

  ======================================================
  WARNING: possible circular locking dependency detected
  4.20.0+ #297 Not tainted
  ------------------------------------------------------
  syz-executor0/8529 is trying to acquire lock:
  000000005e7fb829 (&pgdat->kswapd_wait){....}, at:
  __wake_up_common_lock+0x19e/0x330 kernel/sched/wait.c:120

  but task is already holding lock:
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: spin_lock
  include/linux/spinlock.h:329 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_bulk
  mm/page_alloc.c:2548 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: __rmqueue_pcplist
  mm/page_alloc.c:3021 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_pcplist
  mm/page_alloc.c:3050 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue
  mm/page_alloc.c:3072 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at:
  get_page_from_freelist+0x1bae/0x52a0 mm/page_alloc.c:3491

It appears to be a false positive in that the only way the lock ordering
should be inverted is if kswapd is waking itself and the wakeup
allocates debugging objects which should already be allocated if it's
kswapd doing the waking.  Nevertheless, the possibility exists and so
it's best to avoid the problem.

This patch flags a zone as needing a kswapd using the, surprisingly,
unused zone flag field.  The flag is read without the lock held to do
the wakeup.  It's possible that the flag setting context is not the same
as the flag clearing context or for small races to occur.  However, each
race possibility is harmless and there is no visible degredation in
fragmentation treatment.

While zone->flag could have continued to be unused, there is potential
for moving some existing fields into the flags field instead.
Particularly read-mostly ones like zone->initialized and
zone->contiguous.

Link: http://lkml.kernel.org/r/20190103225712.GJ31517@techsingularity.net
Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Reported-by: syzbot+93d94a001cfbce9e60e1@syzkaller.appspotmail.com
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Qian Cai <cai@lca.pw>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
include/linux/mmzone.h
mm/page_alloc.c

index cc4a507d7ca45bb3e88cce09dad4d3ed8d4e40f7..842f9189537bb04d51285c3b237d9263273fc559 100644 (file)
@@ -520,6 +520,12 @@ enum pgdat_flags {
        PGDAT_RECLAIM_LOCKED,           /* prevents concurrent reclaim */
 };
 
+enum zone_flags {
+       ZONE_BOOSTED_WATERMARK,         /* zone recently boosted watermarks.
+                                        * Cleared when kswapd is woken.
+                                        */
+};
+
 static inline unsigned long zone_managed_pages(struct zone *zone)
 {
        return (unsigned long)atomic_long_read(&zone->managed_pages);
index cde5dac6229a7811541ccedab88d408f609df743..d295c9bc01a898e94a16c077f8834a1492fca9cb 100644 (file)
@@ -2214,7 +2214,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
         */
        boost_watermark(zone);
        if (alloc_flags & ALLOC_KSWAPD)
-               wakeup_kswapd(zone, 0, 0, zone_idx(zone));
+               set_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
 
        /* We are not allowed to try stealing from the whole block */
        if (!whole_block)
@@ -3102,6 +3102,12 @@ struct page *rmqueue(struct zone *preferred_zone,
        local_irq_restore(flags);
 
 out:
+       /* Separate test+clear to avoid unnecessary atomics */
+       if (test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags)) {
+               clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
+               wakeup_kswapd(zone, 0, 0, zone_idx(zone));
+       }
+
        VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
        return page;