md-cluster: fix locking when node joins cluster during message broadcast
authorGuoqing Jiang <gqjiang@suse.com>
Mon, 2 May 2016 15:33:12 +0000 (11:33 -0400)
committerShaohua Li <shli@fb.com>
Wed, 4 May 2016 19:39:35 +0000 (12:39 -0700)
If a node joins the cluster while a message broadcast
is under way, a lock issue could happen as follows.

For a cluster which included two nodes, if node A is
calling __sendmsg before up-convert CR to EX on ack,
and node B released CR on ack. But if a new node C
joins the cluster and it doesn't receive the message
which A sent before, so it could hold CR on ack before
A up-convert CR to EX on ack.

So a node joining the cluster should get an EX lock on
the "token" first to ensure no broadcast is ongoing,
then release it after held CR on ack.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
drivers/md/md-cluster.c

index 76f88f731aa1b121f1fd138eaffc7267973b47cd..30f1160142c15353d5d09a9ff7277a1c7b4a9c42 100644 (file)
@@ -781,17 +781,24 @@ static int join(struct mddev *mddev, int nodes)
        cinfo->token_lockres = lockres_init(mddev, "token", NULL, 0);
        if (!cinfo->token_lockres)
                goto err;
-       cinfo->ack_lockres = lockres_init(mddev, "ack", ack_bast, 0);
-       if (!cinfo->ack_lockres)
-               goto err;
        cinfo->no_new_dev_lockres = lockres_init(mddev, "no-new-dev", NULL, 0);
        if (!cinfo->no_new_dev_lockres)
                goto err;
 
+       ret = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
+       if (ret) {
+               ret = -EAGAIN;
+               pr_err("md-cluster: can't join cluster to avoid lock issue\n");
+               goto err;
+       }
+       cinfo->ack_lockres = lockres_init(mddev, "ack", ack_bast, 0);
+       if (!cinfo->ack_lockres)
+               goto err;
        /* get sync CR lock on ACK. */
        if (dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR))
                pr_err("md-cluster: failed to get a sync CR lock on ACK!(%d)\n",
                                ret);
+       dlm_unlock_sync(cinfo->token_lockres);
        /* get sync CR lock on no-new-dev. */
        if (dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR))
                pr_err("md-cluster: failed to get a sync CR lock on no-new-dev!(%d)\n", ret);