Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
authorPaolo Abeni <pabeni@redhat.com>
Tue, 6 Sep 2022 21:21:14 +0000 (23:21 +0200)
committerPaolo Abeni <pabeni@redhat.com>
Tue, 6 Sep 2022 21:21:18 +0000 (23:21 +0200)
Daniel Borkmann says:

====================
pull-request: bpf-next 2022-09-05

The following pull-request contains BPF updates for your *net-next* tree.

We've added 106 non-merge commits during the last 18 day(s) which contain
a total of 159 files changed, 5225 insertions(+), 1358 deletions(-).

There are two small merge conflicts, resolve them as follows:

1) tools/testing/selftests/bpf/DENYLIST.s390x

  Commit 27e23836ce22 ("selftests/bpf: Add lru_bug to s390x deny list") in
  bpf tree was needed to get BPF CI green on s390x, but it conflicted with
  newly added tests on bpf-next. Resolve by adding both hunks, result:

  [...]
  lru_bug                                  # prog 'printk': failed to auto-attach: -524
  setget_sockopt                           # attach unexpected error: -524                                               (trampoline)
  cb_refs                                  # expected error message unexpected error: -524                               (trampoline)
  cgroup_hierarchical_stats                # JIT does not support calling kernel function                                (kfunc)
  htab_update                              # failed to attach: ERROR: strerror_r(-524)=22                                (trampoline)
  [...]

2) net/core/filter.c

  Commit 1227c1771dd2 ("net: Fix data-races around sysctl_[rw]mem_(max|default).")
  from net tree conflicts with commit 29003875bd5b ("bpf: Change bpf_setsockopt(SOL_SOCKET)
  to reuse sk_setsockopt()") from bpf-next tree. Take the code as it is from
  bpf-next tree, result:

  [...]
if (getopt) {
if (optname == SO_BINDTODEVICE)
return -EINVAL;
return sk_getsockopt(sk, SOL_SOCKET, optname,
     KERNEL_SOCKPTR(optval),
     KERNEL_SOCKPTR(optlen));
}

return sk_setsockopt(sk, SOL_SOCKET, optname,
     KERNEL_SOCKPTR(optval), *optlen);
  [...]

The main changes are:

1) Add any-context BPF specific memory allocator which is useful in particular for BPF
   tracing with bonus of performance equal to full prealloc, from Alexei Starovoitov.

2) Big batch to remove duplicated code from bpf_{get,set}sockopt() helpers as an effort
   to reuse the existing core socket code as much as possible, from Martin KaFai Lau.

3) Extend BPF flow dissector for BPF programs to just augment the in-kernel dissector
   with custom logic. In other words, allow for partial replacement, from Shmulik Ladkani.

4) Add a new cgroup iterator to BPF with different traversal options, from Hao Luo.

5) Support for BPF to collect hierarchical cgroup statistics efficiently through BPF
   integration with the rstat framework, from Yosry Ahmed.

6) Support bpf_{g,s}et_retval() under more BPF cgroup hooks, from Stanislav Fomichev.

7) BPF hash table and local storages fixes under fully preemptible kernel, from Hou Tao.

8) Add various improvements to BPF selftests and libbpf for compilation with gcc BPF
   backend, from James Hilliard.

9) Fix verifier helper permissions and reference state management for synchronous
   callbacks, from Kumar Kartikeya Dwivedi.

10) Add support for BPF selftest's xskxceiver to also be used against real devices that
    support MAC loopback, from Maciej Fijalkowski.

11) Various fixes to the bpf-helpers(7) man page generation script, from Quentin Monnet.

12) Document BPF verifier's tnum_in(tnum_range(), ...) gotchas, from Shung-Hsi Yu.

13) Various minor misc improvements all over the place.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (106 commits)
  bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.
  bpf: Remove usage of kmem_cache from bpf_mem_cache.
  bpf: Remove prealloc-only restriction for sleepable bpf programs.
  bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
  bpf: Remove tracing program restriction on map types
  bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
  bpf: Add percpu allocation support to bpf_mem_alloc.
  bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
  bpf: Adjust low/high watermarks in bpf_mem_cache
  bpf: Optimize call_rcu in non-preallocated hash map.
  bpf: Optimize element count in non-preallocated hash map.
  bpf: Relax the requirement to use preallocated hash maps in tracing progs.
  samples/bpf: Reduce syscall overhead in map_perf_test.
  selftests/bpf: Improve test coverage of test_maps
  bpf: Convert hash map to bpf_mem_alloc.
  bpf: Introduce any context BPF specific memory allocator.
  selftest/bpf: Add test for bpf_getsockopt()
  bpf: Change bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt()
  bpf: Change bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt()
  bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt()
  ...
====================

Link: https://lore.kernel.org/r/20220905161136.9150-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12 files changed:
1  2 
include/linux/skbuff.h
include/net/sock.h
include/uapi/linux/bpf.h
kernel/bpf/cgroup.c
kernel/bpf/syscall.c
kernel/bpf/verifier.c
net/core/filter.c
net/core/sock.c
net/ipv4/ip_sockglue.c
net/ipv4/tcp.c
net/ipv6/ipv6_sockglue.c
tools/testing/selftests/bpf/DENYLIST.s390x

diff --combined include/linux/skbuff.h
index b51d07a727c9f23f783f1512657a2aa7684e6179,87921996175cfc834e2e04ea03596ff90af59ff9..43c37385f1e99d919523420a70943395d1178655
@@@ -1195,8 -1195,7 +1195,8 @@@ static inline bool skb_unref(struct sk_
        return true;
  }
  
 -void kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason);
 +void __fix_address
 +kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason);
  
  /**
   *    kfree_skb - free an sk_buff with 'NOT_SPECIFIED' reason
@@@ -1461,8 -1460,8 +1461,8 @@@ void skb_flow_dissector_init(struct flo
                             unsigned int key_count);
  
  struct bpf_flow_dissector;
bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
-                     __be16 proto, int nhoff, int hlen, unsigned int flags);
u32 bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
+                    __be16 proto, int nhoff, int hlen, unsigned int flags);
  
  bool __skb_flow_dissect(const struct net *net,
                        const struct sk_buff *skb,
diff --combined include/net/sock.h
index ca469980e006c3fa8f6cee00e93d6f81afcae1d7,ea79655241331c73b727c1c779e5cf92e7612782..96a31026e35df226f69c8404090824cca29d9811
@@@ -348,7 -348,6 +348,7 @@@ struct sk_filter
    *   @sk_txtime_report_errors: set report errors mode for SO_TXTIME
    *   @sk_txtime_unused: unused txtime flags
    *   @ns_tracker: tracker for netns reference
 +  *   @sk_bind2_node: bind node in the bhash2 table
    */
  struct sock {
        /*
  #endif
        struct rcu_head         sk_rcu;
        netns_tracker           ns_tracker;
 +      struct hlist_node       sk_bind2_node;
  };
  
  enum sk_pacing {
@@@ -579,31 -577,6 +579,31 @@@ static inline bool sk_user_data_is_noco
  
  #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
  
 +/**
 + * __locked_read_sk_user_data_with_flags - return the pointer
 + * only if argument flags all has been set in sk_user_data. Otherwise
 + * return NULL
 + *
 + * @sk: socket
 + * @flags: flag bits
 + *
 + * The caller must be holding sk->sk_callback_lock.
 + */
 +static inline void *
 +__locked_read_sk_user_data_with_flags(const struct sock *sk,
 +                                    uintptr_t flags)
 +{
 +      uintptr_t sk_user_data =
 +              (uintptr_t)rcu_dereference_check(__sk_user_data(sk),
 +                                               lockdep_is_held(&sk->sk_callback_lock));
 +
 +      WARN_ON_ONCE(flags & SK_USER_DATA_PTRMASK);
 +
 +      if ((sk_user_data & flags) == flags)
 +              return (void *)(sk_user_data & SK_USER_DATA_PTRMASK);
 +      return NULL;
 +}
 +
  /**
   * __rcu_dereference_sk_user_data_with_flags - return the pointer
   * only if argument flags all has been set in sk_user_data. Otherwise
@@@ -872,16 -845,6 +872,16 @@@ static inline void sk_add_bind_node(str
        hlist_add_head(&sk->sk_bind_node, list);
  }
  
 +static inline void __sk_del_bind2_node(struct sock *sk)
 +{
 +      __hlist_del(&sk->sk_bind2_node);
 +}
 +
 +static inline void sk_add_bind2_node(struct sock *sk, struct hlist_head *list)
 +{
 +      hlist_add_head(&sk->sk_bind2_node, list);
 +}
 +
  #define sk_for_each(__sk, list) \
        hlist_for_each_entry(__sk, list, sk_node)
  #define sk_for_each_rcu(__sk, list) \
        hlist_for_each_entry_safe(__sk, tmp, list, sk_node)
  #define sk_for_each_bound(__sk, list) \
        hlist_for_each_entry(__sk, list, sk_bind_node)
 +#define sk_for_each_bound_bhash2(__sk, list) \
 +      hlist_for_each_entry(__sk, list, sk_bind2_node)
  
  /**
   * sk_for_each_entry_offset_rcu - iterate over a list at a given struct offset
@@@ -1788,6 -1749,11 +1788,11 @@@ static inline void unlock_sock_fast(str
        }
  }
  
+ void sockopt_lock_sock(struct sock *sk);
+ void sockopt_release_sock(struct sock *sk);
+ bool sockopt_ns_capable(struct user_namespace *ns, int cap);
+ bool sockopt_capable(int cap);
  /* Used by processes to "lock" a socket state, so that
   * interrupts and bottom half handlers won't change it
   * from under us. It essentially blocks any incoming
@@@ -1862,9 -1828,13 +1867,13 @@@ void sock_pfree(struct sk_buff *skb)
  #define sock_edemux sock_efree
  #endif
  
+ int sk_setsockopt(struct sock *sk, int level, int optname,
+                 sockptr_t optval, unsigned int optlen);
  int sock_setsockopt(struct socket *sock, int level, int op,
                    sockptr_t optval, unsigned int optlen);
  
+ int sk_getsockopt(struct sock *sk, int level, int optname,
+                 sockptr_t optval, sockptr_t optlen);
  int sock_getsockopt(struct socket *sock, int level, int op,
                    char __user *optval, int __user *optlen);
  int sock_gettstamp(struct socket *sock, void __user *userstamp,
diff --combined include/uapi/linux/bpf.h
index 1d6085e15fc8e6a0e34af04220602fa455f5b687,837c0f9b7fdd18a4ba4778df0d1706a5ee71c8c0..793103b10eabf1c204d6119dc7aec7c0b686c46d
@@@ -79,7 -79,7 +79,7 @@@ struct bpf_insn 
  /* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */
  struct bpf_lpm_trie_key {
        __u32   prefixlen;      /* up to 32 for AF_INET, 128 for AF_INET6 */
 -      __u8    data[]; /* Arbitrary size */
 +      __u8    data[0];        /* Arbitrary size */
  };
  
  struct bpf_cgroup_storage_key {
        __u32   attach_type;            /* program attach type (enum bpf_attach_type) */
  };
  
+ enum bpf_cgroup_iter_order {
+       BPF_CGROUP_ITER_ORDER_UNSPEC = 0,
+       BPF_CGROUP_ITER_SELF_ONLY,              /* process only a single object. */
+       BPF_CGROUP_ITER_DESCENDANTS_PRE,        /* walk descendants in pre-order. */
+       BPF_CGROUP_ITER_DESCENDANTS_POST,       /* walk descendants in post-order. */
+       BPF_CGROUP_ITER_ANCESTORS_UP,           /* walk ancestors upward. */
+ };
  union bpf_iter_link_info {
        struct {
                __u32   map_fd;
        } map;
+       struct {
+               enum bpf_cgroup_iter_order order;
+               /* At most one of cgroup_fd and cgroup_id can be non-zero. If
+                * both are zero, the walk starts from the default cgroup v2
+                * root. For walking v1 hierarchy, one should always explicitly
+                * specify cgroup_fd.
+                */
+               __u32   cgroup_fd;
+               __u64   cgroup_id;
+       } cgroup;
  };
  
  /* BPF syscall commands, see bpf(2) man-page for more details. */
@@@ -4437,7 -4456,7 +4456,7 @@@ union bpf_attr 
   *
   *            **-EEXIST** if the option already exists.
   *
-  *            **-EFAULT** on failrue to parse the existing header options.
+  *            **-EFAULT** on failure to parse the existing header options.
   *
   *            **-EPERM** if the helper cannot be used under the current
   *            *skops*\ **->op**.
   *            a *map* with *task* as the **key**.  From this
   *            perspective,  the usage is not much different from
   *            **bpf_map_lookup_elem**\ (*map*, **&**\ *task*) except this
-  *            helper enforces the key must be an task_struct and the map must also
+  *            helper enforces the key must be a task_struct and the map must also
   *            be a **BPF_MAP_TYPE_TASK_STORAGE**.
   *
   *            Underneath, the value is stored locally at *task* instead of
   *
   * long bpf_ima_inode_hash(struct inode *inode, void *dst, u32 size)
   *    Description
-  *            Returns the stored IMA hash of the *inode* (if it's avaialable).
+  *            Returns the stored IMA hash of the *inode* (if it's available).
   *            If the hash is larger than *size*, then only *size*
   *            bytes will be copied to *dst*
   *    Return
   *
   *            The argument *len_diff* can be used for querying with a planned
   *            size change. This allows to check MTU prior to changing packet
-  *            ctx. Providing an *len_diff* adjustment that is larger than the
+  *            ctx. Providing a *len_diff* adjustment that is larger than the
   *            actual packet size (resulting in negative packet size) will in
-  *            principle not exceed the MTU, why it is not considered a
-  *            failure.  Other BPF-helpers are needed for performing the
-  *            planned size change, why the responsability for catch a negative
-  *            packet size belong in those helpers.
+  *            principle not exceed the MTU, which is why it is not considered
+  *            a failure.  Other BPF helpers are needed for performing the
+  *            planned size change; therefore the responsibility for catching
+  *            a negative packet size belongs in those helpers.
   *
   *            Specifying *ifindex* zero means the MTU check is performed
   *            against the current net device.  This is practical if this isn't
   *
   * int bpf_get_retval(void)
   *    Description
-  *            Get the syscall's return value that will be returned to userspace.
+  *            Get the BPF program's return value that will be returned to the upper layers.
   *
-  *            This helper is currently supported by cgroup programs only.
+  *            This helper is currently supported by cgroup programs and only by the hooks
+  *            where BPF program's return value is returned to the userspace via errno.
   *    Return
-  *            The syscall's return value.
+  *            The BPF program's return value.
   *
   * int bpf_set_retval(int retval)
   *    Description
-  *            Set the syscall's return value that will be returned to userspace.
+  *            Set the BPF program's return value that will be returned to the upper layers.
+  *
+  *            This helper is currently supported by cgroup programs and only by the hooks
+  *            where BPF program's return value is returned to the userspace via errno.
+  *
+  *            Note that there is the following corner case where the program exports an error
+  *            via bpf_set_retval but signals success via 'return 1':
+  *
+  *                    bpf_set_retval(-EPERM);
+  *                    return 1;
+  *
+  *            In this case, the BPF program's return value will use helper's -EPERM. This
+  *            still holds true for cgroup/bind{4,6} which supports extra 'return 3' success case.
   *
-  *            This helper is currently supported by cgroup programs only.
   *    Return
   *            0 on success, or a negative error in case of failure.
   *
@@@ -5628,6 -5659,11 +5659,11 @@@ enum 
        BPF_F_SEQ_NUMBER                = (1ULL << 3),
  };
  
+ /* BPF_FUNC_skb_get_tunnel_key flags. */
+ enum {
+       BPF_F_TUNINFO_FLAGS             = (1ULL << 4),
+ };
  /* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and
   * BPF_FUNC_perf_event_read_value flags.
   */
@@@ -5817,7 -5853,10 +5853,10 @@@ struct bpf_tunnel_key 
        };
        __u8 tunnel_tos;
        __u8 tunnel_ttl;
-       __u16 tunnel_ext;       /* Padding, future use. */
+       union {
+               __u16 tunnel_ext;       /* compat */
+               __be16 tunnel_flags;
+       };
        __u32 tunnel_label;
        union {
                __u32 local_ipv4;
@@@ -5861,6 -5900,11 +5900,11 @@@ enum bpf_ret_code 
         *    represented by BPF_REDIRECT above).
         */
        BPF_LWT_REROUTE = 128,
+       /* BPF_FLOW_DISSECTOR_CONTINUE: used by BPF_PROG_TYPE_FLOW_DISSECTOR
+        *   to indicate that no custom dissection was performed, and
+        *   fallback to standard dissector is requested.
+        */
+       BPF_FLOW_DISSECTOR_CONTINUE = 129,
  };
  
  struct bpf_sock {
@@@ -6159,11 -6203,22 +6203,22 @@@ struct bpf_link_info 
                struct {
                        __aligned_u64 target_name; /* in/out: target_name buffer ptr */
                        __u32 target_name_len;     /* in/out: target_name buffer len */
+                       /* If the iter specific field is 32 bits, it can be put
+                        * in the first or second union. Otherwise it should be
+                        * put in the second union.
+                        */
                        union {
                                struct {
                                        __u32 map_id;
                                } map;
                        };
+                       union {
+                               struct {
+                                       __u64 cgroup_id;
+                                       __u32 order;
+                               } cgroup;
+                       };
                } iter;
                struct  {
                        __u32 netns_ino;
diff --combined kernel/bpf/cgroup.c
index 4a400cd6373167d99ba2ce9c27101b9e95282c7f,121b5a5edb6420305c6f93c4939311d306068c59..00c7f864900e6b8966264c65b96a6a7f6c8b81d4
@@@ -921,10 -921,8 +921,10 @@@ static void purge_effective_progs(struc
                                pos++;
                        }
                }
 +
 +              /* no link or prog match, skip the cgroup of this layer */
 +              continue;
  found:
 -              BUG_ON(!cg);
                progs = rcu_dereference_protected(
                                desc->bpf.effective[atype],
                                lockdep_is_held(&cgroup_mutex));
@@@ -1529,6 -1527,37 +1529,37 @@@ int __cgroup_bpf_check_dev_permission(s
        return ret;
  }
  
+ BPF_CALL_2(bpf_get_local_storage, struct bpf_map *, map, u64, flags)
+ {
+       /* flags argument is not used now,
+        * but provides an ability to extend the API.
+        * verifier checks that its value is correct.
+        */
+       enum bpf_cgroup_storage_type stype = cgroup_storage_type(map);
+       struct bpf_cgroup_storage *storage;
+       struct bpf_cg_run_ctx *ctx;
+       void *ptr;
+       /* get current cgroup storage from BPF run context */
+       ctx = container_of(current->bpf_ctx, struct bpf_cg_run_ctx, run_ctx);
+       storage = ctx->prog_item->cgroup_storage[stype];
+       if (stype == BPF_CGROUP_STORAGE_SHARED)
+               ptr = &READ_ONCE(storage->buf)->data[0];
+       else
+               ptr = this_cpu_ptr(storage->percpu_buf);
+       return (unsigned long)ptr;
+ }
+ const struct bpf_func_proto bpf_get_local_storage_proto = {
+       .func           = bpf_get_local_storage,
+       .gpl_only       = false,
+       .ret_type       = RET_PTR_TO_MAP_VALUE,
+       .arg1_type      = ARG_CONST_MAP_PTR,
+       .arg2_type      = ARG_ANYTHING,
+ };
  BPF_CALL_0(bpf_get_retval)
  {
        struct bpf_cg_run_ctx *ctx =
@@@ -1560,32 -1589,26 +1591,26 @@@ const struct bpf_func_proto bpf_set_ret
  };
  
  static const struct bpf_func_proto *
- cgroup_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+ cgroup_dev_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
  {
+       const struct bpf_func_proto *func_proto;
+       func_proto = cgroup_common_func_proto(func_id, prog);
+       if (func_proto)
+               return func_proto;
+       func_proto = cgroup_current_func_proto(func_id, prog);
+       if (func_proto)
+               return func_proto;
        switch (func_id) {
-       case BPF_FUNC_get_current_uid_gid:
-               return &bpf_get_current_uid_gid_proto;
-       case BPF_FUNC_get_local_storage:
-               return &bpf_get_local_storage_proto;
-       case BPF_FUNC_get_current_cgroup_id:
-               return &bpf_get_current_cgroup_id_proto;
        case BPF_FUNC_perf_event_output:
                return &bpf_event_output_data_proto;
-       case BPF_FUNC_get_retval:
-               return &bpf_get_retval_proto;
-       case BPF_FUNC_set_retval:
-               return &bpf_set_retval_proto;
        default:
                return bpf_base_func_proto(func_id);
        }
  }
  
- static const struct bpf_func_proto *
- cgroup_dev_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
- {
-       return cgroup_base_func_proto(func_id, prog);
- }
  static bool cgroup_dev_is_valid_access(int off, int size,
                                       enum bpf_access_type type,
                                       const struct bpf_prog *prog,
@@@ -2098,11 -2121,17 +2123,17 @@@ static const struct bpf_func_proto bpf_
  static const struct bpf_func_proto *
  sysctl_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
  {
+       const struct bpf_func_proto *func_proto;
+       func_proto = cgroup_common_func_proto(func_id, prog);
+       if (func_proto)
+               return func_proto;
+       func_proto = cgroup_current_func_proto(func_id, prog);
+       if (func_proto)
+               return func_proto;
        switch (func_id) {
-       case BPF_FUNC_strtol:
-               return &bpf_strtol_proto;
-       case BPF_FUNC_strtoul:
-               return &bpf_strtoul_proto;
        case BPF_FUNC_sysctl_get_name:
                return &bpf_sysctl_get_name_proto;
        case BPF_FUNC_sysctl_get_current_value:
                return &bpf_sysctl_set_new_value_proto;
        case BPF_FUNC_ktime_get_coarse_ns:
                return &bpf_ktime_get_coarse_ns_proto;
+       case BPF_FUNC_perf_event_output:
+               return &bpf_event_output_data_proto;
        default:
-               return cgroup_base_func_proto(func_id, prog);
+               return bpf_base_func_proto(func_id);
        }
  }
  
@@@ -2235,6 -2266,16 +2268,16 @@@ static const struct bpf_func_proto bpf_
  static const struct bpf_func_proto *
  cg_sockopt_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
  {
+       const struct bpf_func_proto *func_proto;
+       func_proto = cgroup_common_func_proto(func_id, prog);
+       if (func_proto)
+               return func_proto;
+       func_proto = cgroup_current_func_proto(func_id, prog);
+       if (func_proto)
+               return func_proto;
        switch (func_id) {
  #ifdef CONFIG_NET
        case BPF_FUNC_get_netns_cookie:
        case BPF_FUNC_tcp_sock:
                return &bpf_tcp_sock_proto;
  #endif
+       case BPF_FUNC_perf_event_output:
+               return &bpf_event_output_data_proto;
        default:
-               return cgroup_base_func_proto(func_id, prog);
+               return bpf_base_func_proto(func_id);
        }
  }
  
@@@ -2422,3 -2465,69 +2467,69 @@@ const struct bpf_verifier_ops cg_sockop
  
  const struct bpf_prog_ops cg_sockopt_prog_ops = {
  };
+ /* Common helpers for cgroup hooks. */
+ const struct bpf_func_proto *
+ cgroup_common_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+ {
+       switch (func_id) {
+       case BPF_FUNC_get_local_storage:
+               return &bpf_get_local_storage_proto;
+       case BPF_FUNC_get_retval:
+               switch (prog->expected_attach_type) {
+               case BPF_CGROUP_INET_INGRESS:
+               case BPF_CGROUP_INET_EGRESS:
+               case BPF_CGROUP_SOCK_OPS:
+               case BPF_CGROUP_UDP4_RECVMSG:
+               case BPF_CGROUP_UDP6_RECVMSG:
+               case BPF_CGROUP_INET4_GETPEERNAME:
+               case BPF_CGROUP_INET6_GETPEERNAME:
+               case BPF_CGROUP_INET4_GETSOCKNAME:
+               case BPF_CGROUP_INET6_GETSOCKNAME:
+                       return NULL;
+               default:
+                       return &bpf_get_retval_proto;
+               }
+       case BPF_FUNC_set_retval:
+               switch (prog->expected_attach_type) {
+               case BPF_CGROUP_INET_INGRESS:
+               case BPF_CGROUP_INET_EGRESS:
+               case BPF_CGROUP_SOCK_OPS:
+               case BPF_CGROUP_UDP4_RECVMSG:
+               case BPF_CGROUP_UDP6_RECVMSG:
+               case BPF_CGROUP_INET4_GETPEERNAME:
+               case BPF_CGROUP_INET6_GETPEERNAME:
+               case BPF_CGROUP_INET4_GETSOCKNAME:
+               case BPF_CGROUP_INET6_GETSOCKNAME:
+                       return NULL;
+               default:
+                       return &bpf_set_retval_proto;
+               }
+       default:
+               return NULL;
+       }
+ }
+ /* Common helpers for cgroup hooks with valid process context. */
+ const struct bpf_func_proto *
+ cgroup_current_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+ {
+       switch (func_id) {
+       case BPF_FUNC_get_current_uid_gid:
+               return &bpf_get_current_uid_gid_proto;
+       case BPF_FUNC_get_current_pid_tgid:
+               return &bpf_get_current_pid_tgid_proto;
+       case BPF_FUNC_get_current_comm:
+               return &bpf_get_current_comm_proto;
+       case BPF_FUNC_get_current_cgroup_id:
+               return &bpf_get_current_cgroup_id_proto;
+       case BPF_FUNC_get_current_ancestor_cgroup_id:
+               return &bpf_get_current_ancestor_cgroup_id_proto;
+ #ifdef CONFIG_CGROUP_NET_CLASSID
+       case BPF_FUNC_get_cgroup_classid:
+               return &bpf_get_cgroup_classid_curr_proto;
+ #endif
+       default:
+               return NULL;
+       }
+ }
diff --combined kernel/bpf/syscall.c
index 27760627370db821a5fc2741f73130e4e2bedb2f,074c901fbb4e3977961297d94edee8357cdfba28..4fb08c43420d19fd9e5c59bcd8800494e3d5fc01
@@@ -638,7 -638,10 +638,10 @@@ static void __bpf_map_put(struct bpf_ma
                bpf_map_free_id(map, do_idr_lock);
                btf_put(map->btf);
                INIT_WORK(&map->work, bpf_map_free_deferred);
-               schedule_work(&map->work);
+               /* Avoid spawning kworkers, since they all might contend
+                * for the same mutex like slab_mutex.
+                */
+               queue_work(system_unbound_wq, &map->work);
        }
  }
  
@@@ -1437,9 -1440,9 +1440,9 @@@ err_put
  
  #define BPF_MAP_DELETE_ELEM_LAST_FIELD key
  
- static int map_delete_elem(union bpf_attr *attr)
+ static int map_delete_elem(union bpf_attr *attr, bpfptr_t uattr)
  {
-       void __user *ukey = u64_to_user_ptr(attr->key);
+       bpfptr_t ukey = make_bpfptr(attr->key, uattr.is_kernel);
        int ufd = attr->map_fd;
        struct bpf_map *map;
        struct fd f;
                goto err_put;
        }
  
-       key = __bpf_copy_key(ukey, map->key_size);
+       key = ___bpf_copy_key(ukey, map->key_size);
        if (IS_ERR(key)) {
                err = PTR_ERR(key);
                goto err_put;
@@@ -4941,7 -4944,7 +4944,7 @@@ static int __sys_bpf(int cmd, bpfptr_t 
                err = map_update_elem(&attr, uattr);
                break;
        case BPF_MAP_DELETE_ELEM:
-               err = map_delete_elem(&attr);
+               err = map_delete_elem(&attr, uattr);
                break;
        case BPF_MAP_GET_NEXT_KEY:
                err = map_get_next_key(&attr);
@@@ -5073,8 -5076,10 +5076,10 @@@ BPF_CALL_3(bpf_sys_bpf, int, cmd, unio
  {
        switch (cmd) {
        case BPF_MAP_CREATE:
+       case BPF_MAP_DELETE_ELEM:
        case BPF_MAP_UPDATE_ELEM:
        case BPF_MAP_FREEZE:
+       case BPF_MAP_GET_FD_BY_ID:
        case BPF_PROG_LOAD:
        case BPF_BTF_LOAD:
        case BPF_LINK_CREATE:
@@@ -5197,7 -5202,7 +5202,7 @@@ syscall_prog_func_proto(enum bpf_func_i
  {
        switch (func_id) {
        case BPF_FUNC_sys_bpf:
 -              return &bpf_sys_bpf_proto;
 +              return !perfmon_capable() ? NULL : &bpf_sys_bpf_proto;
        case BPF_FUNC_btf_find_by_name_kind:
                return &bpf_btf_find_by_name_kind_proto;
        case BPF_FUNC_sys_close:
diff --combined kernel/bpf/verifier.c
index 8d07493a477c794c6252a5fa5bec6aa483b4d3ee,068b20ed34d2c9894fb378a23b24d5661b80329e..003f7ba195581de4ad90a77625adbdbc00673d2c
@@@ -1092,6 -1092,7 +1092,7 @@@ static int acquire_reference_state(stru
        id = ++env->id_gen;
        state->refs[new_ofs].id = id;
        state->refs[new_ofs].insn_idx = insn_idx;
+       state->refs[new_ofs].callback_ref = state->in_callback_fn ? state->frameno : 0;
  
        return id;
  }
@@@ -1104,6 -1105,9 +1105,9 @@@ static int release_reference_state(stru
        last_idx = state->acquired_refs - 1;
        for (i = 0; i < state->acquired_refs; i++) {
                if (state->refs[i].id == ptr_id) {
+                       /* Cannot release caller references in callbacks */
+                       if (state->in_callback_fn && state->refs[i].callback_ref != state->frameno)
+                               return -EINVAL;
                        if (last_idx && i != last_idx)
                                memcpy(&state->refs[i], &state->refs[last_idx],
                                       sizeof(*state->refs));
@@@ -6072,9 -6076,6 +6076,9 @@@ skip_type_check
                        return -EACCES;
                }
                meta->mem_size = reg->var_off.value;
 +              err = mark_chain_precision(env, regno);
 +              if (err)
 +                      return err;
                break;
        case ARG_PTR_TO_INT:
        case ARG_PTR_TO_LONG:
@@@ -6918,10 -6919,17 +6922,17 @@@ static int prepare_func_exit(struct bpf
                caller->regs[BPF_REG_0] = *r0;
        }
  
-       /* Transfer references to the caller */
-       err = copy_reference_state(caller, callee);
-       if (err)
-               return err;
+       /* callback_fn frame should have released its own additions to parent's
+        * reference state at this point, or check_reference_leak would
+        * complain, hence it must be the same as the caller. There is no need
+        * to copy it back.
+        */
+       if (!callee->in_callback_fn) {
+               /* Transfer references to the caller */
+               err = copy_reference_state(caller, callee);
+               if (err)
+                       return err;
+       }
  
        *insn_idx = callee->callsite + 1;
        if (env->log.level & BPF_LOG_LEVEL) {
@@@ -7010,7 -7018,8 +7021,7 @@@ record_func_key(struct bpf_verifier_en
        struct bpf_insn_aux_data *aux = &env->insn_aux_data[insn_idx];
        struct bpf_reg_state *regs = cur_regs(env), *reg;
        struct bpf_map *map = meta->map_ptr;
 -      struct tnum range;
 -      u64 val;
 +      u64 val, max;
        int err;
  
        if (func_id != BPF_FUNC_tail_call)
                return -EINVAL;
        }
  
 -      range = tnum_range(0, map->max_entries - 1);
        reg = &regs[BPF_REG_3];
 +      val = reg->var_off.value;
 +      max = map->max_entries;
  
 -      if (!register_is_const(reg) || !tnum_in(range, reg->var_off)) {
 +      if (!(register_is_const(reg) && val < max)) {
                bpf_map_key_store(aux, BPF_MAP_KEY_POISON);
                return 0;
        }
        err = mark_chain_precision(env, BPF_REG_3);
        if (err)
                return err;
 -
 -      val = reg->var_off.value;
        if (bpf_map_key_unseen(aux))
                bpf_map_key_store(aux, val);
        else if (!bpf_map_key_poisoned(aux) &&
  static int check_reference_leak(struct bpf_verifier_env *env)
  {
        struct bpf_func_state *state = cur_func(env);
+       bool refs_lingering = false;
        int i;
  
+       if (state->frameno && !state->in_callback_fn)
+               return 0;
        for (i = 0; i < state->acquired_refs; i++) {
+               if (state->in_callback_fn && state->refs[i].callback_ref != state->frameno)
+                       continue;
                verbose(env, "Unreleased reference id=%d alloc_insn=%d\n",
                        state->refs[i].id, state->refs[i].insn_idx);
+               refs_lingering = true;
        }
-       return state->acquired_refs ? -EINVAL : 0;
+       return refs_lingering ? -EINVAL : 0;
  }
  
  static int check_bpf_snprintf_call(struct bpf_verifier_env *env,
@@@ -12338,6 -12355,16 +12356,16 @@@ static int do_check(struct bpf_verifier
                                        return -EINVAL;
                                }
  
+                               /* We must do check_reference_leak here before
+                                * prepare_func_exit to handle the case when
+                                * state->curframe > 0, it may be a callback
+                                * function, for which reference_state must
+                                * match caller reference state when it exits.
+                                */
+                               err = check_reference_leak(env);
+                               if (err)
+                                       return err;
                                if (state->curframe) {
                                        /* exit from nested function */
                                        err = prepare_func_exit(env, &env->insn_idx);
                                        continue;
                                }
  
-                               err = check_reference_leak(env);
-                               if (err)
-                                       return err;
                                err = check_return_code(env);
                                if (err)
                                        return err;
@@@ -12563,14 -12586,6 +12587,6 @@@ err_put
        return err;
  }
  
- static int check_map_prealloc(struct bpf_map *map)
- {
-       return (map->map_type != BPF_MAP_TYPE_HASH &&
-               map->map_type != BPF_MAP_TYPE_PERCPU_HASH &&
-               map->map_type != BPF_MAP_TYPE_HASH_OF_MAPS) ||
-               !(map->map_flags & BPF_F_NO_PREALLOC);
- }
  static bool is_tracing_prog_type(enum bpf_prog_type type)
  {
        switch (type) {
        }
  }
  
- static bool is_preallocated_map(struct bpf_map *map)
- {
-       if (!check_map_prealloc(map))
-               return false;
-       if (map->inner_map_meta && !check_map_prealloc(map->inner_map_meta))
-               return false;
-       return true;
- }
  static int check_map_prog_compatibility(struct bpf_verifier_env *env,
                                        struct bpf_map *map,
                                        struct bpf_prog *prog)
  
  {
        enum bpf_prog_type prog_type = resolve_prog_type(prog);
-       /*
-        * Validate that trace type programs use preallocated hash maps.
-        *
-        * For programs attached to PERF events this is mandatory as the
-        * perf NMI can hit any arbitrary code sequence.
-        *
-        * All other trace types using preallocated hash maps are unsafe as
-        * well because tracepoint or kprobes can be inside locked regions
-        * of the memory allocator or at a place where a recursion into the
-        * memory allocator would see inconsistent state.
-        *
-        * On RT enabled kernels run-time allocation of all trace type
-        * programs is strictly prohibited due to lock type constraints. On
-        * !RT kernels it is allowed for backwards compatibility reasons for
-        * now, but warnings are emitted so developers are made aware of
-        * the unsafety and can fix their programs before this is enforced.
-        */
-       if (is_tracing_prog_type(prog_type) && !is_preallocated_map(map)) {
-               if (prog_type == BPF_PROG_TYPE_PERF_EVENT) {
-                       verbose(env, "perf_event programs can only use preallocated hash map\n");
-                       return -EINVAL;
-               }
-               if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
-                       verbose(env, "trace type programs can only use preallocated hash map\n");
-                       return -EINVAL;
-               }
-               WARN_ONCE(1, "trace type BPF program uses run-time allocation\n");
-               verbose(env, "trace type programs with run-time allocated hash maps are unsafe. Switch to preallocated hash maps.\n");
-       }
  
        if (map_value_has_spin_lock(map)) {
                if (prog_type == BPF_PROG_TYPE_SOCKET_FILTER) {
                case BPF_MAP_TYPE_LRU_PERCPU_HASH:
                case BPF_MAP_TYPE_ARRAY_OF_MAPS:
                case BPF_MAP_TYPE_HASH_OF_MAPS:
-                       if (!is_preallocated_map(map)) {
-                               verbose(env,
-                                       "Sleepable programs can only use preallocated maps\n");
-                               return -EINVAL;
-                       }
-                       break;
                case BPF_MAP_TYPE_RINGBUF:
                case BPF_MAP_TYPE_INODE_STORAGE:
                case BPF_MAP_TYPE_SK_STORAGE:
diff --combined net/core/filter.c
index c191db80ce93c7efbe8a623d64c8485fb7591ca3,ee768bb5b5ab0c1230403049e0bce5c6aa878301..e872f45399b0a00ed47227f063bc2b9e25f9e106
@@@ -1214,11 -1214,10 +1214,11 @@@ void sk_filter_uncharge(struct sock *sk
  static bool __sk_filter_charge(struct sock *sk, struct sk_filter *fp)
  {
        u32 filter_size = bpf_prog_size(fp->prog->len);
 +      int optmem_max = READ_ONCE(sysctl_optmem_max);
  
        /* same check as in sock_kmalloc() */
 -      if (filter_size <= sysctl_optmem_max &&
 -          atomic_read(&sk->sk_omem_alloc) + filter_size < sysctl_optmem_max) {
 +      if (filter_size <= optmem_max &&
 +          atomic_read(&sk->sk_omem_alloc) + filter_size < optmem_max) {
                atomic_add(filter_size, &sk->sk_omem_alloc);
                return true;
        }
@@@ -1549,7 -1548,7 +1549,7 @@@ int sk_reuseport_attach_filter(struct s
        if (IS_ERR(prog))
                return PTR_ERR(prog);
  
 -      if (bpf_prog_size(prog->len) > sysctl_optmem_max)
 +      if (bpf_prog_size(prog->len) > READ_ONCE(sysctl_optmem_max))
                err = -ENOMEM;
        else
                err = reuseport_attach_prog(sk, prog);
@@@ -1616,7 -1615,7 +1616,7 @@@ int sk_reuseport_attach_bpf(u32 ufd, st
                }
        } else {
                /* BPF_PROG_TYPE_SOCKET_FILTER */
 -              if (bpf_prog_size(prog->len) > sysctl_optmem_max) {
 +              if (bpf_prog_size(prog->len) > READ_ONCE(sysctl_optmem_max)) {
                        err = -ENOMEM;
                        goto err_prog_put;
                }
@@@ -3010,7 -3009,7 +3010,7 @@@ BPF_CALL_0(bpf_get_cgroup_classid_curr
        return __task_get_classid(current);
  }
  
static const struct bpf_func_proto bpf_get_cgroup_classid_curr_proto = {
+ const struct bpf_func_proto bpf_get_cgroup_classid_curr_proto = {
        .func           = bpf_get_cgroup_classid_curr,
        .gpl_only       = false,
        .ret_type       = RET_INTEGER,
@@@ -4489,7 -4488,8 +4489,8 @@@ BPF_CALL_4(bpf_skb_get_tunnel_key, stru
        void *to_orig = to;
        int err;
  
-       if (unlikely(!info || (flags & ~(BPF_F_TUNINFO_IPV6)))) {
+       if (unlikely(!info || (flags & ~(BPF_F_TUNINFO_IPV6 |
+                                        BPF_F_TUNINFO_FLAGS)))) {
                err = -EINVAL;
                goto err_clear;
        }
@@@ -4521,7 -4521,10 +4522,10 @@@ set_compat
        to->tunnel_id = be64_to_cpu(info->key.tun_id);
        to->tunnel_tos = info->key.tos;
        to->tunnel_ttl = info->key.ttl;
-       to->tunnel_ext = 0;
+       if (flags & BPF_F_TUNINFO_FLAGS)
+               to->tunnel_flags = info->key.tun_flags;
+       else
+               to->tunnel_ext = 0;
  
        if (flags & BPF_F_TUNINFO_IPV6) {
                memcpy(to->remote_ipv6, &info->key.u.ipv6.src,
@@@ -5014,359 -5017,259 +5018,259 @@@ static const struct bpf_func_proto bpf_
        .arg1_type      = ARG_PTR_TO_CTX,
  };
  
- static int __bpf_setsockopt(struct sock *sk, int level, int optname,
-                           char *optval, int optlen)
- {
-       char devname[IFNAMSIZ];
-       int val, valbool;
-       struct net *net;
-       int ifindex;
-       int ret = 0;
-       if (!sk_fullsock(sk))
+ static int sol_socket_sockopt(struct sock *sk, int optname,
+                             char *optval, int *optlen,
+                             bool getopt)
+ {
+       switch (optname) {
+       case SO_REUSEADDR:
+       case SO_SNDBUF:
+       case SO_RCVBUF:
+       case SO_KEEPALIVE:
+       case SO_PRIORITY:
+       case SO_REUSEPORT:
+       case SO_RCVLOWAT:
+       case SO_MARK:
+       case SO_MAX_PACING_RATE:
+       case SO_BINDTOIFINDEX:
+       case SO_TXREHASH:
+               if (*optlen != sizeof(int))
+                       return -EINVAL;
+               break;
+       case SO_BINDTODEVICE:
+               break;
+       default:
                return -EINVAL;
+       }
  
-       if (level == SOL_SOCKET) {
-               if (optlen != sizeof(int) && optname != SO_BINDTODEVICE)
+       if (getopt) {
+               if (optname == SO_BINDTODEVICE)
                        return -EINVAL;
-               val = *((int *)optval);
-               valbool = val ? 1 : 0;
-               /* Only some socketops are supported */
-               switch (optname) {
-               case SO_RCVBUF:
-                       val = min_t(u32, val, READ_ONCE(sysctl_rmem_max));
-                       val = min_t(int, val, INT_MAX / 2);
-                       sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
-                       WRITE_ONCE(sk->sk_rcvbuf,
-                                  max_t(int, val * 2, SOCK_MIN_RCVBUF));
-                       break;
-               case SO_SNDBUF:
-                       val = min_t(u32, val, READ_ONCE(sysctl_wmem_max));
-                       val = min_t(int, val, INT_MAX / 2);
-                       sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
-                       WRITE_ONCE(sk->sk_sndbuf,
-                                  max_t(int, val * 2, SOCK_MIN_SNDBUF));
-                       break;
-               case SO_MAX_PACING_RATE: /* 32bit version */
-                       if (val != ~0U)
-                               cmpxchg(&sk->sk_pacing_status,
-                                       SK_PACING_NONE,
-                                       SK_PACING_NEEDED);
-                       sk->sk_max_pacing_rate = (val == ~0U) ?
-                                                ~0UL : (unsigned int)val;
-                       sk->sk_pacing_rate = min(sk->sk_pacing_rate,
-                                                sk->sk_max_pacing_rate);
-                       break;
-               case SO_PRIORITY:
-                       sk->sk_priority = val;
-                       break;
-               case SO_RCVLOWAT:
-                       if (val < 0)
-                               val = INT_MAX;
-                       if (sk->sk_socket && sk->sk_socket->ops->set_rcvlowat)
-                               ret = sk->sk_socket->ops->set_rcvlowat(sk, val);
-                       else
-                               WRITE_ONCE(sk->sk_rcvlowat, val ? : 1);
-                       break;
-               case SO_MARK:
-                       if (sk->sk_mark != val) {
-                               sk->sk_mark = val;
-                               sk_dst_reset(sk);
-                       }
-                       break;
-               case SO_BINDTODEVICE:
-                       optlen = min_t(long, optlen, IFNAMSIZ - 1);
-                       strncpy(devname, optval, optlen);
-                       devname[optlen] = 0;
+               return sk_getsockopt(sk, SOL_SOCKET, optname,
+                                    KERNEL_SOCKPTR(optval),
+                                    KERNEL_SOCKPTR(optlen));
+       }
  
-                       ifindex = 0;
-                       if (devname[0] != '\0') {
-                               struct net_device *dev;
+       return sk_setsockopt(sk, SOL_SOCKET, optname,
+                            KERNEL_SOCKPTR(optval), *optlen);
+ }
  
-                               ret = -ENODEV;
+ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
+                                 char *optval, int optlen)
+ {
+       struct tcp_sock *tp = tcp_sk(sk);
+       unsigned long timeout;
+       int val;
  
-                               net = sock_net(sk);
-                               dev = dev_get_by_name(net, devname);
-                               if (!dev)
-                                       break;
-                               ifindex = dev->ifindex;
-                               dev_put(dev);
-                       }
-                       fallthrough;
-               case SO_BINDTOIFINDEX:
-                       if (optname == SO_BINDTOIFINDEX)
-                               ifindex = val;
-                       ret = sock_bindtoindex(sk, ifindex, false);
-                       break;
-               case SO_KEEPALIVE:
-                       if (sk->sk_prot->keepalive)
-                               sk->sk_prot->keepalive(sk, valbool);
-                       sock_valbool_flag(sk, SOCK_KEEPOPEN, valbool);
-                       break;
-               case SO_REUSEPORT:
-                       sk->sk_reuseport = valbool;
-                       break;
-               case SO_TXREHASH:
-                       if (val < -1 || val > 1) {
-                               ret = -EINVAL;
-                               break;
-                       }
-                       sk->sk_txrehash = (u8)val;
-                       break;
-               default:
-                       ret = -EINVAL;
-               }
- #ifdef CONFIG_INET
-       } else if (level == SOL_IP) {
-               if (optlen != sizeof(int) || sk->sk_family != AF_INET)
-                       return -EINVAL;
+       if (optlen != sizeof(int))
+               return -EINVAL;
  
-               val = *((int *)optval);
-               /* Only some options are supported */
-               switch (optname) {
-               case IP_TOS:
-                       if (val < -1 || val > 0xff) {
-                               ret = -EINVAL;
-                       } else {
-                               struct inet_sock *inet = inet_sk(sk);
+       val = *(int *)optval;
  
-                               if (val == -1)
-                                       val = 0;
-                               inet->tos = val;
-                       }
-                       break;
-               default:
-                       ret = -EINVAL;
-               }
- #if IS_ENABLED(CONFIG_IPV6)
-       } else if (level == SOL_IPV6) {
-               if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+       /* Only some options are supported */
+       switch (optname) {
+       case TCP_BPF_IW:
+               if (val <= 0 || tp->data_segs_out > tp->syn_data)
+                       return -EINVAL;
+               tcp_snd_cwnd_set(tp, val);
+               break;
+       case TCP_BPF_SNDCWND_CLAMP:
+               if (val <= 0)
+                       return -EINVAL;
+               tp->snd_cwnd_clamp = val;
+               tp->snd_ssthresh = val;
+               break;
+       case TCP_BPF_DELACK_MAX:
+               timeout = usecs_to_jiffies(val);
+               if (timeout > TCP_DELACK_MAX ||
+                   timeout < TCP_TIMEOUT_MIN)
+                       return -EINVAL;
+               inet_csk(sk)->icsk_delack_max = timeout;
+               break;
+       case TCP_BPF_RTO_MIN:
+               timeout = usecs_to_jiffies(val);
+               if (timeout > TCP_RTO_MIN ||
+                   timeout < TCP_TIMEOUT_MIN)
                        return -EINVAL;
+               inet_csk(sk)->icsk_rto_min = timeout;
+               break;
+       default:
+               return -EINVAL;
+       }
  
-               val = *((int *)optval);
-               /* Only some options are supported */
-               switch (optname) {
-               case IPV6_TCLASS:
-                       if (val < -1 || val > 0xff) {
-                               ret = -EINVAL;
-                       } else {
-                               struct ipv6_pinfo *np = inet6_sk(sk);
+       return 0;
+ }
  
-                               if (val == -1)
-                                       val = 0;
-                               np->tclass = val;
-                       }
-                       break;
-               default:
-                       ret = -EINVAL;
-               }
- #endif
-       } else if (level == SOL_TCP &&
-                  sk->sk_prot->setsockopt == tcp_setsockopt) {
-               if (optname == TCP_CONGESTION) {
-                       char name[TCP_CA_NAME_MAX];
+ static int sol_tcp_sockopt(struct sock *sk, int optname,
+                          char *optval, int *optlen,
+                          bool getopt)
+ {
+       if (sk->sk_prot->setsockopt != tcp_setsockopt)
+               return -EINVAL;
  
-                       strncpy(name, optval, min_t(long, optlen,
-                                                   TCP_CA_NAME_MAX-1));
-                       name[TCP_CA_NAME_MAX-1] = 0;
-                       ret = tcp_set_congestion_control(sk, name, false, true);
-               } else {
-                       struct inet_connection_sock *icsk = inet_csk(sk);
+       switch (optname) {
+       case TCP_NODELAY:
+       case TCP_MAXSEG:
+       case TCP_KEEPIDLE:
+       case TCP_KEEPINTVL:
+       case TCP_KEEPCNT:
+       case TCP_SYNCNT:
+       case TCP_WINDOW_CLAMP:
+       case TCP_THIN_LINEAR_TIMEOUTS:
+       case TCP_USER_TIMEOUT:
+       case TCP_NOTSENT_LOWAT:
+       case TCP_SAVE_SYN:
+               if (*optlen != sizeof(int))
+                       return -EINVAL;
+               break;
+       case TCP_CONGESTION:
+               if (*optlen < 2)
+                       return -EINVAL;
+               break;
+       case TCP_SAVED_SYN:
+               if (*optlen < 1)
+                       return -EINVAL;
+               break;
+       default:
+               if (getopt)
+                       return -EINVAL;
+               return bpf_sol_tcp_setsockopt(sk, optname, optval, *optlen);
+       }
+       if (getopt) {
+               if (optname == TCP_SAVED_SYN) {
                        struct tcp_sock *tp = tcp_sk(sk);
-                       unsigned long timeout;
  
-                       if (optlen != sizeof(int))
+                       if (!tp->saved_syn ||
+                           *optlen > tcp_saved_syn_len(tp->saved_syn))
                                return -EINVAL;
+                       memcpy(optval, tp->saved_syn->data, *optlen);
+                       /* It cannot free tp->saved_syn here because it
+                        * does not know if the user space still needs it.
+                        */
+                       return 0;
+               }
  
-                       val = *((int *)optval);
-                       /* Only some options are supported */
-                       switch (optname) {
-                       case TCP_BPF_IW:
-                               if (val <= 0 || tp->data_segs_out > tp->syn_data)
-                                       ret = -EINVAL;
-                               else
-                                       tcp_snd_cwnd_set(tp, val);
-                               break;
-                       case TCP_BPF_SNDCWND_CLAMP:
-                               if (val <= 0) {
-                                       ret = -EINVAL;
-                               } else {
-                                       tp->snd_cwnd_clamp = val;
-                                       tp->snd_ssthresh = val;
-                               }
-                               break;
-                       case TCP_BPF_DELACK_MAX:
-                               timeout = usecs_to_jiffies(val);
-                               if (timeout > TCP_DELACK_MAX ||
-                                   timeout < TCP_TIMEOUT_MIN)
-                                       return -EINVAL;
-                               inet_csk(sk)->icsk_delack_max = timeout;
-                               break;
-                       case TCP_BPF_RTO_MIN:
-                               timeout = usecs_to_jiffies(val);
-                               if (timeout > TCP_RTO_MIN ||
-                                   timeout < TCP_TIMEOUT_MIN)
-                                       return -EINVAL;
-                               inet_csk(sk)->icsk_rto_min = timeout;
-                               break;
-                       case TCP_SAVE_SYN:
-                               if (val < 0 || val > 1)
-                                       ret = -EINVAL;
-                               else
-                                       tp->save_syn = val;
-                               break;
-                       case TCP_KEEPIDLE:
-                               ret = tcp_sock_set_keepidle_locked(sk, val);
-                               break;
-                       case TCP_KEEPINTVL:
-                               if (val < 1 || val > MAX_TCP_KEEPINTVL)
-                                       ret = -EINVAL;
-                               else
-                                       tp->keepalive_intvl = val * HZ;
-                               break;
-                       case TCP_KEEPCNT:
-                               if (val < 1 || val > MAX_TCP_KEEPCNT)
-                                       ret = -EINVAL;
-                               else
-                                       tp->keepalive_probes = val;
-                               break;
-                       case TCP_SYNCNT:
-                               if (val < 1 || val > MAX_TCP_SYNCNT)
-                                       ret = -EINVAL;
-                               else
-                                       icsk->icsk_syn_retries = val;
-                               break;
-                       case TCP_USER_TIMEOUT:
-                               if (val < 0)
-                                       ret = -EINVAL;
-                               else
-                                       icsk->icsk_user_timeout = val;
-                               break;
-                       case TCP_NOTSENT_LOWAT:
-                               tp->notsent_lowat = val;
-                               sk->sk_write_space(sk);
-                               break;
-                       case TCP_WINDOW_CLAMP:
-                               ret = tcp_set_window_clamp(sk, val);
-                               break;
-                       default:
-                               ret = -EINVAL;
-                       }
+               if (optname == TCP_CONGESTION) {
+                       if (!inet_csk(sk)->icsk_ca_ops)
+                               return -EINVAL;
+                       /* BPF expects NULL-terminated tcp-cc string */
+                       optval[--(*optlen)] = '\0';
                }
- #endif
-       } else {
-               ret = -EINVAL;
+               return do_tcp_getsockopt(sk, SOL_TCP, optname,
+                                        KERNEL_SOCKPTR(optval),
+                                        KERNEL_SOCKPTR(optlen));
        }
-       return ret;
+       return do_tcp_setsockopt(sk, SOL_TCP, optname,
+                                KERNEL_SOCKPTR(optval), *optlen);
  }
  
- static int _bpf_setsockopt(struct sock *sk, int level, int optname,
-                          char *optval, int optlen)
+ static int sol_ip_sockopt(struct sock *sk, int optname,
+                         char *optval, int *optlen,
+                         bool getopt)
  {
-       if (sk_fullsock(sk))
-               sock_owned_by_me(sk);
-       return __bpf_setsockopt(sk, level, optname, optval, optlen);
+       if (sk->sk_family != AF_INET)
+               return -EINVAL;
+       switch (optname) {
+       case IP_TOS:
+               if (*optlen != sizeof(int))
+                       return -EINVAL;
+               break;
+       default:
+               return -EINVAL;
+       }
+       if (getopt)
+               return do_ip_getsockopt(sk, SOL_IP, optname,
+                                       KERNEL_SOCKPTR(optval),
+                                       KERNEL_SOCKPTR(optlen));
+       return do_ip_setsockopt(sk, SOL_IP, optname,
+                               KERNEL_SOCKPTR(optval), *optlen);
  }
  
- static int __bpf_getsockopt(struct sock *sk, int level, int optname,
-                           char *optval, int optlen)
+ static int sol_ipv6_sockopt(struct sock *sk, int optname,
+                           char *optval, int *optlen,
+                           bool getopt)
  {
-       if (!sk_fullsock(sk))
-               goto err_clear;
+       if (sk->sk_family != AF_INET6)
+               return -EINVAL;
  
-       if (level == SOL_SOCKET) {
-               if (optlen != sizeof(int))
-                       goto err_clear;
+       switch (optname) {
+       case IPV6_TCLASS:
+       case IPV6_AUTOFLOWLABEL:
+               if (*optlen != sizeof(int))
+                       return -EINVAL;
+               break;
+       default:
+               return -EINVAL;
+       }
  
-               switch (optname) {
-               case SO_RCVBUF:
-                       *((int *)optval) = sk->sk_rcvbuf;
-                       break;
-               case SO_SNDBUF:
-                       *((int *)optval) = sk->sk_sndbuf;
-                       break;
-               case SO_MARK:
-                       *((int *)optval) = sk->sk_mark;
-                       break;
-               case SO_PRIORITY:
-                       *((int *)optval) = sk->sk_priority;
-                       break;
-               case SO_BINDTOIFINDEX:
-                       *((int *)optval) = sk->sk_bound_dev_if;
-                       break;
-               case SO_REUSEPORT:
-                       *((int *)optval) = sk->sk_reuseport;
-                       break;
-               case SO_TXREHASH:
-                       *((int *)optval) = sk->sk_txrehash;
-                       break;
-               default:
-                       goto err_clear;
-               }
- #ifdef CONFIG_INET
-       } else if (level == SOL_TCP && sk->sk_prot->getsockopt == tcp_getsockopt) {
-               struct inet_connection_sock *icsk;
-               struct tcp_sock *tp;
+       if (getopt)
+               return ipv6_bpf_stub->ipv6_getsockopt(sk, SOL_IPV6, optname,
+                                                     KERNEL_SOCKPTR(optval),
+                                                     KERNEL_SOCKPTR(optlen));
  
-               switch (optname) {
-               case TCP_CONGESTION:
-                       icsk = inet_csk(sk);
+       return ipv6_bpf_stub->ipv6_setsockopt(sk, SOL_IPV6, optname,
+                                             KERNEL_SOCKPTR(optval), *optlen);
+ }
  
-                       if (!icsk->icsk_ca_ops || optlen <= 1)
-                               goto err_clear;
-                       strncpy(optval, icsk->icsk_ca_ops->name, optlen);
-                       optval[optlen - 1] = 0;
-                       break;
-               case TCP_SAVED_SYN:
-                       tp = tcp_sk(sk);
+ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
+                           char *optval, int optlen)
+ {
+       if (!sk_fullsock(sk))
+               return -EINVAL;
  
-                       if (optlen <= 0 || !tp->saved_syn ||
-                           optlen > tcp_saved_syn_len(tp->saved_syn))
-                               goto err_clear;
-                       memcpy(optval, tp->saved_syn->data, optlen);
-                       break;
-               default:
-                       goto err_clear;
-               }
-       } else if (level == SOL_IP) {
-               struct inet_sock *inet = inet_sk(sk);
+       if (level == SOL_SOCKET)
+               return sol_socket_sockopt(sk, optname, optval, &optlen, false);
+       else if (IS_ENABLED(CONFIG_INET) && level == SOL_IP)
+               return sol_ip_sockopt(sk, optname, optval, &optlen, false);
+       else if (IS_ENABLED(CONFIG_IPV6) && level == SOL_IPV6)
+               return sol_ipv6_sockopt(sk, optname, optval, &optlen, false);
+       else if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP)
+               return sol_tcp_sockopt(sk, optname, optval, &optlen, false);
  
-               if (optlen != sizeof(int) || sk->sk_family != AF_INET)
-                       goto err_clear;
+       return -EINVAL;
+ }
  
-               /* Only some options are supported */
-               switch (optname) {
-               case IP_TOS:
-                       *((int *)optval) = (int)inet->tos;
-                       break;
-               default:
-                       goto err_clear;
-               }
- #if IS_ENABLED(CONFIG_IPV6)
-       } else if (level == SOL_IPV6) {
-               struct ipv6_pinfo *np = inet6_sk(sk);
+ static int _bpf_setsockopt(struct sock *sk, int level, int optname,
+                          char *optval, int optlen)
+ {
+       if (sk_fullsock(sk))
+               sock_owned_by_me(sk);
+       return __bpf_setsockopt(sk, level, optname, optval, optlen);
+ }
  
-               if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
-                       goto err_clear;
+ static int __bpf_getsockopt(struct sock *sk, int level, int optname,
+                           char *optval, int optlen)
+ {
+       int err, saved_optlen = optlen;
  
-               /* Only some options are supported */
-               switch (optname) {
-               case IPV6_TCLASS:
-                       *((int *)optval) = (int)np->tclass;
-                       break;
-               default:
-                       goto err_clear;
-               }
- #endif
- #endif
-       } else {
-               goto err_clear;
+       if (!sk_fullsock(sk)) {
+               err = -EINVAL;
+               goto done;
        }
-       return 0;
- err_clear:
-       memset(optval, 0, optlen);
-       return -EINVAL;
+       if (level == SOL_SOCKET)
+               err = sol_socket_sockopt(sk, optname, optval, &optlen, true);
+       else if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP)
+               err = sol_tcp_sockopt(sk, optname, optval, &optlen, true);
+       else if (IS_ENABLED(CONFIG_INET) && level == SOL_IP)
+               err = sol_ip_sockopt(sk, optname, optval, &optlen, true);
+       else if (IS_ENABLED(CONFIG_IPV6) && level == SOL_IPV6)
+               err = sol_ipv6_sockopt(sk, optname, optval, &optlen, true);
+       else
+               err = -EINVAL;
+ done:
+       if (err)
+               optlen = 0;
+       if (optlen < saved_optlen)
+               memset(optval + optlen, 0, saved_optlen - optlen);
+       return err;
  }
  
  static int _bpf_getsockopt(struct sock *sk, int level, int optname,
@@@ -7667,34 -7570,23 +7571,23 @@@ const struct bpf_func_proto bpf_sk_stor
  static const struct bpf_func_proto *
  sock_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
  {
+       const struct bpf_func_proto *func_proto;
+       func_proto = cgroup_common_func_proto(func_id, prog);
+       if (func_proto)
+               return func_proto;
+       func_proto = cgroup_current_func_proto(func_id, prog);
+       if (func_proto)
+               return func_proto;
        switch (func_id) {
-       /* inet and inet6 sockets are created in a process
-        * context so there is always a valid uid/gid
-        */
-       case BPF_FUNC_get_current_uid_gid:
-               return &bpf_get_current_uid_gid_proto;
-       case BPF_FUNC_get_local_storage:
-               return &bpf_get_local_storage_proto;
        case BPF_FUNC_get_socket_cookie:
                return &bpf_get_socket_cookie_sock_proto;
        case BPF_FUNC_get_netns_cookie:
                return &bpf_get_netns_cookie_sock_proto;
        case BPF_FUNC_perf_event_output:
                return &bpf_event_output_data_proto;
-       case BPF_FUNC_get_current_pid_tgid:
-               return &bpf_get_current_pid_tgid_proto;
-       case BPF_FUNC_get_current_comm:
-               return &bpf_get_current_comm_proto;
- #ifdef CONFIG_CGROUPS
-       case BPF_FUNC_get_current_cgroup_id:
-               return &bpf_get_current_cgroup_id_proto;
-       case BPF_FUNC_get_current_ancestor_cgroup_id:
-               return &bpf_get_current_ancestor_cgroup_id_proto;
- #endif
- #ifdef CONFIG_CGROUP_NET_CLASSID
-       case BPF_FUNC_get_cgroup_classid:
-               return &bpf_get_cgroup_classid_curr_proto;
- #endif
        case BPF_FUNC_sk_storage_get:
                return &bpf_sk_storage_get_cg_sock_proto;
        case BPF_FUNC_ktime_get_coarse_ns:
  static const struct bpf_func_proto *
  sock_addr_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
  {
+       const struct bpf_func_proto *func_proto;
+       func_proto = cgroup_common_func_proto(func_id, prog);
+       if (func_proto)
+               return func_proto;
+       func_proto = cgroup_current_func_proto(func_id, prog);
+       if (func_proto)
+               return func_proto;
        switch (func_id) {
-       /* inet and inet6 sockets are created in a process
-        * context so there is always a valid uid/gid
-        */
-       case BPF_FUNC_get_current_uid_gid:
-               return &bpf_get_current_uid_gid_proto;
        case BPF_FUNC_bind:
                switch (prog->expected_attach_type) {
                case BPF_CGROUP_INET4_CONNECT:
                return &bpf_get_socket_cookie_sock_addr_proto;
        case BPF_FUNC_get_netns_cookie:
                return &bpf_get_netns_cookie_sock_addr_proto;
-       case BPF_FUNC_get_local_storage:
-               return &bpf_get_local_storage_proto;
        case BPF_FUNC_perf_event_output:
                return &bpf_event_output_data_proto;
-       case BPF_FUNC_get_current_pid_tgid:
-               return &bpf_get_current_pid_tgid_proto;
-       case BPF_FUNC_get_current_comm:
-               return &bpf_get_current_comm_proto;
- #ifdef CONFIG_CGROUPS
-       case BPF_FUNC_get_current_cgroup_id:
-               return &bpf_get_current_cgroup_id_proto;
-       case BPF_FUNC_get_current_ancestor_cgroup_id:
-               return &bpf_get_current_ancestor_cgroup_id_proto;
- #endif
- #ifdef CONFIG_CGROUP_NET_CLASSID
-       case BPF_FUNC_get_cgroup_classid:
-               return &bpf_get_cgroup_classid_curr_proto;
- #endif
  #ifdef CONFIG_INET
        case BPF_FUNC_sk_lookup_tcp:
                return &bpf_sock_addr_sk_lookup_tcp_proto;
@@@ -7823,9 -7704,13 +7705,13 @@@ const struct bpf_func_proto bpf_sk_stor
  static const struct bpf_func_proto *
  cg_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
  {
+       const struct bpf_func_proto *func_proto;
+       func_proto = cgroup_common_func_proto(func_id, prog);
+       if (func_proto)
+               return func_proto;
        switch (func_id) {
-       case BPF_FUNC_get_local_storage:
-               return &bpf_get_local_storage_proto;
        case BPF_FUNC_sk_fullsock:
                return &bpf_sk_fullsock_proto;
        case BPF_FUNC_sk_storage_get:
@@@ -8065,6 -7950,12 +7951,12 @@@ const struct bpf_func_proto bpf_sock_ha
  static const struct bpf_func_proto *
  sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
  {
+       const struct bpf_func_proto *func_proto;
+       func_proto = cgroup_common_func_proto(func_id, prog);
+       if (func_proto)
+               return func_proto;
        switch (func_id) {
        case BPF_FUNC_setsockopt:
                return &bpf_sock_ops_setsockopt_proto;
                return &bpf_sock_hash_update_proto;
        case BPF_FUNC_get_socket_cookie:
                return &bpf_get_socket_cookie_sock_ops_proto;
-       case BPF_FUNC_get_local_storage:
-               return &bpf_get_local_storage_proto;
        case BPF_FUNC_perf_event_output:
                return &bpf_event_output_data_proto;
        case BPF_FUNC_sk_storage_get:
@@@ -10812,14 -10701,13 +10702,13 @@@ int sk_detach_filter(struct sock *sk
  }
  EXPORT_SYMBOL_GPL(sk_detach_filter);
  
- int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf,
-                 unsigned int len)
+ int sk_get_filter(struct sock *sk, sockptr_t optval, unsigned int len)
  {
        struct sock_fprog_kern *fprog;
        struct sk_filter *filter;
        int ret = 0;
  
-       lock_sock(sk);
+       sockopt_lock_sock(sk);
        filter = rcu_dereference_protected(sk->sk_filter,
                                           lockdep_sock_is_held(sk));
        if (!filter)
                goto out;
  
        ret = -EFAULT;
-       if (copy_to_user(ubuf, fprog->filter, bpf_classic_proglen(fprog)))
+       if (copy_to_sockptr(optval, fprog->filter, bpf_classic_proglen(fprog)))
                goto out;
  
        /* Instead of bytes, the API requests to return the number
         */
        ret = fprog->len;
  out:
-       release_sock(sk);
+       sockopt_release_sock(sk);
        return ret;
  }
  
diff --combined net/core/sock.c
index 788c1372663cbabdd3d2dd0d0274d60b7d63dd2c,68e4662eb2ebf6b51845718cb86a6943be302d9b..eeb6cbac6f4998dbc41fc686e7e882135e45b9e3
@@@ -703,15 -703,17 +703,17 @@@ static int sock_setbindtodevice(struct 
                        goto out;
        }
  
-       return sock_bindtoindex(sk, index, true);
+       sockopt_lock_sock(sk);
+       ret = sock_bindtoindex_locked(sk, index);
+       sockopt_release_sock(sk);
  out:
  #endif
  
        return ret;
  }
  
- static int sock_getbindtodevice(struct sock *sk, char __user *optval,
-                               int __user *optlen, int len)
+ static int sock_getbindtodevice(struct sock *sk, sockptr_t optval,
+                               sockptr_t optlen, int len)
  {
        int ret = -ENOPROTOOPT;
  #ifdef CONFIG_NETDEVICES
        len = strlen(devname) + 1;
  
        ret = -EFAULT;
-       if (copy_to_user(optval, devname, len))
+       if (copy_to_sockptr(optval, devname, len))
                goto out;
  
  zero:
        ret = -EFAULT;
-       if (put_user(len, optlen))
+       if (copy_to_sockptr(optlen, &len, sizeof(int)))
                goto out;
  
        ret = 0;
@@@ -1036,17 -1038,51 +1038,51 @@@ static int sock_reserve_memory(struct s
        return 0;
  }
  
+ void sockopt_lock_sock(struct sock *sk)
+ {
+       /* When current->bpf_ctx is set, the setsockopt is called from
+        * a bpf prog.  bpf has ensured the sk lock has been
+        * acquired before calling setsockopt().
+        */
+       if (has_current_bpf_ctx())
+               return;
+       lock_sock(sk);
+ }
+ EXPORT_SYMBOL(sockopt_lock_sock);
+ void sockopt_release_sock(struct sock *sk)
+ {
+       if (has_current_bpf_ctx())
+               return;
+       release_sock(sk);
+ }
+ EXPORT_SYMBOL(sockopt_release_sock);
+ bool sockopt_ns_capable(struct user_namespace *ns, int cap)
+ {
+       return has_current_bpf_ctx() || ns_capable(ns, cap);
+ }
+ EXPORT_SYMBOL(sockopt_ns_capable);
+ bool sockopt_capable(int cap)
+ {
+       return has_current_bpf_ctx() || capable(cap);
+ }
+ EXPORT_SYMBOL(sockopt_capable);
  /*
   *    This is meant for all protocols to use and covers goings on
   *    at the socket level. Everything here is generic.
   */
  
- int sock_setsockopt(struct socket *sock, int level, int optname,
-                   sockptr_t optval, unsigned int optlen)
+ int sk_setsockopt(struct sock *sk, int level, int optname,
+                 sockptr_t optval, unsigned int optlen)
  {
        struct so_timestamping timestamping;
+       struct socket *sock = sk->sk_socket;
        struct sock_txtime sk_txtime;
-       struct sock *sk = sock->sk;
        int val;
        int valbool;
        struct linger ling;
  
        valbool = val ? 1 : 0;
  
-       lock_sock(sk);
+       sockopt_lock_sock(sk);
  
        switch (optname) {
        case SO_DEBUG:
-               if (val && !capable(CAP_NET_ADMIN))
+               if (val && !sockopt_capable(CAP_NET_ADMIN))
                        ret = -EACCES;
                else
                        sock_valbool_flag(sk, SOCK_DBG, valbool);
                 * play 'guess the biggest size' games. RCVBUF/SNDBUF
                 * are treated in BSD as hints
                 */
 -              val = min_t(u32, val, sysctl_wmem_max);
 +              val = min_t(u32, val, READ_ONCE(sysctl_wmem_max));
  set_sndbuf:
                /* Ensure val * 2 fits into an int, to prevent max_t()
                 * from treating it as a negative value.
                break;
  
        case SO_SNDBUFFORCE:
-               if (!capable(CAP_NET_ADMIN)) {
+               if (!sockopt_capable(CAP_NET_ADMIN)) {
                        ret = -EPERM;
                        break;
                }
                 * play 'guess the biggest size' games. RCVBUF/SNDBUF
                 * are treated in BSD as hints
                 */
 -              __sock_set_rcvbuf(sk, min_t(u32, val, sysctl_rmem_max));
 +              __sock_set_rcvbuf(sk, min_t(u32, val, READ_ONCE(sysctl_rmem_max)));
                break;
  
        case SO_RCVBUFFORCE:
-               if (!capable(CAP_NET_ADMIN)) {
+               if (!sockopt_capable(CAP_NET_ADMIN)) {
                        ret = -EPERM;
                        break;
                }
  
        case SO_PRIORITY:
                if ((val >= 0 && val <= 6) ||
-                   ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) ||
-                   ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
+                   sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) ||
+                   sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
                        sk->sk_priority = val;
                else
                        ret = -EPERM;
        case SO_RCVLOWAT:
                if (val < 0)
                        val = INT_MAX;
-               if (sock->ops->set_rcvlowat)
+               if (sock && sock->ops->set_rcvlowat)
                        ret = sock->ops->set_rcvlowat(sk, val);
                else
                        WRITE_ONCE(sk->sk_rcvlowat, val ? : 1);
                        clear_bit(SOCK_PASSSEC, &sock->flags);
                break;
        case SO_MARK:
-               if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) &&
-                   !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
+               if (!sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) &&
+                   !sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
                        ret = -EPERM;
                        break;
                }
                __sock_set_mark(sk, val);
                break;
        case SO_RCVMARK:
-               if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) &&
-                   !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
+               if (!sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) &&
+                   !sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
                        ret = -EPERM;
                        break;
                }
  #ifdef CONFIG_NET_RX_BUSY_POLL
        case SO_BUSY_POLL:
                /* allow unprivileged users to decrease the value */
-               if ((val > sk->sk_ll_usec) && !capable(CAP_NET_ADMIN))
+               if ((val > sk->sk_ll_usec) && !sockopt_capable(CAP_NET_ADMIN))
                        ret = -EPERM;
                else {
                        if (val < 0)
                }
                break;
        case SO_PREFER_BUSY_POLL:
-               if (valbool && !capable(CAP_NET_ADMIN))
+               if (valbool && !sockopt_capable(CAP_NET_ADMIN))
                        ret = -EPERM;
                else
                        WRITE_ONCE(sk->sk_prefer_busy_poll, valbool);
                break;
        case SO_BUSY_POLL_BUDGET:
-               if (val > READ_ONCE(sk->sk_busy_poll_budget) && !capable(CAP_NET_ADMIN)) {
+               if (val > READ_ONCE(sk->sk_busy_poll_budget) && !sockopt_capable(CAP_NET_ADMIN)) {
                        ret = -EPERM;
                } else {
                        if (val < 0 || val > U16_MAX)
                 * scheduler has enough safe guards.
                 */
                if (sk_txtime.clockid != CLOCK_MONOTONIC &&
-                   !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
+                   !sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
                        ret = -EPERM;
                        break;
                }
                ret = -ENOPROTOOPT;
                break;
        }
-       release_sock(sk);
+       sockopt_release_sock(sk);
        return ret;
  }
+ int sock_setsockopt(struct socket *sock, int level, int optname,
+                   sockptr_t optval, unsigned int optlen)
+ {
+       return sk_setsockopt(sock->sk, level, optname,
+                            optval, optlen);
+ }
  EXPORT_SYMBOL(sock_setsockopt);
  
  static const struct cred *sk_get_peer_cred(struct sock *sk)
@@@ -1525,22 -1568,25 +1568,25 @@@ static void cred_to_ucred(struct pid *p
        }
  }
  
- static int groups_to_user(gid_t __user *dst, const struct group_info *src)
+ static int groups_to_user(sockptr_t dst, const struct group_info *src)
  {
        struct user_namespace *user_ns = current_user_ns();
        int i;
  
-       for (i = 0; i < src->ngroups; i++)
-               if (put_user(from_kgid_munged(user_ns, src->gid[i]), dst + i))
+       for (i = 0; i < src->ngroups; i++) {
+               gid_t gid = from_kgid_munged(user_ns, src->gid[i]);
+               if (copy_to_sockptr_offset(dst, i * sizeof(gid), &gid, sizeof(gid)))
                        return -EFAULT;
+       }
  
        return 0;
  }
  
- int sock_getsockopt(struct socket *sock, int level, int optname,
-                   char __user *optval, int __user *optlen)
+ int sk_getsockopt(struct sock *sk, int level, int optname,
+                 sockptr_t optval, sockptr_t optlen)
  {
-       struct sock *sk = sock->sk;
+       struct socket *sock = sk->sk_socket;
  
        union {
                int val;
        int lv = sizeof(int);
        int len;
  
-       if (get_user(len, optlen))
+       if (copy_from_sockptr(&len, optlen, sizeof(int)))
                return -EFAULT;
        if (len < 0)
                return -EINVAL;
                cred_to_ucred(sk->sk_peer_pid, sk->sk_peer_cred, &peercred);
                spin_unlock(&sk->sk_peer_lock);
  
-               if (copy_to_user(optval, &peercred, len))
+               if (copy_to_sockptr(optval, &peercred, len))
                        return -EFAULT;
                goto lenout;
        }
                if (len < n * sizeof(gid_t)) {
                        len = n * sizeof(gid_t);
                        put_cred(cred);
-                       return put_user(len, optlen) ? -EFAULT : -ERANGE;
+                       return copy_to_sockptr(optlen, &len, sizeof(int)) ? -EFAULT : -ERANGE;
                }
                len = n * sizeof(gid_t);
  
-               ret = groups_to_user((gid_t __user *)optval, cred->group_info);
+               ret = groups_to_user(optval, cred->group_info);
                put_cred(cred);
                if (ret)
                        return ret;
                        return -ENOTCONN;
                if (lv < len)
                        return -EINVAL;
-               if (copy_to_user(optval, address, len))
+               if (copy_to_sockptr(optval, address, len))
                        return -EFAULT;
                goto lenout;
        }
                break;
  
        case SO_PEERSEC:
-               return security_socket_getpeersec_stream(sock, optval, optlen, len);
+               return security_socket_getpeersec_stream(sock, optval.user, optlen.user, len);
  
        case SO_MARK:
                v.val = sk->sk_mark;
                return sock_getbindtodevice(sk, optval, optlen, len);
  
        case SO_GET_FILTER:
-               len = sk_get_filter(sk, (struct sock_filter __user *)optval, len);
+               len = sk_get_filter(sk, optval, len);
                if (len < 0)
                        return len;
  
                sk_get_meminfo(sk, meminfo);
  
                len = min_t(unsigned int, len, sizeof(meminfo));
-               if (copy_to_user(optval, &meminfo, len))
+               if (copy_to_sockptr(optval, &meminfo, len))
                        return -EFAULT;
  
                goto lenout;
  
        if (len > lv)
                len = lv;
-       if (copy_to_user(optval, &v, len))
+       if (copy_to_sockptr(optval, &v, len))
                return -EFAULT;
  lenout:
-       if (put_user(len, optlen))
+       if (copy_to_sockptr(optlen, &len, sizeof(int)))
                return -EFAULT;
        return 0;
  }
  
+ int sock_getsockopt(struct socket *sock, int level, int optname,
+                   char __user *optval, int __user *optlen)
+ {
+       return sk_getsockopt(sock->sk, level, optname,
+                            USER_SOCKPTR(optval),
+                            USER_SOCKPTR(optlen));
+ }
  /*
   * Initialize an sk_lock.
   *
@@@ -2536,7 -2590,7 +2590,7 @@@ struct sk_buff *sock_omalloc(struct soc
  
        /* small safe race: SKB_TRUESIZE may differ from final skb->truesize */
        if (atomic_read(&sk->sk_omem_alloc) + SKB_TRUESIZE(size) >
 -          sysctl_optmem_max)
 +          READ_ONCE(sysctl_optmem_max))
                return NULL;
  
        skb = alloc_skb(size, priority);
   */
  void *sock_kmalloc(struct sock *sk, int size, gfp_t priority)
  {
 -      if ((unsigned int)size <= sysctl_optmem_max &&
 -          atomic_read(&sk->sk_omem_alloc) + size < sysctl_optmem_max) {
 +      int optmem_max = READ_ONCE(sysctl_optmem_max);
 +
 +      if ((unsigned int)size <= optmem_max &&
 +          atomic_read(&sk->sk_omem_alloc) + size < optmem_max) {
                void *mem;
                /* First do the add, to avoid the race if kmalloc
                 * might sleep.
@@@ -3311,8 -3363,8 +3365,8 @@@ void sock_init_data(struct socket *sock
        timer_setup(&sk->sk_timer, NULL, 0);
  
        sk->sk_allocation       =       GFP_KERNEL;
 -      sk->sk_rcvbuf           =       sysctl_rmem_default;
 -      sk->sk_sndbuf           =       sysctl_wmem_default;
 +      sk->sk_rcvbuf           =       READ_ONCE(sysctl_rmem_default);
 +      sk->sk_sndbuf           =       READ_ONCE(sysctl_wmem_default);
        sk->sk_state            =       TCP_CLOSE;
        sk_set_socket(sk, sock);
  
  
  #ifdef CONFIG_NET_RX_BUSY_POLL
        sk->sk_napi_id          =       0;
 -      sk->sk_ll_usec          =       sysctl_net_busy_read;
 +      sk->sk_ll_usec          =       READ_ONCE(sysctl_net_busy_read);
  #endif
  
        sk->sk_max_pacing_rate = ~0UL;
diff --combined net/ipv4/ip_sockglue.c
index e49a61a053a68838d9fb7fc0b1a9012477e19fd6,47830f3fea1be4f4135c07fa33debe2046074415..6e19cad154f5cdf7e4aecb745e0c85002d27369e
@@@ -772,7 -772,7 +772,7 @@@ static int ip_set_mcast_msfilter(struc
  
        if (optlen < GROUP_FILTER_SIZE(0))
                return -EINVAL;
 -      if (optlen > sysctl_optmem_max)
 +      if (optlen > READ_ONCE(sysctl_optmem_max))
                return -ENOBUFS;
  
        gsf = memdup_sockptr(optval, optlen);
@@@ -808,7 -808,7 +808,7 @@@ static int compat_ip_set_mcast_msfilter
  
        if (optlen < size0)
                return -EINVAL;
 -      if (optlen > sysctl_optmem_max - 4)
 +      if (optlen > READ_ONCE(sysctl_optmem_max) - 4)
                return -ENOBUFS;
  
        p = kmalloc(optlen + 4, GFP_KERNEL);
@@@ -888,8 -888,8 +888,8 @@@ static int compat_ip_mcast_join_leave(s
  
  DEFINE_STATIC_KEY_FALSE(ip4_min_ttl);
  
static int do_ip_setsockopt(struct sock *sk, int level, int optname,
-               sockptr_t optval, unsigned int optlen)
+ int do_ip_setsockopt(struct sock *sk, int level, int optname,
+                    sockptr_t optval, unsigned int optlen)
  {
        struct inet_sock *inet = inet_sk(sk);
        struct net *net = sock_net(sk);
        err = 0;
        if (needs_rtnl)
                rtnl_lock();
-       lock_sock(sk);
+       sockopt_lock_sock(sk);
  
        switch (optname) {
        case IP_OPTIONS:
  
                if (optlen < IP_MSFILTER_SIZE(0))
                        goto e_inval;
 -              if (optlen > sysctl_optmem_max) {
 +              if (optlen > READ_ONCE(sysctl_optmem_max)) {
                        err = -ENOBUFS;
                        break;
                }
        case IP_IPSEC_POLICY:
        case IP_XFRM_POLICY:
                err = -EPERM;
-               if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
+               if (!sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
                        break;
                err = xfrm_user_policy(sk, optname, optval, optlen);
                break;
  
        case IP_TRANSPARENT:
-               if (!!val && !ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) &&
-                   !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
+               if (!!val && !sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) &&
+                   !sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
                        err = -EPERM;
                        break;
                }
                err = -ENOPROTOOPT;
                break;
        }
-       release_sock(sk);
+       sockopt_release_sock(sk);
        if (needs_rtnl)
                rtnl_unlock();
        return err;
  
  e_inval:
-       release_sock(sk);
+       sockopt_release_sock(sk);
        if (needs_rtnl)
                rtnl_unlock();
        return -EINVAL;
@@@ -1462,37 -1462,37 +1462,37 @@@ static bool getsockopt_needs_rtnl(int o
        return false;
  }
  
- static int ip_get_mcast_msfilter(struct sock *sk, void __user *optval,
-               int __user *optlen, int len)
+ static int ip_get_mcast_msfilter(struct sock *sk, sockptr_t optval,
+                                sockptr_t optlen, int len)
  {
        const int size0 = offsetof(struct group_filter, gf_slist_flex);
-       struct group_filter __user *p = optval;
        struct group_filter gsf;
-       int num;
+       int num, gsf_size;
        int err;
  
        if (len < size0)
                return -EINVAL;
-       if (copy_from_user(&gsf, p, size0))
+       if (copy_from_sockptr(&gsf, optval, size0))
                return -EFAULT;
  
        num = gsf.gf_numsrc;
-       err = ip_mc_gsfget(sk, &gsf, p->gf_slist_flex);
+       err = ip_mc_gsfget(sk, &gsf, optval,
+                          offsetof(struct group_filter, gf_slist_flex));
        if (err)
                return err;
        if (gsf.gf_numsrc < num)
                num = gsf.gf_numsrc;
-       if (put_user(GROUP_FILTER_SIZE(num), optlen) ||
-           copy_to_user(p, &gsf, size0))
+       gsf_size = GROUP_FILTER_SIZE(num);
+       if (copy_to_sockptr(optlen, &gsf_size, sizeof(int)) ||
+           copy_to_sockptr(optval, &gsf, size0))
                return -EFAULT;
        return 0;
  }
  
- static int compat_ip_get_mcast_msfilter(struct sock *sk, void __user *optval,
-               int __user *optlen, int len)
+ static int compat_ip_get_mcast_msfilter(struct sock *sk, sockptr_t optval,
+                                       sockptr_t optlen, int len)
  {
        const int size0 = offsetof(struct compat_group_filter, gf_slist_flex);
-       struct compat_group_filter __user *p = optval;
        struct compat_group_filter gf32;
        struct group_filter gf;
        int num;
  
        if (len < size0)
                return -EINVAL;
-       if (copy_from_user(&gf32, p, size0))
+       if (copy_from_sockptr(&gf32, optval, size0))
                return -EFAULT;
  
        gf.gf_interface = gf32.gf_interface;
        num = gf.gf_numsrc = gf32.gf_numsrc;
        gf.gf_group = gf32.gf_group;
  
-       err = ip_mc_gsfget(sk, &gf, p->gf_slist_flex);
+       err = ip_mc_gsfget(sk, &gf, optval,
+                          offsetof(struct compat_group_filter, gf_slist_flex));
        if (err)
                return err;
        if (gf.gf_numsrc < num)
                num = gf.gf_numsrc;
        len = GROUP_FILTER_SIZE(num) - (sizeof(gf) - sizeof(gf32));
-       if (put_user(len, optlen) ||
-           put_user(gf.gf_fmode, &p->gf_fmode) ||
-           put_user(gf.gf_numsrc, &p->gf_numsrc))
+       if (copy_to_sockptr(optlen, &len, sizeof(int)) ||
+           copy_to_sockptr_offset(optval, offsetof(struct compat_group_filter, gf_fmode),
+                                  &gf.gf_fmode, sizeof(gf.gf_fmode)) ||
+           copy_to_sockptr_offset(optval, offsetof(struct compat_group_filter, gf_numsrc),
+                                  &gf.gf_numsrc, sizeof(gf.gf_numsrc)))
                return -EFAULT;
        return 0;
  }
  
static int do_ip_getsockopt(struct sock *sk, int level, int optname,
-                           char __user *optval, int __user *optlen)
+ int do_ip_getsockopt(struct sock *sk, int level, int optname,
+                    sockptr_t optval, sockptr_t optlen)
  {
        struct inet_sock *inet = inet_sk(sk);
        bool needs_rtnl = getsockopt_needs_rtnl(optname);
        if (ip_mroute_opt(optname))
                return ip_mroute_getsockopt(sk, optname, optval, optlen);
  
-       if (get_user(len, optlen))
+       if (copy_from_sockptr(&len, optlen, sizeof(int)))
                return -EFAULT;
        if (len < 0)
                return -EINVAL;
  
        if (needs_rtnl)
                rtnl_lock();
-       lock_sock(sk);
+       sockopt_lock_sock(sk);
  
        switch (optname) {
        case IP_OPTIONS:
                        memcpy(optbuf, &inet_opt->opt,
                               sizeof(struct ip_options) +
                               inet_opt->opt.optlen);
-               release_sock(sk);
+               sockopt_release_sock(sk);
  
-               if (opt->optlen == 0)
-                       return put_user(0, optlen);
+               if (opt->optlen == 0) {
+                       len = 0;
+                       return copy_to_sockptr(optlen, &len, sizeof(int));
+               }
  
                ip_options_undo(opt);
  
                len = min_t(unsigned int, len, opt->optlen);
-               if (put_user(len, optlen))
+               if (copy_to_sockptr(optlen, &len, sizeof(int)))
                        return -EFAULT;
-               if (copy_to_user(optval, opt->__data, len))
+               if (copy_to_sockptr(optval, opt->__data, len))
                        return -EFAULT;
                return 0;
        }
                        dst_release(dst);
                }
                if (!val) {
-                       release_sock(sk);
+                       sockopt_release_sock(sk);
                        return -ENOTCONN;
                }
                break;
                struct in_addr addr;
                len = min_t(unsigned int, len, sizeof(struct in_addr));
                addr.s_addr = inet->mc_addr;
-               release_sock(sk);
+               sockopt_release_sock(sk);
  
-               if (put_user(len, optlen))
+               if (copy_to_sockptr(optlen, &len, sizeof(int)))
                        return -EFAULT;
-               if (copy_to_user(optval, &addr, len))
+               if (copy_to_sockptr(optval, &addr, len))
                        return -EFAULT;
                return 0;
        }
                        err = -EINVAL;
                        goto out;
                }
-               if (copy_from_user(&msf, optval, IP_MSFILTER_SIZE(0))) {
+               if (copy_from_sockptr(&msf, optval, IP_MSFILTER_SIZE(0))) {
                        err = -EFAULT;
                        goto out;
                }
-               err = ip_mc_msfget(sk, &msf,
-                                  (struct ip_msfilter __user *)optval, optlen);
+               err = ip_mc_msfget(sk, &msf, optval, optlen);
                goto out;
        }
        case MCAST_MSFILTER:
        {
                struct msghdr msg;
  
-               release_sock(sk);
+               sockopt_release_sock(sk);
  
                if (sk->sk_type != SOCK_STREAM)
                        return -ENOPROTOOPT;
  
-               msg.msg_control_is_user = true;
-               msg.msg_control_user = optval;
+               if (optval.is_kernel) {
+                       msg.msg_control_is_user = false;
+                       msg.msg_control = optval.kernel;
+               } else {
+                       msg.msg_control_is_user = true;
+                       msg.msg_control_user = optval.user;
+               }
                msg.msg_controllen = len;
                msg.msg_flags = in_compat_syscall() ? MSG_CMSG_COMPAT : 0;
  
                        put_cmsg(&msg, SOL_IP, IP_TOS, sizeof(tos), &tos);
                }
                len -= msg.msg_controllen;
-               return put_user(len, optlen);
+               return copy_to_sockptr(optlen, &len, sizeof(int));
        }
        case IP_FREEBIND:
                val = inet->freebind;
                val = inet->min_ttl;
                break;
        default:
-               release_sock(sk);
+               sockopt_release_sock(sk);
                return -ENOPROTOOPT;
        }
-       release_sock(sk);
+       sockopt_release_sock(sk);
  
        if (len < sizeof(int) && len > 0 && val >= 0 && val <= 255) {
                unsigned char ucval = (unsigned char)val;
                len = 1;
-               if (put_user(len, optlen))
+               if (copy_to_sockptr(optlen, &len, sizeof(int)))
                        return -EFAULT;
-               if (copy_to_user(optval, &ucval, 1))
+               if (copy_to_sockptr(optval, &ucval, 1))
                        return -EFAULT;
        } else {
                len = min_t(unsigned int, sizeof(int), len);
-               if (put_user(len, optlen))
+               if (copy_to_sockptr(optlen, &len, sizeof(int)))
                        return -EFAULT;
-               if (copy_to_user(optval, &val, len))
+               if (copy_to_sockptr(optval, &val, len))
                        return -EFAULT;
        }
        return 0;
  
  out:
-       release_sock(sk);
+       sockopt_release_sock(sk);
        if (needs_rtnl)
                rtnl_unlock();
        return err;
@@@ -1767,7 -1776,8 +1776,8 @@@ int ip_getsockopt(struct sock *sk, int 
  {
        int err;
  
-       err = do_ip_getsockopt(sk, level, optname, optval, optlen);
+       err = do_ip_getsockopt(sk, level, optname,
+                              USER_SOCKPTR(optval), USER_SOCKPTR(optlen));
  
  #if IS_ENABLED(CONFIG_BPFILTER_UMH)
        if (optname >= BPFILTER_IPT_SO_GET_INFO &&
diff --combined net/ipv4/tcp.c
index 306b94dedc8deb019f43f0c64966b8dc1de2bdda,a822cc627e2ad33b4bfc76ac2330d81c15eeec10..52b8879e7d20e20adf6b0802601e60845cce9681
@@@ -1000,7 -1000,7 +1000,7 @@@ new_segment
  
        i = skb_shinfo(skb)->nr_frags;
        can_coalesce = skb_can_coalesce(skb, i, page, offset);
 -      if (!can_coalesce && i >= sysctl_max_skb_frags) {
 +      if (!can_coalesce && i >= READ_ONCE(sysctl_max_skb_frags)) {
                tcp_mark_push(tp, skb);
                goto new_segment;
        }
@@@ -1354,7 -1354,7 +1354,7 @@@ new_segment
  
                        if (!skb_can_coalesce(skb, i, pfrag->page,
                                              pfrag->offset)) {
 -                              if (i >= sysctl_max_skb_frags) {
 +                              if (i >= READ_ONCE(sysctl_max_skb_frags)) {
                                        tcp_mark_push(tp, skb);
                                        goto new_segment;
                                }
@@@ -1567,11 -1567,17 +1567,11 @@@ static int tcp_peek_sndq(struct sock *s
   * calculation of whether or not we must ACK for the sake of
   * a window update.
   */
 -void tcp_cleanup_rbuf(struct sock *sk, int copied)
 +static void __tcp_cleanup_rbuf(struct sock *sk, int copied)
  {
        struct tcp_sock *tp = tcp_sk(sk);
        bool time_to_ack = false;
  
 -      struct sk_buff *skb = skb_peek(&sk->sk_receive_queue);
 -
 -      WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq),
 -           "cleanup rbuf bug: copied %X seq %X rcvnxt %X\n",
 -           tp->copied_seq, TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt);
 -
        if (inet_csk_ack_scheduled(sk)) {
                const struct inet_connection_sock *icsk = inet_csk(sk);
  
                tcp_send_ack(sk);
  }
  
 +void tcp_cleanup_rbuf(struct sock *sk, int copied)
 +{
 +      struct sk_buff *skb = skb_peek(&sk->sk_receive_queue);
 +      struct tcp_sock *tp = tcp_sk(sk);
 +
 +      WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq),
 +           "cleanup rbuf bug: copied %X seq %X rcvnxt %X\n",
 +           tp->copied_seq, TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt);
 +      __tcp_cleanup_rbuf(sk, copied);
 +}
 +
  static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb)
  {
        __skb_unlink(skb, &sk->sk_receive_queue);
@@@ -1761,26 -1756,34 +1761,26 @@@ int tcp_read_skb(struct sock *sk, skb_r
        if (sk->sk_state == TCP_LISTEN)
                return -ENOTCONN;
  
 -      while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
 -              int used;
 -
 -              __skb_unlink(skb, &sk->sk_receive_queue);
 -              used = recv_actor(sk, skb);
 -              if (used <= 0) {
 -                      if (!copied)
 -                              copied = used;
 -                      break;
 -              }
 -              seq += used;
 -              copied += used;
 +      skb = tcp_recv_skb(sk, seq, &offset);
 +      if (!skb)
 +              return 0;
  
 -              if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
 -                      consume_skb(skb);
 +      __skb_unlink(skb, &sk->sk_receive_queue);
 +      WARN_ON(!skb_set_owner_sk_safe(skb, sk));
 +      copied = recv_actor(sk, skb);
 +      if (copied >= 0) {
 +              seq += copied;
 +              if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
                        ++seq;
 -                      break;
 -              }
 -              consume_skb(skb);
 -              break;
        }
 +      consume_skb(skb);
        WRITE_ONCE(tp->copied_seq, seq);
  
        tcp_rcv_space_adjust(sk);
  
        /* Clean up data we have read: This will do ACK frames. */
        if (copied > 0)
 -              tcp_cleanup_rbuf(sk, copied);
 +              __tcp_cleanup_rbuf(sk, copied);
  
        return copied;
  }
@@@ -3199,7 -3202,7 +3199,7 @@@ EXPORT_SYMBOL(tcp_disconnect)
  
  static inline bool tcp_can_repair_sock(const struct sock *sk)
  {
-       return ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN) &&
+       return sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN) &&
                (sk->sk_state != TCP_LISTEN);
  }
  
@@@ -3476,8 -3479,8 +3476,8 @@@ int tcp_set_window_clamp(struct sock *s
  /*
   *    Socket option code for TCP.
   */
static int do_tcp_setsockopt(struct sock *sk, int level, int optname,
-               sockptr_t optval, unsigned int optlen)
+ int do_tcp_setsockopt(struct sock *sk, int level, int optname,
+                     sockptr_t optval, unsigned int optlen)
  {
        struct tcp_sock *tp = tcp_sk(sk);
        struct inet_connection_sock *icsk = inet_csk(sk);
                        return -EFAULT;
                name[val] = 0;
  
-               lock_sock(sk);
-               err = tcp_set_congestion_control(sk, name, true,
-                                                ns_capable(sock_net(sk)->user_ns,
-                                                           CAP_NET_ADMIN));
-               release_sock(sk);
+               sockopt_lock_sock(sk);
+               err = tcp_set_congestion_control(sk, name, !has_current_bpf_ctx(),
+                                                sockopt_ns_capable(sock_net(sk)->user_ns,
+                                                                   CAP_NET_ADMIN));
+               sockopt_release_sock(sk);
                return err;
        }
        case TCP_ULP: {
                        return -EFAULT;
                name[val] = 0;
  
-               lock_sock(sk);
+               sockopt_lock_sock(sk);
                err = tcp_set_ulp(sk, name);
-               release_sock(sk);
+               sockopt_release_sock(sk);
                return err;
        }
        case TCP_FASTOPEN_KEY: {
        if (copy_from_sockptr(&val, optval, sizeof(val)))
                return -EFAULT;
  
-       lock_sock(sk);
+       sockopt_lock_sock(sk);
  
        switch (optname) {
        case TCP_MAXSEG:
                break;
        }
  
-       release_sock(sk);
+       sockopt_release_sock(sk);
        return err;
  }
  
@@@ -4040,15 -4043,15 +4040,15 @@@ struct sk_buff *tcp_get_timestamping_op
        return stats;
  }
  
static int do_tcp_getsockopt(struct sock *sk, int level,
-               int optname, char __user *optval, int __user *optlen)
+ int do_tcp_getsockopt(struct sock *sk, int level,
+                     int optname, sockptr_t optval, sockptr_t optlen)
  {
        struct inet_connection_sock *icsk = inet_csk(sk);
        struct tcp_sock *tp = tcp_sk(sk);
        struct net *net = sock_net(sk);
        int val, len;
  
-       if (get_user(len, optlen))
+       if (copy_from_sockptr(&len, optlen, sizeof(int)))
                return -EFAULT;
  
        len = min_t(unsigned int, len, sizeof(int));
        case TCP_INFO: {
                struct tcp_info info;
  
-               if (get_user(len, optlen))
+               if (copy_from_sockptr(&len, optlen, sizeof(int)))
                        return -EFAULT;
  
                tcp_get_info(sk, &info);
  
                len = min_t(unsigned int, len, sizeof(info));
-               if (put_user(len, optlen))
+               if (copy_to_sockptr(optlen, &len, sizeof(int)))
                        return -EFAULT;
-               if (copy_to_user(optval, &info, len))
+               if (copy_to_sockptr(optval, &info, len))
                        return -EFAULT;
                return 0;
        }
                size_t sz = 0;
                int attr;
  
-               if (get_user(len, optlen))
+               if (copy_from_sockptr(&len, optlen, sizeof(int)))
                        return -EFAULT;
  
                ca_ops = icsk->icsk_ca_ops;
                        sz = ca_ops->get_info(sk, ~0U, &attr, &info);
  
                len = min_t(unsigned int, len, sz);
-               if (put_user(len, optlen))
+               if (copy_to_sockptr(optlen, &len, sizeof(int)))
                        return -EFAULT;
-               if (copy_to_user(optval, &info, len))
+               if (copy_to_sockptr(optval, &info, len))
                        return -EFAULT;
                return 0;
        }
                break;
  
        case TCP_CONGESTION:
-               if (get_user(len, optlen))
+               if (copy_from_sockptr(&len, optlen, sizeof(int)))
                        return -EFAULT;
                len = min_t(unsigned int, len, TCP_CA_NAME_MAX);
-               if (put_user(len, optlen))
+               if (copy_to_sockptr(optlen, &len, sizeof(int)))
                        return -EFAULT;
-               if (copy_to_user(optval, icsk->icsk_ca_ops->name, len))
+               if (copy_to_sockptr(optval, icsk->icsk_ca_ops->name, len))
                        return -EFAULT;
                return 0;
  
        case TCP_ULP:
-               if (get_user(len, optlen))
+               if (copy_from_sockptr(&len, optlen, sizeof(int)))
                        return -EFAULT;
                len = min_t(unsigned int, len, TCP_ULP_NAME_MAX);
                if (!icsk->icsk_ulp_ops) {
-                       if (put_user(0, optlen))
+                       len = 0;
+                       if (copy_to_sockptr(optlen, &len, sizeof(int)))
                                return -EFAULT;
                        return 0;
                }
-               if (put_user(len, optlen))
+               if (copy_to_sockptr(optlen, &len, sizeof(int)))
                        return -EFAULT;
-               if (copy_to_user(optval, icsk->icsk_ulp_ops->name, len))
+               if (copy_to_sockptr(optval, icsk->icsk_ulp_ops->name, len))
                        return -EFAULT;
                return 0;
  
                u64 key[TCP_FASTOPEN_KEY_BUF_LENGTH / sizeof(u64)];
                unsigned int key_len;
  
-               if (get_user(len, optlen))
+               if (copy_from_sockptr(&len, optlen, sizeof(int)))
                        return -EFAULT;
  
                key_len = tcp_fastopen_get_cipher(net, icsk, key) *
                                TCP_FASTOPEN_KEY_LENGTH;
                len = min_t(unsigned int, len, key_len);
-               if (put_user(len, optlen))
+               if (copy_to_sockptr(optlen, &len, sizeof(int)))
                        return -EFAULT;
-               if (copy_to_user(optval, key, len))
+               if (copy_to_sockptr(optval, key, len))
                        return -EFAULT;
                return 0;
        }
        case TCP_REPAIR_WINDOW: {
                struct tcp_repair_window opt;
  
-               if (get_user(len, optlen))
+               if (copy_from_sockptr(&len, optlen, sizeof(int)))
                        return -EFAULT;
  
                if (len != sizeof(opt))
                opt.rcv_wnd     = tp->rcv_wnd;
                opt.rcv_wup     = tp->rcv_wup;
  
-               if (copy_to_user(optval, &opt, len))
+               if (copy_to_sockptr(optval, &opt, len))
                        return -EFAULT;
                return 0;
        }
                val = tp->save_syn;
                break;
        case TCP_SAVED_SYN: {
-               if (get_user(len, optlen))
+               if (copy_from_sockptr(&len, optlen, sizeof(int)))
                        return -EFAULT;
  
-               lock_sock(sk);
+               sockopt_lock_sock(sk);
                if (tp->saved_syn) {
                        if (len < tcp_saved_syn_len(tp->saved_syn)) {
-                               if (put_user(tcp_saved_syn_len(tp->saved_syn),
-                                            optlen)) {
-                                       release_sock(sk);
+                               len = tcp_saved_syn_len(tp->saved_syn);
+                               if (copy_to_sockptr(optlen, &len, sizeof(int))) {
+                                       sockopt_release_sock(sk);
                                        return -EFAULT;
                                }
-                               release_sock(sk);
+                               sockopt_release_sock(sk);
                                return -EINVAL;
                        }
                        len = tcp_saved_syn_len(tp->saved_syn);
-                       if (put_user(len, optlen)) {
-                               release_sock(sk);
+                       if (copy_to_sockptr(optlen, &len, sizeof(int))) {
+                               sockopt_release_sock(sk);
                                return -EFAULT;
                        }
-                       if (copy_to_user(optval, tp->saved_syn->data, len)) {
-                               release_sock(sk);
+                       if (copy_to_sockptr(optval, tp->saved_syn->data, len)) {
+                               sockopt_release_sock(sk);
                                return -EFAULT;
                        }
                        tcp_saved_syn_free(tp);
-                       release_sock(sk);
+                       sockopt_release_sock(sk);
                } else {
-                       release_sock(sk);
+                       sockopt_release_sock(sk);
                        len = 0;
-                       if (put_user(len, optlen))
+                       if (copy_to_sockptr(optlen, &len, sizeof(int)))
                                return -EFAULT;
                }
                return 0;
                struct tcp_zerocopy_receive zc = {};
                int err;
  
-               if (get_user(len, optlen))
+               if (copy_from_sockptr(&len, optlen, sizeof(int)))
                        return -EFAULT;
                if (len < 0 ||
                    len < offsetofend(struct tcp_zerocopy_receive, length))
                        return -EINVAL;
                if (unlikely(len > sizeof(zc))) {
-                       err = check_zeroed_user(optval + sizeof(zc),
-                                               len - sizeof(zc));
+                       err = check_zeroed_sockptr(optval, sizeof(zc),
+                                                  len - sizeof(zc));
                        if (err < 1)
                                return err == 0 ? -EINVAL : err;
                        len = sizeof(zc);
-                       if (put_user(len, optlen))
+                       if (copy_to_sockptr(optlen, &len, sizeof(int)))
                                return -EFAULT;
                }
-               if (copy_from_user(&zc, optval, len))
+               if (copy_from_sockptr(&zc, optval, len))
                        return -EFAULT;
                if (zc.reserved)
                        return -EINVAL;
                if (zc.msg_flags &  ~(TCP_VALID_ZC_MSG_FLAGS))
                        return -EINVAL;
-               lock_sock(sk);
+               sockopt_lock_sock(sk);
                err = tcp_zerocopy_receive(sk, &zc, &tss);
                err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
                                                          &zc, &len, err);
-               release_sock(sk);
+               sockopt_release_sock(sk);
                if (len >= offsetofend(struct tcp_zerocopy_receive, msg_flags))
                        goto zerocopy_rcv_cmsg;
                switch (len) {
@@@ -4351,7 -4355,7 +4352,7 @@@ zerocopy_rcv_sk_err
  zerocopy_rcv_inq:
                zc.inq = tcp_inq_hint(sk);
  zerocopy_rcv_out:
-               if (!err && copy_to_user(optval, &zc, len))
+               if (!err && copy_to_sockptr(optval, &zc, len))
                        err = -EFAULT;
                return err;
        }
                return -ENOPROTOOPT;
        }
  
-       if (put_user(len, optlen))
+       if (copy_to_sockptr(optlen, &len, sizeof(int)))
                return -EFAULT;
-       if (copy_to_user(optval, &val, len))
+       if (copy_to_sockptr(optval, &val, len))
                return -EFAULT;
        return 0;
  }
@@@ -4387,7 -4391,8 +4388,8 @@@ int tcp_getsockopt(struct sock *sk, in
        if (level != SOL_TCP)
                return icsk->icsk_af_ops->getsockopt(sk, level, optname,
                                                     optval, optlen);
-       return do_tcp_getsockopt(sk, level, optname, optval, optlen);
+       return do_tcp_getsockopt(sk, level, optname, USER_SOCKPTR(optval),
+                                USER_SOCKPTR(optlen));
  }
  EXPORT_SYMBOL(tcp_getsockopt);
  
@@@ -4433,16 -4438,12 +4435,16 @@@ static void __tcp_alloc_md5sig_pool(voi
         * to memory. See smp_rmb() in tcp_get_md5sig_pool()
         */
        smp_wmb();
 -      tcp_md5sig_pool_populated = true;
 +      /* Paired with READ_ONCE() from tcp_alloc_md5sig_pool()
 +       * and tcp_get_md5sig_pool().
 +      */
 +      WRITE_ONCE(tcp_md5sig_pool_populated, true);
  }
  
  bool tcp_alloc_md5sig_pool(void)
  {
 -      if (unlikely(!tcp_md5sig_pool_populated)) {
 +      /* Paired with WRITE_ONCE() from __tcp_alloc_md5sig_pool() */
 +      if (unlikely(!READ_ONCE(tcp_md5sig_pool_populated))) {
                mutex_lock(&tcp_md5sig_mutex);
  
                if (!tcp_md5sig_pool_populated) {
  
                mutex_unlock(&tcp_md5sig_mutex);
        }
 -      return tcp_md5sig_pool_populated;
 +      /* Paired with WRITE_ONCE() from __tcp_alloc_md5sig_pool() */
 +      return READ_ONCE(tcp_md5sig_pool_populated);
  }
  EXPORT_SYMBOL(tcp_alloc_md5sig_pool);
  
@@@ -4470,8 -4470,7 +4472,8 @@@ struct tcp_md5sig_pool *tcp_get_md5sig_
  {
        local_bh_disable();
  
 -      if (tcp_md5sig_pool_populated) {
 +      /* Paired with WRITE_ONCE() from __tcp_alloc_md5sig_pool() */
 +      if (READ_ONCE(tcp_md5sig_pool_populated)) {
                /* coupled with smp_wmb() in __tcp_alloc_md5sig_pool() */
                smp_rmb();
                return this_cpu_ptr(&tcp_md5sig_pool);
@@@ -4742,12 -4741,6 +4744,12 @@@ void __init tcp_init(void
                                  SLAB_HWCACHE_ALIGN | SLAB_PANIC |
                                  SLAB_ACCOUNT,
                                  NULL);
 +      tcp_hashinfo.bind2_bucket_cachep =
 +              kmem_cache_create("tcp_bind2_bucket",
 +                                sizeof(struct inet_bind2_bucket), 0,
 +                                SLAB_HWCACHE_ALIGN | SLAB_PANIC |
 +                                SLAB_ACCOUNT,
 +                                NULL);
  
        /* Size and allocate the main established and bind bucket
         * hash tables.
                panic("TCP: failed to alloc ehash_locks");
        tcp_hashinfo.bhash =
                alloc_large_system_hash("TCP bind",
 -                                      sizeof(struct inet_bind_hashbucket),
 +                                      2 * sizeof(struct inet_bind_hashbucket),
                                        tcp_hashinfo.ehash_mask + 1,
                                        17, /* one slot per 128 KB of memory */
                                        0,
                                        0,
                                        64 * 1024);
        tcp_hashinfo.bhash_size = 1U << tcp_hashinfo.bhash_size;
 +      tcp_hashinfo.bhash2 = tcp_hashinfo.bhash + tcp_hashinfo.bhash_size;
        for (i = 0; i < tcp_hashinfo.bhash_size; i++) {
                spin_lock_init(&tcp_hashinfo.bhash[i].lock);
                INIT_HLIST_HEAD(&tcp_hashinfo.bhash[i].chain);
 +              spin_lock_init(&tcp_hashinfo.bhash2[i].lock);
 +              INIT_HLIST_HEAD(&tcp_hashinfo.bhash2[i].chain);
        }
  
  
diff --combined net/ipv6/ipv6_sockglue.c
index e0dcc7a193df2a9f70350ec448f810c06a9e5330,1193f5a5247d1bc133fc348ba57439bf0a852998..2d2f4dd9e5dfa8278f5dbad0bfd5a2e16a77406d
@@@ -210,7 -210,7 +210,7 @@@ static int ipv6_set_mcast_msfilter(stru
  
        if (optlen < GROUP_FILTER_SIZE(0))
                return -EINVAL;
 -      if (optlen > sysctl_optmem_max)
 +      if (optlen > READ_ONCE(sysctl_optmem_max))
                return -ENOBUFS;
  
        gsf = memdup_sockptr(optval, optlen);
@@@ -244,7 -244,7 +244,7 @@@ static int compat_ipv6_set_mcast_msfilt
  
        if (optlen < size0)
                return -EINVAL;
 -      if (optlen > sysctl_optmem_max - 4)
 +      if (optlen > READ_ONCE(sysctl_optmem_max) - 4)
                return -ENOBUFS;
  
        p = kmalloc(optlen + 4, GFP_KERNEL);
@@@ -327,7 -327,7 +327,7 @@@ static int ipv6_set_opt_hdr(struct soc
        int err;
  
        /* hop-by-hop / destination options are privileged option */
-       if (optname != IPV6_RTHDR && !ns_capable(net->user_ns, CAP_NET_RAW))
+       if (optname != IPV6_RTHDR && !sockopt_ns_capable(net->user_ns, CAP_NET_RAW))
                return -EPERM;
  
        /* remove any sticky options header with a zero option
@@@ -391,8 -391,8 +391,8 @@@ sticky_done
        return err;
  }
  
static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
-                  sockptr_t optval, unsigned int optlen)
+ int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
+                      sockptr_t optval, unsigned int optlen)
  {
        struct ipv6_pinfo *np = inet6_sk(sk);
        struct net *net = sock_net(sk);
  
        if (needs_rtnl)
                rtnl_lock();
-       lock_sock(sk);
+       sockopt_lock_sock(sk);
  
        switch (optname) {
  
                break;
  
        case IPV6_TRANSPARENT:
-               if (valbool && !ns_capable(net->user_ns, CAP_NET_RAW) &&
-                   !ns_capable(net->user_ns, CAP_NET_ADMIN)) {
+               if (valbool && !sockopt_ns_capable(net->user_ns, CAP_NET_RAW) &&
+                   !sockopt_ns_capable(net->user_ns, CAP_NET_ADMIN)) {
                        retv = -EPERM;
                        break;
                }
@@@ -946,7 -946,7 +946,7 @@@ done
        case IPV6_IPSEC_POLICY:
        case IPV6_XFRM_POLICY:
                retv = -EPERM;
-               if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
+               if (!sockopt_ns_capable(net->user_ns, CAP_NET_ADMIN))
                        break;
                retv = xfrm_user_policy(sk, optname, optval, optlen);
                break;
                break;
        }
  
-       release_sock(sk);
+       sockopt_release_sock(sk);
        if (needs_rtnl)
                rtnl_unlock();
  
        return retv;
  
  e_inval:
-       release_sock(sk);
+       sockopt_release_sock(sk);
        if (needs_rtnl)
                rtnl_unlock();
        return -EINVAL;
@@@ -1030,7 -1030,7 +1030,7 @@@ int ipv6_setsockopt(struct sock *sk, in
  EXPORT_SYMBOL(ipv6_setsockopt);
  
  static int ipv6_getsockopt_sticky(struct sock *sk, struct ipv6_txoptions *opt,
-                                 int optname, char __user *optval, int len)
+                                 int optname, sockptr_t optval, int len)
  {
        struct ipv6_opt_hdr *hdr;
  
                return 0;
  
        len = min_t(unsigned int, len, ipv6_optlen(hdr));
-       if (copy_to_user(optval, hdr, len))
+       if (copy_to_sockptr(optval, hdr, len))
                return -EFAULT;
        return len;
  }
  
- static int ipv6_get_msfilter(struct sock *sk, void __user *optval,
-               int __user *optlen, int len)
+ static int ipv6_get_msfilter(struct sock *sk, sockptr_t optval,
+                            sockptr_t optlen, int len)
  {
        const int size0 = offsetof(struct group_filter, gf_slist_flex);
-       struct group_filter __user *p = optval;
        struct group_filter gsf;
        int num;
        int err;
  
        if (len < size0)
                return -EINVAL;
-       if (copy_from_user(&gsf, p, size0))
+       if (copy_from_sockptr(&gsf, optval, size0))
                return -EFAULT;
        if (gsf.gf_group.ss_family != AF_INET6)
                return -EADDRNOTAVAIL;
        num = gsf.gf_numsrc;
-       lock_sock(sk);
-       err = ip6_mc_msfget(sk, &gsf, p->gf_slist_flex);
+       sockopt_lock_sock(sk);
+       err = ip6_mc_msfget(sk, &gsf, optval, size0);
        if (!err) {
                if (num > gsf.gf_numsrc)
                        num = gsf.gf_numsrc;
-               if (put_user(GROUP_FILTER_SIZE(num), optlen) ||
-                   copy_to_user(p, &gsf, size0))
+               len = GROUP_FILTER_SIZE(num);
+               if (copy_to_sockptr(optlen, &len, sizeof(int)) ||
+                   copy_to_sockptr(optval, &gsf, size0))
                        err = -EFAULT;
        }
-       release_sock(sk);
+       sockopt_release_sock(sk);
        return err;
  }
  
- static int compat_ipv6_get_msfilter(struct sock *sk, void __user *optval,
-               int __user *optlen)
+ static int compat_ipv6_get_msfilter(struct sock *sk, sockptr_t optval,
+                                   sockptr_t optlen, int len)
  {
        const int size0 = offsetof(struct compat_group_filter, gf_slist_flex);
-       struct compat_group_filter __user *p = optval;
        struct compat_group_filter gf32;
        struct group_filter gf;
-       int len, err;
+       int err;
        int num;
  
-       if (get_user(len, optlen))
-               return -EFAULT;
        if (len < size0)
                return -EINVAL;
  
-       if (copy_from_user(&gf32, p, size0))
+       if (copy_from_sockptr(&gf32, optval, size0))
                return -EFAULT;
        gf.gf_interface = gf32.gf_interface;
        gf.gf_fmode = gf32.gf_fmode;
        if (gf.gf_group.ss_family != AF_INET6)
                return -EADDRNOTAVAIL;
  
-       lock_sock(sk);
-       err = ip6_mc_msfget(sk, &gf, p->gf_slist_flex);
-       release_sock(sk);
+       sockopt_lock_sock(sk);
+       err = ip6_mc_msfget(sk, &gf, optval, size0);
+       sockopt_release_sock(sk);
        if (err)
                return err;
        if (num > gf.gf_numsrc)
                num = gf.gf_numsrc;
        len = GROUP_FILTER_SIZE(num) - (sizeof(gf)-sizeof(gf32));
-       if (put_user(len, optlen) ||
-           put_user(gf.gf_fmode, &p->gf_fmode) ||
-           put_user(gf.gf_numsrc, &p->gf_numsrc))
+       if (copy_to_sockptr(optlen, &len, sizeof(int)) ||
+           copy_to_sockptr_offset(optval, offsetof(struct compat_group_filter, gf_fmode),
+                                  &gf.gf_fmode, sizeof(gf32.gf_fmode)) ||
+           copy_to_sockptr_offset(optval, offsetof(struct compat_group_filter, gf_numsrc),
+                                  &gf.gf_numsrc, sizeof(gf32.gf_numsrc)))
                return -EFAULT;
        return 0;
  }
  
static int do_ipv6_getsockopt(struct sock *sk, int level, int optname,
-                   char __user *optval, int __user *optlen, unsigned int flags)
+ int do_ipv6_getsockopt(struct sock *sk, int level, int optname,
+                      sockptr_t optval, sockptr_t optlen)
  {
        struct ipv6_pinfo *np = inet6_sk(sk);
        int len;
        if (ip6_mroute_opt(optname))
                return ip6_mroute_getsockopt(sk, optname, optval, optlen);
  
-       if (get_user(len, optlen))
+       if (copy_from_sockptr(&len, optlen, sizeof(int)))
                return -EFAULT;
        switch (optname) {
        case IPV6_ADDRFORM:
                break;
        case MCAST_MSFILTER:
                if (in_compat_syscall())
-                       return compat_ipv6_get_msfilter(sk, optval, optlen);
+                       return compat_ipv6_get_msfilter(sk, optval, optlen, len);
                return ipv6_get_msfilter(sk, optval, optlen, len);
        case IPV6_2292PKTOPTIONS:
        {
                if (sk->sk_type != SOCK_STREAM)
                        return -ENOPROTOOPT;
  
-               msg.msg_control_user = optval;
+               if (optval.is_kernel) {
+                       msg.msg_control_is_user = false;
+                       msg.msg_control = optval.kernel;
+               } else {
+                       msg.msg_control_is_user = true;
+                       msg.msg_control_user = optval.user;
+               }
                msg.msg_controllen = len;
-               msg.msg_flags = flags;
-               msg.msg_control_is_user = true;
+               msg.msg_flags = 0;
  
-               lock_sock(sk);
+               sockopt_lock_sock(sk);
                skb = np->pktoptions;
                if (skb)
                        ip6_datagram_recv_ctl(sk, &msg, skb);
-               release_sock(sk);
+               sockopt_release_sock(sk);
                if (!skb) {
                        if (np->rxopt.bits.rxinfo) {
                                struct in6_pktinfo src_info;
                        }
                }
                len -= msg.msg_controllen;
-               return put_user(len, optlen);
+               return copy_to_sockptr(optlen, &len, sizeof(int));
        }
        case IPV6_MTU:
        {
        {
                struct ipv6_txoptions *opt;
  
-               lock_sock(sk);
+               sockopt_lock_sock(sk);
                opt = rcu_dereference_protected(np->opt,
                                                lockdep_sock_is_held(sk));
                len = ipv6_getsockopt_sticky(sk, opt, optname, optval, len);
-               release_sock(sk);
+               sockopt_release_sock(sk);
                /* check if ipv6_getsockopt_sticky() returns err code */
                if (len < 0)
                        return len;
-               return put_user(len, optlen);
+               return copy_to_sockptr(optlen, &len, sizeof(int));
        }
  
        case IPV6_RECVHOPOPTS:
                if (!mtuinfo.ip6m_mtu)
                        return -ENOTCONN;
  
-               if (put_user(len, optlen))
+               if (copy_to_sockptr(optlen, &len, sizeof(int)))
                        return -EFAULT;
-               if (copy_to_user(optval, &mtuinfo, len))
+               if (copy_to_sockptr(optval, &mtuinfo, len))
                        return -EFAULT;
  
                return 0;
                if (len < sizeof(freq))
                        return -EINVAL;
  
-               if (copy_from_user(&freq, optval, sizeof(freq)))
+               if (copy_from_sockptr(&freq, optval, sizeof(freq)))
                        return -EFAULT;
  
                if (freq.flr_action != IPV6_FL_A_GET)
                if (val < 0)
                        return val;
  
-               if (put_user(len, optlen))
+               if (copy_to_sockptr(optlen, &len, sizeof(int)))
                        return -EFAULT;
-               if (copy_to_user(optval, &freq, len))
+               if (copy_to_sockptr(optval, &freq, len))
                        return -EFAULT;
  
                return 0;
                return -ENOPROTOOPT;
        }
        len = min_t(unsigned int, sizeof(int), len);
-       if (put_user(len, optlen))
+       if (copy_to_sockptr(optlen, &len, sizeof(int)))
                return -EFAULT;
-       if (copy_to_user(optval, &val, len))
+       if (copy_to_sockptr(optval, &val, len))
                return -EFAULT;
        return 0;
  }
@@@ -1492,7 -1496,8 +1496,8 @@@ int ipv6_getsockopt(struct sock *sk, in
        if (level != SOL_IPV6)
                return -ENOPROTOOPT;
  
-       err = do_ipv6_getsockopt(sk, level, optname, optval, optlen, 0);
+       err = do_ipv6_getsockopt(sk, level, optname,
+                                USER_SOCKPTR(optval), USER_SOCKPTR(optlen));
  #ifdef CONFIG_NETFILTER
        /* we need to exclude all possible ENOPROTOOPTs except default case */
        if (err == -ENOPROTOOPT && optname != IPV6_2292PKTOPTIONS) {
index 5cadfbdadf360977b93215aad2c104e345e71024,ba02b559ca681c38dde0ac75d5da3b044c482856..18fbb6eab1e2b24294277558e621876b53b7e8c2
@@@ -65,4 -65,7 +65,8 @@@ send_signa
  select_reuseport                         # intermittently fails on new s390x setup
  xdp_synproxy                             # JIT does not support calling kernel function                                (kfunc)
  unpriv_bpf_disabled                      # fentry
 +lru_bug                                  # prog 'printk': failed to auto-attach: -524
+ setget_sockopt                           # attach unexpected error: -524                                               (trampoline)
+ cb_refs                                  # expected error message unexpected error: -524                               (trampoline)
+ cgroup_hierarchical_stats                # JIT does not support calling kernel function                                (kfunc)
+ htab_update                              # failed to attach: ERROR: strerror_r(-524)=22                                (trampoline)