From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 9 Jul 2019 19:34:26 +0000 (-0700)
Subject: Merge tag 'docs-5.3' of git://git.lwn.net/linux
X-Git-Tag: v5.3-rc1~156
X-Git-Url: https://git.kernel.dk/?a=commitdiff_plain;h=e9a83bd2322035ed9d7dcf35753d3f984d76c6a5;p=linux-2.6-block.git

Merge tag 'docs-5.3' of git://git.lwn.net/linux

Pull Documentation updates from Jonathan Corbet:
 "It's been a relatively busy cycle for docs:

   - A fair pile of RST conversions, many from Mauro. These create more
     than the usual number of simple but annoying merge conflicts with
     other trees, unfortunately. He has a lot more of these waiting on
     the wings that, I think, will go to you directly later on.

   - A new document on how to use merges and rebases in kernel repos,
     and one on Spectre vulnerabilities.

   - Various improvements to the build system, including automatic
     markup of function() references because some people, for reasons I
     will never understand, were of the opinion that
     :c:func:``function()`` is unattractive and not fun to type.

   - We now recommend using sphinx 1.7, but still support back to 1.4.

   - Lots of smaller improvements, warning fixes, typo fixes, etc"

* tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits)
  docs: automarkup.py: ignore exceptions when seeking for xrefs
  docs: Move binderfs to admin-guide
  Disable Sphinx SmartyPants in HTML output
  doc: RCU callback locks need only _bh, not necessarily _irq
  docs: format kernel-parameters -- as code
  Doc : doc-guide : Fix a typo
  platform: x86: get rid of a non-existent document
  Add the RCU docs to the core-api manual
  Documentation: RCU: Add TOC tree hooks
  Documentation: RCU: Rename txt files to rst
  Documentation: RCU: Convert RCU UP systems to reST
  Documentation: RCU: Convert RCU linked list to reST
  Documentation: RCU: Convert RCU basic concepts to reST
  docs: filesystems: Remove uneeded .rst extension on toctables
  scripts/sphinx-pre-install: fix out-of-tree build
  docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/
  Documentation: PGP: update for newer HW devices
  Documentation: Add section about CPU vulnerabilities for Spectre
  Documentation: platform: Delete x86-laptop-drivers.txt
  docs: Note that :c:func: should no longer be used
  ...
---

e9a83bd2322035ed9d7dcf35753d3f984d76c6a5
diff --cc Documentation/arm64/elf_hwcaps.rst
index 000000000000,c7cbf4b571c0..91f79529c58c
mode 000000,100644..100644
--- a/Documentation/arm64/elf_hwcaps.rst
+++ b/Documentation/arm64/elf_hwcaps.rst
@@@ -1,0 -1,201 +1,209 @@@
+ ================
+ ARM64 ELF hwcaps
+ ================
+ 
+ This document describes the usage and semantics of the arm64 ELF hwcaps.
+ 
+ 
+ 1. Introduction
+ ---------------
+ 
+ Some hardware or software features are only available on some CPU
+ implementations, and/or with certain kernel configurations, but have no
+ architected discovery mechanism available to userspace code at EL0. The
+ kernel exposes the presence of these features to userspace through a set
+ of flags called hwcaps, exposed in the auxilliary vector.
+ 
+ Userspace software can test for features by acquiring the AT_HWCAP or
+ AT_HWCAP2 entry of the auxiliary vector, and testing whether the relevant
+ flags are set, e.g.::
+ 
+ 	bool floating_point_is_present(void)
+ 	{
+ 		unsigned long hwcaps = getauxval(AT_HWCAP);
+ 		if (hwcaps & HWCAP_FP)
+ 			return true;
+ 
+ 		return false;
+ 	}
+ 
+ Where software relies on a feature described by a hwcap, it should check
+ the relevant hwcap flag to verify that the feature is present before
+ attempting to make use of the feature.
+ 
+ Features cannot be probed reliably through other means. When a feature
+ is not available, attempting to use it may result in unpredictable
+ behaviour, and is not guaranteed to result in any reliable indication
+ that the feature is unavailable, such as a SIGILL.
+ 
+ 
+ 2. Interpretation of hwcaps
+ ---------------------------
+ 
+ The majority of hwcaps are intended to indicate the presence of features
+ which are described by architected ID registers inaccessible to
+ userspace code at EL0. These hwcaps are defined in terms of ID register
+ fields, and should be interpreted with reference to the definition of
+ these fields in the ARM Architecture Reference Manual (ARM ARM).
+ 
+ Such hwcaps are described below in the form::
+ 
+     Functionality implied by idreg.field == val.
+ 
+ Such hwcaps indicate the availability of functionality that the ARM ARM
+ defines as being present when idreg.field has value val, but do not
+ indicate that idreg.field is precisely equal to val, nor do they
+ indicate the absence of functionality implied by other values of
+ idreg.field.
+ 
+ Other hwcaps may indicate the presence of features which cannot be
+ described by ID registers alone. These may be described without
+ reference to ID registers, and may refer to other documentation.
+ 
+ 
+ 3. The hwcaps exposed in AT_HWCAP
+ ---------------------------------
+ 
+ HWCAP_FP
+     Functionality implied by ID_AA64PFR0_EL1.FP == 0b0000.
+ 
+ HWCAP_ASIMD
+     Functionality implied by ID_AA64PFR0_EL1.AdvSIMD == 0b0000.
+ 
+ HWCAP_EVTSTRM
+     The generic timer is configured to generate events at a frequency of
+     approximately 100KHz.
+ 
+ HWCAP_AES
+     Functionality implied by ID_AA64ISAR0_EL1.AES == 0b0001.
+ 
+ HWCAP_PMULL
+     Functionality implied by ID_AA64ISAR0_EL1.AES == 0b0010.
+ 
+ HWCAP_SHA1
+     Functionality implied by ID_AA64ISAR0_EL1.SHA1 == 0b0001.
+ 
+ HWCAP_SHA2
+     Functionality implied by ID_AA64ISAR0_EL1.SHA2 == 0b0001.
+ 
+ HWCAP_CRC32
+     Functionality implied by ID_AA64ISAR0_EL1.CRC32 == 0b0001.
+ 
+ HWCAP_ATOMICS
+     Functionality implied by ID_AA64ISAR0_EL1.Atomic == 0b0010.
+ 
+ HWCAP_FPHP
+     Functionality implied by ID_AA64PFR0_EL1.FP == 0b0001.
+ 
+ HWCAP_ASIMDHP
+     Functionality implied by ID_AA64PFR0_EL1.AdvSIMD == 0b0001.
+ 
+ HWCAP_CPUID
+     EL0 access to certain ID registers is available, to the extent
+     described by Documentation/arm64/cpu-feature-registers.rst.
+ 
+     These ID registers may imply the availability of features.
+ 
+ HWCAP_ASIMDRDM
+     Functionality implied by ID_AA64ISAR0_EL1.RDM == 0b0001.
+ 
+ HWCAP_JSCVT
+     Functionality implied by ID_AA64ISAR1_EL1.JSCVT == 0b0001.
+ 
+ HWCAP_FCMA
+     Functionality implied by ID_AA64ISAR1_EL1.FCMA == 0b0001.
+ 
+ HWCAP_LRCPC
+     Functionality implied by ID_AA64ISAR1_EL1.LRCPC == 0b0001.
+ 
+ HWCAP_DCPOP
+     Functionality implied by ID_AA64ISAR1_EL1.DPB == 0b0001.
+ 
+ HWCAP2_DCPODP
+ 
+     Functionality implied by ID_AA64ISAR1_EL1.DPB == 0b0010.
+ 
+ HWCAP_SHA3
+     Functionality implied by ID_AA64ISAR0_EL1.SHA3 == 0b0001.
+ 
+ HWCAP_SM3
+     Functionality implied by ID_AA64ISAR0_EL1.SM3 == 0b0001.
+ 
+ HWCAP_SM4
+     Functionality implied by ID_AA64ISAR0_EL1.SM4 == 0b0001.
+ 
+ HWCAP_ASIMDDP
+     Functionality implied by ID_AA64ISAR0_EL1.DP == 0b0001.
+ 
+ HWCAP_SHA512
+     Functionality implied by ID_AA64ISAR0_EL1.SHA2 == 0b0010.
+ 
+ HWCAP_SVE
+     Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001.
+ 
+ HWCAP2_SVE2
+ 
+     Functionality implied by ID_AA64ZFR0_EL1.SVEVer == 0b0001.
+ 
+ HWCAP2_SVEAES
+ 
+     Functionality implied by ID_AA64ZFR0_EL1.AES == 0b0001.
+ 
+ HWCAP2_SVEPMULL
+ 
+     Functionality implied by ID_AA64ZFR0_EL1.AES == 0b0010.
+ 
+ HWCAP2_SVEBITPERM
+ 
+     Functionality implied by ID_AA64ZFR0_EL1.BitPerm == 0b0001.
+ 
+ HWCAP2_SVESHA3
+ 
+     Functionality implied by ID_AA64ZFR0_EL1.SHA3 == 0b0001.
+ 
+ HWCAP2_SVESM4
+ 
+     Functionality implied by ID_AA64ZFR0_EL1.SM4 == 0b0001.
+ 
+ HWCAP_ASIMDFHM
+    Functionality implied by ID_AA64ISAR0_EL1.FHM == 0b0001.
+ 
+ HWCAP_DIT
+     Functionality implied by ID_AA64PFR0_EL1.DIT == 0b0001.
+ 
+ HWCAP_USCAT
+     Functionality implied by ID_AA64MMFR2_EL1.AT == 0b0001.
+ 
+ HWCAP_ILRCPC
+     Functionality implied by ID_AA64ISAR1_EL1.LRCPC == 0b0010.
+ 
+ HWCAP_FLAGM
+     Functionality implied by ID_AA64ISAR0_EL1.TS == 0b0001.
+ 
++HWCAP2_FLAGM2
++
++    Functionality implied by ID_AA64ISAR0_EL1.TS == 0b0010.
++
+ HWCAP_SSBS
+     Functionality implied by ID_AA64PFR1_EL1.SSBS == 0b0010.
+ 
+ HWCAP_PACA
+     Functionality implied by ID_AA64ISAR1_EL1.APA == 0b0001 or
+     ID_AA64ISAR1_EL1.API == 0b0001, as described by
+     Documentation/arm64/pointer-authentication.rst.
+ 
+ HWCAP_PACG
+     Functionality implied by ID_AA64ISAR1_EL1.GPA == 0b0001 or
+     ID_AA64ISAR1_EL1.GPI == 0b0001, as described by
+     Documentation/arm64/pointer-authentication.rst.
+ 
++HWCAP2_FRINT
++
++    Functionality implied by ID_AA64ISAR1_EL1.FRINTTS == 0b0001.
++
+ 
+ 4. Unused AT_HWCAP bits
+ -----------------------
+ 
+ For interoperation with userspace, the kernel guarantees that bits 62
+ and 63 of AT_HWCAP will always be returned as 0.
diff --cc Documentation/arm64/sve.rst
index 000000000000,38422ab249dd..5689c74c8082
mode 000000,100644..100644
--- a/Documentation/arm64/sve.rst
+++ b/Documentation/arm64/sve.rst
@@@ -1,0 -1,529 +1,545 @@@
+ ===================================================
+ Scalable Vector Extension support for AArch64 Linux
+ ===================================================
+ 
+ Author: Dave Martin <Dave.Martin@arm.com>
+ 
+ Date:   4 August 2017
+ 
+ This document outlines briefly the interface provided to userspace by Linux in
+ order to support use of the ARM Scalable Vector Extension (SVE).
+ 
+ This is an outline of the most important features and issues only and not
+ intended to be exhaustive.
+ 
+ This document does not aim to describe the SVE architecture or programmer's
+ model.  To aid understanding, a minimal description of relevant programmer's
+ model features for SVE is included in Appendix A.
+ 
+ 
+ 1.  General
+ -----------
+ 
+ * SVE registers Z0..Z31, P0..P15 and FFR and the current vector length VL, are
+   tracked per-thread.
+ 
+ * The presence of SVE is reported to userspace via HWCAP_SVE in the aux vector
+   AT_HWCAP entry.  Presence of this flag implies the presence of the SVE
+   instructions and registers, and the Linux-specific system interfaces
+   described in this document.  SVE is reported in /proc/cpuinfo as "sve".
+ 
+ * Support for the execution of SVE instructions in userspace can also be
+   detected by reading the CPU ID register ID_AA64PFR0_EL1 using an MRS
+   instruction, and checking that the value of the SVE field is nonzero. [3]
+ 
+   It does not guarantee the presence of the system interfaces described in the
+   following sections: software that needs to verify that those interfaces are
+   present must check for HWCAP_SVE instead.
+ 
+ * On hardware that supports the SVE2 extensions, HWCAP2_SVE2 will also
+   be reported in the AT_HWCAP2 aux vector entry.  In addition to this,
+   optional extensions to SVE2 may be reported by the presence of:
+ 
+ 	HWCAP2_SVE2
+ 	HWCAP2_SVEAES
+ 	HWCAP2_SVEPMULL
+ 	HWCAP2_SVEBITPERM
+ 	HWCAP2_SVESHA3
+ 	HWCAP2_SVESM4
+ 
+   This list may be extended over time as the SVE architecture evolves.
+ 
+   These extensions are also reported via the CPU ID register ID_AA64ZFR0_EL1,
+   which userspace can read using an MRS instruction.  See elf_hwcaps.txt and
+   cpu-feature-registers.txt for details.
+ 
+ * Debuggers should restrict themselves to interacting with the target via the
+   NT_ARM_SVE regset.  The recommended way of detecting support for this regset
+   is to connect to a target process first and then attempt a
+   ptrace(PTRACE_GETREGSET, pid, NT_ARM_SVE, &iov).
+ 
++* Whenever SVE scalable register values (Zn, Pn, FFR) are exchanged in memory
++  between userspace and the kernel, the register value is encoded in memory in
++  an endianness-invariant layout, with bits [(8 * i + 7) : (8 * i)] encoded at
++  byte offset i from the start of the memory representation.  This affects for
++  example the signal frame (struct sve_context) and ptrace interface
++  (struct user_sve_header) and associated data.
++
++  Beware that on big-endian systems this results in a different byte order than
++  for the FPSIMD V-registers, which are stored as single host-endian 128-bit
++  values, with bits [(127 - 8 * i) : (120 - 8 * i)] of the register encoded at
++  byte offset i.  (struct fpsimd_context, struct user_fpsimd_state).
++
+ 
+ 2.  Vector length terminology
+ -----------------------------
+ 
+ The size of an SVE vector (Z) register is referred to as the "vector length".
+ 
+ To avoid confusion about the units used to express vector length, the kernel
+ adopts the following conventions:
+ 
+ * Vector length (VL) = size of a Z-register in bytes
+ 
+ * Vector quadwords (VQ) = size of a Z-register in units of 128 bits
+ 
+ (So, VL = 16 * VQ.)
+ 
+ The VQ convention is used where the underlying granularity is important, such
+ as in data structure definitions.  In most other situations, the VL convention
+ is used.  This is consistent with the meaning of the "VL" pseudo-register in
+ the SVE instruction set architecture.
+ 
+ 
+ 3.  System call behaviour
+ -------------------------
+ 
+ * On syscall, V0..V31 are preserved (as without SVE).  Thus, bits [127:0] of
+   Z0..Z31 are preserved.  All other bits of Z0..Z31, and all of P0..P15 and FFR
+   become unspecified on return from a syscall.
+ 
+ * The SVE registers are not used to pass arguments to or receive results from
+   any syscall.
+ 
+ * In practice the affected registers/bits will be preserved or will be replaced
+   with zeros on return from a syscall, but userspace should not make
+   assumptions about this.  The kernel behaviour may vary on a case-by-case
+   basis.
+ 
+ * All other SVE state of a thread, including the currently configured vector
+   length, the state of the PR_SVE_VL_INHERIT flag, and the deferred vector
+   length (if any), is preserved across all syscalls, subject to the specific
+   exceptions for execve() described in section 6.
+ 
+   In particular, on return from a fork() or clone(), the parent and new child
+   process or thread share identical SVE configuration, matching that of the
+   parent before the call.
+ 
+ 
+ 4.  Signal handling
+ -------------------
+ 
+ * A new signal frame record sve_context encodes the SVE registers on signal
+   delivery. [1]
+ 
+ * This record is supplementary to fpsimd_context.  The FPSR and FPCR registers
+   are only present in fpsimd_context.  For convenience, the content of V0..V31
+   is duplicated between sve_context and fpsimd_context.
+ 
+ * The signal frame record for SVE always contains basic metadata, in particular
+   the thread's vector length (in sve_context.vl).
+ 
+ * The SVE registers may or may not be included in the record, depending on
+   whether the registers are live for the thread.  The registers are present if
+   and only if:
+   sve_context.head.size >= SVE_SIG_CONTEXT_SIZE(sve_vq_from_vl(sve_context.vl)).
+ 
+ * If the registers are present, the remainder of the record has a vl-dependent
+   size and layout.  Macros SVE_SIG_* are defined [1] to facilitate access to
+   the members.
+ 
++* Each scalable register (Zn, Pn, FFR) is stored in an endianness-invariant
++  layout, with bits [(8 * i + 7) : (8 * i)] stored at byte offset i from the
++  start of the register's representation in memory.
++
+ * If the SVE context is too big to fit in sigcontext.__reserved[], then extra
+   space is allocated on the stack, an extra_context record is written in
+   __reserved[] referencing this space.  sve_context is then written in the
+   extra space.  Refer to [1] for further details about this mechanism.
+ 
+ 
+ 5.  Signal return
+ -----------------
+ 
+ When returning from a signal handler:
+ 
+ * If there is no sve_context record in the signal frame, or if the record is
+   present but contains no register data as desribed in the previous section,
+   then the SVE registers/bits become non-live and take unspecified values.
+ 
+ * If sve_context is present in the signal frame and contains full register
+   data, the SVE registers become live and are populated with the specified
+   data.  However, for backward compatibility reasons, bits [127:0] of Z0..Z31
+   are always restored from the corresponding members of fpsimd_context.vregs[]
+   and not from sve_context.  The remaining bits are restored from sve_context.
+ 
+ * Inclusion of fpsimd_context in the signal frame remains mandatory,
+   irrespective of whether sve_context is present or not.
+ 
+ * The vector length cannot be changed via signal return.  If sve_context.vl in
+   the signal frame does not match the current vector length, the signal return
+   attempt is treated as illegal, resulting in a forced SIGSEGV.
+ 
+ 
+ 6.  prctl extensions
+ --------------------
+ 
+ Some new prctl() calls are added to allow programs to manage the SVE vector
+ length:
+ 
+ prctl(PR_SVE_SET_VL, unsigned long arg)
+ 
+     Sets the vector length of the calling thread and related flags, where
+     arg == vl | flags.  Other threads of the calling process are unaffected.
+ 
+     vl is the desired vector length, where sve_vl_valid(vl) must be true.
+ 
+     flags:
+ 
+ 	PR_SVE_SET_VL_INHERIT
+ 
+ 	    Inherit the current vector length across execve().  Otherwise, the
+ 	    vector length is reset to the system default at execve().  (See
+ 	    Section 9.)
+ 
+ 	PR_SVE_SET_VL_ONEXEC
+ 
+ 	    Defer the requested vector length change until the next execve()
+ 	    performed by this thread.
+ 
+ 	    The effect is equivalent to implicit exceution of the following
+ 	    call immediately after the next execve() (if any) by the thread:
+ 
+ 		prctl(PR_SVE_SET_VL, arg & ~PR_SVE_SET_VL_ONEXEC)
+ 
+ 	    This allows launching of a new program with a different vector
+ 	    length, while avoiding runtime side effects in the caller.
+ 
+ 
+ 	    Without PR_SVE_SET_VL_ONEXEC, the requested change takes effect
+ 	    immediately.
+ 
+ 
+     Return value: a nonnegative on success, or a negative value on error:
+ 	EINVAL: SVE not supported, invalid vector length requested, or
+ 	    invalid flags.
+ 
+ 
+     On success:
+ 
+     * Either the calling thread's vector length or the deferred vector length
+       to be applied at the next execve() by the thread (dependent on whether
+       PR_SVE_SET_VL_ONEXEC is present in arg), is set to the largest value
+       supported by the system that is less than or equal to vl.  If vl ==
+       SVE_VL_MAX, the value set will be the largest value supported by the
+       system.
+ 
+     * Any previously outstanding deferred vector length change in the calling
+       thread is cancelled.
+ 
+     * The returned value describes the resulting configuration, encoded as for
+       PR_SVE_GET_VL.  The vector length reported in this value is the new
+       current vector length for this thread if PR_SVE_SET_VL_ONEXEC was not
+       present in arg; otherwise, the reported vector length is the deferred
+       vector length that will be applied at the next execve() by the calling
+       thread.
+ 
+     * Changing the vector length causes all of P0..P15, FFR and all bits of
+       Z0..Z31 except for Z0 bits [127:0] .. Z31 bits [127:0] to become
+       unspecified.  Calling PR_SVE_SET_VL with vl equal to the thread's current
+       vector length, or calling PR_SVE_SET_VL with the PR_SVE_SET_VL_ONEXEC
+       flag, does not constitute a change to the vector length for this purpose.
+ 
+ 
+ prctl(PR_SVE_GET_VL)
+ 
+     Gets the vector length of the calling thread.
+ 
+     The following flag may be OR-ed into the result:
+ 
+ 	PR_SVE_SET_VL_INHERIT
+ 
+ 	    Vector length will be inherited across execve().
+ 
+     There is no way to determine whether there is an outstanding deferred
+     vector length change (which would only normally be the case between a
+     fork() or vfork() and the corresponding execve() in typical use).
+ 
+     To extract the vector length from the result, and it with
+     PR_SVE_VL_LEN_MASK.
+ 
+     Return value: a nonnegative value on success, or a negative value on error:
+ 	EINVAL: SVE not supported.
+ 
+ 
+ 7.  ptrace extensions
+ ---------------------
+ 
+ * A new regset NT_ARM_SVE is defined for use with PTRACE_GETREGSET and
+   PTRACE_SETREGSET.
+ 
+   Refer to [2] for definitions.
+ 
+ The regset data starts with struct user_sve_header, containing:
+ 
+     size
+ 
+ 	Size of the complete regset, in bytes.
+ 	This depends on vl and possibly on other things in the future.
+ 
+ 	If a call to PTRACE_GETREGSET requests less data than the value of
+ 	size, the caller can allocate a larger buffer and retry in order to
+ 	read the complete regset.
+ 
+     max_size
+ 
+ 	Maximum size in bytes that the regset can grow to for the target
+ 	thread.  The regset won't grow bigger than this even if the target
+ 	thread changes its vector length etc.
+ 
+     vl
+ 
+ 	Target thread's current vector length, in bytes.
+ 
+     max_vl
+ 
+ 	Maximum possible vector length for the target thread.
+ 
+     flags
+ 
+ 	either
+ 
+ 	    SVE_PT_REGS_FPSIMD
+ 
+ 		SVE registers are not live (GETREGSET) or are to be made
+ 		non-live (SETREGSET).
+ 
+ 		The payload is of type struct user_fpsimd_state, with the same
+ 		meaning as for NT_PRFPREG, starting at offset
+ 		SVE_PT_FPSIMD_OFFSET from the start of user_sve_header.
+ 
+ 		Extra data might be appended in the future: the size of the
+ 		payload should be obtained using SVE_PT_FPSIMD_SIZE(vq, flags).
+ 
+ 		vq should be obtained using sve_vq_from_vl(vl).
+ 
+ 		or
+ 
+ 	    SVE_PT_REGS_SVE
+ 
+ 		SVE registers are live (GETREGSET) or are to be made live
+ 		(SETREGSET).
+ 
+ 		The payload contains the SVE register data, starting at offset
+ 		SVE_PT_SVE_OFFSET from the start of user_sve_header, and with
+ 		size SVE_PT_SVE_SIZE(vq, flags);
+ 
+ 	... OR-ed with zero or more of the following flags, which have the same
+ 	meaning and behaviour as the corresponding PR_SET_VL_* flags:
+ 
+ 	    SVE_PT_VL_INHERIT
+ 
+ 	    SVE_PT_VL_ONEXEC (SETREGSET only).
+ 
+ * The effects of changing the vector length and/or flags are equivalent to
+   those documented for PR_SVE_SET_VL.
+ 
+   The caller must make a further GETREGSET call if it needs to know what VL is
+   actually set by SETREGSET, unless is it known in advance that the requested
+   VL is supported.
+ 
+ * In the SVE_PT_REGS_SVE case, the size and layout of the payload depends on
+   the header fields.  The SVE_PT_SVE_*() macros are provided to facilitate
+   access to the members.
+ 
+ * In either case, for SETREGSET it is permissible to omit the payload, in which
+   case only the vector length and flags are changed (along with any
+   consequences of those changes).
+ 
+ * For SETREGSET, if an SVE_PT_REGS_SVE payload is present and the
+   requested VL is not supported, the effect will be the same as if the
+   payload were omitted, except that an EIO error is reported.  No
+   attempt is made to translate the payload data to the correct layout
+   for the vector length actually set.  The thread's FPSIMD state is
+   preserved, but the remaining bits of the SVE registers become
+   unspecified.  It is up to the caller to translate the payload layout
+   for the actual VL and retry.
+ 
+ * The effect of writing a partial, incomplete payload is unspecified.
+ 
+ 
+ 8.  ELF coredump extensions
+ ---------------------------
+ 
+ * A NT_ARM_SVE note will be added to each coredump for each thread of the
+   dumped process.  The contents will be equivalent to the data that would have
+   been read if a PTRACE_GETREGSET of NT_ARM_SVE were executed for each thread
+   when the coredump was generated.
+ 
+ 
+ 9.  System runtime configuration
+ --------------------------------
+ 
+ * To mitigate the ABI impact of expansion of the signal frame, a policy
+   mechanism is provided for administrators, distro maintainers and developers
+   to set the default vector length for userspace processes:
+ 
+ /proc/sys/abi/sve_default_vector_length
+ 
+     Writing the text representation of an integer to this file sets the system
+     default vector length to the specified value, unless the value is greater
+     than the maximum vector length supported by the system in which case the
+     default vector length is set to that maximum.
+ 
+     The result can be determined by reopening the file and reading its
+     contents.
+ 
+     At boot, the default vector length is initially set to 64 or the maximum
+     supported vector length, whichever is smaller.  This determines the initial
+     vector length of the init process (PID 1).
+ 
+     Reading this file returns the current system default vector length.
+ 
+ * At every execve() call, the new vector length of the new process is set to
+   the system default vector length, unless
+ 
+     * PR_SVE_SET_VL_INHERIT (or equivalently SVE_PT_VL_INHERIT) is set for the
+       calling thread, or
+ 
+     * a deferred vector length change is pending, established via the
+       PR_SVE_SET_VL_ONEXEC flag (or SVE_PT_VL_ONEXEC).
+ 
+ * Modifying the system default vector length does not affect the vector length
+   of any existing process or thread that does not make an execve() call.
+ 
+ 
+ Appendix A.  SVE programmer's model (informative)
+ =================================================
+ 
+ This section provides a minimal description of the additions made by SVE to the
+ ARMv8-A programmer's model that are relevant to this document.
+ 
+ Note: This section is for information only and not intended to be complete or
+ to replace any architectural specification.
+ 
+ A.1.  Registers
+ ---------------
+ 
+ In A64 state, SVE adds the following:
+ 
+ * 32 8VL-bit vector registers Z0..Z31
+   For each Zn, Zn bits [127:0] alias the ARMv8-A vector register Vn.
+ 
+   A register write using a Vn register name zeros all bits of the corresponding
+   Zn except for bits [127:0].
+ 
+ * 16 VL-bit predicate registers P0..P15
+ 
+ * 1 VL-bit special-purpose predicate register FFR (the "first-fault register")
+ 
+ * a VL "pseudo-register" that determines the size of each vector register
+ 
+   The SVE instruction set architecture provides no way to write VL directly.
+   Instead, it can be modified only by EL1 and above, by writing appropriate
+   system registers.
+ 
+ * The value of VL can be configured at runtime by EL1 and above:
+   16 <= VL <= VLmax, where VL must be a multiple of 16.
+ 
+ * The maximum vector length is determined by the hardware:
+   16 <= VLmax <= 256.
+ 
+   (The SVE architecture specifies 256, but permits future architecture
+   revisions to raise this limit.)
+ 
+ * FPSR and FPCR are retained from ARMv8-A, and interact with SVE floating-point
+   operations in a similar way to the way in which they interact with ARMv8
+   floating-point operations::
+ 
+          8VL-1                       128               0  bit index
+         +----          ////            -----------------+
+      Z0 |                               :       V0      |
+       :                                          :
+      Z7 |                               :       V7      |
+      Z8 |                               :     * V8      |
+       :                                       :  :
+     Z15 |                               :     *V15      |
+     Z16 |                               :      V16      |
+       :                                          :
+     Z31 |                               :      V31      |
+         +----          ////            -----------------+
+                                                  31    0
+          VL-1                  0                +-------+
+         +----       ////      --+          FPSR |       |
+      P0 |                       |               +-------+
+       : |                       |         *FPCR |       |
+     P15 |                       |               +-------+
+         +----       ////      --+
+     FFR |                       |               +-----+
+         +----       ////      --+            VL |     |
+                                                 +-----+
+ 
+ (*) callee-save:
+     This only applies to bits [63:0] of Z-/V-registers.
+     FPCR contains callee-save and caller-save bits.  See [4] for details.
+ 
+ 
+ A.2.  Procedure call standard
+ -----------------------------
+ 
+ The ARMv8-A base procedure call standard is extended as follows with respect to
+ the additional SVE register state:
+ 
+ * All SVE register bits that are not shared with FP/SIMD are caller-save.
+ 
+ * Z8 bits [63:0] .. Z15 bits [63:0] are callee-save.
+ 
+   This follows from the way these bits are mapped to V8..V15, which are caller-
+   save in the base procedure call standard.
+ 
+ 
+ Appendix B.  ARMv8-A FP/SIMD programmer's model
+ ===============================================
+ 
+ Note: This section is for information only and not intended to be complete or
+ to replace any architectural specification.
+ 
+ Refer to [4] for for more information.
+ 
+ ARMv8-A defines the following floating-point / SIMD register state:
+ 
+ * 32 128-bit vector registers V0..V31
+ * 2 32-bit status/control registers FPSR, FPCR
+ 
+ ::
+ 
+          127           0  bit index
+         +---------------+
+      V0 |               |
+       : :               :
+      V7 |               |
+    * V8 |               |
+    :  : :               :
+    *V15 |               |
+     V16 |               |
+       : :               :
+     V31 |               |
+         +---------------+
+ 
+                  31    0
+                 +-------+
+            FPSR |       |
+                 +-------+
+           *FPCR |       |
+                 +-------+
+ 
+ (*) callee-save:
+     This only applies to bits [63:0] of V-registers.
+     FPCR contains a mixture of callee-save and caller-save bits.
+ 
+ 
+ References
+ ==========
+ 
+ [1] arch/arm64/include/uapi/asm/sigcontext.h
+     AArch64 Linux signal ABI definitions
+ 
+ [2] arch/arm64/include/uapi/asm/ptrace.h
+     AArch64 Linux ptrace ABI definitions
+ 
+ [3] Documentation/arm64/cpu-feature-registers.rst
+ 
+ [4] ARM IHI0055C
+     http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055c/IHI0055C_beta_aapcs64.pdf
+     http://infocenter.arm.com/help/topic/com.arm.doc.subset.swdev.abi/index.html
+     Procedure Call Standard for the ARM 64-bit Architecture (AArch64)
diff --cc Documentation/core-api/timekeeping.rst
index 20ee447a50f3,5f87d9c8b04d..c0ffa30c7c37
--- a/Documentation/core-api/timekeeping.rst
+++ b/Documentation/core-api/timekeeping.rst
@@@ -113,9 -108,10 +113,9 @@@ Some additional variants exist for mor
  		void ktime_get_coarse_boottime_ts64( struct timespec64 * )
  		void ktime_get_coarse_real_ts64( struct timespec64 * )
  		void ktime_get_coarse_clocktai_ts64( struct timespec64 * )
 -		void ktime_get_coarse_raw_ts64( struct timespec64 * )
  
  	These are quicker than the non-coarse versions, but less accurate,
- 	corresponding to CLOCK_MONONOTNIC_COARSE and CLOCK_REALTIME_COARSE
+ 	corresponding to CLOCK_MONOTONIC_COARSE and CLOCK_REALTIME_COARSE
  	in user space, along with the equivalent boottime/tai/raw
  	timebase not available in user space.
  
diff --cc Documentation/fault-injection/nvme-fault-injection.rst
index 000000000000,bbb1bf3e8650..cdb2e829228e
mode 000000,100644..100644
--- a/Documentation/fault-injection/nvme-fault-injection.rst
+++ b/Documentation/fault-injection/nvme-fault-injection.rst
@@@ -1,0 -1,120 +1,178 @@@
+ NVMe Fault Injection
+ ====================
+ Linux's fault injection framework provides a systematic way to support
+ error injection via debugfs in the /sys/kernel/debug directory. When
+ enabled, the default NVME_SC_INVALID_OPCODE with no retry will be
+ injected into the nvme_end_request. Users can change the default status
+ code and no retry flag via the debugfs. The list of Generic Command
+ Status can be found in include/linux/nvme.h
+ 
+ Following examples show how to inject an error into the nvme.
+ 
+ First, enable CONFIG_FAULT_INJECTION_DEBUG_FS kernel config,
+ recompile the kernel. After booting up the kernel, do the
+ following.
+ 
+ Example 1: Inject default status code with no retry
+ ---------------------------------------------------
+ 
+ ::
+ 
+   mount /dev/nvme0n1 /mnt
+   echo 1 > /sys/kernel/debug/nvme0n1/fault_inject/times
+   echo 100 > /sys/kernel/debug/nvme0n1/fault_inject/probability
+   cp a.file /mnt
+ 
+ Expected Result::
+ 
+   cp: cannot stat â/mnt/a.fileâ: Input/output error
+ 
+ Message from dmesg::
+ 
+   FAULT_INJECTION: forcing a failure.
+   name fault_inject, interval 1, probability 100, space 0, times 1
+   CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc8+ #2
+   Hardware name: innotek GmbH VirtualBox/VirtualBox,
+   BIOS VirtualBox 12/01/2006
+   Call Trace:
+     <IRQ>
+     dump_stack+0x5c/0x7d
+     should_fail+0x148/0x170
+     nvme_should_fail+0x2f/0x50 [nvme_core]
+     nvme_process_cq+0xe7/0x1d0 [nvme]
+     nvme_irq+0x1e/0x40 [nvme]
+     __handle_irq_event_percpu+0x3a/0x190
+     handle_irq_event_percpu+0x30/0x70
+     handle_irq_event+0x36/0x60
+     handle_fasteoi_irq+0x78/0x120
+     handle_irq+0xa7/0x130
+     ? tick_irq_enter+0xa8/0xc0
+     do_IRQ+0x43/0xc0
+     common_interrupt+0xa2/0xa2
+     </IRQ>
+   RIP: 0010:native_safe_halt+0x2/0x10
+   RSP: 0018:ffffffff82003e90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd
+   RAX: ffffffff817a10c0 RBX: ffffffff82012480 RCX: 0000000000000000
+   RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
+   RBP: 0000000000000000 R08: 000000008e38ce64 R09: 0000000000000000
+   R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82012480
+   R13: ffffffff82012480 R14: 0000000000000000 R15: 0000000000000000
+     ? __sched_text_end+0x4/0x4
+     default_idle+0x18/0xf0
+     do_idle+0x150/0x1d0
+     cpu_startup_entry+0x6f/0x80
+     start_kernel+0x4c4/0x4e4
+     ? set_init_arg+0x55/0x55
+     secondary_startup_64+0xa5/0xb0
+     print_req_error: I/O error, dev nvme0n1, sector 9240
+   EXT4-fs error (device nvme0n1): ext4_find_entry:1436:
+   inode #2: comm cp: reading directory lblock 0
+ 
+ Example 2: Inject default status code with retry
+ ------------------------------------------------
+ 
+ ::
+ 
+   mount /dev/nvme0n1 /mnt
+   echo 1 > /sys/kernel/debug/nvme0n1/fault_inject/times
+   echo 100 > /sys/kernel/debug/nvme0n1/fault_inject/probability
+   echo 1 > /sys/kernel/debug/nvme0n1/fault_inject/status
+   echo 0 > /sys/kernel/debug/nvme0n1/fault_inject/dont_retry
+ 
+   cp a.file /mnt
+ 
+ Expected Result::
+ 
+   command success without error
+ 
+ Message from dmesg::
+ 
+   FAULT_INJECTION: forcing a failure.
+   name fault_inject, interval 1, probability 100, space 0, times 1
+   CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc8+ #4
+   Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
+   Call Trace:
+     <IRQ>
+     dump_stack+0x5c/0x7d
+     should_fail+0x148/0x170
+     nvme_should_fail+0x30/0x60 [nvme_core]
+     nvme_loop_queue_response+0x84/0x110 [nvme_loop]
+     nvmet_req_complete+0x11/0x40 [nvmet]
+     nvmet_bio_done+0x28/0x40 [nvmet]
+     blk_update_request+0xb0/0x310
+     blk_mq_end_request+0x18/0x60
+     flush_smp_call_function_queue+0x3d/0xf0
+     smp_call_function_single_interrupt+0x2c/0xc0
+     call_function_single_interrupt+0xa2/0xb0
+     </IRQ>
+   RIP: 0010:native_safe_halt+0x2/0x10
+   RSP: 0018:ffffc9000068bec0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04
+   RAX: ffffffff817a10c0 RBX: ffff88011a3c9680 RCX: 0000000000000000
+   RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
+   RBP: 0000000000000001 R08: 000000008e38c131 R09: 0000000000000000
+   R10: 0000000000000000 R11: 0000000000000000 R12: ffff88011a3c9680
+   R13: ffff88011a3c9680 R14: 0000000000000000 R15: 0000000000000000
+     ? __sched_text_end+0x4/0x4
+     default_idle+0x18/0xf0
+     do_idle+0x150/0x1d0
+     cpu_startup_entry+0x6f/0x80
+     start_secondary+0x187/0x1e0
+     secondary_startup_64+0xa5/0xb0
++
++Example 3: Inject an error into the 10th admin command
++------------------------------------------------------
++
++::
++
++  echo 100 > /sys/kernel/debug/nvme0/fault_inject/probability
++  echo 10 > /sys/kernel/debug/nvme0/fault_inject/space
++  echo 1 > /sys/kernel/debug/nvme0/fault_inject/times
++  nvme reset /dev/nvme0
++
++Expected Result::
++
++  After NVMe controller reset, the reinitialization may or may not succeed.
++  It depends on which admin command is actually forced to fail.
++
++Message from dmesg::
++
++  nvme nvme0: resetting controller
++  FAULT_INJECTION: forcing a failure.
++  name fault_inject, interval 1, probability 100, space 1, times 1
++  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.2.0-rc2+ #2
++  Hardware name: MSI MS-7A45/B150M MORTAR ARCTIC (MS-7A45), BIOS 1.50 04/25/2017
++  Call Trace:
++   <IRQ>
++   dump_stack+0x63/0x85
++   should_fail+0x14a/0x170
++   nvme_should_fail+0x38/0x80 [nvme_core]
++   nvme_irq+0x129/0x280 [nvme]
++   ? blk_mq_end_request+0xb3/0x120
++   __handle_irq_event_percpu+0x84/0x1a0
++   handle_irq_event_percpu+0x32/0x80
++   handle_irq_event+0x3b/0x60
++   handle_edge_irq+0x7f/0x1a0
++   handle_irq+0x20/0x30
++   do_IRQ+0x4e/0xe0
++   common_interrupt+0xf/0xf
++   </IRQ>
++  RIP: 0010:cpuidle_enter_state+0xc5/0x460
++  Code: ff e8 8f 5f 86 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 69 03 00 00 31 ff e8 62 aa 8c ff fb 66 0f 1f 44 00 00 <45> 85 ed 0f 88 37 03 00 00 4c 8b 45 d0 4c 2b 45 b8 48 ba cf f7 53
++  RSP: 0018:ffffffff88c03dd0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdc
++  RAX: ffff9dac25a2ac80 RBX: ffffffff88d53760 RCX: 000000000000001f
++  RDX: 0000000000000000 RSI: 000000002d958403 RDI: 0000000000000000
++  RBP: ffffffff88c03e18 R08: fffffff75e35ffb7 R09: 00000a49a56c0b48
++  R10: ffffffff88c03da0 R11: 0000000000001b0c R12: ffff9dac25a34d00
++  R13: 0000000000000006 R14: 0000000000000006 R15: ffffffff88d53760
++   cpuidle_enter+0x2e/0x40
++   call_cpuidle+0x23/0x40
++   do_idle+0x201/0x280
++   cpu_startup_entry+0x1d/0x20
++   rest_init+0xaa/0xb0
++   arch_call_rest_init+0xe/0x1b
++   start_kernel+0x51c/0x53b
++   x86_64_start_reservations+0x24/0x26
++   x86_64_start_kernel+0x74/0x77
++   secondary_startup_64+0xa4/0xb0
++  nvme nvme0: Could not set queue count (16385)
++  nvme nvme0: IO queues not created
diff --cc Documentation/scheduler/sched-deadline.rst
index 000000000000,873fb2775ca6..3391e86d810c
mode 000000,100644..100644
--- a/Documentation/scheduler/sched-deadline.rst
+++ b/Documentation/scheduler/sched-deadline.rst
@@@ -1,0 -1,888 +1,888 @@@
+ ========================
+ Deadline Task Scheduling
+ ========================
+ 
+ .. CONTENTS
+ 
+     0. WARNING
+     1. Overview
+     2. Scheduling algorithm
+       2.1 Main algorithm
+       2.2 Bandwidth reclaiming
+     3. Scheduling Real-Time Tasks
+       3.1 Definitions
+       3.2 Schedulability Analysis for Uniprocessor Systems
+       3.3 Schedulability Analysis for Multiprocessor Systems
+       3.4 Relationship with SCHED_DEADLINE Parameters
+     4. Bandwidth management
+       4.1 System-wide settings
+       4.2 Task interface
+       4.3 Default behavior
+       4.4 Behavior of sched_yield()
+     5. Tasks CPU affinity
+       5.1 SCHED_DEADLINE and cpusets HOWTO
+     6. Future plans
+     A. Test suite
+     B. Minimal main()
+ 
+ 
+ 0. WARNING
+ ==========
+ 
+  Fiddling with these settings can result in an unpredictable or even unstable
+  system behavior. As for -rt (group) scheduling, it is assumed that root users
+  know what they're doing.
+ 
+ 
+ 1. Overview
+ ===========
+ 
+  The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
+  basically an implementation of the Earliest Deadline First (EDF) scheduling
+  algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
+  that makes it possible to isolate the behavior of tasks between each other.
+ 
+ 
+ 2. Scheduling algorithm
+ =======================
+ 
+ 2.1 Main algorithm
+ ------------------
+ 
+  SCHED_DEADLINE [18] uses three parameters, named "runtime", "period", and
+  "deadline", to schedule tasks. A SCHED_DEADLINE task should receive
+  "runtime" microseconds of execution time every "period" microseconds, and
+  these "runtime" microseconds are available within "deadline" microseconds
+  from the beginning of the period.  In order to implement this behavior,
+  every time the task wakes up, the scheduler computes a "scheduling deadline"
+  consistent with the guarantee (using the CBS[2,3] algorithm). Tasks are then
+  scheduled using EDF[1] on these scheduling deadlines (the task with the
+  earliest scheduling deadline is selected for execution). Notice that the
+  task actually receives "runtime" time units within "deadline" if a proper
+  "admission control" strategy (see Section "4. Bandwidth management") is used
+  (clearly, if the system is overloaded this guarantee cannot be respected).
+ 
+  Summing up, the CBS[2,3] algorithm assigns scheduling deadlines to tasks so
+  that each task runs for at most its runtime every period, avoiding any
+  interference between different tasks (bandwidth isolation), while the EDF[1]
+  algorithm selects the task with the earliest scheduling deadline as the one
+  to be executed next. Thanks to this feature, tasks that do not strictly comply
+  with the "traditional" real-time task model (see Section 3) can effectively
+  use the new policy.
+ 
+  In more details, the CBS algorithm assigns scheduling deadlines to
+  tasks in the following way:
+ 
+   - Each SCHED_DEADLINE task is characterized by the "runtime",
+     "deadline", and "period" parameters;
+ 
+   - The state of the task is described by a "scheduling deadline", and
+     a "remaining runtime". These two parameters are initially set to 0;
+ 
+   - When a SCHED_DEADLINE task wakes up (becomes ready for execution),
+     the scheduler checks if::
+ 
+                  remaining runtime                  runtime
+         ----------------------------------    >    ---------
+         scheduling deadline - current time           period
+ 
+     then, if the scheduling deadline is smaller than the current time, or
+     this condition is verified, the scheduling deadline and the
+     remaining runtime are re-initialized as
+ 
+          scheduling deadline = current time + deadline
+          remaining runtime = runtime
+ 
+     otherwise, the scheduling deadline and the remaining runtime are
+     left unchanged;
+ 
+   - When a SCHED_DEADLINE task executes for an amount of time t, its
+     remaining runtime is decreased as::
+ 
+          remaining runtime = remaining runtime - t
+ 
+     (technically, the runtime is decreased at every tick, or when the
+     task is descheduled / preempted);
+ 
+   - When the remaining runtime becomes less or equal than 0, the task is
+     said to be "throttled" (also known as "depleted" in real-time literature)
+     and cannot be scheduled until its scheduling deadline. The "replenishment
+     time" for this task (see next item) is set to be equal to the current
+     value of the scheduling deadline;
+ 
+   - When the current time is equal to the replenishment time of a
+     throttled task, the scheduling deadline and the remaining runtime are
+     updated as::
+ 
+          scheduling deadline = scheduling deadline + period
+          remaining runtime = remaining runtime + runtime
+ 
+  The SCHED_FLAG_DL_OVERRUN flag in sched_attr's sched_flags field allows a task
+  to get informed about runtime overruns through the delivery of SIGXCPU
+  signals.
+ 
+ 
+ 2.2 Bandwidth reclaiming
+ ------------------------
+ 
+  Bandwidth reclaiming for deadline tasks is based on the GRUB (Greedy
+  Reclamation of Unused Bandwidth) algorithm [15, 16, 17] and it is enabled
+  when flag SCHED_FLAG_RECLAIM is set.
+ 
+  The following diagram illustrates the state names for tasks handled by GRUB::
+ 
+                              ------------
+                  (d)        |   Active   |
+               ------------->|            |
+               |             | Contending |
+               |              ------------
+               |                A      |
+           ----------           |      |
+          |          |          |      |
+          | Inactive |          |(b)   | (a)
+          |          |          |      |
+           ----------           |      |
+               A                |      V
+               |              ------------
+               |             |   Active   |
+               --------------|     Non    |
+                  (c)        | Contending |
+                              ------------
+ 
+  A task can be in one of the following states:
+ 
+   - ActiveContending: if it is ready for execution (or executing);
+ 
+   - ActiveNonContending: if it just blocked and has not yet surpassed the 0-lag
+     time;
+ 
+   - Inactive: if it is blocked and has surpassed the 0-lag time.
+ 
+  State transitions:
+ 
+   (a) When a task blocks, it does not become immediately inactive since its
+       bandwidth cannot be immediately reclaimed without breaking the
+       real-time guarantees. It therefore enters a transitional state called
+       ActiveNonContending. The scheduler arms the "inactive timer" to fire at
+       the 0-lag time, when the task's bandwidth can be reclaimed without
+       breaking the real-time guarantees.
+ 
+       The 0-lag time for a task entering the ActiveNonContending state is
+       computed as::
+ 
+                         (runtime * dl_period)
+              deadline - ---------------------
+                              dl_runtime
+ 
+       where runtime is the remaining runtime, while dl_runtime and dl_period
+       are the reservation parameters.
+ 
+   (b) If the task wakes up before the inactive timer fires, the task re-enters
+       the ActiveContending state and the "inactive timer" is canceled.
+       In addition, if the task wakes up on a different runqueue, then
+       the task's utilization must be removed from the previous runqueue's active
+       utilization and must be added to the new runqueue's active utilization.
+       In order to avoid races between a task waking up on a runqueue while the
+       "inactive timer" is running on a different CPU, the "dl_non_contending"
+       flag is used to indicate that a task is not on a runqueue but is active
+       (so, the flag is set when the task blocks and is cleared when the
+       "inactive timer" fires or when the task  wakes up).
+ 
+   (c) When the "inactive timer" fires, the task enters the Inactive state and
+       its utilization is removed from the runqueue's active utilization.
+ 
+   (d) When an inactive task wakes up, it enters the ActiveContending state and
+       its utilization is added to the active utilization of the runqueue where
+       it has been enqueued.
+ 
+  For each runqueue, the algorithm GRUB keeps track of two different bandwidths:
+ 
+   - Active bandwidth (running_bw): this is the sum of the bandwidths of all
+     tasks in active state (i.e., ActiveContending or ActiveNonContending);
+ 
+   - Total bandwidth (this_bw): this is the sum of all tasks "belonging" to the
+     runqueue, including the tasks in Inactive state.
+ 
+ 
+  The algorithm reclaims the bandwidth of the tasks in Inactive state.
+  It does so by decrementing the runtime of the executing task Ti at a pace equal
+  to
+ 
+            dq = -max{ Ui / Umax, (1 - Uinact - Uextra) } dt
+ 
+  where:
+ 
+   - Ui is the bandwidth of task Ti;
+   - Umax is the maximum reclaimable utilization (subjected to RT throttling
+     limits);
+   - Uinact is the (per runqueue) inactive utilization, computed as
+     (this_bq - running_bw);
+   - Uextra is the (per runqueue) extra reclaimable utilization
+     (subjected to RT throttling limits).
+ 
+ 
+  Let's now see a trivial example of two deadline tasks with runtime equal
+  to 4 and period equal to 8 (i.e., bandwidth equal to 0.5)::
+ 
+          A            Task T1
+          |
+          |                               |
+          |                               |
+          |--------                       |----
+          |       |                       V
+          |---|---|---|---|---|---|---|---|--------->t
+          0   1   2   3   4   5   6   7   8
+ 
+ 
+          A            Task T2
+          |
+          |                               |
+          |                               |
+          |       ------------------------|
+          |       |                       V
+          |---|---|---|---|---|---|---|---|--------->t
+          0   1   2   3   4   5   6   7   8
+ 
+ 
+          A            running_bw
+          |
+        1 -----------------               ------
+          |               |               |
+       0.5-               -----------------
+          |                               |
+          |---|---|---|---|---|---|---|---|--------->t
+          0   1   2   3   4   5   6   7   8
+ 
+ 
+   - Time t = 0:
+ 
+     Both tasks are ready for execution and therefore in ActiveContending state.
+     Suppose Task T1 is the first task to start execution.
+     Since there are no inactive tasks, its runtime is decreased as dq = -1 dt.
+ 
+   - Time t = 2:
+ 
+     Suppose that task T1 blocks
+     Task T1 therefore enters the ActiveNonContending state. Since its remaining
+     runtime is equal to 2, its 0-lag time is equal to t = 4.
+     Task T2 start execution, with runtime still decreased as dq = -1 dt since
+     there are no inactive tasks.
+ 
+   - Time t = 4:
+ 
+     This is the 0-lag time for Task T1. Since it didn't woken up in the
+     meantime, it enters the Inactive state. Its bandwidth is removed from
+     running_bw.
+     Task T2 continues its execution. However, its runtime is now decreased as
+     dq = - 0.5 dt because Uinact = 0.5.
+     Task T2 therefore reclaims the bandwidth unused by Task T1.
+ 
+   - Time t = 8:
+ 
+     Task T1 wakes up. It enters the ActiveContending state again, and the
+     running_bw is incremented.
+ 
+ 
+ 2.3 Energy-aware scheduling
+ ---------------------------
+ 
+  When cpufreq's schedutil governor is selected, SCHED_DEADLINE implements the
+  GRUB-PA [19] algorithm, reducing the CPU operating frequency to the minimum
+  value that still allows to meet the deadlines. This behavior is currently
+  implemented only for ARM architectures.
+ 
+  A particular care must be taken in case the time needed for changing frequency
+  is of the same order of magnitude of the reservation period. In such cases,
+  setting a fixed CPU frequency results in a lower amount of deadline misses.
+ 
+ 
+ 3. Scheduling Real-Time Tasks
+ =============================
+ 
+ 
+ 
+  ..  BIG FAT WARNING ******************************************************
+ 
+  .. warning::
+ 
+    This section contains a (not-thorough) summary on classical deadline
+    scheduling theory, and how it applies to SCHED_DEADLINE.
+    The reader can "safely" skip to Section 4 if only interested in seeing
+    how the scheduling policy can be used. Anyway, we strongly recommend
+    to come back here and continue reading (once the urge for testing is
+    satisfied :P) to be sure of fully understanding all technical details.
+ 
+  .. ************************************************************************
+ 
+  There are no limitations on what kind of task can exploit this new
+  scheduling discipline, even if it must be said that it is particularly
+  suited for periodic or sporadic real-time tasks that need guarantees on their
+  timing behavior, e.g., multimedia, streaming, control applications, etc.
+ 
+ 3.1 Definitions
+ ------------------------
+ 
+  A typical real-time task is composed of a repetition of computation phases
+  (task instances, or jobs) which are activated on a periodic or sporadic
+  fashion.
+  Each job J_j (where J_j is the j^th job of the task) is characterized by an
+  arrival time r_j (the time when the job starts), an amount of computation
+  time c_j needed to finish the job, and a job absolute deadline d_j, which
+  is the time within which the job should be finished. The maximum execution
+  time max{c_j} is called "Worst Case Execution Time" (WCET) for the task.
+  A real-time task can be periodic with period P if r_{j+1} = r_j + P, or
+  sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally,
+  d_j = r_j + D, where D is the task's relative deadline.
+  Summing up, a real-time task can be described as
+ 
+ 	Task = (WCET, D, P)
+ 
+  The utilization of a real-time task is defined as the ratio between its
+  WCET and its period (or minimum inter-arrival time), and represents
+  the fraction of CPU time needed to execute the task.
+ 
+  If the total utilization U=sum(WCET_i/P_i) is larger than M (with M equal
+  to the number of CPUs), then the scheduler is unable to respect all the
+  deadlines.
+  Note that total utilization is defined as the sum of the utilizations
+  WCET_i/P_i over all the real-time tasks in the system. When considering
+  multiple real-time tasks, the parameters of the i-th task are indicated
+  with the "_i" suffix.
+  Moreover, if the total utilization is larger than M, then we risk starving
+  non- real-time tasks by real-time tasks.
+  If, instead, the total utilization is smaller than M, then non real-time
+  tasks will not be starved and the system might be able to respect all the
+  deadlines.
+  As a matter of fact, in this case it is possible to provide an upper bound
+  for tardiness (defined as the maximum between 0 and the difference
+  between the finishing time of a job and its absolute deadline).
+  More precisely, it can be proven that using a global EDF scheduler the
+  maximum tardiness of each task is smaller or equal than
+ 
+ 	((M â 1) Â· WCET_max â WCET_min)/(M â (M â 2) Â· U_max) + WCET_max
+ 
+  where WCET_max = max{WCET_i} is the maximum WCET, WCET_min=min{WCET_i}
+  is the minimum WCET, and U_max = max{WCET_i/P_i} is the maximum
+  utilization[12].
+ 
+ 3.2 Schedulability Analysis for Uniprocessor Systems
+ ----------------------------------------------------
+ 
+  If M=1 (uniprocessor system), or in case of partitioned scheduling (each
+  real-time task is statically assigned to one and only one CPU), it is
+  possible to formally check if all the deadlines are respected.
+  If D_i = P_i for all tasks, then EDF is able to respect all the deadlines
+  of all the tasks executing on a CPU if and only if the total utilization
+  of the tasks running on such a CPU is smaller or equal than 1.
+  If D_i != P_i for some task, then it is possible to define the density of
+  a task as WCET_i/min{D_i,P_i}, and EDF is able to respect all the deadlines
+  of all the tasks running on a CPU if the sum of the densities of the tasks
+  running on such a CPU is smaller or equal than 1:
+ 
+ 	sum(WCET_i / min{D_i, P_i}) <= 1
+ 
+  It is important to notice that this condition is only sufficient, and not
+  necessary: there are task sets that are schedulable, but do not respect the
+  condition. For example, consider the task set {Task_1,Task_2} composed by
+  Task_1=(50ms,50ms,100ms) and Task_2=(10ms,100ms,100ms).
+  EDF is clearly able to schedule the two tasks without missing any deadline
+  (Task_1 is scheduled as soon as it is released, and finishes just in time
+  to respect its deadline; Task_2 is scheduled immediately after Task_1, hence
+  its response time cannot be larger than 50ms + 10ms = 60ms) even if
+ 
+ 	50 / min{50,100} + 10 / min{100, 100} = 50 / 50 + 10 / 100 = 1.1
+ 
+  Of course it is possible to test the exact schedulability of tasks with
+  D_i != P_i (checking a condition that is both sufficient and necessary),
+  but this cannot be done by comparing the total utilization or density with
+  a constant. Instead, the so called "processor demand" approach can be used,
+  computing the total amount of CPU time h(t) needed by all the tasks to
+  respect all of their deadlines in a time interval of size t, and comparing
+  such a time with the interval size t. If h(t) is smaller than t (that is,
+  the amount of time needed by the tasks in a time interval of size t is
+  smaller than the size of the interval) for all the possible values of t, then
+  EDF is able to schedule the tasks respecting all of their deadlines. Since
+  performing this check for all possible values of t is impossible, it has been
+  proven[4,5,6] that it is sufficient to perform the test for values of t
+  between 0 and a maximum value L. The cited papers contain all of the
+  mathematical details and explain how to compute h(t) and L.
+  In any case, this kind of analysis is too complex as well as too
+  time-consuming to be performed on-line. Hence, as explained in Section
+  4 Linux uses an admission test based on the tasks' utilizations.
+ 
+ 3.3 Schedulability Analysis for Multiprocessor Systems
+ ------------------------------------------------------
+ 
+  On multiprocessor systems with global EDF scheduling (non partitioned
+  systems), a sufficient test for schedulability can not be based on the
+  utilizations or densities: it can be shown that even if D_i = P_i task
+  sets with utilizations slightly larger than 1 can miss deadlines regardless
+  of the number of CPUs.
+ 
+  Consider a set {Task_1,...Task_{M+1}} of M+1 tasks on a system with M
+  CPUs, with the first task Task_1=(P,P,P) having period, relative deadline
+  and WCET equal to P. The remaining M tasks Task_i=(e,P-1,P-1) have an
+  arbitrarily small worst case execution time (indicated as "e" here) and a
+  period smaller than the one of the first task. Hence, if all the tasks
+  activate at the same time t, global EDF schedules these M tasks first
+  (because their absolute deadlines are equal to t + P - 1, hence they are
+  smaller than the absolute deadline of Task_1, which is t + P). As a
+  result, Task_1 can be scheduled only at time t + e, and will finish at
+  time t + e + P, after its absolute deadline. The total utilization of the
+  task set is U = M Â· e / (P - 1) + P / P = M Â· e / (P - 1) + 1, and for small
+  values of e this can become very close to 1. This is known as "Dhall's
+  effect"[7]. Note: the example in the original paper by Dhall has been
+  slightly simplified here (for example, Dhall more correctly computed
+  lim_{e->0}U).
+ 
+  More complex schedulability tests for global EDF have been developed in
+  real-time literature[8,9], but they are not based on a simple comparison
+  between total utilization (or density) and a fixed constant. If all tasks
+  have D_i = P_i, a sufficient schedulability condition can be expressed in
+  a simple way:
+ 
+ 	sum(WCET_i / P_i) <= M - (M - 1) Â· U_max
+ 
+  where U_max = max{WCET_i / P_i}[10]. Notice that for U_max = 1,
+  M - (M - 1) Â· U_max becomes M - M + 1 = 1 and this schedulability condition
+  just confirms the Dhall's effect. A more complete survey of the literature
+  about schedulability tests for multi-processor real-time scheduling can be
+  found in [11].
+ 
+  As seen, enforcing that the total utilization is smaller than M does not
+  guarantee that global EDF schedules the tasks without missing any deadline
+  (in other words, global EDF is not an optimal scheduling algorithm). However,
+  a total utilization smaller than M is enough to guarantee that non real-time
+  tasks are not starved and that the tardiness of real-time tasks has an upper
+  bound[12] (as previously noted). Different bounds on the maximum tardiness
+  experienced by real-time tasks have been developed in various papers[13,14],
+  but the theoretical result that is important for SCHED_DEADLINE is that if
+  the total utilization is smaller or equal than M then the response times of
+  the tasks are limited.
+ 
+ 3.4 Relationship with SCHED_DEADLINE Parameters
+ -----------------------------------------------
+ 
+  Finally, it is important to understand the relationship between the
+  SCHED_DEADLINE scheduling parameters described in Section 2 (runtime,
+  deadline and period) and the real-time task parameters (WCET, D, P)
+  described in this section. Note that the tasks' temporal constraints are
+  represented by its absolute deadlines d_j = r_j + D described above, while
+  SCHED_DEADLINE schedules the tasks according to scheduling deadlines (see
+  Section 2).
+  If an admission test is used to guarantee that the scheduling deadlines
+  are respected, then SCHED_DEADLINE can be used to schedule real-time tasks
+  guaranteeing that all the jobs' deadlines of a task are respected.
+  In order to do this, a task must be scheduled by setting:
+ 
+   - runtime >= WCET
+   - deadline = D
+   - period <= P
+ 
+  IOW, if runtime >= WCET and if period is <= P, then the scheduling deadlines
+  and the absolute deadlines (d_j) coincide, so a proper admission control
+  allows to respect the jobs' absolute deadlines for this task (this is what is
+  called "hard schedulability property" and is an extension of Lemma 1 of [2]).
+  Notice that if runtime > deadline the admission control will surely reject
+  this task, as it is not possible to respect its temporal constraints.
+ 
+  References:
+ 
+   1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
+       ming in a hard-real-time environment. Journal of the Association for
+       Computing Machinery, 20(1), 1973.
+   2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
+       Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
+       Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
+   3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
+       Technical Report. http://disi.unitn.it/~abeni/tr-98-01.pdf
+   4 - J. Y. Leung and M.L. Merril. A Note on Preemptive Scheduling of
+       Periodic, Real-Time Tasks. Information Processing Letters, vol. 11,
+       no. 3, pp. 115-118, 1980.
+   5 - S. K. Baruah, A. K. Mok and L. E. Rosier. Preemptively Scheduling
+       Hard-Real-Time Sporadic Tasks on One Processor. Proceedings of the
+       11th IEEE Real-time Systems Symposium, 1990.
+   6 - S. K. Baruah, L. E. Rosier and R. R. Howell. Algorithms and Complexity
+       Concerning the Preemptive Scheduling of Periodic Real-Time tasks on
+       One Processor. Real-Time Systems Journal, vol. 4, no. 2, pp 301-324,
+       1990.
+   7 - S. J. Dhall and C. L. Liu. On a real-time scheduling problem. Operations
+       research, vol. 26, no. 1, pp 127-140, 1978.
+   8 - T. Baker. Multiprocessor EDF and Deadline Monotonic Schedulability
+       Analysis. Proceedings of the 24th IEEE Real-Time Systems Symposium, 2003.
+   9 - T. Baker. An Analysis of EDF Schedulability on a Multiprocessor.
+       IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 8,
+       pp 760-768, 2005.
+   10 - J. Goossens, S. Funk and S. Baruah, Priority-Driven Scheduling of
+        Periodic Task Systems on Multiprocessors. Real-Time Systems Journal,
+        vol. 25, no. 2â3, pp. 187â205, 2003.
+   11 - R. Davis and A. Burns. A Survey of Hard Real-Time Scheduling for
+        Multiprocessor Systems. ACM Computing Surveys, vol. 43, no. 4, 2011.
+        http://www-users.cs.york.ac.uk/~robdavis/papers/MPSurveyv5.0.pdf
+   12 - U. C. Devi and J. H. Anderson. Tardiness Bounds under Global EDF
+        Scheduling on a Multiprocessor. Real-Time Systems Journal, vol. 32,
+        no. 2, pp 133-189, 2008.
+   13 - P. Valente and G. Lipari. An Upper Bound to the Lateness of Soft
+        Real-Time Tasks Scheduled by EDF on Multiprocessors. Proceedings of
+        the 26th IEEE Real-Time Systems Symposium, 2005.
+   14 - J. Erickson, U. Devi and S. Baruah. Improved tardiness bounds for
+        Global EDF. Proceedings of the 22nd Euromicro Conference on
+        Real-Time Systems, 2010.
+   15 - G. Lipari, S. Baruah, Greedy reclamation of unused bandwidth in
+        constant-bandwidth servers, 12th IEEE Euromicro Conference on Real-Time
+        Systems, 2000.
+   16 - L. Abeni, J. Lelli, C. Scordino, L. Palopoli, Greedy CPU reclaiming for
+        SCHED DEADLINE. In Proceedings of the Real-Time Linux Workshop (RTLWS),
+        Dusseldorf, Germany, 2014.
+   17 - L. Abeni, G. Lipari, A. Parri, Y. Sun, Multicore CPU reclaiming: parallel
+        or sequential?. In Proceedings of the 31st Annual ACM Symposium on Applied
+        Computing, 2016.
+   18 - J. Lelli, C. Scordino, L. Abeni, D. Faggioli, Deadline scheduling in the
+        Linux kernel, Software: Practice and Experience, 46(6): 821-839, June
+        2016.
+   19 - C. Scordino, L. Abeni, J. Lelli, Energy-Aware Real-Time Scheduling in
+        the Linux Kernel, 33rd ACM/SIGAPP Symposium On Applied Computing (SAC
+        2018), Pau, France, April 2018.
+ 
+ 
+ 4. Bandwidth management
+ =======================
+ 
+  As previously mentioned, in order for -deadline scheduling to be
+  effective and useful (that is, to be able to provide "runtime" time units
+  within "deadline"), it is important to have some method to keep the allocation
+  of the available fractions of CPU time to the various tasks under control.
+  This is usually called "admission control" and if it is not performed, then
+  no guarantee can be given on the actual scheduling of the -deadline tasks.
+ 
+  As already stated in Section 3, a necessary condition to be respected to
+  correctly schedule a set of real-time tasks is that the total utilization
+  is smaller than M. When talking about -deadline tasks, this requires that
+  the sum of the ratio between runtime and period for all tasks is smaller
+  than M. Notice that the ratio runtime/period is equivalent to the utilization
+  of a "traditional" real-time task, and is also often referred to as
+  "bandwidth".
+  The interface used to control the CPU bandwidth that can be allocated
+  to -deadline tasks is similar to the one already used for -rt
+  tasks with real-time group scheduling (a.k.a. RT-throttling - see
+  Documentation/scheduler/sched-rt-group.rst), and is based on readable/
+  writable control files located in procfs (for system wide settings).
+  Notice that per-group settings (controlled through cgroupfs) are still not
+  defined for -deadline tasks, because more discussion is needed in order to
+  figure out how we want to manage SCHED_DEADLINE bandwidth at the task group
+  level.
+ 
+  A main difference between deadline bandwidth management and RT-throttling
+  is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
+  and thus we don't need a higher level throttling mechanism to enforce the
+  desired bandwidth. In other words, this means that interface parameters are
+  only used at admission control time (i.e., when the user calls
+  sched_setattr()). Scheduling is then performed considering actual tasks'
+  parameters, so that CPU bandwidth is allocated to SCHED_DEADLINE tasks
+  respecting their needs in terms of granularity. Therefore, using this simple
+  interface we can put a cap on total utilization of -deadline tasks (i.e.,
+  \Sum (runtime_i / period_i) < global_dl_utilization_cap).
+ 
+ 4.1 System wide settings
+ ------------------------
+ 
+  The system wide settings are configured under the /proc virtual file system.
+ 
+  For now the -rt knobs are used for -deadline admission control and the
+  -deadline runtime is accounted against the -rt runtime. We realize that this
+  isn't entirely desirable; however, it is better to have a small interface for
+  now, and be able to change it easily later. The ideal situation (see 5.) is to
+  run -rt tasks from a -deadline server; in which case the -rt bandwidth is a
+  direct subset of dl_bw.
+ 
+  This means that, for a root_domain comprising M CPUs, -deadline tasks
+  can be created while the sum of their bandwidths stays below:
+ 
+    M * (sched_rt_runtime_us / sched_rt_period_us)
+ 
+  It is also possible to disable this bandwidth management logic, and
+  be thus free of oversubscribing the system up to any arbitrary level.
+  This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
+ 
+ 
+ 4.2 Task interface
+ ------------------
+ 
+  Specifying a periodic/sporadic task that executes for a given amount of
+  runtime at each instance, and that is scheduled according to the urgency of
+  its own timing constraints needs, in general, a way of declaring:
+ 
+   - a (maximum/typical) instance execution time,
+   - a minimum interval between consecutive instances,
+   - a time constraint by which each instance must be completed.
+ 
+  Therefore:
+ 
+   * a new struct sched_attr, containing all the necessary fields is
+     provided;
+   * the new scheduling related syscalls that manipulate it, i.e.,
+     sched_setattr() and sched_getattr() are implemented.
+ 
+  For debugging purposes, the leftover runtime and absolute deadline of a
+  SCHED_DEADLINE task can be retrieved through /proc/<pid>/sched (entries
+  dl.runtime and dl.deadline, both values in ns). A programmatic way to
+  retrieve these values from production code is under discussion.
+ 
+ 
+ 4.3 Default behavior
+ ---------------------
+ 
+  The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
+  950000. With rt_period equal to 1000000, by default, it means that -deadline
+  tasks can use at most 95%, multiplied by the number of CPUs that compose the
+  root_domain, for each root_domain.
+  This means that non -deadline tasks will receive at least 5% of the CPU time,
+  and that -deadline tasks will receive their runtime with a guaranteed
+  worst-case delay respect to the "deadline" parameter. If "deadline" = "period"
+  and the cpuset mechanism is used to implement partitioned scheduling (see
+  Section 5), then this simple setting of the bandwidth management is able to
+  deterministically guarantee that -deadline tasks will receive their runtime
+  in a period.
+ 
+  Finally, notice that in order not to jeopardize the admission control a
+  -deadline task cannot fork.
+ 
+ 
+ 4.4 Behavior of sched_yield()
+ -----------------------------
+ 
+  When a SCHED_DEADLINE task calls sched_yield(), it gives up its
+  remaining runtime and is immediately throttled, until the next
+  period, when its runtime will be replenished (a special flag
+  dl_yielded is set and used to handle correctly throttling and runtime
+  replenishment after a call to sched_yield()).
+ 
+  This behavior of sched_yield() allows the task to wake-up exactly at
+  the beginning of the next period. Also, this may be useful in the
+  future with bandwidth reclaiming mechanisms, where sched_yield() will
+  make the leftoever runtime available for reclamation by other
+  SCHED_DEADLINE tasks.
+ 
+ 
+ 5. Tasks CPU affinity
+ =====================
+ 
+  -deadline tasks cannot have an affinity mask smaller that the entire
+  root_domain they are created on. However, affinities can be specified
 - through the cpuset facility (Documentation/cgroup-v1/cpusets.txt).
++ through the cpuset facility (Documentation/cgroup-v1/cpusets.rst).
+ 
+ 5.1 SCHED_DEADLINE and cpusets HOWTO
+ ------------------------------------
+ 
+  An example of a simple configuration (pin a -deadline task to CPU0)
+  follows (rt-app is used to create a -deadline task)::
+ 
+    mkdir /dev/cpuset
+    mount -t cgroup -o cpuset cpuset /dev/cpuset
+    cd /dev/cpuset
+    mkdir cpu0
+    echo 0 > cpu0/cpuset.cpus
+    echo 0 > cpu0/cpuset.mems
+    echo 1 > cpuset.cpu_exclusive
+    echo 0 > cpuset.sched_load_balance
+    echo 1 > cpu0/cpuset.cpu_exclusive
+    echo 1 > cpu0/cpuset.mem_exclusive
+    echo $$ > cpu0/tasks
+    rt-app -t 100000:10000:d:0 -D5 # it is now actually superfluous to specify
+ 				  # task affinity
+ 
+ 6. Future plans
+ ===============
+ 
+  Still missing:
+ 
+   - programmatic way to retrieve current runtime and absolute deadline
+   - refinements to deadline inheritance, especially regarding the possibility
+     of retaining bandwidth isolation among non-interacting tasks. This is
+     being studied from both theoretical and practical points of view, and
+     hopefully we should be able to produce some demonstrative code soon;
+   - (c)group based bandwidth management, and maybe scheduling;
+   - access control for non-root users (and related security concerns to
+     address), which is the best way to allow unprivileged use of the mechanisms
+     and how to prevent non-root users "cheat" the system?
+ 
+  As already discussed, we are planning also to merge this work with the EDF
+  throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
+  the preliminary phases of the merge and we really seek feedback that would
+  help us decide on the direction it should take.
+ 
+ Appendix A. Test suite
+ ======================
+ 
+  The SCHED_DEADLINE policy can be easily tested using two applications that
+  are part of a wider Linux Scheduler validation suite. The suite is
+  available as a GitHub repository: https://github.com/scheduler-tools.
+ 
+  The first testing application is called rt-app and can be used to
+  start multiple threads with specific parameters. rt-app supports
+  SCHED_{OTHER,FIFO,RR,DEADLINE} scheduling policies and their related
+  parameters (e.g., niceness, priority, runtime/deadline/period). rt-app
+  is a valuable tool, as it can be used to synthetically recreate certain
+  workloads (maybe mimicking real use-cases) and evaluate how the scheduler
+  behaves under such workloads. In this way, results are easily reproducible.
+  rt-app is available at: https://github.com/scheduler-tools/rt-app.
+ 
+  Thread parameters can be specified from the command line, with something like
+  this::
+ 
+   # rt-app -t 100000:10000:d -t 150000:20000:f:10 -D5
+ 
+  The above creates 2 threads. The first one, scheduled by SCHED_DEADLINE,
+  executes for 10ms every 100ms. The second one, scheduled at SCHED_FIFO
+  priority 10, executes for 20ms every 150ms. The test will run for a total
+  of 5 seconds.
+ 
+  More interestingly, configurations can be described with a json file that
+  can be passed as input to rt-app with something like this::
+ 
+   # rt-app my_config.json
+ 
+  The parameters that can be specified with the second method are a superset
+  of the command line options. Please refer to rt-app documentation for more
+  details (`<rt-app-sources>/doc/*.json`).
+ 
+  The second testing application is a modification of schedtool, called
+  schedtool-dl, which can be used to setup SCHED_DEADLINE parameters for a
+  certain pid/application. schedtool-dl is available at:
+  https://github.com/scheduler-tools/schedtool-dl.git.
+ 
+  The usage is straightforward::
+ 
+   # schedtool -E -t 10000000:100000000 -e ./my_cpuhog_app
+ 
+  With this, my_cpuhog_app is put to run inside a SCHED_DEADLINE reservation
+  of 10ms every 100ms (note that parameters are expressed in microseconds).
+  You can also use schedtool to create a reservation for an already running
+  application, given that you know its pid::
+ 
+   # schedtool -E -t 10000000:100000000 my_app_pid
+ 
+ Appendix B. Minimal main()
+ ==========================
+ 
+  We provide in what follows a simple (ugly) self-contained code snippet
+  showing how SCHED_DEADLINE reservations can be created by a real-time
+  application developer::
+ 
+    #define _GNU_SOURCE
+    #include <unistd.h>
+    #include <stdio.h>
+    #include <stdlib.h>
+    #include <string.h>
+    #include <time.h>
+    #include <linux/unistd.h>
+    #include <linux/kernel.h>
+    #include <linux/types.h>
+    #include <sys/syscall.h>
+    #include <pthread.h>
+ 
+    #define gettid() syscall(__NR_gettid)
+ 
+    #define SCHED_DEADLINE	6
+ 
+    /* XXX use the proper syscall numbers */
+    #ifdef __x86_64__
+    #define __NR_sched_setattr		314
+    #define __NR_sched_getattr		315
+    #endif
+ 
+    #ifdef __i386__
+    #define __NR_sched_setattr		351
+    #define __NR_sched_getattr		352
+    #endif
+ 
+    #ifdef __arm__
+    #define __NR_sched_setattr		380
+    #define __NR_sched_getattr		381
+    #endif
+ 
+    static volatile int done;
+ 
+    struct sched_attr {
+ 	__u32 size;
+ 
+ 	__u32 sched_policy;
+ 	__u64 sched_flags;
+ 
+ 	/* SCHED_NORMAL, SCHED_BATCH */
+ 	__s32 sched_nice;
+ 
+ 	/* SCHED_FIFO, SCHED_RR */
+ 	__u32 sched_priority;
+ 
+ 	/* SCHED_DEADLINE (nsec) */
+ 	__u64 sched_runtime;
+ 	__u64 sched_deadline;
+ 	__u64 sched_period;
+    };
+ 
+    int sched_setattr(pid_t pid,
+ 		  const struct sched_attr *attr,
+ 		  unsigned int flags)
+    {
+ 	return syscall(__NR_sched_setattr, pid, attr, flags);
+    }
+ 
+    int sched_getattr(pid_t pid,
+ 		  struct sched_attr *attr,
+ 		  unsigned int size,
+ 		  unsigned int flags)
+    {
+ 	return syscall(__NR_sched_getattr, pid, attr, size, flags);
+    }
+ 
+    void *run_deadline(void *data)
+    {
+ 	struct sched_attr attr;
+ 	int x = 0;
+ 	int ret;
+ 	unsigned int flags = 0;
+ 
+ 	printf("deadline thread started [%ld]\n", gettid());
+ 
+ 	attr.size = sizeof(attr);
+ 	attr.sched_flags = 0;
+ 	attr.sched_nice = 0;
+ 	attr.sched_priority = 0;
+ 
+ 	/* This creates a 10ms/30ms reservation */
+ 	attr.sched_policy = SCHED_DEADLINE;
+ 	attr.sched_runtime = 10 * 1000 * 1000;
+ 	attr.sched_period = attr.sched_deadline = 30 * 1000 * 1000;
+ 
+ 	ret = sched_setattr(0, &attr, flags);
+ 	if (ret < 0) {
+ 		done = 0;
+ 		perror("sched_setattr");
+ 		exit(-1);
+ 	}
+ 
+ 	while (!done) {
+ 		x++;
+ 	}
+ 
+ 	printf("deadline thread dies [%ld]\n", gettid());
+ 	return NULL;
+    }
+ 
+    int main (int argc, char **argv)
+    {
+ 	pthread_t thread;
+ 
+ 	printf("main thread [%ld]\n", gettid());
+ 
+ 	pthread_create(&thread, NULL, run_deadline, NULL);
+ 
+ 	sleep(10);
+ 
+ 	done = 1;
+ 	pthread_join(thread, NULL);
+ 
+ 	printf("main dies [%ld]\n", gettid());
+ 	return 0;
+    }
diff --cc Documentation/scheduler/sched-design-CFS.rst
index 000000000000,82406685365a..53b30d1967cf
mode 000000,100644..100644
--- a/Documentation/scheduler/sched-design-CFS.rst
+++ b/Documentation/scheduler/sched-design-CFS.rst
@@@ -1,0 -1,249 +1,249 @@@
+ =============
+ CFS Scheduler
+ =============
+ 
+ 
+ 1.  OVERVIEW
+ ============
+ 
+ CFS stands for "Completely Fair Scheduler," and is the new "desktop" process
+ scheduler implemented by Ingo Molnar and merged in Linux 2.6.23.  It is the
+ replacement for the previous vanilla scheduler's SCHED_OTHER interactivity
+ code.
+ 
+ 80% of CFS's design can be summed up in a single sentence: CFS basically models
+ an "ideal, precise multi-tasking CPU" on real hardware.
+ 
+ "Ideal multi-tasking CPU" is a (non-existent  :-)) CPU that has 100% physical
+ power and which can run each task at precise equal speed, in parallel, each at
+ 1/nr_running speed.  For example: if there are 2 tasks running, then it runs
+ each at 50% physical power --- i.e., actually in parallel.
+ 
+ On real hardware, we can run only a single task at once, so we have to
+ introduce the concept of "virtual runtime."  The virtual runtime of a task
+ specifies when its next timeslice would start execution on the ideal
+ multi-tasking CPU described above.  In practice, the virtual runtime of a task
+ is its actual runtime normalized to the total number of running tasks.
+ 
+ 
+ 
+ 2.  FEW IMPLEMENTATION DETAILS
+ ==============================
+ 
+ In CFS the virtual runtime is expressed and tracked via the per-task
+ p->se.vruntime (nanosec-unit) value.  This way, it's possible to accurately
+ timestamp and measure the "expected CPU time" a task should have gotten.
+ 
+ [ small detail: on "ideal" hardware, at any time all tasks would have the same
+   p->se.vruntime value --- i.e., tasks would execute simultaneously and no task
+   would ever get "out of balance" from the "ideal" share of CPU time.  ]
+ 
+ CFS's task picking logic is based on this p->se.vruntime value and it is thus
+ very simple: it always tries to run the task with the smallest p->se.vruntime
+ value (i.e., the task which executed least so far).  CFS always tries to split
+ up CPU time between runnable tasks as close to "ideal multitasking hardware" as
+ possible.
+ 
+ Most of the rest of CFS's design just falls out of this really simple concept,
+ with a few add-on embellishments like nice levels, multiprocessing and various
+ algorithm variants to recognize sleepers.
+ 
+ 
+ 
+ 3.  THE RBTREE
+ ==============
+ 
+ CFS's design is quite radical: it does not use the old data structures for the
+ runqueues, but it uses a time-ordered rbtree to build a "timeline" of future
+ task execution, and thus has no "array switch" artifacts (by which both the
+ previous vanilla scheduler and RSDL/SD are affected).
+ 
+ CFS also maintains the rq->cfs.min_vruntime value, which is a monotonic
+ increasing value tracking the smallest vruntime among all tasks in the
+ runqueue.  The total amount of work done by the system is tracked using
+ min_vruntime; that value is used to place newly activated entities on the left
+ side of the tree as much as possible.
+ 
+ The total number of running tasks in the runqueue is accounted through the
+ rq->cfs.load value, which is the sum of the weights of the tasks queued on the
+ runqueue.
+ 
+ CFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the
+ p->se.vruntime key. CFS picks the "leftmost" task from this tree and sticks to it.
+ As the system progresses forwards, the executed tasks are put into the tree
+ more and more to the right --- slowly but surely giving a chance for every task
+ to become the "leftmost task" and thus get on the CPU within a deterministic
+ amount of time.
+ 
+ Summing up, CFS works like this: it runs a task a bit, and when the task
+ schedules (or a scheduler tick happens) the task's CPU usage is "accounted
+ for": the (small) time it just spent using the physical CPU is added to
+ p->se.vruntime.  Once p->se.vruntime gets high enough so that another task
+ becomes the "leftmost task" of the time-ordered rbtree it maintains (plus a
+ small amount of "granularity" distance relative to the leftmost task so that we
+ do not over-schedule tasks and trash the cache), then the new leftmost task is
+ picked and the current task is preempted.
+ 
+ 
+ 
+ 4.  SOME FEATURES OF CFS
+ ========================
+ 
+ CFS uses nanosecond granularity accounting and does not rely on any jiffies or
+ other HZ detail.  Thus the CFS scheduler has no notion of "timeslices" in the
+ way the previous scheduler had, and has no heuristics whatsoever.  There is
+ only one central tunable (you have to switch on CONFIG_SCHED_DEBUG):
+ 
+    /proc/sys/kernel/sched_min_granularity_ns
+ 
+ which can be used to tune the scheduler from "desktop" (i.e., low latencies) to
+ "server" (i.e., good batching) workloads.  It defaults to a setting suitable
+ for desktop workloads.  SCHED_BATCH is handled by the CFS scheduler module too.
+ 
+ Due to its design, the CFS scheduler is not prone to any of the "attacks" that
+ exist today against the heuristics of the stock scheduler: fiftyp.c, thud.c,
+ chew.c, ring-test.c, massive_intr.c all work fine and do not impact
+ interactivity and produce the expected behavior.
+ 
+ The CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH
+ than the previous vanilla scheduler: both types of workloads are isolated much
+ more aggressively.
+ 
+ SMP load-balancing has been reworked/sanitized: the runqueue-walking
+ assumptions are gone from the load-balancing code now, and iterators of the
+ scheduling modules are used.  The balancing code got quite a bit simpler as a
+ result.
+ 
+ 
+ 
+ 5. Scheduling policies
+ ======================
+ 
+ CFS implements three scheduling policies:
+ 
+   - SCHED_NORMAL (traditionally called SCHED_OTHER): The scheduling
+     policy that is used for regular tasks.
+ 
+   - SCHED_BATCH: Does not preempt nearly as often as regular tasks
+     would, thereby allowing tasks to run longer and make better use of
+     caches but at the cost of interactivity. This is well suited for
+     batch jobs.
+ 
+   - SCHED_IDLE: This is even weaker than nice 19, but its not a true
+     idle timer scheduler in order to avoid to get into priority
+     inversion problems which would deadlock the machine.
+ 
+ SCHED_FIFO/_RR are implemented in sched/rt.c and are as specified by
+ POSIX.
+ 
+ The command chrt from util-linux-ng 2.13.1.1 can set all of these except
+ SCHED_IDLE.
+ 
+ 
+ 
+ 6.  SCHEDULING CLASSES
+ ======================
+ 
+ The new CFS scheduler has been designed in such a way to introduce "Scheduling
+ Classes," an extensible hierarchy of scheduler modules.  These modules
+ encapsulate scheduling policy details and are handled by the scheduler core
+ without the core code assuming too much about them.
+ 
+ sched/fair.c implements the CFS scheduler described above.
+ 
+ sched/rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than
+ the previous vanilla scheduler did.  It uses 100 runqueues (for all 100 RT
+ priority levels, instead of 140 in the previous scheduler) and it needs no
+ expired array.
+ 
+ Scheduling classes are implemented through the sched_class structure, which
+ contains hooks to functions that must be called whenever an interesting event
+ occurs.
+ 
+ This is the (partial) list of the hooks:
+ 
+  - enqueue_task(...)
+ 
+    Called when a task enters a runnable state.
+    It puts the scheduling entity (task) into the red-black tree and
+    increments the nr_running variable.
+ 
+  - dequeue_task(...)
+ 
+    When a task is no longer runnable, this function is called to keep the
+    corresponding scheduling entity out of the red-black tree.  It decrements
+    the nr_running variable.
+ 
+  - yield_task(...)
+ 
+    This function is basically just a dequeue followed by an enqueue, unless the
+    compat_yield sysctl is turned on; in that case, it places the scheduling
+    entity at the right-most end of the red-black tree.
+ 
+  - check_preempt_curr(...)
+ 
+    This function checks if a task that entered the runnable state should
+    preempt the currently running task.
+ 
+  - pick_next_task(...)
+ 
+    This function chooses the most appropriate task eligible to run next.
+ 
+  - set_curr_task(...)
+ 
+    This function is called when a task changes its scheduling class or changes
+    its task group.
+ 
+  - task_tick(...)
+ 
+    This function is mostly called from time tick functions; it might lead to
+    process switch.  This drives the running preemption.
+ 
+ 
+ 
+ 
+ 7.  GROUP SCHEDULER EXTENSIONS TO CFS
+ =====================================
+ 
+ Normally, the scheduler operates on individual tasks and strives to provide
+ fair CPU time to each task.  Sometimes, it may be desirable to group tasks and
+ provide fair CPU time to each such task group.  For example, it may be
+ desirable to first provide fair CPU time to each user on the system and then to
+ each task belonging to a user.
+ 
+ CONFIG_CGROUP_SCHED strives to achieve exactly that.  It lets tasks to be
+ grouped and divides CPU time fairly among such groups.
+ 
+ CONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and
+ SCHED_RR) tasks.
+ 
+ CONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and
+ SCHED_BATCH) tasks.
+ 
+    These options need CONFIG_CGROUPS to be defined, and let the administrator
+    create arbitrary groups of tasks, using the "cgroup" pseudo filesystem.  See
 -   Documentation/cgroup-v1/cgroups.txt for more information about this filesystem.
++   Documentation/cgroup-v1/cgroups.rst for more information about this filesystem.
+ 
+ When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
+ group created using the pseudo filesystem.  See example steps below to create
+ task groups and modify their CPU share using the "cgroups" pseudo filesystem::
+ 
+ 	# mount -t tmpfs cgroup_root /sys/fs/cgroup
+ 	# mkdir /sys/fs/cgroup/cpu
+ 	# mount -t cgroup -ocpu none /sys/fs/cgroup/cpu
+ 	# cd /sys/fs/cgroup/cpu
+ 
+ 	# mkdir multimedia	# create "multimedia" group of tasks
+ 	# mkdir browser		# create "browser" group of tasks
+ 
+ 	# #Configure the multimedia group to receive twice the CPU bandwidth
+ 	# #that of browser group
+ 
+ 	# echo 2048 > multimedia/cpu.shares
+ 	# echo 1024 > browser/cpu.shares
+ 
+ 	# firefox &	# Launch firefox and move it to "browser" group
+ 	# echo <firefox_pid> > browser/tasks
+ 
+ 	# #Launch gmplayer (or your favourite movie player)
+ 	# echo <movie_player_pid> > multimedia/tasks
diff --cc Documentation/scheduler/sched-rt-group.rst
index 000000000000,79b30a21c51a..d27d3f3712fd
mode 000000,100644..100644
--- a/Documentation/scheduler/sched-rt-group.rst
+++ b/Documentation/scheduler/sched-rt-group.rst
@@@ -1,0 -1,185 +1,185 @@@
+ ==========================
+ Real-Time group scheduling
+ ==========================
+ 
+ .. CONTENTS
+ 
+    0. WARNING
+    1. Overview
+      1.1 The problem
+      1.2 The solution
+    2. The interface
+      2.1 System-wide settings
+      2.2 Default behaviour
+      2.3 Basis for grouping tasks
+    3. Future plans
+ 
+ 
+ 0. WARNING
+ ==========
+ 
+  Fiddling with these settings can result in an unstable system, the knobs are
+  root only and assumes root knows what he is doing.
+ 
+ Most notable:
+ 
+  * very small values in sched_rt_period_us can result in an unstable
+    system when the period is smaller than either the available hrtimer
+    resolution, or the time it takes to handle the budget refresh itself.
+ 
+  * very small values in sched_rt_runtime_us can result in an unstable
+    system when the runtime is so small the system has difficulty making
+    forward progress (NOTE: the migration thread and kstopmachine both
+    are real-time processes).
+ 
+ 1. Overview
+ ===========
+ 
+ 
+ 1.1 The problem
+ ---------------
+ 
+ Realtime scheduling is all about determinism, a group has to be able to rely on
+ the amount of bandwidth (eg. CPU time) being constant. In order to schedule
+ multiple groups of realtime tasks, each group must be assigned a fixed portion
+ of the CPU time available.  Without a minimum guarantee a realtime group can
+ obviously fall short. A fuzzy upper limit is of no use since it cannot be
+ relied upon. Which leaves us with just the single fixed portion.
+ 
+ 1.2 The solution
+ ----------------
+ 
+ CPU time is divided by means of specifying how much time can be spent running
+ in a given period. We allocate this "run time" for each realtime group which
+ the other realtime groups will not be permitted to use.
+ 
+ Any time not allocated to a realtime group will be used to run normal priority
+ tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by
+ SCHED_OTHER.
+ 
+ Let's consider an example: a frame fixed realtime renderer must deliver 25
+ frames a second, which yields a period of 0.04s per frame. Now say it will also
+ have to play some music and respond to input, leaving it with around 80% CPU
+ time dedicated for the graphics. We can then give this group a run time of 0.8
+ * 0.04s = 0.032s.
+ 
+ This way the graphics group will have a 0.04s period with a 0.032s run time
+ limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but
+ needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s =
+ 0.00015s. So this group can be scheduled with a period of 0.005s and a run time
+ of 0.00015s.
+ 
+ The remaining CPU time will be used for user input and other tasks. Because
+ realtime tasks have explicitly allocated the CPU time they need to perform
+ their tasks, buffer underruns in the graphics or audio can be eliminated.
+ 
+ NOTE: the above example is not fully implemented yet. We still
+ lack an EDF scheduler to make non-uniform periods usable.
+ 
+ 
+ 2. The Interface
+ ================
+ 
+ 
+ 2.1 System wide settings
+ ------------------------
+ 
+ The system wide settings are configured under the /proc virtual file system:
+ 
+ /proc/sys/kernel/sched_rt_period_us:
+   The scheduling period that is equivalent to 100% CPU bandwidth
+ 
+ /proc/sys/kernel/sched_rt_runtime_us:
+   A global limit on how much time realtime scheduling may use.  Even without
+   CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime
+   processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth
+   available to all realtime groups.
+ 
+   * Time is specified in us because the interface is s32. This gives an
+     operating range from 1us to about 35 minutes.
+   * sched_rt_period_us takes values from 1 to INT_MAX.
+   * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1).
+   * A run time of -1 specifies runtime == period, ie. no limit.
+ 
+ 
+ 2.2 Default behaviour
+ ---------------------
+ 
+ The default values for sched_rt_period_us (1000000 or 1s) and
+ sched_rt_runtime_us (950000 or 0.95s).  This gives 0.05s to be used by
+ SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away
+ realtime tasks will not lock up the machine but leave a little time to recover
+ it.  By setting runtime to -1 you'd get the old behaviour back.
+ 
+ By default all bandwidth is assigned to the root group and new groups get the
+ period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
+ want to assign bandwidth to another group, reduce the root group's bandwidth
+ and assign some or all of the difference to another group.
+ 
+ Realtime group scheduling means you have to assign a portion of total CPU
+ bandwidth to the group before it will accept realtime tasks. Therefore you will
+ not be able to run realtime tasks as any user other than root until you have
+ done that, even if the user has the rights to run processes with realtime
+ priority!
+ 
+ 
+ 2.3 Basis for grouping tasks
+ ----------------------------
+ 
+ Enabling CONFIG_RT_GROUP_SCHED lets you explicitly allocate real
+ CPU bandwidth to task groups.
+ 
+ This uses the cgroup virtual file system and "<cgroup>/cpu.rt_runtime_us"
+ to control the CPU time reserved for each control group.
+ 
+ For more information on working with control groups, you should read
 -Documentation/cgroup-v1/cgroups.txt as well.
++Documentation/cgroup-v1/cgroups.rst as well.
+ 
+ Group settings are checked against the following limits in order to keep the
+ configuration schedulable:
+ 
+    \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period
+ 
+ For now, this can be simplified to just the following (but see Future plans):
+ 
+    \Sum_{i} runtime_{i} <= global_runtime
+ 
+ 
+ 3. Future plans
+ ===============
+ 
+ There is work in progress to make the scheduling period for each group
+ ("<cgroup>/cpu.rt_period_us") configurable as well.
+ 
+ The constraint on the period is that a subgroup must have a smaller or
+ equal period to its parent. But realistically its not very useful _yet_
+ as its prone to starvation without deadline scheduling.
+ 
+ Consider two sibling groups A and B; both have 50% bandwidth, but A's
+ period is twice the length of B's.
+ 
+ * group A: period=100000us, runtime=50000us
+ 
+ 	- this runs for 0.05s once every 0.1s
+ 
+ * group B: period= 50000us, runtime=25000us
+ 
+ 	- this runs for 0.025s twice every 0.1s (or once every 0.05 sec).
+ 
+ This means that currently a while (1) loop in A will run for the full period of
+ B and can starve B's tasks (assuming they are of lower priority) for a whole
+ period.
+ 
+ The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring
+ full deadline scheduling to the linux kernel. Deadline scheduling the above
+ groups and treating end of the period as a deadline will ensure that they both
+ get their allocated time.
+ 
+ Implementing SCHED_EDF might take a while to complete. Priority Inheritance is
+ the biggest challenge as the current linux PI infrastructure is geared towards
+ the limited static priority levels 0-99. With deadline scheduling you need to
+ do deadline inheritance (since priority is inversely proportional to the
+ deadline delta (deadline - now)).
+ 
+ This means the whole PI machinery will have to be reworked - and that is one of
+ the most complex pieces of code we have.
diff --cc Documentation/x86/x86_64/fake-numa-for-cpusets.rst
index a6926cd40f70,04df57b9aa3f..30108684ae87
--- a/Documentation/x86/x86_64/fake-numa-for-cpusets.rst
+++ b/Documentation/x86/x86_64/fake-numa-for-cpusets.rst
@@@ -15,10 -15,10 +15,10 @@@ assign them to cpusets and their attach
  amount of system memory that are available to a certain class of tasks.
  
  For more information on the features of cpusets, see
 -Documentation/cgroup-v1/cpusets.txt.
 +Documentation/cgroup-v1/cpusets.rst.
  There are a number of different configurations you can use for your needs.  For
  more information on the numa=fake command line option and its various ways of
- configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt.
+ configuring fake nodes, see Documentation/x86/x86_64/boot-options.rst.
  
  For the purposes of this introduction, we'll assume a very primitive NUMA
  emulation setup of "numa=fake=4*512,".  This will split our system memory into