(e.g. thinkpad_acpi, sony_acpi, etc.) instead
of the ACPI video.ko driver.
+ acpi_force_32bit_fadt_addr
+ force FADT to use 32 bit addresses rather than the
+ 64 bit X_* addresses. Some firmware have broken 64
+ bit addresses for force ACPI ignore these and use
+ the older legacy 32 bit addresses.
+
acpica_no_return_repair [HW, ACPI]
Disable AML predefined validation mechanism
This mechanism can repair the evaluation result to make
cut the overhead, others just disable the usage. So
only cgroup_disable=memory is actually worthy}
+ cgroup_no_v1= [KNL] Disable one, multiple, all cgroup controllers in v1
+ Format: { controller[,controller...] | "all" }
+ Like cgroup_disable, but only applies to cgroup v1;
+ the blacklisted controllers remain available in cgroup2.
+
cgroup.memory= [KNL] Pass options to the cgroup memory controller.
Format: <string>
nosocket -- Disable socket memory accounting.
See Documentation/x86/intel_mpx.txt for more
information about the feature.
+ nopku [X86] Disable Memory Protection Keys CPU feature found
+ in some Intel CPUs.
+
eagerfpu= [X86]
on enable eager fpu restore
off disable eager fpu restore
A valid base address must be provided, and the serial
port must already be setup and configured.
+ armada3700_uart,<addr>
+ Start an early, polled-mode console on the
+ Armada 3700 serial port at the specified
+ address. The serial port must already be setup
+ and configured. Options are not yet supported.
+
earlyprintk= [X86,SH,BLACKFIN,ARM,M68k]
earlyprintk=vga
earlyprintk=efi
ip= [IP_PNP]
See Documentation/filesystems/nfs/nfsroot.txt.
+ irqaffinity= [SMP] Set the default irq affinity mask
+ Format:
+ <cpu number>,...,<cpu number>
+ or
+ <cpu number>-<cpu number>
+ (must be a positive range in ascending order)
+ or a mixture
+ <cpu number>,...,<cpu number>-<cpu number>
+
irqfixup [HW]
When an interrupt is not handled search all handlers
for it. Intended to get systems with badly broken
keepinitrd [HW,ARM]
- kernelcore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter
+ kernelcore= [KNL,X86,IA-64,PPC]
+ Format: nn[KMGTPE] | "mirror"
+ This parameter
specifies the amount of memory usable by the kernel
for non-movable allocations. The requested amount is
spread evenly throughout all nodes in the system. The
use the HighMem zone if it exists, and the Normal
zone if it does not.
+ Instead of specifying the amount of memory (nn[KMGTPE]),
+ you can specify "mirror" option. In case "mirror"
+ option is specified, mirrored (reliable) memory is used
+ for non-movable allocations and remaining memory is used
+ for Movable pages. nn[KMGTPE] and "mirror" are exclusive,
+ so you can NOT specify nn[KMGTPE] and "mirror" at the same
+ time.
+
kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port.
Format: <Controller#>[,poll interval]
The controller # is the number of the ehci usb debug
nolapic_timer [X86-32,APIC] Do not use the local APIC timer.
noltlbs [PPC] Do not use large page/tlb entries for kernel
- lowmem mapping on PPC40x.
+ lowmem mapping on PPC40x and PPC8xx
nomca [IA-64] Disable machine check abort handling
we can turn it on.
on: enable the feature
+ page_poison= [KNL] Boot-time parameter changing the state of
+ poisoning on the buddy allocator.
+ off: turn off poisoning
+ on: turn on poisoning
+
panic= [KNL] Kernel behaviour on panic: delay <timeout>
timeout > 0: seconds before rebooting
timeout = 0: wait forever
ro [KNL] Mount root device read-only on boot
+ rodata= [KNL]
+ on Mark read-only kernel memory as read-only (default).
+ off Leave read-only kernel memory writable for debugging.
+
+ rockchip.usb_uart
+ Enable the uart passthrough on the designated usb port
+ on Rockchip SoCs. When active, the signals of the
+ debug-uart get routed to the D+ and D- pins of the usb
+ port and the regular usb controller gets disabled.
+
root= [KNL] Root filesystem
See name_to_dev_t comment in init/do_mounts.c.
sched_debug [KNL] Enables verbose scheduler debug messages.
+ schedstats= [KNL,X86] Enable or disable scheduled statistics.
+ Allowed values are enable and disable. This feature
+ incurs a small amount of overhead in the scheduler
+ but is useful for debugging and performance tuning.
+
skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate
xtime_lock contention on larger systems, and/or RCU lock
contention on all systems with CONFIG_MAXSMP set.
{
VM_BUG_ON(page != compound_head(page));
VM_BUG_ON(page_count(page) == 0);
- atomic_add(nr, &page->_count);
+ page_ref_add(page, nr);
SetPageReferenced(page);
}
start += nr << PAGE_SHIFT;
pages += nr;
- ret = get_user_pages_unlocked(current, mm, start,
- (end - start) >> PAGE_SHIFT,
+ ret = get_user_pages_unlocked(start, (end - start) >> PAGE_SHIFT,
write, 0, pages);
/* Have to be a bit careful with return values */
static inline int init_new_context(struct task_struct *tsk,
struct mm_struct *mm)
{
+ spin_lock_init(&mm->context.list_lock);
+ INIT_LIST_HEAD(&mm->context.pgtable_list);
+ INIT_LIST_HEAD(&mm->context.gmap_list);
cpumask_clear(&mm->context.cpu_attach_mask);
atomic_set(&mm->context.attach_count, 0);
mm->context.flush_mm = 0;
- mm->context.asce_bits = _ASCE_TABLE_LENGTH | _ASCE_USER_BITS;
- mm->context.asce_bits |= _ASCE_TYPE_REGION3;
#ifdef CONFIG_PGSTE
mm->context.alloc_pgste = page_table_allocate_pgste;
mm->context.has_pgste = 0;
mm->context.use_skey = 0;
#endif
- mm->context.asce_limit = STACK_TOP_MAX;
+ if (mm->context.asce_limit == 0) {
+ /* context created by exec, set asce limit to 4TB */
+ mm->context.asce_bits = _ASCE_TABLE_LENGTH |
+ _ASCE_USER_BITS | _ASCE_TYPE_REGION3;
+ mm->context.asce_limit = STACK_TOP_MAX;
+ } else if (mm->context.asce_limit == (1UL << 31)) {
+ mm_inc_nr_pmds(mm);
+ }
crst_table_init((unsigned long *) mm->pgd, pgd_entry_type(mm));
return 0;
}
static inline void arch_dup_mmap(struct mm_struct *oldmm,
struct mm_struct *mm)
{
- if (oldmm->context.asce_limit < mm->context.asce_limit)
- crst_table_downgrade(mm, oldmm->context.asce_limit);
}
static inline void arch_exit_mmap(struct mm_struct *mm)
{
}
+ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
+ bool write, bool execute, bool foreign)
+ {
+ /* by default, allow everything */
+ return true;
+ }
+
+ static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+ {
+ /* by default, allow everything */
+ return true;
+ }
#endif /* __S390_MMU_CONTEXT_H */
select VIRT_TO_BUS
select X86_DEV_DMA_OPS if X86_64
select X86_FEATURE_NAMES if PROC_FS
+ select HAVE_STACK_VALIDATION if X86_64
+ select ARCH_USES_HIGH_VMA_FLAGS if X86_INTEL_MEMORY_PROTECTION_KEYS
+ select ARCH_HAS_PKEYS if X86_INTEL_MEMORY_PROTECTION_KEYS
config INSTRUCTION_DECODER
def_bool y
config FIX_EARLYCON_MEM
def_bool y
+config DEBUG_RODATA
+ def_bool y
+
config PGTABLE_LEVELS
int
default 4 if X86_64
HPET is the next generation timer replacing legacy 8254s.
The HPET provides a stable time base on SMP
systems, unlike the TSC, but it is more expensive to access,
- as it is off-chip. You can find the HPET spec at
- <http://www.intel.com/hardwaredesign/hpetspec_1.pdf>.
+ as it is off-chip. The interface used is documented
+ in the HPET spec, revision 1.
You can safely choose Y here. However, HPET will only be
activated if the platform and the BIOS support this feature.
bool "CPU microcode loading support"
default y
depends on CPU_SUP_AMD || CPU_SUP_INTEL
- depends on BLK_DEV_INITRD
select FW_LOADER
---help---
-
If you say Y here, you will be able to update the microcode on
- certain Intel and AMD processors. The Intel support is for the
- IA32 family, e.g. Pentium Pro, Pentium II, Pentium III, Pentium 4,
- Xeon etc. The AMD support is for families 0x10 and later. You will
- obviously need the actual microcode binary data itself which is not
- shipped with the Linux kernel.
+ Intel and AMD processors. The Intel support is for the IA32 family,
+ e.g. Pentium Pro, Pentium II, Pentium III, Pentium 4, Xeon etc. The
+ AMD support is for families 0x10 and later. You will obviously need
+ the actual microcode binary data itself which is not shipped with
+ the Linux kernel.
- This option selects the general module only, you need to select
- at least one vendor specific module as well.
+ The preferred method to load microcode from a detached initrd is described
+ in Documentation/x86/early-microcode.txt. For that you need to enable
+ CONFIG_BLK_DEV_INITRD in order for the loader to be able to scan the
+ initrd for microcode blobs.
- To compile this driver as a module, choose M here: the module
- will be called microcode.
+ In addition, you can build-in the microcode into the kernel. For that you
+ need to enable FIRMWARE_IN_KERNEL and add the vendor-supplied microcode
+ to the CONFIG_EXTRA_FIRMWARE config option.
config MICROCODE_INTEL
bool "Intel microcode loading support"
If unsure, say N.
+ config X86_INTEL_MEMORY_PROTECTION_KEYS
+ prompt "Intel Memory Protection Keys"
+ def_bool y
+ # Note: only available in 64-bit mode
+ depends on CPU_SUP_INTEL && X86_64
+ ---help---
+ Memory Protection Keys provides a mechanism for enforcing
+ page-based protections, but without requiring modification of the
+ page tables when an application changes protection domains.
+
+ For details, see Documentation/x86/protection-keys.txt
+
+ If unsure, say y.
+
config EFI
bool "EFI runtime service support"
depends on ACPI
You should say N unless you know you need this.
-source "drivers/pci/pcie/Kconfig"
-
source "drivers/pci/Kconfig"
# x86_64 have no ISA slots, but can have ISA-style DMA.
source "drivers/pcmcia/Kconfig"
-source "drivers/pci/hotplug/Kconfig"
-
config RAPIDIO
tristate "RapidIO support"
depends on PCI
/*
* Defines x86 CPU feature bits
*/
- #define NCAPINTS 16 /* N 32-bit words worth of info */
+ #define NCAPINTS 17 /* N 32-bit words worth of info */
#define NBUGINTS 1 /* N 32-bit bug flags */
/*
#define X86_FEATURE_P4 ( 3*32+ 7) /* "" P4 */
#define X86_FEATURE_CONSTANT_TSC ( 3*32+ 8) /* TSC ticks at a constant rate */
#define X86_FEATURE_UP ( 3*32+ 9) /* smp kernel running on up */
-/* free, was #define X86_FEATURE_FXSAVE_LEAK ( 3*32+10) * "" FXSAVE leaks FOP/FIP/FOP */
+#define X86_FEATURE_ART ( 3*32+10) /* Platform has always running timer (ART) */
#define X86_FEATURE_ARCH_PERFMON ( 3*32+11) /* Intel Architectural PerfMon */
#define X86_FEATURE_PEBS ( 3*32+12) /* Precise-Event Based Sampling */
#define X86_FEATURE_BTS ( 3*32+13) /* Branch Trace Store */
#define X86_FEATURE_APERFMPERF ( 3*32+28) /* APERFMPERF */
#define X86_FEATURE_EAGER_FPU ( 3*32+29) /* "eagerfpu" Non lazy FPU restore */
#define X86_FEATURE_NONSTOP_TSC_S3 ( 3*32+30) /* TSC doesn't stop in S3 state */
+#define X86_FEATURE_MCE_RECOVERY ( 3*32+31) /* cpu has recoverable machine checks */
/* Intel-defined CPU features, CPUID level 0x00000001 (ecx), word 4 */
#define X86_FEATURE_XMM3 ( 4*32+ 0) /* "pni" SSE-3 */
#define X86_FEATURE_CQM ( 9*32+12) /* Cache QoS Monitoring */
#define X86_FEATURE_MPX ( 9*32+14) /* Memory Protection Extension */
#define X86_FEATURE_AVX512F ( 9*32+16) /* AVX-512 Foundation */
+#define X86_FEATURE_AVX512DQ ( 9*32+17) /* AVX-512 DQ (Double/Quad granular) Instructions */
#define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */
#define X86_FEATURE_ADX ( 9*32+19) /* The ADCX and ADOX instructions */
#define X86_FEATURE_SMAP ( 9*32+20) /* Supervisor Mode Access Prevention */
#define X86_FEATURE_AVX512ER ( 9*32+27) /* AVX-512 Exponential and Reciprocal */
#define X86_FEATURE_AVX512CD ( 9*32+28) /* AVX-512 Conflict Detection */
#define X86_FEATURE_SHA_NI ( 9*32+29) /* SHA1/SHA256 Instruction Extensions */
+#define X86_FEATURE_AVX512BW ( 9*32+30) /* AVX-512 BW (Byte/Word granular) Instructions */
+#define X86_FEATURE_AVX512VL ( 9*32+31) /* AVX-512 VL (128/256 Vector Length) Extensions */
/* Extended state features, CPUID level 0x0000000d:1 (eax), word 10 */
#define X86_FEATURE_XSAVEOPT (10*32+ 0) /* XSAVEOPT */
#define X86_FEATURE_PFTHRESHOLD (15*32+12) /* pause filter threshold */
#define X86_FEATURE_AVIC (15*32+13) /* Virtual Interrupt Controller */
+ /* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx), word 16 */
+ #define X86_FEATURE_PKU (16*32+ 3) /* Protection Keys for Userspace */
+ #define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */
+
/*
* BUG word(s)
*/
#define X86_BUG_CLFLUSH_MONITOR X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */
#define X86_BUG_SYSRET_SS_ATTRS X86_BUG(8) /* SYSRET doesn't fix up SS attrs */
+#ifdef CONFIG_X86_32
+/*
+ * 64-bit kernels don't use X86_BUG_ESPFIX. Make the define conditional
+ * to avoid confusion.
+ */
+#define X86_BUG_ESPFIX X86_BUG(9) /* "" IRET to 16-bit SS corrupts ESP/RSP high bits */
+#endif
+
#endif /* _ASM_X86_CPUFEATURES_H */
/* Supported features which support lazy state saving */
#define XFEATURE_MASK_LAZY (XFEATURE_MASK_FP | \
- XFEATURE_MASK_SSE)
-
-/* Supported features which require eager state saving */
-#define XFEATURE_MASK_EAGER (XFEATURE_MASK_BNDREGS | \
- XFEATURE_MASK_BNDCSR | \
+ XFEATURE_MASK_SSE | \
XFEATURE_MASK_YMM | \
XFEATURE_MASK_OPMASK | \
XFEATURE_MASK_ZMM_Hi256 | \
- XFEATURE_MASK_Hi16_ZMM)
+ XFEATURE_MASK_Hi16_ZMM | \
+ XFEATURE_MASK_PKRU)
+/* Supported features which require eager state saving */
+#define XFEATURE_MASK_EAGER (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR)
+
/* All currently supported features */
#define XCNTXT_MASK (XFEATURE_MASK_LAZY | XFEATURE_MASK_EAGER)
lo |= 0x200000;
wrmsr(MSR_IA32_BBL_CR_CTL, lo, hi);
- printk(KERN_NOTICE "CPU serial number disabled.\n");
+ pr_notice("CPU serial number disabled.\n");
clear_cpu_cap(c, X86_FEATURE_PN);
/* Disabling the serial number may affect the cpuid level */
}
}
+ /*
+ * Protection Keys are not available in 32-bit mode.
+ */
+ static bool pku_disabled;
+
+ static __always_inline void setup_pku(struct cpuinfo_x86 *c)
+ {
+ if (!cpu_has(c, X86_FEATURE_PKU))
+ return;
+ if (pku_disabled)
+ return;
+
+ cr4_set_bits(X86_CR4_PKE);
+ /*
+ * Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE
+ * cpuid bit to be set. We need to ensure that we
+ * update that bit in this CPU's "cpu_info".
+ */
+ get_cpu_cap(c);
+ }
+
+ #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+ static __init int setup_disable_pku(char *arg)
+ {
+ /*
+ * Do not clear the X86_FEATURE_PKU bit. All of the
+ * runtime checks are against OSPKE so clearing the
+ * bit does nothing.
+ *
+ * This way, we will see "pku" in cpuinfo, but not
+ * "ospke", which is exactly what we want. It shows
+ * that the CPU has PKU, but the OS has not enabled it.
+ * This happens to be exactly how a system would look
+ * if we disabled the config option.
+ */
+ pr_info("x86: 'nopku' specified, disabling Memory Protection Keys\n");
+ pku_disabled = true;
+ return 1;
+ }
+ __setup("nopku", setup_disable_pku);
+ #endif /* CONFIG_X86_64 */
+
/*
* Some CPU features depend on higher CPUID levels, which may not always
* be available due to CPUID level capping or broken virtualization
if (!warn)
continue;
- printk(KERN_WARNING
- "CPU: CPU feature " X86_CAP_FMT " disabled, no CPUID level 0x%x\n",
- x86_cap_flag(df->feature), df->level);
+ pr_warn("CPU: CPU feature " X86_CAP_FMT " disabled, no CPUID level 0x%x\n",
+ x86_cap_flag(df->feature), df->level);
}
}
smp_num_siblings = (ebx & 0xff0000) >> 16;
if (smp_num_siblings == 1) {
- printk_once(KERN_INFO "CPU0: Hyper-Threading is disabled\n");
+ pr_info_once("CPU0: Hyper-Threading is disabled\n");
goto out;
}
out:
if (!printed && (c->x86_max_cores * smp_num_siblings) > 1) {
- printk(KERN_INFO "CPU: Physical Processor ID: %d\n",
- c->phys_proc_id);
- printk(KERN_INFO "CPU: Processor Core ID: %d\n",
- c->cpu_core_id);
+ pr_info("CPU: Physical Processor ID: %d\n",
+ c->phys_proc_id);
+ pr_info("CPU: Processor Core ID: %d\n",
+ c->cpu_core_id);
printed = 1;
}
#endif
}
}
- printk_once(KERN_ERR
- "CPU: vendor_id '%s' unknown, using generic init.\n" \
- "CPU: Your system may be unstable.\n", v);
+ pr_err_once("CPU: vendor_id '%s' unknown, using generic init.\n" \
+ "CPU: Your system may be unstable.\n", v);
c->x86_vendor = X86_VENDOR_UNKNOWN;
this_cpu = &default_cpu;
c->x86_capability[CPUID_7_0_EBX] = ebx;
c->x86_capability[CPUID_6_EAX] = cpuid_eax(0x00000006);
+ c->x86_capability[CPUID_7_ECX] = ecx;
}
/* Extended state features: level 0x0000000d */
int count = 0;
#ifdef CONFIG_PROCESSOR_SELECT
- printk(KERN_INFO "KERNEL supported cpus:\n");
+ pr_info("KERNEL supported cpus:\n");
#endif
for (cdev = __x86_cpu_dev_start; cdev < __x86_cpu_dev_end; cdev++) {
for (j = 0; j < 2; j++) {
if (!cpudev->c_ident[j])
continue;
- printk(KERN_INFO " %s %s\n", cpudev->c_vendor,
+ pr_info(" %s %s\n", cpudev->c_vendor,
cpudev->c_ident[j]);
}
}
clear_cpu_cap(c, X86_FEATURE_NOPL);
#else
set_cpu_cap(c, X86_FEATURE_NOPL);
+#endif
+
+ /*
+ * ESPFIX is a strange bug. All real CPUs have it. Paravirt
+ * systems that run Linux at CPL > 0 may or may not have the
+ * issue, but, even if they have the issue, there's absolutely
+ * nothing we can do about it because we can't use the real IRET
+ * instruction.
+ *
+ * NB: For the time being, only 32-bit kernels support
+ * X86_BUG_ESPFIX as such. 64-bit kernels directly choose
+ * whether to apply espfix using paravirt hooks. If any
+ * non-paravirt system ever shows up that does *not* have the
+ * ESPFIX issue, we can change this.
+ */
+#ifdef CONFIG_X86_32
+#ifdef CONFIG_PARAVIRT
+ do {
+ extern void native_iret(void);
+ if (pv_cpu_ops.iret == native_iret)
+ set_cpu_bug(c, X86_BUG_ESPFIX);
+ } while (0);
+#else
+ set_cpu_bug(c, X86_BUG_ESPFIX);
+#endif
#endif
}
init_hypervisor(c);
x86_init_rdrand(c);
x86_init_cache_qos(c);
+ setup_pku(c);
/*
* Clear/Set all flags overriden by options, need do it
#ifdef CONFIG_NUMA
numa_add_cpu(smp_processor_id());
#endif
+ /* The boot/hotplug time assigment got cleared, restore it */
+ c->logical_proc_id = topology_phys_to_logical_pkg(c->phys_proc_id);
}
/*
for (index = index_min; index < index_max; index++) {
if (rdmsrl_safe(index, &val))
continue;
- printk(KERN_INFO " MSR%08x: %016llx\n", index, val);
+ pr_info(" MSR%08x: %016llx\n", index, val);
}
}
}
}
if (vendor && !strstr(c->x86_model_id, vendor))
- printk(KERN_CONT "%s ", vendor);
+ pr_cont("%s ", vendor);
if (c->x86_model_id[0])
- printk(KERN_CONT "%s", c->x86_model_id);
+ pr_cont("%s", c->x86_model_id);
else
- printk(KERN_CONT "%d86", c->x86);
+ pr_cont("%d86", c->x86);
- printk(KERN_CONT " (family: 0x%x, model: 0x%x", c->x86, c->x86_model);
+ pr_cont(" (family: 0x%x, model: 0x%x", c->x86, c->x86_model);
if (c->x86_mask || c->cpuid_level >= 0)
- printk(KERN_CONT ", stepping: 0x%x)\n", c->x86_mask);
+ pr_cont(", stepping: 0x%x)\n", c->x86_mask);
else
- printk(KERN_CONT ")\n");
+ pr_cont(")\n");
print_cpu_msr(c);
}
show_ucode_info_early();
- printk(KERN_INFO "Initializing CPU#%d\n", cpu);
+ pr_info("Initializing CPU#%d\n", cpu);
if (cpu_feature_enabled(X86_FEATURE_VME) ||
cpu_has_tsc ||
}
}
+ /*
+ * This function must be called before we write the current
+ * task's fpstate.
+ *
+ * This call gets the current FPU register state and moves
+ * it in to the 'fpstate'. Preemption is disabled so that
+ * no writes to the 'fpstate' can occur from context
+ * swiches.
+ *
+ * Must be followed by a fpu__current_fpstate_write_end().
+ */
+ void fpu__current_fpstate_write_begin(void)
+ {
+ struct fpu *fpu = ¤t->thread.fpu;
+
+ /*
+ * Ensure that the context-switching code does not write
+ * over the fpstate while we are doing our update.
+ */
+ preempt_disable();
+
+ /*
+ * Move the fpregs in to the fpu's 'fpstate'.
+ */
+ fpu__activate_fpstate_read(fpu);
+
+ /*
+ * The caller is about to write to 'fpu'. Ensure that no
+ * CPU thinks that its fpregs match the fpstate. This
+ * ensures we will not be lazy and skip a XRSTOR in the
+ * future.
+ */
+ fpu->last_cpu = -1;
+ }
+
+ /*
+ * This function must be paired with fpu__current_fpstate_write_begin()
+ *
+ * This will ensure that the modified fpstate gets placed back in
+ * the fpregs if necessary.
+ *
+ * Note: This function may be called whether or not an _actual_
+ * write to the fpstate occurred.
+ */
+ void fpu__current_fpstate_write_end(void)
+ {
+ struct fpu *fpu = ¤t->thread.fpu;
+
+ /*
+ * 'fpu' now has an updated copy of the state, but the
+ * registers may still be out of date. Update them with
+ * an XRSTOR if they are active.
+ */
+ if (fpregs_active())
+ copy_kernel_to_fpregs(&fpu->state);
+
+ /*
+ * Our update is done and the fpregs/fpstate are in sync
+ * if necessary. Context switches can happen again.
+ */
+ preempt_enable();
+ }
+
/*
* 'fpu__restore()' is called to copy FPU registers from
* the FPU fpstate to the live hw registers and to activate
{
if (use_xsave())
copy_kernel_to_xregs(&init_fpstate.xsave, -1);
- else
+ else if (static_cpu_has(X86_FEATURE_FXSR))
copy_kernel_to_fxregs(&init_fpstate.fxsave);
+ else
+ copy_kernel_to_fregs(&init_fpstate.fsave);
}
/*
*/
#include <linux/compat.h>
#include <linux/cpu.h>
+ #include <linux/pkeys.h>
#include <asm/fpu/api.h>
#include <asm/fpu/internal.h>
#include <asm/tlbflush.h>
+ /*
+ * Although we spell it out in here, the Processor Trace
+ * xfeature is completely unused. We use other mechanisms
+ * to save/restore PT state in Linux.
+ */
static const char *xfeature_names[] =
{
"x87 floating point registers" ,
"AVX-512 opmask" ,
"AVX-512 Hi256" ,
"AVX-512 ZMM_Hi256" ,
+ "Processor Trace (unused)" ,
+ "Protection Keys User registers",
"unknown xstate feature" ,
};
setup_clear_cpu_cap(X86_FEATURE_AVX512PF);
setup_clear_cpu_cap(X86_FEATURE_AVX512ER);
setup_clear_cpu_cap(X86_FEATURE_AVX512CD);
+ setup_clear_cpu_cap(X86_FEATURE_AVX512DQ);
+ setup_clear_cpu_cap(X86_FEATURE_AVX512BW);
+ setup_clear_cpu_cap(X86_FEATURE_AVX512VL);
setup_clear_cpu_cap(X86_FEATURE_MPX);
setup_clear_cpu_cap(X86_FEATURE_XGETBV1);
+ setup_clear_cpu_cap(X86_FEATURE_PKU);
}
/*
const char *feature_name;
if (cpu_has_xfeatures(xstate_mask, &feature_name))
- pr_info("x86/fpu: Supporting XSAVE feature 0x%02Lx: '%s'\n", xstate_mask, feature_name);
+ pr_info("x86/fpu: Supporting XSAVE feature 0x%03Lx: '%s'\n", xstate_mask, feature_name);
}
/*
print_xstate_feature(XFEATURE_MASK_OPMASK);
print_xstate_feature(XFEATURE_MASK_ZMM_Hi256);
print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
+ print_xstate_feature(XFEATURE_MASK_PKRU);
}
/*
XCHECK_SZ(sz, nr, XFEATURE_OPMASK, struct avx_512_opmask_state);
XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM, struct avx_512_hi16_state);
+ XCHECK_SZ(sz, nr, XFEATURE_PKRU, struct pkru_state);
/*
* Make *SURE* to add any feature numbers in below if
* numbers.
*/
if ((nr < XFEATURE_YMM) ||
- (nr >= XFEATURE_MAX)) {
+ (nr >= XFEATURE_MAX) ||
+ (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR)) {
WARN_ONCE(1, "no structure for xstate: %d\n", nr);
XSTATE_WARN_ON(1);
}
xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask);
}
+ /*
+ * Given an xstate feature mask, calculate where in the xsave
+ * buffer the state is. Callers should ensure that the buffer
+ * is valid.
+ *
+ * Note: does not work for compacted buffers.
+ */
+ void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask)
+ {
+ int feature_nr = fls64(xstate_feature_mask) - 1;
+
+ return (void *)xsave + xstate_comp_offsets[feature_nr];
+ }
/*
* Given the xsave area and a state inside, this function returns the
* address of the state.
*/
void *get_xsave_addr(struct xregs_state *xsave, int xstate_feature)
{
- int feature_nr = fls64(xstate_feature) - 1;
/*
* Do we even *have* xsave state?
*/
if (!(xsave->header.xfeatures & xstate_feature))
return NULL;
- return (void *)xsave + xstate_comp_offsets[feature_nr];
+ return __raw_xsave_addr(xsave, xstate_feature);
}
EXPORT_SYMBOL_GPL(get_xsave_addr);
return get_xsave_addr(&fpu->state.xsave, xsave_state);
}
+
+
+ /*
+ * Set xfeatures (aka XSTATE_BV) bit for a feature that we want
+ * to take out of its "init state". This will ensure that an
+ * XRSTOR actually restores the state.
+ */
+ static void fpu__xfeature_set_non_init(struct xregs_state *xsave,
+ int xstate_feature_mask)
+ {
+ xsave->header.xfeatures |= xstate_feature_mask;
+ }
+
+ /*
+ * This function is safe to call whether the FPU is in use or not.
+ *
+ * Note that this only works on the current task.
+ *
+ * Inputs:
+ * @xsave_state: state which is defined in xsave.h (e.g. XFEATURE_MASK_FP,
+ * XFEATURE_MASK_SSE, etc...)
+ * @xsave_state_ptr: a pointer to a copy of the state that you would
+ * like written in to the current task's FPU xsave state. This pointer
+ * must not be located in the current tasks's xsave area.
+ * Output:
+ * address of the state in the xsave area or NULL if the state
+ * is not present or is in its 'init state'.
+ */
+ static void fpu__xfeature_set_state(int xstate_feature_mask,
+ void *xstate_feature_src, size_t len)
+ {
+ struct xregs_state *xsave = ¤t->thread.fpu.state.xsave;
+ struct fpu *fpu = ¤t->thread.fpu;
+ void *dst;
+
+ if (!boot_cpu_has(X86_FEATURE_XSAVE)) {
+ WARN_ONCE(1, "%s() attempted with no xsave support", __func__);
+ return;
+ }
+
+ /*
+ * Tell the FPU code that we need the FPU state to be in
+ * 'fpu' (not in the registers), and that we need it to
+ * be stable while we write to it.
+ */
+ fpu__current_fpstate_write_begin();
+
+ /*
+ * This method *WILL* *NOT* work for compact-format
+ * buffers. If the 'xstate_feature_mask' is unset in
+ * xcomp_bv then we may need to move other feature state
+ * "up" in the buffer.
+ */
+ if (xsave->header.xcomp_bv & xstate_feature_mask) {
+ WARN_ON_ONCE(1);
+ goto out;
+ }
+
+ /* find the location in the xsave buffer of the desired state */
+ dst = __raw_xsave_addr(&fpu->state.xsave, xstate_feature_mask);
+
+ /*
+ * Make sure that the pointer being passed in did not
+ * come from the xsave buffer itself.
+ */
+ WARN_ONCE(xstate_feature_src == dst, "set from xsave buffer itself");
+
+ /* put the caller-provided data in the location */
+ memcpy(dst, xstate_feature_src, len);
+
+ /*
+ * Mark the xfeature so that the CPU knows there is state
+ * in the buffer now.
+ */
+ fpu__xfeature_set_non_init(xsave, xstate_feature_mask);
+ out:
+ /*
+ * We are done writing to the 'fpu'. Reenable preeption
+ * and (possibly) move the fpstate back in to the fpregs.
+ */
+ fpu__current_fpstate_write_end();
+ }
+
+ #define NR_VALID_PKRU_BITS (CONFIG_NR_PROTECTION_KEYS * 2)
+ #define PKRU_VALID_MASK (NR_VALID_PKRU_BITS - 1)
+
+ /*
+ * This will go out and modify the XSAVE buffer so that PKRU is
+ * set to a particular state for access to 'pkey'.
+ *
+ * PKRU state does affect kernel access to user memory. We do
+ * not modfiy PKRU *itself* here, only the XSAVE state that will
+ * be restored in to PKRU when we return back to userspace.
+ */
+ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+ unsigned long init_val)
+ {
+ struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
+ struct pkru_state *old_pkru_state;
+ struct pkru_state new_pkru_state;
+ int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
+ u32 new_pkru_bits = 0;
+
+ /*
+ * This check implies XSAVE support. OSPKE only gets
+ * set if we enable XSAVE and we enable PKU in XCR0.
+ */
+ if (!boot_cpu_has(X86_FEATURE_OSPKE))
+ return -EINVAL;
+
+ /* Set the bits we need in PKRU */
+ if (init_val & PKEY_DISABLE_ACCESS)
+ new_pkru_bits |= PKRU_AD_BIT;
+ if (init_val & PKEY_DISABLE_WRITE)
+ new_pkru_bits |= PKRU_WD_BIT;
+
+ /* Shift the bits in to the correct place in PKRU for pkey. */
+ new_pkru_bits <<= pkey_shift;
+
+ /* Locate old copy of the state in the xsave buffer */
+ old_pkru_state = get_xsave_addr(xsave, XFEATURE_MASK_PKRU);
+
+ /*
+ * When state is not in the buffer, it is in the init
+ * state, set it manually. Otherwise, copy out the old
+ * state.
+ */
+ if (!old_pkru_state)
+ new_pkru_state.pkru = 0;
+ else
+ new_pkru_state.pkru = old_pkru_state->pkru;
+
+ /* mask off any old bits in place */
+ new_pkru_state.pkru &= ~((PKRU_AD_BIT|PKRU_WD_BIT) << pkey_shift);
+ /* Set the newly-requested bits */
+ new_pkru_state.pkru |= new_pkru_bits;
+
+ /*
+ * We could theoretically live without zeroing pkru.pad.
+ * The current XSAVE feature state definition says that
+ * only bytes 0->3 are used. But we do not want to
+ * chance leaking kernel stack out to userspace in case a
+ * memcpy() of the whole xsave buffer was done.
+ *
+ * They're in the same cacheline anyway.
+ */
+ new_pkru_state.pad = 0;
+
+ fpu__xfeature_set_state(XFEATURE_MASK_PKRU, &new_pkru_state,
+ sizeof(new_pkru_state));
+
+ return 0;
+ }
#include <asm/alternative.h>
#include <asm/prom.h>
#include <asm/microcode.h>
+ #include <asm/mmu_context.h>
/*
* max_low_pfn_mapped: highest direct mapped pfn under 4GB
.name = "Kernel data",
.start = 0,
.end = 0,
- .flags = IORESOURCE_BUSY | IORESOURCE_MEM
+ .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM
};
static struct resource code_resource = {
.name = "Kernel code",
.start = 0,
.end = 0,
- .flags = IORESOURCE_BUSY | IORESOURCE_MEM
+ .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM
};
static struct resource bss_resource = {
.name = "Kernel bss",
.start = 0,
.end = 0,
- .flags = IORESOURCE_BUSY | IORESOURCE_MEM
+ .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM
};
return 0;
}
__initcall(register_kernel_offset_dumper);
+
+ void arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
+ {
+ if (!boot_cpu_has(X86_FEATURE_OSPKE))
+ return;
+
+ seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma));
+ }
#include <linux/context_tracking.h> /* exception_enter(), ... */
#include <linux/uaccess.h> /* faulthandler_disabled() */
+ #include <asm/cpufeature.h> /* boot_cpu_has, ... */
#include <asm/traps.h> /* dotraplinkage, ... */
#include <asm/pgalloc.h> /* pgd_*(), ... */
#include <asm/kmemcheck.h> /* kmemcheck_*(), ... */
#include <asm/fixmap.h> /* VSYSCALL_ADDR */
#include <asm/vsyscall.h> /* emulate_vsyscall */
#include <asm/vm86.h> /* struct vm86 */
+ #include <asm/mmu_context.h> /* vma_pkey() */
#define CREATE_TRACE_POINTS
#include <asm/trace/exceptions.h>
* bit 2 == 0: kernel-mode access 1: user-mode access
* bit 3 == 1: use of reserved bit detected
* bit 4 == 1: fault was an instruction fetch
+ * bit 5 == 1: protection keys block access
*/
enum x86_pf_error_code {
PF_USER = 1 << 2,
PF_RSVD = 1 << 3,
PF_INSTR = 1 << 4,
+ PF_PK = 1 << 5,
};
/*
return prefetch;
}
+ /*
+ * A protection key fault means that the PKRU value did not allow
+ * access to some PTE. Userspace can figure out what PKRU was
+ * from the XSAVE state, and this function fills out a field in
+ * siginfo so userspace can discover which protection key was set
+ * on the PTE.
+ *
+ * If we get here, we know that the hardware signaled a PF_PK
+ * fault and that there was a VMA once we got in the fault
+ * handler. It does *not* guarantee that the VMA we find here
+ * was the one that we faulted on.
+ *
+ * 1. T1 : mprotect_key(foo, PAGE_SIZE, pkey=4);
+ * 2. T1 : set PKRU to deny access to pkey=4, touches page
+ * 3. T1 : faults...
+ * 4. T2: mprotect_key(foo, PAGE_SIZE, pkey=5);
+ * 5. T1 : enters fault handler, takes mmap_sem, etc...
+ * 6. T1 : reaches here, sees vma_pkey(vma)=5, when we really
+ * faulted on a pte with its pkey=4.
+ */
+ static void fill_sig_info_pkey(int si_code, siginfo_t *info,
+ struct vm_area_struct *vma)
+ {
+ /* This is effectively an #ifdef */
+ if (!boot_cpu_has(X86_FEATURE_OSPKE))
+ return;
+
+ /* Fault not from Protection Keys: nothing to do */
+ if (si_code != SEGV_PKUERR)
+ return;
+ /*
+ * force_sig_info_fault() is called from a number of
+ * contexts, some of which have a VMA and some of which
+ * do not. The PF_PK handing happens after we have a
+ * valid VMA, so we should never reach this without a
+ * valid VMA.
+ */
+ if (!vma) {
+ WARN_ONCE(1, "PKU fault with no VMA passed in");
+ info->si_pkey = 0;
+ return;
+ }
+ /*
+ * si_pkey should be thought of as a strong hint, but not
+ * absolutely guranteed to be 100% accurate because of
+ * the race explained above.
+ */
+ info->si_pkey = vma_pkey(vma);
+ }
+
static void
force_sig_info_fault(int si_signo, int si_code, unsigned long address,
- struct task_struct *tsk, int fault)
+ struct task_struct *tsk, struct vm_area_struct *vma,
+ int fault)
{
unsigned lsb = 0;
siginfo_t info;
lsb = PAGE_SHIFT;
info.si_addr_lsb = lsb;
+ fill_sig_info_pkey(si_code, &info, vma);
+
force_sig_info(si_signo, &info, tsk);
}
if (!pmd_k)
return -1;
+ if (pmd_huge(*pmd_k))
+ return 0;
+
pte_k = pte_offset_kernel(pmd_k, address);
if (!pte_present(*pte_k))
return -1;
* 64-bit:
*
* Handle a fault on the vmalloc area
- *
- * This assumes no large pages in there.
*/
static noinline int vmalloc_fault(unsigned long address)
{
if (pud_none(*pud_ref))
return -1;
- if (pud_none(*pud) || pud_page_vaddr(*pud) != pud_page_vaddr(*pud_ref))
+ if (pud_none(*pud) || pud_pfn(*pud) != pud_pfn(*pud_ref))
BUG();
+ if (pud_huge(*pud))
+ return 0;
+
pmd = pmd_offset(pud, address);
pmd_ref = pmd_offset(pud_ref, address);
if (pmd_none(*pmd_ref))
return -1;
- if (pmd_none(*pmd) || pmd_page(*pmd) != pmd_page(*pmd_ref))
+ if (pmd_none(*pmd) || pmd_pfn(*pmd) != pmd_pfn(*pmd_ref))
BUG();
+ if (pmd_huge(*pmd))
+ return 0;
+
pte_ref = pte_offset_kernel(pmd_ref, address);
if (!pte_present(*pte_ref))
return -1;
struct task_struct *tsk = current;
unsigned long flags;
int sig;
+ /* No context means no VMA to pass down */
+ struct vm_area_struct *vma = NULL;
/* Are we prepared to handle this kernel fault? */
- if (fixup_exception(regs)) {
+ if (fixup_exception(regs, X86_TRAP_PF)) {
/*
* Any interrupt that takes a fault gets the fixup. This makes
* the below recursive fault logic only apply to a faults from
tsk->thread.cr2 = address;
/* XXX: hwpoison faults will set the wrong code. */
- force_sig_info_fault(signal, si_code, address, tsk, 0);
+ force_sig_info_fault(signal, si_code, address,
+ tsk, vma, 0);
}
/*
static void
__bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
- unsigned long address, int si_code)
+ unsigned long address, struct vm_area_struct *vma,
+ int si_code)
{
struct task_struct *tsk = current;
tsk->thread.error_code = error_code;
tsk->thread.trap_nr = X86_TRAP_PF;
- force_sig_info_fault(SIGSEGV, si_code, address, tsk, 0);
+ force_sig_info_fault(SIGSEGV, si_code, address, tsk, vma, 0);
return;
}
static noinline void
bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
- unsigned long address)
+ unsigned long address, struct vm_area_struct *vma)
{
- __bad_area_nosemaphore(regs, error_code, address, SEGV_MAPERR);
+ __bad_area_nosemaphore(regs, error_code, address, vma, SEGV_MAPERR);
}
static void
__bad_area(struct pt_regs *regs, unsigned long error_code,
- unsigned long address, int si_code)
+ unsigned long address, struct vm_area_struct *vma, int si_code)
{
struct mm_struct *mm = current->mm;
*/
up_read(&mm->mmap_sem);
- __bad_area_nosemaphore(regs, error_code, address, si_code);
+ __bad_area_nosemaphore(regs, error_code, address, vma, si_code);
}
static noinline void
bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address)
{
- __bad_area(regs, error_code, address, SEGV_MAPERR);
+ __bad_area(regs, error_code, address, NULL, SEGV_MAPERR);
+ }
+
+ static inline bool bad_area_access_from_pkeys(unsigned long error_code,
+ struct vm_area_struct *vma)
+ {
+ /* This code is always called on the current mm */
+ bool foreign = false;
+
+ if (!boot_cpu_has(X86_FEATURE_OSPKE))
+ return false;
+ if (error_code & PF_PK)
+ return true;
+ /* this checks permission keys on the VMA: */
+ if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE),
+ (error_code & PF_INSTR), foreign))
+ return true;
+ return false;
}
static noinline void
bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
- unsigned long address)
+ unsigned long address, struct vm_area_struct *vma)
{
- __bad_area(regs, error_code, address, SEGV_ACCERR);
+ /*
+ * This OSPKE check is not strictly necessary at runtime.
+ * But, doing it this way allows compiler optimizations
+ * if pkeys are compiled out.
+ */
+ if (bad_area_access_from_pkeys(error_code, vma))
+ __bad_area(regs, error_code, address, vma, SEGV_PKUERR);
+ else
+ __bad_area(regs, error_code, address, vma, SEGV_ACCERR);
}
static void
do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
- unsigned int fault)
+ struct vm_area_struct *vma, unsigned int fault)
{
struct task_struct *tsk = current;
int code = BUS_ADRERR;
code = BUS_MCEERR_AR;
}
#endif
- force_sig_info_fault(SIGBUS, code, address, tsk, fault);
+ force_sig_info_fault(SIGBUS, code, address, tsk, vma, fault);
}
static noinline void
mm_fault_error(struct pt_regs *regs, unsigned long error_code,
- unsigned long address, unsigned int fault)
+ unsigned long address, struct vm_area_struct *vma,
+ unsigned int fault)
{
if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
no_context(regs, error_code, address, 0, 0);
} else {
if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
VM_FAULT_HWPOISON_LARGE))
- do_sigbus(regs, error_code, address, fault);
+ do_sigbus(regs, error_code, address, vma, fault);
else if (fault & VM_FAULT_SIGSEGV)
- bad_area_nosemaphore(regs, error_code, address);
+ bad_area_nosemaphore(regs, error_code, address, vma);
else
BUG();
}
if ((error_code & PF_INSTR) && !pte_exec(*pte))
return 0;
+ /*
+ * Note: We do not do lazy flushing on protection key
+ * changes, so no spurious fault will ever set PF_PK.
+ */
+ if ((error_code & PF_PK))
+ return 1;
return 1;
}
static inline int
access_error(unsigned long error_code, struct vm_area_struct *vma)
{
+ /* This is only called for the current mm, so: */
+ bool foreign = false;
+ /*
+ * Make sure to check the VMA so that we do not perform
+ * faults just to hit a PF_PK as soon as we fill in a
+ * page.
+ */
+ if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE),
+ (error_code & PF_INSTR), foreign))
+ return 1;
+
if (error_code & PF_WRITE) {
/* write, present and write, not present: */
if (unlikely(!(vma->vm_flags & VM_WRITE)))
* Don't take the mm semaphore here. If we fixup a prefetch
* fault we could otherwise deadlock:
*/
- bad_area_nosemaphore(regs, error_code, address);
+ bad_area_nosemaphore(regs, error_code, address, NULL);
return;
}
pgtable_bad(regs, error_code, address);
if (unlikely(smap_violation(error_code, regs))) {
- bad_area_nosemaphore(regs, error_code, address);
+ bad_area_nosemaphore(regs, error_code, address, NULL);
return;
}
* in a region with pagefaults disabled then we must not take the fault
*/
if (unlikely(faulthandler_disabled() || !mm)) {
- bad_area_nosemaphore(regs, error_code, address);
+ bad_area_nosemaphore(regs, error_code, address, NULL);
return;
}
if (error_code & PF_WRITE)
flags |= FAULT_FLAG_WRITE;
+ if (error_code & PF_INSTR)
+ flags |= FAULT_FLAG_INSTRUCTION;
/*
* When running in the kernel we expect faults to occur only to
if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
if ((error_code & PF_USER) == 0 &&
!search_exception_tables(regs->ip)) {
- bad_area_nosemaphore(regs, error_code, address);
+ bad_area_nosemaphore(regs, error_code, address, NULL);
return;
}
retry:
*/
good_area:
if (unlikely(access_error(error_code, vma))) {
- bad_area_access_error(regs, error_code, address);
+ bad_area_access_error(regs, error_code, address, vma);
return;
}
up_read(&mm->mmap_sem);
if (unlikely(fault & VM_FAULT_ERROR)) {
- mm_fault_error(regs, error_code, address, fault);
+ mm_fault_error(regs, error_code, address, vma, fault);
return;
}
#include <linux/swap.h>
#include <linux/memremap.h>
+ #include <asm/mmu_context.h>
#include <asm/pgtable.h>
static inline pte_t gup_get_pte(pte_t *ptep)
}
}
+ /*
+ * 'pteval' can come from a pte, pmd or pud. We only check
+ * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which are the
+ * same value on all 3 types.
+ */
+ static inline int pte_allows_gup(unsigned long pteval, int write)
+ {
+ unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
+
+ if (write)
+ need_pte_bits |= _PAGE_RW;
+
+ if ((pteval & need_pte_bits) != need_pte_bits)
+ return 0;
+
+ /* Check memory protection keys permissions. */
+ if (!__pkru_allows_pkey(pte_flags_pkey(pteval), write))
+ return 0;
+
+ return 1;
+ }
+
/*
* The performance critical leaf functions are made noinline otherwise gcc
* inlines everything into a single function which results in too much
unsigned long end, int write, struct page **pages, int *nr)
{
struct dev_pagemap *pgmap = NULL;
- unsigned long mask;
int nr_start = *nr;
pte_t *ptep;
- mask = _PAGE_PRESENT|_PAGE_USER;
- if (write)
- mask |= _PAGE_RW;
-
ptep = pte_offset_map(&pmd, addr);
do {
pte_t pte = gup_get_pte(ptep);
return 0;
}
- page = pte_page(pte);
if (pte_devmap(pte)) {
pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
if (unlikely(!pgmap)) {
pte_unmap(ptep);
return 0;
}
- } else if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
+ } else if (!pte_allows_gup(pte_val(pte), write) ||
+ pte_special(pte)) {
pte_unmap(ptep);
return 0;
}
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+ page = pte_page(pte);
get_page(page);
put_dev_pagemap(pgmap);
SetPageReferenced(page);
{
VM_BUG_ON_PAGE(page != compound_head(page), page);
VM_BUG_ON_PAGE(page_count(page) == 0, page);
- atomic_add(nr, &page->_count);
+ page_ref_add(page, nr);
SetPageReferenced(page);
}
static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
unsigned long end, int write, struct page **pages, int *nr)
{
- unsigned long mask;
struct page *head, *page;
int refs;
- mask = _PAGE_PRESENT|_PAGE_USER;
- if (write)
- mask |= _PAGE_RW;
- if ((pmd_flags(pmd) & mask) != mask)
+ if (!pte_allows_gup(pmd_val(pmd), write))
return 0;
VM_BUG_ON(!pfn_valid(pmd_pfn(pmd)));
static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
unsigned long end, int write, struct page **pages, int *nr)
{
- unsigned long mask;
struct page *head, *page;
int refs;
- mask = _PAGE_PRESENT|_PAGE_USER;
- if (write)
- mask |= _PAGE_RW;
- if ((pud_flags(pud) & mask) != mask)
+ if (!pte_allows_gup(pud_val(pud), write))
return 0;
/* hugepages are never "special" */
VM_BUG_ON(pud_flags(pud) & _PAGE_SPECIAL);
start += nr << PAGE_SHIFT;
pages += nr;
- ret = get_user_pages_unlocked(current, mm, start,
+ ret = get_user_pages_unlocked(start,
(end - start) >> PAGE_SHIFT,
write, 0, pages);
break;
}
- if (regno > nr_registers) {
+ if (regno >= nr_registers) {
WARN_ONCE(1, "decoded an instruction with an invalid register");
return -EINVAL;
}
int nr_pages = 1;
int force = 0;
- gup_ret = get_user_pages(current, current->mm, (unsigned long)addr,
- nr_pages, write, force, NULL, NULL);
+ gup_ret = get_user_pages((unsigned long)addr, nr_pages, write,
+ force, NULL, NULL);
/*
* get_user_pages() returns number of pages gotten.
* 0 means we failed to fault in and get anything,
uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE;
struct page **pages = ttm->pages + pinned;
- r = get_user_pages(current, current->mm, userptr, num_pages,
- write, 0, pages, NULL);
+ r = get_user_pages(userptr, num_pages, write, 0, pages, NULL);
if (r < 0)
goto release_pages;
0, PAGE_SIZE,
PCI_DMA_BIDIRECTIONAL);
if (pci_dma_mapping_error(adev->pdev, gtt->ttm.dma_address[i])) {
- while (--i) {
+ while (i--) {
pci_unmap_page(adev->pdev, gtt->ttm.dma_address[i],
PAGE_SIZE, PCI_DMA_BIDIRECTIONAL);
gtt->ttm.dma_address[i] = 0;
uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE;
struct page **pages = ttm->pages + pinned;
- r = get_user_pages(current, current->mm, userptr, num_pages,
- write, 0, pages, NULL);
+ r = get_user_pages(userptr, num_pages, write, 0, pages, NULL);
if (r < 0)
goto release_pages;
0, PAGE_SIZE,
PCI_DMA_BIDIRECTIONAL);
if (pci_dma_mapping_error(rdev->pdev, gtt->ttm.dma_address[i])) {
- while (--i) {
+ while (i--) {
pci_unmap_page(rdev->pdev, gtt->ttm.dma_address[i],
PAGE_SIZE, PCI_DMA_BIDIRECTIONAL);
gtt->ttm.dma_address[i] = 0;
}
pinned_pages->nr_pages = get_user_pages(
- current,
- mm,
(u64)addr,
nr_pages,
!!(prot & SCIF_PROT_WRITE),
if ((map_flags & SCIF_MAP_FIXED) &&
((ALIGN(offset, PAGE_SIZE) != offset) ||
(offset < 0) ||
- (offset + (off_t)len < offset)))
+ (len > LONG_MAX - offset)))
return -EINVAL;
might_sleep();
if ((map_flags & SCIF_MAP_FIXED) &&
((ALIGN(offset, PAGE_SIZE) != offset) ||
(offset < 0) ||
- (offset + (off_t)len < offset)))
+ (len > LONG_MAX - offset)))
return -EINVAL;
/* Unsupported protection requested */
/* Offset is not page aligned or offset+len wraps around */
if ((ALIGN(offset, PAGE_SIZE) != offset) ||
- (offset + (off_t)len < offset))
+ (offset < 0) ||
+ (len > LONG_MAX - offset))
return -EINVAL;
err = scif_verify_epd(ep);
Steve Hirsch, Andreas Koppenh"ofer, Michael Leodolter, Eyal Lebedinsky,
Michael Schaefer, J"org Weule, and Eric Youngdale.
- Copyright 1992 - 2010 Kai Makisara
+ Copyright 1992 - 2016 Kai Makisara
email Kai.Makisara@kolumbus.fi
Some small formal changes - aeb, 950809
Last modified: 18-JAN-1998 Richard Gooch <rgooch@atnf.csiro.au> Devfs support
*/
-static const char *verstr = "20101219";
+static const char *verstr = "20160209";
#include <linux/module.h>
#define PP_OFF_RESERVED 7
#define PP_BIT_IDP 0x20
+#define PP_BIT_FDP 0x80
#define PP_MSK_PSUM_MB 0x10
+#define PP_MSK_PSUM_UNITS 0x18
+#define PP_MSK_POFM 0x04
/* Get the number of partitions on the tape. As a side effect reads the
mode page into the tape buffer. */
}
+static int format_medium(struct scsi_tape *STp, int format)
+{
+ int result = 0;
+ int timeout = STp->long_timeout;
+ unsigned char scmd[MAX_COMMAND_SIZE];
+ struct st_request *SRpnt;
+
+ memset(scmd, 0, MAX_COMMAND_SIZE);
+ scmd[0] = FORMAT_UNIT;
+ scmd[2] = format;
+ if (STp->immediate) {
+ scmd[1] |= 1; /* Don't wait for completion */
+ timeout = STp->device->request_queue->rq_timeout;
+ }
+ DEBC_printk(STp, "Sending FORMAT MEDIUM\n");
+ SRpnt = st_do_scsi(NULL, STp, scmd, 0, DMA_NONE,
+ timeout, MAX_RETRIES, 1);
+ if (!SRpnt)
+ result = STp->buffer->syscall_result;
+ return result;
+}
+
+
/* Partition the tape into two partitions if size > 0 or one partition if
size == 0.
and 10 when 1 partition is defined (information from Eric Lee Green). This is
is acceptable also to some other old drives and enforced if the first partition
size field is used for the first additional partition size.
+
+ For drives that advertize SCSI-3 or newer, use the SSC-3 methods.
*/
static int partition_tape(struct scsi_tape *STp, int size)
{
int result;
+ int target_partition;
+ bool scsi3 = STp->device->scsi_level >= SCSI_3, needs_format = false;
int pgo, psd_cnt, psdo;
+ int psum = PP_MSK_PSUM_MB, units = 0;
unsigned char *bp;
result = read_mode_page(STp, PART_PAGE, 0);
DEBC_printk(STp, "Can't read partition mode page.\n");
return result;
}
+ target_partition = 1;
+ if (size < 0) {
+ target_partition = 0;
+ size = -size;
+ }
+
/* The mode page is in the buffer. Let's modify it and write it. */
bp = (STp->buffer)->b_data;
pgo = MODE_HEADER_LENGTH + bp[MH_OFF_BDESCS_LENGTH];
bp[pgo + MP_OFF_PAGE_LENGTH] + 2);
psd_cnt = (bp[pgo + MP_OFF_PAGE_LENGTH] + 2 - PART_PAGE_FIXED_LENGTH) / 2;
+
+ if (scsi3) {
+ needs_format = (bp[pgo + PP_OFF_FLAGS] & PP_MSK_POFM) != 0;
+ if (needs_format && size == 0) {
+ /* No need to write the mode page when clearing
+ * partitioning
+ */
+ DEBC_printk(STp, "Formatting tape with one partition.\n");
+ result = format_medium(STp, 0);
+ goto out;
+ }
+ if (needs_format) /* Leave the old value for HP DATs claiming SCSI_3 */
+ psd_cnt = 2;
+ if ((bp[pgo + PP_OFF_FLAGS] & PP_MSK_PSUM_UNITS) == PP_MSK_PSUM_UNITS) {
+ /* Use units scaling for large partitions if the device
+ * suggests it and no precision lost. Required for IBM
+ * TS1140/50 drives that don't support MB units.
+ */
+ if (size >= 1000 && (size % 1000) == 0) {
+ size /= 1000;
+ psum = PP_MSK_PSUM_UNITS;
+ units = 9; /* GB */
+ }
+ }
+ /* Try it anyway if too large to specify in MB */
+ if (psum == PP_MSK_PSUM_MB && size >= 65534) {
+ size /= 1000;
+ psum = PP_MSK_PSUM_UNITS;
+ units = 9; /* GB */
+ }
+ }
+
+ if (size >= 65535 || /* Does not fit into two bytes */
+ (target_partition == 0 && psd_cnt < 2)) {
+ result = -EINVAL;
+ goto out;
+ }
+
psdo = pgo + PART_PAGE_FIXED_LENGTH;
- if (psd_cnt > bp[pgo + PP_OFF_MAX_ADD_PARTS]) {
- bp[psdo] = bp[psdo + 1] = 0xff; /* Rest of the tape */
+ /* The second condition is for HP DDS which use only one partition size
+ * descriptor
+ */
+ if (target_partition > 0 &&
+ (psd_cnt > bp[pgo + PP_OFF_MAX_ADD_PARTS] ||
+ bp[pgo + PP_OFF_MAX_ADD_PARTS] != 1)) {
+ bp[psdo] = bp[psdo + 1] = 0xff; /* Rest to partition 0 */
psdo += 2;
}
memset(bp + psdo, 0, bp[pgo + PP_OFF_NBR_ADD_PARTS] * 2);
psd_cnt, bp[pgo + PP_OFF_MAX_ADD_PARTS],
bp[pgo + PP_OFF_NBR_ADD_PARTS]);
- if (size <= 0) {
+ if (size == 0) {
bp[pgo + PP_OFF_NBR_ADD_PARTS] = 0;
if (psd_cnt <= bp[pgo + PP_OFF_MAX_ADD_PARTS])
bp[pgo + MP_OFF_PAGE_LENGTH] = 6;
} else {
bp[psdo] = (size >> 8) & 0xff;
bp[psdo + 1] = size & 0xff;
+ if (target_partition == 0)
+ bp[psdo + 2] = bp[psdo + 3] = 0xff;
bp[pgo + 3] = 1;
if (bp[pgo + MP_OFF_PAGE_LENGTH] < 8)
bp[pgo + MP_OFF_PAGE_LENGTH] = 8;
- DEBC_printk(STp, "Formatting tape with two partitions "
- "(1 = %d MB).\n", size);
+ DEBC_printk(STp,
+ "Formatting tape with two partitions (%i = %d MB).\n",
+ target_partition, units > 0 ? size * 1000 : size);
}
bp[pgo + PP_OFF_PART_UNITS] = 0;
bp[pgo + PP_OFF_RESERVED] = 0;
- bp[pgo + PP_OFF_FLAGS] = PP_BIT_IDP | PP_MSK_PSUM_MB;
+ if (size != 1 || units != 0) {
+ bp[pgo + PP_OFF_FLAGS] = PP_BIT_IDP | psum |
+ (bp[pgo + PP_OFF_FLAGS] & 0x07);
+ bp[pgo + PP_OFF_PART_UNITS] = units;
+ } else
+ bp[pgo + PP_OFF_FLAGS] = PP_BIT_FDP |
+ (bp[pgo + PP_OFF_FLAGS] & 0x1f);
+ bp[pgo + MP_OFF_PAGE_LENGTH] = 6 + psd_cnt * 2;
result = write_mode_page(STp, PART_PAGE, 1);
+
+ if (!result && needs_format)
+ result = format_medium(STp, 1);
+
if (result) {
st_printk(KERN_INFO, STp, "Partitioning of tape failed.\n");
result = (-EIO);
}
+out:
return result;
}
\f
retval = (-EINVAL);
goto out;
}
- if ((i = st_int_ioctl(STp, MTREW, 0)) < 0 ||
- (i = partition_tape(STp, mtc.mt_count)) < 0) {
+ i = do_load_unload(STp, file, 1);
+ if (i < 0) {
+ retval = i;
+ goto out;
+ }
+ i = partition_tape(STp, mtc.mt_count);
+ if (i < 0) {
retval = i;
goto out;
}
STp->ps[i].last_block_valid = 0;
}
STp->partition = STp->new_partition = 0;
- STp->nbr_partitions = 1; /* Bad guess ?-) */
+ STp->nbr_partitions = mtc.mt_count != 0 ? 2 : 1;
STps->drv_block = STps->drv_file = 0;
retval = 0;
goto out;
/* Try to fault in all of the necessary pages */
/* rw==READ means read from drive, write into memory area */
res = get_user_pages_unlocked(
- current,
- current->mm,
uaddr,
nr_pages,
rw == READ,
#define range_on_lru(range) \
((range)->purged == ASHMEM_NOT_PURGED)
-#define page_range_subsumes_range(range, start, end) \
- (((range)->pgstart >= (start)) && ((range)->pgend <= (end)))
+static inline int page_range_subsumes_range(struct ashmem_range *range,
+ size_t start, size_t end)
+{
+ return (((range)->pgstart >= (start)) && ((range)->pgend <= (end)));
+}
-#define page_range_subsumed_by_range(range, start, end) \
- (((range)->pgstart <= (start)) && ((range)->pgend >= (end)))
+static inline int page_range_subsumed_by_range(struct ashmem_range *range,
+ size_t start, size_t end)
+{
+ return (((range)->pgstart <= (start)) && ((range)->pgend >= (end)));
+}
-#define page_in_range(range, page) \
- (((range)->pgstart <= (page)) && ((range)->pgend >= (page)))
+static inline int page_in_range(struct ashmem_range *range, size_t page)
+{
+ return (((range)->pgstart <= (page)) && ((range)->pgend >= (page)));
+}
-#define page_range_in_range(range, start, end) \
- (page_in_range(range, start) || page_in_range(range, end) || \
- page_range_subsumes_range(range, start, end))
+static inline int page_range_in_range(struct ashmem_range *range,
+ size_t start, size_t end)
+{
+ return (page_in_range(range, start) || page_in_range(range, end) ||
+ page_range_subsumes_range(range, start, end));
+}
-#define range_before_page(range, page) \
- ((range)->pgend < (page))
+static inline int range_before_page(struct ashmem_range *range, size_t page)
+{
+ return ((range)->pgend < (page));
+}
#define PROT_MASK (PROT_EXEC | PROT_READ | PROT_WRITE)
}
/* requested protection bits must match our allowed protection mask */
- if (unlikely((vma->vm_flags & ~calc_vm_prot_bits(asma->prot_mask)) &
- calc_vm_prot_bits(PROT_MASK))) {
+ if (unlikely((vma->vm_flags & ~calc_vm_prot_bits(asma->prot_mask, 0)) &
+ calc_vm_prot_bits(PROT_MASK, 0))) {
ret = -EPERM;
goto out;
}
if (!(sc->gfp_mask & __GFP_FS))
return SHRINK_STOP;
- mutex_lock(&ashmem_mutex);
+ if (!mutex_trylock(&ashmem_mutex))
+ return -1;
+
list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
loff_t start = range->pgstart * PAGE_SIZE;
loff_t end = (range->pgend + 1) * PAGE_SIZE;
if (page_range_subsumed_by_range(range, pgstart, pgend))
return 0;
if (page_range_in_range(range, pgstart, pgend)) {
- pgstart = min_t(size_t, range->pgstart, pgstart);
- pgend = max_t(size_t, range->pgend, pgend);
+ pgstart = min(range->pgstart, pgstart);
+ pgend = max(range->pgend, pgend);
purged |= range->purged;
range_del(range);
goto restart;
#include <linux/pipe_fs_i.h>
#include <linux/oom.h>
#include <linux/compat.h>
+#include <linux/vmalloc.h>
#include <asm/uaccess.h>
#include <asm/mmu_context.h>
return NULL;
}
#endif
- ret = get_user_pages(current, bprm->mm, pos,
- 1, write, 1, &page, NULL);
+ /*
+ * We are doing an exec(). 'current' is the process
+ * doing the exec and bprm->mm is the new process's mm.
+ */
+ ret = get_user_pages_remote(current, bprm->mm, pos, 1, write,
+ 1, &page, NULL);
if (ret <= 0)
return NULL;
EXPORT_SYMBOL(kernel_read);
+int kernel_read_file(struct file *file, void **buf, loff_t *size,
+ loff_t max_size, enum kernel_read_file_id id)
+{
+ loff_t i_size, pos;
+ ssize_t bytes = 0;
+ int ret;
+
+ if (!S_ISREG(file_inode(file)->i_mode) || max_size < 0)
+ return -EINVAL;
+
+ ret = security_kernel_read_file(file, id);
+ if (ret)
+ return ret;
+
+ i_size = i_size_read(file_inode(file));
+ if (max_size > 0 && i_size > max_size)
+ return -EFBIG;
+ if (i_size <= 0)
+ return -EINVAL;
+
+ *buf = vmalloc(i_size);
+ if (!*buf)
+ return -ENOMEM;
+
+ pos = 0;
+ while (pos < i_size) {
+ bytes = kernel_read(file, pos, (char *)(*buf) + pos,
+ i_size - pos);
+ if (bytes < 0) {
+ ret = bytes;
+ goto out;
+ }
+
+ if (bytes == 0)
+ break;
+ pos += bytes;
+ }
+
+ if (pos != i_size) {
+ ret = -EIO;
+ goto out;
+ }
+
+ ret = security_kernel_post_read_file(file, *buf, i_size, id);
+ if (!ret)
+ *size = pos;
+
+out:
+ if (ret < 0) {
+ vfree(*buf);
+ *buf = NULL;
+ }
+ return ret;
+}
+EXPORT_SYMBOL_GPL(kernel_read_file);
+
+int kernel_read_file_from_path(char *path, void **buf, loff_t *size,
+ loff_t max_size, enum kernel_read_file_id id)
+{
+ struct file *file;
+ int ret;
+
+ if (!path || !*path)
+ return -EINVAL;
+
+ file = filp_open(path, O_RDONLY, 0);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+
+ ret = kernel_read_file(file, buf, size, max_size, id);
+ fput(file);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(kernel_read_file_from_path);
+
+int kernel_read_file_from_fd(int fd, void **buf, loff_t *size, loff_t max_size,
+ enum kernel_read_file_id id)
+{
+ struct fd f = fdget(fd);
+ int ret = -EBADF;
+
+ if (!f.file)
+ goto out;
+
+ ret = kernel_read_file(f.file, buf, size, max_size, id);
+out:
+ fdput(f);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(kernel_read_file_from_fd);
+
ssize_t read_code(struct file *file, unsigned long addr, loff_t pos, size_t len)
{
ssize_t res = vfs_read(file, (void __user *)addr, len, &pos);
#include <linux/resource.h>
#include <linux/page_ext.h>
#include <linux/err.h>
+#include <linux/page_ref.h>
struct mempolicy;
struct anon_vma;
#define mm_forbids_zeropage(X) (0)
#endif
+/*
+ * Default maximum number of active map areas, this limits the number of vmas
+ * per mm struct. Users can overwrite this number by sysctl but there is a
+ * problem.
+ *
+ * When a program's coredump is generated as ELF format, a section is created
+ * per a vma. In ELF, the number of sections is represented in unsigned short.
+ * This means the number of sections should be smaller than 65535 at coredump.
+ * Because the kernel adds some informative sections to a image of program at
+ * generating coredump, we need some margin. The number of extra sections is
+ * 1-3 now and depends on arch. We use "5" as safe margin, here.
+ *
+ * ELF extended numbering allows more than 65535 sections, so 16-bit bound is
+ * not a hard limit any more. Although some userspace tools can be surprised by
+ * that.
+ */
+#define MAPCOUNT_ELF_CORE_MARGIN (5)
+#define DEFAULT_MAX_MAP_COUNT (USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
+
+extern int sysctl_max_map_count;
+
extern unsigned long sysctl_user_reserve_kbytes;
extern unsigned long sysctl_admin_reserve_kbytes;
/*
* vm_flags in vm_area_struct, see mm_types.h.
+ * When changing, update also include/trace/events/mmflags.h
*/
#define VM_NONE 0x00000000
#define VM_NOHUGEPAGE 0x40000000 /* MADV_NOHUGEPAGE marked this vma */
#define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */
+ #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+ #define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */
+ #define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */
+ #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
+ #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
+ #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
+ #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
+ #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
+ #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
+ #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
+
#if defined(CONFIG_X86)
# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
+ #if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
+ # define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
+ # define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */
+ # define VM_PKEY_BIT1 VM_HIGH_ARCH_1
+ # define VM_PKEY_BIT2 VM_HIGH_ARCH_2
+ # define VM_PKEY_BIT3 VM_HIGH_ARCH_3
+ #endif
#elif defined(CONFIG_PPC)
# define VM_SAO VM_ARCH_1 /* Strong Access Ordering (powerpc) */
#elif defined(CONFIG_PARISC)
#define FAULT_FLAG_KILLABLE 0x10 /* The fault task is in SIGKILL killable region */
#define FAULT_FLAG_TRIED 0x20 /* Second try */
#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */
+ #define FAULT_FLAG_REMOTE 0x80 /* faulting for non current tsk/mm */
+ #define FAULT_FLAG_INSTRUCTION 0x100 /* The fault was during an instruction fetch */
/*
* vm_fault is filled by the the pagefault handler and passed to the vma's
*/
static inline int put_page_testzero(struct page *page)
{
- VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0, page);
- return atomic_dec_and_test(&page->_count);
+ VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
+ return page_ref_dec_and_test(page);
}
/*
*/
static inline int get_page_unless_zero(struct page *page)
{
- return atomic_inc_not_zero(&page->_count);
+ return page_ref_add_unless(page, 1, 0);
}
extern int page_is_ram(unsigned long pfn);
REGION_MIXED,
};
-int region_intersects(resource_size_t offset, size_t size, const char *type);
+int region_intersects(resource_size_t offset, size_t size, unsigned long flags,
+ unsigned long desc);
/* Support for virtually mapped pages */
struct page *vmalloc_to_page(const void *addr);
}
#endif
-static inline int page_count(struct page *page)
-{
- return atomic_read(&compound_head(page)->_count);
-}
-
static inline struct page *virt_to_head_page(const void *x)
{
struct page *page = virt_to_page(x);
return compound_head(page);
}
-/*
- * Setup the page count before being freed into the page allocator for
- * the first time (boot or memory hotplug)
- */
-static inline void init_page_count(struct page *page)
-{
- atomic_set(&page->_count, 1);
-}
-
void __put_page(struct page *page);
void put_pages_list(struct list_head *pages);
* Getting a normal page or the head of a compound page
* requires to already have an elevated page->_count.
*/
- VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
- atomic_inc(&page->_count);
+ VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
+ page_ref_inc(page);
if (unlikely(is_zone_device_page(page)))
get_zone_device_page(page);
{
return page->mem_cgroup;
}
-
-static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
-{
- page->mem_cgroup = memcg;
-}
#else
static inline struct mem_cgroup *page_memcg(struct page *page)
{
return NULL;
}
-
-static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
-{
-}
#endif
/*
* just gets major/minor fault counters bumped up.
*/
-#define VM_FAULT_MINOR 0 /* For backwards compat. Remove me quickly. */
-
#define VM_FAULT_OOM 0x0001
#define VM_FAULT_SIGBUS 0x0002
#define VM_FAULT_MAJOR 0x0004
unsigned long start, unsigned long nr_pages,
unsigned int foll_flags, struct page **pages,
struct vm_area_struct **vmas, int *nonblocking);
- long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
- unsigned long start, unsigned long nr_pages,
- int write, int force, struct page **pages,
- struct vm_area_struct **vmas);
- long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
- unsigned long start, unsigned long nr_pages,
- int write, int force, struct page **pages,
- int *locked);
+ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages,
+ struct vm_area_struct **vmas);
+ long get_user_pages6(unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages,
+ struct vm_area_struct **vmas);
+ long get_user_pages_locked6(unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages, int *locked);
long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
int write, int force, struct page **pages,
unsigned int gup_flags);
- long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
- unsigned long start, unsigned long nr_pages,
+ long get_user_pages_unlocked5(unsigned long start, unsigned long nr_pages,
int write, int force, struct page **pages);
int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
+ /* suppress warnings from use in EXPORT_SYMBOL() */
+ #ifndef __DISABLE_GUP_DEPRECATED
+ #define __gup_deprecated __deprecated
+ #else
+ #define __gup_deprecated
+ #endif
+ /*
+ * These macros provide backward-compatibility with the old
+ * get_user_pages() variants which took tsk/mm. These
+ * functions/macros provide both compile-time __deprecated so we
+ * can catch old-style use and not break the build. The actual
+ * functions also have WARN_ON()s to let us know at runtime if
+ * the get_user_pages() should have been the "remote" variant.
+ *
+ * These are hideous, but temporary.
+ *
+ * If you run into one of these __deprecated warnings, look
+ * at how you are calling get_user_pages(). If you are calling
+ * it with current/current->mm as the first two arguments,
+ * simply remove those arguments. The behavior will be the same
+ * as it is now. If you are calling it on another task, use
+ * get_user_pages_remote() instead.
+ *
+ * Any questions? Ask Dave Hansen <dave@sr71.net>
+ */
+ long
+ __gup_deprecated
+ get_user_pages8(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages,
+ struct vm_area_struct **vmas);
+ #define GUP_MACRO(_1, _2, _3, _4, _5, _6, _7, _8, get_user_pages, ...) \
+ get_user_pages
+ #define get_user_pages(...) GUP_MACRO(__VA_ARGS__, \
+ get_user_pages8, x, \
+ get_user_pages6, x, x, x, x, x)(__VA_ARGS__)
+
+ __gup_deprecated
+ long get_user_pages_locked8(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages,
+ int *locked);
+ #define GUPL_MACRO(_1, _2, _3, _4, _5, _6, _7, _8, get_user_pages_locked, ...) \
+ get_user_pages_locked
+ #define get_user_pages_locked(...) GUPL_MACRO(__VA_ARGS__, \
+ get_user_pages_locked8, x, \
+ get_user_pages_locked6, x, x, x, x)(__VA_ARGS__)
+
+ __gup_deprecated
+ long get_user_pages_unlocked7(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages);
+ #define GUPU_MACRO(_1, _2, _3, _4, _5, _6, _7, get_user_pages_unlocked, ...) \
+ get_user_pages_unlocked
+ #define get_user_pages_unlocked(...) GUPU_MACRO(__VA_ARGS__, \
+ get_user_pages_unlocked7, x, \
+ get_user_pages_unlocked5, x, x, x, x)(__VA_ARGS__)
+
/* Container for pinned pfns / pages */
struct frame_vector {
unsigned int nr_allocated; /* Number of frames we have space for */
int __set_page_dirty_no_writeback(struct page *page);
int redirty_page_for_writepage(struct writeback_control *wbc,
struct page *page);
-void account_page_dirtied(struct page *page, struct address_space *mapping,
- struct mem_cgroup *memcg);
+void account_page_dirtied(struct page *page, struct address_space *mapping);
void account_page_cleaned(struct page *page, struct address_space *mapping,
- struct mem_cgroup *memcg, struct bdi_writeback *wb);
+ struct bdi_writeback *wb);
int set_page_dirty(struct page *page);
int set_page_dirty_lock(struct page *page);
void cancel_dirty_page(struct page *page);
}
#endif
-int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
- pmd_t *pmd, unsigned long address);
+int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address);
int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
/*
pte_unmap(pte); \
} while (0)
-#define pte_alloc_map(mm, vma, pmd, address) \
- ((unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, vma, \
- pmd, address))? \
- NULL: pte_offset_map(pmd, address))
+#define pte_alloc(mm, pmd, address) \
+ (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd, address))
+
+#define pte_alloc_map(mm, pmd, address) \
+ (pte_alloc(mm, pmd, address) ? NULL : pte_offset_map(pmd, address))
#define pte_alloc_map_lock(mm, pmd, address, ptlp) \
- ((unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, NULL, \
- pmd, address))? \
- NULL: pte_offset_map_lock(mm, pmd, address, ptlp))
+ (pte_alloc(mm, pmd, address) ? \
+ NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
#define pte_alloc_kernel(pmd, address) \
((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd, address))? \
extern void mem_init(void);
extern void __init mmap_init(void);
extern void show_mem(unsigned int flags);
+extern long si_mem_available(void);
extern void si_meminfo(struct sysinfo * val);
extern void si_meminfo_node(struct sysinfo *val, int nid);
/* page_alloc.c */
extern int min_free_kbytes;
+extern int watermark_scale_factor;
/* nommu.c */
extern atomic_long_t mmap_pages_allocated;
#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
#define FOLL_MLOCK 0x1000 /* lock present pages */
+ #define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
unsigned long size, pte_fn_t fn, void *data);
+#ifdef CONFIG_PAGE_POISONING
+extern bool page_poisoning_enabled(void);
+extern void kernel_poison_pages(struct page *page, int numpages, int enable);
+extern bool page_is_poisoned(struct page *page);
+#else
+static inline bool page_poisoning_enabled(void) { return false; }
+static inline void kernel_poison_pages(struct page *page, int numpages,
+ int enable) { }
+static inline bool page_is_poisoned(struct page *page) { return false; }
+#endif
+
#ifdef CONFIG_DEBUG_PAGEALLOC
extern bool _debug_pagealloc_enabled;
extern void __kernel_map_pages(struct page *page, int numpages, int enable);
}
#ifdef CONFIG_HIBERNATION
extern bool kernel_page_present(struct page *page);
-#endif /* CONFIG_HIBERNATION */
-#else
+#endif /* CONFIG_HIBERNATION */
+#else /* CONFIG_DEBUG_PAGEALLOC */
static inline void
kernel_map_pages(struct page *page, int numpages, int enable) {}
#ifdef CONFIG_HIBERNATION
static inline bool kernel_page_present(struct page *page) { return true; }
-#endif /* CONFIG_HIBERNATION */
-#endif
+#endif /* CONFIG_HIBERNATION */
+static inline bool debug_pagealloc_enabled(void)
+{
+ return false;
+}
+#endif /* CONFIG_DEBUG_PAGEALLOC */
#ifdef __HAVE_ARCH_GATE_AREA
extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
retry:
/* Read the page with vaddr into memory */
- ret = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &old_page, &vma);
+ ret = get_user_pages_remote(NULL, mm, vaddr, 1, 0, 1, &old_page, &vma);
if (ret <= 0)
return ret;
goto free_area;
area->xol_mapping.name = "[uprobes]";
+ area->xol_mapping.fault = NULL;
area->xol_mapping.pages = area->pages;
area->pages[0] = alloc_page(GFP_HIGHUSER);
if (!area->pages[0])
if (likely(result == 0))
goto out;
- result = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
+ /*
+ * The NULL 'tsk' here ensures that any faults that occur here
+ * will not be accounted to the task. 'mm' *is* current->mm,
+ * but we treat this as a 'remote' access since it is
+ * essentially a kernel access to the memory.
+ */
+ result = get_user_pages_remote(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
if (result < 0)
return result;
bool "Allow for memory hot-add"
depends on SPARSEMEM || X86_64_ACPI_NUMA
depends on ARCH_ENABLE_MEMORY_HOTPLUG
- depends on (IA64 || X86 || PPC_BOOK3S_64 || SUPERH || S390)
config MEMORY_HOTPLUG_SPARSE
def_bool y
config ZONE_DEVICE
bool "Device memory (pmem, etc...) hotplug support" if EXPERT
- default !ZONE_DMA
- depends on !ZONE_DMA
depends on MEMORY_HOTPLUG
depends on MEMORY_HOTREMOVE
+ depends on SPARSEMEM_VMEMMAP
depends on X86_64 #arch_add_memory() comprehends device memory
help
config FRAME_VECTOR
bool
+
+ config ARCH_USES_HIGH_VMA_FLAGS
+ bool
+ config ARCH_HAS_PKEYS
+ bool
#include <linux/userfaultfd_k.h>
#include <asm/io.h>
+ #include <asm/mmu_context.h>
#include <asm/pgalloc.h>
#include <asm/uaccess.h>
#include <asm/tlb.h>
}
}
-int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
- pmd_t *pmd, unsigned long address)
+int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
{
spinlock_t *ptl;
pgtable_t new = pte_alloc_one(mm, address);
return;
}
if (nr_unshown) {
- printk(KERN_ALERT
- "BUG: Bad page map: %lu messages suppressed\n",
- nr_unshown);
+ pr_alert("BUG: Bad page map: %lu messages suppressed\n",
+ nr_unshown);
nr_unshown = 0;
}
nr_shown = 0;
mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
index = linear_page_index(vma, addr);
- printk(KERN_ALERT
- "BUG: Bad page map in process %s pte:%08llx pmd:%08llx\n",
- current->comm,
- (long long)pte_val(pte), (long long)pmd_val(*pmd));
+ pr_alert("BUG: Bad page map in process %s pte:%08llx pmd:%08llx\n",
+ current->comm,
+ (long long)pte_val(pte), (long long)pmd_val(*pmd));
if (page)
dump_page(page, "bad pte");
- printk(KERN_ALERT
- "addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n",
- (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
+ pr_alert("addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n",
+ (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
/*
* Choose text because data symbols depend on CONFIG_KALLSYMS_ALL=y
*/
unsigned long end = addr + size;
int err;
- BUG_ON(addr >= end);
+ if (WARN_ON(addr >= end))
+ return -EINVAL;
+
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);
unsigned long address, pte_t *page_table, pmd_t *pmd,
unsigned int flags, pte_t orig_pte)
{
- pgoff_t pgoff = (((address & PAGE_MASK)
- - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+ pgoff_t pgoff = linear_page_index(vma, address);
pte_unmap(page_table);
/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
pmd_t *pmd;
pte_t *pte;
+ if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+ flags & FAULT_FLAG_INSTRUCTION,
+ flags & FAULT_FLAG_REMOTE))
+ return VM_FAULT_SIGSEGV;
+
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);
}
/*
- * Use __pte_alloc instead of pte_alloc_map, because we can't
+ * Use pte_alloc() instead of pte_alloc_map, because we can't
* run pte_offset_map on the pmd, if an huge pmd could
* materialize from under us from a different thread.
*/
- if (unlikely(pmd_none(*pmd)) &&
- unlikely(__pte_alloc(mm, vma, pmd, address)))
+ if (unlikely(pte_alloc(mm, pmd, address)))
return VM_FAULT_OOM;
- /* if an huge pmd materialized from under us just retry later */
- if (unlikely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
+ * didn't become pmd_trans_huge under us and then back to pmd_none, as
+ * a result of MADV_DONTNEED running immediately after a huge pmd fault
+ * in a different thread of this mm, in turn leading to a misleading
+ * pmd_trans_huge() retval. All we have to ensure is that it is a
+ * regular pmd that we can walk with pte_offset_map() and we can do that
+ * through an atomic read in C, which is what pmd_trans_unstable()
+ * provides.
+ */
+ if (unlikely(pmd_trans_unstable(pmd) || pmd_devmap(*pmd)))
return 0;
/*
* A regular pmd is established and it can't morph into a huge pmd
void *maddr;
struct page *page = NULL;
- ret = get_user_pages(tsk, mm, addr, 1,
+ ret = get_user_pages_remote(tsk, mm, addr, 1,
write, 1, &page, &vma);
if (ret <= 0) {
#ifndef CONFIG_HAVE_IOREMAP_PROT
nid = page_to_nid(page);
if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
continue;
- if (PageTail(page) && PageAnon(page)) {
+ if (PageTransCompound(page) && PageAnon(page)) {
get_page(page);
pte_unmap_unlock(pte, ptl);
lock_page(page);
if (flags & MPOL_MF_LAZY) {
/* Similar to task_numa_work, skip inaccessible VMAs */
- if (vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))
+ if (!is_vm_hugetlb_page(vma) &&
+ (vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)) &&
+ !(vma->vm_flags & VM_MIXEDMAP))
change_prot_numa(vma, start, endvma);
return 1;
}
}
}
- static int lookup_node(struct mm_struct *mm, unsigned long addr)
+ static int lookup_node(unsigned long addr)
{
struct page *p;
int err;
- err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
+ err = get_user_pages(addr & PAGE_MASK, 1, 0, 0, &p, NULL);
if (err >= 0) {
err = page_to_nid(p);
put_page(p);
if (flags & MPOL_F_NODE) {
if (flags & MPOL_F_ADDR) {
- err = lookup_node(mm, addr);
+ err = lookup_node(addr);
if (err < 0)
goto out;
*policy = err;
set_numabalancing_state(numabalancing_override == 1);
if (num_online_nodes() > 1 && !numabalancing_override) {
- pr_info("%s automatic NUMA balancing. "
- "Configure with numa_balancing= or the "
- "kernel.numa_balancing sysctl",
+ pr_info("%s automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl\n",
numabalancing_default ? "Enabling" : "Disabling");
set_numabalancing_state(numabalancing_default);
}
#include <linux/khugepaged.h>
#include <linux/uprobes.h>
#include <linux/rbtree_augmented.h>
-#include <linux/sched/sysctl.h>
#include <linux/notifier.h>
#include <linux/memory.h>
#include <linux/printk.h>
#include <linux/userfaultfd_k.h>
#include <linux/moduleparam.h>
+ #include <linux/pkeys.h>
#include <asm/uaccess.h>
#include <asm/cacheflush.h>
}
}
-
-int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS; /* heuristic overcommit */
-int sysctl_overcommit_ratio __read_mostly = 50; /* default is 50% */
-unsigned long sysctl_overcommit_kbytes __read_mostly;
-int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
-unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
-unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
-/*
- * Make sure vm_committed_as in one cacheline and not cacheline shared with
- * other variables. It can be updated by several CPUs frequently.
- */
-struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
-
-/*
- * The global memory commitment made in the system can be a metric
- * that can be used to drive ballooning decisions when Linux is hosted
- * as a guest. On Hyper-V, the host implements a policy engine for dynamically
- * balancing memory across competing virtual machines that are hosted.
- * Several metrics drive this policy engine including the guest reported
- * memory commitment.
- */
-unsigned long vm_memory_committed(void)
-{
- return percpu_counter_read_positive(&vm_committed_as);
-}
-EXPORT_SYMBOL_GPL(vm_memory_committed);
-
-/*
- * Check that a process has enough memory to allocate a new virtual
- * mapping. 0 means there is enough memory for the allocation to
- * succeed and -ENOMEM implies there is not.
- *
- * We currently support three overcommit policies, which are set via the
- * vm.overcommit_memory sysctl. See Documentation/vm/overcommit-accounting
- *
- * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
- * Additional code 2002 Jul 20 by Robert Love.
- *
- * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise.
- *
- * Note this is a helper function intended to be used by LSMs which
- * wish to use this logic.
- */
-int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
-{
- long free, allowed, reserve;
-
- VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
- -(s64)vm_committed_as_batch * num_online_cpus(),
- "memory commitment underflow");
-
- vm_acct_memory(pages);
-
- /*
- * Sometimes we want to use more memory than we have
- */
- if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
- return 0;
-
- if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
- free = global_page_state(NR_FREE_PAGES);
- free += global_page_state(NR_FILE_PAGES);
-
- /*
- * shmem pages shouldn't be counted as free in this
- * case, they can't be purged, only swapped out, and
- * that won't affect the overall amount of available
- * memory in the system.
- */
- free -= global_page_state(NR_SHMEM);
-
- free += get_nr_swap_pages();
-
- /*
- * Any slabs which are created with the
- * SLAB_RECLAIM_ACCOUNT flag claim to have contents
- * which are reclaimable, under pressure. The dentry
- * cache and most inode caches should fall into this
- */
- free += global_page_state(NR_SLAB_RECLAIMABLE);
-
- /*
- * Leave reserved pages. The pages are not for anonymous pages.
- */
- if (free <= totalreserve_pages)
- goto error;
- else
- free -= totalreserve_pages;
-
- /*
- * Reserve some for root
- */
- if (!cap_sys_admin)
- free -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
-
- if (free > pages)
- return 0;
-
- goto error;
- }
-
- allowed = vm_commit_limit();
- /*
- * Reserve some for root
- */
- if (!cap_sys_admin)
- allowed -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
-
- /*
- * Don't let a single process grow so big a user can't recover
- */
- if (mm) {
- reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);
- allowed -= min_t(long, mm->total_vm / 32, reserve);
- }
-
- if (percpu_counter_read_positive(&vm_committed_as) < allowed)
- return 0;
-error:
- vm_unacct_memory(pages);
-
- return -ENOMEM;
-}
-
/*
* Requires inode->i_mapping->i_mmap_rwsem
*/
unsigned long pgoff, unsigned long *populate)
{
struct mm_struct *mm = current->mm;
+ int pkey = 0;
*populate = 0;
if (offset_in_page(addr))
return addr;
+ if (prot == PROT_EXEC) {
+ pkey = execute_only_pkey(mm);
+ if (pkey < 0)
+ pkey = 0;
+ }
+
/* Do simple checking here so the lower-level routines won't have
* to. we assume access permissions have been handled by the open
* of the memory object, so we don't do any here.
*/
- vm_flags |= calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
+ vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
if (flags & MAP_LOCKED)
unsigned long ret = -EINVAL;
struct file *file;
- pr_warn_once("%s (%d) uses deprecated remap_file_pages() syscall. "
- "See Documentation/vm/remap_file_pages.txt.\n",
- current->comm, current->pid);
+ pr_warn_once("%s (%d) uses deprecated remap_file_pages() syscall. See Documentation/vm/remap_file_pages.txt.\n",
+ current->comm, current->pid);
if (prot)
return ret;
if (!vma || !(vma->vm_flags & VM_SHARED))
goto out;
- if (start < vma->vm_start || start + size > vma->vm_end)
+ if (start < vma->vm_start)
goto out;
- if (pgoff == linear_page_index(vma, start)) {
- ret = 0;
- goto out;
+ if (start + size > vma->vm_end) {
+ struct vm_area_struct *next;
+
+ for (next = vma->vm_next; next; next = next->vm_next) {
+ /* hole between vmas ? */
+ if (next->vm_start != next->vm_prev->vm_end)
+ goto out;
+
+ if (next->vm_file != vma->vm_file)
+ goto out;
+
+ if (next->vm_flags != vma->vm_flags)
+ goto out;
+
+ if (start + size <= next->vm_end)
+ break;
+ }
+
+ if (!next)
+ goto out;
}
prot |= vma->vm_flags & VM_READ ? PROT_READ : 0;
flags &= MAP_NONBLOCK;
flags |= MAP_SHARED | MAP_FIXED | MAP_POPULATE;
if (vma->vm_flags & VM_LOCKED) {
+ struct vm_area_struct *tmp;
flags |= MAP_LOCKED;
+
/* drop PG_Mlocked flag for over-mapped range */
- munlock_vma_pages_range(vma, start, start + size);
+ for (tmp = vma; tmp->vm_start >= start + size;
+ tmp = tmp->vm_next) {
+ munlock_vma_pages_range(tmp,
+ max(tmp->vm_start, start),
+ min(tmp->vm_end, start + size));
+ }
}
file = get_file(vma->vm_file);
if (is_data_mapping(flags) &&
mm->data_vm + npages > rlimit(RLIMIT_DATA) >> PAGE_SHIFT) {
if (ignore_rlimit_data)
- pr_warn_once("%s (%d): VmData %lu exceed data ulimit "
- "%lu. Will be forbidden soon.\n",
+ pr_warn_once("%s (%d): VmData %lu exceed data ulimit %lu. Will be forbidden soon.\n",
current->comm, current->pid,
(mm->data_vm + npages) << PAGE_SHIFT,
rlimit(RLIMIT_DATA));
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+ #define __DISABLE_GUP_DEPRECATED
+
#include <linux/export.h>
#include <linux/mm.h>
#include <linux/vmacache.h>
#include <linux/security.h>
#include <linux/syscalls.h>
#include <linux/audit.h>
-#include <linux/sched/sysctl.h>
#include <linux/printk.h>
#include <asm/uaccess.h>
unsigned long max_mapnr;
EXPORT_SYMBOL(max_mapnr);
unsigned long highest_memmap_pfn;
-struct percpu_counter vm_committed_as;
-int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */
-int sysctl_overcommit_ratio = 50; /* default is 50% */
-unsigned long sysctl_overcommit_kbytes __read_mostly;
-int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT;
int sysctl_nr_trim_pages = CONFIG_NOMMU_INITIAL_TRIM_EXCESS;
-unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
-unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
int heap_stack_gap = 0;
atomic_long_t mmap_pages_allocated;
-/*
- * The global memory commitment made in the system can be a metric
- * that can be used to drive ballooning decisions when Linux is hosted
- * as a guest. On Hyper-V, the host implements a policy engine for dynamically
- * balancing memory across competing virtual machines that are hosted.
- * Several metrics drive this policy engine including the guest reported
- * memory commitment.
- */
-unsigned long vm_memory_committed(void)
-{
- return percpu_counter_read_positive(&vm_committed_as);
-}
-
-EXPORT_SYMBOL_GPL(vm_memory_committed);
-
EXPORT_SYMBOL(mem_map);
/* list of mapped, potentially shareable regions */
* slab page or a secondary page from a compound page
* - don't permit access to VMAs that don't support it, such as I/O mappings
*/
- long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
- unsigned long start, unsigned long nr_pages,
+ long get_user_pages6(unsigned long start, unsigned long nr_pages,
int write, int force, struct page **pages,
struct vm_area_struct **vmas)
{
if (force)
flags |= FOLL_FORCE;
- return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas,
- NULL);
+ return __get_user_pages(current, current->mm, start, nr_pages, flags,
+ pages, vmas, NULL);
}
- EXPORT_SYMBOL(get_user_pages);
+ EXPORT_SYMBOL(get_user_pages6);
- long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
- unsigned long start, unsigned long nr_pages,
- int write, int force, struct page **pages,
- int *locked)
+ long get_user_pages_locked6(unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages,
+ int *locked)
{
- return get_user_pages(tsk, mm, start, nr_pages, write, force,
- pages, NULL);
+ return get_user_pages6(start, nr_pages, write, force, pages, NULL);
}
- EXPORT_SYMBOL(get_user_pages_locked);
+ EXPORT_SYMBOL(get_user_pages_locked6);
long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
{
long ret;
down_read(&mm->mmap_sem);
- ret = get_user_pages(tsk, mm, start, nr_pages, write, force,
- pages, NULL);
+ ret = __get_user_pages(tsk, mm, start, nr_pages, gup_flags, pages,
+ NULL, NULL);
up_read(&mm->mmap_sem);
return ret;
}
EXPORT_SYMBOL(__get_user_pages_unlocked);
- long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
- unsigned long start, unsigned long nr_pages,
+ long get_user_pages_unlocked5(unsigned long start, unsigned long nr_pages,
int write, int force, struct page **pages)
{
- return __get_user_pages_unlocked(tsk, mm, start, nr_pages, write,
- force, pages, 0);
+ return __get_user_pages_unlocked(current, current->mm, start, nr_pages,
+ write, force, pages, 0);
}
- EXPORT_SYMBOL(get_user_pages_unlocked);
+ EXPORT_SYMBOL(get_user_pages_unlocked5);
/**
* follow_pfn - look up PFN at a user virtual address
{
unsigned long vm_flags;
- vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags);
+ vm_flags = calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags);
/* vm_flags |= mm->def_flags; */
if (!(capabilities & NOMMU_MAP_DIRECT)) {
}
EXPORT_SYMBOL(unmap_mapping_range);
-/*
- * Check that a process has enough memory to allocate a new virtual
- * mapping. 0 means there is enough memory for the allocation to
- * succeed and -ENOMEM implies there is not.
- *
- * We currently support three overcommit policies, which are set via the
- * vm.overcommit_memory sysctl. See Documentation/vm/overcommit-accounting
- *
- * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
- * Additional code 2002 Jul 20 by Robert Love.
- *
- * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise.
- *
- * Note this is a helper function intended to be used by LSMs which
- * wish to use this logic.
- */
-int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
-{
- long free, allowed, reserve;
-
- vm_acct_memory(pages);
-
- /*
- * Sometimes we want to use more memory than we have
- */
- if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
- return 0;
-
- if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
- free = global_page_state(NR_FREE_PAGES);
- free += global_page_state(NR_FILE_PAGES);
-
- /*
- * shmem pages shouldn't be counted as free in this
- * case, they can't be purged, only swapped out, and
- * that won't affect the overall amount of available
- * memory in the system.
- */
- free -= global_page_state(NR_SHMEM);
-
- free += get_nr_swap_pages();
-
- /*
- * Any slabs which are created with the
- * SLAB_RECLAIM_ACCOUNT flag claim to have contents
- * which are reclaimable, under pressure. The dentry
- * cache and most inode caches should fall into this
- */
- free += global_page_state(NR_SLAB_RECLAIMABLE);
-
- /*
- * Leave reserved pages. The pages are not for anonymous pages.
- */
- if (free <= totalreserve_pages)
- goto error;
- else
- free -= totalreserve_pages;
-
- /*
- * Reserve some for root
- */
- if (!cap_sys_admin)
- free -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
-
- if (free > pages)
- return 0;
-
- goto error;
- }
-
- allowed = vm_commit_limit();
- /*
- * Reserve some 3% for root
- */
- if (!cap_sys_admin)
- allowed -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
-
- /*
- * Don't let a single process grow so big a user can't recover
- */
- if (mm) {
- reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);
- allowed -= min_t(long, mm->total_vm / 32, reserve);
- }
-
- if (percpu_counter_read_positive(&vm_committed_as) < allowed)
- return 0;
-
-error:
- vm_unacct_memory(pages);
-
- return -ENOMEM;
-}
-
int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
BUG();
return 0;
}
subsys_initcall(init_admin_reserve);
+
+ long get_user_pages8(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages,
+ struct vm_area_struct **vmas)
+ {
+ return get_user_pages6(start, nr_pages, write, force, pages, vmas);
+ }
+ EXPORT_SYMBOL(get_user_pages8);
+
+ long get_user_pages_locked8(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages,
+ int *locked)
+ {
+ return get_user_pages_locked6(start, nr_pages, write,
+ force, pages, locked);
+ }
+ EXPORT_SYMBOL(get_user_pages_locked8);
+
+ long get_user_pages_unlocked7(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages)
+ {
+ return get_user_pages_unlocked5(start, nr_pages, write, force, pages);
+ }
+ EXPORT_SYMBOL(get_user_pages_unlocked7);
+
int __weak get_user_pages_fast(unsigned long start,
int nr_pages, int write, struct page **pages)
{
- struct mm_struct *mm = current->mm;
- return get_user_pages_unlocked(current, mm, start, nr_pages,
- write, 0, pages);
+ return get_user_pages_unlocked(start, nr_pages, write, 0, pages);
}
EXPORT_SYMBOL_GPL(get_user_pages_fast);
}
EXPORT_SYMBOL_GPL(__page_mapcount);
+int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
+int sysctl_overcommit_ratio __read_mostly = 50;
+unsigned long sysctl_overcommit_kbytes __read_mostly;
+int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
+unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
+unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
+
int overcommit_ratio_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
return allowed;
}
+/*
+ * Make sure vm_committed_as in one cacheline and not cacheline shared with
+ * other variables. It can be updated by several CPUs frequently.
+ */
+struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
+
+/*
+ * The global memory commitment made in the system can be a metric
+ * that can be used to drive ballooning decisions when Linux is hosted
+ * as a guest. On Hyper-V, the host implements a policy engine for dynamically
+ * balancing memory across competing virtual machines that are hosted.
+ * Several metrics drive this policy engine including the guest reported
+ * memory commitment.
+ */
+unsigned long vm_memory_committed(void)
+{
+ return percpu_counter_read_positive(&vm_committed_as);
+}
+EXPORT_SYMBOL_GPL(vm_memory_committed);
+
+/*
+ * Check that a process has enough memory to allocate a new virtual
+ * mapping. 0 means there is enough memory for the allocation to
+ * succeed and -ENOMEM implies there is not.
+ *
+ * We currently support three overcommit policies, which are set via the
+ * vm.overcommit_memory sysctl. See Documentation/vm/overcommit-accounting
+ *
+ * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
+ * Additional code 2002 Jul 20 by Robert Love.
+ *
+ * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise.
+ *
+ * Note this is a helper function intended to be used by LSMs which
+ * wish to use this logic.
+ */
+int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
+{
+ long free, allowed, reserve;
+
+ VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
+ -(s64)vm_committed_as_batch * num_online_cpus(),
+ "memory commitment underflow");
+
+ vm_acct_memory(pages);
+
+ /*
+ * Sometimes we want to use more memory than we have
+ */
+ if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
+ return 0;
+
+ if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
+ free = global_page_state(NR_FREE_PAGES);
+ free += global_page_state(NR_FILE_PAGES);
+
+ /*
+ * shmem pages shouldn't be counted as free in this
+ * case, they can't be purged, only swapped out, and
+ * that won't affect the overall amount of available
+ * memory in the system.
+ */
+ free -= global_page_state(NR_SHMEM);
+
+ free += get_nr_swap_pages();
+
+ /*
+ * Any slabs which are created with the
+ * SLAB_RECLAIM_ACCOUNT flag claim to have contents
+ * which are reclaimable, under pressure. The dentry
+ * cache and most inode caches should fall into this
+ */
+ free += global_page_state(NR_SLAB_RECLAIMABLE);
+
+ /*
+ * Leave reserved pages. The pages are not for anonymous pages.
+ */
+ if (free <= totalreserve_pages)
+ goto error;
+ else
+ free -= totalreserve_pages;
+
+ /*
+ * Reserve some for root
+ */
+ if (!cap_sys_admin)
+ free -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
+
+ if (free > pages)
+ return 0;
+
+ goto error;
+ }
+
+ allowed = vm_commit_limit();
+ /*
+ * Reserve some for root
+ */
+ if (!cap_sys_admin)
+ allowed -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
+
+ /*
+ * Don't let a single process grow so big a user can't recover
+ */
+ if (mm) {
+ reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);
+ allowed -= min_t(long, mm->total_vm / 32, reserve);
+ }
+
+ if (percpu_counter_read_positive(&vm_committed_as) < allowed)
+ return 0;
+error:
+ vm_unacct_memory(pages);
+
+ return -ENOMEM;
+}
+
/**
* get_cmdline() - copy the cmdline value to a buffer.
* @task: the task whose cmdline value to copy.
might_sleep();
- get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL);
+ /*
+ * This work is run asynchromously to the task which owns
+ * mm and might be done in another context, so we must
+ * use FOLL_REMOTE.
+ */
+ __get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL, FOLL_REMOTE);
+
kvm_async_page_present_sync(vcpu, apf);
spin_lock(&vcpu->async_pf.lock);
* This memory barrier pairs with prepare_to_wait's set_current_state()
*/
smp_mb();
- if (waitqueue_active(&vcpu->wq))
- wake_up_interruptible(&vcpu->wq);
+ if (swait_active(&vcpu->wq))
+ swake_up(&vcpu->wq);
mmput(mm);
kvm_put_kvm(vcpu->kvm);
/* cancel outstanding work queue item */
while (!list_empty(&vcpu->async_pf.queue)) {
struct kvm_async_pf *work =
- list_entry(vcpu->async_pf.queue.next,
- typeof(*work), queue);
+ list_first_entry(&vcpu->async_pf.queue,
+ typeof(*work), queue);
list_del(&work->queue);
#ifdef CONFIG_KVM_ASYNC_PF_SYNC
spin_lock(&vcpu->async_pf.lock);
while (!list_empty(&vcpu->async_pf.done)) {
struct kvm_async_pf *work =
- list_entry(vcpu->async_pf.done.next,
- typeof(*work), link);
+ list_first_entry(&vcpu->async_pf.done,
+ typeof(*work), link);
list_del(&work->link);
kmem_cache_free(async_pf_cache, work);
}
* do alloc nowait since if we are going to sleep anyway we
* may as well sleep faulting in page
*/
- work = kmem_cache_zalloc(async_pf_cache, GFP_NOWAIT);
+ work = kmem_cache_zalloc(async_pf_cache, GFP_NOWAIT | __GFP_NOWARN);
if (!work)
return 0;
/* Default doubles per-vcpu halt_poll_ns. */
static unsigned int halt_poll_ns_grow = 2;
-module_param(halt_poll_ns_grow, int, S_IRUGO);
+module_param(halt_poll_ns_grow, uint, S_IRUGO | S_IWUSR);
/* Default resets per-vcpu halt_poll_ns . */
static unsigned int halt_poll_ns_shrink;
-module_param(halt_poll_ns_shrink, int, S_IRUGO);
+module_param(halt_poll_ns_shrink, uint, S_IRUGO | S_IWUSR);
/*
* Ordering of locks:
vcpu->kvm = kvm;
vcpu->vcpu_id = id;
vcpu->pid = NULL;
- vcpu->halt_poll_ns = 0;
- init_waitqueue_head(&vcpu->wq);
+ init_swait_queue_head(&vcpu->wq);
kvm_async_pf_vcpu_init(vcpu);
vcpu->pre_pcpu = -1;
static void kvm_destroy_devices(struct kvm *kvm)
{
- struct list_head *node, *tmp;
+ struct kvm_device *dev, *tmp;
- list_for_each_safe(node, tmp, &kvm->devices) {
- struct kvm_device *dev =
- list_entry(node, struct kvm_device, vm_node);
-
- list_del(node);
+ list_for_each_entry_safe(dev, tmp, &kvm->devices, vm_node) {
+ list_del(&dev->vm_node);
dev->ops->destroy(dev);
}
}
return gfn_to_hva_memslot_prot(slot, gfn, writable);
}
- static int get_user_page_nowait(struct task_struct *tsk, struct mm_struct *mm,
- unsigned long start, int write, struct page **page)
+ static int get_user_page_nowait(unsigned long start, int write,
+ struct page **page)
{
int flags = FOLL_TOUCH | FOLL_NOWAIT | FOLL_HWPOISON | FOLL_GET;
if (write)
flags |= FOLL_WRITE;
- return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL);
+ return __get_user_pages(current, current->mm, start, 1, flags, page,
+ NULL, NULL);
}
static inline int check_user_page_hwpoison(unsigned long addr)
if (async) {
down_read(¤t->mm->mmap_sem);
- npages = get_user_page_nowait(current, current->mm,
- addr, write_fault, page);
+ npages = get_user_page_nowait(addr, write_fault, page);
up_read(¤t->mm->mmap_sem);
} else
npages = __get_user_pages_unlocked(current, current->mm, addr, 1,
{
unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
- if (addr == KVM_HVA_ERR_RO_BAD)
+ if (addr == KVM_HVA_ERR_RO_BAD) {
+ if (writable)
+ *writable = false;
return KVM_PFN_ERR_RO_FAULT;
+ }
- if (kvm_is_error_hva(addr))
+ if (kvm_is_error_hva(addr)) {
+ if (writable)
+ *writable = false;
return KVM_PFN_NOSLOT;
+ }
/* Do not map writable pfn in the readonly memslot. */
if (writable && memslot_is_readonly(slot)) {
static void grow_halt_poll_ns(struct kvm_vcpu *vcpu)
{
- int old, val;
+ unsigned int old, val, grow;
old = val = vcpu->halt_poll_ns;
+ grow = READ_ONCE(halt_poll_ns_grow);
/* 10us base */
- if (val == 0 && halt_poll_ns_grow)
+ if (val == 0 && grow)
val = 10000;
else
- val *= halt_poll_ns_grow;
+ val *= grow;
+
+ if (val > halt_poll_ns)
+ val = halt_poll_ns;
vcpu->halt_poll_ns = val;
trace_kvm_halt_poll_ns_grow(vcpu->vcpu_id, val, old);
static void shrink_halt_poll_ns(struct kvm_vcpu *vcpu)
{
- int old, val;
+ unsigned int old, val, shrink;
old = val = vcpu->halt_poll_ns;
- if (halt_poll_ns_shrink == 0)
+ shrink = READ_ONCE(halt_poll_ns_shrink);
+ if (shrink == 0)
val = 0;
else
- val /= halt_poll_ns_shrink;
+ val /= shrink;
vcpu->halt_poll_ns = val;
trace_kvm_halt_poll_ns_shrink(vcpu->vcpu_id, val, old);
void kvm_vcpu_block(struct kvm_vcpu *vcpu)
{
ktime_t start, cur;
- DEFINE_WAIT(wait);
+ DECLARE_SWAITQUEUE(wait);
bool waited = false;
u64 block_ns;
kvm_arch_vcpu_blocking(vcpu);
for (;;) {
- prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
+ prepare_to_swait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
if (kvm_vcpu_check_block(vcpu) < 0)
break;
schedule();
}
- finish_wait(&vcpu->wq, &wait);
+ finish_swait(&vcpu->wq, &wait);
cur = ktime_get();
kvm_arch_vcpu_unblocking(vcpu);
{
int me;
int cpu = vcpu->cpu;
- wait_queue_head_t *wqp;
+ struct swait_queue_head *wqp;
wqp = kvm_arch_vcpu_wq(vcpu);
- if (waitqueue_active(wqp)) {
- wake_up_interruptible(wqp);
+ if (swait_active(wqp)) {
+ swake_up(wqp);
++vcpu->stat.halt_wakeup;
}
continue;
if (vcpu == me)
continue;
- if (waitqueue_active(&vcpu->wq) && !kvm_arch_vcpu_runnable(vcpu))
+ if (swait_active(&vcpu->wq) && !kvm_arch_vcpu_runnable(vcpu))
continue;
if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
continue;