From: Linus Torvalds Date: Tue, 17 May 2016 02:17:22 +0000 (-0700) Subject: Merge tag 'pm-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm X-Git-Tag: v4.7-rc1~171 X-Git-Url: https://git.kernel.dk/?a=commitdiff_plain;h=d57d39431924d1628ac9b93a2de7f806fc80680a;hp=-c;p=linux-2.6-block.git Merge tag 'pm-4.7-rc1' of git://git./linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "The majority of changes go into the cpufreq subsystem this time. To me, quite obviously, the biggest ticket item is the new "schedutil" governor. Interestingly enough, it's the first new cpufreq governor since the beginning of the git era (except for some out-of-the-tree ones). There are two main differences between it and the existing governors. First, it uses the information provided by the scheduler directly for making its decisions, so it doesn't have to track anything by itself. Second, it can invoke drivers (supporting that feature) to adjust CPU performance right away without having to spawn work items to be executed in process context or similar. Currently, the acpi-cpufreq driver is the only one supporting that mode of operation, but then it is used on a large number of systems. The "schedutil" governor as included here is very simple and mostly regarded as a foundation for future work on the integration of the scheduler with CPU power management (in fact, there is work in progress on top of it already). Nevertheless it works and the preliminary results obtained with it are encouraging. There also is some consolidation of CPU frequency management for ARM platforms that can add their machine IDs the the new stub dt-platdev driver now and that will take care of creating the requisite platform device for cpufreq-dt, so it is not necessary to do that in platform code any more. Several ARM platforms are switched over to using this generic mechanism. In addition to that, the intel_pstate driver is now going to respect CPU frequency limits set by the platform firmware (or a BMC) and provided via the ACPI _PPC object. The devfreq subsystem is getting a new "passive" governor for SoCs subsystems that will depend on somebody else to manage their voltage rails and its support for Samsung Exynos SoCs is consolidated. The rest is support for new hardware (Intel Broxton support in intel_idle for one example), bug fixes, optimizations and cleanups in a number of places. Specifics: - New cpufreq "schedutil" governor (making decisions based on CPU utilization information provided by the scheduler and capable of switching CPU frequencies right away if the underlying driver supports that) and support for fast frequency switching in the acpi-cpufreq driver (Rafael Wysocki) - Consolidation of CPU frequency management on ARM platforms allowing them to get rid of some platform-specific boilerplate code if they are going to use the cpufreq-dt driver (Viresh Kumar, Finley Xiao, Marc Gonzalez) - Support for ACPI _PPC and CPU frequency limits in the intel_pstate driver (Srinivas Pandruvada) - Fixes and cleanups in the cpufreq core and generic governor code (Rafael Wysocki, Sai Gurrappadi) - intel_pstate driver optimizations and cleanups (Rafael Wysocki, Philippe Longepe, Chen Yu, Joe Perches) - cpufreq powernv driver fixes and cleanups (Akshay Adiga, Shilpasri Bhat) - cpufreq qoriq driver fixes and cleanups (Jia Hongtao) - ACPI cpufreq driver cleanups (Viresh Kumar) - Assorted cpufreq driver updates (Ashwin Chaugule, Geliang Tang, Javier Martinez Canillas, Paul Gortmaker, Sudeep Holla) - Assorted cpufreq fixes and cleanups (Joe Perches, Arnd Bergmann) - Fixes and cleanups in the OPP (Operating Performance Points) framework, mostly related to OPP sharing, and reorganization of OF-dependent code in it (Viresh Kumar, Arnd Bergmann, Sudeep Holla) - New "passive" governor for devfreq (for SoC subsystems that will rely on someone else for the management of their power resources) and consolidation of devfreq support for Exynos platforms, coding style and typo fixes for devfreq (Chanwoo Choi, MyungJoo Ham) - PM core fixes and cleanups, mostly to make it work better with the generic power domains (genpd) framework, and updates for that framework (Ulf Hansson, Thierry Reding, Colin Ian King) - Intel Broxton support for the intel_idle driver (Len Brown) - cpuidle core optimization and fix (Daniel Lezcano, Dave Gerlach) - ARM cpuidle cleanups (Jisheng Zhang) - Intel Kabylake support for the RAPL power capping driver (Jacob Pan) - AVS (Adaptive Voltage Switching) rockchip-io driver update (Heiko Stuebner) - Updates for the cpupower tool (Arjun Sreedharan, Colin Ian King, Mattia Dongili, Thomas Renninger)" * tag 'pm-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (112 commits) intel_pstate: Clean up get_target_pstate_use_performance() intel_pstate: Use sample.core_avg_perf in get_avg_pstate() intel_pstate: Clarify average performance computation intel_pstate: Avoid unnecessary synchronize_sched() during initialization cpufreq: schedutil: Make default depend on CONFIG_SMP cpufreq: powernv: del_timer_sync when global and local pstate are equal cpufreq: powernv: Move smp_call_function_any() out of irq safe block intel_pstate: Clean up intel_pstate_get() cpufreq: schedutil: Make it depend on CONFIG_SMP cpufreq: governor: Fix handling of special cases in dbs_update() PM / OPP: Move CONFIG_OF dependent code in a separate file cpufreq: intel_pstate: Ignore _PPC processing under HWP cpufreq: arm_big_little: use generic OPP functions for {init, free}_opp_table PM / OPP: add non-OF versions of dev_pm_opp_{cpumask_, }remove_table cpufreq: tango: Use generic platdev driver PM / OPP: pass cpumask by reference cpufreq: Fix GOV_LIMITS handling for the userspace governor cpupower: fix potential memory leak PM / devfreq: style/typo fixes PM / devfreq: exynos: Add the detailed correlation for Exynos5422 bus .. --- d57d39431924d1628ac9b93a2de7f806fc80680a diff --combined Documentation/kernel-parameters.txt index 09126c2307d2,52292b28b291..1b322c0c8dff --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@@ -131,7 -131,6 +131,7 @@@ parameter is applicable More X86-64 boot options can be found in Documentation/x86/x86_64/boot-options.txt . X86 Either 32-bit or 64-bit x86 (same as X86-32+X86-64) + X86_UV SGI UV support is enabled. XEN Xen support is enabled In addition, the following text indicates that the option: @@@ -168,18 -167,16 +168,18 @@@ bytes respectively. Such letter suffixe acpi= [HW,ACPI,X86,ARM64] Advanced Configuration and Power Interface - Format: { force | off | strict | noirq | rsdt | + Format: { force | on | off | strict | noirq | rsdt | copy_dsdt } force -- enable ACPI if default was off + on -- enable ACPI but allow fallback to DT [arm64] off -- disable ACPI if default was on noirq -- do not use ACPI for IRQ routing strict -- Be less tolerant of platforms that are not strictly ACPI specification compliant. rsdt -- prefer RSDT over (default) XSDT copy_dsdt -- copy DSDT to memory - For ARM64, ONLY "acpi=off" or "acpi=force" are available + For ARM64, ONLY "acpi=off", "acpi=on" or "acpi=force" + are available See also Documentation/power/runtime_pm.txt, pci=noacpi @@@ -545,13 -542,6 +545,13 @@@ Format: (must be >=0) Default: 64 + bau= [X86_UV] Enable the BAU on SGI UV. The default + behavior is to disable the BAU (i.e. bau=0). + Format: { "0" | "1" } + 0 - Disable the BAU. + 1 - Enable the BAU. + unset - Disable the BAU. + baycom_epp= [HW,AX25] Format: , @@@ -1671,6 -1661,11 +1671,11 @@@ hwp_only Only load intel_pstate on systems which support hardware P state control (HWP) if available. + support_acpi_ppc + Enforce ACPI _PPC performance limits. If the Fixed ACPI + Description Table, specifies preferred power management + profile as "Enterprise Server" or "Performance Server", + then this feature is turned on by default. intremap= [X86-64, Intel-IOMMU] on enable Interrupt Remapping (default) @@@ -3294,44 -3289,6 +3299,44 @@@ Lazy RCU callbacks are those which RCU can prove do nothing more than free memory. + rcuperf.gp_exp= [KNL] + Measure performance of expedited synchronous + grace-period primitives. + + rcuperf.holdoff= [KNL] + Set test-start holdoff period. The purpose of + this parameter is to delay the start of the + test until boot completes in order to avoid + interference. + + rcuperf.nreaders= [KNL] + Set number of RCU readers. The value -1 selects + N, where N is the number of CPUs. A value + "n" less than -1 selects N-n+1, where N is again + the number of CPUs. For example, -2 selects N + (the number of CPUs), -3 selects N+1, and so on. + A value of "n" less than or equal to -N selects + a single reader. + + rcuperf.nwriters= [KNL] + Set number of RCU writers. The values operate + the same as for rcuperf.nreaders. + N, where N is the number of CPUs + + rcuperf.perf_runnable= [BOOT] + Start rcuperf running at boot time. + + rcuperf.shutdown= [KNL] + Shut the system down after performance tests + complete. This is useful for hands-off automated + testing. + + rcuperf.perf_type= [KNL] + Specify the RCU implementation to test. + + rcuperf.verbose= [KNL] + Enable additional printk() statements. + rcutorture.cbflood_inter_holdoff= [KNL] Set holdoff time (jiffies) between successive callback-flood tests. diff --combined MAINTAINERS index d2b9b80ffe30,945b395f4bb1..f40d40d42030 --- a/MAINTAINERS +++ b/MAINTAINERS @@@ -1322,6 -1322,7 +1322,7 @@@ F: drivers/rtc/rtc-armada38x. F: arch/arm/boot/dts/armada* F: arch/arm/boot/dts/kirkwood* F: arch/arm64/boot/dts/marvell/armada* + F: drivers/cpufreq/mvebu-cpufreq.c ARM/Marvell Berlin SoC support @@@ -3539,6 -3540,15 +3540,15 @@@ F: drivers/devfreq/devfreq-event. F: include/linux/devfreq-event.h F: Documentation/devicetree/bindings/devfreq/event/ + BUS FREQUENCY DRIVER FOR SAMSUNG EXYNOS + M: Chanwoo Choi + L: linux-pm@vger.kernel.org + L: linux-samsung-soc@vger.kernel.org + T: git git://git.kernel.org/pub/scm/linux/kernel/git/mzx/devfreq.git + S: Maintained + F: drivers/devfreq/exynos-bus.c + F: Documentation/devicetree/bindings/devfreq/exynos-bus.txt + DEVICE NUMBER REGISTRY M: Torben Mathiasen W: http://lanana.org/docs/device-list/index.html @@@ -7020,9 -7030,9 +7030,9 @@@ M: Chanwoo Choi L: linux-kernel@vger.kernel.org S: Supported -F: drivers/*/max14577.c +F: drivers/*/max14577*.c F: drivers/*/max77686*.c -F: drivers/*/max77693.c +F: drivers/*/max77693*.c F: drivers/extcon/extcon-max14577.c F: drivers/extcon/extcon-max77693.c F: drivers/rtc/rtc-max77686.c @@@ -11246,13 -11256,14 +11256,13 @@@ S: Maintaine F: drivers/media/i2c/tc358743* F: include/media/i2c/tc358743.h -TMIO MMC DRIVER -M: Ian Molton +TMIO/SDHI MMC DRIVER +M: Wolfram Sang L: linux-mmc@vger.kernel.org -S: Maintained +S: Supported F: drivers/mmc/host/tmio_mmc* F: drivers/mmc/host/sh_mobile_sdhi.c -F: include/linux/mmc/tmio.h -F: include/linux/mmc/sh_mobile_sdhi.h +F: include/linux/mfd/tmio.h TMP401 HARDWARE MONITOR DRIVER M: Guenter Roeck @@@ -11317,20 -11328,6 +11327,20 @@@ F: include/trace F: kernel/trace/ F: tools/testing/selftests/ftrace/ +TRACING MMIO ACCESSES (MMIOTRACE) +M: Steven Rostedt +M: Ingo Molnar +R: Karol Herbst +R: Pekka Paalanen +S: Maintained +L: linux-kernel@vger.kernel.org +L: nouveau@lists.freedesktop.org +F: kernel/trace/trace_mmiotrace.c +F: include/linux/mmiotrace.h +F: arch/x86/mm/kmmio.c +F: arch/x86/mm/mmio-mod.c +F: arch/x86/mm/testmmiotrace.c + TRIVIAL PATCHES M: Jiri Kosina T: git git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial.git @@@ -12010,9 -12007,7 +12020,9 @@@ L: linux-kernel@vger.kernel.or W: http://www.slimlogic.co.uk/?p=48 T: git git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator.git S: Supported +F: Documentation/devicetree/bindings/regulator/ F: drivers/regulator/ +F: include/dt-bindings/regulator/ F: include/linux/regulator/ VRF diff --combined drivers/cpufreq/longhaul.c index 247bfa8eaddb,beae5cf5c62c..c46a12df40dd --- a/drivers/cpufreq/longhaul.c +++ b/drivers/cpufreq/longhaul.c @@@ -21,6 -21,8 +21,8 @@@ * BIG FAT DISCLAIMER: Work in progress code. Possibly *dangerous* */ + #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + #include #include #include @@@ -40,8 -42,6 +42,6 @@@ #include "longhaul.h" - #define PFX "longhaul: " - #define TYPE_LONGHAUL_V1 1 #define TYPE_LONGHAUL_V2 2 #define TYPE_POWERSAVER 3 @@@ -347,14 -347,13 +347,13 @@@ retry_loop freqs.new = calc_speed(longhaul_get_cpu_mult()); /* Check if requested frequency is set. */ if (unlikely(freqs.new != speed)) { - printk(KERN_INFO PFX "Failed to set requested frequency!\n"); + pr_info("Failed to set requested frequency!\n"); /* Revision ID = 1 but processor is expecting revision key * equal to 0. Jumpers at the bottom of processor will change * multiplier and FSB, but will not change bits in Longhaul * MSR nor enable voltage scaling. */ if (!revid_errata) { - printk(KERN_INFO PFX "Enabling \"Ignore Revision ID\" " - "option.\n"); + pr_info("Enabling \"Ignore Revision ID\" option\n"); revid_errata = 1; msleep(200); goto retry_loop; @@@ -364,11 -363,10 +363,10 @@@ * but it doesn't change frequency. I tried poking various * bits in northbridge registers, but without success. */ if (longhaul_flags & USE_ACPI_C3) { - printk(KERN_INFO PFX "Disabling ACPI C3 support.\n"); + pr_info("Disabling ACPI C3 support\n"); longhaul_flags &= ~USE_ACPI_C3; if (revid_errata) { - printk(KERN_INFO PFX "Disabling \"Ignore " - "Revision ID\" option.\n"); + pr_info("Disabling \"Ignore Revision ID\" option\n"); revid_errata = 0; } msleep(200); @@@ -379,7 -377,7 +377,7 @@@ * RevID = 1. RevID errata will make things right. Just * to be 100% sure. */ if (longhaul_version == TYPE_LONGHAUL_V2) { - printk(KERN_INFO PFX "Switching to Longhaul ver. 1\n"); + pr_info("Switching to Longhaul ver. 1\n"); longhaul_version = TYPE_LONGHAUL_V1; msleep(200); goto retry_loop; @@@ -387,8 -385,7 +385,7 @@@ } if (!bm_timeout) { - printk(KERN_INFO PFX "Warning: Timeout while waiting for " - "idle PCI bus.\n"); + pr_info("Warning: Timeout while waiting for idle PCI bus\n"); return -EBUSY; } @@@ -433,12 -430,12 +430,12 @@@ static int longhaul_get_ranges(void /* Get current frequency */ mult = longhaul_get_cpu_mult(); if (mult == -1) { - printk(KERN_INFO PFX "Invalid (reserved) multiplier!\n"); + pr_info("Invalid (reserved) multiplier!\n"); return -EINVAL; } fsb = guess_fsb(mult); if (fsb == 0) { - printk(KERN_INFO PFX "Invalid (reserved) FSB!\n"); + pr_info("Invalid (reserved) FSB!\n"); return -EINVAL; } /* Get max multiplier - as we always did. @@@ -468,11 -465,11 +465,11 @@@ print_speed(highest_speed/1000)); if (lowest_speed == highest_speed) { - printk(KERN_INFO PFX "highestspeed == lowest, aborting.\n"); + pr_info("highestspeed == lowest, aborting\n"); return -EINVAL; } if (lowest_speed > highest_speed) { - printk(KERN_INFO PFX "nonsense! lowest (%d > %d) !\n", + pr_info("nonsense! lowest (%d > %d) !\n", lowest_speed, highest_speed); return -EINVAL; } @@@ -538,16 -535,16 +535,16 @@@ static void longhaul_setup_voltagescali rdmsrl(MSR_VIA_LONGHAUL, longhaul.val); if (!(longhaul.bits.RevisionID & 1)) { - printk(KERN_INFO PFX "Voltage scaling not supported by CPU.\n"); + pr_info("Voltage scaling not supported by CPU\n"); return; } if (!longhaul.bits.VRMRev) { - printk(KERN_INFO PFX "VRM 8.5\n"); + pr_info("VRM 8.5\n"); vrm_mV_table = &vrm85_mV[0]; mV_vrm_table = &mV_vrm85[0]; } else { - printk(KERN_INFO PFX "Mobile VRM\n"); + pr_info("Mobile VRM\n"); if (cpu_model < CPU_NEHEMIAH) return; vrm_mV_table = &mobilevrm_mV[0]; @@@ -558,27 -555,21 +555,21 @@@ maxvid = vrm_mV_table[longhaul.bits.MaximumVID]; if (minvid.mV == 0 || maxvid.mV == 0 || minvid.mV > maxvid.mV) { - printk(KERN_INFO PFX "Bogus values Min:%d.%03d Max:%d.%03d. " - "Voltage scaling disabled.\n", - minvid.mV/1000, minvid.mV%1000, - maxvid.mV/1000, maxvid.mV%1000); + pr_info("Bogus values Min:%d.%03d Max:%d.%03d - Voltage scaling disabled\n", + minvid.mV/1000, minvid.mV%1000, + maxvid.mV/1000, maxvid.mV%1000); return; } if (minvid.mV == maxvid.mV) { - printk(KERN_INFO PFX "Claims to support voltage scaling but " - "min & max are both %d.%03d. " - "Voltage scaling disabled\n", - maxvid.mV/1000, maxvid.mV%1000); + pr_info("Claims to support voltage scaling but min & max are both %d.%03d - Voltage scaling disabled\n", + maxvid.mV/1000, maxvid.mV%1000); return; } /* How many voltage steps*/ numvscales = maxvid.pos - minvid.pos + 1; - printk(KERN_INFO PFX - "Max VID=%d.%03d " - "Min VID=%d.%03d, " - "%d possible voltage scales\n", + pr_info("Max VID=%d.%03d Min VID=%d.%03d, %d possible voltage scales\n", maxvid.mV/1000, maxvid.mV%1000, minvid.mV/1000, minvid.mV%1000, numvscales); @@@ -617,12 -608,12 +608,12 @@@ pos = minvid.pos; freq_pos->driver_data |= mV_vrm_table[pos] << 8; vid = vrm_mV_table[mV_vrm_table[pos]]; - printk(KERN_INFO PFX "f: %d kHz, index: %d, vid: %d mV\n", + pr_info("f: %d kHz, index: %d, vid: %d mV\n", speed, (int)(freq_pos - longhaul_table), vid.mV); } can_scale_voltage = 1; - printk(KERN_INFO PFX "Voltage scaling enabled.\n"); + pr_info("Voltage scaling enabled\n"); } @@@ -720,8 -711,7 +711,7 @@@ static int enable_arbiter_disable(void pci_write_config_byte(dev, reg, pci_cmd); pci_read_config_byte(dev, reg, &pci_cmd); if (!(pci_cmd & 1<<7)) { - printk(KERN_ERR PFX - "Can't enable access to port 0x22.\n"); + pr_err("Can't enable access to port 0x22\n"); status = 0; } } @@@ -758,8 -748,7 +748,7 @@@ static int longhaul_setup_southbridge(v if (pci_cmd & 1 << 7) { pci_read_config_dword(dev, 0x88, &acpi_regs_addr); acpi_regs_addr &= 0xff00; - printk(KERN_INFO PFX "ACPI I/O at 0x%x\n", - acpi_regs_addr); + pr_info("ACPI I/O at 0x%x\n", acpi_regs_addr); } pci_dev_put(dev); @@@ -853,14 -842,14 +842,14 @@@ static int longhaul_cpu_init(struct cpu longhaul_version = TYPE_LONGHAUL_V1; } - printk(KERN_INFO PFX "VIA %s CPU detected. ", cpuname); + pr_info("VIA %s CPU detected. ", cpuname); switch (longhaul_version) { case TYPE_LONGHAUL_V1: case TYPE_LONGHAUL_V2: - printk(KERN_CONT "Longhaul v%d supported.\n", longhaul_version); + pr_cont("Longhaul v%d supported\n", longhaul_version); break; case TYPE_POWERSAVER: - printk(KERN_CONT "Powersaver supported.\n"); + pr_cont("Powersaver supported\n"); break; }; @@@ -889,15 -878,14 +878,14 @@@ if (!(longhaul_flags & USE_ACPI_C3 || longhaul_flags & USE_NORTHBRIDGE) && ((pr == NULL) || !(pr->flags.bm_control))) { - printk(KERN_ERR PFX - "No ACPI support. Unsupported northbridge.\n"); + pr_err("No ACPI support: Unsupported northbridge\n"); return -ENODEV; } if (longhaul_flags & USE_NORTHBRIDGE) - printk(KERN_INFO PFX "Using northbridge support.\n"); + pr_info("Using northbridge support\n"); if (longhaul_flags & USE_ACPI_C3) - printk(KERN_INFO PFX "Using ACPI support.\n"); + pr_info("Using ACPI support\n"); ret = longhaul_get_ranges(); if (ret != 0) @@@ -934,20 -922,18 +922,18 @@@ static int __init longhaul_init(void return -ENODEV; if (!enable) { - printk(KERN_ERR PFX "Option \"enable\" not set. Aborting.\n"); + pr_err("Option \"enable\" not set - Aborting\n"); return -ENODEV; } #ifdef CONFIG_SMP if (num_online_cpus() > 1) { - printk(KERN_ERR PFX "More than 1 CPU detected, " - "longhaul disabled.\n"); + pr_err("More than 1 CPU detected, longhaul disabled\n"); return -ENODEV; } #endif #ifdef CONFIG_X86_IO_APIC - if (cpu_has_apic) { + if (boot_cpu_has(X86_FEATURE_APIC)) { - printk(KERN_ERR PFX "APIC detected. Longhaul is currently " - "broken in this configuration.\n"); + pr_err("APIC detected. Longhaul is currently broken in this configuration.\n"); return -ENODEV; } #endif @@@ -955,7 -941,7 +941,7 @@@ case 6 ... 9: return cpufreq_register_driver(&longhaul_driver); case 10: - printk(KERN_ERR PFX "Use acpi-cpufreq driver for VIA C7\n"); + pr_err("Use acpi-cpufreq driver for VIA C7\n"); default: ; } diff --combined drivers/powercap/intel_rapl.c index f2201d42a9cd,470bb622fd8c..b2766b867b0e --- a/drivers/powercap/intel_rapl.c +++ b/drivers/powercap/intel_rapl.c @@@ -34,9 -34,6 +34,9 @@@ #include #include +/* Local defines */ +#define MSR_PLATFORM_POWER_LIMIT 0x0000065C + /* bitmasks for RAPL MSRs, used by primitive access functions */ #define ENERGY_STATUS_MASK 0xffffffff @@@ -89,7 -86,6 +89,7 @@@ enum rapl_domain_type RAPL_DOMAIN_PP0, /* core power plane */ RAPL_DOMAIN_PP1, /* graphics uncore */ RAPL_DOMAIN_DRAM,/* DRAM control_type */ + RAPL_DOMAIN_PLATFORM, /* PSys control_type */ RAPL_DOMAIN_MAX, }; @@@ -255,11 -251,9 +255,11 @@@ static const char * const rapl_domain_n "core", "uncore", "dram", + "psys", }; static struct powercap_control_type *control_type; /* PowerCap Controller */ +static struct rapl_domain *platform_rapl_domain; /* Platform (PSys) domain */ /* caller to ensure CPU hotplug lock is held */ static struct rapl_package *find_package_by_id(int id) @@@ -415,14 -409,6 +415,14 @@@ static const struct powercap_zone_ops z .set_enable = set_domain_enable, .get_enable = get_domain_enable, }, + /* RAPL_DOMAIN_PLATFORM */ + { + .get_energy_uj = get_energy_counter, + .get_max_energy_range_uj = get_max_energy_counter, + .release = release_zone, + .set_enable = set_domain_enable, + .get_enable = get_domain_enable, + }, }; static int set_power_limit(struct powercap_zone *power_zone, int id, @@@ -1115,6 -1101,8 +1115,8 @@@ static const struct x86_cpu_id rapl_ids RAPL_CPU(0X5C, rapl_defaults_core),/* Broxton */ RAPL_CPU(0x5E, rapl_defaults_core),/* Skylake-H/S */ RAPL_CPU(0x57, rapl_defaults_hsw_server),/* Knights Landing */ + RAPL_CPU(0x8E, rapl_defaults_core),/* Kabylake */ + RAPL_CPU(0x9E, rapl_defaults_core),/* Kabylake */ {} }; MODULE_DEVICE_TABLE(x86cpu, rapl_ids); @@@ -1174,13 -1162,6 +1176,13 @@@ static int rapl_unregister_powercap(voi powercap_unregister_zone(control_type, &rd_package->power_zone); } + + if (platform_rapl_domain) { + powercap_unregister_zone(control_type, + &platform_rapl_domain->power_zone); + kfree(platform_rapl_domain); + } + powercap_unregister_control_type(control_type); return 0; @@@ -1260,47 -1241,6 +1262,47 @@@ err_cleanup return ret; } +static int rapl_register_psys(void) +{ + struct rapl_domain *rd; + struct powercap_zone *power_zone; + u64 val; + + if (rdmsrl_safe_on_cpu(0, MSR_PLATFORM_ENERGY_STATUS, &val) || !val) + return -ENODEV; + + if (rdmsrl_safe_on_cpu(0, MSR_PLATFORM_POWER_LIMIT, &val) || !val) + return -ENODEV; + + rd = kzalloc(sizeof(*rd), GFP_KERNEL); + if (!rd) + return -ENOMEM; + + rd->name = rapl_domain_names[RAPL_DOMAIN_PLATFORM]; + rd->id = RAPL_DOMAIN_PLATFORM; + rd->msrs[0] = MSR_PLATFORM_POWER_LIMIT; + rd->msrs[1] = MSR_PLATFORM_ENERGY_STATUS; + rd->rpl[0].prim_id = PL1_ENABLE; + rd->rpl[0].name = pl1_name; + rd->rpl[1].prim_id = PL2_ENABLE; + rd->rpl[1].name = pl2_name; + rd->rp = find_package_by_id(0); + + power_zone = powercap_register_zone(&rd->power_zone, control_type, + "psys", NULL, + &zone_ops[RAPL_DOMAIN_PLATFORM], + 2, &constraint_ops); + + if (IS_ERR(power_zone)) { + kfree(rd); + return PTR_ERR(power_zone); + } + + platform_rapl_domain = rd; + + return 0; +} + static int rapl_register_powercap(void) { struct rapl_domain *rd; @@@ -1317,10 -1257,6 +1319,10 @@@ list_for_each_entry(rp, &rapl_packages, plist) if (rapl_package_register_powercap(rp)) goto err_cleanup_package; + + /* Don't bail out if PSys is not supported */ + rapl_register_psys(); + return ret; err_cleanup_package: @@@ -1355,9 -1291,6 +1357,9 @@@ static int rapl_check_domain(int cpu, i case RAPL_DOMAIN_DRAM: msr = MSR_DRAM_ENERGY_STATUS; break; + case RAPL_DOMAIN_PLATFORM: + /* PSYS(PLATFORM) is not a CPU domain, so avoid printng error */ + return -EINVAL; default: pr_err("invalid domain id %d\n", domain); return -EINVAL; diff --combined include/linux/sched.h index 6cc0df970f1a,8344e1947eec..31bd0d97d178 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@@ -40,6 -40,7 +40,6 @@@ struct sched_param #include #include #include -#include #include #include #include @@@ -177,11 -178,9 +177,11 @@@ extern void get_iowait_load(unsigned lo extern void calc_global_load(unsigned long ticks); #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) -extern void update_cpu_load_nohz(int active); +extern void cpu_load_update_nohz_start(void); +extern void cpu_load_update_nohz_stop(void); #else -static inline void update_cpu_load_nohz(int active) { } +static inline void cpu_load_update_nohz_start(void) { } +static inline void cpu_load_update_nohz_stop(void) { } #endif extern void dump_cpu_task(int cpu); @@@ -373,15 -372,6 +373,15 @@@ extern void cpu_init (void) extern void trap_init(void); extern void update_process_times(int user); extern void scheduler_tick(void); +extern int sched_cpu_starting(unsigned int cpu); +extern int sched_cpu_activate(unsigned int cpu); +extern int sched_cpu_deactivate(unsigned int cpu); + +#ifdef CONFIG_HOTPLUG_CPU +extern int sched_cpu_dying(unsigned int cpu); +#else +# define sched_cpu_dying NULL +#endif extern void sched_show_task(struct task_struct *p); @@@ -944,20 -934,10 +944,20 @@@ enum cpu_idle_type CPU_MAX_IDLE_TYPES }; +/* + * Integer metrics need fixed point arithmetic, e.g., sched/fair + * has a few: load, load_avg, util_avg, freq, and capacity. + * + * We define a basic fixed point arithmetic range, and then formalize + * all these metrics based on that basic range. + */ +# define SCHED_FIXEDPOINT_SHIFT 10 +# define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT) + /* * Increase resolution of cpu_capacity calculations */ -#define SCHED_CAPACITY_SHIFT 10 +#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT #define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT) /* @@@ -1219,56 -1199,18 +1219,56 @@@ struct load_weight }; /* - * The load_avg/util_avg accumulates an infinite geometric series. - * 1) load_avg factors frequency scaling into the amount of time that a - * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the - * aggregated such weights of all runnable and blocked sched_entities. - * 2) util_avg factors frequency and cpu scaling into the amount of time - * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE]. - * For cfs_rq, it is the aggregated such times of all runnable and + * The load_avg/util_avg accumulates an infinite geometric series + * (see __update_load_avg() in kernel/sched/fair.c). + * + * [load_avg definition] + * + * load_avg = runnable% * scale_load_down(load) + * + * where runnable% is the time ratio that a sched_entity is runnable. + * For cfs_rq, it is the aggregated load_avg of all runnable and * blocked sched_entities. - * The 64 bit load_sum can: - * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with - * the highest weight (=88761) always runnable, we should not overflow - * 2) for entity, support any load.weight always runnable + * + * load_avg may also take frequency scaling into account: + * + * load_avg = runnable% * scale_load_down(load) * freq% + * + * where freq% is the CPU frequency normalized to the highest frequency. + * + * [util_avg definition] + * + * util_avg = running% * SCHED_CAPACITY_SCALE + * + * where running% is the time ratio that a sched_entity is running on + * a CPU. For cfs_rq, it is the aggregated util_avg of all runnable + * and blocked sched_entities. + * + * util_avg may also factor frequency scaling and CPU capacity scaling: + * + * util_avg = running% * SCHED_CAPACITY_SCALE * freq% * capacity% + * + * where freq% is the same as above, and capacity% is the CPU capacity + * normalized to the greatest capacity (due to uarch differences, etc). + * + * N.B., the above ratios (runnable%, running%, freq%, and capacity%) + * themselves are in the range of [0, 1]. To do fixed point arithmetics, + * we therefore scale them to as large a range as necessary. This is for + * example reflected by util_avg's SCHED_CAPACITY_SCALE. + * + * [Overflow issue] + * + * The 64-bit load_sum can have 4353082796 (=2^64/47742/88761) entities + * with the highest load (=88761), always runnable on a single cfs_rq, + * and should not overflow as the number already hits PID_MAX_LIMIT. + * + * For all other cases (including 32-bit kernels), struct load_weight's + * weight will overflow first before we do, because: + * + * Max(load_avg) <= Max(load.weight) + * + * Then it is the load_weight's responsibility to consider overflow + * issues. */ struct sched_avg { u64 last_update_time, load_sum; @@@ -1654,7 -1596,6 +1654,7 @@@ struct task_struct unsigned long sas_ss_sp; size_t sas_ss_size; + unsigned sas_ss_flags; struct callback_head *task_works; @@@ -1930,11 -1871,6 +1930,11 @@@ extern int arch_task_struct_size __read /* Future-safe accessor for struct task_struct's cpus_allowed. */ #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed) +static inline int tsk_nr_cpus_allowed(struct task_struct *p) +{ + return p->nr_cpus_allowed; +} + #define TNF_MIGRATED 0x01 #define TNF_NO_GROUP 0x02 #define TNF_SHARED 0x04 @@@ -2367,6 -2303,8 +2367,6 @@@ extern unsigned long long notrace sched /* * See the comment in kernel/sched/clock.c */ -extern u64 cpu_clock(int cpu); -extern u64 local_clock(void); extern u64 running_clock(void); extern u64 sched_clock_cpu(int cpu); @@@ -2385,16 -2323,6 +2385,16 @@@ static inline void sched_clock_idle_sle static inline void sched_clock_idle_wakeup_event(u64 delta_ns) { } + +static inline u64 cpu_clock(int cpu) +{ + return sched_clock(); +} + +static inline u64 local_clock(void) +{ + return sched_clock(); +} #else /* * Architectures can set this to 1 if they have specified @@@ -2409,26 -2337,6 +2409,26 @@@ extern void clear_sched_clock_stable(vo extern void sched_clock_tick(void); extern void sched_clock_idle_sleep_event(void); extern void sched_clock_idle_wakeup_event(u64 delta_ns); + +/* + * As outlined in clock.c, provides a fast, high resolution, nanosecond + * time source that is monotonic per cpu argument and has bounded drift + * between cpus. + * + * ######################### BIG FAT WARNING ########################## + * # when comparing cpu_clock(i) to cpu_clock(j) for i != j, time can # + * # go backwards !! # + * #################################################################### + */ +static inline u64 cpu_clock(int cpu) +{ + return sched_clock_cpu(cpu); +} + +static inline u64 local_clock(void) +{ + return sched_clock_cpu(raw_smp_processor_id()); +} #endif #ifdef CONFIG_IRQ_TIME_ACCOUNTING @@@ -2667,18 -2575,6 +2667,18 @@@ static inline int kill_cad_pid(int sig */ static inline int on_sig_stack(unsigned long sp) { + /* + * If the signal stack is SS_AUTODISARM then, by construction, we + * can't be on the signal stack unless user code deliberately set + * SS_AUTODISARM when we were already on it. + * + * This improves reliability: if user state gets corrupted such that + * the stack pointer points very close to the end of the signal stack, + * then this check will enable the signal to be handled anyway. + */ + if (current->sas_ss_flags & SS_AUTODISARM) + return 0; + #ifdef CONFIG_STACK_GROWSUP return sp >= current->sas_ss_sp && sp - current->sas_ss_sp < current->sas_ss_size; @@@ -2696,13 -2592,6 +2696,13 @@@ static inline int sas_ss_flags(unsigne return on_sig_stack(sp) ? SS_ONSTACK : 0; } +static inline void sas_ss_reset(struct task_struct *p) +{ + p->sas_ss_sp = 0; + p->sas_ss_size = 0; + p->sas_ss_flags = SS_DISABLE; +} + static inline unsigned long sigsp(unsigned long sp, struct ksignal *ksig) { if (unlikely((ksig->ka.sa.sa_flags & SA_ONSTACK)) && ! sas_ss_flags(sp)) @@@ -3351,7 -3240,10 +3351,10 @@@ struct update_util_data u64 time, unsigned long util, unsigned long max); }; - void cpufreq_set_update_util_data(int cpu, struct update_util_data *data); + void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data, + void (*func)(struct update_util_data *data, u64 time, + unsigned long util, unsigned long max)); + void cpufreq_remove_update_util_hook(int cpu); #endif /* CONFIG_CPU_FREQ */ #endif diff --combined kernel/sched/sched.h index e51145e76807,921d6e5d33b7..72f1f3087b04 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@@ -31,9 -31,9 +31,9 @@@ extern void calc_global_load_tick(struc extern long calc_load_fold_active(struct rq *this_rq); #ifdef CONFIG_SMP -extern void update_cpu_load_active(struct rq *this_rq); +extern void cpu_load_update_active(struct rq *this_rq); #else -static inline void update_cpu_load_active(struct rq *this_rq) { } +static inline void cpu_load_update_active(struct rq *this_rq) { } #endif /* @@@ -49,32 -49,25 +49,32 @@@ * and does not change the user-interface for setting shares/weights. * * We increase resolution only if we have enough bits to allow this increased - * resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution - * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the - * increased costs. + * resolution (i.e. 64bit). The costs for increasing resolution when 32bit are + * pretty high and the returns do not justify the increased costs. + * + * Really only required when CONFIG_FAIR_GROUP_SCHED is also set, but to + * increase coverage and consistency always enable it on 64bit platforms. */ -#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load */ -# define SCHED_LOAD_RESOLUTION 10 -# define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION) -# define scale_load_down(w) ((w) >> SCHED_LOAD_RESOLUTION) +#ifdef CONFIG_64BIT +# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT) +# define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT) +# define scale_load_down(w) ((w) >> SCHED_FIXEDPOINT_SHIFT) #else -# define SCHED_LOAD_RESOLUTION 0 +# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT) # define scale_load(w) (w) # define scale_load_down(w) (w) #endif -#define SCHED_LOAD_SHIFT (10 + SCHED_LOAD_RESOLUTION) -#define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT) - -#define NICE_0_LOAD SCHED_LOAD_SCALE -#define NICE_0_SHIFT SCHED_LOAD_SHIFT +/* + * Task weight (visible to users) and its load (invisible to users) have + * independent resolution, but they should be well calibrated. We use + * scale_load() and scale_load_down(w) to convert between them. The + * following must be true: + * + * scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD + * + */ +#define NICE_0_LOAD (1L << NICE_0_LOAD_SHIFT) /* * Single value that decides SCHED_DEADLINE internal math precision. @@@ -592,13 -585,11 +592,13 @@@ struct rq #endif #define CPU_LOAD_IDX_MAX 5 unsigned long cpu_load[CPU_LOAD_IDX_MAX]; - unsigned long last_load_update_tick; #ifdef CONFIG_NO_HZ_COMMON +#ifdef CONFIG_SMP + unsigned long last_load_update_tick; +#endif /* CONFIG_SMP */ u64 nohz_stamp; unsigned long nohz_flags; -#endif +#endif /* CONFIG_NO_HZ_COMMON */ #ifdef CONFIG_NO_HZ_FULL unsigned long last_sched_tick; #endif @@@ -863,7 -854,7 +863,7 @@@ DECLARE_PER_CPU(struct sched_domain *, struct sched_group_capacity { atomic_t ref; /* - * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity + * CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity * for a single CPU. */ unsigned int capacity; @@@ -1168,7 -1159,7 +1168,7 @@@ extern const u32 sched_prio_to_wmult[40 * * ENQUEUE_HEAD - place at front of runqueue (tail if not specified) * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline) - * ENQUEUE_WAKING - sched_class::task_waking was called + * ENQUEUE_MIGRATED - the task was migrated during wakeup * */ @@@ -1183,9 -1174,9 +1183,9 @@@ #define ENQUEUE_HEAD 0x08 #define ENQUEUE_REPLENISH 0x10 #ifdef CONFIG_SMP -#define ENQUEUE_WAKING 0x20 +#define ENQUEUE_MIGRATED 0x20 #else -#define ENQUEUE_WAKING 0x00 +#define ENQUEUE_MIGRATED 0x00 #endif #define RETRY_TASK ((void *)-1UL) @@@ -1209,14 -1200,14 +1209,14 @@@ struct sched_class * tasks. */ struct task_struct * (*pick_next_task) (struct rq *rq, - struct task_struct *prev); + struct task_struct *prev, + struct pin_cookie cookie); void (*put_prev_task) (struct rq *rq, struct task_struct *p); #ifdef CONFIG_SMP int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags); void (*migrate_task_rq)(struct task_struct *p); - void (*task_waking) (struct task_struct *task); void (*task_woken) (struct rq *this_rq, struct task_struct *task); void (*set_cpus_allowed)(struct task_struct *p, @@@ -1322,7 -1313,6 +1322,7 @@@ extern void init_dl_task_timer(struct s unsigned long to_ratio(u64 period, u64 runtime); extern void init_entity_runnable_average(struct sched_entity *se); +extern void post_init_entity_util_avg(struct sched_entity *se); #ifdef CONFIG_NO_HZ_FULL extern bool sched_can_stop_tick(struct rq *rq); @@@ -1458,32 -1448,86 +1458,32 @@@ static inline void sched_rt_avg_update( static inline void sched_avg_update(struct rq *rq) { } #endif -/* - * __task_rq_lock - lock the rq @p resides on. - */ -static inline struct rq *__task_rq_lock(struct task_struct *p) - __acquires(rq->lock) -{ - struct rq *rq; - - lockdep_assert_held(&p->pi_lock); - - for (;;) { - rq = task_rq(p); - raw_spin_lock(&rq->lock); - if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) { - lockdep_pin_lock(&rq->lock); - return rq; - } - raw_spin_unlock(&rq->lock); - - while (unlikely(task_on_rq_migrating(p))) - cpu_relax(); - } -} +struct rq_flags { + unsigned long flags; + struct pin_cookie cookie; +}; -/* - * task_rq_lock - lock p->pi_lock and lock the rq @p resides on. - */ -static inline struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags) +struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf) + __acquires(rq->lock); +struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf) __acquires(p->pi_lock) - __acquires(rq->lock) -{ - struct rq *rq; - - for (;;) { - raw_spin_lock_irqsave(&p->pi_lock, *flags); - rq = task_rq(p); - raw_spin_lock(&rq->lock); - /* - * move_queued_task() task_rq_lock() - * - * ACQUIRE (rq->lock) - * [S] ->on_rq = MIGRATING [L] rq = task_rq() - * WMB (__set_task_cpu()) ACQUIRE (rq->lock); - * [S] ->cpu = new_cpu [L] task_rq() - * [L] ->on_rq - * RELEASE (rq->lock) - * - * If we observe the old cpu in task_rq_lock, the acquire of - * the old rq->lock will fully serialize against the stores. - * - * If we observe the new cpu in task_rq_lock, the acquire will - * pair with the WMB to ensure we must then also see migrating. - */ - if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) { - lockdep_pin_lock(&rq->lock); - return rq; - } - raw_spin_unlock(&rq->lock); - raw_spin_unlock_irqrestore(&p->pi_lock, *flags); - - while (unlikely(task_on_rq_migrating(p))) - cpu_relax(); - } -} + __acquires(rq->lock); -static inline void __task_rq_unlock(struct rq *rq) +static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf) __releases(rq->lock) { - lockdep_unpin_lock(&rq->lock); + lockdep_unpin_lock(&rq->lock, rf->cookie); raw_spin_unlock(&rq->lock); } static inline void -task_rq_unlock(struct rq *rq, struct task_struct *p, unsigned long *flags) +task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf) __releases(rq->lock) __releases(p->pi_lock) { - lockdep_unpin_lock(&rq->lock); + lockdep_unpin_lock(&rq->lock, rf->cookie); raw_spin_unlock(&rq->lock); - raw_spin_unlock_irqrestore(&p->pi_lock, *flags); + raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags); } #ifdef CONFIG_SMP @@@ -1699,10 -1743,6 +1699,10 @@@ enum rq_nohz_flag_bits }; #define nohz_flags(cpu) (&cpu_rq(cpu)->nohz_flags) + +extern void nohz_balance_exit_idle(unsigned int cpu); +#else +static inline void nohz_balance_exit_idle(unsigned int cpu) { } #endif #ifdef CONFIG_IRQ_TIME_ACCOUNTING @@@ -1802,6 -1842,14 +1802,14 @@@ static inline void cpufreq_update_util( static inline void cpufreq_trigger_update(u64 time) {} #endif /* CONFIG_CPU_FREQ */ + #ifdef arch_scale_freq_capacity + #ifndef arch_scale_freq_invariant + #define arch_scale_freq_invariant() (true) + #endif + #else /* arch_scale_freq_capacity */ + #define arch_scale_freq_invariant() (false) + #endif + static inline void account_reset_rq(struct rq *rq) { #ifdef CONFIG_IRQ_TIME_ACCOUNTING