From patchwork Mon Apr 22 13:26:57 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mauro Carvalho Chehab X-Patchwork-Id: 1088702 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=infradead.org header.i=@infradead.org header.b="KEcBZSQz"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 44nnbn36V5z9sPJ for ; Mon, 22 Apr 2019 23:33:57 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727949AbfDVNdf (ORCPT ); Mon, 22 Apr 2019 09:33:35 -0400 Received: from bombadil.infradead.org ([198.137.202.133]:37500 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727492AbfDVN2T (ORCPT ); Mon, 22 Apr 2019 09:28:19 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=Sender:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=YLlP5lBNcNohoj9OXQiDAZEk5ixK2ba4iPhjhE/WGSA=; b=KEcBZSQzzHaf8/WAqlLi6nCca/ iyeL1J9AnMQX3uhHmnOiNQuq3lNKXE4OuUZgV+oeBOUiKLd5xrTqeG+sN1sQVAD8ilPdqu+CVPrhg v7B0g2ZLiJI55CuLVMj/LnyY2rSVBBnVSMrtVMMqAgESGOaiY8YXNf+CRhleDx7vF/4hHUKMcPh10 Q1HVN2qJLnhjCNkWBW6RChUI5eYWP+vw7doNGAnGPnkfLn5KXgXM6ZMOVlCdvEjG7vTupRx7A/mBD wXiRMDQQ7CkQJylZb8Ody3TSaQW0fmluGRLfd6W1W0KuhBOt8rcKHUBNoM8SDnY0ciCFhIskN6s9n V7tsSWtA==; Received: from 179.176.125.229.dynamic.adsl.gvt.net.br ([179.176.125.229] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtpsa (Exim 4.90_1 #2 (Red Hat Linux)) id 1hIYzU-0005Hd-NV; Mon, 22 Apr 2019 13:28:15 +0000 Received: from mchehab by bombadil.infradead.org with local (Exim 4.92) (envelope-from ) id 1hIYzS-0005jy-FA; Mon, 22 Apr 2019 10:28:10 -0300 From: Mauro Carvalho Chehab To: Linux Doc Mailing List Cc: Mauro Carvalho Chehab , Mauro Carvalho Chehab , linux-kernel@vger.kernel.org, Jonathan Corbet , Thomas Gleixner , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" , x86@kernel.org, Jens Axboe , Tejun Heo , Li Zefan , Johannes Weiner , Alexei Starovoitov , Daniel Borkmann , Martin KaFai Lau , Song Liu , Yonghong Song , James Morris , "Serge E. Hallyn" , linux-block@vger.kernel.org, cgroups@vger.kernel.org, netdev@vger.kernel.org, bpf@vger.kernel.org, linux-security-module@vger.kernel.org Subject: [PATCH v2 08/79] docs: cgroup-v1: convert docs to ReST and rename to *.rst Date: Mon, 22 Apr 2019 10:26:57 -0300 Message-Id: X-Mailer: git-send-email 2.20.1 In-Reply-To: References: MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Convert the cgroup-v1 files to ReST format, in order to allow a later addition to the admin-guide. The conversion is actually: - add blank lines and identation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab Acked-by: Tejun Heo --- .../admin-guide/kernel-parameters.txt | 4 +- Documentation/admin-guide/l1tf.rst | 2 +- .../admin-guide/mm/numa_memory_policy.rst | 2 +- Documentation/block/bfq-iosched.txt | 2 +- ...io-controller.txt => blkio-controller.rst} | 96 ++-- .../cgroup-v1/{cgroups.txt => cgroups.rst} | 184 +++---- .../cgroup-v1/{cpuacct.txt => cpuacct.rst} | 15 +- .../cgroup-v1/{cpusets.txt => cpusets.rst} | 205 ++++---- .../cgroup-v1/{devices.txt => devices.rst} | 40 +- ...er-subsystem.txt => freezer-subsystem.rst} | 14 +- .../cgroup-v1/{hugetlb.txt => hugetlb.rst} | 31 +- Documentation/cgroup-v1/index.rst | 30 ++ .../{memcg_test.txt => memcg_test.rst} | 261 ++++++---- .../cgroup-v1/{memory.txt => memory.rst} | 449 +++++++++++------- .../cgroup-v1/{net_cls.txt => net_cls.rst} | 37 +- .../cgroup-v1/{net_prio.txt => net_prio.rst} | 24 +- .../cgroup-v1/{pids.txt => pids.rst} | 78 +-- .../cgroup-v1/{rdma.txt => rdma.rst} | 66 +-- Documentation/filesystems/tmpfs.txt | 2 +- Documentation/scheduler/sched-deadline.txt | 2 +- Documentation/scheduler/sched-design-CFS.txt | 2 +- Documentation/scheduler/sched-rt-group.txt | 2 +- Documentation/vm/numa.rst | 4 +- Documentation/vm/page_migration.rst | 2 +- Documentation/vm/unevictable-lru.rst | 2 +- .../x86/x86_64/fake-numa-for-cpusets | 4 +- MAINTAINERS | 2 +- block/Kconfig | 2 +- include/linux/cgroup-defs.h | 2 +- include/uapi/linux/bpf.h | 2 +- init/Kconfig | 2 +- kernel/cgroup/cpuset.c | 2 +- security/device_cgroup.c | 2 +- tools/include/uapi/linux/bpf.h | 2 +- 34 files changed, 947 insertions(+), 629 deletions(-) rename Documentation/cgroup-v1/{blkio-controller.txt => blkio-controller.rst} (90%) rename Documentation/cgroup-v1/{cgroups.txt => cgroups.rst} (88%) rename Documentation/cgroup-v1/{cpuacct.txt => cpuacct.rst} (90%) rename Documentation/cgroup-v1/{cpusets.txt => cpusets.rst} (90%) rename Documentation/cgroup-v1/{devices.txt => devices.rst} (88%) rename Documentation/cgroup-v1/{freezer-subsystem.txt => freezer-subsystem.rst} (95%) rename Documentation/cgroup-v1/{hugetlb.txt => hugetlb.rst} (74%) create mode 100644 Documentation/cgroup-v1/index.rst rename Documentation/cgroup-v1/{memcg_test.txt => memcg_test.rst} (62%) rename Documentation/cgroup-v1/{memory.txt => memory.rst} (71%) rename Documentation/cgroup-v1/{net_cls.txt => net_cls.rst} (50%) rename Documentation/cgroup-v1/{net_prio.txt => net_prio.rst} (71%) rename Documentation/cgroup-v1/{pids.txt => pids.rst} (62%) rename Documentation/cgroup-v1/{rdma.txt => rdma.rst} (79%) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 308af3b62f8d..0376e7e7dfa3 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4010,7 +4010,7 @@ relax_domain_level= [KNL, SMP] Set scheduler's default relax_domain_level. - See Documentation/cgroup-v1/cpusets.txt. + See Documentation/cgroup-v1/cpusets.rst. reserve= [KNL,BUGS] Force kernel to ignore I/O ports or memory Format: ,[,,,...] @@ -4520,7 +4520,7 @@ swapaccount=[0|1] [KNL] Enable accounting of swap in memory resource controller if no parameter or 1 is given or disable - it if 0 is given (See Documentation/cgroup-v1/memory.txt) + it if 0 is given (See Documentation/cgroup-v1/memory.rst) swiotlb= [ARM,IA-64,PPC,MIPS,X86] Format: { | force | noforce } diff --git a/Documentation/admin-guide/l1tf.rst b/Documentation/admin-guide/l1tf.rst index 9af977384168..f5b2a54a0dc2 100644 --- a/Documentation/admin-guide/l1tf.rst +++ b/Documentation/admin-guide/l1tf.rst @@ -241,7 +241,7 @@ Guest mitigation mechanisms For further information about confining guests to a single or to a group of cores consult the cpusets documentation: - https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt + https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.rst .. _interrupt_isolation: diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index d78c5b315f72..546f174e5d6a 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -15,7 +15,7 @@ document attempts to describe the concepts and APIs of the 2.6 memory policy support. Memory policies should not be confused with cpusets -(``Documentation/cgroup-v1/cpusets.txt``) +(``Documentation/cgroup-v1/cpusets.rst``) which is an administrative mechanism for restricting the nodes from which memory may be allocated by a set of processes. Memory policies are a programming interface that a NUMA-aware application can take advantage of. When diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt index 1a0f2ac02eb6..b2265cf6c9c3 100644 --- a/Documentation/block/bfq-iosched.txt +++ b/Documentation/block/bfq-iosched.txt @@ -539,7 +539,7 @@ As for cgroups-v1 (blkio controller), the exact set of stat files created, and kept up-to-date by bfq, depends on whether CONFIG_DEBUG_BLK_CGROUP is set. If it is set, then bfq creates all the stat files documented in -Documentation/cgroup-v1/blkio-controller.txt. If, instead, +Documentation/cgroup-v1/blkio-controller.rst. If, instead, CONFIG_DEBUG_BLK_CGROUP is not set, then bfq creates only the files blkio.bfq.io_service_bytes blkio.bfq.io_service_bytes_recursive diff --git a/Documentation/cgroup-v1/blkio-controller.txt b/Documentation/cgroup-v1/blkio-controller.rst similarity index 90% rename from Documentation/cgroup-v1/blkio-controller.txt rename to Documentation/cgroup-v1/blkio-controller.rst index 673dc34d3f78..2c1b907afc14 100644 --- a/Documentation/cgroup-v1/blkio-controller.txt +++ b/Documentation/cgroup-v1/blkio-controller.rst @@ -1,5 +1,7 @@ - Block IO Controller - =================== +=================== +Block IO Controller +=================== + Overview ======== cgroup subsys "blkio" implements the block io controller. There seems to be @@ -22,28 +24,35 @@ Proportional Weight division of bandwidth You can do a very simple testing of running two dd threads in two different cgroups. Here is what you can do. -- Enable Block IO controller +- Enable Block IO controller:: + CONFIG_BLK_CGROUP=y -- Enable group scheduling in CFQ +- Enable group scheduling in CFQ: + + CONFIG_CFQ_GROUP_IOSCHED=y - Compile and boot into kernel and mount IO controller (blkio); see cgroups.txt, Why are cgroups needed?. + :: + mount -t tmpfs cgroup_root /sys/fs/cgroup mkdir /sys/fs/cgroup/blkio mount -t cgroup -o blkio none /sys/fs/cgroup/blkio -- Create two cgroups +- Create two cgroups:: + mkdir -p /sys/fs/cgroup/blkio/test1/ /sys/fs/cgroup/blkio/test2 -- Set weights of group test1 and test2 +- Set weights of group test1 and test2:: + echo 1000 > /sys/fs/cgroup/blkio/test1/blkio.weight echo 500 > /sys/fs/cgroup/blkio/test2/blkio.weight - Create two same size files (say 512MB each) on same disk (file1, file2) and - launch two dd threads in different cgroup to read those files. + launch two dd threads in different cgroup to read those files:: sync echo 3 > /proc/sys/vm/drop_caches @@ -65,24 +74,27 @@ cgroups. Here is what you can do. Throttling/Upper Limit policy ----------------------------- -- Enable Block IO controller +- Enable Block IO controller:: + CONFIG_BLK_CGROUP=y -- Enable throttling in block layer +- Enable throttling in block layer:: + CONFIG_BLK_DEV_THROTTLING=y -- Mount blkio controller (see cgroups.txt, Why are cgroups needed?) +- Mount blkio controller (see cgroups.txt, Why are cgroups needed?):: + mount -t cgroup -o blkio none /sys/fs/cgroup/blkio - Specify a bandwidth rate on particular device for root group. The format - for policy is ": ". + for policy is ": ":: echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device Above will put a limit of 1MB/second on reads happening for root group on device having major/minor number 8:16. -- Run dd to read a file and see if rate is throttled to 1MB/s or not. +- Run dd to read a file and see if rate is throttled to 1MB/s or not:: # dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024 1024+0 records in @@ -99,7 +111,7 @@ throttling's hierarchy support is enabled iff "sane_behavior" is enabled from cgroup side, which currently is a development option and not publicly available. -If somebody created a hierarchy like as follows. +If somebody created a hierarchy like as follows:: root / \ @@ -115,7 +127,7 @@ directly generated by tasks in that cgroup. Throttling without "sane_behavior" enabled from cgroup side will practically treat all groups at same level as if it looks like the -following. +following:: pivot / / \ \ @@ -152,27 +164,31 @@ Proportional weight policy files These rules override the default value of group weight as specified by blkio.weight. - Following is the format. + Following is the format:: - # echo dev_maj:dev_minor weight > blkio.weight_device - Configure weight=300 on /dev/sdb (8:16) in this cgroup - # echo 8:16 300 > blkio.weight_device - # cat blkio.weight_device - dev weight - 8:16 300 + # echo dev_maj:dev_minor weight > blkio.weight_device - Configure weight=500 on /dev/sda (8:0) in this cgroup - # echo 8:0 500 > blkio.weight_device - # cat blkio.weight_device - dev weight - 8:0 500 - 8:16 300 + Configure weight=300 on /dev/sdb (8:16) in this cgroup:: - Remove specific weight for /dev/sda in this cgroup - # echo 8:0 0 > blkio.weight_device - # cat blkio.weight_device - dev weight - 8:16 300 + # echo 8:16 300 > blkio.weight_device + # cat blkio.weight_device + dev weight + 8:16 300 + + Configure weight=500 on /dev/sda (8:0) in this cgroup:: + + # echo 8:0 500 > blkio.weight_device + # cat blkio.weight_device + dev weight + 8:0 500 + 8:16 300 + + Remove specific weight for /dev/sda in this cgroup:: + + # echo 8:0 0 > blkio.weight_device + # cat blkio.weight_device + dev weight + 8:16 300 - blkio.leaf_weight[_device] - Equivalents of blkio.weight[_device] for the purpose of @@ -297,30 +313,30 @@ Throttling/Upper limit policy files - blkio.throttle.read_bps_device - Specifies upper limit on READ rate from the device. IO rate is specified in bytes per second. Rules are per device. Following is - the format. + the format:: - echo ": " > /cgrp/blkio.throttle.read_bps_device + echo ": " > /cgrp/blkio.throttle.read_bps_device - blkio.throttle.write_bps_device - Specifies upper limit on WRITE rate to the device. IO rate is specified in bytes per second. Rules are per device. Following is - the format. + the format:: - echo ": " > /cgrp/blkio.throttle.write_bps_device + echo ": " > /cgrp/blkio.throttle.write_bps_device - blkio.throttle.read_iops_device - Specifies upper limit on READ rate from the device. IO rate is specified in IO per second. Rules are per device. Following is - the format. + the format:: - echo ": " > /cgrp/blkio.throttle.read_iops_device + echo ": " > /cgrp/blkio.throttle.read_iops_device - blkio.throttle.write_iops_device - Specifies upper limit on WRITE rate to the device. IO rate is specified in io per second. Rules are per device. Following is - the format. + the format:: - echo ": " > /cgrp/blkio.throttle.write_iops_device + echo ": " > /cgrp/blkio.throttle.write_iops_device Note: If both BW and IOPS rules are specified for a device, then IO is subjected to both the constraints. diff --git a/Documentation/cgroup-v1/cgroups.txt b/Documentation/cgroup-v1/cgroups.rst similarity index 88% rename from Documentation/cgroup-v1/cgroups.txt rename to Documentation/cgroup-v1/cgroups.rst index 059f7063eea6..46bbe7e022d4 100644 --- a/Documentation/cgroup-v1/cgroups.txt +++ b/Documentation/cgroup-v1/cgroups.rst @@ -1,35 +1,39 @@ - CGROUPS - ------- +============== +Control Groups +============== Written by Paul Menage based on -Documentation/cgroup-v1/cpusets.txt +Documentation/cgroup-v1/cpusets.rst Original copyright statements from cpusets.txt: + Portions Copyright (C) 2004 BULL SA. + Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. + Modified by Paul Jackson + Modified by Christoph Lameter -CONTENTS: -========= +.. CONTENTS: -1. Control Groups - 1.1 What are cgroups ? - 1.2 Why are cgroups needed ? - 1.3 How are cgroups implemented ? - 1.4 What does notify_on_release do ? - 1.5 What does clone_children do ? - 1.6 How do I use cgroups ? -2. Usage Examples and Syntax - 2.1 Basic Usage - 2.2 Attaching processes - 2.3 Mounting hierarchies by name -3. Kernel API - 3.1 Overview - 3.2 Synchronization - 3.3 Subsystem API -4. Extended attributes usage -5. Questions + 1. Control Groups + 1.1 What are cgroups ? + 1.2 Why are cgroups needed ? + 1.3 How are cgroups implemented ? + 1.4 What does notify_on_release do ? + 1.5 What does clone_children do ? + 1.6 How do I use cgroups ? + 2. Usage Examples and Syntax + 2.1 Basic Usage + 2.2 Attaching processes + 2.3 Mounting hierarchies by name + 3. Kernel API + 3.1 Overview + 3.2 Synchronization + 3.3 Subsystem API + 4. Extended attributes usage + 5. Questions 1. Control Groups ================= @@ -72,7 +76,7 @@ On their own, the only use for cgroups is for simple job tracking. The intention is that other subsystems hook into the generic cgroup support to provide new attributes for cgroups, such as accounting/limiting the resources which processes in a cgroup can -access. For example, cpusets (see Documentation/cgroup-v1/cpusets.txt) allow +access. For example, cpusets (see Documentation/cgroup-v1/cpusets.rst) allow you to associate a set of CPUs and a set of memory nodes with the tasks in each cgroup. @@ -108,7 +112,7 @@ As an example of a scenario (originally proposed by vatsa@in.ibm.com) that can benefit from multiple hierarchies, consider a large university server with various users - students, professors, system tasks etc. The resource planning for this server could be along the -following lines: +following lines:: CPU : "Top cpuset" / \ @@ -136,7 +140,7 @@ depending on who launched it (prof/student). With the ability to classify tasks differently for different resources (by putting those resource subsystems in different hierarchies), the admin can easily set up a script which receives exec notifications -and depending on who is launching the browser he can +and depending on who is launching the browser he can:: # echo browser_pid > /sys/fs/cgroup///tasks @@ -151,7 +155,7 @@ wants to do online gaming :)) OR give one of the student's simulation apps enhanced CPU power. With ability to write PIDs directly to resource classes, it's just a -matter of: +matter of:: # echo pid > /sys/fs/cgroup/network//tasks (after some time) @@ -306,7 +310,7 @@ configuration from the parent during initialization. -------------------------- To start a new job that is to be contained within a cgroup, using -the "cpuset" cgroup subsystem, the steps are something like: +the "cpuset" cgroup subsystem, the steps are something like:: 1) mount -t tmpfs cgroup_root /sys/fs/cgroup 2) mkdir /sys/fs/cgroup/cpuset @@ -320,7 +324,7 @@ the "cpuset" cgroup subsystem, the steps are something like: For example, the following sequence of commands will setup a cgroup named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, -and then start a subshell 'sh' in that cgroup: +and then start a subshell 'sh' in that cgroup:: mount -t tmpfs cgroup_root /sys/fs/cgroup mkdir /sys/fs/cgroup/cpuset @@ -345,8 +349,9 @@ and then start a subshell 'sh' in that cgroup: Creating, modifying, using cgroups can be done through the cgroup virtual filesystem. -To mount a cgroup hierarchy with all available subsystems, type: -# mount -t cgroup xxx /sys/fs/cgroup +To mount a cgroup hierarchy with all available subsystems, type:: + + # mount -t cgroup xxx /sys/fs/cgroup The "xxx" is not interpreted by the cgroup code, but will appear in /proc/mounts so may be any useful identifying string that you like. @@ -355,18 +360,19 @@ Note: Some subsystems do not work without some user input first. For instance, if cpusets are enabled the user will have to populate the cpus and mems files for each new cgroup created before that group can be used. -As explained in section `1.2 Why are cgroups needed?' you should create +As explained in section `1.2 Why are cgroups needed?` you should create different hierarchies of cgroups for each single resource or group of resources you want to control. Therefore, you should mount a tmpfs on /sys/fs/cgroup and create directories for each cgroup resource or resource -group. +group:: -# mount -t tmpfs cgroup_root /sys/fs/cgroup -# mkdir /sys/fs/cgroup/rg1 + # mount -t tmpfs cgroup_root /sys/fs/cgroup + # mkdir /sys/fs/cgroup/rg1 To mount a cgroup hierarchy with just the cpuset and memory -subsystems, type: -# mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1 +subsystems, type:: + + # mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1 While remounting cgroups is currently supported, it is not recommend to use it. Remounting allows changing bound subsystems and @@ -375,9 +381,10 @@ hierarchy is empty and release_agent itself should be replaced with conventional fsnotify. The support for remounting will be removed in the future. -To Specify a hierarchy's release_agent: -# mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \ - xxx /sys/fs/cgroup/rg1 +To Specify a hierarchy's release_agent:: + + # mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \ + xxx /sys/fs/cgroup/rg1 Note that specifying 'release_agent' more than once will return failure. @@ -390,32 +397,39 @@ Then under /sys/fs/cgroup/rg1 you can find a tree that corresponds to the tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1 is the cgroup that holds the whole system. -If you want to change the value of release_agent: -# echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent +If you want to change the value of release_agent:: + + # echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent It can also be changed via remount. -If you want to create a new cgroup under /sys/fs/cgroup/rg1: -# cd /sys/fs/cgroup/rg1 -# mkdir my_cgroup +If you want to create a new cgroup under /sys/fs/cgroup/rg1:: -Now you want to do something with this cgroup. -# cd my_cgroup + # cd /sys/fs/cgroup/rg1 + # mkdir my_cgroup -In this directory you can find several files: -# ls -cgroup.procs notify_on_release tasks -(plus whatever files added by the attached subsystems) +Now you want to do something with this cgroup: -Now attach your shell to this cgroup: -# /bin/echo $$ > tasks + # cd my_cgroup + +In this directory you can find several files:: + + # ls + cgroup.procs notify_on_release tasks + (plus whatever files added by the attached subsystems) + +Now attach your shell to this cgroup:: + + # /bin/echo $$ > tasks You can also create cgroups inside your cgroup by using mkdir in this -directory. -# mkdir my_sub_cs +directory:: -To remove a cgroup, just use rmdir: -# rmdir my_sub_cs + # mkdir my_sub_cs + +To remove a cgroup, just use rmdir:: + + # rmdir my_sub_cs This will fail if the cgroup is in use (has cgroups inside, or has processes attached, or is held alive by other subsystem-specific @@ -424,19 +438,21 @@ reference). 2.2 Attaching processes ----------------------- -# /bin/echo PID > tasks +:: + + # /bin/echo PID > tasks Note that it is PID, not PIDs. You can only attach ONE task at a time. -If you have several tasks to attach, you have to do it one after another: +If you have several tasks to attach, you have to do it one after another:: -# /bin/echo PID1 > tasks -# /bin/echo PID2 > tasks - ... -# /bin/echo PIDn > tasks + # /bin/echo PID1 > tasks + # /bin/echo PID2 > tasks + ... + # /bin/echo PIDn > tasks -You can attach the current shell task by echoing 0: +You can attach the current shell task by echoing 0:: -# echo 0 > tasks + # echo 0 > tasks You can use the cgroup.procs file instead of the tasks file to move all threads in a threadgroup at once. Echoing the PID of any task in a @@ -529,7 +545,7 @@ Each subsystem may export the following methods. The only mandatory methods are css_alloc/free. Any others that are null are presumed to be successful no-ops. -struct cgroup_subsys_state *css_alloc(struct cgroup *cgrp) +``struct cgroup_subsys_state *css_alloc(struct cgroup *cgrp)`` (cgroup_mutex held by caller) Called to allocate a subsystem state object for a cgroup. The @@ -544,7 +560,7 @@ identified by the passed cgroup object having a NULL parent (since it's the root of the hierarchy) and may be an appropriate place for initialization code. -int css_online(struct cgroup *cgrp) +``int css_online(struct cgroup *cgrp)`` (cgroup_mutex held by caller) Called after @cgrp successfully completed all allocations and made @@ -554,7 +570,7 @@ callback can be used to implement reliable state sharing and propagation along the hierarchy. See the comment on cgroup_for_each_descendant_pre() for details. -void css_offline(struct cgroup *cgrp); +``void css_offline(struct cgroup *cgrp);`` (cgroup_mutex held by caller) This is the counterpart of css_online() and called iff css_online() @@ -564,7 +580,7 @@ all references it's holding on @cgrp. When all references are dropped, cgroup removal will proceed to the next step - css_free(). After this callback, @cgrp should be considered dead to the subsystem. -void css_free(struct cgroup *cgrp) +``void css_free(struct cgroup *cgrp)`` (cgroup_mutex held by caller) The cgroup system is about to free @cgrp; the subsystem should free @@ -573,7 +589,7 @@ is completely unused; @cgrp->parent is still valid. (Note - can also be called for a newly-created cgroup if an error occurs after this subsystem's create() method has been called for the new cgroup). -int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset) +``int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)`` (cgroup_mutex held by caller) Called prior to moving one or more tasks into a cgroup; if the @@ -594,7 +610,7 @@ fork. If this method returns 0 (success) then this should remain valid while the caller holds cgroup_mutex and it is ensured that either attach() or cancel_attach() will be called in future. -void css_reset(struct cgroup_subsys_state *css) +``void css_reset(struct cgroup_subsys_state *css)`` (cgroup_mutex held by caller) An optional operation which should restore @css's configuration to the @@ -608,7 +624,7 @@ This prevents unexpected resource control from a hidden css and ensures that the configuration is in the initial state when it is made visible again later. -void cancel_attach(struct cgroup *cgrp, struct cgroup_taskset *tset) +``void cancel_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)`` (cgroup_mutex held by caller) Called when a task attach operation has failed after can_attach() has succeeded. @@ -617,26 +633,26 @@ function, so that the subsystem can implement a rollback. If not, not necessary. This will be called only about subsystems whose can_attach() operation have succeeded. The parameters are identical to can_attach(). -void attach(struct cgroup *cgrp, struct cgroup_taskset *tset) +``void attach(struct cgroup *cgrp, struct cgroup_taskset *tset)`` (cgroup_mutex held by caller) Called after the task has been attached to the cgroup, to allow any post-attachment activity that requires memory allocations or blocking. The parameters are identical to can_attach(). -void fork(struct task_struct *task) +``void fork(struct task_struct *task)`` Called when a task is forked into a cgroup. -void exit(struct task_struct *task) +``void exit(struct task_struct *task)`` Called during task exit. -void free(struct task_struct *task) +``void free(struct task_struct *task)`` Called when the task_struct is freed. -void bind(struct cgroup *root) +``void bind(struct cgroup *root)`` (cgroup_mutex held by caller) Called when a cgroup subsystem is rebound to a different hierarchy @@ -649,6 +665,7 @@ that is being created/destroyed (and hence has no sub-cgroups). cgroup filesystem supports certain types of extended attributes in its directories and files. The current supported types are: + - Trusted (XATTR_TRUSTED) - Security (XATTR_SECURITY) @@ -666,12 +683,13 @@ in containers and systemd for assorted meta data like main PID in a cgroup 5. Questions ============ -Q: what's up with this '/bin/echo' ? -A: bash's builtin 'echo' command does not check calls to write() against - errors. If you use it in the cgroup file system, you won't be - able to tell whether a command succeeded or failed. +:: -Q: When I attach processes, only the first of the line gets really attached ! -A: We can only return one error code per call to write(). So you should also - put only ONE PID. + Q: what's up with this '/bin/echo' ? + A: bash's builtin 'echo' command does not check calls to write() against + errors. If you use it in the cgroup file system, you won't be + able to tell whether a command succeeded or failed. + Q: When I attach processes, only the first of the line gets really attached ! + A: We can only return one error code per call to write(). So you should also + put only ONE PID. diff --git a/Documentation/cgroup-v1/cpuacct.txt b/Documentation/cgroup-v1/cpuacct.rst similarity index 90% rename from Documentation/cgroup-v1/cpuacct.txt rename to Documentation/cgroup-v1/cpuacct.rst index 9d73cc0cadb9..d30ed81d2ad7 100644 --- a/Documentation/cgroup-v1/cpuacct.txt +++ b/Documentation/cgroup-v1/cpuacct.rst @@ -1,5 +1,6 @@ +========================= CPU Accounting Controller -------------------------- +========================= The CPU accounting controller is used to group tasks using cgroups and account the CPU usage of these groups of tasks. @@ -8,9 +9,9 @@ The CPU accounting controller supports multi-hierarchy groups. An accounting group accumulates the CPU usage of all of its child groups and the tasks directly present in its group. -Accounting groups can be created by first mounting the cgroup filesystem. +Accounting groups can be created by first mounting the cgroup filesystem:: -# mount -t cgroup -ocpuacct none /sys/fs/cgroup + # mount -t cgroup -ocpuacct none /sys/fs/cgroup With the above step, the initial or the parent accounting group becomes visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in @@ -19,11 +20,11 @@ the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup. by this group which is essentially the CPU time obtained by all the tasks in the system. -New accounting groups can be created under the parent group /sys/fs/cgroup. +New accounting groups can be created under the parent group /sys/fs/cgroup:: -# cd /sys/fs/cgroup -# mkdir g1 -# echo $$ > g1/tasks + # cd /sys/fs/cgroup + # mkdir g1 + # echo $$ > g1/tasks The above steps create a new group g1 and move the current shell process (bash) into it. CPU time consumed by this bash and its children diff --git a/Documentation/cgroup-v1/cpusets.txt b/Documentation/cgroup-v1/cpusets.rst similarity index 90% rename from Documentation/cgroup-v1/cpusets.txt rename to Documentation/cgroup-v1/cpusets.rst index 8402dd6de8df..b6a42cdea72b 100644 --- a/Documentation/cgroup-v1/cpusets.txt +++ b/Documentation/cgroup-v1/cpusets.rst @@ -1,35 +1,36 @@ - CPUSETS - ------- +======= +CPUSETS +======= Copyright (C) 2004 BULL SA. + Written by Simon.Derr@bull.net -Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. -Modified by Paul Jackson -Modified by Christoph Lameter -Modified by Paul Menage -Modified by Hidetoshi Seto +- Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. +- Modified by Paul Jackson +- Modified by Christoph Lameter +- Modified by Paul Menage +- Modified by Hidetoshi Seto -CONTENTS: -========= +.. CONTENTS: -1. Cpusets - 1.1 What are cpusets ? - 1.2 Why are cpusets needed ? - 1.3 How are cpusets implemented ? - 1.4 What are exclusive cpusets ? - 1.5 What is memory_pressure ? - 1.6 What is memory spread ? - 1.7 What is sched_load_balance ? - 1.8 What is sched_relax_domain_level ? - 1.9 How do I use cpusets ? -2. Usage Examples and Syntax - 2.1 Basic Usage - 2.2 Adding/removing cpus - 2.3 Setting flags - 2.4 Attaching processes -3. Questions -4. Contact + 1. Cpusets + 1.1 What are cpusets ? + 1.2 Why are cpusets needed ? + 1.3 How are cpusets implemented ? + 1.4 What are exclusive cpusets ? + 1.5 What is memory_pressure ? + 1.6 What is memory spread ? + 1.7 What is sched_load_balance ? + 1.8 What is sched_relax_domain_level ? + 1.9 How do I use cpusets ? + 2. Usage Examples and Syntax + 2.1 Basic Usage + 2.2 Adding/removing cpus + 2.3 Setting flags + 2.4 Attaching processes + 3. Questions + 4. Contact 1. Cpusets ========== @@ -48,7 +49,7 @@ hooks, beyond what is already present, required to manage dynamic job placement on large systems. Cpusets use the generic cgroup subsystem described in -Documentation/cgroup-v1/cgroups.txt. +Documentation/cgroup-v1/cgroups.rst. Requests by a task, using the sched_setaffinity(2) system call to include CPUs in its CPU affinity mask, and using the mbind(2) and @@ -157,7 +158,7 @@ modifying cpusets is via this cpuset file system. The /proc//status file for each task has four added lines, displaying the task's cpus_allowed (on which CPUs it may be scheduled) and mems_allowed (on which Memory Nodes it may obtain memory), -in the two formats seen in the following example: +in the two formats seen in the following example:: Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff Cpus_allowed_list: 0-127 @@ -181,6 +182,7 @@ files describing that cpuset: - cpuset.sched_relax_domain_level: the searching range when migrating tasks In addition, only the root cpuset has the following file: + - cpuset.memory_pressure_enabled flag: compute memory_pressure? New cpusets are created using the mkdir system call or shell @@ -266,7 +268,8 @@ to monitor a cpuset for signs of memory pressure. It's up to the batch manager or other user code to decide what to do about it and take action. -==> Unless this feature is enabled by writing "1" to the special file +==> + Unless this feature is enabled by writing "1" to the special file /dev/cpuset/memory_pressure_enabled, the hook in the rebalance code of __alloc_pages() for this metric reduces to simply noticing that the cpuset_memory_pressure_enabled flag is zero. So only @@ -399,6 +402,7 @@ have tasks running on them unless explicitly assigned. This default load balancing across all CPUs is not well suited for the following two situations: + 1) On large systems, load balancing across many CPUs is expensive. If the system is managed using cpusets to place independent jobs on separate sets of CPUs, full load balancing is unnecessary. @@ -501,6 +505,7 @@ all the CPUs that must be load balanced. The cpuset code builds a new such partition and passes it to the scheduler sched domain setup code, to have the sched domains rebuilt as necessary, whenever: + - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes, - or CPUs come or go from a cpuset with this flag enabled, - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs @@ -553,13 +558,15 @@ this searching range as you like. This file takes int value which indicates size of searching range in levels ideally as follows, otherwise initial value -1 that indicates the cpuset has no request. - -1 : no request. use system default or follow request of others. - 0 : no search. - 1 : search siblings (hyperthreads in a core). - 2 : search cores in a package. - 3 : search cpus in a node [= system wide on non-NUMA system] - 4 : search nodes in a chunk of node [on NUMA system] - 5 : search system wide [on NUMA system] +====== =========================================================== + -1 no request. use system default or follow request of others. + 0 no search. + 1 search siblings (hyperthreads in a core). + 2 search cores in a package. + 3 search cpus in a node [= system wide on non-NUMA system] + 4 search nodes in a chunk of node [on NUMA system] + 5 search system wide [on NUMA system] +====== =========================================================== The system default is architecture dependent. The system default can be changed using the relax_domain_level= boot parameter. @@ -578,13 +585,14 @@ and whether it is acceptable or not depends on your situation. Don't modify this file if you are not sure. If your situation is: + - The migration costs between each cpu can be assumed considerably small(for you) due to your special application's behavior or special hardware support for CPU cache etc. - The searching cost doesn't have impact(for you) or you can make the searching cost enough small by managing cpuset to compact etc. - The latency is required even it sacrifices cache hit rate etc. -then increasing 'sched_relax_domain_level' would benefit you. + then increasing 'sched_relax_domain_level' would benefit you. 1.9 How do I use cpusets ? @@ -678,7 +686,7 @@ To start a new job that is to be contained within a cpuset, the steps are: For example, the following sequence of commands will setup a cpuset named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, -and then start a subshell 'sh' in that cpuset: +and then start a subshell 'sh' in that cpuset:: mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset cd /sys/fs/cgroup/cpuset @@ -693,6 +701,7 @@ and then start a subshell 'sh' in that cpuset: cat /proc/self/cpuset There are ways to query or modify cpusets: + - via the cpuset file system directly, using the various cd, mkdir, echo, cat, rmdir commands from the shell, or their equivalent from C. - via the C library libcpuset. @@ -722,115 +731,133 @@ Then under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset is the cpuset that holds the whole system. -If you want to create a new cpuset under /sys/fs/cgroup/cpuset: -# cd /sys/fs/cgroup/cpuset -# mkdir my_cpuset +If you want to create a new cpuset under /sys/fs/cgroup/cpuset:: -Now you want to do something with this cpuset. -# cd my_cpuset + # cd /sys/fs/cgroup/cpuset + # mkdir my_cpuset -In this directory you can find several files: -# ls -cgroup.clone_children cpuset.memory_pressure -cgroup.event_control cpuset.memory_spread_page -cgroup.procs cpuset.memory_spread_slab -cpuset.cpu_exclusive cpuset.mems -cpuset.cpus cpuset.sched_load_balance -cpuset.mem_exclusive cpuset.sched_relax_domain_level -cpuset.mem_hardwall notify_on_release -cpuset.memory_migrate tasks +Now you want to do something with this cpuset:: + + # cd my_cpuset + +In this directory you can find several files:: + + # ls + cgroup.clone_children cpuset.memory_pressure + cgroup.event_control cpuset.memory_spread_page + cgroup.procs cpuset.memory_spread_slab + cpuset.cpu_exclusive cpuset.mems + cpuset.cpus cpuset.sched_load_balance + cpuset.mem_exclusive cpuset.sched_relax_domain_level + cpuset.mem_hardwall notify_on_release + cpuset.memory_migrate tasks Reading them will give you information about the state of this cpuset: the CPUs and Memory Nodes it can use, the processes that are using it, its properties. By writing to these files you can manipulate the cpuset. -Set some flags: -# /bin/echo 1 > cpuset.cpu_exclusive +Set some flags:: -Add some cpus: -# /bin/echo 0-7 > cpuset.cpus + # /bin/echo 1 > cpuset.cpu_exclusive -Add some mems: -# /bin/echo 0-7 > cpuset.mems +Add some cpus:: -Now attach your shell to this cpuset: -# /bin/echo $$ > tasks + # /bin/echo 0-7 > cpuset.cpus + +Add some mems:: + + # /bin/echo 0-7 > cpuset.mems + +Now attach your shell to this cpuset:: + + # /bin/echo $$ > tasks You can also create cpusets inside your cpuset by using mkdir in this -directory. -# mkdir my_sub_cs +directory:: + + # mkdir my_sub_cs + +To remove a cpuset, just use rmdir:: + + # rmdir my_sub_cs -To remove a cpuset, just use rmdir: -# rmdir my_sub_cs This will fail if the cpuset is in use (has cpusets inside, or has processes attached). Note that for legacy reasons, the "cpuset" filesystem exists as a wrapper around the cgroup filesystem. -The command +The command:: -mount -t cpuset X /sys/fs/cgroup/cpuset + mount -t cpuset X /sys/fs/cgroup/cpuset -is equivalent to +is equivalent to:: -mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset -echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent + mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset + echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent 2.2 Adding/removing cpus ------------------------ This is the syntax to use when writing in the cpus or mems files -in cpuset directories: +in cpuset directories:: -# /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 -# /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 + # /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 + # /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 To add a CPU to a cpuset, write the new list of CPUs including the -CPU to be added. To add 6 to the above cpuset: +CPU to be added. To add 6 to the above cpuset:: -# /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6 + # /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6 Similarly to remove a CPU from a cpuset, write the new list of CPUs without the CPU to be removed. -To remove all the CPUs: +To remove all the CPUs:: -# /bin/echo "" > cpuset.cpus -> clear cpus list + # /bin/echo "" > cpuset.cpus -> clear cpus list 2.3 Setting flags ----------------- -The syntax is very simple: +The syntax is very simple:: -# /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive' -# /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive' + # /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive' + # /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive' 2.4 Attaching processes ----------------------- -# /bin/echo PID > tasks +:: + + # /bin/echo PID > tasks Note that it is PID, not PIDs. You can only attach ONE task at a time. -If you have several tasks to attach, you have to do it one after another: +If you have several tasks to attach, you have to do it one after another:: -# /bin/echo PID1 > tasks -# /bin/echo PID2 > tasks + # /bin/echo PID1 > tasks + # /bin/echo PID2 > tasks ... -# /bin/echo PIDn > tasks + # /bin/echo PIDn > tasks 3. Questions ============ -Q: what's up with this '/bin/echo' ? -A: bash's builtin 'echo' command does not check calls to write() against +Q: + what's up with this '/bin/echo' ? + +A: + bash's builtin 'echo' command does not check calls to write() against errors. If you use it in the cpuset file system, you won't be able to tell whether a command succeeded or failed. -Q: When I attach processes, only the first of the line gets really attached ! -A: We can only return one error code per call to write(). So you should also +Q: + When I attach processes, only the first of the line gets really attached ! + +A: + We can only return one error code per call to write(). So you should also put only ONE pid. 4. Contact diff --git a/Documentation/cgroup-v1/devices.txt b/Documentation/cgroup-v1/devices.rst similarity index 88% rename from Documentation/cgroup-v1/devices.txt rename to Documentation/cgroup-v1/devices.rst index 3c1095ca02ea..e1886783961e 100644 --- a/Documentation/cgroup-v1/devices.txt +++ b/Documentation/cgroup-v1/devices.rst @@ -1,6 +1,9 @@ +=========================== Device Whitelist Controller +=========================== -1. Description: +1. Description +============== Implement a cgroup to track and enforce open and mknod restrictions on device files. A device cgroup associates a device access @@ -16,24 +19,26 @@ devices from the whitelist or add new entries. A child cgroup can never receive a device access which is denied by its parent. 2. User Interface +================= An entry is added using devices.allow, and removed using -devices.deny. For instance +devices.deny. For instance:: echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow allows cgroup 1 to read and mknod the device usually known as -/dev/null. Doing +/dev/null. Doing:: echo a > /sys/fs/cgroup/1/devices.deny -will remove the default 'a *:* rwm' entry. Doing +will remove the default 'a *:* rwm' entry. Doing:: echo a > /sys/fs/cgroup/1/devices.allow will add the 'a *:* rwm' entry to the whitelist. 3. Security +=========== Any task can move itself between cgroups. This clearly won't suffice, but we can decide the best way to adequately restrict @@ -50,6 +55,7 @@ A cgroup may not be granted more permissions than the cgroup's parent has. 4. Hierarchy +============ device cgroups maintain hierarchy by making sure a cgroup never has more access permissions than its parent. Every time an entry is written to @@ -58,7 +64,8 @@ from their whitelist and all the locally set whitelist entries will be re-evaluated. In case one of the locally set whitelist entries would provide more access than the cgroup's parent, it'll be removed from the whitelist. -Example: +Example:: + A / \ B @@ -67,10 +74,12 @@ Example: A allow "b 8:* rwm", "c 116:1 rw" B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm" -If a device is denied in group A: +If a device is denied in group A:: + # echo "c 116:* r" > A/devices.deny + it'll propagate down and after revalidating B's entries, the whitelist entry -"c 116:2 rwm" will be removed: +"c 116:2 rwm" will be removed:: group whitelist entries denied devices A all "b 8:* rwm", "c 116:* rw" @@ -79,7 +88,8 @@ it'll propagate down and after revalidating B's entries, the whitelist entry In case parent's exceptions change and local exceptions are not allowed anymore, they'll be deleted. -Notice that new whitelist entries will not be propagated: +Notice that new whitelist entries will not be propagated:: + A / \ B @@ -88,24 +98,30 @@ Notice that new whitelist entries will not be propagated: A "c 1:3 rwm", "c 1:5 r" all the rest B "c 1:3 rwm", "c 1:5 r" all the rest -when adding "c *:3 rwm": +when adding ``c *:3 rwm``:: + # echo "c *:3 rwm" >A/devices.allow -the result: +the result:: + group whitelist entries denied devices A "c *:3 rwm", "c 1:5 r" all the rest B "c 1:3 rwm", "c 1:5 r" all the rest -but now it'll be possible to add new entries to B: +but now it'll be possible to add new entries to B:: + # echo "c 2:3 rwm" >B/devices.allow # echo "c 50:3 r" >B/devices.allow -or even + +or even:: + # echo "c *:3 rwm" >B/devices.allow Allowing or denying all by writing 'a' to devices.allow or devices.deny will not be possible once the device cgroups has children. 4.1 Hierarchy (internal implementation) +--------------------------------------- device cgroups is implemented internally using a behavior (ALLOW, DENY) and a list of exceptions. The internal state is controlled using the same user diff --git a/Documentation/cgroup-v1/freezer-subsystem.txt b/Documentation/cgroup-v1/freezer-subsystem.rst similarity index 95% rename from Documentation/cgroup-v1/freezer-subsystem.txt rename to Documentation/cgroup-v1/freezer-subsystem.rst index e831cb2b8394..582d3427de3f 100644 --- a/Documentation/cgroup-v1/freezer-subsystem.txt +++ b/Documentation/cgroup-v1/freezer-subsystem.rst @@ -1,3 +1,7 @@ +============== +Cgroup Freezer +============== + The cgroup freezer is useful to batch job management system which start and stop sets of tasks in order to schedule the resources of a machine according to the desires of a system administrator. This sort of program @@ -23,7 +27,7 @@ blocked, or ignored it can be seen by waiting or ptracing parent tasks. SIGCONT is especially unsuitable since it can be caught by the task. Any programs designed to watch for SIGSTOP and SIGCONT could be broken by attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can -demonstrate this problem using nested bash shells: +demonstrate this problem using nested bash shells:: $ echo $$ 16644 @@ -93,19 +97,19 @@ The following cgroupfs files are created by cgroup freezer. The root cgroup is non-freezable and the above interface files don't exist. -* Examples of usage : +* Examples of usage:: # mkdir /sys/fs/cgroup/freezer # mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer # mkdir /sys/fs/cgroup/freezer/0 # echo $some_pid > /sys/fs/cgroup/freezer/0/tasks -to get status of the freezer subsystem : +to get status of the freezer subsystem:: # cat /sys/fs/cgroup/freezer/0/freezer.state THAWED -to freeze all tasks in the container : +to freeze all tasks in the container:: # echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state # cat /sys/fs/cgroup/freezer/0/freezer.state @@ -113,7 +117,7 @@ to freeze all tasks in the container : # cat /sys/fs/cgroup/freezer/0/freezer.state FROZEN -to unfreeze all tasks in the container : +to unfreeze all tasks in the container:: # echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state # cat /sys/fs/cgroup/freezer/0/freezer.state diff --git a/Documentation/cgroup-v1/hugetlb.txt b/Documentation/cgroup-v1/hugetlb.rst similarity index 74% rename from Documentation/cgroup-v1/hugetlb.txt rename to Documentation/cgroup-v1/hugetlb.rst index 106245c3aecc..7056a185914b 100644 --- a/Documentation/cgroup-v1/hugetlb.txt +++ b/Documentation/cgroup-v1/hugetlb.rst @@ -1,5 +1,6 @@ +================== HugeTLB Controller -------------------- +================== The HugeTLB controller allows to limit the HugeTLB usage per control group and enforces the controller limit during page fault. Since HugeTLB doesn't @@ -16,16 +17,16 @@ With the above step, the initial or the parent HugeTLB group becomes visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup. -New groups can be created under the parent group /sys/fs/cgroup. +New groups can be created under the parent group /sys/fs/cgroup:: -# cd /sys/fs/cgroup -# mkdir g1 -# echo $$ > g1/tasks + # cd /sys/fs/cgroup + # mkdir g1 + # echo $$ > g1/tasks The above steps create a new group g1 and move the current shell process (bash) into it. -Brief summary of control files +Brief summary of control files:: hugetlb..limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage hugetlb..max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded @@ -33,13 +34,13 @@ Brief summary of control files hugetlb..failcnt # show the number of allocation failure due to HugeTLB limit For a system supporting two hugepage size (16M and 16G) the control -files include: +files include:: -hugetlb.16GB.limit_in_bytes -hugetlb.16GB.max_usage_in_bytes -hugetlb.16GB.usage_in_bytes -hugetlb.16GB.failcnt -hugetlb.16MB.limit_in_bytes -hugetlb.16MB.max_usage_in_bytes -hugetlb.16MB.usage_in_bytes -hugetlb.16MB.failcnt + hugetlb.16GB.limit_in_bytes + hugetlb.16GB.max_usage_in_bytes + hugetlb.16GB.usage_in_bytes + hugetlb.16GB.failcnt + hugetlb.16MB.limit_in_bytes + hugetlb.16MB.max_usage_in_bytes + hugetlb.16MB.usage_in_bytes + hugetlb.16MB.failcnt diff --git a/Documentation/cgroup-v1/index.rst b/Documentation/cgroup-v1/index.rst new file mode 100644 index 000000000000..fe76d42edc11 --- /dev/null +++ b/Documentation/cgroup-v1/index.rst @@ -0,0 +1,30 @@ +:orphan: + +======================== +Control Groups version 1 +======================== + +.. toctree:: + :maxdepth: 1 + + cgroups + + blkio-controller + cpuacct + cpusets + devices + freezer-subsystem + hugetlb + memcg_test + memory + net_cls + net_prio + pids + rdma + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/cgroup-v1/memcg_test.txt b/Documentation/cgroup-v1/memcg_test.rst similarity index 62% rename from Documentation/cgroup-v1/memcg_test.txt rename to Documentation/cgroup-v1/memcg_test.rst index 621e29ffb358..9d1de6600c45 100644 --- a/Documentation/cgroup-v1/memcg_test.txt +++ b/Documentation/cgroup-v1/memcg_test.rst @@ -1,32 +1,43 @@ -Memory Resource Controller(Memcg) Implementation Memo. +===================================================== +Memory Resource Controller(Memcg) Implementation Memo +===================================================== + Last Updated: 2010/2 + Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34). Because VM is getting complex (one of reasons is memcg...), memcg's behavior is complex. This is a document for memcg's internal behavior. Please note that implementation details can be changed. -(*) Topics on API should be in Documentation/cgroup-v1/memory.txt) +(*) Topics on API should be in Documentation/cgroup-v1/memory.rst) 0. How to record usage ? +======================== + 2 objects are used. page_cgroup ....an object per page. + Allocated at boot or memory hotplug. Freed at memory hot removal. swap_cgroup ... an entry per swp_entry. + Allocated at swapon(). Freed at swapoff(). The page_cgroup has USED bit and double count against a page_cgroup never occurs. swap_cgroup is used only when a charged page is swapped-out. 1. Charge +========= a page/swp_entry may be charged (usage += PAGE_SIZE) at mem_cgroup_try_charge() 2. Uncharge +=========== + a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by mem_cgroup_uncharge() @@ -37,9 +48,12 @@ Please note that implementation details can be changed. disappears. 3. charge-commit-cancel +======================= + Memcg pages are charged in two steps: - mem_cgroup_try_charge() - mem_cgroup_commit_charge() or mem_cgroup_cancel_charge() + + - mem_cgroup_try_charge() + - mem_cgroup_commit_charge() or mem_cgroup_cancel_charge() At try_charge(), there are no flags to say "this page is charged". at this point, usage += PAGE_SIZE. @@ -51,6 +65,8 @@ Please note that implementation details can be changed. Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. 4. Anonymous +============ + Anonymous page is newly allocated at - page fault into MAP_ANONYMOUS mapping. - Copy-On-Write. @@ -78,34 +94,45 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0. 5. Page Cache +============= + Page Cache is charged at - add_to_page_cache_locked(). The logic is very clear. (About migration, see below) - Note: __remove_from_page_cache() is called by remove_from_page_cache() - and __remove_mapping(). + + Note: + __remove_from_page_cache() is called by remove_from_page_cache() + and __remove_mapping(). 6. Shmem(tmpfs) Page Cache +=========================== + The best way to understand shmem's page state transition is to read mm/shmem.c. + But brief explanation of the behavior of memcg around shmem will be helpful to understand the logic. Shmem's page (just leaf page, not direct/indirect block) can be on + - radix-tree of shmem's inode. - SwapCache. - Both on radix-tree and SwapCache. This happens at swap-in and swap-out, It's charged when... + - A new page is added to shmem's radix-tree. - A swp page is read. (move a charge from swap_cgroup to page_cgroup) 7. Page Migration +================= mem_cgroup_migrate() 8. LRU +====== Each memcg has its own private LRU. Now, its handling is under global VM's control (means that it's handled under global pgdat->lru_lock). Almost all routines around memcg's LRU is called by global LRU's @@ -114,163 +141,211 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. A special function is mem_cgroup_isolate_pages(). This scans memcg's private LRU and call __isolate_lru_page() to extract a page from LRU. + (By __isolate_lru_page(), the page is removed from both of global and - private LRU.) + private LRU.) 9. Typical Tests. +================= Tests for racy cases. - 9.1 Small limit to memcg. +9.1 Small limit to memcg. +------------------------- + When you do test to do racy case, it's good test to set memcg's limit to be very small rather than GB. Many races found in the test under xKB or xxMB limits. + (Memory behavior under GB and Memory behavior under MB shows very - different situation.) + different situation.) + +9.2 Shmem +--------- - 9.2 Shmem Historically, memcg's shmem handling was poor and we saw some amount of troubles here. This is because shmem is page-cache but can be SwapCache. Test with shmem/tmpfs is always good test. - 9.3 Migration +9.3 Migration +------------- + For NUMA, migration is an another special case. To do easy test, cpuset - is useful. Following is a sample script to do migration. + is useful. Following is a sample script to do migration:: - mount -t cgroup -o cpuset none /opt/cpuset + mount -t cgroup -o cpuset none /opt/cpuset - mkdir /opt/cpuset/01 - echo 1 > /opt/cpuset/01/cpuset.cpus - echo 0 > /opt/cpuset/01/cpuset.mems - echo 1 > /opt/cpuset/01/cpuset.memory_migrate - mkdir /opt/cpuset/02 - echo 1 > /opt/cpuset/02/cpuset.cpus - echo 1 > /opt/cpuset/02/cpuset.mems - echo 1 > /opt/cpuset/02/cpuset.memory_migrate + mkdir /opt/cpuset/01 + echo 1 > /opt/cpuset/01/cpuset.cpus + echo 0 > /opt/cpuset/01/cpuset.mems + echo 1 > /opt/cpuset/01/cpuset.memory_migrate + mkdir /opt/cpuset/02 + echo 1 > /opt/cpuset/02/cpuset.cpus + echo 1 > /opt/cpuset/02/cpuset.mems + echo 1 > /opt/cpuset/02/cpuset.memory_migrate In above set, when you moves a task from 01 to 02, page migration to node 0 to node 1 will occur. Following is a script to migrate all - under cpuset. - -- - move_task() - { - for pid in $1 - do - /bin/echo $pid >$2/tasks 2>/dev/null - echo -n $pid - echo -n " " - done - echo END - } + under cpuset.:: + + -- + move_task() + { + for pid in $1 + do + /bin/echo $pid >$2/tasks 2>/dev/null + echo -n $pid + echo -n " " + done + echo END + } + + G1_TASK=`cat ${G1}/tasks` + G2_TASK=`cat ${G2}/tasks` + move_task "${G1_TASK}" ${G2} & + -- + +9.4 Memory hotplug +------------------ - G1_TASK=`cat ${G1}/tasks` - G2_TASK=`cat ${G2}/tasks` - move_task "${G1_TASK}" ${G2} & - -- - 9.4 Memory hotplug. memory hotplug test is one of good test. - to offline memory, do following. - # echo offline > /sys/devices/system/memory/memoryXXX/state + + to offline memory, do following:: + + # echo offline > /sys/devices/system/memory/memoryXXX/state + (XXX is the place of memory) + This is an easy way to test page migration, too. - 9.5 mkdir/rmdir +9.5 mkdir/rmdir +--------------- + When using hierarchy, mkdir/rmdir test should be done. - Use tests like the following. + Use tests like the following:: - echo 1 >/opt/cgroup/01/memory/use_hierarchy - mkdir /opt/cgroup/01/child_a - mkdir /opt/cgroup/01/child_b + echo 1 >/opt/cgroup/01/memory/use_hierarchy + mkdir /opt/cgroup/01/child_a + mkdir /opt/cgroup/01/child_b - set limit to 01. - add limit to 01/child_b - run jobs under child_a and child_b + set limit to 01. + add limit to 01/child_b + run jobs under child_a and child_b - create/delete following groups at random while jobs are running. - /opt/cgroup/01/child_a/child_aa - /opt/cgroup/01/child_b/child_bb - /opt/cgroup/01/child_c + create/delete following groups at random while jobs are running:: + + /opt/cgroup/01/child_a/child_aa + /opt/cgroup/01/child_b/child_bb + /opt/cgroup/01/child_c running new jobs in new group is also good. - 9.6 Mount with other subsystems. +9.6 Mount with other subsystems +------------------------------- + Mounting with other subsystems is a good test because there is a race and lock dependency with other cgroup subsystems. - example) - # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices + example:: + + # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices and do task move, mkdir, rmdir etc...under this. - 9.7 swapoff. +9.7 swapoff +----------- + Besides management of swap is one of complicated parts of memcg, call path of swap-in at swapoff is not same as usual swap-in path.. It's worth to be tested explicitly. - For example, test like following is good. - (Shell-A) - # mount -t cgroup none /cgroup -o memory - # mkdir /cgroup/test - # echo 40M > /cgroup/test/memory.limit_in_bytes - # echo 0 > /cgroup/test/tasks + For example, test like following is good: + + (Shell-A):: + + # mount -t cgroup none /cgroup -o memory + # mkdir /cgroup/test + # echo 40M > /cgroup/test/memory.limit_in_bytes + # echo 0 > /cgroup/test/tasks + Run malloc(100M) program under this. You'll see 60M of swaps. - (Shell-B) - # move all tasks in /cgroup/test to /cgroup - # /sbin/swapoff -a - # rmdir /cgroup/test - # kill malloc task. + + (Shell-B):: + + # move all tasks in /cgroup/test to /cgroup + # /sbin/swapoff -a + # rmdir /cgroup/test + # kill malloc task. Of course, tmpfs v.s. swapoff test should be tested, too. - 9.8 OOM-Killer +9.8 OOM-Killer +-------------- + Out-of-memory caused by memcg's limit will kill tasks under the memcg. When hierarchy is used, a task under hierarchy will be killed by the kernel. + In this case, panic_on_oom shouldn't be invoked and tasks in other groups shouldn't be killed. It's not difficult to cause OOM under memcg as following. - Case A) when you can swapoff - #swapoff -a - #echo 50M > /memory.limit_in_bytes + + Case A) when you can swapoff:: + + #swapoff -a + #echo 50M > /memory.limit_in_bytes + run 51M of malloc - Case B) when you use mem+swap limitation. - #echo 50M > memory.limit_in_bytes - #echo 50M > memory.memsw.limit_in_bytes + Case B) when you use mem+swap limitation:: + + #echo 50M > memory.limit_in_bytes + #echo 50M > memory.memsw.limit_in_bytes + run 51M of malloc - 9.9 Move charges at task migration +9.9 Move charges at task migration +---------------------------------- + Charges associated with a task can be moved along with task migration. - (Shell-A) - #mkdir /cgroup/A - #echo $$ >/cgroup/A/tasks + (Shell-A):: + + #mkdir /cgroup/A + #echo $$ >/cgroup/A/tasks + run some programs which uses some amount of memory in /cgroup/A. - (Shell-B) - #mkdir /cgroup/B - #echo 1 >/cgroup/B/memory.move_charge_at_immigrate - #echo "pid of the program running in group A" >/cgroup/B/tasks + (Shell-B):: - You can see charges have been moved by reading *.usage_in_bytes or + #mkdir /cgroup/B + #echo 1 >/cgroup/B/memory.move_charge_at_immigrate + #echo "pid of the program running in group A" >/cgroup/B/tasks + + You can see charges have been moved by reading ``*.usage_in_bytes`` or memory.stat of both A and B. - See 8.2 of Documentation/cgroup-v1/memory.txt to see what value should be - written to move_charge_at_immigrate. - 9.10 Memory thresholds + See 8.2 of Documentation/cgroup-v1/memory.rst to see what value should + be written to move_charge_at_immigrate. + +9.10 Memory thresholds +---------------------- + Memory controller implements memory thresholds using cgroups notification API. You can use tools/cgroup/cgroup_event_listener.c to test it. - (Shell-A) Create cgroup and run event listener - # mkdir /cgroup/A - # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M + (Shell-A) Create cgroup and run event listener:: - (Shell-B) Add task to cgroup and try to allocate and free memory - # echo $$ >/cgroup/A/tasks - # a="$(dd if=/dev/zero bs=1M count=10)" - # a= + # mkdir /cgroup/A + # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M + + (Shell-B) Add task to cgroup and try to allocate and free memory:: + + # echo $$ >/cgroup/A/tasks + # a="$(dd if=/dev/zero bs=1M count=10)" + # a= You will see message from cgroup_event_listener every time you cross the thresholds. diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.rst similarity index 71% rename from Documentation/cgroup-v1/memory.txt rename to Documentation/cgroup-v1/memory.rst index a33cedf85427..41bdc038dad9 100644 --- a/Documentation/cgroup-v1/memory.txt +++ b/Documentation/cgroup-v1/memory.rst @@ -1,22 +1,26 @@ +========================== Memory Resource Controller +========================== -NOTE: This document is hopelessly outdated and it asks for a complete +NOTE: + This document is hopelessly outdated and it asks for a complete rewrite. It still contains a useful information so we are keeping it here but make sure to check the current code if you need a deeper understanding. -NOTE: The Memory Resource Controller has generically been referred to as the +NOTE: + The Memory Resource Controller has generically been referred to as the memory controller in this document. Do not confuse memory controller used here with the memory controller that is used in hardware. -(For editors) -In this document: +(For editors) In this document: When we mention a cgroup (cgroupfs's directory) with memory controller, we call it "memory cgroup". When you see git-log and source code, you'll see patch's title and function names tend to use "memcg". In this document, we avoid using it. Benefits and Purpose of the memory controller +============================================= The memory controller isolates the memory behaviour of a group of tasks from the rest of the system. The article on LWN [12] mentions some probable @@ -38,6 +42,7 @@ e. There are several other use cases; find one or use the controller just Current Status: linux-2.6.34-mmotm(development version of 2010/April) Features: + - accounting anonymous pages, file caches, swap caches usage and limiting them. - pages are linked to per-memcg LRU exclusively, and there is no global LRU. - optionally, memory+swap usage can be accounted and limited. @@ -54,41 +59,48 @@ Features: Brief summary of control files. - tasks # attach a task(thread) and show list of threads - cgroup.procs # show list of processes - cgroup.event_control # an interface for event_fd() - memory.usage_in_bytes # show current usage for memory - (See 5.5 for details) - memory.memsw.usage_in_bytes # show current usage for memory+Swap - (See 5.5 for details) - memory.limit_in_bytes # set/show limit of memory usage - memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage - memory.failcnt # show the number of memory usage hits limits - memory.memsw.failcnt # show the number of memory+Swap hits limits - memory.max_usage_in_bytes # show max memory usage recorded - memory.memsw.max_usage_in_bytes # show max memory+Swap usage recorded - memory.soft_limit_in_bytes # set/show soft limit of memory usage - memory.stat # show various statistics - memory.use_hierarchy # set/show hierarchical account enabled - memory.force_empty # trigger forced page reclaim - memory.pressure_level # set memory pressure notifications - memory.swappiness # set/show swappiness parameter of vmscan - (See sysctl's vm.swappiness) - memory.move_charge_at_immigrate # set/show controls of moving charges - memory.oom_control # set/show oom controls. - memory.numa_stat # show the number of memory usage per numa node +==================================== ========================================== + tasks attach a task(thread) and show list of + threads + cgroup.procs show list of processes + cgroup.event_control an interface for event_fd() + memory.usage_in_bytes show current usage for memory + (See 5.5 for details) + memory.memsw.usage_in_bytes show current usage for memory+Swap + (See 5.5 for details) + memory.limit_in_bytes set/show limit of memory usage + memory.memsw.limit_in_bytes set/show limit of memory+Swap usage + memory.failcnt show the number of memory usage hits limits + memory.memsw.failcnt show the number of memory+Swap hits limits + memory.max_usage_in_bytes show max memory usage recorded + memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded + memory.soft_limit_in_bytes set/show soft limit of memory usage + memory.stat show various statistics + memory.use_hierarchy set/show hierarchical account enabled + memory.force_empty trigger forced page reclaim + memory.pressure_level set memory pressure notifications + memory.swappiness set/show swappiness parameter of vmscan + (See sysctl's vm.swappiness) + memory.move_charge_at_immigrate set/show controls of moving charges + memory.oom_control set/show oom controls. + memory.numa_stat show the number of memory usage per numa + node - memory.kmem.limit_in_bytes # set/show hard limit for kernel memory - memory.kmem.usage_in_bytes # show current kernel memory allocation - memory.kmem.failcnt # show the number of kernel memory usage hits limits - memory.kmem.max_usage_in_bytes # show max kernel memory usage recorded + memory.kmem.limit_in_bytes set/show hard limit for kernel memory + memory.kmem.usage_in_bytes show current kernel memory allocation + memory.kmem.failcnt show the number of kernel memory usage + hits limits + memory.kmem.max_usage_in_bytes show max kernel memory usage recorded - memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory - memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation - memory.kmem.tcp.failcnt # show the number of tcp buf memory usage hits limits - memory.kmem.tcp.max_usage_in_bytes # show max tcp buf memory usage recorded + memory.kmem.tcp.limit_in_bytes set/show hard limit for tcp buf memory + memory.kmem.tcp.usage_in_bytes show current tcp buf memory allocation + memory.kmem.tcp.failcnt show the number of tcp buf memory usage + hits limits + memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded +==================================== ========================================== 1. History +========== The memory controller has a long history. A request for comments for the memory controller was posted by Balbir Singh [1]. At the time the RFC was posted @@ -103,6 +115,7 @@ at version 6; it combines both mapped (RSS) and unmapped Page Cache Control [11]. 2. Memory Control +================= Memory is a unique resource in the sense that it is present in a limited amount. If a task requires a lot of CPU processing, the task can spread @@ -120,6 +133,7 @@ are: The memory controller is the first controller developed. 2.1. Design +----------- The core of the design is a counter called the page_counter. The page_counter tracks the current memory usage and limit of the group of @@ -127,6 +141,9 @@ processes associated with the controller. Each cgroup has a memory controller specific data structure (mem_cgroup) associated with it. 2.2. Accounting +--------------- + +:: +--------------------+ | mem_cgroup | @@ -165,6 +182,7 @@ updated. page_cgroup has its own LRU on cgroup. (*) page_cgroup structure is allocated at boot/memory-hotplug time. 2.2.1 Accounting details +------------------------ All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. Some pages which are never reclaimable and will not be on the LRU @@ -191,6 +209,7 @@ Note: we just account pages-on-LRU because our purpose is to control amount of used pages; not-on-LRU pages tend to be out-of-control from VM view. 2.3 Shared Page Accounting +-------------------------- Shared pages are accounted on the basis of the first touch approach. The cgroup that first touches a page is accounted for the page. The principle @@ -207,11 +226,13 @@ be backed into memory in force, charges for pages are accounted against the caller of swapoff rather than the users of shmem. 2.4 Swap Extension (CONFIG_MEMCG_SWAP) +-------------------------------------- Swap Extension allows you to record charge for swap. A swapped-in page is charged back to original page allocator if possible. When swap is accounted, following files are added. + - memory.memsw.usage_in_bytes. - memory.memsw.limit_in_bytes. @@ -224,14 +245,16 @@ In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap. By using the memsw limit, you can avoid system OOM which can be caused by swap shortage. -* why 'memory+swap' rather than swap. +**why 'memory+swap' rather than swap** + The global LRU(kswapd) can swap out arbitrary pages. Swap-out means to move account from memory to swap...there is no change in usage of memory+swap. In other words, when we want to limit the usage of swap without affecting global LRU, memory+swap limit is better than just limiting swap from an OS point of view. -* What happens when a cgroup hits memory.memsw.limit_in_bytes +**What happens when a cgroup hits memory.memsw.limit_in_bytes** + When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out in this cgroup. Then, swap-out will not be done by cgroup routine and file caches are dropped. But as mentioned above, global LRU can do swapout memory @@ -239,6 +262,7 @@ from it for sanity of the system's memory management state. You can't forbid it by cgroup. 2.5 Reclaim +----------- Each cgroup maintains a per cgroup LRU which has the same structure as global VM. When a cgroup goes over its limit, we first try @@ -251,29 +275,36 @@ The reclaim algorithm has not been modified for cgroups, except that pages that are selected for reclaiming come from the per-cgroup LRU list. -NOTE: Reclaim does not work for the root cgroup, since we cannot set any -limits on the root cgroup. +NOTE: + Reclaim does not work for the root cgroup, since we cannot set any + limits on the root cgroup. -Note2: When panic_on_oom is set to "2", the whole system will panic. +Note2: + When panic_on_oom is set to "2", the whole system will panic. When oom event notifier is registered, event will be delivered. (See oom_control section) 2.6 Locking +----------- lock_page_cgroup()/unlock_page_cgroup() should not be called under the i_pages lock. Other lock order is following: + PG_locked. - mm->page_table_lock - pgdat->lru_lock - lock_page_cgroup. + mm->page_table_lock + pgdat->lru_lock + lock_page_cgroup. + In many cases, just lock_page_cgroup() is called. + per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by pgdat->lru_lock, it has no lock of its own. 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) +----------------------------------------------- With the Kernel memory extension, the Memory Controller is able to limit the amount of kernel memory used by the system. Kernel memory is fundamentally @@ -288,6 +319,7 @@ Kernel memory limits are not imposed for the root cgroup. Usage for the root cgroup may or may not be accounted. The memory used is accumulated into memory.kmem.usage_in_bytes, or in a separate counter when it makes sense. (currently only for tcp). + The main "kmem" counter is fed into the main counter, so kmem charges will also be visible from the user counter. @@ -295,36 +327,42 @@ Currently no soft limit is implemented for kernel memory. It is future work to trigger slab reclaim when those limits are reached. 2.7.1 Current Kernel Memory resources accounted +----------------------------------------------- -* stack pages: every process consumes some stack pages. By accounting into -kernel memory, we prevent new processes from being created when the kernel -memory usage is too high. +stack pages: + every process consumes some stack pages. By accounting into + kernel memory, we prevent new processes from being created when the kernel + memory usage is too high. -* slab pages: pages allocated by the SLAB or SLUB allocator are tracked. A copy -of each kmem_cache is created every time the cache is touched by the first time -from inside the memcg. The creation is done lazily, so some objects can still be -skipped while the cache is being created. All objects in a slab page should -belong to the same memcg. This only fails to hold when a task is migrated to a -different memcg during the page allocation by the cache. +slab pages: + pages allocated by the SLAB or SLUB allocator are tracked. A copy + of each kmem_cache is created every time the cache is touched by the first time + from inside the memcg. The creation is done lazily, so some objects can still be + skipped while the cache is being created. All objects in a slab page should + belong to the same memcg. This only fails to hold when a task is migrated to a + different memcg during the page allocation by the cache. -* sockets memory pressure: some sockets protocols have memory pressure -thresholds. The Memory Controller allows them to be controlled individually -per cgroup, instead of globally. +sockets memory pressure: + some sockets protocols have memory pressure + thresholds. The Memory Controller allows them to be controlled individually + per cgroup, instead of globally. -* tcp memory pressure: sockets memory pressure for the tcp protocol. +tcp memory pressure: + sockets memory pressure for the tcp protocol. 2.7.2 Common use cases +---------------------- Because the "kmem" counter is fed to the main user counter, kernel memory can never be limited completely independently of user memory. Say "U" is the user limit, and "K" the kernel limit. There are three possible ways limits can be set: - U != 0, K = unlimited: +U != 0, K = unlimited: This is the standard memcg limitation mechanism already present before kmem accounting. Kernel memory is completely ignored. - U != 0, K < U: +U != 0, K < U: Kernel memory is a subset of the user memory. This setup is useful in deployments where the total amount of memory per-cgroup is overcommited. Overcommiting kernel memory limits is definitely not recommended, since the @@ -332,19 +370,23 @@ set: In this case, the admin could set up K so that the sum of all groups is never greater than the total memory, and freely set U at the cost of his QoS. - WARNING: In the current implementation, memory reclaim will NOT be + +WARNING: + In the current implementation, memory reclaim will NOT be triggered for a cgroup when it hits K while staying below U, which makes this setup impractical. - U != 0, K >= U: +U != 0, K >= U: Since kmem charges will also be fed to the user counter and reclaim will be triggered for the cgroup for both kinds of memory. This setup gives the admin a unified view of memory, and it is also useful for people who just want to track kernel memory usage. 3. User Interface +================= 3.0. Configuration +------------------ a. Enable CONFIG_CGROUPS b. Enable CONFIG_MEMCG @@ -352,39 +394,53 @@ c. Enable CONFIG_MEMCG_SWAP (to use swap extension) d. Enable CONFIG_MEMCG_KMEM (to use kmem extension) 3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) -# mount -t tmpfs none /sys/fs/cgroup -# mkdir /sys/fs/cgroup/memory -# mount -t cgroup none /sys/fs/cgroup/memory -o memory +------------------------------------------------------------------- -3.2. Make the new group and move bash into it -# mkdir /sys/fs/cgroup/memory/0 -# echo $$ > /sys/fs/cgroup/memory/0/tasks +:: -Since now we're in the 0 cgroup, we can alter the memory limit: -# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes + # mount -t tmpfs none /sys/fs/cgroup + # mkdir /sys/fs/cgroup/memory + # mount -t cgroup none /sys/fs/cgroup/memory -o memory -NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, -mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.) +3.2. Make the new group and move bash into it:: -NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited). -NOTE: We cannot set limits on the root cgroup any more. + # mkdir /sys/fs/cgroup/memory/0 + # echo $$ > /sys/fs/cgroup/memory/0/tasks -# cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes -4194304 +Since now we're in the 0 cgroup, we can alter the memory limit:: -We can check the usage: -# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes -1216512 + # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes + +NOTE: + We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, + mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, + Gibibytes.) + +NOTE: + We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``. + +NOTE: + We cannot set limits on the root cgroup any more. + +:: + + # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes + 4194304 + +We can check the usage:: + + # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes + 1216512 A successful write to this file does not guarantee a successful setting of this limit to the value written into the file. This can be due to a number of factors, such as rounding up to page boundaries or the total availability of memory on the system. The user is required to re-read -this file after a write to guarantee the value committed by the kernel. +this file after a write to guarantee the value committed by the kernel:: -# echo 1 > memory.limit_in_bytes -# cat memory.limit_in_bytes -4096 + # echo 1 > memory.limit_in_bytes + # cat memory.limit_in_bytes + 4096 The memory.failcnt field gives the number of times that the cgroup limit was exceeded. @@ -393,6 +449,7 @@ The memory.stat file gives accounting information. Now, the number of caches, RSS and Active pages/Inactive pages are shown. 4. Testing +========== For testing features and implementation, see memcg_test.txt. @@ -408,6 +465,7 @@ But the above two are testing extreme situations. Trying usual test under memory controller is always helpful. 4.1 Troubleshooting +------------------- Sometimes a user might find that the application under a cgroup is terminated by the OOM killer. There are several causes for this: @@ -422,6 +480,7 @@ To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and seeing what happens will be helpful. 4.2 Task migration +------------------ When a task migrates from one cgroup to another, its charge is not carried forward by default. The pages allocated from the original cgroup still @@ -432,6 +491,7 @@ You can move charges of a task along with task migration. See 8. "Move charges at task migration" 4.3 Removing a cgroup +--------------------- A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a cgroup might have some charge associated with it, even though all @@ -448,13 +508,15 @@ will be charged as a new owner of it. About use_hierarchy, see Section 6. -5. Misc. interfaces. +5. Misc. interfaces +=================== 5.1 force_empty +--------------- memory.force_empty interface is provided to make cgroup's memory usage empty. - When writing anything to this + When writing anything to this:: - # echo 0 > memory.force_empty + # echo 0 > memory.force_empty the cgroup will be reclaimed and as many pages reclaimed as possible. @@ -471,50 +533,61 @@ About use_hierarchy, see Section 6. About use_hierarchy, see Section 6. 5.2 stat file +------------- memory.stat file includes following statistics -# per-memory cgroup local status -cache - # of bytes of page cache memory. -rss - # of bytes of anonymous and swap cache memory (includes +per-memory cgroup local status +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +=============== =============================================================== +cache # of bytes of page cache memory. +rss # of bytes of anonymous and swap cache memory (includes transparent hugepages). -rss_huge - # of bytes of anonymous transparent hugepages. -mapped_file - # of bytes of mapped file (includes tmpfs/shmem) -pgpgin - # of charging events to the memory cgroup. The charging +rss_huge # of bytes of anonymous transparent hugepages. +mapped_file # of bytes of mapped file (includes tmpfs/shmem) +pgpgin # of charging events to the memory cgroup. The charging event happens each time a page is accounted as either mapped anon page(RSS) or cache page(Page Cache) to the cgroup. -pgpgout - # of uncharging events to the memory cgroup. The uncharging +pgpgout # of uncharging events to the memory cgroup. The uncharging event happens each time a page is unaccounted from the cgroup. -swap - # of bytes of swap usage -dirty - # of bytes that are waiting to get written back to the disk. -writeback - # of bytes of file/anon cache that are queued for syncing to +swap # of bytes of swap usage +dirty # of bytes that are waiting to get written back to the disk. +writeback # of bytes of file/anon cache that are queued for syncing to disk. -inactive_anon - # of bytes of anonymous and swap cache memory on inactive +inactive_anon # of bytes of anonymous and swap cache memory on inactive LRU list. -active_anon - # of bytes of anonymous and swap cache memory on active +active_anon # of bytes of anonymous and swap cache memory on active LRU list. -inactive_file - # of bytes of file-backed memory on inactive LRU list. -active_file - # of bytes of file-backed memory on active LRU list. -unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc). +inactive_file # of bytes of file-backed memory on inactive LRU list. +active_file # of bytes of file-backed memory on active LRU list. +unevictable # of bytes of memory that cannot be reclaimed (mlocked etc). +=============== =============================================================== -# status considering hierarchy (see memory.use_hierarchy settings) +status considering hierarchy (see memory.use_hierarchy settings) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -hierarchical_memory_limit - # of bytes of memory limit with regard to hierarchy - under which the memory cgroup is -hierarchical_memsw_limit - # of bytes of memory+swap limit with regard to - hierarchy under which memory cgroup is. +========================= =================================================== +hierarchical_memory_limit # of bytes of memory limit with regard to hierarchy + under which the memory cgroup is +hierarchical_memsw_limit # of bytes of memory+swap limit with regard to + hierarchy under which memory cgroup is. -total_ - # hierarchical version of , which in - addition to the cgroup's own value includes the - sum of all hierarchical children's values of - , i.e. total_cache +total_ # hierarchical version of , which in + addition to the cgroup's own value includes the + sum of all hierarchical children's values of + , i.e. total_cache +========================= =================================================== -# The following additional stats are dependent on CONFIG_DEBUG_VM. +The following additional stats are dependent on CONFIG_DEBUG_VM +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) -recent_rotated_file - VM internal parameter. (see mm/vmscan.c) -recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) -recent_scanned_file - VM internal parameter. (see mm/vmscan.c) +========================= ======================================== +recent_rotated_anon VM internal parameter. (see mm/vmscan.c) +recent_rotated_file VM internal parameter. (see mm/vmscan.c) +recent_scanned_anon VM internal parameter. (see mm/vmscan.c) +recent_scanned_file VM internal parameter. (see mm/vmscan.c) +========================= ======================================== Memo: recent_rotated means recent frequency of LRU rotation. @@ -525,12 +598,15 @@ Note: Only anonymous and swap cache memory is listed as part of 'rss' stat. This should not be confused with the true 'resident set size' or the amount of physical memory used by the cgroup. + 'rss + mapped_file" will give you resident set size of cgroup. + (Note: file and shmem may be shared among other cgroups. In that case, - mapped_file is accounted only when the memory cgroup is owner of page - cache.) + mapped_file is accounted only when the memory cgroup is owner of page + cache.) 5.3 swappiness +-------------- Overrides /proc/sys/vm/swappiness for the particular group. The tunable in the root cgroup corresponds to the global swappiness setting. @@ -541,16 +617,19 @@ there is a swap storage available. This might lead to memcg OOM killer if there are no file pages to reclaim. 5.4 failcnt +----------- A memory cgroup provides memory.failcnt and memory.memsw.failcnt files. This failcnt(== failure count) shows the number of times that a usage counter hit its limit. When a memory cgroup hits a limit, failcnt increases and memory under it will be reclaimed. -You can reset failcnt by writing 0 to failcnt file. -# echo 0 > .../memory.failcnt +You can reset failcnt by writing 0 to failcnt file:: + + # echo 0 > .../memory.failcnt 5.5 usage_in_bytes +------------------ For efficiency, as other kernel components, memory cgroup uses some optimization to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the @@ -560,6 +639,7 @@ If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) value in memory.stat(see 5.2). 5.6 numa_stat +------------- This is similar to numa_maps but operates on a per-memcg basis. This is useful for providing visibility into the numa locality information within @@ -571,22 +651,23 @@ Each memcg's numa_stat file includes "total", "file", "anon" and "unevictable" per-node page counts including "hierarchical_" which sums up all hierarchical children's values in addition to the memcg's own value. -The output format of memory.numa_stat is: +The output format of memory.numa_stat is:: -total= N0= N1= ... -file= N0= N1= ... -anon= N0= N1= ... -unevictable= N0= N1= ... -hierarchical_= N0= N1= ... + total= N0= N1= ... + file= N0= N1= ... + anon= N0= N1= ... + unevictable= N0= N1= ... + hierarchical_= N0= N1= ... The "total" count is sum of file + anon + unevictable. 6. Hierarchy support +==================== The memory controller supports a deep hierarchy and hierarchical accounting. The hierarchy is created by creating the appropriate cgroups in the cgroup filesystem. Consider for example, the following cgroup filesystem -hierarchy +hierarchy:: root / | \ @@ -603,24 +684,28 @@ limit, the reclaim algorithm reclaims from the tasks in the ancestor and the children of the ancestor. 6.1 Enabling hierarchical accounting and reclaim +------------------------------------------------ A memory cgroup by default disables the hierarchy feature. Support -can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup +can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup:: -# echo 1 > memory.use_hierarchy + # echo 1 > memory.use_hierarchy -The feature can be disabled by +The feature can be disabled by:: -# echo 0 > memory.use_hierarchy + # echo 0 > memory.use_hierarchy -NOTE1: Enabling/disabling will fail if either the cgroup already has other +NOTE1: + Enabling/disabling will fail if either the cgroup already has other cgroups created below it, or if the parent cgroup has use_hierarchy enabled. -NOTE2: When panic_on_oom is set to "2", the whole system will panic in +NOTE2: + When panic_on_oom is set to "2", the whole system will panic in case of an OOM event in any cgroup. 7. Soft limits +============== Soft limits allow for greater sharing of memory. The idea behind soft limits is to allow control groups to use as much of the memory as needed, provided @@ -640,22 +725,26 @@ hints/setup. Currently soft limit based reclaim is set up such that it gets invoked from balance_pgdat (kswapd). 7.1 Interface +------------- Soft limits can be setup by using the following commands (in this example we -assume a soft limit of 256 MiB) +assume a soft limit of 256 MiB):: -# echo 256M > memory.soft_limit_in_bytes + # echo 256M > memory.soft_limit_in_bytes -If we want to change this to 1G, we can at any time use +If we want to change this to 1G, we can at any time use:: -# echo 1G > memory.soft_limit_in_bytes + # echo 1G > memory.soft_limit_in_bytes -NOTE1: Soft limits take effect over a long period of time, since they involve +NOTE1: + Soft limits take effect over a long period of time, since they involve reclaiming memory for balancing between memory cgroups -NOTE2: It is recommended to set the soft limit always below the hard limit, +NOTE2: + It is recommended to set the soft limit always below the hard limit, otherwise the hard limit will take precedence. 8. Move charges at task migration +================================= Users can move charges associated with a task along with task migration, that is, uncharge task's pages from the old cgroup and charge them to the new cgroup. @@ -663,60 +752,71 @@ This feature is not supported in !CONFIG_MMU environments because of lack of page tables. 8.1 Interface +------------- This feature is disabled by default. It can be enabled (and disabled again) by writing to memory.move_charge_at_immigrate of the destination cgroup. -If you want to enable it: +If you want to enable it:: -# echo (some positive value) > memory.move_charge_at_immigrate + # echo (some positive value) > memory.move_charge_at_immigrate -Note: Each bits of move_charge_at_immigrate has its own meaning about what type +Note: + Each bits of move_charge_at_immigrate has its own meaning about what type of charges should be moved. See 8.2 for details. -Note: Charges are moved only when you move mm->owner, in other words, +Note: + Charges are moved only when you move mm->owner, in other words, a leader of a thread group. -Note: If we cannot find enough space for the task in the destination cgroup, we +Note: + If we cannot find enough space for the task in the destination cgroup, we try to make space by reclaiming memory. Task migration may fail if we cannot make enough space. -Note: It can take several seconds if you move charges much. +Note: + It can take several seconds if you move charges much. -And if you want disable it again: +And if you want disable it again:: -# echo 0 > memory.move_charge_at_immigrate + # echo 0 > memory.move_charge_at_immigrate 8.2 Type of charges which can be moved +-------------------------------------- Each bit in move_charge_at_immigrate has its own meaning about what type of charges should be moved. But in any case, it must be noted that an account of a page or a swap can be moved only when it is charged to the task's current (old) memory cgroup. - bit | what type of charges would be moved ? - -----+------------------------------------------------------------------------ - 0 | A charge of an anonymous page (or swap of it) used by the target task. - | You must enable Swap Extension (see 2.4) to enable move of swap charges. - -----+------------------------------------------------------------------------ - 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) - | and swaps of tmpfs file) mmapped by the target task. Unlike the case of - | anonymous pages, file pages (and swaps) in the range mmapped by the task - | will be moved even if the task hasn't done page fault, i.e. they might - | not be the task's "RSS", but other task's "RSS" that maps the same file. - | And mapcount of the page is ignored (the page can be moved even if - | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to - | enable move of swap charges. ++---+--------------------------------------------------------------------------+ +|bit| what type of charges would be moved ? | ++===+==========================================================================+ +| 0 | A charge of an anonymous page (or swap of it) used by the target task. | +| | You must enable Swap Extension (see 2.4) to enable move of swap charges. | ++---+--------------------------------------------------------------------------+ +| 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) | +| | and swaps of tmpfs file) mmapped by the target task. Unlike the case of | +| | anonymous pages, file pages (and swaps) in the range mmapped by the task | +| | will be moved even if the task hasn't done page fault, i.e. they might | +| | not be the task's "RSS", but other task's "RSS" that maps the same file. | +| | And mapcount of the page is ignored (the page can be moved even if | +| | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to | +| | enable move of swap charges. | ++---+--------------------------------------------------------------------------+ 8.3 TODO +-------- - All of moving charge operations are done under cgroup_mutex. It's not good behavior to hold the mutex too long, so we may need some trick. 9. Memory thresholds +==================== Memory cgroup implements memory thresholds using the cgroups notification API (see cgroups.txt). It allows to register multiple memory and memsw thresholds and gets notifications when it crosses. To register a threshold, an application must: + - create an eventfd using eventfd(2); - open memory.usage_in_bytes or memory.memsw.usage_in_bytes; - write string like " " to @@ -728,6 +828,7 @@ threshold in any direction. It's applicable for root and non-root cgroup. 10. OOM Control +=============== memory.oom_control file is for OOM notification and other controls. @@ -736,6 +837,7 @@ API (See cgroups.txt). It allows to register multiple OOM notification delivery and gets notification when OOM happens. To register a notifier, an application must: + - create an eventfd using eventfd(2) - open memory.oom_control file - write string like " " to @@ -752,8 +854,11 @@ If OOM-killer is disabled, tasks under cgroup will hang/sleep in memory cgroup's OOM-waitqueue when they request accountable memory. For running them, you have to relax the memory cgroup's OOM status by + * enlarge limit or reduce usage. + To reduce usage, + * kill some tasks. * move some tasks to other group with account migration. * remove some files (on tmpfs?) @@ -761,11 +866,14 @@ To reduce usage, Then, stopped tasks will work again. At reading, current status of OOM is shown. - oom_kill_disable 0 or 1 (if 1, oom-killer is disabled) - under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may - be stopped.) + + - oom_kill_disable 0 or 1 + (if 1, oom-killer is disabled) + - under_oom 0 or 1 + (if 1, the memory cgroup is under OOM, tasks may be stopped.) 11. Memory Pressure +=================== The pressure level notifications can be used to monitor the memory allocation cost; based on the pressure, applications can implement @@ -840,21 +948,22 @@ Test: Here is a small script example that makes a new cgroup, sets up a memory limit, sets up a notification in the cgroup and then makes child - cgroup experience a critical pressure: + cgroup experience a critical pressure:: - # cd /sys/fs/cgroup/memory/ - # mkdir foo - # cd foo - # cgroup_event_listener memory.pressure_level low,hierarchy & - # echo 8000000 > memory.limit_in_bytes - # echo 8000000 > memory.memsw.limit_in_bytes - # echo $$ > tasks - # dd if=/dev/zero | read x + # cd /sys/fs/cgroup/memory/ + # mkdir foo + # cd foo + # cgroup_event_listener memory.pressure_level low,hierarchy & + # echo 8000000 > memory.limit_in_bytes + # echo 8000000 > memory.memsw.limit_in_bytes + # echo $$ > tasks + # dd if=/dev/zero | read x (Expect a bunch of notifications, and eventually, the oom-killer will trigger.) 12. TODO +======== 1. Make per-cgroup scanner reclaim not-shared pages first 2. Teach controller to account for shared-pages @@ -862,11 +971,13 @@ Test: not yet hit but the usage is getting closer Summary +======= Overall, the memory controller has been a stable controller and has been commented and discussed quite extensively in the community. References +========== 1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ 2. Singh, Balbir. Memory Controller (RSS Control), diff --git a/Documentation/cgroup-v1/net_cls.txt b/Documentation/cgroup-v1/net_cls.rst similarity index 50% rename from Documentation/cgroup-v1/net_cls.txt rename to Documentation/cgroup-v1/net_cls.rst index ec182346dea2..a2cf272af7a0 100644 --- a/Documentation/cgroup-v1/net_cls.txt +++ b/Documentation/cgroup-v1/net_cls.rst @@ -1,5 +1,6 @@ +========================= Network classifier cgroup -------------------------- +========================= The Network classifier cgroup provides an interface to tag network packets with a class identifier (classid). @@ -17,23 +18,27 @@ values is 0xAAAABBBB; AAAA is the major handle number and BBBB is the minor handle number. Reading net_cls.classid yields a decimal result. -Example: -mkdir /sys/fs/cgroup/net_cls -mount -t cgroup -onet_cls net_cls /sys/fs/cgroup/net_cls -mkdir /sys/fs/cgroup/net_cls/0 -echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid - - setting a 10:1 handle. +Example:: -cat /sys/fs/cgroup/net_cls/0/net_cls.classid -1048577 + mkdir /sys/fs/cgroup/net_cls + mount -t cgroup -onet_cls net_cls /sys/fs/cgroup/net_cls + mkdir /sys/fs/cgroup/net_cls/0 + echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid -configuring tc: -tc qdisc add dev eth0 root handle 10: htb +- setting a 10:1 handle:: -tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit - - creating traffic class 10:1 + cat /sys/fs/cgroup/net_cls/0/net_cls.classid + 1048577 -tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup +- configuring tc:: -configuring iptables, basic example: -iptables -A OUTPUT -m cgroup ! --cgroup 0x100001 -j DROP + tc qdisc add dev eth0 root handle 10: htb + tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit + +- creating traffic class 10:1:: + + tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup + +configuring iptables, basic example:: + + iptables -A OUTPUT -m cgroup ! --cgroup 0x100001 -j DROP diff --git a/Documentation/cgroup-v1/net_prio.txt b/Documentation/cgroup-v1/net_prio.rst similarity index 71% rename from Documentation/cgroup-v1/net_prio.txt rename to Documentation/cgroup-v1/net_prio.rst index a82cbd28ea8a..b40905871c64 100644 --- a/Documentation/cgroup-v1/net_prio.txt +++ b/Documentation/cgroup-v1/net_prio.rst @@ -1,5 +1,6 @@ +======================= Network priority cgroup -------------------------- +======================= The Network priority cgroup provides an interface to allow an administrator to dynamically set the priority of network traffic generated by various @@ -14,9 +15,9 @@ SO_PRIORITY socket option. This however, is not always possible because: This cgroup allows an administrator to assign a process to a group which defines the priority of egress traffic on a given interface. Network priority groups can -be created by first mounting the cgroup filesystem. +be created by first mounting the cgroup filesystem:: -# mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio + # mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio With the above step, the initial group acting as the parent accounting group becomes visible at '/sys/fs/cgroup/net_prio'. This group includes all tasks in @@ -25,17 +26,18 @@ the system. '/sys/fs/cgroup/net_prio/tasks' lists the tasks in this cgroup. Each net_prio cgroup contains two files that are subsystem specific net_prio.prioidx -This file is read-only, and is simply informative. It contains a unique integer -value that the kernel uses as an internal representation of this cgroup. + This file is read-only, and is simply informative. It contains a unique + integer value that the kernel uses as an internal representation of this + cgroup. net_prio.ifpriomap -This file contains a map of the priorities assigned to traffic originating from -processes in this group and egressing the system on various interfaces. It -contains a list of tuples in the form . Contents of this file -can be modified by echoing a string into the file using the same tuple format. -for example: + This file contains a map of the priorities assigned to traffic originating + from processes in this group and egressing the system on various interfaces. + It contains a list of tuples in the form . Contents of this + file can be modified by echoing a string into the file using the same tuple + format. For example:: -echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap + echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap This command would force any traffic originating from processes belonging to the iscsi net_prio cgroup and egressing on interface eth0 to have the priority of diff --git a/Documentation/cgroup-v1/pids.txt b/Documentation/cgroup-v1/pids.rst similarity index 62% rename from Documentation/cgroup-v1/pids.txt rename to Documentation/cgroup-v1/pids.rst index e105d708ccde..6acebd9e72c8 100644 --- a/Documentation/cgroup-v1/pids.txt +++ b/Documentation/cgroup-v1/pids.rst @@ -1,5 +1,6 @@ - Process Number Controller - ========================= +========================= +Process Number Controller +========================= Abstract -------- @@ -34,55 +35,58 @@ pids.current tracks all child cgroup hierarchies, so parent/pids.current is a superset of parent/child/pids.current. The pids.events file contains event counters: + - max: Number of times fork failed because limit was hit. Example ------- -First, we mount the pids controller: -# mkdir -p /sys/fs/cgroup/pids -# mount -t cgroup -o pids none /sys/fs/cgroup/pids +First, we mount the pids controller:: -Then we create a hierarchy, set limits and attach processes to it: -# mkdir -p /sys/fs/cgroup/pids/parent/child -# echo 2 > /sys/fs/cgroup/pids/parent/pids.max -# echo $$ > /sys/fs/cgroup/pids/parent/cgroup.procs -# cat /sys/fs/cgroup/pids/parent/pids.current -2 -# + # mkdir -p /sys/fs/cgroup/pids + # mount -t cgroup -o pids none /sys/fs/cgroup/pids + +Then we create a hierarchy, set limits and attach processes to it:: + + # mkdir -p /sys/fs/cgroup/pids/parent/child + # echo 2 > /sys/fs/cgroup/pids/parent/pids.max + # echo $$ > /sys/fs/cgroup/pids/parent/cgroup.procs + # cat /sys/fs/cgroup/pids/parent/pids.current + 2 + # It should be noted that attempts to overcome the set limit (2 in this case) will -fail: +fail:: -# cat /sys/fs/cgroup/pids/parent/pids.current -2 -# ( /bin/echo "Here's some processes for you." | cat ) -sh: fork: Resource temporary unavailable -# + # cat /sys/fs/cgroup/pids/parent/pids.current + 2 + # ( /bin/echo "Here's some processes for you." | cat ) + sh: fork: Resource temporary unavailable + # Even if we migrate to a child cgroup (which doesn't have a set limit), we will not be able to overcome the most stringent limit in the hierarchy (in this case, -parent's): +parent's):: -# echo $$ > /sys/fs/cgroup/pids/parent/child/cgroup.procs -# cat /sys/fs/cgroup/pids/parent/pids.current -2 -# cat /sys/fs/cgroup/pids/parent/child/pids.current -2 -# cat /sys/fs/cgroup/pids/parent/child/pids.max -max -# ( /bin/echo "Here's some processes for you." | cat ) -sh: fork: Resource temporary unavailable -# + # echo $$ > /sys/fs/cgroup/pids/parent/child/cgroup.procs + # cat /sys/fs/cgroup/pids/parent/pids.current + 2 + # cat /sys/fs/cgroup/pids/parent/child/pids.current + 2 + # cat /sys/fs/cgroup/pids/parent/child/pids.max + max + # ( /bin/echo "Here's some processes for you." | cat ) + sh: fork: Resource temporary unavailable + # We can set a limit that is smaller than pids.current, which will stop any new processes from being forked at all (note that the shell itself counts towards -pids.current): +pids.current):: -# echo 1 > /sys/fs/cgroup/pids/parent/pids.max -# /bin/echo "We can't even spawn a single process now." -sh: fork: Resource temporary unavailable -# echo 0 > /sys/fs/cgroup/pids/parent/pids.max -# /bin/echo "We can't even spawn a single process now." -sh: fork: Resource temporary unavailable -# + # echo 1 > /sys/fs/cgroup/pids/parent/pids.max + # /bin/echo "We can't even spawn a single process now." + sh: fork: Resource temporary unavailable + # echo 0 > /sys/fs/cgroup/pids/parent/pids.max + # /bin/echo "We can't even spawn a single process now." + sh: fork: Resource temporary unavailable + # diff --git a/Documentation/cgroup-v1/rdma.txt b/Documentation/cgroup-v1/rdma.rst similarity index 79% rename from Documentation/cgroup-v1/rdma.txt rename to Documentation/cgroup-v1/rdma.rst index 9bdb7fd03f83..2fcb0a9bf790 100644 --- a/Documentation/cgroup-v1/rdma.txt +++ b/Documentation/cgroup-v1/rdma.rst @@ -1,16 +1,17 @@ - RDMA Controller - ---------------- +=============== +RDMA Controller +=============== -Contents --------- +.. Contents -1. Overview - 1-1. What is RDMA controller? - 1-2. Why RDMA controller needed? - 1-3. How is RDMA controller implemented? -2. Usage Examples + 1. Overview + 1-1. What is RDMA controller? + 1-2. Why RDMA controller needed? + 1-3. How is RDMA controller implemented? + 2. Usage Examples 1. Overview +=========== 1-1. What is RDMA controller? ----------------------------- @@ -83,27 +84,34 @@ what is configured by user for a given cgroup and what is supported by IB device. Following resources can be accounted by rdma controller. + + ========== ============================= hca_handle Maximum number of HCA Handles hca_object Maximum number of HCA Objects + ========== ============================= 2. Usage Examples ------------------ - -(a) Configure resource limit: -echo mlx4_0 hca_handle=2 hca_object=2000 > /sys/fs/cgroup/rdma/1/rdma.max -echo ocrdma1 hca_handle=3 > /sys/fs/cgroup/rdma/2/rdma.max - -(b) Query resource limit: -cat /sys/fs/cgroup/rdma/2/rdma.max -#Output: -mlx4_0 hca_handle=2 hca_object=2000 -ocrdma1 hca_handle=3 hca_object=max - -(c) Query current usage: -cat /sys/fs/cgroup/rdma/2/rdma.current -#Output: -mlx4_0 hca_handle=1 hca_object=20 -ocrdma1 hca_handle=1 hca_object=23 - -(d) Delete resource limit: -echo echo mlx4_0 hca_handle=max hca_object=max > /sys/fs/cgroup/rdma/1/rdma.max +================= + +(a) Configure resource limit:: + + echo mlx4_0 hca_handle=2 hca_object=2000 > /sys/fs/cgroup/rdma/1/rdma.max + echo ocrdma1 hca_handle=3 > /sys/fs/cgroup/rdma/2/rdma.max + +(b) Query resource limit:: + + cat /sys/fs/cgroup/rdma/2/rdma.max + #Output: + mlx4_0 hca_handle=2 hca_object=2000 + ocrdma1 hca_handle=3 hca_object=max + +(c) Query current usage:: + + cat /sys/fs/cgroup/rdma/2/rdma.current + #Output: + mlx4_0 hca_handle=1 hca_object=20 + ocrdma1 hca_handle=1 hca_object=23 + +(d) Delete resource limit:: + + echo echo mlx4_0 hca_handle=max hca_object=max > /sys/fs/cgroup/rdma/1/rdma.max diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt index d06e9a59a9f4..cad797a8a39e 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt @@ -98,7 +98,7 @@ A memory policy with a valid NodeList will be saved, as specified, for use at file creation time. When a task allocates a file in the file system, the mount option memory policy will be applied with a NodeList, if any, modified by the calling task's cpuset constraints -[See Documentation/cgroup-v1/cpusets.txt] and any optional flags, listed +[See Documentation/cgroup-v1/cpusets.rst] and any optional flags, listed below. If the resulting NodeLists is the empty set, the effective memory policy for the file will revert to "default" policy. diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt index b14e03ff3528..a7514343b660 100644 --- a/Documentation/scheduler/sched-deadline.txt +++ b/Documentation/scheduler/sched-deadline.txt @@ -652,7 +652,7 @@ CONTENTS -deadline tasks cannot have an affinity mask smaller that the entire root_domain they are created on. However, affinities can be specified - through the cpuset facility (Documentation/cgroup-v1/cpusets.txt). + through the cpuset facility (Documentation/cgroup-v1/cpusets.rst). 5.1 SCHED_DEADLINE and cpusets HOWTO ------------------------------------ diff --git a/Documentation/scheduler/sched-design-CFS.txt b/Documentation/scheduler/sched-design-CFS.txt index edd861c94c1b..d1328890ef28 100644 --- a/Documentation/scheduler/sched-design-CFS.txt +++ b/Documentation/scheduler/sched-design-CFS.txt @@ -215,7 +215,7 @@ SCHED_BATCH) tasks. These options need CONFIG_CGROUPS to be defined, and let the administrator create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See - Documentation/cgroup-v1/cgroups.txt for more information about this filesystem. + Documentation/cgroup-v1/cgroups.rst for more information about this filesystem. When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each group created using the pseudo filesystem. See example steps below to create diff --git a/Documentation/scheduler/sched-rt-group.txt b/Documentation/scheduler/sched-rt-group.txt index d8fce3e78457..c09f7a3fee66 100644 --- a/Documentation/scheduler/sched-rt-group.txt +++ b/Documentation/scheduler/sched-rt-group.txt @@ -133,7 +133,7 @@ This uses the cgroup virtual file system and "/cpu.rt_runtime_us" to control the CPU time reserved for each control group. For more information on working with control groups, you should read -Documentation/cgroup-v1/cgroups.txt as well. +Documentation/cgroup-v1/cgroups.rst as well. Group settings are checked against the following limits in order to keep the configuration schedulable: diff --git a/Documentation/vm/numa.rst b/Documentation/vm/numa.rst index 5cae13e9a08b..0d830edae8fe 100644 --- a/Documentation/vm/numa.rst +++ b/Documentation/vm/numa.rst @@ -67,7 +67,7 @@ nodes. Each emulated node will manage a fraction of the underlying cells' physical memory. NUMA emluation is useful for testing NUMA kernel and application features on non-NUMA platforms, and as a sort of memory resource management mechanism when used together with cpusets. -[see Documentation/cgroup-v1/cpusets.txt] +[see Documentation/cgroup-v1/cpusets.rst] For each node with memory, Linux constructs an independent memory management subsystem, complete with its own free page lists, in-use page lists, usage @@ -114,7 +114,7 @@ allocation behavior using Linux NUMA memory policy. [see System administrators can restrict the CPUs and nodes' memories that a non- privileged user can specify in the scheduling or NUMA commands and functions -using control groups and CPUsets. [see Documentation/cgroup-v1/cpusets.txt] +using control groups and CPUsets. [see Documentation/cgroup-v1/cpusets.rst] On architectures that do not hide memoryless nodes, Linux will include only zones [nodes] with memory in the zonelists. This means that for a memoryless diff --git a/Documentation/vm/page_migration.rst b/Documentation/vm/page_migration.rst index f68d61335abb..35bba27d5fff 100644 --- a/Documentation/vm/page_migration.rst +++ b/Documentation/vm/page_migration.rst @@ -41,7 +41,7 @@ locations. Larger installations usually partition the system using cpusets into sections of nodes. Paul Jackson has equipped cpusets with the ability to move pages when a task is moved to another cpuset (See -Documentation/cgroup-v1/cpusets.txt). +Documentation/cgroup-v1/cpusets.rst). Cpusets allows the automation of process locality. If a task is moved to a new cpuset then also all its pages are moved with it so that the performance of the process does not sink dramatically. Also the pages diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst index b8e29f977f2d..c6d94118fbcc 100644 --- a/Documentation/vm/unevictable-lru.rst +++ b/Documentation/vm/unevictable-lru.rst @@ -98,7 +98,7 @@ Memory Control Group Interaction -------------------------------- The unevictable LRU facility interacts with the memory control group [aka -memory controller; see Documentation/cgroup-v1/memory.txt] by extending the +memory controller; see Documentation/cgroup-v1/memory.rst] by extending the lru_list enum. The memory controller data structure automatically gets a per-zone unevictable diff --git a/Documentation/x86/x86_64/fake-numa-for-cpusets b/Documentation/x86/x86_64/fake-numa-for-cpusets index 4b09f18831f8..10b73bbea8eb 100644 --- a/Documentation/x86/x86_64/fake-numa-for-cpusets +++ b/Documentation/x86/x86_64/fake-numa-for-cpusets @@ -8,7 +8,7 @@ assign them to cpusets and their attached tasks. This is a way of limiting the amount of system memory that are available to a certain class of tasks. For more information on the features of cpusets, see -Documentation/cgroup-v1/cpusets.txt. +Documentation/cgroup-v1/cpusets.rst. There are a number of different configurations you can use for your needs. For more information on the numa=fake command line option and its various ways of configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt. @@ -33,7 +33,7 @@ A machine may be split as follows with "numa=fake=4*512," as reported by dmesg: On node 3 totalpages: 131072 Now following the instructions for mounting the cpusets filesystem from -Documentation/cgroup-v1/cpusets.txt, you can assign fake nodes (i.e. contiguous memory +Documentation/cgroup-v1/cpusets.rst, you can assign fake nodes (i.e. contiguous memory address spaces) to individual cpusets: [root@xroads /]# mkdir exampleset diff --git a/MAINTAINERS b/MAINTAINERS index c8eebc8da565..1595b65e5249 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4053,7 +4053,7 @@ W: http://www.bullopensource.org/cpuset/ W: http://oss.sgi.com/projects/cpusets/ T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git S: Maintained -F: Documentation/cgroup-v1/cpusets.txt +F: Documentation/cgroup-v1/cpusets.rst F: include/linux/cpuset.h F: kernel/cgroup/cpuset.c diff --git a/block/Kconfig b/block/Kconfig index 1b220101a9cb..78374cb03114 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -88,7 +88,7 @@ config BLK_DEV_THROTTLING one needs to mount and use blkio cgroup controller for creating cgroups and specifying per device IO rate policies. - See Documentation/cgroup-v1/blkio-controller.txt for more information. + See Documentation/cgroup-v1/blkio-controller.rst for more information. config BLK_DEV_THROTTLING_LOW bool "Block throttling .low limit interface support (EXPERIMENTAL)" diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 53669fdd5fad..380a8f2d8e02 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -590,7 +590,7 @@ struct cftype { /* * Control Group subsystem type. - * See Documentation/cgroup-v1/cgroups.txt for details + * See Documentation/cgroup-v1/cgroups.rst for details */ struct cgroup_subsys { struct cgroup_subsys_state *(*css_alloc)(struct cgroup_subsys_state *parent_css); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index eaf2d3284248..b8e159a7fdf1 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -781,7 +781,7 @@ union bpf_attr { * based on a user-provided identifier for all traffic coming from * the tasks belonging to the related cgroup. See also the related * kernel documentation, available from the Linux sources in file - * *Documentation/cgroup-v1/net_cls.txt*. + * *Documentation/cgroup-v1/net_cls.rst*. * * The Linux kernel has two versions for cgroups: there are * cgroups v1 and cgroups v2. Both are available to users, who can diff --git a/init/Kconfig b/init/Kconfig index b050890f69dc..9b52c958fd92 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -788,7 +788,7 @@ config BLK_CGROUP CONFIG_CFQ_GROUP_IOSCHED=y; for enabling throttling policy, set CONFIG_BLK_DEV_THROTTLING=y. - See Documentation/cgroup-v1/blkio-controller.txt for more information. + See Documentation/cgroup-v1/blkio-controller.rst for more information. config DEBUG_BLK_CGROUP bool "IO controller debugging" diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 4834c4214e9c..44025fea3169 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -729,7 +729,7 @@ static inline int nr_cpusets(void) * load balancing domains (sched domains) as specified by that partial * partition. * - * See "What is sched_load_balance" in Documentation/cgroup-v1/cpusets.txt + * See "What is sched_load_balance" in Documentation/cgroup-v1/cpusets.rst * for a background explanation of this. * * Does not return errors, on the theory that the callers of this diff --git a/security/device_cgroup.c b/security/device_cgroup.c index dc28914fa72e..c07196502577 100644 --- a/security/device_cgroup.c +++ b/security/device_cgroup.c @@ -509,7 +509,7 @@ static inline int may_allow_all(struct dev_cgroup *parent) * This is one of the three key functions for hierarchy implementation. * This function is responsible for re-evaluating all the cgroup's active * exceptions due to a parent's exception change. - * Refer to Documentation/cgroup-v1/devices.txt for more details. + * Refer to Documentation/cgroup-v1/devices.rst for more details. */ static void revalidate_active_exceptions(struct dev_cgroup *devcg) { diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 704bb69514a2..6d8bda1a5d68 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -781,7 +781,7 @@ union bpf_attr { * based on a user-provided identifier for all traffic coming from * the tasks belonging to the related cgroup. See also the related * kernel documentation, available from the Linux sources in file - * *Documentation/cgroup-v1/net_cls.txt*. + * *Documentation/cgroup-v1/net_cls.rst*. * * The Linux kernel has two versions for cgroups: there are * cgroups v1 and cgroups v2. Both are available to users, who can From patchwork Mon Apr 22 13:27:12 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mauro Carvalho Chehab X-Patchwork-Id: 1088707 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=infradead.org header.i=@infradead.org header.b="R5fJWr3W"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 44nnf130Ygz9sPY for ; Mon, 22 Apr 2019 23:35:53 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727441AbfDVN2P (ORCPT ); Mon, 22 Apr 2019 09:28:15 -0400 Received: from bombadil.infradead.org ([198.137.202.133]:36868 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727399AbfDVN2O (ORCPT ); Mon, 22 Apr 2019 09:28:14 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=Sender:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=GECUijow27B5LO3QbZmnOSuGvkmLy8YlG7BfwPjabio=; b=R5fJWr3WWddOKK3ZSE/DYhh05G UeJO2kZeMZ7NNymocSBPX+Bu2udtn2VKiXlmvYV2XNXSDsyYGL2C5ZdokLRp53d232fklQud06Y0V pzvT7tfWnyaLyqbbWRfAIqdahVtU3t9KE+lV6Ld11G17NYnfDl/lldAndJU94T7AWbFnyOyty1B8O vJRXTnvsLrwjwVM0vHxqp9eBE3Eg5J6Bn8KPZNw24TXlo07gjBemUjrWalUx7/Fo2uuXX6ZKc0U+F 9hGqcx6h85aIhc2tkGZv87zOMQXI5NhVsR8ILg6kn454LmLLtFDiRcOQAg7WojX4Ew4HwE81gS6Md igDAFfhw==; Received: from 179.176.125.229.dynamic.adsl.gvt.net.br ([179.176.125.229] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtpsa (Exim 4.90_1 #2 (Red Hat Linux)) id 1hIYzV-0005Hp-5G; Mon, 22 Apr 2019 13:28:13 +0000 Received: from mchehab by bombadil.infradead.org with local (Exim 4.92) (envelope-from ) id 1hIYzS-0005lE-UL; Mon, 22 Apr 2019 10:28:10 -0300 From: Mauro Carvalho Chehab To: Linux Doc Mailing List Cc: Mauro Carvalho Chehab , Mauro Carvalho Chehab , linux-kernel@vger.kernel.org, Jonathan Corbet , Paul Moore , netdev@vger.kernel.org, linux-security-module@vger.kernel.org Subject: [PATCH v2 23/79] docs: netlabel: convert docs to ReST and rename to *.rst Date: Mon, 22 Apr 2019 10:27:12 -0300 Message-Id: <72133d276dd8cde2d1ee8528b4e87ab2a614cbd0.1555938376.git.mchehab+samsung@kernel.org> X-Mailer: git-send-email 2.20.1 In-Reply-To: References: MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Convert netlabel documentation to ReST. This was trivial: just add proper title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab Acked-by: Paul Moore --- .../{cipso_ipv4.txt => cipso_ipv4.rst} | 19 +++++++++++------ Documentation/netlabel/draft_ietf.rst | 5 +++++ Documentation/netlabel/index.rst | 21 +++++++++++++++++++ .../{introduction.txt => introduction.rst} | 16 +++++++++----- .../{lsm_interface.txt => lsm_interface.rst} | 16 +++++++++----- 5 files changed, 61 insertions(+), 16 deletions(-) rename Documentation/netlabel/{cipso_ipv4.txt => cipso_ipv4.rst} (87%) create mode 100644 Documentation/netlabel/draft_ietf.rst create mode 100644 Documentation/netlabel/index.rst rename Documentation/netlabel/{introduction.txt => introduction.rst} (91%) rename Documentation/netlabel/{lsm_interface.txt => lsm_interface.rst} (88%) diff --git a/Documentation/netlabel/cipso_ipv4.txt b/Documentation/netlabel/cipso_ipv4.rst similarity index 87% rename from Documentation/netlabel/cipso_ipv4.txt rename to Documentation/netlabel/cipso_ipv4.rst index a6075481fd60..cbd3f3231221 100644 --- a/Documentation/netlabel/cipso_ipv4.txt +++ b/Documentation/netlabel/cipso_ipv4.rst @@ -1,10 +1,13 @@ +=================================== NetLabel CIPSO/IPv4 Protocol Engine -============================================================================== +=================================== + Paul Moore, paul.moore@hp.com May 17, 2006 - * Overview +Overview +======== The NetLabel CIPSO/IPv4 protocol engine is based on the IETF Commercial IP Security Option (CIPSO) draft from July 16, 1992. A copy of this @@ -13,7 +16,8 @@ draft can be found in this directory it to an RFC standard it has become a de-facto standard for labeled networking and is used in many trusted operating systems. - * Outbound Packet Processing +Outbound Packet Processing +========================== The CIPSO/IPv4 protocol engine applies the CIPSO IP option to packets by adding the CIPSO label to the socket. This causes all packets leaving the @@ -24,7 +28,8 @@ label by using the NetLabel security module API; if the NetLabel "domain" is configured to use CIPSO for packet labeling then a CIPSO IP option will be generated and attached to the socket. - * Inbound Packet Processing +Inbound Packet Processing +========================= The CIPSO/IPv4 protocol engine validates every CIPSO IP option it finds at the IP layer without any special handling required by the LSM. However, in order @@ -33,7 +38,8 @@ NetLabel security module API to extract the security attributes of the packet. This is typically done at the socket layer using the 'socket_sock_rcv_skb()' LSM hook. - * Label Translation +Label Translation +================= The CIPSO/IPv4 protocol engine contains a mechanism to translate CIPSO security attributes such as sensitivity level and category to values which are @@ -42,7 +48,8 @@ Domain Of Interpretation (DOI) definition and are configured through the NetLabel user space communication layer. Each DOI definition can have a different security attribute mapping table. - * Label Translation Cache +Label Translation Cache +======================= The NetLabel system provides a framework for caching security attribute mappings from the network labels to the corresponding LSM identifiers. The diff --git a/Documentation/netlabel/draft_ietf.rst b/Documentation/netlabel/draft_ietf.rst new file mode 100644 index 000000000000..5ed39ab8234b --- /dev/null +++ b/Documentation/netlabel/draft_ietf.rst @@ -0,0 +1,5 @@ +Draft IETF CIPSO IP Security +---------------------------- + + .. include:: draft-ietf-cipso-ipsecurity-01.txt + :literal: diff --git a/Documentation/netlabel/index.rst b/Documentation/netlabel/index.rst new file mode 100644 index 000000000000..47f1e0e5acd1 --- /dev/null +++ b/Documentation/netlabel/index.rst @@ -0,0 +1,21 @@ +:orphan: + +======== +NetLabel +======== + +.. toctree:: + :maxdepth: 1 + + introduction + cipso_ipv4 + lsm_interface + + draft_ietf + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/netlabel/introduction.txt b/Documentation/netlabel/introduction.rst similarity index 91% rename from Documentation/netlabel/introduction.txt rename to Documentation/netlabel/introduction.rst index 3caf77bcff0f..9333bbb0adc1 100644 --- a/Documentation/netlabel/introduction.txt +++ b/Documentation/netlabel/introduction.rst @@ -1,10 +1,13 @@ +===================== NetLabel Introduction -============================================================================== +===================== + Paul Moore, paul.moore@hp.com August 2, 2006 - * Overview +Overview +======== NetLabel is a mechanism which can be used by kernel security modules to attach security attributes to outgoing network packets generated from user space @@ -12,7 +15,8 @@ applications and read security attributes from incoming network packets. It is composed of three main components, the protocol engines, the communication layer, and the kernel security module API. - * Protocol Engines +Protocol Engines +================ The protocol engines are responsible for both applying and retrieving the network packet's security attributes. If any translation between the network @@ -24,7 +28,8 @@ the NetLabel kernel security module API described below. Detailed information about each NetLabel protocol engine can be found in this directory. - * Communication Layer +Communication Layer +=================== The communication layer exists to allow NetLabel configuration and monitoring from user space. The NetLabel communication layer uses a message based @@ -33,7 +38,8 @@ formatting of these NetLabel messages as well as the Generic NETLINK family names can be found in the 'net/netlabel/' directory as comments in the header files as well as in 'include/net/netlabel.h'. - * Security Module API +Security Module API +=================== The purpose of the NetLabel security module API is to provide a protocol independent interface to the underlying NetLabel protocol engines. In addition diff --git a/Documentation/netlabel/lsm_interface.txt b/Documentation/netlabel/lsm_interface.rst similarity index 88% rename from Documentation/netlabel/lsm_interface.txt rename to Documentation/netlabel/lsm_interface.rst index 638c74f7de7f..026fc267f798 100644 --- a/Documentation/netlabel/lsm_interface.txt +++ b/Documentation/netlabel/lsm_interface.rst @@ -1,10 +1,13 @@ +======================================== NetLabel Linux Security Module Interface -============================================================================== +======================================== + Paul Moore, paul.moore@hp.com May 17, 2006 - * Overview +Overview +======== NetLabel is a mechanism which can set and retrieve security attributes from network packets. It is intended to be used by LSM developers who want to make @@ -12,7 +15,8 @@ use of a common code base for several different packet labeling protocols. The NetLabel security module API is defined in 'include/net/netlabel.h' but a brief overview is given below. - * NetLabel Security Attributes +NetLabel Security Attributes +============================ Since NetLabel supports multiple different packet labeling protocols and LSMs it uses the concept of security attributes to refer to the packet's security @@ -24,7 +28,8 @@ configuration. It is up to the LSM developer to translate the NetLabel security attributes into whatever security identifiers are in use for their particular LSM. - * NetLabel LSM Protocol Operations +NetLabel LSM Protocol Operations +================================ These are the functions which allow the LSM developer to manipulate the labels on outgoing packets as well as read the labels on incoming packets. Functions @@ -32,7 +37,8 @@ exist to operate both on sockets as well as the sk_buffs directly. These high level functions are translated into low level protocol operations based on how the administrator has configured the NetLabel subsystem. - * NetLabel Label Mapping Cache Operations +NetLabel Label Mapping Cache Operations +======================================= Depending on the exact configuration, translation between the network packet label and the internal LSM security identifier can be time consuming. The From patchwork Mon Apr 22 13:27:14 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Mauro Carvalho Chehab X-Patchwork-Id: 1088696 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=infradead.org header.i=@infradead.org header.b="DeuZPtBZ"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 44nnWd6pfFz9sNM for ; Mon, 22 Apr 2019 23:30:21 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727618AbfDVN2Z (ORCPT ); Mon, 22 Apr 2019 09:28:25 -0400 Received: from bombadil.infradead.org ([198.137.202.133]:37482 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727482AbfDVN2X (ORCPT ); Mon, 22 Apr 2019 09:28:23 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc: To:From:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=1ov6gkrE6MQUleLaiQljU1+OipePHzZ1vUYoSA0d2wI=; b=DeuZPtBZu3MjeAaJ8lglpy2Dvb P0eUS8iXdus3M8ggpf0BigQqXi4x7j1QIY+n3fR9r7Xz381iaLCTACCJAcWVbHNAAaH4MtEzx+bN6 XnHlN37AIECZRGG5bdIFAJ3qCUNGOGejxhkTbx0Rfj1KUn1PvkJJ7K7g6zP78YA6UBiTIOKHhB1Ui iBFgXqdv5KVQHijy7VSQP9FHG6j2fVvZXhe4Zha7qUPsVtrRqpLI6QwrctZ6RhUtjCMCR8wGuZHjd YXJo7OhrdyQiHrxFoNCMt3TXOzuPTSa/z7Or1y5QKIc9W/zQpeKin9k77Z5TBP+OK7YCvGOaCCHJV WJgFqXfQ==; Received: from 179.176.125.229.dynamic.adsl.gvt.net.br ([179.176.125.229] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtpsa (Exim 4.90_1 #2 (Red Hat Linux)) id 1hIYzV-0005Hx-Ds; Mon, 22 Apr 2019 13:28:17 +0000 Received: from mchehab by bombadil.infradead.org with local (Exim 4.92) (envelope-from ) id 1hIYzT-0005lQ-08; Mon, 22 Apr 2019 10:28:11 -0300 From: Mauro Carvalho Chehab To: Linux Doc Mailing List Cc: Mauro Carvalho Chehab , Mauro Carvalho Chehab , linux-kernel@vger.kernel.org, Jonathan Corbet , Sebastian Reichel , Bjorn Helgaas , "Rafael J. Wysocki" , Viresh Kumar , Len Brown , Pavel Machek , Nishanth Menon , Stephen Boyd , Liam Girdwood , Mark Brown , Mathieu Poirier , Suzuki K Poulose , Harry Wei , Alex Shi , Thomas Gleixner , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" , x86@kernel.org, Jani Nikula , Joonas Lahtinen , Rodrigo Vivi , David Airlie , Daniel Vetter , Johannes Berg , "David S. Miller" , linux-pm@vger.kernel.org, linux-pci@vger.kernel.org, linux-arm-kernel@lists.infradead.org, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, linux-wireless@vger.kernel.org, netdev@vger.kernel.org Subject: [PATCH v2 25/79] docs: convert docs to ReST and rename to *.rst Date: Mon, 22 Apr 2019 10:27:14 -0300 Message-Id: <7adf9035ae06ecc6c7e46b51cb677f0a8f61d19a.1555938376.git.mchehab+samsung@kernel.org> X-Mailer: git-send-email 2.20.1 In-Reply-To: References: MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Convert the PM documents to ReST, in order to allow them to build with Sphinx. The conversion is actually: - add blank lines and identation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab Acked-by: Mark Brown --- .../ABI/testing/sysfs-class-powercap | 2 +- Documentation/PCI/pci.txt | 2 +- .../admin-guide/kernel-parameters.txt | 6 +- Documentation/cpu-freq/core.rst | 2 +- Documentation/driver-api/pm/devices.rst | 6 +- .../driver-api/usb/power-management.rst | 2 +- .../power/{apm-acpi.txt => apm-acpi.rst} | 10 +- ...m-debugging.txt => basic-pm-debugging.rst} | 79 +-- ...harger-manager.txt => charger-manager.rst} | 101 ++-- ...rivers-testing.txt => drivers-testing.rst} | 15 +- .../{energy-model.txt => energy-model.rst} | 101 ++-- ...ing-of-tasks.txt => freezing-of-tasks.rst} | 91 ++-- Documentation/power/index.rst | 46 ++ .../power/{interface.txt => interface.rst} | 24 +- Documentation/power/{opp.txt => opp.rst} | 175 +++--- Documentation/power/{pci.txt => pci.rst} | 87 ++- ...qos_interface.txt => pm_qos_interface.rst} | 127 +++-- Documentation/power/power_supply_class.rst | 282 ++++++++++ Documentation/power/power_supply_class.txt | 231 -------- Documentation/power/powercap/powercap.rst | 257 +++++++++ Documentation/power/powercap/powercap.txt | 236 --------- .../regulator/{consumer.txt => consumer.rst} | 141 ++--- .../regulator/{design.txt => design.rst} | 9 +- .../regulator/{machine.txt => machine.rst} | 47 +- .../regulator/{overview.txt => overview.rst} | 57 +- Documentation/power/regulator/regulator.rst | 32 ++ Documentation/power/regulator/regulator.txt | 30 -- .../power/{runtime_pm.txt => runtime_pm.rst} | 234 ++++---- Documentation/power/{s2ram.txt => s2ram.rst} | 20 +- ...hotplug.txt => suspend-and-cpuhotplug.rst} | 42 +- ...errupts.txt => suspend-and-interrupts.rst} | 2 + ...ap-files.txt => swsusp-and-swap-files.rst} | 17 +- ...{swsusp-dmcrypt.txt => swsusp-dmcrypt.rst} | 120 ++--- Documentation/power/swsusp.rst | 501 ++++++++++++++++++ Documentation/power/swsusp.txt | 446 ---------------- .../power/{tricks.txt => tricks.rst} | 6 +- ...serland-swsusp.txt => userland-swsusp.rst} | 55 +- Documentation/power/{video.txt => video.rst} | 156 +++--- Documentation/process/submitting-drivers.rst | 2 +- Documentation/scheduler/sched-energy.txt | 6 +- Documentation/trace/coresight-cpu-debug.txt | 2 +- .../zh_CN/process/submitting-drivers.rst | 2 +- MAINTAINERS | 4 +- arch/x86/Kconfig | 2 +- drivers/gpu/drm/i915/i915_drv.h | 2 +- drivers/opp/Kconfig | 2 +- drivers/power/supply/power_supply_core.c | 2 +- include/linux/interrupt.h | 2 +- include/linux/pm.h | 2 +- kernel/power/Kconfig | 6 +- net/wireless/Kconfig | 2 +- 51 files changed, 2126 insertions(+), 1707 deletions(-) rename Documentation/power/{apm-acpi.txt => apm-acpi.rst} (87%) rename Documentation/power/{basic-pm-debugging.txt => basic-pm-debugging.rst} (87%) rename Documentation/power/{charger-manager.txt => charger-manager.rst} (78%) rename Documentation/power/{drivers-testing.txt => drivers-testing.rst} (86%) rename Documentation/power/{energy-model.txt => energy-model.rst} (74%) rename Documentation/power/{freezing-of-tasks.txt => freezing-of-tasks.rst} (75%) create mode 100644 Documentation/power/index.rst rename Documentation/power/{interface.txt => interface.rst} (84%) rename Documentation/power/{opp.txt => opp.rst} (78%) rename Documentation/power/{pci.txt => pci.rst} (97%) rename Documentation/power/{pm_qos_interface.txt => pm_qos_interface.rst} (62%) create mode 100644 Documentation/power/power_supply_class.rst delete mode 100644 Documentation/power/power_supply_class.txt create mode 100644 Documentation/power/powercap/powercap.rst delete mode 100644 Documentation/power/powercap/powercap.txt rename Documentation/power/regulator/{consumer.txt => consumer.rst} (61%) rename Documentation/power/regulator/{design.txt => design.rst} (86%) rename Documentation/power/regulator/{machine.txt => machine.rst} (75%) rename Documentation/power/regulator/{overview.txt => overview.rst} (79%) create mode 100644 Documentation/power/regulator/regulator.rst delete mode 100644 Documentation/power/regulator/regulator.txt rename Documentation/power/{runtime_pm.txt => runtime_pm.rst} (89%) rename Documentation/power/{s2ram.txt => s2ram.rst} (92%) rename Documentation/power/{suspend-and-cpuhotplug.txt => suspend-and-cpuhotplug.rst} (90%) rename Documentation/power/{suspend-and-interrupts.txt => suspend-and-interrupts.rst} (98%) rename Documentation/power/{swsusp-and-swap-files.txt => swsusp-and-swap-files.rst} (83%) rename Documentation/power/{swsusp-dmcrypt.txt => swsusp-dmcrypt.rst} (67%) create mode 100644 Documentation/power/swsusp.rst delete mode 100644 Documentation/power/swsusp.txt rename Documentation/power/{tricks.txt => tricks.rst} (93%) rename Documentation/power/{userland-swsusp.txt => userland-swsusp.rst} (85%) rename Documentation/power/{video.txt => video.rst} (56%) diff --git a/Documentation/ABI/testing/sysfs-class-powercap b/Documentation/ABI/testing/sysfs-class-powercap index db3b3ff70d84..742dfd966592 100644 --- a/Documentation/ABI/testing/sysfs-class-powercap +++ b/Documentation/ABI/testing/sysfs-class-powercap @@ -5,7 +5,7 @@ Contact: linux-pm@vger.kernel.org Description: The powercap/ class sub directory belongs to the power cap subsystem. Refer to - Documentation/power/powercap/powercap.txt for details. + Documentation/power/powercap/powercap.rst for details. What: /sys/class/powercap/ Date: September 2013 diff --git a/Documentation/PCI/pci.txt b/Documentation/PCI/pci.txt index badb26ac33dc..bbbae19f10b0 100644 --- a/Documentation/PCI/pci.txt +++ b/Documentation/PCI/pci.txt @@ -110,7 +110,7 @@ initialization with a pointer to a structure describing the driver resume_early Wake device from low power state. resume Wake device from low power state. - (Please see Documentation/power/pci.txt for descriptions + (Please see Documentation/power/pci.rst for descriptions of PCI Power Management and the related functions.) shutdown Hook into reboot_notifier_list (kernel/sys.c). diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index a4e8e6435fff..fdc04f23d093 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -13,7 +13,7 @@ For ARM64, ONLY "acpi=off", "acpi=on" or "acpi=force" are available - See also Documentation/power/runtime_pm.txt, pci=noacpi + See also Documentation/power/runtime_pm.rst, pci=noacpi acpi_apic_instance= [ACPI, IOAPIC] Format: @@ -223,7 +223,7 @@ acpi_sleep= [HW,ACPI] Sleep options Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig, old_ordering, nonvs, sci_force_enable, nobl } - See Documentation/power/video.txt for information on + See Documentation/power/video.rst for information on s3_bios and s3_mode. s3_beep is for debugging; it makes the PC's speaker beep as soon as the kernel's real-mode entry point is called. @@ -4040,7 +4040,7 @@ Specify the offset from the beginning of the partition given by "resume=" at which the swap header is located, in units (needed only for swap files). - See Documentation/power/swsusp-and-swap-files.txt + See Documentation/power/swsusp-and-swap-files.rst resumedelay= [HIBERNATION] Delay (in seconds) to pause before attempting to read the resume files diff --git a/Documentation/cpu-freq/core.rst b/Documentation/cpu-freq/core.rst index c719e3cb700c..003faebd42c2 100644 --- a/Documentation/cpu-freq/core.rst +++ b/Documentation/cpu-freq/core.rst @@ -89,7 +89,7 @@ flags flags of the cpufreq driver 3. CPUFreq Table Generation with Operating Performance Point (OPP) ================================================================== -For details about OPP, see Documentation/power/opp.txt +For details about OPP, see Documentation/power/opp.rst dev_pm_opp_init_cpufreq_table This function provides a ready to use conversion routine to translate diff --git a/Documentation/driver-api/pm/devices.rst b/Documentation/driver-api/pm/devices.rst index 30835683616a..f66c7b9126ea 100644 --- a/Documentation/driver-api/pm/devices.rst +++ b/Documentation/driver-api/pm/devices.rst @@ -225,7 +225,7 @@ system-wide transition to a sleep state even though its :c:member:`runtime_auto` flag is clear. For more information about the runtime power management framework, refer to -:file:`Documentation/power/runtime_pm.txt`. +:file:`Documentation/power/runtime_pm.rst`. Calling Drivers to Enter and Leave System Sleep States @@ -728,7 +728,7 @@ it into account in any way. Devices may be defined as IRQ-safe which indicates to the PM core that their runtime PM callbacks may be invoked with disabled interrupts (see -:file:`Documentation/power/runtime_pm.txt` for more information). If an +:file:`Documentation/power/runtime_pm.rst` for more information). If an IRQ-safe device belongs to a PM domain, the runtime PM of the domain will be disallowed, unless the domain itself is defined as IRQ-safe. However, it makes sense to define a PM domain as IRQ-safe only if all the devices in it @@ -795,7 +795,7 @@ so on) and the final state of the device must reflect the "active" runtime PM status in that case. During system-wide resume from a sleep state it's easiest to put devices into -the full-power state, as explained in :file:`Documentation/power/runtime_pm.txt`. +the full-power state, as explained in :file:`Documentation/power/runtime_pm.rst`. [Refer to that document for more information regarding this particular issue as well as for information on the device runtime power management framework in general.] diff --git a/Documentation/driver-api/usb/power-management.rst b/Documentation/driver-api/usb/power-management.rst index 79beb807996b..4f1f0e7e5d9f 100644 --- a/Documentation/driver-api/usb/power-management.rst +++ b/Documentation/driver-api/usb/power-management.rst @@ -46,7 +46,7 @@ device is turned off while the system as a whole remains running, we call it a "dynamic suspend" (also known as a "runtime suspend" or "selective suspend"). This document concentrates mostly on how dynamic PM is implemented in the USB subsystem, although system PM is -covered to some extent (see ``Documentation/power/*.txt`` for more +covered to some extent (see ``Documentation/power/*.rst`` for more information about system PM). System PM support is present only if the kernel was built with diff --git a/Documentation/power/apm-acpi.txt b/Documentation/power/apm-acpi.rst similarity index 87% rename from Documentation/power/apm-acpi.txt rename to Documentation/power/apm-acpi.rst index 6cc423d3662e..5b90d947126d 100644 --- a/Documentation/power/apm-acpi.txt +++ b/Documentation/power/apm-acpi.rst @@ -1,5 +1,7 @@ +============ APM or ACPI? ------------- +============ + If you have a relatively recent x86 mobile, desktop, or server system, odds are it supports either Advanced Power Management (APM) or Advanced Configuration and Power Interface (ACPI). ACPI is the newer @@ -28,5 +30,7 @@ and be sure that they are started sometime in the system boot process. Go ahead and start both. If ACPI or APM is not available on your system the associated daemon will exit gracefully. - apmd: http://ftp.debian.org/pool/main/a/apmd/ - acpid: http://acpid.sf.net/ + ===== ======================================= + apmd http://ftp.debian.org/pool/main/a/apmd/ + acpid http://acpid.sf.net/ + ===== ======================================= diff --git a/Documentation/power/basic-pm-debugging.txt b/Documentation/power/basic-pm-debugging.rst similarity index 87% rename from Documentation/power/basic-pm-debugging.txt rename to Documentation/power/basic-pm-debugging.rst index 708f87f78a75..69862e759c30 100644 --- a/Documentation/power/basic-pm-debugging.txt +++ b/Documentation/power/basic-pm-debugging.rst @@ -1,12 +1,16 @@ +================================= Debugging hibernation and suspend +================================= + (C) 2007 Rafael J. Wysocki , GPL 1. Testing hibernation (aka suspend to disk or STD) +=================================================== -To check if hibernation works, you can try to hibernate in the "reboot" mode: +To check if hibernation works, you can try to hibernate in the "reboot" mode:: -# echo reboot > /sys/power/disk -# echo disk > /sys/power/state + # echo reboot > /sys/power/disk + # echo disk > /sys/power/state and the system should create a hibernation image, reboot, resume and get back to the command prompt where you have started the transition. If that happens, @@ -15,20 +19,21 @@ test at least a couple of times in a row for confidence. [This is necessary, because some problems only show up on a second attempt at suspending and resuming the system.] Moreover, hibernating in the "reboot" and "shutdown" modes causes the PM core to skip some platform-related callbacks which on ACPI -systems might be necessary to make hibernation work. Thus, if your machine fails -to hibernate or resume in the "reboot" mode, you should try the "platform" mode: +systems might be necessary to make hibernation work. Thus, if your machine +fails to hibernate or resume in the "reboot" mode, you should try the +"platform" mode:: -# echo platform > /sys/power/disk -# echo disk > /sys/power/state + # echo platform > /sys/power/disk + # echo disk > /sys/power/state which is the default and recommended mode of hibernation. Unfortunately, the "platform" mode of hibernation does not work on some systems with broken BIOSes. In such cases the "shutdown" mode of hibernation might -work: +work:: -# echo shutdown > /sys/power/disk -# echo disk > /sys/power/state + # echo shutdown > /sys/power/disk + # echo disk > /sys/power/state (it is similar to the "reboot" mode, but it requires you to press the power button to make the system resume). @@ -37,6 +42,7 @@ If neither "platform" nor "shutdown" hibernation mode works, you will need to identify what goes wrong. a) Test modes of hibernation +---------------------------- To find out why hibernation fails on your system, you can use a special testing facility available if the kernel is compiled with CONFIG_PM_DEBUG set. Then, @@ -44,36 +50,38 @@ there is the file /sys/power/pm_test that can be used to make the hibernation core run in a test mode. There are 5 test modes available: freezer -- test the freezing of processes + - test the freezing of processes devices -- test the freezing of processes and suspending of devices + - test the freezing of processes and suspending of devices platform -- test the freezing of processes, suspending of devices and platform - global control methods(*) + - test the freezing of processes, suspending of devices and platform + global control methods [1]_ processors -- test the freezing of processes, suspending of devices, platform - global control methods(*) and the disabling of nonboot CPUs + - test the freezing of processes, suspending of devices, platform + global control methods [1]_ and the disabling of nonboot CPUs core -- test the freezing of processes, suspending of devices, platform global - control methods(*), the disabling of nonboot CPUs and suspending of - platform/system devices + - test the freezing of processes, suspending of devices, platform global + control methods\ [1]_, the disabling of nonboot CPUs and suspending + of platform/system devices -(*) the platform global control methods are only available on ACPI systems +.. [1] + + the platform global control methods are only available on ACPI systems and are only tested if the hibernation mode is set to "platform" To use one of them it is necessary to write the corresponding string to /sys/power/pm_test (eg. "devices" to test the freezing of processes and suspending devices) and issue the standard hibernation commands. For example, to use the "devices" test mode along with the "platform" mode of hibernation, -you should do the following: +you should do the following:: -# echo devices > /sys/power/pm_test -# echo platform > /sys/power/disk -# echo disk > /sys/power/state + # echo devices > /sys/power/pm_test + # echo platform > /sys/power/disk + # echo disk > /sys/power/state Then, the kernel will try to freeze processes, suspend devices, wait a few seconds (5 by default, but configurable by the suspend.pm_test_delay module @@ -108,11 +116,12 @@ If the "devices" test fails, most likely there is a driver that cannot suspend or resume its device (in the latter case the system may hang or become unstable after the test, so please take that into consideration). To find this driver, you can carry out a binary search according to the rules: + - if the test fails, unload a half of the drivers currently loaded and repeat -(that would probably involve rebooting the system, so always note what drivers -have been loaded before the test), + (that would probably involve rebooting the system, so always note what drivers + have been loaded before the test), - if the test succeeds, load a half of the drivers you have unloaded most -recently and repeat. + recently and repeat. Once you have found the failing driver (there can be more than just one of them), you have to unload it every time before hibernation. In that case please @@ -146,6 +155,7 @@ indicates a serious problem that very well may be related to the hardware, but please report it anyway. b) Testing minimal configuration +-------------------------------- If all of the hibernation test modes work, you can boot the system with the "init=/bin/bash" command line parameter and attempt to hibernate in the @@ -165,14 +175,15 @@ Again, if you find the offending module(s), it(they) must be unloaded every time before hibernation, and please report the problem with it(them). c) Using the "test_resume" hibernation option +--------------------------------------------- /sys/power/disk generally tells the kernel what to do after creating a hibernation image. One of the available options is "test_resume" which causes the just created image to be used for immediate restoration. Namely, -after doing: +after doing:: -# echo test_resume > /sys/power/disk -# echo disk > /sys/power/state + # echo test_resume > /sys/power/disk + # echo disk > /sys/power/state a hibernation image will be created and a resume from it will be triggered immediately without involving the platform firmware in any way. @@ -190,6 +201,7 @@ to resume may be related to the differences between the restore and image kernels. d) Advanced debugging +--------------------- In case that hibernation does not work on your system even in the minimal configuration and compiling more drivers as modules is not practical or some @@ -200,9 +212,10 @@ kernel messages using the serial console. This may provide you with some information about the reasons of the suspend (resume) failure. Alternatively, it may be possible to use a FireWire port for debugging with firescope (http://v3.sk/~lkundrak/firescope/). On x86 it is also possible to -use the PM_TRACE mechanism documented in Documentation/power/s2ram.txt . +use the PM_TRACE mechanism documented in Documentation/power/s2ram.rst . 2. Testing suspend to RAM (STR) +=============================== To verify that the STR works, it is generally more convenient to use the s2ram tool available from http://suspend.sf.net and documented at @@ -230,7 +243,8 @@ you will have to unload them every time before an STR transition (ie. before you run s2ram), and please report the problems with them. There is a debugfs entry which shows the suspend to RAM statistics. Here is an -example of its output. +example of its output:: + # mount -t debugfs none /sys/kernel/debug # cat /sys/kernel/debug/suspend_stats success: 20 @@ -248,6 +262,7 @@ example of its output. -16 last_failed_step: suspend suspend + Field success means the success number of suspend to RAM, and field fail means the failure number. Others are the failure number of different steps of suspend to RAM. suspend_stats just lists the last 2 failed devices, error number and diff --git a/Documentation/power/charger-manager.txt b/Documentation/power/charger-manager.rst similarity index 78% rename from Documentation/power/charger-manager.txt rename to Documentation/power/charger-manager.rst index 9ff1105e58d6..84fab9376792 100644 --- a/Documentation/power/charger-manager.txt +++ b/Documentation/power/charger-manager.rst @@ -1,4 +1,7 @@ +=============== Charger Manager +=============== + (C) 2011 MyungJoo Ham , GPL Charger Manager provides in-kernel battery charger management that @@ -55,41 +58,39 @@ Charger Manager supports the following: notification to users with UEVENT. 2. Global Charger-Manager Data related with suspend_again -======================================================== +========================================================= In order to setup Charger Manager with suspend-again feature (in-suspend monitoring), the user should provide charger_global_desc -with setup_charger_manager(struct charger_global_desc *). +with setup_charger_manager(`struct charger_global_desc *`). This charger_global_desc data for in-suspend monitoring is global as the name suggests. Thus, the user needs to provide only once even if there are multiple batteries. If there are multiple batteries, the multiple instances of Charger Manager share the same charger_global_desc and it will manage in-suspend monitoring for all instances of Charger Manager. -The user needs to provide all the three entries properly in order to activate -in-suspend monitoring: +The user needs to provide all the three entries to `struct charger_global_desc` +properly in order to activate in-suspend monitoring: -struct charger_global_desc { - -char *rtc_name; - : The name of rtc (e.g., "rtc0") used to wakeup the system from +`char *rtc_name;` + The name of rtc (e.g., "rtc0") used to wakeup the system from suspend for Charger Manager. The alarm interrupt (AIE) of the rtc should be able to wake up the system from suspend. Charger Manager saves and restores the alarm value and use the previously-defined alarm if it is going to go off earlier than Charger Manager so that Charger Manager does not interfere with previously-defined alarms. -bool (*rtc_only_wakeup)(void); - : This callback should let CM know whether +`bool (*rtc_only_wakeup)(void);` + This callback should let CM know whether the wakeup-from-suspend is caused only by the alarm of "rtc" in the same struct. If there is any other wakeup source triggered the wakeup, it should return false. If the "rtc" is the only wakeup reason, it should return true. -bool assume_timer_stops_in_suspend; - : if true, Charger Manager assumes that +`bool assume_timer_stops_in_suspend;` + if true, Charger Manager assumes that the timer (CM uses jiffies as timer) stops during suspend. Then, CM assumes that the suspend-duration is same as the alarm length. -}; + 3. How to setup suspend_again ============================= @@ -109,26 +110,28 @@ if the system was woken up by Charger Manager and the polling ============================================= For each battery charged independently from other batteries (if a series of batteries are charged by a single charger, they are counted as one independent -battery), an instance of Charger Manager is attached to it. +battery), an instance of Charger Manager is attached to it. The following -struct charger_desc { +struct charger_desc elements: -char *psy_name; - : The power-supply-class name of the battery. Default is +`char *psy_name;` + The power-supply-class name of the battery. Default is "battery" if psy_name is NULL. Users can access the psy entries at "/sys/class/power_supply/[psy_name]/". -enum polling_modes polling_mode; - : CM_POLL_DISABLE: do not poll this battery. - CM_POLL_ALWAYS: always poll this battery. - CM_POLL_EXTERNAL_POWER_ONLY: poll this battery if and only if - an external power source is attached. - CM_POLL_CHARGING_ONLY: poll this battery if and only if the - battery is being charged. +`enum polling_modes polling_mode;` + CM_POLL_DISABLE: + do not poll this battery. + CM_POLL_ALWAYS: + always poll this battery. + CM_POLL_EXTERNAL_POWER_ONLY: + poll this battery if and only if an external power + source is attached. + CM_POLL_CHARGING_ONLY: + poll this battery if and only if the battery is being charged. -unsigned int fullbatt_vchkdrop_ms; -unsigned int fullbatt_vchkdrop_uV; - : If both have non-zero values, Charger Manager will check the +`unsigned int fullbatt_vchkdrop_ms; / unsigned int fullbatt_vchkdrop_uV;` + If both have non-zero values, Charger Manager will check the battery voltage drop fullbatt_vchkdrop_ms after the battery is fully charged. If the voltage drop is over fullbatt_vchkdrop_uV, Charger Manager will try to recharge the battery by disabling and enabling @@ -136,50 +139,52 @@ unsigned int fullbatt_vchkdrop_uV; condition) is needed to be implemented with hardware interrupts from fuel gauges or charger devices/chips. -unsigned int fullbatt_uV; - : If specified with a non-zero value, Charger Manager assumes +`unsigned int fullbatt_uV;` + If specified with a non-zero value, Charger Manager assumes that the battery is full (capacity = 100) if the battery is not being charged and the battery voltage is equal to or greater than fullbatt_uV. -unsigned int polling_interval_ms; - : Required polling interval in ms. Charger Manager will poll +`unsigned int polling_interval_ms;` + Required polling interval in ms. Charger Manager will poll this battery every polling_interval_ms or more frequently. -enum data_source battery_present; - : CM_BATTERY_PRESENT: assume that the battery exists. - CM_NO_BATTERY: assume that the battery does not exists. - CM_FUEL_GAUGE: get battery presence information from fuel gauge. - CM_CHARGER_STAT: get battery presence from chargers. +`enum data_source battery_present;` + CM_BATTERY_PRESENT: + assume that the battery exists. + CM_NO_BATTERY: + assume that the battery does not exists. + CM_FUEL_GAUGE: + get battery presence information from fuel gauge. + CM_CHARGER_STAT: + get battery presence from chargers. -char **psy_charger_stat; - : An array ending with NULL that has power-supply-class names of +`char **psy_charger_stat;` + An array ending with NULL that has power-supply-class names of chargers. Each power-supply-class should provide "PRESENT" (if battery_present is "CM_CHARGER_STAT"), "ONLINE" (shows whether an external power source is attached or not), and "STATUS" (shows whether the battery is {"FULL" or not FULL} or {"FULL", "Charging", "Discharging", "NotCharging"}). -int num_charger_regulators; -struct regulator_bulk_data *charger_regulators; - : Regulators representing the chargers in the form for +`int num_charger_regulators; / struct regulator_bulk_data *charger_regulators;` + Regulators representing the chargers in the form for regulator framework's bulk functions. -char *psy_fuel_gauge; - : Power-supply-class name of the fuel gauge. +`char *psy_fuel_gauge;` + Power-supply-class name of the fuel gauge. -int (*temperature_out_of_range)(int *mC); -bool measure_battery_temp; - : This callback returns 0 if the temperature is safe for charging, +`int (*temperature_out_of_range)(int *mC); / bool measure_battery_temp;` + This callback returns 0 if the temperature is safe for charging, a positive number if it is too hot to charge, and a negative number if it is too cold to charge. With the variable mC, the callback returns the temperature in 1/1000 of centigrade. The source of temperature can be battery or ambient one according to the value of measure_battery_temp. -}; + 5. Notify Charger-Manager of charger events: cm_notify_event() -========================================================= +============================================================== If there is an charger event is required to notify Charger Manager, a charger device driver that triggers the event can call cm_notify_event(psy, type, msg) to notify the corresponding Charger Manager. diff --git a/Documentation/power/drivers-testing.txt b/Documentation/power/drivers-testing.rst similarity index 86% rename from Documentation/power/drivers-testing.txt rename to Documentation/power/drivers-testing.rst index 638afdf4d6b8..e53f1999fc39 100644 --- a/Documentation/power/drivers-testing.txt +++ b/Documentation/power/drivers-testing.rst @@ -1,7 +1,11 @@ +==================================================== Testing suspend and resume support in device drivers +==================================================== + (C) 2007 Rafael J. Wysocki , GPL 1. Preparing the test system +============================ Unfortunately, to effectively test the support for the system-wide suspend and resume transitions in a driver, it is necessary to suspend and resume a fully @@ -14,19 +18,20 @@ the machine's BIOS. Of course, for this purpose the test system has to be known to suspend and resume without the driver being tested. Thus, if possible, you should first resolve all suspend/resume-related problems in the test system before you start -testing the new driver. Please see Documentation/power/basic-pm-debugging.txt +testing the new driver. Please see Documentation/power/basic-pm-debugging.rst for more information about the debugging of suspend/resume functionality. 2. Testing the driver +===================== Once you have resolved the suspend/resume-related problems with your test system without the new driver, you are ready to test it: a) Build the driver as a module, load it and try the test modes of hibernation - (see: Documentation/power/basic-pm-debugging.txt, 1). + (see: Documentation/power/basic-pm-debugging.rst, 1). b) Load the driver and attempt to hibernate in the "reboot", "shutdown" and - "platform" modes (see: Documentation/power/basic-pm-debugging.txt, 1). + "platform" modes (see: Documentation/power/basic-pm-debugging.rst, 1). c) Compile the driver directly into the kernel and try the test modes of hibernation. @@ -34,12 +39,12 @@ c) Compile the driver directly into the kernel and try the test modes of d) Attempt to hibernate with the driver compiled directly into the kernel in the "reboot", "shutdown" and "platform" modes. -e) Try the test modes of suspend (see: Documentation/power/basic-pm-debugging.txt, +e) Try the test modes of suspend (see: Documentation/power/basic-pm-debugging.rst, 2). [As far as the STR tests are concerned, it should not matter whether or not the driver is built as a module.] f) Attempt to suspend to RAM using the s2ram tool with the driver loaded - (see: Documentation/power/basic-pm-debugging.txt, 2). + (see: Documentation/power/basic-pm-debugging.rst, 2). Each of the above tests should be repeated several times and the STD tests should be mixed with the STR tests. If any of them fails, the driver cannot be diff --git a/Documentation/power/energy-model.txt b/Documentation/power/energy-model.rst similarity index 74% rename from Documentation/power/energy-model.txt rename to Documentation/power/energy-model.rst index a2b0ae4c76bd..90a345d57ae9 100644 --- a/Documentation/power/energy-model.txt +++ b/Documentation/power/energy-model.rst @@ -1,6 +1,6 @@ - ==================== - Energy Model of CPUs - ==================== +==================== +Energy Model of CPUs +==================== 1. Overview ----------- @@ -20,7 +20,7 @@ kernel, hence enabling to avoid redundant work. The figure below depicts an example of drivers (Arm-specific here, but the approach is applicable to any architecture) providing power costs to the EM -framework, and interested clients reading the data from it. +framework, and interested clients reading the data from it:: +---------------+ +-----------------+ +---------------+ | Thermal (IPA) | | Scheduler (EAS) | | Other | @@ -58,15 +58,17 @@ micro-architectures. 2. Core APIs ------------ - 2.1 Config options +2.1 Config options +^^^^^^^^^^^^^^^^^^ CONFIG_ENERGY_MODEL must be enabled to use the EM framework. - 2.2 Registration of performance domains +2.2 Registration of performance domains +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Drivers are expected to register performance domains into the EM framework by -calling the following API: +calling the following API:: int em_register_perf_domain(cpumask_t *span, unsigned int nr_states, struct em_data_callback *cb); @@ -80,7 +82,8 @@ callback, and kernel/power/energy_model.c for further documentation on this API. - 2.3 Accessing performance domains +2.3 Accessing performance domains +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Subsystems interested in the energy model of a CPU can retrieve it using the em_cpu_get() API. The energy model tables are allocated once upon creation of @@ -99,46 +102,46 @@ More details about the above APIs can be found in include/linux/energy_model.h. This section provides a simple example of a CPUFreq driver registering a performance domain in the Energy Model framework using the (fake) 'foo' protocol. The driver implements an est_power() function to be provided to the -EM framework. +EM framework:: - -> drivers/cpufreq/foo_cpufreq.c + -> drivers/cpufreq/foo_cpufreq.c -01 static int est_power(unsigned long *mW, unsigned long *KHz, int cpu) -02 { -03 long freq, power; -04 -05 /* Use the 'foo' protocol to ceil the frequency */ -06 freq = foo_get_freq_ceil(cpu, *KHz); -07 if (freq < 0); -08 return freq; -09 -10 /* Estimate the power cost for the CPU at the relevant freq. */ -11 power = foo_estimate_power(cpu, freq); -12 if (power < 0); -13 return power; -14 -15 /* Return the values to the EM framework */ -16 *mW = power; -17 *KHz = freq; -18 -19 return 0; -20 } -21 -22 static int foo_cpufreq_init(struct cpufreq_policy *policy) -23 { -24 struct em_data_callback em_cb = EM_DATA_CB(est_power); -25 int nr_opp, ret; -26 -27 /* Do the actual CPUFreq init work ... */ -28 ret = do_foo_cpufreq_init(policy); -29 if (ret) -30 return ret; -31 -32 /* Find the number of OPPs for this policy */ -33 nr_opp = foo_get_nr_opp(policy); -34 -35 /* And register the new performance domain */ -36 em_register_perf_domain(policy->cpus, nr_opp, &em_cb); -37 -38 return 0; -39 } + 01 static int est_power(unsigned long *mW, unsigned long *KHz, int cpu) + 02 { + 03 long freq, power; + 04 + 05 /* Use the 'foo' protocol to ceil the frequency */ + 06 freq = foo_get_freq_ceil(cpu, *KHz); + 07 if (freq < 0); + 08 return freq; + 09 + 10 /* Estimate the power cost for the CPU at the relevant freq. */ + 11 power = foo_estimate_power(cpu, freq); + 12 if (power < 0); + 13 return power; + 14 + 15 /* Return the values to the EM framework */ + 16 *mW = power; + 17 *KHz = freq; + 18 + 19 return 0; + 20 } + 21 + 22 static int foo_cpufreq_init(struct cpufreq_policy *policy) + 23 { + 24 struct em_data_callback em_cb = EM_DATA_CB(est_power); + 25 int nr_opp, ret; + 26 + 27 /* Do the actual CPUFreq init work ... */ + 28 ret = do_foo_cpufreq_init(policy); + 29 if (ret) + 30 return ret; + 31 + 32 /* Find the number of OPPs for this policy */ + 33 nr_opp = foo_get_nr_opp(policy); + 34 + 35 /* And register the new performance domain */ + 36 em_register_perf_domain(policy->cpus, nr_opp, &em_cb); + 37 + 38 return 0; + 39 } diff --git a/Documentation/power/freezing-of-tasks.txt b/Documentation/power/freezing-of-tasks.rst similarity index 75% rename from Documentation/power/freezing-of-tasks.txt rename to Documentation/power/freezing-of-tasks.rst index cd283190855a..ef110fe55e82 100644 --- a/Documentation/power/freezing-of-tasks.txt +++ b/Documentation/power/freezing-of-tasks.rst @@ -1,13 +1,18 @@ +================= Freezing of tasks - (C) 2007 Rafael J. Wysocki , GPL +================= + +(C) 2007 Rafael J. Wysocki , GPL I. What is the freezing of tasks? +================================= The freezing of tasks is a mechanism by which user space processes and some kernel threads are controlled during hibernation or system-wide suspend (on some architectures). II. How does it work? +===================== There are three per-task flags used for that, PF_NOFREEZE, PF_FROZEN and PF_FREEZER_SKIP (the last one is auxiliary). The tasks that have @@ -41,7 +46,7 @@ explicitly in suitable places or use the wait_event_freezable() or wait_event_freezable_timeout() macros (defined in include/linux/freezer.h) that combine interruptible sleep with checking if the task is to be frozen and calling try_to_freeze(). The main loop of a freezable kernel thread may look -like the following one: +like the following one:: set_freezable(); do { @@ -65,7 +70,7 @@ order to clear the PF_FROZEN flag for each frozen task. Then, the tasks that have been frozen leave __refrigerator() and continue running. -Rationale behind the functions dealing with freezing and thawing of tasks: +Rationale behind the functions dealing with freezing and thawing of tasks ------------------------------------------------------------------------- freeze_processes(): @@ -86,6 +91,7 @@ thaw_processes(): III. Which kernel threads are freezable? +======================================== Kernel threads are not freezable by default. However, a kernel thread may clear PF_NOFREEZE for itself by calling set_freezable() (the resetting of PF_NOFREEZE @@ -93,37 +99,39 @@ directly is not allowed). From this point it is regarded as freezable and must call try_to_freeze() in a suitable place. IV. Why do we do that? +====================== Generally speaking, there is a couple of reasons to use the freezing of tasks: 1. The principal reason is to prevent filesystems from being damaged after -hibernation. At the moment we have no simple means of checkpointing -filesystems, so if there are any modifications made to filesystem data and/or -metadata on disks, we cannot bring them back to the state from before the -modifications. At the same time each hibernation image contains some -filesystem-related information that must be consistent with the state of the -on-disk data and metadata after the system memory state has been restored from -the image (otherwise the filesystems will be damaged in a nasty way, usually -making them almost impossible to repair). We therefore freeze tasks that might -cause the on-disk filesystems' data and metadata to be modified after the -hibernation image has been created and before the system is finally powered off. -The majority of these are user space processes, but if any of the kernel threads -may cause something like this to happen, they have to be freezable. + hibernation. At the moment we have no simple means of checkpointing + filesystems, so if there are any modifications made to filesystem data and/or + metadata on disks, we cannot bring them back to the state from before the + modifications. At the same time each hibernation image contains some + filesystem-related information that must be consistent with the state of the + on-disk data and metadata after the system memory state has been restored + from the image (otherwise the filesystems will be damaged in a nasty way, + usually making them almost impossible to repair). We therefore freeze + tasks that might cause the on-disk filesystems' data and metadata to be + modified after the hibernation image has been created and before the + system is finally powered off. The majority of these are user space + processes, but if any of the kernel threads may cause something like this + to happen, they have to be freezable. 2. Next, to create the hibernation image we need to free a sufficient amount of -memory (approximately 50% of available RAM) and we need to do that before -devices are deactivated, because we generally need them for swapping out. Then, -after the memory for the image has been freed, we don't want tasks to allocate -additional memory and we prevent them from doing that by freezing them earlier. -[Of course, this also means that device drivers should not allocate substantial -amounts of memory from their .suspend() callbacks before hibernation, but this -is a separate issue.] + memory (approximately 50% of available RAM) and we need to do that before + devices are deactivated, because we generally need them for swapping out. + Then, after the memory for the image has been freed, we don't want tasks + to allocate additional memory and we prevent them from doing that by + freezing them earlier. [Of course, this also means that device drivers + should not allocate substantial amounts of memory from their .suspend() + callbacks before hibernation, but this is a separate issue.] 3. The third reason is to prevent user space processes and some kernel threads -from interfering with the suspending and resuming of devices. A user space -process running on a second CPU while we are suspending devices may, for -example, be troublesome and without the freezing of tasks we would need some -safeguards against race conditions that might occur in such a case. + from interfering with the suspending and resuming of devices. A user space + process running on a second CPU while we are suspending devices may, for + example, be troublesome and without the freezing of tasks we would need some + safeguards against race conditions that might occur in such a case. Although Linus Torvalds doesn't like the freezing of tasks, he said this in one of the discussions on LKML (http://lkml.org/lkml/2007/4/27/608): @@ -132,7 +140,7 @@ of the discussions on LKML (http://lkml.org/lkml/2007/4/27/608): Linus: In many ways, 'at all'. -I _do_ realize the IO request queue issues, and that we cannot actually do +I **do** realize the IO request queue issues, and that we cannot actually do s2ram with some devices in the middle of a DMA. So we want to be able to avoid *that*, there's no question about that. And I suspect that stopping user threads and then waiting for a sync is practically one of the easier @@ -150,17 +158,18 @@ thawed after the driver's .resume() callback has run, so it won't be accessing the device while it's suspended. 4. Another reason for freezing tasks is to prevent user space processes from -realizing that hibernation (or suspend) operation takes place. Ideally, user -space processes should not notice that such a system-wide operation has occurred -and should continue running without any problems after the restore (or resume -from suspend). Unfortunately, in the most general case this is quite difficult -to achieve without the freezing of tasks. Consider, for example, a process -that depends on all CPUs being online while it's running. Since we need to -disable nonboot CPUs during the hibernation, if this process is not frozen, it -may notice that the number of CPUs has changed and may start to work incorrectly -because of that. + realizing that hibernation (or suspend) operation takes place. Ideally, user + space processes should not notice that such a system-wide operation has + occurred and should continue running without any problems after the restore + (or resume from suspend). Unfortunately, in the most general case this + is quite difficult to achieve without the freezing of tasks. Consider, + for example, a process that depends on all CPUs being online while it's + running. Since we need to disable nonboot CPUs during the hibernation, + if this process is not frozen, it may notice that the number of CPUs has + changed and may start to work incorrectly because of that. V. Are there any problems related to the freezing of tasks? +=========================================================== Yes, there are. @@ -172,11 +181,12 @@ may be undesirable. That's why kernel threads are not freezable by default. Second, there are the following two problems related to the freezing of user space processes: + 1. Putting processes into an uninterruptible sleep distorts the load average. 2. Now that we have FUSE, plus the framework for doing device drivers in -userspace, it gets even more complicated because some userspace processes are -now doing the sorts of things that kernel threads do -(https://lists.linux-foundation.org/pipermail/linux-pm/2007-May/012309.html). + userspace, it gets even more complicated because some userspace processes are + now doing the sorts of things that kernel threads do + (https://lists.linux-foundation.org/pipermail/linux-pm/2007-May/012309.html). The problem 1. seems to be fixable, although it hasn't been fixed so far. The other one is more serious, but it seems that we can work around it by using @@ -201,6 +211,7 @@ requested early enough using the suspend notifier API described in Documentation/driver-api/pm/notifiers.rst. VI. Are there any precautions to be taken to prevent freezing failures? +======================================================================= Yes, there are. @@ -226,6 +237,8 @@ So, to summarize, use [un]lock_system_sleep() instead of directly using mutex_[un]lock(&system_transition_mutex). That would prevent freezing failures. V. Miscellaneous +================ + /sys/power/pm_freeze_timeout controls how long it will cost at most to freeze all user space processes or all freezable kernel threads, in unit of millisecond. The default value is 20000, with range of unsigned integer. diff --git a/Documentation/power/index.rst b/Documentation/power/index.rst new file mode 100644 index 000000000000..20415f21e48a --- /dev/null +++ b/Documentation/power/index.rst @@ -0,0 +1,46 @@ +:orphan: + +================ +Power Management +================ + +.. toctree:: + :maxdepth: 1 + + apm-acpi + basic-pm-debugging + charger-manager + drivers-testing + energy-model + freezing-of-tasks + interface + opp + pci + pm_qos_interface + power_supply_class + runtime_pm + s2ram + suspend-and-cpuhotplug + suspend-and-interrupts + swsusp-and-swap-files + swsusp-dmcrypt + swsusp + video + tricks + + userland-swsusp + + powercap/powercap + + regulator/consumer + regulator/design + regulator/machine + regulator/overview + regulator/regulator + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/power/interface.txt b/Documentation/power/interface.rst similarity index 84% rename from Documentation/power/interface.txt rename to Documentation/power/interface.rst index 27df7f98668a..8d270ed27228 100644 --- a/Documentation/power/interface.txt +++ b/Documentation/power/interface.rst @@ -1,4 +1,6 @@ +=========================================== Power Management Interface for System Sleep +=========================================== Copyright (c) 2016 Intel Corp., Rafael J. Wysocki @@ -11,10 +13,10 @@ mounted at /sys). Reading from it returns a list of supported sleep states, encoded as: -'freeze' (Suspend-to-Idle) -'standby' (Power-On Suspend) -'mem' (Suspend-to-RAM) -'disk' (Suspend-to-Disk) +- 'freeze' (Suspend-to-Idle) +- 'standby' (Power-On Suspend) +- 'mem' (Suspend-to-RAM) +- 'disk' (Suspend-to-Disk) Suspend-to-Idle is always supported. Suspend-to-Disk is always supported too as long the kernel has been configured to support hibernation at all @@ -32,18 +34,18 @@ Specifically, it tells the kernel what to do after creating a hibernation image. Reading from it returns a list of supported options encoded as: -'platform' (put the system into sleep using a platform-provided method) -'shutdown' (shut the system down) -'reboot' (reboot the system) -'suspend' (trigger a Suspend-to-RAM transition) -'test_resume' (resume-after-hibernation test mode) +- 'platform' (put the system into sleep using a platform-provided method) +- 'shutdown' (shut the system down) +- 'reboot' (reboot the system) +- 'suspend' (trigger a Suspend-to-RAM transition) +- 'test_resume' (resume-after-hibernation test mode) The currently selected option is printed in square brackets. The 'platform' option is only available if the platform provides a special mechanism to put the system to sleep after creating a hibernation image (ACPI does that, for example). The 'suspend' option is available if Suspend-to-RAM -is supported. Refer to Documentation/power/basic-pm-debugging.txt for the +is supported. Refer to Documentation/power/basic-pm-debugging.rst for the description of the 'test_resume' option. To select an option, write the string representing it to /sys/power/disk. @@ -71,7 +73,7 @@ If /sys/power/pm_trace contains '1', the fingerprint of each suspend/resume event point in turn will be stored in the RTC memory (overwriting the actual RTC information), so it will survive a system crash if one occurs right after storing it and it can be used later to identify the driver that caused the crash -to happen (see Documentation/power/s2ram.txt for more information). +to happen (see Documentation/power/s2ram.rst for more information). Initially it contains '0' which may be changed to '1' by writing a string representing a nonzero integer into it. diff --git a/Documentation/power/opp.txt b/Documentation/power/opp.rst similarity index 78% rename from Documentation/power/opp.txt rename to Documentation/power/opp.rst index 0c007e250cd1..b3cf1def9dee 100644 --- a/Documentation/power/opp.txt +++ b/Documentation/power/opp.rst @@ -1,20 +1,23 @@ +========================================== Operating Performance Points (OPP) Library ========================================== (C) 2009-2010 Nishanth Menon , Texas Instruments Incorporated -Contents --------- -1. Introduction -2. Initial OPP List Registration -3. OPP Search Functions -4. OPP Availability Control Functions -5. OPP Data Retrieval Functions -6. Data Structures +.. Contents + + 1. Introduction + 2. Initial OPP List Registration + 3. OPP Search Functions + 4. OPP Availability Control Functions + 5. OPP Data Retrieval Functions + 6. Data Structures 1. Introduction =============== + 1.1 What is an Operating Performance Point (OPP)? +------------------------------------------------- Complex SoCs of today consists of a multiple sub-modules working in conjunction. In an operational system executing varied use cases, not all modules in the SoC @@ -28,16 +31,19 @@ the device will support per domain are called Operating Performance Points or OPPs. As an example: + Let us consider an MPU device which supports the following: {300MHz at minimum voltage of 1V}, {800MHz at minimum voltage of 1.2V}, {1GHz at minimum voltage of 1.3V} We can represent these as three OPPs as the following {Hz, uV} tuples: -{300000000, 1000000} -{800000000, 1200000} -{1000000000, 1300000} + +- {300000000, 1000000} +- {800000000, 1200000} +- {1000000000, 1300000} 1.2 Operating Performance Points Library +---------------------------------------- OPP library provides a set of helper functions to organize and query the OPP information. The library is located in drivers/base/power/opp.c and the header @@ -46,9 +52,10 @@ CONFIG_PM_OPP from power management menuconfig menu. OPP library depends on CONFIG_PM as certain SoCs such as Texas Instrument's OMAP framework allows to optionally boot at a certain OPP without needing cpufreq. -Typical usage of the OPP library is as follows: -(users) -> registers a set of default OPPs -> (library) -SoC framework -> modifies on required cases certain OPPs -> OPP layer +Typical usage of the OPP library is as follows:: + + (users) -> registers a set of default OPPs -> (library) + SoC framework -> modifies on required cases certain OPPs -> OPP layer -> queries to search/retrieve information -> OPP layer expects each domain to be represented by a unique device pointer. SoC @@ -57,8 +64,9 @@ list is expected to be an optimally small number typically around 5 per device. This initial list contains a set of OPPs that the framework expects to be safely enabled by default in the system. -Note on OPP Availability: ------------------------- +Note on OPP Availability +^^^^^^^^^^^^^^^^^^^^^^^^ + As the system proceeds to operate, SoC framework may choose to make certain OPPs available or not available on each device based on various external factors. Example usage: Thermal management or other exceptional situations where @@ -88,7 +96,8 @@ registering the OPPs is maintained by OPP library throughout the device operation. The SoC framework can subsequently control the availability of the OPPs dynamically using the dev_pm_opp_enable / disable functions. -dev_pm_opp_add - Add a new OPP for a specific domain represented by the device pointer. +dev_pm_opp_add + Add a new OPP for a specific domain represented by the device pointer. The OPP is defined using the frequency and voltage. Once added, the OPP is assumed to be available and control of it's availability can be done with the dev_pm_opp_enable/disable functions. OPP library internally stores @@ -96,9 +105,11 @@ dev_pm_opp_add - Add a new OPP for a specific domain represented by the device p used by SoC framework to define a optimal list as per the demands of SoC usage environment. - WARNING: Do not use this function in interrupt context. + WARNING: + Do not use this function in interrupt context. + + Example:: - Example: soc_pm_init() { /* Do things */ @@ -125,12 +136,15 @@ Callers of these functions shall call dev_pm_opp_put() after they have used the OPP. Otherwise the memory for the OPP will never get freed and result in memleak. -dev_pm_opp_find_freq_exact - Search for an OPP based on an *exact* frequency and +dev_pm_opp_find_freq_exact + Search for an OPP based on an *exact* frequency and availability. This function is especially useful to enable an OPP which is not available by default. Example: In a case when SoC framework detects a situation where a higher frequency could be made available, it can use this function to - find the OPP prior to call the dev_pm_opp_enable to actually make it available. + find the OPP prior to call the dev_pm_opp_enable to actually make + it available:: + opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false); dev_pm_opp_put(opp); /* dont operate on the pointer.. just do a sanity check.. */ @@ -141,27 +155,34 @@ dev_pm_opp_find_freq_exact - Search for an OPP based on an *exact* frequency and dev_pm_opp_enable(dev,1000000000); } - NOTE: This is the only search function that operates on OPPs which are - not available. + NOTE: + This is the only search function that operates on OPPs which are + not available. -dev_pm_opp_find_freq_floor - Search for an available OPP which is *at most* the +dev_pm_opp_find_freq_floor + Search for an available OPP which is *at most* the provided frequency. This function is useful while searching for a lesser match OR operating on OPP information in the order of decreasing frequency. - Example: To find the highest opp for a device: + Example: To find the highest opp for a device:: + freq = ULONG_MAX; opp = dev_pm_opp_find_freq_floor(dev, &freq); dev_pm_opp_put(opp); -dev_pm_opp_find_freq_ceil - Search for an available OPP which is *at least* the +dev_pm_opp_find_freq_ceil + Search for an available OPP which is *at least* the provided frequency. This function is useful while searching for a higher match OR operating on OPP information in the order of increasing frequency. - Example 1: To find the lowest opp for a device: + Example 1: To find the lowest opp for a device:: + freq = 0; opp = dev_pm_opp_find_freq_ceil(dev, &freq); dev_pm_opp_put(opp); - Example 2: A simplified implementation of a SoC cpufreq_driver->target: + + Example 2: A simplified implementation of a SoC cpufreq_driver->target:: + soc_cpufreq_target(..) { /* Do stuff like policy checks etc. */ @@ -184,12 +205,15 @@ fine grained dynamic control of which sets of OPPs are operationally available. These functions are intended to *temporarily* remove an OPP in conditions such as thermal considerations (e.g. don't use OPPx until the temperature drops). -WARNING: Do not use these functions in interrupt context. +WARNING: + Do not use these functions in interrupt context. -dev_pm_opp_enable - Make a OPP available for operation. +dev_pm_opp_enable + Make a OPP available for operation. Example: Lets say that 1GHz OPP is to be made available only if the SoC temperature is lower than a certain threshold. The SoC framework - implementation might choose to do something as follows: + implementation might choose to do something as follows:: + if (cur_temp < temp_low_thresh) { /* Enable 1GHz if it was disabled */ opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false); @@ -201,10 +225,12 @@ dev_pm_opp_enable - Make a OPP available for operation. goto try_something_else; } -dev_pm_opp_disable - Make an OPP to be not available for operation +dev_pm_opp_disable + Make an OPP to be not available for operation Example: Lets say that 1GHz OPP is to be disabled if the temperature exceeds a threshold value. The SoC framework implementation might - choose to do something as follows: + choose to do something as follows:: + if (cur_temp > temp_high_thresh) { /* Disable 1GHz if it was enabled */ opp = dev_pm_opp_find_freq_exact(dev, 1000000000, true); @@ -223,11 +249,13 @@ information from the OPP structure is necessary. Once an OPP pointer is retrieved using the search functions, the following functions can be used by SoC framework to retrieve the information represented inside the OPP layer. -dev_pm_opp_get_voltage - Retrieve the voltage represented by the opp pointer. +dev_pm_opp_get_voltage + Retrieve the voltage represented by the opp pointer. Example: At a cpufreq transition to a different frequency, SoC framework requires to set the voltage represented by the OPP using the regulator framework to the Power Management chip providing the - voltage. + voltage:: + soc_switch_to_freq_voltage(freq) { /* do things */ @@ -239,10 +267,12 @@ dev_pm_opp_get_voltage - Retrieve the voltage represented by the opp pointer. /* do other things */ } -dev_pm_opp_get_freq - Retrieve the freq represented by the opp pointer. +dev_pm_opp_get_freq + Retrieve the freq represented by the opp pointer. Example: Lets say the SoC framework uses a couple of helper functions we could pass opp pointers instead of doing additional parameters to - handle quiet a bit of data parameters. + handle quiet a bit of data parameters:: + soc_cpufreq_target(..) { /* do things.. */ @@ -264,9 +294,11 @@ dev_pm_opp_get_freq - Retrieve the freq represented by the opp pointer. /* do things.. */ } -dev_pm_opp_get_opp_count - Retrieve the number of available opps for a device +dev_pm_opp_get_opp_count + Retrieve the number of available opps for a device Example: Lets say a co-processor in the SoC needs to know the available - frequencies in a table, the main processor can notify as following: + frequencies in a table, the main processor can notify as following:: + soc_notify_coproc_available_frequencies() { /* Do things */ @@ -289,54 +321,59 @@ dev_pm_opp_get_opp_count - Retrieve the number of available opps for a device ================== Typically an SoC contains multiple voltage domains which are variable. Each domain is represented by a device pointer. The relationship to OPP can be -represented as follows: -SoC - |- device 1 - | |- opp 1 (availability, freq, voltage) - | |- opp 2 .. - ... ... - | `- opp n .. - |- device 2 - ... - `- device m +represented as follows:: + + SoC + |- device 1 + | |- opp 1 (availability, freq, voltage) + | |- opp 2 .. + ... ... + | `- opp n .. + |- device 2 + ... + `- device m OPP library maintains a internal list that the SoC framework populates and accessed by various functions as described above. However, the structures representing the actual OPPs and domains are internal to the OPP library itself to allow for suitable abstraction reusable across systems. -struct dev_pm_opp - The internal data structure of OPP library which is used to +struct dev_pm_opp + The internal data structure of OPP library which is used to represent an OPP. In addition to the freq, voltage, availability information, it also contains internal book keeping information required for the OPP library to operate on. Pointer to this structure is provided back to the users such as SoC framework to be used as a identifier for OPP in the interactions with OPP layer. - WARNING: The struct dev_pm_opp pointer should not be parsed or modified by the - users. The defaults of for an instance is populated by dev_pm_opp_add, but the - availability of the OPP can be modified by dev_pm_opp_enable/disable functions. + WARNING: + The struct dev_pm_opp pointer should not be parsed or modified by the + users. The defaults of for an instance is populated by + dev_pm_opp_add, but the availability of the OPP can be modified + by dev_pm_opp_enable/disable functions. -struct device - This is used to identify a domain to the OPP layer. The +struct device + This is used to identify a domain to the OPP layer. The nature of the device and it's implementation is left to the user of OPP library such as the SoC framework. Overall, in a simplistic view, the data structure operations is represented as -following: +following:: -Initialization / modification: - +-----+ /- dev_pm_opp_enable -dev_pm_opp_add --> | opp | <------- - | +-----+ \- dev_pm_opp_disable - \-------> domain_info(device) + Initialization / modification: + +-----+ /- dev_pm_opp_enable + dev_pm_opp_add --> | opp | <------- + | +-----+ \- dev_pm_opp_disable + \-------> domain_info(device) -Search functions: - /-- dev_pm_opp_find_freq_ceil ---\ +-----+ -domain_info<---- dev_pm_opp_find_freq_exact -----> | opp | - \-- dev_pm_opp_find_freq_floor ---/ +-----+ + Search functions: + /-- dev_pm_opp_find_freq_ceil ---\ +-----+ + domain_info<---- dev_pm_opp_find_freq_exact -----> | opp | + \-- dev_pm_opp_find_freq_floor ---/ +-----+ -Retrieval functions: -+-----+ /- dev_pm_opp_get_voltage -| opp | <--- -+-----+ \- dev_pm_opp_get_freq + Retrieval functions: + +-----+ /- dev_pm_opp_get_voltage + | opp | <--- + +-----+ \- dev_pm_opp_get_freq -domain_info <- dev_pm_opp_get_opp_count + domain_info <- dev_pm_opp_get_opp_count diff --git a/Documentation/power/pci.txt b/Documentation/power/pci.rst similarity index 97% rename from Documentation/power/pci.txt rename to Documentation/power/pci.rst index 8eaf9ee24d43..0e2ef7429304 100644 --- a/Documentation/power/pci.txt +++ b/Documentation/power/pci.rst @@ -1,4 +1,6 @@ +==================== PCI Power Management +==================== Copyright (c) 2010 Rafael J. Wysocki , Novell Inc. @@ -9,14 +11,14 @@ management. Based on previous work by Patrick Mochel This document only covers the aspects of power management specific to PCI devices. For general description of the kernel's interfaces related to device power management refer to Documentation/driver-api/pm/devices.rst and -Documentation/power/runtime_pm.txt. +Documentation/power/runtime_pm.rst. ---------------------------------------------------------------------------- +.. contents: -1. Hardware and Platform Support for PCI Power Management -2. PCI Subsystem and Device Power Management -3. PCI Device Drivers and Power Management -4. Resources + 1. Hardware and Platform Support for PCI Power Management + 2. PCI Subsystem and Device Power Management + 3. PCI Device Drivers and Power Management + 4. Resources 1. Hardware and Platform Support for PCI Power Management @@ -24,6 +26,7 @@ Documentation/power/runtime_pm.txt. 1.1. Native and Platform-Based Power Management ----------------------------------------------- + In general, power management is a feature allowing one to save energy by putting devices into states in which they draw less power (low-power states) at the price of reduced functionality or performance. @@ -67,6 +70,7 @@ mechanisms have to be used simultaneously to obtain the desired result. 1.2. Native PCI Power Management -------------------------------- + The PCI Bus Power Management Interface Specification (PCI PM Spec) was introduced between the PCI 2.1 and PCI 2.2 Specifications. It defined a standard interface for performing various operations related to power @@ -134,6 +138,7 @@ sufficiently active to generate a wakeup signal. 1.3. ACPI Device Power Management --------------------------------- + The platform firmware support for the power management of PCI devices is system-specific. However, if the system in question is compliant with the Advanced Configuration and Power Interface (ACPI) Specification, like the @@ -194,6 +199,7 @@ enabled for the device to be able to generate wakeup signals. 1.4. Wakeup Signaling --------------------- + Wakeup signals generated by PCI devices, either as native PCI PMEs, or as a result of the execution of the _DSW (or _PSW) ACPI control method before putting the device into a low-power state, have to be caught and handled as @@ -265,14 +271,15 @@ the native PCI Express PME signaling cannot be used by the kernel in that case. 2.1. Device Power Management Callbacks -------------------------------------- + The PCI Subsystem participates in the power management of PCI devices in a number of ways. First of all, it provides an intermediate code layer between the device power management core (PM core) and PCI device drivers. Specifically, the pm field of the PCI subsystem's struct bus_type object, pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing -pointers to several device power management callbacks: +pointers to several device power management callbacks:: -const struct dev_pm_ops pci_dev_pm_ops = { + const struct dev_pm_ops pci_dev_pm_ops = { .prepare = pci_pm_prepare, .complete = pci_pm_complete, .suspend = pci_pm_suspend, @@ -290,7 +297,7 @@ const struct dev_pm_ops pci_dev_pm_ops = { .runtime_suspend = pci_pm_runtime_suspend, .runtime_resume = pci_pm_runtime_resume, .runtime_idle = pci_pm_runtime_idle, -}; + }; These callbacks are executed by the PM core in various situations related to device power management and they, in turn, execute power management callbacks @@ -299,9 +306,9 @@ involving some standard configuration registers of PCI devices that device drivers need not know or care about. The structure representing a PCI device, struct pci_dev, contains several fields -that these callbacks operate on: +that these callbacks operate on:: -struct pci_dev { + struct pci_dev { ... pci_power_t current_state; /* Current operating state. */ int pm_cap; /* PM capability offset in the @@ -315,13 +322,14 @@ struct pci_dev { unsigned int wakeup_prepared:1; /* Device prepared for wake up */ unsigned int d3_delay; /* D3->D0 transition time in ms */ ... -}; + }; They also indirectly use some fields of the struct device that is embedded in struct pci_dev. 2.2. Device Initialization -------------------------- + The PCI subsystem's first task related to device power management is to prepare the device for power management and initialize the fields of struct pci_dev used for this purpose. This happens in two functions defined in @@ -348,10 +356,11 @@ during system-wide transitions to a sleep state and back to the working state. 2.3. Runtime Device Power Management ------------------------------------ + The PCI subsystem plays a vital role in the runtime power management of PCI devices. For this purpose it uses the general runtime power management -(runtime PM) framework described in Documentation/power/runtime_pm.txt. -Namely, it provides subsystem-level callbacks: +(runtime PM) framework described in Documentation/power/runtime_pm.rst. +Namely, it provides subsystem-level callbacks:: pci_pm_runtime_suspend() pci_pm_runtime_resume() @@ -425,13 +434,14 @@ to the given subsystem before the next phase begins. These phases always run after tasks have been frozen. 2.4.1. System Suspend +^^^^^^^^^^^^^^^^^^^^^ When the system is going into a sleep state in which the contents of memory will be preserved, such as one of the ACPI sleep states S1-S3, the phases are: prepare, suspend, suspend_noirq. -The following PCI bus type's callbacks, respectively, are used in these phases: +The following PCI bus type's callbacks, respectively, are used in these phases:: pci_pm_prepare() pci_pm_suspend() @@ -492,6 +502,7 @@ this purpose). PCI device drivers are not encouraged to do that, but in some rare cases doing that in the driver may be the optimum approach. 2.4.2. System Resume +^^^^^^^^^^^^^^^^^^^^ When the system is undergoing a transition from a sleep state in which the contents of memory have been preserved, such as one of the ACPI sleep states @@ -500,7 +511,7 @@ S1-S3, into the working state (ACPI S0), the phases are: resume_noirq, resume, complete. The following PCI bus type's callbacks, respectively, are executed in these -phases: +phases:: pci_pm_resume_noirq() pci_pm_resume() @@ -539,6 +550,7 @@ The pci_pm_complete() routine only executes the device driver's pm->complete() callback, if defined. 2.4.3. System Hibernation +^^^^^^^^^^^^^^^^^^^^^^^^^ System hibernation is more complicated than system suspend, because it requires a system image to be created and written into a persistent storage medium. The @@ -551,7 +563,7 @@ to be free) in the following three phases: prepare, freeze, freeze_noirq -that correspond to the PCI bus type's callbacks: +that correspond to the PCI bus type's callbacks:: pci_pm_prepare() pci_pm_freeze() @@ -580,7 +592,7 @@ back to the fully functional state and this is done in the following phases: thaw_noirq, thaw, complete -using the following PCI bus type's callbacks: +using the following PCI bus type's callbacks:: pci_pm_thaw_noirq() pci_pm_thaw() @@ -608,7 +620,7 @@ three phases: where the prepare phase is exactly the same as for system suspend. The other two phases are analogous to the suspend and suspend_noirq phases, respectively. -The PCI subsystem-level callbacks they correspond to +The PCI subsystem-level callbacks they correspond to:: pci_pm_poweroff() pci_pm_poweroff_noirq() @@ -618,6 +630,7 @@ although they don't attempt to save the device's standard configuration registers. 2.4.4. System Restore +^^^^^^^^^^^^^^^^^^^^^ System restore requires a hibernation image to be loaded into memory and the pre-hibernation memory contents to be restored before the pre-hibernation system @@ -653,7 +666,7 @@ phases: The first two of these are analogous to the resume_noirq and resume phases described above, respectively, and correspond to the following PCI subsystem -callbacks: +callbacks:: pci_pm_restore_noirq() pci_pm_restore() @@ -671,6 +684,7 @@ resume. 3.1. Power Management Callbacks ------------------------------- + PCI device drivers participate in power management by providing callbacks to be executed by the PCI subsystem's power management routines described above and by controlling the runtime power management of their devices. @@ -698,6 +712,7 @@ defined, though, they are expected to behave as described in the following subsections. 3.1.1. prepare() +^^^^^^^^^^^^^^^^ The prepare() callback is executed during system suspend, during hibernation (when a hibernation image is about to be created), during power-off after @@ -716,6 +731,7 @@ preallocated earlier, for example in a suspend/hibernate notifier as described in Documentation/driver-api/pm/notifiers.rst). 3.1.2. suspend() +^^^^^^^^^^^^^^^^ The suspend() callback is only executed during system suspend, after prepare() callbacks have been executed for all devices in the system. @@ -742,6 +758,7 @@ operations relying on the driver's ability to handle interrupts should be carried out in this callback. 3.1.3. suspend_noirq() +^^^^^^^^^^^^^^^^^^^^^^ The suspend_noirq() callback is only executed during system suspend, after suspend() callbacks have been executed for all devices in the system and @@ -753,6 +770,7 @@ suspend_noirq() can carry out operations that would cause race conditions to arise if they were performed in suspend(). 3.1.4. freeze() +^^^^^^^^^^^^^^^ The freeze() callback is hibernation-specific and is executed in two situations, during hibernation, after prepare() callbacks have been executed for all devices @@ -770,6 +788,7 @@ or put it into a low-power state. Still, either it or freeze_noirq() should save the device's standard configuration registers using pci_save_state(). 3.1.5. freeze_noirq() +^^^^^^^^^^^^^^^^^^^^^ The freeze_noirq() callback is hibernation-specific. It is executed during hibernation, after prepare() and freeze() callbacks have been executed for all @@ -786,6 +805,7 @@ The difference between freeze_noirq() and freeze() is analogous to the difference between suspend_noirq() and suspend(). 3.1.6. poweroff() +^^^^^^^^^^^^^^^^^ The poweroff() callback is hibernation-specific. It is executed when the system is about to be powered off after saving a hibernation image to a persistent @@ -802,6 +822,7 @@ into a low-power state, respectively, but it need not save the device's standard configuration registers. 3.1.7. poweroff_noirq() +^^^^^^^^^^^^^^^^^^^^^^^ The poweroff_noirq() callback is hibernation-specific. It is executed after poweroff() callbacks have been executed for all devices in the system. @@ -814,6 +835,7 @@ The difference between poweroff_noirq() and poweroff() is analogous to the difference between suspend_noirq() and suspend(). 3.1.8. resume_noirq() +^^^^^^^^^^^^^^^^^^^^^ The resume_noirq() callback is only executed during system resume, after the PM core has enabled the non-boot CPUs. The driver's interrupt handler will not @@ -827,6 +849,7 @@ it should only be used for performing operations that would lead to race conditions if carried out by resume(). 3.1.9. resume() +^^^^^^^^^^^^^^^ The resume() callback is only executed during system resume, after resume_noirq() callbacks have been executed for all devices in the system and @@ -837,6 +860,7 @@ device and bringing it back to the fully functional state. The device should be able to process I/O in a usual way after resume() has returned. 3.1.10. thaw_noirq() +^^^^^^^^^^^^^^^^^^^^ The thaw_noirq() callback is hibernation-specific. It is executed after a system image has been created and the non-boot CPUs have been enabled by the PM @@ -851,6 +875,7 @@ freeze() and freeze_noirq(), so in general it does not need to modify the contents of the device's registers. 3.1.11. thaw() +^^^^^^^^^^^^^^ The thaw() callback is hibernation-specific. It is executed after thaw_noirq() callbacks have been executed for all devices in the system and after device @@ -860,6 +885,7 @@ This callback is responsible for restoring the pre-freeze configuration of the device, so that it will work in a usual way after thaw() has returned. 3.1.12. restore_noirq() +^^^^^^^^^^^^^^^^^^^^^^^ The restore_noirq() callback is hibernation-specific. It is executed in the restore_noirq phase of hibernation, when the boot kernel has passed control to @@ -875,6 +901,7 @@ For the vast majority of PCI device drivers there is no difference between resume_noirq() and restore_noirq(). 3.1.13. restore() +^^^^^^^^^^^^^^^^^ The restore() callback is hibernation-specific. It is executed after restore_noirq() callbacks have been executed for all devices in the system and @@ -888,14 +915,17 @@ For the vast majority of PCI device drivers there is no difference between resume() and restore(). 3.1.14. complete() +^^^^^^^^^^^^^^^^^^ The complete() callback is executed in the following situations: + - during system resume, after resume() callbacks have been executed for all devices, - during hibernation, before saving the system image, after thaw() callbacks have been executed for all devices, - during system restore, when the system is going back to its pre-hibernation state, after restore() callbacks have been executed for all devices. + It also may be executed if the loading of a hibernation image into memory fails (in that case it is run after thaw() callbacks have been executed for all devices that have drivers in the boot kernel). @@ -904,6 +934,7 @@ This callback is entirely optional, although it may be necessary if the prepare() callback performs operations that need to be reversed. 3.1.15. runtime_suspend() +^^^^^^^^^^^^^^^^^^^^^^^^^ The runtime_suspend() callback is specific to device runtime power management (runtime PM). It is executed by the PM core's runtime PM framework when the @@ -915,6 +946,7 @@ put into a low-power state, but it must allow the PCI subsystem to perform all of the PCI-specific actions necessary for suspending the device. 3.1.16. runtime_resume() +^^^^^^^^^^^^^^^^^^^^^^^^ The runtime_resume() callback is specific to device runtime PM. It is executed by the PM core's runtime PM framework when the device is about to be resumed @@ -927,6 +959,7 @@ The device is expected to be able to process I/O in the usual way after runtime_resume() has returned. 3.1.17. runtime_idle() +^^^^^^^^^^^^^^^^^^^^^^ The runtime_idle() callback is specific to device runtime PM. It is executed by the PM core's runtime PM framework whenever it may be desirable to suspend @@ -939,6 +972,7 @@ PCI subsystem will call pm_runtime_suspend() for the device, which in turn will cause the driver's runtime_suspend() callback to be executed. 3.1.18. Pointing Multiple Callback Pointers to One Routine +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Although in principle each of the callbacks described in the previous subsections can be defined as a separate function, it often is convenient to @@ -962,6 +996,7 @@ dev_pm_ops to indicate that one suspend routine is to be pointed to by the be pointed to by the .resume(), .thaw(), and .restore() members. 3.1.19. Driver Flags for Power Management +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The PM core allows device drivers to set flags that influence the handling of power management for the devices by the core itself and by middle layer code @@ -1007,6 +1042,7 @@ it. 3.2. Device Runtime Power Management ------------------------------------ + In addition to providing device power management callbacks PCI device drivers are responsible for controlling the runtime power management (runtime PM) of their devices. @@ -1073,22 +1109,27 @@ device the PM core automatically queues a request to check if the device is idle), device drivers are generally responsible for queuing power management requests for their devices. For this purpose they should use the runtime PM helper functions provided by the PM core, discussed in -Documentation/power/runtime_pm.txt. +Documentation/power/runtime_pm.rst. Devices can also be suspended and resumed synchronously, without placing a request into pm_wq. In the majority of cases this also is done by their drivers that use helper functions provided by the PM core for this purpose. For more information on the runtime PM of devices refer to -Documentation/power/runtime_pm.txt. +Documentation/power/runtime_pm.rst. 4. Resources ============ PCI Local Bus Specification, Rev. 3.0 + PCI Bus Power Management Interface Specification, Rev. 1.2 + Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b + PCI Express Base Specification, Rev. 2.0 + Documentation/driver-api/pm/devices.rst -Documentation/power/runtime_pm.txt + +Documentation/power/runtime_pm.rst diff --git a/Documentation/power/pm_qos_interface.txt b/Documentation/power/pm_qos_interface.rst similarity index 62% rename from Documentation/power/pm_qos_interface.txt rename to Documentation/power/pm_qos_interface.rst index 19c5f7b1a7ba..945fc6d760c9 100644 --- a/Documentation/power/pm_qos_interface.txt +++ b/Documentation/power/pm_qos_interface.rst @@ -1,4 +1,6 @@ -PM Quality Of Service Interface. +=============================== +PM Quality Of Service Interface +=============================== This interface provides a kernel and user mode interface for registering performance expectations by drivers, subsystems and user space applications on @@ -11,6 +13,7 @@ memory_bandwidth. constraints and PM QoS flags. Each parameters have defined units: + * latency: usec * timeout: usec * throughput: kbs (kilo bit / sec) @@ -18,6 +21,7 @@ Each parameters have defined units: 1. PM QoS framework +=================== The infrastructure exposes multiple misc device nodes one per implemented parameter. The set of parameters implement is defined by pm_qos_power_init() @@ -37,38 +41,39 @@ reading the aggregated value does not require any locking mechanism. From kernel mode the use of this interface is simple: void pm_qos_add_request(handle, param_class, target_value): -Will insert an element into the list for that identified PM QoS class with the -target value. Upon change to this list the new target is recomputed and any -registered notifiers are called only if the target value is now different. -Clients of pm_qos need to save the returned handle for future use in other -pm_qos API functions. + Will insert an element into the list for that identified PM QoS class with the + target value. Upon change to this list the new target is recomputed and any + registered notifiers are called only if the target value is now different. + Clients of pm_qos need to save the returned handle for future use in other + pm_qos API functions. void pm_qos_update_request(handle, new_target_value): -Will update the list element pointed to by the handle with the new target value -and recompute the new aggregated target, calling the notification tree if the -target is changed. + Will update the list element pointed to by the handle with the new target value + and recompute the new aggregated target, calling the notification tree if the + target is changed. void pm_qos_remove_request(handle): -Will remove the element. After removal it will update the aggregate target and -call the notification tree if the target was changed as a result of removing -the request. + Will remove the element. After removal it will update the aggregate target and + call the notification tree if the target was changed as a result of removing + the request. int pm_qos_request(param_class): -Returns the aggregated value for a given PM QoS class. + Returns the aggregated value for a given PM QoS class. int pm_qos_request_active(handle): -Returns if the request is still active, i.e. it has not been removed from a -PM QoS class constraints list. + Returns if the request is still active, i.e. it has not been removed from a + PM QoS class constraints list. int pm_qos_add_notifier(param_class, notifier): -Adds a notification callback function to the PM QoS class. The callback is -called when the aggregated value for the PM QoS class is changed. + Adds a notification callback function to the PM QoS class. The callback is + called when the aggregated value for the PM QoS class is changed. int pm_qos_remove_notifier(int param_class, notifier): -Removes the notification callback function for the PM QoS class. + Removes the notification callback function for the PM QoS class. From user mode: + Only processes can register a pm_qos request. To provide for automatic cleanup of a process, the interface requires the process to register its parameter requests in the following way: @@ -89,6 +94,7 @@ node. 2. PM QoS per-device latency and flags framework +================================================ For each device, there are three lists of PM QoS requests. Two of them are maintained along with the aggregated targets of resume latency and active @@ -107,73 +113,80 @@ the aggregated value does not require any locking mechanism. From kernel mode the use of this interface is the following: int dev_pm_qos_add_request(device, handle, type, value): -Will insert an element into the list for that identified device with the -target value. Upon change to this list the new target is recomputed and any -registered notifiers are called only if the target value is now different. -Clients of dev_pm_qos need to save the handle for future use in other -dev_pm_qos API functions. + Will insert an element into the list for that identified device with the + target value. Upon change to this list the new target is recomputed and any + registered notifiers are called only if the target value is now different. + Clients of dev_pm_qos need to save the handle for future use in other + dev_pm_qos API functions. int dev_pm_qos_update_request(handle, new_value): -Will update the list element pointed to by the handle with the new target value -and recompute the new aggregated target, calling the notification trees if the -target is changed. + Will update the list element pointed to by the handle with the new target + value and recompute the new aggregated target, calling the notification + trees if the target is changed. int dev_pm_qos_remove_request(handle): -Will remove the element. After removal it will update the aggregate target and -call the notification trees if the target was changed as a result of removing -the request. + Will remove the element. After removal it will update the aggregate target + and call the notification trees if the target was changed as a result of + removing the request. s32 dev_pm_qos_read_value(device): -Returns the aggregated value for a given device's constraints list. + Returns the aggregated value for a given device's constraints list. enum pm_qos_flags_status dev_pm_qos_flags(device, mask) -Check PM QoS flags of the given device against the given mask of flags. -The meaning of the return values is as follows: - PM_QOS_FLAGS_ALL: All flags from the mask are set - PM_QOS_FLAGS_SOME: Some flags from the mask are set - PM_QOS_FLAGS_NONE: No flags from the mask are set - PM_QOS_FLAGS_UNDEFINED: The device's PM QoS structure has not been - initialized or the list of requests is empty. + Check PM QoS flags of the given device against the given mask of flags. + The meaning of the return values is as follows: + + PM_QOS_FLAGS_ALL: + All flags from the mask are set + PM_QOS_FLAGS_SOME: + Some flags from the mask are set + PM_QOS_FLAGS_NONE: + No flags from the mask are set + PM_QOS_FLAGS_UNDEFINED: + The device's PM QoS structure has not been initialized + or the list of requests is empty. int dev_pm_qos_add_ancestor_request(dev, handle, type, value) -Add a PM QoS request for the first direct ancestor of the given device whose -power.ignore_children flag is unset (for DEV_PM_QOS_RESUME_LATENCY requests) -or whose power.set_latency_tolerance callback pointer is not NULL (for -DEV_PM_QOS_LATENCY_TOLERANCE requests). + Add a PM QoS request for the first direct ancestor of the given device whose + power.ignore_children flag is unset (for DEV_PM_QOS_RESUME_LATENCY requests) + or whose power.set_latency_tolerance callback pointer is not NULL (for + DEV_PM_QOS_LATENCY_TOLERANCE requests). int dev_pm_qos_expose_latency_limit(device, value) -Add a request to the device's PM QoS list of resume latency constraints and -create a sysfs attribute pm_qos_resume_latency_us under the device's power -directory allowing user space to manipulate that request. + Add a request to the device's PM QoS list of resume latency constraints and + create a sysfs attribute pm_qos_resume_latency_us under the device's power + directory allowing user space to manipulate that request. void dev_pm_qos_hide_latency_limit(device) -Drop the request added by dev_pm_qos_expose_latency_limit() from the device's -PM QoS list of resume latency constraints and remove sysfs attribute -pm_qos_resume_latency_us from the device's power directory. + Drop the request added by dev_pm_qos_expose_latency_limit() from the device's + PM QoS list of resume latency constraints and remove sysfs attribute + pm_qos_resume_latency_us from the device's power directory. int dev_pm_qos_expose_flags(device, value) -Add a request to the device's PM QoS list of flags and create sysfs attribute -pm_qos_no_power_off under the device's power directory allowing user space to -change the value of the PM_QOS_FLAG_NO_POWER_OFF flag. + Add a request to the device's PM QoS list of flags and create sysfs attribute + pm_qos_no_power_off under the device's power directory allowing user space to + change the value of the PM_QOS_FLAG_NO_POWER_OFF flag. void dev_pm_qos_hide_flags(device) -Drop the request added by dev_pm_qos_expose_flags() from the device's PM QoS list -of flags and remove sysfs attribute pm_qos_no_power_off from the device's power -directory. + Drop the request added by dev_pm_qos_expose_flags() from the device's PM QoS list + of flags and remove sysfs attribute pm_qos_no_power_off from the device's power + directory. Notification mechanisms: + The per-device PM QoS framework has a per-device notification tree. int dev_pm_qos_add_notifier(device, notifier): -Adds a notification callback function for the device. -The callback is called when the aggregated value of the device constraints list -is changed (for resume latency device PM QoS only). + Adds a notification callback function for the device. + The callback is called when the aggregated value of the device constraints list + is changed (for resume latency device PM QoS only). int dev_pm_qos_remove_notifier(device, notifier): -Removes the notification callback function for the device. + Removes the notification callback function for the device. Active state latency tolerance +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This device PM QoS type is used to support systems in which hardware may switch to energy-saving operation modes on the fly. In those systems, if the operation diff --git a/Documentation/power/power_supply_class.rst b/Documentation/power/power_supply_class.rst new file mode 100644 index 000000000000..3f2c3fe38a61 --- /dev/null +++ b/Documentation/power/power_supply_class.rst @@ -0,0 +1,282 @@ +======================== +Linux power supply class +======================== + +Synopsis +~~~~~~~~ +Power supply class used to represent battery, UPS, AC or DC power supply +properties to user-space. + +It defines core set of attributes, which should be applicable to (almost) +every power supply out there. Attributes are available via sysfs and uevent +interfaces. + +Each attribute has well defined meaning, up to unit of measure used. While +the attributes provided are believed to be universally applicable to any +power supply, specific monitoring hardware may not be able to provide them +all, so any of them may be skipped. + +Power supply class is extensible, and allows to define drivers own attributes. +The core attribute set is subject to the standard Linux evolution (i.e. +if it will be found that some attribute is applicable to many power supply +types or their drivers, it can be added to the core set). + +It also integrates with LED framework, for the purpose of providing +typically expected feedback of battery charging/fully charged status and +AC/USB power supply online status. (Note that specific details of the +indication (including whether to use it at all) are fully controllable by +user and/or specific machine defaults, per design principles of LED +framework). + + +Attributes/properties +~~~~~~~~~~~~~~~~~~~~~ +Power supply class has predefined set of attributes, this eliminates code +duplication across drivers. Power supply class insist on reusing its +predefined attributes *and* their units. + +So, userspace gets predictable set of attributes and their units for any +kind of power supply, and can process/present them to a user in consistent +manner. Results for different power supplies and machines are also directly +comparable. + +See drivers/power/supply/ds2760_battery.c and drivers/power/supply/pda_power.c +for the example how to declare and handle attributes. + + +Units +~~~~~ +Quoting include/linux/power_supply.h: + + All voltages, currents, charges, energies, time and temperatures in µV, + µA, µAh, µWh, seconds and tenths of degree Celsius unless otherwise + stated. It's driver's job to convert its raw values to units in which + this class operates. + + +Attributes/properties detailed +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++--------------------------------------------------------------------------+ +| **Charge/Energy/Capacity - how to not confuse** | ++--------------------------------------------------------------------------+ +| **Because both "charge" (µAh) and "energy" (µWh) represents "capacity" | +| of battery, this class distinguish these terms. Don't mix them!** | +| | +| - `CHARGE_*` | +| attributes represents capacity in µAh only. | +| - `ENERGY_*` | +| attributes represents capacity in µWh only. | +| - `CAPACITY` | +| attribute represents capacity in *percents*, from 0 to 100. | ++--------------------------------------------------------------------------+ + +Postfixes: + +_AVG + *hardware* averaged value, use it if your hardware is really able to + report averaged values. +_NOW + momentary/instantaneous values. + +STATUS + this attribute represents operating status (charging, full, + discharging (i.e. powering a load), etc.). This corresponds to + `BATTERY_STATUS_*` values, as defined in battery.h. + +CHARGE_TYPE + batteries can typically charge at different rates. + This defines trickle and fast charges. For batteries that + are already charged or discharging, 'n/a' can be displayed (or + 'unknown', if the status is not known). + +AUTHENTIC + indicates the power supply (battery or charger) connected + to the platform is authentic(1) or non authentic(0). + +HEALTH + represents health of the battery, values corresponds to + POWER_SUPPLY_HEALTH_*, defined in battery.h. + +VOLTAGE_OCV + open circuit voltage of the battery. + +VOLTAGE_MAX_DESIGN, VOLTAGE_MIN_DESIGN + design values for maximal and minimal power supply voltages. + Maximal/minimal means values of voltages when battery considered + "full"/"empty" at normal conditions. Yes, there is no direct relation + between voltage and battery capacity, but some dumb + batteries use voltage for very approximated calculation of capacity. + Battery driver also can use this attribute just to inform userspace + about maximal and minimal voltage thresholds of a given battery. + +VOLTAGE_MAX, VOLTAGE_MIN + same as _DESIGN voltage values except that these ones should be used + if hardware could only guess (measure and retain) the thresholds of a + given power supply. + +VOLTAGE_BOOT + Reports the voltage measured during boot + +CURRENT_BOOT + Reports the current measured during boot + +CHARGE_FULL_DESIGN, CHARGE_EMPTY_DESIGN + design charge values, when battery considered full/empty. + +ENERGY_FULL_DESIGN, ENERGY_EMPTY_DESIGN + same as above but for energy. + +CHARGE_FULL, CHARGE_EMPTY + These attributes means "last remembered value of charge when battery + became full/empty". It also could mean "value of charge when battery + considered full/empty at given conditions (temperature, age)". + I.e. these attributes represents real thresholds, not design values. + +ENERGY_FULL, ENERGY_EMPTY + same as above but for energy. + +CHARGE_COUNTER + the current charge counter (in µAh). This could easily + be negative; there is no empty or full value. It is only useful for + relative, time-based measurements. + +PRECHARGE_CURRENT + the maximum charge current during precharge phase of charge cycle + (typically 20% of battery capacity). + +CHARGE_TERM_CURRENT + Charge termination current. The charge cycle terminates when battery + voltage is above recharge threshold, and charge current is below + this setting (typically 10% of battery capacity). + +CONSTANT_CHARGE_CURRENT + constant charge current programmed by charger. + + +CONSTANT_CHARGE_CURRENT_MAX + maximum charge current supported by the power supply object. + +CONSTANT_CHARGE_VOLTAGE + constant charge voltage programmed by charger. +CONSTANT_CHARGE_VOLTAGE_MAX + maximum charge voltage supported by the power supply object. + +INPUT_CURRENT_LIMIT + input current limit programmed by charger. Indicates + the current drawn from a charging source. + +CHARGE_CONTROL_LIMIT + current charge control limit setting +CHARGE_CONTROL_LIMIT_MAX + maximum charge control limit setting + +CALIBRATE + battery or coulomb counter calibration status + +CAPACITY + capacity in percents. +CAPACITY_ALERT_MIN + minimum capacity alert value in percents. +CAPACITY_ALERT_MAX + maximum capacity alert value in percents. +CAPACITY_LEVEL + capacity level. This corresponds to POWER_SUPPLY_CAPACITY_LEVEL_*. + +TEMP + temperature of the power supply. +TEMP_ALERT_MIN + minimum battery temperature alert. +TEMP_ALERT_MAX + maximum battery temperature alert. +TEMP_AMBIENT + ambient temperature. +TEMP_AMBIENT_ALERT_MIN + minimum ambient temperature alert. +TEMP_AMBIENT_ALERT_MAX + maximum ambient temperature alert. +TEMP_MIN + minimum operatable temperature +TEMP_MAX + maximum operatable temperature + +TIME_TO_EMPTY + seconds left for battery to be considered empty + (i.e. while battery powers a load) +TIME_TO_FULL + seconds left for battery to be considered full + (i.e. while battery is charging) + + +Battery <-> external power supply interaction +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Often power supplies are acting as supplies and supplicants at the same +time. Batteries are good example. So, batteries usually care if they're +externally powered or not. + +For that case, power supply class implements notification mechanism for +batteries. + +External power supply (AC) lists supplicants (batteries) names in +"supplied_to" struct member, and each power_supply_changed() call +issued by external power supply will notify supplicants via +external_power_changed callback. + + +Devicetree battery characteristics +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Drivers should call power_supply_get_battery_info() to obtain battery +characteristics from a devicetree battery node, defined in +Documentation/devicetree/bindings/power/supply/battery.txt. This is +implemented in drivers/power/supply/bq27xxx_battery.c. + +Properties in struct power_supply_battery_info and their counterparts in the +battery node have names corresponding to elements in enum power_supply_property, +for naming consistency between sysfs attributes and battery node properties. + + +QA +~~ + +Q: + Where is POWER_SUPPLY_PROP_XYZ attribute? +A: + If you cannot find attribute suitable for your driver needs, feel free + to add it and send patch along with your driver. + + The attributes available currently are the ones currently provided by the + drivers written. + + Good candidates to add in future: model/part#, cycle_time, manufacturer, + etc. + + +Q: + I have some very specific attribute (e.g. battery color), should I add + this attribute to standard ones? +A: + Most likely, no. Such attribute can be placed in the driver itself, if + it is useful. Of course, if the attribute in question applicable to + large set of batteries, provided by many drivers, and/or comes from + some general battery specification/standard, it may be a candidate to + be added to the core attribute set. + + +Q: + Suppose, my battery monitoring chip/firmware does not provides capacity + in percents, but provides charge_{now,full,empty}. Should I calculate + percentage capacity manually, inside the driver, and register CAPACITY + attribute? The same question about time_to_empty/time_to_full. +A: + Most likely, no. This class is designed to export properties which are + directly measurable by the specific hardware available. + + Inferring not available properties using some heuristics or mathematical + model is not subject of work for a battery driver. Such functionality + should be factored out, and in fact, apm_power, the driver to serve + legacy APM API on top of power supply class, uses a simple heuristic of + approximating remaining battery capacity based on its charge, current, + voltage and so on. But full-fledged battery model is likely not subject + for kernel at all, as it would require floating point calculation to deal + with things like differential equations and Kalman filters. This is + better be handled by batteryd/libbattery, yet to be written. diff --git a/Documentation/power/power_supply_class.txt b/Documentation/power/power_supply_class.txt deleted file mode 100644 index 300d37896e51..000000000000 --- a/Documentation/power/power_supply_class.txt +++ /dev/null @@ -1,231 +0,0 @@ -Linux power supply class -======================== - -Synopsis -~~~~~~~~ -Power supply class used to represent battery, UPS, AC or DC power supply -properties to user-space. - -It defines core set of attributes, which should be applicable to (almost) -every power supply out there. Attributes are available via sysfs and uevent -interfaces. - -Each attribute has well defined meaning, up to unit of measure used. While -the attributes provided are believed to be universally applicable to any -power supply, specific monitoring hardware may not be able to provide them -all, so any of them may be skipped. - -Power supply class is extensible, and allows to define drivers own attributes. -The core attribute set is subject to the standard Linux evolution (i.e. -if it will be found that some attribute is applicable to many power supply -types or their drivers, it can be added to the core set). - -It also integrates with LED framework, for the purpose of providing -typically expected feedback of battery charging/fully charged status and -AC/USB power supply online status. (Note that specific details of the -indication (including whether to use it at all) are fully controllable by -user and/or specific machine defaults, per design principles of LED -framework). - - -Attributes/properties -~~~~~~~~~~~~~~~~~~~~~ -Power supply class has predefined set of attributes, this eliminates code -duplication across drivers. Power supply class insist on reusing its -predefined attributes *and* their units. - -So, userspace gets predictable set of attributes and their units for any -kind of power supply, and can process/present them to a user in consistent -manner. Results for different power supplies and machines are also directly -comparable. - -See drivers/power/supply/ds2760_battery.c and drivers/power/supply/pda_power.c -for the example how to declare and handle attributes. - - -Units -~~~~~ -Quoting include/linux/power_supply.h: - - All voltages, currents, charges, energies, time and temperatures in µV, - µA, µAh, µWh, seconds and tenths of degree Celsius unless otherwise - stated. It's driver's job to convert its raw values to units in which - this class operates. - - -Attributes/properties detailed -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -~ ~ ~ ~ ~ ~ ~ Charge/Energy/Capacity - how to not confuse ~ ~ ~ ~ ~ ~ ~ -~ ~ -~ Because both "charge" (µAh) and "energy" (µWh) represents "capacity" ~ -~ of battery, this class distinguish these terms. Don't mix them! ~ -~ ~ -~ CHARGE_* attributes represents capacity in µAh only. ~ -~ ENERGY_* attributes represents capacity in µWh only. ~ -~ CAPACITY attribute represents capacity in *percents*, from 0 to 100. ~ -~ ~ -~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ - -Postfixes: -_AVG - *hardware* averaged value, use it if your hardware is really able to -report averaged values. -_NOW - momentary/instantaneous values. - -STATUS - this attribute represents operating status (charging, full, -discharging (i.e. powering a load), etc.). This corresponds to -BATTERY_STATUS_* values, as defined in battery.h. - -CHARGE_TYPE - batteries can typically charge at different rates. -This defines trickle and fast charges. For batteries that -are already charged or discharging, 'n/a' can be displayed (or -'unknown', if the status is not known). - -AUTHENTIC - indicates the power supply (battery or charger) connected -to the platform is authentic(1) or non authentic(0). - -HEALTH - represents health of the battery, values corresponds to -POWER_SUPPLY_HEALTH_*, defined in battery.h. - -VOLTAGE_OCV - open circuit voltage of the battery. - -VOLTAGE_MAX_DESIGN, VOLTAGE_MIN_DESIGN - design values for maximal and -minimal power supply voltages. Maximal/minimal means values of voltages -when battery considered "full"/"empty" at normal conditions. Yes, there is -no direct relation between voltage and battery capacity, but some dumb -batteries use voltage for very approximated calculation of capacity. -Battery driver also can use this attribute just to inform userspace -about maximal and minimal voltage thresholds of a given battery. - -VOLTAGE_MAX, VOLTAGE_MIN - same as _DESIGN voltage values except that -these ones should be used if hardware could only guess (measure and -retain) the thresholds of a given power supply. - -VOLTAGE_BOOT - Reports the voltage measured during boot - -CURRENT_BOOT - Reports the current measured during boot - -CHARGE_FULL_DESIGN, CHARGE_EMPTY_DESIGN - design charge values, when -battery considered full/empty. - -ENERGY_FULL_DESIGN, ENERGY_EMPTY_DESIGN - same as above but for energy. - -CHARGE_FULL, CHARGE_EMPTY - These attributes means "last remembered value -of charge when battery became full/empty". It also could mean "value of -charge when battery considered full/empty at given conditions (temperature, -age)". I.e. these attributes represents real thresholds, not design values. - -ENERGY_FULL, ENERGY_EMPTY - same as above but for energy. - -CHARGE_COUNTER - the current charge counter (in µAh). This could easily -be negative; there is no empty or full value. It is only useful for -relative, time-based measurements. - -PRECHARGE_CURRENT - the maximum charge current during precharge phase -of charge cycle (typically 20% of battery capacity). -CHARGE_TERM_CURRENT - Charge termination current. The charge cycle -terminates when battery voltage is above recharge threshold, and charge -current is below this setting (typically 10% of battery capacity). - -CONSTANT_CHARGE_CURRENT - constant charge current programmed by charger. -CONSTANT_CHARGE_CURRENT_MAX - maximum charge current supported by the -power supply object. - -CONSTANT_CHARGE_VOLTAGE - constant charge voltage programmed by charger. -CONSTANT_CHARGE_VOLTAGE_MAX - maximum charge voltage supported by the -power supply object. - -INPUT_CURRENT_LIMIT - input current limit programmed by charger. Indicates -the current drawn from a charging source. - -CHARGE_CONTROL_LIMIT - current charge control limit setting -CHARGE_CONTROL_LIMIT_MAX - maximum charge control limit setting - -CALIBRATE - battery or coulomb counter calibration status - -CAPACITY - capacity in percents. -CAPACITY_ALERT_MIN - minimum capacity alert value in percents. -CAPACITY_ALERT_MAX - maximum capacity alert value in percents. -CAPACITY_LEVEL - capacity level. This corresponds to -POWER_SUPPLY_CAPACITY_LEVEL_*. - -TEMP - temperature of the power supply. -TEMP_ALERT_MIN - minimum battery temperature alert. -TEMP_ALERT_MAX - maximum battery temperature alert. -TEMP_AMBIENT - ambient temperature. -TEMP_AMBIENT_ALERT_MIN - minimum ambient temperature alert. -TEMP_AMBIENT_ALERT_MAX - maximum ambient temperature alert. -TEMP_MIN - minimum operatable temperature -TEMP_MAX - maximum operatable temperature - -TIME_TO_EMPTY - seconds left for battery to be considered empty (i.e. -while battery powers a load) -TIME_TO_FULL - seconds left for battery to be considered full (i.e. -while battery is charging) - - -Battery <-> external power supply interaction -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Often power supplies are acting as supplies and supplicants at the same -time. Batteries are good example. So, batteries usually care if they're -externally powered or not. - -For that case, power supply class implements notification mechanism for -batteries. - -External power supply (AC) lists supplicants (batteries) names in -"supplied_to" struct member, and each power_supply_changed() call -issued by external power supply will notify supplicants via -external_power_changed callback. - - -Devicetree battery characteristics -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Drivers should call power_supply_get_battery_info() to obtain battery -characteristics from a devicetree battery node, defined in -Documentation/devicetree/bindings/power/supply/battery.txt. This is -implemented in drivers/power/supply/bq27xxx_battery.c. - -Properties in struct power_supply_battery_info and their counterparts in the -battery node have names corresponding to elements in enum power_supply_property, -for naming consistency between sysfs attributes and battery node properties. - - -QA -~~ -Q: Where is POWER_SUPPLY_PROP_XYZ attribute? -A: If you cannot find attribute suitable for your driver needs, feel free - to add it and send patch along with your driver. - - The attributes available currently are the ones currently provided by the - drivers written. - - Good candidates to add in future: model/part#, cycle_time, manufacturer, - etc. - - -Q: I have some very specific attribute (e.g. battery color), should I add - this attribute to standard ones? -A: Most likely, no. Such attribute can be placed in the driver itself, if - it is useful. Of course, if the attribute in question applicable to - large set of batteries, provided by many drivers, and/or comes from - some general battery specification/standard, it may be a candidate to - be added to the core attribute set. - - -Q: Suppose, my battery monitoring chip/firmware does not provides capacity - in percents, but provides charge_{now,full,empty}. Should I calculate - percentage capacity manually, inside the driver, and register CAPACITY - attribute? The same question about time_to_empty/time_to_full. -A: Most likely, no. This class is designed to export properties which are - directly measurable by the specific hardware available. - - Inferring not available properties using some heuristics or mathematical - model is not subject of work for a battery driver. Such functionality - should be factored out, and in fact, apm_power, the driver to serve - legacy APM API on top of power supply class, uses a simple heuristic of - approximating remaining battery capacity based on its charge, current, - voltage and so on. But full-fledged battery model is likely not subject - for kernel at all, as it would require floating point calculation to deal - with things like differential equations and Kalman filters. This is - better be handled by batteryd/libbattery, yet to be written. diff --git a/Documentation/power/powercap/powercap.rst b/Documentation/power/powercap/powercap.rst new file mode 100644 index 000000000000..7ae3b44c7624 --- /dev/null +++ b/Documentation/power/powercap/powercap.rst @@ -0,0 +1,257 @@ +======================= +Power Capping Framework +======================= + +The power capping framework provides a consistent interface between the kernel +and the user space that allows power capping drivers to expose the settings to +user space in a uniform way. + +Terminology +=========== + +The framework exposes power capping devices to user space via sysfs in the +form of a tree of objects. The objects at the root level of the tree represent +'control types', which correspond to different methods of power capping. For +example, the intel-rapl control type represents the Intel "Running Average +Power Limit" (RAPL) technology, whereas the 'idle-injection' control type +corresponds to the use of idle injection for controlling power. + +Power zones represent different parts of the system, which can be controlled and +monitored using the power capping method determined by the control type the +given zone belongs to. They each contain attributes for monitoring power, as +well as controls represented in the form of power constraints. If the parts of +the system represented by different power zones are hierarchical (that is, one +bigger part consists of multiple smaller parts that each have their own power +controls), those power zones may also be organized in a hierarchy with one +parent power zone containing multiple subzones and so on to reflect the power +control topology of the system. In that case, it is possible to apply power +capping to a set of devices together using the parent power zone and if more +fine grained control is required, it can be applied through the subzones. + + +Example sysfs interface tree:: + + /sys/devices/virtual/powercap + └──intel-rapl + ├──intel-rapl:0 + │   ├──constraint_0_name + │   ├──constraint_0_power_limit_uw + │   ├──constraint_0_time_window_us + │   ├──constraint_1_name + │   ├──constraint_1_power_limit_uw + │   ├──constraint_1_time_window_us + │   ├──device -> ../../intel-rapl + │   ├──energy_uj + │   ├──intel-rapl:0:0 + │   │   ├──constraint_0_name + │   │   ├──constraint_0_power_limit_uw + │   │   ├──constraint_0_time_window_us + │   │   ├──constraint_1_name + │   │   ├──constraint_1_power_limit_uw + │   │   ├──constraint_1_time_window_us + │   │   ├──device -> ../../intel-rapl:0 + │   │   ├──energy_uj + │   │   ├──max_energy_range_uj + │   │   ├──name + │   │   ├──enabled + │   │   ├──power + │   │   │   ├──async + │   │   │   [] + │   │   ├──subsystem -> ../../../../../../class/power_cap + │   │   └──uevent + │   ├──intel-rapl:0:1 + │   │   ├──constraint_0_name + │   │   ├──constraint_0_power_limit_uw + │   │   ├──constraint_0_time_window_us + │   │   ├──constraint_1_name + │   │   ├──constraint_1_power_limit_uw + │   │   ├──constraint_1_time_window_us + │   │   ├──device -> ../../intel-rapl:0 + │   │   ├──energy_uj + │   │   ├──max_energy_range_uj + │   │   ├──name + │   │   ├──enabled + │   │   ├──power + │   │   │   ├──async + │   │   │   [] + │   │   ├──subsystem -> ../../../../../../class/power_cap + │   │   └──uevent + │   ├──max_energy_range_uj + │   ├──max_power_range_uw + │   ├──name + │   ├──enabled + │   ├──power + │   │   ├──async + │   │   [] + │   ├──subsystem -> ../../../../../class/power_cap + │   ├──enabled + │   ├──uevent + ├──intel-rapl:1 + │   ├──constraint_0_name + │   ├──constraint_0_power_limit_uw + │   ├──constraint_0_time_window_us + │   ├──constraint_1_name + │   ├──constraint_1_power_limit_uw + │   ├──constraint_1_time_window_us + │   ├──device -> ../../intel-rapl + │   ├──energy_uj + │   ├──intel-rapl:1:0 + │   │   ├──constraint_0_name + │   │   ├──constraint_0_power_limit_uw + │   │   ├──constraint_0_time_window_us + │   │   ├──constraint_1_name + │   │   ├──constraint_1_power_limit_uw + │   │   ├──constraint_1_time_window_us + │   │   ├──device -> ../../intel-rapl:1 + │   │   ├──energy_uj + │   │   ├──max_energy_range_uj + │   │   ├──name + │   │   ├──enabled + │   │   ├──power + │   │   │   ├──async + │   │   │   [] + │   │   ├──subsystem -> ../../../../../../class/power_cap + │   │   └──uevent + │   ├──intel-rapl:1:1 + │   │   ├──constraint_0_name + │   │   ├──constraint_0_power_limit_uw + │   │   ├──constraint_0_time_window_us + │   │   ├──constraint_1_name + │   │   ├──constraint_1_power_limit_uw + │   │   ├──constraint_1_time_window_us + │   │   ├──device -> ../../intel-rapl:1 + │   │   ├──energy_uj + │   │   ├──max_energy_range_uj + │   │   ├──name + │   │   ├──enabled + │   │   ├──power + │   │   │   ├──async + │   │   │   [] + │   │   ├──subsystem -> ../../../../../../class/power_cap + │   │   └──uevent + │   ├──max_energy_range_uj + │   ├──max_power_range_uw + │   ├──name + │   ├──enabled + │   ├──power + │   │   ├──async + │   │   [] + │   ├──subsystem -> ../../../../../class/power_cap + │   ├──uevent + ├──power + │   ├──async + │   [] + ├──subsystem -> ../../../../class/power_cap + ├──enabled + └──uevent + +The above example illustrates a case in which the Intel RAPL technology, +available in Intel® IA-64 and IA-32 Processor Architectures, is used. There is one +control type called intel-rapl which contains two power zones, intel-rapl:0 and +intel-rapl:1, representing CPU packages. Each of these power zones contains +two subzones, intel-rapl:j:0 and intel-rapl:j:1 (j = 0, 1), representing the +"core" and the "uncore" parts of the given CPU package, respectively. All of +the zones and subzones contain energy monitoring attributes (energy_uj, +max_energy_range_uj) and constraint attributes (constraint_*) allowing controls +to be applied (the constraints in the 'package' power zones apply to the whole +CPU packages and the subzone constraints only apply to the respective parts of +the given package individually). Since Intel RAPL doesn't provide instantaneous +power value, there is no power_uw attribute. + +In addition to that, each power zone contains a name attribute, allowing the +part of the system represented by that zone to be identified. +For example:: + + cat /sys/class/power_cap/intel-rapl/intel-rapl:0/name + +package-0 +--------- + +The Intel RAPL technology allows two constraints, short term and long term, +with two different time windows to be applied to each power zone. Thus for +each zone there are 2 attributes representing the constraint names, 2 power +limits and 2 attributes representing the sizes of the time windows. Such that, +constraint_j_* attributes correspond to the jth constraint (j = 0,1). + +For example:: + + constraint_0_name + constraint_0_power_limit_uw + constraint_0_time_window_us + constraint_1_name + constraint_1_power_limit_uw + constraint_1_time_window_us + +Power Zone Attributes +===================== + +Monitoring attributes +--------------------- + +energy_uj (rw) + Current energy counter in micro joules. Write "0" to reset. + If the counter can not be reset, then this attribute is read only. + +max_energy_range_uj (ro) + Range of the above energy counter in micro-joules. + +power_uw (ro) + Current power in micro watts. + +max_power_range_uw (ro) + Range of the above power value in micro-watts. + +name (ro) + Name of this power zone. + +It is possible that some domains have both power ranges and energy counter ranges; +however, only one is mandatory. + +Constraints +----------- + +constraint_X_power_limit_uw (rw) + Power limit in micro watts, which should be applicable for the + time window specified by "constraint_X_time_window_us". + +constraint_X_time_window_us (rw) + Time window in micro seconds. + +constraint_X_name (ro) + An optional name of the constraint + +constraint_X_max_power_uw(ro) + Maximum allowed power in micro watts. + +constraint_X_min_power_uw(ro) + Minimum allowed power in micro watts. + +constraint_X_max_time_window_us(ro) + Maximum allowed time window in micro seconds. + +constraint_X_min_time_window_us(ro) + Minimum allowed time window in micro seconds. + +Except power_limit_uw and time_window_us other fields are optional. + +Common zone and control type attributes +--------------------------------------- + +enabled (rw): Enable/Disable controls at zone level or for all zones using +a control type. + +Power Cap Client Driver Interface +================================= + +The API summary: + +Call powercap_register_control_type() to register control type object. +Call powercap_register_zone() to register a power zone (under a given +control type), either as a top-level power zone or as a subzone of another +power zone registered earlier. +The number of constraints in a power zone and the corresponding callbacks have +to be defined prior to calling powercap_register_zone() to register that zone. + +To Free a power zone call powercap_unregister_zone(). +To free a control type object call powercap_unregister_control_type(). +Detailed API can be generated using kernel-doc on include/linux/powercap.h. diff --git a/Documentation/power/powercap/powercap.txt b/Documentation/power/powercap/powercap.txt deleted file mode 100644 index 1e6ef164e07a..000000000000 --- a/Documentation/power/powercap/powercap.txt +++ /dev/null @@ -1,236 +0,0 @@ -Power Capping Framework -================================== - -The power capping framework provides a consistent interface between the kernel -and the user space that allows power capping drivers to expose the settings to -user space in a uniform way. - -Terminology -========================= -The framework exposes power capping devices to user space via sysfs in the -form of a tree of objects. The objects at the root level of the tree represent -'control types', which correspond to different methods of power capping. For -example, the intel-rapl control type represents the Intel "Running Average -Power Limit" (RAPL) technology, whereas the 'idle-injection' control type -corresponds to the use of idle injection for controlling power. - -Power zones represent different parts of the system, which can be controlled and -monitored using the power capping method determined by the control type the -given zone belongs to. They each contain attributes for monitoring power, as -well as controls represented in the form of power constraints. If the parts of -the system represented by different power zones are hierarchical (that is, one -bigger part consists of multiple smaller parts that each have their own power -controls), those power zones may also be organized in a hierarchy with one -parent power zone containing multiple subzones and so on to reflect the power -control topology of the system. In that case, it is possible to apply power -capping to a set of devices together using the parent power zone and if more -fine grained control is required, it can be applied through the subzones. - - -Example sysfs interface tree: - -/sys/devices/virtual/powercap -??? intel-rapl - ??? intel-rapl:0 - ?   ??? constraint_0_name - ?   ??? constraint_0_power_limit_uw - ?   ??? constraint_0_time_window_us - ?   ??? constraint_1_name - ?   ??? constraint_1_power_limit_uw - ?   ??? constraint_1_time_window_us - ?   ??? device -> ../../intel-rapl - ?   ??? energy_uj - ?   ??? intel-rapl:0:0 - ?   ?   ??? constraint_0_name - ?   ?   ??? constraint_0_power_limit_uw - ?   ?   ??? constraint_0_time_window_us - ?   ?   ??? constraint_1_name - ?   ?   ??? constraint_1_power_limit_uw - ?   ?   ??? constraint_1_time_window_us - ?   ?   ??? device -> ../../intel-rapl:0 - ?   ?   ??? energy_uj - ?   ?   ??? max_energy_range_uj - ?   ?   ??? name - ?   ?   ??? enabled - ?   ?   ??? power - ?   ?   ?   ??? async - ?   ?   ?   [] - ?   ?   ??? subsystem -> ../../../../../../class/power_cap - ?   ?   ??? uevent - ?   ??? intel-rapl:0:1 - ?   ?   ??? constraint_0_name - ?   ?   ??? constraint_0_power_limit_uw - ?   ?   ??? constraint_0_time_window_us - ?   ?   ??? constraint_1_name - ?   ?   ??? constraint_1_power_limit_uw - ?   ?   ??? constraint_1_time_window_us - ?   ?   ??? device -> ../../intel-rapl:0 - ?   ?   ??? energy_uj - ?   ?   ??? max_energy_range_uj - ?   ?   ??? name - ?   ?   ??? enabled - ?   ?   ??? power - ?   ?   ?   ??? async - ?   ?   ?   [] - ?   ?   ??? subsystem -> ../../../../../../class/power_cap - ?   ?   ??? uevent - ?   ??? max_energy_range_uj - ?   ??? max_power_range_uw - ?   ??? name - ?   ??? enabled - ?   ??? power - ?   ?   ??? async - ?   ?   [] - ?   ??? subsystem -> ../../../../../class/power_cap - ?   ??? enabled - ?   ??? uevent - ??? intel-rapl:1 - ?   ??? constraint_0_name - ?   ??? constraint_0_power_limit_uw - ?   ??? constraint_0_time_window_us - ?   ??? constraint_1_name - ?   ??? constraint_1_power_limit_uw - ?   ??? constraint_1_time_window_us - ?   ??? device -> ../../intel-rapl - ?   ??? energy_uj - ?   ??? intel-rapl:1:0 - ?   ?   ??? constraint_0_name - ?   ?   ??? constraint_0_power_limit_uw - ?   ?   ??? constraint_0_time_window_us - ?   ?   ??? constraint_1_name - ?   ?   ??? constraint_1_power_limit_uw - ?   ?   ??? constraint_1_time_window_us - ?   ?   ??? device -> ../../intel-rapl:1 - ?   ?   ??? energy_uj - ?   ?   ??? max_energy_range_uj - ?   ?   ??? name - ?   ?   ??? enabled - ?   ?   ??? power - ?   ?   ?   ??? async - ?   ?   ?   [] - ?   ?   ??? subsystem -> ../../../../../../class/power_cap - ?   ?   ??? uevent - ?   ??? intel-rapl:1:1 - ?   ?   ??? constraint_0_name - ?   ?   ??? constraint_0_power_limit_uw - ?   ?   ??? constraint_0_time_window_us - ?   ?   ??? constraint_1_name - ?   ?   ??? constraint_1_power_limit_uw - ?   ?   ??? constraint_1_time_window_us - ?   ?   ??? device -> ../../intel-rapl:1 - ?   ?   ??? energy_uj - ?   ?   ??? max_energy_range_uj - ?   ?   ??? name - ?   ?   ??? enabled - ?   ?   ??? power - ?   ?   ?   ??? async - ?   ?   ?   [] - ?   ?   ??? subsystem -> ../../../../../../class/power_cap - ?   ?   ??? uevent - ?   ??? max_energy_range_uj - ?   ??? max_power_range_uw - ?   ??? name - ?   ??? enabled - ?   ??? power - ?   ?   ??? async - ?   ?   [] - ?   ??? subsystem -> ../../../../../class/power_cap - ?   ??? uevent - ??? power - ?   ??? async - ?   [] - ??? subsystem -> ../../../../class/power_cap - ??? enabled - ??? uevent - -The above example illustrates a case in which the Intel RAPL technology, -available in Intel® IA-64 and IA-32 Processor Architectures, is used. There is one -control type called intel-rapl which contains two power zones, intel-rapl:0 and -intel-rapl:1, representing CPU packages. Each of these power zones contains -two subzones, intel-rapl:j:0 and intel-rapl:j:1 (j = 0, 1), representing the -"core" and the "uncore" parts of the given CPU package, respectively. All of -the zones and subzones contain energy monitoring attributes (energy_uj, -max_energy_range_uj) and constraint attributes (constraint_*) allowing controls -to be applied (the constraints in the 'package' power zones apply to the whole -CPU packages and the subzone constraints only apply to the respective parts of -the given package individually). Since Intel RAPL doesn't provide instantaneous -power value, there is no power_uw attribute. - -In addition to that, each power zone contains a name attribute, allowing the -part of the system represented by that zone to be identified. -For example: - -cat /sys/class/power_cap/intel-rapl/intel-rapl:0/name -package-0 - -The Intel RAPL technology allows two constraints, short term and long term, -with two different time windows to be applied to each power zone. Thus for -each zone there are 2 attributes representing the constraint names, 2 power -limits and 2 attributes representing the sizes of the time windows. Such that, -constraint_j_* attributes correspond to the jth constraint (j = 0,1). - -For example: - constraint_0_name - constraint_0_power_limit_uw - constraint_0_time_window_us - constraint_1_name - constraint_1_power_limit_uw - constraint_1_time_window_us - -Power Zone Attributes -================================= -Monitoring attributes ----------------------- - -energy_uj (rw): Current energy counter in micro joules. Write "0" to reset. -If the counter can not be reset, then this attribute is read only. - -max_energy_range_uj (ro): Range of the above energy counter in micro-joules. - -power_uw (ro): Current power in micro watts. - -max_power_range_uw (ro): Range of the above power value in micro-watts. - -name (ro): Name of this power zone. - -It is possible that some domains have both power ranges and energy counter ranges; -however, only one is mandatory. - -Constraints ----------------- -constraint_X_power_limit_uw (rw): Power limit in micro watts, which should be -applicable for the time window specified by "constraint_X_time_window_us". - -constraint_X_time_window_us (rw): Time window in micro seconds. - -constraint_X_name (ro): An optional name of the constraint - -constraint_X_max_power_uw(ro): Maximum allowed power in micro watts. - -constraint_X_min_power_uw(ro): Minimum allowed power in micro watts. - -constraint_X_max_time_window_us(ro): Maximum allowed time window in micro seconds. - -constraint_X_min_time_window_us(ro): Minimum allowed time window in micro seconds. - -Except power_limit_uw and time_window_us other fields are optional. - -Common zone and control type attributes ----------------------------------------- -enabled (rw): Enable/Disable controls at zone level or for all zones using -a control type. - -Power Cap Client Driver Interface -================================== -The API summary: - -Call powercap_register_control_type() to register control type object. -Call powercap_register_zone() to register a power zone (under a given -control type), either as a top-level power zone or as a subzone of another -power zone registered earlier. -The number of constraints in a power zone and the corresponding callbacks have -to be defined prior to calling powercap_register_zone() to register that zone. - -To Free a power zone call powercap_unregister_zone(). -To free a control type object call powercap_unregister_control_type(). -Detailed API can be generated using kernel-doc on include/linux/powercap.h. diff --git a/Documentation/power/regulator/consumer.txt b/Documentation/power/regulator/consumer.rst similarity index 61% rename from Documentation/power/regulator/consumer.txt rename to Documentation/power/regulator/consumer.rst index e51564c1a140..0cd8cc1275a7 100644 --- a/Documentation/power/regulator/consumer.txt +++ b/Documentation/power/regulator/consumer.rst @@ -1,3 +1,4 @@ +=================================== Regulator Consumer Driver Interface =================================== @@ -8,73 +9,77 @@ Please see overview.txt for a description of the terms used in this text. 1. Consumer Regulator Access (static & dynamic drivers) ======================================================= -A consumer driver can get access to its supply regulator by calling :- +A consumer driver can get access to its supply regulator by calling :: -regulator = regulator_get(dev, "Vcc"); + regulator = regulator_get(dev, "Vcc"); The consumer passes in its struct device pointer and power supply ID. The core then finds the correct regulator by consulting a machine specific lookup table. If the lookup is successful then this call will return a pointer to the struct regulator that supplies this consumer. -To release the regulator the consumer driver should call :- +To release the regulator the consumer driver should call :: -regulator_put(regulator); + regulator_put(regulator); Consumers can be supplied by more than one regulator e.g. codec consumer with -analog and digital supplies :- +analog and digital supplies :: -digital = regulator_get(dev, "Vcc"); /* digital core */ -analog = regulator_get(dev, "Avdd"); /* analog */ + digital = regulator_get(dev, "Vcc"); /* digital core */ + analog = regulator_get(dev, "Avdd"); /* analog */ The regulator access functions regulator_get() and regulator_put() will usually be called in your device drivers probe() and remove() respectively. 2. Regulator Output Enable & Disable (static & dynamic drivers) -==================================================================== +=============================================================== -A consumer can enable its power supply by calling:- -int regulator_enable(regulator); +A consumer can enable its power supply by calling:: -NOTE: The supply may already be enabled before regulator_enabled() is called. -This may happen if the consumer shares the regulator or the regulator has been -previously enabled by bootloader or kernel board initialization code. + int regulator_enable(regulator); -A consumer can determine if a regulator is enabled by calling :- +NOTE: + The supply may already be enabled before regulator_enabled() is called. + This may happen if the consumer shares the regulator or the regulator has been + previously enabled by bootloader or kernel board initialization code. -int regulator_is_enabled(regulator); +A consumer can determine if a regulator is enabled by calling:: + + int regulator_is_enabled(regulator); This will return > zero when the regulator is enabled. -A consumer can disable its supply when no longer needed by calling :- +A consumer can disable its supply when no longer needed by calling:: -int regulator_disable(regulator); + int regulator_disable(regulator); -NOTE: This may not disable the supply if it's shared with other consumers. The -regulator will only be disabled when the enabled reference count is zero. +NOTE: + This may not disable the supply if it's shared with other consumers. The + regulator will only be disabled when the enabled reference count is zero. -Finally, a regulator can be forcefully disabled in the case of an emergency :- +Finally, a regulator can be forcefully disabled in the case of an emergency:: -int regulator_force_disable(regulator); + int regulator_force_disable(regulator); -NOTE: this will immediately and forcefully shutdown the regulator output. All -consumers will be powered off. +NOTE: + this will immediately and forcefully shutdown the regulator output. All + consumers will be powered off. 3. Regulator Voltage Control & Status (dynamic drivers) -====================================================== +======================================================= Some consumer drivers need to be able to dynamically change their supply voltage to match system operating points. e.g. CPUfreq drivers can scale voltage along with frequency to save power, SD drivers may need to select the correct card voltage, etc. -Consumers can control their supply voltage by calling :- +Consumers can control their supply voltage by calling:: -int regulator_set_voltage(regulator, min_uV, max_uV); + int regulator_set_voltage(regulator, min_uV, max_uV); Where min_uV and max_uV are the minimum and maximum acceptable voltages in microvolts. @@ -84,47 +89,50 @@ when enabled, then the voltage changes instantly, otherwise the voltage configuration changes and the voltage is physically set when the regulator is next enabled. -The regulators configured voltage output can be found by calling :- +The regulators configured voltage output can be found by calling:: -int regulator_get_voltage(regulator); + int regulator_get_voltage(regulator); -NOTE: get_voltage() will return the configured output voltage whether the -regulator is enabled or disabled and should NOT be used to determine regulator -output state. However this can be used in conjunction with is_enabled() to -determine the regulator physical output voltage. +NOTE: + get_voltage() will return the configured output voltage whether the + regulator is enabled or disabled and should NOT be used to determine regulator + output state. However this can be used in conjunction with is_enabled() to + determine the regulator physical output voltage. 4. Regulator Current Limit Control & Status (dynamic drivers) -=========================================================== +============================================================= Some consumer drivers need to be able to dynamically change their supply current limit to match system operating points. e.g. LCD backlight driver can change the current limit to vary the backlight brightness, USB drivers may want to set the limit to 500mA when supplying power. -Consumers can control their supply current limit by calling :- +Consumers can control their supply current limit by calling:: -int regulator_set_current_limit(regulator, min_uA, max_uA); + int regulator_set_current_limit(regulator, min_uA, max_uA); Where min_uA and max_uA are the minimum and maximum acceptable current limit in microamps. -NOTE: this can be called when the regulator is enabled or disabled. If called -when enabled, then the current limit changes instantly, otherwise the current -limit configuration changes and the current limit is physically set when the -regulator is next enabled. +NOTE: + this can be called when the regulator is enabled or disabled. If called + when enabled, then the current limit changes instantly, otherwise the current + limit configuration changes and the current limit is physically set when the + regulator is next enabled. -A regulators current limit can be found by calling :- +A regulators current limit can be found by calling:: -int regulator_get_current_limit(regulator); + int regulator_get_current_limit(regulator); -NOTE: get_current_limit() will return the current limit whether the regulator -is enabled or disabled and should not be used to determine regulator current -load. +NOTE: + get_current_limit() will return the current limit whether the regulator + is enabled or disabled and should not be used to determine regulator current + load. 5. Regulator Operating Mode Control & Status (dynamic drivers) -============================================================= +============================================================== Some consumers can further save system power by changing the operating mode of their supply regulator to be more efficient when the consumers operating state @@ -135,9 +143,9 @@ Regulator operating mode can be changed indirectly or directly. Indirect operating mode control. -------------------------------- Consumer drivers can request a change in their supply regulator operating mode -by calling :- +by calling:: -int regulator_set_load(struct regulator *regulator, int load_uA); + int regulator_set_load(struct regulator *regulator, int load_uA); This will cause the core to recalculate the total load on the regulator (based on all its consumers) and change operating mode (if necessary and permitted) @@ -153,12 +161,13 @@ consumers. Direct operating mode control. ------------------------------ + Bespoke or tightly coupled drivers may want to directly control regulator operating mode depending on their operating point. This can be achieved by -calling :- +calling:: -int regulator_set_mode(struct regulator *regulator, unsigned int mode); -unsigned int regulator_get_mode(struct regulator *regulator); + int regulator_set_mode(struct regulator *regulator, unsigned int mode); + unsigned int regulator_get_mode(struct regulator *regulator); Direct mode will only be used by consumers that *know* about the regulator and are not sharing the regulator with other consumers. @@ -166,24 +175,26 @@ are not sharing the regulator with other consumers. 6. Regulator Events =================== + Regulators can notify consumers of external events. Events could be received by consumers under regulator stress or failure conditions. -Consumers can register interest in regulator events by calling :- +Consumers can register interest in regulator events by calling:: -int regulator_register_notifier(struct regulator *regulator, - struct notifier_block *nb); + int regulator_register_notifier(struct regulator *regulator, + struct notifier_block *nb); -Consumers can unregister interest by calling :- +Consumers can unregister interest by calling:: -int regulator_unregister_notifier(struct regulator *regulator, - struct notifier_block *nb); + int regulator_unregister_notifier(struct regulator *regulator, + struct notifier_block *nb); Regulators use the kernel notifier framework to send event to their interested consumers. 7. Regulator Direct Register Access =================================== + Some kinds of power management hardware or firmware are designed such that they need to do low-level hardware access to regulators, with no involvement from the kernel. Examples of such devices are: @@ -199,20 +210,20 @@ to it. The regulator framework provides the following helpers for querying these details. Bus-specific details, like I2C addresses or transfer rates are handled by the -regmap framework. To get the regulator's regmap (if supported), use :- +regmap framework. To get the regulator's regmap (if supported), use:: -struct regmap *regulator_get_regmap(struct regulator *regulator); + struct regmap *regulator_get_regmap(struct regulator *regulator); To obtain the hardware register offset and bitmask for the regulator's voltage -selector register, use :- +selector register, use:: -int regulator_get_hardware_vsel_register(struct regulator *regulator, - unsigned *vsel_reg, - unsigned *vsel_mask); + int regulator_get_hardware_vsel_register(struct regulator *regulator, + unsigned *vsel_reg, + unsigned *vsel_mask); To convert a regulator framework voltage selector code (used by regulator_list_voltage) to a hardware-specific voltage selector that can be -directly written to the voltage selector register, use :- +directly written to the voltage selector register, use:: -int regulator_list_hardware_vsel(struct regulator *regulator, - unsigned selector); + int regulator_list_hardware_vsel(struct regulator *regulator, + unsigned selector); diff --git a/Documentation/power/regulator/design.txt b/Documentation/power/regulator/design.rst similarity index 86% rename from Documentation/power/regulator/design.txt rename to Documentation/power/regulator/design.rst index fdd919b96830..3b09c6841dc4 100644 --- a/Documentation/power/regulator/design.txt +++ b/Documentation/power/regulator/design.rst @@ -1,3 +1,4 @@ +========================== Regulator API design notes ========================== @@ -14,7 +15,9 @@ Safety have different power requirements, and not all components with power requirements are visible to software. - => The API should make no changes to the hardware state unless it has +.. note:: + + The API should make no changes to the hardware state unless it has specific knowledge that these changes are safe to perform on this particular system. @@ -28,6 +31,8 @@ Consumer use cases - Many of the power supplies in the system will be shared between many different consumers. - => The consumer API should be structured so that these use cases are +.. note:: + + The consumer API should be structured so that these use cases are very easy to handle and so that consumers will work with shared supplies without any additional effort. diff --git a/Documentation/power/regulator/machine.txt b/Documentation/power/regulator/machine.rst similarity index 75% rename from Documentation/power/regulator/machine.txt rename to Documentation/power/regulator/machine.rst index eff4dcaaa252..22fffefaa3ad 100644 --- a/Documentation/power/regulator/machine.txt +++ b/Documentation/power/regulator/machine.rst @@ -1,10 +1,11 @@ +================================== Regulator Machine Driver Interface -=================================== +================================== The regulator machine driver interface is intended for board/machine specific initialisation code to configure the regulator subsystem. -Consider the following machine :- +Consider the following machine:: Regulator-1 -+-> Regulator-2 --> [Consumer A @ 1.8 - 2.0V] | @@ -13,31 +14,31 @@ Consider the following machine :- The drivers for consumers A & B must be mapped to the correct regulator in order to control their power supplies. This mapping can be achieved in machine initialisation code by creating a struct regulator_consumer_supply for -each regulator. +each regulator:: -struct regulator_consumer_supply { + struct regulator_consumer_supply { const char *dev_name; /* consumer dev_name() */ const char *supply; /* consumer supply - e.g. "vcc" */ -}; + }; -e.g. for the machine above +e.g. for the machine above:: -static struct regulator_consumer_supply regulator1_consumers[] = { + static struct regulator_consumer_supply regulator1_consumers[] = { REGULATOR_SUPPLY("Vcc", "consumer B"), -}; + }; -static struct regulator_consumer_supply regulator2_consumers[] = { + static struct regulator_consumer_supply regulator2_consumers[] = { REGULATOR_SUPPLY("Vcc", "consumer A"), -}; + }; This maps Regulator-1 to the 'Vcc' supply for Consumer B and maps Regulator-2 to the 'Vcc' supply for Consumer A. Constraints can now be registered by defining a struct regulator_init_data for each regulator power domain. This structure also maps the consumers -to their supply regulators :- +to their supply regulators:: -static struct regulator_init_data regulator1_data = { + static struct regulator_init_data regulator1_data = { .constraints = { .name = "Regulator-1", .min_uV = 3300000, @@ -46,7 +47,7 @@ static struct regulator_init_data regulator1_data = { }, .num_consumer_supplies = ARRAY_SIZE(regulator1_consumers), .consumer_supplies = regulator1_consumers, -}; + }; The name field should be set to something that is usefully descriptive for the board for configuration of supplies for other regulators and @@ -57,9 +58,9 @@ name is provided then the subsystem will choose one. Regulator-1 supplies power to Regulator-2. This relationship must be registered with the core so that Regulator-1 is also enabled when Consumer A enables its supply (Regulator-2). The supply regulator is set by the supply_regulator -field below and co:- +field below and co:: -static struct regulator_init_data regulator2_data = { + static struct regulator_init_data regulator2_data = { .supply_regulator = "Regulator-1", .constraints = { .min_uV = 1800000, @@ -69,11 +70,11 @@ static struct regulator_init_data regulator2_data = { }, .num_consumer_supplies = ARRAY_SIZE(regulator2_consumers), .consumer_supplies = regulator2_consumers, -}; + }; -Finally the regulator devices must be registered in the usual manner. +Finally the regulator devices must be registered in the usual manner:: -static struct platform_device regulator_devices[] = { + static struct platform_device regulator_devices[] = { { .name = "regulator", .id = DCDC_1, @@ -88,9 +89,9 @@ static struct platform_device regulator_devices[] = { .platform_data = ®ulator2_data, }, }, -}; -/* register regulator 1 device */ -platform_device_register(®ulator_devices[0]); + }; + /* register regulator 1 device */ + platform_device_register(®ulator_devices[0]); -/* register regulator 2 device */ -platform_device_register(®ulator_devices[1]); + /* register regulator 2 device */ + platform_device_register(®ulator_devices[1]); diff --git a/Documentation/power/regulator/overview.txt b/Documentation/power/regulator/overview.rst similarity index 79% rename from Documentation/power/regulator/overview.txt rename to Documentation/power/regulator/overview.rst index 721b4739ec32..ee494c70a7c4 100644 --- a/Documentation/power/regulator/overview.txt +++ b/Documentation/power/regulator/overview.rst @@ -1,3 +1,4 @@ +============================================= Linux voltage and current regulator framework ============================================= @@ -13,26 +14,30 @@ regulators (where voltage output is controllable) and current sinks (where current limit is controllable). (C) 2008 Wolfson Microelectronics PLC. + Author: Liam Girdwood Nomenclature ============ -Some terms used in this document:- +Some terms used in this document: - o Regulator - Electronic device that supplies power to other devices. + - Regulator + - Electronic device that supplies power to other devices. Most regulators can enable and disable their output while some can control their output voltage and or current. Input Voltage -> Regulator -> Output Voltage - o PMIC - Power Management IC. An IC that contains numerous regulators - and often contains other subsystems. + - PMIC + - Power Management IC. An IC that contains numerous + regulators and often contains other subsystems. - o Consumer - Electronic device that is supplied power by a regulator. + - Consumer + - Electronic device that is supplied power by a regulator. Consumers can be classified into two types:- Static: consumer does not change its supply voltage or @@ -44,46 +49,48 @@ Some terms used in this document:- current limit to meet operation demands. - o Power Domain - Electronic circuit that is supplied its input power by the + - Power Domain + - Electronic circuit that is supplied its input power by the output power of a regulator, switch or by another power domain. - The supply regulator may be behind a switch(s). i.e. + The supply regulator may be behind a switch(s). i.e.:: - Regulator -+-> Switch-1 -+-> Switch-2 --> [Consumer A] - | | - | +-> [Consumer B], [Consumer C] - | - +-> [Consumer D], [Consumer E] + Regulator -+-> Switch-1 -+-> Switch-2 --> [Consumer A] + | | + | +-> [Consumer B], [Consumer C] + | + +-> [Consumer D], [Consumer E] That is one regulator and three power domains: - Domain 1: Switch-1, Consumers D & E. - Domain 2: Switch-2, Consumers B & C. - Domain 3: Consumer A. + - Domain 1: Switch-1, Consumers D & E. + - Domain 2: Switch-2, Consumers B & C. + - Domain 3: Consumer A. and this represents a "supplies" relationship: Domain-1 --> Domain-2 --> Domain-3. A power domain may have regulators that are supplied power - by other regulators. i.e. + by other regulators. i.e.:: - Regulator-1 -+-> Regulator-2 -+-> [Consumer A] - | - +-> [Consumer B] + Regulator-1 -+-> Regulator-2 -+-> [Consumer A] + | + +-> [Consumer B] This gives us two regulators and two power domains: - Domain 1: Regulator-2, Consumer B. - Domain 2: Consumer A. + - Domain 1: Regulator-2, Consumer B. + - Domain 2: Consumer A. and a "supplies" relationship: Domain-1 --> Domain-2 - o Constraints - Constraints are used to define power levels for performance + - Constraints + - Constraints are used to define power levels for performance and hardware protection. Constraints exist at three levels: Regulator Level: This is defined by the regulator hardware @@ -141,7 +148,7 @@ relevant to non SoC devices and is split into the following four interfaces:- limit. This also compiles out if not in use so drivers can be reused in systems with no regulator based power control. - See Documentation/power/regulator/consumer.txt + See Documentation/power/regulator/consumer.rst 2. Regulator driver interface. @@ -149,7 +156,7 @@ relevant to non SoC devices and is split into the following four interfaces:- operations to the core. It also has a notifier call chain for propagating regulator events to clients. - See Documentation/power/regulator/regulator.txt + See Documentation/power/regulator/regulator.rst 3. Machine interface. @@ -160,7 +167,7 @@ relevant to non SoC devices and is split into the following four interfaces:- allows the creation of a regulator tree whereby some regulators are supplied by others (similar to a clock tree). - See Documentation/power/regulator/machine.txt + See Documentation/power/regulator/machine.rst 4. Userspace ABI. diff --git a/Documentation/power/regulator/regulator.rst b/Documentation/power/regulator/regulator.rst new file mode 100644 index 000000000000..794b3256fbb9 --- /dev/null +++ b/Documentation/power/regulator/regulator.rst @@ -0,0 +1,32 @@ +========================== +Regulator Driver Interface +========================== + +The regulator driver interface is relatively simple and designed to allow +regulator drivers to register their services with the core framework. + + +Registration +============ + +Drivers can register a regulator by calling:: + + struct regulator_dev *regulator_register(struct regulator_desc *regulator_desc, + const struct regulator_config *config); + +This will register the regulator's capabilities and operations to the regulator +core. + +Regulators can be unregistered by calling:: + + void regulator_unregister(struct regulator_dev *rdev); + + +Regulator Events +================ + +Regulators can send events (e.g. overtemperature, undervoltage, etc) to +consumer drivers by calling:: + + int regulator_notifier_call_chain(struct regulator_dev *rdev, + unsigned long event, void *data); diff --git a/Documentation/power/regulator/regulator.txt b/Documentation/power/regulator/regulator.txt deleted file mode 100644 index b17e5833ce21..000000000000 --- a/Documentation/power/regulator/regulator.txt +++ /dev/null @@ -1,30 +0,0 @@ -Regulator Driver Interface -========================== - -The regulator driver interface is relatively simple and designed to allow -regulator drivers to register their services with the core framework. - - -Registration -============ - -Drivers can register a regulator by calling :- - -struct regulator_dev *regulator_register(struct regulator_desc *regulator_desc, - const struct regulator_config *config); - -This will register the regulator's capabilities and operations to the regulator -core. - -Regulators can be unregistered by calling :- - -void regulator_unregister(struct regulator_dev *rdev); - - -Regulator Events -================ -Regulators can send events (e.g. overtemperature, undervoltage, etc) to -consumer drivers by calling :- - -int regulator_notifier_call_chain(struct regulator_dev *rdev, - unsigned long event, void *data); diff --git a/Documentation/power/runtime_pm.txt b/Documentation/power/runtime_pm.rst similarity index 89% rename from Documentation/power/runtime_pm.txt rename to Documentation/power/runtime_pm.rst index 937e33c46211..2c2ec99b5088 100644 --- a/Documentation/power/runtime_pm.txt +++ b/Documentation/power/runtime_pm.rst @@ -1,10 +1,15 @@ +================================================== Runtime Power Management Framework for I/O Devices +================================================== (C) 2009-2011 Rafael J. Wysocki , Novell Inc. + (C) 2010 Alan Stern + (C) 2014 Intel Corp., Rafael J. Wysocki 1. Introduction +=============== Support for runtime power management (runtime PM) of I/O devices is provided at the power management core (PM core) level by means of: @@ -33,16 +38,17 @@ fields of 'struct dev_pm_info' and the core helper functions provided for runtime PM are described below. 2. Device Runtime PM Callbacks +============================== -There are three device runtime PM callbacks defined in 'struct dev_pm_ops': +There are three device runtime PM callbacks defined in 'struct dev_pm_ops':: -struct dev_pm_ops { + struct dev_pm_ops { ... int (*runtime_suspend)(struct device *dev); int (*runtime_resume)(struct device *dev); int (*runtime_idle)(struct device *dev); ... -}; + }; The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks are executed by the PM core for the device's subsystem that may be either of @@ -112,7 +118,7 @@ low-power state during the execution of the suspend callback, it is expected that remote wakeup will be enabled for the device. Generally, remote wakeup should be enabled for all input devices put into low-power states at run time. -The subsystem-level resume callback, if present, is _entirely_ _responsible_ for +The subsystem-level resume callback, if present, is **entirely responsible** for handling the resume of the device as appropriate, which may, but need not include executing the device driver's own ->runtime_resume() callback (from the PM core's point of view it is not necessary to implement a ->runtime_resume() @@ -197,95 +203,96 @@ rules: except for scheduled autosuspends. 3. Runtime PM Device Fields +=========================== The following device runtime PM fields are present in 'struct dev_pm_info', as defined in include/linux/pm.h: - struct timer_list suspend_timer; + `struct timer_list suspend_timer;` - timer used for scheduling (delayed) suspend and autosuspend requests - unsigned long timer_expires; + `unsigned long timer_expires;` - timer expiration time, in jiffies (if this is different from zero, the timer is running and will expire at that time, otherwise the timer is not running) - struct work_struct work; + `struct work_struct work;` - work structure used for queuing up requests (i.e. work items in pm_wq) - wait_queue_head_t wait_queue; + `wait_queue_head_t wait_queue;` - wait queue used if any of the helper functions needs to wait for another one to complete - spinlock_t lock; + `spinlock_t lock;` - lock used for synchronization - atomic_t usage_count; + `atomic_t usage_count;` - the usage counter of the device - atomic_t child_count; + `atomic_t child_count;` - the count of 'active' children of the device - unsigned int ignore_children; + `unsigned int ignore_children;` - if set, the value of child_count is ignored (but still updated) - unsigned int disable_depth; + `unsigned int disable_depth;` - used for disabling the helper functions (they work normally if this is equal to zero); the initial value of it is 1 (i.e. runtime PM is initially disabled for all devices) - int runtime_error; + `int runtime_error;` - if set, there was a fatal error (one of the callbacks returned error code as described in Section 2), so the helper functions will not work until this flag is cleared; this is the error code returned by the failing callback - unsigned int idle_notification; + `unsigned int idle_notification;` - if set, ->runtime_idle() is being executed - unsigned int request_pending; + `unsigned int request_pending;` - if set, there's a pending request (i.e. a work item queued up into pm_wq) - enum rpm_request request; + `enum rpm_request request;` - type of request that's pending (valid if request_pending is set) - unsigned int deferred_resume; + `unsigned int deferred_resume;` - set if ->runtime_resume() is about to be run while ->runtime_suspend() is being executed for that device and it is not practical to wait for the suspend to complete; means "start a resume as soon as you've suspended" - enum rpm_status runtime_status; + `enum rpm_status runtime_status;` - the runtime PM status of the device; this field's initial value is RPM_SUSPENDED, which means that each device is initially regarded by the PM core as 'suspended', regardless of its real hardware status - unsigned int runtime_auto; + `unsigned int runtime_auto;` - if set, indicates that the user space has allowed the device driver to power manage the device at run time via the /sys/devices/.../power/control - interface; it may only be modified with the help of the pm_runtime_allow() + `interface;` it may only be modified with the help of the pm_runtime_allow() and pm_runtime_forbid() helper functions - unsigned int no_callbacks; + `unsigned int no_callbacks;` - indicates that the device does not use the runtime PM callbacks (see Section 8); it may be modified only by the pm_runtime_no_callbacks() helper function - unsigned int irq_safe; + `unsigned int irq_safe;` - indicates that the ->runtime_suspend() and ->runtime_resume() callbacks will be invoked with the spinlock held and interrupts disabled - unsigned int use_autosuspend; + `unsigned int use_autosuspend;` - indicates that the device's driver supports delayed autosuspend (see Section 9); it may be modified only by the pm_runtime{_dont}_use_autosuspend() helper functions - unsigned int timer_autosuspends; + `unsigned int timer_autosuspends;` - indicates that the PM core should attempt to carry out an autosuspend when the timer expires rather than a normal suspend - int autosuspend_delay; + `int autosuspend_delay;` - the delay time (in milliseconds) to be used for autosuspend - unsigned long last_busy; + `unsigned long last_busy;` - the time (in jiffies) when the pm_runtime_mark_last_busy() helper function was last called for this device; used in calculating inactivity periods for autosuspend @@ -293,37 +300,38 @@ defined in include/linux/pm.h: All of the above fields are members of the 'power' member of 'struct device'. 4. Runtime PM Device Helper Functions +===================================== The following runtime PM helper functions are defined in drivers/base/power/runtime.c and include/linux/pm_runtime.h: - void pm_runtime_init(struct device *dev); + `void pm_runtime_init(struct device *dev);` - initialize the device runtime PM fields in 'struct dev_pm_info' - void pm_runtime_remove(struct device *dev); + `void pm_runtime_remove(struct device *dev);` - make sure that the runtime PM of the device will be disabled after removing the device from device hierarchy - int pm_runtime_idle(struct device *dev); + `int pm_runtime_idle(struct device *dev);` - execute the subsystem-level idle callback for the device; returns an error code on failure, where -EINPROGRESS means that ->runtime_idle() is already being executed; if there is no callback or the callback returns 0 then run pm_runtime_autosuspend(dev) and return its result - int pm_runtime_suspend(struct device *dev); + `int pm_runtime_suspend(struct device *dev);` - execute the subsystem-level suspend callback for the device; returns 0 on success, 1 if the device's runtime PM status was already 'suspended', or error code on failure, where -EAGAIN or -EBUSY means it is safe to attempt to suspend the device again in future and -EACCES means that 'power.disable_depth' is different from 0 - int pm_runtime_autosuspend(struct device *dev); + `int pm_runtime_autosuspend(struct device *dev);` - same as pm_runtime_suspend() except that the autosuspend delay is taken - into account; if pm_runtime_autosuspend_expiration() says the delay has + `into account;` if pm_runtime_autosuspend_expiration() says the delay has not yet expired then an autosuspend is scheduled for the appropriate time and 0 is returned - int pm_runtime_resume(struct device *dev); + `int pm_runtime_resume(struct device *dev);` - execute the subsystem-level resume callback for the device; returns 0 on success, 1 if the device's runtime PM status was already 'active' or error code on failure, where -EAGAIN means it may be safe to attempt to @@ -331,17 +339,17 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h: checked additionally, and -EACCES means that 'power.disable_depth' is different from 0 - int pm_request_idle(struct device *dev); + `int pm_request_idle(struct device *dev);` - submit a request to execute the subsystem-level idle callback for the device (the request is represented by a work item in pm_wq); returns 0 on success or error code if the request has not been queued up - int pm_request_autosuspend(struct device *dev); + `int pm_request_autosuspend(struct device *dev);` - schedule the execution of the subsystem-level suspend callback for the device when the autosuspend delay has expired; if the delay has already expired then the work item is queued up immediately - int pm_schedule_suspend(struct device *dev, unsigned int delay); + `int pm_schedule_suspend(struct device *dev, unsigned int delay);` - schedule the execution of the subsystem-level suspend callback for the device in future, where 'delay' is the time to wait before queuing up a suspend work item in pm_wq, in milliseconds (if 'delay' is zero, the work @@ -351,58 +359,58 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h: ->runtime_suspend() is already scheduled and not yet expired, the new value of 'delay' will be used as the time to wait - int pm_request_resume(struct device *dev); + `int pm_request_resume(struct device *dev);` - submit a request to execute the subsystem-level resume callback for the device (the request is represented by a work item in pm_wq); returns 0 on success, 1 if the device's runtime PM status was already 'active', or error code if the request hasn't been queued up - void pm_runtime_get_noresume(struct device *dev); + `void pm_runtime_get_noresume(struct device *dev);` - increment the device's usage counter - int pm_runtime_get(struct device *dev); + `int pm_runtime_get(struct device *dev);` - increment the device's usage counter, run pm_request_resume(dev) and return its result - int pm_runtime_get_sync(struct device *dev); + `int pm_runtime_get_sync(struct device *dev);` - increment the device's usage counter, run pm_runtime_resume(dev) and return its result - int pm_runtime_get_if_in_use(struct device *dev); + `int pm_runtime_get_if_in_use(struct device *dev);` - return -EINVAL if 'power.disable_depth' is nonzero; otherwise, if the runtime PM status is RPM_ACTIVE and the runtime PM usage counter is nonzero, increment the counter and return 1; otherwise return 0 without changing the counter - void pm_runtime_put_noidle(struct device *dev); + `void pm_runtime_put_noidle(struct device *dev);` - decrement the device's usage counter - int pm_runtime_put(struct device *dev); + `int pm_runtime_put(struct device *dev);` - decrement the device's usage counter; if the result is 0 then run pm_request_idle(dev) and return its result - int pm_runtime_put_autosuspend(struct device *dev); + `int pm_runtime_put_autosuspend(struct device *dev);` - decrement the device's usage counter; if the result is 0 then run pm_request_autosuspend(dev) and return its result - int pm_runtime_put_sync(struct device *dev); + `int pm_runtime_put_sync(struct device *dev);` - decrement the device's usage counter; if the result is 0 then run pm_runtime_idle(dev) and return its result - int pm_runtime_put_sync_suspend(struct device *dev); + `int pm_runtime_put_sync_suspend(struct device *dev);` - decrement the device's usage counter; if the result is 0 then run pm_runtime_suspend(dev) and return its result - int pm_runtime_put_sync_autosuspend(struct device *dev); + `int pm_runtime_put_sync_autosuspend(struct device *dev);` - decrement the device's usage counter; if the result is 0 then run pm_runtime_autosuspend(dev) and return its result - void pm_runtime_enable(struct device *dev); + `void pm_runtime_enable(struct device *dev);` - decrement the device's 'power.disable_depth' field; if that field is equal to zero, the runtime PM helper functions can execute subsystem-level callbacks described in Section 2 for the device - int pm_runtime_disable(struct device *dev); + `int pm_runtime_disable(struct device *dev);` - increment the device's 'power.disable_depth' field (if the value of that field was previously zero, this prevents subsystem-level runtime PM callbacks from being run for the device), make sure that all of the @@ -411,7 +419,7 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h: necessary to execute the subsystem-level resume callback for the device to satisfy that request, otherwise 0 is returned - int pm_runtime_barrier(struct device *dev); + `int pm_runtime_barrier(struct device *dev);` - check if there's a resume request pending for the device and resume it (synchronously) in that case, cancel any other pending runtime PM requests regarding it and wait for all runtime PM operations on it in progress to @@ -419,10 +427,10 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h: necessary to execute the subsystem-level resume callback for the device to satisfy that request, otherwise 0 is returned - void pm_suspend_ignore_children(struct device *dev, bool enable); + `void pm_suspend_ignore_children(struct device *dev, bool enable);` - set/unset the power.ignore_children flag of the device - int pm_runtime_set_active(struct device *dev); + `int pm_runtime_set_active(struct device *dev);` - clear the device's 'power.runtime_error' flag, set the device's runtime PM status to 'active' and update its parent's counter of 'active' children as appropriate (it is only valid to use this function if @@ -430,61 +438,61 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h: zero); it will fail and return error code if the device has a parent which is not active and the 'power.ignore_children' flag of which is unset - void pm_runtime_set_suspended(struct device *dev); + `void pm_runtime_set_suspended(struct device *dev);` - clear the device's 'power.runtime_error' flag, set the device's runtime PM status to 'suspended' and update its parent's counter of 'active' children as appropriate (it is only valid to use this function if 'power.runtime_error' is set or 'power.disable_depth' is greater than zero) - bool pm_runtime_active(struct device *dev); + `bool pm_runtime_active(struct device *dev);` - return true if the device's runtime PM status is 'active' or its 'power.disable_depth' field is not equal to zero, or false otherwise - bool pm_runtime_suspended(struct device *dev); + `bool pm_runtime_suspended(struct device *dev);` - return true if the device's runtime PM status is 'suspended' and its 'power.disable_depth' field is equal to zero, or false otherwise - bool pm_runtime_status_suspended(struct device *dev); + `bool pm_runtime_status_suspended(struct device *dev);` - return true if the device's runtime PM status is 'suspended' - void pm_runtime_allow(struct device *dev); + `void pm_runtime_allow(struct device *dev);` - set the power.runtime_auto flag for the device and decrease its usage counter (used by the /sys/devices/.../power/control interface to effectively allow the device to be power managed at run time) - void pm_runtime_forbid(struct device *dev); + `void pm_runtime_forbid(struct device *dev);` - unset the power.runtime_auto flag for the device and increase its usage counter (used by the /sys/devices/.../power/control interface to effectively prevent the device from being power managed at run time) - void pm_runtime_no_callbacks(struct device *dev); + `void pm_runtime_no_callbacks(struct device *dev);` - set the power.no_callbacks flag for the device and remove the runtime PM attributes from /sys/devices/.../power (or prevent them from being added when the device is registered) - void pm_runtime_irq_safe(struct device *dev); + `void pm_runtime_irq_safe(struct device *dev);` - set the power.irq_safe flag for the device, causing the runtime-PM callbacks to be invoked with interrupts off - bool pm_runtime_is_irq_safe(struct device *dev); + `bool pm_runtime_is_irq_safe(struct device *dev);` - return true if power.irq_safe flag was set for the device, causing the runtime-PM callbacks to be invoked with interrupts off - void pm_runtime_mark_last_busy(struct device *dev); + `void pm_runtime_mark_last_busy(struct device *dev);` - set the power.last_busy field to the current time - void pm_runtime_use_autosuspend(struct device *dev); + `void pm_runtime_use_autosuspend(struct device *dev);` - set the power.use_autosuspend flag, enabling autosuspend delays; call pm_runtime_get_sync if the flag was previously cleared and power.autosuspend_delay is negative - void pm_runtime_dont_use_autosuspend(struct device *dev); + `void pm_runtime_dont_use_autosuspend(struct device *dev);` - clear the power.use_autosuspend flag, disabling autosuspend delays; decrement the device's usage counter if the flag was previously set and power.autosuspend_delay is negative; call pm_runtime_idle - void pm_runtime_set_autosuspend_delay(struct device *dev, int delay); + `void pm_runtime_set_autosuspend_delay(struct device *dev, int delay);` - set the power.autosuspend_delay value to 'delay' (expressed in milliseconds); if 'delay' is negative then runtime suspends are prevented; if power.use_autosuspend is set, pm_runtime_get_sync may be @@ -493,7 +501,7 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h: changed to or from a negative value; if power.use_autosuspend is clear, pm_runtime_idle is called - unsigned long pm_runtime_autosuspend_expiration(struct device *dev); + `unsigned long pm_runtime_autosuspend_expiration(struct device *dev);` - calculate the time when the current autosuspend delay period will expire, based on power.last_busy and power.autosuspend_delay; if the delay time is 1000 ms or larger then the expiration time is rounded up to the @@ -503,36 +511,37 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h: It is safe to execute the following helper functions from interrupt context: -pm_request_idle() -pm_request_autosuspend() -pm_schedule_suspend() -pm_request_resume() -pm_runtime_get_noresume() -pm_runtime_get() -pm_runtime_put_noidle() -pm_runtime_put() -pm_runtime_put_autosuspend() -pm_runtime_enable() -pm_suspend_ignore_children() -pm_runtime_set_active() -pm_runtime_set_suspended() -pm_runtime_suspended() -pm_runtime_mark_last_busy() -pm_runtime_autosuspend_expiration() +- pm_request_idle() +- pm_request_autosuspend() +- pm_schedule_suspend() +- pm_request_resume() +- pm_runtime_get_noresume() +- pm_runtime_get() +- pm_runtime_put_noidle() +- pm_runtime_put() +- pm_runtime_put_autosuspend() +- pm_runtime_enable() +- pm_suspend_ignore_children() +- pm_runtime_set_active() +- pm_runtime_set_suspended() +- pm_runtime_suspended() +- pm_runtime_mark_last_busy() +- pm_runtime_autosuspend_expiration() If pm_runtime_irq_safe() has been called for a device then the following helper functions may also be used in interrupt context: -pm_runtime_idle() -pm_runtime_suspend() -pm_runtime_autosuspend() -pm_runtime_resume() -pm_runtime_get_sync() -pm_runtime_put_sync() -pm_runtime_put_sync_suspend() -pm_runtime_put_sync_autosuspend() +- pm_runtime_idle() +- pm_runtime_suspend() +- pm_runtime_autosuspend() +- pm_runtime_resume() +- pm_runtime_get_sync() +- pm_runtime_put_sync() +- pm_runtime_put_sync_suspend() +- pm_runtime_put_sync_autosuspend() 5. Runtime PM Initialization, Device Probing and Removal +======================================================== Initially, the runtime PM is disabled for all devices, which means that the majority of the runtime PM helper functions described in Section 4 will return @@ -608,6 +617,7 @@ manage the device at run time, the driver may confuse it by using pm_runtime_forbid() this way. 6. Runtime PM and System Sleep +============================== Runtime PM and system sleep (i.e., system suspend and hibernation, also known as suspend-to-RAM and suspend-to-disk) interact with each other in a couple of @@ -647,9 +657,9 @@ brought back to full power during resume, then its runtime PM status will have to be updated to reflect the actual post-system sleep status. The way to do this is: - pm_runtime_disable(dev); - pm_runtime_set_active(dev); - pm_runtime_enable(dev); + - pm_runtime_disable(dev); + - pm_runtime_set_active(dev); + - pm_runtime_enable(dev); The PM core always increments the runtime usage counter before calling the ->suspend() callback and decrements it after calling the ->resume() callback. @@ -705,66 +715,66 @@ Subsystems may wish to conserve code space by using the set of generic power management callbacks provided by the PM core, defined in driver/base/power/generic_ops.c: - int pm_generic_runtime_suspend(struct device *dev); + `int pm_generic_runtime_suspend(struct device *dev);` - invoke the ->runtime_suspend() callback provided by the driver of this device and return its result, or return 0 if not defined - int pm_generic_runtime_resume(struct device *dev); + `int pm_generic_runtime_resume(struct device *dev);` - invoke the ->runtime_resume() callback provided by the driver of this device and return its result, or return 0 if not defined - int pm_generic_suspend(struct device *dev); + `int pm_generic_suspend(struct device *dev);` - if the device has not been suspended at run time, invoke the ->suspend() callback provided by its driver and return its result, or return 0 if not defined - int pm_generic_suspend_noirq(struct device *dev); + `int pm_generic_suspend_noirq(struct device *dev);` - if pm_runtime_suspended(dev) returns "false", invoke the ->suspend_noirq() callback provided by the device's driver and return its result, or return 0 if not defined - int pm_generic_resume(struct device *dev); + `int pm_generic_resume(struct device *dev);` - invoke the ->resume() callback provided by the driver of this device and, if successful, change the device's runtime PM status to 'active' - int pm_generic_resume_noirq(struct device *dev); + `int pm_generic_resume_noirq(struct device *dev);` - invoke the ->resume_noirq() callback provided by the driver of this device - int pm_generic_freeze(struct device *dev); + `int pm_generic_freeze(struct device *dev);` - if the device has not been suspended at run time, invoke the ->freeze() callback provided by its driver and return its result, or return 0 if not defined - int pm_generic_freeze_noirq(struct device *dev); + `int pm_generic_freeze_noirq(struct device *dev);` - if pm_runtime_suspended(dev) returns "false", invoke the ->freeze_noirq() callback provided by the device's driver and return its result, or return 0 if not defined - int pm_generic_thaw(struct device *dev); + `int pm_generic_thaw(struct device *dev);` - if the device has not been suspended at run time, invoke the ->thaw() callback provided by its driver and return its result, or return 0 if not defined - int pm_generic_thaw_noirq(struct device *dev); + `int pm_generic_thaw_noirq(struct device *dev);` - if pm_runtime_suspended(dev) returns "false", invoke the ->thaw_noirq() callback provided by the device's driver and return its result, or return 0 if not defined - int pm_generic_poweroff(struct device *dev); + `int pm_generic_poweroff(struct device *dev);` - if the device has not been suspended at run time, invoke the ->poweroff() callback provided by its driver and return its result, or return 0 if not defined - int pm_generic_poweroff_noirq(struct device *dev); + `int pm_generic_poweroff_noirq(struct device *dev);` - if pm_runtime_suspended(dev) returns "false", run the ->poweroff_noirq() callback provided by the device's driver and return its result, or return 0 if not defined - int pm_generic_restore(struct device *dev); + `int pm_generic_restore(struct device *dev);` - invoke the ->restore() callback provided by the driver of this device and, if successful, change the device's runtime PM status to 'active' - int pm_generic_restore_noirq(struct device *dev); + `int pm_generic_restore_noirq(struct device *dev);` - invoke the ->restore_noirq() callback provided by the device's driver These functions are the defaults used by the PM core, if a subsystem doesn't @@ -781,6 +791,7 @@ UNIVERSAL_DEV_PM_OPS macro defined in include/linux/pm.h (possibly setting its last argument to NULL). 8. "No-Callback" Devices +======================== Some "devices" are only logical sub-devices of their parent and cannot be power-managed on their own. (The prototype example is a USB interface. Entire @@ -807,6 +818,7 @@ parent must take responsibility for telling the device's driver when the parent's power state changes. 9. Autosuspend, or automatically-delayed suspends +================================================= Changing a device's power state isn't free; it requires both time and energy. A device should be put in a low-power state only when there's some reason to @@ -832,8 +844,8 @@ registration the length should be controlled by user space, using the In order to use autosuspend, subsystems or drivers must call pm_runtime_use_autosuspend() (preferably before registering the device), and -thereafter they should use the various *_autosuspend() helper functions instead -of the non-autosuspend counterparts: +thereafter they should use the various `*_autosuspend()` helper functions +instead of the non-autosuspend counterparts:: Instead of: pm_runtime_suspend use: pm_runtime_autosuspend; Instead of: pm_schedule_suspend use: pm_request_autosuspend; @@ -858,7 +870,7 @@ The implementation is well suited for asynchronous use in interrupt contexts. However such use inevitably involves races, because the PM core can't synchronize ->runtime_suspend() callbacks with the arrival of I/O requests. This synchronization must be handled by the driver, using its private lock. -Here is a schematic pseudo-code example: +Here is a schematic pseudo-code example:: foo_read_or_write(struct foo_priv *foo, void *data) { diff --git a/Documentation/power/s2ram.txt b/Documentation/power/s2ram.rst similarity index 92% rename from Documentation/power/s2ram.txt rename to Documentation/power/s2ram.rst index 4685aee197fd..d739aa7c742c 100644 --- a/Documentation/power/s2ram.txt +++ b/Documentation/power/s2ram.rst @@ -1,7 +1,9 @@ - How to get s2ram working - ~~~~~~~~~~~~~~~~~~~~~~~~ - 2006 Linus Torvalds - 2006 Pavel Machek +======================== +How to get s2ram working +======================== + +2006 Linus Torvalds +2006 Pavel Machek 1) Check suspend.sf.net, program s2ram there has long whitelist of "known ok" machines, along with tricks to use on each one. @@ -12,8 +14,8 @@ 3) You can use Linus' TRACE_RESUME infrastructure, described below. - Using TRACE_RESUME - ~~~~~~~~~~~~~~~~~~ +Using TRACE_RESUME +~~~~~~~~~~~~~~~~~~ I've been working at making the machines I have able to STR, and almost always it's a driver that is buggy. Thank God for the suspend/resume @@ -27,7 +29,7 @@ machine that doesn't boot) is: - enable PM_DEBUG, and PM_TRACE - - use a script like this: + - use a script like this:: #!/bin/sh sync @@ -38,7 +40,7 @@ machine that doesn't boot) is: - if it doesn't come back up (which is usually the problem), reboot by holding the power button down, and look at the dmesg output for things - like + like:: Magic number: 4:156:725 hash matches drivers/base/power/resume.c:28 @@ -52,7 +54,7 @@ machine that doesn't boot) is: If no device matches the hash (or any matches appear to be false positives), the culprit may be a device from a loadable kernel module that is not loaded until after the hash is checked. You can check the hash against the current - devices again after more modules are loaded using sysfs: + devices again after more modules are loaded using sysfs:: cat /sys/power/pm_trace_dev_match diff --git a/Documentation/power/suspend-and-cpuhotplug.txt b/Documentation/power/suspend-and-cpuhotplug.rst similarity index 90% rename from Documentation/power/suspend-and-cpuhotplug.txt rename to Documentation/power/suspend-and-cpuhotplug.rst index a8751b8df10e..9df664f5423a 100644 --- a/Documentation/power/suspend-and-cpuhotplug.txt +++ b/Documentation/power/suspend-and-cpuhotplug.rst @@ -1,10 +1,15 @@ +==================================================================== Interaction of Suspend code (S3) with the CPU hotplug infrastructure +==================================================================== - (C) 2011 - 2014 Srivatsa S. Bhat +(C) 2011 - 2014 Srivatsa S. Bhat -I. How does the regular CPU hotplug code differ from how the Suspend-to-RAM - infrastructure uses it internally? And where do they share common code? +I. Differences between CPU hotplug and Suspend-to-RAM +====================================================== + +How does the regular CPU hotplug code differ from how the Suspend-to-RAM +infrastructure uses it internally? And where do they share common code? Well, a picture is worth a thousand words... So ASCII art follows :-) @@ -16,13 +21,13 @@ of describing where they take different paths and where they share code. What happens when regular CPU hotplug and Suspend-to-RAM race with each other is not depicted here.] -On a high level, the suspend-resume cycle goes like this: +On a high level, the suspend-resume cycle goes like this:: -|Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw | -|tasks | | cpus | | | | cpus | |tasks| + |Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw | + |tasks | | cpus | | | | cpus | |tasks| -More details follow: +More details follow:: Suspend call path ----------------- @@ -87,7 +92,9 @@ More details follow: Resuming back is likewise, with the counterparts being (in the order of execution during resume): -* enable_nonboot_cpus() which involves: + +* enable_nonboot_cpus() which involves:: + | Acquire cpu_add_remove_lock | Decrease cpu_hotplug_disabled, thereby enabling regular cpu hotplug | Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop] @@ -101,7 +108,7 @@ execution during resume): It is to be noted here that the system_transition_mutex lock is acquired at the very beginning, when we are just starting out to suspend, and then released only -after the entire cycle is complete (i.e., suspend + resume). +after the entire cycle is complete (i.e., suspend + resume):: @@ -152,16 +159,16 @@ with the 'tasks_frozen' argument set to 1. Important files and functions/entry points: ------------------------------------------- +------------------------------------------- -kernel/power/process.c : freeze_processes(), thaw_processes() -kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish() -kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus() +- kernel/power/process.c : freeze_processes(), thaw_processes() +- kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish() +- kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus() II. What are the issues involved in CPU hotplug? - ------------------------------------------- +------------------------------------------------ There are some interesting situations involving CPU hotplug and microcode update on the CPUs, as discussed below: @@ -243,8 +250,11 @@ d. Handling microcode update during suspend/hibernate: cycles). -III. Are there any known problems when regular CPU hotplug and suspend race - with each other? +III. Known problems +=================== + +Are there any known problems when regular CPU hotplug and suspend race +with each other? Yes, they are listed below: diff --git a/Documentation/power/suspend-and-interrupts.txt b/Documentation/power/suspend-and-interrupts.rst similarity index 98% rename from Documentation/power/suspend-and-interrupts.txt rename to Documentation/power/suspend-and-interrupts.rst index 8afb29a8604a..4cda6617709a 100644 --- a/Documentation/power/suspend-and-interrupts.txt +++ b/Documentation/power/suspend-and-interrupts.rst @@ -1,4 +1,6 @@ +==================================== System Suspend and Device Interrupts +==================================== Copyright (C) 2014 Intel Corp. Author: Rafael J. Wysocki diff --git a/Documentation/power/swsusp-and-swap-files.txt b/Documentation/power/swsusp-and-swap-files.rst similarity index 83% rename from Documentation/power/swsusp-and-swap-files.txt rename to Documentation/power/swsusp-and-swap-files.rst index f281886de490..a33a2919dbe4 100644 --- a/Documentation/power/swsusp-and-swap-files.txt +++ b/Documentation/power/swsusp-and-swap-files.rst @@ -1,4 +1,7 @@ +=============================================== Using swap files with software suspend (swsusp) +=============================================== + (C) 2006 Rafael J. Wysocki The Linux kernel handles swap files almost in the same way as it handles swap @@ -21,20 +24,20 @@ units. In order to use a swap file with swsusp, you need to: -1) Create the swap file and make it active, eg. +1) Create the swap file and make it active, eg.:: -# dd if=/dev/zero of= bs=1024 count= -# mkswap -# swapon + # dd if=/dev/zero of= bs=1024 count= + # mkswap + # swapon 2) Use an application that will bmap the swap file with the help of the FIBMAP ioctl and determine the location of the file's swap header, as the offset, in units, from the beginning of the partition which holds the swap file. -3) Add the following parameters to the kernel command line: +3) Add the following parameters to the kernel command line:: -resume= resume_offset= + resume= resume_offset= where is the partition on which the swap file is located and is the offset of the swap header determined by the @@ -46,7 +49,7 @@ OR Use a userland suspend application that will set the partition and offset with the help of the SNAPSHOT_SET_SWAP_AREA ioctl described in -Documentation/power/userland-swsusp.txt (this is the only method to suspend +Documentation/power/userland-swsusp.rst (this is the only method to suspend to a swap file allowing the resume to be initiated from an initrd or initramfs image). diff --git a/Documentation/power/swsusp-dmcrypt.txt b/Documentation/power/swsusp-dmcrypt.rst similarity index 67% rename from Documentation/power/swsusp-dmcrypt.txt rename to Documentation/power/swsusp-dmcrypt.rst index b802fbfd95ef..426df59172cd 100644 --- a/Documentation/power/swsusp-dmcrypt.txt +++ b/Documentation/power/swsusp-dmcrypt.rst @@ -1,13 +1,15 @@ +======================================= +How to use dm-crypt and swsusp together +======================================= + Author: Andreas Steinmetz -How to use dm-crypt and swsusp together: -======================================== Some prerequisites: You know how dm-crypt works. If not, visit the following web page: http://www.saout.de/misc/dm-crypt/ -You have read Documentation/power/swsusp.txt and understand it. +You have read Documentation/power/swsusp.rst and understand it. You did read Documentation/admin-guide/initrd.rst and know how an initrd works. You know how to create or how to modify an initrd. @@ -29,23 +31,23 @@ a way that the swap device you suspend to/resume from has always the same major/minor within the initrd as well as within your running system. The easiest way to achieve this is to always set up this swap device first with dmsetup, so that -it will always look like the following: +it will always look like the following:: -brw------- 1 root root 254, 0 Jul 28 13:37 /dev/mapper/swap0 + brw------- 1 root root 254, 0 Jul 28 13:37 /dev/mapper/swap0 Now set up your kernel to use /dev/mapper/swap0 as the default -resume partition, so your kernel .config contains: +resume partition, so your kernel .config contains:: -CONFIG_PM_STD_PARTITION="/dev/mapper/swap0" + CONFIG_PM_STD_PARTITION="/dev/mapper/swap0" Prepare your boot loader to use the initrd you will create or modify. For lilo the simplest setup looks like the following -lines: +lines:: -image=/boot/vmlinuz -initrd=/boot/initrd.gz -label=linux -append="root=/dev/ram0 init=/linuxrc rw" + image=/boot/vmlinuz + initrd=/boot/initrd.gz + label=linux + append="root=/dev/ram0 init=/linuxrc rw" Finally you need to create or modify your initrd. Lets assume you create an initrd that reads the required dm-crypt setup @@ -53,66 +55,66 @@ from a pcmcia flash disk card. The card is formatted with an ext2 fs which resides on /dev/hde1 when the card is inserted. The card contains at least the encrypted swap setup in a file named "swapkey". /etc/fstab of your initrd contains something -like the following: +like the following:: -/dev/hda1 /mnt ext3 ro 0 0 -none /proc proc defaults,noatime,nodiratime 0 0 -none /sys sysfs defaults,noatime,nodiratime 0 0 + /dev/hda1 /mnt ext3 ro 0 0 + none /proc proc defaults,noatime,nodiratime 0 0 + none /sys sysfs defaults,noatime,nodiratime 0 0 /dev/hda1 contains an unencrypted mini system that sets up all of your crypto devices, again by reading the setup from the pcmcia flash disk. What follows now is a /linuxrc for your initrd that allows you to resume from encrypted swap and that continues boot with your mini system on /dev/hda1 if resume -does not happen: +does not happen:: -#!/bin/sh -PATH=/sbin:/bin:/usr/sbin:/usr/bin -mount /proc -mount /sys -mapped=0 -noresume=`grep -c noresume /proc/cmdline` -if [ "$*" != "" ] -then - noresume=1 -fi -dmesg -n 1 -/sbin/cardmgr -q -for i in 1 2 3 4 5 6 7 8 9 0 -do - if [ -f /proc/ide/hde/media ] + #!/bin/sh + PATH=/sbin:/bin:/usr/sbin:/usr/bin + mount /proc + mount /sys + mapped=0 + noresume=`grep -c noresume /proc/cmdline` + if [ "$*" != "" ] then + noresume=1 + fi + dmesg -n 1 + /sbin/cardmgr -q + for i in 1 2 3 4 5 6 7 8 9 0 + do + if [ -f /proc/ide/hde/media ] + then + usleep 500000 + mount -t ext2 -o ro /dev/hde1 /mnt + if [ -f /mnt/swapkey ] + then + dmsetup create swap0 /mnt/swapkey > /dev/null 2>&1 && mapped=1 + fi + umount /mnt + break + fi usleep 500000 - mount -t ext2 -o ro /dev/hde1 /mnt - if [ -f /mnt/swapkey ] + done + killproc /sbin/cardmgr + dmesg -n 6 + if [ $mapped = 1 ] + then + if [ $noresume != 0 ] then - dmsetup create swap0 /mnt/swapkey > /dev/null 2>&1 && mapped=1 + mkswap /dev/mapper/swap0 > /dev/null 2>&1 fi - umount /mnt - break + echo 254:0 > /sys/power/resume + dmsetup remove swap0 fi - usleep 500000 -done -killproc /sbin/cardmgr -dmesg -n 6 -if [ $mapped = 1 ] -then - if [ $noresume != 0 ] - then - mkswap /dev/mapper/swap0 > /dev/null 2>&1 - fi - echo 254:0 > /sys/power/resume - dmsetup remove swap0 -fi -umount /sys -mount /mnt -umount /proc -cd /mnt -pivot_root . mnt -mount /proc -umount -l /mnt -umount /proc -exec chroot . /sbin/init $* < dev/console > dev/console 2>&1 + umount /sys + mount /mnt + umount /proc + cd /mnt + pivot_root . mnt + mount /proc + umount -l /mnt + umount /proc + exec chroot . /sbin/init $* < dev/console > dev/console 2>&1 Please don't mind the weird loop above, busybox's msh doesn't know the let statement. Now, what is happening in the script? diff --git a/Documentation/power/swsusp.rst b/Documentation/power/swsusp.rst new file mode 100644 index 000000000000..d000312f6965 --- /dev/null +++ b/Documentation/power/swsusp.rst @@ -0,0 +1,501 @@ +============ +Swap suspend +============ + +Some warnings, first. + +.. warning:: + + **BIG FAT WARNING** + + If you touch anything on disk between suspend and resume... + ...kiss your data goodbye. + + If you do resume from initrd after your filesystems are mounted... + ...bye bye root partition. + + [this is actually same case as above] + + If you have unsupported ( ) devices using DMA, you may have some + problems. If your disk driver does not support suspend... (IDE does), + it may cause some problems, too. If you change kernel command line + between suspend and resume, it may do something wrong. If you change + your hardware while system is suspended... well, it was not good idea; + but it will probably only crash. + + ( ) suspend/resume support is needed to make it safe. + + If you have any filesystems on USB devices mounted before software suspend, + they won't be accessible after resume and you may lose data, as though + you have unplugged the USB devices with mounted filesystems on them; + see the FAQ below for details. (This is not true for more traditional + power states like "standby", which normally don't turn USB off.) + +Swap partition: + You need to append resume=/dev/your_swap_partition to kernel command + line or specify it using /sys/power/resume. + +Swap file: + If using a swapfile you can also specify a resume offset using + resume_offset= on the kernel command line or specify it + in /sys/power/resume_offset. + +After preparing then you suspend by:: + + echo shutdown > /sys/power/disk; echo disk > /sys/power/state + +- If you feel ACPI works pretty well on your system, you might try:: + + echo platform > /sys/power/disk; echo disk > /sys/power/state + +- If you would like to write hibernation image to swap and then suspend + to RAM (provided your platform supports it), you can try:: + + echo suspend > /sys/power/disk; echo disk > /sys/power/state + +- If you have SATA disks, you'll need recent kernels with SATA suspend + support. For suspend and resume to work, make sure your disk drivers + are built into kernel -- not modules. [There's way to make + suspend/resume with modular disk drivers, see FAQ, but you probably + should not do that.] + +If you want to limit the suspend image size to N bytes, do:: + + echo N > /sys/power/image_size + +before suspend (it is limited to around 2/5 of available RAM by default). + +- The resume process checks for the presence of the resume device, + if found, it then checks the contents for the hibernation image signature. + If both are found, it resumes the hibernation image. + +- The resume process may be triggered in two ways: + + 1) During lateinit: If resume=/dev/your_swap_partition is specified on + the kernel command line, lateinit runs the resume process. If the + resume device has not been probed yet, the resume process fails and + bootup continues. + 2) Manually from an initrd or initramfs: May be run from + the init script by using the /sys/power/resume file. It is vital + that this be done prior to remounting any filesystems (even as + read-only) otherwise data may be corrupted. + +Article about goals and implementation of Software Suspend for Linux +==================================================================== + +Author: Gábor Kuti +Last revised: 2003-10-20 by Pavel Machek + +Idea and goals to achieve +------------------------- + +Nowadays it is common in several laptops that they have a suspend button. It +saves the state of the machine to a filesystem or to a partition and switches +to standby mode. Later resuming the machine the saved state is loaded back to +ram and the machine can continue its work. It has two real benefits. First we +save ourselves the time machine goes down and later boots up, energy costs +are real high when running from batteries. The other gain is that we don't have +to interrupt our programs so processes that are calculating something for a long +time shouldn't need to be written interruptible. + +swsusp saves the state of the machine into active swaps and then reboots or +powerdowns. You must explicitly specify the swap partition to resume from with +`resume=` kernel option. If signature is found it loads and restores saved +state. If the option `noresume` is specified as a boot parameter, it skips +the resuming. If the option `hibernate=nocompress` is specified as a boot +parameter, it saves hibernation image without compression. + +In the meantime while the system is suspended you should not add/remove any +of the hardware, write to the filesystems, etc. + +Sleep states summary +==================== + +There are three different interfaces you can use, /proc/acpi should +work like this: + +In a really perfect world:: + + echo 1 > /proc/acpi/sleep # for standby + echo 2 > /proc/acpi/sleep # for suspend to ram + echo 3 > /proc/acpi/sleep # for suspend to ram, but with more power conservative + echo 4 > /proc/acpi/sleep # for suspend to disk + echo 5 > /proc/acpi/sleep # for shutdown unfriendly the system + +and perhaps:: + + echo 4b > /proc/acpi/sleep # for suspend to disk via s4bios + +Frequently Asked Questions +========================== + +Q: + well, suspending a server is IMHO a really stupid thing, + but... (Diego Zuccato): + +A: + You bought new UPS for your server. How do you install it without + bringing machine down? Suspend to disk, rearrange power cables, + resume. + + You have your server on UPS. Power died, and UPS is indicating 30 + seconds to failure. What do you do? Suspend to disk. + + +Q: + Maybe I'm missing something, but why don't the regular I/O paths work? + +A: + We do use the regular I/O paths. However we cannot restore the data + to its original location as we load it. That would create an + inconsistent kernel state which would certainly result in an oops. + Instead, we load the image into unused memory and then atomically copy + it back to it original location. This implies, of course, a maximum + image size of half the amount of memory. + + There are two solutions to this: + + * require half of memory to be free during suspend. That way you can + read "new" data onto free spots, then cli and copy + + * assume we had special "polling" ide driver that only uses memory + between 0-640KB. That way, I'd have to make sure that 0-640KB is free + during suspending, but otherwise it would work... + + suspend2 shares this fundamental limitation, but does not include user + data and disk caches into "used memory" by saving them in + advance. That means that the limitation goes away in practice. + +Q: + Does linux support ACPI S4? + +A: + Yes. That's what echo platform > /sys/power/disk does. + +Q: + What is 'suspend2'? + +A: + suspend2 is 'Software Suspend 2', a forked implementation of + suspend-to-disk which is available as separate patches for 2.4 and 2.6 + kernels from swsusp.sourceforge.net. It includes support for SMP, 4GB + highmem and preemption. It also has a extensible architecture that + allows for arbitrary transformations on the image (compression, + encryption) and arbitrary backends for writing the image (eg to swap + or an NFS share[Work In Progress]). Questions regarding suspend2 + should be sent to the mailing list available through the suspend2 + website, and not to the Linux Kernel Mailing List. We are working + toward merging suspend2 into the mainline kernel. + +Q: + What is the freezing of tasks and why are we using it? + +A: + The freezing of tasks is a mechanism by which user space processes and some + kernel threads are controlled during hibernation or system-wide suspend (on some + architectures). See freezing-of-tasks.txt for details. + +Q: + What is the difference between "platform" and "shutdown"? + +A: + shutdown: + save state in linux, then tell bios to powerdown + + platform: + save state in linux, then tell bios to powerdown and blink + "suspended led" + + "platform" is actually right thing to do where supported, but + "shutdown" is most reliable (except on ACPI systems). + +Q: + I do not understand why you have such strong objections to idea of + selective suspend. + +A: + Do selective suspend during runtime power management, that's okay. But + it's useless for suspend-to-disk. (And I do not see how you could use + it for suspend-to-ram, I hope you do not want that). + + Lets see, so you suggest to + + * SUSPEND all but swap device and parents + * Snapshot + * Write image to disk + * SUSPEND swap device and parents + * Powerdown + + Oh no, that does not work, if swap device or its parents uses DMA, + you've corrupted data. You'd have to do + + * SUSPEND all but swap device and parents + * FREEZE swap device and parents + * Snapshot + * UNFREEZE swap device and parents + * Write + * SUSPEND swap device and parents + + Which means that you still need that FREEZE state, and you get more + complicated code. (And I have not yet introduce details like system + devices). + +Q: + There don't seem to be any generally useful behavioral + distinctions between SUSPEND and FREEZE. + +A: + Doing SUSPEND when you are asked to do FREEZE is always correct, + but it may be unnecessarily slow. If you want your driver to stay simple, + slowness may not matter to you. It can always be fixed later. + + For devices like disk it does matter, you do not want to spindown for + FREEZE. + +Q: + After resuming, system is paging heavily, leading to very bad interactivity. + +A: + Try running:: + + cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u | while read file + do + test -f "$file" && cat "$file" > /dev/null + done + + after resume. swapoff -a; swapon -a may also be useful. + +Q: + What happens to devices during swsusp? They seem to be resumed + during system suspend? + +A: + That's correct. We need to resume them if we want to write image to + disk. Whole sequence goes like + + **Suspend part** + + running system, user asks for suspend-to-disk + + user processes are stopped + + suspend(PMSG_FREEZE): devices are frozen so that they don't interfere + with state snapshot + + state snapshot: copy of whole used memory is taken with interrupts disabled + + resume(): devices are woken up so that we can write image to swap + + write image to swap + + suspend(PMSG_SUSPEND): suspend devices so that we can power off + + turn the power off + + **Resume part** + + (is actually pretty similar) + + running system, user asks for suspend-to-disk + + user processes are stopped (in common case there are none, + but with resume-from-initrd, no one knows) + + read image from disk + + suspend(PMSG_FREEZE): devices are frozen so that they don't interfere + with image restoration + + image restoration: rewrite memory with image + + resume(): devices are woken up so that system can continue + + thaw all user processes + +Q: + What is this 'Encrypt suspend image' for? + +A: + First of all: it is not a replacement for dm-crypt encrypted swap. + It cannot protect your computer while it is suspended. Instead it does + protect from leaking sensitive data after resume from suspend. + + Think of the following: you suspend while an application is running + that keeps sensitive data in memory. The application itself prevents + the data from being swapped out. Suspend, however, must write these + data to swap to be able to resume later on. Without suspend encryption + your sensitive data are then stored in plaintext on disk. This means + that after resume your sensitive data are accessible to all + applications having direct access to the swap device which was used + for suspend. If you don't need swap after resume these data can remain + on disk virtually forever. Thus it can happen that your system gets + broken in weeks later and sensitive data which you thought were + encrypted and protected are retrieved and stolen from the swap device. + To prevent this situation you should use 'Encrypt suspend image'. + + During suspend a temporary key is created and this key is used to + encrypt the data written to disk. When, during resume, the data was + read back into memory the temporary key is destroyed which simply + means that all data written to disk during suspend are then + inaccessible so they can't be stolen later on. The only thing that + you must then take care of is that you call 'mkswap' for the swap + partition used for suspend as early as possible during regular + boot. This asserts that any temporary key from an oopsed suspend or + from a failed or aborted resume is erased from the swap device. + + As a rule of thumb use encrypted swap to protect your data while your + system is shut down or suspended. Additionally use the encrypted + suspend image to prevent sensitive data from being stolen after + resume. + +Q: + Can I suspend to a swap file? + +A: + Generally, yes, you can. However, it requires you to use the "resume=" and + "resume_offset=" kernel command line parameters, so the resume from a swap file + cannot be initiated from an initrd or initramfs image. See + swsusp-and-swap-files.txt for details. + +Q: + Is there a maximum system RAM size that is supported by swsusp? + +A: + It should work okay with highmem. + +Q: + Does swsusp (to disk) use only one swap partition or can it use + multiple swap partitions (aggregate them into one logical space)? + +A: + Only one swap partition, sorry. + +Q: + If my application(s) causes lots of memory & swap space to be used + (over half of the total system RAM), is it correct that it is likely + to be useless to try to suspend to disk while that app is running? + +A: + No, it should work okay, as long as your app does not mlock() + it. Just prepare big enough swap partition. + +Q: + What information is useful for debugging suspend-to-disk problems? + +A: + Well, last messages on the screen are always useful. If something + is broken, it is usually some kernel driver, therefore trying with as + little as possible modules loaded helps a lot. I also prefer people to + suspend from console, preferably without X running. Booting with + init=/bin/bash, then swapon and starting suspend sequence manually + usually does the trick. Then it is good idea to try with latest + vanilla kernel. + +Q: + How can distributions ship a swsusp-supporting kernel with modular + disk drivers (especially SATA)? + +A: + Well, it can be done, load the drivers, then do echo into + /sys/power/resume file from initrd. Be sure not to mount + anything, not even read-only mount, or you are going to lose your + data. + +Q: + How do I make suspend more verbose? + +A: + If you want to see any non-error kernel messages on the virtual + terminal the kernel switches to during suspend, you have to set the + kernel console loglevel to at least 4 (KERN_WARNING), for example by + doing:: + + # save the old loglevel + read LOGLEVEL DUMMY < /proc/sys/kernel/printk + # set the loglevel so we see the progress bar. + # if the level is higher than needed, we leave it alone. + if [ $LOGLEVEL -lt 5 ]; then + echo 5 > /proc/sys/kernel/printk + fi + + IMG_SZ=0 + read IMG_SZ < /sys/power/image_size + echo -n disk > /sys/power/state + RET=$? + # + # the logic here is: + # if image_size > 0 (without kernel support, IMG_SZ will be zero), + # then try again with image_size set to zero. + if [ $RET -ne 0 -a $IMG_SZ -ne 0 ]; then # try again with minimal image size + echo 0 > /sys/power/image_size + echo -n disk > /sys/power/state + RET=$? + fi + + # restore previous loglevel + echo $LOGLEVEL > /proc/sys/kernel/printk + exit $RET + +Q: + Is this true that if I have a mounted filesystem on a USB device and + I suspend to disk, I can lose data unless the filesystem has been mounted + with "sync"? + +A: + That's right ... if you disconnect that device, you may lose data. + In fact, even with "-o sync" you can lose data if your programs have + information in buffers they haven't written out to a disk you disconnect, + or if you disconnect before the device finished saving data you wrote. + + Software suspend normally powers down USB controllers, which is equivalent + to disconnecting all USB devices attached to your system. + + Your system might well support low-power modes for its USB controllers + while the system is asleep, maintaining the connection, using true sleep + modes like "suspend-to-RAM" or "standby". (Don't write "disk" to the + /sys/power/state file; write "standby" or "mem".) We've not seen any + hardware that can use these modes through software suspend, although in + theory some systems might support "platform" modes that won't break the + USB connections. + + Remember that it's always a bad idea to unplug a disk drive containing a + mounted filesystem. That's true even when your system is asleep! The + safest thing is to unmount all filesystems on removable media (such USB, + Firewire, CompactFlash, MMC, external SATA, or even IDE hotplug bays) + before suspending; then remount them after resuming. + + There is a work-around for this problem. For more information, see + Documentation/driver-api/usb/persist.rst. + +Q: + Can I suspend-to-disk using a swap partition under LVM? + +A: + Yes and No. You can suspend successfully, but the kernel will not be able + to resume on its own. You need an initramfs that can recognize the resume + situation, activate the logical volume containing the swap volume (but not + touch any filesystems!), and eventually call:: + + echo -n "$major:$minor" > /sys/power/resume + + where $major and $minor are the respective major and minor device numbers of + the swap volume. + + uswsusp works with LVM, too. See http://suspend.sourceforge.net/ + +Q: + I upgraded the kernel from 2.6.15 to 2.6.16. Both kernels were + compiled with the similar configuration files. Anyway I found that + suspend to disk (and resume) is much slower on 2.6.16 compared to + 2.6.15. Any idea for why that might happen or how can I speed it up? + +A: + This is because the size of the suspend image is now greater than + for 2.6.15 (by saving more data we can get more responsive system + after resume). + + There's the /sys/power/image_size knob that controls the size of the + image. If you set it to 0 (eg. by echo 0 > /sys/power/image_size as + root), the 2.6.15 behavior should be restored. If it is still too + slow, take a look at suspend.sf.net -- userland suspend is faster and + supports LZF compression to speed it up further. diff --git a/Documentation/power/swsusp.txt b/Documentation/power/swsusp.txt deleted file mode 100644 index 236d1fb13640..000000000000 --- a/Documentation/power/swsusp.txt +++ /dev/null @@ -1,446 +0,0 @@ -Some warnings, first. - - * BIG FAT WARNING ********************************************************* - * - * If you touch anything on disk between suspend and resume... - * ...kiss your data goodbye. - * - * If you do resume from initrd after your filesystems are mounted... - * ...bye bye root partition. - * [this is actually same case as above] - * - * If you have unsupported (*) devices using DMA, you may have some - * problems. If your disk driver does not support suspend... (IDE does), - * it may cause some problems, too. If you change kernel command line - * between suspend and resume, it may do something wrong. If you change - * your hardware while system is suspended... well, it was not good idea; - * but it will probably only crash. - * - * (*) suspend/resume support is needed to make it safe. - * - * If you have any filesystems on USB devices mounted before software suspend, - * they won't be accessible after resume and you may lose data, as though - * you have unplugged the USB devices with mounted filesystems on them; - * see the FAQ below for details. (This is not true for more traditional - * power states like "standby", which normally don't turn USB off.) - -Swap partition: -You need to append resume=/dev/your_swap_partition to kernel command -line or specify it using /sys/power/resume. - -Swap file: -If using a swapfile you can also specify a resume offset using -resume_offset= on the kernel command line or specify it -in /sys/power/resume_offset. - -After preparing then you suspend by - -echo shutdown > /sys/power/disk; echo disk > /sys/power/state - -. If you feel ACPI works pretty well on your system, you might try - -echo platform > /sys/power/disk; echo disk > /sys/power/state - -. If you would like to write hibernation image to swap and then suspend -to RAM (provided your platform supports it), you can try - -echo suspend > /sys/power/disk; echo disk > /sys/power/state - -. If you have SATA disks, you'll need recent kernels with SATA suspend -support. For suspend and resume to work, make sure your disk drivers -are built into kernel -- not modules. [There's way to make -suspend/resume with modular disk drivers, see FAQ, but you probably -should not do that.] - -If you want to limit the suspend image size to N bytes, do - -echo N > /sys/power/image_size - -before suspend (it is limited to around 2/5 of available RAM by default). - -. The resume process checks for the presence of the resume device, -if found, it then checks the contents for the hibernation image signature. -If both are found, it resumes the hibernation image. - -. The resume process may be triggered in two ways: - 1) During lateinit: If resume=/dev/your_swap_partition is specified on - the kernel command line, lateinit runs the resume process. If the - resume device has not been probed yet, the resume process fails and - bootup continues. - 2) Manually from an initrd or initramfs: May be run from - the init script by using the /sys/power/resume file. It is vital - that this be done prior to remounting any filesystems (even as - read-only) otherwise data may be corrupted. - -Article about goals and implementation of Software Suspend for Linux -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Author: Gábor Kuti -Last revised: 2003-10-20 by Pavel Machek - -Idea and goals to achieve - -Nowadays it is common in several laptops that they have a suspend button. It -saves the state of the machine to a filesystem or to a partition and switches -to standby mode. Later resuming the machine the saved state is loaded back to -ram and the machine can continue its work. It has two real benefits. First we -save ourselves the time machine goes down and later boots up, energy costs -are real high when running from batteries. The other gain is that we don't have to -interrupt our programs so processes that are calculating something for a long -time shouldn't need to be written interruptible. - -swsusp saves the state of the machine into active swaps and then reboots or -powerdowns. You must explicitly specify the swap partition to resume from with -``resume='' kernel option. If signature is found it loads and restores saved -state. If the option ``noresume'' is specified as a boot parameter, it skips -the resuming. If the option ``hibernate=nocompress'' is specified as a boot -parameter, it saves hibernation image without compression. - -In the meantime while the system is suspended you should not add/remove any -of the hardware, write to the filesystems, etc. - -Sleep states summary -==================== - -There are three different interfaces you can use, /proc/acpi should -work like this: - -In a really perfect world: -echo 1 > /proc/acpi/sleep # for standby -echo 2 > /proc/acpi/sleep # for suspend to ram -echo 3 > /proc/acpi/sleep # for suspend to ram, but with more power conservative -echo 4 > /proc/acpi/sleep # for suspend to disk -echo 5 > /proc/acpi/sleep # for shutdown unfriendly the system - -and perhaps -echo 4b > /proc/acpi/sleep # for suspend to disk via s4bios - -Frequently Asked Questions -========================== - -Q: well, suspending a server is IMHO a really stupid thing, -but... (Diego Zuccato): - -A: You bought new UPS for your server. How do you install it without -bringing machine down? Suspend to disk, rearrange power cables, -resume. - -You have your server on UPS. Power died, and UPS is indicating 30 -seconds to failure. What do you do? Suspend to disk. - - -Q: Maybe I'm missing something, but why don't the regular I/O paths work? - -A: We do use the regular I/O paths. However we cannot restore the data -to its original location as we load it. That would create an -inconsistent kernel state which would certainly result in an oops. -Instead, we load the image into unused memory and then atomically copy -it back to it original location. This implies, of course, a maximum -image size of half the amount of memory. - -There are two solutions to this: - -* require half of memory to be free during suspend. That way you can -read "new" data onto free spots, then cli and copy - -* assume we had special "polling" ide driver that only uses memory -between 0-640KB. That way, I'd have to make sure that 0-640KB is free -during suspending, but otherwise it would work... - -suspend2 shares this fundamental limitation, but does not include user -data and disk caches into "used memory" by saving them in -advance. That means that the limitation goes away in practice. - -Q: Does linux support ACPI S4? - -A: Yes. That's what echo platform > /sys/power/disk does. - -Q: What is 'suspend2'? - -A: suspend2 is 'Software Suspend 2', a forked implementation of -suspend-to-disk which is available as separate patches for 2.4 and 2.6 -kernels from swsusp.sourceforge.net. It includes support for SMP, 4GB -highmem and preemption. It also has a extensible architecture that -allows for arbitrary transformations on the image (compression, -encryption) and arbitrary backends for writing the image (eg to swap -or an NFS share[Work In Progress]). Questions regarding suspend2 -should be sent to the mailing list available through the suspend2 -website, and not to the Linux Kernel Mailing List. We are working -toward merging suspend2 into the mainline kernel. - -Q: What is the freezing of tasks and why are we using it? - -A: The freezing of tasks is a mechanism by which user space processes and some -kernel threads are controlled during hibernation or system-wide suspend (on some -architectures). See freezing-of-tasks.txt for details. - -Q: What is the difference between "platform" and "shutdown"? - -A: - -shutdown: save state in linux, then tell bios to powerdown - -platform: save state in linux, then tell bios to powerdown and blink - "suspended led" - -"platform" is actually right thing to do where supported, but -"shutdown" is most reliable (except on ACPI systems). - -Q: I do not understand why you have such strong objections to idea of -selective suspend. - -A: Do selective suspend during runtime power management, that's okay. But -it's useless for suspend-to-disk. (And I do not see how you could use -it for suspend-to-ram, I hope you do not want that). - -Lets see, so you suggest to - -* SUSPEND all but swap device and parents -* Snapshot -* Write image to disk -* SUSPEND swap device and parents -* Powerdown - -Oh no, that does not work, if swap device or its parents uses DMA, -you've corrupted data. You'd have to do - -* SUSPEND all but swap device and parents -* FREEZE swap device and parents -* Snapshot -* UNFREEZE swap device and parents -* Write -* SUSPEND swap device and parents - -Which means that you still need that FREEZE state, and you get more -complicated code. (And I have not yet introduce details like system -devices). - -Q: There don't seem to be any generally useful behavioral -distinctions between SUSPEND and FREEZE. - -A: Doing SUSPEND when you are asked to do FREEZE is always correct, -but it may be unnecessarily slow. If you want your driver to stay simple, -slowness may not matter to you. It can always be fixed later. - -For devices like disk it does matter, you do not want to spindown for -FREEZE. - -Q: After resuming, system is paging heavily, leading to very bad interactivity. - -A: Try running - -cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u | while read file -do - test -f "$file" && cat "$file" > /dev/null -done - -after resume. swapoff -a; swapon -a may also be useful. - -Q: What happens to devices during swsusp? They seem to be resumed -during system suspend? - -A: That's correct. We need to resume them if we want to write image to -disk. Whole sequence goes like - - Suspend part - ~~~~~~~~~~~~ - running system, user asks for suspend-to-disk - - user processes are stopped - - suspend(PMSG_FREEZE): devices are frozen so that they don't interfere - with state snapshot - - state snapshot: copy of whole used memory is taken with interrupts disabled - - resume(): devices are woken up so that we can write image to swap - - write image to swap - - suspend(PMSG_SUSPEND): suspend devices so that we can power off - - turn the power off - - Resume part - ~~~~~~~~~~~ - (is actually pretty similar) - - running system, user asks for suspend-to-disk - - user processes are stopped (in common case there are none, but with resume-from-initrd, no one knows) - - read image from disk - - suspend(PMSG_FREEZE): devices are frozen so that they don't interfere - with image restoration - - image restoration: rewrite memory with image - - resume(): devices are woken up so that system can continue - - thaw all user processes - -Q: What is this 'Encrypt suspend image' for? - -A: First of all: it is not a replacement for dm-crypt encrypted swap. -It cannot protect your computer while it is suspended. Instead it does -protect from leaking sensitive data after resume from suspend. - -Think of the following: you suspend while an application is running -that keeps sensitive data in memory. The application itself prevents -the data from being swapped out. Suspend, however, must write these -data to swap to be able to resume later on. Without suspend encryption -your sensitive data are then stored in plaintext on disk. This means -that after resume your sensitive data are accessible to all -applications having direct access to the swap device which was used -for suspend. If you don't need swap after resume these data can remain -on disk virtually forever. Thus it can happen that your system gets -broken in weeks later and sensitive data which you thought were -encrypted and protected are retrieved and stolen from the swap device. -To prevent this situation you should use 'Encrypt suspend image'. - -During suspend a temporary key is created and this key is used to -encrypt the data written to disk. When, during resume, the data was -read back into memory the temporary key is destroyed which simply -means that all data written to disk during suspend are then -inaccessible so they can't be stolen later on. The only thing that -you must then take care of is that you call 'mkswap' for the swap -partition used for suspend as early as possible during regular -boot. This asserts that any temporary key from an oopsed suspend or -from a failed or aborted resume is erased from the swap device. - -As a rule of thumb use encrypted swap to protect your data while your -system is shut down or suspended. Additionally use the encrypted -suspend image to prevent sensitive data from being stolen after -resume. - -Q: Can I suspend to a swap file? - -A: Generally, yes, you can. However, it requires you to use the "resume=" and -"resume_offset=" kernel command line parameters, so the resume from a swap file -cannot be initiated from an initrd or initramfs image. See -swsusp-and-swap-files.txt for details. - -Q: Is there a maximum system RAM size that is supported by swsusp? - -A: It should work okay with highmem. - -Q: Does swsusp (to disk) use only one swap partition or can it use -multiple swap partitions (aggregate them into one logical space)? - -A: Only one swap partition, sorry. - -Q: If my application(s) causes lots of memory & swap space to be used -(over half of the total system RAM), is it correct that it is likely -to be useless to try to suspend to disk while that app is running? - -A: No, it should work okay, as long as your app does not mlock() -it. Just prepare big enough swap partition. - -Q: What information is useful for debugging suspend-to-disk problems? - -A: Well, last messages on the screen are always useful. If something -is broken, it is usually some kernel driver, therefore trying with as -little as possible modules loaded helps a lot. I also prefer people to -suspend from console, preferably without X running. Booting with -init=/bin/bash, then swapon and starting suspend sequence manually -usually does the trick. Then it is good idea to try with latest -vanilla kernel. - -Q: How can distributions ship a swsusp-supporting kernel with modular -disk drivers (especially SATA)? - -A: Well, it can be done, load the drivers, then do echo into -/sys/power/resume file from initrd. Be sure not to mount -anything, not even read-only mount, or you are going to lose your -data. - -Q: How do I make suspend more verbose? - -A: If you want to see any non-error kernel messages on the virtual -terminal the kernel switches to during suspend, you have to set the -kernel console loglevel to at least 4 (KERN_WARNING), for example by -doing - - # save the old loglevel - read LOGLEVEL DUMMY < /proc/sys/kernel/printk - # set the loglevel so we see the progress bar. - # if the level is higher than needed, we leave it alone. - if [ $LOGLEVEL -lt 5 ]; then - echo 5 > /proc/sys/kernel/printk - fi - - IMG_SZ=0 - read IMG_SZ < /sys/power/image_size - echo -n disk > /sys/power/state - RET=$? - # - # the logic here is: - # if image_size > 0 (without kernel support, IMG_SZ will be zero), - # then try again with image_size set to zero. - if [ $RET -ne 0 -a $IMG_SZ -ne 0 ]; then # try again with minimal image size - echo 0 > /sys/power/image_size - echo -n disk > /sys/power/state - RET=$? - fi - - # restore previous loglevel - echo $LOGLEVEL > /proc/sys/kernel/printk - exit $RET - -Q: Is this true that if I have a mounted filesystem on a USB device and -I suspend to disk, I can lose data unless the filesystem has been mounted -with "sync"? - -A: That's right ... if you disconnect that device, you may lose data. -In fact, even with "-o sync" you can lose data if your programs have -information in buffers they haven't written out to a disk you disconnect, -or if you disconnect before the device finished saving data you wrote. - -Software suspend normally powers down USB controllers, which is equivalent -to disconnecting all USB devices attached to your system. - -Your system might well support low-power modes for its USB controllers -while the system is asleep, maintaining the connection, using true sleep -modes like "suspend-to-RAM" or "standby". (Don't write "disk" to the -/sys/power/state file; write "standby" or "mem".) We've not seen any -hardware that can use these modes through software suspend, although in -theory some systems might support "platform" modes that won't break the -USB connections. - -Remember that it's always a bad idea to unplug a disk drive containing a -mounted filesystem. That's true even when your system is asleep! The -safest thing is to unmount all filesystems on removable media (such USB, -Firewire, CompactFlash, MMC, external SATA, or even IDE hotplug bays) -before suspending; then remount them after resuming. - -There is a work-around for this problem. For more information, see -Documentation/driver-api/usb/persist.rst. - -Q: Can I suspend-to-disk using a swap partition under LVM? - -A: Yes and No. You can suspend successfully, but the kernel will not be able -to resume on its own. You need an initramfs that can recognize the resume -situation, activate the logical volume containing the swap volume (but not -touch any filesystems!), and eventually call - -echo -n "$major:$minor" > /sys/power/resume - -where $major and $minor are the respective major and minor device numbers of -the swap volume. - -uswsusp works with LVM, too. See http://suspend.sourceforge.net/ - -Q: I upgraded the kernel from 2.6.15 to 2.6.16. Both kernels were -compiled with the similar configuration files. Anyway I found that -suspend to disk (and resume) is much slower on 2.6.16 compared to -2.6.15. Any idea for why that might happen or how can I speed it up? - -A: This is because the size of the suspend image is now greater than -for 2.6.15 (by saving more data we can get more responsive system -after resume). - -There's the /sys/power/image_size knob that controls the size of the -image. If you set it to 0 (eg. by echo 0 > /sys/power/image_size as -root), the 2.6.15 behavior should be restored. If it is still too -slow, take a look at suspend.sf.net -- userland suspend is faster and -supports LZF compression to speed it up further. diff --git a/Documentation/power/tricks.txt b/Documentation/power/tricks.rst similarity index 93% rename from Documentation/power/tricks.txt rename to Documentation/power/tricks.rst index a1b8f7249f4c..ca787f142c3f 100644 --- a/Documentation/power/tricks.txt +++ b/Documentation/power/tricks.rst @@ -1,5 +1,7 @@ - swsusp/S3 tricks - ~~~~~~~~~~~~~~~~ +================ +swsusp/S3 tricks +================ + Pavel Machek If you want to trick swsusp/S3 into working, you might want to try: diff --git a/Documentation/power/userland-swsusp.txt b/Documentation/power/userland-swsusp.rst similarity index 85% rename from Documentation/power/userland-swsusp.txt rename to Documentation/power/userland-swsusp.rst index bbfcd1bbedc5..a0fa51bb1a4d 100644 --- a/Documentation/power/userland-swsusp.txt +++ b/Documentation/power/userland-swsusp.rst @@ -1,4 +1,7 @@ +===================================================== Documentation for userland software suspend interface +===================================================== + (C) 2006 Rafael J. Wysocki First, the warnings at the beginning of swsusp.txt still apply. @@ -30,13 +33,16 @@ called. The ioctl() commands recognized by the device are: -SNAPSHOT_FREEZE - freeze user space processes (the current process is +SNAPSHOT_FREEZE + freeze user space processes (the current process is not frozen); this is required for SNAPSHOT_CREATE_IMAGE and SNAPSHOT_ATOMIC_RESTORE to succeed -SNAPSHOT_UNFREEZE - thaw user space processes frozen by SNAPSHOT_FREEZE +SNAPSHOT_UNFREEZE + thaw user space processes frozen by SNAPSHOT_FREEZE -SNAPSHOT_CREATE_IMAGE - create a snapshot of the system memory; the +SNAPSHOT_CREATE_IMAGE + create a snapshot of the system memory; the last argument of ioctl() should be a pointer to an int variable, the value of which will indicate whether the call returned after creating the snapshot (1) or after restoring the system memory state @@ -45,48 +51,59 @@ SNAPSHOT_CREATE_IMAGE - create a snapshot of the system memory; the has been created the read() operation can be used to transfer it out of the kernel -SNAPSHOT_ATOMIC_RESTORE - restore the system memory state from the +SNAPSHOT_ATOMIC_RESTORE + restore the system memory state from the uploaded snapshot image; before calling it you should transfer the system memory snapshot back to the kernel using the write() operation; this call will not succeed if the snapshot image is not available to the kernel -SNAPSHOT_FREE - free memory allocated for the snapshot image +SNAPSHOT_FREE + free memory allocated for the snapshot image -SNAPSHOT_PREF_IMAGE_SIZE - set the preferred maximum size of the image +SNAPSHOT_PREF_IMAGE_SIZE + set the preferred maximum size of the image (the kernel will do its best to ensure the image size will not exceed this number, but if it turns out to be impossible, the kernel will create the smallest image possible) -SNAPSHOT_GET_IMAGE_SIZE - return the actual size of the hibernation image +SNAPSHOT_GET_IMAGE_SIZE + return the actual size of the hibernation image -SNAPSHOT_AVAIL_SWAP_SIZE - return the amount of available swap in bytes (the +SNAPSHOT_AVAIL_SWAP_SIZE + return the amount of available swap in bytes (the last argument should be a pointer to an unsigned int variable that will contain the result if the call is successful). -SNAPSHOT_ALLOC_SWAP_PAGE - allocate a swap page from the resume partition +SNAPSHOT_ALLOC_SWAP_PAGE + allocate a swap page from the resume partition (the last argument should be a pointer to a loff_t variable that will contain the swap page offset if the call is successful) -SNAPSHOT_FREE_SWAP_PAGES - free all swap pages allocated by +SNAPSHOT_FREE_SWAP_PAGES + free all swap pages allocated by SNAPSHOT_ALLOC_SWAP_PAGE -SNAPSHOT_SET_SWAP_AREA - set the resume partition and the offset (in +SNAPSHOT_SET_SWAP_AREA + set the resume partition and the offset (in units) from the beginning of the partition at which the swap header is located (the last ioctl() argument should point to a struct resume_swap_area, as defined in kernel/power/suspend_ioctls.h, containing the resume device specification and the offset); for swap partitions the offset is always 0, but it is different from zero for - swap files (see Documentation/power/swsusp-and-swap-files.txt for + swap files (see Documentation/power/swsusp-and-swap-files.rst for details). -SNAPSHOT_PLATFORM_SUPPORT - enable/disable the hibernation platform support, +SNAPSHOT_PLATFORM_SUPPORT + enable/disable the hibernation platform support, depending on the argument value (enable, if the argument is nonzero) -SNAPSHOT_POWER_OFF - make the kernel transition the system to the hibernation +SNAPSHOT_POWER_OFF + make the kernel transition the system to the hibernation state (eg. ACPI S4) using the platform (eg. ACPI) driver -SNAPSHOT_S2RAM - suspend to RAM; using this call causes the kernel to +SNAPSHOT_S2RAM + suspend to RAM; using this call causes the kernel to immediately enter the suspend-to-RAM state, so this call must always be preceded by the SNAPSHOT_FREEZE call and it is also necessary to use the SNAPSHOT_UNFREEZE call after the system wakes up. This call @@ -98,10 +115,11 @@ SNAPSHOT_S2RAM - suspend to RAM; using this call causes the kernel to The device's read() operation can be used to transfer the snapshot image from the kernel. It has the following limitations: + - you cannot read() more than one virtual memory page at a time - read()s across page boundaries are impossible (ie. if you read() 1/2 of - a page in the previous call, you will only be able to read() - _at_ _most_ 1/2 of the page in the next call) + a page in the previous call, you will only be able to read() + **at most** 1/2 of the page in the next call) The device's write() operation is used for uploading the system memory snapshot into the kernel. It has the same limitations as the read() operation. @@ -143,8 +161,10 @@ preferably using mlockall(), before calling SNAPSHOT_FREEZE. The suspending utility MUST check the value stored by SNAPSHOT_CREATE_IMAGE in the memory location pointed to by the last argument of ioctl() and proceed in accordance with it: + 1. If the value is 1 (ie. the system memory snapshot has just been created and the system is ready for saving it): + (a) The suspending utility MUST NOT close the snapshot device _unless_ the whole suspend procedure is to be cancelled, in which case, if the snapshot image has already been saved, the @@ -158,6 +178,7 @@ in accordance with it: called. However, it MAY mount a file system that was not mounted at that time and perform some operations on it (eg. use it for saving the image). + 2. If the value is 0 (ie. the system state has just been restored from the snapshot image), the suspending utility MUST close the snapshot device. Afterwards it will be treated as a regular userland process, diff --git a/Documentation/power/video.txt b/Documentation/power/video.rst similarity index 56% rename from Documentation/power/video.txt rename to Documentation/power/video.rst index 3e6272bc4472..337a2ba9f32f 100644 --- a/Documentation/power/video.txt +++ b/Documentation/power/video.rst @@ -1,7 +1,8 @@ +=========================== +Video issues with S3 resume +=========================== - Video issues with S3 resume - ~~~~~~~~~~~~~~~~~~~~~~~~~~~ - 2003-2006, Pavel Machek +2003-2006, Pavel Machek During S3 resume, hardware needs to be reinitialized. For most devices, this is easy, and kernel driver knows how to do @@ -41,37 +42,37 @@ There are a few types of systems where video works after S3 resume: (1) systems where video state is preserved over S3. (2) systems where it is possible to call the video BIOS during S3 - resume. Unfortunately, it is not correct to call the video BIOS at - that point, but it happens to work on some machines. Use - acpi_sleep=s3_bios. + resume. Unfortunately, it is not correct to call the video BIOS at + that point, but it happens to work on some machines. Use + acpi_sleep=s3_bios. (3) systems that initialize video card into vga text mode and where - the BIOS works well enough to be able to set video mode. Use - acpi_sleep=s3_mode on these. + the BIOS works well enough to be able to set video mode. Use + acpi_sleep=s3_mode on these. (4) on some systems s3_bios kicks video into text mode, and - acpi_sleep=s3_bios,s3_mode is needed. + acpi_sleep=s3_bios,s3_mode is needed. (5) radeon systems, where X can soft-boot your video card. You'll need - a new enough X, and a plain text console (no vesafb or radeonfb). See - http://www.doesi.gmxhome.de/linux/tm800s3/s3.html for more information. - Alternatively, you should use vbetool (6) instead. + a new enough X, and a plain text console (no vesafb or radeonfb). See + http://www.doesi.gmxhome.de/linux/tm800s3/s3.html for more information. + Alternatively, you should use vbetool (6) instead. (6) other radeon systems, where vbetool is enough to bring system back - to life. It needs text console to be working. Do vbetool vbestate - save > /tmp/delme; echo 3 > /proc/acpi/sleep; vbetool post; vbetool - vbestate restore < /tmp/delme; setfont , and your video - should work. + to life. It needs text console to be working. Do vbetool vbestate + save > /tmp/delme; echo 3 > /proc/acpi/sleep; vbetool post; vbetool + vbestate restore < /tmp/delme; setfont , and your video + should work. (7) on some systems, it is possible to boot most of kernel, and then - POSTing bios works. Ole Rohne has patch to do just that at - http://dev.gentoo.org/~marineam/patch-radeonfb-2.6.11-rc2-mm2. + POSTing bios works. Ole Rohne has patch to do just that at + http://dev.gentoo.org/~marineam/patch-radeonfb-2.6.11-rc2-mm2. -(8) on some systems, you can use the video_post utility and or - do echo 3 > /sys/power/state && /usr/sbin/video_post - which will - initialize the display in console mode. If you are in X, you can switch - to a virtual terminal and back to X using CTRL+ALT+F1 - CTRL+ALT+F7 to get - the display working in graphical mode again. +(8) on some systems, you can use the video_post utility and or + do echo 3 > /sys/power/state && /usr/sbin/video_post - which will + initialize the display in console mode. If you are in X, you can switch + to a virtual terminal and back to X using CTRL+ALT+F1 - CTRL+ALT+F7 to get + the display working in graphical mode again. Now, if you pass acpi_sleep=something, and it does not work with your bios, you'll get a hard crash during resume. Be careful. Also it is @@ -87,99 +88,126 @@ chance of working. Table of known working notebooks: + +=============================== =============================================== Model hack (or "how to do it") ------------------------------------------------------------------------------- +=============================== =============================================== Acer Aspire 1406LC ole's late BIOS init (7), turn off DRI Acer TM 230 s3_bios (2) Acer TM 242FX vbetool (6) Acer TM C110 video_post (8) -Acer TM C300 vga=normal (only suspend on console, not in X), vbetool (6) or video_post (8) +Acer TM C300 vga=normal (only suspend on console, not in X), + vbetool (6) or video_post (8) Acer TM 4052LCi s3_bios (2) Acer TM 636Lci s3_bios,s3_mode (4) -Acer TM 650 (Radeon M7) vga=normal plus boot-radeon (5) gets text console back -Acer TM 660 ??? (*) -Acer TM 800 vga=normal, X patches, see webpage (5) or vbetool (6) -Acer TM 803 vga=normal, X patches, see webpage (5) or vbetool (6) +Acer TM 650 (Radeon M7) vga=normal plus boot-radeon (5) gets text + console back +Acer TM 660 ??? [#f1]_ +Acer TM 800 vga=normal, X patches, see webpage (5) + or vbetool (6) +Acer TM 803 vga=normal, X patches, see webpage (5) + or vbetool (6) Acer TM 803LCi vga=normal, vbetool (6) Arima W730a vbetool needed (6) -Asus L2400D s3_mode (3)(***) (S1 also works OK) +Asus L2400D s3_mode (3) [#f2]_ (S1 also works OK) Asus L3350M (SiS 740) (6) Asus L3800C (Radeon M7) s3_bios (2) (S1 also works OK) -Asus M6887Ne vga=normal, s3_bios (2), use radeon driver instead of fglrx in x.org +Asus M6887Ne vga=normal, s3_bios (2), use radeon driver + instead of fglrx in x.org Athlon64 desktop prototype s3_bios (2) -Compal CL-50 ??? (*) +Compal CL-50 ??? [#f1]_ Compaq Armada E500 - P3-700 none (1) (S1 also works OK) Compaq Evo N620c vga=normal, s3_bios (2) Dell 600m, ATI R250 Lf none (1), but needs xorg-x11-6.8.1.902-1 Dell D600, ATI RV250 vga=normal and X, or try vbestate (6) -Dell D610 vga=normal and X (possibly vbestate (6) too, but not tested) -Dell Inspiron 4000 ??? (*) -Dell Inspiron 500m ??? (*) +Dell D610 vga=normal and X (possibly vbestate (6) too, + but not tested) +Dell Inspiron 4000 ??? [#f1]_ +Dell Inspiron 500m ??? [#f1]_ Dell Inspiron 510m ??? Dell Inspiron 5150 vbetool needed (6) -Dell Inspiron 600m ??? (*) -Dell Inspiron 8200 ??? (*) -Dell Inspiron 8500 ??? (*) -Dell Inspiron 8600 ??? (*) -eMachines athlon64 machines vbetool needed (6) (someone please get me model #s) -HP NC6000 s3_bios, may not use radeonfb (2); or vbetool (6) -HP NX7000 ??? (*) -HP Pavilion ZD7000 vbetool post needed, need open-source nv driver for X +Dell Inspiron 600m ??? [#f1]_ +Dell Inspiron 8200 ??? [#f1]_ +Dell Inspiron 8500 ??? [#f1]_ +Dell Inspiron 8600 ??? [#f1]_ +eMachines athlon64 machines vbetool needed (6) (someone please get + me model #s) +HP NC6000 s3_bios, may not use radeonfb (2); + or vbetool (6) +HP NX7000 ??? [#f1]_ +HP Pavilion ZD7000 vbetool post needed, need open-source nv + driver for X HP Omnibook XE3 athlon version none (1) HP Omnibook XE3GC none (1), video is S3 Savage/IX-MV HP Omnibook XE3L-GF vbetool (6) HP Omnibook 5150 none (1), (S1 also works OK) -IBM TP T20, model 2647-44G none (1), video is S3 Inc. 86C270-294 Savage/IX-MV, vesafb gets "interesting" but X work. -IBM TP A31 / Type 2652-M5G s3_mode (3) [works ok with BIOS 1.04 2002-08-23, but not at all with BIOS 1.11 2004-11-05 :-(] +IBM TP T20, model 2647-44G none (1), video is S3 Inc. 86C270-294 + Savage/IX-MV, vesafb gets "interesting" + but X work. +IBM TP A31 / Type 2652-M5G s3_mode (3) [works ok with + BIOS 1.04 2002-08-23, but not at all with + BIOS 1.11 2004-11-05 :-(] IBM TP R32 / Type 2658-MMG none (1) -IBM TP R40 2722B3G ??? (*) +IBM TP R40 2722B3G ??? [#f1]_ IBM TP R50p / Type 1832-22U s3_bios (2) IBM TP R51 none (1) -IBM TP T30 236681A ??? (*) +IBM TP T30 236681A ??? [#f1]_ IBM TP T40 / Type 2373-MU4 none (1) IBM TP T40p none (1) IBM TP R40p s3_bios (2) IBM TP T41p s3_bios (2), switch to X after resume IBM TP T42 s3_bios (2) IBM ThinkPad T42p (2373-GTG) s3_bios (2) -IBM TP X20 ??? (*) +IBM TP X20 ??? [#f1]_ IBM TP X30 s3_bios, s3_mode (4) -IBM TP X31 / Type 2672-XXH none (1), use radeontool (http://fdd.com/software/radeon/) to turn off backlight. -IBM TP X32 none (1), but backlight is on and video is trashed after long suspend. s3_bios,s3_mode (4) works too. Perhaps that gets better results? +IBM TP X31 / Type 2672-XXH none (1), use radeontool + (http://fdd.com/software/radeon/) to + turn off backlight. +IBM TP X32 none (1), but backlight is on and video is + trashed after long suspend. s3_bios, + s3_mode (4) works too. Perhaps that gets + better results? IBM Thinkpad X40 Type 2371-7JG s3_bios,s3_mode (4) -IBM TP 600e none(1), but a switch to console and back to X is needed -Medion MD4220 ??? (*) +IBM TP 600e none(1), but a switch to console and + back to X is needed +Medion MD4220 ??? [#f1]_ Samsung P35 vbetool needed (6) Sharp PC-AR10 (ATI rage) none (1), backlight does not switch off Sony Vaio PCG-C1VRX/K s3_bios (2) -Sony Vaio PCG-F403 ??? (*) +Sony Vaio PCG-F403 ??? [#f1]_ Sony Vaio PCG-GRT995MP none (1), works with 'nv' X driver -Sony Vaio PCG-GR7/K none (1), but needs radeonfb, use radeontool (http://fdd.com/software/radeon/) to turn off backlight. -Sony Vaio PCG-N505SN ??? (*) +Sony Vaio PCG-GR7/K none (1), but needs radeonfb, use + radeontool (http://fdd.com/software/radeon/) + to turn off backlight. +Sony Vaio PCG-N505SN ??? [#f1]_ Sony Vaio vgn-s260 X or boot-radeon can init it (5) -Sony Vaio vgn-S580BH vga=normal, but suspend from X. Console will be blank unless you return to X. +Sony Vaio vgn-S580BH vga=normal, but suspend from X. Console will + be blank unless you return to X. Sony Vaio vgn-FS115B s3_bios (2),s3_mode (4) Toshiba Libretto L5 none (1) Toshiba Libretto 100CT/110CT vbetool (6) Toshiba Portege 3020CT s3_mode (3) Toshiba Satellite 4030CDT s3_mode (3) (S1 also works OK) Toshiba Satellite 4080XCDT s3_mode (3) (S1 also works OK) -Toshiba Satellite 4090XCDT ??? (*) -Toshiba Satellite P10-554 s3_bios,s3_mode (4)(****) +Toshiba Satellite 4090XCDT ??? [#f1]_ +Toshiba Satellite P10-554 s3_bios,s3_mode (4)[#f3]_ Toshiba M30 (2) xor X with nvidia driver using internal AGP -Uniwill 244IIO ??? (*) +Uniwill 244IIO ??? [#f1]_ +=============================== =============================================== Known working desktop systems ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +=================== ============================= ======================== Mainboard Graphics card hack (or "how to do it") ------------------------------------------------------------------------------- +=================== ============================= ======================== Asus A7V8X nVidia RIVA TNT2 model 64 s3_bios,s3_mode (4) +=================== ============================= ======================== -(*) from https://wiki.ubuntu.com/HoaryPMResults, not sure - which options to use. If you know, please tell me. +.. [#f1] from https://wiki.ubuntu.com/HoaryPMResults, not sure + which options to use. If you know, please tell me. -(***) To be tested with a newer kernel. +.. [#f2] To be tested with a newer kernel. -(****) Not with SMP kernel, UP only. +.. [#f3] Not with SMP kernel, UP only. diff --git a/Documentation/process/submitting-drivers.rst b/Documentation/process/submitting-drivers.rst index 58bc047e7b95..1acaa14903d6 100644 --- a/Documentation/process/submitting-drivers.rst +++ b/Documentation/process/submitting-drivers.rst @@ -117,7 +117,7 @@ PM support: implemented") error. You should also try to make sure that your driver uses as little power as possible when it's not doing anything. For the driver testing instructions see - Documentation/power/drivers-testing.txt and for a relatively + Documentation/power/drivers-testing.rst and for a relatively complete overview of the power management issues related to drivers see :ref:`Documentation/driver-api/pm/devices.rst `. diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt index 197d81f4b836..d97207b9accb 100644 --- a/Documentation/scheduler/sched-energy.txt +++ b/Documentation/scheduler/sched-energy.txt @@ -22,7 +22,7 @@ the highest. The actual EM used by EAS is _not_ maintained by the scheduler, but by a dedicated framework. For details about this framework and what it provides, -please refer to its documentation (see Documentation/power/energy-model.txt). +please refer to its documentation (see Documentation/power/energy-model.rst). 2. Background and Terminology @@ -81,7 +81,7 @@ through the arch_scale_cpu_capacity() callback. The rest of platform knowledge used by EAS is directly read from the Energy Model (EM) framework. The EM of a platform is composed of a power cost table -per 'performance domain' in the system (see Documentation/power/energy-model.txt +per 'performance domain' in the system (see Documentation/power/energy-model.rst for futher details about performance domains). The scheduler manages references to the EM objects in the topology code when the @@ -352,7 +352,7 @@ could be amended in the future if proven otherwise. EAS uses the EM of a platform to estimate the impact of scheduling decisions on energy. So, your platform must provide power cost tables to the EM framework in order to make EAS start. To do so, please refer to documentation of the -independent EM framework in Documentation/power/energy-model.txt. +independent EM framework in Documentation/power/energy-model.rst. Please also note that the scheduling domains need to be re-built after the EM has been registered in order to start EAS. diff --git a/Documentation/trace/coresight-cpu-debug.txt b/Documentation/trace/coresight-cpu-debug.txt index f07e38094b40..1a660a39e3c0 100644 --- a/Documentation/trace/coresight-cpu-debug.txt +++ b/Documentation/trace/coresight-cpu-debug.txt @@ -151,7 +151,7 @@ At the runtime you can disable idle states with below methods: It is possible to disable CPU idle states by way of the PM QoS subsystem, more specifically by using the "/dev/cpu_dma_latency" -interface (see Documentation/power/pm_qos_interface.txt for more +interface (see Documentation/power/pm_qos_interface.rst for more details). As specified in the PM QoS documentation the requested parameter will stay in effect until the file descriptor is released. For example: diff --git a/Documentation/translations/zh_CN/process/submitting-drivers.rst b/Documentation/translations/zh_CN/process/submitting-drivers.rst index 72c6cd935821..f1c3906c69a8 100644 --- a/Documentation/translations/zh_CN/process/submitting-drivers.rst +++ b/Documentation/translations/zh_CN/process/submitting-drivers.rst @@ -97,7 +97,7 @@ Linux 2.6: 函数定义成返回 -ENOSYS(功能未实现)错误。你还应该尝试确 保你的驱动在什么都不干的情况下将耗电降到最低。要获得驱动 程序测试的指导,请参阅 - Documentation/power/drivers-testing.txt。有关驱动程序电 + Documentation/power/drivers-testing.rst。有关驱动程序电 源管理问题相对全面的概述,请参阅 Documentation/driver-api/pm/devices.rst。 diff --git a/MAINTAINERS b/MAINTAINERS index 3dd988519a11..518b73924d7e 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6393,7 +6393,7 @@ M: "Rafael J. Wysocki" M: Pavel Machek L: linux-pm@vger.kernel.org S: Supported -F: Documentation/power/freezing-of-tasks.txt +F: Documentation/power/freezing-of-tasks.rst F: include/linux/freezer.h F: kernel/freezer.c @@ -11675,7 +11675,7 @@ S: Maintained T: git git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm.git F: drivers/opp/ F: include/linux/pm_opp.h -F: Documentation/power/opp.txt +F: Documentation/power/opp.rst F: Documentation/devicetree/bindings/opp/ OPL4 DRIVER diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index bd8dea466b04..bf8cb068acf8 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2448,7 +2448,7 @@ menuconfig APM machines with more than one CPU. In order to use APM, you will need supporting software. For location - and more information, read + and more information, read and the Battery Powered Linux mini-HOWTO, available from . diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 066fd2a12851..10d040e2e807 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -1175,7 +1175,7 @@ struct skl_wm_params { * to be disabled. This shouldn't happen and we'll print some error messages in * case it happens. * - * For more, read the Documentation/power/runtime_pm.txt. + * For more, read the Documentation/power/runtime_pm.rst. */ struct i915_runtime_pm { atomic_t wakeref_count; diff --git a/drivers/opp/Kconfig b/drivers/opp/Kconfig index a7fbb93f302c..1f64a3d46c8a 100644 --- a/drivers/opp/Kconfig +++ b/drivers/opp/Kconfig @@ -10,4 +10,4 @@ config PM_OPP OPP layer organizes the data internally using device pointers representing individual voltage domains and provides SOC implementations a ready to use framework to manage OPPs. - For more information, read + For more information, read diff --git a/drivers/power/supply/power_supply_core.c b/drivers/power/supply/power_supply_core.c index 874495c6face..1ae561180051 100644 --- a/drivers/power/supply/power_supply_core.c +++ b/drivers/power/supply/power_supply_core.c @@ -607,7 +607,7 @@ int power_supply_get_battery_info(struct power_supply *psy, /* The property and field names below must correspond to elements * in enum power_supply_property. For reasoning, see - * Documentation/power/power_supply_class.txt. + * Documentation/power/power_supply_class.rst. */ of_property_read_u32(battery_np, "energy-full-design-microwatt-hours", diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index c7eef32e7739..5b8328a99b2a 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -52,7 +52,7 @@ * irq line disabled until the threaded handler has been run. * IRQF_NO_SUSPEND - Do not disable this IRQ during suspend. Does not guarantee * that this interrupt will wake the system from a suspended - * state. See Documentation/power/suspend-and-interrupts.txt + * state. See Documentation/power/suspend-and-interrupts.rst * IRQF_FORCE_RESUME - Force enable it on resume even if IRQF_NO_SUSPEND is set * IRQF_NO_THREAD - Interrupt cannot be threaded * IRQF_EARLY_RESUME - Resume IRQ early during syscore instead of at device diff --git a/include/linux/pm.h b/include/linux/pm.h index 66c19a65a514..c14ad8bc1a41 100644 --- a/include/linux/pm.h +++ b/include/linux/pm.h @@ -284,7 +284,7 @@ typedef struct pm_message { * actions to be performed by a device driver's callbacks generally depend on * the platform and subsystem the device belongs to. * - * Refer to Documentation/power/runtime_pm.txt for more information about the + * Refer to Documentation/power/runtime_pm.rst for more information about the * role of the @runtime_suspend(), @runtime_resume() and @runtime_idle() * callbacks in device runtime power management. */ diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig index f8fe57d1022e..45b502a1748e 100644 --- a/kernel/power/Kconfig +++ b/kernel/power/Kconfig @@ -65,7 +65,7 @@ config HIBERNATION need to run mkswap against the swap partition used for the suspend. It also works with swap files to a limited extent (for details see - ). + ). Right now you may boot without resuming and resume later but in the meantime you cannot use the swap partition(s)/file(s) involved in @@ -74,7 +74,7 @@ config HIBERNATION MOUNT any journaled filesystems mounted before the suspend or they will get corrupted in a nasty way. - For more information take a look at . + For more information take a look at . config ARCH_SAVE_PAGE_KEYS bool @@ -246,7 +246,7 @@ config APM_EMULATION notification of APM "events" (e.g. battery status change). In order to use APM, you will need supporting software. For location - and more information, read + and more information, read and the Battery Powered Linux mini-HOWTO, available from . diff --git a/net/wireless/Kconfig b/net/wireless/Kconfig index 41722046b937..0cd26289bfbc 100644 --- a/net/wireless/Kconfig +++ b/net/wireless/Kconfig @@ -165,7 +165,7 @@ config CFG80211_DEFAULT_PS If this causes your applications to misbehave you should fix your applications instead -- they need to register their network - latency requirement, see Documentation/power/pm_qos_interface.txt. + latency requirement, see Documentation/power/pm_qos_interface.rst. config CFG80211_DEBUGFS bool "cfg80211 DebugFS entries" From patchwork Mon Apr 22 13:27:17 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mauro Carvalho Chehab X-Patchwork-Id: 1088710 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=infradead.org header.i=@infradead.org header.b="lJRg2Ey9"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 44nnh00R86z9sPp for ; Mon, 22 Apr 2019 23:37:36 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728170AbfDVNhI (ORCPT ); Mon, 22 Apr 2019 09:37:08 -0400 Received: from bombadil.infradead.org ([198.137.202.133]:36852 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727394AbfDVN2N (ORCPT ); Mon, 22 Apr 2019 09:28:13 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=Sender:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=l3+QQr+EjFlRAldxJVDW4gRFR5sdBmKFqWAZkocg8xs=; b=lJRg2Ey9VuJMHDaM95xWfOK/VU 71Rtswr9/BcKtNlDG/1nIC+at1I73jkD+OE4VKlWeU6WuZdCyBmny4z5C8Rxg0ahm6H7fsZv2lUgB dey61wRY7BUWGiLkgkGWYyk82i/3IYCrf1Jn4ni92tlCpYYFxgg6Eev1gk0mQTvXozK5CI3rTLfqU Ow3d9kyVRTf1Y4q6qQtyRAZkq1UcZgEF0/Mh/HnuH6+nxKu9MyNWm51dCQ0Bk1sHqfTM47U1NAZD6 0n/Jd7CP6OZiduAoQnwbueXUEPM0M18Mj+MFgo+RrLhZ+BAgT/3BFOENapFPQ8yhdDYskrPh+cnF2 5lcNAN7w==; Received: from 179.176.125.229.dynamic.adsl.gvt.net.br ([179.176.125.229] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtpsa (Exim 4.90_1 #2 (Red Hat Linux)) id 1hIYzV-0005Hs-BR; Mon, 22 Apr 2019 13:28:13 +0000 Received: from mchehab by bombadil.infradead.org with local (Exim 4.92) (envelope-from ) id 1hIYzT-0005lg-38; Mon, 22 Apr 2019 10:28:11 -0300 From: Mauro Carvalho Chehab To: Linux Doc Mailing List Cc: Mauro Carvalho Chehab , Mauro Carvalho Chehab , linux-kernel@vger.kernel.org, Jonathan Corbet , Richard Cochran , "David S. Miller" , netdev@vger.kernel.org Subject: [PATCH v2 28/79] docs: ptp.txt: convert to ReST and move to driver-api Date: Mon, 22 Apr 2019 10:27:17 -0300 Message-Id: <21d7862dfaaa7a587469aacf7b8a2e55cf705bfd.1555938376.git.mchehab+samsung@kernel.org> X-Mailer: git-send-email 2.20.1 In-Reply-To: References: MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org The conversion is trivial: just adjust title markups. In order to avoid conflicts, let's add an :orphan: tag to it, to be removed when this file gets added to the driver-api book. Signed-off-by: Mauro Carvalho Chehab Acked-by: Richard Cochran --- .../{ptp/ptp.txt => driver-api/ptp.rst} | 26 +++++++++++++------ Documentation/networking/timestamping.txt | 2 +- MAINTAINERS | 2 +- 3 files changed, 20 insertions(+), 10 deletions(-) rename Documentation/{ptp/ptp.txt => driver-api/ptp.rst} (88%) diff --git a/Documentation/ptp/ptp.txt b/Documentation/driver-api/ptp.rst similarity index 88% rename from Documentation/ptp/ptp.txt rename to Documentation/driver-api/ptp.rst index 11e904ee073f..b6e65d66d37a 100644 --- a/Documentation/ptp/ptp.txt +++ b/Documentation/driver-api/ptp.rst @@ -1,5 +1,8 @@ +:orphan: -* PTP hardware clock infrastructure for Linux +=========================================== +PTP hardware clock infrastructure for Linux +=========================================== This patch set introduces support for IEEE 1588 PTP clocks in Linux. Together with the SO_TIMESTAMPING socket options, this @@ -22,7 +25,8 @@ - Period output signals configurable from user space - Synchronization of the Linux system time via the PPS subsystem -** PTP hardware clock kernel API +PTP hardware clock kernel API +============================= A PTP clock driver registers itself with the class driver. The class driver handles all of the dealings with user space. The @@ -36,7 +40,8 @@ development, it can be useful to have more than one clock in a single system, in order to allow performance comparisons. -** PTP hardware clock user space API +PTP hardware clock user space API +================================= The class driver also creates a character device for each registered clock. User space can use an open file descriptor from @@ -49,7 +54,8 @@ ancillary clock features. User space can receive time stamped events via blocking read() and poll(). -** Writing clock drivers +Writing clock drivers +===================== Clock drivers include include/linux/ptp_clock_kernel.h and register themselves by presenting a 'struct ptp_clock_info' to the @@ -66,14 +72,17 @@ class driver, since the lock may also be needed by the clock driver's interrupt service routine. -** Supported hardware +Supported hardware +================== + + * Freescale eTSEC gianfar - + Freescale eTSEC gianfar - 2 Time stamp external triggers, programmable polarity (opt. interrupt) - 2 Alarm registers (optional interrupt) - 3 Periodic signals (optional interrupt) - + National DP83640 + * National DP83640 + - 6 GPIOs programmable as inputs or outputs - 6 GPIOs with dedicated functions (LED/JTAG/clock) can also be used as general inputs or outputs @@ -81,6 +90,7 @@ - GPIO outputs can produce periodic signals - 1 interrupt pin - + Intel IXP465 + * Intel IXP465 + - Auxiliary Slave/Master Mode Snapshot (optional interrupt) - Target Time (optional interrupt) diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt index bbdaf8990031..8dd6333c3270 100644 --- a/Documentation/networking/timestamping.txt +++ b/Documentation/networking/timestamping.txt @@ -368,7 +368,7 @@ ts[1] used to hold hardware timestamps converted to system time. Instead, expose the hardware clock device on the NIC directly as a HW PTP clock source, to allow time conversion in userspace and optionally synchronize system time with a userspace PTP stack such -as linuxptp. For the PTP clock API, see Documentation/ptp/ptp.txt. +as linuxptp. For the PTP clock API, see Documentation/driver-api/ptp.rst. Note that if the SO_TIMESTAMP or SO_TIMESTAMPNS option is enabled together with SO_TIMESTAMPING using SOF_TIMESTAMPING_SOFTWARE, a false diff --git a/MAINTAINERS b/MAINTAINERS index be3d80397956..e4c26dc67668 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -12649,7 +12649,7 @@ L: netdev@vger.kernel.org S: Maintained W: http://linuxptp.sourceforge.net/ F: Documentation/ABI/testing/sysfs-ptp -F: Documentation/ptp/* +F: Documentation/driver-api/ptp.rst F: drivers/net/phy/dp83640* F: drivers/ptp/* F: include/linux/ptp_cl* From patchwork Mon Apr 22 13:27:37 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mauro Carvalho Chehab X-Patchwork-Id: 1088694 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=infradead.org header.i=@infradead.org header.b="g9hcvzJI"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 44nnWT2Nx4z9s9G for ; Mon, 22 Apr 2019 23:30:13 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727360AbfDVNaH (ORCPT ); Mon, 22 Apr 2019 09:30:07 -0400 Received: from bombadil.infradead.org ([198.137.202.133]:38078 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727616AbfDVN2Z (ORCPT ); Mon, 22 Apr 2019 09:28:25 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=Sender:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=ip8JfyX00rP9YC3LIDu6YBR9Wqjk6kfz4ExxCPaRC/I=; b=g9hcvzJINA/yOgvqekTkFsI4B8 cJmMXuUQZFf2HmBzmePChmvNBJXJr5E5AvfeObKyt/wLWuYkfKQ1x7y2Oox9Gg2mZIfqTlvJPKPFN 4euLKUWStV97iUtGoJb2NEMe4Acqj00yve9F1Qxc8uMM5brVTRrtXAQMLrT+1w7BxOGqmMSI+5+2L nXtH2PMsJO6qG7ClpwiPyTvjlV78zjQ4pN5gR7vDjIHKkKwVTx+l4WAV0/72KKz5BdBiV3ZYfN7QM Kn9lk5mwJnX+fmFqsJCHMCTnBNNRqhowUeIqrj/2BUpgtUy2YEwjBiIVCcLX/QR3iLjZSpUy3tFfU 1vWJ5RLA==; Received: from 179.176.125.229.dynamic.adsl.gvt.net.br ([179.176.125.229] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtpsa (Exim 4.90_1 #2 (Red Hat Linux)) id 1hIYza-0005I1-Sf; Mon, 22 Apr 2019 13:28:19 +0000 Received: from mchehab by bombadil.infradead.org with local (Exim 4.92) (envelope-from ) id 1hIYzT-0005nN-Mh; Mon, 22 Apr 2019 10:28:11 -0300 From: Mauro Carvalho Chehab To: Linux Doc Mailing List Cc: Mauro Carvalho Chehab , Mauro Carvalho Chehab , linux-kernel@vger.kernel.org, Jonathan Corbet , Linus Walleij , Bartosz Golaszewski , Jean Delvare , Guenter Roeck , Greg Kroah-Hartman , "Rafael J. Wysocki" , Jeff Kirsher , "David S. Miller" , Julia Lawall , Gilles Muller , Nicolas Palix , Michal Marek , linux-gpio@vger.kernel.org, linux-hwmon@vger.kernel.org, intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org, cocci@systeme.lip6.fr Subject: [PATCH v2 48/79] docs: driver-model: convert docs to ReST and rename to *.rst Date: Mon, 22 Apr 2019 10:27:37 -0300 Message-Id: X-Mailer: git-send-email 2.20.1 In-Reply-To: References: MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Convert the various documents at the driver-model, preparing them to be part of the driver-api book. The conversion is actually: - add blank lines and identation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab Acked-by: Julia Lawall Acked-by: Guenter Roeck --- Documentation/driver-api/gpio/driver.rst | 2 +- .../driver-model/{binding.txt => binding.rst} | 20 +- .../driver-model/{bus.txt => bus.rst} | 69 ++-- .../driver-model/{class.txt => class.rst} | 74 ++-- ...esign-patterns.txt => design-patterns.rst} | 106 +++--- .../driver-model/{device.txt => device.rst} | 57 +-- .../driver-model/{devres.txt => devres.rst} | 50 +-- .../driver-model/{driver.txt => driver.rst} | 112 +++--- Documentation/driver-model/index.rst | 26 ++ .../{overview.txt => overview.rst} | 37 +- .../{platform.txt => platform.rst} | 30 +- .../driver-model/{porting.txt => porting.rst} | 333 +++++++++--------- Documentation/eisa.txt | 4 +- Documentation/hwmon/submitting-patches.rst | 2 +- drivers/base/platform.c | 2 +- drivers/gpio/gpio-cs5535.c | 2 +- drivers/net/ethernet/intel/ice/ice_main.c | 2 +- scripts/coccinelle/free/devm_free.cocci | 2 +- 18 files changed, 489 insertions(+), 441 deletions(-) rename Documentation/driver-model/{binding.txt => binding.rst} (92%) rename Documentation/driver-model/{bus.txt => bus.rst} (76%) rename Documentation/driver-model/{class.txt => class.rst} (75%) rename Documentation/driver-model/{design-patterns.txt => design-patterns.rst} (59%) rename Documentation/driver-model/{device.txt => device.rst} (71%) rename Documentation/driver-model/{devres.txt => devres.rst} (93%) rename Documentation/driver-model/{driver.txt => driver.rst} (75%) create mode 100644 Documentation/driver-model/index.rst rename Documentation/driver-model/{overview.txt => overview.rst} (90%) rename Documentation/driver-model/{platform.txt => platform.rst} (95%) rename Documentation/driver-model/{porting.txt => porting.rst} (62%) diff --git a/Documentation/driver-api/gpio/driver.rst b/Documentation/driver-api/gpio/driver.rst index 3043167fc557..0eb083baab9e 100644 --- a/Documentation/driver-api/gpio/driver.rst +++ b/Documentation/driver-api/gpio/driver.rst @@ -303,7 +303,7 @@ symbol: the struct gpio_chip* for the chip to all IRQ callbacks, so the callbacks need to embed the gpio_chip in its state container and obtain a pointer to the container using container_of(). - (See Documentation/driver-model/design-patterns.txt) + (See Documentation/driver-model/design-patterns.rst) * gpiochip_irqchip_add_nested(): adds a nested irqchip to a gpiochip. Apart from that it works exactly like the chained irqchip. diff --git a/Documentation/driver-model/binding.txt b/Documentation/driver-model/binding.rst similarity index 92% rename from Documentation/driver-model/binding.txt rename to Documentation/driver-model/binding.rst index abfc8e290d53..7ea1d7a41e1d 100644 --- a/Documentation/driver-model/binding.txt +++ b/Documentation/driver-model/binding.rst @@ -1,5 +1,6 @@ - +============== Driver Binding +============== Driver binding is the process of associating a device with a device driver that can control it. Bus drivers have typically handled this @@ -25,7 +26,7 @@ device_register When a new device is added, the bus's list of drivers is iterated over to find one that supports it. In order to determine that, the device ID of the device must match one of the device IDs that the driver -supports. The format and semantics for comparing IDs is bus-specific. +supports. The format and semantics for comparing IDs is bus-specific. Instead of trying to derive a complex state machine and matching algorithm, it is up to the bus driver to provide a callback to compare a device against the IDs of a driver. The bus returns 1 if a match was @@ -36,14 +37,14 @@ int match(struct device * dev, struct device_driver * drv); If a match is found, the device's driver field is set to the driver and the driver's probe callback is called. This gives the driver a chance to verify that it really does support the hardware, and that -it's in a working state. +it's in a working state. Device Class ~~~~~~~~~~~~ Upon the successful completion of probe, the device is registered with the class to which it belongs. Device drivers belong to one and only one -class, and that is set in the driver's devclass field. +class, and that is set in the driver's devclass field. devclass_add_device is called to enumerate the device within the class and actually register it with the class, which happens with the class's register_dev callback. @@ -53,7 +54,7 @@ Driver ~~~~~~ When a driver is attached to a device, the device is inserted into the -driver's list of devices. +driver's list of devices. sysfs @@ -67,18 +68,18 @@ to the device's directory in the physical hierarchy. A directory for the device is created in the class's directory. A symlink is created in that directory that points to the device's -physical location in the sysfs tree. +physical location in the sysfs tree. A symlink can be created (though this isn't done yet) in the device's physical directory to either its class directory, or the class's top-level directory. One can also be created to point to its driver's -directory also. +directory also. driver_register ~~~~~~~~~~~~~~~ -The process is almost identical for when a new driver is added. +The process is almost identical for when a new driver is added. The bus's list of devices is iterated over to find a match. Devices that already have a driver are skipped. All the devices are iterated over, to bind as many devices as possible to the driver. @@ -94,5 +95,4 @@ of the driver is decremented. All symlinks between the two are removed. When a driver is removed, the list of devices that it supports is iterated over, and the driver's remove callback is called for each -one. The device is removed from that list and the symlinks removed. - +one. The device is removed from that list and the symlinks removed. diff --git a/Documentation/driver-model/bus.txt b/Documentation/driver-model/bus.rst similarity index 76% rename from Documentation/driver-model/bus.txt rename to Documentation/driver-model/bus.rst index c247b488a567..016b15a6e8ea 100644 --- a/Documentation/driver-model/bus.txt +++ b/Documentation/driver-model/bus.rst @@ -1,5 +1,6 @@ - -Bus Types +========= +Bus Types +========= Definition ~~~~~~~~~~ @@ -13,12 +14,12 @@ Declaration Each bus type in the kernel (PCI, USB, etc) should declare one static object of this type. They must initialize the name field, and may -optionally initialize the match callback. +optionally initialize the match callback:: -struct bus_type pci_bus_type = { - .name = "pci", - .match = pci_bus_match, -}; + struct bus_type pci_bus_type = { + .name = "pci", + .match = pci_bus_match, + }; The structure should be exported to drivers in a header file: @@ -30,8 +31,8 @@ Registration When a bus driver is initialized, it calls bus_register. This initializes the rest of the fields in the bus object and inserts it -into a global list of bus types. Once the bus object is registered, -the fields in it are usable by the bus driver. +into a global list of bus types. Once the bus object is registered, +the fields in it are usable by the bus driver. Callbacks @@ -43,17 +44,17 @@ match(): Attaching Drivers to Devices The format of device ID structures and the semantics for comparing them are inherently bus-specific. Drivers typically declare an array of device IDs of devices they support that reside in a bus-specific -driver structure. +driver structure. The purpose of the match callback is to give the bus an opportunity to determine if a particular driver supports a particular device by comparing the device IDs the driver supports with the device ID of a particular device, without sacrificing bus-specific functionality or -type-safety. +type-safety. When a driver is registered with the bus, the bus's list of devices is iterated over, and the match callback is called for each device that -does not have a driver associated with it. +does not have a driver associated with it. @@ -64,22 +65,23 @@ The lists of devices and drivers are intended to replace the local lists that many buses keep. They are lists of struct devices and struct device_drivers, respectively. Bus drivers are free to use the lists as they please, but conversion to the bus-specific type may be -necessary. +necessary. -The LDM core provides helper functions for iterating over each list. +The LDM core provides helper functions for iterating over each list:: -int bus_for_each_dev(struct bus_type * bus, struct device * start, void * data, - int (*fn)(struct device *, void *)); + int bus_for_each_dev(struct bus_type * bus, struct device * start, + void * data, + int (*fn)(struct device *, void *)); -int bus_for_each_drv(struct bus_type * bus, struct device_driver * start, - void * data, int (*fn)(struct device_driver *, void *)); + int bus_for_each_drv(struct bus_type * bus, struct device_driver * start, + void * data, int (*fn)(struct device_driver *, void *)); These helpers iterate over the respective list, and call the callback for each device or driver in the list. All list accesses are synchronized by taking the bus's lock (read currently). The reference count on each object in the list is incremented before the callback is called; it is decremented after the next object has been obtained. The -lock is not held when calling the callback. +lock is not held when calling the callback. sysfs @@ -87,14 +89,14 @@ sysfs There is a top-level directory named 'bus'. Each bus gets a directory in the bus directory, along with two default -directories: +directories:: /sys/bus/pci/ |-- devices `-- drivers Drivers registered with the bus get a directory in the bus's drivers -directory: +directory:: /sys/bus/pci/ |-- devices @@ -106,7 +108,7 @@ directory: Each device that is discovered on a bus of that type gets a symlink in the bus's devices directory to the device's directory in the physical -hierarchy: +hierarchy:: /sys/bus/pci/ |-- devices @@ -118,26 +120,27 @@ hierarchy: Exporting Attributes ~~~~~~~~~~~~~~~~~~~~ -struct bus_attribute { + +:: + + struct bus_attribute { struct attribute attr; ssize_t (*show)(struct bus_type *, char * buf); ssize_t (*store)(struct bus_type *, const char * buf, size_t count); -}; + }; Bus drivers can export attributes using the BUS_ATTR_RW macro that works similarly to the DEVICE_ATTR_RW macro for devices. For example, a -definition like this: +definition like this:: -static BUS_ATTR_RW(debug); + static BUS_ATTR_RW(debug); -is equivalent to declaring: +is equivalent to declaring:: -static bus_attribute bus_attr_debug; + static bus_attribute bus_attr_debug; This can then be used to add and remove the attribute from the bus's -sysfs directory using: - -int bus_create_file(struct bus_type *, struct bus_attribute *); -void bus_remove_file(struct bus_type *, struct bus_attribute *); - +sysfs directory using:: + int bus_create_file(struct bus_type *, struct bus_attribute *); + void bus_remove_file(struct bus_type *, struct bus_attribute *); diff --git a/Documentation/driver-model/class.txt b/Documentation/driver-model/class.rst similarity index 75% rename from Documentation/driver-model/class.txt rename to Documentation/driver-model/class.rst index 1fefc480a80b..fff55b80e86a 100644 --- a/Documentation/driver-model/class.txt +++ b/Documentation/driver-model/class.rst @@ -1,6 +1,6 @@ - +============== Device Classes - +============== Introduction ~~~~~~~~~~~~ @@ -13,37 +13,37 @@ device. The following device classes have been identified: Each device class defines a set of semantics and a programming interface that devices of that class adhere to. Device drivers are the implementation of that programming interface for a particular device on -a particular bus. +a particular bus. Device classes are agnostic with respect to what bus a device resides -on. +on. Programming Interface ~~~~~~~~~~~~~~~~~~~~~ -The device class structure looks like: +The device class structure looks like:: -typedef int (*devclass_add)(struct device *); -typedef void (*devclass_remove)(struct device *); + typedef int (*devclass_add)(struct device *); + typedef void (*devclass_remove)(struct device *); See the kerneldoc for the struct class. -A typical device class definition would look like: +A typical device class definition would look like:: -struct device_class input_devclass = { + struct device_class input_devclass = { .name = "input", .add_device = input_add_device, .remove_device = input_remove_device, -}; + }; Each device class structure should be exported in a header file so it can be used by drivers, extensions and interfaces. -Device classes are registered and unregistered with the core using: +Device classes are registered and unregistered with the core using:: -int devclass_register(struct device_class * cls); -void devclass_unregister(struct device_class * cls); + int devclass_register(struct device_class * cls); + void devclass_unregister(struct device_class * cls); Devices @@ -52,16 +52,16 @@ As devices are bound to drivers, they are added to the device class that the driver belongs to. Before the driver model core, this would typically happen during the driver's probe() callback, once the device has been initialized. It now happens after the probe() callback -finishes from the core. +finishes from the core. The device is enumerated in the class. Each time a device is added to the class, the class's devnum field is incremented and assigned to the device. The field is never decremented, so if the device is removed from the class and re-added, it will receive a different enumerated -value. +value. The class is allowed to create a class-specific structure for the -device and store it in the device's class_data pointer. +device and store it in the device's class_data pointer. There is no list of devices in the device class. Each driver has a list of devices that it supports. The device class has a list of @@ -73,15 +73,15 @@ Device Drivers ~~~~~~~~~~~~~~ Device drivers are added to device classes when they are registered with the core. A driver specifies the class it belongs to by setting -the struct device_driver::devclass field. +the struct device_driver::devclass field. sysfs directory structure ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -There is a top-level sysfs directory named 'class'. +There is a top-level sysfs directory named 'class'. Each class gets a directory in the class directory, along with two -default subdirectories: +default subdirectories:: class/ `-- input @@ -89,8 +89,8 @@ default subdirectories: `-- drivers -Drivers registered with the class get a symlink in the drivers/ directory -that points to the driver's directory (under its bus directory): +Drivers registered with the class get a symlink in the drivers/ directory +that points to the driver's directory (under its bus directory):: class/ `-- input @@ -99,8 +99,8 @@ that points to the driver's directory (under its bus directory): `-- usb:usb_mouse -> ../../../bus/drivers/usb_mouse/ -Each device gets a symlink in the devices/ directory that points to the -device's directory in the physical hierarchy: +Each device gets a symlink in the devices/ directory that points to the +device's directory in the physical hierarchy:: class/ `-- input @@ -111,37 +111,39 @@ device's directory in the physical hierarchy: Exporting Attributes ~~~~~~~~~~~~~~~~~~~~ -struct devclass_attribute { + +:: + + struct devclass_attribute { struct attribute attr; ssize_t (*show)(struct device_class *, char * buf, size_t count, loff_t off); ssize_t (*store)(struct device_class *, const char * buf, size_t count, loff_t off); -}; + }; Class drivers can export attributes using the DEVCLASS_ATTR macro that works -similarly to the DEVICE_ATTR macro for devices. For example, a definition -like this: +similarly to the DEVICE_ATTR macro for devices. For example, a definition +like this:: -static DEVCLASS_ATTR(debug,0644,show_debug,store_debug); + static DEVCLASS_ATTR(debug,0644,show_debug,store_debug); -is equivalent to declaring: +is equivalent to declaring:: -static devclass_attribute devclass_attr_debug; + static devclass_attribute devclass_attr_debug; The bus driver can add and remove the attribute from the class's -sysfs directory using: +sysfs directory using:: -int devclass_create_file(struct device_class *, struct devclass_attribute *); -void devclass_remove_file(struct device_class *, struct devclass_attribute *); + int devclass_create_file(struct device_class *, struct devclass_attribute *); + void devclass_remove_file(struct device_class *, struct devclass_attribute *); In the example above, the file will be named 'debug' in placed in the -class's directory in sysfs. +class's directory in sysfs. Interfaces ~~~~~~~~~~ There may exist multiple mechanisms for accessing the same device of a -particular class type. Device interfaces describe these mechanisms. +particular class type. Device interfaces describe these mechanisms. When a device is added to a device class, the core attempts to add it to every interface that is registered with the device class. - diff --git a/Documentation/driver-model/design-patterns.txt b/Documentation/driver-model/design-patterns.rst similarity index 59% rename from Documentation/driver-model/design-patterns.txt rename to Documentation/driver-model/design-patterns.rst index ba7b2df64904..41eb8f41f7dd 100644 --- a/Documentation/driver-model/design-patterns.txt +++ b/Documentation/driver-model/design-patterns.rst @@ -1,6 +1,6 @@ - +============================= Device Driver Design Patterns -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +============================= This document describes a few common design patterns found in device drivers. It is likely that subsystem maintainers will ask driver developers to @@ -19,23 +19,23 @@ that the device the driver binds to will appear in several instances. This means that the probe() function and all callbacks need to be reentrant. The most common way to achieve this is to use the state container design -pattern. It usually has this form: +pattern. It usually has this form:: -struct foo { - spinlock_t lock; /* Example member */ - (...) -}; + struct foo { + spinlock_t lock; /* Example member */ + (...) + }; -static int foo_probe(...) -{ - struct foo *foo; + static int foo_probe(...) + { + struct foo *foo; - foo = devm_kzalloc(dev, sizeof(*foo), GFP_KERNEL); - if (!foo) - return -ENOMEM; - spin_lock_init(&foo->lock); - (...) -} + foo = devm_kzalloc(dev, sizeof(*foo), GFP_KERNEL); + if (!foo) + return -ENOMEM; + spin_lock_init(&foo->lock); + (...) + } This will create an instance of struct foo in memory every time probe() is called. This is our state container for this instance of the device driver. @@ -43,21 +43,21 @@ Of course it is then necessary to always pass this instance of the state around to all functions that need access to the state and its members. For example, if the driver is registering an interrupt handler, you would -pass around a pointer to struct foo like this: +pass around a pointer to struct foo like this:: -static irqreturn_t foo_handler(int irq, void *arg) -{ - struct foo *foo = arg; - (...) -} + static irqreturn_t foo_handler(int irq, void *arg) + { + struct foo *foo = arg; + (...) + } -static int foo_probe(...) -{ - struct foo *foo; + static int foo_probe(...) + { + struct foo *foo; - (...) - ret = request_irq(irq, foo_handler, 0, "foo", foo); -} + (...) + ret = request_irq(irq, foo_handler, 0, "foo", foo); + } This way you always get a pointer back to the correct instance of foo in your interrupt handler. @@ -66,38 +66,38 @@ your interrupt handler. 2. container_of() ~~~~~~~~~~~~~~~~~ -Continuing on the above example we add an offloaded work: +Continuing on the above example we add an offloaded work:: -struct foo { - spinlock_t lock; - struct workqueue_struct *wq; - struct work_struct offload; - (...) -}; + struct foo { + spinlock_t lock; + struct workqueue_struct *wq; + struct work_struct offload; + (...) + }; -static void foo_work(struct work_struct *work) -{ - struct foo *foo = container_of(work, struct foo, offload); + static void foo_work(struct work_struct *work) + { + struct foo *foo = container_of(work, struct foo, offload); - (...) -} + (...) + } -static irqreturn_t foo_handler(int irq, void *arg) -{ - struct foo *foo = arg; + static irqreturn_t foo_handler(int irq, void *arg) + { + struct foo *foo = arg; - queue_work(foo->wq, &foo->offload); - (...) -} + queue_work(foo->wq, &foo->offload); + (...) + } -static int foo_probe(...) -{ - struct foo *foo; + static int foo_probe(...) + { + struct foo *foo; - foo->wq = create_singlethread_workqueue("foo-wq"); - INIT_WORK(&foo->offload, foo_work); - (...) -} + foo->wq = create_singlethread_workqueue("foo-wq"); + INIT_WORK(&foo->offload, foo_work); + (...) + } The design pattern is the same for an hrtimer or something similar that will return a single argument which is a pointer to a struct member in the diff --git a/Documentation/driver-model/device.txt b/Documentation/driver-model/device.rst similarity index 71% rename from Documentation/driver-model/device.txt rename to Documentation/driver-model/device.rst index 2403eb856187..2b868d49d349 100644 --- a/Documentation/driver-model/device.txt +++ b/Documentation/driver-model/device.rst @@ -1,6 +1,6 @@ - +========================== The Basic Device Structure -~~~~~~~~~~~~~~~~~~~~~~~~~~ +========================== See the kerneldoc for the struct device. @@ -8,9 +8,9 @@ See the kerneldoc for the struct device. Programming Interface ~~~~~~~~~~~~~~~~~~~~~ The bus driver that discovers the device uses this to register the -device with the core: +device with the core:: -int device_register(struct device * dev); + int device_register(struct device * dev); The bus should initialize the following fields: @@ -20,30 +20,33 @@ The bus should initialize the following fields: - bus A device is removed from the core when its reference count goes to -0. The reference count can be adjusted using: +0. The reference count can be adjusted using:: -struct device * get_device(struct device * dev); -void put_device(struct device * dev); + struct device * get_device(struct device * dev); + void put_device(struct device * dev); get_device() will return a pointer to the struct device passed to it if the reference is not already 0 (if it's in the process of being removed already). -A driver can access the lock in the device structure using: +A driver can access the lock in the device structure using:: -void lock_device(struct device * dev); -void unlock_device(struct device * dev); + void lock_device(struct device * dev); + void unlock_device(struct device * dev); Attributes ~~~~~~~~~~ -struct device_attribute { + +:: + + struct device_attribute { struct attribute attr; ssize_t (*show)(struct device *dev, struct device_attribute *attr, char *buf); ssize_t (*store)(struct device *dev, struct device_attribute *attr, const char *buf, size_t count); -}; + }; Attributes of devices can be exported by a device driver through sysfs. @@ -54,39 +57,39 @@ As explained in Documentation/kobject.txt, device attributes must be created before the KOBJ_ADD uevent is generated. The only way to realize that is by defining an attribute group. -Attributes are declared using a macro called DEVICE_ATTR: +Attributes are declared using a macro called DEVICE_ATTR:: -#define DEVICE_ATTR(name,mode,show,store) + #define DEVICE_ATTR(name,mode,show,store) -Example: +Example::: -static DEVICE_ATTR(type, 0444, show_type, NULL); -static DEVICE_ATTR(power, 0644, show_power, store_power); + static DEVICE_ATTR(type, 0444, show_type, NULL); + static DEVICE_ATTR(power, 0644, show_power, store_power); This declares two structures of type struct device_attribute with respective names 'dev_attr_type' and 'dev_attr_power'. These two attributes can be -organized as follows into a group: +organized as follows into a group:: -static struct attribute *dev_attrs[] = { + static struct attribute *dev_attrs[] = { &dev_attr_type.attr, &dev_attr_power.attr, NULL, -}; + }; -static struct attribute_group dev_attr_group = { + static struct attribute_group dev_attr_group = { .attrs = dev_attrs, -}; + }; -static const struct attribute_group *dev_attr_groups[] = { + static const struct attribute_group *dev_attr_groups[] = { &dev_attr_group, NULL, -}; + }; This array of groups can then be associated with a device by setting the -group pointer in struct device before device_register() is invoked: +group pointer in struct device before device_register() is invoked:: - dev->groups = dev_attr_groups; - device_register(dev); + dev->groups = dev_attr_groups; + device_register(dev); The device_register() function will use the 'groups' pointer to create the device attributes and the device_unregister() function will use this pointer diff --git a/Documentation/driver-model/devres.txt b/Documentation/driver-model/devres.rst similarity index 93% rename from Documentation/driver-model/devres.txt rename to Documentation/driver-model/devres.rst index 99994a461359..4f88bdd7d555 100644 --- a/Documentation/driver-model/devres.txt +++ b/Documentation/driver-model/devres.rst @@ -1,3 +1,4 @@ +================================ Devres - Managed Device Resource ================================ @@ -5,17 +6,18 @@ Tejun Heo First draft 10 January 2007 +.. contents -1. Intro : Huh? Devres? -2. Devres : Devres in a nutshell -3. Devres Group : Group devres'es and release them together -4. Details : Life time rules, calling context, ... -5. Overhead : How much do we have to pay for this? -6. List of managed interfaces : Currently implemented managed interfaces + 1. Intro : Huh? Devres? + 2. Devres : Devres in a nutshell + 3. Devres Group : Group devres'es and release them together + 4. Details : Life time rules, calling context, ... + 5. Overhead : How much do we have to pay for this? + 6. List of managed interfaces: Currently implemented managed interfaces - 1. Intro - -------- +1. Intro +-------- devres came up while trying to convert libata to use iomap. Each iomapped address should be kept and unmapped on driver detach. For @@ -42,8 +44,8 @@ would leak resources or even cause oops when failure occurs. iomap adds more to this mix. So do msi and msix. - 2. Devres - --------- +2. Devres +--------- devres is basically linked list of arbitrarily sized memory areas associated with a struct device. Each devres entry is associated with @@ -58,7 +60,7 @@ using dma_alloc_coherent(). The managed version is called dmam_alloc_coherent(). It is identical to dma_alloc_coherent() except for the DMA memory allocated using it is managed and will be automatically released on driver detach. Implementation looks like -the following. +the following:: struct dma_devres { size_t size; @@ -98,7 +100,7 @@ If a driver uses dmam_alloc_coherent(), the area is guaranteed to be freed whether initialization fails half-way or the device gets detached. If most resources are acquired using managed interface, a driver can have much simpler init and exit code. Init path basically -looks like the following. +looks like the following:: my_init_one() { @@ -119,7 +121,7 @@ looks like the following. return register_to_upper_layer(d); } -And exit path, +And exit path:: my_remove_one() { @@ -140,13 +142,13 @@ on you. In some cases this may mean introducing checks that were not necessary before moving to the managed devm_* calls. - 3. Devres group - --------------- +3. Devres group +--------------- Devres entries can be grouped using devres group. When a group is released, all contained normal devres entries and properly nested groups are released. One usage is to rollback series of acquired -resources on failure. For example, +resources on failure. For example:: if (!devres_open_group(dev, NULL, GFP_KERNEL)) return -ENOMEM; @@ -172,7 +174,7 @@ like above are usually useful in midlayer driver (e.g. libata core layer) where interface function shouldn't have side effect on failure. For LLDs, just returning error code suffices in most cases. -Each group is identified by void *id. It can either be explicitly +Each group is identified by `void *id`. It can either be explicitly specified by @id argument to devres_open_group() or automatically created by passing NULL as @id as in the above example. In both cases, devres_open_group() returns the group's id. The returned id @@ -180,7 +182,7 @@ can be passed to other devres functions to select the target group. If NULL is given to those functions, the latest open group is selected. -For example, you can do something like the following. +For example, you can do something like the following:: int my_midlayer_create_something() { @@ -199,8 +201,8 @@ For example, you can do something like the following. } - 4. Details - ---------- +4. Details +---------- Lifetime of a devres entry begins on devres allocation and finishes when it is released or destroyed (removed and freed) - no reference @@ -220,8 +222,8 @@ All devres interface functions can be called without context if the right gfp mask is given. - 5. Overhead - ----------- +5. Overhead +----------- Each devres bookkeeping info is allocated together with requested data area. With debug option turned off, bookkeeping info occupies 16 @@ -237,8 +239,8 @@ and 400 bytes on 32bit machine after naive conversion (we can certainly invest a bit more effort into libata core layer). - 6. List of managed interfaces - ----------------------------- +6. List of managed interfaces +----------------------------- CLOCK devm_clk_get() diff --git a/Documentation/driver-model/driver.txt b/Documentation/driver-model/driver.rst similarity index 75% rename from Documentation/driver-model/driver.txt rename to Documentation/driver-model/driver.rst index d661e6f7e6a0..11d281506a04 100644 --- a/Documentation/driver-model/driver.txt +++ b/Documentation/driver-model/driver.rst @@ -1,5 +1,6 @@ - +============== Device Drivers +============== See the kerneldoc for the struct device_driver. @@ -26,50 +27,50 @@ Declaration As stated above, struct device_driver objects are statically allocated. Below is an example declaration of the eepro100 driver. This declaration is hypothetical only; it relies on the driver -being converted completely to the new model. +being converted completely to the new model:: -static struct device_driver eepro100_driver = { - .name = "eepro100", - .bus = &pci_bus_type, - - .probe = eepro100_probe, - .remove = eepro100_remove, - .suspend = eepro100_suspend, - .resume = eepro100_resume, -}; + static struct device_driver eepro100_driver = { + .name = "eepro100", + .bus = &pci_bus_type, + + .probe = eepro100_probe, + .remove = eepro100_remove, + .suspend = eepro100_suspend, + .resume = eepro100_resume, + }; Most drivers will not be able to be converted completely to the new model because the bus they belong to has a bus-specific structure with -bus-specific fields that cannot be generalized. +bus-specific fields that cannot be generalized. The most common example of this are device ID structures. A driver typically defines an array of device IDs that it supports. The format of these structures and the semantics for comparing device IDs are completely bus-specific. Defining them as bus-specific entities would -sacrifice type-safety, so we keep bus-specific structures around. +sacrifice type-safety, so we keep bus-specific structures around. Bus-specific drivers should include a generic struct device_driver in -the definition of the bus-specific driver. Like this: +the definition of the bus-specific driver. Like this:: -struct pci_driver { - const struct pci_device_id *id_table; - struct device_driver driver; -}; + struct pci_driver { + const struct pci_device_id *id_table; + struct device_driver driver; + }; A definition that included bus-specific fields would look like -(using the eepro100 driver again): +(using the eepro100 driver again):: -static struct pci_driver eepro100_driver = { - .id_table = eepro100_pci_tbl, - .driver = { + static struct pci_driver eepro100_driver = { + .id_table = eepro100_pci_tbl, + .driver = { .name = "eepro100", .bus = &pci_bus_type, .probe = eepro100_probe, .remove = eepro100_remove, .suspend = eepro100_suspend, .resume = eepro100_resume, - }, -}; + }, + }; Some may find the syntax of embedded struct initialization awkward or even a bit ugly. So far, it's the best way we've found to do what we want... @@ -77,12 +78,14 @@ even a bit ugly. So far, it's the best way we've found to do what we want... Registration ~~~~~~~~~~~~ -int driver_register(struct device_driver * drv); +:: + + int driver_register(struct device_driver *drv); The driver registers the structure on startup. For drivers that have no bus-specific fields (i.e. don't have a bus-specific driver structure), they would use driver_register and pass a pointer to their -struct device_driver object. +struct device_driver object. Most drivers, however, will have a bus-specific structure and will need to register with the bus using something like pci_driver_register. @@ -101,7 +104,7 @@ By defining wrapper functions, the transition to the new model can be made easier. Drivers can ignore the generic structure altogether and let the bus wrapper fill in the fields. For the callbacks, the bus can define generic callbacks that forward the call to the bus-specific -callbacks of the drivers. +callbacks of the drivers. This solution is intended to be only temporary. In order to get class information in the driver, the drivers must be modified anyway. Since @@ -113,16 +116,16 @@ Access ~~~~~~ Once the object has been registered, it may access the common fields of -the object, like the lock and the list of devices. +the object, like the lock and the list of devices:: -int driver_for_each_dev(struct device_driver * drv, void * data, - int (*callback)(struct device * dev, void * data)); + int driver_for_each_dev(struct device_driver *drv, void *data, + int (*callback)(struct device *dev, void *data)); The devices field is a list of all the devices that have been bound to the driver. The LDM core provides a helper function to operate on all the devices a driver controls. This helper locks the driver on each node access, and does proper reference counting on each device as it -accesses it. +accesses it. sysfs @@ -142,7 +145,9 @@ supports. Callbacks ~~~~~~~~~ - int (*probe) (struct device * dev); +:: + + int (*probe) (struct device *dev); The probe() entry is called in task context, with the bus's rwsem locked and the driver partially bound to the device. Drivers commonly use @@ -162,9 +167,9 @@ the driver to that device. A driver's probe() may return a negative errno value to indicate that the driver did not bind to this device, in which case it should have -released all resources it allocated. +released all resources it allocated:: - int (*remove) (struct device * dev); + int (*remove) (struct device *dev); remove is called to unbind a driver from a device. This may be called if a device is physically removed from the system, if the @@ -173,43 +178,46 @@ in other cases. It is up to the driver to determine if the device is present or not. It should free any resources allocated specifically for the -device; i.e. anything in the device's driver_data field. +device; i.e. anything in the device's driver_data field. If the device is still present, it should quiesce the device and place -it into a supported low-power state. +it into a supported low-power state:: - int (*suspend) (struct device * dev, pm_message_t state); + int (*suspend) (struct device *dev, pm_message_t state); -suspend is called to put the device in a low power state. +suspend is called to put the device in a low power state:: - int (*resume) (struct device * dev); + int (*resume) (struct device *dev); Resume is used to bring a device back from a low power state. Attributes ~~~~~~~~~~ -struct driver_attribute { - struct attribute attr; - ssize_t (*show)(struct device_driver *driver, char *buf); - ssize_t (*store)(struct device_driver *, const char * buf, size_t count); -}; -Device drivers can export attributes via their sysfs directories. +:: + + struct driver_attribute { + struct attribute attr; + ssize_t (*show)(struct device_driver *driver, char *buf); + ssize_t (*store)(struct device_driver *, const char *buf, size_t count); + }; + +Device drivers can export attributes via their sysfs directories. Drivers can declare attributes using a DRIVER_ATTR_RW and DRIVER_ATTR_RO macro that works identically to the DEVICE_ATTR_RW and DEVICE_ATTR_RO macros. -Example: +Example:: -DRIVER_ATTR_RW(debug); + DRIVER_ATTR_RW(debug); -This is equivalent to declaring: +This is equivalent to declaring:: -struct driver_attribute driver_attr_debug; + struct driver_attribute driver_attr_debug; This can then be used to add and remove the attribute from the -driver's directory using: +driver's directory using:: -int driver_create_file(struct device_driver *, const struct driver_attribute *); -void driver_remove_file(struct device_driver *, const struct driver_attribute *); + int driver_create_file(struct device_driver *, const struct driver_attribute *); + void driver_remove_file(struct device_driver *, const struct driver_attribute *); diff --git a/Documentation/driver-model/index.rst b/Documentation/driver-model/index.rst new file mode 100644 index 000000000000..9f85d579ce56 --- /dev/null +++ b/Documentation/driver-model/index.rst @@ -0,0 +1,26 @@ +:orphan: + +============ +Driver Model +============ + +.. toctree:: + :maxdepth: 1 + + binding + bus + class + design-patterns + device + devres + driver + overview + platform + porting + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/driver-model/overview.txt b/Documentation/driver-model/overview.rst similarity index 90% rename from Documentation/driver-model/overview.txt rename to Documentation/driver-model/overview.rst index 6a8f9a8075d8..d4d1e9b40e0c 100644 --- a/Documentation/driver-model/overview.txt +++ b/Documentation/driver-model/overview.rst @@ -1,4 +1,6 @@ +============================= The Linux Kernel Device Model +============================= Patrick Mochel @@ -41,14 +43,14 @@ data structure. These fields must still be accessed by the bus layers, and sometimes by the device-specific drivers. Other bus layers are encouraged to do what has been done for the PCI layer. -struct pci_dev now looks like this: +struct pci_dev now looks like this:: -struct pci_dev { + struct pci_dev { ... struct device dev; /* Generic device interface */ ... -}; + }; Note first that the struct device dev within the struct pci_dev is statically allocated. This means only one allocation on device discovery. @@ -80,26 +82,26 @@ easy. This has been accomplished by implementing a special purpose virtual file system named sysfs. Almost all mainstream Linux distros mount this filesystem automatically; you -can see some variation of the following in the output of the "mount" command: +can see some variation of the following in the output of the "mount" command:: -$ mount -... -none on /sys type sysfs (rw,noexec,nosuid,nodev) -... -$ + $ mount + ... + none on /sys type sysfs (rw,noexec,nosuid,nodev) + ... + $ The auto-mounting of sysfs is typically accomplished by an entry similar to -the following in the /etc/fstab file: +the following in the /etc/fstab file:: -none /sys sysfs defaults 0 0 + none /sys sysfs defaults 0 0 -or something similar in the /lib/init/fstab file on Debian-based systems: +or something similar in the /lib/init/fstab file on Debian-based systems:: -none /sys sysfs nodev,noexec,nosuid 0 0 + none /sys sysfs nodev,noexec,nosuid 0 0 -If sysfs is not automatically mounted, you can always do it manually with: +If sysfs is not automatically mounted, you can always do it manually with:: -# mount -t sysfs sysfs /sys + # mount -t sysfs sysfs /sys Whenever a device is inserted into the tree, a directory is created for it. This directory may be populated at each layer of discovery - the global layer, @@ -108,7 +110,7 @@ the bus layer, or the device layer. The global layer currently creates two files - 'name' and 'power'. The former only reports the name of the device. The latter reports the current power state of the device. It will also be used to set the current -power state. +power state. The bus layer may also create files for the devices it finds while probing the bus. For example, the PCI layer currently creates 'irq' and 'resource' files @@ -118,6 +120,5 @@ A device-specific driver may also export files in its directory to expose device-specific data or tunable interfaces. More information about the sysfs directory layout can be found in -the other documents in this directory and in the file +the other documents in this directory and in the file Documentation/filesystems/sysfs.txt. - diff --git a/Documentation/driver-model/platform.txt b/Documentation/driver-model/platform.rst similarity index 95% rename from Documentation/driver-model/platform.txt rename to Documentation/driver-model/platform.rst index 9d9e47dfc013..334dd4071ae4 100644 --- a/Documentation/driver-model/platform.txt +++ b/Documentation/driver-model/platform.rst @@ -1,5 +1,7 @@ +============================ Platform Devices and Drivers -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +============================ + See for the driver model interface to the platform bus: platform_device, and platform_driver. This pseudo-bus is used to connect devices on busses with minimal infrastructure, @@ -19,15 +21,15 @@ be connected through a segment of some other kind of bus; but its registers will still be directly addressable. Platform devices are given a name, used in driver binding, and a -list of resources such as addresses and IRQs. +list of resources such as addresses and IRQs:: -struct platform_device { + struct platform_device { const char *name; u32 id; struct device dev; u32 num_resources; struct resource *resource; -}; + }; Platform drivers @@ -35,9 +37,9 @@ Platform drivers Platform drivers follow the standard driver model convention, where discovery/enumeration is handled outside the drivers, and drivers provide probe() and remove() methods. They support power management -and shutdown notifications using the standard conventions. +and shutdown notifications using the standard conventions:: -struct platform_driver { + struct platform_driver { int (*probe)(struct platform_device *); int (*remove)(struct platform_device *); void (*shutdown)(struct platform_device *); @@ -46,25 +48,25 @@ struct platform_driver { int (*resume_early)(struct platform_device *); int (*resume)(struct platform_device *); struct device_driver driver; -}; + }; Note that probe() should in general verify that the specified device hardware actually exists; sometimes platform setup code can't be sure. The probing can use device resources, including clocks, and device platform_data. -Platform drivers register themselves the normal way: +Platform drivers register themselves the normal way:: int platform_driver_register(struct platform_driver *drv); Or, in common situations where the device is known not to be hot-pluggable, the probe() routine can live in an init section to reduce the driver's -runtime memory footprint: +runtime memory footprint:: int platform_driver_probe(struct platform_driver *drv, int (*probe)(struct platform_device *)) Kernel modules can be composed of several platform drivers. The platform core -provides helpers to register and unregister an array of drivers: +provides helpers to register and unregister an array of drivers:: int __platform_register_drivers(struct platform_driver * const *drivers, unsigned int count, struct module *owner); @@ -73,7 +75,7 @@ provides helpers to register and unregister an array of drivers: If one of the drivers fails to register, all drivers registered up to that point will be unregistered in reverse order. Note that there is a convenience -macro that passes THIS_MODULE as owner parameter: +macro that passes THIS_MODULE as owner parameter:: #define platform_register_drivers(drivers, count) @@ -81,7 +83,7 @@ macro that passes THIS_MODULE as owner parameter: Device Enumeration ~~~~~~~~~~~~~~~~~~ As a rule, platform specific (and often board-specific) setup code will -register platform devices: +register platform devices:: int platform_device_register(struct platform_device *pdev); @@ -133,14 +135,14 @@ tend to already have "normal" modes, such as ones using device nodes that were created by PNP or by platform device setup. None the less, there are some APIs to support such legacy drivers. Avoid -using these calls except with such hotplug-deficient drivers. +using these calls except with such hotplug-deficient drivers:: struct platform_device *platform_device_alloc( const char *name, int id); You can use platform_device_alloc() to dynamically allocate a device, which you will then initialize with resources and platform_device_register(). -A better solution is usually: +A better solution is usually:: struct platform_device *platform_device_register_simple( const char *name, int id, diff --git a/Documentation/driver-model/porting.txt b/Documentation/driver-model/porting.rst similarity index 62% rename from Documentation/driver-model/porting.txt rename to Documentation/driver-model/porting.rst index 453053f1661f..ae4bf843c1d6 100644 --- a/Documentation/driver-model/porting.txt +++ b/Documentation/driver-model/porting.rst @@ -1,5 +1,6 @@ - +======================================= Porting Drivers to the New Driver Model +======================================= Patrick Mochel @@ -8,8 +9,8 @@ Patrick Mochel Overview -Please refer to Documentation/driver-model/*.txt for definitions of -various driver types and concepts. +Please refer to `Documentation/driver-model/*.rst` for definitions of +various driver types and concepts. Most of the work of porting devices drivers to the new model happens at the bus driver layer. This was intentional, to minimize the @@ -18,11 +19,11 @@ of bus drivers. In a nutshell, the driver model consists of a set of objects that can be embedded in larger, bus-specific objects. Fields in these generic -objects can replace fields in the bus-specific objects. +objects can replace fields in the bus-specific objects. The generic objects must be registered with the driver model core. By doing so, they will exported via the sysfs filesystem. sysfs can be -mounted by doing +mounted by doing:: # mount -t sysfs sysfs /sys @@ -30,108 +31,109 @@ mounted by doing The Process -Step 0: Read include/linux/device.h for object and function definitions. +Step 0: Read include/linux/device.h for object and function definitions. -Step 1: Registering the bus driver. +Step 1: Registering the bus driver. -- Define a struct bus_type for the bus driver. +- Define a struct bus_type for the bus driver:: -struct bus_type pci_bus_type = { - .name = "pci", -}; + struct bus_type pci_bus_type = { + .name = "pci", + }; - Register the bus type. + This should be done in the initialization function for the bus type, - which is usually the module_init(), or equivalent, function. + which is usually the module_init(), or equivalent, function:: -static int __init pci_driver_init(void) -{ - return bus_register(&pci_bus_type); -} + static int __init pci_driver_init(void) + { + return bus_register(&pci_bus_type); + } -subsys_initcall(pci_driver_init); + subsys_initcall(pci_driver_init); The bus type may be unregistered (if the bus driver may be compiled - as a module) by doing: + as a module) by doing:: bus_unregister(&pci_bus_type); -- Export the bus type for others to use. +- Export the bus type for others to use. - Other code may wish to reference the bus type, so declare it in a + Other code may wish to reference the bus type, so declare it in a shared header file and export the symbol. -From include/linux/pci.h: +From include/linux/pci.h:: -extern struct bus_type pci_bus_type; + extern struct bus_type pci_bus_type; -From file the above code appears in: +From file the above code appears in:: -EXPORT_SYMBOL(pci_bus_type); + EXPORT_SYMBOL(pci_bus_type); - This will cause the bus to show up in /sys/bus/pci/ with two - subdirectories: 'devices' and 'drivers'. + subdirectories: 'devices' and 'drivers':: -# tree -d /sys/bus/pci/ -/sys/bus/pci/ -|-- devices -`-- drivers + # tree -d /sys/bus/pci/ + /sys/bus/pci/ + |-- devices + `-- drivers -Step 2: Registering Devices. +Step 2: Registering Devices. struct device represents a single device. It mainly contains metadata -describing the relationship the device has to other entities. +describing the relationship the device has to other entities. -- Embed a struct device in the bus-specific device type. +- Embed a struct device in the bus-specific device type:: -struct pci_dev { - ... - struct device dev; /* Generic device interface */ - ... -}; + struct pci_dev { + ... + struct device dev; /* Generic device interface */ + ... + }; - It is recommended that the generic device not be the first item in + It is recommended that the generic device not be the first item in the struct to discourage programmers from doing mindless casts between the object types. Instead macros, or inline functions, - should be created to convert from the generic object type. + should be created to convert from the generic object type:: -#define to_pci_dev(n) container_of(n, struct pci_dev, dev) + #define to_pci_dev(n) container_of(n, struct pci_dev, dev) -or + or -static inline struct pci_dev * to_pci_dev(struct kobject * kobj) -{ + static inline struct pci_dev * to_pci_dev(struct kobject * kobj) + { return container_of(n, struct pci_dev, dev); -} + } - This allows the compiler to verify type-safety of the operations + This allows the compiler to verify type-safety of the operations that are performed (which is Good). - Initialize the device on registration. - When devices are discovered or registered with the bus type, the + When devices are discovered or registered with the bus type, the bus driver should initialize the generic device. The most important things to initialize are the bus_id, parent, and bus fields. The bus_id is an ASCII string that contains the device's address on the bus. The format of this string is bus-specific. This is - necessary for representing devices in sysfs. + necessary for representing devices in sysfs. parent is the physical parent of the device. It is important that - the bus driver sets this field correctly. + the bus driver sets this field correctly. The driver model maintains an ordered list of devices that it uses for power management. This list must be in order to guarantee that @@ -140,13 +142,13 @@ static inline struct pci_dev * to_pci_dev(struct kobject * kobj) devices. Also, the location of the device's sysfs directory depends on a - device's parent. sysfs exports a directory structure that mirrors + device's parent. sysfs exports a directory structure that mirrors the device hierarchy. Accurately setting the parent guarantees that sysfs will accurately represent the hierarchy. The device's bus field is a pointer to the bus type the device belongs to. This should be set to the bus_type that was declared - and initialized before. + and initialized before. Optionally, the bus driver may set the device's name and release fields. @@ -155,107 +157,107 @@ static inline struct pci_dev * to_pci_dev(struct kobject * kobj) "ATI Technologies Inc Radeon QD" - The release field is a callback that the driver model core calls - when the device has been removed, and all references to it have + The release field is a callback that the driver model core calls + when the device has been removed, and all references to it have been released. More on this in a moment. -- Register the device. +- Register the device. Once the generic device has been initialized, it can be registered - with the driver model core by doing: + with the driver model core by doing:: device_register(&dev->dev); - It can later be unregistered by doing: + It can later be unregistered by doing:: device_unregister(&dev->dev); - This should happen on buses that support hotpluggable devices. + This should happen on buses that support hotpluggable devices. If a bus driver unregisters a device, it should not immediately free - it. It should instead wait for the driver model core to call the - device's release method, then free the bus-specific object. + it. It should instead wait for the driver model core to call the + device's release method, then free the bus-specific object. (There may be other code that is currently referencing the device - structure, and it would be rude to free the device while that is + structure, and it would be rude to free the device while that is happening). - When the device is registered, a directory in sysfs is created. - The PCI tree in sysfs looks like: + When the device is registered, a directory in sysfs is created. + The PCI tree in sysfs looks like:: -/sys/devices/pci0/ -|-- 00:00.0 -|-- 00:01.0 -| `-- 01:00.0 -|-- 00:02.0 -| `-- 02:1f.0 -| `-- 03:00.0 -|-- 00:1e.0 -| `-- 04:04.0 -|-- 00:1f.0 -|-- 00:1f.1 -| |-- ide0 -| | |-- 0.0 -| | `-- 0.1 -| `-- ide1 -| `-- 1.0 -|-- 00:1f.2 -|-- 00:1f.3 -`-- 00:1f.5 + /sys/devices/pci0/ + |-- 00:00.0 + |-- 00:01.0 + | `-- 01:00.0 + |-- 00:02.0 + | `-- 02:1f.0 + | `-- 03:00.0 + |-- 00:1e.0 + | `-- 04:04.0 + |-- 00:1f.0 + |-- 00:1f.1 + | |-- ide0 + | | |-- 0.0 + | | `-- 0.1 + | `-- ide1 + | `-- 1.0 + |-- 00:1f.2 + |-- 00:1f.3 + `-- 00:1f.5 Also, symlinks are created in the bus's 'devices' directory - that point to the device's directory in the physical hierarchy. + that point to the device's directory in the physical hierarchy:: -/sys/bus/pci/devices/ -|-- 00:00.0 -> ../../../devices/pci0/00:00.0 -|-- 00:01.0 -> ../../../devices/pci0/00:01.0 -|-- 00:02.0 -> ../../../devices/pci0/00:02.0 -|-- 00:1e.0 -> ../../../devices/pci0/00:1e.0 -|-- 00:1f.0 -> ../../../devices/pci0/00:1f.0 -|-- 00:1f.1 -> ../../../devices/pci0/00:1f.1 -|-- 00:1f.2 -> ../../../devices/pci0/00:1f.2 -|-- 00:1f.3 -> ../../../devices/pci0/00:1f.3 -|-- 00:1f.5 -> ../../../devices/pci0/00:1f.5 -|-- 01:00.0 -> ../../../devices/pci0/00:01.0/01:00.0 -|-- 02:1f.0 -> ../../../devices/pci0/00:02.0/02:1f.0 -|-- 03:00.0 -> ../../../devices/pci0/00:02.0/02:1f.0/03:00.0 -`-- 04:04.0 -> ../../../devices/pci0/00:1e.0/04:04.0 + /sys/bus/pci/devices/ + |-- 00:00.0 -> ../../../devices/pci0/00:00.0 + |-- 00:01.0 -> ../../../devices/pci0/00:01.0 + |-- 00:02.0 -> ../../../devices/pci0/00:02.0 + |-- 00:1e.0 -> ../../../devices/pci0/00:1e.0 + |-- 00:1f.0 -> ../../../devices/pci0/00:1f.0 + |-- 00:1f.1 -> ../../../devices/pci0/00:1f.1 + |-- 00:1f.2 -> ../../../devices/pci0/00:1f.2 + |-- 00:1f.3 -> ../../../devices/pci0/00:1f.3 + |-- 00:1f.5 -> ../../../devices/pci0/00:1f.5 + |-- 01:00.0 -> ../../../devices/pci0/00:01.0/01:00.0 + |-- 02:1f.0 -> ../../../devices/pci0/00:02.0/02:1f.0 + |-- 03:00.0 -> ../../../devices/pci0/00:02.0/02:1f.0/03:00.0 + `-- 04:04.0 -> ../../../devices/pci0/00:1e.0/04:04.0 Step 3: Registering Drivers. struct device_driver is a simple driver structure that contains a set -of operations that the driver model core may call. +of operations that the driver model core may call. -- Embed a struct device_driver in the bus-specific driver. +- Embed a struct device_driver in the bus-specific driver. - Just like with devices, do something like: + Just like with devices, do something like:: -struct pci_driver { - ... - struct device_driver driver; -}; + struct pci_driver { + ... + struct device_driver driver; + }; -- Initialize the generic driver structure. +- Initialize the generic driver structure. When the driver registers with the bus (e.g. doing pci_register_driver()), initialize the necessary fields of the driver: the name and bus - fields. + fields. - Register the driver. - After the generic driver has been initialized, call + After the generic driver has been initialized, call:: driver_register(&drv->driver); to register the driver with the core. When the driver is unregistered from the bus, unregister it from the - core by doing: + core by doing:: driver_unregister(&drv->driver); @@ -265,15 +267,15 @@ struct pci_driver { - Sysfs representation. - Drivers are exported via sysfs in their bus's 'driver's directory. - For example: + Drivers are exported via sysfs in their bus's 'driver's directory. + For example:: -/sys/bus/pci/drivers/ -|-- 3c59x -|-- Ensoniq AudioPCI -|-- agpgart-amdk7 -|-- e100 -`-- serial + /sys/bus/pci/drivers/ + |-- 3c59x + |-- Ensoniq AudioPCI + |-- agpgart-amdk7 + |-- e100 + `-- serial Step 4: Define Generic Methods for Drivers. @@ -281,30 +283,30 @@ Step 4: Define Generic Methods for Drivers. struct device_driver defines a set of operations that the driver model core calls. Most of these operations are probably similar to operations the bus already defines for drivers, but taking different -parameters. +parameters. It would be difficult and tedious to force every driver on a bus to simultaneously convert their drivers to generic format. Instead, the bus driver should define single instances of the generic methods that -forward call to the bus-specific drivers. For instance: +forward call to the bus-specific drivers. For instance:: -static int pci_device_remove(struct device * dev) -{ - struct pci_dev * pci_dev = to_pci_dev(dev); - struct pci_driver * drv = pci_dev->driver; + static int pci_device_remove(struct device * dev) + { + struct pci_dev * pci_dev = to_pci_dev(dev); + struct pci_driver * drv = pci_dev->driver; - if (drv) { - if (drv->remove) - drv->remove(pci_dev); - pci_dev->driver = NULL; - } - return 0; -} + if (drv) { + if (drv->remove) + drv->remove(pci_dev); + pci_dev->driver = NULL; + } + return 0; + } The generic driver should be initialized with these methods before it -is registered. +is registered:: /* initialize common driver fields */ drv->driver.name = drv->name; @@ -320,23 +322,23 @@ is registered. Ideally, the bus should only initialize the fields if they are not already set. This allows the drivers to implement their own generic -methods. +methods. -Step 5: Support generic driver binding. +Step 5: Support generic driver binding. The model assumes that a device or driver can be dynamically registered with the bus at any time. When registration happens, devices must be bound to a driver, or drivers must be bound to all -devices that it supports. +devices that it supports. A driver typically contains a list of device IDs that it supports. The -bus driver compares these IDs to the IDs of devices registered with it. +bus driver compares these IDs to the IDs of devices registered with it. The format of the device IDs, and the semantics for comparing them are -bus-specific, so the generic model does attempt to generalize them. +bus-specific, so the generic model does attempt to generalize them. Instead, a bus may supply a method in struct bus_type that does the -comparison: +comparison:: int (*match)(struct device * dev, struct device_driver * drv); @@ -346,59 +348,59 @@ and zero otherwise. It may also return error code (for example not possible. When a device is registered, the bus's list of drivers is iterated -over. bus->match() is called for each one until a match is found. +over. bus->match() is called for each one until a match is found. When a driver is registered, the bus's list of devices is iterated over. bus->match() is called for each device that is not already -claimed by a driver. +claimed by a driver. When a device is successfully bound to a driver, device->driver is set, the device is added to a per-driver list of devices, and a symlink is created in the driver's sysfs directory that points to the -device's physical directory: +device's physical directory:: -/sys/bus/pci/drivers/ -|-- 3c59x -| `-- 00:0b.0 -> ../../../../devices/pci0/00:0b.0 -|-- Ensoniq AudioPCI -|-- agpgart-amdk7 -| `-- 00:00.0 -> ../../../../devices/pci0/00:00.0 -|-- e100 -| `-- 00:0c.0 -> ../../../../devices/pci0/00:0c.0 -`-- serial + /sys/bus/pci/drivers/ + |-- 3c59x + | `-- 00:0b.0 -> ../../../../devices/pci0/00:0b.0 + |-- Ensoniq AudioPCI + |-- agpgart-amdk7 + | `-- 00:00.0 -> ../../../../devices/pci0/00:00.0 + |-- e100 + | `-- 00:0c.0 -> ../../../../devices/pci0/00:0c.0 + `-- serial This driver binding should replace the existing driver binding -mechanism the bus currently uses. +mechanism the bus currently uses. Step 6: Supply a hotplug callback. Whenever a device is registered with the driver model core, the -userspace program /sbin/hotplug is called to notify userspace. +userspace program /sbin/hotplug is called to notify userspace. Users can define actions to perform when a device is inserted or -removed. +removed. The driver model core passes several arguments to userspace via environment variables, including - ACTION: set to 'add' or 'remove' -- DEVPATH: set to the device's physical path in sysfs. +- DEVPATH: set to the device's physical path in sysfs. A bus driver may also supply additional parameters for userspace to consume. To do this, a bus must implement the 'hotplug' method in -struct bus_type: +struct bus_type:: - int (*hotplug) (struct device *dev, char **envp, + int (*hotplug) (struct device *dev, char **envp, int num_envp, char *buffer, int buffer_size); -This is called immediately before /sbin/hotplug is executed. +This is called immediately before /sbin/hotplug is executed. Step 7: Cleaning up the bus driver. The generic bus, device, and driver structures provide several fields -that can replace those defined privately to the bus driver. +that can replace those defined privately to the bus driver. - Device list. @@ -407,36 +409,36 @@ type. This includes all devices on all instances of that bus type. An internal list that the bus uses may be removed, in favor of using this one. -The core provides an iterator to access these devices. +The core provides an iterator to access these devices:: -int bus_for_each_dev(struct bus_type * bus, struct device * start, - void * data, int (*fn)(struct device *, void *)); + int bus_for_each_dev(struct bus_type * bus, struct device * start, + void * data, int (*fn)(struct device *, void *)); - Driver list. struct bus_type also contains a list of all drivers registered with -it. An internal list of drivers that the bus driver maintains may -be removed in favor of using the generic one. +it. An internal list of drivers that the bus driver maintains may +be removed in favor of using the generic one. -The drivers may be iterated over, like devices: +The drivers may be iterated over, like devices:: -int bus_for_each_drv(struct bus_type * bus, struct device_driver * start, - void * data, int (*fn)(struct device_driver *, void *)); + int bus_for_each_drv(struct bus_type * bus, struct device_driver * start, + void * data, int (*fn)(struct device_driver *, void *)); Please see drivers/base/bus.c for more information. -- rwsem +- rwsem struct bus_type contains an rwsem that protects all core accesses to the device and driver lists. This can be used by the bus driver internally, and should be used when accessing the device or driver -lists the bus maintains. +lists the bus maintains. -- Device and driver fields. +- Device and driver fields. Some of the fields in struct device and struct device_driver duplicate fields in the bus-specific representations of these objects. Feel free @@ -444,4 +446,3 @@ to remove the bus-specific ones and favor the generic ones. Note though, that this will likely mean fixing up all the drivers that reference the bus-specific fields (though those should all be 1-line changes). - diff --git a/Documentation/eisa.txt b/Documentation/eisa.txt index 2806e5544e43..f388545a85a7 100644 --- a/Documentation/eisa.txt +++ b/Documentation/eisa.txt @@ -103,7 +103,7 @@ id_table an array of NULL terminated EISA id strings, (driver_data). driver a generic driver, such as described in - Documentation/driver-model/driver.txt. Only .name, + Documentation/driver-model/driver.rst. Only .name, .probe and .remove members are mandatory. =============== ==================================================== @@ -152,7 +152,7 @@ state set of flags indicating the state of the device. Current flags are EISA_CONFIG_ENABLED and EISA_CONFIG_FORCED. res set of four 256 bytes I/O regions allocated to this device dma_mask DMA mask set from the parent device. -dev generic device (see Documentation/driver-model/device.txt) +dev generic device (see Documentation/driver-model/device.rst) ======== ============================================================ You can get the 'struct eisa_device' from 'struct device' using the diff --git a/Documentation/hwmon/submitting-patches.rst b/Documentation/hwmon/submitting-patches.rst index f9796b9d9db6..d5b05d3e54ba 100644 --- a/Documentation/hwmon/submitting-patches.rst +++ b/Documentation/hwmon/submitting-patches.rst @@ -89,7 +89,7 @@ increase the chances of your change being accepted. console. Excessive logging can seriously affect system performance. * Use devres functions whenever possible to allocate resources. For rationale - and supported functions, please see Documentation/driver-model/devres.txt. + and supported functions, please see Documentation/driver-model/devres.rst. If a function is not supported by devres, consider using devm_add_action(). * If the driver has a detect function, make sure it is silent. Debug messages diff --git a/drivers/base/platform.c b/drivers/base/platform.c index dab0a5abc391..f051d22f6e9f 100644 --- a/drivers/base/platform.c +++ b/drivers/base/platform.c @@ -5,7 +5,7 @@ * Copyright (c) 2002-3 Patrick Mochel * Copyright (c) 2002-3 Open Source Development Labs * - * Please see Documentation/driver-model/platform.txt for more + * Please see Documentation/driver-model/platform.rst for more * information. */ diff --git a/drivers/gpio/gpio-cs5535.c b/drivers/gpio/gpio-cs5535.c index 8814c8f47e57..0cb568b3fac9 100644 --- a/drivers/gpio/gpio-cs5535.c +++ b/drivers/gpio/gpio-cs5535.c @@ -44,7 +44,7 @@ MODULE_PARM_DESC(mask, "GPIO channel mask."); /* * FIXME: convert this singleton driver to use the state container - * design pattern, see Documentation/driver-model/design-patterns.txt + * design pattern, see Documentation/driver-model/design-patterns.rst */ static struct cs5535_gpio_chip { struct gpio_chip chip; diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index f7073e046979..9b627ce970f4 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -2220,7 +2220,7 @@ ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent) struct ice_hw *hw; int err; - /* this driver uses devres, see Documentation/driver-model/devres.txt */ + /* this driver uses devres, see Documentation/driver-model/devres.rst */ err = pcim_enable_device(pdev); if (err) return err; diff --git a/scripts/coccinelle/free/devm_free.cocci b/scripts/coccinelle/free/devm_free.cocci index b2a2cf8bf81f..e32236a979a8 100644 --- a/scripts/coccinelle/free/devm_free.cocci +++ b/scripts/coccinelle/free/devm_free.cocci @@ -2,7 +2,7 @@ /// functions. Values allocated using the devm_functions are freed when /// the device is detached, and thus the use of the standard freeing /// function would cause a double free. -/// See Documentation/driver-model/devres.txt for more information. +/// See Documentation/driver-model/devres.rst for more information. /// /// A difficulty of detecting this problem is that the standard freeing /// function might be called from a different function than the one From patchwork Mon Apr 22 13:27:50 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mauro Carvalho Chehab X-Patchwork-Id: 1088697 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=infradead.org header.i=@infradead.org header.b="OqXJubj8"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 44nnXT0Rlcz9sNq for ; Mon, 22 Apr 2019 23:31:04 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727876AbfDVNat (ORCPT ); Mon, 22 Apr 2019 09:30:49 -0400 Received: from bombadil.infradead.org ([198.137.202.133]:37774 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727557AbfDVN2X (ORCPT ); Mon, 22 Apr 2019 09:28:23 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=Sender:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=oc+D3O8jLpNo2z+6fTrVycyAdeQ7hRBezR5hvOFETrU=; b=OqXJubj8AOGV2P6ajjv4raHWie yf73GLJ4LdVSAXQQCcp/ZvBvJ2FRK5IaDXSaD5EcfCfJC48OO1GczUT5uRti/8jyyaHKIqT5umamK yAz5pSavjTUvT21XrzCHf5BTktFFmsMjstbrKm8WsXadL9+r8oWdjeIOsoUcpi94scG277tlsir5h zvGrGlacRB0XKaklAU3NPG19jVCUUhB7Vc+0eIQ/aU9JZiz3k1brUKFaEYOUJJNd1L0TkUf+Ef2Mo YOeDcPlx2MCH5vdbH+kOXF2VW7q7ntpUS75c1CadAWZTDifBNAkgVgsLHkKpFGa9rgVi53q+PkqEO Mzpl25kg==; Received: from 179.176.125.229.dynamic.adsl.gvt.net.br ([179.176.125.229] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtpsa (Exim 4.90_1 #2 (Red Hat Linux)) id 1hIYzZ-0005Hq-ON; Mon, 22 Apr 2019 13:28:18 +0000 Received: from mchehab by bombadil.infradead.org with local (Exim 4.92) (envelope-from ) id 1hIYzU-0005oO-6g; Mon, 22 Apr 2019 10:28:12 -0300 From: Mauro Carvalho Chehab To: Linux Doc Mailing List Cc: Mauro Carvalho Chehab , Mauro Carvalho Chehab , linux-kernel@vger.kernel.org, Jonathan Corbet , Vadim Pasternak , Jacek Anaszewski , Pavel Machek , Pablo Neira Ayuso , Jozsef Kadlecsik , Florian Westphal , "David S. Miller" , linux-leds@vger.kernel.org, netfilter-devel@vger.kernel.org, coreteam@netfilter.org, netdev@vger.kernel.org Subject: [PATCH v2 61/79] docs: leds: convert to ReST Date: Mon, 22 Apr 2019 10:27:50 -0300 Message-Id: X-Mailer: git-send-email 2.20.1 In-Reply-To: References: MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Rename the leds documentation files to ReST, add an index for them and adjust in order to produce a nice html output via the Sphinx build system. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab Acked-by: Jacek Anaszewski --- Documentation/laptops/thinkpad-acpi.txt | 4 +- Documentation/leds/index.rst | 25 ++ .../leds/{leds-blinkm.txt => leds-blinkm.rst} | 63 ++--- ...s-class-flash.txt => leds-class-flash.rst} | 49 ++-- .../leds/{leds-class.txt => leds-class.rst} | 15 +- .../leds/{leds-lm3556.txt => leds-lm3556.rst} | 100 ++++++-- .../leds/{leds-lp3944.txt => leds-lp3944.rst} | 23 +- Documentation/leds/leds-lp5521.rst | 115 +++++++++ Documentation/leds/leds-lp5521.txt | 101 -------- Documentation/leds/leds-lp5523.rst | 147 ++++++++++++ Documentation/leds/leds-lp5523.txt | 130 ---------- Documentation/leds/leds-lp5562.rst | 137 +++++++++++ Documentation/leds/leds-lp5562.txt | 120 ---------- Documentation/leds/leds-lp55xx.rst | 224 ++++++++++++++++++ Documentation/leds/leds-lp55xx.txt | 194 --------------- Documentation/leds/leds-mlxcpld.rst | 118 +++++++++ Documentation/leds/leds-mlxcpld.txt | 110 --------- ...edtrig-oneshot.txt => ledtrig-oneshot.rst} | 11 +- ...ig-transient.txt => ledtrig-transient.rst} | 63 +++-- ...edtrig-usbport.txt => ledtrig-usbport.rst} | 11 +- Documentation/leds/{uleds.txt => uleds.rst} | 5 +- MAINTAINERS | 2 +- drivers/leds/trigger/Kconfig | 2 +- drivers/leds/trigger/ledtrig-transient.c | 2 +- net/netfilter/Kconfig | 2 +- 25 files changed, 996 insertions(+), 777 deletions(-) create mode 100644 Documentation/leds/index.rst rename Documentation/leds/{leds-blinkm.txt => leds-blinkm.rst} (56%) rename Documentation/leds/{leds-class-flash.txt => leds-class-flash.rst} (74%) rename Documentation/leds/{leds-class.txt => leds-class.rst} (92%) rename Documentation/leds/{leds-lm3556.txt => leds-lm3556.rst} (70%) rename Documentation/leds/{leds-lp3944.txt => leds-lp3944.rst} (78%) create mode 100644 Documentation/leds/leds-lp5521.rst delete mode 100644 Documentation/leds/leds-lp5521.txt create mode 100644 Documentation/leds/leds-lp5523.rst delete mode 100644 Documentation/leds/leds-lp5523.txt create mode 100644 Documentation/leds/leds-lp5562.rst delete mode 100644 Documentation/leds/leds-lp5562.txt create mode 100644 Documentation/leds/leds-lp55xx.rst delete mode 100644 Documentation/leds/leds-lp55xx.txt create mode 100644 Documentation/leds/leds-mlxcpld.rst delete mode 100644 Documentation/leds/leds-mlxcpld.txt rename Documentation/leds/{ledtrig-oneshot.txt => ledtrig-oneshot.rst} (90%) rename Documentation/leds/{ledtrig-transient.txt => ledtrig-transient.rst} (81%) rename Documentation/leds/{ledtrig-usbport.txt => ledtrig-usbport.rst} (86%) rename Documentation/leds/{uleds.txt => uleds.rst} (95%) diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt index 3de3c95f01f6..65719384fc36 100644 --- a/Documentation/laptops/thinkpad-acpi.txt +++ b/Documentation/laptops/thinkpad-acpi.txt @@ -679,7 +679,7 @@ status as "unknown". The available commands are: sysfs notes: The ThinkLight sysfs interface is documented by the LED class -documentation, in Documentation/leds/leds-class.txt. The ThinkLight LED name +documentation, in Documentation/leds/leds-class.rst. The ThinkLight LED name is "tpacpi::thinklight". Due to limitations in the sysfs LED class, if the status of the ThinkLight @@ -779,7 +779,7 @@ All of the above can be turned on and off and can be made to blink. sysfs notes: The ThinkPad LED sysfs interface is described in detail by the LED class -documentation, in Documentation/leds/leds-class.txt. +documentation, in Documentation/leds/leds-class.rst. The LEDs are named (in LED ID order, from 0 to 12): "tpacpi::power", "tpacpi:orange:batt", "tpacpi:green:batt", diff --git a/Documentation/leds/index.rst b/Documentation/leds/index.rst new file mode 100644 index 000000000000..9885f7c1b75d --- /dev/null +++ b/Documentation/leds/index.rst @@ -0,0 +1,25 @@ +:orphan: + +==== +LEDs +==== + +.. toctree:: + :maxdepth: 1 + + leds-class + leds-class-flash + ledtrig-oneshot + ledtrig-transient + ledtrig-usbport + + uleds + + leds-blinkm + leds-lm3556 + leds-lp3944 + leds-lp5521 + leds-lp5523 + leds-lp5562 + leds-lp55xx + leds-mlxcpld diff --git a/Documentation/leds/leds-blinkm.txt b/Documentation/leds/leds-blinkm.rst similarity index 56% rename from Documentation/leds/leds-blinkm.txt rename to Documentation/leds/leds-blinkm.rst index 9dd92f4cf4e1..4c970b7d21cd 100644 --- a/Documentation/leds/leds-blinkm.txt +++ b/Documentation/leds/leds-blinkm.rst @@ -1,3 +1,7 @@ +================== +Leds BlinkM driver +================== + The leds-blinkm driver supports the devices of the BlinkM family. They are RGB-LED modules driven by a (AT)tiny microcontroller and @@ -14,35 +18,36 @@ The interface this driver provides is 2-fold: a) LED class interface for use with triggers ############################################ -The registration follows the scheme: -blinkm--- +The registration follows the scheme:: -$ ls -h /sys/class/leds/blinkm-6-* -/sys/class/leds/blinkm-6-9-blue: -brightness device max_brightness power subsystem trigger uevent + blinkm--- -/sys/class/leds/blinkm-6-9-green: -brightness device max_brightness power subsystem trigger uevent + $ ls -h /sys/class/leds/blinkm-6-* + /sys/class/leds/blinkm-6-9-blue: + brightness device max_brightness power subsystem trigger uevent -/sys/class/leds/blinkm-6-9-red: -brightness device max_brightness power subsystem trigger uevent + /sys/class/leds/blinkm-6-9-green: + brightness device max_brightness power subsystem trigger uevent + + /sys/class/leds/blinkm-6-9-red: + brightness device max_brightness power subsystem trigger uevent (same is /sys/bus/i2c/devices/6-0009/leds) We can control the colors separated into red, green and blue and assign triggers on each color. -E.g.: +E.g.:: -$ cat blinkm-6-9-blue/brightness -05 + $ cat blinkm-6-9-blue/brightness + 05 -$ echo 200 > blinkm-6-9-blue/brightness -$ + $ echo 200 > blinkm-6-9-blue/brightness + $ -$ modprobe ledtrig-heartbeat -$ echo heartbeat > blinkm-6-9-green/trigger -$ + $ modprobe ledtrig-heartbeat + $ echo heartbeat > blinkm-6-9-green/trigger + $ b) Sysfs group to control rgb, fade, hsb, scripts ... @@ -52,25 +57,25 @@ This extended interface is available as folder blinkm in the sysfs folder of the I2C device. E.g. below /sys/bus/i2c/devices/6-0009/blinkm -$ ls -h /sys/bus/i2c/devices/6-0009/blinkm/ -blue green red test + $ ls -h /sys/bus/i2c/devices/6-0009/blinkm/ + blue green red test Currently supported is just setting red, green, blue and a test sequence. -E.g.: +E.g.:: -$ cat * -00 -00 -00 -#Write into test to start test sequence!# + $ cat * + 00 + 00 + 00 + #Write into test to start test sequence!# -$ echo 1 > test -$ + $ echo 1 > test + $ -$ echo 255 > red -$ + $ echo 255 > red + $ diff --git a/Documentation/leds/leds-class-flash.txt b/Documentation/leds/leds-class-flash.rst similarity index 74% rename from Documentation/leds/leds-class-flash.txt rename to Documentation/leds/leds-class-flash.rst index 8da3c6f4b60b..6ec12c5a1a0e 100644 --- a/Documentation/leds/leds-class-flash.txt +++ b/Documentation/leds/leds-class-flash.rst @@ -1,9 +1,9 @@ - +============================== Flash LED handling under Linux ============================== Some LED devices provide two modes - torch and flash. In the LED subsystem -those modes are supported by LED class (see Documentation/leds/leds-class.txt) +those modes are supported by LED class (see Documentation/leds/leds-class.rst) and LED Flash class respectively. The torch mode related features are enabled by default and the flash ones only if a driver declares it by setting LED_DEV_CAP_FLASH flag. @@ -14,6 +14,7 @@ registered in the LED subsystem with led_classdev_flash_register function. Following sysfs attributes are exposed for controlling flash LED devices: (see Documentation/ABI/testing/sysfs-class-led-flash) + - flash_brightness - max_flash_brightness - flash_timeout @@ -31,30 +32,46 @@ be defined in the kernel config. The driver must call the v4l2_flash_init function to get registered in the V4L2 subsystem. The function takes six arguments: -- dev : flash device, e.g. an I2C device -- of_node : of_node of the LED, may be NULL if the same as device's -- fled_cdev : LED flash class device to wrap -- iled_cdev : LED flash class device representing indicator LED associated with - fled_cdev, may be NULL -- ops : V4L2 specific ops - * external_strobe_set - defines the source of the flash LED strobe - + +- dev: + flash device, e.g. an I2C device +- of_node: + of_node of the LED, may be NULL if the same as device's +- fled_cdev: + LED flash class device to wrap +- iled_cdev: + LED flash class device representing indicator LED associated with + fled_cdev, may be NULL +- ops: + V4L2 specific ops + + * external_strobe_set + defines the source of the flash LED strobe - V4L2_CID_FLASH_STROBE control or external source, typically a sensor, which makes it possible to synchronise the flash strobe start with exposure start, - * intensity_to_led_brightness and led_brightness_to_intensity - perform + * intensity_to_led_brightness and led_brightness_to_intensity + perform enum led_brightness <-> V4L2 intensity conversion in a device specific manner - they can be used for devices with non-linear LED current scale. -- config : configuration for V4L2 Flash sub-device - * dev_name - the name of the media entity, unique in the system, - * flash_faults - bitmask of flash faults that the LED flash class +- config: + configuration for V4L2 Flash sub-device + + * dev_name + the name of the media entity, unique in the system, + * flash_faults + bitmask of flash faults that the LED flash class device can report; corresponding LED_FAULT* bit definitions are available in , - * torch_intensity - constraints for the LED in TORCH mode + * torch_intensity + constraints for the LED in TORCH mode in microamperes, - * indicator_intensity - constraints for the indicator LED + * indicator_intensity + constraints for the indicator LED in microamperes, - * has_external_strobe - determines whether the flash strobe source + * has_external_strobe + determines whether the flash strobe source can be switched to external, On remove the v4l2_flash_release function has to be called, which takes one diff --git a/Documentation/leds/leds-class.txt b/Documentation/leds/leds-class.rst similarity index 92% rename from Documentation/leds/leds-class.txt rename to Documentation/leds/leds-class.rst index 8b39cc6b03ee..df0120a1ee3c 100644 --- a/Documentation/leds/leds-class.txt +++ b/Documentation/leds/leds-class.rst @@ -1,4 +1,4 @@ - +======================== LED handling under Linux ======================== @@ -43,7 +43,7 @@ LED Device Naming Is currently of the form: -"devicename:colour:function" + "devicename:colour:function" There have been calls for LED properties such as colour to be exported as individual led class attributes. As a solution which doesn't incur as much @@ -57,9 +57,12 @@ Brightness setting API LED subsystem core exposes following API for setting brightness: - - led_set_brightness : it is guaranteed not to sleep, passing LED_OFF stops + - led_set_brightness: + it is guaranteed not to sleep, passing LED_OFF stops blinking, - - led_set_brightness_sync : for use cases when immediate effect is desired - + + - led_set_brightness_sync: + for use cases when immediate effect is desired - it can block the caller for the time required for accessing device registers and can sleep, passing LED_OFF stops hardware blinking, returns -EBUSY if software blink fallback is enabled. @@ -70,7 +73,7 @@ LED registration API A driver wanting to register a LED classdev for use by other drivers / userspace needs to allocate and fill a led_classdev struct and then call -[devm_]led_classdev_register. If the non devm version is used the driver +`[devm_]led_classdev_register`. If the non devm version is used the driver must call led_classdev_unregister from its remove function before free-ing the led_classdev struct. @@ -94,7 +97,7 @@ with brightness value LED_OFF, which should stop any software timers that may have been required for blinking. The blink_set() function should choose a user friendly blinking value -if it is called with *delay_on==0 && *delay_off==0 parameters. In this +if it is called with `*delay_on==0` && `*delay_off==0` parameters. In this case the driver should give back the chosen value through delay_on and delay_off parameters to the leds subsystem. diff --git a/Documentation/leds/leds-lm3556.txt b/Documentation/leds/leds-lm3556.rst similarity index 70% rename from Documentation/leds/leds-lm3556.txt rename to Documentation/leds/leds-lm3556.rst index 62278e871b50..1ef17d7d800e 100644 --- a/Documentation/leds/leds-lm3556.txt +++ b/Documentation/leds/leds-lm3556.rst @@ -1,68 +1,118 @@ +======================== Kernel driver for lm3556 ======================== -*Texas Instrument: - 1.5 A Synchronous Boost LED Flash Driver w/ High-Side Current Source +* Texas Instrument: + 1.5 A Synchronous Boost LED Flash Driver w/ High-Side Current Source * Datasheet: http://www.national.com/ds/LM/LM3556.pdf Authors: - Daniel Jeong + - Daniel Jeong + Contact:Daniel Jeong(daniel.jeong-at-ti.com, gshark.jeong-at-gmail.com) Description ----------- There are 3 functions in LM3556, Flash, Torch and Indicator. -FLASH MODE +Flash Mode +^^^^^^^^^^ + In Flash Mode, the LED current source(LED) provides 16 target current levels from 93.75 mA to 1500 mA.The Flash currents are adjusted via the CURRENT CONTROL REGISTER(0x09).Flash mode is activated by the ENABLE REGISTER(0x0A), or by pulling the STROBE pin HIGH. + LM3556 Flash can be controlled through sys/class/leds/flash/brightness file + * if STROBE pin is enabled, below example control brightness only, and -ON / OFF will be controlled by STROBE pin. + ON / OFF will be controlled by STROBE pin. Flash Example: -OFF : #echo 0 > sys/class/leds/flash/brightness -93.75 mA: #echo 1 > sys/class/leds/flash/brightness -... ..... -1500 mA: #echo 16 > sys/class/leds/flash/brightness -TORCH MODE +OFF:: + + #echo 0 > sys/class/leds/flash/brightness + +93.75 mA:: + + #echo 1 > sys/class/leds/flash/brightness + +... + +1500 mA:: + + #echo 16 > sys/class/leds/flash/brightness + +Torch Mode +^^^^^^^^^^ + In Torch Mode, the current source(LED) is programmed via the CURRENT CONTROL REGISTER(0x09).Torch Mode is activated by the ENABLE REGISTER(0x0A) or by the hardware TORCH input. + LM3556 torch can be controlled through sys/class/leds/torch/brightness file. * if TORCH pin is enabled, below example control brightness only, and ON / OFF will be controlled by TORCH pin. Torch Example: -OFF : #echo 0 > sys/class/leds/torch/brightness -46.88 mA: #echo 1 > sys/class/leds/torch/brightness -... ..... -375 mA : #echo 8 > sys/class/leds/torch/brightness -INDICATOR MODE +OFF:: + + #echo 0 > sys/class/leds/torch/brightness + +46.88 mA:: + + #echo 1 > sys/class/leds/torch/brightness + +... + +375 mA:: + + #echo 8 > sys/class/leds/torch/brightness + +Indicator Mode +^^^^^^^^^^^^^^ + Indicator pattern can be set through sys/class/leds/indicator/pattern file, and 4 patterns are pre-defined in indicator_pattern array. + According to N-lank, Pulse time and N Period values, different pattern wiill be generated.If you want new patterns for your own device, change indicator_pattern array with your own values and INDIC_PATTERN_SIZE. + Please refer datasheet for more detail about N-Blank, Pulse time and N Period. Indicator pattern example: -pattern 0: #echo 0 > sys/class/leds/indicator/pattern -.... -pattern 3: #echo 3 > sys/class/leds/indicator/pattern + +pattern 0:: + + #echo 0 > sys/class/leds/indicator/pattern + +... + +pattern 3:: + + #echo 3 > sys/class/leds/indicator/pattern Indicator brightness can be controlled through sys/class/leds/indicator/brightness file. Example: -OFF : #echo 0 > sys/class/leds/indicator/brightness -5.86 mA : #echo 1 > sys/class/leds/indicator/brightness -........ -46.875mA : #echo 8 > sys/class/leds/indicator/brightness + +OFF:: + + #echo 0 > sys/class/leds/indicator/brightness + +5.86 mA:: + + #echo 1 > sys/class/leds/indicator/brightness + +... + +46.875mA:: + + #echo 8 > sys/class/leds/indicator/brightness Notes ----- @@ -70,7 +120,8 @@ Driver expects it is registered using the i2c_board_info mechanism. To register the chip at address 0x63 on specific adapter, set the platform data according to include/linux/platform_data/leds-lm3556.h, set the i2c board info -Example: +Example:: + static struct i2c_board_info board_i2c_ch4[] __initdata = { { I2C_BOARD_INFO(LM3556_NAME, 0x63), @@ -80,6 +131,7 @@ Example: and register it in the platform init function -Example: +Example:: + board_register_i2c_bus(4, 400, board_i2c_ch4, ARRAY_SIZE(board_i2c_ch4)); diff --git a/Documentation/leds/leds-lp3944.txt b/Documentation/leds/leds-lp3944.rst similarity index 78% rename from Documentation/leds/leds-lp3944.txt rename to Documentation/leds/leds-lp3944.rst index e88ac3b60c08..c2f87dc1a3a9 100644 --- a/Documentation/leds/leds-lp3944.txt +++ b/Documentation/leds/leds-lp3944.rst @@ -1,14 +1,20 @@ +==================== Kernel driver lp3944 ==================== * National Semiconductor LP3944 Fun-light Chip + Prefix: 'lp3944' + Addresses scanned: None (see the Notes section below) - Datasheet: Publicly available at the National Semiconductor website - http://www.national.com/pf/LP/LP3944.html + + Datasheet: + + Publicly available at the National Semiconductor website + http://www.national.com/pf/LP/LP3944.html Authors: - Antonio Ospite + Antonio Ospite Description @@ -19,8 +25,11 @@ is used as a led controller. The DIM modes are used to set _blink_ patterns for leds, the pattern is specified supplying two parameters: - - period: from 0s to 1.6s - - duty cycle: percentage of the period the led is on, from 0 to 100 + + - period: + from 0s to 1.6s + - duty cycle: + percentage of the period the led is on, from 0 to 100 Setting a led in DIM0 or DIM1 mode makes it blink according to the pattern. See the datasheet for details. @@ -35,7 +44,7 @@ The chip is used mainly in embedded contexts, so this driver expects it is registered using the i2c_board_info mechanism. To register the chip at address 0x60 on adapter 0, set the platform data -according to include/linux/leds-lp3944.h, set the i2c board info: +according to include/linux/leds-lp3944.h, set the i2c board info:: static struct i2c_board_info a910_i2c_board_info[] __initdata = { { @@ -44,7 +53,7 @@ according to include/linux/leds-lp3944.h, set the i2c board info: }, }; -and register it in the platform init function +and register it in the platform init function:: i2c_register_board_info(0, a910_i2c_board_info, ARRAY_SIZE(a910_i2c_board_info)); diff --git a/Documentation/leds/leds-lp5521.rst b/Documentation/leds/leds-lp5521.rst new file mode 100644 index 000000000000..0432615b083d --- /dev/null +++ b/Documentation/leds/leds-lp5521.rst @@ -0,0 +1,115 @@ +======================== +Kernel driver for lp5521 +======================== + +* National Semiconductor LP5521 led driver chip +* Datasheet: http://www.national.com/pf/LP/LP5521.html + +Authors: Mathias Nyman, Yuri Zaporozhets, Samu Onkalo + +Contact: Samu Onkalo (samu.p.onkalo-at-nokia.com) + +Description +----------- + +LP5521 can drive up to 3 channels. Leds can be controlled directly via +the led class control interface. Channels have generic names: +lp5521:channelx, where x is 0 .. 2 + +All three channels can be also controlled using the engine micro programs. +More details of the instructions can be found from the public data sheet. + +LP5521 has the internal program memory for running various LED patterns. +There are two ways to run LED patterns. + +1) Legacy interface - enginex_mode and enginex_load + Control interface for the engines: + + x is 1 .. 3 + + enginex_mode: + disabled, load, run + enginex_load: + store program (visible only in engine load mode) + + Example (start to blink the channel 2 led):: + + cd /sys/class/leds/lp5521:channel2/device + echo "load" > engine3_mode + echo "037f4d0003ff6000" > engine3_load + echo "run" > engine3_mode + + To stop the engine:: + + echo "disabled" > engine3_mode + +2) Firmware interface - LP55xx common interface + +For the details, please refer to 'firmware' section in leds-lp55xx.txt + +sysfs contains a selftest entry. + +The test communicates with the chip and checks that +the clock mode is automatically set to the requested one. + +Each channel has its own led current settings. + +- /sys/class/leds/lp5521:channel0/led_current - RW +- /sys/class/leds/lp5521:channel0/max_current - RO + +Format: 10x mA i.e 10 means 1.0 mA + +example platform data:: + + static struct lp55xx_led_config lp5521_led_config[] = { + { + .name = "red", + .chan_nr = 0, + .led_current = 50, + .max_current = 130, + }, { + .name = "green", + .chan_nr = 1, + .led_current = 0, + .max_current = 130, + }, { + .name = "blue", + .chan_nr = 2, + .led_current = 0, + .max_current = 130, + } + }; + + static int lp5521_setup(void) + { + /* setup HW resources */ + } + + static void lp5521_release(void) + { + /* Release HW resources */ + } + + static void lp5521_enable(bool state) + { + /* Control of chip enable signal */ + } + + static struct lp55xx_platform_data lp5521_platform_data = { + .led_config = lp5521_led_config, + .num_channels = ARRAY_SIZE(lp5521_led_config), + .clock_mode = LP55XX_CLOCK_EXT, + .setup_resources = lp5521_setup, + .release_resources = lp5521_release, + .enable = lp5521_enable, + }; + +Note: + chan_nr can have values between 0 and 2. + The name of each channel can be configurable. + If the name field is not defined, the default name will be set to 'xxxx:channelN' + (XXXX : pdata->label or i2c client name, N : channel number) + + +If the current is set to 0 in the platform data, that channel is +disabled and it is not visible in the sysfs. diff --git a/Documentation/leds/leds-lp5521.txt b/Documentation/leds/leds-lp5521.txt deleted file mode 100644 index d08d8c179f85..000000000000 --- a/Documentation/leds/leds-lp5521.txt +++ /dev/null @@ -1,101 +0,0 @@ -Kernel driver for lp5521 -======================== - -* National Semiconductor LP5521 led driver chip -* Datasheet: http://www.national.com/pf/LP/LP5521.html - -Authors: Mathias Nyman, Yuri Zaporozhets, Samu Onkalo -Contact: Samu Onkalo (samu.p.onkalo-at-nokia.com) - -Description ------------ - -LP5521 can drive up to 3 channels. Leds can be controlled directly via -the led class control interface. Channels have generic names: -lp5521:channelx, where x is 0 .. 2 - -All three channels can be also controlled using the engine micro programs. -More details of the instructions can be found from the public data sheet. - -LP5521 has the internal program memory for running various LED patterns. -There are two ways to run LED patterns. - -1) Legacy interface - enginex_mode and enginex_load - Control interface for the engines: - x is 1 .. 3 - enginex_mode : disabled, load, run - enginex_load : store program (visible only in engine load mode) - - Example (start to blink the channel 2 led): - cd /sys/class/leds/lp5521:channel2/device - echo "load" > engine3_mode - echo "037f4d0003ff6000" > engine3_load - echo "run" > engine3_mode - - To stop the engine: - echo "disabled" > engine3_mode - -2) Firmware interface - LP55xx common interface - For the details, please refer to 'firmware' section in leds-lp55xx.txt - -sysfs contains a selftest entry. -The test communicates with the chip and checks that -the clock mode is automatically set to the requested one. - -Each channel has its own led current settings. -/sys/class/leds/lp5521:channel0/led_current - RW -/sys/class/leds/lp5521:channel0/max_current - RO -Format: 10x mA i.e 10 means 1.0 mA - -example platform data: - -Note: chan_nr can have values between 0 and 2. -The name of each channel can be configurable. -If the name field is not defined, the default name will be set to 'xxxx:channelN' -(XXXX : pdata->label or i2c client name, N : channel number) - -static struct lp55xx_led_config lp5521_led_config[] = { - { - .name = "red", - .chan_nr = 0, - .led_current = 50, - .max_current = 130, - }, { - .name = "green", - .chan_nr = 1, - .led_current = 0, - .max_current = 130, - }, { - .name = "blue", - .chan_nr = 2, - .led_current = 0, - .max_current = 130, - } -}; - -static int lp5521_setup(void) -{ - /* setup HW resources */ -} - -static void lp5521_release(void) -{ - /* Release HW resources */ -} - -static void lp5521_enable(bool state) -{ - /* Control of chip enable signal */ -} - -static struct lp55xx_platform_data lp5521_platform_data = { - .led_config = lp5521_led_config, - .num_channels = ARRAY_SIZE(lp5521_led_config), - .clock_mode = LP55XX_CLOCK_EXT, - .setup_resources = lp5521_setup, - .release_resources = lp5521_release, - .enable = lp5521_enable, -}; - -If the current is set to 0 in the platform data, that channel is -disabled and it is not visible in the sysfs. diff --git a/Documentation/leds/leds-lp5523.rst b/Documentation/leds/leds-lp5523.rst new file mode 100644 index 000000000000..7d7362a1dd57 --- /dev/null +++ b/Documentation/leds/leds-lp5523.rst @@ -0,0 +1,147 @@ +======================== +Kernel driver for lp5523 +======================== + +* National Semiconductor LP5523 led driver chip +* Datasheet: http://www.national.com/pf/LP/LP5523.html + +Authors: Mathias Nyman, Yuri Zaporozhets, Samu Onkalo +Contact: Samu Onkalo (samu.p.onkalo-at-nokia.com) + +Description +----------- +LP5523 can drive up to 9 channels. Leds can be controlled directly via +the led class control interface. +The name of each channel is configurable in the platform data - name and label. +There are three options to make the channel name. + +a) Define the 'name' in the platform data + +To make specific channel name, then use 'name' platform data. + +- /sys/class/leds/R1 (name: 'R1') +- /sys/class/leds/B1 (name: 'B1') + +b) Use the 'label' with no 'name' field + +For one device name with channel number, then use 'label'. +- /sys/class/leds/RGB:channelN (label: 'RGB', N: 0 ~ 8) + +c) Default + +If both fields are NULL, 'lp5523' is used by default. +- /sys/class/leds/lp5523:channelN (N: 0 ~ 8) + +LP5523 has the internal program memory for running various LED patterns. +There are two ways to run LED patterns. + +1) Legacy interface - enginex_mode, enginex_load and enginex_leds + + Control interface for the engines: + + x is 1 .. 3 + + enginex_mode: + disabled, load, run + enginex_load: + microcode load + enginex_leds: + led mux control + + :: + + cd /sys/class/leds/lp5523:channel2/device + echo "load" > engine3_mode + echo "9d80400004ff05ff437f0000" > engine3_load + echo "111111111" > engine3_leds + echo "run" > engine3_mode + + To stop the engine:: + + echo "disabled" > engine3_mode + +2) Firmware interface - LP55xx common interface + +For the details, please refer to 'firmware' section in leds-lp55xx.txt + +LP5523 has three master faders. If a channel is mapped to one of +the master faders, its output is dimmed based on the value of the master +fader. + +For example:: + + echo "123000123" > master_fader_leds + +creates the following channel-fader mappings:: + + channel 0,6 to master_fader1 + channel 1,7 to master_fader2 + channel 2,8 to master_fader3 + +Then, to have 25% of the original output on channel 0,6:: + + echo 64 > master_fader1 + +To have 0% of the original output (i.e. no output) channel 1,7:: + + echo 0 > master_fader2 + +To have 100% of the original output (i.e. no dimming) on channel 2,8:: + + echo 255 > master_fader3 + +To clear all master fader controls:: + + echo "000000000" > master_fader_leds + +Selftest uses always the current from the platform data. + +Each channel contains led current settings. +- /sys/class/leds/lp5523:channel2/led_current - RW +- /sys/class/leds/lp5523:channel2/max_current - RO + +Format: 10x mA i.e 10 means 1.0 mA + +Example platform data:: + + static struct lp55xx_led_config lp5523_led_config[] = { + { + .name = "D1", + .chan_nr = 0, + .led_current = 50, + .max_current = 130, + }, + ... + { + .chan_nr = 8, + .led_current = 50, + .max_current = 130, + } + }; + + static int lp5523_setup(void) + { + /* Setup HW resources */ + } + + static void lp5523_release(void) + { + /* Release HW resources */ + } + + static void lp5523_enable(bool state) + { + /* Control chip enable signal */ + } + + static struct lp55xx_platform_data lp5523_platform_data = { + .led_config = lp5523_led_config, + .num_channels = ARRAY_SIZE(lp5523_led_config), + .clock_mode = LP55XX_CLOCK_EXT, + .setup_resources = lp5523_setup, + .release_resources = lp5523_release, + .enable = lp5523_enable, + }; + +Note + chan_nr can have values between 0 and 8. diff --git a/Documentation/leds/leds-lp5523.txt b/Documentation/leds/leds-lp5523.txt deleted file mode 100644 index 0961a060fc4d..000000000000 --- a/Documentation/leds/leds-lp5523.txt +++ /dev/null @@ -1,130 +0,0 @@ -Kernel driver for lp5523 -======================== - -* National Semiconductor LP5523 led driver chip -* Datasheet: http://www.national.com/pf/LP/LP5523.html - -Authors: Mathias Nyman, Yuri Zaporozhets, Samu Onkalo -Contact: Samu Onkalo (samu.p.onkalo-at-nokia.com) - -Description ------------ -LP5523 can drive up to 9 channels. Leds can be controlled directly via -the led class control interface. -The name of each channel is configurable in the platform data - name and label. -There are three options to make the channel name. - -a) Define the 'name' in the platform data -To make specific channel name, then use 'name' platform data. -/sys/class/leds/R1 (name: 'R1') -/sys/class/leds/B1 (name: 'B1') - -b) Use the 'label' with no 'name' field -For one device name with channel number, then use 'label'. -/sys/class/leds/RGB:channelN (label: 'RGB', N: 0 ~ 8) - -c) Default -If both fields are NULL, 'lp5523' is used by default. -/sys/class/leds/lp5523:channelN (N: 0 ~ 8) - -LP5523 has the internal program memory for running various LED patterns. -There are two ways to run LED patterns. - -1) Legacy interface - enginex_mode, enginex_load and enginex_leds - Control interface for the engines: - x is 1 .. 3 - enginex_mode : disabled, load, run - enginex_load : microcode load - enginex_leds : led mux control - - cd /sys/class/leds/lp5523:channel2/device - echo "load" > engine3_mode - echo "9d80400004ff05ff437f0000" > engine3_load - echo "111111111" > engine3_leds - echo "run" > engine3_mode - - To stop the engine: - echo "disabled" > engine3_mode - -2) Firmware interface - LP55xx common interface - For the details, please refer to 'firmware' section in leds-lp55xx.txt - -LP5523 has three master faders. If a channel is mapped to one of -the master faders, its output is dimmed based on the value of the master -fader. - -For example, - - echo "123000123" > master_fader_leds - -creates the following channel-fader mappings: - - channel 0,6 to master_fader1 - channel 1,7 to master_fader2 - channel 2,8 to master_fader3 - -Then, to have 25% of the original output on channel 0,6: - - echo 64 > master_fader1 - -To have 0% of the original output (i.e. no output) channel 1,7: - - echo 0 > master_fader2 - -To have 100% of the original output (i.e. no dimming) on channel 2,8: - - echo 255 > master_fader3 - -To clear all master fader controls: - - echo "000000000" > master_fader_leds - -Selftest uses always the current from the platform data. - -Each channel contains led current settings. -/sys/class/leds/lp5523:channel2/led_current - RW -/sys/class/leds/lp5523:channel2/max_current - RO -Format: 10x mA i.e 10 means 1.0 mA - -Example platform data: - -Note - chan_nr can have values between 0 and 8. - -static struct lp55xx_led_config lp5523_led_config[] = { - { - .name = "D1", - .chan_nr = 0, - .led_current = 50, - .max_current = 130, - }, -... - { - .chan_nr = 8, - .led_current = 50, - .max_current = 130, - } -}; - -static int lp5523_setup(void) -{ - /* Setup HW resources */ -} - -static void lp5523_release(void) -{ - /* Release HW resources */ -} - -static void lp5523_enable(bool state) -{ - /* Control chip enable signal */ -} - -static struct lp55xx_platform_data lp5523_platform_data = { - .led_config = lp5523_led_config, - .num_channels = ARRAY_SIZE(lp5523_led_config), - .clock_mode = LP55XX_CLOCK_EXT, - .setup_resources = lp5523_setup, - .release_resources = lp5523_release, - .enable = lp5523_enable, -}; diff --git a/Documentation/leds/leds-lp5562.rst b/Documentation/leds/leds-lp5562.rst new file mode 100644 index 000000000000..79bbb2487ff6 --- /dev/null +++ b/Documentation/leds/leds-lp5562.rst @@ -0,0 +1,137 @@ +======================== +Kernel driver for lp5562 +======================== + +* TI LP5562 LED Driver + +Author: Milo(Woogyom) Kim + +Description +=========== + + LP5562 can drive up to 4 channels. R/G/B and White. + LEDs can be controlled directly via the led class control interface. + + All four channels can be also controlled using the engine micro programs. + LP5562 has the internal program memory for running various LED patterns. + For the details, please refer to 'firmware' section in leds-lp55xx.txt + +Device attribute +================ + +engine_mux + 3 Engines are allocated in LP5562, but the number of channel is 4. + Therefore each channel should be mapped to the engine number. + + Value: RGB or W + + This attribute is used for programming LED data with the firmware interface. + Unlike the LP5521/LP5523/55231, LP5562 has unique feature for the engine mux, + so additional sysfs is required + + LED Map + + ===== === =============================== + Red ... Engine 1 (fixed) + Green ... Engine 2 (fixed) + Blue ... Engine 3 (fixed) + White ... Engine 1 or 2 or 3 (selective) + ===== === =============================== + +How to load the program data using engine_mux +============================================= + + Before loading the LP5562 program data, engine_mux should be written between + the engine selection and loading the firmware. + Engine mux has two different mode, RGB and W. + RGB is used for loading RGB program data, W is used for W program data. + + For example, run blinking green channel pattern:: + + echo 2 > /sys/bus/i2c/devices/xxxx/select_engine # 2 is for green channel + echo "RGB" > /sys/bus/i2c/devices/xxxx/engine_mux # engine mux for RGB + echo 1 > /sys/class/firmware/lp5562/loading + echo "4000600040FF6000" > /sys/class/firmware/lp5562/data + echo 0 > /sys/class/firmware/lp5562/loading + echo 1 > /sys/bus/i2c/devices/xxxx/run_engine + + To run a blinking white pattern:: + + echo 1 or 2 or 3 > /sys/bus/i2c/devices/xxxx/select_engine + echo "W" > /sys/bus/i2c/devices/xxxx/engine_mux + echo 1 > /sys/class/firmware/lp5562/loading + echo "4000600040FF6000" > /sys/class/firmware/lp5562/data + echo 0 > /sys/class/firmware/lp5562/loading + echo 1 > /sys/bus/i2c/devices/xxxx/run_engine + +How to load the predefined patterns +=================================== + + Please refer to 'leds-lp55xx.txt" + +Setting Current of Each Channel +=============================== + + Like LP5521 and LP5523/55231, LP5562 provides LED current settings. + The 'led_current' and 'max_current' are used. + +Example of Platform data +======================== + +:: + + static struct lp55xx_led_config lp5562_led_config[] = { + { + .name = "R", + .chan_nr = 0, + .led_current = 20, + .max_current = 40, + }, + { + .name = "G", + .chan_nr = 1, + .led_current = 20, + .max_current = 40, + }, + { + .name = "B", + .chan_nr = 2, + .led_current = 20, + .max_current = 40, + }, + { + .name = "W", + .chan_nr = 3, + .led_current = 20, + .max_current = 40, + }, + }; + + static int lp5562_setup(void) + { + /* setup HW resources */ + } + + static void lp5562_release(void) + { + /* Release HW resources */ + } + + static void lp5562_enable(bool state) + { + /* Control of chip enable signal */ + } + + static struct lp55xx_platform_data lp5562_platform_data = { + .led_config = lp5562_led_config, + .num_channels = ARRAY_SIZE(lp5562_led_config), + .setup_resources = lp5562_setup, + .release_resources = lp5562_release, + .enable = lp5562_enable, + }; + +To configure the platform specific data, lp55xx_platform_data structure is used + + +If the current is set to 0 in the platform data, that channel is +disabled and it is not visible in the sysfs. diff --git a/Documentation/leds/leds-lp5562.txt b/Documentation/leds/leds-lp5562.txt deleted file mode 100644 index 5a823ff6b393..000000000000 --- a/Documentation/leds/leds-lp5562.txt +++ /dev/null @@ -1,120 +0,0 @@ -Kernel driver for LP5562 -======================== - -* TI LP5562 LED Driver - -Author: Milo(Woogyom) Kim - -Description - - LP5562 can drive up to 4 channels. R/G/B and White. - LEDs can be controlled directly via the led class control interface. - - All four channels can be also controlled using the engine micro programs. - LP5562 has the internal program memory for running various LED patterns. - For the details, please refer to 'firmware' section in leds-lp55xx.txt - -Device attribute: engine_mux - - 3 Engines are allocated in LP5562, but the number of channel is 4. - Therefore each channel should be mapped to the engine number. - Value : RGB or W - - This attribute is used for programming LED data with the firmware interface. - Unlike the LP5521/LP5523/55231, LP5562 has unique feature for the engine mux, - so additional sysfs is required. - - LED Map - Red ... Engine 1 (fixed) - Green ... Engine 2 (fixed) - Blue ... Engine 3 (fixed) - White ... Engine 1 or 2 or 3 (selective) - -How to load the program data using engine_mux - - Before loading the LP5562 program data, engine_mux should be written between - the engine selection and loading the firmware. - Engine mux has two different mode, RGB and W. - RGB is used for loading RGB program data, W is used for W program data. - - For example, run blinking green channel pattern, - echo 2 > /sys/bus/i2c/devices/xxxx/select_engine # 2 is for green channel - echo "RGB" > /sys/bus/i2c/devices/xxxx/engine_mux # engine mux for RGB - echo 1 > /sys/class/firmware/lp5562/loading - echo "4000600040FF6000" > /sys/class/firmware/lp5562/data - echo 0 > /sys/class/firmware/lp5562/loading - echo 1 > /sys/bus/i2c/devices/xxxx/run_engine - - To run a blinking white pattern, - echo 1 or 2 or 3 > /sys/bus/i2c/devices/xxxx/select_engine - echo "W" > /sys/bus/i2c/devices/xxxx/engine_mux - echo 1 > /sys/class/firmware/lp5562/loading - echo "4000600040FF6000" > /sys/class/firmware/lp5562/data - echo 0 > /sys/class/firmware/lp5562/loading - echo 1 > /sys/bus/i2c/devices/xxxx/run_engine - -How to load the predefined patterns - - Please refer to 'leds-lp55xx.txt" - -Setting Current of Each Channel - - Like LP5521 and LP5523/55231, LP5562 provides LED current settings. - The 'led_current' and 'max_current' are used. - -(Example of Platform data) - -To configure the platform specific data, lp55xx_platform_data structure is used. - -static struct lp55xx_led_config lp5562_led_config[] = { - { - .name = "R", - .chan_nr = 0, - .led_current = 20, - .max_current = 40, - }, - { - .name = "G", - .chan_nr = 1, - .led_current = 20, - .max_current = 40, - }, - { - .name = "B", - .chan_nr = 2, - .led_current = 20, - .max_current = 40, - }, - { - .name = "W", - .chan_nr = 3, - .led_current = 20, - .max_current = 40, - }, -}; - -static int lp5562_setup(void) -{ - /* setup HW resources */ -} - -static void lp5562_release(void) -{ - /* Release HW resources */ -} - -static void lp5562_enable(bool state) -{ - /* Control of chip enable signal */ -} - -static struct lp55xx_platform_data lp5562_platform_data = { - .led_config = lp5562_led_config, - .num_channels = ARRAY_SIZE(lp5562_led_config), - .setup_resources = lp5562_setup, - .release_resources = lp5562_release, - .enable = lp5562_enable, -}; - -If the current is set to 0 in the platform data, that channel is -disabled and it is not visible in the sysfs. diff --git a/Documentation/leds/leds-lp55xx.rst b/Documentation/leds/leds-lp55xx.rst new file mode 100644 index 000000000000..632e41cec0b5 --- /dev/null +++ b/Documentation/leds/leds-lp55xx.rst @@ -0,0 +1,224 @@ +================================================= +LP5521/LP5523/LP55231/LP5562/LP8501 Common Driver +================================================= + +Authors: Milo(Woogyom) Kim + +Description +----------- +LP5521, LP5523/55231, LP5562 and LP8501 have common features as below. + + Register access via the I2C + Device initialization/deinitialization + Create LED class devices for multiple output channels + Device attributes for user-space interface + Program memory for running LED patterns + +The LP55xx common driver provides these features using exported functions. + + lp55xx_init_device() / lp55xx_deinit_device() + lp55xx_register_leds() / lp55xx_unregister_leds() + lp55xx_regsister_sysfs() / lp55xx_unregister_sysfs() + +( Driver Structure Data ) + +In lp55xx common driver, two different data structure is used. + +* lp55xx_led + control multi output LED channels such as led current, channel index. +* lp55xx_chip + general chip control such like the I2C and platform data. + +For example, LP5521 has maximum 3 LED channels. +LP5523/55231 has 9 output channels:: + + lp55xx_chip for LP5521 ... lp55xx_led #1 + lp55xx_led #2 + lp55xx_led #3 + + lp55xx_chip for LP5523 ... lp55xx_led #1 + lp55xx_led #2 + . + . + lp55xx_led #9 + +( Chip Dependent Code ) + +To support device specific configurations, special structure +'lpxx_device_config' is used. + + - Maximum number of channels + - Reset command, chip enable command + - Chip specific initialization + - Brightness control register access + - Setting LED output current + - Program memory address access for running patterns + - Additional device specific attributes + +( Firmware Interface ) + +LP55xx family devices have the internal program memory for running +various LED patterns. + +This pattern data is saved as a file in the user-land or +hex byte string is written into the memory through the I2C. + +LP55xx common driver supports the firmware interface. + +LP55xx chips have three program engines. + +To load and run the pattern, the programming sequence is following. + + (1) Select an engine number (1/2/3) + (2) Mode change to load + (3) Write pattern data into selected area + (4) Mode change to run + +The LP55xx common driver provides simple interfaces as below. + +select_engine: + Select which engine is used for running program +run_engine: + Start program which is loaded via the firmware interface +firmware: + Load program data + +In case of LP5523, one more command is required, 'enginex_leds'. +It is used for selecting LED output(s) at each engine number. +In more details, please refer to 'leds-lp5523.txt'. + +For example, run blinking pattern in engine #1 of LP5521:: + + echo 1 > /sys/bus/i2c/devices/xxxx/select_engine + echo 1 > /sys/class/firmware/lp5521/loading + echo "4000600040FF6000" > /sys/class/firmware/lp5521/data + echo 0 > /sys/class/firmware/lp5521/loading + echo 1 > /sys/bus/i2c/devices/xxxx/run_engine + +For example, run blinking pattern in engine #3 of LP55231 + +Two LEDs are configured as pattern output channels:: + + echo 3 > /sys/bus/i2c/devices/xxxx/select_engine + echo 1 > /sys/class/firmware/lp55231/loading + echo "9d0740ff7e0040007e00a0010000" > /sys/class/firmware/lp55231/data + echo 0 > /sys/class/firmware/lp55231/loading + echo "000001100" > /sys/bus/i2c/devices/xxxx/engine3_leds + echo 1 > /sys/bus/i2c/devices/xxxx/run_engine + +To start blinking patterns in engine #2 and #3 simultaneously:: + + for idx in 2 3 + do + echo $idx > /sys/class/leds/red/device/select_engine + sleep 0.1 + echo 1 > /sys/class/firmware/lp5521/loading + echo "4000600040FF6000" > /sys/class/firmware/lp5521/data + echo 0 > /sys/class/firmware/lp5521/loading + done + echo 1 > /sys/class/leds/red/device/run_engine + +Here is another example for LP5523. + +Full LED strings are selected by 'engine2_leds':: + + echo 2 > /sys/bus/i2c/devices/xxxx/select_engine + echo 1 > /sys/class/firmware/lp5523/loading + echo "9d80400004ff05ff437f0000" > /sys/class/firmware/lp5523/data + echo 0 > /sys/class/firmware/lp5523/loading + echo "111111111" > /sys/bus/i2c/devices/xxxx/engine2_leds + echo 1 > /sys/bus/i2c/devices/xxxx/run_engine + +As soon as 'loading' is set to 0, registered callback is called. +Inside the callback, the selected engine is loaded and memory is updated. +To run programmed pattern, 'run_engine' attribute should be enabled. + +The pattern sequence of LP8501 is similar to LP5523. + +However pattern data is specific. + +Ex 1) Engine 1 is used:: + + echo 1 > /sys/bus/i2c/devices/xxxx/select_engine + echo 1 > /sys/class/firmware/lp8501/loading + echo "9d0140ff7e0040007e00a001c000" > /sys/class/firmware/lp8501/data + echo 0 > /sys/class/firmware/lp8501/loading + echo 1 > /sys/bus/i2c/devices/xxxx/run_engine + +Ex 2) Engine 2 and 3 are used at the same time:: + + echo 2 > /sys/bus/i2c/devices/xxxx/select_engine + sleep 1 + echo 1 > /sys/class/firmware/lp8501/loading + echo "9d0140ff7e0040007e00a001c000" > /sys/class/firmware/lp8501/data + echo 0 > /sys/class/firmware/lp8501/loading + sleep 1 + echo 3 > /sys/bus/i2c/devices/xxxx/select_engine + sleep 1 + echo 1 > /sys/class/firmware/lp8501/loading + echo "9d0340ff7e0040007e00a001c000" > /sys/class/firmware/lp8501/data + echo 0 > /sys/class/firmware/lp8501/loading + sleep 1 + echo 1 > /sys/class/leds/d1/device/run_engine + +( 'run_engine' and 'firmware_cb' ) + +The sequence of running the program data is common. + +But each device has own specific register addresses for commands. + +To support this, 'run_engine' and 'firmware_cb' are configurable in each driver. + +run_engine: + Control the selected engine +firmware_cb: + The callback function after loading the firmware is done. + + Chip specific commands for loading and updating program memory. + +( Predefined pattern data ) + +Without the firmware interface, LP55xx driver provides another method for +loading a LED pattern. That is 'predefined' pattern. + +A predefined pattern is defined in the platform data and load it(or them) +via the sysfs if needed. + +To use the predefined pattern concept, 'patterns' and 'num_patterns' should be +configured. + +Example of predefined pattern data:: + + /* mode_1: blinking data */ + static const u8 mode_1[] = { + 0x40, 0x00, 0x60, 0x00, 0x40, 0xFF, 0x60, 0x00, + }; + + /* mode_2: always on */ + static const u8 mode_2[] = { 0x40, 0xFF, }; + + struct lp55xx_predef_pattern board_led_patterns[] = { + { + .r = mode_1, + .size_r = ARRAY_SIZE(mode_1), + }, + { + .b = mode_2, + .size_b = ARRAY_SIZE(mode_2), + }, + } + + struct lp55xx_platform_data lp5562_pdata = { + ... + .patterns = board_led_patterns, + .num_patterns = ARRAY_SIZE(board_led_patterns), + }; + +Then, mode_1 and mode_2 can be run via through the sysfs:: + + echo 1 > /sys/bus/i2c/devices/xxxx/led_pattern # red blinking LED pattern + echo 2 > /sys/bus/i2c/devices/xxxx/led_pattern # blue LED always on + +To stop running pattern:: + + echo 0 > /sys/bus/i2c/devices/xxxx/led_pattern diff --git a/Documentation/leds/leds-lp55xx.txt b/Documentation/leds/leds-lp55xx.txt deleted file mode 100644 index e23fa91ea722..000000000000 --- a/Documentation/leds/leds-lp55xx.txt +++ /dev/null @@ -1,194 +0,0 @@ -LP5521/LP5523/LP55231/LP5562/LP8501 Common Driver -================================================= - -Authors: Milo(Woogyom) Kim - -Description ------------ -LP5521, LP5523/55231, LP5562 and LP8501 have common features as below. - - Register access via the I2C - Device initialization/deinitialization - Create LED class devices for multiple output channels - Device attributes for user-space interface - Program memory for running LED patterns - -The LP55xx common driver provides these features using exported functions. - lp55xx_init_device() / lp55xx_deinit_device() - lp55xx_register_leds() / lp55xx_unregister_leds() - lp55xx_regsister_sysfs() / lp55xx_unregister_sysfs() - -( Driver Structure Data ) - -In lp55xx common driver, two different data structure is used. - -o lp55xx_led - control multi output LED channels such as led current, channel index. -o lp55xx_chip - general chip control such like the I2C and platform data. - -For example, LP5521 has maximum 3 LED channels. -LP5523/55231 has 9 output channels. - -lp55xx_chip for LP5521 ... lp55xx_led #1 - lp55xx_led #2 - lp55xx_led #3 - -lp55xx_chip for LP5523 ... lp55xx_led #1 - lp55xx_led #2 - . - . - lp55xx_led #9 - -( Chip Dependent Code ) - -To support device specific configurations, special structure -'lpxx_device_config' is used. - - Maximum number of channels - Reset command, chip enable command - Chip specific initialization - Brightness control register access - Setting LED output current - Program memory address access for running patterns - Additional device specific attributes - -( Firmware Interface ) - -LP55xx family devices have the internal program memory for running -various LED patterns. -This pattern data is saved as a file in the user-land or -hex byte string is written into the memory through the I2C. -LP55xx common driver supports the firmware interface. - -LP55xx chips have three program engines. -To load and run the pattern, the programming sequence is following. - (1) Select an engine number (1/2/3) - (2) Mode change to load - (3) Write pattern data into selected area - (4) Mode change to run - -The LP55xx common driver provides simple interfaces as below. -select_engine : Select which engine is used for running program -run_engine : Start program which is loaded via the firmware interface -firmware : Load program data - -In case of LP5523, one more command is required, 'enginex_leds'. -It is used for selecting LED output(s) at each engine number. -In more details, please refer to 'leds-lp5523.txt'. - -For example, run blinking pattern in engine #1 of LP5521 -echo 1 > /sys/bus/i2c/devices/xxxx/select_engine -echo 1 > /sys/class/firmware/lp5521/loading -echo "4000600040FF6000" > /sys/class/firmware/lp5521/data -echo 0 > /sys/class/firmware/lp5521/loading -echo 1 > /sys/bus/i2c/devices/xxxx/run_engine - -For example, run blinking pattern in engine #3 of LP55231 -Two LEDs are configured as pattern output channels. -echo 3 > /sys/bus/i2c/devices/xxxx/select_engine -echo 1 > /sys/class/firmware/lp55231/loading -echo "9d0740ff7e0040007e00a0010000" > /sys/class/firmware/lp55231/data -echo 0 > /sys/class/firmware/lp55231/loading -echo "000001100" > /sys/bus/i2c/devices/xxxx/engine3_leds -echo 1 > /sys/bus/i2c/devices/xxxx/run_engine - -To start blinking patterns in engine #2 and #3 simultaneously, -for idx in 2 3 -do - echo $idx > /sys/class/leds/red/device/select_engine - sleep 0.1 - echo 1 > /sys/class/firmware/lp5521/loading - echo "4000600040FF6000" > /sys/class/firmware/lp5521/data - echo 0 > /sys/class/firmware/lp5521/loading -done -echo 1 > /sys/class/leds/red/device/run_engine - -Here is another example for LP5523. -Full LED strings are selected by 'engine2_leds'. -echo 2 > /sys/bus/i2c/devices/xxxx/select_engine -echo 1 > /sys/class/firmware/lp5523/loading -echo "9d80400004ff05ff437f0000" > /sys/class/firmware/lp5523/data -echo 0 > /sys/class/firmware/lp5523/loading -echo "111111111" > /sys/bus/i2c/devices/xxxx/engine2_leds -echo 1 > /sys/bus/i2c/devices/xxxx/run_engine - -As soon as 'loading' is set to 0, registered callback is called. -Inside the callback, the selected engine is loaded and memory is updated. -To run programmed pattern, 'run_engine' attribute should be enabled. - -The pattern sequence of LP8501 is similar to LP5523. -However pattern data is specific. -Ex 1) Engine 1 is used -echo 1 > /sys/bus/i2c/devices/xxxx/select_engine -echo 1 > /sys/class/firmware/lp8501/loading -echo "9d0140ff7e0040007e00a001c000" > /sys/class/firmware/lp8501/data -echo 0 > /sys/class/firmware/lp8501/loading -echo 1 > /sys/bus/i2c/devices/xxxx/run_engine - -Ex 2) Engine 2 and 3 are used at the same time -echo 2 > /sys/bus/i2c/devices/xxxx/select_engine -sleep 1 -echo 1 > /sys/class/firmware/lp8501/loading -echo "9d0140ff7e0040007e00a001c000" > /sys/class/firmware/lp8501/data -echo 0 > /sys/class/firmware/lp8501/loading -sleep 1 -echo 3 > /sys/bus/i2c/devices/xxxx/select_engine -sleep 1 -echo 1 > /sys/class/firmware/lp8501/loading -echo "9d0340ff7e0040007e00a001c000" > /sys/class/firmware/lp8501/data -echo 0 > /sys/class/firmware/lp8501/loading -sleep 1 -echo 1 > /sys/class/leds/d1/device/run_engine - -( 'run_engine' and 'firmware_cb' ) -The sequence of running the program data is common. -But each device has own specific register addresses for commands. -To support this, 'run_engine' and 'firmware_cb' are configurable in each driver. -run_engine : Control the selected engine -firmware_cb : The callback function after loading the firmware is done. - Chip specific commands for loading and updating program memory. - -( Predefined pattern data ) - -Without the firmware interface, LP55xx driver provides another method for -loading a LED pattern. That is 'predefined' pattern. -A predefined pattern is defined in the platform data and load it(or them) -via the sysfs if needed. -To use the predefined pattern concept, 'patterns' and 'num_patterns' should be -configured. - - Example of predefined pattern data: - - /* mode_1: blinking data */ - static const u8 mode_1[] = { - 0x40, 0x00, 0x60, 0x00, 0x40, 0xFF, 0x60, 0x00, - }; - - /* mode_2: always on */ - static const u8 mode_2[] = { 0x40, 0xFF, }; - - struct lp55xx_predef_pattern board_led_patterns[] = { - { - .r = mode_1, - .size_r = ARRAY_SIZE(mode_1), - }, - { - .b = mode_2, - .size_b = ARRAY_SIZE(mode_2), - }, - } - - struct lp55xx_platform_data lp5562_pdata = { - ... - .patterns = board_led_patterns, - .num_patterns = ARRAY_SIZE(board_led_patterns), - }; - -Then, mode_1 and mode_2 can be run via through the sysfs. - - echo 1 > /sys/bus/i2c/devices/xxxx/led_pattern # red blinking LED pattern - echo 2 > /sys/bus/i2c/devices/xxxx/led_pattern # blue LED always on - -To stop running pattern, - echo 0 > /sys/bus/i2c/devices/xxxx/led_pattern diff --git a/Documentation/leds/leds-mlxcpld.rst b/Documentation/leds/leds-mlxcpld.rst new file mode 100644 index 000000000000..528582429e0b --- /dev/null +++ b/Documentation/leds/leds-mlxcpld.rst @@ -0,0 +1,118 @@ +======================================= +Kernel driver for Mellanox systems LEDs +======================================= + +Provide system LED support for the nex Mellanox systems: +"msx6710", "msx6720", "msb7700", "msn2700", "msx1410", +"msn2410", "msb7800", "msn2740", "msn2100". + +Description +----------- +Driver provides the following LEDs for the systems "msx6710", "msx6720", +"msb7700", "msn2700", "msx1410", "msn2410", "msb7800", "msn2740": + + - mlxcpld:fan1:green + - mlxcpld:fan1:red + - mlxcpld:fan2:green + - mlxcpld:fan2:red + - mlxcpld:fan3:green + - mlxcpld:fan3:red + - mlxcpld:fan4:green + - mlxcpld:fan4:red + - mlxcpld:psu:green + - mlxcpld:psu:red + - mlxcpld:status:green + - mlxcpld:status:red + + "status" + - CPLD reg offset: 0x20 + - Bits [3:0] + + "psu" + - CPLD reg offset: 0x20 + - Bits [7:4] + + "fan1" + - CPLD reg offset: 0x21 + - Bits [3:0] + + "fan2" + - CPLD reg offset: 0x21 + - Bits [7:4] + + "fan3" + - CPLD reg offset: 0x22 + - Bits [3:0] + + "fan4" + - CPLD reg offset: 0x22 + - Bits [7:4] + + Color mask for all the above LEDs: + + [bit3,bit2,bit1,bit0] or + [bit7,bit6,bit5,bit4]: + + - [0,0,0,0] = LED OFF + - [0,1,0,1] = Red static ON + - [1,1,0,1] = Green static ON + - [0,1,1,0] = Red blink 3Hz + - [1,1,1,0] = Green blink 3Hz + - [0,1,1,1] = Red blink 6Hz + - [1,1,1,1] = Green blink 6Hz + +Driver provides the following LEDs for the system "msn2100": + + - mlxcpld:fan:green + - mlxcpld:fan:red + - mlxcpld:psu1:green + - mlxcpld:psu1:red + - mlxcpld:psu2:green + - mlxcpld:psu2:red + - mlxcpld:status:green + - mlxcpld:status:red + - mlxcpld:uid:blue + + "status" + - CPLD reg offset: 0x20 + - Bits [3:0] + + "fan" + - CPLD reg offset: 0x21 + - Bits [3:0] + + "psu1" + - CPLD reg offset: 0x23 + - Bits [3:0] + + "psu2" + - CPLD reg offset: 0x23 + - Bits [7:4] + + "uid" + - CPLD reg offset: 0x24 + - Bits [3:0] + + Color mask for all the above LEDs, excepted uid: + + [bit3,bit2,bit1,bit0] or + [bit7,bit6,bit5,bit4]: + + - [0,0,0,0] = LED OFF + - [0,1,0,1] = Red static ON + - [1,1,0,1] = Green static ON + - [0,1,1,0] = Red blink 3Hz + - [1,1,1,0] = Green blink 3Hz + - [0,1,1,1] = Red blink 6Hz + - [1,1,1,1] = Green blink 6Hz + + Color mask for uid LED: + [bit3,bit2,bit1,bit0]: + + - [0,0,0,0] = LED OFF + - [1,1,0,1] = Blue static ON + - [1,1,1,0] = Blue blink 3Hz + - [1,1,1,1] = Blue blink 6Hz + +Driver supports HW blinking at 3Hz and 6Hz frequency (50% duty cycle). +For 3Hz duty cylce is about 167 msec, for 6Hz is about 83 msec. diff --git a/Documentation/leds/leds-mlxcpld.txt b/Documentation/leds/leds-mlxcpld.txt deleted file mode 100644 index a0e8fd457117..000000000000 --- a/Documentation/leds/leds-mlxcpld.txt +++ /dev/null @@ -1,110 +0,0 @@ -Kernel driver for Mellanox systems LEDs -======================================= - -Provide system LED support for the nex Mellanox systems: -"msx6710", "msx6720", "msb7700", "msn2700", "msx1410", -"msn2410", "msb7800", "msn2740", "msn2100". - -Description ------------ -Driver provides the following LEDs for the systems "msx6710", "msx6720", -"msb7700", "msn2700", "msx1410", "msn2410", "msb7800", "msn2740": - mlxcpld:fan1:green - mlxcpld:fan1:red - mlxcpld:fan2:green - mlxcpld:fan2:red - mlxcpld:fan3:green - mlxcpld:fan3:red - mlxcpld:fan4:green - mlxcpld:fan4:red - mlxcpld:psu:green - mlxcpld:psu:red - mlxcpld:status:green - mlxcpld:status:red - - "status" - CPLD reg offset: 0x20 - Bits [3:0] - - "psu" - CPLD reg offset: 0x20 - Bits [7:4] - - "fan1" - CPLD reg offset: 0x21 - Bits [3:0] - - "fan2" - CPLD reg offset: 0x21 - Bits [7:4] - - "fan3" - CPLD reg offset: 0x22 - Bits [3:0] - - "fan4" - CPLD reg offset: 0x22 - Bits [7:4] - - Color mask for all the above LEDs: - [bit3,bit2,bit1,bit0] or - [bit7,bit6,bit5,bit4]: - [0,0,0,0] = LED OFF - [0,1,0,1] = Red static ON - [1,1,0,1] = Green static ON - [0,1,1,0] = Red blink 3Hz - [1,1,1,0] = Green blink 3Hz - [0,1,1,1] = Red blink 6Hz - [1,1,1,1] = Green blink 6Hz - -Driver provides the following LEDs for the system "msn2100": - mlxcpld:fan:green - mlxcpld:fan:red - mlxcpld:psu1:green - mlxcpld:psu1:red - mlxcpld:psu2:green - mlxcpld:psu2:red - mlxcpld:status:green - mlxcpld:status:red - mlxcpld:uid:blue - - "status" - CPLD reg offset: 0x20 - Bits [3:0] - - "fan" - CPLD reg offset: 0x21 - Bits [3:0] - - "psu1" - CPLD reg offset: 0x23 - Bits [3:0] - - "psu2" - CPLD reg offset: 0x23 - Bits [7:4] - - "uid" - CPLD reg offset: 0x24 - Bits [3:0] - - Color mask for all the above LEDs, excepted uid: - [bit3,bit2,bit1,bit0] or - [bit7,bit6,bit5,bit4]: - [0,0,0,0] = LED OFF - [0,1,0,1] = Red static ON - [1,1,0,1] = Green static ON - [0,1,1,0] = Red blink 3Hz - [1,1,1,0] = Green blink 3Hz - [0,1,1,1] = Red blink 6Hz - [1,1,1,1] = Green blink 6Hz - - Color mask for uid LED: - [bit3,bit2,bit1,bit0]: - [0,0,0,0] = LED OFF - [1,1,0,1] = Blue static ON - [1,1,1,0] = Blue blink 3Hz - [1,1,1,1] = Blue blink 6Hz - -Driver supports HW blinking at 3Hz and 6Hz frequency (50% duty cycle). -For 3Hz duty cylce is about 167 msec, for 6Hz is about 83 msec. diff --git a/Documentation/leds/ledtrig-oneshot.txt b/Documentation/leds/ledtrig-oneshot.rst similarity index 90% rename from Documentation/leds/ledtrig-oneshot.txt rename to Documentation/leds/ledtrig-oneshot.rst index fe57474a12e2..69fa3ea1d554 100644 --- a/Documentation/leds/ledtrig-oneshot.txt +++ b/Documentation/leds/ledtrig-oneshot.rst @@ -1,3 +1,4 @@ +==================== One-shot LED Trigger ==================== @@ -17,27 +18,27 @@ additional "invert" property specifies if the LED has to stay off (normal) or on (inverted) when not rearmed. The trigger can be activated from user space on led class devices as shown -below: +below:: echo oneshot > trigger This adds sysfs attributes to the LED that are documented in: Documentation/ABI/testing/sysfs-class-led-trigger-oneshot -Example use-case: network devices, initialization: +Example use-case: network devices, initialization:: echo oneshot > trigger # set trigger for this led echo 33 > delay_on # blink at 1 / (33 + 33) Hz on continuous traffic echo 33 > delay_off -interface goes up: +interface goes up:: echo 1 > invert # set led as normally-on, turn the led on -packet received/transmitted: +packet received/transmitted:: echo 1 > shot # led starts blinking, ignored if already blinking -interface goes down +interface goes down:: echo 0 > invert # set led as normally-off, turn the led off diff --git a/Documentation/leds/ledtrig-transient.txt b/Documentation/leds/ledtrig-transient.rst similarity index 81% rename from Documentation/leds/ledtrig-transient.txt rename to Documentation/leds/ledtrig-transient.rst index 3bd38b487df1..d921dc830cd0 100644 --- a/Documentation/leds/ledtrig-transient.txt +++ b/Documentation/leds/ledtrig-transient.rst @@ -1,3 +1,4 @@ +===================== LED Transient Trigger ===================== @@ -62,12 +63,13 @@ non-transient state. When driver gets suspended, irrespective of the transient state, the LED state changes to LED_OFF. Transient trigger can be enabled and disabled from user space on led class -devices, that support this trigger as shown below: +devices, that support this trigger as shown below:: -echo transient > trigger -echo none > trigger + echo transient > trigger + echo none > trigger -NOTE: Add a new property trigger state to control the state. +NOTE: + Add a new property trigger state to control the state. This trigger exports three properties, activate, state, and duration. When transient trigger is activated these properties are set to default values. @@ -79,7 +81,8 @@ transient trigger is activated these properties are set to default values. - state allows user to specify a transient state to be held for the specified duration. - activate - one shot timer activate mechanism. + activate + - one shot timer activate mechanism. 1 when activated, 0 when deactivated. default value is zero when transient trigger is enabled, to allow duration to be set. @@ -89,12 +92,14 @@ transient trigger is activated these properties are set to default values. deactivated state indicates that there is no active timer running. - duration - one shot timer value. When activate is set, duration value + duration + - one shot timer value. When activate is set, duration value is used to start a timer that runs once. This value doesn't get changed by the trigger unless user does a set via echo new_value > duration - state - transient state to be held. It has two values 0 or 1. 0 maps + state + - transient state to be held. It has two values 0 or 1. 0 maps to LED_OFF and 1 maps to LED_FULL. The specified state is held for the duration of the one shot timer and then the state gets changed to the non-transient state which is the @@ -114,39 +119,49 @@ When timer expires activate goes back to deactivated state, duration is left at the set value to be used when activate is set at a future time. This will allow user app to set the time once and activate it to run it once for the specified value as needed. When timer expires, state is restored to the -non-transient state which is the inverse of the transient state. +non-transient state which is the inverse of the transient state: - echo 1 > activate - starts timer = duration when duration is not 0. - echo 0 > activate - cancels currently running timer. - echo n > duration - stores timer value to be used upon next - activate. Currently active timer if - any, continues to run for the specified time. - echo 0 > duration - stores timer value to be used upon next - activate. Currently active timer if any, - continues to run for the specified time. - echo 1 > state - stores desired transient state LED_FULL to be + ================= =============================================== + echo 1 > activate starts timer = duration when duration is not 0. + echo 0 > activate cancels currently running timer. + echo n > duration stores timer value to be used upon next + activate. Currently active timer if + any, continues to run for the specified time. + echo 0 > duration stores timer value to be used upon next + activate. Currently active timer if any, + continues to run for the specified time. + echo 1 > state stores desired transient state LED_FULL to be held for the specified duration. - echo 0 > state - stores desired transient state LED_OFF to be + echo 0 > state stores desired transient state LED_OFF to be held for the specified duration. + ================= =============================================== + +What is not supported +===================== -What is not supported: -====================== - Timer activation is one shot and extending and/or shortening the timer is not supported. -Example use-case 1: +Examples +======== + +use-case 1:: + echo transient > trigger echo n > duration echo 1 > state -repeat the following step as needed: + +repeat the following step as needed:: + echo 1 > activate - start timer = duration to run once echo 1 > activate - start timer = duration to run once echo none > trigger This trigger is intended to be used for for the following example use cases: + - Control of vibrate (phones, tablets etc.) hardware by user space app. - Use of LED by user space app as activity indicator. - Use of LED by user space app as a kind of watchdog indicator -- as - long as the app is alive, it can keep the LED illuminated, if it dies - the LED will be extinguished automatically. + long as the app is alive, it can keep the LED illuminated, if it dies + the LED will be extinguished automatically. - Use by any user space app that needs a transient GPIO output. diff --git a/Documentation/leds/ledtrig-usbport.txt b/Documentation/leds/ledtrig-usbport.rst similarity index 86% rename from Documentation/leds/ledtrig-usbport.txt rename to Documentation/leds/ledtrig-usbport.rst index 69f54bfb4789..37c2505bfd57 100644 --- a/Documentation/leds/ledtrig-usbport.txt +++ b/Documentation/leds/ledtrig-usbport.rst @@ -1,3 +1,4 @@ +==================== USB port LED trigger ==================== @@ -10,14 +11,18 @@ listed as separated entries in a "ports" subdirectory. Selecting is handled by echoing "1" to a chosen port. Please note that this trigger allows selecting multiple USB ports for a single -LED. This can be useful in two cases: +LED. + +This can be useful in two cases: 1) Device with single USB LED and few physical ports +==================================================== In such a case LED will be turned on as long as there is at least one connected USB device. 2) Device with a physical port handled by few controllers +========================================================= Some devices may have one controller per PHY standard. E.g. USB 3.0 physical port may be handled by ohci-platform, ehci-platform and xhci-hcd. If there is @@ -25,14 +30,14 @@ only one LED user will most likely want to assign ports from all 3 hubs. This trigger can be activated from user space on led class devices as shown -below: +below:: echo usbport > trigger This adds sysfs attributes to the LED that are documented in: Documentation/ABI/testing/sysfs-class-led-trigger-usbport -Example use-case: +Example use-case:: echo usbport > trigger echo 1 > ports/usb1-port1 diff --git a/Documentation/leds/uleds.txt b/Documentation/leds/uleds.rst similarity index 95% rename from Documentation/leds/uleds.txt rename to Documentation/leds/uleds.rst index 13e375a580f9..83221098009c 100644 --- a/Documentation/leds/uleds.txt +++ b/Documentation/leds/uleds.rst @@ -1,3 +1,4 @@ +============== Userspace LEDs ============== @@ -10,12 +11,12 @@ Usage When the driver is loaded, a character device is created at /dev/uleds. To create a new LED class device, open /dev/uleds and write a uleds_user_dev -structure to it (found in kernel public header file linux/uleds.h). +structure to it (found in kernel public header file linux/uleds.h):: #define LED_MAX_NAME_SIZE 64 struct uleds_user_dev { - char name[LED_MAX_NAME_SIZE]; + char name[LED_MAX_NAME_SIZE]; }; A new LED class device will be created with the name given. The name can be diff --git a/MAINTAINERS b/MAINTAINERS index ddd526efcb46..6696779c0826 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10045,7 +10045,7 @@ L: linux-leds@vger.kernel.org S: Supported F: drivers/leds/leds-mlxcpld.c F: drivers/leds/leds-mlxreg.c -F: Documentation/leds/leds-mlxcpld.txt +F: Documentation/leds/leds-mlxcpld.rst MELLANOX PLATFORM DRIVER M: Vadim Pasternak diff --git a/drivers/leds/trigger/Kconfig b/drivers/leds/trigger/Kconfig index 23cc85e2e0e5..24e36eef95d3 100644 --- a/drivers/leds/trigger/Kconfig +++ b/drivers/leds/trigger/Kconfig @@ -14,7 +14,7 @@ config LEDS_TRIGGER_TIMER This allows LEDs to be controlled by a programmable timer via sysfs. Some LED hardware can be programmed to start blinking the LED without any further software interaction. - For more details read Documentation/leds/leds-class.txt. + For more details read Documentation/leds/leds-class.rst. If unsure, say Y. diff --git a/drivers/leds/trigger/ledtrig-transient.c b/drivers/leds/trigger/ledtrig-transient.c index a80bb82aacc2..80635183fac8 100644 --- a/drivers/leds/trigger/ledtrig-transient.c +++ b/drivers/leds/trigger/ledtrig-transient.c @@ -3,7 +3,7 @@ // LED Kernel Transient Trigger // // Transient trigger allows one shot timer activation. Please refer to -// Documentation/leds/ledtrig-transient.txt for details +// Documentation/leds/ledtrig-transient.rst for details // Copyright (C) 2012 Shuah Khan // // Based on Richard Purdie's ledtrig-timer.c and Atsushi Nemoto's diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig index 1f4a4d9f80b4..21b13e75b0a9 100644 --- a/net/netfilter/Kconfig +++ b/net/netfilter/Kconfig @@ -905,7 +905,7 @@ config NETFILTER_XT_TARGET_LED echo netfilter-ssh > /sys/class/leds//trigger For more information on the LEDs available on your system, see - Documentation/leds/leds-class.txt + Documentation/leds/leds-class.rst config NETFILTER_XT_TARGET_LOG tristate "LOG target support" From patchwork Mon Apr 22 13:28:07 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mauro Carvalho Chehab X-Patchwork-Id: 1088699 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=infradead.org header.i=@infradead.org header.b="GQOj8t3g"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 44nnZJ3lM2z9sP8 for ; Mon, 22 Apr 2019 23:32:40 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727997AbfDVNcZ (ORCPT ); Mon, 22 Apr 2019 09:32:25 -0400 Received: from bombadil.infradead.org ([198.137.202.133]:37648 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727536AbfDVN2V (ORCPT ); Mon, 22 Apr 2019 09:28:21 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=Sender:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=tX4fbse5hn6rNQw8e+XouSgKkr3NhuYdYyQ134cUmYw=; b=GQOj8t3grK7r84s0qRzYEnReGe AyKMthqCjt/OaOEEAe9bbg/MOOTzzyBdkfbdkUHMiiO/MuCGwrSb0iCn1IXiPt62foPERX8C9Knub yxd7fuJPfvoDOe1z+ArhQEyrFBn3huuv9rJWg7YFDktLa7NWO6uRwRfYhOFfCUipF4+lnUog2/fcH gzRVR9G7AvRV+aBqkXmiGyrnHzlXS0+bSN2MpF+XW5Zgd2A15c6TH5nsylfVHEC1n2QEifhRFLSXk f3E3odODhyT8BwYZ9z3M2GiE2IxeJwE/wEhpqxsor8TfzZcC8mq9+ynVWZ0V1GvTiC3NpFMu34IUA TdB7JONA==; Received: from 179.176.125.229.dynamic.adsl.gvt.net.br ([179.176.125.229] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtpsa (Exim 4.90_1 #2 (Red Hat Linux)) id 1hIYzZ-0005Hr-5A; Mon, 22 Apr 2019 13:28:18 +0000 Received: from mchehab by bombadil.infradead.org with local (Exim 4.92) (envelope-from ) id 1hIYzU-0005pm-RQ; Mon, 22 Apr 2019 10:28:12 -0300 From: Mauro Carvalho Chehab To: Linux Doc Mailing List Cc: Mauro Carvalho Chehab , Mauro Carvalho Chehab , linux-kernel@vger.kernel.org, Jonathan Corbet , "David S. Miller" , netdev@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 78/79] docs: sysctl: convert to ReST Date: Mon, 22 Apr 2019 10:28:07 -0300 Message-Id: X-Mailer: git-send-email 2.20.1 In-Reply-To: References: MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Rename the /proc/sys/ documentation files to ReST, using the README file as a template for an index.rst, adding the other files there via TOC markup. Despite being written on different times with different styles, try to make them somewhat coherent with a similar look and feel, ensuring that they'll look nice as both raw text file and as via the html output produced by the Sphinx build system. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab --- .../admin-guide/kernel-parameters.txt | 2 +- Documentation/admin-guide/mm/index.rst | 2 +- Documentation/admin-guide/mm/ksm.rst | 2 +- Documentation/core-api/printk-formats.rst | 2 +- Documentation/networking/ip-sysctl.txt | 2 +- Documentation/sysctl/abi.rst | 67 ++++ Documentation/sysctl/abi.txt | 54 --- Documentation/sysctl/{fs.txt => fs.rst} | 141 ++++--- Documentation/sysctl/{README => index.rst} | 36 +- .../sysctl/{kernel.txt => kernel.rst} | 374 ++++++++++-------- Documentation/sysctl/{net.txt => net.rst} | 141 ++++--- .../sysctl/{sunrpc.txt => sunrpc.rst} | 13 +- Documentation/sysctl/{user.txt => user.rst} | 32 +- Documentation/sysctl/{vm.txt => vm.rst} | 258 ++++++------ Documentation/vm/unevictable-lru.rst | 2 +- kernel/panic.c | 2 +- mm/swap.c | 2 +- 17 files changed, 651 insertions(+), 481 deletions(-) create mode 100644 Documentation/sysctl/abi.rst delete mode 100644 Documentation/sysctl/abi.txt rename Documentation/sysctl/{fs.txt => fs.rst} (77%) rename Documentation/sysctl/{README => index.rst} (78%) rename Documentation/sysctl/{kernel.txt => kernel.rst} (79%) rename Documentation/sysctl/{net.txt => net.rst} (85%) rename Documentation/sysctl/{sunrpc.txt => sunrpc.rst} (62%) rename Documentation/sysctl/{user.txt => user.rst} (77%) rename Documentation/sysctl/{vm.txt => vm.rst} (85%) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index a6297aff5598..30689c08bdc3 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3074,7 +3074,7 @@ numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA. 'node', 'default' can be specified This can be set from sysctl after boot. - See Documentation/sysctl/vm.txt for details. + See Documentation/sysctl/vm.rst for details. ohci1394_dma=early [HW] enable debugging via the ohci1394 driver. See Documentation/debugging-via-ohci1394.rst for more diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 8edb35f11317..ea33ca6a5aca 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -11,7 +11,7 @@ processes address space and many other cool things. Linux memory management is a complex system with many configurable settings. Most of these settings are available via ``/proc`` filesystem and can be quired and adjusted using ``sysctl``. These APIs -are described in Documentation/sysctl/vm.txt and in `man 5 proc`_. +are described in Documentation/sysctl/vm.rst and in `man 5 proc`_. .. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst index 9303786632d1..7b2b8767c0b4 100644 --- a/Documentation/admin-guide/mm/ksm.rst +++ b/Documentation/admin-guide/mm/ksm.rst @@ -59,7 +59,7 @@ MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE. If a region of memory must be split into at least one new MADV_MERGEABLE or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process -will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.txt). +will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.rst). Like other madvise calls, they are intended for use on mapped areas of the user address space: they will report ENOMEM if the specified range diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst index c37ec7cd9c06..2222c5e56dfd 100644 --- a/Documentation/core-api/printk-formats.rst +++ b/Documentation/core-api/printk-formats.rst @@ -111,7 +111,7 @@ Kernel Pointers For printing kernel pointers which should be hidden from unprivileged users. The behaviour of %pK depends on the kptr_restrict sysctl - see -Documentation/sysctl/kernel.txt for more details. +Documentation/sysctl/kernel.rst for more details. Unmodified Addresses -------------------- diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index e28f765da570..878e37fbd376 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -2222,7 +2222,7 @@ addr_scope_policy - INTEGER /proc/sys/net/core/* - Please see: Documentation/sysctl/net.txt for descriptions of these entries. + Please see: Documentation/sysctl/net.rst for descriptions of these entries. /proc/sys/net/unix/* diff --git a/Documentation/sysctl/abi.rst b/Documentation/sysctl/abi.rst new file mode 100644 index 000000000000..599bcde7f0b7 --- /dev/null +++ b/Documentation/sysctl/abi.rst @@ -0,0 +1,67 @@ +================================ +Documentation for /proc/sys/abi/ +================================ + +kernel version 2.6.0.test2 + +Copyright (c) 2003, Fabian Frederick + +For general info: index.rst. + +------------------------------------------------------------------------------ + +This path is binary emulation relevant aka personality types aka abi. +When a process is executed, it's linked to an exec_domain whose +personality is defined using values available from /proc/sys/abi. +You can find further details about abi in include/linux/personality.h. + +Here are the files featuring in 2.6 kernel: + +- defhandler_coff +- defhandler_elf +- defhandler_lcall7 +- defhandler_libcso +- fake_utsname +- trace + +defhandler_coff +--------------- + +defined value: + PER_SCOSVR3:: + + 0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS | SHORT_INODE + +defhandler_elf +-------------- + +defined value: + PER_LINUX:: + + 0 + +defhandler_lcall7 +----------------- + +defined value : + PER_SVR4:: + + 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, + +defhandler_libsco +----------------- + +defined value: + PER_SVR4:: + + 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, + +fake_utsname +------------ + +Unused + +trace +----- + +Unused diff --git a/Documentation/sysctl/abi.txt b/Documentation/sysctl/abi.txt deleted file mode 100644 index 63f4ebcf652c..000000000000 --- a/Documentation/sysctl/abi.txt +++ /dev/null @@ -1,54 +0,0 @@ -Documentation for /proc/sys/abi/* kernel version 2.6.0.test2 - (c) 2003, Fabian Frederick - -For general info : README. - -============================================================== - -This path is binary emulation relevant aka personality types aka abi. -When a process is executed, it's linked to an exec_domain whose -personality is defined using values available from /proc/sys/abi. -You can find further details about abi in include/linux/personality.h. - -Here are the files featuring in 2.6 kernel : - -- defhandler_coff -- defhandler_elf -- defhandler_lcall7 -- defhandler_libcso -- fake_utsname -- trace - -=========================================================== -defhandler_coff: -defined value : -PER_SCOSVR3 -0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS | SHORT_INODE - -=========================================================== -defhandler_elf: -defined value : -PER_LINUX -0 - -=========================================================== -defhandler_lcall7: -defined value : -PER_SVR4 -0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, - -=========================================================== -defhandler_libsco: -defined value: -PER_SVR4 -0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, - -=========================================================== -fake_utsname: -Unused - -=========================================================== -trace: -Unused - -=========================================================== diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.rst similarity index 77% rename from Documentation/sysctl/fs.txt rename to Documentation/sysctl/fs.rst index ebc679bcb2dc..94ad19e879c2 100644 --- a/Documentation/sysctl/fs.txt +++ b/Documentation/sysctl/fs.rst @@ -1,10 +1,16 @@ -Documentation for /proc/sys/fs/* kernel version 2.2.10 - (c) 1998, 1999, Rik van Riel - (c) 2009, Shen Feng +=============================== +Documentation for /proc/sys/fs/ +=============================== -For general info and legal blurb, please look in README. +kernel version 2.2.10 -============================================================== +Copyright (c) 1998, 1999, Rik van Riel + +Copyright (c) 2009, Shen Feng + +For general info and legal blurb, please look in intro.rst. + +------------------------------------------------------------------------------ This file contains documentation for the sysctl files in /proc/sys/fs/ and is valid for Linux kernel version 2.2. @@ -16,9 +22,10 @@ system, it is advisable to read both documentation and source before actually making adjustments. 1. /proc/sys/fs ----------------------------------------------------------- +=============== Currently, these files are in /proc/sys/fs: + - aio-max-nr - aio-nr - dentry-state @@ -42,9 +49,9 @@ Currently, these files are in /proc/sys/fs: - super-max - super-nr -============================================================== -aio-nr & aio-max-nr: +aio-nr & aio-max-nr +------------------- aio-nr is the running total of the number of events specified on the io_setup system call for all currently active aio contexts. If aio-nr @@ -52,21 +59,20 @@ reaches aio-max-nr then io_setup will fail with EAGAIN. Note that raising aio-max-nr does not result in the pre-allocation or re-sizing of any kernel data structures. -============================================================== -dentry-state: +dentry-state +------------ -From linux/include/linux/dcache.h: --------------------------------------------------------------- -struct dentry_stat_t dentry_stat { +From linux/include/linux/dcache.h:: + + struct dentry_stat_t dentry_stat { int nr_dentry; int nr_unused; int age_limit; /* age in seconds */ int want_pages; /* pages requested by system */ int nr_negative; /* # of unused negative dentries */ int dummy; /* Reserved for future use */ -}; --------------------------------------------------------------- + }; Dentries are dynamically allocated and deallocated. @@ -84,9 +90,9 @@ negative dentries which do not map to any files. Instead, they help speeding up rejection of non-existing files provided by the users. -============================================================== -dquot-max & dquot-nr: +dquot-max & dquot-nr +-------------------- The file dquot-max shows the maximum number of cached disk quota entries. @@ -98,9 +104,9 @@ If the number of free cached disk quotas is very low and you have some awesome number of simultaneous system users, you might want to raise the limit. -============================================================== -file-max & file-nr: +file-max & file-nr +------------------ The value in file-max denotes the maximum number of file- handles that the Linux kernel will allocate. When you get lots @@ -119,18 +125,19 @@ used file handles. Attempts to allocate more file descriptors than file-max are reported with printk, look for "VFS: file-max limit reached". -============================================================== -nr_open: + +nr_open +------- This denotes the maximum number of file-handles a process can allocate. Default value is 1024*1024 (1048576) which should be enough for most machines. Actual limit depends on RLIMIT_NOFILE resource limit. -============================================================== -inode-max, inode-nr & inode-state: +inode-max, inode-nr & inode-state +--------------------------------- As with file handles, the kernel allocates the inode structures dynamically, but can't free them yet. @@ -157,9 +164,9 @@ preshrink is nonzero when the nr_inodes > inode-max and the system needs to prune the inode list instead of allocating more. -============================================================== -overflowgid & overflowuid: +overflowgid & overflowuid +------------------------- Some filesystems only support 16-bit UIDs and GIDs, although in Linux UIDs and GIDs are 32 bits. When one of these filesystems is mounted @@ -169,18 +176,18 @@ to a fixed value before being written to disk. These sysctls allow you to change the value of the fixed UID and GID. The default is 65534. -============================================================== -pipe-user-pages-hard: +pipe-user-pages-hard +-------------------- Maximum total number of pages a non-privileged user may allocate for pipes. Once this limit is reached, no new pipes may be allocated until usage goes below the limit again. When set to 0, no limit is applied, which is the default setting. -============================================================== -pipe-user-pages-soft: +pipe-user-pages-soft +-------------------- Maximum total number of pages a non-privileged user may allocate for pipes before the pipe size gets limited to a single page. Once this limit is reached, @@ -190,9 +197,9 @@ denied until usage goes below the limit again. The default value allows to allocate up to 1024 pipes at their default size. When set to 0, no limit is applied. -============================================================== -protected_fifos: +protected_fifos +--------------- The intent of this protection is to avoid unintentional writes to an attacker-controlled FIFO, where a program expected to create a regular @@ -208,9 +215,9 @@ When set to "2" it also applies to group writable sticky directories. This protection is based on the restrictions in Openwall. -============================================================== -protected_hardlinks: +protected_hardlinks +-------------------- A long-standing class of security issues is the hardlink-based time-of-check-time-of-use race, most commonly seen in world-writable @@ -228,9 +235,9 @@ already own the source file, or do not have read/write access to it. This protection is based on the restrictions in Openwall and grsecurity. -============================================================== -protected_regular: +protected_regular +----------------- This protection is similar to protected_fifos, but it avoids writes to an attacker-controlled regular file, where a program @@ -244,9 +251,9 @@ owned by the owner of the directory. When set to "2" it also applies to group writable sticky directories. -============================================================== -protected_symlinks: +protected_symlinks +------------------ A long-standing class of security issues is the symlink-based time-of-check-time-of-use race, most commonly seen in world-writable @@ -264,34 +271,38 @@ follower match, or when the directory owner matches the symlink's owner. This protection is based on the restrictions in Openwall and grsecurity. -============================================================== suid_dumpable: +-------------- This value can be used to query and set the core dump mode for setuid or otherwise protected/tainted binaries. The modes are -0 - (default) - traditional behaviour. Any process which has changed - privilege levels or is execute only will not be dumped. -1 - (debug) - all processes dump core when possible. The core dump is - owned by the current user and no security is applied. This is - intended for system debugging situations only. Ptrace is unchecked. - This is insecure as it allows regular users to examine the memory - contents of privileged processes. -2 - (suidsafe) - any binary which normally would not be dumped is dumped - anyway, but only if the "core_pattern" kernel sysctl is set to - either a pipe handler or a fully qualified path. (For more details - on this limitation, see CVE-2006-2451.) This mode is appropriate - when administrators are attempting to debug problems in a normal - environment, and either have a core dump pipe handler that knows - to treat privileged core dumps with care, or specific directory - defined for catching core dumps. If a core dump happens without - a pipe handler or fully qualifid path, a message will be emitted - to syslog warning about the lack of a correct setting. += ========== =============================================================== +0 (default) traditional behaviour. Any process which has changed + privilege levels or is execute only will not be dumped. +1 (debug) all processes dump core when possible. The core dump is + owned by the current user and no security is applied. This is + intended for system debugging situations only. + Ptrace is unchecked. + This is insecure as it allows regular users to examine the + memory contents of privileged processes. +2 (suidsafe) any binary which normally would not be dumped is dumped + anyway, but only if the "core_pattern" kernel sysctl is set to + either a pipe handler or a fully qualified path. (For more + details on this limitation, see CVE-2006-2451.) This mode is + appropriate when administrators are attempting to debug + problems in a normal environment, and either have a core dump + pipe handler that knows to treat privileged core dumps with + care, or specific directory defined for catching core dumps. + If a core dump happens without a pipe handler or fully + qualified path, a message will be emitted to syslog warning + about the lack of a correct setting. += ========== =============================================================== -============================================================== -super-max & super-nr: +super-max & super-nr +-------------------- These numbers control the maximum number of superblocks, and thus the maximum number of mounted filesystems the kernel @@ -299,33 +310,33 @@ can have. You only need to increase super-max if you need to mount more filesystems than the current value in super-max allows you to. -============================================================== -aio-nr & aio-max-nr: +aio-nr & aio-max-nr +------------------- aio-nr shows the current system-wide number of asynchronous io requests. aio-max-nr allows you to change the maximum value aio-nr can grow to. -============================================================== -mount-max: +mount-max +--------- This denotes the maximum number of mounts that may exist in a mount namespace. -============================================================== 2. /proc/sys/fs/binfmt_misc ----------------------------------------------------------- +=========================== Documentation for the files in /proc/sys/fs/binfmt_misc is in Documentation/admin-guide/binfmt-misc.rst. 3. /proc/sys/fs/mqueue - POSIX message queues filesystem ----------------------------------------------------------- +======================================================== + The "mqueue" filesystem provides the necessary kernel features to enable the creation of a user space library that implements the POSIX message queues @@ -356,7 +367,7 @@ the default message size value if attr parameter of mq_open(2) is NULL. If it exceed msgsize_max, the default value is initialized msgsize_max. 4. /proc/sys/fs/epoll - Configuration options for the epoll interface --------------------------------------------------------- +===================================================================== This directory contains configuration options for the epoll(7) interface. diff --git a/Documentation/sysctl/README b/Documentation/sysctl/index.rst similarity index 78% rename from Documentation/sysctl/README rename to Documentation/sysctl/index.rst index d5f24ab0ecc3..efbcde8c1c9c 100644 --- a/Documentation/sysctl/README +++ b/Documentation/sysctl/index.rst @@ -1,5 +1,12 @@ -Documentation for /proc/sys/ kernel version 2.2.10 - (c) 1998, 1999, Rik van Riel +:orphan: + +=========================== +Documentation for /proc/sys +=========================== + +Copyright (c) 1998, 1999, Rik van Riel + +------------------------------------------------------------------------------ 'Why', I hear you ask, 'would anyone even _want_ documentation for them sysctl files? If anybody really needs it, it's all in @@ -12,11 +19,12 @@ have the time or knowledge to read the source code. Furthermore, the programmers who built sysctl have built it to be actually used, not just for the fun of programming it :-) -============================================================== +------------------------------------------------------------------------------ Legal blurb: As usual, there are two main things to consider: + 1. you get what you pay for 2. it's free @@ -35,15 +43,17 @@ stories to: Rik van Riel. -============================================================== +-------------------------------------------------------------- -Introduction: +Introduction +============ Sysctl is a means of configuring certain aspects of the kernel at run-time, and the /proc/sys/ directory is there so that you don't even need special tools to do it! In fact, there are only four things needed to use these config facilities: + - a running Linux system - root access - common sense (this is especially hard to come by these days) @@ -54,7 +64,9 @@ several (arch-dependent?) subdirs. Each subdir is mainly about one part of the kernel, so you can do configuration on a piece by piece basis, or just some 'thematic frobbing'. -The subdirs are about: +This documentation is about: + +=============== =============================================================== abi/ execution domains & personalities debug/ dev/ device specific information (eg dev/cdrom/info) @@ -70,7 +82,19 @@ sunrpc/ SUN Remote Procedure Call (NFS) vm/ memory management tuning buffer and cache management user/ Per user per user namespace limits +=============== =============================================================== These are the subdirs I have on my system. There might be more or other subdirs in another setup. If you see another dir, I'd really like to hear about it :-) + +.. toctree:: + :maxdepth: 1 + + abi + fs + kernel + net + sunrpc + user + vm diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.rst similarity index 79% rename from Documentation/sysctl/kernel.txt rename to Documentation/sysctl/kernel.rst index 86fd3e35afa7..1e87935e6e96 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.rst @@ -1,10 +1,16 @@ -Documentation for /proc/sys/kernel/* kernel version 2.2.10 - (c) 1998, 1999, Rik van Riel - (c) 2009, Shen Feng +=================================== +Documentation for /proc/sys/kernel/ +=================================== -For general info and legal blurb, please look in README. +kernel version 2.2.10 -============================================================== +Copyright (c) 1998, 1999, Rik van Riel + +Copyright (c) 2009, Shen Feng + +For general info and legal blurb, please look in index.rst. + +------------------------------------------------------------------------------ This file contains documentation for the sysctl files in /proc/sys/kernel/ and is valid for Linux kernel version 2.2. @@ -102,9 +108,9 @@ show up in /proc/sys/kernel: - watchdog_thresh - version -============================================================== acct: +===== highwater lowwater frequency @@ -119,18 +125,18 @@ That is, suspend accounting if there left <= 2% free; resume it if we got >=4%; consider information about amount of free space valid for 30 seconds. -============================================================== acpi_video_flags: +================= flags See Doc*/kernel/power/video.txt, it allows mode of video boot to be set during run time. -============================================================== auto_msgmni: +============ This variable has no effect and may be removed in future kernel releases. Reading it always returns 0. @@ -140,9 +146,8 @@ Echoing "1" into this file enabled msgmni automatic recomputing. Echoing "0" turned it off. auto_msgmni default value was 1. -============================================================== - bootloader_type: +================ x86 bootloader identification @@ -157,9 +162,9 @@ the value 340 = 0x154. See the type_of_loader and ext_loader_type fields in Documentation/x86/boot.txt for additional information. -============================================================== bootloader_version: +=================== x86 bootloader version @@ -169,9 +174,9 @@ file will contain the value 564 = 0x234. See the type_of_loader and ext_loader_ver fields in Documentation/x86/boot.txt for additional information. -============================================================== callhome: +========= Controls the kernel's callhome behavior in case of a kernel panic. @@ -184,27 +189,31 @@ the complete kernel oops message is send to the IBM customer service organization in case the mainframe the Linux operating system is running on has a service contract with IBM. -============================================================== -cap_last_cap +cap_last_cap: +============= Highest valid capability of the running kernel. Exports CAP_LAST_CAP from the kernel. -============================================================== core_pattern: +============= core_pattern is used to specify a core dumpfile pattern name. -. max length 127 characters; default value is "core" -. core_pattern is used as a pattern template for the output filename; + +* max length 127 characters; default value is "core" +* core_pattern is used as a pattern template for the output filename; certain string patterns (beginning with '%') are substituted with their actual values. -. backward compatibility with core_uses_pid: +* backward compatibility with core_uses_pid: + If core_pattern does not include "%p" (default does not) and core_uses_pid is set, then .PID will be appended to the filename. -. corename format specifiers: + +* corename format specifiers:: + % '%' is dropped %% output one '%' %p pid @@ -221,13 +230,14 @@ core_pattern is used to specify a core dumpfile pattern name. %e executable filename (may be shortened) %E executable path % both are dropped -. If the first character of the pattern is a '|', the kernel will treat + +* If the first character of the pattern is a '|', the kernel will treat the rest of the pattern as a command to run. The core dump will be written to the standard input of that program instead of to a file. -============================================================== core_pipe_limit: +================ This sysctl is only applicable when core_pattern is configured to pipe core files to a user space helper (when the first character of @@ -248,9 +258,9 @@ parallel, but that no waiting will take place (i.e. the collecting process is not guaranteed access to /proc//). This value defaults to 0. -============================================================== core_uses_pid: +============== The default coredump filename is "core". By setting core_uses_pid to 1, the coredump filename becomes core.PID. @@ -258,9 +268,9 @@ If core_pattern does not include "%p" (default does not) and core_uses_pid is set, then .PID will be appended to the filename. -============================================================== ctrl-alt-del: +============= When the value in this file is 0, ctrl-alt-del is trapped and sent to the init(1) program to handle a graceful restart. @@ -268,14 +278,15 @@ When, however, the value is > 0, Linux's reaction to a Vulcan Nerve Pinch (tm) will be an immediate reboot, without even syncing its dirty buffers. -Note: when a program (like dosemu) has the keyboard in 'raw' -mode, the ctrl-alt-del is intercepted by the program before it -ever reaches the kernel tty layer, and it's up to the program -to decide what to do with it. +Note: + when a program (like dosemu) has the keyboard in 'raw' + mode, the ctrl-alt-del is intercepted by the program before it + ever reaches the kernel tty layer, and it's up to the program + to decide what to do with it. -============================================================== dmesg_restrict: +=============== This toggle indicates whether unprivileged users are prevented from using dmesg(8) to view messages from the kernel's log buffer. @@ -286,18 +297,21 @@ dmesg(8). The kernel config option CONFIG_SECURITY_DMESG_RESTRICT sets the default value of dmesg_restrict. -============================================================== domainname & hostname: +====================== These files can be used to set the NIS/YP domainname and the hostname of your box in exactly the same way as the commands -domainname and hostname, i.e.: -# echo "darkstar" > /proc/sys/kernel/hostname -# echo "mydomain" > /proc/sys/kernel/domainname -has the same effect as -# hostname "darkstar" -# domainname "mydomain" +domainname and hostname, i.e.:: + + # echo "darkstar" > /proc/sys/kernel/hostname + # echo "mydomain" > /proc/sys/kernel/domainname + +has the same effect as:: + + # hostname "darkstar" + # domainname "mydomain" Note, however, that the classic darkstar.frop.org has the hostname "darkstar" and DNS (Internet Domain Name Server) @@ -306,8 +320,9 @@ Information Service) or YP (Yellow Pages) domainname. These two domain names are in general different. For a detailed discussion see the hostname(1) man page. -============================================================== + hardlockup_all_cpu_backtrace: +============================= This value controls the hard lockup detector behavior when a hard lockup condition is detected as to whether or not to gather further @@ -317,9 +332,10 @@ will be initiated. 0: do nothing. This is the default behavior. 1: on detection capture more debug information. -============================================================== + hardlockup_panic: +================= This parameter can be used to control whether the kernel panics when a hard lockup is detected. @@ -330,16 +346,16 @@ when a hard lockup is detected. See Documentation/lockup-watchdogs.rst for more information. This can also be set using the nmi_watchdog kernel parameter. -============================================================== hotplug: +======== Path for the hotplug policy agent. Default value is "/sbin/hotplug". -============================================================== hung_task_panic: +================ Controls the kernel's behavior when a hung task is detected. This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. @@ -348,27 +364,28 @@ This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. 1: panic immediately. -============================================================== hung_task_check_count: +====================== The upper bound on the number of tasks that are checked. This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. -============================================================== hung_task_timeout_secs: +======================= When a task in D state did not get scheduled for more than this value report a warning. This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. 0: means infinite timeout - no checking done. + Possible values to set are in range {0..LONG_MAX/HZ}. -============================================================== hung_task_check_interval_secs: +============================== Hung task check interval. If hung task checking is enabled (see hung_task_timeout_secs), the check is done every @@ -378,9 +395,9 @@ This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. 0 (default): means use hung_task_timeout_secs as checking interval. Possible values to set are in range {0..LONG_MAX/HZ}. -============================================================== hung_task_warnings: +=================== The maximum number of warnings to report. During a check interval if a hung task is detected, this value is decreased by 1. @@ -389,9 +406,9 @@ This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. -1: report an infinite number of warnings. -============================================================== hyperv_record_panic_msg: +======================== Controls whether the panic kmsg data should be reported to Hyper-V. @@ -399,9 +416,9 @@ Controls whether the panic kmsg data should be reported to Hyper-V. 1: report the panic kmsg data. This is the default behavior. -============================================================== kexec_load_disabled: +==================== A toggle indicating if the kexec_load syscall has been disabled. This value defaults to 0 (false: kexec_load enabled), but can be set to 1 @@ -411,9 +428,9 @@ loaded before disabling the syscall, allowing a system to set up (and later use) an image without it being altered. Generally used together with the "modules_disabled" sysctl. -============================================================== kptr_restrict: +============== This toggle indicates whether restrictions are placed on exposing kernel addresses via /proc and other interfaces. @@ -436,16 +453,16 @@ values to unprivileged users is a concern. When kptr_restrict is set to (2), kernel pointers printed using %pK will be replaced with 0's regardless of privileges. -============================================================== l2cr: (PPC only) +================ This flag controls the L2 cache of G3 processor boards. If 0, the cache is disabled. Enabled if nonzero. -============================================================== modules_disabled: +================= A toggle value indicating if modules are allowed to be loaded in an otherwise modular kernel. This toggle defaults to off @@ -453,9 +470,9 @@ in an otherwise modular kernel. This toggle defaults to off neither loaded nor unloaded, and the toggle cannot be set back to false. Generally used with the "kexec_load_disabled" toggle. -============================================================== msg_next_id, sem_next_id, and shm_next_id: +========================================== These three toggles allows to specify desired id for next allocated IPC object: message, semaphore or shared memory respectively. @@ -464,21 +481,22 @@ By default they are equal to -1, which means generic allocation logic. Possible values to set are in range {0..INT_MAX}. Notes: -1) kernel doesn't guarantee, that new object will have desired id. So, -it's up to userspace, how to handle an object with "wrong" id. -2) Toggle with non-default value will be set back to -1 by kernel after -successful IPC object allocation. If an IPC object allocation syscall -fails, it is undefined if the value remains unmodified or is reset to -1. + 1) kernel doesn't guarantee, that new object will have desired id. So, + it's up to userspace, how to handle an object with "wrong" id. + 2) Toggle with non-default value will be set back to -1 by kernel after + successful IPC object allocation. If an IPC object allocation syscall + fails, it is undefined if the value remains unmodified or is reset to -1. -============================================================== nmi_watchdog: +============= This parameter can be used to control the NMI watchdog (i.e. the hard lockup detector) on x86 systems. - 0 - disable the hard lockup detector - 1 - enable the hard lockup detector +0 - disable the hard lockup detector + +1 - enable the hard lockup detector The hard lockup detector monitors each CPU for its ability to respond to timer interrupts. The mechanism utilizes CPU performance counter registers @@ -486,15 +504,15 @@ that are programmed to generate Non-Maskable Interrupts (NMIs) periodically while a CPU is busy. Hence, the alternative name 'NMI watchdog'. The NMI watchdog is disabled by default if the kernel is running as a guest -in a KVM virtual machine. This default can be overridden by adding +in a KVM virtual machine. This default can be overridden by adding:: nmi_watchdog=1 to the guest kernel command line (see Documentation/admin-guide/kernel-parameters.rst). -============================================================== -numa_balancing +numa_balancing: +=============== Enables/disables automatic page fault based NUMA memory balancing. Memory is moved automatically to nodes @@ -516,10 +534,9 @@ faults may be controlled by the numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb, and numa_balancing_settle_count sysctls. -============================================================== +numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb +=============================================================================================================================== -numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, -numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb Automatic NUMA balancing scans tasks address space and unmaps pages to detect if pages are properly placed or if the data should be migrated to a @@ -555,16 +572,18 @@ rate for each task. numa_balancing_scan_size_mb is how many megabytes worth of pages are scanned for a given scan. -============================================================== osrelease, ostype & version: +============================ -# cat osrelease -2.1.88 -# cat ostype -Linux -# cat version -#5 Wed Feb 25 21:49:24 MET 1998 +:: + + # cat osrelease + 2.1.88 + # cat ostype + Linux + # cat version + #5 Wed Feb 25 21:49:24 MET 1998 The files osrelease and ostype should be clear enough. Version needs a little more clarification however. The '#5' means that @@ -572,9 +591,9 @@ this is the fifth kernel built from this source base and the date behind it indicates the time the kernel was built. The only way to tune these values is to rebuild the kernel :-) -============================================================== overflowgid & overflowuid: +========================== if your architecture did not always support 32-bit UIDs (i.e. arm, i386, m68k, sh, and sparc32), a fixed UID and GID will be returned to @@ -584,17 +603,17 @@ actual UID or GID would exceed 65535. These sysctls allow you to change the value of the fixed UID and GID. The default is 65534. -============================================================== panic: +====== The value in this file represents the number of seconds the kernel waits before rebooting on a panic. When you use the software watchdog, the recommended setting is 60. -============================================================== panic_on_io_nmi: +================ Controls the kernel's behavior when a CPU receives an NMI caused by an IO error. @@ -607,20 +626,20 @@ an IO error. servers issue this sort of NMI when the dump button is pushed, and you can use this option to take a crash dump. -============================================================== panic_on_oops: +============== Controls the kernel's behaviour when an oops or BUG is encountered. 0: try to continue operation -1: panic immediately. If the `panic' sysctl is also non-zero then the +1: panic immediately. If the `panic` sysctl is also non-zero then the machine will be rebooted. -============================================================== panic_on_stackoverflow: +======================= Controls the kernel's behavior when detecting the overflows of kernel, IRQ and exception stacks except a user stack. @@ -630,9 +649,9 @@ This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled. 1: panic immediately. -============================================================== panic_on_unrecovered_nmi: +========================= The default Linux behaviour on an NMI of either memory or unknown is to continue operation. For many environments such as scientific @@ -643,9 +662,9 @@ A small number of systems do generate NMI's for bizarre random reasons such as power management so the default is off. That sysctl works like the existing panic controls already in that directory. -============================================================== panic_on_warn: +============== Calls panic() in the WARN() path when set to 1. This is useful to avoid a kernel rebuild when attempting to kdump at the location of a WARN(). @@ -654,25 +673,28 @@ a kernel rebuild when attempting to kdump at the location of a WARN(). 1: call panic() after printing out WARN() location. -============================================================== panic_print: +============ Bitmask for printing system info when panic happens. User can chose combination of the following bits: -bit 0: print all tasks info -bit 1: print system memory info -bit 2: print timer info -bit 3: print locks info if CONFIG_LOCKDEP is on -bit 4: print ftrace buffer +===== ======================================== +bit 0 print all tasks info +bit 1 print system memory info +bit 2 print timer info +bit 3 print locks info if CONFIG_LOCKDEP is on +bit 4 print ftrace buffer +===== ======================================== + +So for example to print tasks and memory info on panic, user can:: -So for example to print tasks and memory info on panic, user can: echo 3 > /proc/sys/kernel/panic_print -============================================================== panic_on_rcu_stall: +=================== When set to 1, calls panic() after RCU stall detection messages. This is useful to define the root cause of RCU stalls using a vmcore. @@ -681,9 +703,9 @@ is useful to define the root cause of RCU stalls using a vmcore. 1: panic() after printing RCU stall messages. -============================================================== perf_cpu_time_max_percent: +========================== Hints to the kernel how much CPU time it should be allowed to use to handle perf sampling events. If the perf subsystem @@ -696,10 +718,12 @@ unexpectedly take too long to execute, the NMIs can become stacked up next to each other so much that nothing else is allowed to execute. -0: disable the mechanism. Do not monitor or correct perf's +0: + disable the mechanism. Do not monitor or correct perf's sampling rate no matter how CPU time it takes. -1-100: attempt to throttle perf's sample rate to this +1-100: + attempt to throttle perf's sample rate to this percentage of CPU. Note: the kernel calculates an "expected" length of each sample event. 100 here means 100% of that expected length. Even if this is set to @@ -707,23 +731,30 @@ allowed to execute. length is exceeded. Set to 0 if you truly do not care how much CPU is consumed. -============================================================== perf_event_paranoid: +==================== Controls use of the performance events system by unprivileged users (without CAP_SYS_ADMIN). The default value is 2. - -1: Allow use of (almost) all events by all users +=== ================================================================== + -1 Allow use of (almost) all events by all users + Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK ->=0: Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN + +>=0 Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN + Disallow raw tracepoint access by users without CAP_SYS_ADMIN ->=1: Disallow CPU event access by users without CAP_SYS_ADMIN ->=2: Disallow kernel profiling by users without CAP_SYS_ADMIN -============================================================== +>=1 Disallow CPU event access by users without CAP_SYS_ADMIN + +>=2 Disallow kernel profiling by users without CAP_SYS_ADMIN +=== ================================================================== + perf_event_max_stack: +===================== Controls maximum number of stack frames to copy for (attr.sample_type & PERF_SAMPLE_CALLCHAIN) configured events, for instance, when using @@ -734,17 +765,17 @@ enabled, otherwise writing to this file will return -EBUSY. The default value is 127. -============================================================== perf_event_mlock_kb: +==================== Control size of per-cpu ring buffer not counted agains mlock limit. The default value is 512 + 1 page -============================================================== perf_event_max_contexts_per_stack: +================================== Controls maximum number of stack frame context entries for (attr.sample_type & PERF_SAMPLE_CALLCHAIN) configured events, for @@ -755,25 +786,25 @@ enabled, otherwise writing to this file will return -EBUSY. The default value is 8. -============================================================== pid_max: +======== PID allocation wrap value. When the kernel's next PID value reaches this value, it wraps back to a minimum PID value. PIDs of value pid_max or larger are not allocated. -============================================================== ns_last_pid: +============ The last pid allocated in the current (the one task using this sysctl lives in) pid namespace. When selecting a pid for a next task on fork kernel tries to allocate a number starting from this one. -============================================================== powersave-nap: (PPC only) +========================= If set, Linux-PPC will use the 'nap' mode of powersaving, otherwise the 'doze' mode will be used. @@ -781,6 +812,7 @@ otherwise the 'doze' mode will be used. ============================================================== printk: +======= The four values in printk denote: console_loglevel, default_message_loglevel, minimum_console_loglevel and @@ -790,25 +822,29 @@ These values influence printk() behavior when printing or logging error messages. See 'man 2 syslog' for more info on the different loglevels. -- console_loglevel: messages with a higher priority than - this will be printed to the console -- default_message_loglevel: messages without an explicit priority - will be printed with this priority -- minimum_console_loglevel: minimum (highest) value to which - console_loglevel can be set -- default_console_loglevel: default value for console_loglevel +- console_loglevel: + messages with a higher priority than + this will be printed to the console +- default_message_loglevel: + messages without an explicit priority + will be printed with this priority +- minimum_console_loglevel: + minimum (highest) value to which + console_loglevel can be set +- default_console_loglevel: + default value for console_loglevel -============================================================== printk_delay: +============= Delay each printk message in printk_delay milliseconds Value from 0 - 10000 is allowed. -============================================================== printk_ratelimit: +================= Some warning messages are rate limited. printk_ratelimit specifies the minimum length of time between these messages (in jiffies), by @@ -816,48 +852,52 @@ default we allow one every 5 seconds. A value of 0 will disable rate limiting. -============================================================== printk_ratelimit_burst: +======================= While long term we enforce one message per printk_ratelimit seconds, we do allow a burst of messages to pass through. printk_ratelimit_burst specifies the number of messages we can send before ratelimiting kicks in. -============================================================== printk_devkmsg: +=============== Control the logging to /dev/kmsg from userspace: -ratelimit: default, ratelimited +ratelimit: + default, ratelimited + on: unlimited logging to /dev/kmsg from userspace + off: logging to /dev/kmsg disabled The kernel command line parameter printk.devkmsg= overrides this and is a one-time setting until next reboot: once set, it cannot be changed by this sysctl interface anymore. -============================================================== randomize_va_space: +=================== This option can be used to select the type of process address space randomization that is used in the system, for architectures that support this feature. -0 - Turn the process address space randomization off. This is the +== =========================================================================== +0 Turn the process address space randomization off. This is the default for architectures that do not support this feature anyways, and kernels that are booted with the "norandmaps" parameter. -1 - Make the addresses of mmap base, stack and VDSO page randomized. +1 Make the addresses of mmap base, stack and VDSO page randomized. This, among other things, implies that shared libraries will be loaded to random addresses. Also for PIE-linked binaries, the location of code start is randomized. This is the default if the CONFIG_COMPAT_BRK option is enabled. -2 - Additionally enable heap randomization. This is the default if +2 Additionally enable heap randomization. This is the default if CONFIG_COMPAT_BRK is disabled. There are a few legacy applications out there (such as some ancient @@ -870,18 +910,19 @@ that support this feature. Systems with ancient and/or broken binaries should be configured with CONFIG_COMPAT_BRK enabled, which excludes the heap from process address space randomization. +== =========================================================================== -============================================================== reboot-cmd: (Sparc only) +======================== ??? This seems to be a way to give an argument to the Sparc ROM/Flash boot loader. Maybe to tell it what to do after rebooting. ??? -============================================================== rtsig-max & rtsig-nr: +===================== The file rtsig-max can be used to tune the maximum number of POSIX realtime (queued) signals that can be outstanding @@ -889,9 +930,9 @@ in the system. rtsig-nr shows the number of RT signals currently queued. -============================================================== sched_energy_aware: +=================== Enables/disables Energy Aware Scheduling (EAS). EAS starts automatically on platforms where it can run (that is, @@ -900,17 +941,17 @@ Model available). If your platform happens to meet the requirements for EAS but you do not want to use it, change this value to 0. -============================================================== sched_schedstats: +================= Enables/disables scheduler statistics. Enabling this feature incurs a small amount of overhead in the scheduler but is useful for debugging and performance tuning. -============================================================== sg-big-buff: +============ This file shows the size of the generic SCSI (sg) buffer. You can't tune it just yet, but you could change it on @@ -921,9 +962,9 @@ There shouldn't be any reason to change this value. If you can come up with one, you probably know what you are doing anyway :) -============================================================== shmall: +======= This parameter sets the total amount of shared memory pages that can be used system wide. Hence, SHMALL should always be at least @@ -932,20 +973,20 @@ ceil(shmmax/PAGE_SIZE). If you are not sure what the default PAGE_SIZE is on your Linux system, you can run the following command: -# getconf PAGE_SIZE + # getconf PAGE_SIZE -============================================================== shmmax: +======= This value can be used to query and set the run time limit on the maximum shared memory segment size that can be created. Shared memory segments up to 1Gb are now supported in the kernel. This value defaults to SHMMAX. -============================================================== shm_rmid_forced: +================ Linux lets you set resource limits, including how much memory one process can consume, via setrlimit(2). Unfortunately, shared memory @@ -964,28 +1005,30 @@ need this. Note that if you change this from 0 to 1, already created segments without users and with a dead originative process will be destroyed. -============================================================== sysctl_writes_strict: +===================== Control how file position affects the behavior of updating sysctl values via the /proc/sys interface: - -1 - Legacy per-write sysctl value handling, with no printk warnings. + == ====================================================================== + -1 Legacy per-write sysctl value handling, with no printk warnings. Each write syscall must fully contain the sysctl value to be written, and multiple writes on the same sysctl file descriptor will rewrite the sysctl value, regardless of file position. - 0 - Same behavior as above, but warn about processes that perform writes + 0 Same behavior as above, but warn about processes that perform writes to a sysctl file descriptor when the file position is not 0. - 1 - (default) Respect file position when writing sysctl strings. Multiple + 1 (default) Respect file position when writing sysctl strings. Multiple writes will append to the sysctl value buffer. Anything past the max length of the sysctl value buffer will be ignored. Writes to numeric sysctl entries must always be at file position 0 and the value must be fully contained in the buffer sent in the write syscall. + == ====================================================================== -============================================================== softlockup_all_cpu_backtrace: +============================= This value controls the soft lockup detector thread's behavior when a soft lockup condition is detected as to whether or not @@ -999,13 +1042,14 @@ NMI. 1: on detection capture more debug information. -============================================================== -soft_watchdog +soft_watchdog: +============== This parameter can be used to control the soft lockup detector. 0 - disable the soft lockup detector + 1 - enable the soft lockup detector The soft lockup detector monitors CPUs for threads that are hogging the CPUs @@ -1015,9 +1059,9 @@ interrupts which are needed for the 'watchdog/N' threads to be woken up by the watchdog timer function, otherwise the NMI watchdog - if enabled - can detect a hard lockup condition. -============================================================== -stack_erasing +stack_erasing: +============== This parameter can be used to control kernel stack erasing at the end of syscalls for kernels built with CONFIG_GCC_PLUGIN_STACKLEAK. @@ -1031,37 +1075,40 @@ compilation sees a 1% slowdown, other systems and workloads may vary. 1: kernel stack erasing is enabled (default), it is performed before returning to the userspace at the end of syscalls. -============================================================== + tainted +======= Non-zero if the kernel has been tainted. Numeric values, which can be ORed together. The letters are seen in "Tainted" line of Oops reports. - 1 (P): proprietary module was loaded - 2 (F): module was force loaded - 4 (S): SMP kernel oops on an officially SMP incapable processor - 8 (R): module was force unloaded - 16 (M): processor reported a Machine Check Exception (MCE) - 32 (B): bad page referenced or some unexpected page flags - 64 (U): taint requested by userspace application - 128 (D): kernel died recently, i.e. there was an OOPS or BUG - 256 (A): an ACPI table was overridden by user - 512 (W): kernel issued warning - 1024 (C): staging driver was loaded - 2048 (I): workaround for bug in platform firmware applied - 4096 (O): externally-built ("out-of-tree") module was loaded - 8192 (E): unsigned module was loaded - 16384 (L): soft lockup occurred - 32768 (K): kernel has been live patched - 65536 (X): Auxiliary taint, defined and used by for distros -131072 (T): The kernel was built with the struct randomization plugin +====== ===== ============================================================== + 1 `(P)` proprietary module was loaded + 2 `(F)` module was force loaded + 4 `(S)` SMP kernel oops on an officially SMP incapable processor + 8 `(R)` module was force unloaded + 16 `(M)` processor reported a Machine Check Exception (MCE) + 32 `(B)` bad page referenced or some unexpected page flags + 64 `(U)` taint requested by userspace application + 128 `(D)` kernel died recently, i.e. there was an OOPS or BUG + 256 `(A)` an ACPI table was overridden by user + 512 `(W)` kernel issued warning + 1024 `(C)` staging driver was loaded + 2048 `(I)` workaround for bug in platform firmware applied + 4096 `(O)` externally-built ("out-of-tree") module was loaded + 8192 `(E)` unsigned module was loaded + 16384 `(L)` soft lockup occurred + 32768 `(K)` kernel has been live patched + 65536 `(X)` Auxiliary taint, defined and used by for distros +131072 `(T)` The kernel was built with the struct randomization plugin +====== ===== ============================================================== See Documentation/admin-guide/tainted-kernels.rst for more information. -============================================================== -threads-max +threads-max: +============ This value controls the maximum number of threads that can be created using fork(). @@ -1071,8 +1118,10 @@ maximum number of threads is created, the thread structures occupy only a part (1/8th) of the available RAM pages. The minimum value that can be written to threads-max is 20. + The maximum value that can be written to threads-max is given by the constant FUTEX_TID_MASK (0x3fffffff). + If a value outside of this range is written to threads-max an error EINVAL occurs. @@ -1080,9 +1129,9 @@ The value written is checked against the available RAM pages. If the thread structures would occupy too much (more than 1/8th) of the available RAM pages threads-max is reduced accordingly. -============================================================== unknown_nmi_panic: +================== The value in this file affects behavior of handling NMI. When the value is non-zero, unknown NMI is trapped and then panic occurs. At @@ -1091,28 +1140,29 @@ that time, kernel debugging information is displayed on console. NMI switch that most IA32 servers have fires unknown NMI up, for example. If a system hangs up, try pressing the NMI switch. -============================================================== watchdog: +========= This parameter can be used to disable or enable the soft lockup detector _and_ the NMI watchdog (i.e. the hard lockup detector) at the same time. 0 - disable both lockup detectors + 1 - enable both lockup detectors The soft lockup detector and the NMI watchdog can also be disabled or enabled individually, using the soft_watchdog and nmi_watchdog parameters. -If the watchdog parameter is read, for example by executing +If the watchdog parameter is read, for example by executing:: cat /proc/sys/kernel/watchdog the output of this command (0 or 1) shows the logical OR of soft_watchdog and nmi_watchdog. -============================================================== watchdog_cpumask: +================= This value can be used to control on which cpus the watchdog may run. The default cpumask is all possible cores, but if NO_HZ_FULL is @@ -1127,13 +1177,13 @@ if a kernel lockup was suspected on those cores. The argument value is the standard cpulist format for cpumasks, so for example to enable the watchdog on cores 0, 2, 3, and 4 you -might say: +might say:: echo 0,2-4 > /proc/sys/kernel/watchdog_cpumask -============================================================== watchdog_thresh: +================ This value can be used to control the frequency of hrtimer and NMI events and the soft and hard lockup thresholds. The default threshold @@ -1141,5 +1191,3 @@ is 10 seconds. The softlockup threshold is (2 * watchdog_thresh). Setting this tunable to zero will disable lockup detection altogether. - -============================================================== diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.rst similarity index 85% rename from Documentation/sysctl/net.txt rename to Documentation/sysctl/net.rst index 2ae91d3873bb..a7d44e71019d 100644 --- a/Documentation/sysctl/net.txt +++ b/Documentation/sysctl/net.rst @@ -1,12 +1,25 @@ -Documentation for /proc/sys/net/* - (c) 1999 Terrehon Bowden - Bodo Bauer - (c) 2000 Jorge Nerin - (c) 2009 Shen Feng +================================ +Documentation for /proc/sys/net/ +================================ -For general info and legal blurb, please look in README. +Copyright -============================================================== +Copyright (c) 1999 + + - Terrehon Bowden + - Bodo Bauer + +Copyright (c) 2000 + + - Jorge Nerin + +Copyright (c) 2009 + + - Shen Feng + +For general info and legal blurb, please look in index.rst. + +------------------------------------------------------------------------------ This file contains the documentation for the sysctl files in /proc/sys/net @@ -17,20 +30,22 @@ see only some of them, depending on your kernel's configuration. Table : Subdirectories in /proc/sys/net -.............................................................................. - Directory Content Directory Content - core General parameter appletalk Appletalk protocol - unix Unix domain sockets netrom NET/ROM - 802 E802 protocol ax25 AX25 - ethernet Ethernet protocol rose X.25 PLP layer - ipv4 IP version 4 x25 X.25 protocol - ipx IPX token-ring IBM token ring - bridge Bridging decnet DEC net - ipv6 IP version 6 tipc TIPC -.............................................................................. + + ========= =================== = ========== ================== + Directory Content Directory Content + ========= =================== = ========== ================== + core General parameter appletalk Appletalk protocol + unix Unix domain sockets netrom NET/ROM + 802 E802 protocol ax25 AX25 + ethernet Ethernet protocol rose X.25 PLP layer + ipv4 IP version 4 x25 X.25 protocol + ipx IPX token-ring IBM token ring + bridge Bridging decnet DEC net + ipv6 IP version 6 tipc TIPC + ========= =================== = ========== ================== 1. /proc/sys/net/core - Network core options -------------------------------------------------------- +============================================ bpf_jit_enable -------------- @@ -44,6 +59,7 @@ restricted C into a sequence of BPF instructions. After program load through bpf(2) and passing a verifier in the kernel, a JIT will then translate these BPF proglets into native CPU instructions. There are two flavors of JITs, the newer eBPF JIT currently supported on: + - x86_64 - x86_32 - arm64 @@ -55,6 +71,7 @@ two flavors of JITs, the newer eBPF JIT currently supported on: - riscv And the older cBPF JIT supported on the following archs: + - mips - ppc - sparc @@ -65,10 +82,11 @@ compile them transparently. Older cBPF JITs can only translate tcpdump filters, seccomp rules, etc, but not mentioned eBPF programs loaded through bpf(2). -Values : - 0 - disable the JIT (default value) - 1 - enable the JIT - 2 - enable the JIT and ask the compiler to emit traces on kernel log. +Values: + + - 0 - disable the JIT (default value) + - 1 - enable the JIT + - 2 - enable the JIT and ask the compiler to emit traces on kernel log. bpf_jit_harden -------------- @@ -76,10 +94,12 @@ bpf_jit_harden This enables hardening for the BPF JIT compiler. Supported are eBPF JIT backends. Enabling hardening trades off performance, but can mitigate JIT spraying. -Values : - 0 - disable JIT hardening (default value) - 1 - enable JIT hardening for unprivileged users only - 2 - enable JIT hardening for all users + +Values: + + - 0 - disable JIT hardening (default value) + - 1 - enable JIT hardening for unprivileged users only + - 2 - enable JIT hardening for all users bpf_jit_kallsyms ---------------- @@ -89,9 +109,11 @@ addresses to the kernel, meaning they neither show up in traces nor in /proc/kallsyms. This enables export of these addresses, which can be used for debugging/tracing. If bpf_jit_harden is enabled, this feature is disabled. + Values : - 0 - disable JIT kallsyms export (default value) - 1 - enable JIT kallsyms export for privileged users only + + - 0 - disable JIT kallsyms export (default value) + - 1 - enable JIT kallsyms export for privileged users only bpf_jit_limit ------------- @@ -102,7 +124,7 @@ been surpassed. bpf_jit_limit contains the value of the global limit in bytes. dev_weight --------------- +---------- The maximum number of packets that kernel can handle on a NAPI interrupt, it's a Per-CPU variable. For drivers that support LRO or GRO_HW, a hardware @@ -111,7 +133,7 @@ aggregated packet is counted as one packet in this context. Default: 64 dev_weight_rx_bias --------------- +------------------ RPS (e.g. RFS, aRFS) processing is competing with the registered NAPI poll function of the driver for the per softirq cycle netdev_budget. This parameter influences @@ -120,19 +142,22 @@ processing during RX softirq cycles. It is further meant for making current dev_weight adaptable for asymmetric CPU needs on RX/TX side of the network stack. (see dev_weight_tx_bias) It is effective on a per CPU basis. Determination is based on dev_weight and is calculated multiplicative (dev_weight * dev_weight_rx_bias). + Default: 1 dev_weight_tx_bias --------------- +------------------ Scales the maximum number of packets that can be processed during a TX softirq cycle. Effective on a per CPU basis. Allows scaling of current dev_weight for asymmetric net stack processing needs. Be careful to avoid making TX softirq processing a CPU hog. + Calculation is based on dev_weight (dev_weight * dev_weight_tx_bias). + Default: 1 default_qdisc --------------- +------------- The default queuing discipline to use for network devices. This allows overriding the default of pfifo_fast with an alternative. Since the default @@ -144,17 +169,21 @@ which require setting up classes and bandwidths. Note that physical multiqueue interfaces still use mq as root qdisc, which in turn uses this default for its leaves. Virtual devices (like e.g. lo or veth) ignore this setting and instead default to noqueue. + Default: pfifo_fast busy_read ----------------- +--------- + Low latency busy poll timeout for socket reads. (needs CONFIG_NET_RX_BUSY_POLL) Approximate time in us to busy loop waiting for packets on the device queue. This sets the default value of the SO_BUSY_POLL socket option. Can be set or overridden per socket by setting socket option SO_BUSY_POLL, which is the preferred method of enabling. If you need to enable the feature globally via sysctl, a value of 50 is recommended. + Will increase power usage. + Default: 0 (off) busy_poll @@ -167,7 +196,9 @@ For more than that you probably want to use epoll. Note that only sockets with SO_BUSY_POLL set will be busy polled, so you want to either selectively set SO_BUSY_POLL on those sockets or set sysctl.net.busy_read globally. + Will increase power usage. + Default: 0 (off) rmem_default @@ -185,6 +216,7 @@ tstamp_allow_data Allow processes to receive tx timestamps looped together with the original packet contents. If disabled, transmit timestamp requests from unprivileged processes are dropped unless socket option SOF_TIMESTAMPING_OPT_TSONLY is set. + Default: 1 (on) @@ -250,19 +282,24 @@ randomly generated. Some user space might need to gather its content even if drivers do not provide ethtool -x support yet. -myhost:~# cat /proc/sys/net/core/netdev_rss_key -84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8: ... (52 bytes total) +:: + + myhost:~# cat /proc/sys/net/core/netdev_rss_key + 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8: ... (52 bytes total) File contains nul bytes if no driver ever called netdev_rss_key_fill() function. + Note: -/proc/sys/net/core/netdev_rss_key contains 52 bytes of key, -but most drivers only use 40 bytes of it. + /proc/sys/net/core/netdev_rss_key contains 52 bytes of key, + but most drivers only use 40 bytes of it. -myhost:~# ethtool -x eth0 -RX flow hash indirection table for eth0 with 8 RX ring(s): - 0: 0 1 2 3 4 5 6 7 -RSS hash key: -84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8:43:e3:c9:0c:fd:17:55:c2:3a:4d:69:ed:f1:42:89 +:: + + myhost:~# ethtool -x eth0 + RX flow hash indirection table for eth0 with 8 RX ring(s): + 0: 0 1 2 3 4 5 6 7 + RSS hash key: + 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8:43:e3:c9:0c:fd:17:55:c2:3a:4d:69:ed:f1:42:89 netdev_tstamp_prequeue ---------------------- @@ -293,7 +330,7 @@ user space is responsible for creating them if needed. Default : 0 (for compatibility reasons) devconf_inherit_init_net ----------------------------- +------------------------ Controls if a new network namespace should inherit all current settings under /proc/sys/net/{ipv4,ipv6}/conf/{all,default}/. By @@ -307,7 +344,7 @@ forced to reset to their default values. Default : 0 (for compatibility reasons) 2. /proc/sys/net/unix - Parameters for Unix domain sockets -------------------------------------------------------- +---------------------------------------------------------- There is only one file in this directory. unix_dgram_qlen limits the max number of datagrams queued in Unix domain @@ -315,13 +352,13 @@ socket's buffer. It will not take effect unless PF_UNIX flag is specified. 3. /proc/sys/net/ipv4 - IPV4 settings -------------------------------------------------------- +------------------------------------- Please see: Documentation/networking/ip-sysctl.txt and ipvs-sysctl.txt for descriptions of these entries. 4. Appletalk -------------------------------------------------------- +------------ The /proc/sys/net/appletalk directory holds the Appletalk configuration data when Appletalk is loaded. The configurable parameters are: @@ -366,7 +403,7 @@ route flags, and the device the route is using. 5. IPX -------------------------------------------------------- +------ The IPX protocol has no tunable values in proc/sys/net. @@ -391,14 +428,16 @@ gives the destination network, the router node (or Directly) and the network address of the router (or Connected) for internal networks. 6. TIPC -------------------------------------------------------- +------- tipc_rmem ----------- +--------- The TIPC protocol now has a tunable for the receive memory, similar to the tcp_rmem - i.e. a vector of 3 INTEGERs: (min, default, max) +:: + # cat /proc/sys/net/tipc/tipc_rmem 4252725 34021800 68043600 # @@ -409,7 +448,7 @@ is not at this point in time used in any meaningful way, but the triplet is preserved in order to be consistent with things like tcp_rmem. named_timeout --------------- +------------- TIPC name table updates are distributed asynchronously in a cluster, without any form of transaction handling. This means that different race scenarios are diff --git a/Documentation/sysctl/sunrpc.txt b/Documentation/sysctl/sunrpc.rst similarity index 62% rename from Documentation/sysctl/sunrpc.txt rename to Documentation/sysctl/sunrpc.rst index ae1ecac6f85a..09780a682afd 100644 --- a/Documentation/sysctl/sunrpc.txt +++ b/Documentation/sysctl/sunrpc.rst @@ -1,9 +1,14 @@ -Documentation for /proc/sys/sunrpc/* kernel version 2.2.10 - (c) 1998, 1999, Rik van Riel +=================================== +Documentation for /proc/sys/sunrpc/ +=================================== -For general info and legal blurb, please look in README. +kernel version 2.2.10 -============================================================== +Copyright (c) 1998, 1999, Rik van Riel + +For general info and legal blurb, please look in index.rst. + +------------------------------------------------------------------------------ This file contains the documentation for the sysctl files in /proc/sys/sunrpc and is valid for Linux kernel version 2.2. diff --git a/Documentation/sysctl/user.txt b/Documentation/sysctl/user.rst similarity index 77% rename from Documentation/sysctl/user.txt rename to Documentation/sysctl/user.rst index a5882865836e..650eaa03f15e 100644 --- a/Documentation/sysctl/user.txt +++ b/Documentation/sysctl/user.rst @@ -1,7 +1,12 @@ -Documentation for /proc/sys/user/* kernel version 4.9.0 - (c) 2016 Eric Biederman +================================= +Documentation for /proc/sys/user/ +================================= -============================================================== +kernel version 4.9.0 + +Copyright (c) 2016 Eric Biederman + +------------------------------------------------------------------------------ This file contains the documentation for the sysctl files in /proc/sys/user. @@ -30,37 +35,44 @@ user namespace does not allow a user to escape their current limits. Currently, these files are in /proc/sys/user: -- max_cgroup_namespaces +max_cgroup_namespaces +===================== The maximum number of cgroup namespaces that any user in the current user namespace may create. -- max_ipc_namespaces +max_ipc_namespaces +================== The maximum number of ipc namespaces that any user in the current user namespace may create. -- max_mnt_namespaces +max_mnt_namespaces +================== The maximum number of mount namespaces that any user in the current user namespace may create. -- max_net_namespaces +max_net_namespaces +================== The maximum number of network namespaces that any user in the current user namespace may create. -- max_pid_namespaces +max_pid_namespaces +================== The maximum number of pid namespaces that any user in the current user namespace may create. -- max_user_namespaces +max_user_namespaces +=================== The maximum number of user namespaces that any user in the current user namespace may create. -- max_uts_namespaces +max_uts_namespaces +================== The maximum number of user namespaces that any user in the current user namespace may create. diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.rst similarity index 85% rename from Documentation/sysctl/vm.txt rename to Documentation/sysctl/vm.rst index f10245b06b0e..e0b145dfe266 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.rst @@ -1,10 +1,16 @@ -Documentation for /proc/sys/vm/* kernel version 2.6.29 - (c) 1998, 1999, Rik van Riel - (c) 2008 Peter W. Morreale +=============================== +Documentation for /proc/sys/vm/ +=============================== -For general info and legal blurb, please look in README. +kernel version 2.6.29 -============================================================== +Copyright (c) 1998, 1999, Rik van Riel + +Copyright (c) 2008 Peter W. Morreale + +For general info and legal blurb, please look in index.rst. + +------------------------------------------------------------------------------ This file contains the documentation for the sysctl files in /proc/sys/vm and is valid for Linux kernel version 2.6.29. @@ -68,9 +74,9 @@ Currently, these files are in /proc/sys/vm: - watermark_scale_factor - zone_reclaim_mode -============================================================== admin_reserve_kbytes +==================== The amount of free memory in the system that should be reserved for users with the capability cap_sys_admin. @@ -97,25 +103,25 @@ On x86_64 this is about 128MB. Changing this takes effect whenever an application requests memory. -============================================================== block_dump +========== block_dump enables block I/O debugging when set to a nonzero value. More information on block I/O debugging is in Documentation/laptops/laptop-mode.rst. -============================================================== compact_memory +============== Available only when CONFIG_COMPACTION is set. When 1 is written to the file, all zones are compacted such that free memory is available in contiguous blocks where possible. This can be important for example in the allocation of huge pages although processes will also directly compact memory as required. -============================================================== compact_unevictable_allowed +=========================== Available only when CONFIG_COMPACTION is set. When set to 1, compaction is allowed to examine the unevictable lru (mlocked pages) for pages to compact. @@ -123,21 +129,22 @@ This should be used on systems where stalls for minor page faults are an acceptable trade for large contiguous free memory. Set to 0 to prevent compaction from moving pages that are unevictable. Default value is 1. -============================================================== dirty_background_bytes +====================== Contains the amount of dirty memory at which the background kernel flusher threads will start writeback. -Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only -one of them may be specified at a time. When one sysctl is written it is -immediately taken into account to evaluate the dirty memory limits and the -other appears as 0 when read. +Note: + dirty_background_bytes is the counterpart of dirty_background_ratio. Only + one of them may be specified at a time. When one sysctl is written it is + immediately taken into account to evaluate the dirty memory limits and the + other appears as 0 when read. -============================================================== dirty_background_ratio +====================== Contains, as a percentage of total available memory that contains free pages and reclaimable pages, the number of pages at which the background kernel @@ -145,9 +152,9 @@ flusher threads will start writing out dirty data. The total available memory is not equal to total system memory. -============================================================== dirty_bytes +=========== Contains the amount of dirty memory at which a process generating disk writes will itself start writeback. @@ -161,18 +168,18 @@ Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any value lower than this limit will be ignored and the old configuration will be retained. -============================================================== dirty_expire_centisecs +====================== This tunable is used to define when dirty data is old enough to be eligible for writeout by the kernel flusher threads. It is expressed in 100'ths of a second. Data which has been dirty in-memory for longer than this interval will be written out next time a flusher thread wakes up. -============================================================== dirty_ratio +=========== Contains, as a percentage of total available memory that contains free pages and reclaimable pages, the number of pages at which a process which is @@ -180,9 +187,9 @@ generating disk writes will itself start writing out dirty data. The total available memory is not equal to total system memory. -============================================================== dirtytime_expire_seconds +======================== When a lazytime inode is constantly having its pages dirtied, the inode with an updated timestamp will never get chance to be written out. And, if the @@ -192,34 +199,39 @@ eventually gets pushed out to disk. This tunable is used to define when dirty inode is old enough to be eligible for writeback by the kernel flusher threads. And, it is also used as the interval to wakeup dirtytime_writeback thread. -============================================================== dirty_writeback_centisecs +========================= -The kernel flusher threads will periodically wake up and write `old' data +The kernel flusher threads will periodically wake up and write `old` data out to disk. This tunable expresses the interval between those wakeups, in 100'ths of a second. Setting this to zero disables periodic writeback altogether. -============================================================== drop_caches +=========== Writing to this will cause the kernel to drop clean caches, as well as reclaimable slab objects like dentries and inodes. Once dropped, their memory becomes free. -To free pagecache: +To free pagecache:: + echo 1 > /proc/sys/vm/drop_caches -To free reclaimable slab objects (includes dentries and inodes): + +To free reclaimable slab objects (includes dentries and inodes):: + echo 2 > /proc/sys/vm/drop_caches -To free slab objects and pagecache: + +To free slab objects and pagecache:: + echo 3 > /proc/sys/vm/drop_caches This is a non-destructive operation and will not free any dirty objects. To increase the number of objects freed by this operation, the user may run -`sync' prior to writing to /proc/sys/vm/drop_caches. This will minimize the +`sync` prior to writing to /proc/sys/vm/drop_caches. This will minimize the number of dirty objects on the system and create more candidates to be dropped. @@ -233,16 +245,16 @@ dropped objects, especially if they were under heavy use. Because of this, use outside of a testing or debugging environment is not recommended. You may see informational messages in your kernel log when this file is -used: +used:: cat (1234): drop_caches: 3 These are informational only. They do not mean that anything is wrong with your system. To disable them, echo 4 (bit 2) into drop_caches. -============================================================== extfrag_threshold +================= This parameter affects whether the kernel will compact memory or direct reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in @@ -254,9 +266,9 @@ implies that the allocation will succeed as long as watermarks are met. The kernel will not compact memory in a zone if the fragmentation index is <= extfrag_threshold. The default value is 500. -============================================================== highmem_is_dirtyable +==================== Available only for systems with CONFIG_HIGHMEM enabled (32b systems). @@ -274,30 +286,30 @@ OOM killer because some writers (e.g. direct block device writes) can only use the low memory and they can fill it up with dirty data without any throttling. -============================================================== hugetlb_shm_group +================= hugetlb_shm_group contains group id that is allowed to create SysV shared memory segment using hugetlb page. -============================================================== laptop_mode +=========== laptop_mode is a knob that controls "laptop mode". All the things that are controlled by this knob are discussed in Documentation/laptops/laptop-mode.rst. -============================================================== legacy_va_layout +================ If non-zero, this sysctl disables the new 32-bit mmap layout - the kernel will use the legacy (2.4) layout for all processes. -============================================================== lowmem_reserve_ratio +==================== For some specialised workloads on highmem machines it is dangerous for the kernel to allow process memory to be allocated from the "lowmem" @@ -308,7 +320,7 @@ And on large highmem machines this lack of reclaimable lowmem memory can be fatal. So the Linux page allocator has a mechanism which prevents allocations -which _could_ use highmem from using too much lowmem. This means that +which *could* use highmem from using too much lowmem. This means that a certain amount of lowmem is defended from the possibility of being captured into pinned user memory. @@ -316,39 +328,37 @@ captured into pinned user memory. mechanism will also defend that region from allocations which could use highmem or lowmem). -The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is +The `lowmem_reserve_ratio` tunable determines how aggressive the kernel is in defending these lower zones. If you have a machine which uses highmem or ISA DMA and your applications are using mlock(), or if you are running with no swap then you probably should change the lowmem_reserve_ratio setting. -The lowmem_reserve_ratio is an array. You can see them by reading this file. -- -% cat /proc/sys/vm/lowmem_reserve_ratio -256 256 32 -- +The lowmem_reserve_ratio is an array. You can see them by reading this file:: + + % cat /proc/sys/vm/lowmem_reserve_ratio + 256 256 32 But, these values are not used directly. The kernel calculates # of protection pages for each zones from them. These are shown as array of protection pages in /proc/zoneinfo like followings. (This is an example of x86-64 box). -Each zone has an array of protection pages like this. +Each zone has an array of protection pages like this:: -- -Node 0, zone DMA - pages free 1355 - min 3 - low 3 - high 4 + Node 0, zone DMA + pages free 1355 + min 3 + low 3 + high 4 : : - numa_other 0 - protection: (0, 2004, 2004, 2004) + numa_other 0 + protection: (0, 2004, 2004, 2004) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - pagesets - cpu: 0 pcp: 0 - : -- + pagesets + cpu: 0 pcp: 0 + : + These protections are added to score to judge whether this zone should be used for page allocation or should be reclaimed. @@ -359,20 +369,24 @@ not be used because pages_free(1355) is smaller than watermark + protection[2] normal page requirement. If requirement is DMA zone(index=0), protection[0] (=0) is used. -zone[i]'s protection[j] is calculated by following expression. +zone[i]'s protection[j] is calculated by following expression:: -(i < j): - zone[i]->protection[j] - = (total sums of managed_pages from zone[i+1] to zone[j] on the node) - / lowmem_reserve_ratio[i]; -(i = j): - (should not be protected. = 0; -(i > j): - (not necessary, but looks 0) + (i < j): + zone[i]->protection[j] + = (total sums of managed_pages from zone[i+1] to zone[j] on the node) + / lowmem_reserve_ratio[i]; + (i = j): + (should not be protected. = 0; + (i > j): + (not necessary, but looks 0) The default values of lowmem_reserve_ratio[i] are + + === ==================================== 256 (if zone[i] means DMA or DMA32 zone) - 32 (others). + 32 (others) + === ==================================== + As above expression, they are reciprocal number of ratio. 256 means 1/256. # of protection pages becomes about "0.39%" of total managed pages of higher zones on the node. @@ -381,9 +395,9 @@ If you would like to protect more pages, smaller values are effective. The minimum value is 1 (1/1 -> 100%). The value less than 1 completely disables protection of the pages. -============================================================== max_map_count: +============== This file contains the maximum number of memory map areas a process may have. Memory map areas are used as a side-effect of calling @@ -396,9 +410,9 @@ e.g., up to one or two maps per allocation. The default value is 65536. -============================================================= memory_failure_early_kill: +========================== Control how to kill processes when uncorrected memory error (typically a 2bit error in a memory module) is detected in the background by hardware @@ -424,9 +438,9 @@ check handling and depends on the hardware capabilities. Applications can override this setting individually with the PR_MCE_KILL prctl -============================================================== memory_failure_recovery +======================= Enable memory failure recovery (when supported by the platform) @@ -434,9 +448,9 @@ Enable memory failure recovery (when supported by the platform) 0: Always panic on a memory failure. -============================================================== -min_free_kbytes: +min_free_kbytes +=============== This is used to force the Linux VM to keep a minimum number of kilobytes free. The VM uses this number to compute a @@ -450,9 +464,9 @@ become subtly broken, and prone to deadlock under high loads. Setting this too high will OOM your machine instantly. -============================================================= -min_slab_ratio: +min_slab_ratio +============== This is available only on NUMA kernels. @@ -468,9 +482,9 @@ Note that slab reclaim is triggered in a per zone / node fashion. The process of reclaiming slab memory is currently not node specific and may not be fast. -============================================================= -min_unmapped_ratio: +min_unmapped_ratio +================== This is available only on NUMA kernels. @@ -485,9 +499,9 @@ files and similar are considered. The default is 1 percent. -============================================================== mmap_min_addr +============= This file indicates the amount of address space which a user process will be restricted from mmapping. Since kernel null dereference bugs could @@ -498,9 +512,9 @@ security module. Setting this value to something like 64k will allow the vast majority of applications to work correctly and provide defense in depth against future potential kernel bugs. -============================================================== -mmap_rnd_bits: +mmap_rnd_bits +============= This value can be used to select the number of bits to use to determine the random offset to the base address of vma regions @@ -511,9 +525,9 @@ by the architecture's minimum and maximum supported values. This value can be changed after boot using the /proc/sys/vm/mmap_rnd_bits tunable -============================================================== -mmap_rnd_compat_bits: +mmap_rnd_compat_bits +==================== This value can be used to select the number of bits to use to determine the random offset to the base address of vma regions @@ -525,35 +539,35 @@ architecture's minimum and maximum supported values. This value can be changed after boot using the /proc/sys/vm/mmap_rnd_compat_bits tunable -============================================================== nr_hugepages +============ Change the minimum size of the hugepage pool. See Documentation/admin-guide/mm/hugetlbpage.rst -============================================================== nr_hugepages_mempolicy +====================== Change the size of the hugepage pool at run-time on a specific set of NUMA nodes. See Documentation/admin-guide/mm/hugetlbpage.rst -============================================================== nr_overcommit_hugepages +======================= Change the maximum size of the hugepage pool. The maximum is nr_hugepages + nr_overcommit_hugepages. See Documentation/admin-guide/mm/hugetlbpage.rst -============================================================== nr_trim_pages +============= This is available only on NOMMU kernels. @@ -568,16 +582,17 @@ The default value is 1. See Documentation/nommu-mmap.rst for more information. -============================================================== numa_zonelist_order +=================== This sysctl is only for NUMA and it is deprecated. Anything but Node order will fail! 'where the memory is allocated from' is controlled by zonelists. + (This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation. - you may be able to read ZONE_DMA as ZONE_DMA32...) +you may be able to read ZONE_DMA as ZONE_DMA32...) In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following. ZONE_NORMAL -> ZONE_DMA @@ -585,10 +600,10 @@ This means that a memory allocation request for GFP_KERNEL will get memory from ZONE_DMA only when ZONE_NORMAL is not available. In NUMA case, you can think of following 2 types of order. -Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL +Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL:: -(A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL -(B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA. + (A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL + (B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA. Type(A) offers the best locality for processes on Node(0), but ZONE_DMA will be used before ZONE_NORMAL exhaustion. This increases possibility of @@ -616,9 +631,9 @@ order will be selected. Default order is recommended unless this is causing problems for your system/application. -============================================================== oom_dump_tasks +============== Enables a system-wide task dump (excluding kernel threads) to be produced when the kernel performs an OOM-killing and includes such information as @@ -638,9 +653,9 @@ OOM killer actually kills a memory-hogging task. The default value is 1 (enabled). -============================================================== oom_kill_allocating_task +======================== This enables or disables killing the OOM-triggering task in out-of-memory situations. @@ -659,9 +674,9 @@ is used in oom_kill_allocating_task. The default value is 0. -============================================================== -overcommit_kbytes: +overcommit_kbytes +================= When overcommit_memory is set to 2, the committed address space is not permitted to exceed swap plus this amount of physical RAM. See below. @@ -670,9 +685,9 @@ Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one of them may be specified at a time. Setting one disables the other (which then appears as 0 when read). -============================================================== -overcommit_memory: +overcommit_memory +================= This value contains a flag that enables memory overcommitment. @@ -695,17 +710,17 @@ The default value is 0. See Documentation/vm/overcommit-accounting.rst and mm/util.c::__vm_enough_memory() for more information. -============================================================== -overcommit_ratio: +overcommit_ratio +================ When overcommit_memory is set to 2, the committed address space is not permitted to exceed swap plus this percentage of physical RAM. See above. -============================================================== page-cluster +============ page-cluster controls the number of pages up to which consecutive pages are read in from swap in a single attempt. This is the swap counterpart @@ -725,9 +740,9 @@ Lower values mean lower latencies for initial faults, but at the same time extra faults and I/O delays for following faults if they would have been part of that consecutive pages readahead would have brought in. -============================================================= panic_on_oom +============ This enables or disables panic on out-of-memory feature. @@ -747,14 +762,16 @@ above-mentioned. Even oom happens under memory cgroup, the whole system panics. The default value is 0. + 1 and 2 are for failover of clustering. Please select either according to your policy of failover. + panic_on_oom=2+kdump gives you very strong tool to investigate why oom happens. You can get snapshot. -============================================================= percpu_pagelist_fraction +======================== This is the fraction of pages at most (high mark pcp->high) in each zone that are allocated for each per cpu page list. The min value for this is 8. It @@ -770,16 +787,16 @@ The initial value is zero. Kernel does not use this value at boot time to set the high water marks for each per cpu page list. If the user writes '0' to this sysctl, it will revert to this default behavior. -============================================================== stat_interval +============= The time interval between which vm statistics are updated. The default is 1 second. -============================================================== stat_refresh +============ Any read or write (by root only) flushes all the per-cpu vm statistics into their global totals, for more accurate reports when testing @@ -790,24 +807,26 @@ as 0) and "fails" with EINVAL if any are found, with a warning in dmesg. (At time of writing, a few stats are known sometimes to be found negative, with no ill effects: errors and warnings on these stats are suppressed.) -============================================================== numa_stat +========= This interface allows runtime configuration of numa statistics. When page allocation performance becomes a bottleneck and you can tolerate some possible tool breakage and decreased numa counter precision, you can -do: +do:: + echo 0 > /proc/sys/vm/numa_stat When page allocation performance is not a bottleneck and you want all -tooling to work, you can do: +tooling to work, you can do:: + echo 1 > /proc/sys/vm/numa_stat -============================================================== swappiness +========== This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase aggressiveness, lower values @@ -817,9 +836,9 @@ than the high water mark in a zone. The default value is 60. -============================================================== unprivileged_userfaultfd +======================== This flag controls whether unprivileged users can use the userfaultfd system calls. Set this to 1 to allow unprivileged users to use the @@ -828,9 +847,9 @@ privileged users (with SYS_CAP_PTRACE capability). The default value is 1. -============================================================== -- user_reserve_kbytes +user_reserve_kbytes +=================== When overcommit_memory is set to 2, "never overcommit" mode, reserve min(3% of current process size, user_reserve_kbytes) of free memory. @@ -846,10 +865,9 @@ Any subsequent attempts to execute a command will result in Changing this takes effect whenever an application requests memory. -============================================================== vfs_cache_pressure ------------------- +================== This percentage value controls the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects. @@ -867,9 +885,9 @@ performance impact. Reclaim code needs to take various locks to find freeable directory and inode objects. With vfs_cache_pressure=1000, it will look for ten times more freeable objects than there are. -============================================================= -watermark_boost_factor: +watermark_boost_factor +====================== This factor controls the level of reclaim when memory is being fragmented. It defines the percentage of the high watermark of a zone that will be @@ -887,9 +905,9 @@ recent past. If this value is smaller than a pageblock then a pageblocks worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor of 0 will disable the feature. -============================================================= -watermark_scale_factor: +watermark_scale_factor +====================== This factor controls the aggressiveness of kswapd. It defines the amount of memory left in a node/system before kswapd is woken up and @@ -905,20 +923,22 @@ that the number of free pages kswapd maintains for latency reasons is too small for the allocation bursts occurring in the system. This knob can then be used to tune kswapd aggressiveness accordingly. -============================================================== -zone_reclaim_mode: +zone_reclaim_mode +================= Zone_reclaim_mode allows someone to set more or less aggressive approaches to reclaim memory when a zone runs out of memory. If it is set to zero then no zone reclaim occurs. Allocations will be satisfied from other zones / nodes in the system. -This is value ORed together of +This is value OR'ed together of -1 = Zone reclaim on -2 = Zone reclaim writes dirty pages out -4 = Zone reclaim swaps pages += =================================== +1 Zone reclaim on +2 Zone reclaim writes dirty pages out +4 Zone reclaim swaps pages += =================================== zone_reclaim_mode is disabled by default. For file servers or workloads that benefit from having their data cached, zone_reclaim_mode should be @@ -942,5 +962,3 @@ of other processes running on other nodes will not be affected. Allowing regular swap effectively restricts allocations to the local node unless explicitly overridden by memory policies or cpuset configurations. - -============ End of Document ================================= diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst index c6d94118fbcc..8ba656f37cd8 100644 --- a/Documentation/vm/unevictable-lru.rst +++ b/Documentation/vm/unevictable-lru.rst @@ -439,7 +439,7 @@ Compacting MLOCKED Pages The unevictable LRU can be scanned for compactable regions and the default behavior is to do so. /proc/sys/vm/compact_unevictable_allowed controls -this behavior (see Documentation/sysctl/vm.txt). Once scanning of the +this behavior (see Documentation/sysctl/vm.rst). Once scanning of the unevictable LRU is enabled, the work of compaction is mostly handled by the page migration code and the same work flow as described in MIGRATING MLOCKED PAGES will apply. diff --git a/kernel/panic.c b/kernel/panic.c index cd73af35ec66..63ee20704fe6 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -372,7 +372,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = { /** * print_tainted - return a string to represent the kernel taint state. * - * For individual taint flag meanings, see Documentation/sysctl/kernel.txt + * For individual taint flag meanings, see Documentation/sysctl/kernel.rst * * The string is overwritten by the next call to print_tainted(), * but is always NULL terminated. diff --git a/mm/swap.c b/mm/swap.c index 3a75722e68a9..6fa43f17bcbc 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -7,7 +7,7 @@ /* * This file contains the default values for the operation of the * Linux VM subsystem. Fine-tuning documentation can be found in - * Documentation/sysctl/vm.txt. + * Documentation/sysctl/vm.rst. * Started 18.12.91 * Swap aging added 23.2.95, Stephen Tweedie. * Buffermem limits added 12.3.98, Rik van Riel.