diff mbox

[v9,1/9] Basic kernel memory functionality for the Memory Controller

Message ID 1323676029-5890-2-git-send-email-glommer@parallels.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Glauber Costa Dec. 12, 2011, 7:47 a.m. UTC
This patch lays down the foundation for the kernel memory component
of the Memory Controller.

As of today, I am only laying down the following files:

 * memory.independent_kmem_limit
 * memory.kmem.limit_in_bytes (currently ignored)
 * memory.kmem.usage_in_bytes (always zero)

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Kirill A. Shutemov <kirill@shutemov.name>
CC: Paul Menage <paul@paulmenage.org>
CC: Greg Thelen <gthelen@google.com>
CC: Johannes Weiner <jweiner@redhat.com>
CC: Michal Hocko <mhocko@suse.cz>
---
 Documentation/cgroups/memory.txt |   40 ++++++++++++++-
 init/Kconfig                     |   11 ++++
 mm/memcontrol.c                  |  105 ++++++++++++++++++++++++++++++++++++--
 3 files changed, 149 insertions(+), 7 deletions(-)

Comments

Michal Hocko Dec. 14, 2011, 5:04 p.m. UTC | #1
[Now with the current patch version, I hope]

On Mon 12-12-11 11:47:01, Glauber Costa wrote:
> This patch lays down the foundation for the kernel memory component
> of the Memory Controller.
> 
> As of today, I am only laying down the following files:
> 
>  * memory.independent_kmem_limit

Maybe has been already discussed but the name is rather awkward and it
would deserve more clarification. It is independent in the way that it
doesn't add up to the standard (user) allocations or it enables/disables
accounting?

>  * memory.kmem.limit_in_bytes (currently ignored)

What happens if we reach the limit? Are all kernel allocations
considered or only selected caches? How do I find out which are those?

AFAIU you have implemented it for network buffers at this stage but I
guess that dentries will follow...

>  * memory.kmem.usage_in_bytes (always zero)
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Kirill A. Shutemov <kirill@shutemov.name>
> CC: Paul Menage <paul@paulmenage.org>
> CC: Greg Thelen <gthelen@google.com>
> CC: Johannes Weiner <jweiner@redhat.com>
> CC: Michal Hocko <mhocko@suse.cz>
> ---
>  Documentation/cgroups/memory.txt |   40 ++++++++++++++-
>  init/Kconfig                     |   11 ++++
>  mm/memcontrol.c                  |  105 ++++++++++++++++++++++++++++++++++++--
>  3 files changed, 149 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index cc0ebc5..f245324 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -44,8 +44,9 @@ Features:
>   - oom-killer disable knob and oom-notifier
>   - Root cgroup has no limit controls.
>  
> - Kernel memory and Hugepages are not under control yet. We just manage
> - pages on LRU. To add more controls, we have to take care of performance.
> + Hugepages is not under control yet. We just manage pages on LRU. To add more

Hugepages are not
Anyway this sounds outdated as we track both THP and hugetlb, right?

> + controls, we have to take care of performance. Kernel memory support is work
> + in progress, and the current version provides basically functionality.

s/basically/basic/

>  
>  Brief summary of control files.
>  
> @@ -56,8 +57,11 @@ Brief summary of control files.
>  				 (See 5.5 for details)
>   memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
>  				 (See 5.5 for details)
> + memory.kmem.usage_in_bytes	 # show current res_counter usage for kmem only.
> +				 (See 2.7 for details)
>   memory.limit_in_bytes		 # set/show limit of memory usage
>   memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
> + memory.kmem.limit_in_bytes	 # if allowed, set/show limit of kernel memory
>   memory.failcnt			 # show the number of memory usage hits limits
>   memory.memsw.failcnt		 # show the number of memory+Swap hits limits
>   memory.max_usage_in_bytes	 # show max memory usage recorded
> @@ -72,6 +76,9 @@ Brief summary of control files.
>   memory.oom_control		 # set/show oom controls.
>   memory.numa_stat		 # show the number of memory usage per numa node
>  
> + memory.independent_kmem_limit	 # select whether or not kernel memory limits are
> +				   independent of user limits
> +

It is not clear what happens in enabled/disabled cases. Let's say they
are not independent. Does it form a single limit with user charges or it
toggles kmem charging on/off.

>  1. History
>  
>  The memory controller has a long history. A request for comments for the memory
> @@ -255,6 +262,35 @@ When oom event notifier is registered, event will be delivered.
>    per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
>    zone->lru_lock, it has no lock of its own.
>  
> +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
> +
> +With the Kernel memory extension, the Memory Controller is able to limit
> +the amount of kernel memory used by the system. Kernel memory is fundamentally
> +different than user memory, since it can't be swapped out, which makes it
> +possible to DoS the system by consuming too much of this precious resource.
> +
> +Some kernel memory resources may be accounted and limited separately from the
> +main "kmem" resource. For instance, a slab cache that is considered important
> +enough to be limited separately may have its own knobs.

How do you tell which are those that are accounted to the "main kmem"?

> +
> +Kernel memory limits are not imposed for the root cgroup. Usage for the root
> +cgroup may or may not be accounted.
> +
> +Memory limits as specified by the standard Memory Controller may or may not
> +take kernel memory into consideration. This is achieved through the file
> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel
> +memory to be controlled separately.

Separately from user space allocations, right?
What happens if we reach the limit in both cases?

> @@ -344,9 +353,14 @@ enum charge_type {
>  };
>  
>  /* for encoding cft->private value on file */
> -#define _MEM			(0)
> -#define _MEMSWAP		(1)
> -#define _OOM_TYPE		(2)
> +
> +enum mem_type {
> +	_MEM = 0,
> +	_MEMSWAP,
> +	_OOM_TYPE,
> +	_KMEM,
> +};
> +

Probably in a separate (cleanup) patch?

>  #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
>  #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
>  #define MEMFILE_ATTR(val)	((val) & 0xffff)
> @@ -3848,10 +3862,17 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>  	u64 val;
>  
>  	if (!mem_cgroup_is_root(memcg)) {
> +		val = 0;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +		if (!memcg->kmem_independent_accounting)
> +			val = res_counter_read_u64(&memcg->kmem, RES_USAGE);
> +#endif
>  		if (!swap)
> -			return res_counter_read_u64(&memcg->res, RES_USAGE);
> +			val += res_counter_read_u64(&memcg->res, RES_USAGE);
>  		else
> -			return res_counter_read_u64(&memcg->memsw, RES_USAGE);
> +			val += res_counter_read_u64(&memcg->memsw, RES_USAGE);
> +
> +		return val;
>  	}

So you report kmem+user but we do not consider kmem during charge so one
can easily end up with usage_in_bytes over limit but no reclaim is going
on. Not good, I would say.

OK, so to sum it up. The biggest problem I see is the (non)independent
accounting. We simply cannot mix user+kernel limits otherwise we would
see issues (like kernel resource hog would force memcg-oom and innocent
members would die because their rss is much bigger).
It is also not clear to me what should happen when we hit the kmem
limit. I guess it will be kmem cache dependent.
Glauber Costa Dec. 15, 2011, 12:29 p.m. UTC | #2
On 12/14/2011 09:04 PM, Michal Hocko wrote:
> [Now with the current patch version, I hope]
>
> On Mon 12-12-11 11:47:01, Glauber Costa wrote:
>> This patch lays down the foundation for the kernel memory component
>> of the Memory Controller.
>>
>> As of today, I am only laying down the following files:
>>
>>   * memory.independent_kmem_limit
>
> Maybe has been already discussed but the name is rather awkward and it
> would deserve more clarification. It is independent in the way that it
> doesn't add up to the standard (user) allocations or it enables/disables
> accounting?

If turned on, it doesn't add up to the user allocations.
As for the name, this is marked experimental, so I don't think anyone 
will be relying on it for a while. We can change it, if you have a 
better suggestion.

>>   * memory.kmem.limit_in_bytes (currently ignored)
>
> What happens if we reach the limit? Are all kernel allocations
> considered or only selected caches? How do I find out which are those?
>
> AFAIU you have implemented it for network buffers at this stage but I
> guess that dentries will follow...

Further allocations should fail.

About other caches, tcp is a bit different because we are concerned with 
conditions that applies after the allocation already took place. It is 
not clear to me if we will treat the other caches as a single entity, or 
separate them.

>>   * memory.kmem.usage_in_bytes (always zero)
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: Kirill A. Shutemov<kirill@shutemov.name>
>> CC: Paul Menage<paul@paulmenage.org>
>> CC: Greg Thelen<gthelen@google.com>
>> CC: Johannes Weiner<jweiner@redhat.com>
>> CC: Michal Hocko<mhocko@suse.cz>
>> ---
>>   Documentation/cgroups/memory.txt |   40 ++++++++++++++-
>>   init/Kconfig                     |   11 ++++
>>   mm/memcontrol.c                  |  105 ++++++++++++++++++++++++++++++++++++--
>>   3 files changed, 149 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> index cc0ebc5..f245324 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -44,8 +44,9 @@ Features:
>>    - oom-killer disable knob and oom-notifier
>>    - Root cgroup has no limit controls.
>>
>> - Kernel memory and Hugepages are not under control yet. We just manage
>> - pages on LRU. To add more controls, we have to take care of performance.
>> + Hugepages is not under control yet. We just manage pages on LRU. To add more
>
> Hugepages are not
> Anyway this sounds outdated as we track both THP and hugetlb, right?
>
>> + controls, we have to take care of performance. Kernel memory support is work
>> + in progress, and the current version provides basically functionality.
>
> s/basically/basic/
>
>>
>>   Brief summary of control files.
>>
>> @@ -56,8 +57,11 @@ Brief summary of control files.
>>   				 (See 5.5 for details)
>>    memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
>>   				 (See 5.5 for details)
>> + memory.kmem.usage_in_bytes	 # show current res_counter usage for kmem only.
>> +				 (See 2.7 for details)
>>    memory.limit_in_bytes		 # set/show limit of memory usage
>>    memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
>> + memory.kmem.limit_in_bytes	 # if allowed, set/show limit of kernel memory
>>    memory.failcnt			 # show the number of memory usage hits limits
>>    memory.memsw.failcnt		 # show the number of memory+Swap hits limits
>>    memory.max_usage_in_bytes	 # show max memory usage recorded
>> @@ -72,6 +76,9 @@ Brief summary of control files.
>>    memory.oom_control		 # set/show oom controls.
>>    memory.numa_stat		 # show the number of memory usage per numa node
>>
>> + memory.independent_kmem_limit	 # select whether or not kernel memory limits are
>> +				   independent of user limits
>> +
>
> It is not clear what happens in enabled/disabled cases. Let's say they
> are not independent. Does it form a single limit with user charges or it
> toggles kmem charging on/off.
>
>>   1. History
>>
>>   The memory controller has a long history. A request for comments for the memory
>> @@ -255,6 +262,35 @@ When oom event notifier is registered, event will be delivered.
>>     per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
>>     zone->lru_lock, it has no lock of its own.
>>
>> +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
>> +
>> +With the Kernel memory extension, the Memory Controller is able to limit
>> +the amount of kernel memory used by the system. Kernel memory is fundamentally
>> +different than user memory, since it can't be swapped out, which makes it
>> +possible to DoS the system by consuming too much of this precious resource.
>> +
>> +Some kernel memory resources may be accounted and limited separately from the
>> +main "kmem" resource. For instance, a slab cache that is considered important
>> +enough to be limited separately may have its own knobs.
>
> How do you tell which are those that are accounted to the "main kmem"?

Besides being in this list, they should have they own files, like tcp.
>
>> +
>> +Kernel memory limits are not imposed for the root cgroup. Usage for the root
>> +cgroup may or may not be accounted.
>> +
>> +Memory limits as specified by the standard Memory Controller may or may not
>> +take kernel memory into consideration. This is achieved through the file
>> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel
>> +memory to be controlled separately.
>
> Separately from user space allocations, right?
Yes.
> What happens if we reach the limit in both cases?
For kernel memory, further allocations should fail.

>
>> @@ -344,9 +353,14 @@ enum charge_type {
>>   };
>>
>>   /* for encoding cft->private value on file */
>> -#define _MEM			(0)
>> -#define _MEMSWAP		(1)
>> -#define _OOM_TYPE		(2)
>> +
>> +enum mem_type {
>> +	_MEM = 0,
>> +	_MEMSWAP,
>> +	_OOM_TYPE,
>> +	_KMEM,
>> +};
>> +
>
> Probably in a separate (cleanup) patch?
>
>>   #define MEMFILE_PRIVATE(x, val)	(((x)<<  16) | (val))
>>   #define MEMFILE_TYPE(val)	(((val)>>  16)&  0xffff)
>>   #define MEMFILE_ATTR(val)	((val)&  0xffff)
>> @@ -3848,10 +3862,17 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>>   	u64 val;
>>
>>   	if (!mem_cgroup_is_root(memcg)) {
>> +		val = 0;
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +		if (!memcg->kmem_independent_accounting)
>> +			val = res_counter_read_u64(&memcg->kmem, RES_USAGE);
>> +#endif
>>   		if (!swap)
>> -			return res_counter_read_u64(&memcg->res, RES_USAGE);
>> +			val += res_counter_read_u64(&memcg->res, RES_USAGE);
>>   		else
>> -			return res_counter_read_u64(&memcg->memsw, RES_USAGE);
>> +			val += res_counter_read_u64(&memcg->memsw, RES_USAGE);
>> +
>> +		return val;
>>   	}
>
> So you report kmem+user but we do not consider kmem during charge so one
> can easily end up with usage_in_bytes over limit but no reclaim is going
> on. Not good, I would say.
>
> OK, so to sum it up. The biggest problem I see is the (non)independent
> accounting. We simply cannot mix user+kernel limits otherwise we would
> see issues (like kernel resource hog would force memcg-oom and innocent
> members would die because their rss is much bigger).
> It is also not clear to me what should happen when we hit the kmem
> limit. I guess it will be kmem cache dependent.

So right now, tcp is completely independent, since it is not accounted 
to kmem. In summary, we still never do non-independent accounting. When 
we start doing it for the other caches, We will have to add a test at 
charge time as well.

We still need to keep it separate though, in case the independent flag 
is turned on/off

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg Thelen Dec. 16, 2011, 6:20 a.m. UTC | #3
On Sun, Dec 11, 2011 at 11:47 PM, Glauber Costa <glommer@parallels.com> wrote:
> +Memory limits as specified by the standard Memory Controller may or may not
> +take kernel memory into consideration. This is achieved through the file
> +memory.independent_kmem_limit. A Value different than 0 will allow for kernel

s/Value/value/

It is probably worth documenting the default value for
memory.independent_kmem_limit?  I figure it would be zero at root and
and inherited from parents.  But I think the implementation differs.

> @@ -277,6 +281,11 @@ struct mem_cgroup {
>         */
>        unsigned long   move_charge_at_immigrate;
>        /*
> +        * Should kernel memory limits be stabilished independently
> +        * from user memory ?
> +        */
> +       int             kmem_independent_accounting;

I have no serious objection, but a full int seems like overkill for a
boolean value.

> +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
> +{
> +       int ret = 0;
> +
> +       ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
> +                              ARRAY_SIZE(kmem_cgroup_files));
> +       return ret;

If you want to this function could be condensed down to:
  return cgroup_add_files(...);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko Dec. 16, 2011, 12:32 p.m. UTC | #4
On Thu 15-12-11 16:29:18, Glauber Costa wrote:
> On 12/14/2011 09:04 PM, Michal Hocko wrote:
> >[Now with the current patch version, I hope]
> >On Mon 12-12-11 11:47:01, Glauber Costa wrote:
[...]
> >>@@ -3848,10 +3862,17 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
> >>  	u64 val;
> >>
> >>  	if (!mem_cgroup_is_root(memcg)) {
> >>+		val = 0;
> >>+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> >>+		if (!memcg->kmem_independent_accounting)
> >>+			val = res_counter_read_u64(&memcg->kmem, RES_USAGE);
> >>+#endif
> >>  		if (!swap)
> >>-			return res_counter_read_u64(&memcg->res, RES_USAGE);
> >>+			val += res_counter_read_u64(&memcg->res, RES_USAGE);
> >>  		else
> >>-			return res_counter_read_u64(&memcg->memsw, RES_USAGE);
> >>+			val += res_counter_read_u64(&memcg->memsw, RES_USAGE);
> >>+
> >>+		return val;
> >>  	}
> >
> >So you report kmem+user but we do not consider kmem during charge so one
> >can easily end up with usage_in_bytes over limit but no reclaim is going
> >on. Not good, I would say.

I find this a problem and one of the reason I do not like !independent
accounting.

> >
> >OK, so to sum it up. The biggest problem I see is the (non)independent
> >accounting. We simply cannot mix user+kernel limits otherwise we would
> >see issues (like kernel resource hog would force memcg-oom and innocent
> >members would die because their rss is much bigger).
> >It is also not clear to me what should happen when we hit the kmem
> >limit. I guess it will be kmem cache dependent.
> 
> So right now, tcp is completely independent, since it is not
> accounted to kmem. 

So why do we need kmem accounting when tcp (the only user at the moment)
doesn't use it? 

> In summary, we still never do non-independent accounting. When we
> start doing it for the other caches, We will have to add a test at
> charge time as well.

So we shouldn't do it as a part of this patchset because the further
usage is not clear and I think there will be some real issues with
user+kmem accounting (e.g. a proper memcg-oom implementation).
Can you just drop this patch?

> We still need to keep it separate though, in case the independent
> flag is turned on/off

I don't mind to have kmem.tcp.* knobs.
Glauber Costa Dec. 16, 2011, 1:02 p.m. UTC | #5
On 12/16/2011 04:32 PM, Michal Hocko wrote:
> On Thu 15-12-11 16:29:18, Glauber Costa wrote:
>> On 12/14/2011 09:04 PM, Michal Hocko wrote:
>>> [Now with the current patch version, I hope]
>>> On Mon 12-12-11 11:47:01, Glauber Costa wrote:
> [...]
>>>> @@ -3848,10 +3862,17 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>>>>   	u64 val;
>>>>
>>>>   	if (!mem_cgroup_is_root(memcg)) {
>>>> +		val = 0;
>>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>>> +		if (!memcg->kmem_independent_accounting)
>>>> +			val = res_counter_read_u64(&memcg->kmem, RES_USAGE);
>>>> +#endif
>>>>   		if (!swap)
>>>> -			return res_counter_read_u64(&memcg->res, RES_USAGE);
>>>> +			val += res_counter_read_u64(&memcg->res, RES_USAGE);
>>>>   		else
>>>> -			return res_counter_read_u64(&memcg->memsw, RES_USAGE);
>>>> +			val += res_counter_read_u64(&memcg->memsw, RES_USAGE);
>>>> +
>>>> +		return val;
>>>>   	}
>>>
>>> So you report kmem+user but we do not consider kmem during charge so one
>>> can easily end up with usage_in_bytes over limit but no reclaim is going
>>> on. Not good, I would say.
>
> I find this a problem and one of the reason I do not like !independent
> accounting.
>
>>>
>>> OK, so to sum it up. The biggest problem I see is the (non)independent
>>> accounting. We simply cannot mix user+kernel limits otherwise we would
>>> see issues (like kernel resource hog would force memcg-oom and innocent
>>> members would die because their rss is much bigger).
>>> It is also not clear to me what should happen when we hit the kmem
>>> limit. I guess it will be kmem cache dependent.
>>
>> So right now, tcp is completely independent, since it is not
>> accounted to kmem.
>
> So why do we need kmem accounting when tcp (the only user at the moment)
> doesn't use it?

Well, a bit historical. I needed a basic placeholder for it, since it 
tcp is officially kmem. As the time passed, I took most of the stuff out 
of this patch to leave just the basics I would need for tcp.
Turns out I ended up focusing on the rest, and some of the stuff was 
left here.

At one point I merged tcp data into kmem, but then reverted this 
behavior. the kmem counter stayed.

I agree deferring the whole behavior would be better.

>> In summary, we still never do non-independent accounting. When we
>> start doing it for the other caches, We will have to add a test at
>> charge time as well.
>
> So we shouldn't do it as a part of this patchset because the further
> usage is not clear and I think there will be some real issues with
> user+kmem accounting (e.g. a proper memcg-oom implementation).
> Can you just drop this patch?

Yes, but the whole set is in the net tree already. (All other patches 
are tcp-related but this) Would you mind if I'd send a follow up patch 
removing the kmem files, and leaving just the registration functions and 
basic documentation? (And sorry for that as well in advance)

>> We still need to keep it separate though, in case the independent
>> flag is turned on/off
>
> I don't mind to have kmem.tcp.* knobs.
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko Dec. 16, 2011, 1:30 p.m. UTC | #6
On Fri 16-12-11 17:02:51, Glauber Costa wrote:
> On 12/16/2011 04:32 PM, Michal Hocko wrote:
[...]
> >So why do we need kmem accounting when tcp (the only user at the moment)
> >doesn't use it?
> 
> Well, a bit historical. I needed a basic placeholder for it, since
> it tcp is officially kmem. As the time passed, I took most of the
> stuff out of this patch to leave just the basics I would need for
> tcp.
> Turns out I ended up focusing on the rest, and some of the stuff was
> left here.
> 
> At one point I merged tcp data into kmem, but then reverted this
> behavior. the kmem counter stayed.
> 
> I agree deferring the whole behavior would be better.
> 
> >>In summary, we still never do non-independent accounting. When we
> >>start doing it for the other caches, We will have to add a test at
> >>charge time as well.
> >
> >So we shouldn't do it as a part of this patchset because the further
> >usage is not clear and I think there will be some real issues with
> >user+kmem accounting (e.g. a proper memcg-oom implementation).
> >Can you just drop this patch?
> 
> Yes, but the whole set is in the net tree already. 

Isn't it only in some for-next branch? Can that one be updated?

> (All other patches are tcp-related but this) Would you mind if I'd
> send a follow up patch removing the kmem files, and leaving just the
> registration functions and basic documentation? (And sorry for that as
> well in advance)

Yes a followup patch would work as well.
diff mbox

Patch

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index cc0ebc5..f245324 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -44,8 +44,9 @@  Features:
  - oom-killer disable knob and oom-notifier
  - Root cgroup has no limit controls.
 
- Kernel memory and Hugepages are not under control yet. We just manage
- pages on LRU. To add more controls, we have to take care of performance.
+ Hugepages is not under control yet. We just manage pages on LRU. To add more
+ controls, we have to take care of performance. Kernel memory support is work
+ in progress, and the current version provides basically functionality.
 
 Brief summary of control files.
 
@@ -56,8 +57,11 @@  Brief summary of control files.
 				 (See 5.5 for details)
  memory.memsw.usage_in_bytes	 # show current res_counter usage for memory+Swap
 				 (See 5.5 for details)
+ memory.kmem.usage_in_bytes	 # show current res_counter usage for kmem only.
+				 (See 2.7 for details)
  memory.limit_in_bytes		 # set/show limit of memory usage
  memory.memsw.limit_in_bytes	 # set/show limit of memory+Swap usage
+ memory.kmem.limit_in_bytes	 # if allowed, set/show limit of kernel memory
  memory.failcnt			 # show the number of memory usage hits limits
  memory.memsw.failcnt		 # show the number of memory+Swap hits limits
  memory.max_usage_in_bytes	 # show max memory usage recorded
@@ -72,6 +76,9 @@  Brief summary of control files.
  memory.oom_control		 # set/show oom controls.
  memory.numa_stat		 # show the number of memory usage per numa node
 
+ memory.independent_kmem_limit	 # select whether or not kernel memory limits are
+				   independent of user limits
+
 1. History
 
 The memory controller has a long history. A request for comments for the memory
@@ -255,6 +262,35 @@  When oom event notifier is registered, event will be delivered.
   per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
   zone->lru_lock, it has no lock of its own.
 
+2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
+
+With the Kernel memory extension, the Memory Controller is able to limit
+the amount of kernel memory used by the system. Kernel memory is fundamentally
+different than user memory, since it can't be swapped out, which makes it
+possible to DoS the system by consuming too much of this precious resource.
+
+Some kernel memory resources may be accounted and limited separately from the
+main "kmem" resource. For instance, a slab cache that is considered important
+enough to be limited separately may have its own knobs.
+
+Kernel memory limits are not imposed for the root cgroup. Usage for the root
+cgroup may or may not be accounted.
+
+Memory limits as specified by the standard Memory Controller may or may not
+take kernel memory into consideration. This is achieved through the file
+memory.independent_kmem_limit. A Value different than 0 will allow for kernel
+memory to be controlled separately.
+
+When kernel memory limits are not independent, the limit values set in
+memory.kmem files are ignored.
+
+Currently no soft limit is implemented for kernel memory. It is future work
+to trigger slab reclaim when those limits are reached.
+
+2.7.1 Current Kernel Memory resources accounted
+
+None
+
 3. User Interface
 
 0. Configuration
diff --git a/init/Kconfig b/init/Kconfig
index 43298f9..b8930d5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -689,6 +689,17 @@  config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
 	  For those who want to have the feature enabled by default should
 	  select this option (if, for some reason, they need to disable it
 	  then swapaccount=0 does the trick).
+config CGROUP_MEM_RES_CTLR_KMEM
+	bool "Memory Resource Controller Kernel Memory accounting (EXPERIMENTAL)"
+	depends on CGROUP_MEM_RES_CTLR && EXPERIMENTAL
+	default n
+	help
+	  The Kernel Memory extension for Memory Resource Controller can limit
+	  the amount of memory used by kernel objects in the system. Those are
+	  fundamentally different from the entities handled by the standard
+	  Memory Controller, which are page-based, and can be swapped. Users of
+	  the kmem extension can use it to guarantee that no group of processes
+	  will ever exhaust kernel resources alone.
 
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6aff93c..9fbcff7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -227,6 +227,10 @@  struct mem_cgroup {
 	 */
 	struct res_counter memsw;
 	/*
+	 * the counter to account for kmem usage.
+	 */
+	struct res_counter kmem;
+	/*
 	 * Per cgroup active and inactive list, similar to the
 	 * per zone LRU lists.
 	 */
@@ -277,6 +281,11 @@  struct mem_cgroup {
 	 */
 	unsigned long 	move_charge_at_immigrate;
 	/*
+	 * Should kernel memory limits be stabilished independently
+	 * from user memory ?
+	 */
+	int		kmem_independent_accounting;
+	/*
 	 * percpu counter.
 	 */
 	struct mem_cgroup_stat_cpu *stat;
@@ -344,9 +353,14 @@  enum charge_type {
 };
 
 /* for encoding cft->private value on file */
-#define _MEM			(0)
-#define _MEMSWAP		(1)
-#define _OOM_TYPE		(2)
+
+enum mem_type {
+	_MEM = 0,
+	_MEMSWAP,
+	_OOM_TYPE,
+	_KMEM,
+};
+
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
 #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
@@ -3848,10 +3862,17 @@  static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 	u64 val;
 
 	if (!mem_cgroup_is_root(memcg)) {
+		val = 0;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+		if (!memcg->kmem_independent_accounting)
+			val = res_counter_read_u64(&memcg->kmem, RES_USAGE);
+#endif
 		if (!swap)
-			return res_counter_read_u64(&memcg->res, RES_USAGE);
+			val += res_counter_read_u64(&memcg->res, RES_USAGE);
 		else
-			return res_counter_read_u64(&memcg->memsw, RES_USAGE);
+			val += res_counter_read_u64(&memcg->memsw, RES_USAGE);
+
+		return val;
 	}
 
 	val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE);
@@ -3884,6 +3905,11 @@  static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 		else
 			val = res_counter_read_u64(&memcg->memsw, name);
 		break;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	case _KMEM:
+		val = res_counter_read_u64(&memcg->kmem, name);
+		break;
+#endif
 	default:
 		BUG();
 		break;
@@ -4612,6 +4638,69 @@  static int mem_control_numa_stat_open(struct inode *unused, struct file *file)
 }
 #endif /* CONFIG_NUMA */
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+static u64 kmem_limit_independent_read(struct cgroup *cgroup, struct cftype *cft)
+{
+	return mem_cgroup_from_cont(cgroup)->kmem_independent_accounting;
+}
+
+static int kmem_limit_independent_write(struct cgroup *cgroup, struct cftype *cft,
+					u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgroup);
+	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+
+	val = !!val;
+
+	/*
+	 * This follows the same hierarchy restrictions than
+	 * mem_cgroup_hierarchy_write()
+	 */
+	if (!parent || !parent->use_hierarchy) {
+		if (list_empty(&cgroup->children))
+			memcg->kmem_independent_accounting = val;
+		else
+			return -EBUSY;
+	}
+	else
+		return -EINVAL;
+
+	return 0;
+}
+static struct cftype kmem_cgroup_files[] = {
+	{
+		.name = "independent_kmem_limit",
+		.read_u64 = kmem_limit_independent_read,
+		.write_u64 = kmem_limit_independent_write,
+	},
+	{
+		.name = "kmem.usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "kmem.limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
+		.read_u64 = mem_cgroup_read,
+	},
+};
+
+static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
+{
+	int ret = 0;
+
+	ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
+			       ARRAY_SIZE(kmem_cgroup_files));
+	return ret;
+};
+
+#else
+static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
+{
+	return 0;
+}
+#endif
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4925,6 +5014,7 @@  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	if (parent && parent->use_hierarchy) {
 		res_counter_init(&memcg->res, &parent->res);
 		res_counter_init(&memcg->memsw, &parent->memsw);
+		res_counter_init(&memcg->kmem, &parent->kmem);
 		/*
 		 * We increment refcnt of the parent to ensure that we can
 		 * safely access it on res_counter_charge/uncharge.
@@ -4935,6 +5025,7 @@  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	} else {
 		res_counter_init(&memcg->res, NULL);
 		res_counter_init(&memcg->memsw, NULL);
+		res_counter_init(&memcg->kmem, NULL);
 	}
 	memcg->last_scanned_child = 0;
 	memcg->last_scanned_node = MAX_NUMNODES;
@@ -4978,6 +5069,10 @@  static int mem_cgroup_populate(struct cgroup_subsys *ss,
 
 	if (!ret)
 		ret = register_memsw_files(cont, ss);
+
+	if (!ret)
+		ret = register_kmem_files(cont, ss);
+
 	return ret;
 }