mbox series

[net-next,0/5] rtnetlink: add IFA_IF_NETNSID for RTM_GETADDR

Message ID 20180828231859.29758-1-christian@brauner.io
Headers show
Series rtnetlink: add IFA_IF_NETNSID for RTM_GETADDR | expand

Message

Christian Brauner Aug. 28, 2018, 11:18 p.m. UTC
From: Christian Brauner <christian.brauner@ubuntu.com>

Hey,

A while back we introduced and enabled IFLA_IF_NETNSID in
RTM_{DEL,GET,NEW}LINK requests (cf. [1], [2], [3], [4], [5]). This has led
to signficant performance increases since it allows userspace to avoid
taking the hit of a setns(netns_fd, CLONE_NEWNET), then getting the
interfaces from the netns associated with the netns_fd. Especially when a
lot of network namespaces are in use, using setns() becomes increasingly
problematic when performance matters.
Usually, RTML_GETLINK requests are followed by RTM_GETADDR requests (cf.
getifaddrs() style functions and friends). But currently, RTM_GETADDR
requests do not support a similar property like IFLA_IF_NETNSID for
RTM_*LINK requests.
This is problematic since userspace can retrieve interfaces from another
network namespace by sending a IFLA_IF_NETNSID property along but
RTM_GETLINK request but is still forced to use the legacy setns() style of
retrieving interfaces in RTM_GETADDR requests.

The goal of this series is to make it possible to perform RTM_GETADDR
requests on different network namespaces. To this end a new IFA_IF_NETNSID
property for RTM_*ADDR requests is introduced. It can be used to send a
network namespace identifier along in RTM_*ADDR requests.  The network
namespace identifier will be used to retrieve the target network namespace
in which the request is supposed to be fulfilled.  This aligns the behavior
of RTM_*ADDR requests with the behavior of RTM_*LINK requests.

Security:
- The caller must have assigned a valid network namespace identifier for
  the target network namespace.
- The caller must have CAP_NET_ADMIN in the owning user namespace of the
  target network namespace.

Thanks!
Christian

[1]: commit 7973bfd8758d ("rtnetlink: remove check for IFLA_IF_NETNSID")
[2]: commit 5bb8ed075428 ("rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK")
[3]: commit b61ad68a9fe8 ("rtnetlink: enable IFLA_IF_NETNSID for RTM_DELLINK")
[4]: commit c310bfcb6e1b ("rtnetlink: enable IFLA_IF_NETNSID for RTM_SETLINK")
[5]: commit 7c4f63ba8243 ("rtnetlink: enable IFLA_IF_NETNSID in do_setlink()")

Christian Brauner (5):
  rtnetlink: add rtnl_get_net_ns_capable()
  if_addr: add IFA_IF_NETNSID
  ipv4: enable IFA_IF_NETNSID for RTM_GETADDR
  ipv6: enable IFA_IF_NETNSID for RTM_GETADDR
  rtnetlink: move type calculation out of loop

 include/net/rtnetlink.h      |  1 +
 include/uapi/linux/if_addr.h |  1 +
 net/core/rtnetlink.c         | 15 +++++---
 net/ipv4/devinet.c           | 38 +++++++++++++++-----
 net/ipv6/addrconf.c          | 70 ++++++++++++++++++++++++++++--------
 5 files changed, 97 insertions(+), 28 deletions(-)

Comments

Kirill Tkhai Aug. 29, 2018, 8:30 a.m. UTC | #1
Hi, Christian,

On 29.08.2018 02:18, Christian Brauner wrote:
> From: Christian Brauner <christian.brauner@ubuntu.com>
> 
> Hey,
> 
> A while back we introduced and enabled IFLA_IF_NETNSID in
> RTM_{DEL,GET,NEW}LINK requests (cf. [1], [2], [3], [4], [5]). This has led
> to signficant performance increases since it allows userspace to avoid
> taking the hit of a setns(netns_fd, CLONE_NEWNET), then getting the
> interfaces from the netns associated with the netns_fd. Especially when a
> lot of network namespaces are in use, using setns() becomes increasingly
> problematic when performance matters.

could you please give a real example, when setns()+socket(AF_NETLINK) cause
problems with the performance? You should do this only once on application
startup, and then you have created netlink sockets in any net namespaces you
need. What is the problem here?

> Usually, RTML_GETLINK requests are followed by RTM_GETADDR requests (cf.
> getifaddrs() style functions and friends). But currently, RTM_GETADDR
> requests do not support a similar property like IFLA_IF_NETNSID for
> RTM_*LINK requests.
> This is problematic since userspace can retrieve interfaces from another
> network namespace by sending a IFLA_IF_NETNSID property along but
> RTM_GETLINK request but is still forced to use the legacy setns() style of
> retrieving interfaces in RTM_GETADDR requests.
> 
> The goal of this series is to make it possible to perform RTM_GETADDR
> requests on different network namespaces. To this end a new IFA_IF_NETNSID
> property for RTM_*ADDR requests is introduced. It can be used to send a
> network namespace identifier along in RTM_*ADDR requests.  The network
> namespace identifier will be used to retrieve the target network namespace
> in which the request is supposed to be fulfilled.  This aligns the behavior
> of RTM_*ADDR requests with the behavior of RTM_*LINK requests.
> 
> Security:
> - The caller must have assigned a valid network namespace identifier for
>   the target network namespace.
> - The caller must have CAP_NET_ADMIN in the owning user namespace of the
>   target network namespace.
> 
> Thanks!
> Christian
> 
> [1]: commit 7973bfd8758d ("rtnetlink: remove check for IFLA_IF_NETNSID")
> [2]: commit 5bb8ed075428 ("rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK")
> [3]: commit b61ad68a9fe8 ("rtnetlink: enable IFLA_IF_NETNSID for RTM_DELLINK")
> [4]: commit c310bfcb6e1b ("rtnetlink: enable IFLA_IF_NETNSID for RTM_SETLINK")
> [5]: commit 7c4f63ba8243 ("rtnetlink: enable IFLA_IF_NETNSID in do_setlink()")
> 
> Christian Brauner (5):
>   rtnetlink: add rtnl_get_net_ns_capable()
>   if_addr: add IFA_IF_NETNSID
>   ipv4: enable IFA_IF_NETNSID for RTM_GETADDR
>   ipv6: enable IFA_IF_NETNSID for RTM_GETADDR
>   rtnetlink: move type calculation out of loop
> 
>  include/net/rtnetlink.h      |  1 +
>  include/uapi/linux/if_addr.h |  1 +
>  net/core/rtnetlink.c         | 15 +++++---
>  net/ipv4/devinet.c           | 38 +++++++++++++++-----
>  net/ipv6/addrconf.c          | 70 ++++++++++++++++++++++++++++--------
>  5 files changed, 97 insertions(+), 28 deletions(-)
>
Christian Brauner Aug. 29, 2018, 6:13 p.m. UTC | #2
Hi Kirill,

Thanks for the question!

On Wed, Aug 29, 2018 at 11:30:37AM +0300, Kirill Tkhai wrote:
> Hi, Christian,
> 
> On 29.08.2018 02:18, Christian Brauner wrote:
> > From: Christian Brauner <christian@brauner.io>
> > 
> > Hey,
> > 
> > A while back we introduced and enabled IFLA_IF_NETNSID in
> > RTM_{DEL,GET,NEW}LINK requests (cf. [1], [2], [3], [4], [5]). This has led
> > to signficant performance increases since it allows userspace to avoid
> > taking the hit of a setns(netns_fd, CLONE_NEWNET), then getting the
> > interfaces from the netns associated with the netns_fd. Especially when a
> > lot of network namespaces are in use, using setns() becomes increasingly
> > problematic when performance matters.
> 
> could you please give a real example, when setns()+socket(AF_NETLINK) cause
> problems with the performance? You should do this only once on application
> startup, and then you have created netlink sockets in any net namespaces you
> need. What is the problem here?

So we have a daemon (LXD) that is often running thousands of containers.
When users issue a lxc list request against the daemon it returns a list
of all containers including all of the interfaces and addresses for each
container. To retrieve those addresses we currently rely on setns() +
getifaddrs() for each of those containers. That has horrible
performance.
The problem with what you're proposing is that the daemon would need to
cache a socket file descriptor for each container which is something
that we unfortunately cannot do since we can't excessively cache file
descriptors because we can easily hit the open file limit. We also
refrain from caching file descriptors for a long time for security
reasons.

For the case where users just request a list of the interfaces we
can already use RTM_GETLINK + IFLA_IF_NETNS which has way better
performance. But we can't do the same with RTM_GETADDR requests which
was an oversight on my part when I wrote the original patchset for the
RTM_*LINK requests. This just rectifies this and aligns RTM_GETLINK +
RTM_GETADDR.
Based on this patchset I have written a userspace POC that is basically
a netns namespace aware getifaddr() or - as I like to call it -
netns_getifaddr().

> 
> > Usually, RTML_GETLINK requests are followed by RTM_GETADDR requests (cf.
> > getifaddrs() style functions and friends). But currently, RTM_GETADDR
> > requests do not support a similar property like IFLA_IF_NETNSID for
> > RTM_*LINK requests.
> > This is problematic since userspace can retrieve interfaces from another
> > network namespace by sending a IFLA_IF_NETNSID property along but
> > RTM_GETLINK request but is still forced to use the legacy setns() style of
> > retrieving interfaces in RTM_GETADDR requests.
> > 
> > The goal of this series is to make it possible to perform RTM_GETADDR
> > requests on different network namespaces. To this end a new IFA_IF_NETNSID
> > property for RTM_*ADDR requests is introduced. It can be used to send a
> > network namespace identifier along in RTM_*ADDR requests.  The network
> > namespace identifier will be used to retrieve the target network namespace
> > in which the request is supposed to be fulfilled.  This aligns the behavior
> > of RTM_*ADDR requests with the behavior of RTM_*LINK requests.
> > 
> > Security:
> > - The caller must have assigned a valid network namespace identifier for
> >   the target network namespace.
> > - The caller must have CAP_NET_ADMIN in the owning user namespace of the
> >   target network namespace.
> > 
> > Thanks!
> > Christian
> > 
> > [1]: commit 7973bfd8758d ("rtnetlink: remove check for IFLA_IF_NETNSID")
> > [2]: commit 5bb8ed075428 ("rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK")
> > [3]: commit b61ad68a9fe8 ("rtnetlink: enable IFLA_IF_NETNSID for RTM_DELLINK")
> > [4]: commit c310bfcb6e1b ("rtnetlink: enable IFLA_IF_NETNSID for RTM_SETLINK")
> > [5]: commit 7c4f63ba8243 ("rtnetlink: enable IFLA_IF_NETNSID in do_setlink()")
> > 
> > Christian Brauner (5):
> >   rtnetlink: add rtnl_get_net_ns_capable()
> >   if_addr: add IFA_IF_NETNSID
> >   ipv4: enable IFA_IF_NETNSID for RTM_GETADDR
> >   ipv6: enable IFA_IF_NETNSID for RTM_GETADDR
> >   rtnetlink: move type calculation out of loop
> > 
> >  include/net/rtnetlink.h      |  1 +
> >  include/uapi/linux/if_addr.h |  1 +
> >  net/core/rtnetlink.c         | 15 +++++---
> >  net/ipv4/devinet.c           | 38 +++++++++++++++-----
> >  net/ipv6/addrconf.c          | 70 ++++++++++++++++++++++++++++--------
> >  5 files changed, 97 insertions(+), 28 deletions(-)
> >
Kirill Tkhai Aug. 30, 2018, 8:49 a.m. UTC | #3
On 29.08.2018 21:13, Christian Brauner wrote:
> Hi Kirill,
> 
> Thanks for the question!
> 
> On Wed, Aug 29, 2018 at 11:30:37AM +0300, Kirill Tkhai wrote:
>> Hi, Christian,
>>
>> On 29.08.2018 02:18, Christian Brauner wrote:
>>> From: Christian Brauner <christian@brauner.io>
>>>
>>> Hey,
>>>
>>> A while back we introduced and enabled IFLA_IF_NETNSID in
>>> RTM_{DEL,GET,NEW}LINK requests (cf. [1], [2], [3], [4], [5]). This has led
>>> to signficant performance increases since it allows userspace to avoid
>>> taking the hit of a setns(netns_fd, CLONE_NEWNET), then getting the
>>> interfaces from the netns associated with the netns_fd. Especially when a
>>> lot of network namespaces are in use, using setns() becomes increasingly
>>> problematic when performance matters.
>>
>> could you please give a real example, when setns()+socket(AF_NETLINK) cause
>> problems with the performance? You should do this only once on application
>> startup, and then you have created netlink sockets in any net namespaces you
>> need. What is the problem here?
> 
> So we have a daemon (LXD) that is often running thousands of containers.
> When users issue a lxc list request against the daemon it returns a list
> of all containers including all of the interfaces and addresses for each
> container. To retrieve those addresses we currently rely on setns() +
> getifaddrs() for each of those containers. That has horrible
> performance.

Could you please provide some numbers showing that setns()
introduces signify performance decrease in the application?

I worry about all this just because of netlink interface is
already overloaded, while this patch makes it even less modular.
In case of one day we finally reach rtnl unscalability trap,
every common interface like this may be a decisive nail in
a coffin lid of possibility to overwrite everything.

> The problem with what you're proposing is that the daemon would need to
> cache a socket file descriptor for each container which is something
> that we unfortunately cannot do since we can't excessively cache file
> descriptors because we can easily hit the open file limit. We also
> refrain from caching file descriptors for a long time for security
> reasons.
> 
> For the case where users just request a list of the interfaces we
> can already use RTM_GETLINK + IFLA_IF_NETNS which has way better
> performance. But we can't do the same with RTM_GETADDR requests which
> was an oversight on my part when I wrote the original patchset for the
> RTM_*LINK requests. This just rectifies this and aligns RTM_GETLINK +
> RTM_GETADDR.
> Based on this patchset I have written a userspace POC that is basically
> a netns namespace aware getifaddr() or - as I like to call it -
> netns_getifaddr().
> 
>>
>>> Usually, RTML_GETLINK requests are followed by RTM_GETADDR requests (cf.
>>> getifaddrs() style functions and friends). But currently, RTM_GETADDR
>>> requests do not support a similar property like IFLA_IF_NETNSID for
>>> RTM_*LINK requests.
>>> This is problematic since userspace can retrieve interfaces from another
>>> network namespace by sending a IFLA_IF_NETNSID property along but
>>> RTM_GETLINK request but is still forced to use the legacy setns() style of
>>> retrieving interfaces in RTM_GETADDR requests.
>>>
>>> The goal of this series is to make it possible to perform RTM_GETADDR
>>> requests on different network namespaces. To this end a new IFA_IF_NETNSID
>>> property for RTM_*ADDR requests is introduced. It can be used to send a
>>> network namespace identifier along in RTM_*ADDR requests.  The network
>>> namespace identifier will be used to retrieve the target network namespace
>>> in which the request is supposed to be fulfilled.  This aligns the behavior
>>> of RTM_*ADDR requests with the behavior of RTM_*LINK requests.
>>>
>>> Security:
>>> - The caller must have assigned a valid network namespace identifier for
>>>   the target network namespace.
>>> - The caller must have CAP_NET_ADMIN in the owning user namespace of the
>>>   target network namespace.
>>>
>>> Thanks!
>>> Christian
>>>
>>> [1]: commit 7973bfd8758d ("rtnetlink: remove check for IFLA_IF_NETNSID")
>>> [2]: commit 5bb8ed075428 ("rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK")
>>> [3]: commit b61ad68a9fe8 ("rtnetlink: enable IFLA_IF_NETNSID for RTM_DELLINK")
>>> [4]: commit c310bfcb6e1b ("rtnetlink: enable IFLA_IF_NETNSID for RTM_SETLINK")
>>> [5]: commit 7c4f63ba8243 ("rtnetlink: enable IFLA_IF_NETNSID in do_setlink()")
>>>
>>> Christian Brauner (5):
>>>   rtnetlink: add rtnl_get_net_ns_capable()
>>>   if_addr: add IFA_IF_NETNSID
>>>   ipv4: enable IFA_IF_NETNSID for RTM_GETADDR
>>>   ipv6: enable IFA_IF_NETNSID for RTM_GETADDR
>>>   rtnetlink: move type calculation out of loop
>>>
>>>  include/net/rtnetlink.h      |  1 +
>>>  include/uapi/linux/if_addr.h |  1 +
>>>  net/core/rtnetlink.c         | 15 +++++---
>>>  net/ipv4/devinet.c           | 38 +++++++++++++++-----
>>>  net/ipv6/addrconf.c          | 70 ++++++++++++++++++++++++++++--------
>>>  5 files changed, 97 insertions(+), 28 deletions(-)
>>>
Christian Brauner Aug. 30, 2018, 2:45 p.m. UTC | #4
On Thu, Aug 30, 2018 at 11:49:31AM +0300, Kirill Tkhai wrote:
> On 29.08.2018 21:13, Christian Brauner wrote:
> > Hi Kirill,
> > 
> > Thanks for the question!
> > 
> > On Wed, Aug 29, 2018 at 11:30:37AM +0300, Kirill Tkhai wrote:
> >> Hi, Christian,
> >>
> >> On 29.08.2018 02:18, Christian Brauner wrote:
> >>> From: Christian Brauner <christian@brauner.io>
> >>>
> >>> Hey,
> >>>
> >>> A while back we introduced and enabled IFLA_IF_NETNSID in
> >>> RTM_{DEL,GET,NEW}LINK requests (cf. [1], [2], [3], [4], [5]). This has led
> >>> to signficant performance increases since it allows userspace to avoid
> >>> taking the hit of a setns(netns_fd, CLONE_NEWNET), then getting the
> >>> interfaces from the netns associated with the netns_fd. Especially when a
> >>> lot of network namespaces are in use, using setns() becomes increasingly
> >>> problematic when performance matters.
> >>
> >> could you please give a real example, when setns()+socket(AF_NETLINK) cause
> >> problems with the performance? You should do this only once on application
> >> startup, and then you have created netlink sockets in any net namespaces you
> >> need. What is the problem here?
> > 
> > So we have a daemon (LXD) that is often running thousands of containers.
> > When users issue a lxc list request against the daemon it returns a list
> > of all containers including all of the interfaces and addresses for each
> > container. To retrieve those addresses we currently rely on setns() +
> > getifaddrs() for each of those containers. That has horrible
> > performance.
> 
> Could you please provide some numbers showing that setns()
> introduces signify performance decrease in the application?

Sure, might take a few days++ though since I'm traveling.

> 
> I worry about all this just because of netlink interface is
> already overloaded, while this patch makes it even less modular.

Introducing the IFA_IF_NETNSID property will not make the netlink
interface less modular. It is a clean, RTM_*ADDR-request specific
property using network namespace identifiers which we discussed in prior
patches are the way to go forward.

You can already get interfaces via GETLINK from another network
namespaces than the one you reside in (Which we enabled just a few
months back.) but you can't do the same for GETADDR. Those two are
almost always used together. When you want to get the links you usually
also want to get the addresses associated with it right after.
In a prior discussion we agreed that network namespace identifiers are
the way to go forward but that any other propery, i.e. PIDs and fds
should never be ported into other parts of the codebase and that is
indeed something I agree with.

> In case of one day we finally reach rtnl unscalability trap,
> every common interface like this may be a decisive nail in
> a coffin lid of possibility to overwrite everything.
> 
> > The problem with what you're proposing is that the daemon would need to
> > cache a socket file descriptor for each container which is something
> > that we unfortunately cannot do since we can't excessively cache file
> > descriptors because we can easily hit the open file limit. We also
> > refrain from caching file descriptors for a long time for security
> > reasons.
> > 
> > For the case where users just request a list of the interfaces we
> > can already use RTM_GETLINK + IFLA_IF_NETNS which has way better
> > performance. But we can't do the same with RTM_GETADDR requests which
> > was an oversight on my part when I wrote the original patchset for the
> > RTM_*LINK requests. This just rectifies this and aligns RTM_GETLINK +
> > RTM_GETADDR.
> > Based on this patchset I have written a userspace POC that is basically
> > a netns namespace aware getifaddr() or - as I like to call it -
> > netns_getifaddr().
> > 
> >>
> >>> Usually, RTML_GETLINK requests are followed by RTM_GETADDR requests (cf.
> >>> getifaddrs() style functions and friends). But currently, RTM_GETADDR
> >>> requests do not support a similar property like IFLA_IF_NETNSID for
> >>> RTM_*LINK requests.
> >>> This is problematic since userspace can retrieve interfaces from another
> >>> network namespace by sending a IFLA_IF_NETNSID property along but
> >>> RTM_GETLINK request but is still forced to use the legacy setns() style of
> >>> retrieving interfaces in RTM_GETADDR requests.
> >>>
> >>> The goal of this series is to make it possible to perform RTM_GETADDR
> >>> requests on different network namespaces. To this end a new IFA_IF_NETNSID
> >>> property for RTM_*ADDR requests is introduced. It can be used to send a
> >>> network namespace identifier along in RTM_*ADDR requests.  The network
> >>> namespace identifier will be used to retrieve the target network namespace
> >>> in which the request is supposed to be fulfilled.  This aligns the behavior
> >>> of RTM_*ADDR requests with the behavior of RTM_*LINK requests.
> >>>
> >>> Security:
> >>> - The caller must have assigned a valid network namespace identifier for
> >>>   the target network namespace.
> >>> - The caller must have CAP_NET_ADMIN in the owning user namespace of the
> >>>   target network namespace.
> >>>
> >>> Thanks!
> >>> Christian
> >>>
> >>> [1]: commit 7973bfd8758d ("rtnetlink: remove check for IFLA_IF_NETNSID")
> >>> [2]: commit 5bb8ed075428 ("rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK")
> >>> [3]: commit b61ad68a9fe8 ("rtnetlink: enable IFLA_IF_NETNSID for RTM_DELLINK")
> >>> [4]: commit c310bfcb6e1b ("rtnetlink: enable IFLA_IF_NETNSID for RTM_SETLINK")
> >>> [5]: commit 7c4f63ba8243 ("rtnetlink: enable IFLA_IF_NETNSID in do_setlink()")
> >>>
> >>> Christian Brauner (5):
> >>>   rtnetlink: add rtnl_get_net_ns_capable()
> >>>   if_addr: add IFA_IF_NETNSID
> >>>   ipv4: enable IFA_IF_NETNSID for RTM_GETADDR
> >>>   ipv6: enable IFA_IF_NETNSID for RTM_GETADDR
> >>>   rtnetlink: move type calculation out of loop
> >>>
> >>>  include/net/rtnetlink.h      |  1 +
> >>>  include/uapi/linux/if_addr.h |  1 +
> >>>  net/core/rtnetlink.c         | 15 +++++---
> >>>  net/ipv4/devinet.c           | 38 +++++++++++++++-----
> >>>  net/ipv6/addrconf.c          | 70 ++++++++++++++++++++++++++++--------
> >>>  5 files changed, 97 insertions(+), 28 deletions(-)
> >>>
Nicolas Dichtel Aug. 30, 2018, 3:49 p.m. UTC | #5
Le 30/08/2018 à 16:45, Christian Brauner a écrit :
[snip]
> Introducing the IFA_IF_NETNSID property will not make the netlink
> interface less modular. It is a clean, RTM_*ADDR-request specific
> property using network namespace identifiers which we discussed in prior
> patches are the way to go forward.
> 
> You can already get interfaces via GETLINK from another network
> namespaces than the one you reside in (Which we enabled just a few
> months back.) but you can't do the same for GETADDR. Those two are
> almost always used together. When you want to get the links you usually
> also want to get the addresses associated with it right after.
> In a prior discussion we agreed that network namespace identifiers are
> the way to go forward but that any other propery, i.e. PIDs and fds
> should never be ported into other parts of the codebase and that is
> indeed something I agree with.
Yes, I agree with this and I think this series go to the right direction.

Maybe I would choose a more generic name for the attribute, something that can
be used in other netlink families (xfrm, netfilter, ...) also.
What about IFA_TARGET_NSID?
David Miller Sept. 1, 2018, 12:58 a.m. UTC | #6
From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Thu, 30 Aug 2018 17:49:12 +0200

> Maybe I would choose a more generic name for the attribute,
> something that can be used in other netlink families (xfrm,
> netfilter, ...) also.  What about IFA_TARGET_NSID?

Agreed, let's at least try to think ahead about potential future
uses :)
Christian Brauner Sept. 1, 2018, 1:34 a.m. UTC | #7
On Thu, Aug 30, 2018 at 04:45:45PM +0200, Christian Brauner wrote:
> On Thu, Aug 30, 2018 at 11:49:31AM +0300, Kirill Tkhai wrote:
> > On 29.08.2018 21:13, Christian Brauner wrote:
> > > Hi Kirill,
> > > 
> > > Thanks for the question!
> > > 
> > > On Wed, Aug 29, 2018 at 11:30:37AM +0300, Kirill Tkhai wrote:
> > >> Hi, Christian,
> > >>
> > >> On 29.08.2018 02:18, Christian Brauner wrote:
> > >>> From: Christian Brauner <christian@brauner.io>
> > >>>
> > >>> Hey,
> > >>>
> > >>> A while back we introduced and enabled IFLA_IF_NETNSID in
> > >>> RTM_{DEL,GET,NEW}LINK requests (cf. [1], [2], [3], [4], [5]). This has led
> > >>> to signficant performance increases since it allows userspace to avoid
> > >>> taking the hit of a setns(netns_fd, CLONE_NEWNET), then getting the
> > >>> interfaces from the netns associated with the netns_fd. Especially when a
> > >>> lot of network namespaces are in use, using setns() becomes increasingly
> > >>> problematic when performance matters.
> > >>
> > >> could you please give a real example, when setns()+socket(AF_NETLINK) cause
> > >> problems with the performance? You should do this only once on application
> > >> startup, and then you have created netlink sockets in any net namespaces you
> > >> need. What is the problem here?
> > > 
> > > So we have a daemon (LXD) that is often running thousands of containers.
> > > When users issue a lxc list request against the daemon it returns a list
> > > of all containers including all of the interfaces and addresses for each
> > > container. To retrieve those addresses we currently rely on setns() +
> > > getifaddrs() for each of those containers. That has horrible
> > > performance.
> > 
> > Could you please provide some numbers showing that setns()
> > introduces signify performance decrease in the application?
> 
> Sure, might take a few days++ though since I'm traveling.

Hey Kirill,

As promised here's some code [1] that compares performance. I basically
did a setns() to the network namespace and called getifaddrs() and
compared this to the scenario where I use the newly introduced property.
I did this 1 million times and calculated the mean getifaddrs()
retrieval time based on that.
My patch cuts the time in half.

brauner@wittgenstein:~/netns_getifaddrs$ ./getifaddrs_perf 0 1178
Mean time in microseconds (netnsid): 81
Mean time in microseconds (setns): 162

Christian

I'm only appending the main file since the netsns_getifaddrs() code I
used is pretty long:

[1]:

#define _GNU_SOURCE
#define __STDC_FORMAT_MACROS
#include <fcntl.h>
#include <inttypes.h>
#include <linux/types.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>

#include "netns_getifaddrs.h"
#include "print_getifaddrs.h"

#define ITERATIONS 1000000
#define SEC_TO_MICROSEC(x) (1000000 * (x))

int main(int argc, char *argv[])
{
	int i, ret;
	__s32 netns_id;
	pid_t netns_pid;
	char path[1024];
	intmax_t times[ITERATIONS];
	struct timeval t1, t2;
	intmax_t time_in_mcs;
	int fret = EXIT_FAILURE;
	intmax_t sum = 0;
	int host_netns_fd = -1, netns_fd = -1;

	struct ifaddrs *ifaddrs = NULL;

	if (argc != 3)
		goto on_error;

	netns_id = atoi(argv[1]);
	netns_pid = atoi(argv[2]);
	printf("%d\n", netns_id);
	printf("%d\n", netns_pid);

	for (i = 0; i < ITERATIONS; i++) {
		ret = gettimeofday(&t1, NULL);
		if (ret < 0)
			goto on_error;

		ret = netns_getifaddrs(&ifaddrs, netns_id);
		freeifaddrs(ifaddrs);
		if (ret < 0)
			goto on_error;

		ret = gettimeofday(&t2, NULL);
		if (ret < 0)
			goto on_error;

		time_in_mcs = (SEC_TO_MICROSEC(t2.tv_sec) + t2.tv_usec) -
			      (SEC_TO_MICROSEC(t1.tv_sec) + t1.tv_usec);
		times[i] = time_in_mcs;
	}

	for (i = 0; i < ITERATIONS; i++)
		sum += times[i];

	printf("Mean time in microseconds (netnsid): %ju\n",
	       sum / ITERATIONS);

	ret = snprintf(path, sizeof(path), "/proc/%d/ns/net", netns_pid);
	if (ret < 0 || (size_t)ret >= sizeof(path))
		goto on_error;

	netns_fd = open(path, O_RDONLY | O_CLOEXEC);
	if (netns_fd < 0)
		goto on_error;

	host_netns_fd = open("/proc/self/ns/net", O_RDONLY | O_CLOEXEC);
	if (host_netns_fd < 0)
		goto on_error;

	memset(times, 0, sizeof(times));
	for (i = 0; i < ITERATIONS; i++) {
		ret = gettimeofday(&t1, NULL);
		if (ret < 0)
			goto on_error;

		ret = setns(netns_fd, CLONE_NEWNET);
		if (ret < 0)
			goto on_error;

		ret = getifaddrs(&ifaddrs);
		freeifaddrs(ifaddrs);
		if (ret < 0)
			goto on_error;

		ret = gettimeofday(&t2, NULL);
		if (ret < 0)
			goto on_error;

		ret = setns(host_netns_fd, CLONE_NEWNET);
		if (ret < 0)
			goto on_error;

		time_in_mcs = (SEC_TO_MICROSEC(t2.tv_sec) + t2.tv_usec) -
			      (SEC_TO_MICROSEC(t1.tv_sec) + t1.tv_usec);
		times[i] = time_in_mcs;
	}

	for (i = 0; i < ITERATIONS; i++)
		sum += times[i];

	printf("Mean time in microseconds (setns): %ju\n",
	       sum / ITERATIONS);

	fret = EXIT_SUCCESS;

on_error:
	if (netns_fd >= 0)
		close(netns_fd);

	if (host_netns_fd >= 0)
		close(host_netns_fd);

	exit(fret);
}

> 
> > 
> > I worry about all this just because of netlink interface is
> > already overloaded, while this patch makes it even less modular.
> 
> Introducing the IFA_IF_NETNSID property will not make the netlink
> interface less modular. It is a clean, RTM_*ADDR-request specific
> property using network namespace identifiers which we discussed in prior
> patches are the way to go forward.
> 
> You can already get interfaces via GETLINK from another network
> namespaces than the one you reside in (Which we enabled just a few
> months back.) but you can't do the same for GETADDR. Those two are
> almost always used together. When you want to get the links you usually
> also want to get the addresses associated with it right after.
> In a prior discussion we agreed that network namespace identifiers are
> the way to go forward but that any other propery, i.e. PIDs and fds
> should never be ported into other parts of the codebase and that is
> indeed something I agree with.
> 
> > In case of one day we finally reach rtnl unscalability trap,
> > every common interface like this may be a decisive nail in
> > a coffin lid of possibility to overwrite everything.
> > 
> > > The problem with what you're proposing is that the daemon would need to
> > > cache a socket file descriptor for each container which is something
> > > that we unfortunately cannot do since we can't excessively cache file
> > > descriptors because we can easily hit the open file limit. We also
> > > refrain from caching file descriptors for a long time for security
> > > reasons.
> > > 
> > > For the case where users just request a list of the interfaces we
> > > can already use RTM_GETLINK + IFLA_IF_NETNS which has way better
> > > performance. But we can't do the same with RTM_GETADDR requests which
> > > was an oversight on my part when I wrote the original patchset for the
> > > RTM_*LINK requests. This just rectifies this and aligns RTM_GETLINK +
> > > RTM_GETADDR.
> > > Based on this patchset I have written a userspace POC that is basically
> > > a netns namespace aware getifaddr() or - as I like to call it -
> > > netns_getifaddr().
> > > 
> > >>
> > >>> Usually, RTML_GETLINK requests are followed by RTM_GETADDR requests (cf.
> > >>> getifaddrs() style functions and friends). But currently, RTM_GETADDR
> > >>> requests do not support a similar property like IFLA_IF_NETNSID for
> > >>> RTM_*LINK requests.
> > >>> This is problematic since userspace can retrieve interfaces from another
> > >>> network namespace by sending a IFLA_IF_NETNSID property along but
> > >>> RTM_GETLINK request but is still forced to use the legacy setns() style of
> > >>> retrieving interfaces in RTM_GETADDR requests.
> > >>>
> > >>> The goal of this series is to make it possible to perform RTM_GETADDR
> > >>> requests on different network namespaces. To this end a new IFA_IF_NETNSID
> > >>> property for RTM_*ADDR requests is introduced. It can be used to send a
> > >>> network namespace identifier along in RTM_*ADDR requests.  The network
> > >>> namespace identifier will be used to retrieve the target network namespace
> > >>> in which the request is supposed to be fulfilled.  This aligns the behavior
> > >>> of RTM_*ADDR requests with the behavior of RTM_*LINK requests.
> > >>>
> > >>> Security:
> > >>> - The caller must have assigned a valid network namespace identifier for
> > >>>   the target network namespace.
> > >>> - The caller must have CAP_NET_ADMIN in the owning user namespace of the
> > >>>   target network namespace.
> > >>>
> > >>> Thanks!
> > >>> Christian
> > >>>
> > >>> [1]: commit 7973bfd8758d ("rtnetlink: remove check for IFLA_IF_NETNSID")
> > >>> [2]: commit 5bb8ed075428 ("rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK")
> > >>> [3]: commit b61ad68a9fe8 ("rtnetlink: enable IFLA_IF_NETNSID for RTM_DELLINK")
> > >>> [4]: commit c310bfcb6e1b ("rtnetlink: enable IFLA_IF_NETNSID for RTM_SETLINK")
> > >>> [5]: commit 7c4f63ba8243 ("rtnetlink: enable IFLA_IF_NETNSID in do_setlink()")
> > >>>
> > >>> Christian Brauner (5):
> > >>>   rtnetlink: add rtnl_get_net_ns_capable()
> > >>>   if_addr: add IFA_IF_NETNSID
> > >>>   ipv4: enable IFA_IF_NETNSID for RTM_GETADDR
> > >>>   ipv6: enable IFA_IF_NETNSID for RTM_GETADDR
> > >>>   rtnetlink: move type calculation out of loop
> > >>>
> > >>>  include/net/rtnetlink.h      |  1 +
> > >>>  include/uapi/linux/if_addr.h |  1 +
> > >>>  net/core/rtnetlink.c         | 15 +++++---
> > >>>  net/ipv4/devinet.c           | 38 +++++++++++++++-----
> > >>>  net/ipv6/addrconf.c          | 70 ++++++++++++++++++++++++++++--------
> > >>>  5 files changed, 97 insertions(+), 28 deletions(-)
> > >>>
Christian Brauner Sept. 1, 2018, 6:47 p.m. UTC | #8
On Fri, Aug 31, 2018 at 05:58:47PM -0700, David Miller wrote:
> From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> Date: Thu, 30 Aug 2018 17:49:12 +0200
> 
> > Maybe I would choose a more generic name for the attribute,
> > something that can be used in other netlink families (xfrm,
> > netfilter, ...) also.  What about IFA_TARGET_NSID?
> 
> Agreed, let's at least try to think ahead about potential future
> uses :)

Sorry for the late reply. Yup, sounds good to me.
But maybe IFA_TARGET_NETNSID to indicate that we're talking network
namespaces here? Seems to me that NSID might give the wrong impression.
I'll send v1 soon. I expect tomorrow or sometime next week.

Thanks!
Christian
Jiri Benc Sept. 2, 2018, 9:58 a.m. UTC | #9
On Sat, 1 Sep 2018 20:47:05 +0200, Christian Brauner wrote:
> Sorry for the late reply. Yup, sounds good to me.
> But maybe IFA_TARGET_NETNSID to indicate that we're talking network
> namespaces here? Seems to me that NSID might give the wrong impression.
> I'll send v1 soon. I expect tomorrow or sometime next week.

On the other hand, we currently have IFLA_IF_NETNSID for the link
operations. IFA_IF_NETNSID is more consistent with the existing
attribute. It may be confusing to authors of user space programs to
have attribute names doing the same thing constructed differently for
different calls (IFLA_IF_NETNSID and IFA_TARGET_NETNSID).

As for the patch set itself, it makes sense to me, whatever the
final attribute name is.

Thanks,

 Jiri
Nicolas Dichtel Sept. 3, 2018, 7:50 a.m. UTC | #10
Le 02/09/2018 à 11:58, Jiri Benc a écrit :
> On Sat, 1 Sep 2018 20:47:05 +0200, Christian Brauner wrote:
>> Sorry for the late reply. Yup, sounds good to me.
>> But maybe IFA_TARGET_NETNSID to indicate that we're talking network
>> namespaces here? Seems to me that NSID might give the wrong impression.
>> I'll send v1 soon. I expect tomorrow or sometime next week.
> 
> On the other hand, we currently have IFLA_IF_NETNSID for the link
> operations. IFA_IF_NETNSID is more consistent with the existing
> attribute. It may be confusing to authors of user space programs to
> have attribute names doing the same thing constructed differently for
> different calls (IFLA_IF_NETNSID and IFA_TARGET_NETNSID).
Yep, I also thought to this. It had sense because it was the link family.
What about adding an alias (#define IFLA_TARGET_NETNSID IFLA_IF_NETNSID) in the
uapi to better document this?
Christian Brauner Sept. 3, 2018, 9:32 a.m. UTC | #11
On Mon, Sep 03, 2018 at 09:50:00AM +0200, Nicolas Dichtel wrote:
> Le 02/09/2018 à 11:58, Jiri Benc a écrit :
> > On Sat, 1 Sep 2018 20:47:05 +0200, Christian Brauner wrote:
> >> Sorry for the late reply. Yup, sounds good to me.
> >> But maybe IFA_TARGET_NETNSID to indicate that we're talking network
> >> namespaces here? Seems to me that NSID might give the wrong impression.
> >> I'll send v1 soon. I expect tomorrow or sometime next week.
> > 
> > On the other hand, we currently have IFLA_IF_NETNSID for the link
> > operations. IFA_IF_NETNSID is more consistent with the existing
> > attribute. It may be confusing to authors of user space programs to
> > have attribute names doing the same thing constructed differently for
> > different calls (IFLA_IF_NETNSID and IFA_TARGET_NETNSID).
> Yep, I also thought to this. It had sense because it was the link family.
> What about adding an alias (#define IFLA_TARGET_NETNSID IFLA_IF_NETNSID) in the
> uapi to better document this?

Sounds good to me.
Kirill Tkhai Sept. 3, 2018, 1:41 p.m. UTC | #12
On 01.09.2018 04:34, Christian Brauner wrote:
> On Thu, Aug 30, 2018 at 04:45:45PM +0200, Christian Brauner wrote:
>> On Thu, Aug 30, 2018 at 11:49:31AM +0300, Kirill Tkhai wrote:
>>> On 29.08.2018 21:13, Christian Brauner wrote:
>>>> Hi Kirill,
>>>>
>>>> Thanks for the question!
>>>>
>>>> On Wed, Aug 29, 2018 at 11:30:37AM +0300, Kirill Tkhai wrote:
>>>>> Hi, Christian,
>>>>>
>>>>> On 29.08.2018 02:18, Christian Brauner wrote:
>>>>>> From: Christian Brauner <christian@brauner.io>
>>>>>>
>>>>>> Hey,
>>>>>>
>>>>>> A while back we introduced and enabled IFLA_IF_NETNSID in
>>>>>> RTM_{DEL,GET,NEW}LINK requests (cf. [1], [2], [3], [4], [5]). This has led
>>>>>> to signficant performance increases since it allows userspace to avoid
>>>>>> taking the hit of a setns(netns_fd, CLONE_NEWNET), then getting the
>>>>>> interfaces from the netns associated with the netns_fd. Especially when a
>>>>>> lot of network namespaces are in use, using setns() becomes increasingly
>>>>>> problematic when performance matters.
>>>>>
>>>>> could you please give a real example, when setns()+socket(AF_NETLINK) cause
>>>>> problems with the performance? You should do this only once on application
>>>>> startup, and then you have created netlink sockets in any net namespaces you
>>>>> need. What is the problem here?
>>>>
>>>> So we have a daemon (LXD) that is often running thousands of containers.
>>>> When users issue a lxc list request against the daemon it returns a list
>>>> of all containers including all of the interfaces and addresses for each
>>>> container. To retrieve those addresses we currently rely on setns() +
>>>> getifaddrs() for each of those containers. That has horrible
>>>> performance.
>>>
>>> Could you please provide some numbers showing that setns()
>>> introduces signify performance decrease in the application?
>>
>> Sure, might take a few days++ though since I'm traveling.
> 
> Hey Kirill,
> 
> As promised here's some code [1] that compares performance. I basically
> did a setns() to the network namespace and called getifaddrs() and
> compared this to the scenario where I use the newly introduced property.
> I did this 1 million times and calculated the mean getifaddrs()
> retrieval time based on that.
> My patch cuts the time in half.
> 
> brauner@wittgenstein:~/netns_getifaddrs$ ./getifaddrs_perf 0 1178
> Mean time in microseconds (netnsid): 81
> Mean time in microseconds (setns): 162
> 
> Christian
> 
> I'm only appending the main file since the netsns_getifaddrs() code I
> used is pretty long:
> 
> [1]:
> 
> #define _GNU_SOURCE
> #define __STDC_FORMAT_MACROS
> #include <fcntl.h>
> #include <inttypes.h>
> #include <linux/types.h>
> #include <sched.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/stat.h>
> #include <sys/time.h>
> #include <sys/types.h>
> #include <unistd.h>
> 
> #include "netns_getifaddrs.h"
> #include "print_getifaddrs.h"
> 
> #define ITERATIONS 1000000
> #define SEC_TO_MICROSEC(x) (1000000 * (x))
> 
> int main(int argc, char *argv[])
> {
> 	int i, ret;
> 	__s32 netns_id;
> 	pid_t netns_pid;
> 	char path[1024];
> 	intmax_t times[ITERATIONS];
> 	struct timeval t1, t2;
> 	intmax_t time_in_mcs;
> 	int fret = EXIT_FAILURE;
> 	intmax_t sum = 0;
> 	int host_netns_fd = -1, netns_fd = -1;
> 
> 	struct ifaddrs *ifaddrs = NULL;
> 
> 	if (argc != 3)
> 		goto on_error;
> 
> 	netns_id = atoi(argv[1]);
> 	netns_pid = atoi(argv[2]);
> 	printf("%d\n", netns_id);
> 	printf("%d\n", netns_pid);
> 
> 	for (i = 0; i < ITERATIONS; i++) {
> 		ret = gettimeofday(&t1, NULL);
> 		if (ret < 0)
> 			goto on_error;
> 
> 		ret = netns_getifaddrs(&ifaddrs, netns_id);
> 		freeifaddrs(ifaddrs);
> 		if (ret < 0)
> 			goto on_error;
> 
> 		ret = gettimeofday(&t2, NULL);
> 		if (ret < 0)
> 			goto on_error;
> 
> 		time_in_mcs = (SEC_TO_MICROSEC(t2.tv_sec) + t2.tv_usec) -
> 			      (SEC_TO_MICROSEC(t1.tv_sec) + t1.tv_usec);
> 		times[i] = time_in_mcs;
> 	}
> 
> 	for (i = 0; i < ITERATIONS; i++)
> 		sum += times[i];
> 
> 	printf("Mean time in microseconds (netnsid): %ju\n",
> 	       sum / ITERATIONS);
> 
> 	ret = snprintf(path, sizeof(path), "/proc/%d/ns/net", netns_pid);
> 	if (ret < 0 || (size_t)ret >= sizeof(path))
> 		goto on_error;
> 
> 	netns_fd = open(path, O_RDONLY | O_CLOEXEC);
> 	if (netns_fd < 0)
> 		goto on_error;
> 
> 	host_netns_fd = open("/proc/self/ns/net", O_RDONLY | O_CLOEXEC);
> 	if (host_netns_fd < 0)
> 		goto on_error;
> 
> 	memset(times, 0, sizeof(times));
> 	for (i = 0; i < ITERATIONS; i++) {
> 		ret = gettimeofday(&t1, NULL);
> 		if (ret < 0)
> 			goto on_error;
> 
> 		ret = setns(netns_fd, CLONE_NEWNET);
> 		if (ret < 0)
> 			goto on_error;
> 
> 		ret = getifaddrs(&ifaddrs);
> 		freeifaddrs(ifaddrs);
> 		if (ret < 0)
> 			goto on_error;
> 
> 		ret = gettimeofday(&t2, NULL);
> 		if (ret < 0)
> 			goto on_error;
> 
> 		ret = setns(host_netns_fd, CLONE_NEWNET);
> 		if (ret < 0)
> 			goto on_error;
> 
> 		time_in_mcs = (SEC_TO_MICROSEC(t2.tv_sec) + t2.tv_usec) -
> 			      (SEC_TO_MICROSEC(t1.tv_sec) + t1.tv_usec);
> 		times[i] = time_in_mcs;
> 	}
> 
> 	for (i = 0; i < ITERATIONS; i++)
> 		sum += times[i];
> 
> 	printf("Mean time in microseconds (setns): %ju\n",
> 	       sum / ITERATIONS);
> 
> 	fret = EXIT_SUCCESS;
> 
> on_error:
> 	if (netns_fd >= 0)
> 		close(netns_fd);
> 
> 	if (host_netns_fd >= 0)
> 		close(host_netns_fd);
> 
> 	exit(fret);
> }

But this is a synthetic test, while I asked about real workflow.
Is this real problem for lxd, and there is observed performance
decrease?

I see, there are already nsid use in existing code, but I have to say,
that adding new types of variables as a system call arguments make it
less modular. When you request RTM_GETADDR for a specific nsid, this
obligates the kernel to make everything unchangable during the call,
doesn't it?

We may look at existing code as example, what problems this may cause.
Look at do_setlink(). There are many different types of variables,
and all of them should be dereferenced atomically. So, all the function
is executed under global rtnl. And this causes delays in another config
places, which are sensitive to rtnl. So, adding more dimensions to RTM_GETADDR
may turn it in the same overloaded function as do_setlink() is. And one
day, when we reach the state, when we must rework all of this, we won't
be able to do this. I'm not sure, now is not too late.

I just say about this, because it's possible we should consider another
approach in rtnl communication in general, and stop to overload it.

Kirill
Jiri Benc Sept. 3, 2018, 1:50 p.m. UTC | #13
On Mon, 3 Sep 2018 16:41:45 +0300, Kirill Tkhai wrote:
> But this is a synthetic test, while I asked about real workflow.
> Is this real problem for lxd, and there is observed performance
> decrease?

It's actually not as much a performance problem but rather the only way
to get the data in some situations. Namely, when you have only netnsid.
This happens e.g. when you want to query a veth peer in another netns.

setns() requires a file descriptor which you don't have. Nor there is
a way to convert netnsid to a fd.

While developing the IFLA_IF_NETNSID patch, I was first thinking about
implementing an API doing the conversion. The problem is there's no
good place to put this into. It can't be done over netlink: netlink is
unreliable and you can't have the kernel open a fd for you and lose it.
There's no ioctl to use. So we'd be left with a procfs/sysfs or a
syscall.

Using netnsid to refer to the target netns seems to be a nice solution -
after all, netnsid is the identifier to use in netlink.

 Jiri
Christian Brauner Sept. 3, 2018, 2:22 p.m. UTC | #14
On Mon, Sep 03, 2018 at 04:41:45PM +0300, Kirill Tkhai wrote:
> On 01.09.2018 04:34, Christian Brauner wrote:
> > On Thu, Aug 30, 2018 at 04:45:45PM +0200, Christian Brauner wrote:
> >> On Thu, Aug 30, 2018 at 11:49:31AM +0300, Kirill Tkhai wrote:
> >>> On 29.08.2018 21:13, Christian Brauner wrote:
> >>>> Hi Kirill,
> >>>>
> >>>> Thanks for the question!
> >>>>
> >>>> On Wed, Aug 29, 2018 at 11:30:37AM +0300, Kirill Tkhai wrote:
> >>>>> Hi, Christian,
> >>>>>
> >>>>> On 29.08.2018 02:18, Christian Brauner wrote:
> >>>>>> From: Christian Brauner <christian@brauner.io>
> >>>>>>
> >>>>>> Hey,
> >>>>>>
> >>>>>> A while back we introduced and enabled IFLA_IF_NETNSID in
> >>>>>> RTM_{DEL,GET,NEW}LINK requests (cf. [1], [2], [3], [4], [5]). This has led
> >>>>>> to signficant performance increases since it allows userspace to avoid
> >>>>>> taking the hit of a setns(netns_fd, CLONE_NEWNET), then getting the
> >>>>>> interfaces from the netns associated with the netns_fd. Especially when a
> >>>>>> lot of network namespaces are in use, using setns() becomes increasingly
> >>>>>> problematic when performance matters.
> >>>>>
> >>>>> could you please give a real example, when setns()+socket(AF_NETLINK) cause
> >>>>> problems with the performance? You should do this only once on application
> >>>>> startup, and then you have created netlink sockets in any net namespaces you
> >>>>> need. What is the problem here?
> >>>>
> >>>> So we have a daemon (LXD) that is often running thousands of containers.
> >>>> When users issue a lxc list request against the daemon it returns a list
> >>>> of all containers including all of the interfaces and addresses for each
> >>>> container. To retrieve those addresses we currently rely on setns() +
> >>>> getifaddrs() for each of those containers. That has horrible
> >>>> performance.
> >>>
> >>> Could you please provide some numbers showing that setns()
> >>> introduces signify performance decrease in the application?
> >>
> >> Sure, might take a few days++ though since I'm traveling.
> > 
> > Hey Kirill,
> > 
> > As promised here's some code [1] that compares performance. I basically
> > did a setns() to the network namespace and called getifaddrs() and
> > compared this to the scenario where I use the newly introduced property.
> > I did this 1 million times and calculated the mean getifaddrs()
> > retrieval time based on that.
> > My patch cuts the time in half.
> > 
> > brauner@wittgenstein:~/netns_getifaddrs$ ./getifaddrs_perf 0 1178
> > Mean time in microseconds (netnsid): 81
> > Mean time in microseconds (setns): 162
> > 
> > Christian
> > 
> > I'm only appending the main file since the netsns_getifaddrs() code I
> > used is pretty long:
> > 
> > [1]:
> > 
> > #define _GNU_SOURCE
> > #define __STDC_FORMAT_MACROS
> > #include <fcntl.h>
> > #include <inttypes.h>
> > #include <linux/types.h>
> > #include <sched.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <sys/stat.h>
> > #include <sys/time.h>
> > #include <sys/types.h>
> > #include <unistd.h>
> > 
> > #include "netns_getifaddrs.h"
> > #include "print_getifaddrs.h"
> > 
> > #define ITERATIONS 1000000
> > #define SEC_TO_MICROSEC(x) (1000000 * (x))
> > 
> > int main(int argc, char *argv[])
> > {
> > 	int i, ret;
> > 	__s32 netns_id;
> > 	pid_t netns_pid;
> > 	char path[1024];
> > 	intmax_t times[ITERATIONS];
> > 	struct timeval t1, t2;
> > 	intmax_t time_in_mcs;
> > 	int fret = EXIT_FAILURE;
> > 	intmax_t sum = 0;
> > 	int host_netns_fd = -1, netns_fd = -1;
> > 
> > 	struct ifaddrs *ifaddrs = NULL;
> > 
> > 	if (argc != 3)
> > 		goto on_error;
> > 
> > 	netns_id = atoi(argv[1]);
> > 	netns_pid = atoi(argv[2]);
> > 	printf("%d\n", netns_id);
> > 	printf("%d\n", netns_pid);
> > 
> > 	for (i = 0; i < ITERATIONS; i++) {
> > 		ret = gettimeofday(&t1, NULL);
> > 		if (ret < 0)
> > 			goto on_error;
> > 
> > 		ret = netns_getifaddrs(&ifaddrs, netns_id);
> > 		freeifaddrs(ifaddrs);
> > 		if (ret < 0)
> > 			goto on_error;
> > 
> > 		ret = gettimeofday(&t2, NULL);
> > 		if (ret < 0)
> > 			goto on_error;
> > 
> > 		time_in_mcs = (SEC_TO_MICROSEC(t2.tv_sec) + t2.tv_usec) -
> > 			      (SEC_TO_MICROSEC(t1.tv_sec) + t1.tv_usec);
> > 		times[i] = time_in_mcs;
> > 	}
> > 
> > 	for (i = 0; i < ITERATIONS; i++)
> > 		sum += times[i];
> > 
> > 	printf("Mean time in microseconds (netnsid): %ju\n",
> > 	       sum / ITERATIONS);
> > 
> > 	ret = snprintf(path, sizeof(path), "/proc/%d/ns/net", netns_pid);
> > 	if (ret < 0 || (size_t)ret >= sizeof(path))
> > 		goto on_error;
> > 
> > 	netns_fd = open(path, O_RDONLY | O_CLOEXEC);
> > 	if (netns_fd < 0)
> > 		goto on_error;
> > 
> > 	host_netns_fd = open("/proc/self/ns/net", O_RDONLY | O_CLOEXEC);
> > 	if (host_netns_fd < 0)
> > 		goto on_error;
> > 
> > 	memset(times, 0, sizeof(times));
> > 	for (i = 0; i < ITERATIONS; i++) {
> > 		ret = gettimeofday(&t1, NULL);
> > 		if (ret < 0)
> > 			goto on_error;
> > 
> > 		ret = setns(netns_fd, CLONE_NEWNET);
> > 		if (ret < 0)
> > 			goto on_error;
> > 
> > 		ret = getifaddrs(&ifaddrs);
> > 		freeifaddrs(ifaddrs);
> > 		if (ret < 0)
> > 			goto on_error;
> > 
> > 		ret = gettimeofday(&t2, NULL);
> > 		if (ret < 0)
> > 			goto on_error;
> > 
> > 		ret = setns(host_netns_fd, CLONE_NEWNET);
> > 		if (ret < 0)
> > 			goto on_error;
> > 
> > 		time_in_mcs = (SEC_TO_MICROSEC(t2.tv_sec) + t2.tv_usec) -
> > 			      (SEC_TO_MICROSEC(t1.tv_sec) + t1.tv_usec);
> > 		times[i] = time_in_mcs;
> > 	}
> > 
> > 	for (i = 0; i < ITERATIONS; i++)
> > 		sum += times[i];
> > 
> > 	printf("Mean time in microseconds (setns): %ju\n",
> > 	       sum / ITERATIONS);
> > 
> > 	fret = EXIT_SUCCESS;
> > 
> > on_error:
> > 	if (netns_fd >= 0)
> > 		close(netns_fd);
> > 
> > 	if (host_netns_fd >= 0)
> > 		close(host_netns_fd);
> > 
> > 	exit(fret);
> > }
> 
> But this is a synthetic test, while I asked about real workflow.
> Is this real problem for lxd, and there is observed performance
> decrease?

As you can see in this mail I explicitly stated that it is a real
performacne issue we see with LXD. You asked for numbers I gave you
numbers by writing a test-program just per your request. The benefit of
this "synthetic" case is that it allows us to clearly see the
performance benefit. Expecting me to hack all of this into LXD just to
get some perf numbers that will show the exact same thing per your
request is - and I hope I'm not being unreasonable here - expecting a
bit much.

> 
> I see, there are already nsid use in existing code, but I have to say,
> that adding new types of variables as a system call arguments make it
> less modular. When you request RTM_GETADDR for a specific nsid, this
> obligates the kernel to make everything unchangable during the call,
> doesn't it?
> 
> We may look at existing code as example, what problems this may cause.
> Look at do_setlink(). There are many different types of variables,
> and all of them should be dereferenced atomically. So, all the function
> is executed under global rtnl. And this causes delays in another config
> places, which are sensitive to rtnl. So, adding more dimensions to RTM_GETADDR
> may turn it in the same overloaded function as do_setlink() is. And one
> day, when we reach the state, when we must rework all of this, we won't
> be able to do this. I'm not sure, now is not too late.
> 
> I just say about this, because it's possible we should consider another
> approach in rtnl communication in general, and stop to overload it.

While I sympathize with your concerns this all seems very vague. There
is a real-world use case that is solved by this patchset.
Kirill Tkhai Sept. 3, 2018, 2:53 p.m. UTC | #15
On 03.09.2018 16:50, Jiri Benc wrote:
> On Mon, 3 Sep 2018 16:41:45 +0300, Kirill Tkhai wrote:
>> But this is a synthetic test, while I asked about real workflow.
>> Is this real problem for lxd, and there is observed performance
>> decrease?
> 
> It's actually not as much a performance problem but rather the only way
> to get the data in some situations. Namely, when you have only netnsid.
> This happens e.g. when you want to query a veth peer in another netns.
> 
> setns() requires a file descriptor which you don't have. Nor there is
> a way to convert netnsid to a fd.
> 
> While developing the IFLA_IF_NETNSID patch, I was first thinking about
> implementing an API doing the conversion. The problem is there's no
> good place to put this into. It can't be done over netlink: netlink is
> unreliable and you can't have the kernel open a fd for you and lose it.
> There's no ioctl to use. So we'd be left with a procfs/sysfs or a
> syscall.
> 
> Using netnsid to refer to the target netns seems to be a nice solution -
> after all, netnsid is the identifier to use in netlink.

Ok, I have no objections, just mentioning that generic problem.

Thanks,
Kirill