Message ID | 20240530201616.1316526-6-almasrymina@google.com |
---|---|
State | New |
Headers | show |
Series | Device Memory TCP | expand |
On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote: > diff --git a/net/core/devmem.c b/net/core/devmem.c > index d82f92d7cf9ce..d5fac8edf621d 100644 > --- a/net/core/devmem.c > +++ b/net/core/devmem.c > @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool, > kfree(owner); > } > > +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov) Minor nit: please no 'inline' keyword in c files. Thanks, Paolo
On Tue, 04 Jun 2024 12:13:15 +0200 Paolo Abeni <pabeni@redhat.com> wrote: > On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote: > > diff --git a/net/core/devmem.c b/net/core/devmem.c > > index d82f92d7cf9ce..d5fac8edf621d 100644 > > --- a/net/core/devmem.c > > +++ b/net/core/devmem.c > > @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool, > > kfree(owner); > > } > > > > +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov) > > Minor nit: please no 'inline' keyword in c files. I'm curious. Is this a networking rule? I use 'inline' in my C code all the time. -- Steve
On Tue, Jun 04, 2024 at 12:15:51PM -0400, Steven Rostedt wrote: > On Tue, 04 Jun 2024 12:13:15 +0200 > Paolo Abeni <pabeni@redhat.com> wrote: > > > On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote: > > > diff --git a/net/core/devmem.c b/net/core/devmem.c > > > index d82f92d7cf9ce..d5fac8edf621d 100644 > > > --- a/net/core/devmem.c > > > +++ b/net/core/devmem.c > > > @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool, > > > kfree(owner); > > > } > > > > > > +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov) > > > > Minor nit: please no 'inline' keyword in c files. > > I'm curious. Is this a networking rule? I use 'inline' in my C code all the > time. It mostly comes from Documentation/process/coding-style.rst: 15) The inline disease ---------------------- There appears to be a common misperception that gcc has a magic "make me faster" speedup option called ``inline``. While the use of inlines can be appropriate (for example as a means of replacing macros, see Chapter 12), it very often is not. Abundant use of the inline keyword leads to a much bigger kernel, which in turn slows the system as a whole down, due to a bigger icache footprint for the CPU and simply because there is less memory available for the pagecache. Just think about it; a pagecache miss causes a disk seek, which easily takes 5 milliseconds. There are a LOT of cpu cycles that can go into these 5 milliseconds. A reasonable rule of thumb is to not put inline at functions that have more than 3 lines of code in them. An exception to this rule are the cases where a parameter is known to be a compiletime constant, and as a result of this constantness you *know* the compiler will be able to optimize most of your function away at compile time. For a good example of this later case, see the kmalloc() inline function. Often people argue that adding inline to functions that are static and used only once is always a win since there is no space tradeoff. While this is technically correct, gcc is capable of inlining these automatically without help, and the maintenance issue of removing the inline when a second user appears outweighs the potential value of the hint that tells gcc to do something it would have done anyway. Jason
On Tue, 4 Jun 2024 13:31:58 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Tue, Jun 04, 2024 at 12:15:51PM -0400, Steven Rostedt wrote: > > On Tue, 04 Jun 2024 12:13:15 +0200 > > Paolo Abeni <pabeni@redhat.com> wrote: > > > > > On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote: > > > > diff --git a/net/core/devmem.c b/net/core/devmem.c > > > > index d82f92d7cf9ce..d5fac8edf621d 100644 > > > > --- a/net/core/devmem.c > > > > +++ b/net/core/devmem.c > > > > @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool, > > > > kfree(owner); > > > > } > > > > > > > > +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov) > > > > > > Minor nit: please no 'inline' keyword in c files. > > > > I'm curious. Is this a networking rule? I use 'inline' in my C code all the > > time. > > It mostly comes from Documentation/process/coding-style.rst: > > 15) The inline disease > ---------------------- > > There appears to be a common misperception that gcc has a magic "make me > faster" speedup option called ``inline``. While the use of inlines can be > appropriate (for example as a means of replacing macros, see Chapter 12), it > very often is not. Abundant use of the inline keyword leads to a much bigger > kernel, which in turn slows the system as a whole down, due to a bigger > icache footprint for the CPU and simply because there is less memory > available for the pagecache. Just think about it; a pagecache miss causes a > disk seek, which easily takes 5 milliseconds. There are a LOT of cpu cycles > that can go into these 5 milliseconds. > > A reasonable rule of thumb is to not put inline at functions that have more > than 3 lines of code in them. An exception to this rule are the cases where > a parameter is known to be a compiletime constant, and as a result of this > constantness you *know* the compiler will be able to optimize most of your > function away at compile time. For a good example of this later case, see > the kmalloc() inline function. > > Often people argue that adding inline to functions that are static and used > only once is always a win since there is no space tradeoff. While this is > technically correct, gcc is capable of inlining these automatically without > help, and the maintenance issue of removing the inline when a second user > appears outweighs the potential value of the hint that tells gcc to do > something it would have done anyway. > Interesting, as I sped up the ftrace ring buffer by a substantial amount by adding strategic __always_inline, noinline, likely() and unlikely() throughout the code. It had to do with what was considered the fast path and slow path, and not actually the size of the function. gcc got it horribly wrong. -- Steve
> Interesting, as I sped up the ftrace ring buffer by a substantial amount by > adding strategic __always_inline, noinline, likely() and unlikely() > throughout the code. It had to do with what was considered the fast path > and slow path, and not actually the size of the function. gcc got it > horribly wrong. And what did the compiler people say when you reported gcc was getting it wrong? Our assumption is, the compiler is better than a human at deciding this. Or at least, a human who does not spend a long time profiling and tuning. If this assumption is not true, we probably should be trying to figure out why, and improving the compiler when possible. That will benefit everybody. Andrew
On Wed, 5 Jun 2024 01:44:37 +0200 Andrew Lunn <andrew@lunn.ch> wrote: > > Interesting, as I sped up the ftrace ring buffer by a substantial amount by > > adding strategic __always_inline, noinline, likely() and unlikely() > > throughout the code. It had to do with what was considered the fast path > > and slow path, and not actually the size of the function. gcc got it > > horribly wrong. > > And what did the compiler people say when you reported gcc was getting > it wrong? > > Our assumption is, the compiler is better than a human at deciding > this. Or at least, a human who does not spend a long time profiling > and tuning. If this assumption is not true, we probably should be > trying to figure out why, and improving the compiler when > possible. That will benefit everybody. > How is the compiler going to know which path is going to be taken the most? There's two main paths in the ring buffer logic. One when an event stays on the sub-buffer, the other when the event crosses over to a new sub buffer. As there's 100s of events that happen on the same sub-buffer for every one time there's a cross over, I optimized the paths that stayed on the sub-buffer, which caused the time for those events to go from 250ns down to 150 ns!. That's a 40% speed up. I added the unlikely/likely and 'always_inline' and 'noinline' paths to make sure the "staying on the buffer" path was always the hot path, and keeping it tight in cache. How is a compiler going to know that? -- Steve
> How is the compiler going to know which path is going to be taken the most? > There's two main paths in the ring buffer logic. One when an event stays on > the sub-buffer, the other when the event crosses over to a new sub buffer. > As there's 100s of events that happen on the same sub-buffer for every one > time there's a cross over, I optimized the paths that stayed on the > sub-buffer, which caused the time for those events to go from 250ns down to > 150 ns!. That's a 40% speed up. > > I added the unlikely/likely and 'always_inline' and 'noinline' paths to > make sure the "staying on the buffer" path was always the hot path, and > keeping it tight in cache. > > How is a compiler going to know that? It might have some heuristics to try to guess unlikely/likely, but that is not what we are talking about here. How much difference did 'always_inline' and 'noinline' make? Hopefully the likely is enough of a clue it should prefer to inline whatever is in that branch, where as for the unlikely case it can do a function call. But compilers is not my thing, which is why i would reach out to the compiler people and ask them, is it expected to get this wrong, could it be made better? Andrew
On Wed, 5 Jun 2024 02:52:29 +0200 Andrew Lunn <andrew@lunn.ch> wrote: > > How is a compiler going to know that? > > It might have some heuristics to try to guess unlikely/likely, but > that is not what we are talking about here. > > How much difference did 'always_inline' and 'noinline' make? Hopefully > the likely is enough of a clue it should prefer to inline whatever is > in that branch, where as for the unlikely case it can do a function > call. Perhaps, but one of the issues was that I have lots of small functions that are used all over the place, and gcc tends to change them to function calls, instead of duplicating them. I did this analysis back in 2016, so maybe it became better. > > But compilers is not my thing, which is why i would reach out to the > compiler people and ask them, is it expected to get this wrong, could > it be made better? Well, I actually do work with the compiler folks, and we are actually trying to get a session at GNU Cauldron where Linux kernel folks can talk with the gcc compiler folks. I've stared at so many objdump outputs, that I can now pretty much see the assembly that my C code makes ;-) -- Steve
On Tue, 2024-06-04 at 20:27 -0400, Steven Rostedt wrote: > On Wed, 5 Jun 2024 01:44:37 +0200 > Andrew Lunn <andrew@lunn.ch> wrote: > > > > Interesting, as I sped up the ftrace ring buffer by a substantial amount by > > > adding strategic __always_inline, noinline, likely() and unlikely() > > > throughout the code. It had to do with what was considered the fast path > > > and slow path, and not actually the size of the function. gcc got it > > > horribly wrong. > > > > And what did the compiler people say when you reported gcc was getting > > it wrong? > > > > Our assumption is, the compiler is better than a human at deciding > > this. Or at least, a human who does not spend a long time profiling > > and tuning. If this assumption is not true, we probably should be > > trying to figure out why, and improving the compiler when > > possible. That will benefit everybody. > > > > How is the compiler going to know which path is going to be taken the most? > There's two main paths in the ring buffer logic. One when an event stays on > the sub-buffer, the other when the event crosses over to a new sub buffer. > As there's 100s of events that happen on the same sub-buffer for every one > time there's a cross over, I optimized the paths that stayed on the > sub-buffer, which caused the time for those events to go from 250ns down to > 150 ns!. That's a 40% speed up. > > I added the unlikely/likely and 'always_inline' and 'noinline' paths to > make sure the "staying on the buffer" path was always the hot path, and > keeping it tight in cache. > > How is a compiler going to know that? > > -- Steve > Isn't this basically a perfect example of something where profile guided optimization should work? Thanks, Niklas
diff --git a/include/net/devmem.h b/include/net/devmem.h index fa03bdabdffd9..cd3186f5d1fbd 100644 --- a/include/net/devmem.h +++ b/include/net/devmem.h @@ -68,7 +68,20 @@ int net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding); int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx, struct net_devmem_dmabuf_binding *binding); +struct net_iov * +net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding); +void net_devmem_free_dmabuf(struct net_iov *ppiov); #else +static inline struct net_iov * +net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding) +{ + return NULL; +} + +static inline void net_devmem_free_dmabuf(struct net_iov *ppiov) +{ +} + static inline void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding) { diff --git a/include/net/netmem.h b/include/net/netmem.h index 72e932a1a9489..01dbdd216fae7 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -14,8 +14,26 @@ struct net_iov { struct dmabuf_genpool_chunk_owner *owner; + unsigned long dma_addr; }; +static inline struct dmabuf_genpool_chunk_owner * +net_iov_owner(const struct net_iov *niov) +{ + return niov->owner; +} + +static inline unsigned int net_iov_idx(const struct net_iov *niov) +{ + return niov - net_iov_owner(niov)->niovs; +} + +static inline struct net_devmem_dmabuf_binding * +net_iov_binding(const struct net_iov *niov) +{ + return net_iov_owner(niov)->binding; +} + /* netmem */ /** diff --git a/net/core/devmem.c b/net/core/devmem.c index d82f92d7cf9ce..d5fac8edf621d 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool, kfree(owner); } +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov) +{ + struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov); + + return owner->base_dma_addr + + ((dma_addr_t)net_iov_idx(niov) << PAGE_SHIFT); +} + void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding) { size_t size, avail; @@ -54,6 +62,42 @@ void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding) kfree(binding); } +struct net_iov * +net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding) +{ + struct dmabuf_genpool_chunk_owner *owner; + unsigned long dma_addr; + struct net_iov *niov; + ssize_t offset; + ssize_t index; + + dma_addr = gen_pool_alloc_owner(binding->chunk_pool, PAGE_SIZE, + (void **)&owner); + if (!dma_addr) + return NULL; + + offset = dma_addr - owner->base_dma_addr; + index = offset / PAGE_SIZE; + niov = &owner->niovs[index]; + + niov->dma_addr = 0; + + net_devmem_dmabuf_binding_get(binding); + + return niov; +} + +void net_devmem_free_dmabuf(struct net_iov *niov) +{ + struct net_devmem_dmabuf_binding *binding = net_iov_binding(niov); + unsigned long dma_addr = net_devmem_get_dma_addr(niov); + + if (gen_pool_has_addr(binding->chunk_pool, dma_addr, PAGE_SIZE)) + gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE); + + net_devmem_dmabuf_binding_put(binding); +} + /* Protected by rtnl_lock() */ static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);