e1000e: Don't zero out buffer address in rx descriptor

Message ID	1476657307-26165-1-git-send-email-mail@kevin-wolf.de
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Kevin Wolf <mail@kevin-wolf.de> To: qemu-devel@nongnu.org Date: Mon, 17 Oct 2016 00:35:07 +0200 Message-Id: <1476657307-26165-1-git-send-email-mail@kevin-wolf.de> Subject: [Qemu-devel] [PATCH] e1000e: Don't zero out buffer address in rx descriptor Precedence: list Cc: kwolf@redhat.com, dmitry@daynix.com, jasowang@redhat.com, mail@kevin-wolf.de Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>

Kevin Wolf Oct. 16, 2016, 10:35 p.m. UTC

The e1000e emulation zeroes out any used rx descriptor and then writes a
completely newly constructed value there. By doing this, it doesn't only
update the write-back area of the descriptors (as it's supposed to do),
but it also clears the buffer address, which real hardware doesn't do.

The spec explicitly mentions in chapter 7.1.8 that it is valid for a
driver to reuse a descriptor and only update the status field while
doing so, i.e. reusing the old buffer address:

    If software statically allocates buffers, and uses memory read to
    check for completed descriptors, it simply has to zero the status
    byte in the descriptor to make it ready for reuse by hardware.

This patch fixes the behaviour to leave the buffer address in
descriptors unchanged even after the descriptor has been used.

Signed-off-by: Kevin Wolf <mail@kevin-wolf.de>
---
 hw/net/e1000e_core.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Dmitry Fleytman Oct. 18, 2016, 2:10 p.m. UTC | #1

> On 17 Oct 2016, at 01:35 AM, Kevin Wolf <mail@kevin-wolf.de> wrote:
> 
> The e1000e emulation zeroes out any used rx descriptor and then writes a
> completely newly constructed value there. By doing this, it doesn't only
> update the write-back area of the descriptors (as it's supposed to do),
> but it also clears the buffer address, which real hardware doesn't do.
> 
> The spec explicitly mentions in chapter 7.1.8 that it is valid for a
> driver to reuse a descriptor and only update the status field while
> doing so, i.e. reusing the old buffer address:
> 
>    If software statically allocates buffers, and uses memory read to
>    check for completed descriptors, it simply has to zero the status
>    byte in the descriptor to make it ready for reuse by hardware.
> 
> This patch fixes the behaviour to leave the buffer address in
> descriptors unchanged even after the descriptor has been used.

Hi Kevin,

Reviewed-by: Dmitry Fleytman <dmitry@daynix.com>

Thanks for catching this!

~Dmitry

> 
> Signed-off-by: Kevin Wolf <mail@kevin-wolf.de>
> ---
> hw/net/e1000e_core.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
> index 9fa4116..a5ca97d 100644
> --- a/hw/net/e1000e_core.c
> +++ b/hw/net/e1000e_core.c
> @@ -1280,11 +1280,10 @@ e1000e_write_lgcy_rx_descr(E1000ECore *core, uint8_t *desc,
> 
>     struct e1000_rx_desc *d = (struct e1000_rx_desc *) desc;
> 
> -    memset(d, 0, sizeof(*d));
> -
>     assert(!rss_info->enabled);
> 
>     d->length = cpu_to_le16(length);
> +    d->csum = 0;
> 
>     e1000e_build_rx_metadata(core, pkt, pkt != NULL,
>                              rss_info,
> @@ -1293,6 +1292,7 @@ e1000e_write_lgcy_rx_descr(E1000ECore *core, uint8_t *desc,
>                              &d->special);
>     d->errors = (uint8_t) (le32_to_cpu(status_flags) >> 24);
>     d->status = (uint8_t) le32_to_cpu(status_flags);
> +    d->special = 0;
> }
> 
> static inline void
> @@ -1303,7 +1303,7 @@ e1000e_write_ext_rx_descr(E1000ECore *core, uint8_t *desc,
> {
>     union e1000_rx_desc_extended *d = (union e1000_rx_desc_extended *) desc;
> 
> -    memset(d, 0, sizeof(*d));
> +    memset(&d->wb, 0, sizeof(d->wb));
> 
>     d->wb.upper.length = cpu_to_le16(length);
> 
> @@ -1327,7 +1327,7 @@ e1000e_write_ps_rx_descr(E1000ECore *core, uint8_t *desc,
>     union e1000_rx_desc_packet_split *d =
>         (union e1000_rx_desc_packet_split *) desc;
> 
> -    memset(d, 0, sizeof(*d));
> +    memset(&d->wb, 0, sizeof(d->wb));
> 
>     d->wb.middle.length0 = cpu_to_le16((*written)[0]);
> 
> -- 
> 2.1.4
>

Kevin Wolf Oct. 18, 2016, 2:27 p.m. UTC | #2

Am 18.10.2016 um 16:10 hat Dmitry Fleytman geschrieben:
>     On 17 Oct 2016, at 01:35 AM, Kevin Wolf <mail@kevin-wolf.de> wrote:
> 
>     The e1000e emulation zeroes out any used rx descriptor and then writes a
>     completely newly constructed value there. By doing this, it doesn't only
>     update the write-back area of the descriptors (as it's supposed to do),
>     but it also clears the buffer address, which real hardware doesn't do.
> 
>     The spec explicitly mentions in chapter 7.1.8 that it is valid for a
>     driver to reuse a descriptor and only update the status field while
>     doing so, i.e. reusing the old buffer address:
> 
>        If software statically allocates buffers, and uses memory read to
>        check for completed descriptors, it simply has to zero the status
>        byte in the descriptor to make it ready for reuse by hardware.
> 
>     This patch fixes the behaviour to leave the buffer address in
>     descriptors unchanged even after the descriptor has been used.
> 
> 
> Hi Kevin,
> 
> Reviewed-by: Dmitry Fleytman <dmitry@daynix.com>
> 
> Thanks for catching this!

Thanks, Dmitry!

I assume that your R-b implies that you don't send a pull request
yourself. If so, what's the process for getting the patch merged? Is
Jason going to pick it up normally or should I send a pull request of my
own?

Another related thing that I noticed while debugging this and turning on
tracing is that the interrupt throttling timers kept firing even if
there was no activity at all. Something might be wrong, there, too.

Next thing I wondered why throttling was enabled at all because the spec
says the default is 0 (turned off). So one thing that I'm pretty sure is
just a misunderstanding is the following defintion:

#define E1000E_MIN_XITR     (500) /* No more then 7813 interrupts per
                                     second according to spec 10.2.4.2 */

As I understand it, the spec is just giving an example there and lower
values are valid as well. At the very least, 0 should be accepted as a
special case because it means "disabled" and it's specified to be the
default.

As this wasn't blocking me after I had patched the above constant
locally to 0 so that I could see the actually meaningful trace events, I
didn't dig deeper, but I suspect that my local workaround for the
trace point spam may actually be a valid fix.

Kevin

Dmitry Fleytman Oct. 19, 2016, 6:48 a.m. UTC | #3

> On 18 Oct 2016, at 17:27 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> 
> Am 18.10.2016 um 16:10 hat Dmitry Fleytman geschrieben:
>>    On 17 Oct 2016, at 01:35 AM, Kevin Wolf <mail@kevin-wolf.de> wrote:
>> 
>>    The e1000e emulation zeroes out any used rx descriptor and then writes a
>>    completely newly constructed value there. By doing this, it doesn't only
>>    update the write-back area of the descriptors (as it's supposed to do),
>>    but it also clears the buffer address, which real hardware doesn't do.
>> 
>>    The spec explicitly mentions in chapter 7.1.8 that it is valid for a
>>    driver to reuse a descriptor and only update the status field while
>>    doing so, i.e. reusing the old buffer address:
>> 
>>       If software statically allocates buffers, and uses memory read to
>>       check for completed descriptors, it simply has to zero the status
>>       byte in the descriptor to make it ready for reuse by hardware.
>> 
>>    This patch fixes the behaviour to leave the buffer address in
>>    descriptors unchanged even after the descriptor has been used.
>> 
>> 
>> Hi Kevin,
>> 
>> Reviewed-by: Dmitry Fleytman <dmitry@daynix.com>
>> 
>> Thanks for catching this!
> 
> Thanks, Dmitry!
> 
> I assume that your R-b implies that you don't send a pull request

Right.

> yourself. If so, what's the process for getting the patch merged? Is
> Jason going to pick it up normally or should I send a pull request of my
> own?

Jason usually picks up patches like these.

> 
> 
> Another related thing that I noticed while debugging this and turning on
> tracing is that the interrupt throttling timers kept firing even if
> there was no activity at all. Something might be wrong, there, too.
> 
> Next thing I wondered why throttling was enabled at all because the spec
> says the default is 0 (turned off). So one thing that I'm pretty sure is
> just a misunderstanding is the following defintion:
> 
> #define E1000E_MIN_XITR     (500) /* No more then 7813 interrupts per
>                                     second according to spec 10.2.4.2 */
> 
> As I understand it, the spec is just giving an example there and lower
> values are valid as well. At the very least, 0 should be accepted as a
> special case because it means "disabled" and it's specified to be the
> default.

Right, this according to the spec this value should be 0 by default and throttling should be disabled.

Current device implementation does not allow specification of throttling interval less than 500
and treats interval 0 as throttling enabled with interval 500.

This is done by intention because according to the spec (10.2.4.2) device cannot produce more than
7813 interrupts per second even when throttling is disabled. Therefore, even in case of interrupt storm
(continuous interrupt re-injection by device), number of interrupts produced by device is limited
and CPU (driver) has enough time to do its job and handle problematic interrupt state.

Opposed to this, virtual device is able to raise interrupts with rate limited by CPU speed only
therefore driver has no chance to fix interrupt storm condition. 

Windows e1000e drivers rely on upper limit for number of interrupts per second in some cases
and absence of this limit leads to infinite interrupt storms.

To summarise, while usage of throttling mechanisms is a little bit different from what specification says,
effective emulated device behavior is totally compliant to the real device.

> 
> As this wasn't blocking me after I had patched the above constant
> locally to 0 so that I could see the actually meaningful trace events, I
> didn't dig deeper, but I suspect that my local workaround for the
> trace point spam may actually be a valid fix.

This part that looks suspicious.

Even when throttling is enabled, timers should be idle
unless there are packets on corresponding path.

In case you see times firing when device is idle - something
might be wrong in throttling mechanisms.

~Dmitry

> 
> Kevin

Kevin Wolf Oct. 19, 2016, 7:25 a.m. UTC | #4

Am 19.10.2016 um 08:48 hat Dmitry Fleytman geschrieben:
>     Another related thing that I noticed while debugging this and turning on
>     tracing is that the interrupt throttling timers kept firing even if
>     there was no activity at all. Something might be wrong, there, too.
> 
>     Next thing I wondered why throttling was enabled at all because the spec
>     says the default is 0 (turned off). So one thing that I'm pretty sure is
>     just a misunderstanding is the following defintion:
> 
>     #define E1000E_MIN_XITR     (500) /* No more then 7813 interrupts per
>                                         second according to spec 10.2.4.2 */
> 
> 
>     As I understand it, the spec is just giving an example there and lower
>     values are valid as well. At the very least, 0 should be accepted as a
>     special case because it means "disabled" and it's specified to be the
>     default.
> 
> 
> Right, this according to the spec this value should be 0 by default and
> throttling should be disabled.
> 
> Current device implementation does not allow specification of
> throttling interval less than 500 and treats interval 0 as throttling
> enabled with interval 500.
> 
> This is done by intention because according to the spec (10.2.4.2)
> device cannot produce more than 7813 interrupts per second even when
> throttling is disabled. Therefore, even in case of interrupt storm
> (continuous interrupt re-injection by device), number of interrupts
> produced by device is limited and CPU (driver) has enough time to do
> its job and handle problematic interrupt state.

I think you're misinterpreting the spec here. This is the paragraph
we're talking about, right?

    For example, if the interval is programmed to 500 (decimal), the
    82574 guarantees the CPU is not interrupted by it for 128 µs from
    the last interrupt. The maximum observable interrupt rate from the
    82574 should never exceed 7813 interrupts/sec.

It says "for example", so this is just demonstrating how you can
calculate the effects of a specific throttling setting. It says that
_if_ you set ITR to 500, you get an interrupt at most every
500 * 256 ns = 128 µs. And 1 / 128 µs = 7821.5 Hz, so this is the
effective maximum frequency that _this specific_ ITR setting allows.

I also don't think it would make any sense for hardware to be unable to
trigger interrupts more often than that. Triggering an interrupt is not
a complex operation that involves a lot of calculation or anything.

> Opposed to this, virtual device is able to raise interrupts with rate
> limited by CPU speed only therefore driver has no chance to fix
> interrupt storm condition. 
> 
> Windows e1000e drivers rely on upper limit for number of interrupts
> per second in some cases and absence of this limit leads to infinite
> interrupt storms.
> 
> To summarise, while usage of throttling mechanisms is a little bit
> different from what specification says, effective emulated device
> behavior is totally compliant to the real device.

So Windows doesn't configure ITR (i.e. it is 0) even though it can't
handle unlimited interrupts? That would be a driver bug then, and
perhaps an important enough one to keep a workaround in our code. But
then let's be explicit that this is a workaround for a Windows bug and
not mandated by the spec.

I'm not sure in what setup you produced this error, but possibly a
reason why this doesn't happen with real hardware isn't the NIC itself
but the backend: Communication with the host can obviously be faster
than talking to a physical network (so if you were doing the latter, the
rate in the VM wouldn't be limited by the CPU, but by the physical
network).

>     As this wasn't blocking me after I had patched the above constant
>     locally to 0 so that I could see the actually meaningful trace events, I
>     didn't dig deeper, but I suspect that my local workaround for the
>     trace point spam may actually be a valid fix.
> 
> 
> This part that looks suspicious.
> 
> Even when throttling is enabled, timers should be idle
> unless there are packets on corresponding path.
> 
> In case you see times firing when device is idle - something
> might be wrong in throttling mechanisms.

I saw a lot of trace events related to the timer between the really
interesting information (that is, multiple timer invocations with
nothing else in between), but I think it may not have been always. I
didn't really look too closely at the cause of this one because it was
easy enough to patch it out of the way.

Kevin

Dmitry Fleytman Oct. 19, 2016, 7:57 a.m. UTC | #5

> On 19 Oct 2016, at 10:25 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> 
> Am 19.10.2016 um 08:48 hat Dmitry Fleytman geschrieben:
>>    Another related thing that I noticed while debugging this and turning on
>>    tracing is that the interrupt throttling timers kept firing even if
>>    there was no activity at all. Something might be wrong, there, too.
>> 
>>    Next thing I wondered why throttling was enabled at all because the spec
>>    says the default is 0 (turned off). So one thing that I'm pretty sure is
>>    just a misunderstanding is the following defintion:
>> 
>>    #define E1000E_MIN_XITR     (500) /* No more then 7813 interrupts per
>>                                        second according to spec 10.2.4.2 */
>> 
>> 
>>    As I understand it, the spec is just giving an example there and lower
>>    values are valid as well. At the very least, 0 should be accepted as a
>>    special case because it means "disabled" and it's specified to be the
>>    default.
>> 
>> 
>> Right, this according to the spec this value should be 0 by default and
>> throttling should be disabled.
>> 
>> Current device implementation does not allow specification of
>> throttling interval less than 500 and treats interval 0 as throttling
>> enabled with interval 500.
>> 
>> This is done by intention because according to the spec (10.2.4.2)
>> device cannot produce more than 7813 interrupts per second even when
>> throttling is disabled. Therefore, even in case of interrupt storm
>> (continuous interrupt re-injection by device), number of interrupts
>> produced by device is limited and CPU (driver) has enough time to do
>> its job and handle problematic interrupt state.
> 
> I think you're misinterpreting the spec here. This is the paragraph
> we're talking about, right?
> 
>    For example, if the interval is programmed to 500 (decimal), the
>    82574 guarantees the CPU is not interrupted by it for 128 µs from
>    the last interrupt. The maximum observable interrupt rate from the
>    82574 should never exceed 7813 interrupts/sec.
> 
> It says "for example", so this is just demonstrating how you can
> calculate the effects of a specific throttling setting. It says that
> _if_ you set ITR to 500, you get an interrupt at most every
> 500 * 256 ns = 128 µs. And 1 / 128 µs = 7821.5 Hz, so this is the
> effective maximum frequency that _this specific_ ITR setting allows.
> 
> I also don't think it would make any sense for hardware to be unable to
> trigger interrupts more often than that. Triggering an interrupt is not
> a complex operation that involves a lot of calculation or anything.

Hi Kevin,

Yes, I assume that sentence

“The maximum observable interrupt rate from the
82574 should never exceed 7813 interrupts/sec."

is not a related to a specific case, but describes a generic limitation,
however it might be I’m misreading the spec indeed.

> 
>> Opposed to this, virtual device is able to raise interrupts with rate
>> limited by CPU speed only therefore driver has no chance to fix
>> interrupt storm condition. 
>> 
>> Windows e1000e drivers rely on upper limit for number of interrupts
>> per second in some cases and absence of this limit leads to infinite
>> interrupt storms.
>> 
>> To summarise, while usage of throttling mechanisms is a little bit
>> different from what specification says, effective emulated device
>> behavior is totally compliant to the real device.
> 
> So Windows doesn't configure ITR (i.e. it is 0) even though it can't
> handle unlimited interrupts? That would be a driver bug then, and
> perhaps an important enough one to keep a workaround in our code. But
> then let's be explicit that this is a workaround for a Windows bug and
> not mandated by the spec.
> 
> I'm not sure in what setup you produced this error, but possibly a
> reason why this doesn't happen with real hardware isn't the NIC itself
> but the backend: Communication with the host can obviously be faster
> than talking to a physical network (so if you were doing the latter, the
> rate in the VM wouldn't be limited by the CPU, but by the physical
> network).

This issue is reproduced on device disable and not related 
to intensive device/backend communication. One RX packet with
right timing is enough to trigged the problem.

The same issue was fixed in e1000 device some time ago as well.

~Dmitry

> 
>>    As this wasn't blocking me after I had patched the above constant
>>    locally to 0 so that I could see the actually meaningful trace events, I
>>    didn't dig deeper, but I suspect that my local workaround for the
>>    trace point spam may actually be a valid fix.
>> 
>> 
>> This part that looks suspicious.
>> 
>> Even when throttling is enabled, timers should be idle
>> unless there are packets on corresponding path.
>> 
>> In case you see times firing when device is idle - something
>> might be wrong in throttling mechanisms.
> 
> I saw a lot of trace events related to the timer between the really
> interesting information (that is, multiple timer invocations with
> nothing else in between), but I think it may not have been always. I
> didn't really look too closely at the cause of this one because it was
> easy enough to patch it out of the way.
> 
> Kevin

Kevin Wolf Oct. 19, 2016, 10:07 a.m. UTC | #6

Am 19.10.2016 um 09:57 hat Dmitry Fleytman geschrieben:
> 
>     On 19 Oct 2016, at 10:25 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> 
>     Am 19.10.2016 um 08:48 hat Dmitry Fleytman geschrieben:
> 
>            Another related thing that I noticed while debugging this and
>         turning on
>            tracing is that the interrupt throttling timers kept firing even if
>            there was no activity at all. Something might be wrong, there, too.
> 
>            Next thing I wondered why throttling was enabled at all because the
>         spec
>            says the default is 0 (turned off). So one thing that I'm pretty
>         sure is
>            just a misunderstanding is the following defintion:
> 
>            #define E1000E_MIN_XITR     (500) /* No more then 7813 interrupts
>         per
>                                                second according to spec
>         10.2.4.2 */
> 
> 
>            As I understand it, the spec is just giving an example there and
>         lower
>            values are valid as well. At the very least, 0 should be accepted as
>         a
>            special case because it means "disabled" and it's specified to be
>         the
>            default.
> 
> 
>         Right, this according to the spec this value should be 0 by default and
>         throttling should be disabled.
> 
>         Current device implementation does not allow specification of
>         throttling interval less than 500 and treats interval 0 as throttling
>         enabled with interval 500.
> 
>         This is done by intention because according to the spec (10.2.4.2)
>         device cannot produce more than 7813 interrupts per second even when
>         throttling is disabled. Therefore, even in case of interrupt storm
>         (continuous interrupt re-injection by device), number of interrupts
>         produced by device is limited and CPU (driver) has enough time to do
>         its job and handle problematic interrupt state.
> 
> 
>     I think you're misinterpreting the spec here. This is the paragraph
>     we're talking about, right?
> 
>        For example, if the interval is programmed to 500 (decimal), the
>        82574 guarantees the CPU is not interrupted by it for 128 µs from
>        the last interrupt. The maximum observable interrupt rate from the
>        82574 should never exceed 7813 interrupts/sec.
> 
>     It says "for example", so this is just demonstrating how you can
>     calculate the effects of a specific throttling setting. It says that
>     _if_ you set ITR to 500, you get an interrupt at most every
>     500 * 256 ns = 128 µs. And 1 / 128 µs = 7821.5 Hz, so this is the
>     effective maximum frequency that _this specific_ ITR setting allows.
> 
>     I also don't think it would make any sense for hardware to be unable to
>     trigger interrupts more often than that. Triggering an interrupt is not
>     a complex operation that involves a lot of calculation or anything.
> 
> 
> Hi Kevin,
> 
> Yes, I assume that sentence
> 
> “The maximum observable interrupt rate from the
> 82574 should never exceed 7813 interrupts/sec."
> 
> is not a related to a specific case, but describes a generic limitation,
> however it might be I’m misreading the spec indeed.

For me everything hints at this being only an example: Not only do the
numbers match the example made in the previous sentence (which is
explicitly called an example) and look weird as a real limit, but it's
also in the same paragraph as the explicit example and the spec is
generally good at starting a new paragraph when talking about a new
aspect.

I don't care enough to actually make you change anything, but I wanted
you to be aware that the interpretation of the spec as coded into our
emulation isn't clear at all (in fact, I would think it's clear that
it's _not_ meant this way) and that real hardware probably doesn't do
the same thing as we do.

What we're doing may still have merit, as a workaround for a guest
driver bug.

>         Opposed to this, virtual device is able to raise interrupts with rate
>         limited by CPU speed only therefore driver has no chance to fix
>         interrupt storm condition.
> 
>         Windows e1000e drivers rely on upper limit for number of interrupts
>         per second in some cases and absence of this limit leads to infinite
>         interrupt storms.
> 
>         To summarise, while usage of throttling mechanisms is a little bit
>         different from what specification says, effective emulated device
>         behavior is totally compliant to the real device.
> 
> 
>     So Windows doesn't configure ITR (i.e. it is 0) even though it can't
>     handle unlimited interrupts? That would be a driver bug then, and
>     perhaps an important enough one to keep a workaround in our code. But
>     then let's be explicit that this is a workaround for a Windows bug and
>     not mandated by the spec.
> 
>     I'm not sure in what setup you produced this error, but possibly a
>     reason why this doesn't happen with real hardware isn't the NIC itself
>     but the backend: Communication with the host can obviously be faster
>     than talking to a physical network (so if you were doing the latter, the
>     rate in the VM wouldn't be limited by the CPU, but by the physical
>     network).
> 
> 
> This issue is reproduced on device disable and not related 
> to intensive device/backend communication. One RX packet with
> right timing is enough to trigged the problem.
> 
> The same issue was fixed in e1000 device some time ago as well.

Commit 9596ef7c was good in flagging it as a guest driver bug. Only a
later series brought in the questionable spec interpretation.

Kevin

Dmitry Fleytman Oct. 19, 2016, 10:15 a.m. UTC | #7

> On 19 Oct 2016, at 13:07 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> 
> Am 19.10.2016 um 09:57 hat Dmitry Fleytman geschrieben:
>> 
>>    On 19 Oct 2016, at 10:25 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>> 
>>    Am 19.10.2016 um 08:48 hat Dmitry Fleytman geschrieben:
>> 
>>           Another related thing that I noticed while debugging this and
>>        turning on
>>           tracing is that the interrupt throttling timers kept firing even if
>>           there was no activity at all. Something might be wrong, there, too.
>> 
>>           Next thing I wondered why throttling was enabled at all because the
>>        spec
>>           says the default is 0 (turned off). So one thing that I'm pretty
>>        sure is
>>           just a misunderstanding is the following defintion:
>> 
>>           #define E1000E_MIN_XITR     (500) /* No more then 7813 interrupts
>>        per
>>                                               second according to spec
>>        10.2.4.2 */
>> 
>> 
>>           As I understand it, the spec is just giving an example there and
>>        lower
>>           values are valid as well. At the very least, 0 should be accepted as
>>        a
>>           special case because it means "disabled" and it's specified to be
>>        the
>>           default.
>> 
>> 
>>        Right, this according to the spec this value should be 0 by default and
>>        throttling should be disabled.
>> 
>>        Current device implementation does not allow specification of
>>        throttling interval less than 500 and treats interval 0 as throttling
>>        enabled with interval 500.
>> 
>>        This is done by intention because according to the spec (10.2.4.2)
>>        device cannot produce more than 7813 interrupts per second even when
>>        throttling is disabled. Therefore, even in case of interrupt storm
>>        (continuous interrupt re-injection by device), number of interrupts
>>        produced by device is limited and CPU (driver) has enough time to do
>>        its job and handle problematic interrupt state.
>> 
>> 
>>    I think you're misinterpreting the spec here. This is the paragraph
>>    we're talking about, right?
>> 
>>       For example, if the interval is programmed to 500 (decimal), the
>>       82574 guarantees the CPU is not interrupted by it for 128 µs from
>>       the last interrupt. The maximum observable interrupt rate from the
>>       82574 should never exceed 7813 interrupts/sec.
>> 
>>    It says "for example", so this is just demonstrating how you can
>>    calculate the effects of a specific throttling setting. It says that
>>    _if_ you set ITR to 500, you get an interrupt at most every
>>    500 * 256 ns = 128 µs. And 1 / 128 µs = 7821.5 Hz, so this is the
>>    effective maximum frequency that _this specific_ ITR setting allows.
>> 
>>    I also don't think it would make any sense for hardware to be unable to
>>    trigger interrupts more often than that. Triggering an interrupt is not
>>    a complex operation that involves a lot of calculation or anything.
>> 
>> 
>> Hi Kevin,
>> 
>> Yes, I assume that sentence
>> 
>> “The maximum observable interrupt rate from the
>> 82574 should never exceed 7813 interrupts/sec."
>> 
>> is not a related to a specific case, but describes a generic limitation,
>> however it might be I’m misreading the spec indeed.
> 
> For me everything hints at this being only an example: Not only do the
> numbers match the example made in the previous sentence (which is
> explicitly called an example) and look weird as a real limit, but it's
> also in the same paragraph as the explicit example and the spec is
> generally good at starting a new paragraph when talking about a new
> aspect.

I tend to agree.

> 
> I don't care enough to actually make you change anything, but I wanted
> you to be aware that the interpretation of the spec as coded into our
> emulation isn't clear at all (in fact, I would think it's clear that
> it's _not_ meant this way) and that real hardware probably doesn't do
> the same thing as we do.

Thanks, Kevin.

> 
> What we're doing may still have merit, as a workaround for a guest
> driver bug.
> 
>>        Opposed to this, virtual device is able to raise interrupts with rate
>>        limited by CPU speed only therefore driver has no chance to fix
>>        interrupt storm condition.
>> 
>>        Windows e1000e drivers rely on upper limit for number of interrupts
>>        per second in some cases and absence of this limit leads to infinite
>>        interrupt storms.
>> 
>>        To summarise, while usage of throttling mechanisms is a little bit
>>        different from what specification says, effective emulated device
>>        behavior is totally compliant to the real device.
>> 
>> 
>>    So Windows doesn't configure ITR (i.e. it is 0) even though it can't
>>    handle unlimited interrupts? That would be a driver bug then, and
>>    perhaps an important enough one to keep a workaround in our code. But
>>    then let's be explicit that this is a workaround for a Windows bug and
>>    not mandated by the spec.
>> 
>>    I'm not sure in what setup you produced this error, but possibly a
>>    reason why this doesn't happen with real hardware isn't the NIC itself
>>    but the backend: Communication with the host can obviously be faster
>>    than talking to a physical network (so if you were doing the latter, the
>>    rate in the VM wouldn't be limited by the CPU, but by the physical
>>    network).
>> 
>> 
>> This issue is reproduced on device disable and not related 
>> to intensive device/backend communication. One RX packet with
>> right timing is enough to trigged the problem.
>> 
>> The same issue was fixed in e1000 device some time ago as well.
> 
> Commit 9596ef7c was good in flagging it as a guest driver bug. Only a
> later series brought in the questionable spec interpretation.
> 
> Kevin

Jason Wang Oct. 21, 2016, 1:58 a.m. UTC | #8

On 2016年10月18日 22:10, Dmitry Fleytman wrote:
>> On 17 Oct 2016, at 01:35 AM, Kevin Wolf <mail@kevin-wolf.de> wrote:
>>
>> The e1000e emulation zeroes out any used rx descriptor and then writes a
>> completely newly constructed value there. By doing this, it doesn't only
>> update the write-back area of the descriptors (as it's supposed to do),
>> but it also clears the buffer address, which real hardware doesn't do.
>>
>> The spec explicitly mentions in chapter 7.1.8 that it is valid for a
>> driver to reuse a descriptor and only update the status field while
>> doing so, i.e. reusing the old buffer address:
>>
>>     If software statically allocates buffers, and uses memory read to
>>     check for completed descriptors, it simply has to zero the status
>>     byte in the descriptor to make it ready for reuse by hardware.
>>
>> This patch fixes the behaviour to leave the buffer address in
>> descriptors unchanged even after the descriptor has been used.
> Hi Kevin,
>
> Reviewed-by: Dmitry Fleytman <dmitry@daynix.com>
>
> Thanks for catching this!
>
> ~Dmitry

Applied, thanks.

>
>> Signed-off-by: Kevin Wolf <mail@kevin-wolf.de>
>> ---
>> hw/net/e1000e_core.c | 8 ++++----
>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
>> index 9fa4116..a5ca97d 100644
>> --- a/hw/net/e1000e_core.c
>> +++ b/hw/net/e1000e_core.c
>> @@ -1280,11 +1280,10 @@ e1000e_write_lgcy_rx_descr(E1000ECore *core, uint8_t *desc,
>>
>>      struct e1000_rx_desc *d = (struct e1000_rx_desc *) desc;
>>
>> -    memset(d, 0, sizeof(*d));
>> -
>>      assert(!rss_info->enabled);
>>
>>      d->length = cpu_to_le16(length);
>> +    d->csum = 0;
>>
>>      e1000e_build_rx_metadata(core, pkt, pkt != NULL,
>>                               rss_info,
>> @@ -1293,6 +1292,7 @@ e1000e_write_lgcy_rx_descr(E1000ECore *core, uint8_t *desc,
>>                               &d->special);
>>      d->errors = (uint8_t) (le32_to_cpu(status_flags) >> 24);
>>      d->status = (uint8_t) le32_to_cpu(status_flags);
>> +    d->special = 0;
>> }
>>
>> static inline void
>> @@ -1303,7 +1303,7 @@ e1000e_write_ext_rx_descr(E1000ECore *core, uint8_t *desc,
>> {
>>      union e1000_rx_desc_extended *d = (union e1000_rx_desc_extended *) desc;
>>
>> -    memset(d, 0, sizeof(*d));
>> +    memset(&d->wb, 0, sizeof(d->wb));
>>
>>      d->wb.upper.length = cpu_to_le16(length);
>>
>> @@ -1327,7 +1327,7 @@ e1000e_write_ps_rx_descr(E1000ECore *core, uint8_t *desc,
>>      union e1000_rx_desc_packet_split *d =
>>          (union e1000_rx_desc_packet_split *) desc;
>>
>> -    memset(d, 0, sizeof(*d));
>> +    memset(&d->wb, 0, sizeof(d->wb));
>>
>>      d->wb.middle.length0 = cpu_to_le16((*written)[0]);
>>
>> -- 
>> 2.1.4
>>

Jason Wang Oct. 21, 2016, 2:01 a.m. UTC | #9

On 2016年10月18日 22:27, Kevin Wolf wrote:
> Am 18.10.2016 um 16:10 hat Dmitry Fleytman geschrieben:
>> >     On 17 Oct 2016, at 01:35 AM, Kevin Wolf<mail@kevin-wolf.de>  wrote:
>> >
>> >     The e1000e emulation zeroes out any used rx descriptor and then writes a
>> >     completely newly constructed value there. By doing this, it doesn't only
>> >     update the write-back area of the descriptors (as it's supposed to do),
>> >     but it also clears the buffer address, which real hardware doesn't do.
>> >
>> >     The spec explicitly mentions in chapter 7.1.8 that it is valid for a
>> >     driver to reuse a descriptor and only update the status field while
>> >     doing so, i.e. reusing the old buffer address:
>> >
>> >        If software statically allocates buffers, and uses memory read to
>> >        check for completed descriptors, it simply has to zero the status
>> >        byte in the descriptor to make it ready for reuse by hardware.
>> >
>> >     This patch fixes the behaviour to leave the buffer address in
>> >     descriptors unchanged even after the descriptor has been used.
>> >
>> >
>> >Hi Kevin,
>> >
>> >Reviewed-by: Dmitry Fleytman<dmitry@daynix.com>
>> >
>> >Thanks for catching this!
> Thanks, Dmitry!
>
> I assume that your R-b implies that you don't send a pull request
> yourself. If so, what's the process for getting the patch merged? Is
> Jason going to pick it up normally or should I send a pull request of my
> own?
>

I've picked this in my tree.

Thanks

e1000e: Don't zero out buffer address in rx descriptor

Commit Message

Comments

Patch