Message ID | 20210825150408.54367-1-fbarrat@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | phb4/5: Escalate page-level TCE kills | expand |
Sounds like also for stable? Sent from my iPhone > On Aug 25, 2021, at 8:09 AM, Frederic Barrat <fbarrat@linux.ibm.com> wrote: > > An hw issue was found on P10 (HW560152) where a page-level TCE kill > can be dropped if there are enough TCE kill requests already being > processed. The net effect is that data integrity is not > guaranteed. The circumvention is to stay away from page-level kills > and escalate those to PE kills. Which hurts performance. > It also affects P9. > > Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> > --- > hw/phb4.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/hw/phb4.c b/hw/phb4.c > index 79083d4a..ddaa18f8 100644 > --- a/hw/phb4.c > +++ b/hw/phb4.c > @@ -1051,6 +1051,14 @@ static int64_t phb4_tce_kill(struct phb *phb, uint32_t kill_type, > uint64_t val; > int64_t rc; > > + /* > + * HW560152: a page-level kill can be dropped if the > + * processing queue is backed-up, which can cause data > + * integrity issues > + */ > + if (kill_type == OPAL_PCI_TCE_KILL_PAGES) > + kill_type = OPAL_PCI_TCE_KILL_PE; > + > sync(); > switch(kill_type) { > case OPAL_PCI_TCE_KILL_PAGES: > -- > 2.31.1 > > _______________________________________________ > Skiboot mailing list > Skiboot@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/skiboot
On 25/08/2021 18:37, Stewart Smith wrote: > Sounds like also for stable? I sent it to the stable list but didn't explicitly add the "cc" field in the commit log. Vasant: could you add it or should I resend? Fred > Sent from my iPhone > >> On Aug 25, 2021, at 8:09 AM, Frederic Barrat <fbarrat@linux.ibm.com> wrote: >> >> An hw issue was found on P10 (HW560152) where a page-level TCE kill >> can be dropped if there are enough TCE kill requests already being >> processed. The net effect is that data integrity is not >> guaranteed. The circumvention is to stay away from page-level kills >> and escalate those to PE kills. Which hurts performance. >> It also affects P9. >> >> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> >> --- >> hw/phb4.c | 8 ++++++++ >> 1 file changed, 8 insertions(+) >> >> diff --git a/hw/phb4.c b/hw/phb4.c >> index 79083d4a..ddaa18f8 100644 >> --- a/hw/phb4.c >> +++ b/hw/phb4.c >> @@ -1051,6 +1051,14 @@ static int64_t phb4_tce_kill(struct phb *phb, uint32_t kill_type, >> uint64_t val; >> int64_t rc; >> >> + /* >> + * HW560152: a page-level kill can be dropped if the >> + * processing queue is backed-up, which can cause data >> + * integrity issues >> + */ >> + if (kill_type == OPAL_PCI_TCE_KILL_PAGES) >> + kill_type = OPAL_PCI_TCE_KILL_PE; >> + >> sync(); >> switch(kill_type) { >> case OPAL_PCI_TCE_KILL_PAGES: >> -- >> 2.31.1 >> >> _______________________________________________ >> Skiboot mailing list >> Skiboot@lists.ozlabs.org >> https://lists.ozlabs.org/listinfo/skiboot >
On Thu, Aug 26, 2021 at 1:09 AM Frederic Barrat <fbarrat@linux.ibm.com> wrote: > > An hw issue was found on P10 (HW560152) where a page-level TCE kill > can be dropped if there are enough TCE kill requests already being > processed. The net effect is that data integrity is not > guaranteed. Hmm, what is the actual problem? Is there a race between when the bit in TCE_KILL says there's a free queue slot and when one actually comes available? If so, how big is that race window? > The circumvention is to stay away from page-level kills > and escalate those to PE kills. Which hurts performance. understatement > It also affects P9. lol > > Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> > --- > hw/phb4.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/hw/phb4.c b/hw/phb4.c > index 79083d4a..ddaa18f8 100644 > --- a/hw/phb4.c > +++ b/hw/phb4.c > @@ -1051,6 +1051,14 @@ static int64_t phb4_tce_kill(struct phb *phb, uint32_t kill_type, > uint64_t val; > int64_t rc; > > + /* > + * HW560152: a page-level kill can be dropped if the > + * processing queue is backed-up, which can cause data > + * integrity issues > + */ > + if (kill_type == OPAL_PCI_TCE_KILL_PAGES) > + kill_type = OPAL_PCI_TCE_KILL_PE; > + > sync(); > switch(kill_type) { > case OPAL_PCI_TCE_KILL_PAGES: > -- > 2.31.1 > > -- > Skiboot-stable mailing list > Skiboot-stable@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/skiboot-stable
On 26/08/2021 17:47, Oliver O'Halloran wrote: > On Thu, Aug 26, 2021 at 1:09 AM Frederic Barrat <fbarrat@linux.ibm.com> wrote: >> >> An hw issue was found on P10 (HW560152) where a page-level TCE kill >> can be dropped if there are enough TCE kill requests already being >> processed. The net effect is that data integrity is not >> guaranteed. > > Hmm, what is the actual problem? Is there a race between when the bit > in TCE_KILL says there's a free queue slot and when one actually comes > available? If so, how big is that race window? Not quite. We need to have the queue backed up with a mix of PE-level and page-level kills. And if there's the right sequence of those in the queue, then a page level kill is dropped. Fred >> The circumvention is to stay away from page-level kills >> and escalate those to PE kills. Which hurts performance. > > understatement > >> It also affects P9. > > lol > > >> >> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> >> --- >> hw/phb4.c | 8 ++++++++ >> 1 file changed, 8 insertions(+) >> >> diff --git a/hw/phb4.c b/hw/phb4.c >> index 79083d4a..ddaa18f8 100644 >> --- a/hw/phb4.c >> +++ b/hw/phb4.c >> @@ -1051,6 +1051,14 @@ static int64_t phb4_tce_kill(struct phb *phb, uint32_t kill_type, >> uint64_t val; >> int64_t rc; >> >> + /* >> + * HW560152: a page-level kill can be dropped if the >> + * processing queue is backed-up, which can cause data >> + * integrity issues >> + */ >> + if (kill_type == OPAL_PCI_TCE_KILL_PAGES) >> + kill_type = OPAL_PCI_TCE_KILL_PE; >> + >> sync(); >> switch(kill_type) { >> case OPAL_PCI_TCE_KILL_PAGES: >> -- >> 2.31.1 >> >> -- >> Skiboot-stable mailing list >> Skiboot-stable@lists.ozlabs.org >> https://lists.ozlabs.org/listinfo/skiboot-stable
On 8/25/21 8:34 PM, Frederic Barrat wrote: > An hw issue was found on P10 (HW560152) where a page-level TCE kill > can be dropped if there are enough TCE kill requests already being > processed. The net effect is that data integrity is not > guaranteed. The circumvention is to stay away from page-level kills > and escalate those to PE kills. Which hurts performance. > It also affects P9. Thanks! Merged to master as 2b0b6a1a. -Vasant
diff --git a/hw/phb4.c b/hw/phb4.c index 79083d4a..ddaa18f8 100644 --- a/hw/phb4.c +++ b/hw/phb4.c @@ -1051,6 +1051,14 @@ static int64_t phb4_tce_kill(struct phb *phb, uint32_t kill_type, uint64_t val; int64_t rc; + /* + * HW560152: a page-level kill can be dropped if the + * processing queue is backed-up, which can cause data + * integrity issues + */ + if (kill_type == OPAL_PCI_TCE_KILL_PAGES) + kill_type = OPAL_PCI_TCE_KILL_PE; + sync(); switch(kill_type) { case OPAL_PCI_TCE_KILL_PAGES:
An hw issue was found on P10 (HW560152) where a page-level TCE kill can be dropped if there are enough TCE kill requests already being processed. The net effect is that data integrity is not guaranteed. The circumvention is to stay away from page-level kills and escalate those to PE kills. Which hurts performance. It also affects P9. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> --- hw/phb4.c | 8 ++++++++ 1 file changed, 8 insertions(+)