mbox series

[RFC,0/3] Asynchronous EEH recovery

Message ID 20220816032716.108297-1-ganeshgr@linux.ibm.com (mailing list archive)
Headers show
Series Asynchronous EEH recovery | expand

Message

Ganesh Goudar Aug. 16, 2022, 3:27 a.m. UTC
Hi,

EEH reocvery is currently serialized and these patches shorten
the time taken for EEH recovery by making the recovery to run
in parallel. The original author of these patches is Sam Bobroff,
I have rebased and tested these patches.

On powervm with 64 VFs and I see approximately 48% reduction
in time taken in EEH recovery, Yet to be tested on powernv.

These patches were originally posted as separate RFCs, I think
posting them as single series would be more helpful, I know the
patches are too big, I will try to logically divide in next
iterations.

Thanks 

Ganesh Goudar (3):
  powerpc/eeh: Synchronization for safety
  powerpc/eeh: Provide a unique ID for each EEH recovery
  powerpc/eeh: Asynchronous recovery

 arch/powerpc/include/asm/eeh.h               |   7 +-
 arch/powerpc/include/asm/eeh_event.h         |  10 +-
 arch/powerpc/include/asm/pci-bridge.h        |   3 +
 arch/powerpc/include/asm/ppc-pci.h           |   2 +-
 arch/powerpc/kernel/eeh.c                    | 154 +++--
 arch/powerpc/kernel/eeh_driver.c             | 578 +++++++++++++++----
 arch/powerpc/kernel/eeh_event.c              |  71 ++-
 arch/powerpc/kernel/eeh_pe.c                 |  33 +-
 arch/powerpc/platforms/powernv/eeh-powernv.c |  12 +-
 arch/powerpc/platforms/pseries/eeh_pseries.c |   5 +-
 arch/powerpc/platforms/pseries/pci_dlpar.c   |   5 +-
 drivers/pci/hotplug/pnv_php.c                |   5 +-
 drivers/pci/hotplug/rpadlpar_core.c          |   2 +
 drivers/vfio/vfio_spapr_eeh.c                |  10 +-
 14 files changed, 685 insertions(+), 212 deletions(-)

Comments

Oliver O'Halloran Aug. 17, 2022, 7:16 a.m. UTC | #1
On Tue, Aug 16, 2022 at 1:29 PM Ganesh Goudar <ganeshgr@linux.ibm.com> wrote:
>
> Hi,
>
> EEH reocvery is currently serialized and these patches shorten
> the time taken for EEH recovery by making the recovery to run
> in parallel. The original author of these patches is Sam Bobroff,
> I have rebased and tested these patches.
>
> On powervm with 64 VFs and I see approximately 48% reduction
> in time taken in EEH recovery,

Can you elaborate on how you're testing this? A VM with 64 separate
VFs on 64 separate PHBs I guess? How are you injecting errors?

> Yet to be tested on powernv.

If you're not testing on powernv you might as well not be testing. EEH
support on pseries is relatively straightforward since managing PCI
bridges, etc is all hidden away in the hypervisor. Most of the things
about EEH which give me headaches simply don't happen on pseries since
the PCI topology seen by the guest is flat.

> These patches were originally posted as separate RFCs, I think
> posting them as single series would be more helpful, I know the
> patches are too big, I will try to logically divide in next
> iterations.
>
> Thanks
>
> Ganesh Goudar (3):
>   powerpc/eeh: Synchronization for safety

I didn't pick that patch up after Sam left for a reason. I don't
recall the exact details, but I think I decided that the lock being
added had the same (or mostly the same) scope as the pci rescan lock.
All modifications to the EEH PE tree have to occur with the rescan
lock held since the per-device EEH setup occurs while configuring the
pci_dev (similarly the teardown happens when we remove the pci_dev
from its bus). I don't think that was true when Same wrote the patch
originally though since it pre-dates my big EEH init re-work.

It's possible (probable even!) I got that wrong, or that it was
something I was trying to make true through re-works rather than an
accurate assessment of the current state of things. I'll try page some
of this back into my brain later. I think I left some comments on
Sam's original RFC patch which might still be valid.

>   powerpc/eeh: Asynchronous recovery

I remember not hating the idea, but I think the implementation leaves
a lot to be desired. If you only really care about pseries then just
moving to a model where each PHB has its own recovery thread would get
you most of the benefit without needing to turn the already
complicated recovery path into an async trainwreck. IIRC Sam's
motivation for that patch was to make recovery faster for powernv
systems when an error was injected on a PF with a lot of child VFs.
Certain drivers took forever to recover each VF (might have been mlx5)
so doing it in parallel would have sped things up considerably.

Oliver
Jason Gunthorpe Sept. 2, 2022, 12:19 a.m. UTC | #2
On Tue, Aug 16, 2022 at 08:57:13AM +0530, Ganesh Goudar wrote:
> Hi,
> 
> EEH reocvery is currently serialized and these patches shorten
> the time taken for EEH recovery by making the recovery to run
> in parallel. The original author of these patches is Sam Bobroff,
> I have rebased and tested these patches.

How did you test this?

I understand that VFIO on 6.0 does not work at all on power?

I am waiting for power maintainers to pick up this series to fix it:

https://lore.kernel.org/kvm/20220714081822.3717693-1-aik@ozlabs.ru/

Jason
Ganesh Goudar Sept. 15, 2022, 10:15 a.m. UTC | #3
On 9/2/22 05:49, Jason Gunthorpe wrote:

> On Tue, Aug 16, 2022 at 08:57:13AM +0530, Ganesh Goudar wrote:
>> Hi,
>>
>> EEH reocvery is currently serialized and these patches shorten
>> the time taken for EEH recovery by making the recovery to run
>> in parallel. The original author of these patches is Sam Bobroff,
>> I have rebased and tested these patches.
> How did you test this?

This is tested on SRIOV VFs.

>
> I understand that VFIO on 6.0 does not work at all on power?
>
> I am waiting for power maintainers to pick up this series to fix it:
>
> https://lore.kernel.org/kvm/20220714081822.3717693-1-aik@ozlabs.ru/
>
> Jason
Christophe Leroy Jan. 25, 2023, 2:04 p.m. UTC | #4
Hi,

Le 15/09/2022 à 12:15, Ganesh a écrit :
> On 9/2/22 05:49, Jason Gunthorpe wrote:
> 
>> On Tue, Aug 16, 2022 at 08:57:13AM +0530, Ganesh Goudar wrote:
>>> Hi,
>>>
>>> EEH reocvery is currently serialized and these patches shorten
>>> the time taken for EEH recovery by making the recovery to run
>>> in parallel. The original author of these patches is Sam Bobroff,
>>> I have rebased and tested these patches.
>> How did you test this?
> 
> This is tested on SRIOV VFs.
> 
>> I understand that VFIO on 6.0 does not work at all on power?
>>
>> I am waiting for power maintainers to pick up this series to fix it:
>>
>> https://lore.kernel.org/kvm/20220714081822.3717693-1-aik@ozlabs.ru/
>>


This series doesn't apply anymore, for instance 
drivers/vfio/vfio_spapr_eeh.c doesn't exist anymore.

If you series is still relevant, please resubmit.

Christophe