Message ID | 7728fa4b6027269d468f6a0665017c187471a9cd.camel@wefi.net |
---|---|
State | New |
Headers | show |
Series | ata: Add Elkhart Lake AHCI controller | expand |
On 8/28/23 20:55, Werner Fischer wrote: > Elkhart Lake is the successor of Apollo Lake and Gemini Lake. These > CPUs and their PCHs are used in mobile and embedded environments. > > With this patch I suggest that Elkhart Lake SATA controllers [1] should > use the default LPM policy for mobile chipsets. > The disadvantage of missing hot-plug support with this setting should > not be an issue, as those CPUs are used in embedded environments and > not in servers with hot-plug backplanes. > > We discovered that the Elkhart Lake SATA controllers have been missing > in ahci.c after a customer reported the throttling of his SATA SSD > after a short period of higher I/O. We determined the high temperature > of the SSD controller in idle mode as the root cause for that. > > Depending on the used SSD, we have seen up to 1.8 Watt lower system > idle power usage and up to 30°C lower SSD controller temperatures in > our tests, when we set med_power_with_dipm manually. I have provided a > table showing seven different SATA SSDs from ATP, Intel/Solidigm and > Samsung [2]. > > Intel lists a total of 3 SATA controller IDs (4B60, 4B62, 4B63) in [1] > for those mobile PCHs. > This commit just adds 0x4b63 as I do not have test systems with 0x4b60 > and 0x4b62 SATA controllers. Adding a mention about the other 2 IDs in a comment would be nice. > I have tested this patch with a system which uses 0x4b63 as SATA > controller. > > [1] https://sata-io.org/product/8803 > [2] https://www.thomas-krenn.com/en/wiki/SATA_Link_Power_Management#Example_LES_v4 > > Signed-off-by: Werner Fischer <devlists@wefi.net> > Cc: stable@vger.kernel.org > > --- a/drivers/ata/ahci.c 2023-07-27 11:45:21.141511943 +0200 > +++ b/drivers/ata/ahci.c 2023-07-27 11:44:57.054711402 +0200 > @@ -421,6 +421,7 @@ > { PCI_VDEVICE(INTEL, 0x34d3), board_ahci_low_power }, /* Ice Lake LP AHCI */ > { PCI_VDEVICE(INTEL, 0x02d3), board_ahci_low_power }, /* Comet Lake PCH-U AHCI */ > { PCI_VDEVICE(INTEL, 0x02d7), board_ahci_low_power }, /* Comet Lake PCH RAID */ > + { PCI_VDEVICE(INTEL, 0x4b63), board_ahci_low_power }, /* Elkhart Lake AHCI */ > > /* JMicron 360/1/3/5/6, match class to avoid IDE function */ > { PCI_VENDOR_ID_JMICRON, PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID, >
On Tue, 2023-08-29 at 15:09 +0900, Damien Le Moal wrote: > > Intel lists a total of 3 SATA controller IDs (4B60, 4B62, 4B63) in > > [1] for those mobile PCHs. > > This commit just adds 0x4b63 as I do not have test systems with > > 0x4b60 and 0x4b62 SATA controllers. > > Adding a mention about the other 2 IDs in a comment would be nice. Thank you for your time reviewing the patch. I will send a v2 with an added comment about the other 2 IDs.
On Tue, Aug 29, 2023 at 11:35:54AM +0200, Werner Fischer wrote: > On Tue, 2023-08-29 at 15:09 +0900, Damien Le Moal wrote: > > > Intel lists a total of 3 SATA controller IDs (4B60, 4B62, 4B63) in > > > [1] for those mobile PCHs. > > > This commit just adds 0x4b63 as I do not have test systems with > > > 0x4b60 and 0x4b62 SATA controllers. > > > > Adding a mention about the other 2 IDs in a comment would be nice. > > Thank you for your time reviewing the patch. > I will send a v2 with an added comment about the other 2 IDs. Considering the big mess that was introduced the last time we tried to enable LPM in an Intel AHCI controller, see: 104ff59af73a ("ata: ahci: Add Tiger Lake UP{3,4} AHCI controller") and 6210038aeaf4 ("ata: ahci: Revert "ata: ahci: Add Tiger Lake UP{3,4} AHCI controller"") This patch makes me feel quite uncomfortable. Yes, an AHCI controller should be able to have LPM enabled, but in reality there are so many SATA devices with broken LPM support, that we would need to add hundreds of quirks for all the broken devices. The simple fact that LPM is commonly broken in the wild, makes me think that systems should need to opt-in explicitly to enable it. Yes, our default LPM policy is 0: 0 => Keep firmware settings But in reality most distros use: 3 => Medium power with Device Initiated PM enabled And in the end, we end up breaking users systems: https://bugzilla.kernel.org/show_bug.cgi?id=217114 And if their root partition is on the SATA drive... they won't be able to boot. Yes, one problem is that you need to use the board_ahci_low_power in order for the LPM polcies to apply. But it feels to me that accepting such a change is too intrusive (because of the distros default of LPM policy 3). If we somehow had a way where regular users are less affected... E.g. if there was a way to try to link up with LPM enabled in the AHCI controller by default, but if it fails X amount of times, then we disable LPM in the controller (or maybe try to set policy to ATA_LPM_MAX_POWER). Kind of like how we do for link speed, where we try a lower link speed on the final attempt to establish a link: https://github.com/torvalds/linux/blob/v6.5/drivers/ata/libata-eh.c#L3599 Another option might be to do the opposite.. Establish a link with LPM disabled (or maybe with policy set to ATA_LPM_MAX_POWER), if we can establish a link, make a record of it, and then enable LPM and reset the drive. If we can't establish a link with LPM enabled, and we have a record that the link was established without LPM, disable LPM again and reset the drive. Not sure which is most feasible to implement. Kind regards, Niklas
On Tue, 2023-08-29 at 13:04 +0000, Niklas Cassel wrote: > Considering the big mess that was introduced the last time we tried > to enable LPM in an Intel AHCI controller, see: > 104ff59af73a ("ata: ahci: Add Tiger Lake UP{3,4} AHCI controller") > and > 6210038aeaf4 ("ata: ahci: Revert "ata: ahci: Add Tiger Lake UP{3,4} > AHCI controller"") > > This patch makes me feel quite uncomfortable. For Tiger Lake I think the problems could be related to the chipset or BIOS. The "Intel 500 Series Chipset Family Platform Controller Hub (PCH)" Spec Update https://cdrdv2.intel.com/v1/dl/getContent/635220 mentions the following update on page 20: "BIOS should set this bit to 0 as APST is not supported." So maybe older BIOS version have this bit set to 1, which may cause issues? > Yes, an AHCI controller should be able to have LPM enabled, but in > reality there are so many SATA devices with broken LPM support, that > we would need to add hundreds of quirks for all the broken devices. In the error reports regarding Tiger Lake at least two different SSDs are mentioned: "Netac SSD 1TB" https://bbs.archlinux.org/viewtopic.php?pid=2087227#p2087227 "SATA 3 drive (Samsung)" https://bbs.archlinux.org/viewtopic.php?pid=2087334#p2087334 Currently, I'm not aware of Samsung drives having issues with LPM, but of course I could be missing something here. Unfortunately, the posting above does not state which Samsung drive it is. With my Elkhart Lake test system, I have tested the following Samsung drives successfully: 850 PRO, PM893 https://www.thomas-krenn.com/en/wiki/SATA_Link_Power_Management#Example_LES_v4 > Yes, one problem is that you need to use the board_ahci_low_power in > order for the LPM polcies to apply. But it feels to me that accepting > such a change is too intrusive (because of the distros default of LPM > policy 3). There are already many SATA controllers marked as "board_ahci_low_power", including Bay Trail, Apollo Lake, Comet Lake PCH-U (I have testsystem of these three platforms, LPM is working without issues here for me). So I'm not sure whether it is the SATA devices (drives) or chipsets having issues in the wild. Where I have seen issues on the other hand are cases, when a SATA controller is used for different platforms as mentioned here: https://www.spinics.net/lists/linux-ide/msg63375.html 1527f69204fe ("ata: ahci: Add Green Sardine vendor ID as board_ahci_mobile") has set 0x7901 as board_ahci_mobile. As this SATA controller is also used in EPYC servers, SATA Hot-Plug is not working e.g. in Ubuntu by default (because the distro defaults to LPM policy 3): https://www.thomas-krenn.com/en/wiki/AMD_EPYC_Server_with_Ubuntu_-_Enable_SATA_Hot-Swap Maybe including additional code which helps to distinguish whether it's a mobile system or server system with hot-swap could help here. > If we somehow had a way where regular users are less affected... > E.g. if there was a way to try to link up with LPM enabled in the > AHCI controller by default, but if it fails X amount of times, then > we disable LPM in the controller (or maybe try to set policy to > ATA_LPM_MAX_POWER). > Kind of like how we do for link speed, where we try a lower link > speed on the final attempt to establish a link: > https://github.com/torvalds/linux/blob/v6.5/drivers/ata/libata-eh.c#L3599 > > Another option might be to do the opposite.. Establish a link with > LPM disabled (or maybe with policy set to ATA_LPM_MAX_POWER), if we > can establish a link, make a record of it, and then enable LPM and > reset the drive. > If we can't establish a link with LPM enabled, and we have a record > that the link was established without LPM, disable LPM again and > reset the drive. > > Not sure which is most feasible to implement. Thank you very much for your ideas and your time. I will think about them (although I'm not capable of coding such changes). I may give further feedback in the upcoming days. Maybe other developers have input in here, too. Kind regards, Werner
--- a/drivers/ata/ahci.c 2023-07-27 11:45:21.141511943 +0200 +++ b/drivers/ata/ahci.c 2023-07-27 11:44:57.054711402 +0200 @@ -421,6 +421,7 @@ { PCI_VDEVICE(INTEL, 0x34d3), board_ahci_low_power }, /* Ice Lake LP AHCI */ { PCI_VDEVICE(INTEL, 0x02d3), board_ahci_low_power }, /* Comet Lake PCH-U AHCI */ { PCI_VDEVICE(INTEL, 0x02d7), board_ahci_low_power }, /* Comet Lake PCH RAID */ + { PCI_VDEVICE(INTEL, 0x4b63), board_ahci_low_power }, /* Elkhart Lake AHCI */ /* JMicron 360/1/3/5/6, match class to avoid IDE function */ { PCI_VENDOR_ID_JMICRON, PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID,
Elkhart Lake is the successor of Apollo Lake and Gemini Lake. These CPUs and their PCHs are used in mobile and embedded environments. With this patch I suggest that Elkhart Lake SATA controllers [1] should use the default LPM policy for mobile chipsets. The disadvantage of missing hot-plug support with this setting should not be an issue, as those CPUs are used in embedded environments and not in servers with hot-plug backplanes. We discovered that the Elkhart Lake SATA controllers have been missing in ahci.c after a customer reported the throttling of his SATA SSD after a short period of higher I/O. We determined the high temperature of the SSD controller in idle mode as the root cause for that. Depending on the used SSD, we have seen up to 1.8 Watt lower system idle power usage and up to 30°C lower SSD controller temperatures in our tests, when we set med_power_with_dipm manually. I have provided a table showing seven different SATA SSDs from ATP, Intel/Solidigm and Samsung [2]. Intel lists a total of 3 SATA controller IDs (4B60, 4B62, 4B63) in [1] for those mobile PCHs. This commit just adds 0x4b63 as I do not have test systems with 0x4b60 and 0x4b62 SATA controllers. I have tested this patch with a system which uses 0x4b63 as SATA controller. [1] https://sata-io.org/product/8803 [2] https://www.thomas-krenn.com/en/wiki/SATA_Link_Power_Management#Example_LES_v4 Signed-off-by: Werner Fischer <devlists@wefi.net> Cc: stable@vger.kernel.org