Message ID | 20230119011653.311675-6-jacob.e.keller@intel.com |
---|---|
State | Changes Requested |
Delegated to: | Anthony Nguyen |
Headers | show |
Series | ice: various virtualization cleanups | expand |
> -----Original Message----- > From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of > Jacob Keller > Sent: Thursday, January 19, 2023 2:17 AM > To: Intel Wired LAN <intel-wired-lan@lists.osuosl.org> > Cc: Nguyen, Anthony L <anthony.l.nguyen@intel.com> > Subject: [Intel-wired-lan] [PATCH net-next v2 05/13] ice: Fix RDMA latency > issue by allowing write-combining > > The current method of mapping the entire BAR region as a single > uncacheable region does not allow RDMA to use write combining (WC). This > results in increased latency with RDMA. > > To fix this, we initially planned to reduce the size of the map made by the PF > driver to include only up to the beginning of the RDMA space. > Unfortunately this will not work in the future as there are some hardware > features which use registers beyond the RDMA area. This includes Scalable > IOV, a virtualization feature being worked on currently. > > Instead of simply reducing the size of the map, we need a solution which will > allow access to all areas of the address space while leaving the RDMA area > open to be mapped with write combining. > > To allow for this, and fix the RMDA latency issue without blocking the higher > areas of the BAR, we need to create multiple separate memory maps. > Doing so will create a sparse mapping rather than a contiguous single area. > > Replace the void *hw_addr with a special ice_hw_addr structure which > represents the multiple mappings as a flexible array. > > Based on the available BAR size, map up to 3 regions: > > * The space before the RDMA section > * The RDMA section which wants write combining behavior > * The space after the RDMA section > > Add an ice_get_hw_addr function which converts a register offset into the > appropriate kernel address based on which chunk it falls into. This does cost > us slightly more computation overhead for register access as we now must > check the table each access. However, we can pre-compute the addresses > where this would most be a problem. > > With this change, the RDMA driver is now free to map the RDMA register > section as write-combined without impacting access to other device registers > used by the main PF driver. > > Reported-by: Dave Ertman <david.m.ertman@intel.com> > Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> > --- > Changes since v1: > * Export ice_get_hw_addr > * Use ice_get_hw_addr in iRDMA driver > * Fix the WARN_ON to use %pa instead of %llx for printing a resource_size_t > > drivers/infiniband/hw/irdma/main.c | 2 +- > drivers/net/ethernet/intel/ice/ice.h | 4 +- > drivers/net/ethernet/intel/ice/ice_base.c | 5 +- > drivers/net/ethernet/intel/ice/ice_ethtool.c | 3 +- > drivers/net/ethernet/intel/ice/ice_main.c | 177 +++++++++++++++++-- > drivers/net/ethernet/intel/ice/ice_osdep.h | 48 ++++- > drivers/net/ethernet/intel/ice/ice_txrx.h | 2 +- > drivers/net/ethernet/intel/ice/ice_type.h | 2 +- > 8 files changed, 219 insertions(+), 24 deletions(-) > Tested-by: Jakub Andrysiak <jakub.andrysiak@intel.com> _______________________________________________ > Intel-wired-lan mailing list > Intel-wired-lan@osuosl.org > https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
From: Jacob Keller <jacob.e.keller@intel.com> Date: Wed, 18 Jan 2023 17:16:45 -0800 > The current method of mapping the entire BAR region as a single uncacheable > region does not allow RDMA to use write combining (WC). This results in > increased latency with RDMA. > > To fix this, we initially planned to reduce the size of the map made by the > PF driver to include only up to the beginning of the RDMA space. > Unfortunately this will not work in the future as there are some hardware > features which use registers beyond the RDMA area. This includes Scalable > IOV, a virtualization feature being worked on currently. > > Instead of simply reducing the size of the map, we need a solution which > will allow access to all areas of the address space while leaving the RDMA > area open to be mapped with write combining. > > To allow for this, and fix the RMDA latency issue without blocking the > higher areas of the BAR, we need to create multiple separate memory maps. > Doing so will create a sparse mapping rather than a contiguous single area. > > Replace the void *hw_addr with a special ice_hw_addr structure which > represents the multiple mappings as a flexible array. > > Based on the available BAR size, map up to 3 regions: > > * The space before the RDMA section > * The RDMA section which wants write combining behavior > * The space after the RDMA section Please don't. You have[0]: * io_mapping_init_wc() (+ io_mapping_fini()); * io_mapping_create_wc() (+ io_mapping_free()); ^ they do the same (the second just allocates a struct ad-hoc, but it can be allocated manually or embedded into a driver structure), * arch_phys_wc_add() (+ arch_phys_wc_del())[1]; ^ optional to make MTRR happy -- precisely for the case when you need to remap *a part* of BAR in a different mode. Splitting BARs, dropping pcim_iomap_regions() and so on, is very wrong. Not speaking of that it's PCI driver which must own and map all the memory the device advertises in its PCI config space, and in case of ice, PCI driver is combined with Ethernet, so it's ice which must own and map all the memory. Not speaking of that using a structure with a flex array and creating a static inline to calculate the pointer each time you need to read/write a register, hurts performance and looks properly ugly. The interfaces above must be used by the RDMA driver, right before mapping its part in WC mode. PCI driver has no idea that someone else wants to remap its memory differently, so the code doesn't belong here. I'd drop the patch and let the RDMA team fix/improve their driver. > > Add an ice_get_hw_addr function which converts a register offset into the > appropriate kernel address based on which chunk it falls into. This does > cost us slightly more computation overhead for register access as we now > must check the table each access. However, we can pre-compute the addresses > where this would most be a problem. > > With this change, the RDMA driver is now free to map the RDMA register > section as write-combined without impacting access to other device > registers used by the main PF driver. > > Reported-by: Dave Ertman <david.m.ertman@intel.com> > Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> > --- > Changes since v1: > * Export ice_get_hw_addr > * Use ice_get_hw_addr in iRDMA driver > * Fix the WARN_ON to use %pa instead of %llx for printing a resource_size_t > > drivers/infiniband/hw/irdma/main.c | 2 +- > drivers/net/ethernet/intel/ice/ice.h | 4 +- > drivers/net/ethernet/intel/ice/ice_base.c | 5 +- > drivers/net/ethernet/intel/ice/ice_ethtool.c | 3 +- > drivers/net/ethernet/intel/ice/ice_main.c | 177 +++++++++++++++++-- > drivers/net/ethernet/intel/ice/ice_osdep.h | 48 ++++- > drivers/net/ethernet/intel/ice/ice_txrx.h | 2 +- > drivers/net/ethernet/intel/ice/ice_type.h | 2 +- > 8 files changed, 219 insertions(+), 24 deletions(-) [0] https://elixir.bootlin.com/linux/v6.2-rc6/source/include/linux/io-mapping.h#L42 [1] https://elixir.bootlin.com/linux/v6.2-rc6/source/arch/x86/include/asm/io.h#L339 Thanks, Olek
> -----Original Message----- > From: Lobakin, Alexandr <alexandr.lobakin@intel.com> > Sent: Monday, January 30, 2023 2:03 AM > To: Keller, Jacob E <jacob.e.keller@intel.com> > Cc: Intel Wired LAN <intel-wired-lan@lists.osuosl.org>; Nguyen, Anthony L > <anthony.l.nguyen@intel.com>; Linga, Pavan Kumar > <pavan.kumar.linga@intel.com>; netdev@vger.kernel.org > Subject: Re: [Intel-wired-lan] [PATCH net-next v2 05/13] ice: Fix RDMA latency > issue by allowing write-combining > > From: Jacob Keller <jacob.e.keller@intel.com> > Date: Wed, 18 Jan 2023 17:16:45 -0800 > > > The current method of mapping the entire BAR region as a single uncacheable > > region does not allow RDMA to use write combining (WC). This results in > > increased latency with RDMA. > > > > To fix this, we initially planned to reduce the size of the map made by the > > PF driver to include only up to the beginning of the RDMA space. > > Unfortunately this will not work in the future as there are some hardware > > features which use registers beyond the RDMA area. This includes Scalable > > IOV, a virtualization feature being worked on currently. > > > > Instead of simply reducing the size of the map, we need a solution which > > will allow access to all areas of the address space while leaving the RDMA > > area open to be mapped with write combining. > > > > To allow for this, and fix the RMDA latency issue without blocking the > > higher areas of the BAR, we need to create multiple separate memory maps. > > Doing so will create a sparse mapping rather than a contiguous single area. > > > > Replace the void *hw_addr with a special ice_hw_addr structure which > > represents the multiple mappings as a flexible array. > > > > Based on the available BAR size, map up to 3 regions: > > > > * The space before the RDMA section > > * The RDMA section which wants write combining behavior > > * The space after the RDMA section > > Please don't. > > You have[0]: > > * io_mapping_init_wc() (+ io_mapping_fini()); > * io_mapping_create_wc() (+ io_mapping_free()); > > ^ they do the same (the second just allocates a struct ad-hoc, but it > can be allocated manually or embedded into a driver structure), > > * arch_phys_wc_add() (+ arch_phys_wc_del())[1]; > > ^ optional to make MTRR happy > > -- precisely for the case when you need to remap *a part* of BAR in a > different mode. > > Splitting BARs, dropping pcim_iomap_regions() and so on, is very wrong. > Not speaking of that it's PCI driver which must own and map all the > memory the device advertises in its PCI config space, and in case of > ice, PCI driver is combined with Ethernet, so it's ice which must own > and map all the memory. > Not speaking of that using a structure with a flex array and creating a > static inline to calculate the pointer each time you need to read/write > a register, hurts performance and looks properly ugly. > > The interfaces above must be used by the RDMA driver, right before > mapping its part in WC mode. PCI driver has no idea that someone else > wants to remap its memory differently, so the code doesn't belong here. > I'd drop the patch and let the RDMA team fix/improve their driver. > Appreciate the review! I proposed this option after the original change was to simply reduce the initial size of our bar mapping, resulting in losing access to the registers beyond the RDMA section, which was a non-starter for us once we finish implementing Scalable IOV support. Searching for io_mapping_init_wc and io_mapping_create_wc there are only a handful of users and not much documentation so no wonder I had trouble locating it! Thanks for helping me learn about it. @Dave.Ertman@intel.com, @Saleem, Shiraz it looks like we need to drop this patch and modify the iRDMA driver's method of requesting write combined regions to use these new interfaces. @Nguyen, Anthony L Can you drop this patch from the series on IWL or should I send a v3? Thanks, Jake
On 1/30/2023 3:34 PM, Keller, Jacob E wrote: > > @Nguyen, Anthony L Can you drop this patch from the series on IWL or should I send a v3? It looks like it cleanly drops out of the series; I can take care of it, no need for v3. Thanks, Tony
On 1/30/2023 3:34 PM, Keller, Jacob E wrote: > > >> -----Original Message----- >> From: Lobakin, Alexandr <alexandr.lobakin@intel.com> >> Sent: Monday, January 30, 2023 2:03 AM >> To: Keller, Jacob E <jacob.e.keller@intel.com> >> Cc: Intel Wired LAN <intel-wired-lan@lists.osuosl.org>; Nguyen, Anthony L >> <anthony.l.nguyen@intel.com>; Linga, Pavan Kumar >> <pavan.kumar.linga@intel.com>; netdev@vger.kernel.org >> Subject: Re: [Intel-wired-lan] [PATCH net-next v2 05/13] ice: Fix RDMA latency >> issue by allowing write-combining >> >> From: Jacob Keller <jacob.e.keller@intel.com> >> Date: Wed, 18 Jan 2023 17:16:45 -0800 >> >>> The current method of mapping the entire BAR region as a single uncacheable >>> region does not allow RDMA to use write combining (WC). This results in >>> increased latency with RDMA. >>> >>> To fix this, we initially planned to reduce the size of the map made by the >>> PF driver to include only up to the beginning of the RDMA space. >>> Unfortunately this will not work in the future as there are some hardware >>> features which use registers beyond the RDMA area. This includes Scalable >>> IOV, a virtualization feature being worked on currently. >>> >>> Instead of simply reducing the size of the map, we need a solution which >>> will allow access to all areas of the address space while leaving the RDMA >>> area open to be mapped with write combining. >>> >>> To allow for this, and fix the RMDA latency issue without blocking the >>> higher areas of the BAR, we need to create multiple separate memory maps. >>> Doing so will create a sparse mapping rather than a contiguous single area. >>> >>> Replace the void *hw_addr with a special ice_hw_addr structure which >>> represents the multiple mappings as a flexible array. >>> >>> Based on the available BAR size, map up to 3 regions: >>> >>> * The space before the RDMA section >>> * The RDMA section which wants write combining behavior >>> * The space after the RDMA section >> >> Please don't. >> >> You have[0]: >> >> * io_mapping_init_wc() (+ io_mapping_fini()); >> * io_mapping_create_wc() (+ io_mapping_free()); >> >> ^ they do the same (the second just allocates a struct ad-hoc, but it >> can be allocated manually or embedded into a driver structure), >> >> * arch_phys_wc_add() (+ arch_phys_wc_del())[1]; >> >> ^ optional to make MTRR happy >> >> -- precisely for the case when you need to remap *a part* of BAR in a >> different mode. >> >> Splitting BARs, dropping pcim_iomap_regions() and so on, is very wrong. >> Not speaking of that it's PCI driver which must own and map all the >> memory the device advertises in its PCI config space, and in case of >> ice, PCI driver is combined with Ethernet, so it's ice which must own >> and map all the memory. >> Not speaking of that using a structure with a flex array and creating a >> static inline to calculate the pointer each time you need to read/write >> a register, hurts performance and looks properly ugly. >> >> The interfaces above must be used by the RDMA driver, right before >> mapping its part in WC mode. PCI driver has no idea that someone else >> wants to remap its memory differently, so the code doesn't belong here. >> I'd drop the patch and let the RDMA team fix/improve their driver. >> > > Appreciate the review! I proposed this option after the original change was to simply reduce the initial size of our bar mapping, resulting in losing access to the registers beyond the RDMA section, which was a non-starter for us once we finish implementing Scalable IOV support. > > Searching for io_mapping_init_wc and io_mapping_create_wc there are only a handful of users and not much documentation so no wonder I had trouble locating it! Thanks for helping me learn about it. > > @Dave.Ertman@intel.com, @Saleem, Shiraz it looks like we need to drop this patch and modify the iRDMA driver's method of requesting write combined regions to use these new interfaces. > > @Nguyen, Anthony L Can you drop this patch from the series on IWL or should I send a v3? > > Thanks, > Jake Mustafa, It looks like Shiraz is on vacation and you're his coverage. Can you help investigate this and the proposed io_mapping solution from Olek? Dave, I tried to add you on the original thread above but it got lost due to an incorrect email. Thanks, Jake
diff --git a/drivers/infiniband/hw/irdma/main.c b/drivers/infiniband/hw/irdma/main.c index 514453777e07..37a2650abbbb 100644 --- a/drivers/infiniband/hw/irdma/main.c +++ b/drivers/infiniband/hw/irdma/main.c @@ -228,7 +228,7 @@ static void irdma_fill_device_info(struct irdma_device *iwdev, struct ice_pf *pf rf->cdev = pf; rf->gen_ops.register_qset = irdma_lan_register_qset; rf->gen_ops.unregister_qset = irdma_lan_unregister_qset; - rf->hw.hw_addr = pf->hw.hw_addr; + rf->hw.hw_addr = ice_get_hw_addr(&pf->hw, 0); rf->pcidev = pf->pdev; rf->msix_count = pf->num_rdma_msix; rf->pf_id = pf->hw.pf_id; diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h index 51a1a89f7b5a..cd81974822cc 100644 --- a/drivers/net/ethernet/intel/ice/ice.h +++ b/drivers/net/ethernet/intel/ice/ice.h @@ -75,7 +75,9 @@ #include "ice_vsi_vlan_ops.h" #include "ice_gnss.h" -#define ICE_BAR0 0 +#define ICE_BAR0 0 +#define ICE_BAR_RDMA_WC_START 0x0800000 +#define ICE_BAR_RDMA_WC_END 0x1000000 #define ICE_REQ_DESC_MULTIPLE 32 #define ICE_MIN_NUM_DESC 64 #define ICE_MAX_NUM_DESC 8160 diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c index 554095b25f44..332d5a1b326c 100644 --- a/drivers/net/ethernet/intel/ice/ice_base.c +++ b/drivers/net/ethernet/intel/ice/ice_base.c @@ -480,7 +480,7 @@ static int ice_setup_rx_ctx(struct ice_rx_ring *ring) ring->rx_offset = ice_rx_offset(ring); /* init queue specific tail register */ - ring->tail = hw->hw_addr + QRX_TAIL(pf_q); + ring->tail = ice_get_hw_addr(hw, QRX_TAIL(pf_q)); writel(0, ring->tail); return 0; @@ -790,8 +790,7 @@ ice_vsi_cfg_txq(struct ice_vsi *vsi, struct ice_tx_ring *ring, /* init queue specific tail reg. It is referred as * transmit comm scheduler queue doorbell. */ - ring->tail = hw->hw_addr + QTX_COMM_DBELL(pf_q); - + ring->tail = ice_get_hw_addr(hw, QTX_COMM_DBELL(pf_q)); if (IS_ENABLED(CONFIG_DCB)) tc = ring->dcb_tc; else diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c index 936f0e0c553d..b54f470be8d7 100644 --- a/drivers/net/ethernet/intel/ice/ice_ethtool.c +++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c @@ -3085,7 +3085,8 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring, /* this is to allow wr32 to have something to write to * during early allocation of Rx buffers */ - rx_rings[i].tail = vsi->back->hw.hw_addr + PRTGEN_STATUS; + rx_rings[i].tail = ice_get_hw_addr(&vsi->back->hw, + PRTGEN_STATUS); err = ice_setup_rx_ring(&rx_rings[i]); if (err) diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index 4165fde0106d..3b98721fd9d8 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -596,6 +596,163 @@ ice_prepare_for_reset(struct ice_pf *pf, enum ice_reset_req reset_type) set_bit(ICE_PREPARED_FOR_RESET, pf->state); } +/** + * ice_get_hw_addr - Get memory address for a given device register + * @hw: pointer to the HW struct + * @reg: the register to get address of + * + * Convert a register offset into the appropriate memory mapped kernel + * address. + * + * Returns the pointer address or an ERR_PTR on failure. + */ +void __iomem *ice_get_hw_addr(struct ice_hw *hw, resource_size_t reg) +{ + struct ice_hw_addr *hw_addr = (struct ice_hw_addr *)hw->hw_addr; + struct ice_hw_addr_map *map; + unsigned int i; + + if (WARN_ON(!hw_addr)) + return (void __iomem *)ERR_PTR(-EIO); + + for (i = 0, map = hw_addr->maps; i < hw_addr->nr; i++, map++) + if (reg >= map->start && reg < map->end) + return (u8 __iomem *)map->addr + (reg - map->start); + + WARN_ONCE(1, "Unable to map register address %pa to kernel address", + ®); + + return (void __iomem *)ERR_PTR(-EFAULT); +} +EXPORT_SYMBOL_GPL(ice_get_hw_addr); + +/** + * ice_map_hw_addr - map a region of device registers to memory + * @pdev: the PCI device + * @map: the address map structure + * + * Map the specified section of the hardware registers into memory, storing + * the memory mapped address in the provided structure. + * + * Returns 0 on success or an error code on failure. + */ +static int ice_map_hw_addr(struct pci_dev *pdev, struct ice_hw_addr_map *map) +{ + struct device *dev = &pdev->dev; + resource_size_t size, base; + void __iomem *addr; + + if (WARN_ON(map->end <= map->start)) + return -EIO; + + size = map->end - map->start; + base = pci_resource_start(pdev, map->bar) + map->start; + addr = ioremap(base, size); + if (!addr) { + dev_err(dev, "%s: remap at offset %llu failed\n", + __func__, map->start); + return -EIO; + } + + map->addr = addr; + + return 0; +} + +/** + * ice_map_all_hw_addr - Request and map PCI BAR memory + * @pf: pointer to the PF structure + * + * Request and reserve all PCI BAR regions. Memory map chunks of the PCI BAR + * 0 into a sparse memory map to allow the RDMA region to be mapped with write + * combining. + * + * Returns 0 on success or an error code on failure. + */ +static int ice_map_all_hw_addr(struct ice_pf *pf) +{ + struct pci_dev *pdev = pf->pdev; + struct device *dev = &pdev->dev; + struct ice_hw_addr *hw_addr; + resource_size_t bar_len; + unsigned int nr_maps; + int err; + + bar_len = pci_resource_len(pdev, 0); + if (bar_len > ICE_BAR_RDMA_WC_END) + nr_maps = 2; + else + nr_maps = 1; + + hw_addr = kzalloc(struct_size(hw_addr, maps, nr_maps), GFP_KERNEL); + if (!hw_addr) + return -ENOMEM; + + hw_addr->nr = nr_maps; + + err = pci_request_mem_regions(pdev, dev_driver_string(dev)); + if (err) { + dev_err(dev, "pci_request_mem_regions failed, err %pe\n", + ERR_PTR(err)); + goto err_free_hw_addr; + } + + /* Map the start of the BAR as uncachable */ + hw_addr->maps[0].bar = 0; + hw_addr->maps[0].start = 0; + hw_addr->maps[0].end = min_t(resource_size_t, bar_len, + ICE_BAR_RDMA_WC_START); + err = ice_map_hw_addr(pdev, &hw_addr->maps[0]); + if (err) + goto err_release_mem_regions; + + /* Map everything past the RDMA section as uncachable */ + if (nr_maps > 1) { + hw_addr->maps[1].bar = 0; + hw_addr->maps[1].start = ICE_BAR_RDMA_WC_END; + hw_addr->maps[1].end = bar_len; + err = ice_map_hw_addr(pdev, &hw_addr->maps[1]); + if (err) + goto err_unmap_bar_start; + } + + pf->hw.hw_addr = (typeof(pf->hw.hw_addr))hw_addr; + + return 0; + +err_unmap_bar_start: + iounmap(hw_addr->maps[0].addr); +err_release_mem_regions: + pci_release_mem_regions(pdev); +err_free_hw_addr: + kfree(hw_addr); + + return err; +} + +/** + * ice_unmap_all_hw_addr - Release device register memory maps + * @pf: pointer to the PF structure + * + * Release all PCI memory maps and regions. + */ +static void ice_unmap_all_hw_addr(struct ice_pf *pf) +{ + struct ice_hw_addr *hw_addr = (struct ice_hw_addr *)pf->hw.hw_addr; + struct pci_dev *pdev = pf->pdev; + unsigned int i; + + if (WARN_ON(!hw_addr)) + return; + + pf->hw.hw_addr = NULL; + for (i = 0; i < hw_addr->nr; i++) + iounmap(hw_addr->maps[i].addr); + kfree(hw_addr); + + pci_release_mem_regions(pdev); +} + /** * ice_do_reset - Initiate one of many types of resets * @pf: board private structure @@ -5101,19 +5258,10 @@ ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent) return -EINVAL; } - /* this driver uses devres, see - * Documentation/driver-api/driver-model/devres.rst - */ - err = pcim_enable_device(pdev); + err = pci_enable_device(pdev); if (err) return err; - err = pcim_iomap_regions(pdev, BIT(ICE_BAR0), dev_driver_string(dev)); - if (err) { - dev_err(dev, "BAR0 I/O map error %d\n", err); - return err; - } - pf = ice_allocate_pf(dev); if (!pf) return -ENOMEM; @@ -5138,7 +5286,11 @@ ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent) set_bit(ICE_SERVICE_DIS, pf->state); hw = &pf->hw; - hw->hw_addr = pcim_iomap_table(pdev)[ICE_BAR0]; + + err = ice_map_all_hw_addr(pf); + if (err) + goto err_init_iomap_fail; + pci_save_state(pdev); hw->back = pf; @@ -5186,6 +5338,8 @@ ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent) err_init_eth: ice_deinit(pf); err_init: + ice_unmap_all_hw_addr(pf); +err_init_iomap_fail: pci_disable_pcie_error_reporting(pdev); pci_disable_device(pdev); return err; @@ -5295,6 +5449,7 @@ static void ice_remove(struct pci_dev *pdev) */ ice_reset(&pf->hw, ICE_RESET_PFR); pci_wait_for_pending_transaction(pdev); + ice_unmap_all_hw_addr(pf); pci_disable_pcie_error_reporting(pdev); pci_disable_device(pdev); } diff --git a/drivers/net/ethernet/intel/ice/ice_osdep.h b/drivers/net/ethernet/intel/ice/ice_osdep.h index 82bc54fec7f3..4b16ff489c3a 100644 --- a/drivers/net/ethernet/intel/ice/ice_osdep.h +++ b/drivers/net/ethernet/intel/ice/ice_osdep.h @@ -18,10 +18,49 @@ #endif #include <net/udp_tunnel.h> -#define wr32(a, reg, value) writel((value), ((a)->hw_addr + (reg))) -#define rd32(a, reg) readl((a)->hw_addr + (reg)) -#define wr64(a, reg, value) writeq((value), ((a)->hw_addr + (reg))) -#define rd64(a, reg) readq((a)->hw_addr + (reg)) +struct ice_hw; + +/** + * struct ice_hw_addr_map - a single hardware address memory map + * @addr: iomem address of the start of this map + * @start: register offset at the start of this map, inclusive bound + * @end: register offset at the end of this map, exclusive bound + * @bar: the BAR this map is for + * + * Structure representing one map of a device BAR register space. Stored as + * part of the ice_hw_addr structure in an array ordered by the start offset. + * + * The addr value is an iomem address returned by ioremap. The start indicates + * the first register offset this map is valid for. The end indicates the end + * of the map, and is an exclusive bound. + */ +struct ice_hw_addr_map { + void __iomem *addr; + resource_size_t start; + resource_size_t end; + int bar; +}; + +/** + * struct ice_hw_addr - a list of hardware address memory maps + * @nr: the number of maps made + * @maps: flexible array of maps made during device initialization + * + * Structure representing a series of sparse maps of the device BAR 0 address + * space to kernel addresses. Users must convert a register offset to an iomem + * address using ice_get_hw_addr. + */ +struct ice_hw_addr { + unsigned int nr; + struct ice_hw_addr_map maps[]; +}; + +void __iomem *ice_get_hw_addr(struct ice_hw *hw, resource_size_t reg); + +#define wr32(a, reg, value) writel((value), ice_get_hw_addr((a), (reg))) +#define rd32(a, reg) readl(ice_get_hw_addr((a), (reg))) +#define wr64(a, reg, value) writeq((value), ice_get_hw_addr((a), (reg))) +#define rd64(a, reg) readq(ice_get_hw_addr((a), (reg))) #define ice_flush(a) rd32((a), GLGEN_STAT) #define ICE_M(m, s) ((m) << (s)) @@ -32,7 +71,6 @@ struct ice_dma_mem { size_t size; }; -struct ice_hw; struct device *ice_hw_to_dev(struct ice_hw *hw); #ifdef CONFIG_DYNAMIC_DEBUG diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h index 4fd0e5d0a313..3d2834673903 100644 --- a/drivers/net/ethernet/intel/ice/ice_txrx.h +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h @@ -272,7 +272,7 @@ struct ice_rx_ring { struct net_device *netdev; /* netdev ring maps to */ struct ice_vsi *vsi; /* Backreference to associated VSI */ struct ice_q_vector *q_vector; /* Backreference to associated vector */ - u8 __iomem *tail; + void __iomem *tail; union { struct ice_rx_buf *rx_buf; struct xdp_buff **xdp_buf; diff --git a/drivers/net/ethernet/intel/ice/ice_type.h b/drivers/net/ethernet/intel/ice/ice_type.h index e3f622cad425..f34975efeed7 100644 --- a/drivers/net/ethernet/intel/ice/ice_type.h +++ b/drivers/net/ethernet/intel/ice/ice_type.h @@ -821,7 +821,7 @@ struct ice_mbx_data { /* Port hardware description */ struct ice_hw { - u8 __iomem *hw_addr; + void *hw_addr; void *back; struct ice_aqc_layer_props *layer_info; struct ice_port_info *port_info;
The current method of mapping the entire BAR region as a single uncacheable region does not allow RDMA to use write combining (WC). This results in increased latency with RDMA. To fix this, we initially planned to reduce the size of the map made by the PF driver to include only up to the beginning of the RDMA space. Unfortunately this will not work in the future as there are some hardware features which use registers beyond the RDMA area. This includes Scalable IOV, a virtualization feature being worked on currently. Instead of simply reducing the size of the map, we need a solution which will allow access to all areas of the address space while leaving the RDMA area open to be mapped with write combining. To allow for this, and fix the RMDA latency issue without blocking the higher areas of the BAR, we need to create multiple separate memory maps. Doing so will create a sparse mapping rather than a contiguous single area. Replace the void *hw_addr with a special ice_hw_addr structure which represents the multiple mappings as a flexible array. Based on the available BAR size, map up to 3 regions: * The space before the RDMA section * The RDMA section which wants write combining behavior * The space after the RDMA section Add an ice_get_hw_addr function which converts a register offset into the appropriate kernel address based on which chunk it falls into. This does cost us slightly more computation overhead for register access as we now must check the table each access. However, we can pre-compute the addresses where this would most be a problem. With this change, the RDMA driver is now free to map the RDMA register section as write-combined without impacting access to other device registers used by the main PF driver. Reported-by: Dave Ertman <david.m.ertman@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> --- Changes since v1: * Export ice_get_hw_addr * Use ice_get_hw_addr in iRDMA driver * Fix the WARN_ON to use %pa instead of %llx for printing a resource_size_t drivers/infiniband/hw/irdma/main.c | 2 +- drivers/net/ethernet/intel/ice/ice.h | 4 +- drivers/net/ethernet/intel/ice/ice_base.c | 5 +- drivers/net/ethernet/intel/ice/ice_ethtool.c | 3 +- drivers/net/ethernet/intel/ice/ice_main.c | 177 +++++++++++++++++-- drivers/net/ethernet/intel/ice/ice_osdep.h | 48 ++++- drivers/net/ethernet/intel/ice/ice_txrx.h | 2 +- drivers/net/ethernet/intel/ice/ice_type.h | 2 +- 8 files changed, 219 insertions(+), 24 deletions(-)