Message ID | 20210721104205.885115-2-damien.lemoal@wdc.com |
---|---|
State | New |
Headers | show |
Series | Initial support for multi-actuator HDDs | expand |
On 7/21/21 12:42 PM, Damien Le Moal wrote: > The Concurrent Positioning Ranges VPD page (for SCSI) and Log (for ATA) > contain parameters describing the number of sets of contiguous LBAs that > can be served independently by a single LUN multi-actuator disk. This > patch provides the blk_queue_set_cranges() function allowing a device > driver to signal to the block layer that a disk has multiple actuators, > each one serving a contiguous range of sectors. To describe the set > of sector ranges representing the different actuators of a device, the > data type struct blk_cranges is introduced. > > For a device with multiple actuators, a struct blk_cranges is attached > to the device request queue by the blk_queue_set_cranges() function. The > function blk_alloc_cranges() is provided for drivers to allocate this > structure. > > The blk_cranges structure contains kobjects (struct kobject) to register > with sysfs the set of sector ranges defined by a device. On initial > device scan, this registration is done from blk_register_queue() using > the block layer internal function blk_register_cranges(). If a driver > calls blk_queue_set_cranges() for a registered queue, e.g. when a device > is revalidated, blk_queue_set_cranges() will execute > blk_register_cranges() to update the queue sysfs attribute files. > > The sysfs file structure created starts from the cranges sub-directory > and contains the start sector and number of sectors served by an > actuator, with the information for each actuator grouped in one > directory per actuator. E.g. for a dual actuator drive, we have: > > $ tree /sys/block/sdk/queue/cranges/ > /sys/block/sdk/queue/cranges/ > |-- 0 > | |-- nr_sectors > | `-- sector > `-- 1 > |-- nr_sectors > `-- sector > > For a regular single actuator device, the cranges directory does not > exist. > Hmm. I do wonder if we need such a detailed layout. In my patch I have just: + unsigned int num_lba_ranges; + sector_t *lba_ranges; to hold the information; after all, I don't expect this data to be changing very often during the lifetime of the OS. > Device revalidation may lead to changes to this structure and to the > attribute values. When manipulated, the queue sysfs_lock and > sysfs_dir_lock are held for atomicity, similarly to how the blk-mq and > elevator sysfs queue sub-directories are protected. > Indeed, the values might change with revalidation. But not independently; a change in _one_ range will typically affect the adjacent ranges, too. So it should be sufficient to drop them completely and redo from scratch for revalidation. > The code related to the management of cranges is added in the new > file block/blk-cranges.c. > > Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> > --- > block/Makefile | 2 +- > block/blk-cranges.c | 286 ++++++++++++++++++++++++++++++++++++++ > block/blk-sysfs.c | 13 ++ > block/blk.h | 3 + > drivers/ata/libata-scsi.c | 1 + > include/linux/blkdev.h | 29 ++++ > 6 files changed, 333 insertions(+), 1 deletion(-) > create mode 100644 block/blk-cranges.c > > diff --git a/block/Makefile b/block/Makefile > index bfbe4e13ca1e..e477e6ca9ea6 100644 > --- a/block/Makefile > +++ b/block/Makefile > @@ -9,7 +9,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-sysfs.o \ > blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \ > blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \ > genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \ > - disk-events.o > + disk-events.o blk-cranges.o > > obj-$(CONFIG_BOUNCE) += bounce.o > obj-$(CONFIG_BLK_SCSI_REQUEST) += scsi_ioctl.o > diff --git a/block/blk-cranges.c b/block/blk-cranges.c > new file mode 100644 > index 000000000000..e8ff0d40a5c9 > --- /dev/null > +++ b/block/blk-cranges.c > @@ -0,0 +1,286 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Block device concurrent positioning ranges. > + * > + * Copyright (C) 2021 Western Digital Corporation or its Affiliates. > + */ > +#include <linux/kernel.h> > +#include <linux/blkdev.h> > +#include <linux/slab.h> > +#include <linux/init.h> > + > +#include "blk.h" > + > +static ssize_t blk_crange_sector_show(struct blk_crange *cr, char *page) > +{ > + return sprintf(page, "%llu\n", cr->sector); > +} > + > +static ssize_t blk_crange_nr_sectors_show(struct blk_crange *cr, char *page) > +{ > + return sprintf(page, "%llu\n", cr->nr_sectors); > +} > + > +struct blk_crange_sysfs_entry { > + struct attribute attr; > + ssize_t (*show)(struct blk_crange *cr, char *page); > +}; > + > +static struct blk_crange_sysfs_entry blk_crange_sector_entry = { > + .attr = { .name = "sector", .mode = 0444 }, > + .show = blk_crange_sector_show, > +}; > + > +static struct blk_crange_sysfs_entry blk_crange_nr_sectors_entry = { > + .attr = { .name = "nr_sectors", .mode = 0444 }, > + .show = blk_crange_nr_sectors_show, > +}; > + > +static struct attribute *blk_crange_attrs[] = { > + &blk_crange_sector_entry.attr, > + &blk_crange_nr_sectors_entry.attr, > + NULL, > +}; > +ATTRIBUTE_GROUPS(blk_crange); > + > +static ssize_t blk_crange_sysfs_show(struct kobject *kobj, > + struct attribute *attr, char *page) > +{ > + struct blk_crange_sysfs_entry *entry; > + struct blk_crange *cr; > + struct request_queue *q; > + ssize_t ret; > + > + entry = container_of(attr, struct blk_crange_sysfs_entry, attr); > + cr = container_of(kobj, struct blk_crange, kobj); > + q = cr->queue; > + > + mutex_lock(&q->sysfs_lock); > + ret = entry->show(cr, page); > + mutex_unlock(&q->sysfs_lock); > + > + return ret; > +} > + > +static const struct sysfs_ops blk_crange_sysfs_ops = { > + .show = blk_crange_sysfs_show, > +}; > + > +/* > + * Dummy release function to make kobj happy. > + */ > +static void blk_cranges_sysfs_nop_release(struct kobject *kobj) > +{ > +} > + > +static struct kobj_type blk_crange_ktype = { > + .sysfs_ops = &blk_crange_sysfs_ops, > + .default_groups = blk_crange_groups, > + .release = blk_cranges_sysfs_nop_release, > +}; > + > +static struct kobj_type blk_cranges_ktype = { > + .release = blk_cranges_sysfs_nop_release, > +}; > + > +/** > + * blk_register_cranges - register with sysfs a set of concurrent ranges > + * @disk: Target disk > + * @new_cranges: New set of concurrent ranges > + * > + * Register with sysfs a set of concurrent ranges for @disk. If @new_cranges > + * is not NULL, this set of concurrent ranges is registered and the > + * old set specified by q->cranges is unregistered. Otherwise, q->cranges > + * is registered if it is not already. > + */ > +int blk_register_cranges(struct gendisk *disk, struct blk_cranges *new_cranges) > +{ > + struct request_queue *q = disk->queue; > + struct blk_cranges *cranges; > + int i, ret; > + > + lockdep_assert_held(&q->sysfs_dir_lock); > + lockdep_assert_held(&q->sysfs_lock); > + > + /* If a new range set is specified, unregister the old one */ > + if (new_cranges) { > + if (q->cranges) > + blk_unregister_cranges(disk); > + q->cranges = new_cranges; > + } > + > + cranges = q->cranges; > + if (!cranges) > + return 0; > + > + /* > + * At this point, q->cranges is the new set of sector ranges that needs > + * to be registered with sysfs. > + */ > + WARN_ON(cranges->sysfs_registered); > + ret = kobject_init_and_add(&cranges->kobj, &blk_cranges_ktype, > + &q->kobj, "%s", "cranges"); > + if (ret) > + goto free; > + > + for (i = 0; i < cranges->nr_ranges; i++) { > + cranges->ranges[i].queue = q; > + ret = kobject_init_and_add(&cranges->ranges[i].kobj, > + &blk_crange_ktype, &cranges->kobj, > + "%d", i); > + if (ret) > + goto delete_obj; > + } > + > + cranges->sysfs_registered = true; > + Do you really need independent kobj here? As discussed above, the individual ranges are not independent, and it might be easier to just lump them into one structure, seeing that all of them will have to be redone during revalidation anyway. > + return 0; > + > +delete_obj: > + while (--i >= 0) > + kobject_del(&cranges->ranges[i].kobj); > + kobject_del(&cranges->kobj); > +free: > + kfree(cranges); > + q->cranges = NULL; > + return ret; > +} > + > +void blk_unregister_cranges(struct gendisk *disk) > +{ > + struct request_queue *q = disk->queue; > + struct blk_cranges *cranges = q->cranges; > + int i; > + > + lockdep_assert_held(&q->sysfs_dir_lock); > + lockdep_assert_held(&q->sysfs_lock); > + > + if (!cranges) > + return; > + > + if (cranges->sysfs_registered) { > + for (i = 0; i < cranges->nr_ranges; i++) > + kobject_del(&cranges->ranges[i].kobj); > + kobject_del(&cranges->kobj); > + } > + > + kfree(cranges); > + q->cranges = NULL; > +} > + > +static bool blk_check_ranges(struct gendisk *disk, struct blk_cranges *cr) > +{ > + sector_t capacity = get_capacity(disk); > + sector_t min_sector = (sector_t)-1; > + sector_t max_sector = 0; > + int i; > + > + /* > + * Sector ranges may overlap but should overall contain all sectors > + * within the disk capacity. > + */ > + for (i = 0; i < cr->nr_ranges; i++) { > + min_sector = min(min_sector, cr->ranges[i].sector); > + max_sector = max(max_sector, cr->ranges[i].sector + > + cr->ranges[i].nr_sectors); > + } > + Warm fuzzy feelings. Sector ranges may overlap? Seriously? The only way this could happen is when a given LBA will be present in _two_ ranges. And this is valid? > + if (min_sector != 0 || max_sector < capacity) { > + pr_warn("Invalid concurrent ranges: missing sectors\n"); > + return false; > + } > + > + if (max_sector > capacity) { > + pr_warn("Invalid concurrent ranges: beyond capacity\n"); > + return false; > + } > + > + return true; > +} > + > +static bool blk_cranges_changed(struct gendisk *disk, struct blk_cranges *new) > +{ > + struct blk_cranges *old = disk->queue->cranges; > + int i; > + > + if (!old || old->nr_ranges != new->nr_ranges) > + return true; > + > + for (i = 0; i < new->nr_ranges; i++) { > + if (old->ranges[i].sector != new->ranges[i].sector || > + old->ranges[i].nr_sectors != new->ranges[i].nr_sectors) > + return true; > + } > + > + return false; > +} > + As mentioned above: I doubt we need to go into such detail. Just redo the ranges on revalidation should be sufficient. > +/** > + * blk_alloc_cranges - Allocate a concurrent positioning range structure > + * @disk: target disk > + * @nr_ranges: Number of concurrent ranges > + * > + * Allocate a struct blk_cranges structure with @nr_ranges range descriptors. > + */ > +struct blk_cranges *blk_alloc_cranges(struct gendisk *disk, int nr_ranges) > +{ > + struct blk_cranges *cr; > + > + cr = kzalloc_node(struct_size(cr, ranges, nr_ranges), GFP_KERNEL, > + disk->queue->node); > + if (cr) > + cr->nr_ranges = nr_ranges; > + return cr; > +} > +EXPORT_SYMBOL_GPL(blk_alloc_cranges); > + > +/** > + * blk_queue_set_cranges - Set a disk concurrent positioning ranges > + * @disk: target disk > + * @cr: concurrent ranges structure > + * > + * Set the concurrant positioning ranges information of the request queue > + * of @disk to @cr. If @cr is NULL and the concurrent ranges structure > + * already set, if any, is cleared. If there are no differences between > + * @cr and the concurrent ranges structure already set, @cr is freed. > + */ > +void blk_queue_set_cranges(struct gendisk *disk, struct blk_cranges *cr) > +{ > + struct request_queue *q = disk->queue; > + > + if (WARN_ON_ONCE(cr && !cr->nr_ranges)) { > + kfree(cr); > + cr = NULL; > + } > + > + mutex_lock(&q->sysfs_dir_lock); > + mutex_lock(&q->sysfs_lock); > + And this is what you get for all the complexity. Having to take _two_ locks, and have a detailed description how the function should be used. I still think that redoing everything is simpler. > + if (cr && !blk_check_ranges(disk, cr)) { > + kfree(cr); > + cr = NULL; > + } > + > + if (!blk_cranges_changed(disk, cr)) { > + kfree(cr); > + goto unlock; > + } > + > + /* > + * This may be called for a registered queue. E.g. during a device > + * revalidation. If that is the case, we need to register the new set > + * of concurrent ranges. If the queue is not already registered, the > + * device request queue registration will register the ranges. > + */ > + if (blk_queue_registered(q)) { > + blk_register_cranges(disk, cr); > + } else { > + swap(q->cranges, cr); > + kfree(cr); > + } > + > +unlock: > + mutex_unlock(&q->sysfs_lock); > + mutex_unlock(&q->sysfs_dir_lock); > +} > +EXPORT_SYMBOL_GPL(blk_queue_set_cranges); > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index 370d83c18057..aeac98ecc5a0 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -899,9 +899,21 @@ int blk_register_queue(struct gendisk *disk) > } > > mutex_lock(&q->sysfs_lock); > + > + ret = blk_register_cranges(disk, NULL); > + if (ret) { > + mutex_unlock(&q->sysfs_lock); > + mutex_unlock(&q->sysfs_dir_lock); > + kobject_del(&q->kobj); > + blk_trace_remove_sysfs(dev); > + kobject_put(&dev->kobj); > + return ret; > + } > + > if (q->elevator) { > ret = elv_register_queue(q, false); > if (ret) { > + blk_unregister_cranges(disk); > mutex_unlock(&q->sysfs_lock); > mutex_unlock(&q->sysfs_dir_lock); > kobject_del(&q->kobj); > @@ -985,6 +997,7 @@ void blk_unregister_queue(struct gendisk *disk) > mutex_lock(&q->sysfs_lock); > if (q->elevator) > elv_unregister_queue(q); > + blk_unregister_cranges(disk); > mutex_unlock(&q->sysfs_lock); > mutex_unlock(&q->sysfs_dir_lock); > > diff --git a/block/blk.h b/block/blk.h > index 4b885c0f6708..650c0d87987c 100644 > --- a/block/blk.h > +++ b/block/blk.h > @@ -368,4 +368,7 @@ extern struct device_attribute dev_attr_events; > extern struct device_attribute dev_attr_events_async; > extern struct device_attribute dev_attr_events_poll_msecs; > > +int blk_register_cranges(struct gendisk *disk, struct blk_cranges *new_cranges); > +void blk_unregister_cranges(struct gendisk *disk); > + > #endif /* BLK_INTERNAL_H */ > diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c > index b9588c52815d..144e8cac44ba 100644 > --- a/drivers/ata/libata-scsi.c > +++ b/drivers/ata/libata-scsi.c > @@ -1947,6 +1947,7 @@ static unsigned int ata_scsiop_inq_00(struct ata_scsi_args *args, u8 *rbuf) > 0xb1, /* page 0xb1, block device characteristics page */ > 0xb2, /* page 0xb2, thin provisioning page */ > 0xb6, /* page 0xb6, zoned block device characteristics */ > + 0xb9, /* page 0xb9, concurrent positioning ranges */ > }; > > num_pages = sizeof(pages); And this hunk is here because ... ? > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 3177181c4326..cd58aa1090a3 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -378,6 +378,29 @@ static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev, > > #endif /* CONFIG_BLK_DEV_ZONED */ > > +/* > + * Concurrent sector ranges: struct blk_crange describes range of > + * contiguous sectors that can be served by independent resources on the > + * device. The set of ranges defined in struct blk_cranges must overall > + * include all sectors within the device capacity. > + * For a device with multiple ranges, e.g. a single LUN multi-actuator HDD, > + * requests targeting sectors in different ranges can be executed in parallel. > + * A request can straddle a range boundary. > + */ > +struct blk_crange { > + struct kobject kobj; > + struct request_queue *queue; > + sector_t sector; > + sector_t nr_sectors; > +}; > + > +struct blk_cranges { > + struct kobject kobj; > + bool sysfs_registered; > + unsigned int nr_ranges; > + struct blk_crange ranges[]; > +}; > + > struct request_queue { > struct request *last_merge; > struct elevator_queue *elevator; > @@ -570,6 +593,9 @@ struct request_queue { > > #define BLK_MAX_WRITE_HINTS 5 > u64 write_hints[BLK_MAX_WRITE_HINTS]; > + > + /* Concurrent sector ranges */ > + struct blk_cranges *cranges; > }; > > /* Keep blk_queue_flag_name[] in sync with the definitions below */ > @@ -1163,6 +1189,9 @@ extern void blk_queue_required_elevator_features(struct request_queue *q, > extern bool blk_queue_can_use_dma_map_merging(struct request_queue *q, > struct device *dev); > > +struct blk_cranges *blk_alloc_cranges(struct gendisk *disk, int nr_ranges); > +void blk_queue_set_cranges(struct gendisk *disk, struct blk_cranges *cr); > + > /* > * Number of physical segments as sent to the device. > * > Cheers, Hannes
On 2021/07/23 1:11, Hannes Reinecke wrote: > On 7/21/21 12:42 PM, Damien Le Moal wrote: >> The Concurrent Positioning Ranges VPD page (for SCSI) and Log (for ATA) >> contain parameters describing the number of sets of contiguous LBAs that >> can be served independently by a single LUN multi-actuator disk. This >> patch provides the blk_queue_set_cranges() function allowing a device >> driver to signal to the block layer that a disk has multiple actuators, >> each one serving a contiguous range of sectors. To describe the set >> of sector ranges representing the different actuators of a device, the >> data type struct blk_cranges is introduced. >> >> For a device with multiple actuators, a struct blk_cranges is attached >> to the device request queue by the blk_queue_set_cranges() function. The >> function blk_alloc_cranges() is provided for drivers to allocate this >> structure. >> >> The blk_cranges structure contains kobjects (struct kobject) to register >> with sysfs the set of sector ranges defined by a device. On initial >> device scan, this registration is done from blk_register_queue() using >> the block layer internal function blk_register_cranges(). If a driver >> calls blk_queue_set_cranges() for a registered queue, e.g. when a device >> is revalidated, blk_queue_set_cranges() will execute >> blk_register_cranges() to update the queue sysfs attribute files. >> >> The sysfs file structure created starts from the cranges sub-directory >> and contains the start sector and number of sectors served by an >> actuator, with the information for each actuator grouped in one >> directory per actuator. E.g. for a dual actuator drive, we have: >> >> $ tree /sys/block/sdk/queue/cranges/ >> /sys/block/sdk/queue/cranges/ >> |-- 0 >> | |-- nr_sectors >> | `-- sector >> `-- 1 >> |-- nr_sectors >> `-- sector >> >> For a regular single actuator device, the cranges directory does not >> exist. >> > Hmm. I do wonder if we need such a detailed layout. > In my patch I have just: > > + unsigned int num_lba_ranges; > + sector_t *lba_ranges; > > to hold the information; after all, I don't expect this data to be > changing very often during the lifetime of the OS. Indeed, it will not. Except after a reformat that change the LBA size or that depops heads. I was also split between what I sent here and a flat list. In the end, I went for this one as I think it is more elegant and also allows simply swaping the old cranges with the new on revalidation (more on that below). Otherwise, with the flat list, we will have to mess with the visibility of the ranges OR add/remove attribute files to the queue sysfs directory when something changes (and the number of ranges can actually change with head depop formats...). >> Device revalidation may lead to changes to this structure and to the >> attribute values. When manipulated, the queue sysfs_lock and >> sysfs_dir_lock are held for atomicity, similarly to how the blk-mq and >> elevator sysfs queue sub-directories are protected. >> > > Indeed, the values might change with revalidation. But not > independently; a change in _one_ range will typically affect the > adjacent ranges, too. > So it should be sufficient to drop them completely and redo from scratch > for revalidation. Which is what blk_queue_set_cranges() does :) If the new cranges specified as argument has a change from the old one already set as q->cranges, the old one is dropped and the new one registred in its place. >> The code related to the management of cranges is added in the new >> file block/blk-cranges.c. >> >> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> >> --- >> block/Makefile | 2 +- >> block/blk-cranges.c | 286 ++++++++++++++++++++++++++++++++++++++ >> block/blk-sysfs.c | 13 ++ >> block/blk.h | 3 + >> drivers/ata/libata-scsi.c | 1 + >> include/linux/blkdev.h | 29 ++++ >> 6 files changed, 333 insertions(+), 1 deletion(-) >> create mode 100644 block/blk-cranges.c >> >> diff --git a/block/Makefile b/block/Makefile >> index bfbe4e13ca1e..e477e6ca9ea6 100644 >> --- a/block/Makefile >> +++ b/block/Makefile >> @@ -9,7 +9,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-sysfs.o \ >> blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \ >> blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \ >> genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \ >> - disk-events.o >> + disk-events.o blk-cranges.o >> >> obj-$(CONFIG_BOUNCE) += bounce.o >> obj-$(CONFIG_BLK_SCSI_REQUEST) += scsi_ioctl.o >> diff --git a/block/blk-cranges.c b/block/blk-cranges.c >> new file mode 100644 >> index 000000000000..e8ff0d40a5c9 >> --- /dev/null >> +++ b/block/blk-cranges.c >> @@ -0,0 +1,286 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* >> + * Block device concurrent positioning ranges. >> + * >> + * Copyright (C) 2021 Western Digital Corporation or its Affiliates. >> + */ >> +#include <linux/kernel.h> >> +#include <linux/blkdev.h> >> +#include <linux/slab.h> >> +#include <linux/init.h> >> + >> +#include "blk.h" >> + >> +static ssize_t blk_crange_sector_show(struct blk_crange *cr, char *page) >> +{ >> + return sprintf(page, "%llu\n", cr->sector); >> +} >> + >> +static ssize_t blk_crange_nr_sectors_show(struct blk_crange *cr, char *page) >> +{ >> + return sprintf(page, "%llu\n", cr->nr_sectors); >> +} >> + >> +struct blk_crange_sysfs_entry { >> + struct attribute attr; >> + ssize_t (*show)(struct blk_crange *cr, char *page); >> +}; >> + >> +static struct blk_crange_sysfs_entry blk_crange_sector_entry = { >> + .attr = { .name = "sector", .mode = 0444 }, >> + .show = blk_crange_sector_show, >> +}; >> + >> +static struct blk_crange_sysfs_entry blk_crange_nr_sectors_entry = { >> + .attr = { .name = "nr_sectors", .mode = 0444 }, >> + .show = blk_crange_nr_sectors_show, >> +}; >> + >> +static struct attribute *blk_crange_attrs[] = { >> + &blk_crange_sector_entry.attr, >> + &blk_crange_nr_sectors_entry.attr, >> + NULL, >> +}; >> +ATTRIBUTE_GROUPS(blk_crange); >> + >> +static ssize_t blk_crange_sysfs_show(struct kobject *kobj, >> + struct attribute *attr, char *page) >> +{ >> + struct blk_crange_sysfs_entry *entry; >> + struct blk_crange *cr; >> + struct request_queue *q; >> + ssize_t ret; >> + >> + entry = container_of(attr, struct blk_crange_sysfs_entry, attr); >> + cr = container_of(kobj, struct blk_crange, kobj); >> + q = cr->queue; >> + >> + mutex_lock(&q->sysfs_lock); >> + ret = entry->show(cr, page); >> + mutex_unlock(&q->sysfs_lock); >> + >> + return ret; >> +} >> + >> +static const struct sysfs_ops blk_crange_sysfs_ops = { >> + .show = blk_crange_sysfs_show, >> +}; >> + >> +/* >> + * Dummy release function to make kobj happy. >> + */ >> +static void blk_cranges_sysfs_nop_release(struct kobject *kobj) >> +{ >> +} >> + >> +static struct kobj_type blk_crange_ktype = { >> + .sysfs_ops = &blk_crange_sysfs_ops, >> + .default_groups = blk_crange_groups, >> + .release = blk_cranges_sysfs_nop_release, >> +}; >> + >> +static struct kobj_type blk_cranges_ktype = { >> + .release = blk_cranges_sysfs_nop_release, >> +}; >> + >> +/** >> + * blk_register_cranges - register with sysfs a set of concurrent ranges >> + * @disk: Target disk >> + * @new_cranges: New set of concurrent ranges >> + * >> + * Register with sysfs a set of concurrent ranges for @disk. If @new_cranges >> + * is not NULL, this set of concurrent ranges is registered and the >> + * old set specified by q->cranges is unregistered. Otherwise, q->cranges >> + * is registered if it is not already. >> + */ >> +int blk_register_cranges(struct gendisk *disk, struct blk_cranges *new_cranges) >> +{ >> + struct request_queue *q = disk->queue; >> + struct blk_cranges *cranges; >> + int i, ret; >> + >> + lockdep_assert_held(&q->sysfs_dir_lock); >> + lockdep_assert_held(&q->sysfs_lock); >> + >> + /* If a new range set is specified, unregister the old one */ >> + if (new_cranges) { >> + if (q->cranges) >> + blk_unregister_cranges(disk); >> + q->cranges = new_cranges; >> + } >> + >> + cranges = q->cranges; >> + if (!cranges) >> + return 0; >> + >> + /* >> + * At this point, q->cranges is the new set of sector ranges that needs >> + * to be registered with sysfs. >> + */ >> + WARN_ON(cranges->sysfs_registered); >> + ret = kobject_init_and_add(&cranges->kobj, &blk_cranges_ktype, >> + &q->kobj, "%s", "cranges"); >> + if (ret) >> + goto free; >> + > + for (i = 0; i < cranges->nr_ranges; i++) { >> + cranges->ranges[i].queue = q; >> + ret = kobject_init_and_add(&cranges->ranges[i].kobj, >> + &blk_crange_ktype, &cranges->kobj, >> + "%d", i); >> + if (ret) >> + goto delete_obj; >> + } >> + >> + cranges->sysfs_registered = true; >> + > > Do you really need independent kobj here? > As discussed above, the individual ranges are not independent, and it > might be easier to just lump them into one structure, seeing that all of > them will have to be redone during revalidation anyway. For the structure lifetime, no, we do not need separate kobjs. But with sysfs, one kobj == one directory. SO I have these for the queue/cranges sysfs directory and its sub directories (one per range). Note that there are no kobj_get|put() for these. It is not needed. They are only for the sub-directories. >> + return 0; >> + >> +delete_obj: >> + while (--i >= 0) >> + kobject_del(&cranges->ranges[i].kobj); >> + kobject_del(&cranges->kobj); >> +free: >> + kfree(cranges); >> + q->cranges = NULL; >> + return ret; >> +} >> + >> +void blk_unregister_cranges(struct gendisk *disk) >> +{ >> + struct request_queue *q = disk->queue; >> + struct blk_cranges *cranges = q->cranges; >> + int i; >> + >> + lockdep_assert_held(&q->sysfs_dir_lock); >> + lockdep_assert_held(&q->sysfs_lock); >> + >> + if (!cranges) >> + return; >> + >> + if (cranges->sysfs_registered) { >> + for (i = 0; i < cranges->nr_ranges; i++) >> + kobject_del(&cranges->ranges[i].kobj); >> + kobject_del(&cranges->kobj); >> + } >> + >> + kfree(cranges); >> + q->cranges = NULL; >> +} >> + >> +static bool blk_check_ranges(struct gendisk *disk, struct blk_cranges *cr) >> +{ >> + sector_t capacity = get_capacity(disk); >> + sector_t min_sector = (sector_t)-1; >> + sector_t max_sector = 0; >> + int i; >> + >> + /* >> + * Sector ranges may overlap but should overall contain all sectors >> + * within the disk capacity. >> + */ >> + for (i = 0; i < cr->nr_ranges; i++) { >> + min_sector = min(min_sector, cr->ranges[i].sector); >> + max_sector = max(max_sector, cr->ranges[i].sector + >> + cr->ranges[i].nr_sectors); >> + } >> + > > Warm fuzzy feelings. > Sector ranges may overlap? Seriously? Yes they can. The standards do not forbid it, and for a good reason: split-actuator drives vs multi-actuator drives. These two terms are often used interchangably, but they actually refer to something different: 1) Split-actuator drives: On these, the head stack (actuator) is split in 2 on the same axis. This creates 2 subsets of heads, each serving different disk platters, so different LBAs. In this case, there will be no overlap. 2) Multi-actuator drives: With these, the drives have 2 distinct actuators (head stacks) and each stack can potentially access the entire LBA range (all disk platters). In this case, the ranges for each one will be the entire disk capacity. Which makes reporting that rather pointless I guess, at least in terms of optimization for the storage stack since we will be better off having the disk do its own scheduling in this case. So following the standard, both are allowed here so I only check that all sectors are covered. No check on overlap. File systems or IO schedulers who want to optimize for multiple cranges will have to be careful about that. > The only way this could happen is when a given LBA will be present in > _two_ ranges. And this is valid? Yep. Dual-actuator case. >> + if (min_sector != 0 || max_sector < capacity) { >> + pr_warn("Invalid concurrent ranges: missing sectors\n"); >> + return false; >> + } >> + >> + if (max_sector > capacity) { >> + pr_warn("Invalid concurrent ranges: beyond capacity\n"); >> + return false; >> + } >> + >> + return true; >> +} >> + >> +static bool blk_cranges_changed(struct gendisk *disk, struct blk_cranges *new) >> +{ >> + struct blk_cranges *old = disk->queue->cranges; >> + int i; >> + >> + if (!old || old->nr_ranges != new->nr_ranges) >> + return true; >> + >> + for (i = 0; i < new->nr_ranges; i++) { >> + if (old->ranges[i].sector != new->ranges[i].sector || >> + old->ranges[i].nr_sectors != new->ranges[i].nr_sectors) >> + return true; >> + } >> + >> + return false; >> +} >> + > > As mentioned above: I doubt we need to go into such detail. Just redo > the ranges on revalidation should be sufficient. Yes, we could. But checking for the change is trivial with this function and it will return true most of the time. If it does, nothing is done and the cranges structure passed by blk_queue_set_cranges() is simply freed and sysfs remains untouched. I prefer that over constantly changing the cranges structure for nothing. after all, revalidation is not exactly a rare event... >> +/** >> + * blk_alloc_cranges - Allocate a concurrent positioning range structure >> + * @disk: target disk >> + * @nr_ranges: Number of concurrent ranges >> + * >> + * Allocate a struct blk_cranges structure with @nr_ranges range descriptors. >> + */ >> +struct blk_cranges *blk_alloc_cranges(struct gendisk *disk, int nr_ranges) >> +{ >> + struct blk_cranges *cr; >> + >> + cr = kzalloc_node(struct_size(cr, ranges, nr_ranges), GFP_KERNEL, >> + disk->queue->node); >> + if (cr) >> + cr->nr_ranges = nr_ranges; >> + return cr; >> +} >> +EXPORT_SYMBOL_GPL(blk_alloc_cranges); >> + >> +/** >> + * blk_queue_set_cranges - Set a disk concurrent positioning ranges >> + * @disk: target disk >> + * @cr: concurrent ranges structure >> + * >> + * Set the concurrant positioning ranges information of the request queue >> + * of @disk to @cr. If @cr is NULL and the concurrent ranges structure >> + * already set, if any, is cleared. If there are no differences between >> + * @cr and the concurrent ranges structure already set, @cr is freed. >> + */ >> +void blk_queue_set_cranges(struct gendisk *disk, struct blk_cranges *cr) >> +{ >> + struct request_queue *q = disk->queue; >> + >> + if (WARN_ON_ONCE(cr && !cr->nr_ranges)) { >> + kfree(cr); >> + cr = NULL; >> + } >> + >> + mutex_lock(&q->sysfs_dir_lock); >> + mutex_lock(&q->sysfs_lock); >> + > > And this is what you get for all the complexity. > Having to take _two_ locks, and have a detailed description how the > function should be used. > > I still think that redoing everything is simpler. It is redoing everything, but only in the case of a difference between the old and new cranges. q->sysfs_dir_lock is taken because the structure has sub-directories in sysfs. We would not need to take it with a flat list of attribute files that are updated, unless the number of of actuators actually changes. In which case, attribute files would need to be removed and q->sysfs_dir_lock taken. Or we need to muck with attribute visibility, which I find ugly personnally... But if you really insist, we can swith to that. > >> + if (cr && !blk_check_ranges(disk, cr)) { >> + kfree(cr); >> + cr = NULL; >> + } >> + >> + if (!blk_cranges_changed(disk, cr)) { >> + kfree(cr); >> + goto unlock; >> + } >> + >> + /* >> + * This may be called for a registered queue. E.g. during a device >> + * revalidation. If that is the case, we need to register the new set >> + * of concurrent ranges. If the queue is not already registered, the >> + * device request queue registration will register the ranges. >> + */ >> + if (blk_queue_registered(q)) { >> + blk_register_cranges(disk, cr); >> + } else { >> + swap(q->cranges, cr); >> + kfree(cr); >> + } >> + >> +unlock: >> + mutex_unlock(&q->sysfs_lock); >> + mutex_unlock(&q->sysfs_dir_lock); >> +} >> +EXPORT_SYMBOL_GPL(blk_queue_set_cranges); >> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c >> index 370d83c18057..aeac98ecc5a0 100644 >> --- a/block/blk-sysfs.c >> +++ b/block/blk-sysfs.c >> @@ -899,9 +899,21 @@ int blk_register_queue(struct gendisk *disk) >> } >> >> mutex_lock(&q->sysfs_lock); >> + >> + ret = blk_register_cranges(disk, NULL); >> + if (ret) { >> + mutex_unlock(&q->sysfs_lock); >> + mutex_unlock(&q->sysfs_dir_lock); >> + kobject_del(&q->kobj); >> + blk_trace_remove_sysfs(dev); >> + kobject_put(&dev->kobj); >> + return ret; >> + } >> + >> if (q->elevator) { >> ret = elv_register_queue(q, false); >> if (ret) { >> + blk_unregister_cranges(disk); >> mutex_unlock(&q->sysfs_lock); >> mutex_unlock(&q->sysfs_dir_lock); >> kobject_del(&q->kobj); >> @@ -985,6 +997,7 @@ void blk_unregister_queue(struct gendisk *disk) >> mutex_lock(&q->sysfs_lock); >> if (q->elevator) >> elv_unregister_queue(q); >> + blk_unregister_cranges(disk); >> mutex_unlock(&q->sysfs_lock); >> mutex_unlock(&q->sysfs_dir_lock); >> >> diff --git a/block/blk.h b/block/blk.h >> index 4b885c0f6708..650c0d87987c 100644 >> --- a/block/blk.h >> +++ b/block/blk.h >> @@ -368,4 +368,7 @@ extern struct device_attribute dev_attr_events; >> extern struct device_attribute dev_attr_events_async; >> extern struct device_attribute dev_attr_events_poll_msecs; >> >> +int blk_register_cranges(struct gendisk *disk, struct blk_cranges *new_cranges); >> +void blk_unregister_cranges(struct gendisk *disk); >> + >> #endif /* BLK_INTERNAL_H */ >> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c >> index b9588c52815d..144e8cac44ba 100644 >> --- a/drivers/ata/libata-scsi.c >> +++ b/drivers/ata/libata-scsi.c >> @@ -1947,6 +1947,7 @@ static unsigned int ata_scsiop_inq_00(struct ata_scsi_args *args, u8 *rbuf) >> 0xb1, /* page 0xb1, block device characteristics page */ >> 0xb2, /* page 0xb2, thin provisioning page */ >> 0xb6, /* page 0xb6, zoned block device characteristics */ >> + 0xb9, /* page 0xb9, concurrent positioning ranges */ >> }; >> >> num_pages = sizeof(pages); > > And this hunk is here because ... ? Oops... That belongs to patch 3. Thanks for the review. Sending V2.
On 2021/07/23 1:11, Hannes Reinecke wrote: > On 7/21/21 12:42 PM, Damien Le Moal wrote: >> The Concurrent Positioning Ranges VPD page (for SCSI) and Log (for ATA) >> contain parameters describing the number of sets of contiguous LBAs that >> can be served independently by a single LUN multi-actuator disk. This >> patch provides the blk_queue_set_cranges() function allowing a device >> driver to signal to the block layer that a disk has multiple actuators, >> each one serving a contiguous range of sectors. To describe the set >> of sector ranges representing the different actuators of a device, the >> data type struct blk_cranges is introduced. >> >> For a device with multiple actuators, a struct blk_cranges is attached >> to the device request queue by the blk_queue_set_cranges() function. The >> function blk_alloc_cranges() is provided for drivers to allocate this >> structure. >> >> The blk_cranges structure contains kobjects (struct kobject) to register >> with sysfs the set of sector ranges defined by a device. On initial >> device scan, this registration is done from blk_register_queue() using >> the block layer internal function blk_register_cranges(). If a driver >> calls blk_queue_set_cranges() for a registered queue, e.g. when a device >> is revalidated, blk_queue_set_cranges() will execute >> blk_register_cranges() to update the queue sysfs attribute files. >> >> The sysfs file structure created starts from the cranges sub-directory >> and contains the start sector and number of sectors served by an >> actuator, with the information for each actuator grouped in one >> directory per actuator. E.g. for a dual actuator drive, we have: >> >> $ tree /sys/block/sdk/queue/cranges/ >> /sys/block/sdk/queue/cranges/ >> |-- 0 >> | |-- nr_sectors >> | `-- sector >> `-- 1 >> |-- nr_sectors >> `-- sector >> >> For a regular single actuator device, the cranges directory does not >> exist. >> > Hmm. I do wonder if we need such a detailed layout. > In my patch I have just: > > + unsigned int num_lba_ranges; > + sector_t *lba_ranges; > > to hold the information; after all, I don't expect this data to be > changing very often during the lifetime of the OS. Another thing about this sysfs structure: it allows easily adding attributes for each actuator if needed. For instance, the VPD and Log page advertize a "number of storage elements" for each actuator. This is not exposed here as I do not find that information particularly useful. But we could add anything that the standards defines. Or purely software things like a request counter to see how well balanced accesses are. > >> Device revalidation may lead to changes to this structure and to the >> attribute values. When manipulated, the queue sysfs_lock and >> sysfs_dir_lock are held for atomicity, similarly to how the blk-mq and >> elevator sysfs queue sub-directories are protected. >> > > Indeed, the values might change with revalidation. But not > independently; a change in _one_ range will typically affect the > adjacent ranges, too. > So it should be sufficient to drop them completely and redo from scratch > for revalidation. > >> The code related to the management of cranges is added in the new >> file block/blk-cranges.c. >> >> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> >> --- >> block/Makefile | 2 +- >> block/blk-cranges.c | 286 ++++++++++++++++++++++++++++++++++++++ >> block/blk-sysfs.c | 13 ++ >> block/blk.h | 3 + >> drivers/ata/libata-scsi.c | 1 + >> include/linux/blkdev.h | 29 ++++ >> 6 files changed, 333 insertions(+), 1 deletion(-) >> create mode 100644 block/blk-cranges.c >> >> diff --git a/block/Makefile b/block/Makefile >> index bfbe4e13ca1e..e477e6ca9ea6 100644 >> --- a/block/Makefile >> +++ b/block/Makefile >> @@ -9,7 +9,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-sysfs.o \ >> blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \ >> blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \ >> genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \ >> - disk-events.o >> + disk-events.o blk-cranges.o >> >> obj-$(CONFIG_BOUNCE) += bounce.o >> obj-$(CONFIG_BLK_SCSI_REQUEST) += scsi_ioctl.o >> diff --git a/block/blk-cranges.c b/block/blk-cranges.c >> new file mode 100644 >> index 000000000000..e8ff0d40a5c9 >> --- /dev/null >> +++ b/block/blk-cranges.c >> @@ -0,0 +1,286 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* >> + * Block device concurrent positioning ranges. >> + * >> + * Copyright (C) 2021 Western Digital Corporation or its Affiliates. >> + */ >> +#include <linux/kernel.h> >> +#include <linux/blkdev.h> >> +#include <linux/slab.h> >> +#include <linux/init.h> >> + >> +#include "blk.h" >> + >> +static ssize_t blk_crange_sector_show(struct blk_crange *cr, char *page) >> +{ >> + return sprintf(page, "%llu\n", cr->sector); >> +} >> + >> +static ssize_t blk_crange_nr_sectors_show(struct blk_crange *cr, char *page) >> +{ >> + return sprintf(page, "%llu\n", cr->nr_sectors); >> +} >> + >> +struct blk_crange_sysfs_entry { >> + struct attribute attr; >> + ssize_t (*show)(struct blk_crange *cr, char *page); >> +}; >> + >> +static struct blk_crange_sysfs_entry blk_crange_sector_entry = { >> + .attr = { .name = "sector", .mode = 0444 }, >> + .show = blk_crange_sector_show, >> +}; >> + >> +static struct blk_crange_sysfs_entry blk_crange_nr_sectors_entry = { >> + .attr = { .name = "nr_sectors", .mode = 0444 }, >> + .show = blk_crange_nr_sectors_show, >> +}; >> + >> +static struct attribute *blk_crange_attrs[] = { >> + &blk_crange_sector_entry.attr, >> + &blk_crange_nr_sectors_entry.attr, >> + NULL, >> +}; >> +ATTRIBUTE_GROUPS(blk_crange); >> + >> +static ssize_t blk_crange_sysfs_show(struct kobject *kobj, >> + struct attribute *attr, char *page) >> +{ >> + struct blk_crange_sysfs_entry *entry; >> + struct blk_crange *cr; >> + struct request_queue *q; >> + ssize_t ret; >> + >> + entry = container_of(attr, struct blk_crange_sysfs_entry, attr); >> + cr = container_of(kobj, struct blk_crange, kobj); >> + q = cr->queue; >> + >> + mutex_lock(&q->sysfs_lock); >> + ret = entry->show(cr, page); >> + mutex_unlock(&q->sysfs_lock); >> + >> + return ret; >> +} >> + >> +static const struct sysfs_ops blk_crange_sysfs_ops = { >> + .show = blk_crange_sysfs_show, >> +}; >> + >> +/* >> + * Dummy release function to make kobj happy. >> + */ >> +static void blk_cranges_sysfs_nop_release(struct kobject *kobj) >> +{ >> +} >> + >> +static struct kobj_type blk_crange_ktype = { >> + .sysfs_ops = &blk_crange_sysfs_ops, >> + .default_groups = blk_crange_groups, >> + .release = blk_cranges_sysfs_nop_release, >> +}; >> + >> +static struct kobj_type blk_cranges_ktype = { >> + .release = blk_cranges_sysfs_nop_release, >> +}; >> + >> +/** >> + * blk_register_cranges - register with sysfs a set of concurrent ranges >> + * @disk: Target disk >> + * @new_cranges: New set of concurrent ranges >> + * >> + * Register with sysfs a set of concurrent ranges for @disk. If @new_cranges >> + * is not NULL, this set of concurrent ranges is registered and the >> + * old set specified by q->cranges is unregistered. Otherwise, q->cranges >> + * is registered if it is not already. >> + */ >> +int blk_register_cranges(struct gendisk *disk, struct blk_cranges *new_cranges) >> +{ >> + struct request_queue *q = disk->queue; >> + struct blk_cranges *cranges; >> + int i, ret; >> + >> + lockdep_assert_held(&q->sysfs_dir_lock); >> + lockdep_assert_held(&q->sysfs_lock); >> + >> + /* If a new range set is specified, unregister the old one */ >> + if (new_cranges) { >> + if (q->cranges) >> + blk_unregister_cranges(disk); >> + q->cranges = new_cranges; >> + } >> + >> + cranges = q->cranges; >> + if (!cranges) >> + return 0; >> + >> + /* >> + * At this point, q->cranges is the new set of sector ranges that needs >> + * to be registered with sysfs. >> + */ >> + WARN_ON(cranges->sysfs_registered); >> + ret = kobject_init_and_add(&cranges->kobj, &blk_cranges_ktype, >> + &q->kobj, "%s", "cranges"); >> + if (ret) >> + goto free; >> + > + for (i = 0; i < cranges->nr_ranges; i++) { >> + cranges->ranges[i].queue = q; >> + ret = kobject_init_and_add(&cranges->ranges[i].kobj, >> + &blk_crange_ktype, &cranges->kobj, >> + "%d", i); >> + if (ret) >> + goto delete_obj; >> + } >> + >> + cranges->sysfs_registered = true; >> + > > Do you really need independent kobj here? > As discussed above, the individual ranges are not independent, and it > might be easier to just lump them into one structure, seeing that all of > them will have to be redone during revalidation anyway. > >> + return 0; >> + >> +delete_obj: >> + while (--i >= 0) >> + kobject_del(&cranges->ranges[i].kobj); >> + kobject_del(&cranges->kobj); >> +free: >> + kfree(cranges); >> + q->cranges = NULL; >> + return ret; >> +} >> + >> +void blk_unregister_cranges(struct gendisk *disk) >> +{ >> + struct request_queue *q = disk->queue; >> + struct blk_cranges *cranges = q->cranges; >> + int i; >> + >> + lockdep_assert_held(&q->sysfs_dir_lock); >> + lockdep_assert_held(&q->sysfs_lock); >> + >> + if (!cranges) >> + return; >> + >> + if (cranges->sysfs_registered) { >> + for (i = 0; i < cranges->nr_ranges; i++) >> + kobject_del(&cranges->ranges[i].kobj); >> + kobject_del(&cranges->kobj); >> + } >> + >> + kfree(cranges); >> + q->cranges = NULL; >> +} >> + >> +static bool blk_check_ranges(struct gendisk *disk, struct blk_cranges *cr) >> +{ >> + sector_t capacity = get_capacity(disk); >> + sector_t min_sector = (sector_t)-1; >> + sector_t max_sector = 0; >> + int i; >> + >> + /* >> + * Sector ranges may overlap but should overall contain all sectors >> + * within the disk capacity. >> + */ >> + for (i = 0; i < cr->nr_ranges; i++) { >> + min_sector = min(min_sector, cr->ranges[i].sector); >> + max_sector = max(max_sector, cr->ranges[i].sector + >> + cr->ranges[i].nr_sectors); >> + } >> + > > Warm fuzzy feelings. > Sector ranges may overlap? Seriously? > > The only way this could happen is when a given LBA will be present in > _two_ ranges. And this is valid? > >> + if (min_sector != 0 || max_sector < capacity) { >> + pr_warn("Invalid concurrent ranges: missing sectors\n"); >> + return false; >> + } >> + >> + if (max_sector > capacity) { >> + pr_warn("Invalid concurrent ranges: beyond capacity\n"); >> + return false; >> + } >> + >> + return true; >> +} >> + >> +static bool blk_cranges_changed(struct gendisk *disk, struct blk_cranges *new) >> +{ >> + struct blk_cranges *old = disk->queue->cranges; >> + int i; >> + >> + if (!old || old->nr_ranges != new->nr_ranges) >> + return true; >> + >> + for (i = 0; i < new->nr_ranges; i++) { >> + if (old->ranges[i].sector != new->ranges[i].sector || >> + old->ranges[i].nr_sectors != new->ranges[i].nr_sectors) >> + return true; >> + } >> + >> + return false; >> +} >> + > > As mentioned above: I doubt we need to go into such detail. Just redo > the ranges on revalidation should be sufficient. > >> +/** >> + * blk_alloc_cranges - Allocate a concurrent positioning range structure >> + * @disk: target disk >> + * @nr_ranges: Number of concurrent ranges >> + * >> + * Allocate a struct blk_cranges structure with @nr_ranges range descriptors. >> + */ >> +struct blk_cranges *blk_alloc_cranges(struct gendisk *disk, int nr_ranges) >> +{ >> + struct blk_cranges *cr; >> + >> + cr = kzalloc_node(struct_size(cr, ranges, nr_ranges), GFP_KERNEL, >> + disk->queue->node); >> + if (cr) >> + cr->nr_ranges = nr_ranges; >> + return cr; >> +} >> +EXPORT_SYMBOL_GPL(blk_alloc_cranges); >> + >> +/** >> + * blk_queue_set_cranges - Set a disk concurrent positioning ranges >> + * @disk: target disk >> + * @cr: concurrent ranges structure >> + * >> + * Set the concurrant positioning ranges information of the request queue >> + * of @disk to @cr. If @cr is NULL and the concurrent ranges structure >> + * already set, if any, is cleared. If there are no differences between >> + * @cr and the concurrent ranges structure already set, @cr is freed. >> + */ >> +void blk_queue_set_cranges(struct gendisk *disk, struct blk_cranges *cr) >> +{ >> + struct request_queue *q = disk->queue; >> + >> + if (WARN_ON_ONCE(cr && !cr->nr_ranges)) { >> + kfree(cr); >> + cr = NULL; >> + } >> + >> + mutex_lock(&q->sysfs_dir_lock); >> + mutex_lock(&q->sysfs_lock); >> + > > And this is what you get for all the complexity. > Having to take _two_ locks, and have a detailed description how the > function should be used. > > I still think that redoing everything is simpler. > >> + if (cr && !blk_check_ranges(disk, cr)) { >> + kfree(cr); >> + cr = NULL; >> + } >> + >> + if (!blk_cranges_changed(disk, cr)) { >> + kfree(cr); >> + goto unlock; >> + } >> + >> + /* >> + * This may be called for a registered queue. E.g. during a device >> + * revalidation. If that is the case, we need to register the new set >> + * of concurrent ranges. If the queue is not already registered, the >> + * device request queue registration will register the ranges. >> + */ >> + if (blk_queue_registered(q)) { >> + blk_register_cranges(disk, cr); >> + } else { >> + swap(q->cranges, cr); >> + kfree(cr); >> + } >> + >> +unlock: >> + mutex_unlock(&q->sysfs_lock); >> + mutex_unlock(&q->sysfs_dir_lock); >> +} >> +EXPORT_SYMBOL_GPL(blk_queue_set_cranges); >> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c >> index 370d83c18057..aeac98ecc5a0 100644 >> --- a/block/blk-sysfs.c >> +++ b/block/blk-sysfs.c >> @@ -899,9 +899,21 @@ int blk_register_queue(struct gendisk *disk) >> } >> >> mutex_lock(&q->sysfs_lock); >> + >> + ret = blk_register_cranges(disk, NULL); >> + if (ret) { >> + mutex_unlock(&q->sysfs_lock); >> + mutex_unlock(&q->sysfs_dir_lock); >> + kobject_del(&q->kobj); >> + blk_trace_remove_sysfs(dev); >> + kobject_put(&dev->kobj); >> + return ret; >> + } >> + >> if (q->elevator) { >> ret = elv_register_queue(q, false); >> if (ret) { >> + blk_unregister_cranges(disk); >> mutex_unlock(&q->sysfs_lock); >> mutex_unlock(&q->sysfs_dir_lock); >> kobject_del(&q->kobj); >> @@ -985,6 +997,7 @@ void blk_unregister_queue(struct gendisk *disk) >> mutex_lock(&q->sysfs_lock); >> if (q->elevator) >> elv_unregister_queue(q); >> + blk_unregister_cranges(disk); >> mutex_unlock(&q->sysfs_lock); >> mutex_unlock(&q->sysfs_dir_lock); >> >> diff --git a/block/blk.h b/block/blk.h >> index 4b885c0f6708..650c0d87987c 100644 >> --- a/block/blk.h >> +++ b/block/blk.h >> @@ -368,4 +368,7 @@ extern struct device_attribute dev_attr_events; >> extern struct device_attribute dev_attr_events_async; >> extern struct device_attribute dev_attr_events_poll_msecs; >> >> +int blk_register_cranges(struct gendisk *disk, struct blk_cranges *new_cranges); >> +void blk_unregister_cranges(struct gendisk *disk); >> + >> #endif /* BLK_INTERNAL_H */ >> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c >> index b9588c52815d..144e8cac44ba 100644 >> --- a/drivers/ata/libata-scsi.c >> +++ b/drivers/ata/libata-scsi.c >> @@ -1947,6 +1947,7 @@ static unsigned int ata_scsiop_inq_00(struct ata_scsi_args *args, u8 *rbuf) >> 0xb1, /* page 0xb1, block device characteristics page */ >> 0xb2, /* page 0xb2, thin provisioning page */ >> 0xb6, /* page 0xb6, zoned block device characteristics */ >> + 0xb9, /* page 0xb9, concurrent positioning ranges */ >> }; >> >> num_pages = sizeof(pages); > > And this hunk is here because ... ? > >> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h >> index 3177181c4326..cd58aa1090a3 100644 >> --- a/include/linux/blkdev.h >> +++ b/include/linux/blkdev.h >> @@ -378,6 +378,29 @@ static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev, >> >> #endif /* CONFIG_BLK_DEV_ZONED */ >> >> +/* >> + * Concurrent sector ranges: struct blk_crange describes range of >> + * contiguous sectors that can be served by independent resources on the >> + * device. The set of ranges defined in struct blk_cranges must overall >> + * include all sectors within the device capacity. >> + * For a device with multiple ranges, e.g. a single LUN multi-actuator HDD, >> + * requests targeting sectors in different ranges can be executed in parallel. >> + * A request can straddle a range boundary. >> + */ >> +struct blk_crange { >> + struct kobject kobj; >> + struct request_queue *queue; >> + sector_t sector; >> + sector_t nr_sectors; >> +}; >> + >> +struct blk_cranges { >> + struct kobject kobj; >> + bool sysfs_registered; >> + unsigned int nr_ranges; >> + struct blk_crange ranges[]; >> +}; >> + >> struct request_queue { >> struct request *last_merge; >> struct elevator_queue *elevator; >> @@ -570,6 +593,9 @@ struct request_queue { >> >> #define BLK_MAX_WRITE_HINTS 5 >> u64 write_hints[BLK_MAX_WRITE_HINTS]; >> + >> + /* Concurrent sector ranges */ >> + struct blk_cranges *cranges; >> }; >> >> /* Keep blk_queue_flag_name[] in sync with the definitions below */ >> @@ -1163,6 +1189,9 @@ extern void blk_queue_required_elevator_features(struct request_queue *q, >> extern bool blk_queue_can_use_dma_map_merging(struct request_queue *q, >> struct device *dev); >> >> +struct blk_cranges *blk_alloc_cranges(struct gendisk *disk, int nr_ranges); >> +void blk_queue_set_cranges(struct gendisk *disk, struct blk_cranges *cr); >> + >> /* >> * Number of physical segments as sent to the device. >> * >> > > Cheers, > > Hannes >
diff --git a/block/Makefile b/block/Makefile index bfbe4e13ca1e..e477e6ca9ea6 100644 --- a/block/Makefile +++ b/block/Makefile @@ -9,7 +9,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-sysfs.o \ blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \ blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \ genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \ - disk-events.o + disk-events.o blk-cranges.o obj-$(CONFIG_BOUNCE) += bounce.o obj-$(CONFIG_BLK_SCSI_REQUEST) += scsi_ioctl.o diff --git a/block/blk-cranges.c b/block/blk-cranges.c new file mode 100644 index 000000000000..e8ff0d40a5c9 --- /dev/null +++ b/block/blk-cranges.c @@ -0,0 +1,286 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Block device concurrent positioning ranges. + * + * Copyright (C) 2021 Western Digital Corporation or its Affiliates. + */ +#include <linux/kernel.h> +#include <linux/blkdev.h> +#include <linux/slab.h> +#include <linux/init.h> + +#include "blk.h" + +static ssize_t blk_crange_sector_show(struct blk_crange *cr, char *page) +{ + return sprintf(page, "%llu\n", cr->sector); +} + +static ssize_t blk_crange_nr_sectors_show(struct blk_crange *cr, char *page) +{ + return sprintf(page, "%llu\n", cr->nr_sectors); +} + +struct blk_crange_sysfs_entry { + struct attribute attr; + ssize_t (*show)(struct blk_crange *cr, char *page); +}; + +static struct blk_crange_sysfs_entry blk_crange_sector_entry = { + .attr = { .name = "sector", .mode = 0444 }, + .show = blk_crange_sector_show, +}; + +static struct blk_crange_sysfs_entry blk_crange_nr_sectors_entry = { + .attr = { .name = "nr_sectors", .mode = 0444 }, + .show = blk_crange_nr_sectors_show, +}; + +static struct attribute *blk_crange_attrs[] = { + &blk_crange_sector_entry.attr, + &blk_crange_nr_sectors_entry.attr, + NULL, +}; +ATTRIBUTE_GROUPS(blk_crange); + +static ssize_t blk_crange_sysfs_show(struct kobject *kobj, + struct attribute *attr, char *page) +{ + struct blk_crange_sysfs_entry *entry; + struct blk_crange *cr; + struct request_queue *q; + ssize_t ret; + + entry = container_of(attr, struct blk_crange_sysfs_entry, attr); + cr = container_of(kobj, struct blk_crange, kobj); + q = cr->queue; + + mutex_lock(&q->sysfs_lock); + ret = entry->show(cr, page); + mutex_unlock(&q->sysfs_lock); + + return ret; +} + +static const struct sysfs_ops blk_crange_sysfs_ops = { + .show = blk_crange_sysfs_show, +}; + +/* + * Dummy release function to make kobj happy. + */ +static void blk_cranges_sysfs_nop_release(struct kobject *kobj) +{ +} + +static struct kobj_type blk_crange_ktype = { + .sysfs_ops = &blk_crange_sysfs_ops, + .default_groups = blk_crange_groups, + .release = blk_cranges_sysfs_nop_release, +}; + +static struct kobj_type blk_cranges_ktype = { + .release = blk_cranges_sysfs_nop_release, +}; + +/** + * blk_register_cranges - register with sysfs a set of concurrent ranges + * @disk: Target disk + * @new_cranges: New set of concurrent ranges + * + * Register with sysfs a set of concurrent ranges for @disk. If @new_cranges + * is not NULL, this set of concurrent ranges is registered and the + * old set specified by q->cranges is unregistered. Otherwise, q->cranges + * is registered if it is not already. + */ +int blk_register_cranges(struct gendisk *disk, struct blk_cranges *new_cranges) +{ + struct request_queue *q = disk->queue; + struct blk_cranges *cranges; + int i, ret; + + lockdep_assert_held(&q->sysfs_dir_lock); + lockdep_assert_held(&q->sysfs_lock); + + /* If a new range set is specified, unregister the old one */ + if (new_cranges) { + if (q->cranges) + blk_unregister_cranges(disk); + q->cranges = new_cranges; + } + + cranges = q->cranges; + if (!cranges) + return 0; + + /* + * At this point, q->cranges is the new set of sector ranges that needs + * to be registered with sysfs. + */ + WARN_ON(cranges->sysfs_registered); + ret = kobject_init_and_add(&cranges->kobj, &blk_cranges_ktype, + &q->kobj, "%s", "cranges"); + if (ret) + goto free; + + for (i = 0; i < cranges->nr_ranges; i++) { + cranges->ranges[i].queue = q; + ret = kobject_init_and_add(&cranges->ranges[i].kobj, + &blk_crange_ktype, &cranges->kobj, + "%d", i); + if (ret) + goto delete_obj; + } + + cranges->sysfs_registered = true; + + return 0; + +delete_obj: + while (--i >= 0) + kobject_del(&cranges->ranges[i].kobj); + kobject_del(&cranges->kobj); +free: + kfree(cranges); + q->cranges = NULL; + return ret; +} + +void blk_unregister_cranges(struct gendisk *disk) +{ + struct request_queue *q = disk->queue; + struct blk_cranges *cranges = q->cranges; + int i; + + lockdep_assert_held(&q->sysfs_dir_lock); + lockdep_assert_held(&q->sysfs_lock); + + if (!cranges) + return; + + if (cranges->sysfs_registered) { + for (i = 0; i < cranges->nr_ranges; i++) + kobject_del(&cranges->ranges[i].kobj); + kobject_del(&cranges->kobj); + } + + kfree(cranges); + q->cranges = NULL; +} + +static bool blk_check_ranges(struct gendisk *disk, struct blk_cranges *cr) +{ + sector_t capacity = get_capacity(disk); + sector_t min_sector = (sector_t)-1; + sector_t max_sector = 0; + int i; + + /* + * Sector ranges may overlap but should overall contain all sectors + * within the disk capacity. + */ + for (i = 0; i < cr->nr_ranges; i++) { + min_sector = min(min_sector, cr->ranges[i].sector); + max_sector = max(max_sector, cr->ranges[i].sector + + cr->ranges[i].nr_sectors); + } + + if (min_sector != 0 || max_sector < capacity) { + pr_warn("Invalid concurrent ranges: missing sectors\n"); + return false; + } + + if (max_sector > capacity) { + pr_warn("Invalid concurrent ranges: beyond capacity\n"); + return false; + } + + return true; +} + +static bool blk_cranges_changed(struct gendisk *disk, struct blk_cranges *new) +{ + struct blk_cranges *old = disk->queue->cranges; + int i; + + if (!old || old->nr_ranges != new->nr_ranges) + return true; + + for (i = 0; i < new->nr_ranges; i++) { + if (old->ranges[i].sector != new->ranges[i].sector || + old->ranges[i].nr_sectors != new->ranges[i].nr_sectors) + return true; + } + + return false; +} + +/** + * blk_alloc_cranges - Allocate a concurrent positioning range structure + * @disk: target disk + * @nr_ranges: Number of concurrent ranges + * + * Allocate a struct blk_cranges structure with @nr_ranges range descriptors. + */ +struct blk_cranges *blk_alloc_cranges(struct gendisk *disk, int nr_ranges) +{ + struct blk_cranges *cr; + + cr = kzalloc_node(struct_size(cr, ranges, nr_ranges), GFP_KERNEL, + disk->queue->node); + if (cr) + cr->nr_ranges = nr_ranges; + return cr; +} +EXPORT_SYMBOL_GPL(blk_alloc_cranges); + +/** + * blk_queue_set_cranges - Set a disk concurrent positioning ranges + * @disk: target disk + * @cr: concurrent ranges structure + * + * Set the concurrant positioning ranges information of the request queue + * of @disk to @cr. If @cr is NULL and the concurrent ranges structure + * already set, if any, is cleared. If there are no differences between + * @cr and the concurrent ranges structure already set, @cr is freed. + */ +void blk_queue_set_cranges(struct gendisk *disk, struct blk_cranges *cr) +{ + struct request_queue *q = disk->queue; + + if (WARN_ON_ONCE(cr && !cr->nr_ranges)) { + kfree(cr); + cr = NULL; + } + + mutex_lock(&q->sysfs_dir_lock); + mutex_lock(&q->sysfs_lock); + + if (cr && !blk_check_ranges(disk, cr)) { + kfree(cr); + cr = NULL; + } + + if (!blk_cranges_changed(disk, cr)) { + kfree(cr); + goto unlock; + } + + /* + * This may be called for a registered queue. E.g. during a device + * revalidation. If that is the case, we need to register the new set + * of concurrent ranges. If the queue is not already registered, the + * device request queue registration will register the ranges. + */ + if (blk_queue_registered(q)) { + blk_register_cranges(disk, cr); + } else { + swap(q->cranges, cr); + kfree(cr); + } + +unlock: + mutex_unlock(&q->sysfs_lock); + mutex_unlock(&q->sysfs_dir_lock); +} +EXPORT_SYMBOL_GPL(blk_queue_set_cranges); diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 370d83c18057..aeac98ecc5a0 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -899,9 +899,21 @@ int blk_register_queue(struct gendisk *disk) } mutex_lock(&q->sysfs_lock); + + ret = blk_register_cranges(disk, NULL); + if (ret) { + mutex_unlock(&q->sysfs_lock); + mutex_unlock(&q->sysfs_dir_lock); + kobject_del(&q->kobj); + blk_trace_remove_sysfs(dev); + kobject_put(&dev->kobj); + return ret; + } + if (q->elevator) { ret = elv_register_queue(q, false); if (ret) { + blk_unregister_cranges(disk); mutex_unlock(&q->sysfs_lock); mutex_unlock(&q->sysfs_dir_lock); kobject_del(&q->kobj); @@ -985,6 +997,7 @@ void blk_unregister_queue(struct gendisk *disk) mutex_lock(&q->sysfs_lock); if (q->elevator) elv_unregister_queue(q); + blk_unregister_cranges(disk); mutex_unlock(&q->sysfs_lock); mutex_unlock(&q->sysfs_dir_lock); diff --git a/block/blk.h b/block/blk.h index 4b885c0f6708..650c0d87987c 100644 --- a/block/blk.h +++ b/block/blk.h @@ -368,4 +368,7 @@ extern struct device_attribute dev_attr_events; extern struct device_attribute dev_attr_events_async; extern struct device_attribute dev_attr_events_poll_msecs; +int blk_register_cranges(struct gendisk *disk, struct blk_cranges *new_cranges); +void blk_unregister_cranges(struct gendisk *disk); + #endif /* BLK_INTERNAL_H */ diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index b9588c52815d..144e8cac44ba 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -1947,6 +1947,7 @@ static unsigned int ata_scsiop_inq_00(struct ata_scsi_args *args, u8 *rbuf) 0xb1, /* page 0xb1, block device characteristics page */ 0xb2, /* page 0xb2, thin provisioning page */ 0xb6, /* page 0xb6, zoned block device characteristics */ + 0xb9, /* page 0xb9, concurrent positioning ranges */ }; num_pages = sizeof(pages); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 3177181c4326..cd58aa1090a3 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -378,6 +378,29 @@ static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev, #endif /* CONFIG_BLK_DEV_ZONED */ +/* + * Concurrent sector ranges: struct blk_crange describes range of + * contiguous sectors that can be served by independent resources on the + * device. The set of ranges defined in struct blk_cranges must overall + * include all sectors within the device capacity. + * For a device with multiple ranges, e.g. a single LUN multi-actuator HDD, + * requests targeting sectors in different ranges can be executed in parallel. + * A request can straddle a range boundary. + */ +struct blk_crange { + struct kobject kobj; + struct request_queue *queue; + sector_t sector; + sector_t nr_sectors; +}; + +struct blk_cranges { + struct kobject kobj; + bool sysfs_registered; + unsigned int nr_ranges; + struct blk_crange ranges[]; +}; + struct request_queue { struct request *last_merge; struct elevator_queue *elevator; @@ -570,6 +593,9 @@ struct request_queue { #define BLK_MAX_WRITE_HINTS 5 u64 write_hints[BLK_MAX_WRITE_HINTS]; + + /* Concurrent sector ranges */ + struct blk_cranges *cranges; }; /* Keep blk_queue_flag_name[] in sync with the definitions below */ @@ -1163,6 +1189,9 @@ extern void blk_queue_required_elevator_features(struct request_queue *q, extern bool blk_queue_can_use_dma_map_merging(struct request_queue *q, struct device *dev); +struct blk_cranges *blk_alloc_cranges(struct gendisk *disk, int nr_ranges); +void blk_queue_set_cranges(struct gendisk *disk, struct blk_cranges *cr); + /* * Number of physical segments as sent to the device. *
The Concurrent Positioning Ranges VPD page (for SCSI) and Log (for ATA) contain parameters describing the number of sets of contiguous LBAs that can be served independently by a single LUN multi-actuator disk. This patch provides the blk_queue_set_cranges() function allowing a device driver to signal to the block layer that a disk has multiple actuators, each one serving a contiguous range of sectors. To describe the set of sector ranges representing the different actuators of a device, the data type struct blk_cranges is introduced. For a device with multiple actuators, a struct blk_cranges is attached to the device request queue by the blk_queue_set_cranges() function. The function blk_alloc_cranges() is provided for drivers to allocate this structure. The blk_cranges structure contains kobjects (struct kobject) to register with sysfs the set of sector ranges defined by a device. On initial device scan, this registration is done from blk_register_queue() using the block layer internal function blk_register_cranges(). If a driver calls blk_queue_set_cranges() for a registered queue, e.g. when a device is revalidated, blk_queue_set_cranges() will execute blk_register_cranges() to update the queue sysfs attribute files. The sysfs file structure created starts from the cranges sub-directory and contains the start sector and number of sectors served by an actuator, with the information for each actuator grouped in one directory per actuator. E.g. for a dual actuator drive, we have: $ tree /sys/block/sdk/queue/cranges/ /sys/block/sdk/queue/cranges/ |-- 0 | |-- nr_sectors | `-- sector `-- 1 |-- nr_sectors `-- sector For a regular single actuator device, the cranges directory does not exist. Device revalidation may lead to changes to this structure and to the attribute values. When manipulated, the queue sysfs_lock and sysfs_dir_lock are held for atomicity, similarly to how the blk-mq and elevator sysfs queue sub-directories are protected. The code related to the management of cranges is added in the new file block/blk-cranges.c. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> --- block/Makefile | 2 +- block/blk-cranges.c | 286 ++++++++++++++++++++++++++++++++++++++ block/blk-sysfs.c | 13 ++ block/blk.h | 3 + drivers/ata/libata-scsi.c | 1 + include/linux/blkdev.h | 29 ++++ 6 files changed, 333 insertions(+), 1 deletion(-) create mode 100644 block/blk-cranges.c