[4/4] UBI: Implement bitrot checking

Message ID	1427631197-23610-5-git-send-email-richard@nod.at
State	Superseded
Headers	show Return-Path: <linux-mtd-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org> From: Richard Weinberger <richard@nod.at> To: linux-mtd@lists.infradead.org Subject: [PATCH 4/4] UBI: Implement bitrot checking Date: Sun, 29 Mar 2015 14:13:17 +0200 Message-Id: <1427631197-23610-5-git-send-email-richard@nod.at> In-Reply-To: <1427631197-23610-1-git-send-email-richard@nod.at> References: <1427631197-23610-1-git-send-email-richard@nod.at> summary: Content analysis details: (-0.0 points) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 SPF_PASS SPF: sender matches SPF record Cc: Richard Weinberger <richard@nod.at>, boris.brezillon@free-electrons.com, linux-kernel@vger.kernel.org, dedekind1@gmail.com Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-mtd" <linux-mtd-bounces@lists.infradead.org> Errors-To: linux-mtd-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org

Richard Weinberger March 29, 2015, 12:13 p.m. UTC

This patch implements bitrot checking for UBI.
ubi_wl_trigger_bitrot_check() triggers a re-read of every
PEB. If a bitflip is detected PEBs in use will get scrubbed
and free ones erased.

Signed-off-by: Richard Weinberger <richard@nod.at>
---
 drivers/mtd/ubi/build.c |  39 +++++++++++++
 drivers/mtd/ubi/ubi.h   |   4 ++
 drivers/mtd/ubi/wl.c    | 146 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 189 insertions(+)

rnd4@dave-tech.it April 2, 2015, 5:34 p.m. UTC | #1

Richard,

Il 29/03/2015 14:13, Richard Weinberger ha scritto:
> +	mutex_lock(&ubi->buf_mutex);
> +	err = ubi_io_read(ubi, ubi->peb_buf, e->pnum, 0, ubi->peb_size);
> +	mutex_unlock(&ubi->buf_mutex);
> +	if (err == UBI_IO_BITFLIPS) {
> +		dbg_wl("found bitflips in PEB %d", e->pnum);
> +		spin_lock(&ubi->wl_lock);
> +

IIUC you trigger the action as soon as you have a bitflip error, is this
correct?

Isn't this too much conservative? You usually have a RBER on MLC devices
that's between 1E-7 (for brand new devices) and 1E-4 (for devices with
1k-2k P/E cycle after 100k-300k read-without-P/E)

Having a few bitflips on a block read is more that usual and current ECC
can correct more that 16 bit error over 512/1KiB.

WDYT?

Kind Regards,

Richard Weinberger April 2, 2015, 5:54 p.m. UTC | #2

Hi!

Am 02.04.2015 um 19:34 schrieb Andrea Scian:
> 
> Richard,
> 
> Il 29/03/2015 14:13, Richard Weinberger ha scritto:
>> +	mutex_lock(&ubi->buf_mutex);
>> +	err = ubi_io_read(ubi, ubi->peb_buf, e->pnum, 0, ubi->peb_size);
>> +	mutex_unlock(&ubi->buf_mutex);
>> +	if (err == UBI_IO_BITFLIPS) {
>> +		dbg_wl("found bitflips in PEB %d", e->pnum);
>> +		spin_lock(&ubi->wl_lock);
>> +
> 
> IIUC you trigger the action as soon as you have a bitflip error, is this
> correct?

I trigger it as soon UBI sees the bitflip. This depends on the configured MTD
bitflip_threshold.

> Isn't this too much conservative? You usually have a RBER on MLC devices
> that's between 1E-7 (for brand new devices) and 1E-4 (for devices with
> 1k-2k P/E cycle after 100k-300k read-without-P/E)
> 
> Having a few bitflips on a block read is more that usual and current ECC
> can correct more that 16 bit error over 512/1KiB.

Please see above. :)

Thanks,
//richard

rnd4@dave-tech.it April 2, 2015, 7:19 p.m. UTC | #3

Il 02/04/2015 19:54, Richard Weinberger ha scritto:
> Hi!
>
> Am 02.04.2015 um 19:34 schrieb Andrea Scian:
>> Richard,
>>
>> Il 29/03/2015 14:13, Richard Weinberger ha scritto:
>>> +	mutex_lock(&ubi->buf_mutex);
>>> +	err = ubi_io_read(ubi, ubi->peb_buf, e->pnum, 0, ubi->peb_size);
>>> +	mutex_unlock(&ubi->buf_mutex);
>>> +	if (err == UBI_IO_BITFLIPS) {
>>> +		dbg_wl("found bitflips in PEB %d", e->pnum);
>>> +		spin_lock(&ubi->wl_lock);
>>> +
>> IIUC you trigger the action as soon as you have a bitflip error, is this
>> correct?
> I trigger it as soon UBI sees the bitflip. This depends on the configured MTD
> bitflip_threshold.

ops.. I confused UBI bitflip error with MTD bitflip error ;-)
thanks for point this out and help me understanding the code

now it's clear and it sounds good to me

Kind Regards,

Richard Weinberger April 8, 2015, 10:34 a.m. UTC | #4

Am 02.04.2015 um 21:19 schrieb Andrea Scian:
> Il 02/04/2015 19:54, Richard Weinberger ha scritto:
>> Hi!
>>
>> Am 02.04.2015 um 19:34 schrieb Andrea Scian:
>>> Richard,
>>>
>>> Il 29/03/2015 14:13, Richard Weinberger ha scritto:
>>>> +	mutex_lock(&ubi->buf_mutex);
>>>> +	err = ubi_io_read(ubi, ubi->peb_buf, e->pnum, 0, ubi->peb_size);
>>>> +	mutex_unlock(&ubi->buf_mutex);
>>>> +	if (err == UBI_IO_BITFLIPS) {
>>>> +		dbg_wl("found bitflips in PEB %d", e->pnum);
>>>> +		spin_lock(&ubi->wl_lock);
>>>> +
>>> IIUC you trigger the action as soon as you have a bitflip error, is this
>>> correct?
>> I trigger it as soon UBI sees the bitflip. This depends on the configured MTD
>> bitflip_threshold.
> 
> ops.. I confused UBI bitflip error with MTD bitflip error ;-)
> thanks for point this out and help me understanding the code
> 
> now it's clear and it sounds good to me

Is this an implicit Reviewed-by? :)

Thanks,
//richard

David Oberhollenzer April 8, 2015, 11:48 a.m. UTC | #5

Reviewed-by: David Oberhollenzer <david.oberhollenzer@sigma-star.at>

rnd4@dave-tech.it April 8, 2015, 9:02 p.m. UTC | #6

Il 08/04/2015 12:34, Richard Weinberger ha scritto:
> Am 02.04.2015 um 21:19 schrieb Andrea Scian:
>> Il 02/04/2015 19:54, Richard Weinberger ha scritto:
>>> Hi!
>>>
>>> Am 02.04.2015 um 19:34 schrieb Andrea Scian:
>>>> Richard,
>>>>
>>>> Il 29/03/2015 14:13, Richard Weinberger ha scritto:
>>>>> +	mutex_lock(&ubi->buf_mutex);
>>>>> +	err = ubi_io_read(ubi, ubi->peb_buf, e->pnum, 0, ubi->peb_size);
>>>>> +	mutex_unlock(&ubi->buf_mutex);
>>>>> +	if (err == UBI_IO_BITFLIPS) {
>>>>> +		dbg_wl("found bitflips in PEB %d", e->pnum);
>>>>> +		spin_lock(&ubi->wl_lock);
>>>>> +
>>>> IIUC you trigger the action as soon as you have a bitflip error, is this
>>>> correct?
>>> I trigger it as soon UBI sees the bitflip. This depends on the configured MTD
>>> bitflip_threshold.
>>
>> ops.. I confused UBI bitflip error with MTD bitflip error ;-)
>> thanks for point this out and help me understanding the code
>>
>> now it's clear and it sounds good to me
> 
> Is this an implicit Reviewed-by? :)

Yes

Reviewed-by: Andrea Scian <andrea.scian@dave.eu>

(I have lot to learn about submitting kernel patches ;-) )

Unfortunately I cannot review it from a deep technical point of view
because some part of bitrot_check_worker() are not so clear to me (I'm
not an UBI expert and I need to dig a little bit further to have better
understanding), but the idea IMHO is right and it's worth having into UBI

Kind Regards,

Boris Brezillon April 12, 2015, 2:12 p.m. UTC | #7

Hi Richard,

Sorry for the late reply.

On Sun, 29 Mar 2015 14:13:17 +0200
Richard Weinberger <richard@nod.at> wrote:

> This patch implements bitrot checking for UBI.
> ubi_wl_trigger_bitrot_check() triggers a re-read of every
> PEB. If a bitflip is detected PEBs in use will get scrubbed
> and free ones erased.

As you'll see, I didn't have much to say about the 'UBI bitrot
detection' mechanism, so this review is a collection of
nitpicks :-).

> 
> Signed-off-by: Richard Weinberger <richard@nod.at>
> ---
>  drivers/mtd/ubi/build.c |  39 +++++++++++++
>  drivers/mtd/ubi/ubi.h   |   4 ++
>  drivers/mtd/ubi/wl.c    | 146 ++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 189 insertions(+)
> 
> diff --git a/drivers/mtd/ubi/build.c b/drivers/mtd/ubi/build.c
> index 9690cf9..f58330b 100644
> --- a/drivers/mtd/ubi/build.c
> +++ b/drivers/mtd/ubi/build.c
> @@ -118,6 +118,9 @@ static struct class_attribute ubi_version =
>  
>  static ssize_t dev_attribute_show(struct device *dev,
>  				  struct device_attribute *attr, char *buf);
> +static ssize_t trigger_bitrot_check(struct device *dev,
> +				    struct device_attribute *mattr,
> +				    const char *data, size_t count);
>  
>  /* UBI device attributes (correspond to files in '/<sysfs>/class/ubi/ubiX') */
>  static struct device_attribute dev_eraseblock_size =
> @@ -142,6 +145,8 @@ static struct device_attribute dev_bgt_enabled =
>  	__ATTR(bgt_enabled, S_IRUGO, dev_attribute_show, NULL);
>  static struct device_attribute dev_mtd_num =
>  	__ATTR(mtd_num, S_IRUGO, dev_attribute_show, NULL);
> +static struct device_attribute dev_trigger_bitrot_check =
> +	__ATTR(trigger_bitrot_check, S_IWUSR, NULL, trigger_bitrot_check);

How about making this attribute a RW one, so that users could check
if there's a bitrot check in progress.

>  
>  /**
>   * ubi_volume_notify - send a volume change notification.
> @@ -334,6 +339,36 @@ int ubi_major2num(int major)
>  	return ubi_num;
>  }
>  
> +/* "Store" method for file '/<sysfs>/class/ubi/ubiX/trigger_bitrot_check' */
> +static ssize_t trigger_bitrot_check(struct device *dev,
> +				    struct device_attribute *mattr,
> +				    const char *data, size_t count)
> +{
> +	struct ubi_device *ubi;
> +	int ret;
> +

Maybe that's on purpose, but you do not check the value passed in data
(in your documention you suggest to do an
echo 1 > /sys/class/ubi/ubiX/trigger_bitrot_check).

> +	ubi = container_of(dev, struct ubi_device, dev);
> +	ubi = ubi_get_device(ubi->ubi_num);
> +	if (!ubi) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
> +	if (atomic_inc_return(&ubi->bit_rot_work) != 1) {
> +		ret = -EBUSY;
> +		goto out_dec;
> +	}
> +
> +	ubi_wl_trigger_bitrot_check(ubi);
> +	ret = count;
> +
> +out_dec:
> +	atomic_dec(&ubi->bit_rot_work);
> +out:
> +	ubi_put_device(ubi);
> +	return ret;
> +}
> +
>  /* "Show" method for files in '/<sysfs>/class/ubi/ubiX/' */
>  static ssize_t dev_attribute_show(struct device *dev,
>  				  struct device_attribute *attr, char *buf)
> @@ -445,6 +480,9 @@ static int ubi_sysfs_init(struct ubi_device *ubi, int *ref)
>  	if (err)
>  		return err;
>  	err = device_create_file(&ubi->dev, &dev_mtd_num);
> +	if (err)
> +		return err;
> +	err = device_create_file(&ubi->dev, &dev_trigger_bitrot_check);
>  	return err;

You don't seem to control the return code, so, how about replacing
those 2 lines by:

	return device_create_file(&ubi->dev, &dev_trigger_bitrot_check);

>  }

[...]

> diff --git a/drivers/mtd/ubi/wl.c b/drivers/mtd/ubi/wl.c
> index 9b11db9..784bb52 100644
> --- a/drivers/mtd/ubi/wl.c
> +++ b/drivers/mtd/ubi/wl.c
> @@ -1447,6 +1447,150 @@ static void tree_destroy(struct ubi_device *ubi, struct rb_root *root)
>  }
>  
>  /**
> + * bitrot_check_worker - physical eraseblock bitrot check worker function.
> + * @ubi: UBI device description object
> + * @wl_wrk: the work object
> + * @shutdown: non-zero if the worker has to free memory and exit
> + *
> + * This function reads a physical eraseblock and schedules scrubbing if
> + * bit flips are detected.
> + */
> +static int bitrot_check_worker(struct ubi_device *ubi, struct ubi_work *wl_wrk,
> +			       int shutdown)
> +{
> +	struct ubi_wl_entry *e = wl_wrk->e;
> +	int err;
> +
> +	kfree(wl_wrk);
> +	if (shutdown) {
> +		dbg_wl("cancel bitrot check of PEB %d", e->pnum);
> +		wl_entry_destroy(ubi, e);
> +		return 0;
> +	}
> +
> +	mutex_lock(&ubi->buf_mutex);
> +	err = ubi_io_read(ubi, ubi->peb_buf, e->pnum, 0, ubi->peb_size);
> +	mutex_unlock(&ubi->buf_mutex);
> +	if (err == UBI_IO_BITFLIPS) {
> +		dbg_wl("found bitflips in PEB %d", e->pnum);
> +		spin_lock(&ubi->wl_lock);
> +
> +		if (in_pq(ubi, e)) {
> +			prot_queue_del(ubi, e->pnum);
> +			wl_tree_add(e, &ubi->scrub);
> +			spin_unlock(&ubi->wl_lock);
> +
> +			err = ensure_wear_leveling(ubi, 1);
> +		}
> +		else if (in_wl_tree(e, &ubi->used)) {
> +			rb_erase(&e->u.rb, &ubi->used);
> +			wl_tree_add(e, &ubi->scrub);
> +			spin_unlock(&ubi->wl_lock);
> +
> +			err = ensure_wear_leveling(ubi, 1);
> +		}
> +		else if (in_wl_tree(e, &ubi->free)) {
> +			rb_erase(&e->u.rb, &ubi->free);
> +			spin_unlock(&ubi->wl_lock);
> +
> +			wl_wrk = prepare_erase_work(e, -1, -1, 1);
> +			if (IS_ERR(wl_wrk)) {
> +				err = PTR_ERR(wl_wrk);
> +				goto out;
> +			}
> +
> +			__schedule_ubi_work(ubi, wl_wrk);
> +			err = 0;
> +		}
> +		/*
> +		 * e is target of a move operation, all we can do is kicking
> +		 * wear leveling such that we can catch it later or wear
> +		 * leveling itself scrubbs the PEB.
> +		 */
> +		else if (ubi->move_to == e || ubi->move_from == e) {
> +			spin_unlock(&ubi->wl_lock);
> +
> +			err = ensure_wear_leveling(ubi, 1);
> +		}
> +		/*
> +		 * e is member of a fastmap pool. We are not allowed to
> +		 * remove it from that pool as the on-flash fastmap data
> +		 * structure refers to it. Let's schedule a new fastmap write
> +		 * such that the said PEB can get released.
> +		 */
> +		else {
> +			ubi_schedule_fm_work(ubi);
> +			spin_unlock(&ubi->wl_lock);
> +
> +			err = 0;
> +		}

Nitpick, but checkpatch complains about 'else' or 'else if' statements
that are not on the '}' line.

> +	}
> +	else {
> +		/*
> +		 * Ignore read errors as we return only work related errors.
> +		 * Read errors will be logged by ubi_io_read().
> +		 */
> +		err = 0;
> +	}

Nitpicking again, but you can avoid another level of indentation by
doing the following:

	if (err != UBI_IO_BITFLIPS) {
		err = 0;
		goto out;
	}

	dbg_wl("found bitflips in PEB %d", e->pnum);
	spin_lock(&ubi->wl_lock);
	/* ... */

> +
> +out:
> +	atomic_dec(&ubi->bit_rot_work);
> +	ubi_assert(atomic_read(&ubi->bit_rot_work) >= 0);

How about replacing those two lines by:

	ubi_assert(atomic_dec_return(&ubi->bit_rot_work) >= 0);

> +	return err;
> +}

Best Regards,

Boris

Boris Brezillon April 12, 2015, 3:14 p.m. UTC | #8

Second pass on this patch :-).

On Sun, 29 Mar 2015 14:13:17 +0200
Richard Weinberger <richard@nod.at> wrote:

>  /**
> + * bitrot_check_worker - physical eraseblock bitrot check worker function.
> + * @ubi: UBI device description object
> + * @wl_wrk: the work object
> + * @shutdown: non-zero if the worker has to free memory and exit
> + *
> + * This function reads a physical eraseblock and schedules scrubbing if
> + * bit flips are detected.
> + */
> +static int bitrot_check_worker(struct ubi_device *ubi, struct ubi_work *wl_wrk,
> +			       int shutdown)
> +{
> +	struct ubi_wl_entry *e = wl_wrk->e;
> +	int err;
> +
> +	kfree(wl_wrk);
> +	if (shutdown) {
> +		dbg_wl("cancel bitrot check of PEB %d", e->pnum);
> +		wl_entry_destroy(ubi, e);
> +		return 0;
> +	}
> +
> +	mutex_lock(&ubi->buf_mutex);
> +	err = ubi_io_read(ubi, ubi->peb_buf, e->pnum, 0, ubi->peb_size);
> +	mutex_unlock(&ubi->buf_mutex);
> +	if (err == UBI_IO_BITFLIPS) {
> +		dbg_wl("found bitflips in PEB %d", e->pnum);
> +		spin_lock(&ubi->wl_lock);
> +
> +		if (in_pq(ubi, e)) {
> +			prot_queue_del(ubi, e->pnum);
> +			wl_tree_add(e, &ubi->scrub);
> +			spin_unlock(&ubi->wl_lock);
> +
> +			err = ensure_wear_leveling(ubi, 1);
> +		}
> +		else if (in_wl_tree(e, &ubi->used)) {
> +			rb_erase(&e->u.rb, &ubi->used);
> +			wl_tree_add(e, &ubi->scrub);
> +			spin_unlock(&ubi->wl_lock);
> +
> +			err = ensure_wear_leveling(ubi, 1);
> +		}
> +		else if (in_wl_tree(e, &ubi->free)) {
> +			rb_erase(&e->u.rb, &ubi->free);
> +			spin_unlock(&ubi->wl_lock);
> +

IMHO the following code chunk, starting here:

> +			wl_wrk = prepare_erase_work(e, -1, -1, 1);
> +			if (IS_ERR(wl_wrk)) {
> +				err = PTR_ERR(wl_wrk);
> +				goto out;
> +			}
> +
> +			__schedule_ubi_work(ubi, wl_wrk);

and ending here ^, could be placed in an helper function
(re_erase_peb ?)

> +			err = 0;
> +		}
> +		/*
> +		 * e is target of a move operation, all we can do is kicking
> +		 * wear leveling such that we can catch it later or wear
> +		 * leveling itself scrubbs the PEB.
> +		 */
> +		else if (ubi->move_to == e || ubi->move_from == e) {
> +			spin_unlock(&ubi->wl_lock);
> +
> +			err = ensure_wear_leveling(ubi, 1);
> +		}
> +		/*
> +		 * e is member of a fastmap pool. We are not allowed to
> +		 * remove it from that pool as the on-flash fastmap data
> +		 * structure refers to it. Let's schedule a new fastmap write
> +		 * such that the said PEB can get released.
> +		 */
> +		else {
> +			ubi_schedule_fm_work(ubi);
> +			spin_unlock(&ubi->wl_lock);
> +
> +			err = 0;
> +		}

I'm nitpicking again, but I like to have a single place where spinlocks
are locked and unlocked, so here is a rework suggestion for the code
inside the 'if (err == UBI_IO_BITFLIPS)' statement:

		bool ensure_wl = false, scrub = false, re_erase = false;

		spin_lock(&ubi->wl_lock);

		if (in_pq(ubi, e)) {
			prot_queue_del(ubi, e->pnum);
			scrub = true;
		} else if (in_wl_tree(e, &ubi->used)) {
			rb_erase(&e->u.rb, &ubi->used);
			scrub = true;
		} else if (in_wl_tree(e, &ubi->free)) {
			rb_erase(&e->u.rb, &ubi->free);
			re_erase = true;
		} else if (ubi->move_to == e || ubi->move_from == e) {
			ensure_wl = true;
		} else {
			ubi_schedule_fm_work(ubi);
		}

		if (scrub)
			wl_tree_add(e, &ubi->scrub);

		spin_unlock(&ubi->wl_lock);

		if (scrub || ensure_wl)
			err = ensure_wear_leveling(ubi, 1);
		else if (re_erase)
			err = re_erase_peb(ubi, e);
		else
			err = 0;

Best Regards,

Boris

Richard Weinberger April 12, 2015, 4:09 p.m. UTC | #9

Am 12.04.2015 um 16:12 schrieb Boris Brezillon:
> Hi Richard,
> 
> Sorry for the late reply.
> 
> On Sun, 29 Mar 2015 14:13:17 +0200
> Richard Weinberger <richard@nod.at> wrote:
> 
>> This patch implements bitrot checking for UBI.
>> ubi_wl_trigger_bitrot_check() triggers a re-read of every
>> PEB. If a bitflip is detected PEBs in use will get scrubbed
>> and free ones erased.
> 
> As you'll see, I didn't have much to say about the 'UBI bitrot
> detection' mechanism, so this review is a collection of
> nitpicks :-).
> 
>>
>> Signed-off-by: Richard Weinberger <richard@nod.at>
>> ---
>>  drivers/mtd/ubi/build.c |  39 +++++++++++++
>>  drivers/mtd/ubi/ubi.h   |   4 ++
>>  drivers/mtd/ubi/wl.c    | 146 ++++++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 189 insertions(+)
>>
>> diff --git a/drivers/mtd/ubi/build.c b/drivers/mtd/ubi/build.c
>> index 9690cf9..f58330b 100644
>> --- a/drivers/mtd/ubi/build.c
>> +++ b/drivers/mtd/ubi/build.c
>> @@ -118,6 +118,9 @@ static struct class_attribute ubi_version =
>>  
>>  static ssize_t dev_attribute_show(struct device *dev,
>>  				  struct device_attribute *attr, char *buf);
>> +static ssize_t trigger_bitrot_check(struct device *dev,
>> +				    struct device_attribute *mattr,
>> +				    const char *data, size_t count);
>>  
>>  /* UBI device attributes (correspond to files in '/<sysfs>/class/ubi/ubiX') */
>>  static struct device_attribute dev_eraseblock_size =
>> @@ -142,6 +145,8 @@ static struct device_attribute dev_bgt_enabled =
>>  	__ATTR(bgt_enabled, S_IRUGO, dev_attribute_show, NULL);
>>  static struct device_attribute dev_mtd_num =
>>  	__ATTR(mtd_num, S_IRUGO, dev_attribute_show, NULL);
>> +static struct device_attribute dev_trigger_bitrot_check =
>> +	__ATTR(trigger_bitrot_check, S_IWUSR, NULL, trigger_bitrot_check);
> 
> How about making this attribute a RW one, so that users could check
> if there's a bitrot check in progress.

As the check will be initiated only by userspace and writing to the trigger
while a check is running will return anyway a EBUSY I don't really see
a point why userspace would check for it.

>>  
>>  /**
>>   * ubi_volume_notify - send a volume change notification.
>> @@ -334,6 +339,36 @@ int ubi_major2num(int major)
>>  	return ubi_num;
>>  }
>>  
>> +/* "Store" method for file '/<sysfs>/class/ubi/ubiX/trigger_bitrot_check' */
>> +static ssize_t trigger_bitrot_check(struct device *dev,
>> +				    struct device_attribute *mattr,
>> +				    const char *data, size_t count)
>> +{
>> +	struct ubi_device *ubi;
>> +	int ret;
>> +
> 
> Maybe that's on purpose, but you do not check the value passed in data
> (in your documention you suggest to do an
> echo 1 > /sys/class/ubi/ubiX/trigger_bitrot_check).

Yeah, the example using "1", but why should I limit it to it?
The idea was that any write will trigger a check.

>> +	ubi = container_of(dev, struct ubi_device, dev);
>> +	ubi = ubi_get_device(ubi->ubi_num);
>> +	if (!ubi) {
>> +		ret = -ENODEV;
>> +		goto out;
>> +	}
>> +
>> +	if (atomic_inc_return(&ubi->bit_rot_work) != 1) {
>> +		ret = -EBUSY;
>> +		goto out_dec;
>> +	}
>> +
>> +	ubi_wl_trigger_bitrot_check(ubi);
>> +	ret = count;
>> +
>> +out_dec:
>> +	atomic_dec(&ubi->bit_rot_work);
>> +out:
>> +	ubi_put_device(ubi);
>> +	return ret;
>> +}
>> +
>>  /* "Show" method for files in '/<sysfs>/class/ubi/ubiX/' */
>>  static ssize_t dev_attribute_show(struct device *dev,
>>  				  struct device_attribute *attr, char *buf)
>> @@ -445,6 +480,9 @@ static int ubi_sysfs_init(struct ubi_device *ubi, int *ref)
>>  	if (err)
>>  		return err;
>>  	err = device_create_file(&ubi->dev, &dev_mtd_num);
>> +	if (err)
>> +		return err;
>> +	err = device_create_file(&ubi->dev, &dev_trigger_bitrot_check);
>>  	return err;
> 
> You don't seem to control the return code, so, how about replacing
> those 2 lines by:
> 
> 	return device_create_file(&ubi->dev, &dev_trigger_bitrot_check);

I did it exactly like the existing code does.
But I can replace the lines...

>>  }
> 
> [...]
> 
>> diff --git a/drivers/mtd/ubi/wl.c b/drivers/mtd/ubi/wl.c
>> index 9b11db9..784bb52 100644
>> --- a/drivers/mtd/ubi/wl.c
>> +++ b/drivers/mtd/ubi/wl.c
>> @@ -1447,6 +1447,150 @@ static void tree_destroy(struct ubi_device *ubi, struct rb_root *root)
>>  }
>>  
>>  /**
>> + * bitrot_check_worker - physical eraseblock bitrot check worker function.
>> + * @ubi: UBI device description object
>> + * @wl_wrk: the work object
>> + * @shutdown: non-zero if the worker has to free memory and exit
>> + *
>> + * This function reads a physical eraseblock and schedules scrubbing if
>> + * bit flips are detected.
>> + */
>> +static int bitrot_check_worker(struct ubi_device *ubi, struct ubi_work *wl_wrk,
>> +			       int shutdown)
>> +{
>> +	struct ubi_wl_entry *e = wl_wrk->e;
>> +	int err;
>> +
>> +	kfree(wl_wrk);
>> +	if (shutdown) {
>> +		dbg_wl("cancel bitrot check of PEB %d", e->pnum);
>> +		wl_entry_destroy(ubi, e);
>> +		return 0;
>> +	}
>> +
>> +	mutex_lock(&ubi->buf_mutex);
>> +	err = ubi_io_read(ubi, ubi->peb_buf, e->pnum, 0, ubi->peb_size);
>> +	mutex_unlock(&ubi->buf_mutex);
>> +	if (err == UBI_IO_BITFLIPS) {
>> +		dbg_wl("found bitflips in PEB %d", e->pnum);
>> +		spin_lock(&ubi->wl_lock);
>> +
>> +		if (in_pq(ubi, e)) {
>> +			prot_queue_del(ubi, e->pnum);
>> +			wl_tree_add(e, &ubi->scrub);
>> +			spin_unlock(&ubi->wl_lock);
>> +
>> +			err = ensure_wear_leveling(ubi, 1);
>> +		}
>> +		else if (in_wl_tree(e, &ubi->used)) {
>> +			rb_erase(&e->u.rb, &ubi->used);
>> +			wl_tree_add(e, &ubi->scrub);
>> +			spin_unlock(&ubi->wl_lock);
>> +
>> +			err = ensure_wear_leveling(ubi, 1);
>> +		}
>> +		else if (in_wl_tree(e, &ubi->free)) {
>> +			rb_erase(&e->u.rb, &ubi->free);
>> +			spin_unlock(&ubi->wl_lock);
>> +
>> +			wl_wrk = prepare_erase_work(e, -1, -1, 1);
>> +			if (IS_ERR(wl_wrk)) {
>> +				err = PTR_ERR(wl_wrk);
>> +				goto out;
>> +			}
>> +
>> +			__schedule_ubi_work(ubi, wl_wrk);
>> +			err = 0;
>> +		}
>> +		/*
>> +		 * e is target of a move operation, all we can do is kicking
>> +		 * wear leveling such that we can catch it later or wear
>> +		 * leveling itself scrubbs the PEB.
>> +		 */
>> +		else if (ubi->move_to == e || ubi->move_from == e) {
>> +			spin_unlock(&ubi->wl_lock);
>> +
>> +			err = ensure_wear_leveling(ubi, 1);
>> +		}
>> +		/*
>> +		 * e is member of a fastmap pool. We are not allowed to
>> +		 * remove it from that pool as the on-flash fastmap data
>> +		 * structure refers to it. Let's schedule a new fastmap write
>> +		 * such that the said PEB can get released.
>> +		 */
>> +		else {
>> +			ubi_schedule_fm_work(ubi);
>> +			spin_unlock(&ubi->wl_lock);
>> +
>> +			err = 0;
>> +		}
> 
> Nitpick, but checkpatch complains about 'else' or 'else if' statements
> that are not on the '}' line.

I like it as is because I can nicely place the comment above the else {.
And checkpatch is not our lawmaker.

>> +	}
>> +	else {
>> +		/*
>> +		 * Ignore read errors as we return only work related errors.
>> +		 * Read errors will be logged by ubi_io_read().
>> +		 */
>> +		err = 0;
>> +	}
> 
> Nitpicking again, but you can avoid another level of indentation by
> doing the following:
> 
> 	if (err != UBI_IO_BITFLIPS) {
> 		err = 0;
> 		goto out;
> 	}
> 
> 	dbg_wl("found bitflips in PEB %d", e->pnum);
> 	spin_lock(&ubi->wl_lock);
> 	/* ... */
> 
>> +
>> +out:
>> +	atomic_dec(&ubi->bit_rot_work);
>> +	ubi_assert(atomic_read(&ubi->bit_rot_work) >= 0);
> 
> How about replacing those two lines by:
> 
> 	ubi_assert(atomic_dec_return(&ubi->bit_rot_work) >= 0);

True, I always forget how many nice atomic_* helper we have. :)

Thanks,
//richard

Richard Weinberger April 12, 2015, 4:14 p.m. UTC | #10

Am 12.04.2015 um 17:14 schrieb Boris Brezillon:
> Second pass on this patch :-).
> 
> On Sun, 29 Mar 2015 14:13:17 +0200
> Richard Weinberger <richard@nod.at> wrote:
> 
>>  /**
>> + * bitrot_check_worker - physical eraseblock bitrot check worker function.
>> + * @ubi: UBI device description object
>> + * @wl_wrk: the work object
>> + * @shutdown: non-zero if the worker has to free memory and exit
>> + *
>> + * This function reads a physical eraseblock and schedules scrubbing if
>> + * bit flips are detected.
>> + */
>> +static int bitrot_check_worker(struct ubi_device *ubi, struct ubi_work *wl_wrk,
>> +			       int shutdown)
>> +{
>> +	struct ubi_wl_entry *e = wl_wrk->e;
>> +	int err;
>> +
>> +	kfree(wl_wrk);
>> +	if (shutdown) {
>> +		dbg_wl("cancel bitrot check of PEB %d", e->pnum);
>> +		wl_entry_destroy(ubi, e);
>> +		return 0;
>> +	}
>> +
>> +	mutex_lock(&ubi->buf_mutex);
>> +	err = ubi_io_read(ubi, ubi->peb_buf, e->pnum, 0, ubi->peb_size);
>> +	mutex_unlock(&ubi->buf_mutex);
>> +	if (err == UBI_IO_BITFLIPS) {
>> +		dbg_wl("found bitflips in PEB %d", e->pnum);
>> +		spin_lock(&ubi->wl_lock);
>> +
>> +		if (in_pq(ubi, e)) {
>> +			prot_queue_del(ubi, e->pnum);
>> +			wl_tree_add(e, &ubi->scrub);
>> +			spin_unlock(&ubi->wl_lock);
>> +
>> +			err = ensure_wear_leveling(ubi, 1);
>> +		}
>> +		else if (in_wl_tree(e, &ubi->used)) {
>> +			rb_erase(&e->u.rb, &ubi->used);
>> +			wl_tree_add(e, &ubi->scrub);
>> +			spin_unlock(&ubi->wl_lock);
>> +
>> +			err = ensure_wear_leveling(ubi, 1);
>> +		}
>> +		else if (in_wl_tree(e, &ubi->free)) {
>> +			rb_erase(&e->u.rb, &ubi->free);
>> +			spin_unlock(&ubi->wl_lock);
>> +
> 
> IMHO the following code chunk, starting here:
> 
>> +			wl_wrk = prepare_erase_work(e, -1, -1, 1);
>> +			if (IS_ERR(wl_wrk)) {
>> +				err = PTR_ERR(wl_wrk);
>> +				goto out;
>> +			}
>> +
>> +			__schedule_ubi_work(ubi, wl_wrk);
> 
> and ending here ^, could be placed in an helper function
> (re_erase_peb ?)

As long we have only one user of that pattern I'd keep it as is.
We have in UBI already a gazillion helper functions.

>> +			err = 0;
>> +		}
>> +		/*
>> +		 * e is target of a move operation, all we can do is kicking
>> +		 * wear leveling such that we can catch it later or wear
>> +		 * leveling itself scrubbs the PEB.
>> +		 */
>> +		else if (ubi->move_to == e || ubi->move_from == e) {
>> +			spin_unlock(&ubi->wl_lock);
>> +
>> +			err = ensure_wear_leveling(ubi, 1);
>> +		}
>> +		/*
>> +		 * e is member of a fastmap pool. We are not allowed to
>> +		 * remove it from that pool as the on-flash fastmap data
>> +		 * structure refers to it. Let's schedule a new fastmap write
>> +		 * such that the said PEB can get released.
>> +		 */
>> +		else {
>> +			ubi_schedule_fm_work(ubi);
>> +			spin_unlock(&ubi->wl_lock);
>> +
>> +			err = 0;
>> +		}
> 
> I'm nitpicking again, but I like to have a single place where spinlocks
> are locked and unlocked, so here is a rework suggestion for the code
> inside the 'if (err == UBI_IO_BITFLIPS)' statement:

A single lock/unlock place is nice but in this case the whole logic fits
into a single page on screen. "do_this" and "do_that" variables don't make
the code more readable IMHO.
But as with all nitpicks it is a matter of taste and we could waste multiple
days on such things.

Thanks,
//richard

Boris Brezillon April 12, 2015, 4:31 p.m. UTC | #11

On Sun, 12 Apr 2015 18:14:40 +0200
Richard Weinberger <richard@nod.at> wrote:

> > IMHO the following code chunk, starting here:
> > 
> >> +			wl_wrk = prepare_erase_work(e, -1, -1, 1);
> >> +			if (IS_ERR(wl_wrk)) {
> >> +				err = PTR_ERR(wl_wrk);
> >> +				goto out;
> >> +			}
> >> +
> >> +			__schedule_ubi_work(ubi, wl_wrk);
> > 
> > and ending here ^, could be placed in an helper function
> > (re_erase_peb ?)
> 
> As long we have only one user of that pattern I'd keep it as is.
> We have in UBI already a gazillion helper functions.

Okay, then maybe you should comment what you're doing here: erase an
already erased PEB where bitflips have occured.

> 
> >> +			err = 0;
> >> +		}
> >> +		/*
> >> +		 * e is target of a move operation, all we can do is kicking
> >> +		 * wear leveling such that we can catch it later or wear
> >> +		 * leveling itself scrubbs the PEB.
> >> +		 */
> >> +		else if (ubi->move_to == e || ubi->move_from == e) {
> >> +			spin_unlock(&ubi->wl_lock);
> >> +
> >> +			err = ensure_wear_leveling(ubi, 1);
> >> +		}
> >> +		/*
> >> +		 * e is member of a fastmap pool. We are not allowed to
> >> +		 * remove it from that pool as the on-flash fastmap data
> >> +		 * structure refers to it. Let's schedule a new fastmap write
> >> +		 * such that the said PEB can get released.
> >> +		 */
> >> +		else {
> >> +			ubi_schedule_fm_work(ubi);
> >> +			spin_unlock(&ubi->wl_lock);
> >> +
> >> +			err = 0;
> >> +		}
> > 
> > I'm nitpicking again, but I like to have a single place where spinlocks
> > are locked and unlocked, so here is a rework suggestion for the code
> > inside the 'if (err == UBI_IO_BITFLIPS)' statement:
> 
> A single lock/unlock place is nice but in this case the whole logic fits
> into a single page on screen. "do_this" and "do_that" variables don't make
> the code more readable IMHO.
> But as with all nitpicks it is a matter of taste and we could waste multiple
> days on such things.

Isn't that the whole point of code reviews :-P ?

Richard Weinberger April 12, 2015, 4:32 p.m. UTC | #12

Am 12.04.2015 um 18:31 schrieb Boris Brezillon:
> On Sun, 12 Apr 2015 18:14:40 +0200
> Richard Weinberger <richard@nod.at> wrote:
> 
>>> IMHO the following code chunk, starting here:
>>>
>>>> +			wl_wrk = prepare_erase_work(e, -1, -1, 1);
>>>> +			if (IS_ERR(wl_wrk)) {
>>>> +				err = PTR_ERR(wl_wrk);
>>>> +				goto out;
>>>> +			}
>>>> +
>>>> +			__schedule_ubi_work(ubi, wl_wrk);
>>>
>>> and ending here ^, could be placed in an helper function
>>> (re_erase_peb ?)
>>
>> As long we have only one user of that pattern I'd keep it as is.
>> We have in UBI already a gazillion helper functions.
> 
> Okay, then maybe you should comment what you're doing here: erase an
> already erased PEB where bitflips have occured.

Makes sense!

>>
>>>> +			err = 0;
>>>> +		}
>>>> +		/*
>>>> +		 * e is target of a move operation, all we can do is kicking
>>>> +		 * wear leveling such that we can catch it later or wear
>>>> +		 * leveling itself scrubbs the PEB.
>>>> +		 */
>>>> +		else if (ubi->move_to == e || ubi->move_from == e) {
>>>> +			spin_unlock(&ubi->wl_lock);
>>>> +
>>>> +			err = ensure_wear_leveling(ubi, 1);
>>>> +		}
>>>> +		/*
>>>> +		 * e is member of a fastmap pool. We are not allowed to
>>>> +		 * remove it from that pool as the on-flash fastmap data
>>>> +		 * structure refers to it. Let's schedule a new fastmap write
>>>> +		 * such that the said PEB can get released.
>>>> +		 */
>>>> +		else {
>>>> +			ubi_schedule_fm_work(ubi);
>>>> +			spin_unlock(&ubi->wl_lock);
>>>> +
>>>> +			err = 0;
>>>> +		}
>>>
>>> I'm nitpicking again, but I like to have a single place where spinlocks
>>> are locked and unlocked, so here is a rework suggestion for the code
>>> inside the 'if (err == UBI_IO_BITFLIPS)' statement:
>>
>> A single lock/unlock place is nice but in this case the whole logic fits
>> into a single page on screen. "do_this" and "do_that" variables don't make
>> the code more readable IMHO.
>> But as with all nitpicks it is a matter of taste and we could waste multiple
>> days on such things.
> 
> Isn't that the whole point of code reviews :-P ?

;-)

Thanks,
//richard

Boris Brezillon April 12, 2015, 4:43 p.m. UTC | #13

On Sun, 12 Apr 2015 18:09:23 +0200
Richard Weinberger <richard@nod.at> wrote:

> Am 12.04.2015 um 16:12 schrieb Boris Brezillon:
> > Hi Richard,
> > 
> > Sorry for the late reply.
> > 
> > On Sun, 29 Mar 2015 14:13:17 +0200
> > Richard Weinberger <richard@nod.at> wrote:
> > 
> >> This patch implements bitrot checking for UBI.
> >> ubi_wl_trigger_bitrot_check() triggers a re-read of every
> >> PEB. If a bitflip is detected PEBs in use will get scrubbed
> >> and free ones erased.
> > 
> > As you'll see, I didn't have much to say about the 'UBI bitrot
> > detection' mechanism, so this review is a collection of
> > nitpicks :-).
> > 
> >>
> >> Signed-off-by: Richard Weinberger <richard@nod.at>
> >> ---
> >>  drivers/mtd/ubi/build.c |  39 +++++++++++++
> >>  drivers/mtd/ubi/ubi.h   |   4 ++
> >>  drivers/mtd/ubi/wl.c    | 146 ++++++++++++++++++++++++++++++++++++++++++++++++
> >>  3 files changed, 189 insertions(+)
> >>
> >> diff --git a/drivers/mtd/ubi/build.c b/drivers/mtd/ubi/build.c
> >> index 9690cf9..f58330b 100644
> >> --- a/drivers/mtd/ubi/build.c
> >> +++ b/drivers/mtd/ubi/build.c
> >> @@ -118,6 +118,9 @@ static struct class_attribute ubi_version =
> >>  
> >>  static ssize_t dev_attribute_show(struct device *dev,
> >>  				  struct device_attribute *attr, char *buf);
> >> +static ssize_t trigger_bitrot_check(struct device *dev,
> >> +				    struct device_attribute *mattr,
> >> +				    const char *data, size_t count);
> >>  
> >>  /* UBI device attributes (correspond to files in '/<sysfs>/class/ubi/ubiX') */
> >>  static struct device_attribute dev_eraseblock_size =
> >> @@ -142,6 +145,8 @@ static struct device_attribute dev_bgt_enabled =
> >>  	__ATTR(bgt_enabled, S_IRUGO, dev_attribute_show, NULL);
> >>  static struct device_attribute dev_mtd_num =
> >>  	__ATTR(mtd_num, S_IRUGO, dev_attribute_show, NULL);
> >> +static struct device_attribute dev_trigger_bitrot_check =
> >> +	__ATTR(trigger_bitrot_check, S_IWUSR, NULL, trigger_bitrot_check);
> > 
> > How about making this attribute a RW one, so that users could check
> > if there's a bitrot check in progress.
> 
> As the check will be initiated only by userspace and writing to the trigger
> while a check is running will return anyway a EBUSY I don't really see
> a point why userspace would check for it.

Sometime you just want to know whether something is running or not (in
this case the bitrot check) without risking to trigger a new action...

> 
> >>  
> >>  /**
> >>   * ubi_volume_notify - send a volume change notification.
> >> @@ -334,6 +339,36 @@ int ubi_major2num(int major)
> >>  	return ubi_num;
> >>  }
> >>  
> >> +/* "Store" method for file '/<sysfs>/class/ubi/ubiX/trigger_bitrot_check' */
> >> +static ssize_t trigger_bitrot_check(struct device *dev,
> >> +				    struct device_attribute *mattr,
> >> +				    const char *data, size_t count)
> >> +{
> >> +	struct ubi_device *ubi;
> >> +	int ret;
> >> +
> > 
> > Maybe that's on purpose, but you do not check the value passed in data
> > (in your documention you suggest to do an
> > echo 1 > /sys/class/ubi/ubiX/trigger_bitrot_check).
> 
> Yeah, the example using "1", but why should I limit it to it?
> The idea was that any write will trigger a check.

Okay.


[...]

> >> +		/*
> >> +		 * e is member of a fastmap pool. We are not allowed to
> >> +		 * remove it from that pool as the on-flash fastmap data
> >> +		 * structure refers to it. Let's schedule a new fastmap write
> >> +		 * such that the said PEB can get released.
> >> +		 */
> >> +		else {
> >> +			ubi_schedule_fm_work(ubi);
> >> +			spin_unlock(&ubi->wl_lock);
> >> +
> >> +			err = 0;
> >> +		}
> > 
> > Nitpick, but checkpatch complains about 'else' or 'else if' statements
> > that are not on the '}' line.
> 
> I like it as is because I can nicely place the comment above the else {.
> And checkpatch is not our lawmaker.

You could put your comment after the braces.
Anyway, you might dislike the coding style imposed by kernel
developers/maintainers, but this is what keeps the kernel code
consistent across the different subsystems.
I agree that some checks done by checkpatch can be a bit restrictive in
some cases (like the 80 characters limit), but I really think the
braces and else[ if] placements should be enforced.
This being said, this is your call to make, so I won't complain about
it anymore ;-).

> 
> >> +	}
> >> +	else {
> >> +		/*
> >> +		 * Ignore read errors as we return only work related errors.
> >> +		 * Read errors will be logged by ubi_io_read().
> >> +		 */
> >> +		err = 0;
> >> +	}
> > 
> > Nitpicking again, but you can avoid another level of indentation by
> > doing the following:
> > 
> > 	if (err != UBI_IO_BITFLIPS) {
> > 		err = 0;
> > 		goto out;
> > 	}
> > 
> > 	dbg_wl("found bitflips in PEB %d", e->pnum);
> > 	spin_lock(&ubi->wl_lock);
> > 	/* ... */

You didn't answer to that one.

Richard Weinberger April 12, 2015, 4:55 p.m. UTC | #14

Am 12.04.2015 um 18:43 schrieb Boris Brezillon:
> On Sun, 12 Apr 2015 18:09:23 +0200
> Richard Weinberger <richard@nod.at> wrote:
> 
>> Am 12.04.2015 um 16:12 schrieb Boris Brezillon:
>>> Hi Richard,
>>>
>>> Sorry for the late reply.
>>>
>>> On Sun, 29 Mar 2015 14:13:17 +0200
>>> Richard Weinberger <richard@nod.at> wrote:
>>>
>>>> This patch implements bitrot checking for UBI.
>>>> ubi_wl_trigger_bitrot_check() triggers a re-read of every
>>>> PEB. If a bitflip is detected PEBs in use will get scrubbed
>>>> and free ones erased.
>>>
>>> As you'll see, I didn't have much to say about the 'UBI bitrot
>>> detection' mechanism, so this review is a collection of
>>> nitpicks :-).
>>>
>>>>
>>>> Signed-off-by: Richard Weinberger <richard@nod.at>
>>>> ---
>>>>  drivers/mtd/ubi/build.c |  39 +++++++++++++
>>>>  drivers/mtd/ubi/ubi.h   |   4 ++
>>>>  drivers/mtd/ubi/wl.c    | 146 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  3 files changed, 189 insertions(+)
>>>>
>>>> diff --git a/drivers/mtd/ubi/build.c b/drivers/mtd/ubi/build.c
>>>> index 9690cf9..f58330b 100644
>>>> --- a/drivers/mtd/ubi/build.c
>>>> +++ b/drivers/mtd/ubi/build.c
>>>> @@ -118,6 +118,9 @@ static struct class_attribute ubi_version =
>>>>  
>>>>  static ssize_t dev_attribute_show(struct device *dev,
>>>>  				  struct device_attribute *attr, char *buf);
>>>> +static ssize_t trigger_bitrot_check(struct device *dev,
>>>> +				    struct device_attribute *mattr,
>>>> +				    const char *data, size_t count);
>>>>  
>>>>  /* UBI device attributes (correspond to files in '/<sysfs>/class/ubi/ubiX') */
>>>>  static struct device_attribute dev_eraseblock_size =
>>>> @@ -142,6 +145,8 @@ static struct device_attribute dev_bgt_enabled =
>>>>  	__ATTR(bgt_enabled, S_IRUGO, dev_attribute_show, NULL);
>>>>  static struct device_attribute dev_mtd_num =
>>>>  	__ATTR(mtd_num, S_IRUGO, dev_attribute_show, NULL);
>>>> +static struct device_attribute dev_trigger_bitrot_check =
>>>> +	__ATTR(trigger_bitrot_check, S_IWUSR, NULL, trigger_bitrot_check);
>>>
>>> How about making this attribute a RW one, so that users could check
>>> if there's a bitrot check in progress.
>>
>> As the check will be initiated only by userspace and writing to the trigger
>> while a check is running will return anyway a EBUSY I don't really see
>> a point why userspace would check for it.
> 
> Sometime you just want to know whether something is running or not (in
> this case the bitrot check) without risking to trigger a new action...

Why would they care?
But I can add this feature, no problem.

>>
>>>>  
>>>>  /**
>>>>   * ubi_volume_notify - send a volume change notification.
>>>> @@ -334,6 +339,36 @@ int ubi_major2num(int major)
>>>>  	return ubi_num;
>>>>  }
>>>>  
>>>> +/* "Store" method for file '/<sysfs>/class/ubi/ubiX/trigger_bitrot_check' */
>>>> +static ssize_t trigger_bitrot_check(struct device *dev,
>>>> +				    struct device_attribute *mattr,
>>>> +				    const char *data, size_t count)
>>>> +{
>>>> +	struct ubi_device *ubi;
>>>> +	int ret;
>>>> +
>>>
>>> Maybe that's on purpose, but you do not check the value passed in data
>>> (in your documention you suggest to do an
>>> echo 1 > /sys/class/ubi/ubiX/trigger_bitrot_check).
>>
>> Yeah, the example using "1", but why should I limit it to it?
>> The idea was that any write will trigger a check.
> 
> Okay.
> 
> 
> [...]
> 
>>>> +		/*
>>>> +		 * e is member of a fastmap pool. We are not allowed to
>>>> +		 * remove it from that pool as the on-flash fastmap data
>>>> +		 * structure refers to it. Let's schedule a new fastmap write
>>>> +		 * such that the said PEB can get released.
>>>> +		 */
>>>> +		else {
>>>> +			ubi_schedule_fm_work(ubi);
>>>> +			spin_unlock(&ubi->wl_lock);
>>>> +
>>>> +			err = 0;
>>>> +		}
>>>
>>> Nitpick, but checkpatch complains about 'else' or 'else if' statements
>>> that are not on the '}' line.
>>
>> I like it as is because I can nicely place the comment above the else {.
>> And checkpatch is not our lawmaker.
> 
> You could put your comment after the braces.
> Anyway, you might dislike the coding style imposed by kernel
> developers/maintainers, but this is what keeps the kernel code
> consistent across the different subsystems.
> I agree that some checks done by checkpatch can be a bit restrictive in
> some cases (like the 80 characters limit), but I really think the
> braces and else[ if] placements should be enforced.
> This being said, this is your call to make, so I won't complain about
> it anymore ;-).

It is corner case which is not handled by the kernel coding style IMHO.
The sad thing is that checkpatch is not developed by kernel developers.

>>
>>>> +	}
>>>> +	else {
>>>> +		/*
>>>> +		 * Ignore read errors as we return only work related errors.
>>>> +		 * Read errors will be logged by ubi_io_read().
>>>> +		 */
>>>> +		err = 0;
>>>> +	}
>>>
>>> Nitpicking again, but you can avoid another level of indentation by
>>> doing the following:
>>>
>>> 	if (err != UBI_IO_BITFLIPS) {
>>> 		err = 0;
>>> 		goto out;
>>> 	}
>>>
>>> 	dbg_wl("found bitflips in PEB %d", e->pnum);
>>> 	spin_lock(&ubi->wl_lock);
>>> 	/* ... */
> 
> You didn't answer to that one.

Whoops.
Yeah, that makes sense!

Thanks,
//richard

Boris Brezillon April 12, 2015, 5:01 p.m. UTC | #15

Hi Richard,

After the 'coding style related'/'useless' comments, now comes a real
question related to the approach you've taken :-).

On Sun, 29 Mar 2015 14:13:17 +0200
Richard Weinberger <richard@nod.at> wrote:

[...]
> +
> +/**
> + * ubi_wl_trigger_bitrot_check - triggers a re-read of all physical erase
> + * blocks.
> + * @ubi: UBI device description object
> + */
> +void ubi_wl_trigger_bitrot_check(struct ubi_device *ubi)
> +{
> +	int i;
> +	struct ubi_wl_entry *e;
> +
> +	ubi_msg(ubi, "Running a full read check");
> +
> +	for (i = 0; i < ubi->peb_count; i++) {
> +		spin_lock(&ubi->wl_lock);
> +		e = ubi->lookuptbl[i];
> +		spin_unlock(&ubi->wl_lock);
> +		if (e) {
> +			atomic_inc(&ubi->bit_rot_work);
> +			schedule_bitrot_check(ubi, e);
> +		}
> +	}

Do we really need to create a ubi_work per PEB ?
Couldn't we create a single work being rescheduled inside the worker
function (after updating the ubi_wl_entry of course).

I'm pretty sure I'm missing something obvious that you'll probably
point out ;-).

Richard Weinberger April 12, 2015, 5:09 p.m. UTC | #16

Am 12.04.2015 um 19:01 schrieb Boris Brezillon:
> Hi Richard,
> 
> After the 'coding style related'/'useless' comments, now comes a real
> question related to the approach you've taken :-).
> 
> On Sun, 29 Mar 2015 14:13:17 +0200
> Richard Weinberger <richard@nod.at> wrote:
> 
> [...]
>> +
>> +/**
>> + * ubi_wl_trigger_bitrot_check - triggers a re-read of all physical erase
>> + * blocks.
>> + * @ubi: UBI device description object
>> + */
>> +void ubi_wl_trigger_bitrot_check(struct ubi_device *ubi)
>> +{
>> +	int i;
>> +	struct ubi_wl_entry *e;
>> +
>> +	ubi_msg(ubi, "Running a full read check");
>> +
>> +	for (i = 0; i < ubi->peb_count; i++) {
>> +		spin_lock(&ubi->wl_lock);
>> +		e = ubi->lookuptbl[i];
>> +		spin_unlock(&ubi->wl_lock);
>> +		if (e) {
>> +			atomic_inc(&ubi->bit_rot_work);
>> +			schedule_bitrot_check(ubi, e);
>> +		}
>> +	}
> 
> Do we really need to create a ubi_work per PEB ?
> Couldn't we create a single work being rescheduled inside the worker
> function (after updating the ubi_wl_entry of course).

Currently the UBI worker thread handles one PEB per ubi_work. I didn't wanted
to break that pattern. The downside of that approach is that we need more memory.
A few KiB per run.

I'm not sure if I understood your idea. You mean that we schedule one check for
PEB N and this work will re-schedule again a work for PEB N+1?
Using that approach we can safe memory, yes. But is it worth the hassle?
I'd like to avoid works which schedule again other works.
In the current way it is clear where the work is scheduled and how much.

> I'm pretty sure I'm missing something obvious that you'll probably
> point out ;-).

No no, it is a very good question.

Thanks,
//richard

Richard Weinberger April 12, 2015, 5:36 p.m. UTC | #17

Am 12.04.2015 um 19:01 schrieb Boris Brezillon:
> Hi Richard,
> 
> After the 'coding style related'/'useless' comments, now comes a real
> question related to the approach you've taken :-).
> 
> On Sun, 29 Mar 2015 14:13:17 +0200
> Richard Weinberger <richard@nod.at> wrote:
> 
> [...]
>> +
>> +/**
>> + * ubi_wl_trigger_bitrot_check - triggers a re-read of all physical erase
>> + * blocks.
>> + * @ubi: UBI device description object
>> + */
>> +void ubi_wl_trigger_bitrot_check(struct ubi_device *ubi)
>> +{
>> +	int i;
>> +	struct ubi_wl_entry *e;
>> +
>> +	ubi_msg(ubi, "Running a full read check");
>> +
>> +	for (i = 0; i < ubi->peb_count; i++) {
>> +		spin_lock(&ubi->wl_lock);
>> +		e = ubi->lookuptbl[i];
>> +		spin_unlock(&ubi->wl_lock);
>> +		if (e) {
>> +			atomic_inc(&ubi->bit_rot_work);
>> +			schedule_bitrot_check(ubi, e);
>> +		}

After re-reading this loop I realized that I've missed a possible race against
other workers. It can happen that we schedule eX and while being scheduled it
is possible that eX turns bad and some worker frees eX under us.
Not very likely but definitely possible.
Let's see if I can find a solution for that without adding new locks or refcounter to
ubi_wl_entry.

I can think of adding a check to the schedule work code which ensures that work
targeting eX is scheduled only once.

Thanks,
//richard

Boris Brezillon April 12, 2015, 7:20 p.m. UTC | #18

On Sun, 12 Apr 2015 19:09:11 +0200
Richard Weinberger <richard@nod.at> wrote:

> Am 12.04.2015 um 19:01 schrieb Boris Brezillon:
> > Hi Richard,
> > 
> > After the 'coding style related'/'useless' comments, now comes a real
> > question related to the approach you've taken :-).
> > 
> > On Sun, 29 Mar 2015 14:13:17 +0200
> > Richard Weinberger <richard@nod.at> wrote:
> > 
> > [...]
> >> +
> >> +/**
> >> + * ubi_wl_trigger_bitrot_check - triggers a re-read of all physical erase
> >> + * blocks.
> >> + * @ubi: UBI device description object
> >> + */
> >> +void ubi_wl_trigger_bitrot_check(struct ubi_device *ubi)
> >> +{
> >> +	int i;
> >> +	struct ubi_wl_entry *e;
> >> +
> >> +	ubi_msg(ubi, "Running a full read check");
> >> +
> >> +	for (i = 0; i < ubi->peb_count; i++) {
> >> +		spin_lock(&ubi->wl_lock);
> >> +		e = ubi->lookuptbl[i];
> >> +		spin_unlock(&ubi->wl_lock);
> >> +		if (e) {
> >> +			atomic_inc(&ubi->bit_rot_work);
> >> +			schedule_bitrot_check(ubi, e);
> >> +		}
> >> +	}
> > 
> > Do we really need to create a ubi_work per PEB ?
> > Couldn't we create a single work being rescheduled inside the worker
> > function (after updating the ubi_wl_entry of course).
> 
> Currently the UBI worker thread handles one PEB per ubi_work. I didn't wanted
> to break that pattern. The downside of that approach is that we need more memory.
> A few KiB per run.
> 
> I'm not sure if I understood your idea. You mean that we schedule one check for
> PEB N and this work will re-schedule again a work for PEB N+1?

That's exactly what I meant.

> Using that approach we can safe memory, yes. But is it worth the hassle?

Unless I'm missing something, it should be pretty easy to implement:
adding the following lines at the end of bitrot_check_worker() should do
the trick

	if (e->pnum + 1 < ubi->peb_count) {
		wl_wrk->e = ubi->lookuptbl[e->pnum + 1];
		__schedule_ubi_work(ubi, wl_wrk);
	} else {
		atomic_dec(&ubi->bit_rot_work);
	}
	

> I'd like to avoid works which schedule again other works.
> In the current way it is clear where the work is scheduled and how much.

Yes, but the memory consumption induced by this approach can be pretty
big on modern NAND chips (on 32 bit platforms, ubi_work is 32 octets
large, and on modern NANDs you often have 4096 blocks, so a UBI device
of 4000 block is pretty common => 4000 * 32 = 125 KiB).

For standard wear leveling requests, using a ubi_work per request is
sensible since you can't know in advance which block will be queued for
wear-leveling operation next time.
In your case, you're scanning all blocks in ascending order, which
makes it a good candidate for this 'one work for all bitrot checks'
approach.

Richard Weinberger April 12, 2015, 7:53 p.m. UTC | #19

Am 12.04.2015 um 21:20 schrieb Boris Brezillon:
> Unless I'm missing something, it should be pretty easy to implement:
> adding the following lines at the end of bitrot_check_worker() should do
> the trick
> 
> 	if (e->pnum + 1 < ubi->peb_count) {
> 		wl_wrk->e = ubi->lookuptbl[e->pnum + 1];
> 		__schedule_ubi_work(ubi, wl_wrk);
> 	} else {
> 		atomic_dec(&ubi->bit_rot_work);
> 	}
> 	

It will suffer from the same race issue as my current approach.
While e is scheduled another worker could free it in case of an fatal
error.

>> I'd like to avoid works which schedule again other works.
>> In the current way it is clear where the work is scheduled and how much.
> 
> Yes, but the memory consumption induced by this approach can be pretty
> big on modern NAND chips (on 32 bit platforms, ubi_work is 32 octets
> large, and on modern NANDs you often have 4096 blocks, so a UBI device
> of 4000 block is pretty common => 4000 * 32 = 125 KiB).

While I agree that consuming memory is not very nice I don't think that 125KiB
is a big deal.

> For standard wear leveling requests, using a ubi_work per request is
> sensible since you can't know in advance which block will be queued for
> wear-leveling operation next time.
> In your case, you're scanning all blocks in ascending order, which
> makes it a good candidate for this 'one work for all bitrot checks'
> approach.

The good news is that I have an idea to solve both problems the race and
the memory issue. It should be pretty easy to implement.
Patches will materialize in a few days.

Thanks,
//richard

rnd4@dave-tech.it April 12, 2015, 8:42 p.m. UTC | #20

Il 12/04/2015 18:55, Richard Weinberger ha scritto:
> Am 12.04.2015 um 18:43 schrieb Boris Brezillon:
>> On Sun, 12 Apr 2015 18:09:23 +0200
>> Richard Weinberger <richard@nod.at> wrote:
>>
>>> Am 12.04.2015 um 16:12 schrieb Boris Brezillon:
>>>> Hi Richard,
>>>>
>>>> Sorry for the late reply.
>>>>
>>>> On Sun, 29 Mar 2015 14:13:17 +0200
>>>> Richard Weinberger <richard@nod.at> wrote:
>>>>
>>>>> This patch implements bitrot checking for UBI.
>>>>> ubi_wl_trigger_bitrot_check() triggers a re-read of every
>>>>> PEB. If a bitflip is detected PEBs in use will get scrubbed
>>>>> and free ones erased.
>>>>
>>>> As you'll see, I didn't have much to say about the 'UBI bitrot
>>>> detection' mechanism, so this review is a collection of
>>>> nitpicks :-).
>>>>
>>>>>
>>>>> Signed-off-by: Richard Weinberger <richard@nod.at>
>>>>> ---
>>>>>   drivers/mtd/ubi/build.c |  39 +++++++++++++
>>>>>   drivers/mtd/ubi/ubi.h   |   4 ++
>>>>>   drivers/mtd/ubi/wl.c    | 146 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>   3 files changed, 189 insertions(+)
>>>>>
>>>>> diff --git a/drivers/mtd/ubi/build.c b/drivers/mtd/ubi/build.c
>>>>> index 9690cf9..f58330b 100644
>>>>> --- a/drivers/mtd/ubi/build.c
>>>>> +++ b/drivers/mtd/ubi/build.c
>>>>> @@ -118,6 +118,9 @@ static struct class_attribute ubi_version =
>>>>>
>>>>>   static ssize_t dev_attribute_show(struct device *dev,
>>>>>   				  struct device_attribute *attr, char *buf);
>>>>> +static ssize_t trigger_bitrot_check(struct device *dev,
>>>>> +				    struct device_attribute *mattr,
>>>>> +				    const char *data, size_t count);
>>>>>
>>>>>   /* UBI device attributes (correspond to files in '/<sysfs>/class/ubi/ubiX') */
>>>>>   static struct device_attribute dev_eraseblock_size =
>>>>> @@ -142,6 +145,8 @@ static struct device_attribute dev_bgt_enabled =
>>>>>   	__ATTR(bgt_enabled, S_IRUGO, dev_attribute_show, NULL);
>>>>>   static struct device_attribute dev_mtd_num =
>>>>>   	__ATTR(mtd_num, S_IRUGO, dev_attribute_show, NULL);
>>>>> +static struct device_attribute dev_trigger_bitrot_check =
>>>>> +	__ATTR(trigger_bitrot_check, S_IWUSR, NULL, trigger_bitrot_check);
>>>>
>>>> How about making this attribute a RW one, so that users could check
>>>> if there's a bitrot check in progress.
>>>
>>> As the check will be initiated only by userspace and writing to the trigger
>>> while a check is running will return anyway a EBUSY I don't really see
>>> a point why userspace would check for it.
>>
>> Sometime you just want to know whether something is running or not (in
>> this case the bitrot check) without risking to trigger a new action...
>
> Why would they care?

I think is always useful to give some additional information in 
userspace, from both debugging and diagnostic point of view.

> But I can add this feature, no problem.

Thanks ;-)

May I ask if can be useful to abort the (IMHO quite long running) operation?
I think it can be useful to save power, e.g. when running on batteries: 
smart systems will trigger the operation when charging and aborting it 
if on batteries (or on low batteries).

What happens if the system need to reboot in the middle of scanning?
Probably nothing at all but I think it's worth asking ;-)
Anyway I think it's better if we can, on runlevel 6, shutdown the 
operation in a clean way

To ask a little bit more from the current implementation, can it be 
useful expand sysfs entry with the current status (stopped, running, 
completed)?
In this way the userspace knows whenever the operation it has triggered, 
it completed successfully or something interrupt it (e.g. an internal 
error). I will schedule a new operation sooner if I have no evidence 
that the last one completed successfully.. WDYT?
But maybe all of this stuff will be implemented inside a daemon with 
additional ioctl() (IIRC Richard already is working on this).

>
>>>
>>>>>
>>>>>   /**
>>>>>    * ubi_volume_notify - send a volume change notification.
>>>>> @@ -334,6 +339,36 @@ int ubi_major2num(int major)
>>>>>   	return ubi_num;
>>>>>   }
>>>>>
>>>>> +/* "Store" method for file '/<sysfs>/class/ubi/ubiX/trigger_bitrot_check' */
>>>>> +static ssize_t trigger_bitrot_check(struct device *dev,
>>>>> +				    struct device_attribute *mattr,
>>>>> +				    const char *data, size_t count)
>>>>> +{
>>>>> +	struct ubi_device *ubi;
>>>>> +	int ret;
>>>>> +
>>>>
>>>> Maybe that's on purpose, but you do not check the value passed in data
>>>> (in your documention you suggest to do an
>>>> echo 1 > /sys/class/ubi/ubiX/trigger_bitrot_check).
>>>
>>> Yeah, the example using "1", but why should I limit it to it?
>>> The idea was that any write will trigger a check.
>>
>> Okay.

See above my reason for having something more that 1 value (or having 
more than one sysfs entry ;-) )

Kind Regards,

Andrea SCIAN

Richard Weinberger April 12, 2015, 9:01 p.m. UTC | #21

Am 12.04.2015 um 22:42 schrieb Andrea Scian:
> 
> Il 12/04/2015 18:55, Richard Weinberger ha scritto:
>> Am 12.04.2015 um 18:43 schrieb Boris Brezillon:
>>> On Sun, 12 Apr 2015 18:09:23 +0200
>>> Richard Weinberger <richard@nod.at> wrote:
>>>
>>>> Am 12.04.2015 um 16:12 schrieb Boris Brezillon:
>>>>> Hi Richard,
>>>>>
>>>>> Sorry for the late reply.
>>>>>
>>>>> On Sun, 29 Mar 2015 14:13:17 +0200
>>>>> Richard Weinberger <richard@nod.at> wrote:
>>>>>
>>>>>> This patch implements bitrot checking for UBI.
>>>>>> ubi_wl_trigger_bitrot_check() triggers a re-read of every
>>>>>> PEB. If a bitflip is detected PEBs in use will get scrubbed
>>>>>> and free ones erased.
>>>>>
>>>>> As you'll see, I didn't have much to say about the 'UBI bitrot
>>>>> detection' mechanism, so this review is a collection of
>>>>> nitpicks :-).
>>>>>
>>>>>>
>>>>>> Signed-off-by: Richard Weinberger <richard@nod.at>
>>>>>> ---
>>>>>>   drivers/mtd/ubi/build.c |  39 +++++++++++++
>>>>>>   drivers/mtd/ubi/ubi.h   |   4 ++
>>>>>>   drivers/mtd/ubi/wl.c    | 146 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>   3 files changed, 189 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/mtd/ubi/build.c b/drivers/mtd/ubi/build.c
>>>>>> index 9690cf9..f58330b 100644
>>>>>> --- a/drivers/mtd/ubi/build.c
>>>>>> +++ b/drivers/mtd/ubi/build.c
>>>>>> @@ -118,6 +118,9 @@ static struct class_attribute ubi_version =
>>>>>>
>>>>>>   static ssize_t dev_attribute_show(struct device *dev,
>>>>>>                     struct device_attribute *attr, char *buf);
>>>>>> +static ssize_t trigger_bitrot_check(struct device *dev,
>>>>>> +                    struct device_attribute *mattr,
>>>>>> +                    const char *data, size_t count);
>>>>>>
>>>>>>   /* UBI device attributes (correspond to files in '/<sysfs>/class/ubi/ubiX') */
>>>>>>   static struct device_attribute dev_eraseblock_size =
>>>>>> @@ -142,6 +145,8 @@ static struct device_attribute dev_bgt_enabled =
>>>>>>       __ATTR(bgt_enabled, S_IRUGO, dev_attribute_show, NULL);
>>>>>>   static struct device_attribute dev_mtd_num =
>>>>>>       __ATTR(mtd_num, S_IRUGO, dev_attribute_show, NULL);
>>>>>> +static struct device_attribute dev_trigger_bitrot_check =
>>>>>> +    __ATTR(trigger_bitrot_check, S_IWUSR, NULL, trigger_bitrot_check);
>>>>>
>>>>> How about making this attribute a RW one, so that users could check
>>>>> if there's a bitrot check in progress.
>>>>
>>>> As the check will be initiated only by userspace and writing to the trigger
>>>> while a check is running will return anyway a EBUSY I don't really see
>>>> a point why userspace would check for it.
>>>
>>> Sometime you just want to know whether something is running or not (in
>>> this case the bitrot check) without risking to trigger a new action...
>>
>> Why would they care?
> 
> I think is always useful to give some additional information in userspace, from both debugging and diagnostic point of view.

The question is, why does userspace care?
Other UBI operations are also not visible...

>> But I can add this feature, no problem.
> 
> Thanks ;-)
> 
> May I ask if can be useful to abort the (IMHO quite long running) operation?
> I think it can be useful to save power, e.g. when running on batteries: smart systems will trigger the operation when charging and aborting it if on batteries (or on low batteries).

If the system is running on low power mode just don't trigger the run...
Userspace controls it.

> What happens if the system need to reboot in the middle of scanning?

Just reboot, UBI can handle that. Work will be canceled.

> Probably nothing at all but I think it's worth asking ;-)
> Anyway I think it's better if we can, on runlevel 6, shutdown the operation in a clean way
> 
> To ask a little bit more from the current implementation, can it be useful expand sysfs entry with the current status (stopped, running, completed)?
> In this way the userspace knows whenever the operation it has triggered, it completed successfully or something interrupt it (e.g. an internal error). I will schedule a new
> operation sooner if I have no evidence that the last one completed successfully.. WDYT?
> But maybe all of this stuff will be implemented inside a daemon with additional ioctl() (IIRC Richard already is working on this).

That's the plan. The interface proposed in that patch series it designed to be a simple replacement for the dd if=/dev/ubiXY of=/dev/null hack.
The next step is adding a more advanced ioctl() interface to support a clever deamon.

Thanks,
//richard

Boris Brezillon April 12, 2015, 9:24 p.m. UTC | #22

On Sun, 12 Apr 2015 21:53:12 +0200
Richard Weinberger <richard@nod.at> wrote:

> Am 12.04.2015 um 21:20 schrieb Boris Brezillon:
> > Unless I'm missing something, it should be pretty easy to implement:
> > adding the following lines at the end of bitrot_check_worker() should do
> > the trick
> > 
> > 	if (e->pnum + 1 < ubi->peb_count) {
> > 		wl_wrk->e = ubi->lookuptbl[e->pnum + 1];
> > 		__schedule_ubi_work(ubi, wl_wrk);
> > 	} else {
> > 		atomic_dec(&ubi->bit_rot_work);
> > 	}
> > 	
> 
> It will suffer from the same race issue as my current approach.
> While e is scheduled another worker could free it in case of an fatal
> error.

Right, I guess grabing wl_lock before retrieving ->e (and iterating
over the lookuptbl until it's != NULL) would partly solve the problem.
But I'm not sure how you would handle this sequence (not sure it can
happen though):
1/ schedule bitrot check on PEB X
2/ execute some operation on PEB x that might free PEB X entry in the
   lookuptbl
3/ execute bitrot check on PEB X

In this case the ->e is invalid (pointing to a freed memory region) at
the time bitrot check is executed.

Of course, if you're guaranteed that ubi_work are executed in the
correct order (FIFO) this should never happen, because the scheduled
operation messing up with lookuptbl entry could have been detected
before bitrot work insertion.

> 
> >> I'd like to avoid works which schedule again other works.
> >> In the current way it is clear where the work is scheduled and how much.
> > 
> > Yes, but the memory consumption induced by this approach can be pretty
> > big on modern NAND chips (on 32 bit platforms, ubi_work is 32 octets
> > large, and on modern NANDs you often have 4096 blocks, so a UBI device
> > of 4000 block is pretty common => 4000 * 32 = 125 KiB).
> 
> While I agree that consuming memory is not very nice I don't think that 125KiB
> is a big deal.

Hm, a few weeks ago, when I suggested to store information about PEBs in
order to better choose the next block to be checked for bitrot, one of
your argument to reject that approach was the memory consumption of
such a design.
In my case the only thing I needed was the following structure (one
instance per PEB):

struct ubi_peb_statistics {
	struct list_head node;
	int pnum;
	int bitflips;
	int last_full_read; /* in seconds */
	int last_partial_write; /* in seconds */
};

which is 24 bytes large.

I definitely understand the memory consumption argument, but that's not
something you can change depending on who's proposing the solution :-).

> 
> > For standard wear leveling requests, using a ubi_work per request is
> > sensible since you can't know in advance which block will be queued for
> > wear-leveling operation next time.
> > In your case, you're scanning all blocks in ascending order, which
> > makes it a good candidate for this 'one work for all bitrot checks'
> > approach.
> 
> The good news is that I have an idea to solve both problems the race and
> the memory issue. It should be pretty easy to implement.
> Patches will materialize in a few days.

Great!

Best Regards,

Boris

Boris Brezillon April 12, 2015, 9:30 p.m. UTC | #23

On Sun, 12 Apr 2015 23:01:27 +0200
Richard Weinberger <richard@nod.at> wrote:


> >>>>>> +static struct device_attribute dev_trigger_bitrot_check =
> >>>>>> +    __ATTR(trigger_bitrot_check, S_IWUSR, NULL, trigger_bitrot_check);
> >>>>>
> >>>>> How about making this attribute a RW one, so that users could check
> >>>>> if there's a bitrot check in progress.
> >>>>
> >>>> As the check will be initiated only by userspace and writing to the trigger
> >>>> while a check is running will return anyway a EBUSY I don't really see
> >>>> a point why userspace would check for it.
> >>>
> >>> Sometime you just want to know whether something is running or not (in
> >>> this case the bitrot check) without risking to trigger a new action...
> >>
> >> Why would they care?
> > 
> > I think is always useful to give some additional information in userspace, from both debugging and diagnostic point of view.
> 
> The question is, why does userspace care?
> Other UBI operations are also not visible...

Yes, but AFAIK other wear-leveling operations are not directly triggered
by user-space.

rnd4@dave-tech.it April 12, 2015, 9:33 p.m. UTC | #24

Il 12/04/2015 23:01, Richard Weinberger ha scritto:
> Am 12.04.2015 um 22:42 schrieb Andrea Scian:
>> Il 12/04/2015 18:55, Richard Weinberger ha scritto:
>>> Am 12.04.2015 um 18:43 schrieb Boris Brezillon:
>>>> On Sun, 12 Apr 2015 18:09:23 +0200
>>>> Richard Weinberger <richard@nod.at> wrote:
>>>>
>>>>> Am 12.04.2015 um 16:12 schrieb Boris Brezillon:
>>>>>> Hi Richard,
>>>>>>
>>>>>> Sorry for the late reply.
>>>>>>
>>>>>> On Sun, 29 Mar 2015 14:13:17 +0200
>>>>>> Richard Weinberger <richard@nod.at> wrote:
>>>>>>
>>>>>>> This patch implements bitrot checking for UBI.
>>>>>>> ubi_wl_trigger_bitrot_check() triggers a re-read of every
>>>>>>> PEB. If a bitflip is detected PEBs in use will get scrubbed
>>>>>>> and free ones erased.
>>>>>> As you'll see, I didn't have much to say about the 'UBI bitrot
>>>>>> detection' mechanism, so this review is a collection of
>>>>>> nitpicks :-).
>>>>>>
>>>>>>> Signed-off-by: Richard Weinberger <richard@nod.at>
>>>>>>> ---
>>>>>>>    drivers/mtd/ubi/build.c |  39 +++++++++++++
>>>>>>>    drivers/mtd/ubi/ubi.h   |   4 ++
>>>>>>>    drivers/mtd/ubi/wl.c    | 146 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>    3 files changed, 189 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/mtd/ubi/build.c b/drivers/mtd/ubi/build.c
>>>>>>> index 9690cf9..f58330b 100644
>>>>>>> --- a/drivers/mtd/ubi/build.c
>>>>>>> +++ b/drivers/mtd/ubi/build.c
>>>>>>> @@ -118,6 +118,9 @@ static struct class_attribute ubi_version =
>>>>>>>
>>>>>>>    static ssize_t dev_attribute_show(struct device *dev,
>>>>>>>                      struct device_attribute *attr, char *buf);
>>>>>>> +static ssize_t trigger_bitrot_check(struct device *dev,
>>>>>>> +                    struct device_attribute *mattr,
>>>>>>> +                    const char *data, size_t count);
>>>>>>>
>>>>>>>    /* UBI device attributes (correspond to files in '/<sysfs>/class/ubi/ubiX') */
>>>>>>>    static struct device_attribute dev_eraseblock_size =
>>>>>>> @@ -142,6 +145,8 @@ static struct device_attribute dev_bgt_enabled =
>>>>>>>        __ATTR(bgt_enabled, S_IRUGO, dev_attribute_show, NULL);
>>>>>>>    static struct device_attribute dev_mtd_num =
>>>>>>>        __ATTR(mtd_num, S_IRUGO, dev_attribute_show, NULL);
>>>>>>> +static struct device_attribute dev_trigger_bitrot_check =
>>>>>>> +    __ATTR(trigger_bitrot_check, S_IWUSR, NULL, trigger_bitrot_check);
>>>>>> How about making this attribute a RW one, so that users could check
>>>>>> if there's a bitrot check in progress.
>>>>> As the check will be initiated only by userspace and writing to the trigger
>>>>> while a check is running will return anyway a EBUSY I don't really see
>>>>> a point why userspace would check for it.
>>>> Sometime you just want to know whether something is running or not (in
>>>> this case the bitrot check) without risking to trigger a new action...
>>> Why would they care?
>> I think is always useful to give some additional information in userspace, from both debugging and diagnostic point of view.
> The question is, why does userspace care?

Because the userspace trigger it

> Other UBI operations are also not visible...

I don't really know much about UBI internals, but I think not many 
operation are triggered from userspace (apart from ubiformat and mount ;-) )


>>> But I can add this feature, no problem.
>> Thanks ;-)
>>
>> May I ask if can be useful to abort the (IMHO quite long running) operation?
>> I think it can be useful to save power, e.g. when running on batteries: smart systems will trigger the operation when charging and aborting it if on batteries (or on low batteries).
> If the system is running on low power mode just don't trigger the run...
> Userspace controls it.

It heavily depends on how long the operation takes, we may have 4 to 32 
GiB devices so I think it may take more than just a few minutes to scan 
the whole device (and additional time depending on how many scrub 
operation are needed).
You may start the operation when power is not an issue (e.g. while 
charging) and when it is (e.g. when running on batteries and you need to 
save any mAh you have ;-) ) it may be still running for a while..
I know that this is some kind of advance feature, but I would like to 
point this out for comments.

>
>> What happens if the system need to reboot in the middle of scanning?
> Just reboot, UBI can handle that. Work will be canceled.

Thanks for the confirm

>
>> Probably nothing at all but I think it's worth asking ;-)
>> Anyway I think it's better if we can, on runlevel 6, shutdown the operation in a clean way
>>
>> To ask a little bit more from the current implementation, can it be useful expand sysfs entry with the current status (stopped, running, completed)?
>> In this way the userspace knows whenever the operation it has triggered, it completed successfully or something interrupt it (e.g. an internal error). I will schedule a new
>> operation sooner if I have no evidence that the last one completed successfully.. WDYT?
>> But maybe all of this stuff will be implemented inside a daemon with additional ioctl() (IIRC Richard already is working on this).
> That's the plan. The interface proposed in that patch series it designed to be a simple replacement for the dd if=/dev/ubiXY of=/dev/null hack.

dd can be stopped if needed and you may also have the progress status :-)

for sure you probably don't need all of the above additional stuff but 
it may be useful anyway
Maybe it's better to have an initial working implementation and add some 
more (backward compatible?) features later.

> The next step is adding a more advanced ioctl() interface to support a clever deamon.

Kind Regards and thanks for you comments,

Richard Weinberger April 12, 2015, 9:34 p.m. UTC | #25

Am 12.04.2015 um 23:24 schrieb Boris Brezillon:
> On Sun, 12 Apr 2015 21:53:12 +0200
> Richard Weinberger <richard@nod.at> wrote:
> 
>> Am 12.04.2015 um 21:20 schrieb Boris Brezillon:
>>> Unless I'm missing something, it should be pretty easy to implement:
>>> adding the following lines at the end of bitrot_check_worker() should do
>>> the trick
>>>
>>> 	if (e->pnum + 1 < ubi->peb_count) {
>>> 		wl_wrk->e = ubi->lookuptbl[e->pnum + 1];
>>> 		__schedule_ubi_work(ubi, wl_wrk);
>>> 	} else {
>>> 		atomic_dec(&ubi->bit_rot_work);
>>> 	}
>>> 	
>>
>> It will suffer from the same race issue as my current approach.
>> While e is scheduled another worker could free it in case of an fatal
>> error.
> 
> Right, I guess grabing wl_lock before retrieving ->e (and iterating
> over the lookuptbl until it's != NULL) would partly solve the problem.
> But I'm not sure how you would handle this sequence (not sure it can
> happen though):
> 1/ schedule bitrot check on PEB X
> 2/ execute some operation on PEB x that might free PEB X entry in the
>    lookuptbl
> 3/ execute bitrot check on PEB X
> 
> In this case the ->e is invalid (pointing to a freed memory region) at
> the time bitrot check is executed.

We have to make sure that we remove e from the data structure.
This is what UBI currently does for all works.
That way only one work can operate on an ubi_wl_entry.

> Of course, if you're guaranteed that ubi_work are executed in the
> correct order (FIFO) this should never happen, because the scheduled
> operation messing up with lookuptbl entry could have been detected
> before bitrot work insertion.
> 
>>
>>>> I'd like to avoid works which schedule again other works.
>>>> In the current way it is clear where the work is scheduled and how much.
>>>
>>> Yes, but the memory consumption induced by this approach can be pretty
>>> big on modern NAND chips (on 32 bit platforms, ubi_work is 32 octets
>>> large, and on modern NANDs you often have 4096 blocks, so a UBI device
>>> of 4000 block is pretty common => 4000 * 32 = 125 KiB).
>>
>> While I agree that consuming memory is not very nice I don't think that 125KiB
>> is a big deal.
> 
> Hm, a few weeks ago, when I suggested to store information about PEBs in
> order to better choose the next block to be checked for bitrot, one of
> your argument to reject that approach was the memory consumption of
> such a design.
> In my case the only thing I needed was the following structure (one
> instance per PEB):
> 
> struct ubi_peb_statistics {
> 	struct list_head node;
> 	int pnum;
> 	int bitflips;
> 	int last_full_read; /* in seconds */
> 	int last_partial_write; /* in seconds */
> };
> 
> which is 24 bytes large.
> 
> I definitely understand the memory consumption argument, but that's not
> something you can change depending on who's proposing the solution :-).

Yeah, but this structure remains in memory forever, right?
In the bitrot case we allocate the memory only temporary.

That said, my arguments are not perfect nor irreversible,
it can happen that I mess up or was simply wrong.
Just beat me down with my own arguments when I deserve it.

Thanks,
//richard

Richard Weinberger April 12, 2015, 9:37 p.m. UTC | #26

Am 12.04.2015 um 23:30 schrieb Boris Brezillon:
> On Sun, 12 Apr 2015 23:01:27 +0200
> Richard Weinberger <richard@nod.at> wrote:
> 
> 
>>>>>>>> +static struct device_attribute dev_trigger_bitrot_check =
>>>>>>>> +    __ATTR(trigger_bitrot_check, S_IWUSR, NULL, trigger_bitrot_check);
>>>>>>>
>>>>>>> How about making this attribute a RW one, so that users could check
>>>>>>> if there's a bitrot check in progress.
>>>>>>
>>>>>> As the check will be initiated only by userspace and writing to the trigger
>>>>>> while a check is running will return anyway a EBUSY I don't really see
>>>>>> a point why userspace would check for it.
>>>>>
>>>>> Sometime you just want to know whether something is running or not (in
>>>>> this case the bitrot check) without risking to trigger a new action...
>>>>
>>>> Why would they care?
>>>
>>> I think is always useful to give some additional information in userspace, from both debugging and diagnostic point of view.
>>
>> The question is, why does userspace care?
>> Other UBI operations are also not visible...
> 
> Yes, but AFAIK other wear-leveling operations are not directly triggered
> by user-space.

If userspace trigger is, it knows that it was triggered.
Unless you have multiple tools/scripts which want to trigger it,
which is IMHO a problem anyways.

But as stated before, I can add an indicator.

Thanks,
//richard

Richard Weinberger April 12, 2015, 9:42 p.m. UTC | #27

Am 12.04.2015 um 23:33 schrieb Andrea Scian:
>>> I think is always useful to give some additional information in userspace, from both debugging and diagnostic point of view.
>> The question is, why does userspace care?
> 
> Because the userspace trigger it

As written in the previous mail, then userspace knows anyway the state.

> 
>> Other UBI operations are also not visible...
> 
> I don't really know much about UBI internals, but I think not many operation are triggered from userspace (apart from ubiformat and mount ;-) )
> 
> 
>>>> But I can add this feature, no problem.
>>> Thanks ;-)
>>>
>>> May I ask if can be useful to abort the (IMHO quite long running) operation?
>>> I think it can be useful to save power, e.g. when running on batteries: smart systems will trigger the operation when charging and aborting it if on batteries (or on low
>>> batteries).
>> If the system is running on low power mode just don't trigger the run...
>> Userspace controls it.
> 
> It heavily depends on how long the operation takes, we may have 4 to 32 GiB devices so I think it may take more than just a few minutes to scan the whole device (and additional
> time depending on how many scrub operation are needed).
> You may start the operation when power is not an issue (e.g. while charging) and when it is (e.g. when running on batteries and you need to save any mAh you have ;-) ) it may be
> still running for a while..
> I know that this is some kind of advance feature, but I would like to point this out for comments.

In such a scenario one definitely wants the emerging ioctl() interface.

>>
>>> What happens if the system need to reboot in the middle of scanning?
>> Just reboot, UBI can handle that. Work will be canceled.
> 
> Thanks for the confirm
> 
>>
>>> Probably nothing at all but I think it's worth asking ;-)
>>> Anyway I think it's better if we can, on runlevel 6, shutdown the operation in a clean way
>>>
>>> To ask a little bit more from the current implementation, can it be useful expand sysfs entry with the current status (stopped, running, completed)?
>>> In this way the userspace knows whenever the operation it has triggered, it completed successfully or something interrupt it (e.g. an internal error). I will schedule a new
>>> operation sooner if I have no evidence that the last one completed successfully.. WDYT?
>>> But maybe all of this stuff will be implemented inside a daemon with additional ioctl() (IIRC Richard already is working on this).
>> That's the plan. The interface proposed in that patch series it designed to be a simple replacement for the dd if=/dev/ubiXY of=/dev/null hack.
> 
> dd can be stopped if needed and you may also have the progress status :-)
> 
> for sure you probably don't need all of the above additional stuff but it may be useful anyway
> Maybe it's better to have an initial working implementation and add some more (backward compatible?) features later.

I agree.

BTW: Thanks a lot for your comments, Boris and Andrea!

Thanks,
//richard

Nicholas Krause April 13, 2015, 3:36 a.m. UTC | #28

On 2015-04-12 05:34 PM, Richard Weinberger wrote:
>>> >> While I agree that consuming memory is not very nice I don't think that 125KiB
>>> >> is a big deal.
>> > 
>> > Hm, a few weeks ago, when I suggested to store information about PEBs in
>> > order to better choose the next block to be checked for bitrot, one of
>> > your argument to reject that approach was the memory consumption of
>> > such a design.
>> > In my case the only thing I needed was the following structure (one
>> > instance per PEB):
>> > 
>> > struct ubi_peb_statistics {
>> > 	struct list_head node;
>> > 	int pnum;
>> > 	int bitflips;
>> > 	int last_full_read; /* in seconds */
>> > 	int last_partial_write; /* in seconds */
>> > };
>> > 
>> > which is 24 bytes large.
>> > 
>> > I definitely understand the memory consumption argument, but that's not
>> > something you can change depending on who's proposing the solution :-).
> Yeah, but this structure remains in memory forever, right?
> In the bitrot case we allocate the memory only temporary.
> 
> That said, my arguments are not perfect nor irreversible,
> it can happen that I mess up or was simply wrong.
> Just beat me down with my own arguments when I deserve it.
> 
> Thanks,
> //richard
Richard and others,
This seems to be like the way we are handling page tables in the kernel. Further more due
to this if this is overall a good idea otherwise, I would recommend looking into the ratio
of storing the structure as a percent of overall memory on various systems to see if that
much memory is using storing the PEBs this way. Generally if it's over 2% of total memory
I would recommend finding a different solution, on the high end page structures take 1.5%
of overall memory on the high end for all systems I am currently aware of. Another area to
compare for doing something like is the driver core or slabs that need to be there for a 
useable system i.e. the one for task_structs and therefore are always in kernel memory.
Just a option for a MTD newcomer,
Nick

Brian Norris April 13, 2015, 5:17 p.m. UTC | #29

Hi everyone,

linux-mtd administrative note:

Please don't start conversations based on mailing list "digest" emails.
It's not helpful for others (e.g., it likely breaks threading). Also,
the mailing list filters these types of messages out, presumably based
on the email subject, so many subscribers probably did not see your
email.

So, a recent thread was under the subject:

 Re: [PATCH 4/4] UBI: Implement bitrot checking (linux-mtd Digest, Vol 145, Issue 24) 

But it should be using:

 Re: [PATCH 4/4] UBI: Implement bitrot checking

I've released a few messages for now, but I can't guarantee that such
messages will get through in the future (and certainly not promptly).

Thanks,
Brian

[4/4] UBI: Implement bitrot checking

Commit Message

Comments

Patch