diff mbox

[RFC/PATCH] mtd: mtd_read: Fix bitflips_threshold comparison to allow max bitflips

Message ID 1388407257-6557-1-git-send-email-sr@denx.de
State RFC
Headers show

Commit Message

Stefan Roese Dec. 30, 2013, 12:40 p.m. UTC
On a custom AM335x based platform with a Toshiba NAND device
(TC58NVG1S3H) we are currently seeing quite a few of these UBI messages:

[   18.044967] UBI: fixable bit-flip detected at PEB 50
[   18.050252] UBI: schedule PEB 50 for scrubbing
...

After a bit debugging I found that those messages are only printed when
the OMAP NAND driver has detected 8 (corrected) bitflips / 512 bytes on
a read. We're using HW BCH8 and the Toshiba chip supports 8 bit ECC for
each 512Byte. I was wondering why 8 bitflips resulted in these UBI
messages and e.g. 7 bitflips didn't. Hence I discovered the comparison
"ret_code >= mtd->bitflip_threshold" in mtd_read().

With this patch applied all tests (UBIFS) I've done so far didn't produce
any of these "UBI: fixable bit-flip" messages any more.

Note that I'm sending this patch as RFC for now. To get some feedback
from other MTD / NAND developers on this issue. The main question is:
Should mtd_read() return -EUCLEAN if the corrected bitflips are equal to
the bitflip-threshold value? Or should it return 0 since the bitflips
have been corrected?

Signed-off-by: Stefan Roese <sr@denx.de>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Pekon Gupta <pekon@ti.com>
---
 drivers/mtd/mtdcore.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Stefan Roese Jan. 8, 2014, 7:29 a.m. UTC | #1
On 30.12.2013 13:40, Stefan Roese wrote:
> On a custom AM335x based platform with a Toshiba NAND device
> (TC58NVG1S3H) we are currently seeing quite a few of these UBI messages:
> 
> [   18.044967] UBI: fixable bit-flip detected at PEB 50
> [   18.050252] UBI: schedule PEB 50 for scrubbing
> ...
> 
> After a bit debugging I found that those messages are only printed when
> the OMAP NAND driver has detected 8 (corrected) bitflips / 512 bytes on
> a read. We're using HW BCH8 and the Toshiba chip supports 8 bit ECC for
> each 512Byte. I was wondering why 8 bitflips resulted in these UBI
> messages and e.g. 7 bitflips didn't. Hence I discovered the comparison
> "ret_code >= mtd->bitflip_threshold" in mtd_read().
> 
> With this patch applied all tests (UBIFS) I've done so far didn't produce
> any of these "UBI: fixable bit-flip" messages any more.
> 
> Note that I'm sending this patch as RFC for now. To get some feedback
> from other MTD / NAND developers on this issue. The main question is:
> Should mtd_read() return -EUCLEAN if the corrected bitflips are equal to
> the bitflip-threshold value? Or should it return 0 since the bitflips
> have been corrected?

Brian, do you have any comments? Is this patch good as is? Should I
resend it as non-RFC?

Thanks,
Stefan
Brian Norris Jan. 11, 2014, 7:06 p.m. UTC | #2
On Wed, Jan 08, 2014 at 08:29:32AM +0100, Stefan Roese wrote:
> On 30.12.2013 13:40, Stefan Roese wrote:
> > On a custom AM335x based platform with a Toshiba NAND device
> > (TC58NVG1S3H) we are currently seeing quite a few of these UBI messages:
> > 
> > [   18.044967] UBI: fixable bit-flip detected at PEB 50
> > [   18.050252] UBI: schedule PEB 50 for scrubbing
> > ...
> > 
> > After a bit debugging I found that those messages are only printed when
> > the OMAP NAND driver has detected 8 (corrected) bitflips / 512 bytes on
> > a read. We're using HW BCH8 and the Toshiba chip supports 8 bit ECC for
> > each 512Byte. I was wondering why 8 bitflips resulted in these UBI
> > messages and e.g. 7 bitflips didn't. Hence I discovered the comparison
> > "ret_code >= mtd->bitflip_threshold" in mtd_read().
> > 
> > With this patch applied all tests (UBIFS) I've done so far didn't produce
> > any of these "UBI: fixable bit-flip" messages any more.
> > 
> > Note that I'm sending this patch as RFC for now. To get some feedback
> > from other MTD / NAND developers on this issue. The main question is:
> > Should mtd_read() return -EUCLEAN if the corrected bitflips are equal to
> > the bitflip-threshold value? Or should it return 0 since the bitflips
> > have been corrected?

-EUCLEAN is a purposeful part of the MTD API. It means that mtd_read()
encountered some high level of (correctable) bitflips, and MTD is
intentionally informing the upper layer(s) that they may want to take
corrective action on scrubbing the affected area, so that it doesn't
accumulate more bitflips and corrupt your data. MTD is acting correctly.

At a higher layer, I think UBI should be scrubbing the block (erasing
and moving or rewriting data) at least once, to refresh the data. After
that, wear leveling *should* prevent the block from being reused in the
near future. If that isn't the case, then we may need to fix UBI.

> Brian, do you have any comments? Is this patch good as is? Should I
> resend it as non-RFC?

No, this patch is not acceptable. It effectively ignores all bitflips
until it's too late (i.e., there are more bitflips than we can correct).
Also, if you really did want this effect, you could simply increase your
threshold to 9 via sysfs. But I don't recommend that.

Brian
Ricard Wanderlof Jan. 13, 2014, 8:03 a.m. UTC | #3
On Sat, 11 Jan 2014, Brian Norris wrote:

>>> After a bit debugging I found that those messages are only printed when
>>> the OMAP NAND driver has detected 8 (corrected) bitflips / 512 bytes on
>>> a read. We're using HW BCH8 and the Toshiba chip supports 8 bit ECC for
>>> each 512Byte. I was wondering why 8 bitflips resulted in these UBI
>>> messages and e.g. 7 bitflips didn't. Hence I discovered the comparison
>>> "ret_code >= mtd->bitflip_threshold" in mtd_read().
>>>
>>> With this patch applied all tests (UBIFS) I've done so far didn't produce
>>> any of these "UBI: fixable bit-flip" messages any more.
>>>
>>> Note that I'm sending this patch as RFC for now. To get some feedback
>>> from other MTD / NAND developers on this issue. The main question is:
>>> Should mtd_read() return -EUCLEAN if the corrected bitflips are equal to
>>> the bitflip-threshold value? Or should it return 0 since the bitflips
>>> have been corrected?
>
> -EUCLEAN is a purposeful part of the MTD API. It means that mtd_read()
> encountered some high level of (correctable) bitflips, and MTD is
> intentionally informing the upper layer(s) that they may want to take
> corrective action on scrubbing the affected area, so that it doesn't
> accumulate more bitflips and corrupt your data. MTD is acting correctly.
>
> At a higher layer, I think UBI should be scrubbing the block (erasing
> and moving or rewriting data) at least once, to refresh the data. After
> that, wear leveling *should* prevent the block from being reused in the
> near future. If that isn't the case, then we may need to fix UBI.
>
>> Brian, do you have any comments? Is this patch good as is? Should I
>> resend it as non-RFC?
>
> No, this patch is not acceptable. It effectively ignores all bitflips
> until it's too late (i.e., there are more bitflips than we can correct).
> Also, if you really did want this effect, you could simply increase your
> threshold to 9 via sysfs. But I don't recommend that.

Besides, assuming you are using BCH-8, then you can never get more than 8 
fixable bitflips, after that you'll get an 'uncorrectable' error. So 
essentially either applying the patch or setting the bitflip_threshold to 
9 results in no potential corrective action being taken (e.g. UBI 
scrubbing the block when it sees -EUCLEAN) until it's too late.

If anything I'd suggest lowering the bitflip_threshold a few steps under 
what the ECC algorithm can correct, e.g. 6 in the case of BCH-8, the 
rationale being that one could imagine getting 6 bitflips one day and 8 
bitflips another, i.e. the number of bitflips may not necessarily increase 
in steps of one.

/Ricard
diff mbox

Patch

diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index 92311a5..28500a1 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -824,7 +824,7 @@  int mtd_read(struct mtd_info *mtd, loff_t from, size_t len, size_t *retlen,
 		return ret_code;
 	if (mtd->ecc_strength == 0)
 		return 0;	/* device lacks ecc */
-	return ret_code >= mtd->bitflip_threshold ? -EUCLEAN : 0;
+	return ret_code > mtd->bitflip_threshold ? -EUCLEAN : 0;
 }
 EXPORT_SYMBOL_GPL(mtd_read);