mtd: cfi: Wait for Block Erase operation to finish

Message ID	1330443135.75072.YahooMailClassic@web29019.mail.ird.yahoo.com
State	New, archived
Headers	show Return-Path: <linux-mtd-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org> DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Mailer:Message-ID:Date:From:Subject:To:Cc:MIME-Version:Content-Type:Content-Transfer-Encoding; b=lCSLPh1G5M7bejHfYEHU3Dqq0zmJMLL+GUi8pBV5hHyrwAbNmc5i54KBl+vBzhmraAn42IwKn7/9S7ivj0dDjNzOjlU/FSkyoRxes8Tg7ZU33vQVp6/s8jNYwAhYgfMiPa7FRVmrXKIK1LXoR436eKNNTpW/OHQQ+17BuJyjNOc=; Message-ID: <1330443135.75072.YahooMailClassic@web29019.mail.ird.yahoo.com> Date: Tue, 28 Feb 2012 15:32:15 +0000 (GMT) From: Paul Parsons <lost.distance@yahoo.com> Subject: [PATCH] mtd: cfi: Wait for Block Erase operation to finish To: linux-mtd@lists.infradead.org MIME-Version: 1.0 summary: Content analysis details: (-2.0 points) pts rule name description ---- ---------------------- -------------------------------------------------- 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (lost.distance[at]yahoo.com) -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [77.238.189.202 listed in list.dnswl.org] -0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay domain -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature Cc: dwmw2@infradead.org, philipp.zabel@gmail.com Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: linux-mtd-bounces@lists.infradead.org Errors-To: linux-mtd-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org

Paul Parsons Feb. 28, 2012, 3:32 p.m. UTC

If an Erase Suspend Command (0xb0) is issued while a Block Erase operation
(0x20, 0xd0) is in progress, Status Register bit 6 (SR.6) will be set to 1.
After an Erase Resume Command (0xd0) is issued, SR.6 will be set back to 0.

Unfortunately when inval_cache_and_wait_for_operation() is called to wait for
a Block Erase operation to finish, it never checks that SR.6 = 0 before
returning. Consequently it can (and does) return while a Block Erase operation
is still suspended, resulting in random breakage.

This patch ensures that when inval_cache_and_wait_for_operation() is called to
wait for a Block Erase operation to finish (chip->state == FL_ERASING), it does
not return until both SR.7 = 1 (as before) and SR.6 = 0.

Signed-off-by: Paul Parsons <lost.distance@yahoo.com>
---

I found this after switching my HP iPAQ hx4700 from using jffs2 to ubifs. It
then consistently reported "block erase error: (bad VPP)" errors. Intriguingly,
the bootloader never reporting erase or programming failures; it turned out that
the bootloader never issued Erase Suspend Commands.

Should SR.6 = 0 also be checked after an Erase Suspend Command (0xb0) is issued?
This should preclude an unnecessary (and perhaps unpredictable) subsequent Erase
Resume Command (0xd0).

Joakim Tjernlund Feb. 28, 2012, 9:25 p.m. UTC | #1

>
> If an Erase Suspend Command (0xb0) is issued while a Block Erase operation
> (0x20, 0xd0) is in progress, Status Register bit 6 (SR.6) will be set to 1.
> After an Erase Resume Command (0xd0) is issued, SR.6 will be set back to 0.
>
> Unfortunately when inval_cache_and_wait_for_operation() is called to wait for
> a Block Erase operation to finish, it never checks that SR.6 = 0 before
> returning. Consequently it can (and does) return while a Block Erase operation
> is still suspended, resulting in random breakage.
>
> This patch ensures that when inval_cache_and_wait_for_operation() is called to
> wait for a Block Erase operation to finish (chip->state == FL_ERASING), it does
> not return until both SR.7 = 1 (as before) and SR.6 = 0.

Interesting but I do wonder, should not the
  if (chip->state != chip_state)
test just before the status test prevent this in the first place?

>
> Signed-off-by: Paul Parsons <lost.distance@yahoo.com>
> ---
>
> I found this after switching my HP iPAQ hx4700 from using jffs2 to ubifs. It
> then consistently reported "block erase error: (bad VPP)" errors. Intriguingly,
> the bootloader never reporting erase or programming failures; it turned out that
> the bootloader never issued Erase Suspend Commands.
>
> Should SR.6 = 0 also be checked after an Erase Suspend Command (0xb0) is issued?
> This should preclude an unnecessary (and perhaps unpredictable) subsequent Erase
> Resume Command (0xd0).
>
> --- clean-3.3-rc5/drivers/mtd/chips/cfi_cmdset_0001.c   2012-02-25 20:18:16.000000000 +0000
> +++ linux-3.3-rc5/drivers/mtd/chips/cfi_cmdset_0001.c   2012-02-28 03:13:30.521537614 +0000
> @@ -1211,6 +1211,7 @@ static int inval_cache_and_wait_for_oper
>  {
>     struct cfi_private *cfi = map->fldrv_priv;
>     map_word status, status_OK = CMD(0x80);
> +   map_word status_76 = (chip->state == FL_ERASING) ? CMD(0xc0) : CMD(0x80);

hmm, I wonder if you can get by without the chip->state test? Just always set it to 0xc0?

 Jocke

Paul Parsons Feb. 29, 2012, 12:23 a.m. UTC | #2

> > If an Erase Suspend Command (0xb0) is issued while a
> Block Erase operation
> > (0x20, 0xd0) is in progress, Status Register bit 6
> (SR.6) will be set to 1.
> > After an Erase Resume Command (0xd0) is issued, SR.6
> will be set back to 0.
> >
> > Unfortunately when inval_cache_and_wait_for_operation()
> is called to wait for
> > a Block Erase operation to finish, it never checks that
> SR.6 = 0 before
> > returning. Consequently it can (and does) return while
> a Block Erase operation
> > is still suspended, resulting in random breakage.
> >
> > This patch ensures that when
> inval_cache_and_wait_for_operation() is called to
> > wait for a Block Erase operation to finish
> (chip->state == FL_ERASING), it does
> > not return until both SR.7 = 1 (as before) and SR.6 =
> 0.
> 
> Interesting but I do wonder, should not the
>   if (chip->state != chip_state)
> test just before the status test prevent this in the first
> place?

I'm not sure I understand. The (chip->state != chip_state) test forces
inval_cache_and_wait_for_operation() to stop polling and sleep, pending
the operation which suspended it. After it wakes it resumes polling. The
test doesn't prevent inval_cache_and_wait_for_operation() returning
prematurely, i.e. before the Block Erase has finished.

An important point which I suspect has been overlooked is that both SR.7
and SR.6 are set when a Block Erase is suspended. Thus testing SR.7 only
is not sufficient to determine that a Block Erase has completed.

> > +   map_word status_76 = (chip->state
> == FL_ERASING) ? CMD(0xc0) : CMD(0x80);
> 
> hmm, I wonder if you can get by without the chip->state
> test? Just always set it to 0xc0?

No, unfortunately not. For example a write buffer sequence will finish
whether or not a Block Erase is currently suspended; thus it must test
SR.7 only, not SR.6.

Paul

Joakim Tjernlund Feb. 29, 2012, 8:58 a.m. UTC | #3

Paul Parsons <lost.distance@yahoo.com> wrote on 2012/02/29 01:23:21:
> From: Paul Parsons <lost.distance@yahoo.com>
>
> > > If an Erase Suspend Command (0xb0) is issued while a
> > Block Erase operation
> > > (0x20, 0xd0) is in progress, Status Register bit 6
> > (SR.6) will be set to 1.
> > > After an Erase Resume Command (0xd0) is issued, SR.6
> > will be set back to 0.
> > >
> > > Unfortunately when inval_cache_and_wait_for_operation()
> > is called to wait for
> > > a Block Erase operation to finish, it never checks that
> > SR.6 = 0 before
> > > returning. Consequently it can (and does) return while
> > a Block Erase operation
> > > is still suspended, resulting in random breakage.
> > >
> > > This patch ensures that when
> > inval_cache_and_wait_for_operation() is called to
> > > wait for a Block Erase operation to finish
> > (chip->state == FL_ERASING), it does
> > > not return until both SR.7 = 1 (as before) and SR.6 =
> > 0.
> >
> > Interesting but I do wonder, should not the
> >   if (chip->state != chip_state)
> > test just before the status test prevent this in the first
> > place?
>
> I'm not sure I understand. The (chip->state != chip_state) test forces
> inval_cache_and_wait_for_operation() to stop polling and sleep, pending
> the operation which suspended it. After it wakes it resumes polling. The
> test doesn't prevent inval_cache_and_wait_for_operation() returning
> prematurely, i.e. before the Block Erase has finished.

Been to long since I been in this code I guess, but I still don't get it.

If erase is suspended, chip->state is changed which will keep the function in
the for(;;) loop. Once erase has been resumed again chip->state will change back.
It seems to me that chip->state and SR.6 are mutally exclusive? I can not see how
you can get to
	status = map_read(map, cmd_adr);
	if (map_word_andequal(map, status, status_OK, status_OK))
if the the erase has been suspended.

>
> An important point which I suspect has been overlooked is that both SR.7
> and SR.6 are set when a Block Erase is suspended. Thus testing SR.7 only
> is not sufficient to determine that a Block Erase has completed.
>
> > > +   map_word status_76 = (chip->state
> > == FL_ERASING) ? CMD(0xc0) : CMD(0x80);
> >
> > hmm, I wonder if you can get by without the chip->state
> > test? Just always set it to 0xc0?
>
> No, unfortunately not. For example a write buffer sequence will finish
> whether or not a Block Erase is currently suspended; thus it must test
> SR.7 only, not SR.6.

I see, thanks.

 Jocke

Paul Parsons Feb. 29, 2012, 5:22 p.m. UTC | #4

--- On Wed, 29/2/12, Joakim Tjernlund <joakim.tjernlund@transmode.se> wrote:
> If erase is suspended, chip->state is changed which will
> keep the function in
> the for(;;) loop. Once erase has been resumed again
> chip->state will change back.
> It seems to me that chip->state and SR.6 are mutally
> exclusive? I can not see how
> you can get to
>     status = map_read(map, cmd_adr);
>     if (map_word_andequal(map, status,
> status_OK, status_OK))
> if the the erase has been suspended.

OK, I understand. Let me demonstrate.

I Added the following debug to the original linux-3.3-rc5
drivers/mtd/chips/cfi_cmdset_0001.c file (i.e. without my patch):

--------

797a798
> printk("%d: CRDY: about to Erase Suspend adr=0x%08x\n", current->pid, adr);
958a960
> printk("%d: PUTC: enter\n", current->pid);
974a977
> printk("%d: PUTC: about to wake_up (a)\n", current->pid);
988a992
> printk("%d: PUTC: about to wake_up (b)\n", current->pid);
1005a1010
> printk("%d: PUTC: about to Erase Resume adr=0x%08x\n", current->pid, adr);
1025a1031
> printk("%d: PUTC: about to wake_up (c)\n", current->pid);
1216a1223
> printk("%d: WAIT: call-state=%d, cmd_adr=0x%08x\n", current->pid, chip_state, cmd_adr);
1234a1242
> printk("%d: WAIT: about to suspend\n", current->pid);
1235a1244
> printk("%d: WAIT: resumed\n", current->pid);
1241a1251
> printk("%d: WAIT: status=0x%08x\n", current->pid, status.x[0]);
1257a1268
> printk("%d: WAIT: return -ETIME\n", current->pid);
1262a1274
> printk("%d: WAIT: about to sleep: sleep_time=%d\n", current->pid, sleep_time);
1281a1294
> printk("%d: WAIT: return 0\n", current->pid);
1891a1905
> printk("%d: ERAS: adr=0x%08x\n", current->pid, adr);

--------

I rebooted my hx4700, saved the dmesg output, and edited and annotated
what I feel to be the important messages:

--------

[    2.224667] 30: WAIT: status=0x00a000a0
[    2.224693] 30: WAIT: return 0

PID 30 has just failed FL_ERASING of block 0x00440000 for the 3rd time:
erase has failed (SR.5 = 1).

[    2.224719] block erase failed at 0x00440000: status 0xa000a0. Retrying...

Confirmation of erase fail from elsewhere.

[    2.224779] 30: ERAS: adr=0x00440000
[    2.224798] 30: WAIT: call-state=4, cmd_adr=0x00440000
[    2.224814] 30: WAIT: status=0x00000000
[    2.224830] 30: WAIT: about to sleep: sleep_time=512000

PID 30 is now attempting FL_ERASING of block 0x00440000 for the 4th time.

[    2.492129] 76: CRDY: about to Erase Suspend adr=0x0060f748
[    2.492216] 76: PUTC: enter
[    2.492233] 76: PUTC: about to Erase Resume adr=0x0060f748

PID 76 has issued Erase Suspend and Erase Resume (the address is irrelevant).
PID 30 is still sleeping at this point.

[    2.785235] 30: WAIT: status=0x00e800e8
[    2.785250] 30: WAIT: return 0

PID 30 has woken up to find:
bad F-VPP (SR.3 = 1),
erase has failed (SR.5 = 1),
Erase Suspend is in effect (SR.6 = 1).

[    2.785276] physmap-flash: block erase error: (bad VPP)

Confirmation of bad VPP from elsewhere.

[    2.863384] 30: ERAS: adr=0x00440000
[    2.863412] 30: WAIT: call-state=4, cmd_adr=0x00440000
[    2.863431] 30: WAIT: status=0x00c000c0
[    2.863444] 30: WAIT: return 0

PID 30 is now attempting FL_ERASING of block 0x00440000 for the 5th time:
Erase Suspend is in effect (SR.6 = 1),
but no errors, so the erase seemingly succeeded.

[    2.863543] 30: WAIT: call-state=8, cmd_adr=0x00440000
[    2.863562] 30: WAIT: status=0x00c000c0
[    2.863574] 30: WAIT: return 0

PID 30 is now attempting FL_WRITING_TO_BUFFER:
Erase Suspend is in effect (SR.6 = 1),
but no errors, so the FL_WRITING_TO_BUFFER seemingly succeeded.

[    2.863599] 30: WAIT: call-state=7, cmd_adr=0x00440000
[    2.863615] 30: WAIT: status=0x00d000d0
[    2.863628] 30: WAIT: return 0

PID 30 is now attempting FL_WRITING to block 0x00440000:
Erase Suspend is in effect (SR.6 = 1),
program has failed (SR.3 = 1).

[    2.863649] physmap-flash: buffer write error (status 0xd000d0)

Confirmation of write error from elsewhere.

[    2.863703] UBI error: ubi_io_write: error -22 while writing 64 bytes to PEB 262:0, written 0 bytes
[    2.863728] UBI error: erase_worker: failed to erase PEB 262, error -22
[    2.863750] UBI warning: ubi_ro_mode: switch to read-only mode
[    2.863770] UBI error: do_work: work failed with error code -22
[    2.863791] UBI error: ubi_thread: ubi_bgt0d: work failed with error code -22

Bye bye UBI.

--------

First we note that the WAIT (inval_cache_and_wait_for_operation) function
never suspends itself; i.e. it never prints "WAIT: about to suspend".

The message at:
[    2.863431] 30: WAIT: status=0x00c000c0
shows that WAIT found status_OK (SR.7) set and Erase Suspend in effect.

Now things get complicated.

I repeated with exercise with my patch applied. This time the WAIT
function looped forever trying to erase the same block (0x00440000)
because the status never changed from 0x00e800e8 (rather like 2.785235
above, but with my patch now applied the function cannot return).

Disheartened, I thought my patch had only succeeded in livelocking the
function because of a hardware error.

However, remembering our version of the Heisenberg uncertainty principle -
that adding debug sometimes changes the behaviour - I stripped out almost
all of the debug and repeated the exercise. This time block 0x00440000
erased successfully without errors.

Where do we go from here?

The above debug demonstrates that inval_cache_and_wait_for_operation()
can return while Erase Suspend is in effect. My patch prevents that, and
fixes UBI for me.

I am open to the suggestion that my hx4700 exhibits random hardware
failures. This would explain the results. But I have a nagging doubt about
that because my bootloader has never reported erase errors (apart from once
when I broke its erase block size table). As I mentioned previously, the
bootloader never suspends erase operations.

I am open to the suggestion that my patch merely fixes a symptom rather
than an underlying cause.

So perhaps the MTD driver contains a race condition. The stripping out of
debug to fix the erase error tends to support that explanation.

I don't know what else to say.

Paul

Artem Bityutskiy Feb. 29, 2012, 6:03 p.m. UTC | #5

Did you try to validate your setup with mtd tests?

Paul Parsons Feb. 29, 2012, 10:26 p.m. UTC | #6

--- On Wed, 29/2/12, Artem Bityutskiy <dedekind1@gmail.com> wrote:
> Did you try to validate your setup
> with mtd tests?

No I did not.

I've just quickly tried the readtest on all mtd partitions and they seemed
to pass. A torturetest on a small unused partition reported "retried 3
times, still errors, give-up", but a stresstest on the same partition passed.

Was there a specific test and set of options I should try?
The hx4700 has NOR flash (Intel StrataFlash) so it looks like at least 4
of the tests - pagetest, oobtest, subpagetest, nandecctest - are not
appropriate.

Joakim Tjernlund Feb. 29, 2012, 11:35 p.m. UTC | #7

Paul Parsons <lost.distance@yahoo.com> wrote on 2012/02/29 18:22:39:
>
> --- On Wed, 29/2/12, Joakim Tjernlund <joakim.tjernlund@transmode.se> wrote:
> > If erase is suspended, chip->state is changed which will
> > keep the function in
> > the for(;;) loop. Once erase has been resumed again
> > chip->state will change back.
> > It seems to me that chip->state and SR.6 are mutally
> > exclusive? I can not see how
> > you can get to
> >     status = map_read(map, cmd_adr);
> >     if (map_word_andequal(map, status,
> > status_OK, status_OK))
> > if the the erase has been suspended.
>
> OK, I understand. Let me demonstrate.
>
> I Added the following debug to the original linux-3.3-rc5
> drivers/mtd/chips/cfi_cmdset_0001.c file (i.e. without my patch):

This looks like a erase resume problem, cfi driver issues a resume command but the
chip fails or is slow to respond. I remember someone having a similar problem
some time ago. Don't think we got to bottom of it though, look for subject:
  Numonyx NOR and chip->mutex bug?

Try adding extra read status cmds around resume, check if SR.6 is cleared too.
Enable all driver debug code and look for the first sign of trouble.

That said, there seems to be something fishy going on with UBI as it is
erasing the same failing block over and over again.
Also, it would probably be good to add test of SR.6 in some places, just
to make sure the flash does what the driver wants it to do.

 Jocke

Artem Bityutskiy March 9, 2012, 10:39 a.m. UTC | #8

On Wed, 2012-02-29 at 22:26 +0000, Paul Parsons wrote:
> --- On Wed, 29/2/12, Artem Bityutskiy <dedekind1@gmail.com> wrote:
> > Did you try to validate your setup
> > with mtd tests?
> 
> No I did not.
> 
> I've just quickly tried the readtest on all mtd partitions and they seemed
> to pass. A torturetest on a small unused partition reported "retried 3
> times, still errors, give-up", but a stresstest on the same partition passed.
> 
> Was there a specific test and set of options I should try?
> The hx4700 has NOR flash (Intel StrataFlash) so it looks like at least 4
> of the tests - pagetest, oobtest, subpagetest, nandecctest - are not
> appropriate.

In the past all tests either worked on nor or refused to load. I thought
stresstest should work as well. I do not have any specific test in mind,
threw out standard suggestion I give to people :-)

mtd: cfi: Wait for Block Erase operation to finish

Commit Message

Comments

Patch