[v2,1/4] ipc_notify_receive: add timeout_ms argument

Message ID	20211209010659.724523-1-dominique.martinet@atmark-techno.com
State	Accepted
Headers	show Return-Path: <swupdate+bncBCWIVBV7SUGRBPFNYWGQMGQEUUUDKEA@googlegroups.com> Sender: swupdate@googlegroups.com Received-SPF: pass (google.com: domain of dominique.martinet@atmark-techno.com designates 35.74.137.57 as permitted sender) client-ip=35.74.137.57; From: Dominique Martinet <dominique.martinet@atmark-techno.com> To: swupdate@googlegroups.com Cc: Dominique Martinet <dominique.martinet@atmark-techno.com> Subject: [swupdate] [PATCH v2 1/4] ipc_notify_receive: add timeout_ms argument Date: Thu, 9 Dec 2021 10:06:56 +0900 Message-Id: <20211209010659.724523-1-dominique.martinet@atmark-techno.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Precedence: list Mailing-list: list swupdate@googlegroups.com; contact swupdate+owners@googlegroups.com
Series	[v2,1/4] ipc_notify_receive: add timeout_ms argument \| expand [v2,1/4] ipc_notify_receive: add timeout_ms argument [v2,2/4] ipc_notify_connect: fix return 0 when message was not acked [v2,3/4] network_ipc: update swupdate_async_thread to use notify [v2,4/4] write_notify_msg: silence status client disappearing messages

Dominique Martinet Dec. 9, 2021, 1:06 a.m. UTC

allow checking if we have a notify message pending with a timeout.
Since we want timeout=0 to mean "check without waiting", also make
other timeout_ms in this file logic match for coherence:
 - timeout < 0 = wait forever
 - timeout >= 0 = pass timeout to select()

Since we now have 3 places in the code that basically do the same
optional select, move this to its own __ipc_select_timeout helper

Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
---
v2: trivial rebase on master

 include/network_ipc.h         |  4 +-
 ipc/network_ipc.c             | 86 ++++++++++++++++-------------------
 mongoose/mongoose_interface.c |  2 +-
 3 files changed, 43 insertions(+), 49 deletions(-)

Stefano Babic March 4, 2022, 8:23 a.m. UTC | #1

Hi Dominique,

On 09.12.21 02:06, Dominique Martinet wrote:
> allow checking if we have a notify message pending with a timeout.
> Since we want timeout=0 to mean "check without waiting", also make
> other timeout_ms in this file logic match for coherence:
>   - timeout < 0 = wait forever
>   - timeout >= 0 = pass timeout to select()
> 
> Since we now have 3 places in the code that basically do the same
> optional select, move this to its own __ipc_select_timeout helper
> 
> Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
> ---


I am facing regression issues on current TOT. The issue is a deadlock, 
and both client and SWUpdate are waiting for something. I saw the same 
issue by testing with "-i <swu file>", or sending a file wit 
swupdate-client, and I can reproduce this on my PC, too, not only on device.

After bisecting, this patch was responsible for the behavior - I haven't 
time to check in detail and find the reason, but after reverting 85b2842 
and da6eec the deadlock disappears. This is something that should be 
investigated, it is a blocking point for these patches.

Best regards,
Stefano

> v2: trivial rebase on master
> 
>   include/network_ipc.h         |  4 +-
>   ipc/network_ipc.c             | 86 ++++++++++++++++-------------------
>   mongoose/mongoose_interface.c |  2 +-
>   3 files changed, 43 insertions(+), 49 deletions(-)
> 
> diff --git a/include/network_ipc.h b/include/network_ipc.h
> index 6dc2521d827f..356931bc4e5c 100644
> --- a/include/network_ipc.h
> +++ b/include/network_ipc.h
> @@ -129,9 +129,9 @@ int ipc_inst_start_ext(void *priv, ssize_t size);
>   int ipc_send_data(int connfd, char *buf, int size);
>   void ipc_end(int connfd);
>   int ipc_get_status(ipc_message *msg);
> -int ipc_get_status_timeout(ipc_message *msg, unsigned int timeout_ms);
> +int ipc_get_status_timeout(ipc_message *msg, int timeout_ms);
>   int ipc_notify_connect(void);
> -int ipc_notify_receive(int *connfd, ipc_message *msg);
> +int ipc_notify_receive(int *connfd, ipc_message *msg, int timeout_ms);
>   int ipc_postupdate(ipc_message *msg);
>   int ipc_send_cmd(ipc_message *msg);
>   
> diff --git a/ipc/network_ipc.c b/ipc/network_ipc.c
> index 5ccc18b6c6d4..39ca695ec75f 100644
> --- a/ipc/network_ipc.c
> +++ b/ipc/network_ipc.c
> @@ -92,11 +92,32 @@ int ipc_postupdate(ipc_message *msg) {
>   	return -result;
>   }
>   
> -static int __ipc_get_status(int connfd, ipc_message *msg, unsigned int timeout_ms)
> -{
> +static int __ipc_select_timeout(int connfd, ipc_message *msg, int timeout_ms) {
>   	fd_set fds;
>   	struct timeval tv;
>   
> +	if (timeout_ms < 0)
> +		return 0;
> +	FD_ZERO(&fds);
> +	FD_SET(connfd, &fds);
> +
> +	/*
> +	 * Invalid the message
> +	 * Caller should check it
> +	 */
> +	msg->magic = 0;
> +
> +	tv.tv_sec = 0;
> +	tv.tv_usec = timeout_ms * 1000;
> +	if ((select(connfd + 1, &fds, NULL, NULL, &tv) <= 0) ||
> +			!FD_ISSET(connfd, &fds))
> +		return -ETIMEDOUT;
> +
> +	return 0;
> +}
> +
> +static int __ipc_get_status(int connfd, ipc_message *msg, int timeout_ms)
> +{
>   	memset(msg, 0, sizeof(*msg));
>   	msg->magic = IPC_MAGIC;
>   	msg->type = GET_STATUS;
> @@ -104,22 +125,8 @@ static int __ipc_get_status(int connfd, ipc_message *msg, unsigned int timeout_m
>   	if (write(connfd, msg, sizeof(*msg)) != sizeof(*msg))
>   		return -1;
>   
> -	if (timeout_ms) {
> -		FD_ZERO(&fds);
> -		FD_SET(connfd, &fds);
> -
> -		/*
> -		 * Invalid the message
> -		 * Caller should check it
> -		 */
> -		msg->magic = 0;
> -
> -		tv.tv_sec = 0;
> -		tv.tv_usec = timeout_ms * 1000;
> -		if ((select(connfd + 1, &fds, NULL, NULL, &tv) <= 0) ||
> -			!FD_ISSET(connfd, &fds))
> -			return -ETIMEDOUT;
> -	}
> +	if (__ipc_select_timeout(connfd, msg, timeout_ms))
> +		return -ETIMEDOUT;
>   
>   	return -(read(connfd, msg, sizeof(*msg)) != sizeof(*msg));
>   }
> @@ -133,7 +140,7 @@ int ipc_get_status(ipc_message *msg)
>   	if (connfd < 0)
>   		return -1;
>   
> -	ret = __ipc_get_status(connfd, msg, 0);
> +	ret = __ipc_get_status(connfd, msg, -1);
>   	close(connfd);
>   
>   	return ret;
> @@ -144,7 +151,7 @@ int ipc_get_status(ipc_message *msg)
>    *           -1 : error
>    *           else data read
>    */
> -int ipc_get_status_timeout(ipc_message *msg, unsigned int timeout_ms)
> +int ipc_get_status_timeout(ipc_message *msg, int timeout_ms)
>   {
>   	int ret;
>   	int connfd;
> @@ -163,11 +170,8 @@ int ipc_get_status_timeout(ipc_message *msg, unsigned int timeout_ms)
>   	return ret == 0 ? sizeof(*msg) : -1;
>   }
>   
> -static int __ipc_start_notify(int connfd, ipc_message *msg, unsigned int timeout_ms)
> +static int __ipc_start_notify(int connfd, ipc_message *msg, int timeout_ms)
>   {
> -	fd_set fds;
> -	struct timeval tv;
> -
>   	memset(msg, 0, sizeof(*msg));
>   	msg->magic = IPC_MAGIC;
>   	msg->type = NOTIFY_STREAM;
> @@ -175,22 +179,8 @@ static int __ipc_start_notify(int connfd, ipc_message *msg, unsigned int timeout
>   	if (write(connfd, msg, sizeof(*msg)) != sizeof(*msg))
>   		return -1;
>   
> -	if (timeout_ms) {
> -		FD_ZERO(&fds);
> -		FD_SET(connfd, &fds);
> -
> -		/*
> -		 * Invalid the message
> -		 * Caller should check it
> -		 */
> -		msg->magic = 0;
> -
> -		tv.tv_sec = 0;
> -		tv.tv_usec = timeout_ms * 1000;
> -		if ((select(connfd + 1, &fds, NULL, NULL, &tv) <= 0) ||
> -		!FD_ISSET(connfd, &fds))
> -			return -ETIMEDOUT;
> -	}
> +	if (__ipc_select_timeout(connfd, msg, timeout_ms))
> +		return -ETIMEDOUT;
>   
>   	return -(read(connfd, msg, sizeof(*msg)) != sizeof(*msg));
>   }
> @@ -208,7 +198,7 @@ int ipc_notify_connect(void)
>   	/*
>   	 * Initialize the notify stream
>   	 */
> -	ret = __ipc_start_notify(connfd, &msg, 0);
> +	ret = __ipc_start_notify(connfd, &msg, -1);
>   	if (ret || msg.type != ACK) {
>   		fprintf(stdout, "Notify connection handshake failed..\n");
>   		close(connfd);
> @@ -218,10 +208,14 @@ int ipc_notify_connect(void)
>   	return connfd;
>   }
>   
> -int ipc_notify_receive(int *connfd, ipc_message *msg)
> +int ipc_notify_receive(int *connfd, ipc_message *msg, int timeout_ms)
>   {
> -	int ret = read(*connfd, msg, sizeof(*msg));
> +	int ret;
>   
> +	if (__ipc_select_timeout(*connfd, msg, timeout_ms))
> +		return -ETIMEDOUT;
> +
> +	ret = read(*connfd, msg, sizeof(*msg));
>   	if (ret == -1 && (errno == EAGAIN || errno == EINTR))
>   		return 0;
>   
> @@ -229,14 +223,14 @@ int ipc_notify_receive(int *connfd, ipc_message *msg)
>   		fprintf(stdout, "Connection closing..\n");
>   		close(*connfd);
>   		*connfd = -1;
> -		return -1;
> +		return -EIO;
>   	}
>   
>   	if (msg->magic != IPC_MAGIC) {
>   		fprintf(stdout, "Connection closing, invalid magic...\n");
>   		close(*connfd);
>   		*connfd = -1;
> -		return -1;
> +		return -EIO;
>   	}
>   
>   	return ret;
> @@ -330,7 +324,7 @@ int ipc_wait_for_complete(getstatus callback)
>   		fd = prepare_ipc();
>   		if (fd < 0)
>   			break;
> -		ret = __ipc_get_status(fd, &message, 0);
> +		ret = __ipc_get_status(fd, &message, -1);
>   		close(fd);
>   
>   		if (ret < 0) {
> diff --git a/mongoose/mongoose_interface.c b/mongoose/mongoose_interface.c
> index 851a17bfbaf8..930cbf393a12 100644
> --- a/mongoose/mongoose_interface.c
> +++ b/mongoose/mongoose_interface.c
> @@ -173,7 +173,7 @@ static void *broadcast_message_thread(void *data)
>   			continue;
>   		}
>   
> -		ret = ipc_notify_receive(&fd, &msg);
> +		ret = ipc_notify_receive(&fd, &msg, -1);
>   		if (ret != sizeof(msg))
>   			return NULL;
>

Dominique Martinet March 4, 2022, 9 a.m. UTC | #2

Stefano Babic wrote on Fri, Mar 04, 2022 at 09:23:39AM +0100:
> I am facing regression issues on current TOT. The issue is a deadlock, and
> both client and SWUpdate are waiting for something. I saw the same issue by
> testing with "-i <swu file>", or sending a file wit swupdate-client, and I
> can reproduce this on my PC, too, not only on device.
> 
> After bisecting, this patch was responsible for the behavior - I haven't
> time to check in detail and find the reason, but after reverting 85b2842 and
> da6eec the deadlock disappears. This is something that should be
> investigated, it is a blocking point for these patches.

This is funny because I have an occasional deadlock without these
patches on my end! :)


But, fair enough, the patches have already been merged so I'd like to
help.
I've just updated my local version to master on my PC and ran a few
hundred of times with both a swu that exits immediately and a swu that
waits 2 seconds in a shell script and could not reproduce the hang.

(this is on a debian testing)


Would you be able to give me some more information on your setup so I
can try to reproduce?

I'm off for the weekend in about 10 minutes, but you can send me a bunch
of infos and I'll have a look first thing Monday morning (Sunday night
in europe)

Thanks,

Stefano Babic March 4, 2022, noon UTC | #3

Hi Dominique,

On 04.03.22 10:00, Dominique Martinet wrote:
> Stefano Babic wrote on Fri, Mar 04, 2022 at 09:23:39AM +0100:
>> I am facing regression issues on current TOT. The issue is a deadlock, and
>> both client and SWUpdate are waiting for something. I saw the same issue by
>> testing with "-i <swu file>", or sending a file wit swupdate-client, and I
>> can reproduce this on my PC, too, not only on device.
>>
>> After bisecting, this patch was responsible for the behavior - I haven't
>> time to check in detail and find the reason, but after reverting 85b2842 and
>> da6eec the deadlock disappears. This is something that should be
>> investigated, it is a blocking point for these patches.
> 
> This is funny because I have an occasional deadlock without these
> patches on my end! :)
> 
> 
> But, fair enough, the patches have already been merged so I'd like to
> help.
> I've just updated my local version to master on my PC and ran a few
> hundred of times with both a swu that exits immediately and a swu that
> waits 2 seconds in a shell script and could not reproduce the hang.
> 
> (this is on a debian testing)
> 

I run it on a Ubuntu 20.04

> 
> Would you be able to give me some more information on your setup so I
> can try to reproduce?

There is an embedded Lua script, it is not doing something weird, and 
the Lua script exited with success. To test, I run swupdate with 
--dry-run, and it looks like I can easy reproduce by increasing 
verbosity with "-l 5".

I just run the daemon without Webserver and suricatta (I try to exclude 
some causes). I start with:

./swupdate -Hpc:1.0 -l 5 -Hbeaglebone-yocto:1.0 -K 
own-ci-tests/etc/aeskey -e stable,production

(I replace board name of my device with beaglebone just to reproduce it)

And on another shell, I run

./tools/swupdate-client myimage.swu

I have two weird effects with applied patches. One is the deadlock, both 
SWUpdate and swupdate client hangs. I have to stop swupdate-client to 
let swupdate exiting with  an error. But if I run SWUpdate with less 
verbosity (-l 2), I do not see the deadlock.

The second weird effect is when I change the selection to another one, 
still valid. It is not recognized when patches are applied, it works 
flawlessly without. But as I said, I have just started to bisect to find 
which patches are responsible for the behavior, I have not checked 
deeper what is happening.

Best regards,
Stefano

> 
> I'm off for the weekend in about 10 minutes, but you can send me a bunch
> of infos and I'll have a look first thing Monday morning (Sunday night
> in europe)
> 
> Thanks,

Dominique Martinet March 7, 2022, 2:35 a.m. UTC | #4

Stefano Babic wrote on Fri, Mar 04, 2022 at 01:00:57PM +0100:
> > Would you be able to give me some more information on your setup so I
> > can try to reproduce?
> 
> There is an embedded Lua script, it is not doing something weird, and the
> Lua script exited with success. To test, I run swupdate with --dry-run, and
> it looks like I can easy reproduce by increasing verbosity with "-l 5".

I've tried adding a lua script as follow to no difference:
---
  embedded-script = "
    require(\"swupdate\")
    swupdate.trace(\"embedded-script loaded\")
  ";
---



> I just run the daemon without Webserver and suricatta (I try to exclude some
> causes). I start with:
> 
> ./swupdate -Hpc:1.0 -l 5 -Hbeaglebone-yocto:1.0 -K own-ci-tests/etc/aeskey
> -e stable,production
> 
> (I replace board name of my device with beaglebone just to reproduce it)

Ok, that looks like what I'm doing as well


> And on another shell, I run
> 
> ./tools/swupdate-client myimage.swu

For this I need to specify LD_LIBRARY_PATH=$PWD as swupdate-client
depends on libswupdate.so.0.1 : do you happen to have one installed, so
you'd be using an old swupdate-client lib with a newer swupdate server?

The IPC protocol did not change so I would expect this to work anyway,
but some new non-lib part of swupdate-client could be incompatible with
some of the old library, so I wouldn't be surprised if this caused a
deadlock.


I just tried and this doesn't dead-lock immediately for me, but I get
the old pre-patch behaviour, so I believe the deadlock I fixed should be
present.

That deadlock only happened with high verbosity and very short updates
(or dryrun), so this might be a match wrt your next paragraph:
> I have two weird effects with applied patches. One is the deadlock, both
> SWUpdate and swupdate client hangs. I have to stop swupdate-client to let
> swupdate exiting with  an error. But if I run SWUpdate with less verbosity
> (-l 2), I do not see the deadlock.

I wasn't able to reproduce this ultimately so it might just be hopeful
thinking though :)



> The second weird effect is when I change the selection to another one, still
> valid. It is not recognized when patches are applied, it works flawlessly
> without. But as I said, I have just started to bisect to find which patches
> are responsible for the behavior, I have not checked deeper what is
> happening.

Hmm, I don't see how these patches could impact the software selection
part of the code, but this might hint at some memory corruption?
I don't use this software selection part of the code but I'll run some
more tests with it and report back this afternoon.

Stefano Babic March 7, 2022, 8:31 a.m. UTC | #5

Hi Dominique,

On 07.03.22 03:35, Dominique Martinet wrote:
> Stefano Babic wrote on Fri, Mar 04, 2022 at 01:00:57PM +0100:
>>> Would you be able to give me some more information on your setup so I
>>> can try to reproduce?
>>
>> There is an embedded Lua script, it is not doing something weird, and the
>> Lua script exited with success. To test, I run swupdate with --dry-run, and
>> it looks like I can easy reproduce by increasing verbosity with "-l 5".
> 
> I've tried adding a lua script as follow to no difference:
> ---
>    embedded-script = "
>      require(\"swupdate\")
>      swupdate.trace(\"embedded-script loaded\")
>    ";
> ---
> 

I guess this is too small to get into trouble, I show later. One point 
is when one of the hooks fails, because the Lua's stack is dumped 
through the notify via TRACE.

> 
> 
>> I just run the daemon without Webserver and suricatta (I try to exclude some
>> causes). I start with:
>>
>> ./swupdate -Hpc:1.0 -l 5 -Hbeaglebone-yocto:1.0 -K own-ci-tests/etc/aeskey
>> -e stable,production
>>
>> (I replace board name of my device with beaglebone just to reproduce it)
> 
> Ok, that looks like what I'm doing as well
> 
> 
>> And on another shell, I run
>>
>> ./tools/swupdate-client myimage.swu
> 
> For this I need to specify LD_LIBRARY_PATH=$PWD as swupdate-client
> depends on libswupdate.so.0.1 : do you happen to have one installed, so
> you'd be using an old swupdate-client lib with a newer swupdate server?
> 

No, I set LD_LIBRARY_PATH as well.

> The IPC protocol did not change so I would expect this to work anyway,
> but some new non-lib part of swupdate-client could be incompatible with
> some of the old library, so I wouldn't be surprised if this caused a
> deadlock.
> 
> 
> I just tried and this doesn't dead-lock immediately for me, but I get
> the old pre-patch behaviour, so I believe the deadlock I fixed should be
> present.
> 
> That deadlock only happened with high verbosity and very short updates
> (or dryrun), so this might be a match wrt your next paragraph:
>> I have two weird effects with applied patches. One is the deadlock, both
>> SWUpdate and swupdate client hangs. I have to stop swupdate-client to let
>> swupdate exiting with  an error. But if I run SWUpdate with less verbosity
>> (-l 2), I do not see the deadlock.
> 
> I wasn't able to reproduce this ultimately so it might just be hopeful
> thinking though :)
> 

I can reproduce the issue yesterday and after attaching the debugger to 
both swupdate-client and swupdate, I think I have found the reason and 
definitely depends on theintroduced changes.

Both client and swupdate are hanging waiting for something - this is the 
backtrace on SWUpdate:

#0  __libc_send (flags=<optimized out>, len=3120, buf=0x7ffff3c6d7b8, 
fd=8) at ../sysdeps/unix/sysv/linux/send.c:28
#1  __libc_send (fd=fd@entry=8, buf=buf@entry=0x7ffff3c6d7b8, 
len=len@entry=3120, flags=flags@entry=16384)
     at ../sysdeps/unix/sysv/linux/send.c:23
#2  0x000055555557552f in write_notify_msg 
(msg=msg@entry=0x7ffff3c6d7b8, sockfd=8) at core/network_thread.c:137
#3  0x0000555555575bed in send_notify_msg (msg=0x7ffff3c6d7b8) at 
core/network_thread.c:164

	==> SWUpdate is running in send_notify_msg, calling write(). It cannot 
write because the socket is full, and is blocked here

#4  network_notifier (status=RUN, error=0, level=4, msg=<optimized out>) 
at core/network_thread.c:217
#5  0x000055555556f279 in notify (status=status@entry=RUN, 
error=error@entry=0, level=level@entry=4,
     msg=msg@entry=0x7ffff3c6ecb8 "[lua_dump_table] : (2) [table ] 
filesystem = ") at core/notifier.c:243
#6  0x0000555555583a2c in lua_dump_table (L=L@entry=0x7fffec004090, 
str=0x7fffec012c40 "(2) [table ]",

	==> lua_dumptable push the whole Lua stack, and it is quite verbose. A 
lot of traces are sent at this time. No surprise there is no space then 
to send data.

     img=img@entry=0x0, key=key@entry=0x0) at corelib/lua_interface.c:90
#7  0x0000555555586bae in LUAstackDump (L=L@entry=0x7fffec004090) at 
corelib/lua_interface.c:169
#8  0x0000555555587662 in lua_parser_fn (L=L@entry=0x7fffec004090, 
fcn=0x7fffec00e740 "set_partition",
     img=img@entry=0x7ffff3c705a0) at corelib/lua_interface.c:1327
#9  0x00005555555a10cf in run_embscript (p=p@entry=LIBCFG_PARSER, 
elem=0x0, elem@entry=0x7fffec00e400,
     img=img@entry=0x7ffff3c705a0, L=L@entry=0x7fffec004090, 
embscript=<optimized out>) at parser/parser.c:381
#10 0x00005555555a2cb8 in _parse_bootloader (p=p@entry=LIBCFG_PARSER, 
cfg=cfg@entry=0x7ffff3c723d0,
     setting=0x7fffec00e390, nodes=nodes@entry=0x7ffff3c719d8, 
swcfg=swcfg@entry=0x5555555c6f00 <swcfg>,
     L=L@entry=0x7fffec004090) at parser/parser.c:650
#11 0x00005555555a342b in parse_bootloader (L=0x7fffec004090, 
swcfg=0x5555555c6f00 <swcfg>, cfg=0x7ffff3c723d0,
     p=LIBCFG_PARSER) at parser/parser.c:713
#12 parser (p=p@entry=LIBCFG_PARSER, cfg=cfg@entry=0x7ffff3c723d0, 
swcfg=swcfg@entry=0x5555555c6f00 <swcfg>)
     at parser/parser.c:935
#13 0x00005555555a36eb in parse_cfg (swcfg=0x5555555c6f00 <swcfg>, 
filename=0x7ffff3c750b8 "/tmp/sw-description")
     at parser/parser.c:984
#14 0x00005555555724a4 in parse (sw=sw@entry=0x5555555c6f00 <swcfg>,
     descfile=descfile@entry=0x7ffff3c750b8 "/tmp/sw-description") at 
core/parser.c:157
#15 0x00005555555777fb in extract_files (software=0x5555555c6f00 
<swcfg>, fd=7) at core/stream_interface.c:164
#16 network_initializer (data=0x5555555c6f00 <swcfg>) at 
core/stream_interface.c:594
#17 0x00007ffff7f5e609 in start_thread (arg=<optimized out>) at 
pthread_create.c:477
#18 0x00007ffff7676163 in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

	==> the rest is normal behavior, sw-description is parsed.

The original design for client and SWUpdate is that the flowcontrol is 
done by writing into the IPC socket. If SWUpdate is slow to install 
something, it does not read from the IPC (install), and the 
swupdate_async_thread() will block then in the swupdate_image_write(), 
so in the ipc_send_data(). This allows to control independently if we 
get fast the data or we are faster to install: in the second case, the 
provided callback (wr_func in swupdate_async_start) will block, and 
SWUpdate waits in the read call.

The patches introduced a dependency with the notify: in fact, it is also 
a requirement that the listeners consume the notification. A listener 
can crash, of course, but it cannot ignore the notifications. This is 
what it happens here.

We have two socket, and the client tries to push the data (SWU) calling 
ipc_send_data(), but it happens that the socket is full.

SWUpdate has read the data, but during the processing it writes a lot of 
notification - but it can't, because the client does not read them (it 
is blocked), and SWUpdate will block when the socket for notification is 
full ==> we have the deadlock.

I have found a couple of further bugs, that is the library call exit() 
(it should not, it is a library) and ipc_send_data() does not check if 
all data were written (bad, strange it was not noted before). I will 
send patches for these two issues, too.

Anyway,  the deadlock is really due to the changed behavior. I would 
like to restart from a known state, and I agree with you about the race 
if the SWU is too short, but I think we can readdress this after the 
deadlock is away.

So my next step will be to revert the two patches (I let my system in 
test for the whole night without deadlocks) with the fixes for 
ipc_send_data(), and afterward we check how to fix the other problem. 
Sure, the notification could be read in a separate thread (this was my 
first thought), but it looks to me overengineering and too much 
asynchronicity. There is maybe another solution.

> 
> 
>> The second weird effect is when I change the selection to another one, still
>> valid. It is not recognized when patches are applied, it works flawlessly
>> without. But as I said, I have just started to bisect to find which patches
>> are responsible for the behavior, I have not checked deeper what is
>> happening.
> 
> Hmm, I don't see how these patches could impact the software selection
> part of the code,

See above - from my analyses I am pretty sure I am right. Backtrace 
helped a lot to understand what is happening.

> but this might hint at some memory corruption?
> I don't use this software selection part of the code but I'll run some
> more tests with it and report back this afternoon.
> 

Best regards,
Stefano

Dominique Martinet March 7, 2022, 9:02 a.m. UTC | #6

Stefano Babic wrote on Mon, Mar 07, 2022 at 09:31:18AM +0100:
> I can reproduce the issue yesterday and after attaching the debugger to both
> swupdate-client and swupdate, I think I have found the reason and definitely
> depends on theintroduced changes.

Ok, with your traces and explanation I was able to reproduce.
It requires both lots of messages during parsing of swu and a big swu to
send, my swu files were too small to reproduce...

> The original design for client and SWUpdate is that the flowcontrol is done
> by writing into the IPC socket. If SWUpdate is slow to install something, it
> does not read from the IPC (install), and the swupdate_async_thread() will
> block then in the swupdate_image_write(), so in the ipc_send_data(). This
> allows to control independently if we get fast the data or we are faster to
> install: in the second case, the provided callback (wr_func in
> swupdate_async_start) will block, and SWUpdate waits in the read call.
> 
> The patches introduced a dependency with the notify: in fact, it is also a
> requirement that the listeners consume the notification. A listener can
> crash, of course, but it cannot ignore the notifications. This is what it
> happens here.
> 
> We have two socket, and the client tries to push the data (SWU) calling
> ipc_send_data(), but it happens that the socket is full.
> 
> SWUpdate has read the data, but during the processing it writes a lot of
> notification - but it can't, because the client does not read them (it is
> blocked), and SWUpdate will block when the socket for notification is full
> ==> we have the deadlock.
> 
> I have found a couple of further bugs, that is the library call exit() (it
> should not, it is a library) and ipc_send_data() does not check if all data
> were written (bad, strange it was not noted before). I will send patches for
> these two issues, too.
> 
> Anyway,  the deadlock is really due to the changed behavior. I would like to
> restart from a known state, and I agree with you about the race if the SWU
> is too short, but I think we can readdress this after the deadlock is away.
> 
> So my next step will be to revert the two patches (I let my system in test
> for the whole night without deadlocks) with the fixes for ipc_send_data(),
> and afterward we check how to fix the other problem. Sure, the notification
> could be read in a separate thread (this was my first thought), but it looks
> to me overengineering and too much asynchronicity. There is maybe another
> solution.


We don't need threads for this, the problem is that the client writes
too much e.g. it does:
----
loop {
  write
  read
}
----
from socket, but it starts writing again before swupdate is ready to
read more.

We just need to use poll (or similar) here on notify_fd (read) and
rq->connfd (write), and that will be it.


Please don't revert my patches, my workday is just over here but I can
send you a fix during your night with very little modification that fits
in 10 lines... Heck, let me do that now, it'll be a minute to test.

Stefano Babic March 7, 2022, 10:03 a.m. UTC | #7

On 07.03.22 10:02, Dominique Martinet wrote:
> Stefano Babic wrote on Mon, Mar 07, 2022 at 09:31:18AM +0100:
>> I can reproduce the issue yesterday and after attaching the debugger to both
>> swupdate-client and swupdate, I think I have found the reason and definitely
>> depends on theintroduced changes.
> 
> Ok, with your traces and explanation I was able to reproduce.
> It requires both lots of messages during parsing of swu and a big swu to
> send, my swu files were too small to reproduce...
> 
>> The original design for client and SWUpdate is that the flowcontrol is done
>> by writing into the IPC socket. If SWUpdate is slow to install something, it
>> does not read from the IPC (install), and the swupdate_async_thread() will
>> block then in the swupdate_image_write(), so in the ipc_send_data(). This
>> allows to control independently if we get fast the data or we are faster to
>> install: in the second case, the provided callback (wr_func in
>> swupdate_async_start) will block, and SWUpdate waits in the read call.
>>
>> The patches introduced a dependency with the notify: in fact, it is also a
>> requirement that the listeners consume the notification. A listener can
>> crash, of course, but it cannot ignore the notifications. This is what it
>> happens here.
>>
>> We have two socket, and the client tries to push the data (SWU) calling
>> ipc_send_data(), but it happens that the socket is full.
>>
>> SWUpdate has read the data, but during the processing it writes a lot of
>> notification - but it can't, because the client does not read them (it is
>> blocked), and SWUpdate will block when the socket for notification is full
>> ==> we have the deadlock.
>>
>> I have found a couple of further bugs, that is the library call exit() (it
>> should not, it is a library) and ipc_send_data() does not check if all data
>> were written (bad, strange it was not noted before). I will send patches for
>> these two issues, too.
>>
>> Anyway,  the deadlock is really due to the changed behavior. I would like to
>> restart from a known state, and I agree with you about the race if the SWU
>> is too short, but I think we can readdress this after the deadlock is away.
>>
>> So my next step will be to revert the two patches (I let my system in test
>> for the whole night without deadlocks) with the fixes for ipc_send_data(),
>> and afterward we check how to fix the other problem. Sure, the notification
>> could be read in a separate thread (this was my first thought), but it looks
>> to me overengineering and too much asynchronicity. There is maybe another
>> solution.
> 
> 
> We don't need threads for this, the problem is that the client writes
> too much e.g. it does:
> ----
> loop {
>    write
>    read
> }

Correct.

> ----
> from socket, but it starts writing again before swupdate is ready to
> read more.

Yes, but this was allowed - it is the way flow-control is implemented.

> 
> We just need to use poll (or similar) here on notify_fd (read) and
> rq->connfd (write), and that will be it.
> 
> 
> Please don't revert my patches, my workday is just over here but I can
> send you a fix during your night with very little modification that fits
> in 10 lines... Heck, let me do that now, it'll be a minute to test.
> 

I do not revert for the moment, I check your patch.

Regards,
Stefano

[v2,1/4] ipc_notify_receive: add timeout_ms argument

Commit Message

Comments

Patch