Message ID | 20240702140841.3337-1-fw@strlen.de |
---|---|
State | Accepted |
Headers | show |
Series | [nf] netfilter: nf_tables: unconditionally flush pending work before notifier | expand |
On Tue, 2 Jul 2024 16:08:14 +0200 Florian Westphal <fw@strlen.de> > syzbot reports: > > KASAN: slab-uaf in nft_ctx_update include/net/netfilter/nf_tables.h:1831 > KASAN: slab-uaf in nft_commit_release net/netfilter/nf_tables_api.c:9530 > KASAN: slab-uaf int nf_tables_trans_destroy_work+0x152b/0x1750 net/netfilter/nf_tables_api.c:9597 > Read of size 2 at addr ffff88802b0051c4 by task kworker/1:1/45 > [..] > Workqueue: events nf_tables_trans_destroy_work > Call Trace: > nft_ctx_update include/net/netfilter/nf_tables.h:1831 [inline] > nft_commit_release net/netfilter/nf_tables_api.c:9530 [inline] > nf_tables_trans_destroy_work+0x152b/0x1750 net/netfilter/nf_tables_api.c:9597 > > Problem is that the notifier does a conditional flush, but its possible > that the table-to-be-removed is still referenced by transactions being > processed by the worker, so we need to flush unconditionally. > Given trans->table goes thru the lifespan of trans, your proposal is a bandaid if trans outlives table.
Hillf Danton <hdanton@sina.com> wrote: > Given trans->table goes thru the lifespan of trans, your proposal is a bandaid > if trans outlives table. trans must never outlive table.
On Wed, 3 Jul 2024 12:52:15 +0200 Florian Westphal <fw@strlen.de> > Hillf Danton <hdanton@sina.com> wrote: > > Given trans->table goes thru the lifespan of trans, your proposal is a bandaid > > if trans outlives table. > > trans must never outlive table. > What is preventing trans from being freed after closing sock, given trans is freed in workqueue? close sock queue work
Hillf Danton <hdanton@sina.com> wrote: > On Wed, 3 Jul 2024 12:52:15 +0200 Florian Westphal <fw@strlen.de> > > Hillf Danton <hdanton@sina.com> wrote: > > > Given trans->table goes thru the lifespan of trans, your proposal is a bandaid > > > if trans outlives table. > > > > trans must never outlive table. > > > What is preventing trans from being freed after closing sock, given > trans is freed in workqueue? > > close sock > queue work The notifier acquires the transaction mutex, locking out all other transactions, so no further transactions requests referencing the table can be queued. The work queue is flushed before potentially ripping the table out. After this, no transactions referencing the table can exist anymore; the only transactions than can still be queued are those coming from a different netns, and tables are scoped per netns. Table is torn down. Transaction mutex is released. Next transaction from userspace can't find the table anymore (its gone), so no more transactions can be queued for this table. As I wrote in the commit message, the flush is dumb, this should first walk to see if there is a matching table to be torn down, and then flush work queue once before tearing the table down. But its better to clearly split bug fix and such a change.
On Wed, 3 Jul 2024 15:01:07 +0200 Florian Westphal <fw@strlen.de> > Hillf Danton <hdanton@sina.com> wrote: > > On Wed, 3 Jul 2024 12:52:15 +0200 Florian Westphal <fw@strlen.de> > > > Hillf Danton <hdanton@sina.com> wrote: > > > > Given trans->table goes thru the lifespan of trans, your proposal is a bandaid > > > > if trans outlives table. > > > > > > trans must never outlive table. > > > > > What is preventing trans from being freed after closing sock, given > > trans is freed in workqueue? > > > > close sock > > queue work > > The notifier acquires the transaction mutex, locking out all other > transactions, so no further transactions requests referencing > the table can be queued. > As per the syzbot report, trans->table could be instantiated before notifier acquires the transaction mutex. And in fact the lock helps trans outlive table even with your patch. cpu1 cpu2 --- --- transB->table = A lock trans mutex flush work free A unlock trans mutex queue work to free transB > The work queue is flushed before potentially ripping the table > out. After this, no transactions referencing the table can exist > anymore; the only transactions than can still be queued are those > coming from a different netns, and tables are scoped per netns. > > Table is torn down. Transaction mutex is released. > > Next transaction from userspace can't find the table anymore (its gone), > so no more transactions can be queued for this table. > > As I wrote in the commit message, the flush is dumb, this should first > walk to see if there is a matching table to be torn down, and then flush > work queue once before tearing the table down. > > But its better to clearly split bug fix and such a change.
Hillf Danton <hdanton@sina.com> wrote: > On Wed, 3 Jul 2024 15:01:07 +0200 Florian Westphal <fw@strlen.de> > > Hillf Danton <hdanton@sina.com> wrote: > > > On Wed, 3 Jul 2024 12:52:15 +0200 Florian Westphal <fw@strlen.de> > > > > Hillf Danton <hdanton@sina.com> wrote: > > > > > Given trans->table goes thru the lifespan of trans, your proposal is a bandaid > > > > > if trans outlives table. > > > > > > > > trans must never outlive table. > > > > > > > What is preventing trans from being freed after closing sock, given > > > trans is freed in workqueue? > > > > > > close sock > > > queue work > > > > The notifier acquires the transaction mutex, locking out all other > > transactions, so no further transactions requests referencing > > the table can be queued. > > > As per the syzbot report, trans->table could be instantiated before > notifier acquires the transaction mutex. And in fact the lock helps > trans outlive table even with your patch. > > cpu1 cpu2 > --- --- > transB->table = A > lock trans mutex > flush work > free A > unlock trans mutex > > queue work to free transB Can you show a crash reproducer or explain how this assign and queueing happens unordered wrt. cpu2? This should look like this: cpu1 cpu2 --- --- lock trans mutex lock trans mutex -> blocks transB->table = A queue work to free transB unlock trans mutex lock trans mutex returns flush work free A unlock trans mutex
On Thu, 4 Jul 2024 12:54:18 +0200 Florian Westphal <fw@strlen.de> > Hillf Danton <hdanton@sina.com> wrote: > > On Wed, 3 Jul 2024 15:01:07 +0200 Florian Westphal <fw@strlen.de> > > > Hillf Danton <hdanton@sina.com> wrote: > > > > On Wed, 3 Jul 2024 12:52:15 +0200 Florian Westphal <fw@strlen.de> > > > > > Hillf Danton <hdanton@sina.com> wrote: > > > > > > Given trans->table goes thru the lifespan of trans, your proposal is a bandaid > > > > > > if trans outlives table. > > > > > > > > > > trans must never outlive table. > > > > > > > > > What is preventing trans from being freed after closing sock, given > > > > trans is freed in workqueue? > > > > > > > > close sock > > > > queue work > > > > > > The notifier acquires the transaction mutex, locking out all other > > > transactions, so no further transactions requests referencing > > > the table can be queued. > > > > > As per the syzbot report, trans->table could be instantiated before > > notifier acquires the transaction mutex. And in fact the lock helps > > trans outlive table even with your patch. > > > > cpu1 cpu2 > > --- --- > > transB->table = A > > lock trans mutex > > flush work > > free A > > unlock trans mutex > > > > queue work to free transB > > Can you show a crash reproducer or explain how this assign > and queueing happens unordered wrt. cpu2? > Not so difficult. > This should look like this: > > cpu1 cpu2 > --- --- > lock trans mutex > lock trans mutex -> blocks > transB->table = A > queue work to free transB > unlock trans mutex > lock trans mutex returns > flush work > free A > unlock trans mutex > If your patch is correct, it should survive a warning. #syz test https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git 1c5fc27bc48a --- x/net/netfilter/nf_tables_api.c +++ y/net/netfilter/nf_tables_api.c @@ -11552,9 +11552,10 @@ static int nft_rcv_nl_event(struct notif gc_seq = nft_gc_seq_begin(nft_net); - if (!list_empty(&nf_tables_destroy_list)) - nf_tables_trans_destroy_flush_work(); + nf_tables_trans_destroy_flush_work(); again: + WARN_ON(!list_empty(&nft_net->commit_list)); + list_for_each_entry(table, &nft_net->tables, list) { if (nft_table_has_owner(table) && n->portid == table->nlpid) { --
Hillf Danton <hdanton@sina.com> wrote: > > lock trans mutex returns > > flush work > > free A > > unlock trans mutex > > > If your patch is correct, it should survive a warning. > > #syz test https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git 1c5fc27bc48a > > --- x/net/netfilter/nf_tables_api.c > +++ y/net/netfilter/nf_tables_api.c > @@ -11552,9 +11552,10 @@ static int nft_rcv_nl_event(struct notif > > gc_seq = nft_gc_seq_begin(nft_net); > > - if (!list_empty(&nf_tables_destroy_list)) > - nf_tables_trans_destroy_flush_work(); > + nf_tables_trans_destroy_flush_work(); > again: > + WARN_ON(!list_empty(&nft_net->commit_list)); > + You could officially submit this patch to nf-next, this is a slow path and the transaction list must be empty here. I think this change might be useful as it also documents this requirement.
Hello, syzbot has tested the proposed patch and the reproducer did not trigger any issue: Reported-and-tested-by: syzbot+4fd66a69358fc15ae2ad@syzkaller.appspotmail.com Tested on: commit: 1c5fc27b Merge tag 'nf-next-24-06-28' of git://git.ker.. git tree: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git console output: https://syzkaller.appspot.com/x/log.txt?x=152db3d1980000 kernel config: https://syzkaller.appspot.com/x/.config?x=5264b58fdff6e881 dashboard link: https://syzkaller.appspot.com/bug?extid=4fd66a69358fc15ae2ad compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40 patch: https://syzkaller.appspot.com/x/patch.diff?x=1395cd71980000 Note: testing is done by a robot and is best-effort only.
On Fri, 5 Jul 2024 13:02:18 +0200 Florian Westphal <fw@strlen.de> > Hillf Danton <hdanton@sina.com> wrote: > > > lock trans mutex returns > > > flush work > > > free A > > > unlock trans mutex > > > > > If your patch is correct, it should survive a warning. > > > > #syz test https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git 1c5fc27bc48a > > > > --- x/net/netfilter/nf_tables_api.c > > +++ y/net/netfilter/nf_tables_api.c > > @@ -11552,9 +11552,10 @@ static int nft_rcv_nl_event(struct notif > > > > gc_seq = nft_gc_seq_begin(nft_net); > > > > - if (!list_empty(&nf_tables_destroy_list)) > > - nf_tables_trans_destroy_flush_work(); > > + nf_tables_trans_destroy_flush_work(); > > again: > > + WARN_ON(!list_empty(&nft_net->commit_list)); > > + > > You could officially submit this patch to nf-next, this > is a slow path and the transaction list must be empty here. > > I think this change might be useful as it also documents > this requirement. Yes it is boy and the current reproducer triggered another warning [1,2]. [1] https://lore.kernel.org/lkml/20240706231332.3261-1-hdanton@sina.com/ [2] https://lore.kernel.org/lkml/000000000000a42289061c9d452e@google.com/ And feel free to take a look at fput() and sock_put() for instance of freeing slab pieces asyncally.
Hillf Danton <hdanton@sina.com> wrote: > > I think this change might be useful as it also documents > > this requirement. > > Yes it is boy and the current reproducer triggered another warning [1,2]. > > [1] https://lore.kernel.org/lkml/20240706231332.3261-1-hdanton@sina.com/ The WARN is incorrect. The destroy list can be non-empty; i already tried to explain why.
On Sun, 7 Jul 2024 10:08:24 +0200 Florian Westphal <fw@strlen.de> > Hillf Danton <hdanton@sina.com> wrote: > > > I think this change might be useful as it also documents > > > this requirement. > > > > Yes it is boy and the current reproducer triggered another warning [1,2]. > > > > [1] https://lore.kernel.org/lkml/20240706231332.3261-1-hdanton@sina.com/ > > The WARN is incorrect. The destroy list can be non-empty; i already > tried to explain why. > That warning as-is could be false positive but it could be triggered with a single netns. cpu1 cpu2 cpu3 --- --- --- nf_tables_trans_destroy_work() spin_lock(&nf_tables_destroy_list_lock); // 1) clear the destroy list list_splice_init(&nf_tables_destroy_list, &head); spin_unlock(&nf_tables_destroy_list_lock); nf_tables_commit_release() spin_lock(&nf_tables_destroy_list_lock); // 2) refill the destroy list list_splice_tail_init(&nft_net->commit_list, &nf_tables_destroy_list); spin_unlock(&nf_tables_destroy_list_lock); schedule_work(&trans_destroy_work); mutex_unlock(&nft_net->commit_mutex); nft_rcv_nl_event() mutex_lock(&nft_net->commit_mutex); flush_work(&trans_destroy_work); start_flush_work() insert_wq_barrier() /* * If @target is currently being executed, schedule the * barrier to the worker; otherwise, put it after @target. */ // 3) flush work ends with the refilled destroy list left intact tear tables down
Hillf Danton <hdanton@sina.com> wrote: > On Sun, 7 Jul 2024 10:08:24 +0200 Florian Westphal <fw@strlen.de> > > Hillf Danton <hdanton@sina.com> wrote: > > > > I think this change might be useful as it also documents > > > > this requirement. > > > > > > Yes it is boy and the current reproducer triggered another warning [1,2]. > > > > > > [1] https://lore.kernel.org/lkml/20240706231332.3261-1-hdanton@sina.com/ > > > > The WARN is incorrect. The destroy list can be non-empty; i already > > tried to explain why. > > > That warning as-is could be false positive but it could be triggered with a > single netns. How? > cpu1 cpu2 cpu3 > --- --- --- > nf_tables_trans_destroy_work() > spin_lock(&nf_tables_destroy_list_lock); > > // 1) clear the destroy list > list_splice_init(&nf_tables_destroy_list, &head); > spin_unlock(&nf_tables_destroy_list_lock); > > nf_tables_commit_release() > spin_lock(&nf_tables_destroy_list_lock); > > // 2) refill the destroy list > list_splice_tail_init(&nft_net->commit_list, &nf_tables_destroy_list); > spin_unlock(&nf_tables_destroy_list_lock); > schedule_work(&trans_destroy_work); > mutex_unlock(&nft_net->commit_mutex); So you're saying work can be IDLE after schedule_work()? I'm not following at all.
On Mon, 8 Jul 2024 13:58:31 +0200 Florian Westphal <fw@strlen.de> > Hillf Danton <hdanton@sina.com> wrote: > > On Sun, 7 Jul 2024 10:08:24 +0200 Florian Westphal <fw@strlen.de> > > > Hillf Danton <hdanton@sina.com> wrote: > > > > > I think this change might be useful as it also documents > > > > > this requirement. > > > > > > > > Yes it is boy and the current reproducer triggered another warning [1,2]. > > > > > > > > [1] https://lore.kernel.org/lkml/20240706231332.3261-1-hdanton@sina.com/ > > > > > > The WARN is incorrect. The destroy list can be non-empty; i already > > > tried to explain why. > > > > > That warning as-is could be false positive but it could be triggered with a > > single netns. > > How? > You saw the below cpu diagram, no? > > cpu1 cpu2 cpu3 > > --- --- --- > > nf_tables_trans_destroy_work() > > spin_lock(&nf_tables_destroy_list_lock); > > > > // 1) clear the destroy list > > list_splice_init(&nf_tables_destroy_list, &head); > > spin_unlock(&nf_tables_destroy_list_lock); > > > > nf_tables_commit_release() > > spin_lock(&nf_tables_destroy_list_lock); > > > > // 2) refill the destroy list > > list_splice_tail_init(&nft_net->commit_list, &nf_tables_destroy_list); > > spin_unlock(&nf_tables_destroy_list_lock); > > schedule_work(&trans_destroy_work); > > mutex_unlock(&nft_net->commit_mutex); > > So you're saying work can be IDLE after schedule_work()? > I got your point but difficult to explain you. In simple words, like runqueue, workqueue has latency. > I'm not following at all. This does not matter but is why I added tj to the cc list.
Hillf Danton <hdanton@sina.com> wrote: > On Mon, 8 Jul 2024 13:58:31 +0200 Florian Westphal <fw@strlen.de> > > Hillf Danton <hdanton@sina.com> wrote: > > > On Sun, 7 Jul 2024 10:08:24 +0200 Florian Westphal <fw@strlen.de> > > > > Hillf Danton <hdanton@sina.com> wrote: > > > > > > I think this change might be useful as it also documents > > > > > > this requirement. > > > > > > > > > > Yes it is boy and the current reproducer triggered another warning [1,2]. > > > > > > > > > > [1] https://lore.kernel.org/lkml/20240706231332.3261-1-hdanton@sina.com/ > > > > > > > > The WARN is incorrect. The destroy list can be non-empty; i already > > > > tried to explain why. > > > > > > > That warning as-is could be false positive but it could be triggered with a > > > single netns. > > > > How? > > > You saw the below cpu diagram, no? It did not explain the problem in a way I understand. > cpu1 cpu2 cpu3 > --- --- --- > nf_tables_trans_destroy_work() > spin_lock(&nf_tables_destroy_list_lock); > > // 1) clear the destroy list > list_splice_init(&nf_tables_destroy_list, &head); > spin_unlock(&nf_tables_destroy_list_lock); This means @work is running on cpu3 and made a snapshot of the list. I don't even understand how thats relevant, but OK. > nf_tables_commit_release() > spin_lock(&nf_tables_destroy_list_lock); > // 2) refill the destroy list > list_splice_tail_init(&nft_net->commit_list, &nf_tables_destroy_list); > spin_unlock(&nf_tables_destroy_list_lock); > schedule_work(&trans_destroy_work); > mutex_unlock(&nft_net->commit_mutex); Means CPU2 has added transaction structures that could reference @table to list. It also called schedule_work BEFORE releasing the mutex and after placing entries on destroy list. > nft_rcv_nl_event() > mutex_lock(&nft_net->commit_mutex); > flush_work(&trans_destroy_work); Means cpu1 serializes vs. cpu2, @work was scheduled. flush_work() must only return if @work is idle, without any other pending execution. If it gets scheduled again right after flush_work returns that is NOT a problem, as I tried to explain several times. We hold the transaction mutex, only a different netns can queue more work, and such foreign netns can only see struct nft_table structures that are private to their namespaces. > // 3) flush work ends with the refilled destroy list left intact > tear tables down Again, I do not understand how its possible. The postcondition after flush_work returns is: 1. nf_tables_destroy_list must be empty, UNLESS its from unrelated net namespaces, they cannot see the tables we're tearing down in 3), so they cannot reference them. 2. nf_tables_trans_destroy_work() is NOT running, unless its processing entries queued by other netns, after flush work returned. cpu2 does: -> add trans->table to @nf_tables_destroy_list -> unlock list spinlock -> schedule_work -> unlock mutex cpu1 does: -> lock mutex -> flush work You say its not enough and that trans->table queued by cpu2 can still be on @nf_tables_destroy_list. I say flush_work after taking the mutex guarantees strans->table has been processed by @work in all cases.
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 02d75aefaa8e..683f6a4518ee 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -11552,8 +11552,7 @@ static int nft_rcv_nl_event(struct notifier_block *this, unsigned long event, gc_seq = nft_gc_seq_begin(nft_net); - if (!list_empty(&nf_tables_destroy_list)) - nf_tables_trans_destroy_flush_work(); + nf_tables_trans_destroy_flush_work(); again: list_for_each_entry(table, &nft_net->tables, list) { if (nft_table_has_owner(table) &&
syzbot reports: KASAN: slab-uaf in nft_ctx_update include/net/netfilter/nf_tables.h:1831 KASAN: slab-uaf in nft_commit_release net/netfilter/nf_tables_api.c:9530 KASAN: slab-uaf int nf_tables_trans_destroy_work+0x152b/0x1750 net/netfilter/nf_tables_api.c:9597 Read of size 2 at addr ffff88802b0051c4 by task kworker/1:1/45 [..] Workqueue: events nf_tables_trans_destroy_work Call Trace: nft_ctx_update include/net/netfilter/nf_tables.h:1831 [inline] nft_commit_release net/netfilter/nf_tables_api.c:9530 [inline] nf_tables_trans_destroy_work+0x152b/0x1750 net/netfilter/nf_tables_api.c:9597 Problem is that the notifier does a conditional flush, but its possible that the table-to-be-removed is still referenced by transactions being processed by the worker, so we need to flush unconditionally. We could make the flush_work depend on whether we found a table to delete in nf-next to avoid the flush for most cases. AFAICS this problem is only exposed in nf-next, with commit e169285f8c56 ("netfilter: nf_tables: do not store nft_ctx in transaction objects"), with this commit applied there is an unconditional fetch of table->family which is whats triggering the above splat. Fixes: 2c9f0293280e ("netfilter: nf_tables: flush pending destroy work before netlink notifier") Reported-and-tested-by: syzbot+4fd66a69358fc15ae2ad@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=4fd66a69358fc15ae2ad Signed-off-by: Florian Westphal <fw@strlen.de> --- net/netfilter/nf_tables_api.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)