From patchwork Fri Oct 11 12:19:47 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Damien Le Moal X-Patchwork-Id: 1996097 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=kernel.org header.i=@kernel.org header.a=rsa-sha256 header.s=k20201202 header.b=byNuwZgT; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=147.75.80.249; helo=am.mirrors.kernel.org; envelope-from=linux-pci+bounces-14308-incoming=patchwork.ozlabs.org@vger.kernel.org; receiver=patchwork.ozlabs.org) Received: from am.mirrors.kernel.org (am.mirrors.kernel.org [147.75.80.249]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4XQ5Kw0ZGBz1xv0 for ; Fri, 11 Oct 2024 23:20:04 +1100 (AEDT) Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 7E78E1F26272 for ; Fri, 11 Oct 2024 12:20:01 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 6F6CD213EEB; Fri, 11 Oct 2024 12:19:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="byNuwZgT" X-Original-To: linux-pci@vger.kernel.org Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4BDF1802 for ; Fri, 11 Oct 2024 12:19:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728649197; cv=none; b=oP8wCh/aTnc7LoKMkR+fJmEZgRVe/q/TY0iFXmaQUBcnYjV5xejNxucXczKNaCMkBg71RLDySXZNFH1UsN334aqytfDx4nrZdJtay3hmUx7RGB6d7FLT2bPnwGMZUy6Ol40FR3MQSI7nQ5X3NY16OzVRXGdaX+CaKiKyf+65ubY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728649197; c=relaxed/simple; bh=d7+U9CxxyPqSKY2n/y+GDBVMjr19oSCqwuEa33sYgf4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=JWwwIfi/oMVfb+P9itg2Nwdw60+cav60EdutZECpFkRsjO/I2ynhAKBWnpwCqxUCgxBREzkVZV1qB5/aEKCNzfP2we1nPwxaqx+E1pnu9FJvqL39GJE3eRRHV9hosORYWl5QCuNf76ihLZv4d1SWsBHsp6uB3O5db4e1f+wU68w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=byNuwZgT; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 18AEDC4CED1; Fri, 11 Oct 2024 12:19:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728649196; bh=d7+U9CxxyPqSKY2n/y+GDBVMjr19oSCqwuEa33sYgf4=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=byNuwZgTyvV3fYCB0bsicI3k6znK8ufQlD5nM7G4u5gArpT7bG8WOdn+l3M6tKNnA 6zGCwGhzozfl2qvM+ESXJQxTOhpRCtXcrEiPagwvlBJwqM1521rLlltDkeAl1lkSSW nMzYUzuC/CctZIPLII1DpDqBlgL5fwwjozJpca0xCPnAbR1HyPiTX2SfJjug/XE49f Wj0AclgPK3KwTvnmopym2jD2eJHjNLQiSM2XGZ0cD2lYOOqT8HrjM7H3LWa1tMyqbr AWvzkvTBE42ZIawOgeuQ+tkiXrXUVbfx3tsh+pCLLE0/nuiOGw+CiLy0qCWC8QmSZV 0Y3fVRO0NXogA== From: Damien Le Moal To: linux-nvme@lists.infradead.org, Keith Busch , Christoph Hellwig , Sagi Grimberg , Manivannan Sadhasivam , =?utf-8?q?Krzyszt?= =?utf-8?q?of_Wilczy=C5=84ski?= , Kishon Vijay Abraham I , Bjorn Helgaas , Lorenzo Pieralisi , linux-pci@vger.kernel.org Cc: Rick Wertenbroek , Niklas Cassel Subject: [PATCH v2 1/5] nvmet: rename and move nvmet_get_log_page_len() Date: Fri, 11 Oct 2024 21:19:47 +0900 Message-ID: <20241011121951.90019-2-dlemoal@kernel.org> X-Mailer: git-send-email 2.46.2 In-Reply-To: <20241011121951.90019-1-dlemoal@kernel.org> References: <20241011121951.90019-1-dlemoal@kernel.org> Precedence: bulk X-Mailing-List: linux-pci@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 The code for nvmet_get_log_page_len() has no pedendency on nvme target code and only depends on struct nvme_command. Move this helper function out of drivers/nvme/target/admin-cmd.c and inline it as part of the generic definitions in include/linux/nvme.h. Apply the same modification to nvmet_get_log_page_offset(). Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Chaitanya Kulkarni --- drivers/nvme/target/admin-cmd.c | 20 +------------------- drivers/nvme/target/discovery.c | 4 ++-- drivers/nvme/target/nvmet.h | 3 --- include/linux/nvme.h | 19 +++++++++++++++++++ 4 files changed, 22 insertions(+), 24 deletions(-) diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c index 081f0473cd9e..64434654b713 100644 --- a/drivers/nvme/target/admin-cmd.c +++ b/drivers/nvme/target/admin-cmd.c @@ -12,19 +12,6 @@ #include #include "nvmet.h" -u32 nvmet_get_log_page_len(struct nvme_command *cmd) -{ - u32 len = le16_to_cpu(cmd->get_log_page.numdu); - - len <<= 16; - len += le16_to_cpu(cmd->get_log_page.numdl); - /* NUMD is a 0's based value */ - len += 1; - len *= sizeof(u32); - - return len; -} - static u32 nvmet_feat_data_len(struct nvmet_req *req, u32 cdw10) { switch (cdw10 & 0xff) { @@ -35,11 +22,6 @@ static u32 nvmet_feat_data_len(struct nvmet_req *req, u32 cdw10) } } -u64 nvmet_get_log_page_offset(struct nvme_command *cmd) -{ - return le64_to_cpu(cmd->get_log_page.lpo); -} - static void nvmet_execute_get_log_page_noop(struct nvmet_req *req) { nvmet_req_complete(req, nvmet_zero_sgl(req, 0, req->transfer_len)); @@ -319,7 +301,7 @@ static void nvmet_execute_get_log_page_ana(struct nvmet_req *req) static void nvmet_execute_get_log_page(struct nvmet_req *req) { - if (!nvmet_check_transfer_len(req, nvmet_get_log_page_len(req->cmd))) + if (!nvmet_check_transfer_len(req, nvme_get_log_page_len(req->cmd))) return; switch (req->cmd->get_log_page.lid) { diff --git a/drivers/nvme/target/discovery.c b/drivers/nvme/target/discovery.c index 28843df5fa7c..71c94a54bcd8 100644 --- a/drivers/nvme/target/discovery.c +++ b/drivers/nvme/target/discovery.c @@ -163,8 +163,8 @@ static void nvmet_execute_disc_get_log_page(struct nvmet_req *req) const int entry_size = sizeof(struct nvmf_disc_rsp_page_entry); struct nvmet_ctrl *ctrl = req->sq->ctrl; struct nvmf_disc_rsp_page_hdr *hdr; - u64 offset = nvmet_get_log_page_offset(req->cmd); - size_t data_len = nvmet_get_log_page_len(req->cmd); + u64 offset = nvme_get_log_page_offset(req->cmd); + size_t data_len = nvme_get_log_page_len(req->cmd); size_t alloc_len; struct nvmet_subsys_link *p; struct nvmet_port *r; diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 190f55e6d753..6e9499268c28 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -541,9 +541,6 @@ u16 nvmet_copy_from_sgl(struct nvmet_req *req, off_t off, void *buf, size_t len); u16 nvmet_zero_sgl(struct nvmet_req *req, off_t off, size_t len); -u32 nvmet_get_log_page_len(struct nvme_command *cmd); -u64 nvmet_get_log_page_offset(struct nvme_command *cmd); - extern struct list_head *nvmet_ports; void nvmet_port_disc_changed(struct nvmet_port *port, struct nvmet_subsys *subsys); diff --git a/include/linux/nvme.h b/include/linux/nvme.h index b58d9405d65e..1f6d8cd0389a 100644 --- a/include/linux/nvme.h +++ b/include/linux/nvme.h @@ -10,6 +10,7 @@ #include #include #include +#include /* NQN names in commands fields specified one size */ #define NVMF_NQN_FIELD_LEN 256 @@ -1856,6 +1857,24 @@ static inline bool nvme_is_write(const struct nvme_command *cmd) return cmd->common.opcode & 1; } +static inline __u32 nvme_get_log_page_len(struct nvme_command *cmd) +{ + __u32 len = le16_to_cpu(cmd->get_log_page.numdu); + + len <<= 16; + len += le16_to_cpu(cmd->get_log_page.numdl); + /* NUMD is a 0's based value */ + len += 1; + len *= sizeof(__u32); + + return len; +} + +static inline __u64 nvme_get_log_page_offset(struct nvme_command *cmd) +{ + return le64_to_cpu(cmd->get_log_page.lpo); +} + enum { /* * Generic Command Status: From patchwork Fri Oct 11 12:19:48 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Damien Le Moal X-Patchwork-Id: 1996098 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=kernel.org header.i=@kernel.org header.a=rsa-sha256 header.s=k20201202 header.b=XcGXmSZ5; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=147.75.80.249; helo=am.mirrors.kernel.org; envelope-from=linux-pci+bounces-14310-incoming=patchwork.ozlabs.org@vger.kernel.org; receiver=patchwork.ozlabs.org) Received: from am.mirrors.kernel.org (am.mirrors.kernel.org [147.75.80.249]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4XQ5Ky6mlFz1xsc for ; Fri, 11 Oct 2024 23:20:06 +1100 (AEDT) Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id ABA631F262CA for ; Fri, 11 Oct 2024 12:20:04 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 8014A217314; Fri, 11 Oct 2024 12:19:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="XcGXmSZ5" X-Original-To: linux-pci@vger.kernel.org Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5C72D215026 for ; Fri, 11 Oct 2024 12:19:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728649199; cv=none; b=pkvEfirzZbVWfNWXXg8qTbEB930+vb8iYXVLJ9nXyTdwOzz0+5IfjdyfZPri7v5XVlpEqrUtaNHCNxueC8wlO0aNG+dNT7INVfowrbVYs2pc8cO9ZmQYIJrE/AhYj3FdkoZrO0xVz49qzO6d4ERM7uG1G4OJJeoQY3xCVq1fVDg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728649199; c=relaxed/simple; bh=USrx1eSbx00KxGW5SNsqKGkCfe9T8LfwHwehSbOrR1g=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=lb2SOUVdevCivZ1UnB1pfFd+Y1AjgQmR0eHpeuuCHMrP0aMrSpje956AOzQxc8hP72525Z5SAnWXL2VC6z705IRpVhprO/zxKXXpoVAnt+BripDJW1Jm4tLuiHIQJHCUWzKMzwm59aAtPRezIxvGzlSkufGuLDrhcxqMhx32PnE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=XcGXmSZ5; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2F006C4CECF; Fri, 11 Oct 2024 12:19:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728649198; bh=USrx1eSbx00KxGW5SNsqKGkCfe9T8LfwHwehSbOrR1g=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=XcGXmSZ5EBdyza640mDUbyYFCHJvq6TMDw8IAWU758p7dl/BSOWqNRNZyd9UGI3lJ nC/Ukp2LR8pFOUI6a0J8ZF0o236l0qyZjtGMBguHNp7uhQ5YXSsDKotCD+yXK8WyMg LvXrSFHlR1IPfdb02QSUKinQVmICK2jKXjI4+nVqqjOz82xPOfcQAXMgTa0ZOkdGEW 0c4UpPQyH4cp6xjG68G4yIBiu5zDJMhfvUEmnENbKugPuxKzvKyaP+2AASFX4C4M5x TH+cv63Gq91FtIgeRIByVI8bxYNGQZzYHq7omHm9BTteTWlbZo1+xAEmW1PgWRmpjm dBq0nWJIyrGGg== From: Damien Le Moal To: linux-nvme@lists.infradead.org, Keith Busch , Christoph Hellwig , Sagi Grimberg , Manivannan Sadhasivam , =?utf-8?q?Krzyszt?= =?utf-8?q?of_Wilczy=C5=84ski?= , Kishon Vijay Abraham I , Bjorn Helgaas , Lorenzo Pieralisi , linux-pci@vger.kernel.org Cc: Rick Wertenbroek , Niklas Cassel Subject: [PATCH v2 2/5] nvmef: export nvmef_create_ctrl() Date: Fri, 11 Oct 2024 21:19:48 +0900 Message-ID: <20241011121951.90019-3-dlemoal@kernel.org> X-Mailer: git-send-email 2.46.2 In-Reply-To: <20241011121951.90019-1-dlemoal@kernel.org> References: <20241011121951.90019-1-dlemoal@kernel.org> Precedence: bulk X-Mailing-List: linux-pci@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Export nvmef_create_ctrl() to allow drivers to directly call this function instead of forcing the creation of a fabrics host controller with a write to /dev/nvmef. The export is restricted to the NVME_FABRICS namespace. Signed-off-by: Damien Le Moal Reviewed-by: Chaitanya Kulkarni --- drivers/nvme/host/fabrics.c | 4 ++-- drivers/nvme/host/fabrics.h | 1 + 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index 432efcbf9e2f..e3c990d50704 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -1276,8 +1276,7 @@ EXPORT_SYMBOL_GPL(nvmf_free_options); NVMF_OPT_FAIL_FAST_TMO | NVMF_OPT_DHCHAP_SECRET |\ NVMF_OPT_DHCHAP_CTRL_SECRET) -static struct nvme_ctrl * -nvmf_create_ctrl(struct device *dev, const char *buf) +struct nvme_ctrl *nvmf_create_ctrl(struct device *dev, const char *buf) { struct nvmf_ctrl_options *opts; struct nvmf_transport_ops *ops; @@ -1346,6 +1345,7 @@ nvmf_create_ctrl(struct device *dev, const char *buf) nvmf_free_options(opts); return ERR_PTR(ret); } +EXPORT_SYMBOL_NS_GPL(nvmf_create_ctrl, NVME_FABRICS); static const struct class nvmf_class = { .name = "nvme-fabrics", diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h index 21d75dc4a3a0..2dd3aeb8c53a 100644 --- a/drivers/nvme/host/fabrics.h +++ b/drivers/nvme/host/fabrics.h @@ -214,6 +214,7 @@ static inline unsigned int nvmf_nr_io_queues(struct nvmf_ctrl_options *opts) min(opts->nr_poll_queues, num_online_cpus()); } +struct nvme_ctrl *nvmf_create_ctrl(struct device *dev, const char *buf); int nvmf_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val); int nvmf_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val); int nvmf_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val); From patchwork Fri Oct 11 12:19:49 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Damien Le Moal X-Patchwork-Id: 1996099 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=kernel.org header.i=@kernel.org header.a=rsa-sha256 header.s=k20201202 header.b=YtSPV42p; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=2604:1380:4601:e00::3; helo=am.mirrors.kernel.org; envelope-from=linux-pci+bounces-14311-incoming=patchwork.ozlabs.org@vger.kernel.org; receiver=patchwork.ozlabs.org) Received: from am.mirrors.kernel.org (am.mirrors.kernel.org [IPv6:2604:1380:4601:e00::3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4XQ5L02Wgjz1xsc for ; Fri, 11 Oct 2024 23:20:08 +1100 (AEDT) Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 41C9F1F263D3 for ; Fri, 11 Oct 2024 12:20:06 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id A7DDF1F9415; Fri, 11 Oct 2024 12:20:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="YtSPV42p" X-Original-To: linux-pci@vger.kernel.org Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 838DB1BDAA1 for ; Fri, 11 Oct 2024 12:20:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728649201; cv=none; b=LEdGFqeZeCrGYLNUtTeOhgwBI8N+Y3sn9wKZN/RXCb1YSkmTpv1tlDMuBw4HXDt/w6aKGvIu4ef1rSOZQHXszYvNjkUnBeAjFzPVSinAjU+4jz4SCR9g0bmlRKS4ZhshWX10V55Gv4K11UexjizF9mWmyqD7EcHZQ2uqjyYOmRQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728649201; c=relaxed/simple; bh=oPxanfHmqSGpINPKSIFc4wvoW0qyQNZj9s2w5ak3H0s=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=BVLYEO1TYpy0ivt9fzT/Ia42bvNMuPbd6qFLOpu59cQDAqjGvCaI/GRY/9To2XjQWMUZBEG9KCfUjeWGD+jqLvh3LlnxC4ambsBf3unJz+26jG6ckI/X78mFC4D4t1XzT8iDZ5xVxagy3quT1gsHZgvmVorX1YPql3OfxqohDSA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=YtSPV42p; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 45BA7C4CED0; Fri, 11 Oct 2024 12:19:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728649201; bh=oPxanfHmqSGpINPKSIFc4wvoW0qyQNZj9s2w5ak3H0s=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=YtSPV42p36ky9dlF8FFAm4SVoynaMBWrMpjJrv5nL75vkCcO3lf4ksfwd4gUwot7L abaXVfrihGNk7WAp9tb9tsjjsyg8fo8TQePMsgZUU6Ps26dTj7S2Qsb+VLwhbSmObJ z02RtWreHE4xkPB9AtkKFHUT7By+Z4etfsP+i+5tWQwlLdp6QewCXtLzL2svYsPqxm 87azbLcyvzpbmTXs0Bxp6a3Ew1XsQ4sjEBsbPmkJeL/jLtwqdTTYqbV851zfdYn9S0 3TQAePsuwp8RpjF0AKCETj3EvLoaULvzzBnzEEOb/CFmBIGe1Oys/v2Fs9B12cdsx5 bJnLAnJUzi4jg== From: Damien Le Moal To: linux-nvme@lists.infradead.org, Keith Busch , Christoph Hellwig , Sagi Grimberg , Manivannan Sadhasivam , =?utf-8?q?Krzyszt?= =?utf-8?q?of_Wilczy=C5=84ski?= , Kishon Vijay Abraham I , Bjorn Helgaas , Lorenzo Pieralisi , linux-pci@vger.kernel.org Cc: Rick Wertenbroek , Niklas Cassel Subject: [PATCH v2 3/5] nvmef: Introduce the NVME_OPT_HIDDEN_NS option Date: Fri, 11 Oct 2024 21:19:49 +0900 Message-ID: <20241011121951.90019-4-dlemoal@kernel.org> X-Mailer: git-send-email 2.46.2 In-Reply-To: <20241011121951.90019-1-dlemoal@kernel.org> References: <20241011121951.90019-1-dlemoal@kernel.org> Precedence: bulk X-Mailing-List: linux-pci@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Introduce the NVME fabrics option NVME_OPT_HIDDEN_NS to allow a host controller to be created without any user visible or internally usable namespace devices. That is, if set, this option will result in the controller having no character device and no block device for any of its namespaces. This option should be used only when the nvme controller will be managed using passthrough commands using the controller character device, either by the user or by another device driver. Signed-off-by: Damien Le Moal --- drivers/nvme/host/core.c | 17 ++++++++++++++--- drivers/nvme/host/fabrics.c | 7 ++++++- drivers/nvme/host/fabrics.h | 4 ++++ 3 files changed, 24 insertions(+), 4 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 43d73d31c66f..6559e1028685 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1714,11 +1714,17 @@ static void nvme_enable_aen(struct nvme_ctrl *ctrl) queue_work(nvme_wq, &ctrl->async_event_work); } +static inline bool nvme_hidden_ns(struct nvme_ctrl *ctrl) +{ + return ctrl->opts && ctrl->opts->hidden_ns; +} + static int nvme_ns_open(struct nvme_ns *ns) { /* should never be called due to GENHD_FL_HIDDEN */ - if (WARN_ON_ONCE(nvme_ns_head_multipath(ns->head))) + if (WARN_ON_ONCE(nvme_ns_head_multipath(ns->head) || + nvme_hidden_ns(ns->ctrl))) goto fail; if (!nvme_get_ns(ns)) goto fail; @@ -3828,6 +3834,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info) disk->fops = &nvme_bdev_ops; disk->private_data = ns; + if (nvme_hidden_ns(ctrl)) + disk->flags |= GENHD_FL_HIDDEN; + ns->disk = disk; ns->queue = disk->queue; ns->ctrl = ctrl; @@ -3879,7 +3888,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info) if (device_add_disk(ctrl->device, ns->disk, nvme_ns_attr_groups)) goto out_cleanup_ns_from_list; - if (!nvme_ns_head_multipath(ns->head)) + if (!nvme_ns_head_multipath(ns->head) && + !nvme_hidden_ns(ctrl)) nvme_add_ns_cdev(ns); nvme_mpath_add_disk(ns, info->anagrpid); @@ -3945,7 +3955,8 @@ static void nvme_ns_remove(struct nvme_ns *ns) /* guarantee not available in head->list */ synchronize_srcu(&ns->head->srcu); - if (!nvme_ns_head_multipath(ns->head)) + if (!nvme_ns_head_multipath(ns->head) && + !nvme_hidden_ns(ns->ctrl)) nvme_cdev_del(&ns->cdev, &ns->cdev_device); del_gendisk(ns->disk); diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index e3c990d50704..64e95727ae2a 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -707,6 +707,7 @@ static const match_table_t opt_tokens = { #ifdef CONFIG_NVME_TCP_TLS { NVMF_OPT_TLS, "tls" }, #endif + { NVMF_OPT_HIDDEN_NS, "hidden_ns" }, { NVMF_OPT_ERR, NULL } }; @@ -1053,6 +1054,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts, } opts->tls = true; break; + case NVMF_OPT_HIDDEN_NS: + opts->hidden_ns = true; + break; default: pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n", p); @@ -1274,7 +1278,8 @@ EXPORT_SYMBOL_GPL(nvmf_free_options); NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT |\ NVMF_OPT_DISABLE_SQFLOW | NVMF_OPT_DISCOVERY |\ NVMF_OPT_FAIL_FAST_TMO | NVMF_OPT_DHCHAP_SECRET |\ - NVMF_OPT_DHCHAP_CTRL_SECRET) + NVMF_OPT_DHCHAP_CTRL_SECRET | \ + NVMF_OPT_HIDDEN_NS) struct nvme_ctrl *nvmf_create_ctrl(struct device *dev, const char *buf) { diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h index 2dd3aeb8c53a..5388610e475d 100644 --- a/drivers/nvme/host/fabrics.h +++ b/drivers/nvme/host/fabrics.h @@ -66,6 +66,7 @@ enum { NVMF_OPT_TLS = 1 << 25, NVMF_OPT_KEYRING = 1 << 26, NVMF_OPT_TLS_KEY = 1 << 27, + NVMF_OPT_HIDDEN_NS = 1 << 28, }; /** @@ -108,6 +109,8 @@ enum { * @nr_poll_queues: number of queues for polling I/O * @tos: type of service * @fast_io_fail_tmo: Fast I/O fail timeout in seconds + * @fast_io_fail_tmo: Fast I/O fail timeout in seconds + * @hide_dev: Hide block devices for the namesapces of the controller */ struct nvmf_ctrl_options { unsigned mask; @@ -133,6 +136,7 @@ struct nvmf_ctrl_options { bool disable_sqflow; bool hdr_digest; bool data_digest; + bool hidden_ns; unsigned int nr_write_queues; unsigned int nr_poll_queues; int tos; From patchwork Fri Oct 11 12:19:50 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Damien Le Moal X-Patchwork-Id: 1996100 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=kernel.org header.i=@kernel.org header.a=rsa-sha256 header.s=k20201202 header.b=Wc15Q3FB; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=2604:1380:45d1:ec00::1; helo=ny.mirrors.kernel.org; envelope-from=linux-pci+bounces-14312-incoming=patchwork.ozlabs.org@vger.kernel.org; receiver=patchwork.ozlabs.org) Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org [IPv6:2604:1380:45d1:ec00::1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4XQ5L071GSz1xv0 for ; Fri, 11 Oct 2024 23:20:08 +1100 (AEDT) Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 19ACC1C23212 for ; Fri, 11 Oct 2024 12:20:07 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id BCB6A215026; Fri, 11 Oct 2024 12:20:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Wc15Q3FB" X-Original-To: linux-pci@vger.kernel.org Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8D8941BDAA1 for ; Fri, 11 Oct 2024 12:20:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728649203; cv=none; b=BVZeX/JGJAkgTdTWfAJXV8269tCgHudD1NGjxZsxhnOD/yYciHI2zLdWna4oTzMl1Qh+52lwtjTSKkN3eV27lPg/coywhEBcb7KlA8eU33I7G8/9elEyTvEelTU4Kci60/wA9/4w9xOaTMlKj8gu2OqXWNiDN5tNSOJtRRKQBdM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728649203; c=relaxed/simple; bh=07YP04UK3mdaJBrHVqzRvQq5OHw5XZqXCedoDihffj8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=k9p68CUmZOCY8YjWpVXQjPVI72Lz0mMYh1TU/epXhCaxA6mqVjGsi+e+jGUzGTn0FmLNJE8u3ctcaFmOcfKHDqGV+CEmI9BJ7kjJGTfQJp3kbkCwLglF+Ew357bMOz0iyRdLi+o+jqUk8kkNEYyNx2a+8jSyIIY34/0jLrFfoUA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Wc15Q3FB; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 5F915C4CECE; Fri, 11 Oct 2024 12:20:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728649203; bh=07YP04UK3mdaJBrHVqzRvQq5OHw5XZqXCedoDihffj8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=Wc15Q3FBca16yKmKUM5y2u/w5CYf5dkUTNSoO34g2dv8Gt5IXRqAw3iRsDK4fznl1 b033n/naC8djZHOqUmaJTUvdIFMFaVg6phTQdAIV1Q07JZOcQKUC/mXQFvjXtbthsw CFDr40wX7hxRq2sppXLbs9QnI33UlBSyaALwNpIQPoyfTToDS1HxeSFQMej8ys/4M6 2O+vC7F5Or1E7t05Zc4t7nix1prMBtXOgWy95TbVFDxTA5bK4sWH/go79u+Kjus1/3 9/uBZQACcxoJJwakol9gDe3EL1ik4GrK4XS8iw6aoaNjRe2SFMYhyYD2PHikqvkP1U m8oasJ8GeCiBQ== From: Damien Le Moal To: linux-nvme@lists.infradead.org, Keith Busch , Christoph Hellwig , Sagi Grimberg , Manivannan Sadhasivam , =?utf-8?q?Krzyszt?= =?utf-8?q?of_Wilczy=C5=84ski?= , Kishon Vijay Abraham I , Bjorn Helgaas , Lorenzo Pieralisi , linux-pci@vger.kernel.org Cc: Rick Wertenbroek , Niklas Cassel Subject: [PATCH v2 4/5] PCI: endpoint: Add NVMe endpoint function driver Date: Fri, 11 Oct 2024 21:19:50 +0900 Message-ID: <20241011121951.90019-5-dlemoal@kernel.org> X-Mailer: git-send-email 2.46.2 In-Reply-To: <20241011121951.90019-1-dlemoal@kernel.org> References: <20241011121951.90019-1-dlemoal@kernel.org> Precedence: bulk X-Mailing-List: linux-pci@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Damien Le Moal Add a Linux PCI Endpoint function driver that implements a PCIe NVMe controller for a local NVMe fabrics host controller. The NVMe endpoint function driver relies as most as possible on the NVMe fabrics driver for executing NVMe commands received from the host, to minimize NVMe command parsing. However, some admin commands must be modified to satisfy PCI transport specifications constraints (e.g. queue management commands support and the optional SGL support). The NVMe endpoint function drivers is created as follows: 1) Upon binding of the endpoint function driver to the endpoint controller (pci_epf_nvme_bind()), the function driver sets up BAR 0 for the NVMe PCI controller with enough doorbell space to accommodate up to PCI_EPF_NVME_MAX_NR_QUEUES (16) queue pairs. The DMA channels that will be used to exchange data with the host over the PCI link are also initialized. 2) The endpoint function driver then creates the NVMe host fabrics controller using nvmef_create_ctrl() (called from pci_epf_nvme_create_ctrl()), which connects the host controller to its target (e.g. a loop target with a file or block device or a TCP remote target). 3) Once the PCI link status is detected to be up, the endpoint controller initializes IRQ management and BAR 0 content to advertize its capabilities. The capabilities of the fabrics controller are mostly used unmodified (pci_epf_nvme_init_ctrl_regs()). With that, the endpoint controller starts a delayed task to poll the BAR 0 register bar to detect changes to the CC register. 4) When the PCI host enables the controller, pci_epf_nvme_enable_ctrl() is called to create the admin submission and completion queues and start the fabrics controller with nvme_start_ctrl(). The endpoint controller then starts a delayed work to poll the admin submission queue doorbell to detect commands from the PCI host. 5) Admin commands received from the PCI host are retrieved from the admin queue by mapping the queue memory to PCI memory space, copying the command locally using a struct pci_epf_nvme_cmd, and proccess the command using pci_epf_nvme_process_admin_cmd(). 6) I/O commands are similarly handled: each I/O submission queue uses a delayed work to poll the queue doorbell and upon detection of a command being issued by the host, the I/O command is copied locally and processed using pci_epf_nvme_process_io_cmd(). I/O and admin commands are processed as follows: 1) A minimal parsing of the command is done to determine the command buffer size and data transfer direction. The command processing then continues using a command work scheduled using a per queue-pair high-priority workqueue (pci_epf_nvme_exec_cmd_work()). 2) The command execution work calls pci_epf_nvme_exec_cmd() which will retrieve and parse the command PRPs to determine the PCI address location of the command buffer segments, and retrieve the command data if the command is a write command. The command is then executed using the host fabrics controller by calling __nvme_submit_sync_cmd(). Once done, pci_epf_nvme_complete_cmd() is called to complete the command, after having transferred the command data back to the PCI host in the case of a read command. 3) pci_epf_nvme_complete_cmd() queues the command in a completion list for the completion queue of the command and schedules the queue completion work which will batch CQ entry transfers to the PCI host with the completion queue memory mapped to the host PCI address of the completion queue. With this processing, most of the command parsing and handling is left to the NVMe fabrics code. The only NVMe specific parsing implemented in the endpoint driver is the command PRP parsing. Of note is that the current code does not support SGL (this capability is thus not advertized). For data transfers, the endpoint driver relies by default on the DMA RX and TX channels of the hardware endpoint PCI controller. If no DMA channels are available, the NVMe endpoint function driver falls back to using mmio, which degrades performance significantly but keeps the function working. The BAR register polling work also monitors for controller-disable events (e.g. the PCI host reboots or shutdown). Such events trigger calls to pci_epf_nvme_disable_ctrl() which drains, cleanups and destroys the local queue pairs. The configuration and enablement of this NVMe endpoint function driver can be fully controlled using configfs, once a NVMe fabrics target is also setup. The available configfs parameters are: - ctrl_opts: Fabrics controller connection arguments, as formatted for the nvme cli "connect" command. - dma_enable: Enable (default) or disable DMA data transfers. - mdts_kb: Change the maximum data transfer size (default: 128 KB). Early versions of this driver code were based on an RFC submission by Alan Mikhak (https://lwn.net/Articles/804369/). The code however has since been completely rewritten. Co-developed-by: Rick Wertenbroek Signed-off-by: Rick Wertenbroek Signed-off-by: Damien Le Moal --- MAINTAINERS | 7 + drivers/pci/endpoint/functions/Kconfig | 9 + drivers/pci/endpoint/functions/Makefile | 1 + drivers/pci/endpoint/functions/pci-epf-nvme.c | 2591 +++++++++++++++++ 4 files changed, 2608 insertions(+) create mode 100644 drivers/pci/endpoint/functions/pci-epf-nvme.c diff --git a/MAINTAINERS b/MAINTAINERS index a097afd76ded..301e0a1b56e8 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16558,6 +16558,13 @@ S: Supported F: drivers/platform/x86/nvidia-wmi-ec-backlight.c F: include/linux/platform_data/x86/nvidia-wmi-ec-backlight.h +NVME ENDPOINT DRIVER +M: Damien Le Moal +L: linux-pci@vger.kernel.org +L: linux-nvme@lists.infradead.org +S: Supported +F: drivers/pci/endpoint/functions/pci-epf-nvme.c + NVM EXPRESS DRIVER M: Keith Busch M: Jens Axboe diff --git a/drivers/pci/endpoint/functions/Kconfig b/drivers/pci/endpoint/functions/Kconfig index 0c9cea0698d7..ea641d558fb8 100644 --- a/drivers/pci/endpoint/functions/Kconfig +++ b/drivers/pci/endpoint/functions/Kconfig @@ -47,3 +47,12 @@ config PCI_EPF_MHI devices such as SDX55. If in doubt, say "N" to disable Endpoint driver for MHI bus. + +config PCI_EPF_NVME + tristate "PCI Endpoint NVMe function driver" + depends on PCI_ENDPOINT && NVME_TARGET + help + Enable this configuration option to enable the NVMe PCI endpoint + function driver. + + If in doubt, say "N". diff --git a/drivers/pci/endpoint/functions/Makefile b/drivers/pci/endpoint/functions/Makefile index 696473fce50e..fe2d6cf8c502 100644 --- a/drivers/pci/endpoint/functions/Makefile +++ b/drivers/pci/endpoint/functions/Makefile @@ -7,3 +7,4 @@ obj-$(CONFIG_PCI_EPF_TEST) += pci-epf-test.o obj-$(CONFIG_PCI_EPF_NTB) += pci-epf-ntb.o obj-$(CONFIG_PCI_EPF_VNTB) += pci-epf-vntb.o obj-$(CONFIG_PCI_EPF_MHI) += pci-epf-mhi.o +obj-$(CONFIG_PCI_EPF_NVME) += pci-epf-nvme.o diff --git a/drivers/pci/endpoint/functions/pci-epf-nvme.c b/drivers/pci/endpoint/functions/pci-epf-nvme.c new file mode 100644 index 000000000000..00314a7d1b20 --- /dev/null +++ b/drivers/pci/endpoint/functions/pci-epf-nvme.c @@ -0,0 +1,2591 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * NVMe function driver for PCI Endpoint Framework + * + * Copyright (c) 2024, Western Digital Corporation or its affiliates. + * Copyright (c) 2024, Rick Wertenbroek + * REDS Institute, HEIG-VD, HES-SO, Switzerland + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "../../../nvme/host/nvme.h" +#include "../../../nvme/host/fabrics.h" + +/* + * Maximum number of queue pairs: A higheer this number, the more mapping + * windows of the PCI endpoint controller will be used. To avoid exceeding the + * maximum number of mapping windows available (i.e. avoid PCI space mapping + * failures) errors, the maximum number of queue pairs should be limited to + * the number of mapping windows minus 2 (one window for IRQ issuing and one + * window for data transfers) and divided by 2 (one mapping windows for the SQ + * and one mapping window for the CQ). + */ +#define PCI_EPF_NVME_MAX_NR_QUEUES 16 + +/* + * Default maximum data transfer size: limit to 128 KB to avoid + * excessive local memory use for buffers. + */ +#define PCI_EPF_NVME_MDTS_KB 128 +#define PCI_EPF_NVME_MAX_MDTS_KB 1024 + +/* + * Queue flags. + */ +#define PCI_EPF_NVME_QUEUE_IS_SQ (1U << 0) +#define PCI_EPF_NVME_QUEUE_LIVE (1U << 1) + +/* PRP manipulation macros */ +#define pci_epf_nvme_prp_addr(ctrl, prp) ((prp) & ~(ctrl)->mps_mask) +#define pci_epf_nvme_prp_ofst(ctrl, prp) ((prp) & (ctrl)->mps_mask) +#define pci_epf_nvme_prp_size(ctrl, prp) \ + ((size_t)((ctrl)->mps - pci_epf_nvme_prp_ofst(ctrl, prp))) + +static struct kmem_cache *epf_nvme_cmd_cache; + +struct pci_epf_nvme; + +/* + * Host PCI memory segment for admin and IO commands. + */ +struct pci_epf_nvme_segment { + phys_addr_t pci_addr; + size_t size; +}; + +/* + * Queue definition and mapping for the local PCI controller. + */ +struct pci_epf_nvme_queue { + struct pci_epf_nvme *epf_nvme; + + unsigned int qflags; + int ref; + + phys_addr_t pci_addr; + size_t pci_size; + struct pci_epc_map pci_map; + + u16 qid; + u16 cqid; + u16 size; + u16 depth; + u16 flags; + u16 vector; + u16 head; + u16 tail; + u16 phase; + u32 db; + + size_t qes; + + struct workqueue_struct *cmd_wq; + struct delayed_work work; + spinlock_t lock; + struct list_head list; + + struct pci_epf_nvme_queue *sq; +}; + +/* + * Local PCI controller exposed with the endpoint function. + */ +struct pci_epf_nvme_ctrl { + /* Fabrics host controller */ + struct nvme_ctrl *ctrl; + + /* Registers of the local PCI controller */ + void *reg; + u64 cap; + u32 vs; + u32 cc; + u32 csts; + u32 aqa; + u64 asq; + u64 acq; + + size_t adm_sqes; + size_t adm_cqes; + size_t io_sqes; + size_t io_cqes; + + size_t mps_shift; + size_t mps; + size_t mps_mask; + + size_t mdts; + + unsigned int nr_queues; + struct pci_epf_nvme_queue *sq; + struct pci_epf_nvme_queue *cq; + + struct workqueue_struct *wq; +}; + +/* + * Descriptor of commands sent by the host. + */ +struct pci_epf_nvme_cmd { + struct list_head link; + struct pci_epf_nvme *epf_nvme; + + int sqid; + int cqid; + unsigned int status; + struct nvme_ns *ns; + struct nvme_command cmd; + struct nvme_completion cqe; + + /* Internal buffer that we will transfer over PCI */ + size_t buffer_size; + void *buffer; + enum dma_data_direction dma_dir; + + /* + * Host PCI address segments: if nr_segs is 1, we use only "seg", + * otherwise, the segs array is allocated and used to store + * multiple segments. + */ + unsigned int nr_segs; + struct pci_epf_nvme_segment seg; + struct pci_epf_nvme_segment *segs; + + struct work_struct work; +}; + +/* + * EPF function private data representing our NVMe subsystem. + */ +struct pci_epf_nvme { + struct pci_epf *epf; + const struct pci_epc_features *epc_features; + + void *reg_bar; + size_t msix_table_offset; + + unsigned int irq_type; + unsigned int nr_vectors; + + unsigned int queue_count; + + struct pci_epf_nvme_ctrl ctrl; + bool ctrl_enabled; + + __le64 *prp_list_buf; + + struct dma_chan *dma_chan_tx; + struct dma_chan *dma_chan_rx; + struct mutex xfer_lock; + + struct mutex irq_lock; + + struct delayed_work reg_poll; + + /* Function configfs attributes */ + struct config_group group; + char *ctrl_opts_buf; + bool dma_enable; + size_t mdts_kb; +}; + +/* + * Read a 32-bits BAR register (equivalent to readl()). + */ +static inline u32 pci_epf_nvme_reg_read32(struct pci_epf_nvme_ctrl *ctrl, + u32 reg) +{ + __le32 *ctrl_reg = ctrl->reg + reg; + + return le32_to_cpu(READ_ONCE(*ctrl_reg)); +} + +/* + * Write a 32-bits BAR register (equivalent to writel()). + */ +static inline void pci_epf_nvme_reg_write32(struct pci_epf_nvme_ctrl *ctrl, + u32 reg, u32 val) +{ + __le32 *ctrl_reg = ctrl->reg + reg; + + WRITE_ONCE(*ctrl_reg, cpu_to_le32(val)); +} + +/* + * Read a 64-bits BAR register (equivalent to lo_hi_readq()). + */ +static inline u64 pci_epf_nvme_reg_read64(struct pci_epf_nvme_ctrl *ctrl, + u32 reg) +{ + return (u64)pci_epf_nvme_reg_read32(ctrl, reg) | + ((u64)pci_epf_nvme_reg_read32(ctrl, reg + 4) << 32); +} + +/* + * Write a 64-bits BAR register (equivalent to lo_hi_writeq()). + */ +static inline void pci_epf_nvme_reg_write64(struct pci_epf_nvme_ctrl *ctrl, + u32 reg, u64 val) +{ + pci_epf_nvme_reg_write32(ctrl, reg, val & 0xFFFFFFFF); + pci_epf_nvme_reg_write32(ctrl, reg + 4, (val >> 32) & 0xFFFFFFFF); +} + +static inline bool pci_epf_nvme_ctrl_ready(struct pci_epf_nvme *epf_nvme) +{ + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + + if (!epf_nvme->ctrl_enabled) + return false; + return (ctrl->cc & NVME_CC_ENABLE) && (ctrl->csts & NVME_CSTS_RDY); +} + +struct pci_epf_nvme_dma_filter { + struct device *dev; + u32 dma_mask; +}; + +static bool pci_epf_nvme_dma_filter(struct dma_chan *chan, void *arg) +{ + struct pci_epf_nvme_dma_filter *filter = arg; + struct dma_slave_caps caps; + + memset(&caps, 0, sizeof(caps)); + dma_get_slave_caps(chan, &caps); + + return chan->device->dev == filter->dev && + (filter->dma_mask & caps.directions); +} + +static bool pci_epf_nvme_init_dma(struct pci_epf_nvme *epf_nvme) +{ + struct pci_epf *epf = epf_nvme->epf; + struct device *dev = &epf->dev; + struct pci_epf_nvme_dma_filter filter; + struct dma_chan *chan; + dma_cap_mask_t mask; + + mutex_init(&epf_nvme->xfer_lock); + mutex_init(&epf_nvme->irq_lock); + + dma_cap_zero(mask); + dma_cap_set(DMA_SLAVE, mask); + + filter.dev = epf->epc->dev.parent; + filter.dma_mask = BIT(DMA_DEV_TO_MEM); + + chan = dma_request_channel(mask, pci_epf_nvme_dma_filter, &filter); + if (!chan) + return false; + epf_nvme->dma_chan_rx = chan; + + filter.dma_mask = BIT(DMA_MEM_TO_DEV); + chan = dma_request_channel(mask, pci_epf_nvme_dma_filter, &filter); + if (!chan) { + dma_release_channel(epf_nvme->dma_chan_rx); + epf_nvme->dma_chan_rx = NULL; + return false; + } + epf_nvme->dma_chan_tx = chan; + + dev_info(dev, "DMA RX channel %s, maximum segment size %u B\n", + dma_chan_name(epf_nvme->dma_chan_rx), + dma_get_max_seg_size(epf_nvme->dma_chan_rx->device->dev)); + dev_info(dev, "DMA TX channel %s, maximum segment size %u B\n", + dma_chan_name(epf_nvme->dma_chan_tx), + dma_get_max_seg_size(epf_nvme->dma_chan_tx->device->dev)); + + return true; +} + +static void pci_epf_nvme_clean_dma(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + + if (epf_nvme->dma_chan_tx) { + dma_release_channel(epf_nvme->dma_chan_tx); + epf_nvme->dma_chan_tx = NULL; + } + + if (epf_nvme->dma_chan_rx) { + dma_release_channel(epf_nvme->dma_chan_rx); + epf_nvme->dma_chan_rx = NULL; + } +} + +static void pci_epf_nvme_dma_callback(void *param) +{ + complete(param); +} + +static ssize_t pci_epf_nvme_dma_transfer(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_segment *seg, + enum dma_data_direction dir, void *buf) +{ + struct pci_epf *epf = epf_nvme->epf; + struct device *dma_dev = epf->epc->dev.parent; + struct dma_async_tx_descriptor *desc; + DECLARE_COMPLETION_ONSTACK(complete); + struct dma_slave_config sconf = {}; + struct device *dev = &epf->dev; + phys_addr_t dma_addr; + struct dma_chan *chan; + dma_cookie_t cookie; + int ret; + + switch (dir) { + case DMA_FROM_DEVICE: + chan = epf_nvme->dma_chan_rx; + sconf.direction = DMA_DEV_TO_MEM; + sconf.src_addr = seg->pci_addr; + break; + case DMA_TO_DEVICE: + chan = epf_nvme->dma_chan_tx; + sconf.direction = DMA_MEM_TO_DEV; + sconf.dst_addr = seg->pci_addr; + break; + default: + return -EINVAL; + } + + ret = dmaengine_slave_config(chan, &sconf); + if (ret) { + dev_err(dev, "Failed to configure DMA channel\n"); + return ret; + } + + dma_addr = dma_map_single(dma_dev, buf, seg->size, dir); + ret = dma_mapping_error(dma_dev, dma_addr); + if (ret) { + dev_err(dev, "Failed to map remote memory\n"); + return ret; + } + + desc = dmaengine_prep_slave_single(chan, dma_addr, + seg->size, sconf.direction, + DMA_CTRL_ACK | DMA_PREP_INTERRUPT); + if (!desc) { + dev_err(dev, "Failed to prepare DMA\n"); + ret = -EIO; + goto unmap; + } + + desc->callback = pci_epf_nvme_dma_callback; + desc->callback_param = &complete; + + cookie = dmaengine_submit(desc); + ret = dma_submit_error(cookie); + if (ret) { + dev_err(dev, "DMA submit failed %d\n", ret); + goto unmap; + } + + dma_async_issue_pending(chan); + ret = wait_for_completion_timeout(&complete, msecs_to_jiffies(1000)); + if (!ret) { + dev_err(dev, "DMA transfer timeout\n"); + dmaengine_terminate_sync(chan); + ret = -ETIMEDOUT; + goto unmap; + } + + ret = seg->size; + +unmap: + dma_unmap_single(dma_dev, dma_addr, seg->size, dir); + + return ret; +} + +static ssize_t pci_epf_nvme_mmio_transfer(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_segment *seg, + enum dma_data_direction dir, + void *buf) +{ + struct pci_epf *epf = epf_nvme->epf; + struct pci_epc_map map; + int ret; + + /* Map segment */ + ret = pci_epc_mem_map(epf->epc, epf->func_no, epf->vfunc_no, + seg->pci_addr, seg->size, &map); + if (ret) + return ret; + + switch (dir) { + case DMA_FROM_DEVICE: + memcpy_fromio(buf, map.virt_addr, map.pci_size); + ret = map.pci_size; + break; + case DMA_TO_DEVICE: + memcpy_toio(map.virt_addr, buf, map.pci_size); + ret = map.pci_size; + break; + default: + ret = -EINVAL; + break; + } + + pci_epc_mem_unmap(epf->epc, epf->func_no, epf->vfunc_no, &map); + + return ret; +} + +static int pci_epf_nvme_transfer(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_segment *seg, + enum dma_data_direction dir, void *buf) +{ + size_t size = seg->size; + int ret; + + while (size) { + /* + * Note: mmio transfers do not need serialization, but + * this is an easy way to prevent using too many mapped + * memory areauiswhich would lead to errors. + */ + mutex_lock(&epf_nvme->xfer_lock); + if (!epf_nvme->dma_enable) + ret = pci_epf_nvme_mmio_transfer(epf_nvme, seg, + dir, buf); + else + ret = pci_epf_nvme_dma_transfer(epf_nvme, seg, + dir, buf); + mutex_unlock(&epf_nvme->xfer_lock); + + if (ret < 0) + return ret; + + size -= ret; + buf += ret; + } + + return 0; +} + +static const char *pci_epf_nvme_cmd_name(struct pci_epf_nvme_cmd *epcmd) +{ + u8 opcode = epcmd->cmd.common.opcode; + + if (epcmd->sqid) + return nvme_get_opcode_str(opcode); + return nvme_get_admin_opcode_str(opcode); +} + +static inline struct pci_epf_nvme_cmd * +pci_epf_nvme_alloc_cmd(struct pci_epf_nvme *nvme) +{ + return kmem_cache_alloc(epf_nvme_cmd_cache, GFP_KERNEL); +} + +static void pci_epf_nvme_exec_cmd_work(struct work_struct *work); + +static void pci_epf_nvme_init_cmd(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_cmd *epcmd, + int sqid, int cqid) +{ + memset(epcmd, 0, sizeof(*epcmd)); + INIT_LIST_HEAD(&epcmd->link); + INIT_WORK(&epcmd->work, pci_epf_nvme_exec_cmd_work); + epcmd->epf_nvme = epf_nvme; + epcmd->sqid = sqid; + epcmd->cqid = cqid; + epcmd->status = NVME_SC_SUCCESS; + epcmd->dma_dir = DMA_NONE; +} + +static int pci_epf_nvme_alloc_cmd_buffer(struct pci_epf_nvme_cmd *epcmd) +{ + void *buffer; + + buffer = kmalloc(epcmd->buffer_size, GFP_KERNEL); + if (!buffer) { + epcmd->buffer_size = 0; + return -ENOMEM; + } + + if (!epcmd->sqid) + memset(buffer, 0, epcmd->buffer_size); + epcmd->buffer = buffer; + + return 0; +} + +static int pci_epf_nvme_alloc_cmd_segs(struct pci_epf_nvme_cmd *epcmd, + int nr_segs) +{ + struct pci_epf_nvme_segment *segs; + + /* Single segment case: use the command embedded structure */ + if (nr_segs == 1) { + epcmd->segs = &epcmd->seg; + epcmd->nr_segs = 1; + return 0; + } + + /* More than one segment needed: allocate an array */ + segs = kcalloc(nr_segs, sizeof(struct pci_epf_nvme_segment), GFP_KERNEL); + if (!segs) + return -ENOMEM; + + epcmd->nr_segs = nr_segs; + epcmd->segs = segs; + + return 0; +} + +static void pci_epf_nvme_free_cmd(struct pci_epf_nvme_cmd *epcmd) +{ + if (epcmd->ns) + nvme_put_ns(epcmd->ns); + + kfree(epcmd->buffer); + + if (epcmd->segs && epcmd->segs != &epcmd->seg) + kfree(epcmd->segs); + + kmem_cache_free(epf_nvme_cmd_cache, epcmd); +} + +static void pci_epf_nvme_complete_cmd(struct pci_epf_nvme_cmd *epcmd) +{ + struct pci_epf_nvme *epf_nvme = epcmd->epf_nvme; + struct pci_epf_nvme_queue *cq; + unsigned long flags; + + if (!pci_epf_nvme_ctrl_ready(epf_nvme)) { + pci_epf_nvme_free_cmd(epcmd); + return; + } + + /* + * Add the command to the list of completed commands for the + * target cq and schedule the list processing. + */ + cq = &epf_nvme->ctrl.cq[epcmd->cqid]; + spin_lock_irqsave(&cq->lock, flags); + list_add_tail(&epcmd->link, &cq->list); + queue_delayed_work(epf_nvme->ctrl.wq, &cq->work, 0); + spin_unlock_irqrestore(&cq->lock, flags); +} + +static int pci_epf_nvme_transfer_cmd_data(struct pci_epf_nvme_cmd *epcmd) +{ + struct pci_epf_nvme *epf_nvme = epcmd->epf_nvme; + struct pci_epf_nvme_segment *seg; + void *buf = epcmd->buffer; + size_t size = 0; + int i, ret; + + /* Transfer each segment of the command */ + for (i = 0; i < epcmd->nr_segs; i++) { + seg = &epcmd->segs[i]; + + if (size >= epcmd->buffer_size) { + dev_err(&epf_nvme->epf->dev, "Invalid transfer size\n"); + goto xfer_err; + } + + ret = pci_epf_nvme_transfer(epf_nvme, seg, epcmd->dma_dir, buf); + if (ret) + goto xfer_err; + + buf += seg->size; + size += seg->size; + } + + return 0; + +xfer_err: + epcmd->status = NVME_SC_DATA_XFER_ERROR | NVME_STATUS_DNR; + return -EIO; +} + +static void pci_epf_nvme_raise_irq(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_queue *cq) +{ + struct pci_epf *epf = epf_nvme->epf; + int ret; + + if (!(cq->qflags & NVME_CQ_IRQ_ENABLED)) + return; + + mutex_lock(&epf_nvme->irq_lock); + + switch (epf_nvme->irq_type) { + case PCI_IRQ_MSIX: + case PCI_IRQ_MSI: + ret = pci_epc_raise_irq(epf->epc, epf->func_no, epf->vfunc_no, + epf_nvme->irq_type, cq->vector + 1); + if (!ret) + break; + /* + * If we got an error, it is likely because the host is using + * legacy IRQs (e.g. BIOS, grub). + */ + fallthrough; + case PCI_IRQ_INTX: + ret = pci_epc_raise_irq(epf->epc, epf->func_no, epf->vfunc_no, + PCI_IRQ_INTX, 0); + break; + default: + WARN_ON_ONCE(1); + ret = -EINVAL; + break; + } + + if (ret) + dev_err(&epf->dev, "Raise IRQ failed %d\n", ret); + + mutex_unlock(&epf_nvme->irq_lock); +} + +/* + * Transfer a prp list from the host and return the number of prps. + */ +static int pci_epf_nvme_get_prp_list(struct pci_epf_nvme *epf_nvme, u64 prp, + size_t xfer_len) +{ + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + size_t nr_prps = (xfer_len + ctrl->mps_mask) >> ctrl->mps_shift; + struct pci_epf_nvme_segment seg; + int ret; + + /* + * Compute the number of PRPs required for the number of bytes to + * transfer (xfer_len). If this number overflows the memory page size + * with the PRP list pointer specified, only return the space available + * in the memory page, the last PRP in there will be a PRP list pointer + * to the remaining PRPs. + */ + seg.pci_addr = prp; + seg.size = min(pci_epf_nvme_prp_size(ctrl, prp), nr_prps << 3); + ret = pci_epf_nvme_transfer(epf_nvme, &seg, DMA_FROM_DEVICE, + epf_nvme->prp_list_buf); + if (ret) + return ret; + + return seg.size >> 3; +} + +static int pci_epf_nvme_cmd_parse_prp_list(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_cmd *epcmd) +{ + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct nvme_command *cmd = &epcmd->cmd; + __le64 *prps = epf_nvme->prp_list_buf; + struct pci_epf_nvme_segment *seg; + size_t size = 0, ofst, prp_size, xfer_len; + size_t transfer_len = epcmd->buffer_size; + int nr_segs, nr_prps = 0; + phys_addr_t pci_addr; + int i = 0, ret; + u64 prp; + + /* + * Allocate segments for the command: this considers the worst case + * scenario where all prps are discontiguous, so get as many segments + * as we can have prps. In practice, most of the time, we will have + * far less segments than prps. + */ + prp = le64_to_cpu(cmd->common.dptr.prp1); + if (!prp) + goto invalid_field; + + ofst = pci_epf_nvme_prp_ofst(ctrl, prp); + nr_segs = (transfer_len + ofst + NVME_CTRL_PAGE_SIZE - 1) + >> NVME_CTRL_PAGE_SHIFT; + + ret = pci_epf_nvme_alloc_cmd_segs(epcmd, nr_segs); + if (ret) + goto internal; + + /* Set the first segment using prp1 */ + seg = &epcmd->segs[0]; + seg->pci_addr = prp; + seg->size = pci_epf_nvme_prp_size(ctrl, prp); + + size = seg->size; + pci_addr = prp + size; + nr_segs = 1; + + /* + * Now build the PCI address segments using the prp lists, starting + * from prp2. + */ + prp = le64_to_cpu(cmd->common.dptr.prp2); + if (!prp) + goto invalid_field; + + while (size < transfer_len) { + xfer_len = transfer_len - size; + + if (!nr_prps) { + /* Get the prp list */ + nr_prps = pci_epf_nvme_get_prp_list(epf_nvme, prp, + xfer_len); + if (nr_prps < 0) + goto internal; + + i = 0; + ofst = 0; + } + + /* Current entry */ + prp = le64_to_cpu(prps[i]); + if (!prp) + goto invalid_field; + + /* Did we reach the last prp entry of the list ? */ + if (xfer_len > ctrl->mps && i == nr_prps - 1) { + /* We need more PRPs: prp is a list pointer */ + nr_prps = 0; + continue; + } + + /* Only the first prp is allowed to have an offset */ + if (pci_epf_nvme_prp_ofst(ctrl, prp)) + goto invalid_offset; + + if (prp != pci_addr) { + /* Discontiguous prp: new segment */ + nr_segs++; + if (WARN_ON_ONCE(nr_segs > epcmd->nr_segs)) + goto internal; + + seg++; + seg->pci_addr = prp; + seg->size = 0; + pci_addr = prp; + } + + prp_size = min_t(size_t, ctrl->mps, xfer_len); + seg->size += prp_size; + pci_addr += prp_size; + size += prp_size; + + i++; + } + + epcmd->nr_segs = nr_segs; + ret = 0; + + if (size != transfer_len) { + dev_err(&epf_nvme->epf->dev, + "PRPs transfer length mismatch %zu / %zu\n", + size, transfer_len); + goto internal; + } + + return 0; + +internal: + epcmd->status = NVME_SC_INTERNAL | NVME_STATUS_DNR; + return -EINVAL; + +invalid_offset: + epcmd->status = NVME_SC_PRP_INVALID_OFFSET | NVME_STATUS_DNR; + return -EINVAL; + +invalid_field: + epcmd->status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; + return -EINVAL; +} + +static int pci_epf_nvme_cmd_parse_prp_simple(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_cmd *epcmd) +{ + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct nvme_command *cmd = &epcmd->cmd; + size_t transfer_len = epcmd->buffer_size; + int ret, nr_segs = 1; + u64 prp1, prp2 = 0; + size_t prp1_size; + + /* prp1 */ + prp1 = le64_to_cpu(cmd->common.dptr.prp1); + prp1_size = pci_epf_nvme_prp_size(ctrl, prp1); + + /* For commands crossing a page boundary, we should have a valid prp2 */ + if (transfer_len > prp1_size) { + prp2 = le64_to_cpu(cmd->common.dptr.prp2); + if (!prp2) + goto invalid_field; + if (pci_epf_nvme_prp_ofst(ctrl, prp2)) + goto invalid_offset; + if (prp2 != prp1 + prp1_size) + nr_segs = 2; + } + + /* Create segments using the prps */ + ret = pci_epf_nvme_alloc_cmd_segs(epcmd, nr_segs); + if (ret) + goto internal; + + epcmd->segs[0].pci_addr = prp1; + if (nr_segs == 1) { + epcmd->segs[0].size = transfer_len; + } else { + epcmd->segs[0].size = prp1_size; + epcmd->segs[1].pci_addr = prp2; + epcmd->segs[1].size = transfer_len - prp1_size; + } + + return 0; + +invalid_offset: + epcmd->status = NVME_SC_PRP_INVALID_OFFSET | NVME_STATUS_DNR; + return -EINVAL; + +invalid_field: + epcmd->status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; + return -EINVAL; + +internal: + epcmd->status = NVME_SC_INTERNAL | NVME_STATUS_DNR; + return ret; +} + +static int pci_epf_nvme_cmd_parse_dptr(struct pci_epf_nvme_cmd *epcmd) +{ + struct pci_epf_nvme *epf_nvme = epcmd->epf_nvme; + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct nvme_command *cmd = &epcmd->cmd; + u64 prp1 = le64_to_cpu(cmd->common.dptr.prp1); + size_t ofst; + int ret; + + if (epcmd->buffer_size > ctrl->mdts) + goto invalid_field; + + /* We do not support SGL for now */ + if (epcmd->cmd.common.flags & NVME_CMD_SGL_ALL) + goto invalid_field; + + /* Get PCI address segments for the command using its prps */ + ofst = pci_epf_nvme_prp_ofst(ctrl, prp1); + if (ofst & 0x3) + goto invalid_offset; + + if (epcmd->buffer_size + ofst <= NVME_CTRL_PAGE_SIZE * 2) + ret = pci_epf_nvme_cmd_parse_prp_simple(epf_nvme, epcmd); + else + ret = pci_epf_nvme_cmd_parse_prp_list(epf_nvme, epcmd); + if (ret) + return ret; + + /* Get an internal buffer for the command */ + ret = pci_epf_nvme_alloc_cmd_buffer(epcmd); + if (ret) { + epcmd->status = NVME_SC_INTERNAL | NVME_STATUS_DNR; + return ret; + } + + return 0; + +invalid_field: + epcmd->status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; + return -EINVAL; + +invalid_offset: + epcmd->status = NVME_SC_PRP_INVALID_OFFSET | NVME_STATUS_DNR; + return -EINVAL; +} + +static void pci_epf_nvme_exec_cmd(struct pci_epf_nvme_cmd *epcmd, + void (*post_exec_hook)(struct pci_epf_nvme_cmd *)) +{ + struct pci_epf_nvme *epf_nvme = epcmd->epf_nvme; + struct nvme_command *cmd = &epcmd->cmd; + struct request_queue *q; + int ret; + + if (epcmd->ns) + q = epcmd->ns->queue; + else + q = epf_nvme->ctrl.ctrl->admin_q; + + if (epcmd->buffer_size) { + /* Setup the command buffer */ + ret = pci_epf_nvme_cmd_parse_dptr(epcmd); + if (ret) + return; + + /* Get data from the host if needed */ + if (epcmd->dma_dir == DMA_FROM_DEVICE) { + ret = pci_epf_nvme_transfer_cmd_data(epcmd); + if (ret) + return; + } + } + + /* Synchronously execute the command */ + ret = __nvme_submit_sync_cmd(q, cmd, &epcmd->cqe.result, + epcmd->buffer, epcmd->buffer_size, + NVME_QID_ANY, 0); + if (ret < 0) + epcmd->status = NVME_SC_INTERNAL | NVME_STATUS_DNR; + else if (ret > 0) + epcmd->status = ret; + + if (epcmd->status != NVME_SC_SUCCESS) { + dev_err(&epf_nvme->epf->dev, + "QID %d: submit command %s (0x%x) failed, status 0x%0x\n", + epcmd->sqid, pci_epf_nvme_cmd_name(epcmd), + epcmd->cmd.common.opcode, epcmd->status); + return; + } + + if (post_exec_hook) + post_exec_hook(epcmd); + + if (epcmd->buffer_size && epcmd->dma_dir == DMA_TO_DEVICE) + pci_epf_nvme_transfer_cmd_data(epcmd); +} + +static void pci_epf_nvme_exec_cmd_work(struct work_struct *work) +{ + struct pci_epf_nvme_cmd *epcmd = + container_of(work, struct pci_epf_nvme_cmd, work); + + pci_epf_nvme_exec_cmd(epcmd, NULL); + + pci_epf_nvme_complete_cmd(epcmd); +} + +static bool pci_epf_nvme_queue_response(struct pci_epf_nvme_cmd *epcmd) +{ + struct pci_epf_nvme *epf_nvme = epcmd->epf_nvme; + struct pci_epf *epf = epf_nvme->epf; + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct pci_epf_nvme_queue *sq = &ctrl->sq[epcmd->sqid]; + struct pci_epf_nvme_queue *cq = &ctrl->cq[epcmd->cqid]; + struct nvme_completion *cqe = &epcmd->cqe; + + /* + * Do not try to complete commands if the controller is not ready + * anymore, e.g. after the host cleared CC.EN. + */ + if (!pci_epf_nvme_ctrl_ready(epf_nvme) || + !(cq->qflags & PCI_EPF_NVME_QUEUE_LIVE)) + goto free_cmd; + + /* Check completion queue full state */ + cq->head = pci_epf_nvme_reg_read32(ctrl, cq->db); + if (cq->head == cq->tail + 1) + return false; + + /* Setup the completion entry */ + cqe->sq_id = cpu_to_le16(epcmd->sqid); + cqe->sq_head = cpu_to_le16(sq->head); + cqe->command_id = epcmd->cmd.common.command_id; + cqe->status = cpu_to_le16((epcmd->status << 1) | cq->phase); + + /* Post the completion entry */ + dev_dbg(&epf->dev, + "cq[%d]: %s status 0x%x, head %d, tail %d, phase %d\n", + epcmd->cqid, pci_epf_nvme_cmd_name(epcmd), + epcmd->status, cq->head, cq->tail, cq->phase); + + memcpy_toio(cq->pci_map.virt_addr + cq->tail * cq->qes, cqe, + sizeof(struct nvme_completion)); + + /* Advance the tail */ + cq->tail++; + if (cq->tail >= cq->depth) { + cq->tail = 0; + cq->phase ^= 1; + } + +free_cmd: + pci_epf_nvme_free_cmd(epcmd); + + return true; +} + +static int pci_epf_nvme_map_queue(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_queue *q) +{ + struct pci_epf *epf = epf_nvme->epf; + int ret; + + ret = pci_epc_mem_map(epf->epc, epf->func_no, epf->vfunc_no, + q->pci_addr, q->pci_size, &q->pci_map); + if (ret) { + dev_err(&epf->dev, "Map %cQ %d failed %d\n", + q->qflags & PCI_EPF_NVME_QUEUE_IS_SQ ? 'S' : 'C', + q->qid, ret); + return ret; + } + + if (q->pci_map.pci_size < q->pci_size) { + dev_err(&epf->dev, "Partial %cQ %d mapping\n", + q->qflags & PCI_EPF_NVME_QUEUE_IS_SQ ? 'S' : 'C', + q->qid); + pci_epc_mem_unmap(epf->epc, epf->func_no, epf->vfunc_no, + &q->pci_map); + return -ENOMEM; + } + + return 0; +} + +static inline void pci_epf_nvme_unmap_queue(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_queue *q) +{ + struct pci_epf *epf = epf_nvme->epf; + + pci_epc_mem_unmap(epf->epc, epf->func_no, epf->vfunc_no, + &q->pci_map); +} + +static void pci_epf_nvme_delete_queue(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_queue *q) +{ + struct pci_epf_nvme_cmd *epcmd; + + q->qflags &= ~PCI_EPF_NVME_QUEUE_LIVE; + + if (q->cmd_wq) { + flush_workqueue(q->cmd_wq); + destroy_workqueue(q->cmd_wq); + q->cmd_wq = NULL; + } + + flush_delayed_work(&q->work); + cancel_delayed_work_sync(&q->work); + + while (!list_empty(&q->list)) { + epcmd = list_first_entry(&q->list, + struct pci_epf_nvme_cmd, link); + list_del_init(&epcmd->link); + pci_epf_nvme_free_cmd(epcmd); + } +} + +static void pci_epf_nvme_cq_work(struct work_struct *work); + +static int pci_epf_nvme_create_cq(struct pci_epf_nvme *epf_nvme, int qid, + int flags, int size, int vector, + phys_addr_t pci_addr) +{ + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct pci_epf_nvme_queue *cq = &ctrl->cq[qid]; + struct pci_epf *epf = epf_nvme->epf; + + /* + * Increment the queue reference count: if the queue is already being + * used, we have nothing to do. + */ + cq->ref++; + if (cq->ref > 1) + return 0; + + /* Setup the completion queue */ + cq->pci_addr = pci_addr; + cq->qid = qid; + cq->cqid = qid; + cq->size = size; + cq->flags = flags; + cq->depth = size + 1; + cq->vector = vector; + cq->head = 0; + cq->tail = 0; + cq->phase = 1; + cq->db = NVME_REG_DBS + (((qid * 2) + 1) * sizeof(u32)); + pci_epf_nvme_reg_write32(ctrl, cq->db, 0); + INIT_DELAYED_WORK(&cq->work, pci_epf_nvme_cq_work); + if (!qid) + cq->qes = ctrl->adm_cqes; + else + cq->qes = ctrl->io_cqes; + cq->pci_size = cq->qes * cq->depth; + + dev_dbg(&epf->dev, + "CQ %d: %d entries of %zu B, vector IRQ %d\n", + qid, cq->size, cq->qes, (int)cq->vector + 1); + + cq->qflags = PCI_EPF_NVME_QUEUE_LIVE; + + return 0; +} + +static void pci_epf_nvme_delete_cq(struct pci_epf_nvme *epf_nvme, int qid) +{ + struct pci_epf_nvme_queue *cq = &epf_nvme->ctrl.cq[qid]; + + if (cq->ref < 1) + return; + + cq->ref--; + if (cq->ref) + return; + + pci_epf_nvme_delete_queue(epf_nvme, cq); +} + +static void pci_epf_nvme_sq_work(struct work_struct *work); + +static int pci_epf_nvme_create_sq(struct pci_epf_nvme *epf_nvme, int qid, + int cqid, int flags, int size, + phys_addr_t pci_addr) +{ + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct pci_epf_nvme_queue *sq = &ctrl->sq[qid]; + struct pci_epf_nvme_queue *cq = &ctrl->cq[cqid]; + struct pci_epf *epf = epf_nvme->epf; + + /* Setup the submission queue */ + sq->qflags = PCI_EPF_NVME_QUEUE_IS_SQ; + sq->pci_addr = pci_addr; + sq->ref = 1; + sq->qid = qid; + sq->cqid = cqid; + sq->size = size; + sq->flags = flags; + sq->depth = size + 1; + sq->head = 0; + sq->tail = 0; + sq->phase = 0; + sq->db = NVME_REG_DBS + (qid * 2 * sizeof(u32)); + pci_epf_nvme_reg_write32(ctrl, sq->db, 0); + INIT_DELAYED_WORK(&sq->work, pci_epf_nvme_sq_work); + if (!qid) + sq->qes = ctrl->adm_sqes; + else + sq->qes = ctrl->io_sqes; + sq->pci_size = sq->qes * sq->depth; + + sq->cmd_wq = alloc_workqueue("sq%d_wq", WQ_HIGHPRI | WQ_UNBOUND, + min_t(int, sq->depth, WQ_MAX_ACTIVE), qid); + if (!sq->cmd_wq) { + dev_err(&epf->dev, "Create SQ %d cmd wq failed\n", qid); + memset(sq, 0, sizeof(*sq)); + return -ENOMEM; + } + + /* Get a reference on the completion queue */ + cq->ref++; + cq->sq = sq; + + dev_dbg(&epf->dev, + "SQ %d: %d queue entries of %zu B, CQ %d\n", + qid, size, sq->qes, cqid); + + sq->qflags |= PCI_EPF_NVME_QUEUE_LIVE; + + return 0; +} + +static void pci_epf_nvme_delete_sq(struct pci_epf_nvme *epf_nvme, int qid) +{ + struct pci_epf_nvme_queue *sq = &epf_nvme->ctrl.sq[qid]; + + if (!sq->ref) + return; + + sq->ref--; + if (WARN_ON_ONCE(sq->ref != 0)) + return; + + pci_epf_nvme_delete_queue(epf_nvme, sq); + + if (epf_nvme->ctrl.cq[sq->cqid].ref) + epf_nvme->ctrl.cq[sq->cqid].ref--; +} + +static void pci_epf_nvme_disable_ctrl(struct pci_epf_nvme *epf_nvme) +{ + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct pci_epf *epf = epf_nvme->epf; + int qid; + + if (!epf_nvme->ctrl_enabled) + return; + + dev_info(&epf->dev, "Disabling controller\n"); + + /* + * Delete the submission queues first to release all references + * to the completion queues. This also stops polling for submissions + * and drains any pending command from the queue. + */ + for (qid = 1; qid < ctrl->nr_queues; qid++) + pci_epf_nvme_delete_sq(epf_nvme, qid); + + for (qid = 1; qid < ctrl->nr_queues; qid++) + pci_epf_nvme_delete_cq(epf_nvme, qid); + + /* Unmap the admin queue last */ + pci_epf_nvme_delete_sq(epf_nvme, 0); + pci_epf_nvme_delete_cq(epf_nvme, 0); + + /* Tell the host we are done */ + ctrl->csts &= ~NVME_CSTS_RDY; + if (ctrl->cc & NVME_CC_SHN_NORMAL) { + ctrl->csts |= NVME_CSTS_SHST_CMPLT; + ctrl->cc &= ~NVME_CC_SHN_NORMAL; + } + ctrl->cc &= ~NVME_CC_ENABLE; + pci_epf_nvme_reg_write32(ctrl, NVME_REG_CSTS, ctrl->csts); + pci_epf_nvme_reg_write32(ctrl, NVME_REG_CC, ctrl->cc); + + epf_nvme->ctrl_enabled = false; +} + +static void pci_epf_nvme_delete_ctrl(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + + dev_info(&epf->dev, "Deleting controller\n"); + + if (ctrl->ctrl) { + nvme_put_ctrl(ctrl->ctrl); + ctrl->ctrl = NULL; + + ctrl->cc &= ~NVME_CC_SHN_NORMAL; + ctrl->csts |= NVME_CSTS_SHST_CMPLT; + } + + pci_epf_nvme_disable_ctrl(epf_nvme); + + if (ctrl->wq) { + flush_workqueue(ctrl->wq); + destroy_workqueue(ctrl->wq); + ctrl->wq = NULL; + } + + ctrl->nr_queues = 0; + kfree(ctrl->cq); + ctrl->cq = NULL; + kfree(ctrl->sq); + ctrl->sq = NULL; +} + +static struct pci_epf_nvme_queue * +pci_epf_nvme_alloc_queues(struct pci_epf_nvme *epf_nvme, int nr_queues) +{ + struct pci_epf_nvme_queue *q; + int i; + + q = kcalloc(nr_queues, sizeof(struct pci_epf_nvme_queue), GFP_KERNEL); + if (!q) + return NULL; + + for (i = 0; i < nr_queues; i++) { + q[i].epf_nvme = epf_nvme; + spin_lock_init(&q[i].lock); + INIT_LIST_HEAD(&q[i].list); + } + + return q; +} + +static int pci_epf_nvme_create_ctrl(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + const struct pci_epc_features *features = epf_nvme->epc_features; + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct nvme_ctrl *fctrl; + int ret; + + /* We must have nvme fabrics options. */ + if (!epf_nvme->ctrl_opts_buf) { + dev_err(&epf->dev, "No nvme fabrics options specified\n"); + return -EINVAL; + } + + /* Create the fabrics controller */ + fctrl = nvmf_create_ctrl(&epf->dev, epf_nvme->ctrl_opts_buf); + if (IS_ERR(fctrl)) { + dev_err(&epf->dev, "Create nvme fabrics controller failed\n"); + return PTR_ERR(fctrl); + } + + /* We only support IO controllers */ + if (fctrl->cntrltype != NVME_CTRL_IO) { + dev_err(&epf->dev, "Unsupported controller type\n"); + ret = -EINVAL; + goto out_delete_ctrl; + } + + dev_info(&epf->dev, "NVMe fabrics controller created, %u I/O queues\n", + fctrl->queue_count - 1); + + epf_nvme->queue_count = + min(fctrl->queue_count, PCI_EPF_NVME_MAX_NR_QUEUES); + if (features->msix_capable && epf->msix_interrupts) { + dev_info(&epf->dev, + "NVMe PCI controller supports MSI-X, %u vectors\n", + epf->msix_interrupts); + epf_nvme->queue_count = + min(epf_nvme->queue_count, epf->msix_interrupts); + } else if (features->msi_capable && epf->msi_interrupts) { + dev_info(&epf->dev, + "NVMe PCI controller supports MSI, %u vectors\n", + epf->msi_interrupts); + epf_nvme->queue_count = + min(epf_nvme->queue_count, epf->msi_interrupts); + } + + if (epf_nvme->queue_count < 2) { + dev_info(&epf->dev, "Invalid number of queues %u\n", + epf_nvme->queue_count); + ret = -EINVAL; + goto out_delete_ctrl; + } + + if (epf_nvme->queue_count != fctrl->queue_count) + dev_info(&epf->dev, "Limiting number of queues to %u\n", + epf_nvme->queue_count); + + dev_info(&epf->dev, "NVMe PCI controller: %u I/O queues\n", + epf_nvme->queue_count - 1); + + ret = -ENOMEM; + + /* Create the workqueue for processing our SQs and CQs */ + ctrl->wq = alloc_workqueue("ctrl_wq", WQ_HIGHPRI | WQ_UNBOUND, + min_t(int, ctrl->nr_queues * 2, WQ_MAX_ACTIVE)); + if (!ctrl->wq) { + dev_err(&epf->dev, "Create controller wq failed\n"); + goto out_delete_ctrl; + } + + /* Allocate queues */ + ctrl->nr_queues = epf_nvme->queue_count; + ctrl->sq = pci_epf_nvme_alloc_queues(epf_nvme, ctrl->nr_queues); + if (!ctrl->sq) + goto out_delete_ctrl; + + ctrl->cq = pci_epf_nvme_alloc_queues(epf_nvme, ctrl->nr_queues); + if (!ctrl->cq) + goto out_delete_ctrl; + + epf_nvme->ctrl.ctrl = fctrl; + + return 0; + +out_delete_ctrl: + pci_epf_nvme_delete_ctrl(epf); + + return ret; +} + +static void pci_epf_nvme_init_ctrl_regs(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + + ctrl->reg = epf_nvme->reg_bar; + + /* Copy the fabrics controller capabilities as a base */ + ctrl->cap = ctrl->ctrl->cap; + + /* Contiguous Queues Required (CQR) */ + ctrl->cap |= 0x1ULL << 16; + + /* Set Doorbell stride to 4B (DSTRB) */ + ctrl->cap &= ~GENMASK(35, 32); + + /* Clear NVM Subsystem Reset Supported (NSSRS) */ + ctrl->cap &= ~(0x1ULL << 36); + + /* Clear Boot Partition Support (BPS) */ + ctrl->cap &= ~(0x1ULL << 45); + + /* Memory Page Size minimum (MPSMIN) = 4K */ + ctrl->cap |= (NVME_CTRL_PAGE_SHIFT - 12) << NVME_CC_MPS_SHIFT; + + /* Memory Page Size maximum (MPSMAX) = 4K */ + ctrl->cap |= (NVME_CTRL_PAGE_SHIFT - 12) << NVME_CC_MPS_SHIFT; + + /* Clear Persistent Memory Region Supported (PMRS) */ + ctrl->cap &= ~(0x1ULL << 56); + + /* Clear Controller Memory Buffer Supported (CMBS) */ + ctrl->cap &= ~(0x1ULL << 57); + + /* NVMe version supported */ + ctrl->vs = ctrl->ctrl->vs; + + /* Controller configuration */ + ctrl->cc = ctrl->ctrl->ctrl_config & (~NVME_CC_ENABLE); + + /* Controller Status (not ready) */ + ctrl->csts = 0; + + pci_epf_nvme_reg_write64(ctrl, NVME_REG_CAP, ctrl->cap); + pci_epf_nvme_reg_write32(ctrl, NVME_REG_VS, ctrl->vs); + pci_epf_nvme_reg_write32(ctrl, NVME_REG_CSTS, ctrl->csts); + pci_epf_nvme_reg_write32(ctrl, NVME_REG_CC, ctrl->cc); +} + +static void pci_epf_nvme_enable_ctrl(struct pci_epf_nvme *epf_nvme) +{ + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct pci_epf *epf = epf_nvme->epf; + int ret; + + dev_info(&epf->dev, "Enabling controller\n"); + + ctrl->mdts = epf_nvme->mdts_kb * SZ_1K; + + ctrl->mps_shift = ((ctrl->cc >> NVME_CC_MPS_SHIFT) & 0xf) + 12; + ctrl->mps = 1UL << ctrl->mps_shift; + ctrl->mps_mask = ctrl->mps - 1; + + ctrl->adm_sqes = 1UL << NVME_ADM_SQES; + ctrl->adm_cqes = sizeof(struct nvme_completion); + ctrl->io_sqes = 1UL << ((ctrl->cc >> NVME_CC_IOSQES_SHIFT) & 0xf); + ctrl->io_cqes = 1UL << ((ctrl->cc >> NVME_CC_IOCQES_SHIFT) & 0xf); + + if (ctrl->io_sqes < sizeof(struct nvme_command)) { + dev_err(&epf->dev, "Unsupported IO sqes %zu (need %zu)\n", + ctrl->io_sqes, sizeof(struct nvme_command)); + return; + } + + if (ctrl->io_cqes < sizeof(struct nvme_completion)) { + dev_err(&epf->dev, "Unsupported IO cqes %zu (need %zu)\n", + ctrl->io_sqes, sizeof(struct nvme_completion)); + return; + } + + ctrl->aqa = pci_epf_nvme_reg_read32(ctrl, NVME_REG_AQA); + ctrl->asq = pci_epf_nvme_reg_read64(ctrl, NVME_REG_ASQ); + ctrl->acq = pci_epf_nvme_reg_read64(ctrl, NVME_REG_ACQ); + + /* + * Create the PCI controller admin completion and submission queues. + */ + ret = pci_epf_nvme_create_cq(epf_nvme, 0, + NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED, + (ctrl->aqa & 0x0fff0000) >> 16, 0, + ctrl->acq & GENMASK(63, 12)); + if (ret) + return; + + ret = pci_epf_nvme_create_sq(epf_nvme, 0, 0, NVME_QUEUE_PHYS_CONTIG, + ctrl->aqa & 0x0fff, + ctrl->asq & GENMASK(63, 12)); + if (ret) { + pci_epf_nvme_delete_cq(epf_nvme, 0); + return; + } + + nvme_start_ctrl(ctrl->ctrl); + + /* Tell the host we are now ready */ + ctrl->csts |= NVME_CSTS_RDY; + pci_epf_nvme_reg_write32(ctrl, NVME_REG_CSTS, ctrl->csts); + + /* Start polling the admin submission queue */ + queue_delayed_work(ctrl->wq, &ctrl->sq[0].work, msecs_to_jiffies(5)); + + epf_nvme->ctrl_enabled = true; +} + +static void pci_epf_nvme_process_create_cq(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_cmd *epcmd) +{ + struct nvme_command *cmd = &epcmd->cmd; + int mqes = NVME_CAP_MQES(epf_nvme->ctrl.cap); + u16 cqid, cq_flags, qsize, vector; + int ret; + + cqid = le16_to_cpu(cmd->create_cq.cqid); + if (cqid >= epf_nvme->ctrl.nr_queues || epf_nvme->ctrl.cq[cqid].ref) { + epcmd->status = NVME_SC_QID_INVALID | NVME_STATUS_DNR; + return; + } + + cq_flags = le16_to_cpu(cmd->create_cq.cq_flags); + if (!(cq_flags & NVME_QUEUE_PHYS_CONTIG)) { + epcmd->status = NVME_SC_INVALID_QUEUE | NVME_STATUS_DNR; + return; + } + + qsize = le16_to_cpu(cmd->create_cq.qsize); + if (!qsize || qsize > NVME_CAP_MQES(epf_nvme->ctrl.cap)) { + if (qsize > mqes) + dev_warn(&epf_nvme->epf->dev, + "Create CQ %d, qsize %d > mqes %d: buggy driver?\n", + cqid, (int)qsize, mqes); + epcmd->status = NVME_SC_QUEUE_SIZE | NVME_STATUS_DNR; + return; + } + + vector = le16_to_cpu(cmd->create_cq.irq_vector); + if (vector >= epf_nvme->nr_vectors) { + epcmd->status = NVME_SC_INVALID_VECTOR | NVME_STATUS_DNR; + return; + } + + ret = pci_epf_nvme_create_cq(epf_nvme, cqid, cq_flags, qsize, vector, + le64_to_cpu(cmd->create_cq.prp1)); + if (ret) + epcmd->status = NVME_SC_INTERNAL | NVME_STATUS_DNR; +} + +static void pci_epf_nvme_process_delete_cq(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_cmd *epcmd) +{ + struct nvme_command *cmd = &epcmd->cmd; + u16 cqid; + + cqid = le16_to_cpu(cmd->delete_queue.qid); + if (!cqid || + cqid >= epf_nvme->ctrl.nr_queues || + !epf_nvme->ctrl.cq[cqid].ref) { + epcmd->status = NVME_SC_QID_INVALID | NVME_STATUS_DNR; + return; + } + + pci_epf_nvme_delete_cq(epf_nvme, cqid); +} + +static void pci_epf_nvme_process_create_sq(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_cmd *epcmd) +{ + struct nvme_command *cmd = &epcmd->cmd; + int mqes = NVME_CAP_MQES(epf_nvme->ctrl.cap); + u16 sqid, cqid, sq_flags, qsize; + int ret; + + sqid = le16_to_cpu(cmd->create_sq.sqid); + if (!sqid || sqid > epf_nvme->ctrl.nr_queues || + epf_nvme->ctrl.sq[sqid].ref) { + epcmd->status = NVME_SC_QID_INVALID | NVME_STATUS_DNR; + return; + } + + cqid = le16_to_cpu(cmd->create_sq.cqid); + if (!cqid || !epf_nvme->ctrl.cq[cqid].ref) { + epcmd->status = NVME_SC_CQ_INVALID | NVME_STATUS_DNR; + return; + } + + sq_flags = le16_to_cpu(cmd->create_sq.sq_flags); + if (!(sq_flags & NVME_QUEUE_PHYS_CONTIG)) { + epcmd->status = NVME_SC_INVALID_QUEUE | NVME_STATUS_DNR; + return; + } + + qsize = le16_to_cpu(cmd->create_sq.qsize); + if (!qsize || qsize > mqes) { + if (qsize > mqes) + dev_warn(&epf_nvme->epf->dev, + "Create SQ %d, qsize %d > mqes %d: buggy driver?\n", + sqid, (int)qsize, mqes); + epcmd->status = NVME_SC_QUEUE_SIZE | NVME_STATUS_DNR; + return; + } + + ret = pci_epf_nvme_create_sq(epf_nvme, sqid, cqid, sq_flags, qsize, + le64_to_cpu(cmd->create_sq.prp1)); + if (ret) { + epcmd->status = NVME_SC_INTERNAL | NVME_STATUS_DNR; + return; + } + + /* Start polling the submission queue */ + queue_delayed_work(epf_nvme->ctrl.wq, &epf_nvme->ctrl.sq[sqid].work, 1); +} + +static void pci_epf_nvme_process_delete_sq(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_cmd *epcmd) +{ + struct nvme_command *cmd = &epcmd->cmd; + u16 sqid; + + sqid = le16_to_cpu(cmd->delete_queue.qid); + if (!sqid || + sqid >= epf_nvme->ctrl.nr_queues || + !epf_nvme->ctrl.sq[sqid].ref) { + epcmd->status = NVME_SC_QID_INVALID | NVME_STATUS_DNR; + return; + } + + pci_epf_nvme_delete_sq(epf_nvme, sqid); +} + +static void pci_epf_nvme_identify_hook(struct pci_epf_nvme_cmd *epcmd) +{ + struct pci_epf_nvme *epf_nvme = epcmd->epf_nvme; + struct nvme_command *cmd = &epcmd->cmd; + struct nvme_id_ctrl *id = epcmd->buffer; + unsigned int page_shift; + + if (cmd->identify.cns != NVME_ID_CNS_CTRL) + return; + + /* Set device vendor IDs */ + id->vid = cpu_to_le16(epf_nvme->epf->header->vendorid); + id->ssvid = id->vid; + + /* Set Maximum Data Transfer Size (MDTS) */ + page_shift = NVME_CAP_MPSMIN(epf_nvme->ctrl.ctrl->cap) + 12; + id->mdts = ilog2(epf_nvme->ctrl.mdts) - page_shift; + + /* Clear Controller Multi-Path I/O and Namespace Sharing Capabilities */ + id->cmic = 0; + + /* Do not report support for Autonomous Power State Transitions */ + id->apsta = 0; + + /* Indicate no support for SGLs */ + id->sgls = 0; +} + +static void pci_epf_nvme_get_log_hook(struct pci_epf_nvme_cmd *epcmd) +{ + struct nvme_command *cmd = &epcmd->cmd; + struct nvme_effects_log *log = epcmd->buffer; + + if (cmd->get_log_page.lid != NVME_LOG_CMD_EFFECTS) + return; + + /* + * ACS0 [Delete I/O Submission Queue ] 00000001 + * CSUPP+ LBCC- NCC- NIC- CCC- USS- No command restriction + */ + log->acs[0] |= cpu_to_le32(NVME_CMD_EFFECTS_CSUPP); + + /* + * ACS1 [Create I/O Submission Queue ] 00000001 + * CSUPP+ LBCC- NCC- NIC- CCC- USS- No command restriction + */ + log->acs[1] |= cpu_to_le32(NVME_CMD_EFFECTS_CSUPP); + + /* + * ACS4 [Delete I/O Completion Queue ] 00000001 + * CSUPP+ LBCC- NCC- NIC- CCC- USS- No command restriction + */ + log->acs[4] |= cpu_to_le32(NVME_CMD_EFFECTS_CSUPP); + + /* + * ACS5 [Create I/O Completion Queue ] 00000001 + * CSUPP+ LBCC- NCC- NIC- CCC- USS- No command restriction + */ + log->acs[5] |= cpu_to_le32(NVME_CMD_EFFECTS_CSUPP); +} + +/* + * Returns true if the command has been handled + */ +static bool pci_epf_nvme_process_set_features(struct pci_epf_nvme_cmd *epcmd) +{ + struct pci_epf_nvme *epf_nvme = epcmd->epf_nvme; + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + u32 cdw10 = le32_to_cpu(epcmd->cmd.common.cdw10); + u32 cdw11 = le32_to_cpu(epcmd->cmd.common.cdw11); + u8 feat = cdw10 & 0xff; + u16 nr_ioq, nsqr, ncqr; + int qid; + + switch (feat) { + case NVME_FEAT_NUM_QUEUES: + ncqr = (cdw11 >> 16) & 0xffff; + nsqr = cdw11 & 0xffff; + if (ncqr == 0xffff || nsqr == 0xffff) { + epcmd->status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; + return true; + } + + /* We cannot accept this command if we already have IO queues */ + for (qid = 1; qid < ctrl->nr_queues; qid++) { + if (epf_nvme->ctrl.sq[qid].ref || + epf_nvme->ctrl.cq[qid].ref) { + epcmd->status = + NVME_SC_CMD_SEQ_ERROR | NVME_STATUS_DNR; + return true; + } + } + + /* + * Number of I/O queues to report must not include the admin + * queue and is a 0-based value, so it is the total number of + * queues minus two. + */ + nr_ioq = ctrl->nr_queues - 2; + epcmd->cqe.result.u32 = cpu_to_le32(nr_ioq | (nr_ioq << 16)); + return true; + case NVME_FEAT_IRQ_COALESCE: + case NVME_FEAT_ARBITRATION: + /* We do not need to do anything special here. */ + epcmd->status = NVME_SC_SUCCESS; + return true; + default: + return false; + } +} + +/* + * Returns true if the command has been handled + */ +static bool pci_epf_nvme_process_get_features(struct pci_epf_nvme_cmd *epcmd) +{ + struct pci_epf_nvme *epf_nvme = epcmd->epf_nvme; + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + u32 cdw10 = le32_to_cpu(epcmd->cmd.common.cdw10); + u8 feat = cdw10 & 0xff; + u16 nr_ioq; + + switch (feat) { + case NVME_FEAT_NUM_QUEUES: + /* + * Number of I/O queues to report must not include the admin + * queue and is a 0-based value, so it is the total number of + * queues minus two. + */ + nr_ioq = ctrl->nr_queues - 2; + epcmd->cqe.result.u32 = cpu_to_le32(nr_ioq | (nr_ioq << 16)); + return true; + case NVME_FEAT_IRQ_COALESCE: + case NVME_FEAT_ARBITRATION: + /* We do not need to do anything special here. */ + epcmd->status = NVME_SC_SUCCESS; + return true; + default: + return false; + } +} + +static void pci_epf_nvme_process_admin_cmd(struct pci_epf_nvme_cmd *epcmd) +{ + struct pci_epf_nvme *epf_nvme = epcmd->epf_nvme; + void (*post_exec_hook)(struct pci_epf_nvme_cmd *) = NULL; + struct nvme_command *cmd = &epcmd->cmd; + + switch (cmd->common.opcode) { + case nvme_admin_identify: + post_exec_hook = pci_epf_nvme_identify_hook; + epcmd->buffer_size = NVME_IDENTIFY_DATA_SIZE; + epcmd->dma_dir = DMA_TO_DEVICE; + break; + + case nvme_admin_get_log_page: + post_exec_hook = pci_epf_nvme_get_log_hook; + epcmd->buffer_size = nvme_get_log_page_len(cmd); + epcmd->dma_dir = DMA_TO_DEVICE; + break; + + case nvme_admin_async_event: + /* + * Async events are a pain to deal with as they get canceled + * only once we delete the fabrics controller, which happens + * after the epf function is deleted, thus causing access to + * freed memory or leaking of epcmd. So ignore these commands + * for now, which is fine. The host will simply never see any + * event. + */ + pci_epf_nvme_free_cmd(epcmd); + return; + + case nvme_admin_set_features: + /* + * Several NVMe features do not apply to the NVMe fabrics + * host controller, so handle them directly here. + */ + if (pci_epf_nvme_process_set_features(epcmd)) + goto complete; + break; + + case nvme_admin_get_features: + /* + * Several NVMe features do not apply to the NVMe fabrics + * host controller, so handle them directly here. + */ + if (pci_epf_nvme_process_get_features(epcmd)) + goto complete; + + case nvme_admin_abort_cmd: + break; + + case nvme_admin_create_cq: + pci_epf_nvme_process_create_cq(epf_nvme, epcmd); + goto complete; + + case nvme_admin_create_sq: + pci_epf_nvme_process_create_sq(epf_nvme, epcmd); + goto complete; + + case nvme_admin_delete_cq: + pci_epf_nvme_process_delete_cq(epf_nvme, epcmd); + goto complete; + + case nvme_admin_delete_sq: + pci_epf_nvme_process_delete_sq(epf_nvme, epcmd); + goto complete; + + default: + dev_err(&epf_nvme->epf->dev, + "Unhandled admin command %s (0x%02x)\n", + pci_epf_nvme_cmd_name(epcmd), cmd->common.opcode); + epcmd->status = NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; + goto complete; + } + + /* Synchronously execute the command */ + pci_epf_nvme_exec_cmd(epcmd, post_exec_hook); + +complete: + pci_epf_nvme_complete_cmd(epcmd); +} + +static inline size_t pci_epf_nvme_rw_data_len(struct pci_epf_nvme_cmd *epcmd) +{ + return ((u32)le16_to_cpu(epcmd->cmd.rw.length) + 1) << + epcmd->ns->head->lba_shift; +} + +static void pci_epf_nvme_process_io_cmd(struct pci_epf_nvme_cmd *epcmd, + struct pci_epf_nvme_queue *sq) +{ + struct pci_epf_nvme *epf_nvme = epcmd->epf_nvme; + + /* Get the command target namespace */ + epcmd->ns = nvme_find_get_ns(epf_nvme->ctrl.ctrl, + le32_to_cpu(epcmd->cmd.common.nsid)); + if (!epcmd->ns) { + epcmd->status = NVME_SC_INVALID_NS | NVME_STATUS_DNR; + goto complete; + } + + switch (epcmd->cmd.common.opcode) { + case nvme_cmd_read: + epcmd->buffer_size = pci_epf_nvme_rw_data_len(epcmd); + epcmd->dma_dir = DMA_TO_DEVICE; + break; + + case nvme_cmd_write: + epcmd->buffer_size = pci_epf_nvme_rw_data_len(epcmd); + epcmd->dma_dir = DMA_FROM_DEVICE; + break; + + case nvme_cmd_dsm: + epcmd->buffer_size = (le32_to_cpu(epcmd->cmd.dsm.nr) + 1) * + sizeof(struct nvme_dsm_range); + epcmd->dma_dir = DMA_FROM_DEVICE; + goto complete; + + case nvme_cmd_flush: + case nvme_cmd_write_zeroes: + break; + + default: + dev_err(&epf_nvme->epf->dev, + "Unhandled IO command %s (0x%02x)\n", + pci_epf_nvme_cmd_name(epcmd), + epcmd->cmd.common.opcode); + epcmd->status = NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; + goto complete; + } + + queue_work(sq->cmd_wq, &epcmd->work); + + return; + +complete: + pci_epf_nvme_complete_cmd(epcmd); +} + +static bool pci_epf_nvme_fetch_cmd(struct pci_epf_nvme *epf_nvme, + struct pci_epf_nvme_queue *sq) +{ + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + struct pci_epf_nvme_cmd *epcmd; + int ret; + + if (!(sq->qflags & PCI_EPF_NVME_QUEUE_LIVE)) + return false; + + sq->tail = pci_epf_nvme_reg_read32(ctrl, sq->db); + if (sq->head == sq->tail) + return false; + + ret = pci_epf_nvme_map_queue(epf_nvme, sq); + if (ret) + return false; + + while (sq->head != sq->tail) { + epcmd = pci_epf_nvme_alloc_cmd(epf_nvme); + if (!epcmd) + break; + + /* Get the NVMe command submitted by the host */ + pci_epf_nvme_init_cmd(epf_nvme, epcmd, sq->qid, sq->cqid); + memcpy_fromio(&epcmd->cmd, + sq->pci_map.virt_addr + sq->head * sq->qes, + sizeof(struct nvme_command)); + + dev_dbg(&epf_nvme->epf->dev, + "sq[%d]: head %d/%d, tail %d, command %s\n", + sq->qid, (int)sq->head, (int)sq->depth, + (int)sq->tail, pci_epf_nvme_cmd_name(epcmd)); + + sq->head++; + if (sq->head == sq->depth) + sq->head = 0; + + list_add_tail(&epcmd->link, &sq->list); + } + + pci_epf_nvme_unmap_queue(epf_nvme, sq); + + return !list_empty(&sq->list); +} + +static void pci_epf_nvme_sq_work(struct work_struct *work) +{ + struct pci_epf_nvme_queue *sq = + container_of(work, struct pci_epf_nvme_queue, work.work); + struct pci_epf_nvme *epf_nvme = sq->epf_nvme; + struct pci_epf_nvme_cmd *epcmd; + unsigned long poll_interval = 1; + unsigned long j = jiffies; + + while (pci_epf_nvme_ctrl_ready(epf_nvme) && + (sq->qflags & PCI_EPF_NVME_QUEUE_LIVE)) { + /* + * Try to get commands from the host. If We do not yet have any + * command, aggressively keep polling the SQ of IO queues for at + * most one tick and fall back to rescheduling the SQ work if we + * have not received any command after that. This hybrid + * spin-polling method significantly increases the IOPS for + * shallow queue depth operation (e.g. QD=1). + */ + if (!pci_epf_nvme_fetch_cmd(epf_nvme, sq)) { + if (!sq->qid || j != jiffies) + break; + usleep_range(1, 2); + continue; + } + + while (!list_empty(&sq->list)) { + epcmd = list_first_entry(&sq->list, + struct pci_epf_nvme_cmd, link); + list_del_init(&epcmd->link); + if (sq->qid) + pci_epf_nvme_process_io_cmd(epcmd, sq); + else + pci_epf_nvme_process_admin_cmd(epcmd); + } + } + + if (!pci_epf_nvme_ctrl_ready(epf_nvme)) + return; + + /* No need to aggressively poll the admin queue. */ + if (!sq->qid) + poll_interval = msecs_to_jiffies(5); + queue_delayed_work(epf_nvme->ctrl.wq, &sq->work, poll_interval); +} + +static void pci_epf_nvme_cq_work(struct work_struct *work) +{ + struct pci_epf_nvme_queue *cq = + container_of(work, struct pci_epf_nvme_queue, work.work); + struct pci_epf_nvme *epf_nvme = cq->epf_nvme; + struct pci_epf_nvme_cmd *epcmd; + unsigned long flags; + LIST_HEAD(list); + int ret; + + spin_lock_irqsave(&cq->lock, flags); + + while (!list_empty(&cq->list)) { + + list_splice_tail_init(&cq->list, &list); + spin_unlock_irqrestore(&cq->lock, flags); + + ret = pci_epf_nvme_map_queue(epf_nvme, cq); + if (ret) { + queue_delayed_work(epf_nvme->ctrl.wq, &cq->work, 1); + return; + } + + while (!list_empty(&list)) { + epcmd = list_first_entry(&list, + struct pci_epf_nvme_cmd, link); + list_del_init(&epcmd->link); + if (!pci_epf_nvme_queue_response(epcmd)) + break; + } + + pci_epf_nvme_unmap_queue(epf_nvme, cq); + + if (pci_epf_nvme_ctrl_ready(cq->epf_nvme)) + pci_epf_nvme_raise_irq(cq->epf_nvme, cq); + + spin_lock_irqsave(&cq->lock, flags); + } + + /* + * Completions on the host may trigger issuing of new commands. Try to + * get these early to improve IOPS and reduce latency. + */ + if (cq->qid) + queue_delayed_work(epf_nvme->ctrl.wq, &cq->sq->work, 0); + + spin_unlock_irqrestore(&cq->lock, flags); +} + +static void pci_epf_nvme_reg_poll(struct work_struct *work) +{ + struct pci_epf_nvme *epf_nvme = + container_of(work, struct pci_epf_nvme, reg_poll.work); + struct pci_epf_nvme_ctrl *ctrl = &epf_nvme->ctrl; + u32 old_cc; + + /* Set the controller register bar */ + ctrl->reg = epf_nvme->reg_bar; + if (!ctrl->reg) { + dev_err(&epf_nvme->epf->dev, "No register BAR set\n"); + goto again; + } + + /* Check CC.EN to determine what we need to do */ + old_cc = ctrl->cc; + ctrl->cc = pci_epf_nvme_reg_read32(ctrl, NVME_REG_CC); + + /* If not enabled yet, wait */ + if (!(old_cc & NVME_CC_ENABLE) && !(ctrl->cc & NVME_CC_ENABLE)) + goto again; + + /* If CC.EN was set by the host, enable the controller */ + if (!(old_cc & NVME_CC_ENABLE) && (ctrl->cc & NVME_CC_ENABLE)) { + pci_epf_nvme_enable_ctrl(epf_nvme); + goto again; + } + + /* If CC.EN was cleared by the host, disable the controller */ + if (((old_cc & NVME_CC_ENABLE) && !(ctrl->cc & NVME_CC_ENABLE)) || + ctrl->cc & NVME_CC_SHN_NORMAL) + pci_epf_nvme_disable_ctrl(epf_nvme); + +again: + schedule_delayed_work(&epf_nvme->reg_poll, msecs_to_jiffies(5)); +} + +static int pci_epf_nvme_configure_bar(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + const struct pci_epc_features *features = epf_nvme->epc_features; + size_t reg_size, reg_bar_size; + size_t msix_table_size = 0; + + /* + * The first free BAR will be our register BAR and per NVMe + * specifications, it must be BAR 0. + */ + if (pci_epc_get_first_free_bar(features) != BAR_0) { + dev_err(&epf->dev, "BAR 0 is not free\n"); + return -EINVAL; + } + + /* Initialize BAR flags */ + if (features->bar[BAR_0].only_64bit) + epf->bar[BAR_0].flags |= PCI_BASE_ADDRESS_MEM_TYPE_64; + + /* + * Calculate the size of the register bar: NVMe registers first with + * enough space for the doorbells, followed by the MSI-X table + * if supported. + */ + reg_size = NVME_REG_DBS + (PCI_EPF_NVME_MAX_NR_QUEUES * 2 * sizeof(u32)); + reg_size = ALIGN(reg_size, 8); + + if (features->msix_capable) { + size_t pba_size; + + msix_table_size = PCI_MSIX_ENTRY_SIZE * epf->msix_interrupts; + epf_nvme->msix_table_offset = reg_size; + pba_size = ALIGN(DIV_ROUND_UP(epf->msix_interrupts, 8), 8); + + reg_size += msix_table_size + pba_size; + } + + reg_bar_size = ALIGN(reg_size, 4096); + + if (features->bar[BAR_0].type == BAR_FIXED) { + if (reg_bar_size > features->bar[BAR_0].fixed_size) { + dev_err(&epf->dev, + "Reg BAR 0 size %llu B too small, need %zu B\n", + features->bar[BAR_0].fixed_size, + reg_bar_size); + return -ENOMEM; + } + reg_bar_size = features->bar[BAR_0].fixed_size; + } + + epf_nvme->reg_bar = pci_epf_alloc_space(epf, reg_bar_size, BAR_0, + features, PRIMARY_INTERFACE); + if (!epf_nvme->reg_bar) { + dev_err(&epf->dev, "Allocate register BAR failed\n"); + return -ENOMEM; + } + memset(epf_nvme->reg_bar, 0, reg_bar_size); + + return 0; +} + +static void pci_epf_nvme_clear_bar(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + + pci_epc_clear_bar(epf->epc, epf->func_no, epf->vfunc_no, + &epf->bar[BAR_0]); + pci_epf_free_space(epf, epf_nvme->reg_bar, BAR_0, PRIMARY_INTERFACE); + epf_nvme->reg_bar = NULL; +} + +static int pci_epf_nvme_init_irq(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + int ret; + + /* Enable MSI-X if supported, otherwise, use MSI */ + if (epf_nvme->epc_features->msix_capable && epf->msix_interrupts) { + ret = pci_epc_set_msix(epf->epc, epf->func_no, epf->vfunc_no, + epf->msix_interrupts, BAR_0, + epf_nvme->msix_table_offset); + if (ret) { + dev_err(&epf->dev, "MSI-X configuration failed\n"); + return ret; + } + + epf_nvme->nr_vectors = epf->msix_interrupts; + epf_nvme->irq_type = PCI_IRQ_MSIX; + + return 0; + } + + if (epf_nvme->epc_features->msi_capable && epf->msi_interrupts) { + ret = pci_epc_set_msi(epf->epc, epf->func_no, epf->vfunc_no, + epf->msi_interrupts); + if (ret) { + dev_err(&epf->dev, "MSI configuration failed\n"); + return ret; + } + + epf_nvme->nr_vectors = epf->msi_interrupts; + epf_nvme->irq_type = PCI_IRQ_MSI; + + return 0; + } + + /* MSI and MSI-X are not supported: fall back to INTX */ + epf_nvme->nr_vectors = 1; + epf_nvme->irq_type = PCI_IRQ_INTX; + + return 0; +} + +static int pci_epf_nvme_epc_init(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + int ret; + + if (epf->vfunc_no <= 1) { + /* Set device ID, class, etc */ + ret = pci_epc_write_header(epf->epc, epf->func_no, epf->vfunc_no, + epf->header); + if (ret) { + dev_err(&epf->dev, + "Write configuration header failed %d\n", ret); + return ret; + } + } + + /* Setup the PCIe BAR and enable interrupts */ + ret = pci_epc_set_bar(epf->epc, epf->func_no, epf->vfunc_no, + &epf->bar[BAR_0]); + if (ret) { + dev_err(&epf->dev, "Set BAR 0 failed\n"); + pci_epf_free_space(epf, epf_nvme->reg_bar, BAR_0, + PRIMARY_INTERFACE); + return ret; + } + + ret = pci_epf_nvme_init_irq(epf); + if (ret) + return ret; + + pci_epf_nvme_init_ctrl_regs(epf); + + if (!epf_nvme->epc_features->linkup_notifier) + schedule_delayed_work(&epf_nvme->reg_poll, msecs_to_jiffies(5)); + + return 0; +} + +static void pci_epf_nvme_epc_deinit(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + + /* Stop polling BAR registers and disable the controller */ + cancel_delayed_work_sync(&epf_nvme->reg_poll); + + pci_epf_nvme_delete_ctrl(epf); + pci_epf_nvme_clean_dma(epf); + pci_epf_nvme_clear_bar(epf); +} + +static int pci_epf_nvme_link_up(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + + dev_info(&epf->dev, "Link UP\n"); + + pci_epf_nvme_init_ctrl_regs(epf); + + /* Start polling the BAR registers to detect controller enable */ + schedule_delayed_work(&epf_nvme->reg_poll, 0); + + return 0; +} + +static int pci_epf_nvme_link_down(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + + dev_info(&epf->dev, "Link DOWN\n"); + + /* Stop polling BAR registers and disable the controller */ + cancel_delayed_work_sync(&epf_nvme->reg_poll); + pci_epf_nvme_disable_ctrl(epf_nvme); + + return 0; +} + +static const struct pci_epc_event_ops pci_epf_nvme_event_ops = { + .epc_init = pci_epf_nvme_epc_init, + .epc_deinit = pci_epf_nvme_epc_deinit, + .link_up = pci_epf_nvme_link_up, + .link_down = pci_epf_nvme_link_down, +}; + +static int pci_epf_nvme_bind(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + const struct pci_epc_features *epc_features; + struct pci_epc *epc = epf->epc; + bool dma_supported; + int ret; + + if (!epc) { + dev_err(&epf->dev, "No endpoint controller\n"); + return -EINVAL; + } + + epc_features = pci_epc_get_features(epc, epf->func_no, epf->vfunc_no); + if (!epc_features) { + dev_err(&epf->dev, "epc_features not implemented\n"); + return -EOPNOTSUPP; + } + epf_nvme->epc_features = epc_features; + + ret = pci_epf_nvme_configure_bar(epf); + if (ret) + return ret; + + if (epf_nvme->dma_enable) { + dma_supported = pci_epf_nvme_init_dma(epf_nvme); + if (dma_supported) { + dev_info(&epf->dev, "DMA supported\n"); + } else { + dev_info(&epf->dev, + "DMA not supported, falling back to mmio\n"); + epf_nvme->dma_enable = false; + } + } else { + dev_info(&epf->dev, "DMA disabled\n"); + } + + /* Create the fabrics host controller */ + ret = pci_epf_nvme_create_ctrl(epf); + if (ret) + goto clean_dma; + + return 0; + +clean_dma: + pci_epf_nvme_clean_dma(epf); + pci_epf_nvme_clear_bar(epf); + + return ret; +} + +static void pci_epf_nvme_unbind(struct pci_epf *epf) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + struct pci_epc *epc = epf->epc; + + cancel_delayed_work_sync(&epf_nvme->reg_poll); + + pci_epf_nvme_delete_ctrl(epf); + + if (epc->init_complete) { + pci_epf_nvme_clean_dma(epf); + pci_epf_nvme_clear_bar(epf); + } +} + +static struct pci_epf_header epf_nvme_pci_header = { + .vendorid = PCI_ANY_ID, + .deviceid = PCI_ANY_ID, + .progif_code = 0x02, /* NVM Express */ + .baseclass_code = PCI_BASE_CLASS_STORAGE, + .subclass_code = 0x08, /* Non-Volatile Memory controller */ + .interrupt_pin = PCI_INTERRUPT_INTA, +}; + +static int pci_epf_nvme_probe(struct pci_epf *epf, + const struct pci_epf_device_id *id) +{ + struct pci_epf_nvme *epf_nvme; + + epf_nvme = devm_kzalloc(&epf->dev, sizeof(*epf_nvme), GFP_KERNEL); + if (!epf_nvme) + return -ENOMEM; + + epf_nvme->epf = epf; + INIT_DELAYED_WORK(&epf_nvme->reg_poll, pci_epf_nvme_reg_poll); + + epf_nvme->prp_list_buf = devm_kzalloc(&epf->dev, NVME_CTRL_PAGE_SIZE, + GFP_KERNEL); + if (!epf_nvme->prp_list_buf) + return -ENOMEM; + + /* Set default attribute values */ + epf_nvme->dma_enable = true; + epf_nvme->mdts_kb = PCI_EPF_NVME_MDTS_KB; + + epf->event_ops = &pci_epf_nvme_event_ops; + epf->header = &epf_nvme_pci_header; + epf_set_drvdata(epf, epf_nvme); + + return 0; +} + +#define to_epf_nvme(epf_group) \ + container_of((epf_group), struct pci_epf_nvme, group) + +static ssize_t pci_epf_nvme_ctrl_opts_show(struct config_item *item, + char *page) +{ + struct config_group *group = to_config_group(item); + struct pci_epf_nvme *epf_nvme = to_epf_nvme(group); + + if (!epf_nvme->ctrl_opts_buf) + return 0; + + return sysfs_emit(page, "%s\n", epf_nvme->ctrl_opts_buf); +} + +#define PCI_EPF_NVME_OPT_HIDDEN_NS "hidden_ns" + +static ssize_t pci_epf_nvme_ctrl_opts_store(struct config_item *item, + const char *page, size_t len) +{ + struct config_group *group = to_config_group(item); + struct pci_epf_nvme *epf_nvme = to_epf_nvme(group); + size_t opt_buf_size; + + /* Do not allow setting options when the function is already started */ + if (epf_nvme->ctrl.ctrl) + return -EBUSY; + + if (!len) + return -EINVAL; + + kfree(epf_nvme->ctrl_opts_buf); + + /* + * Make sure we have enough room to add the hidden_ns option + * if it is missing. + */ + opt_buf_size = len + strlen(PCI_EPF_NVME_OPT_HIDDEN_NS) + 2; + epf_nvme->ctrl_opts_buf = kzalloc(opt_buf_size, GFP_KERNEL); + if (!epf_nvme->ctrl_opts_buf) + return -ENOMEM; + + strscpy(epf_nvme->ctrl_opts_buf, page, opt_buf_size); + if (!strnstr(page, PCI_EPF_NVME_OPT_HIDDEN_NS, len)) + strncat(epf_nvme->ctrl_opts_buf, + "," PCI_EPF_NVME_OPT_HIDDEN_NS, opt_buf_size); + + dev_dbg(&epf_nvme->epf->dev, + "NVMe fabrics controller options: %s\n", + epf_nvme->ctrl_opts_buf); + + return len; +} + +CONFIGFS_ATTR(pci_epf_nvme_, ctrl_opts); + +static ssize_t pci_epf_nvme_dma_enable_show(struct config_item *item, + char *page) +{ + struct config_group *group = to_config_group(item); + struct pci_epf_nvme *epf_nvme = to_epf_nvme(group); + + return sysfs_emit(page, "%d\n", epf_nvme->dma_enable); +} + +static ssize_t pci_epf_nvme_dma_enable_store(struct config_item *item, + const char *page, size_t len) +{ + struct config_group *group = to_config_group(item); + struct pci_epf_nvme *epf_nvme = to_epf_nvme(group); + int ret; + + if (epf_nvme->ctrl_enabled) + return -EBUSY; + + ret = kstrtobool(page, &epf_nvme->dma_enable); + if (ret) + return ret; + + return len; +} + +CONFIGFS_ATTR(pci_epf_nvme_, dma_enable); + +static ssize_t pci_epf_nvme_mdts_kb_show(struct config_item *item, char *page) +{ + struct config_group *group = to_config_group(item); + struct pci_epf_nvme *epf_nvme = to_epf_nvme(group); + + return sysfs_emit(page, "%zu\n", epf_nvme->mdts_kb); +} + +static ssize_t pci_epf_nvme_mdts_kb_store(struct config_item *item, + const char *page, size_t len) +{ + struct config_group *group = to_config_group(item); + struct pci_epf_nvme *epf_nvme = to_epf_nvme(group); + unsigned long mdts_kb; + int ret; + + if (epf_nvme->ctrl_enabled) + return -EBUSY; + + ret = kstrtoul(page, 0, &mdts_kb); + if (ret) + return ret; + if (!mdts_kb) + mdts_kb = PCI_EPF_NVME_MDTS_KB; + else if (mdts_kb > PCI_EPF_NVME_MAX_MDTS_KB) + mdts_kb = PCI_EPF_NVME_MAX_MDTS_KB; + + if (!is_power_of_2(mdts_kb)) + return -EINVAL; + + epf_nvme->mdts_kb = mdts_kb; + + return len; +} + +CONFIGFS_ATTR(pci_epf_nvme_, mdts_kb); + +static struct configfs_attribute *pci_epf_nvme_attrs[] = { + &pci_epf_nvme_attr_ctrl_opts, + &pci_epf_nvme_attr_dma_enable, + &pci_epf_nvme_attr_mdts_kb, + NULL, +}; + +static const struct config_item_type pci_epf_nvme_group_type = { + .ct_attrs = pci_epf_nvme_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct config_group *pci_epf_nvme_add_cfs(struct pci_epf *epf, + struct config_group *group) +{ + struct pci_epf_nvme *epf_nvme = epf_get_drvdata(epf); + + /* Add the NVMe target attributes */ + config_group_init_type_name(&epf_nvme->group, "nvme", + &pci_epf_nvme_group_type); + + return &epf_nvme->group; +} + +static const struct pci_epf_device_id pci_epf_nvme_ids[] = { + { .name = "pci_epf_nvme" }, + {}, +}; + +static struct pci_epf_ops pci_epf_nvme_ops = { + .bind = pci_epf_nvme_bind, + .unbind = pci_epf_nvme_unbind, + .add_cfs = pci_epf_nvme_add_cfs, +}; + +static struct pci_epf_driver epf_nvme_driver = { + .driver.name = "pci_epf_nvme", + .probe = pci_epf_nvme_probe, + .id_table = pci_epf_nvme_ids, + .ops = &pci_epf_nvme_ops, + .owner = THIS_MODULE, +}; + +static int __init pci_epf_nvme_init(void) +{ + int ret; + + epf_nvme_cmd_cache = kmem_cache_create("epf_nvme_cmd", + sizeof(struct pci_epf_nvme_cmd), + 0, SLAB_HWCACHE_ALIGN, NULL); + if (!epf_nvme_cmd_cache) + return -ENOMEM; + + ret = pci_epf_register_driver(&epf_nvme_driver); + if (ret) + goto out_cache; + + pr_info("Registered nvme EPF driver\n"); + + return 0; + +out_cache: + kmem_cache_destroy(epf_nvme_cmd_cache); + + pr_err("Register nvme EPF driver failed\n"); + + return ret; +} +module_init(pci_epf_nvme_init); + +static void __exit pci_epf_nvme_exit(void) +{ + pci_epf_unregister_driver(&epf_nvme_driver); + + kmem_cache_destroy(epf_nvme_cmd_cache); + + pr_info("Unregistered nvme EPF driver\n"); +} +module_exit(pci_epf_nvme_exit); + +MODULE_DESCRIPTION("PCI endpoint NVMe function driver"); +MODULE_AUTHOR("Damien Le Moal "); +MODULE_IMPORT_NS(NVME_TARGET_PASSTHRU); +MODULE_IMPORT_NS(NVME_FABRICS); +MODULE_LICENSE("GPL"); From patchwork Fri Oct 11 12:19:51 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Damien Le Moal X-Patchwork-Id: 1996101 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=kernel.org header.i=@kernel.org header.a=rsa-sha256 header.s=k20201202 header.b=rEQKvnaA; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=139.178.88.99; helo=sv.mirrors.kernel.org; envelope-from=linux-pci+bounces-14313-incoming=patchwork.ozlabs.org@vger.kernel.org; receiver=patchwork.ozlabs.org) Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org [139.178.88.99]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4XQ5L23qtlz1xt1 for ; Fri, 11 Oct 2024 23:20:10 +1100 (AEDT) Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 2554D287F34 for ; Fri, 11 Oct 2024 12:20:09 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id D9E52216A3B; Fri, 11 Oct 2024 12:20:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="rEQKvnaA" X-Original-To: linux-pci@vger.kernel.org Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B68E4802 for ; Fri, 11 Oct 2024 12:20:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728649205; cv=none; b=FuRTq/ri5469Z0zIaplrxzePes74RwgQnOavWnrOp9Wx5KCMeUrR1hpBjjki+0rBzi1RW6flO72IrlglkBtoCmOahtg7aQoEQQN+KGwnGCGCUpaQeHCbnO8kloQGEjFLVxDUUsIM5FYstT97HTwKjUdSKrE91xvBtJcFItnZBmw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728649205; c=relaxed/simple; bh=l1AhmbDvh9axEVcFE3ZHVaOPS0jik07VFsLdb/5x1KQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ToIDu5hy7bpn9EvpMyO/rYKMMdYlxkRB1X9NoWjKWvfBx+AL0unlc0Ey5SnjLL/q42bOVd6refa6h00QK1kLz4fvfTNEZk9Q1QCdQvytluEWdiR9AZmIFjGiMKEBM0ECpKxUsGJ9cr16Z5ju/wANXF9JOtBHcniTZlm87lN/Tx8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=rEQKvnaA; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 85F96C4CED1; Fri, 11 Oct 2024 12:20:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728649205; bh=l1AhmbDvh9axEVcFE3ZHVaOPS0jik07VFsLdb/5x1KQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=rEQKvnaA3mbv1hbpDLzYyOeTa9uxmxisvg/HmXRPqv930I+UV5B1LTYGEXp1eR2kj PW/E1CML2+wttZbY+Z3ui70OuoRL7i0hl7hOGHnJ5pM7pS2E39cNyAa9SkPXdmewJG 1lSSRK/wBVaQKpNcYigRrEvHVILIU9xIkddDmaKJTJFU2CUhfbZNIFVRl/mvlpvmAf 7wPgVeCfQTWhIi3b7nWNw46HisStD3KkrEV0TCifJ21OWMh3AORpJYkOyolC2ijAAE p+oZwpW4oBZQ5OjhZoAI2sFY92e3la1gkURtwmY2wUBHLVuBw/MxthEGuIgx3qKy/Q HNJnfeoHwD2Tw== From: Damien Le Moal To: linux-nvme@lists.infradead.org, Keith Busch , Christoph Hellwig , Sagi Grimberg , Manivannan Sadhasivam , =?utf-8?q?Krzyszt?= =?utf-8?q?of_Wilczy=C5=84ski?= , Kishon Vijay Abraham I , Bjorn Helgaas , Lorenzo Pieralisi , linux-pci@vger.kernel.org Cc: Rick Wertenbroek , Niklas Cassel Subject: [PATCH v2 5/5] PCI: endpoint: Document the NVMe endpoint function driver Date: Fri, 11 Oct 2024 21:19:51 +0900 Message-ID: <20241011121951.90019-6-dlemoal@kernel.org> X-Mailer: git-send-email 2.46.2 In-Reply-To: <20241011121951.90019-1-dlemoal@kernel.org> References: <20241011121951.90019-1-dlemoal@kernel.org> Precedence: bulk X-Mailing-List: linux-pci@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Add the documentation files: - Documentation/PCI/endpoint/pci-nvme-function.rst - Documentation/PCI/endpoint/pci-nvme-howto.rst - Documentation/PCI/endpoint/function/binding/pci-nvme.rst To respectively document the NVMe PCI endpoint function driver internals, provide a user guide explaning how to setup an NVMe endpoint device and describe the NVMe endpoint function driver binding attributes. Signed-off-by: Damien Le Moal --- .../endpoint/function/binding/pci-nvme.rst | 34 ++++ Documentation/PCI/endpoint/index.rst | 3 + .../PCI/endpoint/pci-nvme-function.rst | 151 ++++++++++++++ Documentation/PCI/endpoint/pci-nvme-howto.rst | 189 ++++++++++++++++++ MAINTAINERS | 2 + 5 files changed, 379 insertions(+) create mode 100644 Documentation/PCI/endpoint/function/binding/pci-nvme.rst create mode 100644 Documentation/PCI/endpoint/pci-nvme-function.rst create mode 100644 Documentation/PCI/endpoint/pci-nvme-howto.rst diff --git a/Documentation/PCI/endpoint/function/binding/pci-nvme.rst b/Documentation/PCI/endpoint/function/binding/pci-nvme.rst new file mode 100644 index 000000000000..d80293c08bcd --- /dev/null +++ b/Documentation/PCI/endpoint/function/binding/pci-nvme.rst @@ -0,0 +1,34 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +PCI NVMe Endpoint Function +========================== + +1) Create the function subdirectory pci_epf_nvme.0 in the +pci_ep/functions/pci_epf_nvme directory of configfs. + +Standard EPF Configurable Fields: + +================ =========================================================== +vendorid Do not care (e.g. PCI_ANY_ID) +deviceid Do not care (e.g. PCI_ANY_ID) +revid Do not care +progif_code Must be 0x02 (NVM Express) +baseclass_code Must be 0x1 (PCI_BASE_CLASS_STORAGE) +subclass_code Must be 0x08 (Non-Volatile Memory controller) +cache_line_size Do not care +subsys_vendor_id Do not care (e.g. PCI_ANY_ID) +subsys_id Do not care (e.g. PCI_ANY_ID) +msi_interrupts At least equal to the number of queue pairs desired +msix_interrupts At least equal to the number of queue pairs desired +interrupt_pin Interrupt PIN to use if MSI and MSI-X are not supported +================ =========================================================== + +The NVMe EPF specific configurable fields are in the nvme subdirectory of the +directory created in 1 + +================ =========================================================== +ctrl_opts NVMe target connection parameters +dma_enable Enable (1) or disable (0) DMA transfers; default = 1 +mdts_kb Maximum data transfer size in KiB; default = 128 +================ =========================================================== diff --git a/Documentation/PCI/endpoint/index.rst b/Documentation/PCI/endpoint/index.rst index 4d2333e7ae06..764f1e8f81f2 100644 --- a/Documentation/PCI/endpoint/index.rst +++ b/Documentation/PCI/endpoint/index.rst @@ -15,6 +15,9 @@ PCI Endpoint Framework pci-ntb-howto pci-vntb-function pci-vntb-howto + pci-nvme-function + pci-nvme-howto function/binding/pci-test function/binding/pci-ntb + function/binding/pci-nvme diff --git a/Documentation/PCI/endpoint/pci-nvme-function.rst b/Documentation/PCI/endpoint/pci-nvme-function.rst new file mode 100644 index 000000000000..ac8baa5556be --- /dev/null +++ b/Documentation/PCI/endpoint/pci-nvme-function.rst @@ -0,0 +1,151 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +PCI NVMe Function +================= + +:Author: Damien Le Moal + +The PCI NVMe endpoint function driver implements a PCIe NVMe controller for a +local NVMe fabrics host controller. The fabrics controller target can use any +of the transports supported by the NVMe driver. In practice, using small SBC +boards equipped with a PCI endpoint controller, loop targets to files or block +devices or TCP targets to remote NVMe devices can be easily used. + +Overview +======== + +The NVMe endpoint function driver relies as most as possible on the NVMe +fabrics driver for executing NVMe commands received from the PCI RC host to +minimize NVMe command parsing. However, some admin commands must be modified to +satisfy PCI transport specifications constraints (e.g. queue management +commands support and the optional SGL support). + +Capabilities +------------ + +The NVMe capabilities exposed to the PCI RC host through the BAR 0 registers +are almost identical to the capabilities of the NVMe fabrics controller, with +some exceptions: + +1) NVMe-over-fabrics specifications mandate support for SGL. Howerver, this + capability is not exposed as supported because the current NVMe endpoint + driver code does not support SGL. + +2) The NVMe endpoint function driver can expose a different MDTS (Maximum Data + Transfer Size) than the fabrics controller used. + +Maximum Number of Queue Pairs +----------------------------- + +Upon binding of the NVMe endpoint function driver to the endpoint controller, +BAR 0 is allocated with enough space to accommodate up to +PCI_EPF_NVME_MAX_NR_QUEUES (16) queue pairs. This relatively low number is +necessary to avoid running out of memory windows for mapping PCI addresses to +local endpoint controller memory. + +The number of memory windows necessary for operation is roughly at most: +1) One memory window for raising MSI/MSI-X interrupts +2) One memory window for command PRP and data transfers +3) One memory window for each submission queue +4) One memory window for each completion queue + +Given the highly asynchronous nature of the NVMe endpoint function driver +operation, the memory windows needed as described above will generally not be +used simultaneously, but that may happen. So a safe maximum number of queue +pairs that can be supported is equal to the maximum number of memory windows of +the endpoint controller minus two and divided by two. E.g. for an endpoint PCI +controller with 32 outbound memory windows available, up to 10 queue pairs can +be safely operated without any risk of getting PCI space mapping errors due to +the lack of memory windows. + +The NVMe endpoint function driver allows configuring the maximum number of +queue pairs through configfs. + +Command Execution +================= + +The NVMe endpoint function driver relies on several work items to process NVMe +commands issued by the PCI RC host. + +Register Poll Work +------------------ + +The register poll work is a delayed work used to poll for changes to the +controller state register. This is used to detect operations initiated by the +PCI host such as enabling or enabling the NVMe controller. The register poll +work is scheduled every 5 ms. + +Submission Queue Work +--------------------- + +Upon creation of submission queues, starting with the submission queue for +admin commands, a delayed work is created and scheduled for execution every +jiffy to poll for a submission queue doorbell to detect submission of commands +by the PCI host. + +When changes to a submission queue work are detected by a submission queue +work, the work allocates a command structure to copy the NVMe command issued by +the PCI host and schedules processing of the command using the command work. + +Command Processing Work +----------------------- + +This per-NVMe command work is scheduled for execution when an NVMe command is +received from the host. This work will: + +1) Does minimal parsing of the NVMe command to determine if the command has a + data buffer. If it does, the PRP list for the command is retrieved to + identify the PCI address ranges used for the command data buffer. This can + lead to the command buffer being represented using several discontiguous + memory fragments. A local memory buffer is also allocated for local + execution of the command using the fabrics controller. + +2) If the command is a write command (DMA direction from host to device), data + is transferred from the host to the local memory buffer of the command. This + is handled in a loop to process all fragments of the command buffer as well + as simultaneously handle PCI address mapping constraints of the PCI endpoint + controller. + +3) The command is then executed using the NVMe driver fabrics code. This blocks + the command work until the command execution completes. + +4) When the command completes, the command work schedules handling of the + command response using the completion queue work. + +Completion Queue Work +--------------------- + +This per-completion queue work is used to aggregate handling of responses to +completed commands in batches to avoid having to issue an IRQ for every +completed command. This work is sceduled every time a command completes and +does: + +1) Post a command completion entry for all completed commands. + +2) Update the completion queue doorbell. + +3) Raise an IRQ to signal the host that commands have completed. + +Configuration +============= + +The NVMe endpoint function driver can be fully controlled using configfs, once +a NVMe fabrics target is also setup. The available configfs parameters are: + + ctrl_opts + + Fabrics controller connection arguments, as formatted for + the nvme cli "connect" command. + + dma_enable + + Enable (default) or disable DMA data transfers. + + mdts_kb + + Change the maximum data transfer size (default: 128 KB). + +See Documentation/PCI/endpoint/pci-nvme-howto.rst for a more detailed +description of these parameters and how to use them to configure an NVMe +endpoint function driver. diff --git a/Documentation/PCI/endpoint/pci-nvme-howto.rst b/Documentation/PCI/endpoint/pci-nvme-howto.rst new file mode 100644 index 000000000000..e15e8453b6d5 --- /dev/null +++ b/Documentation/PCI/endpoint/pci-nvme-howto.rst @@ -0,0 +1,189 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================================== +PCI NVMe Endpoint Function (EPF) User Guide +=========================================== + +:Author: Damien Le Moal + +This document is a guide to help users use the pci-epf-nvme function driver to +create PCIe NVMe controllers. For a high-level description of the NVMe function +driver internals, see Documentation/PCI/endpoint/pci-nvme-function.rst. + +Hardware and Kernel Requirements +================================ + +To use the NVMe PCI endpoint driver, at least one endpoint controller device +is required. + +To find the list of endpoint controller devices in the system:: + + # ls /sys/class/pci_epc/ + a40000000.pcie-ep + +If PCI_ENDPOINT_CONFIGFS is enabled:: + + # ls /sys/kernel/config/pci_ep/controllers + a40000000.pcie-ep + +Compiling the NVMe endpoint function driver depends on the target support of +the NVMe driver being enabled (CONFIG_NVME_TARGET). It is also recommended to +enable CONFIG_NVME_TARGET_LOOP to enable the use of loop targets (to use files +or block devices as storage for the NVMe target device). If the board used also +supports ethernet, CONFIG_NVME_TCP can be set to enable the use of remote TCP +NVMe targets. + +To facilitate testing, enabling the null-blk driver (CONFIG_BLK_DEV_NULL_BLK) +is also recommended. With this, a simple setup using a null_blk block device +with an NVMe loop target can be used. + + +NVMe Endpoint Device +==================== + +Creating an NVMe endpoint device is a two step process. First, an NVMe target +device must be defined. Second, the NVMe endpoint device must be setup using +the defined NVMe target device. + +Creating a NVMe Target Device +----------------------------- + +Details about how to configure and NVMe target are outside the scope of this +document. The following only provides a simple example of a loop target setup +using a null_blk device for storage. + +First, make sure that configfs is enabled:: + + # mount -t configfs none /sys/kernel/config + +Next, create a null_blk device (default settings give a 250 GB device without +memory backing). The block device created will be /dev/nullb0 by default:: + + # modprobe null_blk + # ls /dev/nullb0 + /dev/nullb0 + +The NVMe loop target driver must be loaded:: + + # modprobe nvme_loop + # lsmod | grep nvme + nvme_loop 16384 0 + nvmet 106496 1 nvme_loop + nvme_fabrics 28672 1 nvme_loop + nvme_core 131072 3 nvme_loop,nvmet,nvme_fabrics + +Now, create the NVMe loop target, starting with the NVMe subsystem, specifying +a maximum of 4 queue pairs:: + + # cd /sys/kernel/config/nvmet/subsystems + # mkdir pci_epf_nvme.0.nqn + # echo -n "Linux-pci-epf" > pci_epf_nvme.0.nqn/attr_model + # echo 4 > pci_epf_nvme.0.nqn/attr_qid_max + # echo 1 > pci_epf_nvme.0.nqn/attr_allow_any_host + +Next, create the target namespace using the null_blk block device:: + + # mkdir pci_epf_nvme.0.nqn/namespaces/1 + # echo -n "/dev/nullb0" > pci_epf_nvme.0.nqn/namespaces/1/device_path + # echo 1 > "pci_epf_nvme.0.nqn/namespaces/1/enable" + +Finally, create the target port and link it to the subsystem:: + + # cd /sys/kernel/config/nvmet/ports + # mkdir 1 + # echo -n "loop" > 1/addr_trtype + # ln -s /sys/kernel/config/nvmet/subsystems/pci_epf_nvme.0.nqn + 1/subsystems/pci_epf_nvme.0.nqn + + +Creating a NVMe Endpoint Device +------------------------------- + +With the NVMe target ready for use, the NVMe PCI endpoint device can now be +created and enabled. The first step is to load the NVMe function driver:: + + # modprobe pci_epf_nvme + # ls /sys/kernel/config/pci_ep/functions + pci_epf_nvme + +Next, create function 0:: + + # cd /sys/kernel/config/pci_ep/functions/pci_epf_nvme + # mkdir pci_epf_nvme.0 + # ls pci_epf_nvme.0/ + baseclass_code msix_interrupts secondary + cache_line_size nvme subclass_code + deviceid primary subsys_id + interrupt_pin progif_code subsys_vendor_id + msi_interrupts revid vendorid + +Configure the function using any vendor ID and device ID:: + + # cd /sys/kernel/config/pci_ep/functions/pci_epf_nvme/pci_epf_nvme.0 + # echo 0x15b7 > vendorid + # echo 0x5fff > deviceid + # echo 32 > msix_interrupts + # echo -n "transport=loop,nqn=pci_epf_nvme.0.nqn,nr_io_queues=4" > \ + ctrl_opts + +The ctrl_opts attribute must be set using equivalent arguments as used for a +norma NVMe target connection using "nvme connect" command. For the example +above, the equivalen target connection command is:: + + # nvme connect --transport=loop --nqn=pci_epf_nvme.0.nqn --nr-io-queues=4 + +The endpoint function can then be bound to the endpoint controller and the +controller started:: + + # cd /sys/kernel/config/pci_ep + # ln -s functions/pci_epf_nvme/pci_epf_nvme.0 controllers/a40000000.pcie-ep/ + # echo 1 > controllers/a40000000.pcie-ep/start + +Kernel messages will show information as the NVMe target device and endpoint +device are created and connected. + +.. code-block:: text + + pci_epf_nvme: Registered nvme EPF driver + nvmet: adding nsid 1 to subsystem pci_epf_nvme.0.nqn + pci_epf_nvme pci_epf_nvme.0: DMA RX channel dma3chan2, maximum segment size 4294967295 B + pci_epf_nvme pci_epf_nvme.0: DMA TX channel dma3chan0, maximum segment size 4294967295 B + pci_epf_nvme pci_epf_nvme.0: DMA supported + nvmet: creating nvm controller 1 for subsystem pci_epf_nvme.0.nqn for NQN nqn.2014-08.org.nvmexpress:uuid:0aa34ec6-11c0-4b02-ac9b-e07dff4b5c84. + nvme nvme0: creating 4 I/O queues. + nvme nvme0: new ctrl: "pci_epf_nvme.0.nqn" + pci_epf_nvme pci_epf_nvme.0: NVMe fabrics controller created, 4 I/O queues + pci_epf_nvme pci_epf_nvme.0: NVMe PCI controller supports MSI-X, 32 vectors + pci_epf_nvme pci_epf_nvme.0: NVMe PCI controller: 4 I/O queues + + +PCI RootComplex Host +==================== + +Booting the host, the NVMe endpoint device will be discoverable as a PCI device:: + + # lspci -n + 0000:01:00.0 0108: 15b7:5fff + +An this device will be recognized as an NVMe device with a single namespace:: + + # lsblk + NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS + nvme0n1 259:0 0 250G 0 disk + +The NVMe endpoint block device can then be used as any other regular NVMe +device. The nvme command line utility can be used to get more detailed +information about the endpoint device:: + + # nvme id-ctrl /dev/nvme0 + NVME Identify Controller: + vid : 0x15b7 + ssvid : 0x15b7 + sn : 0ec249554579a1d08fb5 + mn : Linux-pci-epf + fr : 6.12.0-r + rab : 6 + ieee : 000000 + cmic : 0 + mdts : 5 + ... diff --git a/MAINTAINERS b/MAINTAINERS index 301e0a1b56e8..48431d2aa751 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16563,6 +16563,8 @@ M: Damien Le Moal L: linux-pci@vger.kernel.org L: linux-nvme@lists.infradead.org S: Supported +F: Documentation/PCI/endpoint/function/binding/pci-nvme.rst +F: Documentation/PCI/endpoint/pci-nvme-*.rst F: drivers/pci/endpoint/functions/pci-epf-nvme.c NVM EXPRESS DRIVER