From patchwork Fri Jun 21 08:58:14 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?Marc_Poulhi=C3=A8s?= <poulhies@adacore.com>
X-Patchwork-Id: 1950647
Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (2048-bit key;
 secure) header.d=adacore.com header.i=@adacore.com header.a=rsa-sha256
 header.s=google header.b=UFT9w1h3;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4W5BFs0y2bz1ydW
	for <incoming@patchwork.ozlabs.org>; Fri, 21 Jun 2024 19:02:41 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 66CE13899426
	for <incoming@patchwork.ozlabs.org>; Fri, 21 Jun 2024 09:02:39 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mail-wm1-x335.google.com (mail-wm1-x335.google.com
 [IPv6:2a00:1450:4864:20::335])
 by sourceware.org (Postfix) with ESMTPS id 887D03896C15
 for <gcc-patches@gcc.gnu.org>; Fri, 21 Jun 2024 08:58:45 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 887D03896C15
Authentication-Results: sourceware.org; dmarc=pass (p=quarantine dis=none)
 header.from=adacore.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=adacore.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 887D03896C15
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=2a00:1450:4864:20::335
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718960329; cv=none;
 b=tYdfHL9O6gBV6rjgpLHsXp/WHMV3Oyqx/6/o7KvOf0VN8Pe0tUgpvE9mNLPhoRtxGUZ4JiDj9hv8ODtZl4jT88vQw7LtHG2A6+gqgZIgJgS+diFdpcvBiP0qQqSQc+V/Nhl6MFdyEJTuUOh4qqyJ0ueuKM/AswLVXv9I50whh5I=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1718960329; c=relaxed/simple;
 bh=8NgXWi3SWLESARQ6BRBX9xcuCBs+ICWz5PKVzbr8FZQ=;
 h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version;
 b=YliX4h/AinmzbNXvZiptaZI5KTcwcC0CDiPyXdSfTT02EIcW2JR1vOodzr05sve6CfjYhcpyqNK5VTQKkbF79Cow/LWKzZnqaI3E+BxMk2BpGFDLd4UZoEQqC4l3/6pxhHHhz0ATsvVLljx4NM4QUBe3UjVQhq6vu1ijEUZZ3Rc=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-wm1-x335.google.com with SMTP id
 5b1f17b1804b1-424720e73e1so15703885e9.1
 for <gcc-patches@gcc.gnu.org>; Fri, 21 Jun 2024 01:58:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=adacore.com; s=google; t=1718960324; x=1719565124; darn=gcc.gnu.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=XAUslvpKlPGaKM0BOZd05zlQSCW3D+h/zuR91ePyiTY=;
 b=UFT9w1h3vzY7OEJKmxP3SxpEucA1Nvn9d56AQ9eSCOSyu3xuNG8PSxKuXekHmmQwzM
 MaiLfcP0BO5+gZGR45ekJr7PHvtqcSonrl7AVRvP5QRGiSsv7w3HCPexwlhO3JFuxH8/
 erR3QR4So+wMxUxDtE5Y+FWnmoPd3o6UDW7BjgpPtsd8ztQyUD3Vb1brgQr6xmNPHxLi
 dhW5Z16osFVF5rah8BtDcQTUk63xChsHj9VXtB2eVQ4CgLFfdKRJ4XaxnKCyIuH6Kn7I
 OCVmtgVvV6t2PfhWbl7vWp2ZGw/QMvc86o2GjaNpahqCrjvHRZHYkoZgyWcHjyJ+rtHs
 +WsA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1718960324; x=1719565124;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=XAUslvpKlPGaKM0BOZd05zlQSCW3D+h/zuR91ePyiTY=;
 b=AsCcIDTUymVZP5BXFpYXBhxlfUECrIXDzlTvZmIJowQzPWbSSUDhdY4/r8vJZwtMUl
 t/CKPhnq+XRj0YiaQZqF57wAXz+jMq4nwk/RIk/LPOnTUqgigW99fGQVMFYwC17O26Kb
 ZOS6/K4oZLKJivVOVEe6DDt2fBBLpgkWom+F+L5kjL1xklwE8SeAecaeI72i8I6Qfz2P
 HYeqkkjbp7Hm1WVZkl/BzSH2FSMRRaob7rt2uX11urGN0d1YGt7cRH0K/KQsIRRxtJVK
 l6KNKlFa8L334YZvHptXeD1hivZoRr8LBGpnAm3LtvKvp+yI9KVRlzezVSy+nE1Qc5l7
 XoHQ==
X-Gm-Message-State: AOJu0YzM7aV5C62ebcy3rXcQV36/sWvbPV9tK2nrHvzkA5iyt8RN5FjM
 2lHAdaOz6z1Jc/rcItV61b04TtnSVFUUD0M4xkmcarJ1oAxSmEhENdWsPkmiD7v8LiaQG+g/ofM
 =
X-Google-Smtp-Source: 
 AGHT+IHdGiko7EWG8k3U6gKQrEpBfKf9H/efRS19cHc8T/iLFxLqcZ6SDDK1bmEp2Vj9s/I61V5ZFQ==
X-Received: by 2002:a05:600c:310a:b0:41a:b30e:42a3 with SMTP id
 5b1f17b1804b1-4247529bd84mr54345515e9.37.1718960324309;
 Fri, 21 Jun 2024 01:58:44 -0700 (PDT)
Received: from poulhies-Precision-5550.lan
 ([2001:861:3382:1a90:a589:2704:bfe1:5d92])
 by smtp.gmail.com with ESMTPSA id
 5b1f17b1804b1-4247d0c5485sm55322375e9.21.2024.06.21.01.58.43
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Fri, 21 Jun 2024 01:58:43 -0700 (PDT)
From: =?utf-8?q?Marc_Poulhi=C3=A8s?= <poulhies@adacore.com>
To: gcc-patches@gcc.gnu.org
Cc: Eric Botcazou <ebotcazou@adacore.com>
Subject: [COMMITTED 18/22] ada: Implement fast modulo reduction for nonbinary
 modular multiplication
Date: Fri, 21 Jun 2024 10:58:14 +0200
Message-ID: <20240621085819.2485987-18-poulhies@adacore.com>
X-Mailer: git-send-email 2.45.1
In-Reply-To: <20240621085819.2485987-1-poulhies@adacore.com>
References: <20240621085819.2485987-1-poulhies@adacore.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-13.7 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org

From: Eric Botcazou <ebotcazou@adacore.com>

This implements modulo reduction for nonbinary modular multiplication with
small moduli by means of the standard division-free algorithm also used in
the optimizer, but with fewer constraints and therefore better results.

For the sake of consistency, it is also used for the 'Mod attribute of the
same modular types and, more generally, for the Mod (and Rem) operators of
unsigned types if the second operand is static and not a power of two.

gcc/ada/

	* gcc-interface/gigi.h (fast_modulo_reduction): Declare.
	* gcc-interface/trans.cc (gnat_to_gnu) <N_Op_Mod>: In the unsigned
	case, call fast_modulo_reduction for {FLOOR,TRUNC}_MOD_EXPR if the
	RHS is a constant and not a power of two, and the precision is not
	larger than the word size.
	* gcc-interface/utils2.cc: Include expmed.h.
	(fast_modulo_reduction): New function.
	(nonbinary_modular_operation): Call fast_modulo_reduction for the
	multiplication if the precision is not larger than the word size.

Tested on x86_64-pc-linux-gnu, committed on master.
---
 gcc/ada/gcc-interface/gigi.h    |   5 ++
 gcc/ada/gcc-interface/trans.cc  |  17 ++++++
 gcc/ada/gcc-interface/utils2.cc | 102 +++++++++++++++++++++++++++++++-
 3 files changed, 121 insertions(+), 3 deletions(-)

diff --git a/gcc/ada/gcc-interface/gigi.h b/gcc/ada/gcc-interface/gigi.h
index 6ed74d6879e..40f3f0d3d13 100644
--- a/gcc/ada/gcc-interface/gigi.h
+++ b/gcc/ada/gcc-interface/gigi.h
@@ -1040,6 +1040,11 @@ extern bool simple_constant_p (Entity_Id gnat_entity);
 /* Return the size of TYPE, which must be a positive power of 2.  */
 extern unsigned int resolve_atomic_size (tree type);
 
+/* Try to compute the reduction of OP modulo MODULUS in PRECISION bits with a
+   division-free algorithm.  Return NULL_TREE if this is not easily doable.  */
+extern tree fast_modulo_reduction (tree op, tree modulus,
+				   unsigned int precision);
+
 #ifdef __cplusplus
 extern "C" {
 #endif
diff --git a/gcc/ada/gcc-interface/trans.cc b/gcc/ada/gcc-interface/trans.cc
index e68fb3fd776..7c5282602b2 100644
--- a/gcc/ada/gcc-interface/trans.cc
+++ b/gcc/ada/gcc-interface/trans.cc
@@ -7317,6 +7317,23 @@ gnat_to_gnu (Node_Id gnat_node)
 	  gnu_result
 	    = build_binary_op_trapv (code, gnu_type, gnu_lhs, gnu_rhs,
 				     gnat_node);
+
+	  /* For an unsigned modulo operation with nonbinary constant modulus,
+	     we first try to do a reduction by means of a (multiplier, shifter)
+	     pair in the needed precision up to the word size.  But not when
+	     optimizing for size, because it will be longer than a div+mul+sub
+	     sequence.  */
+        else if (!optimize_size
+		 && (code == FLOOR_MOD_EXPR || code == TRUNC_MOD_EXPR)
+		 && TYPE_UNSIGNED (gnu_type)
+		 && TYPE_PRECISION (gnu_type) <= BITS_PER_WORD
+		 && TREE_CODE (gnu_rhs) == INTEGER_CST
+		 && !integer_pow2p (gnu_rhs)
+		 && (gnu_expr
+		     = fast_modulo_reduction (gnu_lhs, gnu_rhs,
+					      TYPE_PRECISION (gnu_type))))
+	  gnu_result = gnu_expr;
+
 	else
 	  {
 	    /* Some operations, e.g. comparisons of arrays, generate complex
diff --git a/gcc/ada/gcc-interface/utils2.cc b/gcc/ada/gcc-interface/utils2.cc
index 70271cf2836..a37eccc4cfb 100644
--- a/gcc/ada/gcc-interface/utils2.cc
+++ b/gcc/ada/gcc-interface/utils2.cc
@@ -33,6 +33,7 @@
 #include "tree.h"
 #include "inchash.h"
 #include "builtins.h"
+#include "expmed.h"
 #include "fold-const.h"
 #include "stor-layout.h"
 #include "stringpool.h"
@@ -534,6 +535,91 @@ compare_fat_pointers (location_t loc, tree result_type, tree p1, tree p2)
 					   p1_array_is_null, same_bounds));
 }
 
+/* Try to compute the reduction of OP modulo MODULUS in PRECISION bits with a
+   division-free algorithm.  Return NULL_TREE if this is not easily doable.  */
+
+tree
+fast_modulo_reduction (tree op, tree modulus, unsigned int precision)
+{
+  const tree type = TREE_TYPE (op);
+  const unsigned int type_precision = TYPE_PRECISION (type);
+
+  /* The implementation is host-dependent for the time being.  */
+  if (type_precision <= HOST_BITS_PER_WIDE_INT)
+    {
+      const unsigned HOST_WIDE_INT d = tree_to_uhwi (modulus);
+      unsigned HOST_WIDE_INT ml, mh;
+      int pre_shift, post_shift;
+      tree t;
+
+      /* The trick is to replace the division by d with a multiply-and-shift
+	 sequence parameterized by a (multiplier, shifter) pair computed from
+	 d, the precision of the type and the needed precision:
+
+	   op / d = (op * multiplier) >> shifter
+
+         But choose_multiplier provides a slightly different interface:
+
+           op / d = (op h* multiplier) >> reduced_shifter
+
+         that makes things easier by using a high-part multiplication.  */
+      mh = choose_multiplier (d, type_precision, precision, &ml, &post_shift);
+
+      /* If the suggested multiplier is more than TYPE_PRECISION bits, we can
+	 do better for even divisors, using an initial right shift.  */
+      if (mh != 0 && (d & 1) == 0)
+	{
+	  pre_shift = ctz_or_zero (d);
+	  mh = choose_multiplier (d >> pre_shift, type_precision,
+				  precision - pre_shift, &ml, &post_shift);
+	}
+      else
+	pre_shift = 0;
+
+      /* If the suggested multiplier is still more than TYPE_PRECISION bits,
+	 try again with a larger type up to the word size.  */
+      if (mh != 0)
+	{
+	  if (type_precision < BITS_PER_WORD)
+	    {
+	      const scalar_int_mode m
+		= smallest_int_mode_for_size (type_precision + 1);
+	      tree new_type = gnat_type_for_mode (m, 1);
+	      op = fold_convert (new_type, op);
+	      modulus = fold_convert (new_type, modulus);
+	      t = fast_modulo_reduction (op, modulus, precision);
+	      if (t)
+		return fold_convert (type, t);
+	    }
+
+	  return NULL_TREE;
+	}
+
+      /* This computes op - (op / modulus) * modulus with PRECISION bits.  */
+      op = gnat_protect_expr (op);
+
+      /* t = op >> pre_shift
+	 t = t h* ml
+	 t = t >> post_shift
+	 t = t * modulus  */
+      if (pre_shift)
+	t = fold_build2 (RSHIFT_EXPR, type, op,
+			 build_int_cst (type, pre_shift));
+      else
+	t = op;
+      t = fold_build2 (MULT_HIGHPART_EXPR, type, t, build_int_cst (type, ml));
+      if (post_shift)
+	t = fold_build2 (RSHIFT_EXPR, type, t,
+			 build_int_cst (type, post_shift));
+      t = fold_build2 (MULT_EXPR, type, t, modulus);
+
+      return fold_build2 (MINUS_EXPR, type, op, t);
+    }
+
+  else
+    return NULL_TREE;
+}
+
 /* Compute the result of applying OP_CODE to LHS and RHS, where both are of
    TYPE.  We know that TYPE is a modular type with a nonbinary modulus.  */
 
@@ -543,7 +629,7 @@ nonbinary_modular_operation (enum tree_code op_code, tree type, tree lhs,
 {
   tree modulus = TYPE_MODULUS (type);
   unsigned precision = tree_floor_log2 (modulus) + 1;
-  tree op_type, result;
+  tree op_type, result, fmr;
 
   /* For the logical operations, we only need PRECISION bits.  For addition and
      subtraction, we need one more, and for multiplication twice as many.  */
@@ -576,9 +662,19 @@ nonbinary_modular_operation (enum tree_code op_code, tree type, tree lhs,
   if (op_code == MINUS_EXPR)
     result = fold_build2 (PLUS_EXPR, op_type, result, modulus);
 
-  /* For a multiplication, we have no choice but to use a modulo operation.  */
+  /* For a multiplication, we first try to do a modulo reduction by means of a
+     (multiplier, shifter) pair in the needed precision up to the word size, or
+     else we fall back to a standard modulo operation.  But not when optimizing
+     for size, because it will be longer than a div+mul+sub sequence.  */
   if (op_code == MULT_EXPR)
-    result = fold_build2 (TRUNC_MOD_EXPR, op_type, result, modulus);
+    {
+      if (!optimize_size
+	  && precision <= BITS_PER_WORD
+	  && (fmr = fast_modulo_reduction (result, modulus, precision)))
+	result = fmr;
+      else
+	result = fold_build2 (TRUNC_MOD_EXPR, op_type, result, modulus);
+    }
 
   /* For the other operations, subtract the modulus if we are >= it.  */
   else