[ARM] Further improve stack usage on sha512 (PR 77308)

Message ID	AM4PR0701MB2162C11BF479CD62542E2E8AE48A0@AM4PR0701MB2162.eurprd07.prod.outlook.com
State	New
Headers	show Return-Path: <gcc-patches-return-442810-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:references:in-reply-to :content-type:mime-version; q=dns; s=default; b=yl8m7MvRrzmGcUW4 qLVbWTwqSmPL4Pk6jNMQxhUVdg+9vlTpY+uP+TXkQXzAE3Nw7iv5J25wukBDK35D qGN/H7N2luIrySkrK/Pwsr6RaCggkv8+jsoWl++kdmuEXExxKUXvYy28ifAFwFAq I7oN+HlQc0LHLm+oyuyDxdzUSd0= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org From: Bernd Edlinger <bernd.edlinger@hotmail.de> To: Ramana Radhakrishnan <ramana.gcc@googlemail.com> CC: GCC Patches <gcc-patches@gcc.gnu.org>, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>, Richard Earnshaw <richard.earnshaw@arm.com>, Wilco Dijkstra <wilco.dijkstra@arm.com> Subject: Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308) Date: Mon, 28 Nov 2016 19:42:17 +0000 Message-ID: <AM4PR0701MB2162C11BF479CD62542E2E8AE48A0@AM4PR0701MB2162.eurprd07.prod.outlook.com> References: <HE1PR0701MB2169CD4BF5110F84B68E4AF9E4A40@HE1PR0701MB2169.eurprd07.prod.outlook.com> <CAJA7tRZGYttnYYCsbqFuc88jt8DySFvLY9J+1+88sfofY8Gweg@mail.gmail.com> In-Reply-To: <CAJA7tRZGYttnYYCsbqFuc88jt8DySFvLY9J+1+88sfofY8Gweg@mail.gmail.com> spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: multipart/mixed; boundary="_002_AM4PR0701MB2162C11BF479CD62542E2E8AE48A0AM4PR0701MB2162_" MIME-Version: 1.0

Message ID

AM4PR0701MB2162C11BF479CD62542E2E8AE48A0@AM4PR0701MB2162.eurprd07.prod.outlook.com

State

New

Headers

DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:from
	:to:cc:subject:date:message-id:references:in-reply-to
	:content-type:mime-version; q=dns; s=default; b=yl8m7MvRrzmGcUW4
	qLVbWTwqSmPL4Pk6jNMQxhUVdg+9vlTpY+uP+TXkQXzAE3Nw7iv5J25wukBDK35D
	qGN/H7N2luIrySkrK/Pwsr6RaCggkv8+jsoWl++kdmuEXExxKUXvYy28ifAFwFAq
	I7oN+HlQc0LHLm+oyuyDxdzUSd0=
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
Sender: gcc-patches-owner@gcc.gnu.org
From: Bernd Edlinger <bernd.edlinger@hotmail.de>
To: Ramana Radhakrishnan <ramana.gcc@googlemail.com>
CC: GCC Patches <gcc-patches@gcc.gnu.org>,
	Kyrill Tkachov	<kyrylo.tkachov@foss.arm.com>,
	Richard Earnshaw <richard.earnshaw@arm.com>,
	Wilco Dijkstra <wilco.dijkstra@arm.com>
Subject: Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308)
Date: Mon, 28 Nov 2016 19:42:17 +0000
Message-ID: <AM4PR0701MB2162C11BF479CD62542E2E8AE48A0@AM4PR0701MB2162.eurprd07.prod.outlook.com>
References: <HE1PR0701MB2169CD4BF5110F84B68E4AF9E4A40@HE1PR0701MB2169.eurprd07.prod.outlook.com>
	<CAJA7tRZGYttnYYCsbqFuc88jt8DySFvLY9J+1+88sfofY8Gweg@mail.gmail.com>
In-Reply-To: <CAJA7tRZGYttnYYCsbqFuc88jt8DySFvLY9J+1+88sfofY8Gweg@mail.gmail.com>
x-ms-exchange-messagesentrepresentingtype: 1
x-incomingheadercount: 37
x-eopattributedmessage: 0
x-microsoft-exchange-diagnostics: 1; VE1EUR01HT135;
	7:IRalJGThLY7oq43FKtnKIrHkgew9qUu4NAvcB9q/O4CpiWSAl6o+HE1TFjXNU+hJ6/qCbBIE0r14Rev9LMFBVWmw4bfw7V1Io8hXfQNpPuv4DVCNeLbWEIe6zL1UsX2+OSezucFUQO4/ajtic0gvBFWeuc0esGTdvPfj2umR52Eubow+pvUaNigIiyPiR9zk4mDKQESMkp+a4oOPaAGmrexDwWw3MZAdTUaq/xnihgb6N8PR/NgVzdVHKnMho0zg7MBEvUZUlPVaazAPC70VVSbKAQ8XxYoEpunlLhq4Im2ldzGpfqHWUeQsw5aY+AktWbGLQah1EHHUm26YveL/RpN6k8UIzRdkSh80luM4D8s=
x-forefront-antispam-report: EFV:NLI; SFV:NSPM; SFS:(10019020)(98900003);
	DIR:OUT; SFP:1102; SCL:1; SRVR:VE1EUR01HT135;
	H:AM4PR0701MB2162.eurprd07.prod.outlook.com; FPR:; SPF:None;
	LANG:en; 
x-ms-office365-filtering-correlation-id: 99fac248-c11c-4d7a-3b1e-08d417c6a8a0
x-microsoft-antispam: UriScan:; BCL:0; PCL:0;
	RULEID:(22001)(1601124038)(1603103113)(1601125047);
	SRVR:VE1EUR01HT135; 
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0;
	RULEID:(432015012)(102415395)(82015046); SRVR:VE1EUR01HT135;
	BCL:0; PCL:0; RULEID:; SRVR:VE1EUR01HT135; 
x-forefront-prvs: 01401330D1
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: multipart/mixed;
	boundary="_002_AM4PR0701MB2162C11BF479CD62542E2E8AE48A0AM4PR0701MB2162_"
MIME-Version: 1.0
X-OriginatorOrg: outlook.com
X-MS-Exchange-CrossTenant-originalarrivaltime: 28 Nov 2016 19:42:17.9150
	(UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Internet
X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-Transport-CrossTenantHeadersStamped: VE1EUR01HT135

Commit Message

Bernd Edlinger Nov. 28, 2016, 7:42 p.m. UTC

On 11/25/16 12:30, Ramana Radhakrishnan wrote:
> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger

> <bernd.edlinger@hotmail.de> wrote:

>> Hi!

>>

>> This improves the stack usage on the sha512 test case for the case

>> without hardware fpu and without iwmmxt by splitting all di-mode

>> patterns right while expanding which is similar to what the shift-pattern

>> does.  It does nothing in the case iwmmxt and fpu=neon or vfp as well as

>> thumb1.

>>

>

> I would go further and do this in the absence of Neon, the VFP unit

> being there doesn't help with DImode operations i.e. we do not have 64

> bit integer arithmetic instructions without Neon. The main reason why

> we have the DImode patterns split so late is to give a chance for

> folks who want to do 64 bit arithmetic in Neon a chance to make this

> work as well as support some of the 64 bit Neon intrinsics which IIRC

> map down to these instructions. Doing this just for soft-float doesn't

> improve the default case only. I don't usually test iwmmxt and I'm not

> sure who has the ability to do so, thus keeping this restriction for

> iwMMX is fine.

>

>


Yes I understand, thanks for pointing that out.

I was not aware what iwmmxt exists at all, but I noticed that most
64bit expansions work completely different, and would break if we split
the pattern early.

I can however only look at the assembler outout for iwmmxt, and make
sure that the stack usage does not get worse.

Thus the new version of the patch keeps only thumb1, neon and iwmmxt as
it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack
for the test cases, and vfp and soft-float at around 270 bytes stack
usage.

>> It reduces the stack usage from 2300 to near optimal 272 bytes (!).

>>

>> Note this also splits many ldrd/strd instructions and therefore I will

>> post a followup-patch that mitigates this effect by enabling the ldrd/strd

>> peephole optimization after the necessary reg-testing.

>>

>>

>> Bootstrapped and reg-tested on arm-linux-gnueabihf.

>

> What do you mean by arm-linux-gnueabihf - when folks say that I

> interpret it as --with-arch=armv7-a --with-float=hard

> --with-fpu=vfpv3-d16 or (--with-fpu=neon).

>

> If you've really bootstrapped and regtested it on armhf, doesn't this

> patch as it stand have no effect there i.e. no change ?

> arm-linux-gnueabihf usually means to me someone has configured with

> --with-float=hard, so there are no regressions in the hard float ABI

> case,

>


I know it proves little.  When I say arm-linux-gnueabihf
I do in fact mean --enable-languages=all,ada,go,obj-c++
--with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16
--with-float=hard.

My main interest in the stack usage is of course not because of linux,
but because of eCos where we have very small task stacks and in fact
no fpu support by the O/S at all, so that patch is exactly what we need.


Bootstrapped and reg-tested on arm-linux-gnueabihf
Is it OK for trunk?


Thanks
Bernd.

Comments

Wilco Dijkstra Nov. 30, 2016, 12:01 p.m. UTC | #1

Bernd Edlinger wrote:
> On 11/29/16 16:06, Wilco Dijkstra wrote:
> > Bernd Edlinger wrote:
> >
> > -  "TARGET_32BIT && reload_completed
> > +  "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)
> >     && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))"
> >
> > This is equivalent to "&& (!TARGET_IWMMXT || reload_completed)" since we're
> > already excluding NEON.
>
> Aehm, no.  This would split the addi_neon insn before it is clear
> if the reload pass will assign a VFP register.

Hmm that's strange... This instruction shouldn't be used to also split some random
Neon pattern - for example arm_subdi3 doesn't do the same. To understand and
reason about any of these complex patterns they should all work in the same way...

> But when I make *arm_cmpdi_insn split early, it ICEs:

(insn 4870 4869 1636 87 (set (scratch:SI)
         (minus:SI (minus:SI (subreg:SI (reg:DI 2261) 4)
                 (subreg:SI (reg:DI 473 [ X$14 ]) 4))
             (ltu:SI (reg:CC_C 100 cc)
                 (const_int 0 [0])))) "pr77308-2.c":140 -1
      (nil))

That's easy, we don't have a sbcs <scratch>, r1, r2 pattern. A quick workaround is
to create a temporary for operand[2] (if before reload) so it will match the standard
sbcs pattern, and then the split works fine.

> So it is certainly possible, but not really simple to improve the
> stack size even further.  But I would prefer to do that in a
> separate patch.

Yes separate patches would be fine. However there is a lot of scope to improve this
further. For example after your patch shifts and logical operations are expanded in
expand, add/sub are in split1 after combine runs and everything else is split after
reload. It doesn't make sense to split different operations at different times - it means
you're still going to get the bad DImode subregs and miss lots of optimization
opportunities due to the mix of partly split and partly not-yet-split operations.

> BTW: there are also negd2_compare, *negdi_extendsidi,
> *negdi_zero_extendsidi, *thumb2_negdi2.

I have a patch to merge thumb2_negdi2 into arm_negdi2. For extends, if we split them
at expand time, then none of the combined alu+extend patterns will be needed, and
that will be a huge simplification.

> I think it would be a precondition to have test cases that exercise
> each of these patterns before we try to split these instructions.

Agreed.

Wilco

2016-11-25  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	PR target/77308
	* config/arm/arm.md (*arm_adddi3, *arm_subdi3): Split early except for
	TARGET_NEON and TARGET_IWMMXT.
	(anddi3, iordi3, xordi3, one_cmpldi2): Split while expanding except for
	TARGET_NEON and TARGET_IWMMXT.
	(*one_cmpldi2_insn): Moved the body of one_cmpldi2 here.

testsuite:
2016-11-25  Bernd Edlinger  <bernd.edlinger@hotmail.de>

	PR target/77308
	* gcc.target/arm/pr77308-1.c: New test.

Index: gcc/config/arm/arm.md
===================================================================
--- gcc/config/arm/arm.md	(revision 242875)
+++ gcc/config/arm/arm.md	(working copy)
@@ -467,7 +467,7 @@ 
    (clobber (reg:CC CC_REGNUM))]
   "TARGET_32BIT && !TARGET_NEON"
   "#"
-  "TARGET_32BIT && reload_completed
+  "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)
    && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))"
   [(parallel [(set (reg:CC_C CC_REGNUM)
 		   (compare:CC_C (plus:SI (match_dup 1) (match_dup 2))
@@ -1272,7 +1272,7 @@ 
    (clobber (reg:CC CC_REGNUM))]
   "TARGET_32BIT && !TARGET_NEON"
   "#"  ; "subs\\t%Q0, %Q1, %Q2\;sbc\\t%R0, %R1, %R2"
-  "&& reload_completed"
+  "&& (!TARGET_IWMMXT || reload_completed)"
   [(parallel [(set (reg:CC CC_REGNUM)
 		   (compare:CC (match_dup 1) (match_dup 2)))
 	      (set (match_dup 0) (minus:SI (match_dup 1) (match_dup 2)))])
@@ -2258,7 +2258,24 @@ 
 	(and:DI (match_operand:DI 1 "s_register_operand" "")
 		(match_operand:DI 2 "neon_inv_logic_op2" "")))]
   "TARGET_32BIT"
-  ""
+  "
+  if (!TARGET_NEON && !TARGET_IWMMXT)
+    {
+      rtx low  = simplify_gen_binary (AND, SImode,
+				      gen_lowpart (SImode, operands[1]),
+				      gen_lowpart (SImode, operands[2]));
+      rtx high = simplify_gen_binary (AND, SImode,
+				      gen_highpart (SImode, operands[1]),
+				      gen_highpart_mode (SImode, DImode,
+							 operands[2]));
+
+      emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low));
+      emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high));
+
+      DONE;
+    }
+  /* Otherwise expand pattern as above.  */
+  "
 )
 
 (define_insn_and_split "*anddi3_insn"
@@ -3131,7 +3148,24 @@ 
 	(ior:DI (match_operand:DI 1 "s_register_operand" "")
 		(match_operand:DI 2 "neon_logic_op2" "")))]
   "TARGET_32BIT"
-  ""
+  "
+  if (!TARGET_NEON && !TARGET_IWMMXT)
+    {
+      rtx low  = simplify_gen_binary (IOR, SImode,
+				      gen_lowpart (SImode, operands[1]),
+				      gen_lowpart (SImode, operands[2]));
+      rtx high = simplify_gen_binary (IOR, SImode,
+				      gen_highpart (SImode, operands[1]),
+				      gen_highpart_mode (SImode, DImode,
+							 operands[2]));
+
+      emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low));
+      emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high));
+
+      DONE;
+    }
+  /* Otherwise expand pattern as above.  */
+  "
 )
 
 (define_insn_and_split "*iordi3_insn"
@@ -3312,7 +3346,24 @@ 
 	(xor:DI (match_operand:DI 1 "s_register_operand" "")
 		(match_operand:DI 2 "arm_xordi_operand" "")))]
   "TARGET_32BIT"
-  ""
+  "
+  if (!TARGET_NEON && !TARGET_IWMMXT)
+    {
+      rtx low  = simplify_gen_binary (XOR, SImode,
+				      gen_lowpart (SImode, operands[1]),
+				      gen_lowpart (SImode, operands[2]));
+      rtx high = simplify_gen_binary (XOR, SImode,
+				      gen_highpart (SImode, operands[1]),
+				      gen_highpart_mode (SImode, DImode,
+							 operands[2]));
+
+      emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low));
+      emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high));
+
+      DONE;
+    }
+  /* Otherwise expand pattern as above.  */
+  "
 )
 
 (define_insn_and_split "*xordi3_insn"
@@ -5022,7 +5073,31 @@ 
   "TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_VFP_DOUBLE"
   "")
 
-(define_insn_and_split "one_cmpldi2"
+(define_expand "one_cmpldi2"
+  [(set (match_operand:DI 0 "s_register_operand" "")
+	(not:DI (match_operand:DI 1 "s_register_operand" "")))]
+  "TARGET_32BIT"
+  "
+  if (!TARGET_NEON && !TARGET_IWMMXT)
+    {
+      rtx low  = simplify_gen_unary (NOT, SImode,
+				     gen_lowpart (SImode, operands[1]),
+				     SImode);
+      rtx high = simplify_gen_unary (NOT, SImode,
+				     gen_highpart_mode (SImode, DImode,
+							operands[1]),
+				     SImode);
+
+      emit_insn (gen_rtx_SET (gen_lowpart (SImode, operands[0]), low));
+      emit_insn (gen_rtx_SET (gen_highpart (SImode, operands[0]), high));
+
+      DONE;
+    }
+  /* Otherwise expand pattern as above.  */
+  "
+)
+
+(define_insn_and_split "*one_cmpldi2_insn"
   [(set (match_operand:DI 0 "s_register_operand"	 "=w,&r,&r,?w")
 	(not:DI (match_operand:DI 1 "s_register_operand" " w, 0, r, w")))]
   "TARGET_32BIT"
Index: gcc/testsuite/gcc.target/arm/pr77308-1.c
===================================================================
--- gcc/testsuite/gcc.target/arm/pr77308-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/pr77308-1.c	(working copy)
@@ -0,0 +1,169 @@ 
+/* { dg-do compile } */
+/* { dg-options "-Os -Wstack-usage=2500" } */
+
+/* This is a modified algorithm with bit-not "~" at the Sigma-blocks.
+   It improves the test coverage of one_cmpldi2 and subdi3 patterns.
+   Unlike the original test case these insns can reach the reload pass,
+   which may result in large stack usage.  */
+
+#define SHA_LONG64 unsigned long long
+#define U64(C)     C##ULL
+
+#define SHA_LBLOCK      16
+#define SHA512_CBLOCK   (SHA_LBLOCK*8)
+
+typedef struct SHA512state_st {
+    SHA_LONG64 h[8];
+    SHA_LONG64 Nl, Nh;
+    union {
+        SHA_LONG64 d[SHA_LBLOCK];
+        unsigned char p[SHA512_CBLOCK];
+    } u;
+    unsigned int num, md_len;
+} SHA512_CTX;
+
+static const SHA_LONG64 K512[80] = {
+    U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd),
+    U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc),
+    U64(0x3956c25bf348b538), U64(0x59f111f1b605d019),
+    U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118),
+    U64(0xd807aa98a3030242), U64(0x12835b0145706fbe),
+    U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2),
+    U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1),
+    U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694),
+    U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3),
+    U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65),
+    U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483),
+    U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5),
+    U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210),
+    U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4),
+    U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725),
+    U64(0x06ca6351e003826f), U64(0x142929670a0e6e70),
+    U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926),
+    U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df),
+    U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8),
+    U64(0x81c2c92e47edaee6), U64(0x92722c851482353b),
+    U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001),
+    U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30),
+    U64(0xd192e819d6ef5218), U64(0xd69906245565a910),
+    U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8),
+    U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53),
+    U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8),
+    U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb),
+    U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3),
+    U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60),
+    U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec),
+    U64(0x90befffa23631e28), U64(0xa4506cebde82bde9),
+    U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b),
+    U64(0xca273eceea26619c), U64(0xd186b8c721c0c207),
+    U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178),
+    U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6),
+    U64(0x113f9804bef90dae), U64(0x1b710b35131c471b),
+    U64(0x28db77f523047d84), U64(0x32caab7b40c72493),
+    U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c),
+    U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a),
+    U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817)
+};
+
+#define B(x,j)    (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8))
+#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7))
+#define ROTR(x,s)       (((x)>>s) | (x)<<(64-s))
+#define Sigma0(x)       ~(ROTR((x),28) ^ ROTR((x),34) ^ ROTR((x),39))
+#define Sigma1(x)       ~(ROTR((x),14) ^ ROTR((x),18) ^ ROTR((x),41))
+#define sigma0(x)       ~(ROTR((x),1)  ^ ROTR((x),8)  ^ ((x)>>7))
+#define sigma1(x)       ~(ROTR((x),19) ^ ROTR((x),61) ^ ((x)>>6))
+#define Ch(x,y,z)       (((x) & (y)) ^ ((~(x)) & (z)))
+#define Maj(x,y,z)      (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z)))
+
+#define ROUND_00_15(i,a,b,c,d,e,f,g,h)          do {    \
+        T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i];      \
+        h = Sigma0(a) + Maj(a,b,c);                     \
+        d += T1;        h += T1;                } while (0)
+#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X)      do {    \
+        s0 = X[(j+1)&0x0f];     s0 = sigma0(s0);        \
+        s1 = X[(j+14)&0x0f];    s1 = sigma1(s1);        \
+        T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f];    \
+        ROUND_00_15(i+j,a,b,c,d,e,f,g,h);               } while (0)
+void sha512_block_data_order(SHA512_CTX *ctx, const void *in,
+                                    unsigned int num)
+{
+    const SHA_LONG64 *W = in;
+    SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1;
+    SHA_LONG64 X[16];
+    int i;
+
+    while (num--) {
+
+        a = ctx->h[0];
+        b = ctx->h[1];
+        c = ctx->h[2];
+        d = ctx->h[3];
+        e = ctx->h[4];
+        f = ctx->h[5];
+        g = ctx->h[6];
+        h = ctx->h[7];
+
+        T1 = X[0] = PULL64(W[0]);
+        ROUND_00_15(0, a, b, c, d, e, f, g, h);
+        T1 = X[1] = PULL64(W[1]);
+        ROUND_00_15(1, h, a, b, c, d, e, f, g);
+        T1 = X[2] = PULL64(W[2]);
+        ROUND_00_15(2, g, h, a, b, c, d, e, f);
+        T1 = X[3] = PULL64(W[3]);
+        ROUND_00_15(3, f, g, h, a, b, c, d, e);
+        T1 = X[4] = PULL64(W[4]);
+        ROUND_00_15(4, e, f, g, h, a, b, c, d);
+        T1 = X[5] = PULL64(W[5]);
+        ROUND_00_15(5, d, e, f, g, h, a, b, c);
+        T1 = X[6] = PULL64(W[6]);
+        ROUND_00_15(6, c, d, e, f, g, h, a, b);
+        T1 = X[7] = PULL64(W[7]);
+        ROUND_00_15(7, b, c, d, e, f, g, h, a);
+        T1 = X[8] = PULL64(W[8]);
+        ROUND_00_15(8, a, b, c, d, e, f, g, h);
+        T1 = X[9] = PULL64(W[9]);
+        ROUND_00_15(9, h, a, b, c, d, e, f, g);
+        T1 = X[10] = PULL64(W[10]);
+        ROUND_00_15(10, g, h, a, b, c, d, e, f);
+        T1 = X[11] = PULL64(W[11]);
+        ROUND_00_15(11, f, g, h, a, b, c, d, e);
+        T1 = X[12] = PULL64(W[12]);
+        ROUND_00_15(12, e, f, g, h, a, b, c, d);
+        T1 = X[13] = PULL64(W[13]);
+        ROUND_00_15(13, d, e, f, g, h, a, b, c);
+        T1 = X[14] = PULL64(W[14]);
+        ROUND_00_15(14, c, d, e, f, g, h, a, b);
+        T1 = X[15] = PULL64(W[15]);
+        ROUND_00_15(15, b, c, d, e, f, g, h, a);
+
+        for (i = 16; i < 80; i += 16) {
+            ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X);
+            ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X);
+            ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X);
+            ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X);
+            ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X);
+            ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X);
+            ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X);
+            ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X);
+            ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X);
+            ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X);
+            ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X);
+            ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X);
+            ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X);
+            ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X);
+            ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X);
+            ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X);
+        }
+
+        ctx->h[0] += a;
+        ctx->h[1] += b;
+        ctx->h[2] += c;
+        ctx->h[3] += d;
+        ctx->h[4] += e;
+        ctx->h[5] += f;
+        ctx->h[6] += g;
+        ctx->h[7] += h;
+
+        W += SHA_LBLOCK;
+    }
+}