From patchwork Wed Nov 30 17:01:24 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Bernd Edlinger <bernd.edlinger@hotmail.de>
X-Patchwork-Id: 701090
Return-Path: 
 <gcc-patches-return-443094-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 3tTRXc4njGz9t14
	for <incoming@patchwork.ozlabs.org>;
	Thu,  1 Dec 2016 04:01:51 +1100 (AEDT)
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b="EIixbYik"; dkim-atps=neutral
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:from
	:to:cc:subject:date:message-id:references:in-reply-to
	:content-type:mime-version; q=dns; s=default; b=x08teh8UQKnJi8Hv
	PilgCYoyUEr4+JEGE5vz5tAPGB/g/wzl7ncBaDKXxsUoz5bY/hbVSRgX/r0nBqKg
	1RIt5dqFmReT4VRabw00p4Iiy7qL2m7r5O2ET1o3s6vi5z3jU8OFC7Kq3yAYwClX
	wpakcv73GDBkA5qotxop8pX1SBk=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:from
	:to:cc:subject:date:message-id:references:in-reply-to
	:content-type:mime-version; s=default; bh=ZmPvSKUNrBTg4gLcge5Q4f
	qjueo=; b=EIixbYiknaZmTDtfYT83OIWnhg5DhxclJsJ5+grjxgU67DRjPY715F
	dhkgBvBbOkimScmx7QSoh7VrA78ib6Kh0MDCpmrVc9XSIVq471QQiyIaXwre0neb
	5FtjphpakbXrRjl5oZ14DBgQxILeoVAH/Ylt8zWff54uM+LpvrJ6M=
Received: (qmail 111152 invoked by alias); 30 Nov 2016 17:01:39 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 111140 invoked by uid 89); 30 Nov 2016 17:01:38 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL, BAYES_00,
	FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,
	SPF_PASS autolearn=ham version=3.3.2 spammy=2500, influence,
	Unlike, differently
X-HELO: SNT004-OMC2S22.hotmail.com
Received: from snt004-omc2s22.hotmail.com (HELO SNT004-OMC2S22.hotmail.com)
	(65.55.90.97) by sourceware.org
	(qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
	Wed, 30 Nov 2016 17:01:28 +0000
Received: from EUR03-DB5-obe.outbound.protection.outlook.com ([65.55.90.71])
	by SNT004-OMC2S22.hotmail.com over TLS secured channel with
	Microsoft SMTPSVC(7.5.7601.23008); Wed, 30 Nov 2016 09:01:27 -0800
Received: from AM5EUR03FT010.eop-EUR03.prod.protection.outlook.com
	(10.152.16.57) by
	AM5EUR03HT140.eop-EUR03.prod.protection.outlook.com
	(10.152.17.74) with Microsoft SMTP Server (version=TLS1_2,
	cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384) id
	15.1.734.4; Wed, 30 Nov 2016 17:01:24 +0000
Received: from AM4PR0701MB2162.eurprd07.prod.outlook.com (10.152.16.54) by
	AM5EUR03FT010.mail.protection.outlook.com (10.152.16.134)
	with Microsoft SMTP Server (version=TLS1_2,
	cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384) id
	15.1.734.4 via Frontend Transport; Wed, 30 Nov 2016 17:01:24 +0000
Received: from AM4PR0701MB2162.eurprd07.prod.outlook.com ([10.167.132.147])
	by AM4PR0701MB2162.eurprd07.prod.outlook.com
	([10.167.132.147]) with mapi id 15.01.0761.009;
	Wed, 30 Nov 2016 17:01:24 +0000
From: Bernd Edlinger <bernd.edlinger@hotmail.de>
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>,
	Ramana Radhakrishnan	<ramana.gcc@googlemail.com>
CC: GCC Patches <gcc-patches@gcc.gnu.org>,
	Kyrill Tkachov	<kyrylo.tkachov@foss.arm.com>,
	Richard Earnshaw <Richard.Earnshaw@arm.com>, nd <nd@arm.com>
Subject: Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308)
Date: Wed, 30 Nov 2016 17:01:24 +0000
Message-ID: 
 <AM4PR0701MB2162EFA55A83139B5015E0ADE48C0@AM4PR0701MB2162.eurprd07.prod.outlook.com>
References: 
 <HE1PR0701MB2169CD4BF5110F84B68E4AF9E4A40@HE1PR0701MB2169.eurprd07.prod.outlook.com>
	<CAJA7tRZGYttnYYCsbqFuc88jt8DySFvLY9J+1+88sfofY8Gweg@mail.gmail.com>
	<AM4PR0701MB2162C11BF479CD62542E2E8AE48A0@AM4PR0701MB2162.eurprd07.prod.outlook.com>
	<VI1PR0802MB2621FFBFA3252B40E5978C9F838D0@VI1PR0802MB2621.eurprd08.prod.outlook.com>
	<AM4PR0701MB21628562505A31E0C0630660E48D0@AM4PR0701MB2162.eurprd07.prod.outlook.com>
	<AM5PR0802MB261038521472515DDE3E58DA838D0@AM5PR0802MB2610.eurprd08.prod.outlook.com>
	<AM5PR0802MB26103E4F51572F3BAB036ACC838C0@AM5PR0802MB2610.eurprd08.prod.outlook.com>
In-Reply-To: 
 <AM5PR0802MB26103E4F51572F3BAB036ACC838C0@AM5PR0802MB2610.eurprd08.prod.outlook.com>
authentication-results: arm.com; dkim=none (message not signed) header.d=none;
	arm.com; dmarc=none action=none header.from=hotmail.de;
x-incomingtopheadermarker: OriginalChecksum:; UpperCasedChecksum:;
	SizeAsReceived:8136; Count:37
x-ms-exchange-messagesentrepresentingtype: 1
x-incomingheadercount: 37
x-eopattributedmessage: 0
x-microsoft-exchange-diagnostics: 1; AM5EUR03HT140;
	7:m6Hn5qfXZKAhbdWHQhBbYKzd3ceA3tVGeAg9d15SFh0kXhbd1fwhQ62m63NVzkMu3GC3mDk0nLyQqtPuegDLu4myL5gcQnKIAOcJk7SdmpRCZapansPWenvSPuyxxyzg5xYCbQ1AxZ0oilYnGWzbtdLzKrZkxdbHtTpuZvzA5+Zx4TyELPMgd0oAIj9ui/NEKmZSMGHUaWrKWvfW5xJYOu+Jv8owXUWNGUQfiqCb0/45t2tubimDqsaUpJAzG2ALS7pn0ba3kTnRkrClTGUC8yxG/F0Tu2DEU9DoStyghIGH6osSUGtyvVdGAhpGljF5RMSo9Rw4VgG230q6VT/KG/dJR1HhtBEmQHfkrvwAcX4=
x-forefront-antispam-report: EFV:NLI; SFV:NSPM; SFS:(10019020)(98900003);
	DIR:OUT; SFP:1102; SCL:1; SRVR:AM5EUR03HT140;
	H:AM4PR0701MB2162.eurprd07.prod.outlook.com; FPR:; SPF:None;
	LANG:en;
x-ms-office365-filtering-correlation-id: 65e60204-05f3-447a-1ba5-08d41942837c
x-microsoft-antispam: UriScan:; BCL:0; PCL:0;
	RULEID:(22001)(1601124038)(1603103113)(1601125047);
	SRVR:AM5EUR03HT140;
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0;
	RULEID:(432015012)(102415395)(82015046); SRVR:AM5EUR03HT140;
	BCL:0; PCL:0; RULEID:; SRVR:AM5EUR03HT140;
x-forefront-prvs: 0142F22657
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
MIME-Version: 1.0
X-OriginatorOrg: outlook.com
X-MS-Exchange-CrossTenant-originalarrivaltime: 30 Nov 2016 17:01:24.4238
	(UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Internet
X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM5EUR03HT140

On 11/30/16 13:01, Wilco Dijkstra wrote:
> Bernd Edlinger wrote:
>> On 11/29/16 16:06, Wilco Dijkstra wrote:
>>> Bernd Edlinger wrote:
>>>
>>> -  "TARGET_32BIT && reload_completed
>>> +  "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)
>>>      && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))"
>>>
>>> This is equivalent to "&& (!TARGET_IWMMXT || reload_completed)" since we're
>>> already excluding NEON.
>>
>> Aehm, no.  This would split the addi_neon insn before it is clear
>> if the reload pass will assign a VFP register.
>
> Hmm that's strange... This instruction shouldn't be used to also split some random
> Neon pattern - for example arm_subdi3 doesn't do the same. To understand and
> reason about any of these complex patterns they should all work in the same way...
>
I was a bit surprised as well, when I saw that happen.
But subdi3 is different:
   "TARGET_32BIT && !TARGET_NEON"
   "#"  ; "subs\\t%Q0, %Q1, %Q2\;sbc\\t%R0, %R1, %R2"
   "&& reload_completed"

so this never splits anything if TARGET_NEON.
but adddi3 can not expand if TARGET_NEON but it's pattern simply
looks exactly like the addi3_neon:

(define_insn_and_split "*arm_adddi3"
   [(set (match_operand:DI          0 "s_register_operand" 
"=&r,&r,&r,&r,&r")
         (plus:DI (match_operand:DI 1 "s_register_operand" "%0, 0, r, 0, r")
                  (match_operand:DI 2 "arm_adddi_operand"  "r,  0, r, 
Dd, Dd")))
    (clobber (reg:CC CC_REGNUM))]
   "TARGET_32BIT && !TARGET_NEON"
   "#"
   "TARGET_32BIT && reload_completed
    && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))"

(define_insn "adddi3_neon"
   [(set (match_operand:DI 0 "s_register_operand" 
"=w,?&r,?&r,?w,?&r,?&r,?&r")
         (plus:DI (match_operand:DI 1 "s_register_operand" "%w,0,0,w,r,0,r")
                  (match_operand:DI 2 "arm_adddi_operand" 
"w,r,0,w,r,Dd,Dd")))
    (clobber (reg:CC CC_REGNUM))]
   "TARGET_NEON"
{
   switch (which_alternative)
     {
     case 0: /* fall through */
     case 3: return "vadd.i64\t%P0, %P1, %P2";
     case 1: return "#";
     case 2: return "#";
     case 4: return "#";
     case 5: return "#";
     case 6: return "#";
     default: gcc_unreachable ();
     }

Even the return "#" explicitly invokes the former pattern.
So I think the author knew that, and did it on purpose.


>> But when I make *arm_cmpdi_insn split early, it ICEs:
>
> (insn 4870 4869 1636 87 (set (scratch:SI)
>          (minus:SI (minus:SI (subreg:SI (reg:DI 2261) 4)
>                  (subreg:SI (reg:DI 473 [ X$14 ]) 4))
>              (ltu:SI (reg:CC_C 100 cc)
>                  (const_int 0 [0])))) "pr77308-2.c":140 -1
>       (nil))
>
> That's easy, we don't have a sbcs <scratch>, r1, r2 pattern. A quick workaround is
> to create a temporary for operand[2] (if before reload) so it will match the standard
> sbcs pattern, and then the split works fine.
>
>> So it is certainly possible, but not really simple to improve the
>> stack size even further.  But I would prefer to do that in a
>> separate patch.
>
> Yes separate patches would be fine. However there is a lot of scope to improve this
> further. For example after your patch shifts and logical operations are expanded in
> expand, add/sub are in split1 after combine runs and everything else is split after
> reload. It doesn't make sense to split different operations at different times - it means
> you're still going to get the bad DImode subregs and miss lots of optimization
> opportunities due to the mix of partly split and partly not-yet-split operations.
>

Yes.  I did the add/sub differently because it was more easy this way,
and it was simply sufficient to make the existing test cases happy.

Also, the biggest benefit was IIRC from the very early splitting
of the anddi/iordi/xordi patterns, because they have completely
separate data flow in low and high parts.  And that is not
the case for the arihmetic patterns, but nevertheless they
can still be optimized, preferably, when a new test case
is found, that can demonstrate an improvement.

I am not sure why the cmpdi pattern have an influence at all,
because from the data flow you need all 64 bits of both sides.
Nevertheless it is a fact: With the modified test case I
get 264 bytes frame size, and that was 1920 before.

I attached the completely untested follow-up patch now, but I would
like to post that one again for review, after I applied my current
patch, which is still waiting for final review (please feel pinged!).


This is really exciting...


Thanks
Bernd.

--- gcc/config/arm/arm.md.orig	2016-11-27 09:22:41.794790123 +0100
+++ gcc/config/arm/arm.md	2016-11-30 16:40:30.140532737 +0100
@@ -4738,7 +4738,7 @@
    (clobber (reg:CC CC_REGNUM))]
   "TARGET_ARM"
   "#"   ; "rsbs\\t%Q0, %Q1, #0\;rsc\\t%R0, %R1, #0"
-  "&& reload_completed"
+  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
   [(parallel [(set (reg:CC CC_REGNUM)
 		   (compare:CC (const_int 0) (match_dup 1)))
 	      (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))])
@@ -7432,7 +7432,7 @@
    (clobber (match_scratch:SI 2 "=r"))]
   "TARGET_32BIT"
   "#"   ; "cmp\\t%Q0, %Q1\;sbcs\\t%2, %R0, %R1"
-  "&& reload_completed"
+  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
   [(set (reg:CC CC_REGNUM)
         (compare:CC (match_dup 0) (match_dup 1)))
    (parallel [(set (reg:CC CC_REGNUM)
@@ -7456,7 +7456,10 @@
         operands[5] = gen_rtx_MINUS (SImode, operands[3], operands[4]);
       }
     operands[1] = gen_lowpart (SImode, operands[1]);
-    operands[2] = gen_lowpart (SImode, operands[2]);
+    if (can_create_pseudo_p ())
+      operands[2] = gen_reg_rtx (SImode);
+    else
+      operands[2] = gen_lowpart (SImode, operands[2]);
   }
   [(set_attr "conds" "set")
    (set_attr "length" "8")
@@ -7470,7 +7473,7 @@
 
   "TARGET_32BIT"
   "#"   ; "cmp\\t%R0, %R1\;it eq\;cmpeq\\t%Q0, %Q1"
-  "&& reload_completed"
+  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
   [(set (reg:CC CC_REGNUM)
         (compare:CC (match_dup 2) (match_dup 3)))
    (cond_exec (eq:SI (reg:CC CC_REGNUM) (const_int 0))
--- gcc/config/arm/thumb2.md.orig	2016-11-30 16:57:44.760589624 +0100
+++ gcc/config/arm/thumb2.md	2016-11-30 16:58:05.310590754 +0100
@@ -132,7 +132,7 @@
    (clobber (reg:CC CC_REGNUM))]
   "TARGET_THUMB2"
   "#" ; negs\\t%Q0, %Q1\;sbc\\t%R0, %R1, %R1, lsl #1
-  "&& reload_completed"
+  "&& (!TARGET_NEON || reload_completed)"
   [(parallel [(set (reg:CC CC_REGNUM)
 		   (compare:CC (const_int 0) (match_dup 1)))
 	      (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))])
--- /dev/null	2016-11-30 15:23:46.779473644 +0100
+++ gcc/testsuite/gcc.target/arm/pr77308-2.c	2016-11-30 17:05:21.021614711 +0100
@@ -0,0 +1,169 @@
+/* { dg-do compile } */
+/* { dg-options "-Os -Wstack-usage=2500" } */
+
+/* This is a modified algorithm with 64bit cmp and neg at the Sigma-blocks.
+   It improves the test coverage of cmpdi and negdi2 patterns.
+   Unlike the original test case these insns can reach the reload pass,
+   which may result in large stack usage.  */
+
+#define SHA_LONG64 unsigned long long
+#define U64(C)     C##ULL
+
+#define SHA_LBLOCK      16
+#define SHA512_CBLOCK   (SHA_LBLOCK*8)
+
+typedef struct SHA512state_st {
+    SHA_LONG64 h[8];
+    SHA_LONG64 Nl, Nh;
+    union {
+        SHA_LONG64 d[SHA_LBLOCK];
+        unsigned char p[SHA512_CBLOCK];
+    } u;
+    unsigned int num, md_len;
+} SHA512_CTX;
+
+static const SHA_LONG64 K512[80] = {
+    U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd),
+    U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc),
+    U64(0x3956c25bf348b538), U64(0x59f111f1b605d019),
+    U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118),
+    U64(0xd807aa98a3030242), U64(0x12835b0145706fbe),
+    U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2),
+    U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1),
+    U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694),
+    U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3),
+    U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65),
+    U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483),
+    U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5),
+    U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210),
+    U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4),
+    U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725),
+    U64(0x06ca6351e003826f), U64(0x142929670a0e6e70),
+    U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926),
+    U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df),
+    U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8),
+    U64(0x81c2c92e47edaee6), U64(0x92722c851482353b),
+    U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001),
+    U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30),
+    U64(0xd192e819d6ef5218), U64(0xd69906245565a910),
+    U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8),
+    U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53),
+    U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8),
+    U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb),
+    U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3),
+    U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60),
+    U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec),
+    U64(0x90befffa23631e28), U64(0xa4506cebde82bde9),
+    U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b),
+    U64(0xca273eceea26619c), U64(0xd186b8c721c0c207),
+    U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178),
+    U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6),
+    U64(0x113f9804bef90dae), U64(0x1b710b35131c471b),
+    U64(0x28db77f523047d84), U64(0x32caab7b40c72493),
+    U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c),
+    U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a),
+    U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817)
+};
+
+#define B(x,j)    (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8))
+#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7))
+#define ROTR(x,s)       (((x)>>s) | (x)<<(64-s))
+#define Sigma0(x)       (ROTR((x),28) ^ ROTR((x),34) ^ (ROTR((x),39) == (x)) ? -(x) : (x))
+#define Sigma1(x)       (ROTR((x),14) ^ ROTR(-(x),18) ^ ((long long)ROTR((x),41) < (long long)(x)) ? -(x) : (x))
+#define sigma0(x)       (ROTR((x),1)  ^ ROTR((x),8)  ^ (((x)>>7) > (x)) ? -(x) : (x))
+#define sigma1(x)       (ROTR((x),19) ^ ROTR((x),61) ^ ((long long)((x)>>6) < (long long)(x)) ? -(x) : (x))
+#define Ch(x,y,z)       (((x) & (y)) ^ ((~(x)) & (z)))
+#define Maj(x,y,z)      (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z)))
+
+#define ROUND_00_15(i,a,b,c,d,e,f,g,h)          do {    \
+        T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i];      \
+        h = Sigma0(a) + Maj(a,b,c);                     \
+        d += T1;        h += T1;                } while (0)
+#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X)      do {    \
+        s0 = X[(j+1)&0x0f];     s0 = sigma0(s0);        \
+        s1 = X[(j+14)&0x0f];    s1 = sigma1(s1);        \
+        T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f];    \
+        ROUND_00_15(i+j,a,b,c,d,e,f,g,h);               } while (0)
+void sha512_block_data_order(SHA512_CTX *ctx, const void *in,
+                                    unsigned int num)
+{
+    const SHA_LONG64 *W = in;
+    SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1;
+    SHA_LONG64 X[16];
+    int i;
+
+    while (num--) {
+
+        a = ctx->h[0];
+        b = ctx->h[1];
+        c = ctx->h[2];
+        d = ctx->h[3];
+        e = ctx->h[4];
+        f = ctx->h[5];
+        g = ctx->h[6];
+        h = ctx->h[7];
+
+        T1 = X[0] = PULL64(W[0]);
+        ROUND_00_15(0, a, b, c, d, e, f, g, h);
+        T1 = X[1] = PULL64(W[1]);
+        ROUND_00_15(1, h, a, b, c, d, e, f, g);
+        T1 = X[2] = PULL64(W[2]);
+        ROUND_00_15(2, g, h, a, b, c, d, e, f);
+        T1 = X[3] = PULL64(W[3]);
+        ROUND_00_15(3, f, g, h, a, b, c, d, e);
+        T1 = X[4] = PULL64(W[4]);
+        ROUND_00_15(4, e, f, g, h, a, b, c, d);
+        T1 = X[5] = PULL64(W[5]);
+        ROUND_00_15(5, d, e, f, g, h, a, b, c);
+        T1 = X[6] = PULL64(W[6]);
+        ROUND_00_15(6, c, d, e, f, g, h, a, b);
+        T1 = X[7] = PULL64(W[7]);
+        ROUND_00_15(7, b, c, d, e, f, g, h, a);
+        T1 = X[8] = PULL64(W[8]);
+        ROUND_00_15(8, a, b, c, d, e, f, g, h);
+        T1 = X[9] = PULL64(W[9]);
+        ROUND_00_15(9, h, a, b, c, d, e, f, g);
+        T1 = X[10] = PULL64(W[10]);
+        ROUND_00_15(10, g, h, a, b, c, d, e, f);
+        T1 = X[11] = PULL64(W[11]);
+        ROUND_00_15(11, f, g, h, a, b, c, d, e);
+        T1 = X[12] = PULL64(W[12]);
+        ROUND_00_15(12, e, f, g, h, a, b, c, d);
+        T1 = X[13] = PULL64(W[13]);
+        ROUND_00_15(13, d, e, f, g, h, a, b, c);
+        T1 = X[14] = PULL64(W[14]);
+        ROUND_00_15(14, c, d, e, f, g, h, a, b);
+        T1 = X[15] = PULL64(W[15]);
+        ROUND_00_15(15, b, c, d, e, f, g, h, a);
+
+        for (i = 16; i < 80; i += 16) {
+            ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X);
+            ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X);
+            ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X);
+            ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X);
+            ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X);
+            ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X);
+            ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X);
+            ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X);
+            ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X);
+            ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X);
+            ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X);
+            ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X);
+            ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X);
+            ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X);
+            ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X);
+            ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X);
+        }
+
+        ctx->h[0] += a;
+        ctx->h[1] += b;
+        ctx->h[2] += c;
+        ctx->h[3] += d;
+        ctx->h[4] += e;
+        ctx->h[5] += f;
+        ctx->h[6] += g;
+        ctx->h[7] += h;
+
+        W += SHA_LBLOCK;
+    }
+}