• <small id='SgHno'></small><noframes id='SgHno'>

  • <tfoot id='SgHno'></tfoot>
    <legend id='SgHno'><style id='SgHno'><dir id='SgHno'><q id='SgHno'></q></dir></style></legend>

      <i id='SgHno'><tr id='SgHno'><dt id='SgHno'><q id='SgHno'><span id='SgHno'><b id='SgHno'><form id='SgHno'><ins id='SgHno'></ins><ul id='SgHno'></ul><sub id='SgHno'></sub></form><legend id='SgHno'></legend><bdo id='SgHno'><pre id='SgHno'><center id='SgHno'></center></pre></bdo></b><th id='SgHno'></th></span></q></dt></tr></i><div id='SgHno'><tfoot id='SgHno'></tfoot><dl id='SgHno'><fieldset id='SgHno'></fieldset></dl></div>
        <bdo id='SgHno'></bdo><ul id='SgHno'></ul>

        重叠数组的总和、自动矢量化和限制

        时间:2023-09-16
        <tfoot id='GTWAY'></tfoot>

        <legend id='GTWAY'><style id='GTWAY'><dir id='GTWAY'><q id='GTWAY'></q></dir></style></legend>

        <small id='GTWAY'></small><noframes id='GTWAY'>

            <bdo id='GTWAY'></bdo><ul id='GTWAY'></ul>
              <tbody id='GTWAY'></tbody>
            • <i id='GTWAY'><tr id='GTWAY'><dt id='GTWAY'><q id='GTWAY'><span id='GTWAY'><b id='GTWAY'><form id='GTWAY'><ins id='GTWAY'></ins><ul id='GTWAY'></ul><sub id='GTWAY'></sub></form><legend id='GTWAY'></legend><bdo id='GTWAY'><pre id='GTWAY'><center id='GTWAY'></center></pre></bdo></b><th id='GTWAY'></th></span></q></dt></tr></i><div id='GTWAY'><tfoot id='GTWAY'></tfoot><dl id='GTWAY'><fieldset id='GTWAY'></fieldset></dl></div>

                1. 本文介绍了重叠数组的总和、自动矢量化和限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

                  问题描述

                  Arstechnia 最近有一篇文章 为什么有些编程语言比其他语言快.它比较了 Fortran 和 C 并提到了求和数组.在 Fortran 中,假设数组不重叠,以便进一步优化.在 C/C++ 中,指向同一类型的指针可能会重叠,因此通常不能使用这种优化.但是,在 C/C++ 中,可以使用 restrict__restrict 关键字告诉编译器不要假设指针重叠.所以我开始研究关于自动矢量化的问题.

                  Arstechnia recently had an article Why are some programming languages faster than others. It compares Fortran and C and mentions summing arrays. In Fortran it's assumed that arrays don't overlap so that allows further optimization. In C/C++ pointers to the same type may overlap so this optimization can't be used in general. However, in C/C++ one can use the restrict or __restrict keyword to tell the compiler not to assume the pointers overlap. So I started looking into this in regards to auto-vectorization.

                  以下代码在 GCC 和 MSVC 中向量化

                  The following code vectorizes in GCC and MSVC

                  void dot_int(int *a, int *b, int *c, int n) {
                      for(int i=0; i<n; i++) {
                          c[i] = a[i] + b[i];
                      }
                  }
                  

                  我在使用和不使用重叠数组的情况下对此进行了测试,并得到了正确的结果.但是,我使用 SSE 手动矢量化此循环的方式不能处理重叠数组.

                  I tested this with and without overlapping arrays and it gets the correct result. However, the way I would vectorize this loop manually with SSE does not handle overlapping arrays.

                  int i=0;    
                  for(; i<n-3; i+=4) {
                      __m128i a4 = _mm_loadu_si128((__m128i*)&a[i]);
                      __m128i b4 = _mm_loadu_si128((__m128i*)&b[i]);
                      __m128i c4 = _mm_add_epi32(a4,b4);
                      _mm_storeu_si128((__m128i*)c, c4);
                  }
                  for(; i<n; i++) {
                      c[i] = a[i] + b[i];
                  }
                  

                  接下来我尝试使用 __restrict.我认为,由于编译器可以假设数组不重叠,因此它不会处理重叠数组,但即使使用 __restrict,GCC 和 MSVC 仍然可以获得重叠数组的正确结果.

                  Next I tried using __restrict. I assumed that since the compiler could assume the arrays don't overlap it would not handle overlapping arrays but both GCC and MSVC still get the correct result for overlapping arrays even with __restrict.

                  void dot_int_restrict(int * __restrict a, int * __restrict b, int * __restrict c, int n) {
                      for(int i=0; i<n; i++) {
                          c[i] = a[i] + b[i];
                      }
                  }
                  

                  为什么带有和不带有 __restrict 的自动矢量化代码在重叠数组时得到正确的结果?.

                  Why does the auto-vectorized code with and without __restrict get the correct result for overlapping arrays?.

                  这是我用来测试的完整代码:

                  Here is the full code I used to test this:

                  #include <stdio.h>
                  #include <immintrin.h>
                  void dot_int(int *a, int *b, int *c, int n) {
                      for(int i=0; i<n; i++) {
                          c[i] = a[i] + b[i];
                      }
                      for(int i=0; i<8; i++) printf("%d ", c[i]); printf("
                  "); 
                  }
                  
                  void dot_int_restrict(int * __restrict a, int * __restrict b, int * __restrict c, int n) {
                      for(int i=0; i<n; i++) {
                          c[i] = a[i] + b[i];
                      }
                      for(int i=0; i<8; i++) printf("%d ", c[i]); printf("
                  "); 
                  }
                  
                  void dot_int_SSE(int *a, int *b, int *c, int n) {
                      int i=0;    
                      for(; i<n-3; i+=4) {
                          __m128i a4 = _mm_loadu_si128((__m128i*)&a[i]);
                          __m128i b4 = _mm_loadu_si128((__m128i*)&b[i]);
                          __m128i c4 = _mm_add_epi32(a4,b4);
                          _mm_storeu_si128((__m128i*)c, c4);
                      }
                      for(; i<n; i++) {
                          c[i] = a[i] + b[i];
                      }
                      for(int i=0; i<8; i++) printf("%d ", c[i]); printf("
                  "); 
                  }
                  
                  int main() {
                      const int n = 100;
                      int a[] = {1,1,1,1,1,1,1,1};
                      int b1[] = {1,1,1,1,1,1,1,1,1};
                      int b2[] = {1,1,1,1,1,1,1,1,1};
                      int b3[] = {1,1,1,1,1,1,1,1,1};
                  
                      int c[8];
                      int *c1 = &b1[1];
                      int *c2 = &b2[1];
                      int *c3 = &b3[1];
                  
                      dot_int(a,b1,c, 8);
                      dot_int_SSE(a,b1,c,8);
                  
                      dot_int(a,b1,c1, 8);
                      dot_int_restrict(a,b2,c2,8);
                      dot_int_SSE(a,b3,c3,8);
                  
                  }
                  

                  输出(来自 MSVC)

                  The output (from MSVC)

                  2 2 2 2 2 2 2 2 //no overlap default
                  2 2 2 2 2 2 2 2 //no overlap with manual SSE vector code
                  2 3 4 5 6 7 8 9 //overlap default
                  2 3 4 5 6 7 8 9 //overlap with restrict
                  3 2 2 2 1 1 1 1 //manual SSE vector code
                  

                  这是另一个产生更简单代码的插入版本

                  Here is another inserting version which produces much simpler code

                  void dot_int(int * __restrict a, int * __restrict b, int * __restrict c, int n) {
                      a = (int*)__builtin_assume_aligned (a, 16);
                      b = (int*)__builtin_assume_aligned (b, 16);
                      c = (int*)__builtin_assume_aligned (c, 16);
                      for(int i=0; i<n; i++) {
                          c[i] = a[i] + b[i];
                      }
                  }
                  

                  推荐答案

                  我不明白问题出在哪里.在 Linux/64 位、GCC 4.6、-O3、-mtune=native、-msse4.1(即非常旧的编译器/系统)上测试,这段代码

                  I don't get what the problem is. Testing on Linux/64 bit, GCC 4.6, -O3, -mtune=native, -msse4.1 (i.e. a very old compiler/system), this code

                  void dot_int(int *a, int *b, int *c, int n) {
                      for(int i=0; i<n; ++i) {
                          c[i] = a[i] + b[i];
                      }
                  }
                  

                  编译到这个内部循环:

                  .L4:
                      movdqu  (%rdi,%rax), %xmm1
                      addl    $1, %r8d
                      movdqu  (%rsi,%rax), %xmm0
                      paddd   %xmm1, %xmm0
                      movdqu  %xmm0, (%rdx,%rax)
                      addq    $16, %rax
                      cmpl    %r8d, %r10d
                      ja      .L4
                      cmpl    %r9d, %ecx
                      je      .L1
                  

                  虽然这段代码

                  void dot_int_restrict(int * __restrict a, int * __restrict b, int * __restrict c, int n) {
                      for(int i=0; i<n; ++i) {
                          c[i] = a[i] + b[i];
                      }
                  }
                  

                  编译为:

                  .L15:
                      movdqu  (%rbx,%rax), %xmm0
                      addl    $1, %r8d
                      paddd   0(%rbp,%rax), %xmm0
                      movdqu  %xmm0, (%r11,%rax)
                      addq    $16, %rax
                      cmpl    %r10d, %r8d
                      jb      .L15
                      addl    %r12d, %r9d
                      cmpl    %r12d, %r13d
                      je      .L10
                  

                  如您所见,负载减少了一个.我猜正确估计在执行求和之前不需要显式加载内存,因为结果不会覆盖任何内容.

                  As you can clearly see there's one less load. I guess it correclty estimated that there's no need to explicitely load memory before performing the sum, as the result won't overwrite anythng.

                  还有更多优化的空间——GCC 不知道参数是 f.i.128 位对齐,因此它必须生成一个巨大的前导码来检查是否有对齐问题 (YMMV),以及一个可发布的以处理额外的未对齐部分(或小于 128 位的宽度).这实际上发生在上面的两个版本中.这是为 dot_int 生成的完整代码:

                  There's also room for way more optimizations -- GCC doesn't know that the parameters are f.i. 128 bit aligned, hence it must generate a huge preamble to check that there are no alignment issues (YMMV), and a postable to deal with extra unaligned parts (or less wide than 128 bits). This actually happens with both versions above. This is the complete code generated for dot_int:

                  dot_int:
                  .LFB626:
                          .cfi_startproc
                          testl   %ecx, %ecx
                          pushq   %rbx
                          .cfi_def_cfa_offset 16
                          .cfi_offset 3, -16
                          jle     .L1
                          leaq    16(%rdx), %r11
                          movl    %ecx, %r10d
                          shrl    $2, %r10d
                          leal    0(,%r10,4), %r9d
                          testl   %r9d, %r9d
                          je      .L6
                          leaq    16(%rdi), %rax
                          cmpl    $6, %ecx
                          seta    %r8b
                          cmpq    %rax, %rdx
                          seta    %al
                          cmpq    %r11, %rdi
                          seta    %bl
                          orl     %ebx, %eax
                          andl    %eax, %r8d
                          leaq    16(%rsi), %rax
                          cmpq    %rax, %rdx
                          seta    %al
                          cmpq    %r11, %rsi
                          seta    %r11b
                          orl     %r11d, %eax
                          testb   %al, %r8b
                          je      .L6
                          xorl    %eax, %eax
                          xorl    %r8d, %r8d
                          .p2align 4,,10
                          .p2align 3
                  .L4:
                          movdqu  (%rdi,%rax), %xmm1
                          addl    $1, %r8d
                          movdqu  (%rsi,%rax), %xmm0
                          paddd   %xmm1, %xmm0
                          movdqu  %xmm0, (%rdx,%rax)
                          addq    $16, %rax
                          cmpl    %r8d, %r10d
                          ja      .L4
                          cmpl    %r9d, %ecx
                          je      .L1
                  .L3:
                          movslq  %r9d, %r8
                          xorl    %eax, %eax
                          salq    $2, %r8
                          addq    %r8, %rdx
                          addq    %r8, %rdi
                          addq    %r8, %rsi
                          .p2align 4,,10
                          .p2align 3
                  .L5:
                          movl    (%rdi,%rax,4), %r8d
                          addl    (%rsi,%rax,4), %r8d
                          movl    %r8d, (%rdx,%rax,4)
                          addq    $1, %rax
                          leal    (%r9,%rax), %r8d
                          cmpl    %r8d, %ecx
                          jg      .L5
                  .L1:
                          popq    %rbx
                          .cfi_remember_state
                          .cfi_def_cfa_offset 8
                          ret
                  .L6:
                          .cfi_restore_state
                          xorl    %r9d, %r9d
                          jmp     .L3
                          .cfi_endproc
                  

                  现在在您的情况下,int 有效地对齐(因为它们在堆栈上),但是如果您可以使它们对齐并告诉 GCC,那么您可以改进代码生成:

                  Now in your case the ints effectively not aligned (as they're on the stack), but if you can make them aligned and tell GCC so, then you can improve code generation:

                  typedef int intvec __attribute__((vector_size(16)));
                  
                  void dot_int_restrict_alig(intvec * restrict a, 
                                             intvec * restrict b, 
                                             intvec * restrict c, 
                                             unsigned int n) {
                      for(unsigned int i=0; i<n; ++i) {
                          c[i] = a[i] + b[i];
                      }
                  }
                  

                  这将生成此代码,没有前导码:

                  This generates this code, with no preamble:

                  dot_int_restrict_alig:
                  .LFB628:
                          .cfi_startproc
                          testl   %ecx, %ecx
                          je      .L23
                          subl    $1, %ecx
                          xorl    %eax, %eax
                          addq    $1, %rcx
                          salq    $4, %rcx
                          .p2align 4,,10
                          .p2align 3
                  .L25:
                          movdqa  (%rdi,%rax), %xmm0
                          paddd   (%rsi,%rax), %xmm0
                          movdqa  %xmm0, (%rdx,%rax)
                          addq    $16, %rax
                          cmpq    %rcx, %rax
                          jne     .L25
                  .L23:
                          rep
                          ret
                          .cfi_endproc
                  

                  注意对齐的 128 位加载指令的用法(movdqaa 对齐,vs movdqu,未对齐).

                  Note the usage of the aligned 128 bit load instructions (movdqa, a as aligned, vs movdqu, unaligned).

                  这篇关于重叠数组的总和、自动矢量化和限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持html5模板网!

                  上一篇:VS:_BitScanReverse64 内在的意外优化行为 下一篇:在 C++ 中集成 swf

                  相关文章

                  最新文章

                  • <bdo id='ZnUfd'></bdo><ul id='ZnUfd'></ul>

                  <small id='ZnUfd'></small><noframes id='ZnUfd'>

                  <legend id='ZnUfd'><style id='ZnUfd'><dir id='ZnUfd'><q id='ZnUfd'></q></dir></style></legend>
                  1. <i id='ZnUfd'><tr id='ZnUfd'><dt id='ZnUfd'><q id='ZnUfd'><span id='ZnUfd'><b id='ZnUfd'><form id='ZnUfd'><ins id='ZnUfd'></ins><ul id='ZnUfd'></ul><sub id='ZnUfd'></sub></form><legend id='ZnUfd'></legend><bdo id='ZnUfd'><pre id='ZnUfd'><center id='ZnUfd'></center></pre></bdo></b><th id='ZnUfd'></th></span></q></dt></tr></i><div id='ZnUfd'><tfoot id='ZnUfd'></tfoot><dl id='ZnUfd'><fieldset id='ZnUfd'></fieldset></dl></div>
                      <tfoot id='ZnUfd'></tfoot>