我相信 push/pop 指令会产生更紧凑的代码,甚至可能会运行得稍微快一些.不过,这也需要禁用堆栈帧.
I belive push/pop instructions will result in a more compact code, maybe will even run slightly faster. This requires disabling stack frames as well though.
为了检查这一点,我需要手工重写一个足够大的汇编程序(比较它们),或者安装和研究一些其他编译器(看看他们是否有这个选项,并比较结果).
To check this, I will need to either rewrite a large enough program in assembly by hand (to compare them), or to install and study a few other compilers (to see if they have an option for this, and to compare the results).
这是关于此问题和类似问题的论坛主题.
Here is the forum topic about this and simular problems.
简而言之,我想了解哪种代码更好.代码如下:
In short, I want to understand which code is better. Code like this:
sub esp, c
mov [esp+8],eax
mov [esp+4],ecx
mov [esp],edx
...
add esp, c
或这样的代码:
push eax
push ecx
push edx
...
add esp, c
什么编译器可以生成第二种代码?他们通常会产生第一个的一些变体.
What compiler can produce the second kind of code? They usually produce some variation of the first one.
你说得对,push
是对所有 4 种主要 x86 编译器的轻微遗漏优化.有一些代码大小,因此可以间接获得性能.或者在某些情况下可能更直接获得少量性能,例如保存一个 sub rsp
指令.
You're right, push
is a minor missed-optimization with all 4 major x86 compilers. There's some code-size, and thus indirectly performance to be had. Or maybe more directly a small amount of performance in some cases, e.g. saving a sub rsp
instruction.
但是如果你不小心,你可以通过混合 push
和 [rsp+x]
寻址模式来使用额外的堆栈同步 uops 使事情变慢.pop
听起来没什么用,只是push
.正如您链接的论坛主题所建议的那样,您仅将其用于initial 本地存储;以后的重新加载和存储应该使用像 [rsp+8]
这样的普通寻址模式.我们不是在谈论试图完全避免 mov
加载/存储,我们仍然希望随机访问我们从寄存器溢出局部变量的堆栈槽!
But if you're not careful, you can make things slower with extra stack-sync uops by mixing push
with [rsp+x]
addressing modes. pop
doesn't sound useful, just push
. As the forum thread you linked suggests, you only use this for the initial store of locals; later reloads and stores should use normal addressing modes like [rsp+8]
. We're not talking about trying to avoid mov
loads/stores entirely, and we still want random access to the stack slots where we spilled local variables from registers!
现代代码生成器避免使用 PUSH.它在当今的处理器上效率低下,因为它修改了堆栈指针,从而使超标量内核变得混乱.(Hans Passant)
Modern code generators avoid using PUSH. It is inefficient on today's processors because it modifies the stack pointer, that gums-up a super-scalar core. (Hans Passant)
15 年前确实如此,但编译器在优化速度时再次使用 push
,而不仅仅是代码大小.编译器已经使用 push
/pop
来保存/恢复他们想要使用的调用保留寄存器,例如 rbx
,并用于推送堆栈参数(主要在 32 位模式下;在 64 位模式下,大多数 args 适合寄存器).这两件事都可以用 mov
来完成,但是编译器使用 push
因为它比 sub rsp,8
/mov [rsp], rbx
.在这些情况下,gcc 有 调整选项可以避免 push
/pop
,为 -mtune=pentium3
启用和 -mtune=pentium
,以及类似的旧 CPU,但不适用于现代 CPU.
This was true 15 years ago, but compilers are once again using push
when optimizing for speed, not just code-size. Compilers already use push
/pop
for saving/restoring call-preserved registers they want to use, like rbx
, and for pushing stack args (mostly in 32-bit mode; in 64-bit mode most args fit in registers). Both of these things could be done with mov
, but compilers use push
because it's more efficient than sub rsp,8
/ mov [rsp], rbx
. gcc has tuning options to avoid push
/pop
for these cases, enabled for -mtune=pentium3
and -mtune=pentium
, and similar old CPUs, but not for modern CPUs.
Intel 自 Pentium-M 以来,AMD 自 Bulldozer(?) 有一个堆栈引擎",它以零延迟和无 ALU uops 跟踪对 RSP 的更改,用于 PUSH/POP/CALL/RET.许多实际代码仍在使用 push/pop,因此 CPU 设计人员添加了硬件以使其高效.现在我们可以在调整性能时使用它们(小心!).请参阅 Agner Fog 的微架构指南和指令表,以及他的 asm 优化手册.他们很优秀.(以及 x86 标签 wiki 中的其他链接.)
Intel since Pentium-M and AMD since Bulldozer(?) have a "stack engine" that tracks the changes to RSP with zero latency and no ALU uops, for PUSH/POP/CALL/RET. Lots of real code was still using push/pop, so CPU designers added hardware to make it efficient. Now we can use them (carefully!) when tuning for performance. See Agner Fog's microarchitecture guide and instruction tables, and his asm optimization manual. They're excellent. (And other links in the x86 tag wiki.)
它并不完美;直接读取 RSP(当与乱序内核中的值的偏移量非零时)确实会导致在 Intel CPU 上插入堆栈同步 uop.例如push rax
/mov [rsp-8], rdi
总共有 3 个融合域 uops:2 个存储和一个堆栈同步.
It's not perfect; reading RSP directly (when the offset from the value in the out-of-order core is nonzero) does cause a stack-sync uop to be inserted on Intel CPUs. e.g. push rax
/ mov [rsp-8], rdi
is 3 total fused-domain uops: 2 stores and one stack-sync.
在函数入口,堆栈引擎"已经处于非零偏移状态(来自父级中的call
),因此使用一些push
指令在第一次直接引用 RSP 之前,根本不需要额外的 uops.(除非我们使用 jmp
从另一个函数进行尾调用,并且该函数在 jmp
之前没有 pop
任何东西.)
On function entry, the "stack engine" is already in a non-zero-offset state (from the call
in the parent), so using some push
instructions before the first direct reference to RSP costs no extra uops at all. (Unless we were tailcalled from another function with jmp
, and that function didn't pop
anything right before jmp
.)
有点好笑编译器一直在使用虚拟的推送/弹出指令来将堆栈调整 8 个字节 一段时间了,因为它非常便宜和紧凑(如果你只做一次,不是 10 次分配 80 个字节),但没有利用它来存储有用的数据.堆栈在缓存中几乎总是很热,现代 CPU 具有非常出色的 L1d 存储/加载带宽.
It's kind of funny that compilers have been using dummy push/pop instructions just to adjust the stack by 8 bytes for a while now, because it's so cheap and compact (if you're doing it once, not 10 times to allocate 80 bytes), but aren't taking advantage of it to store useful data. The stack is almost always hot in cache, and modern CPUs have very excellent store / load bandwidth to L1d.
int extfunc(int *,int *);
void foo() {
int a=1, b=2;
extfunc(&a, &b);
}
编译以 clang6.0 -O3 -march = Haswell的代码> <强> 在 Godbolt 编译器资源管理器中 请参阅该链接以了解所有其余代码,以及许多不同的遗漏优化和愚蠢的代码生成(请参阅我的C 源代码中的注释指出了其中的一些):
compiles with clang6.0 -O3 -march=haswell
on the Godbolt compiler explorer See that link for all the rest of the code, and many different missed-optimizations and silly code-gen (see my comments in the C source pointing out some of them):
# compiled for the x86-64 System V calling convention:
# integer args in rdi, rsi (,rdx, rcx, r8, r9)
push rax # clang / ICC ALREADY use push instead of sub rsp,8
lea rdi, [rsp + 4]
mov dword ptr [rdi], 1 # 6 bytes: opcode + modrm + imm32
mov rsi, rsp # special case for lea rsi, [rsp + 0]
mov dword ptr [rsi], 2
call extfunc(int*, int*)
pop rax # and POP instead of add rsp,8
ret
与 gcc、ICC 和 MSVC 非常相似的代码,有时指令的顺序不同,或者 gcc 无缘无故地保留了额外的 16B 堆栈空间.(MSVC 保留更多空间,因为它针对的是 Windows x64 调用约定,该约定保留了阴影空间而不是红色区域).
And very similar code with gcc, ICC, and MSVC, sometimes with the instructions in a different order, or gcc reserving an extra 16B of stack space for no reason. (MSVC reserves more space because it's targeting the Windows x64 calling convention which reserves shadow space instead of having a red-zone).
clang 通过使用存储地址的 LEA 结果而不是重复 RSP 相关地址 (SIB+disp8) 来节省代码大小.ICC 和 clang 将变量放在它保留的空间的底部,因此其中一种寻址模式避免了 disp8
.(对于 3 个变量,需要保留 24 个字节而不是 8 个字节,然后 clang 没有利用.)gcc 和 MSVC 错过了这个优化.
clang saves code-size by using the LEA results for store addresses instead of repeating RSP-relative addresses (SIB+disp8). ICC and clang put the variables at the bottom of the space it reserved, so one of the addressing modes avoids a disp8
. (With 3 variables, reserving 24 bytes instead of 8 was necessary, and clang didn't take advantage then.) gcc and MSVC miss this optimization.
但无论如何,更理想的是:
push 2 # only 2 bytes
lea rdi, [rsp + 4]
mov dword ptr [rdi], 1
mov rsi, rsp # special case for lea rsi, [rsp + 0]
call extfunc(int*, int*)
# ... later accesses would use [rsp] and [rsp+] if needed, not pop
pop rax # alternative to add rsp,8
ret
push
是一个 8 字节的存储,我们重叠了它的一半.这不是问题,即使在存储了高半部分之后,CPU 也可以有效地存储转发未修改的低半部分.重叠存储一般不是问题,事实上 glibc 的注释良好的 memcpy
实现 使用两个(可能)重叠的加载 + 存储来存储小副本(至少达到 2x xmm 寄存器的大小),以加载所有内容然后存储所有内容,无需关心是否存在重叠.
The push
is an 8-byte store, and we overlap half of it. This is not a problem, CPUs can store-forward the unmodified low half efficiently even after storing the high half. Overlapping stores in general are not a problem, and in fact glibc's well-commented memcpy
implementation uses two (potentially) overlapping loads + stores for small copies (up to the size of 2x xmm registers at least), to load everything then store everything without caring about whether or not there's overlap.
请注意,在 64 位模式下,32 位 push
不可用.所以我们还是要直接引用rsp
的qword的上半部分.但是如果我们的变量是 uint64_t,或者我们不关心让它们连续,我们可以使用 push
.
Note that in 64-bit mode, 32-bit push
is not available. So we still have to reference rsp
directly for the upper half of of the qword. But if our variables were uint64_t, or we didn't care about making them contiguous, we could just use push
.
在这种情况下,我们必须显式引用 RSP 以获取指向本地变量的指针以传递给另一个函数,因此在 Intel CPU 上无法绕过额外的堆栈同步 uop.在其他情况下,您可能只需要在 call
之后溢出一些函数参数以供使用.(尽管通常编译器会 push rbx
和 mov rbx,rdi
将 arg 保存在调用保留的寄存器中,而不是溢出/重新加载 arg 本身,以缩短关键路径.)
We have to reference RSP explicitly in this case to get pointers to the locals for passing to another function, so there's no getting around the extra stack-sync uop on Intel CPUs. In other cases maybe you just need to spill some function args for use after a call
. (Although normally compilers will push rbx
and mov rbx,rdi
to save an arg in a call-preserved register, instead of spilling/reloading the arg itself, to shorten the critical path.)
我选择了 2 个 4 字节的参数,这样我们就可以用 1 个 push
达到 16 字节的对齐边界,这样我们就可以优化掉 sub rsp, ##
(或虚拟 push
) 完全.
I chose 2x 4-byte args so we could reach a 16-byte alignment boundary with 1 push
, so we can optimize away the sub rsp, ##
(or dummy push
) entirely.
我可以使用 mov rax, 0x0000000200000001
/push rax
,但是 10 字节的 mov r64, imm64
在 uop 中需要 2 个条目缓存,以及很多代码大小.
gcc7 确实知道如何合并两个相邻的存储,但在这种情况下选择不对 mov
这样做.如果这两个常量都需要 32 位立即数,那就说得通了.但是,如果这些值实际上根本不是常数,而是来自寄存器,那么 push
/mov [rsp+4]
会起作用.(将寄存器中的值与 SHL + SHLD 或任何其他将 2 个存储变为 1 个存储的指令合并是不值得的.)
I could have used mov rax, 0x0000000200000001
/ push rax
, but 10-byte mov r64, imm64
takes 2 entries in the uop cache, and a lot of code-size.
gcc7 does know how to merge two adjacent stores, but chooses not to do that for mov
in this case. If both constants had needed 32-bit immediates, it would have made sense. But if the values weren't actually constant at all, and came from registers, this wouldn't work while push
/ mov [rsp+4]
would. (It wouldn't be worth merging values in a register with SHL + SHLD or whatever other instructions to turn 2 stores into 1.)
如果您需要为多个 8 字节的块保留空间,并且还没有任何有用的东西可以存储在那里,请务必使用 sub
而不是多个最后一次有用的 PUSH 之后的虚拟 PUSH.但是如果你有有用的东西要存储,push imm8 或 push imm32,或者 push reg 都不错.
If you need to reserve space for more than one 8-byte chunk, and don't have anything useful to store there yet, definitely use sub
instead of multiple dummy PUSHes after the last useful PUSH. But if you have useful stuff to store, push imm8 or push imm32, or push reg are good.
我们可以看到编译器使用带有 ICC 输出的固定"序列的更多证据:它在调用的 arg 设置中使用 lea rdi, [rsp]
.似乎他们没有想到寻找由寄存器直接指向的本地地址的特殊情况,没有偏移,允许 mov
而不是 lea
.(mov
绝对不差,在某些 CPU 上更好.)
We can see more evidence of compilers using "canned" sequences with ICC output: it uses lea rdi, [rsp]
in the arg setup for the call. It seems they didn't think to look for the special case of the address of a local being pointed to directly by a register, with no offset, allowing mov
instead of lea
. (mov
is definitely not worse, and better on some CPUs.)
不使局部变量连续的一个有趣的例子是上面的一个带有 3 个参数的版本,int a=1, b=2, c=3;
.为了保持 16B 对齐,我们现在需要偏移 8 + 16*1 = 24
个字节,所以我们可以这样做
An interesting example of not making locals contiguous is a version of the above with 3 args, int a=1, b=2, c=3;
. To maintain 16B alignment, we now need to offset 8 + 16*1 = 24
bytes, so we could do
bar3:
push 3
push 2 # don't interleave mov in here; extra stack-sync uops
push 1
mov rdi, rsp
lea rsi, [rsp+8]
lea rdx, [rdi+16] # relative to RDI to save a byte with probably no extra latency even if MOV isn't zero latency, at least not on the critical path
call extfunc3(int*,int*,int*)
add rsp, 24
ret
这比编译器生成的代码小得多,因为 mov [rsp+16], 2
必须使用 mov r/m32, imm32
编码,使用 4 字节立即数,因为没有 mov
的 sign_extended_imm8 形式.
This is significantly smaller code-size than compiler-generated code, because mov [rsp+16], 2
has to use the mov r/m32, imm32
encoding, using a 4-byte immediate because there's no sign_extended_imm8 form of mov
.
push imm8
非常紧凑,2 个字节.mov dword ptr [rsp+8], 1
是 8 个字节:opcode + modrm + SIB + disp8 + imm32.(作为基址寄存器的 RSP 总是需要一个 SIB 字节;带有 base=RSP 的 ModRM 编码是现有 SIB 字节的转义码.使用 RBP 作为帧指针允许更紧凑的局部寻址(每个 insn 1 个字节),但是需要 3 个额外的指令来设置/拆除,并绑定一个寄存器.但它避免了对 RSP 的进一步访问,避免了堆栈同步 uops.有时它实际上可能是一个胜利.)
push imm8
is extremely compact, 2 bytes. mov dword ptr [rsp+8], 1
is 8 bytes: opcode + modrm + SIB + disp8 + imm32. (RSP as a base register always needs a SIB byte; the ModRM encoding with base=RSP is the escape code for a SIB byte existing. Using RBP as a frame pointer allows more compact addressing of locals (by 1 byte per insn), but takes an 3 extra instructions to set up / tear down, and ties up a register. But it avoids further access to RSP, avoiding stack-sync uops. It could actually be a win sometimes.)
在本地人之间留下差距的一个缺点是它可能会在以后打败加载或存储合并机会.如果您(编译器)需要在某处复制 2 个本地人,如果它们相邻,您可以使用单个 qword 加载/存储来完成.据我所知,编译器在决定如何在堆栈上排列局部变量时不会考虑函数的所有未来权衡.我们希望编译器能够快速运行,这意味着并不总是回溯以考虑重新排列局部变量或其他各种事物的所有可能性.如果寻找优化需要二次时间,或者将其他步骤所需的时间乘以一个重要的常数,那么它最好是一个重要优化.(IDK 实现搜索使用 push
的机会可能有多么困难,特别是如果您保持简单并且不花时间为其优化堆栈布局.)
One downside to leaving gaps between your locals is that it may defeat load or store merging opportunities later. If you (the compiler) need to copy 2 locals somewhere, you may be able to do it with a single qword load/store if they're adjacent. Compilers don't consider all the future tradeoffs for the function when deciding how to arrange locals on the stack, as far as I know. We want compilers to run quickly, and that means not always back-tracking to consider every possibility for rearranging locals, or various other things. If looking for an optimization would take quadratic time, or multiply the time taken for other steps by a significant constant, it had better be an important optimization. (IDK how hard it might be to implement a search for opportunities to use push
, especially if you keep it simple and don't spend time optimizing the stack layout for it.)
然而,假设还有其他局部变量会在以后使用,我们可以将它们分配到我们早期溢出的任何间隙中.所以空间不必浪费,我们可以稍后简单地使用 mov [rsp+12], eax
来存储我们推送的两个 32 位值.
However, assuming there are other locals which will be used later, we can allocate them in the gaps between any we spill early. So the space doesn't have to be wasted, we can simply come along later and use mov [rsp+12], eax
to store between two 32-bit values we pushed.
一个很小的long
数组,包含非常量的内容
A tiny array of long
, with non-constant contents
int ext_longarr(long *);
void longarr_arg(long a, long b, long c) {
long arr[] = {a,b,c};
ext_longarr(arr);
}
gcc/clang/ICC/MSVC 遵循它们的正常模式,并使用 mov
存储:
gcc/clang/ICC/MSVC follow their normal pattern, and use mov
stores:
longarr_arg(long, long, long): # @longarr_arg(long, long, long)
sub rsp, 24
mov rax, rsp # this is clang being silly
mov qword ptr [rax], rdi # it could have used [rsp] for the first store at least,
mov qword ptr [rax + 8], rsi # so it didn't need 2 reg,reg MOVs to avoid clobbering RDI before storing it.
mov qword ptr [rax + 16], rdx
mov rdi, rax
call ext_longarr(long*)
add rsp, 24
ret
但它可以存储这样的参数数组:
But it could have stored an array of the args like this:
longarr_arg_handtuned:
push rdx
push rsi
push rdi # leave stack 16B-aligned
mov rsp, rdi
call ext_longarr(long*)
add rsp, 24
ret
有了更多的参数,我们开始获得更显着的好处,尤其是在代码大小方面,当更多的总函数用于存储到堆栈时.这是一个非常综合的示例,几乎没有其他任何作用.我本可以使用 volatile int a = 1;
,但有些编译器会特别对待它.
With more args, we start to get more noticeable benefits especially in code-size when more of the total function is spent storing to the stack. This is a very synthetic example that does nearly nothing else. I could have used volatile int a = 1;
, but some compilers treat that extra-specially.
(可能是错误的)堆栈展开异常和调试格式,我认为不支持随意使用堆栈指针.因此,至少在执行任何 call
指令之前,函数应该具有与此函数中所有未来函数调用相同的 RSP 偏移量.
(probably wrong) Stack unwinding for exceptions, and debug formats, I think don't support arbitrary playing around with the stack pointer. So at least before making any call
instructions, a function is supposed to have offset RSP as much as its going to for all future function calls in this function.
但这不可能是对的,因为 alloca
和 C99 可变长度数组会违反这一点.编译器本身之外可能有某种工具链原因没有寻找这种优化.
But that can't be right, because alloca
and C99 variable-length arrays would violate that. There may be some kind of toolchain reason outside the compiler itself for not looking for this kind of optimization.
这个 gcc 邮件列表帖子关于禁用 -maccumulate-outgoing-args
for tune=default(2014 年)很有趣.它指出,更多的推送/弹出导致更大的展开信息(.eh_frame
部分),但这是通常永远不会读取的元数据(如果没有例外),因此总二进制文件更大,但代码更小/更快.相关:这显示了什么-maccumulate-outgoing-args
用于 gcc 代码生成.
This gcc mailing list post about disabling -maccumulate-outgoing-args
for tune=default (in 2014) was interesting. It pointed out that more push/pop led to larger unwind info (.eh_frame
section), but that's metadata that's normally never read (if no exceptions), so larger total binary but smaller / faster code. Related: this shows what -maccumulate-outgoing-args
does for gcc code-gen.
显然,我选择的示例是微不足道的,我们push
未修改的输入参数.更有趣的是,当我们在获得想要溢出的值之前,根据 args(以及它们指向的数据和全局变量等)计算寄存器中的某些内容.
Obviously the examples I chose were trivial, where we're push
ing the input parameters unmodified. More interesting would be when we calculate some things in registers from the args (and data they point to, and globals, etc.) before having a value we want to spill.
如果您必须在函数入口和稍后的push
es 之间溢出/重新加载任何内容,那么您将在 Intel 上创建额外的堆栈同步 uops.在 AMD 上,执行 push rbx
/blah blah/mov [rsp-32], eax
(溢出到红色区域)/blah blah/仍然可能是一个胜利push rcx
/imul ecx, [rsp-24], 12345
(从仍然是红色区域的地方重新加载之前的溢出,使用不同的偏移量)
If you have to spill/reload anything between function entry and later push
es, you're creating extra stack-sync uops on Intel. On AMD, it could still be a win to do push rbx
/ blah blah / mov [rsp-32], eax
(spill to the red zone) / blah blah / push rcx
/ imul ecx, [rsp-24], 12345
(reload the earlier spill from what's still the red-zone, with a different offset)
混合 push
和 [rsp]
寻址模式效率较低(在 Intel CPU 上,因为堆栈同步 uops),因此编译器会必须仔细权衡权衡以确保它们不会使事情变慢.sub
/mov
众所周知,可以在所有 CPU 上运行良好,尽管它的代码大小可能会很昂贵,尤其是对于小常量.
Mixing push
and [rsp]
addressing modes is less efficient (on Intel CPUs because of stack-sync uops), so compilers would have to carefully weight the tradeoffs to make sure they're not making things slower. sub
/ mov
is well-known to work well on all CPUs, even though it can be costly in code-size, especially for small constants.
很难跟踪偏移量"是一个完全虚假的论点.这是一台电脑;在使用 push
将函数 args 放入堆栈时,它必须根据不断变化的引用重新计算偏移量.我认为编译器可能会遇到问题(即需要更多的特殊情况检查和代码,使它们编译更慢),如果它们有超过 128B 的本地变量,所以你不能总是 mov
存储在 RSP (进入仍然是红色区域的区域),然后使用未来的 push
指令将 RSP 向下移动.
"It's hard to keep track of the offsets" is a totally bogus argument. It's a computer; re-calculating offsets from a changing reference is something it has to do anyway when using push
to put function args on the stack. I think compilers could run into problems (i.e. need more special-case checks and code, making them compile slower) if they had more than 128B of locals, so you couldn't always mov
store below RSP (into what's still the red-zone) before moving RSP down with future push
instructions.
编译器已经考虑了多种权衡,但目前逐渐增加堆栈框架并不是他们考虑的事情之一.push
在 Pentium-M 引入堆栈引擎之前效率不高,所以即使是有效的 push
甚至可用也是最近的一个变化,就重新设计编译器如何考虑堆栈布局而言选择.
Compilers already consider multiple tradeoffs, but currently growing the stack frame gradually isn't one of the things they consider. push
wasn't as efficient before Pentium-M introduce the stack engine, so efficient push
even being available is a somewhat recent change as far as redesigning how compilers think about stack layout choices.
为序言和访问当地人提供一个基本固定的配方当然更简单.
Having a mostly-fixed recipe for prologues and for accessing locals is certainly simpler.
这篇关于什么 C/C++ 编译器可以使用 push pop 指令来创建局部变量,而不是仅仅增加一次 esp?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持html5模板网!