Delphi XE3 compiles 64-bit fast floating point code based on SSE2 (Streaming SIMD Extensions). The SSE instruction data move for 4-byte singles such as movaps requires data address to be 16-byte align or otherwise movups should be used. Similarly, for 8-byte double types, movapd can be used if 16-byte var address can be aligned or movupd if no align can be trusted. Why are we bothering this? Because using 16-byte aligned mov instructions are faster than their non-aligned versions. If memory locations are not 16-byte aligned, using 16-byte aligned data move instructions (e.g. movapd, movaps) will generate a runtime error i.e. Access Violations.
In Delphi XE3, you can compile 64 bit applications/library. In x64, all stack alignment needs to be 16-byte (as a standard requirement). So the following always produces zero (all procedure/functions have 16-byte aligned stacks).
program Project7; {$APPTYPE CONSOLE} uses Windows; {$R *.res} var blba: byte; procedure test; var x: byte; begin Writeln('ba bla bla'); end; begin Writeln(NativeUInt(Addr(test)) mod 16); Readln; end.
In SSE, we can handle 4 single floats at the same time, to illustrate this, we can define the Vector type like this.
type Vector = array [1..4] of Single;
And, the following adds the parameter a and b, returning the sum of both.
function add4(const a, b: Vector): Vector; asm movaps xmm0, [a] movaps xmm1, [b] addps xmm0, xmm1 movaps [@result], xmm0 end;
So, if the parameters a and b can be trusted with 16-byte aligned, then we can safely use this function. For example,
procedure test(); var v1, v2: vector; begin v1[1] := 1; v2[1] := 1; v1 := add4(v1, v2); // this works end;
The above works, because v1 and v2 are always 16-byte aligned in x64. However, the Delphi compiler does not align each stack variables individually, i.e. the following does not work.
procedure test(); var dump: Integer; // 4 byte v1, v2: vector; begin v1[1] := 1; v2[1] := 1; v1 := add4(v1, v2); // access violation end;
The dummy variable dump takes up 4 byte, following tightly by the address of v1, therefore, the v1 is not 16-byte aligned (16x + 4), neither is v2 because vector type is sizeof 16-byte.
It is no wonder that the following works.
procedure test(); var v1, v2: vector; // 16-byte align dump: Integer; // 4 byte, 16x + 4 dump2: Integer; // 4 byte, 16x + 8 dump3: Integer; // 4 byte, 16x + 12 dump4: Integer; // 4 byte, 16x + 16 v3, v4: vector; // 16-byte align again! begin v1[1] := 1; v2[1] := 1; v1 := add4(v3, v4); // this works end;
In Delphi XE3, you can use {$CODEALIGN 16} to align code to 16-byte, but this has nothing to do with the stack frames. Code align makes sure that procedure entry points are aligned on specified boundaries.
The Dynamic arrays (or Objects) live in heap, where you can use SetMinimumBlockAlignment(mba16byte) to align the heap-variables. However, under x64, it seems 16-byte is the requirement.
procedure SetMinimumBlockAlignment(AMinimumBlockAlignment: TMinimumBlockAlignment); begin {16-byte alignment is required under 64-bit.} {$if SizeOf(Pointer) <> 8} if AMinimumBlockAlignment <> MinimumBlockAlignment then begin MinimumBlockAlignment := AMinimumBlockAlignment; {Rebuild the size to small block type lookup table} BuildBlockTypeLookupTable; end; {$endif} end;
References:
1. http://en.wikipedia.org/wiki/Call_stack
2. http://msdn.microsoft.com/en-us/library/aa290049(v=vs.71).aspx
3. http://stackoverflow.com/questions/15801313/how-to-use-align-data-move-sse-in-delphi-xe3
–EOF (The Ultimate Computing & Technology Blog) —
Last Post: Copy Function in Delphi XE3
Next Post: Teaching Kids Programming - Estimate the Golden Ratio via Fibonacci Numbers in Python