How to use align-data-move SSE in Delphi XE3 ?

ACMer

12 years ago

Delphi XE3 compiles 64-bit fast floating point code based on SSE2 (Streaming SIMD Extensions). The SSE instruction data move for 4-byte singles such as movaps requires data address to be 16-byte align or otherwise movups should be used. Similarly, for 8-byte double types, movapd can be used if 16-byte var address can be aligned or movupd if no align can be trusted. Why are we bothering this? Because using 16-byte aligned mov instructions are faster than their non-aligned versions. If memory locations are not 16-byte aligned, using 16-byte aligned data move instructions (e.g. movapd, movaps) will generate a runtime error i.e. Access Violations.

In Delphi XE3, you can compile 64 bit applications/library. In x64, all stack alignment needs to be 16-byte (as a standard requirement). So the following always produces zero (all procedure/functions have 16-byte aligned stacks).

program Project7;

{$APPTYPE CONSOLE}

uses
  Windows;
{$R *.res}

var
  blba: byte;

procedure test;
var
  x: byte;
begin
  Writeln('ba bla bla');
end;

begin
  Writeln(NativeUInt(Addr(test)) mod 16);
  Readln;
end.

In SSE, we can handle 4 single floats at the same time, to illustrate this, we can define the Vector type like this.

type
  Vector = array [1..4] of Single;

And, the following adds the parameter a and b, returning the sum of both.

function add4(const a, b: Vector): Vector;
asm
  movaps xmm0, [a]
  movaps xmm1, [b]
  addps xmm0, xmm1
  movaps [@result], xmm0
end;

So, if the parameters a and b can be trusted with 16-byte aligned, then we can safely use this function. For example,

procedure test();
var
  v1, v2: vector;
begin
  v1[1] := 1;
  v2[1] := 1;
  v1 := add4(v1, v2);  // this works
end;

The above works, because v1 and v2 are always 16-byte aligned in x64. However, the Delphi compiler does not align each stack variables individually, i.e. the following does not work.

procedure test();
var
  dump: Integer; // 4 byte
  v1, v2: vector;
begin
  v1[1] := 1;
  v2[1] := 1;
  v1 := add4(v1, v2);  // access violation
end;

The dummy variable dump takes up 4 byte, following tightly by the address of v1, therefore, the v1 is not 16-byte aligned (16x + 4), neither is v2 because vector type is sizeof 16-byte.
It is no wonder that the following works.

procedure test();
var
  v1, v2: vector; // 16-byte align
  dump: Integer; // 4 byte,   16x + 4
  dump2: Integer; // 4 byte,  16x + 8
  dump3: Integer; // 4 byte,   16x + 12
  dump4: Integer; // 4 byte,  16x + 16
  v3, v4: vector; // 16-byte align again!
begin
  v1[1] := 1;
  v2[1] := 1;
  v1 := add4(v3, v4);  // this works
end;

In Delphi XE3, you can use {$CODEALIGN 16} to align code to 16-byte, but this has nothing to do with the stack frames. Code align makes sure that procedure entry points are aligned on specified boundaries.

The Dynamic arrays (or Objects) live in heap, where you can use SetMinimumBlockAlignment(mba16byte) to align the heap-variables. However, under x64, it seems 16-byte is the requirement.

procedure SetMinimumBlockAlignment(AMinimumBlockAlignment: TMinimumBlockAlignment);
begin
{16-byte alignment is required under 64-bit.}
{$if SizeOf(Pointer) <> 8}
  if AMinimumBlockAlignment <> MinimumBlockAlignment then
  begin
    MinimumBlockAlignment := AMinimumBlockAlignment;
    {Rebuild the size to small block type lookup table}
    BuildBlockTypeLookupTable;
  end;
{$endif}
end;