Skip to content

[SIMD] Comparisons: Add Vector256 for >, <, ==, !=, >=, <= #578

@Nucs

Description

@Nucs

Overview

Add SIMD vectorization to comparison operations. Currently, comparison kernels use scalar IL opcodes despite having the IL kernel infrastructure in place.

Parent issue: #545

Current State

Aspect Status
IL kernel generation ✅ Exists (ILKernelGenerator.cs:3156-3790)
Execution path classification ✅ SimdFull, ScalarRight, General
Vector256 SIMD NOT USED
Actual comparison Scalar IL opcodes (Ceq, Clt, Cgt)

Current implementation (line 3698-3762):

case ComparisonOp.Greater:
    il.Emit(OpCodes.Cgt);  // Scalar comparison → 0 or 1

The Challenge

Vector256.GreaterThan() returns a mask, not bools:

Vector256<int> a = Vector256.Create(1, 2, 3, 4, 5, 6, 7, 8);
Vector256<int> b = Vector256.Create(5, 5, 5, 5, 5, 5, 5, 5);
Vector256<int> mask = Vector256.GreaterThan(a, b);
// mask = [0, 0, 0, 0, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF]
//        (false, false, false, false, true, true, true, true)

Need to convert mask lanes (0x00000000 / 0xFFFFFFFF) to bool array.

Task List

  • Implement SIMD comparison loop

    • Use Vector256.GreaterThan/LessThan/Equals etc.
    • Convert mask to bool array via MoveMask + bit extraction
  • Handle all comparison types

    • Equal: Vector256.Equals()
    • NotEqual: ~Vector256.Equals()
    • Less: Vector256.LessThan()
    • LessEqual: Vector256.LessThanOrEqual()
    • Greater: Vector256.GreaterThan()
    • GreaterEqual: Vector256.GreaterThanOrEqual()
  • Mask-to-bool conversion strategies

    • Option A: MoveMask + bit extraction (fastest)
    • Option B: ConditionalSelect to 0/1, then narrow
    • Option C: Per-lane extraction (fallback)

Implementation Approach

Option A: MoveMask Extraction (Recommended for AVX2)

// For int32 (8 elements per Vector256):
var mask = Vector256.GreaterThan(lhsVec, rhsVec);  // Vector256<int>

// MoveMask extracts MSB of each byte (32 bits total)
uint moveMask = (uint)Avx2.MoveMask(mask.AsByte());

// For int32, relevant bits are at positions 3, 7, 11, 15, 19, 23, 27, 31
// (MSB of each 4-byte lane)
result[i+0] = ((moveMask >> 3) & 1) != 0;
result[i+1] = ((moveMask >> 7) & 1) != 0;
result[i+2] = ((moveMask >> 11) & 1) != 0;
// ... etc for all 8 lanes

Option B: ConditionalSelect + Narrow

var mask = Vector256.GreaterThan(lhsVec, rhsVec);
var ones = Vector256.Create(1);
var zeros = Vector256<int>.Zero;
var boolAsInt = Vector256.ConditionalSelect(mask, ones, zeros);
// boolAsInt = [0, 0, 0, 0, 1, 1, 1, 1]

// Narrow to bytes and store
var narrow1 = Avx2.PackSignedSaturate(boolAsInt.GetLower(), boolAsInt.GetUpper());
// Continue narrowing to bytes...

Type-Specific Handling

Type Elements/Vector Extraction Method
byte 32 MoveMask direct (1 bit per element)
short 16 MoveMask + stride 2
int 8 MoveMask + stride 4
long 4 MoveMask + stride 8
float 8 Same as int
double 4 Same as long

Files to Modify

File Changes
ILKernelGenerator.cs:3222-3339 Replace scalar loops with SIMD
ILKernelGenerator.cs:3346-3409 New EmitComparisonSimdLoop()
ILKernelGenerator.cs:3698-3762 EmitVectorComparisonOperation()

Code Structure

private static void EmitComparisonSimdLoop(ILGenerator il, ComparisonKernelKey key)
{
    int vectorCount = GetVectorCount(key.LhsType);
    
    // Locals
    var locI = il.DeclareLocal(typeof(int));
    var locVectorEnd = il.DeclareLocal(typeof(int));
    
    // vectorEnd = totalSize - vectorCount
    // ...
    
    // SIMD loop
    var lblSimdLoop = il.DefineLabel();
    il.MarkLabel(lblSimdLoop);
    // if (i > vectorEnd) goto tail
    
    // Load vectors
    EmitVectorLoad(il, key.LhsType);  // lhs
    EmitVectorLoad(il, key.RhsType);  // rhs
    
    // Compare
    EmitVectorComparison(il, key.Op, key.ComparisonType);
    
    // Extract mask to bools
    EmitMaskToBoolExtraction(il, key.ComparisonType, vectorCount);
    
    // i += vectorCount
    // goto lblSimdLoop
    
    // Tail loop (scalar fallback for remainder)
    // ...
}

Benchmarks

[Benchmark] public NDArray Greater_10M() => _a > _b;
[Benchmark] public NDArray Equal_10M() => _a == _b;
[Benchmark] public NDArray LessEqual_10M() => _a <= _b;

Expected Results

Operation Current (scalar) With SIMD Speedup
a > b (10M int32) ~15 ms ~5-7 ms 2-3×
a == b (10M float64) ~20 ms ~7-10 ms 2-3×

Success Criteria

  1. SIMD comparisons ≥2× faster than current scalar
  2. All existing comparison tests pass
  3. Correct handling of NaN comparisons (float/double)
  4. Works with all execution paths (SimdFull, ScalarRight, General)

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreInternal engine: Shape, Storage, TensorEngine, iteratorsenhancementNew feature or requestperformancePerformance improvements or optimizations

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions