-
Notifications
You must be signed in to change notification settings - Fork 205
Open
Labels
coreInternal engine: Shape, Storage, TensorEngine, iteratorsInternal engine: Shape, Storage, TensorEngine, iteratorsenhancementNew feature or requestNew feature or requestperformancePerformance improvements or optimizationsPerformance improvements or optimizations
Description
Overview
Add SIMD vectorization to comparison operations. Currently, comparison kernels use scalar IL opcodes despite having the IL kernel infrastructure in place.
Parent issue: #545
Current State
| Aspect | Status |
|---|---|
| IL kernel generation | ✅ Exists (ILKernelGenerator.cs:3156-3790) |
| Execution path classification | ✅ SimdFull, ScalarRight, General |
| Vector256 SIMD | ❌ NOT USED |
| Actual comparison | Scalar IL opcodes (Ceq, Clt, Cgt) |
Current implementation (line 3698-3762):
case ComparisonOp.Greater:
il.Emit(OpCodes.Cgt); // Scalar comparison → 0 or 1The Challenge
Vector256.GreaterThan() returns a mask, not bools:
Vector256<int> a = Vector256.Create(1, 2, 3, 4, 5, 6, 7, 8);
Vector256<int> b = Vector256.Create(5, 5, 5, 5, 5, 5, 5, 5);
Vector256<int> mask = Vector256.GreaterThan(a, b);
// mask = [0, 0, 0, 0, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF]
// (false, false, false, false, true, true, true, true)Need to convert mask lanes (0x00000000 / 0xFFFFFFFF) to bool array.
Task List
-
Implement SIMD comparison loop
- Use
Vector256.GreaterThan/LessThan/Equalsetc. - Convert mask to bool array via MoveMask + bit extraction
- Use
-
Handle all comparison types
- Equal:
Vector256.Equals() - NotEqual:
~Vector256.Equals() - Less:
Vector256.LessThan() - LessEqual:
Vector256.LessThanOrEqual() - Greater:
Vector256.GreaterThan() - GreaterEqual:
Vector256.GreaterThanOrEqual()
- Equal:
-
Mask-to-bool conversion strategies
- Option A: MoveMask + bit extraction (fastest)
- Option B: ConditionalSelect to 0/1, then narrow
- Option C: Per-lane extraction (fallback)
Implementation Approach
Option A: MoveMask Extraction (Recommended for AVX2)
// For int32 (8 elements per Vector256):
var mask = Vector256.GreaterThan(lhsVec, rhsVec); // Vector256<int>
// MoveMask extracts MSB of each byte (32 bits total)
uint moveMask = (uint)Avx2.MoveMask(mask.AsByte());
// For int32, relevant bits are at positions 3, 7, 11, 15, 19, 23, 27, 31
// (MSB of each 4-byte lane)
result[i+0] = ((moveMask >> 3) & 1) != 0;
result[i+1] = ((moveMask >> 7) & 1) != 0;
result[i+2] = ((moveMask >> 11) & 1) != 0;
// ... etc for all 8 lanesOption B: ConditionalSelect + Narrow
var mask = Vector256.GreaterThan(lhsVec, rhsVec);
var ones = Vector256.Create(1);
var zeros = Vector256<int>.Zero;
var boolAsInt = Vector256.ConditionalSelect(mask, ones, zeros);
// boolAsInt = [0, 0, 0, 0, 1, 1, 1, 1]
// Narrow to bytes and store
var narrow1 = Avx2.PackSignedSaturate(boolAsInt.GetLower(), boolAsInt.GetUpper());
// Continue narrowing to bytes...Type-Specific Handling
| Type | Elements/Vector | Extraction Method |
|---|---|---|
| byte | 32 | MoveMask direct (1 bit per element) |
| short | 16 | MoveMask + stride 2 |
| int | 8 | MoveMask + stride 4 |
| long | 4 | MoveMask + stride 8 |
| float | 8 | Same as int |
| double | 4 | Same as long |
Files to Modify
| File | Changes |
|---|---|
ILKernelGenerator.cs:3222-3339 |
Replace scalar loops with SIMD |
ILKernelGenerator.cs:3346-3409 |
New EmitComparisonSimdLoop() |
ILKernelGenerator.cs:3698-3762 |
EmitVectorComparisonOperation() |
Code Structure
private static void EmitComparisonSimdLoop(ILGenerator il, ComparisonKernelKey key)
{
int vectorCount = GetVectorCount(key.LhsType);
// Locals
var locI = il.DeclareLocal(typeof(int));
var locVectorEnd = il.DeclareLocal(typeof(int));
// vectorEnd = totalSize - vectorCount
// ...
// SIMD loop
var lblSimdLoop = il.DefineLabel();
il.MarkLabel(lblSimdLoop);
// if (i > vectorEnd) goto tail
// Load vectors
EmitVectorLoad(il, key.LhsType); // lhs
EmitVectorLoad(il, key.RhsType); // rhs
// Compare
EmitVectorComparison(il, key.Op, key.ComparisonType);
// Extract mask to bools
EmitMaskToBoolExtraction(il, key.ComparisonType, vectorCount);
// i += vectorCount
// goto lblSimdLoop
// Tail loop (scalar fallback for remainder)
// ...
}Benchmarks
[Benchmark] public NDArray Greater_10M() => _a > _b;
[Benchmark] public NDArray Equal_10M() => _a == _b;
[Benchmark] public NDArray LessEqual_10M() => _a <= _b;Expected Results
| Operation | Current (scalar) | With SIMD | Speedup |
|---|---|---|---|
| a > b (10M int32) | ~15 ms | ~5-7 ms | 2-3× |
| a == b (10M float64) | ~20 ms | ~7-10 ms | 2-3× |
Success Criteria
- SIMD comparisons ≥2× faster than current scalar
- All existing comparison tests pass
- Correct handling of NaN comparisons (float/double)
- Works with all execution paths (SimdFull, ScalarRight, General)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
coreInternal engine: Shape, Storage, TensorEngine, iteratorsInternal engine: Shape, Storage, TensorEngine, iteratorsenhancementNew feature or requestNew feature or requestperformancePerformance improvements or optimizationsPerformance improvements or optimizations