[SIMD] Comparisons: Add Vector256 for >, <, ==, !=, >=, <=

## Overview

Add SIMD vectorization to comparison operations. Currently, comparison kernels use scalar IL opcodes despite having the IL kernel infrastructure in place.

**Parent issue:** #545

## Current State

| Aspect | Status |
|--------|--------|
| IL kernel generation | ✅ Exists (`ILKernelGenerator.cs:3156-3790`) |
| Execution path classification | ✅ SimdFull, ScalarRight, General |
| **Vector256 SIMD** | ❌ **NOT USED** |
| Actual comparison | Scalar IL opcodes (Ceq, Clt, Cgt) |

**Current implementation (line 3698-3762):**
```csharp
case ComparisonOp.Greater:
    il.Emit(OpCodes.Cgt);  // Scalar comparison → 0 or 1
```

## The Challenge

`Vector256.GreaterThan()` returns a **mask**, not bools:
```csharp
Vector256<int> a = Vector256.Create(1, 2, 3, 4, 5, 6, 7, 8);
Vector256<int> b = Vector256.Create(5, 5, 5, 5, 5, 5, 5, 5);
Vector256<int> mask = Vector256.GreaterThan(a, b);
// mask = [0, 0, 0, 0, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF]
//        (false, false, false, false, true, true, true, true)
```

Need to convert mask lanes (0x00000000 / 0xFFFFFFFF) to bool array.

## Task List

- [ ] **Implement SIMD comparison loop**
  - Use `Vector256.GreaterThan/LessThan/Equals` etc.
  - Convert mask to bool array via MoveMask + bit extraction
  
- [ ] **Handle all comparison types**
  - Equal: `Vector256.Equals()`
  - NotEqual: `~Vector256.Equals()`
  - Less: `Vector256.LessThan()`
  - LessEqual: `Vector256.LessThanOrEqual()`
  - Greater: `Vector256.GreaterThan()`
  - GreaterEqual: `Vector256.GreaterThanOrEqual()`

- [ ] **Mask-to-bool conversion strategies**
  - Option A: MoveMask + bit extraction (fastest)
  - Option B: ConditionalSelect to 0/1, then narrow
  - Option C: Per-lane extraction (fallback)

## Implementation Approach

### Option A: MoveMask Extraction (Recommended for AVX2)

```csharp
// For int32 (8 elements per Vector256):
var mask = Vector256.GreaterThan(lhsVec, rhsVec);  // Vector256<int>

// MoveMask extracts MSB of each byte (32 bits total)
uint moveMask = (uint)Avx2.MoveMask(mask.AsByte());

// For int32, relevant bits are at positions 3, 7, 11, 15, 19, 23, 27, 31
// (MSB of each 4-byte lane)
result[i+0] = ((moveMask >> 3) & 1) != 0;
result[i+1] = ((moveMask >> 7) & 1) != 0;
result[i+2] = ((moveMask >> 11) & 1) != 0;
// ... etc for all 8 lanes
```

### Option B: ConditionalSelect + Narrow

```csharp
var mask = Vector256.GreaterThan(lhsVec, rhsVec);
var ones = Vector256.Create(1);
var zeros = Vector256<int>.Zero;
var boolAsInt = Vector256.ConditionalSelect(mask, ones, zeros);
// boolAsInt = [0, 0, 0, 0, 1, 1, 1, 1]

// Narrow to bytes and store
var narrow1 = Avx2.PackSignedSaturate(boolAsInt.GetLower(), boolAsInt.GetUpper());
// Continue narrowing to bytes...
```

### Type-Specific Handling

| Type | Elements/Vector | Extraction Method |
|------|-----------------|-------------------|
| byte | 32 | MoveMask direct (1 bit per element) |
| short | 16 | MoveMask + stride 2 |
| int | 8 | MoveMask + stride 4 |
| long | 4 | MoveMask + stride 8 |
| float | 8 | Same as int |
| double | 4 | Same as long |

## Files to Modify

| File | Changes |
|------|---------|
| `ILKernelGenerator.cs:3222-3339` | Replace scalar loops with SIMD |
| `ILKernelGenerator.cs:3346-3409` | New `EmitComparisonSimdLoop()` |
| `ILKernelGenerator.cs:3698-3762` | `EmitVectorComparisonOperation()` |

## Code Structure

```csharp
private static void EmitComparisonSimdLoop(ILGenerator il, ComparisonKernelKey key)
{
    int vectorCount = GetVectorCount(key.LhsType);
    
    // Locals
    var locI = il.DeclareLocal(typeof(int));
    var locVectorEnd = il.DeclareLocal(typeof(int));
    
    // vectorEnd = totalSize - vectorCount
    // ...
    
    // SIMD loop
    var lblSimdLoop = il.DefineLabel();
    il.MarkLabel(lblSimdLoop);
    // if (i > vectorEnd) goto tail
    
    // Load vectors
    EmitVectorLoad(il, key.LhsType);  // lhs
    EmitVectorLoad(il, key.RhsType);  // rhs
    
    // Compare
    EmitVectorComparison(il, key.Op, key.ComparisonType);
    
    // Extract mask to bools
    EmitMaskToBoolExtraction(il, key.ComparisonType, vectorCount);
    
    // i += vectorCount
    // goto lblSimdLoop
    
    // Tail loop (scalar fallback for remainder)
    // ...
}
```

## Benchmarks

```csharp
[Benchmark] public NDArray Greater_10M() => _a > _b;
[Benchmark] public NDArray Equal_10M() => _a == _b;
[Benchmark] public NDArray LessEqual_10M() => _a <= _b;
```

## Expected Results

| Operation | Current (scalar) | With SIMD | Speedup |
|-----------|------------------|-----------|---------|
| a > b (10M int32) | ~15 ms | ~5-7 ms | **2-3×** |
| a == b (10M float64) | ~20 ms | ~7-10 ms | **2-3×** |

## Success Criteria

1. SIMD comparisons ≥2× faster than current scalar
2. All existing comparison tests pass
3. Correct handling of NaN comparisons (float/double)
4. Works with all execution paths (SimdFull, ScalarRight, General)

File	Changes
`ILKernelGenerator.cs:3222-3339`	Replace scalar loops with SIMD
`ILKernelGenerator.cs:3346-3409`	New `EmitComparisonSimdLoop()`
`ILKernelGenerator.cs:3698-3762`	`EmitVectorComparisonOperation()`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SIMD] Comparisons: Add Vector256 for >, <, ==, !=, >=, <= #578

Overview

Current State

The Challenge

Task List

Implementation Approach

Option A: MoveMask Extraction (Recommended for AVX2)

Option B: ConditionalSelect + Narrow

Type-Specific Handling

Files to Modify

Code Structure

Benchmarks

Expected Results

Success Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aspect	Status
IL kernel generation	✅ Exists (`ILKernelGenerator.cs:3156-3790`)
Execution path classification	✅ SimdFull, ScalarRight, General
Vector256 SIMD	❌ NOT USED
Actual comparison	Scalar IL opcodes (Ceq, Clt, Cgt)

Type	Elements/Vector	Extraction Method
byte	32	MoveMask direct (1 bit per element)
short	16	MoveMask + stride 2
int	8	MoveMask + stride 4
long	4	MoveMask + stride 8
float	8	Same as int
double	4	Same as long

Operation	Current (scalar)	With SIMD	Speedup
a > b (10M int32)	~15 ms	~5-7 ms	2-3×
a == b (10M float64)	~20 ms	~7-10 ms	2-3×

[SIMD] Comparisons: Add Vector256 for >, <, ==, !=, >=, <= #578

Description

Overview

Current State

The Challenge

Task List

Implementation Approach

Option A: MoveMask Extraction (Recommended for AVX2)

Option B: ConditionalSelect + Narrow

Type-Specific Handling

Files to Modify

Code Structure

Benchmarks

Expected Results

Success Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions