Skip to content

Add SobelFilter to SVE microbenchmark#5040

Open
ylpoonlg wants to merge 3 commits intodotnet:mainfrom
ylpoonlg:github-sobelfilter
Open

Add SobelFilter to SVE microbenchmark#5040
ylpoonlg wants to merge 3 commits intodotnet:mainfrom
ylpoonlg:github-sobelfilter

Conversation

@ylpoonlg
Copy link
Contributor

Performance Results

Run on Neoverse-V2

Method Size Mean Error StdDev Median Min Max Allocated
Scalar 9 106.89 ns 0.027 ns 0.023 ns 106.89 ns 106.86 ns 106.94 ns -
Vector128SobelFilter 9 47.21 ns 0.007 ns 0.006 ns 47.21 ns 47.20 ns 47.22 ns -
SveSobelFilter 9 28.89 ns 0.134 ns 0.125 ns 28.81 ns 28.77 ns 29.07 ns -
Scalar 64 7,125.25 ns 2.823 ns 2.503 ns 7,124.19 ns 7,122.83 ns 7,130.43 ns -
Vector128SobelFilter 64 1,255.35 ns 0.913 ns 0.854 ns 1,255.21 ns 1,254.21 ns 1,256.87 ns -
SveSobelFilter 64 1,339.70 ns 0.280 ns 0.219 ns 1,339.65 ns 1,339.41 ns 1,340.14 ns -
Scalar 527 507,148.17 ns 324.567 ns 287.720 ns 507,050.40 ns 506,733.19 ns 507,636.12 ns -
Vector128SobelFilter 527 73,998.18 ns 166.540 ns 147.634 ns 73,971.34 ns 73,795.42 ns 74,256.96 ns -
SveSobelFilter 527 100,770.21 ns 1,836.759 ns 1,628.239 ns 100,470.79 ns 96,333.15 ns 103,294.79 ns -

cc @dotnet/arm64-contrib @SwapnilGaikwad @LoopedBard3

Copy link
Member

@LoopedBard3 LoopedBard3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the benchmarks!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new SVE-focused microbenchmark for a 2D Sobel filter convolution, enabling comparison of scalar vs AdvSimd (Vector128) vs SVE implementations on Arm64/SVE-capable systems.

Changes:

  • Introduces SobelFilter microbenchmark with Scalar, Vector128SobelFilter, and SveSobelFilter benchmarks.
  • Adds setup/verification scaffolding and Sobel kernel initialization for repeatable runs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +196 to +200
Vector<float> resVec;
// Load coefficients of the filter into vectors.
Vector<float> kxVec = Sve.LoadVector((Vector<float>)Sve.CreateWhileLessThanMask32Bit(0, 3), kx);
Vector<float> kyVec = Sve.LoadVector((Vector<float>)Sve.CreateWhileLessThanMask32Bit(0, 3), ky);
for (int j = 0; j < img_size; j++)
Comment on lines +206 to +213
for (int i = 0; i < out_size; i += cntw)
{
Vector<float> pRow = (Vector<float>)Sve.CreateWhileLessThanMask32Bit(i, out_size);

// Load input elements from the next 3 columns.
Vector<float> col0 = Sve.LoadVector(pRow, in_ptr + i);
Vector<float> col1 = Sve.LoadVector(pRow, in_ptr + i + 1);
Vector<float> col2 = Sve.LoadVector(pRow, in_ptr + i + 2);
Comment on lines +229 to +236
for (int i = 0; i < out_size; i += cntw)
{
Vector<float> pRow = (Vector<float>)Sve.CreateWhileLessThanMask32Bit(i, out_size);

// Load input elements from the next 3 rows.
Vector<float> row0 = Sve.LoadVector(pRow, in_ptr + i);
Vector<float> row1 = Sve.LoadVector(pRow, in_ptr + i + out_size);
Vector<float> row2 = Sve.LoadVector(pRow, in_ptr + i + 2 * out_size);
Comment on lines +69 to +100
int img_size = Size;
// The output image size is 2-pixel smaller in each direction.
int out_size = img_size - 2;
fixed (float* input = _source, temp = _temp, output = _result)
fixed (float* kx = _kx, ky = _ky)
{
// Convolve the horizontal component first.
// The result is save to the temp array.
for (int j = 0; j < img_size; j++)
{
for (int i = 0; i < out_size; i++)
{
float res = 0.0F;
for (int k = 0; k < 3; k++)
{
res += kx[k] * input[j * img_size + i + k];
}
temp[j * out_size + i] = res;
}
}
// Then convolve the vertical component.
// Using the temp array as input.
for (int j = 0; j < out_size; j++)
{
for (int i = 0; i < out_size; i++)
{
float res = 0.0F;
for (int k = 0; k < 3; k++)
{
res += ky[k] * temp[(j + k) * out_size + i];
}
output[j * out_size + i] = res;
fixed (float* kx = _kx, ky = _ky)
{
// Convolve the horizontal component first.
// The result is save to the temp array.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants