Skip to content

Eliminate redundant NCHW↔NHWC permute_copy and NHWC-safe view_copy transposes in ToTosaMemoryFormatPass (#18167)#18167

Open
3l1 wants to merge 11 commits intopytorch:mainfrom
3l1:export-D96432610
Open

Eliminate redundant NCHW↔NHWC permute_copy and NHWC-safe view_copy transposes in ToTosaMemoryFormatPass (#18167)#18167
3l1 wants to merge 11 commits intopytorch:mainfrom
3l1:export-D96432610

Conversation

@3l1
Copy link
Contributor

@3l1 3l1 commented Mar 13, 2026

Summary:

Two optimizations in ToTosaMemoryFormatPass to reduce TOSA TRANSPOSE nodes:

  1. NHWC-safe reshape detection: When a 4D→4D view_copy has monotonic
    shape_indices on the raw shapes and preserves the last dimension (NHWC
    channel), skip inserting input/output transposes. The view_copy can
    operate directly on NHWC data.

  2. Redundant permute_copy elimination: Model-level permute_copy ops whose
    permutation matches channels_last_order (NCHW→NHWC) or its inverse
    (NHWC→NCHW) are redundant with the tosa_dim_order annotation that
    already handles format conversion. Replace them with view_copy (identity
    reshape) to avoid generating TOSA TRANSPOSE nodes. Handles both 4D
    (rank>=4, sr>=2) and 3D (rank>=3, sr>=1) permutations.

Reviewed By: digantdesai

Differential Revision: D96432610

3l1 added 10 commits March 4, 2026 10:27
Differential Revision: D92296625
…ranspose chains

Summary:
This pass fuses transpose -> reshape -> transpose patterns into a single transpose
followed by a reshape. This optimization is particularly useful when reordering
tensors between NCHW and NHWC memory formats.

The pass now handles both:
- torch.ops.aten.permute_copy.default (standard aten ops)
- exir_ops.backend.tosa.TRANSPOSE.default (TOSA backend ops)

This enables the pass to work with transposes inserted by ToTosaMemoryFormatPass.
The pass is placed AFTER ToTosaMemoryFormatPass in arm_pass_manager.py.

Example: Consider a reshape on an NCHW tensor that reshapes the batch and channel
dimensions into the channel dimension:
    (N, C, H, W) -> reshape -> (1, (N, C), H, W)

If both input and output tensors are reordered to NHWC:
    (N, H, W, C)
    -> transpose -> (N, C, H, W)
    -> reshape -> (1, (N, C), H, W)
    -> transpose -> (1, H, W, (N, C))

This is equivalent to:
    (N, H, W, C) -> transpose -> (H, W, N, C) -> reshape -> (1, H, W, (N, C))

Inspired by bolt/nn/espresso/transforms/fuse_ops.py:fuse_transpose_reshape_transpose

Differential Revision: D95292796
…ear layers

Summary:
This pass fuses transpose -> reshape -> linear patterns by eliminating the
transpose operation. Instead of transposing at runtime, the pass applies the
inverse transpose to the linear layer's weights at compile time.

The pass now handles both:
- torch.ops.aten.permute_copy.default (standard aten ops)
- exir_ops.backend.tosa.TRANSPOSE.default (TOSA backend ops)

This enables the pass to work with transposes inserted by ToTosaMemoryFormatPass.
The pass is placed AFTER ToTosaMemoryFormatPass in arm_pass_manager.py.

Common artifact from NCHW -> NHWC reordering:
- Transpose -> Reshape -> Linear
where the Reshape flattens all dimensions except the batch dimension.

The pass validates:
1. Transpose does not modify the batch dimension (dims[0] == 0)
2. Reshape flattens to 2D with batch dim preserved
3. Followed by a linear/mm operation

Inspired by bolt/nn/espresso/transforms/fuse_ops.py:fuse_transpose_reshape_fc

Differential Revision: D95268043
Summary:
This pass identifies consecutive transpose/permute operations and either:
1. Removes both if they compose to identity (cancel out)
2. Fuses them into a single permute with combined dimensions

The pass now handles:
- exir_ops.edge.aten.permute.default (standard edge dialect ops)
- exir_ops.edge.aten.permute_copy.default (edge dialect copy ops)
- exir_ops.backend.tosa.TRANSPOSE.default (TOSA backend ops)

This enables the pass to work with transposes inserted by ToTosaMemoryFormatPass.
The pass is placed AFTER ToTosaMemoryFormatPass in arm_pass_manager.py.

The optimization reduces runtime overhead by eliminating redundant
memory movement operations.

Inspired by bolt/nn/espresso/transforms/fuse_ops.py:fuse_transposes

Differential Revision: D95293181
…riant ops

Summary:
This pass identifies and removes redundant transpose pairs that sandwich layout-invariant operations (e.g., elementwise ops like relu, add, mul).

Pattern targeted:
  T(perm1) → LayoutInvariantOp → T(perm2)

When perm1 and perm2 compose to identity (i.e., they cancel out), both transposes can be safely removed since the middle operation doesn't depend on data layout.

This pattern is common in TOSA graphs where ToTosaMemoryFormatPass inserts NCHW↔NHWC transposes at operation boundaries, but some elementwise ops between them don't actually require the format conversion.

Example:
  Before: T([0,2,3,1]) → ReLU → T([0,3,1,2])
  After:  ReLU  (both transposes removed)

The pass handles:
- edge.aten.permute/permute_copy operations
- backend.tosa.TRANSPOSE operations
- All common elementwise unary ops (relu, sigmoid, tanh, clamp, etc.)
- Elementwise binary ops (add, mul, sub, div, etc.)
…TOSA Rescale operations

Summary:
This pass targets T1 → Rescale → T2 patterns where transposes surround a TOSA
Rescale operation. Since Rescale is elementwise (per-element scaling and
zero-point adjustment), it is layout-invariant and transposes can be
propagated through it.

Pattern targeted:
  Before: T(perm1) → Rescale → T(perm2)
  After:  Rescale → T_combined(compose(perm1, perm2))

If the composed permutation is identity, both transposes are eliminated:
  Before: T(perm1) → Rescale → T(inverse(perm1))
  After:  Rescale (both transposes removed)

This optimization is particularly effective for TOSA graphs where:
- ToTosaMemoryFormatPass inserts NCHW↔NHWC transposes
- Rescale operations are inserted for quantization/dequantization
- The pattern Transpose → Rescale → Conv → Rescale → Transpose is common

The pass handles:
- edge.aten.permute/permute_copy operations
- backend.tosa.TRANSPOSE operations
…ugh Concat operations

Summary:
This pass targets the pattern [T(perm), T(perm), ...] → Concat(dim=d) → T(inv_perm)
where all inputs to Concat have the same transpose permutation and the output
is transposed with the inverse permutation. The transposes can be eliminated
by adjusting the concat dimension.

This optimization is particularly useful for TOSA graphs where
ToTosaMemoryFormatPass inserts NCHW↔NHWC transposes at graph boundaries.

Transformation:
- Before: [T(perm)(x1), T(perm)(x2)] → Concat(dim=d) → T(inv_perm) → y
- After: [x1, x2] → Concat(dim=inv_perm[d]) → y
Summary:
This TODO documents the investigation needed to make PropagateTransposesThroughConcatPass
more effective for models like Control Ceres where:
- Input transposes have multiple users (not just Concat)
- Input transposes may have different permutations

Investigation scope:
- Option A: Allow input transposes with multiple users (duplicate for other users)
- Option B: Handle mixed permutations by propagating matching subsets
- Option C: Target a different pattern entirely
Summary:
The FuseTransposeReshapeTransposePass was only checking for torch.ops.aten.view_copy.default but not the Edge dialect exir_ops.edge.aten.view_copy.default. This prevented the pass from matching transpose-reshape-transpose patterns in edge graphs where view_copy uses the edge dialect.

Changes:
- Added exir_ops.edge.aten.view_copy.default to _RESHAPE_TARGETS
- Changed logging from debug to warning for better visibility of fusion failures
- Removed unused variable permute_output_shape
- Created ANALYSIS_expensive_transposes.md documenting findings

Note: The expensive transposes in Control Ceres model around reshape operations cannot be fused because the reshapes involve both dimension combining AND reordering (e.g., [1,2,14,72] → [1,1,72,28]). These transposes are mathematically necessary for the reshape to work correctly in NHWC layout.
…at compile time

Summary:
This pass identifies TOSA TRANSPOSE operations where the input is a static tensor (parameter, buffer, or lifted tensor constant) and folds the transpose at compile time by:

1. Actually permuting the tensor data
2. Creating a new constant placeholder with the permuted data
3. Removing the transpose node and rewiring users

This eliminates runtime transpose operations on static tensors like weights, which is especially important for Ethos-U55 where Vela implements transposes as expensive NPU_OP_POOL (1x1 AvgPool) sequences.

**Note:** Analysis on test_combined_control_ceres_u55 shows that in this particular model, all 40 transpose operations (19 TOSA TRANSPOSE + 21 aten.permute_copy) are on activation tensors, not constant tensors. The pass correctly identifies this and doesn't fold any transposes. The pass will be beneficial for future models that have transposes on constant tensors.
@3l1 3l1 requested a review from digantdesai as a code owner March 13, 2026 19:59
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 13, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18167

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 9 Cancelled Jobs, 3 Unrelated Failures

As of commit c019a17 with merge base 4f900b2 (image):

NEW FAILURE - The following job has failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 13, 2026
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Mar 13, 2026

@3l1 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96432610.

@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copy link
Contributor

@digantdesai digantdesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review automatically exported from Phabricator review in Meta.

3l1 added a commit to 3l1/executorch that referenced this pull request Mar 13, 2026
…ansposes in ToTosaMemoryFormatPass (pytorch#18167)

Summary:
Pull Request resolved: pytorch#18167

Two optimizations in ToTosaMemoryFormatPass to reduce TOSA TRANSPOSE nodes:

1. NHWC-safe reshape detection: When a 4D→4D view_copy has monotonic
   shape_indices on the raw shapes and preserves the last dimension (NHWC
   channel), skip inserting input/output transposes. The view_copy can
   operate directly on NHWC data.

2. Redundant permute_copy elimination: Model-level permute_copy ops whose
   permutation matches channels_last_order (NCHW→NHWC) or its inverse
   (NHWC→NCHW) are redundant with the tosa_dim_order annotation that
   already handles format conversion. Replace them with view_copy (identity
   reshape) to avoid generating TOSA TRANSPOSE nodes. Handles both 4D
   (rank>=4, sr>=2) and 3D (rank>=3, sr>=1) permutations.

This reduces Vela Transpose entries from 75→33
(-56%), Transpose op cycles from 33.4K→6.1K (-82%), and NPU operators
from 367→329 (-38).

Reviewed By: digantdesai

Differential Revision: D96432610
@3l1 3l1 force-pushed the export-D96432610 branch from f9a57c8 to 6aac88a Compare March 13, 2026 20:45
@meta-codesync meta-codesync bot changed the title Eliminate redundant NCHW↔NHWC permute_copy and NHWC-safe view_copy transposes in ToTosaMemoryFormatPass Eliminate redundant NCHW↔NHWC permute_copy and NHWC-safe view_copy transposes in ToTosaMemoryFormatPass (#18167) Mar 13, 2026
3l1 added a commit to 3l1/executorch that referenced this pull request Mar 13, 2026
…ansposes in ToTosaMemoryFormatPass (pytorch#18167)

Summary:

Two optimizations in ToTosaMemoryFormatPass to reduce TOSA TRANSPOSE nodes:

1. NHWC-safe reshape detection: When a 4D→4D view_copy has monotonic
   shape_indices on the raw shapes and preserves the last dimension (NHWC
   channel), skip inserting input/output transposes. The view_copy can
   operate directly on NHWC data.

2. Redundant permute_copy elimination: Model-level permute_copy ops whose
   permutation matches channels_last_order (NCHW→NHWC) or its inverse
   (NHWC→NCHW) are redundant with the tosa_dim_order annotation that
   already handles format conversion. Replace them with view_copy (identity
   reshape) to avoid generating TOSA TRANSPOSE nodes. Handles both 4D
   (rank>=4, sr>=2) and 3D (rank>=3, sr>=1) permutations.

Reviewed By: digantdesai

Differential Revision: D96432610
@3l1 3l1 force-pushed the export-D96432610 branch from 6aac88a to 7f039b2 Compare March 13, 2026 20:47
…ansposes in ToTosaMemoryFormatPass (pytorch#18167)

Summary:
Pull Request resolved: pytorch#18167

Two optimizations in ToTosaMemoryFormatPass to reduce TOSA TRANSPOSE nodes:

1. NHWC-safe reshape detection: When a 4D→4D view_copy has monotonic
   shape_indices on the raw shapes and preserves the last dimension (NHWC
   channel), skip inserting input/output transposes. The view_copy can
   operate directly on NHWC data.

2. Redundant permute_copy elimination: Model-level permute_copy ops whose
   permutation matches channels_last_order (NCHW→NHWC) or its inverse
   (NHWC→NCHW) are redundant with the tosa_dim_order annotation that
   already handles format conversion. Replace them with view_copy (identity
   reshape) to avoid generating TOSA TRANSPOSE nodes. Handles both 4D
   (rank>=4, sr>=2) and 3D (rank>=3, sr>=1) permutations.

Reviewed By: digantdesai

Differential Revision: D96432610
@3l1 3l1 force-pushed the export-D96432610 branch from 7f039b2 to c019a17 Compare March 13, 2026 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants