PULPOpen Training Hardness: tiling-ready backward and optimizer deployment#174
Draft
runwangdl wants to merge 167 commits intopulp-platform:develfrom
Draft
PULPOpen Training Hardness: tiling-ready backward and optimizer deployment#174runwangdl wants to merge 167 commits intopulp-platform:develfrom
runwangdl wants to merge 167 commits intopulp-platform:develfrom
Conversation
… have finite lifetime
…y are I/O buffers
# Conflicts: # .github/workflows/ci-platform-siracusa-tiled.yml # .github/workflows/ci-platform-siracusa.yml # Deeploy/Targets/Generic/Bindings.py # Deeploy/Targets/Generic/Layers.py # Deeploy/Targets/Generic/Parsers.py # Deeploy/Targets/Generic/Platform.py # Deeploy/Targets/Generic/TypeCheckers.py # Deeploy/Targets/PULPOpen/Bindings.py # Deeploy/Targets/PULPOpen/Parsers.py # Deeploy/Targets/PULPOpen/Platform.py # Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py # Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py # Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py # Deeploy/Targets/PULPOpen/Tiler.py # TargetLibraries/Generic/inc/kernel/Layernorm.h # TargetLibraries/Generic/src/Layernorm_fp32.c
…ng merge The merge commit resolved the conflict by using TrainingPlatform's version of Bindings.py which was missing GroupNorm checker and template imports that are needed by the GroupNorm binding definitions (PULPGroupNormGrad*Binding, PULPGroupNormalization*Binding) brought in from CNNTraining.
…lution Adds missing ConvGrad, GroupNorm, AveragePool, and ReluGrad operators that were dropped when TrainingPlatform took precedence during merge. - Generic/Parsers.py: Add Conv2DGradXParser, Conv2DGradWParser, AveragePool2DParser, and 6 GroupNorm parser classes - Generic/Layers.py: Add AveragePoolLayer/Grad, ConvGradX/WLayer, ReluGradLayer, and 6 GroupNorm layer classes - PULPOpen/Parsers.py: Add 6 PULPConvGrad parser classes - PULPOpen/Bindings.py: Add PULPAveragePool2DBindings, PULPAveragePoolGrad2DBindings, and fix GroupNorm/template imports - PULPOpen/Tiler.py: Fix imports to include all CNNTraining bindings and GroupNorm/ConvGrad/AveragePool TileConstraints - PULPOpen/Platform.py: Fix all imports and add GroupNorm entries to PULPMapping
…finitions - Generic/TypeCheckers.py: Add AveragePoolChecker class - Generic/Bindings.py: Add FloatAveragePoolTemplate import, AveragePoolChecker import, and BasicAveragePool2DBindings definition - Generic/Platform.py: Add AveragePoolLayer/Mapper, fix imports for BasicAveragePool2DBindings and AveragePoolLayer - Generic/Parsers.py: Remove duplicate LayerNormGradParser (4-input version from CNNTraining, keep 5-input TrainingPlatform version) - PULPOpen/Bindings.py: Remove duplicate PULPLayernormGradBinding (4-input version from CNNTraining, keep 5-input version), remove duplicate FunctionCallClosure/ForkClosure definitions
The merge added a second LayerNormGradParser class (4-input, 1-output version from CNNTraining) that overrode the correct 5-input, 3-output version from TrainingPlatform. Remove the duplicate to restore the correct implementation that matches the ONNX LayerNormalizationGrad node format.
-t /app/Onnx4Deeploy/onnx/model/simplemlp_train --n-accum=8 --plotMemAlloc" MemoryScheduler (_calculateLifetimes):
- Pin is_input buffers to (0, maxStepIdx) instead of (0, last_use_step),
preventing minimalloc from aliasing network weight inputs (e.g. fc1_weight)
with intermediate tensors computed later in the same kernel call.
InPlaceAccumulatorV2:
- Add tiledReferenceTemplate (PULPOpen) that writes only to ${accum_buffer};
data_out is excluded because its L2 address may alias other live buffers.
- InPlaceAccumulatorV2TileConstraint: remove data_out from addrNames; egress
DMA targets accum_buffer's L2 address instead of data_out.
- PULPInPlaceAccumulatorV2TilingReadyBindings now uses TiledBindings.
- Generic template: write to both accum_buffer and data_out; use
> (int8_t)(-128) for reset detection to handle sign-propagation path.
Add int8_t binding variant for the sign-propagated lazy_reset_grad.
SoftmaxCrossEntropyLoss dual-output:
- Add SoftmaxCrossEntropyLossDualOutputTileConstraint with wrapTilingSolution
override to bypass the single-output assertion.
- Platform: DualOutputMapper now uses TilingReadyBindings.
LayerNorm / LayerNormGrad tiling:
- LayernormTileConstraint: add mean/inv_std_dev output constraints and
wrapTilingSolution to handle multi-output nodes.
- LayernormGradTileConstraint: replace bias with mean/inv_std_dev inputs;
add weight_grad/bias_grad output constraints; addPolicyConstraint pins
batch dims to full size; serializeTilingSolution updated accordingly.
FloatGEMMTileConstraint:
- Skip bias DMA when bias tensor is not present in the tiling solution
(e.g. Constant bias kept entirely in L2), fixing KeyError on small biases.
Tiling code generation infrastructure:
- SingleBufferingTilingCodeGeneration: add direction suffix (_ref_in/_ref_out)
to hoisted external buffer references to avoid collisions for in-place
buffers appearing in both ingress and egress schedules; use combined
tensorMemoryConstraints for egress lookup so input-side constraints
(e.g. accum_buffer) are visible during output DMA generation.
- TilingCodeGeneration: add direction suffix to per-tile opRepr variable
names to prevent ingress/egress collisions for the same tensor.
- TilingVariableReplacement: skip re-hoisting a reference already created
on the input pass (in-place buffers); fix _updateReferenceTemplate to use
pointer arithmetic (base + idx) instead of dereference.
- MemoryConstraintFlows: skip kill-set tracking for ConstantBuffers.
- TileConstraint: print diagnostic info before the single-output assertion.
Test infrastructure:
- execution.py: prepend /usr/bin to PATH so gapy resolves the correct Python;
add tiled-training branch (config.training and config.tiling) that invokes
testMVPTraining.py and generateOptimizerNetwork.py, then reads back
training_meta.json for n_train_steps / n_accum_steps.
- deeploytraintest.c: remove debug scaffolding left from liveness investigation.
…ansformer/tinytransformer_train --n-accum 4"
Implements the backward pass of MaxPool2D (MaxPoolGrad) following the same architecture as AveragePoolGrad. The gradient is scattered only to the argmax position in each pooling window (re-computed from the original forward input), unlike AveragePoolGrad which distributes uniformly. New files: - TargetLibraries/PULPOpen/inc/kernel/MaxPool.h: declare PULP_MaxPoolGrad2d_fp32_fp32_HWC - TargetLibraries/PULPOpen/src/MaxPool.c: implement MaxPoolGrad kernel (zero-init + argmax scatter, channel-parallel across cores) - Deeploy/Targets/PULPOpen/TileConstraints/MaxPoolGradTileConstraint.py: MaxPoolGradCTileConstraint (channel-tiling for 3 tensors: grad_output, original_input, grad_input) - Deeploy/Targets/PULPOpen/Templates/FloatMaxPoolTemplate.py: add referenceGradTemplate calling the new kernel Modified files: - Generic/Parsers.py: MaxPoolGradParser (2 inputs, 1 output, same attrs as MaxPool) - Generic/Layers.py: MaxPoolGradLayer - Generic/TypeCheckers.py: MaxPoolGradChecker (2 float32 in, 1 float32 out) - PULPOpen/Bindings.py: PULPMaxPoolGrad2DBindings - PULPOpen/Tiler.py: PULPMaxPoolGrad2DTilingReadyBindings - PULPOpen/Platform.py: MaxPoolGrad2DMapper + 'MaxPoolGrad' in PULPMapping Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bump pulp-trainlib to 37f70e5 (CNNTiling): - DW ConvGradW/X: padded kernels for non-zero padding or stride != 1 - im2col: fix early-return bug blocking stride > 1 weight gradients - Conv2D bw param grads: pass actual padding to im2col (was zero)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds end-to-end on-device training graph deployment support for PULPOpen/Siracusa targets. This includes a full code-generation pipeline for training networks (forward + backward + optimizer), tiling support for gradient operators, and the necessary runtime harness changes to run SGD-based on-device learning.
Added
Changed
inputs.npz / outputs.npz format — added meta_data_size (unique samples stored) and meta_n_batches (total training steps) keys; C harness cycles data via mb % TRAINING_DATA_SIZE instead of storing all batches
TilerExtension.py — _setupTensorDimensionProducts and _setupHeuristics now receive layerBinding as parameter; four hasattr(template, 'tileConstraint') guards added so non-tileable ops (e.g. ConvGradB) execute on their current memory level without blocking the tiler
Fixed
PR Merge Checklist
develcommit and pointing todevel.CHANGELOG.mdfile has been updated.