Skip to content

PULPOpen Training Hardness: tiling-ready backward and optimizer deployment#174

Draft
runwangdl wants to merge 167 commits intopulp-platform:develfrom
runwangdl:TrainingPlatform
Draft

PULPOpen Training Hardness: tiling-ready backward and optimizer deployment#174
runwangdl wants to merge 167 commits intopulp-platform:develfrom
runwangdl:TrainingPlatform

Conversation

@runwangdl
Copy link
Contributor

Adds end-to-end on-device training graph deployment support for PULPOpen/Siracusa targets. This includes a full code-generation pipeline for training networks (forward + backward + optimizer), tiling support for gradient operators, and the necessary runtime harness changes to run SGD-based on-device learning.

Added

  • generateTrainingNetwork.py — CLI script to generate tiled training C code; supports --tiling, --l1, --l2, --doublebuffer, --defaultMemLevel
  • deeployTrainingRunner_siracusa.py — end-to-end training test driver for Siracusa
  • InPlaceAccumulatorV2 operator: parser, type checker, template, bindings, and SBTiler-based tile constraint (gradient accumulation buffer)
  • SoftmaxCrossEntropyLoss dual-output variant (loss scalar + log_prob): separate parser, checker, template, bindings, and MultiOutputMixin-based tile constraint
  • ConvGradX / ConvGradW / ConvGradB operators split from ConvGrad via SplitConvGradPass: individual parsers, templates, and bindings for each
  • MultiOutputTileConstraint framework (MultiOutputMixin, ScalarOutputAppender, FullTensorOutputAppender) — generic mechanism for wrapping multi-output tile constraints without per-operator boilerplate
  • deeploytraintest.c — C harness for running training steps on device, with mb % TRAINING_DATA_SIZE data cycling and post-init grad buffer memset
  • testinputs.h with TRAINING_DATA_SIZE, TRAINING_GRAD_BUF_START_IDX, TRAINING_NUM_GRAD_INPUTS macros

Changed

  • inputs.npz / outputs.npz format — added meta_data_size (unique samples stored) and meta_n_batches (total training steps) keys; C harness cycles data via mb % TRAINING_DATA_SIZE instead of storing all batches

  • TilerExtension.py — _setupTensorDimensionProducts and _setupHeuristics now receive layerBinding as parameter; four hasattr(template, 'tileConstraint') guards added so non-tileable ops (e.g. ConvGradB) execute on their current memory level without blocking the tiler

Fixed

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR reviewed and approved.
  3. All checks are passing.
  4. The CHANGELOG.md file has been updated.
  5. If the docker was modified, change back its link after review.

runwangdl and others added 25 commits February 13, 2026 23:55
# Conflicts:
#	.github/workflows/ci-platform-siracusa-tiled.yml
#	.github/workflows/ci-platform-siracusa.yml
#	Deeploy/Targets/Generic/Bindings.py
#	Deeploy/Targets/Generic/Layers.py
#	Deeploy/Targets/Generic/Parsers.py
#	Deeploy/Targets/Generic/Platform.py
#	Deeploy/Targets/Generic/TypeCheckers.py
#	Deeploy/Targets/PULPOpen/Bindings.py
#	Deeploy/Targets/PULPOpen/Parsers.py
#	Deeploy/Targets/PULPOpen/Platform.py
#	Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py
#	Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py
#	Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py
#	Deeploy/Targets/PULPOpen/Tiler.py
#	TargetLibraries/Generic/inc/kernel/Layernorm.h
#	TargetLibraries/Generic/src/Layernorm_fp32.c
…ng merge

The merge commit resolved the conflict by using TrainingPlatform's version
of Bindings.py which was missing GroupNorm checker and template imports that
are needed by the GroupNorm binding definitions (PULPGroupNormGrad*Binding,
PULPGroupNormalization*Binding) brought in from CNNTraining.
…lution

Adds missing ConvGrad, GroupNorm, AveragePool, and ReluGrad operators
that were dropped when TrainingPlatform took precedence during merge.

- Generic/Parsers.py: Add Conv2DGradXParser, Conv2DGradWParser,
  AveragePool2DParser, and 6 GroupNorm parser classes
- Generic/Layers.py: Add AveragePoolLayer/Grad, ConvGradX/WLayer,
  ReluGradLayer, and 6 GroupNorm layer classes
- PULPOpen/Parsers.py: Add 6 PULPConvGrad parser classes
- PULPOpen/Bindings.py: Add PULPAveragePool2DBindings,
  PULPAveragePoolGrad2DBindings, and fix GroupNorm/template imports
- PULPOpen/Tiler.py: Fix imports to include all CNNTraining bindings
  and GroupNorm/ConvGrad/AveragePool TileConstraints
- PULPOpen/Platform.py: Fix all imports and add GroupNorm entries
  to PULPMapping
…finitions

- Generic/TypeCheckers.py: Add AveragePoolChecker class
- Generic/Bindings.py: Add FloatAveragePoolTemplate import,
  AveragePoolChecker import, and BasicAveragePool2DBindings definition
- Generic/Platform.py: Add AveragePoolLayer/Mapper, fix imports for
  BasicAveragePool2DBindings and AveragePoolLayer
- Generic/Parsers.py: Remove duplicate LayerNormGradParser (4-input
  version from CNNTraining, keep 5-input TrainingPlatform version)
- PULPOpen/Bindings.py: Remove duplicate PULPLayernormGradBinding
  (4-input version from CNNTraining, keep 5-input version),
  remove duplicate FunctionCallClosure/ForkClosure definitions
The merge added a second LayerNormGradParser class (4-input, 1-output
version from CNNTraining) that overrode the correct 5-input, 3-output
version from TrainingPlatform. Remove the duplicate to restore the
correct implementation that matches the ONNX LayerNormalizationGrad
node format.
    -t /app/Onnx4Deeploy/onnx/model/simplemlp_train --n-accum=8 --plotMemAlloc"  MemoryScheduler (_calculateLifetimes):
- Pin is_input buffers to (0, maxStepIdx) instead of (0, last_use_step),
  preventing minimalloc from aliasing network weight inputs (e.g. fc1_weight)
  with intermediate tensors computed later in the same kernel call.

InPlaceAccumulatorV2:
- Add tiledReferenceTemplate (PULPOpen) that writes only to ${accum_buffer};
  data_out is excluded because its L2 address may alias other live buffers.
- InPlaceAccumulatorV2TileConstraint: remove data_out from addrNames; egress
  DMA targets accum_buffer's L2 address instead of data_out.
- PULPInPlaceAccumulatorV2TilingReadyBindings now uses TiledBindings.
- Generic template: write to both accum_buffer and data_out; use
  > (int8_t)(-128) for reset detection to handle sign-propagation path.
  Add int8_t binding variant for the sign-propagated lazy_reset_grad.

SoftmaxCrossEntropyLoss dual-output:
- Add SoftmaxCrossEntropyLossDualOutputTileConstraint with wrapTilingSolution
  override to bypass the single-output assertion.
- Platform: DualOutputMapper now uses TilingReadyBindings.

LayerNorm / LayerNormGrad tiling:
- LayernormTileConstraint: add mean/inv_std_dev output constraints and
  wrapTilingSolution to handle multi-output nodes.
- LayernormGradTileConstraint: replace bias with mean/inv_std_dev inputs;
  add weight_grad/bias_grad output constraints; addPolicyConstraint pins
  batch dims to full size; serializeTilingSolution updated accordingly.

FloatGEMMTileConstraint:
- Skip bias DMA when bias tensor is not present in the tiling solution
  (e.g. Constant bias kept entirely in L2), fixing KeyError on small biases.

Tiling code generation infrastructure:
- SingleBufferingTilingCodeGeneration: add direction suffix (_ref_in/_ref_out)
  to hoisted external buffer references to avoid collisions for in-place
  buffers appearing in both ingress and egress schedules; use combined
  tensorMemoryConstraints for egress lookup so input-side constraints
  (e.g. accum_buffer) are visible during output DMA generation.
- TilingCodeGeneration: add direction suffix to per-tile opRepr variable
  names to prevent ingress/egress collisions for the same tensor.
- TilingVariableReplacement: skip re-hoisting a reference already created
  on the input pass (in-place buffers); fix _updateReferenceTemplate to use
  pointer arithmetic (base + idx) instead of dereference.
- MemoryConstraintFlows: skip kill-set tracking for ConstantBuffers.
- TileConstraint: print diagnostic info before the single-output assertion.

Test infrastructure:
- execution.py: prepend /usr/bin to PATH so gapy resolves the correct Python;
  add tiled-training branch (config.training and config.tiling) that invokes
  testMVPTraining.py and generateOptimizerNetwork.py, then reads back
  training_meta.json for n_train_steps / n_accum_steps.
- deeploytraintest.c: remove debug scaffolding left from liveness investigation.
…ansformer/tinytransformer_train --n-accum 4"
Implements the backward pass of MaxPool2D (MaxPoolGrad) following the
same architecture as AveragePoolGrad. The gradient is scattered only
to the argmax position in each pooling window (re-computed from the
original forward input), unlike AveragePoolGrad which distributes
uniformly.

New files:
- TargetLibraries/PULPOpen/inc/kernel/MaxPool.h: declare PULP_MaxPoolGrad2d_fp32_fp32_HWC
- TargetLibraries/PULPOpen/src/MaxPool.c: implement MaxPoolGrad kernel
  (zero-init + argmax scatter, channel-parallel across cores)
- Deeploy/Targets/PULPOpen/TileConstraints/MaxPoolGradTileConstraint.py:
  MaxPoolGradCTileConstraint (channel-tiling for 3 tensors:
  grad_output, original_input, grad_input)
- Deeploy/Targets/PULPOpen/Templates/FloatMaxPoolTemplate.py: add
  referenceGradTemplate calling the new kernel

Modified files:
- Generic/Parsers.py: MaxPoolGradParser (2 inputs, 1 output, same attrs as MaxPool)
- Generic/Layers.py: MaxPoolGradLayer
- Generic/TypeCheckers.py: MaxPoolGradChecker (2 float32 in, 1 float32 out)
- PULPOpen/Bindings.py: PULPMaxPoolGrad2DBindings
- PULPOpen/Tiler.py: PULPMaxPoolGrad2DTilingReadyBindings
- PULPOpen/Platform.py: MaxPoolGrad2DMapper + 'MaxPoolGrad' in PULPMapping

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@runwangdl runwangdl marked this pull request as draft March 12, 2026 01:30
@runwangdl runwangdl self-assigned this Mar 12, 2026
Bump pulp-trainlib to 37f70e5 (CNNTiling):
- DW ConvGradW/X: padded kernels for non-zero padding or stride != 1
- im2col: fix early-return bug blocking stride > 1 weight gradients
- Conv2D bw param grads: pass actual padding to im2col (was zero)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants