Release v0.2.0 · pulp-platform/Deeploy

Release v0.2.0 (2025-07-08) #103

This release contains major architectural changes, new platform support, enhanced simulation workflows, floating-point kernel support, training infrastructure for CCT models, memory allocation strategies, and documentation improvements.

List of Pull Requests

Prepare v0.2.0 release #102
Add Luka as Code Owner #101
Fix CI, Docker Files, and Documentation Workflow #100
Chimera Platform Integration #96
Add Tutorial and Refactor README #97
Reduce Mean Float Template #92
Reshape Memory Freeing and Generic Float GEMM Fixes #91
Prepare for Release and Separate Dependencies #90
Fix input offsets calculation #89
Move PULP SDK to main branch/fork #88
Finite Lifetime for IO Tensors #51
Improved Memory Visualization and Multi-Layer Tiling Profiling #56
Fix Linting in CI and Reformat C Files #86
Fix Broken CMake Flow For pulp-sdk #87
Refactor Changelog For Release #85
ARM Docker Container and Minor Bug Fix #84
Added Kernel for Generic Float DW Conv2D #63
Autoselect Self-Hosted Runners if the Action is on Upstream #81
TEST_RECENT linking on MacOS #78
Add RV32IMF Picolibc support for Siracusa platform #66
Improve Documentation and VSCode Support #76
Debug Print Topology Pass and Code Transformation #75
Find all subdirectories of Deeploy when installing with pip install #70
Add milestone issue template #71
Bunch of fixes and changes #58
Add SoftHier platform #65
rv32imf_xpulpv2 ISA support for Siracusa platform #64
One LLVM To Compile Them All #60
One GVSoC to Simulate Them All #59
Add Support for CCT Last Layer Training with Embedding Dim 8-128 #55
Add CCT Classifier Training Support #53
L3 Bugs: DMA Struct Datatype and Maxpool Margin Error #45
DeepQuant Quantized Linear Support #54
Implemented Dequant Layer for Generic and Siracusa #52
Infinite Lifetime Buffers Considered in Tiling & Memory Allocation (+ Visualization) #44
Implemented Quant Layer for Generic and Siracusa #49
Increase maximal Mchan DMA transfer sizes from 64KiB to 128KiB #47
Add MiniMalloc and Decouple Memory Allocation and Tiling #40
Float CCT Bugs on L3 #37
Memory Allocation Strategies and Visualization #36
Add CODEOWNERS #42
Add Tiling Support to All CCT Kernels and Fix CCT Operators on Siracusa Platform for L2 #35
Add Fp gemm and Softmax for Snitch platform #31
Add Float Kernels for CCT #29
documentation deployment #34
main.c Float Cast Bugs #28
Add Float GEMM on PULP with Tiling #26
Add Float Support & Float GEMM for Generic #25
GVSOC support for the Snitch Cluster platform #23
Snitch Cluster Tiling Support #22
Snitch support integration #14
Update bibtex citation #20
the PR template location, bump min python to 3.10, change install command #17
Add pre-commit for python formatting #15
FP integration (v2) #12
shell for sequential tests of Generic, Cortex, and Mempool platforms #11
Add issue templates #10
Minor CI and Readme Improvements #8
Fix GHCR Link for Docker Build #7
neureka's ccache id #6
GitHub-based CI/CD Flow #4
Generic Softmax Kernel #2
Port GitLab CI #1

Added

ChimeraDeployer, currently mainly a placeholder
Allocate templates for Chimera
ChimeraPlatform, using appropriate allocation templates and using the generic Parser + Binding for the Add node
Adder CI test for Chimera
Install flow for chimera-sdk via Makefile
DeeployChimeraMath library
Generic FP32 reduce mean bindings, parser, and template
New alias list parameter for buffer objects
New test, also included in the CI pipeline, for the reshape and skip connection situation
'shape' parameter handling similar to the 'indices' parameter in the generic reshape template
Test the correcteness of the memory map generated by the tiler
Add attribute to VariableBuffer to distinguish I/Os
Add proper static memory allocation with finite lifetime for I/Os
The memory allocation visualization now displays the allocation for each level used
Tutorial section in the documentation
Guide on using the debug print topology pass and code transformation
VSCode configuration files for improved IDE support
Multi-branch GitHub Pages deployment support
Test for the DebugPrintTopologyPass.
Test for PrintInputGeneration, PrintOutputGeneration, MemoryAwarePrintInputGeneration, MemoryAwarePrintOutputGeneration
check for CMAKE variable and fallback to searching for cmake
tensor name mangling
identity operation removal
_unpack_const helper function to NodeParser to allow for node attributes that are direct Constant tensors or direct numpy values
load_file_to_local in dory_mem as a way to load values directly to a local memory (not ram). needed for copying values from flash to wmem needed for Neureka v2
Add the documentation.yml workflow to deploy doc pages.
Improved README with more detailed Getting Started section, a section listing related publications, and a list of supported platforms.
Schedule a CI run every 6 days at 2AM CET to refresh the cache (it expires after 7 days if unused).
Add the FloatImmediate AbstractType
Define fp64, fp32, fp16, and bf16
Add float binding for the Adder in the Generic platform
Add a FloatAdder test to the CI for Siracusa and Generic platforms
Extend testType.py with float tests
LIMITATION: Current LLVM compiler does not support bfp16 and fp16, these types are commented in the library header
cMake Flow for the Snitch Cluster
Added snitch_cluster to Makefile
New Snitch platform with testing application
Testrunner for tiled and untiled execution (testRunner_snitch.py, testRunner_tiled_snitch.py)
Minimal library with CycleCounter and utility function
Support for single-buffered tiling from L2.
Parsers, Templates, TypeCheckers, Layers, and TCF for the newly supported operators.
A code transformation pass to filter DMA cores or compute cores for an ExecutionBlock.
A code transformation pass to profile an ExecutionBlock.
Test for single kernels, both with and without tiling.
Adds the --debug flag to cargo install when installing Banshee to get the possibility of enabling the debug prints.
New tests for the snitch_cluster platform.
Add macros to main.c to disable printing and testing (convenient when running RTL simulations).
gvsoc in the Makefile and dockerfile
cmake flow for gvsoc
CI tests regarding Snitch run on GVSOC as well
Float Support for Constbuffer
Simple Float GEMM on Generic and PULP
FP GEMM to CI
FP GEMM Tiling on PULP
Add one new #define OUTPUTTYPE to testoutput.h
Float Template, binding and parser, test for Conv2D, LayerNorm, Div, Relu, Softmax, MaxPool, Matmul, Transpose, Gelu, Mul, Reshape, Gather, Squeeze, Padding
CCT model test to Generic Target
Math Lib link on Generic Target
New templates for GEMM and Softmax.
Added GEMM and Softmax to TargetLibraries, including case for GEMM with a transposed B matrix.
Added new CI tests for GEMM and Softmax.
Float Bindings, Tilers of CCT kernels for Pulp Target
Float Convolution, MaxPool Parser, Template, Kernel with HWC layout and padding integrated
Added tiling constraints for conv gather and layernorm and exisitng constraints for other kernels
profileuntiled arg
CCT onnx tests with img size of 16 and 32
CODEOWNERS file to control who is responsible for reviewing future PRs.
A visualization of the memory allocation solution generated by Deeploy at each level of memory. I use Plotpy to generate a static html file and save it to the DeeployState directory.
An initialization strategy for the variable in the tiling to randomize the variables related to the permutation matrix.
New interface to testRunner_tiled_siracusa to control the generation of the memory allocation visualization, the memory allocation strategy, and the search strategy.
Export a new docker container with plotpy as dependency.
Added multiple CCT settings for testing.
Added CCT L3 test to CI to ensure correctness for img size of 16 and 32.
Added NaN check for deeploytest diff to improve result validation.
Installation and compilation flow for MiniMalloc through Makefile.
Adapt the docker to install MiniMalloc and declare necessary symbols.
Add the constraintTileBuffersWithOverlappingLifetime method to the memory scheduler to add the necessary memory constraint when we decouple memory allocation and tiling.
Add the minimalloc method to the Tiler class. MiniMalloc comes as a precompiled cpp library using CSV for I/O. Hence, this method converts Deeploy's memory map to MiniMalloc's CSV representation, calls a subprocess to run MiniMalloc, reads the output CSV, and translates it back to Deeploy's memory map.
Add MiniMalloc to the memory allocation strategies and add a new argument to the test runner to control the L2 size.
New Quant operation to handle quantization pattern in ONNX models
Implementation for both Generic and Siracusa targets in the Deeploy framework
Custom QuantPatternPass class to replace matched patterns with a single Quant operator
Parser implementation in Parsers.py to extract quantization parameters
C template implementation in QuantTemplate.py for efficient quantization
Type checker implementation in TypeCheckers.py to handle bit-width and signedness
New Dequant operation to handle dequantization pattern in ONNX models
Implementation for both Generic and Siracusa targets in the Deeploy framework
Custom DequantPatternPass class to replace matched patterns with a single Dequant operator
Parser implementation in Parsers.py to extract dequantization parameters
C template implementation in DequantTemplate.py for efficient dequantization
Type checker implementation in TypeCheckers.py to handle bit-width and signedness
New Test Cases: Added and passed tests for 16×16 64 and 16×16 128 configurations to validate correctness.
New _sanitizeGraphNames function to sanitize the names of the nodes and tensors of the graph
Implementation for both Generic and Siracusa targets in the Deeploy framework
Modified the binding of dequant in Bindings.py to handle int32 after GEMM operation
New test cases: testTrainCCT/CCT_GEMM_Weight_Bias_1_16_16_8, testFloatReduceSum, testFloatSoftmaxGrad, testFloatSoftmaxCrossEntropy, testFloatSoftmaxCrossEntropyGrad
New kernels: SoftmaxCrossEntropy, SoftmaxCrossEntropyGrad, SoftmaxGrad, ReduceSum
Refinements in operator parsers and computeShape logic for: Softmax, Mul, Reducesum
New _sanitizeGraphNames function to sanitize the names of the nodes and tensors of the graph
Implementation for both Generic and Siracusa targets in the Deeploy framework
Modified the binding of dequant in Bindings.py to handle int32 after GEMM operation
Support for SoftmaxCrossEntropyLoss and SoftmaxCrossEntropyLossGrad with tiling.
Implementation of SGD updates for CCT training.
Test for one iteration of CCT last-layer training with dimensions from 8 to 128.
All Banshee dependencies now have a frozen version. This improves maintainability as some packages get yanked for the old versions of Rust.
Increase the L2 buffer size for loading files from Flash to RAM. This speeds up the simulation setup time.
Align the GVSoC simulation command and build command for the new version.
Bump new version of GVSoC and PULP-SDK
Build flow and its Docker integration for LLVM 15 tagged `15.0.0-snitch-0.1.0'
Picolibc build flow for v32im, v32ima, rv32imc and rv32imafd. Previously, it was only for rv32imc.
LLVM Compiler RT for rv32im, rv32ima, and rv32imafd.
Appropriate linking of picolibc and compiler RT.
Build and install a flow for XTensor, XTL, and XSIMD. These libraries are used in some GVSoC models, and they used to live in the PULP SDK, as a header-only library. Keeping only the library headers in the PULP SDK makes it hard to bump new versions.
Adds RV32IMF Picolib to the toolchain
Generic float DW Conv2D kernel and bindings.
Bias handling and computation for regular and DW Conv2D.
Empty bias handling for generic regular and DW Conv2D.
Tests for Conv2D regular and DW, with and without bias (and included them in the CI pipeline).
BuildDockerToolchain.yml to build Toolchain Docker container
BuildDockerDeeploy.yml to build Deeploy Docker container
Add support for linux/arm64 containers
Added caching to speed up container builds
Makefile to simplify local container build
Add helper script to generate a baseline changelog.
SoftHier Deeploy Targets, including Deployer, Platform, and Templates
SoftHier cmake compilation flow
SoftHier CI task
Parallel implementations of the following operators on Siracusa: Matmul, Softmax, Gelu, Conv, Layernorm, Maxpool, Add, Mul,and Relu
Gelu with Sigmoid implementation
ComputeOp support for multiple float kernels: Maxpool, Relu, and Mul
dev-requirements.txt tracking the dependencies of the build system, linting, documentation, and QOL.

Changed

Bump the CMake version to 3.24 as required for the chimera-sdk
Bump GVSoC's version and add chimera simulation target
Rename the generic source util to utils to avoid name collision with chimera-sdk
Moved PULP SDK from Victor-Jung/pulp-sdk branch deeploy to pulp-platform/pulp-sdk branch main.
Memory arena buffers are now declared at the beginning of the InitNetwork function
Tiling profiling is now an ON/OFF version where you get the I/O DMA time for each DMA call
The profiling strings are const static, such that they are stored in .rodata
Adapt the select docker image stage to also select a runner depending on github.repository
Adapt the jobs and reusable workflows to use the selected runner.
Updated README.md description to use a persistent development container
Symlinking of the latest build and source files into TEST_RECENT
Disabled CMAKE_VERBOSE_MAKEFILE by default for cleaner builds.
Refactored IntrospectiveCodeTransformationMixIn to allow extracting dynamic references to global variables
duplicateConstants now also duplicates constant nodes
check float output define in DeeployTest Generic platform
kernel_shape now inferred from weight shape if not present as per ONNX spec
USE_NEUREKA moved into TargetLibraries where it's closer to pulp-nnx
hex dumping logic for pulp platforms in prep for neureka v2 where I need to save weights to flash and move them during runtime to wmem
add_gvsoc_emulation macro now requires an additional target argument and abstracted adding flags to gvsoc through the GVSOC_EXTRA_FLAGS variable
Updated README.md with direct link to the documentation page.
Update the Banshee's commit to include a recent PR.
Add the possibility of changing the simulator when using the snitch-tiled test runner.
Add the RTL library to the snitch_cluster build process in the Makefile, required for GVSOC simulation
float infinity macro #define inf
Signprop depend on float check and platform
Adapted snitch Bindings and Platform files.
Removed unused TilerAwareDeployer class.
Regenerated CCT ONNX files without "output" & "input" in their names to avoid triggering the dumphex parser bug.
Regenerated CCT ONNX file with 3 branches for attention, transforming the attention computation graph into three branches.
Changed code generation for Hex output to properly handle float values.
Enhanced layernorm operator to support three outputs (layernormout, mean, std) for compatibility with training-related layernormgrad in the future.
Modified the outputs of LayerNorm and SoftmaxCrossEntropyLoss nodes to a single output for better tiling compatibility.
Added SGD parameter updates to the CCT training graph.
Officially depreciate Banshee as a simulator for Snitch Cluster in the CI. Maintaining this is a burden and unnecessary, as GVSoC is now the standard simulator. Additionally, newer versions of the Snitch runtime don't support Banshee anymore.
Bump XTensor's version to 0.25.0 to fix a bug with Intel's SSE.
Update snitch cluster patch to link to picolibc and add explicit target.
Update README to include Snitch in the Getting Started and the D&T Journal.
The ISA for the Siracusa platform has been updated from rv32imc_zfinx_xpulpv2 to rv32imf_xpulpv2.
All floating-point comparison tasks in deeploytest.c are now offloaded to Cluster 0 for execution.
Split the original build flow into two container
Refactor changelog for better readability
Reformatted all C files
Prepare pyproject.toml for a proper pip package release.
Packages listed in dev-requirements.txt are installed in the final stage of the Deeploy container.

Fixed

DW Conv2D kernel header to avoid warnings
FP casting in GELU kernel to avoid warnings
Buffer deallocation to only happen when all its aliases are not live anymore (the data stored there is not needed anymore, not even by other nodes)
GEMM Generic float template to iterate through terms only when they actually contain multiple matrices
Fix the PULP Deployer where outputs were unecessary loaded in L3
Fix the lifetime computation of aliased buffers
Removed unsupported -MMD compiler flag in LLVM-based toolchains.
Fix DebugPrint topology pass
Fix PrintInput code transformations to work with global variables
RequantShift when log2d is 0
missing math.h headers
clang on mac doesn't support -Wl,--gc-sections flag, moved it into each target and for host it's checking now for host system
--ffast-math caused numerical errors on generic so moved into each target and removed from that one since I'm imagining it as the debug target
Gather kernel on generic target
Update the link of the Docker container used to run the CI with the Docker published by this repo instead of my fork.
Add a retry on timeout step for large network tests. This is a temporary fix to address the sporadic freeze happening at the compilation stage, see this issue.
Float bug on Testslice, CMSIS TestUtil, DivInterger
AbstractDatayType Float Bugs
Change main.c to use OUTPUTTYPE instead of float
MaxPool Padding Extract Pass for float and interger
Testinput, testoutput, weight type casted from double to float warning
Relaxed the error threshold between expected and actual values in deeploytest.
CycleMeasure Pass for Siracusa Untiled Profilling
GEMM Tiling Constraints transA and `transB' not supported
MatMul layer Multi-Dimensional Input Issue
Add Layer for Broadcasted Bias
Resolved an issue where concatenation of float32 with f caused inf errors during code generation
Fixed a bug in the MemoryScheduler where the CP problem was solved more time that it was needed.
Updated printinput nodetemplate for float handling.
Fix testMVP.py to get a proper should fail test.
Maxpool Tile Calculation Error: The last dimension padding was incorrectly calculated due to L3 wraptiling solution. This has been fixed by updating serializeTilingSolution of Maxpool to avoid incorrect padding of Maxpool and prevent potential DMA 3D transfer issues of Maxpool.
DMA 1D Copy Assertion Issue: Updated the DMA length datatype from uint16 to uint32 to avoid assertion failures when dealing with large block transfers.
Deeploy subdirectories installed when installing Deeploy with pip install
Fix linking TEST_RECENT on MacOS
Fixed broken VSCode launch configuration
Fixed broken pulp-sdk hash
Fix issue with building banshee on `linux/arm
Removed i3c related files from the pulp-sdk CMake flow
Fixed C-code linting stage in CI
Input offset height and width calculation for tiled PULPOpen convolution kernels

Removed

Removed commented code from generic allocation templates
Remove the link to the precompiled LLVM 12 in the testRunner for Snitch and in the CI.
Remove the sourcing of the cursed PULP SDK script.
Commented IPython breakpoints.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.2.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Release v0.2.0 (2025-07-08) #103

List of Pull Requests

Added

Changed

Fixed

Removed

Uh oh!