Have a light scalar implementation

The current implementations generate large binaries because they have one specialized implementation for each bitwidth, and do loop unrolling.

Add a flag-enabled implementation that uses a more compact scalar implementation. This would be useful for web assembly for instance.