-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Description
@RVC-Boss @yxlllc
Hey guys ~ Me and many more devs been trying our hardest for the last 2 years to improve RVC and much more.
That's essentially how current Applio and my v4 fork came to be.
Since a longer while, re-creating better base models was our primary target, involving many experiments..
- Different datasets ( including but not limited to: vctk, m4, ears, JL corpus, librispeech and much more.. )
- Various vocoder attempts: RefineGan, Avocado, Hiftnet, istftnet, wavehax, ringformer and currently,
my custom experimental pcph-gan utilizing pcph signal and silu/snake-beta activations - Various custom losses to aid the training: envelope loss, multi-scale mel loss, multi-res stft loss and more
- Different adversarial objectives ( Hinge and tprls and iirc, wgan-gp )
- Other discriminators ( avocado discs, ms-sb-cqt, mr-stft, wavelet mpd/msd takes )
and many experiments involving different combinations.
Unfortunately, we still can't quite well and reliably reach what's presented by original rvc's pretrained models
and ss far as I know, it is still unclear what steps exactly were taken during the training.
What's supposedly known for sure is, for base models:
- Dataset was VCTK only.
- Batch size was 16 ( 4 * 4 gpus ).
- Was trained with 4 gpus.
- There's extra 3 periods in MultiPeriodDiscriminator ( 17, 23, 37 ).
- MultiScaleDiscriminator is set to work as 1 " single scale " like disc.
- Learning rate was 1e-4.
- exponential lr decay with gamma of 0.999875 as scheduler.
Now questions are..
- Was there any pitch-shift transformations / any other alterations performed on the vctk?
- Was it truly just VCTK used? or did you guys also employ a bit of M4 or whatever other " singing " set ?
and if not, how did you achieve relatively decent f0 handling like that on such a bad and poor in pitch ( since it's fairly monotone ) dataset? - Any actual field-tested reasons to utilize 3 extra periods?
Or is it simply that you referenced playvoice's nsf-hifigan take? ( or specifically vtuber-plan's mirror(?) ) - Was there any gradient clipping utilized during training or was left as is?
- Were the base models trained using mixed precision or all fp32?
- Any specific mid-training operations or tweaks?
- Was each pretrains set ( per sample rate ) trained independently? or was it iteratively re-trained from lower sr to higher?
And lastly, for how long did you train each base models? anywhere close to 200k steps or more towards 1mil+ ?
As a side note, some of us intended to contact you for a longer while, sadly there's no reliable contact or email provided anywhere.
We would truly appreciate some insights!
And yes, I realize 'rvc' is kind of a dead project now and boss, you've taken the TTS direction,
but we don't want to give up too easily, essentially want to continue the legacy as there's so much room for improvements in many areas.. and foremostly, there's nothing like RVC ( and on pair with it's quality ) still up to this date.
Here's my email just in case if you feel more comfortable that way: [email protected]
In any case, appreciate the response!