We should try a 3D vision transformer architecture for the model. It might increase the accuracy instead of our Conv3D architecture right now