In end-to-end scan2cap, the relation graph's input is the origin proposals without nms. However in fixed-detector scan2cap, the relation graph's input is the origin proposals with nms. I think that's not a fair comparison.
I've performed experiments with pre-fetched votenet features without nms, and use train_pretrained.py to train the fix-detector's performance. The result shows that the fixed-detector one actually out-performs the end-to-end one.