Open
Conversation
Owner
|
Nice work. The larger problem, it seems to me, is that VCF is madder than a box of frogs as a file format. eg it includes at least two incompatible delimited field specs IIRC. How is tooling support for binary call format these days? Shouldn't that be the target format for performance? |
Author
|
Seems to me that it's still worth having the best performance in all use-cases. BCF has been discussed for years, but progress is slow. Shall we either merge or close this? There hasn't been a release for a while either, would probably be useful for people. |
|
Merged on dridk@9e5de0f |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I haven't looked at this branch for a few weeks, and you should bear in mind that I've never used Cython before. I've rebased it and it passes all current tests.
My observation was that PyVCF is still rather slow in reading & writing large, real-world VCFs (about 6-8x slower than a simplistic split-index-join approach). The individual commits here should be reasonably clear, and I found:
I haven't had much luck with line-profiling to improve things further. One idea might be to lazy-parse the INFO fields – keep them as strings until accessed. They still seem to be a bottleneck even with Cython (large real-world VCFs may contain many annotations, for example).
Downside here is further duplication between Python and Cython, but that seems unavoidable if supporting pure Python remains a priority.