Jerome Kelleher Profile
Jerome Kelleher

@jeromekelleher

Followers
2K
Following
4K
Media
101
Statuses
4K

Research group leader at the Oxford Big Data Institute. Lead developer on tskit/msprime/tsinfer. He/him. 🇮🇪 @[email protected]

Joined December 2013
Don't wanna be here? Send us removal request.
@jeromekelleher
Jerome Kelleher
2 years
I haven't been on Twitter for a while, and have been enjoying the peace and quiet of Mastodon (@jeromekelleher@mstdn.science). I've returned for a brief bit of self promotion (it's been an eventful couple of weeks!), but am dropping off again 👋.
0
0
6
@jeromekelleher
Jerome Kelleher
1 month
Very happy to announce that the VCF Zarr paper is now published at GigaScience! . See below for a tweetorial based on the preprint.
@jeromekelleher
Jerome Kelleher
1 year
VCF is in many ways a tremendous success, providing a single channel through which all kinds of genomics data flows, with (mostly) good interoperability. It is a great archival format. However, it is not an efficient basis for computation, particularly at Biobank scale.
0
4
12
@jeromekelleher
Jerome Kelleher
1 year
If you are interested in this, or any other aspect of the work, please do get in contact! VCF Zarr has lots of potential, but this can only be realised if it is widely adopted. Efficient, FAIR access to VCF data *is* possible, but only with a concerted, community effort.
0
1
4
@jeromekelleher
Jerome Kelleher
1 year
One necessary piece of infrastructure that does not yet exist is a "vcztools" package, that implements some of the read-only functionality of bcftools. This could provide compatibility with existing workflows, allowing a cloud-based Zarr store have file-like semantics.
1
0
2
@jeromekelleher
Jerome Kelleher
1 year
We hope that these tools can provide the starting point for a new generation of tools that process genetic variation data. The VCF Zarr spec should provide a stable platform for methods developers, who can enjoy efficient, scalable access to data.
1
0
3
@jeromekelleher
Jerome Kelleher
1 year
We also provide the vcf2zarr converter, as part of the fledgling bio2zarr package. It supports both parallel and distributed conversion, and can handle very large datasets. It could also be improved in many ways - feedback and contributions welcome!.
1
0
1
@jeromekelleher
Jerome Kelleher
1 year
We provide the draft VCF Zarr specification, which formalises the mapping from VCF to Zarr. While we're confident it captures the vast majority of use-cases, there's probably still lots of details that need working out. Feedback and contributions welcome!.
1
0
2
@jeromekelleher
Jerome Kelleher
1 year
Zarr is currently used to store multiple petabyte scale scientific datasets ( and with multiple implementations (. It is cloud-native, with first-class support for object stores like S3. It scales.
1
0
1
@jeromekelleher
Jerome Kelleher
1 year
But, does this work on real data? Yes! To demonstrate how well Zarr performs on real data with many FORMAT fields we converted chr2 for the Genomics England aggv2 dataset. Overall, we see a 5X reduction in storage compared to the original (12.81TiB over 106 vcf.gz files).
Tweet media one
1
0
2
@jeromekelleher
Jerome Kelleher
1 year
Extracting individual (1-D) fields is even more extreme. Here we benchmark extracting the POS field and writing to a text file: 21,418 seconds with bcftools on a BCF file, vs 5 seconds using Zarr and Python.
Tweet media one
1
0
1
@jeromekelleher
Jerome Kelleher
1 year
Where Zarr really starts to shine is when we are interested in *subsets* of the data. By storing fields separately, and by storing the data in each field as a regular grid of compressed chunks, subsetting is much more efficient. Here is the same benchmark on a small sub-matrix.
Tweet media one
1
0
3
@jeromekelleher
Jerome Kelleher
1 year
Compression isn't everything though - we also want to *compute* with our data. Here is a benchmark in which we perform a simple calculation over the whole genotype matrix (see text for rationale), essentially comparing the computational accessibility of the formats.
Tweet media one
1
0
3
@jeromekelleher
Jerome Kelleher
1 year
This yields excellent compression performance. Here is a benchmark based on (very realistic) simulations where we compare the Zarr based approach with VCF, BCF and two state-of-the-art methods. Remarkably, Zarr's simple approach does almost as well as Savvy!
Tweet media one
1
0
1
@jeromekelleher
Jerome Kelleher
1 year
We propose an alternative storage approach for variation data based on the widely used Zarr standard (. Rather than grouping all data for a given variant together, we group all data for a given field, and store as chunked, compressed N-D arrays (tensors).
Tweet media one
1
0
1
@jeromekelleher
Jerome Kelleher
1 year
It is the *row-wise* storage of data used by VCF (and most of its proposed alternatives, including BCF) that is most fundamentally limiting. It is not possible to efficiently extract a particular field or sample from row-wise variant stores.
1
0
1
@jeromekelleher
Jerome Kelleher
1 year
VCF is in many ways a tremendous success, providing a single channel through which all kinds of genomics data flows, with (mostly) good interoperability. It is a great archival format. However, it is not an efficient basis for computation, particularly at Biobank scale.
1
0
3
@jeromekelleher
Jerome Kelleher
1 year
Excited to share my latest preprint (with a stellar band of collaborators), where we map the VCF data model into an efficient, cloud-native storage format. Thread follows:.
1
32
76
@jeromekelleher
Jerome Kelleher
1 year
UPDATE: closing date now **June 26th**, so plenty time to apply!.
0
0
1
@jeromekelleher
Jerome Kelleher
1 year
There's a 1.5 year postdoc position working on msprime/background selection models open in my group in Oxford:. Closing date is **June 12th**, so please do get in contact if you're interested, or send on to anyone who might be interested!.
1
35
36
@jeromekelleher
Jerome Kelleher
2 years
Great to see this preprint from @GeorgiaTsambos out!.
0
15
28
@jeromekelleher
Jerome Kelleher
2 years
We'd love to hear people's thoughts on this! It's definitely been a learning experience for us, and hopefully others will find the ideas here useful. Huge thanks to my coauthors @DrYanWong, @ana_ignatieva , Jere Koskela, @GregorGorjanc. and @WilderWohns!.
0
2
3