Jerome Kelleher @jeromekelleher X Profile

Jerome Kelleher

@jeromekelleher

Followers

2K

Following

4K

Media

101

Statuses

4K

Research group leader at the Oxford Big Data Institute. Lead developer on tskit/msprime/tsinfer. He/him. 🇮🇪 @[email protected]

Joined December 2013

Don't wanna be here? Send us removal request.

Jerome Kelleher

@jeromekelleher

2 years

I haven't been on Twitter for a while, and have been enjoying the peace and quiet of Mastodon (@jeromekelleher @mstdn.science). I've returned for a brief bit of self promotion (it's been an eventful couple of weeks!), but am dropping off again 👋.

0

6

Jerome Kelleher

@jeromekelleher

1 month

Very happy to announce that the VCF Zarr paper is now published at GigaScience! . See below for a tweetorial based on the preprint.

Jerome Kelleher

@jeromekelleher

1 year

VCF is in many ways a tremendous success, providing a single channel through which all kinds of genomics data flows, with (mostly) good interoperability. It is a great archival format. However, it is not an efficient basis for computation, particularly at Biobank scale.

0

4

12

Jerome Kelleher

@jeromekelleher

1 year

If you are interested in this, or any other aspect of the work, please do get in contact! VCF Zarr has lots of potential, but this can only be realised if it is widely adopted. Efficient, FAIR access to VCF data *is* possible, but only with a concerted, community effort.

0

1

4

Jerome Kelleher

@jeromekelleher

1 year

One necessary piece of infrastructure that does not yet exist is a "vcztools" package, that implements some of the read-only functionality of bcftools. This could provide compatibility with existing workflows, allowing a cloud-based Zarr store have file-like semantics.

1

0

2

Jerome Kelleher

@jeromekelleher

1 year

We hope that these tools can provide the starting point for a new generation of tools that process genetic variation data. The VCF Zarr spec should provide a stable platform for methods developers, who can enjoy efficient, scalable access to data.

1

0

3

Jerome Kelleher

@jeromekelleher

1 year

We also provide the vcf2zarr converter, as part of the fledgling bio2zarr package. It supports both parallel and distributed conversion, and can handle very large datasets. It could also be improved in many ways - feedback and contributions welcome!.

1

0

1

Jerome Kelleher

@jeromekelleher

1 year

We provide the draft VCF Zarr specification, which formalises the mapping from VCF to Zarr. While we're confident it captures the vast majority of use-cases, there's probably still lots of details that need working out. Feedback and contributions welcome!.

1

0

2

Jerome Kelleher

@jeromekelleher

1 year

Zarr is currently used to store multiple petabyte scale scientific datasets ( and with multiple implementations (. It is cloud-native, with first-class support for object stores like S3. It scales.

1

0

1

Jerome Kelleher

@jeromekelleher

1 year

But, does this work on real data? Yes! To demonstrate how well Zarr performs on real data with many FORMAT fields we converted chr2 for the Genomics England aggv2 dataset. Overall, we see a 5X reduction in storage compared to the original (12.81TiB over 106 vcf.gz files).

1

0

2

Jerome Kelleher

@jeromekelleher

1 year

Extracting individual (1-D) fields is even more extreme. Here we benchmark extracting the POS field and writing to a text file: 21,418 seconds with bcftools on a BCF file, vs 5 seconds using Zarr and Python.

1

0

1

Jerome Kelleher

@jeromekelleher

1 year

Where Zarr really starts to shine is when we are interested in *subsets* of the data. By storing fields separately, and by storing the data in each field as a regular grid of compressed chunks, subsetting is much more efficient. Here is the same benchmark on a small sub-matrix.

1

0

3

Jerome Kelleher

@jeromekelleher

1 year

Compression isn't everything though - we also want to *compute* with our data. Here is a benchmark in which we perform a simple calculation over the whole genotype matrix (see text for rationale), essentially comparing the computational accessibility of the formats.

1

0

3

Jerome Kelleher

@jeromekelleher

1 year

This yields excellent compression performance. Here is a benchmark based on (very realistic) simulations where we compare the Zarr based approach with VCF, BCF and two state-of-the-art methods. Remarkably, Zarr's simple approach does almost as well as Savvy!

1

0

1

Jerome Kelleher

@jeromekelleher

1 year

We propose an alternative storage approach for variation data based on the widely used Zarr standard (. Rather than grouping all data for a given variant together, we group all data for a given field, and store as chunked, compressed N-D arrays (tensors).

1

0

1

Jerome Kelleher

@jeromekelleher

1 year

It is the *row-wise* storage of data used by VCF (and most of its proposed alternatives, including BCF) that is most fundamentally limiting. It is not possible to efficiently extract a particular field or sample from row-wise variant stores.

1

0

1

Jerome Kelleher

@jeromekelleher

1 year

VCF is in many ways a tremendous success, providing a single channel through which all kinds of genomics data flows, with (mostly) good interoperability. It is a great archival format. However, it is not an efficient basis for computation, particularly at Biobank scale.

1

0

3

Jerome Kelleher

@jeromekelleher

1 year

Excited to share my latest preprint (with a stellar band of collaborators), where we map the VCF data model into an efficient, cloud-native storage format. Thread follows:.

1

32

76

Jerome Kelleher

@jeromekelleher

1 year

UPDATE: closing date now **June 26th**, so plenty time to apply!.

0

1

Jerome Kelleher

@jeromekelleher

1 year

There's a 1.5 year postdoc position working on msprime/background selection models open in my group in Oxford:. Closing date is **June 12th**, so please do get in contact if you're interested, or send on to anyone who might be interested!.

1

35

36

Jerome Kelleher

@jeromekelleher

2 years

Great to see this preprint from @GeorgiaTsambos out!.

0

15

28

Jerome Kelleher

@jeromekelleher

2 years

We'd love to hear people's thoughts on this! It's definitely been a learning experience for us, and hopefully others will find the ideas here useful. Huge thanks to my coauthors @DrYanWong, @ana_ignatieva , Jere Koskela, @GregorGorjanc. and @WilderWohns!.

0

2

3