Mikhail Karasikov
@m_karasikov
Followers
119
Following
60
Media
8
Statuses
36
ML Engineer developing vision foundation models. PhD in CS, genome graphs, compressed data structures
Zurich
Joined November 2020
After years of research and continuous refinement, we’re thrilled to share that our paper on the MetaGraph framework — enabling Petabase-scale search across sequencing data — has been published today in Nature ( https://t.co/WQgDjIYDZL).
nature.com
Nature - MetaGraph enables scalable indexing of large sets of DNA, RNA or protein sequences using annotated de Bruijn graphs.
1
25
61
Glad this resonated with the committee, and grateful to @sebastianffx for presenting it last week at MICCAI 2025. Hopefully, it sparked some insightful discussions and ideas!
0
1
1
Happy to share that this work was early accepted for #MICCAI2025 and will be presented this September in Daejeon, South Korea. It will also be outlined at #ECDP2025 this June in Barcelona.
1
0
1
Achieving top results with little data suggests that current algorithms don't fully exploit the information in truly large data sets. I tend to think that there still remains huge unrealized potential, and new algorithms are needed to bring this to the next level.
1
0
0
We also explored post-training techniques to further boost the FMs and pushed our model to top-1.
1
0
0
How much data is enough for training a SOTA-level pathology Foundation Model? In our new work https://t.co/rTUvygSBAx, we show that all recent models are close, and even 12k WSIs from TCGA are enough to outperform most of them. /w @JoostvDoorn @Huugie76 @sebastianffx and the team
1
2
6
Really nice writeup on the hugely impressive MetaGraph work by @gxr, @akkah21, @m_karasikov, @HarunMustafa416 (& others whose handles I don't know) in @ScienceMagazine: https://t.co/3h3uCgWsaU. Some comments by @ZaminIqbal, Lesley Hoyles and myself! Congrats @m_karasikov & team!
science.org
Achievement demonstrates feasibility of making all of life’s code easily searchable, researchers say
4
19
58
“A major step to making DNA sequencing data accessible to wider audiences.” 🥁 That's what the committee said about this work, one of the #SIBRemarkableOutputs 2022 👏 👉 Find out more: https://t.co/Mh7BbERQCz
#genomics @m_karasikov
1
2
4
After half a year of "next week", "tomorrow", and "today", I'm glad that our (@peshotrie and me) preprint on exact global alignment is finally online! This thread visually summarizes our paper and next steps. 1/9 https://t.co/IH1KajQpsf
@curious_coding and I extended the seed heuristic to exact alignment of long (Mbps) erroneous (≤15%) sequences. The empyrical near-linear runtime makes our aligner A*PA 250x faster than Edlib and WFA on synthetic data, and looks promising on human data. https://t.co/YmqdAOml32
3
7
43
Awesome project! Grateful for the chance to make my humble contribution as well. The data is indexed with MetaGraph and ready for search and alignment: 6.4M genomes -> 15 GB index with k-mer coordinates (CountingDBG); all 318M assembled scaffolds -> 124 GB index, 121 bln k-mers.
Delighted to share our latest publication on the ‘biosynthetic potential of the global ocean microbiome’ in @Nature
https://t.co/b4kgOLIROD. If you want to know, have a look at this video:
0
1
10
I've created a "Crash Course on Data Compression" that I'm going to teach next week to PhD students in Pisa (20h, 5 modules). 1/ Link: https://t.co/8pNqxvZHcN
#compression #DataScience #coding
github.com
🗜 💻 A crash course on Data Compression. Contribute to jermp/data_compression_course development by creating an account on GitHub.
3
22
102
If you are interested in learning how compression can be used to speed up algorithms and design smaller data structures, join our online workshop "Compression + Computation" on 01/19 (Wednesday), 10am-6pm EST (registration is free but required to join)! https://t.co/MYDY1J4qqL
sites.google.com
Overview Many modern applications produce massive datasets containing a lot of redundancy, either in the form of highly skewed frequencies or repeating motifs/fragments of identical data. Prominent...
3
10
47
For encoding RNA expression levels, it can also be turned into a kmer-count dictionary - 8x smaller than state-of-the-art and yet much faster to query.
0
0
5
To demonstrate the new opportunities, we designed a sequence-to-graph alignment algorithm on top of Counting de Bruijn graphs, with a modified backtracking stage ensuring the consistency with sequences encoded in the graph (traces) - by @HarunMustafa416
1
0
1
The method encodes traces in the underlying DBG, playing a similar role as gPBWT in variation graphs. One of the crucial differences is in the coding technique: while gPBWT encodes each path by storing the "turns", Counting DBG applies a delta-like coding on global coordinates.
1
0
2
We call this data structure a Counting De Bruijn graph. On average, the compression is even higher than with gzip - only 0.54 bits/bp for long HiFi reads.
1
0
2
Additionally, we apply a delta-like coding extending the RowDiff scheme https://t.co/ADZ0KLCtPj (would be impossible without @danieldanciu) which computes a delta between the original annotation at each node and its predicted/expected value reconstructed from the successor nodes.
1
0
2
The general idea is to decompose the annotation matrix into a sparse binary indicator matrix and dense vectors of attributes encoded separately. This decomposition allows directly applying existing schemes for a compressed representation of binary matrices and arrays.
1
0
2
The k-mer coordinates are then stored in a special succinct representation.
1
0
1