
Alexey Tumanov
@alsched
Followers
548
Following
874
Media
5
Statuses
247
Assistant Professor of Computer Science @gatech_scs @gtcomputing | postdoc @Berkeley_EECS @ucbrise | ML Systems
Atlanta, GA
Joined December 2012
RT @agrawalamey12: After hitting evaluation puzzles like this in our own work, we analyzed patterns across LLM inference papers and identif….
0
3
0
RT @agrawalamey12: Interesting work on long context inference from @nvidia, where they scale KV parallelism on gb200-nvl72 systems! To lear….
arxiv.org
As large language models (LLMs) handle increasingly longer contexts, serving long inference requests of millions of tokens presents unique challenges. We show that existing work for long context...
0
5
0
RT @gatech_scs: Congratulations 👏 to our faculty who were recognized on the Spring 2025 CIOS Honor Roll for their outstanding teaching and….
0
1
0
RT @SachitKuhar: Full code 🔓 Collaboration with @jinga_lala1 and @alsched. (6/6).#EfficientAI #EdgeAI #Quantizati….
github.com
Codebase for "PLUM: Improving Inference Efficiency By Leveraging Repetition-Sparsity Trade-Off" - sachitkuhar/PLUM
0
1
0
RT @agrawalamey12: Super excited to share another incredible systems that we have built over the past two years! Training giant foundation….
0
13
0
RT @agrawalamey12: Super long-context models with context window spanning millions of tokens are becoming commonplace (@GoogleDeepMind Gemi….
0
14
0
RT @agrawalamey12: Maya offers a transparent, accurate, and efficient way to model and optimize large-scale DL training without needing exp….
arxiv.org
Training large foundation models costs hundreds of millions of dollars, making deployment optimization critical. Current approaches require machine learning engineers to manually craft training...
0
1
0
RT @agrawalamey12: Sequence pipeline parallelism being rapidly adopted for extreme long context inference in the industry! Checkout our pap….
0
4
0
Super-charged technical program this year at @ACMSoCC:.Looking forward! Hope to see you there! #socc24.
We are just under a month away from SoCC’24! This year’s conference will be from Nov 20-22 at the Microsoft Campus in Redmond, WA . Early bird registration is now open until Nov 6. Make sure to register!
0
0
4
RT @agrawalamey12: ⚡ Speed Meets Accuracy:. Unlike approximation-based methods, Mnemosyne achieves exact inference—ensuring that the genera….
0
2
0
RT @agrawalamey12: @Google has silently but surely developed an edge over @OpenAI. Long context processing seems to be the key to Google's….
0
4
0
RT @agrawalamey12: 🔗 Curious to learn more? Dive into our paper to explore the technical details behind Mnemosyne: ….
arxiv.org
As large language models (LLMs) evolve to handle increasingly longer contexts, serving inference requests for context lengths in the range of millions of tokens presents unique challenges. While...
0
2
0
I'm serving as the #SOSP24 AEC Chair. We're still looking for artifact evaluation reviewers: .AE is indispensable to Systems Research and is a valuable experience. Grad students and early career researchers welcome! Exp. load: 2 artifacts. Self-nominate!.
sysartifacts.github.io
We are looking for members of the Artifact Evaluation Committee (AEC), who will contribute to SOSP’24 Artifact Evaluation (AE) process by reviewing submitted artifacts. AEC membership is especially...
2
13
22
Let's set the standard for the interactive performance of LLMs capturing nuances of user experience. While latency/throughput tension is well known to the Systems community, latency jitter is less explored. Fluidity index & fluid token generation rate more aptly capture LLM perf.
🚀 Introducing Metron: Redefining LLM Serving Benchmarks! 📊. Tired of misleading metrics for LLM performance? Our new paper introduces a holistic framework that captures what really matters - the user experience! 🧠💬. #LLM #AI #Benchmark.
0
0
6
Really proud of my PhD student's work on developing the new mechanism and policy that significantly improves tail latency performance in Large Language Model (LLM) inference without sacrificing throughput. Already received 10+ citations, source is OSS and adopted in the industry.
Did you ever feel that @chatgpt is done generating your response and then suddenly a burst of tokens show up? This happens when the serving system is prioritizing someone else’s request before generating your response. But why? well to reduce cost. 🧵.
0
0
8
RT @gatech_scs: Three SCS faculty members were recognized by their students for outstanding teaching and educational impact. Congratulation….
blog.ctl.gatech.edu
The Center for Teaching and Learning (CTL) and the Office of Academic Effectiveness (OAE) are thrilled to announce the Spring 2024 Course Instructor Opinion Survey (CIOS) Honor Roll. Faculty member…
0
1
0