Holden Profile
Holden

@hodlenx

Followers
446
Following
67
Media
5
Statuses
45

Maintainer of PowerInfer | LLM Systems | DMs are welcomed 🤗

Shinagawa-ku, Tokyo
Joined January 2018
Don't wanna be here? Send us removal request.
@hodlenx
Holden
2 years
👐 PowerInfer-2 will be open-sourced based on the PowerInfer repo. We’re refining it to untangle from our testing platform and making it accessible on PCs for the community. Open-sourcing will happen in stages starting soon. Stay tuned for updates at
1
2
20
@hodlenx
Holden
2 years
🔓 The power of cloud-scale models and local privacy isn't mutually exclusive. We're pioneering to bring LLM's incredible capabilities directly to your device without compromising privacy. Explore how we're making AI accessible to everyone, everywhere:
powerinfer.ai
High-speed Large Language Model Serving leveraging activation locality on PC and Mobile devices.
1
0
9
@hodlenx
Holden
2 years
🔑 Model sparsity is the key to PowerInfer-2, and TurboSparse makes it possible. We've pushed the FFN sparsity of Mistral and Mixtral to 90% and 97%, with even higher performance. Dive into the details at https://t.co/ChsDCZyxgI and get the models today: https://t.co/us1sgubiip.
2
0
13
@hodlenx
Holden
2 years
🔍 Discover the groundbreaking system innovations in PowerInfer-2! Using heterogeneous computing and I/O-Compute pipelining, it unleashes the full potential of mobile SoC and flash performance. See how we dominate the speed-cost tradeoff! Paper available: https://t.co/KNWUfwT22a
1
0
9
@hodlenx
Holden
2 years
🚀 Excited to introduce PowerInfer-2: A game-changing LLM inference engine for mobile devices by the #PowerInfer team. It smoothly runs a 47B model with a staggering 29x speedup on smartphones! Watch our demo to see it in action! 🎥 Technical details at: https://t.co/7bx5EnzWCs
31
141
579
@arankomatsuzaki
Aran Komatsuzaki
2 years
Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters - Proposes a novel dReLU function, which is designed to improve LLM activation sparsity - 2-5× decoding speedup model: https://t.co/UEoBgMxceD abs: https://t.co/yc9ZAhPokt
3
19
110
@hodlenx
Holden
2 years
Thrilled to unveil Bamboo-v0.1: A groundbreaking 7B LLM by the #PowerInfer team, matching Mistral's performance with 85% activation sparsity. Built on Mistral's weights, supercharged with dReLU for up to 4.38x hybrid computing speedups. Discover https://t.co/jhFnVL50BH.
0
0
8
@tydsh
Yuandong Tian
2 years
🌟PowerInfer boosts LLM serving speeds by up to 11x on consumer-grade GPUs! Inspired by our Deja Vu paper (ICML'23 https://t.co/rSeAoz4bs4), it serves ReLUified LLMs, keeping heavy hitter (hot) neurons in GPU and offloading sparsely fired ones on CPUs. Proud of my
Tweet card summary image
arxiv.org
Large language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications. However, they are computationally expensive at inference time. Sparsity is...
3
22
119
@mayfer
murat 🍥
2 years
with all the sparsity-aware context based memory loading papers coming out, (PowerInfer getting 11x and Apple getting 25x speedup on GPU) ReLU's dead zone is turning out to be important llama-class models (SwiGLU) might not have much longevity afterall once all the Metal work
9
19
237
@mtrainier2020
Rainier
2 years
我初步读了一下论文,这篇论文的基本思路其实是利用了推理过程中的局部性。现在现在推理性能的一个瓶颈就是GPU的内存。它们的思路就是联合CPU和GPU做联合推理。尽可能把active的neuron 信息load到GPU中,充分利用局部性。这样大大提高了GPU推理的效率。
@engineer_bob_sz
电工Bob
2 years
@nash_su 看了下说明,是通过更高效的CPU、GPU分布来加速运算的,那就是说纯CPU速度应该还是一样的,我理解这个更适合有高速显卡但没有很大显存的情况,可以解决llama.cpp分开CPU+GPU效果不如单GPU的情况
10
32
112
@abhphy
睡眠雲
2 years
やっとPowerInferが4090でLlama2-70B-4qを動かしてくれた。桃太郎を書かせたけど途中からストーリーがずれててまあLlama2くんの安定運転... 最終スピードとしてWSL2環境において13900K 64G RAM 16 threads許可で3.89 tokens/sまで出していて、プリプリントで主張された~4 tokens/sの結果を再現できた
2
13
42
@m0d8ye
Max Lv
2 years
非常有趣的研究,单卡 4090 跑 175B 大模型不是梦🎉
Tweet card summary image
github.com
High-speed Large Language Model Serving for Local Deployment - SJTU-IPADS/PowerInfer
3
40
160
@deliprao
Delip Rao e/σ
2 years
This is huge! Now watch the LLM API costs dropping even further. [.cn PDF link] https://t.co/Kvj3lRTONE
37
212
2K
@LiorOnAI
Lior Alexander
2 years
PowerInfer can massively speed up inference on consumer GPUs. Almost reaching A100 levels. It outperforms llama.cpp by up to 11.69x while retaining model accuracy. PowerInfer reached an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across
15
81
357
@hodlenx
Holden
2 years
Hope everyone enjoys it! We're just getting started on building PowerInfer as a brand new solution for local LLM hosting and we are open to any kind of feedback ❤️
@omarsar0
elvis
2 years
PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine.
3
3
22
@omarsar0
elvis
2 years
PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine.
24
289
1K