Holden @hodlenx X Profile

Holden

@hodlenx

Followers

446

Following

67

Media

5

Statuses

45

Maintainer of PowerInfer | LLM Systems | DMs are welcomed 🤗

Shinagawa-ku, Tokyo

Joined January 2018

Don't wanna be here? Send us removal request.

Holden

@hodlenx

2 years

👐 PowerInfer-2 will be open-sourced based on the PowerInfer repo. We’re refining it to untangle from our testing platform and making it accessible on PCs for the community. Open-sourcing will happen in stages starting soon. Stay tuned for updates at

1

2

20

Holden

@hodlenx

2 years

🔓 The power of cloud-scale models and local privacy isn't mutually exclusive. We're pioneering to bring LLM's incredible capabilities directly to your device without compromising privacy. Explore how we're making AI accessible to everyone, everywhere:

powerinfer.ai

High-speed Large Language Model Serving leveraging activation locality on PC and Mobile devices.

1

0

9

Holden

@hodlenx

2 years

🔑 Model sparsity is the key to PowerInfer-2, and TurboSparse makes it possible. We've pushed the FFN sparsity of Mistral and Mixtral to 90% and 97%, with even higher performance. Dive into the details at https://t.co/ChsDCZyxgI and get the models today: https://t.co/us1sgubiip.

2

0

13

Holden

@hodlenx

2 years

🔍 Discover the groundbreaking system innovations in PowerInfer-2! Using heterogeneous computing and I/O-Compute pipelining, it unleashes the full potential of mobile SoC and flash performance. See how we dominate the speed-cost tradeoff! Paper available: https://t.co/KNWUfwT22a

1

0

9

Holden

@hodlenx

2 years

🚀 Excited to introduce PowerInfer-2: A game-changing LLM inference engine for mobile devices by the #PowerInfer team. It smoothly runs a 47B model with a staggering 29x speedup on smartphones! Watch our demo to see it in action! 🎥 Technical details at: https://t.co/7bx5EnzWCs

31

141

579

Aran Komatsuzaki

@arankomatsuzaki

2 years

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters - Proposes a novel dReLU function, which is designed to improve LLM activation sparsity - 2-5× decoding speedup model: https://t.co/UEoBgMxceD abs: https://t.co/yc9ZAhPokt

3

19

110

Holden

@hodlenx

2 years

Thrilled to unveil Bamboo-v0.1: A groundbreaking 7B LLM by the #PowerInfer team, matching Mistral's performance with 85% activation sparsity. Built on Mistral's weights, supercharged with dReLU for up to 4.38x hybrid computing speedups. Discover https://t.co/jhFnVL50BH.

0

8

Yuandong Tian

@tydsh

2 years

🌟PowerInfer boosts LLM serving speeds by up to 11x on consumer-grade GPUs! Inspired by our Deja Vu paper (ICML'23 https://t.co/rSeAoz4bs4), it serves ReLUified LLMs, keeping heavy hitter (hot) neurons in GPU and offloading sparsely fired ones on CPUs. Proud of my

arxiv.org

Large language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications. However, they are computationally expensive at inference time. Sparsity is...

3

22

119

murat 🍥

@mayfer

2 years

with all the sparsity-aware context based memory loading papers coming out, (PowerInfer getting 11x and Apple getting 25x speedup on GPU) ReLU's dead zone is turning out to be important llama-class models (SwiGLU) might not have much longevity afterall once all the Metal work

9

19

237

Rainier

@mtrainier2020

2 years

我初步读了一下论文，这篇论文的基本思路其实是利用了推理过程中的局部性。现在现在推理性能的一个瓶颈就是GPU的内存。它们的思路就是联合CPU和GPU做联合推理。尽可能把active的neuron 信息load到GPU中，充分利用局部性。这样大大提高了GPU推理的效率。

电工Bob

@engineer_bob_sz

2 years

@nash_su 看了下说明，是通过更高效的CPU、GPU分布来加速运算的，那就是说纯CPU速度应该还是一样的，我理解这个更适合有高速显卡但没有很大显存的情况，可以解决llama.cpp分开CPU+GPU效果不如单GPU的情况

10

32

112

睡眠雲

@abhphy

2 years

やっとPowerInferが4090でLlama2-70B-4qを動かしてくれた。桃太郎を書かせたけど途中からストーリーがずれててまあLlama2くんの安定運転... 最終スピードとしてWSL2環境において13900K 64G RAM 16 threads許可で3.89 tokens/sまで出していて、プリプリントで主張された~4 tokens/sの結果を再現できた

2

13

42

Max Lv

@m0d8ye

2 years

非常有趣的研究，单卡 4090 跑 175B 大模型不是梦🎉

github.com

High-speed Large Language Model Serving for Local Deployment - SJTU-IPADS/PowerInfer

3

40

160

Delip Rao e/σ

@deliprao

2 years

This is huge! Now watch the LLM API costs dropping even further. [.cn PDF link] https://t.co/Kvj3lRTONE

37

212

2K

Lior Alexander

@LiorOnAI

2 years

PowerInfer can massively speed up inference on consumer GPUs. Almost reaching A100 levels. It outperforms llama.cpp by up to 11.69x while retaining model accuracy. PowerInfer reached an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across

15

81

357

Holden

@hodlenx

2 years

Hope everyone enjoys it! We're just getting started on building PowerInfer as a brand new solution for local LLM hosting and we are open to any kind of feedback ❤️

elvis

@omarsar0

2 years

PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine.

3

22

elvis

@omarsar0

2 years

PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine.

24

289

1K