Holden
@hodlenx
Followers
446
Following
67
Media
5
Statuses
45
Maintainer of PowerInfer | LLM Systems | DMs are welcomed 🤗
Shinagawa-ku, Tokyo
Joined January 2018
👐 PowerInfer-2 will be open-sourced based on the PowerInfer repo. We’re refining it to untangle from our testing platform and making it accessible on PCs for the community. Open-sourcing will happen in stages starting soon. Stay tuned for updates at
1
2
20
🔓 The power of cloud-scale models and local privacy isn't mutually exclusive. We're pioneering to bring LLM's incredible capabilities directly to your device without compromising privacy. Explore how we're making AI accessible to everyone, everywhere:
powerinfer.ai
High-speed Large Language Model Serving leveraging activation locality on PC and Mobile devices.
1
0
9
🔑 Model sparsity is the key to PowerInfer-2, and TurboSparse makes it possible. We've pushed the FFN sparsity of Mistral and Mixtral to 90% and 97%, with even higher performance. Dive into the details at https://t.co/ChsDCZyxgI and get the models today: https://t.co/us1sgubiip.
2
0
13
🔍 Discover the groundbreaking system innovations in PowerInfer-2! Using heterogeneous computing and I/O-Compute pipelining, it unleashes the full potential of mobile SoC and flash performance. See how we dominate the speed-cost tradeoff! Paper available: https://t.co/KNWUfwT22a
1
0
9
🚀 Excited to introduce PowerInfer-2: A game-changing LLM inference engine for mobile devices by the #PowerInfer team. It smoothly runs a 47B model with a staggering 29x speedup on smartphones! Watch our demo to see it in action! 🎥 Technical details at: https://t.co/7bx5EnzWCs
31
141
579
Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters - Proposes a novel dReLU function, which is designed to improve LLM activation sparsity - 2-5× decoding speedup model: https://t.co/UEoBgMxceD abs: https://t.co/yc9ZAhPokt
3
19
110
Thrilled to unveil Bamboo-v0.1: A groundbreaking 7B LLM by the #PowerInfer team, matching Mistral's performance with 85% activation sparsity. Built on Mistral's weights, supercharged with dReLU for up to 4.38x hybrid computing speedups. Discover https://t.co/jhFnVL50BH.
0
0
8
🌟PowerInfer boosts LLM serving speeds by up to 11x on consumer-grade GPUs! Inspired by our Deja Vu paper (ICML'23 https://t.co/rSeAoz4bs4), it serves ReLUified LLMs, keeping heavy hitter (hot) neurons in GPU and offloading sparsely fired ones on CPUs. Proud of my
arxiv.org
Large language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications. However, they are computationally expensive at inference time. Sparsity is...
3
22
119
with all the sparsity-aware context based memory loading papers coming out, (PowerInfer getting 11x and Apple getting 25x speedup on GPU) ReLU's dead zone is turning out to be important llama-class models (SwiGLU) might not have much longevity afterall once all the Metal work
9
19
237
我初步读了一下论文,这篇论文的基本思路其实是利用了推理过程中的局部性。现在现在推理性能的一个瓶颈就是GPU的内存。它们的思路就是联合CPU和GPU做联合推理。尽可能把active的neuron 信息load到GPU中,充分利用局部性。这样大大提高了GPU推理的效率。
@nash_su 看了下说明,是通过更高效的CPU、GPU分布来加速运算的,那就是说纯CPU速度应该还是一样的,我理解这个更适合有高速显卡但没有很大显存的情况,可以解决llama.cpp分开CPU+GPU效果不如单GPU的情况
10
32
112
やっとPowerInferが4090でLlama2-70B-4qを動かしてくれた。桃太郎を書かせたけど途中からストーリーがずれててまあLlama2くんの安定運転... 最終スピードとしてWSL2環境において13900K 64G RAM 16 threads許可で3.89 tokens/sまで出していて、プリプリントで主張された~4 tokens/sの結果を再現できた
2
13
42
非常有趣的研究,单卡 4090 跑 175B 大模型不是梦🎉
github.com
High-speed Large Language Model Serving for Local Deployment - SJTU-IPADS/PowerInfer
3
40
160
This is huge! Now watch the LLM API costs dropping even further. [.cn PDF link] https://t.co/Kvj3lRTONE
37
212
2K
PowerInfer can massively speed up inference on consumer GPUs. Almost reaching A100 levels. It outperforms llama.cpp by up to 11.69x while retaining model accuracy. PowerInfer reached an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across
15
81
357
Hope everyone enjoys it! We're just getting started on building PowerInfer as a brand new solution for local LLM hosting and we are open to any kind of feedback ❤️
PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine.
3
3
22
PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine.
24
289
1K