Explore tweets tagged as #PowerInfer
with all the sparsity-aware context based memory loading papers coming out, (PowerInfer getting 11x and Apple getting 25x speedup on GPU) ReLU's dead zone is turning out to be important llama-class models (SwiGLU) might not have much longevity afterall once all the Metal work
9
19
237
⚡PowerInfer It is a CPU/GPU LLM inference engine leveraging activation locality for your device. The underlying design of PowerInfer exploits the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This distribution
0
0
0
A one-plus 24GB mobile running a Mixtral 8x7B at 11 tokens/second with PowerInfer-2🤯 Much faster inference speed vs llama.cpp and MLC-LLM. Using swap and caching to run the model even if it doesn't fit the available RAM. 📌 Between Apple’s LLM in a flash and PowerInfer-2,
10
64
393
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU paper page: https://t.co/GfwfNHOidp This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key
5
42
181
This is why quality quantization or other methods like PowerInfer matter
1
3
14
PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine.
24
289
1K
🚀🖥️ PowerInfer: Fast LLM Serving with a Consumer-grade GPU • A new tool to make large LLMs run fast on regular computers with a consumer graphics card (GPU). • It works by identifying which parts of the AI (neurons) are used a lot and which aren't. • The parts used a lot are
2
2
5
This is the paper which made possible to run Mixtral 8x7B at 11 tokens/s on a mobile phone.🤯 📌 PowerInfer-2 tackles the challenge of enabling high-speed inference of LLMs on smartphones, especially for models that exceed the device's memory capacity. 🔥 📌 The key innovation
Surprised by how many tech enthusiasts view the current AI revolution as merely another technological breakthrough. We're actually experiencing an unprecedented historical shift that none of us have ever seen. Here, a one-plus 24GB mobile running a Mixtral 8x7B at 11
3
69
294
やっとPowerInferが4090でLlama2-70B-4qを動かしてくれた。桃太郎を書かせたけど途中からストーリーがずれててまあLlama2くんの安定運転... 最終スピードとしてWSL2環境において13900K 64G RAM 16 threads許可で3.89 tokens/sまで出していて、プリプリントで主張された~4 tokens/sの結果を再現できた
2
13
42
专为智能手机设计的高性能大模型推理框架:PowerInfer-2 支持 Mixtral 47B MoE模型,推理速度快达每秒11.68个token,比其他框架快22倍 即使在7B模型上,只需在手机上放置50%的FFN权重,仍然能保持最先进的速度 特点 1、PowerInfer-2 是一个高速推理引擎,专为智能手机部署激活稀疏型 LLMs 设计
2
37
99
PowerInfer is making rounds in AI inference/local llama community. It's a fork of llama.cpp. I created a PR merging it and wrote a review on the diff. Will update it as I go through it more. Feel free to contribute with comments. https://t.co/Zp2LvtLge0
2
2
6
キタコレ。ポケットにGPTー4時代 【Apple の LLM in a flash と PowerInfer】 24GB のモバイルで、Mixtral 8x7B を 11 トークン/秒で実行 https://t.co/xvckxQOH5T llama.cpp および MLC-LLM と比較して推論速度が大幅に高速化 使用可能な RAM
0
3
44
Kind of.. I started writing this in Nov.. Just started prompting tho… This and Powerinfer
Has anyone written anything on how to prompt MoE models? How can I can purposefully mess with which experts get picked? Is there a good way to tell when one expert is being triggered and not another? It would be really cool if we could see what tokens went to which expert
0
0
1
🚀 Excited to introduce PowerInfer-2: A game-changing LLM inference engine for mobile devices by the #PowerInfer team. It smoothly runs a 47B model with a staggering 29x speedup on smartphones! Watch our demo to see it in action! 🎥 Technical details at: https://t.co/7bx5EnzWCs
31
140
579
折角NVIDAとCUDAを使える様になったので、高速LLM本命のPowerInferを使う 果たして6GBでまともに動くのだろうか インストールはCUDAを入れてから git clone https://t.co/49Dn3GnEDB cd PowerInfer pip install -r requirements.txt 続く
1
0
5
This is how GPT-4 scores its own responses to tech & startup prompts compared to: * LocalMentor * Mixtral-8x7b * Falcon-40B (128 token response) Sampled responses using the mixtral-offloading notebook: https://t.co/45zJekZ5lo And the PowerInfer demo: https://t.co/IXMqYiRjQd
0
0
4