#PowerInfer X Hashtag

Explore tweets tagged as #PowerInfer

murat 🍥

@mayfer

2 years

with all the sparsity-aware context based memory loading papers coming out, (PowerInfer getting 11x and Apple getting 25x speedup on GPU) ReLU's dead zone is turning out to be important llama-class models (SwiGLU) might not have much longevity afterall once all the Metal work

9

19

237

MJ

@pythonym

2 months

⚡PowerInfer It is a CPU/GPU LLM inference engine leveraging activation locality for your device. The underlying design of PowerInfer exploits the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This distribution

0

Rohan Paul

@rohanpaul_ai

1 year

A one-plus 24GB mobile running a Mixtral 8x7B at 11 tokens/second with PowerInfer-2🤯 Much faster inference speed vs llama.cpp and MLC-LLM. Using swap and caching to run the model even if it doesn't fit the available RAM. 📌 Between Apple’s LLM in a flash and PowerInfer-2,

10

64

393

AK

@_akhaliq

2 years

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU paper page: https://t.co/GfwfNHOidp This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key

5

42

181

Knut Jägersberg

@JagersbergKnut

2 years

This is why quality quantization or other methods like PowerInfer matter

1

3

14

elvis

@omarsar0

2 years

PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine.

24

289

1K

Tiiny AI

@TiinyAILab

7 days

Wonder how Tiiny runs a 120B LLM locally? The key is PowerInfer — a heterogeneous inference architecture that assigns "hot neurons" and "cold neurons" across NPU/CPU for unprecedented 120B decoding speed on consumer hardware. Code & details on GitHub👇 #AI #LLM #OpenAI

2

0

10

Dr. Daniel Bender

@drdanielbender

2 years

🚀🖥️ PowerInfer: Fast LLM Serving with a Consumer-grade GPU • A new tool to make large LLMs run fast on regular computers with a consumer graphics card (GPU). • It works by identifying which parts of the AI (neurons) are used a lot and which aren't. • The parts used a lot are

2

5

Rohan Paul

@rohanpaul_ai

1 year

This is the paper which made possible to run Mixtral 8x7B at 11 tokens/s on a mobile phone.🤯 📌 PowerInfer-2 tackles the challenge of enabling high-speed inference of LLMs on smartphones, especially for models that exceed the device's memory capacity. 🔥 📌 The key innovation

Rohan Paul

@rohanpaul_ai

1 year

Surprised by how many tech enthusiasts view the current AI revolution as merely another technological breakthrough. We're actually experiencing an unprecedented historical shift that none of us have ever seen. Here, a one-plus 24GB mobile running a Mixtral 8x7B at 11

3

69

294

Rohan Paul

@rohanpaul_ai

1 year

PowerInfer-2 vs LLM in a Flash vs llama.cpp

1

8

22

睡眠雲

@abhphy

2 years

やっとPowerInferが4090でLlama2-70B-4qを動かしてくれた。桃太郎を書かせたけど途中からストーリーがずれててまあLlama2くんの安定運転... 最終スピードとしてWSL2環境において13900K 64G RAM 16 threads許可で3.89 tokens/sまで出していて、プリプリントで主張された~4 tokens/sの結果を再現できた

2

13

42

AIGCLINK

@aigclink

2 years

专为智能手机设计的高性能大模型推理框架：PowerInfer-2 支持 Mixtral 47B MoE模型，推理速度快达每秒11.68个token，比其他框架快22倍即使在7B模型上，只需在手机上放置50%的FFN权重，仍然能保持最先进的速度特点 1、PowerInfer-2 是一个高速推理引擎，专为智能手机部署激活稀疏型 LLMs 设计

2

37

99

nash_su - e/acc

@nash_su

2 years

比llama.cpp快12倍！ PwerInfer推理引擎实测速度比llama.cpp快了11.69倍，日新月异啊 https://t.co/duCyCZy5Ja

5

44

208

Sasank Chilamkurthy

@sasank51

2 years

PowerInfer is making rounds in AI inference/local llama community. It's a fork of llama.cpp. I created a PR merging it and wrote a review on the diff. Will update it as I go through it more. Feel free to contribute with comments. https://t.co/Zp2LvtLge0

2

6

ホーダチ-Hodatsu | LLM Researcher × AI Engineer

@hokazuya

1 year

キタコレ。ポケットにGPTー4時代【Apple の LLM in a flash と PowerInfer】 24GB のモバイルで、Mixtral 8x7B を 11 トークン/秒で実行 https://t.co/xvckxQOH5T llama.cpp および MLC-LLM と比較して推論速度が大幅に高速化使用可能な RAM

0

3

44

Mr. Prescient

@InTheOort

2 years

Kind of.. I started writing this in Nov.. Just started prompting tho… This and Powerinfer

Nick Dobos

@NickADobos

2 years

Has anyone written anything on how to prompt MoE models? How can I can purposefully mess with which experts get picked? Is there a good way to tell when one expert is being triggered and not another? It would be really cool if we could see what tokens went to which expert

0

1

Holden

@hodlenx

2 years

🚀 Excited to introduce PowerInfer-2: A game-changing LLM inference engine for mobile devices by the #PowerInfer team. It smoothly runs a 47B model with a staggering 29x speedup on smartphones! Watch our demo to see it in action! 🎥 Technical details at: https://t.co/7bx5EnzWCs

31

140

579

Kapper@Linuxガジェヲタ＆異世界小説家＆生成AI＆Linux大好き

@kapper1224

2 years

折角NVIDAとCUDAを使える様になったので、高速LLM本命のPowerInferを使う果たして6GBでまともに動くのだろうかインストールはCUDAを入れてから git clone https://t.co/49Dn3GnEDB cd PowerInfer pip install -r requirements.txt 続く

1

0

5

Smells Like ML

@smellslikeml

2 years

This is how GPT-4 scores its own responses to tech & startup prompts compared to: * LocalMentor * Mixtral-8x7b * Falcon-40B (128 token response) Sampled responses using the mixtral-offloading notebook: https://t.co/45zJekZ5lo And the PowerInfer demo: https://t.co/IXMqYiRjQd

0

4