
You Jiacheng
@YouJiacheng
Followers
9K
Following
17K
Media
2K
Statuses
12K
a big fan of TileLang 关注TileLang喵!关注TileLang谢谢喵! https://t.co/utshC0jrCO 十年老粉
Joined August 2015
I think an intrinsic property of "fast weight" is that different tokens with different contexts will see different weights. In this sense, MoE is a special case of "fast weight".
2
1
60
China × California √
🇨🇳#China’s intelligent port operates with high efficiency At China’s smart port, autonomous transport vehicles move in an orderly and tireless manner. https://t.co/xBDxgruRUS
1
0
22
a well organized and information rich thread!
three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)
0
2
23
Introducing Geometry-aware Policy Imitation (GPI)! GPI constructs an energy landscape over the state space using demonstrations. A policy acts in the environment by following the gradient of the landscape. This enables fast multimodal policies with very fast inference (<1 ms)!
🎉 Excited to share Geometry-aware Policy Imitation (GPI): A simple, efficient, and interpretable approach for imitation learning. Delivers multimodal skills, stronger performance, 20–100× faster inference (<1 ms), and orders-of-magnitude less memory. https://t.co/YEUaiYwuQd
4
44
387
it's a stack plus a read-only view. (a stack doesn't support random read)
@YouJiacheng @ShengjieWa34067 it's stack
1
1
10
sorry, but it's not random access memory. it's append-only log.
31
28
1K
the total number of tokens is slightly reduced (2380 * 2/3 < 1630), but I'm not sure if this comes from separate batch_size's or other changes in this PR.
0
0
2
last year, I was motivated to introduce a similar change: do less embedding updates, because modded-nanogpt has many embedding params (due to value embeddings) and their gradient communications are costly. but I gave up this idea cuz it looks ugly and ad-hoc😂.
Down to 146.8s on modded-nanogpt! https://t.co/OV0TaesL4I Surprising result: Different parameter groups have different sensitivity to batch size. Instead of picking a single batch size, grad accumulation can be managed on a param level to simulate different batch sizes.
1
0
3
(disclaimer: I witnessed this work but did not contribute to it.)
0
0
4
one caveat is that memfd_create needs glibc≥2.27 I tried it when I did my homework a few years ago and I found the test environment is Ubuntu 16.04🥵
@jeremyphoward You can do something like fd = os.memfd_create(name) os.ftruncate(fd, size) and then either share fd with your child process e.g. via subprocess.Popen(pass_fds=) or you mmap it which multiprocessing can deserialize to the same region. The kernel refcounts the fd like a file.
0
0
10
To be precise: cut off U.S. quartz mineral that can be purified to 5N purity exports. the purity of the mineral is not important. the composition of impurities matters.
Immediate term: Throttle the PRC tech sector. To stop Beijing from achieving dominance in critical tech, we must: ⚙️ Cut off U.S. high-purity quartz exports 🚫 Expand SME export controls 🤝 Align allies like Japan & the Netherlands with U.S. policy
0
0
7