
Luca Benini
@LucaBeniniZhFe
Followers
2K
Following
32K
Media
37
Statuses
2K
Ferrara, Palo Alto, Ferrara, Zurich... Boh
Joined March 2018
We are really hellbent to achieve full FPU utilization! ! Maybe too much? 😏.
And here is Luca @lucacolagrande3 at #ISLPED2025 in Reykjavik presenting his poster “Towards Zero-Stall Matrix Multiplication on Energy-Efficient RISC-V Clusters for Machine Learning Acceleration”. Check out the paper:
0
2
7
This is the last missing element to make @pulp_platform MCU cluster robust vs soft errors. Protecting memory and cores is not sufficient. You must also protect the low-latency interconnect between them! This paper shows how!.
We propose "relOBI: A Reliable Low-latency Interconnect for Tightly-Coupled On-chip Communication" extending Open Bus Interface by combining TMR with error correction codes for critical signals @MikeRogenmoser @Ang__93
0
3
11
RT @pulp_platform: We propose "relOBI: A Reliable Low-latency Interconnect for Tightly-Coupled On-chip Communication" extending Open Bus In….
0
3
0
Nice explanation. By the way, if you want to understand you could can also read @pulp_platform Quadrilatero paper (with open source hardware of course)
arxiv.org
The rapid growth of AI-based Internet-of-Things applications increased the demand for high-performance edge processing engines on a low-power budget and tight area constraints. As a consequence,...
Blackwell Tensor Core 5th Generation MMA instruction (in PTX) fully moved away from using registers for holding matrices. Operands now reside in shared memory and Tensor Memory. Specifically, suppose the MMA computes D = A * B + C Not using thread
0
1
7
Can you build a low-latency (<15 cycles contention-free WC), interconnect with less than 15% area overhead, plenty of bandwidth, to connect 1Kcores with 4K memory banks (multi MB L1 memory). Yes you can. See how in this paper! (Open hardware, to appear at ICCD25).
Here is "TeraNoC: A Multi-Channel 32-bit Fine-Grained, Hybrid Mesh-Crossbar NoC for Efficient Scale-up of 1000+ Core Shared-L1-Memory Clusters" offering scaling & low latency with minimal routing overhead @yichao_zh @fischetim @MarcoBertuletti #ICCD2025
0
2
17
RT @pulp_platform: Exciting news! Our paper "A Dynamic Allocation Scheme for Adaptive Shared-memory Mapping on Kilo-core RV Clusters for At….
0
4
0
Yes. It is happening. I have been telling this for a,while. USA and EU are lagging, though. 🤐.
Intel’s not doing so hot lately. Meanwhile vendors are killing it at the RISC-V Summit in China. One CPU got a specint2006/GHz rating of 10.4/GHz!. To put it in perspective, a i7-4790k (Haswell) scores 8.1/GHz. RISC-V is hitting high-end desktop territory FAST:
0
2
9
This is solid, grounded reasoning: 100x not possible, unless you squeeze the model in on-chip memories. 10x could be possible wrt to current GPUs if you remove caches, instruction memories, manage dataflow, use advanced collectives, and make ultra-dense large tensor engines!.
@sallywf As Clive [1] said, it is not possible for a chip released in same timeframe to have 10x perf per watt over TPU/Trainium/GPGPU using the physics in our universe despite claims from companies that are trying to bake transformers in silicon. If you looked at a profile before &.
0
0
2
Pushing serously to get @pulp_platform to become the reference open RISCV Platform for automotive!.
Nils @niwist, Cyril and Frank have just visited Friedrichshafen to discuss possible future collaboration with ZF . Below a couple of pics from the visit, and of course the iconic Zeppelin.
0
0
10
Overheard in Silicon valley - Death, Taxes, EDA Licenses :).This is one heck of a docker image if you are into open source hardware tools. And I strongly suspect that some of the tools in this container will have a huge impact in coming months/years. Thank you @LucaBeniniZhFe.
0
3
12
Totally agree on this!.
The race for LLM "cognitive core" - a few billion param model that maximally sacrifices encyclopedic knowledge for capability. It lives always-on and by default on every computer as the kernel of LLM personal computing. Its features are slowly crystalizing:. - Natively multimodal.
0
0
5
In real time 👍.
2 MOSFETS - N & P.CMOS cells with a clean DRC. Here is Matt @matthewvenn rapping at the Tiny Tapeout Workshop at #ETHZurich 😀
0
1
7
Honored by the 2024 award! Good luck to the next winner! 🏆.
TCMM Open Source Hardware Contribution Award 2024 winner was @LucaBeniniZhFe contributions to @pulp_platform.
2
1
19
RT @Saboo_Shubham_: This Chinese AI model just changed document OCR forever. It can parse complex documents with text, tables, formulas an….
0
428
0
I promised to @pulp_platform to never do supercalar (too much hard engineering vs. research 🙄). However it was worth it. We need high quality open source cores, with industry support , to climb the performance ladder at high energy efficiency. The team did it! 👍.
=>."CVA6S+: A Superscalar #RISCV Core with High-Throughput Memory Architecture", Riccardo Tedeschi, . , @LucaBeniniZhFe, Davide Rossi, @pulp_platform, RISC-V Summit Europe, May 14, 2025 CVA6S+, arXiv, May 8
1
6
31
have been waiting for quite some time to talk about FlatAttention. Now finally the genie is out of the bottle. 🚀.Architecture+dataflow which IMHO will be dominating the next generation of scaled ML systems: Powerful on-chip collectives and multicast, and fat compute tiles.
To minimize main memory accesses by leveraging collective primitives integrated into the on-chip network fabric here is "FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Efficient Multi-Head Attention on Tile-Based Many-PE Accelerators"
0
6
17