LucaBeniniZhFe Profile Banner
Luca Benini Profile
Luca Benini

@LucaBeniniZhFe

Followers
2K
Following
32K
Media
37
Statuses
2K

Ferrara, Palo Alto, Ferrara, Zurich... Boh

Joined March 2018
Don't wanna be here? Send us removal request.
@LucaBeniniZhFe
Luca Benini
1 day
We are really hellbent to achieve full FPU utilization! ! Maybe too much? 😏.
@pulp_platform
PULP Platform
3 days
And here is Luca @lucacolagrande3 at #ISLPED2025 in Reykjavik presenting his poster “Towards Zero-Stall Matrix Multiplication on Energy-Efficient RISC-V Clusters for Machine Learning Acceleration”. Check out the paper:
Tweet media one
Tweet media two
Tweet media three
0
2
7
@LucaBeniniZhFe
Luca Benini
1 day
This is the last missing element to make @pulp_platform MCU cluster robust vs soft errors. Protecting memory and cores is not sufficient. You must also protect the low-latency interconnect between them! This paper shows how!.
@pulp_platform
PULP Platform
3 days
We propose "relOBI: A Reliable Low-latency Interconnect for Tightly-Coupled On-chip Communication" extending Open Bus Interface by combining TMR with error correction codes for critical signals @MikeRogenmoser @Ang__93
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
3
11
@LucaBeniniZhFe
Luca Benini
1 day
RT @pulp_platform: We propose "relOBI: A Reliable Low-latency Interconnect for Tightly-Coupled On-chip Communication" extending Open Bus In….
0
3
0
@LucaBeniniZhFe
Luca Benini
1 day
Nice explanation. By the way, if you want to understand you could can also read @pulp_platform Quadrilatero paper (with open source hardware of course)
Tweet card summary image
arxiv.org
The rapid growth of AI-based Internet-of-Things applications increased the demand for high-performance edge processing engines on a low-power budget and tight area constraints. As a consequence,...
@SemiAnalysis_
SemiAnalysis
2 days
Blackwell Tensor Core 5th Generation MMA instruction (in PTX) fully moved away from using registers for holding matrices. Operands now reside in shared memory and Tensor Memory. Specifically, suppose the MMA computes D = A * B + C Not using thread
Tweet media one
0
1
7
@LucaBeniniZhFe
Luca Benini
5 days
Can you build a low-latency (<15 cycles contention-free WC), interconnect with less than 15% area overhead, plenty of bandwidth, to connect 1Kcores with 4K memory banks (multi MB L1 memory). Yes you can. See how in this paper! (Open hardware, to appear at ICCD25).
@pulp_platform
PULP Platform
6 days
Here is "TeraNoC: A Multi-Channel 32-bit Fine-Grained, Hybrid Mesh-Crossbar NoC for Efficient Scale-up of 1000+ Core Shared-L1-Memory Clusters" offering scaling & low latency with minimal routing overhead @yichao_zh @fischetim @MarcoBertuletti #ICCD2025
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
2
17
@LucaBeniniZhFe
Luca Benini
10 days
RT @pulp_platform: Exciting news! Our paper "A Dynamic Allocation Scheme for Adaptive Shared-memory Mapping on Kilo-core RV Clusters for At….
0
4
0
@LucaBeniniZhFe
Luca Benini
13 days
Yes. It is happening. I have been telling this for a,while. USA and EU are lagging, though. 🤐.
@lauriewired
LaurieWired
13 days
Intel’s not doing so hot lately. Meanwhile vendors are killing it at the RISC-V Summit in China. One CPU got a specint2006/GHz rating of 10.4/GHz!. To put it in perspective, a i7-4790k (Haswell) scores 8.1/GHz. RISC-V is hitting high-end desktop territory FAST:
Tweet media one
Tweet media two
0
2
9
@LucaBeniniZhFe
Luca Benini
17 days
Minority report. .
@matteopelleg
Matteo Pellegrini ⚡️
17 days
Forbes 30 under 30 has been doing this for years
Tweet media one
0
0
2
@LucaBeniniZhFe
Luca Benini
17 days
RT @matteopelleg: Forbes 30 under 30 has been doing this for years
Tweet media one
0
3K
0
@LucaBeniniZhFe
Luca Benini
17 days
This is solid, grounded reasoning: 100x not possible, unless you squeeze the model in on-chip memories. 10x could be possible wrt to current GPUs if you remove caches, instruction memories, manage dataflow, use advanced collectives, and make ultra-dense large tensor engines!.
@SemiAnalysis_
SemiAnalysis
17 days
@sallywf As Clive [1] said, it is not possible for a chip released in same timeframe to have 10x perf per watt over TPU/Trainium/GPGPU using the physics in our universe despite claims from companies that are trying to bake transformers in silicon. If you looked at a profile before &.
0
0
2
@LucaBeniniZhFe
Luca Benini
1 month
Pushing serously to get @pulp_platform to become the reference open RISCV Platform for automotive!.
@pulp_platform
PULP Platform
1 month
Nils @niwist, Cyril and Frank have just visited Friedrichshafen to discuss possible future collaboration with ZF . Below a couple of pics from the visit, and of course the iconic Zeppelin.
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
0
10
@LucaBeniniZhFe
Luca Benini
1 month
👍 . and it's going to get better. right, @open @pulp_platform @YosysHQ @OpenROAD_EDA ?.
@RajaXg
Raja Koduri
1 month
Overheard in Silicon valley - Death, Taxes, EDA Licenses :).This is one heck of a docker image if you are into open source hardware tools. And I strongly suspect that some of the tools in this container will have a huge impact in coming months/years. Thank you @LucaBeniniZhFe.
0
3
12
@LucaBeniniZhFe
Luca Benini
1 month
Totally agree on this!.
@karpathy
Andrej Karpathy
1 month
The race for LLM "cognitive core" - a few billion param model that maximally sacrifices encyclopedic knowledge for capability. It lives always-on and by default on every computer as the kernel of LLM personal computing. Its features are slowly crystalizing:. - Natively multimodal.
0
0
5
@LucaBeniniZhFe
Luca Benini
1 month
In real time 👍.
@pulp_platform
PULP Platform
1 month
2 MOSFETS - N & P.CMOS cells with a clean DRC. Here is Matt @matthewvenn rapping at the Tiny Tapeout Workshop at #ETHZurich 😀
0
1
7
@LucaBeniniZhFe
Luca Benini
2 months
Honored by the 2024 award! Good luck to the next winner! 🏆.
@hotchipsorg
Hot Chips
2 months
TCMM Open Source Hardware Contribution Award 2024 winner was @LucaBeniniZhFe contributions to @pulp_platform.
2
1
19
@LucaBeniniZhFe
Luca Benini
2 months
RT @Saboo_Shubham_: This Chinese AI model just changed document OCR forever. It can parse complex documents with text, tables, formulas an….
0
428
0
@LucaBeniniZhFe
Luca Benini
2 months
Feel the momentum! 👍🚀.
@jun1okamura
CEO@Trigence
2 months
It was a great gathering of the OSS community to reveal real activity to Japanese semiconductor community.
Tweet media one
0
0
2
@LucaBeniniZhFe
Luca Benini
2 months
I promised to @pulp_platform to never do supercalar (too much hard engineering vs. research 🙄). However it was worth it. We need high quality open source cores, with industry support , to climb the performance ladder at high energy efficiency. The team did it! 👍.
@ogawa_tter
OGAWA, Tadashi
2 months
=>."CVA6S+: A Superscalar #RISCV Core with High-Throughput Memory Architecture", Riccardo Tedeschi, . , @LucaBeniniZhFe, Davide Rossi, @pulp_platform, RISC-V Summit Europe, May 14, 2025 CVA6S+, arXiv, May 8
Tweet media one
Tweet media two
Tweet media three
1
6
31
@LucaBeniniZhFe
Luca Benini
3 months
have been waiting for quite some time to talk about FlatAttention. Now finally the genie is out of the bottle. 🚀.Architecture+dataflow which IMHO will be dominating the next generation of scaled ML systems: Powerful on-chip collectives and multicast, and fat compute tiles.
@pulp_platform
PULP Platform
3 months
To minimize main memory accesses by leveraging collective primitives integrated into the on-chip network fabric here is "FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Efficient Multi-Head Attention on Tile-Based Many-PE Accelerators"
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
6
17
@LucaBeniniZhFe
Luca Benini
3 months
Here is it, on arxiv!.
@WWVY
Hardware Architecture Papers
3 months
How to keep pushing ML accelerator performance? Know your rooflines!.
1
1
11