PaLI-X () continued the joint vision / language scaling approach from PaLI (using ViT-22B and UL2-32B), with an updated pre-training mix. Aside from good benchmark numbers, a few results I found most intriguing… Tweet added by Neil Houlsby @neilhoulsby

Neil Houlsby

1 year

PaLI-X () continued the joint vision / language scaling approach from PaLI (using ViT-22B and UL2-32B), with an updated pre-training mix. Aside from good benchmark numbers, a few results I found most intriguing…

PaLI-X: On Scaling up a Multilingual Vision and Language Model

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our...

arxiv.org

2

34

127

Neil Houlsby

@neilhoulsby

1 year

In-context / few-shot captioning worked in diverse scenarios absent (or present only in minute quantities) in the training mix. In this e.g., PaLI-X must infer the location from the prompt, respond in the appropriate language, and combine with "world knowledge".

1

0

14

Neil Houlsby

@neilhoulsby

1 year

For object detection, performance on common objects was reasonable, but relatively, rare objects performed much better. Concepts not explicit in the OD mix were handled: left vs. right, OCR, singular vs. plural, and multilingual. Encouraging positive transfer from pre-training.

1

0

11

Neil Houlsby

@neilhoulsby

1 year

Counting was not an explicit pre-training task, but performance really started to take off around 17B+ parameters, especially on the "complex" variety (counting some described subset of objects).

1

0

15

Neil Houlsby

@neilhoulsby

1 year

Finally, multitask finetuning, without fancy task-specific prompting, was almost on par with tuned task-specific finetuning. Quite encouraging for potential future work on massively-multitask finetuning of V&L models.

0

14

Tesla God

@8FNPath

1 year

@neilhoulsby how good is it at Pali?

0

Replies