Check
We are excited to share code & model for Self-supervised Correspondence Flow (BMVC 2019 Oral)
@bmvc2019
,
State-of-the-art performance on video segmentation and pose tracking.
@Oxford_VGG
Personal update:
After spending seven wonderful years at Oxford, I've decided to take new adventure.
I'm joining Shanghai Jiao Tong University from this year 🐯.
Tracking objects is among the first skills human infants learn, surely this must be a task without semantic understanding.
We present a SOTA self-supervised tracking approach, all you need is just 10min raw videos, zero annotations required.
@Oxford_VGG
A tiny milestone in my academic journey.
I know these metrics do not carry much significance in today's academic landscape.
Nevertheless, they serve as a personal gauge, allowing me to assess the papers' impact and reflect on if I've contributed something meaningful.
ICCV23 work on Open-vocabulary Object Segmentation with Diffusion Models
- we do visual instruction tuning on pre-trained diffusion model, to simultaneously generate image and open-vocabulary masks.
- it can create synthetic datasets for training discriminative model for free.
Can GPT-4V(vision) serve medical applications?
We present recent efforts on assessing GPT-4V for multimodal medical diagnosis, by case studies, covering 17 human body systems, across 8 clinical imaging modalities, e.g., radiology, pathology.
🔥Report:
Just read Med-PaLM 2, the progress of LLMs in medical question answering is incredible ! but, I think multimodal medical question answering is quite far behind, here I present you,
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering:
Happy to share the work, "Visual-Language Models for Efficient Video Understanding" at ECCV2022.
We benchmark 10 different datasets for various tasks, it turns out that, simply prompting CLIP can achieve comparable or sota results on many video tasks already.
#ECCV2022
We are releasing the code and model for
#VGGSound
A new large-scale audio-visual dataset, it was collected with audio-visual correspondence,
accessible via:
codes & model:
We investigate self-supervised learning on video correspondence flow. If done properly, the self-supervised learning can be surprisingly powerful (closing the gap to supervised learning). We demonstrate state-of-the-art results on video segmentation.
We are presenting our new paper at LUV2020 workshop today at 16:15 - 16:30pm.
MAST: A Memory-Augmented Self-Supervised Tracker,
by
@LaiZihang
,
@erika_lu_
,
@Oxford_VGG
.
A strong tracking model trained with no manual annotation.
Code:
#VGGatCVPR2020
Also best paper on CVPR RVSU Workshop.
TL;DR:
We propose a self-supervised learning approach for segmentation based on motions, ie, Gestalt Principle.
Achieve strong performance to strong supervision on several popular benchmarks, e.g. DAVIS2016, MoCA (camouflage detection).
Happy to share the paper of "Self-supervised Tumor Segmentation with Sim2Real Adaptation" published in IEEE Journal of Biomedical and Health Informatics.
The model enables zero-shot tumor segmentation with Sim2Real training, requiring zero/few annotation from physicians.
Sharing the work "PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents",
- A large-scale image-caption datasets collected from biomedical papers,
- A CLIP-style model, that can be transfered to various downstream tasks with comparable or sota results.
I won't be able to attend the CVPR due to visa reasons, though I applied it over two months ago.
Found an interesting workshop, around 4AM in China time, and it's promised to be NOT recorded. Honestly, I don't understand the point. Have we decided to go CloseAI ?
#CVPR2023
We present you our new efforts on building medical generalist foundation models for radiology:
Arxiv:
Website:
Hope this can promote the development of medical foundation models ! (1/5)
VGGFace2 is a large-scale face recognition dataset. Over 9000 identities, 3M images, are downloaded from Google Image Search and have large variations in pose, age, illumination, ethnicity and profession.
Dataset:
Github:
Our recent work to initiate the open-vocab video instance segmentation (ICCV23 Oral):
- we collect a large-vocabulary video instance dataset (LV-VIS), with over 1196 categories.
- we propose an Transformer-based architecture, OV2Seg, proposing, segmenting objects through time.
We present you our recent work on developing open-source, multilingual language model for medicine, that the benefits a wider, linguistically diverse audience from different regions.
All codes, models are available at
@huggingface
Models for perception understanding are developing rapidly, thanks for the SAM model, however, can the model infer visual attributes under open-vocabulary setting ?
Here, we develop a model for open-vocabulary object detection and attribute recognition.
#CVPR2023
Our recent work on AI4Medicine: visual-language representation learning in radiology, it will be presented on ICCV2023.
MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training in Radiology:
Our contributions include (1/4):
Excited to share our new effort on large-scale long-tailed disease diagnosis on radiology images. This is more feasible playground for academic labs, to explore sophisticated algorithms, which is impractical in developing generalist foundation model, due to computational costs.
The human visual system is amazing at many tasks, however, it is particularly weak for objects counting. In fact, one can only make a rapid, accurate and confident judgement if the number of items is below five.
We can augment the ability with a CounTR:
I feel extremely disappointed while reading papers with incomplete literature review.
To me, this should be the MOST important thing, as it clearly shows you understand what has been done and what is the remaining challenge and your contribution, instead of over-claiming.
Recent advances in AI, e.g. nlp, visual perception have revealed the power of supervised training on massive data, e.g. chatGPT, SAM.
From a product perspective, this is great. However, from a research view, the dream remains to be training models with zero/cheap annotations.
Check out our
#BMVC2020
paper: “Inducing Predictive Uncertainty Estimation for Face Recognition”
@Oxford_VGG
A simple approach for estimating the predictive confidence for face recognition systems.
Q&A session Tuesday at 10:00-11:00 and 16:00-17:00 UK
InstaGen
Enhancing Object Detection by Training on Synthetic Dataset
paper page:
introduce a novel paradigm to enhance the ability of object detector, e.g., expanding categories or improving detection performance, by training on synthetic dataset
Yet another medical-related report:
- we finetuning LLaMA on 4.8 million biomedical papers from Pubmed, after several epochs, it has already enhanced capabilities in the medical domain
- the proposed model, PMC-LLaMA, achieves high performance on biomedical QA benchmarks.
I'd like to share a latest work from the group, accepted at
#CVPR2024
!
Grounded Question-Answering in Long Egocentric Videos, by Shangzhe Di.
It explores grounded question-answering in long, egocentric videos, enabling individuals to inquire about their past visual experiences.
I recently saw many Mamba models for medical segmentation, generally inserted at the bottleneck of UNet, and claim to model long-term dependency at resolution of 9x9 or 7x7...
This is such an odd atmosphere in AI, whenever something new comes out, people use it for EVERYTHING !!
🚀 Excited to introduce RadGenome-Chest CT, a comprehensive, large-scale, and fine-grained visual-language dataset for 3D CT scans.
It includes:
- Organ-level segmentation for 197 categories;
- 665K multi-granularity grounded reports;
- 1.3M grounded VQA pairs.
We have updated the PMC-LLaMA model:
Compared with the former versions, we have:
(i) upscaled the model size to 13B;
(ii) added 30K medical books into the knowledge injection stage;
(iii) done instruction tuning on a large-scale dataset with 202M tokens.
New paper on open-vocabulary detection.
We aim to tackle two problems in existing work:
(1) lexical ambiguity, (2) visual granularity.
We demonstrate great results, that use text, visual or both ways to generate classifier.
Please check the threads from
@pranna
for details.
[1/8] In Hawaii(!) for
#ICML2023
to present "Multi-Modal Classifiers for Open-Vocabulary Object Detection"
Joint with
@WeidiXie
and Andrew Zisserman
🕸️🏗️
📑
🖥️
Poster
#413
, Thursday 1:30-3pm Exhibit Hall 1
Recently, I'm quite interested in the deploying AI tools for medical applications, here are a series of our recent work, the list is growing continuously, please stay tuned.
This originally aims to generate comic books for my daughter with diffusion models, i.e., continuous image sequences with consistent characters, storylines, etc. Well, I guess now we have
#SORA
🤣.....
Still, it is good to get it accepted on
#CVPR2024
, congrats the team !!
Thrilled to share that our exciting project, [StoryGen] (with
@liu_chang666
and
@WeidiXie
), has been accepted by
#CVPR2024
!!!🥳
Check out our paper, code and dataset at:
Title: Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models
Would like to share fun work, which has shown to be functional to entertain my daughter !
It can generate coherent image sequences based on the stories you give, or those written by GPTs !
Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models.
Our work on self-supervised learning for the problem of Geometric Alignment based on noisy annotation. Geometric consistency is applied, that is to say all possible perturbed label must be transformed back the unique ground-truth position.
@Oxford_VGG
State-of-the-art Speaker Recognition with VLAD and GhostVLAD aggregation.
"Utterance-level Aggregation For Speaker Recognition In The Wild",
to appear in
@icassp2019
, for Oral Presentation.
Project page (models & code):
We have updated the manuscript :
- more recent models as baselines,
- better model performance,
- more comprehensive evaluation, including both machine and human scoring.
Arxiv:
Website:
We have made the first release of SAT-Nano, with model and inference code.
The SAT-Ultra will be released soon as well. Stay tuned !
Webpage:
Code & Model:
I’m presenting our paper with
@gorkaydemir
and
@WeidiXie
tomorrow at Poster Session 2* at
#NeurIPS2023
:
with SOLV; “Self-supervised Object Centric Learning for Videos”, we can discover multiple objects in real-world video sequences without using additional modalities like depth
📢 Our
#PMC
-VQA dataset:
is now live on
@huggingface
datasets 🤗 ,
and officially benchmarked on
@paperswithcode
Looking forward the progress in this domain !!!
I will present the recent work
@Oxford_VGG
with
@NagraniArsha
, Joon Son Chung, and Andrew Zisserman on Speaker Recognition
@icassp2019
on Thursday (May 16), 09:20 - 09:40.
Project page (models & code):
Our paper on the medical foundation model, PMC-CLIP, has been accepted by MICCAI2023. Congratulations to all co-authors.😆
In PMC-CLIP, we collected 1.6M medical image-text caption pairs. All meta reviewers rank it first. Thanks for their recognition.👏
📢 Thrilled to share that our paper on "Synchformer: Efficient Synchronization from Sparse Cues" got accepted at
#ICASSP24
!
🎉 Huge shoutout to the amazing team:
@WeidiXie
, Esa Rahtu, and Andrew Zisserman!
Code:
arXiv:
Use motion to train segmentation model, again, use common fate principle, similar to the motion grouping paper. Simple simulation in flow field turns out generalising extremely well.
Existing super-resolution (SR) model often specialized for one scale, limiting their use in practise.
We develop a general plugin module, can be injected to any existing SR models, to augment their ability for arbitrary-scale super-resolution.
Webpage:
Our recent work on AI4Medicine: visual-language representation learning in radiology, it will be presented on ICCV2023.
MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training in Radiology:
Our contributions include (1/4):
Recent work on Multi-view Cardiac MR Detection, Orientation, and Segmentation has been published at Medical Image Analysis, "Ω-Net (Omega-Net): Fully automatic, multi-view cardiac MR detection, orientation, and segmentation with deep neural networks", .
Multi-Modal Classifiers for Open-Vocabulary Object Detection
paper page:
The goal of this paper is open-vocabulary object detection (OVOD) x2013 building a model that can detect objects beyond the set of categories seen at training, thus enabling the
Can GPT-4V(vision) serve medical applications?
We present recent efforts on assessing GPT-4V for multimodal medical diagnosis, by case studies, covering 17 human body systems, across 8 clinical imaging modalities, e.g., radiology, pathology.
🔥Report:
Following my previous tweet,
- Our team decide to run thorough comparison between nnUNet and Mamba-based model for medical segmentation.
- We will conduct experiment on over 60 public segmentation datasets, and provide complete comparison. (1/n)
I think the key problem is, this community is not tolerant to failure !!
So whenever something comes out, people can always do hyper-parameter tuning or compare to low baselines, to show its effectiveness.
This is misleading the entire community for huge waste of resource.
New paper on Nature Communications, where we investigate knowledge-enhanced multimodal representation learning with chest X-rays and radiology reports.
Our past conferences wouldn't have been possible without our many reviewers. If you have at least 2 papers in top peer-reviewed confs or journals, with at least one in an ML venue (eg NeurIPS ICML ICLR), we’d be v grateful if you reviewed for
#NeurIPS2023
Don't miss the "Object localization for free: Going beyond self-supervised learning"
@CVPR
tutorial (by
@oriane_simeoni
@WeidiXie
@tkipf
P. Pérez) for an in-depth coverage of different angles on object localization with no human supervision
#cvpr2023
When trained at a sufficient scale, self-supervised learning has exhibited a notable ability to solve a wide range of visual-language tasks. It turns out that we can adapt pre-trained foundation models to open-vocabulary semantic segmentation, by training very few parameters.
A Large-scale Dataset for Audio-Language Representation Learning
paper page:
The AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, in the audio representation learning
- a strong generative model for medVQA
- a scalable pipeline for collecting large-scale dataset
- a significantly more challenging benchmark than all existing ones, strong visual-language models fail miserably, e.g. BLIP2, Open-Flamingo.
Would like to see the progress on medVQA!
Our final model outperforms ChatGPT and LLaMA-2 on multiple Medical QA benchmarks !
Open-source materials:
Code:
Model:
DATA:
Hope this can promote the development of open-source LLM for healthcare.
"Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision."
#CVPR2023
We aim to train an open-vocab segmentation model with image caption, explicitly exploiting the visual invariance between images.
Project Page:
Collaboration with Chen Ju,
@TengdaHan
,
@KunhaoZ
, Ya Zhang,
Project page:
GitHub:
We are gradually releasing the codes for reproducing all the benchmark results.
I've got asked for the code of an old paper, may be too late 😂, but in case someone else is interested:
"Multicolumn Networks for Face Recognition", BMVC2018.
The idea is to compute set representation by aggregating images based on the importances.
- We present a novel architecture that enables to process arbitrary number of input scans, from various imaging modalities, and trained by leveraging the rich domain knowledge.
This is a work questioned by the reviewer for its value.......
To be honest, I'm not sure if the idea will be adopted by the community or not, but I think it's COOL !
Webpage:
Arxiv:
Code & Model:
Our
@CVPR
tutorial about "object localization for free" is today room East 11 starting at 8:30am PDT time (with
@WeidiXie
,
@tkipf
and P. Pérez). Come and join us if you want to hear/discuss about different successful approaches to object localization with no annotation!
Check out our
#BMVC2020
paper: “Inducing Predictive Uncertainty Estimation for Face Recognition”
@Oxford_VGG
A simple approach for estimating the predictive confidence for face recognition systems.
Q&A session Tuesday at 10:00-11:00 and 16:00-17:00 UK
Overwhelmed by the progress of human-object interaction (HOI) detection? Ever wondered why one HOI model performs better than another? Check out our recent work in diagnosing human-object interaction detectors.
Paper:
Code:
🛢️ 1/N
1) multimodal biomedical dataset:
PMC-OA, 1.6M image-caption pairs collected from PubMedCentral, covering diverse modalities or diseases, with majority of the image-caption samples aligned at finer-grained level, i.e., subfigure and subcaption.
@gdb
Right. I think this is exactly what we would expect a AI4Health model to have, some 'emerging abilities', to discover some hidden factors behind the disease itself, being able to make diagnosis by combining all information source, with the ability of top level clinicians.
Introducing Pika 1.0, the idea-to-video platform that brings your creativity to life.
Create and edit your videos with AI.
Rolling out to new users on web and discord, starting today. Sign up at
We present MEDITRON, a set of new open-access
#LLMs
(70B & 7B) adapted to the medical domain, achieving new SoTA open-source performance on common medical benchmarks, outperforming
#GPT
-3.5 and Med-PaLM, and coming within 5% of
#GPT4
Find out how we did this ⬇️
- We build up an large-scale diagnostic dataset that encompasses 5568 disorders linked with 930 unique ICD-10-CM codes, containing 39,026 cases (192,675 scans).