The model is pretrained on a mixture of publicly available datasets, achieving superior zero-shot performance on various evaluation benchmarks of multi-modal comprehension and generation. It can be ...