GPT-4 with Vision, known as GPT-4V, allows users to instruct the model to analyze user-provided images. This integration of image analysis into large language models (LLM) represents a significant advance that is now widely accessible. The inclusion of additional modalities, such as image inputs, in LLMs is considered by some to be a crucial frontier in the field of artificial intelligence research and development, as highlighted by various sources. Multimodal LLMs have the potential to extend the capabilities of language-driven systems by introducing new interfaces and functionalities. This now allows them to tackle new tasks and offer unique experiences to their users.
GPT-4V, similar to GPT-4, completed training in 2022, with early access becoming available in March 2023. The training process for GPT-4V was akin to that of GPT-4, involving initial training for predict the next word in text using a large dataset of texts and images from the Internet and licensed sources. Subsequently, reinforcement learning from human feedback (RLHF) was used to refine the model, ensuring that its results matched human preferences.
Large multimodal models like GPT-4V combine both text and vision capabilities, which introduces unique limitations and risks. GPT-4V inherits the strengths and weaknesses of each modality while also introducing new capabilities resulting from the fusion of text and vision, as well as the intelligence derived from its large scale. To gain a comprehensive understanding of the GPT-4V system, a combination of qualitative and quantitative assessments was used. Qualitative assessments involved internal experimentation to rigorously assess the system’s capabilities, and a red team of external experts was sought to provide valuable insights from external perspectives.
This system map provides an overview of how OpenAI prepared GPT-4V’s vision capabilities for deployment. It covers the early access period for small-scale users, security measures learned during this phase, assessments to assess the model’s readiness for deployment, feedback from expert evaluators of the red team and the precautions OpenAI took before the model’s wider release.
The image above shows examples of unreliable performance of GPT-4V for medical purposes. The capabilities of GPT-4V present both exciting prospects and new challenges. The approach taken to prepare for its deployment has focused on assessing and managing the risks associated with images of individuals, which include concerns such as person identification and the potential for biased results from these images, leading to prejudices in terms of representation or attribution.
Additionally, the model’s significant advances in capabilities in high-risk areas, such as medicine and scientific skills, were carefully examined. There are multiple fronts on which researchers. As we move forward, it is essential to continue refining and expanding the capabilities of GPT-4V, paving the way for even more remarkable advancements in AI-driven multimodal systems!
Check Paper. All credit for this research goes to the researchers of this project. Also don’t forget to register our SubReddit of more than 30,000 ML, More than 40,000 Facebook communities, Discord Channel, And E-mailwhere we share the latest AI research news, interesting AI projects and much more.
If you like our work, you will love our newsletter.
Janhavi Lande is a graduate in engineering physics from IIT Guwahati, batch of 2023. She is an upcoming data scientist and has been working in the world of ml/ai research for the past two years. Above all, she is fascinated by this constantly evolving world and by the constant demand of humans to keep pace with it. In her hobby, she likes to travel, read and write poems.