OpenAI Announces Update to GPT Models for Image Analysis and Verbal Conversations

On Monday, OpenAI announced a significant update to Jeff GPT that enables its GPT 3.5 and GPT 4 AI models to analyze images and react to them as part of a text and conversation. Additionally, the ChatGPT mobile app will add speech synthesis options that, when paired with its existing speech recognition features, will enable fully verbal conversations with the AI assistant.

OpenAI plans to roll out these features in ChatGPT to Plus and Enterprise subscribers over the next two weeks. The company also notes that speech synthesis will be available on iOS and Android only, while image recognition will be available on both the web interface and the mobile apps.

The new image recognition feature in ChatGPT allows users to upload one or more images for conversation using either the GPT 3.5 or GPT 4 models. OpenAI claims that this feature can be used for a variety of everyday applications, such as figuring out what’s for dinner by taking pictures of the fridge and pantry.

Users can use the device’s touch screen to circle parts of the image that they would like ChatGPT to concentrate on. OpenAI provides a promotional video on its site that illustrates a hypothetical exchange with ChatGPT, where a user asks how to raise a bicycle seat and provides photos as well as an instruction manual and an image of the user’s toolbox. ChatGPT reacts and advises the user on how to complete the process.

It is important to note that the effectiveness of this feature has not been tested by ourselves, so its real-world effectiveness is unknown.

OpenAI has not released technical details of how GPT 4 or its multimodal version, GPT 4v, operate under the hood. However, based on known AI research, including OpenAI’s partner Microsoft, multimodal AI models typically transform text and images into a shared encoding space, which enables them to process various types of data through the same neural network. OpenAI may use CLIP to bridge the gap between visual and text data in a way that aligns image and text representations in the same latent space.

On the audio front, ChatGPT’s new voice synthesis feature reportedly allows for back and forth spoken conversation with ChatGPT driven by a new text-to-speech model. OpenAI says that users can engage this feature by opting into voice conversations in the Apple settings and then selecting from five different synthetic voices.

OpenAI’s Whisper, an open-source speech recognition system, will continue to handle the transcription of user speech input. Whisper has been integrated with the ChatGPT iOS app since its launch in May. OpenAI released the similarly capable ChatGPT Android app in July.