AI research and deployment company OpenAI has been in the limelight for its AI chat model ChatGPT and advanced large language model GPT-4. While the data used in training has been an ongoing debate in recent times, a new report has revealed a few nuggets concerning YouTube data. According to a report from The New York Times, OpenAI transcribed over one million hours of YouTube videos to train GPT-4.
According to the report, as OpenAI ran out of data to train the LLM, it developed Whisper, a speech recognition tool that could transcribe videos. This opened avenues for more conversational text to train the AI system. Later, the company reportedly trained GPT-4 using YouTube data. The team included OpenAI president Greg Brockman, who helped streamline videos, citing sources, the report added.
Beyond YouTube, the company reportedly used data from GitHub, Quizlet and database on chess moves. As more demand popped up, the company planned transcription of podcasts, videos and audiobooks, buying start-ups (having large amounts of digital assets), and creating data from scratch.
Despite Google’s prohibition from using videos for “independent” applications and accessing them via “any automated means (such as robots, botnets or scrapers)”. According to the sources cited in the report, the company believed training AI using videos was a part of fair use. As per the report, the ChatGPT maker said it used numerous data sources for training.
Also Read: OpenAI May Release GPT-5 With New Features Later This Year: Report
While some Google staff was aware of the activity, the company did not intervene as the tech giant also used the video transcripts to train AI models, sources said. As per the reports, if Google had highlighted OpenAI’s activities, it could have sparked a backlash as Google was also doing the same by violating the copyright of creators.
Google spokesperson Matt Bryant shared that Google was not aware of any such practice of OpenAI and it prohibits “unauthorised scraping or downloading of YouTube content”. It takes legal actions whenever required in case of a violation, the spokesperson added.
Source: OpenAI Trained GPT-4 Using Transcripts Of One Million Hours Worth Of YouTube Videos,