As AI accuracy improves, chat AIs that can handle everyday conversations smoothly, such as ChatGPT, Copilot, and Gemini, are appearing one after another. However, it is difficult for general users to judge which chat AI is the most powerful. Meanwhile, The Wall Street Journal conducted a ‘test to evaluate the response performance of everyday conversations’ for five types of chat AI and published the test results.
The Great AI Chatbot Challenge: ChatGPT vs. Gemini vs. Copilot vs. Perplexity vs. Claude – WSJ
https://www.wsj.com/tech/personal-tech/ai-chatbots-chatgpt-gemini-copilot-perplexity-claude-f9e40d26
When AI companies and AI researchers promote the performance of their own AI, they often use scores measured using benchmark tools. However, just because an AI has a good benchmark test score does not necessarily mean that it can accurately answer questions asked in everyday conversations. Therefore, the Wall Street Journal conducted a test to evaluate the responses of five chat AIs, ‘ChatGPT,’ ‘Copilot,’ ‘Gemini,’ ‘Claude,’ and ‘Perplexity,’ by inputting questions that are likely to arise in everyday conversations.
The questions used in the test were created in collaboration with Wall Street Journal editors and columnists, and included questions in a variety of categories such as ‘health,’ ‘finance,’ and ‘cooking.’ For example, the cooking category included questions such as, ‘Can you bake a chocolate cake without flour, gluten, dairy, nuts, or eggs? If so, please give me the recipe.’ These questions were entered into five chat AIs, and editors and columnists evaluated the responses for ‘accuracy,’ ‘usefulness,’ and ‘overall quality’ without identifying which AI they were. Paid versions of the chat AIs were used for the test: ChatGPT used ‘GPT-4o,’ and Gemini used ‘Gemini 1.5 Pro.’
The test results are as follows. Although performance varied depending on the question category, Perplexity came in first in the overall evaluation. However, Perplexity had the slowest response time among the five chat AIs. In addition, there was no significant difference between the five chat AIs in coding questions.
First place | No.2 | 3rd place | 4th | No.5 | |
health | ChatGPT | Gemini | Perplexity | Claude | Copilot |
finance | Gemini | Claude | Perplexity | ChatGPT | Copilot |
cooking | ChatGPT | Gemini | Perplexity | Claude | Copilot |
Work-related writing | Claude | Perplexity | Gemini | ChatGPT | Copilot |
Creative Writing | Copilot | Claude | Perplexity | Gemini | ChatGPT |
summary | Perplexity | Copilot | ChatGPT | Claude | Gemini |
Current Affairs | Perplexity | ChatGPT | Copilot | Claude | Gemini |
coding | Perplexity | ChatGPT | Gemini | Claude | Copilot |
Response Time | ChatGPT | Gemini | Copilot | Claude | Perplexity |
Overall rating | Perplexity | ChatGPT | Gemini | Claude | Copilot |
Microsoft told the Wall Street Journal that they plan to integrate GPT-4o into Copilot in the near future. Therefore, it is expected that Copilot’s performance will improve in the near future. Also, please note that the Wall Street Journal’s test is only in English.
There are other examples of comprehensive analysis of AI performance. For example, Stanford University has been publishing a report analyzing the performance and impact of AI every year since 2017. The contents of Stanford University’s AI Report 2024 can be found in the following article.
Source: The results of evaluating the performance of ‘ChatGPT’, ‘Copilot’, ‘Gemini’, ‘Claude’ and