The AI that powers ChatGPT appears to be performing less well at mathematical problems than it was just a few months ago.
The AI powering ChatGPT may provide completely different answers to the same mathematical problems over time. Those findings from recent experiments have fuelled an ongoing debate about whether the AI chatbot’s performance is getting worse – and have spurred the firm behind it, OpenAI, to reassure customers that applications built on ChatGPT will not continually break.
Last week, a study emerged from a collaboration between Stanford University and UC Berkeley which was published in the ArXiv preprint archive and highlighted noticeable differences in the responses of GPT-4 and its predecessor, GPT-3.5, over a span of a few months since the former’s March 13 debut.
One of the most striking findings was GPT-4’s reduced accuracy in answering complex mathematical questions. For instance, while the model demonstrated a high success rate (97.6 percent) in answering queries about large-scale prime numbers in March, its accuracy in answering that same prompt correctly plummeted to a mere 2.4 percent in June.
The study also pointed out that, while older versions of the bot offered detailed explanations for their answers, the latest iterations seemed more reticent, often forgoing step-by-step solutions even when explicitly prompted. Interestingly, during the same period, GPT-3.5 showed improved capabilities in addressing basic math problems, though it still struggled with more intricate code generation tasks.
The research team also delved into GPT-4’s coding capabilities, which appeared to have regressed. When the model was tested using problems from the online learning platform LeetCode, only 10 percent of the generated code adhered to the platform’s guidelines. This marked a significant drop from a 50 percent success rate observed in March.
OpenAI’s approach to updating and fine-tuning its models has always been somewhat enigmatic, leaving users and researchers to speculate about the changes made behind the scenes. With global concerns and ongoing legislation in the works surrounding AI regulation and its ethical use, transparency is increasingly on the minds of government regulators and even everyday users of the AI-based tech products that are emerging ever-more frequently.
While the model’s responses seemed to lack the depth and rationale observed in earlier versions, the recent study did note some positive developments: GPT-4 demonstrated enhanced resistance to certain types of attacks and showed a reduced propensity to respond to harmful prompts.
The AI powering ChatGPT may provide completely different answers to the same mathematical problems over time. Those findings from recent experiments have fuelled an ongoing debate about whether the AI chatbot’s performance is getting worse – and have spurred the firm behind it, OpenAI, to reassure customers that applications built on ChatGPT will not continually break.
Last week, a study emerged from a collaboration between Stanford University and UC Berkeley which was published in the ArXiv preprint archive and highlighted noticeable differences in the responses of GPT-4 and its predecessor, GPT-3.5, over a span of a few months since the former’s March 13 debut.
One of the most striking findings was GPT-4’s reduced accuracy in answering complex mathematical questions. For instance, while the model demonstrated a high success rate (97.6 percent) in answering queries about large-scale prime numbers in March, its accuracy in answering that same prompt correctly plummeted to a mere 2.4 percent in June.
The study also pointed out that, while older versions of the bot offered detailed explanations for their answers, the latest iterations seemed more reticent, often forgoing step-by-step solutions even when explicitly prompted. Interestingly, during the same period, GPT-3.5 showed improved capabilities in addressing basic math problems, though it still struggled with more intricate code generation tasks.
The research team also delved into GPT-4’s coding capabilities, which appeared to have regressed. When the model was tested using problems from the online learning platform LeetCode, only 10 percent of the generated code adhered to the platform’s guidelines. This marked a significant drop from a 50 percent success rate observed in March.
OpenAI’s approach to updating and fine-tuning its models has always been somewhat enigmatic, leaving users and researchers to speculate about the changes made behind the scenes. With global concerns and ongoing legislation in the works surrounding AI regulation and its ethical use, transparency is increasingly on the minds of government regulators and even everyday users of the AI-based tech products that are emerging ever-more frequently.
While the model’s responses seemed to lack the depth and rationale observed in earlier versions, the recent study did note some positive developments: GPT-4 demonstrated enhanced resistance to certain types of attacks and showed a reduced propensity to respond to harmful prompts.