<p>Can current AI models match human creativity, critical thinking, and problem-solving capabilities? Until recently, OpenAI’s GPT-4 scored 94 per cent on the SAT and 99 per cent on the GRE verbal section, leading many to believe that AI is rapidly approaching human intelligence. However, a recent test has challenged this assumption.</p><p>The Humanity’s Last Exam (HLE) conducted by the San Francisco-based Center for AI Safety (CAIS) and Scale AI, the exam was designed specially to test deep reasoning and problem-solving capabilities of Large Language Models (LLMs) rather than just memorisation or pattern recognition.</p><p>Unlike multiple-choice exams, the HLE required AI models to solve open-ended, multi-step problems across mathematics, physics, law, logic, and even ancient scripts. <a href="https://www.prnewswire.com/news-releases/cais-and-scale-ai-unveil-results-of-humanitys-last-exam-a-groundbreaking-new-benchmark-302358108.html">The results</a>, which were released in late January, were shocking: most leading AI models, including GPT-4o and Google’s Gemini, scored between 3 per cent and 6 per cent. The best performer, China’s DeepSeek-R1, managed 9.4 per cent — still far below human intelligence. Adding to the concerns, AI models also exhibited high confidence in incorrect answers, a phenomenon known as hallucination.</p><p><strong>Hardest AI reasoning test ever designed</strong></p><p>What led to this drastic drop in the percentile of these models in HLE exam? Same models, which managed to score in the high nineties in exams like the GRE and SAT, floundered with an inaccuracy rate as high as 90 per cent. This was because, unlike standard multiple-choice tests, the HLE presented open-ended and multi-step problems that required AI models to exhibit genuine reasoning ability rather than relying on pattern recognition, memorised answers, or web scraping.</p><p><strong>Why Deepseek-R1 beat the rest</strong></p><p>The AI’s recent poster boy DeepSeek-R1’s performance in the HLE took many by surprise. It outscored top models like GPT-4o, Anthropic’s Claude 3.5, and Gemini, particularly in text-only reasoning tasks. The newly launched LLM is likely to have been trained on more structured and high-quality datasets that emphasised logical reasoning and sequential decision-making.</p><p>It exhibited superior calibration — meaning it could better assess the certainty of its answers, reducing the chances of overconfidence errors. Lastly, its fine-tuning appeared more optimised for expert-level problem-solving, setting it apart from competitors designed for broader, general-purpose tasks.</p><p>In contrast, models like GPT-4o and Claude 3.5, despite their advanced capabilities, struggled in some key areas. One major weakness was their tendency to provide overly verbose responses without necessarily arriving at the correct conclusions. These models, though excellent at generating human-like text, often failed to break down multi-step problems systematically. Though strong in areas like language comprehension, Gemini also fell short in handling complex mathematical and abstract reasoning challenges. The results of the HLE underscore a crucial point — AI is advancing beyond just generating fluent text; it is now being judged on its ability to think and reason like a human.</p><p>Interestingly, DeepSeek-R1’s reign was short-lived. OpenAI's Deep Research, a specialised model designed for advanced problem-solving, later surpassed DeepSeek-R1 on HLE. Reports suggest that Deep Research outperformed previous top contenders and demonstrated <a href="https://analyticsindiamag.com/global-tech/openais-deep-research-might-just-out-think-google-and-deepseek/">better step-by-step reasoning</a>. This unexpected development highlights the rapid evolution of AI models and raises fresh debates on the future of AI-driven reasoning.</p>.India may get own GPU in 3-5 years, 18,000 AI servers to be rolled out in 2-4 days: Vaishnaw.<p><strong>Lessons for India</strong></p><p>DeepSeek’s success has drawn global attention, especially in India, where the government has announced plans for its own LLM under the <a href="https://analyticsindiamagazine.substack.com/p/ai-takes-centre-stage-in-indias-budget?utm_campaign=post&utm_medium=web">₹10,370 crore IndiaAI Mission</a>. While India is securing AI infrastructure — such as 18,693 GPUs for model training — DeepSeek’s approach reveals key lessons. India lacks a flagship AI model comparable to DeepSeek or GPT-4. While institutions like IITs and IIITs are making strides, AI research remains largely private-sector driven. The country also faces a major brain drain, with many top AI researchers working for OpenAI, Google DeepMind, and Meta, instead of contributing to Indian AI labs. Additionally, India’s AI infrastructure lags global leaders, particularly in high-performance computing and semiconductor manufacturing.</p><p>Yet, India holds a unique advantage — its linguistic diversity and large-scale societal challenges. Instead of competing directly with Western AI models, India should focus on regional languages and real-world applications in education, healthcare, and agriculture. An India-specific LLM must be built for mass accessibility, not just technological superiority.</p><p>Notably, India's IT minister <a href="https://inc42.com/buzz/india-to-build-own-foundational-ai-model-in-next-10-months-it-minister/">recently announced</a> that the country is on track to launch its own foundational AI model by the end of 2025. This development signals India’s ambition to develop an indigenous AI ecosystem that aligns with its national priorities. With sustained efforts in AI research, infrastructure, and policy frameworks, India has the potential to carve out a unique niche in the AI landscape.</p><p><strong>Should AI be allowed to think?</strong></p><p>During the initial testing phase, the developers noticed that the model spontaneously switched between English and Chinese while solving problems. When the model was forced to stick to one language, it was discovered that the system’s ability to solve the same problems diminished. Most capable AI systems 'think' in human-legible languages and subsequently write out their reasoning before concluding. This enables AI safety team to monitor the model's 'chains of thought' and catch signs of dangerous behaviours.</p><p>DeepSeek’s results have opened a new frontier and generated a fresh debate. Should AI models be allowed to think in their own language rather than the language of the humans? Should we trade security concerns for accuracy? If AI develops its own internal language, it might improve accuracy but that would happen at the cost of transparency. Would this be an acceptable trade-off? These and several questions need urgent answers from our policymakers.</p><p><strong>What lies ahead?</strong></p><p>The HLE test results are a reality check — while AI can generate text with remarkable fluency, it still fails at deep reasoning and step-by-step problem-solving. Though, few experts believe that given the rapid pace of AI, its distinctly possible that these models may reach around 50 per cent accuracy by the end of this year. What this means is that these models could deliver expert level performance, and possess cutting edge scientific knowledge, but still lack autonomous research capabilities.</p><p>For India, this is both a challenge and an opportunity. If the country wants to build a truly competitive AI model, it must go beyond hardware investments and focus on developing world-class research, infrastructure, and ethical AI frameworks.</p><p>The AI race should be no longer about who can generate text the fastest — it’s about who can build AI that truly understands and reasons like a human. As of now no AI model has cracked that code.</p><p><em>(Abhishek Patni is a New Delhi-based senior journalist.</em>)</p><p><em>Disclaimer: The views expressed above are the author's own. They do not necessarily reflect the views of DH.</em></p>
<p>Can current AI models match human creativity, critical thinking, and problem-solving capabilities? Until recently, OpenAI’s GPT-4 scored 94 per cent on the SAT and 99 per cent on the GRE verbal section, leading many to believe that AI is rapidly approaching human intelligence. However, a recent test has challenged this assumption.</p><p>The Humanity’s Last Exam (HLE) conducted by the San Francisco-based Center for AI Safety (CAIS) and Scale AI, the exam was designed specially to test deep reasoning and problem-solving capabilities of Large Language Models (LLMs) rather than just memorisation or pattern recognition.</p><p>Unlike multiple-choice exams, the HLE required AI models to solve open-ended, multi-step problems across mathematics, physics, law, logic, and even ancient scripts. <a href="https://www.prnewswire.com/news-releases/cais-and-scale-ai-unveil-results-of-humanitys-last-exam-a-groundbreaking-new-benchmark-302358108.html">The results</a>, which were released in late January, were shocking: most leading AI models, including GPT-4o and Google’s Gemini, scored between 3 per cent and 6 per cent. The best performer, China’s DeepSeek-R1, managed 9.4 per cent — still far below human intelligence. Adding to the concerns, AI models also exhibited high confidence in incorrect answers, a phenomenon known as hallucination.</p><p><strong>Hardest AI reasoning test ever designed</strong></p><p>What led to this drastic drop in the percentile of these models in HLE exam? Same models, which managed to score in the high nineties in exams like the GRE and SAT, floundered with an inaccuracy rate as high as 90 per cent. This was because, unlike standard multiple-choice tests, the HLE presented open-ended and multi-step problems that required AI models to exhibit genuine reasoning ability rather than relying on pattern recognition, memorised answers, or web scraping.</p><p><strong>Why Deepseek-R1 beat the rest</strong></p><p>The AI’s recent poster boy DeepSeek-R1’s performance in the HLE took many by surprise. It outscored top models like GPT-4o, Anthropic’s Claude 3.5, and Gemini, particularly in text-only reasoning tasks. The newly launched LLM is likely to have been trained on more structured and high-quality datasets that emphasised logical reasoning and sequential decision-making.</p><p>It exhibited superior calibration — meaning it could better assess the certainty of its answers, reducing the chances of overconfidence errors. Lastly, its fine-tuning appeared more optimised for expert-level problem-solving, setting it apart from competitors designed for broader, general-purpose tasks.</p><p>In contrast, models like GPT-4o and Claude 3.5, despite their advanced capabilities, struggled in some key areas. One major weakness was their tendency to provide overly verbose responses without necessarily arriving at the correct conclusions. These models, though excellent at generating human-like text, often failed to break down multi-step problems systematically. Though strong in areas like language comprehension, Gemini also fell short in handling complex mathematical and abstract reasoning challenges. The results of the HLE underscore a crucial point — AI is advancing beyond just generating fluent text; it is now being judged on its ability to think and reason like a human.</p><p>Interestingly, DeepSeek-R1’s reign was short-lived. OpenAI's Deep Research, a specialised model designed for advanced problem-solving, later surpassed DeepSeek-R1 on HLE. Reports suggest that Deep Research outperformed previous top contenders and demonstrated <a href="https://analyticsindiamag.com/global-tech/openais-deep-research-might-just-out-think-google-and-deepseek/">better step-by-step reasoning</a>. This unexpected development highlights the rapid evolution of AI models and raises fresh debates on the future of AI-driven reasoning.</p>.India may get own GPU in 3-5 years, 18,000 AI servers to be rolled out in 2-4 days: Vaishnaw.<p><strong>Lessons for India</strong></p><p>DeepSeek’s success has drawn global attention, especially in India, where the government has announced plans for its own LLM under the <a href="https://analyticsindiamagazine.substack.com/p/ai-takes-centre-stage-in-indias-budget?utm_campaign=post&utm_medium=web">₹10,370 crore IndiaAI Mission</a>. While India is securing AI infrastructure — such as 18,693 GPUs for model training — DeepSeek’s approach reveals key lessons. India lacks a flagship AI model comparable to DeepSeek or GPT-4. While institutions like IITs and IIITs are making strides, AI research remains largely private-sector driven. The country also faces a major brain drain, with many top AI researchers working for OpenAI, Google DeepMind, and Meta, instead of contributing to Indian AI labs. Additionally, India’s AI infrastructure lags global leaders, particularly in high-performance computing and semiconductor manufacturing.</p><p>Yet, India holds a unique advantage — its linguistic diversity and large-scale societal challenges. Instead of competing directly with Western AI models, India should focus on regional languages and real-world applications in education, healthcare, and agriculture. An India-specific LLM must be built for mass accessibility, not just technological superiority.</p><p>Notably, India's IT minister <a href="https://inc42.com/buzz/india-to-build-own-foundational-ai-model-in-next-10-months-it-minister/">recently announced</a> that the country is on track to launch its own foundational AI model by the end of 2025. This development signals India’s ambition to develop an indigenous AI ecosystem that aligns with its national priorities. With sustained efforts in AI research, infrastructure, and policy frameworks, India has the potential to carve out a unique niche in the AI landscape.</p><p><strong>Should AI be allowed to think?</strong></p><p>During the initial testing phase, the developers noticed that the model spontaneously switched between English and Chinese while solving problems. When the model was forced to stick to one language, it was discovered that the system’s ability to solve the same problems diminished. Most capable AI systems 'think' in human-legible languages and subsequently write out their reasoning before concluding. This enables AI safety team to monitor the model's 'chains of thought' and catch signs of dangerous behaviours.</p><p>DeepSeek’s results have opened a new frontier and generated a fresh debate. Should AI models be allowed to think in their own language rather than the language of the humans? Should we trade security concerns for accuracy? If AI develops its own internal language, it might improve accuracy but that would happen at the cost of transparency. Would this be an acceptable trade-off? These and several questions need urgent answers from our policymakers.</p><p><strong>What lies ahead?</strong></p><p>The HLE test results are a reality check — while AI can generate text with remarkable fluency, it still fails at deep reasoning and step-by-step problem-solving. Though, few experts believe that given the rapid pace of AI, its distinctly possible that these models may reach around 50 per cent accuracy by the end of this year. What this means is that these models could deliver expert level performance, and possess cutting edge scientific knowledge, but still lack autonomous research capabilities.</p><p>For India, this is both a challenge and an opportunity. If the country wants to build a truly competitive AI model, it must go beyond hardware investments and focus on developing world-class research, infrastructure, and ethical AI frameworks.</p><p>The AI race should be no longer about who can generate text the fastest — it’s about who can build AI that truly understands and reasons like a human. As of now no AI model has cracked that code.</p><p><em>(Abhishek Patni is a New Delhi-based senior journalist.</em>)</p><p><em>Disclaimer: The views expressed above are the author's own. They do not necessarily reflect the views of DH.</em></p>