None of the artificial intelligence models were able to pass the external assessment test that Ukrainian children take every year.

Oksana18.07.2025

0 0 4 minutes read

During the war, children in Ukraine do not learn in the usual classroom conditions, but between alarms, evacuations, in shelters, remotely or in overcrowded classes of temporary schools. The stability of the educational process is disturbed, the usual landmarks are broken. The sense of the future is blurred, and the expectation of results is excessive. In this context, the task of external independent assessment, which should be a standard test of knowledge, turns into a psychological barrier. And if teenagers can barely stand it, how will artificial intelligence cope with it?

A research group of Ukrainian scientists decided to test the ability of modern language models to work with the content of Ukrainian secondary education — not in theory, but in practice. Their conclusions made public on the international scientific platform arXiv. For verification, a special test tool – ZNO-Vision – was created. This is not just a set of questions, but a multimodal benchmark that requires models to recognize graphs, interpret patterns, analyze images, and not just text. The tasks covered seven main subjects of the Ukrainian secondary school: mathematics, physics, chemistry, biology, history of Ukraine, Ukrainian language and literature.

The test base included more than four thousand questions. For many of them, the correct answer requires not only basic knowledge, but also an intuitive understanding of the logic of school wording, the style of textbooks, and the peculiarities of Ukrainian humanitarian discourse. Questions on the history of Ukraine, for example, are based on national terminology, understanding the chronology of events in the context of European processes, and language – on shades of stylistic devices and cultural allusions.

Models such as GPT-4o (OpenAI), Gemini Pro (Google), Claude 3.5 (Anthropic), Qwen2-VL (Alibaba), LLaMA (Meta), Paligemma (Google) and others participated in this test. Their result was lower than the threshold score required to pass the ZNO — 70% of correct answers. The best result was shown by the Gemini Pro model — 67.5%. She almost reached the passing level, but did not pass it. Next – Claude 3.5 with 64.3%, Qwen2-VL – 51.2%, GPT-4o – 47%. For comparison: with a random choice of options, the correct answer can be given in approximately 22% of cases.

What exactly is the problem? Why did even the most modern systems capable of generating code, writing symphonies, forecasting markets and supporting complex technical solutions not be able to cope with the school test for Ukrainian graduates? Researchers point to several reasons. The first is language structure. Most modern models are trained mainly on English-language corpora. Even if the Ukrainian language is included in the volume of training data, its specific weight is insignificant. The second is instructions and wording.

Ukrainian test tasks have their own characteristics: they are often multi-layered, with non-obvious conditions, double negations, embedded cultural codes. In many cases, the models simply did not understand the content of the instructions or gave an answer that was not what they were asked about.

Separately, the researchers recorded typical errors of the models in the context of Ukrainian culture. For example, in the traditional dish question, most models misidentified borscht, sometimes calling it “Russian,” or substituted regional ingredients that did not fit the condition. In the question about Ukrainian literature, where it was necessary to recognize a character by stylistics, the model did not distinguish between real works and fanciful versions composed of quotations.

Interpreting graphs, working with multi-level logic, and recognizing the language of instructions are among the most vulnerable areas. There were also technical problems: the models confused the diagrams, did not read the image properly, or focused on the template rather than the content.

At the same time, in some cases, the adaptation of the model to Ukrainian-language content gave a positive result. For example, Paligemma, after fine-tuning, began to more accurately recognize tasks related to Ukrainian realities. But even with that, no system passed the baseline threshold.

The results of the study have several important implications. First, they confirm that language models are not universal—their capabilities are unevenly distributed across languages. Secondly, these results demonstrate that the Ukrainian language, education and culture remain a space that needs a separate representation in global artificial intelligence. And so far they are in it — on the periphery. Thirdly, it emphasizes that Ukrainian educational content has its own complexity, depth and nuances, which are not reduced to a set of correct answers. Even for a machine, even with terabytes of text and trillions of parameters.

The failure of state-of-the-art language models on the Ukrainian test of foreign language education indicates a systemic inequality in the representation of languages and cultures in the global architecture of artificial intelligence. At the same time, this study reveals not only the limits of AI, but also the peculiarities of the Ukrainian educational space itself. Tasks that seem standard to a graduate turn out to be excessively complex for high-precision algorithms.

In addition, the results of the study demonstrate that the tasks of the external examination have a high level of complexity even for the most modern language models. However, in the Ukrainian context, teenagers who study in an unstable, sometimes incomplete educational system are tested every year with this test. The war deprived many students of systematic access to the school curriculum, stable teaching, pedagogical support and training according to uniform standards.

Despite this, the structure of the external examination remains stable: it involves a wide coverage of topics, work with abstract tasks, complex formulations and interdisciplinary connections. Such requirements formally remain the same for all, but in fact do not take into account uneven access to education, and therefore are not equal. If artificial intelligence with trillions of parameters does not cope with tasks created for 16-17-year-olds, then this is a reason to review not the technical capabilities of the models, but questions to the evaluation system itself.

This study records the gap between the formal criteria of educational success and the actual conditions in which Ukrainian schoolchildren are. Tasks designed for the most prepared applicant in the conditions of a full-fledged school become impossible for those whose learning experience is limited by circumstances beyond their control.

Oksana18.07.2025

0 0 4 minutes read