AI-Driven Assessment: Automating Multiple-Choice Question Generation in Malaysian Curriculum

Artificial intelligence is revolutionizing educational assessment by automating the labor-intensive process of creating multiple-choice questions (MCQs), a development with significant implications for Malaysian education. As teachers and tutoring platforms struggle with the burden of manually generating examination questions, AI-powered systems offer transformative potential to reduce workload while maintaining assessment quality. However, implementing AI-driven MCQ generation within Malaysia’s unique linguistic and curriculum context presents distinctive technical, pedagogical, and ethical challenges that require carefully tailored solutions. Understanding this emerging technology’s capabilities, limitations, and implementation pathways is essential for Malaysian educators, EdTech developers, and policymakers seeking to leverage automation while safeguarding assessment integrity.

The Challenge of Manual Question Generation

Creating high-quality multiple-choice questions represents a significant burden on educators. The process requires subject matter expertise, pedagogical understanding, and substantial time investment—work that diverts educators from more impactful instructional activities. Teachers report that question generation ranks among the most time-intensive tasks in their profession, including classroom discussion preparation, formative assessment design, and lesson planning. For Malaysian tutoring platforms serving hundreds of thousands of students annually, manually creating questions for diverse subjects and curriculum levels across multiple examination boards (SPM, STPM, IGCSE, IB) becomes operationally prohibitive.​

Beyond time constraints, manually created questions frequently suffer from quality inconsistencies. Questions may be ambiguous, poorly worded, contain factual errors, or inadequately assess intended learning outcomes. Evaluating question quality requires expertise that many educators lack, potentially resulting in flawed assessments that misrepresent student understanding. These limitations create compelling rationale for automation—AI systems can generate questions at scale while potentially improving consistency and quality through systematic application of pedagogical principles.​

How AI Generates Multiple-Choice Questions

AI-driven MCQ generation employs multiple complementary natural language processing (NLP) and machine learning techniques to understand educational content and construct assessment items.

Large Language Models and Prompt Engineering

The most accessible and rapidly advancing approach leverages Large Language Models (LLMs) like GPT-4o, Mistral, and Llama 2, which generate questions through sophisticated prompt engineering. In a recent Stanford study comparing three widely known LLMs for MCQ generation, researchers evaluated their effectiveness with 21 educators as evaluators. The methodology involved injecting curriculum content directly into prompts to prevent hallucinations and give educators control over source material—a critical feature for curriculum alignment. Results demonstrated that GPT-3.5 generated the most effective MCQs across several quality metrics, with educators showing substantial appreciation for LLM capabilities but expressing continued skepticism about AI in educational applications more broadly.​

The prompting process involves providing models with educational source material, desired difficulty levels, target learning objectives, and structural specifications for the MCQ format (number of options, answer structure, etc.). The model then generates questions aligned with these specifications.​

Retrieval-Augmented Generation (RAG) for Curriculum Alignment

For specialized applications requiring strict adherence to Malaysian curriculum standards, Retrieval-Augmented Generation (RAG) represents a more sophisticated approach. A groundbreaking 2025 study specifically addressing Malaysian needs compared four incremental pipelines for generating Form 1 Mathematics MCQs in Bahasa Melayu using GPT-4o. The research introduced pipelines ranging from non-grounded prompting (structured and basic) to advanced RAG approaches using the LangChain framework and manual implementation.​

Critically, the study grounded systems in official Malaysian curriculum documents, including teacher-prepared notes and the yearly teaching plan (RPT—Rancangan Pelajaran Tahunan). RAG-based pipelines significantly outperformed non-grounded prompting, producing questions with substantially higher curriculum alignment and factual validity. The system employed dual-pronged evaluation: Semantic Textual Similarity (STS) against the RPT measured curriculum alignment, while a novel RAG-based Question-Answering (RAG-QA) method verified contextual validity through automated fact-checking.​

Named Entity Recognition and Distractor Generation

Beyond basic question generation, sophisticated systems employ Named Entity Recognition (NER) to identify key entities in educational material, then generate plausible but incorrect answer options (distractors) using techniques like Sense2Vec and WordNet, validated through cosine similarity measures. Quality distractors are essential for effective MCQs—if incorrect options are obviously implausible, questions fail to discriminate between students with genuine understanding and those guessing.​

Technical Approaches to MCQ Generation

Three primary technical methodologies characterize contemporary AI-driven question generation systems:

Template-Based Approaches

Template-based methods employ predefined question structures filled with extracted key concepts from source material. For example, educators might define templates for factual recall (“What is _____?”), procedural understanding (“How would you _____?”), or conceptual application (“Which of the following demonstrates _____?”). The system then selects relevant content and populates templates, generating grammatically correct, structurally sound questions.​

This approach offers advantages including explainability (educators understand exactly how questions are generated), ease of customization (templates can be tailored to specific learning objectives), and reproducibility. However, template-based systems generate questions within rigid structural constraints, potentially limiting diversity and complexity of assessment items.​

Rule-Based Systems

Rule-based approaches apply linguistic transformation rules to source sentences, converting declarative statements into interrogative forms. A system might identify sentence structures, apply syntactic transformations, and rank resulting questions through metrics like logistic regression trained on human quality judgments. Rule-based systems demonstrated ability to improve question acceptability by 71 percent through TextRank-based keyword extraction identifying most important concepts to assess.​

Neural Network and Sequence-to-Sequence Models

Advanced neural approaches employ sequence-to-sequence architectures that learn patterns from labeled question-answer datasets without requiring explicit templates or rules. Transformer-based models like T5, fine-tuned on educational datasets such as SQuAD and RACE, can generate diverse question types including multiple-choice, fill-in-the-blank, factoid, and matching-type questions. These models learn implicit patterns of effective questioning, potentially capturing nuances human-designed templates miss.​

Advantages of AI-Driven MCQ Generation

AI-powered question generation offers multiple compelling benefits when implemented thoughtfully within educational contexts.

Significant Time and Resource Savings

Automated MCQ generation dramatically reduces educator workload, liberating time for higher-impact instructional activities. Teachers and tutoring platforms can generate comprehensive question banks far more rapidly than manual approaches would permit. For online tutoring platforms serving diverse student populations across multiple subjects and curriculum levels, this efficiency gain translates into scalability that manual approaches cannot achieve.​

Consistency and Quality Standardization

AI systems apply consistent criteria when generating questions, potentially improving uniformity compared to manually created questions reflecting individual educators’ varied approaches. When grounded in pedagogical principles and curriculum standards, AI systems can systematically ensure questions assess intended learning outcomes, maintain appropriate difficulty levels, and employ effective question construction techniques.​

Personalization and Adaptive Assessment

AI systems can generate question variants tailored to individual student needs, difficulty preferences, and learning profiles. Rather than using identical questions across all students, adaptive systems can generate easier questions for struggling learners, more challenging items for advanced students, and content matching individual student areas of weakness.​

Scalability for Comprehensive Assessment Coverage

AI enables generating question banks so extensive that students can practice with fresh questions across multiple attempts without repetition. This capability supports effective learning since research demonstrates that practicing with varied question sets enhances retention and transfer better than repeated practice with identical questions.​

Critical Challenges and Limitations

Despite promising potential, AI-driven MCQ generation faces substantial technical, pedagogical, and practical challenges that currently limit widespread deployment.

Factual Accuracy and Hallucination Risks

Large language models are notorious for generating plausible-sounding but factually incorrect information—a phenomenon termed “hallucination.” In educational assessment contexts, factual errors are catastrophic because they teach incorrect information and produce invalid assessment results. A critical study on evaluation methods for automatic question generation noted that current systems fall short of consistently producing universally effective questions readily deployable in real educational settings.​

Curriculum Alignment and Validity

Questions must align with specific curriculum standards, learning objectives, and assessment frameworks governing Malaysian education. A non-grounded MCQ generation system might produce grammatically correct, educationally reasonable questions that nonetheless diverge from the official curriculum, leading to misalignment between instruction and assessment. The 2025 Malaysian study demonstrated that RAG-based approaches significantly outperformed non-grounded prompting in curriculum alignment, but even advanced systems require substantial human review to ensure validity.​

Low-Resource Language Challenges: Bahasa Melayu

Bahasa Melayu presents distinctive challenges for AI systems developed primarily for English. As a low-resource language, Malay comprises only 0.1 percent of global web content—ten times less than Indonesian—creating severe training data scarcity. The MalayMMLU benchmark study found that general LLMs, even sophisticated models like GPT-3.5, encounter difficulties with high school-level examinations in Malay linguistic and cultural contexts, emphasizing substantial challenges in adapting AI to local nuances.​

Beyond linguistic challenges, generating MCQs in Malay requires understanding Malaysian-specific cultural references, historical contexts, and contemporary issues integral to curriculum content. Questions must avoid cultural insensitivity while authentically reflecting Malaysian society—a nuance difficult for AI systems trained primarily on non-Malaysian content.​

Distractor Quality and Answer Option Generation

Creating plausible, educationally valuable incorrect answer options remains technically challenging. Poor distractors—either obviously implausible or too similar to correct answers—undermine assessment effectiveness. While Sense2Vec and WordNet-based approaches show promise, generating distractors requiring domain expertise or creative thinking remains difficult.​

Evaluation Methodology Bottleneck

A fundamental challenge limiting MCQ generation system development is lack of standardized quality evaluation methodologies. Different research groups employ varied metrics assessing grammaticality, coherence, relevance, alignment with learning objectives, and critical thinking promotion—but no consensus exists regarding optimal evaluation approaches. This methodological fragmentation makes it difficult to systematically compare different generation approaches or establish quality benchmarks for deployment.​

Evaluation Frameworks for Generated Questions

Assessing AI-generated question quality requires multifaceted approaches examining both technical linguistic properties and pedagogical validity.

Automated Evaluation Metrics

Established NLP metrics include BLEU-4 scores (assessing n-gram overlap with reference questions), semantic similarity measures comparing generated questions to curriculum standards, and grammaticality scores evaluating syntactic correctness. However, these metrics capture surface-level properties while potentially missing pedagogical qualities distinguishing effective assessment items.​

Curriculum Alignment Assessment

The Malaysian 2025 study introduced Semantic Textual Similarity (STS) measurement against the RPT (yearly teaching plan) to evaluate curriculum alignment. Questions aligned with curriculum standards but poorly measuring intended learning outcomes constitute assessment failures despite technical correctness.​

Validity and Contextual Assessment

The novel RAG-based Question-Answering (RAG-QA) method employed in the Malaysian research automatically verifies factual validity by having the AI system answer generated questions from curriculum source material, checking whether correct answers are actually derivable from official curriculum documents. This approach addresses hallucination risks by ensuring questions remain grounded in verified content.​

Human Expert Evaluation

Ultimately, human subject matter experts and experienced educators must evaluate generated questions before deployment. Expert review assesses whether questions effectively measure intended learning, whether answer options are appropriately challenging, whether language is clear and unambiguous, and whether cultural appropriateness is maintained.​

Implementation in Malaysian Education: Practical Pathways

For Malaysian educators and EdTech platforms seeking to implement AI-driven MCQ generation, several practical approaches offer different tradeoffs between sophistication, cost, and implementation effort.

Immediate Implementation: Prompt Engineering with GPT-4o

The most accessible entry point leverages existing commercial LLM APIs through sophisticated prompting. Educators or platforms can use GPT-4o (or similar models) directly through web interfaces, providing curriculum content and detailed specifications for desired questions. This approach requires no technical expertise beyond effective prompt writing, enables rapid experimentation, and costs minimally at small scale.​

However, non-grounded prompting (even with detailed prompting) risks curriculum misalignment and factual inaccuracy without specialized optimization for Malaysian contexts.​

Intermediate Implementation: RAG-Based Systems with LangChain

For organizations requiring better curriculum alignment, RAG-based approaches using LangChain framework enable integration of official curriculum documents as grounding material. Educators provide curriculum sources (RPT documents, textbooks, official notes) which the system incorporates when generating questions, substantially improving alignment and factual accuracy.​

This approach requires greater technical setup but provides superior results and substantial quality improvement over basic prompting.​

Advanced Implementation: Fine-Tuned Models for Malay

Long-term, optimal solutions likely involve fine-tuning open-source models on Malaysian curriculum datasets. The MalayMMLU research demonstrates that models fine-tuned with diverse Malaysian-language content perform substantially better than general models. Organizations with sufficient resources could develop proprietary fine-tuned models optimized for Malaysian contexts—a significant investment but enabling long-term competitive advantage and reduced dependence on commercial APIs.​

Teacher Adoption and Readiness

Successful implementation of AI-driven assessment requires not merely technical capability but genuine educator adoption and integration into classroom practices.

Current Teacher Perspectives

Recent research on Malaysian secondary school teachers’ perspectives on AI-based classroom assessment (280 teachers in Selangor) reveals complex attitudes. Positively, 78 percent of teachers expressed positive perceptions of AI’s usefulness in reducing workload and providing real-time feedback. This enthusiasm suggests substantial receptiveness to efficiency benefits.​

However, concerning challenges persist. Sixty-five percent of teachers reported concerns about over-dependence on AI and possible threats to professional judgment and data ethics. Only 30 percent of Malaysian educators are adequately trained to integrate AI tools, necessitating robust professional development programs.​

Infrastructure and Support Requirements

Beyond training, successful implementation requires supporting infrastructure. Interviewed educators reported that technology access limitations, insufficient training, and data protection concerns restricted AI tool usage. Rural schools, facing particular infrastructure challenges, would require special support for equitable AI integration.​

Ethical and Data Privacy Considerations

AI-driven assessment deployment raises pressing ethical concerns requiring careful attention.

Algorithmic Bias and Fairness

Training datasets influence AI system behavior, and if datasets contain biases, generated questions may perpetuate disparities. For example, if training data overrepresents certain cultural contexts or demographic groups, generated questions may contain cultural bias disadvantaging specific student populations. Mitigation strategies include ensuring diverse training datasets, employing fairness-aware algorithms, and adopting explainable AI approaches enabling educators to identify potential biases.​

Data Privacy and Student Information Protection

AI systems require training data and operational data. If student performance data, learning behaviors, or personal characteristics are used training systems or enabling personalization, privacy protections become critical. Malaysia’s Personal Data Protection Act (PDPA) and emerging AI governance frameworks require careful attention to consent, data minimization, and secure handling.​

Transparency and Educator Autonomy

Educators must understand how questions are generated and maintain authority over assessment decisions. “Black box” AI systems educators cannot interpret undermine professional judgment and accountability. Implementing interpretable systems (template-based or rule-based approaches) or employing explainable AI techniques ensures educators retain meaningful control.​

Overreliance and Learning Outcome Concerns

Research raises important cautions about overreliance on AI-generated assessments potentially compromising learning quality.

Critical Thinking and Cognitive Skill Development

Over-reliance on AI-generated MCQs, particularly if questions focus narrowly on factual recall rather than higher-order thinking, could inadvertently narrow assessment scope away from creativity, critical analysis, and problem-solving—capabilities increasingly valued in the modern economy. Effective implementation requires that AI generation capabilities be coupled with intentional pedagogical design ensuring questions promote deep learning and critical thinking.​

The FACT Framework: Balancing AI and Traditional Assessment

Recent education research proposes the FACT framework addressing tension between AI assistance and traditional learning outcomes, comprising four components: Foundation (building core skills), Critical judgment (developing critical thinking), Application (applying knowledge with AI tools), and Transfer (using learning in new contexts). This framework suggests that while AI can support assessment at scale, careful pedagogical design is essential preventing students from developing genuine understanding while merely reproducing AI-generated knowledge.​

Future Directions and Recommendations

The field of AI-driven MCQ generation remains dynamic, with several promising development trajectories for Malaysian education.

Development of Malayalam-Optimized Models

The MalayMMLU research provides foundation for developing ML models specifically optimized for Malaysian curriculum contexts. Continued development of fine-tuned models trained on Malaysian curriculum content, with cultural sensitivity and linguistic appropriateness, would substantially improve quality and contextual relevance.​

Standardized Quality Evaluation Frameworks

Establishing Malaysian-context-specific quality criteria for AI-generated questions—including curriculum alignment, cultural appropriateness, factual accuracy, and pedagogical validity—would enable systematic comparison of different approaches and provide benchmarks for operational deployment.​

Comprehensive Teacher Training Programs

Malaysia’s Digital Education Policy and AI roadmap should be complemented by substantial investments in teacher training addressing both technical AI literacy and pedagogical considerations for AI-driven assessment. Short applied courses like “AI Essentials for Work” bootcamp (15 weeks) and sector-specific training on automated grading and MCQ generation represent practical starting points.​

Policy Development and Ethical Guidelines

Clear policy guidelines addressing AI in educational assessment—covering data privacy, algorithmic fairness, transparency requirements, educator autonomy, and quality standards—would facilitate responsible innovation while protecting students and educators. Malaysia’s National Guidelines on AI Governance and Ethics provide foundation but require educational-sector-specific elaboration.​

Public-Private Partnerships for Infrastructure

Given rural infrastructure challenges limiting AI adoption, strategic public-private partnerships could develop cost-effective solutions addressing rural schools’ needs while maintaining educational equity. The RM100 million allocation for rural internet access under Budget 2025 provides foundation supporting such infrastructure development.​

Strategic Opportunity and Careful Implementation

AI-driven MCQ generation represents a strategic opportunity for Malaysian education to substantially improve assessment scalability, consistency, and personalization while liberating educators from time-consuming manual question creation. The technology is demonstrably capable—particularly when grounded in curriculum standards through RAG-based approaches—of generating high-quality questions at unprecedented scale.

However, realizing this opportunity requires moving beyond technological capability to thoughtful, deliberate implementation addressing Malaysian education’s distinctive context. Success requires developing Malay-optimized AI models reflecting Malaysian curriculum, cultural context, and linguistic nuance; establishing rigorous quality evaluation frameworks ensuring pedagogical validity; investing substantially in teacher training and support infrastructure; addressing ethical concerns around bias, privacy, and educator autonomy; and maintaining pedagogical focus on developing genuine understanding rather than mere question-answering capability.

The 2025 Malaysian research demonstrating RAG-based approach effectiveness for curriculum-aligned MCQ generation provides proof-of-concept that technically sophisticated, educationally rigorous AI-driven assessment is achievable within Malaysian contexts. Building upon this foundation through coordinated investment in model development, educator training, policy frameworks, and infrastructure will enable Malaysian education to harness AI assessment’s transformative potential while safeguarding educational integrity and equity.