Infoxmed2.0-27B: Instruction Tuning, Preference Alignment, and GRPO-Based Reward Model Training for Medical LLMs

Abstract-Large language models (LLMs) [1], [2] have demon strated remarkable capabilities across general domains, yet their application in specialized medical contexts demands rigorous domain adaptation [3], [4]. We present Infoxmed2.0-27B, a medical foundation model built upon Qwen3.5-27B [5] through a comprehensive multi-stage post-training pipeline: (1) proprietary medical data synthesis from a MySQL database with MedicalCategoryTree organization, medical PhD team validation, Chinese RoBERTa [6] semantic deduplication, and API-assisted language refinement; (2) instruction supervised fine-tuning of Qwen3.5- 27B via LoRA [7] (r = 8, = 32) using MS-Swift [8], producing iterations Infoxmed2.0.0[->]2.0.2[->]2.0.4; (3) Direct Preference Optimization (DPO) [9] on 6,283 curated medical preference pairs [10] using DPO-RPO loss ({beta} = 0.3, RPO = 0.1) across eight progressive training iterations (v0-v7); and (4) parallel Group Relative Policy Optimization (GRPO) [11]-based medical reward model training on Qwen3.5 combining internal rule-based reward functions with external DeepSeek signals. Comprehensive evaluations under a uniform LLM-as-Judge [12] framework with GPT-5.4 demonstrate 77.0% accuracy (mean quality score +7.18) on MedMCQA [10] and +2.59 on HLE, with pipeline progression from +6.69 (base) to +7.06 (SFT) to +7.18 (final).

Read Original Article →

Source

https://www.medrxiv.org/content/10.64898/2026.06.25.26356522v1?rss=1