Personalised health plan development using agentic AI in Singapore’s national preventive care programme: a pilot study | npj Digital Medicine – nature.com

Welcome to the forefront of conversational AI as we explore the fascinating world of AI chatbots in our dedicated blog series. Discover the latest advancements, applications, and strategies that propel the evolution of chatbot technology. From enhancing customer interactions to streamlining business processes, these articles delve into the innovative ways artificial intelligence is shaping the landscape of automated conversational agents. Whether you’re a business owner, developer, or simply intrigued by the future of interactive technology, join us on this journey to unravel the transformative power and endless possibilities of AI chatbots.
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Advertisement
npj Digital Medicine volume 9, Article number: 332 (2026)
4788 Accesses
2 Altmetric
Metrics details
The workforce shortages caused by aging populations demand a transition from reactive to preventive healthcare strategies. Generative Artificial Intelligence offers a promising solution through the use of agents that can generate personalised guidance. We implement a digital assistant powered by a multi-agent framework that generates and refines personalised health plans based on user interactions. A pilot study with a cohort of 20 residents and 7 clinicians revealed positive user acceptance. Both groups rated four success metrics significantly above neutral satisfaction levels (p values: <0.05). The majority of residents valued the personalisation (p value: 0.003), appreciated the level of granularity (p value: 0.0003), and did not express major concerns about the recommended plans (p value: 0.941). More than 50% of the collected feedback reflected a positive sentiment on the personalised diet (p value: 0.110), personalised exercise (p value: 0.003), and general features (p value: 6e–06). This pilot study highlights the potential of AI-driven digital assistants in supporting preventive healthcare programmes.
Singapore (SG), like many parts of the world, is experiencing a significant demographic transition marked by an aging population (a population is considered “aging” when more than 10% of its members are aged 65 years or older), with projections from the United Nations indicating that the percentage of individuals aged 65 and over will rise from 9.7% of 8 billion people today to 16.4% of 9.7 billion by 20501. This demographic shift poses a critical challenge, as chronic diseases such as cancer, diabetes, and cardiovascular illnesses increasingly become the leading causes of death and disability, impacting both quality of life and economic stability. Countries with a higher proportion of older adults face considerable healthcare burdens, necessitating a transformative shift in healthcare delivery.
SG is not immune to these challenges. With life expectancy reaching 80.7 years for men and 85.2 years for women, the demand for healthcare services is escalating, particularly in primary care, hospital services, and long-term care2. However, workforce shortages, particularly among nurses and other essential healthcare professionals, have created significant pressure on the healthcare system3. These manpower constraints are exacerbated by the increasing complexity of medical needs, overcrowded public healthcare facilities, and limitations in primary and intermediate care services4. As the population continues to age and the burden of chronic diseases rises, addressing these workforce shortages and optimizing healthcare delivery models are critical to ensure sustainable and effective care.
To address this looming crisis, healthcare systems must evolve from a reactive model, which primarily intervenes after the onset of disease, to a proactive, preventive approach. This transformation emphasises early detection, disease prevention, and active patient engagement, strategies essential for improving overall population health. In SG, the Healthier SG2,5 initiative represents a paradigm shift in healthcare delivery, moving away from traditional reactive care toward proactive health management. This initiative aims to cultivate a culture of health-seeking behaviours and lifestyle changes among residents.
Central to Healthier SG is the concept of establishing enduring relationships between residents and family doctors, thereby fostering community support for healthier lifestyles. The initiative seeks to prevent or delay health deterioration and enhance residents’ quality of life while alleviating the growing financial strain on healthcare systems. Key components of Healthier SG include mobilising family doctors to provide preventive care, developing personalised health plans that incorporate lifestyle modifications, activating community partners to support healthier living, and facilitating national enrolment for residents to commit to their health plans.
The development of personalised health plans aligns with major studies worldwide that demonstrate how lifestyle interventions, including dietary changes and physical exercise, can mitigate the progression of chronic diseases6,7,8. Additionally, there is evidence to suggest that personalised case managers or “lifestyle coaches” can potentially improve the success of preventive care programmes9. While the potential benefits of personalised health plans are highly anticipated, practical challenges remain. As the population continues to grow and age, and the burden of chronic diseases increases, the demand and intensity of care is expected to raise accordingly3,4. Practical constraints may limit the ability to provide personalised case managers for all residents, or for family doctors to offer extensively customised health plans for each resident. However, the advent of Generative AI in healthcare10,11,12, particularly through the use of Large Language Models (LLMs), presents a promising opportunity.
Large language models (LLMs) are rapidly being integrated into healthcare systems to support a range of functions, from clinical decision support and documentation to patient communication, biomedical research, and lifestyle/wellness guidance13. LLMs have also shown promise in reducing documentation burden by summarizing longitudinal electronic health records and generating structured clinical notes13,14. Patient-facing applications, including conversational agents for symptom triage, patient education, and mental health support, achieve competitive performance15,16,17. In research contexts, LLMs can expedite literature synthesis and early stage drug discovery by analysing large corpora of biomedical data13. Notably, emerging evidence indicates that LLM-based systems can deliver personalised lifestyle and preventive health recommendations, including sleep, nutrition, and exercise guidance, at levels comparable to human experts, highlighting their potential role in scalable, patient-centred digital health systems18,19,20,21. While adoption remains cautious due to safety and accuracy concerns, initial trials, including those in Singapore, suggest LLM-based systems can augment medication safety, enhance documentation efficiency, and save costs when implemented with clinical oversight22,23.
Notably, while conventional LLM applications follow linear workflows, which limit adaptability in complex tasks like personalised health planning, agentic LLMs overcome these limitations through decision-driven processes, enhancing flexibility, performance and efficiency by incorporating tool usage24, task decomposition and planning25,26, multi-reasoning path exploration27,28, adaptive learning29,30, and memory management. By leveraging agentic LLMs, we can significantly enhance the practicality of implementing personalised lifestyle interventions on a national scale for preventive care.
This pilot study explores the feasibility, usability, acceptability, and functionality of utilising agentic LLMs to provide individualised health plans and tailored recommendations based on user preferences, user profile (gender and health segment) and trusted localised data sources, integrating seamlessly into the clinical workflow of the national healthcare programme, Healthier SG, as envisioned in Fig. S1. We formulate the aim of this study into three hypotheses:
Residents seek highly personalised diet and exercise recommendations.
Residents require detailed guidance and would like to have a say in the recommendations.
Residents require assurance that AI solutions are safe and trustworthy.
To investigate these objectives, we undertook a series of development and experimental phases. First, we conceptualised a workflow that integrates LLMs into Healthier SG, acknowledging the potential for LLMs to generate inaccurate information and the necessity of human oversight to ensure reliability without compromising productivity. We designed the use of LLMs to occur post-consultation, allowing family doctors to input health goals and additional clinical notes as inputs for the LLMs. Second, we developed a self-guided tool called HealthGuide@Home, an agentic LLM that creates personalised, actionable health plans tailored to individual preferences and health goals. By translating family doctors’ recommendations into practical steps, HealthGuide@Home enables residents to generate a Base Plan informed by their health objectives and clinicians’ insights, while also allowing customisation based on personal preferences. Through its interactive features, HealthGuide@Home not only facilitates inquiries related to diet and exercise but also personalises health plans to align with residents’ specific needs. Finally, we designed a comprehensive evaluation methodology for HealthGuide@Home, combining perspectives from both users (residents and clinicians) and technical assessments. Through this exploratory pilot study, we aim to share preliminary insights with the broader healthcare and research community regarding the potential feasibility of using LLMs and agentic frameworks in content generation, expanding beyond their current pervasive uses in summarisation and information retrieval.
HealthGuide@Home is a digital assistant that delivers personalised health recommendations by integrating state-of-the-art LLMs with an agentic framework (“Methods”). The selection of the LLM engine for HealthGuide@Home was driven by an assessment of model capabilities (in a scale of 1.0–5.0) and answer quality (in a scale of 0.00–1.00) among the 2 top-performing open-source and commercial LLMs. The selected model demonstrated superior cognitive performance (4.9 vs 4.7), function calling (5.0 vs 4.6), instruction adherence (4.9 vs 4.1), and flexibility (4.0 vs 3.5) (Table S1), while its answers exhibited comparable faithfulness (0.903 vs 0.887) and relevancy (0.848 vs 0.858) (Table S2) to the second candidate. Likewise, the agentic framework for HealthGuide@Home was selected based on a qualitative assessment of capabilities in search, answer generation, feedback refinement, and synthesis (Table S3).
HealthGuide@Home utilizes a semantic router to tailor meal and exercise plans based on users’ preferences, health goals, and contextual factors, such as location and availability of resources. The system generates an initial base plan using user-specific data, such as demographics, health conditions, and preferences. The plan can be iteratively refined by interacting with the assistant to better align the plan with individual health and lifestyle needs (“Methods”, Fig. 1).
Personalisation in HealthGuide@Home system is performed by a routing strategy via a semantic router to guide users through four key journeys: (1) preference gathering, (2) personalisation, (3) option discovery, and (4) feedback collection. These journeys ensure that user preferences are incorporated iteratively, facilitating the generation and customization of exercise plans. The modular architecture allows for dynamic integration of new workflows, ensuring that each user’s health plan is personalised and responsive to their evolving needs.
HealthGuide@Home was assessed by users (residents and clinicians) in four success metrics (Appropriateness, Usefulness, Actionability, and Personalisation) overall and across three ethnic groups (Indian, Chinese, Malay) and two health segments (Green and Orange) (“Methods”). Users rated each category (1–5) and provided optional feedback. A median baseline threshold of 3.00 was used to determine the users’ acceptance of the system, as it is in the centre of the rating scale and represents a neutral satisfaction level.
Fig. 2 and Tables S4S6 display a summary of the users’ rating and feedback across categories. The median residents’ rating of Appropriateness (4.00), Usefulness (3.75), Actionability (4.00), and Personalisation (4.00) were significantly higher than the passing criteria (p vales: 0.0001, 0.0336, 0.0007, 0.0032, respectively), suggesting that the recommended health plans met the intended user targets. Likewise, clinicians’ median scores on Appropriateness (4.00) and Usefulness (4.00) were significantly greater than the threshold (p values: 0.0445, 0.0488, respectively). They valued the “safety first” approach in the recommendations and the actionable resources offered as practical additions to clinical advice, endorsing the system’s suitability for both green (healthy) and orange (well-managed) segment individuals. Overall, these findings provide preliminary evidence for wide acceptance and high satisfaction levels among both groups of users.
User rating was collected from 20 residents and 7 clinicians across 4 categories (Appropriateness, Usefulness, Actionability, Personalisation), 3 ethnicities (Chinese, Indian, Malay) and 2 health segments (Green, Orange). The upper and lower edges of the boxes represent the 1st and 3rd quantile, respectively, and the central line represents the median. The whiskers represent the minimum and maximum values, excluding outliers, which are represented by individual circles. The dotted horizontal line represents the expected average rating from a neutral population. Colours represent four dimensions. Blue: Appropriateness; Orange: Usefulness; Grey: Actionability; Yellow: Personalisation. Statistically significant differences in the median rating between the sample groups and the population are indicated. ns not significant, *p value < 0.05; **p value < 0.01; ***p value < 0.001.
Fig. 2 and Table S4 present the distribution of scores by ethnicity, and Table S6 shows examples of users’ feedback:
Appropriateness: Rated positively across all groups, with Malay (4.00; p value: 0.0445), Chinese (4.00; p value: 0.0026), and Indian (3.75; p value: 0.0907) scores exceeding the success threshold, suggesting alignment with users’ health profiles. Users of the three ethnicities appreciated tailored plans, particularly for chronic conditions.
Usefulness: Scored highest among Chinese (4.00; p value: 0.0042), moderate for Malays (3.75; p value: 0.427), and lowest for Indians (2.50; 0.6054). The three groups valued HealthGuide@Home’s recommendations for the adoption of healthier habits, particularly meal planning and structured guidance.
Actionability: Chinese (4.00; p value: 0.0027), Indian (3.5; p value: 0.2113), and Malay (3.00; p value: 0.500) users found the plans practical, pointing to a generally positive perception of the feasibility of the recommended plans. Users liked clear, achievable steps and instructional videos but suggested more visual aids.
Personalisation: Chinese (4.00; p value: 0.0059), Malay (4.00; p value: 0.0488), and Indian (3.5; p value: 0.6073) users found recommendations aligned with their individual needs. Features like GPS-based recommendations and recipe customisation were well received, with suggestions for refining onboarding and preference settings to enhance personalisation.
Fig. 2 and Table S4 illustrate the distribution of scores by health segment, and Table S6 shows examples of users’ feedback:
Appropriateness: Both Orange (4.00; p value: 0.0026) and Green (4.00; p value: 0.0089) segments rated this dimension significantly above the reference, suggesting a broad alignment with user needs across both segments.
Usefulness: Orange segment users rated this category 4.00 (p value: 0.0529), and Green segment 3.50 (p value: 0.165), suggesting higher value for individuals with pre-existing well managed conditions.
Actionability: Orange segment rated this dimension 4.00 (p value: 0.0165), and Green segment 3.625 (p value: 0.0104), indicating overall satisfaction with practical guidance for feasible health actions, Users expressed preference for additional actionable features, such as “next steps” in diet and exercise.
Personalisation: Orange users rated personalisation 4.00 (p value: 0.0592), and Green users 4.00 (p value: 0.0131), expressing satisfaction with the range of personalisation options with suggestions for improvement on customisation options.
Overall, HealthGuide@Home was well received, with some variations across demographics and segments, highlighting areas for refinement in usability and personalisation.
Residents’ feedback was collected to test the study initial hypotheses (“Introduction”). Tables 1 and 2 summarise the key insights collected from the assessment and the results of the hypotheses tests. The hypothesis that residents seek highly personalised diet and exercise recommendations was supported by a total of 29 out of 40 residents’ comments (p value: 0.003). The hypothesis that residents require detailed guidance and would like to have a say in the recommendations was supported by 31 out of 40 responses (p value: 0.0003). Lastly, the hypothesis that residents require assurance that AI solutions are safe and trustworthy was only supported by 5 out of 20 residents (p value: 0.994), thus providing evidence for its rejection. Overall, residents valued the personalisation of the health plans, appreciated the level of granularity provided, and did not express major concerns about the recommended plans.
Residents’ feedback was subsequently analysed by classifying their sentiment as positive, neutral, or negative across 4 categories (Diet, Exercise, General, and Semantic Router), where General includes all the comments that do not fit into any of the other 3 categories (“Methods”). An equal proportion of positive and negative comments (0.5) was established as null hypothesis to assess the user’s sentiments, as it represents the opinion of a neutral population.
Fig. 3 and Table S7 show the aggregated sentiment analysis for residents across the 4 categories:
Diet: 32 positive, 37 neutral, and 22 negative responses were collected, indicating neutral sentiment in this area (p value: 0.110).
Exercise: Significant positive sentiment captured by 11 positive, 12 neutral, and 1 negative responses (p value: 0.003), with acknowledgement on alignment with medical conditions.
General: Predominantly positive sentiment with 31 positive, 12 neutral, and 5 negative responses (p value: 6e–06), indicating overall satisfaction on the health plans beyond diet and exercise.
Semantic Router: Trend towards positive sentiment collected in 3 positive, 4 neutral, and 0 negative comments (p value: 0.125), suggesting a favourable reception of this feature.
Residents’ feedback was collected and classified as positive, neutral, or negative overall and by health plan (base, refined). Bars represent the ratio of positive comments across 4 topics (Diet, Exercise, General, Semantic router), 3 ethnicities (Chinese, Indian, Malay), 2 genders (female, male), and 2 health segments (green, orange). The dotted line represents the expected positive ratio of a neutral population. Colours represent health plans. Blue: All plans; Orange: Base plan; Grey: Refined plan. Statistically significant differences in the positive ratio between the sample groups and the population are indicated. *p value < 0.05; **p value < 0.01; ***p value < 0.001.
Fig. 3 and Table S7 display the user sentiment on the base and refined health plans by category:
Diet: Primarily positive sentiment on the base plan (0.58 positive ratio; p value: 0.237) with a slight improvement on the refined plan (0.61 positive ratio; p value: 0.202) plans, suggesting a modest change towards a more optimistic perception after refinement.
Exercise: Positive sentiment dominated the feedback on the base plan (1.0 positive ratio; p value: 0.016) and remained high in the refined plan (0.83 positive ratio; p value: 0.109), signalling good alignment of the initial recommendations with users’ preferences even before further customisation.
General: Significantly positive sentiment on the base plan (0.81 positive ratio; p value: 0.004), which was markedly enhanced on the refined plan (0.93 positive ratio; p value: 0.001), suggesting that users responded more positively to the overall structure and recommendations of the refined plan.
Semantic Router: Only positive sentiment registered on the base (1 response; p value: 0.500) and refined (2 responses; p value: 0.250) plans, requiring collection of further feedback to assess the effect of the plan refinement on users’ views.
Fig. 3 and Table S7 display the sentiment analysis on the base and refined health plans stratified by ethnic group, gender, and health segment:
Ethnic group: The feedback on the base plan was significantly positive within the Chinese group (0.90 positive ratio; p value: 1e–05) and predominantly neutral within the Indian (0.54 positive ratio; p value: 0.425) and Malay (0.50 positive ratio; p value: 0.500) groups. Notably, sentiment on the refined plan remained high in the Chinese group (0.74 positive ratio; p value: 0.005) and increased substantially, albeit not significantly, in the Indian (0.78 positive ratio; p value: 0.090) and Malay (0.83 positive ratio; p value: 0.109) groups, hinting that Chinese individuals valued the recommendations of the base and refined plans, while other ethnicities showed a trend of appreciating the improvements of the refined plan.
Gender: Females displayed more positive sentiment towards the base plan (0.75 positive ratio; p value: 0.004) than males (0.67 positive ratio; p value: 0.061). However, females’ positive sentiment on the refined plan dropped moderately (0.57 positive ratio; p value: 0.339), while males’ increased significantly (0.96 positive ratio; p value: 3e–06). These findings suggest a gender disparity on the perception of the initial base plan and the subsequent refinement by interaction with the digital assistant, and indicate that for males the interaction achieved the intended purpose of adapting the health plan to the specific requirements and conditions of the user.
Health Segment: While the green segment expressed moderately positive sentiment on the base plan (0.62 positive ratio; p value: 0.115), the orange segment feedback was significantly positive (0.84 positive ratio; p value: 5e–04). Notably, the sentiment of the green segment towards the refined plan increased significantly (0.73 positive ratio; p value: 0.007), while the opinion of the orange segment remained highly positive (0.85 positive ratio; p value: 0.011), suggesting an optimistic reception of the initial recommendations by individuals with pre-existing and well managed conditions and the likely positive impact of further personalisation for healthy individuals.
Overall, refinements in HealthGuide@Home’s recommendations appear to improve user sentiment, particularly among high-risk populations.
Over the past decade, Natural Language Processing (NLP) has played an increasingly crucial role in healthcare. From the early days of information retrieval from clinical free-text notes stored in EMR or EHR systems31,32,33,34 to converting unstructured data into coded information35, NLP applications have significantly reduced the administrative burden for clinicians. Recent advancements in LLMs have further expanded the potential for creating fully automated clinical digital assistants36, broadening the scope of what NLP can achieve in revolutionizing healthcare delivery and enhancing patient care. For example, LLM-powered conversational agents have been successfully deployed to provide automated health education and behavioural counselling37, build virtual coaches for adolescent preventive care38, or guide patients through complex medical choices39.
Building on the conventional use of NLP in clinical settings, our findings in this study demonstrate the viability of using LLMs to create a personalization workflow adept at providing users with customised health plans tailored to their unique needs and preferences, ultimately improving population health. The core strength of this system lies in its ability to seamlessly integrate user inputs, such as demographics, health conditions, and lifestyle preferences, with the clinician’s lifestyle prescriptions (which constitute the baseline recommendation for the user) into a coherent plan that prioritizes health actions, including essential strategies such as balanced diets and regular exercise.
Our work leverages a state-of-the-art agentic approach that revolutionizes conventional language model applications by introducing dynamic, decision-oriented workflows. Unlike traditional methods, which often operate sequentially and struggle with complex tasks, this innovative framework enhances adaptability, modularity, and efficiency. Key design principles include the integration of tools that augment LLM capabilities, improving accuracy in tasks such as factual verification and calculations. Additionally, adaptive learning mechanisms enhance these systems further, utilizing feedback to refine strategies based on past performance while maintaining contextual awareness through short-term and long-term memory management. This dual-memory framework ensures coherence and continuity in multi-step problem-solving.
Recent work has likewise applied LLM-based agents to deliver personalised health guidance. For example, a prior study developed an agentic LLM workflow for nutrition coaching that first probes patient-specific dietary barriers and then delivers tailored guidance. This system consistently identified key obstacles and generated advice that participants rated as highly personalised in user tests40. Similarly, PlanFitting is an LLM-driven conversational agent designed for crafting personalised exercise routines. PlanFitting engages users in iterative dialogue to elicit goals, schedules and constraints, and then suggests evidence-based weekly plans, which participants in a study found useful and well-aligned with exercise guidelines41. Other recent work has shown LLMs can match expert performance in focused health domains: for example, a Personal Health LLM (PH-LLM) fine-tuned on wearable sleep and fitness data achieved expert-level insight generation and recommendation quality18. In addition, agentic frameworks like openCHA demonstrate that orchestrating multiple LLM agents can greatly outperform vanilla models15. While our study aligns with previous work in leveraging LLM agents for personalisation, it extends it in several important ways. First, rather than focusing on a single health domain (nutrition or exercise), HealthGuide@Home assistant uniquely integrates clinician-prescribed lifestyle plans with user preferences across both diet and exercise. Second, the multi-agent design explicitly supports iterative plan refinement through user feedback loops, enabling dynamic adaptation of recommendations over multiple interactions rather than one-off plan generation. Finally, the use of short-term and long-term memory mechanisms allows the assistant to preserve contextual continuity across sessions while progressively personalising outputs based on historical interactions. Collectively, these technical differentiators position our prototype as a more adaptive and scalable agentic system relative to prior LLM-based digital health assistants and extends these applications to a more comprehensive, population-health setting.
Our pilot study represents a preliminary effort to deploy and end evaluate a functional prototype, as shown in Fig. S2, that seamlessly integrates with real-world preventive care workflows. By leveraging a combination of qualitative and quantitative observations, we successfully engaged both residents and clinicians to test our approach. This work presents a prototype and provides valuable preliminary insights and foundational evidence that extend previous studies42,43,44, examining the potential feasibility, usability, acceptability, and functionality of using digital health technologies to improve residents’ healthcare outcomes in a population setting.
Our pilot study reveals two preliminary insights regarding residents’ preferences and needs for personalised diet and exercise recommendations. First, the desire for highly personalised plans was strongly supported, as 16 out of 20 residents expressed interest in tailored base plans, and 13 residents appreciated the enhanced personalization after updating their preferences. These initial observations support the relevance of individualization in fostering user engagement with health-related recommendations. Second, preliminary evidence pointed to the necessity for detailed guidance. A significant majority (17/20) of residents found the level of granularity in the recommendations, such as step-by-step instructions, recipes, and video links, impressive. Furthermore, 14 residents explicitly requested additional resources like visual guides and meal options for dining out, pointing to a demand for user-friendly, comprehensive resources. Most residents (15/20) reported no major concerns with AI-generated plans, and their questions focused more on understanding the recommendation process rather than mistrusting it. These findings suggest that while demand for personalization and detailed guidance may be evident, concerns about AI trustworthiness may be less obvious, provided the system operates transparently and effectively.
Based on data collected, our findings reveal preliminary insights into the perceived appropriateness, usefulness, actionability, and personalization of GenAI to support healthy lifestyle choices through tailored recommendations. Both resident and clinician scores exceeded the desired threshold of 3.0 across most categories, with some notable differences in perspectives. Appropriateness was rated highly by residents (4.0), who expressed no major concerns and appreciated the relevance of recommendations for users in both green and orange segments. However, clinicians provided a slightly lower score (3.6), expressing greater caution when considering patients with chronic conditions. While they acknowledged the safety guardrails incorporated into GenAI, clinicians emphasised the importance of ensuring recommendations are well-suited for medically complex scenarios. In terms of actionability, residents appreciated the clear and detailed nature of recommendations, specifically citing the role of video guides in enhancing usability. Although clinicians did not provide a formal score for this metric, the emphasis on actionable content was indirectly supported by their recognition of the platform’s potential to supplement clinical consultations with granular, practical suggestions. The usefulness of HealthGuide@Home was well-rated by both groups (3.6 by residents and 3.7 by clinicians). Residents particularly valued the inclusion of guides for initiating healthier diet and exercise habits, as well as features such as nudges and calendar integration to improve adherence. Clinicians, meanwhile, appreciated the tool’s ability to deliver granular and actionable suggestions, such as recipes and exercise plans, which are often beyond the scope of traditional clinical consultations. Finally, personalization was a key functionality of HealthGuide@Home, with residents scoring it 3.8. They reported satisfaction with the ability to customize recommendations based on individual preferences, particularly in the Orange segment, though ongoing enhancements to ensure deeper personalization could further strengthen user engagement. Our assessment suggests that the Orange tier population requires clinical support beyond primary care, i.e., personalised guidance, self-directed physical activities, and/or diet recommendations, such as those provided by HealthGuide@Home. This population group would also be deemed more motivated or ready to take action, as they have inherent health conditions to manage.
AI health assistants like HealthGuide@Home have the potential to augment preventive care programmes such as Healthier SG by supporting both clinical integration and long-term behaviour change. Integrating these tools into national initiatives requires empathetic, patient-focused design to engage diverse stakeholders, including patients, general practitioners (GPs), policymakers, and healthcare providers. Additionally, AI offers scalable, cost-effective support for sustained behaviour change through personalised nudges and continuous engagement, complementing traditional care and allowing professionals to focus on more complex cases45. Furthermore, digital tools like HealthGuide@Home can mitigate the manpower shortages caused by population ageing by delivering real-time, personalised guidance without availability or fatigue constraints46, thus empowering patients, optimising healthcare resource utilisation, and ensuring accessibility across linguistic, cultural, and underserved populations.
While HealthGuide@Home conveyed positive sentiment and acceptable satisfaction among residents and clinicians, several enhancements hold potential to improve its effectiveness and clinical value. AI recommendations must be regularly updated to reflect current clinical guidelines, especially for high-risk areas. Rigorous validation frameworks are needed to ensure health plans meet clinical standards and are user-appropriate47. To ensure trustworthiness, evaluation frameworks should encompass trust-building, thoughtfulness, and emotional understanding, with metrics covering safety, privacy, bias, and interpretability. Defining accountability is vital, with clinicians being responsible when using AI as a support tool, while providers being accountable when patients interact directly with AI. Driving long-term effectiveness requires AI systems to exhibit good intrinsic and extrinsic metrics, and satisfy the four healthcare-specific criteria for broad and sustained adoption: accuracy, trustworthiness, empathy and performance47. Finally, successful integration into the healthcare ecosystem requires structured frameworks like TEHAI48, long-term effectiveness studies47, and a tight collaboration with policymakers, the public, and community partners. By addressing these considerations, HealthGuide@Home and similar AI-powered assistants may effectively support preventive healthcare, fostering better health outcomes and sustainable adoption within national initiatives like Healthier SG.
Several limitations of this pilot study should be noted. First, the sample sizes for both residents (n = 20) and clinicians (n = 7) were relatively small, which may limit the generalizability of the findings. A larger, more diverse pool of participants would provide a broader understanding of user perspectives and reduce potential biases introduced by a small cohort. In addition, the study did not include direct feedback from patients, particularly those with complex health conditions. Without their input, it is difficult to determine how effectively the GenAI application meets the needs of high-risk populations or how it aligns with their lived experiences and expectations. Second, although the study recruited participants of mid and high tech-savviness based on their usage of applications/services, it did not take into account users’ independence and confidence with new digital tools, which may vary based on factors such as age, socioeconomic status, or digital literacy. These factors could influence the usability and effectiveness of HealthGuide@Home in real-world settings, particularly among populations with limited access to technology. Lastly, the study focused on the technical feasibility of using GenAI to support personalised health plans and collected preliminary perceptions on users acceptability, rather than validating or establishing confirmatory evidence of its long-term effectiveness. It is unclear whether the platform’s recommendations lead to sustained behaviour change, improved health outcomes, or meaningful integration into clinical workflows over time. Longitudinal studies would be necessary to address this gap.
In addition to current limitations, four core trustworthiness metrics must be addressed to ensure reliable and conscientious responses from healthcare chatbots. Patient safety involves minimizing risks of harmful content by tailoring responses to user type. While clinicians can receive detailed guidance, patient-facing outputs should remain cautious47. Data privacy must be guaranteed by ensuring that sensitive data remains within a session and avoiding privacy-sensitive questions. Bias should be identified and mitigated across demographics, medical conditions, and representation in training data. Lastly, interpretability is crucial for clinical trust and adoption, requiring transparent reasoning and alignment with medical guidelines to support informed decision-making. Addressing these clinical risks through robust safety measures is critical for the responsible and effective deployment of LLM-powered tools in healthcare settings.
Based on the findings and feedback, several directions for future research and development of HealthGuide@Home have been identified. One direction should focus on enhancing user engagement through gamification and visual communication tools such as images, videos, and visual aids. These elements, along with motivational features, nudges, and tracking tools can make health recommendations more accessible and encourage sustained platform use, potentially through integration with wearables and personal devices like smartphones and calendars. Incorporating periodic check-ins and user diaries may further strengthen user adherence and connection to the platform.
Clinicians highlighted the need for safer, more personalised recommendations for patients with complex conditions, such as comorbidities or cardiovascular risks. To meet this requirement, it is essential that AI-generated recommendations are safeguarded with appropriate medical guardrails and tailored guidance that considers individual factors like health status, medication, and exercise tolerance. To manage these challenges, HealthGuide@Home should adopt a robust risk stratification framework to deliver highly individualised and medically sound recommendations aligned with clinical best practices. Importantly, AI recommendations should be regarded as adjuncts to existing clinical workflows, particularly in contexts with limited clinician availability. Clinicians remain integral to this workflow by managing more complex cases, while AI can offer immediate, personalised guidance for the majority of users.
Future studies are being designed to validate the platform’s long-term effectiveness in improving health outcomes and supporting behaviour change across diverse user groups. A broader evaluation encompassing larger and more diverse samples, including individuals with complex health conditions, will provide more conclusive insights into its potential scalability and usefulness. By addressing these areas, GenAI has the potential to evolve into a comprehensive, adaptable, and clinically relevant tool for promoting healthier lifestyles.
HealthGuide@Home was developed using official guides, including the Tier 1 health actions approved by the SG Ministry of Health (MOH) and the Health Promotion Board (HPB) and the HealthHub MoveIt series endorsed by the HPB49,50, as well as open-source knowledge containing muscle-specific metadata. HealthHub is a national digital app for the self-management of personal medical records and health services51,52.
Official guides provide structured, step-by-step exercise instructions, and serve as the foundation for constructing an exercise plan. For example, T1 health actions offer condition-specific recommendations for chronic conditions like diabetes, hypertension, and hyperlipidaemia. Open-source knowledge expands exercise variety and enhances personalisation with muscle-targeted routines.
This study recruited 20 residents from the general community, representing SG’s ethnic composition and a balanced distribution across gender and green/orange tier segments (Table 3). Participants had high (60%) or medium (40%) digital literacy. The study targeted residents aged 40–59, eligible for SG’s Healthier SG Programme, presumably tech-savvy, and likely to engage with AI-based solutions.
The health tier stratification in this study was based on the HealthierSG BMI Control Management Framework, in which classification is primarily determined by chronic disease status rather than BMI alone53: (1) Green tier: Comprises healthy individuals without chronic conditions and a BMI in the range 18.5–<37.5 kg/m², including the obese range (>30 kg/m²), requiring minimal clinical support; (2) Orange tier: Comprises individuals with well-controlled chronic diseases (diabetes, hypertension, lipid disorder) and a BMI in the range of 18.5 to <32.5 kg/m²; (3) Red tier: Includes individuals with poorly controlled chronic diseases and comorbidities and a BMI in the range of 32.5–37.5 kg/m², needing intensive medical intervention (excluded from this study).
Within the Singapore HealthierSG framework, individuals without chronic conditions (green tier) receive preventive care even if they have obesity. They can safely participate in more progressive exercise programmes because there are no chronic disease complications to manage. Individuals with chronic conditions (orange tier) require conservative exercise modifications and structured programmes with medical oversight, regardless of whether their BMI is in the “normal” or “overweight” range. Therefore, a person with obesity but no chronic conditions has different exercise safety considerations than someone with well-controlled diabetes at the same BMI.
For this study, subjects were selected from the Green or Orange segments. By excluding the Red tier segment, the study focused on participants whose healthcare requirements aligned more closely with the scope of the investigation, specifically targeting those amenable to self-directed care approaches.
Additionally, 7 clinicians were enroled for the user validation, including one behavioural scientist, one clinician AI scientist, two clinical advisors, and three primary care network clinicians.
The Singapore Ministry of Health (MOH) approved the study plan and data use agreement. This study adhered to Synapxe AI and data governance policies and processes. Principals from all participating institutions, including MOH, Singapore Health Promotion Board (HPB), and Synapxe, provided the guidelines for all study procedures. The study complied with applicable data protection regulations, including the Singapore Personal Data Protection Act 2012. All participants provided electronic informed consent. No ethical review was required for the user evaluation, as neither personal data was processed, nor any intervention was conducted.
Clinical trial number: not applicable.
HealthGuide@Home’s generative AI model was evaluated by a multi-dimensional approach, integrating user-centric and technical assessments for a holistic understanding of performance and suitability.
To evaluate HealthGuide@Home’s recommendations, users interacted with the prototype and provided feedback in four satisfaction metrics (Table S8): (1) Appropriateness measures the relevance of the recommendations to users’ health profiles54,55,56; (2) Usefulness: Assesses whether the recommendations can contribute to users’ health improvement57,58. (3) Actionability evaluates the feasibility and actionability of the recommendations59; (4) Personalisation: Reflects the alignment of the recommendations with individual user preferences and habits57,60. After giving informed consent, user feedback was gathered via a structured questionnaire and interviews, with answers rated on a 1–5 scale for each metric.
Sentiment analysis was performed by classifying user feedback as positive, neutral, or negative using Claude 3.5 Sonnet61,62. Suggestions put forth by residents were classified as neutral. Sentiment analysis was stratified across 4 categories (Diet, Exercise, General, and Semantic Router), 3 ethnic groups (Chinese, Malay, Indian), 2 genders (Female, Male), and 2 health segments (Green, Orange). The “General” category encompasses user comments that could not be distinctly classified under Diet, Exercise, or Semantic Router, representing a holistic reflection on the overall personalisation experience offered by the application. This structured approach quantified user perceptions of HealthGuide@Home’s personalisation features.
The technical evaluation of HealthGuide@Home’s comprissed an assessment of both model’s performance answer quality.
The model’s performance was assessed across seven capabilities63,64,65,66,67,68,69,70 (Table S1): (C1) Elocution (Intelligence): The model’s ability to demonstrate coherent, natural, and contextually appropriate language use65; (C2) Function Calling: The model’s capacity to execute specific functions or tasks as requested; (C3) Instruction Following: The model’s adherence to provided instructions and guidelines63,64; (C4) Ability to Fine-tune: The model’s adaptability and responsiveness to fine-tuning or customisation; (C5) Ease of Use: The intuitiveness and user-friendliness of interacting with the model; (C6) Processing Speed: The model’s efficiency in generating outputs in a timely manner; (C7) Flexibility: The model’s ability to handle a diverse range of inputs and adapt to different contexts.
For C1, C2, C4, and C6, the scores were derived from the LMSYS and Berkeley Gorilla leaderboards71,72 and were linearly adjusted to a scale of 1.0–5.0 for each capability. For C1, C2, and C4, 1.0 and 5.0 represent the theoretical minimum (0) and maximum (1300) possible scores in the scale, respectively. For C6, 1.0 and 5.0 represent the “time to first token” of the slowest and fastest models in the ranking, respectively. For C5 and C7, ratings were assigned by the study researchers in a scale of 1.0 to 5.0 and the scores represent the team’s averaged consensus. C3 represents the capability of the model as a binary quality (possible or not possible).
For the evaluation of the effectiveness of personalised health plans generated by HealthGuide@Home’s responses, we used DeepEval, a specialised framework for assessing the quality of outputs from LLMs73. Two key metrics were assessed: (1) Faithfulness ensures factual accuracy by verifying recommendations against trusted sources74,75; (2) Answer relevancy measures alignment with user needs, avoiding generic responses76. Models were assessed for the responses generated for diet, exercise, and semantic routing. Each aspect underwent three iterations, and the averages of these iterations was computed for faithfulness and answer relevancy.
To ascertain whether the users’ ratings on the Appropriateness, Usefulness, Actionability, and Personalisation of the recommended health plans are normally distributed, a Shapiro-Wilk test was performed (Table S9). The null hypothesis (H0) stated that the data was normally distributed. The W statistic and the p value were reported. The level of significance (α) was set to 0.05. The test was run using the open access statistics kingdom calculator77.
To assess the significance of the users’ ratings on the Appropriateness, Usefulness, Actionability, and Personalisation of the recommended health plans, a non-parametric single sample one-tailed Wilcoxon signed rank test was performed (Table S4). The Wilcoxon signed rank test is a non-parametric test suitable for not-normally distributed data and small sample sizes. A median score of 3.00 was used as reference, representing the score of a neutral population (M0). The null hypothesis stated that the median (M) of the users’ ratings is equal to or lower than the population median (H0: M ≤ M0), while the alternative hypothesis stated that the user’s median is higher (H1: M > M0). The W+ statistic and the p value were reported. The level of significance (α) was set to 0.05. The test was run using the open access statistical software stats.blue78.
To test the study hypotheses (H1) that (1) Residents seek highly personalised diet and exercise recommendations, (2) Residents require detailed guidance and would like to have a say in the recommendations, and (3) Residents require assurance that AI solutions are safe and trustworthy, a one sample one-tailed (H1: p > p0) proportion test was performed (Table 2). The observed proportions of residents that align with the hypotheses (p) were compared against an expected proportion of 0.5, representing the opinion of a neutral population (p0). The corresponding null hypotheses (H0: p ≤ p0) stated that (1) 50% or less of residents seek highly personalised diet and exercise recommendations, (2) 50% or less of residents require detailed guidance and would like to have a say in the recommendations, and (3) 50% or less of residents require assurance that AI solutions are safe and trustworthy.
Likewise, to assess the significance of the users’ positive sentiment on the diet, exercise, general, and semantic router comments across ethnic groups, genders, and health segments on the base and the refined plans, a one sample one-tailed proportion test was performed (Table S7). An expected positive proportion of 0.5 was used as reference, representing the sentiment of a neutral population (p0). The null hypothesis stated that the proportion of positive sentiment among users (p) is equal to or lesser than the reference proportion (H0: p ≤ p0), while the alternative hypothesis stated that the proportion of positive sentiment is higher than the reference (H1: p > p0).
For the one sample proportion test, the X-statistic and the p value were reported. The level of significance (α) was set to 0.05. The test was run using the open access statistics kingdom calculator79.
The LLM engine for HealthGuide@Home was selected by assessing the model capabilities and answer quality of the 2 top-performing open-source and commercial LLMs at the time of this study, Llama 370B Instruct and Claude 3.5 Sonnet80,81. Claude 3.5 Sonnet outperformed Llama 3 across most capabilities, excelling in cognitive performance, function calling, instruction adherence, and flexibility, making it ideal for complex tasks. However, Llama 3 offered superior fine-tuning for domain-specific applications and faster processing speed, making it more suitable for real-time use (Table S1). In terms of answer quality, Claude 3.5 achieved a higher faithfulness score (0.903 vs. 0.887), suggesting more reliable and contextually accurate responses, while Llama 3 performed slightly better in answer relevancy (0.858 vs. 0.848), indicating stronger alignment with specific user preferences (Table S2). Based on these assessments, Claude 3.5 Sonnet was selected as the core LLM engine, as its enhanced capabilities establish a more robust basis for fulfilling HealthGuide@Home’s long-term development requirements.
The agentic framework for HealthGuide@Home was selected by implementing four agentic workflows82,83,84,85 and qualitatively assessing their capabilities in search, answer generation, feedback refinement, and synthesis86,87,88,89 (Table S3, Figs. S36). While LlamaIndex excels at modular control, AutoGen at collaborative interactions, and CrewAI at structured delegation, each framework displays limitations with real-time multi-stage refinement and adaptation, crucial for personalised health recommendations. LangGraph overcomes these limitations by using a stateful graph for dynamic task decomposition and continuous adaptation. Due to its scalability, modularity, and memory management, LangGraph was chosen as the optimal platform for developing HealthGuide@Home.
The core engine of HealthGuide@Home is an LLM-powered semantic router that systematically refines diet and exercise recommendations based on user’s preferences, dietary restrictions, and fitness goals. The semantic router interprets user inputs and refines them into sub-queries to direct them to relevant resources, following a four-journey process (Fig. 1): (1) Preference gathering, where the router collects initial user inputs; (2) Personalization, where it refines inputs to align recommendations with health and lifestyle needs. (3) Option discovery, where it suggests tailored alternatives based on the refined input; and (4) Feedback Collection, where the router solicits user opinions and iteratively refines the recommendations.
Initially, the system generates an initial base plan which users can refine through three main options: (1) View Base Plan: Access the initial set of recommendations; (2) Personalize Base Plan: Modify elements based on health objectives, diet, or exercise needs; (3) Enquire About Diet/Exercise: Ask specific questions for tailored advice. Upon selection, the system enters a request processing phase, which dynamically adjusts the plan. The user can select one of the following options: (1) Refine a Single Recommendation: Adjusts one element of the plan; (2) Refine General Diet/Exercise Plans: Updates all meal or workout recommendations based on user preferences, dietary restrictions, or fitness goals; (3) Enquire About Diet/Exercise: Allows targeted questions and provides the answer accordingly. The semantic router orchestrates this process, ensuring progressive personalization through the four predefined user journeys.
Fig. 4 illustrates exercise plan generation and refinement in HealthGuide@Home. The process begins by updating parameters based on user inputs (preferences, activity history, or exercise attributes). Depending on the exercise type and Tier 1 (T1) health actions, the system conducts either a base search or a hybrid meta-search and retrieves information from the primary knowledge base (official guides) and from the open-source knowledge base (muscle groups). Finally, the LLM synthesizes both sources to generate a personalised exercise plan.
The HealthGuide@Home LLM-Agent system comprises two agent workflows: (1) Agent Workflow A (Exercise Generator), responsible for generating exercise plans; (2) Agent Workflow B (Exercise Refiner), which refines existing plans based on further customization. The architecture leverages shared functions across both workflows, with modules for parameter updating, search decision-making, and plan generation. Core components include an extractor for condition-specific parameters, a retriever for searching data, and a generator for creating and refining exercise routines. The system is highly modular, allowing the reuse of shared functions and the integration of new nodes, ensuring flexibility for personalised health plan recommendations.
The datasets generated and/or analysed during the current study are not publicly available due to privacy and confidentiality concerns, but are available from the corresponding author on reasonable request.
The underlying code for this study is not publicly available for proprietary reasons. However, the code is available to Editors and referees upon request.
United Nations. World Population Prospects. https://population.un.org/wpp/ (2024).
The Lancet Regional Health Western Pacific. Healthier SG: For a healthier Singapore and beyond. Lancet Reg. Health West. Pac. 37, 100893 (2023).
Google Scholar 
Chua, G. Challenges confronting the practice of nursing in Singapore. Asia Pac. J. Oncol. Nurs. 7, 259–265 (2020).
Article  PubMed  PubMed Central  Google Scholar 
Tan, C. C., Lam, C. S. P., Matchar, D. B., Zee, Y. K. & Wong, J. E. L. Singapore’s health-care system: Key features, challenges, and shifts. Lancet 398, 1091–1104 (2021).
Article  PubMed  Google Scholar 
Ministry of Health Singapore. Healthier SG. https://www.healthiersg.gov.sg/ (2023).
Knowler, W. C. et al. Reduction in the incidence of type 2 diabetes with lifestyle intervention or metformin. N. Engl. J. Med. 346, 393–403 (2002).
Article  CAS  PubMed  PubMed Central  Google Scholar 
Li, G. et al. The long-term effect of lifestyle interventions to prevent diabetes in the China Da Qing Diabetes Prevention Study: A 20-year follow-up study. Lancet 371, 1783–1789 (2008).
Article  PubMed  Google Scholar 
Lindström, J. et al. Improved lifestyle and decreased diabetes risk over 13 years: Long-term follow-up of the randomised Finnish Diabetes Prevention Study (DPS). Diabetologia 56, 284–293 (2013).
Article  PubMed  Google Scholar 
Zahedani, A. D. et al. Digital health application integrating wearable data and behavioral patterns improves metabolic health. npj Digit. Med. 6, 216 (2023).
Article  PubMed  PubMed Central  Google Scholar 
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Article  CAS  PubMed  Google Scholar 
Yip, W. Improving primary healthcare with generative AI. Nat. Med. 30, 2727–2728 (2024).
Article  CAS  PubMed  Google Scholar 
Papastratis, I., Konstantinidis, D., Daras, P. & Dimitropoulos, K. AI nutrition recommendation using a deep generative model and ChatGPT. Sci. Rep. 14, 14620 (2024).
Article  CAS  PubMed  PubMed Central  Google Scholar 
Maity, S. & Saikia, M. J. Large language models in healthcare and medical applications: A review. Bioengineering 12, 631 (2025).
Article  PubMed  PubMed Central  Google Scholar 
Croxford, E. et al. Evaluating clinical AI summaries with large language models as judges. NPJ Digit Med 8, 640 (2025).
Article  PubMed  PubMed Central  Google Scholar 
Abbasian, M., Azimi, I., Rahmani, A. M. & Jain, R. Conversational health agents: a personalized large language model-powered agent framework. JAMIA Open 8, ooaf067 (2025).
Article  PubMed  PubMed Central  Google Scholar 
Barreda, M. et al. Transforming healthcare with chatbots: Uses and applications—A scoping review. Digit Health 11, 20552076251319176 (2025).
Google Scholar 
Huo, B. et al. Large Language Models for Chatbot Health Advice Studies: A Systematic Review. JAMA Netw. Open 8, e2457879 (2025).
Article  PubMed  PubMed Central  Google Scholar 
Khasentino, J. et al. A personal health large language model for sleep and fitness coaching. Nat. Med. 31, 3394–3403 (2025).
Article  CAS  PubMed  PubMed Central  Google Scholar 
Lai, X. et al. Using large language models to enhance exercise recommendations and physical activity in clinical and healthy populations: Scoping review. JMIR Med Inf. 13, e59309 (2025).
Article  Google Scholar 
Gavai, A. K. & van Hillegersberg, J. AI-driven personalized nutrition: RAG-based digital health solution for obesity and type 2 diabetes. PLOS Digital Health 4, e0000758 (2025).
Article  PubMed  PubMed Central  Google Scholar 
Ong, Q. C. et al. Advancing health coaching: A comparative study of large language model and health coaches. Artif. Intell. Med 157, 103004 (2024).
Article  PubMed  Google Scholar 
Ke, Y. H. et al. Clinical and economic impact of a large language model in perioperative medicine: a randomized crossover trial. NPJ Digit Med. 8, 462 (2025).
Article  PubMed  PubMed Central  Google Scholar 
Ong, J. C. L. et al. Large language model as clinical decision support system augments medication safety in 16 clinical specialties. Cell Rep. Med. 6, 102323 (2025).
Article  PubMed  PubMed Central  Google Scholar 
Yao, S. et al. ReAct: Synergizing Reasoning and Acting in Language Models. 11th International Conference on Learning Representations, ICLR 2023 (2022).
Huang, W., Abbeel, P., Pathak, D. & Mordatch, I. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. Proc. Mach. Learn. Res. 162, 9118–9147 (2022).
Google Scholar 
Wang, Z. et al. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. Adv. Neural Inf. Process. Syst. 36, 34153–34189 (2023).
Yao, S. et al. Tree of thoughts: deliberate problem solving with large language models. Adv. Neural Inf. Process. Syst. 36, 11809–11822 (2023).
Hao, S. et al. Reasoning with Language Model is Planning with World Model. 2023 Conference on Empirical Methods in Natural Language Processing 10, 8154–8173 (2023).
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K. & Yao, S. Reflexion: Language Agents with Verbal Reinforcement Learning. Adv. Neural Inf. Process. Syst. 36, (2023).
Madaan, A. et al. Self-Refine: Iterative Refinement with Self-feedback. Proceedings of the 37th International Conference on Neural Information Processing Systems (2023).
Hsu, W., Han, S. X., Arnold, C. W., Bui, A. A. T. & Enzmann, D. R. A data-driven approach for quality assessment of radiologic interpretations. J. Am. Med. Inform. Assoc. 23, e152–e156 (2016).
Article  PubMed  Google Scholar 
Popejoy, L. L. et al. Quantifying care coordination using natural language processing and domain-specific ontology. J. Am. Med. Inform. Assoc. 22, e93–e103 (2015).
Article  PubMed  Google Scholar 
Xu, H. et al. Facilitating pharmacogenetic studies using electronic health records and natural-language processing: A case study of warfarin. J. Am. Med. Inform. Assoc. 18, 387–391 (2011).
Article  PubMed  PubMed Central  Google Scholar 
Yang, H., Spasic, I., Keane, J. A. & Nenadic, G. A Text Mining Approach to the Prediction of Disease Status from Clinical Discharge Summaries. J. Am. Med. Inform. Assoc. 16, 596–600 (2009).
Article  PubMed  PubMed Central  Google Scholar 
Abdulnazar, A., Roller, R., Schulz, S. & Kreuzthaler, M. Large Language Models for Clinical Text Cleansing Enhance Medical Concept Normalization. IEEE Access 12, https://doi.org/10.1109/ACCESS.2024.3472500 (2024).
van Buchem, M. M. et al. The digital scribe in clinical practice: A scoping review and research agenda. npj Digit. Med. 4, 57 (2021).
Bickmore, T. W. et al. Usability of conversational agents by patients with inadequate health literacy: Evidence from two clinical trials. J. Health Commun. 15, 197–210 (2010).
Article  PubMed  Google Scholar 
Rowe, J. P. & Lester, J. C. Artificial intelligence for personalized preventive adolescent healthcare. J. Adolesc. Health 67, S52–S58 (2020).
Article  PubMed  Google Scholar 
Zhang, Z. & Bickmore, T. Medical shared decision making with a virtual agent. Proceedings of the 18th International Conference on Intelligent Virtual Agents, IVA 2018 113–118; https://doi.org/10.1145/3267851.3267883 (2018).
Yang, E. et al. A behavioral science-informed agentic workflow for personalized nutrition coaching: Development and validation study. JMIR Form. Res 9, e75421 (2025).
Article  PubMed  PubMed Central  Google Scholar 
Shin, D., Hsieh, G. & Kim yghokim, Y.-H. NAVER Lab Seongnam, younghokimnet A. PlanFitting: Personalized exercise planning with large language model-driven conversational agent. Proc. 7th ACM Conf. Conversational Use Interfaces 66, 1–19 (2023).
Google Scholar 
Czech, M. D. et al. Improved measurement of disease progression in people living with early Parkinson’s disease using digital health technologies. Commun. Med. 4, 49 (2024).
Tan, J. Y. et al. A cross sectional study of role of technology in health for middle-aged and older adults in Singapore. Sci. Rep. 14, 18645 (2024).
Article  CAS  PubMed  PubMed Central  Google Scholar 
Smak Gregoor, A. M. et al. An artificial intelligence based app for skin cancer detection evaluated in a population based setting. npj Digit. Med. 6, 90, https://doi.org/10.1038/s41746-023-00831-w (2023).
Article  PubMed  PubMed Central  Google Scholar 
Prochaska, J. O. & DiClemente, C. C. Stages and processes of self-change of smoking: Toward an integrative model of change. J. Consult. Clin. Psychol. 51, 390–395 (1983).
Article  CAS  PubMed  Google Scholar 
Luxton, D. D. Ethical implications of conversational agents in global public health. Bull. World Health Organ. 98, 285–287 (2020).
Article  PubMed  PubMed Central  Google Scholar 
Abbasian, M. et al. Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. npj Digit. Med. 7, 82, https://doi.org/10.1038/s41746-024-01074-z (2024).
Article  PubMed  PubMed Central  Google Scholar 
Reddy, S. et al. Evaluation framework to guide implementation of AI systems into healthcare settings. BMJ Health Care Inf. 28, 100444, https://doi.org/10.1136/bmjhci-2021-100444 (2021).
Article  Google Scholar 
Health Promotion Board. Great things start when you MOVE IT! https://www.healthhub.sg/programmes/moveit.
Health Promotion Board. MOVE IT. https://www.hpb.gov.sg/healthy-living/physical-activity/move-it (2025).
Smart Nation and Digital Government Office, Ministry of Digital Development and Information Singapore. HealthHub. https://www.smartnation.gov.sg/initiatives/healthhub/ (2025).
Ministry of Health Singapore. HealthHub. https://www.healthhub.sg/.
Ministry of Health Singapore. Body Mass Index (BMI) Control. https://www.primarycarepages.sg/healthier-sg/care-protocols/preventive-health-care-protocols/body-mass-index-control (2025).
Büker, M. & Mercan, G. Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root canal retreatment: A comparative assessment. Int. J. Med. Inform. 201, 105948 (2025).
Singh, S. U. & Namin, A. S. A survey on chatbots and large language models: Testing and evaluation techniques. Nat. Lang. Process. J. 10, 100128 (2025).
Article  Google Scholar 
Chen, L., Preece, D. A., Sikka, P., Gross, J. J. & Krause, B. A framework for evaluating appropriateness, trustworthiness, and safety in mental wellness AI Chatbots. https://arxiv.org/pdf/2407.11387 (2024).
An, G. K. & Ngo, T. T. A. AI-powered personalized advertising and purchase intention in Vietnam’s digital landscape: The role of trust, relevance, and usefulness. J. Open Innov.: Technol. Mark. Complex. 11, 100580 (2025).
Article  Google Scholar 
Bacchin, D., Pernice, G. F. A., Sardena, M., Malvestio, M. & Gamberini, L. Caregivers’ perceived usefulness of an IoT-based smart bed. Lect. Notes Computer Sci. 13325, 247–265 (2022).
Article  Google Scholar 
Khabaz, K. et al. Assessment of artificial intelligence chatbot responses to common patient questions on bone sarcoma. J. Surg. Oncol. 131, 719 (2024).
Article  PubMed  PubMed Central  Google Scholar 
Bucher, A. The patient experience of the future is personalized: Using technology to scale an N of 1 approach. J. Patient Exp. 10, 23743735231167976 (2023).
Google Scholar 
Rawat, M., Hosseini, S. E. & Pervez, S. Sentiment Analysis for Assessing Customer Satisfaction in Chatbot Service Encounters. 2023 16th International Conference on Developments in eSystems Engineering (DeSE) 10469554 (2023).
El-Ansari, A. & Beni-Hssane, A. Sentiment analysis for personalized chatbots in E-commerce applications. Wirel. Personal. Commun. 2023 129, 1623–1644 (2023).
Article  Google Scholar 
Zhou, J. et al. Instruction-following evaluation for large language models. https://arxiv.org/pdf/2311.07911 (2023).
Jing, Y. et al. FollowEval: A multi-dimensional benchmark for assessing the instruction-following capability of large language models. https://arxiv.org/pdf/2311.09829 (2023).
Boubdir, M. et al. Elo Uncovered: Robustness and Best Practices in Language Model Evaluation. Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM) 339–352 (2023).
Ding, N. et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat. Mach. Intell. 5, 220–235 (2023).
Article  Google Scholar 
Yuan, Z. et al. EfficientLLM: Efficiency in large language models. https://arxiv.org/pdf/2505.13840 (2025).
Miller, J. K. & Tang, W. Evaluating LLM metrics through real-world capabilities. https://arxiv.org/pdf/2505.08253v1 (2025).
Kate, K. et al. LongFuncEval: Measuring the effectiveness of long context models for function calling. https://arxiv.org/pdf/2505.10570 (2025).
Wang, J. et al. HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenarios. Findings of the Association for Computational Linguistics: ACL 2025 3350–3376 (2025).
Yan, F. et al. Berkeley function calling leaderboard. https://gorilla.cs.berkeley.edu/leaderboard.html (2024).
Chatbot Arena. Chatbot Arena (formerly LMSYS): Free AI chat to compare & test best AI chatbots. https://lmarena.ai/.
Confident, A. I. DeepEval – The open-source. LLM evaluation framework https://docs.confident-ai.com/
Adlakha, V., Behnamghader, P., Lu, X. H., Meade, N. & Reddy, S. Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering. Trans. Assoc. Comput Linguist 12, 775–793 (2024).
Google Scholar 
Zhang, W. et al. Towards Fine-Grained Citation Evaluation in Generated Text: A Comparative Analysis of Faithfulness Metrics. Proceedings of the 17th International Natural Language Generation Conference, 427–439 (2024).
Knollmeyer, S. et al. Benchmarking of Retrieval Augmented Generation: A Comprehensive Systematic Literature Review on Evaluation Dimensions, Evaluation Metrics and Datasets. Proc. 16th Int. Jt. Conf. Knowl. Discov., Knowl. Eng. Knowl. Manag. (IC3K 2024) 3, 137–148 (2024).
Google Scholar 
Shapiro-Wilk test calculator: normality calculator, Q-Q plot. https://www.statskingdom.com/shapiro-wilk-test-calculator.html.
Wilcoxon Signed Rank Test. https://stats.blue/Stats_Suite/wilcoxon_signed_rank_test.html.
Statistics Kingdom. One sample proportion test. https://www.statskingdom.com/111proportion_normal1.html.
Anthropic. Introducing Claude 3.5 Sonnet. https://www.anthropic.com/news/claude-3-5-sonnet (2024).
Llama Team, AI @ Meta. The Llama 3 Herd of Models. arXiv 2407.21783; https://doi.org/10.48550/arXiv.2407.21783 (2024).
LlamaIndex. Build AI Knowledge Assistants over your enterprise data. https://www.llamaindex.ai/ (2025).
Zhu, C. et al. AutoGen: An Automated Dynamic Model Generation Framework for Recommender System. WSDM 2023 – Proceedings of the 16th ACM International Conference on Web Search and Data Mining, 598–606 (2023).
CrewAI. The Leading Multi-Agent Platform. https://www.crewai.com/ (2025).
LangGraph. Balance agent control with agency. https://www.langchain.com/langgraph.
Yehudai, A. et al. Survey on Evaluation of LLM-based Agents. https://arxiv.org/pdf/2503.16416 (2025).
Zhu, H. et al. AutoLibra: Agent metric induction from open-ended human feedback. https://arxiv.org/pdf/2505.02820 (2025).
Yuksel, K. A., Castro Ferreira, T., Al-Badrashiny, M. & Sawaf, H. A Multi-AI Agent System for Autonomous Optimization of Agentic AI Solutions via Iterative Refinement and LLM-Driven Feedback Loops. Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025) 52–62 (2025).
Gabriel, A. G., Alameer, A., Shankar, A. & Jeyakumar, K. Advancing agentic systems: dynamic task decomposition, tool integration and evaluation using novel metrics and dataset. https://arxiv.org/pdf/2410.22457 (2024).
Download references
No funding was received for this research.
These authors contributed equally: Han Leong Goh, Vicente Sancenon.
Synapxe, 1 N Buona Vista Link, #05-01 Elementum, 139691, Singapore, Singapore
Han Leong Goh, Vicente Sancenon, Benjamin M. X. Chu, Corryne N. Thng, Chia-Zhi Tan, David W. L. Chua & Andy W. A. Ta
MOH Office for Healthcare Transformation, 1 N Buona Vista Link, #09-02 Elementum, 139691, Singapore, Singapore
Gerald C. H. Koh
Health Promotion Board (HPB), 3 Second Hospital Ave, 168937, Singapore, Singapore
Leroy Koh
Ministry of Health (MOH), College of Medicine Building, 16 College Road, 169854, Singapore, Singapore
Delia Teo & Maybelline S. L. Ooi
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
H.L.G. and V.S. contributed equally to this work and drafted the manuscript. G.C.K., D.T., L.K., M.S.L.O., H.L.G., and B.M.X.C. conceived the study idea and conceptualised the model and evaluation strategy. B.M.X.C. and C.N.T. co-developed the model and performed the evaluation. V.S. conducted the statistical analyses and supplementary analyses. G.C.K. and L.K. provided clinical context to the analyses and discussion and contributed to manuscript revision. G.C.K., D.T., M.S.L.O., L.K., and C.Z.T. contributed policy insights and reviewed the analyses. A.W.A.T. and D.W.L.C. provided resources and overall supervision throughout the study. All authors reviewed and approved the final manuscript.
Correspondence to Han Leong Goh.
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
Goh, H.L., Sancenon, V., Chu, B.M.X. et al. Personalised health plan development using agentic AI in Singapore’s national preventive care programme: a pilot study. npj Digit. Med. 9, 332 (2026). https://doi.org/10.1038/s41746-026-02514-8
Download citation
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41746-026-02514-8
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative
Collection
Advertisement
npj Digital Medicine (npj Digit. Med.)
ISSN 2398-6352 (online)
© 2026 Springer Nature Limited
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

source

Scroll to Top