Biomedical (17) | [ - ] |
| Medical licensing examination (12) | [ - ] |
| Clinical (general) (15) | [ - ] |
| Psychiatry (10) | [ , - ] |
| Oncology (5) | [ - ] |
| Cardiology (4) | [ - ] |
| Ophthalmology (3) | [ - ] |
| Neurology (3) | [ , , ] |
| Orthopedics (2) | [ , ] |
| Clinical trials (2) | [ , ] |
| Intensive care (2) | [ , ] |
| Geriatrics (2) | [ , ] |
| Radiology (2) | [ , ] |
| Nuclear medicine (1) | [ ] |
| Hepatology (1) | [ ] |
| Endocrinology (1) | [ ] |
| Plastic surgery (1) | [ ] |
| Gastroenterology (1) | [ ] |
| Genetics (1) | [ ] |
| Nursing (1) | [ ] |
|
| Biomedical (13) | [ - ] |
| Clinical (general) (15) | [ , , - ] |
| Psychiatry (1) | [ ] |
|
| Biomedical (9) | [ , , , , , , , , ] |
| Clinical (general) (6) | [ , , , - ] |
| Oncology (2) | [ , ] |
| Psychiatry (1) | [ ] |
| Medical insurance (1) | [ ] |
Terminology Use
In our review, the consistency of terminology use around prompt engineering was investigated, particularly concerning its 3 paradigms: PD, PL, and PT. Across the papers, we meticulously tracked instances where the terminology was applied differently to the definitions used in the literature and described in the introduction. Notably, PL was used to refer to PD 4 times [ 12 , 13 , 67 , 86 ] and PT once [ 119 ], while PT was used 5 times to describe PL [ 88 , 96 , 97 , 99 , 114 ] and twice for PD [ 23 , 43 ]. Terminology inconsistencies were identified in only 12 studies. Consequently, while there remains some degree of inconsistency, a significant majority of 102 papers adhered to the definitions identified as commonly used terminology.
Language of Study
Considering the latest developments in NLP research encompassing languages beyond English [ 124 ], reporting the language of study is crucial. Several papers do not explicitly state the language of study. In some cases, the language can be inferred from prompt illustrations or examples. In the least informative cases, only the data set of the study is disclosed, indirectly hinting at the language.
Table 2 illustrates the language distribution among the selected papers, noting whether languages are explicitly mentioned, implicitly inferred from prompt illustrations, or simply not stated but implied from the used data set. The language used in 2 papers [ 60 , 68 ] remains unknown.
Language and type of venue | Stated , n (%) | Inferred , n (%) | Not stated , n (%) | Total, n (%) |
|
| All | 37 (32.5) | 48 (42.1) | 11 (9.6) | 96 (84.2) |
| Medical informatics | 16 (14) | 9 (7.9) | 2 (1.8) | 27 (23.7) |
| Computer science | 8 (7) | 18 (15.8) | 1 (0.9) | 27 (23.7) |
| Preprint | 9 (7.9) | 12 (10.5) | 5 (4.4) | 26 (22.8) |
| Clinical | 1 (0.9) | 8 (7) | 3 (2.6) | 12 (10.5) |
| Other | 3 (2.6) | 1 (0.9) | 0 (0) | 4 (3.5) |
|
| All | 18 (15.8) | 0 (0) | 0 (0) | 18 (15.8) |
|
| All | 3 (2.6) | 0 (0) | 0 (0) | 3 (2.6) |
|
| All | 3 (2.6) | 0 (0) | 0 (0) | 3 (2.6) |
|
| All | 2 (1.8) | 0 (0) | 0 (0) | 2 (1.8) |
|
| All | 2 (1.8) | 0 (0) | 0 (0) | 2 (1.8) |
|
| All | 2 (1.8) | 0 (0) | 0 (0) | 2 (1.8) |
|
| All | 2 (1.8) | 0 (0) | 0 (0) | 2 (1.8) |
|
| All | 0 (0) | 0 (0) | 1 (0.9) | 1 (0.9) |
|
| All | 1 (0.9) | 0 (0) | 0 (0) | 1 (0.9) |
|
| All | 1 (0.9) | 0 (0) | 0 (0) | 1 (0.9) |
|
| All | 1 (0.9) | 0 (0) | 0 (0) | 1 (0.9) |
|
| All | 1 (0.9) | 0 (0) | 0 (0) | 1 (0.9) |
|
| All | 1 (0.9) | 0 (0) | 0 (0) | 1 (0.9) |
|
| All | 0 (0) | 0 (0) | 2 (1.8) | 2 (1.8) |
a Stated in the paper.
b Inferred from prompt figures and examples.
c Inferred from the data set.
Notably, English dominates with 84.2% (n=96) of the selected papers, followed by Chinese at 15.7% (n=18). Then, the other languages are relatively rare, often appearing in studies featuring multiple languages. It is worth mentioning that languages besides English are usually explicitly stated, with the exception of a paper studying Korean [ 63 ]. In total, the language had to be inferred from prompt figures and examples in 48 papers, all in English.
Choice of LLMs
Given the diverse array of LLMs available, spanning general or medical, open-source or proprietary, and monolingual or multilingual models, alongside various architectural configurations (encoder, decoder, or both), our study investigates LLM selection across prompt paradigms.
Figure 3 outlines prevalent LLMs categorized by prompt paradigms, though it is not exhaustive and only includes commonly encountered architectures. For example, while encoder-decoder models are absent in PT in Figure 3 , there are a few instances where they are used [ 95 , 110 ].
ChatGPT’s popularity in PD is unsurprising, given its accessibility. Models from Google, PaLM, and Bard (subsequently rebranded Gemini), all falling under closed models, are also prominent. Among open-source instruct-based LLMs, fewer are used, notably those based on LLaMA-2 with 7 occurrences.
In PL, encoder models, those following the BERT architecture, dominate, covering both general and specialized variants. There are occasional uses of decoder models like GPT-2 in PL-based tasks [ 103 , 105 ]. PT involves all model types, with a preference toward encoders. Further details on the models used are available in Multimedia Appendix 3 .
Topic Domain and NLP Task Trends
Figure 4 [ 16 , 20 , 26 , 41 , 47 , 88 - 123 ] illustrates the target tasks used in the PL and PT papers. PL-focused papers predominantly address classification-based tasks such as text classification, named entity recognition, and relation extraction, with text classification being particularly prominent. This aligns with the nature of PL, which centers around an MLM objective. Among other tasks, a study based on text generation [ 111 ] makes use of PL to predict masked tokens from partial patient records, aiming to generate synthetic electronic health records. Conversely, PT papers tend to exhibit a slightly broader range of tasks.
Figure 5 [ 10 - 87 ] presents the same analysis for PD-based papers. Unlike PL and PT, a prominent trend observed is that several studies focus on real-world board examinations. Notably, these studies predominantly center around tasks involving answering multiple-choice questions (MCQs). It is worth noting that although MCQs might be cast as a classification task, in practice, it is cast as a generation task using causal LLMs. It is interesting to note that none of the selected PD papers propose the task of entity linking, despite the clear opportunity of leveraging LLMs’ in-context learning ability for medical entity linking.
Prompt Engineering Techniques
We extensively investigated the used prompt techniques: among PD papers, 49 studies used zero-shot prompting, 23 used few-shot prompting, and 10 used one-shot prompting. Few shot tends to outperform in MCQs, but its advantage over zero shot is inconsistent in other NLP tasks. We propose a comprehensive summary of the existing techniques in Table 3 .
As shown in Table 3 , chain-of-thought (CoT) prompting [ 2 ] stands as the most common technique, followed by the persona pattern. In medical MCQs, various attempts with CoT can lead to different reasoning pathways and answers. Hence, to improve accuracy, 2 studies [ 19 , 20 ] used self-consistency, a method involving using multiple CoT prompts and selecting the most frequently occurring answer through voting.
Flipped interaction was used for simulation tasks, such as doctor-patient engagement [ 60 ] or to provide clinical training to medical students [ 81 ]. Emotion enhancement was applied in mental health contexts [ 58 , 60 ], allowing the LLM to produce emotional statements.
More innovative prompt engineering techniques include k-nearest neighbor few-shot prompting [ 19 ] and pseudoclassification prompting [ 78 ]. The former uses the k-nearest neighbor algorithm to select the k-closest examples in a large annotated data set based on the input before using them in the prompt, and the latter presents to the LLMs all possible labels, asking the model to respond with a binary output for each provided label. Despite its potential, tree-of-thoughts pattern use was limited, with only 1 instance found among the papers [ 77 ].
Prompt techniques | Description | Prompt template examples | Count papers | References |
Chain-of-thought (CoT) | Asking the large language model (LLM) to provide the reasoning before answering. | | 17 | [ , , , , , , , , , , , , , , , , ] |
Persona (role-defining) | Assigning the LLM a particular role to accomplish a task related to that role. | | 10 | [ , , , , - , , , ] |
Ensemble prompting | Using multiple independent prompts to answer the same question. The final output is decided by majority vote. | | 4 | [ , , , ] |
Scene-defining | Simulating a scene related to the addressed task. | | 3 | [ , , ] |
Prompt-chaining | Separating a task into multiple subtasks, each resolved with a prompt. | | 3 | [ , , ] |
Flipped interaction | Making the LLM take the lead (eg, asking questions) and the user interacting with it passively. | | 2 | [ , ] |
Emotion enhancement | Making the LLM more or less expressing human-like emotions. | | 2 | [ , ] |
Prompt refinement | Using the LLM to refine the prompt such as translating the prompt or rephrasing it. | | 2 | [ , ] |
Retrieval-augmented generation | Combining an information retrieval component with a generative LLM. Snippets extracted from documents are fed into the system along with the input prompt to generate an enriched output. | | 2 | [ , ] |
Self-consistency (CoT ensembling) | Ensemble prompting each prompt using CoT. Ideal if a problem has many possible reasoning paths. | | 2 | [ , ] |
Emerging Trends
Figure 6 illustrates a chronological polar pie chart of selected papers and their citation connections, identifying five highly cited papers: (1) Agrawal et al [ 40 ] demonstrate GPT-3’s clinical task performance, especially in named entity recognition and relation extraction through thorough PD. (2) Kung et al [ 36 ] evaluate ChatGPT’s (GPT-3.5) ability for the United States Medical Licensing Examination, shortly after the public release of ChatGPT. (3) Singhal et al [ 20 ] introduce MultiMedQA and HealthSearchQA benchmarks. The paper also presents instruction PT for domain alignment, a novel paradigm that entails learning a soft prompt prior to the LLM general instruction, which is usually written as a hard prompt. Using this approach on FlanPaLM led to the development of Med-PaLM, improving question answering over FlanPaLM. (4) Nori et al [ 27 ] evaluate GPT-4 on the United States Medical Licensing Examination and MultiMedQA, surpassing previous state-of-the-art results, including GPT-3.5 and Med-PaLM. (5) Luo et al [ 26 ] release BioGPT, a fine-tuned variant of GPT-2 for biomedical tasks, achieving state-of-the-art results on 6 biomedical NLP tasks with suffix-based PT.
Trends in PD
As shown in Figure 6 , the PD paradigm presents multiple trends: all papers disseminated in clinical-based venues, and 27 of 33 (82%) of the encountered preprints adhere to this paradigm. Furthermore, we observed a significant focus on work involving frozen LLMs within the PD domain. This trend is likely due to the frequent use of ChatGPT in 74 instances, as depicted in Figure 3 , despite OpenAI offering fine-tuning capabilities for the model. It is worth mentioning that 46 of 78 (59%) PD papers do not include any baseline, including human comparison. This gap will be further explored in a subsequent section.
Trends in PL and PT
Among PL and PT papers, computer science and medical informatics are the most prevalent venues. Although PL has drawn attention to the idea of adapting the MLM objective to downstream tasks without needing to further update the LLM weights, many studies still opt to fine-tune their LLMs, with a nonnegligible amount of them evaluating in few-shot settings [ 89 , 92 , 93 , 112 ]. Unlike PD, PL and PT usually include a baseline, with it often being a traditional fine-tuning version of the evaluated model [ 92 , 93 , 95 ] to compare it against novel prompt-based paradigms. These studies came to a common conclusion, being that PL is a promising alternative to traditional fine-tuning in few-shot scenarios.
There are 2 ways for conducting PL: one involves filling in the blanks within a text, known as cloze prompts, while the other consists in predicting masked tokens at the end of the sequence, referred to as prefix prompts. A distinct advantage of the latter approach is its compatibility with autoregressive models, as they exclusively predict the appended masks. Among the 29 PL papers, 21 (72%) of them propose cloze prompts, while 15 (52%) use prefix prompting. The involved NLP tasks are well-distributed across these 2 prompt patterns. Another crucial component of PL is the verbalizer. As PL revolves around predicting masked tokens, classification-based tasks require mapping manually selected relevant tokens to each class (manual verbalizer). Alternatively, some studies propose a soft verbalizer, akin to soft prompts, which automatically determines the most relevant token embedding for each label through training. Of the 29 PL papers selected, 16 (55%) studies explicitly mention the use of a manual verbalizer, while 2 explored both verbalizers to assess performance [ 101 , 110 ]. Only 1 exclusively used a soft verbalizer [ 89 ]. Another study does not use any verbalizer, as it focuses on generating synthetic data by filling the blanks [ 111 ]. Notably, 8 (28%) studies did not report any mention regarding the verbalizer methodology.
Hard prompts, which are related to PD and PL, involve manually crafted prompts. Regarding PT, optimal prompts are attainable through soft prompting (ie, prompts that are trained on a training data set), yet, determining the appropriate soft prompt length remains obscure. In total, 5 of 19 (26%) PT studies tried various soft prompt lengths and reported their corresponding performances [ 26 , 105 , 118 , 119 , 122 ]. While there is no definitive optimal prompt length, a trend emerges: optimal soft prompt length typically exceeds 10 tokens. Surprisingly, 8 (42%) papers omit reporting the soft prompt length. Regarding the placement of soft prompts in relation to the input and the mask, consensus is lacking. A total of 5 (26%) papers prepend the soft prompt at the input’s outset, while 4 (21%) append it as a suffix. One paper uses both strategies in a single prompt template [ 95 ]. Some innovative methods involve inserting a single soft prompt for each entity that needs to be identified in entity-linking tasks or using token-wise soft prompts, where each token in the textual input is accompanied by a distinct soft prompt. The position of soft prompts remains unreported in 5 (26%) studies. Finally, according to the 6 (32%) studies that used mixed prompts [ 90 , 91 , 95 , 101 , 105 , 110 ] (a combination of hard and soft prompts), it has consistently been reported that mixed prompts lead to a better performance than hard prompts alone.
Baseline Comparison
Only 62 of the screened papers reported comparisons to established baselines. These include traditional deep learning approaches (eg, fine-tuning approach), classical machine learning algorithms (eg, logistic regression), naive systems (eg, majority class), or human annotation. The remaining papers solely explored prompt-related solutions, without including baseline comparisons. Tables 4 - 6 traces the presence of a nonprompt baseline among different prompt categories ( Table 4 ), papers sources ( Table 5 ), and NLP tasks addressed ( Table 6 ).
Prompt category | No baseline, n (%) | Higher, n (%) | Similar, n (%) | Lower, n (%) | Total, n (%) |
Prompt design | 48 (42.1) | 13 (11.4) | 4 (3.5) | 13 (11.4) | 78 (68.4) |
Prompt learning | 5 (4.4) | 19 (16.7) | 3 (2.6) | 2 (1.8) | 29 (25.4) |
Prompt tuning | 3 (2.6) | 11 (9.6) | 2 (1.8) | 3 (2.6) | 19 (16.7) |
a Higher or lower indicates that the performance of the proposed prompt-based approach is higher or lower than the baseline.
Type of venue | No baseline, n (%) | Higher, n (%) | Similar, n (%) | Lower, n (%) | Total, n (%) |
Medical informatics | 13 (11.4) | 16 (14) | 2 (1.8) | 2 (1.8) | 33 (28.9) |
Computer science | 7 (6.1) | 12 (10.5) | 3 (2.6) | 9 (7.9) | 31 (27.2) |
Preprint | 21 (18.4) | 6 (5.3) | 1 (0.9) | 5 (4.4) | 33 (28.9) |
Clinical | 13 (11.4) | 0 (0) | 0 (0) | 0 (0) | 13 (11.4) |
Other | 1 (0.9) | 2 (1.8) | 0 (0) | 1 (0.9) | 4 (3.5) |
NLP task | No baseline, n (%) | Higher, n (%) | Similar, n (%) | Lower, n (%) | Total, n (%) |
Text classification | 13 (11.4) | 18 (15.8) | 4 (3.5) | 11 (9.6) | 46 (40.4) |
Question answering | 13 (11.4) | 3 (2.6) | 1 (0.9) | 2 (1.8) | 19 (16.7) |
Relation extraction | 3 (2.6) | 10 (8.8) | 0 (0) | 3 (2.6) | 16 (14) |
Information extraction | 10 (8.8) | 3 (2.6) | 0 (0) | 2 (1.8) | 15 (13.2) |
Multiple-choice question | 10 (8.8) | 3 (2.6) | 1 (0.9) | 1 (0.9) | 15 (13.2) |
Named entity recognition | 4 (3.5) | 5 (4.4) | 1 (0.9) | 5 (4.4) | 15 (13.2) |
Text summarization | 7 (6.1) | 3 (2.6) | 0 (0) | 1 (0.9) | 11 (9.6) |
Reasoning | 5 (4.4) | 3 (2.6) | 0 (0) | 1 (0.9) | 9 (7.9) |
Generation | 5 (4.4) | 2 (1.8) | 0 (0) | 1 (0.9) | 8 (7) |
Entity linking | 0 (0) | 3 (2.6) | 0 (0) | 0 (0) | 3 (2.6) |
Coreference resolution | 1 (0.9) | 1 (0.9) | 0 (0) | 1 (0.9) | 3 (2.6) |
Decision support | 2 (1.8) | 0 (0) | 0 (0) | 1 (0.9) | 3 (2.6) |
Conversational | 3 (2.6) | 0 (0) | 0 (0) | 0 (0) | 3 (2.6) |
Text simplification | 1 (0.9) | 0 (0) | 0 (0) | 1 (0.9) | 2 (1.8) |
a NLP: natural language processing.
b Higher or lower indicates that the performance of the proposed prompt-based approach is higher or lower than the baseline.
Nonprompt-related baselines are often featured in studies focused on PL and PT but not PD. Additionally, PL and PT have a tendency to perform better than their respective reported baselines, PD tends to report less conclusive results. More specifically, among the 22 papers using either PL or PT with an identical fine-tuned model as a baseline, 17 indicate superior performance with the prompt-based approach, 3 observed comparable performance, and 2 studies noted inferior performance.
Significantly, papers from computer science venues tend to include more state-of-the-art baselines than those from medical informatics and clinical venues. Specifically, all 13 papers reviewed from clinical venues did not use any nonprompt baselines. Furthermore, there appears to be no consistent link between the type of NLP tasks and the omission of baselines, indicating that the decision to include baselines is more influenced by the evaluation methodology than by feasibility.
Prompt Optimization
Numerous studies in the literature highlight the few-shot learning capabilities of LLMs, often referred to as “few-shot prompting,” wherein they demonstrate proficiency in executing tasks with minimal demonstrations provided, typically through text prompts. However, it is crucial to acknowledge that the annotation cost associated with such frameworks might extend beyond the few annotated demonstrations within the prompt. Many studies claiming to explore few-shot or zero-shot learning through prompt engineering rely on extensive annotated validation data sets to refine PD and formulation. This is, for example, the case in the paper that popularized the term “few-shot learning” [ 1 ]. Among the 45 analyzed papers concentrating on few-shot or zero-shot learning, 5 explicitly detail the optimization of prompt formulation using extensive validation data sets. Conversely, 18 of these papers either do not engage in prompt optimization or test various prompts and document all results. Notably, 22 papers present results using only 1 prompt choice, without clarifying whether this choice was made thanks to additional validation data sets.
Summary of the Findings
This scoping review aimed to map the current landscape of medical prompt engineering, identifying key themes, gaps, and trends within the existing literature. The primary findings of this study reveal a greater prevalence of PD over PL and PT, with ChatGPT dominating the PD domain. Additionally, many studies omit nonprompt-based baselines, do not specify the language of study, or exhibit a lack of consensus in PL (prefix vs cloze prompt) and PT settings (soft prompt lengths and positions). English is notably dominant as the language of study. These findings suggest that while the field is emerging, there is a pressing need for improved research practices.
Costs, Infrastructure, and LLMs in Clinical Settings
Prompt engineering techniques enable competitive performance in scenarios with limited or no resources as well as in environments with low-cost computing infrastructure. As hospital data and infrastructure are often found in this scenario, these approaches hold great promise in the clinical field. Figure 6 shows the absence of PL- and PT-related works in clinical journals. This trend may stem from the widespread accessibility of ChatGPT, favoring PD-focused investigations. Despite efforts like OpenPrompt [ 125 ] to facilitate PL and PT works, the programming barrier likely deters clinical practitioners. Surprisingly, 7 papers use ChatGPT with sensitive clinical data. Despite the recent availability of ChatGPT Enterprise in GPT-4 for secure data handling, it is apparent that most of these studies have not used this feature since they used GPT-3.5. Limited use of local LLMs, especially LLaMA-based, suggests a need for their increased adoption in future clinical PD studies. The lack of local LLMs may be due to clinicians’ limited computational infrastructure.
Prompt Engineering Techniques Effectiveness in Medical Research
In documented prompt engineering techniques, the effectiveness of few-shot prompting compared to zero shot varies by task and scenario. However, CoT shows superior reasoning performance, compelling LLMs to present reasoning pathways and consistently outperforming zero-shot and few-shot methods across PD studies. Its ensemble-based variant, self-consistency, consistently outperforms CoT. Despite the persona pattern’s frequent use, there is a lack of ablation studies on its impact on medical task performance, with only 1 paper reporting negligible improvement [ 61 ]. Prompt engineering is an emerging field of study that still needs to prove its efficacy. However, almost half of the papers focused only on prompt engineering and failed to report any nonprompt-related baseline performance, despite the availability of such baselines for the addressed NLP tasks. On the whole, the results are far from being systematically in favor of LLM-based methods, greatly attenuating the impression of a technological breakthrough that is generally commented on. Selecting a baseline remains a necessary step toward understanding the actual impact of prompt engineering.
Bender Rule
Regarding the languages, while Table 2 shows the dominance of English in medical literature, many papers studying English fail to explicitly mention the language of study. This oversight is more prevalent in computer science and clinical venues, whereas medical informatics exhibits a more favorable trend, as validated by a chi-square test yielding a P value of .02 (Table S1 in Multimedia Appendix 2 ). Notably, languages such as Chinese are consistently mentioned across the 18 selected papers. However, the Bender rule, namely “always name the language(s) you are working on,” seems to be well respected for languages other than English. This finding has already been documented for NLP research in general [ 126 ].
Fine-Tuning Versus Prompt-Based Approaches
While traditional LLM fine-tuning remains a viable method for various NLP tasks, PL and PT are competitive alternatives to fine-tuning, particularly in resource-constrained and low computational scenarios. PL, leveraging predefined prompts to guide model behavior, offers an efficient approach in low-to-no resource environments. Conversely, PT emerges as a viable solution in low computational scenarios, as it requires substantially fewer trainable parameters compared to traditional fine-tuning approaches. Since both prompt-based approaches do not require the LLM to be further trained, they are less prone to catastrophic forgetting [ 127 ].
Recommendations for Future Medical Prompt–Based Studies
For future research in prompt engineering, we propose several recommendations aimed at improving research quality, reporting, and reproducibility. From this review, we identified several trends such as the computational advantages or the lack of evaluations on baselines with a lack of ablation studies to evaluate the performance of the prompting strategies. Some studies do not clearly mention the prompt engineering choices they made. For instance, in PL, choices range from using cloze to prefix prompting and from using manual to soft verbalizer. Similarly, PT is characterized by configurations of soft prompts, such as the length and the positions. To clarify these distinctions and enhance methodological transparency and reproducibility in future research, we have developed reporting guidelines available in Textbox 1 . Adhering to these reporting guidelines will contribute to advancing prompt engineering methodologies and their practical applications in the medical field.
General reporting recommendations
- For sensitive data, local large language models (LLMs) should be preferred to the ones that use an application programming interface or a web service.
- The language of the study used should be explicitly stated.
- The mention of whether the LLM undergoes fine-tuning should be made explicit.
- The prompt optimization process and results should be documented to ensure transparency, whether it is through different tested manual prompts or through a validation data set.
- The terms “few-shot,” “one-shot,” and “zero-shot” should not be used in settings where the prompts have been optimized on annotated examples.
- Experiments should include baseline comparisons or at least mention existing results, particularly when data sets originate from previous medical challenges or benchmarks.
Specific to prompt learning and prompt tuning
- Concepts (such as prompt learning and prompt tuning) should be defined and used consistently with the consensus.
- In prompt learning experiments, the verbalizer used (soft and hard) should be explicitly specified, or a clear justification should be provided if the verbalizer is omitted. Additionally, whether the prompt template follows the cloze or the prefix format should be mentioned.
- In prompt tuning experiments, authors should provide details on soft prompt positions, length, and any variations tested, such as incorporating hard or mixed prompts, as part of the ablation study.
Limitations
A limitation was the large number of papers retrieved during the initial search, which was addressed by limiting the search scope to titles, abstracts, and keywords. Furthermore, since some studies may perform prompt engineering techniques without mentioning any of the 4 prompt-related expressions used in the queries, they might be missed by our searches.
Conclusions
Medical prompt engineering is an emerging field with significant potential for enhancing clinical applications, particularly in resource-constrained environments. Despite the promising capabilities demonstrated, there is a pressing need for standardized research practices and comprehensive reporting to ensure methodological transparency and reproducibility. Consistent evaluation against nonprompt-based baselines, prompt optimization documentation, and prompt settings reporting will be crucial for advancing the field. We hope that a better adherence to the recommended guidelines, in Textbox 1 , will improve our understanding of prompt engineering and enhance the capabilities of LLMs in health care.
Acknowledgments
JZ is financed by the NCCR Evolving Language, a National Centre of Competence in Research, funded by the Swiss National Science Foundation (grant # 51NF40_180888).
Authors' Contributions
JZ and MN performed the screening and data extraction of the papers and synthesized the findings. AN and XT supervised MN. MB and CL supervised JZ. JZ and MN wrote the manuscript with support from MB, AN, XT, and CL. All authors contributed to the analysis of the results. CL conceived the original idea.
Conflicts of Interest
CL is the editor-in-chief of JMIR Medical Informatics . All other authors have no conflict of interest to declare.
PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) checklist.
Search strategy and statistical analysis.
Reading notes and details of the reviewed papers.
- Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. 2020. Presented at: Advances in Neural Information Processing Systems; December 6, 2020:1877-1901; Virtual. URL: https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
- Kojima T, Gu SS, Reid M, Matsuo Y, Iwasawa Y. Large language models are zero-shot reasoners. 2022. Presented at: Advances in Neural Information Processing Systems; November 28, 2022:22199-22213; New Orleans. URL: https://papers.nips.cc/paper_files/paper/2022/hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html
- Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv. Jan 16, 2023;55(9):1-35. [ CrossRef ]
- White J, Fu Q, Hays S, Sandborn M, Olea C, Gilbert H, et al. A prompt pattern catalog to enhance prompt engineering with ChatGPT. ArXiv. Preprint posted online on February 21, 2023. [ FREE Full text ] [ CrossRef ]
- Névéol A, Dalianis H, Velupillai S, Savova G, Zweigenbaum P. Clinical natural language processing in languages other than English: opportunities and challenges. J Biomed Semantics. 2018;9(1):12. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Luccioni AS, Rogers A. Mind your language (model): fact-checking LLMs and their role in NLP research and practice. ArXiv. Preprint posted online on June 1, 2024. [ CrossRef ]
- Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930-1940. [ CrossRef ] [ Medline ]
- Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, et al. A survey of large language models. ArXiv. Preprint posted online on November 24, 2023. [ FREE Full text ] [ CrossRef ]
- Lester B, Al-Rfou R, Constant N. The power of scale for parameter-efficient prompt tuning. 2021. Presented at: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; January 10, 2021:3045-3059; Online and Punta Cana, Dominican Republic. [ CrossRef ]
- Fries J, Weber L, Seelam N, Altay G, Datta D, Garda S, et al. BigBIO: a framework for data-centric biomedical natural language processing. 2022. Presented at: Advances in Neural Information Processing Systems; November 28, 2022:25792-25806; New Orleans. URL: https://proceedings.neurips.cc/paper_files/paper/2022/hash/a583d2197eafc4afdd41f5b8765555c5-Abstract-Datasets_and_Benchmarks.html
- Weisenthal SJ. ChatGPT and post-test probability. ArXiv. Preprint posted online on July 20, 2024. [ FREE Full text ] [ CrossRef ]
- Li L, Ning W. ProBioRE: a framework for biomedical causal relation extraction based on dual-head prompt and prototypical network. 2023. Presented at: 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); December 5, 2023:2071-2074; Istanbul, Turkiye. URL: https://tinyurl.com/3n45uwdb
- Li Z, Belkadi S, Micheletti N, Han L, Shardlow M, Nenadic G. Large language models and control mechanisms improve text readability of biomedical abstracts. ArXiv. Preprint posted online on March 16, 2024. [ FREE Full text ] [ CrossRef ]
- Li Q, Yang X, Wang H, Liu L, Wang Q, Wang J, et al. From beginner to expert: modeling medical knowledge into general LLMs. ArXiv. Preprint posted online on January 7, 2024. [ FREE Full text ] [ CrossRef ]
- Ateia S, Kruschwitz U. Is ChatGPT a biomedical expert? 2023. Presented at: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023); September 18-21, 2023:73-90; Thessaloniki, Greece. URL: https://ceur-ws.org/Vol-3497/paper-006.pdf
- Belyaeva A, Cosentino J, Hormozdiari F, Eswaran K, Shetty S, Corrado G, et al. Multimodal LLMs for health grounded in individual-specific data. 2023. Presented at: Machine Learning for Multimodal Healthcare Data; July 29, 2023:86-102; Honolulu, Hawaii, United States. [ CrossRef ]
- Chen Q, Sun H, Liu H, Jiang Y, Ran T, Jin X, et al. An extensive benchmark study on biomedical text generation and mining with ChatGPT. Bioinformatics. 2023;39(9):btad557. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Mollá D. Large language models and prompt engineering for biomedical query focused multi-document summarisation. ArXiv. Preprint posted online on November 9, 2023. [ FREE Full text ] [ CrossRef ]
- Nori H, Lee YT, Zhang S, Carignan D, Edgar R, Fusi N, et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. ArXiv. Preprint posted online on November 28, 2023. [ FREE Full text ] [ CrossRef ]
- Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172-180. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Tian S, Jin Q, Yeganova L, Lai PT, Zhu Q, Chen X, et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief Bioinform. 2023;25(1):bbad493. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Lim S, Schmälzle R. Artificial intelligence for health message generation: an empirical study using a large language model (LLM) and prompt engineering. Front Commun. 2023;8:1129082. [ CrossRef ]
- Wu YH, Lin YJ, Kao HY. IKM_Lab at BioLaySumm Task 1: longformer-based prompt tuning for biomedical lay summary generation. 2023. Presented at: The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks; July 13, 2023; Toronto, Canada. [ CrossRef ]
- Zhang W, Chen C, Wang J, Liu J, Ruan T. A co-adaptive duality-aware framework for biomedical relation extraction. Bioinformatics. 2023;39(5):btad301. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Chen P, Wang J, Lin H, Zhao D, Yang Z. Few-shot biomedical named entity recognition via knowledge-guided instance generation and prompt contrastive learning. Bioinformatics. 2023;39(8):btad496. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H, et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 2022;23(6):bbac409. [ CrossRef ] [ Medline ]
- Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on medical challenge problems. ArXiv. Preprint posted online on April 12, 2023. [ FREE Full text ] [ CrossRef ]
- Heinz MV, Bhattacharya S, Trudeau B, Quist R, Song SH, Lee CM, et al. Testing domain knowledge and risk of bias of a large-scale general artificial intelligence model in mental health. Digit Health. 2023;9:20552076231170499. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Ting Y, Hsieh T, Wang Y, Kuo Y, Chen Y, Chan P, et al. Performance of ChatGPT incorporated chain-of-thought method in bilingual nuclear medicine physician board examinations. Digit Health. 2024;10:20552076231224074. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Casola S, Labruna T, Lavelli A, Magnini B. Testing ChatGPT for stability and reasoning: a case study using Italian medical specialty tests. 2023. Presented at: Proceedings of the 9th Italian Conference on Computational Linguistics; November 30-Decemeber 2, 2023; Venice, Italy. URL: https://ceur-ws.org/Vol-3596/paper13.pdf
- Roemer G, Li A, Mahmood U, Dauer L, Bellamy M. Artificial intelligence model GPT4 narrowly fails simulated radiological protection exam. J Radiol Prot. 2024;44(1):013502. [ CrossRef ] [ Medline ]
- Ali S, Shahab O, Al Shabeeb R, Ladak F, Yang JO, Nadkarni G, et al. General purpose large language models match human performance on gastroenterology board exam self-assessments. MedRxi. Preprint posted online on September 25, 2023. [ FREE Full text ] [ CrossRef ]
- Patel D, Raut G, Zimlichman E, Cheetirala S, Nadkarni G, Glicksberg BS, et al. The limits of prompt engineering in medical problem-solving: a comparative analysis with ChatGPT on calculation based USMLE medical questions. MedRxiv. Preprint posted online on August 9, 2023. [ FREE Full text ] [ CrossRef ]
- Sallam M, Al-Salahat K, Eid H, Egger J, Puladi B. Human versus artificial intelligence: ChatGPT-4 outperforming Bing, Bard, ChatGPT-3.5, and humans in clinical chemistry multiple-choice questions. MedRxiv. Preprint posted online on January 9, 2024. [ FREE Full text ] [ CrossRef ]
- Savage T, Nayak A, Gallo R, Rangan E, Chen JH. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digit Med. 2024;7(1):20. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Tanaka Y, Nakata T, Aiga K, Etani T, Muramatsu R, Katagiri S, et al. Performance of generative pretrained transformer on the national medical licensing examination in Japan. PLOS Digit Health. 2024;3(1):e0000433. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Rosoł M, Gąsior JS, Łaba J, Korzeniewski K, Młyńczak M. Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish medical final examination. Sci Rep. 2023;13(1):20512. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Sivarajkumar S, Kelley M, Samolyk-Mazzanti A, Visweswaran S, Wang Y. An empirical evaluation of prompting strategies for large language models in zero-shot clinical natural language processing. ArXiv. Preprint posted online on September 14, 2023. [ FREE Full text ] [ CrossRef ]
- Agrawal M, Hegselmann S, Lang H, Kim Y, Sontag D. Large language models are few-shot clinical information extractors. 2022. Presented at: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; May 25, 2022:1998-2022; Abu Dhabi. [ CrossRef ]
- Dong B, Wang Z, Li Z, Duan Z, Xu J, Pan T, et al. Toward a stable and low-resource PLM-based medical diagnostic system via prompt tuning and MoE structure. Sci Rep. 2023;13(1):12595. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Gutierrez KLT, Viacrusis PML. Bridging the gap or widening the divide: a call for capacity-building in artificial intelligence for healthcare in the Philippines. JMUST. 2023;7(2):1325-1334. [ CrossRef ]
- Islam KS, Nipu AS, Madiraju P, Deshpande P. Autocompletion of chief complaints in the electronic health records using large language models. 2023. Presented at: 2023 IEEE International Conference on Big Data (BigData); December 15-18, 2023:4912-4921; Sorrento, Italy. URL: https://tinyurl.com/4ajdyddt [ CrossRef ]
- Meoni S, Ryffel T, De La Clergerie É. Annotate French clinical data using large language model predictions. 2023. Presented at: 2023 IEEE 11th International Conference on Healthcare Informatics (ICHI); June 26-29, 2023:550-557; Houston, TX, United States. URL: https://tinyurl.com/yy2b9fe8 [ CrossRef ]
- Meoni S, De la Clergerie E, Ryffel T. Large language models as instructors: a study on multilingual clinical entity extraction. 2023. Presented at: The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks; July 2023:178-190; Toronto, Canada. URL: https://aclanthology.org/2023.bionlp-1.15/ [ CrossRef ]
- Wang X, Yang Q. LingX at ROCLING 2023 MultiNER-health task: intelligent capture of Chinese medical named entities by LLMs. 2023. Presented at: Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023); October 20-21, 2023; Taipei City, Taiwan. URL: https://aclanthology.org/2023.rocling-1.44.pdf
- Yang Y, Li X, Wang H, Guan Y, Jiang J. Modeling clinical thinking based on knowledge hypergraph attention network and prompt learning for disease prediction. SSRN. Preprint posted online on June 30, 2023. [ FREE Full text ] [ CrossRef ]
- Yao Z, Jaafar A, Wang B, Zhu Y, Yang Z, Yu H. Do physicians know how to prompt? The need for automatic prompt optimization help in clinical note generation. ArXiv. Preprint posted online on July 5, 2024. [ FREE Full text ] [ CrossRef ]
- van Zandvoort D, Wiersema L, Huibers T, van Dulmen S, Brinkkemper S. Enhancing summarization performance through transformer-based prompt engineering in automated medical reporting. ArXiv. Preprint posted online on January 19, 2024. [ FREE Full text ] [ CrossRef ]
- Zhang B, Mishra R, Teodoro D. DS4DH at MEDIQA-Chat 2023: leveraging SVM and GPT-3 prompt engineering for medical dialogue classification and summarization. 2023. Presented at: Proceedings of the 5th Clinical Natural Language Processing Workshop; June 12, 2023:536-545; Toronto, Canada. [ CrossRef ]
- Zhu W, Wang X, Chen M, Tang B. Overview of the PromptCBLUE Shared Task in CHIP2023. ArXiv. Preprint posted online on December 29, 2023. [ FREE Full text ] [ CrossRef ]
- Caruccio L, Cirillo S, Polese G, Solimando G, Sundaramurthy S, Tortora G. Can ChatGPT provide intelligent diagnoses? A comparative study between predictive models and ChatGPT to define a new medical diagnostic bot. Expert Syst Appl. 2024;235:121186. [ CrossRef ]
- Lee Y, Chen C, Chen C, Lee C, Chen P, Wu C, et al. Unlocking the secrets behind advanced artificial intelligence language models in deidentifying Chinese-English mixed clinical text: development and validation study. J Med Internet Res. 2024;26:e48443. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Bhaumik R, Srivastava V, Jalali A, Ghosh S, Chandrasekaran R. Mindwatch: a smart cloud-based AI solution for suicide ideation detection leveraging large language models. MedRxiv. Preprint posted online on September 26, 2023. [ FREE Full text ] [ CrossRef ]
- Heston TF. Safety of large language models in addressing depression. Cureus. 2023:15. [ FREE Full text ] [ CrossRef ]
- Grabb D. The impact of prompt engineering in large language model performance: a psychiatric example. J Med Artif Intell. 2023;6:20. [ FREE Full text ] [ CrossRef ]
- Santos WR, Paraboni I. Prompt-based mental health screening from social media text. ArXiv. Preprint posted online on May 11, 2024. [ FREE Full text ] [ CrossRef ]
- Yang K, Ji S, Zhang T, Xie Q, Kuang Z, Ananiadou S. Towards interpretable mental health analysis with large language models. 2023. Presented at: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; January 16, 2023:6056-6077; Singapore. [ CrossRef ]
- Xu X, Yao B, Dong Y, Gabriel S, Yu H, Hendler J, et al. Mental-LLM: leveraging large language models for mental health prediction via online text data. Proc ACM Interact Mob Wearable Ubiquitous Technol. 2024;8(1):1-32. [ CrossRef ]
- Chen S, Wu M, Zhu KQ, Lan K, Zhang Z, Cui L. LLM-empowered chatbots for psychiatrist and patient simulation: application and evaluation. ArXiv. Preprint posted online on May 23, 2023. [ FREE Full text ] [ CrossRef ]
- Qi H, Zhao Q, Li J, Song C, Zhai W, Dan L, et al. Supervised learning and large language model benchmarks on mental health datasets: cognitive distortions and suicidal risks in Chinese social media. ResearchSquare. Preprint posted online on November 02, 2023. [ FREE Full text ] [ CrossRef ]
- Sambath V. Advancements of artificial intelligence in mental health applications?: A comparative analysis of ChatGPT 3.5 and ChatGPT 4. ResearchGate. Preprint posted online on December, 2023. [ FREE Full text ] [ CrossRef ]
- Choi HS, Song JY, Shin KH, Chang JH, Jang B. Developing prompts from large language model for extracting clinical information from pathology and ultrasound reports in breast cancer. Radiat Oncol J. 2023;41(3):209-216. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Lee DT, Vaid A, Menon KM, Freeman R, Matteson DS, Marin MP, et al. Development of a privacy preserving large language model for automated data extraction from thyroid cancer pathology reports. MedRxiv. Preprint posted online on November 8, 2023. [ FREE Full text ] [ CrossRef ]
- Dennstädt F, Hastings J, Putora PM, Vu E, Fischer GF, Süveg K, et al. Exploring capabilities of large language models such as ChatGPT in radiation oncology. Adv Radiat Oncol. 2024;9(3):101400. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Zhu S, Gilbert M, Ghanem AI, Siddiqui F, Thind K. Feasibility of using zero-shot learning in transformer-based natural language processing algorithm for key information extraction from head and neck tumor board notes. Int J Radiat Oncol Biol Phys. 2023;117(2):e500. [ CrossRef ]
- Zhao X, Zhang M, Ma M, Su C, Wang M, Qiao X, et al. HW-TSC at SemEval-2023 task 7: exploring the natural language inference capabilities of ChatGPT and pre-trained language model for clinical trial. 2023. Presented at: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023); July 10, 2023:1603-1608; Toronto, Canada. [ CrossRef ]
- Nazary F, Deldjoo Y, Di Noia T. ChatGPT-HealthPrompt. Harnessing the power of XAI in prompt-based healthcare decision support using ChatGPT. 2023. Presented at: Artificial Intelligence. ECAI 2023 International Workshops; September 30-October 4, 2023:382-397; Kraków, Poland. [ CrossRef ]
- Wang B, Lai J, Cao H, Jin F, Tang M, Yao C, et al. Enhancing real-world data extraction in clinical research: evaluating the impact of the implementation of large language models in hospital setting. ResearchSquare. Preprint posted online on November 29, 2023. [ FREE Full text ] [ CrossRef ]
- Mishra V, Sarraju A, Kalwani NM, Dexter JP. Evaluation of prompts to simplify cardiovascular disease information using a large language model. MedRxiv. Preprint posted online on November 9, 2023. [ FREE Full text ] [ CrossRef ]
- Feng R, Brennan KA, Azizi Z, Goyal J, Pedron M, Chang HJ, et al. Optimizing ChatGPT to detect VT recurrence from complex medical notes. Circulation. 2023;148(Suppl 1):A16401. [ CrossRef ]
- Chowdhury M, Lim E, Higham A, McKinnon R, Ventoura N, He Y, et al. Can large language models safely address patient questions following cataract surgery? 2023. Presented at: Proceedings of the 5th Clinical Natural Language Processing Workshop; June 10, 2023:131-137; Toronto, Canada. [ CrossRef ]
- Kleinig O, Gao C, Kovoor JG, Gupta AK, Bacchi S, Chan WO. How to use large language models in ophthalmology: from prompt engineering to protecting confidentiality. Eye (Lond). 2024;38(4):649-653. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Arsenyan V, Bughdaryan S, Shaya F, Small K, Shahnazaryan D. Large language models for biomedical knowledge graph construction: information extraction from EMR notes. ArXiv. Preprint posted online on December 9, 2023. [ FREE Full text ] [ CrossRef ]
- Kwon T, Ong KT, Kang D, Moon S, Lee JR, Hwang D, et al. Large language models are clinical reasoners: reasoning-aware diagnosis framework with prompt-generated rationales. 2024. Presented at: Proceedings of the Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence; February 20, 2024:18417-18425; Vancouver. [ CrossRef ]
- Wang C, Liu S, Li A, Liu J. Text dialogue analysis for primary screening of mild cognitive impairment: development and validation study. J Med Internet Res. 2023;25:e51501. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Li J, Wang L, Chen X, Deng X, Wen H, You M, et al. Are you asking GPT-4 medical questions properly?—Prompt engineering in consistency and reliability with evidence-based guidelines for ChatGPT-4: a pilot study. ResearchSquare. Posted online on October 3, 2023. 2023:1-20. [ FREE Full text ] [ CrossRef ]
- Zaidat B, Lahoti YS, Yu A, Mohamed KS, Cho SK, Kim JS. Artificially intelligent billing in spine surgery: an analysis of a large language model. Global Spine J. 2023:21925682231224753. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Datta S, Lee K, Paek H, Manion FJ, Ofoegbu N, Du J, et al. AutoCriteria: a generalizable clinical trial eligibility criteria extraction system powered by large language models. J Am Med Inform Assoc. 2024;31(2):375-385. [ FREE Full text ] [ CrossRef ] [ Medline ]
- White R, Peng T, Sripitak P, Rosenberg Johansen A, Snyder M. CliniDigest: a case study in large language model based large-scale summarization of clinical trial descriptions. 2023. Presented at: Proceedings of the 2023 ACM Conference on Information Technology for Social Good; September 6-8, 2023; Lisbon, Portugal. [ CrossRef ]
- Scherr R, Halaseh FF, Spina A, Andalib S, Rivera R. ChatGPT interactive medical simulations for early clinical education: case study. JMIR Med Educ. 2023;9:e49877. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Akinci D'Antonoli T, Stanzione A, Bluethgen C, Vernuccio F, Ugga L, Klontzas ME, et al. Large language models in radiology: fundamentals, applications, ethical considerations, risks, and future directions. Diagn Interv Radiol. 2024;30(2):80-90. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Wiest IC, Ferber D, Zhu J, van Treeck M, Meyer SK, Juglan SK, et al. From text to tables: a local privacy preserving large language model for structured information retrieval from medical documents. MedRxiv. Preprint posted online on December 8, 2023. [ FREE Full text ] [ CrossRef ]
- Hamed E, Eid A, Alberry M. Exploring ChatGPT's potential in facilitating adaptation of clinical guidelines: a case study of diabetic ketoacidosis guidelines. Cureus. 2023;15(5):e38784. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Leypold T, Schäfer B, Boos A, Beier JP. Can AI think like a plastic surgeon? Evaluating GPT-4's clinical judgment in reconstructive procedures of the upper extremity. Plast Reconstr Surg Glob Open. 2023;11(12):e5471. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Yang J, Liu C, Deng W, Wu D, Weng C, Zhou Y, et al. Enhancing phenotype recognition in clinical notes using large language models: PhenoBCBERT and PhenoGPT. Patterns (NY). 2024;5(1):100887. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Xiong L, Zeng Q, Deng W, Luo W, Liu R. A novel approach to nursing clinical intelligent decision-making: integration of large language models and local knowledge bases. ResearchSquare. Preprint posted online on December 8, 2023. [ FREE Full text ] [ CrossRef ]
- Zeng Q, Liu Y, He P. A medical question classification approach based on prompt tuning and contrastive learning. 2023. Presented at: The Thirty Fifth International Conference on Software Engineering and Knowledge Engineering (SEKE 2023); July 1-10, 2023:632-635; San Francisco, CA, United States. [ CrossRef ]
- Zhao D, Yang Y, Chen P, Meng J, Sun S, Wang J, et al. Biomedical document relation extraction with prompt learning and KNN. J Biomed Inform. 2023;145:104459. [ CrossRef ] [ Medline ]
- Zhu T, Qin Y, Feng M, Chen Q, Hu B, Xiang Y. BioPRO: context-infused prompt learning for biomedical entity linking. IEEE/ACM Trans Audio Speech Lang Process. 2024;32(2023):374-385. [ CrossRef ]
- Liu C, Zhang S, Li C, Zhao H. CPK-Adapter: infusing medical knowledge into K-adapter with continuous prompt. 2023. Presented at: 2023 8th International Conference on Intelligent Computing and Signal Processing (ICSP); April 21-23, 2023:1017-1023; Xi'an, China. [ CrossRef ]
- Yeh HS, Lavergne T, Zweigenbaum P. Decorate the examples: a simple method of prompt design for biomedical relation extraction. 2022. Presented at: Proceedings of the Thirteenth Language Resources and Evaluation Conference; August 13, 2024:3780-3787; Marseille, France. URL: https://aclanthology.org/2022.lrec-1.403
- Su Z, Yu X, Chen P. EPTQA: a Chinese medical prompt learning method based on entity pair type question answering. SSRN. 2023:24. [ CrossRef ]
- Xu H, Zhang J, Wang Z, Zhang S, Bhalerao M, Liu Y, et al. GraphPrompt: graph-based prompt templates for biomedical synonym prediction. 2023. Presented at: Proceedings of the Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence; February 7, 2023:10576-10584; Washington, DC, United States. [ CrossRef ]
- Chen T, Stefanidis A, Jiang Z, Su J. Improving biomedical claim detection using prompt learning approaches. 2023. Presented at: 2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML); August 4-6, 2023:369-376; Urumqi, China. [ CrossRef ]
- Xu Z, Chen Y, Hu B. Improving biomedical entity linking with cross-entity interaction. 2023. Presented at: Proceedings of the Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence; February 7, 2023:13869-13877; Washington. [ CrossRef ]
- Wang Y, Wang Y, Peng Z, Zhang F, Zhou L, Yang F. Medical text classification based on the discriminative pre-training model and prompt-tuning. Digit Health. 2023;9:1-14. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Tian X, Wang P, Mao S. Open-world biomedical knowledge probing and verification. 2023. Presented at: Proceedings of The 12th International Joint Conference on Knowledge Graphs (IJCKG-23); December 8-9, 2023; Tokyo, Japan. URL: https://ijckg2023.knowledge-graph.jp/pages/proc/paper_3.pdf
- Lu K, Potash P, Lin X, Sun Y, Qian Z, Yuan Z, et al. Prompt discriminative language models for domain adaptation. 2023. Presented at: Proceedings of the 5th Clinical Natural Language Processing Workshop; July 14, 2023:247-258; Toronto, Canada. [ CrossRef ]
- Hu Y, Chen Y, Xu H. Towards more generalizable and accurate sentence classification in medical abstracts with less data. J Healthc Inform Res. 2023;7(4):542-556. [ CrossRef ] [ Medline ]
- Taylor N, Zhang Y, Joyce DW, Gao Z, Kormilitzin A, Nevado-Holgado A. Clinical prompt learning with frozen language models. IEEE Trans Neural Netw Learn Syst. 2023:1-11. [ CrossRef ] [ Medline ]
- Landi I, Alleva E, Valentine AA, Lepow LA, Charney AW. Clinical text deduplication practices for efficient pretraining and improved clinical tasks. ArXiv. Preprint posted online on September 29, 2023. [ FREE Full text ] [ CrossRef ]
- Sivarajkumar S, Wang Y. HealthPrompt: a zero-shot learning paradigm for clinical natural language processing. AMIA Annu Symp Proc. 2022;2022:972-981. [ FREE Full text ] [ Medline ]
- Sivarajkumar S, Wang Y. Evaluation of healthprompt for zero-shot clinical text classification. 2023. Presented at: 2023 IEEE 11th International Conference on Healthcare Informatics (ICHI); June 26-29, 2023:492-494; Houston, TX, United States. [ CrossRef ]
- Zhang L, Liu J. Intent-aware prompt learning for medical question summarization. 2022. Presented at: 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); December 6-8, 2022:672-679; Las Vegas, NV, United States. [ CrossRef ]
- Alleva E, Landi I, Shaw LJ, Böttinger E, Fuchs TJ, Ensari I. Keyword-optimized template insertion for clinical information extraction via prompt-based learning. ArXiv. Preprint posted online on October 31, 2023. [ FREE Full text ] [ CrossRef ]
- Cui Z, Yu K, Yuan Z, Dong X, Luo W. Language inference-based learning for low-resource Chinese clinical named entity recognition using language model. J Biomed Inform. 2024;149:104559. [ CrossRef ] [ Medline ]
- Ahmed A, Zeng X, Xi R, Hou M, Shah SA. MED-Prompt: a novel prompt engineering framework for medicine prediction on free-text clinical notes. J King Saud Univ Comput Inf Sci. 2024;36(2):1-17. [ CrossRef ]
- Lu Y, Liu X, Du Z, Gao Y, Wang G. MedKPL: a heterogeneous knowledge enhanced prompt learning framework for transferable diagnosis. J Biomed Inform. 2023;143:104417. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Cui Y, Han L, Nenadic G. MedTem2.0: prompt-based temporal classification of treatment events from discharge summaries. 2023. Presented at: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop); July 10-12, 2023:160-183; Toronto, Canada. [ CrossRef ]
- Wang Z, Sun J. PromptEHR: conditional electronic healthcare records generation with prompt learning. 2022. Presented at: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics; October 11, 2022:2873-2855; Abu Dhabi, United Arab Emirates. [ CrossRef ]
- Wang S, Tang L, Majety A, Rousseau JF, Shih G, Ding Y, et al. Trustworthy assertion classification through prompting. J Biomed Inform. 2022;132:104139. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Yao Z, Tsai J, Liu W, Levy DA, Druhl E, Reisman JI, et al. Automated identification of eviction status from electronic health record notes. J Am Med Inform Assoc. 2023;30(8):1429-1437. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Kwon S, Wang X, Liu W, Druhl E, Sung ML, Reisman JI, et al. ODD: a benchmark dataset for the natural language processing based opioid related aberrant behavior detection. 2024. Presented at: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; June 16, 2024:1-22; Mexico City. [ CrossRef ]
- Su J, Zhang J, Peng P, Wang H. EGDE: a framework for bridging the gap in medical zero-shot relation triplet extraction. 2023. Presented at: 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); December 5-8, 2023; Istanbul, Turkiye. [ CrossRef ]
- Li Q, Wang Y, You T, Lu Y. BioKnowPrompt: incorporating imprecise knowledge into prompt-tuning verbalizer with biomedical text for relation extraction. Inf Sci. 2022;617:346-358. [ CrossRef ]
- Peng C, Yang X, Chen A, Yu Z, Smith KE, Costa AB, et al. Generative large language models are all-purpose text analytics engines: text-to-text learning is all your need. J Am Med Inform Assoc. Sep 01, 2024;31(9):1892-1903. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Peng C, Yang X, Smith KE, Yu Z, Chen A, Bian J, et al. Model tuning or prompt tuning? A study of large language models for clinical concept and relation extraction. J Biomed Inform. 2024;153:104630. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Duan J, Lu F, Liu J. MVP: optimizing multi-view prompts for medical dialogue summarization. 2023. Presented at: 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); December 5-8, 2023; Istanbul, Turkiye. [ CrossRef ]
- Rohanian O, Jauncey H, Nouriborji M, Kumar V, Gonalves BP, Kartsonaki C, et al. Using bottleneck adapters to identify cancer in clinical notes under low-resource constraints. 2023. Presented at: The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks; July 23, 2023:6239-6278; Toronto, Canada. [ CrossRef ]
- Elfrink A, Vagliano I, Abu-Hanna A, Calixto I. Soft-prompt tuning to predict lung cancer using primary care free-text Dutch medical notes. In: Artificial Intelligence in Medicine. Cham. Springer Nature Switzerland; 2023:193-198.
- Singh Rawat BP, Yu H. Parameter efficient transfer learning for suicide attempt and ideation detection. 2022. Presented at: Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI); September 7, 2022:108-115; Abu Dhabi, United Arab Emirates. [ CrossRef ]
- Xu S, Wan X, Hu S, Zhou M, Xu T, Wang H, et al. COSSUM: towards conversation-oriented structured summarization for automatic medical insurance assessment. 2022. Presented at: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 14-18, 2022:4248-4256; Washington, DC, United States. [ CrossRef ]
- Shaitarova A, Zaghir J, Lavelli A, Krauthammer M, Rinaldi F. Exploring the latest highlights in medical natural language processing across multiple languages: a survey. Yearb Med Inform. 2023;32(1):230-243. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Ding N, Hu S, Zhao W, Chen Y, Liu Z, Zheng HT, et al. OpenPrompt: an open-source framework for prompt-learning. ArXiv. Preprint posted online on November 3, 2021. [ FREE Full text ] [ CrossRef ]
- Ducel F, Fort K, Lejeune G, Lepage Y. Do we name the languages we study? The #BenderRule in LREC and ACL articles. 2022. Presented at: Proceedings of the Thirteenth Language Resources and Evaluation Conference; June 20-25, 2022:564-573; Marseille, France. URL: https://aclanthology.org/2022.lrec-1.60
- Luo Y, Yang Z, Meng F, Li Y, Zhou J, Zhang Y. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. ArXiv. Preprint posted online on April 2, 2024. [ FREE Full text ] [ CrossRef ]
Abbreviations
Bidirectional Encoder Representations From Transformers |
chain-of-thought |
large language model |
multiple-choice question |
masked language modeling |
natural language processing |
prompt design |
prompt learning |
Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews |
prompt tuning |
Edited by T de Azevedo Cardoso; submitted 14.05.24; peer-reviewed by B Bhasuran, D Hu, A Jain; comments to author 03.07.24; revised version received 09.07.24; accepted 22.07.24; published 10.09.24.
©Jamil Zaghir, Marco Naguib, Mina Bjelogrlic, Aurélie Névéol, Xavier Tannier, Christian Lovis. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 10.09.2024.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
IMAGES
VIDEO
COMMENTS
The rise of quantitative medicine. Quantitative medicine is a paradigm shift in the practice of medicine that emphasizes the use of quantitative data and mathematical models to understand and treat disease. 20 This approach is based on the idea that the human body can be studied as a complex system, with many interconnected parts that can be modeled and simulated using mathematical and ...
Clarifying the research purpose is an essential first step when reading or conducting scholarship in medical education. 1 Medical education research can serve a variety of purposes, from advancing the science of learning to improving the outcomes of medical trainees and the patients they care for. However, a well-designed study has limited ...
The COVID-19 pandemic disrupted the United States (US) medical education system with the necessary, yet unprecedented Association of American Medical Colleges (AAMC) national recommendation to pause all student clinical rotations with in-person patient care. This study is a quantitative analysis investigating the educational and psychological effects of the pandemic on US medical students and ...
2. SAMPLE SIZE AND SELECTION. Quantitative research usually aims to provide precise, unbiased estimates of parameters of interest for the entire population which requires a large, randomly selected sample. Brett et al 4 reported a positive impact of PPI on recruitment in studies, but the representativeness of the sample is as important in quantitative research as sample size.
Results A total of 649 evaluations of 29 conferences were submitted by 153 of 185 (83%) residents in the program.Median effectiveness score was 6 (on a scale of 1 to 7). Clinicopathological conferences had .55-point higher effectiveness scores than traditional conferences (P = .011).In multivariable analyses focusing on traditional conferences, summary statement inclusion was significantly ...
The study and practice of medicine could benefit from an enhanced engagement with the new perspectives provided by the emerging areas of complexity science and systems biology. A more integrated, systemic approach is needed to fully understand the processes of health, disease, and dysfunction, and the many challenges in medical research and education. Integral to this approach is the search ...
Evaluating medical education research requires specific orientation to issues related to format and content. Our goal is to review the quantitative aspects of research in medical education so that clinicians may understand these articles with respect to framing the study, recognizing methodologic issues, and utilizing instruments for evaluating ...
The chapter describes the quantitative research methods of meta-analysis and systematic reviews. It contrasts these strategies with those of reviews that are better defined as critical and theory-oriented. The chapter examines various issues related to selecting a particular research design.
Abstract. Rigorous medical education research is critical to effectively develop and evaluate the training we provide our learners. Yet many clinical medical educators lack the training and skills needed to conduct high-quality medical education research. We offer guidance on conducting sound quantitative medical education research.
The evolution of quantitative research in medical education stems from researchers' interest in using a systematic empirical methodology to investigate and develop models, theories and hypotheses related to educational phenomena. In particular, the use of measurements as a form of empirical observations and being able to investigate ...
Abstract. In an era of data-driven decision-making, a comprehensive understanding of quantitative research is indispensable. Current guides often provide fragmented insights, failing to offer a holistic view, while more comprehensive sources remain lengthy and less accessible, hindered by physical and proprietary barriers.
For more examples of how QM approaches have contributed to drug development and review, I encourage readers to visit the web page for the recording and slides of our recently held public workshop ...
This paper explores the underlying principles and key contributions of quantitative medicine, as well as the context for its rise, including the development of new technologies and the influence of reductionist philosophies. ... The model could also be used for medical research purposes. Indeed, it makes it possible to run scenario analyses ...
This chapter begins with a commentary on the importance of precisely focusing one's research question, emphasising that while good studies require good methods, the quality of a study is not completely defined by its methodological rigour. It provides some guidance for those trying to better understand the variety of quantitative methods available.
Background Pharmaceutical care is the pharmacist's contribution to the care of individuals to optimize medicines use and improve health outcomes. The primary tool of pharmaceutical care is medication review. Defining and classifying Drug-Related Problems (DRPs) is an essential pillar of the medication review. Our objectives were to perform a pilot of medication review in Hungarian community ...
This chapter begins with a commentary on the importance of precisely focusing one's research question, emphasising that while good studies require good methods, the quality of a study is not completely defined by its methodological rigour. It provides some guidance for those trying to better understand the variety of quantitative methods available.
Research and Medical Students: Some Notable Contributions Made in History. Upon the commencement of the practice of modern medicine, the establishment of evidence based practice has played a crucial role in its advancement. Whether it be an expert medical practitioner or some beginner medical student who is in the early phase of pursuing their ...
Quantitative Research Methods in Medical Education. April 2019. Anesthesiology Publish Ahead of Print (&NA;):&NA; DOI: 10.1097/aln.0000000000002727. Authors: John T. Ratelle. Mayo Foundation for ...
Although COVID-19 has spread almost all over the world, social isolation is still a controversial public health policy and governments of many countries still doubt its level of effectiveness. This situation can create deadlocks in places where there is a discrepancy among municipal, state and federal policies. The exponential increase of the number of infectious people and deaths in the last ...
Perspectives and Debates Public Health Research Introduction Medicine has a rich and complex history, shaped by a variety of factors including cultural, social, and scientific ... This paper explores the underlying principles and key contributions of quantitative medicine, as well as the context for its rise, including the development of new ...
The medical and mathematical ways of thinking are contrasted, stressed is the leading role of the former. The rebound effect of the quantitative study on the qualitative aspects is treated, with special regard to medical terminology. The necessity to distinguish physiological values and population data is emphasized.
The Quantitative Paradigm. The Research Question. Research Designs. The Experimental Tradition. The Epidemiologic Tradition. The Psychometric Tradition. The Correlational Tradition. Cronbach's 'Two Disciplines' Reviews. Discussion. References
Description. Quantitative Research in Human Biology and Medicine reflects the author's past activities and experiences in the field of medical statistics. The book presents statistical material from a variety of medical fields. The text contains chapters that deal with different aspects of vital statistics. It provides statistical surveys of ...
INTRODUCTION. Scientific research is usually initiated by posing evidenced-based research questions which are then explicitly restated as hypotheses.1,2 The hypotheses provide directions to guide the study, solutions, explanations, and expected results.3,4 Both research questions and hypotheses are essentially formulated based on conventional theories and real-world processes, which allow the ...
Neoadjuvant chemotherapy (NAC) is an effective treatment for locally advanced breast cancer (BC). However, there are no effective biomarkers for evaluating its efficacy. CDR1-AS, well known for its important role in tumorigenesis, is a famous circular RNA involved in the chemosensitivity of cancers other than BC. However, the predictive role of CDR1-AS in the efficacy and prognosis of NAC for ...
Background: Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must navigate complex language and ensure privacy ...