PLUS-predict: Incorporation of prediction models in the Danish Lung Cancer Screening pilot (PLUS)

Postdoc
Margrethe Bang Henriksen
Department of Oncology, Vejle

Projekt styring

Projekt status	Open



Data indsamlingsdatoer
Start	14.08.2025
Slut	31.12.2029

Projektet i tal

OPEN undersøgelse/kliniske data
Forventet # af deltagere	57.109
Inkluderet antal deltagere





Inkluderede deltagere med prøver
Prøver

PLUS-predict: Incorporation of prediction models in the Danish Lung Cancer Screening pilot (PLUS)

Short summary

PLUS-Predict will enhance lung cancer screening in Denmark by combining registry, questionnaire, and health record data from the PLUS pilot. The project will develop models for non-responders, validate multiple risk prediction models in responders, and test them in a control group. Using machine learning and NLP, it aims to improve high-risk identification, reduce social inequality, and support a fair, evidence-based national screening program.

Rationale

Lung cancer remains the leading cause of cancer-related death, largely due to late-stage diagnoses. Although screening with low-dose CT scan can reduce mortality, implementation in Denmark is challenged by how best to identify high-risk individuals. Traditional criteria based on age and smoking history are inadequate compared to newer prediction models, yet their application in a Danish setting remains unexplored. Furthermore, a large proportion of individuals invited to screening do not respond, leaving their risk status unknown and potentially contributing to social inequality in cancer outcomes.

Description of the cohort

Study A: Non-responders Study A will focus on the 7,711 individuals who did not respond to the initial screening invitation. Comparative characterization of non-responders and responders Initially non-responders will be characterized in a cross-sectional study applying relevant data from national registries and electronic health records from an index time around the time of invitation. Demographic and socioeconomic characteristics will be summarized using descriptive statistics (numbers and proportions) and differences between non-responders and responders tested using t-test, Mann-Whitney U-test or Pearson's chi-squared test as relevant. Odds ratios (ORs) with 95 % confidence intervals (CIs) will be calculated using uni and multivariate logistic regression to estimate the associations between demographic and socioeconomic factors and non-respondence. Lung cancer prediction model pipeline Prediction models will be evaluated using a structured pipeline, tailored to the availability and quality of input data. This includes data preprocessing steps such as class balancing, missing data imputation, and feature scaling. Models will be trained and evaluated using cross-validation to ensure robustness and minimize overfitting. Performance will be assessed using standard validation metrics including the area under the receiver operating characteristic curve (AUC-ROC), F1-score, precision, and recall at clinically relevant thresholds. Calibration will be assessed using observed-versus-predicted plots. To enhance interpretability, SHAP (SHapley Additive exPlanations) analyses will be used to assess feature importance both at the population level and for individual case-level explanations. The proportion of eligible individuals will be evaluated at clinical thresholds of 1-3%, and compared to that of the PLCOm2012 model, which served as the inclusion tool for the screening cohort. Additionally, we will benchmark our model against other established risk prediction models that utilize comparable variables-such as the Liverpool Lung Project (LLP) model, the HUNT model, and the Optimized Early Warning for Lung Cancer (OWL) model, with adaptation to align with the available dataset. Study B: Responders Study B will focus on the 9,589 individuals who responded to the first round of the PLUS project including the 1,332 individuals who met the eligibility criteria for screening. Of responders, 59% were females, 42% were non-smokers, the mean number of pack years was 23.9 (SD: 20.7) and the mean PLCOm2012 1.9% (SD: 2.8). For included responders, 45% were females, the mean number of pack years was 43.3(SD: 19.7) and the mean PLCOm2012 4.2 (SD: 3.4). Using data collected from the questionnaires, linked with national registry data, we will validate other known prediction models than only PLCOm2012, and compared with the 2% threshold that was set for PLCOm2012. We will asses model performance annually until the end of PLUS study period ultimo 2026. Study C: Control group The control group consists of 39,809 individuals who were initially stratified alongside the screened cohort, but were not randomly selected for screening in the strata of 1000. The screening questionnaire has recently been updated to include four supplementary questions, enabling risk calculation based on multiple risk models, and will be distributed to the control cohort during the fall 2025. Following one year of follow-up, responses will be evaluated and linked with lung cancer outcomes on an annual basis. Using the collected data, various prediction models will be applied (LLP, HUNT among others) and their performance compared to the currently used screening criteria.

Data and biological material

Questionaire data Registry data Free text data from electronic health records

Collaborating researchers and departments

Department of cardiothoracic and vascular surgery

Michael Stenger