flow

📖 LLMs can't innovate

They sure can write gooder than me. But is innovation really their strongest suit? And most importantly - why does that matter?

Curious? Read more

Jun 17, 2025

•

tech

📝 Auto-METRICS - a proof of concept

Automatic assessment of methodological quality in radiomics research

Recently, I received my first (obvious) LLM peer review. It was quite blatant. What’s worse: it wasn’t good - at all! Funnily enough, I had been working on something related: Auto-METRICS, a tool for automatic standardised assessment of scientific research quality in radiomics research using the METRICS framework.

To show its utility, we make use of two unique, recent datasets on reproducibility in radiomics studies - Akinci D’Antonoli et al. (2025) and Kocak (2025). Together, they feature really good set of METRICS ratings - for different levels of expertise and training - for more than 50 publications. This allowed us to systematically compare human and LLM raters.

The main takeaways:

Human raters agree with LLMs at the same rate that they agree with other human raters ✅
Prompt iterations: clarifying radiomics guidelines can lead to better agreement with human raters. However these improvements were quite limited! 📈
Too nice: LLM ratings tended to be slightly higher than those offered by human raters 😇

I tested our tool - Auto-METRICS - here (all you need is a free Google Gemini API key) and found it really helpful to get an initial assessment for METRICS which I can easily confirm. The key? Enhance, don’t replace - having good initial ratings was super helpful in getting a final, human-based classification.

Curious? Read more about Auto-METRICS at medRxiv.

Apr 22, 2025

•

apps

📝 How to develop a radiomics signature

Presentation and workshop on applying machine-learning to radiology data to medical doctors + biomedical researchers

In 2024, I participated as faculty in the “How to develop a radiomics signature” course organised for the European Society of Gastrointestinal and Abdominal Radiology. Here, I gave a short presentation on machine-learning model development and tuning, as well as two practical programming sessions: one on radiomic feature extraction and the other on model development and (fine-)tuning. Both are freely available.

Dec 11, 2024

•

teaching

📝 Some notes on programming and coding with AI assistants

Most IDEs come pre-packaged with their own AI assistant for coding. But who should use them?

Recently, I have been wondering about AI coding assistants (AICA). A lot of what I do day-to-day is coding, so having a convenient assistant is good.

I have tried a few and my general impression is simple: it works well, but it frequently introduces bugs which I have to fix. Ultimately, I have a hard time understanding whether I see an improvement in my coding ability or not - on one hand I can code faster; on the other I have to fix more bugs.

So I took a quick dive into the literature (some early works trying to predict productivity gains are available, but, alas!, behind these keystrokes lies an empiricist):

GitClear conducted their own study into this, showing that post-AICA code has a higher churn rate (i.e. code requiring maintenance further down the line) ¹
Some works focus on understanding whether AICA solutions are good without verification - they are not² ³, with a recent work showing that only 30% of code suggestions are accepted ⁴
In terms of productivity, the same work claims a 30% increase in productivity (!) ⁴. However, productivity is measured as the acceptance rate of AICA suggestions, which, despite being a somewhat dumb metric, is associated with perceived productivity ⁵

From other works on the impact of coding assistants on productivity, we get a heterogenous picture:

a study asking participants to implement an HTTP server in JS as quickly as possible showed that people using AICA were able to get the job done 50% faster ⁶
a study from Uplevel (on 800 of its customers) saw a 40% increase (!) in bugs with very little impact on efficiency (PR cycles reduced by 1.7 minutes) and no impact on burnout rate ⁷
a study shows an increase in PRs but also a decrease in build success rates ⁸
a study surveying developers determined that they are generally positive about AICA, but are somewhat fearful that more junior developers won’t have the opportunity to own up to their skills if they rely too heavily on these tools; additionally they were more keen to boast its efficiency boosts when tasks are easy or repetitive (i.e. the usual boilerplate code) which leaves more time for learning and creative thinking ⁹
a study on developers in the public sector highlights a similar trend, with developers feeling that they have more time to focus on more meaningful and rewarding tasks if they use AICA ¹⁰
showed that GH CoPilot has a more positive impact when used by experts and that it can be harmful for novices as it suggests buggy/complicated code which novices cannot fully comprehend ¹¹

My take on this: it can be useful, but a lot of the benefits are overstated and novice developers benefit greatly from steering clear from it while learning. Benefits appear to lie in a speed-quality tradeoff curve, so each position will have a specific position on this.

Oct 10, 2024

•

tech

🔗 Para onde foi o microbioma do cancro?

Um artigo científico publicado em 2020 prometia revolucionar o diagnóstico de cancro. Quando outras equipas não obtiveram os mesmos resultados, foram lançadas dúvidas sobre a investigação e sobre a empresa que se fundou inspirada nas conclusões.

Published in Shifter

Oct 16, 2023

•

science

notes 📝 and articles, here 📖 and elsewhere 🔗

📖 LLMs can't innovate

📝 Auto-METRICS - a proof of concept

📝 How to develop a radiomics signature

📝 Some notes on programming and coding with AI assistants

🔗 Para onde foi o microbioma do cancro?