First, the headline result of 0.7*sigma improvement is the output of a statistical based on lessons/reviews they engaged with and their mid-term score, with that shift being for "full engagement". Based on their tables something like ~16 students (11% of the group) actually reached that level of engagement
Second, trying to incorporate past grades into their modelling is not a substitute for a randomized trial.
Third, the headline engagement number of 90% is for "engaging with the platform, via Module Review or Lesson Quizzes, at least once". I don't know why much of that couldn't just be attributed to novelty. Or even partly a professor with all sorts of enthusiasm for the platform.
Fourth, the "full dosage" effectiveness is measured based the final exam scores. Were these exam questions produced independently from the "Phosphor" materials? (e.g. by blinding?) Were they checked for direct overlap with those materials? The 0.7 sigma shift is 3 points on a 24 point exam; if even a few of the questions on that exam were very similar to those materials it could account for almost all of it. This is not clear to me from the manuscript.
If this was the case, then it's a question less of "is AI effective" vs. "did the students look at the materials". You could still argue that the AI platform got them to read, but that is a somewhat different statement than the AI helped them learn.
(ie changing the environment can lead to short term productivity gains because either participants are aware they are being watch, or it breaks up the monotony and makes people work a bit harder. )
I'm on record saying that a system like this with some extra hardware (i.e. a way for the LLM to have live understanding of the student's paper notebook or handout which are being written in with a plain old pencil) combines the best of both worlds - individual tutoring with approximately zero screen time which scales linearly with the number of students. The role of the teacher or professor then becomes a manager of the student - agentic tutor pairs, a referee when the student and model disagree, etc. and most importantly still being the human teacher you can just talk to in the human education process.
I'm convinced this is the future of education - models are there, we need the classroom tech to catch up. The alternative is obvious and quantified in the paper - students just use models to do their work for them and learn nothing.
I work in consulting and one of my projects is piloting an AI use case for a department within one of my clients. On a discovery call someone casually brought up that they bought a reMarkable notebook themselves and were wondering if it could be integrated into the use case. It really got me thinking.
Maybe reMarkable or something like it could help bridge a student's writing with an LLM without having to fall back to a laptop or ipad.
A 'smart pen' that records the student's writing in some way, maybe? My first thought was a tablet that boots straight into a writing software but students should not be subjected to any amount of latency in their writing.
Practically, I think if you want the AI system to have a live view of what the student's doing you're going to have to replace one of either the tablet or the writing instrument. A wearable camera could work as well but there are issues with that.
there was a pen that used special paper to directly record your notes (15-20years ago)... should be possible nowadays to directly transfer this to a connected device and have it feed it to an llm.
I would add that somewhere in there should be a spaced repetition algorithm.
Spaced repetition is very effective, but it's really really clunky to use. My unpopular opinion is that we all have Stockholm syndrome when it comes to creating "cards", and people talk about how valuable creating cards is; but I think it stucks, it takes a lot of time.
If AI is already teaching me math (let's say), it would be nice to tell the AI/app "quiz me on this periodically", and then the AI makes up a fresh polynomial to factor (or whatever) and presents that to you according to a spaced repetition algorithm.
Behind the scenes, the AI should have access to what has happened the last several times a specific topic has been quized, so the AI can watch to see that certain mistakes are resolved, and the AI might also know better how to correct the user if it has context about previous quizzes of that topic.
But the very act of making and organizing your card deck is part of the SRS! It “sucks” because you get no dopamine hit from a fresh desk, as the reward system is not yet in place.
This is exciting because the effect size is so large. But as the author's acknowledged, selection bias is nearly impossible to control for in this non-randomized study:
> and lacks randomized controls. Self-selection is the central threat: students who complete more quizzes
may be more motivated or higher-performing generally
But this is still a strong result. I'm excited to see more in this space.
Conflicted about this study. On one hand, LLMs have been incredible for my personal learnings of new concepts.
On the other, I'm sceptical of that it'll have "strong benefits" at scale; I'd be more in favor if the wording was "some"/"moderate". I reckon self-selection plays a huge part, as mentioned in the "Limitations" section of the paper.
I'd also caution against attaching the tool to grading. That means students have to put more effort into the course, which increases the chances that they will use LLMs to save time rather than make the investment.
> LLMs have been incredible for my personal learnings of new concepts.
Mind if I ask what did you learn and how you're using it?
The reason I'm asking is that I repeatedly felt excitement only to realize down the line that the explanations didn't actually translate into practical skills. I'm not sure it's even an AI problem, it's a "doing versus reading" problem. Same as with reading a pop-science article and thinking to myself that I learned something about physics or medicine or mathematics.
Do you have a larger study planned for the Fall? It definitely seems promising.
I'm curious how well you feel this worked because the subject was Statistics (objective grading) versus something more subjective like Civics or Literature.
Honestly whether or not this was effective seems less important to me than the adoption numbers.
Text book reading in this course was 10-15% at baseline ... but this AI thing got 90% voluntary usage ungraded.
Even if its worse per-hour than a textbook, you're now teaching 6x as many students _something_ instead of teaching a small minority everything.
So really it just becomes an optimization problem at that point because most students are at least in the funnel/in the running to learn something.
The paper kind of proves this itself ... they tweaked the quize formats mid-semester and where able to iterate which you can't do on a textbook that nobody opens in the first place
I'd argue the results are even better: just reading a textbook doesn't really teach you much. You have to do exercises, but they're expensive to create and grade. LLMs with a proper harness (see paper) tackle both.
Curious how this holds up across different learning styles.
SD effect sizes look impressive, but I'd want to see
retention data at 30/90 days before drawing conclusions.
I don't want to learn from hallucinations where it will change its answers based on me questioning their teachings. I use it for conversations in a language I'm learning, but I quickly learned that asking it grammar questions for example is not a wise decision.
Curious whether you were just bare asking it questions, or whether you provided it with lessons one by one with instruction that the lesson is the baseline truth etc
Shocking that a well executed AI tutor improves outcomes.
Hasn't computer assisted interactive learning already been proven for years? Why does there seem to be so much skepticism about enhancing it with AI?
Is this just something like, astoundingly slow adoption or poor execution? Being held back by paper textbook makers? Teachers unions dragging their feet?
How can interactive AI driven individually paced learning _not_ be obviously dramatically more effective?
Lots of people in education will happily tell you how the past 15 years of tech integration has been a net negative.
There ARE technologies that have improved things, but so much high-cost useless tech has been shoved into every level of education that many educators are incredibly leery of new tech.
The issue is that while the underlying technology is useful, the way it gets integrated is frequently not. An administrator cuts a deal for a product they never have to use to an ed-tech giant for a huge amount. Because the ink is dry and a huge sum of money has been spent admins pressure educators to use the technology as much as possible regardless of outcome.
In that context it makes a lot more sense why there is pushback and FUD among educators.
Selection effects are extremely important in education. Dartmouth students have already had a large selection effect. If you try to apply this more broadly then it might not work.
Motivation is also a huge part of the problem. I'm wondering if the novelty of the AI tutoring gets more people to try it and whether it would wear off?
It's surprising to me that many students at Dartmouth don't read the textbook. You'd think college admissions would select for that?
It seems promising but, as they say, more research needed.
First, the headline result of 0.7*sigma improvement is the output of a statistical based on lessons/reviews they engaged with and their mid-term score, with that shift being for "full engagement". Based on their tables something like ~16 students (11% of the group) actually reached that level of engagement
Second, trying to incorporate past grades into their modelling is not a substitute for a randomized trial.
Third, the headline engagement number of 90% is for "engaging with the platform, via Module Review or Lesson Quizzes, at least once". I don't know why much of that couldn't just be attributed to novelty. Or even partly a professor with all sorts of enthusiasm for the platform.
Fourth, the "full dosage" effectiveness is measured based the final exam scores. Were these exam questions produced independently from the "Phosphor" materials? (e.g. by blinding?) Were they checked for direct overlap with those materials? The 0.7 sigma shift is 3 points on a 24 point exam; if even a few of the questions on that exam were very similar to those materials it could account for almost all of it. This is not clear to me from the manuscript.
If this was the case, then it's a question less of "is AI effective" vs. "did the students look at the materials". You could still argue that the AI platform got them to read, but that is a somewhat different statement than the AI helped them learn.
(ie changing the environment can lead to short term productivity gains because either participants are aware they are being watch, or it breaks up the monotony and makes people work a bit harder. )
I'm convinced this is the future of education - models are there, we need the classroom tech to catch up. The alternative is obvious and quantified in the paper - students just use models to do their work for them and learn nothing.
Maybe reMarkable or something like it could help bridge a student's writing with an LLM without having to fall back to a laptop or ipad.
https://remarkable.com/
Practically, I think if you want the AI system to have a live view of what the student's doing you're going to have to replace one of either the tablet or the writing instrument. A wearable camera could work as well but there are issues with that.
and after looking it up, it appears they are still available: https://www.livescribe.com/landingpage/ls3_onenote/
Spaced repetition is very effective, but it's really really clunky to use. My unpopular opinion is that we all have Stockholm syndrome when it comes to creating "cards", and people talk about how valuable creating cards is; but I think it stucks, it takes a lot of time.
If AI is already teaching me math (let's say), it would be nice to tell the AI/app "quiz me on this periodically", and then the AI makes up a fresh polynomial to factor (or whatever) and presents that to you according to a spaced repetition algorithm.
Behind the scenes, the AI should have access to what has happened the last several times a specific topic has been quized, so the AI can watch to see that certain mistakes are resolved, and the AI might also know better how to correct the user if it has context about previous quizzes of that topic.
Bloom's Two Sigma Opportunity suggests that there's another SD improvement available: https://en.wikipedia.org/wiki/Bloom%27s_2_sigma_problem
> and lacks randomized controls. Self-selection is the central threat: students who complete more quizzes may be more motivated or higher-performing generally
But this is still a strong result. I'm excited to see more in this space.
On the other, I'm sceptical of that it'll have "strong benefits" at scale; I'd be more in favor if the wording was "some"/"moderate". I reckon self-selection plays a huge part, as mentioned in the "Limitations" section of the paper.
I'd also caution against attaching the tool to grading. That means students have to put more effort into the course, which increases the chances that they will use LLMs to save time rather than make the investment.
Mind if I ask what did you learn and how you're using it?
The reason I'm asking is that I repeatedly felt excitement only to realize down the line that the explanations didn't actually translate into practical skills. I'm not sure it's even an AI problem, it's a "doing versus reading" problem. Same as with reading a pop-science article and thinking to myself that I learned something about physics or medicine or mathematics.
I'm curious how well you feel this worked because the subject was Statistics (objective grading) versus something more subjective like Civics or Literature.
PS - I'd say this qualifies for Show HN, too!
Do you
Are you planning on opening access to Phosphor?
Text book reading in this course was 10-15% at baseline ... but this AI thing got 90% voluntary usage ungraded.
Even if its worse per-hour than a textbook, you're now teaching 6x as many students _something_ instead of teaching a small minority everything.
So really it just becomes an optimization problem at that point because most students are at least in the funnel/in the running to learn something.
The paper kind of proves this itself ... they tweaked the quize formats mid-semester and where able to iterate which you can't do on a textbook that nobody opens in the first place
Hasn't computer assisted interactive learning already been proven for years? Why does there seem to be so much skepticism about enhancing it with AI?
Is this just something like, astoundingly slow adoption or poor execution? Being held back by paper textbook makers? Teachers unions dragging their feet?
How can interactive AI driven individually paced learning _not_ be obviously dramatically more effective?
There ARE technologies that have improved things, but so much high-cost useless tech has been shoved into every level of education that many educators are incredibly leery of new tech.
The issue is that while the underlying technology is useful, the way it gets integrated is frequently not. An administrator cuts a deal for a product they never have to use to an ed-tech giant for a huge amount. Because the ink is dry and a huge sum of money has been spent admins pressure educators to use the technology as much as possible regardless of outcome.
In that context it makes a lot more sense why there is pushback and FUD among educators.
Motivation is also a huge part of the problem. I'm wondering if the novelty of the AI tutoring gets more people to try it and whether it would wear off?
It's surprising to me that many students at Dartmouth don't read the textbook. You'd think college admissions would select for that?
It seems promising but, as they say, more research needed.
very few are actually motivated to learn and are just there to get a job or its just next thing that they have to do in life.