In much of Eastern Europe and the former Soviet Union, the start of the school year is not just another day on the calendar. September 1st is celebrated as the “Day of Knowledge”, a festive occasion marking the beginning of the academic year and a milestone in children’s lives.
The heart of the celebration is the First Bell ceremony. Families gather at schools, children arrive in their best clothes, and first-graders carry bouquets of flowers to present to their new teachers. The entire community comes together — parents, grandparents, teachers, and older pupils — to mark the moment when a new generation enters the school system.
A highlight of the day is the symbolic role of the oldest students in the school. They stand as mentors and guardians for the newcomers, welcoming the youngest pupils into their school family. In many schools, a beloved ritual unfolds: a senior student lifts a first-grader onto their shoulders while the child rings a small hand bell. This ringing of the “first bell” represents the start of the school journey and the promise of guidance, care, and learning ahead.
While September 1st is standard across Russia, Ukraine, Belarus, and several Central Asian republics, other countries adapt the tradition to their own calendars. In Bulgaria, for example, schools open with a similar bell-ringing ceremony on 15 September. In Serbia, the school year also begins around 1 September, with local customs to welcome first-graders, though it is not formally called the Day of Knowledge. Despite these differences, the theme is shared: education is not only about textbooks, but also about community, continuity, and shared responsibility.
The atmosphere is festive but also deeply symbolic. For families, especially those sending their first child to school, it is an emotional milestone. For teachers, it is a renewal of purpose. And for the first-graders themselves, it is often the first taste of belonging to a wider community beyond their family.
It is also worth mentioning that in many of these countries, the school year is framed not just by the First Bell but also by the Last Bell. Held in late May, the Last Bell marks the end of the school year and, for graduating students, the close of their school journey. Just as the First Bell welcomes children into the world of learning, the Last Bell sends them forth — celebrated with songs, speeches, and, once again, the symbolic ringing of the bell. Together, these two ceremonies highlight the cultural significance of schooling as both a beginning and an ending, woven into the rhythm of community life.
Twenty years ago, we still wrote letters, filled in forms by hand, and scribbled notes in margins. Today, most of us type, swipe, or tap. Keyboards and touchscreens have transformed how we communicate—but they risk erasing one of humanity’s most elegant skills: handwriting.
One of my two most treasured belongings is a sky-blue Caran d’Ache fountain pen, bought on a summer holiday. The other is a simple analogue wristwatch (don’t get me started on disposable digital plastic strapped to the wrist!). For me, nothing compares to the beauty of a handwritten letter—the sweep of ink, the individuality of style, the dignity of effort.
My most treasured teaching memory after 30+ years came from three weeks working with a 12-year-old boy, Nikita. Every day for an hour, he copied my best attempt at an Arial font alphabet—lower case, then upper, then cursive. At first, his pencil crawled like a caterpillar. But day by day, he transformed. At the end of those weeks, he was writing like a butterfly in full flight: graceful, balanced, elegant. When I asked him to write on the whiteboard, one friend laughed, “He’s got the worst handwriting ever.” But before I could intervene, a girl spoke up: “No, Nikita’s got the best handwriting in the class.” Soon, everyone agreed. It was a moment of quiet triumph.
Of course, cave drawings were surpassed by photography. Hand sewing by mass production. Slowly cooked Sunday roasts by microwaves. And handwritten letters by emails and emojis. But some things survive because they are more than practical—they carry meaning.
And here’s where IB Diploma teachers come in. Our exams are still handwritten. AI hasn’t taken our jobs yet, and the future hasn’t stolen our past. Whether Cyrillic, English, Thai, or Arabic, let’s celebrate elegant handwriting. Let’s encourage our students to see it not as an obsolete chore but as a skill of dignity, beauty, and identity.
In the “factory of the future,” handwriting may be rare. But in our classrooms, it can still be cherished. And who knows, perhaps a blue fountain pen and three weeks of care can still turn a caterpillar into a butterfly.
A week ago, the IB Diploma results landed — the long-anticipated climax to two gruelling, rewarding, unpredictable years. For us teachers, it’s the end of a journey marked not just by planning, teaching, and assessing, but also by sleepless nights, student meltdowns, awkward parent meetings, internal deadlines, and navigating the relentless tide of CAS logs, EE drafts, IA deadlines, and Theory of Knowledge epiphanies (or not). Somewhere in the mix: online PDs, five-year evaluations, and staffroom diplomacy.
So, firstly: well done. You made it. You stayed (mostly) sane. That’s no small thing.
Secondly: whatever the grade breakdowns or the points out of 45, your students succeeded. Because the IB Diploma Programme is not just a gateway to university — it’s a transformation. It’s about guiding students through complex ideas, encouraging them to ask better questions, helping them reflect, write, fail, revise, persevere. We gave them knowledge — yes — but also ways to think, to evaluate, to connect. We taught them to research, to balance creativity with critical thinking, to serve others. We nudged them to consider perspectives beyond their own, to appreciate other cultures, and to see language as a window rather than a wall.
And while today’s focus for many families might still be on numbers and thresholds, we know that the true value of the IB is long-term. Eventually, most students will forget their score — but they’ll remember how TOK changed their thinking, how the EE taught them to explore independently, how CAS challenged them to give, grow, and reflect. They’ll remember you.
So take a breath, reflect, and recover. Forget about moderation processes and the scaling machinery – we’ll neber be told the truth about these. You didn’t just get your students through the programme, you helped shape better people.
Now go enjoy your coffee. Or something stronger. You earned it.
All IB assessments—whether exams, internal assessments, Theory of Knowledge essays, Extended Essays, or CAS—are assessed using clearly defined criteria. These describe what a student must demonstrate to achieve a particular level of performance. For example, a criterion might state: “The student evaluates the implications of cultural dimensions on behaviour.” The task of the examiner is to judge whether, and to what extent, the student has met this criterion. This is a standards-based approach: performance is measured against fixed descriptors, not against the performance of other students.
How does the concept of criterion-based assessment match with statistical scaling of results?
This means that raw marks reflect how well a student met the specified criteria—independently of how other students performed.
The challenge arises because even though the criteria remain constant, the exams themselves vary slightly from session to session in terms of difficulty. For example, one year’s psychology paper may include case studies or questions that are more conceptually challenging than the previous year’s.
This is where statistical scaling comes in. Once all papers are marked according to the criteria, the IB uses a process called grade boundary setting, supported by expert judgement, to determine what raw mark range should correspond to each grade (1 to 7). For example, while 65/100 might earn a 6 one year, it might only earn a 5 the next year if the exam was slightly easier.
So, the criteria tell us how many marks a student earns, and statistical scaling determines how those marks map onto grades, ensuring fairness across different cohorts and exam sessions.
Reconciling the two
In short, criterion-based marking ensures validity (students are assessed on what they know and can do), while statistical scaling ensures reliability and comparability (grades mean the same thing year to year).
They can absolutely be reconciled because they operate at different stages of the assessment process:
Criteria are used during the marking stage to ensure objective and consistent scoring.
Scaling is used at the grade-setting stage to account for differences in exam difficulty across sessions.
This dual approach helps maintain both academic integrity and global consistency in the awarding of IB grades.
Many of us in IB classrooms are noticing an increase in students using AI to ‘assist’ wirh their IAs, Extended Essays, and TOK essays. But the question isn’t how do we stop students using AI? The better question is: why are they turning to it in the first place?
In most cases, AI use is a symptom of deeper issues:
Incomplete research skills: Students struggle with framing research questions, analysing sources, and building arguments. These are central to the Diploma Programme, but many students need more scaffolding earlier in the process.
Poor time management: Extended tasks require sustained effort over months. Students often leave things too late, then panic. AI feels like an easy fast-fix.
Language challenges: Academic writing in English is daunting for many. AI offers fluent, polished prose many DP students may not feel capable of producing on their own.
Crushing pressure to achieve: IB students feel immense pressure — from parents, universities, and themselves — to secure top grades. The fear of underperforming leads some to seek ‘perfect’ answers generated by AI.
Fear of failing IB standards: With rubrics, formal assessments, and high expectations, many students lose confidence and look for safety nets.
So, what can we do?
Build explicit research and inquiry instruction into the curriculum early and often.
Break large assignments into manageable milestones with frequent check-ins.
Offer targeted academic writing support, especially for non-native speakers.
Create a classroom culture that values process over perfection — encourage drafts, reflection, and feedback.
Provide emotional support to ease anxiety around performance and failure.
AI is not the problem — unmet needs are. If we strengthen our teaching around these core areas, we empower students to rely on their own thinking, which is exactly what the Diploma Programme is designed to develop.
It’s time we moved beyond the blanket blame game that often paints boys and men as inherently problematic. The phrase “toxic masculinity” has become a catch-all for behaviors and attitudes that are, in many cases, symptoms of deeper social failures — not innate traits of maleness. Rather than pathologizing boys for being boys, we need to understand and address the systems that shape them.
Boys are not evil
Take the UK documentary series Adolescence. A quick glance may lead viewers to assume it’s about a “typical” white teenage boy in trouble. But a deeper look reveals the story of a young man who is anything but typical — he is the product of a broken, racially biased immigration system that failed him long before society judged him. Framing him simply through the lens of “toxic masculinity” erases that context and oversimplifies a complex, human story.
If we’re serious about improving the lives of boys and girls alike, we must stop demonizing masculinity altogether and start promoting strong, compassionate male role models. As experts point out, fathers and mentors play a powerful role in shaping boys into healthy, empathetic men. “Boys and young men cannot be what they cannot see,” one researcher notes — and without nurturing, non-violent male figures in media, culture, and homes, boys are left adrift.
Blaming social media or banning teens from platforms like TikTok and YouTube won’t solve the problem either. What’s needed is education — particularly media literacy that helps boys critically evaluate the content they consume, including the disturbing sexism often found in online porn.
Let’s shift the conversation. Ditch the harmful labels. Understand the context. And start building up boys instead of breaking them down.
While much attention has been given to the mental health struggles of teenage girls in the UK — often rightly so — there is growing concern that boys’ mental health needs are being neglected in schools due to a focus on targeted support for girls.
Pseudo-reality television programmes like Adolescence do nothing for the mental health of teenage boys. In fact, they may actively make things worse. These shows often promote shallow, stereotypical versions of masculinity — valuing aggression, emotional suppression, and physical appearance over vulnerability, kindness, or emotional intelligence. For boys already struggling to find their identity in a world of social media pressure and unrealistic expectations, such programming reinforces damaging ideas about what it means to “be a man.” Instead of encouraging healthy emotional expression or real connection, shows like Adolescence create a false narrative where popularity, dominance, and appearance are the keys to success. This not only isolates boys who don’t fit these narrow roles but also discourages them from seeking help when they are struggling internally.
Pseudo-reality television programmes like Adolescence do nothing for the mental health of teenage boys.
Recent NHS data and research show a dramatic rise in hospital admissions for girls suffering from self-harm and eating disorders. For example, eating disorders are four times more common in girls aged 11 to 16 than boys, and the number of girls hospitalised after self-harming has sharply increased.
However, experts warn this doesn’t mean boys are doing well. Dr Elaine Lockhart of the Royal College of Psychiatrists explains that boys often express mental distress differently — through behavioural problems rather than emotional symptoms — but this is “the same two sides of the same coin.”
This suggests that support systems in schools and healthcare may be unintentionally skewed. As resources are directed toward emotional disorders, more commonly presented by girls, boys whose distress shows up as anger, defiance, or withdrawal risk being labelled as troublemakers rather than being offered help.
Joeli Brearley, host of To Be A Boy, argues that outdated systems in education and society are failing both girls and boys: “Something is going badly wrong.” The pressures of social media, rising inequality, and the collapse of traditional support systems have left many young people — boys included — feeling lost and unsupported.
As a teacher, I’ve seen my share of tense parent-school exchanges. But the recent case in Hertfordshire, England, where two parents were arrested after expressing concerns about their daughter’s school in a private WhatsApp group, is beyond belief.
Six police to arrest 2 parents for complaining about their daughter’s school in a private Whatsapp group chat.
Let’s pause for a moment: this didn’t happen in Myanmar or Russia. This happened in 2025 Britain. Six police officers arrived at the home of Maxie Allen and Rosalind Levine. The couple were detained for 11 hours on suspicion of harassment and malicious communications—all because they criticised their daughter’s school leadership and shared their disbelief about being banned from the premises.
Were they issuing threats? No. Inciting violence? Not even close. They were frustrated parents asking questions and sharing opinions in a private forum.
We are educators, not enforcers. Of course schools deserve respect—but so do parents. Especially when they are advocating for their disabled child. When routine communication is criminalised, and “disharmony” becomes a police matter, we must ask: what are we becoming?
This is not how trust is built. If schools want engaged, supportive communities, we need to stop treating dissent like a crime.
When we are interested in cause and effect relationships (which is much of the time!) we have two options: We can simply observe the world to identify associations between X and Y, or we can randomise people to different levels of X and then measure Y.
The former – observational methods – generally provides us with only a weak basis for inferring causality at best. This approach has given us the oft-repeated (but slightly fallacious) line that ‘correlation does not imply causation’ (I would say that it can imply it, just often not much more). Of course, sometimes this is the best that we can do – if we want to understand the effects of years spent in education on mental health outcomes (for example), it would be unethical and impractical to conduct an experiment where we randomise children to stay in school for 1 or 2 more years (which option is unethical may depend on whether you’re the child or the parent…).
But when we can randomise, that gives us remarkable inferential power. The lack of causal pathways between how we allocate participants to conditions (our randomisation procedure – hopefully, something more robust than tossing a coin!) and other factors is critical. If our randomisation mechanism influences our exposure (which by definition it should) and nothing else (ditto), and we see a difference in our outcome, then this difference must have been caused by the exposure we manipulated. But a lot remains poorly understood about exactly how and why randomisation has this magic property of allowing us to infer cause and effect. And this leads to misconceptions about what we should report in randomised studies.
I want to dispel a couple of common but persistent myths.
The first myth is that randomisation works because it balances confounders. Confounders exist in observational studies because the associations we observe between an exposure and an outcome are also influenced by myriad other variables – age, sex, social position and so on – via a complicated web of causal chains. In principle, if we measure all of these perfectly and statistically adjust for them then we are left with the causal effect of the exposure on the outcome. But in practice, we are never able to do this.
When we randomise people, these influences will still be operating on the outcome, which will vary across the people randomised to our conditions. Does randomisation mean that all these different effects are balanced somehow?
No – not least because confounders do not exist in experimental studies! This is for the simple reason that a confounder is something that affects both the exposure and the outcome, and in an experimental (i.e., randomised) study we test for a difference in our outcome between the two randomised groups. We know that randomisation influences the exposure, but we don’t directly compare levels of exposure and the outcome – we compare the randomised arms. And variables such as age, sex and social position can’t influence the randomisation mechanism (there is no causal pathway between, for example, participant age and our random number generator!).
So, to be accurate, we need to be talking about covariates in experimental studies –factors that influence or strongly predict the outcome – not confounders. Does randomisation balance these? Well, yes, but in a more technical and subtle sense than is generally appreciated. We know (mathematically) that the chance of a difference between our randomised groups in terms of covariates and the distribution of future outcomes becomes smaller as our sample size become larger (all other things being equal, larger experiments will provide narrower confidence intervals, and more precise estimates – as well as smaller p-values, if that’s your thing!).
In other words, a smaller study has a higher chance of imbalance, and this will be reflected in a wider confidence interval (and correspondingly larger p-value).
This means that it doesn’t matter if our groups are in fact balanced, because we’ve been able to turn complexity into error. If our sample is small our standard error will be large, reflecting the greater likelihood of imbalance, and our statistical test will take that into account when generating a confidence interval and p-value. That is exactly why larger studies are more precise – they are more likely to be balanced. Darren Dahly, a statistician at University College Cork, gives a more complete treatment of the issue here. In his words: ‘randomisation allows us to make probabilistic statements about the likely similarity of the two randomised groups with respect to the outcome’.
This leads to the second myth, which is that we should test for baseline differences between randomised groups. We see this all the time – usually Table 1 in an experiment – a range of demographic variables (the covariates we’ve measured – the known knowns) for each of the two groups, and then a column of p-values. Now, this is a valid approach an observational study, where we might want to test whether something is in fact a confounder by testing whether it is associated with the level of the exposure (e.g. whether or not someone drinks alcohol). But is it valid in an experimental study (e.g. if we’re randomising people to consume a dose of alcohol or not)?
Once we start to think about what those p-values in Table 1 might be telling us, the conceptual confusion becomes clear. A randomisation procedure should be robust (i.e., immune to outside influence), and the methods section should give us the information to evaluate this. What would a statistical test add to this? As Doug Altman said in 1985: ‘performing a significance test to compare baseline variables is to assess the probability of something having occurred by chance when we know that it did occur by chance’. If our randomisation procedure is robust, by definition any difference between the groups must be due to chance. It’s not a null hypothesis we’re testing, it’s a non-hypothesis.
Aha! But what if our randomisation process is not robust for reasons we’re not aware of? Surely we can test for that this way? But how should we do that? In particular, what alpha level should we set for declaring statistical significance? The usual 5%? If we did that, we would find baseline differences in 1 in 20 studies (more, probably, since multiple baseline variables are usually included in Table 1) even if all of them had perfectly robust randomisation. Better to invest our energies in ensuring that our randomisation mechanism is indeed robust by design (e.g., computer-generated random numbers that are generated by someone not involved in data collection).
OK, OK – but what about deciding which of our baseline characteristics to adjust for in our analysis? It’s true that adjusting for baseline covariates that are known to influence the outcome can increase the precision of our estimates (and shrink our p-values – hurrah!). But testing for baseline differences to decide what to adjust for is again conceptually flawed. A statistically significant difference is not necessarily a meaningful difference in terms of the impact on our outcome. It depends in large part on whether the covariate does in fact strongly influence the outcome, and we aren’t testing that! Much better to select covariates based on theory or prior evidence – identify the variables we think a priori are likely to be relevant and adjust for these.
Randomisation is extremely powerful but also surprisingly simple. Its power comes from the ability it gives us to control some of the key causal pathways operating, and to convert complexity into measurable, predictable error. So we can relax! We don’t need to worry about ‘balance’ – our sample size and the standard error will take care of that (which is why we need to power our studies properly!) – and we don’t need to have that column of p-values in Table 1 – they don’t tell us anything useful or give us any information we can usefully act on. We should all – including the editors and reviewers who ask for these things – take note!
How should we report randomisation?
If we accept that the key to successful randomisation is getting the process right (rather than testing whether or not it works post hoc, which is fraught with conceptual and practical issues), how do we report randomisation in a way that allows readers to evaluate its robustness?
In medical studies – particularly clinical trials – journals expect authors to follow reporting guidelines (these exist for a vast range of study designs, many of which are relevant to psychology). A full description might look something like this:
Randomisation was generated by an online automated algorithm (at a ratio of 1:1), which tracked counts to ensure each intervention was displayed equally. Allocation was online and participants and researchers were masked to study arm. If participants raised technical queries the researcher would be unblinded, participants seeking technical assistance received no information on the intervention in the other condition and so were not unblinded. The trial statistician had no contact with participants throughout the trial and remained blinded for the analysis. At the end of the baseline survey, participants were randomised to view one of two pages with the recommendation to either download Drink Less (intervention) or the recommendation to view the NHS alcohol advice webpage (comparator).
This example was taken from a recent article published by Claire Garnett and colleagues (disclosure: I’m a co-author!), which tested the efficacy of an app to reduce alcohol consumption. As it was a clinical trial and published in a medical journal it had to follow the relevant reporting guidelines and describe the randomisation process fully.
Of course, sometimes the randomisation process is robust and can be described v briefly – a computer task may have randomisation built in, so the experimenter doesn’t need to be involved at all. But that should still be described clearly. And sometimes the randomisation process does involve humans (and therefore may be potentially biased!).
Something I’ve learned throughout my career is that we can learn a lot from how things are done in other disciplines (and also showcase what we do well in psychology). This is perhaps one example of that – there’s lots of good practice in psychology when it comes from reporting randomised studies, but we can still look to learn and improve.
Marcus Munafò is a Professor of Biological Psychology and MRC Investigator, and Associate Pro Vice-Chancellor – Research Culture, at the University of Bristol. marcus.munafo@bristol.ac.uk
As technology advances, schools and universities are increasingly challenged to keep pace, but they are struggling. Technology affects learning, teaching, assessment, and… academic honesty. While digital tools, artificial intelligence, and online platforms may offer some benefits, they also raise questions about how schools (and colleges/universities) manage human behavior—especially when it comes to dishonesty.
ChatGPT can help students with research and writing essays and solving maths problems and while this can enhance learning, it also presents opportunities for academic dishonesty. Many institutions still lack policies that specifically address AI-generated work, leaving them to play catch-up as students find creative ways to bypass traditional assessment methods. The rapid adoption of online assessment by schools and colleges has opened the door to new forms of cheating. From sharing answers via social media as was the case with the International Baccalaureate in the May 2024 session, to using unauthorized tech during exams. Students now have easy access to tools that can undermine the integrity of assessments. School administrators are often left scrambling for solutions like lockdowns and AI detection software, but these tools are not foolproof and can lead to intrusive surveillance and unnecessary tension.
Institutions that jumped onto the high-tech (and expensive) tech band wagon are now facing issues with cheating for which they had not prepared, and some seem to be rethinking their decision to go down the tech route.
Read this article on the Radio New Zealand website about universities and their online assessment problems and solutions.