What Language Teachers Should Ask Any AI Vendor Before Signing: A Checklist From Someone on the Other Side of the Table
By Maria Le Mura, Founder, Beyond Words

DOI: https://www.doi.org/10.69732/WUQI7059
In a demo call last autumn, the Head of a World Language Department in the American Midwest asked me a question I had not been asked before. We had just finished the standard tour of our platform. Numbers on the slides, pretty student dashboards, the usual. Then, calmly, almost apologetically, she said: “Can you show me the same data, but for one student who did not improve?”
I had not prepared that slide. I asked if I could send it the next day. She agreed. The deal eventually went through, but I have thought about her question many times since.
It was the right question. It separates language technology coordinators who have been burned from those who have not been burned yet. Coordinators who have lived through a deployment that quietly underperformed know what an aggregate slide can hide. The ones who have not lived through that conversation still believe a positive average means a positive experience for the students who actually struggled. And it captures something every vendor demo is structurally bad at: telling you what does not work.
I founded Beyond Words five years ago. Before that I was a phonetics researcher working on Italian regional varieties, and I still teach for one semester a year at a French University. We are an AI-powered oral language learning platform based in Montpellier, France, with around 200,000 users across 12 countries. So I am writing this from a strange place. I am one of the people you are evaluating, and I am about to give you a list of questions designed to make my own demo calls harder.
I think the trade is worth making. Language technology coordinators choose between vendors who all show the same upward arrows, the same cherry-picked classrooms, the same vague claims about “AI-powered” something. The questions below will not save you from every bad decision. But they will save you from the worst ones.
1. Ask for per-student data, not cohort data.
This is the first test, and most vendors fail it without realizing they have failed.
When a vendor shows you a “30 percent improvement” or a “40 percent reduction in errors”, ask: improvement for whom, measured how, across what time. Then ask to see the data for one student. Then for a different student in the same cohort. Then for the lowest performer. What you are looking for is whether the dashboard shows the same level of detail for each of those students. The top performer’s data should not be a glowing summary while the lowest performer’s is a generic flag. If the resolution drops at the bottom of the cohort, that is where the tool is least useful, and that is exactly where you need it most.
Aggregated metrics are easy to massage. Per-student trajectories are not. If a vendor can show you a dashboard where a single teacher can see, for a single student, the proficiency level, the specific weak spots, and the trajectory over time, you are probably in front of something built by people who teach. If the demo defaults back to a cohort heatmap every time you ask about an individual, you are probably in front of a marketing layer wrapped around a model. Saito and Plonsky (2019), in their meta-analysis of L2 pronunciation teaching, make a related point at the research level: the most diagnostic measures are the ones that separate overall fluency from specific dimensions like accuracy on a particular sound, prosody, or hesitation. A score that collapses everything into one number tells you almost nothing about what to work on.
A practical phrasing that has worked for coordinators I have met: “Show me your tool from the perspective of one teacher, with one class, on a Monday morning.”
2. How do you measure oral proficiency, and against what?
The polite answer is “CEFR” or “ACTFL”. The honest answer is longer, and that is where you learn things.
Ask whether the vendor maps their internal score to CEFR, and if so, how that mapping was validated. Ask whether the mapping was done by linguists, by educators, or by a regression against another tool. Ask whether the validation has been peer-reviewed, even informally, by anyone outside the company. Ask if you can see the underlying rubric.
If the vendor says “we use AI to assess CEFR level”, they are saying nothing. CEFR is a descriptive framework, not a measurement instrument. The Companion Volume to the CEFR (Council of Europe, 2020) is explicit on this point: the descriptors define what a learner can do, not how to measure it. Any tool that gives a CEFR output has made an interpretive choice about how to bridge that gap. You want to know what that choice is.
A useful follow-up: ask what the tool would do with a student who is B2 in vocabulary, B1 in fluency, and A2 in pronunciation accuracy. If the answer is a single number, you have learned something about how much resolution the tool actually has. A low-resolution tool tells you the student is B1. A high-resolution tool tells you the student is B1 globally but A2 on the trilled /r/ and on vowel length distinctions, and B2 on sentence intonation. The second answer is what allows a teacher to do anything useful with the assessment.
3. What happens to my students’ voice data?
This question is not glamorous, and that is exactly why it is one of the first questions you should ask in any procurement conversation.
Voice data is biometric data. Under Article 9 of the EU General Data Protection Regulation (GDPR, 2016), biometric data carries a higher protection threshold than ordinary personal data. Under FERPA in the United States, the framework for voice recordings of minors is uneven, and under the Children’s Online Privacy Protection Rule, voice data from students under thirteen cannot be captured without verifiable parental consent (Federal Trade Commission, 2025). Outside the US and EU, regimes vary even more, and most are catching up to the technology rather than leading it. Many vendors, including ones I respect, are uncomfortably vague on this point because the engineering reality is messier than the marketing copy.
There are three layers of this question that are worth pressing on.
The first is access. Where is voice data stored, for how long, who has access, is it used to train models, can it be deleted on student request, and what happens to it when the contract ends. Ask in writing. If the vendor’s answer is “we are compliant with all applicable regulations”, that is not an answer. Compliance is the floor, not the ceiling, and the floor is currently lower than most educators assume. Ask the same questions again until you get specifics: server locations, retention windows, named access roles, deletion mechanisms.
The second is consent. The default should be opt-in for data collection, and opt-out for deletion at any time. If a tool is required for an educational activity, the data it collects to perform that activity should not be available for anything beyond that activity. Ask the vendor how students and parents are informed about what is being recorded, how it is used, and how consent can be revoked. If the answer is “this is covered in the terms of service”, you are looking at a tool that treats consent as a checkbox rather than as a conversation.
The third is security. Compliance is one thing, breach risk is another. Voice prints are biometric identifiers and they cannot be reset like passwords. If a voice dataset leaks, the people in it do not get a new voice. Ask the vendor what their security posture looks like, when they were last audited, and whether they have ever had a security incident. The answer to that last question should be honest, even uncomfortable. A vendor who claims they have never had any kind of security event is either very small, very lucky, or not telling you the whole story.
If the vendor has a Data Protection Officer, ask to speak with them directly during procurement, and bring your own institution’s DPO or, if you do not have one, your most senior administrator and an IT specialist with you. The answers you get from a DPO will often be different from the answers your sales contact gives, and the difference is informative. There is no excuse for adopting a tool of this kind without proper review at procurement and at every renewal. Educators are not expected to be lawyers, but they are expected to refuse what they cannot defend.
4. What is your stake in our success?
This is the question that pricing answers, whether the vendor intends it to or not.
Per-seat pricing means the vendor’s revenue scales with enrolled students. The vendor wins when you sign up many students, regardless of whether they use the tool. Per-institution flat pricing means the vendor wins once and has limited financial reason to ensure that students engage. Freemium models mean the vendor’s incentive is to convert individual students or teachers, not necessarily to serve your institution. In practice, these incentives play out in predictable ways. Per-seat: the vendor pushes for high enrollment numbers at contract signing, and becomes less responsive once you are locked in. Per-institution flat: the customer success team is calibrated for upsell rather than usage, and students who do not log in are not the vendor’s problem. Freemium: the product roadmap is driven by what individual users want, not what your curriculum needs.
None of these is inherently bad. But each tells you something about what behaviors the vendor is optimizing for. I have run a company through one near-death funding event that forced us to contract from fifteen people back to five, and I can tell you with some authority that the business model a vendor signs you into is also the discipline they are most likely to enforce when survival is on the line. When my company had to contract from fifteen people to five, the parts of the service that survived were the parts that touched our contractual obligations. The parts that survived first were the parts that paid the bills. This is the discipline you are buying into when you sign with any vendor, whether they tell you or not.
If usage data is part of your evaluation, ask the vendor what their renewal rate looks like across institutions with low student engagement. If they give you a rehearsed answer, ask them to put you in touch with two clients who did not renew their contract. A vendor who refuses to put you in touch with a churned client is telling you something useful about their relationship with reality.
5. How do you handle the languages we actually teach?
Most platforms support a “top 10” set of languages. Most language departments teach at least one language not in any top 10.
If you teach Yoruba, Tagalog, Quechua, Haitian Creole, Hawaiian, or any heritage or indigenous language, ask what happens in those languages. The answer reveals architecture more than anything else the vendor will tell you. A vendor with a flexible engine that can be retrained on new linguistic data will say “we have not built that yet, but here is what would be involved”. A vendor whose tool is essentially a wrapper around a third-party model will say “we hope to add it soon”. The vendor cannot answer with confidence because they do not control the underlying engine. Adding a language to a wrapped tool depends on whether the upstream provider supports it, on what timeline, and at what quality. The vendor’s roadmap is a function of someone else’s roadmap, and that someone else is not in the procurement meeting with you.
For dialects and regional variants, the question matters even when the language is “covered”. Spanish is a single language, but like most widely-spoken languages, it has regional variants with phonological and lexical differences large enough to matter in the classroom. If your students are heritage speakers of Mexican Spanish, an engine trained on Castilian will tell them, gently and incorrectly, that they are wrong. Italian, my own first language, behaves the same way. My grandmother in Sicily and a speaker from Milan share less phonological territory than most language models will admit, and that gap matters in the classroom long before it matters in the literature. Ask the vendor how the tool handles dialectal variation, and whether they distinguish between L1 transfer errors and dialect-level differences. If they cannot answer, the tool will spend the semester correcting your students for speaking the way their grandmother does.
6. How do you integrate with the LMS we already have?
This is a procurement question that often gets answered last, when it should be answered first.
Single sign-on, gradebook passback, SCORM or LTI compliance, and roster sync are not features. They are essential infrastructure, and the question of whether the vendor supports them is not negotiable. Ask the vendor what their integration is with Canvas, Moodle, Blackboard, Brightspace, or whatever you run. Ask for a list of institutions where the integration has been live for at least one academic year, and ask to speak with a representative from one of them.
If the vendor says “we are working on it” for any of these, factor in the integration work as part of your true cost. The faculty time required to manage rosters by hand for a semester is almost always larger than the license fee.
7. Show me one client where it did not work.
This is the question I started with, and it remains the one I find most diagnostic.
Most other questions can be answered with prepared slides. This one cannot. A vendor’s response to it tells you whether they have ever sat through a deployment that failed, and what they did when they realized it had failed.
Every vendor has a deck of success stories. The interesting information is on the other side of the deck. Ask the vendor about a client where the tool was deployed and did not produce the expected results. Listen to whether the explanation is about the client (“they did not engage their teachers”) or about the vendor (“we did not understand their use case until it was too late”).
The honest vendors will tell you about a deployment that taught them something. They will name what they changed afterwards. They might even tell you, with some embarrassment, that they walked away from a contract rather than push through a renewal that would have failed. I have done that once in the past two years, and I learned more from that one case than from twenty successful deployments.
Alain Guillard, the IT coordinator at Isaac de l’Étoile, a secondary school group in Western France, once described our deployment with a sentence I have repeated since: teachers could finally hear all of the students, even those who never spoke in class, and sometimes those students did not want to take off the headset. The observation came from a deployment that worked, but the surprise in his voice told me what he had expected from previous tools. Ask any vendor about the deployments where expectations were not met. That is where the learning lives.
8. The question vendors don’t want you to ask.
I will close with the question that, when asked by a coordinator, makes me reach for my coffee.
“Walk me through what happens, technically, when a student speaks into your tool.”
Most vendors, including some I respect, will give an answer that sounds confident and tells you almost nothing. Phonetic features, encoder, comparison to reference, score. The phrasing varies, the resolution does not. The reason is that very few vendors in this space have built the engine themselves. Most are wrapping third-party speech recognition systems and adding a feedback layer.
This is not necessarily bad. But it tells you what the vendor controls and what the vendor does not. If the underlying speech model has a bias against accented English, the vendor cannot fix it. Lindemann and Subtirelu (2013) showed that even human listeners, and the models trained on their judgments, bring expectations that are not phonetic but social. If the model performs poorly on the language you teach, the vendor cannot retrain it. If the vendor’s “AI” is, structurally, someone else’s AI plus a UI, you are not buying what you think you are buying.
There is a second layer to this question that matters even more. When a student speaks into the tool, who hears the recording? The vendor’s servers, certainly. The upstream speech model provider, if there is one. Anyone the upstream provider shares data with for model improvement. Anyone in the vendor’s engineering team who has access to debugging logs. Each of these is a link in a chain, and each link is a place where your students’ voice data exists in someone else’s possession. Ask the vendor to draw the chain on paper. If they cannot, you are not evaluating a single tool, you are inheriting every supplier contract upstream of it.
The honest answer to this question is rarely glamorous. Mine, for what it is worth, involves a lot of phonetics, several years of dataset construction, and a particular obsession with syllable-level analysis that started during my linguistics degree and never quite let go. The honest answer should sound like that: specific, situated, and a little bit personal. If it sounds like marketing, ask again until it does not.
Closing
None of these questions guarantees a good vendor decision. Procurement is hard, vendor demos are designed to be persuasive, and language technology is a field where the gap between what is promised and what is delivered remains uncomfortably wide. But these questions raise the cost of a sloppy demo, and they reward vendors who have done the work.
I came to this from phonetics, by way of a doctoral program in business management that I am still finishing. The transition from researcher to founder did not give me new answers. It gave me an uncomfortably good map of the questions I had been pretending were already answered. I have tried to write the list I would have wanted to see four years ago, when I was walking into my first vendor evaluation conversations from the other side of the table. I hope it helps you make fewer mistakes than I did.
References
Council of Europe. (2020). Common European Framework of Reference for Languages: Learning, teaching, assessment – Companion volume. Strasbourg, France: Council of Europe Publishing. Retrieved from https://rm.coe.int/common-european-framework-of-reference-for-languages-learning-teaching/16809ea0d4
European Parliament & Council of the European Union. (2016). Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (General Data Protection Regulation). Official Journal of the European Union, L 119, 1-88.
Federal Trade Commission. (2025). Children’s Online Privacy Protection Rule. Federal Register, 90(76). Retrieved from https://www.federalregister.gov/documents/2025/04/22/2025-05904/childrens-online-privacy-protection-rule
Lindemann, S., & Subtirelu, N. (2013). Reliably biased: The role of listener expectation in the perception of second language speech. Language Learning, 63(3), 567-594. https://doi.org/10.1111/lang.12014
Saito, K., & Plonsky, L. (2019). Effects of second language pronunciation teaching revisited: A proposed measurement framework and meta-analysis. Language Learning, 69(3), 652-708. https://doi.org/10.1111/lang.12345
AI disclosure: During the preparation of this article, the author used Claude (Anthropic) to generate initial drafts of several sections. The author wrote, reviewed and edited the content, made substantive changes to reflect her own experience and perspective, and takes full responsibility for the content of the article.
