Welcome to Sab-AI lab

A boutique AI lab in Nagoya-Japan.

This is a brief of the Auto-Speech-Rater technical narrative developed to be installed in the TOEFL test platform of
TestDen Canada

The narrative lays out the technology's scope of works, accuracy, the-best-use and wayforwards.


Datasets and models download:

By downloading this source code I acknowledge that I have fully read and understood the below system's scope and description as well as its behaviour/acceptance test criteria in its entirety and am considering all requirements when I build upon/use the system to keep it performing as expressed.



Auto-Speech-Rater; an algorithm for real-time Spoken English Assessment

This algorithm was built for processing high-entropy speech (simultaneou free speech processing) using probabilistic machine learning and deep learning models to predict spoken English language proficiency. This algorithm can measure the “pronunciation”, "prosody", "use of language" competency and latent semantic index of a user (speaker) to rate its spoken proficiency based on existing classification of TOEFL scores and also compare it with the average rate of non-native and native speakers.

This is the results from two years of study whose overal achivement is an average assessment accuracy level of 72% for non-native adult speakers. The correlation between the human scores and the machine scores for an overall measure of speaking was 0.86 thus proving the reliability of the measure of speaking in tests.


  • it transforms sounds/language into vectores in a n-dimension sphere in which each feature is vectorized to represent pronunciation, prosody, and language for further evaluation.

  • The models range from parametric, non-parametric statistics to a Neural Networks architecture. The ETS scoring rubric philosophy was generally adopted for judgment on the spoken language proficiency level. This framework could change and be customized on demand.

  • The mic input is crucial in the accuracy of the results. Certainly, pre-recorded sounds can be analysed, some parts of acoustic features which contain key information will be vanished though. It happens when the sound is compressed during the digitalization process.

Here are the models generated by the algorithms:

  • * CART,
  • * ETC,
  • * NN,
  • * LDA,
  • * LR,
  • * MLTRNL,
  • * CNN,
  • * RNN,
  • * myspsolution,
  • * PCA,
  • * REF,
  • * SVN,
  • * dfdg,
  • * forquil,

There are three models-set:

  • SET-1; it was developed based on non-native English speakers who had prepared for the TOEFL test.
  • SET-2; it was developed based on non-native and native English speakers in ordinary conversation situations.
  • SET-3; it was developed based on non-native and native English speakers where they had spoken about specific topics with having background knowledge of them.



Total speaking fluency refers to the ability of speakers to speak the words about specific topics in English effortlessly and efficiently (automaticity) with meaningful expression that enhances the meaning of the topics (prosody). Fluency takes phonics or word recognition to the next level. While many speakers can decode words accurately, they may not be fluent or automatic in their word recognition in simultaneious speech. These speakers tend to expend too much of their limited mental energy on figuring out the pronunciation and meaning of words, energy that is taken away from that more important task in conveying intelligible ideas — getting to the topics overall meaning. Thus, the lack of fluency often results in poor comunication.

Fluent speakers, on the other hand, are able to speak words accurately and effortlessly. They speak words and phrases instantly on spot. A minimal amount of cognitive energy is expended in decoding the words. This means, then, that the maximum amount of a speaker’s cognitive energy can be directed to the all-important task of making sense of its ideas.

The second component to fluency is prosody , or speaking with expression. A key characteristic of fluent speakers (or speech, for that matter) is the ability to embed appropriate expression into the speaking.Fluent speakers raise and lower the volume and pitch of their voices, they speed up and slow down at appropriate places through the course of speech, they speak words in meaningful groups or phrases, they pause at appropriate places through the course of speech. All these are elements of expression, or what linguists have termed prosody. Prosody is essentially the melody of language as it is spoken. By embedding prosody in our oral language (read or spoken), we are adding meaning to the communication.

Latent Semantic Analysis (LSA) is employed for analyzing speech to find the underlying meaning or concepts of those used words in speech. If each word was only meant one concept, and each concept was only described by one word, then LSA would be easy since there is a simple mapping from words to concepts. Since in languages words or a group of words or even spoken words in different intonation have diffirent meaning, the semantic analysis becomes a difficult tast to complete. It means that the same group of words/words could convey multiple meanings and it might creates sorts of ambiguities in comprehensive communication between people. At this stage, we use LSA by introducing datasets which are represented as (1) "bags of words", where the order of the words in a document is not important, only how many times each word appears being considered. (2) Concepts are represented as patterns of words that usually appear together in documents. (3) vectorizing Words which are transformed to hold only one meaning by considering their neighbouring words.


Speech Rater inside

The scientific way to measure one’s Reading/Speaking rate is in syllables per second.

Speech Rater’s estimate of the “Speaking Rate” is obtained by timing the user while reading a selection of text with a known syllables count.

Speech Rater evaluates the competency of the user by employing mathematical formulas and ETS’s independent speaking rubrics and philosophy (Educational Testing Service-USA).

Speech Rater went through a Machine learning session, with an audio dataset of non-native English speakers. These audios ranged from just 1 minute in length from the speech audios of the TOEFL iBT trainees. Speakers' topics vary widely, total 13762 minutes audios.

The speech audios of the TOEFL iBT trainees had been rated by native English teachers and TOEFL examiners.


Scope and limitations

  • Note, this is the ”Speaking Rater” for evaluating simultaneious free-speech. If a user reads out loud, its results should not be considered as the same thing as its spoken language proficiency.

  • The best way to determine the user’s speaking rate is to time the user’s delivering a free speech.

  • All the annotations that will been analyzed by the current algorithm are based on the mentioned rubrics and the non-native English speaker audios. We do not claim that these are 100% accurate or the only way the speech can be analyzed. We will upgrade the algorithm. Your comments and feedback are most welcome. Please feel free to contact us and let us know your thoughts about the corpus.

  • The evaluation mode could be adjusted either to Flexible or stringent. The stringent mode is sensetive to high accuracy of the language production, the standard rate of reading, and the ability to read sentences effortlessly, and automatically with little conscious attention to the mechanics of reading, such as decoding. While the flexible mode was originally designed for beginners to allow them build their confidence along with growing their skills.


High quality recording

  • Step 1. Find a quiet place for recording. Make sure to turn off all background machinery and electronic appliances, such as your TV set.
  • Step 2. Set up your recording equipment Plug in and test your microphone. Please do not put the microphone too close to your mouth(10-12 inches from the speaker is preferred)to avoid “p pops".
  • Step 3. Adjust the recording settings. Before starting your recording, you must be certain that your machine sound recorder will record at DVD quality mono settings (44.10 kHz., 24-bit, mono).

For installing the algorithm on a new machine , we recommend you re-train the machine/algorithm. Please contact us.

which can help eliminate or counterbalance the effects of different sentence types on suprasegmental features produced by learners and reveal the segmental features in different sentence types.

Suggested pausing
Suggested intonation
Suggested linking
Suggested lexical stress for multisyllabic words


If you need the models to develop your own assessment system, please contact me to provide you with the models-set which cater best your case/needs.


A quick perfrmance report on ML

Dataset f1 Accuracy Precision Recall (Sensitivity)
For non-native 74% 72% 78% 76%
For Japanese-English speaker 78% 79% 81% 78%


The Auto-Speech-Rater source code is licensed under GNU General Public License v3.0


Contact us


〒466-0834 Hirojichō, Umezono Nagoya City Aichi. Japan


Sab-AI Lab 愛知県 名古屋市 昭和区 広路町字梅園 10-4