u/CheckEmpty

▲ 11 r/ajatt+1 crossposts

Hey everyone!

I’m a final-year Data Science/Software Engineering student. For my Final Year project, I wanted to tackle the learning of the Japanese pitch accent

So, I built a web app from scratch (UI and backend) powered by a custom Deep Learning model that grades your accent by comparing it to a native pronounciation.

https://preview.redd.it/g2sf5wvu3pxg1.png?width=1102&format=png&auto=webp&s=855f2519a67fdcdb28bfc8a14869f6b4af2a2f63

Here is a breakdown of how it works and what I learned building the AI behind it.

Link: https://pitchaccentapp.web.app/

https://preview.redd.it/8vjkfgj12pxg1.png?width=1138&format=png&auto=webp&s=f6ff662d9c866f6d5b15331e718b53cbe66e6db4

How to use it (The UI)

I designed the interface to feel like an Anki deck. As you can see in the screenshot, you get a standard flashcard layout with the word (like 有力 / yuuryoku), the meaning, and an example sentence.

  1. Listen: You click the audio button to hear the native pronunciation.
  2. Visualize: You can see the intended pitch accent mapped out (the red/black text shows the highs and lows). Black is Low, Red is High
  3. Test Yourself: You tap the Mic button and say the word into your browser.
  4. Get Graded: My AI compares your audio to the native speaker's audio and gives you a similarity score to let you know if you nailed the pitch. AI score has a weight of 40% and DTW (dynamic tim warping ) algorthim has 60% to get a combined score

How I built it

  • Data Scraping: I couldn't find a clean dataset, so I wrote a custom scraper to pull thousands of native audio files directly from OJAD (Online Japanese Accent Dictionary). I then had to write scripts to clean, resample, and standardize the audio so the neural network could process it.
  • The AI: I built a Siamese Neural Network (using a math concept called Contrastive Loss). Instead of categorizing words, it uses twin networks to compare the mathematical distance between the native OJAD audio and your microphone input.
  • Odaka: I trained the model on 900 samples of audio, 300 heiban, atamadaka and nakadaka. Odaka (same as heiban except when a particle is added) would cause confusion to the model so i removed it. since the deck consists of words, particles never come up.

Mel spectograms the model is trained on

Disclaimer: The AI definitely isn't perfect yet, its accurate 80% of the time. It's still a work in progress, so I am really looking for your feedback on the UI, the grading accuracy, or any suggestions.

reddit.com
u/CheckEmpty — 17 days ago