dcdaML - devanagari character detection dataset training framework
dcdaML - devanagari character detection dataset training framework

github.com
GitHub - Kishlay-notabot/dcdaML

cross-posted from: https://lemm.ee/post/61282397
Open sourcing this project I made in just a weekend, planning to continue this in my free time, with synthetic data gen and some more modifications, anyone is welcome to chip in, I'm not an expert in ML. The inference is live here using tensorflow.js. The model is just 1.92 Megabytes!
Great effort! What do you propose to do with the joint letters that are a peculiarity of devnagari?
thanks a lot! I think, not only the joint letters but the diacritics is so diverse, and it is a shame that we don't have any dataset covering this language and it's diacritic combinations. Honestly the possibilities are infinite and i don't know how we can generalize a model for this. It is surely possible but i'm not as experienced in ML. I'd really like to get ideas on this. Talking about dataset, I think im gonna do something about diacritics included dataset in the future. I have plans but not the time to execute it to its fullest, and also that the response and impact is very less.
I can imagine the challenges that you describe. It is only through efforts like yours people will feel encouraged to produce better training datasets. I came across this dataset that has words with diacritics (though I'm not sure if it's right to call them that since they are not accent marks) that seems to be different from the dataset that you are using: https://cvit-iiit-ac-in.translate.goog/research/projects/cvit-projects/indic-hw-data?_x_tr_sl=en&_x_tr_tl=hi&_x_tr_hl=hi&_x_tr_pto=tc
I can read/write hindi/devnagari well and am willing to help in anyway it may be possible for any incremental progress in this domain.