Most text indexing/search algorithms can be tuned better if you know the language of the text. But for much of the data we deal with in libraries, we don’t know what the source language is.
Figuring out what alphabet(s) are used is fairly trivial, but that’s not the same as figuring out languages. I suspect if you spent some time, you might develop a reasonably accurate “supervised learning” set that could often get it right. Maybe there’s even some open source tool that come already ‘trained’ and can do it reasonably accurately?
If you think language detection could help improve your software, and you don’t have time or interest in trying to figure that out, but you do have $35/month and your software pipelines can tolerate making some network calls for language detection — you may be interested in this new language-detection service, getlang.io. I haven’t played with it yet myself, just saw the announcement and thought some might be interested.
Some in the HackerNews thread suggest that it’s not that hard at all to do this yourself locally with open source software, so it might be worth investigating that first before jumping to the paid cloud service. Perhaps this is mainly a good reminder that reasonably accurate language detection is well within our grasp, one way or another, and not having properly labelled source languages shouldn’t be considered a major barrier.