cloud service for human language detection

Most text indexing/search algorithms can be tuned better if you know the language of the text. But for much of the data we deal with in libraries, we don’t know what the source language is.

Figuring out what alphabet(s) are used is fairly trivial, but that’s not the same as figuring out languages.  I suspect if you spent some time, you might develop a reasonably accurate “supervised learning” set that could often get it right. Maybe there’s even some open source tool that come already ‘trained’ and can do it reasonably accurately?

If you think language detection could help improve your software, and you don’t have time or interest in trying to figure that out, but you do have $35/month and your software pipelines can tolerate making some network calls for language detection — you may be interested in this new language-detection service, getlang.io. I haven’t played with it yet myself, just saw the announcement and thought some might be interested.

Some in the HackerNews thread suggest that it’s not that hard at all to do this yourself locally with open source software, so it might be worth investigating that first before jumping to the paid cloud service.  Perhaps this is mainly a good reminder that reasonably accurate language detection is well within our grasp, one way or another, and not having properly labelled source languages shouldn’t be considered a major barrier.

This entry was posted in General. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s