Repository
https://github.com/programarivm/babylon
What Will I Learn?
You'll learn about how programs can perform tasks by feeding them data rather than writing explicit code.
Requirements
The following two posts will provide you with some context to follow today's tutorial.
- Babylon, a New Machine Learning Repo for Language Detection
- PHP Machine Learning Diary: Preparing Random Phrases with Linux Commands
Difficulty
- Intermediate
Tutorial Contents
Dear readers, how are you today? I wonder if you'd like to help me feed an intelligent language detector that can learn new languages very easily.
In case you missed my previous post on machine learning with PHP, let me tell you that Babylon 0.9.1 is already working okay with a bunch of ISO 8859 languages:
ISO 639-3 Code | Language |
---|---|
ces | Czech |
cym | Welsh |
dan | Danish |
deu | German |
eng | English |
fin | Finnish |
fra | French |
gla | Scottish Gaelic |
gle | Irish |
hun | Hungarian |
ita | Italian |
isl | Icelandic |
nld | Dutch; Flemish |
nob | Norwegian |
pol | Polish |
por | Portuguese |
ron | Romanian |
spa | Spanish |
swe | Swedish |
tgl | Tagalog |
Note: Currently Babylon supports ISO 8859 languages only, which are those ones using some form of latin alphabet -- English, Italian, German, Tagalog, Turkish are a few examples among many others.
Can you imagine? There are dozens of different ISO 8859 languages out there! So your help is really appreciated.
How Can Babylon Be Taught New Languages?
Let me show you with an example how you could help me teach Babylon new languages. I'm going to replicate the steps that I followed to teach it Cebuano.
First of all, I did some digging on the Internet and found a public domain ebook written in Cebuano.
Project Gutenberg is a nice resource for this purpose since it contains a plethora of ebooks in different languages. However I must say it does not support all languages in the world. There are no ebooks in Turkish for example, which is a bit of a shame keeping in mind this language has around 65 million speakers.
Be that as it may, as I said before we are teaching Babylon the Cebuano language. So let's just pick an ebook in Cebuano and copy and paste its content into the babylon/dataset/input/iso-8859/latin/austronesian/ceb.txt file.
Did you know that Cebuano is an Austronesian language?
As you see, we're following a convention consisting of naming the files with their ISO 639-3 code counterpart.
Also make sure to remove all English words from the beginning and the end of the file as it is shown in the example above. This is because the core idea implemented in Babylon consists in computing the most frequent words in this or that language, and we want to remove bias from our data sets.
Then, I run the command:
php cli/prepare.php
This will create a CSV with the most frequent words in all of the files in the dataset/input folder.
The operation may take a few seconds to be completed.
Do you want to proceed? (Y/N): y
OK! The most frequent words in ceb.txt were transformed into CSV format...
OK! The most frequent words in tgl.txt were transformed into CSV format...
The austronesian language family has been updated.
OK! The most frequent words in cym.txt were transformed into CSV format...
OK! The most frequent words in gla.txt were transformed into CSV format...
OK! The most frequent words in gle.txt were transformed into CSV format...
The gaelic language family has been updated.
OK! The most frequent words in dan.txt were transformed into CSV format...
OK! The most frequent words in deu.txt were transformed into CSV format...
OK! The most frequent words in eng.txt were transformed into CSV format...
OK! The most frequent words in isl.txt were transformed into CSV format...
OK! The most frequent words in nld.txt were transformed into CSV format...
OK! The most frequent words in nob.txt were transformed into CSV format...
OK! The most frequent words in swe.txt were transformed into CSV format...
The germanic language family has been updated.
OK! The most frequent words in fra.txt were transformed into CSV format...
OK! The most frequent words in ita.txt were transformed into CSV format...
OK! The most frequent words in por.txt were transformed into CSV format...
OK! The most frequent words in ron.txt were transformed into CSV format...
OK! The most frequent words in spa.txt were transformed into CSV format...
The romance language family has been updated.
OK! The most frequent words in ces.txt were transformed into CSV format...
OK! The most frequent words in pol.txt were transformed into CSV format...
The slavic language family has been updated.
OK! The most frequent words in fin.txt were transformed into CSV format...
OK! The most frequent words in hun.txt were transformed into CSV format...
The uralic language family has been updated.
OK! The words in austronesian.csv were successfully read...
OK! The words in gaelic.csv were successfully read...
OK! The words in germanic.csv were successfully read...
OK! The words in romance.csv were successfully read...
OK! The words in slavic.csv were successfully read...
OK! The words in uralic.csv were successfully read...
OK! iso-8859-latin-family.csv was successfully written...
Operation completed.
That's it!
The prepare.php
command calculates the most frequent words in all languages storing the results in the babylon/dataset/output/ folder. On the other hand, the babylon/dataset/output/iso-8859-latin-family.csv file, which holds disjointish sets of words, is kind of the language families' digital fingerprint.
With this statistics, now Babylon can detect the Cebuano language as it is shown next.
php cli/detect/language.php "Kong kini madangat, ang mga bana hayan mangaklas tungud kay dili makaantus sa paghikay sa abó ug pagpanakdak sa mga labhanan."
ceb
Finally let me suggest it is a good idea to write some tests to make sure the Cebuano language detection won't break in future releases for any unexpected reason we just can't foresee right now.
// tests/unit/AustronesianTest.php
...
/**
* @dataProvider cebData
* @test
*/
public function family_detect_ceb($text)
{
$this->assertEquals('austronesian', (new FamilyDetector($text))->detect());
}
...
public function cebData()
{
return [
[
"Sa usá ka bangko, naglingkod ang duhá ka tawo. Babaye ang usá
nga nagmaskará ug lalake ang ikaduhá nga waláy maskará. Ang lake
nagkanayón:"
],
];
}
Conclusion
Let's recap. Here is all we did to add a new ISO 8859 language:
- Copy and paste a public domain ebook in the
babylon/dataset/input/iso-8859/latin/
folder - Run
php cli/prepare.php
- Write a couple of tests to make sure our code is okay
Instead of writing PHP code, we've just followed a straightforward three-steps process.
For more detailed information on how to add a new language, please visit the Cebuano language is taught to Babylon #14. Click on GaelicTest passes #12 to learn how to add a new family of languages.
I hope you liked today's post! Thank you for reading and sharing your thoughts.
Thank you for your contribution.
Your choice for the title isn't clear as needed to reflect the content of the tutorial.
Your tutorial is quite short for a good tutorial. We recommend you aim for capturing at least 2-3 concepts.
It's important to explain in detail the code that is in the tutorial.
We suggest you always put comments in your code.
Include proof of work under the shape of a gist or your own github repository containing your code.
Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.
To view those questions and the relevant answers related to your post, click here.
Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]
Thank you for your review, @portugalcoin!
So far this week you've reviewed 11 contributions. Keep up the good work!
Hi @portugalcoin, thanks for the review! :)
I am not very clear about it though:
Here are my questions:
Could you please provide tutorials explaining how to teach languages to programs?
Also I believe I did something new by calculating the language families' digital fingerprint at babylon/dataset/output/iso-8859-latin-family.csv. This is the core idea of the library.
Could you elaborate a little bit on what is wrong with that?
Hi @programarivm, I'm @checky ! While checking the mentions made in this post I noticed that @dataprovider doesn't exist on Steem. Maybe you made a typo ?
If you found this comment useful, consider upvoting it to help keep this bot running. You can see a list of all available commands by replying with
!help
.Hey, @programarivm!
Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!
Get higher incentives and support Utopian.io!
Simply set @utopian.pay as a 5% (or higher) payout beneficiary on your contribution post (via SteemPlus or Steeditor).
Want to chat? Join us on Discord https://discord.gg/h52nFrV.
Vote for Utopian Witness!