Saturday, July 15, 2017

NLTK's Swadesh List in the Bulgarian-Macedonian part

While browsing the Swadesh list from the NLTK corpus, especially the Bulgarian and Macedonian*  columns.

The Swadesh list is a corpus from comparative/historical/developmental linguistics, which is trying to map how languages evolved by tracking the most stable subset of words. It's also suggestive in the field of language acquisition/child language/developmental psychology/evolutionary psychology, because one can imagine which words would appear first in the vocabulary of a primitive human society and may reach to the same or similar basic concepts as the real ones.*

For the record, I've "discovered" such sort of a list before discovering Swadesh one, inspired by developmental psychology, while trying to imagine and write down a short set of words/concepts which would appear first and in what sequence. I wanted to generalize on them - on the complexity and specifics of these basic concepts, why and how they may appear. For example, the eyes of the mother are among the first "objects" that a child sees. Soon after the child sees the moon and the sun (in the sunset and the dawn). They are all circular, the eyes are spherical... That's answering the question "who invented the wheel"... It was... in front of humans' eyes forever.

The "wheel", in the transportation domain, was rather constructed or built after the tools got good enough, the groups of people grew enough and/also/therefore allowing roads to be built - both by cleaning out forests or utilizing areas which were dry and flat.

Bare in mind that in general Bulgarian linguistics deny that Macedonian is an independent language, it was started to be forcefully and violently diverted in the Yugoslavian period and afterwards. Of course there were difference which have grown, but if people talk in their local dialects in Bulgarian, there are differences between Thrace (where I live) and all other regions such as "Shopskiya Kray", Northern-West, Northern-East, in different villages. There's one prominent dialect in Rhodopes/Smolyan which is almost impossible to decode.

Bulgarians and macedonians don't need translators in order to communicate. Macedonian sounds like a dialect, sometimes funny, for the other side I guess it's the same.**

The similarities are most apparent and harder to deny if you know the grammar which is harder to change forcefully by the political authorities (than replacing or adding a few letters in the alphabet, adding lots of foreignisms). For example, among the Slavic languages, it's only Bulgarian and Macedonian (?) which have determiners (as "the", but it's in the end of the word) and are analytic, there are no cases, but prepositions - as in German and Latin languages.


Bulgarian Swadesh list in NLTK

I don't know who's responsible for the translations and making them official.

However, IMO:

1. Где - as of now it's a kind of archaism for къде . 

  "Онуй" for "that" is a dialect? and funny form of онова   .
It's used in the expressive phrase "туй-онуй" and is funny because "уй" sounds like "хуй". That word is the same as the one in the famous greeting in Russian "Пошол на хуй". :)
2. "Бастун" is given for stick, however this is a foreignism and a specific instance/application of the general term "пръчка" ("прачка" in Maced.), "палка" is also used for some specific "sticks" ("palka" is given for other Slavic languages).

3. For other words there are also many translation.

баща/татко  (in Macedonian - татко, I don't know what's the frequency there or how they've decided. 

4. 'To swell" in Bulgarian is actively used in many variations which match the roots in other Slavic languages as given in the list:

"подувам се" is given as a translation, but it's also:

1. подпухвам  (пухнуць и др.)
2. набъбвам  (набабрува, мак.)
3. отичам ( otékat, oteći и др.)
4. бухвам (набухати, беларуски?)

4. "Овошка" for "fruit" is another funny item, instead of the more common and ordinary word "плод".
"Овошка" is appropriate for a "fruit tree", short for "овощно дърво" such as an apple tree etc.

Etc., that's what I spotted at a first glance.

I think there should be many words for each entry and more structural data.

How to use it? Install NLTK:

python -m pip install nltk

>import nltk

Find the corpus, download it. Then:

from nltk.corpora import swadesh as sw


Here they show how to get translations between the language pairs:

* For instance, I think that "mama" is produced by the first random sound-producing attempts of the baby, "ma" is maybe the simplest syllable, "mama" is more reliable than "ma", because it's repeated (suggests it's a pattern) and yet two repetitons are enough. The baby called their mother "mama".


**E.g. the Macedonian word for "boyfrend" is "дечко", which in Bulgarian means a little boy or an infantile adolescent/teenager boy, somebody acting inappropriately as a younger child.

"Барам" in Macedonian means "to search" in the search engines, while in Bulgarian it's an expressive word, a stem/root for "to touch" as:

- Обарам/обарвам, набарам/набарвам, прибара/прибарвам, да се барна

However it actually has an old/deducible meaning of "search" by a gradual explorative touching.

"Обарвам" can be applied for "to search (somebody)", because it involves touching, but it's also to erotically touch somebody.


Extending the line, there are a lot of funny false-cognates in the Bulgarian-Serbian relation.

"Сине" in Serbian is an address to a "daughter", while in Bulgarian it is an address to a... "son". (However there are Bulgarian dialects where "син" means "daughter", too).

No comments :