Tuesday, January 16, 2024

// // Leave a Comment

BgGPT is the mightiest new Bulgarian LLM model, but technically not the first

https://therecursive.com/bggpt-the-first-bulgarian-language-model-is-launched/

https://bggpt.ai


The info is 7B params, trained on 3B sentences.

  • "The first Bulgarian large language model, BgGPT, was announced today by INSAIT. It was created specifically for the Bulgarian state, users, public and private organizations."

This is impressive and well done, however it is not the first LLM for Bulgarian.

The first known to me is the experimental training GPT2-Medium model (331M) trained in the summer of 2021 by me/The Sacred Computer: 

https://github.com/Twenkid/GPT2-Bulgarian-Training-Tips-and-Tools/

GPT2-Medium Training from Scratch on Colab for Any Language - Tips & Tricks by Twenkid

https://youtu.be/F-Xt-cK4L-g

It was just an experiment on a small dataset (about 140 MB UTF8) and the trained model wasn't published because of part of the dataset and that it seemed to start memorizing too much (needed more data) and the training setting wasn't good (Colab/Tesla T4 and training each iteration on on subsets of the dataset due to this setting). I will probably publish it anyway.

Some guy asked for the weights, but there wasn't more interest about it what so ever. I see a few models from 2022-2023 in hugging face.

On the joke side, there's also RhodopeGPT (a Mountain in Bulgaria, Greece and Turkey):

https://github.com/Twenkid/rhodope-gpt

(An experiment with simplest transformer based on Karpathy's example with a GPT2 tokenizer and save-load)

This is a serious project, but lacking partners:
https://github.com/Twenkid/Vsy-Jack-Of-All-Trades-AGI-Bulgarian-Internet-Archive-And-Search-Engine 

* They mention that the research has started back in 2020: "BgGPT’s initial project research began in 2020 under the leadership of Prof. Martin Vechev, Professor of Computer Science at ETH Zurich. He is also a founder and architect of INSAIT. The aim for 2024 is to continue the development of an AI computing center, attracting international partners"



Update: https://huggingface.co/usmiva/gpt-web-bg (2023) and several others (2022 or so). From Linkedin: about 50B tokens, they say they've omitted that number because it was "too techy".

0 коментара: