About ilo Muni
This page is about what ilo Muni is, and how and why it exists, as well as thanks for everyone who helped make it possible. If you want to know how to use ilo Muni, check the help page!
Feel free to skip around. If you don’t care about a section, that’s totally okay!
Table of Contents
- My question isn’t here!
- Why did you make ilo Muni?
- Where is the data from?
- Will you add more communities?
- Will you update the data in the future?
- What updates have you made so far?
- What can I do with ilo Muni?
- Why “ilo Muni?”
- Thank you to…
My question isn’t here!
Feel free to email me! You can also reach out on other platforms, but email is easiest for me to see and respond to (because it is the quietest way people try to talk to me).
Why did you make ilo Muni?
Back in January 2024, several members of the toki pona community came together with the same idea all at once: We should write a proposal to get sitelen pona included in Unicode! From there, it turned out that jan Lepeka had experience writing proposals for Unicode and had already written a massive portion of a sitelen pona proposal. Many others signed on to help, so we all got to work right away.
This has made a lot of people very angry and been widely regarded as a bad move This turned out to be challenging for many reasons beyond the documentation of how toki pona and sitelen pona are used.
First, many of the members of the workgroup did not agree on what words should be encoded or why. We examined the Linku usage data to get a reasonable answer for what words to encode- that took a while and had its own complexities, but it did work out at the time.
Unfortunately, there was no comparable data for sitelen pona glyphs. While many were obvious because most words only have one glyph, over a dozen words have multiple glyphs with different levels of use- especially when the words themselves are low use. We tried running an ad-hoc survey to study glyph usage, but this simultaneously revealed how limited our perspective was and how difficult it would be to answer our original question.
While examining the survey data, we found we had failed to offer at least one glyph as an option, resulting in a large number of write-ins for a version of “olin” where the hearts overlap rather than stack. Then we observed that the results for some glyphs were far higher than expected or anecdotally observed, particularly a variant of “linluwi” where the three circles of “kulupu” are connected with lines.
This, coupled with a later survey about whether certain glyphs are “distinct”, led us down a brand new rabbit hole: A large number of survey respondents seemed to answer how they would like to use toki pona, or how they think they will use it in the future, rather than how they do so now. Several members of the workgroup, including myself, could not reconcile the reported usage from the surveys with our observations.
By March, we hadn’t come to any agreement about what glyphs to use for some words, and the prior agreement about what words to encode in the first place had fallen apart because of the observed issues with the two glyph surveys. At that point, we chose to pivot: What if we directly studied usage?
We submitted a preliminary proposal with the only set of words we could all agree must be in the proposal, that being the nimi pu plus tonsi. It has not been added to the Unicode document register as of writing, but Unicode has acknowledged it!
After this, the work left to be done was much more open-ended. I started improving upon my Discord bot, ilo pi toki pona taso. At the time, it had a simple way to detect whether or not you were speaking toki pona, so that it could remind you to speak toki pona if you didn’t. But when I say simple, I do mean simple- it could be evaded by capitalizing the first word of every sentence, or carefully dodged by using only the 14 letters in Toki Pona’s alphabet, or ignored entirely by quoting your text. This wouldn’t do, since my goal was to fetch all the data I could from as many communities as I could and examine only the messages that are “in toki pona.”
From there, I created a parsing library called sona toki, a script called sona mute for counting all the words in toki pona sentences and stuffing it into a database, and this tool, ilo Muni. The work you’re reading now has been five months in the making!
Where is the data from?
Anywhere toki pona is written in text, so long as there is a date associated with every message and the community is open to the public. Right now, I support six platforms:
Discord
Discord makes up the majority of all written toki pona. For that matter, ma pona pi toki pona is nearly the majority of all written toki pona on its own. As such, this data is the most important to have a look at- but Discord is among the least export-friendly platforms on the internet. Discord does not offer any native functionality to export messages. Fortunately, this excellent project enables you to fetch all messages you have access to in any server you’re in, including in threads, and with tons of metadata per message!
The main challenge was actually in finding all of the communities to fetch- I could have settled for just the large ones, since they’d represent most of the data and be easy to find, but I chose to hunt down as many as I could. I ended up finding over 120 servers, and I’m certain there are more!
Telegram
Telegram offers an “export chat” function directly in its desktop UI, and this dumps every message you can see from the start of the chat’s existence to the present. Perfect! I spent some time searching to find all the communities I could, then asking around in other communities. Once I was done, I listed them here too.
After that, jan Pensa (@spencjo) helped me out by exporting two particular chats which were special cases:
- mi tok e toki pona: a workgroup attemping to get toki pona an ISO-639 code. This chat is functionally archived since they succeeded- toki pona has had an ISO code for over two years!
- kulupu pi toki pona: This chat is public, but up until April 2017 it was configured to only show messages since you joined. Fortunately, jan Pensa had been in this group since March 2016!
The format for Telegram messages is a bit odd in their official exports, and I can’t tell what users are bots from the exports alone- but other than that, Telegram was refreshingly easy to add to this list.
Reddit moved its API behind a paywall long before I began this project, so it is no longer reasonable to “officially” scrape Reddit. They also limited scrollback on all endpoints in the user API to only 1000 items, meaning unofficial scraping would have to be done live in order to be sure you fetched everything. I obviously wasn’t doing my own live scraping of Reddit, but thanks to the incredible work of Pushshift, /u/raiderbdev, and /u/Watchful1, I was able to fetch /r/tokipona and a dozen smaller subreddits including /r/mi_lon, /r/tokiponataso, /r/tokiponaunpa, and /r/liputenpo.
Previously, I was limited to just /r/tokipona through the end of 2023, but /u/Watchful1 reached out and created a dump for all these communities through the end of July 2024. Enormous thanks!
YouTube
Thanks to yt-dlp, it is shockingly easy to make archives of your favorite YouTube videos. To my surprise, it also has the capability to download comments without downloading the associated videos other than their metadata, so I fetched everything I could find related to Toki Pona! However, there is no “toki pona community” on YouTube, because YouTube doesn’t have community structures. There are channels and videos- and for most purposes, that’s it.
Fortunately, there are a few playlists on YouTube such as this one and this one which collect huge lists of known toki pona videos. With this, plus a few obvious search terms like “toki pona” and “kijetesantakalu,” I collected an initial list of videos and their associated authors. Since the list of authors represented almost exclusively those who had, at some point, uploaded at least one Toki Pona video, I added all of them to a separate list- and then I downloaded every video from each of these channels. This functionally guarantees an extremely high degree of coverage.
forums.tokipona.org
To an outsider, it may seem odd to want to include a specific singular forum in this data- but this forum is important because it was active from 1 Oct 2009 to mid-2020, and was one of the only active spaces during most of that time. As such, this forum is highly important to the history of toki pona.
I managed to create a backup of this data using wget, specifically its recursive download feature. Normally I’d prefer to have a more stripped down and structured backup, such as having everything packaged into JSON. However, the forum is closed to new accounts, and has received only a dozen posts in the past two years; it will likely remain up as an archive rather than ever become an active space again. Because of that, having a large archive is okay; it won’t get any larger.
Yahoo group
The toki pona yahoo group was one of very few spaces where toki pona was spoken before 2010. It was active from March 2002 until October 1st 2009, and I’m only aware of three other communities that existed at all during this time. The IRC channel could have been more popular, but it wasn’t preserved beyond a handful of specific conversations that I’m aware of. Fortunately, the entire yahoo group is backed up on the forum above, so including the forum also means including the Yahoo group!
As a fair warning, the formatting of messages in the Yahoo group backup is rough. Many newlines are missing, presumably caused by whatever software was used to copy the data to the forum. For that matter, messages are formatted inconsistently due to the unstable nature of email from provider to provider.
However, my preprocessing in sona mute was more than capable of handling the inconsistent formatting. It would be nice to go back to this data to improve the formatting and squeeze a bit more accuracy out of it as a result, but it’s adequate as-is.
Will you add more communities?
Yes! The main barrier to adding more communities is being able to download data from the given platform. For all of the communities I support, there was a relatively easy way to obtain the messages from the platform or from a pre-existing archive of the platform. For everything I don’t yet support, there isn’t an easy way to download the data from the platform, or a pre-existing archive to download.
There are several toki pona communities on Facebook, here, here, here, and here. The majority of their activity is in a period similar to that of Discord- that is, from 2020 onward- but they have much more pre-2020 activity than most other communities that existed around that time. Unfortunately, scraping data from Facebook is extremely difficult- the handful of open source scrapers that exist are variously inconsistent, low quality, or unsupported.
Tumblr
There are a surprising number of toki pona blogs and toki pona posts on Tumblr, which would be excellent to include! During my searching, I identified around 700 blogs that had posted about toki pona at any point. Additionally, toki pona activity on Tumblr goes back as far as 2016, with a major uptick in 2020 and 2021.
However, downloading the data to include it seems extremely difficult. There is a tool called TumblThree which purports to let you download Tumblr and Twitter data, but my experience is that it gets rate limited aggressively by both platforms- even with the slowest settings.
LiveJournal
There are at least two LiveJournal blogs that focused on toki pona, here and here, which were active in a similar time period to the forum or yahoo group. They’re both small, but anything counts, especially for the history of toki pona before 2016. Unfortunately, all the LiveJournal archiving tools I can find are for improving the personal data export feature, not for scraping the site.
toki.social (Mastodon)
toki.social is a Mastodon instance for toki pona! It’s been around since early 2022. Not much else to say; it’s a lovely place, although it’s fairly quiet.
kulupu.pona.la
kulupu.pona.la was a forum hosted by mazziechai which closed abruptly in November 2023 due to trolling. The forum was archived fully by Mazzie before the shutdown, but the format is pure HTML, making it a difficult to get the necessary data out of it.
poki Lapo
poki Lapo is a project attempting to collect all toki pona media in one place, transcribing where necessary, with attributed authors and dates. This is an extremely valuable project for my purpose, because it is significantly easier to fetch and parse these stories if they’ve already been collected into a single format!
I’ll grant, there is an argument in favor of their omission: They are not discussion posts, but media, and thus don’t necessarily conform to trends in the language the way social media and discussion forums do. However, I don’t think this is a huge concern; the goal of ilo Muni is to faithfully represent trends in toki pona as a whole, and this would be reflected in media too.
lipu Wikipesija
lipu Wikipesija is an independent Wikipedia instance that is entirely in Toki Pona. Similar to poki Lapo, it would represent a specific, well-prepared form of toki pona, as opposed to casual and continuous discussion. The main challenges are in attribution and timing, since articles may be edited by more than one author at different points in time- I’m currently unsure how to resolve this, but as soon as I can do so, I would love to add lipu Wikipesija.
Will you update the data in the future?
Yes! I plan to update this data fully at least once per year, but doing so two or three times isn’t out of the question. Collecting, parsing, and counting up all of the data is not labor intensive- writing the code to do all that was, but most of that is done.
That said, there isn’t much value in updating more than once per year. Google Ngrams only updates about every three years. Trends in language don’t generally happen in weeks or months- even at the community’s current size, these trends take years.
But I will be making improvements to my scoring and parsing of sentences, and adding new platforms or communities, in which case I’ll update the data to reflect those improvements. There won’t be any new data for those updates though- just better accuracy for the existing data!
What updates have you made so far?
Since the most important changes follow from database revisions, I list the changes I’ve made below based on the corresponding database.
2024-11-29 Database
- Used by ilo Muni from 2024-12-17 to present
- Includes word frequency and filtered authorship data (>= 20 sentences required)
- Uses monthly unit times
- Represents Discord, Telegram, Reddit, YouTube, and the Toki Pona forum
- Earliest date is 2002-03-20
- Download here
2024-11-01 Database
- Improvements to scoring in sona-toki
- Even better rejection of false positives from non-tp languages (“saluton”)
- Never used in ilo Muni
- Includes frequency data and unfiltered authorship data
- Uses monthly unit times
- Represents Discord, Telegram, Reddit, YouTube, and the Toki Pona forum
- Earliest date is 2002-03-20
- NOTE: The schema has changed to improve naming.
- Download here
2024-09-07 Database
- Improvements to scoring in sona-toki
- Better rejection of false positives from non-tp languages (“manipulate”)
- Recognition of intra-word punctuation in word tokenizer (“isn’t”)
- Less strict recognition of proper names (“FaceBook”, “ChatGPT” now valid)
- Omit codeblocks (paired triple backticks), but not inline code (paired single backticks)
- Replace markdown URLs with just the linked text
- Omit emails
- Used by ilo Muni from 2024-09-12 to 2024-12-17
- Only includes word frequency data, not authorship
- Uses 4-weekly unit times starting from 2001-08-08
- Represents Discord, Telegram, Reddit, YouTube, and the Toki Pona forum
- Earliest date is 2002-03-20, latest date is 2024-07-10
- Download here
2024-08-08 Database
- Used by ilo Muni from 2024-08-10 to 2024-09-12
- Includes word frequency data, not authorship
- Uses monthly unit times
- Represents Discord, Telegram, and Reddit
- Earliest date is 2010-10-01, latest date is 2024-08-01 (not graphed)
- Download here
What can I do with ilo Muni?
Can I use the data in my project?
Sure! The database is distributed under the terms of the CC BY-SA 4.0 License. In short, you can use the data for anything you like, but you need to attribute me when you do so, and any derivative works made from the data must use the same license. See the linked deed and associated license terms for more details.
You can download the latest database here, or see all the old databases and their details here.
Can I cite ilo Muni in my study?
Yes! I recommend including the access date for the graphing tool, because I update the primary dataset periodically with improvements or additions. The dataset is already dated, and you can see what specific dates each dataset was active on above.
Here are some samples in various citation formats:
MLA 9
- Danielson, G., III. ilo Muni, 10 Aug. 2024. https://ilo.muni.la/. Accessed 17 Dec. 2024.
- Danielson, G., III. ilo Muni, 10 Aug. 2024. https://gregdan3.com/sqlite/2024-11-29-trimmed.sqlite.gz.
APA 7
- Danielson, G., III. (2024, August 10). Ilo Muni [Computer software]. Retrieved December 17th, 2024, from https://ilo.muni.la/
- Danielson, G., III. (2024, August 10). Ilo Muni [Dataset]. https://gregdan3.com/sqlite/2024-11-29-trimmed.sqlite.gz
How can I contribute?
You can contribute to ilo Muni here, its preprocessing library here, and its frequency counting library here!
However, code is not the only way to help- if you know of a Toki Pona community that I don’t, and especially if you already have an archive of that data, please email me or open an issue.
Why “ilo Muni?”
Several reasons! I was originally going to name this project “lipu mute”, which in this context would be translated like “multiplicity document.” But I realized near the end of July 2024 that if this tool had a name, I would be able to search its name in this tool! This was too cool to pass up.
From there, choosing a name was easy. I wanted to name it in toki pona, and have the sitelen pona of the name be a phrase which also describes itself. This tool shows how many words there are or how frequent words are compared to all others, which is aptly described by “mute nimi” and still closely described by “nimi mute”. Taking the first syllables of each word, you get “Muni” and “Nimu.”
I ran a poll with both, but the results were very close (with a slight preference for Muni!), so I decided to search both names in the version of ilo Muni that existed at that time- “Nimu” had just over 40 results, but “Muni” didn’t come up, so I went with it. Here we are!
Thank you to…
- ilo Nija for the logo! It’s a stylized version of the sitelen pona name for ilo Muni- the three bars are “mute” and the rectangle is “nimi”. It’s so sick.
- jan Lepeka, jan Juwan, akesi Jan, jan Tepo, ijo Alison, and many others of the Unicode proposal workgroup for convicing me to start on this project. It’s been incredibly rewarding, and I hope it helps
and doesn’t cause lots of future arguments. - jan Telesi and ilo Tani for a bit of technical direction along the way!
- waso Keli for asking many questions about how I collected and display the data fairly!
- kala Asi for listening and helping me iterate on the early versions of the graph and search tool.
- jan Asiku, ilo Tani, pan Pake, and waso Ete for helping me work through the set theory reasoning necessary to display the data usefully!
- /u/Watchful1 for providing the Reddit data they had already archived!
- Tyrrrz and his contributors for the Discord Chat Exporter, without which this project would have been functionally impossible.
- phiresky for the excellent
sql.js-httpvfs
library, without which this project would have taken much longer and been much more stressful. - jordemort for figuring out how to use that library in Astro!