This page is about what ilo Muni is, and how and why it exists, as well as thanks for everyone who helped make it possible. If you want to know how to use ilo Muni, check the help page!
Table of Contents
- My question isn’t here
- Why did you make ilo Muni?
- Where is the data from?
- Will you add more communities?
- Will you update the data in the future?
- What updates have you made so far?
- What can I do with ilo Muni?
- Why “ilo Muni?”
- Thank you to
My question isn’t here
Feel free to email me! You can also reach out on other platforms, but email is easiest for me to see and respond to (because it is the quietest way people try to talk to me). That said, if you already talk to me on Discord or Telegram, those are perfectly fine too.
Why did you make ilo Muni?
Back in January 2024, several members of the toki pona community came together with the same idea all at once: We should write a proposal to get sitelen pona included in Unicode! From there, it turned out that jan Lepeka had experience writing proposals for Unicode and had already written a massive portion of a sitelen pona proposal. Many others signed on to help, so we all got to work right away.
This has made a lot of people very angry and been widely regarded as a bad move This turned out to be challenging for many reasons beyond the documentation of how toki pona and sitelen pona are used.
First, many of the members of the workgroup did not agree on what words should be encoded or why. We started by using the Linku usage data to determine what to encode, based on what words survey respondents reported to use. that took a while and had its own complexities, but it did work out at the time.
Unfortunately, the Linku usage data only covers toki pona words, not sitelen pona glyphs, and there was no comparable data for sitelen pona. While many encoding decisions were obvious because most words only have one glyph, over a dozen words have multiple glyphs with different levels of use- and this is made both more significant and more difficult to address when the words themselves are low use. We tried running an ad-hoc survey to study glyph usage, but this simultaneously revealed how limited our perspective was and how difficult it would be to answer our original question.
While examining the survey data, we found we had failed to offer at least one glyph as an option, resulting in a large number of write-ins for a version of “olin” where the hearts overlap rather than stack. Then we observed that the results for some glyphs were far higher than expected or anecdotally observed, particularly a variant of “linluwi” where the three circles of “kulupu” are connected with lines.
This, coupled with a later survey about whether certain glyphs are “distinct”, led us down a brand new rabbit hole: A large number of survey respondents seemed to answer how they would like to use toki pona, or how they think they will use it in the future, rather than how they do so now. Several members of the workgroup, including myself, could not reconcile the reported usage from the surveys with our observations.
By March, we hadn’t come to any agreement about what glyphs to use for some words, and the prior agreement about what words to encode in the first place had fallen apart because of the observed issues with the two glyph surveys. At that point, we chose to pivot: What if we directly studied usage?
To round up the work we had already done, we submitted a preliminary proposal with the only set of words we could all agree must be in the proposal, that being the nimi pu plus tonsi. There wasn’t anything else we could do to improve the proposal without data, and we’d be waiting a long time on a response from Unicode anyway.
After this, the work left to be done was much more open-ended. How do we actually study usage directly? First, I started improving upon my Discord bot, ilo pi toki pona taso. At the time, it had a simple way to detect whether or not you were speaking toki pona, so that it could remind you to speak toki pona if you didn’t. But when I say simple, I do mean simple- it could be evaded by capitalizing the first word of every sentence, or carefully dodged by using only the 14 letters in Toki Pona’s alphabet, or ignored entirely by quoting your text. This wouldn’t do, since my goal was to fetch all the data I could from as many communities as I could and examine only the messages that are “in toki pona.”
To address these limitations, I created a parsing library called sona toki, a script called sona mute for counting all the words in toki pona sentences and stuffing it into a database, and this tool, ilo Muni. The work you’re reading now has been five months in the making, not to mention the time I’ve spent since to improve its functionality and add more dimensions of data!
Footnote on the Sitelen Pona proposal (click)
We ended up waiting from our submission in April 2024 until February of 2025 before getting a response back. The proposal was added to the 2024 Unicode document register some time in early 2025, and you can read it here! That proposal was also briefly acknowledged in the first UTC meeting of 2025:
D.1 6.6. Sitelen Pona
Long discussion of stability.
Following that, Jan Kučera of the Unicode Script Encoding Working Group reached out to us to relay the working group’s thoughts on our proposal:
Jan Kučera’s email to our workgroup (click)
Hello Gregory,
Apologies for the delay in responding. We thrive to provide comments on all proposals, even if they are not favorable. While I am happy to answer your questions, please keep in mind the big picture. Too often people spend a lot of time on fixing whatever they can in the proposals, even though it cannot progress without addressing the main concern.
As noted in the report, the biggest concern with Sitelen Pona is that there is no authority or organization providing a standardized set. The community needs to go through its own process of standardization before we can consider encoding this script in Unicode. When this was originally proposed by Spencer van der Meulen, it came with a note: “Sitelen Pona is an open class of characters. New characters are created as new Toki Pona words are coined and enter into usage. Creators of Sitelen Pona fonts have a leading role in the standardization of characters for new words.” This was taken as a strong sign of open-ended project, with community coming up with new drawings as new concepts are being added to the language. Members of the group also pointed out that there are several instances of people on r/tokipona proposing new things, such as
https://www.reddit.com/r/tokipona/comments/1gzi5o0/nasin_nanpa_ilo_my_proposed_number_system_for/, again not an indication of stable conventions.As an opened-ended system, and with examples as above, some members were not convinced the script indeed is or will remain logographic in the future, and there is a worry that we don’t know how many codepoints is coming. It was also noted that even if the language was closed set, it doesn’t provide any guarantees on the script repertoire. Some of the glyph variants in 4a would not be considered unifiable in other scripts, and even for a core set of logographs it wasn’t viewed as stable.
As for formal stability requirements, no changes to the repertoire in the last 5 years would be viewed as a good indication of stability. You might be able to argue a case without meeting this bar, but note that this is already a compromise and some members feel much longer periods should be observed.
Personally I would suggest to not worry about “fixing” the proposal too much now. Get the community to stabilize the repertoire, and/or establish some body governing the script, and let it live for some time. I cannot promise that will guarantee encoding, but it would help to overcome some of the strongest arguments against it.
Hope this helps,
Jan Kučera
Script Encoding Working Group
To be frank, this email was very confusing to read through. But I suppose it makes more sense from an outsider perspective? I’m unsure.
Where is the data from?
Anywhere toki pona is written in text, so long as there is a date associated with every message and the community is open to the public. Right now, I support six platforms:
Discord
Discord makes up the majority of all written toki pona. For that matter, ma pona pi toki pona is nearly the majority of all written toki pona on its own. As such, this data is the most important to have a look at- but Discord is among the least export-friendly platforms on the internet. Discord does not offer any native functionality to export messages. Fortunately, this excellent project enables you to fetch all messages you have access to in any server you’re in, including in threads, and with tons of metadata per message!
The main challenge was actually in finding all of the communities to fetch- I could have settled for just the large ones, since they’d represent most of the data and be easy to find, but I chose to hunt down as many as I could. I ended up finding over 120 servers, and I’m certain there are more!
Telegram
Telegram offers an “export chat” function directly in its desktop UI, and this dumps every message you can see from the start of the chat’s existence to the present. Perfect! I spent some time searching to find all the communities I could, then asking around in other communities. Once I was done, I listed them here too.
After that, jan Pensa (@spencjo) helped me out by exporting two particular chats which were special cases:
- mi tok e toki pona: a workgroup attemping to get toki pona an ISO-639 code. This chat is functionally archived since they succeeded- toki pona has had an ISO code for over two years!
- kulupu pi toki pona: This chat is public, but up until April 2017 it was configured to only show messages since you joined. Fortunately, jan Pensa had been in this group since March 2016!
The format for Telegram messages is a bit odd in their official exports, and I can’t tell what users are bots from the exports alone- but other than that, Telegram was refreshingly easy to add to this list.
Reddit moved its API behind a paywall long before I began this project, so it is no longer reasonable to “officially” scrape Reddit. They also limited scrollback on all endpoints in the user API to only 1000 items, meaning unofficial scraping would have to be done live in order to be sure you fetched everything. I obviously wasn’t doing my own live scraping of Reddit, but thanks to the incredible work of Pushshift, /u/raiderbdev, and /u/Watchful1, I was able to fetch /r/tokipona and a dozen smaller subreddits including /r/mi_lon, /r/tokiponataso, /r/tokiponaunpa, and /r/liputenpo.
Previously, I was limited to just /r/tokipona through the end of 2023, but /u/Watchful1 reached out and created a dump for all these communities through the end of July 2024. Enormous thanks!
YouTube
Thanks to yt-dlp, it is shockingly easy to make archives of your favorite YouTube videos. To my surprise, it also has the capability to download comments without downloading the associated videos other than their metadata, so I fetched everything I could find related to Toki Pona! However, there is no “toki pona community” on YouTube, because YouTube doesn’t have community structures. There are channels and videos- and for most purposes, that’s it.
Fortunately, there are a few playlists on YouTube such as this one and this one which collect huge lists of known toki pona videos. With this, plus a few obvious search terms like “toki pona” and “kijetesantakalu,” I collected an initial list of videos and their associated authors. Since the list of authors represented almost exclusively those who had, at some point, uploaded at least one Toki Pona video, I added all of them to a separate list- and then I downloaded every video from each of these channels. This functionally guarantees an extremely high degree of coverage.
poki Lapo
poki Lapo is a project attempting to collect all toki pona media in one place, transcribing where necessary, with attributed authors and dates. This is an extremely valuable project for my purpose, because it is significantly easier to fetch and parse these stories if they’ve already been collected into a single, consistent format!
Although, this is a somewhat strange addition. poki Lapo’s content is not discussion posts, but media, and thus its entries don’t necessarily conform to trends in the language the way social media posts, messenger conversations, and forum discussions do. However, I don’t think this is a huge concern; the goal of ilo Muni is to faithfully represent trends in toki pona as a whole, and this would be reflected in media too.
poki Lapo does present one notable complexity over the other datastes, however. New entries can be added in the past, as older media is discovered. This isn’t a problem for ilo Muni’s dataset, but it does mean that I have to note: the poki Lapo data in ilo Muni is from the main branch as of 2025-10-XX!
forums.tokipona.org
To an outsider, it may seem odd to want to include a specific singular forum in this data- but this forum is important because it was active from 1 Oct 2009 to mid-2020, and was one of the only active spaces during most of that time. As such, this forum is highly important to the history of toki pona.
I managed to create a backup of this data using wget, specifically its recursive download feature. Normally I’d prefer to have a more stripped down and structured backup, such as having everything packaged into JSON. However, the forum is closed to new accounts, and has received only a dozen posts in the past two years; it will likely remain up as an archive rather than ever become an active space again. Because of that, having a large archive is okay; it won’t get any larger.
Yahoo group
The toki pona yahoo group was one of very few spaces where toki pona was spoken before 2010. It was active from March 2002 until October 1st 2009, and I’m only aware of three other communities that existed at all during this time. The IRC channel could have been more popular, but it wasn’t preserved beyond a handful of specific conversations that I’m aware of. Fortunately, the entire yahoo group is backed up on the forum above, so including the forum also means including the Yahoo group!
As a fair warning, the formatting of messages in the Yahoo group backup is rough. Many newlines are missing, presumably caused by whatever software was used to copy the data to the forum. For that matter, messages are formatted inconsistently due to the unstable nature of email from provider to provider.
However, my preprocessing in sona mute was more than capable of handling the inconsistent formatting. It would be nice to go back to this data to improve the formatting and squeeze a bit more accuracy out of it as a result, but it’s adequate as-is.
Will you add more communities?
Yes! The main barrier to adding more communities is being able to download data from the given platform. For all of the communities I support, there was a relatively easy way to obtain the messages from the platform or from a pre-existing archive of the platform. For everything I don’t yet support, there isn’t an easy way to download the data from the platform, or a pre-existing archive to download.
There are several toki pona communities on Facebook, here, here, here, and here. The majority of their activity is in a period similar to that of Discord- that is, from 2020 onward- but they have much more pre-2020 activity than most other communities that existed around that time. Unfortunately, scraping data from Facebook is extremely difficult- the handful of open source scrapers that exist are variously inconsistent, low quality, or unsupported.
Tumblr
There are a surprising number of toki pona blogs and toki pona posts on Tumblr, which would be excellent to include! During my searching, I identified around 700 blogs that had posted about toki pona at any point. Additionally, toki pona activity on Tumblr goes back as far as 2016, with a major uptick in 2020 and 2021.
However, downloading the data to include it seems extremely difficult. There is a tool called TumblThree which purports to let you download Tumblr and Twitter data, but my experience is that it gets rate limited aggressively by both platforms- even with the slowest settings.
It is also possible to download most of the data needed for ilo Muni via gallery-dl, but that still requires you to have a list of known blogs to fetch from first.
I haven’t done a deep dive into toki pona activity on Twitter, but given its age and its prior position as the only notable social media platform of its kind, I can only assume there must be a lot of discussion. That said, actually fetching data from Twitter has been near impossible for several years after the neutering and subsequent near-destruction of their API.
Bluesky
Toki Pona activity on Bluesky has skyrocketed since the untimely death of Twitter! One of the most important features that has made this possible is the ability for users to make programmatically curated timelines like this one which is entirely in Toki Pona, or this one which is both toki pona language use and toki pona discussion. That said, Bluesky is a fairly young platform, so there are not many tools for fetching posts from it.
LiveJournal
There are at least two LiveJournal blogs that focused on toki pona, here and here, which were active in a similar time period to the forum or yahoo group. They’re both small, but anything counts, especially for the history of toki pona before 2016. Unfortunately, all the LiveJournal archiving tools I can find are for improving the personal data export feature, not for scraping the site.
toki.social (Mastodon)
toki.social is a Mastodon instance for toki pona! It’s been around since early 2022. Not much else to say; it’s a lovely place, although it’s fairly quiet.
kulupu.pona.la
kulupu.pona.la was a forum hosted by mazziechai which closed abruptly in November 2023 due to trolling. The forum was archived fully by Mazzie before the shutdown, but the format is pure HTML, making it a difficult to get the necessary data out of it.
lipu Wikipesija
lipu Wikipesija is an independent Wikipedia instance that is entirely in Toki Pona. Similar to poki Lapo, it would represent a specific, well-prepared form of toki pona, as opposed to casual and continuous discussion. The main challenges are in attribution and timing, since articles may be edited by more than one author at different points in time- I’m currently unsure how to resolve this, but as soon as I can do so, I would love to add lipu Wikipesija.
Will you update the data in the future?
Yes! I plan to update this data fully at least once per year, but doing so two or three times isn’t out of the question. Collecting, parsing, and counting up all of the data is not labor intensive- writing the code to do all that was, but most of that is done.
That said, there isn’t much value in updating more than once per year. Google Ngrams only updates about every three years. Trends in language don’t generally happen in weeks or months- even at the community’s current size, these trends take years.
But I will be making improvements to my scoring and parsing of sentences, and adding new platforms or communities, in which case I’ll update the data to reflect those improvements. There won’t be any new data for those updates though- just better accuracy for the existing data!
What updates have you made so far?
Since the most important changes follow from database revisions, I list the changes I’ve made below based on the corresponding database. All revisions other than the most recent are hidden by default.
2025-10-20 Database
- Used by ilo Muni from 2025-11-15 to present
- Improvements in sona-toki’s parser, better tokenization and rejection of non-tp sentences
- Replaces min_sent_len with an attribute system
- Available attributes are default (0), start of sentence (1), end of sentence (2), entire sentence (3), sentence having 4 or more words (4), sentence having fewer than 4 words (5), and inside the sentence (6).
- Offers both monthly and yearly unit times
- Total table has been split into total_monthly and total_yearly
- Represents Discord, Telegram, Reddit, YouTube, the Toki Pona forum, and poki Lapo.
- Discord, Telegram, YouTube, and poki Lapo data is included through July 2025. Reddit is only included through July 2024. The forums are archived, so there will not be any further updates to its data.
- Earliest date is 2002-03-01, latest date is 2025-07-01
- Download here
Older database revisions
2024-11-29 Database
- Used by ilo Muni from 2024-12-17 to 2025-11-15
- Includes word frequency and filtered authorship data (>= 20 sentences required)
- Uses monthly unit times
- Represents Discord, Telegram, Reddit, YouTube, and the Toki Pona forum
- Earliest date is 2002-03-01, latest date is 2024-07-01
- Download here
2024-11-01 Database
- Never used in ilo Muni
- Improvements to scoring in sona-toki
- Even better rejection of false positives from non-tp languages (“saluton”)
- Schema updated for naming clarity
- Includes frequency data and unfiltered authorship data
- Uses monthly unit times
- Represents Discord, Telegram, Reddit, YouTube, and the Toki Pona forum
- Earliest date is 2002-03-01, latest date is 2024-07-01
- Download here
2024-09-07 Database
- Used by ilo Muni from 2024-09-12 to 2024-12-17
- Improvements to scoring in sona-toki
- Better rejection of false positives from non-tp languages (“manipulate”)
- Recognition of intra-word punctuation in word tokenizer (“isn’t”)
- Less strict recognition of proper names (“FaceBook”, “ChatGPT” now valid)
- Omit codeblocks (paired triple backticks), but not inline code (paired single backticks)
- Replace markdown URLs with just the linked text
- Omit emails
- Only includes word frequency data, not authorship
- Uses 4-weekly unit times starting from 2001-08-08
- Represents Discord, Telegram, Reddit, YouTube, and the Toki Pona forum
- Earliest date is 2002-03-20, latest date is 2024-07-10
- Download here
2024-08-08 Database
- Used by ilo Muni from 2024-08-10 to 2024-09-12
- Includes word frequency data, not authorship
- Uses monthly unit times
- Represents Discord, Telegram, and Reddit
- Earliest date is 2010-10-01, latest date is 2024-08-01 (not graphed)
- Download here
What can I do with ilo Muni?
Can I use the data in my project?
Sure! The database is distributed under the terms of the CC BY-SA 4.0 License. In short, you can use the data for anything you like, but you need to attribute me when you do so, and any derivative works made from the data must use the same license. See the linked deed and associated license terms for more details.
You can download the latest database here, or see all the old databases and their details here.
Can I cite ilo Muni in my study?
Yes! I recommend including the access date for the graphing tool, because I update the primary dataset periodically with improvements or additions. The dataset is already dated, and you can see what specific dates each dataset was active on above.
Here are some samples in various citation formats:
MLA 9
- Danielson, G., III. ilo Muni, 10 Aug. 2024.
<https://ilo.muni.la/>. Accessed 17 Dec. 2024. - Danielson, G., III. ilo Muni, 10 Aug. 2024.
<https://gregdan3.com/sqlite/2024-11-29-trimmed.sqlite.gz>.
APA 7
- Danielson, G., III. (2024, August 10). Ilo Muni [Computer software]. Retrieved December 17th, 2024, from
<https://ilo.muni.la/> - Danielson, G., III. (2024, August 10). Ilo Muni [Dataset].
<https://gregdan3.com/sqlite/2024-11-29-trimmed.sqlite.gz>
How can I contribute?
You can contribute to ilo Muni here, its preprocessing library here, and its frequency counting library here!
However, code is not the only way to help- if you know of a Toki Pona community that I don’t, and especially if you already have an archive of that data, please email me or open an issue.
Why “ilo Muni?”
Several reasons! I was originally going to name this project “lipu mute”, which in this context would be translated like “multiplicity document.” But I realized near the end of July 2024 that if this tool had a name, I would be able to search its name in this tool! This was too cool to pass up.
From there, choosing a name was easy. I wanted to name it in toki pona, and have the sitelen pona of the name be a phrase which also describes itself. This tool shows how many words there are or how frequent words are compared to all others, which is aptly described by “mute nimi” and still closely described by “nimi mute”. Taking the first syllables of each word, you get “Muni” and “Nimu.”
I ran a poll with both, but the results were very close (with a slight preference for Muni!), so I decided to search both names in the version of ilo Muni that existed at that time- “Nimu” had just over 40 results, but “Muni” didn’t come up, so I went with it. Here we are!
Thank you to
- ilo Nija for the logo! It’s a stylized version of the sitelen pona name for ilo Muni- the three bars are “mute” and the rectangle is “nimi”. It’s so sick.
- jan Lepeka, jan Juwan, akesi Jan, jan Tepo, ijo Alison, and many others of the Unicode proposal workgroup for convicing me to start on this project. It’s been incredibly rewarding, and I hope it helps
and doesn’t cause lots of future arguments. - jan Telesi and ilo Tani for a bit of technical direction along the way!
- waso Keli for asking many questions about how I collected and display the data fairly!
- kala Asi for listening and helping me iterate on the early versions of the graph and search tool.
- jan Asiku, ilo Tani, pan Pake, and waso Ete for helping me work through the set theory reasoning necessary to display the data usefully!
- /u/Watchful1 for providing the Reddit data they had already archived!
- Tyrrrz and his contributors for the Discord Chat Exporter, without which this project would have been functionally impossible.
- phiresky for the excellent
sql.js-httpvfslibrary, without which this project would have taken much longer and been much more stressful. - jordemort for figuring out how to use that library in Astro!