Help! for ilo Muni
This page is about how to use and understand ilo Muni, which you can work through like a tutorial or a manual. If you want to know how or why ilo Muni exists, or want to talk to me, see the about page!
There’s also a quick reference under the help button on the main page, and a section for frequently asked questions.
Hi! This page is going through a rewrite due to the new features on the search page. If something is incomplete, out of date, or confusing, that would be why! This notice is dated 2024-12-17. If you’re reading this more than a week after that date, please complain loudly at me!
Table of Contents
Search
Words
Search for a toki pona word such as pona. If it appeared at least 40 times across all the places and times I checked, you’ll get a graph showing how that word has been used over time! By default, you’ll see what percentage of all words were the searched word with monthly datapoints.
You can also search for proper names like Sonko, Inli, or Siko. There are lots of very rare words in the database too. You might be surprised by what you find, so try lots of things!
Examples
Terms
You can search for up to 6 words in one term, such as mi kama sona e toki pona. Note that the more words are in a term, the fewer times it is likely to appear- so don’t be surprised if you don’t get a result. Try searching for shorter terms, like kama pona or anu seme.
Examples
- Watch the rising popularity of terms like kin la
- Watch terms come and go over time as the things they reference do, like tenpo pana, tenpo monsuta, or suno pi toki pona
- Examine the use of grammatical features like kepeken e
Multiple Searches
You can graph multiple terms at once by separating them with commas ,
. You can even mix different lenght terms on one graph: toki, pona, toki pona. Often, different terms will have very different amounts of use, resulting in it being hard to read more than the top one or two graphs. You can address that with logarithmic or minmax scales seen later in this tutorial.
Examples
- Compare former synonyms like lukin, oko
- See the close relationship between similar words like pan, kili or laso, loje, jelo, walo, pimeja
- Compare the modifiers that appear after specific words, like wawa a, wawa mute, wawa lili, wawa suli, wawa sewi
- Examine how people talk about their skill in toki pona with sona toki pona, sona e toki pona, sona pi toki pona
- Compare greetings over time, like sina seme, sina pali e seme
- See how popular Sonja’s books are with pu, ku, su
Wildcard
You can search for multiple similar terms by replacing one word in a term with a wildcard *
. This will search for the ten most popular terms that match your search and graph them. Note that you may only use one wildcard per search term, and you cannot start a term with a wildcard.
Examples
Adding terms
You can add terms together by putting a +
between them. This is helpful for combining synonyms, like ale + ali. You can also use it to compare multiple related words to another one, like pan + kili, moku.
It often helps to graph some or all of the summed terms separately from the sum so you can see how much the graph has been shifted by the sum, such as in pan + kili, moku, pan, kili
Examples
- See what portion of toki pona is pure particles: li, e, la, pi, o, en, anu
- Combine multiple ways to write the same word, like ala + x or anu + y, anu, y
- Do that with UCSUR text specifically: toki +
- Combine related words to compare to others: a + n, pona
Note: If you’re currently viewing the authors or hits per author field, adding terms is disabled.
Subtracting terms
You can subtract terms from other terms by putting a -
between them. This is helpful when you’d like to omit a specific use of a word, such as in toki - toki pona. This gives you all the uses of the word “toki” which are not in the term “toki pona”.
Examples
- Determine what grammatical positions a term is most common in: tenpo ni - tenpo ni la - lon tenpo ni, tenpo ni
- Examine the use of a word without other conflicting terms, like in san - kekan san
Note: If you’re currently viewing the authors or hits per author field, subtracting terms is disabled.
Minimum Sentence Length
You can set a minimum sentence length for a term by adding an underscore _
and a number from 1 to 6 to the end of that term. For example, toki_1, toki_6 will show you the percentage of times toki appeared in any sentence, versus the smaller percentage of times that it was in specifically sentences of length 6. You can do this with any length of term: ona li, ona li_6
You can use this with subtraction to isolate a term: Searching toki - toki_2 will show you every time “toki” appeared, except for the times it appeared in sentences with 2 or more words. This means means you have all the times “toki” was the only word in the sentence!
Read ahead to the options section on minimum sentence length for more details.
Examples
- Get even more accurate information about greetings: sina seme - sina seme_3, sina pali e seme - sina pali e seme_5
- See if there is a difference in relative use of words: toki, pona, toki_6, pona_6. (there is! “pona” is more common in short sentences; “toki” is more common in long sentences!)
Scales
On the search page, this dropdown offers many different scales you can graph the data on:
The categories in the dropdown (“simple”, “useful”, and “weird”) indicate the importance of the scale to you- if you’re not interested in the details, then stick to the simple category and you’ll be fine! Otherwise, read on.
Beyond the categories, you’ll note there are four types of scale listed: Linear, Logarithmic, Minmax, and Other. I’ll go over each in detail below.
Linear
The linear scales will be the most familiar and obvious. They include the absolute, relative, and cumulative scales, and in each, you have a series of exact numbers which are graphed over time without changing the left axis or the magnitude of the numbers- this means the visual distance from 10 to 20 will be the same as the distance from 20 to 30, and so on. In other words, this is how graphs normally work!
Logarithmic
The logarithmic scales squish all the data on the graph closer together by rendering every power of ten at the same size. This means that the visual distance from 10 to 100 will be the same as the visual distance from 100 to 1,000, and so on for 1,000 to 10,000 and beyond.
The benefit of this is that you can visually compare two graphs that a linear scale could not, because the graphs would be drawn closer together. On a linear scale, a word with 10,000 monthly uses would make a word with only 100 monthly uses essentially invisible, which you can see here. On the logarithmic scale though, the trend line of the less common word becomes visible, letting you see that it still trends the same way as the much more popular word!
Another benefit of the logarithmic scale is that the original data, and therefore the original ordering of the data, is preserved. For example, in the relative log scale, kijetesantakalu, soweli are much closer together while not errantly implying that kijetesantakalu exceeded soweli in popularity as the equivalent minmax scale would.
Minmax
For example, the Relative Minmax scale normalizes all the points in the relative graph to be between 0 and 1 while preserving the original curve of the graph. This is excellent for demonstrating that words like ona, li trend similarly with use while being in different magnitudes.
The Absolute Minmax scale normalizes all the points in the absolute graph to be between 0 and 1 while preserving the original curve of the graph. This makes it helpful for demonstrating that toki, meli trend similarly with community activity while being in different mangitudes. This even applies when comparing words and phrases.
Other
Fields
On the search page, this drop down gives you access to different types of data to graph:
Hits
Hits are the default graphed data, and represent the number of times a word or phrase appeared per month. There are no filters on this, aside from what is considered a “toki pona sentence” by my library, sona toki. If you’d like to test that library out, there’s a slightly more strict version of it implemented in my bot ilo pi toki pona taso, which can be found in many toki pona Discord servers.
Absolute
Hits on the absolute scale shows the exact number of times a given word or phrase was said in each time period. This is useful for observing trends in the activity of the community, but can make it difficult to compare words which are of very different magnitude, or to study the use of toki pona before March 2020.
If you’re interested in comparing two or more absolute graphs, but one word is vastly more popular than the others, check again using the logarithmic or minmax scales.
Relative
The relative scale is the default scale, and when graphing hits on it, it shows you the percentage of all words which are the searched term in each time period on the graph. This also applies to phrases, for which you would interpret as follows: “What portion of all words are any one word from this phrase?” This is to say, the number of times a phrase appears can stand in for any of the words in the phrase, from a math standpoint.
If you’re interested in why I do the math this way, you can read about it in the discussion below. Spoiler: I dunk on Google, who admits they are doing their math incorrectly.
Discussion
The challenge of doing appropriate math with the different term lengths is that you can derive different reasonable percentages for the same term. My goal is to offer a percentage which matches users’ assumptions as often as possible, and particularly, makes sense when adding or subtracting terms. In other words, I intend to follow the principle of least surprise.
There are three questions we could ask of the data to derive a reasonable percentage, and I’ll explore each with the following simplified dataset:
- There are 10,000 words in total
- There are 9,000 terms of length 2
- The word “toki” appears 100 times
- The word “pona” appears 100 times
- The phrase “toki pona” appears 90 times
Here are the percentages that result from this data:
- toki is 1% of all words. (100 / 10000)
- pona is 1% of all words. (100 / 10000)
- toki pona is 1% of all terms of length 2. (90 / 9000)
- toki pona is 1.8% of all words when you count it for each of its words. ((90 * 2) / 10000)
- toki pona is 0.9% of all words when you count it as a word itself. (90 / 10000)
- What percentage of all same-length terms are this term?
This method seems to work at first, because we know that the total number of terms of a given length (9,000 terms of length 2) can be compared to the number of a specific term of that length (90 times that “toki pona” appears). The problem with this method is that the resultant percentage is not comparable to the percentage you would derive for terms of different lengths.
In the sample data, toki is 1% of all words, and the phrase toki pona is 1% of all terms of length 2. If we perform the search toki - toki pona, the result would be 0% ((100 / 10000) - (90 / 9000)). To a user, this would imply that all of the occurrences of toki were in the phrase toki pona, which they are not.
Under this method, there isn’t a way to compare “toki” and “toki pona” because their percentages are measured against unrelated totals: “percentage of all words” is not comparable to “percentage of all phrases of length 2”, and percentages of each cannot be meaningfully added or subtracted. Doing math as though they are comparable will create confusing and incorrect results as we observed.
Strangely, Google Ngrams does this intentionally according to their info page, in spite of the fact that they allow you to add and subtract n-grams of different length. Direct quote, with bold added by me:
What the y-axis shows is this: of all the bigrams [terms of length 2] contained in our sample of books written in English and published in the United States, what percentage of them are “nursery school” or “child care”?
All that said, this method is still useful. It produces a sensible result on its own, still lets you do valid math on two or more terms with the same length, and can help you infer useful things about the distribution of terms in the language you’re studying. But it does not work for my purposes, because it is necessary for a user to be able to compare terms with different lengths and get a valid result.
If you’re interested in this alternate math, I provide totals for every phrase length in ilo Muni’s database, so download the database to try it!
- What percentage of all words are in this term?
For graphing a single term with two or more words, you can get interesting and meaningful results from this question. To do this, we would multiply the number of times the phrase occurred by the length of the phrase, then divide by the total number of words. This means that, percentage-wise, each phrase counts for all of its words.
In the sample data, we would count the phrase toki pona as 1.8% of all words, because the phrase toki pona occurs 90 times and has 2 words in it ((90 * 2) / 10000). However, what happens when we attempt to subtract with this total?
Let’s look at toki - toki pona again, but without the percentages for a moment. toki is counted once for each time it occurs, times the number of words in it, so it has 100 occurrences. toki pona is counted once for each time it occurs, times the number of words in it, so it has 180 occurrences. But this also means that we’re attributing 90 of those occurrences to toki in the phrase, and the other 90 occurrences to pona in the same phrase.
This is where the math falls apart: We’re removing occurrences of “pona” from occurrences of “toki”, because the phrase “toki pona” counts for both of its words. But there were never any occurrences of “pona” in the occurrences of “toki”, so we would get a lower result than expected- in this case, the math works out to -0.9% ((100 - (90 * 2)) / 10,000), which is an even worse result than before and likely doesn’t mean anything to a user.
- What percentage of all words are any word in this phrase?
In this method, we take the number of times a given phrase occurred, regardless of its length, and divide it by the total number of words. This seems incorrect on the surface, for similar reasons to the first method: Aren’t we doing math with incomparable totals? But let’s continue with this method and see where it goes.
In this method, we use a single total number of words (10,000) and we do not need to alter the occurrences for a term based on its length- which means this method is actually the simplest to calculate. With this, the search toki - toki pona works out to 0.1% ((100 - 90) / 10000). That looks like a sensible result, but what does it mean?
In the prior scenario, we explored the phrase being counted for each of its words, which we accomplished by multiplying its number of occurrences by its number of words. We also noted that this is equivalent to counting its words independently, such that the phrase toki pona being counted 180 times means that the words toki and pona occur 90 times each, in the phrase toki pona.
In this scenario, the term toki pona is only counted for how many times it appeared. This means, in the subtraction example, we can imagine the 90 occurrences of toki pona as applying to either word in the phrase- contextually! This means you would interpret that 0.1% figure as meaning “toki which is not in the phrase toki pona is what percentage of all words?”
Relatedly, the fact that the search is possible to interpret contextually means we can perform another similar serach and get a valid result: If we search for pona - toki pona, the result is 0.1% ((100 - 90) / 10000) and still makes sense: “pona which is not in the phrase toki pona is what percentage of all words?”
Even better, you can replicate the prior method by subtracting a phrase once for each word in it. If we did want to remove every word in toki pona from toki we can search toki - toki pona - toki pona to get that result.
Last note: If you’re a math academic reading this, I am terribly, terribly sorry for not using real notation to demonstrate the above. I assure you, these plain language explanations are much less embarassing than any attempt I would make to use set notation.
The relative scale is the most generally useful scale, as it implicitly tells you the relationship of your search term to all other terms by showing a percentage instead of a raw number. For example, you can compare the grammatical particles or the colors, determining how much of the entire language is these terms.
It can be difficult to compare relative graphs for words which are in different magnitudes, such as kijetesantakalu, soweli. I recommend the Relative Minmax scale for cases like these, which can help to identify words which trend in the same way no matter their magnitude.
Cumulative
When graphing hits on the cumulative scale, you’ll see how many times a given word has been said up to a point in time, increasing to the total number of times the word has been said by the present. This is handy for observing the point where a word or phrase becomes more spoken than another, or differently examining periodic phrases. It can also help to clarify the fact that, while one word has become more popular than another recently, it may not have done so all-time
Authors
Authors is the number of unique users who said a given word or phrase during a month. This is further filtered by the fact that an author must have said at least 20 toki pona sentences in order to be counted, which is done to make sure that the counted authors are only those who have spent at least a bit of time actually practicing toki pona. Without this limit, a very large number of authors would be counted among the crowd despite not using almost any of toki pona’s words, which would errantly reduce the portion of authors who have used a critical word like toki.
Also, because this is the number of unique authors, it is not possible to add or subtract authors or use the cumulative scale, and so adding and subtracting terms will not show any data. This is not possible because the database only has the strict number of authors per data point, not the authors themselves- and you need the authors themselves, because this is actually a set operation. Here’s an example:
Let’s say the word toki has been said by {A, B, C, D}
, and the phrase toki pona has been said by {A, E}
. In the current database, I store just the size of each set, so toki has been said by 4 authors, and toki pona has been said by 2 authors. If I wanted to know how many authors said toki outside of the phrase toki pona, the correct math would be {A, B, C, D}
- {A, E}
= {B, C, D}
, or 3 authors. That is, we remove A
from {A, B, C, D}
, but we don’t do anything with E
because it wasn’t in the prior set to be removed.
Now, the issue should be obvious: If we only had the number of authors without knowing who the original authors were, we would remove the 2 authors of toki pona from the 4 authors of toki and get 2 authors. But we already know this is incorrect.
I’m considering a new database backend which would let me do the set math when performing such a query, but for the time being that cannot be done.
Hits per Author
Absolute
On the absolute graph, hits per author is the average number of hits per author, graphed over time. For example, if there were 100 authors, and 1,000 hits for the word toki, then the hits per author would be 10 (10 / 1000).
Relative
On the relative graph, hits per author is the percentage of the language which is made up of the average author saying a given word.
Other Options
Smoothing
By default, 2 smoothing is set. The number is how many neighbors on both sides of a given data point will be smoothed. For example, if you set 5 smoothing with the Window Avg smoother, it means a given point will be set to the average of the 5 points before, 5 points after, and itself.
Smoothing is helpful for making noisy graphs more readable while preserving the trend line of the original graph. Compare the graphs of wawa, nasa, suwi, sewi, suli with 0 smoothing and 5 smoothing. Tip: Double check the axis on the left!
Note that smoothing can produce misleading graphs with respect to the time axis, such as smearing periodic phrases over too much time. Sometimes, 0 smoothing is better!
Relatedly, some scales have smoothing disabled, usually because it wouldn’t make sense to average their values. This applies to the absolute scale, for example, because it is meant to show you the exact number of times a given word or phrase appeared! This also applies to both offered derivatives, because they are completely impervious to localized averaging.
Historical note
In a prior version of ilo Muni, smoothing would be performed over all neighbors of a value, even if those neighbors were zero. This is intended behavior if those zeroes appear in the middle of an otherwise busy graph, but this causes some misleading graphs when a word had no data from the start of a graphed period. Smoothing such a graph could imply that misikeke existed before November 2019. This was fixed on August 13th; now, smoothing does not occur until the first non-zero element of a graph.
Dates
-By default, the date range for the graph is set from 2016 to 2024. You can select any start or end you want, but there are some caveats to warn you about:
Each year represents August of that year until August of the next year. This is done intentionally, to align annual measurements of toki pona with the day toki pona was created, 2001-08-08. This is why, if you select 2016, the graph will start in August and mark 2017 shortly after.
Each datapoint represents a month of activity, regardless of the length of the represented month. This means that datapoints for different months are not necessarily comparable on the absolute scale. For example: If there were more activity in May compared to February, you wouldn’t be able to tell if there were a specific cause for that change- the change could be accounted for by the month of May being longer. This issue could be corrected for by using same-sized units of time for each datapoint, but this creates more problems than it solves. See the historical note in this section for more information.
The default start date is 2016 because the data prior to that is extremely sparse. I have left in the option to query for earlier data, but be aware that relative graphs will be noisy, the absolute graphs will be flat, minmax graphs will become nonsense- and for the other graphs, here be dragons.
The graph ends in July 2024. There is no data provided for August 2024 or beyond, because I only wanted to represent completed months in ilo Muni, and I collected this data during August 2024.
Historical note
The database dated 2024-09-07, which was available on ilo Muni from 2024-09-12 to 2024-12-17, does not use monthly buckets of data. Instead, it uses 4-weekly buckets of data, with the 4-weekly period starting from 2001-08-08. The purpose of this was to solve the following problem:
Some months are shorter than others, so the absolute scale may be misleading for those periods by implying they were less active than neighboring months. Relatedly, weeks are not evenly distributed over months, so some months will have more weekends and therefore more active periods than others. 4-weekly buckets solves both of these problems at once.
By switching to 4-weekly periods for ilo Muni’s data, the represented periods of time were now all the same size. This made datapoints in the absolute graph directly comparable, but it also meant that the user interface was obnoxious: Each datapoint would be labeled with an exact day, implying it represented only that day. It wasn’t possible to reduce it to a month because the represented period of time could cross months, and even if I had changed it to a range, there would have been that much more interpretation complexity for an incredibly minor, single gain.
All of this said, the relative scale was never affected in the first place. It represents datapoints as a percentage of the words said- this corrects for the differing totals you may get on a month to month basis.
As of 2024-12-17, I have switched back to monthly buckets.
Minimum Sentence Length
Note: This is hidden by default! Click to show it.
This option is also called words per sentence in its dropdown. By default, All sentences is set, meaning you will see how words or phrases appear in any length of sentence. If you set this option to 3+ words per sentence, you’ll see how words or phrases appear in sentences which have at least 3 words. This can be helpful if you want to study more “substantial” uses of words, i.e. those that appear in longer sentences.
If one of your searches sets the minimum sentence length for a term, pay attention to the legend below the graph: If the legend shows the term without an underscore, it means the length you chose was already being searched. This can happen with a search like toki pona_2 normally, or wawa_3 while the minimum sentence length dropdown is set to 3.
This happens to the phrase toki pona with a minimum sentence length of 2 because phrases have a minimum sentence length equal to how many words are in them. That is, “toki pona” can only appear in sentences with at least two words- which I hope makes sense! Because of this, the minimum sentence length is always implicitly set to at least the length of the phrase. You can always set it to be higher, of course.
Note that when graphing on any relative scale, the percentage is derived by dividing the number of occurrences for the search term by the total number of words in the same time period.
If you’re curious why this is done instead of dividing by the number of words in sentences of the appropriate minimum length, here’s a simplified example:
Demonstration
Imagine the following scenario:
- There are 10,000 words in total
- 9,000 of those words are in sentences with at least 2 words
- 100 of the words are toki
- 90 of those toki are in sentences with at least 2 words
With this data, asking “What percentage of words are toki?” means we get 1% (100 / 10,000), which makes sense, and this is the only reasonable question to measure toki on its own.
However, there are two ways to measure toki in sentences with at least 2 words (toki_2): You could measure it as a portion of all words, or as a portion of words from sentences with at least 2 words. These two different choices have different results: toki_2 is 0.9% (90 / 10,000) of all words, but 1% (90 / 9,000) of the words in sentences with at least 2 words.
We can use this information to determine which answer is best for this graphing tool by exploring what happens when you search toki - toki_2.
In the sample data, graphing toki - toki_2 with toki_2 measured as a portion of all words means we get 0.1%: ((100 - 90) / 10,000). That is, toki appears exactly 10 times on its own in this data. Graphing instead with toki_2 measured as a portion of words in sentences with 2 or more words means we get 0%: ((100 / 10,000) - (90 / 9,000)). In other words, we get a misleading outcome because we’re subtracting two incomparable percentages.
There is value in knowing what portion of sentences with at least 2 words are some specific word. This graphing tool does not offer that information because doing so would produce misleading graphs for both side-by-side comparison and for adding or subtracting specific results. If you’re interested in that alternate data, download the database!
Smoother
Note: This is hidden by default! Click to show it.
By default, Window Avg is set, and I don’t recommend changing it from the default unless you’re aware of what change you’re making and why. The different smoothers have different properties, but almost all of them boil down to making the data more readable when the month-to-month values are messy.
Gaussian
The Gaussian smoother is the default smoother as of 2024-12-17. It makes highly smooth and presentable curves, and does a better job of preserving the locations of peaks and troughs than either Window Average or Exponential.
Window Avg
The Window Avg smoother was previously the default smoother, and is the only one offered by Google Ngrams. It averages the neighbors of a given point, which will flatten the graph toward a smoother one with the same characteristic curve as the original noisy data, but preserving a bit of the noise along the way. In data with high peaks, you’ll notice a characteristic plateau shape form around where the peak would be. With extremely high smoothing, all of the data will tend toward a flat line as all datapoints are set to the average of the entire graph, although data with regular peaks will exhibit phantom peaks with sufficiently high smoothing.
Exponential
The Exponential smoother preserves the curve of activity while moving peaks and troughs closer to the average of their neighbors. This is helpful for preserving the occurrence of large peaks, but it does smear them forward in time due to weighting later events more heavily than earlier ones.
Median
The median smoother finds the median of every value in the smoothing window, which results in a very square-looking graph. It occasionally exhibits the same phantom peaks as Window Average
Triangular
Potential Bias
In previous sections, I discussed how smoothing can create misleading graphs, and how the math works for phrases with 2 or more words and words as they appear in sentences with a minimum length to create a consistent interpretation across the dataset. However, these are not the only interpretation challenges. Most of the remaining bias is in where and how the data is collected, which I describe the specifics of below.
Bots
Every platform has bots which send messages, but not every platform is gracious enough to inform you that a given message is from a bot. Fortunately, almost no bots are sending messages in toki pona, but there are a handful worth being aware of because they artificially increase the use of any word or phrase they use if counted.
Discord
Discord does an excellent job of informing you that a given user is a bot, but things get mixed up when it comes to webhooks because of PluralKit. Normally, webhooks are a kind of bot message which is sent automatically but is not attached to any user. They tend to be notifications for things happening on other platforms, like commits on Github or posts on Reddit. However, PluralKit messages are webhook messages from users. Right now, I can’t distinguish between PluralKit messages and any other type of webhook, so I’m forced to count both or neither- and I chose to count both. This means there is some uncertain amount of bot data in the Discord data. Fortunately, only 4.6% of the data is from webhooks, so the impact of this cannot be too large.
In the future, I plan to grab all the webhook messages I have and ask the PluralKit API whether those messages are PluralKit messages. Then I would be able to map those messages back to the host account which originally sent the message, and I could then count only user messages.
Telegram
On Telegram, I have no way to know if a given user is a bot. Telegram’s JSON export format does not include that information. I do have one hard-coded exception, the IRC forwarding bot, because I needed to cut the names out of its messages to represent them as intended. Otherwise, all Telegram bots are invisible to me. That said, all of the ones I’ve seen speak English, so they shouldn’t have any counted sentences.
Reddit does not tell you if a given user is a bot as far as I’m aware. Can’t fix it, but like Telegram, none of the bots on Reddit seem to speak toki pona- so no harm done!
Herbevitistoj
Herbevitistoj is “Many people who professionally avoid grass.” It’s a fairly recent and silly way that Esperanto speakers refer to the “terminally online,” or termed more kindly, people who are on the internet a lot. Most toki pona is spoken on the internet in the first place, but there is still a subset of the community which makes up an outsized portion of messages written in toki pona because they are much more active in online spaces. There isn’t anything to be done about this- it’s just a fact to be aware of.
That said, I am personally curious to see what the data would look like if the most active 20% of users were removed from it- maybe something to do in the future.
Platform Notes
No notes for Telegram, Reddit, YouTube, or the Toki Pona forums; all of them are represented in their entirety, or as much entirety as can be reasonably obtained, through the final represented date of ilo Muni.
Discord
Right now, the data for ilo Muni is collected from Discord, Telegram, and Reddit. Of these, Discord is about 80% of all of the data, and ma pona pi toki pona in particular is about 80% of the data from Discord. You can see the impact of that one server from sections of the graph like this, where a small number of archived channels caused a nearly 50% decline in use of toki pona.
In a sense, Discord is “over-represented” in the data, because it is such a large portion of the data in the first place. For the time being, I have chosen to weight all messages equally, but I would like to produce alternate databases and analyses in the future.
Identifying toki pona
This data would not exist without first being able to detect whether a message is “in toki pona”. I wrote a library to do this, but it has its own collection of complexities which can impact how you interpret the data.
I chose to use the dictionary from lipu Linku, including its sandbox, in order to to identify definite “toki pona words.” These weren’t the only words I counted, because that would miss anything that weren’t already in my dictionary. But it still isn’t perfect.
Dictionary
If a word in my dictionary matches a word in another language, I would errantly count that word while scoring the sentence. My scoring algorithm has no concept of a penalty currently, so I couldn’t identify that the words around a given one were specifically of some language other than toki pona- I’d just know they didn’t match any of my filters.
This is especially troublesome if a message is a single word in my dictionary, or otherwise very short. “je” is borderline non-existent in toki pona, but it’s in the dictionary, and it’s a first person pronoun in French!
Short words
While writing my sentence scoring algorithm, I noticed that there were tons of 1-2 letter words in other languages which would errantly match my scoring filter, and thus errantly raise the score of messages that were not in toki pona.
To fix this, I changed my syllable checking and alphabetic match checking filters to only score if a given word has at least three letters.
For the most part, this is a good assumption: two letter words are rare enough that I almost certainly have them all in my dictionary already. Doubly so for one letter words, since there are only six possible (a, e, i, o, u, n).
But this means that, if a two letter word were coined, and it were not added to my dictionary, I would score it zero by default- it could still be discovered if it were next to many other toki pona words, but it would drag down the score of the sentence it’s in.
Frequently Asked Questions
Why does [query] take so long?
While searching for one or a few terms should only take about half a second, searching with a wildcard or for many terms will take much longer- you can estimate this by multiplying the half second by the number of queries you’re making.
That said, all queries will become faster as you make more queries in the same session. This is because you’re downloading and caching more of the database’s data!
But this is limited by the fact that each part of each query must be fetched consecutively, including for different queries. There is only one worker fetching data, because multiple workers would be unable to share their cache.
If I had a database hosting solution, nearly all of the queries would be as instant as the network itself. If you have any suggestions for one, let me know!
Why is my subtraction negative?
That’s allowed! If you do tawa pona - kama pona, you’ll get a graph which is mostly negative. This means the phrase “kama pona” is more common than the phrase “tawa pona”, probably because the community is very welcoming!
It’s probably possible to get floating point silliness when subtracting, but I haven’t seen that happen personally- please reach out if you spot it!
Why is the data so noisy before 2020?
In short, there is much less data to examine from before 2020, and even less before 2017. This is why I set the default start date to August 2016, rather than the actual start of my data in March 2002. For reference, this roughly aligns to the creation of the first toki pona groups on Telegram.
So the next question is, why 2020? Although I probably don’t need to answer that, I’ll go ahead and do so:
When everyone was trapped indoors for some two years during the COVID-19 pandemic, toki pona saw a huge spike in popularity. You can see the climb in activity in every word when the scale is set to “absolute”. This also affects the relative graph though- before 2020, each word written is a much larger portion of all the words for that time period! To help, you can add smoothing to relative mode, which will average out nearby data points and thus make the data easier to read while preserving its original shape.
Why is there a huge spike on [date] for [word]?
This data isn’t from professional sources, unlike Google Ngrams which is sourced entirely from published books. In professional sources, you wouldn’t expect an editor to let a paragraph like woo yeah! woo yeah! woo yeah! woo yeah! woo yeah! woo yeah! remain in the final product. But in Discord and any other social media platform, there is no editorial oversight- silly goofy abounds.
Sometimes this means you’ll see a word organically spike in usage, because it was just made and people are excited to use it! Other times, you’ll see a word spike in usage due to a word game being played, which I try to identify and omit- because the goal is to measure “real” toki pona, and word games fall just outside of this bound.
This used to affect mu, wan, tu, luka, mute, and ale, but I’ve since removed those- that was from a single person counting from 1 to over 2,000 all on their own.
Relatedly, there was a time during development where “mu” had a spike to over 40,000 uses because of a day in which a handful of messages were nothing but “mu” to the text limit of Discord. Because of this, I added a nonsense filter to skip sentences before they get counted, which works like so: If a sentence is more than 10x the average sentence length (4.13557) and more than 50% of the message is a single word, it gets thrown out. Similarly, if a sentence is more than 100x the average sentence length, it gets thrown out immediately. This filter isn’t perfect though, because somebody can say “mu. mu. mu.” to similar effect, and each of those will be counted as individual sentences. Working on it!
Ultimately, I would like this data to reflect how toki pona is actually used- and, granted, somebody sending hundreds of “mu” is using toki pona. But I think most people would agree that those messages are not reflective of how toki pona is used in either media or conversation, and these sorts of sentences otherwise make it difficult to examine the rest of the data.
Why is do hits and authors drop so much between Feb and Aug 2023?
Because at the time, the mod team of ma pona pi toki pona (including myself!), chose to archive half of the channels in the toki pona only category of the server! In retrospect, this idea was doomed from the start, but we didn’t know why at the time.
At the time, we were observing a very consistent pattern in the toki pona channels of the server, in the form of the following conversation:
(10:31) jan Wan: toki!
(10:39) jan Tu: toki, sina pilin seme?
(11:04) jan Wan: mi pilin pona. sina pilin seme?
In short, somebody would say “hello,” but not get a response very quickly. They would wander off, eventually see that somebody else responses, but their own response would be too late to catch them. No conversation would actually occur.
We saw this pattern playing out so frequently that we decided to try and help it, and we diagnosed the issue as there being too many channels- people were not seeing one another because they had too many channels to dig through to find conversation. So we archived most of the channels which had a built-in topic.
We chose to restore the archived channels in August, but we wouldn’t actually see the results of this change until July 2024 when I began creating the first graphs with ilo Muni. When I realized the connection between The Dip and the archived channels, I realized what we were missing at the time: Conversations for their own sake aren’t enough. Conversations have a topic, and that topic is what drives longer discussion. Even if people see each other more often, they won’t talk if there is nothing to talk about- and we left behind only the general discussion channels, with no set topic.
For comparison, the topic channels were always much quieter than the general channels- but what they lacked in activity, they made up for in conversation depth, because those channels offered something obvious to talk about from the start.
Anyway, the effect of this period of time is a drop in the number of words used and a drop in the number of active authors for any given word. It doesn’t affect the percentage of words used in that period, which is expected- the porportions of used words should not change. But it does affect the percentage of authors significantly- this is because many authors stuck around, but still ended up speaking much less toki pona. Unlike relative hits, relative authors do not exist in competition- one author may use any number of words, including fewer, without affecting the total number of authors for a given period.