Language on Victor42

I Did a Deep Dive into English Word Stress...

hi@victor42.work (Victor42) — Fri, 05 Jul 2024 22:33:00 +0000

Target audience: English learners, data analysis enthusiasts, Python coders, and my friends.

This is my first data analysis project. I’ve been teaching myself data science for over a year, picking up skills along the way, but I hadn’t tackled a real-world project. During my studies, the words ‘analyze,’ ‘analysis,’ and ‘analytical’ kept appearing. The stress placement is unpredictable (‘analyze, a’nalysis, ana’lytical) – a real headache! It turned reading into a tongue-twisting exercise.

Some claim there are rules for stress, but they’re often lengthy and complex. Others say there are too many exceptions. However, even with those three words, a pattern does emerge. English seems to avoid three unstressed syllables in a row and tends to place stress near the beginning. For words with five or fewer syllables, the stress often lands on the antepenultimate (third-to-last) syllable.

It makes sense, doesn’t it? Three unstressed syllables in a row would be monotonous. Stress adds rhythm. It’s like driving on a straight road – you’ll likely doze off. Placing stress too late would also hinder comprehension. Imagine a long word with emphasis on the very last syllable – you’d likely miss the meaning!

To illustrate, consider Mandarin Chinese. It has a significant flaw: the word “不” (bù, “not”). Both the consonant and vowel are faint, especially in rapid speech. The vowel becomes even weaker. You often can’t discern if someone even uttered “不”! This creates a major communication problem, as it distinguishes between two opposite meanings. When my daughter cries, I struggle to understand if she’s saying “要” (yào, “want”) or “不要” (bù yào, “don’t want”).

Back to English stress. My theory seemed reasonable, but I needed evidence. As a data-science novice, I decided to get my hands dirty and see how many words actually followed this pattern.

Research Plan

Having learned data analysis, the research plan formed quickly. It involved collecting, cleaning, analyzing, and visualizing data. Regression analysis or prediction wasn’t necessary.

Here’s the skillset I had, which was sufficient:

Find a comprehensive word list.
Find a free, batch method for obtaining phonetic transcriptions from an online dictionary.
Determine the syllable count and stress position for each word (possibly with AI assistance).
Analyze the distribution of stress positions and visualize the findings.
Test my hypothesis.

Let’s dive in.

Data Source

I found a dataset on Kaggle, a popular data science community. It’s a simple .txt file containing over 300,000 English words, listed alphabetically, one per line:

https://www.kaggle.com/datasets/bwandowando/479k-english-words

The .txt file is 4MB, comparable to a million-word novel.

I created a Kaggle code project, imported the dataset, read all the words, and obtained a table with 369,652 rows and 1 column.

Getting the Pronunciation

The table only contained words. For rigorous research, I needed phonetic transcriptions.

Fortunately, I discovered a free online dictionary API: https://dictionaryapi.dev/.

Now, I had to look up each of those 300,000+ words. Naturally, I’d write code to automate this.

The API returned more than just phonetics; it included audio, etymology, parts of speech, meanings, and examples. The useful components were the phonetics, etymology, and part of speech. However, etymology was mostly missing, so I extracted only the phonetics and part of speech.

The sheer data volume posed a challenge. The API documentation didn’t specify request limits, but I found it in their Github code: 450 requests every 5 minutes. For 369,652 words, even non-stop, it would take 369652 / 450 * 5 / 60 = 68.45 hours – almost 3 days!

Alright, three days it was. But I had to adjust my strategy. I added a function to chunk queries and save results periodically. Every 1,000 rows, I’d save to a sequentially numbered file. I’d then continue querying based on the sequence number. Finally, I’d merge all 300+ files.

It turned out that most of the 300,000+ words were obscure and not found in the API. I only got results for roughly 100 out of every 1,000 words. The file above contains only 92 rows.

Linguistic research indicates that 3,000 English words cover 95% of everyday usage, and 1,000 cover 89%. Another study shows that the average adult has an active vocabulary of about 20,000 words and a passive one of 40,000. Thus, only about 1/10 of the dataset is relevant, which is reasonable.

Data Cleaning

After merging, I found the dictionary’s phonetic symbols were inconsistent, containing uncommon symbols like ɘ, ɝ, ɚ, ɨ, ʉ. These represent subtle pronunciation variations, roughly equivalent to standard sounds. I had to replace them; otherwise, they’d disrupt syllable counting and subsequent analysis.

Besides unusual symbols, there were many phonetically identical but differently written symbols, like əu/əʊ and ai/aɪ. These also required merging. Each line in the image signifies replacing the first symbol with the second, leaving bracketed symbols untouched.

Some words differ significantly between British and American English. I prioritized American English conventions.

Numerous unconventional spellings existed. Over- or under-replacement could easily cause phonetic errors. I wrote a temporary checker, manually consulted the Cambridge Dictionary, and refined my replacements. This took time.

After processing, the vowel symbols were cleaner. For “anthropomorphic”:

Before: [ˌæ̃n̪θɹ̠əpəˈmɔɹ̠fɪ̈k]
After: [ˌæn̪θɹ̠əpəˈmɔːfɪk]

I didn’t handle consonant symbols, as they were irrelevant to my goal, and that’s a more complex issue.

Later, I discovered some inaccuracies in the dictionary API. For instance, “abacus” was transcribed as /-saɪ/? Nonsense! The information was incomplete.

I calculated this occurred in 0.55% of all words – a small fraction. The incomplete transcriptions seemed random, lacking commonality, so I filtered them out. I’m now analyzing a sample, not the complete data. However, the sample is large enough to be representative, allowing the research to proceed.

Analyzing Phonetic Transcriptions (AI)

This step entails counting syllables from phonetic transcriptions and identifying the stressed syllable using the ˈ mark.

I aimed for a shortcut by deploying an AI model on Kaggle. AI should excel at language, right?

I tested several text-based models but encountered obstacles:

Large models wouldn’t run: Among Kaggle’s deployable open-source models, Llama3 70b could accurately determine syllable count and stress position. ChatGPT, Claude, and even GPT-3.5 could also do it. Language seems to be a strength of large language models. The issue? Kaggle’s free tier can’t run such large models.
Small models were inadequate: Kaggle’s two free T4 GPUs can handle smaller 7b models like Llama3 8b, Gemma 7b, and Qwen2 7b. However, these smaller models, on Kaggle or elsewhere, couldn’t reliably perform the task.

I refined prompts, guiding the AI step-by-step, and provided examples:

<task>
your task is to count how many syllables there are in an English word. list them all then count. finally answer which syllable the stress falls on(tell me the number). answer **EXACTLY** in the example format.
<example>
word: analysis
phonetic transcription: /əˈnælɪsɪs/
syllables:
1. ə
2. 'næ
3. lɪ
4. sɪs
syllables count: 4
stress position: 2
final conclusion: <<<2/4>>>
<word>
analytical /æn.əˈlɪt.ə.kəl/

But the smaller models kept failing. Perhaps they weren’t capable. Phonetic symbols are vastly different from standard English letters, almost a separate, niche language for AI.

This experience highlighted a key point: these open-source small models cluster around 7 billion parameters likely because that’s the upper limit for running on specific GPUs. In this era of constrained computing, GPUs dictate the scale.

Was AI a dead end? I then considered a workaround: Google Sheets with an AI plugin. I could input the phonetic data into Sheets, write a prompt in the adjacent cell (including the word and transcription), and use a formula from an AI plugin to generate the result. This plugin, powered by GPT-3.5, could handle the task. The classic Excel drag-down trick would then populate the entire column.

The plugin’s pricing was reasonable, around 90 RMB for my data volume. However, I was unsure if it could handle tens of thousands of AI generations simultaneously. Debugging and regenerating could double the cost, making it risky.

Analyzing Phonetic Transcriptions (Algorithm)

Okay, no more AI—I’d handle it myself. Counting syllables and locating stress? An algorithm could do that, and more reliably. Here’s the approach, using analytical /æn.əˈlɪt.ə.kəl/ as an example:

Create a set of all vowels: ɑaæɒʌəɛeɪiɔoʊuʉɜ
Remove slashes, parentheses, spaces, and dots: /æn.əˈlɪt.ə.kəl/ becomes ænəˈlɪtəkəl
Iterate through ænəˈlɪtəkəl, checking against the vowel set. Counting vowels: æ, ə, ɪ, ə, ə yields 5 syllables.
Split by the stress mark ˈ: ænəˈlɪtəkəl becomes ænə and lɪtəkəl. Use the first part, ænə.
Count vowels in ænə as in step 3: 2 vowels.
Add 1 to get the stress position: the 3rd syllable.

The logic was clear, so I had AI write the code—a trivial task for it. A few tweaks, and it worked.

A challenge arose in step 3: diphthongs, triphthongs, and long vowels. For ei, the algorithm would count e and i (2 syllables), but ei as a diphthong is only one. Triphthongs would be counted as 3.

The algorithm needed adjustment. I created three vowel sets: monophthongs, diphthongs, and triphthongs. The vowel check now involved three passes:

First pass: Check each character against the monophthong set. This overcounts diphthongs and triphthongs.
Second pass: Check two characters at a time against the diphthong set. If found, subtract 1 from the syllable count. Importantly, skip the next character after a diphthong to avoid miscounting triphthongs like aɪə as aɪ and ɪə.
Third pass: Check three characters at a time against the triphthong set, subtracting 1 if found.

This refined algorithm accurately counted syllables. (Note: I treated the long vowel marker ː as a phonetic character; iː, ɑː are handled as diphthongs, iːə, uːə as triphthongs, which doesn’t affect the outcome.)

It turns out, for data analysis, technique takes a backseat to domain knowledge. Analyzing English requires understanding it. Digging deeper into phonetics, I hit another snag: triphthong identification is incredibly ambiguous. There’s no consensus on whether three vowel symbols together are a triphthong or a monophthong + diphthong. That familiar feeling… Classic English! No rigid rules.

Consider fire /ˈfaɪər/. Some claim aɪə is one syllable; others say it’s aɪ + ə (two syllables). Criteria vary wildly. Some use hyphenation (you can write “fi-” and “re,” but not “fire,” so it’s a triphthong). Others use singing: if sung as one note, it’s a triphthong. In Simple Plan - Fire In My heart, at 0:57, faɪ and ər are sung as separate notes—should it be a diphthong + monophthong?

Oh well, that’s English. Given words like oasis /oʊˈeɪsɪs/ (four vowels!), with oʊ and eɪ clearly separated by the stress mark (obviously two diphthongs), I disregarded triphthongs, treating them as two syllables. The only remaining “triphthongs” were diphthongs with a long vowel marker.

Besides syllable count and stress position, I wanted the stressed vowel itself, potentially for further analysis.

This was trickier. I discussed it with AI, revealing significant model differences. Gemini 1.5 Flash went in circles. GPT-4o provided the correct code in three conversational rounds (about 10 minutes). Claude 3.5 Sonnet got it right immediately. For coding, a good model is worth the cost, though basic code literacy is essential to understand the AI’s code, its functionality, and potential issues.

Here’s the logic, again with analytical /ænəˈlɪtəkəl/:

Locate the stress mark ˈ and consider the subsequent part: lɪtəkəl.
Iterate, removing non-vowels until the first vowel: ɪtəkəl.
The first character is now a vowel. Check the first 3 characters (ɪtə) against the triphthong set. Nope.
Check the first 2 (ɪt) against the diphthong set. Nope.
Check the first character (ɪ) against the monophthong set. Found it! That’s the stressed vowel.

The data table after phonetic analysis. All necessary data was now collected.

Visualization

Now for the highlight—not just for deriving useful conclusions, but also because AI shines here. AI is excellent at writing Python visualization code. These tasks are less about reasoning and more about knowing the visualization library’s syntax. Even Gemini 1.5 Flash, a non-flagship model I use daily, performs well. I haven’t formally learned Seaborn and Matplotlib, but with AI, generating plots is straightforward.

Of course, “straightforward” doesn’t mean “ask and receive.” Giving AI a vague request without context leads to failure. I crafted a Python visualization prompt, detailing the task and the data table’s structure, enabling the AI to perform with full power and stability.

<Task>
You are a Python data visualizer. You excels at coding with data visualization libraries like Seaborn and Matplotlib. I will tell you about the structure of a Pandas dataframe and the visualization I want. First, you dive deeply into the dataframe and understand what it is all about. Then write Python code to visualize it. Just code, no explanation. Next, you check if the code meets my need. Finally, correct the code if necessary.
<Dataframe>
The dataframe(variable name is df) is {a list of common English words with their phonetic information and part-of-speech}.
Now here are the columns of the dataframe, exactly in the following order:
**word**
- datatype: str
- example: complimentary
- description: the English words
**phonetic**
- datatype: str
- example: /ˌkɒmplɪ̈ˈment(ə)ɹɪ/
- description: the phonetic transcription of the words
**part_of_speech**
- datatype: str(list like)
- example: ['adjective']
- description: how are these words used in sentences
**syllable_len**
- datatype: int
- example: 5
- description: how many syllables are there in these words
**stress_pos**
- datatype: int
- example: 3
- description: on which syllable the stress falls on, if there are more than one stress, this is the position of the first stress
**stress_syllable**
- datatype: str
- example: e
- description: the vowel of the stressed syllable
<Request>
I want to know the distribution of stress position, grouped by syllable numbers.

To use the prompt, just tweak the <Request> section.

Some words in the data lack stress marks because they’re short, and their phonetic transcriptions don’t show stress. Let’s filter those out, along with one-syllable words – analyzing stress in those is pointless.

This leaves 24,433 words with complete data.

Syllable Count Analysis

Let’s break down the syllable counts of these 24,433 words.

Unsurprisingly, fewer syllables mean more words. Languages tend to use up short, easy words first.

Two-syllable words make up 48.7%, three-syllable words 31.3%.

Words with four or fewer syllables make up 94.73%; five or fewer, 99%.

The longest word has 11 syllables.

“Antidisestablishmentarianism”? Really? Opposition to opposition – double negative much? No wonder it’s so long. Could I just add “non-” to create “nonantidisestablishmentarianism”?

Syllable Count vs. Stress Position

Statistically, the correlation coefficient is 0.67 – a pretty decent correlation.

This coefficient ranges from -1 to 1. Near 0 means almost no relationship; near 1, positive correlation (one up, other up); near -1, negative correlation (one up, other down).

This is just a first step, showing they’re not unrelated. It doesn’t explain why.

A bubble chart helps. Syllable count is on the y-axis, stress position on the x-axis, and bubble size/color shows the word count. The dots roughly follow a diagonal – more syllables, later stress.

Bubble charts (or heatmaps) show three dimensions but compare absolute word counts. I care more about stress position distribution within each syllable count.

Here’s a stacked bar chart: syllable count on the y-axis, stress position on the x-axis. Now it’s clear: stress shifts right like a wave, clustering around the third-to-last syllable.

Stressed Syllable Analysis

These are all the vowels in stressed syllables. A couple shouldn’t be here, but it’s a dictionary error, and too few to matter.

By frequency, louder vowels like æ and e are more likely stressed; weaker ones like ə and ʊ are less common.

Part of Speech Analysis

Is there a link between part of speech and stress?

All part of speech: ['adjective', 'adverb', 'conjunction', 'interjection', 'noun', 'numeral', 'preposition', 'pronoun', 'propernoun', 'verb']

Here’s a breakdown of all parts of speech. I’m not sure what “propernoun” is – it’s not in my dictionary either. It turns out there are only two, and they don’t seem to fit, so I suspect a data glitch with the dictionary API. I’ll skip it for now.

I ranked the parts of speech by frequency. The big ones are nouns, verbs, adjectives, and adverbs. Nouns account for roughly half the total.

This gets you thinking about how language evolved. First, you need to describe the world and create concepts – that’s where nouns come in. Then, to describe how people and things interact, you need verbs. After that, adjectives and adverbs develop to modify nouns and verbs. So, my guess is the number of words should follow that order.

But wait – shouldn’t the ratio of nouns to adjectives, and verbs to adverbs, be roughly the same? No need to calculate. The bar chart makes it obvious: nouns are more than double the adjectives, and verbs outnumber adverbs almost nine to one. They’re way out of proportion.

['abracadabra', 'absolutely', 'action', 'adieu', 'adios', 'affirmative', 'afternoon', 'ahem', 'alack', 'aloha', 'alright', 'amen', 'amidships', 'arrivederci', 'attaboy', 'attention', 'away', 'banzai', 'bastard', 'beauty', 'begone', 'begorra', 'behold', 'blazes', 'bollocks', 'bonjour', 'bother', 'botheration', 'brother', 'bully', 'bullseye', 'bullshit', 'caramba', 'checkmate', 'cheeses', 'condolences', 'congrats', 'congratulations', 'content', 'cooee', 'curses', 'dammit', 'ecce', 'egad', 'enchanted', 'encore', 'enough', 'eureka', 'exactly', 'farewell', 'fiddlesticks', 'flummery', 'gadzooks', 'gesundheit', 'goddamn', 'goodbye', 'gorblimey', 'gracias', 'gracious', 'greetings', 'hallelujah', 'hardly', 'havoc', 'heavens', 'heyday', 'hola', 'holla', 'honestly', 'hooray', 'hosanna', 'howdy', 'hullo', 'hurrah', 'huzzah', 'yeah', 'indeed', 'knickers', 'later', 'mercy', 'morepork', 'morning', 'namaste', 'negative', 'nonsense', 'oyez', 'okay', 'ole', 'pardon', 'peccavi', 'period', 'pity', 'pleasure', 'presto', 'prithee', 'prosit', 'quiet', 'rather', 'really', 'respect', 'result', 'roger', 'rumble', 'sayonara', 'scramble', 'selah', 'shabash', 'shazam', 'silence', 'sorry', 'standard', 'sugar', 'tally', 'tara', 'tarnation', 'tidy', 'timber', 'uncle', 'understood', 'viva', 'vivat', 'voetsek', 'warning', 'welcome', 'whammo', 'whatever', 'wilco', 'wirra', 'zowie']

I listed all the interjections out of curiosity. I don’t usually give this part of speech much thought, so I took a closer look. Surprisingly, “afternoon” is also classified as one! Which makes sense, since it’s a greeting.

['abaft', 'abeam', 'aboard', 'about', 'above', 'abreast', 'abroad', 'absent', 'across', 'afore', 'after', 'again', 'against', 'agin', 'along', 'alongside', 'aloof', 'alow', 'amid', 'amidst', 'among', 'amongst', 'anent', 'anti', 'around', 'asprawl', 'astraddle', 'astride', 'athwart', 'barring', 'bating', 'because', 'before', 'behind', 'beyond', 'below', 'beneath', 'beside', 'besides', 'between', 'betwixt', 'circa', 'concerning', 'considering', 'contra', 'despite', 'during', 'except', 'excepting', 'failing', 'following', 'forby', 'froward', 'given', 'including', 'inside', 'into', 'minus', 'modulo', 'nearer', 'nearest', 'onto', 'opposite', 'outwith', 'pending', 'regarding', 'regardless', 'respecting', 'rising', 'running', 'saving', 'thorough', 'throughout', 'touching', 'toward', 'towards', 'under', 'underneath', 'unlike', 'until', 'upon', 'upside', 'versus', 'wanting', 'within', 'without']

When listing out prepositions, I noticed some recurring prefixes:

a- indicating location or spatial relationship: aboard, across, amid, around
be- (basically be): before, behind, below, beside

Next, I created heatmaps for each part of speech. The y-axis shows syllable count, the x-axis shows stress position, and color intensity represents the proportion of words for each syllable count. I only included parts of speech with over 1% of the total words, as others had too few to be significant.

Stress tends to shift towards the end as syllables increase. The difference between parts of speech isn’t huge, but it’s there. For longer words (5+ syllables), adjectives often have stress on the antepenultimate (third-to-last) syllable, nouns tend to have stress further back, and verbs/adverbs have stress further forward.

Rules of Stress Position

It was time to test my hypothesis.

I analyzed 4- and 5-syllable words, adding a column showing the difference between the actual and hypothesized (third-to-last) stress positions. A ‘0’ means a match, ‘1’ means one syllable later, ‘-1’ one syllable earlier, etc.

The hypothesis held for 43.9% of the words.

This bar chart shows the stress deviation. Most words follow the rule, with some shifted by one syllable. Very few are further off. It kind of looks like a normal distribution (but I’m no stats expert).

Then I wondered: could this be generalized? Does it apply to words with 5+ syllables? I broadened the filter to include all words with over 3 syllables:

43.92% fit. Almost no change.

The deviation pattern remained. Most words are stressed on the antepenultimate syllable, many on the penultimate. Combined, they account for 78.84%. It’s not a perfect fit, but the general trend is confirmed.

Conclusion

Here’s a recap of the findings regarding phonetics and stress:

Fewer syllables mean more words.
Words with 5+ syllables are rare in everyday use.
The longest word found has 11 syllables.
Stress generally shifts towards the end in longer words.
Louder vowels are more likely to be stressed.
Part of speech has a minor effect on stress.
Most long words are stressed on the antepenultimate or penultimate syllable (78.84%).

Afterword

Five minutes of analysis, two hours of data prep – seriously.

Visualization took only half a day. Data preparation, especially fetching phonetic transcriptions via the dictionary API, took the longest. The script ran on and off for over two weeks; I even finished writing this before the dictionary lookup was done, using placeholders for the data.

I’m happy the results confirmed my hypothesis. After this, I doubt I’ll ever forget English stress rules – it’s my own research, after all.

This project refreshed my Pandas skills, taught me batched requests and incremental saving, showed me how to integrate AI into analysis, helped me write effective Python data visualization prompts, and deepened my understanding of English phonetics. A huge win, and totally worth it!

Thanks to:

Word data source: This 300k+ word list was the base of my analysis.
Free Dictionary API: This provided an inexpensive way to get phonetic transcriptions.
Gemini 1.5 Flash: Helped with about half the data prep and all the visualizations.
GPT-4o: Helped accurately ID vowels in stressed syllables.

The full analysis and code are open-sourced on Kaggle. Check it out if you’re interested:

https://www.kaggle.com/code/victorcheng42/stress-distribution-of-english-words

The dataset with phonetic transcriptions, syllable counts, and stress positions is also public. It might be useful for other analyses:

https://www.kaggle.com/datasets/victorcheng42/english-words-with-stress-position-analyzed

We Only Learn the Intersection of Two Languages

hi@victor42.work (Victor42) — Tue, 17 Jan 2023 15:09:00 +0000

I was looking at the word “stem” the other day, and it got me thinking. Language learning can be a breeze, or it can be a real head-scratcher. We don’t learn the whole language; we learn the overlap between it and our native tongue.

Take “stem,” for example. In the Cambridge Dictionary, as a noun, it’s usually a plant’s stem or a wine glass stem. Basically, the central supporting structure.

As a verb, it means to stop something bad from spreading, or more literally, to stop a flow, like stemming bleeding.

There are other, rarer meanings, but let’s put those aside.

Native Chinese speakers might be thinking, “Ugh, another one of those words? Multiple, seemingly unrelated meanings?”

But, in my experience, when English words seem odd, it’s usually us missing something. There’s probably a historical link we don’t grasp because of our cultural background.

So, let’s think in English. If “stem” is the main support, could it apply to a wind turbine? It kinda looks like a wine glass, right?

The tower supports everything above, and there’s a base. Seems like a slam dunk. Can we call the tower a “stem”?

Nope. Searching “wind turbine” and “stem” mostly turns up STEM (Science, Technology, Engineering, and Math) education. So, no dice.

Okay, back to biology. Can a mushroom’s stalk be a “stem”?

Yep! It can also be a “stalk,” but the point is, native English speakers do see “stem” as a support, and the meaning stretches.

So, how are the noun and verb connected? I hit up an etymology site. I also found a less common meaning: a ship’s bow. This nautical term, though obscure, is the key.

Here’s the gist from the etymology site. The image below should be pretty self-explanatory.

The noun comes from Proto-Germanic, with relatives in Old Saxon, Old Norse, Danish, Swedish, etc. It goes back to the Proto-Indo-European root *sta-, meaning “to stand,” or “be firm.” “Stable” might be a cousin. It evolved to mean “support,” like a plant stem. The wine glass stem sense popped up around 1835.

The verb form has nautical roots. In the early 1300s, it meant “to withstand” in Nordic languages, like withstanding waves. For a ship, that’s like “staying stable.” By the late 1300s, it meant both the bow and to point the bow. Makes sense: a ship’s bow must be angled to handle waves and stay steady.

So, “stem” (main structure) and “stem” (to stop) connect through “staying stable.” The verb isn’t about totally wiping out something bad, but holding the line and preventing things from getting worse. Think: “stem the rise in violent crime,” “stem the tide of resignations,” “stem the bleeding” (you can’t entirely “stop” blood flow).

Two seemingly unrelated concepts in Chinese might be one idea for English speakers. Ask them why the word has two meanings, and they might look at you funny: “It’s just one meaning!” They’re not mashing together two Chinese concepts, but grasping a concept that’s absent in Chinese.

Here’s the thing: when we learn a foreign language, we map its concepts onto our own. The ones that match up fall into the overlap, and we think we’ve got it. The ones that don’t match, the ones outside our native language’s scope, stay out of reach. We only learn the overlap.

To really get to native-like fluency, we have to venture beyond that overlap, into the foreign language’s turf, and wrestle with concepts that don’t exist in our mother tongue. Many “issues” in the overlap might not even be issues in the foreign language’s world. Stepping into that world isn’t rocket science, but it takes serious effort, and there are no shortcuts.

Back to the mushroom: its stalk can be a “stem” or a “stalk.” What’s the deal?

Dictionaries show their Chinese translations are pretty much the same. In biology, there’s a slight difference:

But “stalk” is also a verb, with a totally different meaning. There’s probably another rabbit hole there, like with “stem.” I haven’t gone down it yet, so feel free to fill me in.

Anyway, that’s the lowdown on language learning. Trying to go deep in a foreign language is like Usain Bolt suddenly finding himself underwater – going 1 m/s might be a struggle.

Why isn't there a word for "ten thousand" in English?

hi@victor42.work (Victor42) — Sun, 19 Apr 2020 12:06:00 +0000

Consider this about number units: English separates large numbers with commas, advancing in thousands—million, billion, trillion. There’s no single word for “ten thousand.” Chinese, however, uses units of ten thousand (万, 亿, 兆…). We use “million” (百万) more now, but that’s recent, due to handling larger figures. “Million” is a combination, not a base unit like “ten thousand.”

It’s curious. We have distinct words for smaller place values: ones, tens, hundreds, thousands, ten thousands. Why not for larger numbers? We simply didn’t need them! Daily life, particularly anciently, rarely required such large numbers.

Rulers, however, dealt with massive figures. Inventing a word for every place value would be impractical. The solution? Use the largest common unit as a base. This avoids new concepts and simplifies comparisons. Within the same order of magnitude, the specific unit is less important. Large differences are clear from the unit, and smaller ones remain manageable.

This hints at a difference in scale between the ancient Chinese and English-speaking worlds, reflected in geography, population, and agriculture. It’s well-known, but it might be the core reason for the East-West difference in number units today.

The Texting Experience

hi@victor42.work (Victor42) — Sun, 11 Dec 2016 01:20:29 +0000

Image from Dribbble.

The title might be misleading. I’m not discussing the UX of messaging apps, but the reading experience of chat content.

Number OCD

My supervisor recently asked for my phone and ID numbers for some paperwork. We were chatting on DingTalk. I replied “OK” and sent:

×× (My Name) Phone: 186×××××××× ID: 360103××××××××××××

I stared at the message and thought I could do better. So, I resent it:

×× (My Name) Phone: 186 ×××× ×××× ID: 360 103 ×××× ×××× ××××

I mentioned it was easier to read. My supervisor quipped, “OCD kicking in again, eh?” I replied, a bit pretentiously, “User experience is everywhere,” followed by a grinning emoji.

That was that. But since I’d mentioned UX, I figured I’d explore it further. It wasn’t just me being nitpicky. My initial message wasn’t exactly user-friendly.

Formatting a reply is a design task, tied to the user and goal. The user was clear: my supervisor on DingTalk mobile. But the goal? I hadn’t asked. She needed the numbers for documents, but she wouldn’t be preparing them herself. She’d pass the info along. How? Jot it down or forward it? That’s a big difference!

Writing it Down

If she was writing it down, it’d likely be the old-fashioned “read-memorize-write” method. I can’t control the writing, but the reading and memorizing depend on my formatting.

Research suggests people can only remember about 7 digits at a time. Anything longer needs chunking. We remember five and seven-character poems. Nine-character poems exist, but they’re rare. Qu Yuan’s Li Sao is an exception, but even there, most meaningful content stays within 7 characters, thanks to the modal particle “兮”.

Seven is the limit, though, not ideal. Think about verification codes: usually 4 or 6 digits. We can recall 4 digits easily, but 6-digit codes get broken into 3+3. This suggests the sweet spot for easy recall is under 6 characters. China’s 11-digit phone numbers are commonly read as 3+4+4. We say, “Call my 186 number.” Online, the middle 4 digits are often masked. We also tend to remember the last 4 digits. This shows how ingrained this grouping is. ID numbers aren’t usually split up visually, but they have inherent sections: 6 (region) + 4 (year) + 4 (month/day) + 4 (last four). That’s likely how most people memorize them.

As an aside, is the magic number 4 or 5? I lean towards 4, though I lack hard proof. But the examples above hint at it. Bank card numbers, too: different lengths, but when deliberately grouped, they never exceed 4 digits per chunk.

However you format a long number string, that’s how the recipient will read and memorize it. We should all offer this courtesy to each other.

Forwarding on Mobile

Back to the point. If the numbers were to be forwarded and copied into a system, things change entirely.

I couldn’t know if the system handled spaces. Pasting the “easy-read” format might result in “186 ×××× ××”. Also, my supervisor, on Android, couldn’t use clipboard tools like Pin. Extracting the numbers would be a hassle.

Mobile IM often forces you to copy the entire message.

So, for copying from IM, the best format is:

Name:

×× Phone: 186×××××××× ID: 360103××××××××××××

This reminds me of my WeChat public account. I mostly just post articles, so I set up an auto-reply directing people to my Weibo.

For a while, the auto-reply just said: “I don’t check this account often. Contact me via private message on Sina Weibo: @我_ColaChan.”

Then I messaged myself. It was a pain to copy just the nickname. So I changed it: “I don’t check this account often. Contact me on Sina Weibo. Reply ‘Weibo’ for my username.” Replying “Weibo” triggered a message with just “@我_ColaChan”.

An extra step, but much easier to extract the information.

Eliminating Typos

Text chat isn’t just about numbers. Everyday conversation is key. What defines “good” or “bad” here?

In middle school, we didn’t have cell phones. We chatted on QQ via computer. A classmate once said chatting with me was reassuring. Why? Because I never made typos.

Thinking back, it’s true. Life’s faster now, and with auto-suggestions, typos happen. But attitude matters. I proofread my messages and always fix typos.

Many people don’t check their messages. They don’t check after typing, or even during. They just fire off a message. Even if they spot a mistake, they often can’t be bothered to fix it, assuming the other person will get it. This leads to gibberish like “Enai” (should be “En Ai,” meaning “love”) or “Bu hui ni o” (should be “Bu hui you ni o,” meaning “won’t have you”). Misspelled keywords require serious guesswork, even considering homophones and keyboard layouts. I’ve dealt with printers and developers whose messages are incredibly hard to decipher. Sure, being busy is understandable. But typo-free chat is a better experience for everyone.

Language is Serious

I’ve gotten messages like this before, a jumbled mess:

are you there Does UI need hand-drawing? Can’t do it without hand-drawing? No response to resume Is it not enough experience What are the ui specifications do I need to look at both ios andriod Are you there are you there、 What to do without a portfolio?

That’s not verbatim, but it captures the essence. Missing punctuation, misused punctuation, spaces instead of commas, extra spaces, mixed Chinese and English punctuation, misspelled words, misused words, no clear topics… It’s a catalog of common communication errors.

Language’s main purpose is communication. It’s the agreed-upon system for expressing concepts. Ignoring language norms is like disconnecting from that system. It’s a big deal. Even in casual texts, I think it’s important to use “的,” “地,” and “得” correctly. These details are often overlooked. It’s not about language purity; it’s about making things easier for the reader. Standard language helps.

The Mindset of Writing a Press Release

Think of your messages like press releases. Unless you’re just shooting the breeze with a close friend, there’s usually a point.

The jumbled message above, besides being imprecise, suffers from scattered topics. How do you even answer that? If you’re confused and need help, write a clear request for help. The example above isn’t even an outline.

Clear messages have structure. Start with a sentence stating the topic, then elaborate, point by point. If you’re informing someone, state the key facts. If you need something, explain why, and ideally, offer a solution. If you’re reporting a problem, give enough details for troubleshooting.

When friends ask for computer help, they often just say, “My computer’s broken, help!” And then they wait for me to ask questions. I wish just once someone would proactively tell me the error message, if it’s happened before, when it started, what they did before and after, what they tried, and what the results were.

Imagine a robbery. The police arrive, and the victim just keeps saying, “I’ve been robbed! Catch the thief!” The case won’t get solved.

Focused conversations are efficient. A ten-minute explanation can drag on for an hour due to poor communication. Wasting someone’s time is a cardinal sin.

Modern Big-Character Posters

China has a thing for slogan banners and posters. For urban planning: “Gather all forces, plan water management, build a harmonious city, promote the water town image, and establish a legacy.” For construction safety: “Safety creates happiness, negligence brings pain. Safety is efficiency, safety is happiness.” For hospitals: “Create a safe hospital, build harmonious doctor-patient relations.”

Let’s not even get into the slogans themselves. The point is, the people behind these didn’t consider their audience or tone. A slogan near a military area was actually good: “Obey the Party’s command, be able to win battles, and have a good work style.” It’s hierarchical and logical. Most importantly, it’s clear and unambiguous.

News and official outlets use vague language to be inclusive and cover all bases. But this isn’t just a media thing. We’ve all encountered people who write in an overly formal or flowery style at work. Think of those landing pages: a confusing illustration with shopping carts and money flying everywhere, and text like, “Enjoy endless discounts.”

I remember one ad clearly. I forget the brand, but it showed traditional soy sauce making. The spokesperson, standing by a field of drying soybeans, said plainly, “Just dry it here, just rely on the sun.” A less direct approach might have been: “XX hectares of soybean processing, natural air-drying.” That’s uninspiring. No matter how accurate or fancy, it lacks imagery.

You can see the ad’s directness here: http://t.cn/RcxcZ3I.

It reminds me of a joke with my classmates:

“The rolling Yangtze River flows eastward…"

“Get to the point!" “The river flows east!"

Topic Guardian

After I started working, someone commented on my chat style again, saying I was “chatting with my life.” They explained that with others, it’s a back-and-forth. With me, they’d see me “typing” for ages, sometimes over ten minutes. They’d return from getting water to find a massive, multi-paragraph message from me, addressing every tangent from the earlier conversation.

So, I do have that habit! I don’t let topics die; I need closure. I can see how this would be tiring in casual chats. I don’t want to be like this. I’d prefer to stick to one thing. But once the conversation derails, even if it’s not my fault, I feel compelled to keep it going. If the other person is fine with this style, I’m the one who ends up exhausted.

Long messages have pros and cons. The downside is making people wait. But the upside is preventing further tangents. If, mid-reply, something reminds the other person of something else, they might interrupt, creating more branches. It’s very common.

This is a dilemma. Wasting time is bad, so shouldn’t I avoid long waits? But if I don’t control the topics, forgotten points might need revisiting later, which also wastes time. In text chat, prioritizing the other person’s experience means choosing the less time-consuming option. Letting topics explode seems like a lesser evil.

Oh, Hehe, [Grin]

These are the worst replies, the ultimate conversation killers. Why? They’re short and meaningless, yes. But the real problem is they don’t reflect the sender’s state. They replied, but didn’t actually respond. You don’t know if they understood; they could have been typing those replies mindlessly.

It’s like sending an email with no loading indicator or confirmation. The compose window stays open. You close it, and the email’s nowhere: not in Sent, Outbox, Drafts, Inbox, Junk, or Spam. WTF?

It sounds ridiculous, but these conversations happen all the time. An Android developer asked me for an asset. I asked how he planned to use it – fixed size or .9 patch? He replied, “Okay.” I thought he’d hit send accidentally. But after 30 seconds, not even a “typing” indicator. The topic died, forcing me to start a new round of questions.

Tech people should understand feedback. The TCP/IP handshake is a prime example: Client sends to server: “I want to connect.” Server replies: “Is this what you sent? Is it you?” Client confirms: “Yes, it’s me, let’s connect.”

Humans are good at context, machines less so. But even with context, clear feedback is crucial. At the very least, reply with “OK” or “Received.” If there’s a choice, repeat the chosen option.

Conclusion

Looking back at my IM interactions, there’s a clear divide. Some people are a breeze to communicate with; others make you want to just call. The same task, expressed differently in text, leads to vastly different experiences.

Experience design is everywhere, and it’s practical. Strip away the methodologies, and you’re left with one core principle: Put yourself in the other person’s shoes.