<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Language on Victor42</title><link>https://victor42.eth.limo/tags/language/</link><description>Recent content in Language on Victor42</description><generator>Hugo -- gohugo.io</generator><language>en</language><managingEditor>hi@victor42.work (Victor42)</managingEditor><webMaster>hi@victor42.work (Victor42)</webMaster><lastBuildDate>Fri, 05 Jul 2024 22:33:00 +0000</lastBuildDate><atom:link href="https://victor42.eth.limo/tags/language/index.xml" rel="self" type="application/rss+xml"/><item><title>I Did a Deep Dive into English Word Stress...</title><link>https://victor42.eth.limo/post-en/3651/</link><pubDate>Fri, 05 Jul 2024 22:33:00 +0000</pubDate><author>hi@victor42.work (Victor42)</author><guid>https://victor42.eth.limo/post-en/3651/</guid><description>&lt;img src="https://cdn.victor42.work/posts/2024-07/ea6d9ff8fee7f0f2477d458be8c4a952.jpg" alt="Featured image of post I Did a Deep Dive into English Word Stress..." /&gt;&lt;p&gt;&lt;strong&gt;Target audience: English learners, data analysis enthusiasts, Python coders, and my friends.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This is my first data analysis project. I&amp;rsquo;ve been teaching myself data science for over a year, picking up skills along the way, but I hadn&amp;rsquo;t tackled a real-world project. During my studies, the words &amp;lsquo;analyze,&amp;rsquo; &amp;lsquo;analysis,&amp;rsquo; and &amp;lsquo;analytical&amp;rsquo; kept appearing. The stress placement is unpredictable (&amp;lsquo;analyze, a&amp;rsquo;nalysis, ana&amp;rsquo;lytical) – a real headache! It turned reading into a tongue-twisting exercise.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/70c28efdcd37e6d4a143ff2df66084be.jpg"
loading="lazy"
alt="Four cognate English words Analyze, Analyst, Analysis, and Analytical with stress positions marked by apostrophes, showing stress shifting from the first syllable progressively backward"
&gt;&lt;/p&gt;
&lt;p&gt;Some claim there are rules for stress, but they&amp;rsquo;re often lengthy and complex. Others say there are too many exceptions. However, even with those three words, a pattern &lt;em&gt;does&lt;/em&gt; emerge. English seems to avoid three unstressed syllables in a row and tends to place stress near the beginning. For words with five or fewer syllables, the stress often lands on the antepenultimate (third-to-last) syllable.&lt;/p&gt;
&lt;p&gt;It makes sense, doesn&amp;rsquo;t it? Three unstressed syllables in a row would be monotonous. Stress adds rhythm. It&amp;rsquo;s like driving on a straight road – you&amp;rsquo;ll likely doze off. Placing stress too late would also hinder comprehension. Imagine a long word with emphasis on the very last syllable – you&amp;rsquo;d likely miss the meaning!&lt;/p&gt;
&lt;p&gt;To illustrate, consider Mandarin Chinese. It has a significant flaw: the word &amp;ldquo;不&amp;rdquo; (bù, &amp;ldquo;not&amp;rdquo;). Both the consonant and vowel are faint, especially in rapid speech. The vowel becomes even weaker. You often can&amp;rsquo;t discern if someone even &lt;em&gt;uttered&lt;/em&gt; &amp;ldquo;不&amp;rdquo;! This creates a major communication problem, as it distinguishes between two opposite meanings. When my daughter cries, I struggle to understand if she&amp;rsquo;s saying &amp;ldquo;要&amp;rdquo; (yào, &amp;ldquo;want&amp;rdquo;) or &amp;ldquo;不要&amp;rdquo; (bù yào, &amp;ldquo;don&amp;rsquo;t want&amp;rdquo;).&lt;/p&gt;
&lt;p&gt;Back to English stress. My theory seemed reasonable, but I needed evidence. As a data-science novice, I decided to get my hands dirty and see how many words actually followed this pattern.&lt;/p&gt;
&lt;h2 id="research-plan"&gt;Research Plan
&lt;/h2&gt;&lt;p&gt;Having learned data analysis, the research plan formed quickly. It involved collecting, cleaning, analyzing, and visualizing data. Regression analysis or prediction wasn&amp;rsquo;t necessary.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/7486fc8650cedd8b8b4f7816e9af7e0d.jpg"
loading="lazy"
alt="Kaggle Notebook preview showing the raw English word dataset, listed alphabetically from a, aa, aaa to aardvark, truncated due to large file size"
&gt;&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s the skillset I had, which was sufficient:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Find a comprehensive word list.&lt;/li&gt;
&lt;li&gt;Find a free, batch method for obtaining phonetic transcriptions from an online dictionary.&lt;/li&gt;
&lt;li&gt;Determine the syllable count and stress position for each word (possibly with AI assistance).&lt;/li&gt;
&lt;li&gt;Analyze the distribution of stress positions and visualize the findings.&lt;/li&gt;
&lt;li&gt;Test my hypothesis.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Let&amp;rsquo;s dive in.&lt;/p&gt;
&lt;h2 id="data-source"&gt;Data Source
&lt;/h2&gt;&lt;p&gt;I found a dataset on &lt;a class="link" href="https://www.kaggle.com/" target="_blank" rel="noopener"
&gt;Kaggle&lt;/a&gt;, a popular data science community. It&amp;rsquo;s a simple .txt file containing over 300,000 English words, listed alphabetically, one per line:&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://www.kaggle.com/datasets/bwandowando/479k-english-words" target="_blank" rel="noopener"
&gt;https://www.kaggle.com/datasets/bwandowando/479k-english-words&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/035173524c2057e2515c255add081cea.jpg"
loading="lazy"
alt="A preview of the raw English word list starting with the letter A in Kaggle"
&gt;&lt;/p&gt;
&lt;p&gt;The .txt file is 4MB, comparable to a million-word novel.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/6d8b49da96f58a5292d53296bf7966ba.jpg"
loading="lazy"
alt="Pandas dataframe info showing more than three hundred and sixty-nine thousand rows of words"
&gt;&lt;/p&gt;
&lt;p&gt;I created a Kaggle code project, imported the dataset, read all the words, and obtained a table with 369,652 rows and 1 column.&lt;/p&gt;
&lt;h2 id="getting-the-pronunciation"&gt;Getting the Pronunciation
&lt;/h2&gt;&lt;p&gt;The table only contained words. For rigorous research, I needed phonetic transcriptions.&lt;/p&gt;
&lt;p&gt;Fortunately, I discovered a free online dictionary API: &lt;a class="link" href="https://dictionaryapi.dev/" target="_blank" rel="noopener"
&gt;https://dictionaryapi.dev/&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Now, I had to look up each of those 300,000+ words. Naturally, I&amp;rsquo;d write code to automate this.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/5c311b367a15d50faa8f53f724821a54.jpg"
loading="lazy"
alt="JSON response from the free dictionary API for the word hello"
&gt;&lt;/p&gt;
&lt;p&gt;The API returned more than just phonetics; it included audio, etymology, parts of speech, meanings, and examples. The useful components were the phonetics, etymology, and part of speech. However, etymology was mostly missing, so I extracted only the phonetics and part of speech.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/12f254a9769f985b4cacc3b3992a7577.jpg"
loading="lazy"
alt="Code snippet of the free dictionary API rate limiter setting a limit of 450 requests per 5 minutes"
&gt;&lt;/p&gt;
&lt;p&gt;The sheer data volume posed a challenge. The API documentation didn&amp;rsquo;t specify request limits, but I found it in &lt;a class="link" href="https://github.com/meetDeveloper/freeDictionaryAPI/blob/master/app.js" target="_blank" rel="noopener"
&gt;their Github code&lt;/a&gt;: 450 requests every 5 minutes. For 369,652 words, even non-stop, it would take 369652 / 450 * 5 / 60 = 68.45 hours – almost 3 days!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/4a9c399f7966ab61cf767f7712e209d9.jpg"
loading="lazy"
alt="CSV chunk files saved in the Kaggle working directory during batch processing"
&gt;&lt;/p&gt;
&lt;p&gt;Alright, three days it was. But I had to adjust my strategy. I added a function to chunk queries and save results periodically. Every 1,000 rows, I&amp;rsquo;d save to a sequentially numbered file. I&amp;rsquo;d then continue querying based on the sequence number. Finally, I&amp;rsquo;d merge all 300+ files.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/22b28704556d17baf1c0c141d5ae3e96.jpg"
loading="lazy"
alt="Spreadsheet showing merged English words with phonetic symbols and parts of speech"
&gt;&lt;/p&gt;
&lt;p&gt;It turned out that most of the 300,000+ words were obscure and not found in the API. I only got results for roughly 100 out of every 1,000 words. The file above contains only 92 rows.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://wordsrated.com/how-many-words-are-in-the-english-language/" target="_blank" rel="noopener"
&gt;Linguistic research&lt;/a&gt; indicates that 3,000 English words cover 95% of everyday usage, and 1,000 cover 89%. &lt;a class="link" href="https://wordcounter.io/blog/how-many-words-does-the-average-person-know" target="_blank" rel="noopener"
&gt;Another study&lt;/a&gt; shows that the average adult has an active vocabulary of about 20,000 words and a passive one of 40,000. Thus, only about 1/10 of the dataset is relevant, which is reasonable.&lt;/p&gt;
&lt;h2 id="data-cleaning"&gt;Data Cleaning
&lt;/h2&gt;&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/82acc141ccd3150e4bf0fd08ae292149.jpg"
loading="lazy"
alt="Python code showing the mapping dictionary for uncommon phonetic symbol replacements"
&gt;&lt;/p&gt;
&lt;p&gt;After merging, I found the dictionary&amp;rsquo;s phonetic symbols were inconsistent, containing uncommon symbols like &lt;code&gt;ɘ&lt;/code&gt;, &lt;code&gt;ɝ&lt;/code&gt;, &lt;code&gt;ɚ&lt;/code&gt;, &lt;code&gt;ɨ&lt;/code&gt;, &lt;code&gt;ʉ&lt;/code&gt;. These represent subtle pronunciation variations, roughly equivalent to standard sounds. I had to replace them; otherwise, they&amp;rsquo;d disrupt syllable counting and subsequent analysis.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/9d9304e6642b5df50354c06d739eea1d.jpg"
loading="lazy"
alt="Python code showing the mapping rules to merge phonetically identical but graphically different common vowels"
&gt;&lt;/p&gt;
&lt;p&gt;Besides unusual symbols, there were many phonetically identical but differently written symbols, like &lt;code&gt;əu/əʊ&lt;/code&gt; and &lt;code&gt;ai/aɪ&lt;/code&gt;. These also required merging. Each line in the image signifies replacing the first symbol with the second, leaving bracketed symbols untouched.&lt;/p&gt;
&lt;p&gt;Some words differ significantly between British and American English. I prioritized American English conventions.&lt;/p&gt;
&lt;p&gt;Numerous unconventional spellings existed. Over- or under-replacement could easily cause phonetic errors. I wrote a temporary checker, manually consulted the &lt;a class="link" href="https://dictionary.cambridge.org/us/dictionary/english/" target="_blank" rel="noopener"
&gt;Cambridge Dictionary&lt;/a&gt;, and refined my replacements. This took time.&lt;/p&gt;
&lt;p&gt;After processing, the vowel symbols were cleaner. For &amp;ldquo;anthropomorphic&amp;rdquo;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Before: &lt;code&gt;[ˌæ̃n̪θɹ̠əpəˈmɔɹ̠fɪ̈k]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;After: &lt;code&gt;[ˌæn̪θɹ̠əpəˈmɔːfɪk]&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I didn&amp;rsquo;t handle consonant symbols, as they were irrelevant to my goal, and that&amp;rsquo;s a more complex issue.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/627162599344331488dc70237ce660a6.jpg"
loading="lazy"
alt="API JSON response showing incorrect and incomplete phonetic transcription for the word abacus"
&gt;&lt;/p&gt;
&lt;p&gt;Later, I discovered some inaccuracies in the dictionary API. For instance, &amp;ldquo;abacus&amp;rdquo; was transcribed as /-saɪ/? Nonsense! The information was incomplete.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/f4f3ef7e088114e942d95246bf273902.jpg"
loading="lazy"
alt="Text output showing the count and percentage of words with incomplete phonetic data"
&gt;&lt;/p&gt;
&lt;p&gt;I calculated this occurred in 0.55% of all words – a small fraction. The incomplete transcriptions seemed random, lacking commonality, so I filtered them out. I&amp;rsquo;m now analyzing a sample, not the complete data. However, the sample is large enough to be representative, allowing the research to proceed.&lt;/p&gt;
&lt;h2 id="analyzing-phonetic-transcriptions-ai"&gt;Analyzing Phonetic Transcriptions (AI)
&lt;/h2&gt;&lt;p&gt;This step entails counting syllables from phonetic transcriptions and identifying the stressed syllable using the &lt;code&gt;ˈ&lt;/code&gt; mark.&lt;/p&gt;
&lt;p&gt;I aimed for a shortcut by deploying an AI model on Kaggle. AI should excel at language, right?&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/c77ef4414f82188785924057cfe3bc34.jpg"
loading="lazy"
alt="Kaggle models page showing search results for text-based large language models"
&gt;&lt;/p&gt;
&lt;p&gt;I tested several text-based models but encountered obstacles:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Large models wouldn&amp;rsquo;t run:&lt;/strong&gt; Among Kaggle&amp;rsquo;s deployable open-source models, Llama3 70b could accurately determine syllable count and stress position. ChatGPT, Claude, and even GPT-3.5 could also do it. Language seems to be a strength of large language models. The issue? Kaggle&amp;rsquo;s free tier can&amp;rsquo;t run such large models.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Small models were inadequate:&lt;/strong&gt; Kaggle&amp;rsquo;s two free T4 GPUs can handle smaller 7b models like Llama3 8b, Gemma 7b, and Qwen2 7b. However, these smaller models, on Kaggle or elsewhere, couldn&amp;rsquo;t reliably perform the task.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I refined prompts, guiding the AI step-by-step, and provided examples:&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;&amp;lt;task&amp;gt;
your task is to count how many syllables there are in an English word. list them all then count. finally answer which syllable the stress falls on(tell me the number). answer **EXACTLY** in the example format.
&amp;lt;example&amp;gt;
word: analysis
phonetic transcription: /əˈnælɪsɪs/
syllables:
1. ə
2. &amp;#39;næ
3. lɪ
4. sɪs
syllables count: 4
stress position: 2
final conclusion: &amp;lt;&amp;lt;&amp;lt;2/4&amp;gt;&amp;gt;&amp;gt;
&amp;lt;word&amp;gt;
analytical /æn.əˈlɪt.ə.kəl/
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;But the smaller models kept failing. Perhaps they weren&amp;rsquo;t capable. Phonetic symbols are vastly different from standard English letters, almost a separate, niche language for AI.&lt;/p&gt;
&lt;p&gt;This experience highlighted a key point: these open-source small models cluster around 7 billion parameters likely because that&amp;rsquo;s the upper limit for running on specific GPUs. In this era of constrained computing, GPUs dictate the scale.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/3a5d9b8fcbd23a0d5487891310921f63.jpg"
loading="lazy"
alt="Google Sheets interface showing GPT formulas applied to analyze word stress"
&gt;&lt;/p&gt;
&lt;p&gt;Was AI a dead end? I then considered a workaround: Google Sheets with an AI plugin. I could input the phonetic data into Sheets, write a prompt in the adjacent cell (including the word and transcription), and use a formula from an &lt;a class="link" href="https://workspace.google.com/u/1/marketplace/app/gpt_for_sheets_and_docs/677318054654" target="_blank" rel="noopener"
&gt;AI plugin&lt;/a&gt; to generate the result. This plugin, powered by GPT-3.5, could handle the task. The classic Excel drag-down trick would then populate the entire column.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/81f435b62db92e70d47f0d77841e5703.jpg"
loading="lazy"
alt="Cost estimator page showing the estimated cost for calling the GPT plugin in Google Sheets"
&gt;&lt;/p&gt;
&lt;p&gt;The plugin&amp;rsquo;s pricing was reasonable, around 90 RMB for my data volume. However, I was unsure if it could handle tens of thousands of AI generations simultaneously. Debugging and regenerating could double the cost, making it risky.&lt;/p&gt;
&lt;h2 id="analyzing-phonetic-transcriptions-algorithm"&gt;Analyzing Phonetic Transcriptions (Algorithm)
&lt;/h2&gt;&lt;p&gt;Okay, no more AI—I&amp;rsquo;d handle it myself. Counting syllables and locating stress? An algorithm could do that, and more reliably. Here’s the approach, using &lt;code&gt;analytical /æn.əˈlɪt.ə.kəl/&lt;/code&gt; as an example:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a set of all vowels: &lt;code&gt;ɑaæɒʌəɛeɪiɔoʊuʉɜ&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Remove slashes, parentheses, spaces, and dots: &lt;code&gt;/æn.əˈlɪt.ə.kəl/&lt;/code&gt; becomes &lt;code&gt;ænəˈlɪtəkəl&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Iterate through &lt;code&gt;ænəˈlɪtəkəl&lt;/code&gt;, checking against the vowel set. Counting vowels: &lt;code&gt;æ&lt;/code&gt;, &lt;code&gt;ə&lt;/code&gt;, &lt;code&gt;ɪ&lt;/code&gt;, &lt;code&gt;ə&lt;/code&gt;, &lt;code&gt;ə&lt;/code&gt; yields 5 syllables.&lt;/li&gt;
&lt;li&gt;Split by the stress mark &lt;code&gt;ˈ&lt;/code&gt;: &lt;code&gt;ænəˈlɪtəkəl&lt;/code&gt; becomes &lt;code&gt;ænə&lt;/code&gt; and &lt;code&gt;lɪtəkəl&lt;/code&gt;. Use the first part, &lt;code&gt;ænə&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Count vowels in &lt;code&gt;ænə&lt;/code&gt; as in step 3: 2 vowels.&lt;/li&gt;
&lt;li&gt;Add 1 to get the stress position: the 3rd syllable.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The logic was clear, so I had AI write the code—a trivial task for it. A few tweaks, and it worked.&lt;/p&gt;
&lt;p&gt;A challenge arose in step 3: diphthongs, triphthongs, and long vowels. For &lt;code&gt;ei&lt;/code&gt;, the algorithm would count &lt;code&gt;e&lt;/code&gt; and &lt;code&gt;i&lt;/code&gt; (2 syllables), but &lt;code&gt;ei&lt;/code&gt; as a diphthong is only one. Triphthongs would be counted as 3.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/93fc699338026ae0a224090ea716d17c.jpg"
loading="lazy"
alt="Python code snippet defining sets of monophthongs, diphthongs, and triphthongs"
&gt;&lt;/p&gt;
&lt;p&gt;The algorithm needed adjustment. I created three vowel sets: monophthongs, diphthongs, and triphthongs. The vowel check now involved three passes:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;First pass: Check each character against the monophthong set. This overcounts diphthongs and triphthongs.&lt;/li&gt;
&lt;li&gt;Second pass: Check two characters at a time against the diphthong set. If found, subtract 1 from the syllable count. Importantly, skip the next character after a diphthong to avoid miscounting triphthongs like &lt;code&gt;aɪə&lt;/code&gt; as &lt;code&gt;aɪ&lt;/code&gt; and &lt;code&gt;ɪə&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Third pass: Check three characters at a time against the triphthong set, subtracting 1 if found.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This refined algorithm accurately counted syllables. (Note: I treated the long vowel marker &lt;code&gt;ː&lt;/code&gt; as a phonetic character; &lt;code&gt;iː&lt;/code&gt;, &lt;code&gt;ɑː&lt;/code&gt; are handled as diphthongs, &lt;code&gt;iːə&lt;/code&gt;, &lt;code&gt;uːə&lt;/code&gt; as triphthongs, which doesn&amp;rsquo;t affect the outcome.)&lt;/p&gt;
&lt;p&gt;It turns out, for data analysis, technique takes a backseat to domain knowledge. Analyzing English requires understanding it. Digging deeper into phonetics, I hit another snag: triphthong identification is incredibly ambiguous. There&amp;rsquo;s no consensus on whether three vowel symbols together are a triphthong or a monophthong + diphthong. That familiar feeling&amp;hellip; Classic English! No rigid rules.&lt;/p&gt;
&lt;p&gt;Consider &lt;code&gt;fire /ˈfaɪər/&lt;/code&gt;. Some claim &lt;code&gt;aɪə&lt;/code&gt; is one syllable; others say it&amp;rsquo;s &lt;code&gt;aɪ&lt;/code&gt; + &lt;code&gt;ə&lt;/code&gt; (two syllables). Criteria vary wildly. Some use hyphenation (you can write &amp;ldquo;fi-&amp;rdquo; and &amp;ldquo;re,&amp;rdquo; but not &amp;ldquo;fire,&amp;rdquo; so it&amp;rsquo;s a triphthong). Others use singing: if sung as one note, it&amp;rsquo;s a triphthong. In &lt;a class="link" href="https://www.youtube.com/watch?v=dC7Pog3biCk" target="_blank" rel="noopener"
&gt;Simple Plan - Fire In My heart&lt;/a&gt;, at 0:57, &lt;code&gt;faɪ&lt;/code&gt; and &lt;code&gt;ər&lt;/code&gt; are sung as separate notes—should it be a diphthong + monophthong?&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/d0227a8fc72ffd41ff020f6fceb73b12.jpg"
loading="lazy"
alt="A music video screenshot showing lyrics containing the triphthong word fire"
&gt;&lt;/p&gt;
&lt;p&gt;Oh well, that&amp;rsquo;s English. Given words like &lt;code&gt;oasis /oʊˈeɪsɪs/&lt;/code&gt; (four vowels!), with &lt;code&gt;oʊ&lt;/code&gt; and &lt;code&gt;eɪ&lt;/code&gt; clearly separated by the stress mark (obviously two diphthongs), I disregarded triphthongs, treating them as two syllables. The only remaining &amp;ldquo;triphthongs&amp;rdquo; were diphthongs with a long vowel marker.&lt;/p&gt;
&lt;p&gt;Besides syllable count and stress position, I wanted the stressed vowel itself, potentially for further analysis.&lt;/p&gt;
&lt;p&gt;This was trickier. I discussed it with AI, revealing significant model differences. Gemini 1.5 Flash went in circles. GPT-4o provided the correct code in three conversational rounds (about 10 minutes). Claude 3.5 Sonnet got it right immediately. For coding, a good model is worth the cost, though basic code literacy is essential to understand the AI&amp;rsquo;s code, its functionality, and potential issues.&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s the logic, again with &lt;code&gt;analytical /ænəˈlɪtəkəl/&lt;/code&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Locate the stress mark &lt;code&gt;ˈ&lt;/code&gt; and consider the subsequent part: &lt;code&gt;lɪtəkəl&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Iterate, removing non-vowels until the first vowel: &lt;code&gt;ɪtəkəl&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The first character is now a vowel. Check the first 3 characters (&lt;code&gt;ɪtə&lt;/code&gt;) against the triphthong set. Nope.&lt;/li&gt;
&lt;li&gt;Check the first 2 (&lt;code&gt;ɪt&lt;/code&gt;) against the diphthong set. Nope.&lt;/li&gt;
&lt;li&gt;Check the first character (&lt;code&gt;ɪ&lt;/code&gt;) against the monophthong set. Found it! That&amp;rsquo;s the stressed vowel.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/ba10765865fa9f86332e78b71807279f.jpg"
loading="lazy"
alt="Spreadsheet detailing English words along with their syllable count and stress positions"
&gt;&lt;/p&gt;
&lt;p&gt;The data table after phonetic analysis. All necessary data was now collected.&lt;/p&gt;
&lt;h2 id="visualization"&gt;Visualization
&lt;/h2&gt;&lt;p&gt;Now for the highlight—not just for deriving useful conclusions, but also because AI shines here. AI is excellent at writing Python visualization code. These tasks are less about reasoning and more about knowing the visualization library&amp;rsquo;s syntax. Even Gemini 1.5 Flash, a non-flagship model I use daily, performs well. I haven&amp;rsquo;t formally learned Seaborn and Matplotlib, but with AI, generating plots is straightforward.&lt;/p&gt;
&lt;p&gt;Of course, &amp;ldquo;straightforward&amp;rdquo; doesn&amp;rsquo;t mean &amp;ldquo;ask and receive.&amp;rdquo; Giving AI a vague request without context leads to failure. I crafted a Python visualization prompt, detailing the task and the data table&amp;rsquo;s structure, enabling the AI to perform with full power and stability.&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;&amp;lt;Task&amp;gt;
You are a Python data visualizer. You excels at coding with data visualization libraries like Seaborn and Matplotlib. I will tell you about the structure of a Pandas dataframe and the visualization I want. First, you dive deeply into the dataframe and understand what it is all about. Then write Python code to visualize it. Just code, no explanation. Next, you check if the code meets my need. Finally, correct the code if necessary.
&amp;lt;Dataframe&amp;gt;
The dataframe(variable name is df) is {a list of common English words with their phonetic information and part-of-speech}.
Now here are the columns of the dataframe, exactly in the following order:
**word**
- datatype: str
- example: complimentary
- description: the English words
**phonetic**
- datatype: str
- example: /ˌkɒmplɪ̈ˈment(ə)ɹɪ/
- description: the phonetic transcription of the words
**part_of_speech**
- datatype: str(list like)
- example: [&amp;#39;adjective&amp;#39;]
- description: how are these words used in sentences
**syllable_len**
- datatype: int
- example: 5
- description: how many syllables are there in these words
**stress_pos**
- datatype: int
- example: 3
- description: on which syllable the stress falls on, if there are more than one stress, this is the position of the first stress
**stress_syllable**
- datatype: str
- example: e
- description: the vowel of the stressed syllable
&amp;lt;Request&amp;gt;
I want to know the distribution of stress position, grouped by syllable numbers.
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;To use the prompt, just tweak the &lt;code&gt;&amp;lt;Request&amp;gt;&lt;/code&gt; section.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/6bf1e239c52df87ca7159c81c23911cd.jpg"
loading="lazy"
alt="Head of the loaded pandas dataframe displaying word phonetic and stress properties"
&gt;&lt;/p&gt;
&lt;p&gt;Some words in the data lack stress marks because they&amp;rsquo;re short, and their phonetic transcriptions don&amp;rsquo;t show stress. Let&amp;rsquo;s filter those out, along with one-syllable words – analyzing stress in those is pointless.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/99b768328e8403852edad5bbe1d47def.jpg"
loading="lazy"
alt="Cleaned pandas dataframe info showing twenty-four thousand four hundred and thirty-three entries"
&gt;&lt;/p&gt;
&lt;p&gt;This leaves 24,433 words with complete data.&lt;/p&gt;
&lt;h3 id="syllable-count-analysis"&gt;Syllable Count Analysis
&lt;/h3&gt;&lt;p&gt;Let&amp;rsquo;s break down the syllable counts of these 24,433 words.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/e6ded1b89391ef9844e28f8d4342c3da.jpg"
loading="lazy"
alt="Bar chart illustrating the frequency distribution of English word syllable lengths"
&gt;&lt;/p&gt;
&lt;p&gt;Unsurprisingly, fewer syllables mean more words. Languages tend to use up short, easy words first.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/9655926ed67e4cb11ee3f8a0ba62cbe0.jpg"
loading="lazy"
alt="Pie chart displaying the percentage distribution of different word syllable lengths"
&gt;&lt;/p&gt;
&lt;p&gt;Two-syllable words make up 48.7%, three-syllable words 31.3%.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/20a81644b6c29b8bab1ccc0b79f5e220.jpg"
loading="lazy"
alt="Text statistics showing the cumulative percentages of words with up to four and five syllables"
&gt;&lt;/p&gt;
&lt;p&gt;Words with four or fewer syllables make up 94.73%; five or fewer, 99%.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/963d18455de407866b97e9459de20bab.jpg"
loading="lazy"
alt="Syllable analysis showing eleven syllables in the long English word antidisestablishmentarianism"
&gt;&lt;/p&gt;
&lt;p&gt;The longest word has 11 syllables.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/79fac98a54c6d574e0c2e29ef224e1dd.jpg"
loading="lazy"
alt="Cambridge Dictionary entry defining the long political word antidisestablishmentarianism"
&gt;&lt;/p&gt;
&lt;p&gt;&amp;ldquo;Antidisestablishmentarianism&amp;rdquo;? Really? Opposition to opposition – double negative much? No wonder it&amp;rsquo;s so long. Could I just add &amp;ldquo;non-&amp;rdquo; to create &amp;ldquo;nonantidisestablishmentarianism&amp;rdquo;?&lt;/p&gt;
&lt;h3 id="syllable-count-vs-stress-position"&gt;Syllable Count vs. Stress Position
&lt;/h3&gt;&lt;p&gt;Statistically, the correlation coefficient is 0.67 – a pretty decent correlation.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/de6dd89e6d5f9344dc7788051d2266b0.jpg"
loading="lazy"
alt="Statistical correlation coefficient between syllable length and stress position in English words"
&gt;&lt;/p&gt;
&lt;p&gt;This coefficient ranges from -1 to 1. Near 0 means almost no relationship; near 1, positive correlation (one up, other up); near -1, negative correlation (one up, other down).&lt;/p&gt;
&lt;p&gt;This is just a first step, showing they&amp;rsquo;re not unrelated. It doesn&amp;rsquo;t explain &lt;em&gt;why&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/424a2fdcade241c75ba5a53eabda74ee.jpg"
loading="lazy"
alt="Bubble chart representing the distribution of stress positions across different syllable lengths"
&gt;&lt;/p&gt;
&lt;p&gt;A bubble chart helps. Syllable count is on the y-axis, stress position on the x-axis, and bubble size/color shows the word count. The dots roughly follow a diagonal – more syllables, later stress.&lt;/p&gt;
&lt;p&gt;Bubble charts (or heatmaps) show three dimensions but compare absolute word counts. I care more about stress position distribution &lt;em&gt;within&lt;/em&gt; each syllable count.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/8a8e9b114c1ec9758b4c00e62f8be6f6.jpg"
loading="lazy"
alt="Grouped bar charts displaying stress position distributions for each specific syllable length"
&gt;&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s a stacked bar chart: syllable count on the y-axis, stress position on the x-axis. Now it&amp;rsquo;s clear: stress shifts right like a wave, clustering around the third-to-last syllable.&lt;/p&gt;
&lt;h3 id="stressed-syllable-analysis"&gt;Stressed Syllable Analysis
&lt;/h3&gt;&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/a8cbd78d2abfeeb6f6a12e95dee24c99.jpg"
loading="lazy"
alt="Text list of all unique stressed vowel symbols extracted from the dataset"
&gt;&lt;/p&gt;
&lt;p&gt;These are all the vowels in stressed syllables. A couple shouldn&amp;rsquo;t be here, but it&amp;rsquo;s a dictionary error, and too few to matter.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/078bec4b5063d84f7f328e910dd61f9a.jpg"
loading="lazy"
alt="Horizontal bar chart showing the frequency ranking of different stressed vowels"
&gt;&lt;/p&gt;
&lt;p&gt;By frequency, louder vowels like &lt;code&gt;æ&lt;/code&gt; and &lt;code&gt;e&lt;/code&gt; are more likely stressed; weaker ones like &lt;code&gt;ə&lt;/code&gt; and &lt;code&gt;ʊ&lt;/code&gt; are less common.&lt;/p&gt;
&lt;h3 id="part-of-speech-analysis"&gt;Part of Speech Analysis
&lt;/h3&gt;&lt;p&gt;Is there a link between part of speech and stress?&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;All part of speech: [&amp;#39;adjective&amp;#39;, &amp;#39;adverb&amp;#39;, &amp;#39;conjunction&amp;#39;, &amp;#39;interjection&amp;#39;, &amp;#39;noun&amp;#39;, &amp;#39;numeral&amp;#39;, &amp;#39;preposition&amp;#39;, &amp;#39;pronoun&amp;#39;, &amp;#39;propernoun&amp;#39;, &amp;#39;verb&amp;#39;]
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Here&amp;rsquo;s a breakdown of all parts of speech. I&amp;rsquo;m not sure what &amp;ldquo;propernoun&amp;rdquo; is – it&amp;rsquo;s not in my dictionary either. It turns out there are only two, and they don&amp;rsquo;t seem to fit, so I suspect a data glitch with the dictionary API. I&amp;rsquo;ll skip it for now.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/627f810c2d8d6b27501d19d8ad6cff43.jpg"
loading="lazy"
alt="Horizontal bar chart showing the distribution of words across various parts of speech"
&gt;&lt;/p&gt;
&lt;p&gt;I ranked the parts of speech by frequency. The big ones are nouns, verbs, adjectives, and adverbs. Nouns account for roughly half the total.&lt;/p&gt;
&lt;p&gt;This gets you thinking about how language evolved. First, you need to describe the world and create concepts – that&amp;rsquo;s where nouns come in. Then, to describe how people and things interact, you need verbs. After that, adjectives and adverbs develop to modify nouns and verbs. So, my guess is the number of words should follow that order.&lt;/p&gt;
&lt;p&gt;But wait – shouldn&amp;rsquo;t the ratio of nouns to adjectives, and verbs to adverbs, be roughly the same? No need to calculate. The bar chart makes it obvious: nouns are more than double the adjectives, and verbs outnumber adverbs almost nine to one. They&amp;rsquo;re way out of proportion.&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;[&amp;#39;abracadabra&amp;#39;, &amp;#39;absolutely&amp;#39;, &amp;#39;action&amp;#39;, &amp;#39;adieu&amp;#39;, &amp;#39;adios&amp;#39;, &amp;#39;affirmative&amp;#39;, &amp;#39;afternoon&amp;#39;, &amp;#39;ahem&amp;#39;, &amp;#39;alack&amp;#39;, &amp;#39;aloha&amp;#39;, &amp;#39;alright&amp;#39;, &amp;#39;amen&amp;#39;, &amp;#39;amidships&amp;#39;, &amp;#39;arrivederci&amp;#39;, &amp;#39;attaboy&amp;#39;, &amp;#39;attention&amp;#39;, &amp;#39;away&amp;#39;, &amp;#39;banzai&amp;#39;, &amp;#39;bastard&amp;#39;, &amp;#39;beauty&amp;#39;, &amp;#39;begone&amp;#39;, &amp;#39;begorra&amp;#39;, &amp;#39;behold&amp;#39;, &amp;#39;blazes&amp;#39;, &amp;#39;bollocks&amp;#39;, &amp;#39;bonjour&amp;#39;, &amp;#39;bother&amp;#39;, &amp;#39;botheration&amp;#39;, &amp;#39;brother&amp;#39;, &amp;#39;bully&amp;#39;, &amp;#39;bullseye&amp;#39;, &amp;#39;bullshit&amp;#39;, &amp;#39;caramba&amp;#39;, &amp;#39;checkmate&amp;#39;, &amp;#39;cheeses&amp;#39;, &amp;#39;condolences&amp;#39;, &amp;#39;congrats&amp;#39;, &amp;#39;congratulations&amp;#39;, &amp;#39;content&amp;#39;, &amp;#39;cooee&amp;#39;, &amp;#39;curses&amp;#39;, &amp;#39;dammit&amp;#39;, &amp;#39;ecce&amp;#39;, &amp;#39;egad&amp;#39;, &amp;#39;enchanted&amp;#39;, &amp;#39;encore&amp;#39;, &amp;#39;enough&amp;#39;, &amp;#39;eureka&amp;#39;, &amp;#39;exactly&amp;#39;, &amp;#39;farewell&amp;#39;, &amp;#39;fiddlesticks&amp;#39;, &amp;#39;flummery&amp;#39;, &amp;#39;gadzooks&amp;#39;, &amp;#39;gesundheit&amp;#39;, &amp;#39;goddamn&amp;#39;, &amp;#39;goodbye&amp;#39;, &amp;#39;gorblimey&amp;#39;, &amp;#39;gracias&amp;#39;, &amp;#39;gracious&amp;#39;, &amp;#39;greetings&amp;#39;, &amp;#39;hallelujah&amp;#39;, &amp;#39;hardly&amp;#39;, &amp;#39;havoc&amp;#39;, &amp;#39;heavens&amp;#39;, &amp;#39;heyday&amp;#39;, &amp;#39;hola&amp;#39;, &amp;#39;holla&amp;#39;, &amp;#39;honestly&amp;#39;, &amp;#39;hooray&amp;#39;, &amp;#39;hosanna&amp;#39;, &amp;#39;howdy&amp;#39;, &amp;#39;hullo&amp;#39;, &amp;#39;hurrah&amp;#39;, &amp;#39;huzzah&amp;#39;, &amp;#39;yeah&amp;#39;, &amp;#39;indeed&amp;#39;, &amp;#39;knickers&amp;#39;, &amp;#39;later&amp;#39;, &amp;#39;mercy&amp;#39;, &amp;#39;morepork&amp;#39;, &amp;#39;morning&amp;#39;, &amp;#39;namaste&amp;#39;, &amp;#39;negative&amp;#39;, &amp;#39;nonsense&amp;#39;, &amp;#39;oyez&amp;#39;, &amp;#39;okay&amp;#39;, &amp;#39;ole&amp;#39;, &amp;#39;pardon&amp;#39;, &amp;#39;peccavi&amp;#39;, &amp;#39;period&amp;#39;, &amp;#39;pity&amp;#39;, &amp;#39;pleasure&amp;#39;, &amp;#39;presto&amp;#39;, &amp;#39;prithee&amp;#39;, &amp;#39;prosit&amp;#39;, &amp;#39;quiet&amp;#39;, &amp;#39;rather&amp;#39;, &amp;#39;really&amp;#39;, &amp;#39;respect&amp;#39;, &amp;#39;result&amp;#39;, &amp;#39;roger&amp;#39;, &amp;#39;rumble&amp;#39;, &amp;#39;sayonara&amp;#39;, &amp;#39;scramble&amp;#39;, &amp;#39;selah&amp;#39;, &amp;#39;shabash&amp;#39;, &amp;#39;shazam&amp;#39;, &amp;#39;silence&amp;#39;, &amp;#39;sorry&amp;#39;, &amp;#39;standard&amp;#39;, &amp;#39;sugar&amp;#39;, &amp;#39;tally&amp;#39;, &amp;#39;tara&amp;#39;, &amp;#39;tarnation&amp;#39;, &amp;#39;tidy&amp;#39;, &amp;#39;timber&amp;#39;, &amp;#39;uncle&amp;#39;, &amp;#39;understood&amp;#39;, &amp;#39;viva&amp;#39;, &amp;#39;vivat&amp;#39;, &amp;#39;voetsek&amp;#39;, &amp;#39;warning&amp;#39;, &amp;#39;welcome&amp;#39;, &amp;#39;whammo&amp;#39;, &amp;#39;whatever&amp;#39;, &amp;#39;wilco&amp;#39;, &amp;#39;wirra&amp;#39;, &amp;#39;zowie&amp;#39;]
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;I listed all the interjections out of curiosity. I don&amp;rsquo;t usually give this part of speech much thought, so I took a closer look. Surprisingly, &amp;ldquo;afternoon&amp;rdquo; is also classified as one! Which makes sense, since it&amp;rsquo;s a greeting.&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;[&amp;#39;abaft&amp;#39;, &amp;#39;abeam&amp;#39;, &amp;#39;aboard&amp;#39;, &amp;#39;about&amp;#39;, &amp;#39;above&amp;#39;, &amp;#39;abreast&amp;#39;, &amp;#39;abroad&amp;#39;, &amp;#39;absent&amp;#39;, &amp;#39;across&amp;#39;, &amp;#39;afore&amp;#39;, &amp;#39;after&amp;#39;, &amp;#39;again&amp;#39;, &amp;#39;against&amp;#39;, &amp;#39;agin&amp;#39;, &amp;#39;along&amp;#39;, &amp;#39;alongside&amp;#39;, &amp;#39;aloof&amp;#39;, &amp;#39;alow&amp;#39;, &amp;#39;amid&amp;#39;, &amp;#39;amidst&amp;#39;, &amp;#39;among&amp;#39;, &amp;#39;amongst&amp;#39;, &amp;#39;anent&amp;#39;, &amp;#39;anti&amp;#39;, &amp;#39;around&amp;#39;, &amp;#39;asprawl&amp;#39;, &amp;#39;astraddle&amp;#39;, &amp;#39;astride&amp;#39;, &amp;#39;athwart&amp;#39;, &amp;#39;barring&amp;#39;, &amp;#39;bating&amp;#39;, &amp;#39;because&amp;#39;, &amp;#39;before&amp;#39;, &amp;#39;behind&amp;#39;, &amp;#39;beyond&amp;#39;, &amp;#39;below&amp;#39;, &amp;#39;beneath&amp;#39;, &amp;#39;beside&amp;#39;, &amp;#39;besides&amp;#39;, &amp;#39;between&amp;#39;, &amp;#39;betwixt&amp;#39;, &amp;#39;circa&amp;#39;, &amp;#39;concerning&amp;#39;, &amp;#39;considering&amp;#39;, &amp;#39;contra&amp;#39;, &amp;#39;despite&amp;#39;, &amp;#39;during&amp;#39;, &amp;#39;except&amp;#39;, &amp;#39;excepting&amp;#39;, &amp;#39;failing&amp;#39;, &amp;#39;following&amp;#39;, &amp;#39;forby&amp;#39;, &amp;#39;froward&amp;#39;, &amp;#39;given&amp;#39;, &amp;#39;including&amp;#39;, &amp;#39;inside&amp;#39;, &amp;#39;into&amp;#39;, &amp;#39;minus&amp;#39;, &amp;#39;modulo&amp;#39;, &amp;#39;nearer&amp;#39;, &amp;#39;nearest&amp;#39;, &amp;#39;onto&amp;#39;, &amp;#39;opposite&amp;#39;, &amp;#39;outwith&amp;#39;, &amp;#39;pending&amp;#39;, &amp;#39;regarding&amp;#39;, &amp;#39;regardless&amp;#39;, &amp;#39;respecting&amp;#39;, &amp;#39;rising&amp;#39;, &amp;#39;running&amp;#39;, &amp;#39;saving&amp;#39;, &amp;#39;thorough&amp;#39;, &amp;#39;throughout&amp;#39;, &amp;#39;touching&amp;#39;, &amp;#39;toward&amp;#39;, &amp;#39;towards&amp;#39;, &amp;#39;under&amp;#39;, &amp;#39;underneath&amp;#39;, &amp;#39;unlike&amp;#39;, &amp;#39;until&amp;#39;, &amp;#39;upon&amp;#39;, &amp;#39;upside&amp;#39;, &amp;#39;versus&amp;#39;, &amp;#39;wanting&amp;#39;, &amp;#39;within&amp;#39;, &amp;#39;without&amp;#39;]
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;When listing out prepositions, I noticed some recurring prefixes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a- indicating location or spatial relationship: aboard, across, amid, around&lt;/li&gt;
&lt;li&gt;be- (basically &lt;em&gt;be&lt;/em&gt;): before, behind, below, beside&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Next, I created heatmaps for each part of speech. The y-axis shows syllable count, the x-axis shows stress position, and color intensity represents the proportion of words for each syllable count. I only included parts of speech with over 1% of the total words, as others had too few to be significant.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/ea6d9ff8fee7f0f2477d458be8c4a952.jpg"
loading="lazy"
alt="Heatmaps representing stress positions by syllable length for nouns, verbs, adjectives, and adverbs"
&gt;&lt;/p&gt;
&lt;p&gt;Stress tends to shift towards the end as syllables increase. The difference between parts of speech isn&amp;rsquo;t huge, but it&amp;rsquo;s there. For longer words (5+ syllables), adjectives often have stress on the antepenultimate (third-to-last) syllable, nouns tend to have stress further back, and verbs/adverbs have stress further forward.&lt;/p&gt;
&lt;h3 id="rules-of-stress-position"&gt;Rules of Stress Position
&lt;/h3&gt;&lt;p&gt;It was time to test my hypothesis.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/da8aadd06591c811ed2f67ee0b15503d.jpg"
loading="lazy"
alt="Table showing the dataframe with a new column added to test the stress position hypothesis"
&gt;&lt;/p&gt;
&lt;p&gt;I analyzed 4- and 5-syllable words, adding a column showing the difference between the actual and hypothesized (third-to-last) stress positions. A &amp;lsquo;0&amp;rsquo; means a match, &amp;lsquo;1&amp;rsquo; means one syllable later, &amp;lsquo;-1&amp;rsquo; one syllable earlier, etc.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/2695209758cd7525a2d0e71e4dbb4f85.jpg"
loading="lazy"
alt="Text snippet showing the percentage of words matching the author’s stress position hypothesis"
&gt;&lt;/p&gt;
&lt;p&gt;The hypothesis held for 43.9% of the words.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/5740e6b95198a01806d2831c73cbd1f3.jpg"
loading="lazy"
alt="Bar chart illustrating the deviation of actual stress positions from the predicted ones"
&gt;&lt;/p&gt;
&lt;p&gt;This bar chart shows the stress deviation. Most words follow the rule, with some shifted by one syllable. Very few are further off. It kind of looks like a normal distribution (but I&amp;rsquo;m no stats expert).&lt;/p&gt;
&lt;p&gt;Then I wondered: could this be generalized? Does it apply to words with 5+ syllables? I broadened the filter to include all words with over 3 syllables:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/6048650203a8efe7f09b9d6b3cc270c6.jpg"
loading="lazy"
alt="Text output showing the adjusted percentage of words matching the hypothesis after expanding the sample"
&gt;&lt;/p&gt;
&lt;p&gt;43.92% fit. Almost no change.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/7baa190c8f4aeb3fd58ede643840201d.jpg"
loading="lazy"
alt="Bar chart illustrating the deviation of actual stress positions from the predicted ones after sample expansion"
&gt;&lt;/p&gt;
&lt;p&gt;The deviation pattern remained. Most words are stressed on the antepenultimate syllable, many on the penultimate. Combined, they account for 78.84%. It&amp;rsquo;s not a perfect fit, but the general trend is confirmed.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;Here&amp;rsquo;s a recap of the findings regarding phonetics and stress:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Fewer syllables mean more words.&lt;/li&gt;
&lt;li&gt;Words with 5+ syllables are rare in everyday use.&lt;/li&gt;
&lt;li&gt;The longest word found has 11 syllables.&lt;/li&gt;
&lt;li&gt;Stress generally shifts towards the end in longer words.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Louder vowels are more likely to be stressed.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Part of speech has a minor effect on stress.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Most long words are stressed on the antepenultimate or penultimate syllable (78.84%).&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="afterword"&gt;Afterword
&lt;/h2&gt;&lt;p&gt;Five minutes of analysis, two hours of data prep – seriously.&lt;/p&gt;
&lt;p&gt;Visualization took only half a day. Data preparation, especially fetching phonetic transcriptions via the dictionary API, took the longest. The script ran on and off for over two weeks; I even finished writing this before the dictionary lookup was done, using placeholders for the data.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;m happy the results confirmed my hypothesis. After this, I doubt I&amp;rsquo;ll ever forget English stress rules – it&amp;rsquo;s my own research, after all.&lt;/p&gt;
&lt;p&gt;This project refreshed my Pandas skills, taught me batched requests and incremental saving, showed me how to integrate AI into analysis, helped me write effective Python data visualization prompts, and deepened my understanding of English phonetics. A huge win, and totally worth it!&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Thanks to:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a class="link" href="https://www.kaggle.com/datasets/bwandowando/479k-english-words/versions/5" target="_blank" rel="noopener"
&gt;Word data source&lt;/a&gt;: This 300k+ word list was the base of my analysis.&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://dictionaryapi.dev/" target="_blank" rel="noopener"
&gt;Free Dictionary API&lt;/a&gt;: This provided an inexpensive way to get phonetic transcriptions.&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://poe.com/Gemini-1.5-Flash" target="_blank" rel="noopener"
&gt;Gemini 1.5 Flash&lt;/a&gt;: Helped with about half the data prep and all the visualizations.&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://chatgpt.com/" target="_blank" rel="noopener"
&gt;GPT-4o&lt;/a&gt;: Helped accurately ID vowels in stressed syllables.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The full analysis and code are open-sourced on Kaggle. Check it out if you&amp;rsquo;re interested:&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://www.kaggle.com/code/victorcheng42/stress-distribution-of-english-words" target="_blank" rel="noopener"
&gt;https://www.kaggle.com/code/victorcheng42/stress-distribution-of-english-words&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The dataset with phonetic transcriptions, syllable counts, and stress positions is also public. It might be useful for other analyses:&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://www.kaggle.com/datasets/victorcheng42/english-words-with-stress-position-analyzed" target="_blank" rel="noopener"
&gt;https://www.kaggle.com/datasets/victorcheng42/english-words-with-stress-position-analyzed&lt;/a&gt;&lt;/p&gt;</description></item><item><title>We Only Learn the Intersection of Two Languages</title><link>https://victor42.eth.limo/post-en/3628/</link><pubDate>Tue, 17 Jan 2023 15:09:00 +0000</pubDate><author>hi@victor42.work (Victor42)</author><guid>https://victor42.eth.limo/post-en/3628/</guid><description>&lt;img src="https://cdn.victor42.work/posts/2023-01/ansiyt83yi4nf84.jpg" alt="Featured image of post We Only Learn the Intersection of Two Languages" /&gt;&lt;p&gt;I was looking at the word &amp;ldquo;stem&amp;rdquo; the other day, and it got me thinking. Language learning can be a breeze, or it can be a real head-scratcher. We don&amp;rsquo;t learn the &lt;em&gt;whole&lt;/em&gt; language; we learn the overlap between it and our native tongue.&lt;/p&gt;
&lt;p&gt;Take &amp;ldquo;stem,&amp;rdquo; for example. In the Cambridge Dictionary, as a noun, it&amp;rsquo;s usually a plant&amp;rsquo;s stem or a wine glass stem. Basically, the central supporting structure.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-01/Snipaste_2023-01-17_09-47-25.jpg"
loading="lazy"
alt="Cambridge Dictionary stem noun screenshot, title stem noun [C] (CENTRAL PART), definition a central part of something from which other parts can develop or grow, with images of rose stem and champagne glass foot"
&gt;&lt;/p&gt;
&lt;p&gt;As a verb, it means to stop something bad from spreading, or more literally, to stop a flow, like stemming bleeding.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-01/Snipaste_2023-01-17_09-53-41.jpg"
loading="lazy"
alt="Cambridge Dictionary stem verb screenshot, title stem verb [T], definition to stop something unwanted from spreading or increasing, example These measures are designed to stem the rise of violent crime, and to stop the flow of a liquid such as blood"
&gt;&lt;/p&gt;
&lt;p&gt;There are other, rarer meanings, but let&amp;rsquo;s put those aside.&lt;/p&gt;
&lt;p&gt;Native Chinese speakers might be thinking, &amp;ldquo;Ugh, another one of &lt;em&gt;those&lt;/em&gt; words? Multiple, seemingly unrelated meanings?&amp;rdquo;&lt;/p&gt;
&lt;p&gt;But, in my experience, when English words seem odd, it&amp;rsquo;s usually &lt;em&gt;us&lt;/em&gt; missing something. There&amp;rsquo;s probably a historical link we don&amp;rsquo;t grasp because of our cultural background.&lt;/p&gt;
&lt;p&gt;So, let&amp;rsquo;s think in English. If &amp;ldquo;stem&amp;rdquo; is the main support, could it apply to a wind turbine? It kinda looks like a wine glass, right?&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-01/5cd92e9a89c9b.jpg"
loading="lazy"
alt="Wind turbine structure diagram with labeled components including rotating blades/gearbox/brake valve/nacelle/generator/tower/base/power supply system, tower supporting all upper structure"
&gt;&lt;/p&gt;
&lt;p&gt;The tower supports everything above, and there&amp;rsquo;s a base. Seems like a slam dunk. Can we call the tower a &amp;ldquo;stem&amp;rdquo;?&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-01/Snipaste_2023-01-17_09-55-57.jpg"
loading="lazy"
alt="Google Search results for wind turbine stem screenshot, showing about 4.43 million results, STEM highlighted in red pointing to STEM education concept, including Wind power STEM challenge and Build a wind turbine - STEM Learning"
&gt;&lt;/p&gt;
&lt;p&gt;Nope. Searching &amp;ldquo;wind turbine&amp;rdquo; and &amp;ldquo;stem&amp;rdquo; mostly turns up STEM (Science, Technology, Engineering, and Math) education. So, no dice.&lt;/p&gt;
&lt;p&gt;Okay, back to biology. Can a mushroom&amp;rsquo;s stalk be a &amp;ldquo;stem&amp;rdquo;?&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-01/Snipaste_2023-01-17_09-57-09.jpg"
loading="lazy"
alt="Mushroom structure hand-drawn diagram labeling Cap/Gills/Ring Skirt/Stem Stalk/Sack Volva/Mycelium six parts, Stem/Stalk highlighted with red line pointing to stalk"
&gt;&lt;/p&gt;
&lt;p&gt;Yep! It can also be a &amp;ldquo;stalk,&amp;rdquo; but the point is, native English speakers &lt;em&gt;do&lt;/em&gt; see &amp;ldquo;stem&amp;rdquo; as a support, and the meaning stretches.&lt;/p&gt;
&lt;p&gt;So, how are the noun and verb connected? I hit up an etymology site. I also found a less common meaning: a ship&amp;rsquo;s bow. This nautical term, though obscure, is the key.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-01/Snipaste_2023-01-17_10-58-17.jpg"
loading="lazy"
alt="Wikipedia Stem (ship) entry screenshot, definition The stem is the most forward part of a boat or ship’s bow and is an extension of the keel itself, with image of ancient wooden ship bow"
&gt;&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s the gist from the etymology site. The image below should be pretty self-explanatory.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-01/Snipaste_2023-01-17_11-02-55.jpg"
loading="lazy"
alt="Online Etymology Dictionary stem entry screenshot, noun stem (n.) traced to Old English stemn/stefn and Proto-Germanic *stamniz, verb stem (v.1) meaning to hold back from early 14th century Scandinavian, verb stem (v.2) meaning make headway by sailing from late 14th century"
&gt;&lt;/p&gt;
&lt;p&gt;The noun comes from Proto-Germanic, with relatives in Old Saxon, Old Norse, Danish, Swedish, etc. It goes back to the Proto-Indo-European root *sta-, meaning &amp;ldquo;to stand,&amp;rdquo; or &amp;ldquo;be firm.&amp;rdquo; &amp;ldquo;Stable&amp;rdquo; might be a cousin. It evolved to mean &amp;ldquo;support,&amp;rdquo; like a plant stem. The wine glass stem sense popped up around 1835.&lt;/p&gt;
&lt;p&gt;The verb form has nautical roots. In the early 1300s, it meant &amp;ldquo;to withstand&amp;rdquo; in Nordic languages, like withstanding waves. For a ship, that&amp;rsquo;s like &amp;ldquo;staying stable.&amp;rdquo; By the late 1300s, it meant both the bow and to point the bow. Makes sense: a ship&amp;rsquo;s bow must be angled to handle waves and stay steady.&lt;/p&gt;
&lt;p&gt;So, &amp;ldquo;stem&amp;rdquo; (main structure) and &amp;ldquo;stem&amp;rdquo; (to stop) connect through &amp;ldquo;staying stable.&amp;rdquo; The verb isn&amp;rsquo;t about totally wiping out something bad, but holding the line and preventing things from getting worse. Think: &amp;ldquo;stem the rise in violent crime,&amp;rdquo; &amp;ldquo;stem the tide of resignations,&amp;rdquo; &amp;ldquo;stem the bleeding&amp;rdquo; (you can&amp;rsquo;t entirely &amp;ldquo;stop&amp;rdquo; blood flow).&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-01/Snipaste_2023-01-17_09-53-41.jpg"
loading="lazy"
alt="Cambridge Dictionary stem verb screenshot highlighting example sentences about stemming crime, resignations, and blood flow"
&gt;&lt;/p&gt;
&lt;p&gt;Two seemingly unrelated concepts in Chinese might be one idea for English speakers. Ask them why the word has two meanings, and they might look at you funny: &amp;ldquo;It&amp;rsquo;s just &lt;em&gt;one&lt;/em&gt; meaning!&amp;rdquo; They&amp;rsquo;re not mashing together two Chinese concepts, but grasping a concept that&amp;rsquo;s absent in Chinese.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-01/ansiyt83yi4nf84.jpg"
loading="lazy"
alt="Venn diagram showing native language and foreign language concept overlap, left red circle labeled native language, right blue circle labeled foreign language, center purple overlap labeled what you learn of foreign language"
&gt;&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s the thing: when we learn a foreign language, we map its concepts onto our own. The ones that match up fall into the overlap, and we think we&amp;rsquo;ve got it. The ones that &lt;em&gt;don&amp;rsquo;t&lt;/em&gt; match, the ones outside our native language&amp;rsquo;s scope, stay out of reach. We only learn the overlap.&lt;/p&gt;
&lt;p&gt;To really get to native-like fluency, we have to venture beyond that overlap, into the foreign language&amp;rsquo;s turf, and wrestle with concepts that don&amp;rsquo;t exist in our mother tongue. Many &amp;ldquo;issues&amp;rdquo; in the overlap might not even be issues in the foreign language&amp;rsquo;s world. Stepping into that world isn&amp;rsquo;t rocket science, but it takes serious effort, and there are no shortcuts.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Back to the mushroom: its stalk can be a &amp;ldquo;stem&amp;rdquo; or a &amp;ldquo;stalk.&amp;rdquo; What&amp;rsquo;s the deal?&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-01/Snipaste_2023-01-17_09-57-09.jpg"
loading="lazy"
alt="Mushroom structure hand-drawn diagram labeling Cap/Gills/Ring Skirt/Stem Stalk/Sack Volva/Mycelium six parts, Stem/Stalk highlighted with red line pointing to stalk"
&gt;&lt;/p&gt;
&lt;p&gt;Dictionaries show their Chinese translations are pretty much the same. In biology, there&amp;rsquo;s a slight difference:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-01/Snipaste_2023-01-17_10-07-25.jpg"
loading="lazy"
alt="OED dictionary screenshot explaining botanical difference between stem and stalk, Botanists and arborists will usually use stem to refer to a slender portion of the plant, while stalk refers to something more substantial, often the main upright load-bearing portion"
&gt;&lt;/p&gt;
&lt;p&gt;But &amp;ldquo;stalk&amp;rdquo; is also a verb, with a totally different meaning. There&amp;rsquo;s probably another rabbit hole there, like with &amp;ldquo;stem.&amp;rdquo; I haven&amp;rsquo;t gone down it yet, so feel free to fill me in.&lt;/p&gt;
&lt;p&gt;Anyway, that&amp;rsquo;s the lowdown on language learning. Trying to go deep in a foreign language is like Usain Bolt suddenly finding himself underwater – going 1 m/s might be a struggle.&lt;/p&gt;</description></item><item><title>Why isn't there a word for "ten thousand" in English?</title><link>https://victor42.eth.limo/post-en/3601/</link><pubDate>Sun, 19 Apr 2020 12:06:00 +0000</pubDate><author>hi@victor42.work (Victor42)</author><guid>https://victor42.eth.limo/post-en/3601/</guid><description>&lt;p&gt;Consider this about number units: English separates large numbers with commas, advancing in thousands—million, billion, trillion. There&amp;rsquo;s no single word for &amp;ldquo;ten thousand.&amp;rdquo; Chinese, however, uses units of ten thousand (万, 亿, 兆&amp;hellip;). We use &amp;ldquo;million&amp;rdquo; (百万) more now, but that&amp;rsquo;s recent, due to handling larger figures. &amp;ldquo;Million&amp;rdquo; is a combination, not a base unit like &amp;ldquo;ten thousand.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s curious. We have distinct words for smaller place values: ones, tens, hundreds, thousands, ten thousands. Why not for larger numbers? We simply didn&amp;rsquo;t need them! Daily life, particularly anciently, rarely required such large numbers.&lt;/p&gt;
&lt;p&gt;Rulers, however, dealt with massive figures. Inventing a word for &lt;em&gt;every&lt;/em&gt; place value would be impractical. The solution? Use the largest common unit as a base. This avoids new concepts and simplifies comparisons. Within the same order of magnitude, the specific unit is less important. Large differences are clear from the unit, and smaller ones remain manageable.&lt;/p&gt;
&lt;p&gt;This hints at a difference in scale between the ancient Chinese and English-speaking worlds, reflected in geography, population, and agriculture. It&amp;rsquo;s well-known, but it might be the core reason for the East-West difference in number units today.&lt;/p&gt;</description></item><item><title>The Texting Experience</title><link>https://victor42.eth.limo/post-en/3533/</link><pubDate>Sun, 11 Dec 2016 01:20:29 +0000</pubDate><author>hi@victor42.work (Victor42)</author><guid>https://victor42.eth.limo/post-en/3533/</guid><description>&lt;img src="https://cdn.victor42.work/posts/2016-12/12-09/dribbble.png" alt="Featured image of post The Texting Experience" /&gt;&lt;p&gt;&lt;em&gt;Image from Dribbble.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The title might be misleading. I&amp;rsquo;m not discussing the UX of messaging apps, but the reading experience of chat content.&lt;/p&gt;
&lt;h2 id="number-ocd"&gt;Number OCD
&lt;/h2&gt;&lt;p&gt;My supervisor recently asked for my phone and ID numbers for some paperwork. We were chatting on DingTalk. I replied &amp;ldquo;OK&amp;rdquo; and sent:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;×× (My Name)
Phone: 186××××××××
ID: 360103××××××××××××&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I stared at the message and thought I could do better. So, I resent it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;×× (My Name)
Phone: 186 ×××× ××××
ID: 360 103 ×××× ×××× ××××&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I mentioned it was easier to read. My supervisor quipped, &amp;ldquo;OCD kicking in again, eh?&amp;rdquo; I replied, a bit pretentiously, &amp;ldquo;User experience is everywhere,&amp;rdquo; followed by a grinning emoji.&lt;/p&gt;
&lt;p&gt;That was that. But since I&amp;rsquo;d mentioned UX, I figured I&amp;rsquo;d explore it further. It wasn&amp;rsquo;t just me being nitpicky. My initial message wasn&amp;rsquo;t exactly user-friendly.&lt;/p&gt;
&lt;p&gt;Formatting a reply is a design task, tied to the user and goal. The user was clear: my supervisor on DingTalk mobile. But the goal? I hadn&amp;rsquo;t asked. She needed the numbers for documents, but she wouldn&amp;rsquo;t be preparing them herself. She&amp;rsquo;d pass the info along. How? Jot it down or forward it? That&amp;rsquo;s a big difference!&lt;/p&gt;
&lt;h3 id="writing-it-down"&gt;Writing it Down
&lt;/h3&gt;&lt;p&gt;If she was writing it down, it&amp;rsquo;d likely be the old-fashioned &amp;ldquo;read-memorize-write&amp;rdquo; method. I can&amp;rsquo;t control the writing, but the reading and memorizing depend on my formatting.&lt;/p&gt;
&lt;p&gt;Research suggests people can only remember about 7 digits at a time. Anything longer needs chunking. We remember five and seven-character poems. Nine-character poems exist, but they&amp;rsquo;re rare. Qu Yuan&amp;rsquo;s &lt;a class="link" href="http://baike.baidu.com/item/%E7%A6%BB%E9%AA%9A/1045" target="_blank" rel="noopener"
&gt;&lt;em&gt;Li Sao&lt;/em&gt;&lt;/a&gt; is an exception, but even there, most meaningful content stays within 7 characters, thanks to the modal particle &amp;ldquo;兮&amp;rdquo;.&lt;/p&gt;
&lt;p&gt;Seven is the &lt;em&gt;limit&lt;/em&gt;, though, not ideal. Think about verification codes: usually 4 or 6 digits. We can recall 4 digits easily, but 6-digit codes get broken into 3+3. This suggests the sweet spot for easy recall is under 6 characters. China&amp;rsquo;s 11-digit phone numbers are commonly read as 3+4+4. We say, &amp;ldquo;Call my 186 number.&amp;rdquo; Online, the middle 4 digits are often masked. We also tend to remember the last 4 digits. This shows how ingrained this grouping is. ID numbers aren&amp;rsquo;t usually split up visually, but they have inherent sections: 6 (region) + 4 (year) + 4 (month/day) + 4 (last four). That&amp;rsquo;s likely how most people memorize them.&lt;/p&gt;
&lt;p&gt;As an aside, is the magic number 4 or 5? I lean towards 4, though I lack hard proof. But the examples above hint at it. Bank card numbers, too: different lengths, but when deliberately grouped, they never exceed 4 digits per chunk.&lt;/p&gt;
&lt;p&gt;However you format a long number string, that&amp;rsquo;s how the recipient will read and memorize it. We should all offer this courtesy to each other.&lt;/p&gt;
&lt;h3 id="forwarding-on-mobile"&gt;Forwarding on Mobile
&lt;/h3&gt;&lt;p&gt;Back to the point. If the numbers were to be forwarded and copied into a system, things change entirely.&lt;/p&gt;
&lt;p&gt;I couldn&amp;rsquo;t know if the system handled spaces. Pasting the &amp;ldquo;easy-read&amp;rdquo; format might result in &amp;ldquo;186 ×××× ××&amp;rdquo;. Also, my supervisor, on Android, couldn&amp;rsquo;t use clipboard tools like Pin. Extracting the numbers would be a hassle.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2016-12/12-09/1.jpg"
loading="lazy"
alt="DingTalk chat screen with copy menu popped up on a message containing personal info"
&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Mobile IM often forces you to copy the entire message.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;So, for copying from IM, the best format is:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Name:&lt;/blockquote&gt;
××&lt;/blockquote&gt;
Phone:&lt;/blockquote&gt;
186××××××××&lt;/blockquote&gt;
ID:&lt;/blockquote&gt;
360103××××××××××××&lt;/blockquote&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This reminds me of my WeChat public account. I mostly just post articles, so I set up an auto-reply directing people to my Weibo.&lt;/p&gt;
&lt;p&gt;For a while, the auto-reply just said: &amp;ldquo;I don&amp;rsquo;t check this account often. Contact me via private message on Sina Weibo: @我_ColaChan.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Then I messaged myself. It was a pain to copy just the nickname. So I changed it: &amp;ldquo;I don&amp;rsquo;t check this account often. Contact me on Sina Weibo. Reply &amp;lsquo;Weibo&amp;rsquo; for my username.&amp;rdquo; Replying &amp;ldquo;Weibo&amp;rdquo; triggered a message with just &amp;ldquo;@我_ColaChan&amp;rdquo;.&lt;/p&gt;
&lt;p&gt;An extra step, but much easier to extract the information.&lt;/p&gt;
&lt;h2 id="eliminating-typos"&gt;Eliminating Typos
&lt;/h2&gt;&lt;p&gt;Text chat isn&amp;rsquo;t just about numbers. Everyday conversation is key. What defines &amp;ldquo;good&amp;rdquo; or &amp;ldquo;bad&amp;rdquo; here?&lt;/p&gt;
&lt;p&gt;In middle school, we didn&amp;rsquo;t have cell phones. We chatted on QQ via computer. A classmate once said chatting with me was reassuring. Why? Because I never made typos.&lt;/p&gt;
&lt;p&gt;Thinking back, it&amp;rsquo;s true. Life&amp;rsquo;s faster now, and with auto-suggestions, typos happen. But attitude matters. I proofread my messages and always fix typos.&lt;/p&gt;
&lt;p&gt;Many people don&amp;rsquo;t check their messages. They don&amp;rsquo;t check after typing, or even &lt;em&gt;during&lt;/em&gt;. They just fire off a message. Even if they spot a mistake, they often can&amp;rsquo;t be bothered to fix it, assuming the other person will get it. This leads to gibberish like &amp;ldquo;Enai&amp;rdquo; (should be &amp;ldquo;En Ai,&amp;rdquo; meaning &amp;ldquo;love&amp;rdquo;) or &amp;ldquo;Bu hui ni o&amp;rdquo; (should be &amp;ldquo;Bu hui you ni o,&amp;rdquo; meaning &amp;ldquo;won&amp;rsquo;t have you&amp;rdquo;). Misspelled keywords require serious guesswork, even considering homophones and keyboard layouts. I&amp;rsquo;ve dealt with printers and developers whose messages are incredibly hard to decipher. Sure, being busy is understandable. But typo-free chat is a better experience for everyone.&lt;/p&gt;
&lt;h2 id="language-is-serious"&gt;Language is Serious
&lt;/h2&gt;&lt;p&gt;I&amp;rsquo;ve gotten messages like this before, a jumbled mess:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;are you there
Does UI need hand-drawing?
Can&amp;rsquo;t do it without hand-drawing?
No response to resume Is it not enough experience
What are the ui specifications do I need to look at both ios andriod
Are you there are you there、
What to do without a portfolio?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That&amp;rsquo;s not verbatim, but it captures the essence. Missing punctuation, misused punctuation, spaces instead of commas, extra spaces, mixed Chinese and English punctuation, misspelled words, misused words, no clear topics&amp;hellip; It&amp;rsquo;s a catalog of common communication errors.&lt;/p&gt;
&lt;p&gt;Language&amp;rsquo;s main purpose is communication. It&amp;rsquo;s the agreed-upon system for expressing concepts. Ignoring language norms is like disconnecting from that system. It&amp;rsquo;s a big deal. Even in casual texts, I think it&amp;rsquo;s important to use &amp;ldquo;的,&amp;rdquo; &amp;ldquo;地,&amp;rdquo; and &amp;ldquo;得&amp;rdquo; correctly. These details are often overlooked. It&amp;rsquo;s not about language purity; it&amp;rsquo;s about making things easier for the reader. Standard language helps.&lt;/p&gt;
&lt;h2 id="the-mindset-of-writing-a-press-release"&gt;The Mindset of Writing a Press Release
&lt;/h2&gt;&lt;p&gt;Think of your messages like press releases. Unless you&amp;rsquo;re just shooting the breeze with a close friend, there&amp;rsquo;s usually a point.&lt;/p&gt;
&lt;p&gt;The jumbled message above, besides being imprecise, suffers from scattered topics. How do you even answer that? If you&amp;rsquo;re confused and need help, write a clear request for help. The example above isn&amp;rsquo;t even an outline.&lt;/p&gt;
&lt;p&gt;Clear messages have structure. Start with a sentence stating the topic, then elaborate, point by point. If you&amp;rsquo;re informing someone, state the key facts. If you need something, explain why, and ideally, offer a solution. If you&amp;rsquo;re reporting a problem, give enough details for troubleshooting.&lt;/p&gt;
&lt;p&gt;When friends ask for computer help, they often just say, &amp;ldquo;My computer&amp;rsquo;s broken, help!&amp;rdquo; And then they wait for me to ask questions. I wish just &lt;em&gt;once&lt;/em&gt; someone would proactively tell me the error message, if it&amp;rsquo;s happened before, when it started, what they did before and after, what they tried, and what the results were.&lt;/p&gt;
&lt;p&gt;Imagine a robbery. The police arrive, and the victim just keeps saying, &amp;ldquo;I&amp;rsquo;ve been robbed! Catch the thief!&amp;rdquo; The case won&amp;rsquo;t get solved.&lt;/p&gt;
&lt;p&gt;Focused conversations are efficient. A ten-minute explanation can drag on for an hour due to poor communication. Wasting someone&amp;rsquo;s time is a cardinal sin.&lt;/p&gt;
&lt;h2 id="modern-big-character-posters"&gt;Modern Big-Character Posters
&lt;/h2&gt;&lt;p&gt;China has a thing for slogan banners and posters. For urban planning: &amp;ldquo;Gather all forces, plan water management, build a harmonious city, promote the water town image, and establish a legacy.&amp;rdquo; For construction safety: &amp;ldquo;Safety creates happiness, negligence brings pain. Safety is efficiency, safety is happiness.&amp;rdquo; For hospitals: &amp;ldquo;Create a safe hospital, build harmonious doctor-patient relations.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s not even get into the slogans themselves. The point is, the people behind these didn&amp;rsquo;t consider their audience or tone. A slogan near a military area was actually good: &amp;ldquo;Obey the Party&amp;rsquo;s command, be able to win battles, and have a good work style.&amp;rdquo; It&amp;rsquo;s hierarchical and logical. Most importantly, it&amp;rsquo;s clear and unambiguous.&lt;/p&gt;
&lt;p&gt;News and official outlets use vague language to be inclusive and cover all bases. But this isn&amp;rsquo;t just a media thing. We&amp;rsquo;ve all encountered people who write in an overly formal or flowery style at work. Think of those landing pages: a confusing illustration with shopping carts and money flying everywhere, and text like, &amp;ldquo;Enjoy endless discounts.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;I remember one ad clearly. I forget the brand, but it showed traditional soy sauce making. The spokesperson, standing by a field of drying soybeans, said plainly, &amp;ldquo;Just dry it here, just rely on the sun.&amp;rdquo; A less direct approach might have been: &amp;ldquo;XX hectares of soybean processing, natural air-drying.&amp;rdquo; That&amp;rsquo;s uninspiring. No matter how accurate or fancy, it lacks imagery.&lt;/p&gt;
&lt;p&gt;You can see the ad&amp;rsquo;s directness here: &lt;a class="link" href="http://t.cn/RcxcZ3I" target="_blank" rel="noopener"
&gt;http://t.cn/RcxcZ3I&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It reminds me of a joke with my classmates:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;ldquo;The rolling Yangtze River flows eastward&amp;hellip;&amp;quot;&lt;/blockquote&gt;
&amp;ldquo;Get to the point!&amp;quot;&lt;/blockquote&gt;
&amp;ldquo;The river flows east!&amp;quot;&lt;/blockquote&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="topic-guardian"&gt;Topic Guardian
&lt;/h2&gt;&lt;p&gt;After I started working, someone commented on my chat style again, saying I was &amp;ldquo;chatting with my life.&amp;rdquo; They explained that with others, it&amp;rsquo;s a back-and-forth. With me, they&amp;rsquo;d see me &amp;ldquo;typing&amp;rdquo; for ages, sometimes over ten minutes. They&amp;rsquo;d return from getting water to find a massive, multi-paragraph message from me, addressing every tangent from the earlier conversation.&lt;/p&gt;
&lt;p&gt;So, I do have that habit! I don&amp;rsquo;t let topics die; I need closure. I can see how this would be tiring in casual chats. I don&amp;rsquo;t &lt;em&gt;want&lt;/em&gt; to be like this. I&amp;rsquo;d prefer to stick to one thing. But once the conversation derails, even if it&amp;rsquo;s not my fault, I feel compelled to keep it going. If the other person is fine with this style, I&amp;rsquo;m the one who ends up exhausted.&lt;/p&gt;
&lt;p&gt;Long messages have pros and cons. The downside is making people wait. But the upside is preventing further tangents. If, mid-reply, something reminds the other person of something else, they might interrupt, creating more branches. It&amp;rsquo;s very common.&lt;/p&gt;
&lt;p&gt;This is a dilemma. Wasting time is bad, so shouldn&amp;rsquo;t I avoid long waits? But if I don&amp;rsquo;t control the topics, forgotten points might need revisiting later, which &lt;em&gt;also&lt;/em&gt; wastes time. In text chat, prioritizing the other person&amp;rsquo;s experience means choosing the less time-consuming option. Letting topics explode seems like a lesser evil.&lt;/p&gt;
&lt;h2 id="oh-hehe-grin"&gt;Oh, Hehe, [Grin]
&lt;/h2&gt;&lt;p&gt;These are the worst replies, the ultimate conversation killers. Why? They&amp;rsquo;re short and meaningless, yes. But the real problem is they don&amp;rsquo;t reflect the sender&amp;rsquo;s &lt;em&gt;state&lt;/em&gt;. They replied, but didn&amp;rsquo;t actually &lt;em&gt;respond&lt;/em&gt;. You don&amp;rsquo;t know if they understood; they could have been typing those replies mindlessly.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s like sending an email with no loading indicator or confirmation. The compose window stays open. You close it, and the email&amp;rsquo;s nowhere: not in Sent, Outbox, Drafts, Inbox, Junk, &lt;em&gt;or&lt;/em&gt; Spam. WTF?&lt;/p&gt;
&lt;p&gt;It sounds ridiculous, but these conversations happen all the time. An Android developer asked me for an asset. I asked how he planned to use it – fixed size or .9 patch? He replied, &amp;ldquo;Okay.&amp;rdquo; I thought he&amp;rsquo;d hit send accidentally. But after 30 seconds, not even a &amp;ldquo;typing&amp;rdquo; indicator. The topic died, forcing me to start a new round of questions.&lt;/p&gt;
&lt;p&gt;Tech people &lt;em&gt;should&lt;/em&gt; understand feedback. The TCP/IP handshake is a prime example: Client sends to server: &amp;ldquo;I want to connect.&amp;rdquo; Server replies: &amp;ldquo;Is this what you sent? Is it you?&amp;rdquo; Client confirms: &amp;ldquo;Yes, it&amp;rsquo;s me, let&amp;rsquo;s connect.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Humans are good at context, machines less so. But even with context, clear feedback is crucial. At the very least, reply with &amp;ldquo;OK&amp;rdquo; or &amp;ldquo;Received.&amp;rdquo; If there&amp;rsquo;s a choice, repeat the chosen option.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;Looking back at my IM interactions, there&amp;rsquo;s a clear divide. Some people are a breeze to communicate with; others make you want to just call. The same task, expressed differently in text, leads to vastly different experiences.&lt;/p&gt;
&lt;p&gt;Experience design is everywhere, and it&amp;rsquo;s practical. Strip away the methodologies, and you&amp;rsquo;re left with one core principle: Put yourself in the other person&amp;rsquo;s shoes.&lt;/p&gt;</description></item></channel></rss>