<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Data Analysis on Victor42</title><link>https://victor42.eth.limo/tags/data-analysis/</link><description>Recent content in Data Analysis on Victor42</description><generator>Hugo -- gohugo.io</generator><language>en</language><managingEditor>hi@victor42.work (Victor42)</managingEditor><webMaster>hi@victor42.work (Victor42)</webMaster><lastBuildDate>Fri, 05 Jul 2024 22:33:00 +0000</lastBuildDate><atom:link href="https://victor42.eth.limo/tags/data-analysis/index.xml" rel="self" type="application/rss+xml"/><item><title>I Did a Deep Dive into English Word Stress...</title><link>https://victor42.eth.limo/post-en/3651/</link><pubDate>Fri, 05 Jul 2024 22:33:00 +0000</pubDate><author>hi@victor42.work (Victor42)</author><guid>https://victor42.eth.limo/post-en/3651/</guid><description>&lt;img src="https://cdn.victor42.work/posts/2024-07/ea6d9ff8fee7f0f2477d458be8c4a952.jpg" alt="Featured image of post I Did a Deep Dive into English Word Stress..." /&gt;&lt;p&gt;&lt;strong&gt;Target audience: English learners, data analysis enthusiasts, Python coders, and my friends.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This is my first data analysis project. I&amp;rsquo;ve been teaching myself data science for over a year, picking up skills along the way, but I hadn&amp;rsquo;t tackled a real-world project. During my studies, the words &amp;lsquo;analyze,&amp;rsquo; &amp;lsquo;analysis,&amp;rsquo; and &amp;lsquo;analytical&amp;rsquo; kept appearing. The stress placement is unpredictable (&amp;lsquo;analyze, a&amp;rsquo;nalysis, ana&amp;rsquo;lytical) – a real headache! It turned reading into a tongue-twisting exercise.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/70c28efdcd37e6d4a143ff2df66084be.jpg"
loading="lazy"
alt="Four cognate English words Analyze, Analyst, Analysis, and Analytical with stress positions marked by apostrophes, showing stress shifting from the first syllable progressively backward"
&gt;&lt;/p&gt;
&lt;p&gt;Some claim there are rules for stress, but they&amp;rsquo;re often lengthy and complex. Others say there are too many exceptions. However, even with those three words, a pattern &lt;em&gt;does&lt;/em&gt; emerge. English seems to avoid three unstressed syllables in a row and tends to place stress near the beginning. For words with five or fewer syllables, the stress often lands on the antepenultimate (third-to-last) syllable.&lt;/p&gt;
&lt;p&gt;It makes sense, doesn&amp;rsquo;t it? Three unstressed syllables in a row would be monotonous. Stress adds rhythm. It&amp;rsquo;s like driving on a straight road – you&amp;rsquo;ll likely doze off. Placing stress too late would also hinder comprehension. Imagine a long word with emphasis on the very last syllable – you&amp;rsquo;d likely miss the meaning!&lt;/p&gt;
&lt;p&gt;To illustrate, consider Mandarin Chinese. It has a significant flaw: the word &amp;ldquo;不&amp;rdquo; (bù, &amp;ldquo;not&amp;rdquo;). Both the consonant and vowel are faint, especially in rapid speech. The vowel becomes even weaker. You often can&amp;rsquo;t discern if someone even &lt;em&gt;uttered&lt;/em&gt; &amp;ldquo;不&amp;rdquo;! This creates a major communication problem, as it distinguishes between two opposite meanings. When my daughter cries, I struggle to understand if she&amp;rsquo;s saying &amp;ldquo;要&amp;rdquo; (yào, &amp;ldquo;want&amp;rdquo;) or &amp;ldquo;不要&amp;rdquo; (bù yào, &amp;ldquo;don&amp;rsquo;t want&amp;rdquo;).&lt;/p&gt;
&lt;p&gt;Back to English stress. My theory seemed reasonable, but I needed evidence. As a data-science novice, I decided to get my hands dirty and see how many words actually followed this pattern.&lt;/p&gt;
&lt;h2 id="research-plan"&gt;Research Plan
&lt;/h2&gt;&lt;p&gt;Having learned data analysis, the research plan formed quickly. It involved collecting, cleaning, analyzing, and visualizing data. Regression analysis or prediction wasn&amp;rsquo;t necessary.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/7486fc8650cedd8b8b4f7816e9af7e0d.jpg"
loading="lazy"
alt="Kaggle Notebook preview showing the raw English word dataset, listed alphabetically from a, aa, aaa to aardvark, truncated due to large file size"
&gt;&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s the skillset I had, which was sufficient:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Find a comprehensive word list.&lt;/li&gt;
&lt;li&gt;Find a free, batch method for obtaining phonetic transcriptions from an online dictionary.&lt;/li&gt;
&lt;li&gt;Determine the syllable count and stress position for each word (possibly with AI assistance).&lt;/li&gt;
&lt;li&gt;Analyze the distribution of stress positions and visualize the findings.&lt;/li&gt;
&lt;li&gt;Test my hypothesis.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Let&amp;rsquo;s dive in.&lt;/p&gt;
&lt;h2 id="data-source"&gt;Data Source
&lt;/h2&gt;&lt;p&gt;I found a dataset on &lt;a class="link" href="https://www.kaggle.com/" target="_blank" rel="noopener"
&gt;Kaggle&lt;/a&gt;, a popular data science community. It&amp;rsquo;s a simple .txt file containing over 300,000 English words, listed alphabetically, one per line:&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://www.kaggle.com/datasets/bwandowando/479k-english-words" target="_blank" rel="noopener"
&gt;https://www.kaggle.com/datasets/bwandowando/479k-english-words&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/035173524c2057e2515c255add081cea.jpg"
loading="lazy"
alt="A preview of the raw English word list starting with the letter A in Kaggle"
&gt;&lt;/p&gt;
&lt;p&gt;The .txt file is 4MB, comparable to a million-word novel.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/6d8b49da96f58a5292d53296bf7966ba.jpg"
loading="lazy"
alt="Pandas dataframe info showing more than three hundred and sixty-nine thousand rows of words"
&gt;&lt;/p&gt;
&lt;p&gt;I created a Kaggle code project, imported the dataset, read all the words, and obtained a table with 369,652 rows and 1 column.&lt;/p&gt;
&lt;h2 id="getting-the-pronunciation"&gt;Getting the Pronunciation
&lt;/h2&gt;&lt;p&gt;The table only contained words. For rigorous research, I needed phonetic transcriptions.&lt;/p&gt;
&lt;p&gt;Fortunately, I discovered a free online dictionary API: &lt;a class="link" href="https://dictionaryapi.dev/" target="_blank" rel="noopener"
&gt;https://dictionaryapi.dev/&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Now, I had to look up each of those 300,000+ words. Naturally, I&amp;rsquo;d write code to automate this.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/5c311b367a15d50faa8f53f724821a54.jpg"
loading="lazy"
alt="JSON response from the free dictionary API for the word hello"
&gt;&lt;/p&gt;
&lt;p&gt;The API returned more than just phonetics; it included audio, etymology, parts of speech, meanings, and examples. The useful components were the phonetics, etymology, and part of speech. However, etymology was mostly missing, so I extracted only the phonetics and part of speech.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/12f254a9769f985b4cacc3b3992a7577.jpg"
loading="lazy"
alt="Code snippet of the free dictionary API rate limiter setting a limit of 450 requests per 5 minutes"
&gt;&lt;/p&gt;
&lt;p&gt;The sheer data volume posed a challenge. The API documentation didn&amp;rsquo;t specify request limits, but I found it in &lt;a class="link" href="https://github.com/meetDeveloper/freeDictionaryAPI/blob/master/app.js" target="_blank" rel="noopener"
&gt;their Github code&lt;/a&gt;: 450 requests every 5 minutes. For 369,652 words, even non-stop, it would take 369652 / 450 * 5 / 60 = 68.45 hours – almost 3 days!&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/4a9c399f7966ab61cf767f7712e209d9.jpg"
loading="lazy"
alt="CSV chunk files saved in the Kaggle working directory during batch processing"
&gt;&lt;/p&gt;
&lt;p&gt;Alright, three days it was. But I had to adjust my strategy. I added a function to chunk queries and save results periodically. Every 1,000 rows, I&amp;rsquo;d save to a sequentially numbered file. I&amp;rsquo;d then continue querying based on the sequence number. Finally, I&amp;rsquo;d merge all 300+ files.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/22b28704556d17baf1c0c141d5ae3e96.jpg"
loading="lazy"
alt="Spreadsheet showing merged English words with phonetic symbols and parts of speech"
&gt;&lt;/p&gt;
&lt;p&gt;It turned out that most of the 300,000+ words were obscure and not found in the API. I only got results for roughly 100 out of every 1,000 words. The file above contains only 92 rows.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://wordsrated.com/how-many-words-are-in-the-english-language/" target="_blank" rel="noopener"
&gt;Linguistic research&lt;/a&gt; indicates that 3,000 English words cover 95% of everyday usage, and 1,000 cover 89%. &lt;a class="link" href="https://wordcounter.io/blog/how-many-words-does-the-average-person-know" target="_blank" rel="noopener"
&gt;Another study&lt;/a&gt; shows that the average adult has an active vocabulary of about 20,000 words and a passive one of 40,000. Thus, only about 1/10 of the dataset is relevant, which is reasonable.&lt;/p&gt;
&lt;h2 id="data-cleaning"&gt;Data Cleaning
&lt;/h2&gt;&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/82acc141ccd3150e4bf0fd08ae292149.jpg"
loading="lazy"
alt="Python code showing the mapping dictionary for uncommon phonetic symbol replacements"
&gt;&lt;/p&gt;
&lt;p&gt;After merging, I found the dictionary&amp;rsquo;s phonetic symbols were inconsistent, containing uncommon symbols like &lt;code&gt;ɘ&lt;/code&gt;, &lt;code&gt;ɝ&lt;/code&gt;, &lt;code&gt;ɚ&lt;/code&gt;, &lt;code&gt;ɨ&lt;/code&gt;, &lt;code&gt;ʉ&lt;/code&gt;. These represent subtle pronunciation variations, roughly equivalent to standard sounds. I had to replace them; otherwise, they&amp;rsquo;d disrupt syllable counting and subsequent analysis.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/9d9304e6642b5df50354c06d739eea1d.jpg"
loading="lazy"
alt="Python code showing the mapping rules to merge phonetically identical but graphically different common vowels"
&gt;&lt;/p&gt;
&lt;p&gt;Besides unusual symbols, there were many phonetically identical but differently written symbols, like &lt;code&gt;əu/əʊ&lt;/code&gt; and &lt;code&gt;ai/aɪ&lt;/code&gt;. These also required merging. Each line in the image signifies replacing the first symbol with the second, leaving bracketed symbols untouched.&lt;/p&gt;
&lt;p&gt;Some words differ significantly between British and American English. I prioritized American English conventions.&lt;/p&gt;
&lt;p&gt;Numerous unconventional spellings existed. Over- or under-replacement could easily cause phonetic errors. I wrote a temporary checker, manually consulted the &lt;a class="link" href="https://dictionary.cambridge.org/us/dictionary/english/" target="_blank" rel="noopener"
&gt;Cambridge Dictionary&lt;/a&gt;, and refined my replacements. This took time.&lt;/p&gt;
&lt;p&gt;After processing, the vowel symbols were cleaner. For &amp;ldquo;anthropomorphic&amp;rdquo;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Before: &lt;code&gt;[ˌæ̃n̪θɹ̠əpəˈmɔɹ̠fɪ̈k]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;After: &lt;code&gt;[ˌæn̪θɹ̠əpəˈmɔːfɪk]&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I didn&amp;rsquo;t handle consonant symbols, as they were irrelevant to my goal, and that&amp;rsquo;s a more complex issue.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/627162599344331488dc70237ce660a6.jpg"
loading="lazy"
alt="API JSON response showing incorrect and incomplete phonetic transcription for the word abacus"
&gt;&lt;/p&gt;
&lt;p&gt;Later, I discovered some inaccuracies in the dictionary API. For instance, &amp;ldquo;abacus&amp;rdquo; was transcribed as /-saɪ/? Nonsense! The information was incomplete.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/f4f3ef7e088114e942d95246bf273902.jpg"
loading="lazy"
alt="Text output showing the count and percentage of words with incomplete phonetic data"
&gt;&lt;/p&gt;
&lt;p&gt;I calculated this occurred in 0.55% of all words – a small fraction. The incomplete transcriptions seemed random, lacking commonality, so I filtered them out. I&amp;rsquo;m now analyzing a sample, not the complete data. However, the sample is large enough to be representative, allowing the research to proceed.&lt;/p&gt;
&lt;h2 id="analyzing-phonetic-transcriptions-ai"&gt;Analyzing Phonetic Transcriptions (AI)
&lt;/h2&gt;&lt;p&gt;This step entails counting syllables from phonetic transcriptions and identifying the stressed syllable using the &lt;code&gt;ˈ&lt;/code&gt; mark.&lt;/p&gt;
&lt;p&gt;I aimed for a shortcut by deploying an AI model on Kaggle. AI should excel at language, right?&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/c77ef4414f82188785924057cfe3bc34.jpg"
loading="lazy"
alt="Kaggle models page showing search results for text-based large language models"
&gt;&lt;/p&gt;
&lt;p&gt;I tested several text-based models but encountered obstacles:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Large models wouldn&amp;rsquo;t run:&lt;/strong&gt; Among Kaggle&amp;rsquo;s deployable open-source models, Llama3 70b could accurately determine syllable count and stress position. ChatGPT, Claude, and even GPT-3.5 could also do it. Language seems to be a strength of large language models. The issue? Kaggle&amp;rsquo;s free tier can&amp;rsquo;t run such large models.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Small models were inadequate:&lt;/strong&gt; Kaggle&amp;rsquo;s two free T4 GPUs can handle smaller 7b models like Llama3 8b, Gemma 7b, and Qwen2 7b. However, these smaller models, on Kaggle or elsewhere, couldn&amp;rsquo;t reliably perform the task.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I refined prompts, guiding the AI step-by-step, and provided examples:&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;&amp;lt;task&amp;gt;
your task is to count how many syllables there are in an English word. list them all then count. finally answer which syllable the stress falls on(tell me the number). answer **EXACTLY** in the example format.
&amp;lt;example&amp;gt;
word: analysis
phonetic transcription: /əˈnælɪsɪs/
syllables:
1. ə
2. &amp;#39;næ
3. lɪ
4. sɪs
syllables count: 4
stress position: 2
final conclusion: &amp;lt;&amp;lt;&amp;lt;2/4&amp;gt;&amp;gt;&amp;gt;
&amp;lt;word&amp;gt;
analytical /æn.əˈlɪt.ə.kəl/
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;But the smaller models kept failing. Perhaps they weren&amp;rsquo;t capable. Phonetic symbols are vastly different from standard English letters, almost a separate, niche language for AI.&lt;/p&gt;
&lt;p&gt;This experience highlighted a key point: these open-source small models cluster around 7 billion parameters likely because that&amp;rsquo;s the upper limit for running on specific GPUs. In this era of constrained computing, GPUs dictate the scale.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/3a5d9b8fcbd23a0d5487891310921f63.jpg"
loading="lazy"
alt="Google Sheets interface showing GPT formulas applied to analyze word stress"
&gt;&lt;/p&gt;
&lt;p&gt;Was AI a dead end? I then considered a workaround: Google Sheets with an AI plugin. I could input the phonetic data into Sheets, write a prompt in the adjacent cell (including the word and transcription), and use a formula from an &lt;a class="link" href="https://workspace.google.com/u/1/marketplace/app/gpt_for_sheets_and_docs/677318054654" target="_blank" rel="noopener"
&gt;AI plugin&lt;/a&gt; to generate the result. This plugin, powered by GPT-3.5, could handle the task. The classic Excel drag-down trick would then populate the entire column.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/81f435b62db92e70d47f0d77841e5703.jpg"
loading="lazy"
alt="Cost estimator page showing the estimated cost for calling the GPT plugin in Google Sheets"
&gt;&lt;/p&gt;
&lt;p&gt;The plugin&amp;rsquo;s pricing was reasonable, around 90 RMB for my data volume. However, I was unsure if it could handle tens of thousands of AI generations simultaneously. Debugging and regenerating could double the cost, making it risky.&lt;/p&gt;
&lt;h2 id="analyzing-phonetic-transcriptions-algorithm"&gt;Analyzing Phonetic Transcriptions (Algorithm)
&lt;/h2&gt;&lt;p&gt;Okay, no more AI—I&amp;rsquo;d handle it myself. Counting syllables and locating stress? An algorithm could do that, and more reliably. Here’s the approach, using &lt;code&gt;analytical /æn.əˈlɪt.ə.kəl/&lt;/code&gt; as an example:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a set of all vowels: &lt;code&gt;ɑaæɒʌəɛeɪiɔoʊuʉɜ&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Remove slashes, parentheses, spaces, and dots: &lt;code&gt;/æn.əˈlɪt.ə.kəl/&lt;/code&gt; becomes &lt;code&gt;ænəˈlɪtəkəl&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Iterate through &lt;code&gt;ænəˈlɪtəkəl&lt;/code&gt;, checking against the vowel set. Counting vowels: &lt;code&gt;æ&lt;/code&gt;, &lt;code&gt;ə&lt;/code&gt;, &lt;code&gt;ɪ&lt;/code&gt;, &lt;code&gt;ə&lt;/code&gt;, &lt;code&gt;ə&lt;/code&gt; yields 5 syllables.&lt;/li&gt;
&lt;li&gt;Split by the stress mark &lt;code&gt;ˈ&lt;/code&gt;: &lt;code&gt;ænəˈlɪtəkəl&lt;/code&gt; becomes &lt;code&gt;ænə&lt;/code&gt; and &lt;code&gt;lɪtəkəl&lt;/code&gt;. Use the first part, &lt;code&gt;ænə&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Count vowels in &lt;code&gt;ænə&lt;/code&gt; as in step 3: 2 vowels.&lt;/li&gt;
&lt;li&gt;Add 1 to get the stress position: the 3rd syllable.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The logic was clear, so I had AI write the code—a trivial task for it. A few tweaks, and it worked.&lt;/p&gt;
&lt;p&gt;A challenge arose in step 3: diphthongs, triphthongs, and long vowels. For &lt;code&gt;ei&lt;/code&gt;, the algorithm would count &lt;code&gt;e&lt;/code&gt; and &lt;code&gt;i&lt;/code&gt; (2 syllables), but &lt;code&gt;ei&lt;/code&gt; as a diphthong is only one. Triphthongs would be counted as 3.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/93fc699338026ae0a224090ea716d17c.jpg"
loading="lazy"
alt="Python code snippet defining sets of monophthongs, diphthongs, and triphthongs"
&gt;&lt;/p&gt;
&lt;p&gt;The algorithm needed adjustment. I created three vowel sets: monophthongs, diphthongs, and triphthongs. The vowel check now involved three passes:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;First pass: Check each character against the monophthong set. This overcounts diphthongs and triphthongs.&lt;/li&gt;
&lt;li&gt;Second pass: Check two characters at a time against the diphthong set. If found, subtract 1 from the syllable count. Importantly, skip the next character after a diphthong to avoid miscounting triphthongs like &lt;code&gt;aɪə&lt;/code&gt; as &lt;code&gt;aɪ&lt;/code&gt; and &lt;code&gt;ɪə&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Third pass: Check three characters at a time against the triphthong set, subtracting 1 if found.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This refined algorithm accurately counted syllables. (Note: I treated the long vowel marker &lt;code&gt;ː&lt;/code&gt; as a phonetic character; &lt;code&gt;iː&lt;/code&gt;, &lt;code&gt;ɑː&lt;/code&gt; are handled as diphthongs, &lt;code&gt;iːə&lt;/code&gt;, &lt;code&gt;uːə&lt;/code&gt; as triphthongs, which doesn&amp;rsquo;t affect the outcome.)&lt;/p&gt;
&lt;p&gt;It turns out, for data analysis, technique takes a backseat to domain knowledge. Analyzing English requires understanding it. Digging deeper into phonetics, I hit another snag: triphthong identification is incredibly ambiguous. There&amp;rsquo;s no consensus on whether three vowel symbols together are a triphthong or a monophthong + diphthong. That familiar feeling&amp;hellip; Classic English! No rigid rules.&lt;/p&gt;
&lt;p&gt;Consider &lt;code&gt;fire /ˈfaɪər/&lt;/code&gt;. Some claim &lt;code&gt;aɪə&lt;/code&gt; is one syllable; others say it&amp;rsquo;s &lt;code&gt;aɪ&lt;/code&gt; + &lt;code&gt;ə&lt;/code&gt; (two syllables). Criteria vary wildly. Some use hyphenation (you can write &amp;ldquo;fi-&amp;rdquo; and &amp;ldquo;re,&amp;rdquo; but not &amp;ldquo;fire,&amp;rdquo; so it&amp;rsquo;s a triphthong). Others use singing: if sung as one note, it&amp;rsquo;s a triphthong. In &lt;a class="link" href="https://www.youtube.com/watch?v=dC7Pog3biCk" target="_blank" rel="noopener"
&gt;Simple Plan - Fire In My heart&lt;/a&gt;, at 0:57, &lt;code&gt;faɪ&lt;/code&gt; and &lt;code&gt;ər&lt;/code&gt; are sung as separate notes—should it be a diphthong + monophthong?&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/d0227a8fc72ffd41ff020f6fceb73b12.jpg"
loading="lazy"
alt="A music video screenshot showing lyrics containing the triphthong word fire"
&gt;&lt;/p&gt;
&lt;p&gt;Oh well, that&amp;rsquo;s English. Given words like &lt;code&gt;oasis /oʊˈeɪsɪs/&lt;/code&gt; (four vowels!), with &lt;code&gt;oʊ&lt;/code&gt; and &lt;code&gt;eɪ&lt;/code&gt; clearly separated by the stress mark (obviously two diphthongs), I disregarded triphthongs, treating them as two syllables. The only remaining &amp;ldquo;triphthongs&amp;rdquo; were diphthongs with a long vowel marker.&lt;/p&gt;
&lt;p&gt;Besides syllable count and stress position, I wanted the stressed vowel itself, potentially for further analysis.&lt;/p&gt;
&lt;p&gt;This was trickier. I discussed it with AI, revealing significant model differences. Gemini 1.5 Flash went in circles. GPT-4o provided the correct code in three conversational rounds (about 10 minutes). Claude 3.5 Sonnet got it right immediately. For coding, a good model is worth the cost, though basic code literacy is essential to understand the AI&amp;rsquo;s code, its functionality, and potential issues.&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s the logic, again with &lt;code&gt;analytical /ænəˈlɪtəkəl/&lt;/code&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Locate the stress mark &lt;code&gt;ˈ&lt;/code&gt; and consider the subsequent part: &lt;code&gt;lɪtəkəl&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Iterate, removing non-vowels until the first vowel: &lt;code&gt;ɪtəkəl&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The first character is now a vowel. Check the first 3 characters (&lt;code&gt;ɪtə&lt;/code&gt;) against the triphthong set. Nope.&lt;/li&gt;
&lt;li&gt;Check the first 2 (&lt;code&gt;ɪt&lt;/code&gt;) against the diphthong set. Nope.&lt;/li&gt;
&lt;li&gt;Check the first character (&lt;code&gt;ɪ&lt;/code&gt;) against the monophthong set. Found it! That&amp;rsquo;s the stressed vowel.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/ba10765865fa9f86332e78b71807279f.jpg"
loading="lazy"
alt="Spreadsheet detailing English words along with their syllable count and stress positions"
&gt;&lt;/p&gt;
&lt;p&gt;The data table after phonetic analysis. All necessary data was now collected.&lt;/p&gt;
&lt;h2 id="visualization"&gt;Visualization
&lt;/h2&gt;&lt;p&gt;Now for the highlight—not just for deriving useful conclusions, but also because AI shines here. AI is excellent at writing Python visualization code. These tasks are less about reasoning and more about knowing the visualization library&amp;rsquo;s syntax. Even Gemini 1.5 Flash, a non-flagship model I use daily, performs well. I haven&amp;rsquo;t formally learned Seaborn and Matplotlib, but with AI, generating plots is straightforward.&lt;/p&gt;
&lt;p&gt;Of course, &amp;ldquo;straightforward&amp;rdquo; doesn&amp;rsquo;t mean &amp;ldquo;ask and receive.&amp;rdquo; Giving AI a vague request without context leads to failure. I crafted a Python visualization prompt, detailing the task and the data table&amp;rsquo;s structure, enabling the AI to perform with full power and stability.&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;&amp;lt;Task&amp;gt;
You are a Python data visualizer. You excels at coding with data visualization libraries like Seaborn and Matplotlib. I will tell you about the structure of a Pandas dataframe and the visualization I want. First, you dive deeply into the dataframe and understand what it is all about. Then write Python code to visualize it. Just code, no explanation. Next, you check if the code meets my need. Finally, correct the code if necessary.
&amp;lt;Dataframe&amp;gt;
The dataframe(variable name is df) is {a list of common English words with their phonetic information and part-of-speech}.
Now here are the columns of the dataframe, exactly in the following order:
**word**
- datatype: str
- example: complimentary
- description: the English words
**phonetic**
- datatype: str
- example: /ˌkɒmplɪ̈ˈment(ə)ɹɪ/
- description: the phonetic transcription of the words
**part_of_speech**
- datatype: str(list like)
- example: [&amp;#39;adjective&amp;#39;]
- description: how are these words used in sentences
**syllable_len**
- datatype: int
- example: 5
- description: how many syllables are there in these words
**stress_pos**
- datatype: int
- example: 3
- description: on which syllable the stress falls on, if there are more than one stress, this is the position of the first stress
**stress_syllable**
- datatype: str
- example: e
- description: the vowel of the stressed syllable
&amp;lt;Request&amp;gt;
I want to know the distribution of stress position, grouped by syllable numbers.
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;To use the prompt, just tweak the &lt;code&gt;&amp;lt;Request&amp;gt;&lt;/code&gt; section.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/6bf1e239c52df87ca7159c81c23911cd.jpg"
loading="lazy"
alt="Head of the loaded pandas dataframe displaying word phonetic and stress properties"
&gt;&lt;/p&gt;
&lt;p&gt;Some words in the data lack stress marks because they&amp;rsquo;re short, and their phonetic transcriptions don&amp;rsquo;t show stress. Let&amp;rsquo;s filter those out, along with one-syllable words – analyzing stress in those is pointless.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/99b768328e8403852edad5bbe1d47def.jpg"
loading="lazy"
alt="Cleaned pandas dataframe info showing twenty-four thousand four hundred and thirty-three entries"
&gt;&lt;/p&gt;
&lt;p&gt;This leaves 24,433 words with complete data.&lt;/p&gt;
&lt;h3 id="syllable-count-analysis"&gt;Syllable Count Analysis
&lt;/h3&gt;&lt;p&gt;Let&amp;rsquo;s break down the syllable counts of these 24,433 words.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/e6ded1b89391ef9844e28f8d4342c3da.jpg"
loading="lazy"
alt="Bar chart illustrating the frequency distribution of English word syllable lengths"
&gt;&lt;/p&gt;
&lt;p&gt;Unsurprisingly, fewer syllables mean more words. Languages tend to use up short, easy words first.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/9655926ed67e4cb11ee3f8a0ba62cbe0.jpg"
loading="lazy"
alt="Pie chart displaying the percentage distribution of different word syllable lengths"
&gt;&lt;/p&gt;
&lt;p&gt;Two-syllable words make up 48.7%, three-syllable words 31.3%.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/20a81644b6c29b8bab1ccc0b79f5e220.jpg"
loading="lazy"
alt="Text statistics showing the cumulative percentages of words with up to four and five syllables"
&gt;&lt;/p&gt;
&lt;p&gt;Words with four or fewer syllables make up 94.73%; five or fewer, 99%.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/963d18455de407866b97e9459de20bab.jpg"
loading="lazy"
alt="Syllable analysis showing eleven syllables in the long English word antidisestablishmentarianism"
&gt;&lt;/p&gt;
&lt;p&gt;The longest word has 11 syllables.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/79fac98a54c6d574e0c2e29ef224e1dd.jpg"
loading="lazy"
alt="Cambridge Dictionary entry defining the long political word antidisestablishmentarianism"
&gt;&lt;/p&gt;
&lt;p&gt;&amp;ldquo;Antidisestablishmentarianism&amp;rdquo;? Really? Opposition to opposition – double negative much? No wonder it&amp;rsquo;s so long. Could I just add &amp;ldquo;non-&amp;rdquo; to create &amp;ldquo;nonantidisestablishmentarianism&amp;rdquo;?&lt;/p&gt;
&lt;h3 id="syllable-count-vs-stress-position"&gt;Syllable Count vs. Stress Position
&lt;/h3&gt;&lt;p&gt;Statistically, the correlation coefficient is 0.67 – a pretty decent correlation.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/de6dd89e6d5f9344dc7788051d2266b0.jpg"
loading="lazy"
alt="Statistical correlation coefficient between syllable length and stress position in English words"
&gt;&lt;/p&gt;
&lt;p&gt;This coefficient ranges from -1 to 1. Near 0 means almost no relationship; near 1, positive correlation (one up, other up); near -1, negative correlation (one up, other down).&lt;/p&gt;
&lt;p&gt;This is just a first step, showing they&amp;rsquo;re not unrelated. It doesn&amp;rsquo;t explain &lt;em&gt;why&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/424a2fdcade241c75ba5a53eabda74ee.jpg"
loading="lazy"
alt="Bubble chart representing the distribution of stress positions across different syllable lengths"
&gt;&lt;/p&gt;
&lt;p&gt;A bubble chart helps. Syllable count is on the y-axis, stress position on the x-axis, and bubble size/color shows the word count. The dots roughly follow a diagonal – more syllables, later stress.&lt;/p&gt;
&lt;p&gt;Bubble charts (or heatmaps) show three dimensions but compare absolute word counts. I care more about stress position distribution &lt;em&gt;within&lt;/em&gt; each syllable count.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/8a8e9b114c1ec9758b4c00e62f8be6f6.jpg"
loading="lazy"
alt="Grouped bar charts displaying stress position distributions for each specific syllable length"
&gt;&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s a stacked bar chart: syllable count on the y-axis, stress position on the x-axis. Now it&amp;rsquo;s clear: stress shifts right like a wave, clustering around the third-to-last syllable.&lt;/p&gt;
&lt;h3 id="stressed-syllable-analysis"&gt;Stressed Syllable Analysis
&lt;/h3&gt;&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/a8cbd78d2abfeeb6f6a12e95dee24c99.jpg"
loading="lazy"
alt="Text list of all unique stressed vowel symbols extracted from the dataset"
&gt;&lt;/p&gt;
&lt;p&gt;These are all the vowels in stressed syllables. A couple shouldn&amp;rsquo;t be here, but it&amp;rsquo;s a dictionary error, and too few to matter.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/078bec4b5063d84f7f328e910dd61f9a.jpg"
loading="lazy"
alt="Horizontal bar chart showing the frequency ranking of different stressed vowels"
&gt;&lt;/p&gt;
&lt;p&gt;By frequency, louder vowels like &lt;code&gt;æ&lt;/code&gt; and &lt;code&gt;e&lt;/code&gt; are more likely stressed; weaker ones like &lt;code&gt;ə&lt;/code&gt; and &lt;code&gt;ʊ&lt;/code&gt; are less common.&lt;/p&gt;
&lt;h3 id="part-of-speech-analysis"&gt;Part of Speech Analysis
&lt;/h3&gt;&lt;p&gt;Is there a link between part of speech and stress?&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;All part of speech: [&amp;#39;adjective&amp;#39;, &amp;#39;adverb&amp;#39;, &amp;#39;conjunction&amp;#39;, &amp;#39;interjection&amp;#39;, &amp;#39;noun&amp;#39;, &amp;#39;numeral&amp;#39;, &amp;#39;preposition&amp;#39;, &amp;#39;pronoun&amp;#39;, &amp;#39;propernoun&amp;#39;, &amp;#39;verb&amp;#39;]
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Here&amp;rsquo;s a breakdown of all parts of speech. I&amp;rsquo;m not sure what &amp;ldquo;propernoun&amp;rdquo; is – it&amp;rsquo;s not in my dictionary either. It turns out there are only two, and they don&amp;rsquo;t seem to fit, so I suspect a data glitch with the dictionary API. I&amp;rsquo;ll skip it for now.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/627f810c2d8d6b27501d19d8ad6cff43.jpg"
loading="lazy"
alt="Horizontal bar chart showing the distribution of words across various parts of speech"
&gt;&lt;/p&gt;
&lt;p&gt;I ranked the parts of speech by frequency. The big ones are nouns, verbs, adjectives, and adverbs. Nouns account for roughly half the total.&lt;/p&gt;
&lt;p&gt;This gets you thinking about how language evolved. First, you need to describe the world and create concepts – that&amp;rsquo;s where nouns come in. Then, to describe how people and things interact, you need verbs. After that, adjectives and adverbs develop to modify nouns and verbs. So, my guess is the number of words should follow that order.&lt;/p&gt;
&lt;p&gt;But wait – shouldn&amp;rsquo;t the ratio of nouns to adjectives, and verbs to adverbs, be roughly the same? No need to calculate. The bar chart makes it obvious: nouns are more than double the adjectives, and verbs outnumber adverbs almost nine to one. They&amp;rsquo;re way out of proportion.&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;[&amp;#39;abracadabra&amp;#39;, &amp;#39;absolutely&amp;#39;, &amp;#39;action&amp;#39;, &amp;#39;adieu&amp;#39;, &amp;#39;adios&amp;#39;, &amp;#39;affirmative&amp;#39;, &amp;#39;afternoon&amp;#39;, &amp;#39;ahem&amp;#39;, &amp;#39;alack&amp;#39;, &amp;#39;aloha&amp;#39;, &amp;#39;alright&amp;#39;, &amp;#39;amen&amp;#39;, &amp;#39;amidships&amp;#39;, &amp;#39;arrivederci&amp;#39;, &amp;#39;attaboy&amp;#39;, &amp;#39;attention&amp;#39;, &amp;#39;away&amp;#39;, &amp;#39;banzai&amp;#39;, &amp;#39;bastard&amp;#39;, &amp;#39;beauty&amp;#39;, &amp;#39;begone&amp;#39;, &amp;#39;begorra&amp;#39;, &amp;#39;behold&amp;#39;, &amp;#39;blazes&amp;#39;, &amp;#39;bollocks&amp;#39;, &amp;#39;bonjour&amp;#39;, &amp;#39;bother&amp;#39;, &amp;#39;botheration&amp;#39;, &amp;#39;brother&amp;#39;, &amp;#39;bully&amp;#39;, &amp;#39;bullseye&amp;#39;, &amp;#39;bullshit&amp;#39;, &amp;#39;caramba&amp;#39;, &amp;#39;checkmate&amp;#39;, &amp;#39;cheeses&amp;#39;, &amp;#39;condolences&amp;#39;, &amp;#39;congrats&amp;#39;, &amp;#39;congratulations&amp;#39;, &amp;#39;content&amp;#39;, &amp;#39;cooee&amp;#39;, &amp;#39;curses&amp;#39;, &amp;#39;dammit&amp;#39;, &amp;#39;ecce&amp;#39;, &amp;#39;egad&amp;#39;, &amp;#39;enchanted&amp;#39;, &amp;#39;encore&amp;#39;, &amp;#39;enough&amp;#39;, &amp;#39;eureka&amp;#39;, &amp;#39;exactly&amp;#39;, &amp;#39;farewell&amp;#39;, &amp;#39;fiddlesticks&amp;#39;, &amp;#39;flummery&amp;#39;, &amp;#39;gadzooks&amp;#39;, &amp;#39;gesundheit&amp;#39;, &amp;#39;goddamn&amp;#39;, &amp;#39;goodbye&amp;#39;, &amp;#39;gorblimey&amp;#39;, &amp;#39;gracias&amp;#39;, &amp;#39;gracious&amp;#39;, &amp;#39;greetings&amp;#39;, &amp;#39;hallelujah&amp;#39;, &amp;#39;hardly&amp;#39;, &amp;#39;havoc&amp;#39;, &amp;#39;heavens&amp;#39;, &amp;#39;heyday&amp;#39;, &amp;#39;hola&amp;#39;, &amp;#39;holla&amp;#39;, &amp;#39;honestly&amp;#39;, &amp;#39;hooray&amp;#39;, &amp;#39;hosanna&amp;#39;, &amp;#39;howdy&amp;#39;, &amp;#39;hullo&amp;#39;, &amp;#39;hurrah&amp;#39;, &amp;#39;huzzah&amp;#39;, &amp;#39;yeah&amp;#39;, &amp;#39;indeed&amp;#39;, &amp;#39;knickers&amp;#39;, &amp;#39;later&amp;#39;, &amp;#39;mercy&amp;#39;, &amp;#39;morepork&amp;#39;, &amp;#39;morning&amp;#39;, &amp;#39;namaste&amp;#39;, &amp;#39;negative&amp;#39;, &amp;#39;nonsense&amp;#39;, &amp;#39;oyez&amp;#39;, &amp;#39;okay&amp;#39;, &amp;#39;ole&amp;#39;, &amp;#39;pardon&amp;#39;, &amp;#39;peccavi&amp;#39;, &amp;#39;period&amp;#39;, &amp;#39;pity&amp;#39;, &amp;#39;pleasure&amp;#39;, &amp;#39;presto&amp;#39;, &amp;#39;prithee&amp;#39;, &amp;#39;prosit&amp;#39;, &amp;#39;quiet&amp;#39;, &amp;#39;rather&amp;#39;, &amp;#39;really&amp;#39;, &amp;#39;respect&amp;#39;, &amp;#39;result&amp;#39;, &amp;#39;roger&amp;#39;, &amp;#39;rumble&amp;#39;, &amp;#39;sayonara&amp;#39;, &amp;#39;scramble&amp;#39;, &amp;#39;selah&amp;#39;, &amp;#39;shabash&amp;#39;, &amp;#39;shazam&amp;#39;, &amp;#39;silence&amp;#39;, &amp;#39;sorry&amp;#39;, &amp;#39;standard&amp;#39;, &amp;#39;sugar&amp;#39;, &amp;#39;tally&amp;#39;, &amp;#39;tara&amp;#39;, &amp;#39;tarnation&amp;#39;, &amp;#39;tidy&amp;#39;, &amp;#39;timber&amp;#39;, &amp;#39;uncle&amp;#39;, &amp;#39;understood&amp;#39;, &amp;#39;viva&amp;#39;, &amp;#39;vivat&amp;#39;, &amp;#39;voetsek&amp;#39;, &amp;#39;warning&amp;#39;, &amp;#39;welcome&amp;#39;, &amp;#39;whammo&amp;#39;, &amp;#39;whatever&amp;#39;, &amp;#39;wilco&amp;#39;, &amp;#39;wirra&amp;#39;, &amp;#39;zowie&amp;#39;]
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;I listed all the interjections out of curiosity. I don&amp;rsquo;t usually give this part of speech much thought, so I took a closer look. Surprisingly, &amp;ldquo;afternoon&amp;rdquo; is also classified as one! Which makes sense, since it&amp;rsquo;s a greeting.&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;[&amp;#39;abaft&amp;#39;, &amp;#39;abeam&amp;#39;, &amp;#39;aboard&amp;#39;, &amp;#39;about&amp;#39;, &amp;#39;above&amp;#39;, &amp;#39;abreast&amp;#39;, &amp;#39;abroad&amp;#39;, &amp;#39;absent&amp;#39;, &amp;#39;across&amp;#39;, &amp;#39;afore&amp;#39;, &amp;#39;after&amp;#39;, &amp;#39;again&amp;#39;, &amp;#39;against&amp;#39;, &amp;#39;agin&amp;#39;, &amp;#39;along&amp;#39;, &amp;#39;alongside&amp;#39;, &amp;#39;aloof&amp;#39;, &amp;#39;alow&amp;#39;, &amp;#39;amid&amp;#39;, &amp;#39;amidst&amp;#39;, &amp;#39;among&amp;#39;, &amp;#39;amongst&amp;#39;, &amp;#39;anent&amp;#39;, &amp;#39;anti&amp;#39;, &amp;#39;around&amp;#39;, &amp;#39;asprawl&amp;#39;, &amp;#39;astraddle&amp;#39;, &amp;#39;astride&amp;#39;, &amp;#39;athwart&amp;#39;, &amp;#39;barring&amp;#39;, &amp;#39;bating&amp;#39;, &amp;#39;because&amp;#39;, &amp;#39;before&amp;#39;, &amp;#39;behind&amp;#39;, &amp;#39;beyond&amp;#39;, &amp;#39;below&amp;#39;, &amp;#39;beneath&amp;#39;, &amp;#39;beside&amp;#39;, &amp;#39;besides&amp;#39;, &amp;#39;between&amp;#39;, &amp;#39;betwixt&amp;#39;, &amp;#39;circa&amp;#39;, &amp;#39;concerning&amp;#39;, &amp;#39;considering&amp;#39;, &amp;#39;contra&amp;#39;, &amp;#39;despite&amp;#39;, &amp;#39;during&amp;#39;, &amp;#39;except&amp;#39;, &amp;#39;excepting&amp;#39;, &amp;#39;failing&amp;#39;, &amp;#39;following&amp;#39;, &amp;#39;forby&amp;#39;, &amp;#39;froward&amp;#39;, &amp;#39;given&amp;#39;, &amp;#39;including&amp;#39;, &amp;#39;inside&amp;#39;, &amp;#39;into&amp;#39;, &amp;#39;minus&amp;#39;, &amp;#39;modulo&amp;#39;, &amp;#39;nearer&amp;#39;, &amp;#39;nearest&amp;#39;, &amp;#39;onto&amp;#39;, &amp;#39;opposite&amp;#39;, &amp;#39;outwith&amp;#39;, &amp;#39;pending&amp;#39;, &amp;#39;regarding&amp;#39;, &amp;#39;regardless&amp;#39;, &amp;#39;respecting&amp;#39;, &amp;#39;rising&amp;#39;, &amp;#39;running&amp;#39;, &amp;#39;saving&amp;#39;, &amp;#39;thorough&amp;#39;, &amp;#39;throughout&amp;#39;, &amp;#39;touching&amp;#39;, &amp;#39;toward&amp;#39;, &amp;#39;towards&amp;#39;, &amp;#39;under&amp;#39;, &amp;#39;underneath&amp;#39;, &amp;#39;unlike&amp;#39;, &amp;#39;until&amp;#39;, &amp;#39;upon&amp;#39;, &amp;#39;upside&amp;#39;, &amp;#39;versus&amp;#39;, &amp;#39;wanting&amp;#39;, &amp;#39;within&amp;#39;, &amp;#39;without&amp;#39;]
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;When listing out prepositions, I noticed some recurring prefixes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a- indicating location or spatial relationship: aboard, across, amid, around&lt;/li&gt;
&lt;li&gt;be- (basically &lt;em&gt;be&lt;/em&gt;): before, behind, below, beside&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Next, I created heatmaps for each part of speech. The y-axis shows syllable count, the x-axis shows stress position, and color intensity represents the proportion of words for each syllable count. I only included parts of speech with over 1% of the total words, as others had too few to be significant.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/ea6d9ff8fee7f0f2477d458be8c4a952.jpg"
loading="lazy"
alt="Heatmaps representing stress positions by syllable length for nouns, verbs, adjectives, and adverbs"
&gt;&lt;/p&gt;
&lt;p&gt;Stress tends to shift towards the end as syllables increase. The difference between parts of speech isn&amp;rsquo;t huge, but it&amp;rsquo;s there. For longer words (5+ syllables), adjectives often have stress on the antepenultimate (third-to-last) syllable, nouns tend to have stress further back, and verbs/adverbs have stress further forward.&lt;/p&gt;
&lt;h3 id="rules-of-stress-position"&gt;Rules of Stress Position
&lt;/h3&gt;&lt;p&gt;It was time to test my hypothesis.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/da8aadd06591c811ed2f67ee0b15503d.jpg"
loading="lazy"
alt="Table showing the dataframe with a new column added to test the stress position hypothesis"
&gt;&lt;/p&gt;
&lt;p&gt;I analyzed 4- and 5-syllable words, adding a column showing the difference between the actual and hypothesized (third-to-last) stress positions. A &amp;lsquo;0&amp;rsquo; means a match, &amp;lsquo;1&amp;rsquo; means one syllable later, &amp;lsquo;-1&amp;rsquo; one syllable earlier, etc.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/2695209758cd7525a2d0e71e4dbb4f85.jpg"
loading="lazy"
alt="Text snippet showing the percentage of words matching the author’s stress position hypothesis"
&gt;&lt;/p&gt;
&lt;p&gt;The hypothesis held for 43.9% of the words.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/5740e6b95198a01806d2831c73cbd1f3.jpg"
loading="lazy"
alt="Bar chart illustrating the deviation of actual stress positions from the predicted ones"
&gt;&lt;/p&gt;
&lt;p&gt;This bar chart shows the stress deviation. Most words follow the rule, with some shifted by one syllable. Very few are further off. It kind of looks like a normal distribution (but I&amp;rsquo;m no stats expert).&lt;/p&gt;
&lt;p&gt;Then I wondered: could this be generalized? Does it apply to words with 5+ syllables? I broadened the filter to include all words with over 3 syllables:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/6048650203a8efe7f09b9d6b3cc270c6.jpg"
loading="lazy"
alt="Text output showing the adjusted percentage of words matching the hypothesis after expanding the sample"
&gt;&lt;/p&gt;
&lt;p&gt;43.92% fit. Almost no change.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2024-07/7baa190c8f4aeb3fd58ede643840201d.jpg"
loading="lazy"
alt="Bar chart illustrating the deviation of actual stress positions from the predicted ones after sample expansion"
&gt;&lt;/p&gt;
&lt;p&gt;The deviation pattern remained. Most words are stressed on the antepenultimate syllable, many on the penultimate. Combined, they account for 78.84%. It&amp;rsquo;s not a perfect fit, but the general trend is confirmed.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;Here&amp;rsquo;s a recap of the findings regarding phonetics and stress:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Fewer syllables mean more words.&lt;/li&gt;
&lt;li&gt;Words with 5+ syllables are rare in everyday use.&lt;/li&gt;
&lt;li&gt;The longest word found has 11 syllables.&lt;/li&gt;
&lt;li&gt;Stress generally shifts towards the end in longer words.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Louder vowels are more likely to be stressed.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Part of speech has a minor effect on stress.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Most long words are stressed on the antepenultimate or penultimate syllable (78.84%).&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="afterword"&gt;Afterword
&lt;/h2&gt;&lt;p&gt;Five minutes of analysis, two hours of data prep – seriously.&lt;/p&gt;
&lt;p&gt;Visualization took only half a day. Data preparation, especially fetching phonetic transcriptions via the dictionary API, took the longest. The script ran on and off for over two weeks; I even finished writing this before the dictionary lookup was done, using placeholders for the data.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;m happy the results confirmed my hypothesis. After this, I doubt I&amp;rsquo;ll ever forget English stress rules – it&amp;rsquo;s my own research, after all.&lt;/p&gt;
&lt;p&gt;This project refreshed my Pandas skills, taught me batched requests and incremental saving, showed me how to integrate AI into analysis, helped me write effective Python data visualization prompts, and deepened my understanding of English phonetics. A huge win, and totally worth it!&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Thanks to:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a class="link" href="https://www.kaggle.com/datasets/bwandowando/479k-english-words/versions/5" target="_blank" rel="noopener"
&gt;Word data source&lt;/a&gt;: This 300k+ word list was the base of my analysis.&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://dictionaryapi.dev/" target="_blank" rel="noopener"
&gt;Free Dictionary API&lt;/a&gt;: This provided an inexpensive way to get phonetic transcriptions.&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://poe.com/Gemini-1.5-Flash" target="_blank" rel="noopener"
&gt;Gemini 1.5 Flash&lt;/a&gt;: Helped with about half the data prep and all the visualizations.&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://chatgpt.com/" target="_blank" rel="noopener"
&gt;GPT-4o&lt;/a&gt;: Helped accurately ID vowels in stressed syllables.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The full analysis and code are open-sourced on Kaggle. Check it out if you&amp;rsquo;re interested:&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://www.kaggle.com/code/victorcheng42/stress-distribution-of-english-words" target="_blank" rel="noopener"
&gt;https://www.kaggle.com/code/victorcheng42/stress-distribution-of-english-words&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The dataset with phonetic transcriptions, syllable counts, and stress positions is also public. It might be useful for other analyses:&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://www.kaggle.com/datasets/victorcheng42/english-words-with-stress-position-analyzed" target="_blank" rel="noopener"
&gt;https://www.kaggle.com/datasets/victorcheng42/english-words-with-stress-position-analyzed&lt;/a&gt;&lt;/p&gt;</description></item><item><title>Quantifying Design Value</title><link>https://victor42.eth.limo/post-en/3644/</link><pubDate>Wed, 25 Oct 2023 10:51:00 +0000</pubDate><author>hi@victor42.work (Victor42)</author><guid>https://victor42.eth.limo/post-en/3644/</guid><description>&lt;img src="https://cdn.victor42.work/posts/2023-10/a9a5b3988a8c913ff30d990b21313263.png" alt="Featured image of post Quantifying Design Value" /&gt;&lt;h2 id="the-story"&gt;The Story
&lt;/h2&gt;&lt;p&gt;I recently had a major clash with colleagues in a group chat. Things got heated.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;m a designer, though you wouldn&amp;rsquo;t know it from my posts. I mostly do UI and interaction design, but I also handle data reports and PPTs. Sometimes, I even code and build websites. Our design department acts as a central hub, fielding requests from other departments. I&amp;rsquo;m juggling four projects, two of which are UI projects only I can handle. My schedule&amp;rsquo;s packed.&lt;/p&gt;
&lt;p&gt;Why the fight? My UI work was fully booked, but another department insisted I help optimize a data report (a consumer report on jewelry). It wasn&amp;rsquo;t even advanced data viz, just finding and swapping product images in a PPT, showing it to the client, and swapping them again if they weren&amp;rsquo;t happy.&lt;/p&gt;
&lt;p&gt;I refused. It&amp;rsquo;s intern-level work. I&amp;rsquo;d help if I had the time, but it wasn&amp;rsquo;t jumping the queue. I stood firm. They argued that since I&amp;rsquo;d done it before, I should continue, and the client was pushing. We went at it.&lt;/p&gt;
&lt;p&gt;They ended up finding another designer. Afterward, my manager asked me to share my scheduling method. It seemed like they&amp;rsquo;d complained, but I was booked solid. A company has limited liability; an employee shouldn&amp;rsquo;t have unlimited responsibility, right?&lt;/p&gt;
&lt;p&gt;For now, they can&amp;rsquo;t do much. But to cover my bases, in case they went to the boss, I had a backup plan. I used my work schedule data to calculate time spent on each task, assessed each task&amp;rsquo;s value, and created some charts. The monetary values indicate the salary range of a designer capable of that work.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-10/a9a5b3988a8c913ff30d990b21313263.png"
loading="lazy"
alt="Excel design work composition dashboard with treemap on left showing value tiers 20K&amp;#43;/8-20K/8K in green/orange/gray, business pie chart and value pie chart on right"
&gt;&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s clear: their department (in brown) has a huge chunk of low-value work – finding and replacing images, adjusting alignment and fonts – and it takes up a ton of my time. The boss cares about cost-effectiveness. Having someone with a 20K+ salary doing intern work? Who knows who&amp;rsquo;d get chewed out.&lt;/p&gt;
&lt;p&gt;The fight&amp;rsquo;s over, and I won&amp;rsquo;t dwell on it. But the data handling and analysis were interesting, so I&amp;rsquo;m documenting it.&lt;/p&gt;
&lt;h2 id="data-source"&gt;Data Source
&lt;/h2&gt;&lt;p&gt;This analysis was possible because I regularly collect data. I organize anything I consider data in a way that&amp;rsquo;s useful later.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-10/b7cb270372f19fd67879c57bd8a7b009.jpg"
loading="lazy"
alt="Mobile calendar view screenshot showing October 2023 design schedule with light green blocks marking project days, bottom showing Design Schedule/Calendar tab"
&gt;&lt;/p&gt;
&lt;p&gt;I created a design schedule with a multi-dimensional table tool. I set the default view to a calendar and put it in my DingTalk signature, so anyone requesting work could see my availability.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-10/65d4321bfa13f03090b90554cad84bd6.png"
loading="lazy"
alt="Excel design schedule data table with 6 columns: Project/Designer/Start Date/End Date/Requester/Duration, showing August-September 2023 project records"
&gt;&lt;/p&gt;
&lt;p&gt;Although I add work items in the calendar view, it&amp;rsquo;s a data table. For easy recording, I kept the fields simple: project name, designer (it&amp;rsquo;s for the whole team), start and end dates, requester, and duration (in days), which is calculated automatically.&lt;/p&gt;
&lt;p&gt;When a new project comes in, I update the schedule immediately. To avoid conflicts, I&amp;rsquo;m motivated to maintain this data table.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-10/5d8953d9788ad3b0997eea965fec52e6.png"
loading="lazy"
alt="Excel bottom worksheet tab bar showing Charts/Value Analysis/Time Analysis/Design Schedule Data four tabs from left to right"
&gt;&lt;/p&gt;
&lt;p&gt;The raw data was ready, containing 40 workdays (nearly 2 months of data). I exported it to Excel, changed the duration from text to numbers, and started a series of analyses (from right to left) to generate the charts.&lt;/p&gt;
&lt;h2 id="time-analysis"&gt;Time Analysis
&lt;/h2&gt;&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-10/ab35313c1c52dc5c5328490034a68dbd.png"
loading="lazy"
alt="Excel time analysis table with pivot table on left summing duration by requester totaling 40 days, right side manually mapping requesters to business categories"
&gt;&lt;/p&gt;
&lt;p&gt;First, time analysis. This tab has two tables:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The left table pivots the raw data, showing time spent on each requester.&lt;/li&gt;
&lt;li&gt;The right table maps each requester to major business lines, summarizing the time each line takes up.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-10/b790a28d8fc8fc1ad15ecb4b726112eg.png"
loading="lazy"
alt="Excel PivotTable Fields panel with Designer/Requester/Duration checked, Filters has Designer, Rows has Requester, Values has Sum of Duration"
&gt;&lt;/p&gt;
&lt;p&gt;The left pivot table: filter for a specific designer (me), list each requester as a row, and sum the durations.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-10/ab35313c1c52dc5c5328490034a68dbd.png"
loading="lazy"
alt="Excel time analysis table with pivot table on left and manual business category summation using addition formulas on right"
&gt;&lt;/p&gt;
&lt;p&gt;The right table lists the major business lines, selects corresponding requesters from the left table, and sums the totals.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-10/0d9aec6a5807c7ba9153da8f20b261a1.png"
loading="lazy"
alt="Excel formula bar showing GETPIVOTDATA function extracting duration data from pivot table for business category calculation"
&gt;&lt;/p&gt;
&lt;p&gt;Selecting data from a pivot table is easier. Excel automatically writes the &lt;code&gt;GETPIVOTDATA&lt;/code&gt; function; you just click, avoiding &lt;code&gt;SUMIFS&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="value-analysis"&gt;Value Analysis
&lt;/h2&gt;&lt;p&gt;Next, I analyzed how well my time was spent.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-10/e0a5d1274532853173f10402d53d9d06.png"
loading="lazy"
alt="Excel value analysis table with 5 tables: Table 1 business duration percentage/Table 2 business value tier percentage/Table 3 pivot/Table 4 multiplication result/Table 5 value summary"
&gt;&lt;/p&gt;
&lt;p&gt;The Value Analysis tab has five tables:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Table 1 is the reshaped right table from Time Analysis.&lt;/li&gt;
&lt;li&gt;Table 2 shows the percentage of each business line&amp;rsquo;s work in different value ranges (manually created).&lt;/li&gt;
&lt;li&gt;Table 3 pivots Table 2 for easier use in Table 4.&lt;/li&gt;
&lt;li&gt;Table 4 multiplies Table 1 and Table 3 to calculate the actual percentage of each work type in different value ranges.&lt;/li&gt;
&lt;li&gt;Table 5 pivots Table 4, summarizing the total percentage of work in different value ranges.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-10/4b7fd4d8f38266dc59903bddfa4dc4d2.png"
loading="lazy"
alt="Excel PivotTable Fields panel with Business and Duration checked, Rows has Business, Values has Sum of Duration"
&gt;&lt;/p&gt;
&lt;p&gt;Table 1: each business line is a row, and durations are summed.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-10/187174f765fd78ba42d098c00b301d92.png"
loading="lazy"
alt="Excel Value Field Settings dialog with Show Values As tab selected, % of Column Total highlighted in blue dropdown"
&gt;&lt;/p&gt;
&lt;p&gt;The key is the format. In &amp;ldquo;Sum of Duration&amp;rdquo; settings, I changed &amp;ldquo;Show Values As&amp;rdquo; to &amp;ldquo;Percentage of Column Total&amp;rdquo; and the number format to percentage, getting each business line&amp;rsquo;s time percentage.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-10/e0a5d1274532853173f10402d53d9d06.png"
loading="lazy"
alt="Excel value analysis final result showing Table 4 business-value cross multiplication and Table 5 value summary with 20K&amp;#43; 44.5%/8-20K 35%/8K 20.5%"
&gt;&lt;/p&gt;
&lt;p&gt;Table 2 is the core, but it&amp;rsquo;s subjective. It&amp;rsquo;s not super rigorous, but good enough for arguments and review. I tried to be fair, assigning value percentages to each business line based on experience. I swear I didn&amp;rsquo;t intentionally undervalue the other department; their vendor-like nature means their low-value work proportion is high. The designer salary ranges for the value tiers are based on my 10+ years of experience.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-10/d68bb255437ef1e63a9386d499ce48e4.png"
loading="lazy"
alt="Excel PivotTable Fields panel with Business/Value/Percentage checked, Rows has Value then Business, Values has Sum of Percentage"
&gt;&lt;/p&gt;
&lt;p&gt;Table 3 pivots Table 2. It&amp;rsquo;s divided by value, then by business line. This structure is for Table 4, for easier viewing and data retrieval.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-10/e0a5d1274532853173f10402d53d9d06.png"
loading="lazy"
alt="Excel value analysis complete view with 5 tables showing business duration percentage/value tier percentage/pivot/multiplication/value summary"
&gt;&lt;/p&gt;
&lt;p&gt;Table 4: multiply data from Tables 1 and 3.&lt;/p&gt;
&lt;p&gt;Table 5 pivots Table 4, summarizing by value.&lt;/p&gt;
&lt;h2 id="charts"&gt;Charts
&lt;/h2&gt;&lt;p&gt;With the analysis done, it&amp;rsquo;s time for visuals.&lt;/p&gt;
&lt;p&gt;Level 1: Show percentages of each business line and value range, using data from Tables 1 and 5. Create pie charts, add data labels, and adjust colors.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-10/a9a5b3988a8c913ff30d990b21313263.png"
loading="lazy"
alt="Excel design work composition dashboard with treemap showing value tiers in green/orange/gray, business pie chart and value pie chart on right"
&gt;&lt;/p&gt;
&lt;p&gt;Level 2: Show the breakdown of business lines within each value range. Treemaps are best for this two-level hierarchical proportion data. Create a Treemap from Table 4, and adjust background and label colors to match the two charts on the right.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://cdn.victor42.work/posts/2023-10/a85c0e8de3b950ff50c3771a36666c8e.png"
loading="lazy"
alt="Excel Format Data Labels dialog with Label Options showing Category Name and Value checked, Separator set to space"
&gt;&lt;/p&gt;
&lt;p&gt;Enable Treemap labels to show names and values, displaying each business line&amp;rsquo;s detailed percentage.&lt;/p&gt;
&lt;h2 id="afterword"&gt;Afterword
&lt;/h2&gt;&lt;p&gt;With this value analysis system, I just maintain the schedule. I import the data, update a few pivot tables, and the charts are generated automatically.&lt;/p&gt;
&lt;p&gt;Even with limited raw data, there&amp;rsquo;s more to analyze: monthly workload saturation, average project cycle for each business line, and value composition fluctuation over a year.&lt;/p&gt;
&lt;p&gt;The fight&amp;rsquo;s over, and I won&amp;rsquo;t bring this up to the boss, but it&amp;rsquo;s interesting that design work can be analyzed with data.&lt;/p&gt;</description></item></channel></rss>