Detecting syllables in a word


Detecting syllables in a word



I need to find a fairly efficient way to detect syllables in a word. E.g.,



Invisible -> in-vi-sib-le



There are some syllabification rules that could be used:



V
CV
VC
CVC
CCV
CCCV
CVCC



*where V is a vowel and C is a consonant.
E.g.,



Pronunciation (5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC)



I've tried few methods, among which were using regex (which helps only if you want to count syllables) or hard coded rule definition (a brute force approach which proves to be very inefficient) and finally using a finite state automata (which did not result with anything useful).



The purpose of my application is to create a dictionary of all syllables in a given language. This dictionary will later be used for spell checking applications (using Bayesian classifiers) and text to speech synthesis.



I would appreciate if one could give me tips on an alternate way to solve this problem besides my previous approaches.



I work in Java, but any tip in C/C++, C#, Python, Perl... would work for me.





Do you actually want the actual division points or just the number of syllables in a word? If the latter, consider looking up the words in a text-to-speech dictionary and count the phonemes that encode vowel sounds.
– Adrian McCarthy
Aug 24 '12 at 22:08





The most efficient way (computation-wise; not storage-wise), I would guess would be just to have a Python dictionary with words as keys and the number of syllables as values. However, you'd still need a fallback for words that didn't make it in the dictionary. Let me know if you ever find such a dictionary!
– Shule
Jul 29 '14 at 5:33




14 Answers
14



Read about the TeX approach to this problem for the purposes of hyphenation. Especially see Frank Liang's thesis dissertation Word Hy-phen-a-tion by Com-put-er. His algorithm is very accurate, and then includes a small exceptions dictionary for cases where the algorithm does not work.





I like that youve cited a thesis dissertation on the subject, it's a little hint to the original poster that this might not be an easy question.
– Karl
Jan 1 '09 at 17:29





Yes, I am aware that this is not a simple question, although I haven't worked much on it. I did underestimate the problem though, I thought I would work on other parts of my app, and later return to this 'simple' problem. Silly me :)
– user50705
Jan 1 '09 at 17:33





I read the disertation paper, and found it very helpful. The problem with the approach was that I did not have any patterns for the Albanian language, although I found some tools that could generate those patterns. Anyway, for my purpose I wrote a rule based app, which solved the problem...
– user50705
Jan 3 '09 at 1:20





... My approach is a bit slow (~20 sec on a 50K word file) but I think the results are reasonably accurate (i dont have any useful stats yet).
– user50705
Jan 3 '09 at 1:24





Note that the TeX algorithm is for finding legitimate hyphenation points, which is not exactly the same as syllable divisions. It's true that hyphenation points fall on syllable divisions, but not all syllable divisions are valid hyphenation points. For example, hyphens aren't (usually) used within a letter or two of either end of a word. I also believe the TeX patterns were tuned to trade off false negatives for false positives (never put a hyphen where it doesn't belong, even if that means missing some legitimate hyphenation opportunities).
– Adrian McCarthy
Aug 24 '12 at 22:05



I stumbled across this page looking for the same thing, and found a few implementations of the Liang paper here:
https://github.com/mnater/hyphenator



That is unless you're the type that enjoys reading a 60 page thesis instead of adapting freely available code for non-unique problem. :)





agreed - much more convenient to just use an existing implmentation
– hoju
Nov 5 '10 at 2:48





The repository has been moved to github.com/mnater/hyphenator
– cheffe
May 26 '15 at 8:18




Here is a solution using NLTK:


from nltk.corpus import cmudict
d = cmudict.dict()
def nsyl(word):
return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]





Hey thanks tiny baby error in the should be function def nsyl(word): return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]
– Gourneau
Dec 21 '10 at 1:08






What would you suggest as a fallback for words that aren't in that corpus?
– Dan Gayle
Jun 18 '11 at 0:18





How does this work?
– Pureferret
Mar 16 '15 at 9:21





@Pureferret cmudict is a pronouncing dictionary for north american english words. it splits words into phonemes, which are shorter than syllables (e.g. the word 'cat' is split into three phonemes: K - AE - T). but vowels also have a "stress marker": either 0, 1, or 2, depending on the pronunciation of the word (so AE in 'cat' becomes AE1). the code in the answer counts the stress markers and therefore the number of the vowels - which effectively gives the number of syllables (notice how in OP's examples each syllable has exactly one vowel).
– billy_chapters
Mar 9 '16 at 23:11






This returns the number of syllables, not the syllabification.
– Adam Michael Wood
May 14 '17 at 15:34



I'm trying to tackle this problem for a program that will calculate the flesch-kincaid and flesch reading score of a block of text. My algorithm uses what I found on this website: http://www.howmanysyllables.com/howtocountsyllables.html and it gets reasonably close. It still has trouble on complicated words like invisible and hyphenation, but I've found it gets in the ballpark for my purposes.



It has the upside of being easy to implement. I found the "es" can be either syllabic or not. It's a gamble, but I decided to remove the es in my algorithm.


private int CountSyllables(string word)
{
char vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
string currentWord = word;
int numVowels = 0;
bool lastWasVowel = false;
foreach (char wc in currentWord)
{
bool foundVowel = false;
foreach (char v in vowels)
{
//don't count diphthongs
if (v == wc && lastWasVowel)
{
foundVowel = true;
lastWasVowel = true;
break;
}
else if (v == wc && !lastWasVowel)
{
numVowels++;
foundVowel = true;
lastWasVowel = true;
break;
}
}

//if full cycle and no vowel found, set lastWasVowel to false;
if (!foundVowel)
lastWasVowel = false;
}
//remove es, it's _usually? silent
if (currentWord.Length > 2 &&
currentWord.Substring(currentWord.Length - 2) == "es")
numVowels--;
// remove silent e
else if (currentWord.Length > 1 &&
currentWord.Substring(currentWord.Length - 1) == "e")
numVowels--;

return numVowels;
}





For my simple scenario of finding syllables in proper names this seems to be initially working well enough. Thanks for putting it out here.
– Norman H
Mar 8 at 13:16



This is a particularly difficult problem which is not completely solved by the LaTeX hyphenation algorithm. A good summary of some available methods and the challenges involved can be found in the paper Evaluating Automatic Syllabification Algorithms for English (Marchand, Adsett, and Damper 2007).



Perl has Lingua::Phonology::Syllable module. You might try that, or try looking into its algorithm. I saw a few other older modules there, too.



I don't understand why a regular expression gives you only a count of syllables. You should be able to get the syllables themselves using capture parentheses. Assuming you can construct a regular expression that works, that is.



Thanks Joe Basirico, for sharing your quick and dirty implementation in C#. I've used the big libraries, and they work, but they're usually a bit slow, and for quick projects, your method works fine.



Here is your code in Java, along with test cases:


public static int countSyllables(String word)
{
char vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
char currentWord = word.toCharArray();
int numVowels = 0;
boolean lastWasVowel = false;
for (char wc : currentWord) {
boolean foundVowel = false;
for (char v : vowels)
{
//don't count diphthongs
if ((v == wc) && lastWasVowel)
{
foundVowel = true;
lastWasVowel = true;
break;
}
else if (v == wc && !lastWasVowel)
{
numVowels++;
foundVowel = true;
lastWasVowel = true;
break;
}
}
// If full cycle and no vowel found, set lastWasVowel to false;
if (!foundVowel)
lastWasVowel = false;
}
// Remove es, it's _usually? silent
if (word.length() > 2 &&
word.substring(word.length() - 2) == "es")
numVowels--;
// remove silent e
else if (word.length() > 1 &&
word.substring(word.length() - 1) == "e")
numVowels--;
return numVowels;
}

public static void main(String args) {
String txt = "what";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "super";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "Maryland";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "American";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "disenfranchized";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "Sophia";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
}



The result was as expected (it works good enough for Flesch-Kincaid):


txt=what countSyllables=1
txt=super countSyllables=2
txt=Maryland countSyllables=3
txt=American countSyllables=3
txt=disenfranchized countSyllables=5
txt=Sophia countSyllables=2



Bumping @Tihamer and @joe-basirico. Very useful function, not perfect, but good for most small-to-medium projects. Joe, I have re-written an implementation of your code in Python:


def countSyllables(word):
vowels = "aeiouy"
numVowels = 0
lastWasVowel = False
for wc in word:
foundVowel = False
for v in vowels:
if v == wc:
if not lastWasVowel: numVowels+=1 #don't count diphthongs
foundVowel = lastWasVowel = True
break
if not foundVowel: #If full cycle and no vowel found, set lastWasVowel to false
lastWasVowel = False
if len(word) > 2 and word[-2:] == "es": #Remove es - it's "usually" silent (?)
numVowels-=1
elif len(word) > 1 and word[-1:] == "e": #remove silent e
numVowels-=1
return numVowels



Hope someone finds this useful!



Today I found this Java implementation of Frank Liang's hyphenation algorithmn with pattern for English or German, which works quite well and is available on Maven Central.



Cave: It is important to remove the last lines of the .tex pattern files, because otherwise those files can not be loaded with the current version on Maven Central.


.tex



To load and use the hyphenator, you can use the following Java code snippet. texTable is the name of the .tex files containing the needed patterns. Those files are available on the project github site.


hyphenator


texTable


.tex


private Hyphenator createHyphenator(String texTable) {
Hyphenator hyphenator = new Hyphenator();
hyphenator.setErrorHandler(new ErrorHandler() {
public void debug(String guard, String s) {
logger.debug("{},{}", guard, s);
}

public void info(String s) {
logger.info(s);
}

public void warning(String s) {
logger.warn("WARNING: " + s);
}

public void error(String s) {
logger.error("ERROR: " + s);
}

public void exception(String s, Exception e) {
logger.error("EXCEPTION: " + s, e);
}

public boolean isDebugged(String guard) {
return false;
}
});

BufferedReader table = null;

try {
table = new BufferedReader(new InputStreamReader(Thread.currentThread().getContextClassLoader()
.getResourceAsStream((texTable)), Charset.forName("UTF-8")));
hyphenator.loadTable(table);
} catch (Utf8TexParser.TexParserException e) {
logger.error("error loading hyphenation table: {}", e.getLocalizedMessage(), e);
throw new RuntimeException("Failed to load hyphenation table", e);
} finally {
if (table != null) {
try {
table.close();
} catch (IOException e) {
logger.error("Closing hyphenation table failed", e);
}
}
}

return hyphenator;
}



Afterwards the Hyphenator is ready to use. To detect syllables, the basic idea is to split the term at the provided hyphens.


Hyphenator


String hyphenedTerm = hyphenator.hyphenate(term);

String hyphens = hyphenedTerm.split("u00AD");

int syllables = hyphens.length;



You need to split on "u00AD", since the API does not return a normal "-".


"u00AD


"-"



This approach outperforms the answer of Joe Basirico, since it supports many different languages and detects German hyphenation more accurate.



Why calculate it? Every online dictionary has this info. http://dictionary.reference.com/browse/invisible
in·vis·i·ble





Maybe it has to work for words that don't appear in dictionaries, such as names?
– Wouter Lievens
Sep 13 '10 at 19:13





@WouterLievens: I don't think names are anywhere near well-behaved enough for automatic syllable parsing. A syllable parser for English names would fail miserably on names of Welsh or Scottish origin, let alone names of Indian and Nigerian origins, yet you might find all of these in a single room somewhere in e.g. London.
– Jean-François Corbett
Jan 14 '12 at 14:06





One must keep in mind that it is not reasonable to expect better performance than a human could provide considering this is a purely heuristic approach to a sketchy domain.
– Darren Ringer
Sep 4 '15 at 20:40



Thank you @joe-basirico and @tihamer. I have ported @tihamer's code to Lua 5.1, 5.2 and luajit 2 (most likely will run on other versions of lua as well):



countsyllables.lua


countsyllables.lua


function CountSyllables(word)
local vowels = { 'a','e','i','o','u','y' }
local numVowels = 0
local lastWasVowel = false

for i = 1, #word do
local wc = string.sub(word,i,i)
local foundVowel = false;
for _,v in pairs(vowels) do
if (v == string.lower(wc) and lastWasVowel) then
foundVowel = true
lastWasVowel = true
elseif (v == string.lower(wc) and not lastWasVowel) then
numVowels = numVowels + 1
foundVowel = true
lastWasVowel = true
end
end

if not foundVowel then
lastWasVowel = false
end
end

if string.len(word) > 2 and
string.sub(word,string.len(word) - 1) == "es" then
numVowels = numVowels - 1
elseif string.len(word) > 1 and
string.sub(word,string.len(word)) == "e" then
numVowels = numVowels - 1
end

return numVowels
end



And some fun tests to confirm it works (as much as it's supposed to):



countsyllables.tests.lua


countsyllables.tests.lua


require "countsyllables"

tests = {
{ word = "what", syll = 1 },
{ word = "super", syll = 2 },
{ word = "Maryland", syll = 3},
{ word = "American", syll = 4},
{ word = "disenfranchized", syll = 5},
{ word = "Sophia", syll = 2},
{ word = "End", syll = 1},
{ word = "I", syll = 1},
{ word = "release", syll = 2},
{ word = "same", syll = 1},
}

for _,test in pairs(tests) do
local resultSyll = CountSyllables(test.word)
assert(resultSyll == test.syll,
"Word: "..test.word.."n"..
"Expected: "..test.syll.."n"..
"Result: "..resultSyll)
end

print("Tests passed.")





I added two more test cases "End" and "I". The fix was to compare strings case insensitively. Ping'ing @joe-basirico and tihamer in case they suffer from the same problem and would like to update their functions.
– josefnpat
Sep 9 '15 at 22:15





@tihamer American is 4 syllables!
– josefnpat
Sep 9 '15 at 22:17



I could not find an adequate way to count syllables, so I designed a method myself.



You can view my method here: https://stackoverflow.com/a/32784041/2734752



I use a combination of a dictionary and algorithm method to count syllables.



You can view my library here: https://github.com/troywatson/Lawrence-Style-Checker



I just tested my algorithm and had a 99.4% strike rate!


Lawrence lawrence = new Lawrence();

System.out.println(lawrence.getSyllable("hyphenation"));
System.out.println(lawrence.getSyllable("computer"));



Output:


4
3





Generally, links to a tool or library should be accompanied by usage notes, a specific explanation of how the linked resource is applicable to the problem, or some sample code, or if possible all of the above.
– IKavanagh
Sep 25 '15 at 16:07





Syntax highlighting isn't working. :(
– troy
Sep 25 '15 at 18:04





See Syntax Highlighting. There is a help button (question mark) in the SO editor which will get you to the linked page.
– IKavanagh
Sep 25 '15 at 18:06



I ran into this exact same issue a little while ago.



I ended up using the CMU Pronunciation Dictionary for quick and accurate lookups of most words. For words not in the dictionary, I fell back to a machine learning model that's ~98% accurate at predicting syllable counts.



I wrapped the whole thing up in an easy-to-use python module here: https://github.com/repp/big-phoney



Install:
pip install big-phoney


pip install big-phoney



Count Syllables:


from big_phoney import BigPhoney
phoney = BigPhoney()
phoney.count_syllables('triceratops') # --> 4



If you're not using Python and you want to try the ML-model-based approach, I did a pretty detailed write up on how the syllable counting model works on Kaggle.



I used jsoup to do this once. Here's a sample syllable parser:


public String syllables(String text){
String url = "https://www.merriam-webster.com/dictionary/" + text;
String relHref;
try{
Document doc = Jsoup.connect(url).get();
Element link = doc.getElementsByClass("word-syllables").first();
if(link == null){return new String{text};}
relHref = link.html();
}catch(IOException e){
relHref = text;
}
String syl = relHref.split("·");
return syl;
}





How is that a generic syllable parser? It looks like this code is only looking up syllables in a dictionary
– Nico Haase
Jan 9 at 16:30






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

api-platform.com Unable to generate an IRI for the item of type

How to set up datasource with Spring for HikariCP?

Display dokan vendor name on Woocommerce single product pages