Scaling the Fence: Achilles’ Thesaurus

Posted in Geek Stuff, Random thoughts by dave on August 31, 2009 4 Comments

Scaling the Fence is a series of posts on daverea.com exploring people’s aspirations, apprehensions and encounters with switching from proprietary to open-source software. This post is the first in the series.

Most geeks will tell you that they’re the go-to guy (or gal) when it comes to computer questions from friends and family – and if my experience is any indication, I’m no exception. On a recent car trip with my wife and one of our friends, the topic came ’round to computers, and how this particular friend was in the market for a new one. Of course, me being me, I had to get a plug in for Linux and open source software.

In this case, our brief discussion centered around office suites, and before I could even recommend it, our friend informed me that she’d tried openoffice.org, and didn’t like it. As someone who writes for a living, she needs a robust thesaurus – and her experience with the one built into OpenOffice.org (circa 2006) didn’t meet her needs. Unsure of what version she used, and clueless on where OpenOffice.org’s thesaurus is today, I couldn’t offer much in the way of advocacy outside of the possibility that someone may have written a plug-in to improve the thesaurus.

After we returned home, I decided to put the thesaurus to the test. The candidates? Microsoft Office 2003 and 2007 (tested on PCs at work – during my lunchbreak of course!), Google’s top result for “thesaurus”, Thesuaurs.com, and of course my copy of OpenOffice.org 2.4.1 (as packaged with Kubuntu 8.10). For good measure, I also threw in results from Princeton’s WordNet project, on which OpenOffice.org’s thesaurus has reportedly been based since version 2. Sadly, I no longer own a paper thesaurus, so unless someone would like to add some datapoints in the comments, I can’t include synonym counts for the dead-tree option…

As a language enthusiast and aspiring (albeit admittedly and unapologetically amateur) writer myself, I tried to choose words that I felt would have enough synonyms for a valid comparison. Granted, this is subject to the limitations and biases of my vocabulary, but I think I came up with a reasonable list:

  • Noun: Boss
  • Verb: Work
  • Adjective: Simple
  • Adverb: Extremely
  • Preposition: Beneath

From there, it was just a matter of punching everything into each of our candidates’ respective thesauri and tallying up the results:

Thesaurus Comparison Results (click to enlarge)

Thesaurus Comparison Results (click to enlarge)

As you can see (and also quite understandably) the online thesaurus goes home with the trophy, easily trouncing its nearest competitor by almost 5x (and quite creatively, in many instances, however questionable the usefulness of the results may be). The MS Office suites produced an average of 5.6 and 5.8 synonyms-per-word, respectively, and OpenOffice.org produced a healthy average of 6.6, beating both editions of Office and, interestingly, the WordNet database upon which its thesaurus is based! Of course, looking closer reveals that MS Office trumps OpenOffice on adverbs and prepositions, while OpenOffice.org noses ahead on verbs.

What does all this tell us? For starters, we can probably conclude that while OpenOffice.org’s thesaurus keeps pace with that offered by Microsoft Office 2003/2007, the relative usefulness of each will ultimately hang on what words (and types of words) a given user chooses to look up. This, in turn, will be determined by that writer’s style, vocabulary and preferences. It’s also pretty clear that going online (when there’s an option to do so, which is not always the case) will net the widest selection of superior synonyms for the scrupulous scribe.

Language tools like the thesaurus present an opportunity for the open source community. Just as our friend was quickly dissuaded from using OpenOffice.org because she perceived the thesaurus to be inferior, she might have been quickly won-over by a toolset that performed head-and-shoulders above those she was used to. Between WordNet, the OpenRogets project, the Big Huge Thesaurus, the New York Times’ thesaurus and the Moby Project (hey, it’s only the largest thesaurus in the English language!), we have the opportunity to package an offline thesaurus (or offer an optional download supplement, if binary size is a concern) for OpenOffice.org that could run circles around proprietary offerings.

Of course, the thesaurus is only one tiny facet of one program, which is itself only one facet of a larger suite of tools, which is itself only a minute fraction of the open-source world. It’s easy to discount as unworthy-of-effort in the face of the many other challenges that FOSS faces in achieving widespread adoption. If market share is any indication, OpenOffice.org’s thesaurus isn’t keeping it out of the hands of millions of users worldwide. That said, this strikes me as one small instance where we’ve found the enemy asleep at the gate – so why not take the opportunity to capitalize on it?

Trackbacks
  • [...] Scaling the Fence: Achilles’ Thesaurus Language tools like the thesaurus present an opportunity for the open source community. Just as our friend was quickly dissuaded from using OpenOffice.org because she perceived the thesaurus to be inferior, she might have been quickly won-over by a toolset that performed head-and-shoulders above those she was used to. Between WordNet, the OpenRogets project, the Big Huge Thesaurus, the New York Times’ thesaurus and the Moby Project (hey, it’s only the largest thesaurus in the English language!), we have the opportunity to package an offline thesaurus (or offer an optional download supplement, if binary size is a concern) for OpenOffice.org that could run circles around proprietary offerings. [...]

Comments
  • Gordon Haverland:

    Apparently Aiksaurus is used in Abiword, and can be used in other ways (I ran across it via CPAN, and a perl interface). UTexas has a PHP version on the web. http://www.cs.utexas.edu/users/jared/aiksaurus/index.cgi

    I don’t use Thesaurii, so I don’t really know what it is that this PHP page is displaying. Only beneath got no results.

  • Hello,

    Do you know if the latest version of OpenOffice.org has an improvements in this area?

    • dave:

      Hi Rob – I provisioned a new machine with Ubuntu 8.10 this past weekend (after this post was written, but before it was posted) and installed OpenOffice.org 3.1.0 via the Launchpad repositories. I tried a couple of the words above, and it appeared the data source is the same.

Leave a Comment