In the build-up to President Bush's speech on Iraq, I've heard a lot of speculation on what the emphasis of the speech was going to be about, and how the content was going to be less rhetoric and catch phrases and more of a persuasive argument. Out of curiosity and inspired by the US Presidential Speeches Tag Cloud, I used an application I wrote using free tools and scripts that extracts key phrases and words from a document and represents the results as a tag cloud (weighted list). Here are the results of the tag cloud analysis of the transcript of the speech.
Background
One of my tasks as a programmer was to find a way to process a large body of text (committee suggestions), pick out key phrases or words that occurred more often, and display them in an easily understood format. However, if I just did a word count, then I'd get high occurrences of out-of-context nouns, adverbs and verbs, which wouldn't be useful.
Term Extraction
The Yahoo! Developer Network offers a very useful tool as part of their Content Analysis Web Services - Term Extraction. The service "provides a list of significant words or phrases extracted from a larger content." To use the service, you'll need a free Yahoo! Application ID. The only limit on the service is 5,000 queries per IP address per day, which for most users is more than sufficient. To get around some hosting limitations, I used the HTTP POST from PHP without cURL script from netevil.org
As the service provides a unique terms, I couldn't just submit the whole body of text, which would have resulted in a few key words and phrases occurring only once. I had to split the source text into individual paragraphs, submit each paragraph separately, and push each resulting term into an array.
Tag Clouds
A fairly recent innovation in web development has been the use of tag clouds. A tag cloud is an alphabetized way of visually representing the frequency of a word or phrase using font sizes that correspond to precedence, or some other similar emphasis. If a word or phrase occurs more often, it'll be displayed larger than the other members of the list. For more information and the history of tag clouds, read the wikipedia entry.
I combined the Term Extraction with a modified version of the excellent free PHP tag cloud generator from 15tags. The script just counts the occurrences of strings within an array and displays the top few items that occurred the most often in a tag cloud. You can adjust the number of items in the tag cloud; while I've displayed over 100 items, it works best with around 25-30.
Results
I took the committee suggestions, pasted them into the tool, pasted the result into a word processor, added a little formatting, and gave the printed document to my supervisor. She found the results useful and visually interesting; it a good tool for arbitrary analysis, and the resulting tag cloud could be used as the basis of a cover for the report. I think tag clouds are a useful method of visually representing the importance of a term, tag, or key words.
For fun, I also ran the application on some erotic literature and IRC logs. I won't share the results in public, but believe me, it's a hoot.