Unix For Poets
My co-worker Colin mentioned the paper Unix for Poets (PDF, 60 Kb) to me a while ago, and I thought it was a fun read. It covers a lot of basic text-processing and command-line magic, and is a really cool introduction to the power of the command-line interface and the Unix philosophy.
I wouldn’t bring out the big guns (fancy machines, fancy algorithms, data collection committees, bigtime favors) unless you have a lot of text (e.g., hundreds of million words or more), or you are trying to count really long ngrams (e.g., 50-grams). This chapter will describe a set of simple Unix-based tools that should be more than adequate for counting trigrams on a corpus the size of the Brown Corpus. I’d recommend that you do it yourself for basically the same reason that home repair stores like DIY and Home Depot are as popular as they are. You can always hire a pro to fix your home for you, but a lot of people find that it is better not to, unless they are trying to do something moderately hard. Hamming used to say it is much better to solve the right problem naively than the wrong problem expertly.