Twitter stream algorithms

Cover image for post titled Twitter stream algorithms.

Project GitHub.

Data stream algorithms

The classic college introductory algorithms course nags bright-eyed students to focus on time optimization. Big-O-big-O-big-O is all we can think of and achieving that logarithmic runtime is the end-all-be-all. But what about space? This is an aspect of algorithms that I wish introductory courses touched more upon. There arises some fascinating problems (and approaches) when the major constraint is not time, but space. And in today's age of massive data lakes and unwledy pipeline streams, these space-oriented algorithms are very relevant to industry.

Data stream algorithms tackle problems with large quantities of streaming data with a focus on space optimization. It's relevant to situations where engineers could not feasibly rely on there being enough storage to run traditional Ω(n) algorithms (n or more). Most approaches are random algorithms that relinquish the search for deterministic solutions in lieu for feasible solutions.

// demo simple counting algorithm

Algorithms in action

I wanted to try out some of the algorithms I learned from COSC 35, taught by Professor Chakrabarti at Dartmouth College. The course was heavy on theory, focusing on proofs of correctness and bounds derivations. I wanted to see them in action against real world data.

Twitter

Twitter offers a near-realtime API for a slimmed-down portion of its main platform stream.

Shakespeare

I also decided to try out these algorithms on a stream using words from 100 of Shakespeare's works. The raw text was accessed from Project Gutenberg, cleaned, and extracted into 219 separate works, each containing around a few thousand words. The stream chooses a particular work and feeds the words the same way as the Twitter streaming process.