Note number 1: this last blog is backdated – I have been unable to publish from China due to firewall restrictions
Note number 2: I have now published a version of this work online – view it here
The last few days I have spent brushing up my MapReduce chops, and having a play with Apache Spark whilst I’m at it. Alongside this I’ve been having a lovely time exploring the beautiful surroundings of Dali, Yunnan, China – spectacular!
As a toy MapReduce project today, I decided to have a crack at making an automatic poem generator, based on content from Twitter. (Inspired by other projects – see list at bottom). I’ve been pleasantly surprised at the results I’m getting after just a few hours. The idea is that a user enters a chosen subject (perhaps with a time range and other preferences), a huge number of current/historic tweets are pulled from Twitter (the API rate limit is the bottleneck here) and then the tweets are rearranged into rhyming couplets to produce ‘poetry’ (or at least, a rhyming list of tweets…)
My first attempt works like so:
- Twitter stream pulled by a single machine at user’s request
MapReduce Stage 1 – filter tweets and delete duplicates
- Twitter stream sent to multiple Mappers, which filter each tweet for suitability (for example tweets can’t be retweets, must be within a certain length range, must be in English, can’t be @replies, etc.)
- (Key, Value) sent to multiple Reducers, where Key=tweet text, and Value=other information extracted, e.g. username, tweet date. The reducers apply a minimum date function to each Key, thus unique tweets are extracted (duplicates deleted), with the first tweet of that content being kept, and all subsequent duplicates tweeted later in time being discarded. This just ensures the first publisher of twitter content is hopefully recognised.
MapReduce Stage 2 – extract rhyme ID for end of each tweet and group by rhyme
- Mappers use regex to extract the final word in the tweet, and look up this word in a rhyme dictionary built from http://www.chiark.greenend.org.uk/~tthurman/rhymes.txt). This provides a numerical code for each word in the dictionary, where rhyming words share the same numerical code. I have also written ‘near rhyme’ code (for my separate ‘RhymeTime’ project) though this is not included in this prototype yet.
- (Key, Value) is sent to Reducers, one per tweet. This time the Key=rhyme code, and the Value=tweet text, username, tweet date. The Reducers can then package all of the tweets by rhyme code, ready for the final level (where we will choose the rhyming couplets from this list of rhyming tweets). Note that the reducer can count the size of each rhyme group it collects – if there is only one unique ‘last word’ for each tweet in the rhyme group, we can discard the group at this point, as you wouldn’t want two consecutive lines in a poem using the same last word (technically these lines don’t rhyme, despite having the same rhyme code).
MapReduce Stage 3 – pick rhyming couplets from each rhyme group and compile to poem
- Mappers need to select two lines from each rhyme group, each line in the couplet ending in a different word. Currently this is done randomly. However, it would be desirable within the mapper to try and link the lines chosen for the couplet by topic, or rhythm, or alliteration, etc. This is discussed below as something I plan to improve.
- (Key, Value) is sent to the Reducers. Currently the Key is just the rhyme code (ensuring the rhyming lines are consecutive on the final output).
- Being in China, I don’t actually have access to any live Twitter data due to firewall restrictions, but I can use previous data I have collected on other projects for now. The poem below was produced from around 30MB of Twitter firehose filtered by the word ‘Pepsi’…
i like the beyonce pepsi ad
(by AsToldByTella at 2013-08-20 03:16:49+00:00)
i want a pepsi so damn bad
(by KayyyP_ at 2013-08-19 19:48:10+00:00)
my addiction to pepsi is too real
(by RockoYusuf at 2013-08-20 02:13:29+00:00)
pepsi + anything = a complete meal.
(by SanaaNQ at 2013-08-19 23:16:27+00:00)
i want a pepsi… in a can.
(by rhealynee at 2013-08-19 21:18:46+00:00)
just want pepsi and pepsi man
(by SoberIfeelpain at 2013-08-19 19:09:22+00:00)
Considering the simplicity of what I’ve coded, it’s not bad, bar the inaneness of the tweets. More interesting keywords may improve things. I’ve also put through Alice in Wonderland, which works quite nicely!
Technology-wise, I’ve coded each Map and each Reduce in Python, and am currently just testing on my local machine, piping the inputs and outputs between six Python scripts (see my Github). Once I can access a larger volume of tweets, I will attempt to run on a Hadoop cluster with streaming and these same Python scripts. The code can probably be optimised a little – I may not need to send usernames, dates or even tweet text around between each and every Hadoop stage, instead converting to a unique ID once required information has been extracted from the tweet, and just looking up usernames, dates and tweet text up for the few chosen tweets at the end.
Extending the work further, I would like to:
- Measure rhythm and scan (which can be done using CMU pronunciation dictionary) to really get the metre correct, and create different forms of poem (sonnets, limericks, etc.)
- Improve the third mapper process in picking out the two most appropriate lines from the rhyme-group for each rhyming couplet. These could be picked on content (measuring semantic relatedness between the two lines) or perhaps by linking two different subjects (each alternate line about David Cameron, and Ed Miliband, for instance)
- Adding a temporal aspect to things: we could attempt to ensure earlier tweets come earlier in the poem, and later tweets later in the poem. We could then produce a poem spanning Obama’s presidency for instance (filtering by tweets containing ‘Obama’) where the poem describes the presidency (and popular opinion) over time over the course of the poem.
- Live poetry(?) – it may be possible to commentate on live events (for example football matches) by finding rhyming couplets in near real-time, and adding them to a page with AJAX…
- Creating a user interface to allow anyone to choose a topic and create a poem. As mentioned though, the main issue will be how to scrape tweets from Twitter fast enough (considering API rate limits) to create decent output, before a user gets bored!
As ever, a lot of projects on the go right now, but definitely hoping to extend this fun one a little further.
Other automated poetry resources to check out: