A long time ago, a couple of Twitter followers and I were discussing a movement to prevent the eating of me–that is, because my nickname is Sushi and that also happens to be a food. We eventually came up with #wrimosagainsttheeatingofsushi. In the years since, the #wrimosagainsttheeatingofsushi movement has grown, as well as the #wrimosfortheeatingofsushi movement. But more and more semi-related hashtags have cropped up, almost all of them taking the form #___[for/against]the____ingof______.
I’ve tried to search Twitter for them, but online searches don’t usually handle fill-in-the-blank searches very well. Or at all, really. But that’s okay.
(Image credit: xkcd)
I know (some) regular expressions.
What’s a regular expression?
A regular expression (or regex) is a pattern that matches something. You can think of a word as a regular expression when you search for something–whether you’re searching online or for a file on your computer. Most of your results are going to match the word(s) you type in. But what if you want only the first and last words of your search phrase to match? Or in my case, only certain words within a hashtag to match?
That’s where regular expressions come in.
Here’s a tutorial if you’d like to dig deeper.
But knowing regular expressions doesn’t do me much good if I don’t have a way to search with them. This is why grep exists.
grep is just plain awesome
Grep is amazing. It searches your plaintext files and shows you the results. You can also search across multiple files at once, which makes grep very powerful. I can figure out which novel I wrote that one awesome line in or when I used some made up word in chat… or find Twitter hashtags since grep supports regular expressions.
Sound cool? Good because we’re going to have some greppy fun.
The last tool: the Twitter archive
This is the easiest step. Twitter lets you download a full archive of your tweets from your site settings, so I downloaded mine and unzipped the files to my Backup directory.
The files containing the tweets themselves are Javascript files sorted by month. We can grep our way through these. Sweet.
Putting it all together
Let’s put this together and find all the hashtags. Here’s what I did.
[sushi@marigold ~]$ grep -r -E [a-zA-Z]+the[a-zA-Z]+ingof[a-zA-Z]+ ~/Backup/tweets/data/js/tweets > ~/Documents/wrimosforagainst.csv
Let’s break this down.
[sushi@marigold ~]$: This is my hostname. The ~ says I’m sitting in my home directory, which affects things like where I save files. More on this in a minute.
grep -r -E: grep says to use the grep command. -r means to search recursively–that is, to search all the files in that directory. This is important because we want to search multiple files in the same directory. -E is for extended regular expressions, which let me do some of these exciting things to follow.
[a-zA-Z]+: Check for any letter a through z, capitalized or not. The + means to check and see if a letter appears at least once.
the: Exactly what it says. Look for the letters “the” in order, exactly once.
[a-zA-Z]+: Check for any letter a through z, capitalized or not. The + means to check and see if a letter appears at least once.
ingof: Exactly what it says. Look for the letters “ingof” in order, exactly once.
[a-zA-Z]+: Check for any letter a through z, capitalized or not. The + means to check and see if a letter appears at least once.
This is a lot of information. If I had just hit enter here, my screen would have overflowed with hashtags. That’s where the >
(greater than) sign comes in. The >
means to send the information from this file somewhere. The rest of that command is telling what to name the file and where to save it. In this case I chose to send this information to a comma separated value file, which I can then open with LibreOffice and do whatever I want with it.
A CSV file is basically a spreadsheet in text form, which is what I saved the file as the first time. When you open the file, you can then tell Excel or LibreOffice what to use to separate the values. LibreOffice chose comma and tag and semicolon; I added colon since colons were useful in separating the hashtags from the rest of the data.
Once I opened the CSV file in LibreOffice I sorted the column with the hashtags. This gave me almost exactly what I wanted, as the majority of these hashtags are at the bottom in the w section.
There are a couple of catches to this. Eeach tweet in the Javascript file includes the hashtags as a separate entity from the tweets, as well as what tweet you’re replying to. This means hashtags can show up multiple times. This also means that hashtags used in tweets I replied to can also show up, even if I never used them. This part is doubly useful, but it means that not every spinoff hashtag is here simply because I didn’t reply to all of those tweets.
Here are those hashtags. There are 118 of them in all, and I have only edited to remove duplicates. Happy reading, and happy hashtagging!