I passed 50,000 tweets recently. Because my life really is NaNoWriMo, I found myself wondering how many words these 50,000 tweets make. It turns out the Twitter archive gives you a CSV (comma separated values) file, something I can work with easily.
So I got to work writing a script that would do just this. The result is available on Github, and any comments/feedback are welcome. The script is licensed under the MIT license, so use it for whatever you want–just give credit and don’t blame me/sue me if things go wrong.
The tweetwc script looks at each row in your CSV file and first checks to see if the tweet in that row is a retweet. This is done by looking at the CSV file’s columns that are only filled if the tweet in question is a retweet. If the tweet is a retweet, the script skips that row and keeps going. But if the tweet isn’t a retweet, then the contents of the tweet are added to a separate plaintext file, along with a linebreak for ease of reading.
Once the script has executed, you can then count the words of this plaintext file or do other exciting things.
Of course, there are a few things I’d like to change about this script. tweetwc doesn’t currently count the words for you, instead relying on an external word counter of your choice to do so. I’d like to fix this. Also, you currently have to create the text file to write to before running the script. This should be an easy improvement. The big improvement is that the script doesn’t take into account old-school RTs and counts the full text of the tweet as part of your tweet word count.
What you need
Getting started with tweetwc is easy. You just need a few things:
* Your Twitter archive. You can request this from your Twitter profile. Unzip it and save the files somewhere you can remember.
* Python 2.7 (this may work with Python 3, but I haven’t confirmed.) Chances are good you already have Python on your computer if you use Linux or Mac OSX. But if you don’t, or if you run Windows, follow Zed Shaw’s instructions on installing Python.
* A terminal that accepts command line prompts. If you’re a Mac or Linux user, you probably already have one.
* a method of counting words. This may be the wc
command or a word processor (like LibreOffice of MS Office) that can count words.
* The code itself. Get the source on Github.
Got those? Let’s get started.
1. Unzip your Twitter archive.
If you haven’t already, use winrar, unzip, or whatever you use to unzip files. Once you have, find your Twitter archive. The directory structure will look like this:
tweets
—a few files like index.html and the file we want, tweets.csv
—folders like css, data, img, js, and lib. Look around them if you like, but we don’t need them here.
If you like, you can open index.html in a browser for a pretty version of your archive. Today we’re using a different version of the archive. Find the file named tweets.csv
. Make a note of what folder it’s in.
Example: my tweets.csv file is stored in /home/sushi/Backup/tweets/tweets.csv
2. Edit the path to your tweets.csv file.
Remember how you found the path to your tweets.csv file? Now open the tweetwc.py file in a text editor of your choice and change the path to the file MY_FILE will refer to.
3. Open a terminal and run that code
Got a terminal open? Now go to the directory with your tweetwc.py code with cd /path/to/that/directory
. Enter python tweetwc.py yourtweetfile.txt
. The yourtweetfile.txt file is where this script will write your tweets. You can then use a word counter to count these words.
And you’re done! Questions? Let me know.