How to Scrape Monthly Tweet Data using Jupyter Notebook

Liang Han Sheng
3 min readNov 23, 2021

--

Hi there, this article might be helpful to you if you want to

  1. Scrape monthly tweet data automatically
  2. Convert the json file containing the tweets into Dataframe

If you just want to do a simple scraping, you can easily scrape thousands of tweets using the steps provided in my previous post using CLI — Tweets Scraping using Python.

So a quick revision, by using the following code:

snscrape --jsonl --progress --max-results 2000 --since 2020-06-01 twitter-hashtag "relax until:2020-07-01" > text-query-tweets.json

You can easily download the latest 2000 tweets with the “relax” hashtag in a json file. But what if you want to download not just the latest tweet data, but a fixed amount from each month for a certain duration?

Yes, that can be easy, we can simply use the for loop, but to ease your job, I will show you my code here which you can put in your jupyter notebook or check out this Github repo. We will also convert it to Dataframe which you can do data manipulation straight away in the future steps.

Without further ado, let’s start.

  1. First, we will need to install snscrape.

You can install it in a jupyter notebook cell using

!pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git

You can install snscrape from PyPI, but some supports are not available in the PyPI version yet, so lets’ install it from GitHub to avoid any trouble.

2. Then, we would need to import the following.

import os
import pandas as pd

3. List down the hashtags and month intervals you would like to scrape.

For example, the date_interval below will scrape two months, 2021–04–01 to 2021–05–01 and 2021–05–01 to 2021–06–01.

#hashtags to scrape
hashtag = ["hashtag1", "hashtag2", "hashtag3"]
#to ensure there is tweet from different months
date_interval = ["2021-04-01", "2021-05-01", "2021-06-01"]

4. Bulk scraping

data = Nonefor i in range(len(date_interval)-1):    for hash in hashtag:        os.system(
f'snscrape --jsonl --progress --max-results 5000 --since {date_interval[i]} twitter-hashtag "{hash} until:{date_interval[i+1]}" > text-query-tweets.json')
tweets_df = pd.read_json('text-query-tweets.json', lines=True)
df = tweets_df[["id", "url", "date", "content",
"hashtags", "cashtags", "media", "lang"]]
if data is None:
data = df
else:
data = data.append(df)

For a bit of explanation, the for loop above will scrape the result for one month and save it in text-query-tweets.json which is then extracted into Dataframe tweets_df in each loop (The next month's scraped result will overwrite the text-query-tweets.json).

The df is tweets_df with some chosen columns. All the results are appended/collected in the data.

In tweets_df, there are 28 columns, you may not need them all, if you want to keep them all, just replace the following line

df = tweets_df[["id", "url", "date", "content","hashtags", "cashtags", "media", "lang"]]

with

df = tweets_df

Congrats, you can already view the data in Dataframe!

5. Save to CSV file

data.to_csv('tweet_data.csv')

If you want to keep a record, remember to save it. ✌️

And you are done with one article today! Thank you for reading.

We have used “twitter-hashtag” as our search argument above, you can also scrape tweets by specifying any below other than hashtag:

‘telegram-channel’, ‘vkontakte-user’, ‘weibo-user’, ‘facebook-group’, ‘instagram-user’, ‘instagram-hashtag’, ‘instagram-location’, ‘reddit-user’, ‘reddit-subreddit’, ‘reddit-search’, ‘twitter-search’, ‘twitter-tweet’, ‘facebook-user’, ‘facebook-community’, ‘twitter-user’, ‘twitter-hashtag’, ‘twitter-list-posts’, ‘twitter-profile’

Github

You can find the code all above at this site.

References

About Author:

This article is written by Han Sheng, Technical Lead in Arkmind, Malaysia. He has a passion for Software Design/Architecture related stuff, Computer Vision and also Edge Devices. He made several AI-based Web/Mobile Applications to help clients solving real-world problems. Feel free to read about him via his Github profile.

--

--

Liang Han Sheng

Loan Origination Solutions Provider | Full Stack AI Application Development