read the api documentation for reddit, follow the rules, use python+praw to make requests. Writing a script with BeautifulSoup or similar seems like a headache.
Right! I used the official Reddit API. I created an APP, got the credentials for the API. Then used the Python library PRAW to consume the API. https://praw.readthedocs.io/en/latest/
It took me 36 hours to collect the 4M posts. Reddit API returns results in batches of 100 results, and then sleeps for 2 seconds.
Reddit's API is free for my use cases. From their documentation: "Our API allows free access to moderators and developers creating these tools for non-commercial use cases."
reply