• George
  • Posts
  • How I use ML to look for SaaS ideas on Reddit automatically

How I use ML to look for SaaS ideas on Reddit automatically

A how-to tutorial on how to find SaaS ideas on Reddit

We all know that Reddit is a goldmine when it comes to finding SaaS ideas. People rant about their life, talk about their frustration, seek advice and ask questions. As a solopreneur, someone else’s problem is my opportunity.

But as a solopreneur who juggle multiple responsibilities, I really don’t have the time to scroll and look for posts manually everyday. Also, I could miss important posts.

I wanted to build a tool to scrape Reddit daily, identify pain points and do it all automatically. You can also use it to look for potential customers but let’s save this for another time.

This is a technical how-to guide on how I trained a model to identify if a post is worth investigating further.

Step 1: Setting things up

pip install praw transformers torch pandas nltk scikit-learn

Step 2: Scrape Reddit & Process the texts

Pick a varierty of subreddits, and sort it by “top” or “hot”. For your reference, I scraped 15 posts from 84 subreddits, or 1260 posts.

To make the text suitable for BERT, we need to clean it. This involves removing punctuation, stopwords, and normalizing the text.

After you have processed the texts, we need to manually label each of the 1260 posts

import praw
import os
from dotenv import load_dotenv
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pandas as pd

# Load environment variables
load_dotenv()

# Reddit API credentials
REDDIT_CLIENT_ID = os.getenv('REDDIT_CLIENT_ID')
REDDIT_CLIENT_SECRET = os.getenv('REDDIT_CLIENT_SECRET')
REDDIT_USER_AGENT = os.getenv('REDDIT_USER_AGENT')

# Initialize Reddit instance
reddit = praw.Reddit(
    client_id=REDDIT_CLIENT_ID,
    client_secret=REDDIT_CLIENT_SECRET,
    user_agent=REDDIT_USER_AGENT
)

# List of subreddits to scrape
subreddits = [
    "AgingParents", "antiwork", "AppBusiness", "AskUK", "AusProperty", "AusPropertyChat", "b2b_sales", "b2bmarketing", "Backend", "bash", "BeMyReference", "biotech", "buildapcsales", "business", "Business_Ideas", "businessanalysis", "ChatGPTPromptGenius", "coldemail", "consulting", "copywriting", "CRM", "DigitalMarketingHack", "dividends", "ecommerce", "entrepreneur", "EntrepreneurRideAlong", "ethereum", "farming", "FenceBuilding", "financialindependence", "findawebsite", "Fitness", "Fiverr", "Flipping", "freelance", "freelancer", "Freelancers", "garageporn", "gardening", "growmybusiness", "hacking", "Hacking_Tutorials", "homegym", "homelab", "homestead", "Homesteading", "Hosting", "HowEarnMoneyOnline", "indiebiz", "indiehackers", "Insurance", "investing", "JapanTravel", "juststart", "Landlord", "landscaping", "laptops", "Layoffs", "Leadership", "LeadGeneration", "legaltech", "linkedin", "linux", "linuxhardware", "linuxmasterrace", "LocalLLaMA", "MakeMoney", "mancave", "MarketingHelp", "MedicalWriters", "OpenAI", "privacy", "QuickBooks", "recruitinghell", "SaaS", "sales", "Salesforce", "SEO", "smallbusiness", "SmallBusinessCanada", "smallbusinessuk", "startups", "TechSEO", "webdev"
]

# Scrape posts from subreddits
posts = []
for sub in subreddits:
    subreddit = reddit.subreddit(sub)
    for post in subreddit.hot(limit=15):  # Fetch top 50 posts
        posts.append({
            "title": post.title,
            "text": post.selftext
        })

# Updated preprocess_text function
def preprocess_text(text):
    # Replace curly apostrophes with straight apostrophes
    text = text.replace("", "'").replace("", "'")
    
    # Ensure text is properly decoded as UTF-8
    if isinstance(text, bytes):
        text = text.decode('utf-8', errors='ignore')
    
    # Lowercase
    text = text.lower()
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenize
    tokens = text.split()
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # Join back into a string
    return " ".join(tokens)

# Apply preprocessing to scraped posts
for post in posts:
    post["cleaned_title"] = preprocess_text(post["title"])
    post["cleaned_text"] = preprocess_text(post["text"])
    post["cleaned_full_text"] = preprocess_text(f"{post['title']} {post['text']}")  # Fixed typo here

# Save preprocessed data to a CSV file
df = pd.DataFrame(posts)
df.to_csv("preprocessed_posts.csv", index=False)

print("Done!")

Step 4: Train a BERT Model

We’ll fine-tune a pre-trained BERT model to classify posts as containing a pain point (1) or not (0). Here’s the training script:

from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import Dataset
import pandas as pd

# Load the scraped data
df = pd.read_csv('./reddit_posts.csv')

# Combine title and content into a single text field
df['text'] = df['title'] + " " + df['content']

# Convert to Hugging Face Dataset
dataset = Dataset.from_pandas(df[['text']])

# Load pre-trained GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Assign the EOS token as the padding token
tokenizer.pad_token = tokenizer.eos_token

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding="max_length",  # Pad sequences to the maximum length
        truncation=True,       # Truncate sequences longer than the model's max length
        max_length=512,        # Set a reasonable maximum length for Reddit posts
        return_tensors="pt"    # Return PyTorch tensors
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Add the 'labels' field (same as 'input_ids')
tokenized_dataset = tokenized_dataset.map(lambda x: {'labels': x['input_ids']})

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    save_steps=10_000,
    save_total_limit=2,
    logging_dir='./logs',
    logging_steps=500,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

# Fine-tune the model
trainer.train()

# Save the fine-tuned model
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

Step 5: Test the trained model

I scraped together a few posts from the r/Entrepreneur the next day with the following script:

from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import Dataset
import pandas as pd

# Load the scraped data
df = pd.read_csv('./reddit_posts.csv')

# Combine title and content into a single text field
df['text'] = df['title'] + " " + df['content']

# Convert to Hugging Face Dataset
dataset = Dataset.from_pandas(df[['text']])

# Load pre-trained GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Assign the EOS token as the padding token
tokenizer.pad_token = tokenizer.eos_token

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding="max_length",  # Pad sequences to the maximum length
        truncation=True,       # Truncate sequences longer than the model's max length
        max_length=512,        # Set a reasonable maximum length for Reddit posts
        return_tensors="pt"    # Return PyTorch tensors
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Add the 'labels' field (same as 'input_ids')
tokenized_dataset = tokenized_dataset.map(lambda x: {'labels': x['input_ids']})

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    save_steps=10_000,
    save_total_limit=2,
    logging_dir='./logs',
    logging_steps=500,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

# Fine-tune the model
trainer.train()

# Save the fine-tuned model
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

Results were great!

I was very satisfied with the output. I have filtered out results that have a confidence level of 0.95 or above.

My next step is to use this model to identify whether a post would give an SaaS idea that I can build in the future. At this point, I am thinking of setting up a cronjob to scrape Reddit, run it thru the model and then send an email to me every week.

Another potential use

Another potential use for this is for marketers to generate leads. The way I am thinking it can do this is:-

  1. Scrape subreddits from your niche

  2. Run it thru the model to identify pain points

  3. A marketer could reach out to the OP thru cold DM and offer a solution

I think a marketer could even skip the qualifying part because if an OP is already looking for a solution, it’s going to make the selling process much easier.