Behind The Scene

How we operate under the hood.

Our code is open-source and available on GitHub because we believe in being transparent. For hosting, we use a simple cloud server from Vultr which runs our website, backend, and all the automated scripts. Our domain is from Epik. These are our only expenses, and they come out to less than $6 a month. The power of the cloud, I guess...

The whole service is built with Python. We use the Flask framework to handle all the web-related tasks: routing URLs, managing forms, and providing a simple API. We also use Jinja for templating our pages. For data, we use Pandas to manage our list of channels (stored in a CSV file) and JSON for storing the video data we collect. Never thought I'd like anything that reminded me of JavaScript, but here we are.

For sensitive information like user emails and preferences, we use SQLite. A few other libraries like jQuery, python-isodate, and googleapiclient help out with smaller tasks. Honestly, none of this is particularly complicated by today's web development standards, but it gets the job done effectively.

How do we collect YouTube data?

We get our data directly from the YouTube API. Google provides a Python client library called googleapiclient that makes this process pretty straightforward. All we need is an API key from the Google Cloud Console. The free tier gives us 10,000 quota points a day, which is more than enough for what we need. Here’s how simple it is to get started:


from googleapiclient.discovery import build

if __name__ == "__main__":
   API_KEY = os.environ.get("YT_API_KEY")
   service = build("youtube", "v3", API_KEY)

Your screen is too small to display the code.

How do we find new channels?

We use the YouTube API's `search().list()` method with keywords like "Tech", "Programming", and "Computer Science". We tell the API to only search for channels, sort them by relevance, and bring back 50 results at a time. From these results, we only need the channel ID. After we have a list of IDs, we loop through them and use the `channels().list()` method on each one to get detailed information like the channel's name, URL, icon, and statistics. Here’s what an entry looks like in our CSV file for Fireship:


ChannelID,ChannelName,ChannelIcon,ChannelUrl,ExistedSince,SubscriberCount,VideoCount,ViewCount,Country
CsBjURrPoezykLs9EqgamOA,Fireship,https://yt3.ggpht.com/ytc/AIf8zZTUVa5AeFd3m5-4fdY2hEaKof3Byp8VruZ0f0FNEA,
https://www.youtube.com/@fireship,2017-04-07,2750000,601,364500037,US

Your screen is too small to display the code.

We add this new data to our existing list, then use pandas to remove any duplicates or rows with missing data before saving it back to the CSV. We also manually add channels we like (like Louis Rossmann) and periodically remove any channels that are inactive or low-quality. The process looks something like this:


request = service.search().list(
   q="Tech | Programming | Computer Science",
   type="channel", part="id",
   maxResults=50, order="relevance", 
   relevanceLanguage="en", regionCode="US"
)

response = request.execute()
for item in response.get("items", []):
   temp_id = item["id"]["channelId"]
   searched_channels.append(temp_id)

# Other code...    

for channel in searched_channels:
   request = service.channels().list(part=["snippet", "statistics", ... ], id=channel)
   response = request.execute()

   channel_info = {
      "ChannelID": response["items"][0]["id"]["channelId"],
      "ChannelURL": response["items"][0]["snippet"]["customUrl"],
      "ChannelName": response["items"][0]["snippet"]["title"],
      "ChannelIcon": response["items"][0]["snippet"]["thumbnails"]["medium"]["url"],
      # Additional information about that channel 
   }

   channels.append(channel_info)

df = pd.DataFrame(channels)
df = pd.concat([channel_df, df], ignore_index=True)
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)
df.to_csv("channels.csv", index=False)

Your screen is too small to display the code.

To find channels the API might miss, we also built a simple web scraper using Selenium WebDriver. It just searches a random tech topic on YouTube, grabs the channel URLs from the results, then clicks a few recommended videos and grabs those channel URLs too. After it runs, we manually use an online tool like YouTube Channel ID Finder to convert the URLs to IDs, and then feed those IDs into our main script. Here’s a piece of the scraper:


driver = webdriver.Firefox(options=options)
chosen_topic = choice(search_terms)
search_terms.remove(chosen_topic)
driver.get(f"https://www.youtube.com/results?search_query={chosen_topic}")

all_recommended_channels = driver.find_elements(By.ID, "channel-thumbnail")
channels = [channel.get_attribute("href").split("@")[1] for channel in all_recommended_channels]

video_links = driver.find_elements(By.CSS_SELECTOR, "a#video-title")
choice(video_links[5:]).click()

for i in range(7):
      recommended_channel = driver.find_element(By.CSS_SELECTOR, "a.ytd-video-owner-renderer")
      recommended_video = driver.find_elements(By.TAG_NAME, "ytd-compact-video-renderer")
      choice(recommended_video[:5]).click()
      channels.append(recommended_channel.get_attribute("href").split("@")[1])
      # Visiting YouTube Channel ID Finder afterward to convert our URLs to channelIDs

Your screen is too small to display the code.

You can download our full database of YouTube channels below. It contains all the channel info we use. Note that some fields might say "Unknown" if the API didn't provide the data. Feel free to use this database for your own projects. If you have channels you'd like to see added or removed, just contact us. We believe being open about our data builds trust.

How do we get the videos?

We loop through our channel list CSV and use the `activities().list()` function for each channel ID. This shows us recent uploads. From there, we grab basic info like the video ID, title, and thumbnail. To get more detail, we take that video ID and use the `videos().list()` method. This gives us the important stats like view counts, likes, and duration. Here’s a snippet:


for channel in channel_df["ChannelID"]:
   request = service.activities().list(
      part=["snippet", "id", "contentDetails"],
      publishedAfter=yesterday.isoformat() + "T00:00:00Z",
      channelId=channel, maxResults=50, fields=FIELDS,
   )

   response = request.execute()
      for item in response["items"]:
         channel_name = item["snippet"]["channelTitle"]
         channel_id = item["snippet"]["channelId"]
         video_id = item["contentDetails"]["upload"]["videoId"]
         # Additional information...

         request = service.videos().list(id=video_id, part=["statistics", "snippet", "contentDetails"])
         response = request.execute()
         
         view_count = int(response["items"][0]["statistics"]["viewCount"])
         like_count = int(response["items"][0]["statistics"]["likeCount"])
         content_rating = response["items"][0]["contentDetails"]["contentRating"]
         video_duration = isodate.parse_duration(response["items"][0]["contentDetails"]["duration"])
         # Again, remaining additional information...

Your screen is too small to display the code.

Once we have all the data, we apply a few filters. We ignore videos that are less than 30 seconds long, have fewer than 500 views, or aren't in English. We also make sure they are in YouTube's "Science & Technology" or "Education" categories. After filtering, we have a clean list of videos ready for ranking, stored in a JSON file. Here is what one video's data looks like:


"ChannelName": "Fireship",
"ChannelId": "UCsBjURrPoezykLs9EqgamOA",
"ChannelIcon": "https://yt3.ggpht.com/ytc/AIf8zZTUVa5AeFd3m5-4fdY2hEaKof3Byp8VruZ0f0FNEA",
"ChannelUrl": "https://www.youtube.com/@fireship",
"VideoUrl": "https://www.youtube.com/watch?v=ky5ZB-mqZKM",
"VideoTitle": "AI influencers are getting filthy rich... let's build one",
"VideoId": "ky5ZB-mqZKM",
"PublishedDate": "2023-11-29 21:06",
"Thumbnail": "https://i.ytimg.com/vi/gGWQfV1FCis/mqdefault.jpg",
"Duration": "0:04:25",
"Definition": "HD",
"Language": "EN",
"Caption": false,
"ContentRating": false,
"ViewCount":  4091018,
"LikeCount": 156078,
"CommentCount": 5052,
"CategoryId": 28

Your screen is too small to display the code.

How do we sort and rank these videos?

This is the core of our system. We use a custom algorithm to give each video a score. It’s based on two main parts: a "Quality Multiplier" and the video's stats. First, we calculate the Quality Multiplier by converting non-numeric data into scores. For example, a video gets a higher score if it has captions or is in HD. We also give a small boost to longer videos and smaller channels, while slightly penalizing massive channels to help surface new content. This gives us a single multiplier for each video.

QualityMultiplier = SubscriberBalance × DefinitionQuality × CaptionQuality × RatingQuality × DurationQuality

Your screen is too small to display the formula.

Next, we look at the actual stats: views, likes, and comments. To prevent massive numbers from skewing the results, we normalize them with a log function. We then give more weight to comments and likes than to views, since they are better indicators of engagement. We add these weighted stats together, multiply them by the Quality Multiplier from the first step, and scale the result by 100 to get a clean final number. The formula is simple:

Rating = (ViewRate + LikeRate + CommentRate) × QualityMultiplier

Your screen is too small to display the formula.

Once every video has a rating, we can sort them. We use a simple Python script to handle this. For the weekly and monthly lists, we just combine the new daily videos with the existing ones, re-sort everything, and keep the top N. Finally, we save the sorted lists back into our JSON files, ready to be sent out in the newsletter.


for lang, all_videos in videos.items():
    for time in ["daily", "weekly", "monthly", "yearly"]:
        with open(f"{time}.json", "r") as f:
            data = json.load(f)

        if time == "daily":
            top_day = OrderedDict(sorted(all_videos.items(), key=lambda item: item[0], reverse=True))
            data[lang] = OrderedDict(list(top_day.items()))

        elif time == "weekly":
            top_week = update_videos(data[lang], time)
	    top_week.update(OrderedDict(list(top_day.items())[:50]))
            top_week = sort_videos(top_week)
            data[lang] = top_week

	# Same thing for monthy & yearly videos

         with open(f"{time}.json", "w") as f:
            json.dump(data, f, indent=4)

Your screen is too small to display the code.