Behind The Scenes
Architecture & Implementation
The application is written in Python using the Flask framework. We use Pandas for data processing and SQLite for local storage. The codebase is open-source and available on GitHub. It runs on a standard VPS from Vultr. The architecture is intentionally simple and relies on standard web development practices.
Data Collection
We get our data directly from the YouTube API using the googleapiclient library. Here is how we initialize the service:
from googleapiclient.discovery import build
if __name__ == "__main__":
API_KEY = os.environ.get("YT_API_KEY")
service = build("youtube", "v3", API_KEY)
Source Discovery
We use the YouTube API's `search().list()` method with keywords like "Tech", "Programming", and "Computer Science" to find channels. We extract the channel IDs and use the `channels().list()` method to get detailed metadata. Here is what an entry looks like in our CSV file:
ChannelID,ChannelName,ChannelIcon,ChannelUrl,ExistedSince,SubscriberCount,VideoCount,ViewCount,Country
CsBjURrPoezykLs9EqgamOA,Fireship,https://yt3.ggpht.com/ytc/AIf8zZTUVa5AeFd3m5-4fdY2hEaKof3Byp8VruZ0f0FNEA,
https://www.youtube.com/@fireship,2017-04-07,2750000,601,364500037,US
We append this data to our existing list, using pandas to remove duplicates and rows with missing data. We also manually add channels and periodically prune inactive ones. The process looks like this:
request = service.search().list(
q="Tech | Programming | Computer Science",
type="channel", part="id",
maxResults=50, order="relevance",
relevanceLanguage="en", regionCode="US"
)
response = request.execute()
for item in response.get("items", []):
temp_id = item["id"]["channelId"]
searched_channels.append(temp_id)
# Other code...
for channel in searched_channels:
request = service.channels().list(part=["snippet", "statistics", ... ], id=channel)
response = request.execute()
channel_info = {
"ChannelID": response["items"][0]["id"]["channelId"],
"ChannelURL": response["items"][0]["snippet"]["customUrl"],
"ChannelName": response["items"][0]["snippet"]["title"],
"ChannelIcon": response["items"][0]["snippet"]["thumbnails"]["medium"]["url"],
# Additional information about that channel
}
channels.append(channel_info)
df = pd.DataFrame(channels)
df = pd.concat([channel_df, df], ignore_index=True)
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)
df.to_csv("channels.csv", index=False)
To find channels the API might miss, we also built a simple web scraper using Selenium WebDriver. It searches a random tech topic on YouTube, extracts channel URLs from the results, and follows recommended videos to find additional channels. Here is a snippet of the scraper:
driver = webdriver.Firefox(options=options)
chosen_topic = choice(search_terms)
search_terms.remove(chosen_topic)
driver.get(f"https://www.youtube.com/results?search_query={chosen_topic}")
all_recommended_channels = driver.find_elements(By.ID, "channel-thumbnail")
channels = [channel.get_attribute("href").split("@")[1] for channel in all_recommended_channels]
video_links = driver.find_elements(By.CSS_SELECTOR, "a#video-title")
choice(video_links[5:]).click()
for i in range(7):
recommended_channel = driver.find_element(By.CSS_SELECTOR, "a.ytd-video-owner-renderer")
recommended_video = driver.find_elements(By.TAG_NAME, "ytd-compact-video-renderer")
choice(recommended_video[:5]).click()
channels.append(recommended_channel.get_attribute("href").split("@")[1])
# Converting URLs to channelIDs
You can download our full database of YouTube channels below. It contains all the channel metadata we use. If you have channels you'd like to see added or removed, please contact us.
Video Indexing
We iterate through our channel list and use the `activities().list()` function to fetch recent uploads. We then use the `videos().list()` method to retrieve detailed statistics like view counts, likes, and duration. Here is a snippet:
for channel in channel_df["ChannelID"]:
request = service.activities().list(
part=["snippet", "id", "contentDetails"],
publishedAfter=yesterday.isoformat() + "T00:00:00Z",
channelId=channel, maxResults=50, fields=FIELDS,
)
response = request.execute()
for item in response["items"]:
channel_name = item["snippet"]["channelTitle"]
channel_id = item["snippet"]["channelId"]
video_id = item["contentDetails"]["upload"]["videoId"]
# Additional information...
request = service.videos().list(id=video_id, part=["statistics", "snippet", "contentDetails"])
response = request.execute()
view_count = int(response["items"][0]["statistics"]["viewCount"])
like_count = int(response["items"][0]["statistics"]["likeCount"])
content_rating = response["items"][0]["contentDetails"]["contentRating"]
video_duration = isodate.parse_duration(response["items"][0]["contentDetails"]["duration"])
# Again, remaining additional information...
We apply a few filters to the raw data. We ignore videos that are less than 30 seconds long, have fewer than 500 views, or are not in English. We also verify they are categorized under "Science & Technology" or "Education". The filtered list is stored in a JSON file. Here is an example payload:
"ChannelName": "Fireship",
"ChannelId": "UCsBjURrPoezykLs9EqgamOA",
"ChannelIcon": "https://yt3.ggpht.com/ytc/AIf8zZTUVa5AeFd3m5-4fdY2hEaKof3Byp8VruZ0f0FNEA",
"ChannelUrl": "https://www.youtube.com/@fireship",
"VideoUrl": "https://www.youtube.com/watch?v=ky5ZB-mqZKM",
"VideoTitle": "AI influencers are getting filthy rich... let's build one",
"VideoId": "ky5ZB-mqZKM",
"PublishedDate": "2023-11-29 21:06",
"Thumbnail": "https://i.ytimg.com/vi/gGWQfV1FCis/mqdefault.jpg",
"Duration": "0:04:25",
"Definition": "HD",
"Language": "EN",
"Caption": false,
"ContentRating": false,
"ViewCount": 4091018,
"LikeCount": 156078,
"CommentCount": 5052,
"CategoryId": 28
Scoring Algorithm
Videos are ranked using a custom scoring function. We calculate a base multiplier using metadata (duration, definition, captions) and apply it to normalized engagement metrics (views, likes, comments). This helps surface highly-engaged content regardless of the channel's total subscriber count.
QualityMultiplier = SubscriberBalance × DefinitionQuality × CaptionQuality × RatingQuality × DurationQuality
We normalize the engagement metrics using a logarithmic function to prevent massive view counts from skewing the results. We weight comments and likes more heavily than views, as they are stronger indicators of engagement. The final rating is calculated as follows:
Rating = (ViewRate + LikeRate + CommentRate) × QualityMultiplier
Once every video is scored, we sort the index. For weekly and monthly lists, we aggregate the daily videos, re-sort, and truncate to the top N results. The sorted lists are saved as JSON and served via the API and dashboard.
for lang, all_videos in videos.items():
for time in ["daily", "weekly", "monthly", "yearly"]:
with open(f"{time}.json", "r") as f:
data = json.load(f)
if time == "daily":
top_day = OrderedDict(sorted(all_videos.items(), key=lambda item: item[0], reverse=True))
data[lang] = OrderedDict(list(top_day.items()))
elif time == "weekly":
top_week = update_videos(data[lang], time)
top_week.update(OrderedDict(list(top_day.items())[:50]))
top_week = sort_videos(top_week)
data[lang] = top_week
# Same thing for monthy & yearly videos
with open(f"{time}.json", "w") as f:
json.dump(data, f, indent=4)