Behind The Scene
How do we differ from the competition ?
We, at BytePicks, do things a bit differently than the usual platforms. Our approach to content selection and filtering is unique, backed by advanced algorithms that always showcase our content in the best light for our users. We employ cutting-edge tools and techniques to keep things modern.
But hey, we're not keeping these secrets to ourselves, we believe in transparency and continuous improvement, just like any good community does. So, here's the lowdown on our all-mighty algorithm and genius engineering to make such a thing happen. Check out our open-source code, contribute if you fancy, and remember, we're all ears for your ideas!
How we operate under the hood ?
Beneath the surface, our web service's codebase is openly available on GitHub, a testament to our commitment to transparency and collaboration. Simplifying our hosting needs, we leverage Render to power our website, manage the backend, take care of the boring admin tasks and automate recurring tasks. The essential components of our digital identity, including the URL and communication email, are derived from Namecheap. There are our only expenses, which is impressively low, under $20 a month, and everything is taken care of. Power of the cloud I guess...
Transitioning to the heart of our operation, Python emerges as a pivotal player. Flask, a powerful web framework, orchestrates URL handling, redirects, flashes, form data collection, and the creation of a RESTful API through the magic, with some help of Jinja templating. At a deeper level, we employ Pandas for storing, analyzing, and manipulating channel data in a covert CSV format. The unsung hero, JSON, becomes the linchpin for simple yet effective data storage and retrieval within the market context. Never thought in my life I would love anything remotely close to JavaScript.
Notably, SQLite steps in as the guardian of sensitive information, safeguarding user email, channels Information and preferences. A sprinkle of additional libraries, such as jQuery and python isodate, datetime, googleapiclient join the ensemble to enhance functionality and interactivity. And all of this isn't even remotely complicated or impressive for today's standards for web development, which will make some sure pee their pants.
How we collect YouTube data ?
Thanks to the robust infrastructure provided by Google (obviously they are Google), we tap into the YouTube API with finesse, wielding nothing more than an API key and a few lines of code. Simplifying this process even further is the Google API client library for Python, aptly named googleapiclient offering us simple functions to call and some parameters. We just need to have an api key that we got in the google cloud console . It comes with over 10,000 quota points which is more then enough for us. Here is a code snippet example:
from googleapiclient.discovery import build
if __name__ == "__main__":
API_KEY = os.environ.get("YT_API_KEY")
service = build("youtube", "v3", API_KEY)
How we discover & find new channels ?
We use the search().list() methods provided by the YouTube API to search for content all over the platform with a few keywords like "Tech", "Programming", "Computer Science" using the OR operator. We sort by relevance and only search for channels with any specified language or region, as each page has 50 results. We use a for loop to go get the next one and extract the information we need, which is the channel ID only. Then, to create a profile for each channel, we use the channels().list() method for EACH of the channels to extract the data that we need. This is an example with the Fireship channel in the csv format :
ChannelID,ChannelName,ChannelIcon,ChannelUrl,ExistedSince,SubscriberCount,VideoCount,ViewCount,Country
CsBjURrPoezykLs9EqgamOA,Fireship,https://yt3.ggpht.com/ytc/AIf8zZTUVa5AeFd3m5-4fdY2hEaKof3Byp8VruZ0f0FNEA,
https://www.youtube.com/@fireship,2017-04-07,2750000,601,364500037,US
We make a list, concatenate it with the existing data, and store it in a CSV with pandas, making sure to drop duplicates and NaN. Obviously, we also put some channels we ourselves like, such as Sumsub or Rossmann, to ensure we get the best content. We also remove any dead channels or bad ones. This sums up the process, and if you are curious, here is some part of it :
request = service.search().list(
q="Tech | Programming | Computer Science",
type="channel", part="id",
maxResults=50, order="relevance",
relevanceLanguage="en", regionCode="US"
)
response = request.execute()
for item in response.get("items", []):
temp_id = item["id"]["channelId"]
searched_channels.append(temp_id)
# Other code...
for channel in searched_channels:
request = service.channels().list(part=["snippet", "statistics", ... ], id=channel)
response = request.execute()
channel_info = {
"ChannelID": response["items"][0]["id"]["channelId"],
"ChannelURL": response["items"][0]["snippet"]["customUrl"],
"ChannelName": response["items"][0]["snippet"]["title"],
"ChannelIcon": response["items"][0]["snippet"]["thumbnails"]["medium"]["url"],
# Additional information about that channel
}
channels.append(channel_info)
df = pd.DataFrame(channels)
df = pd.concat([channel_df, df], ignore_index=True)
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)
df.to_csv("channels.csv", index=False)
In our latest update, we sought to refine and enhance our channel selection process, including discovering some previously overlooked ones. To achieve this, we developed a web scraper using Selenium WebDriver. Essentially, it performs a search for a random tech topic on YouTube, collects the channel URLs from the top results, clicks on random videos, and gathers the creators' information. This process is repeated at least 7 times or more. Subsequently, we utilize the YouTube Channel ID Finder to obtain the IDs associated with these channels. With the IDs, in hand, we use then a slightly modified version of our update_channels() function to process the data. While the code is lengthy, here's a snippet:
driver = webdriver.Firefox(options=options)
chosen_topic = choice(search_terms)
search_terms.remove(chosen_topic)
driver.get(f"https://www.youtube.com/results?search_query={chosen_topic}")
all_recommended_channels = driver.find_elements(By.ID, "channel-thumbnail")
channels = [channel.get_attribute("href").split("@")[1] for channel in all_recommended_channels]
video_links = driver.find_elements(By.CSS_SELECTOR, "a#video-title")
choice(video_links[5:]).click()
for i in range(7):
recommended_channel = driver.find_element(By.CSS_SELECTOR, "a.ytd-video-owner-renderer")
recommended_video = driver.find_elements(By.TAG_NAME, "ytd-compact-video-renderer")
choice(recommended_video[:5]).click()
channels.append(recommended_channel.get_attribute("href").split("@")[1])
# Visiting YouTube Channel ID Finder afterward to convert our URLs to channelIDs
You can download our database of YouTube channels using the button at the bottom of this text. It provides all channels information and many other details that make this service possible. Note that some cells may contain "Unknown" as the value because YouTube does not have information about them. You can also redownload this database at any time to update the list. If you want to add or delete channels, feel free to contact us. You are welcome to use this database anywhere. We believe that transparency and openness will only but bring more opportunities and trust :)
How we get the videos ?
Remember our CSV with every tech channel in it? Using a simple for loop, we extract the channel ID to use in the Google API function `activities().list()`. This function showcases the recent activities of the channel, including uploaded videos, which is what we are looking for. We extract information for every video such as thumbnails, ID, title, who posted the video, and when it was posted. To dive deeper, we need another method, `videos().list()`. By specifying the video ID we obtained previously, we get the final missing part of the puzzle – statistics and content rating. Here is a code snippet of it :
for channel in channel_df["ChannelID"]:
request = service.activities().list(
part=["snippet", "id", "contentDetails"],
publishedAfter=yesterday.isoformat() + "T00:00:00Z",
channelId=channel, maxResults=50, fields=FIELDS,
)
response = request.execute()
for item in response["items"]:
channel_name = item["snippet"]["channelTitle"]
channel_id = item["snippet"]["channelId"]
video_id = item["contentDetails"]["upload"]["videoId"]
# Additional information...
request = service.videos().list(id=video_id, part=["statistics", "snippet", "contentDetails"])
response = request.execute()
view_count = int(response["items"][0]["statistics"]["viewCount"])
like_count = int(response["items"][0]["statistics"]["likeCount"])
content_rating = response["items"][0]["contentDetails"]["contentRating"]
video_duration = isodate.parse_duration(response["items"][0]["contentDetails"]["duration"])
# Again, remaining additional information...
Now that we have everything we need, we can process the data and loop through the next ones. But before everything, we need to filter out unworthy content with an if statement, such as videos less than 30 seconds, those with less than 500 views, those not in our set languages, or don't have a category ID either of 27 or 28, which is the category for technological and educational videos. After handling all of this, we finally get the desired result in a JSON format. Again an example of a Fireship video here to demonstrate :
"ChannelName": "Fireship",
"ChannelId": "UCsBjURrPoezykLs9EqgamOA",
"ChannelIcon": "https://yt3.ggpht.com/ytc/AIf8zZTUVa5AeFd3m5-4fdY2hEaKof3Byp8VruZ0f0FNEA",
"ChannelUrl": "https://www.youtube.com/@fireship",
"VideoUrl": "https://www.youtube.com/watch?v=ky5ZB-mqZKM",
"VideoTitle": "AI influencers are getting filthy rich... let's build one",
"VideoId": "ky5ZB-mqZKM",
"PublishedDate": "2023-11-29 21:06",
"Thumbnail": "https://i.ytimg.com/vi/gGWQfV1FCis/mqdefault.jpg",
"Duration": "0:04:25",
"Definition": "HD",
"Language": "EN",
"Caption": false,
"ContentRating": false,
"ViewCount": 4091018,
"LikeCount": 156078,
"CommentCount": 5052,
"CategoryId": 28
How We Sort, Rank & Store These Videos?
Now comes the part you've been waiting for. Sorting, ranking, and storing data is a complex and meticulous process, setting apart good work from the not-so-good, especially when it involves algorithms. Speaking of which, we have a simple yet robust algorithm built on the data we've collected earlier. Like any algorithm, it requires dealing with numbers, not strings and booleans. Therefore, our first step is to convert strings into numbers, specifically into a quality measure. For instance, videos with captions deserve a higher rating than those without. So, we assign a value of 1 to "CaptionQuality" if true and 0.975 if false. The same principle applies to "DefinitionQuality" and "ContentRating." We also introduce a bias, favoring longer videos (30-60 minutes) while slightly disadvantaging those shorter than 10 minutes. Popularity-wise, larger channels with millions of subscribers are disadvantaged, whereas those with less than 100K receive a slight boost. With these adjustments, we now have a float variable between 1.3 and 0.7, representing the video's quality, which will be used in our algorithm. Here's the equation:
QualityMultiplier = SubscriberBalance × DefinitionQuality × CaptionQuality × RatingQuality × DurationQuality
The first part is done. Moving on to the second part, we utilize video statistics, a crucial aspect that truly reflects the video's quality. We consider three values: like-rate, comment-rate, and view-rate. To maintain balance, we normalize them using a logarithmic function. After normalization, we adjust the importance of each factor, multiplying the view-rate by 0.675, like-rate by 1.125 and the comment-rate by 1.375. These adjusted values are then combined, multiplied by the QualityMultiplier, and scaled by 100 for readability. The final rating formula looks like this:
Rating = (ViewRate + LikeRate + CommentRate) × QualityMultiplier
With the rating now complete, the final step is to sort and store the videos. We accomplish this using Python and the JSON format, employing dictionary methods, slicing, along with the `open("file.json") as file` and `json.dump()` methods. After, it's just a matter of selecting the right number of videos for different timeframes. Below is the code snippet illustrating this:
for lang, all_videos in videos.items():
for time in ["daily", "weekly", "monthly", "yearly"]:
with open(f"{time}.json", "r") as f:
data = json.load(f)
if time == "daily":
top_day = OrderedDict(sorted(all_videos.items(), key=lambda item: item[0], reverse=True))
data[lang] = OrderedDict(list(top_day.items()))
elif time == "weekly":
top_week = update_videos(data[lang], time)
top_week.update(OrderedDict(list(top_day.items())[:50]))
top_week = sort_videos(top_week)
data[lang] = top_week
# Same thing for monthy & yearly videos
with open(f"{time}.json", "w") as f:
json.dump(data, f, indent=4)
How we automate both our newsletter and this ?
Using cron jobs, simple right? If you don't know what cron is, well, you should. The cron command-line utility is a job scheduler on Unix OSes. It is often used to set up and maintain software environments because of how simple it is to schedule jobs like running commands or sending emails. And this is exactly what we did. Every day at 11 PM 50 Min, it runs a bash file that, collect the active database of our newsletter, run two Python files, one of them fetching YouTube Data and manipulating it and the other following it by sending the newsletter, afterward, if no errors where met, it copy the new json data to the production server. Lovely Unix never ceases to impress.
#!/bin/bash
file_list="credentials.json token.json channels.csv newsletter.db"
remote_dir="/opt/render/project/src"
cd ~/Code/BytePicks
scp $SERVER_NAME@ssh.oregon.render.com:$remote_dir/newsletter.db . >> /dev/null
if [[ $? -eq 0 ]]; then
python youtube.py && tail -n 1 youtube.log
if [[ $? -eq 0 ]]; then
python newsletter.py >> /dev/null
if [[ $? -eq 0 ]]; then
scp $file_list $SERVER_NAME@ssh.oregon.render.com:$remote_dir
scp data/* $SERVER_NAME@ssh.oregon.render.com:$remote_dir/data
fi
fi
fi