Introduction
Extracting video, picture URLs, and textual content from the webpage might be performed simply with selenium and exquisite soup in python. If there are URLs like “ because the src then we are able to immediately entry these movies.
Nevertheless, there are such a lot of web sites that use the blob format URLs like src=”blob: We will extract them utilizing selenium + bs4 however we can’t entry them immediately as a result of these are generated internally by the browser.
What are BLOB URLs?
Blob URLs can solely be generated internally by the browser. URL.createObjectURL() will create a particular reference to the Blob or File object which later might be launched utilizing URL.revokeObjectURL(). These URLs can solely be used domestically in a single occasion of the browser and in the identical session.
BLOB URLs are sometimes used to show or play multimedia content material, equivalent to movies, immediately in an online browser or media participant, with out the necessity to obtain the content material to the person’s native machine. They’re usually used at the side of HTML5 video parts, which permit net builders to embed video content material immediately into an online web page, utilizing a easy <video> tag.
To beat the above problem we’ve discovered two strategies that may assist to extract the video URL immediately:
YT-dlpSelenium + Community logs
YT-dlp
YT-dlp is a really useful module to obtain youtube movies and likewise extracts different attributes of youtube movies like titles, descriptions, tags, and so forth. We now have discovered a technique to extract movies from regular net pages (non-youtube) utilizing some extra choices with it. Beneath are the steps and pattern code for utilizing it.
Set up YT-dlp module for ubuntu
sudo snap set up yt-dlp
Beneath is the easy code for video URL extraction utilizing yt-dlp with the python subprocess. We’re utilizing extra choices like -f, -g, -q, and so forth. The outline for these choices might be discovered on the git hub of yt-dlp.
import subprocess
def get_video_urls(url):
videos_url = []
youtube_subprocess = subprocess.Popen([“yt-dlp”,”-f”,”all”,”-g”,”-q”,”–ignore-error”,
“–no-warnings”, url], stdout=subprocess.PIPE)
strive:
video_url_list = youtube_subprocess.talk(timeout=15)[0].decode(“utf-8”).cut up(“n”)
for video in video_url_list:
if video.endswith(“.mp4”) or video.endswith(“.mp3”) or video.endswith(“.mov”) or video.endswith(“.webm”):
videos_url.append(video)
if len(videos_url) == 0:
for video in video_url_list:
if video.endswith(“.m3u8″):
videos_url.append(video)
besides subprocess.TimeoutExpired:
youtube_subprocess.kill()
return videos_url
print(get_video_urls(url=”
Selenium + Community logs
At any time when blob format URLs are used within the web site and the video is being performed, we are able to entry the streaming URL (.m3u8) for that video within the browser’s community tab. We will use the community and efficiency logs to search out the streaming URLs.
What’s M3U8?
M3U8 is a textual content file that makes use of UTF-8-encoded characters to specify the places of a number of media information. It’s generally used to specify a playlist of audio or video information for streaming over the web, utilizing a media participant that helps the M3U8 format, equivalent to VLC, Apple’s iTunes, and QuickTime. The file sometimes has the “.m3u8” file extension and begins with an inventory of a number of media information, adopted by a collection of attribute data strains. Every line in an M3U8 file sometimes specifies a single media file, together with its title and size, or a reference to a different M3U8 file for streaming a playlist of media information.
We will extract the community and efficiency logs utilizing selenium with some superior choices. Carry out the next steps to put in all of the required packages:
pip set up selenium
pip set up webdriver_manager
Beneath is the pattern code for getting streaming URL (.m3u8) utilizing selenium and community logs:
from selenium import webdriver
from selenium.webdriver.widespread.desired_capabilities import DesiredCapabilities
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
import json
from selenium.webdriver.widespread.by import By
import json
desired_capabilities = DesiredCapabilities.CHROME
desired_capabilities[“goog:loggingPrefs”] = {“efficiency”: “ALL”}
choices = webdriver.ChromeOptions()
choices.add_argument(“–no-sandbox”)
choices.add_argument(“–headless”)
choices.add_argument(‘–disable-dev-shm-usage’)
choices.add_argument(“start-maximized”)
choices.add_argument(“–autoplay-policy=no-user-gesture-required”)
choices.add_argument(“disable-infobars”)
choices.add_argument(“–disable-extensions”)
choices.add_argument(“–ignore-certificate-errors”)
choices.add_argument(“–mute-audio”)
choices.add_argument(“–disable-notifications”)
choices.add_argument(“–disable-popup-blocking”)
choices.add_argument(f’user-agent={desired_capabilities}’)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().set up()),
choices=choices,
desired_capabilities=desired_capabilities)
def get_m3u8_urls(url):
driver.get(url)
driver.execute_script(“window.scrollTo(0, 10000)”)
time.sleep(20)
logs = driver.get_log(“efficiency”)
url_list = []
for log in logs:
network_log = json.masses(log[“message”])[“message”]
if (“Community.response” in network_log[“method”]
or “Community.request” in network_log[“method”]
or “Community.webSocket” in network_log[“method”]):
if ‘request’ in network_log[“params”]:
if ‘url’ in network_log[“params”][“request”]:
if ‘m3u8’ in network_log[“params”][“request”][“url”] or ‘.mp4’ in network_log[“params”][“request”][“url”]:
if “blob” not in network_log[“params”][“request”][“url”]:
if ‘.m3u8’ in network_log[“params”][“request”][“url”]:
url_list.append( network_log[“params”][“request”][“url”] )
driver.shut()
return url_list
if __name__ == “__main__”:
url = ”
url_list = get_m3u8_urls(url)
print(url_list)
When you get the streaming URL it may be performed within the VLC media participant utilizing the stream choice.
The m3u8 URL can be downloaded as a .mp4 file utilizing the FFmpeg module. It may be put in in ubuntu utilizing:
sudo apt set up ffmpeg
After putting in FFmpeg we are able to simply obtain the video utilizing the beneath command:
ffmpeg -i -c copy -bsf:a aac_adtstoasc output.mp4
Hope you want these two approaches of Advance video scraping. Do tell us when you’ve got any queries.