Google is reading text in YouTube videos for search crawling without user consent
Google is using optical character recognition (OCR) techniques to crawl URLs found in YouTube videos-including private videos-according to programmer Austin Burk, first reported by Naked Security. Burk found an XSS vulnerability in a different website, which he was reproducing using screen capture software as part of a responsible disclosure package. After uploading the video to YouTube, he found evidence of crawling activity with the user agent "Google-Youtube-Links" in server logs on a system he controls.
According to Burk, the URLs were visible in the address bar during the video, which was uploaded to YouTube, but kept unlisted. Burk then made a private video to test the behavior, which occurred in the exact same fashion as the unlisted video created for responsible disclosure.
Considering Google's core product is search, it makes sense that the company is always scanning the web. Google's use of users' personal activity, including browsing history and location, to target advertising and search results is well known. But YouTube's help article for video privacy settings makes no mention of this behavior, and Google's help article listing user agent tokens for their search crawlers also makes no mention of this crawler existing.
SEE: Virtualization policy (Tech Pro Research)
Even if Google's intentions are innocuous, this is potentially very damaging. Burk proposes a scenario similar to the XSS issue he was disclosing:
For this reason, using YouTube to host even private videos for security disclosures is not advisable, as the integrity of the disclosure cannot be assured with Google's search crawler probing inspected websites. It is difficult to completely fault Google for this activity, as malicious actors could use YouTube to instruct unwitting victims into manually typing links into their address bar, leading them to viruses or illicit content.
That said, the abject lack of documentation or acknowledgement from Google about this in public documentation should make users uneasy about how Google is using data uploaded to their services.
TechRepublic contacted Google, but did not receive a response by press time. We will update this story if Google provides a statement.