A group of hacktivists have scraped and archived millions of music files, album art, and metadata from Spotify, in a move that could potentially let anyone clone one of the world’s largest music streaming services.
The 86 million audio files and over 256 million rows of track metadata amounting to roughly 300 terabytes (TB) of storage have been backed up on Anna’s Archive, an open-source search engine for shadow libraries like Sci-Hub and Lib-Gen.
Music metadata is the collection of information that pertains to an audio file, such as the artist’s name, producer, writer, song title, release date, genre, and track duration, to name a few. The pirate activist group has currently released the track metadata scraped from Spotify. It also plans to release the 86 million audio files (representing 99.6 per cent of listens) in order of their popularity as torrent files.
With the metadata of 256 million tracks and 186 million track ISRCs (International Standard Recording Code) – a unique identifier assigned to individual sound recordings – Anna’s Archive currently has the largest publicly available music metadata database. It is the world’s first “preservation archive” for music which is fully open and allows anyone with enough disk space to mirror it, the platform claimed in a blog post published on December 20.
The scraping of a majority of Spotify’s music library has taken on added significance in the artificial intelligence (AI) era, where vast troves of data are routinely harvested by AI companies to train and build large language models (LLMs) like GPT-5 and Gemini 3. It comes amid growing tension between AI companies and rights-holders even as key questions around copyright, consent, and compensation for creators remain unresolved for the most part.
Spotify has responded by saying it had identified and disabled accounts involved in unlawful scraping, and has introduced additional safeguards. “We’ve implemented new safeguards for these types of anti-copyright attacks and are actively monitoring for suspicious behaviour. Since day one, we have stood with the artist community against piracy, and we are actively working with our industry partners to protect creators and defend their rights,” a company spokesperson told The Indian Express.
In a separate statement obtained by Billboard, Spotify said a preliminary investigation found that “a third-party scraped public metadata and used illicit tactics to circumvent DRM (digital rights management) protections to access some audio files.”
Story continues below this ad
What is Anna’s Archive?
Anna’s Archive is an open-source search engine for shadow libraries, which typically contain paid or paywalled content that has been pirated or uploaded for free. The platform functions like a regular search engine and helps users find material hosted elsewhere on the internet. It reportedly does not host pirated material itself.
So far, most of the searchable content via Anna’s Archive has been books, research papers, and other literary material because “text has the highest information density,” as per the platform. This is the first time music metadata has been made accessible through the platform.
Those behind Anna’s Archive appear to be ideologically motivated in making information freely accessible, with the platform’s stated mission being “preserving humanity’s knowledge and culture”. The various domains linked to Anna’s Archive are among the most targeted URLs in Google takedown requests filed by copyright holders, according to the TorrentFreak blog.
What data was scraped from Spotify? Why?
Anna’s Archive said its efforts to scrape audio files and metadata from Spotify were aimed at building a music archive for preservation purposes. While the platform acknowledged that music is already fairly well preserved, it pointed out that existing music libraries mostly contain tracks from the most popular artists and tend to focus too much on archiving files of the highest possible quality.
Story continues below this ad
“This inflates the file size and makes it hard to keep a full archive of all music that humanity has ever produced,” the pirate activist group said. In order to “create an authoritative list of torrents” that represents all music ever produced, Anna’s Archive said it scraped Spotify’s music library at scale using the streaming app’s “popularity” metric to prioritise tracks.
It said that the following data has been scraped and will be released in a phased manner on its Torrents page:
– Metadata (already released)
– Music files (to be released in the order of popularity)
– Additional file metadata (torrent paths and checksums)
– Album art
– .zstdpatch files (to reconstruct original files before embedded metadata was added)
The platform also clarified that only Spotify music files available before July 2025 were scraped, meaning any content uploaded after that date may not be present in the scraped dataset. “For now, this is a torrents-only archive aimed at preservation, but if there is enough interest, we could add downloading of individual files to Anna’s Archive,” it said.
Story continues below this ad
What does it mean for the AI race?
Reactions to the blog post by Anna’s Archive have been split, with some claiming that the soon-to-be publicly available music dataset could aid AI researchers in their work and others arguing that it could fall in the wrong hands. “This, indeed, has mostly implications for ML, training, etc. As otherwise, the whole catalog is available to partners, but costs a lot. So Anna did indeed liberate the content, but I’m definitely not switching off my Spotify subscription, even though, in my personal taste, neither quality, nor UI does match Apple Music. It is still useful to have s.o. serve the content for you,” a user posted on Hacker News, an online forum.
Another user argued the first users of this dataset would be big tech companies and AI giants such as Meta, Google, OpenAI, Microsoft, and Apple. “For them, 300TB is just cheap,” the user said.
“Anyone can now, in theory, create their own personal free version of Spotify (all music up to 2025) with enough storage and a personal media streaming server like Plex. The only real barriers are copyright law and fear of enforcement,” Yoav Zimmerman, CEO of AI startup Third Chair, wrote in a post on LinkedIn.
However, training AI models on pirated content could spell legal trouble for tech companies. While US courts have held that purchasing books, scanning them into digital copies, and using them for AI training purposes are covered under the fair use exception, they have also ruled that using datasets comprising pirated content may not be transformative enough to fall under this exception.
Story continues below this ad
Meta and Anthropic have faced separate allegations that they used pirated digital copies of millions of books to train their AI models. “This leak will also be really useful to bad actors who will resell the music from this list without paying royalties to the artists,” another user posted on Hacker News.