More than 170 images and personal details of children in Brazil have been extracted from an open source data set without their knowledge or consent, and used to train AI, claims a new report by Human Rights Watch published on Monday.
The images have been extracted from content published in 2023 and the mid-1990s, according to the report, long before any Internet user could anticipate that their content could be used to train AI. Human Rights Watch claims that these children’s personal data, along with links to their photographs, were included in LAION-5B, a dataset that has been a popular source of training data for AI startups.
“Your privacy is violated in the first instance when your photograph is extracted and included in these data sets. And then these AI tools are trained with this data and can therefore create realistic images of children,” says Hye Jung Han, a technology and children’s rights researcher at Human Rights Watch and a researcher who found these images. “The technology is developed in such a way that any child who has a photo or video of themselves online is now at risk because any malicious actor could take that photo and then use these tools to manipulate them however they want.”
LAION-5B is based on Common Crawl, a data repository created by web scraping and made available to researchers, and has been used to train several AI models, including Stability AI’s Stable Diffusion imaging tool . Created by the German nonprofit LAION, the dataset is open access and now includes more than 5.85 billion pairs of images and captions, according to its website.
The images of children the researchers found came from mom blogs and other personal, motherhood or parenting blogs, as well as still images from YouTube videos with a low number of views, apparently uploaded to be shared with family and friends.
“Just looking at the context where they were published, they enjoyed an expectation and a certain privacy,” Hye says. “Most of these images were not possible to find online using a reverse image search.”
Youtube Terms of Service not allowing scraping except in certain circumstances; These cases seem to go against those policies. “We’ve made clear that unauthorized scraping of content from YouTube is a violation of our Terms of Service,” says YouTube spokesperson Jack Maon, “and we continue to take action against this type of abuse.”
In December, Stanford University researchers discovered that the AI training data collected by LAION-5B contained child sexual abuse material. The problem of explicit deepfakes is increasing even among students in American schools, where they are used to bully their classmates, especially girls. Hye is concerned that beyond using children’s photos to generate CSAM, the database could reveal potentially sensitive information, such as locations or medical data. In 2022, a US-based artist. found its own image in the LAION data setand realized it was from his private medical records.