PF | Comments Off | Spidering the "Dark Web"
Friday, February 15, 2008 at 16:18
For some, the term "dark web" simply means all the online data that search engine spiders can't reach, crawl, or index, but for the University of Arizona's AI Lab, the "Dark Web" refers to a research project where the social phenomena of terrorism is studied via various techniques including social network analysis, content analysis, link analysis, web metrics, video analysis, data and text mining, sentiment and affect analysis, and authorship analysis. Through the use of sophisticated, mathematical tools, the project aims to collect all web content generated by international terrorist groups, including content found on web sites, forums, chat rooms, blogs, social networking sites, videos, virtual worlds, and more.
The Dark Web Project
Federally funded through the National Science Foundation, the Dark Web's spiders have been crawling through the web for the past five years. As of 2007, they estimated there were about 50,000 sites of extremist/terrorist content when they looked beyond just traditional web pages. This number was a great increase from Dr. Gabriel Weimann of the University of Haifa's estimate that there were only 5000 terrorist web sites in 2006. From 2006-2007, the lab found the greatest increase in terrorist activities was on various new "web 2.0" sites, (a term they use to describe any new-generation web site including video sites, blogs, virtual worlds, etc.)
Currently, the Dark Web collection consists of the complete contents of only 1000 web sites in Arabic, Spanish, and English and the partial contents of 10,000 other sites. This collection is 2 TBs in size making it the largest open-source extremist/terrorist collection in the academic world. Researchers who would like to use this data in their own studies can contact the research center for access.
Where the Bad Guys Are
So far, the Dark Web has determined the following:
- Forums: 300 terrorist forums found, some with more than 30,000 members; nearly 1,000,000 messages posted.
- Blogs, social networking sites, and virtual worlds: Many transient sites have been identified before they disappear; more than 30 (self-proclaimed) terrorist or extremist groups in virtual world sites, though they have yet been unable to determine who is just "playing terrorist" vs who is for real.
- Videos and multimedia content: 1,000,000 images and 15,000 videos from web sites and specialty multimedia file-hosting third-party servers; more than 50% of of videos are related to Improvised Explosive Devices.

Second Life Griefers - A "Terrorist Attack?"
How They Find the Data
The Dark Web project uses various tools for collection, analysis, and visualization:
- Web site spidering: Their focused spiders can access password-protected sites and perform randomized (human-like) fetching. The spiders are trained to fetch all html, pdf, and word files, links, PHP, CGI, and ASP files, images, audios, and videos in a web site. Selected web sites are spidered every 2 to 3 months.
PF | Comments Off | 







