216x Filetype PPTX File size 2.70 MB Source: www.iitr.ac.in
Web Crawler DDaattaa CoColllleectictioonn MMododuullee News Crawler • News Crawlers are focused on retrieving newly published News Data. • News Crawlers monitors a set of defined News sources and captures the news as soon as it publishes. Predefined Predefined News URL News Article Set of News Crawl every News URL News Article Set of News 30 Min Downloader Downloader Downloader Downloader Sources Sources New URLs News New URLs Articles News Database Architecture of News Crawler at IITR 16BIT IITR Web Crawler DDaattaa CoColllleectictioonn MMododuullee Web Crawler A Simple Java Program for Downloading a Web Page 16BIT IITR Web Crawler DDaattaa CoColllleectictioonn MMododuullee Parsing a Web Page • Given a Web Page, we can retrieve different components by Parsing it. • Many HTML Parsers are available such as Jsoup, Xerces, NekoHTML • Following Java program uses Jsoup parser to extract Hyperlinks from a web page. import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.IOException; import java.io.File; public class ExtractLinks { public static void main(String[] args) throws IOException { File input = new File("data.html"); Document doc = Jsoup.parse(input, "UTF-8", “ "); Elements links = doc.select("a[href]"); System.out.println("Total Number of Links:"+links.size()); for (Element link : links) { System.out.println(link.attr("abs:href")); } } } 16BIT IITR Web Crawler DDaattaa CoColllleectictioonn MMododuullee Retrieving Article Text • There are many API available for extracting the main content from web pages, such as Boilerplate API • Following Java program demonstrates the use of Boilerplate API to extract the article text from a news article import java.io.PrintWriter; import java.net.URL; import de.l3s.boilerpipe.BoilerpipeExtractor; import de.l3s.boilerpipe.extractors.CommonExtractors; import de.l3s.boilerpipe.sax.HTMLHighlighter; public class BoilerplateDemo { public static void main(String[] args) throws Exception { URL url = new URL("http://www.thehindu.com/news/national/land-acquisition-ordinance-bill-gets-a-burial/article7597517.ece"); final BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR; // choose the operation mode (i.e., highlighting or extraction) //final HTMLHighlighter hh = HTMLHighlighter.newHighlightingInstance(); final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance(); PrintWriter out = new PrintWriter("highlighted.html", "UTF-8"); out.println(hh.process(url, extractor)); out.close(); System.out.println("Now open file highlighted.html in your web browser"); } } 16BIT IITR Article Extractor DDaattaa CoColllleectictioonn MMododuullee Article Extraction • Objective: To extract Article Content from Given News URL • News URL: http://www.hindustantimes.com/world-t20/amitabh-bachchan-to-sing-national-anthem-before-india-pakistan-match/story- QXxnQAvmJsisvIYtSFv33L.html Bollywood superstar Amitabh Bachchan will sing the National Anthem before the start of the marquee India-Pakistan World Twenty20 cricket match at the Eden Gardens on March 19. Bachchan has confirmed the development by retweeting a post in his official Twitter handle while sources in the Cricket Association of Bengal today said this was an effort by its president Sourav Ganguly. “The president was involved and the plan was on for a long time,” CAB sources said. While the ‘Big B’ will sing the National Anthem in his signature baritone, Pakistan will also make their presence felt with classical singer Shafaqat Amanat Ali who is slated to sing the Pakistani National Anthem. 16BIT IITR
no reviews yet
Please Login to review.