Combined

Partial capture of text on file.

  Web Crawler                                                                                                                      DDaattaa  CoColllleectictioonn  MMododuullee
           News Crawler
        •   News Crawlers are focused on retrieving newly published News Data.
        •   News Crawlers monitors a set of defined News sources and captures the news as soon as it publishes. 
                                                  Predefined 
                                                  Predefined                  News URL                   News Article 
                                                 Set of News    Crawl every   News URL                   News Article 
                                                 Set of News      30 Min     Downloader                  Downloader 
                                                                             Downloader                  Downloader 
                                                   Sources
                                                   Sources
                                                                                                  New 
                                                                                                  URLs          News 
                                                                           New URLs                             Articles
                                                                                              News 
                                                                                            Database
                       Architecture of News Crawler at IITR
  16BIT                                                                                                                                                    IITR
 Web Crawler                                            DDaattaa  CoColllleectictioonn  MMododuullee
     Web Crawler
        A Simple Java Program for 
         Downloading a Web Page
 16BIT                                                            IITR
  Web Crawler                                                                                                                       DDaattaa  CoColllleectictioonn  MMododuullee
           Parsing a Web Page
        •    Given a Web Page, we can retrieve different components by Parsing it.
        •    Many HTML Parsers are available such as Jsoup, Xerces, NekoHTML
        •    Following Java program uses Jsoup parser to extract Hyperlinks from a web page.
                                        import org.jsoup.Jsoup;
                                        import org.jsoup.nodes.Document;
                                        import org.jsoup.nodes.Element;
                                        import org.jsoup.select.Elements;
                                        import java.io.IOException;
                                        import java.io.File;
                                        public class ExtractLinks {
                                            public static void main(String[] args) throws IOException {
                                                File input = new File("data.html");
                                                Document doc = Jsoup.parse(input, "UTF-8", “ ");                       
                                                Elements links = doc.select("a[href]");       
                                                System.out.println("Total Number of Links:"+links.size());
                                                for (Element link : links) {
                                                    System.out.println(link.attr("abs:href"));
                                                }
                                            }
                                        }
  16BIT                                                                                                                                                     IITR
  Web Crawler                                                                                                               DDaattaa  CoColllleectictioonn  MMododuullee
           Retrieving Article Text
        •  There are many API available for extracting the main content from web pages, such as Boilerplate API
        •  Following Java program demonstrates the use of Boilerplate API to extract the article text from a news article
             import java.io.PrintWriter;
             import java.net.URL;
             import de.l3s.boilerpipe.BoilerpipeExtractor;
             import de.l3s.boilerpipe.extractors.CommonExtractors;
             import de.l3s.boilerpipe.sax.HTMLHighlighter;
             public class BoilerplateDemo {
             public static void main(String[] args) throws Exception {
             URL url = new URL("http://www.thehindu.com/news/national/land-acquisition-ordinance-bill-gets-a-burial/article7597517.ece");
             final BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR;
                   // choose the operation mode (i.e., highlighting or extraction)
                   //final HTMLHighlighter hh = HTMLHighlighter.newHighlightingInstance();
                   final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance();
                   PrintWriter out = new PrintWriter("highlighted.html", "UTF-8");
                   out.println(hh.process(url, extractor));
                   out.close();
                   System.out.println("Now open file highlighted.html in your web browser");
                   }
             }
  16BIT                                                                                                                                           IITR
  Article Extractor                                                                                                                     DDaattaa  CoColllleectictioonn  MMododuullee
           Article Extraction
        •   Objective: To extract Article Content from Given News URL
        •   News URL: 
            http://www.hindustantimes.com/world-t20/amitabh-bachchan-to-sing-national-anthem-before-india-pakistan-match/story-
            QXxnQAvmJsisvIYtSFv33L.html                                Bollywood superstar Amitabh Bachchan will sing the National 
                                                                       Anthem before the start of the marquee India-Pakistan World 
                                                                       Twenty20 cricket match at the Eden Gardens on March 19.
                                                                       Bachchan has confirmed the development by retweeting a post in 
                                                                       his official Twitter handle while sources in the Cricket Association 
                                                                       of Bengal today said this was an effort by its president Sourav 
                                                                       Ganguly.
                                                                       “The president was involved and the plan was on for a long time,” 
                                                                       CAB sources said.
                                                                       While the ‘Big B’ will sing the National Anthem in his signature 
                                                                       baritone, Pakistan will also make their presence felt with classical 
                                                                       singer Shafaqat Amanat Ali who is slated to sing the Pakistani 
                                                                       National Anthem.
  16BIT                                                                                                                                                          IITR

The words contained in this file might help you see if this file matches what you are looking for:

...Web crawler ddaattaa cocolllleectictioonn mmododuullee news crawlers are focused on retrieving newly published data monitors a set of defined sources and captures the as soon it publishes predefined url article crawl every min downloader new urls articles database architecture at iitr bit simple java program for downloading page parsing given we can retrieve different components by many html parsers available such jsoup xerces nekohtml following uses parser to extract hyperlinks from import org nodes document element select elements io ioexception file public class extractlinks static void main string args throws input doc parse utf links system out println total number size link attr abs href text there api extracting content pages boilerplate demonstrates use printwriter net de ls boilerpipe boilerpipeextractor extractors commonextractors sax htmlhighlighter boilerplatedemo exception http www thehindu com national land acquisition ordinance bill gets burial ece final extractor choose...

Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area