Class URLClassifyProcessor

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public class URLClassifyProcessor
    extends UpdateRequestProcessor
    Update processor which examines a URL and outputs to various other fields characteristics of that URL, including length, number of path levels, whether it is a top level URL (levels==0), whether it looks like a landing/index page, a canonical representation of the URL (e.g. stripping index.html), the domain and path parts of the URL etc.

    This processor is intended used in connection with processing web resources, and helping to produce values which may be used for boosting or filtering later.

    • Method Detail

      • getCanonicalUrl

        public URL getCanonicalUrl​(URL url)
        Gets a canonical form of the URL for use as main URL
        Parameters:
        url - The input url
        Returns:
        The URL object representing the canonical URL
      • length

        public int length​(URL url)
        Calculates the length of the URL in characters
        Parameters:
        url - The input URL
        Returns:
        the length of the URL
      • levels

        public int levels​(URL url)
        Calculates the number of path levels in the given URL
        Parameters:
        url - The input URL
        Returns:
        the number of levels, where a top-level URL is 0
      • isTopLevelPage

        public boolean isTopLevelPage​(URL url)
        Calculates whether a URL is a top level page
        Parameters:
        url - The input URL
        Returns:
        true if page is a top level page
      • isLandingPage

        public boolean isLandingPage​(URL url)
        Calculates whether the URL is a landing page or not
        Parameters:
        url - The input URL
        Returns:
        true if URL represents a landing page (index page)
      • isEnabled

        public boolean isEnabled()
      • setEnabled

        public void setEnabled​(boolean enabled)