Skip to content

Development

Giacomo Stelluti Scala edited this page Feb 1, 2020 · 42 revisions

Manifesto

Like the project motto says PickAll is both modular and extensible, this means you can develop service in form or searchers or post processors.

Prerequisites

To develop a PickAll service you can potentially use any main .NET language (C#, F# and VB.NET, nothing else ha been tested).

Searchers are meant to scrape the web and this is why you have access to IBrowsingContext from AngleSharp library via the Browsing property (which is a property of the SearchContext bound to your service, in turn available via Context property).

A minimal knowledge of AngleSharp is required (including basics of HTML DOM and possibly CSS Selectors).

This my blog post on AngleSharp maybe of some help.

When if there's no need to access the DOM or use CSS Selectors, you can use the FetchingContext (accessible using the Fetching property). For an example see Textify post processore source code.

Searcher

To be a PickAll searcher a type mush inherit from the Searcher base class. The library comes with built-in searchers which you can take as example.

A searcher must implement Task<IEnumerable<ResultInfo>> SearchAsync(string) method and supply a constructor that accepts a System.Object type. The latter is used to allow the searcher to consume an instance of its configuration settings.

It follows an example:

public class MySearcher : Searcher
{
    public MySearcher(object settings) : base(settings)  
    {
    }

    public override async Task<IEnumerable<ResultInfo>> SearchAsync(string query)
    {
        using (var document = await Context.Browsing.OpenAsync("https://www.somewhere.com/")) {
            var form = document.QuerySelector<IHtmlFormElement>("#some_form");
            ((IHtmlInputElement)form["query_field"]).Value = query;
            using (var result = await form.SubmitAsync(form)) {
                var links = from link in result.QuerySelectorAll<IHtmlAnchorElement>("li a")
                            where link.Attributes["href"].Value.StartsWith(
                                "http",
                                StringComparison.OrdinalIgnoreCase)
                            select link;
                return links.Select((link, index) =>
                    CreateResult((ushort)index, link.Attributes["href"].Value, link.Text));
            }
        }
    }
}

In this example we scrape fake Somewhere search engine inject a query into some_form via query_field. Than we exclude URL that starts whith HTTP scheme. This is essentially what is done Bing searcher to avoid returning UI links.

Behaviour

  • Should (or better must) number results increasing ResultInfo.Index property.
  • Should create results using Searcher.CreateResult protected method.
  • Should not produce more results that ones assigned by PickAll kernel.
    • This can be done inspecting MaximumResults property accessible via Searcher.Runtime (of type RuntimeInfo).
    • Results should be limited as soon as possible. Avoid genereting them all and then returning a subset.
  • Should not produce too much results by default. PickAll is designed to be used with multiple searchers and join results. Producing a too long list of results violates its founding philosophy.

Post Processor

A PickAll post processor a type mush inherit from the PostProcessor base class. The library comes with built-in searchers which you can take as example.

Post processors must implement IEnumerable<ResultInfo> Process(IEnumerable<ResultInfo> results) method and supply a constructor that accepts a System.Object type. It's used to allow the post processor to consume an instance of its configuration settings.

It follows an example:

public class Uniqueness : PostProcessor
{
    public Uniqueness(object settings) : base(settings)
    {
    }

    public override IEnumerable<ResultInfo> Process(IEnumerable<ResultInfo> results)
    {
        return results.DistinctBy(result => result.Url);
    }
}

This is the actual code of Uniqueness post processor. It simply remove duplicates processing all results with a single method call. Anyway it's not uncommon for a post processor performing a more elaborated set of operations and produce results using yield return (like in Textify).

Behaviour

  • Should (or better must) not overwhelm the post processing chain with extens heavy computations.
  • Should provide a mean to limit the data to process (see Textify and Wordify source code).

Settings and Data Types

Both searchers and post processors may use use a Settings Type to recevie a configuration or may produce a Data Type to store extra data in results.

Settings Type

Although PickAll kernel can pass you everything, it's recommended to follow the same design of built-in types.

  • Source
    • As for the Data Type (which we'll talk about later) the class should be defined in the same source file of the service.
  • Name
    • The name the name of the service with Settings suffix. Following previous example, it will be MySearcherSettings.
  • Value Type
    • Should be defined as struct or at least as sealed class.
  • Content
    • May contain all automatic properties you need. It follows and example of declaration and how you shuld be accessed within your searcher service:
public struct MySearcherSettings
{
    public bool DeepSearch { get; set; } // whatever it means
}

public class MySearcher : Searcher
{
    private readonly MySearcherSettings _settings;

    public MySearcher(object settings) : base(settings)  
    {
        if (!(settings is MySearcherSettings)) {
            throw new NotSupportedException(
                $"{nameof(settings)} must be of {nameof(MySearcherSettings)} type");
        }
        _settings = (MySearcherSettings)Settings;
    }

    public override async Task<IEnumerable<ResultInfo>> SearchAsync(string query)
    {
        if (_settings.DeepSearch) {
            // Perform deep search
            // ..
        }
        esle {
            // Perform normal search
            // ...
        }
    }
}

Please follow this design, but if you prefer doing it in a different way choose something meaningful. E.g.: don't require to pass a bool (in the case of deep search setting), a dictionary and anonymous type (or even a dynamic object) all better alternatives.

Data Type

A PickAll Data Type is used to add additional data to each result. It's normally produced by a post processor, but nothing forbids producing additional data directly in a searcher service. It's main purpose is to to encapsulate a result, avoiding the bad practice to populating ResultInfo.Data directly.

  • Source
    • As for the Settings Type the class should be defined in the same source file of the searcher.
  • Name
    • The name the name of the service with Data suffix. Following previous example, it will be MySearcherData.
  • Value Type
    • Should be defined as struct or at least as sealed class.
  • Content
    • May contain all automatic properties you need, but normally one will suffice. It follows an example for an hypothetical MyPostProcessor:
public struct MyPostProcessorData
{
    public string Abstract { get; set; }
}

public class MyPostProcessor : PostProcessor
{
    public override IEnumerable<ResultInfo> Process(IEnumerable<ResultInfo> results)
    {
        foreach (var result in results) {
            yield return result.Clone(
                new MyPostProcessorData { Abstract = MyScraper.ExtractAbstract(result.Url) });
    }
}

Events

PickAll handle all events vai SearchContext. The context itself handles just one very basic event, that is when the search begins.

var context = SearchContext.Default; // or whatever configuration you prefer

context.SearchBegin += (sender, e) => Console.WriteLine($"Searching '{e.Query}' ...");

SearchBeginEventArgs it is a very simple immutable type that allows the event handler to know the search query.

Anyway attacching events to services it's a responsability of the context.

context.ResultCreated += (sender, e) => Console.WriteLine($"{e.Result.Url}");

context.ResultProcessed += (sender, e) => Console.WriteLine($"{sender.GetType().Name}");

ResultCreated is fired when a searcher service get a result, while ResultProcessed if fired when a post processor has handled one of them. In both cases it is in form of ResultInfo instance.

This is because both events are fired with ResultHandledEventArgs class. Result property will own the ResultInfo of interest, while the Type property will discriminate searchers from post processors. To know the service that fired the event just query the sender type using reflection (like in the example).

In case of a searcher you can also cast sender to Searcher type and read the Name property (that is exactly the same thing).