-
Notifications
You must be signed in to change notification settings - Fork 3
Development
Like the project motto says PickAll is both modular and extensible, this means you can develop service in form or searchers or post processors.
To develop a PickAll service you can potentially use any main .NET language (C#, F# and VB.NET, nothing else ha been tested).
Searchers are meant to scrape the web and this is why you have access to IBrowsingContext
from AngleSharp library via the Browsing
property (which is a property of the SearchContext
bound to your service, in turn available via Context
property).
A minimal knowledge of AngleSharp is required (including basics of HTML DOM and possibly CSS Selectors).
This my blog post on AngleSharp maybe of some help.
When if there's no need to access the DOM or use CSS Selectors, you can use the FetchingContext
(accessible using the Fetching
property). For an example see Textify post processore source code.
To be a PickAll searcher a type mush inherit from the Searcher base class. The library comes with built-in searchers which you can take as example.
A searcher must implement Task<IEnumerable<ResultInfo>> SearchAsync(string)
method and supply a constructor that accepts a System.Object
type. The latter is used to allow the searcher to consume an instance of its configuration settings.
It follows an example:
public class MySearcher : Searcher
{
public MySearcher(object settings) : base(settings)
{
}
public override async Task<IEnumerable<ResultInfo>> SearchAsync(string query)
{
using (var document = await Context.Browsing.OpenAsync("https://www.somewhere.com/")) {
var form = document.QuerySelector<IHtmlFormElement>("#some_form");
((IHtmlInputElement)form["query_field"]).Value = query;
using (var result = await form.SubmitAsync(form)) {
var links = from link in result.QuerySelectorAll<IHtmlAnchorElement>("li a")
where link.Attributes["href"].Value.StartsWith(
"http",
StringComparison.OrdinalIgnoreCase)
select link;
return links.Select((link, index) =>
CreateResult((ushort)index, link.Attributes["href"].Value, link.Text));
}
}
}
}
In this example we scrape fake Somewhere search engine inject a query
into some_form
via query_field
. Than we exclude URL that starts whith HTTP scheme. This is essentially what is done Bing searcher to avoid returning UI links.
- Should (or better must) number results increasing
ResultInfo.Index
property. - Should create results using
Searcher.CreateResult
protected method. - Should not produce more results that ones assigned by PickAll kernel.
- This can be done inspecting
MaximumResults
property accessible viaSearcher.Runtime
(of type RuntimeInfo). - Results should be limited as soon as possible. Avoid genereting them all and then returning a subset.
- This can be done inspecting
- Should not produce too much results by default. PickAll is designed to be used with multiple searchers and join results. Producing a too long list of results violates its founding philosophy.
A PickAll post processor a type mush inherit from the PostProcessor base class. The library comes with built-in searchers which you can take as example.
Post processors must implement IEnumerable<ResultInfo> Process(IEnumerable<ResultInfo> results)
method and supply a constructor that accepts a System.Object
type. It's used to allow the post processor to consume an instance of its configuration settings.
It follows an example:
public class Uniqueness : PostProcessor
{
public Uniqueness(object settings) : base(settings)
{
}
public override IEnumerable<ResultInfo> Process(IEnumerable<ResultInfo> results)
{
return results.DistinctBy(result => result.Url);
}
}
This is the actual code of Uniqueness
post processor. It simply remove duplicates processing all results with a single method call. Anyway it's not uncommon for a post processor performing a more elaborated set of operations and produce results using yield return
(like in Textify).
- Should (or better must) not overwhelm the post processing chain with extens heavy computations.
- Should provide a mean to limit the data to process (see Textify and Wordify source code).
Both searchers and post processors may use use a Settings Type to recevie a configuration or may produce a Data Type to store extra data in results.
Although PickAll kernel can pass you everything, it's recommended to follow the same design of built-in types.
- Source
- As for the Data Type (which we'll talk about later) the class should be defined in the same source file of the service.
- Name
- The name the name of the service with
Settings
suffix. Following previous example, it will beMySearcherSettings
.
- The name the name of the service with
- Value Type
- Should be defined as
struct
or at least assealed class
.
- Should be defined as
- Content
- May contain all automatic properties you need. It follows and example of declaration and how you shuld be accessed within your searcher service:
public struct MySearcherSettings
{
public bool DeepSearch { get; set; } // whatever it means
}
public class MySearcher : Searcher
{
private readonly MySearcherSettings _settings;
public MySearcher(object settings) : base(settings)
{
if (!(settings is MySearcherSettings)) {
throw new NotSupportedException(
$"{nameof(settings)} must be of {nameof(MySearcherSettings)} type");
}
_settings = (MySearcherSettings)Settings;
}
public override async Task<IEnumerable<ResultInfo>> SearchAsync(string query)
{
if (_settings.DeepSearch) {
// Perform deep search
// ..
}
esle {
// Perform normal search
// ...
}
}
}
Please follow this design, but if you prefer doing it in a different way choose something meaningful. E.g.: don't require to pass a bool
(in the case of deep search setting), a dictionary and anonymous type (or even a dynamic object) all better alternatives.
A PickAll Data Type is used to add additional data to each result. It's normally produced by a post processor, but nothing forbids producing additional data directly in a searcher service. It's main purpose is to to encapsulate a result, avoiding the bad practice to populating ResultInfo.Data
directly.
- Source
- As for the Settings Type the class should be defined in the same source file of the searcher.
- Name
- The name the name of the service with
Data
suffix. Following previous example, it will beMySearcherData
.
- The name the name of the service with
- Value Type
- Should be defined as
struct
or at least assealed class
.
- Should be defined as
- Content
- May contain all automatic properties you need, but normally one will suffice.
It follows an example for an hypothetical
MyPostProcessor
:
- May contain all automatic properties you need, but normally one will suffice.
It follows an example for an hypothetical
public struct MyPostProcessorData
{
public string Abstract { get; set; }
}
public class MyPostProcessor : PostProcessor
{
public override IEnumerable<ResultInfo> Process(IEnumerable<ResultInfo> results)
{
foreach (var result in results) {
yield return result.Clone(
new MyPostProcessorData { Abstract = MyScraper.ExtractAbstract(result.Url) });
}
}
PickAll handle all events vai SearchContext
. The context itself handles just one very basic event, that is when the search begins.
var context = SearchContext.Default; // or whatever configuration you prefer
context.SearchBegin += (sender, e) => Console.WriteLine($"Searching '{e.Query}' ...");
SearchBeginEventArgs
it is a very simple immutable type that allows the event handler to know the search query.
Anyway attacching events to services it's a responsability of the context.
context.ResultCreated += (sender, e) => Console.WriteLine($"{e.Result.Url}");
context.ResultProcessed += (sender, e) => Console.WriteLine($"{sender.GetType().Name}");
ResultCreated
is fired when a searcher service get a result, while ResultProcessed
if fired when a post processor has handled one of them. In both cases it is in form of ResultInfo
instance.
This is because both events are fired with ResultHandledEventArgs class. Result
property will own the ResultInfo
of interest, while the Type
property will discriminate searchers from post processors. To know the service that fired the event just query the sender
type using reflection (like in the example).
In case of a searcher you can also cast sender
to Searcher
type and read the Name
property (that is exactly the same thing).