Skip to content

Implementing scrapers

Hamuko edited this page Jun 2, 2017 · 2 revisions

This page is a work in progress.

Table of Contents

  1. Overview
  2. Base classes
    1. BaseSeries
    2. BaseChapter
  3. Series
    1. Required methods and properties
  4. Chapter

1. Overview

cum scrapers are all located in the cum/scrapers/ directory, which is where you should also add your new scraper file. Each site has its own scraper. In addition the directory contains some special files, including the all-important base scraper.

cum uses two kinds of objects when dealing with manga sites: series objects and chapter objects. Series objects represent a singe individual series on the site that can have one or more chapters attached to it. When a user uses cum follow, they will supply the application a URL for a series. Chapter objects represent a single chapter on a site that often belongs under a series (depending on the site setup). Chapters are often handled in the software without parent series, so they also include some basic information from their parent series.

2. Base classes

cum has two base classes for scrapers: BaseChapter and BaseSeries. All scrapers must inherit from these two classes when defining their own classes. Hence (almost) all scrapers include this import.

from cum.scrapers.base import BaseChapter, BaseSeries

The base classes add important functions and properties for the scraper objects and ensure that the classes conform to the standards.

i. BaseSeries

The base series is initialised with a URL pointing to the series on the site as well as with any number of keyword arguments. The base class is in charge of saving the URL in the series objects as self.url as well as handling any custom directories, so the scraper classes do not have to worry about these properties.

__init__(self, url, **kwargs)

ii. BaseChapter

3. Series

Each scraper must implement its own series object that inherits from BaseSeries. The series object represents one individual series on a site and is in charge of returning chapter information for itself.

i. Required methods and properties

def __init__(self, url, **kwargs):

Series objects use the initialisation to request the chapters from the remote site and to save them under self.chapters. Because the base scraper needs to access some of the data passed onto the series initialisation, you must call super().__init__(url, **kwargs) at the beginning of the method.

def get_chapters(self):

Series objects fetch the list of chapter objects using this method. Even though the rest of the application accesses the chapters through the chapters property, which is set in __init__, the generation of the chapter list is done within this method for clarity and readability. The signature for this method doesn't matter, so you are free to add other parameters for this method.

@property
def name(self):

Returns the series name as a string. The string can have special characters as part of it if they are present in the series name. This string is used by the base class in order to form the alias of the series that is used in the cum UI as well as to form the path for the downloads.

4. Chapter

Clone this wiki locally