Skip to content

Commit

Permalink
Merge pull request #86 from creekorful/develop
Browse files Browse the repository at this point in the history
Release 0.7.0
  • Loading branch information
creekorful authored Dec 23, 2020
2 parents 0c4013f + bf884d1 commit 7cbfbb7
Show file tree
Hide file tree
Showing 42 changed files with 1,806 additions and 794 deletions.
56 changes: 41 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ Git repository to ease maintenance.

## Why a rewrite?

The first version of Trandoshan [(available here)](https://github.com/trandoshan-io) is working great but
not really professional, the code start to be a mess, hard to manage since split in multiple repositories, etc.
The first version of Trandoshan [(available here)](https://github.com/trandoshan-io) is working great but not really
professional, the code start to be a mess, hard to manage since split in multiple repositories, etc.

I have therefore decided to create & maintain the project in this specific repository,
where all components code will be available (as a Go module).
I have therefore decided to create & maintain the project in this specific repository, where all components code will be
available (as a Go module).

# How to start the crawler

Expand All @@ -35,23 +35,49 @@ Since the API is exposed on localhost:15005, one can use it to start crawling:
using trandoshanctl executable:

```sh
$ trandoshanctl schedule https://www.facebookcorewwwi.onion
$ trandoshanctl --api-token <token> schedule https://www.facebookcorewwwi.onion
```

or using the docker image:

```sh
$ docker run creekorful/trandoshanctl --api-uri <uri> schedule https://www.facebookcorewwwi.onion
$ docker run creekorful/trandoshanctl --api-token <token> --api-uri <uri> schedule https://www.facebookcorewwwi.onion
```

(you'll need to specify the api uri if you use the docker container)
(you'll need to specify the api uri if you use the docker container)

this will schedule given URL for crawling.

## Example token

Here's a working API token that you can use with trandoshanctl if you haven't changed the API signing key:

```
eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VybmFtZSI6InRyYW5kb3NoYW5jdGwiLCJyaWdodHMiOnsiUE9TVCI6WyIvdjEvdXJscyJdLCJHRVQiOlsiL3YxL3Jlc291cmNlcyJdfX0.jGA8WODYKtKy7ZijngoV8C3iWi1eTvMitA8Z1Is2GUg
```

This token is the representation of the following payload:

```
{
"username": "trandoshanctl",
"rights": {
"POST": [
"/v1/urls"
],
"GET": [
"/v1/resources"
]
}
}
```

you may create your own tokens with the rights needed. In the future a CLI tool will allow token generation easily.

## How to speed up crawling

If one want to speed up the crawling, he can scale the instance of crawling component in order
to increase performances. This may be done by issuing the following command after the crawler is started:
If one want to speed up the crawling, he can scale the instance of crawling component in order to increase performances.
This may be done by issuing the following command after the crawler is started:

```sh
$ ./scripts/scale.sh crawler=5
Expand All @@ -69,20 +95,20 @@ $ trandoshanctl search <term>

## Using kibana

You can use the Kibana dashboard available at http://localhost:15004.
You will need to create an index pattern named 'resources', and when it asks for the time field, choose 'time'.
You can use the Kibana dashboard available at http://localhost:15004. You will need to create an index pattern named '
resources', and when it asks for the time field, choose 'time'.

# How to hack the crawler

If you've made a change to one of the crawler component and wish to use the updated version when
running start.sh you just need to issue the following command:
If you've made a change to one of the crawler component and wish to use the updated version when running start.sh you
just need to issue the following command:

```sh
$ ./script/build.sh
```

this will rebuild all crawler images using local changes.
After that just run start.sh again to have the updated version running.
this will rebuild all crawler images using local changes. After that just run start.sh again to have the updated version
running.

# Architecture

Expand Down
52 changes: 32 additions & 20 deletions api/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ import (
"time"
)

//go:generate mockgen -destination=../api_mock/api_mock.go -package=api_mock . Client
//go:generate mockgen -destination=../api_mock/api_mock.go -package=api_mock . API

const (
// PaginationPageHeader is the header to determinate current page in paginated endpoint
Expand All @@ -31,6 +31,7 @@ type ResourceDto struct {
Title string `json:"title"`
Meta map[string]string `json:"meta"`
Description string `json:"description"`
Headers map[string]string `json:"headers"`
}

// CredentialsDto represent the credential when logging in the API
Expand All @@ -39,10 +40,22 @@ type CredentialsDto struct {
Password string `json:"password"`
}

// Client is the interface to interact with the API component
type Client interface {
SearchResources(url, keyword string, startDate, endDate time.Time,
paginationPage, paginationSize int) ([]ResourceDto, int64, error)
// ResSearchParams is the search params used
type ResSearchParams struct {
URL string
Keyword string
StartDate time.Time
EndDate time.Time
WithBody bool
PageSize int
PageNumber int
// TODO allow searching by meta
// TODO allow searching by headers
}

// API is the interface to interact with the API component
type API interface {
SearchResources(params *ResSearchParams) ([]ResourceDto, int64, error)
AddResource(res ResourceDto) (ResourceDto, error)
ScheduleURL(url string) error
}
Expand All @@ -52,34 +65,33 @@ type client struct {
baseURL string
}

func (c *client) SearchResources(url, keyword string,
startDate, endDate time.Time, paginationPage, paginationSize int) ([]ResourceDto, int64, error) {
func (c *client) SearchResources(params *ResSearchParams) ([]ResourceDto, int64, error) {
targetEndpoint := fmt.Sprintf("%s/v1/resources?", c.baseURL)

req := c.httpClient.R()

if url != "" {
b64URL := base64.URLEncoding.EncodeToString([]byte(url))
if params.URL != "" {
b64URL := base64.URLEncoding.EncodeToString([]byte(params.URL))
req.SetQueryParam("url", b64URL)
}

if keyword != "" {
req.SetQueryParam("keyword", keyword)
if params.Keyword != "" {
req.SetQueryParam("keyword", params.Keyword)
}

if !startDate.IsZero() {
req.SetQueryParam("start-date", startDate.Format(time.RFC3339))
if !params.StartDate.IsZero() {
req.SetQueryParam("start-date", params.StartDate.Format(time.RFC3339))
}

if !endDate.IsZero() {
req.SetQueryParam("end-date", endDate.Format(time.RFC3339))
if !params.EndDate.IsZero() {
req.SetQueryParam("end-date", params.EndDate.Format(time.RFC3339))
}

if paginationPage != 0 {
req.Header.Set(PaginationPageHeader, strconv.Itoa(paginationPage))
if params.PageNumber != 0 {
req.Header.Set(PaginationPageHeader, strconv.Itoa(params.PageNumber))
}
if paginationSize != 0 {
req.Header.Set(PaginationSizeHeader, strconv.Itoa(paginationSize))
if params.PageSize != 0 {
req.Header.Set(PaginationSizeHeader, strconv.Itoa(params.PageSize))
}

var resources []ResourceDto
Expand Down Expand Up @@ -123,7 +135,7 @@ func (c *client) ScheduleURL(url string) error {
}

// NewClient create a new API client using given details
func NewClient(baseURL, token string) Client {
func NewClient(baseURL, token string) API {
httpClient := resty.New()
httpClient.SetAuthScheme("Bearer")
httpClient.SetAuthToken(token)
Expand Down
24 changes: 24 additions & 0 deletions build/docker/Dockerfile.tdsh-archiver
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# build image
FROM golang:1.15.0-alpine as builder

RUN apk update && apk upgrade && \
apk add --no-cache bash git openssh

WORKDIR /app

# Copy and download dependencies to cache them and faster build time
COPY go.mod go.sum ./
RUN go mod download

COPY . .

# Test then build app
RUN go build -v github.com/creekorful/trandoshan/cmd/tdsh-archiver

# runtime image
FROM alpine:latest
COPY --from=builder /app/tdsh-archiver /app/

WORKDIR /app/

ENTRYPOINT ["./tdsh-archiver"]
13 changes: 13 additions & 0 deletions cmd/tdsh-archiver/tdsh-archiver.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
package main

import (
"github.com/creekorful/trandoshan/internal/archiver"
"os"
)

func main() {
app := archiver.GetApp()
if err := app.Run(os.Args); err != nil {
os.Exit(1)
}
}
42 changes: 32 additions & 10 deletions deployments/docker/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,15 +1,19 @@
version: '3'

services:
nats:
image: nats:2.1.9-alpine3.12
rabbitmq:
image: rabbitmq:3.8.9-management-alpine
ports:
- 15003:15672
volumes:
- rabbitdata:/var/lib/rabbitmq
torproxy:
image: dperson/torproxy:latest
elasticsearch:
image: elasticsearch:7.10.1
environment:
- discovery.type=single-node
- ES_JAVA_OPTS=-Xms2g -Xmx2g
- ES_JAVA_OPTS=-Xms2g -Xmx4g
volumes:
- esdata:/usr/share/elasticsearch/data
kibana:
Expand All @@ -22,44 +26,58 @@ services:
image: creekorful/tdsh-crawler:latest
command: >
--log-level debug
--nats-uri nats
--hub-uri amqp://guest:guest@rabbitmq:5672
--tor-uri torproxy:9050
restart: always
depends_on:
- nats
- rabbitmq
- torproxy
scheduler:
image: creekorful/tdsh-scheduler:latest
command: >
--log-level debug
--nats-uri nats
--hub-uri amqp://guest:guest@rabbitmq:5672
--api-uri http://api:8080
--api-token eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VybmFtZSI6InNjaGVkdWxlciIsInJpZ2h0cyI6eyJHRVQiOlsiL3YxL3Jlc291cmNlcyJdfX0.dBR6KLQp2h2srY-By3zikEznhQplLCtDrvOkcXP6USY
--forbidden-extensions png
--forbidden-extensions gif
--forbidden-extensions jpg
--forbidden-extensions jpeg
--forbidden-extensions bmp
--forbidden-extensions css
--forbidden-extensions js
--forbidden-hostnames facebookcorewwwi.onion
restart: always
depends_on:
- nats
- rabbitmq
- api
extractor:
image: creekorful/tdsh-extractor:latest
command: >
--log-level debug
--nats-uri nats
--hub-uri amqp://guest:guest@rabbitmq:5672
--api-uri http://api:8080
--api-token eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VybmFtZSI6ImV4dHJhY3RvciIsInJpZ2h0cyI6eyJQT1NUIjpbIi92MS9yZXNvdXJjZXMiXX19.mytGd_9zyK8y_T3fsWAmH8FnaBNr6qWefwCPDOx4in0
restart: always
depends_on:
- nats
- rabbitmq
- api
archiver:
image: creekorful/tdsh-archiver:latest
command: >
--log-level debug
--hub-uri amqp://guest:guest@rabbitmq:5672
--storage-dir /archive
restart: always
volumes:
- archiverdata:/archive
depends_on:
- rabbitmq
api:
image: creekorful/tdsh-api:latest
command: >
--log-level debug
--nats-uri nats
--hub-uri amqp://guest:guest@rabbitmq:5672
--elasticsearch-uri http://elasticsearch:9200
--signing-key K==M5RsU_DQa4_XSbkX?L27s^xWmde25
restart: always
Expand All @@ -71,3 +89,7 @@ services:
volumes:
esdata:
driver: local
rabbitdata:
driver: local
archiverdata:
driver: local
6 changes: 2 additions & 4 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,14 @@ require (
github.com/dgrijalva/jwt-go v3.2.0+incompatible
github.com/go-resty/resty/v2 v2.3.0
github.com/golang/mock v1.4.4
github.com/golang/protobuf v1.4.2 // indirect
github.com/labstack/echo/v4 v4.1.16
github.com/nats-io/nats-server/v2 v2.1.8 // indirect
github.com/nats-io/nats.go v1.10.0
github.com/olekukonko/tablewriter v0.0.4
github.com/olivere/elastic/v7 v7.0.20
github.com/rs/zerolog v1.20.0
github.com/streadway/amqp v1.0.0
github.com/urfave/cli/v2 v2.2.0
github.com/valyala/fasthttp v1.9.0
github.com/xhit/go-str2duration/v2 v2.0.0
golang.org/x/crypto v0.0.0-20200323165209-0ec3e9974c59
golang.org/x/crypto v0.0.0-20200323165209-0ec3e9974c59 // indirect
mvdan.cc/xurls/v2 v2.1.0
)
Loading

0 comments on commit 7cbfbb7

Please sign in to comment.