-
Notifications
You must be signed in to change notification settings - Fork 11
Example: Setting up on AWS
This is an example of setting up the platform on AWS, using AWS services in place of Kubernetes services.
- Open IAM
- Create user
- Create programmatic access keys
- Store them under default profile in
~/.aws/credentials
.aws configure
-
Install eksctl.
-
Create cluster (This step could take a while. ~10+minutes)
eksctl create cluster --name <INSERT-NAME-HERE> \ --version 1.13 \ --nodegroup-name standard-workers \ --node-type t2.large \ --nodes 5 \ --nodes-min 1 \ --nodes-max 8 \ --node-ami auto
If you get a
SignatureDoesNotMatch
error you need to verify that your aws credentials are correct. You can delete and regenerate them and runaws configure
again. -
Setup for helm usage
#tiller.yaml --- apiVersion: v1 kind: ServiceAccount metadata: name: tiller namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: tiller roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: cluster-admin subjects: - kind: ServiceAccount name: tiller namespace: kube-system
Run the commands below
kubectl apply -f tiller.yaml helm init --service-account tiller helm repo add scdp https://smartcitiesdata.github.io/charts
The setup above will allow you to access the EKS cluster, but some steps need to be taken to enable others to access it too.
-
Create role (IAM > Roles > Create role)
- Trusted entity = Another AWS account
- Account ID: $YOUR_ACCOUNT_ID
- Name and description
- Capture this created Role ARN for later (
$NEW_ROLE_ARN
)
- Trusted entity = Another AWS account
-
Create an
iamidentitymapping
in your EKS cluster.eksctl create iamidentitymapping \ --name $CLUSER_NAME \ --role $NEW_ROLE_ARN \ --group system:masters \ --username admin
-
Anyone you wants to use the role to access your cluster must edit their
~/.aws/credentials
.[aws-user-profile] aws_access_key_id = $USER_KEY aws_secret_access_key = $USER_SECRET [eks-admin] role_arn = $NEW_ROLE_ARN source_profile = aws-user-profile
-
Users will need to have the correct
KUBECONFIG
setup.eksctl utils write-kubeconfig --name $CLUSTER_NAME
-
Users must use the new AWS profile to access the system.
export AWS_PROFILE=eks-admin kubectl get pod # works
Redis is used to store dataset metadata and microservice view states.
- Go to ElastiCache
- Select Redis, then Create
- version 5.0.4
- default parameter group
- port 6379
- 0 replicas
- Create new subnet group
- Select EKS VPC
- Select EKS private subnets
- Edit security group and
- Select EKS standardworker and sharednetwork security groups
- Unselect the default one
- Set backup policy per your wishes
- Wait until the cluster is available. (This could take ~10 min) Take note of the redis primary endpoint and add it to the redis.yaml in the next step. Note: Do not add the port
- Create =redis.yaml=
# redis.yaml kind: "Service" apiVersion: "v1" metadata: name: "redis" spec: type: ExternalName externalName: $REDIS_URI_NO_PORT
- Run command to apply to your EKS
kubectl apply -f redis.yaml
- Test
kubectl run --image=redis:5.0.4 redis-client pod_name=$(kubectl get pod -l run=redis-client -o name | cut -d'/' -f2) kubectl exec $pod_name -- redis-cli -h redis set foo bar # OK kubectl exec $pod_name -- redis-cli -h redis get foo # bar kubectl exec $pod_name -- redis-cli -h redis del foo # 1 kubectl delete deployment redis-client
LDAP is currently used to authenticate/authorize users and orchestrate organizations. It will be replaced with OAuth in the near future. There are also plans to support SAML.
-
Install LDAP chart
helm install --name ldap stable/openldap \ --set adminPassword=ThisIs4n4dminP4ssword \ --set env.LDAP_BASE_DN="dc=example,dc=org"
-
Port forward to LDAP (Continuous running process)
kubectl port-forward svc/ldap-openldap 38900:389
-
Using ApacheDirectoryStudio
- create new connection
- localhost:38900
- Use the following dn: cn=admin,dc=example,dc=org/
- with password: ThisIs4n4dminP4ssword
- Add new entry under dc=example,dc=org
- RDP:ou Value: orgs
- cn=admin,dc=example,dc=org/ThisIs4n4dminP4ssword
- create new connection
-
Add LDAP secrets
# ldap.yaml apiVersion: v1 kind: Secret metadata: name: ldap type: Opaque data: host: bGRhcC1vcGVubGRhcA== # ldap-openldap base_dn: ZGM9ZXhhbXBsZSxkYz1vcmc= # dc=example,dc=org user: Y249YWRtaW4= # cn=admin password: VGhpc0lzNG40ZG1pblA0c3N3b3Jk # ThisIs4n4dminP4ssword environment_ou: b3Jncw== # orgs
-
Run
kubectl apply -f ldap.yaml
Kafka is a distributed log used to pass messages and events between decoupled microservices. Strimzi is an operator that makes deploying/running Kafka in Kubernetes easier.
- Deploy strimzi
helm repo add strimzi https://strimzi.io/charts helm upgrade --install strimzi-kafka-operator strimzi/strimzi-kafka-operator --version 0.08.0
- Create Kafka cluster
# kafka.yaml apiVersion: kafka.strimzi.io/v1alpha1 kind: Kafka metadata: name: streaming-service spec: kafka: replicas: 1 listeners: plain: {} tls: {} config: offsets.topic.replication.factor: 1 transaction.state.log.replication.factor: 3 transaction.state.log.min.isr: 2 storage: type: ephemeral zookeeper: replicas: 3 storage: type: ephemeral entityOperator: topicOperator: {} userOperator: {}
- Apply
kubectl apply -f kafka.yaml
- Create topics
- Example 1
# topics.yaml --- apiVersion: kafka.strimzi.io/v1alpha1 kind: KafkaTopic metadata: name: streaming-dead-letters labels: strimzi.io/cluster: streaming-service spec: partitions: 1 replicas: 1 config: {}
- Example 2
--- apiVersion: kafka.strimzi.io/v1alpha1 kind: KafkaTopic metadata: name: event-stream labels: strimzi.io/cluster: streaming-service spec: partitions: 1 replicas: 1 config: {}
- Apply
kubectl apply -f topics.yaml
Andi is the data curation/definition API. Organizations and datasets are created through it.
- Deploy andi
helm upgrade --install andi scdp/andi \ --set image.tag=$ANDI_VERSION \ --set redis.host=redis \ --set strimzi.kafka.brokers=streaming-service-kafka-bootstrap:9092 \ --set service.type=ClusterIP
- Test
Verify the terminal which is running the port forwarding logs a handled connection
kubectl port-forward svc/andi 8080:80 curl -sfL http://localhost:8080/api/v1/datasets # []
The discovery-api is the programmatic entry point to data access. It is a REST API for searching datasets, viewing their details, and querying their historical (batch) data.
- Deploy guardian key
# guardian.yaml apiVersion: v1 kind: Secret metadata: name: guardian type: Opaque data: secret_key: ce1RV/hBmrFVh11geYV4wPWvZ9vOF1en/tZTD/Qwlr27UmJka+Qi+jbFAZLu7FAu
kubectl apply -f guardian.yaml
- Deploy discovery-api
helm upgrade --install discovery-api scdp/discovery-api \ --set image.tag=$DISCOVERY_API_VERSION \ --set redis.host=redis \ --set presto.url=http://kdp-kubernetes-data-platform-presto:8080 \ --set ldap.host=ldap-openldap,ldap.base="dc=example,dc=org" \ --set s3.hostedFileBucket=hosted-bucket,s3.hostedFileRegion=us-east-2 \ --set vault.endpoint="" \ --set environment=demo \ --set service.type=LoadBalancer
- Test
Verify the terminal which is running the chained port forwarding logs a handled connection
kubectl port-forward svc/discovery-api 8080:80 curl -sfL localhost:8080/api/v1/dataset/search?limit=10 # Returns JSON with 200
The discovery-ui is the frontend for dataset searches and details.
- Deploy
helm upgrade --install discovery-ui scdp/discovery-ui \ --set image.environment=demo,image.tag=$DISCOVERY_UI_VERSION \ --set env.api_host="$DISCOVERY_API_URL" \ --set env.base_url="$DISCOVERY_UI_URL" \ --set service.type=LoadBalancer
KDP is the platform's PrestoDB setup.
-
Create S3
$EKS_WORKER_ROLE_ARN -> https://console.aws.amazon.com/iam/home?#/roles -> nodegroup-standar-NodeInstanceRole
$BUCKET_ARN -> https://s3.console.aws.amazon.com/s3/home?region=us-east-2# -> Check box of bucket -> copy bucket arn
- Select s3 Bucket Policy
- Encrypt with AES-256
- Make private
- Access policy statement 1
- Principal: $EKS_WORKER_ROLE_ARN
- Actions: DeleteObject, DeleteObjectVersion, GetObject, PutObject
- Resource: $BUCKET_ARN/#
- Access policy statement 2
- Principal: $EKS_WORKER_ROLE_ARN
- Actions: ListBucket
- Resource: $BUCKET_ARN
- Generate policies and add them to the s3 buckets permissions
- Postgres 10.6-R1 (dev/test template)
- database id (devs choice)
- username:postgres
- password:
- Burstable db.t3.medium (100GB General Purpose SSD)
- No autoscaling
- EKS VPC w/new private subnet group
- EKS shared node security group
- Initial database name "metastore"
- Backups/encryption/maintenance as preferred
- Default for everything else
-
Configure
# kdp.yaml global: environment: demo objectStore: bucketName: $CREATED_BUCKET_NAME << Update this accessKey: null accessSecret: null presto: workers: 2 ingress: enabled: false jvm: maxHeapSize: 1536M deploy: container: resources: limits: memory: 2Gi cpu: 2 requests: memory: 2Gi cpu: 1 hive: enabled: false minio: enabled: false postgres: enabled: false service: externalAddress: $METASTORE_ADDRESS db: name: metastore user: postgres password: "" metastore: allowDropTable: true timeout: 360m
-
Deploy
helm upgrade --install kdp scdp/kubernetes-data-platform \ --values kdp.yaml \ --set postgres.db.password=$DB_PASSWORD
-
Test
pod_name=$(kubectl get po -l role=coordinator -o name | cut -d/ -f2) kubectl exec -it $pod_name -- presto --catalog hive --schema default
Forklift is a service for loading data via PrestoDB into object storage. It also runs table compaction and will replace carpenter
as the table creation/alteration/deletion service.
- Deploy
helm upgrade --install forklift scdp/forklift \ --set image.tag=$FORKLIFT_VERSION \ --set kafka.brokers=streaming-service-kafka-bootstrap:9092 \ --set kdp.url="http://kdp-kubernetes-data-platform-presto:8080" \ --set redis.host=redis \ --set prometheus.scrape=false
- Test
pod_name=$(kubectl get po -l app.kubernetes.io/name=forklift -o name | cut -d/ -f2) kubectl exec -it $pod_name -- bin/forklift rpc 'Prestige.execute("show tables") |> Prestige.prefetch' # []
Reaper gathers (extracts) data to be stored in the platform.
-
Deploy pods
helm upgrade --install reaper scdp/reaper \ --set vault.endpoint="" \ --set strimzi.kafka.brokers=streaming-service-kafka-bootstrap:9092 \ --set redis.host=redis \ --set image.tag=$REAPER_VERSION
-
Deploy service
# reaper.yaml apiVersion: v1 kind: Service metadata: name: reaper labels: app: reaper spec: selector: app.kubernetes.io/name: reaper ports: - protocol: TCP port: 80 targetPort: 4001 name: tcp-80 type: LoadBalancer
kubectl apply -f reaper.yaml
Valkyrie is a microservice to standardize/normalize data within a dataset.
- Deploy
helm upgrade --install valkyrie scdp/valkyrie \ --set kafka.brokers=streaming-service-kafka-bootstrap:9092 \ --set redis.host=redis \ --set image.tag=$VALKYRIE_VERSION
The discovery-streams
API is a websocket API for streaming public data to end user consumers.
- Deploy
helm upgrade --install discovery-streams scdp/discovery-streams \ --set image.tag=$DISCOVERY_STREAMS_VERSION \ --set kafka.brokers=streaming-service-kafka-bootstrap:9092
Every dataset has an owner -- or organization. The organization must exist before a dataset
can be ingested. POST
organizations to andi
/api/v1/organization
endpoint.
You can see an example (test) organization here. Remember to change all the values.
# org.json
{
"description": "This is a test organization that should not be seen",
"homepage": "https://github.com/smartcitiesdata",
"id": "1b1cdc66-ad5e-45f4-baed-b874227838a6",
"logoUrl": "https://placekitten.com/97/97",
"orgName": "test_org",
"orgTitle": "Test Organization"
}
kubectl port-forward svc/andi 8080:8080
curl -X POST \
-H "Content-Type: application/json" \
-d @org.json \
http://localhost:8080/api/v1/organization
Datasets are created with a PUT
to andi
/api/v1/dataset
.
You can see an example (test) dataset here. Remember to change values to be relevant to your dataset. Specifically, make sure the technical.orgName
and technical.orgId
match your organization definition. andi
does not make this check yet.
This is a remote dataset, which means it will not ingest any data into the system. But it will be discoverable via the API/UI.
# remote.json
{
"business": {
"categories": null,
"conformsToUri": null,
"contactEmail": "[email protected]",
"contactName": "me",
"dataTitle": "Some test data",
"describedByMimeType": null,
"describedByUrl": null,
"description": "A test dataset",
"homepage": "https://github.com/smartcitiesdata",
"issuedDate": "2016-08-10T20:20:58.000Z",
"keywords": ["foo", "bar"],
"language": null,
"license": null,
"modifiedDate": "2017-11-28T17:40:24.000Z",
"orgTitle": "Test Organization",
"parentDataset": null,
"publishFrequency": null,
"referenceUrls": null,
"rights": null,
"spatial": null,
"temporal": null
},
"id": "c2ea7056-198e-4f78-b953-8176747f8463",
"technical": {
"cadence": "never",
"dataName": "test_remote",
"headers": {},
"orgId": "1b1cdc66-ad5e-45f4-baed-b874227838a6",
"orgName": "test_org",
"partitioner": {
"query": null,
"type": null
},
"private": false,
"sourceQueryParams": {},
"schema": [],
"sourceFormat": "csv",
"sourceType": "remote",
"sourceUrl": "https://smartcitiesdata.github.io/charts/index.yaml",
"systemName": "test_org__test_remote",
"transformations": [],
"validations": []
}
}
kubectl port-forward svc/andi 8080:8080
curl -X PUT \
-H "Content-Type: application/json" \
-d @remote.json \
http://localhost:8888/api/v1/dataset