Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add hive merge pipeline #33

Merged
merged 10 commits into from
Jun 6, 2018
Merged

add hive merge pipeline #33

merged 10 commits into from
Jun 6, 2018

Conversation

dionsongus
Copy link
Contributor

I'm still working on some improvements but want to get some initial feedback.
I tested the pipeline by running run-tests.sh. It took about 8 min in my laptop. It works as is.
The test involves:

  • clean,
  • a first run that performs the initial sqoop as the base table,
  • a update that conducts the incremental sqoop, merge the table, and produce a new table called report.

For now, the incremental sqoop contains zero rows.

Copy link
Contributor

@afoerster afoerster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, the biggest things are

  • adding back in commented out tests
  • removing use of sudo, it shouldn't be necessary
  • using a primary key field in the merge script instead of

I think this will be an extremely useful set of templates!

$SCRIPT_DIR/sqoop-parquet-hdfs-kudu-impala/run.sh
#$SCRIPT_DIR/sqoop-parquet-hdfs-impala/run.sh
#$SCRIPT_DIR/sqoop-parquet-full-load/run.sh
#$SCRIPT_DIR/sqoop-parquet-hdfs-kudu-impala/run.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to uncomment these, if they are failing it could be because of a previous issue of the container running as the root user. None of the docker container should have the flag --user hdfs anymore.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
cd $SCRIPT_DIR
# verify we can generate scripts without error
sudo -u hdfs hdfs dfs -rm -r /user/hive/warehouse/* || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? What does this command have to do with the comment above it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took that line from the existing example, such as sqoop-parquet-hdfs-impala. I take it to mean cleaning the target directory before ingestion, which sounds right to me. If you are questioning the need for sudo. I'll tested it and it doesn't work w/o sudo it doesn't work w/ sudo root. It only works with sudo hdfs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this one is fine


clean-all:
{%- for table in tables %}
$(MAKE) clean -C {{ table.id }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is red here usually because it has spaces, should be changed to a tab (can only use tabes here in a Makefile)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be tab already:
clean-all:^M$
{%- for table in tables %}^M$
^I$(MAKE) clean -C {{ table.id }}^M$
{%- endfor %}^M$


integration-test-all:
{%- for table in tables %}
$(MAKE) integration-test -C {{ table.id }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch to TAB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to be tab already:
integration-test-all:^M$
{%- for table in tables %}^M$
^I$(MAKE) integration-test -C {{ table.id }}^M$
{%- endfor %}^M$

USE {{ conf.staging_database.name }};

--Create merged view --
DROP VIEW IF EXISTS {{ table.destination.name }}_view;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of calling this just view, maybe call it reconciliation_view like the blog post? Would help readers associate it with the logic in the post

Copy link

@Nicholaev Nicholaev Apr 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Changed to _merge_view.


insert overwrite {{ table.destination.name }}_report select * from {{ table.destination.name }}_view;

COMPUTE STATS {{ table.destination.name }}_report
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the significance of the name _report. It's not bad, but since you are using so many views maybe document what they are in the README for the pipeline?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The report name is taken from the Hive Merge article. They called it reporting to be exact. Note this is a table, not a view. There's only one view in this Hive Merge method.


# Remove parquet data from incr
set -eu
sudo -u hdfs hdfs dfs -rm -r -f {{ conf.staging_database.path }}/{{ table.destination.name }}/incr/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to run this command as the hdfs user? Most people won't have the ability to run sudo commands in their pipelines. It's dangerous if you aren't in a docker container.

Copy link

@Nicholaev Nicholaev Apr 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but I had to. Without sudo, it can get into a situation where one cannot write due to:
WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://0.0.0.0:8020/user/hive/warehouse/titanic'
Even after adding the write via chmod in the virtual prompt:
merge/titanic# hadoop fs -ls /user/hive/warehouse/titanic
Found 2 items
drwxrwxrwx - root supergroup 0 2018-04-10 19:20 /user/hive/warehouse/titanic/base
drwxrwxrwx - root supergroup 0 2018-04-10 19:20 /user/hive/warehouse/titanic/incr

I feel I'm missing something basic here wrt access control in docker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to get to the root of this problem. The other calls to sudo are part of the integration test, and the user has sudo during the integration tests in Docker.

This command though will be run by engineers that don't have sudo.

I think this might be related to #36

Try running your test before $SCRIPT_DIR/sqoop-parquet-hdfs-kudu-impala/run.sh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try again without this sudo?


# Remove parquet data from hdfs
set -eu
sudo -u hdfs hdfs dfs -rm -r -f {{ conf.staging_database.path }}/{{ table.destination.name }}_report/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove sudo

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannot. see above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try removing this again? It shouldn't be needed, and like I said above, users who run this pipeline won't have sudo access

@@ -0,0 +1,5 @@
../shared/.gitignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good reuse!


# Move parquet data from /incr to /base
set -eu
#hdfs dfs -mkdir {{ conf.staging_database.path }}/{{ table.destination.name }}/base || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@afoerster afoerster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to change the order of the tests to get rid of the sudo when clearing the incr dir

@afoerster
Copy link
Contributor

@dionsongus there are a couple of different errors. The error in python 2.7 is caused by a dependency and is happening elsewhere, not sure what's going on there yet.
The error in python 3.6 is a sql syntax error, see here: https://travis-ci.org/Cargill/pipewrench/jobs/382345665#L3188
Good news is looks like you are past the permission issue.

@afoerster
Copy link
Contributor

@dionsongus rebase on master, the python 2.7 error should be fixed there

- Id
num_partitions: 2 # Number of Kudu partitions to create
check_column: Id # Incrementing timestamp of numeric column used when incrementally pulling data (sqoop --check-column)
primary_keys: Id # List of primary keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right way to fix this error. The problem was that you weren't handling the list. The field is called 'primary_keys', and all other tables.yml files have a list here, so I don't think we should break that assumption.
Instead, you can use the ninja join function, like this {{ primary_keys|join(', ') }}


# Remove parquet data from hdfs
set -eu
sudo -u hdfs hdfs dfs -rm -r -f {{ conf.staging_database.path }}/{{ table.destination.name }}/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try again without this sudo?


# Remove parquet data from incr
set -eu
sudo -u hdfs hdfs dfs -rm -r -f {{ conf.staging_database.path }}/{{ table.destination.name }}/incr/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try again without this sudo?

FROM (SELECT * FROM {{ table.destination.name }}_base
UNION ALL
SELECT * FROM {{ table.destination.name }}_incr) t1
JOIN (SELECT {{ table.primary_keys }}, max({{ table.check_column }}) max_modified
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know if you need help modifying this script to work with the multiple primary keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All your comments make sense. I'm having some hardware issue these days. Let me see if I can work on this in a few days.

Copy link
Contributor

@afoerster afoerster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great contribution, thanks @dionsongus !!

SELECT * FROM {{ table.destination.name }}_incr) t2
GROUP BY {{ table.primary_keys|join(', ') }}) s
ON
{% for pk in table.primary_keys %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this looks good.

set sync_ddl=1;
USE {{ conf.staging_database.name }};

-- cannot get create table as select to work. --
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have a good workaround maybe no need to keep the comment, but let's leave it for historical purposes for now.

@afoerster afoerster merged commit 7c704f0 into Cargill:master Jun 6, 2018
@dionsongus
Copy link
Contributor Author

Thank you for the guidance along the way. Next I plan to make the integrated test more meaningful by adding increments to the source table. Right now it's a zero-increment merge.

@afoerster
Copy link
Contributor

afoerster commented Jun 11, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants