add hive merge pipeline #33

dionsongus · 2018-04-11T03:49:36Z

I'm still working on some improvements but want to get some initial feedback.
I tested the pipeline by running run-tests.sh. It took about 8 min in my laptop. It works as is.
The test involves:

clean,
a first run that performs the initial sqoop as the base table,
a update that conducts the incremental sqoop, merge the table, and produce a new table called report.

For now, the incremental sqoop contains zero rows.

afoerster

This looks good, the biggest things are

adding back in commented out tests
removing use of sudo, it shouldn't be necessary
using a primary key field in the merge script instead of

I think this will be an extremely useful set of templates!

afoerster · 2018-04-15T22:57:33Z

integration-tests/run-tests.sh

-$SCRIPT_DIR/sqoop-parquet-hdfs-kudu-impala/run.sh
+#$SCRIPT_DIR/sqoop-parquet-hdfs-impala/run.sh
+#$SCRIPT_DIR/sqoop-parquet-full-load/run.sh
+#$SCRIPT_DIR/sqoop-parquet-hdfs-kudu-impala/run.sh


Make sure to uncomment these, if they are failing it could be because of a previous issue of the container running as the root user. None of the docker container should have the flag --user hdfs anymore.

afoerster · 2018-04-15T22:58:51Z

integration-tests/sqoop-parquet-hdfs-hive-merge/run-in-container.sh

+SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
+cd $SCRIPT_DIR
+# verify we can generate scripts without error
+sudo -u hdfs hdfs dfs -rm -r /user/hive/warehouse/* || true


Is this necessary? What does this command have to do with the comment above it?

I took that line from the existing example, such as sqoop-parquet-hdfs-impala. I take it to mean cleaning the target directory before ingestion, which sounds right to me. If you are questioning the need for sudo. I'll tested it and it doesn't work w/o sudo it doesn't work w/ sudo root. It only works with sudo hdfs.

Ok this one is fine

afoerster · 2018-04-15T22:59:59Z

templates/sqoop-parquet-hdfs-hive-merge/Makefile.meta

+
+clean-all:
+{%- for table in tables %}
+	$(MAKE) clean -C {{ table.id }}


This is red here usually because it has spaces, should be changed to a tab (can only use tabes here in a Makefile)

Seems to be tab already:
clean-all:^M$
{%- for table in tables %}^M$
^I$(MAKE) clean -C {{ table.id }}^M$
{%- endfor %}^M$

afoerster · 2018-04-15T23:00:07Z

templates/sqoop-parquet-hdfs-hive-merge/Makefile.meta

+
+integration-test-all:
+{%- for table in tables %}
+	$(MAKE) integration-test -C {{ table.id }}


switch to TAB

seems to be tab already:
integration-test-all:^M$
{%- for table in tables %}^M$
^I$(MAKE) integration-test -C {{ table.id }}^M$
{%- endfor %}^M$

afoerster · 2018-04-15T23:01:15Z

templates/sqoop-parquet-hdfs-hive-merge/create-merge-view.sql

+USE {{ conf.staging_database.name }};
+
+--Create merged view --
+DROP VIEW IF EXISTS {{ table.destination.name }}_view;


Instead of calling this just view, maybe call it reconciliation_view like the blog post? Would help readers associate it with the logic in the post

Agree. Changed to _merge_view.

afoerster · 2018-04-15T23:06:10Z

templates/sqoop-parquet-hdfs-hive-merge/create-report-table.sql

+
+insert overwrite {{ table.destination.name }}_report select * from {{ table.destination.name }}_view;
+
+COMPUTE STATS {{ table.destination.name }}_report


Not sure the significance of the name _report. It's not bad, but since you are using so many views maybe document what they are in the README for the pipeline?

The report name is taken from the Hive Merge article. They called it reporting to be exact. Note this is a table, not a view. There's only one view in this Hive Merge method.

afoerster · 2018-04-15T23:08:23Z

templates/sqoop-parquet-hdfs-hive-merge/hdfs-incr-clear.sh

+
+# Remove parquet data from incr
+set -eu
+sudo -u hdfs hdfs dfs -rm -r -f {{ conf.staging_database.path }}/{{ table.destination.name }}/incr/*


Do you need to run this command as the hdfs user? Most people won't have the ability to run sudo commands in their pipelines. It's dangerous if you aren't in a docker container.

I agree but I had to. Without sudo, it can get into a situation where one cannot write due to:
WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://0.0.0.0:8020/user/hive/warehouse/titanic'
Even after adding the write via chmod in the virtual prompt:
merge/titanic# hadoop fs -ls /user/hive/warehouse/titanic
Found 2 items
drwxrwxrwx - root supergroup 0 2018-04-10 19:20 /user/hive/warehouse/titanic/base
drwxrwxrwx - root supergroup 0 2018-04-10 19:20 /user/hive/warehouse/titanic/incr

I feel I'm missing something basic here wrt access control in docker.

We need to get to the root of this problem. The other calls to sudo are part of the integration test, and the user has sudo during the integration tests in Docker.

This command though will be run by engineers that don't have sudo.

I think this might be related to #36

Try running your test before $SCRIPT_DIR/sqoop-parquet-hdfs-kudu-impala/run.sh

Can you try again without this sudo?

afoerster · 2018-04-15T23:08:35Z

templates/sqoop-parquet-hdfs-hive-merge/hdfs-report-delete.sh

+
+# Remove parquet data from hdfs
+set -eu
+sudo -u hdfs hdfs dfs -rm -r -f {{ conf.staging_database.path }}/{{ table.destination.name }}_report/


Remove sudo

cannot. see above

Can you try removing this again? It shouldn't be needed, and like I said above, users who run this pipeline won't have sudo access

afoerster · 2018-04-15T23:08:48Z

templates/sqoop-parquet-hdfs-hive-merge/imports

@@ -0,0 +1,5 @@
+../shared/.gitignore


good reuse!

afoerster · 2018-04-15T23:10:20Z

templates/sqoop-parquet-hdfs-hive-merge/move-1st-sqoop.sh

+
+# Move parquet data from /incr to /base
+set -eu
+#hdfs dfs -mkdir {{ conf.staging_database.path }}/{{ table.destination.name }}/base || true 


Remove comment

…rge_view. Uncommented other examples in run_tests.sh

afoerster

I think you need to change the order of the tests to get rid of the sudo when clearing the incr dir

afoerster · 2018-05-24T01:47:21Z

@dionsongus there are a couple of different errors. The error in python 2.7 is caused by a dependency and is happening elsewhere, not sure what's going on there yet.
The error in python 3.6 is a sql syntax error, see here: https://travis-ci.org/Cargill/pipewrench/jobs/382345665#L3188
Good news is looks like you are past the permission issue.

afoerster · 2018-05-25T02:14:12Z

@dionsongus rebase on master, the python 2.7 error should be fixed there

…on of [Id]

afoerster · 2018-05-29T19:10:26Z

integration-tests/sqoop-parquet-hdfs-hive-merge/tables.yml

+        - Id
+      num_partitions: 2 # Number of Kudu partitions to create
+    check_column: Id # Incrementing timestamp of numeric column used when incrementally pulling data (sqoop --check-column)
+    primary_keys: Id # List of primary keys


I don't think this is the right way to fix this error. The problem was that you weren't handling the list. The field is called 'primary_keys', and all other tables.yml files have a list here, so I don't think we should break that assumption.
Instead, you can use the ninja join function, like this {{ primary_keys|join(', ') }}

afoerster · 2018-05-29T19:10:48Z

templates/sqoop-parquet-hdfs-hive-merge/hdfs-delete.sh

+
+# Remove parquet data from hdfs
+set -eu
+sudo -u hdfs hdfs dfs -rm -r -f {{ conf.staging_database.path }}/{{ table.destination.name }}/


Can you try again without this sudo?

afoerster · 2018-05-29T19:11:02Z

templates/sqoop-parquet-hdfs-hive-merge/hdfs-incr-clear.sh

+
+# Remove parquet data from incr
+set -eu
+sudo -u hdfs hdfs dfs -rm -r -f {{ conf.staging_database.path }}/{{ table.destination.name }}/incr/*


Can you try again without this sudo?

afoerster · 2018-05-29T19:12:16Z

templates/sqoop-parquet-hdfs-hive-merge/create-merge-view.sql

+  FROM (SELECT * FROM {{ table.destination.name }}_base
+	UNION ALL 
+	SELECT * FROM {{ table.destination.name }}_incr) t1
+  JOIN (SELECT {{ table.primary_keys }}, max({{ table.check_column }}) max_modified 


let me know if you need help modifying this script to work with the multiple primary keys.

All your comments make sense. I'm having some hardware issue these days. Let me see if I can work on this in a few days.

afoerster

Great contribution, thanks @dionsongus !!

afoerster · 2018-06-06T22:16:40Z

templates/sqoop-parquet-hdfs-hive-merge/create-merge-view.sql

+	SELECT * FROM {{ table.destination.name }}_incr) t2
+	GROUP BY {{ table.primary_keys|join(', ') }}) s
+  ON 
+	{% for pk in table.primary_keys %}


Nice, this looks good.

afoerster · 2018-06-06T22:19:41Z

templates/sqoop-parquet-hdfs-hive-merge/create-report-table.sql

+set sync_ddl=1;
+USE {{ conf.staging_database.name }};
+
+-- cannot get create table as select to work. --


Since you have a good workaround maybe no need to keep the comment, but let's leave it for historical purposes for now.

dionsongus · 2018-06-08T20:30:22Z

Thank you for the guidance along the way. Next I plan to make the integrated test more meaningful by adding increments to the source table. Right now it's a zero-increment merge.

afoerster · 2018-06-11T21:06:10Z

Yes it would be great to have better tests around this

…

On Fri, Jun 8, 2018 at 3:30 PM dionsongus ***@***.***> wrote: Thank you for the guidance along the way. Next I plan to make the integrated test more meaningful by adding increments to the source table. Right now it's a zero-increment merge. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#33 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAftwFExtUHeddUkCpWF2J9grqSdwPqmks5t6t7ggaJpZM4TPVNP> .

add hive merge pipeline

544b9e9

afoerster suggested changes Apr 15, 2018

View reviewed changes

dionsongus added 2 commits April 23, 2018 15:33

Used PK instead of split-by in the merge script. Renamed _view to _me…

c1b02b4

…rge_view. Uncommented other examples in run_tests.sh

Rebase to Cargill/pipewrench

0283d69

afoerster suggested changes May 10, 2018

View reviewed changes

dionsongus added 2 commits May 16, 2018 14:12

Merge remote-tracking branch 'upstream/master'

3e3e343

move kudu pipeline testing to last relating to issue #36

dfdbe1c

dionsongus added 2 commits May 25, 2018 14:58

Merge remote-tracking branch 'upstream/master'

2440ccd

changed pk format in tables.yml to avoid SQL syntax error, the creati…

0a1fb46

…on of [Id]

afoerster suggested changes May 29, 2018

View reviewed changes

dionsongus added 3 commits June 4, 2018 14:32

Merge remote-tracking branch 'upstream/master'

e281b53

remove sudo, restore pk as a list

24c40ef

added brackets to variable names in Jinja loop

44ec934

afoerster approved these changes Jun 6, 2018

View reviewed changes

afoerster merged commit 7c704f0 into Cargill:master Jun 6, 2018


		insert overwrite {{ table.destination.name }}_report select * from {{ table.destination.name }}_view;

		COMPUTE STATS {{ table.destination.name }}_report

add hive merge pipeline #33

add hive merge pipeline #33

Conversation

dionsongus commented Apr 11, 2018

afoerster left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nicholaev Apr 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nicholaev Apr 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afoerster left a comment

Choose a reason for hiding this comment

afoerster commented May 24, 2018

afoerster commented May 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afoerster left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dionsongus commented Jun 8, 2018

afoerster commented Jun 11, 2018 via email

Nicholaev Apr 23, 2018 •

edited

Loading

Nicholaev Apr 23, 2018 •

edited

Loading