-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add hive merge pipeline #33
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, the biggest things are
- adding back in commented out tests
- removing use of
sudo
, it shouldn't be necessary - using a primary key field in the merge script instead of
I think this will be an extremely useful set of templates!
integration-tests/run-tests.sh
Outdated
$SCRIPT_DIR/sqoop-parquet-hdfs-kudu-impala/run.sh | ||
#$SCRIPT_DIR/sqoop-parquet-hdfs-impala/run.sh | ||
#$SCRIPT_DIR/sqoop-parquet-full-load/run.sh | ||
#$SCRIPT_DIR/sqoop-parquet-hdfs-kudu-impala/run.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to uncomment these, if they are failing it could be because of a previous issue of the container running as the root user. None of the docker container should have the flag --user hdfs
anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" | ||
cd $SCRIPT_DIR | ||
# verify we can generate scripts without error | ||
sudo -u hdfs hdfs dfs -rm -r /user/hive/warehouse/* || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? What does this command have to do with the comment above it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took that line from the existing example, such as sqoop-parquet-hdfs-impala. I take it to mean cleaning the target directory before ingestion, which sounds right to me. If you are questioning the need for sudo. I'll tested it and it doesn't work w/o sudo it doesn't work w/ sudo root. It only works with sudo hdfs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this one is fine
|
||
clean-all: | ||
{%- for table in tables %} | ||
$(MAKE) clean -C {{ table.id }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is red here usually because it has spaces, should be changed to a tab (can only use tabes here in a Makefile)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be tab already:
clean-all:^M$
{%- for table in tables %}^M$
^I$(MAKE) clean -C {{ table.id }}^M$
{%- endfor %}^M$
|
||
integration-test-all: | ||
{%- for table in tables %} | ||
$(MAKE) integration-test -C {{ table.id }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch to TAB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems to be tab already:
integration-test-all:^M$
{%- for table in tables %}^M$
^I$(MAKE) integration-test -C {{ table.id }}^M$
{%- endfor %}^M$
USE {{ conf.staging_database.name }}; | ||
|
||
--Create merged view -- | ||
DROP VIEW IF EXISTS {{ table.destination.name }}_view; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of calling this just view
, maybe call it reconciliation_view
like the blog post? Would help readers associate it with the logic in the post
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Changed to _merge_view
.
|
||
insert overwrite {{ table.destination.name }}_report select * from {{ table.destination.name }}_view; | ||
|
||
COMPUTE STATS {{ table.destination.name }}_report |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure the significance of the name _report
. It's not bad, but since you are using so many views maybe document what they are in the README for the pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The report
name is taken from the Hive Merge article. They called it reporting
to be exact. Note this is a table, not a view. There's only one view in this Hive Merge method.
|
||
# Remove parquet data from incr | ||
set -eu | ||
sudo -u hdfs hdfs dfs -rm -r -f {{ conf.staging_database.path }}/{{ table.destination.name }}/incr/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to run this command as the hdfs user? Most people won't have the ability to run sudo commands in their pipelines. It's dangerous if you aren't in a docker container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree but I had to. Without sudo, it can get into a situation where one cannot write due to:
WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://0.0.0.0:8020/user/hive/warehouse/titanic'
Even after adding the write via chmod in the virtual prompt:
merge/titanic# hadoop fs -ls /user/hive/warehouse/titanic
Found 2 items
drwxrwxrwx - root supergroup 0 2018-04-10 19:20 /user/hive/warehouse/titanic/base
drwxrwxrwx - root supergroup 0 2018-04-10 19:20 /user/hive/warehouse/titanic/incr
I feel I'm missing something basic here wrt access control in docker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to get to the root of this problem. The other calls to sudo are part of the integration test, and the user has sudo during the integration tests in Docker.
This command though will be run by engineers that don't have sudo.
I think this might be related to #36
Try running your test before $SCRIPT_DIR/sqoop-parquet-hdfs-kudu-impala/run.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try again without this sudo?
|
||
# Remove parquet data from hdfs | ||
set -eu | ||
sudo -u hdfs hdfs dfs -rm -r -f {{ conf.staging_database.path }}/{{ table.destination.name }}_report/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove sudo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cannot. see above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try removing this again? It shouldn't be needed, and like I said above, users who run this pipeline won't have sudo access
@@ -0,0 +1,5 @@ | |||
../shared/.gitignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good reuse!
|
||
# Move parquet data from /incr to /base | ||
set -eu | ||
#hdfs dfs -mkdir {{ conf.staging_database.path }}/{{ table.destination.name }}/base || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
…rge_view. Uncommented other examples in run_tests.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to change the order of the tests to get rid of the sudo
when clearing the incr dir
@dionsongus there are a couple of different errors. The error in python 2.7 is caused by a dependency and is happening elsewhere, not sure what's going on there yet. |
@dionsongus rebase on master, the python 2.7 error should be fixed there |
- Id | ||
num_partitions: 2 # Number of Kudu partitions to create | ||
check_column: Id # Incrementing timestamp of numeric column used when incrementally pulling data (sqoop --check-column) | ||
primary_keys: Id # List of primary keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right way to fix this error. The problem was that you weren't handling the list. The field is called 'primary_keys', and all other tables.yml files have a list here, so I don't think we should break that assumption.
Instead, you can use the ninja join function, like this {{ primary_keys|join(', ') }}
|
||
# Remove parquet data from hdfs | ||
set -eu | ||
sudo -u hdfs hdfs dfs -rm -r -f {{ conf.staging_database.path }}/{{ table.destination.name }}/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try again without this sudo?
|
||
# Remove parquet data from incr | ||
set -eu | ||
sudo -u hdfs hdfs dfs -rm -r -f {{ conf.staging_database.path }}/{{ table.destination.name }}/incr/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try again without this sudo?
FROM (SELECT * FROM {{ table.destination.name }}_base | ||
UNION ALL | ||
SELECT * FROM {{ table.destination.name }}_incr) t1 | ||
JOIN (SELECT {{ table.primary_keys }}, max({{ table.check_column }}) max_modified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me know if you need help modifying this script to work with the multiple primary keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All your comments make sense. I'm having some hardware issue these days. Let me see if I can work on this in a few days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great contribution, thanks @dionsongus !!
SELECT * FROM {{ table.destination.name }}_incr) t2 | ||
GROUP BY {{ table.primary_keys|join(', ') }}) s | ||
ON | ||
{% for pk in table.primary_keys %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this looks good.
set sync_ddl=1; | ||
USE {{ conf.staging_database.name }}; | ||
|
||
-- cannot get create table as select to work. -- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you have a good workaround maybe no need to keep the comment, but let's leave it for historical purposes for now.
Thank you for the guidance along the way. Next I plan to make the integrated test more meaningful by adding increments to the source table. Right now it's a zero-increment merge. |
Yes it would be great to have better tests around this
…On Fri, Jun 8, 2018 at 3:30 PM dionsongus ***@***.***> wrote:
Thank you for the guidance along the way. Next I plan to make the
integrated test more meaningful by adding increments to the source table.
Right now it's a zero-increment merge.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#33 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAftwFExtUHeddUkCpWF2J9grqSdwPqmks5t6t7ggaJpZM4TPVNP>
.
|
I'm still working on some improvements but want to get some initial feedback.
I tested the pipeline by running run-tests.sh. It took about 8 min in my laptop. It works as is.
The test involves:
For now, the incremental sqoop contains zero rows.