Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hdfs fail to launch after mantl-api DELETE & POST #44

Closed
iharush opened this issue Jun 2, 2016 · 5 comments
Closed

hdfs fail to launch after mantl-api DELETE & POST #44

iharush opened this issue Jun 2, 2016 · 5 comments

Comments

@iharush
Copy link

iharush commented Jun 2, 2016

hi,
I am running a mantl cluster with 5 workers on AWS.
I managed to launch HDFS using mantl-api with the below mesas-site.xml and I can see that the marathon health checks pass.

when I call mantl-api DELETE to remove the HDFS cluster, it seams that every thing is removed and deleted.

the issue is when I call mantl-api POST again to relaunch the HDFS, the marathon task stuck in "deploying" state. and the HDFS is not working properly.

Thanks
Itay

here is my mesas-site.xml:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>mesos.hdfs.data.dir</name>
    <description>The primary data directory in HDFS</description>
    <value>/var/lib/hdfs/data</value>
  </property>

  <!-- Uncomment this to enable secondary data dir
  <property>
    <name>mesos.hdfs.secondary.data.dir</name>
    <description>The secondary data directory in HDFS</description>
    <value>/var/lib/hdfs/data2</value>
  </property>
  -->

  <property>
    <name>mesos.hdfs.domain.socket.dir</name>
    <description>The location used for a local socket used by the data nodes</description>
    <value>/var/run/hadoop-hdfs</value>
  </property>

  <!-- Uncomment this to enable backup
  <property>
    <name>mesos.hdfs.backup.dir</name>
    <description>Backup dir for HDFS</description>
    <value>/tmp/nfs</value>
  </property>
  -->

  <property>
    <name>mesos.hdfs.native-hadoop-binaries</name>
    <description>Mark true if you have hadoop pre-installed on your host machines (otherwise it will be distributed by the scheduler)</description>
    <value>false</value>
  </property>

  <property>
    <name>mesos.hdfs.framework.mnt.path</name>
    <description>Mount location (if mesos.hdfs.native-hadoop-binaries is marked false)</description>
    <value>/opt/mesosphere</value>
  </property>

  <property>
    <name>mesos.hdfs.state.zk</name>
    <description>Comma-separated hostname-port pairs of zookeeper node locations for HDFS framework state information</description>
    <value>zookeeper.service.consul:2181</value>
  </property>

  <property>
    <name>mesos.master.uri</name>
    <description>Zookeeper entry for mesos master location</description>
    <value>zk://zookeeper.service.consul:2181/mesos</value>
  </property>

  <property>
    <name>mesos.hdfs.zkfc.ha.zookeeper.quorum</name>
    <description>Comma-separated list of zookeeper hostname-port pairs for HDFS HA features</description>
    <value>zookeeper.service.consul:2181</value>
  </property>

  <property>
    <name>mesos.hdfs.framework.name</name>
    <description>Your Mesos framework name and cluster name when accessing files (hdfs://YOUR_NAME)</description>
    <value>hdfs</value>
  </property>

  <property>
    <name>mesos.hdfs.mesosdns</name>
    <description>Whether to use Mesos DNS for service discovery within HDFS</description>
    <value>false</value>
  </property>

  <property>
    <name>mesos.hdfs.mesosdns.domain</name>
    <description>Root domain name of Mesos DNS (usually 'mesos')</description>
    <value>mesos</value>
  </property>

  <property>
    <name>mesos.native.library</name>
    <description>Location of libmesos.so</description>
    <value>/usr/local/lib/libmesos.so</value>
  </property>

  <property>
    <name>mesos.hdfs.journalnode.count</name>
    <description>Number of journal nodes (must be odd)</description>
    <value>3</value>
  </property>

  <!-- Additional settings for fine-tuning -->

  <property>
    <name>mesos.hdfs.jvm.overhead</name>
    <description>Multiplier on resources reserved in order to account for JVM allocation</description>
    <value>1.35</value>
  </property>
  <property>
    <name>mesos.hdfs.hadoop.heap.size</name>
    <value>512</value>
  </property>
  <property>
    <name>mesos.hdfs.namenode.heap.size</name>
    <value>512</value>
  </property>
  <property>
    <name>mesos.hdfs.datanode.heap.size</name>
    <value>512</value>
  </property>
  <property>
    <name>mesos.hdfs.executor.heap.size</name>
    <value>256</value>
  </property>
  <property>
    <name>mesos.hdfs.executor.cpus</name>
    <value>0.1</value>
  </property>
  <property>
    <name>mesos.hdfs.namenode.cpus</name>
    <value>0.25</value>
  </property>
  <property>
    <name>mesos.hdfs.journalnode.cpus</name>
    <value>0.25</value>
  </property>
  <property>
    <name>mesos.hdfs.datanode.cpus</name>
    <value>0.25</value>
  </property>
  <property>
    <name>mesos.hdfs.user</name>
    <value>root</value>
  </property>
  <property>
    <name>mesos.hdfs.role</name>
    <value>*</value>
  </property>
   <property>
    <name>mesos.hdfs.ld-library-path</name>
    <value>/usr/local/lib</value>
  </property>
  <property>    
    <name>mesos.hdfs.datanode.exclusive</name>
    WARNING-It is not advisable to run the datanode on same slave because of performance issues
    <description>Whether to run the datanode on slave different from namenode and journal nodes</description>
    <value>true</value>
  </property>
</configuration>
@langston-barrett
Copy link
Contributor

@iharush Are you seeing resource offers in Mesos? Are you able to launch other tasks via Marathon?

@iharush
Copy link
Author

iharush commented Jun 2, 2016

marathon is still functional.

the issue is that not all the hdfs-mesos task launched, the namenode & datanodes did not come up.
For some reason i got many ZKFC (and not 2). PSA.

mesos-hdfs.pdf

yes, the mesos offers keeps coming but none of them is accepted.
in the logs I can see that the namenode2 task declines many offers , for example:

08:42:08.514 [Thread-219] INFO  o.a.mesos.hdfs.scheduler.HdfsNode - Node: namenode, evaluating offer: value: "d62f3dd5-128c-4483-91dd-ea7f2108b89b-O4467"

08:42:08.514 [Thread-219] INFO  o.a.mesos.hdfs.scheduler.NameNode - Offer does not have enough resources
08:42:08.514 [Thread-219] INFO  o.a.mesos.hdfs.scheduler.HdfsNode - Node: namenode, declining offer: value: "d62f3dd5-128c-4483-91dd-ea7f2108b89b-O4467"

@ryane
Copy link
Contributor

ryane commented Jun 2, 2016

@iharush can you try removing the hdfs data directories from each worker after you uninstall (back them up if you need to)? something like:

ansible all -m shell -a 'rm -rf /var/lib/hdfs/data/*' -s -l 'role=worker'

@iharush
Copy link
Author

iharush commented Jun 2, 2016

It works! .Thanks.
removing the hdfs data directories solved the issue.

I think you should add it to the mantl-api uninstall flow.

@ryane
Copy link
Contributor

ryane commented Jun 2, 2016

Yeah, it probably should be optional, though. We don't want to delete data without warning. I created an issue for it: #45

@ryane ryane closed this as completed Jun 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants