You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I really like your library as it makes possible to integrate SQL syntax directly into cells, that's a nice piece of work!
However I would like to hear from you what's the best way to read .csv local files into Spark Dataframes by means of the SparkSQL syntax without the creation of a local Hive database.
I've noticed that the two folders metastore_db and spark_warehouse are always produced in the same folder of the notebook when I create a table, do they act as a local database?
When I run this cell:
%%sparksql
CREATE TABLE IF NOT EXISTS account
USING csv
OPTIONS (
path 'data.csv',
header true,
inferSchema true,
sep ','
);
I get this warning in the Jupyter notebook:
21/12/20 09:03:12 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
21/12/20 09:03:12 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
21/12/20 09:03:16 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
21/12/20 09:03:16 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore myuser@localhost
21/12/20 09:03:16 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
21/12/20 09:03:25 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider csv. Persisting data source table `default`.`account` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
21/12/20 09:03:26 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
21/12/20 09:03:27 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
21/12/20 09:03:27 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
21/12/20 09:03:27 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
-chgrp: 'Users\Domain' does not match expected pattern for group
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] [-v] [-x] <path> ...]
[-expunge]
[-find <path> ... <expression> ...]
[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
[-head <file>]
[-help [cmd ...]]
[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ...]
[-touchz <path> ...]
[-truncate [-w] <length> <path> ...]
[-usage [cmd ...]]
Generic options supported are:
-conf <configuration file> specify an application configuration file
-D <property=value> define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port> specify a ResourceManager
-files <file1,...> specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...> specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...> specify a comma-separated list of archives to be unarchived on the compute machines
The general command line syntax is:
command [genericOptions] [commandOptions]
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
I do understand I don't have Apache Hive installed, but isn't possible to simply read the CSV file as SparkDataFrame without all these warnings?
Doing with the PySpark API is much easier, as a spark.read.csv('myfile.csv') suffices and no local databases are created.
The text was updated successfully, but these errors were encountered:
Dear @cryeo,
I really like your library as it makes possible to integrate SQL syntax directly into cells, that's a nice piece of work!
However I would like to hear from you what's the best way to read
.csv
local files into Spark Dataframes by means of the SparkSQL syntax without the creation of a local Hive database.I've noticed that the two folders
metastore_db
andspark_warehouse
are always produced in the same folder of the notebook when I create a table, do they act as a local database?When I run this cell:
I get this warning in the Jupyter notebook:
I do understand I don't have Apache Hive installed, but isn't possible to simply read the CSV file as SparkDataFrame without all these warnings?
Doing with the PySpark API is much easier, as a
spark.read.csv('myfile.csv')
suffices and no local databases are created.The text was updated successfully, but these errors were encountered: