Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparkSQL simple way of reading CSV as local Spark Dataframes #4

Open
CarloNicolini opened this issue Dec 20, 2021 · 1 comment
Open

Comments

@CarloNicolini
Copy link

Dear @cryeo,

I really like your library as it makes possible to integrate SQL syntax directly into cells, that's a nice piece of work!

However I would like to hear from you what's the best way to read .csv local files into Spark Dataframes by means of the SparkSQL syntax without the creation of a local Hive database.
I've noticed that the two folders metastore_db and spark_warehouse are always produced in the same folder of the notebook when I create a table, do they act as a local database?

When I run this cell:

%%sparksql
CREATE TABLE IF NOT EXISTS account
USING csv
OPTIONS (
    path 'data.csv',
    header true,
    inferSchema true,
    sep ','
);

I get this warning in the Jupyter notebook:

21/12/20 09:03:12 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
21/12/20 09:03:12 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
21/12/20 09:03:16 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
21/12/20 09:03:16 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore myuser@localhost
21/12/20 09:03:16 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
21/12/20 09:03:25 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider csv. Persisting data source table `default`.`account` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
21/12/20 09:03:26 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
21/12/20 09:03:27 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
21/12/20 09:03:27 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
21/12/20 09:03:27 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
-chgrp: 'Users\Domain' does not match expected pattern for group
Usage: hadoop fs [generic options]
    [-appendToFile <localsrc> ... <dst>]
    [-cat [-ignoreCrc] <src> ...]
    [-checksum <src> ...]
    [-chgrp [-R] GROUP PATH...]
    [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
    [-chown [-R] [OWNER][:[GROUP]] PATH...]
    [-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
    [-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
    [-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
    [-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
    [-createSnapshot <snapshotDir> [<snapshotName>]]
    [-deleteSnapshot <snapshotDir> <snapshotName>]
    [-df [-h] [<path> ...]]
    [-du [-s] [-h] [-v] [-x] <path> ...]
    [-expunge]
    [-find <path> ... <expression> ...]
    [-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
    [-getfacl [-R] <path>]
    [-getfattr [-R] {-n name | -d} [-e en] <path>]
    [-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
    [-head <file>]
    [-help [cmd ...]]
    [-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]]
    [-mkdir [-p] <path> ...]
    [-moveFromLocal <localsrc> ... <dst>]
    [-moveToLocal <src> <localdst>]
    [-mv <src> ... <dst>]
    [-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
    [-renameSnapshot <snapshotDir> <oldName> <newName>]
    [-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
    [-rmdir [--ignore-fail-on-non-empty] <dir> ...]
    [-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
    [-setfattr {-n name [-v value] | -x name} <path>]
    [-setrep [-R] [-w] <rep> <path> ...]
    [-stat [format] <path> ...]
    [-tail [-f] <file>]
    [-test -[defsz] <path>]
    [-text [-ignoreCrc] <src> ...]
    [-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ...]
    [-touchz <path> ...]
    [-truncate [-w] <length> <path> ...]
    [-usage [cmd ...]]

Generic options supported are:
-conf <configuration file>        specify an application configuration file
-D <property=value>               define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port>  specify a ResourceManager
-files <file1,...>                specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...>               specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...>          specify a comma-separated list of archives to be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]

Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...

I do understand I don't have Apache Hive installed, but isn't possible to simply read the CSV file as SparkDataFrame without all these warnings?
Doing with the PySpark API is much easier, as a spark.read.csv('myfile.csv') suffices and no local databases are created.

@Wh1isper
Copy link

@CarloNicolini
Hi, I have developed a similar module with support for the latest spark server-client mode, maybe you can try it out:

Wh1isper/sparglim#sql-magic

Install with:

pip install sparglim["magic"]

Support SQL statement u need 🎉

%%sql CREATE TABLE tb_people
USING json
OPTIONS (path "/path/to/file.json");
Show tables;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants