Title: | R Interface to Apache Spark |
---|---|
Description: | R interface to Apache Spark, a fast and general engine for big data processing, see <https://spark.apache.org/>. This package supports connecting to local and remote Apache Spark clusters, provides a 'dplyr' compatible back-end, and provides an interface to Spark's built-in machine learning algorithms. |
Authors: | Javier Luraschi [aut], Kevin Kuo [aut] , Kevin Ushey [aut], JJ Allaire [aut], Samuel Macedo [ctb], Hossein Falaki [aut], Lu Wang [aut], Andy Zhang [aut], Yitao Li [aut] , Jozef Hajnala [ctb], Maciej Szymkiewicz [ctb] , Wil Davis [ctb], Edgar Ruiz [aut, cre], RStudio [cph], The Apache Software Foundation [aut, cph] |
Maintainer: | Edgar Ruiz <[email protected]> |
License: | Apache License 2.0 | file LICENSE |
Version: | 1.8.6.9001 |
Built: | 2024-11-09 05:40:01 UTC |
Source: | https://github.com/sparklyr/sparklyr |
Susetting operator for Spark dataframe allowing a subset of column(s) to be selected using syntaxes similar to those supported by R dataframes
## S3 method for class 'tbl_spark' x[i]
## S3 method for class 'tbl_spark' x[i]
x |
The Spark dataframe |
i |
Expression specifying subset of column(s) to include or exclude from the result (e.g., '["col1"]', '[c("col1", "col2")]', '[1:10]', '[-1]', '[NULL]', or '[]') |
Infix operator that allows a lambda expression to be composed in R and be
translated to Spark SQL equivalent using ' dbplyr::translate_sql
functionalities
params %->% ...
params %->% ...
params |
Parameter(s) of the lambda expression, can be either a single
parameter or a comma separated listed of parameters in the form of
|
... |
Body of the lambda expression, *must be within parentheses* |
Notice when composing a lambda expression in R, the body of the lambda expression *must always be surrounded with parentheses*, otherwise a parsing error will occur.
## Not run: a %->% (mean(a) + 1) # translates to <SQL> `a` -> (AVG(`a`) OVER () + 1.0) .(a, b) %->% (a < 1 && b > 1) # translates to <SQL> `a`,`b` -> (`a` < 1.0 AND `b` > 1.0) ## End(Not run)
## Not run: a %->% (mean(a) + 1) # translates to <SQL> `a` -> (AVG(`a`) OVER () + 1.0) .(a, b) %->% (a < 1 && b > 1) # translates to <SQL> `a`,`b` -> (`a` < 1.0 AND `b` > 1.0) ## End(Not run)
Set/Get Spark checkpoint directory
spark_set_checkpoint_dir(sc, dir) spark_get_checkpoint_dir(sc)
spark_set_checkpoint_dir(sc, dir) spark_get_checkpoint_dir(sc)
sc |
A |
dir |
checkpoint directory, must be HDFS path of running on cluster |
Deserialize Spark data that is serialized using 'spark_write_rds()' into a R dataframe.
collect_from_rds(path)
collect_from_rds(path)
path |
Path to a local RDS file that is produced by 'spark_write_rds()' (RDS files stored in HDFS will need to be downloaded to local filesystem first (e.g., by running 'hadoop fs -copyToLocal ...' or similar) |
Other Spark serialization routines:
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Compile the scala
source files contained within an R package
into a Java Archive (jar
) file that can be loaded and used within
a Spark environment.
compile_package_jars(..., spec = NULL)
compile_package_jars(..., spec = NULL)
... |
Optional compilation specifications, as generated by
|
spec |
An optional list of compilation specifications. When
set, this option takes precedence over arguments passed to
|
Read configuration values for a connection
connection_config(sc, prefix, not_prefix = list())
connection_config(sc, prefix, not_prefix = list())
sc |
|
prefix |
Prefix to read parameters for
(e.g. |
not_prefix |
Prefix to not include. |
Named list of config parameters (note that if a prefix was specified then the names will not include the prefix)
Copy an R data.frame
to Spark, and return a reference to the
generated Spark DataFrame as a tbl_spark
. The returned object will
act as a dplyr
-compatible interface to the underlying Spark table.
## S3 method for class 'spark_connection' copy_to( dest, df, name = spark_table_name(substitute(df)), overwrite = FALSE, memory = TRUE, repartition = 0L, ... )
## S3 method for class 'spark_connection' copy_to( dest, df, name = spark_table_name(substitute(df)), overwrite = FALSE, memory = TRUE, repartition = 0L, ... )
dest |
A |
df |
An R |
name |
The name to assign to the copied table in Spark. |
overwrite |
Boolean; overwrite a pre-existing table with the name |
memory |
Boolean; should the table be cached into memory? |
repartition |
The number of partitions to use when distributing the table across the Spark cluster. The default (0) can be used to avoid partitioning. |
... |
Optional arguments; currently unused. |
A tbl_spark
, representing a dplyr
-compatible interface
to a Spark DataFrame.
compile_package_jars
requires several versions of the
scala compiler to work, this is to match Spark scala versions.
To help setup your environment, this function will download the
required compilers under the default search path.
download_scalac(dest_path = NULL)
download_scalac(dest_path = NULL)
dest_path |
The destination path where scalac will be downloaded to. |
See find_scalac
for a list of paths searched and used by
this function to install the required compilers.
These methods implement dplyr grammars for Apache Spark higher order functions
These routines are useful when preparing to pass objects to a Spark routine, as it is often necessary to ensure certain parameters are scalar integers, or scalar doubles, and so on.
object |
An R object. |
allow.na |
Are |
allow.null |
Are |
default |
If |
Find the scalac
compiler for a particular version of
scala
, by scanning some common directories containing
scala
installations.
find_scalac(version, locations = NULL)
find_scalac(version, locations = NULL)
version |
The |
locations |
Additional locations to scan. By default, the
directories |
Apply thresholding to a column, such that values less than or equal to the
threshold
are assigned the value 0.0, and values greater than the
threshold are assigned the value 1.0. Column output is numeric for
compatibility with other modeling functions.
ft_binarizer( x, input_col, output_col, threshold = 0, uid = random_string("binarizer_"), ... )
ft_binarizer( x, input_col, output_col, threshold = 0, uid = random_string("binarizer_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
threshold |
Threshold used to binarize continuous features. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
## Not run: library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) iris_tbl %>% ft_binarizer( input_col = "Sepal_Length", output_col = "Sepal_Length_bin", threshold = 5 ) %>% select(Sepal_Length, Sepal_Length_bin, Species) ## End(Not run)
## Not run: library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) iris_tbl %>% ft_binarizer( input_col = "Sepal_Length", output_col = "Sepal_Length_bin", threshold = 5 ) %>% select(Sepal_Length, Sepal_Length_bin, Species) ## End(Not run)
Similar to R's cut
function, this transforms a numeric column
into a discretized column, with breaks specified through the splits
parameter.
ft_bucketizer( x, input_col = NULL, output_col = NULL, splits = NULL, input_cols = NULL, output_cols = NULL, splits_array = NULL, handle_invalid = "error", uid = random_string("bucketizer_"), ... )
ft_bucketizer( x, input_col = NULL, output_col = NULL, splits = NULL, input_cols = NULL, output_cols = NULL, splits_array = NULL, handle_invalid = "error", uid = random_string("bucketizer_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
splits |
A numeric vector of cutpoints, indicating the bucket boundaries. |
input_cols |
Names of input columns. |
output_cols |
Names of output columns. |
splits_array |
Parameter for specifying multiple splits parameters. Each element in this array can be used to map continuous features into buckets. |
handle_invalid |
(Spark 2.1.0+) Param for how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket). Default: "error" |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
## Not run: library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) iris_tbl %>% ft_bucketizer( input_col = "Sepal_Length", output_col = "Sepal_Length_bucket", splits = c(0, 4.5, 5, 8) ) %>% select(Sepal_Length, Sepal_Length_bucket, Species) ## End(Not run)
## Not run: library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) iris_tbl %>% ft_bucketizer( input_col = "Sepal_Length", output_col = "Sepal_Length_bucket", splits = c(0, 4.5, 5, 8) ) %>% select(Sepal_Length, Sepal_Length_bucket, Species) ## End(Not run)
Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label
ft_chisq_selector( x, features_col = "features", output_col = NULL, label_col = "label", selector_type = "numTopFeatures", fdr = 0.05, fpr = 0.05, fwe = 0.05, num_top_features = 50, percentile = 0.1, uid = random_string("chisq_selector_"), ... )
ft_chisq_selector( x, features_col = "features", output_col = NULL, label_col = "label", selector_type = "numTopFeatures", fdr = 0.05, fpr = 0.05, fwe = 0.05, num_top_features = 50, percentile = 0.1, uid = random_string("chisq_selector_"), ... )
x |
A |
features_col |
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by |
output_col |
The name of the output column. |
label_col |
Label column name. The column should be a numeric column. Usually this column is output by |
selector_type |
(Spark 2.1.0+) The selector type of the ChisqSelector. Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe". |
fdr |
(Spark 2.2.0+) The upper bound of the expected false discovery rate. Only applicable when selector_type = "fdr". Default value is 0.05. |
fpr |
(Spark 2.1.0+) The highest p-value for features to be kept. Only applicable when selector_type= "fpr". Default value is 0.05. |
fwe |
(Spark 2.2.0+) The upper bound of the expected family-wise error rate. Only applicable when selector_type = "fwe". Default value is 0.05. |
num_top_features |
Number of features that selector will select, ordered by ascending p-value. If the number of features is less than |
percentile |
(Spark 2.1.0+) Percentile of features that selector will select, ordered by statistics value descending. Only applicable when selector_type = "percentile". Default value is 0.1. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
In the case where x
is a tbl_spark
, the estimator
fits against x
to obtain a transformer, returning a tbl_spark
.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
Extracts a vocabulary from document collections.
ft_count_vectorizer( x, input_col = NULL, output_col = NULL, binary = FALSE, min_df = 1, min_tf = 1, vocab_size = 2^18, uid = random_string("count_vectorizer_"), ... ) ml_vocabulary(model)
ft_count_vectorizer( x, input_col = NULL, output_col = NULL, binary = FALSE, min_df = 1, min_tf = 1, vocab_size = 2^18, uid = random_string("count_vectorizer_"), ... ) ml_vocabulary(model)
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
binary |
Binary toggle to control the output vector values.
If |
min_df |
Specifies the minimum number of different documents a term must appear in to be included in the vocabulary. If this is an integer greater than or equal to 1, this specifies the number of documents the term must appear in; if this is a double in [0,1), then this specifies the fraction of documents. Default: 1. |
min_tf |
Filter to ignore rare words in a document. For each document, terms with frequency/count less than the given threshold are ignored. If this is an integer greater than or equal to 1, then this specifies a count (of times the term must appear in the document); if this is a double in [0,1), then this specifies a fraction (out of the document's token count). Default: 1. |
vocab_size |
Build a vocabulary that only considers the top
|
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
model |
A |
In the case where x
is a tbl_spark
, the estimator
fits against x
to obtain a transformer, returning a tbl_spark
.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
ml_vocabulary()
returns a vector of vocabulary built.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II).
ft_dct( x, input_col = NULL, output_col = NULL, inverse = FALSE, uid = random_string("dct_"), ... ) ft_discrete_cosine_transform( x, input_col, output_col, inverse = FALSE, uid = random_string("dct_"), ... )
ft_dct( x, input_col = NULL, output_col = NULL, inverse = FALSE, uid = random_string("dct_"), ... ) ft_discrete_cosine_transform( x, input_col, output_col, inverse = FALSE, uid = random_string("dct_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
inverse |
Indicates whether to perform the inverse DCT (TRUE) or forward DCT (FALSE). |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
ft_discrete_cosine_transform()
is an alias for ft_dct
for backwards compatibility.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector. In other words, it scales each column of the dataset by a scalar multiplier.
ft_elementwise_product( x, input_col = NULL, output_col = NULL, scaling_vec = NULL, uid = random_string("elementwise_product_"), ... )
ft_elementwise_product( x, input_col = NULL, output_col = NULL, scaling_vec = NULL, uid = random_string("elementwise_product_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
scaling_vec |
the vector to multiply with input vectors |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
Feature Transformation – FeatureHasher (Transformer)
ft_feature_hasher( x, input_cols = NULL, output_col = NULL, num_features = 2^18, categorical_cols = NULL, uid = random_string("feature_hasher_"), ... )
ft_feature_hasher( x, input_cols = NULL, output_col = NULL, num_features = 2^18, categorical_cols = NULL, uid = random_string("feature_hasher_"), ... )
x |
A |
input_cols |
Names of input columns. |
output_col |
Name of output column. |
num_features |
Number of features. Defaults to |
categorical_cols |
Numeric columns to treat as categorical features. By default only string and boolean columns are treated as categorical, so this param can be used to explicitly specify the numerical columns to treat as categorical. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick https://en.wikipedia.org/wiki/Feature_hashing to map features to indices in the feature vector.
The FeatureHasher transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows: -Numeric columns: For numeric features, the hash value of the column name is used to map the feature value to its index in the feature vector. By default, numeric features are not treated as categorical (even when they are integers). To treat them as categorical, specify the relevant columns in categoricalCols. -String columns: For categorical features, the hash value of the string "column_name=value" is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features are "one-hot" encoded (similarly to using OneHotEncoder with drop_last=FALSE). -Boolean columns: Boolean values are treated in the same way as string columns. That is, boolean features are represented as "column_name=true" or "column_name=false", with an indicator value of 1.0.
Null (missing) values are ignored (implicitly zero in the resulting feature vector).
The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the num_features parameter; otherwise the features will not be mapped evenly to the vector indices.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
Maps a sequence of terms to their term frequencies using the hashing trick.
ft_hashing_tf( x, input_col = NULL, output_col = NULL, binary = FALSE, num_features = 2^18, uid = random_string("hashing_tf_"), ... )
ft_hashing_tf( x, input_col = NULL, output_col = NULL, binary = FALSE, num_features = 2^18, uid = random_string("hashing_tf_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
binary |
Binary toggle to control term frequency counts.
If true, all non-zero counts are set to 1. This is useful for discrete
probabilistic models that model binary events rather than integer
counts. (default = |
num_features |
Number of features. Should be greater than 0. (default = |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
Compute the Inverse Document Frequency (IDF) given a collection of documents.
ft_idf( x, input_col = NULL, output_col = NULL, min_doc_freq = 0, uid = random_string("idf_"), ... )
ft_idf( x, input_col = NULL, output_col = NULL, min_doc_freq = 0, uid = random_string("idf_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
min_doc_freq |
The minimum number of documents in which a term should appear. Default: 0 |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
In the case where x
is a tbl_spark
, the estimator
fits against x
to obtain a transformer, returning a tbl_spark
.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
Imputation estimator for completing missing values, either using the mean or the median of the columns in which the missing values are located. The input columns should be of numeric type. This function requires Spark 2.2.0+.
ft_imputer( x, input_cols = NULL, output_cols = NULL, missing_value = NULL, strategy = "mean", uid = random_string("imputer_"), ... )
ft_imputer( x, input_cols = NULL, output_cols = NULL, missing_value = NULL, strategy = "mean", uid = random_string("imputer_"), ... )
x |
A |
input_cols |
The names of the input columns |
output_cols |
The names of the output columns. |
missing_value |
The placeholder for the missing values. All occurrences of
|
strategy |
The imputation strategy. Currently only "mean" and "median" are supported. If "mean", then replace missing values using the mean value of the feature. If "median", then replace missing values using the approximate median value of the feature. Default: mean |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
In the case where x
is a tbl_spark
, the estimator
fits against x
to obtain a transformer, returning a tbl_spark
.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
A Transformer that maps a column of indices back to a new column of
corresponding string values. The index-string mapping is either from
the ML attributes of the input column, or from user-supplied labels
(which take precedence over ML attributes). This function is the inverse
of ft_string_indexer
.
ft_index_to_string( x, input_col = NULL, output_col = NULL, labels = NULL, uid = random_string("index_to_string_"), ... )
ft_index_to_string( x, input_col = NULL, output_col = NULL, labels = NULL, uid = random_string("index_to_string_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
labels |
Optional param for array of labels specifying index-string mapping. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
Implements the feature interaction transform. This transformer takes in Double and Vector type columns and outputs a flattened vector of their feature interactions. To handle interaction, we first one-hot encode any nominal features. Then, a vector of the feature cross-products is produced.
ft_interaction( x, input_cols = NULL, output_col = NULL, uid = random_string("interaction_"), ... )
ft_interaction( x, input_cols = NULL, output_col = NULL, uid = random_string("interaction_"), ... )
x |
A |
input_cols |
The names of the input columns |
output_col |
The name of the output column. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
Locality Sensitive Hashing functions for Euclidean distance (Bucketed Random Projection) and Jaccard distance (MinHash).
ft_bucketed_random_projection_lsh( x, input_col = NULL, output_col = NULL, bucket_length = NULL, num_hash_tables = 1, seed = NULL, uid = random_string("bucketed_random_projection_lsh_"), ... ) ft_minhash_lsh( x, input_col = NULL, output_col = NULL, num_hash_tables = 1L, seed = NULL, uid = random_string("minhash_lsh_"), ... )
ft_bucketed_random_projection_lsh( x, input_col = NULL, output_col = NULL, bucket_length = NULL, num_hash_tables = 1, seed = NULL, uid = random_string("bucketed_random_projection_lsh_"), ... ) ft_minhash_lsh( x, input_col = NULL, output_col = NULL, num_hash_tables = 1L, seed = NULL, uid = random_string("minhash_lsh_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
bucket_length |
The length of each hash bucket, a larger bucket lowers the false negative rate. The number of buckets will be (max L2 norm of input vectors) / bucketLength. |
num_hash_tables |
Number of hash tables used in LSH OR-amplification. LSH OR-amplification can be used to reduce the false negative rate. Higher values for this param lead to a reduced false negative rate, at the expense of added computational complexity. |
seed |
A random seed. Set this value if you need your results to be reproducible across repeated calls. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
In the case where x
is a tbl_spark
, the estimator
fits against x
to obtain a transformer, returning a tbl_spark
.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
ft_lsh_utils
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
Utility functions for LSH models
ml_approx_nearest_neighbors( model, dataset, key, num_nearest_neighbors, dist_col = "distCol" ) ml_approx_similarity_join( model, dataset_a, dataset_b, threshold, dist_col = "distCol" )
ml_approx_nearest_neighbors( model, dataset, key, num_nearest_neighbors, dist_col = "distCol" ) ml_approx_similarity_join( model, dataset_a, dataset_b, threshold, dist_col = "distCol" )
model |
A fitted LSH model, returned by either |
dataset |
The dataset to search for nearest neighbors of the key. |
key |
Feature vector representing the item to search for. |
num_nearest_neighbors |
The maximum number of nearest neighbors. |
dist_col |
Output column for storing the distance between each result row and the key. |
dataset_a |
One of the datasets to join. |
dataset_b |
Another dataset to join. |
threshold |
The threshold for the distance of row pairs. |
Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity.
ft_max_abs_scaler( x, input_col = NULL, output_col = NULL, uid = random_string("max_abs_scaler_"), ... )
ft_max_abs_scaler( x, input_col = NULL, output_col = NULL, uid = random_string("max_abs_scaler_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
In the case where x
is a tbl_spark
, the estimator
fits against x
to obtain a transformer, returning a tbl_spark
.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) features <- c("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width") iris_tbl %>% ft_vector_assembler( input_col = features, output_col = "features_temp" ) %>% ft_max_abs_scaler( input_col = "features_temp", output_col = "features" ) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) features <- c("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width") iris_tbl %>% ft_vector_assembler( input_col = features, output_col = "features_temp" ) %>% ft_max_abs_scaler( input_col = "features_temp", output_col = "features" ) ## End(Not run)
Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling
ft_min_max_scaler( x, input_col = NULL, output_col = NULL, min = 0, max = 1, uid = random_string("min_max_scaler_"), ... )
ft_min_max_scaler( x, input_col = NULL, output_col = NULL, min = 0, max = 1, uid = random_string("min_max_scaler_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
min |
Lower bound after transformation, shared by all features Default: 0.0 |
max |
Upper bound after transformation, shared by all features Default: 1.0 |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
In the case where x
is a tbl_spark
, the estimator
fits against x
to obtain a transformer, returning a tbl_spark
.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) features <- c("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width") iris_tbl %>% ft_vector_assembler( input_col = features, output_col = "features_temp" ) %>% ft_min_max_scaler( input_col = "features_temp", output_col = "features" ) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) features <- c("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width") iris_tbl %>% ft_vector_assembler( input_col = features, output_col = "features_temp" ) %>% ft_min_max_scaler( input_col = "features_temp", output_col = "features" ) ## End(Not run)
A feature transformer that converts the input array of strings into an array of n-grams. Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.
ft_ngram( x, input_col = NULL, output_col = NULL, n = 2, uid = random_string("ngram_"), ... )
ft_ngram( x, input_col = NULL, output_col = NULL, n = 2, uid = random_string("ngram_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
n |
Minimum n-gram length, greater than or equal to 1. Default: 2, bigram features |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
Normalize a vector to have unit norm using the given p-norm.
ft_normalizer( x, input_col = NULL, output_col = NULL, p = 2, uid = random_string("normalizer_"), ... )
ft_normalizer( x, input_col = NULL, output_col = NULL, p = 2, uid = random_string("normalizer_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
p |
Normalization in L^p space. Must be >= 1. Defaults to 2. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
One-hot encoding maps a column of label indices to a column of binary
vectors, with at most a single one-value. This encoding allows algorithms
which expect continuous features, such as Logistic Regression, to use
categorical features. Typically, used with ft_string_indexer()
to
index a column first.
ft_one_hot_encoder( x, input_cols = NULL, output_cols = NULL, handle_invalid = NULL, drop_last = TRUE, uid = random_string("one_hot_encoder_"), ... )
ft_one_hot_encoder( x, input_cols = NULL, output_cols = NULL, handle_invalid = NULL, drop_last = TRUE, uid = random_string("one_hot_encoder_"), ... )
x |
A |
input_cols |
The name of the input columns. |
output_cols |
The name of the output columns. |
handle_invalid |
(Spark 2.1.0+) Param for how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket). Default: "error" |
drop_last |
Whether to drop the last category. Defaults to |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].
ft_one_hot_encoder_estimator( x, input_cols = NULL, output_cols = NULL, handle_invalid = "error", drop_last = TRUE, uid = random_string("one_hot_encoder_estimator_"), ... )
ft_one_hot_encoder_estimator( x, input_cols = NULL, output_cols = NULL, handle_invalid = "error", drop_last = TRUE, uid = random_string("one_hot_encoder_estimator_"), ... )
x |
A |
input_cols |
Names of input columns. |
output_cols |
Names of output columns. |
handle_invalid |
(Spark 2.1.0+) Param for how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket). Default: "error" |
drop_last |
Whether to drop the last category. Defaults to |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
In the case where x
is a tbl_spark
, the estimator
fits against x
to obtain a transformer, returning a tbl_spark
.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
PCA trains a model to project vectors to a lower dimensional space of the top k principal components.
ft_pca( x, input_col = NULL, output_col = NULL, k = NULL, uid = random_string("pca_"), ... ) ml_pca(x, features = tbl_vars(x), k = length(features), pc_prefix = "PC", ...)
ft_pca( x, input_col = NULL, output_col = NULL, k = NULL, uid = random_string("pca_"), ... ) ml_pca(x, features = tbl_vars(x), k = length(features), pc_prefix = "PC", ...)
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
k |
The number of principal components |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
features |
The columns to use in the principal components
analysis. Defaults to all columns in |
pc_prefix |
Length-one character vector used to prepend names of components. |
In the case where x
is a tbl_spark
, the estimator
fits against x
to obtain a transformer, returning a tbl_spark
.
ml_pca()
is a wrapper around ft_pca()
that returns a
ml_model
.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
## Not run: library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) iris_tbl %>% select(-Species) %>% ml_pca(k = 2) ## End(Not run)
## Not run: library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) iris_tbl %>% select(-Species) %>% ml_pca(k = 2) ## End(Not run)
Perform feature expansion in a polynomial space. E.g. take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y).
ft_polynomial_expansion( x, input_col = NULL, output_col = NULL, degree = 2, uid = random_string("polynomial_expansion_"), ... )
ft_polynomial_expansion( x, input_col = NULL, output_col = NULL, degree = 2, uid = random_string("polynomial_expansion_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
degree |
The polynomial degree to expand, which should be greater than equal to 1. A value of 1 means no expansion. Default: 2 |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
ft_quantile_discretizer
takes a column with continuous features and outputs
a column with binned categorical features. The number of bins can be
set using the num_buckets
parameter. It is possible that the number
of buckets used will be smaller than this value, for example, if there
are too few distinct values of the input to create enough distinct
quantiles.
ft_quantile_discretizer( x, input_col = NULL, output_col = NULL, num_buckets = 2, input_cols = NULL, output_cols = NULL, num_buckets_array = NULL, handle_invalid = "error", relative_error = 0.001, uid = random_string("quantile_discretizer_"), weight_column = NULL, ... )
ft_quantile_discretizer( x, input_col = NULL, output_col = NULL, num_buckets = 2, input_cols = NULL, output_cols = NULL, num_buckets_array = NULL, handle_invalid = "error", relative_error = 0.001, uid = random_string("quantile_discretizer_"), weight_column = NULL, ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
num_buckets |
Number of buckets (quantiles, or categories) into which data points are grouped. Must be greater than or equal to 2. |
input_cols |
Names of input columns. |
output_cols |
Names of output columns. |
num_buckets_array |
Array of number of buckets (quantiles, or categories) into which data points are grouped. Each value must be greater than or equal to 2. |
handle_invalid |
(Spark 2.1.0+) Param for how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket). Default: "error" |
relative_error |
(Spark 2.0.0+) Relative error (see documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile here for description). Must be in the range [0, 1]. default: 0.001 |
uid |
A character string used to uniquely identify the feature transformer. |
weight_column |
If not NULL, then a generalized version of the Greenwald-Khanna algorithm will be run to compute weighted percentiles, with each input having a relative weight specified by the corresponding value in 'weight_column'. The weights can be considered as relative frequencies of sample inputs. |
... |
Optional arguments; currently unused. |
NaN handling: null and NaN values will be ignored from the column
during QuantileDiscretizer
fitting. This will produce a Bucketizer
model for making predictions. During the transformation, Bucketizer
will raise an error when it finds NaN values in the dataset, but the
user can also choose to either keep or remove NaN values within the
dataset by setting handle_invalid
If the user chooses to keep NaN values,
they will be handled specially and placed into their own bucket,
for example, if 4 buckets are used, then non-NaN data will be put
into buckets[0-3], but NaNs will be counted in a special bucket[4].
Algorithm: The bin ranges are chosen using an approximate algorithm (see
the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
here for a detailed description). The precision of the approximation can be
controlled with the relative_error
parameter. The lower and upper bin
bounds will be -Infinity and +Infinity, covering all real values.
Note that the result may be different every time you run it, since the sample strategy behind it is non-deterministic.
In the case where x
is a tbl_spark
, the estimator
fits against x
to obtain a transformer, returning a tbl_spark
.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
Implements the transforms required for fitting a dataset against an R model
formula. Currently we support a limited subset of the R operators,
including ~
, .
, :
, +
, and -
.
ft_r_formula( x, formula = NULL, features_col = "features", label_col = "label", force_index_label = FALSE, uid = random_string("r_formula_"), ... )
ft_r_formula( x, formula = NULL, features_col = "features", label_col = "label", force_index_label = FALSE, uid = random_string("r_formula_"), ... )
x |
A |
formula |
R formula as a character string or a formula. Formula objects are converted to character strings directly and the environment is not captured. |
features_col |
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by |
label_col |
Label column name. The column should be a numeric column. Usually this column is output by |
force_index_label |
(Spark 2.1.0+) Force to index label whether it is numeric or
string type. Usually we index label only when it is string type. If
the formula was used by classification algorithms, we can force to index
label even it is numeric type by setting this param with true.
Default: |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
The basic operators in the formula are:
~ separate target and terms
+ concat terms, "+ 0" means removing intercept
- remove a term, "- 1" means removing intercept
: interaction (multiplication for numeric values, or binarized categorical values)
. all columns except target
Suppose a and b are double columns, we use the following simple examples to illustrate the effect of RFormula:
y ~ a + b
means model y ~ w0 + w1 * a + w2 * b
where w0
is the intercept and w1, w2
are coefficients.
y ~ a + b + a:b - 1
means model y ~ w1 * a + w2 * b + w3 * a * b
where w1, w2, w3
are coefficients.
RFormula produces a vector column of features and a double or string column of label. Like when formulas are used in R for linear regression, string input columns will be one-hot encoded, and numeric columns will be cast to doubles. If the label column is of type string, it will be first transformed to double with StringIndexer. If the label column does not exist in the DataFrame, the output label column will be created from the specified response variable in the formula.
In the case where x
is a tbl_spark
, the estimator
fits against x
to obtain a transformer, returning a tbl_spark
.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
A regex based tokenizer that extracts tokens either by using the provided
regex pattern to split the text (default) or repeatedly matching the regex
(if gaps
is false). Optional parameters also allow filtering tokens using a
minimal length. It returns an array of strings that can be empty.
ft_regex_tokenizer( x, input_col = NULL, output_col = NULL, gaps = TRUE, min_token_length = 1, pattern = "\\s+", to_lower_case = TRUE, uid = random_string("regex_tokenizer_"), ... )
ft_regex_tokenizer( x, input_col = NULL, output_col = NULL, gaps = TRUE, min_token_length = 1, pattern = "\\s+", to_lower_case = TRUE, uid = random_string("regex_tokenizer_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
gaps |
Indicates whether regex splits on gaps (TRUE) or matches tokens (FALSE). |
min_token_length |
Minimum token length, greater than or equal to 0. |
pattern |
The regular expression pattern to be used. |
to_lower_case |
Indicates whether to convert all characters to lowercase before tokenizing. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
RobustScaler removes the median and scales the data according to the quantile range. The quantile range is by default IQR (Interquartile Range, quantile range between the 1st quartile = 25th quantile and the 3rd quartile = 75th quantile) but can be configured. Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and quantile range are then stored to be used on later data using the transform method. Note that missing values are ignored in the computation of medians and ranges.
ft_robust_scaler( x, input_col = NULL, output_col = NULL, lower = 0.25, upper = 0.75, with_centering = TRUE, with_scaling = TRUE, relative_error = 0.001, uid = random_string("ft_robust_scaler_"), ... )
ft_robust_scaler( x, input_col = NULL, output_col = NULL, lower = 0.25, upper = 0.75, with_centering = TRUE, with_scaling = TRUE, relative_error = 0.001, uid = random_string("ft_robust_scaler_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
lower |
Lower quantile to calculate quantile range. |
upper |
Upper quantile to calculate quantile range. |
with_centering |
Whether to center data with median. |
with_scaling |
Whether to scale the data to quantile range. |
relative_error |
The target relative error for quantile computation. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
In the case where x
is a tbl_spark
, the estimator
fits against x
to obtain a transformer, returning a tbl_spark
.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
Implements the transformations which are defined by SQL statement. Currently we only support SQL syntax like 'SELECT ... FROM __THIS__ ...' where '__THIS__' represents the underlying table of the input dataset. The select clause specifies the fields, constants, and expressions to display in the output, it can be any select clause that Spark SQL supports. Users can also use Spark SQL built-in function and UDFs to operate on these selected columns.
ft_sql_transformer( x, statement = NULL, uid = random_string("sql_transformer_"), ... ) ft_dplyr_transformer(x, tbl, uid = random_string("dplyr_transformer_"), ...)
ft_sql_transformer( x, statement = NULL, uid = random_string("sql_transformer_"), ... ) ft_dplyr_transformer(x, tbl, uid = random_string("dplyr_transformer_"), ...)
x |
A |
statement |
A SQL statement. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
tbl |
A |
ft_dplyr_transformer()
is mostly a wrapper around ft_sql_transformer()
that
takes a tbl_spark
instead of a SQL statement. Internally, the ft_dplyr_transformer()
extracts the dplyr
transformations used to generate tbl
as a SQL statement or a
sampling operation. Note that only single-table dplyr
verbs are supported and that the
sdf_
family of functions are not.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. The "unit std" is computed using the corrected sample standard deviation, which is computed as the square root of the unbiased sample variance.
ft_standard_scaler( x, input_col = NULL, output_col = NULL, with_mean = FALSE, with_std = TRUE, uid = random_string("standard_scaler_"), ... )
ft_standard_scaler( x, input_col = NULL, output_col = NULL, with_mean = FALSE, with_std = TRUE, uid = random_string("standard_scaler_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
with_mean |
Whether to center the data with mean before scaling. It will build a dense output, so take care when applying to sparse input. Default: FALSE |
with_std |
Whether to scale the data to unit standard deviation. Default: TRUE |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
In the case where x
is a tbl_spark
, the estimator
fits against x
to obtain a transformer, returning a tbl_spark
.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) features <- c("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width") iris_tbl %>% ft_vector_assembler( input_col = features, output_col = "features_temp" ) %>% ft_standard_scaler( input_col = "features_temp", output_col = "features", with_mean = TRUE ) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) features <- c("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width") iris_tbl %>% ft_vector_assembler( input_col = features, output_col = "features_temp" ) %>% ft_standard_scaler( input_col = "features_temp", output_col = "features", with_mean = TRUE ) ## End(Not run)
A feature transformer that filters out stop words from input.
ft_stop_words_remover( x, input_col = NULL, output_col = NULL, case_sensitive = FALSE, stop_words = ml_default_stop_words(spark_connection(x), "english"), uid = random_string("stop_words_remover_"), ... )
ft_stop_words_remover( x, input_col = NULL, output_col = NULL, case_sensitive = FALSE, stop_words = ml_default_stop_words(spark_connection(x), "english"), uid = random_string("stop_words_remover_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
case_sensitive |
Whether to do a case sensitive comparison over the stop words. |
stop_words |
The words to be filtered out. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
A label indexer that maps a string column of labels to an ML column of
label indices. If the input column is numeric, we cast it to string and
index the string values. The indices are in [0, numLabels)
, ordered by
label frequencies. So the most frequent label gets index 0. This function
is the inverse of ft_index_to_string
.
ft_string_indexer( x, input_col = NULL, output_col = NULL, handle_invalid = "error", string_order_type = "frequencyDesc", uid = random_string("string_indexer_"), ... ) ml_labels(model) ft_string_indexer_model( x, input_col = NULL, output_col = NULL, labels, handle_invalid = "error", uid = random_string("string_indexer_model_"), ... )
ft_string_indexer( x, input_col = NULL, output_col = NULL, handle_invalid = "error", string_order_type = "frequencyDesc", uid = random_string("string_indexer_"), ... ) ml_labels(model) ft_string_indexer_model( x, input_col = NULL, output_col = NULL, labels, handle_invalid = "error", uid = random_string("string_indexer_model_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
handle_invalid |
(Spark 2.1.0+) Param for how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket). Default: "error" |
string_order_type |
(Spark 2.3+)How to order labels of string column.
The first label after ordering is assigned an index of 0. Options are
|
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
model |
A fitted StringIndexer model returned by |
labels |
Vector of labels, corresponding to indices to be assigned. |
In the case where x
is a tbl_spark
, the estimator
fits against x
to obtain a transformer, returning a tbl_spark
.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
ml_labels()
returns a vector of labels, corresponding to indices to be assigned.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
A tokenizer that converts the input string to lowercase and then splits it by white spaces.
ft_tokenizer( x, input_col = NULL, output_col = NULL, uid = random_string("tokenizer_"), ... )
ft_tokenizer( x, input_col = NULL, output_col = NULL, uid = random_string("tokenizer_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
Combine multiple vectors into a single row-vector; that is, where each row element of the newly generated column is a vector formed by concatenating each row element from the specified input columns.
ft_vector_assembler( x, input_cols = NULL, output_col = NULL, uid = random_string("vector_assembler_"), ... )
ft_vector_assembler( x, input_cols = NULL, output_col = NULL, uid = random_string("vector_assembler_"), ... )
x |
A |
input_cols |
The names of the input columns |
output_col |
The name of the output column. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_indexer()
,
ft_vector_slicer()
,
ft_word2vec()
Indexing categorical feature columns in a dataset of Vector.
ft_vector_indexer( x, input_col = NULL, output_col = NULL, handle_invalid = "error", max_categories = 20, uid = random_string("vector_indexer_"), ... )
ft_vector_indexer( x, input_col = NULL, output_col = NULL, handle_invalid = "error", max_categories = 20, uid = random_string("vector_indexer_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
handle_invalid |
(Spark 2.1.0+) Param for how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket). Default: "error" |
max_categories |
Threshold for the number of values a categorical feature can take. If a feature is found to have > |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
In the case where x
is a tbl_spark
, the estimator
fits against x
to obtain a transformer, returning a tbl_spark
.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_slicer()
,
ft_word2vec()
Takes a feature vector and outputs a new feature vector with a subarray of the original features.
ft_vector_slicer( x, input_col = NULL, output_col = NULL, indices = NULL, uid = random_string("vector_slicer_"), ... )
ft_vector_slicer( x, input_col = NULL, output_col = NULL, indices = NULL, uid = random_string("vector_slicer_"), ... )
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
indices |
An vector of indices to select features from a vector column. Note that the indices are 0-based. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_word2vec()
Word2Vec transforms a word into a code for further natural language processing or machine learning process.
ft_word2vec( x, input_col = NULL, output_col = NULL, vector_size = 100, min_count = 5, max_sentence_length = 1000, num_partitions = 1, step_size = 0.025, max_iter = 1, seed = NULL, uid = random_string("word2vec_"), ... ) ml_find_synonyms(model, word, num)
ft_word2vec( x, input_col = NULL, output_col = NULL, vector_size = 100, min_count = 5, max_sentence_length = 1000, num_partitions = 1, step_size = 0.025, max_iter = 1, seed = NULL, uid = random_string("word2vec_"), ... ) ml_find_synonyms(model, word, num)
x |
A |
input_col |
The name of the input column. |
output_col |
The name of the output column. |
vector_size |
The dimension of the code that you want to transform from words. Default: 100 |
min_count |
The minimum number of times a token must appear to be included in the word2vec model's vocabulary. Default: 5 |
max_sentence_length |
(Spark 2.0.0+) Sets the maximum length (in words) of each sentence
in the input data. Any sentence longer than this threshold will be divided into
chunks of up to |
num_partitions |
Number of partitions for sentences of words. Default: 1 |
step_size |
Param for Step size to be used for each iteration of optimization (> 0). |
max_iter |
The maximum number of iterations to use. |
seed |
A random seed. Set this value if you need your results to be reproducible across repeated calls. |
uid |
A character string used to uniquely identify the feature transformer. |
... |
Optional arguments; currently unused. |
model |
A fitted |
word |
A word, as a length-one character vector. |
num |
Number of words closest in similarity to the given word to find. |
In the case where x
is a tbl_spark
, the estimator
fits against x
to obtain a transformer, returning a tbl_spark
.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
or a
ml_estimator
object. If it is a ml_pipeline
, it will return
a pipeline with the transformer or estimator appended to it. If a
tbl_spark
, it will return a tbl_spark
with the transformation
applied to it.
ml_find_synonyms()
returns a DataFrame of synonyms and cosine similarities
Other feature transformers:
ft_binarizer()
,
ft_bucketizer()
,
ft_chisq_selector()
,
ft_count_vectorizer()
,
ft_dct()
,
ft_elementwise_product()
,
ft_feature_hasher()
,
ft_hashing_tf()
,
ft_idf()
,
ft_imputer()
,
ft_index_to_string()
,
ft_interaction()
,
ft_lsh
,
ft_max_abs_scaler()
,
ft_min_max_scaler()
,
ft_ngram()
,
ft_normalizer()
,
ft_one_hot_encoder()
,
ft_one_hot_encoder_estimator()
,
ft_pca()
,
ft_polynomial_expansion()
,
ft_quantile_discretizer()
,
ft_r_formula()
,
ft_regex_tokenizer()
,
ft_robust_scaler()
,
ft_sql_transformer()
,
ft_standard_scaler()
,
ft_stop_words_remover()
,
ft_string_indexer()
,
ft_tokenizer()
,
ft_vector_assembler()
,
ft_vector_indexer()
,
ft_vector_slicer()
Generic Call Interface
sc |
|
static |
Is this a static method call (including a constructor). If so
then the |
object |
Object instance or name of class (for |
method |
Name of method |
... |
Call parameters |
Retrieve the Spark connection's SQL catalog implementation property
get_spark_sql_catalog_implementation(sc)
get_spark_sql_catalog_implementation(sc)
sc |
|
spark.sql.catalogImplementation property from the connection's runtime configuration
Retrieves the runtime configuration interface for Hive.
hive_context_config(sc)
hive_context_config(sc)
sc |
A |
Apply an element-wise aggregation function to an array column
(this is essentially a dplyr wrapper for the
aggregate(array<T>, A, function<A, T, A>[, function<A, R>]): R
built-in Spark SQL functions)
hof_aggregate( x, start, merge, finish = NULL, expr = NULL, dest_col = NULL, ... )
hof_aggregate( x, start, merge, finish = NULL, expr = NULL, dest_col = NULL, ... )
x |
The Spark data frame to run aggregation on |
start |
The starting value of the aggregation |
merge |
The aggregation function |
finish |
Optional param specifying a transformation to apply on the final value of the aggregation |
expr |
The array being aggregated, could be any SQL expression evaluating to an array (default: the last column of the Spark data frame) |
dest_col |
Column to store the aggregated result (default: expr) |
... |
Additional params to dplyr::mutate |
## Not run: library(sparklyr) sc <- spark_connect(master = "local") # concatenates all numbers of each array in `array_column` and add parentheses # around the resulting string copy_to(sc, dplyr::tibble(array_column = list(1:5, 21:25))) %>% hof_aggregate( start = "", merge = ~ CONCAT(.y, .x), finish = ~ CONCAT("(", .x, ")") ) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local") # concatenates all numbers of each array in `array_column` and add parentheses # around the resulting string copy_to(sc, dplyr::tibble(array_column = list(1:5, 21:25))) %>% hof_aggregate( start = "", merge = ~ CONCAT(.y, .x), finish = ~ CONCAT("(", .x, ")") ) ## End(Not run)
Applies a custom comparator function to sort an array (this is essentially a dplyr wrapper to the 'array_sort(expr, func)' higher- order function, which is supported since Spark 3.0)
hof_array_sort(x, func, expr = NULL, dest_col = NULL, ...)
hof_array_sort(x, func, expr = NULL, dest_col = NULL, ...)
x |
The Spark data frame to be processed |
func |
The comparator function to apply (it should take 2 array elements as arguments and return an integer, with a return value of -1 indicating the first element is less than the second, 0 indicating equality, or 1 indicating the first element is greater than the second) |
expr |
The array being sorted, could be any SQL expression evaluating to an array (default: the last column of the Spark data frame) |
dest_col |
Column to store the sorted result (default: expr) |
... |
Additional params to dplyr::mutate |
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "3.0.0") copy_to( sc, dplyr::tibble( # x contains 2 arrays each having elements in ascending order x = list(1:5, 6:10) ) ) %>% # now each array from x gets sorted in descending order hof_array_sort(~ as.integer(sign(.y - .x))) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "3.0.0") copy_to( sc, dplyr::tibble( # x contains 2 arrays each having elements in ascending order x = list(1:5, 6:10) ) ) %>% # now each array from x gets sorted in descending order hof_array_sort(~ as.integer(sign(.y - .x))) ## End(Not run)
Determines whether an element satisfying the given predicate exists in each array from
an array column
(this is essentially a dplyr wrapper for the
exists(array<T>, function<T, Boolean>): Boolean
built-in Spark SQL function)
hof_exists(x, pred, expr = NULL, dest_col = NULL, ...)
hof_exists(x, pred, expr = NULL, dest_col = NULL, ...)
x |
The Spark data frame to search |
pred |
A boolean predicate |
expr |
The array being searched (could be any SQL expression evaluating to an array) |
dest_col |
Column to store the search result |
... |
Additional params to dplyr::mutate |
Apply an element-wise filtering function to an array column
(this is essentially a dplyr wrapper for the
filter(array<T>, function<T, Boolean>): array<T>
built-in Spark SQL functions)
hof_filter(x, func, expr = NULL, dest_col = NULL, ...)
hof_filter(x, func, expr = NULL, dest_col = NULL, ...)
x |
The Spark data frame to filter |
func |
The filtering function |
expr |
The array being filtered, could be any SQL expression evaluating to an array (default: the last column of the Spark data frame) |
dest_col |
Column to store the filtered result (default: expr) |
... |
Additional params to dplyr::mutate |
## Not run: library(sparklyr) sc <- spark_connect(master = "local") # only keep odd elements in each array in `array_column` copy_to(sc, dplyr::tibble(array_column = list(1:5, 21:25))) %>% hof_filter(~ .x %% 2 == 1) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local") # only keep odd elements in each array in `array_column` copy_to(sc, dplyr::tibble(array_column = list(1:5, 21:25))) %>% hof_filter(~ .x %% 2 == 1) ## End(Not run)
Checks whether the predicate specified holds for all elements in an array (this is essentially a dplyr wrapper to the 'forall(expr, pred)' higher- order function, which is supported since Spark 3.0)
hof_forall(x, pred, expr = NULL, dest_col = NULL, ...)
hof_forall(x, pred, expr = NULL, dest_col = NULL, ...)
x |
The Spark data frame to be processed |
pred |
The predicate to test (it should take an array element as argument and return a boolean value) |
expr |
The array being tested, could be any SQL expression evaluating to an array (default: the last column of the Spark data frame) |
dest_col |
Column to store the boolean result (default: expr) |
... |
Additional params to dplyr::mutate |
## Not run: sc <- spark_connect(master = "local", version = "3.0.0") df <- dplyr::tibble( x = list(c(1, 2, 3, 4, 5), c(6, 7, 8, 9, 10)), y = list(c(1, 4, 2, 8, 5), c(7, 1, 4, 2, 8)), ) sdf <- sdf_copy_to(sc, df, overwrite = TRUE) all_positive_tbl <- sdf %>% hof_forall(pred = ~ .x > 0, expr = y, dest_col = all_positive) %>% dplyr::select(all_positive) ## End(Not run)
## Not run: sc <- spark_connect(master = "local", version = "3.0.0") df <- dplyr::tibble( x = list(c(1, 2, 3, 4, 5), c(6, 7, 8, 9, 10)), y = list(c(1, 4, 2, 8, 5), c(7, 1, 4, 2, 8)), ) sdf <- sdf_copy_to(sc, df, overwrite = TRUE) all_positive_tbl <- sdf %>% hof_forall(pred = ~ .x > 0, expr = y, dest_col = all_positive) %>% dplyr::select(all_positive) ## End(Not run)
Filters entries in a map using the function specified (this is essentially a dplyr wrapper to the 'map_filter(expr, func)' higher- order function, which is supported since Spark 3.0)
hof_map_filter(x, func, expr = NULL, dest_col = NULL, ...)
hof_map_filter(x, func, expr = NULL, dest_col = NULL, ...)
x |
The Spark data frame to be processed |
func |
The filter function to apply (it should take (key, value) as arguments and return a boolean value, with FALSE indicating the key-value pair should be discarded and TRUE otherwise) |
expr |
The map being filtered, could be any SQL expression evaluating to a map (default: the last column of the Spark data frame) |
dest_col |
Column to store the filtered result (default: expr) |
... |
Additional params to dplyr::mutate |
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "3.0.0") sdf <- sdf_len(sc, 1) %>% dplyr::mutate(m = map(1, 0, 2, 2, 3, -1)) filtered_sdf <- sdf %>% hof_map_filter(~ .x > .y) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "3.0.0") sdf <- sdf_len(sc, 1) %>% dplyr::mutate(m = map(1, 0, 2, 2, 3, -1)) filtered_sdf <- sdf %>% hof_map_filter(~ .x > .y) ## End(Not run)
Merges two maps into a single map by applying the function specified to pairs of values with the same key (this is essentially a dplyr wrapper to the 'map_zip_with(map1, map2, func)' higher- order function, which is supported since Spark 3.0)
hof_map_zip_with(x, func, dest_col = NULL, map1 = NULL, map2 = NULL, ...)
hof_map_zip_with(x, func, dest_col = NULL, map1 = NULL, map2 = NULL, ...)
x |
The Spark data frame to be processed |
func |
The function to apply (it should take (key, value1, value2) as arguments, where (key, value1) is a key-value pair present in map1, (key, value2) is a key-value pair present in map2, and return a transformed value associated with key in the resulting map |
dest_col |
Column to store the query result (default: the last column of the Spark data frame) |
map1 |
The first map being merged, could be any SQL expression evaluating to a map (default: the first column of the Spark data frame) |
map2 |
The second map being merged, could be any SQL expression evaluating to a map (default: the second column of the Spark data frame) |
... |
Additional params to dplyr::mutate |
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "3.0.0") # create a Spark dataframe with 2 columns of type MAP<STRING, INT> two_maps_tbl <- sdf_copy_to( sc, dplyr::tibble( m1 = c("{\"1\":2,\"3\":4,\"5\":6}", "{\"2\":1,\"4\":3,\"6\":5}"), m2 = c("{\"1\":1,\"3\":3,\"5\":5}", "{\"2\":2,\"4\":4,\"6\":6}") ), overwrite = TRUE ) %>% dplyr::mutate(m1 = from_json(m1, "MAP<STRING, INT>"), m2 = from_json(m2, "MAP<STRING, INT>")) # create a 3rd column containing MAP<STRING, INT> values derived from the # first 2 columns transformed_two_maps_tbl <- two_maps_tbl %>% hof_map_zip_with( func = .(k, v1, v2) %->% (CONCAT(k, "_", v1, "_", v2)), dest_col = m3 ) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "3.0.0") # create a Spark dataframe with 2 columns of type MAP<STRING, INT> two_maps_tbl <- sdf_copy_to( sc, dplyr::tibble( m1 = c("{\"1\":2,\"3\":4,\"5\":6}", "{\"2\":1,\"4\":3,\"6\":5}"), m2 = c("{\"1\":1,\"3\":3,\"5\":5}", "{\"2\":2,\"4\":4,\"6\":6}") ), overwrite = TRUE ) %>% dplyr::mutate(m1 = from_json(m1, "MAP<STRING, INT>"), m2 = from_json(m2, "MAP<STRING, INT>")) # create a 3rd column containing MAP<STRING, INT> values derived from the # first 2 columns transformed_two_maps_tbl <- two_maps_tbl %>% hof_map_zip_with( func = .(k, v1, v2) %->% (CONCAT(k, "_", v1, "_", v2)), dest_col = m3 ) ## End(Not run)
Apply an element-wise transformation function to an array column
(this is essentially a dplyr wrapper for the
transform(array<T>, function<T, U>): array<U>
and the
transform(array<T>, function<T, Int, U>): array<U>
built-in Spark SQL functions)
hof_transform(x, func, expr = NULL, dest_col = NULL, ...)
hof_transform(x, func, expr = NULL, dest_col = NULL, ...)
x |
The Spark data frame to transform |
func |
The transformation to apply |
expr |
The array being transformed, could be any SQL expression evaluating to an array (default: the last column of the Spark data frame) |
dest_col |
Column to store the transformed result (default: expr) |
... |
Additional params to dplyr::mutate |
## Not run: library(sparklyr) sc <- spark_connect(master = "local") # applies the (x -> x * x) transformation to elements of all arrays copy_to(sc, dplyr::tibble(arr = list(1:5, 21:25))) %>% hof_transform(~ .x * .x) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local") # applies the (x -> x * x) transformation to elements of all arrays copy_to(sc, dplyr::tibble(arr = list(1:5, 21:25))) %>% hof_transform(~ .x * .x) ## End(Not run)
Applies the transformation function specified to all keys of a map (this is essentially a dplyr wrapper to the 'transform_keys(expr, func)' higher- order function, which is supported since Spark 3.0)
hof_transform_keys(x, func, expr = NULL, dest_col = NULL, ...)
hof_transform_keys(x, func, expr = NULL, dest_col = NULL, ...)
x |
The Spark data frame to be processed |
func |
The transformation function to apply (it should take (key, value) as arguments and return a transformed key) |
expr |
The map being transformed, could be any SQL expression evaluating to a map (default: the last column of the Spark data frame) |
dest_col |
Column to store the transformed result (default: expr) |
... |
Additional params to dplyr::mutate |
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "3.0.0") sdf <- sdf_len(sc, 1) %>% dplyr::mutate(m = map("a", 0L, "b", 2L, "c", -1L)) transformed_sdf <- sdf %>% hof_transform_keys(~ CONCAT(.x, " == ", .y)) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "3.0.0") sdf <- sdf_len(sc, 1) %>% dplyr::mutate(m = map("a", 0L, "b", 2L, "c", -1L)) transformed_sdf <- sdf %>% hof_transform_keys(~ CONCAT(.x, " == ", .y)) ## End(Not run)
Applies the transformation function specified to all values of a map (this is essentially a dplyr wrapper to the 'transform_values(expr, func)' higher- order function, which is supported since Spark 3.0)
hof_transform_values(x, func, expr = NULL, dest_col = NULL, ...)
hof_transform_values(x, func, expr = NULL, dest_col = NULL, ...)
x |
The Spark data frame to be processed |
func |
The transformation function to apply (it should take (key, value) as arguments and return a transformed value) |
expr |
The map being transformed, could be any SQL expression evaluating to a map (default: the last column of the Spark data frame) |
dest_col |
Column to store the transformed result (default: expr) |
... |
Additional params to dplyr::mutate |
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "3.0.0") sdf <- sdf_len(sc, 1) %>% dplyr::mutate(m = map("a", 0L, "b", 2L, "c", -1L)) transformed_sdf <- sdf %>% hof_transform_values(~ CONCAT(.x, " == ", .y)) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "3.0.0") sdf <- sdf_len(sc, 1) %>% dplyr::mutate(m = map("a", 0L, "b", 2L, "c", -1L)) transformed_sdf <- sdf %>% hof_transform_values(~ CONCAT(.x, " == ", .y)) ## End(Not run)
Applies an element-wise function to combine elements from 2 array columns
(this is essentially a dplyr wrapper for the
zip_with(array<T>, array<U>, function<T, U, R>): array<R>
built-in function in Spark SQL)
hof_zip_with(x, func, dest_col = NULL, left = NULL, right = NULL, ...)
hof_zip_with(x, func, dest_col = NULL, left = NULL, right = NULL, ...)
x |
The Spark data frame to process |
func |
Element-wise combining function to be applied |
dest_col |
Column to store the query result (default: the last column of the Spark data frame) |
left |
Any expression evaluating to an array (default: the first column of the Spark data frame) |
right |
Any expression evaluating to an array (default: the second column of the Spark data frame) |
... |
Additional params to dplyr::mutate |
## Not run: library(sparklyr) sc <- spark_connect(master = "local") # compute element-wise products of 2 arrays from each row of `left` and `right` # and store the resuling array in `res` copy_to( sc, dplyr::tibble( left = list(1:5, 21:25), right = list(6:10, 16:20), res = c(0, 0) ) ) %>% hof_zip_with(~ .x * .y) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local") # compute element-wise products of 2 arrays from each row of `left` and `right` # and store the resuling array in `res` copy_to( sc, dplyr::tibble( left = list(1:5, 21:25), right = list(6:10, 16:20), res = c(0, 0) ) ) %>% hof_zip_with(~ .x * .y) ## End(Not run)
Invoke methods on Java object references. These functions provide a mechanism for invoking various Java object methods directly from R.
invoke(jobj, method, ...) invoke_static(sc, class, method, ...) invoke_new(sc, class, ...)
invoke(jobj, method, ...) invoke_static(sc, class, method, ...) invoke_new(sc, class, ...)
jobj |
An R object acting as a Java object reference (typically, a |
method |
The name of the method to be invoked. |
... |
Optional arguments, currently unused. |
sc |
A |
class |
The name of the Java class whose methods should be invoked. |
Use each of these functions in the following scenarios:
invoke |
Execute a method on a Java object reference (typically, a spark_jobj ). |
invoke_static |
Execute a static method associated with a Java class. |
invoke_new |
Invoke a constructor associated with a Java class. |
sc <- spark_connect(master = "spark://HOST:PORT") spark_context(sc) %>% invoke("textFile", "file.csv", 1L) %>% invoke("count")
sc <- spark_connect(master = "spark://HOST:PORT") spark_context(sc) %>% invoke("textFile", "file.csv", 1L) %>% invoke("count")
Invoke a Java function and force return value of the call to be retrieved as a Java object reference.
j_invoke(jobj, method, ...) j_invoke_static(sc, class, method, ...) j_invoke_new(sc, class, ...)
j_invoke(jobj, method, ...) j_invoke_static(sc, class, method, ...) j_invoke_new(sc, class, ...)
jobj |
An R object acting as a Java object reference (typically, a |
method |
The name of the method to be invoked. |
... |
Optional arguments, currently unused. |
sc |
A |
class |
The name of the Java class whose methods should be invoked. |
Given a list of Java object references, instantiate an Array[T]
containing the same list of references, where T
is a non-primitive
type that is more specific than java.lang.Object
.
jarray(sc, x, element_type)
jarray(sc, x, element_type)
sc |
A |
x |
A list of Java object references. |
element_type |
A valid Java class name representing the generic type
parameter of the Java array to be instantiated. Each element of |
sc <- spark_connect(master = "spark://HOST:PORT") string_arr <- jarray(sc, letters, element_type = "java.lang.String") # string_arr is now a reference to an array of type String[]
sc <- spark_connect(master = "spark://HOST:PORT") string_arr <- jarray(sc, letters, element_type = "java.lang.String") # string_arr is now a reference to an array of type String[]
Instantiate a java.lang.Float
object with the value specified.
NOTE: this method is useful when one has to invoke a Java/Scala method
requiring a float (instead of double) type for at least one of its
parameters.
jfloat(sc, x)
jfloat(sc, x)
sc |
A |
x |
A numeric value in R. |
sc <- spark_connect(master = "spark://HOST:PORT") jflt <- jfloat(sc, 1.23e-8) # jflt is now a reference to a java.lang.Float object
sc <- spark_connect(master = "spark://HOST:PORT") jflt <- jfloat(sc, 1.23e-8) # jflt is now a reference to a java.lang.Float object
Instantiate an Array[Float]
object with the value specified.
NOTE: this method is useful when one has to invoke a Java/Scala method
requiring an Array[Float]
as one of its parameters.
jfloat_array(sc, x)
jfloat_array(sc, x)
sc |
A |
x |
A numeric vector in R. |
sc <- spark_connect(master = "spark://HOST:PORT") jflt_arr <- jfloat_array(sc, c(-1.23e-8, 0, -1.23e-8)) # jflt_arr is now a reference an array of java.lang.Float
sc <- spark_connect(master = "spark://HOST:PORT") jflt_arr <- jfloat_array(sc, c(-1.23e-8, 0, -1.23e-8)) # jflt_arr is now a reference an array of java.lang.Float
These functions are wrappers around their 'dplyr' equivalents that set Spark SQL-compliant values for the 'suffix' argument by replacing dots ('.') with underscores ('_'). See [join] for a description of the general purpose of the functions.
## S3 method for class 'tbl_spark' inner_join( x, y, by = NULL, copy = FALSE, suffix = c("_x", "_y"), auto_index = FALSE, ..., sql_on = NULL ) ## S3 method for class 'tbl_spark' left_join( x, y, by = NULL, copy = FALSE, suffix = c("_x", "_y"), auto_index = FALSE, ..., sql_on = NULL ) ## S3 method for class 'tbl_spark' right_join( x, y, by = NULL, copy = FALSE, suffix = c("_x", "_y"), auto_index = FALSE, ..., sql_on = NULL ) ## S3 method for class 'tbl_spark' full_join( x, y, by = NULL, copy = FALSE, suffix = c("_x", "_y"), auto_index = FALSE, ..., sql_on = NULL )
## S3 method for class 'tbl_spark' inner_join( x, y, by = NULL, copy = FALSE, suffix = c("_x", "_y"), auto_index = FALSE, ..., sql_on = NULL ) ## S3 method for class 'tbl_spark' left_join( x, y, by = NULL, copy = FALSE, suffix = c("_x", "_y"), auto_index = FALSE, ..., sql_on = NULL ) ## S3 method for class 'tbl_spark' right_join( x, y, by = NULL, copy = FALSE, suffix = c("_x", "_y"), auto_index = FALSE, ..., sql_on = NULL ) ## S3 method for class 'tbl_spark' full_join( x, y, by = NULL, copy = FALSE, suffix = c("_x", "_y"), auto_index = FALSE, ..., sql_on = NULL )
x , y
|
A pair of lazy data frames backed by database queries. |
by |
A join specification created with If To join on different variables between To join by multiple variables, use a
For simple equality joins, you can alternatively specify a character vector
of variable names to join by. For example, To perform a cross-join, generating all combinations of |
copy |
If This allows you to join tables across srcs, but it's potentially expensive operation so you must opt into it. |
suffix |
If there are non-joined duplicate variables in |
auto_index |
if |
... |
Other parameters passed onto methods. |
sql_on |
A custom join predicate as an SQL expression.
Usually joins use column equality, but you can perform more complex
queries by supply |
list all sparklyr-*.jar files that have been built
list_sparklyr_jars()
list_sparklyr_jars()
Create a Spark Configuration for Livy
livy_config( config = spark_config(), username = NULL, password = NULL, negotiate = FALSE, custom_headers = list(`X-Requested-By` = "sparklyr"), proxy = NULL, curl_opts = NULL, ... )
livy_config( config = spark_config(), username = NULL, password = NULL, negotiate = FALSE, custom_headers = list(`X-Requested-By` = "sparklyr"), proxy = NULL, curl_opts = NULL, ... )
config |
Optional base configuration |
username |
The username to use in the Authorization header |
password |
The password to use in the Authorization header |
negotiate |
Whether to use gssnegotiate method or not |
custom_headers |
List of custom headers to append to http requests. Defaults to |
proxy |
Either NULL or a proxy specified by httr::use_proxy(). Defaults to NULL. |
curl_opts |
List of CURL options (e.g., verbose, connecttimeout, dns_cache_timeout, etc, see httr::httr_options() for a list of valid options) – NOTE: these configurations are for libcurl only and separate from HTTP headers or Livy session parameters. |
... |
additional Livy session parameters |
Extends a Spark spark_config()
configuration with settings
for Livy. For instance, username
and password
define the basic authentication settings for a Livy session.
The default value of "custom_headers"
is set to list("X-Requested-By" = "sparklyr")
in order to facilitate connection to Livy servers with CSRF protection enabled.
Additional parameters for Livy sessions are:
proxy_user
User to impersonate when starting the session
jars
jars to be used in this session
py_files
Python files to be used in this session
files
files to be used in this session
driver_memory
Amount of memory to use for the driver process
driver_cores
Number of cores to use for the driver process
executor_memory
Amount of memory to use per executor process
executor_cores
Number of cores to use for each executor
num_executors
Number of executors to launch for this session
archives
Archives to be used in this session
queue
The name of the YARN queue to which submitted
name
The name of this session
heartbeat_timeout
Timeout in seconds to which session be orphaned
conf
Spark configuration properties (Map of key=value)
Note that queue
is supported only by version 0.4.0 of Livy or newer.
If you are using the older one, specify queue via config
(e.g.
config = spark_config(spark.yarn.queue = "my_queue")
).
Named list with configuration data
Starts the livy service.
Stops the running instances of the livy service.
livy_service_start( version = NULL, spark_version = NULL, stdout = "", stderr = "", ... ) livy_service_stop()
livy_service_start( version = NULL, spark_version = NULL, stdout = "", stderr = "", ... ) livy_service_stop()
version |
The version of ‘livy’ to use. |
spark_version |
The version of ‘spark’ to connect to. |
stdout , stderr
|
where output to 'stdout' or 'stderr' should
be sent. Same options as |
... |
Optional arguments; currently unused. |
Fit a parametric survival regression model named accelerated failure time (AFT) model (see Accelerated failure time model (Wikipedia)) based on the Weibull distribution of the survival time.
ml_aft_survival_regression( x, formula = NULL, censor_col = "censor", quantile_probabilities = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99), fit_intercept = TRUE, max_iter = 100L, tol = 1e-06, aggregation_depth = 2, quantiles_col = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("aft_survival_regression_"), ... ) ml_survival_regression( x, formula = NULL, censor_col = "censor", quantile_probabilities = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99), fit_intercept = TRUE, max_iter = 100L, tol = 1e-06, aggregation_depth = 2, quantiles_col = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("aft_survival_regression_"), response = NULL, features = NULL, ... )
ml_aft_survival_regression( x, formula = NULL, censor_col = "censor", quantile_probabilities = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99), fit_intercept = TRUE, max_iter = 100L, tol = 1e-06, aggregation_depth = 2, quantiles_col = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("aft_survival_regression_"), ... ) ml_survival_regression( x, formula = NULL, censor_col = "censor", quantile_probabilities = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99), fit_intercept = TRUE, max_iter = 100L, tol = 1e-06, aggregation_depth = 2, quantiles_col = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("aft_survival_regression_"), response = NULL, features = NULL, ... )
x |
A |
formula |
Used when |
censor_col |
Censor column name. The value of this column could be 0 or 1. If the value is 1, it means the event has occurred i.e. uncensored; otherwise censored. |
quantile_probabilities |
Quantile probabilities array. Values of the quantile probabilities array should be in the range (0, 1) and the array should be non-empty. |
fit_intercept |
Boolean; should the model be fit with an intercept term? |
max_iter |
The maximum number of iterations to use. |
tol |
Param for the convergence tolerance for iterative algorithms. |
aggregation_depth |
(Spark 2.1.0+) Suggested depth for treeAggregate (>= 2). |
quantiles_col |
Quantiles column name. This column will output quantiles of corresponding quantileProbabilities if it is set. |
features_col |
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by |
label_col |
Label column name. The column should be a numeric column. Usually this column is output by |
prediction_col |
Prediction column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
response |
(Deprecated) The name of the response column (as a length-one character vector.) |
features |
(Deprecated) The name of features (terms) to use for the model fit. |
ml_survival_regression()
is an alias for ml_aft_survival_regression()
for backwards compatibility.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
object. If
it is a ml_pipeline
, it will return a pipeline with the predictor
appended to it. If a tbl_spark
, it will return a tbl_spark
with
the predictions added to it.
Other ml algorithms:
ml_decision_tree_classifier()
,
ml_gbt_classifier()
,
ml_generalized_linear_regression()
,
ml_isotonic_regression()
,
ml_linear_regression()
,
ml_linear_svc()
,
ml_logistic_regression()
,
ml_multilayer_perceptron_classifier()
,
ml_naive_bayes()
,
ml_one_vs_rest()
,
ml_random_forest_classifier()
## Not run: library(survival) library(sparklyr) sc <- spark_connect(master = "local") ovarian_tbl <- sdf_copy_to(sc, ovarian, name = "ovarian_tbl", overwrite = TRUE) partitions <- ovarian_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) ovarian_training <- partitions$training ovarian_test <- partitions$test sur_reg <- ovarian_training %>% ml_aft_survival_regression(futime ~ ecog_ps + rx + age + resid_ds, censor_col = "fustat") pred <- ml_predict(sur_reg, ovarian_test) pred ## End(Not run)
## Not run: library(survival) library(sparklyr) sc <- spark_connect(master = "local") ovarian_tbl <- sdf_copy_to(sc, ovarian, name = "ovarian_tbl", overwrite = TRUE) partitions <- ovarian_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) ovarian_training <- partitions$training ovarian_test <- partitions$test sur_reg <- ovarian_training %>% ml_aft_survival_regression(futime ~ ecog_ps + rx + age + resid_ds, censor_col = "fustat") pred <- ml_predict(sur_reg, ovarian_test) pred ## End(Not run)
Perform recommendation using Alternating Least Squares (ALS) matrix factorization.
ml_als( x, formula = NULL, rating_col = "rating", user_col = "user", item_col = "item", rank = 10, reg_param = 0.1, implicit_prefs = FALSE, alpha = 1, nonnegative = FALSE, max_iter = 10, num_user_blocks = 10, num_item_blocks = 10, checkpoint_interval = 10, cold_start_strategy = "nan", intermediate_storage_level = "MEMORY_AND_DISK", final_storage_level = "MEMORY_AND_DISK", uid = random_string("als_"), ... ) ml_recommend(model, type = c("items", "users"), n = 1)
ml_als( x, formula = NULL, rating_col = "rating", user_col = "user", item_col = "item", rank = 10, reg_param = 0.1, implicit_prefs = FALSE, alpha = 1, nonnegative = FALSE, max_iter = 10, num_user_blocks = 10, num_item_blocks = 10, checkpoint_interval = 10, cold_start_strategy = "nan", intermediate_storage_level = "MEMORY_AND_DISK", final_storage_level = "MEMORY_AND_DISK", uid = random_string("als_"), ... ) ml_recommend(model, type = c("items", "users"), n = 1)
x |
A |
formula |
Used when |
rating_col |
Column name for ratings. Default: "rating" |
user_col |
Column name for user ids. Ids must be integers. Other numeric types are supported for this column, but will be cast to integers as long as they fall within the integer value range. Default: "user" |
item_col |
Column name for item ids. Ids must be integers. Other numeric types are supported for this column, but will be cast to integers as long as they fall within the integer value range. Default: "item" |
rank |
Rank of the matrix factorization (positive). Default: 10 |
reg_param |
Regularization parameter. |
implicit_prefs |
Whether to use implicit preference. Default: FALSE. |
alpha |
Alpha parameter in the implicit preference formulation (nonnegative). |
nonnegative |
Whether to apply nonnegativity constraints. Default: FALSE. |
max_iter |
Maximum number of iterations. |
num_user_blocks |
Number of user blocks (positive). Default: 10 |
num_item_blocks |
Number of item blocks (positive). Default: 10 |
checkpoint_interval |
Set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations, defaults to 10. |
cold_start_strategy |
(Spark 2.2.0+) Strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data. Supported values: - "nan": predicted value for unknown ids will be NaN. - "drop": rows in the input DataFrame containing unknown ids will be dropped from the output DataFrame containing predictions. Default: "nan". |
intermediate_storage_level |
(Spark 2.0.0+) StorageLevel for intermediate datasets. Pass in a string representation of |
final_storage_level |
(Spark 2.0.0+) StorageLevel for ALS model factors. Pass in a string representation of |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; currently unused. |
model |
An ALS model object |
type |
What to recommend, one of |
n |
Maximum number of recommendations to return |
ml_recommend()
returns the top n
users/items recommended for each item/user, for all items/users. The output has been transformed (exploded and separated) from the default Spark outputs to be more user friendly.
ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, X and Y, i.e. X * Yt = R. Typically these approximations are called 'factor' matrices. The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares. The newly-solved factor matrix is then held constant while solving for the other factor matrix.
This is a blocked implementation of the ALS factorization algorithm that groups the two sets of factors (referred to as "users" and "products") into blocks and reduces communication by only sending one copy of each user vector to each product block on each iteration, and only for the product blocks that need that user's feature vector. This is achieved by pre-computing some information about the ratings matrix to determine the "out-links" of each user (which blocks of products it will contribute to) and "in-link" information for each product (which of the feature vectors it receives from each user block it will depend on). This allows us to send only an array of feature vectors between each user block and product block, and have the product block find the users' ratings and update the products based on these messages.
For implicit preference data, the algorithm used is based on "Collaborative Filtering for Implicit Feedback Datasets", available at doi:10.1109/ICDM.2008.22, adapted for the blocked approach used here.
Essentially instead of finding the low-rank approximations to the rating matrix R, this finds the approximations for a preference matrix P where the elements of P are 1 if r is greater than 0 and 0 if r is less than or equal to 0. The ratings then act as 'confidence' values related to strength of indicated user preferences rather than explicit ratings given to items.
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_als
recommender object, which is an Estimator.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the recommender appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a recommender
estimator is constructed then immediately fit with the input
tbl_spark
, returning a recommendation model, i.e. ml_als_model
.
## Not run: library(sparklyr) sc <- spark_connect(master = "local") movies <- data.frame( user = c(1, 2, 0, 1, 2, 0), item = c(1, 1, 1, 2, 2, 0), rating = c(3, 1, 2, 4, 5, 4) ) movies_tbl <- sdf_copy_to(sc, movies) model <- ml_als(movies_tbl, rating ~ user + item) ml_predict(model, movies_tbl) ml_recommend(model, type = "item", 1) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local") movies <- data.frame( user = c(1, 2, 0, 1, 2, 0), item = c(1, 1, 1, 2, 2, 0), rating = c(3, 1, 2, 4, 5, 4) ) movies_tbl <- sdf_copy_to(sc, movies) model <- ml_als(movies_tbl, rating ~ user + item) ml_predict(model, movies_tbl) ml_recommend(model, type = "item", 1) ## End(Not run)
These methods summarize the results of Spark ML models into tidy forms.
## S3 method for class 'ml_model_als' tidy(x, ...) ## S3 method for class 'ml_model_als' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_als' glance(x, ...)
## S3 method for class 'ml_model_als' tidy(x, ...) ## S3 method for class 'ml_model_als' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_als' glance(x, ...)
x |
a Spark ML model. |
... |
extra arguments (not used.) |
newdata |
a tbl_spark of new data to use for prediction. |
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.
ml_bisecting_kmeans( x, formula = NULL, k = 4, max_iter = 20, seed = NULL, min_divisible_cluster_size = 1, features_col = "features", prediction_col = "prediction", uid = random_string("bisecting_bisecting_kmeans_"), ... )
ml_bisecting_kmeans( x, formula = NULL, k = 4, max_iter = 20, seed = NULL, min_divisible_cluster_size = 1, features_col = "features", prediction_col = "prediction", uid = random_string("bisecting_bisecting_kmeans_"), ... )
x |
A |
formula |
Used when |
k |
The number of clusters to create |
max_iter |
The maximum number of iterations to use. |
seed |
A random seed. Set this value if you need your results to be reproducible across repeated calls. |
min_divisible_cluster_size |
The minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1.0). |
features_col |
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by |
prediction_col |
Prediction column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments, see Details.
#' @return The object returned depends on the class of |
## Not run: library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) iris_tbl %>% select(-Species) %>% ml_bisecting_kmeans(k = 4, Species ~ .) ## End(Not run)
## Not run: library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) iris_tbl %>% select(-Species) %>% ml_bisecting_kmeans(k = 4, Species ~ .) ## End(Not run)
Conduct Pearson's independence test for every feature against the label. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the Chi-squared statistic is computed. All label and feature values must be categorical.
ml_chisquare_test(x, features, label)
ml_chisquare_test(x, features, label)
x |
A |
features |
The name(s) of the feature columns. This can also be the name
of a single vector column created using |
label |
The name of the label column. |
A data frame with one row for each (feature, label) pair with p-values, degrees of freedom, and test statistics.
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) features <- c("Petal_Width", "Petal_Length", "Sepal_Length", "Sepal_Width") ml_chisquare_test(iris_tbl, features = features, label = "Species") ## End(Not run)
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) features <- c("Petal_Width", "Petal_Length", "Sepal_Length", "Sepal_Width") ml_chisquare_test(iris_tbl, features = features, label = "Species") ## End(Not run)
Evaluator for clustering results. The metric computes the Silhouette measure using the squared Euclidean distance. The Silhouette is a measure for the validation of the consistency within clusters. It ranges between 1 and -1, where a value close to 1 means that the points in a cluster are close to the other points in the same cluster and far from the points of the other clusters.
ml_clustering_evaluator( x, features_col = "features", prediction_col = "prediction", metric_name = "silhouette", uid = random_string("clustering_evaluator_"), ... )
ml_clustering_evaluator( x, features_col = "features", prediction_col = "prediction", metric_name = "silhouette", uid = random_string("clustering_evaluator_"), ... )
x |
A |
features_col |
Name of features column. |
prediction_col |
Name of the prediction column. |
metric_name |
The performance metric. Currently supports "silhouette". |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; currently unused. |
The calculated performance metric
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test formula <- Species ~ . # Train the models kmeans_model <- ml_kmeans(iris_training, formula = formula) b_kmeans_model <- ml_bisecting_kmeans(iris_training, formula = formula) gmm_model <- ml_gaussian_mixture(iris_training, formula = formula) # Predict pred_kmeans <- ml_predict(kmeans_model, iris_test) pred_b_kmeans <- ml_predict(b_kmeans_model, iris_test) pred_gmm <- ml_predict(gmm_model, iris_test) # Evaluate ml_clustering_evaluator(pred_kmeans) ml_clustering_evaluator(pred_b_kmeans) ml_clustering_evaluator(pred_gmm) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test formula <- Species ~ . # Train the models kmeans_model <- ml_kmeans(iris_training, formula = formula) b_kmeans_model <- ml_bisecting_kmeans(iris_training, formula = formula) gmm_model <- ml_gaussian_mixture(iris_training, formula = formula) # Predict pred_kmeans <- ml_predict(kmeans_model, iris_test) pred_b_kmeans <- ml_predict(b_kmeans_model, iris_test) pred_gmm <- ml_predict(gmm_model, iris_test) # Evaluate ml_clustering_evaluator(pred_kmeans) ml_clustering_evaluator(pred_b_kmeans) ml_clustering_evaluator(pred_gmm) ## End(Not run)
Compute correlation matrix
ml_corr(x, columns = NULL, method = c("pearson", "spearman"))
ml_corr(x, columns = NULL, method = c("pearson", "spearman"))
x |
A |
columns |
The names of the columns to calculate correlations of. If only one
column is specified, it must be a vector column (for example, assembled using
|
method |
The method to use, either |
A correlation matrix organized as a data frame.
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) features <- c("Petal_Width", "Petal_Length", "Sepal_Length", "Sepal_Width") ml_corr(iris_tbl, columns = features, method = "pearson") ## End(Not run)
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) features <- c("Petal_Width", "Petal_Length", "Sepal_Length", "Sepal_Width") ml_corr(iris_tbl, columns = features, method = "pearson") ## End(Not run)
Perform classification and regression using decision trees.
ml_decision_tree_classifier( x, formula = NULL, max_depth = 5, max_bins = 32, min_instances_per_node = 1, min_info_gain = 0, impurity = "gini", seed = NULL, thresholds = NULL, cache_node_ids = FALSE, checkpoint_interval = 10, max_memory_in_mb = 256, features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("decision_tree_classifier_"), ... ) ml_decision_tree( x, formula = NULL, type = c("auto", "regression", "classification"), features_col = "features", label_col = "label", prediction_col = "prediction", variance_col = NULL, probability_col = "probability", raw_prediction_col = "rawPrediction", checkpoint_interval = 10L, impurity = "auto", max_bins = 32L, max_depth = 5L, min_info_gain = 0, min_instances_per_node = 1L, seed = NULL, thresholds = NULL, cache_node_ids = FALSE, max_memory_in_mb = 256L, uid = random_string("decision_tree_"), response = NULL, features = NULL, ... ) ml_decision_tree_regressor( x, formula = NULL, max_depth = 5, max_bins = 32, min_instances_per_node = 1, min_info_gain = 0, impurity = "variance", seed = NULL, cache_node_ids = FALSE, checkpoint_interval = 10, max_memory_in_mb = 256, variance_col = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("decision_tree_regressor_"), ... )
ml_decision_tree_classifier( x, formula = NULL, max_depth = 5, max_bins = 32, min_instances_per_node = 1, min_info_gain = 0, impurity = "gini", seed = NULL, thresholds = NULL, cache_node_ids = FALSE, checkpoint_interval = 10, max_memory_in_mb = 256, features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("decision_tree_classifier_"), ... ) ml_decision_tree( x, formula = NULL, type = c("auto", "regression", "classification"), features_col = "features", label_col = "label", prediction_col = "prediction", variance_col = NULL, probability_col = "probability", raw_prediction_col = "rawPrediction", checkpoint_interval = 10L, impurity = "auto", max_bins = 32L, max_depth = 5L, min_info_gain = 0, min_instances_per_node = 1L, seed = NULL, thresholds = NULL, cache_node_ids = FALSE, max_memory_in_mb = 256L, uid = random_string("decision_tree_"), response = NULL, features = NULL, ... ) ml_decision_tree_regressor( x, formula = NULL, max_depth = 5, max_bins = 32, min_instances_per_node = 1, min_info_gain = 0, impurity = "variance", seed = NULL, cache_node_ids = FALSE, checkpoint_interval = 10, max_memory_in_mb = 256, variance_col = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("decision_tree_regressor_"), ... )
x |
A |
formula |
Used when |
max_depth |
Maximum depth of the tree (>= 0); that is, the maximum number of nodes separating any leaves from the root of the tree. |
max_bins |
The maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity. |
min_instances_per_node |
Minimum number of instances each child must have after split. |
min_info_gain |
Minimum information gain for a split to be considered at a tree node. Should be >= 0, defaults to 0. |
impurity |
Criterion used for information gain calculation. Supported: "entropy"
and "gini" (default) for classification and "variance" (default) for regression. For
|
seed |
Seed for random numbers. |
thresholds |
Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value |
cache_node_ids |
If |
checkpoint_interval |
Set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations, defaults to 10. |
max_memory_in_mb |
Maximum memory in MB allocated to histogram aggregation. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size. Defaults to 256. |
features_col |
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by |
label_col |
Label column name. The column should be a numeric column. Usually this column is output by |
prediction_col |
Prediction column name. |
probability_col |
Column name for predicted class conditional probabilities. |
raw_prediction_col |
Raw prediction (a.k.a. confidence) column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
type |
The type of model to fit. |
variance_col |
(Optional) Column name for the biased sample variance of prediction. |
response |
(Deprecated) The name of the response column (as a length-one character vector.) |
features |
(Deprecated) The name of features (terms) to use for the model fit. |
ml_decision_tree
is a wrapper around ml_decision_tree_regressor.tbl_spark
and ml_decision_tree_classifier.tbl_spark
and calls the appropriate method based on model type.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
object. If
it is a ml_pipeline
, it will return a pipeline with the predictor
appended to it. If a tbl_spark
, it will return a tbl_spark
with
the predictions added to it.
Other ml algorithms:
ml_aft_survival_regression()
,
ml_gbt_classifier()
,
ml_generalized_linear_regression()
,
ml_isotonic_regression()
,
ml_linear_regression()
,
ml_linear_svc()
,
ml_logistic_regression()
,
ml_multilayer_perceptron_classifier()
,
ml_naive_bayes()
,
ml_one_vs_rest()
,
ml_random_forest_classifier()
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test dt_model <- iris_training %>% ml_decision_tree(Species ~ .) pred <- ml_predict(dt_model, iris_test) ml_multiclass_classification_evaluator(pred) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test dt_model <- iris_training %>% ml_decision_tree(Species ~ .) pred <- ml_predict(dt_model, iris_test) ml_multiclass_classification_evaluator(pred) ## End(Not run)
Loads the default stop words for the given language.
ml_default_stop_words( sc, language = c("english", "danish", "dutch", "finnish", "french", "german", "hungarian", "italian", "norwegian", "portuguese", "russian", "spanish", "swedish", "turkish"), ... )
ml_default_stop_words( sc, language = c("english", "danish", "dutch", "finnish", "french", "german", "hungarian", "italian", "norwegian", "portuguese", "russian", "spanish", "swedish", "turkish"), ... )
sc |
A |
language |
A character string. |
... |
Optional arguments; currently unused. |
Supported languages: danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, russian, spanish, swedish, turkish. Defaults to English. See https://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/ for more details
A list of stop words.
Compute performance metrics.
ml_evaluate(x, dataset) ## S3 method for class 'ml_model_logistic_regression' ml_evaluate(x, dataset) ## S3 method for class 'ml_logistic_regression_model' ml_evaluate(x, dataset) ## S3 method for class 'ml_model_linear_regression' ml_evaluate(x, dataset) ## S3 method for class 'ml_linear_regression_model' ml_evaluate(x, dataset) ## S3 method for class 'ml_model_generalized_linear_regression' ml_evaluate(x, dataset) ## S3 method for class 'ml_generalized_linear_regression_model' ml_evaluate(x, dataset) ## S3 method for class 'ml_model_clustering' ml_evaluate(x, dataset) ## S3 method for class 'ml_model_classification' ml_evaluate(x, dataset) ## S3 method for class 'ml_evaluator' ml_evaluate(x, dataset)
ml_evaluate(x, dataset) ## S3 method for class 'ml_model_logistic_regression' ml_evaluate(x, dataset) ## S3 method for class 'ml_logistic_regression_model' ml_evaluate(x, dataset) ## S3 method for class 'ml_model_linear_regression' ml_evaluate(x, dataset) ## S3 method for class 'ml_linear_regression_model' ml_evaluate(x, dataset) ## S3 method for class 'ml_model_generalized_linear_regression' ml_evaluate(x, dataset) ## S3 method for class 'ml_generalized_linear_regression_model' ml_evaluate(x, dataset) ## S3 method for class 'ml_model_clustering' ml_evaluate(x, dataset) ## S3 method for class 'ml_model_classification' ml_evaluate(x, dataset) ## S3 method for class 'ml_evaluator' ml_evaluate(x, dataset)
x |
An ML model object or an evaluator object. |
dataset |
The dataset to be validate the model on. |
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) ml_gaussian_mixture(iris_tbl, Species ~ .) %>% ml_evaluate(iris_tbl) ml_kmeans(iris_tbl, Species ~ .) %>% ml_evaluate(iris_tbl) ml_bisecting_kmeans(iris_tbl, Species ~ .) %>% ml_evaluate(iris_tbl) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) ml_gaussian_mixture(iris_tbl, Species ~ .) %>% ml_evaluate(iris_tbl) ml_kmeans(iris_tbl, Species ~ .) %>% ml_evaluate(iris_tbl) ml_bisecting_kmeans(iris_tbl, Species ~ .) %>% ml_evaluate(iris_tbl) ## End(Not run)
A set of functions to calculate performance metrics for prediction models. Also see the Spark ML Documentation https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.package
ml_binary_classification_evaluator( x, label_col = "label", raw_prediction_col = "rawPrediction", metric_name = "areaUnderROC", uid = random_string("binary_classification_evaluator_"), ... ) ml_binary_classification_eval( x, label_col = "label", prediction_col = "prediction", metric_name = "areaUnderROC" ) ml_multiclass_classification_evaluator( x, label_col = "label", prediction_col = "prediction", metric_name = "f1", uid = random_string("multiclass_classification_evaluator_"), ... ) ml_classification_eval( x, label_col = "label", prediction_col = "prediction", metric_name = "f1" ) ml_regression_evaluator( x, label_col = "label", prediction_col = "prediction", metric_name = "rmse", uid = random_string("regression_evaluator_"), ... )
ml_binary_classification_evaluator( x, label_col = "label", raw_prediction_col = "rawPrediction", metric_name = "areaUnderROC", uid = random_string("binary_classification_evaluator_"), ... ) ml_binary_classification_eval( x, label_col = "label", prediction_col = "prediction", metric_name = "areaUnderROC" ) ml_multiclass_classification_evaluator( x, label_col = "label", prediction_col = "prediction", metric_name = "f1", uid = random_string("multiclass_classification_evaluator_"), ... ) ml_classification_eval( x, label_col = "label", prediction_col = "prediction", metric_name = "f1" ) ml_regression_evaluator( x, label_col = "label", prediction_col = "prediction", metric_name = "rmse", uid = random_string("regression_evaluator_"), ... )
x |
A |
label_col |
Name of column string specifying which column contains the true labels or values. |
raw_prediction_col |
Raw prediction (a.k.a. confidence) column name. |
metric_name |
The performance metric. See details. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; currently unused. |
prediction_col |
Name of the column that contains the predicted
label or value NOT the scored probability. Column should be of type
|
The following metrics are supported
Binary Classification: areaUnderROC
(default) or areaUnderPR
(not available in Spark 2.X.)
Multiclass Classification: f1
(default), precision
, recall
, weightedPrecision
, weightedRecall
or accuracy
; for Spark 2.X: f1
(default), weightedPrecision
, weightedRecall
or accuracy
.
Regression: rmse
(root mean squared error, default),
mse
(mean squared error), r2
, or mae
(mean absolute error.)
ml_binary_classification_eval()
is an alias for ml_binary_classification_evaluator()
for backwards compatibility.
ml_classification_eval()
is an alias for ml_multiclass_classification_evaluator()
for backwards compatibility.
The calculated performance metric
## Not run: sc <- spark_connect(master = "local") mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE) partitions <- mtcars_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) mtcars_training <- partitions$training mtcars_test <- partitions$test # for multiclass classification rf_model <- mtcars_training %>% ml_random_forest(cyl ~ ., type = "classification") pred <- ml_predict(rf_model, mtcars_test) ml_multiclass_classification_evaluator(pred) # for regression rf_model <- mtcars_training %>% ml_random_forest(cyl ~ ., type = "regression") pred <- ml_predict(rf_model, mtcars_test) ml_regression_evaluator(pred, label_col = "cyl") # for binary classification rf_model <- mtcars_training %>% ml_random_forest(am ~ gear + carb, type = "classification") pred <- ml_predict(rf_model, mtcars_test) ml_binary_classification_evaluator(pred) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE) partitions <- mtcars_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) mtcars_training <- partitions$training mtcars_test <- partitions$test # for multiclass classification rf_model <- mtcars_training %>% ml_random_forest(cyl ~ ., type = "classification") pred <- ml_predict(rf_model, mtcars_test) ml_multiclass_classification_evaluator(pred) # for regression rf_model <- mtcars_training %>% ml_random_forest(cyl ~ ., type = "regression") pred <- ml_predict(rf_model, mtcars_test) ml_regression_evaluator(pred, label_col = "cyl") # for binary classification rf_model <- mtcars_training %>% ml_random_forest(am ~ gear + carb, type = "classification") pred <- ml_predict(rf_model, mtcars_test) ml_binary_classification_evaluator(pred) ## End(Not run)
Spark ML - Feature Importance for Tree Models
ml_feature_importances(model, ...) ml_tree_feature_importance(model, ...)
ml_feature_importances(model, ...) ml_tree_feature_importance(model, ...)
model |
A decision tree-based model. |
... |
Optional arguments; currently unused. |
For ml_model
, a sorted data frame with feature labels and their relative importance.
For ml_prediction_model
, a vector of relative importances.
A parallel FP-growth algorithm to mine frequent itemsets.
ml_fpgrowth( x, items_col = "items", min_confidence = 0.8, min_support = 0.3, prediction_col = "prediction", uid = random_string("fpgrowth_"), ... ) ml_association_rules(model) ml_freq_itemsets(model)
ml_fpgrowth( x, items_col = "items", min_confidence = 0.8, min_support = 0.3, prediction_col = "prediction", uid = random_string("fpgrowth_"), ... ) ml_association_rules(model) ml_freq_itemsets(model)
x |
A |
items_col |
Items column name. Default: "items" |
min_confidence |
Minimal confidence for generating Association Rule.
|
min_support |
Minimal support level of the frequent pattern. [0.0, 1.0]. Any pattern that appears more than (min_support * size-of-the-dataset) times will be output in the frequent itemsets. Default: 0.3 |
prediction_col |
Prediction column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; currently unused. |
model |
A fitted FPGrowth model returned by |
This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite. Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than tol
, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.
ml_gaussian_mixture( x, formula = NULL, k = 2, max_iter = 100, tol = 0.01, seed = NULL, features_col = "features", prediction_col = "prediction", probability_col = "probability", uid = random_string("gaussian_mixture_"), ... )
ml_gaussian_mixture( x, formula = NULL, k = 2, max_iter = 100, tol = 0.01, seed = NULL, features_col = "features", prediction_col = "prediction", probability_col = "probability", uid = random_string("gaussian_mixture_"), ... )
x |
A |
formula |
Used when |
k |
The number of clusters to create |
max_iter |
The maximum number of iterations to use. |
tol |
Param for the convergence tolerance for iterative algorithms. |
seed |
A random seed. Set this value if you need your results to be reproducible across repeated calls. |
features_col |
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by |
prediction_col |
Prediction column name. |
probability_col |
Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments, see Details.
#' @return The object returned depends on the class of |
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) gmm_model <- ml_gaussian_mixture(iris_tbl, Species ~ .) pred <- sdf_predict(iris_tbl, gmm_model) ml_clustering_evaluator(pred) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) gmm_model <- ml_gaussian_mixture(iris_tbl, Species ~ .) pred <- sdf_predict(iris_tbl, gmm_model) ml_clustering_evaluator(pred) ## End(Not run)
Perform binary classification and regression using gradient boosted trees. Multiclass classification is not supported yet.
ml_gbt_classifier( x, formula = NULL, max_iter = 20, max_depth = 5, step_size = 0.1, subsampling_rate = 1, feature_subset_strategy = "auto", min_instances_per_node = 1L, max_bins = 32, min_info_gain = 0, loss_type = "logistic", seed = NULL, thresholds = NULL, checkpoint_interval = 10, cache_node_ids = FALSE, max_memory_in_mb = 256, features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("gbt_classifier_"), ... ) ml_gradient_boosted_trees( x, formula = NULL, type = c("auto", "regression", "classification"), features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", checkpoint_interval = 10, loss_type = c("auto", "logistic", "squared", "absolute"), max_bins = 32, max_depth = 5, max_iter = 20L, min_info_gain = 0, min_instances_per_node = 1, step_size = 0.1, subsampling_rate = 1, feature_subset_strategy = "auto", seed = NULL, thresholds = NULL, cache_node_ids = FALSE, max_memory_in_mb = 256, uid = random_string("gradient_boosted_trees_"), response = NULL, features = NULL, ... ) ml_gbt_regressor( x, formula = NULL, max_iter = 20, max_depth = 5, step_size = 0.1, subsampling_rate = 1, feature_subset_strategy = "auto", min_instances_per_node = 1, max_bins = 32, min_info_gain = 0, loss_type = "squared", seed = NULL, checkpoint_interval = 10, cache_node_ids = FALSE, max_memory_in_mb = 256, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("gbt_regressor_"), ... )
ml_gbt_classifier( x, formula = NULL, max_iter = 20, max_depth = 5, step_size = 0.1, subsampling_rate = 1, feature_subset_strategy = "auto", min_instances_per_node = 1L, max_bins = 32, min_info_gain = 0, loss_type = "logistic", seed = NULL, thresholds = NULL, checkpoint_interval = 10, cache_node_ids = FALSE, max_memory_in_mb = 256, features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("gbt_classifier_"), ... ) ml_gradient_boosted_trees( x, formula = NULL, type = c("auto", "regression", "classification"), features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", checkpoint_interval = 10, loss_type = c("auto", "logistic", "squared", "absolute"), max_bins = 32, max_depth = 5, max_iter = 20L, min_info_gain = 0, min_instances_per_node = 1, step_size = 0.1, subsampling_rate = 1, feature_subset_strategy = "auto", seed = NULL, thresholds = NULL, cache_node_ids = FALSE, max_memory_in_mb = 256, uid = random_string("gradient_boosted_trees_"), response = NULL, features = NULL, ... ) ml_gbt_regressor( x, formula = NULL, max_iter = 20, max_depth = 5, step_size = 0.1, subsampling_rate = 1, feature_subset_strategy = "auto", min_instances_per_node = 1, max_bins = 32, min_info_gain = 0, loss_type = "squared", seed = NULL, checkpoint_interval = 10, cache_node_ids = FALSE, max_memory_in_mb = 256, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("gbt_regressor_"), ... )
x |
A |
formula |
Used when |
max_iter |
Maxmimum number of iterations. |
max_depth |
Maximum depth of the tree (>= 0); that is, the maximum number of nodes separating any leaves from the root of the tree. |
step_size |
Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of each estimator. (default = 0.1) |
subsampling_rate |
Fraction of the training data used for learning each decision tree, in range (0, 1]. (default = 1.0) |
feature_subset_strategy |
The number of features to consider for splits at each tree node. See details for options. |
min_instances_per_node |
Minimum number of instances each child must have after split. |
max_bins |
The maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity. |
min_info_gain |
Minimum information gain for a split to be considered at a tree node. Should be >= 0, defaults to 0. |
loss_type |
Loss function which GBT tries to minimize. Supported: |
seed |
Seed for random numbers. |
thresholds |
Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value |
checkpoint_interval |
Set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations, defaults to 10. |
cache_node_ids |
If |
max_memory_in_mb |
Maximum memory in MB allocated to histogram aggregation. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size. Defaults to 256. |
features_col |
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by |
label_col |
Label column name. The column should be a numeric column. Usually this column is output by |
prediction_col |
Prediction column name. |
probability_col |
Column name for predicted class conditional probabilities. |
raw_prediction_col |
Raw prediction (a.k.a. confidence) column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
type |
The type of model to fit. |
response |
(Deprecated) The name of the response column (as a length-one character vector.) |
features |
(Deprecated) The name of features (terms) to use for the model fit. |
The supported options for feature_subset_strategy
are
"auto"
: Choose automatically for task: If num_trees == 1
, set to "all"
. If num_trees > 1
(forest), set to "sqrt"
for classification and to "onethird"
for regression.
"all"
: use all features
"onethird"
: use 1/3 of the features
"sqrt"
: use use sqrt(number of features)
"log2"
: use log2(number of features)
"n"
: when n
is in the range (0, 1.0], use n * number of features. When n
is in the range (1, number of features), use n
features. (default = "auto"
)
ml_gradient_boosted_trees
is a wrapper around ml_gbt_regressor.tbl_spark
and ml_gbt_classifier.tbl_spark
and calls the appropriate method based on model type.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
object. If
it is a ml_pipeline
, it will return a pipeline with the predictor
appended to it. If a tbl_spark
, it will return a tbl_spark
with
the predictions added to it.
Other ml algorithms:
ml_aft_survival_regression()
,
ml_decision_tree_classifier()
,
ml_generalized_linear_regression()
,
ml_isotonic_regression()
,
ml_linear_regression()
,
ml_linear_svc()
,
ml_logistic_regression()
,
ml_multilayer_perceptron_classifier()
,
ml_naive_bayes()
,
ml_one_vs_rest()
,
ml_random_forest_classifier()
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test gbt_model <- iris_training %>% ml_gradient_boosted_trees(Sepal_Length ~ Petal_Length + Petal_Width) pred <- ml_predict(gbt_model, iris_test) ml_regression_evaluator(pred, label_col = "Sepal_Length") ## End(Not run)
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test gbt_model <- iris_training %>% ml_gradient_boosted_trees(Sepal_Length ~ Petal_Length + Petal_Width) pred <- ml_predict(gbt_model, iris_test) ml_regression_evaluator(pred, label_col = "Sepal_Length") ## End(Not run)
Perform regression using Generalized Linear Model (GLM).
ml_generalized_linear_regression( x, formula = NULL, family = "gaussian", link = NULL, fit_intercept = TRUE, offset_col = NULL, link_power = NULL, link_prediction_col = NULL, reg_param = 0, max_iter = 25, weight_col = NULL, solver = "irls", tol = 1e-06, variance_power = 0, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("generalized_linear_regression_"), ... )
ml_generalized_linear_regression( x, formula = NULL, family = "gaussian", link = NULL, fit_intercept = TRUE, offset_col = NULL, link_power = NULL, link_prediction_col = NULL, reg_param = 0, max_iter = 25, weight_col = NULL, solver = "irls", tol = 1e-06, variance_power = 0, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("generalized_linear_regression_"), ... )
x |
A |
formula |
Used when |
family |
Name of family which is a description of the error distribution to be used in the model. Supported options: "gaussian", "binomial", "poisson", "gamma" and "tweedie". Default is "gaussian". |
link |
Name of link function which provides the relationship between the linear predictor and the mean of the distribution function. See for supported link functions. |
fit_intercept |
Boolean; should the model be fit with an intercept term? |
offset_col |
Offset column name. If this is not set, we treat all instance offsets as 0.0. The feature specified as offset has a constant coefficient of 1.0. |
link_power |
Index in the power link function. Only applicable to the Tweedie family. Note that link power 0, 1, -1 or 0.5 corresponds to the Log, Identity, Inverse or Sqrt link, respectively. When not set, this value defaults to 1 - variancePower, which matches the R "statmod" package. |
link_prediction_col |
Link prediction (linear predictor) column name. Default is not set, which means we do not output link prediction. |
reg_param |
Regularization parameter (aka lambda) |
max_iter |
The maximum number of iterations to use. |
weight_col |
The name of the column to use as weights for the model fit. |
solver |
Solver algorithm for optimization. |
tol |
Param for the convergence tolerance for iterative algorithms. |
variance_power |
Power in the variance function of the Tweedie distribution which provides the relationship between the variance and mean of the distribution. Only applicable to the Tweedie family. (see Tweedie Distribution (Wikipedia)) Supported values: 0 and [1, Inf). Note that variance power 0, 1, or 2 corresponds to the Gaussian, Poisson or Gamma family, respectively. |
features_col |
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by |
label_col |
Label column name. The column should be a numeric column. Usually this column is output by |
prediction_col |
Prediction column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
Valid link functions for each family is listed below. The first link function of each family is the default one.
gaussian: "identity", "log", "inverse"
binomial: "logit", "probit", "loglog"
poisson: "log", "identity", "sqrt"
gamma: "inverse", "identity", "log"
tweedie: power link function specified through link_power
. The default link power in the tweedie family is 1 - variance_power
.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
object. If
it is a ml_pipeline
, it will return a pipeline with the predictor
appended to it. If a tbl_spark
, it will return a tbl_spark
with
the predictions added to it.
Other ml algorithms:
ml_aft_survival_regression()
,
ml_decision_tree_classifier()
,
ml_gbt_classifier()
,
ml_isotonic_regression()
,
ml_linear_regression()
,
ml_linear_svc()
,
ml_logistic_regression()
,
ml_multilayer_perceptron_classifier()
,
ml_naive_bayes()
,
ml_one_vs_rest()
,
ml_random_forest_classifier()
## Not run: library(sparklyr) sc <- spark_connect(master = "local") mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE) partitions <- mtcars_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) mtcars_training <- partitions$training mtcars_test <- partitions$test # Specify the grid family <- c("gaussian", "gamma", "poisson") link <- c("identity", "log") family_link <- expand.grid(family = family, link = link, stringsAsFactors = FALSE) family_link <- data.frame(family_link, rmse = 0) # Train the models for (i in seq_len(nrow(family_link))) { glm_model <- mtcars_training %>% ml_generalized_linear_regression(mpg ~ ., family = family_link[i, 1], link = family_link[i, 2] ) pred <- ml_predict(glm_model, mtcars_test) family_link[i, 3] <- ml_regression_evaluator(pred, label_col = "mpg") } family_link ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local") mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE) partitions <- mtcars_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) mtcars_training <- partitions$training mtcars_test <- partitions$test # Specify the grid family <- c("gaussian", "gamma", "poisson") link <- c("identity", "log") family_link <- expand.grid(family = family, link = link, stringsAsFactors = FALSE) family_link <- data.frame(family_link, rmse = 0) # Train the models for (i in seq_len(nrow(family_link))) { glm_model <- mtcars_training %>% ml_generalized_linear_regression(mpg ~ ., family = family_link[i, 1], link = family_link[i, 2] ) pred <- ml_predict(glm_model, mtcars_test) family_link[i, 3] <- ml_regression_evaluator(pred, label_col = "mpg") } family_link ## End(Not run)
These methods summarize the results of Spark ML models into tidy forms.
## S3 method for class 'ml_model_generalized_linear_regression' tidy(x, exponentiate = FALSE, ...) ## S3 method for class 'ml_model_linear_regression' tidy(x, ...) ## S3 method for class 'ml_model_generalized_linear_regression' augment( x, newdata = NULL, type.residuals = c("working", "deviance", "pearson", "response"), ... ) ## S3 method for class ''_ml_model_linear_regression'' augment( x, new_data = NULL, type.residuals = c("working", "deviance", "pearson", "response"), ... ) ## S3 method for class 'ml_model_linear_regression' augment( x, newdata = NULL, type.residuals = c("working", "deviance", "pearson", "response"), ... ) ## S3 method for class 'ml_model_generalized_linear_regression' glance(x, ...) ## S3 method for class 'ml_model_linear_regression' glance(x, ...)
## S3 method for class 'ml_model_generalized_linear_regression' tidy(x, exponentiate = FALSE, ...) ## S3 method for class 'ml_model_linear_regression' tidy(x, ...) ## S3 method for class 'ml_model_generalized_linear_regression' augment( x, newdata = NULL, type.residuals = c("working", "deviance", "pearson", "response"), ... ) ## S3 method for class ''_ml_model_linear_regression'' augment( x, new_data = NULL, type.residuals = c("working", "deviance", "pearson", "response"), ... ) ## S3 method for class 'ml_model_linear_regression' augment( x, newdata = NULL, type.residuals = c("working", "deviance", "pearson", "response"), ... ) ## S3 method for class 'ml_model_generalized_linear_regression' glance(x, ...) ## S3 method for class 'ml_model_linear_regression' glance(x, ...)
x |
a Spark ML model. |
exponentiate |
For GLM, whether to exponentiate the coefficient estimates (typical for logistic regression.) |
... |
extra arguments (not used.) |
newdata |
a tbl_spark of new data to use for prediction. |
type.residuals |
type of residuals, defaults to |
new_data |
a tbl_spark of new data to use for prediction. |
The residuals attached by augment
are of type "working" by default,
which is different from the default of "deviance" for residuals()
or sdf_residuals()
.
Currently implemented using parallelized pool adjacent violators algorithm. Only univariate (single feature) algorithm supported.
ml_isotonic_regression( x, formula = NULL, feature_index = 0, isotonic = TRUE, weight_col = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("isotonic_regression_"), ... )
ml_isotonic_regression( x, formula = NULL, feature_index = 0, isotonic = TRUE, weight_col = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("isotonic_regression_"), ... )
x |
A |
formula |
Used when |
feature_index |
Index of the feature if |
isotonic |
Whether the output sequence should be isotonic/increasing (true) or antitonic/decreasing (false). Default: true |
weight_col |
The name of the column to use as weights for the model fit. |
features_col |
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by |
label_col |
Label column name. The column should be a numeric column. Usually this column is output by |
prediction_col |
Prediction column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
object. If
it is a ml_pipeline
, it will return a pipeline with the predictor
appended to it. If a tbl_spark
, it will return a tbl_spark
with
the predictions added to it.
Other ml algorithms:
ml_aft_survival_regression()
,
ml_decision_tree_classifier()
,
ml_gbt_classifier()
,
ml_generalized_linear_regression()
,
ml_linear_regression()
,
ml_linear_svc()
,
ml_logistic_regression()
,
ml_multilayer_perceptron_classifier()
,
ml_naive_bayes()
,
ml_one_vs_rest()
,
ml_random_forest_classifier()
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test iso_res <- iris_tbl %>% ml_isotonic_regression(Petal_Length ~ Petal_Width) pred <- ml_predict(iso_res, iris_test) pred ## End(Not run)
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test iso_res <- iris_tbl %>% ml_isotonic_regression(Petal_Length ~ Petal_Width) pred <- ml_predict(iso_res, iris_test) pred ## End(Not run)
These methods summarize the results of Spark ML models into tidy forms.
## S3 method for class 'ml_model_isotonic_regression' tidy(x, ...) ## S3 method for class 'ml_model_isotonic_regression' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_isotonic_regression' glance(x, ...)
## S3 method for class 'ml_model_isotonic_regression' tidy(x, ...) ## S3 method for class 'ml_model_isotonic_regression' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_isotonic_regression' glance(x, ...)
x |
a Spark ML model. |
... |
extra arguments (not used.) |
newdata |
a tbl_spark of new data to use for prediction. |
K-means clustering with support for k-means|| initialization proposed by Bahmani et al. Using 'ml_kmeans()' with the formula interface requires Spark 2.0+.
ml_kmeans( x, formula = NULL, k = 2, max_iter = 20, tol = 1e-04, init_steps = 2, init_mode = "k-means||", seed = NULL, features_col = "features", prediction_col = "prediction", uid = random_string("kmeans_"), ... ) ml_compute_cost(model, dataset) ml_compute_silhouette_measure( model, dataset, distance_measure = c("squaredEuclidean", "cosine") )
ml_kmeans( x, formula = NULL, k = 2, max_iter = 20, tol = 1e-04, init_steps = 2, init_mode = "k-means||", seed = NULL, features_col = "features", prediction_col = "prediction", uid = random_string("kmeans_"), ... ) ml_compute_cost(model, dataset) ml_compute_silhouette_measure( model, dataset, distance_measure = c("squaredEuclidean", "cosine") )
x |
A |
formula |
Used when |
k |
The number of clusters to create |
max_iter |
The maximum number of iterations to use. |
tol |
Param for the convergence tolerance for iterative algorithms. |
init_steps |
Number of steps for the k-means|| initialization mode. This is an advanced setting – the default of 2 is almost always enough. Must be > 0. Default: 2. |
init_mode |
Initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||. |
seed |
A random seed. Set this value if you need your results to be reproducible across repeated calls. |
features_col |
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by |
prediction_col |
Prediction column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments, see Details.
#' @return The object returned depends on the class of |
model |
A fitted K-means model returned by |
dataset |
Dataset on which to calculate K-means cost |
distance_measure |
Distance measure to apply when computing the Silhouette measure. |
ml_compute_cost()
returns the K-means cost (sum of
squared distances of points to their nearest center) for the model
on the given data.
ml_compute_silhouette_measure()
returns the Silhouette measure
of the clustering on the given data.
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) ml_kmeans(iris_tbl, Species ~ .) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) ml_kmeans(iris_tbl, Species ~ .) ## End(Not run)
Evaluate a K-mean clustering
model |
A fitted K-means model returned by |
dataset |
Dataset on which to calculate K-means cost |
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
ml_lda( x, formula = NULL, k = 10, max_iter = 20, doc_concentration = NULL, topic_concentration = NULL, subsampling_rate = 0.05, optimizer = "online", checkpoint_interval = 10, keep_last_checkpoint = TRUE, learning_decay = 0.51, learning_offset = 1024, optimize_doc_concentration = TRUE, seed = NULL, features_col = "features", topic_distribution_col = "topicDistribution", uid = random_string("lda_"), ... ) ml_describe_topics(model, max_terms_per_topic = 10) ml_log_likelihood(model, dataset) ml_log_perplexity(model, dataset) ml_topics_matrix(model)
ml_lda( x, formula = NULL, k = 10, max_iter = 20, doc_concentration = NULL, topic_concentration = NULL, subsampling_rate = 0.05, optimizer = "online", checkpoint_interval = 10, keep_last_checkpoint = TRUE, learning_decay = 0.51, learning_offset = 1024, optimize_doc_concentration = TRUE, seed = NULL, features_col = "features", topic_distribution_col = "topicDistribution", uid = random_string("lda_"), ... ) ml_describe_topics(model, max_terms_per_topic = 10) ml_log_likelihood(model, dataset) ml_log_perplexity(model, dataset) ml_topics_matrix(model)
x |
A |
formula |
Used when |
k |
The number of clusters to create |
max_iter |
The maximum number of iterations to use. |
doc_concentration |
Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta"). See details. |
topic_concentration |
Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms. |
subsampling_rate |
(For Online optimizer only) Fraction of the corpus
to be sampled and used in each iteration of mini-batch gradient descent, in
range (0, 1]. Note that this should be adjusted in synch with |
optimizer |
Optimizer or inference algorithm used to estimate the LDA model. Supported: "online" for Online Variational Bayes (default) and "em" for Expectation-Maximization. |
checkpoint_interval |
Set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations, defaults to 10. |
keep_last_checkpoint |
(Spark 2.0.0+) (For EM optimizer only) If using
checkpointing, this indicates whether to keep the last checkpoint.
If |
learning_decay |
(For Online optimizer only) Learning rate, set as an exponential decay rate. This should be between (0.5, 1.0] to guarantee asymptotic convergence. This is called "kappa" in the Online LDA paper (Hoffman et al., 2010). Default: 0.51, based on Hoffman et al. |
learning_offset |
(For Online optimizer only) A (positive) learning parameter that downweights early iterations. Larger values make early iterations count less. This is called "tau0" in the Online LDA paper (Hoffman et al., 2010) Default: 1024, following Hoffman et al. |
optimize_doc_concentration |
(For Online optimizer only) Indicates
whether the |
seed |
A random seed. Set this value if you need your results to be reproducible across repeated calls. |
features_col |
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by |
topic_distribution_col |
Output column with estimates of the topic mixture distribution for each document (often called "theta" in the literature). Returns a vector of zeros for an empty document. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments, see Details.
#' @return The object returned depends on the class of |
model |
A fitted LDA model returned by |
max_terms_per_topic |
Maximum number of terms to collect for each topic. Default value of 10. |
dataset |
test corpus to use for calculating log likelihood or log perplexity |
For 'ml_lda.tbl_spark' with the formula interface, you can specify named arguments in '...' that will be passed 'ft_regex_tokenizer()', 'ft_stop_words_remover()', and 'ft_count_vectorizer()'. For example, to increase the default 'min_token_length', you can use 'ml_lda(dataset, ~ text, min_token_length = 4)'.
Terminology for LDA:
"term" = "word": an element of the vocabulary
"token": instance of a term appearing in a document
"topic": multinomial distribution over terms representing some concept
"document": one piece of text, corresponding to one row in the input data
Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
Input data (features_col
): LDA is given a collection of documents as
input data, via the features_col
parameter. Each document is specified
as a Vector of length vocab_size
, where each entry is the count for
the corresponding term (word) in the document. Feature transformers such as
ft_tokenizer
and ft_count_vectorizer
can be
useful for converting text to word count vectors
ml_describe_topics
returns a DataFrame with topics and their top-weighted terms.
ml_log_likelihood
calculates a lower bound on the log likelihood of
the entire corpus
doc_concentration
This is the parameter to a Dirichlet distribution, where larger values mean
more smoothing (more regularization). If not set by the user, then
doc_concentration
is set automatically. If set to singleton vector
[alpha], then alpha is replicated to a vector of length k in fitting.
Otherwise, the doc_concentration
vector must be length k.
(default = automatic)
Optimizer-specific parameter settings:
EM
Currently only supports symmetric distributions, so all values in the vector should be the same.
Values should be greater than 1.0
default = uniformly (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows from Asuncion et al. (2009), who recommend a +1 adjustment for EM.
Online
Values should be greater than or equal to 0
default = uniformly (1.0 / k), following the implementation from here
topic_concentration
This is the parameter to a symmetric Dirichlet distribution.
Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
If not set by the user, then topic_concentration
is set automatically.
(default = automatic)
Optimizer-specific parameter settings:
EM
Value should be greater than 1.0
default = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM.
Online
Value should be greater than or equal to 0
default = (1.0 / k), following the implementation from here.
topic_distribution_col
This uses a variational approximation following Hoffman et al. (2010), where the approximate distribution is called "gamma." Technically, this method returns this approximation "gamma" for each document.
## Not run: library(janeaustenr) library(dplyr) sc <- spark_connect(master = "local") lines_tbl <- sdf_copy_to(sc, austen_books()[c(1:30), ], name = "lines_tbl", overwrite = TRUE ) # transform the data in a tidy form lines_tbl_tidy <- lines_tbl %>% ft_tokenizer( input_col = "text", output_col = "word_list" ) %>% ft_stop_words_remover( input_col = "word_list", output_col = "wo_stop_words" ) %>% mutate(text = explode(wo_stop_words)) %>% filter(text != "") %>% select(text, book) lda_model <- lines_tbl_tidy %>% ml_lda(~text, k = 4) # vocabulary and topics tidy(lda_model) ## End(Not run)
## Not run: library(janeaustenr) library(dplyr) sc <- spark_connect(master = "local") lines_tbl <- sdf_copy_to(sc, austen_books()[c(1:30), ], name = "lines_tbl", overwrite = TRUE ) # transform the data in a tidy form lines_tbl_tidy <- lines_tbl %>% ft_tokenizer( input_col = "text", output_col = "word_list" ) %>% ft_stop_words_remover( input_col = "word_list", output_col = "wo_stop_words" ) %>% mutate(text = explode(wo_stop_words)) %>% filter(text != "") %>% select(text, book) lda_model <- lines_tbl_tidy %>% ml_lda(~text, k = 4) # vocabulary and topics tidy(lda_model) ## End(Not run)
These methods summarize the results of Spark ML models into tidy forms.
## S3 method for class 'ml_model_lda' tidy(x, ...) ## S3 method for class 'ml_model_lda' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_lda' glance(x, ...)
## S3 method for class 'ml_model_lda' tidy(x, ...) ## S3 method for class 'ml_model_lda' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_lda' glance(x, ...)
x |
a Spark ML model. |
... |
extra arguments (not used.) |
newdata |
a tbl_spark of new data to use for prediction. |
Perform regression using linear regression.
ml_linear_regression( x, formula = NULL, fit_intercept = TRUE, elastic_net_param = 0, reg_param = 0, max_iter = 100, weight_col = NULL, loss = "squaredError", solver = "auto", standardization = TRUE, tol = 1e-06, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("linear_regression_"), ... )
ml_linear_regression( x, formula = NULL, fit_intercept = TRUE, elastic_net_param = 0, reg_param = 0, max_iter = 100, weight_col = NULL, loss = "squaredError", solver = "auto", standardization = TRUE, tol = 1e-06, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("linear_regression_"), ... )
x |
A |
formula |
Used when |
fit_intercept |
Boolean; should the model be fit with an intercept term? |
elastic_net_param |
ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. |
reg_param |
Regularization parameter (aka lambda) |
max_iter |
The maximum number of iterations to use. |
weight_col |
The name of the column to use as weights for the model fit. |
loss |
The loss function to be optimized. Supported options: "squaredError" and "huber". Default: "squaredError" |
solver |
Solver algorithm for optimization. |
standardization |
Whether to standardize the training features before fitting the model. |
tol |
Param for the convergence tolerance for iterative algorithms. |
features_col |
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by |
label_col |
Label column name. The column should be a numeric column. Usually this column is output by |
prediction_col |
Prediction column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
object. If
it is a ml_pipeline
, it will return a pipeline with the predictor
appended to it. If a tbl_spark
, it will return a tbl_spark
with
the predictions added to it.
Other ml algorithms:
ml_aft_survival_regression()
,
ml_decision_tree_classifier()
,
ml_gbt_classifier()
,
ml_generalized_linear_regression()
,
ml_isotonic_regression()
,
ml_linear_svc()
,
ml_logistic_regression()
,
ml_multilayer_perceptron_classifier()
,
ml_naive_bayes()
,
ml_one_vs_rest()
,
ml_random_forest_classifier()
## Not run: sc <- spark_connect(master = "local") mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE) partitions <- mtcars_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) mtcars_training <- partitions$training mtcars_test <- partitions$test lm_model <- mtcars_training %>% ml_linear_regression(mpg ~ .) pred <- ml_predict(lm_model, mtcars_test) ml_regression_evaluator(pred, label_col = "mpg") ## End(Not run)
## Not run: sc <- spark_connect(master = "local") mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE) partitions <- mtcars_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) mtcars_training <- partitions$training mtcars_test <- partitions$test lm_model <- mtcars_training %>% ml_linear_regression(mpg ~ .) pred <- ml_predict(lm_model, mtcars_test) ml_regression_evaluator(pred, label_col = "mpg") ## End(Not run)
Perform classification using linear support vector machines (SVM). This binary classifier optimizes the Hinge Loss using the OWLQN optimizer. Only supports L2 regularization currently.
ml_linear_svc( x, formula = NULL, fit_intercept = TRUE, reg_param = 0, max_iter = 100, standardization = TRUE, weight_col = NULL, tol = 1e-06, threshold = 0, aggregation_depth = 2, features_col = "features", label_col = "label", prediction_col = "prediction", raw_prediction_col = "rawPrediction", uid = random_string("linear_svc_"), ... )
ml_linear_svc( x, formula = NULL, fit_intercept = TRUE, reg_param = 0, max_iter = 100, standardization = TRUE, weight_col = NULL, tol = 1e-06, threshold = 0, aggregation_depth = 2, features_col = "features", label_col = "label", prediction_col = "prediction", raw_prediction_col = "rawPrediction", uid = random_string("linear_svc_"), ... )
x |
A |
formula |
Used when |
fit_intercept |
Boolean; should the model be fit with an intercept term? |
reg_param |
Regularization parameter (aka lambda) |
max_iter |
The maximum number of iterations to use. |
standardization |
Whether to standardize the training features before fitting the model. |
weight_col |
The name of the column to use as weights for the model fit. |
tol |
Param for the convergence tolerance for iterative algorithms. |
threshold |
in binary classification prediction, in range [0, 1]. |
aggregation_depth |
(Spark 2.1.0+) Suggested depth for treeAggregate (>= 2). |
features_col |
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by |
label_col |
Label column name. The column should be a numeric column. Usually this column is output by |
prediction_col |
Prediction column name. |
raw_prediction_col |
Raw prediction (a.k.a. confidence) column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
object. If
it is a ml_pipeline
, it will return a pipeline with the predictor
appended to it. If a tbl_spark
, it will return a tbl_spark
with
the predictions added to it.
Other ml algorithms:
ml_aft_survival_regression()
,
ml_decision_tree_classifier()
,
ml_gbt_classifier()
,
ml_generalized_linear_regression()
,
ml_isotonic_regression()
,
ml_linear_regression()
,
ml_logistic_regression()
,
ml_multilayer_perceptron_classifier()
,
ml_naive_bayes()
,
ml_one_vs_rest()
,
ml_random_forest_classifier()
## Not run: library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% filter(Species != "setosa") %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test svc_model <- iris_training %>% ml_linear_svc(Species ~ .) pred <- ml_predict(svc_model, iris_test) ml_binary_classification_evaluator(pred) ## End(Not run)
## Not run: library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% filter(Species != "setosa") %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test svc_model <- iris_training %>% ml_linear_svc(Species ~ .) pred <- ml_predict(svc_model, iris_test) ml_binary_classification_evaluator(pred) ## End(Not run)
These methods summarize the results of Spark ML models into tidy forms.
## S3 method for class 'ml_model_linear_svc' tidy(x, ...) ## S3 method for class 'ml_model_linear_svc' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_linear_svc' glance(x, ...)
## S3 method for class 'ml_model_linear_svc' tidy(x, ...) ## S3 method for class 'ml_model_linear_svc' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_linear_svc' glance(x, ...)
x |
a Spark ML model. |
... |
extra arguments (not used.) |
newdata |
a tbl_spark of new data to use for prediction. |
Perform classification using logistic regression.
ml_logistic_regression( x, formula = NULL, fit_intercept = TRUE, elastic_net_param = 0, reg_param = 0, max_iter = 100, threshold = 0.5, thresholds = NULL, tol = 1e-06, weight_col = NULL, aggregation_depth = 2, lower_bounds_on_coefficients = NULL, lower_bounds_on_intercepts = NULL, upper_bounds_on_coefficients = NULL, upper_bounds_on_intercepts = NULL, features_col = "features", label_col = "label", family = "auto", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("logistic_regression_"), ... )
ml_logistic_regression( x, formula = NULL, fit_intercept = TRUE, elastic_net_param = 0, reg_param = 0, max_iter = 100, threshold = 0.5, thresholds = NULL, tol = 1e-06, weight_col = NULL, aggregation_depth = 2, lower_bounds_on_coefficients = NULL, lower_bounds_on_intercepts = NULL, upper_bounds_on_coefficients = NULL, upper_bounds_on_intercepts = NULL, features_col = "features", label_col = "label", family = "auto", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("logistic_regression_"), ... )
x |
A |
formula |
Used when |
fit_intercept |
Boolean; should the model be fit with an intercept term? |
elastic_net_param |
ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. |
reg_param |
Regularization parameter (aka lambda) |
max_iter |
The maximum number of iterations to use. |
threshold |
in binary classification prediction, in range [0, 1]. |
thresholds |
Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value |
tol |
Param for the convergence tolerance for iterative algorithms. |
weight_col |
The name of the column to use as weights for the model fit. |
aggregation_depth |
(Spark 2.1.0+) Suggested depth for treeAggregate (>= 2). |
lower_bounds_on_coefficients |
(Spark 2.2.0+) Lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. |
lower_bounds_on_intercepts |
(Spark 2.2.0+) Lower bounds on intercepts if fitting under bound constrained optimization. The bounds vector size must be equal with 1 for binomial regression, or the number of classes for multinomial regression. |
upper_bounds_on_coefficients |
(Spark 2.2.0+) Upper bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. |
upper_bounds_on_intercepts |
(Spark 2.2.0+) Upper bounds on intercepts if fitting under bound constrained optimization. The bounds vector size must be equal with 1 for binomial regression, or the number of classes for multinomial regression. |
features_col |
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by |
label_col |
Label column name. The column should be a numeric column. Usually this column is output by |
family |
(Spark 2.1.0+) Param for the name of family which is a description of the label distribution to be used in the model. Supported options: "auto", "binomial", and "multinomial." |
prediction_col |
Prediction column name. |
probability_col |
Column name for predicted class conditional probabilities. |
raw_prediction_col |
Raw prediction (a.k.a. confidence) column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
object. If
it is a ml_pipeline
, it will return a pipeline with the predictor
appended to it. If a tbl_spark
, it will return a tbl_spark
with
the predictions added to it.
Other ml algorithms:
ml_aft_survival_regression()
,
ml_decision_tree_classifier()
,
ml_gbt_classifier()
,
ml_generalized_linear_regression()
,
ml_isotonic_regression()
,
ml_linear_regression()
,
ml_linear_svc()
,
ml_multilayer_perceptron_classifier()
,
ml_naive_bayes()
,
ml_one_vs_rest()
,
ml_random_forest_classifier()
## Not run: sc <- spark_connect(master = "local") mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE) partitions <- mtcars_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) mtcars_training <- partitions$training mtcars_test <- partitions$test lr_model <- mtcars_training %>% ml_logistic_regression(am ~ gear + carb) pred <- ml_predict(lr_model, mtcars_test) ml_binary_classification_evaluator(pred) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") mtcars_tbl <- sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE) partitions <- mtcars_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) mtcars_training <- partitions$training mtcars_test <- partitions$test lr_model <- mtcars_training %>% ml_logistic_regression(am ~ gear + carb) pred <- ml_predict(lr_model, mtcars_test) ml_binary_classification_evaluator(pred) ## End(Not run)
These methods summarize the results of Spark ML models into tidy forms.
## S3 method for class 'ml_model_logistic_regression' tidy(x, ...) ## S3 method for class 'ml_model_logistic_regression' augment(x, newdata = NULL, ...) ## S3 method for class ''_ml_model_logistic_regression'' augment(x, new_data = NULL, ...) ## S3 method for class 'ml_model_logistic_regression' glance(x, ...)
## S3 method for class 'ml_model_logistic_regression' tidy(x, ...) ## S3 method for class 'ml_model_logistic_regression' augment(x, newdata = NULL, ...) ## S3 method for class ''_ml_model_logistic_regression'' augment(x, new_data = NULL, ...) ## S3 method for class 'ml_model_logistic_regression' glance(x, ...)
x |
a Spark ML model. |
... |
extra arguments (not used.) |
newdata |
a tbl_spark of new data to use for prediction. |
new_data |
a tbl_spark of new data to use for prediction. |
The function works best when passed a 'tbl_spark' created by 'ml_predict()'. The output 'tbl_spark' will contain the correct variable types and format that the given Spark model "evaluator" expects.
ml_metrics_binary( x, truth = label, estimate = rawPrediction, metrics = c("roc_auc", "pr_auc"), ... )
ml_metrics_binary( x, truth = label, estimate = rawPrediction, metrics = c("roc_auc", "pr_auc"), ... )
x |
A 'tbl_spark' containing the estimate (prediction) and the truth (value of what actually happened) |
truth |
The name of the column from 'x' with an integer field containing the binary response (0 or 1). The 'ml_predict()' function will create a new field named 'label' which contains the expected type and values. 'truth' defaults to 'label'. |
estimate |
The name of the column from 'x' that contains the prediction. Defaults to 'rawPrediction', since its type and expected values will match 'truth'. |
metrics |
A character vector with the metrics to calculate. For binary models the possible values are: 'roc_auc' (Area under the Receiver Operator curve), 'pr_auc' (Area under the Precesion Recall curve). Defaults to: 'roc_auc', 'pr_auc' |
... |
Optional arguments; currently unused. |
The ‘ml_metrics' family of functions implement Spark’s 'evaluate' closer to how the 'yardstick' package works. The functions expect a table containing the truth and estimate, and return a 'tibble' with the results. The 'tibble' has the same format and variable names as the output of the 'yardstick' functions.
## Not run: sc <- spark_connect("local") tbl_iris <- copy_to(sc, iris) prep_iris <- tbl_iris %>% mutate(is_setosa = ifelse(Species == "setosa", 1, 0)) iris_split <- sdf_random_split(prep_iris, training = 0.5, test = 0.5) model <- ml_logistic_regression(iris_split$training, "is_setosa ~ Sepal_Length") tbl_predictions <- ml_predict(model, iris_split$test) ml_metrics_binary(tbl_predictions) ## End(Not run)
## Not run: sc <- spark_connect("local") tbl_iris <- copy_to(sc, iris) prep_iris <- tbl_iris %>% mutate(is_setosa = ifelse(Species == "setosa", 1, 0)) iris_split <- sdf_random_split(prep_iris, training = 0.5, test = 0.5) model <- ml_logistic_regression(iris_split$training, "is_setosa ~ Sepal_Length") tbl_predictions <- ml_predict(model, iris_split$test) ml_metrics_binary(tbl_predictions) ## End(Not run)
The function works best when passed a 'tbl_spark' created by 'ml_predict()'. The output 'tbl_spark' will contain the correct variable types and format that the given Spark model "evaluator" expects.
ml_metrics_multiclass( x, truth = label, estimate = prediction, metrics = c("accuracy"), beta = NULL, ... )
ml_metrics_multiclass( x, truth = label, estimate = prediction, metrics = c("accuracy"), beta = NULL, ... )
x |
A 'tbl_spark' containing the estimate (prediction) and the truth (value of what actually happened) |
truth |
The name of the column from 'x' with an integer field containing an the indexed value for each outcome . The 'ml_predict()' function will create a new field named 'label' which contains the expected type and values. 'truth' defaults to 'label'. |
estimate |
The name of the column from 'x' that contains the prediction. Defaults to 'prediction', since its type and indexed values will match 'truth'. |
metrics |
A character vector with the metrics to calculate. For multiclass models the possible values are: 'acurracy', 'f_meas' (F-score), 'recall' and 'precision'. This function translates the argument into an acceptable Spark parameter. If no translation is found, then the raw value of the argument is passed to Spark. This makes it possible to request a metric that is not listed here but, depending on version, it is available in Spark. Other metrics form multi-class models are: 'weightedTruePositiveRate', 'weightedFalsePositiveRate', 'weightedFMeasure', 'truePositiveRateByLabel', 'falsePositiveRateByLabel', 'precisionByLabel', 'recallByLabel', 'fMeasureByLabel', 'logLoss', 'hammingLoss' |
beta |
Numerical value used for precision and recall. Defaults to NULL, but if the Spark session's verion is 3.0 and above, then NULL is changed to 1, unless something different is supplied in this argument. |
... |
Optional arguments; currently unused. |
The ‘ml_metrics' family of functions implement Spark’s 'evaluate' closer to how the 'yardstick' package works. The functions expect a table containing the truth and estimate, and return a 'tibble' with the results. The 'tibble' has the same format and variable names as the output of the 'yardstick' functions.
## Not run: sc <- spark_connect("local") tbl_iris <- copy_to(sc, iris) iris_split <- sdf_random_split(tbl_iris, training = 0.5, test = 0.5) model <- ml_random_forest(iris_split$training, "Species ~ .") tbl_predictions <- ml_predict(model, iris_split$test) ml_metrics_multiclass(tbl_predictions) # Request different metrics ml_metrics_multiclass(tbl_predictions, metrics = c("recall", "precision")) # Request metrics not translated by the function, but valid in Spark ml_metrics_multiclass(tbl_predictions, metrics = c("logLoss", "hammingLoss")) ## End(Not run)
## Not run: sc <- spark_connect("local") tbl_iris <- copy_to(sc, iris) iris_split <- sdf_random_split(tbl_iris, training = 0.5, test = 0.5) model <- ml_random_forest(iris_split$training, "Species ~ .") tbl_predictions <- ml_predict(model, iris_split$test) ml_metrics_multiclass(tbl_predictions) # Request different metrics ml_metrics_multiclass(tbl_predictions, metrics = c("recall", "precision")) # Request metrics not translated by the function, but valid in Spark ml_metrics_multiclass(tbl_predictions, metrics = c("logLoss", "hammingLoss")) ## End(Not run)
The function works best when passed a 'tbl_spark' created by 'ml_predict()'. The output 'tbl_spark' will contain the correct variable types and format that the given Spark model "evaluator" expects.
ml_metrics_regression( x, truth, estimate = prediction, metrics = c("rmse", "rsq", "mae"), ... )
ml_metrics_regression( x, truth, estimate = prediction, metrics = c("rmse", "rsq", "mae"), ... )
x |
A 'tbl_spark' containing the estimate (prediction) and the truth (value of what actually happened) |
truth |
The name of the column from 'x' that contains the value of what actually happened |
estimate |
The name of the column from 'x' that contains the prediction. Defaults to 'prediction', since it is the default that 'ml_predict()' uses. |
metrics |
A character vector with the metrics to calculate. For regression models the possible values are: 'rmse' (Root mean squared error), 'mse' (Mean squared error),'rsq' (R squared), 'mae' (Mean absolute error), and 'var' (Explained variance). Defaults to: 'rmse', 'rsq', 'mae' |
... |
Optional arguments; currently unused. |
The ‘ml_metrics' family of functions implement Spark’s 'evaluate' closer to how the 'yardstick' package works. The functions expect a table containing the truth and estimate, and return a 'tibble' with the results. The 'tibble' has the same format and variable names as the output of the 'yardstick' functions.
## Not run: sc <- spark_connect("local") tbl_iris <- copy_to(sc, iris) iris_split <- sdf_random_split(tbl_iris, training = 0.5, test = 0.5) training <- iris_split$training reg_formula <- "Sepal_Length ~ Sepal_Width + Petal_Length + Petal_Width" model <- ml_generalized_linear_regression(training, reg_formula) tbl_predictions <- ml_predict(model, iris_split$test) tbl_predictions %>% ml_metrics_regression(Sepal_Length) ## End(Not run)
## Not run: sc <- spark_connect("local") tbl_iris <- copy_to(sc, iris) iris_split <- sdf_random_split(tbl_iris, training = 0.5, test = 0.5) training <- iris_split$training reg_formula <- "Sepal_Length ~ Sepal_Width + Petal_Length + Petal_Width" model <- ml_generalized_linear_regression(training, reg_formula) tbl_predictions <- ml_predict(model, iris_split$test) tbl_predictions %>% ml_metrics_regression(Sepal_Length) ## End(Not run)
Extracts data associated with a Spark ML model
ml_model_data(object)
ml_model_data(object)
object |
a Spark ML model |
A tbl_spark
Classification model based on the Multilayer Perceptron. Each layer has sigmoid activation function, output layer has softmax.
ml_multilayer_perceptron_classifier( x, formula = NULL, layers = NULL, max_iter = 100, step_size = 0.03, tol = 1e-06, block_size = 128, solver = "l-bfgs", seed = NULL, initial_weights = NULL, thresholds = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("multilayer_perceptron_classifier_"), ... ) ml_multilayer_perceptron( x, formula = NULL, layers, max_iter = 100, step_size = 0.03, tol = 1e-06, block_size = 128, solver = "l-bfgs", seed = NULL, initial_weights = NULL, features_col = "features", label_col = "label", thresholds = NULL, prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("multilayer_perceptron_classifier_"), response = NULL, features = NULL, ... )
ml_multilayer_perceptron_classifier( x, formula = NULL, layers = NULL, max_iter = 100, step_size = 0.03, tol = 1e-06, block_size = 128, solver = "l-bfgs", seed = NULL, initial_weights = NULL, thresholds = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("multilayer_perceptron_classifier_"), ... ) ml_multilayer_perceptron( x, formula = NULL, layers, max_iter = 100, step_size = 0.03, tol = 1e-06, block_size = 128, solver = "l-bfgs", seed = NULL, initial_weights = NULL, features_col = "features", label_col = "label", thresholds = NULL, prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("multilayer_perceptron_classifier_"), response = NULL, features = NULL, ... )
x |
A |
formula |
Used when |
layers |
A numeric vector describing the layers – each element in the vector gives the size of a layer. For example, |
max_iter |
The maximum number of iterations to use. |
step_size |
Step size to be used for each iteration of optimization (> 0). |
tol |
Param for the convergence tolerance for iterative algorithms. |
block_size |
Block size for stacking input data in matrices to speed up the computation. Data is stacked within partitions. If block size is more than remaining data in a partition then it is adjusted to the size of this data. Recommended size is between 10 and 1000. Default: 128 |
solver |
The solver algorithm for optimization. Supported options: "gd" (minibatch gradient descent) or "l-bfgs". Default: "l-bfgs" |
seed |
A random seed. Set this value if you need your results to be reproducible across repeated calls. |
initial_weights |
The initial weights of the model. |
thresholds |
Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value |
features_col |
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by |
label_col |
Label column name. The column should be a numeric column. Usually this column is output by |
prediction_col |
Prediction column name. |
probability_col |
Column name for predicted class conditional probabilities. |
raw_prediction_col |
Raw prediction (a.k.a. confidence) column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
response |
(Deprecated) The name of the response column (as a length-one character vector.) |
features |
(Deprecated) The name of features (terms) to use for the model fit. |
ml_multilayer_perceptron()
is an alias for ml_multilayer_perceptron_classifier()
for backwards compatibility.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
object. If
it is a ml_pipeline
, it will return a pipeline with the predictor
appended to it. If a tbl_spark
, it will return a tbl_spark
with
the predictions added to it.
Other ml algorithms:
ml_aft_survival_regression()
,
ml_decision_tree_classifier()
,
ml_gbt_classifier()
,
ml_generalized_linear_regression()
,
ml_isotonic_regression()
,
ml_linear_regression()
,
ml_linear_svc()
,
ml_logistic_regression()
,
ml_naive_bayes()
,
ml_one_vs_rest()
,
ml_random_forest_classifier()
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test mlp_model <- iris_training %>% ml_multilayer_perceptron_classifier(Species ~ ., layers = c(4, 3, 3)) pred <- ml_predict(mlp_model, iris_test) ml_multiclass_classification_evaluator(pred) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test mlp_model <- iris_training %>% ml_multilayer_perceptron_classifier(Species ~ ., layers = c(4, 3, 3)) pred <- ml_predict(mlp_model, iris_test) ml_multiclass_classification_evaluator(pred) ## End(Not run)
These methods summarize the results of Spark ML models into tidy forms.
## S3 method for class 'ml_model_multilayer_perceptron_classification' tidy(x, ...) ## S3 method for class 'ml_model_multilayer_perceptron_classification' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_multilayer_perceptron_classification' glance(x, ...)
## S3 method for class 'ml_model_multilayer_perceptron_classification' tidy(x, ...) ## S3 method for class 'ml_model_multilayer_perceptron_classification' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_multilayer_perceptron_classification' glance(x, ...)
x |
a Spark ML model. |
... |
extra arguments (not used.) |
newdata |
a tbl_spark of new data to use for prediction. |
Naive Bayes Classifiers. It supports Multinomial NB (see here) which can handle finitely supported discrete data. For example, by converting documents into TF-IDF vectors, it can be used for document classification. By making every vector a binary (0/1) data, it can also be used as Bernoulli NB (see here). The input feature values must be nonnegative.
ml_naive_bayes( x, formula = NULL, model_type = "multinomial", smoothing = 1, thresholds = NULL, weight_col = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("naive_bayes_"), ... )
ml_naive_bayes( x, formula = NULL, model_type = "multinomial", smoothing = 1, thresholds = NULL, weight_col = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("naive_bayes_"), ... )
x |
A |
formula |
Used when |
model_type |
The model type. Supported options: |
smoothing |
The (Laplace) smoothing parameter. Defaults to 1. |
thresholds |
Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value |
weight_col |
(Spark 2.1.0+) Weight column name. If this is not set or empty, we treat all instance weights as 1.0. |
features_col |
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by |
label_col |
Label column name. The column should be a numeric column. Usually this column is output by |
prediction_col |
Prediction column name. |
probability_col |
Column name for predicted class conditional probabilities. |
raw_prediction_col |
Raw prediction (a.k.a. confidence) column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
object. If
it is a ml_pipeline
, it will return a pipeline with the predictor
appended to it. If a tbl_spark
, it will return a tbl_spark
with
the predictions added to it.
Other ml algorithms:
ml_aft_survival_regression()
,
ml_decision_tree_classifier()
,
ml_gbt_classifier()
,
ml_generalized_linear_regression()
,
ml_isotonic_regression()
,
ml_linear_regression()
,
ml_linear_svc()
,
ml_logistic_regression()
,
ml_multilayer_perceptron_classifier()
,
ml_one_vs_rest()
,
ml_random_forest_classifier()
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test nb_model <- iris_training %>% ml_naive_bayes(Species ~ .) pred <- ml_predict(nb_model, iris_test) ml_multiclass_classification_evaluator(pred) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test nb_model <- iris_training %>% ml_naive_bayes(Species ~ .) pred <- ml_predict(nb_model, iris_test) ml_multiclass_classification_evaluator(pred) ## End(Not run)
These methods summarize the results of Spark ML models into tidy forms.
## S3 method for class 'ml_model_naive_bayes' tidy(x, ...) ## S3 method for class 'ml_model_naive_bayes' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_naive_bayes' glance(x, ...)
## S3 method for class 'ml_model_naive_bayes' tidy(x, ...) ## S3 method for class 'ml_model_naive_bayes' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_naive_bayes' glance(x, ...)
x |
a Spark ML model. |
... |
extra arguments (not used.) |
newdata |
a tbl_spark of new data to use for prediction. |
Reduction of Multiclass Classification to Binary Classification. Performs reduction using one against all strategy. For a multiclass classification with k classes, train k models (one per class). Each example is scored against all k models and the model with highest score is picked to label the example.
ml_one_vs_rest( x, formula = NULL, classifier = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("one_vs_rest_"), ... )
ml_one_vs_rest( x, formula = NULL, classifier = NULL, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("one_vs_rest_"), ... )
x |
A |
formula |
Used when |
classifier |
Object of class |
features_col |
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by |
label_col |
Label column name. The column should be a numeric column. Usually this column is output by |
prediction_col |
Prediction column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
object. If
it is a ml_pipeline
, it will return a pipeline with the predictor
appended to it. If a tbl_spark
, it will return a tbl_spark
with
the predictions added to it.
Other ml algorithms:
ml_aft_survival_regression()
,
ml_decision_tree_classifier()
,
ml_gbt_classifier()
,
ml_generalized_linear_regression()
,
ml_isotonic_regression()
,
ml_linear_regression()
,
ml_linear_svc()
,
ml_logistic_regression()
,
ml_multilayer_perceptron_classifier()
,
ml_naive_bayes()
,
ml_random_forest_classifier()
These methods summarize the results of Spark ML models into tidy forms.
## S3 method for class 'ml_model_pca' tidy(x, ...) ## S3 method for class 'ml_model_pca' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_pca' glance(x, ...)
## S3 method for class 'ml_model_pca' tidy(x, ...) ## S3 method for class 'ml_model_pca' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_pca' glance(x, ...)
x |
a Spark ML model. |
... |
extra arguments (not used.) |
newdata |
a tbl_spark of new data to use for prediction. |
Create Spark ML Pipelines
ml_pipeline(x, ..., uid = random_string("pipeline_"))
ml_pipeline(x, ..., uid = random_string("pipeline_"))
x |
Either a |
... |
|
uid |
A character string used to uniquely identify the ML estimator. |
When x
is a spark_connection
, ml_pipeline()
returns an empty pipeline object. When x
is a ml_pipeline_stage
, ml_pipeline()
returns an ml_pipeline
with the stages set to x
and any transformers or estimators given in ...
.
Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering vertices of a graph given pairwise similarities as edge properties, described in the paper "Power Iteration Clustering" by Frank Lin and William W. Cohen. It computes a pseudo-eigenvector of the normalized affinity matrix of the graph via power iteration and uses it to cluster vertices. spark.mllib includes an implementation of PIC using GraphX as its backend. It takes an RDD of (srcId, dstId, similarity) tuples and outputs a model with the clustering assignments. The similarities must be nonnegative. PIC assumes that the similarity measure is symmetric. A pair (srcId, dstId) regardless of the ordering should appear at most once in the input data. If a pair is missing from input, their similarity is treated as zero.
ml_power_iteration( x, k = 4, max_iter = 20, init_mode = "random", src_col = "src", dst_col = "dst", weight_col = "weight", ... )
ml_power_iteration( x, k = 4, max_iter = 20, init_mode = "random", src_col = "src", dst_col = "dst", weight_col = "weight", ... )
x |
A 'spark_connection' or a 'tbl_spark'. |
k |
The number of clusters to create. |
max_iter |
The maximum number of iterations to run. |
init_mode |
This can be either "random", which is the default, to use a random vector as vertex properties, or "degree" to use normalized sum similarities. |
src_col |
Column in the input Spark dataframe containing 0-based indexes of all source vertices in the affinity matrix described in the PIC paper. |
dst_col |
Column in the input Spark dataframe containing 0-based indexes of all destination vertices in the affinity matrix described in the PIC paper. |
weight_col |
Column in the input Spark dataframe containing non-negative edge weights in the affinity matrix described in the PIC paper. |
... |
Optional arguments. Currently unused. |
A 2-column R dataframe with columns named "id" and "cluster" describing the resulting cluster assignments
## Not run: library(sparklyr) sc <- spark_connect(master = "local") r1 <- 1 n1 <- 80L r2 <- 4 n2 <- 80L gen_circle <- function(radius, num_pts) { # generate evenly distributed points on a circle centered at the origin seq(0, num_pts - 1) %>% lapply( function(pt) { theta <- 2 * pi * pt / num_pts radius * c(cos(theta), sin(theta)) } ) } guassian_similarity <- function(pt1, pt2) { dist2 <- sum((pt2 - pt1)^2) exp(-dist2 / 2) } gen_pic_data <- function() { # generate points on 2 concentric circle centered at the origin and then # compute pairwise Gaussian similarity values of all unordered pair of # points n <- n1 + n2 pts <- append(gen_circle(r1, n1), gen_circle(r2, n2)) num_unordered_pairs <- n * (n - 1) / 2 src <- rep(0L, num_unordered_pairs) dst <- rep(0L, num_unordered_pairs) sim <- rep(0, num_unordered_pairs) idx <- 1 for (i in seq(2, n)) { for (j in seq(i - 1)) { src[[idx]] <- i - 1L dst[[idx]] <- j - 1L sim[[idx]] <- guassian_similarity(pts[[i]], pts[[j]]) idx <- idx + 1 } } dplyr::tibble(src = src, dst = dst, sim = sim) } pic_data <- copy_to(sc, gen_pic_data()) clusters <- ml_power_iteration( pic_data, src_col = "src", dst_col = "dst", weight_col = "sim", k = 2, max_iter = 40 ) print(clusters) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local") r1 <- 1 n1 <- 80L r2 <- 4 n2 <- 80L gen_circle <- function(radius, num_pts) { # generate evenly distributed points on a circle centered at the origin seq(0, num_pts - 1) %>% lapply( function(pt) { theta <- 2 * pi * pt / num_pts radius * c(cos(theta), sin(theta)) } ) } guassian_similarity <- function(pt1, pt2) { dist2 <- sum((pt2 - pt1)^2) exp(-dist2 / 2) } gen_pic_data <- function() { # generate points on 2 concentric circle centered at the origin and then # compute pairwise Gaussian similarity values of all unordered pair of # points n <- n1 + n2 pts <- append(gen_circle(r1, n1), gen_circle(r2, n2)) num_unordered_pairs <- n * (n - 1) / 2 src <- rep(0L, num_unordered_pairs) dst <- rep(0L, num_unordered_pairs) sim <- rep(0, num_unordered_pairs) idx <- 1 for (i in seq(2, n)) { for (j in seq(i - 1)) { src[[idx]] <- i - 1L dst[[idx]] <- j - 1L sim[[idx]] <- guassian_similarity(pts[[i]], pts[[j]]) idx <- idx + 1 } } dplyr::tibble(src = src, dst = dst, sim = sim) } pic_data <- copy_to(sc, gen_pic_data()) clusters <- ml_power_iteration( pic_data, src_col = "src", dst_col = "dst", weight_col = "sim", k = 2, max_iter = 40 ) print(clusters) ## End(Not run)
PrefixSpan algorithm for mining frequent itemsets.
ml_prefixspan( x, seq_col = "sequence", min_support = 0.1, max_pattern_length = 10, max_local_proj_db_size = 3.2e+07, uid = random_string("prefixspan_"), ... ) ml_freq_seq_patterns(model)
ml_prefixspan( x, seq_col = "sequence", min_support = 0.1, max_pattern_length = 10, max_local_proj_db_size = 3.2e+07, uid = random_string("prefixspan_"), ... ) ml_freq_seq_patterns(model)
x |
A |
seq_col |
The name of the sequence column in dataset (defaults to "sequence"). Rows with nulls in this column are ignored. |
min_support |
The minimum support required to be considered a frequent sequential pattern. |
max_pattern_length |
The maximum length of a frequent sequential pattern. Any frequent pattern exceeding this length will not be included in the results. |
max_local_proj_db_size |
The maximum number of items allowed in a prefix-projected database before local iterative processing of the projected database begins. This parameter should be tuned with respect to the size of your executors. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; currently unused. |
model |
A Prefix Span model. |
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "2.4.0") items_df <- dplyr::tibble( seq = list( list(list(1, 2), list(3)), list(list(1), list(3, 2), list(1, 2)), list(list(1, 2), list(5)), list(list(6)) ) ) items_sdf <- copy_to(sc, items_df, overwrite = TRUE) prefix_span_model <- ml_prefixspan( sc, seq_col = "seq", min_support = 0.5, max_pattern_length = 5, max_local_proj_db_size = 32000000 ) frequent_items <- prefix_span_model$frequent_sequential_patterns(items_sdf) %>% collect() ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "2.4.0") items_df <- dplyr::tibble( seq = list( list(list(1, 2), list(3)), list(list(1), list(3, 2), list(1, 2)), list(list(1, 2), list(5)), list(list(6)) ) ) items_sdf <- copy_to(sc, items_df, overwrite = TRUE) prefix_span_model <- ml_prefixspan( sc, seq_col = "seq", min_support = 0.5, max_pattern_length = 5, max_local_proj_db_size = 32000000 ) frequent_items <- prefix_span_model$frequent_sequential_patterns(items_sdf) %>% collect() ## End(Not run)
Perform classification and regression using random forests.
ml_random_forest_classifier( x, formula = NULL, num_trees = 20, subsampling_rate = 1, max_depth = 5, min_instances_per_node = 1, feature_subset_strategy = "auto", impurity = "gini", min_info_gain = 0, max_bins = 32, seed = NULL, thresholds = NULL, checkpoint_interval = 10, cache_node_ids = FALSE, max_memory_in_mb = 256, features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("random_forest_classifier_"), ... ) ml_random_forest( x, formula = NULL, type = c("auto", "regression", "classification"), features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", feature_subset_strategy = "auto", impurity = "auto", checkpoint_interval = 10, max_bins = 32, max_depth = 5, num_trees = 20, min_info_gain = 0, min_instances_per_node = 1, subsampling_rate = 1, seed = NULL, thresholds = NULL, cache_node_ids = FALSE, max_memory_in_mb = 256, uid = random_string("random_forest_"), response = NULL, features = NULL, ... ) ml_random_forest_regressor( x, formula = NULL, num_trees = 20, subsampling_rate = 1, max_depth = 5, min_instances_per_node = 1, feature_subset_strategy = "auto", impurity = "variance", min_info_gain = 0, max_bins = 32, seed = NULL, checkpoint_interval = 10, cache_node_ids = FALSE, max_memory_in_mb = 256, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("random_forest_regressor_"), ... )
ml_random_forest_classifier( x, formula = NULL, num_trees = 20, subsampling_rate = 1, max_depth = 5, min_instances_per_node = 1, feature_subset_strategy = "auto", impurity = "gini", min_info_gain = 0, max_bins = 32, seed = NULL, thresholds = NULL, checkpoint_interval = 10, cache_node_ids = FALSE, max_memory_in_mb = 256, features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", uid = random_string("random_forest_classifier_"), ... ) ml_random_forest( x, formula = NULL, type = c("auto", "regression", "classification"), features_col = "features", label_col = "label", prediction_col = "prediction", probability_col = "probability", raw_prediction_col = "rawPrediction", feature_subset_strategy = "auto", impurity = "auto", checkpoint_interval = 10, max_bins = 32, max_depth = 5, num_trees = 20, min_info_gain = 0, min_instances_per_node = 1, subsampling_rate = 1, seed = NULL, thresholds = NULL, cache_node_ids = FALSE, max_memory_in_mb = 256, uid = random_string("random_forest_"), response = NULL, features = NULL, ... ) ml_random_forest_regressor( x, formula = NULL, num_trees = 20, subsampling_rate = 1, max_depth = 5, min_instances_per_node = 1, feature_subset_strategy = "auto", impurity = "variance", min_info_gain = 0, max_bins = 32, seed = NULL, checkpoint_interval = 10, cache_node_ids = FALSE, max_memory_in_mb = 256, features_col = "features", label_col = "label", prediction_col = "prediction", uid = random_string("random_forest_regressor_"), ... )
x |
A |
formula |
Used when |
num_trees |
Number of trees to train (>= 1). If 1, then no bootstrapping is used. If > 1, then bootstrapping is done. |
subsampling_rate |
Fraction of the training data used for learning each decision tree, in range (0, 1]. (default = 1.0) |
max_depth |
Maximum depth of the tree (>= 0); that is, the maximum number of nodes separating any leaves from the root of the tree. |
min_instances_per_node |
Minimum number of instances each child must have after split. |
feature_subset_strategy |
The number of features to consider for splits at each tree node. See details for options. |
impurity |
Criterion used for information gain calculation. Supported: "entropy"
and "gini" (default) for classification and "variance" (default) for regression. For
|
min_info_gain |
Minimum information gain for a split to be considered at a tree node. Should be >= 0, defaults to 0. |
max_bins |
The maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity. |
seed |
Seed for random numbers. |
thresholds |
Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value |
checkpoint_interval |
Set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations, defaults to 10. |
cache_node_ids |
If |
max_memory_in_mb |
Maximum memory in MB allocated to histogram aggregation. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size. Defaults to 256. |
features_col |
Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by |
label_col |
Label column name. The column should be a numeric column. Usually this column is output by |
prediction_col |
Prediction column name. |
probability_col |
Column name for predicted class conditional probabilities. |
raw_prediction_col |
Raw prediction (a.k.a. confidence) column name. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; see Details. |
type |
The type of model to fit. |
response |
(Deprecated) The name of the response column (as a length-one character vector.) |
features |
(Deprecated) The name of features (terms) to use for the model fit. |
The supported options for feature_subset_strategy
are
"auto"
: Choose automatically for task: If num_trees == 1
, set to "all"
. If num_trees > 1
(forest), set to "sqrt"
for classification and to "onethird"
for regression.
"all"
: use all features
"onethird"
: use 1/3 of the features
"sqrt"
: use use sqrt(number of features)
"log2"
: use log2(number of features)
"n"
: when n
is in the range (0, 1.0], use n * number of features. When n
is in the range (1, number of features), use n
features. (default = "auto"
)
ml_random_forest
is a wrapper around ml_random_forest_regressor.tbl_spark
and ml_random_forest_classifier.tbl_spark
and calls the appropriate method based on model type.
The object returned depends on the class of x
. If it is a
spark_connection
, the function returns a ml_estimator
object. If
it is a ml_pipeline
, it will return a pipeline with the predictor
appended to it. If a tbl_spark
, it will return a tbl_spark
with
the predictions added to it.
Other ml algorithms:
ml_aft_survival_regression()
,
ml_decision_tree_classifier()
,
ml_gbt_classifier()
,
ml_generalized_linear_regression()
,
ml_isotonic_regression()
,
ml_linear_regression()
,
ml_linear_svc()
,
ml_logistic_regression()
,
ml_multilayer_perceptron_classifier()
,
ml_naive_bayes()
,
ml_one_vs_rest()
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test rf_model <- iris_training %>% ml_random_forest(Species ~ ., type = "classification") pred <- ml_predict(rf_model, iris_test) ml_multiclass_classification_evaluator(pred) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) partitions <- iris_tbl %>% sdf_random_split(training = 0.7, test = 0.3, seed = 1111) iris_training <- partitions$training iris_test <- partitions$test rf_model <- iris_training %>% ml_random_forest(Species ~ ., type = "classification") pred <- ml_predict(rf_model, iris_test) ml_multiclass_classification_evaluator(pred) ## End(Not run)
Extraction of stages from a Pipeline or PipelineModel object.
ml_stage(x, stage) ml_stages(x, stages = NULL)
ml_stage(x, stage) ml_stages(x, stages = NULL)
x |
A |
stage |
The UID of a stage in the pipeline. |
stages |
The UIDs of stages in the pipeline as a character vector. |
For ml_stage()
: The stage specified.
For ml_stages()
: A list of stages. If stages
is not set, the function returns all stages of the pipeline in a list.
Extracts a metric from the summary object of a Spark ML model.
ml_summary(x, metric = NULL, allow_null = FALSE)
ml_summary(x, metric = NULL, allow_null = FALSE)
x |
A Spark ML model that has a summary. |
metric |
The name of the metric to extract. If not set, returns the summary object. |
allow_null |
Whether null results are allowed when the metric is not found in the summary. |
These methods summarize the results of Spark ML models into tidy forms.
## S3 method for class 'ml_model_aft_survival_regression' tidy(x, ...) ## S3 method for class 'ml_model_aft_survival_regression' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_aft_survival_regression' glance(x, ...)
## S3 method for class 'ml_model_aft_survival_regression' tidy(x, ...) ## S3 method for class 'ml_model_aft_survival_regression' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_aft_survival_regression' glance(x, ...)
x |
a Spark ML model. |
... |
extra arguments (not used.) |
newdata |
a tbl_spark of new data to use for prediction. |
These methods summarize the results of Spark ML models into tidy forms.
## S3 method for class 'ml_model_decision_tree_classification' tidy(x, ...) ## S3 method for class 'ml_model_decision_tree_regression' tidy(x, ...) ## S3 method for class 'ml_model_decision_tree_classification' augment(x, newdata = NULL, ...) ## S3 method for class ''_ml_model_decision_tree_classification'' augment(x, new_data = NULL, ...) ## S3 method for class 'ml_model_decision_tree_regression' augment(x, newdata = NULL, ...) ## S3 method for class ''_ml_model_decision_tree_regression'' augment(x, new_data = NULL, ...) ## S3 method for class 'ml_model_decision_tree_classification' glance(x, ...) ## S3 method for class 'ml_model_decision_tree_regression' glance(x, ...) ## S3 method for class 'ml_model_random_forest_classification' tidy(x, ...) ## S3 method for class 'ml_model_random_forest_regression' tidy(x, ...) ## S3 method for class 'ml_model_random_forest_classification' augment(x, newdata = NULL, ...) ## S3 method for class ''_ml_model_random_forest_classification'' augment(x, new_data = NULL, ...) ## S3 method for class 'ml_model_random_forest_regression' augment(x, newdata = NULL, ...) ## S3 method for class ''_ml_model_random_forest_regression'' augment(x, new_data = NULL, ...) ## S3 method for class 'ml_model_random_forest_classification' glance(x, ...) ## S3 method for class 'ml_model_random_forest_regression' glance(x, ...) ## S3 method for class 'ml_model_gbt_classification' tidy(x, ...) ## S3 method for class 'ml_model_gbt_regression' tidy(x, ...) ## S3 method for class 'ml_model_gbt_classification' augment(x, newdata = NULL, ...) ## S3 method for class ''_ml_model_gbt_classification'' augment(x, new_data = NULL, ...) ## S3 method for class 'ml_model_gbt_regression' augment(x, newdata = NULL, ...) ## S3 method for class ''_ml_model_gbt_regression'' augment(x, new_data = NULL, ...) ## S3 method for class 'ml_model_gbt_classification' glance(x, ...) ## S3 method for class 'ml_model_gbt_regression' glance(x, ...)
## S3 method for class 'ml_model_decision_tree_classification' tidy(x, ...) ## S3 method for class 'ml_model_decision_tree_regression' tidy(x, ...) ## S3 method for class 'ml_model_decision_tree_classification' augment(x, newdata = NULL, ...) ## S3 method for class ''_ml_model_decision_tree_classification'' augment(x, new_data = NULL, ...) ## S3 method for class 'ml_model_decision_tree_regression' augment(x, newdata = NULL, ...) ## S3 method for class ''_ml_model_decision_tree_regression'' augment(x, new_data = NULL, ...) ## S3 method for class 'ml_model_decision_tree_classification' glance(x, ...) ## S3 method for class 'ml_model_decision_tree_regression' glance(x, ...) ## S3 method for class 'ml_model_random_forest_classification' tidy(x, ...) ## S3 method for class 'ml_model_random_forest_regression' tidy(x, ...) ## S3 method for class 'ml_model_random_forest_classification' augment(x, newdata = NULL, ...) ## S3 method for class ''_ml_model_random_forest_classification'' augment(x, new_data = NULL, ...) ## S3 method for class 'ml_model_random_forest_regression' augment(x, newdata = NULL, ...) ## S3 method for class ''_ml_model_random_forest_regression'' augment(x, new_data = NULL, ...) ## S3 method for class 'ml_model_random_forest_classification' glance(x, ...) ## S3 method for class 'ml_model_random_forest_regression' glance(x, ...) ## S3 method for class 'ml_model_gbt_classification' tidy(x, ...) ## S3 method for class 'ml_model_gbt_regression' tidy(x, ...) ## S3 method for class 'ml_model_gbt_classification' augment(x, newdata = NULL, ...) ## S3 method for class ''_ml_model_gbt_classification'' augment(x, new_data = NULL, ...) ## S3 method for class 'ml_model_gbt_regression' augment(x, newdata = NULL, ...) ## S3 method for class ''_ml_model_gbt_regression'' augment(x, new_data = NULL, ...) ## S3 method for class 'ml_model_gbt_classification' glance(x, ...) ## S3 method for class 'ml_model_gbt_regression' glance(x, ...)
x |
a Spark ML model. |
... |
extra arguments (not used.) |
newdata |
a tbl_spark of new data to use for prediction. |
new_data |
a tbl_spark of new data to use for prediction. |
Extracts the UID of an ML object.
ml_uid(x)
ml_uid(x)
x |
A Spark ML object |
These methods summarize the results of Spark ML models into tidy forms.
## S3 method for class 'ml_model_kmeans' tidy(x, ...) ## S3 method for class 'ml_model_kmeans' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_kmeans' glance(x, ...) ## S3 method for class 'ml_model_bisecting_kmeans' tidy(x, ...) ## S3 method for class 'ml_model_bisecting_kmeans' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_bisecting_kmeans' glance(x, ...) ## S3 method for class 'ml_model_gaussian_mixture' tidy(x, ...) ## S3 method for class 'ml_model_gaussian_mixture' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_gaussian_mixture' glance(x, ...)
## S3 method for class 'ml_model_kmeans' tidy(x, ...) ## S3 method for class 'ml_model_kmeans' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_kmeans' glance(x, ...) ## S3 method for class 'ml_model_bisecting_kmeans' tidy(x, ...) ## S3 method for class 'ml_model_bisecting_kmeans' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_bisecting_kmeans' glance(x, ...) ## S3 method for class 'ml_model_gaussian_mixture' tidy(x, ...) ## S3 method for class 'ml_model_gaussian_mixture' augment(x, newdata = NULL, ...) ## S3 method for class 'ml_model_gaussian_mixture' glance(x, ...)
x |
a Spark ML model. |
... |
extra arguments (not used.) |
newdata |
a tbl_spark of new data to use for prediction. |
Helper methods for working with parameters for ML objects.
ml_is_set(x, param, ...) ml_param_map(x, ...) ml_param(x, param, allow_null = FALSE, ...) ml_params(x, params = NULL, allow_null = FALSE, ...)
ml_is_set(x, param, ...) ml_param_map(x, ...) ml_param(x, param, allow_null = FALSE, ...) ml_params(x, params = NULL, allow_null = FALSE, ...)
x |
A Spark ML object, either a pipeline stage or an evaluator. |
param |
The parameter to extract or set. |
... |
Optional arguments; currently unused. |
allow_null |
Whether to allow |
params |
A vector of parameters to extract. |
Save/load Spark ML objects
ml_save(x, path, overwrite = FALSE, ...) ## S3 method for class 'ml_model' ml_save( x, path, overwrite = FALSE, type = c("pipeline_model", "pipeline"), ... ) ml_load(sc, path)
ml_save(x, path, overwrite = FALSE, ...) ## S3 method for class 'ml_model' ml_save( x, path, overwrite = FALSE, type = c("pipeline_model", "pipeline"), ... ) ml_load(sc, path)
x |
A ML object, which could be a |
path |
The path where the object is to be serialized/deserialized. |
overwrite |
Whether to overwrite the existing path, defaults to |
... |
Optional arguments; currently unused. |
type |
Whether to save the pipeline model or the pipeline. |
sc |
A Spark connection. |
ml_save()
serializes a Spark object into a format that can be read back into sparklyr
or by the Scala or PySpark APIs. When called on ml_model
objects, i.e. those that were created via the tbl_spark - formula
signature, the associated pipeline model is serialized. In other words, the saved model contains both the data processing (RFormulaModel
) stage and the machine learning stage.
ml_load()
reads a saved Spark object into sparklyr
. It calls the correct Scala load
method based on parsing the saved metadata. Note that a PipelineModel
object saved from a sparklyr ml_model
via ml_save()
will be read back in as an ml_pipeline_model
, rather than the ml_model
object.
Methods for transformation, fit, and prediction. These are mirrors of the corresponding sdf-transform-methods.
is_ml_transformer(x) is_ml_estimator(x) ml_fit(x, dataset, ...) ## Default S3 method: ml_fit(x, dataset, ...) ml_transform(x, dataset, ...) ml_fit_and_transform(x, dataset, ...) ml_predict(x, dataset, ...) ## S3 method for class 'ml_model_classification' ml_predict(x, dataset, probability_prefix = "probability_", ...)
is_ml_transformer(x) is_ml_estimator(x) ml_fit(x, dataset, ...) ## Default S3 method: ml_fit(x, dataset, ...) ml_transform(x, dataset, ...) ml_fit_and_transform(x, dataset, ...) ml_predict(x, dataset, ...) ## S3 method for class 'ml_model_classification' ml_predict(x, dataset, probability_prefix = "probability_", ...)
x |
A |
dataset |
A |
... |
Optional arguments; currently unused. |
probability_prefix |
String used to prepend the class probability output columns. |
These methods are
When x
is an estimator, ml_fit()
returns a transformer whereas ml_fit_and_transform()
returns a transformed dataset. When x
is a transformer, ml_transform()
and ml_predict()
return a transformed dataset. When ml_predict()
is called on a ml_model
object, additional columns (e.g. probabilities in case of classification models) are appended to the transformed output for the user's convenience.
Perform hyper-parameter tuning using either K-fold cross validation or train-validation split.
ml_sub_models(model) ml_validation_metrics(model) ml_cross_validator( x, estimator = NULL, estimator_param_maps = NULL, evaluator = NULL, num_folds = 3, collect_sub_models = FALSE, parallelism = 1, seed = NULL, uid = random_string("cross_validator_"), ... ) ml_train_validation_split( x, estimator = NULL, estimator_param_maps = NULL, evaluator = NULL, train_ratio = 0.75, collect_sub_models = FALSE, parallelism = 1, seed = NULL, uid = random_string("train_validation_split_"), ... )
ml_sub_models(model) ml_validation_metrics(model) ml_cross_validator( x, estimator = NULL, estimator_param_maps = NULL, evaluator = NULL, num_folds = 3, collect_sub_models = FALSE, parallelism = 1, seed = NULL, uid = random_string("cross_validator_"), ... ) ml_train_validation_split( x, estimator = NULL, estimator_param_maps = NULL, evaluator = NULL, train_ratio = 0.75, collect_sub_models = FALSE, parallelism = 1, seed = NULL, uid = random_string("train_validation_split_"), ... )
model |
A cross validation or train-validation-split model. |
x |
A |
estimator |
A |
estimator_param_maps |
A named list of stages and hyper-parameter sets to tune. See details. |
evaluator |
A |
num_folds |
Number of folds for cross validation. Must be >= 2. Default: 3 |
collect_sub_models |
Whether to collect a list of sub-models trained during tuning.
If set to |
parallelism |
The number of threads to use when running parallel algorithms. Default is 1 for serial execution. |
seed |
A random seed. Set this value if you need your results to be reproducible across repeated calls. |
uid |
A character string used to uniquely identify the ML estimator. |
... |
Optional arguments; currently unused. |
train_ratio |
Ratio between train and validation data. Must be between 0 and 1. Default: 0.75 |
ml_cross_validator()
performs k-fold cross validation while ml_train_validation_split()
performs tuning on one pair of train and validation datasets.
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_cross_validator
or ml_traing_validation_split
object.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the tuning estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, a tuning estimator is constructed then
immediately fit with the input tbl_spark
, returning a ml_cross_validation_model
or a
ml_train_validation_split_model
object.
For cross validation, ml_sub_models()
returns a nested
list of models, where the first layer represents fold indices and the
second layer represents param maps. For train-validation split,
ml_sub_models()
returns a list of models, corresponding to the
order of the estimator param maps.
ml_validation_metrics()
returns a data frame of performance
metrics and hyperparameter combinations.
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) # Create a pipeline pipeline <- ml_pipeline(sc) %>% ft_r_formula(Species ~ .) %>% ml_random_forest_classifier() # Specify hyperparameter grid grid <- list( random_forest = list( num_trees = c(5, 10), max_depth = c(5, 10), impurity = c("entropy", "gini") ) ) # Create the cross validator object cv <- ml_cross_validator( sc, estimator = pipeline, estimator_param_maps = grid, evaluator = ml_multiclass_classification_evaluator(sc), num_folds = 3, parallelism = 4 ) # Train the models cv_model <- ml_fit(cv, iris_tbl) # Print the metrics ml_validation_metrics(cv_model) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) # Create a pipeline pipeline <- ml_pipeline(sc) %>% ft_r_formula(Species ~ .) %>% ml_random_forest_classifier() # Specify hyperparameter grid grid <- list( random_forest = list( num_trees = c(5, 10), max_depth = c(5, 10), impurity = c("entropy", "gini") ) ) # Create the cross validator object cv <- ml_cross_validator( sc, estimator = pipeline, estimator_param_maps = grid, evaluator = ml_multiclass_classification_evaluator(sc), num_folds = 3, parallelism = 4 ) # Train the models cv_model <- ml_fit(cv, iris_tbl) # Print the metrics ml_validation_metrics(cv_model) ## End(Not run)
This S3 generic provides an interface for replacing
NA
values within an object.
na.replace(object, ...)
na.replace(object, ...)
object |
An R object. |
... |
Arguments passed along to implementing methods. |
Generate a random string with a given prefix.
random_string(prefix = "table")
random_string(prefix = "table")
prefix |
A length-one character vector. |
Given a spark object, returns a reactive data source for the contents of the spark object. This function is most useful to read Spark streams.
reactiveSpark(x, intervalMillis = 1000, session = NULL)
reactiveSpark(x, intervalMillis = 1000, session = NULL)
x |
An object coercable to a Spark DataFrame. |
intervalMillis |
Approximate number of milliseconds to wait to retrieve updated data frame. This can be a numeric value, or a function that returns a numeric value. |
session |
The user session to associate this file reader with, or NULL if none. If non-null, the reader will automatically stop when the session ends. |
Registering an extension package will result in the package being automatically scanned for spark dependencies when a connection to Spark is created.
register_extension(package) registered_extensions()
register_extension(package) registered_extensions()
package |
The package(s) to register. |
Packages should typically register their extensions in their
.onLoad
hook – this ensures that their extensions are registered
when their namespaces are loaded.
Registers a parallel backend using the foreach
package.
registerDoSpark(spark_conn, parallelism = NULL, ...)
registerDoSpark(spark_conn, parallelism = NULL, ...)
spark_conn |
Spark connection to use |
parallelism |
Level of parallelism to use for task execution (if unspecified, then it will take the value of 'SparkContext.defaultParallelism()' which by default is the number of cores available to the 'sparklyr' application) |
... |
additional options for sparklyr parallel backend (currently only the only valid option is 'nocompile') |
None
## Not run: sc <- spark_connect(master = "local") registerDoSpark(sc, nocompile = FALSE) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") registerDoSpark(sc, nocompile = FALSE) ## End(Not run)
Creates a DataFrame along the given object.
sdf_along(sc, along, repartition = NULL, type = c("integer", "integer64"))
sdf_along(sc, along, repartition = NULL, type = c("integer", "integer64"))
sc |
The associated Spark connection. |
along |
Takes the length from the length of this argument. |
repartition |
The number of partitions to use when distributing the data across the Spark cluster. |
type |
The data type to use for the index, either |
sdf_bind_rows()
and sdf_bind_cols()
are implementation of the common pattern of
do.call(rbind, sdfs)
or do.call(cbind, sdfs)
for binding many
Spark DataFrames into one.
sdf_bind_rows(..., id = NULL) sdf_bind_cols(...)
sdf_bind_rows(..., id = NULL) sdf_bind_cols(...)
... |
Spark tbls to combine. Each argument can either be a Spark DataFrame or a list of Spark DataFrames When row-binding, columns are matched by name, and any missing columns with be filled with NA. When column-binding, rows are matched by position, so all data frames must have the same number of rows. |
id |
Data frame identifier. When |
The output of sdf_bind_rows()
will contain a column if that column
appears in any of the inputs.
sdf_bind_rows()
and sdf_bind_cols()
return tbl_spark
Used to force broadcast hash joins.
sdf_broadcast(x)
sdf_broadcast(x)
x |
A |
Checkpoint a Spark DataFrame
sdf_checkpoint(x, eager = TRUE)
sdf_checkpoint(x, eager = TRUE)
x |
an object coercible to a Spark DataFrame |
eager |
whether to truncate the lineage of the DataFrame |
Coalesces a Spark DataFrame
sdf_coalesce(x, partitions)
sdf_coalesce(x, partitions)
x |
A |
partitions |
number of partitions |
Collects a Spark dataframe into R.
sdf_collect(object, impl = c("row-wise", "row-wise-iter", "column-wise"), ...)
sdf_collect(object, impl = c("row-wise", "row-wise-iter", "column-wise"), ...)
object |
Spark dataframe to collect |
impl |
Which implementation to use while collecting Spark dataframe - row-wise: fetch the entire dataframe into memory and then process it row-by-row - row-wise-iter: iterate through the dataframe using RDD local iterator, processing one row at a time (hence reducing memory footprint) - column-wise: fetch the entire dataframe into memory and then process it column-by-column NOTE: (1) this will not apply to streaming or arrow use cases (2) this parameter will only affect implementation detail, and will not affect result of 'sdf_collect', and should only be set if performance profiling indicates any particular choice will be significantly better than the default choice ("row-wise") |
... |
Additional options. |
Copy an object into Spark, and return an R object wrapping the copied object (typically, a Spark DataFrame).
sdf_copy_to(sc, x, name, memory, repartition, overwrite, struct_columns, ...) sdf_import(x, sc, name, memory, repartition, overwrite, struct_columns, ...)
sdf_copy_to(sc, x, name, memory, repartition, overwrite, struct_columns, ...) sdf_import(x, sc, name, memory, repartition, overwrite, struct_columns, ...)
sc |
The associated Spark connection. |
x |
An R object from which a Spark DataFrame can be generated. |
name |
The name to assign to the copied table in Spark. |
memory |
Boolean; should the table be cached into memory? |
repartition |
The number of partitions to use when distributing the table across the Spark cluster. The default (0) can be used to avoid partitioning. |
overwrite |
Boolean; overwrite a pre-existing table with the name |
struct_columns |
(only supported with Spark 2.4.0 or higher) A list of columns from the source data frame that should be converted to Spark SQL StructType columns. The source columns can contain either json strings or nested lists. All rows within each source column should have identical schemas (because otherwise the conversion result will contain unexpected null values or missing values as Spark currently does not support schema discovery on individual rows within a struct column). |
... |
Optional arguments, passed to implementing methods. |
sdf_copy_to
is an S3 generic that, by default, dispatches to
sdf_import
. Package authors that would like to implement
sdf_copy_to
for a custom object type can accomplish this by
implementing the associated method on sdf_import
.
Other Spark data frames:
sdf_distinct()
,
sdf_random_split()
,
sdf_register()
,
sdf_sample()
,
sdf_sort()
,
sdf_weighted_sample()
## Not run: sc <- spark_connect(master = "spark://HOST:PORT") sdf_copy_to(sc, iris) ## End(Not run)
## Not run: sc <- spark_connect(master = "spark://HOST:PORT") sdf_copy_to(sc, iris) ## End(Not run)
Builds a contingency table at each combination of factor levels.
sdf_crosstab(x, col1, col2)
sdf_crosstab(x, col1, col2)
x |
A Spark DataFrame |
col1 |
The name of the first column. Distinct items will make the first item of each row. |
col2 |
The name of the second column. Distinct items will make the column names of the DataFrame. |
A DataFrame containing the contingency table.
Prints plan of execution to generate x
. This plan will, among other things, show the
number of partitions in parenthesis at the far left and indicate stages using indentation.
sdf_debug_string(x, print = TRUE)
sdf_debug_string(x, print = TRUE)
x |
An R object wrapping, or containing, a Spark DataFrame. |
print |
Print debug information? |
Compute summary statistics for columns of a data frame
sdf_describe(x, cols = colnames(x))
sdf_describe(x, cols = colnames(x))
x |
An object coercible to a Spark DataFrame |
cols |
Columns to compute statistics for, given as a character vector |
sdf_dim()
, sdf_nrow()
and sdf_ncol()
provide similar
functionality to dim()
, nrow()
and ncol()
.
sdf_dim(x) sdf_nrow(x) sdf_ncol(x)
sdf_dim(x) sdf_nrow(x) sdf_ncol(x)
x |
An object (usually a |
Invoke distinct on a Spark DataFrame
sdf_distinct(x, ..., name)
sdf_distinct(x, ..., name)
x |
A Spark DataFrame. |
... |
Optional variables to use when determining uniqueness. If there are multiple rows for a given combination of inputs, only the first row will be preserved. If omitted, will use all variables. |
name |
A name to assign this table. Passed to [sdf_register()]. |
Other Spark data frames:
sdf_copy_to()
,
sdf_random_split()
,
sdf_register()
,
sdf_sample()
,
sdf_sort()
,
sdf_weighted_sample()
Remove duplicates from a Spark DataFrame
sdf_drop_duplicates(x, cols = NULL)
sdf_drop_duplicates(x, cols = NULL)
x |
An object coercible to a Spark DataFrame |
cols |
Subset of Columns to consider, given as a character vector |
Given one or more R vectors/factors or single-column Spark dataframes, perform an expand.grid operation on all of them and store the result in a Spark dataframe
sdf_expand_grid( sc, ..., broadcast_vars = NULL, memory = TRUE, repartition = NULL, partition_by = NULL )
sdf_expand_grid( sc, ..., broadcast_vars = NULL, memory = TRUE, repartition = NULL, partition_by = NULL )
sc |
The associated Spark connection. |
... |
Each input variable can be either a R vector/factor or a Spark dataframe. Unnamed inputs will assume the default names of 'Var1', 'Var2', etc in the result, similar to what 'expand.grid' does for unnamed inputs. |
broadcast_vars |
Indicates which input(s) should be broadcasted to all nodes of the Spark cluster during the join process (default: none). |
memory |
Boolean; whether the resulting Spark dataframe should be cached into memory (default: TRUE) |
repartition |
Number of partitions the resulting Spark dataframe should have |
partition_by |
Vector of column names used for partitioning the resulting Spark dataframe, only supported for Spark 2.0+ |
## Not run: sc <- spark_connect(master = "local") grid_sdf <- sdf_expand_grid(sc, seq(5), rnorm(10), letters) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") grid_sdf <- sdf_expand_grid(sc, seq(5), rnorm(10), letters) ## End(Not run)
Convert column(s) from avro format
sdf_from_avro(x, cols)
sdf_from_avro(x, cols)
x |
An object coercible to a Spark DataFrame |
cols |
Named list of columns to transform from Avro format plus a valid Avro
schema string for each column, where column names are keys and column schema strings
are values (e.g.,
|
Is the given Spark DataFrame a streaming data?
sdf_is_streaming(x)
sdf_is_streaming(x)
x |
A |
Returns the last index of a Spark DataFrame. The Spark
mapPartitionsWithIndex
function is used to iterate
through the last nonempty partition of the RDD to find the last record.
sdf_last_index(x, id = "id")
sdf_last_index(x, id = "id")
x |
A |
id |
The name of the index column. |
Creates a DataFrame for the given length.
sdf_len(sc, length, repartition = NULL, type = c("integer", "integer64"))
sdf_len(sc, length, repartition = NULL, type = c("integer", "integer64"))
sc |
The associated Spark connection. |
length |
The desired length of the sequence. |
repartition |
The number of partitions to use when distributing the data across the Spark cluster. |
type |
The data type to use for the index, either |
Gets number of partitions of a Spark DataFrame
sdf_num_partitions(x)
sdf_num_partitions(x)
x |
A |
Compute the number of records within each partition of a Spark DataFrame
sdf_partition_sizes(x)
sdf_partition_sizes(x)
x |
A |
## Not run: library(sparklyr) sc <- spark_connect(master = "spark://HOST:PORT") example_sdf <- sdf_len(sc, 100L, repartition = 10L) example_sdf %>% sdf_partition_sizes() %>% print() ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "spark://HOST:PORT") example_sdf <- sdf_len(sc, 100L, repartition = 10L) example_sdf %>% sdf_partition_sizes() %>% print() ## End(Not run)
Persist a Spark DataFrame, forcing any pending computations and (optionally) serializing the results to disk.
sdf_persist(x, storage.level = "MEMORY_AND_DISK", name = NULL)
sdf_persist(x, storage.level = "MEMORY_AND_DISK", name = NULL)
x |
A |
storage.level |
The storage level to be used. Please view the Spark Documentation for information on what storage levels are accepted. |
name |
A name to assign this table. Passed to [sdf_register()]. |
Spark DataFrames invoke their operations lazily – pending operations are deferred until their results are actually needed. Persisting a Spark DataFrame effectively 'forces' any pending computations, and then persists the generated Spark DataFrame as requested (to memory, to disk, or otherwise).
Users of Spark should be careful to persist the results of any computations which are non-deterministic – otherwise, one might see that the values within a column seem to 'change' as new operations are performed on that data set.
Construct a pivot table over a Spark Dataframe, using a syntax similar to
that from reshape2::dcast
.
sdf_pivot(x, formula, fun.aggregate = "count")
sdf_pivot(x, formula, fun.aggregate = "count")
x |
A |
formula |
A two-sided R formula of the form |
fun.aggregate |
How should the grouped dataset be aggregated? Can be a length-one character vector, giving the name of a Spark aggregation function to be called; a named R list mapping column names to an aggregation method, or an R function that is invoked on the grouped dataset. |
## Not run: library(sparklyr) library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) # aggregating by mean iris_tbl %>% mutate(Petal_Width = ifelse(Petal_Width > 1.5, "High", "Low")) %>% sdf_pivot(Petal_Width ~ Species, fun.aggregate = list(Petal_Length = "mean") ) # aggregating all observations in a list iris_tbl %>% mutate(Petal_Width = ifelse(Petal_Width > 1.5, "High", "Low")) %>% sdf_pivot(Petal_Width ~ Species, fun.aggregate = list(Petal_Length = "collect_list") ) ## End(Not run)
## Not run: library(sparklyr) library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) # aggregating by mean iris_tbl %>% mutate(Petal_Width = ifelse(Petal_Width > 1.5, "High", "Low")) %>% sdf_pivot(Petal_Width ~ Species, fun.aggregate = list(Petal_Length = "mean") ) # aggregating all observations in a list iris_tbl %>% mutate(Petal_Width = ifelse(Petal_Width > 1.5, "High", "Low")) %>% sdf_pivot(Petal_Width ~ Species, fun.aggregate = list(Petal_Length = "collect_list") ) ## End(Not run)
Project features onto principal components
sdf_project( object, newdata, features = dimnames(object$pc)[[1]], feature_prefix = NULL, ... )
sdf_project( object, newdata, features = dimnames(object$pc)[[1]], feature_prefix = NULL, ... )
object |
A Spark PCA model object |
newdata |
An object coercible to a Spark DataFrame |
features |
A vector of names of columns to be projected |
feature_prefix |
The prefix used in naming the output features |
... |
Optional arguments; currently unused. |
Given a numeric column within a Spark DataFrame, compute approximate quantiles.
sdf_quantile( x, column, probabilities = c(0, 0.25, 0.5, 0.75, 1), relative.error = 1e-05, weight.column = NULL )
sdf_quantile( x, column, probabilities = c(0, 0.25, 0.5, 0.75, 1), relative.error = 1e-05, weight.column = NULL )
x |
A |
column |
The column(s) for which quantiles should be computed. Multiple columns are only supported in Spark 2.0+. |
probabilities |
A numeric vector of probabilities, for which quantiles should be computed. |
relative.error |
The maximal possible difference between the actual percentile of a result and its expected percentile (e.g., if 'relative.error' is 0.01 and 'probabilities' is 0.95, then any value between the 94th and 96th percentile will be considered an acceptable approximation). |
weight.column |
If not NULL, then a generalized version of the Greenwald- Khanna algorithm will be run to compute weighted percentiles, with each sample from 'column' having a relative weight specified by the corresponding value in 'weight.column'. The weights can be considered as relative frequencies of sample data points. |
Partition a Spark DataFrame into multiple groups. This routine is useful for splitting a DataFrame into, for example, training and test datasets.
sdf_random_split( x, ..., weights = NULL, seed = sample(.Machine$integer.max, 1) ) sdf_partition(x, ..., weights = NULL, seed = sample(.Machine$integer.max, 1))
sdf_random_split( x, ..., weights = NULL, seed = sample(.Machine$integer.max, 1) ) sdf_partition(x, ..., weights = NULL, seed = sample(.Machine$integer.max, 1))
x |
An object coercable to a Spark DataFrame. |
... |
Named parameters, mapping table names to weights. The weights will be normalized such that they sum to 1. |
weights |
An alternate mechanism for supplying weights – when
specified, this takes precedence over the |
seed |
Random seed to use for randomly partitioning the dataset. Set this if you want your partitioning to be reproducible on repeated runs. |
The sampling weights define the probability that a particular observation will be assigned to a particular partition, not the resulting size of the partition. This implies that partitioning a DataFrame with, for example,
sdf_random_split(x, training = 0.5, test = 0.5)
is not guaranteed to produce training
and test
partitions
of equal size.
An R list
of tbl_spark
s.
Other Spark data frames:
sdf_copy_to()
,
sdf_distinct()
,
sdf_register()
,
sdf_sample()
,
sdf_sort()
,
sdf_weighted_sample()
## Not run: # randomly partition data into a 'training' and 'test' # dataset, with 60% of the observations assigned to the # 'training' dataset, and 40% assigned to the 'test' dataset data(diamonds, package = "ggplot2") diamonds_tbl <- copy_to(sc, diamonds, "diamonds") partitions <- diamonds_tbl %>% sdf_random_split(training = 0.6, test = 0.4) print(partitions) # alternate way of specifying weights weights <- c(training = 0.6, test = 0.4) diamonds_tbl %>% sdf_random_split(weights = weights) ## End(Not run)
## Not run: # randomly partition data into a 'training' and 'test' # dataset, with 60% of the observations assigned to the # 'training' dataset, and 40% assigned to the 'test' dataset data(diamonds, package = "ggplot2") diamonds_tbl <- copy_to(sc, diamonds, "diamonds") partitions <- diamonds_tbl %>% sdf_random_split(training = 0.6, test = 0.4) print(partitions) # alternate way of specifying weights weights <- c(training = 0.6, test = 0.4) diamonds_tbl %>% sdf_random_split(weights = weights) ## End(Not run)
Generator method for creating a single-column Spark dataframes comprised of i.i.d. samples from a Betal distribution.
sdf_rbeta( sc, n, shape1, shape2, num_partitions = NULL, seed = NULL, output_col = "x" )
sdf_rbeta( sc, n, shape1, shape2, num_partitions = NULL, seed = NULL, output_col = "x" )
sc |
A Spark connection. |
n |
Sample Size (default: 1000). |
shape1 |
Non-negative parameter (alpha) of the Beta distribution. |
shape2 |
Non-negative parameter (beta) of the Beta distribution. |
num_partitions |
Number of partitions in the resulting Spark dataframe (default: default parallelism of the Spark cluster). |
seed |
Random seed (default: a random long integer). |
output_col |
Name of the output column containing sample values (default: "x"). |
Other Spark statistical routines:
sdf_rbinom()
,
sdf_rcauchy()
,
sdf_rchisq()
,
sdf_rexp()
,
sdf_rgamma()
,
sdf_rgeom()
,
sdf_rhyper()
,
sdf_rlnorm()
,
sdf_rnorm()
,
sdf_rpois()
,
sdf_rt()
,
sdf_runif()
,
sdf_rweibull()
Generator method for creating a single-column Spark dataframes comprised of i.i.d. samples from a binomial distribution.
sdf_rbinom( sc, n, size, prob, num_partitions = NULL, seed = NULL, output_col = "x" )
sdf_rbinom( sc, n, size, prob, num_partitions = NULL, seed = NULL, output_col = "x" )
sc |
A Spark connection. |
n |
Sample Size (default: 1000). |
size |
Number of trials (zero or more). |
prob |
Probability of success on each trial. |
num_partitions |
Number of partitions in the resulting Spark dataframe (default: default parallelism of the Spark cluster). |
seed |
Random seed (default: a random long integer). |
output_col |
Name of the output column containing sample values (default: "x"). |
Other Spark statistical routines:
sdf_rbeta()
,
sdf_rcauchy()
,
sdf_rchisq()
,
sdf_rexp()
,
sdf_rgamma()
,
sdf_rgeom()
,
sdf_rhyper()
,
sdf_rlnorm()
,
sdf_rnorm()
,
sdf_rpois()
,
sdf_rt()
,
sdf_runif()
,
sdf_rweibull()
Generator method for creating a single-column Spark dataframes comprised of i.i.d. samples from a Cauchy distribution.
sdf_rcauchy( sc, n, location = 0, scale = 1, num_partitions = NULL, seed = NULL, output_col = "x" )
sdf_rcauchy( sc, n, location = 0, scale = 1, num_partitions = NULL, seed = NULL, output_col = "x" )
sc |
A Spark connection. |
n |
Sample Size (default: 1000). |
location |
Location parameter of the distribution. |
scale |
Scale parameter of the distribution. |
num_partitions |
Number of partitions in the resulting Spark dataframe (default: default parallelism of the Spark cluster). |
seed |
Random seed (default: a random long integer). |
output_col |
Name of the output column containing sample values (default: "x"). |
Other Spark statistical routines:
sdf_rbeta()
,
sdf_rbinom()
,
sdf_rchisq()
,
sdf_rexp()
,
sdf_rgamma()
,
sdf_rgeom()
,
sdf_rhyper()
,
sdf_rlnorm()
,
sdf_rnorm()
,
sdf_rpois()
,
sdf_rt()
,
sdf_runif()
,
sdf_rweibull()
Generator method for creating a single-column Spark dataframes comprised of i.i.d. samples from a chi-squared distribution.
sdf_rchisq(sc, n, df, num_partitions = NULL, seed = NULL, output_col = "x")
sdf_rchisq(sc, n, df, num_partitions = NULL, seed = NULL, output_col = "x")
sc |
A Spark connection. |
n |
Sample Size (default: 1000). |
df |
Degrees of freedom (non-negative, but can be non-integer). |
num_partitions |
Number of partitions in the resulting Spark dataframe (default: default parallelism of the Spark cluster). |
seed |
Random seed (default: a random long integer). |
output_col |
Name of the output column containing sample values (default: "x"). |
Other Spark statistical routines:
sdf_rbeta()
,
sdf_rbinom()
,
sdf_rcauchy()
,
sdf_rexp()
,
sdf_rgamma()
,
sdf_rgeom()
,
sdf_rhyper()
,
sdf_rlnorm()
,
sdf_rnorm()
,
sdf_rpois()
,
sdf_rt()
,
sdf_runif()
,
sdf_rweibull()
Read a single column from a Spark DataFrame, and return the contents of that column back to R.
sdf_read_column(x, column)
sdf_read_column(x, column)
x |
A |
column |
The name of a column within |
It is expected for this operation to preserve row order.
Registers a Spark DataFrame (giving it a table name for the
Spark SQL context), and returns a tbl_spark
.
sdf_register(x, name = NULL)
sdf_register(x, name = NULL)
x |
A Spark DataFrame. |
name |
A name to assign this table. |
Other Spark data frames:
sdf_copy_to()
,
sdf_distinct()
,
sdf_random_split()
,
sdf_sample()
,
sdf_sort()
,
sdf_weighted_sample()
Repartition a Spark DataFrame
sdf_repartition(x, partitions = NULL, partition_by = NULL)
sdf_repartition(x, partitions = NULL, partition_by = NULL)
x |
A |
partitions |
number of partitions |
partition_by |
vector of column names used for partitioning, only supported for Spark 2.0+ |
This generic method returns a Spark DataFrame with model residuals added as a column to the model training data.
## S3 method for class 'ml_model_generalized_linear_regression' sdf_residuals( object, type = c("deviance", "pearson", "working", "response"), ... ) ## S3 method for class 'ml_model_linear_regression' sdf_residuals(object, ...) sdf_residuals(object, ...)
## S3 method for class 'ml_model_generalized_linear_regression' sdf_residuals( object, type = c("deviance", "pearson", "working", "response"), ... ) ## S3 method for class 'ml_model_linear_regression' sdf_residuals(object, ...) sdf_residuals(object, ...)
object |
Spark ML model object. |
type |
type of residuals which should be returned. |
... |
additional arguments |
Generator method for creating a single-column Spark dataframes comprised of i.i.d. samples from an exponential distribution.
sdf_rexp(sc, n, rate = 1, num_partitions = NULL, seed = NULL, output_col = "x")
sdf_rexp(sc, n, rate = 1, num_partitions = NULL, seed = NULL, output_col = "x")
sc |
A Spark connection. |
n |
Sample Size (default: 1000). |
rate |
Rate of the exponential distribution (default: 1). The exponential distribution with rate lambda has mean 1 / lambda and density f(x) = lambda e ^ - lambda x. |
num_partitions |
Number of partitions in the resulting Spark dataframe (default: default parallelism of the Spark cluster). |
seed |
Random seed (default: a random long integer). |
output_col |
Name of the output column containing sample values (default: "x"). |
Other Spark statistical routines:
sdf_rbeta()
,
sdf_rbinom()
,
sdf_rcauchy()
,
sdf_rchisq()
,
sdf_rgamma()
,
sdf_rgeom()
,
sdf_rhyper()
,
sdf_rlnorm()
,
sdf_rnorm()
,
sdf_rpois()
,
sdf_rt()
,
sdf_runif()
,
sdf_rweibull()
Generator method for creating a single-column Spark dataframes comprised of i.i.d. samples from a Gamma distribution.
sdf_rgamma( sc, n, shape, rate = 1, num_partitions = NULL, seed = NULL, output_col = "x" )
sdf_rgamma( sc, n, shape, rate = 1, num_partitions = NULL, seed = NULL, output_col = "x" )
sc |
A Spark connection. |
n |
Sample Size (default: 1000). |
shape |
Shape parameter (greater than 0) for the Gamma distribution. |
rate |
Rate parameter (greater than 0) for the Gamma distribution (scale is 1/rate). |
num_partitions |
Number of partitions in the resulting Spark dataframe (default: default parallelism of the Spark cluster). |
seed |
Random seed (default: a random long integer). |
output_col |
Name of the output column containing sample values (default: "x"). |
Other Spark statistical routines:
sdf_rbeta()
,
sdf_rbinom()
,
sdf_rcauchy()
,
sdf_rchisq()
,
sdf_rexp()
,
sdf_rgeom()
,
sdf_rhyper()
,
sdf_rlnorm()
,
sdf_rnorm()
,
sdf_rpois()
,
sdf_rt()
,
sdf_runif()
,
sdf_rweibull()
Generator method for creating a single-column Spark dataframes comprised of i.i.d. samples from a geometric distribution.
sdf_rgeom(sc, n, prob, num_partitions = NULL, seed = NULL, output_col = "x")
sdf_rgeom(sc, n, prob, num_partitions = NULL, seed = NULL, output_col = "x")
sc |
A Spark connection. |
n |
Sample Size (default: 1000). |
prob |
Probability of success in each trial. |
num_partitions |
Number of partitions in the resulting Spark dataframe (default: default parallelism of the Spark cluster). |
seed |
Random seed (default: a random long integer). |
output_col |
Name of the output column containing sample values (default: "x"). |
Other Spark statistical routines:
sdf_rbeta()
,
sdf_rbinom()
,
sdf_rcauchy()
,
sdf_rchisq()
,
sdf_rexp()
,
sdf_rgamma()
,
sdf_rhyper()
,
sdf_rlnorm()
,
sdf_rnorm()
,
sdf_rpois()
,
sdf_rt()
,
sdf_runif()
,
sdf_rweibull()
Generator method for creating a single-column Spark dataframes comprised of i.i.d. samples from a hypergeometric distribution.
sdf_rhyper( sc, nn, m, n, k, num_partitions = NULL, seed = NULL, output_col = "x" )
sdf_rhyper( sc, nn, m, n, k, num_partitions = NULL, seed = NULL, output_col = "x" )
sc |
A Spark connection. |
nn |
Sample Size. |
m |
The number of successes among the population. |
n |
The number of failures among the population. |
k |
The number of draws. |
num_partitions |
Number of partitions in the resulting Spark dataframe (default: default parallelism of the Spark cluster). |
seed |
Random seed (default: a random long integer). |
output_col |
Name of the output column containing sample values (default: "x"). |
Other Spark statistical routines:
sdf_rbeta()
,
sdf_rbinom()
,
sdf_rcauchy()
,
sdf_rchisq()
,
sdf_rexp()
,
sdf_rgamma()
,
sdf_rgeom()
,
sdf_rlnorm()
,
sdf_rnorm()
,
sdf_rpois()
,
sdf_rt()
,
sdf_runif()
,
sdf_rweibull()
Generator method for creating a single-column Spark dataframes comprised of i.i.d. samples from a log normal distribution.
sdf_rlnorm( sc, n, meanlog = 0, sdlog = 1, num_partitions = NULL, seed = NULL, output_col = "x" )
sdf_rlnorm( sc, n, meanlog = 0, sdlog = 1, num_partitions = NULL, seed = NULL, output_col = "x" )
sc |
A Spark connection. |
n |
Sample Size (default: 1000). |
meanlog |
The mean of the normally distributed natural logarithm of this distribution. |
sdlog |
The Standard deviation of the normally distributed natural logarithm of this distribution. |
num_partitions |
Number of partitions in the resulting Spark dataframe (default: default parallelism of the Spark cluster). |
seed |
Random seed (default: a random long integer). |
output_col |
Name of the output column containing sample values (default: "x"). |
Other Spark statistical routines:
sdf_rbeta()
,
sdf_rbinom()
,
sdf_rcauchy()
,
sdf_rchisq()
,
sdf_rexp()
,
sdf_rgamma()
,
sdf_rgeom()
,
sdf_rhyper()
,
sdf_rnorm()
,
sdf_rpois()
,
sdf_rt()
,
sdf_runif()
,
sdf_rweibull()
Generator method for creating a single-column Spark dataframes comprised of i.i.d. samples from the standard normal distribution.
sdf_rnorm( sc, n, mean = 0, sd = 1, num_partitions = NULL, seed = NULL, output_col = "x" )
sdf_rnorm( sc, n, mean = 0, sd = 1, num_partitions = NULL, seed = NULL, output_col = "x" )
sc |
A Spark connection. |
n |
Sample Size (default: 1000). |
mean |
The mean value of the normal distribution. |
sd |
The standard deviation of the normal distribution. |
num_partitions |
Number of partitions in the resulting Spark dataframe (default: default parallelism of the Spark cluster). |
seed |
Random seed (default: a random long integer). |
output_col |
Name of the output column containing sample values (default: "x"). |
Other Spark statistical routines:
sdf_rbeta()
,
sdf_rbinom()
,
sdf_rcauchy()
,
sdf_rchisq()
,
sdf_rexp()
,
sdf_rgamma()
,
sdf_rgeom()
,
sdf_rhyper()
,
sdf_rlnorm()
,
sdf_rpois()
,
sdf_rt()
,
sdf_runif()
,
sdf_rweibull()
Generator method for creating a single-column Spark dataframes comprised of i.i.d. samples from a Poisson distribution.
sdf_rpois(sc, n, lambda, num_partitions = NULL, seed = NULL, output_col = "x")
sdf_rpois(sc, n, lambda, num_partitions = NULL, seed = NULL, output_col = "x")
sc |
A Spark connection. |
n |
Sample Size (default: 1000). |
lambda |
Mean, or lambda, of the Poisson distribution. |
num_partitions |
Number of partitions in the resulting Spark dataframe (default: default parallelism of the Spark cluster). |
seed |
Random seed (default: a random long integer). |
output_col |
Name of the output column containing sample values (default: "x"). |
Other Spark statistical routines:
sdf_rbeta()
,
sdf_rbinom()
,
sdf_rcauchy()
,
sdf_rchisq()
,
sdf_rexp()
,
sdf_rgamma()
,
sdf_rgeom()
,
sdf_rhyper()
,
sdf_rlnorm()
,
sdf_rnorm()
,
sdf_rt()
,
sdf_runif()
,
sdf_rweibull()
Generator method for creating a single-column Spark dataframes comprised of i.i.d. samples from a t-distribution.
sdf_rt(sc, n, df, num_partitions = NULL, seed = NULL, output_col = "x")
sdf_rt(sc, n, df, num_partitions = NULL, seed = NULL, output_col = "x")
sc |
A Spark connection. |
n |
Sample Size (default: 1000). |
df |
Degrees of freedom (> 0, maybe non-integer). |
num_partitions |
Number of partitions in the resulting Spark dataframe (default: default parallelism of the Spark cluster). |
seed |
Random seed (default: a random long integer). |
output_col |
Name of the output column containing sample values (default: "x"). |
Other Spark statistical routines:
sdf_rbeta()
,
sdf_rbinom()
,
sdf_rcauchy()
,
sdf_rchisq()
,
sdf_rexp()
,
sdf_rgamma()
,
sdf_rgeom()
,
sdf_rhyper()
,
sdf_rlnorm()
,
sdf_rnorm()
,
sdf_rpois()
,
sdf_runif()
,
sdf_rweibull()
Generator method for creating a single-column Spark dataframes comprised of i.i.d. samples from the uniform distribution U(0, 1).
sdf_runif( sc, n, min = 0, max = 1, num_partitions = NULL, seed = NULL, output_col = "x" )
sdf_runif( sc, n, min = 0, max = 1, num_partitions = NULL, seed = NULL, output_col = "x" )
sc |
A Spark connection. |
n |
Sample Size (default: 1000). |
min |
The lower limit of the distribution. |
max |
The upper limit of the distribution. |
num_partitions |
Number of partitions in the resulting Spark dataframe (default: default parallelism of the Spark cluster). |
seed |
Random seed (default: a random long integer). |
output_col |
Name of the output column containing sample values (default: "x"). |
Other Spark statistical routines:
sdf_rbeta()
,
sdf_rbinom()
,
sdf_rcauchy()
,
sdf_rchisq()
,
sdf_rexp()
,
sdf_rgamma()
,
sdf_rgeom()
,
sdf_rhyper()
,
sdf_rlnorm()
,
sdf_rnorm()
,
sdf_rpois()
,
sdf_rt()
,
sdf_rweibull()
Generator method for creating a single-column Spark dataframes comprised of i.i.d. samples from a Weibull distribution.
sdf_rweibull( sc, n, shape, scale = 1, num_partitions = NULL, seed = NULL, output_col = "x" )
sdf_rweibull( sc, n, shape, scale = 1, num_partitions = NULL, seed = NULL, output_col = "x" )
sc |
A Spark connection. |
n |
Sample Size (default: 1000). |
shape |
The shape of the Weibull distribution. |
scale |
The scale of the Weibull distribution (default: 1). |
num_partitions |
Number of partitions in the resulting Spark dataframe (default: default parallelism of the Spark cluster). |
seed |
Random seed (default: a random long integer). |
output_col |
Name of the output column containing sample values (default: "x"). |
Other Spark statistical routines:
sdf_rbeta()
,
sdf_rbinom()
,
sdf_rcauchy()
,
sdf_rchisq()
,
sdf_rexp()
,
sdf_rgamma()
,
sdf_rgeom()
,
sdf_rhyper()
,
sdf_rlnorm()
,
sdf_rnorm()
,
sdf_rpois()
,
sdf_rt()
,
sdf_runif()
Draw a random sample of rows (with or without replacement) from a Spark DataFrame.
sdf_sample(x, fraction = 1, replacement = TRUE, seed = NULL)
sdf_sample(x, fraction = 1, replacement = TRUE, seed = NULL)
x |
An object coercable to a Spark DataFrame. |
fraction |
The fraction to sample. |
replacement |
Boolean; sample with replacement? |
seed |
An (optional) integer seed. |
Other Spark data frames:
sdf_copy_to()
,
sdf_distinct()
,
sdf_random_split()
,
sdf_register()
,
sdf_sort()
,
sdf_weighted_sample()
Read the schema of a Spark DataFrame.
sdf_schema(x, expand_nested_cols = FALSE, expand_struct_cols = FALSE)
sdf_schema(x, expand_nested_cols = FALSE, expand_struct_cols = FALSE)
x |
A |
expand_nested_cols |
Whether to expand columns containing nested array of structs (which are usually created by tidyr::nest on a Spark data frame) |
expand_struct_cols |
Whether to expand columns containing structs |
The type
column returned gives the string representation of the
underlying Spark type for that column; for example, a vector of numeric
values would be returned with the type "DoubleType"
. Please see the
Spark Scala API Documentation
for information on what types are available and exposed by Spark.
An R list
, with each list
element describing the
name
and type
of a column.
Given a vector column in a Spark DataFrame, split that
into n
separate columns, each column made up of
the different elements in the column column
.
sdf_separate_column(x, column, into = NULL)
sdf_separate_column(x, column, into = NULL)
x |
A |
column |
The name of a (vector-typed) column. |
into |
A specification of the columns that should be
generated from |
Creates a DataFrame for the given range
sdf_seq( sc, from = 1L, to = 1L, by = 1L, repartition = NULL, type = c("integer", "integer64") )
sdf_seq( sc, from = 1L, to = 1L, by = 1L, repartition = NULL, type = c("integer", "integer64") )
sc |
The associated Spark connection. |
from , to
|
The start and end to use as a range |
by |
The increment of the sequence. |
repartition |
The number of partitions to use when distributing the data across the Spark cluster. Defaults to the minimum number of partitions. |
type |
The data type to use for the index, either |
Sort a Spark DataFrame by one or more columns, with each column sorted in ascending order.
sdf_sort(x, columns)
sdf_sort(x, columns)
x |
An object coercable to a Spark DataFrame. |
columns |
The column(s) to sort by. |
Other Spark data frames:
sdf_copy_to()
,
sdf_distinct()
,
sdf_random_split()
,
sdf_register()
,
sdf_sample()
,
sdf_weighted_sample()
Defines a Spark DataFrame from a SQL query, useful to create Spark DataFrames without collecting the results immediately.
sdf_sql(sc, sql)
sdf_sql(sc, sql)
sc |
A |
sql |
a 'SQL' query used to generate a Spark DataFrame. |
Convert column(s) to avro format
sdf_to_avro(x, cols = colnames(x))
sdf_to_avro(x, cols = colnames(x))
x |
An object coercible to a Spark DataFrame |
cols |
Subset of Columns to convert into avro format |
Expand a struct column or an array column within a Spark dataframe into one or more rows, similar what to tidyr::unnest_longer does to an R dataframe. An index column, if included, will be 1-based if 'col' is an array column.
sdf_unnest_longer( data, col, values_to = NULL, indices_to = NULL, include_indices = NULL, names_repair = "check_unique", ptype = list(), transform = list() )
sdf_unnest_longer( data, col, values_to = NULL, indices_to = NULL, include_indices = NULL, names_repair = "check_unique", ptype = list(), transform = list() )
data |
The Spark dataframe to be unnested |
col |
The struct column to extract components from |
values_to |
Name of column to store vector values. Defaults to 'col'. |
indices_to |
A string giving the name of column which will contain the inner names or position (if not named) of the values. Defaults to 'col' with '_id' suffix |
include_indices |
Whether to include an index column. An index column will be included by default if 'col' is a struct column. It will also be included if 'indices_to' is not 'NULL'. |
names_repair |
Strategy for fixing duplicate column names (the semantic
will be exactly identical to that of '.name_repair' option in
|
ptype |
Optionally, supply an R data frame prototype for the output. Each column of the unnested result will be casted based on the Spark equivalent of the type of the column with the same name within 'ptype', e.g., if 'ptype' has a column 'x' of type 'character', then column 'x' of the unnested result will be casted from its original SQL type to StringType. |
transform |
Optionally, a named list of transformation functions applied |
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "2.4.0") # unnesting a struct column sdf <- copy_to( sc, dplyr::tibble( x = 1:3, y = list(list(a = 1, b = 2), list(a = 3, b = 4), list(a = 5, b = 6)) ) ) unnested <- sdf %>% sdf_unnest_longer(y, indices_to = "attr") # unnesting an array column sdf <- copy_to( sc, dplyr::tibble( x = 1:3, y = list(1:10, 1:5, 1:2) ) ) unnested <- sdf %>% sdf_unnest_longer(y, indices_to = "array_idx") ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "2.4.0") # unnesting a struct column sdf <- copy_to( sc, dplyr::tibble( x = 1:3, y = list(list(a = 1, b = 2), list(a = 3, b = 4), list(a = 5, b = 6)) ) ) unnested <- sdf %>% sdf_unnest_longer(y, indices_to = "attr") # unnesting an array column sdf <- copy_to( sc, dplyr::tibble( x = 1:3, y = list(1:10, 1:5, 1:2) ) ) unnested <- sdf %>% sdf_unnest_longer(y, indices_to = "array_idx") ## End(Not run)
Flatten a struct column within a Spark dataframe into one or more columns, similar what to tidyr::unnest_wider does to an R dataframe
sdf_unnest_wider( data, col, names_sep = NULL, names_repair = "check_unique", ptype = list(), transform = list() )
sdf_unnest_wider( data, col, names_sep = NULL, names_repair = "check_unique", ptype = list(), transform = list() )
data |
The Spark dataframe to be unnested |
col |
The struct column to extract components from |
names_sep |
If 'NULL', the default, the names will be left as is. If a string, the inner and outer names will be pasted together using 'names_sep' as the delimiter. |
names_repair |
Strategy for fixing duplicate column names (the semantic
will be exactly identical to that of '.name_repair' option in
|
ptype |
Optionally, supply an R data frame prototype for the output. Each column of the unnested result will be casted based on the Spark equivalent of the type of the column with the same name within 'ptype', e.g., if 'ptype' has a column 'x' of type 'character', then column 'x' of the unnested result will be casted from its original SQL type to StringType. |
transform |
Optionally, a named list of transformation functions applied to each component (e.g., list('x = as.character') to cast column 'x' to String). |
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "2.4.0") sdf <- copy_to( sc, dplyr::tibble( x = 1:3, y = list(list(a = 1, b = 2), list(a = 3, b = 4), list(a = 5, b = 6)) ) ) # flatten struct column 'y' into two separate columns 'y_a' and 'y_b' unnested <- sdf %>% sdf_unnest_wider(y, names_sep = "_") ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "2.4.0") sdf <- copy_to( sc, dplyr::tibble( x = 1:3, y = list(list(a = 1, b = 2), list(a = 3, b = 4), list(a = 5, b = 6)) ) ) # flatten struct column 'y' into two separate columns 'y_a' and 'y_b' unnested <- sdf %>% sdf_unnest_wider(y, names_sep = "_") ## End(Not run)
Draw a random sample of rows (with or without replacement) from a Spark DataFrame If the sampling is done without replacement, then it will be conceptually equivalent to an iterative process such that in each step the probability of adding a row to the sample set is equal to its weight divided by summation of weights of all rows that are not in the sample set yet in that step.
sdf_weighted_sample(x, weight_col, k, replacement = TRUE, seed = NULL)
sdf_weighted_sample(x, weight_col, k, replacement = TRUE, seed = NULL)
x |
An object coercable to a Spark DataFrame. |
weight_col |
Name of the weight column |
k |
Sample set size |
replacement |
Whether to sample with replacement |
seed |
An (optional) integer seed |
Other Spark data frames:
sdf_copy_to()
,
sdf_distinct()
,
sdf_random_split()
,
sdf_register()
,
sdf_sample()
,
sdf_sort()
Add a sequential ID column to a Spark DataFrame. The Spark
zipWithIndex
function is used to produce these. This differs from
sdf_with_unique_id
in that the IDs generated are independent of
partitioning.
sdf_with_sequential_id(x, id = "id", from = 1L)
sdf_with_sequential_id(x, id = "id", from = 1L)
x |
A |
id |
The name of the column to host the generated IDs. |
from |
The starting value of the id column |
Add a unique ID column to a Spark DataFrame. The Spark
monotonicallyIncreasingId
function is used to produce these and is
guaranteed to produce unique, monotonically increasing ids; however, there
is no guarantee that these IDs will be sequential. The table is persisted
immediately after the column is generated, to ensure that the column is
stable – otherwise, it can differ across new computations.
sdf_with_unique_id(x, id = "id")
sdf_with_unique_id(x, id = "id")
x |
A |
id |
The name of the column to host the generated IDs. |
Routines for saving and loading Spark DataFrames.
sdf_save_table(x, name, overwrite = FALSE, append = FALSE) sdf_load_table(sc, name) sdf_save_parquet(x, path, overwrite = FALSE, append = FALSE) sdf_load_parquet(sc, path)
sdf_save_table(x, name, overwrite = FALSE, append = FALSE) sdf_load_table(sc, name) sdf_save_parquet(x, path, overwrite = FALSE, append = FALSE) sdf_load_parquet(sc, path)
x |
A |
name |
The table name to assign to the saved Spark DataFrame. |
overwrite |
Boolean; overwrite a pre-existing table of the same name? |
append |
Boolean; append to a pre-existing table of the same name? |
sc |
A |
path |
The path where the Spark DataFrame should be saved. |
Deprecated methods for transformation, fit, and prediction. These are mirrors of the corresponding ml-transform-methods.
sdf_predict(x, model, ...) sdf_transform(x, transformer, ...) sdf_fit(x, estimator, ...) sdf_fit_and_transform(x, estimator, ...)
sdf_predict(x, model, ...) sdf_transform(x, transformer, ...) sdf_fit(x, estimator, ...) sdf_fit_and_transform(x, estimator, ...)
x |
A |
model |
A |
... |
Optional arguments passed to the corresponding |
transformer |
A |
estimator |
A |
sdf_predict()
, sdf_transform()
, and sdf_fit_and_transform()
return a transformed dataframe whereas sdf_fit()
returns a ml_transformer
.
Retrieves or sets whether Spark adaptive query execution is enabled
spark_adaptive_query_execution(sc, enable = NULL)
spark_adaptive_query_execution(sc, enable = NULL)
sc |
A |
enable |
Whether to enable Spark adaptive query execution. Defaults to
|
Other Spark runtime configuration:
spark_advisory_shuffle_partition_size()
,
spark_auto_broadcast_join_threshold()
,
spark_coalesce_initial_num_partitions()
,
spark_coalesce_min_num_partitions()
,
spark_coalesce_shuffle_partitions()
,
spark_session_config()
Retrieves or sets advisory size in bytes of the shuffle partition during adaptive optimization
spark_advisory_shuffle_partition_size(sc, size = NULL)
spark_advisory_shuffle_partition_size(sc, size = NULL)
sc |
A |
size |
Advisory size in bytes of the shuffle partition.
Defaults to |
Other Spark runtime configuration:
spark_adaptive_query_execution()
,
spark_auto_broadcast_join_threshold()
,
spark_coalesce_initial_num_partitions()
,
spark_coalesce_min_num_partitions()
,
spark_coalesce_shuffle_partitions()
,
spark_session_config()
Applies an R function to a Spark object (typically, a Spark DataFrame).
spark_apply( x, f, columns = NULL, memory = TRUE, group_by = NULL, packages = NULL, context = NULL, name = NULL, barrier = NULL, fetch_result_as_sdf = TRUE, partition_index_param = "", arrow_max_records_per_batch = NULL, auto_deps = FALSE, ... )
spark_apply( x, f, columns = NULL, memory = TRUE, group_by = NULL, packages = NULL, context = NULL, name = NULL, barrier = NULL, fetch_result_as_sdf = TRUE, partition_index_param = "", arrow_max_records_per_batch = NULL, auto_deps = FALSE, ... )
x |
An object (usually a |
f |
A function that transforms a data frame partition into a data frame.
The function Can also be an |
columns |
A vector of column names or a named vector of column types for
the transformed object. When not specified, a sample of 10 rows is taken to
infer out the output columns automatically, to avoid this performance penalty,
specify the column types. The sample size is configurable using the
|
memory |
Boolean; should the table be cached into memory? |
group_by |
Column name used to group by data frame partitions. |
packages |
Boolean to distribute Defaults to For clusters using Yarn cluster mode, For offline clusters where For clusters where R packages already installed in every worker node,
the |
context |
Optional object to be serialized and passed back to |
name |
Optional table name while registering the resulting data frame. |
barrier |
Optional to support Barrier Execution Mode in the scheduler. |
fetch_result_as_sdf |
Whether to return the transformed results in a Spark
Dataframe (defaults to NOTE: |
partition_index_param |
Optional if non-empty, then NOTE: when |
arrow_max_records_per_batch |
Maximum size of each Arrow record batch, ignored if Arrow serialization is not enabled. |
auto_deps |
[Experimental] Whether to infer all required R packages by
examining the closure |
... |
Optional arguments; currently unused. |
spark_config()
settings can be specified to change the workers
environment.
For instance, to set additional environment variables to each
worker node use the sparklyr.apply.env.*
config, to launch workers
without --vanilla
use sparklyr.apply.options.vanilla
set to
FALSE
, to run a custom script before launching Rscript use
sparklyr.apply.options.rscript.before
.
## Not run: library(sparklyr) sc <- spark_connect(master = "local[3]") # creates an Spark data frame with 10 elements then multiply times 10 in R sdf_len(sc, 10) %>% spark_apply(function(df) df * 10) # using barrier mode sdf_len(sc, 3, repartition = 3) %>% spark_apply(nrow, barrier = TRUE, columns = c(id = "integer")) %>% collect() ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local[3]") # creates an Spark data frame with 10 elements then multiply times 10 in R sdf_len(sc, 10) %>% spark_apply(function(df) df * 10) # using barrier mode sdf_len(sc, 3, repartition = 3) %>% spark_apply(nrow, barrier = TRUE, columns = c(id = "integer")) %>% collect() ## End(Not run)
Creates a bundle of packages for spark_apply()
.
spark_apply_bundle(packages = TRUE, base_path = getwd(), session_id = NULL)
spark_apply_bundle(packages = TRUE, base_path = getwd(), session_id = NULL)
packages |
List of packages to pack or |
base_path |
Base path used to store the resulting bundle. |
session_id |
An optional ID string to include in the bundle file name to allow the bundle to be session-specific |
Writes data to log under spark_apply()
.
spark_apply_log(..., level = "INFO")
spark_apply_log(..., level = "INFO")
... |
Arguments to write to log. |
level |
Severity level for this entry; recommended values: |
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command 'ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan' has been run, and file-based data source tables where the statistics are computed directly on the files of data.
spark_auto_broadcast_join_threshold(sc, threshold = NULL)
spark_auto_broadcast_join_threshold(sc, threshold = NULL)
sc |
A |
threshold |
Maximum size in bytes for a table that will be broadcast to all worker nodes
when performing a join. Defaults to |
Other Spark runtime configuration:
spark_adaptive_query_execution()
,
spark_advisory_shuffle_partition_size()
,
spark_coalesce_initial_num_partitions()
,
spark_coalesce_min_num_partitions()
,
spark_coalesce_shuffle_partitions()
,
spark_session_config()
Retrieves or sets initial number of shuffle partitions before coalescing
spark_coalesce_initial_num_partitions(sc, num_partitions = NULL)
spark_coalesce_initial_num_partitions(sc, num_partitions = NULL)
sc |
A |
num_partitions |
Initial number of shuffle partitions before coalescing.
Defaults to |
Other Spark runtime configuration:
spark_adaptive_query_execution()
,
spark_advisory_shuffle_partition_size()
,
spark_auto_broadcast_join_threshold()
,
spark_coalesce_min_num_partitions()
,
spark_coalesce_shuffle_partitions()
,
spark_session_config()
Retrieves or sets the minimum number of shuffle partitions after coalescing
spark_coalesce_min_num_partitions(sc, num_partitions = NULL)
spark_coalesce_min_num_partitions(sc, num_partitions = NULL)
sc |
A |
num_partitions |
Minimum number of shuffle partitions after coalescing.
Defaults to |
Other Spark runtime configuration:
spark_adaptive_query_execution()
,
spark_advisory_shuffle_partition_size()
,
spark_auto_broadcast_join_threshold()
,
spark_coalesce_initial_num_partitions()
,
spark_coalesce_shuffle_partitions()
,
spark_session_config()
Retrieves or sets whether coalescing contiguous shuffle partitions is enabled
spark_coalesce_shuffle_partitions(sc, enable = NULL)
spark_coalesce_shuffle_partitions(sc, enable = NULL)
sc |
A |
enable |
Whether to enable coalescing of contiguous shuffle partitions.
Defaults to |
Other Spark runtime configuration:
spark_adaptive_query_execution()
,
spark_advisory_shuffle_partition_size()
,
spark_auto_broadcast_join_threshold()
,
spark_coalesce_initial_num_partitions()
,
spark_coalesce_min_num_partitions()
,
spark_session_config()
For use with compile_package_jars
. The Spark compilation
specification is used when compiling Spark extension Java Archives, and
defines which versions of Spark, as well as which versions of Scala, should
be used for compilation.
spark_compilation_spec( spark_version = NULL, spark_home = NULL, scalac_path = NULL, scala_filter = NULL, jar_name = NULL, jar_path = NULL, jar_dep = NULL, embedded_srcs = "embedded_sources.R" )
spark_compilation_spec( spark_version = NULL, spark_home = NULL, scalac_path = NULL, scala_filter = NULL, jar_name = NULL, jar_path = NULL, jar_dep = NULL, embedded_srcs = "embedded_sources.R" )
spark_version |
The Spark version to build against. This can be left unset if the path to a suitable Spark home is supplied. |
spark_home |
The path to a Spark home installation. This can
be left unset if |
scalac_path |
The path to the |
scala_filter |
An optional R function that can be used to filter
which |
jar_name |
The name to be assigned to the generated |
jar_path |
The path to the |
jar_dep |
An optional list of additional |
embedded_srcs |
Embedded source file(s) under |
Most Spark extensions won't need to define their own compilation specification,
and can instead rely on the default behavior of compile_package_jars
.
Read Spark Configuration
spark_config(file = "config.yml", use_default = TRUE)
spark_config(file = "config.yml", use_default = TRUE)
file |
Name of the configuration file |
use_default |
TRUE to use the built-in defaults provided in this package |
Read Spark configuration using the config package.
Named list with configuration data
Convenience function to initialize a Kubernetes configuration instead
of spark_config()
, exposes common properties to set in Kubernetes
clusters.
spark_config_kubernetes( master, version = "3.2.3", image = "spark:sparklyr", driver = random_string("sparklyr-"), account = "spark", jars = "local:///opt/sparklyr", forward = TRUE, executors = NULL, conf = NULL, timeout = 120, ports = c(8880, 8881, 4040), fix_config = identical(.Platform$OS.type, "windows"), ... )
spark_config_kubernetes( master, version = "3.2.3", image = "spark:sparklyr", driver = random_string("sparklyr-"), account = "spark", jars = "local:///opt/sparklyr", forward = TRUE, executors = NULL, conf = NULL, timeout = 120, ports = c(8880, 8881, 4040), fix_config = identical(.Platform$OS.type, "windows"), ... )
master |
Kubernetes url to connect to, found by running |
version |
The version of Spark being used. |
image |
Container image to use to launch Spark and sparklyr. Also known
as |
driver |
Name of the driver pod. If not set, the driver pod name is set
to "sparklyr" suffixed by id to avoid name conflicts. Also known as
|
account |
Service account that is used when running the driver pod. The driver
pod uses this service account when requesting executor pods from the API
server. Also known as |
jars |
Path to the sparklyr jars; either, a local path inside the container
image with the sparklyr jars copied when the image was created or, a path
accesible by the container where the sparklyr jars were copied. You can find
a path to the sparklyr jars by running |
forward |
Should ports used in sparklyr be forwarded automatically through Kubernetes?
Default to |
executors |
Number of executors to request while connecting. |
conf |
A named list of additional entries to add to |
timeout |
Total seconds to wait before giving up on connection. |
ports |
Ports to forward using kubectl. |
fix_config |
Should the spark-defaults.conf get fixed? |
... |
Additional parameters, currently not in use. |
Retrieves available sparklyr settings that can be used in configuration files or spark_config()
.
spark_config_settings()
spark_config_settings()
Function that negotiates the connection with the Spark back-end
spark_connect_method( x, method, master, spark_home, config, app_name, version, hadoop_version, extensions, scala_version, ... )
spark_connect_method( x, method, master, spark_home, config, app_name, version, hadoop_version, extensions, scala_version, ... )
x |
A dummy method object to determine which code to use to connect |
method |
The method used to connect to Spark. Default connection method
is |
master |
Spark cluster url to connect to. Use |
spark_home |
The path to a Spark installation. Defaults to the path
provided by the |
config |
Custom configuration for the generated Spark connection. See
|
app_name |
The application name to be used while running in the Spark cluster. |
version |
The version of Spark to use. Required for |
hadoop_version |
Version of Hadoop to use |
extensions |
Extension R packages to enable for this connection. By
default, all packages enabled through the use of
|
scala_version |
Load the sparklyr jar file that is built with the version of Scala specified (this currently only makes sense for Spark 2.4, where sparklyr will by default assume Spark 2.4 on current host is built with Scala 2.11, and therefore ‘scala_version = ’2.12'' is needed if sparklyr is connecting to Spark 2.4 built with Scala 2.12) |
... |
Additional params to be passed to each 'spark_disconnect()' call (e.g., 'terminate = TRUE') |
Retrieve the spark_connection
associated with an R object.
spark_connection(x, ...)
spark_connection(x, ...)
x |
An R object from which a |
... |
Optional arguments; currently unused. |
Finds an active spark connection in the environment given the connection parameters.
spark_connection_find(master = NULL, app_name = NULL, method = NULL)
spark_connection_find(master = NULL, app_name = NULL, method = NULL)
master |
The Spark master parameter. |
app_name |
The Spark application name. |
method |
The method used to connect to Spark. |
Retrieves the runtime configuration interface for the Spark Context.
spark_context_config(sc)
spark_context_config(sc)
sc |
A |
This S3 generic is used to access a Spark DataFrame object (as a Java object reference) from an R object.
spark_dataframe(x, ...)
spark_dataframe(x, ...)
x |
An R object wrapping, or containing, a Spark DataFrame. |
... |
Optional arguments; currently unused. |
A spark_jobj
representing a Java object reference
to a Spark DataFrame.
This is the default compilation specification used for
Spark extensions, when used with compile_package_jars
.
spark_default_compilation_spec( pkg = infer_active_package_name(), locations = NULL )
spark_default_compilation_spec( pkg = infer_active_package_name(), locations = NULL )
pkg |
The package containing Spark extensions to be compiled. |
locations |
Additional locations to scan. By default, the
directories |
Define a Spark dependency consisting of a set of custom JARs, Spark packages, and customized dbplyr SQL translation env.
spark_dependency( jars = NULL, packages = NULL, initializer = NULL, catalog = NULL, repositories = NULL, dbplyr_sql_variant = NULL, ... )
spark_dependency( jars = NULL, packages = NULL, initializer = NULL, catalog = NULL, repositories = NULL, dbplyr_sql_variant = NULL, ... )
jars |
Character vector of full paths to JAR files. |
packages |
Character vector of Spark packages names. |
initializer |
Optional callback function called when initializing a connection. |
catalog |
Optional location where extension JAR files can be downloaded for Livy. |
repositories |
Character vector of Spark package repositories. |
dbplyr_sql_variant |
Customization of dbplyr SQL translation env. Must be a
named list of the following form:
|
... |
Additional optional arguments. |
An object of type 'spark_dependency'
Helper function to assist falling back to previous Spark versions.
spark_dependency_fallback(spark_version, supported_versions)
spark_dependency_fallback(spark_version, supported_versions)
spark_version |
The Spark version being requested in |
supported_versions |
The Spark versions that are supported by this extension. |
A Spark version to use.
Creates an R package ready to be used as an Spark extension.
spark_extension(path)
spark_extension(path)
path |
Location where the extension will be created. |
Set the SPARK_HOME
environment variable. This slightly speeds up some
operations, including the connection time.
spark_home_set(path = NULL, ...)
spark_home_set(path = NULL, ...)
path |
A string containing the path to the installation location of
Spark. If |
... |
Additional parameters not currently used. |
The function is mostly invoked for the side-effect of setting the
SPARK_HOME
environment variable. It also returns TRUE
if the
environment was successfully set, and FALSE
otherwise.
## Not run: # Not run due to side-effects spark_home_set() ## End(Not run)
## Not run: # Not run due to side-effects spark_home_set() ## End(Not run)
Set of functions to provide integration with the RStudio IDE
spark_ide_connection_open(con, env, connect_call) spark_ide_connection_closed(con) spark_ide_connection_updated(con, hint) spark_ide_connection_actions(con) spark_ide_objects(con, catalog, schema, name, type) spark_ide_columns( con, table = NULL, view = NULL, catalog = NULL, schema = NULL ) spark_ide_preview( con, rowLimit, table = NULL, view = NULL, catalog = NULL, schema = NULL )
spark_ide_connection_open(con, env, connect_call) spark_ide_connection_closed(con) spark_ide_connection_updated(con, hint) spark_ide_connection_actions(con) spark_ide_objects(con, catalog, schema, name, type) spark_ide_columns( con, table = NULL, view = NULL, catalog = NULL, schema = NULL ) spark_ide_preview( con, rowLimit, table = NULL, view = NULL, catalog = NULL, schema = NULL )
con |
Valid Spark connection |
env |
R environment of the interactive R session |
connect_call |
R code that can be used to re-connect to the Spark connection |
hint |
Name of the Spark connection that the RStudio IDE can use as reference. |
catalog |
Name of the top level of the requested table or view |
schema |
Name of the second most top level of the requested level or view |
name |
The new of the view or table being requested |
type |
Type of the object being requested, 'view' or 'table' |
table |
Name of the requested table |
view |
Name of the requested view |
rowLimit |
The number of rows to show in the 'Preview' pane of the RStudio IDE |
These function are meant for downstream packages, that provide additional backends to 'sparklyr', to override the opening, closing, update, and preview functionality. The arguments are driven by what the RStudio IDE API expects them to be, so this is the reason why some use 'type' to designated views or tables, and others have one argument for 'table', and another for 'view'.
Inserts a Spark DataFrame into a Spark table
spark_insert_table( x, name, mode = NULL, overwrite = FALSE, options = list(), ... )
spark_insert_table( x, name, mode = NULL, overwrite = FALSE, options = list(), ... )
x |
A Spark DataFrame or dplyr operation |
name |
The name to assign to the newly generated table. |
mode |
A For more details see also https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark. |
overwrite |
Boolean; overwrite the table with the given name if it already exists? |
options |
A list of strings with additional options. |
... |
Optional arguments; currently unused. |
Other Spark serialization routines:
collect_from_rds()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Install versions of Spark for use with local Spark connections
(i.e. spark_connect(master = "local"
)
spark_install( version = NULL, hadoop_version = NULL, reset = TRUE, logging = "INFO", verbose = interactive() ) spark_uninstall(version, hadoop_version) spark_install_dir() spark_install_tar(tarfile) spark_installed_versions() spark_available_versions( show_hadoop = FALSE, show_minor = FALSE, show_future = FALSE )
spark_install( version = NULL, hadoop_version = NULL, reset = TRUE, logging = "INFO", verbose = interactive() ) spark_uninstall(version, hadoop_version) spark_install_dir() spark_install_tar(tarfile) spark_installed_versions() spark_available_versions( show_hadoop = FALSE, show_minor = FALSE, show_future = FALSE )
version |
Version of Spark to install. See |
hadoop_version |
Version of Hadoop to install. See |
reset |
Attempts to reset settings to defaults. |
logging |
Logging level to configure install. Supported options: "WARN", "INFO" |
verbose |
Report information as Spark is downloaded / installed |
tarfile |
Path to TAR file conforming to the pattern spark-###-bin-(hadoop)?### where ### reference spark and hadoop versions respectively. |
show_hadoop |
Show Hadoop distributions? |
show_minor |
Show minor Spark versions? |
show_future |
Should future versions which have not been released be shown? |
List with information about the installed version.
It lets the package know if it should test a particular functionality or not
spark_integ_test_skip(sc, test_name)
spark_integ_test_skip(sc, test_name)
sc |
Spark connection |
test_name |
The name of the test |
It expects a boolean to be returned. If TRUE, the corresponding test will be skipped. If FALSE the test will be conducted.
This S3 generic is used for accessing the underlying Java Virtual Machine
(JVM) Spark objects associated with R objects. These objects act as
references to Spark objects living in the JVM. Methods on these objects
can be called with the invoke
family of functions.
spark_jobj(x, ...)
spark_jobj(x, ...)
x |
An R object containing, or wrapping, a |
... |
Optional arguments; currently unused. |
invoke
, for calling methods on Java object references.
Surfaces the last error from Spark captured by internal 'spark_error' function
spark_last_error()
spark_last_error()
Reads from a Spark Table into a Spark DataFrame.
spark_load_table( sc, name, path, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE )
spark_load_table( sc, name, path, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE )
sc |
A |
name |
The name to assign to the newly generated table. |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
options |
A list of strings with additional options. See https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration. |
repartition |
The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it already exists? |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
View the most recent entries in the Spark log. This can be useful when inspecting output / errors produced by Spark during the invocation of various commands.
spark_log(sc, n = 100, filter = NULL, ...)
spark_log(sc, n = 100, filter = NULL, ...)
sc |
A |
n |
The max number of log entries to retrieve. Use |
filter |
Character string to filter log entries. |
... |
Optional arguments; currently unused. |
Run a custom R function on Spark workers to ingest data from one or more files into a Spark DataFrame, assuming all files follow the same schema.
spark_read(sc, paths, reader, columns, packages = TRUE, ...)
spark_read(sc, paths, reader, columns, packages = TRUE, ...)
sc |
A |
paths |
A character vector of one or more file URIs (e.g., c("hdfs://localhost:9000/file.txt", "hdfs://localhost:9000/file2.txt")) |
reader |
A self-contained R function that takes a single file URI as argument and returns the data read from that file as a data frame. |
columns |
a named list of column names and column types of the resulting data frame (e.g., list(column_1 = "integer", column_2 = "character")), or a list of column names only if column types should be inferred from the data (e.g., list("column_1", "column_2"), or NULL if column types should be inferred and resulting data frame can have arbitrary column names |
packages |
A list of R packages to distribute to Spark workers |
... |
Optional arguments; currently unused. |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
## Not run: library(sparklyr) sc <- spark_connect( master = "yarn", spark_home = "~/spark/spark-2.4.5-bin-hadoop2.7" ) # This is a contrived example to show reader tasks will be distributed across # all Spark worker nodes spark_read( sc, rep("/dev/null", 10), reader = function(path) system("hostname", intern = TRUE), columns = c(hostname = "string") ) %>% sdf_collect() ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect( master = "yarn", spark_home = "~/spark/spark-2.4.5-bin-hadoop2.7" ) # This is a contrived example to show reader tasks will be distributed across # all Spark worker nodes spark_read( sc, rep("/dev/null", 10), reader = function(path) system("hostname", intern = TRUE), columns = c(hostname = "string") ) %>% sdf_collect() ## End(Not run)
Notice this functionality requires the Spark connection sc
to be instantiated with either
an explicitly specified Spark version (i.e.,
spark_connect(..., version = <version>, packages = c("avro", <other package(s)>), ...)
)
or a specific version of Spark avro package to use (e.g.,
spark_connect(..., packages = c("org.apache.spark:spark-avro_2.12:3.0.0", <other package(s)>), ...)
).
spark_read_avro( sc, name = NULL, path = name, avro_schema = NULL, ignore_extension = TRUE, repartition = 0, memory = TRUE, overwrite = TRUE )
spark_read_avro( sc, name = NULL, path = name, avro_schema = NULL, ignore_extension = TRUE, repartition = 0, memory = TRUE, overwrite = TRUE )
sc |
A |
name |
The name to assign to the newly generated table. |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
avro_schema |
Optional Avro schema in JSON format |
ignore_extension |
If enabled, all files with and without .avro extension
are loaded (default: |
repartition |
The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it already exists? |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Read binary files within a directory and convert each file into a record within the resulting Spark dataframe. The output will be a Spark dataframe with the following columns and possibly partition columns:
path: StringType
modificationTime: TimestampType
length: LongType
content: BinaryType
spark_read_binary( sc, name = NULL, dir = name, path_glob_filter = "*", recursive_file_lookup = FALSE, repartition = 0, memory = TRUE, overwrite = TRUE )
spark_read_binary( sc, name = NULL, dir = name, path_glob_filter = "*", recursive_file_lookup = FALSE, repartition = 0, memory = TRUE, overwrite = TRUE )
sc |
A |
name |
The name to assign to the newly generated table. |
dir |
Directory to read binary files from. |
path_glob_filter |
Glob pattern of binary files to be loaded (e.g., "*.jpg"). |
recursive_file_lookup |
If FALSE (default), then partition discovery will be enabled (i.e., if a partition naming scheme is present, then partitions specified by subdirectory names such as "date=2019-07-01" will be created and files outside subdirectories following a partition naming scheme will be ignored). If TRUE, then all nested directories will be searched even if their names do not follow a partition naming scheme. |
repartition |
The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it already exists? |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Read a tabular data file into a Spark DataFrame.
spark_read_csv( sc, name = NULL, path = name, header = TRUE, columns = NULL, infer_schema = is.null(columns), delimiter = ",", quote = "\"", escape = "\\", charset = "UTF-8", null_value = NULL, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, ... )
spark_read_csv( sc, name = NULL, path = name, header = TRUE, columns = NULL, infer_schema = is.null(columns), delimiter = ",", quote = "\"", escape = "\\", charset = "UTF-8", null_value = NULL, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, ... )
sc |
A |
name |
The name to assign to the newly generated table. |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
header |
Boolean; should the first row of data be used as a header?
Defaults to |
columns |
A vector of column names or a named vector of column types.
If specified, the elements can be |
infer_schema |
Boolean; should column types be automatically inferred?
Requires one extra pass over the data. Defaults to |
delimiter |
The character used to delimit each column. Defaults to ‘','’. |
quote |
The character used as a quote. Defaults to ‘'"'’. |
escape |
The character used to escape other characters. Defaults to ‘'\'’. |
charset |
The character set. Defaults to ‘"UTF-8"’. |
null_value |
The character to use for null, or missing, values. Defaults to |
options |
A list of strings with additional options. |
repartition |
The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it already exists? |
... |
Optional arguments; currently unused. |
You can read data from HDFS (hdfs://
), S3 (s3a://
),
as well as the local file system (file://
).
When header
is FALSE
, the column names are generated with a
V
prefix; e.g. V1, V2, ...
.
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Read from Delta Lake into a Spark DataFrame.
spark_read_delta( sc, path, name = NULL, version = NULL, timestamp = NULL, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, ... )
spark_read_delta( sc, path, name = NULL, version = NULL, timestamp = NULL, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, ... )
sc |
A |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
name |
The name to assign to the newly generated table. |
version |
The version of the delta table to read. |
timestamp |
The timestamp of the delta table to read. For example,
|
options |
A list of strings with additional options. |
repartition |
The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it already exists? |
... |
Optional arguments; currently unused. |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Read image files within a directory and convert each file into a record within the resulting Spark dataframe. The output will be a Spark dataframe consisting of struct types containing the following attributes:
origin: StringType
height: IntegerType
width: IntegerType
nChannels: IntegerType
mode: IntegerType
data: BinaryType
spark_read_image( sc, name = NULL, dir = name, drop_invalid = TRUE, repartition = 0, memory = TRUE, overwrite = TRUE )
spark_read_image( sc, name = NULL, dir = name, drop_invalid = TRUE, repartition = 0, memory = TRUE, overwrite = TRUE )
sc |
A |
name |
The name to assign to the newly generated table. |
dir |
Directory to read binary files from. |
drop_invalid |
Whether to drop files that are not valid images from the result (default: TRUE). |
repartition |
The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it already exists? |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Read from JDBC connection into a Spark DataFrame.
spark_read_jdbc( sc, name, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL, ... )
spark_read_jdbc( sc, name, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL, ... )
sc |
A |
name |
The name to assign to the newly generated table. |
options |
A list of strings with additional options. See https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration. |
repartition |
The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it already exists? |
columns |
A vector of column names or a named vector of column types.
If specified, the elements can be |
... |
Optional arguments; currently unused. |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
## Not run: sc <- spark_connect( master = "local", config = list( `sparklyr.shell.driver-class-path` = "/usr/share/java/mysql-connector-java-8.0.25.jar" ) ) spark_read_jdbc( sc, name = "my_sql_table", options = list( url = "jdbc:mysql://localhost:3306/my_sql_schema", driver = "com.mysql.jdbc.Driver", user = "me", password = "******", dbtable = "my_sql_table" ) ) ## End(Not run)
## Not run: sc <- spark_connect( master = "local", config = list( `sparklyr.shell.driver-class-path` = "/usr/share/java/mysql-connector-java-8.0.25.jar" ) ) spark_read_jdbc( sc, name = "my_sql_table", options = list( url = "jdbc:mysql://localhost:3306/my_sql_schema", driver = "com.mysql.jdbc.Driver", user = "me", password = "******", dbtable = "my_sql_table" ) ) ## End(Not run)
Read a table serialized in the JavaScript Object Notation format into a Spark DataFrame.
spark_read_json( sc, name = NULL, path = name, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL, ... )
spark_read_json( sc, name = NULL, path = name, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL, ... )
sc |
A |
name |
The name to assign to the newly generated table. |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
options |
A list of strings with additional options. |
repartition |
The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it already exists? |
columns |
A vector of column names or a named vector of column types.
If specified, the elements can be |
... |
Optional arguments; currently unused. |
You can read data from HDFS (hdfs://
), S3 (s3a://
), as well as
the local file system (file://
).
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Read libsvm file into a Spark DataFrame.
spark_read_libsvm( sc, name = NULL, path = name, repartition = 0, memory = TRUE, overwrite = TRUE, options = list(), ... )
spark_read_libsvm( sc, name = NULL, path = name, repartition = 0, memory = TRUE, overwrite = TRUE, options = list(), ... )
sc |
A |
name |
The name to assign to the newly generated table. |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
repartition |
The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it already exists? |
options |
A list of strings with additional options. |
... |
Optional arguments; currently unused. |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Read a ORC file into a Spark DataFrame.
spark_read_orc( sc, name = NULL, path = name, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL, schema = NULL, ... )
spark_read_orc( sc, name = NULL, path = name, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL, schema = NULL, ... )
sc |
A |
name |
The name to assign to the newly generated table. |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
options |
A list of strings with additional options. See https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration. |
repartition |
The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it already exists? |
columns |
A vector of column names or a named vector of column types.
If specified, the elements can be |
schema |
A (java) read schema. Useful for optimizing read operation on nested data. |
... |
Optional arguments; currently unused. |
You can read data from HDFS (hdfs://
), S3 (s3a://
), as well as
the local file system (file://
).
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Read a Parquet file into a Spark DataFrame.
spark_read_parquet( sc, name = NULL, path = name, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL, schema = NULL, ... )
spark_read_parquet( sc, name = NULL, path = name, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL, schema = NULL, ... )
sc |
A |
name |
The name to assign to the newly generated table. |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
options |
A list of strings with additional options. See https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration. |
repartition |
The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it already exists? |
columns |
A vector of column names or a named vector of column types.
If specified, the elements can be |
schema |
A (java) read schema. Useful for optimizing read operation on nested data. |
... |
Optional arguments; currently unused. |
You can read data from HDFS (hdfs://
), S3 (s3a://
), as well as
the local file system (file://
).
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Read from a generic source into a Spark DataFrame.
spark_read_source( sc, name = NULL, path = name, source, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL, ... )
spark_read_source( sc, name = NULL, path = name, source, options = list(), repartition = 0, memory = TRUE, overwrite = TRUE, columns = NULL, ... )
sc |
A |
name |
The name to assign to the newly generated table. |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
source |
A data source capable of reading data. |
options |
A list of strings with additional options. See https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration. |
repartition |
The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it already exists? |
columns |
A vector of column names or a named vector of column types.
If specified, the elements can be |
... |
Optional arguments; currently unused. |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Reads from a Spark Table into a Spark DataFrame.
spark_read_table( sc, name, options = list(), repartition = 0, memory = TRUE, columns = NULL, ... )
spark_read_table( sc, name, options = list(), repartition = 0, memory = TRUE, columns = NULL, ... )
sc |
A |
name |
The name to assign to the newly generated table. |
options |
A list of strings with additional options. See https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration. |
repartition |
The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?) |
columns |
A vector of column names or a named vector of column types.
If specified, the elements can be |
... |
Optional arguments; currently unused. |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Read a Text file into a Spark DataFrame
spark_read_text( sc, name = NULL, path = name, repartition = 0, memory = TRUE, overwrite = TRUE, options = list(), whole = FALSE, ... )
spark_read_text( sc, name = NULL, path = name, repartition = 0, memory = TRUE, overwrite = TRUE, options = list(), whole = FALSE, ... )
sc |
A |
name |
The name to assign to the newly generated table. |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
repartition |
The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it already exists? |
options |
A list of strings with additional options. |
whole |
Read the entire text file as a single entry? Defaults to |
... |
Optional arguments; currently unused. |
You can read data from HDFS (hdfs://
), S3 (s3a://
), as well as
the local file system (file://
).
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Saves a Spark DataFrame and as a Spark table.
spark_save_table(x, path, mode = NULL, options = list())
spark_save_table(x, path, mode = NULL, options = list())
x |
A Spark DataFrame or dplyr operation |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
mode |
A For more details see also https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark. |
options |
A list of strings with additional options. |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Retrieves or sets runtime configuration entries for the Spark Session
spark_session_config(sc, config = TRUE, value = NULL)
spark_session_config(sc, config = TRUE, value = NULL)
sc |
A |
config |
The configuration entry name(s) (e.g., |
value |
The configuration value to be set. Defaults to |
Other Spark runtime configuration:
spark_adaptive_query_execution()
,
spark_advisory_shuffle_partition_size()
,
spark_auto_broadcast_join_threshold()
,
spark_coalesce_initial_num_partitions()
,
spark_coalesce_min_num_partitions()
,
spark_coalesce_shuffle_partitions()
Generator methods for creating single-column Spark dataframes comprised of i.i.d. samples from some distribution.
sc |
A Spark connection. |
n |
Sample Size (default: 1000). |
num_partitions |
Number of partitions in the resulting Spark dataframe (default: default parallelism of the Spark cluster). |
seed |
Random seed (default: a random long integer). |
output_col |
Name of the output column containing sample values (default: "x"). |
Attempts to generate a table name from an expression; otherwise, assigns an auto-generated generic name with "sparklyr_" prefix.
spark_table_name(expr)
spark_table_name(expr)
expr |
The expression to attempt to use as name |
Retrieve the version of Spark associated with a Spark connection.
spark_version(sc)
spark_version(sc)
sc |
A |
Suffixes for e.g. preview versions, or snapshotted versions,
are trimmed – if you require the full Spark version, you can
retrieve it with invoke(spark_context(sc), "version")
.
The Spark version as a numeric_version
.
Retrieve the version of Spark associated with a Spark installation.
spark_version_from_home(spark_home, default = NULL)
spark_version_from_home(spark_home, default = NULL)
spark_home |
The path to a Spark installation. |
default |
The default version to be inferred, in case
version lookup failed, e.g. no Spark installation was found
at |
Open the Spark web interface
spark_web(sc, ...)
spark_web(sc, ...)
sc |
A |
... |
Optional arguments; currently unused. |
Run a custom R function on Spark worker to write a Spark DataFrame into file(s). If Spark's speculative execution feature is enabled (i.e., 'spark.speculation' is true), then each write task may be executed more than once and the user-defined writer function will need to ensure no concurrent writes happen to the same file path (e.g., by appending UUID to each file name).
spark_write(x, writer, paths, packages = NULL)
spark_write(x, writer, paths, packages = NULL)
x |
A Spark Dataframe to be saved into file(s) |
writer |
A writer function with the signature function(partition, path)
where |
paths |
A single destination path or a list of destination paths, each one
specifying a location for a partition from |
packages |
Boolean to distribute |
## Not run: library(sparklyr) sc <- spark_connect(master = "local[3]") # copy some test data into a Spark Dataframe sdf <- sdf_copy_to(sc, iris, overwrite = TRUE) # create a writer function writer <- function(df, path) { write.csv(df, path) } spark_write( sdf, writer, # re-partition sdf into 3 partitions and write them to 3 separate files paths = list("file:///tmp/file1", "file:///tmp/file2", "file:///tmp/file3"), ) spark_write( sdf, writer, # save all rows into a single file paths = list("file:///tmp/all_rows") ) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local[3]") # copy some test data into a Spark Dataframe sdf <- sdf_copy_to(sc, iris, overwrite = TRUE) # create a writer function writer <- function(df, path) { write.csv(df, path) } spark_write( sdf, writer, # re-partition sdf into 3 partitions and write them to 3 separate files paths = list("file:///tmp/file1", "file:///tmp/file2", "file:///tmp/file3"), ) spark_write( sdf, writer, # save all rows into a single file paths = list("file:///tmp/all_rows") ) ## End(Not run)
Notice this functionality requires the Spark connection sc
to be
instantiated with either
an explicitly specified Spark version (i.e.,
spark_connect(..., version = <version>, packages = c("avro", <other package(s)>), ...)
)
or a specific version of Spark avro package to use (e.g.,
spark_connect(..., packages =
c("org.apache.spark:spark-avro_2.12:3.0.0", <other package(s)>), ...)
).
spark_write_avro( x, path, avro_schema = NULL, record_name = "topLevelRecord", record_namespace = "", compression = "snappy", partition_by = NULL )
spark_write_avro( x, path, avro_schema = NULL, record_name = "topLevelRecord", record_namespace = "", compression = "snappy", partition_by = NULL )
x |
A Spark DataFrame or dplyr operation |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
avro_schema |
Optional Avro schema in JSON format |
record_name |
Optional top level record name in write result (default: "topLevelRecord") |
record_namespace |
Record namespace in write result (default: "") |
compression |
Compression codec to use (default: "snappy") |
partition_by |
A |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Write a Spark DataFrame to a tabular (typically, comma-separated) file.
spark_write_csv( x, path, header = TRUE, delimiter = ",", quote = "\"", escape = "\\", charset = "UTF-8", null_value = NULL, options = list(), mode = NULL, partition_by = NULL, ... )
spark_write_csv( x, path, header = TRUE, delimiter = ",", quote = "\"", escape = "\\", charset = "UTF-8", null_value = NULL, options = list(), mode = NULL, partition_by = NULL, ... )
x |
A Spark DataFrame or dplyr operation |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
header |
Should the first row of data be used as a header? Defaults to |
delimiter |
The character used to delimit each column, defaults to |
quote |
The character used as a quote. Defaults to ‘'"'’. |
escape |
The character used to escape other characters, defaults to |
charset |
The character set, defaults to |
null_value |
The character to use for default values, defaults to |
options |
A list of strings with additional options. |
mode |
A For more details see also https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark. |
partition_by |
A |
... |
Optional arguments; currently unused. |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Writes a Spark DataFrame into Delta Lake.
spark_write_delta( x, path, mode = NULL, options = list(), partition_by = NULL, ... )
spark_write_delta( x, path, mode = NULL, options = list(), partition_by = NULL, ... )
x |
A Spark DataFrame or dplyr operation |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
mode |
A For more details see also https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark. |
options |
A list of strings with additional options. |
partition_by |
A |
... |
Optional arguments; currently unused. |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Writes a Spark DataFrame into a JDBC table
spark_write_jdbc( x, name, mode = NULL, options = list(), partition_by = NULL, ... )
spark_write_jdbc( x, name, mode = NULL, options = list(), partition_by = NULL, ... )
x |
A Spark DataFrame or dplyr operation |
name |
The name to assign to the newly generated table. |
mode |
A For more details see also https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark. |
options |
A list of strings with additional options. |
partition_by |
A |
... |
Optional arguments; currently unused. |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
## Not run: sc <- spark_connect( master = "local", config = list( `sparklyr.shell.driver-class-path` = "/usr/share/java/mysql-connector-java-8.0.25.jar" ) ) spark_write_jdbc( sdf_len(sc, 10), name = "my_sql_table", options = list( url = "jdbc:mysql://localhost:3306/my_sql_schema", driver = "com.mysql.jdbc.Driver", user = "me", password = "******", dbtable = "my_sql_table" ) ) ## End(Not run)
## Not run: sc <- spark_connect( master = "local", config = list( `sparklyr.shell.driver-class-path` = "/usr/share/java/mysql-connector-java-8.0.25.jar" ) ) spark_write_jdbc( sdf_len(sc, 10), name = "my_sql_table", options = list( url = "jdbc:mysql://localhost:3306/my_sql_schema", driver = "com.mysql.jdbc.Driver", user = "me", password = "******", dbtable = "my_sql_table" ) ) ## End(Not run)
Serialize a Spark DataFrame to the JavaScript Object Notation format.
spark_write_json( x, path, mode = NULL, options = list(), partition_by = NULL, ... )
spark_write_json( x, path, mode = NULL, options = list(), partition_by = NULL, ... )
x |
A Spark DataFrame or dplyr operation |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
mode |
A For more details see also https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark. |
options |
A list of strings with additional options. |
partition_by |
A |
... |
Optional arguments; currently unused. |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Serialize a Spark DataFrame to the ORC format.
spark_write_orc( x, path, mode = NULL, options = list(), partition_by = NULL, ... )
spark_write_orc( x, path, mode = NULL, options = list(), partition_by = NULL, ... )
x |
A Spark DataFrame or dplyr operation |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
mode |
A For more details see also https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark. |
options |
A list of strings with additional options. See https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration. |
partition_by |
A |
... |
Optional arguments; currently unused. |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Serialize a Spark DataFrame to the Parquet format.
spark_write_parquet( x, path, mode = NULL, options = list(), partition_by = NULL, ... )
spark_write_parquet( x, path, mode = NULL, options = list(), partition_by = NULL, ... )
x |
A Spark DataFrame or dplyr operation |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
mode |
A For more details see also https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark. |
options |
A list of strings with additional options. See https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration. |
partition_by |
A |
... |
Optional arguments; currently unused. |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_source()
,
spark_write_table()
,
spark_write_text()
Write Spark dataframe to RDS files. Each partition of the dataframe will be exported to a separate RDS file so that all partitions can be processed in parallel.
spark_write_rds(x, dest_uri)
spark_write_rds(x, dest_uri)
x |
A Spark DataFrame to be exported |
dest_uri |
Can be a URI template containing 'partitionId' (e.g.,
|
A tibble containing partition ID and RDS file location for each partition of the input Spark dataframe.
Writes a Spark DataFrame into a generic source.
spark_write_source( x, source, mode = NULL, options = list(), partition_by = NULL, ... )
spark_write_source( x, source, mode = NULL, options = list(), partition_by = NULL, ... )
x |
A Spark DataFrame or dplyr operation |
source |
A data source capable of reading data. |
mode |
A For more details see also https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark. |
options |
A list of strings with additional options. |
partition_by |
A |
... |
Optional arguments; currently unused. |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_table()
,
spark_write_text()
Writes a Spark DataFrame into a Spark table
spark_write_table( x, name, mode = NULL, options = list(), partition_by = NULL, ... )
spark_write_table( x, name, mode = NULL, options = list(), partition_by = NULL, ... )
x |
A Spark DataFrame or dplyr operation |
name |
The name to assign to the newly generated table. |
mode |
A For more details see also https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark. |
options |
A list of strings with additional options. |
partition_by |
A |
... |
Optional arguments; currently unused. |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_text()
Serialize a Spark DataFrame to the plain text format.
spark_write_text( x, path, mode = NULL, options = list(), partition_by = NULL, ... )
spark_write_text( x, path, mode = NULL, options = list(), partition_by = NULL, ... )
x |
A Spark DataFrame or dplyr operation |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
mode |
A For more details see also https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes for your version of Spark. |
options |
A list of strings with additional options. |
partition_by |
A |
... |
Optional arguments; currently unused. |
Other Spark serialization routines:
collect_from_rds()
,
spark_insert_table()
,
spark_load_table()
,
spark_read()
,
spark_read_avro()
,
spark_read_binary()
,
spark_read_csv()
,
spark_read_delta()
,
spark_read_image()
,
spark_read_jdbc()
,
spark_read_json()
,
spark_read_libsvm()
,
spark_read_orc()
,
spark_read_parquet()
,
spark_read_source()
,
spark_read_table()
,
spark_read_text()
,
spark_save_table()
,
spark_write_avro()
,
spark_write_csv()
,
spark_write_delta()
,
spark_write_jdbc()
,
spark_write_json()
,
spark_write_orc()
,
spark_write_parquet()
,
spark_write_source()
,
spark_write_table()
Access the commonly-used Spark objects associated with a Spark instance. These objects provide access to different facets of the Spark API.
spark_context(sc) java_context(sc) hive_context(sc) spark_session(sc)
spark_context(sc) java_context(sc) hive_context(sc) spark_session(sc)
sc |
A |
The Scala API documentation
is useful for discovering what methods are available for each of these
objects. Use invoke
to call methods on these objects.
The main entry point for Spark functionality. The Spark Context
represents the connection to a Spark cluster, and can be used to create
RDD
s, accumulators and broadcast variables on that cluster.
A Java-friendly version of the aforementioned Spark Context.
An instance of the Spark SQL execution engine that integrates with data
stored in Hive. Configuration for Hive is read from hive-site.xml
on
the classpath.
Starting with Spark >= 2.0.0, the Hive Context class has been
deprecated – it is superceded by the Spark Session class, and
hive_context
will return a Spark Session object instead.
Note that both classes share a SQL interface, and therefore one can invoke
SQL through these objects.
Available since Spark 2.0.0, the Spark Session unifies the Spark Context and Hive Context classes into a single interface. Its use is recommended over the older APIs for code targeting Spark 2.0.0 and above.
These routines allow you to manage your connections to Spark.
Call 'spark_disconnect()' on each open Spark connection
spark_connect( master, spark_home = Sys.getenv("SPARK_HOME"), method = c("shell", "livy", "databricks", "test", "qubole", "synapse"), app_name = "sparklyr", version = NULL, config = spark_config(), extensions = sparklyr::registered_extensions(), packages = NULL, scala_version = NULL, ... ) spark_connection_is_open(sc) spark_disconnect(sc, ...) spark_disconnect_all(...) spark_submit( master, file, spark_home = Sys.getenv("SPARK_HOME"), app_name = "sparklyr", version = NULL, config = spark_config(), extensions = sparklyr::registered_extensions(), scala_version = NULL, ... )
spark_connect( master, spark_home = Sys.getenv("SPARK_HOME"), method = c("shell", "livy", "databricks", "test", "qubole", "synapse"), app_name = "sparklyr", version = NULL, config = spark_config(), extensions = sparklyr::registered_extensions(), packages = NULL, scala_version = NULL, ... ) spark_connection_is_open(sc) spark_disconnect(sc, ...) spark_disconnect_all(...) spark_submit( master, file, spark_home = Sys.getenv("SPARK_HOME"), app_name = "sparklyr", version = NULL, config = spark_config(), extensions = sparklyr::registered_extensions(), scala_version = NULL, ... )
master |
Spark cluster url to connect to. Use |
spark_home |
The path to a Spark installation. Defaults to the path
provided by the |
method |
The method used to connect to Spark. Default connection method
is |
app_name |
The application name to be used while running in the Spark cluster. |
version |
The version of Spark to use. Required for |
config |
Custom configuration for the generated Spark connection. See
|
extensions |
Extension R packages to enable for this connection. By
default, all packages enabled through the use of
|
packages |
A list of Spark packages to load. For example, |
scala_version |
Load the sparklyr jar file that is built with the version of Scala specified (this currently only makes sense for Spark 2.4, where sparklyr will by default assume Spark 2.4 on current host is built with Scala 2.11, and therefore ‘scala_version = ’2.12'' is needed if sparklyr is connecting to Spark 2.4 built with Scala 2.12) |
... |
Additional params to be passed to each 'spark_disconnect()' call (e.g., 'terminate = TRUE') |
sc |
A |
file |
Path to R source file to submit for batch execution. |
By default, when using method = "livy"
, jars are downloaded from GitHub. But
an alternative path (local to Livy server or on HDFS or HTTP(s)) to sparklyr
JAR can also be specified through the sparklyr.livy.jar
setting.
conf <- spark_config() conf$`sparklyr.shell.conf` <- c( "spark.executor.extraJavaOptions=-Duser.timezone='UTC'", "spark.driver.extraJavaOptions=-Duser.timezone='UTC'", "spark.sql.session.timeZone='UTC'" ) sc <- spark_connect( master = "spark://HOST:PORT", config = conf ) connection_is_open(sc) spark_disconnect(sc)
conf <- spark_config() conf$`sparklyr.shell.conf` <- c( "spark.executor.extraJavaOptions=-Duser.timezone='UTC'", "spark.driver.extraJavaOptions=-Duser.timezone='UTC'", "spark.sql.session.timeZone='UTC'" ) sc <- spark_connect( master = "spark://HOST:PORT", config = conf ) connection_is_open(sc) spark_disconnect(sc)
Retrieve the port number of the 'sparklyr' backend associated with a Spark connection.
sparklyr_get_backend_port(sc)
sparklyr_get_backend_port(sc)
sc |
A |
The port number of the 'sparklyr' backend associated with sc
.
Show database list
src_databases(sc, col = "databaseName", ...)
src_databases(sc, col = "databaseName", ...)
sc |
A |
col |
The column name of the table that lists all databases
may be referred to as |
... |
Optional arguments; currently unused. |
Finds and returns a stream based on the stream's identifier.
stream_find(sc, id)
stream_find(sc, id)
sc |
The associated Spark connection. |
id |
The stream identifier to find. |
## Not run: sc <- spark_connect(master = "local") sdf_len(sc, 10) %>% spark_write_parquet(path = "parquet-in") stream <- stream_read_parquet(sc, "parquet-in") %>% stream_write_parquet("parquet-out") stream_id <- stream_id(stream) stream_find(sc, stream_id) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") sdf_len(sc, 10) %>% spark_write_parquet(path = "parquet-in") stream <- stream_read_parquet(sc, "parquet-in") %>% stream_write_parquet("parquet-out") stream_id <- stream_id(stream) stream_find(sc, stream_id) ## End(Not run)
Generates a local test stream, useful when testing streams locally.
stream_generate_test( df = rep(1:1000), path = "source", distribution = floor(10 + 1e+05 * stats::dbinom(1:20, 20, 0.5)), iterations = 50, interval = 1 )
stream_generate_test( df = rep(1:1000), path = "source", distribution = floor(10 + 1e+05 * stats::dbinom(1:20, 20, 0.5)), iterations = 50, interval = 1 )
df |
The data frame used as a source of rows to the stream, will be cast to data frame if needed. Defaults to a sequence of one thousand entries. |
path |
Path to save stream of files to, defaults to |
distribution |
The distribution of rows to use over each iteration, defaults to a binomial distribution. The stream will cycle through the distribution if needed. |
iterations |
Number of iterations to execute before stopping, defaults to fifty. |
interval |
The inverval in seconds use to write the stream, defaults to one second. |
This function requires the callr
package to be installed.
Retrieves the identifier of the Spark stream.
stream_id(stream)
stream_id(stream)
stream |
The spark stream object. |
Given a streaming Spark dataframe as input, this function will return another streaming dataframe that contains all columns in the input and column(s) that are shifted behind by the offset(s) specified in '...' (see example)
stream_lag(x, cols, thresholds = NULL)
stream_lag(x, cols, thresholds = NULL)
x |
An object coercable to a Spark Streaming DataFrame. |
cols |
A list of expressions for a single or multiple variables to create that will contain the value of a previous entry. |
thresholds |
Optional named list of timestamp column(s) and corresponding time duration(s) for deterimining whether a previous record is sufficiently recent relative to the current record. If the any of the time difference(s) between the current and a previous record is greater than the maximal duration allowed, then the previous record is discarded and will not be part of the query result. The durations can be specified with numeric types (which will be interpreted as max difference allowed in number of milliseconds between 2 UNIX timestamps) or time duration strings such as "5s", "5sec", "5min", "5hour", etc. Any timestamp column in 'x' that is not of timestamp of date Spark SQL types will be interepreted as number of milliseconds since the UNIX epoch. |
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "2.2.0") streaming_path <- tempfile("days_df_") days_df <- dplyr::tibble( today = weekdays(as.Date(seq(7), origin = "1970-01-01")) ) num_iters <- 7 stream_generate_test( df = days_df, path = streaming_path, distribution = rep(nrow(days_df), num_iters), iterations = num_iters ) stream_read_csv(sc, streaming_path) %>% stream_lag(cols = c(yesterday = today ~ 1, two_days_ago = today ~ 2)) %>% collect() %>% print(n = 10L) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "2.2.0") streaming_path <- tempfile("days_df_") days_df <- dplyr::tibble( today = weekdays(as.Date(seq(7), origin = "1970-01-01")) ) num_iters <- 7 stream_generate_test( df = days_df, path = streaming_path, distribution = rep(nrow(days_df), num_iters), iterations = num_iters ) stream_read_csv(sc, streaming_path) %>% stream_lag(cols = c(yesterday = today ~ 1, two_days_ago = today ~ 2)) %>% collect() %>% print(n = 10L) ## End(Not run)
Retrieves the name of the Spark stream if available.
stream_name(stream)
stream_name(stream)
stream |
The spark stream object. |
Read files created by the stream
stream_read_csv( sc, path, name = NULL, header = TRUE, columns = NULL, delimiter = ",", quote = "\"", escape = "\\", charset = "UTF-8", null_value = NULL, options = list(), ... ) stream_read_text(sc, path, name = NULL, options = list(), ...) stream_read_json(sc, path, name = NULL, columns = NULL, options = list(), ...) stream_read_parquet( sc, path, name = NULL, columns = NULL, options = list(), ... ) stream_read_orc(sc, path, name = NULL, columns = NULL, options = list(), ...) stream_read_kafka(sc, name = NULL, options = list(), ...) stream_read_socket(sc, name = NULL, columns = NULL, options = list(), ...) stream_read_delta(sc, path, name = NULL, options = list(), ...) stream_read_cloudfiles(sc, path, name = NULL, options = list(), ...) stream_read_table(sc, path, name = NULL, options = list(), ...)
stream_read_csv( sc, path, name = NULL, header = TRUE, columns = NULL, delimiter = ",", quote = "\"", escape = "\\", charset = "UTF-8", null_value = NULL, options = list(), ... ) stream_read_text(sc, path, name = NULL, options = list(), ...) stream_read_json(sc, path, name = NULL, columns = NULL, options = list(), ...) stream_read_parquet( sc, path, name = NULL, columns = NULL, options = list(), ... ) stream_read_orc(sc, path, name = NULL, columns = NULL, options = list(), ...) stream_read_kafka(sc, name = NULL, options = list(), ...) stream_read_socket(sc, name = NULL, columns = NULL, options = list(), ...) stream_read_delta(sc, path, name = NULL, options = list(), ...) stream_read_cloudfiles(sc, path, name = NULL, options = list(), ...) stream_read_table(sc, path, name = NULL, options = list(), ...)
sc |
A |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
name |
The name to assign to the newly generated stream. |
header |
Boolean; should the first row of data be used as a header?
Defaults to |
columns |
A vector of column names or a named vector of column types.
If specified, the elements can be |
delimiter |
The character used to delimit each column. Defaults to ‘','’. |
quote |
The character used as a quote. Defaults to ‘'"'’. |
escape |
The character used to escape other characters. Defaults to ‘'\'’. |
charset |
The character set. Defaults to ‘"UTF-8"’. |
null_value |
The character to use for null, or missing, values. Defaults to |
options |
A list of strings with additional options. |
... |
Optional arguments; currently unused. |
## Not run: sc <- spark_connect(master = "local") dir.create("csv-in") write.csv(iris, "csv-in/data.csv", row.names = FALSE) csv_path <- file.path("file://", getwd(), "csv-in") stream <- stream_read_csv(sc, csv_path) %>% stream_write_csv("csv-out") stream_stop(stream) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") dir.create("csv-in") write.csv(iris, "csv-in/data.csv", row.names = FALSE) csv_path <- file.path("file://", getwd(), "csv-in") stream <- stream_read_csv(sc, csv_path) %>% stream_write_csv("csv-out") stream_stop(stream) ## End(Not run)
Collects streaming statistics to render the stream as an 'htmlwidget'.
stream_render(stream = NULL, collect = 10, stats = NULL, ...)
stream_render(stream = NULL, collect = 10, stats = NULL, ...)
stream |
The stream to render |
collect |
The interval in seconds to collect data before rendering the 'htmlwidget'. |
stats |
Optional stream statistics collected using |
... |
Additional optional arguments. |
## Not run: library(sparklyr) sc <- spark_connect(master = "local") dir.create("iris-in") write.csv(iris, "iris-in/iris.csv", row.names = FALSE) stream <- stream_read_csv(sc, "iris-in/") %>% stream_write_csv("iris-out/") stream_render(stream) stream_stop(stream) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local") dir.create("iris-in") write.csv(iris, "iris-in/iris.csv", row.names = FALSE) stream <- stream_read_csv(sc, "iris-in/") %>% stream_write_csv("iris-out/") stream_render(stream) stream_stop(stream) ## End(Not run)
Collects streaming statistics, usually, to be used with stream_render()
to render streaming statistics.
stream_stats(stream, stats = list())
stream_stats(stream, stats = list())
stream |
The stream to collect statistics from. |
stats |
An optional stats object generated using |
A stats object containing streaming statistics that can be passed
back to the stats
parameter to continue aggregating streaming stats.
## Not run: sc <- spark_connect(master = "local") sdf_len(sc, 10) %>% spark_write_parquet(path = "parquet-in") stream <- stream_read_parquet(sc, "parquet-in") %>% stream_write_parquet("parquet-out") stream_stats(stream) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") sdf_len(sc, 10) %>% spark_write_parquet(path = "parquet-in") stream <- stream_read_parquet(sc, "parquet-in") %>% stream_write_parquet("parquet-out") stream_stats(stream) ## End(Not run)
Stops processing data from a Spark stream.
stream_stop(stream)
stream_stop(stream)
stream |
The spark stream object to be stopped. |
Creates a Spark structured streaming trigger to execute continuously. This mode is the most performant but not all operations are supported.
stream_trigger_continuous(checkpoint = 5000)
stream_trigger_continuous(checkpoint = 5000)
checkpoint |
The checkpoint interval specified in milliseconds. |
Creates a Spark structured streaming trigger to execute over the specified interval.
stream_trigger_interval(interval = 1000)
stream_trigger_interval(interval = 1000)
interval |
The execution interval specified in milliseconds. |
Opens a Shiny gadget to visualize the given stream.
stream_view(stream, ...)
stream_view(stream, ...)
stream |
The stream to visualize. |
... |
Additional optional arguments. |
## Not run: library(sparklyr) sc <- spark_connect(master = "local") dir.create("iris-in") write.csv(iris, "iris-in/iris.csv", row.names = FALSE) stream_read_csv(sc, "iris-in/") %>% stream_write_csv("iris-out/") %>% stream_view() %>% stream_stop() ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local") dir.create("iris-in") write.csv(iris, "iris-in/iris.csv", row.names = FALSE) stream_read_csv(sc, "iris-in/") %>% stream_write_csv("iris-out/") %>% stream_view() %>% stream_stop() ## End(Not run)
Ensures a stream has a watermark defined, which is required for some operations over streams.
stream_watermark(x, column = "timestamp", threshold = "10 minutes")
stream_watermark(x, column = "timestamp", threshold = "10 minutes")
x |
An object coercable to a Spark Streaming DataFrame. |
column |
The name of the column that contains the event time of the row, if the column is missing, a column with the current time will be added. |
threshold |
The minimum delay to wait to data to arrive late, defaults to ten minutes. |
Write files to the stream
stream_write_csv( x, path, mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path(path, "checkpoint"), header = TRUE, delimiter = ",", quote = "\"", escape = "\\", charset = "UTF-8", null_value = NULL, options = list(), partition_by = NULL, ... ) stream_write_text( x, path, mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path(path, "checkpoints", random_string("")), options = list(), partition_by = NULL, ... ) stream_write_json( x, path, mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path(path, "checkpoints", random_string("")), options = list(), partition_by = NULL, ... ) stream_write_parquet( x, path, mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path(path, "checkpoints", random_string("")), options = list(), partition_by = NULL, ... ) stream_write_orc( x, path, mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path(path, "checkpoints", random_string("")), options = list(), partition_by = NULL, ... ) stream_write_kafka( x, mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path("checkpoints", random_string("")), options = list(), partition_by = NULL, ... ) stream_write_console( x, mode = c("append", "complete", "update"), options = list(), trigger = stream_trigger_interval(), partition_by = NULL, ... ) stream_write_delta( x, path, mode = c("append", "complete", "update"), checkpoint = file.path("checkpoints", random_string("")), options = list(), partition_by = NULL, ... )
stream_write_csv( x, path, mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path(path, "checkpoint"), header = TRUE, delimiter = ",", quote = "\"", escape = "\\", charset = "UTF-8", null_value = NULL, options = list(), partition_by = NULL, ... ) stream_write_text( x, path, mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path(path, "checkpoints", random_string("")), options = list(), partition_by = NULL, ... ) stream_write_json( x, path, mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path(path, "checkpoints", random_string("")), options = list(), partition_by = NULL, ... ) stream_write_parquet( x, path, mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path(path, "checkpoints", random_string("")), options = list(), partition_by = NULL, ... ) stream_write_orc( x, path, mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path(path, "checkpoints", random_string("")), options = list(), partition_by = NULL, ... ) stream_write_kafka( x, mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path("checkpoints", random_string("")), options = list(), partition_by = NULL, ... ) stream_write_console( x, mode = c("append", "complete", "update"), options = list(), trigger = stream_trigger_interval(), partition_by = NULL, ... ) stream_write_delta( x, path, mode = c("append", "complete", "update"), checkpoint = file.path("checkpoints", random_string("")), options = list(), partition_by = NULL, ... )
x |
A Spark DataFrame or dplyr operation |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
mode |
Specifies how data is written to a streaming sink. Valid values are
|
trigger |
The trigger for the stream query, defaults to micro-batches
running every 5 seconds. See |
checkpoint |
The location where the system will write all the checkpoint information to guarantee end-to-end fault-tolerance. |
header |
Should the first row of data be used as a header? Defaults to |
delimiter |
The character used to delimit each column, defaults to |
quote |
The character used as a quote. Defaults to ‘'"'’. |
escape |
The character used to escape other characters, defaults to |
charset |
The character set, defaults to |
null_value |
The character to use for default values, defaults to |
options |
A list of strings with additional options. |
partition_by |
Partitions the output by the given list of columns. |
... |
Optional arguments; currently unused. |
Other Spark stream serialization:
stream_write_memory()
,
stream_write_table()
## Not run: sc <- spark_connect(master = "local") dir.create("csv-in") write.csv(iris, "csv-in/data.csv", row.names = FALSE) csv_path <- file.path("file://", getwd(), "csv-in") stream <- stream_read_csv(sc, csv_path) %>% stream_write_csv("csv-out") stream_stop(stream) ## End(Not run)
## Not run: sc <- spark_connect(master = "local") dir.create("csv-in") write.csv(iris, "csv-in/data.csv", row.names = FALSE) csv_path <- file.path("file://", getwd(), "csv-in") stream <- stream_read_csv(sc, csv_path) %>% stream_write_csv("csv-out") stream_stop(stream) ## End(Not run)
Writes a Spark dataframe stream into a memory stream.
stream_write_memory( x, name = random_string("sparklyr_tmp_"), mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path("checkpoints", name, random_string("")), options = list(), partition_by = NULL, ... )
stream_write_memory( x, name = random_string("sparklyr_tmp_"), mode = c("append", "complete", "update"), trigger = stream_trigger_interval(), checkpoint = file.path("checkpoints", name, random_string("")), options = list(), partition_by = NULL, ... )
x |
A Spark DataFrame or dplyr operation |
name |
The name to assign to the newly generated stream. |
mode |
Specifies how data is written to a streaming sink. Valid values are
|
trigger |
The trigger for the stream query, defaults to micro-batches
running every 5 seconds. See |
checkpoint |
The location where the system will write all the checkpoint information to guarantee end-to-end fault-tolerance. |
options |
A list of strings with additional options. |
partition_by |
Partitions the output by the given list of columns. |
... |
Optional arguments; currently unused. |
Other Spark stream serialization:
stream_write_csv()
,
stream_write_table()
Writes a Spark dataframe stream into a table.
stream_write_table( x, path, format = NULL, mode = c("append", "complete", "update"), checkpoint = file.path("checkpoints", random_string("")), options = list(), partition_by = NULL, ... )
stream_write_table( x, path, format = NULL, mode = c("append", "complete", "update"), checkpoint = file.path("checkpoints", random_string("")), options = list(), partition_by = NULL, ... )
x |
A Spark DataFrame or dplyr operation |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3a://"’ and ‘"file://"’ protocols. |
format |
Specifies format of data written to table E.g.
|
mode |
Specifies how data is written to a streaming sink. Valid values are
|
checkpoint |
The location where the system will write all the checkpoint information to guarantee end-to-end fault-tolerance. |
options |
A list of strings with additional options. |
partition_by |
Partitions the output by the given list of columns. |
... |
Optional arguments; currently unused. |
Other Spark stream serialization:
stream_write_csv()
,
stream_write_memory()
Force a Spark table with name name
to be loaded into memory.
Operations on cached tables should normally (although not always)
be more performant than the same operation performed on an uncached
table.
tbl_cache(sc, name, force = TRUE)
tbl_cache(sc, name, force = TRUE)
sc |
A |
name |
The table name. |
force |
Force the data to be loaded into memory? This is accomplished
by calling the |
Use specific database
tbl_change_db(sc, name)
tbl_change_db(sc, name)
sc |
A |
name |
The database name. |
Force a Spark table with name name
to be unloaded from memory.
tbl_uncache(sc, name)
tbl_uncache(sc, name)
sc |
A |
name |
The table name. |
transform a subset of column(s) in a Spark Dataframe
transform_sdf(x, cols, fn)
transform_sdf(x, cols, fn)
x |
An object coercible to a Spark DataFrame |
cols |
Subset of columns to apply transformation to |
fn |
Transformation function taking column name as the 1st parameter, the
corresponding |