data
Package
data.featureset
Module
Classes related to storing/merging feature sets.
- author:
Dan Blanchard (dblanchard@ets.org)
- author:
Nitin Madnani (nmadnani@ets.org)
- author:
Jeremy Biggs (jbiggs@ets.org)
- organization:
ETS
- class skll.data.featureset.FeatureSet(name, ids, labels=None, features=None, vectorizer=None)[source]
Bases:
object
Encapsulate features, labels, and metadata for a given dataset.
- Parameters:
name (str) – The name of this feature set.
ids (Union[List[str], numpy.ndarray]) – Example IDs for this set.
labels (Optional[Union[List[str], numpy.ndarray], default=None) – Labels for this set.
features (Optional[Union[
skll.types.FeatureDictList
,numpy.ndarray
]], default=None) – The features for each instance represented as either a list of dictionaries or a numpy array (ifvectorizer
is also specified).vectorizer (Optional[Union[
sklearn.feature_extraction.DictVectorizer
,sklearn.feature_extraction.FeatureHasher
], default=None) – Vectorizer which will be used to generate the feature matrix.
Warning
FeatureSets can only be equal if the order of the instances is identical because these are stored as lists/arrays. Since scikit-learn’s
DictVectorizer
automatically sorts the underlying feature matrix if it is sparse, we do not do any sorting before checking for equality. This is not a problem because we _always_ use sparse matrices withDictVectorizer
when creating FeatureSets.Notes
If ids, labels, and/or features are not None, the number of rows in each array must be equal.
- filter(ids=None, labels=None, features=None, inverse=False)[source]
Remove or keep features and/or examples from the given feature set.
Filtering is done in-place.
- Parameters:
ids (Optional[List[
skll.types.IdType
]], default=None) – Examples to keep in the FeatureSet. IfNone
, no ID filtering takes place.labels (Optional[List[
skll.types.LabelType
]], default=None) – Labels that we want to retain examples for. IfNone
, no label filtering takes place.features (Optional[List[str]], default=None) – Features to keep in the FeatureSet. To help with filtering string-valued features that were converted to sequences of boolean features when read in, any features in the FeatureSet that contain a
=
will be split on the first occurrence and the prefix will be checked to see if it is infeatures
. IfNone
, no feature filtering takes place. Cannot be used if FeatureSet uses a FeatureHasher for vectorization.inverse (bool, default=False) – Instead of keeping features and/or examples in lists, remove them.
- Raises:
ValueError – If attempting to use features to filter a
FeatureSet
that uses aFeatureHasher
vectorizer.- Return type:
None
- filtered_iter(ids=None, labels=None, features=None, inverse=False)[source]
Retain only the specified features and/or examples from the output.
- Parameters:
ids (Optional[List[
skll.types.IdType
]], default=None) – Examples to keep in theFeatureSet
. IfNone
, no ID filtering takes place.labels (Optional[List[
skll.types.LabelType
]], default=None) – Labels that we want to retain examples for. IfNone
, no label filtering takes place.features (Optional[Collection[str]], default=None) – Features to keep in the
FeatureSet
. To help with filtering string-valued features that were converted to sequences of boolean features when read in, any features in theFeatureSet
that contain a = will be split on the first occurrence and the prefix will be checked to see if it is infeatures
. If None, no feature filtering takes place. Cannot be used ifFeatureSet
uses a FeatureHasher for vectorization.inverse (bool, default=False) – Instead of keeping features and/or examples in lists, remove them.
- Returns:
A generator that yields 3-tuples containing:
skll.types.IdType
- The ID of the example.skll.types.LabelType
- The label of the example.skll.types.FeatureDict
- The feature dictionary, with feature name as the key and example value as the value.
- Return type:
- Raises:
ValueError – If the vectorizer is not a
DictVectorizer
.ValueError – If any of the “labels”, “features”, or “vectorizer” attribute is
None
.
- static from_data_frame(df, name, labels_column=None, vectorizer=None)[source]
Create a
FeatureSet
instance from a pandas data frame.Will raise an Exception if pandas is not installed in your environment. The
ids
in theFeatureSet
will be the index from the given frame.- Parameters:
df (pandas.DataFrame) – The pandas.DataFrame object to use as a
FeatureSet
.name (str) – The name of the output
FeatureSet
instance.labels_column (Optional[str], default=None) – The name of the column containing the labels (data to predict).
vectorizer (Optional[Union[
sklearn.feature_extraction.DictVectorizer
,sklearn.feature_extraction.FeatureHasher
]], default=None) – Vectorizer which will be used to generate the feature matrix.
- Returns:
A
FeatureSet
instance generated from from the given data frame.- Return type:
- property has_labels
Check if
FeatureSet
has finite labels.- Returns:
has_labels – Whether or not this FeatureSet has any finite labels.
- Return type:
- static split(fs, ids_for_split1, ids_for_split2=None)[source]
Split
FeatureSet
into two newFeatureSet
instances.The splitting is done based on the given indices for the two splits.
- Parameters:
fs (skll.data.featureset.FeatureSet) – The
FeatureSet
instance to split.ids_for_split1 (List[int]) – A list of example indices which will be split out into the first
FeatureSet
instance. Note that the FeatureSet instance will respect the order of the specified indices.ids_for_split2 (Optional[List[int]], default=None) – An optional list of example indices which will be split out into the second
FeatureSet
instance. Note that theFeatureSet
instance will respect the order of the specified indices. If this is not specified, then the secondFeatureSet
instance will contain the complement of the first set of indices sorted in ascending order.
- Returns:
A tuple containing the two featureset instances.
- Return type:
Tuple[
skll.data.featureset.FeatureSet
,skll.data.featureset.FeatureSet
]
data.readers
Module
- class skll.data.readers.Reader(path_or_list, quiet=True, ids_to_floats=False, label_col='y', id_col='id', class_map=None, sparse=True, feature_hasher=False, num_features=None, logger=None)[source]
Bases:
object
Load FeatureSets from files on disk.
This is the base class used to create featureset readers for different file types.
- Parameters:
path_or_list (Union[
skll.types.PathOrStr
, List[Dict[str, Any]]) – Path or a list of example dictionaries.quiet (bool, default=True) – Do not print “Loading…” status message to stderr.
ids_to_floats (bool, default=False) – Convert IDs to float to save memory. Will raise error if we encounter an a non-numeric ID.
label_col (Optional[str], default='y') – Name of the column which contains the class labels for ARFF/CSV/TSV files. If no column with that name exists, or
None
is specified, the data is considered to be unlabelled.id_col (str, default='id') – Name of the column which contains the instance IDs. If no column with that name exists, or
None
is specified, example IDs will be automatically generated.class_map (Optional[
skll.types.ClassMap
], default=None) – Mapping from original class labels to new ones. This is mainly used for collapsing multiple labels into a single class. Anything not in the mapping will be kept the same. The keys are the new labels and the list of values for each key is the labels to be collapsed to said new label.sparse (bool, default=True) – Whether or not to store the features in a numpy CSR matrix when using a DictVectorizer to vectorize the features.
feature_hasher (bool, default=False) – Whether or not a FeatureHasher should be used to vectorize the features.
num_features (Optional[int], default=None) – If using a FeatureHasher, how many features should the resulting matrix have? You should set this to a power of 2 greater than the actual number of features to avoid collisions.
logger (Optional[logging.Logger], default=None) – A logger instance to use to log messages instead of creating a new one by default.
- classmethod for_path(path_or_list, **kwargs)[source]
Instantiate Reader sub-class based on the file extension.
If the input is a list of dictionaries instead of a path, use a dictionary reader instead.
- Parameters:
path_or_list (Union[
skll.types.PathOrStr
,skll.types.FeatureDictList
]) – A path or list of example dictionaries.kwargs (Optional[Dict[str, Any]]) – The arguments to the Reader object being instantiated.
- Returns:
reader – A new instance of the Reader sub-class that is appropriate for the given path.
- Return type:
- Raises:
ValueError – If file does not have a valid extension.
- read()[source]
Load examples from various file formats.
The following formats are supported:
.arff
,.csv
,.jsonlines
,.libsvm
,.ndj
, or.tsv
formats.- Returns:
A
FeatureSet
instance representing the input file.- Return type:
- Raises:
ValueError – If
ids_to_floats
is True, but IDs cannot be converted.ValueError – If no features are found.
ValueError – If the example IDs are not unique.
- class skll.data.readers.CSVReader(path_or_list, replace_blanks_with=None, drop_blanks=False, pandas_kwargs=None, **kwargs)[source]
Bases:
Reader
Create a
FeatureSet
instance from a CSV file.If example/instance IDs are included in the files, they must be specified in the
id
column. Also, there must be a column with the name specified bylabel_col
if the data is labeled.- Parameters:
path_or_list (Union[
skll.types.PathOrStr
, List[Dict[str, Any]]]) – The path to a comma-delimited file.replace_blanks_with (Optional[Union[Number, Dict[str, Number]]], default=None) –
Specifies a new value with which to replace blank values. Options are:
Number
: A (numeric) value with which to replace blank values.dict
: A dictionary specifying the replacement value for each column.None
: Blank values will be left as blanks, and not replaced.
The replacement occurs after the data set is read into a
pd.DataFrame
.drop_blanks (bool, default=False) – If
True
, remove lines/rows that have any blank values. These lines/rows are removed after the the data set is read into apd.DataFrame
.pandas_kwargs (Optional[Dict[str, Any]], default=None) – Arguments that will be passed directly to the
pandas
I/O reader.kwargs (Optional[Dict[str, Any]]) – Other arguments to the Reader object.
- class skll.data.readers.TSVReader(path_or_list, replace_blanks_with=None, drop_blanks=False, pandas_kwargs=None, **kwargs)[source]
Bases:
CSVReader
Create a
FeatureSet
instance from a TSV file.If example/instance IDs are included in the files, they must be specified in the
id
column. Also there must be a column with the name specified bylabel_col
if the data is labeled.- Parameters:
path_or_list (str) – The path to a comma-delimited file.
replace_blanks_with (Optional[Union[Number, Dict[str, Number]]], default=None) –
Specifies a new value with which to replace blank values. Options are:
Number
: A (numeric) value with which to replace blank values.dict
: A dictionary specifying the replacement value for each column.None
: Blank values will be left as blanks, and not replaced.
The replacement occurs after the data set is read into a
pd.DataFrame
.drop_blanks (bool, default=False) – If
True
, remove lines/rows that have any blank values. These lines/rows are removed after the the data set is read into apd.DataFrame
.pandas_kwargs (Optional[Dict[str, Any]], default=None) – Arguments that will be passed directly to the
pandas
I/O reader.kwargs (Optional[Dict[str, Any]]) – Other arguments to the Reader object.
- class skll.data.readers.NDJReader(path_or_list, quiet=True, ids_to_floats=False, label_col='y', id_col='id', class_map=None, sparse=True, feature_hasher=False, num_features=None, logger=None)[source]
Bases:
Reader
Create a
FeatureSet
instance from a JSONlines/NDJ file.If example/instance IDs are included in the files, they must be specified as the “id” key in each JSON dictionary.
- Parameters:
path_or_list (Union[
skll.types.PathOrStr
, List[Dict[str, Any]]) – Path or a list of example dictionaries.quiet (bool, default=True) – Do not print “Loading…” status message to stderr.
ids_to_floats (bool, default=False) – Convert IDs to float to save memory. Will raise error if we encounter an a non-numeric ID.
label_col (Optional[str], default='y') – Name of the column which contains the class labels for ARFF/CSV/TSV files. If no column with that name exists, or
None
is specified, the data is considered to be unlabelled.id_col (str, default='id') – Name of the column which contains the instance IDs. If no column with that name exists, or
None
is specified, example IDs will be automatically generated.class_map (Optional[
skll.types.ClassMap
], default=None) – Mapping from original class labels to new ones. This is mainly used for collapsing multiple labels into a single class. Anything not in the mapping will be kept the same. The keys are the new labels and the list of values for each key is the labels to be collapsed to said new label.sparse (bool, default=True) – Whether or not to store the features in a numpy CSR matrix when using a DictVectorizer to vectorize the features.
feature_hasher (bool, default=False) – Whether or not a FeatureHasher should be used to vectorize the features.
num_features (Optional[int], default=None) – If using a FeatureHasher, how many features should the resulting matrix have? You should set this to a power of 2 greater than the actual number of features to avoid collisions.
logger (Optional[logging.Logger], default=None) – A logger instance to use to log messages instead of creating a new one by default.
- class skll.data.readers.DictListReader(path_or_list, quiet=True, ids_to_floats=False, label_col='y', id_col='id', class_map=None, sparse=True, feature_hasher=False, num_features=None, logger=None)[source]
Bases:
Reader
Facilitate programmatic use of methods that take
FeatureSet
as input.Support
Learner.predict()
and other methods that takeFeatureSet
objects as input. It iterates over examples in the same way as otherReader
classes, but uses a list of example dictionaries instead of a path to a file.- Parameters:
path_or_list (Union[
skll.types.PathOrStr
, List[Dict[str, Any]]) – Path or a list of example dictionaries.quiet (bool, default=True) – Do not print “Loading…” status message to stderr.
ids_to_floats (bool, default=False) – Convert IDs to float to save memory. Will raise error if we encounter an a non-numeric ID.
label_col (Optional[str], default='y') – Name of the column which contains the class labels for ARFF/CSV/TSV files. If no column with that name exists, or
None
is specified, the data is considered to be unlabelled.id_col (str, default='id') – Name of the column which contains the instance IDs. If no column with that name exists, or
None
is specified, example IDs will be automatically generated.class_map (Optional[
skll.types.ClassMap
], default=None) – Mapping from original class labels to new ones. This is mainly used for collapsing multiple labels into a single class. Anything not in the mapping will be kept the same. The keys are the new labels and the list of values for each key is the labels to be collapsed to said new label.sparse (bool, default=True) – Whether or not to store the features in a numpy CSR matrix when using a DictVectorizer to vectorize the features.
feature_hasher (bool, default=False) – Whether or not a FeatureHasher should be used to vectorize the features.
num_features (Optional[int], default=None) – If using a FeatureHasher, how many features should the resulting matrix have? You should set this to a power of 2 greater than the actual number of features to avoid collisions.
logger (Optional[logging.Logger], default=None) – A logger instance to use to log messages instead of creating a new one by default.
- class skll.data.readers.ARFFReader(path_or_list, **kwargs)[source]
Bases:
Reader
Create a
FeatureSet
instance from an ARFF file.If example/instance IDs are included in the files, they must be specified in the
id
column. Also, there must be a column with the name specified bylabel_col
if the data is labeled, and this column must be the final one (as it is in Weka).- Parameters:
path_or_list (Union[
skll.types.PathOrStr
, List[Dict[str, Any]]]) – The path to the ARFF file.kwargs (Optional[Dict[str, Any]]) – Other arguments to the Reader object.
- class skll.data.readers.LibSVMReader(path_or_list, quiet=True, ids_to_floats=False, label_col='y', id_col='id', class_map=None, sparse=True, feature_hasher=False, num_features=None, logger=None)[source]
Bases:
Reader
Create a
FeatureSet
instance from a LibSVM/LibLinear/SVMLight file.We use a specially formatted comment for storing example IDs, class names, and feature names, which are normally not supported by the format. The comment is not mandatory, but without it, your labels and features will not have names. The comment is structured as follows:
ExampleID | 1=FirstClass | 1=FirstFeature 2=SecondFeature
- Parameters:
path_or_list (Union[
skll.types.PathOrStr
, List[Dict[str, Any]]) – Path or a list of example dictionaries.quiet (bool, default=True) – Do not print “Loading…” status message to stderr.
ids_to_floats (bool, default=False) – Convert IDs to float to save memory. Will raise error if we encounter an a non-numeric ID.
label_col (Optional[str], default='y') – Name of the column which contains the class labels for ARFF/CSV/TSV files. If no column with that name exists, or
None
is specified, the data is considered to be unlabelled.id_col (str, default='id') – Name of the column which contains the instance IDs. If no column with that name exists, or
None
is specified, example IDs will be automatically generated.class_map (Optional[
skll.types.ClassMap
], default=None) – Mapping from original class labels to new ones. This is mainly used for collapsing multiple labels into a single class. Anything not in the mapping will be kept the same. The keys are the new labels and the list of values for each key is the labels to be collapsed to said new label.sparse (bool, default=True) – Whether or not to store the features in a numpy CSR matrix when using a DictVectorizer to vectorize the features.
feature_hasher (bool, default=False) – Whether or not a FeatureHasher should be used to vectorize the features.
num_features (Optional[int], default=None) – If using a FeatureHasher, how many features should the resulting matrix have? You should set this to a power of 2 greater than the actual number of features to avoid collisions.
logger (Optional[logging.Logger], default=None) – A logger instance to use to log messages instead of creating a new one by default.
data.writers
Module
- class skll.data.writers.Writer(path, feature_set, quiet=True, subsets=None, logger=None)[source]
Bases:
object
Write out FeatureSets to files on disk.
This is the base class used to create featureset writers for different file types.
- Parameters:
path (
skll.types.PathOrStr
) – A path to the feature file we would like to create. The suffix to this filename must be.arff
,.csv
,.jsonlines
,.libsvm
,.ndj
, or.tsv
. Ifsubsets
is notNone
, when calling thewrite()
method, path is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.csv
.feature_set (
skll.data.featureset.FeatureSet
) – TheFeatureSet
instance to dump to the file.quiet (bool, default=True) – Do not print “Writing…” status message to stderr.
subsets (Optional[Dict[str, List[str]]], default=None) – A mapping from subset names to lists of feature names that are included in those sets. If given, a feature file will be written for every subset (with the name containing the subset name as suffix to
path
). Note, since string- valued features are automatically converted into boolean features with names of the formFEATURE_NAME=STRING_VALUE
, when doing the filtering, the portion before the=
is all that’s used for matching. Therefore, you do not need to enumerate all of these boolean feature names in your mapping.logger (Optional[logging.Logger], default=None) – A logger instance to use to log messages instead of creating a new one by default.
- classmethod for_path(path, feature_set, **kwargs)[source]
Retrieve object of
Writer
sub-class appropriate for given path.- Parameters:
path (
skll.types.PathOrStr
) – A path to the feature file we would like to create. The suffix to this filename must be.arff
,.csv
,.jsonlines
,.libsvm
,.ndj
, or.tsv
. Ifsubsets
is notNone
, when calling thewrite()
method, path is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.csv
.feature_set (
skll.data.featureset.FeatureSet
) – TheFeatureSet
instance to dump to the output file.kwargs (Optional[Dict[str, Any]]) – The keyword arguments for
for_path
are the same as the initializer for the desiredWriter
subclass.
- Returns:
writer – New instance of the Writer sub-class that is appropriate for the given path.
- Return type:
skll.data.Writer
- class skll.data.writers.CSVWriter(path, feature_set, quiet=True, subsets=None, logger=None, label_col='y', id_col='id', pandas_kwargs=None)[source]
Bases:
Writer
Writer for writing out
FeatureSet
instances as CSV files.- Parameters:
path (
skll.types.PathOrStr
) – A path to the feature file we would like to create. Ifsubsets
is notNone
, this is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.csv
.feature_set (
skll.data.featureset.FeatureSet
) – TheFeatureSet
instance to dump to the output file.quiet (bool, default=True) – Do not print “Writing…” status message to stderr.
subsets (Optional[Dict[str, List[str]]], default=None) – A mapping from subset names to lists of feature names that are included in those sets. If given, a feature file will be written for every subset (with the name containing the subset name as suffix to
path
). Note, since string- valued features are automatically converted into boolean features with names of the formFEATURE_NAME=STRING_VALUE
, when doing the filtering, the portion before the=
is all that’s used for matching. Therefore, you do not need to enumerate all of these boolean feature names in your mapping.logger (Optional[logging.Logger], default=None) – A logger instance to use to log messages instead of creating a new one by default.
label_col (str, default="y") – The column name containing the label.
id_col (str, default="id") – The column name containing the ID.
pandas_kwargs (Optional[Dict[str], Any], default=None) – Arguments that will be passed directly to the pandas I/O reader.
- class skll.data.writers.TSVWriter(path, feature_set, quiet=True, subsets=None, logger=None, label_col='y', id_col='id', pandas_kwargs=None)[source]
Bases:
CSVWriter
Writer for writing out FeatureSets as TSV files.
- Parameters:
path (
skll.types.PathOrStr
) – A path to the feature file we would like to create. Ifsubsets
is notNone
, this is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.tsv
.feature_set (
skll.data.featureset.FeatureSet
) – TheFeatureSet
instance to dump to the output file.quiet (bool, default=True) – Do not print “Writing…” status message to stderr.
subsets (Optional[Dict[str, List[str]]], default=None) – A mapping from subset names to lists of feature names that are included in those sets. If given, a feature file will be written for every subset (with the name containing the subset name as suffix to
path
). Note, since string- valued features are automatically converted into boolean features with names of the formFEATURE_NAME=STRING_VALUE
, when doing the filtering, the portion before the=
is all that’s used for matching. Therefore, you do not need to enumerate all of these boolean feature names in your mapping.logger (Optional[logging.Logger], default=None) – A logger instance to use to log messages instead of creating a new one by default.
label_col (str, default="y") – The column name containing the label.
id_col (str, default="id") – The column name containing the ID.
pandas_kwargs (Optional[Dict[str, Any]], default=None) – Arguments that will be passed directly to the pandas I/O reader.
- class skll.data.writers.NDJWriter(path, feature_set, quiet=True, subsets=None, logger=None)[source]
Bases:
Writer
Writer for writing out FeatureSets as .jsonlines/.ndj files.
- Parameters:
path (
skll.types.PathOrStr
) – A path to the feature file we would like to create. Ifsubsets
is notNone
, this is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.ndj
.feature_set (
skll.data.featureset.FeatureSet
) – TheFeatureSet
instance to dump to the output file.quiet (bool, default=True) – Do not print “Writing…” status message to stderr.
subsets (Optional[Dict[str, List[str]]], default=None) – A mapping from subset names to lists of feature names that are included in those sets. If given, a feature file will be written for every subset (with the name containing the subset name as suffix to
path
). Note, since string- valued features are automatically converted into boolean features with names of the formFEATURE_NAME=STRING_VALUE
, when doing the filtering, the portion before the=
is all that’s used for matching. Therefore, you do not need to enumerate all of these boolean feature names in your mapping.logger (Optional[logging.Logger], default=None) – A logger instance to use to log messages instead of creating a new one by default.
- class skll.data.writers.ARFFWriter(path, feature_set, quiet=True, subsets=None, logger=None, relation='skll_relation', regression=False, dialect='excel-tab', label_col='y', id_col='id')[source]
Bases:
Writer
Writer for writing out FeatureSets as ARFF files.
- Parameters:
path (
skll.types.PathOrStr
) – A path to the feature file we would like to create. Ifsubsets
is notNone
, this is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.arff
.feature_set (
skll.data.featureset.FeatureSet
) – TheFeatureSet
instance to dump to the output file.quiet (bool, default=True) – Do not print “Writing…” status message to stderr.
subsets (Optional[Dict[str, List[str]]], default=None) – A mapping from subset names to lists of feature names that are included in those sets. If given, a feature file will be written for every subset (with the name containing the subset name as suffix to
path
). Note, since string- valued features are automatically converted into boolean features with names of the formFEATURE_NAME=STRING_VALUE
, when doing the filtering, the portion before the=
is all that’s used for matching. Therefore, you do not need to enumerate all of these boolean feature names in your mapping.logger (Optional[logging.Logger], default=None) – A logger instance to use to log messages instead of creating a new one by default.
relation (str, default='skll_relation') – The name of the relation in the ARFF file.
regression (bool, default=False) – Is this an ARFF file to be used for regression?
kwargs (Optional[Dict[str, Any]]) – The arguments to the
Writer
object being instantiated.
- class skll.data.writers.LibSVMWriter(path, feature_set, quiet=True, subsets=None, logger=None, label_map=None)[source]
Bases:
Writer
Writer for writing out FeatureSets as LibSVM/SVMLight files.
- Parameters:
path (
skll.types.PathOrStr
) – A path to the feature file we would like to create. Ifsubsets
is notNone
, this is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.libsvm
.feature_set (
skll.data.featureset.FeatureSet
) – TheFeatureSet
instance to dump to the output file.quiet (bool, default=True) – Do not print “Writing…” status message to stderr.
subsets (Optional[Dict[str, List[str]]], default=None) – A mapping from subset names to lists of feature names that are included in those sets. If given, a feature file will be written for every subset (with the name containing the subset name as suffix to
path
). Note, since string- valued features are automatically converted into boolean features with names of the formFEATURE_NAME=STRING_VALUE
, when doing the filtering, the portion before the=
is all that’s used for matching. Therefore, you do not need to enumerate all of these boolean feature names in your mapping.logger (Optional[logging.Logger], default=None) – A logger instance to use to log messages instead of creating a new one by default.
label_map (Optional[Dict[str, int]], default=None) – A mapping from label strings to integers.