Module keras.api.keras.preprocessing.sequence
Public API for tf.keras.preprocessing.sequence namespace.
Expand source code
# This file is MACHINE GENERATED! Do not edit.
# Generated by: tensorflow/python/tools/api/generator/create_python_api.py script.
"""Public API for tf.keras.preprocessing.sequence namespace.
"""
from __future__ import print_function as _print_function
import sys as _sys
from keras.preprocessing.sequence import TimeseriesGenerator
from keras.preprocessing.sequence import pad_sequences
from keras_preprocessing.sequence import make_sampling_table
from keras_preprocessing.sequence import skipgrams
del _print_function
from tensorflow.python.util import module_wrapper as _module_wrapper
if not isinstance(_sys.modules[__name__], _module_wrapper.TFModuleWrapper):
_sys.modules[__name__] = _module_wrapper.TFModuleWrapper(
_sys.modules[__name__], "keras.preprocessing.sequence", public_apis=None, deprecation=True,
has_lite=False)
Functions
def make_sampling_table(size, sampling_factor=1e-05)
-
Generates a word rank-based probabilistic sampling table.
Used for generating the
sampling_table
argument forskipgrams()
.sampling_table[i]
is the probability of sampling the word i-th most common word in a dataset (more common words should be sampled less frequently, for balance).The sampling probabilities are generated according to the sampling distribution used in word2vec:
p(word) = (min(1, sqrt(word_frequency / sampling_factor) / (word_frequency / sampling_factor)))
We assume that the word frequencies follow Zipf's law (s=1) to derive a numerical approximation of frequency(rank):
frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))
wheregamma
is the Euler-Mascheroni constant.Arguments
size: Int, number of possible words to sample. sampling_factor: The sampling factor in the word2vec formula.
Returns
A 1D Numpy array of length <code>size</code> where the ith entry is the probability that a word of rank i should be sampled.
Expand source code
def make_sampling_table(size, sampling_factor=1e-5): """Generates a word rank-based probabilistic sampling table. Used for generating the `sampling_table` argument for `skipgrams`. `sampling_table[i]` is the probability of sampling the word i-th most common word in a dataset (more common words should be sampled less frequently, for balance). The sampling probabilities are generated according to the sampling distribution used in word2vec: ``` p(word) = (min(1, sqrt(word_frequency / sampling_factor) / (word_frequency / sampling_factor))) ``` We assume that the word frequencies follow Zipf's law (s=1) to derive a numerical approximation of frequency(rank): `frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))` where `gamma` is the Euler-Mascheroni constant. # Arguments size: Int, number of possible words to sample. sampling_factor: The sampling factor in the word2vec formula. # Returns A 1D Numpy array of length `size` where the ith entry is the probability that a word of rank i should be sampled. """ gamma = 0.577 rank = np.arange(size) rank[0] = 1 inv_fq = rank * (np.log(rank) + gamma) + 0.5 - 1. / (12. * rank) f = sampling_factor * inv_fq return np.minimum(1., f / np.sqrt(f))
def pad_sequences(sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.0)
-
Pads sequences to the same length.
This function transforms a list (of length
num_samples
) of sequences (lists of integers) into a 2D Numpy array of shape(num_samples, num_timesteps)
.num_timesteps
is either themaxlen
argument if provided, or the length of the longest sequence in the list.Sequences that are shorter than
num_timesteps
are padded withvalue
until they arenum_timesteps
long.Sequences longer than
num_timesteps
are truncated so that they fit the desired length.The position where padding or truncation happens is determined by the arguments
padding
andtruncating
, respectively. Pre-padding or removing values from the beginning of the sequence is the default.>>> sequence = [[1], [2, 3], [4, 5, 6]] >>> tf.keras.preprocessing.sequence.pad_sequences(sequence) array([[0, 0, 1], [0, 2, 3], [4, 5, 6]], dtype=int32)
>>> tf.keras.preprocessing.sequence.pad_sequences(sequence, value=-1) array([[-1, -1, 1], [-1, 2, 3], [ 4, 5, 6]], dtype=int32)
>>> tf.keras.preprocessing.sequence.pad_sequences(sequence, padding='post') array([[1, 0, 0], [2, 3, 0], [4, 5, 6]], dtype=int32)
>>> tf.keras.preprocessing.sequence.pad_sequences(sequence, maxlen=2) array([[0, 1], [2, 3], [5, 6]], dtype=int32)
Args
sequences
- List of sequences (each sequence is a list of integers).
maxlen
- Optional Int, maximum length of all sequences. If not provided, sequences will be padded to the length of the longest individual sequence.
dtype
- (Optional, defaults to int32). Type of the output sequences.
To pad sequences with variable length strings, you can use
object
. padding
- String, 'pre' or 'post' (optional, defaults to 'pre'): pad either before or after each sequence.
truncating
- String, 'pre' or 'post' (optional, defaults to 'pre'):
remove values from sequences larger than
maxlen
, either at the beginning or at the end of the sequences. value
- Float or String, padding value. (Optional, defaults to 0.)
Returns
Numpy array with shape
(len(sequences), maxlen)
Raises
ValueError
- In case of invalid values for
truncating
orpadding
, or in case of invalid shape for asequences
entry.
Expand source code
@keras_export('keras.preprocessing.sequence.pad_sequences') def pad_sequences(sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.): """Pads sequences to the same length. This function transforms a list (of length `num_samples`) of sequences (lists of integers) into a 2D Numpy array of shape `(num_samples, num_timesteps)`. `num_timesteps` is either the `maxlen` argument if provided, or the length of the longest sequence in the list. Sequences that are shorter than `num_timesteps` are padded with `value` until they are `num_timesteps` long. Sequences longer than `num_timesteps` are truncated so that they fit the desired length. The position where padding or truncation happens is determined by the arguments `padding` and `truncating`, respectively. Pre-padding or removing values from the beginning of the sequence is the default. >>> sequence = [[1], [2, 3], [4, 5, 6]] >>> tf.keras.preprocessing.sequence.pad_sequences(sequence) array([[0, 0, 1], [0, 2, 3], [4, 5, 6]], dtype=int32) >>> tf.keras.preprocessing.sequence.pad_sequences(sequence, value=-1) array([[-1, -1, 1], [-1, 2, 3], [ 4, 5, 6]], dtype=int32) >>> tf.keras.preprocessing.sequence.pad_sequences(sequence, padding='post') array([[1, 0, 0], [2, 3, 0], [4, 5, 6]], dtype=int32) >>> tf.keras.preprocessing.sequence.pad_sequences(sequence, maxlen=2) array([[0, 1], [2, 3], [5, 6]], dtype=int32) Args: sequences: List of sequences (each sequence is a list of integers). maxlen: Optional Int, maximum length of all sequences. If not provided, sequences will be padded to the length of the longest individual sequence. dtype: (Optional, defaults to int32). Type of the output sequences. To pad sequences with variable length strings, you can use `object`. padding: String, 'pre' or 'post' (optional, defaults to 'pre'): pad either before or after each sequence. truncating: String, 'pre' or 'post' (optional, defaults to 'pre'): remove values from sequences larger than `maxlen`, either at the beginning or at the end of the sequences. value: Float or String, padding value. (Optional, defaults to 0.) Returns: Numpy array with shape `(len(sequences), maxlen)` Raises: ValueError: In case of invalid values for `truncating` or `padding`, or in case of invalid shape for a `sequences` entry. """ return sequence.pad_sequences( sequences, maxlen=maxlen, dtype=dtype, padding=padding, truncating=truncating, value=value)
def skipgrams(sequence, vocabulary_size, window_size=4, negative_samples=1.0, shuffle=True, categorical=False, sampling_table=None, seed=None)
-
Generates skipgram word pairs.
This function transforms a sequence of word indexes (list of integers) into tuples of words of the form:
- (word, word in the same window), with label 1 (positive samples).
- (word, random word from the vocabulary), with label 0 (negative samples).
Read more about Skipgram in this gnomic paper by Mikolov et al.: Efficient Estimation of Word Representations in Vector Space
Arguments
sequence: A word sequence (sentence), encoded as a list of word indices (integers). If using a <code>sampling\_table</code>, word indices are expected to match the rank of the words in a reference dataset (e.g. 10 would encode the 10-th most frequently occurring token). Note that index 0 is expected to be a non-word and will be skipped. vocabulary_size: Int, maximum possible word index + 1 window_size: Int, size of sampling windows (technically half-window). The window of a word <code>w\_i</code> will be `[i - window_size, i + window_size+1]`. negative_samples: Float >= 0. 0 for no negative (i.e. random) samples. 1 for same number as positive samples. shuffle: Whether to shuffle the word couples before returning them. categorical: bool. if False, labels will be integers (eg. <code>\[0, 1, 1 .. ]</code>), if <code>True</code>, labels will be categorical, e.g. <code>\[\[1,0],\[0,1],\[0,1] .. ]</code>. sampling_table: 1D array of size <code>vocabulary\_size</code> where the entry i encodes the probability to sample a word of rank i. seed: Random seed.
Returns
couples, labels: where <code>couples</code> are int pairs and <code>labels</code> are either 0 or 1.
Note
By convention, index 0 in the vocabulary is a non-word and will be skipped.
Expand source code
def skipgrams(sequence, vocabulary_size, window_size=4, negative_samples=1., shuffle=True, categorical=False, sampling_table=None, seed=None): """Generates skipgram word pairs. This function transforms a sequence of word indexes (list of integers) into tuples of words of the form: - (word, word in the same window), with label 1 (positive samples). - (word, random word from the vocabulary), with label 0 (negative samples). Read more about Skipgram in this gnomic paper by Mikolov et al.: [Efficient Estimation of Word Representations in Vector Space](http://arxiv.org/pdf/1301.3781v3.pdf) # Arguments sequence: A word sequence (sentence), encoded as a list of word indices (integers). If using a `sampling_table`, word indices are expected to match the rank of the words in a reference dataset (e.g. 10 would encode the 10-th most frequently occurring token). Note that index 0 is expected to be a non-word and will be skipped. vocabulary_size: Int, maximum possible word index + 1 window_size: Int, size of sampling windows (technically half-window). The window of a word `w_i` will be `[i - window_size, i + window_size+1]`. negative_samples: Float >= 0. 0 for no negative (i.e. random) samples. 1 for same number as positive samples. shuffle: Whether to shuffle the word couples before returning them. categorical: bool. if False, labels will be integers (eg. `[0, 1, 1 .. ]`), if `True`, labels will be categorical, e.g. `[[1,0],[0,1],[0,1] .. ]`. sampling_table: 1D array of size `vocabulary_size` where the entry i encodes the probability to sample a word of rank i. seed: Random seed. # Returns couples, labels: where `couples` are int pairs and `labels` are either 0 or 1. # Note By convention, index 0 in the vocabulary is a non-word and will be skipped. """ couples = [] labels = [] for i, wi in enumerate(sequence): if not wi: continue if sampling_table is not None: if sampling_table[wi] < random.random(): continue window_start = max(0, i - window_size) window_end = min(len(sequence), i + window_size + 1) for j in range(window_start, window_end): if j != i: wj = sequence[j] if not wj: continue couples.append([wi, wj]) if categorical: labels.append([0, 1]) else: labels.append(1) if negative_samples > 0: num_negative_samples = int(len(labels) * negative_samples) words = [c[0] for c in couples] random.shuffle(words) couples += [[words[i % len(words)], random.randint(1, vocabulary_size - 1)] for i in range(num_negative_samples)] if categorical: labels += [[1, 0]] * num_negative_samples else: labels += [0] * num_negative_samples if shuffle: if seed is None: seed = random.randint(0, 10e6) random.seed(seed) random.shuffle(couples) random.seed(seed) random.shuffle(labels) return couples, labels
Classes
class TimeseriesGenerator (data, targets, length, sampling_rate=1, stride=1, start_index=0, end_index=None, shuffle=False, reverse=False, batch_size=128)
-
Utility class for generating batches of temporal data.
This class takes in a sequence of data-points gathered at equal intervals, along with time series parameters such as stride, length of history, etc., to produce batches for training/validation.
Arguments
data: Indexable generator (such as list or Numpy array) containing consecutive data points (timesteps). The data should be at 2D, and axis 0 is expected to be the time dimension. targets: Targets corresponding to timesteps in <code>data</code>. It should have same length as <code>data</code>. length: Length of the output sequences (in number of timesteps). sampling_rate: Period between successive individual timesteps within sequences. For rate <code>r</code>, timesteps <code>data\[i]</code>, `data[i-r]`, ... `data[i - length]` are used for create a sample sequence. stride: Period between successive output sequences. For stride <code>s</code>, consecutive output samples would be centered around <code>data\[i]</code>, `data[i+s]`, `data[i+2*s]`, etc. start_index: Data points earlier than <code>start\_index</code> will not be used in the output sequences. This is useful to reserve part of the data for test or validation. end_index: Data points later than <code>end\_index</code> will not be used in the output sequences. This is useful to reserve part of the data for test or validation. shuffle: Whether to shuffle output samples, or instead draw them in chronological order. reverse: Boolean: if <code>true</code>, timesteps in each output sample will be in reverse chronological order. batch_size: Number of timeseries samples in each batch (except maybe the last one).
Returns
A [Sequence](https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence) instance.
Examples
from keras.preprocessing.sequence import TimeseriesGenerator import numpy as np data = np.array([[i] for i in range(50)]) targets = np.array([[i] for i in range(50)]) data_gen = TimeseriesGenerator(data, targets, length=10, sampling_rate=2, batch_size=2) assert len(data_gen) == 20 batch_0 = data_gen[0] x, y = batch_0 assert np.array_equal(x, np.array([[[0], [2], [4], [6], [8]], [[1], [3], [5], [7], [9]]])) assert np.array_equal(y, np.array([[10], [11]]))
Expand source code
class TimeseriesGenerator(sequence.TimeseriesGenerator, data_utils.Sequence): """Utility class for generating batches of temporal data. This class takes in a sequence of data-points gathered at equal intervals, along with time series parameters such as stride, length of history, etc., to produce batches for training/validation. # Arguments data: Indexable generator (such as list or Numpy array) containing consecutive data points (timesteps). The data should be at 2D, and axis 0 is expected to be the time dimension. targets: Targets corresponding to timesteps in `data`. It should have same length as `data`. length: Length of the output sequences (in number of timesteps). sampling_rate: Period between successive individual timesteps within sequences. For rate `r`, timesteps `data[i]`, `data[i-r]`, ... `data[i - length]` are used for create a sample sequence. stride: Period between successive output sequences. For stride `s`, consecutive output samples would be centered around `data[i]`, `data[i+s]`, `data[i+2*s]`, etc. start_index: Data points earlier than `start_index` will not be used in the output sequences. This is useful to reserve part of the data for test or validation. end_index: Data points later than `end_index` will not be used in the output sequences. This is useful to reserve part of the data for test or validation. shuffle: Whether to shuffle output samples, or instead draw them in chronological order. reverse: Boolean: if `true`, timesteps in each output sample will be in reverse chronological order. batch_size: Number of timeseries samples in each batch (except maybe the last one). # Returns A [Sequence](https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence) instance. # Examples ```python from keras.preprocessing.sequence import TimeseriesGenerator import numpy as np data = np.array([[i] for i in range(50)]) targets = np.array([[i] for i in range(50)]) data_gen = TimeseriesGenerator(data, targets, length=10, sampling_rate=2, batch_size=2) assert len(data_gen) == 20 batch_0 = data_gen[0] x, y = batch_0 assert np.array_equal(x, np.array([[[0], [2], [4], [6], [8]], [[1], [3], [5], [7], [9]]])) assert np.array_equal(y, np.array([[10], [11]])) ``` """ pass
Ancestors
- keras_preprocessing.sequence.TimeseriesGenerator
- Sequence
Inherited members