728x90

huggingface.co/transformers/model_doc/bert.html

 

BERT โ€” transformers 4.3.0 documentation

past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) โ€“ Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_he

huggingface.co

  • vocab_size (int, optional, defaults to 30522) โ€“ Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling BertModel or TFBertModel.

  • hidden_size (int, optional, defaults to 768) โ€“ Dimensionality of the encoder layers and the pooler layer.

  • num_hidden_layers (int, optional, defaults to 12) โ€“ Number of hidden layers in the Transformer encoder.

  • num_attention_heads (int, optional, defaults to 12) โ€“ Number of attention heads for each attention layer in the Transformer encoder.

  • intermediate_size (int, optional, defaults to 3072) โ€“ Dimensionality of the โ€œintermediateโ€ (often named feed-forward) layer in the Transformer encoder.

  • hidden_act (str or Callable, optional, defaults to "gelu") โ€“ The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "silu" and "gelu_new" are supported.

  • hidden_dropout_prob (float, optional, defaults to 0.1) โ€“ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

  • attention_probs_dropout_prob (float, optional, defaults to 0.1) โ€“ The dropout ratio for the attention probabilities.

  • max_position_embeddings (int, optional, defaults to 512) โ€“ The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

  • type_vocab_size (int, optional, defaults to 2) โ€“ The vocabulary size of the token_type_ids passed when calling BertModel or TFBertModel.

  • initializer_range (float, optional, defaults to 0.02) โ€“ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

  • layer_norm_eps (float, optional, defaults to 1e-12) โ€“ The epsilon used by the layer normalization layers.

  • gradient_checkpointing (bool, optional, defaults to False) โ€“ If True, use gradient checkpointing to save memory at the expense of slower backward pass.

  • position_embedding_type (str, optional, defaults to "absolute") โ€“ Type of position embedding. Choose one of "absolute", "relative_key", "relative_key_query". For positional embeddings use "absolute". For more information on "relative_key", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on "relative_key_query", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

  • use_cache (bool, optional, defaults to True) โ€“ Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

๋ฐ˜์‘ํ˜•
๋‹คํ–ˆ๋‹ค