---
name: bigquery-bigframes
metadata:
  category: BigDataAndAnalytics
description: >-
  Generates Python code using BigQuery DataFrames (BigFrames), the pandas/scikit-learn-style API over BigQuery. Use when writing BigFrames code or doing pandas-style dataframe/ML work against BigQuery (e.g. in a notebook). Don't use for SQL-first workflows or the google-cloud-bigquery client library — use bigquery-basics.
---

# BigFrames Development Standards

*   **Avoid `.to_pandas()`**: You MUST NOT use `.to_pandas()` to download the
    entire dataset into memory as this downloads all data to the client's
    memory, bypassing BigQuery's distributed computation and risking Out of
    Memory (OOM) errors. There are some exceptions:
    *  An error message explicitly requests you to use `to_pandas()`
    *  You are going to visualize the data, **and** the visualization library does not accept BigFrames Dataframe/Series instances. In this case, reduce the amount of data you are going to download before calling `.to_pandas()`
*   **Avoid `read_gbq()` for SQL**: Do not write SQL queries and execute them
    with `read_gbq()` to maintain the Pandas-like DataFrame abstraction and
    allow lazy executions. Use BigFrames Dataframe/Series methods instead.
*   **Use BigFrames ML package for Machine Learning Tasks**: Do not use
    Scikit-learn or other ML libraries with BigFrames dataframes because
    standard Scikit-learn models require bringing data into local client memory,
    whereas bigframes.ml delegates training directly to BigQuery's scalable ML
    engine. Import your tools/classes from `bigframes.ml`.
*   **Stay in the Cloud**: Perform data cleaning, transformation, and analysis via BigFrames methods to leverage BigQuery's scale.
*   **Accessors over UDFs/Lambdas**:
    *   Prefer built-in accessors (e.g., `df.col.str.*`, `df.col.dt.*`) over remote UDFs.
    *   **Do not use lambdas** with `Series.map()` or `DataFrame.apply()`.
*   **Schema Verification**: Do not assume schema of intermediate outputs. Check `.dtypes` after loading, and use `display()` with `.head()` or `.peek()`.
*   **Visualization**: BigFrames Dataframe mostly works directly with
    Matplotlib, Seaborn, and other plotting libraries. If your attempt didn't
    work, try using the `plot` accessor. If that didn't work either, you MUST
    sample or aggregate your data to make it small enough before calling
    `to_pandas()`.

# Model Development

* **Unlike Scikit-learn**: BigFrames' `predict()` method always returns a **DataFrame** containing both predictions and features (not just a series of predictions).
* **No `random_state`**: Do not pass a `random_state` argument when instantiating BigFrames ML models, because this parameter is not supported in the BigFrames ML package.
* **Automatic Scaling**: Do not use `OneHotEncoder` or `StandardScaler` unless explicitly requested (handled automatically).
* **Hyperparameter Tuning**: You must write custom loops (BigFrames lacks `GridSearchCV` or `RandomizedSearchCV`).
* **ARIMA Plus** (Forecasting):
    * Import from `bigframes.ml.forecasting`.
    * Sort data chronologically and split around a timepoint before training.
    * Prediction horizon must be less than or equal to training horizon.
* **PCA**: BigFrames' PCA class lacks simple `transform()` method. Use `predict()` instead.
* **Model Persistence**: To persist a model, use `model.to_gbq()`. To load a persisted model, use `bpd.read_gbq_model()`.
