- API
- Quick Start
- Overview
- Authentication
- Organizations
- Requests
- Responses
- Status Codes
- Category Codes
- RESOURCES
- Projects
- Sources
- Datasets
- Samples
- Correlations
- Statistical Tests
- Configurations
- Composites
- Supervised
- Models
- Ensembles
- Logistic Regressions
- Deepnets
- Time Series
- Fusions
- Evaluations
- OptiMLs
- Unsupervised
- Clusters
- Anomaly Detectors
- Associations
- Topic Models
- PCA NEW
- Predictions
- Predictions
- Batch Predictions
- Forecasts
- Centroids
- Batch Centroids
- Anomaly Scores
- Batch Scores
- Association Sets
- Topic Distributions
- Batch Distributions
- Projections NEW
- Batch Projections NEW
- WhizzML
- Libraries
- Scripts
- Executions
BigML.io—The BigML API
Documentation
Quick Start
Last Updated: Tuesday, 2019-01-29 16:28
This page helps you quickly create your first source, dataset, model, and prediction.
To get started with BigML.io you need:- Your username and your API key.
- A terminal with curl or any other command-line tool that implements standard HTTPS methods.
-
Some sample data. You can use:
- A csv file with some data. You can download the "Iris dataset" or "Diabetes dataset" from our servers.
- Even easier, you can just use a URL that points to your data. For example, you can use https://static.bigml.com/csv/iris.csv or https://static.bigml.com/csv/diabetes.csv.
- Even even easier, you can just send some inline test data.
Jump to:
- Getting a Toy Data File
- Authentication
- Creating a Source
- Creating a Remote Source
- Creating an Inline Source
- Creating a Dataset
- Creating a Model
- Creating a Prediction
Getting a Toy Data File
If you do not have any dataset handy, you can download Fisher’s Iris dataset using the curl command below or by just clicking on the link.
curl -o iris.csv https://static.bigml.com/csv/iris.csv
$ Getting iris.csv
Authentication
The following snippet will help you set up an environment variable (i.e., BIGML_AUTH) to store your username and API key and avoid typing them again in the rest of examples. See this section for more details.
Note: Use your own username and API Key.
export BIGML_USERNAME=alfred
export BIGML_API_KEY=79138a622755a2383660347f895444b1eb927730
export BIGML_AUTH="username=$BIGML_USERNAME;api_key=$BIGML_API_KEY"
$ Setting Alfred's Authentication Parameters
Creating a Source
To create a new source, POST the file containing your data to the source base URL.
curl "https://au.bigml.io/source?$BIGML_AUTH" -F file=@iris.csv
> Creating a source
To create more sources simply repeat the curl command above using another file. Make sure to use the full path if the file is not in your current directory.
Creating a Remote Source
You can also create a source using a valid URL that points to your data or some public data. For example:
curl "https://au.bigml.io/source?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"remote": "https://static.bigml.com/csv/iris.csv"}'
> Creating a remote source
Creating an Inline Source
You can also create a source using some inline data. For example:
curl "https://au.bigml.io/source?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"data": "a,b,c,d\n1,2,3,4\n5,6,7,8"}'
> Creating an inline source
{
"code": 201,
"content_type": "application/octet-stream",
"created": "2012-03-01T05:29:07.217968",
"credits": 0.0087890625,
"file_name": "iris.csv",
"md5": "d1175c032e1042bec7f974c91e4a65ae",
"name": "iris.csv",
"number_of_datasets": 0,
"number_of_models": 0,
"number_of_predictions": 0,
"private": true,
"resource": "source/4f52824203ce893c0a000053",
"size": 4608,
"source_parser": {
"header": true,
"locale": "en-US",
"missing_tokens": [
"N/A",
"n/a",
"NA",
"na",
"-",
"?"
],
"quote": "\"",
"separator": ",",
"trim": true
},
"status": {
"code": 2,
"elapsed": 0,
"message": "The source creation has been started"
},
"type": 0,
"updated": "2012-03-01T05:29:07.217990"
}
< Example source JSON response
Creating a Dataset
To create a dataset, POST the source/id from the previous step to the dataset base URL as follows.
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"source": "source/4f52824203ce893c0a000053"}'
> Creating a dataset
{
"code": 201,
"columns": 5,
"created": "2012-03-04T02:58:11.910363",
"credits": 0.0087890625,
"fields": {
"000000": {
"column_number": 0,
"name": "sepal length",
"optype": "numeric"
},
"000001": {
"column_number": 1,
"name": "sepal width",
"optype": "numeric"
},
"000002": {
"column_number": 2,
"name": "petal length",
"optype": "numeric"
},
"000003": {
"column_number": 3,
"name": "petal width",
"optype": "numeric"
},
"000004": {
"column_number": 4,
"name": "species",
"optype": "categorical"
}
},
"name": "iris' dataset",
"number_of_models": 0,
"number_of_predictions": 0,
"private": true,
"resource": "dataset/4f52da4303ce896fe3000000",
"rows": 0,
"size": 4608,
"source": "source/4f52824203ce893c0a000053",
"source_parser": {
"header": true,
"locale": "en-US",
"missing_tokens": [
"N/A",
"n/a",
"NA",
"na",
"-",
"?"
],
"quote": "\"",
"separator": ",",
"trim": true
},
"source_status": true,
"status": {
"code": 1,
"message": "The dataset is being processed and will be created soon"
},
"updated": "2012-03-04T02:58:11.910387"
}
< Dataset
Creating a Model
To create a model, POST the dataset/id from the previous step to the model base URL. By default BigML.io will include all fields as predictors and will treat the last non-text field as the objective. In the Models Section you will learn how to customize the input fields or the objective field.
curl "https://au.bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f52da4303ce896fe3000000"}'
> Creating a model
{
"code": 201,
"columns": 5,
"created": "2012-03-04T03:46:53.033372",
"credits": 0.03515625,
"dataset": "dataset/4f52da4303ce896fe3000000",
"dataset_status": true,
"holdout": 0.0,
"input_fields": [],
"max_columns": 5,
"max_rows": 150,
"name": "iris' dataset model",
"number_of_predictions": 0,
"objective_fields": [],
"private": true,
"range": [
1,
150
],
"resource": "model/4f52e5ad03ce898798000000",
"rows": 150,
"size": 4608,
"source": "source/4f52824203ce893c0a000053",
"source_status": true,
"status": {
"code": 1,
"message": "The model is being processed and will be created soon"
},
"updated": "2012-03-04T03:46:53.033396"
}
< Model
Creating a Prediction
To create a prediction, POST the model/id and some input data to the prediction base URL.
curl "https://au.bigml.io/prediction?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"model": "model/4f52e5ad03ce898798000000", "input_data": {"000000": 5, "000001": 3}}'
> Creating a prediction
{
"code": 201,
"created": "2012-03-04T04:11:10.433996",
"credits": 0.01,
"dataset": "dataset/4f52da4303ce896fe3000000",
"dataset_status": true,
"fields": {
"000000": {
"column_number": 0,
"datatype": "double",
"name": "sepal length",
"optype": "numeric"
},
"000001": {
"column_number": 1,
"datatype": "double",
"name": "sepal width",
"optype": "numeric"
},
"000002": {
"column_number": 2,
"datatype": "double",
"name": "petal length",
"optype": "numeric"
},
"000004": {
"column_number": 4,
"datatype": "string",
"name": "species",
"optype": "categorical"
}
},
"input_data": {
"000000": 5,
"000001": 3
},
"model": "model/4f52e5ad03ce898798000000",
"model_status": true,
"name": "Prediction for species",
"objective_fields": [
"000004"
],
"prediction": {
"000004": "Iris-virginica"
},
"prediction_path": {
"bad_fields": [],
"next_predicates": [
{
"count": 100,
"field": "000002",
"operator": ">",
"value": 2.45
},
{
"count": 50,
"field": "000002",
"operator": "<=",
"value": 2.45
}
],
"path": [],
"unknown_fields": []
},
"private": true,
"resource": "prediction/4f52eb5e03ce898798000009",
"source": "source/4f52824203ce893c0a000053",
"source_status": true,
"status": {
"code": 5,
"message": "The prediction has been created"
},
"updated": "2012-03-04T04:11:10.434030"
}
< Prediction
Overview
Last Updated: Tuesday, 2019-01-29 16:28
This page provides an introduction to BigML.io—The BigML API. A quick start guide for the impatient is here.
BigML.io is a Machine Learning REST API to easily build, run, and bring predictive models to your project. You can use BigML.io for basic supervised and unsupervised machine learning tasks and also to create sophisticated machine learning pipelines.
BigML.io is a REST-style API for creating and managing BigML resources programmatically. That is to say, using BigML.io you can create, retrieve, update and delete BigML resources using standard HTTP methods.
BigML.io gives you:
- Secure programmatic access to all your BigML resources.
- Fully white-box access to your datasets, models, clusters and anomaly detectors.
- Asynchronous creation of resources.
- Near real-time predictions.
Jump to:
- BigML Resources
- REST API
- HTTPS
- Base URL
- Version
- Summary of Resource URL Patterns
- Summary of HTTP Methods
- Resource ID
- Libraries
- Limits
BigML Resources
BigML.io gives you access to the following resources: project, source, dataset, sample, correlation, statisticaltest, configuration, and composite.
The four original BigML resources are: source, dataset, model, and prediction.
As shown in the picture below, the most basic flow consists of using some local (or remote) training data to create a source, then using the source to create a dataset, later using the dataset to create a model, and, finally, using the model and new input data to create a prediction.
The training data is usually in tabular format. Each row in the data represents an instance (or example) and each column a field (or attribute). These fields are also known as predictors or covariates.
When the machine learning task to learn from training data is supervised one of the columns (usually the last column) represents a special attribute known as objective field (or target) that assigns a label (or class) to each instance. The training data in this format is named labeled and the machine learning task to learn from is named supervised learning.
Once a source is created, it can be used to create multiple datasets. Likewise, a dataset can be used to create multiple models and a model can be used to create multiple predictions.
A model can be either a classification or a regression model depending on whether the objective field is respectively categorical or numeric.
Often an ensemble (or collection of models) can perform better than just a single model. Thus, a dataset can also be used to create an ensemble instead of a single model.
A dataset can also be used to create a cluster or an anomaly detector. Clusters and Anomaly Detectors are both built using unsupervised learning and therefore an objective field is not needed. In these cases, the training data is named unlabeled.
A centroid is to a cluster what a prediction is to a model. Likewise, an anomaly score is to an anomaly detector what a prediction is to a model.
There are scenarios where generating predictions for a relative big collection of input data is very convenient. For these scenarios, BigML.io offers batch resources such as: batchprediction, batchcentroid, and batchanomalyscore. These resources take a dataset and respectively a model (or ensemble), a cluster, or an anomaly detector to create a new dataset that contains a new column with the corresponding prediction, centroid or anomaly score computed for each instance in the dataset.
When dealing with multiple projects, it's better to keep the resources that belong to each project separated. Thus, BigML also has a resource named project that helps you group together all the other resources. As you will see, you just need to assign a source to a pre-existing project and all the subsequent resources will be created in that project.Note: In the snippets below you should substitute Alfred's username and API key for your own username and API Key.
REST API
BigML.io conforms to the design principles of Representational State Transfer (REST). BigML.io is entirely HTTPS-based.
You can create, read, update, and delete resources using the respective standard HTTP methods: POST, GET, PUT and DELETE.
All communication with BigML.io is JSON formatted except for source creation. Source creation is handled with a HTTP PUT using the "multipart/form-data" content-type.
HTTPS
All access to BigML.io must be performed over HTTPS. In this way communication between your application and BigML.io is encrypted and the integrity of traffic between both is verified.
Base URL
All BigML.io HTTP commands use the following base URL:
https://au.bigml.io
Base URL
Version
The BigML.io API is versioned using code names instead of version numbers. The current version name is "andromeda" so URLs for this version can be written to require this version as follows: https://au.bigml.io/andromeda/
Version
Specifying the version name is optional. If you omit the version name in your API requests, you will always get access to the latest API version. While we will do our best to make future API versions backward compatible it is possible that a future API release could cause your application to fail.
Specifying the API version in your HTTP calls will ensure that your application continues to function for the life cycle of the API release.
Summary of Resource URL Patterns
BigML.io gives you access to the following resources: project, source, dataset, sample, correlation, statistical test, configuration, and composite.
https://au.bigml.io/project
https://au.bigml.io/source
https://au.bigml.io/dataset
https://au.bigml.io/sample
https://au.bigml.io/correlation
https://au.bigml.io/statisticaltest
https://au.bigml.io/configuration
https://au.bigml.io/composite
Resource URL Patterns
Summary of HTTP Methods
BigML.io uses the standard POST, GET, PUT, and DELETE HTTP methods to create, retrieve, update and delete resources, respectively.| Operation | HTTP method | Semantics |
|---|---|---|
| CREATE | POST | Creates a new resource. Only certain fields are "postable". This method is not idempotent. Each valid POST request results in a new directly accessible resource. |
| RETRIEVE | GET | Retrieves either a specific resource or a list of resources. This methods is idempotent. The content type of the resources is always "application/json; charset=utf-8". |
| UPDATE | PUT | Updates partial content of a resource. Only certain fields are "putable". This method is idempotent. |
| DELETE | DELETE | Deletes a resource. This method is idempotent. |
Resource ID
All BigML resources are identified by a name composed of two parts separated by a slash "/". The first part is the type of the resource and the second part is a 24-char unique identifier. See the examples below:
source/4f510d2003ce895676000069
dataset/4f510cfc03ce895676000040
model/4f51473203ce89b7ef000005
ensemble/523e9017035d0772e600b285
prediction/4f51473b03ce89b7ef000008
evaluation/50a30a453c19200bd1000839
Example of resources ids
Libraries
We have developed light-weight API bindings for Python, Node.js, and Java.
A number of libraries for many other languages have been developed by the growing BigML community: C#, Ruby, PHP , and iOS. If you are interested in library support for a particular language let us know. Or if you are motivated to develop a library, we will give you all the support that we can.
Limits
BigML.io is currently limited to 1,000,000 (one million) requests per API key per hour. Please email us if you have a specific use case that requires a higher rate limit.Authentication
Last Updated: Tuesday, 2019-01-29 16:28
https://au.bigml.io/source?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730
Example URL to list your sources
Your BigML API Key is a unique identifier that is assigned exclusively to your account. You can manage your BigML API Key in your account settings. Remember to keep your API key secret.
To use BigML.io from the command line, we recommend setting your username and API key as environment variables. Using environment variables is also an easy way to keep your credentials out of your source code.
Note: Use your own username and API Key.
export BIGML_USERNAME=alfred
export BIGML_API_KEY=79138a622755a2383660347f895444b1eb927730
export BIGML_AUTH="username=$BIGML_USERNAME;api_key=$BIGML_API_KEY"
$ Setting Alfred's Authentication Parameters
set BIGML_USERNAME=alfred
set BIGML_API_KEY=79138a622755a2383660347f895444b1eb927730
set BIGML_AUTH=username^=%BIGML_USERNAME%;api_key^=%BIGML_API_KEY%
$ Setting Alfred's Authentication Parameters in Windows
Here is an example of an authenticated API request to list the sources in your account from a command line.
curl "https://au.bigml.io/source?$BIGML_AUTH"
$ Example request to list your sources
Alternative Keys
Alternative Keys allow you to give fine-grained access to your BigML resources.To create an alternative key you need to use BigML's web interface. There you can define what resources an alternative key can access and what operations (i.e., create, list, retrieve, update or delete) are allowed with it. This is useful in scenarios where you want to grant different roles and privileges to different applications. For example, an application for the IT folks that collects data and creates sources in BigML, another that is accessed by data scientists to create and evaluate models, and a third that is used by the marketing folks to create predictions.
You can read more about alternative keys here.
Organizations
Last Updated: Tuesday, 2019-01-29 16:28
An organization is a permission-based grouping of resources that helps you centralize your organization's resources. The permissions can be managed in a company-specific dashboard, and a user can be a member of multiple organizations at the same time. All resources are created under a specific project in the the organization. A project can be configured as private or public, and you can control who has the access to your projects and resources under the projects.
Organization Member Types
There are 4 types of membership for an organization.
- A restricted member can create, retrieve, update, and delete resources in the organization project, and view public or private projects that the user has access to.
-
A member has the restricted member privilege and also
can create public or private projects in the organization.
A public project can be accessed by any users of the organization, and
a private project can be accessed only by those who have permission to the project.
When a project is created or updated, certain organization users can be assigned with the manage, write, or read permission. A user with the admin permission or an organization administrator can update and delete the project. A user with the write permission can create, retrieve, update, and delete resources in the project, and a user with the read permission can only read existing resources in the project. The user who creates the project will automatically have the admin permission until the user is specifically removed from the project or the organization.
For example, let's say a user with a member role John is in the sales department. John has created a private project Sales Reports and added users Amy and Mike to the write permission list. Now John has been transferred to the marketing department and he shouldn't have access to the Sales Reports project anymore. John can delegate Amy or another organization user with the admin permission allowing the user to update or delete the project in the future and remove himself from the list. If John is already removed or unavailable, it can also be done by any administrator.
Any user with the write permission of the project can create, update, and delete resources and move their personal resources to the project. However, once personal resource is moved under a organization project, it cannot be moved back to the personal account.
Last, the users with read permissions can view all resources in the project. However, they cannot update or delete them, or create a new resource. - An administrator has the full access to all projects and resources in the organization, and can manage the users and their membership of the organization.
- The owner has all privileges that an administrator has plus billing, and is the only one who can update and delete the organization.
Each user can have only one role. If a user is assigned with multiple roles, then only the role with the highest privilege will be considered. For example, a user is assigned with the member and restricted member roles, then the user's final role in the organization will be member.
All resources created under the organization have the username and user_id properties filled with the owner's username and id, and a separate property creator which is the username of the user who actually created the resource.
Authentication
In addition to your username and api_key, all access to BigML organization resources requires an additional parameter in the query string to authenticate.
As explained above, an organization resource must be created under a project. In order to create, retrieve, update, and delete an organization resource, you must pass project in the query string. Thus, even if project is defined in the HTTP POST body, it will simply be ignored in favor of the project property in the query string. For HTTP GET requests for retrieving a list of resources, the project property is used as a filter, so the response contains only the resources under the specified project. For HTTP GET requests for retrieving an individual resource, HTTP PUT, or HTTP DELETE requests, the project property will only be used for authentication to the organization. It means the resource doesn't have to exist in the authenticated project. If you have the read permission for the specific resource, you can retrieve the resource even if the resource is not in the project defined in the query string. Likewise, if you have the read-write permission for the resource, you can update or delete the resource. There is one exception though.
For scripts or libraries, if the resource is shared across all projects in the organization, (i.e., public_in_organization: true) then only the creator of the resource or administrators can update or delete the resource. Note that such resources will also be included in responses of the HTTP GET requests for retrieving a list of resources regardless in which project the resources really belong to.
Finally, in order to retrieve organization project resources, you need to pass the organization parameter instead of the project parameter. See the examples below.
https://au.bigml.io/source?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730;project=project/5948be694e17273079000000
Example URL to list your sources in an organization project
https://au.bigml.io/project?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730;organization=organization/5728cce44e1727587a000000
Example URL to list your projects in an organization
Requests
Last Updated: Tuesday, 2019-01-29 16:28
BigML.io uses the standard POST, GET, PUT, and DELETE HTTP methods to create, retrieve, update, and delete individual resources, respectively. You can also list all your resources for each resource type.
Jump to:
- Creating a Resource
- Retrieving a Resource
- Updating a Resource
- Deleting a Resource
- Listing Resources
- Paginating Resources
- Filtering Resources
- Ordering Resources
- Webhooks
Creating a Resource
To create a new resource, you need to POST an object to the resource's base URL. The content-type must always be "application/json". The only exception is source creation which requires the "multipart/form-data" content type.
For example, to create a model with a dataset, you can use curl like this:
curl "https://au.bigml.io/model/?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006"}'
> Creating a model
The following is an example of what a request header would look like for the request:
POST /model?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
> Example model create request
BigML.io will return a newly create resource document if the request is succeeded.
A number of required and optional arguments exist for each type of resource. You can see a detailed arguments list for each resource in their respective sections: project, source, dataset, sample, correlation, statistical test, configuration, and composite.
Retrieving a Resource
To retrieve a resource, you need to issue a HTTP GET request to the resource/id to be retrieved. Each resource has a unique identifier in the form resource/id where resource is a type of the resource such as dataset, model, and etc, and id is a string of 24 alpha-numeric characters that you can use to retrieve the resource or as a parameter to create other resources from the resource.
For example, using curl you can do something like this to retrieve a dataset:
curl "https://au.bigml.io/dataset/54d86680f0a5ea5fc0000011?$BIGML_AUTH"
$ Retrieving a dataset from the command line
The following is an example of what a request header would look like for a dataset GET request:
GET /dataset/54d86680f0a5ea5fc0000011?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
> Example dataset retreive request
Once a resource has been successfully created, it will have properties. A number of properties exist for each type of resource. You can see a detailed property list for each resource in their respective sections: projects, sources, datasets, samples, correlations, statisticaltests, configurations, and composites.
Updating a Resource
To update a resource, you need to PUT an object containing the fields that you want to update to the resource's base URL. The content-type must always be: "application/json".
If the request succeeds, BigML.io will respond with a 202 accepted code and with the new updated resource in the body of the message.
For example, to update a project with a new name, a new category, a new description, and new tags you can use curl like this:
curl "https://au.bigml.io/project/54d9553bf0a5ea5fc0000016?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "My new Project",
"category": 3,
"description": "My first BigML Project",
"tags": ["fraud", "detection"]}'
$ Updating a project
The following is an example of what a request header would look like for the request:
PUT /project/54d9553bf0a5ea5fc0000016?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
> Example project update request
Deleting a Resource
To delete a resource, you need to issue a HTTP DELETE request to the resource/id to be deleted.
For example, using curl you can do something like this to delete a dataset:
curl -X DELETE "https://au.bigml.io/dataset/54d86680f0a5ea5fc0000011?$BIGML_AUTH"
$ Deleting a dataset from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return HTTP 204 responses with no body.
HTTP/1.1 204 NO CONTENT
Content-Length: 0
< Successful response
Once you delete a resource, it is permanently deleted. That is, a delete request cannot be undone.
For example, if you try to delete a dataset a second time, or a dataset that does not exist you will receive an error like this:
{
"code": 404,
"status": {
"code": -1201,
"extra": [
"A dataset matching the provided arguments could not be found"
],
"message": "Id does not exist"
}
}
Error trying to delete a dataset that does not exist
The following is an example of what a request header would look like for a dataset DELETE request:
DELETE /dataset/54d86680f0a5ea5fc0000011?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
> Example dataset delete request
Listing Resources
To list all the resources, you can use its base URLs. By default, only the 20 most recent resources will be returned. You can see below how to change this number using the limit parameter.
You can get the list of each resource type directly in your browser using your own username and API key with the following links.
https://au.bigml.io/project?$BIGML_AUTH
https://au.bigml.io/source?$BIGML_AUTH
https://au.bigml.io/dataset?$BIGML_AUTH
https://au.bigml.io/sample?$BIGML_AUTH
https://au.bigml.io/correlation?$BIGML_AUTH
https://au.bigml.io/statisticaltest?$BIGML_AUTH
https://au.bigml.io/configuration?$BIGML_AUTH
https://au.bigml.io/composite?$BIGML_AUTH
> Listing resources from a browser
You can also easily list them from the command line using curl as follows:
curl "https://au.bigml.io/project?$BIGML_AUTH"
curl "https://au.bigml.io/source?$BIGML_AUTH"
curl "https://au.bigml.io/dataset?$BIGML_AUTH"
curl "https://au.bigml.io/sample?$BIGML_AUTH"
curl "https://au.bigml.io/correlation?$BIGML_AUTH"
curl "https://au.bigml.io/statisticaltest?$BIGML_AUTH"
curl "https://au.bigml.io/configuration?$BIGML_AUTH"
curl "https://au.bigml.io/composite?$BIGML_AUTH"
$ Listing resources from the command line
The following is an example of what a request header would look like when you request a list of models:
GET /model?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730
Host: bigml.io
> Example model list request
| Property | Type | Description |
|---|---|---|
| meta | Object | Specifies in which page of the listing you are, how to get to the previous page and next page, and the total number of resources. |
| objects | Array of resources | A list of resources filtered and ordered according to the criteria that you supply in your request. See the filtering and ordering options for more details. |
Meta objects have the following properties:
For example, when you list your projects, they will be displayed as below:
{
"meta": {
"limit": 20,
"next": "/?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730&offset=20"
"offset": 0,
"previous": null,
"total_count": 54
},
"objects": [
{
"category": 0,
"code": 200,
"created": "2015-01-27T22:51:57.488000",
"description": "",
"name": "Project 1",
"private": true,
"resource": "project/54c8168df0a5eae58c000019",
...
},
{
"category": 0,
"code": 200,
"created": "2015-01-29T04:08:12.696000",
"description": "",
"name": "Project 2",
"private": true,
"resource": "project/54c9b22cf0a5ea7765000000",
...
},
...
]
}
< Listing of projects template
Paginating Resources
There are two parameters that can help you retrieve just a portion of your resources and paginate them.
If a limit is given, no more than that many resources will be returned but possibly less, if the request itself yields less resources.
For example, if you want to retrieve only the third and forth latest projects:
curl "https://au.bigml.io/project?$BIGML_AUTH;limit=2;offset=2"
$ Paginating projects from the command line
To paginate results, you need to start off with an offset of zero, then increment it by whatever value you use for the limit each time. So if you wanted to return resources 1-10, then 11-20, then 21-30, etc., you would use "limit=10;offset=0", "limit=10;offset=10", and limit=10;offset=20", respectively.
Filtering Resources
The listings of resources can be filtered by any of the fields that we labeled as filterable in the table describing the properties of a resource type. For example, to retrieve all the projects tagged with "fraud":
https://au.bigml.io/project?$BIGML_AUTH;tags__in=fraud
> Filtering projects by tag from a browser
curl "https://au.bigml.io/project?$BIGML_AUTH;tags__in=fraud"
$ Filtering projects by tag from the command line
In addition to exact match, there are more filters that you can use. To add one of these filters to your request you just need to append one of the suffixes in the following table to the name of the property that you want to use as a filter.
| Filter | Description |
|---|---|
| ! optional |
Not Example: !size=1048576 (<>1MB) |
| __gt optional |
Greater than Example: size__gt=1048576 (>1MB) |
| __gte optional |
Greater than or equal to Example: size__gte=1048576 (>=1MB) |
| __contains optional |
Case-sensitive word match Example: name__contains=test |
| __icontains optional |
Case-insensitive word match Example: name__icontains=test |
| __in optional |
Case-sensitive list word match Example: tags__in=fraud,test |
| __lt optional |
Less than Example: created__lt=2016-08-20T00:00:00.000000 (before 2016-08-20) |
| __lte optional |
Less than or equal to Example: created__lte=2016-08-20T00:00:00.000000 (before or on 2016-08-20) |
Ordering Resources
A list of resources can also be ordered by any of the fields that we labeled as sortable in the table describing the properties of a resource type.
For example, you can list your projects ordered by descending name directly in your browser, using your own username and API key, with the following link.
https://au.bigml.io/project?$BIGML_AUTH;order_by=-name
> Listing projects ordered by name from a browser
You can do the same thing from the command line using curl as follows:
curl "https://au.bigml.io/project?$BIGML_AUTH;order_by=-name"
$ Listing projets ordered by name from the command line
Webhooks
Webhooks allow you to build or set up apps which subscribe to the events triggered when the resource creation is complete or halted with an error. When the finished or error event is triggered, BigML.io can send an HTTP POST payload to the webhook's configured URL.
When you create a resource, you can specify the webhook parameter in the POST payload. For example, to create a model with a dataset, you can use curl like this:
curl "https://au.bigml.io/model/?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006",
"webhook":{"url": "http://myhost/path/to/webhook"}}'
> Creating a model with a webhook
When the resource creation is complete, BigML.io calls the provided URL and send an HTTP POST payload, and expects to receive an HTTP 201 status code. Optionally, you can provide the secret parameter to secure your webhook. The value of secret will be used as the key to generate the HMAC hex digest value of the request body in the X-BigML-Signature header if provided. It uses the sha1 hash function and the value will always have the prefix of sha1=. The headers also contain X-BigML-Delivery, a GUID to identify the delivery, and User-Agent, which is `BigML.io`. The payload of the POST request is in the JSON format, so be sure to accept Content-Type: application/json.
"webhook":{
"url": "http://myhost/path/to/webhook",
"secret": "mysecret"
}
> Example webhook parameter with secret
The following is an example of a POST request to the webhook server. Note that the headers contain X-BigML-Signature when secret is provided.
POST /path/to/webhook HTTP/1.1
Host: localhost:800
X-BigML-Delivery: dd04ace6-c2c7-4c62-afff-d6514c016ad7
X-BigML-Signature: sha1=b7f0e0b9401f85ab00c8c8c575a5d71006788eec
User-Agent: BigML.io
Content-Type: application/json;charset=utf-8
Content-Length: 162
{
"event": "finished",
"message": "The model has been created",
"resource": "model/5ba2ccc54e172745a0000000",
"timestamp": "2018-09-19 22:25:11 GMT"
}
> Example POST request to the webhook server
The following is an example of the webhook property in the response body of a model resource.
"webhook": {
"delivery": {
"confirmation_id": "dd04ace6-c2c7-4c62-afff-d6514c016ad7",
"method": "queue",
"status": "delivered"
},
"event": "finished",
"secret": "mysecret",
"signature": "sha1=b7f0e0b9401f85ab00c8c8c575a5d71006788eec",
"timestamp": "2018-09-19T22:25:11.536000",
"url": "http://myhost/path/to/webhook"
}
> Example webhook property in a response
Responses
Last Updated: Tuesday, 2019-01-29 16:28
HTTP/1.1 201 CREATED
Server: nginx/1.0.5
Date: Sat, 03 Mar 2012 23:28:59 GMT
Content-Type: application/json; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Location: https://au.bigml.io/dataset/4f5a59b203ce8945c200000a
< Example HTTP response
{
"code": 201,
"columns": 5,
"created": "2012-03-03T23:28:59.404542",
"credits": 0.0087890625,
"fields": {
"000000": {
"column_number": 0,
"name": "sepal length",
"optype": "numeric"
},
"000001": {
"column_number": 1,
"name": "sepal width",
"optype": "numeric"
},
"000002": {
"column_number": 2,
"name": "petal length",
"optype": "numeric"
},
"000003": {
"column_number": 3,
"name": "petal width",
"optype": "numeric"
},
"000004": {
"column_number": 4,
"name": "species",
"optype": "categorical"
}
},
"name": "iris' dataset",
"number_of_models": 0,
"number_of_predictions": 0,
"private": true,
"resource": "dataset/4f5a59b203ce8945c200000a",
"rows": 0,
"size": 4608,
"source": "source/4f52824203ce893c0a000053",
"source_parser": {
"header": true,
"locale": "en-US",
"missing_tokens": [
"N/A",
"n/a",
"NA",
"na",
"-",
"?"
],
"quote": "\"",
"separator": ",",
"trim": true
},
"source_status": true,
"status": {
"code": 1,
"message": "The dataset is being processed and will be created soon"
},
"updated": "2012-03-03T23:28:59.404561"
}
< Example JSON response
Error Codes
Errors also use conventional HTTP response headers. For example, here is the header for a 404 response:
HTTP/1.1 404 NOT FOUND
Content-Type: application/json; charset=utf-8
Date: Fri, 03 Mar 2012 23:29:18 GMT
Server: nginx/1.1.11
Content-Length: 169
Connection: keep-alive
< Example HTTP error response
{
"code": 404,
"status": {
"code": -1201,
"extra": [
"4f5157f1035d07306600005b"
],
"message": "Id does not exist"
}
}
< Example JSON error response
Status Codes
Last Updated: Tuesday, 2019-01-29 16:28
This section lists the different status codes BigML.io sends in responses. First, we list the HTTP status codes, then the codes that define a resource creation status, and finally detailed error codes for every resource.
Jump to:
- HTTP Status Code Summary
- Resource Status Code Summary
- Error Code Summary
- Source Error Code Summary
- Dataset Error Code Summary
- Download Dataset Unsuccessful Requests
- Sample Error Code Summary
- Correlation Error Code Summary
- Statistical Test Error Code Summary
- Model Error Code Summary
- Ensemble Error Code Summary
- Logistic Regression Error Code Summary
- Cluster Error Code Summary
- Anomaly Error Code Summary
- Association Error Code Summary
- Topic Model Error Code Summary
- PCA Error Code Summary
- Time Series Error Code Summary
- Deepnet Error Code Summary
- Composite Error Code Summary
- Fusion Error Code Summary
- OptiML Error Code Summary
- Prediction Error Code Summary
- Centroid Error Code Summary
- Anomaly Score Error Code Summary
- Association Set Error Code Summary
- Topic Distribution Error Code Summary
- Projection Error Code Summary
- Forecast Error Code Summary
- Batch Prediction Error Code Summary
- Batch Centroid Error Code Summary
- Batch Anomaly Score Error Code Summary
- Batch Topic Distribution Error Code Summary
- Batch Projection Error Code Summary
- Evaluation Error Code Summary
- Whizzml Library Error Code Summary
- Whizzml Script Error Code Summary
- Whizzml Execution Error Code Summary
HTTP Status Code Summary
BigML.io returns meaningful HTTP status codes for every request. The same status code is returned in both the HTTP header of the response and in the JSON body.
| Code | Status | Semantics |
|---|---|---|
| 200 | OK | Your request was successful and the JSON response should include the resource that you requested. |
| 201 | Created | A new resource was created. You can get the new resource complete location through the HTTP headers or the resource/id through the resource key of the JSON response. |
| 202 | Accepted | Received after sending a request to update a resource if it was processed successfully. |
| 204 | No Content | Received after sending a request to delete a resource if it was processed successfully. |
| 400 | Bad Request | Your request is malformed, missed a required parameter, or used an invalid value as parameter. |
| 401 | Unauthorized | Your request used the wrong username or API key. |
| 402 | Payment Required | Your subscription plan does not allow to perform this action because it has exceeded your subscription limit. Please wait until your running tasks complete or upgrade your plan. |
| 403 | Forbidden | Your request is trying to access to a resource that you do not own. |
| 404 | Not Found | The resource that you requested or used as parameter in a request does not exist anymore. |
| 405 | Not Allowed | Your request is trying to use a HTTP method that is not supported or to change fields of a resource that can not be modified. |
| 411 | Length required | Your request is trying to PUT or POST without sending any content or specifying its length. |
| 413 | Request Entity Too Large | The size of the content in your request is greater than what support to PUT or POST. |
| 415 | Unsupported Media Type | Your request is trying to POST 'multipart/form-data' content but it is actually sending the wrong content-type. |
| 429 | Too Many Requests | You have sent too many requests in a given amount of time |
| 500 | Internal Server Error | Your request could not be processed because something went wrong on BigML's end. |
| 503 | Service Unavailable | BigML.io is undergoing maintenance. |
Resource Status Code Summary
The creation of resources involves a computational task that can last a few seconds or a few days depending on the size of the data. Consequently, some HTTP POST requests to create a resource may launch an asynchronous task and return immediately. In order to know the completion status of this task, each resource has a status field that reports the current state of the request. This status is useful to monitor the progress during their creation. The possible states for a task are:
| Code | Status | Semantics | |
|---|---|---|---|
| 0 | Waiting | The resource is waiting for another resource to be finished before BigML.io can start processing it. | |
| 1 | Queued | The task that is going to create the resource has been accepted but has been queued because there are other tasks using the system. | |
| 2 | Started | The task to create the resource has been is started and you should expect partial results soon. | |
| 3 | In Progress | The task has computed the first partial resource but still needs to do more computations. | |
| 4 | Summarized | This status is specific to datasets. It happens when the dataset has been computed but its data has not been serialized yet. The dataset is final but you cannot use it yet to create a model or if you use it the model will be waiting until the dataset is finished. | |
| 5 | Finished | The task is completed and the resource is final. | |
| -1 | Faulty | The task has failed. We either could not process the task as you requested it or have an internal issue. | |
| -2 | Unknown | The task has reached a state that we cannot verify at this time. This a status you should never see unless BigML.io suffers a major outage. |
Error Code Summary
This is the list of possible general error codes you can receive fromBigML.io managing any type of resources.
| Error Code | Semantics |
|---|---|
| -1100 | Unauthorized use |
| -1101 | Not enough credits |
| -1102 | Wrong resource |
| -1104 | Cloned resourced cannot be public |
| -1105 | Price cannot be changed |
| -1107 | Too many projects |
| -1108 | Too many tasks |
| -1109 | Subscription required |
| -1200 | Missing parameter |
| -1201 | Invalid Id |
| -1203 | Field Error |
| -1204 | Bad Request |
| -1205 | Value Error |
| -1206 | Validation Error |
| -1207 | Unsupported Format |
| -1208 | Invalid Sort Error |
Source Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing sources.
| Error Code | Semantics |
|---|---|
| -2000 | This source cannot be read properly |
| -2001 | Bad request to create a source |
| -2002 | The source could not be created |
| -2003 | The source cannot be retrieved |
| -2004 | The source cannot be deleted now |
| -2005 | Faulty source |
Dataset Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing datasets.
| Error Code | Semantics |
|---|---|
| -3000 | The source is not ready yet |
| -3001 | Bad request to create a dataset |
| -3002 | The dataset cannot be created |
| -30021 | The dataset cannot be created now |
| -3003 | The dataset cannot be retrieved |
| -3004 | The dataset cannot be deleted now |
| -3005 | Faulty dataset |
| -3006 | The dataset could not be created properly. This happens when a 1-click model has been requested and corresponding dataset could not be created |
| -3008 | The dataset could not be cloned properly. This happens when there is an internal error when you try to buy or clone other user's dataset |
| -3010 | The clone of the origin dataset is not finished yet |
| -3020 | The source does not contain readable data |
| -3030 | The source cannot be parsed |
| -3040 | The filter expression is not valid |
Download Dataset Unsuccessful Requests
This is the list of possible specific error codes you can receive from BigML.io managing downloads.
| Error Code | Semantics |
|---|---|
| -9000 | The dataset export is not ready yet |
| -9001 | Bad request to perform a dataset export |
| -9002 | The dataset export cannot be performed |
| -90021 | The dataset export cannot be performed now |
| -9003 | The dataset export cannot be retrieved now |
| -9004 | The dataset export cannot be deleted now |
| -9005 | The dataset export could not be performed |
| -9006 | Dataset exports aren't available for cloned datasets |
Sample Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing samples.
| Error Code | Semantics |
|---|---|
| -16000 | The sample is not ready yet |
| -16001 | Bad request to create a sample |
| -16002 | Your sample cannot be created |
| -16021 | Your sample cannot be created now |
| -16003 | The sample cannot be retrieved now |
| -16004 | Cannot delete sample now |
| -16005 | The sample could not be created |
Correlation Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing correlations.
| Error Code | Semantics |
|---|---|
| -18000 | The correlation is not ready yet |
| -18001 | Bad request to create a correlation |
| -18002 | Your correlation cannot be created |
| -18021 | Your correlation cannot be created now |
| -18003 | The correlation cannot be retrieved now |
| -18004 | Cannot delete correlation now |
| -18005 | The correlation could not be created |
Statistical Test Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing statistical tests.
| Error Code | Semantics |
|---|---|
| -17000 | The statistical test is not ready yet |
| -17001 | Bad request to create a statistical test |
| -17002 | Your statistical test cannot be created |
| -17021 | Your statistical test cannot be created now |
| -17003 | The statistical test cannot be retrieved now |
| -17004 | Cannot delete statistical test now |
| -17005 | The statistical test could not be created |
Model Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing models.
| Error Code | Semantics |
|---|---|
| -4000 | The dataset is not ready. A one-click model has been requested but the corresponding dataset is not ready yet |
| -4001 | Bad request to create a model |
| -4002 | The model cannot be created |
| -40021 | The model cannot be created now |
| -4003 | The model cannot be retrieved |
| -4004 | The model cannot be deleted now |
| -4005 | Faulty model |
| -4006 | The dataset is empty |
| -4007 | The input fields are empty |
| -4008 | The model could not be cloned properly. This happens when there is an internal error when you try to buy or clone other user's model |
| -4008 | Wrong objective field |
| -6060 | The (sampled) input dataset is empty |
Ensemble Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing ensembles.
| Error Code | Semantics |
|---|---|
| -8001 | Bad request to create an ensemble |
| -8002 | The ensemble cannot be created |
| -80021 | The ensemble cannot be created now |
| -8003 | The ensemble cannot be retrieved now |
| -8004 | The ensemble cannot be deleted now |
| -8005 | The ensemble could not be created |
| -8008 | The ensemble could not be cloned properly |
Logistic Regression Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing logistic regressions.
| Error Code | Semantics |
|---|---|
| -22000 | The logistic regression is not ready yet |
| -22001 | Bad request to create a logistic regression |
| -22002 | Your logistic regression cannot be created |
| -22021 | Your logistic regression cannot be created now |
| -22003 | The logistic regression cannot be retrieved now |
| -22004 | Cannot delete logistic regression now |
| -22005 | The logistic regression could not be created |
Cluster Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing clusters.
| Error Code | Semantics |
|---|---|
| -10000 | The cluster is not ready yet |
| -10001 | Bad request to create a cluster |
| -10002 | The cluster cannot be created |
| -10003 | The cluster cannot be created now |
| -10004 | The cluster cannot be retrieved now |
| -10005 | The cluster cannot be deleted now |
| -10008 | The cluster could not be cloned properly |
Anomaly Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing anomaly detectors.
| Error Code | Semantics |
|---|---|
| -13000 | The anomaly detector is not ready yet |
| -13001 | Bad request to create an anomaly detector |
| -13002 | The anomaly detector cannot be created |
| -13021 | The anomaly detector cannot be created now |
| -13003 | The anomaly detector cannot be retrieved now |
| -13004 | The anomaly detector cannot be deleted now |
| -13005 | The anomaly detector could not be created |
| -13008 | The anomaly detector could not be cloned properly |
Association Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing associations.
| Error Code | Semantics |
|---|---|
| -23000 | The association is not ready yet |
| -23001 | Bad request to create an association |
| -23002 | Your association cannot be created |
| -23021 | Your association cannot be created now |
| -23003 | The association cannot be retrieved now |
| -23004 | Cannot delete association now |
| -23005 | The association could not be created |
Topic Model Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing topic models.
| Error Code | Semantics |
|---|---|
| -26000 | The topic model is not ready yet |
| -26001 | Bad request to create a topic model |
| -26002 | Your topic model cannot be created |
| -26021 | Your topic model cannot be created now |
| -26003 | The topic model cannot be retrieved now |
| -26004 | Cannot delete topic model now |
| -26005 | The topic model could not be created |
PCA Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing pca.
| Error Code | Semantics |
|---|---|
| -37000 | The PCA is not ready yet |
| -37001 | Bad request to create a PCA |
| -37002 | Your PCA cannot be created |
| -37021 | Your PCA cannot be created now |
| -37003 | The PCA cannot be retrieved now |
| -37004 | Cannot delete PCA now |
| -37005 | The PCA could not be created |
Time Series Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing time series.
| Error Code | Semantics |
|---|---|
| -30000 | The time series is not ready yet |
| -30001 | Bad request to create a time series |
| -30002 | Your time series cannot be created |
| -30021 | Your time series cannot be created now |
| -30003 | The time series cannot be retrieved now |
| -30004 | Cannot delete time series now |
| -30005 | The time series could not be created |
Deepnet Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing deepnets.
| Error Code | Semantics |
|---|---|
| -33001 | Bad request to create an deepnet |
| -33002 | The deepnet cannot be created |
| -330021 | The deepnet cannot be created now |
| -33003 | The deepnet cannot be retrieved now |
| -33004 | The deepnet cannot be deleted now |
| -33005 | The deepnet could not be created |
Composite Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing composites.
| Error Code | Semantics |
|---|---|
| -34001 | Bad request to create an composite |
| -34002 | The composite cannot be created |
| -340021 | The composite cannot be created now |
| -34003 | The composite cannot be retrieved now |
| -34004 | The composite cannot be deleted now |
| -34005 | The composite could not be created |
Fusion Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing fusions.
| Error Code | Semantics |
|---|---|
| -35001 | Bad request to create an fusion |
| -35002 | The fusion cannot be created |
| -350021 | The fusion cannot be created now |
| -35003 | The fusion cannot be retrieved now |
| -35004 | The fusion cannot be deleted now |
| -35005 | The fusion could not be created |
OptiML Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing optimls.
| Error Code | Semantics |
|---|---|
| -36001 | Bad request to create an optiml |
| -36002 | The optiml cannot be created |
| -360021 | The optiml cannot be created now |
| -36003 | The optiml cannot be retrieved now |
| -36004 | The optiml cannot be deleted now |
| -36005 | The optiml could not be created |
Prediction Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing predictions.
| Error Code | Semantics |
|---|---|
| -5000 | This model is not ready yet |
| -5001 | Bad request to create a prediction |
| -5002 | The prediction can not be created |
| -5003 | The prediction cannot be retrieved |
| -5004 | The prediction cannot be deleted now |
| -5005 | The prediction could not be created |
Centroid Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing centroids.
| Error Code | Semantics |
|---|---|
| -11001 | Bad request to create a centroid |
| -11002 | Your centroid cannot be created now |
| -11003 | The centroid cannot be retrieved now |
| -11004 | Cannot delete centroid now |
| -11005 | The centroid could not be created |
Anomaly Score Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing anomaly scores.
| Error Code | Semantics |
|---|---|
| -14001 | Bad request to create an anomaly score |
| -14002 | Your anomaly score cannot be created now |
| -14003 | The anomaly score cannot be retrieved now |
| -14004 | Cannot delete anomaly score now |
| -14005 | The anomaly score could not be created |
Association Set Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing association set.
| Error Code | Semantics |
|---|---|
| -24001 | Bad request to create an association set |
| -24002 | Your association set cannot be created now |
| -24003 | The association set cannot be retrieved now |
| -24004 | Cannot delete association set now |
| -24005 | The association set could not be created |
Topic Distribution Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing topic distributions.
| Error Code | Semantics |
|---|---|
| -27001 | Bad request to create a topic distribution |
| -27002 | Your topic distribution cannot be created now |
| -27003 | The topic distribution cannot be retrieved now |
| -27004 | Cannot delete topic distribution now |
| -27005 | The topic distribution could not be created |
Projection Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing projections.
| Error Code | Semantics |
|---|---|
| -38001 | Bad request to create a projection |
| -38002 | Your projection cannot be created now |
| -38003 | The projection cannot be retrieved now |
| -38004 | Cannot delete projection now |
| -38005 | The projection could not be created |
Forecast Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing forecasts.
| Error Code | Semantics |
|---|---|
| -31001 | Bad request to create a forecast |
| -31002 | Your forecast cannot be created now |
| -31003 | The forecast cannot be retrieved now |
| -31004 | Cannot delete forecast now |
| -31005 | The forecast could not be created |
Batch Prediction Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing batch predictions.
| Error Code | Semantics |
|---|---|
| -6001 | Bad request to perform a batch prediction |
| -6002 | The batch prediction cannot be performed |
| -60021 | The batch prediction cannot be performed now |
| -6003 | The batch prediction cannot be retrieved now |
| -6004 | The batch prediction cannot be deleted now |
| -6005 | The batch prediction could not be performed |
Batch Centroid Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing batch centroids.
| Error Code | Semantics |
|---|---|
| -12001 | Bad request to perform a batch centroid |
| -12002 | The batch centroid cannot be performed |
| -12021 | The batch centroid cannot be performed now |
| -12003 | The batch centroid cannot be retrieved now |
| -12004 | The batch centroid cannot be deleted now |
| -12005 | The batch centroid could not be performed |
Batch Anomaly Score Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing batch anomaly scores.
| Error Code | Semantics |
|---|---|
| -15001 | Bad request to perform a batch anomaly score |
| -15002 | The batch anomaly score cannot be performed |
| -15021 | The batch anomaly score cannot be performed now |
| -15003 | The batch anomaly score cannot be retrieved now |
| -15004 | The batch anomaly score cannot be deleted now |
| -15005 | The batch anomaly score could not be performed |
Batch Topic Distribution Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing batch topic distributions.
| Error Code | Semantics |
|---|---|
| -28001 | Bad request to perform a batch topic distribution |
| -28002 | The batch topic distribution cannot be performed |
| -28021 | The batch topic distribution cannot be performed now |
| -28003 | The batch topic distribution cannot be retrieved now |
| -28004 | The batch topic distribution cannot be deleted now |
| -28005 | The batch topic distribution could not be performed |
Batch Projection Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing batch projections.
| Error Code | Semantics |
|---|---|
| -39001 | Bad request to perform a batch projection |
| -39002 | The batch projection cannot be performed |
| -39021 | The batch projection cannot be performed now |
| -39003 | The batch projection cannot be retrieved now |
| -39004 | The batch projection cannot be deleted now |
| -39005 | The batch projection could not be performed |
Evaluation Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing evaluations.
| Error Code | Semantics |
|---|---|
| -7001 | Bad request to perform an evaluation |
| -7002 | The evaluation cannot be performed |
| -70021 | The evaluation cannot be performed now |
| -7003 | The evaluation cannot be retrieved now |
| -7004 | The evaluation cannot be deleted now |
| -7005 | The evaluation could not be performed |
Whizzml Library Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing libraries.
| Error Code | Semantics |
|---|---|
| -19000 | The library is not ready yet |
| -19001 | Bad request to create a library |
| -19002 | Your library cannot be created |
| -19021 | Your library cannot be created now |
| -19003 | The library cannot be retrieved now |
| -19004 | Cannot delete library now |
| -19005 | The library could not be created |
Whizzml Script Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing scripts.
| Error Code | Semantics |
|---|---|
| -20000 | The script is not ready yet |
| -20001 | Bad request to create a script |
| -20002 | Your script cannot be created |
| -20021 | Your script cannot be created now |
| -20003 | The script cannot be retrieved now |
| -20004 | Cannot delete script now |
| -20005 | The script could not be created |
Whizzml Execution Error Code Summary
This is the list of possible specific error codes you can receive from BigML.io managing executions.
| Error Code | Semantics |
|---|---|
| -21000 | The execution is not ready yet |
| -21001 | Bad request to create an execution |
| -21002 | Your execution cannot be created |
| -21021 | Your execution cannot be created now |
| -21003 | The execution cannot be retrieved now |
| -21004 | Cannot delete execution now |
| -21005 | The execution could not be created |
Category Codes
Last Updated: Tuesday, 2019-01-29 16:28
| Category | Description |
|---|---|
| -1 | Uncategorized |
| 0 | Miscellaneous |
| 1 | Automotive, Engineering & Manufacturing |
| 2 | Energy, Oil & Gas |
| 3 | Banking & Finance |
| 4 | Fraud & Crime |
| 5 | Healthcare |
| 6 | Physical, Earth & Life Sciences |
| 7 | Consumer & Retail |
| 8 | Sports & Games |
| 9 | Demographics & Surveys |
| 10 | Aerospace & Defense |
| 11 | Chemical & Pharmaceutical |
| 12 | Higher Education & Scientific Research |
| 13 | Human Resources & Psychology |
| 14 | Insurance |
| 15 | Law & Order |
| 16 | Media, Marketing & Advertising |
| 17 | Public Sector & Nonprofit |
| 18 | Professional Services |
| 19 | Technology & Communications |
| 20 | Transportation & Logistics |
| 21 | Travel & Leisure |
| 22 | Utilities |
| Category | Description |
|---|---|
| -1 | Uncategorized |
| 0 | Miscellaneous |
| 1 | Advanced Workflow |
| 2 | Anomaly Detection |
| 3 | Association Discovery |
| 4 | Basic Workflow |
| 5 | Boosting |
| 6 | Classification |
| 7 | Classification/Regression |
| 8 | Correlations |
| 9 | Cluster Analysis |
| 10 | Data Transformation |
| 11 | Evaluation |
| 12 | Feature Engineering |
| 13 | Feature Extraction |
| 14 | Feature Selection |
| 15 | Hyperparameter Optimization |
| 16 | Model Selection |
| 17 | Prediction and Scoring |
| 18 | Regression |
| 19 | Stacking |
| 20 | Statistical Test |
Projects
Last Updated: Tuesday, 2019-01-29 16:28
A project is an abstract resource that helps you group related BigML resources together.
A project must have a name and optionally a category, description, and multiple tags to help you organize and retrieve your projects.
When you create a new source you can assign it to a pre-existing project. All the subsequent resources created using that source will belong to the same project.
All the resources created within a project will inherit the name, description, and tags of the project unless you change them when you create the resources or update them later.
When you select a project on your BigML's dashboard, you will only see the BigML resources related to that project. Using your BigML dashboard you can also create, update and delete projects (and all their associated resources).
BigML.io allows you to create, retrieve, update, delete your project. You can also list all of your projects.
Jump to:
- Project Base URL
- Creating a Project
- Project Arguments
- Retrieving a Project
- Project Properties
- Updating a Project
- Deleting a Project
- Listing Projects
Project Base URL
You can use the following base URL to create, retrieve, update, and delete projects. https://au.bigml.io/project
Project base URL
All requests to manage your projects must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Project
To create a new project, you just need to POST the name you want to give to the new project to the project base URL.
You can easily do this using curl.
curl "https://au.bigml.io/project?$BIGML_AUTH" \
-H 'content-type: application/json' \
-d '{"name": "My First Project"}'
> Creating a project
BigML.io will return a newly created project document, if the request succeeded.
{
"category":0,
"created":"2015-02-02T07:49:20.226764",
"description":"",
"name":"My First Project",
"private":true,
"resource":"project/54d9553bf0a5ea5fc0000016",
"stats":{
"anomalies":{
"count":0
},
"anomalyscores":{
"count":0
},
"batchanomalyscores":{
"count":0
},
"batchcentroids":{
"count":0
},
"batchpredictions":{
"count":0
},
"batchtopicdistributions":{
"count":0
},
"centroids":{
"count":0
},
"clusters":{
"count":0
},
"configurations":{
"count":0
},
"correlations":{
"count":0
},
"datasets":{
"count":0
},
"ensembles":{
"count":0
},
"evaluations":{
"count":0
},
"models":{
"count":0
},
"predictions":{
"count":0
},
"sources":{
"count":0
},
"statisticaltests":{
"count":0
},
"topicmodels":{
"count":0
},
"topicdistributions":{
"count":0
}
},
"status":{
"code":5,
"message":"The project has been created"
},
"tags":[],
"updated":"2015-02-02T07:49:20.226781"
}
< Example project JSON response
In addition to the name, you can also use the following arguments.
Project Arguments
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is 0 |
The category that best describes the project. See the category codes for the complete list of categories.
Example: 1 |
|
description
optional |
String |
A description of the project up to 8192 characters long.
Example: "This is a description of my new project" |
|
name
optional |
String, default is Project Number |
The name you want to give to the new project.
Example: "my new project" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your project.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize your new project with a category, description, or tags. For example, you can create a new project with all those arguments as follows:
curl "https://au.bigml.io/project?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{
"name": "Fraud Detection",
"category": 4,
"description": "Detecting fraud in bank transactions",
"tags": ["fraud", "detection"]
}'
> Creating a project with arguments
Retrieving a Project
Each project has a unique identifier in the form "project/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the project.
To retrieve a project with curl:
curl "https://au.bigml.io/project/54d9553bf0a5ea5fc0000016?$BIGML_AUTH"
$ Retrieving a project from the command line
Project Properties
Once a project has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the project and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the project creation has been completed without errors. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the project was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
description
updatable |
String | A text describing the project. It can contain restricted markdown to decorate the text. |
|
name
filterable, sortable, updatable |
String | The name of the project as provided. |
|
private
filterable, sortable |
Boolean | Whether the project is public or not. |
| resource | String | The project/id. |
| stats | Object | An object keyed by resource that informs of the number of resources created. |
| status | Object | A description of the status of the project. It includes a code, a message, and some extra information. See the table below. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the project was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
Updating a Project
To update a project, you need to PUT an object containing the fields that you want to update to the project' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated project.
For example, to update a project with a new name, a new category, a new description, and new tags you can use curl like this:
curl "https://au.bigml.io/project/54d9553bf0a5ea5fc0000016?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "My New Project",
"category": 3,
"description": "My first BigML project",
"tags": ["fraud", "detection"]}'
$ Updating a project
Deleting a Project
To delete a project, you need to issue a HTTP DELETE request to the project/id to be deleted.
Using curl you can do something like this to delete a project:
curl -X DELETE "https://au.bigml.io/project/54d9553bf0a5ea5fc0000016?$BIGML_AUTH"
$ Deleting a project from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a project, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a project a second time, or a project that does not exist, you will receive a "404 not found" response.
However, if you try to delete a project that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Projects
To list all the projects, you can use the project base URL. By default, only the 20 most recent projects will be returned. You can see below how to change this number using the limit parameter.
You can get your list of projects directly in your browser using your own username and API key with the following links.
https://au.bigml.io/project?$BIGML_AUTH
> Listing projects from a browser
Sources
Last Updated: Tuesday, 2019-01-29 16:28
A source is the raw data that you want to use to create a predictive model. A source is usually a (big) file in a comma separated values (CSV) format. See the example below. Each row represents an instance (or example). Each column in the file represents a feature or field. The last column usually represents the class or objective field. The file might have a first row named header with a name for each field.
Plan,Talk,Text,Purchases,Data,Age,Churn?
family,148,72,0,33.6,50,TRUE
business,85,66,0,26.6,31,FALSE
business,83,64,0,23.3,32,TRUE
individual,9,66,94,28.1,21,FALSE
family,15,0,0,35.3,29,FALSE
individual,66,72,175,25.8,51,TRUE
business,0,0,0,30,32,TRUE
family,18,84,230,45.8,31,TRUE
individual,71,110,240,45.4,54,TRUE
family,59,64,0,27.4,40,FALSE
Example CSV file
A source:
- Should be a comma-separated values (CSV) file. Spaces, tabs, semicolons and tabs are also valid separators.
- Weka's ARFF files are also supported.
- JSON in a few formats is also supported. See below for more details.
- Microsoft Excel files or Mac OS numbers files should also work most times. But it would be better if you please export them to CSV (commad-separated values).
- Cannot be bigger than 64GB.
- Can be gzipped (.gz) or compressed (.bz2). It can be zipped (.zip), but only if the archive contains one single file.
You can also create sources from remote locations using a variety of protocols like https, hdfs, s3, asv, odata/odatas, dropbox, gcs, gdrive, etc. See below for more details.
BigML.io allows you to create, retrieve, update, delete your source. You can also list all of your sources.
Jump to:
- JSON Sources
- Source Base URL
- Creating a Source
- Creating a Source Using a Local File
- Creating a Source Using a URL
- Creating a Source Using Inline Data
- Creating a Source with Automatically Generated Synthetic Data
- Text Processing
- Items Detection
- Datetime Detection
- Source Arguments
- Retrieving a Source
- Source Properties
- Filtering and Paginating Fields from a Source
- Updating a Source
- Deleting a Source
- Listing Sources
JSON Sources
BigML.io can parse JSON data in one of the following formats:
-
A top-level list of lists of atomic values, each one defining a row.
Valid JSON Source format (a list of lists)[ ["length","width","height","weight","type"], [5.1,3.5,1.4,0.2,"A"], [4.9,3.0,1.4,0.2,"B"], ... ] -
A top-level list of dictionaries,
where each dictionary's values represent the row values and the corresponding keys the column names.
The first dictionary defines the keys that will be selected.
Valid JSON Source format (a list of dictionaries)[ {"length":5.1,"width":3.5,"height":1.4,"weight":0.2,"type":"A"}, {"length":4.9,"width":3.0,"height":1.4,"weight":0.2,"type":"B"}, ... ] -
A top-level list of dictionaries with the request parameter
json_key defined under source_parser
with the value of one of its keys having any of the two formats above.
For the following example, you can set "source_parser": {"json_key": "data"}.
Alternatively, if the source is from a remote location and is going to be downloaded from, say, http://yourcompany.com?a=foo&b=bar and the data rows will come under the key "data" like the example above, you could request an external source with bigml_json_key in the URL: "http://yourcompany.com?a=foo&b=bar&bigml_json_key=data". Note that the parameter name in the query string is bigm_json_key, not json_key
Valid JSON Source format (a list of dictionaries with json_key){ "name": "Shipping Class", "data": [ {"length":5.1,"width":3.5,"height":1.4,"weight":0.2,"type":"A"}, {"length":4.9,"width":3.0,"height":1.4,"weight":0.2,"type":"B"}, ... ], "size": 5148 } -
A nested dictionary key, with the final value having any of the formats already described.
For the following example, you can set "source_parser": {"json_key": "results.data"}.
Like the example with the top-level list of dictionaries, you could request an external source with bigml_json_key in the URL for a remote source. For the example above, you could request "http://yourcompany.com?a=foo&b=bar&bigml_json_key=results.data".
Valid JSON Source format (a nested dictionary key){ "name": "Shipping Class", "results": { "meta": "Shipping class info", "data": [ {"length":5.1,"width":3.5,"height":1.4,"weight":0.2,"type":"A"}, {"length":4.9,"width":3.0,"height":1.4,"weight":0.2,"type":"B"}, ... ] }, "size": 5148 } -
A top-level dictionary of dictionaries whose values represent rows.
Valid JSON Source format (a dictionary of dictionaries){ "GnCC": {"length":5.1,"width":3.5,"height":1.4,"weight":0.2,"type":"A"}, "4R3R": {"length":4.9,"width":3.0,"height":1.4,"weight":0.2,"type":"B"}, ... } -
Rows of JSON dictionaries where the full file is not a valid JSON document, but each individual line is.
Each line in the file (or input stream) must be a separate JSON either list or map per line,
and is thus parsed individually as a top-level JSON document, and interpreted as a row of data.
To be a valid JSON row, each line must fall into one of two categories:
* A JSON dictionary, with at least some of its values being atomic. In that case, the keys in the dictionary are taken as the field names and the corresponding atomic values as the actual columns of the row. Keys with values that are composite are just ignored. The first map in the file determines what are the fields that will be extracted in subsequent rows unless they are specified via the request parameter json_fields defined under source_parser or the query string parameter bigml_json_fields just as with the json_key explained above.
* A JSON list, again with at list some of its values being atomic. Here, the field names can be inferred from the values of the first list in the file, using the same heuristics that are used to auto-detect headers in CSVs. Or you can set the header flag as well as use json_fields under source_parser (or bigml_json_fields in the query string) to give explicit names in the creation request. Non-atomic values appearing in the lists are translated to missing values.
Here's a snippet of JSON rows using maps:
JSON Rows using Maps{"length":5.1,"width":3.5,"height":1.4,"weight":0.2,"type":"A"} {"length":4.9,"width":3.0,"height":1.4,"weight":0.2,"type":"B"} {"length":4.7,"width":3.2,"height":1.3,"weight":0.2,"type":"B"} {"length":4.6,"width":3.1,"height":1.5,"weight":0.2,"type":"C"} {"length":5.0,"width":3.6,"height":1.4,"weight":0.2,"type":"A"} {"length":5.4,"width":3.9,"height":1.7,"weight":0.4,"type":"C"} ...and here's JSON rows with lists:
JSON Rows using Lists["length","width","height","weight","type"] [5.1,3.5,1.4,0.2,"A"] [4.9,3.0,1.4,0.2,"B"] [4.7,3.2,1.3,0.2,"B"] [4.6,3.1,1.5,0.2,"C"] [5.0,3.6,1.4,0.2,"A"] [5.4,3.9,1.7,0.4,"C"] ...
Source Base URL
You can use the following base URL to create, retrieve, update, and delete sources. https://au.bigml.io/source
Source base URL
All requests to manage your sources must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Source
You can create a new source in any of the following four ways:
- Local Sources: Using a local file. You need to post the file content in "multipart/form-data". The maximum size allowed is 64 GB per file.
- Remote Sources: Using a URL that points to your data. The maximum size allowed is 64 GB or 5 TB if you use a file stored in Amazon S3.
- Inline Sources: Using some inline data. The content type must be "application/json". The maximum size in this case is limited 10 MB per post.
- Synthetic Sources: Automatically generate synthetic data sources, presumably for activities such as testing, prototyping, and benchmarking.
Creating a Source Using a Local File
To create a new source, you need to POST the file containing your data to the source base URL. The file must be attached in the post as a file upload.The Content-Type in your HTTP request must be "multipart/form-data" according to RFC2388. This allows you to upload binary files in compressed format (.Z, .gz, etc) that will be uploaded faster.
You can easily do this using curl. The option -F (--form) lets curl emulate a filled-in form in which a user has pressed the submit button. You need to prefix the file path name with "@".
curl "https://au.bigml.io/source?$BIGML_AUTH" -F file=@iris.csv
> Creating a source
Creating a Source Using a URL
To create a new remote source you need a URL that points to the data file that you want BigML to download for you.
You can easily do this using curl. The option -H lets curl set the content type header while the option -X sets the http method. You can send the URL within a JSON object as follows:
curl "https://au.bigml.io/source?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"remote": "https://static.bigml.com/csv/iris.csv"}'
> Creating a remote source
You can use the following types of URLs to create remote sources:
- HTTP or HTTPS. They can also include basic realm authorization.
Example URLshttps://test:test@static.bigml.com/csv/iris.csv http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data - Public or private files in Amazon S3.
Example Amazon S3 URLss3://bigml-public/csv/iris.csv s3://bigml-test/csv/iris.csv?access-key=AKIAIF6IUYDYUQ7BALJQ&secret-key=XgrQV/hHBVymD75AhFOzveX4qz7DYrO6q8WsM6ny
Creating a remote source from Google Drive and Google Storage
You have two options to create a remote datasource from Google Drive and Google Storage via API:
- Using BigML:
Allow BigML to access to your Google Drive or Google Storage from the Cloud Storages section from your Account or from your Dashboard sources list. You will get the access token and the refresh token.
Google Drive example:- Select the option to create source from Google Drive:
- Allow BigML access to your Google Drive:
- Get the access token and refresh token:
You can easily create the remote source using curl as in the examples below:You should complete steps above in the same way to import source from Google Cloud Storage.
> Creating a remote source from Google Drivecurl "https://au.bigml.io/source?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"remote":"gdrive://noserver/0BxGbAMhJezOScTFBUVFPMy1xT1E?token=ya29.AQGn2MO578Fso0fVF0hGb0Q60ILCagAwvFEZoiovK5kvPWSt5z2QMbXDyAvqUtQeYdJbA39YkyRq3A&refresh-token=1/00qQ1040yDSyh_0DRLYZRnkC_62gE8M5Tpb68sfvmj8"}'
> Creating a remote source from Google Cloud Storagecurl "https://au.bigml.io/source?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"remote":"gcs://company_bucket/Iris.csv?token=ya29.AQGn2MO578Fso0fVF0hGb0Q60ILCagAwvFEZoiovK5kvPWSt5z2QMbXDyAvqUtQeYdJbA39YkyRq3A&refresh-token=1/00qQ1040yDSyh_0DRLYZRnkC_62gE8M5Tpb68sfvmj8"}'
- Using your own app:
You can also create a remote source from your own App. You first need to authorize BigML access from your own Google Apps application. BigML only needs authorization for read-only authentication scope (
https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/drive.readonly), but you can have any of the other available scopes (find authentication scopes available for Google Drive and Google Storage). After the authorization process you will get your access token and refresh token from the Google Authorization Server.
Then the process is the same as creating a remote source using BigML application described above. You need to POST to the source endpoint an object containing at least the file ID (for Google Drive) or the bucket and the file name (for Google Storage) and the access token, but in this case you will also need to include the app secret and app client from your App. Again, including the refresh token is optional.
Your values for app client and app secret appear as Client secret and Client ID in Google developers console respectively. See image below.
> Creating a remote source from Google Drive using your appcurl "https://au.bigml.io/source?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"remote":"gdrive://noserver/0BxGbAMhJezOSXy1oRU5MSU90SUU?token=ya29.AQGn2MO578Fso0fVF0hGb0Q60ILCagAwvFEZoiovK5kvPWSt5z2QMbXDyAvqUtQeYdJbA39YkyRq3A&refresh-token=1/00qQ1040yDSyh_0DRLYZRnkC_62gE8M5Tpb68sfvmj8&app-secret=AvFake1Secretjt27HQWTm4h&app-client=667300000007-07gjg5o912o1v422hfake2cli3nt3no6.apps.googleusercontent.com"}'
> Creating a remote source from Google Cloud Storage using your appcurl "https://au.bigml.io/source?$BIGML_AUTH" \ -X POST \ -H 'content-type: application/json' \ -d '{"remote":"gcs://company_bucket/Iris.csv?token=ya29.AQGn2MO578Fso0fVF0hGb0Q60ILCagAwvFEZoiovK5kvPWSt5z2QMbXDyAvqUtQeYdJbA39YkyRq3A&refresh-token=1/00qQ1040yDSyh_0DRLYZRnkC_62gE8M5Tpb68sfvmj8&app-secret=AvFake1Secretjt27HQWTm4h&app-client=667300000007-07gjg5o912o1v422hfake2cli3nt3no6.apps.googleusercontent.com"}'
Creating a Source Using Inline Data
You can also create sources sending some inline data within the body of a POST http request. This way is specially useful if you want to model small amounts of data generated by an application.
To create an inline source using curl you can use the following example:
curl "https://au.bigml.io/source?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"data": "a,b,c,d\n1,2,3,4\n5,6,7,8"}'
> Creating an inline source
Independently of how you create a new source (local, remote or inline) BigML.io will return a newly created source document, if the request succeeded.
{
"category": 0,
"code": 201,
"content_type": "application/octet-stream",
"created": "2012-11-15T02:24:59.686739",
"credits": 0.0,
"description": "",
"disable_datetime": false,
"fields_meta": {
"count": 0,
"limit": 200,
"offset": 0,
"total": 0
},
"file_name": "iris.csv",
"md5": "d1175c032e1042bec7f974c91e4a65ae",
"name": "iris.csv",
"number_of_datasets": 0,
"number_of_models": 0,
"number_of_predictions": 0,
"private": true,
"project": null,
"resource": "source/4f603fe203ce89bb2d000000",
"size": 4608,
"source_parser": {},
"status": {
"code": 1,
"message": "The request has been queued and will be processed soon"
},
"tags": [],
"type": 0,
"updated": "2012-11-15T02:24:59.686758"
}
< Example source JSON response
Creating a Source with Automatically Generated Synthetic Data
You can also synthetically create sources using automatically generated data for activities such as testing, prototyping, and benchmarking.
To create a syntheric source using curl you can use the following example:
curl "https://au.bigml.io/source?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"synthetic": {"fields": 10, "rows": 10}}'
> Creating a synthetic source
In addition to the file, you can also use the following arguments.
Source Arguments
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is 0 |
The category that best describes the data. See the category codes for the complete list of categories.
Example: 1 |
|
data
optional |
String |
Data for inline source creation.
Example: "a,b,c,d\n1,2,3,4\n5,6,7,8" |
|
description
optional |
String |
A description of the source up to 8192 characters long.
Example: "This is a description of my new source" |
|
disable_datetime
optional |
Boolean, default is false |
Whether BigML has to generate or not new fields from existing date-time fields.
Example: true |
|
file
optional |
multipart/form-data; charset=utf-8 | File containing your data in csv format. It can be compressed, gzipped, or zipped if the archive contains only one file |
|
item_analysis
optional |
Object, default is shown in the table below |
Set of parameters to activate item analysis for the source.
Example:
|
|
name
optional |
String, default is Unnamed source |
The name you want to give to the new source.
Example: "my new source" |
|
project
optional |
String |
The project/id you want the source to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
remote
optional |
String |
A URL pointing to file containing your data in csv format. It can be compressed, gzipped, or zipped.
Example: https://static.bigml.com/csv/iris.csv |
|
source_parser
optional |
Object, default is shown in the table below |
Set of parameters to parse the source.
Example:
|
|
synthetic
optional |
Object, default is shown in the table below |
Set of parameters to generate a synthetic source presumably for activities such as testing, prototyping and benchmarking.
Example:
|
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your source.
Example: ["best customers", "2018"] |
|
term_analysis
optional |
Object, default is shown in the table below |
Set of parameters to activate text analysis for the source.
Example:
|
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
A source parser object is composed of any combination of the following properties.
| Property | Type | Description |
|---|---|---|
|
header
optional |
Boolean, default is true |
Whether the source contains a header or not.
Example: true |
| json_fields | Array of Strings |
The columns to be used when the source is in the JSON format and the rows are a list of dictionaries. See the JSON Sources for more information.
Example: ["age", "height", "weight"] |
| json_key | String |
A top-level dictionary key containing the rows when the source is the JSON format. See the JSON Sources for more information.
Example: "data" |
|
locale
optional |
String, default is "en-US" |
The locale of the source.
Example: "es-ES" |
|
missing_tokens
optional |
Array of Strings, default is ["", "N/A", "n/a", "NULL", "null", "-", "#DIV/0", "#REF!", "#NAME?", "NIL", "nil", "NA", "na", "#VALUE!", "#NULL!", "NaN", "#N/A", "#NUM!", "?"] |
Tokens that represent a missing value
Example: ["?"] |
|
quote
optional |
Char, default is """ |
The source quote character.
Example: "'" |
|
separator
optional |
Char, default is "," |
The source separator character. Empty string if the source has only single column.
Example: ";" |
|
trim
optional |
Boolean, default is true |
Whether to trim field strings or not.
Example: true |
You can also use curl to customize your new source with a name and different parser. For example, to create a new source named "my source", without a header and with "x" as the only missing token.
curl "https://au.bigml.io/source?$BIGML_AUTH" \
-F file=@iris.csv \
-F 'name=my source' \
-F 'source_parser={"header": false, "missing_tokens":["x"]}'
> Creating a source with arguments
If you do not specify a name, BigML.io will assign to the source the same name as the file that you uploaded. If you do not specify a source_parser, BigML.io will do its best to automatically select the parsing parameters for you. However, if you do specify it, BigML.io will not try to second-guess you.
A item_analysis object is composed of any combination of the following properties.
A term_analysis object is composed of any combination of the following properties.
| Property | Type | Description |
|---|---|---|
|
bigrams
optional |
Boolean, default is false |
Whether to include a contiguous sequence of two items from a given sequence of text. See n-gram for more information.This argument is deprecated in favor of ngrams and is equivalent to ngrams=2.
Example: true DEPRECATED |
|
case_sensitive
optional |
Boolean, default is false |
Whether text analysis should be case sensitive or not.
Example: true |
|
enabled
optional |
Boolean, default is true |
Whether text processing should be enabled or not.
Example: true |
|
excluded_terms
optional |
Array of Strings, default is [], an empty list. |
Specifies a list of terms to ignore when performing term analysis.
Example:
|
|
language
optional |
String, default is "en" |
The default language of text fields in a two-letter language code, which will change the resulting stemming and tokenization. Available options are: "ar", "ca", "cs", "da", "de", "en", "es", "fa", "fi", "fr", "hu", "it", "ja", "ko", "nl", "pl", "pt", "ro", "ru", "sv", "tr", "zh", "none", or null for auto-detect.
Example: "es" |
|
ngrams
optional |
Integer, default is 1 |
A positive integer n that specifies the use of all sequences of consecutive tokens of length n should be considered as terms, in addition to their constituent tokens (when separated by a single space and no stopwords). See n-gram for more information.The minimum value is 1 and maximum value is 5.
Example: 5 |
|
stem_words
optional |
Boolean, default is true |
Whether lemmatization (stemming) of terms should be done, according to linguistic rules in the provided language. Note that if the language is, for example, zh, even English words will not be lemmatized as with English rules.
Example: true |
|
stopword_diligence
optional |
String, default is "light" |
The aggressiveness of stopword removal, where the levels are light, normal or aggressive, in order, where each level is a superset of words in the previous ones. The most common languages will add stopwords at each level, but less common languages may not.
Example: "normal" |
|
stopword_removal
optional |
String, default is "selected_language" |
A string or keyword specifying the type of stopword removal to perform. Available options are where it can be none (remove no stopwords), selected_language (remove stopwords from the provided language), and all_languages (remove stopwords from all languages). Note that this parameter supersedes use_stopwords if provided. Also note that the null language does have a non-empty stopword list such as single numeric digits.
Example: "all_languages" |
|
term_filters
optional |
Array of Strings |
Filters that should be applied to the chosen terms. Available options are:
Example: "html_keywords" |
|
term_regexps
optional |
Array | A list of strings specifying regular expressions to be matched against input documents. If present, these regular expressions will automatically be chosen for the final term list, and their per-document occurrence counts will be the number of matches of the expression in that document. |
|
token_mode
optional |
String, default is "all" |
Whether tokens_only, full_terms_only or all should be tokenized.
Example: "tokens_only" |
|
use_stopwords
optional |
Boolean, default is true |
Whether to use stop words or not. This fields is deprecated in favor of stopword_removal.
Example: true DEPRECATED |
A synthetic object is composed of the following properties.
Text Processing
While the handling of numeric, categorical, or items fields within a decision tree framework is fairly straightforward, the handling of text fields can be done in a number of different ways. BigML.io takes a basic and reasonably robust approach, leverging some basic NLP techniques along with a simple bag-of-words style method of feature generation.
At the source level, BigML.io attempts to do basic language detection. Initially the language can be English ("en"), Spanish ("es"), Catalan/Valencian ("ca"), Dutch ("nl"), French ("fr"), German ("de"), Portuguese ("pt"), or "none" if no language is detected. In the near future, BigML.io will support many more languages.
For text fields, BigML.io adds potentially five keys to the detected fields, all of which are placed in a map under term_analysis.
The first is language, which is mapped to the detected language.
There are also three boolean keys, case_sensitive, use_stopwords, and stem_words. The case_sensitive key is false by default. use_stopwords should be true if we should include stopwords in the vocabulary for the detected field during text summarization. stem_words should be true if BigML.io should perform word stemming on this field, which maps forms of the same term to the same key when summarizing or generating models. By default, use_stopwords is false and stem_words is true for languages other than "none" and they are not present otherwise.
Finally, token_mode determines the tokenization strategy. It may be set as either tokens_only, full_terms_only, and all. When set as tokens_only then individual words are used as terms. For example, "ML for all" becomes ["ML", "for", "all"]. However, when full_terms_only is selected, then the entire field is treated as a single term as long as it is shorter than 256 characters. In this case "ML for all" stays ["ML for all"]. If all is selected, then both full terms and tokenized terms are used. In this case ["ML for all"] becomes ["ML", "for", "all", "ML for all"]. The default for token_mode is all.
There are a few details to note:
- If full_terms_only is selected, then no stemming will occur even if stem_words is true.
- Also, when either all or tokens_only are selected, a term must appear at least twice to be selected for the tag cloud. However full_terms_only lowers this limit to a single occurrence.
- Finally, if the language is "none", or if a language does not have an algorithm available for stopword removal or stemming, the use_stopwords and stem_words keys will have no effect.
Items Detection
BigML automatically detects as items fields that have many different categorical values per instance separated by non-alphanumeric characters, so they can’t be considered either categorical or text fields
These kind of fields can be found in transactional datasets where each instance is associated to a different set of products contained within one field. For example, datasets containing all products bought by users or prescription datasets where each patient is associated to different treatments. These datasets are commonly used for Association Discovery to find relationships between different items.
Find the two CSV examples below that could be considered items fields:
User, Prescription
John Doe, medicine 1; medicine 2
Jane Roe, medicine 1; medicine 3; medicine 4; medicine 6
Transaction, Product
12345, product 1; product 2; product 5; product 6; product 7
67890, product 1; product 3; product 4
In the examples above, the fields Prescription and Products will be considered as items and each different value will be a unique item.
Once a field has been detected as items, BigML tries to automatically detect which is the best separator for your items. For example, for the following itemset {hot dog; milk, skimmed; chocolate}, the best separator is the semicolon which yields three different items: 'hot dog', 'milk, skimmed' and 'chocolate'.
For items fields, there are five different parameters you can configure under the property group item_analysis, which includes separator that allows you to specify which separator you want to set for your items.
Note that items fields can’t be eligible as target fields for models, logistic regression, and ensembles, but they can be used as predictors. For anomaly detection, they can’t be included as an input field to calculate the anomaly score, although they can be selected as summary fields.
Datetime Detection
During the source pre-scan BigML tries to determine the data type of each field in your file. This process automatically detects datetime fields and, if disable_datetime is not explicitly set to "false", BigML will generate additional fields with their components.
For instance, if a field named "date" has been identified as a datetime with format "YYYY-MM-dd", four new fields will be automatically added to the source, namely "date.year", "date.month", "date.day-of-month" and "date.day-of-week". For each row, these new fields will be filled in automatically by parsing the value of their parent field, "date". For example, if the latter contains the value "1969-07-14", the autogenerated columns in that row will have the values 1969, 7, 14 and 1 (because that day was Monday). As noted before, autogenaration can be disabled by setting disable_datetime option to "true", either in the create source request or later in an update source operation.
When a field is detected as datetime, BigML tries to determine its format for parsing the values and generate the fields with their components. By default, BigML accepts ISO 8601 time formats (YYYY-MM-DD) as well as a number of other common European and US formats, as seen in the table below:
| time_format Name | Example |
|---|---|
| basic-date-time | 19690714T173639.592Z |
| basic-date-time-no-ms | 19690714T173639Z |
| basic-ordinal-date-time | 1969195T173639.592Z |
| basic-ordinal-date-time-no-ms | 1969195T173639Z |
| basic-t-time | T173639.592Z |
| basic-t-time-no-ms | T173639Z |
| basic-time | 173639.592Z |
| basic-time-no-ms | 173639Z |
| basic-week-date | 1969W297 |
| basic-week-date-time | 1969W297T173639.592Z |
| basic-week-date-time-no-ms | 1969W297T173639Z |
| clock-minute | 5:36 PM |
| clock-minute-nospace | 5:36PM |
| clock-second | 5:36:39 PM |
| clock-second-nospace | 5:36:39PM |
| date | 1969-07-14 |
| date-hour | 1969-07-14T17 |
| date-hour-minute | 1969-07-14T17:36 |
| date-hour-minute-second | 1969-07-14T17:36:39 |
| date-hour-minute-second-fraction | 1969-07-14T17:36:39.592 |
| date-hour-minute-second-ms | 1969-07-14T17:36:39.592 |
| date-time | 1969-07-14T17:36:39.592Z |
| date-time-no-ms | 1969-07-14T17:36:39Z |
| eu-date | 14/7/1969 |
| eu-date-clock-minute | 14/7/1969 5:36 PM |
| eu-date-clock-minute-nospace | 14/7/1969 5:36PM |
| eu-date-clock-second | 14/7/1969 5:36:39 PM |
| eu-date-clock-second-nospace | 14/7/1969 5:36:39PM |
| eu-date-millisecond | 14/7/1969 17:36:39.592 |
| eu-date-minute | 14/7/1969 17:36 |
| eu-date-second | 14/7/1969 17:36:39 |
| eu-ddate | 14.7.1969 |
| eu-ddate-clock-minute | 14.7.1969 5:36 PM |
| eu-ddate-clock-minute-nospace | 14.7.1969 5:36PM |
| eu-ddate-clock-second | 14.7.1969 5:36:39 PM |
| eu-ddate-clock-second-nospace | 14.7.1969 5:36:39PM |
| eu-ddate-millisecond | 14.7.1969 17:36:39.592 |
| eu-ddate-minute | 14.7.1969 17:36 |
| eu-ddate-second | 14.7.1969 17:36:39 |
| eu-sdate | 14-7-1969 |
| eu-sdate-clock-minute | 14-7-1969 5:36 PM |
| eu-sdate-clock-minute-nospace | 14-7-1969 5:36PM |
| eu-sdate-clock-second | 14-7-1969 5:36:39 PM |
| eu-sdate-clock-second-nospace | 14-7-1969 5:36:39PM |
| eu-sdate-millisecond | 14-7-1969 17:36:39.592 |
| eu-sdate-minute | 14-7-1969 17:36 |
| eu-sdate-second | 14-7-1969 17:36:39 |
| hour-minute | 17:36 |
| hour-minute-second | 17:36:39 |
| hour-minute-second-fraction | 17:36:39.592 |
| hour-minute-second-ms | 17:36:39.592 |
| mysql | 1969-07-14 17:36:39 |
| no-t-date-hour-minute | 1969-7-14 17:36 |
| odata-format | /Datetime(-14752170831)/ |
| ordinal-date-time | 1969-195T17:36:39.592Z |
| ordinal-date-time-no-ms | 1969-195T17:36:39Z |
| rfc822 | Mon, 14 Jul 1969 17:36:39 +0000 |
| t-time | T17:36:39.592Z |
| t-time-no-ms | T17:36:39Z |
| time | 17:36:39.592Z |
| time-no-ms | 17:36:39Z |
| timestamp | -14718201 |
| timestamp-msecs | -14718201000 |
| twitter-time | Mon Jul 14 17:36:39 +0000 1969 |
| twitter-time-alt | 1969-7-14 17:36:39 +0000 |
| twitter-time-alt-2 | 1969-7-14 17:36 +0000 |
| twitter-time-alt-3 | Mon Jul 14 17:36 +0000 1969 |
| us-date | 7/14/1969 |
| us-date-clock-minute | 7/14/1969 5:36 PM |
| us-date-clock-minute-nospace | 7/14/1969 5:36PM |
| us-date-clock-second | 7/14/1969 5:36:39 PM |
| us-date-clock-second-nospace | 7/14/1969 5:36:39PM |
| us-date-millisecond | 7/14/1969 17:36:39.592 |
| us-date-minute | 7/14/1969 17:36 |
| us-date-second | 7/14/1969 17:36:39 |
| us-sdate | 7-14-1969 |
| us-sdate-clock-minute | 7-14-1969 5:36 PM |
| us-sdate-clock-minute-nospace | 7-14-1969 5:36PM |
| us-sdate-clock-second | 7-14-1969 5:36:39 PM |
| us-sdate-clock-second-nospace | 7-14-1969 5:36:39PM |
| us-sdate-millisecond | 7-14-1969 17:36:39.592 |
| us-sdate-minute | 7-14-1969 17:36 |
| us-sdate-second | 7-14-1969 17:36:39 |
| week-date | 1969-W29-7 |
| week-date-time | 1969-W29-7T17:36:39.592Z |
| week-date-time-no-ms | 1969-W29-7T17:36:39Z |
| weekyear-week | 1969-W29 |
| weekyear-week-day | 1969-W29-7 |
| year-month | 1969-07 |
| year-month-day | 1969-07-14 |
It might happen that BigML is not able to determine the right format of your datetime field. In that case, it will be considered either a text or a categorical field. You can override that assignment by setting the optype of the field to datetime and passing the appropriate format in time_formats. For instance:
curl "https://au.bigml.io/source/4f603fe203ce89bb2d000000?$BIGML_AUTH" \
-X PUT \
-d '{"fields": {"000004": {"optype": "datetime", "time_formats": ["date"]}}}' \
-H 'content-type: application/json'
> Updating a source field with optype "datetime"
curl "https://au.bigml.io/source/4f603fe203ce89bb2d000000?$BIGML_AUTH" \
-X PUT \
-d '{"fields": {"000004": {"optype": "datetime", "time_formats": ["YYYY-MM-dd"]}}}' \
-H 'content-type: application/json'
> Updating a source field with custom "time_formats"
Retrieving a Source
Each source has a unique identifier in the form "source/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the source.
To retrieve a source with curl:
curl "https://au.bigml.io/source/4f603fe203ce89bb2d000000?$BIGML_AUTH"
$ Retrieving a source from the command line
You can also use your browser to visualize the source using the full BigML.io URL or pasting the source/id into the BigML.com.au dashboard.
Source Properties
Once a source has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the source and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the source creation has been completed without errors. |
|
content_type
filterable, sortable |
String | This is the MIME content-type as provided by your HTTP client. The content-type can help BigML.io to better parse your file. For example, if you use curl, you can alter it using the type option "-F file=@iris.csv;type=text/csv". |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the source was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this source. |
|
description
updatable |
String | A text describing the source. It can contain restricted markdown to decorate the text. |
|
disable_datetime
updatable |
Boolean | False when BigML didn't generate new fields from existing date-time fields. |
|
fields
updatable |
Object |
A dictionary with an entry per field (column) in your data. Each entry includes the column number, the name of the field, the type of the field, a specific locale if it differs from the source's one, and specific missing tokens if the differ from the source's one. This property is very handy to update sources according to your own parsing preferences.
Example:
|
| fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
|
file_name
filterable, sortable |
String | The name of the file as you submitted it. |
| md5 | String | The file MD5 Message-Digest Algorithm as specified by RFC 1321. |
|
name
filterable, sortable, updatable |
String | The name of the source as your provided or the name of the file by default. |
|
number_of_anomalies
filterable, sortable |
Integer | The current number of anomalies that use this source. |
|
number_of_anomalyscores
filterable, sortable |
Integer | The current number of anomaly scores that use this source. |
|
number_of_associations
filterable, sortable |
Integer | The current number of associations that use this source. |
|
number_of_associationsets
filterable, sortable |
Integer | The current number of association sets that use this source. |
|
number_of_centroids
filterable, sortable |
Integer | The current number of centroids that use this source. |
|
number_of_clusters
filterable, sortable |
Integer | The current number of clusters that use this source. |
|
number_of_correlations
filterable, sortable |
Integer | The current number of correlations that use this source. |
|
number_of_datasets
filterable, sortable |
Integer | The current number of datasets that use this source. |
|
number_of_ensembles
filterable, sortable |
Integer | The current number of ensembles that use this source. |
|
number_of_forecasts
filterable, sortable |
Integer | The current number of forecasts that use this source. |
|
number_of_logisticregressions
filterable, sortable |
Integer | The current number of logistic regressions that use this source. |
|
number_of_models
filterable, sortable |
Integer | The current number of models that use this source. |
|
number_of_optimls
filterable, sortable |
Integer | The current number of OptiMLs that use this source. |
|
number_of_predictions
filterable, sortable |
Integer | The current number of predictions that use this source. |
|
number_of_statisticaltests
filterable, sortable |
Integer | The current number of statistical tests that use this source. |
|
number_of_timeseries
filterable, sortable |
Integer | The current number of time series that use this source. |
|
number_of_topicdistributions
filterable, sortable |
Integer | The current number of topic distributions that use this source. |
|
number_of_topicmodels
filterable, sortable |
Integer | The current number of topic models that use this source. |
|
private
filterable, sortable |
Boolean | Whether the source is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| remote | String | URL of the remote data source. |
| resource | String | The source/id. |
|
shared
filterable, sortable |
Boolean | Whether the source is shared using a private link or not. |
| shared_hash | String | The hash that gives access to this source if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this source. |
|
size
filterable, sortable |
Integer | The number of bytes of the source. |
|
source_parser
updatable |
Object | Set of parameters to parse the source. |
| status | Object | A description of the status of the source. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the source was created using a subscription plan or not. |
| synthetic | Object | Set of parameters to generate a synthetic source presumably for activities such as testing, prototyping and benchmarking. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
term_analysis
updatable |
Object | Set of parameters that define how text analysis should work for text fields. |
|
type
filterable, sortable |
Integer |
The type of source.
|
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the source was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
Source Fields
The property fields is a dictionary keyed by an auto-generated id per each field in the source. Each field has as a value an object with the following properties:
For fields classified with optype "text", the default values specified in the term_analysis at the top-level of the source are used.
Non-provided flags by term_analysis take their default value, i.e., false for booleans, none for language.
Besides these global default values, which apply to all text fields (and potential text fields, such as categorical ones that might overflow to text during dataset creation), it's possible to specify term_analysis flags on a per-field basis.
For fields classified with optype "items", the default values specified in the item_analysis at the top-level of the source are used.
Like term_analysis, non-provided flags by item_analysis take their default value and it's possible to specify item_analysis flags on a per-field basis as well at the global level, too.
Source Status
Before a source is successfully created, BigML.io makes sure that it has been uploaded in an understandable format, that the data that it contains is parseable, and that the types for each column in the data can be inferred successfully. The source goes through a number of states until all these analyses are completed. Through the status field in the source you can determine when the source has been fully processed and is ready to be used to create a dataset. These are the fields that a source's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the source creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the source. |
| message | String | A human readable message explaining the status. |
Once a source has been successfully created, it will look like:
{
"category": 0,
"code": 200,
"content_type": "application/octet-stream",
"created": "2012-11-15T02:24:59.686000",
"credits": 0.0,
"description": "",
"fields": {
"000000": {
"column_number": 0,
"name": "sepal length",
"optype": "numeric",
"order": 0
},
"000001": {
"column_number": 1,
"name": "sepal width",
"optype": "numeric",
"order": 1
},
"000002": {
"column_number": 2,
"name": "petal length",
"optype": "numeric",
"order": 2
},
"000003": {
"column_number": 3,
"name": "petal width",
"optype": "numeric",
"order": 3
},
"000004": {
"column_number": 4,
"name": "species",
"optype": "categorical",
"order": 4
}
},
"fields_meta": {
"count": 5,
"limit": 200,
"offset": 0,
"total": 5
},
"file_name": "iris.csv",
"md5": "d1175c032e1042bec7f974c91e4a65ae",
"name": "iris.csv",
"number_of_datasets": 0,
"number_of_models": 0,
"number_of_predictions": 0,
"private": true,
"project": null,
"resource": "source/4f603fe203ce89bb2d000000",
"size": 4608,
"source_parser": {
"header": true,
"locale": "en_US",
"missing_tokens": [
"",
"N/A",
"n/a",
"NULL",
"null",
"-",
"#DIV/0",
"#REF!",
"#NAME?",
"NIL",
"nil",
"NA",
"na",
"#VALUE!",
"#NULL!",
"NaN",
"#N/A",
"#NUM!",
"?"
],
"quote": "\"",
"separator": ","
},
"status": {
"code": 5,
"elapsed": 244,
"message": "The source has been created"
},
"tags": [],
"type": 0,
"updated": "2012-11-15T02:25:00.001000"
}
< Example source JSON response
Filtering and Paginating Fields from a Source
A source might be composed of hundreds or even thousands of fields. Thus when retrieving a source, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Source
To update a source, you need to PUT an object containing the fields that you want to update to the source' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated source.
For example, to update a source with a new name and a new locale you can use curl like this:
curl "https://au.bigml.io/source/4f603fe203ce89bb2d000000?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name", "source_parser": {"locale": "es-ES"}}' \
-H 'content-type: application/json'
$ Updating a source's name and locale
Deleting a Source
To delete a source, you need to issue a HTTP DELETE request to the source/id to be deleted.
Using curl you can do something like this to delete a source:
curl -X DELETE "https://au.bigml.io/source/4f603fe203ce89bb2d000000?$BIGML_AUTH"
$ Deleting a source from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a source, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a source a second time, or a source that does not exist, you will receive a "404 not found" response.
However, if you try to delete a source that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Sources
To list all the sources, you can use the source base URL. By default, only the 20 most recent sources will be returned. You can see below how to change this number using the limit parameter.
You can get your list of sources directly in your browser using your own username and API key with the following links.
https://au.bigml.io/source?$BIGML_AUTH
> Listing sources from a browser
Datasets
Last Updated: Tuesday, 2019-01-29 16:28
A dataset is a structured version of a source where each field has been processed and serialized according to its type. The possible field types are numeric, categorical, text, date-time, or items. For each field, you can also get the number of errors that were encountered processing it. Errors are mostly missing values or values that do not match with the type assigned to the column.
When you create a new dataset, histograms of the field values are created for the categorical and numeric fields. In addition, for the numeric fields, a collection of statistics about the field distribution such as minimum, maximum, sum, and sum of squares are also computed.
For date-time fields, BigML attempts to parse the format and automatically generate the related subfields (year, month, day, and so on) present in the format.
For items fields which have many different categorical values per instance separated by non-alphanumeric characters, BigML tries to automatically detect which is the best separator for your items.
Finally, for text fields, BigML handles plain text fields with some light-weight natural language processing; BigML separates the field into words using punctuation and whitespace, attempts to detect the language, groups word forms together using word stemming, and eliminates words that are too common or too rare to be useful. We are then left with somewhere between a few dozen and a few hundred interesting words per text field, the occurrences of which can be features in a model.
BigML.io allows you to create, retrieve, update, delete your dataset. You can also list all of your datasets.
Jump to:
- Dataset Base URL
- Creating a Dataset
- Dataset Arguments
- Filtering Rows
- Retrieving a Dataset
- Dataset Properties
- Filtering and Paginating Fields from a Dataset
- Updating a Dataset
- Deleting a Dataset
- Listing Datasets
- Multi-Datasets
- Resources Accepting Multi-Datasets Input
- Creating a Dataset using SQL
- Transformations
- Cloning a Dataset
- Sampling a Dataset
- Filtering a Dataset
- Extending a Dataset
- Filtering the New Fields Output
- Discretization of a Continuous Field
- Outlier Elimination
- Lisp and JSON Syntaxes
- Final Remarks
Dataset Base URL
You can use the following base URL to create, retrieve, update, and delete datasets. https://au.bigml.io/dataset
Dataset base URL
All requests to manage your datasets must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Dataset
To create a new dataset, you need to POST to the dataset base URL an object containing at least the source/id that you want to use to create the dataset. The content-type must always be "application/json".
You can easily create a new dataset using curl as follows. All you need is a valid source/id and your authentication variable set up as shown above.
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"source": "source/50a4527b3c1920186d000041"}'
> Creating a dataset
BigML.io will return the newly created dataset if the request succeeded.
{
"category": 0,
"code": 201,
"columns": 0,
"created": "2012-11-15T02:29:09.293711",
"credits": 0.00439453125,
"description": "",
"excluded_fields": [],
"fields": {},
"fields_meta": {
"count": 0,
"limit": 200,
"offset": 0,
"total": 0
},
"input_fields": [],
"locale": "en-US",
"name": "iris' dataset",
"number_of_evaluations": 0,
"number_of_models": 0,
"number_of_predictions": 0,
"price": 0.0,
"private": true,
"project": null,
"resource": "dataset/52b9359a3c19205ff100002a",
"rows": 0,
"size": 4608,
"source": "source/50a4527b3c1920186d000041",
"source_status": true,
"status": {
"code": 1,
"message": "The dataset is being processed and will be created soon"
},
"tags": [],
"updated": "2012-11-15T02:29:09.293733",
"views": 0
}
< Example dataset JSON response
Dataset Arguments
By default, the dataset will include all fields in the corresponding source; but this behaviour can be fine-tuned via the input_fields and excluded_fields lists of identifiers. The former specifies the list of fields to be included in the dataset, and defaults to all fields in the source when empty. To specify excluded fields, you can use excluded_fields: identifiers in that list are removed from the list constructed using input_fields".
See below the full list of arguments that you can POST to create a dataset.
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is the category of the source |
The category that best describes the dataset. See the category codes for the complete list of categories.
Example: 1 |
|
description
optional |
String |
A description of the dataset up to 8192 characters long.
Example: "This is a description of my new dataset" |
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the source is excluded. |
Specifies the fields that won't be included in the dataset.
Example:
|
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names, labels or descriptions are changed. |
Updates the names, labels, and descriptions of the fields in the dataset with respect to the original names in the source. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
|
input_fields
optional |
Array, default is []. All the fields in the source. |
Specifies the fields to be included in the dataset.
Example:
|
|
json_filter
optional |
Array |
A JSON list representing a filter over the rows in the datasource. The first element is an operator and the rest of the elements its arguments. See the section below for more details.
Example: [">", 3.14, ["field", "000002"]] |
|
lisp_filter
optional |
String |
A string representing a Lisp s-expression to filter rows from the datasource.
Example: "(> 3.14 (field 2))" |
|
name
optional |
String, default is source's name |
The name you want to give to the new dataset.
Example: "my new dataset" |
|
objective_field
optional |
Object, default is the last non-auto-generated field in the dataset. |
Specifies the default objective field.
Example:
|
| origin | String |
The dataset/id of the gallery dataset to be cloned. The price of the dataset must be 0 to be cloned via API.
Example: "dataset/5b9ab8474e172785e3000003" |
|
project
optional |
String |
The project/id you want the dataset to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
refresh_field_types
optional |
Boolean, default is false |
Specifies whether field types need to recomputed or not.
Example: true |
|
refresh_objective
optional |
Boolean, default is false |
Specifies whether the default objective field of the dataset needs to be recomputed or not.
Example: true |
|
refresh_preferred
optional |
Boolean, default is false |
Specifies whether preferred field flags need to recomputed or not.
Example: true |
| shared_hash | String |
The shared hash of the shared dataset to be cloned. The price of the dataset must be 0 to be cloned via API.
Example: "kpY46mNuNVReITw0Z1mAqoQ9ySW" |
|
size
optional |
Integer, default is the source's size |
The number of bytes from the source that you want to use.
Example: 1073741824 |
| source | String |
A valid source/id.
Example: source/4f665b8103ce8920bb000006 |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your dataset.
Example: ["best customers", "2018"] |
|
term_limit
optional |
Integer |
The maximum total number of terms to be used in text analysis.
Example: 500 |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new dataset with a name, and different size, and only a few fields from the original source. For example, to create a new dataset named "my dataset", with only 500 bytes, and with only two fields:
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"source": "source/4f665b8103ce8920bb000006", "name": "my dataset", "size": 500, "fields": {"000001": {"name": "width_1"}, "000003": {"name": "width_2"}}}'
> Creating a customized dataset
If you do not specify a name, BigML.io will assign to the new dataset the source's name. If you do not specify a size, BigML.io will use the full the source's size. If you do not specify any fields BigML.io will include all the fields in the source with their corresponding names.
Filtering Rows
The dataset creation request can include an argument, json_filter, specifying a predicate that the input rows from the source have to satisfy in order to be included in the dataset. This predicate is specified as a (possibly nested) JSON list whose first element is an operator and the rest of the elements its arguments. Here's an example of a filter specification to choose only those rows whose field "000002" is less than 3.14:
[">", 3.14, ["field", "000002"]]
Filter Example
As you see, the list starts with the operator we want to use, ">", followed by its operands: the number 3.14, and the value of the field with identifier "000002", which is denoted by the operator "field". As another example, this filter:
["=", ["field", "000002"], ["field", "000003"], ["field", "000004"]]
Filter Example
selects rows for which the three fields with identifiers "000002", "000003" and "000004" have identical values. Note how you're not limited to two arguments. It's also worth noting that for a filter like that one to be accepted, all three fields must have the same optype (e.g. numeric), otherwise they cannot be compared.
The field operator also accepts as arguments the field's name (as a string) or the row column (as an integer). For instance, if field "000002" had column number 12, and field "000003" was named "Stock prize", our previous query could have been written:
["=", ["field", 12], ["field", "Stock prize"], ["field", "000004"]]
Filter Example
If the name is not unique, the first matching field found is picked, consistently over the whole filter formula. If you have duplicated field names, the best thing to do is to use either column numbers or field identifiers in your filters, to avoid ambiguities.
Besides a field's value, one can also ask whether it's missing or not. For instance, to include only those rows for which field "000002" contains a missing token, you would use:
["missing", "000002"]
Filter Example
["and", ["not", ["missing", 12]]
, ["not", ["missing", "Stock prize"]]]
Filter Example
["or", ["=", 3, ["field", "000001"]]
, [">", "1969-07-14T06:10", ["field", "000111"]]
, ["and", ["missing", 23]
, ["=", "Cat", ["field", "000002"]]
, ["<", 2, ["field", "000003"], 4]]]
Filter Example
In the examples above, you can also see how dates are allowed and can be compared as numerical values (provided the implied fields are of the correct optype).
Finally, it's also possible to use the arithmetic operators +, -, * and / with numeric fields and constants, as in the following example:
[">", ["/", ["+", ["-", ["field", "000000"]
, 4.4]
, ["field", "000003"]
, ["*", 2
, ["field", "Class"]
, ["field", "000004"]]]
, 3]
, 5.5]
Filter Example
These are all the accepted operators:
=, !=, >, >=, <, <=, and, or, not, field, missing, +, -, *, /.To be accepted by the API, the filter must evaluate to a boolean value and contain at least one operator.So, for instance, a constant or a formula evaluating to a number will be rejected.
Since writing and reading the above formula in pure JSON might be a bit involved, you can also send your query to the server as a string representing a Lisp flatine formula using the argument lisp_filter, e.g.
(> (/ (+ (- (field "000000") 4.4)
(field 23)
(* 2 (field "Class") (field "000004")))
3)
5.5)
Filter Example
Retrieving a Dataset
Each dataset has a unique identifier in the form "dataset/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the dataset. Notice that to download the dataset file in the CSV format, you will need to append "/download", and in the Tableau tde format, append "/download?format=tde" to resource id.
To retrieve a dataset with curl:
curl "https://au.bigml.io/dataset/52b9359a3c19205ff100002a?$BIGML_AUTH"
$ Retrieving a dataset from the command line
To download the dataset file in the CSV format with curl:
curl "https://au.bigml.io/dataset/52b9359a3c19205ff100002a/download?$BIGML_AUTH"
$ Downloading a dataset csv file from the command line
To download the dataset file in the Tableau tde format with curl:
curl "https://au.bigml.io/dataset/52b9359a3c19205ff100002a/download?format=tde;$BIGML_AUTH"
$ Downloading a dataset tde file from the command line
You can also use your browser to visualize the dataset using the full BigML.io URL or pasting the dataset/id into the BigML.com.au dashboard.
Dataset Properties
Once a dataset has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the dataset and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the dataset creation has been completed without errors. |
|
columns
filterable, sortable |
Integer | The number of fields in the dataset. |
| correlations | Object |
A dictionary where each entry represents a field (column) in your data with the last calculated correlation/id for it.
Example:
|
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the dataset was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this dataset. |
|
description
updatable |
String | A text describing the dataset. It can contain restricted markdown to decorate the text. |
| excluded_fields | Array | The list of fields's ids that were excluded to build the model. |
| field_types | Object | A dictionary that informs about the number of fields of each type. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields and an entry for the total number of fields. In new datasets, it uses the key effective_fields to inform of the effective number of fields. That is the total number of fields including those created under the hood to support text fields. |
|
fields
updatable |
Object | A dictionary with an entry per field (column) in your data. Each entry includes the column number, the name of the field, the type of the field, and the summary. |
| fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
| input_fields | Array | The list of input fields' ids used to create the dataset. |
| json_query | Object | A dictionary specifying each of the parts of the executed SQL query that was used to create this dataset. |
| json_query_parsed | Object | The canonical representation of the SQL query as a JSON map. |
|
juxtapose
filterable, sortable |
Boolean | Whether juxtaposition has been performed during creation |
| juxtapose_input_fields | Object | A dictionary keyed by dataset/id and an array of field names and/or ids that specifies the input fields to use for each dataset during merge. |
| locale | String | The source's locale. |
|
name
filterable, sortable, updatable |
String | The name of the dataset as your provided or based on the name of the source by default.) |
|
number_of_anomalies
filterable, sortable |
Integer | The current number of anomalies that use this dataset. |
|
number_of_anomalyscores
filterable, sortable |
Integer | The current number of anomaly scores that use this dataset. |
|
number_of_associations
filterable, sortable |
Integer | The current number of associations that use this dataset. |
|
number_of_associationsets
filterable, sortable |
Integer | The current number of association sets that use this dataset. |
|
number_of_batchanomalyscores
filterable, sortable |
Integer | The current number of batch anomaly scores that use this dataset. |
|
number_of_batchcentroids
filterable, sortable |
Integer | The current number of batch centroids that use this dataset. |
|
number_of_batchpredictions
filterable, sortable |
Integer | The current number of batch predictions that use this dataset. |
|
number_of_batchtopicdistributions
filterable, sortable |
Integer | The current number of batch topic distributions that use this dataset. |
|
number_of_centroids
filterable, sortable |
Integer | The current number of centroids that use this dataset. |
|
number_of_clusters
filterable, sortable |
Integer | The current number of clusters that use this dataset. |
|
number_of_correlations
filterable, sortable |
Integer | The current number of correlations that use this dataset. |
|
number_of_ensembles
filterable, sortable |
Integer | The current number of ensembles that use this dataset. |
|
number_of_evaluations
filterable, sortable |
Integer | The current number of evaluations that use this dataset. |
|
number_of_forecasts
filterable, sortable |
Integer | The current number of forecasts that use this dataset. |
|
number_of_logisticregressions
filterable, sortable |
Integer | The current number of logistic regressions that use this dataset. |
|
number_of_models
filterable, sortable |
Integer | The current number of models that use this dataset. |
|
number_of_optimls
filterable, sortable |
Integer | The current number of OptiMLs that use this dataset. |
|
number_of_predictions
filterable, sortable |
Integer | The current number of predictions that use this dataset. |
|
number_of_statisticaltests
filterable, sortable |
Integer | The current number of statistical tests that use this dataset. |
|
number_of_timeseries
filterable, sortable |
Integer | The current number of time series that use this dataset. |
|
number_of_topicdistributions
filterable, sortable |
Integer | The current number of topic distributions that use this dataset. |
|
number_of_topicmodels
filterable, sortable |
Integer | The current number of topic models that use this dataset. |
|
objective_field
updatable |
Object | The default objective field. |
|
optiml
filterable, sortable |
String | The optiml/id that created this dataset. |
|
optiml_status
filterable, sortable |
Boolean | Whether the OptiML is still available or has been deleted. |
|
origin
filterable, sortable |
String | The dataset/id of the original gallery dataset. |
|
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to clone the dataset instead of the sampled instances. |
|
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your dataset. |
|
private
filterable, sortable, updatable |
Boolean | Whether the dataset is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| range | Array | The range of instances used to clone the dataset. |
|
refresh_field_types
filterable, sortable |
Boolean | Whether the field types of the dataset have been recomputed or not. |
|
refresh_objective
filterable, sortable |
Boolean | Whether the default objective field of the dataset has been recomputed or not. |
|
refresh_preferred
filterable, sortable |
Boolean | Whether the preferred flags of the dataset fields have been recomputed or not. |
|
replacement
filterable, sortable |
Boolean | Whether the instances sampled to clone the dataset were selected using replacement or not. |
| resource | String | The dataset/id. |
|
rows
filterable, sortable |
Integer | The total number of rows in the dataset. |
|
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset. |
|
seed
filterable, sortable |
String | The string that was used to generate the sample. |
|
shared
filterable, sortable, updatable |
Boolean | Whether the dataset is shared using a private link or not. |
|
shared_clonable
filterable, sortable, updatable |
Boolean | Whether the shared dataset can be cloned or not. |
| shared_hash | String | The hash that gives access to this dataset if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this dataset. |
|
size
filterable, sortable |
Integer | The number of bytes of the source that were used to create this dataset. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| sql_output_fields | Array of Objects | A list of dictionaries containing some of the properties of the fields generated by the given sql_query or json_query. |
| sql_query | String | The SQL query that was executed to create this dataset. |
| sql_query_parsed | String | The canonical form of query as a SQL prepared statement. |
|
statisticaltest
filterable, sortable |
String | The last statisticaltest/id that was generated for this dataset. |
| status | Object | A description of the status of the dataset. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the dataset was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
term_limit
filterable, sortable |
Integer | The maximum total number of terms used by all the text fields. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the dataset was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
Dataset Fields
The property fields is a dictionary keyed by each field's id in the source. Each field's id has as a value an object with the following properties:
Numeric Summary
Numeric summaries come with all the fields described below. If the number of unique values in the data is greater than 32, then 'bins' will be used for the summary. If not, 'counts' will be available.
| Property | Type | Description |
|---|---|---|
| bins | Array | An array that represents an approximate histogram of the distribution. It consists of value pairs, where the first value is the mean of a histogram bin and the second value is the bin population. bins is only available when the number of distinct values is greater than 32. For more information, see our blog post or read this paper. |
| counts | Array | An array of pairs where the first element of each pair is one of the unique values found in the field and the second element is the count. Only available when the number of distinct values is less than or equal to 32. |
| kurtosis | Number | The sample kurtosis. A measure of 'peakiness' or heavy tails in the field's distribution. |
| maximum | Number | The maximum value found in this field. |
| mean | Number | The arithmetic mean of non-missing field values. |
| median | Number | The approximate median of the non-missing values in this field. |
| minimum | Number | The minimum value found in this field. |
| missing_count | Integer | Number of instances missing this field. |
| population | Integer | The number of instances containing data for this field. |
| skewness | Number | The sample skewness. A measure of asymmetry in the field's distribution. |
| standard_deviation | Number | The unbiased sample standard deviation. |
| sum | String | Sum of all values for this field (for mean calculation). |
| sum_squares | String | Sum of squared values (for variance calculation). |
| variance | Number | The unbiased sample variance. |
Categorical Summary
Categorical summaries give you a count per each category and missing count in case any of the instances contain missing values.
Text Summary
Text summaries give statistics about the vocabulary of a text field, and the number of instances containing missing values.
Dataset Status
Before a dataset is successfully created, BigML.io makes sure that it has been uploaded in an understandable format, that the data that it contains is parseable, and that the types for each column in the data can be inferred successfully. The dataset goes through a number of states until all these analyses are completed. Through the status field in the dataset you can determine when the dataset has been fully processed and ready to be used to create a model. These are the fields that a dataset's status has:
| Property | Type | Description |
|---|---|---|
| bytes | Integer | Number of bytes processed so far. |
| code | Integer | A status code that reflects the status of the dataset creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the dataset. |
| field_errors | Object |
Information about ill-formatted fields that includes the total format errors for the field and a sample of the ill-formatted tokens.
Example:
|
| message | String | A human readable message explaining the status. |
| row_format_errors | Array | Information about ill-formatted rows. It includes the total row-format errors and a sampling of the ill-formatted rows. |
| serialized_rows | Integer | The number of rows serialized so far. |
Once a dataset has been successfully created, it will look like:
{
"category": 0,
"code": 200,
"columns": 5,
"created": "2012-11-15T02:29:09.293000",
"credits": 0.00439453125,
"description": "",
"excluded_fields": [],
"fields": {
"000000": {
"column_number": 0,
"datatype": "double",
"name": "sepal length",
"optype": "numeric",
"order": 0,
"preferred": true,
"summary": {
"bins": [
[
4.3,
1
],
[
4.425,
4
],
[
4.6,
4
],
[
4.7,
2
],
[
4.8,
5
],
[
4.9,
6
],
[
5,
10
],
[
5.1,
9
],
[
5.2,
4
],
[
5.3,
1
],
[
5.4,
6
],
[
5.5,
7
],
[
5.6,
6
],
[
5.7,
8
],
[
5.8,
7
],
[
5.9,
3
],
[
6,
6
],
[
6.1,
6
],
[
6.2,
4
],
[
6.3,
9
],
[
6.44167,
12
],
[
6.6,
2
],
[
6.7,
8
],
[
6.8,
3
],
[
6.92,
5
],
[
7.1,
1
],
[
7.2,
3
],
[
7.3,
1
],
[
7.4,
1
],
[
7.6,
1
],
[
7.7,
4
],
[
7.9,
1
]
],
"maximum": 7.9,
"mean": 5.84333,
"median": 5.77889,
"minimum": 4.3,
"missing_count": 0,
"population": 150,
"splits": [
4.51526,
4.67252,
4.81113,
4.89582,
4.96139,
5.01131,
5.05992,
5.11148,
5.18177,
5.35681,
5.44129,
5.5108,
5.58255,
5.65532,
5.71658,
5.77889,
5.85381,
5.97078,
6.05104,
6.13074,
6.23023,
6.29578,
6.35078,
6.41459,
6.49383,
6.63013,
6.70719,
6.79218,
6.92597,
7.20423,
7.64746
],
"standard_deviation": 0.82807,
"sum": 876.5,
"sum_squares": 5223.85,
"variance": 0.68569
}
},
"000001": {
"column_number": 1,
"datatype": "double",
"name": "sepal width",
"optype": "numeric",
"order": 1,
"preferred": true,
"summary": {
"counts": [
[
2,
1
],
[
2.2,
3
],
[
2.3,
4
],
[
2.4,
3
],
[
2.5,
8
],
[
2.6,
5
],
[
2.7,
9
],
[
2.8,
14
],
[
2.9,
10
],
[
3,
26
],
[
3.1,
11
],
[
3.2,
13
],
[
3.3,
6
],
[
3.4,
12
],
[
3.5,
6
],
[
3.6,
4
],
[
3.7,
3
],
[
3.8,
6
],
[
3.9,
2
],
[
4,
1
],
[
4.1,
1
],
[
4.2,
1
],
[
4.4,
1
]
],
"maximum": 4.4,
"mean": 3.05733,
"median": 3.02044,
"minimum": 2,
"missing_count": 0,
"population": 150,
"standard_deviation": 0.43587,
"sum": 458.6,
"sum_squares": 1430.4,
"variance": 0.18998
}
},
"000002": {
"column_number": 2,
"datatype": "double",
"name": "petal length",
"optype": "numeric",
"order": 2,
"preferred": true,
"summary": {
"bins": [
[
1,
1
],
[
1.1,
1
],
[
1.2,
2
],
[
1.3,
7
],
[
1.4,
13
],
[
1.5,
13
],
[
1.63636,
11
],
[
1.9,
2
],
[
3,
1
],
[
3.3,
2
],
[
3.5,
2
],
[
3.6,
1
],
[
3.75,
2
],
[
3.9,
3
],
[
4.0375,
8
],
[
4.23333,
6
],
[
4.46667,
12
],
[
4.6,
3
],
[
4.74444,
9
],
[
4.94444,
9
],
[
5.1,
8
],
[
5.25,
4
],
[
5.46,
5
],
[
5.6,
6
],
[
5.75,
6
],
[
5.95,
4
],
[
6.1,
3
],
[
6.3,
1
],
[
6.4,
1
],
[
6.6,
1
],
[
6.7,
2
],
[
6.9,
1
]
],
"maximum": 6.9,
"mean": 3.758,
"median": 4.34142,
"minimum": 1,
"missing_count": 0,
"population": 150,
"splits": [
1.25138,
1.32426,
1.37171,
1.40962,
1.44567,
1.48173,
1.51859,
1.56301,
1.6255,
1.74645,
3.23033,
3.675,
3.94203,
4.0469,
4.18243,
4.34142,
4.45309,
4.51823,
4.61771,
4.72566,
4.83445,
4.93363,
5.03807,
5.1064,
5.20938,
5.43979,
5.5744,
5.6646,
5.81496,
6.02913,
6.38125
],
"standard_deviation": 1.7653,
"sum": 563.7,
"sum_squares": 2582.71,
"variance": 3.11628
}
},
"000003": {
"column_number": 3,
"datatype": "double",
"name": "petal width",
"optype": "numeric",
"order": 3,
"preferred": true,
"summary": {
"counts": [
[
0.1,
5
],
[
0.2,
29
],
[
0.3,
7
],
[
0.4,
7
],
[
0.5,
1
],
[
0.6,
1
],
[
1,
7
],
[
1.1,
3
],
[
1.2,
5
],
[
1.3,
13
],
[
1.4,
8
],
[
1.5,
12
],
[
1.6,
4
],
[
1.7,
2
],
[
1.8,
12
],
[
1.9,
5
],
[
2,
6
],
[
2.1,
6
],
[
2.2,
3
],
[
2.3,
8
],
[
2.4,
3
],
[
2.5,
3
]
],
"maximum": 2.5,
"mean": 1.19933,
"median": 1.32848,
"minimum": 0.1,
"missing_count": 0,
"population": 150,
"standard_deviation": 0.76224,
"sum": 179.9,
"sum_squares": 302.33,
"variance": 0.58101
}
},
"000004": {
"column_number": 4,
"datatype": "string",
"name": "species",
"optype": "categorical",
"order": 4,
"preferred": true,
"summary": {
"categories": [
[
"Iris-versicolor",
50
],
[
"Iris-setosa",
50
],
[
"Iris-virginica",
50
]
],
"missing_count": 0
}
}
},
"fields_meta": {
"count": 5,
"limit": 200,
"offset": 0,
"total": 5
},
"input_fields": [
"000000",
"000001",
"000002",
"000003",
"000004"
],
"locale": "en_US",
"name": "iris' dataset",
"number_of_evaluations": 0,
"number_of_models": 0,
"number_of_predictions": 0,
"price": 0.0,
"private": true,
"project": null,
"resource": "dataset/52b9359a3c19205ff100002a",
"rows": 150,
"size": 4608,
"source": "source/50a4527b3c1920186d000041",
"source_status": true,
"status": {
"bytes": 4608,
"code": 5,
"elapsed": 163,
"field_errors": [],
"message": "The dataset has been created",
"row_format_errors": [],
"serialized_rows": 150
},
"tags": [],
"updated": "2012-11-15T02:29:10.537000",
"views": 0
}
< Example dataset JSON response
Filtering and Paginating Fields from a Dataset
A dataset might be composed of hundreds or even thousands of fields. Thus when retrieving a dataset, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Dataset
To update a dataset, you need to PUT an object containing the fields that you want to update to the dataset' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated dataset.
For example, to update a dataset with a new name you can use curl like this:
curl "https://au.bigml.io/dataset/52b9359a3c19205ff100002a?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating a dataset's name
Deleting a Dataset
To delete a dataset, you need to issue a HTTP DELETE request to the dataset/id to be deleted.
Using curl you can do something like this to delete a dataset:
curl -X DELETE "https://au.bigml.io/dataset/52b9359a3c19205ff100002a?$BIGML_AUTH"
$ Deleting a dataset from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a dataset, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a dataset a second time, or a dataset that does not exist, you will receive a "404 not found" response.
However, if you try to delete a dataset that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Datasets
To list all the datasets, you can use the dataset base URL. By default, only the 20 most recent datasets will be returned. You can see below how to change this number using the limit parameter.
You can get your list of datasets directly in your browser using your own username and API key with the following links.
https://au.bigml.io/dataset?$BIGML_AUTH
> Listing datasets from a browser
Multi-Datasets
BigML.io now allows you to create a new dataset merging multiple datasets. This functionaliy can be very useful when you use multiple sources of data and in online scenarios as well. Imagine, for example, that you collect data in a hourly basis and want to create a dataset aggregrating data collected over the whole day. So you only need to send the new generated data each hour to BigML, create a source and a dataset for each one and then merge all the individual datasets into one at the end of the day.
We usually call datasets created in this way multi-datasets. BigML.io allows you to aggregrate up to 32 datasets in the same API request. You can merge multi-datasets too so basically you can grow a dataset as much as you want.
To create a multi dataset, you can specify a list of dataset identifiers as input using the argument origin_datasets. The example below will construct a new dataset that is the concatenation of three other datasets.
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
"dataset/52bc7fc83c1920e4a3000012",
"dataset/52bc7fd03c1920e4a3000016",
"dataset/52bc80233c1920e4a300001a"]}'
> Creating a multi dataset
By convention, the first dataset defines the final dataset fields. However, there can be cases where each dataset might come from a different source and therefore have different field ids. In these cases, you might need to use a fields_maps argument to match each field in a dataset to the fields of the first dataset.
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
"dataset/52bc7fc83c1920e4a3000012",
"dataset/52bc7fd03c1920e4a3000016",
"dataset/52bc80233c1920e4a300001a",
"dataset/52bc851b3c1920e4a3000022"],
"fields_maps": {
"dataset/52bc7fd03c1920e4a3000016": {
"000000":"000023",
"000001":"000024",
"000002":"00003a"},
"dataset/52bc80233c1920e4a300001a": {
"000000":"000023",
"000001":"000004",
"000002":"00000f"}}}'
> Creating a multi dataset mapping fields
For instance, in the request above, we use four datasets as input. The first one would define the final dataset fields. For instance, let's say that the dataset "dataset/52bc7fc83c1920e4a3000012" in this example has three fields with identifiers "000001", "000002" and "000003". Those will be the default resulting fields, together with their datatypes and so on. Then we need to specify, for each of the remaining datasets in the list, a mapping from the "standard" fields to those in the corresponding dataset. In our example, we're saying that the fields of the second dataset to be used during the concatenation are "000023", "000024" and "00003a", which correspond to the final fields having them as keys. In the case of the third dataset, the fields used will be "000023", "000004" and "00000f". For the last one, since there's no entry in fields_maps, we'll try to use the same identifiers as those of the first dataset.
The optypes of the paired fields should match, and for the case of categorical fields, be a proper subset. If a final field has optype text, however, all values are converted to strings.
BigML.io also allows you to sample each dataset individually before merging it. You can specify the sample options for each dataset using the arguments sample_rates, replacements, seeds, and out_of_bags. All are dictionaries that must be keyed using the dataset/id of the dataset you want to specify parameters for. The next request will create a multi-dataset sampling the two input datasets differently.
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
"dataset/52bc7fc83c1920e4a3000012",
"dataset/52bc851b3c1920e4a3000022"],
"sample_rates": {
"dataset/52bc7fc83c1920e4a3000012": 0.5,
"dataset/52bc851b3c1920e4a3000022": 0.8},
"replacements": {
"dataset/52bc7fc83c1920e4a3000012": false,
"dataset/52bc851b3c1920e4a3000022": true}
}'
> Creating a multi dataset
BigML.io also allows you to create a new dataset merging multiple datasets using juxtaposition instead of concatenating the datasets passed in the argument origin_datasets. In its simplest form, a juxtaposition request would look like this:
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
"dataset/52bc7fc83c1920e4a3000012",
"dataset/52bc7fd03c1920e4a3000016",
"dataset/52bc80233c1920e4a300001a"],
"juxtapose": true}'
> Creating a multi dataset using juxtaposition
In the example above we are asking for the generation of a new dataset with each row constructed by concatenating the three rows of each origin dataset put side by side. The new dataset will thus contain as many rows as the shorter input dataset and as many fields as the sum of the number of fields of the input datasets.
Unless otherwise specified, all fields of each of the datasets in origin_datasets is used in the juxtaposition. If you want to use a subset of any of them, specify it using juxtapose_input_fields. This creation request field must be an object where each entry specifies the input fields to use for the corresponding dataset. The values must be a list of fields names or identifiers. For instance:
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
"dataset/52bc7fc83c1920e4a3000012",
"dataset/52bc7fd03c1920e4a3000016",
"dataset/52bc80233c1920e4a300001a"],
"juxtapose": true,
"juxtapose_input_fields": {
"dataset/52bc7fc83c1920e4a3000012": ["000001", "species"],
"dataset/52bc80233c1920e4a300001a": ["age", "000002", "000003"]}}'
> Creating a multi dataset using juxtaposition
It will juxtapose two fields of the first dataset, all the fields of the second dataset, and three fields of the last dataset. We also show in the example how fields can be identified by either id or name.
This is the list of all the arguments that you can use to create a multi dataset.| Argument | Type | Description |
|---|---|---|
|
fields_maps
optional |
Object |
A dictionary keyed by dataset/id and object values. Each entry maps fields in the first dataset to fields in the dataset referenced by the key.
Example:
|
|
json_query
optional |
Object | A dictionary specifying each of the parts of the executed SQL query separately. See the Section on Creating a Dataset using SQL for more details. |
|
juxtapose
optional |
Boolean, default is false |
Whether juxtaposition should be performed on multi-dataset merging.
Example: true |
|
juxtapose_input_fields
optional |
Object |
A dictionary keyed by dataset/id and an array of field names and/or ids that specifies the input fields to use for each dataset during merge.
Example:
|
|
origin_dataset_names
optional |
Object | A dictionary keyed by dataset/id and value a string that represents the name to be used as its table name in the SQL query to be performed for the dataset. See the Section on Creating a Dataset using SQL for more details. |
|
out_of_bags
optional |
Object |
A dictionary keyed by dataset/id and boolean values. Setting this parameter to true for a dataset will return a dataset containing sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example:
|
|
replacements
optional |
Object |
A dictionary keyed by dataset/id and boolean values indicating whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example:
|
|
sample_rates
optional |
Object |
A dictionary keyed by dataset/id and float values. Each value is a number between 0 and 1 specifying the sample rate for the dataset. See the Section on Sampling for more details.
Example:
|
|
seeds
optional |
Object |
A dictionary keyed by dataset/idand string values indicating the seed to be used for each dataset to generate deterministic samples. See the Section on Sampling for more details.
Example:
|
|
sql_output_fields
optional |
Array of Objects | A list of dictionaries containing some of the properties of the fields generated by the given sql_query or json_query. See the Section on Creating a Dataset using SQL for more details. |
|
sql_query
optional |
String | The SQL query to be executed.See the Section on Creating a Dataset using SQL for more details. |
- Sample each individual dataset according to the specifications provided in the arguments sample_rates, replacements, seeds, and out_of_bags.
- Merge all the datasets together using the fields_maps argument to match fields in case they come from different sources (i.e., have different field ids).
- When juxtapose is true, all argument in the table above except juxtapose_input_fields are ignore, however, the rules below still apply.
- Sample the merged dataset like in the case of a regular datasaset sampling using the the arguments sample_rate, replacement, seed,out_of_bag.
- Filter the sampled dataset using input_fields, excluded_fields, and either a json_filter or lisp_filter.
- Extend the dataset with new fields according to the specifications provided in the new_fields argument.
- Filter the output of the new fields using either an output_json_filter or output_lisp_filter.
Resources Accepting Multi-Datasets Input
You can also create a model using multiple datasets as input at once. That is, without merging all the datasets together into a new dataset first. The same applies to correlations, statistical tests, ensembles, clusters, anomaly detectors, and evaluations. All the multi-dataset arguments above can be used. You just need to use the datasets argument instead of the regular dataset.See examples below to create a multi-dataset model, a multi-dataset ensemble, and a multi-dataset evaluation.
curl "https://au.bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"datasets": [
"dataset/52bc7fc83c1920e4a3000012",
"dataset/52bc851b3c1920e4a3000022"],
"sample_rates": {
"dataset/52bc7fc83c1920e4a3000012": 0.5,
"dataset/52bc851b3c1920e4a3000022": 0.8},
"replacements": {
"dataset/52bc7fc83c1920e4a3000012": false,
"dataset/52bc851b3c1920e4a3000022": true}
}'
> Creating a multi-dataset model
curl "https://au.bigml.io/ensemble?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"datasets": [
"dataset/52bc7fc83c1920e4a3000012",
"dataset/52bc851b3c1920e4a3000022"],
"sample_rates": {
"dataset/52bc7fc83c1920e4a3000012": 0.5,
"dataset/52bc851b3c1920e4a3000022": 0.8},
"replacements": {
"dataset/52bc7fc83c1920e4a3000012": false,
"dataset/52bc851b3c1920e4a3000022": true}
}'
> Creating a multi-dataset ensemble
curl "https://au.bigml.io/evaluation?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"model": "model/52bcb43e3c1920e4a3000026",
"datasets": [
"dataset/52bc7fc83c1920e4a3000012",
"dataset/52bc851b3c1920e4a3000022"],
"out_of_bags": {
"dataset/52bc7fc83c1920e4a3000012": true,
"dataset/52bc851b3c1920e4a3000022": true},
"sample_rates": {
"dataset/52bc7fc83c1920e4a3000012": 0.5,
"dataset/52bc851b3c1920e4a3000022": 0.8},
"replacements": {
"dataset/52bc7fc83c1920e4a3000012": false,
"dataset/52bc851b3c1920e4a3000022": true}
}'
> Creating a multi-dataset evaluation
Creating a Dataset using SQL
BigML.io now allows you to create a new dataset by performing an SQL-style query over a list of input datasets, which are treated as SQL tables. To that end, the POST request JSON should contain the following fields:
- origin_datasets: a list of identifiers for the input datasets that are going to be used as the input tables of the query, this is identical to the field used to specify multiple datasets during merging datasets.
- origin_dataset_names: a dictionary keyed by dataset/id and value a string that represents the name to be used as its table name in the SQL query to be performed for the dataset. This will typically be a list of sort names so that the SQL query is readable. (i.e., you want to write "SELECT d0.field1" rather than "SELECT a_long_dataset_name.field1".)
- sql_query, a string with the SQL query to be executed or json_query, a map specifying each of the parts of the SQL query separately.
- sql_output_fields, a list of dictionaries containing some of the properties of the fields generated by the given sql_query or json_query.
The platform will parse the query, converting if needed its field names to identifiers and return back (for informational purposes) two additional fields, namely sql_query_parsed and json_query_parsed. The first is the canonical form of query as a SQL prepared statement (i.e., a list with a string that can contain wild-cards and, if needed, some arguments as in ["SELECT * FROM A WHERE A.000000 > ?", 2]), and json_query is the canonical representation of query as a JSON map.
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
"dataset/5bab8d6b1f386f7c20000000",
"dataset/5bab8d6e1f386f7c20000003"
],
"origin_dataset_names": {
"dataset/5bab8d6b1f386f7c20000000": "A",
"dataset/5bab8d6e1f386f7c20000003": "B"
},
"sql_query": "select A.`000000` as x, A.`00000a` as z, A.`00000c` from A, B where A.id = B.id",
"sql_output_fields": [
{
"column": 0,
"name": "name, a text",
"optype": "text",
"term_analysis": {
"enabled": true,
"case_sensitive": true
}
},
{
"column": 1,
"optype": "items",
"item_analysis": {
"separator": ";"
}
}
]
}'
> Creating a dataset using SQL
A SELECT specification can be provided either as a SQL string or as a map possibly containing the following keys:
-
SELECT: A list of strings,
each one specifying one of the new fields in the generated dataset.
This corresponds to the "selected columns" part of a "SELECT FROM ..." SQL statement,
and will use the names in origin_dataset_names
to refer to input datasets as SQL tables.
Each table has as one column per dataset field, and its canonical name is the field identifier;
but for convenience one can refers to input columns using field names
and BigML will translate them automatically to identifiers.
For instance, say we have origin_dataset_names:
{"dataset/5bab8d6b1f386f7c20000000": "d0"}, i.e. one input dataset with,
say, fields 000000, 000001 and 000002 named field1, field2 and field3.
One could select only the first field of the first dataset either via
"SELECT d0.000000" or via "SELECT d0.field1", or using the maps:
{"select": ["d0.`000000`"], ...}
{"select": ["d0.field1"], ...}
or the second and first columns of the dataset with, for instance, "SELECT d0.000001, d0.field1", with the map form:{"select": ["d0.`000001`", "d0.field1"], ...}
In an SQL query specified as string, you can name the output columns of your query using "AS", for instance: "SELECT d0.field2 AS age" will pick the third column of the dataset and the field of the generated dataset will be named age. When the query is specified using a JSON dictionary, the corresponding element in the select list will be a pair, with the first element the left hand operator of "AS" and the second element the right hand one. So the previous query would be translated as:{"select": [["d0.field2", "age"]], ...}
It is also possible to specify SQL function calls in a select list element, using prefix notation for the operators. - DISTINCT: A boolean flag that corresponds to the distinct SQL keyword when set to true. Defaults to false.
- LIMIT: An integer with the maximum number of rows to select, as in the SQL string "SELECT * LIMIT 3".
- OFFSET: The offset of the selected rows, as an integer. Corresponds to the offset SQL keyword.
-
WHERE: A JSON rendition, as a list, of a SQL where clause.
The first element in the list is the SQL operand to apply
(one of and, or, count,
avg, sum, min,
max, =, <>,
!=, >, >=,
<, <=, and between),
and the rest are its operands (possibly including nested operators).
So for instance, the SQL clause "d.f0 < 3 and e.f1 = e.f2" is written as
["and", ["<", "d.f0", 3], ["=", "e.f1", "e.f2"]]
- HAVING: A translation of a SQL having clause using prefix notation, as in where.
- GROUP_BY: A list of field identifiers (as in select) to perform a SQL group by operation.
-
ORDER_BY: A list of field identifiers to perform a SQL order by operation.
Each element in the list can be either a field identifier string,
or a pair of a field identifier and either ASC or
DESC to denote ascending or descending ordering.
["t.field1", ["t.field2", "DESC"], ["e.field0" "ASC"]]
-
JOIN, LEFT_JOIN, RIGHT_JOIN, FULL_JOIN:
An SQL JOIN clause is used to combine rows from two or more datasets (tables).
The specification consists of a list starting with the name of the dataset to join on,
followed by the operation that one writes in the SQL ON specification.
Thus, for instance, the SQL string "JOIN foo ON foo.id = bar.id" would be translate to the JSON specification
{"join": ["foo", ["=", "foo.id", "bar.id"]]}
and likewise for the other joins.
Here's an example of a complicated query combining most of the elements above:
{
"select": ["f.*", "b.baz", "c.quux", ["b.bla", "bla-bla"], ["now"]],
"distinct": true,
"having": ["<", 0, "f.e"],
"where": ["or", ["and", ["=", "f.a", "bort"], ["!=", "b.baz", "param1"]],
["<", 1, 2, 3],
["in", "f.e", [1, 19, 3]],
["between", "f.e", 10, 20]],
"limit": 50,
"group_by": ["f.a"],
"offset": 10,
"join": ["draq", ["=", "f.b", "draq.x"]],
"left_join": ["clod", ["=", "f.a", "clod.d"]],
"order_by": [["b.baz", "desc"], "c.quux", "f.a"]
}
which would correspond to the SQL query string:
SELECT DISTINCT f.*, b.baz, c.quux, b.bla AS bla_bla, now()
INNER JOIN draq ON f.b = draq.x
LEFT JOIN clod c ON f.a = c.d
WHERE ((f.a = "bort" AND b.baz <> "param1")
OR (1 < 2 AND 2 < 3)
OR (f.e in (1, 10, 3))
OR f.e BETWEEN 10 AND 20)
GROUP BY f.a
HAVING 0 < f.e
ORDER BY b.baz DESC, c.quux, f.a
LIMIT 50
OFFSET 10
Although the above examples contain field names, you can also use the field IDs in a SQL query as long as you set them with back quotes, e.g., “SELECT `100006`, sum(`000001`) AS sum_field1, FROM DS GROUP BY `100006`”
The user can submit either form as her query. If she uses the latter, as a string, BigML will parse it to a standard map format as the former, discarding non-supported SQL constructs appearing in the query string.
Note that, as is conventional in SQL, we mix freely upper and lowercase keywords in the above examples for a sql_query value. BigML accepts both cases although the recommended style is to not mix them in a single request. The keywords, however, in json_query must be in lowercase.
Next, we'll list all the arguments that can be used to fine-tuning the properties of the SQL-generated fields.
| Argument | Type | Description |
|---|---|---|
| column | Integer |
Column number denoting the output field. Zero-based numbering.
Example: 1 |
|
item_analysis
optional |
Object |
Set of parameters to activate item analysis for the dataset. See the Section on Item Analysis for more details.
Example:
|
|
name
optional |
String |
Name of the new field.
Example: "Price" |
|
optype
optional |
String |
Optype of the new field. Available optypes are "numeric", "categorical", "text", "datetime", and "items".
Example: "text" |
|
refresh_field_type
optional |
Boolean, default is false |
Whether the optype of the field needs to recomputed or not.
Example: true |
|
refresh_preferred
optional |
Boolean, default is false |
Whether the preferred flag of the field needs to recomputed or not.
Example: true |
|
term_analysis
optional |
Object |
Set of parameters to activate text analysis for the dataset. See the Section on Term Analysis for more details.
Example:
|
Column Concatenation Example
Say you have 2 datasets and want to join them using the field named "id". A query request would look like this:
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
"dataset/5bab8d6b1f386f7c20000000",
"dataset/5bab8d6e1f386f7c20000003"],
"origin_dataset_names": {
"dataset/5bab8d6b1f386f7c20000000": "A",
"dataset/5bab8d6e1f386f7c20000003": "B"},
"sql_query": "select * from A join B on A.id=B.id"
}'
> Creating a dataset using sql_query
or, using the JSON dictionary form of the query:
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_datasets": [
"dataset/5bab8d6b1f386f7c20000000",
"dataset/5bab8d6e1f386f7c20000003"],
"origin_dataset_names": {
"dataset/5bab8d6b1f386f7c20000000": "A",
"dataset/5bab8d6e1f386f7c20000003": "B"},
"json_query": {
"select": ["*"],
"from": ["A"],
"join": ["B", ["=", "A.id", "B.id"]]}
}'
> Creating a dataset using json_query
Transformations
Once you have created a dataset, BigML.io allows you to derive new datasets from it, sampling, filtering, adding new fields, or concatenating it to other datasets. We apply the term dataset transformations to the set of operations to create new modified versions of your original dataset or just transformations to abbreviate.
We use the term:- Cloning for the general operation of generating a new dataset.
- Sampling when the original dataset is sampled.
- Filtering when the original dataset is filtered.
- Extending when new fields are generated.
- Merging when a multi-dataset is created.
Keep in mind that you can sample, filter and extend a dataset all at once in only one API request.
So let's start with the most basic transformation: cloning a dataset.
Cloning a Dataset
To clone a dataset you just need to use the origin_dataset argument to send the dataset/id of the dataset that you want to clone. For example:
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b8fdff3c19205ff100001e"}'
> Cloning a dataset
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer |
The category that best describes the dataset. See the category codes for the complete list of categories.
Example: 1 |
|
fields
optional |
Object |
Updates the names, labels, and descriptions of the fields in the new dataset. An entry keyed with the field id of the original dataset for each field that will be updated.
Example:
|
|
name
optional |
String |
The name you want to give to the new dataset.
Example: "my new dataset" |
| origin_dataset | String |
The dataset/id of the dataset to be cloned.
Example: "dataset/52694b59035d0737c201ac68" |
Sampling a Dataset
It is also possible to provide a sampling specification to be used when cloning the dataset. The sample will be applied to the origin_dataset rows. For example:
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b8fdff3c19205ff100001e",
"sample_rate": 0.8,
"replacement": true,
"seed": "myseed"}'
> Sampling a dataset
| Argument | Type | Description |
|---|---|---|
|
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a dataset containing a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
|
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
|
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
|
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
Filtering a Dataset
A dataset can be filtered in different ways:- Excluding a few fields using the excluded_fields argument.
- Selecting only a few fields using the input_fields argument.
- Filtering rows using a json_filter or lisp_filter similarly to the way you can filter a source.
- Specifying a range of rows.
As illustrated in the following example, it's possible to provide a list of input fields, selecting the fields from the filtered input dataset that will be created. Filtering happens before field picking and, therefore, the row filter can use fields that won't end up in the cloned dataset.
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b8fdff3c19205ff100001e",
"input_fields": ["000000", "000001", "000003"],
"json_filter": [">", 3.14, ["field", "000002"]],
"range": [50, 100]}'
> Filtering a dataset
| Argument | Type | Description |
|---|---|---|
|
excluded_fields
optional |
Array |
Specifies the fields that won't be included in the new dataset.
Example:
|
|
input_fields
optional |
Array |
Specifies the fields to be included in the dataset.
Example:
|
|
json_filter
optional |
Array |
A JSON list representing a filter over the rows in the origin dataset. The first element is an operator and the rest of the elements its arguments. See the Section on filtering sources for more details.
Example: [">", 3.14, ["field", "000002"]] |
|
lisp_filter
optional |
String |
A string representing a Lisp s-expression to filter rows from the origin dataset.
Example: "(> 3.14 (field 2))" |
|
range
optional |
Array |
The range of successive instances to create the new dataset.
Example: [100, 200] |
Extending a Dataset
You can clone a dataset and extend it with brand new fields using the new_fields argument. Each new field is created using a Flatline formula and optionally a name, label, and description.
A Flatline formula is a lisp-like expresion that allows you to make references and process columns and rows of the origin dataset. See the full Flatline reference here. Let's see a first example that clones a dataset and adds a new field named "Celsius" to it using a formula that converts the values from the "Fahrenheit" field to Celsius.
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb64713c192015e3000010",
"new_fields": [{
"field": "(/ (* 5 (- (f Fahrenheit) 32)) 9)",
"name": "Celsius"}]}'
> Extending a dataset
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb64713c192015e3000010",
"all_fields": false,
"new_fields": [{
"fields": "(fields 0 1)",
"names": ["Day", "Temperature"]}]}'
> Extending a dataset
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb64713c192015e3000010",
"new_fields": [
{"field": "(avg (window Fahrenheit -6 0))",
"name": "Weekly AVG",
"label":"Weekly Average",
"description": "Temperature average over the last seven days"},
{"fields": "(list (f 0 -1) (f 0 1))",
"names": ["Yesterday", "Tomorrow"],
"labels": ["Yesterday prediction", "Tomorrow prediction"],
"descriptions": ["Prediction for the previous day", "Prediction for the next day"]}]}'
> Extending a dataset
Filtering the New Fields Output
The generation of new fields works by traversing the input dataset row by row and applying the Flatline formula of each new field to each row in turn. The list of values generated from each input row that way constitutes an output row of the generated dataset.
It is possible to limit the number of input rows that the generator sees by means of filters and/or sample specifications, for example:
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb2c263c192015e3000004",
"lisp_filter": "(not (= 0 (f 000001)))",
"new_fields": [
{"field": "(/ 1 (f 000001))",
"name": "Inverse value"}]}'
> Extending a dataset
And, as an additional convenience, it is also possible to specify either a output_lisp_filter or a output_json_filter, that is, a Flatline row filter that will act on the generated rows, instead of on the input data:
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb2c263c192015e3000004",
"lisp_filter": "(not (= 0 (f 000001)))",
"new_fields": [
{"field": "(/ 1 (f 000001))",
"name": "Inverse value"}],
"output_lisp_filter": "(< 0.25 (f \"Inverse value\"))"}'
> Extending a dataset
You can also skip any number of rows in the input, starting the generation at an offset given by row_offset, and traverse the input rows by any step specified by row_step. For instance, the following request will generate a dataset whose rows are created by putting together every three consecutive values of the input field "Price":
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b7f0ba3c19208c13000131",
"row_offset": 2,
"row_step": 3,
"new_fields": [
{"fields": "(window \"Price\" -2 0)",
"names": ["Price-2", "Price-1", "Price"]}]}'
> Extending a dataset
With the specification above, the new field will start with the third row in the input dataset, generate an output row (which uses values from the current input row as well as from the two previous ones), skip to the 6th input row, generate a new output, and so on and so forth.
Next, we'll list all the arguments that can be used to extend a dataset.
| Argument | Type | Description |
|---|---|---|
|
all_but
optional |
Array |
Specifies the fields to be included in the dataset.
Example: ["000001", "000003"] |
|
all_fields
optional |
Boolean |
Whether all fields should be included in the new dataset or not.
Example: false |
|
new_fields
optional |
Array |
Specifies the new fields to be included in the dataset. See the table below for more details.
Example: [{"field": "(log10 (field "000001"))", "name": "log"}] |
|
output_json_filter
optional |
Array |
A JSON list representing a filter over the rows of the dataset once the new fields have been generated. The first element is an operator and the rest of the elements its arguments. See the Section on filtering rows for more details.
Example: [">", 3.14, ["field", "000002"]] |
|
output_lisp_filter
optional |
String |
A string representing a Lisp s-expression to filter rows after the new fields have been generated.
Example: "(> 3.14 (field 2))" |
|
row_offset
optional |
Array |
The initial number of rows to skip from from the input dataset before start processing rows.
Example: 100 |
|
row_step
optional |
Array |
The number of rows to skip in every step.
Example: 5 |
| Argument | Type | Description |
|---|---|---|
|
description
optional |
String |
A description for the new field.
Example: "This field is a transformation" |
|
descriptions
optional |
Array |
A description for every of the new fields generated.
Example: ["Price 3 days ago", "Price 2 days ago", "Price 1 day ago"] |
| field | Flatline expression |
Either a json-like or lisp-like Flatline expression to generate a new field.
Example: "(* (field 5) 100)" |
| fields | Flatline expression |
Either a json-like or lisp-like Flatline expression to generate a number of fields.
Example: "(window Price -2 0)" |
|
item_analyses
optional |
Array | List of item_analyses for each of the new fields generated. |
|
item_analysis
optional |
Object |
Set of parameters to activate item analysis for the dataset. See the Section on Item Analysis for more details.
Example:
|
|
label
optional |
Array |
Label of the new field.
Example: "New price" |
| labels | Array |
Labels for each of the new fields generated.
Example: ["Price-3", "Price-2", "Price-1"] |
|
name
optional |
String |
Name of the new field.
Example: "Price" |
|
names
optional |
Array |
Names for each of the new fields generated.
Example: ["P3", "P2", "P1"] |
|
optype
optional |
String |
Optype of the new field. Available optypes are "numeric", "categorical", "text", "datetime", and "items".
Example: "text" |
|
optypes
optional |
Array |
Optypes for each of the new fields generated.
Example: ["numeric", "categorical", "text"] |
|
refresh_field_type
optional |
Boolean, default is false |
Specifies whether the new field type needs to recomputed or not.
Example: true |
|
term_analyses
optional |
Array | List of term_analyses for each of the new fields generated. |
|
term_analysis
optional |
Object |
Set of parameters to activate text analysis for the dataset. See the Section on Term Analysis for more details.
Example:
|
Discretization of a Continuous Field
Here's an example discretizing the "temp" field into three homogeneous levels:
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb64713c192015e3000010",
"new_fields": [{
"field": "(cond (< (f \"temp\") 0) \"SOLID\"
(< (f \"temp\") 100) \"LIQUID\"
\"GAS\")",
"name":"Discrete Temp"}]}'
Descritizing a field
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b9359a3c19205ff100002a",
"new_fields": [{
"field": "(cond (> (percentile \"age\" 0.1) (f \"age\")) \"baby\"
(> (percentile \"age\" 0.2) (f \"age\")) \"child\"
(> (percentile \"age\" 0.6) (f \"age\")) \"adult\"
(> (percentile \"age\" 0.9) (f \"age\")) \"old\"
\"elder\")",
"name":"Discrete Age"}]}'
Descritizing a field
Outlier Elimination
You can use, for instance, the following predicate in a filter to remove outliers:
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b9359a3c19205ff100002a",
"lisp_filter": "(< (percentile \"age\" 0.5) (f \"age\") (percentile \"age\" 0.95))"}'
Eliminating outliers
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52b9359a3c19205ff100002a",
"lisp_filter": "(within-percentiles? "age" 0.5 0.95)"}'
Eliminating outliers
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"origin_dataset": "dataset/52bb64713c192015e3000010",
"new_fields": [{
"field": "(if (missing? \"temp\") (mean \"temp\") (field \"temp\"))",
"name": "no missing temp"}]}'
Changing missing values
Lisp and JSON Syntaxes
Flatline also has a json-like flavor with exactly the same semantics that the lisp-like version. Basically, a Flatline formula can easily be translated to its json-like variant and vice versa by just changing parentheses to brackets, symbols to quoted strings, and adding commas to separate each sub-formula. For example, the following two formulas are the same for BigML.io.
"(/ (* 5 (- (f Fahrenheit) 32)) 9)"
Lisp-like formula
["/", ["*", 5, ["-", ["f", "Fahrenheit"], 32]], 9]
Json-like formula
Final Remarks
A few important details that you should keep in mind:- Cloning a dataset implies creating also a copy of its serialized form, so you get an asyncronous resource with a status that evolves from the Summarized (4) to the Finished (5) state.
- If you specify both sampling and filtering arguments, the former are applied first.
- As with filters applied to datasources, dataset filters can use the full Flatline language to specify the boolean formula to use when sifting the input.
- Flatline performs type inference, and will in general figure out the proper optype for the generated fields, which are subsequently summarized by the dataset creation process, reaching then their final datatype (just as with a regular dataset created from a datasource). In case you need to fine-tune Flatline's inferences, you can provide an optype (or optypes) key and value in the corresponding output field entry (together with generator and names), but in general this shouldn't be needed.
- Please check the Flatline reference manual for a full description of the language for field generation and the many pre-built functions it provides.
Samples
Last Updated: Tuesday, 2019-01-29 16:28
A sample provides fast-access to the raw data of a dataset on an on-demand basis.
When a new sample is requested, a copy of the dataset is stored in a special format in an in-memory cache. Multiple and different samples of the data can then be extracted using HTTPS parameterized requests by sampling sizes and simple query string filters.
Samples are ephemeral. That is to say, a sample will be available as long as GETs are requested within periods smaller than a pre-established TTL (Time to Live). The expiration timer of a sample is reset every time a new GET is received.
If requested, a sample can also perform linear regression and compute Pearson's and Spearman's correlations for either one numeric field against all other numeric fields or between two specific numeric fields.
BigML.io allows you to create, retrieve, update, delete your sample. You can also list all of your samples.
Jump to:
- Sample Base URL
- Creating a Sample
- Sample Arguments
- Retrieving a Sample
- Sample Properties
- Filtering and Paginating Fields from a Sample
- Filtering Rows from a Sample
- Updating a Sample
- Deleting a Sample
- Listing Samples
Sample Base URL
You can use the following base URL to create, retrieve, update, and delete samples. https://au.bigml.io/sample
Sample base URL
All requests to manage your samples must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Sample
To create a new sample, you need to POST to the sample base URL an object containing at least the dataset/id that you want to use to create the sample. The content-type must always be "application/json".
You can easily create a new sample using curl as follows. All you need is a valid dataset/id and your authentication variable set up as shown above.
curl "https://au.bigml.io/sample?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/5484b109f0a5ea59a6000018"}'
> Creating a sample
BigML.io will return the newly created sample if the request succeeded.
{
"category":0,
"code":201,
"created":"2015-02-03T08:53:08.782775",
"credits":0,
"dataset":"dataset/5484b109f0a5ea59a6000018",
"description":"",
"fields_meta":{
"count":0,
"limit":1000,
"offset":0,
"total":0
},
"input_fields":[
"000000",
"000001",
"000002",
"000003",
"000004",
"000005",
"000006",
"000007",
"000008",
"000009",
"00000a",
"00000b",
"00000c",
"00000d"
],
"max_columns":14,
"max_rows":32561,
"name":"census' dataset sample",
"private":true,
"project":null,
"resource":"sample/54d9c6f4f0a5ea0b1600003a",
"seed":"c30d76cd14e24ef7ab7d28f98b3c8488",
"size":3292068,
"status":{
"code":1,
"message":"The sample is being processed and will be created soon"
},
"subscription":false,
"tags":[],
"updated":"2015-02-03T08:53:08.782792"
}
< Example sample JSON response
Sample Arguments
See below the full list of arguments that you can POST to create a sample.
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the sample. See the category codes for the complete list of categories.
Example: 1 |
| dataset | String |
A valid dataset/id.
Example: dataset/4f665b8103ce8920bb000006 |
|
description
optional |
String |
A description of the sample up to 8192 characters long.
Example: "This is a description of my new sample" |
|
name
optional |
String, default is dataset's name sample |
The name you want to give to the new sample.
Example: "my new sample" |
|
project
optional |
String |
The project/id you want the sample to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your sample.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new sample with a name. For example, to create a new sample named "my sample" with some tags:
curl "https://au.bigml.io/sample?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/5484b109f0a5ea59a6000018",
"name": "my sample",
"tags": ["potential customers", "2015"]}'
> Creating a customized sample
If you do not specify a name, BigML.io will assign to the new sample the dataset's name.
Retrieving a Sample
Each sample has a unique identifier in the form "sample/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the sample.
To retrieve a sample with curl:
curl "https://au.bigml.io/sample/54d9c6f4f0a5ea0b1600003a?$BIGML_AUTH"
$ Retrieving a sample from the command line
Sample Properties
Once a sample has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the sample and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the sample creation has been completed without errors. |
|
columns
filterable, sortable |
Integer | The number of fields returned in the sample's fields. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the sample was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this sample. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to create the sample. |
|
description
updatable |
String | A text describing the sample. It can contain restricted markdown to decorate the text. |
| fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
| input_fields | Array | The list of input fields' ids available to filter the sample |
| locale | String | The source's locale. |
|
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the sample. |
|
max_rows
filterable, sortable |
Integer | The max number of rows in the sample. |
|
name
filterable, sortable, updatable |
String | The name of the sample as provided or based on the name of the dataset by default. |
|
private
filterable, sortable |
Boolean | Whether the sample is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| resource | String | The sample/id. |
|
rows
filterable, sortable |
Integer | The total number of rows in the sample, |
| sample | Object | All the information that you need to analyze the sample on your own. It includes the fields' dictionary describing the fields and their summaries and the rows. |
|
seed
filterable, sortable |
String | The string that was used to generate the sample. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this sample. |
| status | Object | A description of the status of the sample. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the sample was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the sample was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
A Sample Object has the following properties:
| Property | Type | Description |
|---|---|---|
| rows | Array of Arrays | A list of lists representing the rows of the sample. Values in each list are ordered according to the fields list. |
Sample Status
Through the status field in the sample you can determine when the sample has been fully processed and ready to be used. These are the fields that a sample's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the sample creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the sample. |
| message | String | A human readable message explaining the status. |
Once a sample has been successfully created, it will look like:
{
"category":0,
"code":200,
"columns":2,
"created":"2015-02-03T18:21:07.001000",
"credits":0,
"dataset":"dataset/5484b109f0a5ea59a6000018",
"description":"",
"fields_meta":{
"count":2,
"limit":2,
"offset":0,
"query_total":14,
"total":14
},
"input_fields":[
"000000",
"000001",
"000002",
"000003",
"000004",
"000005",
"000006",
"000007",
"000008",
"000009",
"00000a",
"00000b",
"00000c",
"00000d"
],
"locale":"en-US",
"max_columns":14,
"max_rows":32561,
"name":"my dataset",
"private":true,
"project":null,
"resource":"sample/54d9c6f4f0a5ea0b1600003a",
"rows":2,
"sample":{
"fields":[
{
"column_number":0,
"datatype":"int8",
"id":"000000",
"input_column":0,
"name":"age",
"optype":"numeric",
"order":0,
"preferred":true,
"summary":{
"bins":[
[
18.75643,
2410
],
[
21.51515,
1485
],
[
23.47642,
1675
],
[
25.48278,
1626
],
[
27.5094,
1702
],
[
29.51434,
1674
],
[
31.48252,
1716
],
[
33.50312,
1761
],
[
35.5062,
1774
],
[
37.4908,
1685
],
[
39.49317,
1610
],
[
41.49118,
1588
],
[
43.48461,
1494
],
[
46.38942,
2722
],
[
50.4325,
2252
],
[
53.47213,
879
],
[
55.46624,
785
],
[
57.50552,
724
],
[
59.46777,
667
],
[
61.46237,
558
],
[
63.47489,
438
],
[
65.45732,
328
],
[
67.4428,
271
],
[
69.45178,
197
],
[
71.48201,
139
],
[
73.44348,
115
],
[
75.50549,
91
],
[
77.44231,
52
],
[
80.28947,
76
],
[
83.95,
20
],
[
87.75,
4
],
[
90,
43
]
],
"maximum":90,
"mean":38.58165,
"median":37.03324,
"minimum":17,
"missing_count":0,
"population":32561,
"splits":[
18.58199,
20.00208,
21.38779,
22.6937,
23.89609,
25.137,
26.40151,
27.62339,
28.8206,
30.03925,
31.20051,
32.40167,
33.57212,
34.72468,
35.87617,
37.03324,
38.24651,
39.49294,
40.76573,
42.0444,
43.3639,
44.75256,
46.13703,
47.60107,
49.39145,
51.09725,
53.14627,
55.56526,
58.35547,
61.50785,
66.43583
],
"standard_deviation":13.64043,
"sum":1256257,
"sum_squares":54526623,
"variance":186.0614
}
},
{
"column_number":1,
"datatype":"string",
"id":"000001",
"input_column":1,
"name":"workclass",
"optype":"categorical",
"order":1,
"preferred":true,
"summary":{
"categories":[
[
"Private",
22696
],
[
"Self-emp-not-inc",
2541
],
[
"Local-gov",
2093
],
[
"State-gov",
1298
],
[
"Self-emp-inc",
1116
],
[
"Federal-gov",
960
],
[
"Without-pay",
14
],
[
"Never-worked",
7
]
],
"missing_count":1836
},
"term_analysis":{
"enabled":true
}
}
],
"rows":[
[
48,
"Private",
"HS-grad",
9,
"Divorced",
"Transport-moving",
"Not-in-family",
"White",
"Male",
0,
0,
65,
"United-States",
"<=50K"
],
[
71,
"Private",
"9th",
5,
"Married-civ-spouse",
"Other-service",
"Husband",
"White",
"Male",
0,
0,
40,
"United-States",
"<=50K"
]
]
},
"seed":"0493a6f8ca7aeb2aaccca22560e4b8cb",
"size":3292068,
"status":{
"code":5,
"elapsed":1,
"message":"The sample has been created",
"progress":1
},
"subscription":false,
"tags":[
"potential customers",
"2015"
],
"updated":"2015-02-03T18:21:14.537000"
}
< Example sample JSON response
Filtering and Paginating Fields from a Sample
A sample might be composed of hundreds or even thousands of fields. Thus when retrieving a sample, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Filtering Rows from a Sample
A sample might be composed of thousands or even millions of rows. Thus when retrieving a sample, it's possible to specify that only a subset of rows be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored). BigML will never return more than 1000 rows in the same response. However, you can send additional request to get different random samples.
| Parameter | Type | Description |
|---|---|---|
|
!field=
optional |
Blank |
With field the identifier of a field, select only those rows where field is not missing (i.e., it has a definite value).
Example:
|
|
!field=from,to
optional |
List |
With field the identifier of a numeric field, returns the values not in the specified interval. As with inclusion, it's possible to include or exclude the boundaries of the specified interval using square or round brackets
Example:
|
|
!field=value
optional |
List |
With field the identifier of a numeric field, returns rows for which the field doesn't equal that value.
Example:
|
|
!field=value1&!field=value2&...
optional |
String |
With field the identifier of a categorical field, select only those rows with the value of that field not one of the provided categories (when the parameter is repeated).
Example:
|
|
field=
optional |
Blank |
With field the identifier of a field, select only those rows where field is missing.
Example:
|
|
field=from,to
optional |
List |
With field the identifier of a numeric field and from, to optional numbers, specifies a filter for the numeric values of that field in the range [from, to]. One of the limits can be omitted.
Example:
|
|
field=value
optional |
List |
With field the identifier of a numeric field, returns rows for which the field equals that value.
Example:
|
|
field=value1&field=value2&...
optional |
String |
With field the identifier of a categorical field, select only those rows with the value of that field one of the provided categories (when the parameter is repeated).
Example:
|
|
index
optional |
Boolean |
When set to true, every returned row will have a first extra value which is the absolute row number, i.e., a unique row identifier. This can be useful, for instance, when you're performing various GET requests and want to compute the union of the returned regions.
Example: index=true |
|
mode
optional |
String |
One amongst deterministic, random, or linear. The way we sample the resulting rows, if needed; random means a random sample, deterministic is also random but using a fixed seed so that it's repeatable, and linear means that BigML just returns the first size rows after filtering; defaults to "deterministic".
Example: mode=random |
|
occurrence
optional |
Boolean |
When set to true, rows have prepended a value which denotes the number of times the row was present in the sample. You'll want this only when unique is set to true, otherwise all those extra values will be equal to 1. When index is also set to true (see above), the multiplicity column is added after the row index.
Example: occurrence=true |
|
precision
optional |
Integer |
The number of significant decimal numbers to keep in the returned values, for fields of type float or double. For instance, if you set precision=0, all returned numeric values will be truncated to their integral part.
Example: precision=2 |
|
row_fields
optional |
List |
You can provide a list of identifiers to be present in the samples rows, specifying which ones you actually want to see and in which order.
Example: row_fields=000000,000002 |
|
row_offset
optional |
Integer |
Skip the given number of rows. Useful when paginating over the sample in linear mode.
Example: row_offset=300 |
|
row_order_by
optional |
String |
A field that causes the returned columns to be sorted by the value of the given field, in ascending order or, when the - prefix is used, in descending order.
Example: row_order_by=-000000 |
|
rows
optional |
Integer |
The total number of rows to be returned; if less than the resulting from the rest of the filter parameters, the latter will be sampled according to mode.
Example: rows=300 |
|
seed
optional |
String |
When mode is random, you can specify your own seed in this parameter; otherwise, we choose it at random, and return the value we've used in the body of the response: that way you can make a random sampling deterministic if you happen to like a particular result.
Example: seed=mysample |
|
stat_field
optional |
String |
A field_id that corresponds to the identifier of a numeric field will cause the answer to include the Pearson's and Spearman's correlations, and linear regression terms of this field with all other numeric fields in the sample. Those values will be returned in maps keyed by "other" field id and named spearman_correlations, pearson_correlations, slopes, and intercepts.
Example: stat_field=000000 |
|
stat_fields
optional |
String |
Two field_ids that correspond to the identifiers of numeric fields will cause the answer to include the Pearson's and Spearman's correlations, and linear regression terms between the two fields. Those values will be returned in maps keyed named spearman_correlation, pearson_correlation, slope, and intercept.
Example: stat_fields=000000,000003 |
|
unique
optional |
Boolean |
When set to true, repeated rows will be removed from the sample.
Example: unique=true |
Updating a Sample
To update a sample, you need to PUT an object containing the fields that you want to update to the sample' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated sample.
For example, to update a sample with a new name you can use curl like this:
curl "https://au.bigml.io/sample/54d9c6f4f0a5ea0b1600003a?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating a sample's name
Deleting a Sample
To delete a sample, you need to issue a HTTP DELETE request to the sample/id to be deleted.
Using curl you can do something like this to delete a sample:
curl -X DELETE "https://au.bigml.io/sample/54d9c6f4f0a5ea0b1600003a?$BIGML_AUTH"
$ Deleting a sample from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a sample, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a sample a second time, or a sample that does not exist, you will receive a "404 not found" response.
However, if you try to delete a sample that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Samples
To list all the samples, you can use the sample base URL. By default, only the 20 most recent samples will be returned. You can see below how to change this number using the limit parameter.
You can get your list of samples directly in your browser using your own username and API key with the following links.
https://au.bigml.io/sample?$BIGML_AUTH
> Listing samples from a browser
Correlations
Last Updated: Tuesday, 2019-01-29 16:28
A correlation resource allows you to compute advanced statistics for the fields in your dataset by applying various exploratory data analysis techniques to compare the distributions of the fields in your dataset against an objective_field.
BigML.io allows you to create, retrieve, update, delete your correlation. You can also list all of your correlations.
Jump to:
- Correlation Base URL
- Creating a Correlation
- Correlation Arguments
- Retrieving a Correlation
- Correlation Properties
- Filtering and Paginating Fields from a Correlation
- Updating a Correlation
- Deleting a Correlation
- Listing Correlations
Correlation Base URL
You can use the following base URL to create, retrieve, update, and delete correlations. https://au.bigml.io/correlation
Correlation base URL
All requests to manage your correlations must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Correlation
To create a new correlation, you need to POST to the correlation base URL an object containing at least the dataset/id that you want to use to create the correlation. The content-type must always be "application/json".
You can easily create a new correlation using curl as follows. All you need is a valid dataset/id and your authentication variable set up as shown above.
curl "https://au.bigml.io/correlation?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/54222a14f0a5eaaab000000c"}'
> Creating a correlation
BigML.io will return the newly created correlation if the request succeeded.
{
"category": 0,
"clones": 0,
"code": 201,
"columns": 0,
"correlations": null,
"created": "2015-06-23T21:45:24.002925",
"credits": 15.161365509033203,
"dataset": "dataset/55806fc2545e5f09b400002b",
"dataset_field_types": {
"categorical": 9,
"datetime": 0,
"numeric": 6,
"preferred": 14,
"text": 0,
"total": 15
},
"dataset_status": true,
"dataset_type": 0,
"description": "",
"excluded_fields": [ ],
"fields_meta": {
"count": 0,
"limit": 1000,
"offset": 0,
"total": 0
},
"input_fields": [ ],
"locale": "en-US",
"max_columns": 15,
"max_rows": 32561,
"name": "adult's dataset correlation",
"objective_field": "000000",
"out_of_bag": false,
"price": 0,
"private": true,
"project": "project/5578cfd8545e5f6a17000000",
"range": [
1,
32561
],
"replacement": false,
"resource": "correlation/5589d374545e5f37fa000000",
"rows": 32561,
"sample_rate": 1,
"shared": false,
"size": 3974461,
"source": "source/5578d034545e5f6a17000006",
"source_status": true,
"status": {
"code": 1,
"message": "The correlation is being processed and will be created soon"
},
"subscription": false,
"tags": [ ],
"updated": "2015-06-23T21:45:24.003040",
"white_box": false
}
< Example correlation JSON response
Correlation Arguments
In addition to the dataset, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
categories
optional |
Object, default is {}, an empty dictionary. That is no categories are specified. |
A dictionary between input field id and an array of categories to limit the analysis to. Each array must contain 2 or more unique and valid categories in the string format. If omitted, each categorical field is limited to its 100 most frequent categorical values. This field has no impact if the data type of input fields are non-categorical.
Example:
|
|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the correlation. See the category codes for the complete list of categories.
Example: 1 |
| dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
|
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
|
description
optional |
String |
A description of the correlation up to 8192 characters long.
Example: "This is a description of my new correlation" |
| discretization | Object | Global numeric field transformation parameters. See the discretization table below. |
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the correlation.
Example:
|
| field_discretizations | Object | Per-field numeric field transformation parameters, taking precedence over discretization. See the field_discretizations table below. |
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the correlation with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be considered to create the correlation.
Example:
|
|
name
optional |
String, default is dataset's name |
The name you want to give to the new correlation.
Example: "my new correlation" |
|
objective_field
optional |
String, default is dataset's pre-defined objective field |
The id of the field to be used as the objective for correlation tests.
Example: "000001" |
|
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
|
project
optional |
String |
The project/id you want the correlation to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the correlation.
Example: [1, 150] |
|
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
|
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
|
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
|
significance_levels
optional |
Array, default is [0.01, 0.05, 0.1] |
An array of significance levels between 0 and 1 to test against p_values.
Example: [0.01, 0.025, 0.05, 0075, 0.1] |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your correlation.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
ç Discretization is used to transform numeric input fields to categoricals before further processing. It is to be applied globally with all input fields. A Discretization object is composed of any combination of the following properties.
For example, let's say type is set to "width", size is 7, trim is 0.05, and pretty is false. This requests that numeric input fields be discretized into 7 bins of equal width, trimming the outer 5% of counts, and not rounding bin boundaries.
Field Discretizations is also used to transform numeric input fields to categoricals before further processing. However, it allows the user to specify parameters on a per field basis, taking precedence over the global discretization. It is a map whose keys are field ids and whose values are maps with the same format as discretization. It also accepts edges, which is a numeric array manually specifying edge boundary locations. If this parameter is present, the corresponding field will be discretized according to those defined bins, and the remaining discretization parameters will be ignored. The maximum value of the field's distribution is automatically set as the last value in the edges array. A value object of a Field Discretizations object is composed of any combination of the following properties.
You can also use curl to customize a new correlation. For example, to create a new correlation named "my correlation", with only certain rows, and with only three fields:
curl "https://au.bigml.io/correlation?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/54222a14f0a5eaaab000000c",
"objective_field": "000001",
"input_fields": ["000001", "000002", "000003"],
"name": "my correlation",
"range": [25, 125]}'
> Creating customized correlation
If you do not specify a name, BigML.io will assign to the new correlation the dataset's name. If you do not specify a range of instances, BigML.io will use all the instances in the dataset. If you do not specify any input fields, BigML.io will include all the input fields in the dataset.
Read the Section on Sampling Your Dataset to lean how to sample your dataset. Here's an example of correlation request with range and sampling specifications:
curl "https://au.bigml.io/correlation?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/505f43223c1920eccc000297",
"range": [1, 5000],
"sample_rate": 0.5}'
> Creating a correlation using sampling
Retrieving a Correlation
Each correlation has a unique identifier in the form "correlation/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the correlation.
To retrieve a correlation with curl:
curl "https://au.bigml.io/correlation/5589d374545e5f37fa000000?$BIGML_AUTH"
$ Retrieving a correlation from the command line
Correlation Properties
Once a correlation has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the correlation and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the correlation creation has been completed without errors. |
|
columns
filterable, sortable |
Integer | The number of fields in the correlation. |
| correlations | Object | All the information that you need to recreate the correlation. It includes the field's dictionary describing the fields and their summaries, and the correlations. See the Correlations Object definition below. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the correlation was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this correlation. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the correlation. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the correlation. It can contain restricted markdown to decorate the text. |
| excluded_fields | Array | The list of fields's ids that were excluded to build the correlation. |
| fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
| input_fields | Array | The list of input fields' ids used to build the models of the correlation. |
| locale | String | The dataset's locale. |
|
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the correlation. |
|
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the correlation. |
|
name
filterable, sortable, updatable |
String | The name of the correlation as your provided or based on the name of the dataset by default. |
| objective_field |
String, default is dataset's pre-defined objective field |
The id of the field to be used as the objective for a correlations test.
Example: "000001" |
| objective_field_details | Object | The details of the objective fields. See the Objective Field Details. |
|
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the correlation instead of the sampled instances. |
|
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your correlation. |
|
private
filterable, sortable, updatable |
Boolean | Whether the correlation is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| range | Array | The range of instances used to build the correlation. |
|
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the correlation were selected using replacement or not. |
| resource | String | The correlation/id. |
|
rows
filterable, sortable |
Integer | The total number of instances used to build the correlation |
|
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the correlation. |
|
seed
filterable, sortable |
String | The string that was used to generate the sample. |
|
shared
filterable, sortable, updatable |
Boolean | Whether the correlation is shared using a private link or not. |
| shared_hash | String | The hash that gives access to this correlation if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this correlation. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this correlation. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| status | Object | A description of the status of the correlation. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the correlation was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the correlation was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
|
white_box
filterable, sortable |
Boolean | Whether the correlation is publicly shared as a white-box. |
The Correlations Object of test has the following properties. Some correlation results will contain a p-value and a significant boolean array, indicating whether the p_value is less than the provided significance_levels (by default, [0.01, 0.05, 0.10] is used if not provided). If p-value is greater than the accepted significance level, then then it fails to reject the null hypothesis, meaning there is no statistically significant difference between the treatment groups. For example, if the significance levels is [0.01, .0.025, 0.05, 0.075, 0.1] and p-value is 0.05, then significant is [false, false, false, true, true].
| Property | Type | Description |
|---|---|---|
| categories | Object | A dictionary between input field id and arrays of category names selected for correlations. |
| correlations | Array | Correlation results. See Correlation Results Object. |
| fields | Object | A dictionary with an entry per field in the dataset used to build the test. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details. |
| significance_levels | Array | An array of user provided significance levels to test against p_values. |
The Correlation Results Object has the following properties.
| Property | Type | Description |
|---|---|---|
| name | String | Name of the correlation. Available values are coefficients, contingency_tables, and one_way_anova. |
| result | Object | A correlation result which is a dictionary between field ids and the result. The type of result object varies based on the name of the correlation. When name is coefficients, it returns Coefficients Result Object, when contingency_tables, Contingency Tables Result Object, and when one_way_anova, One-way ANOVA Result Object. |
The Coefficients Result Object contains the correlation measures between objective_field and each of the input_fields when the two fields are numeric-numeric pairs. It has the following properties:
| Property | Type | Description |
|---|---|---|
| pearson | Float | A measure of the linear correlation between two variables, giving a value between +1 and -1, where 1 is total positive correlation, 0 is no correlation, and -1 is total negative correlation. See Pearson's correlation coefficients for more information. |
| pearson_p_value | Float |
A function used in the context of null hypothesis testing for pearson correlations in order to quantify the idea of statistical significance of evidence.
Example: 0.015 |
| spearman | Float | A nonparametric (parameters are determined by the training data, not the model. Thus, the number of parameters grows with the amount of training data) measure of statistical dependence between two variables. It assesses how well the relationship between two variables can be described using a monotonic function. If there are no repeated data values, a perfect correlation of +1 or -1 occurs when each of the variables is a perfect monotone function of the other. See Spearman's correlation coefficients for more information. |
| spearman_p_value | Float |
A function used in the context of null hypothesis testing for spearman correlations in order to quantify the idea of statistical significance of evidence.
Example: 0.015 |
The Contingency Tables Result Object contains the correlation measures between objective_field and each of the input_fields when the two fields are both categorical.It has the following properties:
| Property | Type | Description |
|---|---|---|
| chi_square | Object | See Chi-Square Object. |
| cramer | Float | A measure of association between two nominal variables. Its value ranges between 0 (no association between the variables) and 1 (complete association), and can reach 1 only when the two variables are equal to each other. It is based on Pearson's chi-squared statistic. See Cramer's V for more information. |
| tschuprow | Float | A measure of association between two nominal variables. Its value ranges ranges between 0 (no association between the variables) and 1 (complete association). It is closely related to Cramer's V, coinciding with it for square contingency tables. See Tschuprow's T for more information. |
| two_way_table | Array |
Contingency Table as a nested row-major array with the frequency distribution of the variables. In other words, the table summarizes the distribution of values in the sample.
Example: [[2514, 362, 78, 38, 23], [889, 53, 39, 2, 1]] |
The Chi-Square Object contains the chi-square statistic used to investigate whether distributions of categorical variables differ from one another. This test is used to compare a collection of categorical data with some theoretical expected distribution. The object has the following properties.
The One-way ANOVA Result Object contains correlation measures between objective_field and each of the input_fields when the two fields are categorical-numerical pairs. ANOVA is used to compare the means of numerical data samples. The ANOVA tests the null hypothesis that samples in two or more groups are drawn from populations with the same mean values. See One-way Analysis of Variance for more information. The object has the following properties:
| Property | Type | Description |
|---|---|---|
| eta_square | Float | A measure of effect size, a measure of the strength of the relationship between two variables, for use in ANOVA. Its value ranges ranges between 0 and 1. A rule of thumb is: 0.02 ~ small, 0.13 ~ medium, and 0.26 ~ large. See eta-squared for more information. |
| f_ratio | Float | The value of the F statistic, which is used to assess whether the expected values of a quantitative variable within several pre-defined groups differ from each other. It is the ratio of the variance calculated among the means to the variance within the samples. |
| p_value | Float |
A function used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence.
Example: 0.015 |
| significant | Array |
A boolean array indicating whether the test produced a significant result at each of the significance_levels. If p_value is less than the significance_level, then it indicates it is significant. The default significance_levels are [0.01, 0.05, 0.1].
Example: [false, true, true] |
An Objective Field Details Object has the following properties.
Correlation Status
Creating correlation is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The correlation goes through a number of states until its fully completed. Through the status field in the correlation you can determine when the correlation has been fully processed and ready to be used to create predictions. These are the properties that correlation's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the correlation creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the correlation. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the correlation. |
Once correlation has been successfully created, it will look like:
{
"category": 0,
"clones": 0,
"code": 200,
"columns": 14,
"correlations": {
"categories": {
"000003": [
"Bachelors",
"Some-college",
"HS-grad"
],
"000005": [
"Divorced",
"Separated",
"Widowed"
]
},
"correlations": [
{
"name": "coefficients",
"result": {
"000002": {
"pearson": -0.07665,
"pearson_p_value": 0,
"spearman": -0.07814,
"spearman_p_value": 0
},
"000004": { … },
"00000a": { … },
"00000b": { … },
"00000c": { … }
}
},
{
"name": "one_way_anova",
"result": {
"000001": {
"eta_square": 0.05254,
"f_ratio": 243.34988,
"p_value": 0,
"significant": [
true,
true
]
},
"000003": { … },
"000005": { … },
"000006": { … },
"000007": { … },
"000008": { … },
"000009": { … },
"00000e": { … }
}
}
],
"fields": { … },
"significance_levels": [
0.025,
0.01
]
},
"created": "2015-06-23T21:45:24.002000",
"credits": 15.161365509033203,
"dataset": "dataset/55806fc2545e5f09b400002b",
"dataset_status": true,
"dataset_type": 0,
"description": "",
"excluded_fields": [
],
"fields_meta": {
"count": 14,
"limit": 1000,
"offset": 0,
"query_total": 14,
"total": 14
},
"input_fields": [
"000001",
"000002",
"000003",
"000004",
"000005",
"000006",
"000007",
"000008",
"000009",
"00000a",
"00000b",
"00000c",
"00000e"
],
"locale": "en-US",
"max_columns": 15,
"max_rows": 32561,
"name": "Sample correlation",
"objective_field": "000000",
"objective_field_details": {
"column_number": 0,
"datatype": "int8",
"name": "age",
"optype": "numeric",
"order": 0
},
"out_of_bag": false,
"price": 0,
"private": true,
"project": "project/5578cfd8545e5f6a17000000",
"range": [
1,
32561
],
"replacement": false,
"resource": "correlation/5589d374545e5f37fa000000",
"rows": 32561,
"sample_rate": 1,
"shared": false,
"size": 3974461,
"source": "source/5578d034545e5f6a17000006",
"source_status": true,
"status": {
"code": 5,
"elapsed": 11504,
"message": "The correlation has been created",
"progress": 1
},
"subscription": false,
"tags": [
],
"updated": "2015-06-23T21:45:56.066000",
"white_box": false
}
< Example correlation JSON response
Filtering and Paginating Fields from a Correlation
A correlation might be composed of hundreds or even thousands of fields. Thus when retrieving a correlation, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Correlation
To update a correlation, you need to PUT an object containing the fields that you want to update to the correlation' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated correlation.
For example, to update correlation with a new name you can use curl like this:
curl "https://au.bigml.io/correlation/5589d374545e5f37fa000000?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating a correlation's name
If you want to update correlation with a new label and description for a specific field you can use curl like this:
curl "https://au.bigml.io/correlation/5589d374545e5f37fa000000?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"fields": {"000000": {
"label": "a longer name",
"description": "an even longer description"}}}'
$ Updating correlation's field,
label, and description
Deleting a Correlation
To delete a correlation, you need to issue a HTTP DELETE request to the correlation/id to be deleted.
Using curl you can do something like this to delete a correlation:
curl -X DELETE "https://au.bigml.io/correlation/5589d374545e5f37fa000000?$BIGML_AUTH"
$ Deleting a correlation from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a correlation, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a correlation a second time, or a correlation that does not exist, you will receive a "404 not found" response.
However, if you try to delete a correlation that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Correlations
To list all the correlations, you can use the correlation base URL. By default, only the 20 most recent correlations will be returned. You can see below how to change this number using the limit parameter.
You can get your list of correlations directly in your browser using your own username and API key with the following links.
https://au.bigml.io/correlation?$BIGML_AUTH
> Listing correlations from a browser
Statistical Tests
Last Updated: Tuesday, 2019-01-29 16:28
A statistical test resource automatically runs some advanced statistical tests on the numeric fields of a dataset. The goal of these tests is to check whether the values of individual fields conform or differ from some distribution patterns. Statistical test are useful in tasks such as fraud, normality, or outlier detection.
The tests are grouped in the following three categories:
-
Fraud Detection Tests:
- Benford: This statistical test performs a comparison of the distribution of first significant digits (FSDs) of each value of the field to the Benford's law distribution. Benford's law applies to numerical distributions spanning several orders of magnitude, such as the values found on financial balance sheets. It states that the frequency distribution of leading, or first significant digits (FSD) in such distributions is not uniform. On the contrary, lower digits like 1 and 2 occur disproportionately often as leading significant digits. The test compares the distribution in the field to Bendford's distribution using a Chi-square goodness-of-fit test, and Cho-Gaines d test. If a field has a dissimilar distribution, it may contain anomalous or fraudulent values.
-
Normality tests: These tests can be used to confirm the assumption that the data in each field of a dataset is distributed
according to a normal distribution. The results are relevant because many statistical and machine learning techniques rely on this assumption.
- Anderson-Darling: The Anderson-Darling test computes a test statistic based on the difference between the observed cumulative distribution function (CDF) to that of a normal distribution. A significant result indicates that the assumption of normality is rejected.
- Jarque-Bera: The Jarque-Bera test computes a test statistic based on the third and fourth central moments (skewness and kurtosis) of the data. Again, a significant result indicates that the normality assumption is rejected.
- Z-score: For a given sample size, the maximum deviation from the mean that would expected in a sampling of a normal distribution can be computed based on the 68-95-99.7 rule. This test simply reports this expected deviation and the actual deviation observed in the data, as a sort of sanity check.
-
Outlier tests:
- Grubbs: When the values of a field are normally distributed, a few values may still deviate from the mean distribution. The outlier tests reports whether at least one value in each numeric field differs significantly from the mean using Grubb's test for outliers. If an outlier is found, then its value will be returned.
Note that both the number of tests within each category and the categories may increase in the near future.
BigML.io allows you to create, retrieve, update, delete your statistical test. You can also list all of your statistical tests.
Jump to:
- Statistical Test Base URL
- Creating a Statistical Test
- Statistical Test Arguments
- Retrieving a Statistical Test
- Statistical Test Properties
- Filtering and Paginating Fields from a Statistical Test
- Updating a Statistical Test
- Deleting a Statistical Test
- Listing Statistical Tests
Statistical Test Base URL
You can use the following base URL to create, retrieve, update, and delete statistical tests. https://au.bigml.io/statisticaltest
Statistical Test base URL
All requests to manage your statistical tests must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Statistical Test
To create a new statistical test, you need to POST to the statistical test base URL an object containing at least the dataset/id that you want to use to create the statistical test. The content-type must always be "application/json".
You can easily create a new statistical test using curl as follows. All you need is a valid dataset/id and your authentication variable set up as shown above.
curl "https://au.bigml.io/statisticaltest?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/54222a14f0a5eaaab000000c"}'
> Creating a statistical test
BigML.io will return the newly created statistical test if the request succeeded.
{
"category": 0,
"clones": 0,
"code": 201,
"columns": 0,
"created": "2015-06-23T06:14:49.583473",
"credits": 0.09991455078125,
"dataset": "dataset/5579abc3545e5f4f8a000000",
"dataset_field_types": {
"categorical": 1,
"datetime": 0,
"numeric": 8,
"preferred": 9,
"text": 0,
"total": 9
},
"dataset_status": true,
"dataset_type": 0,
"description": "",
"excluded_fields": [ ],
"fields_meta": {
"count": 0,
"limit": 1000,
"offset": 0,
"total": 0
},
"input_fields": [ ],
"locale": "en-US",
"max_columns": 9,
"max_rows": 768,
"name": "Diabetes (all numeric) test",
"out_of_bag": false,
"price": 0,
"private": true,
"project": "project/5578cfd8545e5f6a17000000",
"range": [
1,
768
],
"replacement": false,
"resource": "statisticaltest/5588f959545e5fdc1e000007",
"rows": 768,
"sample_rate": 1,
"shared": false,
"size": 26192,
"source": "source/5578d077545e5f6a17000011",
"source_status": true,
"statistical_tests": null,
"status": {
"code": 1,
"message": "The statistical test is being processed and will be created soon"
},
"subscription": false,
"tags": [ ],
"updated": "2015-06-23T06:14:49.583623",
"white_box": false
}
< Example statistical test JSON response
Statistical Test Arguments
In addition to the dataset, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
ad_sample_size
optional |
Integer, default is 1024 |
The Anderson-Darling normality test is computed from a sample from the values of each field. This parameter specifies the number of samples to be used during the normality test. If not given, defaults to 1024.
Example: 128 |
|
ad_seed
optional |
String |
A string to be hashed to generate deterministic samples for the Anderson-Darling normality test.
Example: "MyADSeed" |
|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the statistical test. See the category codes for the complete list of categories.
Example: 1 |
| dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
|
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
|
description
optional |
String |
A description of the statistical test up to 8192 characters long.
Example: "This is a description of my new statistical test" |
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the statistical test.
Example:
|
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the statistical test with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be considered to create the statistical test.
Example:
|
|
name
optional |
String, default is dataset's name |
The name you want to give to the new statistical test.
Example: "my new statistical test" |
|
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
|
project
optional |
String |
The project/id you want the statistical test to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the statistical test.
Example: [1, 150] |
|
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
|
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
|
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
|
significance_levels
optional |
Array, default is [0.01, 0.05, 0.1] |
An array of significance levels between 0 and 1 to test against p_values.
Example: [0.01, 0.025, 0.05, 0075, 0.1] |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your statistical test.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new statistical test. For example, to create a new statistical test named "my statistical test", with only certain rows, and with only three fields:
curl "https://au.bigml.io/statisticaltest?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/54222a14f0a5eaaab000000c",
"input_fields": ["000001", "000002", "000003"],
"name": "my statistical test",
"range": [25, 125]}'
> Creating a customized statistical test
If you do not specify a name, BigML.io will assign to the new statistical test the dataset's name. If you do not specify a range of instances, BigML.io will use all the instances in the dataset. If you do not specify any input fields, BigML.io will include all the input fields in the dataset.
Read the Section on to learn how to sample your dataset. Here's an example of statistical test request with range and sampling specifications:
curl "https://au.bigml.io/statisticaltest?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/505f43223c1920eccc000297",
"range": [1, 5000],
"sample_rate": 0.5}'
> Creating a statistical test using sampling
Retrieving a Statistical Test
Each statistical test has a unique identifier in the form "statisticaltest/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the statistical test.
To retrieve a statistical test with curl:
curl "https://au.bigml.io/statisticaltest/5588f959545e5fdc1e000007?$BIGML_AUTH"
$ Retrieving a statistical test from the command line
Statistical Test Properties
Once a statistical test has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the statistical test and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the statistical test creation has been completed without errors. |
|
columns
filterable, sortable |
Integer | The number of fields in the statistical test. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the statistical test was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this statistical test. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the statistical test. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the statistical test. It can contain restricted markdown to decorate the text. |
| excluded_fields | Array | The list of fields's ids that were excluded to build the statistical test. |
| fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
| input_fields | Array | The list of input fields' ids used to build the models of the statistical test. |
| locale | String | The dataset's locale. |
|
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the statistical test. |
|
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the statistical test. |
|
name
filterable, sortable, updatable |
String | The name of the statistical test as your provided or based on the name of the dataset by default. |
|
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the statistical test instead of the sampled instances. |
|
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your statistical test. |
|
private
filterable, sortable, updatable |
Boolean | Whether the statistical test is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| range | Array | The range of instances used to build the statistical test. |
|
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the statistical test were selected using replacement or not. |
| resource | String | The statisticaltest/id. |
|
rows
filterable, sortable |
Integer | The total number of instances used to build the statistical test |
|
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the statistical test. |
|
seed
filterable, sortable |
String | The string that was used to generate the sample. |
|
shared
filterable, sortable, updatable |
Boolean | Whether the statistical test is shared using a private link or not. |
| shared_hash | String | The hash that gives access to this statistical test if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this statistical test. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this statistical test. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| statistical_tests | Object | All the information that you need to recreate the statistical test. It includes the field's dictionary describing the fields and their summaries, and the statistical tests. See the Statistical Tests Object definition below. |
| status | Object | A description of the status of the statistical test. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the statistical test was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the statistical test was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
|
white_box
filterable, sortable |
Boolean | Whether the statistical test is publicly shared as a white-box. |
The Statistical Tests Object of statistical test has the following properties. Many statistical tests will contain a p-value and a significant boolean array, indicating whether the p_value is less than the provided significance_levels (by default, [0.01, 0.05, 0.10] is used if not provided). If p-value is greater than the accepted significance level, then then it fails to reject the null hypothesis, meaning there is no statistically significant difference between the treatment groups. For example, if the significance levels is [0.01, .0.025, 0.05, 0.075, 0.1] and p-value is 0.05, then significant is [false, false, false, true, true].
| Property | Type | Description |
|---|---|---|
| ad_sample_size | Integer | The sample test size used for the Anderson-Darling normality test |
| ad_seed | String | A seed used to generate deterministic samples for the Anderson-Darling normality test. |
| fields | Object | A dictionary with an entry per field in the dataset used to build the test. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details. |
| fraud | Array | An array of anomalous fields detection test results for each numeric field. See Fraud Object. |
| normality | Array | An array of data normality test results for each numeric field. See Normality Object. |
| outliers | Array | An array of outlier detection test results for each numeric field. See Outliers Object. |
| significance_levels | Array | An array of user provided significance levels to test against p_values. |
The Fraud Object has the following properties.
| Property | Type | Description |
|---|---|---|
| name | String | Name of the fraud test. Currently only value available is benford. |
| result | Object | A test result which is a dictionary between field ids and test result. The type of result object varies based on the name of the test. When name is benford, it returns Benford Result Object. |
The Benford Result Object has the following properties. Benford's Law is a simple yet powerful tool allowing quick screening of data for anomalies.
| Property | Type | Description |
|---|---|---|
| chi_square | Object | See Chi-Square Object. |
| cho_gaines | Object | See Cho-Gaines Object. |
| distribution | Array |
The distribution of first significant digits (FSDs) to the Benford's law distribution. For example, the FSD for 2015 is 2, and for 0.00609 is 6. The array represents the number of occurences for each digit from 1 to 9.
Example: [0, 0, 0, 22, 61, 54, 0, 0, 0] |
| negatives | Integer | The number of negative values. |
| zeros | Integer | The number of values exactly equal to 0. |
The Chi-Square Object contains the chi-square statistic used to investigate whether distributions of categorical variables differ from one another. This test is used to compare a collection of categorical data with some theoretical expected distribution. The object has the following properties.
The Cho-Gaines Object has the following properties.
| Property | Type | Description |
|---|---|---|
| d_statistic | Float | A value based on Euclidean distance from Benford's distribution in the 9-dimensional space occupied by any first-digit vector to test Cho-Gaines d test. |
| significant | Array |
A boolean array indicating whether the test produced a significant result at each of the significance_levels. If p_value is less than the significance_level, then it indicates it is significant. It does not respect the values passed in significance_levels, but always use [0.01, 0.05, 0.1].
Example: [false, true, true] |
The Normality Object has the following properties.
| Property | Type | Description |
|---|---|---|
| name | String | Name of the normality test. Available values are anderson_darling, jarque_bera, and z_score. |
| result | Object | A test result which is a dictionary between field ids and test result. The type of result object varies based on the name of the test. When name is anderson_darling, it returns Anderson-Darling Result Object, when jarque_bera, Jarque-Bera Result Object, and when z-score, Z-Score Result Object. |
The Anderson-Darling Result Object has the following properties. See Anderson-Darling Test for more information.
The Jarque-Bera Result Object has the following properties. See Jarque-Bera Test for more information.
The Z-Score Object has the following properties. A positive standard score indicates a datum above the mean, while a negative standard score indicates a datum below the mean. See z-score for more information.
| Property | Type | Description |
|---|---|---|
| expected-max-z | Float | The expected maximum z-score for the sample size. |
| max-z | Float | The maximum z-score. |
The Outliers Object has the following properties.
| Property | Type | Description |
|---|---|---|
| name | String | Name of the outlier detection test. Currently only value available is grubbs. |
| result | Object | A test result which is a dictionary between field ids and test result. The type of result object varies based on the name of the test. When name is grubbs, it returns Grubbs Result Object. |
The Grubb's Test for Outliers Result Object has the following properties. It computes a t-test based on the maximum deviation from the mean. A significant result indicates that at least one outlier is present in the data. If an outlier is found, also returns the value of the outlier. Note that this test assumes that the data are normally distributed. See Grubb's test for outliers for more information.
Statistical Test Status
Creating statistical test is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The statistical test goes through a number of states until its fully completed. Through the status field in the statistical test you can determine when the test has been fully processed and ready to be used to create predictions. These are the properties that statistical test's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the statistical test creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the statistical test. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the statistical test. |
Once statistical test has been successfully created, it will look like:
{
"category": 0,
"clones": 0,
"code": 200,
"columns": 9,
"created": "2015-06-23T06:14:49.583000",
"credits": 0.09991455078125,
"dataset": "dataset/5579abc3545e5f4f8a000000",
"dataset_status": true,
"dataset_type": 0,
"description": "",
"excluded_fields": [ ],
"fields_meta": {
"count": 9,
"limit": 1000,
"offset": 0,
"query_total": 9,
"total": 9
},
"input_fields": [
"000000",
"000001",
"000002",
"000003",
"000004",
"000005",
"000006",
"000007"
],
"locale": "en-US",
"max_columns": 9,
"max_rows": 768,
"name": "Diabetes test",
"out_of_bag": false,
"price": 0,
"private": true,
"project": "project/5578cfd8545e5f6a17000000",
"range": [
1,
768
],
"replacement": false,
"resource": "statisticaltest/5588f959545e5fdc1e000007",
"rows": 768,
"sample_rate": 1,
"shared": false,
"size": 26192,
"source": "source/5578d077545e5f6a17000011",
"source_status": true,
"statistical_tests": {
"ad_sample_size": 2048,
"ad_seed": "MyADSeed",
"fields": { … },
"fraud": [
{
"name": "benford",
"result": {
"000000": {
"chi_square": {
"chi_square_value": 5.67791,
"p_value": 0.68326,
"significant": [
false,
false
]
},
"cho_gaines": {
"d_statistic": 0.7654738225941359,
"significant": [
false,
false,
false
]
},
"distribution": [
193,
103,
75,
68,
57,
50,
45,
38,
28
],
"negatives": 0,
"zeros": 111
},
"000001": { … },
"000002": { … },
"000003": { … },
"000004": { … },
"000005": { … },
"000006": { … },
"000007": { … }
}
}
],
"normality": [
{
"name": "anderson_darling",
"result": {
"000000": {
"p_value": 0,
"significant": [
true,
true
]
},
"000001": { … },
"000002": { … },
"000003": { … },
"000004": { … },
"000005": { … },
"000006": { … },
"000007": { … }
}
},
{
"name": "jarque_bera",
"result": {
"000000": {
"p_value": 0,
"significant": [
true,
true
]
},
"000001": { … },
"000002": { … },
"000003": { … },
"000004": { … },
"000005": { … },
"000006": { … },
"000007": { … }
}
},
{
"name": "z_score",
"result": {
"000000": {
"expected_max_z": 3.21552,
"max_z": 3.90403
},
"000001": { … },
"000002": { … },
"000003": { … },
"000004": { … },
"000005": { … },
"000006": { … },
"000007": { … }
}
}
],
"outliers": [
{
"name": "grubbs",
"result": {
"000000": {
"p_value": 0.06734,
"significant": [
false,
false
]
},
"000001": { … },
"000002": { … },
"000003": { … },
"000004": { … },
"000005": { … },
"000006": { … },
"000007": { … }
}
}
],
"significance_levels": [
0.025,
0.01
]
},
"status": {
"code": 5,
"elapsed": 2244,
"message": "The statistical test has been created",
"progress": 1
},
"subscription": false,
"tags": [ ],
"updated": "2015-06-23T06:15:18.908000",
"white_box": false
}
< Example statistical test JSON response
Filtering and Paginating Fields from a Statistical Test
A statistical test might be composed of hundreds or even thousands of fields. Thus when retrieving a statisticaltest, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Statistical Test
To update a statistical test, you need to PUT an object containing the fields that you want to update to the statistical test' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated statistical test.
For example, to update statistical test with a new name you can use curl like this:
curl "https://au.bigml.io/statisticaltest/5588f959545e5fdc1e000007?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating statistical test' name
If you want to update statistical test with a new label and description for a specific field you can use curl like this:
curl "https://au.bigml.io/statisticaltest/5588f959545e5fdc1e000007?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"fields": {"000000": {
"label": "a longer name",
"description": "an even longer description"}}}'
$ Updating statistical test's field,
label, and description
Deleting a Statistical Test
To delete a statistical test, you need to issue a HTTP DELETE request to the statisticaltest/id to be deleted.
Using curl you can do something like this to delete a statistical test:
curl -X DELETE "https://au.bigml.io/statisticaltest/5588f959545e5fdc1e000007?$BIGML_AUTH"
$ Deleting a statistical test from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a statistical test, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a statistical test a second time, or a statistical test that does not exist, you will receive a "404 not found" response.
However, if you try to delete a statistical test that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Statistical Tests
To list all the statistical tests, you can use the statisticaltest base URL. By default, only the 20 most recent statistical tests will be returned. You can see below how to change this number using the limit parameter.
You can get your list of statistical tests directly in your browser using your own username and API key with the following links.
https://au.bigml.io/statisticaltest?$BIGML_AUTH
> Listing statistical tests from a browser
Configurations
Last Updated: Tuesday, 2019-01-29 16:28
A configuration is a helper resource that provides an easy way to reuse the same arguments during the resource creation.
A configuration must have a name and optionally a category, description, and multiple tags to help you organize and retrieve your configurations.
BigML.io allows you to create, retrieve, update, delete your configuration. You can also list all of your configurations.
Jump to:
- Configuration Base URL
- Creating a Configuration
- Configuration Arguments
- Retrieving a Configuration
- Configuration Properties
- Updating a Configuration
- Deleting a Configuration
- Listing Configurations
Configuration Base URL
You can use the following base URL to create, retrieve, update, and delete configurations. https://au.bigml.io/configuration
Configuration base URL
All requests to manage your configurations must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Configuration
To create a new configuration, you just need to POST the name you want to give to the new configuration and configurations that contains settings for individual or any resources to the configuration base URL.
You can easily do this using curl.
curl "https://au.bigml.io/configuration?$BIGML_AUTH" \
-H 'content-type: application/json' \
-d '{
"name": "My First Configuration",
"configurations": {
"dataset": {
"name": "Customer FAQ dataset"
},
"ensemble": {
"description": "Customer FAQ ensemble with 10 models",
"number_of_models": 10
},
"any": {
"project": "project/55eeed1f1f386fc29520000a"
}
}
}'
> Creating a configuration
BigML.io will return a newly created configuration document, if the request succeeded.
{
"category":0,
"code":201,
"configurations": {
"any": {
"project": "project/55eeed1f1f386fc29500000a"
}
"dataset": {
"name": "Customer FAQ dataset"
},
"ensemble": {
"description": "Customer FAQ ensemble with 10 models"
"number_of_models": 10
}
},
"created":"2016-10-07T19:35:22.533289",
"credits":0,
"description":"",
"name":"Configuration 1",
"private":true,
"project":null,
"resource":"configuration/57db8107b8aa0940d5b61138",
"shared":false,
"stats":null,
"status":{
"code":5,
"message":"The configuration has been created"
},
"tags":[],
"updated":"2016-10-07T19:35:22.533391"
}
< Example configuration JSON response
The following arguments are available for you to use.
Configuration Arguments
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is 0 |
The category that best describes the configuration. See the category codes for the complete list of categories.
Example: 1 |
| configurations | Object |
Default arguments for individual resources or any to apply the argument to all resources. For more information, see the Configurations below.
Example:
|
|
description
optional |
String |
A description of the configuration up to 8192 characters long.
Example: "This is a description of my new configuration" |
|
name
optional |
String |
The name you want to give to the new configuration.
Example: "my new configuration" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your configuration.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
Configurations
Under configurations, you can have any or specific resource names BigML supports excluding configuration resource itself, such as dataset, anomaly, model, etc.
Once a configuration is successfully created, you can pass a configuration argument to any resource as part of POST requests. For example, "configuration": "configuration/5776b2a64e1727b72c000007".
The order of precedence of applying the default values when using a configuration is
- User input
- Specific resource type from the configuration
- any from the configuration
For example, if you use the following configuration for creating a model,
"configurations": {
"model": {
"name": "model name",
"description": "model description"
},
"any" : {
"name": "any name",
"description": "any description",
"tags": ["any tags"]
}
}
Configurations example
and pass "name": "my custom name" as a POST argument, your new model will have
"name": "my custom name",
"description": "model description",
"tags": ["any tags"]
New model properties
Any element under a resource name will be validated against its validator when configuration is created. Elements under any are however validated at the runtime. For example, the following input
"configurations": {
"anomaly": {
"forest_size" : 32,
"dataset": "dataset/5776b19e4e1727b72c000002",
"anomaly_seed": true,
"top_n": "10"
},
"model": {
"objective_field": "000004",
"out_of_bag": true,
"model_seed": "my model seed"
},
"any" : {
"tag" : ["sample"],
}
}
Invalid configuration example
will return the following errors:
{
"code": 400,
"status": {
"code": -1204,
"extra": {
"configurations": {
"anomaly": {
"anomaly_seed": [
"This field must be a string no longer than 256 chars"
],
"top_n": [
"This field must be a number between 1 and 1024"
]
},
"model": {
"model_seed": [
"This field is not postable"
]
}
}
},
"message": "Bad request"
}
}
Errors for an invalid configuration example
Note any has a field called tag instead of tags, which isn't supported by any resources. This won't raise an error until you use the configuration to create other resources.
Retrieving a Configuration
Each configuration has a unique identifier in the form "configuration/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the configuration.
To retrieve a configuration with curl:
curl "https://au.bigml.io/configuration/57db8107b8aa0940d5b61138?$BIGML_AUTH"
$ Retrieving a configuration from the command line
Configuration Properties
Once a configuration has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the configuration and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the configuration creation has been completed without errors. |
| configurations | Object | Configuration object. For more information, see the Configurations above. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the configuration was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
description
updatable |
String | A text describing the configuration. It can contain restricted markdown to decorate the text. |
|
name
filterable, sortable, updatable |
String | The name of the configuration as provided. |
|
private
filterable, sortable |
Boolean | Whether the configuration is public or not. |
| resource | String | The configuration/id. |
| status | Object | A description of the status of the configuration. It includes a code, a message, and some extra information. See the table below. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the configuration was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
Updating a Configuration
To update a configuration, you need to PUT an object containing the fields that you want to update to the configuration' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated configuration.
For example, to update a configuration with a new configurations and a new category, you can use curl like this:
curl "https://au.bigml.io/configuration/57db8107b8aa0940d5b61138?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{
"category": 3,
"configurations": {
"dataset": {
"name": "Customer FAQ dataset"
},
"ensemble": {
"description": "Customer FAQ ensemble with 10 models"
"number_of_models": 10
},
"any": {
"project": "project/55eeed1f1f386fc29520000a"
"tags": ["FAQ", "Sample"]
}
}
}'
$ Updating a configuration
Deleting a Configuration
To delete a configuration, you need to issue a HTTP DELETE request to the configuration/id to be deleted.
Using curl you can do something like this to delete a configuration:
curl -X DELETE "https://au.bigml.io/configuration/57db8107b8aa0940d5b61138?$BIGML_AUTH"
$ Deleting a configuration from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a configuration, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a configuration a second time, or a configuration that does not exist, you will receive a "404 not found" response.
However, if you try to delete a configuration that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Configurations
To list all the configurations, you can use the configuration base URL. By default, only the 20 most recent configurations will be returned. You can see below how to change this number using the limit parameter.
You can get your list of configurations directly in your browser using your own username and API key with the following links.
https://au.bigml.io/configuration?$BIGML_AUTH
> Listing configurations from a browser
Composites
Last Updated: Tuesday, 2019-01-29 16:28
Composite models are an aggregate of individual model resources grouped by the user in an ad-hoc fashion. Each submodel in a composite has been previously built independently by the service, and they're just put together in a composite resource. However, a composite cannot be used for predictions or evaluations.
Any model type (anomaly, association, cluster, composite, deepnet, ensemble, fusion, model, logistic regression, optiml, time series, and topic model) can be a submodel of a composite.
BigML.io allows you to create, retrieve, update, delete your composite. You can also list all of your composites.
Jump to:
- Composite Base URL
- Creating a Composite
- Composite Arguments
- Retrieving a Composite
- Composite Properties
- Filtering and Paginating Models from a Composite
- Updating a Composite
- Deleting a Composite
- Listing Composites
Composite Base URL
You can use the following base URL to create, retrieve, update, and delete composites. https://au.bigml.io/composite
Composite base URL
All requests to manage your composites must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Composite
To create a new composite, you need to POST to the composite base URL an object containing at least a list of model ids that you want to use to create the composite. The content-type must always be "application/json".
POST /composite?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating composite definition
curl "https://au.bigml.io/composite?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"models": ["model/4f66a80803ce8940c5000006", "logisticregression/5a95d5664e17271473000000", "cluster/5aec0b9e4e17275dab000401"]}'
> Creating a composite
BigML.io will return the newly created composite if the request succeeded.
{
"category": 0,
"code": 201,
"composite": {},
"configuration": null,
"configuration_status": false,
"created": "2018-05-09T05:52:16.673433",
"description": "",
"model_count": {
"cluster": 1,
"logisticregression": 1,
"model": 1,
"total": 3
},
"models": [
"model/5948beb44e17273079000003",
"logisticregression/5a95d5664e17271473000000",
"cluster/5aec0b9e4e17275dab000401"
],
"name": "Iris models composite",
"name_options": "3 total models (cluster: 1, logisticregression: 1, model: 1)",
"private": true,
"project": null,
"resource": "composite/59af8107b8aa0965d5b61138",
"shared": false,
"status": {
"code": 1,
"message": "The composite creation request has been queued and will be processed soon"
},
"subscription": false,
"tags": [],
"type": 0,
"updated": "2018-05-09T05:52:16.677393"
}
< Example composite JSON response
Composite Arguments
In addition to the models, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is 0 |
The category that best describes the composite. See the category codes for the complete list of categories.
Example: 1 |
|
description
optional |
String |
A description of the composite up to 8192 characters long.
Example: "This is a description of my new composite" |
| models | Array |
A list with composite submodel resource/ids or or list of maps using the key id for each submodel resource/id, and any other key/values for additional meta-information on the model. Available submodel types are anomaly, association, cluster, composite, deepnet, ensemble, fusion, model, logistic regression, optiml, time series, and topic model. The maximum number of submodels is 1000.
Example: or
|
|
name
optional |
String, default is composite's name |
The name you want to give to the new composite.
Example: "my new composite" |
|
project
optional |
String |
The project/id you want the composite to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your composite.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
If you do not specify a name, BigML.io will assign to the new composite one.
Retrieving a Composite
Each composite has a unique identifier in the form "composite/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the composite.
To retrieve a composite with curl:
curl "https://au.bigml.io/composite/59af8107b8aa0965d5b61138?$BIGML_AUTH"
$ Retrieving a composite from the command line
Composite Properties
Once a composite has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the composite and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the composite creation has been completed without errors. |
|
composite
filterable, sortable |
Object | Composite object. For more information, see the Composite below. |
|
composites
filterable, sortable |
Array of Strings | The list of composite ids that reference this model. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the composite was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
description
updatable |
String | A text describing the composite. It can contain restricted markdown to decorate the text. |
|
model_count
filterable, sortable |
Object |
A dictionary that informs about the number of submodels of each type in the composite.
Example:
|
| models | Array |
A list of all submodels ids regardless of how models are filtered and paged.
Example:
|
| models_meta | Object | A dictionary with meta information about the models filtered.It specifies the total number of models, the current offset, and limit. |
|
name
filterable, sortable, updatable |
String | The name of the composite as your provided. |
|
private
filterable, sortable |
Boolean | Whether the composite is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| resource | String | The composite/id. |
|
shared
filterable |
Boolean | Whether the composite is shared using a private link or not. |
|
shared_clonable
filterable |
Boolean | Whether the shared composite can be cloned or not. |
|
shared_hash
filterable |
String | The hash that gives access to this composite if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this composite. |
| status | Object | A description of the status of the composite. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the composite was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
type
filterable, sortable |
Integer | Reserved for future use. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the composite was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
The Composite object has the following properties.
Composite Status
Creating a composite is a process that can take just a few seconds or a few hours depending on the size of the models used as input and on the workload of BigML's systems. The composite goes through a number of states until its fully completed. Through the status field in the composite you can determine when the composite has been fully processed and ready to be used to create predictions. These are the properties that a composite's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the composite creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the composite. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the composite. |
Once a composite has been successfully created, it will look like:
{
"category": 0,
"code": 200,
"composite": {
"models": [
{
"id": "model/5948beb44e17273079000003",
"kind": "model",
"name": "Iris tree",
"name_options": "512-node, pruned, deterministic order"
},
{
"id": "logisticregression/5a95d5664e17271473000000",
"kind": "logisticregression",
"name": "Iris LR",
"name_options": "L2 regularized (c=1), bias, auto-scaled, missing values"
},
{
"id": "cluster/5aec0b9e4e17275dab000401",
"kind": "cluster",
"name": "Flower colors cluster",
"name_options": "K-means, k=10"
}
]
},
"configuration": null,
"configuration_status": false,
"created": "2018-05-09T05:52:16.673000",
"description": "",
"model_count": {
"cluster": 1,
"logisticregression": 1,
"model": 1,
"total": 3
},
"models": [
"model/5948beb44e17273079000003",
"logisticregression/5a95d5664e17271473000000",
"cluster/5aec0b9e4e17275dab000401"
],
"models_meta": {
"count": 3,
"offset": 0,
"limit": 1000,
"total": 3
},
"name": "Iris models composite",
"name_options": "3 total models (cluster: 1, logisticregression: 1, model: 1)",
"private": true,
"project": null,
"resource": "composite/59af8107b8aa0965d5b61138",
"shared": false,
"status": {
"code": 5,
"elapsed": 4658,
"message": "The composite has been created",
"progress": 1
},
"subscription": false,
"tags": [],
"type": 0,
"updated": "2018-05-09T05:52:21.359000"
}
< Example composite JSON response
Filtering and Paginating Models from a Composite
Since model lists can grow large, we offer paginations of the models list in the response when GETting it via HTTP. Pagination is specified using the following query string parameters:
- models_limit: A non-negative integer indicating how many elements in models to return. If not provided, we return at most 1000. If passed a negative value (say, -1), we return all of them.
- models_offset: The offset in the list of models (i.e., how many models are discarded before we take limit of them).
- models_sort_by: Sorting criteria, specified by any one of the keys the user provided during the creation in the models maps. Sorting is ascending, unless you prefix the key name with a minus sign. For instance, let's say your models have a property, rank. You can use a query string of the form models_sort_by=rank to sort them by rank in ascending order, and one of the form strong>models_sort_by=-rank" to sort them in descending order. It is possible to provide more than one ordering criterion, separating them by commas, in which case the second and subsequent ones are used to break ties in the ordering generated by the previous ones.
Sorting happens before limit and offset are applied. When pagination is active, the models_meta property at the top level in the returned. This property will contain offset, limit, count, and total.
Updating a Composite
To update a composite, you need to PUT an object containing the fields that you want to update to the composite' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated composite.
For example, to update a composite with a new name you can use curl like this:
curl "https://au.bigml.io/composite/59af8107b8aa0965d5b61138?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating a composite's name
Deleting a Composite
To delete a composite, you need to issue a HTTP DELETE request to the composite/id to be deleted.
Using curl you can do something like this to delete a composite:
curl -X DELETE "https://au.bigml.io/composite/59af8107b8aa0965d5b61138?$BIGML_AUTH"
$ Deleting a composite from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a composite, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a composite a second time, or a composite that does not exist, you will receive a "404 not found" response.
However, if you try to delete a composite that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Composites
To list all the composites, you can use the composite base URL. By default, only the 20 most recent composites will be returned. You can see below how to change this number using the limit parameter.
You can get your list of composites directly in your browser using your own username and API key with the following links.
https://au.bigml.io/composite?$BIGML_AUTH
> Listing composites from a browser
Models
Last Updated: Tuesday, 2019-01-29 16:28
A model is a tree-like representation of your dataset with predictive power. You can create a model selecting which fields from your dataset you want to use as input fields (or predictors) and which field you want to predict, the objective field.
Each node in the model corresponds to one of the input fields. Each node has an incoming branch except the top node also known as root that has none. Each node has a number of outgoing branches except those at the bottom (the "leaves") that have none.
Each branch represents a possible value for the input field where it originates. A leaf represents the value of the objective field given all the values for each input field in the chain of branches that goes from the root to that leaf.
When you create a new model, BigML.io will automatically compute a classification model or regression model depending on whether the objective field that you want to predict is categorical or numeric, respectively.
BigML.io allows you to create, retrieve, update, delete your model. You can also list all of your models.
Jump to:
- Model Base URL
- Creating a Model
- Model Arguments
- Shuffling the Rows of Your Dataset
- Sampling Your Dataset
- Random Decision Forests
- Retrieving a Model
- Model Properties
- Filtering a Model
- PMML
- Filtering and Paginating Fields from a Model
- Updating a Model
- Deleting a Model
- Listing Models
- Weights
- Weight Field
- Objective Weights
- Automatic Balancing
Model Base URL
You can use the following base URL to create, retrieve, update, and delete models. https://au.bigml.io/model
Model base URL
All requests to manage your models must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Model
To create a new model, you need to POST to the model base URL an object containing at least the dataset/id that you want to use to create the model. The content-type must always be "application/json".
POST /model?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating model definition
curl "https://au.bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006"}'
> Creating a model
BigML.io will return the newly created model if the request succeeded.
{
"category": 0,
"code": 201,
"columns": 1,
"created": "2012-11-15T02:32:48.763534",
"credits": 0.017578125,
"credits_per_prediction": 0.0,
"dataset": "dataset/50a453753c1920186d000045",
"dataset_status": true,
"description": "",
"excluded_fields": [],
"fields_meta": {
"count": 0,
"limit": 200,
"offset": 0,
"total": 0
},
"input_fields": [],
"locale": "en-US",
"max_columns": 5,
"max_rows": 150,
"missing_splits": false,
"name": "iris' dataset model",
"number_of_evaluations": 0,
"number_of_predictions": 0,
"number_of_public_predictions": 0,
"objective_field": null,
"objective_fields": [],
"ordering": 0,
"out_of_bag": false,
"price": 0.0,
"private": true,
"project": null,
"randomize": false,
"range": [
1,
150
],
"replacement": false,
"resource": "model/50a454503c1920186d000049",
"rows": 150,
"sample_rate": 1.0,
"size": 4608,
"source": "source/50a4527b3c1920186d000041",
"source_status": true,
"status": {
"code": 1,
"message": "The model is being processed and will be created soon"
},
"tags": [
"species"
],
"updated": "2012-11-15T02:32:48.763566",
"views": 0,
"white_box": false
}
< Example model JSON response
Model Arguments
In addition to the dataset, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the model. See the category codes for the complete list of categories.
Example: 1 |
| dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
|
deep
optional |
Boolean, default is false |
Clone the dataset used to build the model while cloning the model if the dataset is available. Must be used along with the origin or shared_hash option.
Example: true |
|
depth_threshold
optional |
Integer, default is 512 |
When the depth in the tree exceeds this value, the tree stops growing. It has no effect if it's bigger than the node_threshold.
Example: 128 |
|
description
optional |
String |
A description of the model up to 8192 characters long.
Example: "This is a description of my new model" |
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the model.
Example:
|
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the model with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be considered to create the model.
Example:
|
|
max_training_time
optional |
Integer, default is 1800 |
The maximum training time allowed for the optimization, in seconds, as a strictly positive integer. Applicable only when optimize is set to true.
Example: 3600 |
|
missing_splits
optional |
Boolean, default is false |
Defines whether to explicitly include missing field values when choosing a split. When this option is enabled, generates predicates whose operators include an asterisk, such as >*, <=*, =*, or !=*. The presence of an asterisk means "or missing". So a split with the operator >* and the value 8 can be read as "x > 8 or x is missing". When using missing_splits there may also be predicates with operators = or !=, but with a null value. This means "x is missing" and "x is not missing" respectively.
Example: true |
|
name
optional |
String, default is dataset's name |
The name you want to give to the new model.
Example: "my new model" |
|
node_threshold
optional |
Integer, default is 512 |
When the number of nodes in the tree exceeds this value, the tree stops growing.
Example: 1000 |
|
number_of_model_candidates
optional |
Integer, default is 128 |
The number of model candidates evaluated over the course of the optimization. Applicable only when optimize is set to true. Maximum 200 candidates.
Example: 100 |
|
objective_field
optional |
String, default is dataset's pre-defined objective field |
Specifies the id of the field that you want to predict.
Example: "000003" |
|
objective_fields
optional |
Array, default is an array with the id of the last field in the dataset |
Specifies the id of the field that you want to predict. Even if this an array BigML.io only accepts one objective field in the current version. If both objective_field and objective_fields are specified then, objective_field takes preference.
Example: ["000003"] |
|
optimize
optional |
Boolean, default is false |
Whether the model should be built with the automatic optimization. When it is set to true, only the following modeling properties are applied: input_fields, excluded_fields, default_numeric_value, missing_numerics, sample_rate, max_training_time, objective_weights, weight_field, and number_of_model_candidates
Example: true |
|
ordering
optional |
Integer, default is 0 (deterministic). |
Specifies the type of ordering followed to build the model. There are three different types that you can specify:
Example: 1 |
| origin | String |
The model/id of the gallery model to be cloned. The price of the model must be 0 to be cloned via API. Set deep to true to clone the dataset used to build the model too.
Example: "model/5b9ab8474e172785e3000003" |
|
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
|
project
optional |
String |
The project/id you want the model to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
random_candidate_ratio
optional |
Float |
A real number between 0 and 1. When randomize is true and random_candidate_ratio is given, BigML randomizes the tree and uses random_candidate_ratio * total fields (counting the number of terms in text fields as fields). To get the final number of candidate fields we round down to the nearest integer, but if the result is 0 we'll use 1 instead. If both random_candidates and random_candidate_ratio are given, BigML ignores random_candidate_ratio.
Example: 0.2 |
|
random_candidates
optional |
Integer, default is the square root of the total number of input fields. |
Sets the number of random fields considered when randomize is true.
Example: 10 |
|
randomize
optional |
Boolean, default is false |
Setting this parameter to true will consider only a subset of the possible fields when choosing a split. See the Section on Random Decision Forests below.
Example: true |
|
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the model.
Example: [1, 150] |
|
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
|
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
|
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
| shared_hash | String |
The shared hash of the shared model to be cloned. The price of the model must be 0 to be cloned via API.
Example: "kpY46mNuNVReITw0Z1mAqoQ9ySW" |
|
split_candidates
optional |
Integer, default is 32 |
The number of split points that are considered whenever the tree evaluates a numeric field. Minimum 1 and maximum 1024
Example: 128 |
|
stat_pruning
optional |
Boolean |
Activates statistical pruning on your decision tree model.
Example: true |
|
support_threshold
optional |
Float, default is 0 |
The parameter controls the minimum amount of support each child node must contain to be valid as a possible split. So, if it is 3, then a both children of a new split must have 3 instances supporting them. Since instances may have non-integer weights, non-integer values are valid.
Example: 16 |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your model.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new model. For example, to create a new model named "my model", with only certain rows, and with only three fields:
curl "https://au.bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006",
"input_fields": ["000001", "000003"],
"name": "my model",
"range": [25, 125]}'
> Creating a customized model
If you do not specify a name, BigML.io will assign to the new model the dataset's name. If you do not specify a range of instances, BigML.io will use all the instances in the dataset. If you do not specify any input fields, BigML.io will include all the input fields in the dataset, and if you do not specify an objective field, BigML.io will use the last field in your dataset.
Shuffling the Rows of Your Dataset
By default, rows from the input dataset are deterministically shuffled before being processed, to avoid inaccurate models caused by ordered fields in the input rows. Since the shuffling is deterministic, i.e., always the same for a given dataset, retraining a model for the same dataset will always yield the same result.
However, you can modify this default behaviour by including the ordering argument in the model creation request, where "ordering" here is a shortcut for "ordering for the traversal of input rows". When this property is absent or set to 0, deterministic shuffling takes place; otherwise, you can set it to:
- Linear: If you know that your input is already in random order. Setting "ordering" to 1 in your model request tells BigML to traverse the dataset in a linear fashion, without performing any shuffling (and therefore operating faster).
- Random: If you'd like to perform a really random shuffling, most probably different from any other one attempted before. Setting "ordering" to 2 will shuffle the input rows non-deterministically.
Sampling Your Dataset
You can limit the dataset rows that are used to create a model in two ways (which can be combined), namely, by specifying a row range and by asking for a sample of the (alreaday clipped) input rows.
The row range is specified with the range argument defined in the Section on Arguments above.
To specify a sample, which is taken over the row range or over the whole dataset if a range is not provided, you can add the following arguments to the creation request:
- sample_rate : A positive number that specifies the sampling rate, i.e., how often we pick a row from the range. In other words, the final number of rows will be the size of the range multiply by the sample_rate, unless "out_of_bag" is true (see below).
- replacement : A boolean indicating whether sampling should be performed with or without replacement, i.e., the same instance may be selected multiple times for inclusion in the result set. Defaults to false.
- out_of_bag : If an instance isn't selected as part of a sampling, it's called out of bag. Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. This can be useful when paired with "seed". When replacement is false, the final number of row returned is the size of the range multiply by one minus the sample_rate. Out-of-bag sampling with replacement gives rise to variable-size samples. Defaults to false.
- seed : Rows are sampled probabilistically using a random string, which means that, in general, two identical samples of the same row range of the same dataset will be different. If you provide a seed (as an arbitrary string), its hash value will be used as the seed, and it'll be possible for you to generate deterministic samples.
Finally, note that the "ordering" of the dataset described in the previous subsection is used on the result of the sampling.
Here's an example of a model request with range and sampling specifications:
curl https://au.bigml.io/model?$BIGML_AUTH \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/505f43223c1920eccc000297", "range": [1, 5000], "sample_rate": 0.5, "replacement": true}'
Creating a model using sampling
Random Decision Forests
A model can be randomized by setting the randomize parameter to true. The default is false.
When randomized, the model considers only a subset of the possible fields when choosing a split. The size of the subset will be the square root of the total number of input fields. So if there are 100 input fields, each split will only consider 10 fields randomly chosen from the 100. Every split will choose a new subset of fields.
Although randomize could be used for other purposes, it's intended for growing random decision forests. To grow tree models for a random forest, set randomize to true and select a sample from the dataset. Traditionally this is a 1.0 sample rate with replacement, but we suggest a 0.63 sample rate without replacement.
Retrieving a Model
Each model has a unique identifier in the form "model/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the model.
To retrieve a model with curl:
curl "https://au.bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH"
$ Retrieving a model from the command line
You can also use your browser to visualize the model using the full BigML.io URL or pasting the model/id into the BigML.com.au dashboard.
Model Properties
Once a model has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
boosted_ensemble
filterable, sortable |
Boolean | Whether the model was built as part of an ensemble with boosted trees. |
| boosting | Object |
Boosting attribute for the boosted tree. See the Gradient Boosting section for more information.
Example:
|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the model and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the model creation has been completed without errors. |
|
columns
filterable, sortable |
Integer | The number of fields in the model. |
|
composites
filterable, sortable |
Array of Strings | The list of composite ids that reference this model. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the model was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this model. |
|
credits_per_prediction
filterable, sortable, updatable |
Float | This is the number of credits that other users will consume to make a prediction with your model if you made it public. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the model. |
| dataset_field_types | Object | A dictionary that informs about the number of fields of each type in the dataset used to create the model. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields, and an entry for the total number of fields. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
| deep | Boolean | Whether the dataset used to build the original model is also requested to be cloned or not. |
|
description
updatable |
String | A text describing the model. It can contain restricted markdown to decorate the text. |
|
ensemble
filterable, sortable |
Boolean | Whether the model was built as part of an ensemble of not. |
|
ensemble_id
filterable, sortable |
String | The ensemble id. |
|
ensemble_index
filterable, sortable |
Integer | The number of order in the ensemble. |
| excluded_fields | Array | The list of fields's ids that were excluded to build the model. |
| fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
|
fusions
filterable, sortable |
Array of Strings | The list of fusion ids that reference this model. |
| input_fields | Array | The list of input fields' ids used to build the models of the model. |
| locale | String | The dataset's locale. |
|
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the model. |
|
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the model. |
| max_training_time | Integer | The maximum training time allowed for the optimization, in seconds. |
|
missing_splits
filterable, sortable |
Boolean | Whether to explicitly include missing field values when choosing a split while growing a model. |
| model | Object | All the information that you need to recreate or use the model on your own. It includes a very intuitive description of the tree-like structure that makes the model up and the field's dictionary describing the fields and their summaries. |
|
name
filterable, sortable, updatable |
String | The name of the model as your provided or based on the name of the dataset by default. |
|
node_threshold
filterable, sortable |
String | The maximum number of nodes that the model will grow. |
|
number_of_batchpredictions
filterable, sortable |
Integer | The current number of batch predictions that use this model. |
|
number_of_evaluations
filterable, sortable |
Integer | The current number of evaluations that use this model. |
| number_of_model_candidates | Integer | The number of model candidates evaluated over the course of the optimization. |
|
number_of_predictions
filterable, sortable |
Integer | The current number of predictions that use this model. |
|
number_of_public_predictions
filterable, sortable |
Integer | The current number of public predictions that use this model. |
| objective_field | String | The id of the field that the model predicts. |
| objective_fields | Array | Specifies the list of ids of the field that the model predicts. Even if this is an array BigML.io only accepts one objective field in the current version. |
| optimize | Boolean | Whether the model was built with the automatic optimization. |
|
optiml
filterable, sortable |
String | The optiml/id that created this model. |
|
optiml_status
filterable, sortable |
Boolean | Whether the OptiML is still available or has been deleted. |
|
ordering
filterable, sortable |
Integer |
The order used to chose instances from the dataset to build the model. There are three different types:
|
|
origin
filterable, sortable |
String | The model/id of the original gallery model. |
|
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the model instead of the sampled instances. |
|
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your model. |
|
private
filterable, sortable, updatable |
Boolean | Whether the model is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
|
random_candidate_ratio
filterable, sortable |
Float | The random candidate ratio considered when randomize is true. |
|
random_candidates
filterable, sortable |
Integer | The number of random fields considered when randomize is true. |
|
randomize
filterable, sortable |
Boolean | Whether the model splits considered only a random subset of the fields or all the fields available. |
| range | Array | The range of instances used to build the model. |
|
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the model were selected using replacement or not. |
| resource | String | The model/id. |
|
rows
filterable, sortable |
Integer | The total number of instances used to build the model |
|
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the model. |
|
seed
filterable, sortable |
String | The string that was used to generate the sample. |
|
selective_pruning
filterable, sortable |
Boolean | If true, selective pruning throttled the strength of the statistical pruning depending on the size of the dataset. |
|
shared
filterable, sortable, updatable |
Boolean | Whether the model is shared using a private link or not. |
|
shared_clonable
filterable, sortable, updatable |
Boolean | Whether the shared model can be cloned or not. |
| shared_hash | String | The hash that gives access to this model if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this model. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this model. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
|
split_candidates
filterable, sortable |
Integer | The number of split points that are considered whenever the tree evaluates a numeric field. Minimum 1 and maximum 1024 |
|
stat_pruning
filterable, sortable |
Boolean | Whether statistical pruning was used when building the model. |
| status | Object | A description of the status of the model. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the model was created using a subscription plan or not. |
|
support_threshold
filterable, sortable |
Float | The parameter controls the minimum amount of support each child node must contain to be valid as a possible split. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the model was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
|
white_box
filterable, sortable |
Boolean | Whether the model is publicly shared as a white-box. |
A Model Object has the following properties:
| Property | Type | Description |
|---|---|---|
| depth_threshold | Integer | The depth, or generation, limit for a tree. |
| distribution | Object | This dictionary gives information about how the training data is distributed across the tree leaves. More concretely, it contains the training data distribution with key training, and the distribution for the actual prediction values of the tree with key predictions. The former is just the objective_summary of the tree root (see below), copied for easier individual retrieval, and both have the format of the objective summary in the tree nodes. |
| fields | Object | A dictionary with an entry per field in the dataset used to build the model. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details. |
| importance | Array of Arrays | A list of pairs [field_id, importance]. Importance is the amount by which each field in the model reduces prediction error, normalized to be between zero and one. Note that fields with an importance of zero may still be correlated with the objective; they were just not used in the model. |
| kind | String | The type of model. Currently, only stree is supported. |
| missing_strategy | String | Default strategy followed by the model when it finds a missing value. Currently, last_prediction. At prediction time you can opt for using proportional. See this Section for more details. |
| model_fields | Object | A dictionary with an entry per field used by the model (not all the fields that were available in the dataset). They follow the same structure as the fields attribute above except that the summary is not present. |
| root | Object | A Node Object, a tree-like recursive structure representing the model. |
| split_criterion | Integer | Method of choosing best attribute and split point for a given node. DEPRECATED |
| support_threshold | Float | A number between 0 and 1. For a split to be valid, each child's support (instances / total instances) must be greater than this threshold. |
Node Objects have the following properties:
| Property | Type | Description |
|---|---|---|
| children | Array | Array of Node Objects. |
| confidence | Float | For classification models, a number between 0 and 1 that expresses how certain the model is of the prediction. For regression models, a number mapped to the top end of a 95 confidence interval around the expected error at that node (measured using the variance of the output at the node). See the Section on Confidence for more details. Note that for models you might have created using the first versions of BigML this value might be null. |
| count | Integer | Number of instances classfied by this node. |
| objective_summary | Object | An Objective Summary Object summarizes the objective field's distribution at this node. |
| output | Number or String | Prediction at this node. |
| predicate | Boolean or Object | Predicate structure to make a decision at this node. |
Objective Summary Objects have the following properties:
| Property | Type | Description |
|---|---|---|
| bins | Array | If the objective field is numeric and the number of distinct values is greater than 32. An array that represents an approximate histogram of the distribution. It consists of value pairs, where the first value is the mean of a histogram bin and the second value is the bin population. For more information, see our blog post or read this paper. |
| categories | Array | If the objective field is categorical, an array of pairs where the first element of each pair is one of the unique categories and the second element is the count for that category. |
| counts | Array | If the objective field is numeric and the number of distinct values is less than or equal to 32, an array of pairs where the first element of each pair is one of the unique values found in the field and the second element is the count. |
| maximum | Number | The maximum of the objective field's values. Available when 'bins' is present. |
| minimum | Number | The minimum of the objective field's values. Available when 'bins' is present. |
Predicate Objects have the following properties:
Model Status
Creating a model is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The model goes through a number of states until its fully completed. Through the status field in the model you can determine when the model has been fully processed and ready to be used to create predictions. These are the properties that a model's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the model creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the model. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the model. |
Once a model has been successfully created, it will look like:
{
"category": 0,
"code": 200,
"columns": 5,
"created": "2012-11-15T02:32:48.763000",
"credits": 0.017578125,
"credits_per_prediction": 0.0,
"dataset": "dataset/50a453753c1920186d000045",
"dataset_status": true,
"description": "",
"excluded_fields": [],
"fields_meta": {
"count": 5,
"limit": 200,
"offset": 0,
"total": 5
},
"input_fields": [
"000000",
"000001",
"000002",
"000003"
],
"locale": "en_US",
"max_columns": 5,
"max_rows": 150,
"missing_splits": false,
"model": {
"depth_threshold": 20,
"fields": {
"000000": {
"column_number": 0,
"datatype": "double",
"name": "sepal length",
"optype": "numeric",
"order": 0,
"preferred": true,
"summary": {
"bins": [
[
4.3,
1
],
[
4.425,
4
],
[
4.6,
4
],
[
4.7,
2
],
[
4.8,
5
],
[
4.9,
6
],
[
5,
10
],
[
5.1,
9
],
[
5.2,
4
],
[
5.3,
1
],
[
5.4,
6
],
[
5.5,
7
],
[
5.6,
6
],
[
5.7,
8
],
[
5.8,
7
],
[
5.9,
3
],
[
6,
6
],
[
6.1,
6
],
[
6.2,
4
],
[
6.3,
9
],
[
6.44167,
12
],
[
6.6,
2
],
[
6.7,
8
],
[
6.8,
3
],
[
6.92,
5
],
[
7.1,
1
],
[
7.2,
3
],
[
7.3,
1
],
[
7.4,
1
],
[
7.6,
1
],
[
7.7,
4
],
[
7.9,
1
]
],
"maximum": 7.9,
"mean": 5.84333,
"median": 5.77889,
"minimum": 4.3,
"missing_count": 0,
"population": 150,
"splits": [
4.51526,
4.67252,
4.81113,
4.89582,
4.96139,
5.01131,
5.05992,
5.11148,
5.18177,
5.35681,
5.44129,
5.5108,
5.58255,
5.65532,
5.71658,
5.77889,
5.85381,
5.97078,
6.05104,
6.13074,
6.23023,
6.29578,
6.35078,
6.41459,
6.49383,
6.63013,
6.70719,
6.79218,
6.92597,
7.20423,
7.64746
],
"standard_deviation": 0.82807,
"sum": 876.5,
"sum_squares": 5223.85,
"variance": 0.68569
}
},
"000001": {
"column_number": 1,
"datatype": "double",
"name": "sepal width",
"optype": "numeric",
"order": 1,
"preferred": true,
"summary": {
"counts": [
[
2,
1
],
[
2.2,
3
],
[
2.3,
4
],
[
2.4,
3
],
[
2.5,
8
],
[
2.6,
5
],
[
2.7,
9
],
[
2.8,
14
],
[
2.9,
10
],
[
3,
26
],
[
3.1,
11
],
[
3.2,
13
],
[
3.3,
6
],
[
3.4,
12
],
[
3.5,
6
],
[
3.6,
4
],
[
3.7,
3
],
[
3.8,
6
],
[
3.9,
2
],
[
4,
1
],
[
4.1,
1
],
[
4.2,
1
],
[
4.4,
1
]
],
"maximum": 4.4,
"mean": 3.05733,
"median": 3.02044,
"minimum": 2,
"missing_count": 0,
"population": 150,
"standard_deviation": 0.43587,
"sum": 458.6,
"sum_squares": 1430.4,
"variance": 0.18998
}
},
"000002": {
"column_number": 2,
"datatype": "double",
"name": "petal length",
"optype": "numeric",
"order": 2,
"preferred": true,
"summary": {
"bins": [
[
1,
1
],
[
1.1,
1
],
[
1.2,
2
],
[
1.3,
7
],
[
1.4,
13
],
[
1.5,
13
],
[
1.63636,
11
],
[
1.9,
2
],
[
3,
1
],
[
3.3,
2
],
[
3.5,
2
],
[
3.6,
1
],
[
3.75,
2
],
[
3.9,
3
],
[
4.0375,
8
],
[
4.23333,
6
],
[
4.46667,
12
],
[
4.6,
3
],
[
4.74444,
9
],
[
4.94444,
9
],
[
5.1,
8
],
[
5.25,
4
],
[
5.46,
5
],
[
5.6,
6
],
[
5.75,
6
],
[
5.95,
4
],
[
6.1,
3
],
[
6.3,
1
],
[
6.4,
1
],
[
6.6,
1
],
[
6.7,
2
],
[
6.9,
1
]
],
"maximum": 6.9,
"mean": 3.758,
"median": 4.34142,
"minimum": 1,
"missing_count": 0,
"population": 150,
"splits": [
1.25138,
1.32426,
1.37171,
1.40962,
1.44567,
1.48173,
1.51859,
1.56301,
1.6255,
1.74645,
3.23033,
3.675,
3.94203,
4.0469,
4.18243,
4.34142,
4.45309,
4.51823,
4.61771,
4.72566,
4.83445,
4.93363,
5.03807,
5.1064,
5.20938,
5.43979,
5.5744,
5.6646,
5.81496,
6.02913,
6.38125
],
"standard_deviation": 1.7653,
"sum": 563.7,
"sum_squares": 2582.71,
"variance": 3.11628
}
},
"000003": {
"column_number": 3,
"datatype": "double",
"name": "petal width",
"optype": "numeric",
"order": 3,
"preferred": true,
"summary": {
"counts": [
[
0.1,
5
],
[
0.2,
29
],
[
0.3,
7
],
[
0.4,
7
],
[
0.5,
1
],
[
0.6,
1
],
[
1,
7
],
[
1.1,
3
],
[
1.2,
5
],
[
1.3,
13
],
[
1.4,
8
],
[
1.5,
12
],
[
1.6,
4
],
[
1.7,
2
],
[
1.8,
12
],
[
1.9,
5
],
[
2,
6
],
[
2.1,
6
],
[
2.2,
3
],
[
2.3,
8
],
[
2.4,
3
],
[
2.5,
3
]
],
"maximum": 2.5,
"mean": 1.19933,
"median": 1.32848,
"minimum": 0.1,
"missing_count": 0,
"population": 150,
"standard_deviation": 0.76224,
"sum": 179.9,
"sum_squares": 302.33,
"variance": 0.58101
}
},
"000004": {
"column_number": 4,
"datatype": "string",
"name": "species",
"optype": "categorical",
"order": 4,
"preferred": true,
"summary": {
"categories": [
[
"Iris-versicolor",
50
],
[
"Iris-setosa",
50
],
[
"Iris-virginica",
50
]
],
"missing_count": 0
}
}
},
"importance": [
[
"000002",
0.53159
],
[
"000003",
0.4633
],
[
"000000",
0.00511
],
[
"000001",
0
]
],
"kind": "stree",
"missing_strategy": "last_prediction",
"model_fields": {
"000000": {
"column_number": 0,
"datatype": "double",
"name": "sepal length",
"optype": "numeric",
"order": 0,
"preferred": true
},
"000001": {
"column_number": 1,
"datatype": "double",
"name": "sepal width",
"optype": "numeric",
"order": 1,
"preferred": true
},
"000002": {
"column_number": 2,
"datatype": "double",
"name": "petal length",
"optype": "numeric",
"order": 2,
"preferred": true
},
"000003": {
"column_number": 3,
"datatype": "double",
"name": "petal width",
"optype": "numeric",
"order": 3,
"preferred": true
},
"000004": {
"column_number": 4,
"datatype": "string",
"name": "species",
"optype": "categorical",
"order": 4,
"preferred": true
}
},
"root": {
"children": [
{
"confidence": 0.92865,
"count": 50,
"objective_summary": {
"categories": [
[
"Iris-setosa",
50
]
]
},
"output": "Iris-setosa",
"predicate": {
"field": "000002",
"operator": "<=",
"value": 2.45
}
},
{
"children": [
{
"children": [
{
"children": [
{
"children": [
{
"children": [
{
"confidence": 0.34237,
"count": 2,
"objective_summary": {
"categories": [
[
"Iris-virginica",
2
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000000",
"operator": ">",
"value": 5.95
}
},
{
"confidence": 0.20654,
"count": 1,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
1
]
]
},
"output": "Iris-versicolor",
"predicate": {
"field": "000000",
"operator": "<=",
"value": 5.95
}
}
],
"confidence": 0.20765,
"count": 3,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
1
],
[
"Iris-virginica",
2
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000000",
"operator": "<=",
"value": 6.4
}
},
{
"confidence": 0.20654,
"count": 1,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
1
]
]
},
"output": "Iris-versicolor",
"predicate": {
"field": "000000",
"operator": ">",
"value": 6.4
}
}
],
"confidence": 0.15004,
"count": 4,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
2
],
[
"Iris-virginica",
2
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000001",
"operator": ">",
"value": 2.9
}
},
{
"confidence": 0.60966,
"count": 6,
"objective_summary": {
"categories": [
[
"Iris-virginica",
6
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000001",
"operator": "<=",
"value": 2.9
}
}
],
"confidence": 0.49016,
"count": 10,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
2
],
[
"Iris-virginica",
8
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000002",
"operator": "<=",
"value": 5.05
}
},
{
"confidence": 0.90819,
"count": 38,
"objective_summary": {
"categories": [
[
"Iris-virginica",
38
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000002",
"operator": ">",
"value": 5.05
}
}
],
"confidence": 0.86024,
"count": 48,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
2
],
[
"Iris-virginica",
46
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000003",
"operator": ">",
"value": 1.65
}
},
{
"children": [
{
"confidence": 0.92444,
"count": 47,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
47
]
]
},
"output": "Iris-versicolor",
"predicate": {
"field": "000002",
"operator": "<=",
"value": 4.95
}
},
{
"children": [
{
"confidence": 0.43849,
"count": 3,
"objective_summary": {
"categories": [
[
"Iris-virginica",
3
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000000",
"operator": ">",
"value": 6.05
}
},
{
"children": [
{
"confidence": 0.20654,
"count": 1,
"objective_summary": {
"categories": [
[
"Iris-virginica",
1
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000003",
"operator": "<=",
"value": 1.55
}
},
{
"confidence": 0.20654,
"count": 1,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
1
]
]
},
"output": "Iris-versicolor",
"predicate": {
"field": "000003",
"operator": ">",
"value": 1.55
}
}
],
"confidence": 0.09453,
"count": 2,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
1
],
[
"Iris-virginica",
1
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000000",
"operator": "<=",
"value": 6.05
}
}
],
"confidence": 0.37553,
"count": 5,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
1
],
[
"Iris-virginica",
4
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000002",
"operator": ">",
"value": 4.95
}
}
],
"confidence": 0.81826,
"count": 52,
"objective_summary": {
"categories": [
[
"Iris-virginica",
4
],
[
"Iris-versicolor",
48
]
]
},
"output": "Iris-versicolor",
"predicate": {
"field": "000003",
"operator": "<=",
"value": 1.65
}
}
],
"confidence": 0.40383,
"count": 100,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
50
],
[
"Iris-virginica",
50
]
]
},
"output": "Iris-virginica",
"predicate": {
"field": "000002",
"operator": ">",
"value": 2.45
}
}
],
"confidence": 0.26289,
"count": 150,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
50
],
[
"Iris-setosa",
50
],
[
"Iris-virginica",
50
]
]
},
"output": "Iris-virginica",
"predicate": true
},
"split_criterion": "information_gain_mix",
"support_threshold": 0
},
"name": "iris' dataset model",
"number_of_evaluations": 0,
"number_of_predictions": 0,
"number_of_public_predictions": 0,
"objective_field": "000004",
"objective_fields": [
"000004"
],
"ordering": 0,
"out_of_bag": false,
"price": 0.0,
"private": true,
"project": null,
"randomize": false,
"range": [
1,
150
],
"replacement": false,
"resource": "model/50a454503c1920186d000049",
"rows": 150,
"sample_rate": 1.0,
"selective_pruning": true,
"size": 4608,
"source": "source/50a4527b3c1920186d000041",
"source_status": true,
"stat_pruning": true,
"status": {
"code": 5,
"elapsed": 413,
"message": "The model has been created",
"progress": 1.0
},
"tags": [
"species"
],
"updated": "2012-11-15T02:32:50.149000",
"views": 0,
"white_box": false
}
< Example model JSON response
Filtering a Model
It is possible to filter the tree returned by a GET to the model location by means of two optional query string parameters, namely support and value.
Filter by Support
Support is a number from 0 to 1 that specifies the minimum fraction of the total number of instances that a given branch must cover to be retained in the resulting tree. Thus, asking for (minimum) support of 0, is just asking for the whole tree, while something like:
curl "https://au.bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;support=1.0"
Filter Example
will return just the root node, that being the only one that covers all instances. If you repeat the support parameter in the query string, the last one is used. Non-parseable support values are ignored.
Filter by Values and Value Intervals
Value is a concrete value or interval of values (for regression trees) that a leaf must predict to be kept in the returning tree. For instance:
curl "https://au.bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;value=Iris-setosa"
Filter Example
will return only those branches in the tree whose leaves predict "Iris-setosa" as the value of the (categorical) objective field, while something like:
curl "https://au.bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;value=[10,20]"
Filter Example
for a regression model will include only those leaves predicting an objective value between 10 and 20. You can also specify sharp values for regression models:
curl "https://au.bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;value=23.2"
Filter Example
will retrieve only those branches whose predictions are exactly 23.2. It is possible to specify multiple values, as in:
curl "https://au.bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;value=Iris-setosa&value=Iris-versicolor"
curl "https://au.bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;value=(10,20]&value=[-1.234,3.3)"
curl "https://au.bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;value=(10.2,20)&value=28.1&value=0.1"
Filter Example
in which case the union of the different predicates is used (i.e., the first query will return a tree will all leaves predicting "Iris-setosa" and all leaves predicting "Iris-versicolor".
Intervals can be closed or open in either end. For example, "(-2,10]", "[1,2)" or "(-1.234,0)", and the values of the left or right limits can be omitted, in which case they're taken as negative and positive infinity, respectively; thus "(,3]" denotes all values less or equal to three, as does "[,3]" (infinity not being a valid value for a numeric prediction), while "(0,)" accepts any positive value.
Filter by Confidence / Probability / Expected Error
Confidence is a concrete value or interval of values that a leaf must have to be kept in the returning tree. The specification of intervals follows the same conventions as those of value. Since confidences are a continuous value, the most common case will be asking for a range, but the service will accept also individual values. It's also possible to specify both a value and a confidence. For instance:
curl "https://au.bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;value=Iris-setosa&confidence=[0.3,]"
Filter Example
asks for a tree with only those leaves that predict "Iris-setosa" with a confidence greater or equal to 0.3, while
curl "https://au.bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;confidence=[,0.25)"
Filter Example
returns a model where all leaves with confidence strictly less than 0.25. Confidence filters will work both for classification regression problems, since we call the regression expected error confidence in our JSON. If desired (and only for regression), one can specify a filter using expected_error instead:
curl "https://au.bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;expected_error=[,0.25)"
Filter Example
If you specify both confidence and expected_error, only one of them will be used: confidence for classifications, expected_error for regressions. If only confidence is specified, it will always be used (confidence is an alias for the the expected error in regressions). If only expected_error is specified, it will only be used if the model is a regression.
Filters by probability works exactly as filters by confidence, but replacing probability for confidence. As a consequence, they'll only have an effect on classification problems.
Finally, note that it is also possible to specify support, value, confidence, probability, and expected_error parameters in the same query.
PMML
The default model output format is JSON. However, the pmml parameter allows to include a PMML version of the model. The model will include a XML document that fullfils PMML v4.1. For example:
curl "https://au.bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;pmml=yes"
PMML Example
curl "https://au.bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH;pmml=only"
PMML Example
Filtering and Paginating Fields from a Model
A model might be composed of hundreds or even thousands of fields. Thus when retrieving a model, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Model
To update a model, you need to PUT an object containing the fields that you want to update to the model' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated model.
For example, to update a model with a new name you can use curl like this:
curl "https://au.bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating a model's name
If you want to update a model with a new label and description for a specific field you can use curl like this:
curl "https://au.bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH" \
-X PUT \
-d '{"fields": {"000000": {"label": "a longer name", "description": "an even longer description"}}}' \
-H 'content-type: application/json'
$ Updating a model's field, label, and description
Deleting a Model
To delete a model, you need to issue a HTTP DELETE request to the model/id to be deleted.
Using curl you can do something like this to delete a model:
curl -X DELETE "https://au.bigml.io/model/50a454503c1920186d000049?$BIGML_AUTH"
$ Deleting a model from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a model, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a model a second time, or a model that does not exist, you will receive a "404 not found" response.
However, if you try to delete a model that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Models
To list all the models, you can use the model base URL. By default, only the 20 most recent models will be returned. You can see below how to change this number using the limit parameter.
You can get your list of models directly in your browser using your own username and API key with the following links.
https://au.bigml.io/model?$BIGML_AUTH
> Listing models from a browser
Weights
BigML.io has added three new ways in which you can use weights to deal with imbalanced datasets:
- Weight Field: considering the values one of the fields in the dataset as weight for the instances. This is valid for both regression and classification models.
- Objective Weights: submitting a specific weight for each class in classification models.
- Automatic Balancing: setting the balance argument to true to let BigML automatically balance all the classes evenly.
Let's see each method in more detail.
Weight Field
A weight_field may be declared for either regression or classification models. Any numeric field with no negative or missing values is valid as a weight field. Each instance will be weighted individually according to the weight field's value. See the toy dataset for credit card transactions below.
online, transaction, pending transactions, days since last transaction, distance, transactions today, balance, mtd, fraud, weight
yes, 10, 3, 31, low, 3, -3250, -1500, no, 1
no, 20, 30, 1, high, 0, 0, -300, no, 1
no, 40, 13, 210, low, 1, -19890, -30, no, 1
yes, 500, 0, 1, high, 0, 0, 0, yes, 10
no, 10, 1, 32, low, 0, -2500, -7891, no, 1
yes, 100, 0, 3, low, 0, -5194, -120, no, 1
yes, 100, 1, 4, low, 0, 0, 1500, no, 1
yes, 1000, 0, 1, high, 0, 0, 0, yes, 10
no, 150, 3, 1, low, 5, -3250, 1500, no, 1
no, 75, 5, 1, high, 1, -3250, 1500, no, 1
yes, 10, 23, 0, low, 1, -3250, 1500, no, 1
yes, 10, 3, 31, low, 3, -3250, -1500, no, 1
Example CSV file
curl "https://au.bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bcdc513c1920e4a300006e",
"objective_field": "000008",
"weight_field": "000009"
}'
> Using a weight field to create a new model
With Flatline, you can define arbitrarily complex functions to produce weight fields, making this the most flexible and powerful way to produce weighted models.
For instance, the request below would create a new dataset using the example above that will add a new weight field using the previous and multiplying by two when the amount of the transaction is higher than 500.
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bcdc513c1920e4a300006e",
"new_fields": [{
"field": "(if (and (= (f fraud) \"yes\") (> (f transaction) 500)) (* (f weight) 2) (f weight))",
"name": "new weight"}]
}'
> Creating a new weight field
Objective Weights
The second method for adding weights only applies to classification models. A set of objective_weights may be defined, one per objective class. Each instance will be weighted according to its class weight.
curl "https://au.bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bcdc513c1920e4a300006e",
"objective_field": "000008",
"excluded_fields": ["000009"],
"objective_weights": [["yes", 10], ["no", 1]]
}'
> Using a weight field to create a new model
If a class is not listed in the objective_weights, it is assumed to have a weight of 1. This means the example below is equivalent to the example above.
curl "https://au.bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bcdc513c1920e4a300006e",
"objective_field": "000008",
"excluded_fields": ["000009"],
"objective_weights": [["yes", 10]]
}'
> Using a weight field to create a new model
Weights of zero are valid as long as there are some positive valued weights. If every weight does end up zero (this is possible with sampled datasets), then the resulting model will have a single node with a nil output.
Automatic Balancing
Finally, we provide a convenience shortcut for specifying weights for a classification objective which are proportional to their category counts, by means of the balance_objective flag.
For instance, if the category counts of the objective field are, say:
[["Iris-versicolor", 20], ["Iris-virginica", 10], ["Iris-setosa", 5]]
Category counts
the request:
curl "https://au.bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bb141a3c19200bdf00000c",
"balance_objective": true
}'
> Using balance_objective to create a new model
would be equivalent to:
curl "https://au.bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/52bb141a3c19200bdf00000c",
"objective_weights": [
["Iris-versicolor", 1],
["Iris-virginica", 2],
["Iris-setosa", 4]]}'
> Using objective_weights to create a new model
The next table summarizes all the available arguments to use weights.
The nodes for a weighted tree will include a weight and weighted_objective_distribution, which are the weighted analogs of count and objective_distribution. Confidence, importance, and pruning calculations also take weights into account.
{
"id":0,
"children":[
{
"id":1,
"children":[
{
"output":"Iris-virginica",
"count":10,
"objective_summary":{
"categories":[
[
"Iris-virginica",
10
]
]
},
"predicate":{
"value":1.7,
"operator":">",
"field":"000003"
},
"weighted_objective_summary":{
"categories":[
[
"Iris-virginica",
10
]
]
},
"weight":10,
"confidence":0.72246,
"id":2
},
{
"output":"Iris-versicolor",
"count":20,
"objective_summary":{
"categories":[
[
"Iris-versicolor",
20
]
]
},
"predicate":{
"value":1.7,
"operator":"<=",
"field":"000003"
},
"weighted_objective_summary":{
"categories":[
[
"Iris-versicolor",
20
]
]
},
"weight":20,
"confidence":0.83887,
"id":3
}
],
"weighted_objective_summary":{
"categories":[
[
"Iris-versicolor",
20
],
[
"Iris-virginica",
10
]
]
},
"weight":30,
"predicate":{
"value":0.6,
"operator":">",
"field":"000003"
},
"confidence":0.4878,
"count":30,
"output":"Iris-versicolor",
"objective_summary":{
"categories":[
[
"Iris-versicolor",
20
],
[
"Iris-virginica",
10
]
]
}
},
{
"output":"Iris-setosa",
"count":5,
"objective_summary":{
"categories":[
[
"Iris-setosa",
5
]
]
},
"predicate":{
"value":0.6,
"operator":"<=",
"field":"000003"
},
"weighted_objective_summary":{
"categories":[
[
"Iris-setosa",
100
]
]
},
"weight":100,
"confidence":0.56551,
"id":4
}
],
"weighted_objective_summary":{
"categories":[
[
"Iris-setosa",
100
],
[
"Iris-versicolor",
20
],
[
"Iris-virginica",
10
]
]
},
"weight":130,
"predicate":true,
"confidence":0.60745,
"count":35,
"output":"Iris-setosa",
"objective_summary":{
"categories":[
[
"Iris-versicolor",
20
],
[
"Iris-virginica",
10
],
[
"Iris-setosa",
5
]
]
}
}
< Example weighted model JSON response
Ensembles
Last Updated: Tuesday, 2019-01-29 16:28
Depending on the nature of your data and the specific parameters of the ensemble, you can significantly boost the predictive performance for single models, using exactly the same data.
You can create an ensemble just as you would create a model with the following three basic machine learning techniques: bagging, random decision forests, and gradient tree boosting.
Bagging, also known as bootstrap aggregating, is one of the simplest ensemble-based strategies but often outperforms strategies that are more complex. The basic idea is to use a different random subset of the original dataset for each model in the ensemble. Specifically BigML uses by default a sampling rate of 100% with replacement for each model. You can read more about bagging here.
Random decision forests is the second ensemble-based strategy that BigML provides. It consists, essentially, in selecting a new random set of the input fields at each split while an individual model is being built instead of considering all the input fields. To create a random decision forest you just need to set the randomize argument to true. You can read more about random decision forests here.
Gradient tree boosting is the third strategy whose predictions are additive. Each tree modifies the predictions of the previously grown tree. You must specify the boosting argument in order to apply this technique.
BigML.io allows you to create, retrieve, update, delete your ensemble. You can also list all of your ensembles.
Jump to:
- Ensemble Base URL
- Creating an Ensemble
- Ensemble Arguments
- Gradient Boosting
- PMML
- Retrieving an Ensemble
- Ensemble Properties
- Filtering and Paginating Fields from an Ensemble
- Updating an Ensemble
- Deleting an Ensemble
- Listing Ensembles
Ensemble Base URL
You can use the following base URL to create, retrieve, update, and delete ensembles. https://au.bigml.io/ensemble
Ensemble base URL
All requests to manage your ensembles must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating an Ensemble
To create a new ensemble, you need to POST to the ensemble base URL an object containing at least the dataset/id that you want to use to create the ensemble. The content-type must always be "application/json".
POST /ensemble?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating ensemble definition
curl "https://au.bigml.io/ensemble?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/50e8d4f03c19202d91000004"}'
> Creating an ensemble
BigML.io will return the newly created ensemble if the request succeeded.
{
"category": 0,
"code": 201,
"columns": 9,
"created": "2013-01-11T00:04:20.202976",
"credits": 0.7992858886718751,
"credits_per_prediction": 0.0,
"dataset": "dataset/50e8d4f03c19202d91000004",
"dataset_status": true,
"description": "",
"ensemble_sample": {
"rate": 0.8,
"replacement": true,
"seed": "my ensemble sample seed"
},
"error_models": 0,
"finished_models": 0,
"locale": "en-US",
"max_columns": 9,
"max_rows": 768,
"missing_splits": true,
"models": [
"model/50ef57043c19208c50000026",
"model/50ef57043c19208c50000029",
"model/50ef57043c19208c5000002c",
"model/50ef57053c19208c5000002f",
"model/50ef57053c19208c50000032",
"model/50ef57053c19208c50000035",
"model/50ef57063c19208c50000038",
"model/50ef57063c19208c5000003b",
"model/50ef57063c19208c5000003e",
"model/50ef57073c19208c50000041"
],
"name": "diabetes' dataset ensemble",
"number_of_evaluations": 0,
"number_of_models": 10,
"number_of_predictions": 0,
"number_of_public_predictions": 0,
"ordering": 0,
"out_of_bag": false,
"price": 0.0,
"private": true,
"project": null,
"randomize": false,
"range": [
1,
768
],
"replacement": false,
"resource": "ensemble/50ef57043c19208c50000022",
"rows": 768,
"sample_rate": 0.8,
"seed": "a0f717f2b3954111b27fcc23f5a85787",
"size": 209528,
"source": "source/50e8d4ea3c19202d91000000",
"source_status": true,
"status": {
"code": 3,
"message": "The ensemble creation has been started"
},
"tags": [
"diabetes"
],
"updated": "2013-01-11T00:04:20.203007",
"views": 0,
"white_box": false
}
< Example ensemble JSON response
Ensemble Arguments
In addition to the dataset, you can also POST the following arguments, and like models, you can use weights to deal with imbalanced datasets. Click here to find more information about weights.
| Argument | Type | Description |
|---|---|---|
|
balance_objective
optional |
Boolean, default is false |
Whether to balance classes proportionally to their category counts or not. For more information, see the Automatic Balancing section.
Example: true |
|
boosting
optional |
Object |
Gradient boosting options for the ensemble. Required to created an ensemble with boosted trees. See the Gradient Boosting section for more information.
Example:
|
|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the ensemble. See the category codes for the complete list of categories.
Example: 1 |
| dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
|
depth_threshold
optional |
Integer, default is 512 |
When the depth in the tree exceeds this value, the tree stops growing. It has no effect if it's bigger than the node_threshold.
Example: 128 |
|
description
optional |
String |
A description of the ensemble up to 8192 characters long.
Example: "This is a description of my new ensemble" |
|
ensemble_sample
optional |
Object |
The sampling to be used for each tree in the ensemble. It can contain a rate (default 1), and replacement (default true), and seed parameters. Note that this is different from the sample_rate, replacement, and seed used in other models, predictions or datasets, where sampling is applied once to the input dataset; rather, it's applied multiple times to the input, in order to create separate samplings for each tree composing the final ensemble. So there is not out_of_bag parameter here, and the seed is used to create a different seed for each of the generated trees.
Example:
|
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the models of the ensemble
Example:
|
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the ensemble with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be considered to create the ensemble.
Example:
|
|
max_training_time
optional |
Integer, default is 1800 |
The maximum training time allowed for the optimization, in seconds, as a strictly positive integer. Applicable only when optimize is set to true.
Example: 3600 |
|
missing_splits
optional |
Boolean, default is true |
Defines whether to explicitly include missing field values when choosing a split while growing the models of an ensemble. When this option is enabled, in each model, generates predicates whose operators include an asterisk, such as >*, <=*, =*, or !=*. The presence of an asterisk means "or missing". So a split with the operator >* and the value 8 can be read as "x > 8 or x is missing". When using missing_splits there may also be predicates with operators = or !=, but with a null value. This means "x is missing" and "x is not missing" respectively.
Example: flase |
|
name
optional |
String, default is dataset's name |
The name you want to give to the new ensemble.
Example: "my new ensemble" |
|
node_threshold
optional |
Integer, default is 512 |
When the number of nodes in the tree exceeds this value, the tree stops growing.
Example: 1000 |
|
number_of_model_candidates
optional |
Integer, default is 128 |
The number of model candidates evaluated over the course of the optimization. Applicable only when optimize is set to true. Maximum 200 candidates.
Example: 100 |
|
number_of_models
optional |
Integer, default is 10 |
The number of models to build the ensemble. This parameter is ignored for boosted trees. See the Gradient Boosting section for more information.
Example: 100 |
|
objective_field
optional |
String, default is dataset's pre-defined objective field |
Specifies the id of the field that the ensemble will predict.
Example: "000003" |
|
objective_weights
optional |
Array |
A list of category and weight pairs. One per objective class. For more information, see the Objective Weights section.
Example:
|
|
optimize
optional |
Boolean, default is false |
Whether the ensemble should be built with the automatic optimization. When it is set to true, only the following modeling properties are applied: input_fields, excluded_fields, default_numeric_value, missing_numerics, sample_rate, max_training_time, objective_weights, weight_field, and number_of_model_candidates
Example: true |
|
ordering
optional |
Integer, default is 0 (deterministic). |
Specifies the type of ordering followed to build the models of the ensemble. There are three different types that you can specify:
Example: 1 |
|
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
|
project
optional |
String |
The project/id you want the ensemble to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
random_candidate_ratio
optional |
Float |
A real number between 0 and 1. When randomize is true and random_candidate_ratio is given, BigML randomizes the trees and uses random_candidate_ratio * total fields (counting the number of terms in text fields as fields). To get the final number of candidate fields we round down to the nearest integer, but if the result is 0 we'll use 1 instead. If both random_candidates and random_candidate_ratio are given, BigML ignores random_candidate_ratio.
Example: 0.2 |
|
random_candidates
optional |
Integer, default is the square root of the total number of input fields. |
Sets the number of random fields considered when randomize is true.
Example: 10 |
|
randomize
optional |
Boolean, default is false |
Setting this parameter to true will consider only a subset of the possible fields when choosing a split. See the Section on Random Decision Forests for further details.
Example: true |
|
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the ensemble.
Example: [1, 150] |
|
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
|
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
|
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
|
split_candidates
optional |
Integer, default is 32 |
The number of split points that are considered whenever the tree evaluates a numeric field. Minimum 1 and maximum 1024
Example: 128 |
|
stat_pruning
optional |
Boolean |
Activates statistical pruning on each decision tree model. It doesn't apply to boosted trees.
Example: true |
|
support_threshold
optional |
Float, default is 0 |
The parameter controls the minimum amount of support each child node must contain to be valid as a possible split. So, if it is 3, then a both children of a new split must have 3 instances supporting them. Since instances may have non-integer weights, non-integer values are valid.
Example: 16 |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your ensemble.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
|
weight_field
optional |
String |
Any numeric field with no negative or missing values is valid as a weight field. Each instance will be weighted individually according to the weight field's value. For more information, see the Weight Field section.
Example: "000005" |
You can use curl to customize a new ensemble from the command line. For example, to create a new ensemble named "my ensemble", with only certain rows, and with only three fields:
curl "https://au.bigml.io/ensemble?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006", "input_fields": ["000001", "000003"], "name": "my dataset", "range": [25, 125]}'
> Creating a customized ensemble
If you do not specify a name, the dataset's name will be assigned to the new ensemble. If you do not specify a range of instances, the complete set of instances in the dataset will be used. If you do not specify any input fields, all the preferred input fields in the dataset will be included, and if you do not specify an objective field, the last field in your dataset will be considered the objective field.
Gradient Boosting
When doing boosting, the number_of_models parameter described above is no longer valid as an input. The number_of_models will now indicate the maximum number of boosting iterations explained below. Note that when gradient boosting option is applied to classification models, the actual number of models created will be a product of the number of classes (categories) and the iterations. For example, if you set boosting iterations to 12 and the number of classes is 3, then the number of models created will be 36 or less depending on whether an early stopping strategy is used or not.
In addition, our implementation of boosted trees support the following parameters, which are all part of the boosting object:
If the boosted trees are using one of the early stopping tests (early_out_of_bag or early_holdout), then it will also have a list of scores indicating the quality of the boosted trees after each iteration.
Individual trees in the boosted trees differ from trees in bagged or random forest ensembles. Primarily the difference is that boosted trees do not try to predict the objective field directly. Instead, they try to fit a gradient (correcting for mistakes made in previous iterations), and this will be stored under a new field, named gradient.
This means the predictions from boosted trees cannot be combined with using the regular ensemble combiners. Instead, boosted trees use their own combiner that relies on a few new parameters included with individual boosted trees. These new parameters will be contained in the boosting attribute in each boosted tree, which may contain the following properties.
- objective_class will indicate the class that each tree helps predict if boosting is used for a classification problem (there will be one tree for each class for every boosting iteration).
- objective_field: contains the field id of the original objective field, as boosted trees will always be regression trees whose new objective is a new generated field (the previously mentioned gradient).
- weight: captures the influence each tree has when computing predictions.
- lambda: helps regulate the strength of a tree's output. It's included for generating predictions when encountering missing data and using the proportional strategy.
Nodes in boosted trees will also contain two new boosting related parameters, g_sum and h_sum. These are sums of the first and second order gradients, and are needed for generating predictions when encountering missing data and using the proportional strategy.
For regression problems, a prediction is generated by finding the prediction from each individual tree and doing a weighted sum using each tree's weight. Predictions for classification problems are similar, but separate weighted sums are found for each objective_class. That vector of weighted sums is then transformed into class probabilities using the soft max function.
Retrieving an Ensemble
Each ensemble has a unique identifier in the form "ensemble/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the ensemble.
To retrieve an ensemble with curl:
curl "https://au.bigml.io/ensemble/50ef57043c19208c50000022?$BIGML_AUTH"
$ Retrieving a ensemble from the command line
You can also use your browser to visualize the ensemble using the full BigML.io URL or pasting the ensemble/id into the BigML.com.au dashboard.
Ensemble Properties
Once an ensemble has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
| boosting | Object | Gradient boosting options for the ensemble including scores which indicates the quality of the boosted trees after each iteration and final_iterations.See the Gradient Boosting section for more information. |
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the ensemble and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the ensemble creation has been completed without errors. |
|
columns
filterable, sortable |
Integer | The number of fields in the ensemble. |
|
composites
filterable, sortable |
Array of Strings | The list of composite ids that reference this model. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the ensemble was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this ensemble. |
|
credits_per_prediction
filterable, sortable, updatable |
Float | This is the number of credits that other users will consume to make a prediction with your ensemble in case you decide to make it public. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the ensemble. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the ensemble. It can contain restricted markdown to decorate the text. |
| distributions | Array | Unordered list of distributions for each model in the ensemble. Each distribution is an Object with a entry for the distribution of instances in the training set and the distribution of predictions in the model. See a model distribution field for more details. Note that distributions must be accessed by the model_order below. |
| ensemble_sample | Object | The sampling to be used for each tree in the ensemble. |
|
error_models
filterable, sortable |
Integer | The number of models in the ensemble that have failed. |
| excluded_fields | Array | The list of fields's ids that were excluded to build the ensemble. |
| fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
|
finished_models
filterable, sortable |
Integer | The number of models in the ensemble that have finished correctly. |
|
fusions
filterable, sortable |
Array of Strings | The list of fusion ids that reference this model. |
| importance | Object | Average importance per field |
| input_fields | Array | The list of input fields' ids used to build the models of the ensemble. |
| locale | String | The dataset's locale. |
|
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the ensemble. |
|
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the ensemble. |
| max_training_time | Integer | The maximum training time allowed for the optimization, in seconds. |
|
missing_splits
filterable, sortable |
Boolean | Whether to explicitly include missing field values when choosing a split while growing the models of an ensemble. |
| model_order | Array | Order in which each model in the list of models was finished. The distributions above must be accessed following this index. |
| models | Array | Unordered list of model/ids that compose the ensemble. Models are ordered by the model_order above. |
|
name
filterable, sortable, updatable |
String | The name of the ensemble as your provided or based on the name of the dataset by default. |
|
node_threshold
filterable, sortable |
String | The maximum number of nodes that the model will grow. |
|
number_of_batchpredictions
filterable, sortable |
Integer | The current number of batch predictions that use this ensemble. |
|
number_of_evaluations
filterable, sortable |
Integer | The current number of evaluations that use this ensemble. |
| number_of_model_candidates | Integer | The number of model candidates evaluated over the course of the optimization. |
|
number_of_models
filterable, sortable |
Integer | The number of models in the ensemble. |
|
number_of_predictions
filterable, sortable |
Integer | The current number of predictions that use this ensemble. |
|
number_of_public_predictions
filterable, sortable |
Integer | The current number of public predictions that use this ensemble. |
| objective_field | String |
Specifies the id of the field that the ensemble predicts.
Example: "000003" |
| optimize | Boolean | Whether the ensemble was built with the automatic optimization. |
|
optiml
filterable, sortable |
String | The optiml/id that created this ensemble. |
|
optiml_status
filterable, sortable |
Boolean | Whether the OptiML is still available or has been deleted. |
|
ordering
filterable, sortable |
Integer |
The order used to chose instances from the dataset to build the models of the ensemble. There are three different types:
|
|
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the ensemble instead of the sampled instances. |
|
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your ensemble. |
|
private
filterable, sortable, updatable |
Boolean | Whether the ensemble is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
|
random_candidate_ratio
filterable, sortable |
Float | The random candidate ratio considered when randomize is true. |
|
random_candidates
filterable, sortable |
Integer | The number of random fields considered when randomize is true. |
|
randomize
filterable, sortable |
Boolean | Whether the splits of each model in the ensemble considered only a random subset of the fields or all the fields available. |
| range | Array | The range of instances used to build the models of the ensemble. |
|
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the ensemble were selected using replacement or not. |
| resource | String | The ensemble/id. |
|
rows
filterable, sortable |
Integer | The total number of instances used to build the models of the ensemble |
|
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the models of the ensemble. |
|
seed
filterable, sortable |
String | The string that was used to generate the sample. |
|
shared
filterable, sortable, updatable |
Boolean | Whether the ensemble is shared using a private link or not. |
| shared_hash | String | The hash that gives access to this ensemble if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this ensemble. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this ensemble. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
|
split_candidates
filterable, sortable |
Integer | The number of split points that are considered whenever the tree evaluates a numeric field. Minimum 1 and maximum 1024 |
| status | Object | A description of the status of the ensemble. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the ensemble was created using a subscription plan or not. |
|
support_threshold
filterable, sortable |
Float | The parameter controls the minimum amount of support each child node must contain to be valid as a possible split. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the ensemble was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
|
white_box
filterable, sortable |
Boolean | Whether the ensemble is publicly shared as a white-box. |
Ensemble Status
Creating a ensemble is a process that can take just a few seconds or a few days depending on the size of the dataset used as input, the number of models, and on the workload of BigML's systems. The ensemble goes through a number of states until its fully completed. Through the status field in the ensemble you can determine when the ensemble has been fully processed and ready to be used to create predictions. These are the properties that an ensemble's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the ensemble creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the ensemble. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the ensemble. |
Once a ensemble has been successfully created, it will look like:
{
"balance_objective":false,
"category":0,
"code":200,
"columns":9,
"created":"2016-07-08T18:26:36.351000",
"credits":0.09991455078125,
"credits_per_prediction":0,
"dataset":"dataset/5747ae334e172785fd000000",
"dataset_field_types":{
"categorical":1,
"datetime":0,
"effective_fields":9,
"items":0,
"numeric":8,
"preferred":9,
"text":0,
"total":9
},
"dataset_status":true,
"datasets":[],
"description":"",
"ensemble_sample": {
"rate": 0.8,
"replacement": true,
"seed": "my ensemble sample seed"
},
"distributions":[
{
"importance":[
[
"000001",
0.35199
],
[
"000005",
0.21031
],
[
"000006",
0.13889
],
[
"000007",
0.11932
],
[
"000002",
0.08733
],
[
"000003",
0.03628
],
[
"000000",
0.03297
],
[
"000004",
0.02291
]
],
"predictions":{
"categories":[
[
"false",
512
],
[
"true",
256
]
]
},
"training":{
"categories":[
[
"false",
512
],
[
"true",
256
]
]
}
},
{
"importance":[
[
"000001",
0.33276
],
[
"000005",
0.24432
],
[
"000006",
0.15996
],
[
"000007",
0.15378
],
[
"000003",
0.05712
],
[
"000002",
0.02431
],
[
"000000",
0.02199
],
[
"000004",
0.00575
]
],
"predictions":{
"categories":[
[
"false",
515
],
[
"true",
253
]
]
},
"training":{
"categories":[
[
"false",
515
],
[
"true",
253
]
]
}
},
{
"importance":[
[
"000001",
0.34203
],
[
"000005",
0.21501
],
[
"000006",
0.15173
],
[
"000007",
0.11991
],
[
"000003",
0.05734
],
[
"000002",
0.04411
],
[
"000000",
0.03615
],
[
"000004",
0.03372
]
],
"predictions":{
"categories":[
[
"false",
501
],
[
"true",
267
]
]
},
"training":{
"categories":[
[
"false",
501
],
[
"true",
267
]
]
}
},
{
"importance":[
[
"000001",
0.38199
],
[
"000005",
0.23932
],
[
"000007",
0.17913
],
[
"000006",
0.06526
],
[
"000002",
0.06305
],
[
"000000",
0.05733
],
[
"000003",
0.00766
],
[
"000004",
0.00625
]
],
"predictions":{
"categories":[
[
"false",
461
],
[
"true",
307
]
]
},
"training":{
"categories":[
[
"false",
461
],
[
"true",
307
]
]
}
},
{
"importance":[
[
"000001",
0.39081
],
[
"000005",
0.16745
],
[
"000007",
0.14195
],
[
"000006",
0.09129
],
[
"000002",
0.088
],
[
"000004",
0.07009
],
[
"000003",
0.03207
],
[
"000000",
0.01834
]
],
"predictions":{
"categories":[
[
"false",
495
],
[
"true",
273
]
]
},
"training":{
"categories":[
[
"false",
495
],
[
"true",
273
]
]
}
},
{
"importance":[
[
"000001",
0.31956
],
[
"000005",
0.23029
],
[
"000006",
0.12127
],
[
"000007",
0.11578
],
[
"000002",
0.06947
],
[
"000003",
0.05644
],
[
"000000",
0.04405
],
[
"000004",
0.04314
]
],
"predictions":{
"categories":[
[
"false",
511
],
[
"true",
257
]
]
},
"training":{
"categories":[
[
"false",
511
],
[
"true",
257
]
]
}
},
{
"importance":[
[
"000001",
0.33974
],
[
"000007",
0.17589
],
[
"000005",
0.15404
],
[
"000002",
0.14244
],
[
"000006",
0.099
],
[
"000000",
0.04024
],
[
"000004",
0.03316
],
[
"000003",
0.0155
]
],
"predictions":{
"categories":[
[
"false",
493
],
[
"true",
275
]
]
},
"training":{
"categories":[
[
"false",
493
],
[
"true",
275
]
]
}
},
{
"importance":[
[
"000001",
0.32296
],
[
"000005",
0.18728
],
[
"000007",
0.18258
],
[
"000006",
0.15218
],
[
"000002",
0.07172
],
[
"000003",
0.04563
],
[
"000000",
0.03449
],
[
"000004",
0.00316
]
],
"predictions":{
"categories":[
[
"false",
501
],
[
"true",
267
]
]
},
"training":{
"categories":[
[
"false",
501
],
[
"true",
267
]
]
}
},
{
"importance":[
[
"000001",
0.32899
],
[
"000005",
0.21858
],
[
"000000",
0.10723
],
[
"000006",
0.10542
],
[
"000007",
0.09207
],
[
"000004",
0.06614
],
[
"000003",
0.0455
],
[
"000002",
0.03606
]
],
"predictions":{
"categories":[
[
"false",
478
],
[
"true",
290
]
]
},
"training":{
"categories":[
[
"false",
478
],
[
"true",
290
]
]
}
},
{
"importance":[
[
"000001",
0.36743
],
[
"000005",
0.20641
],
[
"000007",
0.13267
],
[
"000006",
0.09049
],
[
"000002",
0.0669
],
[
"000004",
0.05171
],
[
"000003",
0.0514
],
[
"000000",
0.03299
]
],
"predictions":{
"categories":[
[
"false",
517
],
[
"true",
251
]
]
},
"training":{
"categories":[
[
"false",
517
],
[
"true",
251
]
]
}
}
],
"error_models":0,
"fast":true,
"fields_maps":null,
"finished_models":10,
"importance":{
"000000":0.04258,
"000001":0.34783,
"000002":0.06934,
"000003":0.04049,
"000004":0.0336,
"000005":0.2073,
"000006":0.11755,
"000007":0.14131
},
"input_fields":[
"000000",
"000001",
"000002",
"000003",
"000004",
"000005",
"000006",
"000007"
],
"locale":"en-us",
"max_columns":9,
"max_rows":768,
"missing_splits":false,
"models":[
"model/577ff05f4e17277ffd000003",
"model/577ff0604e17277ffd000005",
"model/577ff0604e17277ffd000007",
"model/577ff0604e17277ffd000009",
"model/577ff0604e17277ffd00000b",
"model/577ff0604e17277ffd00000d",
"model/577ff0604e17277ffd00000f",
"model/577ff0614e17277ffd000011",
"model/577ff0614e17277ffd000013",
"model/577ff0614e17277ffd000015"
],
"name":"diabetes dataset's ensemble",
"node_threshold":512,
"number_of_batchpredictions":0,
"number_of_evaluations":0,
"number_of_models":10,
"number_of_predictions":0,
"number_of_public_predictions":0,
"objective_field":"000008",
"objective_field_name":"diabetes",
"objective_field_type":"categorical",
"ordering":0,
"out_of_bag": false,
"price":0,
"private":true,
"project": null,
"randomize":false,
"range":[
1,
768
],
"replacement": false,
"resource":"ensemble/50ef57043c19208c50000022",
"rows":768,
"sample_rate":0.8,
"seed": "a0f717f2b3954111b27fcc23f5a85787",
"shared":false,
"size":209528,
"source":"source/5747ae194e172785fc000000",
"source_status":true,
"stat_pruning":false,
"status":{
"code":5,
"elapsed":2704,
"message":"The ensemble has been created",
"progress":1
},
"subscription":true,
"tags":[
"diabetes"
],
"updated":"2016-07-08T18:26:45.524000"
}
< Example ensmeble JSON response
PMML
The default ensemble output format is JSON. However, the pmml parameter allows to include a PMML version of the ensemble. The ensemble will include a XML document that fullfils PMML v4.1. For example:
curl "https://au.bigml.io/ensemble/50ef57043c19208c50000022?$BIGML_AUTH;pmml=yes"
PMML Example
curl "https://au.bigml.io/ensemble/50ef57043c19208c50000022?$BIGML_AUTH;pmml=only"
PMML Example
Filtering and Paginating Fields from an Ensemble
An ensemble might be composed of hundreds or even thousands of fields. Thus when retrieving an ensemble, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating an Ensemble
To update an ensemble, you need to PUT an object containing the fields that you want to update to the ensemble' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated ensemble.
For example, to update an ensemble with a new name you can use curl like this:
curl "https://au.bigml.io/ensemble/50ef57043c19208c50000022?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating an ensemble's name
Deleting an Ensemble
To delete an ensemble, you need to issue a HTTP DELETE request to the ensemble/id to be deleted.
Using curl you can do something like this to delete an ensemble:
curl -X DELETE "https://au.bigml.io/ensemble/50ef57043c19208c50000022?$BIGML_AUTH"
$ Deleting an ensemble from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete an ensemble, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete an ensemble a second time, or an ensemble that does not exist, you will receive a "404 not found" response.
However, if you try to delete an ensemble that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Ensembles
To list all the ensembles, you can use the ensemble base URL. By default, only the 20 most recent ensembles will be returned. You can see below how to change this number using the limit parameter.
You can get your list of ensembles directly in your browser using your own username and API key with the following links.
https://au.bigml.io/ensemble?$BIGML_AUTH
> Listing ensembles from a browser
Logistic Regressions
Last Updated: Tuesday, 2019-01-29 16:28
A logistic regression is a supervised machine learning method for solving classification problems. The probability of the objective being a particular class is modeled as the value of a logistic function, whose argument is a linear combination of feature values. You can create a logistic regression selecting which fields from your dataset you want to use as input fields (or predictors) and which categorical field you want to predict, the objective field.
Logistic regression seeks to learn the coefficient values b0, b1, b2, ... bk from the training data, using maximum likelihood estimation techniques:
where
For this formulation to be valid the features X1, X2, ... Xk must be numeric values. To adapt this model to all the datatypes that BigML supports, we apply the following transformations to the inputs:
- Categorical fields are 'one-hot' encoded by default. That is, a separate 0-1 numeric field is created for each category, and exactly one of those fields has a value of 1, corresponding to the categorical value for the individual instance. To specify different coding behavior, see the Coding Categorical Fields for more details.
- Each term present in a text field is mapped to a corresponding numeric field, whose value is the number of occurrences of that term in the instance. Text fields without term analysis enabled are excluded from the model.
- Each item present in an items field is mapped to a corresponding numeric field, whose value is the number of occurrences of that item in the instance.
- Missing values in numeric fields can be explicitly included as another valid value by using the argument missing_numerics or they can be replaced specifying a default_numeric_value. If none of those arguments are enabled, instances containing missing numeric values will be ignored for training the model.
BigML.io allows you to create, retrieve, update, delete your logistic regression. You can also list all of your logistic regressions.
Jump to:
- Logistic Regression Base URL
- Creating a Logistic Regression
- Logistic Regression Arguments
- Retrieving a Logistic Regression
- Logistic Regression Properties
- PMML
- Filtering and Paginating Fields from a Logistic Regression
- Updating a Logistic Regression
- Deleting a Logistic Regression
- Listing Logistic Regressions
Logistic Regression Base URL
You can use the following base URL to create, retrieve, update, and delete logistic regressions. https://au.bigml.io/logisticregression
Logistic Regression base URL
All requests to manage your logistic regressions must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Logistic Regression
To create a new logistic regression, you need to POST to the logistic regression base URL an object containing at least the dataset/id that you want to use to create the logistic regression. The content-type must always be "application/json".
POST /logisticregression?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating logistic regression definition
curl "https://au.bigml.io/logisticregression?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006"}'
> Creating a logistic regression
BigML.io will return the newly created logistic regression if the request succeeded.
{
"category":0,
"code":201,
"columns":5,
"created":"2015-09-29T18:28:38.755738",
"credits":0.01815032958984375,
"credits_per_prediction":0,
"dataset":"dataset/554e8fcf545e5f1474000010",
"dataset_field_types":{
"categorical":1,
"datetime":0,
"numeric":4,
"preferred":5,
"text":0,
"total":5
},
"dataset_status":true,
"description":"",
"excluded_fields":[],
"fields_meta":{
"count":0,
"limit":1000,
"offset":0,
"total":0
},
"input_fields":[],
"locale":"en-US",
"logistic_regression":null,
"max_columns":5,
"max_rows":150,
"name":"iris' dataset's logistic regression",
"number_of_batchpredictions":0,
"number_of_evaluations":0,
"number_of_predictions":0,
"objective_field":"000004",
"objective_field_name":null,
"objective_field_type":null,
"objective_fields":[
"000004"
],
"out_of_bag":false,
"private":true,
"project":"project/54dc6d05545e5f822c00043f",
"range":[
1,
150
],
"replacement":false,
"resource":"logisticregression/55efc3564e1727d635000004",
"rows":150,
"sample_rate":1,
"shared":false,
"size":4758,
"source":"source/554e8fac545e5f1474000004",
"source_status":true,
"status":{
"code":1,
"message":"The logistic regression is being processed and will be created soon"
},
"subscription":false,
"tags":[
"species"
],
"updated":"2015-09-29T18:28:38.755806",
"white_box":false
}
< Example logistic regression JSON response
Logistic Regression Arguments
In addition to the dataset, you can also POST the following arguments, and like models, you can use weights to deal with imbalanced datasets. Click here to find more information about weights.
| Argument | Type | Description |
|---|---|---|
|
balance_fields
optional |
Boolean, default is false |
Whether to scale each numeric field such that its values are zero mean with a standard deviation of 1, based on the field summary statistics at training time.
Example: true |
|
balance_objective
optional |
Boolean, default is false |
Whether to balance classes proportionally to their category counts or not. For more information, see the Automatic Balancing section.
Example: true |
|
bias
optional |
Boolean, default is true |
Whether to include the bias term from the solution.
Example: false |
|
c
optional |
Float, default is 1 |
The inverse of the regularization strength. Must be greater than 0.
Example: 2 |
|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the logistic regression. See the category codes for the complete list of categories.
Example: 1 |
|
compute_stats
optional |
Boolean, default is false |
Whether to compute statistics and significance tests.
Example: true |
| dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
|
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
|
description
optional |
String |
A description of the logistic regression up to 8192 characters long.
Example: "This is a description of my new logistic regression" |
|
eps
optional |
Float, default is 0.0001 |
Stopping criteria for solver. If the difference between the results from the current and last iterations is less than eps, then the solver is finished.
Example: 0.1 |
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the logistic regression.
Example:
|
|
field_codings
optional |
List |
Coding schemes for categorical fields: dummy, contrast, or other. Value is a map between field identifiers and a coding scheme for that field. See the Coding Categorical Fields for more details. If not specified, one numeric variable is created per categorical value, plus one for missing values.
Example:
|
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the logistic regression with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be considered to create the logistic regression.
Example:
|
|
max_training_time
optional |
Integer, default is 1800 |
The maximum training time allowed for the optimization, in seconds, as a strictly positive integer. Applicable only when optimize is set to true.
Example: 3600 |
|
missing_numerics
optional |
Boolean, default is true |
Whether to create an additional binary predictor each numeric field which denotes a missing value. If false, these predictors are not created, and rows containing missing numeric values are dropped.
Example: false |
|
name
optional |
String, default is dataset's name |
The name you want to give to the new logistic regression.
Example: "my new logistic regression" |
|
normalize
optional |
Boolean, default is false |
Whether to normalize feature vectors in training and predicting.
Example: true |
|
number_of_model_candidates
optional |
Integer, default is 128 |
The number of model candidates evaluated over the course of the optimization. Applicable only when optimize is set to true. Maximum 200 candidates.
Example: 100 |
|
objective_field
optional |
String, default is dataset's pre-defined objective field |
Specifies the id of the field that you want to predict. The type of the field must be categorical.
Example: "000003" |
|
objective_fields
optional |
Array, default is an array with the id of the last field in the dataset |
Specifies the id of the field that you want to predict. Even if this an array BigML.io only accepts one objective field in the current version. If both objective_field and objective_fields are specified then, objective_field takes preference. The type of the fields must be categorical.
Example: ["000003"] |
|
objective_weights
optional |
Array |
A list of category and weight pairs. One per objective class. For more information, see the Objective Weights section.
Example:
|
|
optimize
optional |
Boolean, default is false |
Whether the logistic regression should be built with the automatic optimization. When it is set to true, only the following modeling properties are applied: input_fields, excluded_fields, missing_splits, sample_rate, max_training_time, objective_weights, weight_field, and number_of_model_candidates
Example: true |
|
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
|
project
optional |
String |
The project/id you want the logistic regression to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the logistic regression.
Example: [1, 150] |
|
regularization
optional |
String, default is "l2" |
Either l1 or l2, which selects the norm to minimize when regularizing the solution. Regularizing with respect to the l1 norm causes more coefficients to be zero, using the l2 norm forces the magnitudes of all coefficients towards zero.
Example: "l1" |
|
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
|
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
|
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
|
stats_sample_seed
optional |
String |
Random seed value used for stats sampling
Example: "My stats seed" |
|
stats_sample_size
optional |
Integer, default is -1 |
The number of rows to sample for calculating statistics. If -1 is given, then the number of rows will be calculated such that (rows x coefficients) <= 1E+8. The minimum between that number and the total number of input rows will be used.
Example: 1000 |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your logistic regression.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
|
weight_field
optional |
String |
Any numeric field with no negative or missing values is valid as a weight field. Each instance will be weighted individually according to the weight field's value. For more information, see the Weight Field section.
Example: "000005" |
Coding Categorical Fields
Categorical fields must be converted to numerical values in order to be used in training a logistic regression model. By default, they are "one-hot" coded. That is, one numeric variable is created per categorical value, plus one for missing values. For a given instance, the variable corresponding to the instance's categorical value has its value set to 1, while the other variables are set to 0.
Using the iris dataset as an example, we can express this coding scheme as the following table:
| Value | C0 | C1 | C2 | C3 || setosa | 1 | 0 | 0 | 0 | | versicolor | 0 | 1 | 0 | 0 | | virginica | 0 | 0 | 1 | 0 | | [MISSING] | 0 | 0 | 0 | 1 |
To specify different coding behavior, use the field_codings parameter.
The parameter value is an array where each element is a map describing the coding scheme to apply to a particular field, and containing the following keys:
- field: The name or identifier of the field to code.
- coding: The type of coding to use, either dummy, contrast, or other.
- dummy_class: The class value to treat as the control value in dummy coding.
- coefficients: The coefficients, which is a nested array of floating point values, to be used with contrast, or other coding.
The value for coding determines which of the following methods is used to code the field:
-
dummy: Use dummy coding.
The value is a string specifying the value to use as the control.
For example, the value {"field": "species", "coding": "dummy", "dummy_class": "virginica"} defines the following coding:
| Value | C0 | C1 | C2 |
| setosa | 1 | 0 | 0 | | versicolor | 0 | 1 | 0 | | virginica | 0 | 0 | 0 | | [MISSING] | 0 | 0 | 1 | -
contrast: Use contrast coding.
The value is an array of vectors, each specifying the coding of an individual variable.
The vectors are checked for length.
If the lengths are less than the expected length by 1, then a 0 is implicitly appended to the end of each array,
so that missing values are ignored for the model.
In addition, each vector is checked that its elements sum to 0, and the entire collection of vectors is checked for orthogonality.
For example, the value {"field": "species", "coding": "contrast", "coefficients": [[0.5,-0.25,-0.25,0],[-1,2,0,-1]]}
defines the following coding:
| Value | C0 | C1 |
| setosa | 0.50 | -1 | | versicolor | -0.25 | 2 | | virginica | -0.25 | 0 | | [MISSING] | 0.00 | -1 | - other: A user-specified coding scheme. Uses an array of vectors like in contrast, but only length is checked. For coding vectors, the coefficients should be listed in the same order in which the corresponding values appear in the field summary like [[1, 2, 3, 4, 5, 6, 7, 8], [-2 , 0, -2, 0, 2, 0, 2, 0]].
If multiple coding schemes are listed for a single field, then the coding closest to the end of the list is used. Codings given for non-categorical variables are ignored.
If compute_stats is set to true, then all categorical fields without specified codings will be assigned dummy coding. The dummy class will be the first by alphabetical order. This is because the default one-hot encoding produces collinearity effects which result in an ill-formed covariance matrix.
You can also use curl to customize a new logistic regression. For example, to create a new logistic regression named "my logistic regression", with only certain rows, and with only three fields:
curl "https://au.bigml.io/logisticregression?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006",
"input_fields": ["000001", "000002", "000003"],
"name": "my logistic regression",
"range": [25, 125]}'
> Creating a customized logistic regression
If you do not specify a name, BigML.io will assign to the new logistic regression the dataset's name. If you do not specify a range of instances, BigML.io will use all the instances in the dataset. If you do not specify any input fields, BigML.io will include all the input fields in the dataset, and if you do not specify an objective field, BigML.io will use the last field in your dataset.
Retrieving a Logistic Regression
Each logistic regression has a unique identifier in the form "logisticregression/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the logistic regression.
To retrieve a logistic regression with curl:
curl "https://au.bigml.io/logisticregression/55efc3564e1727d635000004?$BIGML_AUTH"
$ Retrieving a logistic regression from the command line
You can also use your browser to visualize the logistic regression using the full BigML.io URL or pasting the logisticregression/id into the BigML.com.au dashboard.
Logistic Regression Properties
Once a logistic regression has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the logistic regression and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the logistic regression creation has been completed without errors. |
|
columns
filterable, sortable |
Integer | The number of fields in the logistic regression. |
|
composites
filterable, sortable |
Array of Strings | The list of composite ids that reference this model. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the logistic regression was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this logistic regression. |
|
credits_per_prediction
filterable, sortable, updatable |
Float | This is the number of credits that other users will consume to make a prediction with your logistic regression if you made it public. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the logistic regression. |
| dataset_field_types | Object | A dictionary that informs about the number of fields of each type in the dataset used to create the logistic regression. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields, and an entry for the total number of fields. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the logistic regression. It can contain restricted markdown to decorate the text. |
| excluded_fields | Array | The list of fields's ids that were excluded to build the logistic regression. |
| fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
|
fusions
filterable, sortable |
Array of Strings | The list of fusion ids that reference this model. |
| input_fields | Array | The list of input fields' ids used to build the models of the logistic regression. |
| locale | String | The dataset's locale. |
| logistic_regression | Object | All the information that you need to recreate or use the logistic regression on your own. It includes a list of coefficients and the field's dictionary describing the fields and their summaries. See here for more details. |
|
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the logistic regression. |
|
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the logistic regression. |
| max_training_time | Integer | The maximum training time allowed for the optimization, in seconds. |
|
name
filterable, sortable, updatable |
String | The name of the logistic regression as your provided or based on the name of the dataset by default. |
|
number_of_batchpredictions
filterable, sortable |
Integer | The current number of batch predictions that use this logistic regression. |
|
number_of_evaluations
filterable, sortable |
Integer | The current number of evaluations that use this logistic regression. |
| number_of_model_candidates | Integer | The number of model candidates evaluated over the course of the optimization. |
|
number_of_predictions
filterable, sortable |
Integer | The current number of predictions that use this logistic regression. |
| objective_field | String | The id of the field that the logistic regression predicts. |
| objective_fields | Array | Specifies the list of ids of the field that the logistic regression predicts. Even if this is an array BigML.io only accepts one objective field in the current version. |
| optimize | Boolean | Whether the logistic regression was built with the automatic optimization. |
|
optiml
filterable, sortable |
String | The optiml/id that created this logistic regression. |
|
optiml_status
filterable, sortable |
Boolean | Whether the OptiML is still available or has been deleted. |
|
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the logistic regression instead of the sampled instances. |
|
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your logistic regression. |
|
private
filterable, sortable, updatable |
Boolean | Whether the logistic regression is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| range | Array | The range of instances used to build the logistic regression. |
|
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the logistic regression were selected using replacement or not. |
| resource | String | The logisticregression/id. |
|
rows
filterable, sortable |
Integer | The total number of instances used to build the logistic regression |
|
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the logistic regression. |
|
seed
filterable, sortable |
String | The string that was used to generate the sample. |
|
shared
filterable, sortable, updatable |
Boolean | Whether the logistic regression is shared using a private link or not. |
| shared_hash | String | The hash that gives access to this logistic regression if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this logistic regression. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this logistic regression. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| status | Object | A description of the status of the logistic regression. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the logistic regression was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the logistic regression was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
|
white_box
filterable, sortable |
Boolean | Whether the logistic regression is publicly shared as a white-box. |
A Logistic Regression Object has the following properties:
| Property | Type | Description |
|---|---|---|
| balance_fields | Boolean | Whether to scale each numeric field such that its values are zero mean with a standard deviation of 1, based on the field summary statistics at training time. |
| bias | Boolean | Whether to include the bias term from the solution. |
| c | Float | The inverse of the regularization strength. |
| coefficients | Array of Arrays | Coefficients of the logistic regression for each category in the objective field. |
| compute_stats | Boolean | Whether to compute statistics and significance tests. |
| eps | Float | Stopping criteria for solver. If the difference between the results from the current and last iterations is less than eps, then the solver is finished. |
| field_codings | List | Coding schemes for categorical fields. See the Coding Categorical Fields for more details. |
| fields | Object | A dictionary with an entry per field in the dataset used to build the logistic regression. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details. |
| missing_class_in_coefficients | Boolean | Whether there is a missing class in the coefficients of the logistic regression. |
| missing_numerics | Boolean | Whether to create an additional binary predictor each numeric field which denotes a missing value. If false, these predictors are not created, and rows containing missing numeric values are dropped. |
| normalize | Boolean | Whether to normalize feature vectors in training and predicting. |
| regularization | String | Either l1 or l2 at the moment. It selects the norm to minimize when regularizing the solution. Regularizing with respect to the l1 norm causes more coefficients to be zero, and using the l2 norm forces the magnitudes of all coefficients towards zero. |
| stats | Object | Statistical tests to assess the quality of the model's fit to the data. See this Section for more details. |
| stats_sample_seed | String | Random seed value used for stats sampling. |
| stats_sample_size | Integer | The number of rows sampled for calculating statistical tests. |
Coefficients Structure
The coefficients output field is an array of pairs, one pair per class. The first element in the pair is a class value, and the second element is a nested array of coefficients for the logistic model that gives the probability of that class. Each inner array within the nested array contains the group of coefficients that pertain to a single input field. The groups are listed in the same order as in input_fields, with a final singleton array corresponding to the bias term. The class-coefficient pairs are listed in the same order as the class values in the objective field summary. If the model was trained with missing values in the objective field, then a vector of coefficients will also be created for the missing class value, labeled with "", and listed last.
- Numeric fields correspond to two coefficients. The first predictor is the numeric value, and the second predictor is a binary value corresponding to missing values. For example, a numeric field value of 5 maps to a value of 5 in the first predictor, and 0 in the second, while a missing value maps to 0 in the first predictor, and 1 in the second. If the missing_numerics parameter is false, then only a single predictor will be generated for numeric fields.
- Categorical fields correspond to n+1 coefficients, where the first n coefficients correspond to class values, and the final coefficient corresponds to a binary missing value predictor.
- Text and items fields correspond to m+1 coefficients, where the first m coefficients correspond to each term in the field's tag cloud, listed in the same order as in the field summary. The final term corresponds to an empty string or itemset, or in the case of text fields, a string which does not contain any terms in the text analysis vocabulary.
- The final coefficient in the list corresponds to the bias term.
Significance Tests
If the compute_stats parameter is true, then the logistic regression output contains a number of statistical tests to assess the quality of the model's fit to the data. These are found under a field named stats. For each set of coefficients, the following statistics are computed:
- likelihood_ratio: the difference in log likelihood between the fitted model and an intercept-only model. Given as a pair [p-value, ratio]. This statistic tests whether the coefficients as a whole have any predictive power over an intercept-only model.
- standard_errors: the variance of the coefficients estimates.
- z_scores: those values in terms of number of standard deviations.
- p_values: from a 1-DOF Chi Squared test of z^2 (Wald test).
- confidence_intervals: the size of the 95% confidence interval for each coefficient estimate. That is, for a coefficient estimate x, and an interval value n, the value of the coefficient is x ± n with a confidence of 95%.
standard_errors, z_scores, p_values, confidence_intervals: These statistics test the significance of individual coefficient estimates, and are grouped in the same nested array fashion as the coefficients themselves.
To avoid lengthy computation times, statistics from large input datasets will be computed from a sub-sample of the dataset such that the number of coefficients * rows is less than or equal to 1E+8.
It is possible for null to appear among the values contained in stats. Wald test statistics cannot be computed for zero-value coefficients, and so their corresponding entries are null. Moreover, if the coefficients' information matrix is ill-conditioned, e.g. if there are fewer instances of the positive class than the number of coefficients, then it is impossible to perform the Wald test on the entire set of coefficients. In this case standard_errors, z_scores, p_values, and confidence_intervals will have a value of null.
Logistic Regression Status
Creating a logistic regression is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The logistic regression goes through a number of states until its fully completed. Through the status field in the logistic regression you can determine when the logistic regression has been fully processed and ready to be used to create predictions. These are the properties that a logistic regression's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the logistic regression creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the logistic regression. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the logistic regression. |
Once a logistic regression has been successfully created, it will look like:
{
"category":0,
"code":200,
"columns":5,
"created":"2015-09-28T06:03:17.128000",
"credits":0.01815032958984375,
"credits_per_prediction":0,
"dataset":"dataset/554e8fcf545e5f1474000010",
"dataset_field_types":{
"categorical":1,
"datetime":0,
"numeric":4,
"preferred":5,
"text":0,
"total":5
},
"dataset_status":true,
"description":"",
"excluded_fields":[],
"fields_meta":{
"count":5,
"limit":1000,
"offset":0,
"query_total":5,
"total":5
},
"input_fields":[
"000000",
"000001",
"000002",
"000003"
],
"locale":"en-US",
"logistic_regression":{
"balance_fields":false,
"bias":true,
"c":1,
"coefficients":[
[
"Iris-virginica",
[
-1.7725500512039691,
-2.0714411671485604,
1.9765289540667237,
1.2116274344668618,
-0.0006009280238702336
]
],
[
"Iris-setosa",
[
0.4234123201880313,
2.446210746782367,
-4.558271526802624,
-2.0557583244325253,
0.0004628272837137537
]
],
[
"Iris-versicolor",
[
-1.1362239209763645,
-1.658799944046014,
0.9215245039579112,
0.30082088849717076,
-0.00028554600426412813
]
]
],
"eps":0.345,
"missing_numerics":true,
"normalize":true,
"regularization":"l2"
},
"max_columns":5,
"max_rows":150,
"name":"iris' dataset's logistic regression",
"number_of_batchpredictions":0,
"number_of_evaluations":0,
"number_of_predictions":0,
"objective_field":"000004",
"objective_field_name":"species",
"objective_field_type":"categorical",
"objective_fields":[
"000004"
],
"out_of_bag":false,
"private":true,
"project":"project/54dc6d05545e5f822c00043f",
"range":[
1,
150
],
"replacement":false,
"resource":"logisticregression/55efc3564e1727d635000004",
"rows":150,
"sample_rate":1,
"shared":false,
"size":4758,
"source":"source/554e8fac545e5f1474000004",
"source_status":true,
"status":{
"code":5,
"elapsed":21,
"message":"The logistic regression has been created",
"progress":1
},
"subscription":false,
"tags":[
"species"
],
"updated":"2015-09-28T06:03:20.546000",
"white_box":false
}
< Example logistic regression JSON response
PMML
The default logistic regression output format is JSON. However, the pmml parameter allows to include a PMML version of the logistic regression. The logistic regression will include a XML document that fullfils PMML v4.1. For example:
curl "https://au.bigml.io/logisticregression/55efc3564e1727d635000004?$BIGML_AUTH;pmml=yes"
PMML Example
curl "https://au.bigml.io/logisticregression/55efc3564e1727d635000004?$BIGML_AUTH;pmml=only"
PMML Example
Filtering and Paginating Fields from a Logistic Regression
A logistic regression might be composed of hundreds or even thousands of fields. Thus when retrieving a logisticregression, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Logistic Regression
To update a logistic regression, you need to PUT an object containing the fields that you want to update to the logistic regression' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated logistic regression.
For example, to update a logistic regression with a new name you can use curl like this:
curl "https://au.bigml.io/logisticregression/55efc3564e1727d635000004?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating a logistic regression's name
If you want to update a logistic regression with a new label and description for a specific field you can use curl like this:
curl "https://au.bigml.io/logisticregression/55efc3564e1727d635000004?$BIGML_AUTH" \
-X PUT \
-d '{"fields": {"000000": {"label": "a longer name", "description": "an even longer description"}}}' \
-H 'content-type: application/json'
$ Updating a logistic regression's field, label, and description
Deleting a Logistic Regression
To delete a logistic regression, you need to issue a HTTP DELETE request to the logisticregression/id to be deleted.
Using curl you can do something like this to delete a logistic regression:
curl -X DELETE "https://au.bigml.io/logisticregression/55efc3564e1727d635000004?$BIGML_AUTH"
$ Deleting a logistic regression from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a logistic regression, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a logistic regression a second time, or a logistic regression that does not exist, you will receive a "404 not found" response.
However, if you try to delete a logistic regression that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Logistic Regressions
To list all the logistic regressions, you can use the logisticregression base URL. By default, only the 20 most recent logistic regressions will be returned. You can see below how to change this number using the limit parameter.
You can get your list of logistic regressions directly in your browser using your own username and API key with the following links.
https://au.bigml.io/logisticregression?$BIGML_AUTH
> Listing logistic regressions from a browser
Deepnets
Last Updated: Tuesday, 2019-01-29 16:28
A deepnet in BigML is a supervised learning method to solve classification and regression problems. Deepnets are an optimized version of Deep Neural Networks, a class of machine-learned models inspired by the neural circuitry of the human brain. In these classifiers, the input features are fed to a group of nodes called a layer. Each node is essentially a function on the input that transforms the input features into another value or collection of values. Then the entire layer transforms an input vector into a new intermediate feature vector. This new vector is fed as input to another layer of nodes. This process continues layer by layer, until we reach the final output layer of nodes, where the output is the network's prediction: an array of per-class probabilities for classification problems or a single, real value for regression problems.
The deep in Deep Neural Networks refers to the presence of more than one hidden layer; that is, more than one layer of nodes between the input and the output layers. The network architectures supported by BigML can be deep or shallow. The advantage of training deep architectures is that hidden layers have the opportunity to learn higher-level representations of the data that can be used to make correct predictions in cases where a direct mapping between input and output is difficult. For example, when classifying images of numeric digits, the input layer is raw pixels, the output layer is the probability for each digit, and the intermediate layers may learn features that represent the presence of, say, a loop or a vertical stroke.
Network Search
Deep Neural Networks are notoriously sensitive to the chosen topology and the algorithm used to optimize the parameters thereof. This sensitivity means that hand-tuning the topology and optiml algorithm can be difficult and time-consuming as the number of choices that lead to poor networks typically vastly outnumber the choices that lead to good ones.
To combat this problem, BigML offers first class support for automatic network topology search and parameter optimization. The algorithm BigML uses is a variant on the hyperband algorithm. Instead of selecting candidates for evaluation at random, however, we use an acquisition technique based on techniques from Bayesian parameter optimization.
BigML.io allows you to create, retrieve, update, delete your deepnet. You can also list all of your deepnets.
Jump to:
- Deepnet Base URL
- Creating a Deepnet
- Deepnet Arguments
- Retrieving a Deepnet
- Deepnet Properties
- Filtering and Paginating Fields from a Deepnet
- Updating a Deepnet
- Deleting a Deepnet
- Listing Deepnets
Deepnet Base URL
You can use the following base URL to create, retrieve, update, and delete deepnets. https://au.bigml.io/deepnet
Deepnet base URL
All requests to manage your deepnets must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Deepnet
To create a new deepnet, you need to POST to the deepnet base URL an object containing at least the dataset/id that you want to use to create the deepnet. The content-type must always be "application/json".
POST https://au.bigml.io/deepnet?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating deepnet definition
curl "https://au.bigml.io/deepnet?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/50e8d4f03c19202d91000004"}'
> Creating a deepnet
BigML.io will return the newly created deepnet if the request succeeded.
{
"balance_objective": false,
"category": 0,
"clones": 0,
"code": 201,
"columns": 0,
"configuration": null,
"configuration_status": false,
"created": "2017-09-12T13:17:16.903890",
"credits": 0.017578125,
"credits_per_prediction": 0,
"dataset": "dataset/59b7ddac1f386f1e11000001",
"dataset_field_types": {
"categorical": 1,
"datetime": 0,
"effective_fields": 5,
"items": 0,
"numeric": 4,
"preferred": 5,
"text": 0,
"total": 5
},
"dataset_status": true,
"dataset_type": 0,
"deepnet": {
"deepnet_seed": "2c249dda00fbf54ab4cdd850532a584f286af5b6",
"missing_numerics": true,
"search": true
},
"description": "",
"excluded_fields": [],
"fields_meta": {
"count": 0,
"limit": 1000,
"offset": 0,
"total": 0
},
"input_fields": [],
"locale": "en-US",
"max_columns": 5,
"max_rows": 150,
"name": "iris",
"name_options": "",
"number_of_batchpredictions": 0,
"number_of_evaluations": 0,
"number_of_predictions": 0,
"objective_field": "000004",
"objective_field_name": "species",
"objective_field_type": "categorical",
"objective_fields": [
"000004"
],
"ordering": 0,
"out_of_bag": false,
"price": 0,
"private": true,
"project": null,
"range": [
1,
150
],
"replacement": false,
"resource": "deepnet/50ef57043c19208c50000021",
"rows": 150,
"sample_rate": 1,
"shared": false,
"short_url": "",
"size": 4608,
"source": "source/59b7dd9f1f386f1e13000000",
"source_status": true,
"status": {
"code": 1,
"message": "The deepnet creation request has been queued and will be processed soon"
},
"subscription": true,
"tags": [
"species"
],
"type": 0,
"updated": "2017-09-12T13:17:16.903978",
"white_box": false
}
< Example deepnet JSON response
Deepnet Arguments
In addition to the dataset, you can also POST the following arguments, and like models, you can use weights to deal with imbalanced datasets. Click here to find more information about weights.
| Argument | Type | Description |
|---|---|---|
|
balance_objective
optional |
Boolean, default is false |
Whether to balance classes proportionally to their category counts or not. For more information, see the Automatic Balancing section.
Example: true |
|
batch_normalization
optional |
Boolean |
Specifies whether to normalize the outputs of a network before being passed to the activation function or not.
Example: true |
|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the deepnet. See the category codes for the complete list of categories.
Example: 1 |
| dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
| deepnet_seed | String | A string to generate a deterministic ordering of training data, the initial network weights, and the behavior of dropout during training. |
|
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
|
description
optional |
String |
A description of the deepnet up to 8192 characters long.
Example: "This is a description of my new deepnet" |
|
dropout_rate
optional |
Float, default is 0 |
A number between 0 and 1 specifying the rate at which to drop weights during training to control overfitting.
Example: 0.2 |
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the models of the deepnet
Example:
|
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the deepnet with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
|
hidden_layers
optional |
Array |
A list of maps describing the number and type of layers in the network (other than the output layer, which is determined by the type of learning problem).The available keys are:
|
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be considered to create the deepnet.
Example:
|
|
learn_residuals
optional |
Boolean |
Specifies whether alternate layers should learn a representation of the residuals for a given layer rather than the layer itself or not.
Example: true |
|
learning_rate
optional |
Float, default is 0.01 |
A number between 0 and 1 specifying the learning rate.
Example: 0.3 |
|
max_iterations
optional |
Integer |
A number between 100 and 100000 for the maximum number of gradient steps to take during the optimization.
Example: 1000 |
|
max_training_time
optional |
Integer, default is 1800 |
The maximum training time, in seconds, for which to train the network. Training may stop early if performance on a holdout set does not show improvement. Use -1 to deactivate the max training time.
Example: 3600 |
|
missing_numerics
optional |
Boolean, default is true |
Whether to create an additional binary predictor each numeric field which denotes a missing value. If false, these predictors are not created, and rows containing missing numeric values are dropped.
Example: false |
|
name
optional |
String, default is dataset's name |
The name you want to give to the new deepnet.
Example: "my new deepnet" |
|
number_of_hidden_layers
optional |
Integer, default is 10 |
The number of hidden layers to use in the network. If the number is greater than the length of the list of hidden_layers, the list is cycled until the desired number is reached. If the number is smaller than the length of the list of hidden_layers, the list is shortened.
Example: 3 |
|
number_of_model_candidates
optional |
Integer, default is 128 |
A integer specifying the number of models to try during the model search. Maximum 200 candidates.
Example: 100 |
|
objective_field
optional |
String, default is the id of the last field in the dataset |
Specifies the id of the field that the deepnet will predict.
Example: "000003" |
|
objective_fields
optional |
Array, default is an array with the id of the last field in the dataset |
Specifies the id of the field that you want to predict. Even if this an array BigML.io only accepts one objective field in the current version. If both objective_field and objective_fields are specified then, objective_field takes preference.
Example: ["000003"] |
|
objective_weights
optional |
Array |
A list of category and weight pairs. One per objective class. For more information, see the Objective Weights section.
Example:
|
|
optimizer
optional |
Object |
Algorithm configuration used to calculate deepnet. A map keyed by algorithm name, where available algorithms are adam, adagrad, ftrl, momentum and rms_prop. Each entry is map containing the specific parameters for the algorithm. See the Optimizer Object for more details.
Example:
|
|
ordering
optional |
Integer, default is 0 (deterministic). |
Specifies the type of ordering followed to build the models of the deepnet. There are three different types that you can specify:
Example: 1 |
|
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
|
project
optional |
String |
The project/id you want the deepnet to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the deepnet.
Example: [1, 150] |
|
regression_weight_ratio
optional |
Float |
A strictly positive value describing the learning penalty for above-objective errors in relation to below-objective errors. That is, if this value is 10, prediction errors above the objective will be penalized 10 times more during learning.
Example: 10 |
|
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
|
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
|
search
optional |
Boolean |
During the deepnet creation, BigML trains and evaluates over all possible network configurations, returning the best networks found for the problem. The final deepnet returned by the search is a compromise between the top n networks found in the search. Since this option builds several networks, it may be significantly slower than the suggest_structure technique. When it is set to true, only the following modeling properties are applied: default_numeric_value, input_fields, excluded_fields, max_iterations, max_training_time, missing_numerics, sample_rate, objective_weights, and weight_field.
Example: true |
|
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
|
suggest_structure
optional |
Boolean |
An alternative to the search technique that is usually a more efficient way to quickly train and iterate deepnets and it can reach similar results. BigML has learned some general rules about what makes one network structure better than another for a given dataset. Given your dataset, BigML will automatically suggest a structure and a set of parameter values that are likely to perform well for your dataset. This option builds only one network. When it is set to true, only the following modeling properties are applied: default_numeric_value, input_fields, excluded_fields, max_iterations, max_training_time, missing_numerics, sample_rate, objective_weights, and weight_field.
Example: true |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your deepnet.
Example: ["best customers", "2018"] |
|
tree_embedding
optional |
Boolean |
Specify whether to learn a tree-based representation of the data as engineered features along with the raw features, essentially by learning trees over slices of the input space and a small amount of the training data. The theory is that these engineered features will linearize obvious non-linear dependencies before training begins, and so make learning proceed more quickly.
Example: true |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
|
weight_field
optional |
String |
Any numeric field with no negative or missing values is valid as a weight field. Each instance will be weighted individually according to the weight field's value. For more information, see the Weight Field section.
Example: "000005" |
Depending on the descent algorithm chosen and the topology of the network, certain other parameters may apply.
You can also use curl to customize a new deepnet from the command line. For example, to create a new deepnet named "my deepnet" using descent algorithm "adam".
curl "https://au.bigml.io/deepnet?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006",
"optimizer": {
"adam": {
"beta1": 0.9,
"beta2": 0.999,
"epsilon": 0.11
}
},
"name": "my deepnet"
}'
> Creating a customized deepnet
If you do not specify a name, the dataset's name will be assigned to the new deepnet. If you do not specify a range of instances, the complete set of instances in the dataset will be used. If you do not specify any input fields, all the preferred input fields in the dataset will be included, and if you do not specify an objective field, the last field in your dataset will be considered the objective field.
Retrieving a Deepnet
Each deepnet has a unique identifier in the form "deepnet/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the deepnet.
To retrieve a deepnet with curl:
curl "https://au.bigml.io/deepnet/50ef57043c19208c50000021?$BIGML_AUTH"
$ Retrieving a deepnet from the command line
You can also use your browser to visualize the deepnet using the full BigML.io URL or pasting the deepnet/id into the BigML.com.au dashboard.
Deepnet Properties
Once a deepnet has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the deepnet and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the deepnet creation has been completed without errors. |
|
columns
filterable, sortable |
Integer | The number of fields in the deepnet. |
|
composites
filterable, sortable |
Array of Strings | The list of composite ids that reference this model. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the deepnet was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this deepnet. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the deepnet. |
| dataset_field_types | Object | A dictionary that informs about the number of fields of each type in the dataset used to create the deepnet. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields, and an entry for the total number of fields. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
| deepnet | Object | All the information that you need to recreate or use the deepnet on your own. See here for more details. |
|
description
updatable |
String | A text describing the deepnet. It can contain restricted markdown to decorate the text. |
| excluded_fields | Array | The list of fields's ids that were excluded to build the deepnet. |
| fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
|
fusions
filterable, sortable |
Array of Strings | The list of fusion ids that reference this model. |
| importance | Object | Provides a measure of how important an input field is relative to the others to predict the objective field. Each field is normalized to take values between zero and one. |
| input_fields | Array | The list of input fields' ids used to build the models of the deepnet. |
| locale | String | The dataset's locale. |
|
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the deepnet. |
|
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the deepnet. |
|
name
filterable, sortable, updatable |
String | The name of the deepnet as your provided or based on the name of the dataset by default. |
|
number_of_batchpredictions
filterable, sortable |
Integer | The current number of batch predictions that use this deepnet. |
|
number_of_evaluations
filterable, sortable |
Integer | The current number of evaluations that use this deepnet. |
|
number_of_predictions
filterable, sortable |
Integer | The current number of predictions that use this deepnet. |
| objective_field | String |
Specifies the id of the field that the deepnet predicts.
Example: "000003" |
| objective_field_name | String | The name of the objective field in the deepnet. |
| objective_field_type | String | The type of the objective field in the deepnet. |
| objective_fields | Array | Specifies the list of ids of the field that the deepnet predicts. Even if this is an array BigML.io only accepts one objective field in the current version. |
|
optiml
filterable, sortable |
String | The optiml/id that created this deepnet. |
|
optiml_status
filterable, sortable |
Boolean | Whether the OptiML is still available or has been deleted. |
|
ordering
filterable, sortable |
Integer |
The order used to chose instances from the dataset to build the models of the deepnet. There are three different types:
|
|
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the deepnet instead of the sampled instances. |
|
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your deepnet. |
|
private
filterable, sortable, updatable |
Boolean | Whether the deepnet is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| range | Array | The range of instances used to build the models of the deepnet. |
| regression_weight_ratio | Float | The learning penalty for above-objective errors in relation to below-objective errors. |
|
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the deepnet were selected using replacement or not. |
| resource | String | The deepnet/id. |
|
rows
filterable, sortable |
Integer | The total number of instances used to build the models of the deepnet |
|
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the models of the deepnet. |
|
seed
filterable, sortable |
String | The string that was used to generate the sample. |
|
shared
filterable, sortable, updatable |
Boolean | Whether the deepnet is shared using a private link or not. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this deepnet. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| status | Object | A description of the status of the deepnet. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the deepnet was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
type
filterable, sortable |
Integer | Reserved for future use. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the deepnet was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
|
white_box
filterable, sortable |
Boolean | Whether the deepnet is publicly shared as a white-box. |
A Deepnet Object has the following properties:
| Property | Type | Description |
|---|---|---|
| batch_normalization | Boolean | Whether to normalize the outputs of a network before being passed to the activation function or not |
| deepnet_seed | String | A string to generate a deterministic ordering of training data, the initial network weights, and the behavior of dropout during training. |
| dropout_rate | Float | A number between 0 and 1 specifying the rate at which to drop weights during training to control overfitting. |
| fields | Object | A dictionary with an entry per field in the dataset used to build the deepnet. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details. |
| hidden_layers | Array | A list of maps describing the number and type of layers in the network |
| learn_residuals | Boolean | Whether alternate layers should learn a representation of the residuals for a given layer rather than the layer itself or not. |
| learning_rate | Float | A number between 0 and 1 specifying the learning rate. |
| max_training_time | Integer | The maximum wall-clock training time, in seconds, for which to train the network. |
| missing_numerics | Boolean | Whether to create an additional binary predictor each numeric field which denotes a missing value. If false, these predictors are not created, and rows containing missing numeric values are dropped. |
| network | Object | Complete information of the network. See this Section for more details. |
| number_of_hidden_layers | Integer | The number of hidden layers to use in the network. |
| number_of_iterations | Integer | The number of iterations used in the network. |
| optimizer | Object | Algorithm configuration used to calculate deepnet. The key is the name of the algorithm used. See this Section for more details. |
| tree_embedding | Boolean | Whether to learn a tree-based representation of the data as engineered features along with the raw features, essentially by learning trees over slices of the input space and a small amount of the training data. The theory is that these engineered features will linearize obvious non-linear dependencies before training begins, and so make learning proceed more quickly. |
A Network Object has the following properties:
Deepnet Status
Creating a deepnet is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The deepnet goes through a number of states until its fully completed. Through the status field in the deepnet you can determine when deepnet has been fully processed and ready to be used to create predictions. These are the properties that a deepnet's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the deepnet creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the deepnet. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the deepnet. |
Once a deepnet has been successfully created, it will look like:
{
"balance_objective": false,
"category": 0,
"clones": 0,
"code": 200,
"columns": 5,
"configuration": null,
"configuration_status": false,
"created": "2017-09-11T13:05:52.264000",
"credits": 0.0,
"credits_per_prediction": 0.0,
"dataset": "dataset/59afcacf1f386f40ac00030a",
"dataset_field_types": {
"categorical": 1,
"datetime": 0,
"effective_fields": 5,
"items": 0,
"numeric": 4,
"preferred": 5,
"text": 0,
"total": 5
},
"dataset_status": false,
"dataset_type": 0,
"deepnet": {
"batch_normalization": false,
"deepnet_seed": "2c249dda00fbf54ab4cdd850532a584f286af5b6",
"dropout_rate": 0.0,
"fields": {
"000000": {
"column_number": 0,
"datatype": "double",
"name": "sepal length",
"optype": "numeric",
"order": 0,
"preferred": true,
"summary": {
"bins": [
[4.3, 1],
[4.425, 4],
[4.6, 4],
[4.77143, 7],
[4.9625, 16],
[5.1, 9],
[5.2, 4],
[5.3, 1],
[5.4, 6],
[5.5, 7],
[5.6, 6],
[5.7, 8],
[5.8, 7],
[5.9, 3],
[6, 6],
[6.1, 6],
[6.2, 4],
[6.3, 9],
[6.4, 7],
[6.5, 5],
[6.6, 2],
[6.7, 8],
[6.8, 3],
[6.9, 4],
[7, 1],
[7.1, 1],
[7.2, 3],
[7.3, 1],
[7.4, 1],
[7.6, 1],
[7.7, 4],
[7.9, 1]
],
"exact_histogram": {
"populations": [1, 4, 6, 11, 19, 5, 13, 14, 10, 12, 13, 12, 10, 7, 2, 4, 1, 5, 1],
"start": 4.2,
"width": 0.2
},
"kurtosis": -0.57357,
"maximum": 7.9,
"mean": 5.84333,
"median": 5.8,
"minimum": 4.3,
"missing_count": 0,
"population": 150,
"skewness": 0.31175,
"standard_deviation": 0.82807,
"sum": 876.5,
"sum_squares": 5223.85,
"variance": 0.68569
}
},
"000001": {
"column_number": 1,
"datatype": "double",
"name": "sepal width",
"optype": "numeric",
"order": 1,
"preferred": true,
"summary": {
"counts": [
[2, 1],
[2.2, 3],
[2.3, 4],
[2.4, 3],
[2.5, 8],
[2.6, 5],
[2.7, 9],
[2.8, 14],
[2.9, 10],
[3, 26],
[3.1, 11],
[3.2, 13],
[3.3, 6],
[3.4, 12],
[3.5, 6],
[3.6, 4],
[3.7, 3],
[3.8, 6],
[3.9, 2],
[4, 1],
[4.1, 1],
[4.2, 1],
[4.4, 1]
],
"exact_histogram": {
"populations": [1, 7, 11, 14, 24, 37, 19, 18, 7, 8, 2, 1, 1],
"start": 2,
"width": 0.2
},
"kurtosis": 0.18098,
"maximum": 4.4,
"mean": 3.05733,
"median": 3,
"minimum": 2,
"missing_count": 0,
"population": 150,
"skewness": 0.31577,
"standard_deviation": 0.43587,
"sum": 458.6,
"sum_squares": 1430.4,
"variance": 0.18998
}
},
"000002": {
"column_number": 2,
"datatype": "double",
"name": "petal length",
"optype": "numeric",
"order": 2,
"preferred": true,
"summary": {
"bins": [
[1, 1],
[1.16667, 3],
[1.3, 7],
[1.4, 13],
[1.5, 13],
[1.6, 7],
[1.7, 4],
[1.9, 2],
[3, 1],
[3.3, 2],
[3.5, 2],
[3.6, 1],
[3.75, 2],
[3.9, 3],
[4.0375, 8],
[4.23333, 6],
[4.46667, 12],
[4.6, 3],
[4.74444, 9],
[4.94444, 9],
[5.1, 8],
[5.25, 4],
[5.4, 2],
[5.56667, 9],
[5.75, 6],
[5.95, 4],
[6.1, 3],
[6.3, 1],
[6.4, 1],
[6.6, 1],
[6.7, 2],
[6.9, 1]
],
"exact_histogram": {
"populations": [2, 9, 26, 11, 2, 0, 0, 0, 0, 0, 1, 2, 2, 2, 4, 8, 6, 12, 8, 9, 12, 4, 5, 9, 5, 5, 1, 1, 3, 1],
"start": 1,
"width": 0.2
},
"kurtosis": -1.39554,
"maximum": 6.9,
"mean": 3.758,
"median": 4.35,
"minimum": 1,
"missing_count": 0,
"population": 150,
"skewness": -0.27213,
"standard_deviation": 1.7653,
"sum": 563.7,
"sum_squares": 2582.71,
"variance": 3.11628
}
},
"000003": {
"column_number": 3,
"datatype": "double",
"name": "petal width",
"optype": "numeric",
"order": 3,
"preferred": true,
"summary": {
"counts": [
[0.1, 5],
[0.2, 29],
[0.3, 7],
[0.4, 7],
[0.5, 1],
[0.6, 1],
[1, 7],
[1.1, 3],
[1.2, 5],
[1.3, 13],
[1.4, 8],
[1.5, 12],
[1.6, 4],
[1.7, 2],
[1.8, 12],
[1.9, 5],
[2, 6],
[2.1, 6],
[2.2, 3],
[2.3, 8],
[2.4, 3],
[2.5, 3]
],
"exact_histogram": {
"populations": [5, 36, 8, 1, 0, 10, 18, 20, 6, 17, 12, 11, 6],
"start": 0,
"width": 0.2
},
"kurtosis": -1.33607,
"maximum": 2.5,
"mean": 1.19933,
"median": 1.3,
"minimum": 0.1,
"missing_count": 0,
"population": 150,
"skewness": -0.10193,
"standard_deviation": 0.76224,
"sum": 179.9,
"sum_squares": 302.33,
"variance": 0.58101
}
},
"000004": {
"column_number": 4,
"datatype": "string",
"name": "species",
"optype": "categorical",
"order": 4,
"preferred": true,
"summary": {
"categories": [
["Iris-setosa", 50],
["Iris-versicolor", 50],
["Iris-virginica", 50]
],
"missing_count": 0
},
"term_analysis": {
"enabled": true
}
}
},
"hidden_layers": [{
"activation_function": "tanh",
"number_of_nodes": 64,
"type": "fully_connected"
}],
"learn_residuals": false,
"learning_rate": 0.01,
"max_training_time": 1800,
"missing_numerics": true,
"network": {
"layers": [{
"activation_function": "tanh",
"mean": null,
"number_of_nodes": 64,
"offset": [0.38671, -0.24978, -0.42769, -0.28795, 0.29766, 0.29695, 0.54235, 0.30059, 0.30866, -0.39943, 0.26355, 0.17211, 0.31182, -0.527, 0.30357, 0.05006, 0.32461, -0.43207, 0.23705, 0.41211, -0.4759, -0.30301, 0.15335, 0.31888, -0.43167, -0.55408, 0.13209, -0.33713, -0.31771, -0.5171, -0.30194, 0.31916, 0.54687, 0.11455, -0.52625, 0.07541, -0.15756, -0.02931, 0.21028, 0.50604, 0.30064, -0.19009, 0.53916, 0.20068, 0.22989, -0.4011, 0.17585, -0.3765, 0.02567, 0.05753, -0.24421, -0.32434, -0.21159, -0.49259, 0.29906, 0.12144, 0.23939, -0.29688, -0.28445, 0.00983, 0.55722, 0.18689, -0.27728, -0.30094],
"residuals": false,
"scale": null,
"stdev": null,
"weights": [
[0.19423, 0.1048, -0.00283, -0.33144],
[-0.24798, 0.01355, 0.44258, 0.17983],
[0.21272, 0.0052, 0.37649, 0.07925],
[0.17944, -0.2989, 0.36183, 0.07605],
[0.16984, -0.31996, -0.43751, -0.13637],
[0.17172, -0.04742, -0.00061, -0.03181],
[-0.01595, 0.36889, -0.15353, -0.318],
[0.25629, -0.23585, -0.12444, -0.26599],
[0.08813, -0.14506, 0.12219, 0.31101],
[0.03559, -0.04844, -0.12022, 0.40345],
[-0.1702, 0.03361, -0.26967, -0.27892],
[-0.31641, 0.09561, -0.33676, -0.01303],
[0.32036, -0.19056, 0.45731, 0.4989],
[0.05473, -0.06542, 0.30336, 0.25507],
[0.17512, 0.15156, -0.26685, -0.06631],
[0.23635, -0.6111, 0.46263, 0.12851],
[-0.18067, -0.12754, -0.4317, -0.04669],
[0.27632, -0.13466, 0.255, 0.17259],
[0.01425, -0.46286, -0.0375, 0.56685],
[0.27768, -0.23661, 0.20264, 0.3679],
[0.01691, -0.14277, 0.49131, 0.14108],
[0.16813, -0.16617, 0.14779, -0.09744],
[0.02181, -0.01546, 0.13821, -0.19713],
[-0.1535, 0.31325, -0.48685, -0.0536],
[-0.18646, -0.03273, 0.33357, 0.17382],
[-0.24405, 0.01205, 0.50213, 0.22879],
[0.15792, -0.5294, 0.0364, 0.48721],
[0.26878, -0.29296, 0.30611, 0.09294],
[0.14125, 0.05232, 0.69005, -0.11659],
[0.07611, -0.34235, 0.3798, 0.23363],
[0.11782, -0.52565, 0.26928, 0.19098],
[0.08944, 0.08031, -0.58732, -0.0472],
[-0.04493, 0.56618, -0.47086, -0.20158],
[0.22011, -0.38234, 0.10033, 0.39476],
[-0.16547, -0.21455, 0.63947, 0.15051],
[-0.27117, -0.17877, 0.26167, -0.08859],
[-0.21074, 0.24451, 0.14624, -0.07856],
[0.30052, -0.43648, -0.06441, 0.28716],
[-0.03409, 0.49041, -0.46682, -0.02884],
[0.32002, 0.0657, -0.66487, -0.17642],
[0.21624, 0.161, -0.15143, 0.0089],
[-0.25523, -0.04593, 0.20621, 0.01248],
[-0.10583, 0.04929, -0.45652, -0.23213],
[0.56136, -0.30005, 0.16223, 0.5407],
[-0.06054, 0.0979, -0.12845, -0.17565],
[-0.02054, -0.05112, 0.56909, 0.05941],
[0.39788, -0.37221, 0.08516, -0.19105],
[-0.22187, -0.03848, 0.41733, 0.18253],
[-0.29211, -0.03745, -0.16442, -0.15011],
[-0.2841, -0.0211, 0.11573, 0.07856],
[-0.06219, 0.16026, 0.49513, -0.0062],
[-0.30435, 0.1989, 0.21829, 0.25321],
[-0.28073, 0.26935, 0.0871, 0.18554],
[0.0592, -0.08107, 0.55327, 0.18361],
[0.24187, -0.34371, -0.28716, -0.22057],
[0.31778, -0.0277, 0.49531, 0.62134],
[0.1091, 0.23782, 0.02907, -0.2462],
[0.23516, -0.68506, 0.58823, 9e-05],
[0.08714, 0.07052, 0.56947, -0.05573],
[-0.09051, 0.36996, -0.35874, -0.04268],
[0.14219, 0.07678, -0.41958, -0.22962],
[0.15848, -0.20724, 0.47177, 0.66902],
[-0.30074, 0.41649, -0.42818, -0.24543],
[-0.0815, 0.12994, 0.28759, 0.14733]
]
}, {
"activation_function": "softmax",
"mean": null,
"number_of_nodes": 3,
"offset": [0.08825, 0.1932, -0.24769],
"residuals": false,
"scale": null,
"stdev": null,
"weights": [
[0.10757, -0.28484, -0.30589, -0.19559, -0.12426, 0.00922, 0.27402, -0.01708, -0.2547, -0.23854, 0.50764, 0.45659, -0.15996, -0.35908, 0.10263, -0.0048, 0.00928, -0.43381, -0.37109, -0.1001, -0.49249, -0.10222, -0.13058, 0.14757, 0.01753, -0.48728, -0.48701, -0.38919, 0.01905, -0.45798, -0.33257, 0.17572, 0.27421, -0.45031, -0.25285, -0.11106, 0.10642, -0.56695, 0.25578, 0.05934, -0.02947, 0.04946, 0.26994, -0.34284, -0.03751, -0.60813, -0.45237, 0.02863, 0.54928, 0.12266, 0.08395, -0.01092, 0.23124, 0.02051, 0.0356, -0.42423, 0.21361, -0.14365, -0.06057, 0.46418, 0.52643, -0.2722, 0.49756, 0.07601],
[0.16784, -0.14349, -0.11428, -0.19041, 0.37042, 0.37421, 0.01132, 0.47725, 0.25896, -0.07512, -0.20457, -0.13485, 0.45898, -0.2776, 0.55579, 0.33676, 0.07631, -0.21472, 0.36777, 0.29666, -0.34176, -0.22319, 0.17605, 0.16386, -0.43964, -0.40037, 0.15492, -0.02832, -0.0892, -0.12608, 0.1502, 0.55253, 0.16121, 0.3951, -0.29855, 0.04078, -0.44434, 0.20334, -0.17905, 0.25482, 0.38478, -0.36173, 0.01093, 0.2811, -0.08769, -0.69592, 0.44414, -0.09014, -0.13102, -0.23742, -0.47325, -0.54563, -0.4642, 0.22939, 0.2955, 0.32233, 0.0523, -0.03943, -0.2411, -0.17106, 0.20159, 0.46011, -0.30942, -0.02814],
[-0.3096, 0.24373, 0.43549, 0.57319, -0.5435, -0.05673, -0.51083, -0.27403, -0.14522, 0.16572, -0.43827, -0.38511, 0.03649, 0.18996, -0.24442, 0.38867, -0.52136, 0.388, -0.12481, -0.14484, 0.63568, 0.12593, -0.18479, -0.41692, 0.43662, 0.72046, -0.0351, 0.61597, 0.47957, 0.34359, 0.43891, -0.95959, -0.1857, 0.19136, 0.30838, 0.18301, -0.02607, 0.12677, -0.69295, -0.56568, -0.19295, 0.09386, -0.28657, 0.00035, -0.32087, 0.79078, -0.06563, 0.44282, -0.1427, -0.0566, 0.56838, 0.31872, 0.12867, 0.41612, -0.22403, 0.05401, -0.20675, 0.55815, 0.44815, -0.42214, -0.66994, 0.08911, 0.01378, 0.31673]
]
}],
"output_exposition": {
"type": "categorical",
"values": ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
},
"preprocess": [{
"index": 0,
"mean": 5.84333,
"stdev": 0.8253,
"type": "numeric"
}, {
"index": 1,
"mean": 3,
"stdev": 0.67823,
"type": "numeric"
}, {
"index": 2,
"mean": 3.758,
"stdev": 1.7594,
"type": "numeric"
}, {
"index": 3,
"mean": 0.2,
"stdev": 0.74247,
"type": "numeric"
}],
"trees": null
},
"number_of_hidden_layers": 1,
"number_of_iterations": 113,
"optimizer": {
"adam": {
"beta1": 0.9,
"beta2": 0.999,
"epsilon": 1e-08
}
},
"search": false,
"tree_embedding": false
},
"description": "",
"excluded_fields": [],
"fields_meta": {
"count": 5,
"limit": 1000,
"offset": 0,
"query_total": 5,
"total": 5
},
"input_fields": [
"000000",
"000001",
"000002",
"000003"
],
"locale": "en_US",
"max_columns": 5,
"max_rows": 150,
"name": "iris",
"name_options": "beta1=0.9, beta2=0.999, epsilon=1e-08, learning rate=0.01, max training time=1800, missing values",
"number_of_batchpredictions": 0,
"number_of_evaluations": 0,
"number_of_predictions": 0,
"objective_field": "000004",
"objective_field_name": "species",
"objective_field_type": "categorical",
"objective_fields": ["000004"],
"ordering": 0,
"out_of_bag": false,
"price": 0.0,
"private": true,
"project": null,
"range": [1, 150],
"replacement": false,
"resource": "deepnet/50ef57043c19208c50000021",
"rows": 150,
"sample_rate": 1.0,
"shared": false,
"short_url": "",
"size": 4608,
"source": "source/59afcac51f386f40ac000303",
"source_status": false,
"status": {
"code": 5,
"elapsed": 13192,
"message": "The deepnet has been created",
"progress": 1.0
},
"subscription": true,
"tags": ["species"],
"type": 0,
"updated": "2017-09-11T13:06:07.363000",
"white_box": false
}
< Example deepnet JSON response
Filtering and Paginating Fields from a Deepnet
A deepnet might be composed of hundreds or even thousands of fields. Thus when retrieving a deepnet, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Deepnet
To update a deepnet, you need to PUT an object containing the fields that you want to update to the deepnet' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated deepnet.
For example, to update a deepnet with a new name you can use curl like this:
curl "https://au.bigml.io/deepnet/50ef57043c19208c50000021?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating a deepnet's name
Deleting a Deepnet
To delete a deepnet, you need to issue a HTTP DELETE request to the deepnet/id to be deleted.
Using curl you can do something like this to delete a deepnet:
curl -X DELETE "https://au.bigml.io/deepnet/50ef57043c19208c50000021?$BIGML_AUTH"
$ Deleting a deepnet from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a deepnet, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a deepnet a second time, or a deepnet that does not exist, you will receive a "404 not found" response.
However, if you try to delete a deepnet that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Deepnets
To list all the deepnets, you can use the deepnet base URL. By default, only the 20 most recent deepnets will be returned. You can see below how to change this number using the limit parameter.
You can get your list of deepnets directly in your browser using your own username and API key with the following links.
https://au.bigml.io/deepnet?$BIGML_AUTH
> Listing deepnets from a browser
Time Series
Last Updated: Tuesday, 2019-01-29 16:28
A time series model is a supervised learning method to forecast the future values of a field based on its previously observed values. It is used to analyze time based data when historical patterns can explain the future behavior such as stock prices, sales forecasting, website traffic, production and inventory analysis, weather forecasting, etc. A time series model needs to be trained with time series data, i.e., a field containing a sequence of equally distributed data points in time.
BigML implements exponential smoothing to train time series models. Time series data is modeled as a level component and it can optionally include a trend (damped or not damped) and a seasonality components as explained below:
- The simplest exponential smoothing model includes the level component which can be interpreted as a weighted average of the objective field past values where alpha is the smoothing coefficient and it can take values between 0 and 1.
- To allow the modeling of time series data with a trend, a trend component (b_t) needs to be introduced along with its smoothing coefficient 0 < beta < 1.
- Trend models assume that the trend will be repeated in the future indefinitely, but this does not seem to happen in the real world. Therefore, in cases in which you need to forecast a long time horizon ahead, a damped parameter (in a form of a phi coefficient) can be included to dampen the trend to a flat line at some point in the future.
- Finally, to model seasonal data, a seasonality component can be included along with its smoothing parameter gamma by previously defining a period length m.
Forecast equation

Level equation

Forecast equation

Level equation

Trend equation

Forecast equation

Level equation

Damped trend equation

Forecast equation

Level equation

Trend equation

Seasonality equation

The different components can have variations, e.g. the trend and seasonal component can be additive or multiplicative (see the full documentation for a detailed explanation). As a result of combining the different variations for each component, several models can be trained for a given objective field. Unless you configure a specific combination to train your model, BigML will try all possible combinations and provide several ets_models for a given objective field each of them defining a unique recursive relationship. Note that BigML excludes certain combinations for numerical stability reasons such as additive errors with multiplicative trends or multiplicative error and trend with additive seasonality. BigML computes four different performance measures to select the best model for a given objective field.
You can create a time series model selecting one or several fields from your dataset to use as objective fields to forecast their future values.
BigML.io allows you to create, retrieve, update, delete your time series. You can also list all of your time series.
Jump to:
- Time Series Base URL
- Creating a Time Series
- Time Series Arguments
- Retrieving a Time Series
- Time Series Properties
- Filtering and Paginating Fields from a Time Series
- Updating a Time Series
- Deleting a Time Series
- Listing Time Series
- PMML
Time Series Base URL
You can use the following base URL to create, retrieve, update, and delete time series. https://au.bigml.io/timeseries
Time Series base URL
All requests to manage your time series must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Time Series
To create a new time series, you need to POST to the time series base URL an object containing at least the dataset/id that you want to use to create the time series. The content-type must always be "application/json".
POST /timeseries?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating time series definition
curl "https://au.bigml.io/timeseries?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006"}'
> Creating a time series
BigML.io will return the newly created time series if the request succeeded.
{
"category": 0,
"clones": 0,
"code": 201,
"columns": 0,
"configuration": null,
"configuration_status": false,
"created": "2017-06-27T21:23:48.717910",
"credits": 0.6619148254394531,
"dataset": "dataset/5948cc214e172744d700000f",
"dataset_field_types": {
"categorical": 0,
"datetime": 0,
"effective_fields": 47,
"items": 0,
"numeric": 5,
"preferred": 6,
"text": 1,
"total": 6
},
"dataset_status": true,
"dataset_type": 0,
"description": "",
"fields_meta": {
"count": 0,
"limit": 1000,
"offset": 0,
"total": 0
},
"locale": "en-US",
"max_columns": 6,
"max_rows": 2519,
"name": "product sales",
"name_options": "",
"number_of_evaluations": 0,
"number_of_forecasts": 0,
"number_of_public_forecasts": 0,
"objective_field_name": null,
"objective_field_type": null,
"price": 0,
"private": true,
"project": null,
"range": [
1,
2519
],
"resource": "timeseries/55efc3564e17270d5b611004",
"rows": 2519,
"shared": false,
"short_url": "",
"size": 173517,
"source": "source/5948cc164e172744d700000c",
"source_status": true,
"status": {
"code": 1,
"message": "The time series creation request has been queued and will be processed soon"
},
"subscription": true,
"tags": [],
"time_series": {
"all_numeric_objectives": true,
"datasets": {}
},
"type": 0,
"updated": "2017-06-27T21:23:48.717989",
"white_box": false
}
< Example time series JSON response
Time Series Arguments
In addition to the dataset, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
all_numeric_objectives
optional |
Boolean, default is false |
Whether to include all numeric fields in the input dataset in the objective fields list
Example: true |
|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the timeseries. See the category codes for the complete list of categories.
Example: 1 |
|
damped_trend
optional |
Boolean |
Whether to use a damped trend when trend is either 1 (additive) or 2 (multiplicative) to find the best models. If trend is 0 (none), then only damped_trend=false models will be considered. If ommitted, both damped and non-damped trends will be tried and the best models returned.
Example: true |
| dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
|
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
|
description
optional |
String |
A description of the timeseries up to 8192 characters long.
Example: "This is a description of my new timeseries" |
|
error
optional |
Integer |
Any of the following values to specify types of ETS models: 1 (additive), 2 (multiplicative). Multiplicative error models are only available when the objective field has strictly positive values (greater than 0). Ets_models with additive errors and multiplicative trends or seasonalities are not explored due to numeric stability issues.
Example: 2 |
|
field_parameters
optional |
Object |
Field-specific ETS parameters used to optionally override the top-level parameters for a specific objective field. A map keyed by field id, where each entry is map containing the parameters damped_trend, error, period, seasonality, time_range, and trend for the corresponding field.
Example:
|
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the timeseries with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
|
horizon
optional |
Integer, default is 50 |
The timeseries model includes a forecast with intervals for all computed ETS models with this horizon. Max value: 400.
Example: 100 |
|
name
optional |
String, default is dataset's name |
The name you want to give to the new timeseries.
Example: "my new timeseries" |
|
objective_field
optional |
String, default is the id of the last numeric field in the dataset |
The time series field you want to forecast. The type of the field must be numerical.Note that BigML assumes your instances are correctly ordered chronologically in your dataset and equally distributed in time
Example: "000003" |
|
objective_fields
optional |
Array, default is an array with the id of the last numeric field in the dataset |
One or more identifiers of numeric fields for which to fit ETS time series models. Non-numeric fields will be ignored, and if not present, the right-most valid field in the dataset will be used.
Example: ["000001", "000003"] |
|
period
optional |
Integer, default is 0 |
The number of data points per period. The period needs to be set taking into account the time interval of your instances and the seasonal frequency. For example, for monthly data and annual seasonality, the period should be 12, for daily data and weekly seasonality, the period should be 7. It can take values from 0 to 60. If the period is set to 1, there is no seasonality. If the period is 0, or not given, BigML will automatically learn the period in your data.
Example: 12 |
|
project
optional |
String |
The project/id you want the timeseries to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the timeseries.
Example: [1, 150] |
|
seasonality
optional |
Integer |
Any of the following values to specify types of ETS models: 0 (none), 1 (additive), 2 (multiplicative). Multiplicative seasonality models are only available when the objective field has strictly positive values (greater than 0). Ets_models with multiplicative trends and additive seasonality are not explored due to numeric stability issues.
Example: 2 |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your timeseries.
Example: ["best customers", "2018"] |
|
time_range
optional |
Object |
Default timing information of all objectives fields to be used to generate the timestamp field in the output datasets. Its value is a map with the following properties:
After the initial pass through the input data, the value of end will be adjusted to coincide with the last non-missing objective value. If the objective field has missing values at its tail, then this adjusted value will differ from the one specified or computed from start and interval. For datasets without missing values and timestamp_field, the input data points are assumed to be spaced uniformly in time. If a timestamp_field is given, then it is used to compute the actual time between data points. The smallest such time is considered to be one interval, and the remaining intervals between the other data points is computed as round(N/smallest_time). When the dataset contains missing values, only non-missing values are used for fitting the time series model, and if timestamp_field is not given for missing data, then each missing value between two non-missing values increases the time interval between them by one. Example:
|
|
trend
optional |
String |
Any of the following values to specify types of ETS models: 0 (none), 1 (additive), 2 (multiplicative)
Example: 2 |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new time series. For example, to create a new time series named "my time series", with only certain rows, and with only two fields:
curl "https://au.bigml.io/timeseries?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006",
"name": "my time series",
"range": [25, 125]}'
> Creating a customized time series
If you do not specify a name, BigML.io will assign to the new time series the dataset's name. If you do not specify a range of instances, BigML.io will use all the instances in the dataset. If you do not specify an objective field, BigML.io will use the last numeric field in your dataset.
Retrieving a Time Series
Each time series has a unique identifier in the form "timeseries/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the time series.
To retrieve a time series with curl:
curl "https://au.bigml.io/timeseries/55efc3564e17270d5b611004?$BIGML_AUTH"
$ Retrieving a time series from the command line
You can also use your browser to visualize the time series using the full BigML.io URL or pasting the timeseries/id into the BigML.com.au dashboard.
Time Series Properties
Once a time series has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the timeseries and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the timeseries creation has been completed without errors. |
|
columns
filterable, sortable |
Integer | The number of fields in the timeseries. |
|
composites
filterable, sortable |
Array of Strings | The list of composite ids that reference this model. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the timeseries was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this timeseries. |
|
credits_per_prediction
filterable, sortable, updatable |
Float | This is the number of credits that other users will consume to make a forecast with your time series if you made it public. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the timeseries. |
| dataset_field_types | Object | A dictionary that informs about the number of fields of each type in the dataset used to create the timeseries. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields, and an entry for the total number of fields. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the timeseries. It can contain restricted markdown to decorate the text. |
| fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
| forecast | Object | It contains max_periods, which is the maximum periods of all forecasts, and the forecast result which is a map keyed by objective id, where the entries lists of maps describing individual submodel forecasts. See the Forecast Result Object definition below. |
| locale | String | The dataset's locale. |
|
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the timeseries. |
|
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the timeseries. |
|
name
filterable, sortable, updatable |
String | The name of the timeseries as your provided or based on the name of the dataset by default. |
|
number_of_evaluations
filterable, sortable |
Integer | The current number of evaluations that use this time series. |
|
number_of_forecasts
filterable, sortable |
Integer | The current number of forecasts that use this time series. |
|
number_of_public_forecasts
filterable, sortable |
Integer | The current number of public forecasts that use this time series. |
| objective_field | String | The id of the field that the time series forecasts. |
| objective_fields | Array | One or more objectivie fields. |
|
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your timeseries. |
|
private
filterable, sortable, updatable |
Boolean | Whether the timeseries is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| range | Array | The range of instances used to build the timeseries. |
| resource | String | The timeseries/id. |
|
rows
filterable, sortable |
Integer | The total number of instances used to build the timeseries |
|
shared
filterable, sortable, updatable |
Boolean | Whether the timeseries is shared using a private link or not. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this timeseries. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| status | Object | A description of the status of the timeseries. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the timeseries was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
| time_series | Object | All the information that you need to recreate or use the time series on your own. See here for more details. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the timeseries was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
|
white_box
filterable, sortable |
Boolean | Whether the timeseries is publicly shared as a white-box. |
A Time Series Object has the following properties:
| Property | Type | Description |
|---|---|---|
| damped_trend | Boolean | Trend damping parameter |
| datasets | Object | A map with keys (objective field identifiers) and with values (dataset identifiers). i.e., there is one dataset per objective field. Those datasets contain a timestamp column, a copy the original data of the objective field, and one column per model with the values that that particular submodel computes for the objective time series. |
| error | Integer | ETS error type parameter: 1 (additive), 2 (multiplicative) |
| ets_models | Object | The results of the ETS fits. A dictionary with an entry per field in your data. Each entry is a list of maps. See here for more details. |
| field_parameters | Array | Field-specific ETS parameters used to optionally override the top-level parameters for a specific objective field. |
| fields | Object | A dictionary with an entry per field (column) in your data. Each entry includes the column number, the name of the field, the type of the field, and the summary.See this Section for more details. |
| forecast_ranges | Object |
Each key is an objective field ID and itscorresponding value is a map describing the time indexing of theforecast data points for that objective field, containing the following keys:
|
| period | Integer | Seasonal period length. |
| seasonality | Integer | ETS seasonal type parameter: 0 (none), 1 (additive), 2 (multiplicative) |
| time_range | Object |
Timing information of all objectives fields to be used to generate the timestamp field in the output datasets containing the following keys:
|
| trend | Integer | ETS trend type parameter: 0 (none), 1 (additive), 2 (multiplicative) |
The property ets_models is a dictionary keyed by each field's id in the source. Each field's id has a list of objects with the following properties:
| Property | Type | Description |
|---|---|---|
| aic | Float | The Akaike Information Criterion score. |
| aicc | Float | The Small-sample corrected AIC score. |
| alpha | Float | Level smoothing coefficient. |
| beta | Float | Trend smoothing coefficient. Only included for ets models where trend is not none. |
| bic | Float | The Bayesian Information Criterion score. |
| final_state | Object |
The final fitted state for the ETS model with the following entries:
|
| gamma | Float | Seasonality smoothing coefficient. Only included for ets models where seasonality is not none. |
| initial_state | Object |
The initial fitted state for the ETS model with the following entries:
|
| name | String | An abbreviated name which uniquely specifies the ETS model type, using the classification system from Hyndman. It is a comma-separated triplet where the first value is the error and may take on the values {"A","M"} ("additive" or "multiplicative"), the second value is the trend and may be one of {"N","A","Ad", "M", "Md"} ("none", "additive", "multiplicative", with and without damping), and the final value is seasonality, one of {"N","A","M"}. For example, the value "M,Ad,N" specifies the ETS model type with multiplicative error, additive damped trend, and no seasonality. |
| period | Integer | Seasonal period length. |
| phi | Float | Damped trend coefficient. Only included for ets models where damped_trend is true. |
| r_squared | Float | Also called the coefficient of determination. A measure of how much better the model is than predicting the mean of the test set. See |
| sigma | Float | Standard deviation of the fitted model error. |
The property forecast is a dictionary keyed by each field's id in the source. Each field's id has a list of objects with the following properties:
In addition to the ETS models, BigML also provides simple forecast models for each field, to be used as references for the performance of the ETS models. Due to their trivial nature, these are always computed regardless of what ETS parameters are selected in the input. Currently, we offer three simple model types: naive, mean, and drift.
Naive: this model always forecasts the last value of the observed time series. For seasonal models, it repeats the last m values of the training series, where "m" is the given period length for the field. The parameters for this field are as follows:
- name: "naive"
- value: The last value of the objective field or the m continuously cycled values of the training series in the case of seasonal models. Array of floats.
Mean: this model always forecasts the mean of the objective field. For seasonal models, it is similar to the naive model since the model cycles the same sequence of values for forecasts, but instead of using the last set of m values, BigML computes the mean sequence of the naive values. The parameters for this field are as follows:
- name: "mean"
- value: the mean of the obejctive field or an m-length array representing the mean seasonal period across the training series for seasonal models. Array of floats.
Drift: Draws a straight line between the first and last values of the training series. Forecasts are performed by extending that line. The parameters for this field are as follows:
- name: "drift"
- value: The final value of the training series, from which to extend the drift line when performing forecasts. Float.
- slope: The slope of the drift line. Float.
Time Series Status
Creating a time series is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The time series goes through a number of states until its fully completed. Through the status field in the time series you can determine when time series has been fully processed and ready to be used to create forecasts. These are the properties that a time series's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the timeseries creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the timeseries. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the timeseries. |
Once a time series has been successfully created, it will look like:
{
"category": 0,
"clones": 0,
"code": 200,
"columns": 5,
"configuration": null,
"configuration_status": false,
"created": "2017-06-27T21:23:48.717000",
"credits": 0,
"dataset": "dataset/5948cc214e172744d700000f",
"dataset_field_types": {
"categorical": 0,
"datetime": 0,
"effective_fields": 47,
"items": 0,
"numeric": 5,
"preferred": 6,
"text": 1,
"total": 6
},
"dataset_status": true,
"dataset_type": 0,
"description": "",
"fields_meta": {
"count": 5,
"limit": 1000,
"offset": 0,
"query_total": 5,
"total": 5
},
"locale": "en-us",
"max_columns": 6,
"max_rows": 2519,
"name": "product sales",
"name_options": "use all numeric objectives, period: 1",
"number_of_evaluations": 0,
"number_of_forecasts": 0,
"number_of_public_forecasts": 0,
"objective_field": "000001",
"objective_field_name": "close",
"objective_field_type": "numeric",
"objective_fields": [
"000001",
"000002",
"000003",
"000004",
"000005"
],
"price": 0,
"private": true,
"project": null,
"range": [
1,
2519
],
"resource": "timeseries/55efc3564e17270d5b611004",
"rows": 2519,
"shared": false,
"short_url": "",
"size": 173517,
"source": "source/5948cc164e172744d700000c",
"source_status": true,
"status": {
"code": 5,
"elapsed": 35665,
"message": "The time series has been created",
"progress": 1
},
"subscription": true,
"tags": [],
"time_series": {
"all_numeric_objectives": true,
"datasets": {
"000001": "dataset/5952cd0c4e172751b100000c",
"000002": "dataset/5952cd0c4e172751b1000006",
"000003": "dataset/5952cd0c4e172751b1000009",
"000004": "dataset/5952cd0c4e172751b1000000",
"000005": "dataset/5952cd0c4e172751b1000003"
},
"fields": {
"000001": {
"column_number": 1,
"datatype": "double",
"name": "close",
"optype": "numeric",
"order": 0,
"preferred": true,
"summary": {
"bins": [
[
4.30638,
8
],
[
4.9707,
46
],
[
5.7588,
74
],
[
6.95625,
71
],
[
8.24905,
31
],
[
8.9962,
142
],
[
9.79558,
157
],
[
10.80239,
76
],
[
11.59019,
110
],
[
12.41969,
201
],
[
13.27833,
238
],
[
13.80131,
53
],
[
14.71157,
65
],
[
16.12907,
79
],
[
17.1726,
8
],
[
18.11183,
24
],
[
19.06156,
18
],
[
19.77059,
61
],
[
20.44251,
48
],
[
21.29147,
279
],
[
22.32972,
168
],
[
23.20998,
101
],
[
24.11741,
153
],
[
25.34358,
49
],
[
26.29189,
27
],
[
26.99875,
8
],
[
28.2042,
75
],
[
29.10743,
37
],
[
30.29892,
37
],
[
31.322,
25
],
[
32.15135,
37
],
[
33.72385,
13
]
],
"exact_histogram": {
"populations": [
37,
75,
57,
38,
89,
196,
94,
127,
264,
219,
63,
47,
47,
17,
19,
57,
133,
227,
173,
138,
102,
43,
29,
23,
68,
36,
26,
40,
22,
10,
3
],
"start": 4,
"width": 1
},
"kurtosis": -0.93472,
"maximum": 34.28,
"mean": 16.94683,
"median": 15.0047,
"minimum": 4.1659,
"missing_count": 0,
"population": 2519,
"skewness": 0.261,
"standard_deviation": 7.09487,
"sum": 42689.0737,
"sum_squares": 850193.8112,
"variance": 50.33725
}
},
"000002": {
...
},
"000003": {
...
},
"000004": {
...
},
"000005": {
...
}
},
"period": 1,
"ets_models": {
"000001": [
{
"aic": 14182.46174,
"aicc": 14182.47128,
"alpha": 0.96337,
"beta": 0,
"bic": 14199.95659,
"final_state": {
"b": 0,
"l": 12.9813,
"s": [
0
]
},
"gamma": 0,
"initial_state": {
"b": 0,
"l": 34.25241,
"s": []
},
"name": "A,N,N",
"period": 1,
"phi": 1,
"r_squared": 0.99781,
"sigma": 0.33224
},
{
"aic": 14188.36673,
"aicc": 14188.40017,
"alpha": 0.9999,
"beta": 0.0001,
"bic": 14223.35643,
"final_state": {
"b": -0.00003,
"l": 12.9805,
"s": [
0
]
},
"gamma": 0,
"initial_state": {
"b": -0.07686,
"l": 34.32388,
"s": []
},
"name": "A,Ad,N",
"period": 1,
"phi": 0.97903,
"r_squared": 0.99781,
"sigma": 0.33223
},
{
"aic": 14186.28647,
"aicc": 14186.31034,
"alpha": 0.96263,
"beta": 0.00025,
"bic": 14215.44456,
"final_state": {
"b": -0.00342,
"l": 12.98118,
"s": [
0
]
},
"gamma": 0,
"initial_state": {
"b": -0.00037,
"l": 34.26542,
"s": []
},
"name": "A,A,N",
"period": 1,
"phi": 1,
"r_squared": 0.99781,
"sigma": 0.33223
},
{
"aic": 14215.34381,
"aicc": 14215.35336,
"alpha": 0.91925,
"beta": 0,
"bic": 14232.83866,
"final_state": {
"b": 0,
"l": 12.98188,
"s": [
0
]
},
"gamma": 0,
"initial_state": {
"b": 0,
"l": 34.20342,
"s": []
},
"name": "M,N,N",
"period": 1,
"phi": 1,
"r_squared": 0.99999,
"sigma": 0.02177
},
{
"aic": 14220.82046,
"aicc": 14220.8539,
"alpha": 0.91943,
"beta": 0.00217,
"bic": 14255.81016,
"final_state": {
"b": -0.00072,
"l": 12.98182,
"s": [
0
]
},
"gamma": 0,
"initial_state": {
"b": -0.03424,
"l": 34.31773,
"s": []
},
"name": "M,Ad,N",
"period": 1,
"phi": 0.97813,
"r_squared": 0.99999,
"sigma": 0.02177
},
{
"aic": 14221.09125,
"aicc": 14221.11513,
"alpha": 0.91856,
"beta": 0.0001,
"bic": 14250.24934,
"final_state": {
"b": -0.00185,
"l": 12.98172,
"s": [
0
]
},
"gamma": 0,
"initial_state": {
"b": 0,
"l": 34.1753,
"s": []
},
"name": "M,A,N",
"period": 1,
"phi": 1,
"r_squared": 0.99999,
"sigma": 0.02178
},
{
"aic": 14219.11963,
"aicc": 14219.15307,
"alpha": 0.91374,
"beta": 0.00405,
"bic": 14254.10933,
"final_state": {
"b": 0.99992,
"l": 12.98182,
"s": [
0
]
},
"gamma": 0,
"initial_state": {
"b": 0.99549,
"l": 34.56209,
"s": []
},
"name": "M,Md,N",
"period": 1,
"phi": 0.9795,
"r_squared": 0.99999,
"sigma": 0.02177
},
{
"aic": 14219.59974,
"aicc": 14219.62361,
"alpha": 0.91761,
"beta": 0.00013,
"bic": 14248.75782,
"final_state": {
"b": 0.99982,
"l": 12.98169,
"s": [
0
]
},
"gamma": 0,
"initial_state": {
"b": 0.9998,
"l": 34.33755,
"s": []
},
"name": "M,M,N",
"period": 1,
"phi": 1,
"r_squared": 0.99999,
"sigma": 0.02178
},
{
"name": "naive",
"value": [
12.9805
]
},
{
"name": "mean",
"value": [
16.94683
]
},
{
"name": "drift",
"slope": -0.00846,
"value": 12.9805
}
],
"000002": [
...
],
"000003": [
...
],
"000004": [
...
],
"000005": [
...
]
},
"time_range": {
"end": 2518,
"interval": 1,
"start": 0
}
},
"type": 0,
"updated": "2017-06-27T21:24:28.165000",
"white_box": false
}
< Example time series JSON response
PMML
The default time series output format is JSON. However, the pmml parameter allows to include a PMML version of the time series. The time series will include a XML document that fullfils PMML v4.1. For example:
curl "https://au.bigml.io/timeseries/55efc3564e17270d5b611004?$BIGML_AUTH;pmml=yes"
PMML Example
curl "https://au.bigml.io/timeseries/55efc3564e17270d5b611004?$BIGML_AUTH;pmml=only"
PMML Example
Filtering and Paginating Fields from a Time Series
A time series might be composed of hundreds or even thousands of fields. Thus when retrieving a timeseries, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Time Series
To update a time series, you need to PUT an object containing the fields that you want to update to the time series' base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated time series.
For example, to update a time series with a new name you can use curl like this:
curl "https://au.bigml.io/timeseries/55efc3564e17270d5b611004?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating a time series's name
If you want to update with a new label and description for a specific field you can use curl like this:
curl "https://au.bigml.io/timeseries/55efc3564e17270d5b611004?$BIGML_AUTH" \
-X PUT \
-d '{"fields": {"000000": {"label": "a longer name", "description": "an even longer description"}}}' \
-H 'content-type: application/json'
$ Updating 'd field, label, and description
Deleting a Time Series
To delete a time series, you need to issue a HTTP DELETE request to the timeseries/id to be deleted.
Using curl you can do something like this to delete a time series:
curl -X DELETE "https://au.bigml.io/timeseries/55efc3564e17270d5b611004?$BIGML_AUTH"
$ Deleting a time series from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a time series, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a time series a second time, or a time series that does not exist, you will receive a "404 not found" response.
However, if you try to delete a time series that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Time Series
To list all the time series, you can use the timeseries base URL. By default, only the 20 most recent time series will be returned. You can see below how to change this number using the limit parameter.
You can get your list of time series directly in your browser using your own username and API key with the following links.
https://au.bigml.io/timeseries?$BIGML_AUTH
> Listing time series from a browser
Fusions
Last Updated: Tuesday, 2019-01-29 16:28
Fusions are a special type of composite for which all submodels satisfy the following constraints: they're all either classifications or regressions over the same kind of data or compatible fields, with the same objective field. Given those properties, a fusion can be considered a supervised model, and therefore one can predict with fusions and evaluate them. Ensembles can be viewed as a kind of fusion subject to the additional constraints that all its submodels are tree models that, moreover, have been built from the same base input data, but sampled in particular ways.
The following supervised model types can be a submodel of a fusion: deepnet, ensemble, fusion, model, and logistic regression.
BigML.io allows you to create, retrieve, update, delete your fusion. You can also list all of your fusions.
Jump to:
- Fusion Base URL
- Creating a Fusion
- Fusion Arguments
- Retrieving a Fusion
- Fusion Properties
- Filtering and Paginating Fields from a Fusion
- Filtering and Paginating Models from a Fusion
- Updating a Fusion
- Deleting a Fusion
- Listing Fusions
- Model Weights
Fusion Base URL
You can use the following base URL to create, retrieve, update, and delete fusions. https://au.bigml.io/fusion
Fusion base URL
All requests to manage your fusions must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Fusion
To create a new fusion, you need to POST to the fusion base URL an object containing at least a list of model ids that you want to use to create the fusion. The content-type must always be "application/json".
POST /fusion?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating fusion definition
curl "https://au.bigml.io/fusion?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"models": [
"model/5ada0d741f386f2459000019",
"logisticregression/5ada0d781f386f245b000028",
"deepnet/5af32fa04e1727b61f000003"
]}'
> Creating a fusion
BigML.io will return the newly created fusion if the request succeeded.
{
"category": 0,
"code": 201,
"configuration": null,
"configuration_status": false,
"created": "2018-05-09T20:11:05.821427",
"credits_per_prediction": 0,
"description": "",
"fusion": {},
"importance": {},
"model_count": {
"ensemble": 1,
"logisticregression": 1,
"model": 1,
"total": 3
},
"models": [
"ensemble/5af272eb4e1727d378000050",
"model/5af272fe4e1727d3780000d6",
"logisticregression/5af272ff4e1727d3780000d9"
],
"name": "iris",
"name_options": "3 total models (ensemble: 1, logisticregression: 1, model: 1)",
"number_of_batchpredictions": 0,
"number_of_evaluations": 0,
"number_of_predictions": 0,
"number_of_public_predictions": 0,
"objective_field": null,
"objective_field_details": null,
"objective_field_name": null,
"objective_field_type": null,
"private": true,
"project": null,
"resource":"fusion/59af8107b8aa0965d5b61138",
"shared": false,
"status": {
"code": 1,
"message": "The fusion creation request has been queued and will be processed soon"
},
"subscription": false,
"tags": [],
"type": 0,
"updated": "2018-05-09T20:11:05.821635"
}
< Example fusion JSON response
Fusion Arguments
In addition to the models, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is 0 |
The category that best describes the fusion. See the category codes for the complete list of categories.
Example: 1 |
|
description
optional |
String |
A description of the fusion up to 8192 characters long.
Example: "This is a description of my new fusion" |
|
fields_maps
optional |
Object |
A dictionary keyed by a submodel id and object values. Each entry maps fields in the submodel to fields in the fusion
Example:
|
| models | Array |
A list with fusion submodel resource/ids or or list of maps using the key id for each submodel resource/id, and any other key/values for additional meta-information on the model. Available submodel types are deepnet, ensemble, fusion, model, and logistic regression. The maximum number of submodels is 1000.
Example: or
|
|
name
optional |
String, default is fusion's name |
The name you want to give to the new fusion.
Example: "my new fusion" |
|
project
optional |
String |
The project/id you want the fusion to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your fusion.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
If you do not specify a name, BigML.io will assign to the new fusion one.
Retrieving a Fusion
Each fusion has a unique identifier in the form "fusion/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the fusion.
To retrieve a fusion with curl:
curl "https://au.bigml.io/fusion/59af8107b8aa0965d5b61138?$BIGML_AUTH"
$ Retrieving a fusion from the command line
Fusion Properties
Once a fusion has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
| any_model_missing_numerics | Boolean | Whether any model in the fusion is not a logistic regression or any logistic regression has missing_numerics=true. |
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the fusion and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the fusion creation has been completed without errors. |
|
composites
filterable, sortable |
Array of Strings | The list of composite ids that reference this model. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the fusion was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits_per_prediction
filterable, sortable |
Float | This is the number of credits that other users will consume to make a prediction with your fusion if you made it public. |
|
description
updatable |
String | A text describing the fusion. It can contain restricted markdown to decorate the text. |
| fields_maps | Array | A dictionary keyed by a submodel id and object values. Each entry maps fields in the submodel to fields in the fusion |
| fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
|
fusion
filterable, sortable |
Object | Fusion object. For more information, see the Fusion below. |
|
fusions
filterable, sortable |
Array of Strings | The list of fusion ids that reference this model. |
| importance | Array |
Average field importances over all submodels.
Example:
|
| locale | String | The fusion's locale. |
|
model_count
filterable, sortable |
Object |
A dictionary that informs about the number of submodels of each type in the fusion.
Example:
|
| models | Array |
A list of all submodels ids regardless of how models are filtered and paged.
Example:
|
| models_meta | Object | A dictionary with meta information about the models filtered.It specifies the total number of models, the current offset, and limit. |
|
name
filterable, sortable, updatable |
String | The name of the fusion as your provided. |
|
number_of_batchpredictions
filterable, sortable |
Integer | The current number of batch predictions that use this fusion. |
|
number_of_evaluations
filterable, sortable |
Integer | The current number of evaluations that use this fusion. |
|
number_of_predictions
filterable, sortable |
Integer | The current number of predictions that use this fusion. |
|
number_of_public_predictions
filterable, sortable |
Integer | The current number of public predictions that use this fusion. |
| objective_field | String | The id of the field that the fusion predicts. |
| objective_field_details | Object | The details of the objective fields. See the Objective Field Details. |
|
objective_field_name
filterable, sortable |
String | The name of the objective field in the fusion. |
|
objective_field_type
filterable, sortable |
String | The type of the objective field in the fusion. |
| objective_fields | Array | Specifies the list of ids of the field that the fusion predicts. Even if this is an array BigML.io only accepts one objective field in the current version. |
|
private
filterable, sortable |
Boolean | Whether the fusion is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| resource | String | The fusion/id. |
|
shared
filterable |
Boolean | Whether the fusion is shared using a private link or not. |
|
shared_clonable
filterable |
Boolean | Whether the shared fusion can be cloned or not. |
|
shared_hash
filterable |
String | The hash that gives access to this fusion if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this fusion. |
| status | Object | A description of the status of the fusion. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the fusion was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
type
filterable, sortable |
Integer | Reserved for future use. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the fusion was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
The Fusion object has the following properties.
| Property | Type | Description |
|---|---|---|
| fields | Object | All fields used by models without attempting to merge in any way fields from different models. It is the last found summary for every given field. See this Section for more details. |
| models | Array |
A filtered and paged list of the fusion submodels with id, kind, name, name_options as well as other metadata the user provided while creating the fusion.
Example:
|
The Objective Field Details object has the following properties.
Fusion Status
Creating a fusion is a process that can take just a few seconds or a few hours depending on the size of the models used as input and on the workload of BigML's systems. The fusion goes through a number of states until its fully completed. Through the status field in the fusion you can determine when the fusion has been fully processed and ready to be used to create predictions. These are the properties that a fusion's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the fusion creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the fusion. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the fusion. |
Once a fusion has been successfully created, it will look like:
{
"category": 0,
"code": 200,
"configuration": null,
"configuration_status": false,
"created": "2018-05-09T20:11:05.821000",
"credits_per_prediction": 0,
"description": "",
"fields_meta": {
"count": 5,
"limit": 1000,
"offset": 0,
"query_total": 5,
"total": 5
},
"fusion": {
"models": [
{
"id": "ensemble/5af272eb4e1727d378000050",
"kind": "ensemble",
"name": "Iris ensemble",
"name_options": "boosted trees, 1999-node, 16-iteration, deterministic order, balanced"
},
{
"id": "model/5af272fe4e1727d3780000d6",
"kind": "model",
"name": "Iris model",
"name_options": "1999-node, pruned, deterministic order, balanced"
},
{
"id": "logisticregression/5af272ff4e1727d3780000d9",
"kind": "logisticregression",
"name": "Iris LR",
"name_options": "L2 regularized (c=1), bias, auto-scaled, missing values, eps=0.001"
}
]
},
"importance": {
"000000": 0.05847,
"000001": 0.03028,
"000002": 0.13582,
"000003": 0.4421
},
"model_count": {
"ensemble": 1,
"logisticregression": 1,
"model": 1,
"total": 3
},
"models": [
"ensemble/5af272eb4e1727d378000050",
"model/5af272fe4e1727d3780000d6",
"logisticregression/5af272ff4e1727d3780000d9"
],
"models_meta": {
"count": 3,
"limit": 1000,
"offset": 0,
"total": 3
},
"name": "iris",
"name_options": "3 total models (ensemble: 1, logisticregression: 1, model: 1)",
"number_of_batchpredictions": 0,
"number_of_evaluations": 0,
"number_of_predictions": 0,
"number_of_public_predictions": 0,
"objective_field": "000004",
"objective_field_details": {
"column_number": 4,
"datatype": "string",
"name": "species",
"optype": "categorical",
"order": 4
},
"objective_field_name": "species",
"objective_field_type": "categorical",
"objective_fields": [
"000004"
],
"private": true,
"project": null,
"resource":"fusion/59af8107b8aa0965d5b61138",
"shared": false,
"status": {
"code": 5,
"elapsed": 8420,
"message": "The fusion has been created",
"progress": 1
},
"subscription": false,
"tags": [],
"type": 0,
"updated": "2018-05-09T20:11:14.258000"
}
< Example fusion JSON response
Filtering and Paginating Fields from a Fusion
A fusion might be composed of hundreds or even thousands of fields. Thus when retrieving a fusion, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Filtering and Paginating Models from a Fusion
Since model lists can grow large, we offer paginations of the models list in the response when GETting it via HTTP. Pagination is specified using the following query string parameters:
- models_limit: A non-negative integer indicating how many elements in models to return. If not provided, we return at most 1000. If passed a negative value (say, -1), we return all of them.
- models_offset: The offset in the list of models (i.e., how many models are discarded before we take limit of them).
- models_sort_by: Sorting criteria, specified by any one of the keys the user provided during the creation in the models maps. Sorting is ascending, unless you prefix the key name with a minus sign. For instance, let's say your models have a property, rank. You can use a query string of the form models_sort_by=rank to sort them by rank in ascending order, and one of the form strong>models_sort_by=-rank" to sort them in descending order. It is possible to provide more than one ordering criterion, separating them by commas, in which case the second and subsequent ones are used to break ties in the ordering generated by the previous ones.
Sorting happens before limit and offset are applied. When pagination is active, the models_meta property at the top level in the returned. This property will contain offset, limit, count, and total.
Updating a Fusion
To update a fusion, you need to PUT an object containing the fields that you want to update to the fusion' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated fusion.
For example, to update a fusion with a new name you can use curl like this:
curl "https://au.bigml.io/fusion/59af8107b8aa0965d5b61138?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating a fusion's name
Deleting a Fusion
To delete a fusion, you need to issue a HTTP DELETE request to the fusion/id to be deleted.
Using curl you can do something like this to delete a fusion:
curl -X DELETE "https://au.bigml.io/fusion/59af8107b8aa0965d5b61138?$BIGML_AUTH"
$ Deleting a fusion from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a fusion, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a fusion a second time, or a fusion that does not exist, you will receive a "404 not found" response.
However, if you try to delete a fusion that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Fusions
To list all the fusions, you can use the fusion base URL. By default, only the 20 most recent fusions will be returned. You can see below how to change this number using the limit parameter.
You can get your list of fusions directly in your browser using your own username and API key with the following links.
https://au.bigml.io/fusion?$BIGML_AUTH
> Listing fusions from a browser
Model Weights
It is possible to assign a weight to any of the models used to created a fusion. To that end, just use a map in the corresponding entry of models in your request with keys id and weight. Weights can be any non-negative number, and will be used (normalized) to weight the probabilities reported by each model in a classification when making predictions (thus affecting the overall fusion prediction), or simply to weight the value predicted by each component when the problem is a regression. The weights are used every time the fusion has to make a prediction, which includes batch predictions and evaluations besides individual predictions. When not specified (as for the second model below), the weight takes the default value 1. At least one of the model weights must be positive.
curl "https://au.bigml.io/fusion?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"models": [
{"id": "model/5b3dcdd592fb560bf00001a6", "weight": 0.33},
{"id": "model/5b3c92e79252732186002bbe"},
{"id": "model/5b34a09d28e1f46e2a000920", "weight": 21.3}]}}'
> Using weight in model to create a fusion
Evaluations
Last Updated: Tuesday, 2019-01-29 16:28
An evaluation provides an easy way to measure the performance of a predictive model. To create a new evaluation, you need a supervised model (model/id, ensemble/id, logisticregression/id, deepnet/id, fusion/id, or timeseries/id), and a dataset/id.
The type of an evaluation can vary. It can be timeseries type if it is created using a time series. Otherwise, it can be either classification or regression depending on whether the objective field of the model is categorical or numeric, respectively. The performance measures provided by BigML.io will vary depending on the type of the evaluation.
BigML.io allows you to create, retrieve, update, delete your evaluation. You can also list all of your evaluations.
Jump to:
- Evaluation Base URL
- Creating an Evaluation
- Evaluation Arguments
- Retrieving an Evaluation
- Evaluation Properties
- Updating an Evaluation
- Deleting an Evaluation
- Listing Evaluations
Evaluation Base URL
You can use the following base URL to create, retrieve, update, and delete evaluations. https://au.bigml.io/evaluation
Evaluation base URL
All requests to manage your evaluations must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating an Evaluation
To create a new evaluation, you need to POST to the evaluation base URL an object containing a supervised model id that you want to evaluate and dataset/id of the dataset that contains the data that will be used to compute the performance of the model. The content-type must always be "application/json".
POST /evaluation?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating evaluation definition
curl "https://au.bigml.io/evaluation?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"model": "model/50650bea3c19201b64000024", "dataset": "dataset/50650bdf3c19201b64000020"}'
> Creating an evaluation
BigML.io will return the newly created evaluation if the request succeeded.
{
"category": 0,
"code": 201,
"created": "2012-09-28T02:37:10.074357",
"credits": 7.68,
"dataset": "dataset/50650bdf3c19201b64000020",
"dataset_status": true,
"description": "",
"fields_map": {},
"locale": "en-US",
"max_rows": 768,
"model": "model/50650bea3c19201b64000024",
"model_status": true,
"name": "Evaluation of diabetes' dataset model with diabetes' dataset",
"ordering": 0,
"out_of_bag": false,
"private": true,
"project": null,
"range": [
1,
768
],
"replacement": false,
"resource": "evaluation/50650d563c19202679000000",
"result": {},
"rows": 768,
"sample_rate": 1.0,
"size": 26191,
"source_status": true,
"status": {
"code": 1,
"message": "The evaluation is being processed and will be performed soon"
},
"tags": [],
"type": 0,
"updated": "2012-09-28T02:37:10.074381"
}
< Example evaluation JSON response
Evaluation Arguments
In addition to the model and the dataset, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is the category of the model |
The category that best describes the evaluation. See the category codes for the complete list of categories.
Example: 1 |
|
combiner
optional |
Integer |
Specifies the method that should be used to combine predictions when a non-boosted ensemble is used to create the evaluation. Note that if operating_kind or operating_point presents, combiner will be ignored. For classification ensembles, the combination is made by majority vote. The options are:
Example: 1 DEPRECATED |
|
confidence_threshold
optional |
Float |
This parameter has been deprecated because each evaluation of classifiers now contains all the matrices (see per_threshold_matrices) and the performance metrics for each probability threshold of the testing dataset. A number between 0 and 1 that can be used with logistic regressions or classification models,ensembles so that they only return the positive_class when the confidence on the prediction is above the established threshold. When a positive_class is not provided, it will default to the majority class. When the confidence is below the threshold, the prediction returned will be the negative_class. If a negative class is not provided, then the minority class will be returned. When the prediction is overridden, the new confidence returned will be 1 unless specified differently using negative_class_confidence.
Example: 0.7 DEPRECATED |
| dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
|
deepnet
optional |
String |
A valid deepnet/id.
Example: deepnet/55efc3564e1727d635000102 |
|
description
optional |
String |
A description of the evaluation up to 8192 characters long.
Example: "This is a description of my new evaluation" |
|
ensemble
optional |
String |
A valid ensemble/id.
Example: ensemble/517020d53c1920a514000056 |
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset |
Specifies the fields in the dataset to be excluded to create the evaluation.
Example:
|
|
fields_map
optional |
Object |
A dictionary of identifiers of the fields to use from the model under test mapped to their corresponding identifiers in the input dataset.
Example: {"000000":"00000a", "000001":"000002", "000002":"000001", "000003":"000020", "000004":"000004"} |
|
fusion
optional |
String |
A valid fusion/id.
Example: fusion/5948be694e17273079000000 |
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields in the dataset to be considered to create the evaluation.
Example:
|
|
logisticregression
optional |
String |
A valid logisticregression/id.
Example: logisticregression/55efc3564e1727d635000004 |
|
missing_strategy
optional |
Integer, default is 0 |
Specifies the method that should be used for the model, ensemble or fusion when a missing split is found. That is, when a missing value is found in the input data for a decision node. The options are:
Example: 1 |
|
model
optional |
String |
A valid model id of the supported supervised model. Alternatively, you can use ensemble, logisticregression, deepnet, fusion, or timeseries arguments.
Example: model/4f67c0ee03ce89c74a000006 |
|
name
optional |
String, default is dataset's name |
The name you want to give to the new evaluation.
Example: "my new evaluation" |
|
negative_class
optional |
String |
The class that will be returned when a confidence threshold is used and the threshold is not reached for the model or ensemble.
Example: false DEPRECATED |
|
operating_kind
optional |
String, default is "probability" |
The operating threshold kind corresponding to the confusion matrix to perform the evaluation. It also replaces combiner and its value can be confidence, probability, and for non-boosted ensembles, votes.
Example: "confidence" |
|
operating_point
optional |
Object |
The specification of an operating point for classification problems to perform the evaluation which consists of a positive_class (one of the categories of the model's objective field), and a threshold (a number between 0 and 1), and an optional field kind (confidence, probability, and for non-boosted ensembles, votes). When it presents, BigML will predict the positive_class if its probability, confidence or votes (depending on the kind) is greater than the threshold set. Otherwise, BigML will predict the class with the higher probability, confidence or votes.confidence and probability will yield the same results for boosted ensembles, deepnets and logistic regressions. For the votes kind, the threshold specifies the ratio of models in the ensembles predicting the positive_class. Note that operating_point takes precedence over combiner and threshold, thus they will be ignored if provided. However, unlike, predictions and batch predictions it does not takes precedence over operating_kind, which is used as a threshold kind for ROC curves. Example:
|
|
ordering
optional |
Integer, default is 0 (deterministic). |
Specifies the type of ordering followed to pick the instances of the dataset to evaluate the model or ensemble. There are three different types that you can specify:
Example: 1 DEPRECATED |
|
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details. Dataset sampling doesn't apply to evaluations for time series.
Example: true |
|
positive_class
optional |
String |
The class that will be considered when a confidence threshold is used for the model or ensemble.
Example: false DEPRECATED |
|
private
optional |
Boolean, default is true |
Whether you want your evaluation to be private or not.
Example: false |
|
probability_threshold
optional |
Float |
A number between 0 and 1 that can be used for any classification model, ensemble and logistic regression. The positive_class will only be predicted when the probability of the prediction for that class is above the established threshold. If a positive_class is not provided, it will default to the majority class. When the probability is below the threshold for the positive_class, the following class with higher probability will be predicted instead.
Example: 0.7 DEPRECATED |
|
project
optional |
String |
The project/id you want the evaluation to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
range
optional |
Array, default is [1, max rows in the dataset]. |
The range of successive instances to evaluate the model.
Example: [1, 150] |
|
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details. Dataset sampling doesn't apply to evaluations for time series.
Example: true |
|
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details. Dataset sampling doesn't apply to evaluations for time series.
Example: 0.5 |
|
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details. Dataset sampling doesn't apply to evaluations for time series.
Example: "MySample" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your evaluation.
Example: ["best customers", "2018"] |
|
threshold
optional |
Object |
A dictionary with two optional keys for a non-boosted ensemble:
Note that their use is deprecated, and maintained only for backwards compatibility. Instead use an operating_point of kind votes. Example: {"k": 2, "class": "attack"} DEPRECATED |
|
timeseries
optional |
String |
A valid timeseries/id.
Example: timeseries/5948db0a4e17276eab000009 |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new evaluation. For example, to create a new evaluation named "my evaluation" using the first 50 instances in the dataset.
curl "https://au.bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/50614ed53c192043ea00000c", "model": "model/50614edb3c192043ea000010", "range": [1, 50], "name": "my evaluation"}'
> Creating a customized evaluation
If you do not specify a name, BigML.io will assign to the new evaluation a combination of the dataset's name and the model's name. If you do not specify a range of instances, BigML.io will use all the instances in the dataset. If you do not specify any fields_map, BigML.io will include all the input fields in the dataset.
Retrieving an Evaluation
Each evaluation has a unique identifier in the form "evaluation/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the evaluation.
To retrieve an evaluation with curl:
curl "https://au.bigml.io/evaluation/50650d563c19202679000000?$BIGML_AUTH"
$ Retrieving a evaluation from the command line
You can also use your browser to visualize the evaluation using the full BigML.io URL or pasting the evaluation/id into the BigML.com.au dashboard.
Evaluation Properties
Once an evaluation has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
boosted_ensemble
filterable, sortable |
Boolean | Whether the prediction was built with an ensemble with boosted trees. |
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the evaluation and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the evaluation creation has been completed without errors. |
| combiner | Integer | The method used to combine predictions from the non-boosted ensemble. See the available combiners above. DEPRECATED |
|
confidence_threshold
filterable, sortable |
Float | The minimum level of confidence on the positive class that a classification model needs to reach to return the positive_class. Otherwise, it will return the negative class. DEPRECATED |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the evaluation was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this evaluation. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to evaluate the model. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
deepnet
filterable, sortable |
String | The deepnet/id that was used to create the evaluation. |
|
description
updatable |
String | A text describing the evaluation. It can contain restricted markdown to decorate the text. |
|
ensemble
filterable, sortable |
String | The ensemble/id of the ensemble under evaluation. |
| excluded_fields | Array | The list of fields's ids that were excluded to build the evaluation. |
| fields_map | Array | The map of dataset fields to model fields used. |
|
fusion
filterable, sortable |
String | The fusion/id that was used to create the evaluation. |
| input_fields | Array | The list of input fields' ids used to create the evaluation. |
| locale | String | The evaluation's locale. |
|
logisticregression
filterable, sortable |
String | The logisticregression/id that was used to create the evaluation. |
|
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to evaluate the model. |
| missing_strategy |
Integer, default is 0 |
Specifies the type of strategy that a model will follow when a missing value needed to continue with inference in the model is found. The possible values are:
|
|
model
filterable, sortable |
String | The model/id of the model under evaluation. |
|
model_status
filterable, sortable |
Boolean | Whether the model is still available or has been deleted. |
| model_type | Integer |
|
|
name
filterable, sortable, updatable |
String | The name of the evaluation. By default it is based on the name of the model and the dataset used. |
|
negative_class
filterable, sortable |
String | The negative class that will be returned when the model does not reach the confidence_threshold or probability_threshold level on the positive_class. DEPRECATED |
|
number_of_models
filterable, sortable |
Integer | The number of models being evaluated. |
| operating_kind | String | The operating threshold kind corresponding to the confusion matrix to perform the evaluation. See operating_kind above for more information. |
| operating_point | Object | The specification of an operating point for classification problems to perform the evaluation. See operating_point above for more information. |
|
optiml
filterable, sortable |
String | The optiml/id that created this evaluation. |
|
optiml_status
filterable, sortable |
Boolean | Whether the OptiML is still available or has been deleted. |
|
ordering
filterable, sortable |
Integer |
The order used to chose instances from the dataset to evaluate the model. There are three different types:
|
|
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to evaluate the model instead of the sampled instances. |
|
positive_class
filterable, sortable |
String | The positive class that will be considered when a confidence_threshold or probability_threshold is used. DEPRECATED |
|
private
filterable, sortable, updatable |
Boolean | Whether the evaluation is public or not. |
|
probability_threshold
filterable, sortable |
Float | The minimum level of probability on the positive class that boosted treeds need to reach to return the positive_class. Otherwise, it will return the negative class. DEPRECATED |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| range | Array | The range of instances used to evaluate the model. |
|
replacement
filterable, sortable |
Boolean | Whether the instances used to evaluate the model were sampled using replacement or not. |
| resource | String | The evaluation/id. |
| result | Object |
The result of the evaluation. Depending on the type of task performed by the model (i.e., classification, regression, or time series) the performance measures returned change.
|
|
rows
filterable, sortable |
Integer | The total number of instances used to evaluate the model. |
|
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to evaluate the model. |
|
shared
filterable, sortable |
Boolean | Whether the evaluation is shared using a private link or not. |
| shared_hash | String | The hash that gives access to this evaluation if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this evaluation. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that was used to evaluate the model. |
| status | Object | A description of the status of the evaluation. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the evaluation was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
| threshold | Object | The parameters (k and class) given when a threshold-based combiner is used. DEPRECATED |
|
timeseries
filterable, sortable |
String | The timeseries/id that was used to create the evaluation. |
|
type
filterable, sortable |
Integer |
The type of task that the model performs:
|
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the evaluation was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
A result object for classification models has the following properties:
A detailed result object for classification models has the following properties:
| Property | Type | Description |
|---|---|---|
| accuracy | Float | The number of correctly classified instances in the dataset over the total of instances evaluated. |
| average_f_measure | Float | The average of the f_measure of all classes. |
| average_phi | Float | The average of the phi_coefficientof all classes. |
| average_precision | Float | The average of the precision of all classes. |
| average_recall | Float | The average of the recall of all classes. |
| confusion_matrix | An Array of Arrays | A list of rows of the confusion matrix (actual class x predicted class) that represents the performance of the model. See http://en.wikipedia.org/wiki/Confusion_matrix |
| per_class_statistics | Array | A list ordered by actual class with the performance measures described below. |
In a classification model, we compute the following measures for each class (in "per_class_statistics"), based on the true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) for each class:
-
Accuracy: The number of correctly classified
instances in the dataset over the total of instances evaluated.
accuracy = (TP+ TN) / (TP + TN + FP + FN) -
Precision: The number of true positives over the total number of positive predictions
precision = TP / (TP + FP) -
Recall: The number of true positives over the
number of positive instances.
recall = TP / (TP + FN) -
F_measure: The balanced harmonic mean of
precision and recall. See http://en.wikipedia.org/wiki/F1_score.
f_measure = 2 * (precision * recall) / (precision + recall) -
Phi Coefficient: Also called the Matthews Correlation Coefficient; Computed according to the formula at:
http://en.wikipedia.org/wiki/Matthews_correlation_coefficient
phi_coefficient = (TP * TN - FP * FN) / √((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))
The "average" version of these measures at the top level of the evaluation are the "macro-averages" of the measures. That is, each measure is computed with respect to each class, then the computed values are averaged to get the average measure. You can read more on macro vs. micro averaging at http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-text-classification-1.html
The rest of the measures are derived from a list of confusion matrices at the various operating thresholds in the test data, as determined by the probabilities returned for each class per training example. An approximation to this list of confusion matrices (at a sampling of the available thresholds) is stored under the key per_threshold_confusion_matrices. The full set of matrices is used to construct the rest of the measures. These matrices are pairs of the form [[TP FP TN FN] THRESHOLD], where the former is the confusion matrix and the threshold is the operating threshold corresponding to that matrix. The first threshold is always nil, indicating the case where everything is classified positively.
-
Ranking Measures measure the quality of the ranking
provided by the classifier, as estimated from the performance
at different operating thresholds. Their keys are:
- area_under_roc_curve
- kendalls_tau_b
- spearmans_rho
-
Maximized Measures are the maximum value of a given measure
over all possible thresholds. As such, the key is mapped to a pair [value, threshold],
giving the maximal value of the measure and the operating threshold at which it is maximized.
Their keys are:
- ks_statistic
- max_phi
-
Threshold Curves: In this category, there are curves
where a given measure is computed at every operating threshold and
a curve can be drawn showing two values (one each on the x and y axes) for each threshold.
The canonical curve of this sort is the ROC curve,
which shows the false positive rate and the recall at each threshold.
The given curves are of lists of triples, where each triple is of the form [x y threshold].
Note that the last threshold is nil, indicating the case where every is classified positively
(the curve thresholds are sorted in the opposite order from the list of confusion matrices
to maintain a non-decreasing ordering for the x-axis values of the curves).
Their keys are:
- gain_curve
- lift_curve
- negative_cdf
- pr_curve
- roc_curve
A result object for regression models has the following properties:
A detailed result object for regression models has the following properties:
| Property | Type | Description |
|---|---|---|
| mean_absolute_error | Float | The average of the absolute values of the differences between the target predicted by the model and the true target. |
| mean_squared_error | Float | The average of the squares of the differences between the target predicted by the model and the true target. |
| r_squared | Float | Also called the coefficient of determination. A measure of how much better the model is than predicting the mean of the test set. See Coefficient of Determination. |
Time series evaluations compare time series predictions (forecasts) against a test dataset containing true future time series values. For each field in the test dataset corresponding to the objective fields in the time series model, BigML computes the point predictions using each of the field's ets models (including the trivial ets models), with a forecast horizon equal to the number of rows in the test dataset.
A result object for time series models has the following properties:
A detailed result object for time series models has the following properties:
| Property | Type | Description |
|---|---|---|
| mean_absolute_error | Float | The average of the absolute values of the differences between the target predicted by the model and the true target. |
| mean_absolute_scaled_error | Float | The avearge of the absolute scaled error, a measure of the accuracy of forecasts. When comparing forecasting methods, the method with the lowest value is the preferred method. See Mean Absolute Scaled Error. |
| mean_directional_accuracy | Float | The average of the squares of the differences between the target predicted by the model and the true target. |
| mean_squared_error | Float | The average of the squares of the differences between the target predicted by the model and the true target. |
| model | String | An abbreviated name which uniquely specifies the ETS model type. |
| symmetric_mean_absolute_percentage_error | Float | An accuracy measure based on percentage errors. See Symmetric Mean Absolute Percentage Error. |
Evaluation Status
Creating an evaluation is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The evaluation goes through a number of states until its fully completed. Through the status field in the evaluation you can determine when the evaluation has been fully processed. These are the properties that an evaluation's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the evaluation creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the evaluation. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the evaluation. |
Once an evaluation has been successfully finished, for a regression model, the evaluation will look like:
{
"category": 0,
"code": 200,
"created": "2012-11-10T04:10:05.970000",
"credits": 1.5360000000000003,
"dataset": "dataset/509dd2cd3c1920fbde000014",
"dataset_status": true,
"description": "",
"fields_map": {
"000000": "000000",
"000001": "000001",
"000002": "000002",
"000003": "000003",
"000004": "000004",
"000005": "000005",
"000006": "000006",
"000007": "000007",
"000008": "000008"
},
"locale": "en-US",
"max_rows": 768,
"model": "model/509dd3313c1920fbde000018",
"model_status": true,
"name": "Evaluation of Diabetes Train with Diabetest Test",
"objective_field": {
"column_number": 8,
"datatype": "string",
"name": "diabetes",
"optype": "categorical",
"order": 8,
"preferred": true,
"summary": {
"categories": [
[
"false",
500
],
[
"true",
268
]
],
"missing_count": 0
}
},
"ordering": 0,
"out_of_bag": false,
"private": true,
"project": null,
"range": [
1,
768
],
"replacement": false,
"resource": "evaluation/50650d563c19202679000000",
"result": {
"class_names": [
"false",
"true"
],
"mode": {
"accuracy": 0.67974,
"average_f_measure": 0.40467,
"average_phi": 0,
"average_precision": 0.5,
"average_recall": 0.33987,
"confusion_matrix": [
[
104,
0
],
[
49,
0
]
],
"per_class_statistics": [
{
"accuracy": 0.6797385620915033,
"class_name": "true",
"f_measure": 0,
"phi_coefficient": 0,
"precision": 0.0,
"recall": 0
},
{
"accuracy": 0.6797385620915033,
"class_name": "false",
"f_measure": 0.8093385214007782,
"phi_coefficient": 0,
"precision": 1.0,
"recall": 0.6797385620915033
}
]
},
"model": {
"accuracy": 0.9281,
"average_f_measure": 0.91698,
"average_phi": 0.83407,
"average_precision": 0.91474,
"average_recall": 0.91935,
"confusion_matrix": [
[
99,
5
],
[
6,
43
]
],
"per_class_statistics": [
{
"accuracy": 0.9281045751633987,
"class_name": "true",
"f_measure": 0.8865979381443299,
"phi_coefficient": 0.834069556858661,
"precision": 0.8775510204081632,
"recall": 0.8958333333333334
},
{
"accuracy": 0.9281045751633987,
"class_name": "false",
"f_measure": 0.9473684210526315,
"phi_coefficient": 0.834069556858661,
"precision": 0.9519230769230769,
"recall": 0.9428571428571428
}
]
},
"random": {
"accuracy": 0.57516,
"average_f_measure": 0.55573,
"average_phi": 0.13868,
"average_precision": 0.57418,
"average_recall": 0.56481,
"confusion_matrix": [
[
60,
44
],
[
21,
28
]
],
"per_class_statistics": [
{
"accuracy": 0.5751633986928105,
"class_name": "true",
"f_measure": 0.4628099173553719,
"phi_coefficient": 0.1386750490563073,
"precision": 0.5714285714285714,
"recall": 0.3888888888888889
},
{
"accuracy": 0.5751633986928105,
"class_name": "false",
"f_measure": 0.6486486486486486,
"phi_coefficient": 0.1386750490563073,
"precision": 0.5769230769230769,
"recall": 0.7407407407407407
}
]
}
},
"rows": 153,
"sample_rate": 0.2,
"seed": "",
"size": 5238,
"status": {
"code": 5,
"elapsed": 1569,
"message": "The evaluation has been performed",
"progress": 1.0
},
"tags": [],
"type": 0,
"updated": "2012-11-10T04:10:10.885000"
}
< Example evaluation JSON response - Regression
Or like this, if the evaluation is for a classification model:
{
"category": 0,
"code": 200,
"created": "2012-11-10T04:16:15.050000",
"credits": 1.5360000000000003,
"dataset": "dataset/509dd2cd3c1920fbde000014",
"dataset_status": true,
"description": "",
"fields_map": {
"000000": "000000",
"000001": "000001",
"000002": "000002",
"000003": "000003",
"000004": "000004",
"000005": "000005",
"000006": "000006",
"000007": "000007",
"000008": "000008"
},
"locale": "en-US",
"max_rows": 768,
"model": "model/509dd4713c1920fbde000020",
"model_status": true,
"name": "Evaluation of Plasma Glucose with diabetes test",
"objective_field": {
"column_number": 1,
"datatype": "int16",
"name": "plasma_glucose",
"optype": "numeric",
"order": 1,
"preferred": true,
"summary": {
"bins": [
[
0,
5
],
[
44,
1
],
[
56.66667,
3
],
[
61.5,
2
],
[
67.2,
5
],
[
73.3125,
16
],
[
79.79167,
24
],
[
84.26923,
26
],
[
89.64706,
51
],
[
95.29787,
47
],
[
100.97183,
71
],
[
107.21739,
69
],
[
111.48148,
27
],
[
114.57576,
33
],
[
118.5641,
39
],
[
123.68852,
61
],
[
128.62162,
37
],
[
132.57143,
21
],
[
137.52632,
38
],
[
142.65217,
23
],
[
146.4,
25
],
[
150.92857,
14
],
[
154.5625,
16
],
[
158.15385,
13
],
[
162.4,
15
],
[
166.5,
14
],
[
172,
15
],
[
176.16667,
6
],
[
180.125,
16
],
[
183.5,
6
],
[
188.23077,
13
],
[
195.6875,
16
]
],
"maximum": 199,
"mean": 120.89453,
"median": 116.86008,
"minimum": 0,
"missing_count": 0,
"population": 768,
"splits": [
73.4641,
80.66667,
84.54447,
88.29199,
91.05556,
94.06937,
96.48528,
99.20588,
100.75,
102.94511,
105.49074,
107.39491,
109.36701,
111.65834,
114.22967,
116.86008,
119.5,
122.08422,
124.38851,
126.57295,
129.07275,
132.7,
136.75,
140.5,
144.71612,
148.66667,
154.90098,
161.82574,
168.66667,
179.1,
187.58579
],
"standard_deviation": 31.97262,
"sum": 92847,
"sum_squares": 12008759,
"variance": 1022.24831
}
},
"ordering": 0,
"out_of_bag": false,
"private": true,
"project": null,
"range": [
1,
768
],
"replacement": false,
"resource": "evaluation/50650d563c19202679000000",
"result": {
"mean": {
"mean_absolute_error": 23.51463,
"mean_squared_error": 895.98218,
"r_squared": 0
},
"model": {
"mean_absolute_error": 12.42495,
"mean_squared_error": 370.07298,
"r_squared": 0.58696
},
"random": {
"mean_absolute_error": 61.30541,
"mean_squared_error": 5202.04721,
"r_squared": -4.80597
}
},
"rows": 153,
"sample_rate": 0.2,
"seed": "",
"size": 5238,
"status": {
"code": 5,
"elapsed": 970,
"message": "The evaluation has been performed",
"progress": 1.0
},
"tags": [],
"type": 1,
"updated": "2012-11-10T04:16:16.360000"
}
< Example evaluation JSON response - Classification
Updating an Evaluation
To update an evaluation, you need to PUT an object containing the fields that you want to update to the evaluation' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated evaluation.
For example, to update an evaluation with a new name you can use curl like this:
curl "https://au.bigml.io/evaluation/50650d563c19202679000000?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating an evaluation's name
Deleting an Evaluation
To delete an evaluation, you need to issue a HTTP DELETE request to the evaluation/id to be deleted.
Using curl you can do something like this to delete an evaluation:
curl -X DELETE "https://au.bigml.io/evaluation/50650d563c19202679000000?$BIGML_AUTH"
$ Deleting an evaluation from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete an evaluation, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete an evaluation a second time, or an evaluation that does not exist, you will receive a "404 not found" response.
However, if you try to delete an evaluation that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Evaluations
To list all the evaluations, you can use the evaluation base URL. By default, only the 20 most recent evaluations will be returned. You can see below how to change this number using the limit parameter.
You can get your list of evaluations directly in your browser using your own username and API key with the following links.
https://au.bigml.io/evaluation?$BIGML_AUTH
> Listing evaluations from a browser
OptiMLs
Last Updated: Tuesday, 2019-01-29 16:28
OptiML is an automated optimization process for model selection and parameterization (or hyper-parameterization) to solve classification and regression problems.
Selecting the right algorithm and its optimum parameter values is a manual and time-consuming task for any Machine Learning practitioner. This iterative process is currently based on trial and error (creating and evaluating different models to find the best one) and it requires a high level of expertise and intuition. OptiML accelerates the process of model search and parameter tuning, allowing non-experts to build top-performing models.
OptiMLOptiML uses Bayesian parameter optimization for the model selection and parameter tuning. It is based on the Sequential Model-based Algorithm Configuration (SMAC) optimization technique. It sequentially tries groups of parameters training and evaluating models (using Monte Carlo cross-validation), and based on the results, it tries a new group of parameters. All the model search is guided by an optimization metric that can be configured by the user. When the process finishes, a list of the top performing models is returned so you can compare them and select the one that best fits your needs."
BigML.io allows you to create, retrieve, update, delete your optiml. You can also list all of your optimls.
Jump to:
- OptiML Base URL
- Creating an OptiML
- OptiML Arguments
- Retrieving an OptiML
- OptiML Properties
- Filtering and Paginating Fields from an OptiML
- Filtering and Paginating Models from an OptiML
- Updating an OptiML
- Deleting an OptiML
- Listing OptiMLs
OptiML Base URL
You can use the following base URL to create, retrieve, update, and delete optimls. https://au.bigml.io/optiml
OptiML base URL
All requests to manage your optimls must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating an OptiML
To create a new OptiML, you need to POST to the OptiML base URL an object containing at least the dataset/id that you want to use to create the OptiML. The content-type must always be "application/json".
POST /optiml?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating OptiML definition
curl "https://au.bigml.io/optiml?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/5ae56a751f386f543100001d"}'
> Creating an OptiML
BigML.io will return the newly created OptiML if the request succeeded.
{
"category": 0,
"code": 201,
"configuration": null,
"configuration_status": false,
"created": "2018-05-09T03:25:50.687872",
"dataset": "dataset/5aec97fc4e1727358a000000",
"dataset_status": true,
"datasets": [],
"description": "",
"evaluations": [],
"excluded_fields": [],
"input_fields": [],
"model_count": {},
"models": [],
"name": "iris",
"name_options": "model candidates=4",
"objective_field": null,
"objective_field_details": null,
"objective_field_name": null,
"objective_field_type": null,
"objective_fields": [],
"optiml": {
"model_types": [
"model",
"ensemble",
"logisticregression"
],
"number_of_model_candidates": 4
},
"private": true,
"project": null,
"resource": "optiml/5ae6baa81f386f4a33000005",
"shared": false,
"size": 4608,
"source": "source/5aec0b924e17275daa0003fb",
"source_status": true,
"status": {
"code": 1,
"message": "The optiml creation request has been queued and will be processed soon"
},
"subscription": false,
"tags": [],
"test_dataset": null,
"type": 0,
"updated": "2018-05-09T03:25:50.687983"
}
< Example OptiML JSON response
OptiML Arguments
In addition to the dataset, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is 0 |
The category that best describes the OptiML. See the category codes for the complete list of categories.
Example: 1 |
|
creation_defaults
optional |
Object |
A map of default parameters to be used for the created resources. The map contains the keys all and all_models. Properties under all will be applied to all resources that will be created, and properties under all_models to model type resources. Available properties for
Example:
|
| dataset | String |
A valid dataset/id to train the models.
Example: dataset/4f66a80803ce8940c5000006 |
|
description
optional |
String |
A description of the OptiML up to 8192 characters long.
Example: "This is a description of my new OptiML" |
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the models of the OptiML
Example:
|
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields in the dataset to be considered to create the OptiML.
Example:
|
|
max_training_time
optional |
Integer, default is 1800 |
The maximum training time allowed for the optimization, in seconds, as a strictly positive integer.
Example: 3600 |
|
metric
optional |
String, default is "max_phi" for classification, "r_squared" for regression |
The metric used to evaluate models used. For classification problems, you can use one of these: accuracy, area_under_pr_curve, area_under_roc_curve, f_measure, kendalls_tau_b, ks_statistic, max_phi, phi_coefficient, precision, recall, and spearmans_rho. For regression problems, r_squared is the only supported metric. Deepnet optimization is an independent process that uses a combination of several metrics, hence it is not affected by the configuration of the metric.
Example: "f_measure" |
|
metric_class
optional |
String |
A particular class for which the metric should be taken for classification problems. Deepnet optimization is an independent process, hence it is not affected by the configuration of the metric class.
Example: "Yes" |
|
model_types
optional |
Array of Strings, default is ["model", "ensemble", "logisticregression", "deepnet"] |
A list of the types of model to create during the search. Available model types are model, ensemble, logisticregression, deepnet. All model types are selected if not specified.
Example: ["model", "ensemble"] |
|
name
optional |
String, default is OptiML's name |
The name you want to give to the new OptiML.
Example: "my new OptiML" |
|
number_of_ model_candidates optional |
Integer, default is 128 |
The number of model candidates evaluated over the course of the optimization. Maximum 200 candidates.
Example: 100 |
|
objective_field
optional |
String, default is dataset's pre-defined objective field |
Specifies the id of the field that the OptiML will predict.
Example: "000003" |
|
objective_weights
optional |
Array |
A list of category and weight pairs. One per objective class. For more information, see the Objective Weights section.
Example:
|
|
project
optional |
String |
The project/id you want the OptiML to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your OptiML.
Example: ["best customers", "2018"] |
|
test_dataset
optional |
String |
A valid dataset/id to test the models. If not provided, test datasets are created from the training dataset. Deepnet optimization is an independent process that always uses cross-validation, hence it is not affected by the configuration of a test dataset.
Example: dataset/5af266884e1727be82000000 |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
| weight_field | String | Any numeric field with no negative or missing values is valid as a weight field. Each instance will be weighted individually according to the weight field's value. For more information, see the Weight Field section. |
If you do not specify a name, BigML.io will assign to the new OptiML one.
Retrieving an OptiML
Each optiml has a unique identifier in the form "optiml/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the optiml.
To retrieve an optiml with curl:
curl "https://au.bigml.io/optiml/5ae6baa81f386f4a33000005?$BIGML_AUTH"
$ Retrieving a optiml from the command line
You can also use your browser to visualize the optiml using the full BigML.io URL or pasting the optiml/id into the BigML.com.au dashboard.
OptiML Properties
Once an optiml has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the OptiML and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the OptiML creation has been completed without errors. |
|
composites
filterable, sortable |
Array of Strings | The list of composite ids that reference this model. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the OptiML was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
dataset
filterable, sortable |
String | The dataset/id that was used to train the models. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
| datasets | Array |
The list of datasets created for this OptiML.
Example:
|
|
description
updatable |
String | A text describing the OptiML. It can contain restricted markdown to decorate the text. |
| evaluations | Array |
The list of evaluations created for this OptiML.
Example:
|
| excluded_fields | Array | The list of fields's ids that were excluded to build the OptiML. |
| fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
| input_fields | Array | The list of input fields' ids used to build the models of the OptiML. |
|
model_count
filterable, sortable |
Object |
A dictionary that informs about the number of submodels of each type in the OptiML.
Example:
|
| models | Array |
A list of all submodels ids regardless of how models are filtered and paged.
Example:
|
| models_meta | Object | A dictionary with meta information about the models filtered.It specifies the total number of models, the current offset, and limit. |
|
name
filterable, sortable, updatable |
String | The name of the OptiML as your provided. |
| objective_field | String |
Specifies the id of the field that the OptiML predicts.
Example: "000003" |
| objective_field_details | Object | The details of the objective fields. See the Objective Field Details. |
|
objective_field_name
filterable, sortable |
String | The name of the objective field in the OptiML. |
|
objective_field_type
filterable, sortable |
String | The type of the objective field in the OptiML. |
| objective_fields | Array | Specifies the list of ids of the field that the OptiML predicts. Even if this is an array BigML.io only accepts one objective field in the current version. |
|
optiml
filterable, sortable |
Object | All the information that you need to recreate or use the OptiML on your own. See here for more details. |
|
private
filterable, sortable |
Boolean | Whether the OptiML is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| resource | String | The OptiML/id. |
|
shared
filterable |
Boolean | Whether the OptiML is shared using a private link or not. |
|
shared_clonable
filterable |
Boolean | Whether the shared OptiML can be cloned or not. |
|
shared_hash
filterable |
String | The hash that gives access to this OptiML if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this OptiML. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this OptiML. |
|
source
filterable, sortable |
String | The source/id that was used to build the train dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| status | Object | A description of the status of the OptiML. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the OptiML was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
test_dataset
filterable, sortable |
String | The dataset/id that was used to test the OptiML. |
|
type
filterable, sortable |
Integer | Reserved for future use. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the OptiML was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
The OptiML object has the following properties.
| Property | Type | Description |
|---|---|---|
| created_resources | Object |
This is a map of resources created by the optimization process that is updated incrementally during resource creation. The map may have a key for each model type, the key dataset and the key Example:
|
| datasets | Array | A list of objects with id, name, name_options of each dataset created during optimization. |
| evaluations | Array | A list of objects with id, name, name_options of each evaluation created during optimization. |
| fields | Object | All fields used by models without attempting to merge in any way fields from different models. It is the last found summary for every given field. See this Section for more details. |
| max_training_time | Integer | The maximum training time allowed for the OptiML, in seconds. |
| metric | String | The metric used to evaluate models for the OptiML. |
| metric_class | String | The metric class used to evaluate models for the OptiML. |
| model_types | String | The list of model types used in the OptiML. |
| models | Array | A filtered and paged list of the OptiML submodels with id, kind, name, name_options, evaluation, evaluation_count as well as other metadata the user provided while creating the OptiML. evaluation is a map which contains details about the model evaluation. |
|
number_of_model_ candidates |
Integer | The number of model candidates evaluated over the course of the optimization. |
| recent_evaluations | Array | This is a list of the metric values for the most recent evaluations as given by the selected metric. The list is limited in length, and so only contains the n most recent values at any given time. Note that these values represent an evaluation on a single fold of cross-validation, and so may not correspond to any of the values in the final list of models. |
| search_complete | Boolean | True if searching the best models are complete or not. |
| summary | Object |
A dictionary, where each kind of model in the provided list is mapped to a map containing the keys count and best. The former is the number of models of that kind in the provided list, and best is the id of the model in the list with the highest metric_value.
Example:
|
The Objective Field Details object has the following properties.
OptiML Status
Creating an OptiML is a process that can take just a few seconds or a few days depending on the size of the dataset used as input, the configuration of the OptiML and on the workload of BigML's systems. The OptiML goes through a number of states until its fully completed. Through the status field in the OptiML you can determine when the OptiML has been fully processed and ready to be used to create predictions. These are the properties that an OptiML's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the OptiML creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the OptiML. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the OptiML. |
Once an OptiML has been successfully created, it will look like:
{
"category":0,
"code":200,
"configuration":null,
"configuration_status":false,
"created":"2018-05-09T03:25:50.687000",
"dataset":"dataset/5aec97fc4e1727358a000000",
"dataset_status":true,
"datasets":[
"dataset/5af26a6c4e1727be83000041",
"dataset/5af26a6d4e1727be83000044",
"dataset/5af26a6d4e1727be8200007b",
"dataset/5af26a6e4e1727be83000047",
"dataset/5af26a6e4e1727be8300004a",
"dataset/5af26a6f4e1727be8300004d",
"dataset/5af26a6f4e1727be8200007e",
"dataset/5af26a704e1727be82000081",
"dataset/5af26a704e1727be83000050",
"dataset/5af26a714e1727be83000053"
],
"description":"",
"evaluations":[
"evaluation/5af26a864e1727be8300007c",
"evaluation/5af26a874e1727be820000ea",
"evaluation/5af26a884e1727be8300007f",
"evaluation/5af26a884e1727be820000ed"
],
"excluded_fields":[],
"fields_meta":{
"count":5,
"limit":1000,
"offset":0,
"query_total":5,
"total":5
},
"input_fields":[
"000000",
"000001",
"000002",
"000003"
],
"model_count":{
"ensemble":2,
"logisticregression":1,
"model":1,
"total":4
},
"models":[
"ensemble/5af26a714e1727be82000084",
"ensemble/5af26a804e1727be83000056",
"model/5af26a854e1727be83000079",
"logisticregression/5af26a854e1727be820000e7"
],
"models_meta": {
"count": 4,
"offset": 0,
"limit": 1000,
"total": 4
},
"name":"iris",
"name_options":"4 total models (ensemble: 2, logisticregression: 1, model: 1), metric=max_phi, model candidates=4, max. training time=1800",
"objective_field":"000004",
"objective_field_details":{
"column_number":4,
"datatype":"string",
"name":"species",
"optype":"categorical",
"order":4
},
"objective_field_name":"species",
"objective_field_type":"categorical",
"objective_fields":[
"000004"
],
"optiml":{
"created_resources":{
"dataset":10,
"ensemble":10,
"ensemble_evaluation":10,
"logisticregression":5,
"logisticregression_evaluation":5,
"model":5,
"model_evaluation":5
},
"datasets":[
{
"id":"dataset/5af26a6c4e1727be83000041",
"name":"iris",
"name_options":"120 instances, 5 fields (1 categorical, 4 numeric), sample rate=0.8"
},
{
"id":"dataset/5af26a6d4e1727be83000044",
"name":"iris",
"name_options":"30 instances, 5 fields (1 categorical, 4 numeric), sample rate=0.2, out of bag"
},
{
"id":"dataset/5af26a6d4e1727be8200007b",
"name":"iris",
"name_options":"120 instances, 5 fields (1 categorical, 4 numeric), sample rate=0.8"
},
{
"id":"dataset/5af26a6e4e1727be83000047",
"name":"iris",
"name_options":"30 instances, 5 fields (1 categorical, 4 numeric), sample rate=0.2, out of bag"
},
{
"id":"dataset/5af26a6e4e1727be8300004a",
"name":"iris",
"name_options":"30 instances, 5 fields (1 categorical, 4 numeric), sample rate=0.2, out of bag"
},
{
"id":"dataset/5af26a6f4e1727be8300004d",
"name":"iris",
"name_options":"120 instances, 5 fields (1 categorical, 4 numeric), sample rate=0.8"
},
{
"id":"dataset/5af26a6f4e1727be8200007e",
"name":"iris",
"name_options":"30 instances, 5 fields (1 categorical, 4 numeric), sample rate=0.2, out of bag"
},
{
"id":"dataset/5af26a704e1727be82000081",
"name":"iris",
"name_options":"120 instances, 5 fields (1 categorical, 4 numeric), sample rate=0.8"
},
{
"id":"dataset/5af26a704e1727be83000050",
"name":"iris",
"name_options":"30 instances, 5 fields (1 categorical, 4 numeric), sample rate=0.2, out of bag"
},
{
"id":"dataset/5af26a714e1727be83000053",
"name":"iris",
"name_options":"120 instances, 5 fields (1 categorical, 4 numeric), sample rate=0.8"
}
],
"max_training_time":1800,
"metric":"max_phi",
"model_types":[
"model",
"ensemble",
"logisticregression"
],
"models":[
{
"evaluation":{
"id":"evaluation/5af26a864e1727be8300007c",
"info":{
"accuracy":0.96667,
"average_area_under_pr_curve":0.99492,
"average_area_under_roc_curve":0.99683,
"average_balanced_accuracy":0.97353,
"average_f_measure":0.97011,
"average_kendalls_tau_b":0.66275,
"average_ks_statistic":0.96373,
"average_max_phi":0.95472,
"average_phi":0.95356,
"average_precision":0.97619,
"average_recall":0.96667,
"average_spearmans_rho":0.79768,
"per_class_statistics":[
{
"accuracy":1,
"area_under_pr_curve":1,
"area_under_roc_curve":1,
"balanced_accuracy":1,
"class_name":"Iris-setosa",
"f_measure":1,
"kendalls_tau_b":0.60907,
"ks_statistic":[
1,
0.12619
],
"max_phi":[
1,
0.12619
],
"phi_coefficient":1,
"precision":1,
"present_in_test_data":true,
"recall":1,
"spearmans_rho":0.73306
},
{
"accuracy":0.96667,
"area_under_pr_curve":0.99429,
"area_under_roc_curve":0.99548,
"balanced_accuracy":0.97059,
"class_name":"Iris-versicolor",
"f_measure":0.96296,
"kendalls_tau_b":0.70714,
"ks_statistic":[
0.94118,
0.22735
],
"max_phi":[
0.93485,
0.22735
],
"phi_coefficient":0.93485,
"precision":0.92857,
"present_in_test_data":true,
"recall":1,
"spearmans_rho":0.85109
},
{
"accuracy":0.96667,
"area_under_pr_curve":0.99045,
"area_under_roc_curve":0.995,
"balanced_accuracy":0.95,
"class_name":"Iris-virginica",
"f_measure":0.94737,
"kendalls_tau_b":0.67206,
"ks_statistic":[
0.95,
0.18325
],
"max_phi":[
0.92932,
0.18325
],
"phi_coefficient":0.92582,
"precision":1,
"present_in_test_data":true,
"recall":0.9,
"spearmans_rho":0.80887
}
]
},
"metric_value":0.95472,
"metric_variance":0.00068,
"name":"iris vs. iris",
"name_options":"boosted trees, 1999-node, 16-iteration, deterministic order, balanced, operating kind=probability"
},
"evaluation_count": 5,
"id":"ensemble/5af26a714e1727be82000084",
"importance":[
[
"000003",
0.41002
],
[
"000002",
0.32374
],
[
"000000",
0.1754
],
[
"000001",
0.09084
]
],
"kind":"ensemble",
"name":"iris",
"name_options":"boosted trees, 1999-node, 16-iteration, deterministic order, balanced"
},
{
"evaluation":{
"id":"evaluation/5af26a874e1727be820000ea",
"info":{
"accuracy":0.96667,
"average_area_under_pr_curve":0.98809,
"average_area_under_roc_curve":0.99206,
"average_balanced_accuracy":0.97353,
"average_f_measure":0.97011,
"average_kendalls_tau_b":0.82025,
"average_ks_statistic":0.94706,
"average_max_phi":0.95356,
"average_phi":0.95356,
"average_precision":0.97619,
"average_recall":0.96667,
"average_spearmans_rho":0.89212,
"per_class_statistics":[
{
"accuracy":1,
"area_under_pr_curve":1,
"area_under_roc_curve":1,
"balanced_accuracy":1,
"class_name":"Iris-setosa",
"f_measure":1,
"kendalls_tau_b":0.98187,
"ks_statistic":[
1,
0
],
"max_phi":[
1,
0
],
"phi_coefficient":1,
"precision":1,
"present_in_test_data":true,
"recall":1,
"spearmans_rho":0.99568
},
{
"accuracy":0.96667,
"area_under_pr_curve":0.98489,
"area_under_roc_curve":0.98869,
"balanced_accuracy":0.97059,
"class_name":"Iris-versicolor",
"f_measure":0.96296,
"kendalls_tau_b":0.76685,
"ks_statistic":[
0.94118,
0.35524
],
"max_phi":[
0.93485,
0.35524
],
"phi_coefficient":0.93485,
"precision":0.92857,
"present_in_test_data":true,
"recall":1,
"spearmans_rho":0.86883
},
{
"accuracy":0.96667,
"area_under_pr_curve":0.97937,
"area_under_roc_curve":0.9875,
"balanced_accuracy":0.95,
"class_name":"Iris-virginica",
"f_measure":0.94737,
"kendalls_tau_b":0.71204,
"ks_statistic":[
0.9,
0.375
],
"max_phi":[
0.92582,
0.375
],
"phi_coefficient":0.92582,
"precision":1,
"present_in_test_data":true,
"recall":0.9,
"spearmans_rho":0.81184
}
]
},
"metric_value":0.95356,
"metric_variance":0.00046,
"name":"iris vs. iris",
"name_options":"random decision forest, 1999-node, random candidate ratio: 0.25, 16-model, deterministic order, balanced, operating kind=probability"
},
"evaluation_count": 5,
"id":"ensemble/5af26a804e1727be83000056",
"importance":[
[
"000001",
0.33304
],
[
"000002",
0.24475
],
[
"000000",
0.23243
],
[
"000003",
0.18978
]
],
"kind":"ensemble",
"name":"iris",
"name_options":"random decision forest, 1999-node, random candidate ratio: 0.25, 16-model, deterministic order, balanced"
},
{
"evaluation":{
"id":"evaluation/5af26a884e1727be8300007f",
"info":{
"accuracy":0.96667,
"average_area_under_pr_curve":0.97315,
"average_area_under_roc_curve":0.97271,
"average_balanced_accuracy":0.97271,
"average_f_measure":0.9659,
"average_kendalls_tau_b":0.95101,
"average_ks_statistic":0.94542,
"average_max_phi":0.95101,
"average_phi":0.95101,
"average_precision":0.97222,
"average_recall":0.96296,
"average_spearmans_rho":0.95101,
"per_class_statistics":[
{
"accuracy":1,
"area_under_pr_curve":1,
"area_under_roc_curve":1,
"balanced_accuracy":1,
"class_name":"Iris-setosa",
"f_measure":1,
"kendalls_tau_b":1,
"ks_statistic":[
1,
0
],
"max_phi":[
1,
0
],
"phi_coefficient":1,
"precision":1,
"present_in_test_data":true,
"recall":1,
"spearmans_rho":1
},
{
"accuracy":0.96667,
"area_under_pr_curve":0.96111,
"area_under_roc_curve":0.94444,
"balanced_accuracy":0.94444,
"class_name":"Iris-versicolor",
"f_measure":0.94118,
"kendalls_tau_b":0.92113,
"ks_statistic":[
0.88889,
0
],
"max_phi":[
0.92113,
0
],
"phi_coefficient":0.92113,
"precision":1,
"present_in_test_data":true,
"recall":0.88889,
"spearmans_rho":0.92113
},
{
"accuracy":0.96667,
"area_under_pr_curve":0.95833,
"area_under_roc_curve":0.97368,
"balanced_accuracy":0.97368,
"class_name":"Iris-virginica",
"f_measure":0.95652,
"kendalls_tau_b":0.93189,
"ks_statistic":[
0.94737,
0
],
"max_phi":[
0.93189,
0
],
"phi_coefficient":0.93189,
"precision":0.91667,
"present_in_test_data":true,
"recall":1,
"spearmans_rho":0.93189
}
]
},
"metric_value":0.95101,
"metric_variance":0.00039,
"name":"iris vs. iris",
"name_options":"1999-node, pruned, deterministic order, balanced, operating kind=probability"
},
"evaluation_count": 5,
"id":"model/5af26a854e1727be83000079",
"importance":[
[
"000003",
0.91627
],
[
"000002",
0.08373
]
],
"kind":"model",
"name":"iris",
"name_options":"1999-node, pruned, deterministic order, balanced"
},
{
"evaluation":{
"id":"evaluation/5af26a884e1727be820000ed",
"info":{
"accuracy":0.86667,
"average_area_under_pr_curve":0.92857,
"average_area_under_roc_curve":0.9584,
"average_balanced_accuracy":0.89689,
"average_f_measure":0.86111,
"average_kendalls_tau_b":0.61769,
"average_ks_statistic":0.89099,
"average_max_phi":0.90068,
"average_phi":0.79377,
"average_precision":0.86111,
"average_recall":0.86111,
"average_spearmans_rho":0.74421,
"per_class_statistics":[
{
"accuracy":1,
"area_under_pr_curve":1,
"area_under_roc_curve":1,
"balanced_accuracy":1,
"class_name":"Iris-setosa",
"f_measure":1,
"kendalls_tau_b":0.67806,
"ks_statistic":[
1,
0.16263
],
"max_phi":[
1,
0.16263
],
"phi_coefficient":1,
"precision":1,
"present_in_test_data":true,
"recall":1,
"spearmans_rho":0.81695
},
{
"accuracy":0.86667,
"area_under_pr_curve":0.89611,
"area_under_roc_curve":0.92614,
"balanced_accuracy":0.82955,
"class_name":"Iris-versicolor",
"f_measure":0.75,
"kendalls_tau_b":0.54211,
"ks_statistic":[
0.78409,
0.44661
],
"max_phi":[
0.82916,
0.61132
],
"phi_coefficient":0.65909,
"precision":0.75,
"present_in_test_data":true,
"recall":0.75,
"spearmans_rho":0.65315
},
{
"accuracy":0.86667,
"area_under_pr_curve":0.88961,
"area_under_roc_curve":0.94907,
"balanced_accuracy":0.86111,
"class_name":"Iris-virginica",
"f_measure":0.83333,
"kendalls_tau_b":0.63289,
"ks_statistic":[
0.88889,
0.27004
],
"max_phi":[
0.87287,
0.27004
],
"phi_coefficient":0.72222,
"precision":0.83333,
"present_in_test_data":true,
"recall":0.83333,
"spearmans_rho":0.76253
}
]
},
"metric_value":0.90068,
"metric_variance":0.00287,
"name":"iris vs. iris",
"name_options":"L2 regularized (c=1), bias, auto-scaled, missing values, eps=0.001, operating kind=probability"
},
"evaluation_count": 5,
"id":"logisticregression/5af26a854e1727be820000e7",
"kind":"logisticregression",
"name":"iris",
"name_options":"L2 regularized (c=1), bias, auto-scaled, missing values, eps=0.001"
}
],
"number_of_model_candidates":4,
"recent_evaluations":[
0.94952,
1,
0.90068,
1,
0.94952,
1,
0.95472,
0.94952,
1,
0.95356,
0.94952,
0.90764,
0.95356,
0.90068,
1,
0.85966,
0.94952,
1,
0.95472,
0.95288,
0.94952,
1,
0.95356,
0.95374,
0.94952,
0.90764,
0.95356,
0.95288,
0.90068,
1,
0.85966,
0.88287,
0.94952,
1,
0.95356,
0.95374,
0.95257,
0.94952,
1,
0.95472,
0.95288,
1,
0.94952,
0.90764,
0.95356,
0.95288,
0.95101,
0.90068,
1,
0.85966,
0.88287,
0.90427
],
"search_complete": true,
"summary":{
"ensemble":{
"best":"ensemble/5af26a714e1727be82000084",
"count":2
},
"logisticregression":{
"best":"logisticregression/5af26a854e1727be820000e7",
"count":1
},
"model":{
"best":"model/5af26a854e1727be83000079",
"count":1
}
}
},
"private":true,
"project":null,
"resource": "optiml/5ae6baa81f386f4a33000005",
"shared":false,
"size":3686,
"source":"source/5aec0b924e17275daa0003fb",
"source_status":true,
"status":{
"code":5,
"elapsed":75656,
"message":"The optiml has been created",
"progress":1
},
"subscription":false,
"tags":[],
"test_dataset":null,
"type":0,
"updated":"2018-05-09T03:27:07.729000"
}
< Example OptiML JSON response
Filtering and Paginating Fields from an OptiML
An optiml might be composed of hundreds or even thousands of fields. Thus when retrieving an optiml, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Filtering and Paginating Models from an OptiML
Since model lists can grow large, we offer paginations of the models list in the response when GETting it via HTTP. Pagination is specified using the following query string parameters:
- models_limit: A non-negative integer indicating how many elements in models to return. If not provided, we return at most 1000. If passed a negative value (say, -1), we return all of them.
- models_offset: The offset in the list of models (i.e., how many models are discarded before we take limit of them).
- models_sort_by: Sorting criteria, specified by any one of the keys the user provided during the creation in the models maps. Sorting is ascending, unless you prefix the key name with a minus sign. For instance, let's say your models have a property, rank. You can use a query string of the form models_sort_by=rank to sort them by rank in ascending order, and one of the form strong>models_sort_by=-rank" to sort them in descending order. It is possible to provide more than one ordering criterion, separating them by commas, in which case the second and subsequent ones are used to break ties in the ordering generated by the previous ones.
Sorting happens before limit and offset are applied. When pagination is active, the models_meta property at the top level in the returned. This property will contain offset, limit, count, and total.
Updating an OptiML
To update an optiml, you need to PUT an object containing the fields that you want to update to the optiml' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated optiml.
For example, to update an OptiML with a new name you can use curl like this:
curl "https://au.bigml.io/optiml/5ae6baa81f386f4a33000005?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating a OptiML's name
Deleting an OptiML
To delete an optiml, you need to issue a HTTP DELETE request to the optiml/id to be deleted.
Using curl you can do something like this to delete an optiml:
curl -X DELETE "https://au.bigml.io/optiml/5ae6baa81f386f4a33000005?$BIGML_AUTH"
$ Deleting an optiml from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete an optiml, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete an optiml a second time, or an optiml that does not exist, you will receive a "404 not found" response.
However, if you try to delete an optiml that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
Note that you can also delete all resources that have been created by the OptiML. Simply append delete_all=true in the query string.
curl -X DELETE "https://au.bigml.io/optiml/5ae6baa81f386f4a33000005?$BIGML_AUTH;delete_all=true"
$ Deleting an optiml with all associated resources from the command line
Listing OptiMLs
To list all the optimls, you can use the optiml base URL. By default, only the 20 most recent optimls will be returned. You can see below how to change this number using the limit parameter.
You can get your list of optimls directly in your browser using your own username and API key with the following links.
https://au.bigml.io/optiml?$BIGML_AUTH
> Listing optimls from a browser
Clusters
Last Updated: Tuesday, 2019-01-29 16:28
A cluster is a set of groups (i.e., clusters) of instances of a dataset that have been automatically classified together according to a distance measure computed using the fields of the dataset. Clusters can handle numeric, categorical, text and items fields as inputs:
- Numeric fields: the Eucledian distance is computed between the instances numeric values.
- Categorical fields: a common way to handle categorical data is to take each category as a new field and assign 0 or 1 depending on the category. So a field with 20 categories will become 20 separate binary fields. BigML uses a technique called k-prototypes which modifies the distance function to operate as though the categories were transformed to binary values.
- Text and item fields: each instance is assigned a vector of terms and then cosine similarity is computed to determine closeness between instances.
To create a cluster, you can select an arbitrary number of clusters (i.e., k) and also select an arbitrary subset of fields from your dataset as input_fields. You can use scales to select how each field influences the distance measure used to group instances together.
BigML.io allows you to create, retrieve, update, delete your cluster. You can also list all of your clusters.
Jump to:
- Cluster Base URL
- Creating a Cluster
- Cluster Arguments
- Retrieving a Cluster
- Cluster Properties
- Create a Dataset Using a Cluster and a Centroid
- Create a Model Using a Cluster and a Centroid
- PMML
- Filtering and Paginating Fields from a Cluster
- Updating a Cluster
- Deleting a Cluster
- Listing Clusters
- Sampling Your Dataset
Cluster Base URL
You can use the following base URL to create, retrieve, update, and delete clusters. https://au.bigml.io/cluster
Cluster base URL
All requests to manage your clusters must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Cluster
To create a new cluster, you need to POST to the cluster base URL an object containing at least the dataset/id that you want to use to create the cluster. The content-type must always be "application/json".
POST /cluster?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating cluster definition
curl "https://au.bigml.io/cluster?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/537639383c19207026000004"}'
> Creating a cluster
BigML.io will return the newly created cluster if the request succeeded.
{
"balance_fields": true,
"category": 0,
"cluster_datasets": {},
"cluster_models": {},
"cluster_seed": null,
"code": 201,
"columns": 0,
"created": "2014-05-17T15:54:02.419411",
"credits": 0.017578125,
"credits_per_prediction": 0.0,
"dataset": "dataset/537639383c19207026000004",
"dataset_field_types": {
"categorical": 1,
"datetime": 0,
"numeric": 4,
"preferred": 5,
"text": 0,
"total": 5
},
"dataset_status": true,
"dataset_type": 0,
"description": "",
"excluded_fields": [],
"field_scales": null,
"fields_meta": {
"count": 0,
"limit": 1000,
"offset": 0,
"total": 0
},
"input_fields": [],
"k": 8,
"locale": "en-US",
"max_columns": 5,
"max_rows": 150,
"model_clusters": false,
"name": "Iris' dataset cluster",
"number_of_batchcentroids": 0,
"number_of_centroids": 0,
"number_of_public_centroids": 0,
"out_of_bag": false,
"price": 0.0,
"private": true,
"project": null,
"range": [
1,
150
],
"replacement": false,
"resource": "cluster/5377861a3c1920126000001c",
"rows": 150,
"sample_rate": 1.0,
"scales": {},
"shared": false,
"size": 4608,
"source": "source/5341a53c3c19206725000000",
"source_status": true,
"status": {
"code": 1,
"message": "The cluster is being processed and will be created soon"
},
"subscription": false,
"summary_fields": [ ],
"tags": [],
"updated": "2014-05-17T15:54:02.419546",
"white_box": false
}
< Example cluster JSON response
Cluster Arguments
In addition to the dataset, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
balance_fields
optional |
Boolean, default is true. |
When this parameter is enabled, all the numeric fields will be scaled so that their standard deviations are 1. This makes each field have roughly equivalent influence.
Example: true |
|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the cluster. See the category codes for the complete list of categories.
Example: 1 |
|
cluster_seed
optional |
String |
A string to generate deterministic clusters.
Example: "My Seed" |
|
critical_value
optional |
Integer, default is 5 |
The clustering algorithm G-means is parameter free except for one, the critical_value parameter. G-means iteratively takes existing clusters and tests whether the cluster's neighborhood appears Gaussian. If it doesn't the cluster is split into two. The critical_value sets how strict the test is when deciding whether data looks Gaussian. The default is to 5, which seems to work well in most cases. A range of 1 - 10 is acceptable. A critical_value of 1 means data must look very Gaussian to pass the test, and can lead to more clusters being detected. Higher critical_value will tend to find fewer clusters.
Example: 3 |
| dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
|
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
|
description
optional |
String |
A description of the cluster up to 8192 characters long.
Example: "This is a description of my new cluster" |
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the cluster.
Example:
|
|
field_scales
optional |
Object, default is {}, an empty dictionary. That is, no special scaling is used. |
With this argument you can pick your own scaling for each field. If a field isn't included in field_scales, BigML will treat the scale as 1 (no scale change). If both balance_fields and field_scales are present, then balance_fields will be applied first. This will make it easy for you do things like balancing age and salary, but then request that age be twice as important.
Example:
|
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the cluster with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be considered to create the cluster.
Example:
|
|
k
optional |
Integer, default is null to use g-means cluster |
The number of clusters. Must be null or a number greater than or equal to 1 and less than or equal to 300.
Example: 3 |
|
model_clusters
optional |
Boolean, default is false |
Whether a model for every cluster will be generated or not. Each model predicts whether or not an instance is part of its respective cluster.
Example: true |
|
name
optional |
String, default is dataset's name |
The name you want to give to the new cluster.
Example: "my new cluster" |
|
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
|
project
optional |
String |
The project/id you want the cluster to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the cluster.
Example: [1, 150] |
|
regularization
optional |
String, default is l2 |
Either l1 or l2 at the moment. It selects the norm to minimize when regularizing the solution. Regularizing with respect to the l1 norm causes more coefficients to be zero, and using the l2 norm forces the magnitudes of all coefficients towards zero.
Example: l1 |
|
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
|
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
|
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
|
summary_fields
optional |
Array, default is [] |
Specifies the ids for fields which will be included when generating the per cluster summaries/datasets, but will not be used for clustering. The summary_fields must be a strict subset of the input_fields, where the latter is adjusted before passing it to the model creation algorithm by setting it to all non-preferred fields if not provided explicitly, adding to it explicit summary_fields, and subtracting explicit excluded_fields. You can use either field identifiers or field names.
Example: ["000004"] |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your cluster.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
|
weight_field
optional |
String |
Any numeric field with no negative or missing values is valid as a weight field. Each instance will be weighted individually according to the weight field's value.
Example: "000004" |
You can also use curl to customize a new cluster. For example, to create a new cluster named "my cluster", with only certain rows, and with only three fields:
curl "https://au.bigml.io/cluster?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006",
"input_fields": ["000001", "000003"],
"name": "my cluster",
"range": [25, 125]}'
> Creating a customized cluster
If you do not specify a name, BigML.io will assign to the new cluster the dataset's name. If you do not specify a range of instances, BigML.io will use all the instances in the dataset. If you do not specify any input fields, BigML.io will include all the input fields in the dataset.
Retrieving a Cluster
Each cluster has a unique identifier in the form "cluster/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the cluster.
To retrieve a cluster with curl:
curl "https://au.bigml.io/cluster/5377861a3c1920126000001c?$BIGML_AUTH"
$ Retrieving a cluster from the command line
You can also use your browser to visualize the cluster using the full BigML.io URL or pasting the cluster/id into the BigML.com.au dashboard.
Cluster Properties
Once a cluster has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
balance_fields
filterable, sortable |
Boolean | Whether all the numeric fields have been scaled so that their standard deviations are 1. |
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| cluster_datasets | Object | A dictionary that maps cluster ids to dataset resources offering per field distribution summaries for each cluster. Each dataset resource can be serialized on-demand using the neighborhood of the cluster. |
| cluster_seed | String | With no seed, the cluster locations can vary from run to run. With a seed, the clusters are deterministic. |
| clusters | Object |
All the information that you need to recreate or use the cluster on your own. It includes:
|
| code | Integer | HTTP status code. This will be 201 upon successful creation of the cluster and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the cluster creation has been completed without errors. |
|
columns
filterable, sortable |
Integer | The number of fields in the cluster. |
|
composites
filterable, sortable |
Array of Strings | The list of composite ids that reference this model. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the cluster was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this cluster. |
|
credits_per_prediction
filterable, sortable, updatable |
Float | This is the number of credits that other users will consume to make a prediction with your cluster if you made it public. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the cluster. |
| dataset_field_types | Object | A dictionary that informs about the number of fields of each type in the dataset used to create the cluster. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields, and an entry for the total number of fields. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the cluster. It can contain restricted markdown to decorate the text. |
| excluded_fields | Array | The list of fields's ids that were excluded to build the cluster. |
| field_scales | Object | The specific scales used for each field, if any. |
| fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
| input_fields | Array | The list of input fields' ids used to build the models of the cluster. |
|
k
filterable, sortable |
Integer | The number of clusters. |
| locale | String | The dataset's locale. |
|
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the cluster. |
|
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the cluster. |
|
model_clusters
filterable, sortable |
Boolean | Whether a model for each cluster was created or not. |
|
name
filterable, sortable, updatable |
String | The name of the cluster as your provided or based on the name of the dataset by default. |
|
number_of_batchcentroids
filterable, sortable |
Integer | The current number of batch centroids that use this cluster. |
|
number_of_centroids
filterable, sortable |
Integer | The current number of centroids that use this cluster. |
|
number_of_public_centroids
filterable, sortable |
Integer | The current number of public centroids that use this cluster. |
|
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the cluster instead of the sampled instances. |
|
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your cluster. |
|
private
filterable, sortable, updatable |
Boolean | Whether the cluster is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| range | Array | The range of instances used to build the cluster. |
|
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the cluster were selected using replacement or not. |
| resource | String | The cluster/id. |
|
rows
filterable, sortable |
Integer | The total number of instances used to build the cluster |
|
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the cluster. |
| scales | Object | A dictionary that represents the combination of user requested field_scales and balance_fields. |
|
seed
filterable, sortable |
String | The string that was used to generate the sample. |
|
shared
filterable, sortable, updatable |
Boolean | Whether the cluster is shared using a private link or not. |
| shared_hash | String | The hash that gives access to this cluster if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this cluster. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this cluster. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| status | Object | A description of the status of the cluster. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the cluster was created using a subscription plan or not. |
| summary_fields | Array | The list of field's ids that are included when generating the cluster's summaries but were not used for clustering. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the cluster was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
|
wheight_field
filterable, sortable |
String | The field used to weight each instance of the dataset differently. |
|
white_box
filterable, sortable |
Boolean | Whether the cluster is publicly shared as a white-box. |
A Cluster Object has the following properties:
Cluster Status
Creating a cluster is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The cluster goes through a number of states until its fully completed. Through the status field in the cluster you can determine when the cluster has been fully processed and ready to be used to create predictions. These are the properties that a cluster's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the cluster creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the cluster. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the cluster. |
Once a cluster has been successfully created, it will look like:
{
"balance_fields": true,
"category": 0,
"cluster_datasets": {},
"cluster_models": {},
"cluster_seed": "2c249dda00fbf54ab4cdd850532a584f286af5b6",
"clusters": {
"clusters": [
{
"center": {
"000000": 6.81684,
"000001": 3.12727,
"000002": 5.55084,
"000003": 1.99933
},
"count": 44,
"distance": {
"bins": [
[
0.03659,
1
],
[
0.05992,
1
],
[
0.08335,
2
],
[
0.08776,
1
],
[
0.09051,
2
],
[
0.09455,
1
],
[
0.09678,
1
],
[
0.10192,
3
],
[
0.10575,
2
],
[
0.11101,
1
],
[
0.11383,
1
],
[
0.12565,
1
],
[
0.14233,
2
],
[
0.14582,
1
],
[
0.15386,
1
],
[
0.15887,
1
],
[
0.16988,
1
],
[
0.17244,
1
],
[
0.17507,
1
],
[
0.17898,
1
],
[
0.1911,
5
],
[
0.19576,
1
],
[
0.20235,
2
],
[
0.2165,
2
],
[
0.22574,
1
],
[
0.22786,
1
],
[
0.25745,
1
],
[
0.26241,
1
],
[
0.27094,
1
],
[
0.33971,
1
],
[
0.37232,
1
],
[
0.38651,
1
]
],
"maximum": 0.38651,
"mean": 0.16733,
"median": 0.16437,
"minimum": 0.03659,
"population": 44,
"standard_deviation": 0.07883,
"sum": 7.36256,
"sum_squares": 1.49918,
"variance": 0.00621
},
"id": "000000",
"name": "Cluster 0"
},
{
"center": {
"000000": 5.006,
"000001": 3.428,
"000002": 1.462,
"000003": 0.246
},
"count": 50,
"distance": {
"bins": [
[
0.01692,
1
],
[
0.02701,
1
],
[
0.03976,
4
],
[
0.04698,
1
],
[
0.05116,
1
],
[
0.05536,
2
],
[
0.06733,
1
],
[
0.0745,
1
],
[
0.08518,
1
],
[
0.08882,
1
],
[
0.09307,
3
],
[
0.0969,
1
],
[
0.10169,
1
],
[
0.11685,
1
],
[
0.12167,
3
],
[
0.12718,
1
],
[
0.13393,
2
],
[
0.14237,
1
],
[
0.14704,
3
],
[
0.16085,
2
],
[
0.16884,
3
],
[
0.18397,
2
],
[
0.19028,
2
],
[
0.22407,
3
],
[
0.22853,
1
],
[
0.24727,
1
],
[
0.26337,
1
],
[
0.29205,
1
],
[
0.30355,
1
],
[
0.34761,
1
],
[
0.44439,
1
],
[
0.4947,
1
]
],
"maximum": 0.4947,
"mean": 0.15072,
"median": 0.13393,
"minimum": 0.01692,
"population": 50,
"standard_deviation": 0.10099,
"sum": 7.53624,
"sum_squares": 1.63568,
"variance": 0.0102
},
"id": "000001",
"name": "Cluster 1"
},
{
"center": {
"000000": 5.8531,
"000001": 2.68387,
"000002": 4.43077,
"000003": 1.43772
},
"count": 56,
"distance": {
"bins": [
[
0.07002,
2
],
[
0.07729,
1
],
[
0.08219,
2
],
[
0.08755,
1
],
[
0.09332,
1
],
[
0.09887,
2
],
[
0.10564,
2
],
[
0.11235,
2
],
[
0.12547,
2
],
[
0.13347,
4
],
[
0.13861,
4
],
[
0.14292,
2
],
[
0.14719,
2
],
[
0.15377,
4
],
[
0.15752,
2
],
[
0.16407,
1
],
[
0.16937,
2
],
[
0.17239,
1
],
[
0.17621,
1
],
[
0.17899,
3
],
[
0.19034,
2
],
[
0.19518,
3
],
[
0.21791,
1
],
[
0.22118,
1
],
[
0.23676,
1
],
[
0.23897,
1
],
[
0.25023,
1
],
[
0.25337,
1
],
[
0.26621,
1
],
[
0.29323,
1
],
[
0.29837,
1
],
[
0.37776,
1
]
],
"maximum": 0.37776,
"mean": 0.16169,
"median": 0.152,
"minimum": 0.06938,
"population": 56,
"standard_deviation": 0.06174,
"sum": 9.0545,
"sum_squares": 1.67364,
"variance": 0.00381
},
"id": "000002",
"name": "Cluster 2"
}
],
"fields": {
"000000": {
"column_number": 0,
"datatype": "double",
"name": "sepal length",
"optype": "numeric",
"order": 0,
"preferred": true,
"summary": {
"bins": [
[
4.3,
1
],
[
4.425,
4
],
[
4.6,
4
],
[
4.77143,
7
],
[
4.9625,
16
],
[
5.1,
9
],
[
5.2,
4
],
[
5.3,
1
],
[
5.4,
6
],
[
5.5,
7
],
[
5.6,
6
],
[
5.7,
8
],
[
5.8,
7
],
[
5.9,
3
],
[
6,
6
],
[
6.1,
6
],
[
6.2,
4
],
[
6.3,
9
],
[
6.4,
7
],
[
6.5,
5
],
[
6.6,
2
],
[
6.7,
8
],
[
6.8,
3
],
[
6.9,
4
],
[
7,
1
],
[
7.1,
1
],
[
7.2,
3
],
[
7.3,
1
],
[
7.4,
1
],
[
7.6,
1
],
[
7.7,
4
],
[
7.9,
1
]
],
"maximum": 7.9,
"mean": 5.84333,
"median": 5.77889,
"minimum": 4.3,
"missing_count": 0,
"population": 150,
"splits": [
4.51526,
4.67252,
4.81113,
4.89582,
4.96139,
5.01131,
5.05992,
5.11148,
5.18177,
5.35681,
5.44129,
5.5108,
5.58255,
5.65532,
5.71658,
5.77889,
5.85381,
5.97078,
6.05104,
6.13074,
6.23023,
6.29578,
6.35078,
6.41459,
6.49383,
6.63013,
6.70719,
6.79218,
6.92597,
7.20423,
7.64746
],
"standard_deviation": 0.82807,
"sum": 876.5,
"sum_squares": 5223.85,
"variance": 0.68569
}
},
"000001": {
"column_number": 1,
"datatype": "double",
"name": "sepal width",
"optype": "numeric",
"order": 1,
"preferred": true,
"summary": {
"counts": [
[
2,
1
],
[
2.2,
3
],
[
2.3,
4
],
[
2.4,
3
],
[
2.5,
8
],
[
2.6,
5
],
[
2.7,
9
],
[
2.8,
14
],
[
2.9,
10
],
[
3,
26
],
[
3.1,
11
],
[
3.2,
13
],
[
3.3,
6
],
[
3.4,
12
],
[
3.5,
6
],
[
3.6,
4
],
[
3.7,
3
],
[
3.8,
6
],
[
3.9,
2
],
[
4,
1
],
[
4.1,
1
],
[
4.2,
1
],
[
4.4,
1
]
],
"maximum": 4.4,
"mean": 3.05733,
"median": 3.02044,
"minimum": 2,
"missing_count": 0,
"population": 150,
"standard_deviation": 0.43587,
"sum": 458.6,
"sum_squares": 1430.4,
"variance": 0.18998
}
},
"000002": {
"column_number": 2,
"datatype": "double",
"name": "petal length",
"optype": "numeric",
"order": 2,
"preferred": true,
"summary": {
"bins": [
[
1,
1
],
[
1.16667,
3
],
[
1.3,
7
],
[
1.4,
13
],
[
1.5,
13
],
[
1.6,
7
],
[
1.7,
4
],
[
1.9,
2
],
[
3,
1
],
[
3.3,
2
],
[
3.5,
2
],
[
3.6,
1
],
[
3.75,
2
],
[
3.9,
3
],
[
4.0375,
8
],
[
4.23333,
6
],
[
4.46667,
12
],
[
4.6,
3
],
[
4.74444,
9
],
[
4.94444,
9
],
[
5.1,
8
],
[
5.25,
4
],
[
5.46,
5
],
[
5.6,
6
],
[
5.75,
6
],
[
5.95,
4
],
[
6.1,
3
],
[
6.3,
1
],
[
6.4,
1
],
[
6.6,
1
],
[
6.7,
2
],
[
6.9,
1
]
],
"maximum": 6.9,
"mean": 3.758,
"median": 4.34142,
"minimum": 1,
"missing_count": 0,
"population": 150,
"splits": [
1.25138,
1.32426,
1.37171,
1.40962,
1.44567,
1.48173,
1.51859,
1.56301,
1.6255,
1.74645,
3.23033,
3.675,
3.94203,
4.0469,
4.18243,
4.34142,
4.45309,
4.51823,
4.61771,
4.72566,
4.83445,
4.93363,
5.03807,
5.1064,
5.20938,
5.43979,
5.5744,
5.6646,
5.81496,
6.02913,
6.38125
],
"standard_deviation": 1.7653,
"sum": 563.7,
"sum_squares": 2582.71,
"variance": 3.11628
}
},
"000003": {
"column_number": 3,
"datatype": "double",
"name": "petal width",
"optype": "numeric",
"order": 3,
"preferred": true,
"summary": {
"counts": [
[
0.1,
5
],
[
0.2,
29
],
[
0.3,
7
],
[
0.4,
7
],
[
0.5,
1
],
[
0.6,
1
],
[
1,
7
],
[
1.1,
3
],
[
1.2,
5
],
[
1.3,
13
],
[
1.4,
8
],
[
1.5,
12
],
[
1.6,
4
],
[
1.7,
2
],
[
1.8,
12
],
[
1.9,
5
],
[
2,
6
],
[
2.1,
6
],
[
2.2,
3
],
[
2.3,
8
],
[
2.4,
3
],
[
2.5,
3
]
],
"maximum": 2.5,
"mean": 1.19933,
"median": 1.32848,
"minimum": 0.1,
"missing_count": 0,
"population": 150,
"standard_deviation": 0.76224,
"sum": 179.9,
"sum_squares": 302.33,
"variance": 0.58101
}
}
}
},
"code": 200,
"columns": 4,
"created": "2014-05-17T15:54:02.419411",
"credits": 0.017578125,
"credits_per_prediction": 0.0,
"dataset": "dataset/5378e0773c1920e7d8000000",
"dataset_field_types": {
"categorical": 1,
"datetime": 0,
"numeric": 4,
"preferred": 5,
"text": 0,
"total": 5
},
"dataset_status": true,
"dataset_type": 0,
"description": "",
"excluded_fields": [
"000004"
],
"field_scales": {},
"fields_meta": {
"count": 4,
"limit": 1000,
"offset": 0,
"query_total": 4,
"total": 4
},
"input_fields": [
"000000",
"000001",
"000002",
"000003"
],
"k": 3,
"locale": "en-US",
"max_columns": 5,
"max_rows": 150,
"model_clusters": false,
"name": "Iris' dataset cluster",
"number_of_batchcentroids": 0,
"number_of_centroids": 0,
"number_of_public_centroids": 0,
"out_of_bag": false,
"price": 0.0,
"private": true,
"project": null,
"range": [
1,
150
],
"replacement": false,
"resource": "cluster/5377861a3c1920126000001c",
"rows": 150,
"sample_rate": 1.0,
"scales": {
"000000": 0.22445403384096205,
"000001": 0.4264199229189562,
"000002": 0.10528728930079047,
"000003": 0.24383875393929133
},
"shared": false,
"size": 4608,
"source": "source/5341a53c3c19206725000000",
"source_status": true,
"status": {
"code": 5,
"elapsed": 1164,
"message": "The cluster has been created",
"progress": 1.0
},
"subscription": false,
"summary_fields": [ ],
"tags": [],
"updated": "2014-05-18T16:32:20.794000",
"white_box": false
}
< Example cluster JSON response
PMML
The default cluster output format is JSON. However, the pmml parameter allows to include a PMML version of the cluster. The cluster will include a XML document that fullfils PMML v4.1. For example:
curl "https://au.bigml.io/cluster/5377861a3c1920126000001c?$BIGML_AUTH;pmml=yes"
PMML Example
curl "https://au.bigml.io/cluster/5377861a3c1920126000001c?$BIGML_AUTH;pmml=only"
PMML Example
Filtering and Paginating Fields from a Cluster
A cluster might be composed of hundreds or even thousands of fields. Thus when retrieving a cluster, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Create a Dataset Using a Cluster and a Centroid
Each centroid has associated a pre-computed dataset that has been created using all the instances in the neighborhood. You can create a new dataset using the corresponding cluster/id and centroid id as follows:
curl "https://au.bigml.io/dataset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"cluster": "cluster/5377861a3c1920126000001c",
"centroid": "000000"}'
Creating a dataset using a cluster's centroid
Create a Model Using a Cluster and a Centroid
If you created a Cluster setting the model_clusters option to true, then each centroid has associated a pre-computed model that has been created using all the instances of the dataset. Each model separates between those instances that belong to the centroid neighborhood and those that belong to other neighborhoods. You can create a new model using the corresponding cluster/id and centroid id as follows:
curl "https://au.bigml.io/model?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"cluster": "cluster/5377861a3c1920126000001c",
"centroid": "000000"}'
Creating a model using a cluster's centroid
Updating a Cluster
To update a cluster, you need to PUT an object containing the fields that you want to update to the cluster' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated cluster.
For example, to update a cluster with a new name you can use curl like this:
curl "https://au.bigml.io/cluster/5377861a3c1920126000001c?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating a cluster's name
If you want to update a cluster with a new label and description for a specific field you can use curl like this:
curl "https://au.bigml.io/cluster/5377861a3c1920126000001c?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"fields": {"000000": {
"label": "a longer name",
"description": "an even longer description"}}}'
$ Updating a cluster's field, label, and description
To update a cluster with a new name for a specific cluster you can use curl like this:
curl "https://au.bigml.io/cluster/5377861a3c1920126000001c?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"clusters": {"000000": {
"name": "New name for a cluster"}}}'
$ Updating one of the cluster's name
Deleting a Cluster
To delete a cluster, you need to issue a HTTP DELETE request to the cluster/id to be deleted.
Using curl you can do something like this to delete a cluster:
curl -X DELETE "https://au.bigml.io/cluster/5377861a3c1920126000001c?$BIGML_AUTH"
$ Deleting a cluster from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a cluster, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a cluster a second time, or a cluster that does not exist, you will receive a "404 not found" response.
However, if you try to delete a cluster that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Clusters
To list all the clusters, you can use the cluster base URL. By default, only the 20 most recent clusters will be returned. You can see below how to change this number using the limit parameter.
You can get your list of clusters directly in your browser using your own username and API key with the following links.
https://au.bigml.io/cluster?$BIGML_AUTH
> Listing clusters from a browser
Anomaly Detectors
Last Updated: Tuesday, 2019-01-29 16:28
Anomaly detectors can be applied to a variety of domains like fraud detection, security, quality control, medicine, etc.
BigML anomaly detectors are built using an unsupervised anomaly detection technique. Therefore, you do not need to explicitly label each instance in your dataset as "normal" or "abnormal".
When you create a new anomaly detector, it automatically returns an anomaly score for the top n most anomalous instances. The newly created anomaly detector can also be used to later create anomaly scores for new data points or batch anomaly scores for all the instances of a dataset.
BigML.io allows you to create, retrieve, update, delete your anomaly detector. You can also list all of your anomaly detectors.
Jump to:
- Anomaly Detector Base URL
- Creating an Anomaly Detector
- Anomaly Detector Arguments
- Retrieving an Anomaly Detector
- Anomaly Detector Properties
- Filtering and Paginating Fields from an Anomaly Detector
- Updating an Anomaly Detector
- Deleting an Anomaly Detector
- Listing Anomaly Detectors
- PMML
Anomaly Detector Base URL
You can use the following base URL to create, retrieve, update, and delete anomaly detectors. https://au.bigml.io/anomaly
Anomaly Detector base URL
All requests to manage your anomaly detectors must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating an Anomaly Detector
To create a new anomaly detector, you need to POST to the anomaly detector base URL an object containing at least the dataset/id that you want to use to create the anomaly detector. The content-type must always be "application/json".
You can easily create a new anomaly detector using curl as follows. All you need is a valid dataset/id and your authentication variable set up as shown above.
curl "https://au.bigml.io/anomaly?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/54222a14f0a5eaaab000000c"}'
> Creating an anomaly detector
BigML.io will return the newly created anomaly detector if the request succeeded.
{
"associations":null,
"category":0,
"clones":0,
"code":201,
"columns":0,
"created":"2015-10-17T02:48:41.128375",
"credits":0.27747344970703125,
"dataset":"dataset/561d685e10cb863a8200013d",
"dataset_field_types":{
"categorical":0,
"datetime":0,
"effective_fields":504,
"numeric":0,
"preferred":1,
"text":1,
"total":1
},
"dataset_status":true,
"dataset_type":0,
"description":"",
"excluded_fields":[],
"fields_meta":{
"count":0,
"limit":1000,
"offset":0,
"total":0
},
"input_fields":[],
"locale":"en-US",
"max_columns":1,
"max_rows":499,
"name":"Supermarket-dataset-processed's dataset's association",
"out_of_bag":false,
"price":0,
"private":true,
"project":null,
"range":[
1,
499
],
"replacement":false,
"resource":"anomaly/54223546f0a5eaaab0000018",
"rows":499,
"sample_rate":1,
"shared":false,
"size":72738,
"source":"source/561d684f10cb863a82000139",
"source_status":true,
"status":{
"code":1,
"message":"The association is being processed and will be created soon"
},
"subscription":true,
"tags":[],
"updated":"2015-10-17T02:48:41.128511",
"white_box":false
}
< Example anomaly detector JSON response
Anomaly Detector Arguments
In addition to the dataset, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
anomaly_seed
optional |
String |
A string to generate deterministic anomaly detectors.
Example: "My Seed" |
|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the anomaly detector. See the category codes for the complete list of categories.
Example: 1 |
|
constraints
optional |
Boolean, default is false |
An experimental option which adds more predicates to each node in the tree. These predicates help capture expectations about the data, making the tree more sensitive to anomalies. This option tends to inflate the anomaly scores and requires more CPU time to build and evaluate. However, it also seems to make the trees more effective at flagging anomalous data that was not in the training set. It also seems to improve the forests effectiveness on categorical data.
Example: false |
| dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
|
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
|
description
optional |
String |
A description of the anomaly detector up to 8192 characters long.
Example: "This is a description of my new anomaly detector" |
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the anomaly detector.
Example:
|
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the anomaly detector with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
|
forest_size
optional |
Integer, default is 128 |
The number of trees used by the anomaly detector. Must be a number greater than or equal to 2 and less than or equal to 1000.
Example: 256 |
|
id_fields
optional |
Array, default is [] |
Specifies the ids for fields which will be included when computing the top anomalies, but will not be considered to create the anomaly detector. The id_fields must be a strict subset of the input_fields, where the latter is adjusted before passing it to the model creation algorithm by setting it to all non-preferred fields if not provided explicitly, adding to it explicit id_fields, and subtracting explicit excluded_fields. You can use either field identifiers or field names.
Example:
|
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be considered to create the anomaly detector.
Example:
|
|
name
optional |
String, default is dataset's name |
The name you want to give to the new anomaly detector.
Example: "my new anomaly detector" |
|
normalize_repeats
optional |
Boolean, default is false |
Controls whether the frequency of repeated (or very similar) data points lower the anomaly score of the repeats.
Example: false |
|
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
|
project
optional |
String |
The project/id you want the anomaly detector to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the anomaly detector.
Example: [1, 150] |
|
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
|
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
|
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your anomaly detector.
Example: ["best customers", "2018"] |
|
top_n
optional |
Integer, default is 10 |
The number of instances that will be returned together with the anomaly detector that were scored as most anomalous. The minimum number is 1 and the maximum is 1024.
Example: 256 |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new anomaly detector. For example, to create a new anomaly detector named "my anomaly detector", with only certain rows, and with only three fields:
curl "https://au.bigml.io/anomaly?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/54222a14f0a5eaaab000000c",
"input_fields": ["000001", "000002", "000003"],
"name": "my anomaly detector",
"range": [25, 125]}'
> Creating a customized anomaly detector
If you do not specify a name, BigML.io will assign to the new anomaly detector the dataset's name. If you do not specify a range of instances, BigML.io will use all the instances in the dataset. If you do not specify any input fields, BigML.io will include all the input fields in the dataset.
Read the Section on Sampling Your Dataset to learn how to sample your dataset. Here's an example of an anomaly detector request with range and sampling specifications:
curl "https://au.bigml.io/anomaly?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/505f43223c1920eccc000297",
"range": [1, 5000],
"sample_rate": 0.5}'
Creating an anomaly detector using sampling
Retrieving an Anomaly Detector
Each anomaly detector has a unique identifier in the form "anomaly/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the anomaly detector.
To retrieve an anomaly detector with curl:
curl "https://au.bigml.io/anomaly/54223546f0a5eaaab0000018?$BIGML_AUTH"
$ Retrieving a anomaly detector from the command line
You can also use your browser to visualize the anomaly detector using the full BigML.io URL or pasting the anomaly/id into the BigML.com.au dashboard.
Anomaly Detector Properties
Once an anomaly detector has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
| anomaly_seed | String | With no seed, the anomaly detector locations can vary from run to run. With a seed, the anomaly detectors are deterministic. |
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the anomaly detector and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the anomaly detector creation has been completed without errors. |
|
columns
filterable, sortable |
Integer | The number of fields in the anomaly detector. |
|
composites
filterable, sortable |
Array of Strings | The list of composite ids that reference this model. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the anomaly detector was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this anomaly detector. |
|
credits_per_prediction
filterable, sortable, updatable |
Float | This is the number of credits that other users will consume to make a prediction with your anomaly detector if you made it public. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the anomaly detector. |
| dataset_field_types | Object | A dictionary that informs about the number of fields of each type in the dataset used to create the anomaly detector. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields, and an entry for the total number of fields. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the anomaly detector. It can contain restricted markdown to decorate the text. |
| excluded_fields | Array | The list of fields's ids that were excluded to build the anomaly detector. |
| fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
| forest_size | Integer | The number of individual trees the anomaly detector will contain. |
| id_fields | Array | The list of id fields's ids used to build the anomaly detector. |
| input_fields | Array | The list of input fields' ids used to build the models of the anomaly detector. |
| locale | String | The dataset's locale. |
|
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the anomaly detector. |
|
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the anomaly detector. |
| model | Object | All the information that you need to recreate or use the anomaly detector on your own. It includes the field's dictionary describing the fields and their summaries, the tree structures that makes the model up, and the top anomalies found in the dataset. See the Model Object definition below. |
|
name
filterable, sortable, updatable |
String | The name of the anomaly detector as your provided or based on the name of the dataset by default. |
| normalize_repeats | Boolean | Controls whether the frequency of repeated (or very similar) data points lower the anomaly score of the repeats. |
|
number_of_anomalyscores
filterable, sortable |
Integer | The current number of anomaly scores that use this anomaly detector. |
|
number_of_batchanomalyscores
filterable, sortable |
Integer | The current number of batch anomaly scores that use this anomaly detector. |
|
number_of_public_anomalyscores
filterable, sortable |
Integer | The current number of public anomalys scores that use this anomaly detector. |
|
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the anomaly detector instead of the sampled instances. |
|
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your anomaly detector. |
|
private
filterable, sortable, updatable |
Boolean | Whether the anomaly detector is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| range | Array | The range of instances used to build the anomaly detector. |
|
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the anomaly detector were selected using replacement or not. |
| resource | String | The anomaly/id. |
|
rows
filterable, sortable |
Integer | The total number of instances used to build the anomaly detector |
|
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the anomaly detector. |
|
seed
filterable, sortable |
String | The string that was used to generate the sample. |
|
shared
filterable, sortable, updatable |
Boolean | Whether the anomaly detector is shared using a private link or not. |
| shared_hash | String | The hash that gives access to this anomaly detector if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this anomaly detector. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this anomaly detector. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| status | Object | A description of the status of the anomaly detector. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the anomaly detector was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
top_n
filterable, sortable |
Integer | The number of top anomalies returned after scoring each row in the training dataset. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the anomaly detector was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
|
white_box
filterable, sortable |
Boolean | Whether the anomaly detector is publicly shared as a white-box. |
The Model Object of an anomaly detector has the following properties:
| Property | Type | Description |
|---|---|---|
| fields | Object | A dictionary with an entry per field in the dataset used to build the anomaly detector. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details. |
| kind | String | The type of anomaly detector. Currently, only "iforest". |
| top_anomalies | Array | A list of top anomalies objects. See below. |
| trees | Array of Objects | A list with the trees representing the anomaly detector. Each tree conforms the Root Object definition. |
A Top Anomalies Object has the following properties:
Anomaly Detector Status
Creating an anomaly detector is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The anomaly detector goes through a number of states until its fully completed. Through the status field in the anomaly detector you can determine when the anomaly detector has been fully processed and ready to be used to create predictions. These are the properties that an anomaly detector's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the anomaly detector creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the anomaly detector. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the anomaly detector. |
Once an anomaly detector has been successfully created, it will look like:
{
"category":0,
"code":200,
"columns":5,
"constraints":false,
"created":"2014-09-24T03:06:46.107000",
"credits":0.12705230712890625,
"credits_per_prediction":0,
"dataset":"dataset/54222a14f0a5eaaab000000c",
"dataset_field_types":{
"categorical":1,
"datetime":0,
"numeric":4,
"preferred":5,
"text":0,
"total":5
},
"dataset_status":true,
"dataset_type":0,
"description":"",
"excluded_fields":[],
"fields_meta":{
"count":5,
"limit":1000,
"offset":0,
"query_total":5,
"total":5
},
"forest_size":128,
"id_fields":[],
"input_fields":[
"000000",
"000001",
"000002",
"000003",
"000004"
],
"locale":"en-US",
"max_columns":5,
"max_rows":150,
"model":{
"fields":{
"000000":{
"column_number":0,
"datatype":"double",
"description":"an even longer description",
"label":"a longer name",
"name":"sepal length",
"optype":"numeric",
"order":0,
"preferred":true,
"summary":{
"bins":[
[
4.3,
1
],
[
4.425,
4
],
[
4.6,
4
],
[
4.77143,
7
],
[
4.9625,
16
],
[
5.1,
9
],
[
5.2,
4
],
[
5.3,
1
],
[
5.4,
6
],
[
5.5,
7
],
[
5.6,
6
],
[
5.7,
8
],
[
5.8,
7
],
[
5.9,
3
],
[
6,
6
],
[
6.1,
6
],
[
6.2,
4
],
[
6.3,
9
],
[
6.4,
7
],
[
6.5,
5
],
[
6.6,
2
],
[
6.7,
8
],
[
6.8,
3
],
[
6.9,
4
],
[
7,
1
],
[
7.1,
1
],
[
7.2,
3
],
[
7.3,
1
],
[
7.4,
1
],
[
7.6,
1
],
[
7.7,
4
],
[
7.9,
1
]
],
"maximum":7.9,
"mean":5.84333,
"median":5.77889,
"minimum":4.3,
"missing_count":0,
"population":150,
"splits":[
4.51526,
4.67252,
4.81113,
4.89582,
4.96139,
5.01131,
5.05992,
5.11148,
5.18177,
5.35681,
5.44129,
5.5108,
5.58255,
5.65532,
5.71658,
5.77889,
5.85381,
5.97078,
6.05104,
6.13074,
6.23023,
6.29578,
6.35078,
6.41459,
6.49383,
6.63013,
6.70719,
6.79218,
6.92597,
7.20423,
7.64746
],
"standard_deviation":0.82807,
"sum":876.5,
"sum_squares":5223.85,
"variance":0.68569
}
},
...
},
"kind":"iforest",
"mean_depth":9.557347074468085,
"top_anomalies":[
{
"importance":[
0.22808,
0.23051,
0.21026,
0.1756,
0.15555
],
"row":[
7.9,
3.8,
6.4,
2,
"Iris-virginica"
],
"score":0.58766
},
...
],
"trees": [
{
"root": {
"children": [
{
"children": [
{
"children": [
{
"children": [
{
"children": [
{
"population": 1,
"predicates": [
{
"field": "000000",
"op": ">",
"value": 6.91218
}
]
},
...
}
]
},
"name":"a new name",
"number_of_anomalyscores":0,
"number_of_batchanomalyscores":0,
"number_of_public_anomalyscores":0,
"out_of_bag":false,
"price":0,
"private":true,
"project":null,
"range":[
1,
150
],
"replacement":false,
"resource":"anomaly/54223546f0a5eaaab0000018",
"rows":150,
"sample_rate":1,
"sample_size":94,
"shared":false,
"size":4758,
"source":"source/54222a08f0a5eaaab0000008",
"source_status":true,
"status":{
"code":5,
"elapsed":977,
"message":"The anomaly detector has been created",
"progress":1
},
"subscription":false,
"tags":[],
"top_n":10,
"updated":"2014-09-24T03:34:01.051000",
"white_box":false
}
< Example anomaly detector JSON response
PMML
The default anomaly detector output format is JSON. However, the pmml parameter allows to include a PMML version of the anomaly detector. The anomaly detector will include a XML document that fullfils PMML v4.1. For example:
curl "https://au.bigml.io/anomaly/54223546f0a5eaaab0000018?$BIGML_AUTH;pmml=yes"
PMML Example
curl "https://au.bigml.io/anomaly/54223546f0a5eaaab0000018?$BIGML_AUTH;pmml=only"
PMML Example
Filtering and Paginating Fields from an Anomaly Detector
An anomaly detector might be composed of hundreds or even thousands of fields. Thus when retrieving an anomaly, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating an Anomaly Detector
To update an anomaly detector, you need to PUT an object containing the fields that you want to update to the anomaly detector' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated anomaly detector.
For example, to update an anomaly detector with a new name you can use curl like this:
curl "https://au.bigml.io/anomaly/54223546f0a5eaaab0000018?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating an anomaly detector's name
If you want to update an anomaly detector with a new label and description for a specific field you can use curl like this:
curl "https://au.bigml.io/anomaly/54223546f0a5eaaab0000018?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"fields": {"000000": {
"label": "a longer name",
"description": "an even longer description"}}}'
$ Updating an anomaly detector's field, label, and description
Deleting an Anomaly Detector
To delete an anomaly detector, you need to issue a HTTP DELETE request to the anomaly/id to be deleted.
Using curl you can do something like this to delete an anomaly detector:
curl -X DELETE "https://au.bigml.io/anomaly/54223546f0a5eaaab0000018?$BIGML_AUTH"
$ Deleting an anomaly detector from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete an anomaly detector, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete an anomaly detector a second time, or an anomaly detector that does not exist, you will receive a "404 not found" response.
However, if you try to delete an anomaly detector that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Anomaly Detectors
To list all the anomaly detectors, you can use the anomaly base URL. By default, only the 20 most recent anomaly detectors will be returned. You can see below how to change this number using the limit parameter.
You can get your list of anomaly detectors directly in your browser using your own username and API key with the following links.
https://au.bigml.io/anomaly?$BIGML_AUTH
> Listing anomaly detectors from a browser
Associations
Last Updated: Tuesday, 2019-01-29 16:28
Association Discovery is a method to find out relations among values in high-dimensional datasets. It is commonly used for market basket analysis. For example, finding customer shopping patterns across large transactional datasets like customers who buy hamburgers and ketchup also consume bread, can help businesses to make better decisions on promotions and product placements. Association Discovery can also be used for other purposes such as early detection of failures or incidents, intrusion detection, web mining, or biotechnology.
BigML recently acquired the best-of-class association discovery technology Magnum Opus from Dr. Geoff Webb.
How association discovery differs from traditional correlations techniques:
- It can process thousands of variables.
- It is able to find correlations between values, not variables.
- It is able to measure the meaningfulness of associations, so it focuses on finding valuable associations instead of minimizing the risk of making false discoveries.
Useful concepts
- N: the number of instances (cardinality) of a dataset D.
- N = | D |
- Item: a value associated to an instance in the dataset
- Itemset: a set of items associated to an instance in the dataset
- Antecedent or (LHS): left-hand-side itemset of an association rule
- Consequent or (RHS): right-hand-side itemset of an association rule
Note that traditionally association discovery look for co-occurrence and do not consider the order in which an item appear within an itemset.
Association Measures
- Support: the proportion of instances which contain an itemset.
-
Coverage: the support of the antedecent of an
association rule. It measures how often a rule can be applied:
-
Confidence or (strength): The
probability of seeing the rule's consequent under the condition that
the instances also contain the rule's antecedent. Confidence is computed
using the support of the association rule over the coverage. That is,
the percentage of instances which contain the consequent and antecedent
together over the number of instances which only contain the
antecedent:
-
Leverage: the difference of the support of the
association rule (i.e., the antecedent and consequent appearing
together) and what would be expected if antecedent and consequent where
statistically independent. This is a value between -1 and 1. A
positive value suggests a positive relationship and a negative value
suggests a negative relationship. 0 indicates independence:
-
Lift: how many times more often antecedent and
consequent occur together than expected if they where statistically
independent. A value of 1 suggests that there is no relationship
between the antecedent and the consequent. Higher values suggest
stronger positive relationships. Lower values suggest stronger negative
relationships (the presence of the antecedent reduces the likelihood of
the consequent):
Associations can handle categorical, text and numeric fields as input fields:
- Categorical: each different value (class) will be considered a separate item.
- Text: each unique term will be considered a separate item.
- Numeric: values will be discretized into intervals of equal size. For example, a numeric field with values ranging from 0 to 600 split into 3 bins: bin 1 → [0, 200), bin 2 → [200, 400), bin 3 → [400, 600]. You can refine the behavior of the transformation using discretization and field_discretizations.
You can create an association selecting which fields from your dataset you want to use.
BigML.io allows you to create, retrieve, update, delete your association. You can also list all of your associations.
Jump to:
- Association Base URL
- Creating an Association
- Association Arguments
- Retrieving an Association
- Association Properties
- PMML
- Filtering and Paginating Fields from an Association
- Updating an Association
- Deleting an Association
- Listing Associations
Association Base URL
You can use the following base URL to create, retrieve, update, and delete associations. https://au.bigml.io/association
Association base URL
All requests to manage your associations must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating an Association
To create a new association, you need to POST to the association base URL an object containing at least the dataset/id that you want to use to create the association. The content-type must always be "application/json".
POST /association?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating association definition
curl "https://au.bigml.io/association?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006"}'
> Creating an association
BigML.io will return the newly created association if the request succeeded.
{
"associations":null,
"category":0,
"clones":0,
"code":201,
"columns":0,
"created":"2015-11-05T08:06:08.184169",
"credits":0.017581939697265625,
"dataset":"dataset/562fae3f4e1727141d00004e",
"dataset_field_types":{
"categorical":1,
"datetime":0,
"effective_fields":5,
"numeric":4,
"preferred":5,
"text":0,
"total":5
},
"dataset_status":true,
"dataset_type":0,
"description":"",
"excluded_fields":[ ],
"fields_meta":{
"count":0,
"limit":1000,
"offset":0,
"total":0
},
"input_fields":[ ],
"locale":"en-US",
"max_columns":5,
"max_rows":150,
"name":"iris' dataset's association",
"out_of_bag":false,
"price":0,
"private":true,
"project":null,
"range":[
1,
150
],
"replacement":false,
"resource":"association/5621b70910cb86ae4c000000",
"rows":150,
"sample_rate":1,
"shared":false,
"size":4609,
"source":"source/562fae3a4e1727141d000048",
"source_status":true,
"status":{
"code":1,
"message":"The association is being processed and will be created soon"
},
"subscription":false,
"tags":[ ],
"updated":"2015-11-05T08:06:08.184281",
"white_box":false
}
< Example association JSON response
Association Arguments
In addition to the dataset, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the association. See the category codes for the complete list of categories.
Example: 1 |
|
complement
optional |
Boolean, default is false |
If complement is true, complementary items are also taken into account. For example, for the item (coffee), the complement would be (NOT coffee), so a part of the association (milk, cofee) --> (sugar) complementary rules as (milk, NOT cofee) --> (chocolate) may also be detected
Example: true |
| dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
|
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
|
description
optional |
String |
A description of the association up to 8192 characters long.
Example: "This is a description of my new association" |
| discretization | Object | Global numeric field transformation parameters. See the discretization table below. |
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the association.
Example:
|
| field_discretizations | Object | Per-field numeric field transformation parameters, taking precedence over discretization. See the field_discretizations table below. |
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the association with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be considered to create the association.
Example:
|
|
max_k
optional |
Integer, default is 100 |
The maximum number of associations to be discovered.
Example: 10 |
|
max_lhs
optional |
Integer, default is 4 |
The maximum number of items to be considered within the left-hand-side itemset.
Example: 2 |
|
min_confidence
optional |
Float, default is 0 |
A real number between 0 and 1 specifying the minimum confidence for the rules discovered.
Example: 0.5 |
|
min_leverage
optional |
Float, default is 0 |
A real number between -1 and 1 specifying the minimum leverage for the rules discovered.
Example: -0.5 |
|
min_lift
optional |
Float, default is 1 |
A non-negative real number specifying the minimum lift for the rules discovered.
Example: 2 |
|
min_support
optional |
Float, default is 1 |
A non-negative real number specifying the minimum support for the rules discovered, i.e., the number of instances matching both the left-hand and right-hand sides. A value less than 1 represents the percentage of the support, and will be multiplied by the total number of instances and rounded up.
Example: 0.2 |
|
missing_items
optional |
Boolean, default is false |
Whether to create items corresponding to missing field values.
Example: true |
|
name
optional |
String, default is dataset's name |
The name you want to give to the new association.
Example: "my new association" |
|
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
|
project
optional |
String |
The project/id you want the association to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the association.
Example: [1, 150] |
|
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
|
rhs_predicate
optional |
Array of Objects, default is [] |
Restriction for the search to rules with certain right-hand side values. Each must contain, at least the field, and both operator and value. See the description below the table for more details.
Example:
|
|
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
|
search_strategy
optional |
String, default is "leverage" |
The strategy for prioritizing the rules in the search. Available options are "confidence", "lhs_cover", "leverage", "lift", and "support".
Example: "lift" |
|
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
|
significance_level
optional |
Float, default is 0.05 |
The significance level between 0 and 1 used for statistical significance of the results.
Example: 0.1 |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your association.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
The rhs_predicate can restrict the search to rules with certain right-hand side values. The individual predicates within the array are OR'd together to produce the final predicate. The above examples in the arguments table specifies that the right-hand side of all discovered rules must be either the item corresponding to species is Iris-setosa and petal width within the interval (1.0, 2.0].
For items and categorical fields, valid operators are '=' and '!=' to specifiy a single value of interest, and 'in' and '!in' to specify multiple values.
For numeric fields, valid operators are '<=' and '>' for single-ended intervals, and 'in' and '!in' for double-ended intervals. When a predicate for a numeric field is given, the field will be discretized along bin edges specified by the predicate. With the above example, the field petal width will be discretized into three bins, corresponding to the values <=1.0, (1.0, 2.0], and >2.0.
If a predicate is given without an operator or value, then any item pertaining to this field is accepted into the RHS.
Discretization is used to transform numeric input fields to categoricals before further processing. It is to be applied globally with all input fields. A Discretization object is composed of any combination of the following properties.
For example, let's say type is set to 'width', size is 7, trim is 0.05, and pretty is false. This requests that numeric input fields be discretized into 7 bins of equal width, trimming the outer 5% of counts, and not rounding bin boundaries.
Field Discretizations is also used to transform numeric input fields to categoricals before further processing. However, it allows the user to specify parameters on a per field basis, taking precedence over the global discretization. It is a map whose keys are field ids and whose values are maps with the same format as discretization. It also accepts edges, which is a numeric array manually specifying edge boundary locations. If this parameter is present, the corresponding field will be discretized according to those defined bins, and the remaining discretization parameters will be ignored. The maximum value of the field's distribution is automatically set as the last value in the edges array. A value object of a Field Discretizations object is composed of any combination of the following properties.
You can also use curl to customize a new association. For example, to create a new association named "my association", with only certain rows, and with only three fields:
curl "https://au.bigml.io/association?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006",
"input_fields": ["000001", "000002", "000003"],
"name": "my association",
"range": [25, 125]}'
> Creating a customized association
If you do not specify a name, BigML.io will assign to the new association the dataset's name. If you do not specify a range of instances, BigML.io will use all the instances in the dataset. If you do not specify any input fields, BigML.io will include all the input fields in the dataset.
Retrieving an Association
Each association has a unique identifier in the form "association/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the association.
To retrieve an association with curl:
curl "https://au.bigml.io/association/5621b70910cb86ae4c000000?$BIGML_AUTH"
$ Retrieving a association from the command line
You can also use your browser to visualize the association using the full BigML.io URL or pasting the association/id into the BigML.com.au dashboard.
Association Properties
Once an association has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
| associations | Object | All the information that you need to recreate the association. It includes the field's dictionary describing the fields, and the associations' items and rules. See the Associations Object definition below. |
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the association and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the association creation has been completed without errors. |
|
columns
filterable, sortable |
Integer | The number of fields in the association. |
|
composites
filterable, sortable |
Array of Strings | The list of composite ids that reference this model. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the association was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this association. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the association. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the association. It can contain restricted markdown to decorate the text. |
| excluded_fields | Array | The list of fields's ids that were excluded to build the association. |
| fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
| input_fields | Array | The list of input fields' ids used to build the models of the association. |
| locale | String | The dataset's locale. |
|
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the association. |
|
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the association. |
|
name
filterable, sortable, updatable |
String | The name of the association as your provided or based on the name of the dataset by default. |
|
number_of_associationsets
filterable, sortable |
Integer | The current number of association sets that use this association. |
|
number_of_public_associationsets
filterable, sortable |
Integer | The current number of public association sets that use this association. |
|
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the association instead of the sampled instances. |
|
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your association. |
|
private
filterable, sortable, updatable |
Boolean | Whether the association is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| range | Array | The range of instances used to build the association. |
|
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the association were selected using replacement or not. |
| resource | String | The association/id. |
|
rows
filterable, sortable |
Integer | The total number of instances used to build the association |
|
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the association. |
|
seed
filterable, sortable |
String | The string that was used to generate the sample. |
|
shared
filterable, sortable, updatable |
Boolean | Whether the association is shared using a private link or not. |
| shared_hash | String | The hash that gives access to this association if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this association. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this association. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| status | Object | A description of the status of the association. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the association was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the association was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
|
white_box
filterable, sortable |
Boolean | Whether the association is publicly shared as a white-box. |
An Associations Object has the following properties:
| Property | Type | Description |
|---|---|---|
| complement | Boolean | If complement is true, complementary items are also took into account. For example, for the item (coffee), the complement would be (NOT coffee), so a part of the association (milk, cofee) --> (sugar) complementary rules as (milk, NOT cofee) --> (chocolate) may also be detected |
| discretization | Object | Global numeric field transformation parameters. See the discretization table. |
| field_discretizations | Object | Per-field numeric field transformation parameters, taking precedence over discretization. See the field_discretizations table. |
| fields | Object | A dictionary with an entry per field in the dataset used to build the test. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details. |
| items | Array | An array of unique items detected in the datasest. See Item Object below. |
| max_k | Integer | The maximum number of associations to be discovered. |
| max_lhs | Integer | The maximum number of items to be considered within the left-hand-side itemset. |
| min_confidence | Float | A real number between 0 and 1 specifying the minimum confidence for the rules discovered. |
| min_leverage | Float | A real number between -1 and 1 specifying the minimum leverage for the rules discovered. |
| min_lift | Float | A non-negative real number specifying the minimum lift for the rules discovered. |
| min_support | Integer | A non-negative integer specifying the minimum support for the rules discovered. |
| missing_items | Boolean | If it is set to true, items with missing field values are also created. |
| rhs_predicate | Array of Objects | Restriction for the search to rules with certain right-hand side values. |
| rules | Array | An array of association rules discovered in the dataset. The total number of rules is less than or equal to k. See Rule Object below. |
| rules_summary | Object | Summary statistics about the discovered rules. It contains the following keys: k: The number of rules found, and the rule metric keys: confidence, leverage, lhs_cover, lift, p_value, rhs_cover, and support. Each key corresponds to an object with fields counts or bins, mean, sum_squares, maximum, variance, median, population, minimum, standard_deviation, and sum. These fields are documented in Numeric Field Summary. As in the numeric field summary, the presence of counts or bins depends on the number of distinct values. |
| search_strategy | String | The strategy for prioritizing the rules in the search. Available options are confidence, lhs_cover, leverage, lift, and support. |
| significance_level | Float | The significance level between 0 and 1 used for statistical significance of the results. |
Each item of Items Object has the following properties.
Each item of Rules Object has the following properties.
| Property | Type | Description |
|---|---|---|
| confidence | Float | A real number between 0 and 1 representing the confidence for the rule. |
| leverage | Float | A real number between 0 and 1 representing the leverage for the rule. |
| lhs | Array of Integer | An array of zero-based item ids for the LHS of the rule. These ids represent the index of the item in the items list. |
| lhs_cover | Array of Integer | A pair with coverage for the items in the LHS of the rule expressed as a proportion of total instances, and absolute number of counts. |
| lift | Float | A postive real number representing the lift for the rule. |
| p_value | Float | A real number between 0 and 1 representing the p-value for the rule. |
| rhs | Array of Integer | An array of zero-based item ids for the RHS of the rule. These ids represent the index of the item in the items list. |
| rhs_cover | Array of Integer | A pair with coverage for the items in the RHS of the rule expressed as a proportion of total instances, and absolute number of counts. |
| support | Array of Integer | A pair with support for the items in the RHS of the rule expressed as a proportion of total instances, and absolute number of counts. |
Association Status
Creating an association is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The association goes through a number of states until its fully completed. Through the status field in the association you can determine when the association has been fully processed and ready to be used to create predictions. These are the properties that an association's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the association creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the association. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the association. |
Once an association has been successfully created, it will look like:
{
"associations":{
"complement":false,
"discretization":{
"pretty":true,
"size":5,
"trim":0,
"type":"population"
},
"items":[
{
"complement":false,
"count":32,
"field_id":"000000",
"name":"Segment 1",
"bin_end":5,
"bin_start":null
},
{
"complement":false,
"count":49,
"field_id":"000000",
"name":"Segment 3",
"bin_end":7,
"bin_start":6
},
{
"complement":false,
"count":12,
"field_id":"000000",
"name":"Segment 4",
"bin_end":null,
"bin_start":7
},
{
"complement":false,
"count":19,
"field_id":"000001",
"name":"Segment 1",
"bin_end":2.5,
"bin_start":null
},
{
"complement":false,
"count":64,
"field_id":"000001",
"name":"Segment 2",
"bin_end":3,
"bin_start":2.5
},
{
"complement":false,
"count":16,
"field_id":"000001",
"name":"Segment 4",
"bin_end":4,
"bin_start":3.5
},
{
"complement":false,
"count":50,
"field_id":"000002",
"name":"Segment 1",
"bin_end":2,
"bin_start":null
},
{
"complement":false,
"count":16,
"field_id":"000002",
"name":"Segment 2",
"bin_end":4,
"bin_start":2
},
{
"complement":false,
"count":75,
"field_id":"000002",
"name":"Segment 3",
"bin_end":6,
"bin_start":4
},
{
"complement":false,
"count":9,
"field_id":"000002",
"name":"Segment 4",
"bin_end":null,
"bin_start":6
},
{
"complement":false,
"count":57,
"field_id":"000003",
"name":"Segment 1",
"bin_end":1,
"bin_start":null
},
{
"complement":false,
"count":70,
"field_id":"000003",
"name":"Segment 2",
"bin_end":2,
"bin_start":1
},
{
"complement":false,
"count":23,
"field_id":"000003",
"name":"Segment 3",
"bin_end":null,
"bin_start":2
},
{
"complement":false,
"count":50,
"field_id":"000004",
"name":"Iris-setosa"
},
{
"complement":false,
"count":50,
"field_id":"000004",
"name":"Iris-versicolor"
},
{
"complement":false,
"count":50,
"field_id":"000004",
"name":"Iris-virginica"
}
],
"max_k": 100,
"min_confidence":0,
"min_leverage":0,
"min_lift":1,
"min_support":0,
"missing_items":false,
"rules":[
{
"confidence":1,
"id":"000000",
"leverage":0.22222,
"lhs":[
13
],
"lhs_cover":[
0.33333,
50
],
"lift":3,
"p_value":0.000000000,
"rhs":[
6
],
"rhs_cover":[
0.33333,
50
],
"support":[
0.33333,
50
]
},
{
"confidence":1,
"id":"000001",
"leverage":0.22222,
"lhs":[
6
],
"lhs_cover":[
0.33333,
50
],
"lift":3,
"p_value":0.000000000,
"rhs":[
13
],
"rhs_cover":[
0.33333,
50
],
"support":[
0.33333,
50
]
},
{
"confidence":1,
"id":"000002",
"leverage":0.20667,
"lhs":[
6
],
"lhs_cover":[
0.33333,
50
],
"lift":2.63158,
"p_value":0.000000000,
"rhs":[
10
],
"rhs_cover":[
0.38,
57
],
"support":[
0.33333,
50
]
},
{
"confidence":1,
"id":"000003",
"leverage":0.20667,
"lhs":[
13
],
"lhs_cover":[
0.33333,
50
],
"lift":2.63158,
"p_value":0.000000000,
"rhs":[
10
],
"rhs_cover":[
0.38,
57
],
"support":[
0.33333,
50
]
},
{
"confidence":0.87719,
"id":"000004",
"leverage":0.20667,
"lhs":[
10
],
"lhs_cover":[
0.38,
57
],
"lift":2.63158,
"p_value":0.000000000,
"rhs":[
13
],
"rhs_cover":[
0.33333,
50
],
"support":[
0.33333,
50
]
},
{
"confidence":0.87719,
"id":"000005",
"leverage":0.20667,
"lhs":[
10
],
"lhs_cover":[
0.38,
57
],
"lift":2.63158,
"p_value":0.000000000,
"rhs":[
6
],
"rhs_cover":[
0.33333,
50
],
"support":[
0.33333,
50
]
},
{
"confidence":0.97959,
"id":"000006",
"leverage":0.15667,
"lhs":[
1
],
"lhs_cover":[
0.32667,
49
],
"lift":1.95918,
"p_value":0.000000000,
"rhs":[
8
],
"rhs_cover":[
0.5,
75
],
"support":[
0.32,
48
]
},
{
"confidence":0.64,
"id":"000007",
"leverage":0.15667,
"lhs":[
8
],
"lhs_cover":[
0.5,
75
],
"lift":1.95918,
"p_value":0.000000000,
"rhs":[
1
],
"rhs_cover":[
0.32667,
49
],
"support":[
0.32,
48
]
},
{
"confidence":0.8,
"id":"000008",
"leverage":0.14,
"lhs":[
11
],
"lhs_cover":[
0.46667,
70
],
"lift":1.6,
"p_value":0.000000000,
"rhs":[
8
],
"rhs_cover":[
0.5,
75
],
"support":[
0.37333,
56
]
},
{
"confidence":0.74667,
"id":"000009",
"leverage":0.14,
"lhs":[
8
],
"lhs_cover":[
0.5,
75
],
"lift":1.6,
"p_value":0.000000000,
"rhs":[
11
],
"rhs_cover":[
0.46667,
70
],
"support":[
0.37333,
56
]
},
{
"confidence":0.86,
"id":"00000a",
"leverage":0.13111,
"lhs":[
14
],
"lhs_cover":[
0.33333,
50
],
"lift":1.84286,
"p_value":0.000000000,
"rhs":[
11
],
"rhs_cover":[
0.46667,
70
],
"support":[
0.28667,
43
]
},
{
"confidence":0.61429,
"id":"00000b",
"leverage":0.13111,
"lhs":[
11
],
"lhs_cover":[
0.46667,
70
],
"lift":1.84286,
"p_value":0.000000000,
"rhs":[
14
],
"rhs_cover":[
0.33333,
50
],
"support":[
0.28667,
43
]
},
{
"confidence":0.96875,
"id":"00000c",
"leverage":0.1256,
"lhs":[
0
],
"lhs_cover":[
0.21333,
32
],
"lift":2.54934,
"p_value":0.000000000,
"rhs":[
10
],
"rhs_cover":[
0.38,
57
],
"support":[
0.20667,
31
]
},
{
"confidence":0.54386,
"id":"00000d",
"leverage":0.1256,
"lhs":[
10
],
"lhs_cover":[
0.38,
57
],
"lift":2.54934,
"p_value":0.000000000,
"rhs":[
0
],
"rhs_cover":[
0.21333,
32
],
"support":[
0.20667,
31
]
},
{
"confidence":0.875,
"id":"00000e",
"leverage":0.11556,
"lhs":[
0
],
"lhs_cover":[
0.21333,
32
],
"lift":2.625,
"p_value":0.000000000,
"rhs":[
6
],
"rhs_cover":[
0.33333,
50
],
"support":[
0.18667,
28
]
},
{
"confidence":0.56,
"id":"00000f",
"leverage":0.11556,
"lhs":[
6
],
"lhs_cover":[
0.33333,
50
],
"lift":2.625,
"p_value":0.000000000,
"rhs":[
0
],
"rhs_cover":[
0.21333,
32
],
"support":[
0.18667,
28
]
},
{
"confidence":0.875,
"id":"000010",
"leverage":0.11556,
"lhs":[
0
],
"lhs_cover":[
0.21333,
32
],
"lift":2.625,
"p_value":0.000000000,
"rhs":[
13
],
"rhs_cover":[
0.33333,
50
],
"support":[
0.18667,
28
]
},
{
"confidence":0.56,
"id":"000011",
"leverage":0.11556,
"lhs":[
13
],
"lhs_cover":[
0.33333,
50
],
"lift":2.625,
"p_value":0.000000000,
"rhs":[
0
],
"rhs_cover":[
0.21333,
32
],
"support":[
0.18667,
28
]
},
{
"confidence":0.54667,
"id":"000012",
"leverage":0.10667,
"lhs":[
8
],
"lhs_cover":[
0.5,
75
],
"lift":1.64,
"p_value":0.00000002,
"rhs":[
15
],
"rhs_cover":[
0.33333,
50
],
"support":[
0.27333,
41
]
},
{
"confidence":0.82,
"id":"000013",
"leverage":0.10667,
"lhs":[
15
],
"lhs_cover":[
0.33333,
50
],
"lift":1.64,
"p_value":0.00000002,
"rhs":[
8
],
"rhs_cover":[
0.5,
75
],
"support":[
0.27333,
41
]
},
{
"confidence":1,
"id":"000014",
"leverage":0.10222,
"lhs":[
12
],
"lhs_cover":[
0.15333,
23
],
"lift":3,
"p_value":0.000000000,
"rhs":[
15
],
"rhs_cover":[
0.33333,
50
],
"support":[
0.15333,
23
]
},
{
"confidence":0.46,
"id":"000015",
"leverage":0.10222,
"lhs":[
15
],
"lhs_cover":[
0.33333,
50
],
"lift":3,
"p_value":0.000000000,
"rhs":[
12
],
"rhs_cover":[
0.15333,
23
],
"support":[
0.15333,
23
]
},
{
"confidence":0.64286,
"id":"000016",
"leverage":0.10089,
"lhs":[
11
],
"lhs_cover":[
0.46667,
70
],
"lift":1.5067,
"p_value":0.00000048,
"rhs":[
4
],
"rhs_cover":[
0.42667,
64
],
"support":[
0.3,
45
]
},
{
"confidence":0.70313,
"id":"000017",
"leverage":0.10089,
"lhs":[
4
],
"lhs_cover":[
0.42667,
64
],
"lift":1.5067,
"p_value":0.00000048,
"rhs":[
11
],
"rhs_cover":[
0.46667,
70
],
"support":[
0.3,
45
]
},
{
"confidence":0.70313,
"id":"000018",
"leverage":0.08667,
"lhs":[
4
],
"lhs_cover":[
0.42667,
64
],
"lift":1.40625,
"p_value":0.0000152917,
"rhs":[
8
],
"rhs_cover":[
0.5,
75
],
"support":[
0.3,
45
]
},
{
"confidence":0.6,
"id":"000019",
"leverage":0.08667,
"lhs":[
8
],
"lhs_cover":[
0.5,
75
],
"lift":1.40625,
"p_value":0.0000152917,
"rhs":[
4
],
"rhs_cover":[
0.42667,
64
],
"support":[
0.3,
45
]
},
{
"confidence":0.58,
"id":"00001a",
"leverage":0.08444,
"lhs":[
15
],
"lhs_cover":[
0.33333,
50
],
"lift":1.77551,
"p_value":0.00000435444,
"rhs":[
1
],
"rhs_cover":[
0.32667,
49
],
"support":[
0.19333,
29
]
},
{
"confidence":0.59184,
"id":"00001b",
"leverage":0.08444,
"lhs":[
1
],
"lhs_cover":[
0.32667,
49
],
"lift":1.77551,
"p_value":0.00000435444,
"rhs":[
15
],
"rhs_cover":[
0.33333,
50
],
"support":[
0.19333,
29
]
},
{
"confidence":1,
"id":"00001c",
"leverage":0.07111,
"lhs":[
7
],
"lhs_cover":[
0.10667,
16
],
"lift":3,
"p_value":0.00000000,
"rhs":[
14
],
"rhs_cover":[
0.33333,
50
],
"support":[
0.10667,
16
]
},
{
"confidence":0.32,
"id":"00001d",
"leverage":0.07111,
"lhs":[
14
],
"lhs_cover":[
0.33333,
50
],
"lift":3,
"p_value":0.00000000,
"rhs":[
7
],
"rhs_cover":[
0.10667,
16
],
"support":[
0.10667,
16
]
},
{
"confidence":0.69565,
"id":"00001e",
"leverage":0.05658,
"lhs":[
12
],
"lhs_cover":[
0.15333,
23
],
"lift":2.12955,
"p_value":0.0000910873,
"rhs":[
1
],
"rhs_cover":[
0.32667,
49
],
"support":[
0.10667,
16
]
},
{
"confidence":0.32653,
"id":"00001f",
"leverage":0.05658,
"lhs":[
1
],
"lhs_cover":[
0.32667,
49
],
"lift":2.12955,
"p_value":0.0000910873,
"rhs":[
12
],
"rhs_cover":[
0.15333,
23
],
"support":[
0.10667,
16
]
},
{
"confidence":0.75,
"id":"000020",
"leverage":0.0552,
"lhs":[
2
],
"lhs_cover":[
0.08,
12
],
"lift":12.5,
"p_value":0.000000000,
"rhs":[
9
],
"rhs_cover":[
0.06,
9
],
"support":[
0.06,
9
]
},
{
"confidence":1,
"id":"000021",
"leverage":0.0552,
"lhs":[
9
],
"lhs_cover":[
0.06,
9
],
"lift":12.5,
"p_value":0.000000000,
"rhs":[
2
],
"rhs_cover":[
0.08,
12
],
"support":[
0.06,
9
]
},
{
"confidence":1,
"id":"000022",
"leverage":0.05333,
"lhs":[
2
],
"lhs_cover":[
0.08,
12
],
"lift":3,
"p_value":0.0000007,
"rhs":[
15
],
"rhs_cover":[
0.33333,
50
],
"support":[
0.08,
12
]
},
{
"confidence":0.24,
"id":"000023",
"leverage":0.05333,
"lhs":[
15
],
"lhs_cover":[
0.33333,
50
],
"lift":3,
"p_value":0.0000007,
"rhs":[
2
],
"rhs_cover":[
0.08,
12
],
"support":[
0.08,
12
]
},
{
"confidence":0.52632,
"id":"000024",
"leverage":0.05316,
"lhs":[
3
],
"lhs_cover":[
0.12667,
19
],
"lift":4.93421,
"p_value":0.00000044,
"rhs":[
7
],
"rhs_cover":[
0.10667,
16
],
"support":[
0.06667,
10
]
},
{
"confidence":0.625,
"id":"000025",
"leverage":0.05316,
"lhs":[
7
],
"lhs_cover":[
0.10667,
16
],
"lift":4.93421,
"p_value":0.00000044,
"rhs":[
3
],
"rhs_cover":[
0.12667,
19
],
"support":[
0.06667,
10
]
},
{
"confidence":0.8125,
"id":"000026",
"leverage":0.05111,
"lhs":[
5
],
"lhs_cover":[
0.10667,
16
],
"lift":2.4375,
"p_value":0.0000454342,
"rhs":[
13
],
"rhs_cover":[
0.33333,
50
],
"support":[
0.08667,
13
]
},
{
"confidence":0.8125,
"id":"000027",
"leverage":0.05111,
"lhs":[
5
],
"lhs_cover":[
0.10667,
16
],
"lift":2.4375,
"p_value":0.0000454342,
"rhs":[
6
],
"rhs_cover":[
0.33333,
50
],
"support":[
0.08667,
13
]
},
{
"confidence":0.26,
"id":"000028",
"leverage":0.05111,
"lhs":[
6
],
"lhs_cover":[
0.33333,
50
],
"lift":2.4375,
"p_value":0.0000454342,
"rhs":[
5
],
"rhs_cover":[
0.10667,
16
],
"support":[
0.08667,
13
]
},
{
"confidence":0.26,
"id":"000029",
"leverage":0.05111,
"lhs":[
13
],
"lhs_cover":[
0.33333,
50
],
"lift":2.4375,
"p_value":0.0000454342,
"rhs":[
5
],
"rhs_cover":[
0.10667,
16
],
"support":[
0.08667,
13
]
},
{
"confidence":0.18,
"id":"00002a",
"leverage":0.04,
"lhs":[
15
],
"lhs_cover":[
0.33333,
50
],
"lift":3,
"p_value":0.0000302052,
"rhs":[
9
],
"rhs_cover":[
0.06,
9
],
"support":[
0.06,
9
]
},
{
"confidence":1,
"id":"00002b",
"leverage":0.04,
"lhs":[
9
],
"lhs_cover":[
0.06,
9
],
"lift":3,
"p_value":0.0000302052,
"rhs":[
15
],
"rhs_cover":[
0.33333,
50
],
"support":[
0.06,
9
]
}
],
"rules_summary":{
"confidence":{
"counts":[
[
0.18,
1
],
[
0.24,
1
],
[
0.26,
2
],
[
0.32,
1
],
[
0.32653,
1
],
[
0.46,
1
],
[
0.52632,
1
],
[
0.54386,
1
],
[
0.54667,
1
],
[
0.56,
2
],
[
0.58,
1
],
[
0.59184,
1
],
[
0.6,
1
],
[
0.61429,
1
],
[
0.625,
1
],
[
0.64,
1
],
[
0.64286,
1
],
[
0.69565,
1
],
[
0.70313,
2
],
[
0.74667,
1
],
[
0.75,
1
],
[
0.8,
1
],
[
0.8125,
2
],
[
0.82,
1
],
[
0.86,
1
],
[
0.875,
2
],
[
0.87719,
2
],
[
0.96875,
1
],
[
0.97959,
1
],
[
1,
9
]
],
"maximum":1,
"mean":0.70986,
"median":0.72864,
"minimum":0.18,
"population":44,
"standard_deviation":0.24324,
"sum":31.23367,
"sum_squares":24.71548,
"variance":0.05916
},
"k":44,
"leverage":{
"counts":[
[
0.04,
2
],
[
0.05111,
4
],
[
0.05316,
2
],
[
0.05333,
2
],
[
0.0552,
2
],
[
0.05658,
2
],
[
0.07111,
2
],
[
0.08444,
2
],
[
0.08667,
2
],
[
0.10089,
2
],
[
0.10222,
2
],
[
0.10667,
2
],
[
0.11556,
4
],
[
0.1256,
2
],
[
0.13111,
2
],
[
0.14,
2
],
[
0.15667,
2
],
[
0.20667,
4
],
[
0.22222,
2
]
],
"maximum":0.22222,
"mean":0.10603,
"median":0.10156,
"minimum":0.04,
"population":44,
"standard_deviation":0.0536,
"sum":4.6651,
"sum_squares":0.61815,
"variance":0.00287
},
"lhs_cover":{
"counts":[
[
0.06,
2
],
[
0.08,
2
],
[
0.10667,
4
],
[
0.12667,
1
],
[
0.15333,
2
],
[
0.21333,
3
],
[
0.32667,
3
],
[
0.33333,
15
],
[
0.38,
3
],
[
0.42667,
2
],
[
0.46667,
3
],
[
0.5,
4
]
],
"maximum":0.5,
"mean":0.29894,
"median":0.33213,
"minimum":0.06,
"population":44,
"standard_deviation":0.13386,
"sum":13.15331,
"sum_squares":4.70252,
"variance":0.01792
},
"lift":{
"counts":[
[
1.40625,
2
],
[
1.5067,
2
],
[
1.6,
2
],
[
1.64,
2
],
[
1.77551,
2
],
[
1.84286,
2
],
[
1.95918,
2
],
[
2.12955,
2
],
[
2.4375,
4
],
[
2.54934,
2
],
[
2.625,
4
],
[
2.63158,
4
],
[
3,
10
],
[
4.93421,
2
],
[
12.5,
2
]
],
"maximum":12.5,
"mean":2.91963,
"median":2.58068,
"minimum":1.40625,
"population":44,
"standard_deviation":2.24641,
"sum":128.46352,
"sum_squares":592.05855,
"variance":5.04635
},
"p_value":{
"counts":[
[
0.000000000,
2
],
[
0.000000000,
4
],
[
0.000000000,
2
],
[
0.000000000,
2
],
[
0.000000000,
2
],
[
0.000000000,
4
],
[
0.000000000,
2
],
[
0.000000000,
2
],
[
0.000000000,
2
],
[
0.00000000,
2
],
[
0.00000002,
2
],
[
0.00000044,
2
],
[
0.00000048,
2
],
[
0.0000007,
2
],
[
0.00000435444,
2
],
[
0.0000152917,
2
],
[
0.0000302052,
2
],
[
0.0000454342,
4
],
[
0.0000910873,
2
]
],
"maximum":0.0000910873,
"mean":0.0000106114,
"median":0.00000000,
"minimum":0.000000000,
"population":44,
"standard_deviation":0.0000227364,
"sum":0.000466903,
"sum_squares":0.0000000,
"variance":0.000000001
},
"rhs_cover":{
"counts":[
[
0.06,
2
],
[
0.08,
2
],
[
0.10667,
4
],
[
0.12667,
1
],
[
0.15333,
2
],
[
0.21333,
3
],
[
0.32667,
3
],
[
0.33333,
15
],
[
0.38,
3
],
[
0.42667,
2
],
[
0.46667,
3
],
[
0.5,
4
]
],
"maximum":0.5,
"mean":0.29894,
"median":0.33213,
"minimum":0.06,
"population":44,
"standard_deviation":0.13386,
"sum":13.15331,
"sum_squares":4.70252,
"variance":0.01792
},
"support":{
"counts":[
[
0.06,
4
],
[
0.06667,
2
],
[
0.08,
2
],
[
0.08667,
4
],
[
0.10667,
4
],
[
0.15333,
2
],
[
0.18667,
4
],
[
0.19333,
2
],
[
0.20667,
2
],
[
0.27333,
2
],
[
0.28667,
2
],
[
0.3,
4
],
[
0.32,
2
],
[
0.33333,
6
],
[
0.37333,
2
]
],
"maximum":0.37333,
"mean":0.20152,
"median":0.19057,
"minimum":0.06,
"population":44,
"standard_deviation":0.10734,
"sum":8.86668,
"sum_squares":2.28221,
"variance":0.01152
}
},
"search_strategy":"leverage",
"significance_level":0.05
},
"category":0,
"clones":0,
"code":200,
"columns":5,
"created":"2015-11-05T08:06:08.184000",
"credits":0.017581939697265625,
"dataset":"dataset/562fae3f4e1727141d00004e",
"dataset_status":true,
"dataset_type":0,
"description":"",
"excluded_fields":[ ],
"fields_meta":{
"count":5,
"limit":1000,
"offset":0,
"query_total":5,
"total":5
},
"input_fields":[
"000000",
"000001",
"000002",
"000003",
"000004"
],
"locale":"en_US",
"max_columns":5,
"max_rows":150,
"name":"iris' dataset's association",
"out_of_bag":false,
"price":0,
"private":true,
"project":null,
"range":[
1,
150
],
"replacement":false,
"resource":"association/5621b70910cb86ae4c000000",
"rows":150,
"sample_rate":1,
"shared":false,
"size":4609,
"source":"source/562fae3a4e1727141d000048",
"source_status":true,
"status":{
"code":5,
"elapsed":1072,
"message":"The association has been created",
"progress":1
},
"subscription":false,
"tags":[ ],
"updated":"2015-11-05T08:06:20.403000",
"white_box":false
}
< Example association JSON response
PMML
The default association output format is JSON. However, the pmml parameter allows to include a PMML version of the association. The association will include a XML document that fullfils PMML v4.1. For example:
curl "https://au.bigml.io/association/5621b70910cb86ae4c000000?$BIGML_AUTH;pmml=yes"
PMML Example
curl "https://au.bigml.io/association/5621b70910cb86ae4c000000?$BIGML_AUTH;pmml=only"
PMML Example
Filtering and Paginating Fields from an Association
An association might be composed of hundreds or even thousands of fields. Thus when retrieving an association, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating an Association
To update an association, you need to PUT an object containing the fields that you want to update to the association' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated association.
For example, to update an association with a new name you can use curl like this:
curl "https://au.bigml.io/association/5621b70910cb86ae4c000000?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating an association's name
If you want to update an association with a new label and description for a specific field you can use curl like this:
curl "https://au.bigml.io/association/5621b70910cb86ae4c000000?$BIGML_AUTH" \
-X PUT \
-d '{"fields": {"000000": {"label": "a longer name", "description": "an even longer description"}}}' \
-H 'content-type: application/json'
$ Updating an association's field, label, and description
Deleting an Association
To delete an association, you need to issue a HTTP DELETE request to the association/id to be deleted.
Using curl you can do something like this to delete an association:
curl -X DELETE "https://au.bigml.io/association/5621b70910cb86ae4c000000?$BIGML_AUTH"
$ Deleting an association from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete an association, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete an association a second time, or an association that does not exist, you will receive a "404 not found" response.
However, if you try to delete an association that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Associations
To list all the associations, you can use the association base URL. By default, only the 20 most recent associations will be returned. You can see below how to change this number using the limit parameter.
You can get your list of associations directly in your browser using your own username and API key with the following links.
https://au.bigml.io/association?$BIGML_AUTH
> Listing associations from a browser
Topic Models
Last Updated: Tuesday, 2019-01-29 16:28
A topic model is an unsupervised machine learning method for unveiling all the different topics underlying a collection of documents. BigML uses Latent Dirichlet allocation (LDA), one of the most popular probabilistic methods for topic modeling. In BigML, each instance (i.e. each row in your dataset) will be considered a document and the input field (which must be a text field) will be the content of the document. If multiple text fields are given as inputs, they will be automatically concatenated, so the content for each document can be considered as a bag of words. Topic model is an unsupervised method so your data doesn't need to be labeled.
Topic model is based on the assumption that any document exhibits a mixture of topics. Each topic is composed of a set of words which are thematically related. The words from a given topic have different probabilities for that topic. At the same time, each word can be attributable to one or several topics. So for example the word "sea" may be found in a topic related with sea transport but also in a topic related to holidays. Topic model automatically discards stopwords and high frequency words that occur in almost all of the documents as they don't help to determine the boundaries between topics.
Topic model's main applications include browsing, organizing and understanding large archives of documents. It can been applied for information retrieval, collaborative filtering, assessing document similarity among others. The topics found in the dataset can also be very useful new features before applying other models like classification, clustering, or anomaly detection.
Topic model returns a list of top terms for each topic found in the data. Note that topics are not labeled, so you have to infer their meaning according to the words they are composed of. By looking at each group of terms below we can interpret the first topic as regulatory related, the second as healthcare related and so on. You can obtain up to 128 different topics.
- topic 1: senate, committee, federal, vote, congress, law, measure, approved
- topic 2: hospital, health, medical, live, life, care, doctors, heart
- topic 3: workers, union, job, strike, contract, labor, employee, wage
Once you build the topic model you can calculate each topic probability for a given document by using Topic Distribution. This information can be useful to find documents similarities based on their thematic.
BigML.io allows you to create, retrieve, update, delete your topic model. You can also list all of your topic models.
Jump to:
- Topic Model Base URL
- Creating a Topic Model
- Topic Model Arguments
- Retrieving a Topic Model
- Topic Model Properties
- Filtering and Paginating Fields from a Topic Model
- Updating a Topic Model
- Deleting a Topic Model
- Listing Topic Models
Topic Model Base URL
You can use the following base URL to create, retrieve, update, and delete topic models. https://au.bigml.io/topicmodel
Topic Model base URL
All requests to manage your topic models must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Topic Model
To create a new topic model, you need to POST to the topic model base URL an object containing at least the dataset/id that you want to use to create the topic model. The content-type must always be "application/json".
POST /topicmodel?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating topic model definition
curl "https://au.bigml.io/topicmodel?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006"}'
> Creating a topic model
BigML.io will return the newly created topic model if the request succeeded.
{
"category":0,
"code":201,
"columns":0,
"created":"2016-06-17T23:10:22.660449",
"credits":0.000762939453125,
"credits_per_prediction":0,
"dataset":"dataset/576486a94e172732fe000000",
"dataset_field_types":{
"categorical":2,
"datetime":0,
"effective_fields":2,
"items":0,
"numeric":0,
"preferred":2,
"text":0,
"total":2
},
"dataset_status":true,
"dataset_type":0,
"description":"",
"excluded_fields":[],
"fields_meta":{
"count":0,
"limit":1000,
"offset":0,
"total":0
},
"input_fields":[],
"topic_model":{
},
"locale":"en-US",
"max_columns":2,
"max_rows":4,
"name":"SMS dataset's Topic Model",
"number_of_batchtopicdistributions":0,
"number_of_public_topicdistributions":0,
"number_of_topicdistributions":0,
"ordering":0,
"out_of_bag":false,
"price":0,
"private":true,
"project":null,
"range":[
1,
4
],
"replacement":false,
"resource":"topicmodel/56f5ecfa4e17275f4400015b",
"rows":4,
"sample_rate":1,
"shared":false,
"size":200,
"source":"source/5764832b4e17271227000020",
"source_status":true,
"status":{
"code":1,
"message":"The topic model is being processed and will be created soon"
},
"subscription":true,
"tags":[],
"updated":"2016-06-17T23:10:22.660694",
"white_box":false
}
< Example topic model JSON response
Topic Model Arguments
In addition to the dataset, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
bigrams
optional |
Boolean, default is false |
Whether to include a contiguous sequence of two items from a given sequence of text. See n-gram for more information.This argument is deprecated in favor of ngrams and is equivalent to ngrams=2.
Example: true DEPRECATED |
|
case_sensitive
optional |
Boolean, default is false |
Whether the analysis is case-sensitive or not.
Example: true |
|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the topic model. See the category codes for the complete list of categories.
Example: 1 |
| dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
|
description
optional |
String |
A description of the topic model up to 8192 characters long.
Example: "This is a description of my new topic model" |
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the topic model.
Example:
|
|
excluded_terms
optional |
Array of Strings, default is [], an empty list. |
Specifies a list of terms to ignore when performing term analysis.
Example:
|
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the topic model with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
|
input_fields
optional |
Array, default is []. All text fields in the dataset |
Specifies the fields to be considered to create the topic model. Each of the input_fields must be text fields. If multiple fields are given, the text field values for each row will be concatenated so that each row is still considered to be one document.
Example:
|
|
language
optional |
String, default is "en" |
The default language of text fields in a two-letter language code, which will change the resulting stemming and tokenization. Available options are: "ar", "ca", "cs", "da", "de", "en", "es", "fa", "fi", "fr", "hu", "it", "ja", "ko", "nl", "pl", "pt", "ro", "ru", "sv", "tr", "zh", "none", or null for auto-detect.
Example: "es" |
|
minimum_name_terms
optional |
Integer, default is 1 |
The minimum number of terms to be used to build topic names. If 0, the topic names will be generated as Topic 1, Topic 2, Topic 3, and so on.
Example: 2 |
|
name
optional |
String, default is dataset's name |
The name you want to give to the new topic model.
Example: "my new topic model" |
|
ngrams
optional |
Integer, default is 1 |
A positive integer n that specifies the use of all sequences of consecutive tokens of length n should be considered as terms, in addition to their constituent tokens (when separated by a single space and no stopwords). See n-gram for more information.The minimum value is 1 and maximum value is 5.
Example: 5 |
|
number_of_topics
optional |
Integer |
The number of topics that topic model will generate. If it is unset, it will be chosen automatically based on the number documents (i.e., row count). The minimum value is 2 and maximum value is 64.
Example: 32 |
|
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
|
project
optional |
String |
The project/id you want the topic model to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the topic model.
Example: [1, 150] |
|
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
|
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
|
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
|
stem_words
optional |
Boolean, default is true |
Whether lemmatization (stemming) of terms should be done, according to linguistic rules in the provided language. Note that if the language is, for example, zh, even English words will not be lemmatized as with English rules.
Example: true |
|
stopword_diligence
optional |
String, default is "normal" |
The aggressiveness of stopword removal, where the levels are light, normal or aggressive, in order, where each level is a superset of words in the previous ones. The most common languages will add stopwords at each level, but less common languages may not.
Example: "light" |
|
stopword_removal
optional |
String, default is "selected_language" |
A string or keyword specifying the type of stopword removal to perform. Available options are where it can be none (remove no stopwords), selected_language (remove stopwords from the provided language), and all_languages (remove stopwords from all languages). Note that this parameter supersedes use_stopwords if provided. Also note that the null language does have a non-empty stopword list such as single numeric digits.
Example: "all_languages" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your topic model.
Example: ["best customers", "2018"] |
|
term_filters
optional |
Array of Strings |
Filters that should be applied to the chosen terms. Available options are:
Example: "html_keywords" |
|
term_limit
optional |
Integer, default is 4096 |
The maximum number of terms used for the topic model vocabulary. Computation is linear with respect to this parameter. The minimum value is 128 and maximum value is 16384.
Example: 1024 |
|
term_regexps
optional |
Array | A list of strings specifying regular expressions to be matched against input documents. If present, these regular expressions will automatically be chosen for the final term list, and their per-document occurrence counts will be the number of matches of the expression in that document. |
|
token_mode
optional |
String, default is "all" |
Whether tokens_only, full_terms_only or all should be tokenized.
Example: "tokens_only" |
|
top_n_terms
optional |
Integer, default is 10 |
The size of the most influential terms recorded. The minimum value is 1 and maximum value is 128.
Example: 32 |
|
topicmodel_seed
optional |
String |
With a seed, the topic model is deterministic.
Example: "My Seed" |
|
use_stopwords
optional |
Boolean, default is true |
Whether to use stop words or not. This argument is deprecated in favor of stopword_removal.
Example: true DEPRECATED |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new topic model. For example, to create a new topic model named "my topic model", with only certain rows, and with only three fields:
curl "https://au.bigml.io/topicmodel?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006",
"input_fields": ["000001", "000002", "000003"],
"name": "my topic model",
"range": [25, 125]}'
> Creating a customized topic model
If you do not specify a name, BigML.io will assign to the new topic model the dataset's name. If you do not specify a range of instances, BigML.io will use all the instances in the dataset. If you do not specify any input fields, BigML.io will include all text fields in the dataset.
Retrieving a Topic Model
Each topic model has a unique identifier in the form "topicmodel/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the topic model.
To retrieve a topic model with curl:
curl "https://au.bigml.io/topicmodel/56f5ecfa4e17275f4400015b?$BIGML_AUTH"
$ Retrieving a topic model from the command line
You can also use your browser to visualize the topic model using the full BigML.io URL or pasting the topicmodel/id into the BigML.com.au dashboard.
Topic Model Properties
Once a topic model has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the topic model and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the topic model creation has been completed without errors. |
|
columns
filterable, sortable |
Integer | The number of fields in the topic model. |
|
composites
filterable, sortable |
Array of Strings | The list of composite ids that reference this model. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the topic model was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this topic model. |
|
credits_per_prediction
filterable, sortable, updatable |
Float | This is the number of credits that other users will consume to make a prediction with your topic model if you made it public. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the topic model. |
| dataset_field_types | Object | A dictionary that informs about the number of fields of each type in the dataset used to create the topic model. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields, and an entry for the total number of fields. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the topic model. It can contain restricted markdown to decorate the text. |
| excluded_fields | Array | The list of fields's ids that were excluded to build the topic model. |
| fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
| input_fields | Array | The list of input fields' ids used to build the models of the topic model. |
| locale | String | The dataset's locale. |
|
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the topic model. |
|
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the topic model. |
|
name
filterable, sortable, updatable |
String | The name of the topic model as your provided or based on the name of the dataset by default. |
|
number_of_batchtopicdistributions
filterable, sortable |
Integer | The current number of batch topic distributions that use this topic model. |
|
number_of_topicdistributions
filterable, sortable |
Integer | The current number of topic distributions that use this topic model. |
|
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the topic model instead of the sampled instances. |
|
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your topic model. |
|
private
filterable, sortable, updatable |
Boolean | Whether the topic model is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| range | Array | The range of instances used to build the topic model. |
|
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the topic model were selected using replacement or not. |
| resource | String | The topicmodel/id. |
|
rows
filterable, sortable |
Integer | The total number of instances used to build the topic model |
|
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the topic model. |
|
seed
filterable, sortable |
String | The string that was used to generate the sample. |
|
shared
filterable, sortable, updatable |
Boolean | Whether the topic model is shared using a private link or not. |
| shared_hash | String | The hash that gives access to this topic model if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this topic model. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this topic model. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| status | Object | A description of the status of the topic model. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the topic model was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
| topic_model | Object | All the information that you need to recreate or use the topic model on your own. See here for more details. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the topic model was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
|
white_box
filterable, sortable |
Boolean | Whether the topic model is publicly shared as a white-box. |
A topic model Object has the following properties:
| Property | Type | Description |
|---|---|---|
| bigrams | Boolean | Whether to include a contiguous sequence of two items from a given sequence of text. See n-gram for more information.This argument is deprecated in favor of ngrams. DEPRECATED |
| case_sensitive | Boolean | Whether the analysis is case-sensitive or not. |
| excluded_terms | Array of Strings | A list of terms to ignore when performing term analysis. |
| fields | Object | A dictionary with an entry per field in the dataset used to build the topic model. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details. |
| language | String | The language of text fields. |
| minimum_name_terms | Integer | The minimum number of terms used to build topic names. |
| ngrams | Integer | A contiguous sequence of n items from a given sequence of text. See n-gram for more information. |
| number_of_topics | Integer | The number of topics that topic model will generate. |
| stem_words | Boolean | Whether lemmatization (stemming) of terms has been done, according to linguistic rules in the provided language. |
| stopword_diligence | String | The aggressiveness of stopword removal. |
| stopword_removal | String | The keyword specifying the type of stopword removal to perform. |
| term_filters | Array of Strings | Filters that have been applied to the chosen terms. |
| term_limit | Integer | The maximum number of terms used for the topic model vocabulary. |
| term_regexps | Array of Strings | A list of strings specifying regular expressions to be matched against input documents. If present, these regular expressions will automatically be chosen for the final term list, and their per-document occurrence counts will be the number of matches of the expression in that document. |
| term_topic_assignments | Array of Arrays of Integers | A matrix of the term assignments per topic. |
| termset | Array of Strings | The terms selected to be part of the topic creation |
| token_mode | String | The tokenizer mode applied. |
| top_n_terms | Integer | The size of the most influential terms recorded. |
| topicmodel_seed | String | With a seed, the topic model is deterministic. |
|
topics
updatable |
Object |
Information about each topic: id, name, top_terms, the most probable terms for each topic as a list of term/probability pairs, and probability, the average probability of the topic over the training instances. Only the names of the topics can be updated.
Example:
|
| use_stopwords | Boolean | Whether to use stop words. This argument is deprecated in favor of stopword_removal. DEPRECATED |
Topic Model Status
Creating a topic model is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The topic model goes through a number of states until its fully completed. Through the status field in the topic model you can determine when the topic model has been fully processed and ready to be used to create predictions. These are the properties that a topic model's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the topic model creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the topic model. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the topic model. |
Once a topic model has been successfully created, it will look like:
{
"category":0,
"code":200,
"columns":1,
"configuration":null,
"configuration_status":false,
"created":"2016-10-07T07:50:58.376000",
"credits":0,
"credits_per_prediction":0,
"dataset":"dataset/57e427864e1727a95d000003",
"dataset_field_types":{
"categorical":8,
"datetime":1,
"effective_fields":76,
"items":0,
"numeric":10,
"preferred":11,
"text":3,
"total":22
},
"dataset_status":true,
"dataset_type":0,
"description":"",
"excluded_fields":[],
"fields_meta":{
"count":1,
"limit":1000,
"offset":0,
"query_total":1,
"total":1
},
"input_fields":[
"00000a"
],
"locale":"en-us",
"max_columns":22,
"max_rows":57,
"name":"SMS dataset's Topic Model",
"number_of_batchtopicdistributions":0,
"number_of_public_topicdistributions":0,
"number_of_topicdistributions":0,
"ordering":0,
"out_of_bag":false,
"price":0,
"private":true,
"project":null,
"range":[
1,
57
],
"replacement":false,
"resource":"topicmodel/56f5ecfa4e17275f4400015b",
"rows":57,
"sample_rate":1,
"shared":false,
"size":13330,
"source":"source/57e427644e1727a95d000000",
"source_status":true,
"status":{
"code":5,
"elapsed":4344,
"message":"The topic model has been created",
"progress":1
},
"subscription":true,
"tags":[],
"topic_model":{
"alpha":16.666666666666668,
"beta":0.1,
"ngrams":1,
"case_sensitive":false,
"fields":{
"00000a":{
"column_number":10,
"datatype":"string",
"name":"text",
"optype":"text",
"order":0,
"preferred":true,
"summary":{
"average_length":96.17544,
"missing_count":0,
"tag_cloud":[
[
"virginamerica",
57
],
[
"flight",
11
],
[
"co",
8
],
...
],
"term_forms":{
"amazing":[
"amazingly"
],
"book":[
"booking"
],
"call":[
"called"
],
"entertaining":[
"entertainment"
],
"flight":[
"flights"
],
"fly":[
"flying"
],
"graphic":[
"graphics"
],
"hour":[
"hours"
],
"miss":[
"missed"
],
"moodlight":[
"moodlighting"
],
"option":[
"options"
],
"seat":[
"seats",
"seating"
],
"time":[
"times"
],
"week":[
"weeks"
]
}
},
"term_analysis":{
"case_sensitive":false,
"enabled":true,
"language":"en",
"stem_words":true,
"token_mode":"all",
"use_stopwords":false
}
}
},
"hashed_seed":62146850,
"language":"en",
"number_of_topics":3,
"term_limit":4096,
"term_topic_assignments":[
[
0,
1,
0
],
[
0,
1,
0
],
[
0,
0,
1
],
[
1,
0,
0
],
[
0
0,
0
],
...
],
"termset":[
"10",
"1st",
"account",
"add",
"added",
"aggressive",
"cool",
"help",
"hey",
"prefer",
"pretty",
"prime",
"virginamerica",
"virginmedia",
"vx",
"vx358",
...
],
"top_n_terms":10,
"topicmodel_seed":"26c386d781963ca1ea5c90dab8a6b023b5e1d180",
"topics":[
{
"id":"000000",
"name":"Topic 0",
"probability":0.33847,
"top_terms":[
[
"virginamerica",
0.14638
],
[
"http",
0.04074
],
[
"amp",
0.02062
],
[
"help",
0.02062
],
[
"week",
0.02062
],
[
"seat",
0.01559
],
[
"sfo",
0.01559
],
[
"24",
0.01056
],
[
"america",
0.01056
],
[
"book",
0.01056
]
]
},
{
"id":"000001",
"name":"Topic 1",
"probability":0.27369,
"top_terms":[
[
"co",
0.05456
],
[
"seat",
0.03058
],
[
"time",
0.03058
],
[
"ð",
0.03058
],
[
"â",
0.02458
],
[
"amazing",
0.01859
],
[
"gt",
0.01859
],
[
"soon",
0.01859
],
[
"andrews",
0.01259
],
[
"bos",
0.01259
]
]
},
{
"id":"000002",
"name":"Topic 2",
"probability":0.38784,
"top_terms":[
[
"virginamerica",
0.13994
],
[
"flight",
0.06026
],
[
"fly",
0.04034
],
[
"carrieunderwood",
0.03038
],
[
"ladygaga",
0.03038
],
[
"lax",
0.02042
],
[
"trip",
0.02042
],
[
"30",
0.01046
],
[
"added",
0.01046
],
[
"call",
0.01046
]
]
}
]
},
"updated":"2016-10-07T07:51:05.131000",
"white_box":false
}
< Example topic model JSON response
Filtering and Paginating Fields from a Topic Model
A topic model might be composed of hundreds or even thousands of fields. Thus when retrieving a topicmodel, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a Topic Model
To update a topic model, you need to PUT an object containing the fields that you want to update to the topic model' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated topic model.
For example, to update a topic model with a new name you can use curl like this:
curl "https://au.bigml.io/topicmodel/56f5ecfa4e17275f4400015b?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating a topic model's name
If you want to update an] topic model with a new label and description for a specific field you can use curl like this:
curl "https://au.bigml.io/topicmodel/56f5ecfa4e17275f4400015b?$BIGML_AUTH" \
-X PUT \
-d '{"fields": {"000000": {"label": "a longer name", "description": "an even longer description"}}}' \
-H 'content-type: application/json'
$ Updating a topic model's field, label, and description
Deleting a Topic Model
To delete a topic model, you need to issue a HTTP DELETE request to the topicmodel/id to be deleted.
Using curl you can do something like this to delete a topic model:
curl -X DELETE "https://au.bigml.io/topicmodel/56f5ecfa4e17275f4400015b?$BIGML_AUTH"
$ Deleting a topic model from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a topic model, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a topic model a second time, or a topic model that does not exist, you will receive a "404 not found" response.
However, if you try to delete a topic model that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Topic Models
To list all the topic models, you can use the topicmodel base URL. By default, only the 20 most recent topic models will be returned. You can see below how to change this number using the limit parameter.
You can get your list of topic models directly in your browser using your own username and API key with the following links.
https://au.bigml.io/topicmodel?$BIGML_AUTH
> Listing topic models from a browser
PCA
Last Updated: Tuesday, 2019-01-29 16:28
A Principal Component Analysis (PCA) is an unsupervised machine learning method predominantly used for dimensionality reduction and data visualization of large or complex datasets. The primary objective of PCA is to find the directions (principal components) that maximize the variance in the dataset. By selecting a subset of these components, one can describe the data in a reduced number of dimensions.
This methodology works by transforming a matrix with dimensions M (instances) x N (features) into a new matrix with dimensions M (instances) x N (principal components), where each principal component is a linear combination of the original features. Furthermore, principal components are ordered such that the first principal component accounts for as much variability in the original dataset as possible, and each succeeding orthogonal principal component accounts for as much of the remaining variability. As a result, despite the overall reduction in total columns, total information loss is minimized.
The method applied by the model depends on the optypes contained in the dataset. When the dataset contains only numeric fields, the model will perform Principal Component Analysis (PCA). If the dataset contains only categorical data, then the model will perform Multiple Correspondence Analysis (MCA). If the dataset contains both numeric and categorical fields, then the model performs Factorial Analysis of Mixed Data (FAMD). Text and items fields will be processed in a bag-of-words fashioned. That is, one numeric field will be created per term in the tag-cloud/items dictionary. Thus, for any input dataset containing only text fields, the PCA method will be applied.
The selection of method is performed automatically without input from the user. The effective difference between the three methods is how the data rows are transformed in order to populate the data matrix. After the matrix is populated, the same decomposition is performed in order the produce the final results. The subsequent sections will detail the input transformations, because the same transformations must be implemented for predicting with new data points. The following notation is observed:
- Xij - The transformed value for row i, field j.
- Yij - The original input value for row i, field j.
- Mj - The arithmetic mean value for field j.
- Sj - The standard deviation for field j.
- J - The number of categorical fields.
In PCA, the values for each numeric field are shifted to zero mean, and if standardized is true, divided by the field's standard deviation.
- Xij = (Yij - Mj) / Sj
In MCA, categorical variables are first transformed to one-hot encoded binary vectors. The order of the categorical values matches those found in the field summary. For example, a value of Iris-virginica for the species field in the iris dataset becomes [0, 0, 1], because Iris-virginica is listed third in categories. If the categorical field's descriptor contains missing_count > 0, then an additional element will the appended to the end of this vector to represent missing values. For each categorical value, define a value Pj which represents the proportion of the data which contains that value. Continuing with the iris example, we would have P1 = P2 = P3 = 0.3333... because the species values are equally represented in the data. Note that these values are equal to the arithmetic mean of the one-hot encoded columns.
- Xij = (Yij - Pj) / (J * (J * Pj)2)
In FAMD, numeric fields are transformed identically to PCA. The transformation for categorical fields is nearly identical to MCA, with only a change in the denominator.
- Xij = (Yij - Pj) / Pj2
PCA has significant drawbacks, largely related to interpretability, and therefore requires two major conditions to hold.
- The presence of a large number of features/variables/fields.
- It is acceptable to transform the variables into a form that is less interpretable.
BigML.io allows you to create, retrieve, update, delete your pca. You can also list all of your pca.
Jump to:
- PCA Base URL
- Creating a PCA
- PCA Arguments
- Retrieving a PCA
- PCA Properties
- Filtering and Paginating Fields from a PCA
- Updating a PCA
- Deleting a PCA
- Listing PCA
PCA Base URL
You can use the following base URL to create, retrieve, update, and delete pca. https://au.bigml.io/pca
PCA base URL
All requests to manage your pca must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a PCA
To create a new PCA, you need to POST to the PCA base URL an object containing at least the dataset/id that you want to use to create the PCA. The content-type must always be "application/json".
POST /pca?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating PCA definition
curl "https://au.bigml.io/pca?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006"}'
> Creating a PCA
BigML.io will return the newly created PCA if the request succeeded.
{
"category": 0,
"code": 201,
"columns": 5,
"configuration": null,
"configuration_status": false,
"created": "2018-12-10T18:06:58.245974",
"creator": "leon1",
"credits": 0.016017913818359375,
"credits_per_prediction": 0,
"dataset": "dataset/5948be794e1727307a000000",
"dataset_field_types": {
"categorical": 2,
"datetime": 0,
"effective_fields": 5,
"items": 0,
"numeric": 3,
"preferred": 5,
"text": 0,
"total": 5
},
"dataset_status": true,
"description": "",
"excluded_fields": [],
"fields_meta": {
"count": 0,
"limit": 1000,
"offset": 0,
"total": 0
},
"input_fields": [],
"locale": "en-US",
"max_columns": 5,
"max_rows": 150,
"name": "iris",
"name_options": "",
"number_of_batchprojections": 0,
"number_of_projections": 0,
"number_of_public_projections": 0,
"ordering": 0,
"out_of_bag": false,
"pca": {
"pca_seed": "2c249dda00fbf54ab4cdd850532a584f286af5b6"
},
"price": 0,
"private": true,
"project": null,
"range": null,
"replacement": false,
"resource": "pca/5bae775a4e1727e9f3000003",
"rows": 150,
"sample_rate": 1,
"shared": false,
"size": 4199,
"source": "source/5948be694e17273079000000",
"source_status": true,
"status": {
"code": 1,
"message": "The pca creation request has been queued and will be processed soon",
"progress": 0
},
"subscription": true,
"tags": [],
"type": 0,
"updated": "2018-12-10T18:06:58.246116",
"white_box": false
}
< Example PCA JSON response
PCA Arguments
In addition to the dataset, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is the category of the dataset |
The category that best describes the PCA. See the category codes for the complete list of categories.
Example: 1 |
| dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
|
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
|
description
optional |
String |
A description of the PCA up to 8192 characters long.
Example: "This is a description of my new PCA" |
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset is excluded. |
Specifies the fields that won't be included in the PCA.
Example:
|
|
fields
optional |
Object, default is {}, an empty dictionary. That is, no names or preferred statuses are changed. |
This can be used to change the names of the fields in the PCA with respect to the original names in the dataset or to tell BigML that certain fields should be preferred. An entry keyed with the field id generated in the source for each field that you want the name updated.
Example:
|
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields to be considered to create the PCA.
Example:
|
|
name
optional |
String, default is dataset's name |
The name you want to give to the new PCA.
Example: "my new PCA" |
|
out_of_bag
optional |
Boolean, default is false |
Setting this parameter to true will return a sequence of the out-of-bag instances instead of the sampled instances. See the Section on Sampling for more details.
Example: true |
|
pca_seed
optional |
String |
With a seed, the PCA is deterministic.
Example: "My Seed" |
|
project
optional |
String |
The project/id you want the PCA to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
range
optional |
Array, default is [1, max rows in the dataset] |
The range of successive instances to build the PCA.
Example: [1, 150] |
|
replacement
optional |
Boolean, default is false |
Whether sampling should be performed with or without replacement. See the Section on Sampling for more details.
Example: true |
|
sample_rate
optional |
Float, default is 1.0 |
A real number between 0 and 1 specifying the sample rate. See the Section on Sampling for more details.
Example: 0.5 |
|
seed
optional |
String |
A string to be hashed to generate deterministic samples. See the Section on Sampling for more details.
Example: "MySample" |
|
standarized
optional |
Boolean, default is true |
Whether numeric inputs should be scaled to unit variance. Standardizing implies assigning equal importance to all variables. Otherwise, the inputs with higher variance will dominate the PCA result.
Example: false |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your PCA.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new PCA. For example, to create a new PCA named "my PCA", with only certain rows, and with only three fields:
curl "https://au.bigml.io/pca?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/4f66a80803ce8940c5000006",
"input_fields": ["000001", "000002", "000003"],
"name": "my PCA",
"range": [25, 125]}'
> Creating a customized PCA
If you do not specify a name, BigML.io will assign to the new PCA the dataset's name. If you do not specify a range of instances, BigML.io will use all the instances in the dataset. If you do not specify any input fields, BigML.io will include all text fields in the dataset.
Retrieving a PCA
Each pca has a unique identifier in the form "pca/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the pca.
To retrieve a pca with curl:
curl "https://au.bigml.io/pca/5bae775a4e1727e9f3000003?$BIGML_AUTH"
$ Retrieving a pca from the command line
You can also use your browser to visualize the pca using the full BigML.io URL or pasting the pca/id into the BigML.com.au dashboard.
PCA Properties
Once a pca has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the PCA and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the PCA creation has been completed without errors. |
|
columns
filterable, sortable |
Integer | The number of fields in the PCA. |
|
composites
filterable, sortable |
Array of Strings | The list of composite ids that reference this model. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the PCA was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this PCA. |
|
credits_per_prediction
filterable, sortable, updatable |
Float | This is the number of credits that other users will consume to make a projection with your PCA if you made it public. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the PCA. |
| dataset_field_types | Object | A dictionary that informs about the number of fields of each type in the dataset used to create the PCA. It has an entry per each field type (categorical, datetime, numeric, and text), an entry for preferred fields, and an entry for the total number of fields. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the PCA. It can contain restricted markdown to decorate the text. |
| excluded_fields | Array | The list of fields's ids that were excluded to build the PCA. |
| fields_meta | Object | A dictionary with meta information about the fields dictionary. It specifies the total number of fields, the current offset, and limit, and the number of fields (count) returned. |
| input_fields | Array | The list of input fields' ids used to build the models of the PCA. |
| locale | String | The dataset's locale. |
|
max_columns
filterable, sortable |
Integer | The total number of fields in the dataset used to build the PCA. |
|
max_rows
filterable, sortable |
Integer | The maximum number of instances in the dataset that can be used to build the PCA. |
|
name
filterable, sortable, updatable |
String | The name of the PCA as your provided or based on the name of the dataset by default. |
|
number_of_batchprojections
filterable, sortable |
Integer | The current number of batch projections that use this PCA. |
|
number_of_projections
filterable, sortable |
Integer | The current number of projections that use this PCA. |
|
out_of_bag
filterable, sortable |
Boolean | Whether the out-of-bag instances were used to create the PCA instead of the sampled instances. |
| pca | Object | All the information that you need to recreate or use the pca on your own. See here for more details. |
|
price
filterable, sortable, updatable |
Float | The price other users must pay to clone your PCA. |
|
private
filterable, sortable, updatable |
Boolean | Whether the PCA is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| range | Array | The range of instances used to build the PCA. |
|
replacement
filterable, sortable |
Boolean | Whether the instances sampled to build the PCA were selected using replacement or not. |
| resource | String | The PCA/id. |
|
rows
filterable, sortable |
Integer | The total number of instances used to build the PCA |
|
sample_rate
filterable, sortable |
Float | The sample rate used to select instances from the dataset to build the PCA. |
|
seed
filterable, sortable |
String | The string that was used to generate the sample. |
|
shared
filterable, sortable, updatable |
Boolean | Whether the PCA is shared using a private link or not. |
| shared_hash | String | The hash that gives access to this PCA if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this PCA. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that were used to create this PCA. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| status | Object | A description of the status of the PCA. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the PCA was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the PCA was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
|
white_box
filterable, sortable |
Boolean | Whether the PCA is publicly shared as a white-box. |
A PCA Object has the following properties:
PCA Status
Creating a PCA is a process that can take just a few seconds or a few days depending on the size of the dataset used as input and on the workload of BigML's systems. The PCA goes through a number of states until its fully completed. Through the status field in the PCA you can determine when the PCA has been fully processed and ready to be used to create predictions. These are the properties that a PCA's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the PCA creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the PCA. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the PCA. |
Once a PCA has been successfully created, it will look like:
{
"category": 0,
"code": 200,
"columns": 5,
"configuration": null,
"configuration_status": false,
"created": "2018-12-10T18:06:58.245000",
"creator": "leon1",
"credits": 0,
"credits_per_prediction": 0,
"dataset": "dataset/5948be794e1727307a000000",
"dataset_field_types": {
"categorical": 2,
"datetime": 0,
"effective_fields": 5,
"items": 0,
"numeric": 3,
"preferred": 5,
"text": 0,
"total": 5
},
"dataset_status": true,
"description": "",
"excluded_fields": [],
"fields_meta": {
"count": 5,
"limit": 1000,
"offset": 0,
"query_total": 5,
"total": 5
},
"input_fields": [
"000000",
"000001",
"000002",
"000003",
"000004"
],
"locale": "en-us",
"max_columns": 5,
"max_rows": 150,
"name": "iris",
"name_options": "5 components, standardized",
"number_of_batchprojections": 0,
"number_of_projections": 0,
"number_of_public_projections": 0,
"ordering": 0,
"out_of_bag": false,
"pca": {
"components": [
[
0,
-0.59217,
0.97296,
0.95644,
-0.78601,
0.21895,
0.56706
],
[
0,
0.60076,
0.12575,
0.21775,
0.16589,
-0.71699,
0.55109
],
[
0,
0.53665,
0.13502,
0.13303,
-0.12852,
0.32951,
-0.201
],
[
0,
-0.0082,
-0.12084,
0.13414,
0.01049,
0.00398,
-0.01447
],
[
0,
0.01859,
-0.06848,
-0.04592,
-0.09581,
0.02323,
0.07258
]
],
"cumulative_variance": [
0.63817,
0.89138,
0.989,
0.99559,
1
],
"eigenvectors": [
[
0,
-0.33106,
0.54395,
0.53471,
-0.43943,
0.12241,
0.31703
],
[
0,
0.53321,
0.11161,
0.19327,
0.14724,
-0.63636,
0.48913
],
[
0,
0.7671,
0.193,
0.19015,
-0.18371,
0.47101,
-0.28731
],
[
0,
-0.04514,
-0.66522,
0.73843,
0.05774,
0.02191,
-0.07965
],
[
0,
0.12493,
-0.46031,
-0.30869,
-0.64405,
0.15617,
0.48788
]
],
"number_of_components": 5,
"pca_seed": "2c249dda00fbf54ab4cdd850532a584f286af5b6",
"standardized": true,
"text_stats": {},
"variance": [
0.63817,
0.25321,
0.09762,
0.00658,
0.00441
]
},
"price": 0,
"private": true,
"project": null,
"range": null,
"replacement": false,
"resource": "pca/5bae775a4e1727e9f3000003",
"rows": 150,
"sample_rate": 1,
"shared": false,
"size": 4199,
"source": "source/5948be694e17273079000000",
"source_status": true,
"status": {
"code": 5,
"elapsed": 2513,
"message": "The pca has been created",
"progress": 1
},
"subscription": true,
"tags": [],
"type": 0,
"updated": "2018-12-10T18:07:01.288000",
"white_box": false
}
< Example PCA JSON response
The number of components discovered is 1 per numeric input field and n-1 per categorical input field with n categorical values. Therefore, when the input dataset contains a large number of fields, the resource can become very large. In order to make things more manageable, the following two HTTP GET parameters may be used to return a subset of the components.
- components_limit: The maximum number of components to return. Default is 1000.
- components_offset: The index of the first component to return.
Filtering and Paginating Fields from a PCA
A pca might be composed of hundreds or even thousands of fields. Thus when retrieving a pca, it's possible to specify that only a subset of fields be retrieved, by using any combination of the following parameters in the query string (unrecognized parameters are ignored):
Since fields is a map and therefore not ordered, the returned fields contain an additional key, order, whose integer (increasing) value gives you their ordering. In all other respects, the source is the same as the one you would get without any filtering parameter above.
The fields_meta field can help you paginate fields. Its structure is as follows:Updating a PCA
To update a pca, you need to PUT an object containing the fields that you want to update to the pca' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated pca.
For example, to update a PCA with a new name you can use curl like this:
curl "https://au.bigml.io/pca/5bae775a4e1727e9f3000003?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating a PCA's name
If you want to update an] PCA with a new label and description for a specific field you can use curl like this:
curl "https://au.bigml.io/pca/5bae775a4e1727e9f3000003?$BIGML_AUTH" \
-X PUT \
-d '{"fields": {"000000": {"label": "a longer name", "description": "an even longer description"}}}' \
-H 'content-type: application/json'
$ Updating a PCA's field, label, and description
Deleting a PCA
To delete a pca, you need to issue a HTTP DELETE request to the pca/id to be deleted.
Using curl you can do something like this to delete a pca:
curl -X DELETE "https://au.bigml.io/pca/5bae775a4e1727e9f3000003?$BIGML_AUTH"
$ Deleting a pca from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a pca, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a pca a second time, or a pca that does not exist, you will receive a "404 not found" response.
However, if you try to delete a pca that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing PCA
To list all the pca, you can use the pca base URL. By default, only the 20 most recent pca will be returned. You can see below how to change this number using the limit parameter.
You can get your list of pca directly in your browser using your own username and API key with the following links.
https://au.bigml.io/pca?$BIGML_AUTH
> Listing pca from a browser
Predictions
Last Updated: Tuesday, 2019-01-29 16:28
A prediction is created using a supervised model (model/id, ensemble/id, logisticregression/id, deepnet/id, or fusion/id) and the new instance (input_data) for which you wish to create a prediction.
When you create a new prediction using a model, BigML.io will automatically navigate the corresponding model to find the leaf node that best classifies the new instance. If you create a new prediction using an ensemble using the bagging or random decision forests technique, the same process is repeated for each model in the ensemble. Then all the predictions from the individual models in the ensemble are combined to return a final prediction using one of the strategies described below. If the ensemble is using the gradient tree boosting technique, the prediction result will be additive meaning each tree modifies the predictions of the previously grown tree. If you create a new prediction using a logistic regression, its coefficients will be used.
BigML.io allows you to create, retrieve, update, delete your prediction. You can also list all of your predictions.
Jump to:
- Prediction Base URL
- Creating a Prediction
- Prediction Arguments
- Creating a Prediction Using a Filtered Model
- Retrieving a Prediction
- Prediction Properties
- Confidence
- Updating a Prediction
- Deleting a Prediction
- Listing Predictions
Prediction Base URL
You can use the following base URL to create, retrieve, update, and delete predictions. https://au.bigml.io/prediction
Prediction base URL
All requests to manage your predictions must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Prediction
To create a new prediction, you need to POST to the prediction base URL an object containing a supervised model id that you want to use to create the prediction. The content-type must always be "application/json".
POST /prediction?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating prediction definition
curl "https://au.bigml.io/prediction?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"model": "model/50a2eac63c19200bd1000008", "input_data": {"000001": 3}}'
> Creating a prediction
BigML.io will return the newly created prediction if the request succeeded.
{
"category": 0,
"code": 201,
"created": "2012-11-15T02:44:54.492482",
"credits": 0.01,
"dataset": "dataset/50a453753c1920186d000045",
"dataset_status": true,
"description": "",
"fields": {
"000001": {
"column_number": 1,
"datatype": "double",
"name": "sepal width",
"optype": "numeric",
"order": 1,
"preferred": true
},
"000002": {
"column_number": 2,
"datatype": "double",
"name": "petal length",
"optype": "numeric",
"order": 2,
"preferred": true
},
"000004": {
"column_number": 4,
"datatype": "string",
"name": "species",
"optype": "categorical",
"order": 4,
"preferred": true
}
},
"input_data": {
"000001": 3
},
"locale": "en_US",
"model": "model/50a454503c1920186d000049",
"model_status": true,
"name": "Prediction for species",
"objective_fields": [
"000004"
],
"prediction": {
"000004": "Iris-virginica"
},
"prediction_path": {
"bad_fields": [],
"confidence": 0.26289,
"next_predicates": [
{
"count": 50,
"field": "000002",
"operator": "<=",
"value": 2.45
},
{
"count": 100,
"field": "000002",
"operator": ">",
"value": 2.45
}
],
"node_id": 16,
"objective_summary": {
"categories": [
[
"Iris-versicolor",
50
],
[
"Iris-setosa",
50
],
[
"Iris-virginica",
50
]
]
},
"path": [],
"unknown_fields": []
},
"private": true,
"project": null,
"resource": "prediction/50a457263c1920186d00004d",
"source": "source/50a4527b3c1920186d000041",
"source_status": true,
"status": {
"bad_fields": [],
"code": 5,
"message": "The prediction has been created",
"unknown_fields": []
},
"tags": [],
"task": "classification",
"updated": "2012-11-15T02:44:54.492500"
}
< Example prediction JSON response
Prediction Arguments
In addition to the model id and the input_data, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is the category of the model |
The category that best describes the prediction. See the category codes for the complete list of categories.
Example: 1 |
|
combiner
optional |
Integer |
Specifies the method that should be used to combine predictions when a non-boosted ensemble is used. Note that if operating_kind or operating_point presents, combiner will be ignored. For classification ensembles, the combination is made by majority vote. The options are:
Example: 1 DEPRECATED |
|
deepnet
optional |
String |
A valid deepnet/id.
Example: deepnet/55efc3564e1727d635000102 |
|
description
optional |
String |
A description of the prediction up to 8192 characters long.
Example: "This is a description of my new prediction" |
|
ensemble
optional |
String |
A valid ensemble/id.
Example: ensemble/517020d53c1920a514000056 |
|
explain
optional |
Boolean, default is false |
Whether you want to request an explanation for the prediction.
Example: true |
|
fusion
optional |
String |
A valid fusion/id.
Example: fusion/5948be694e17273079000000 |
| input_data | Object |
An object with field's id/value or name/value pairs representing the instance you want to create a prediction for. For a logistic regression, input data for all numerical fields except the objective field must be provided. Missing input_data for categorical or text fields, or invalid values for categorical fields, or an empty value for text fields will be handled as missing values.
Example: {"000000": 5, "000001": 3} |
|
logisticregression
optional |
String |
A valid logisticregression/id.
Example: logisticregression/55efc3564e1727d635000004 |
|
missing_strategy
optional |
Integer, default is 0 |
Specifies the method that should be used when a missing split is found for a model, ensemble or fusion. That is, when a missing value is found in the input data for a decision node. The options are:
Example: 1 |
|
model
optional |
String |
A valid model id of the supported supervised model. Alternatively, you can use ensemble, logisticregression, deepnet or fusion arguments.
Example: model/4f67c0ee03ce89c74a000006 |
|
name
optional |
String, default is Prediction for model's name |
The name you want to give to the new prediction.
Example: "my new prediction" |
|
operating_kind
optional |
String, default is "probability" |
The operating kind to perform the prediction. It also replaces combiner and its value can be confidence, probability, and for non-boosted ensembles, votes. Note that operating_point can override its value. See operating_point for more information.
Example: "confidence" |
|
operating_point
optional |
Object |
The specification of an operating point for classification problems to perform the prediction which consists of a positive_class (one of the categories of the model's objective field), and a threshold (a number between 0 and 1), and an optional field kind (confidence, probability, and for non-boosted ensembles, votes). When it presents, BigML will predict the positive_class if its probability, confidence or votes (depending on the kind) is greater than the threshold set. Otherwise, BigML will predict the class with the higher probability, confidence or votes.confidence and probability will yield the same results for boosted ensembles, deepnets, fusions, and logistic regressions. For the votes kind, the threshold specifies the ratio of models in the ensembles predicting the positive_class. Note that operating_point takes precedence over combiner and threshold, thus they will be ignored if provided. It also takes precedence over operating_kind if kind is provided. When neither of them are presented, then the default value probability will be used for kind. Example:
|
|
private
optional |
Boolean, default is true |
Whether you want your prediction to be private or not.
Example: false |
|
project
optional |
String |
The project/id you want the prediction to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your prediction.
Example: ["best customers", "2018"] |
|
threshold
optional |
Object |
A dictionary with two optional keys for a model or non-boosted ensemble:
Note that their use is deprecated, and maintained only for backwards compatibility. Instead use an operating_point of kind votes. Example: {"k": 2, "class": "attack"} DEPRECATED |
|
vote_count
optional |
Float |
A number between 0 and 1, the fraction of trees that voted for the predicted class. Applicable only when operating_kind is votes or kind of operating_point is votes.
Example: 0.5 |
|
vote_counts
optional |
Array of Arrays |
An array of string-float pairs with each class and the corresponding fraction, between 0 and 1, of trees voting for it as elements. Applicable only when operating_kind is votes or kind of operating_point is votes.
Example: [["Female", 0.3], ["Other", 0.2]] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can use curl to customize new predictions. For example, to create a new prediction named "my prediction":
curl "https://au.bigml.io/prediction?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"input_data": {"000001": 3}, "model": "model/4f67c0ee03ce89c74a000006", "name": "my prediction", "tags":["my", "tags"]}'
> Creating a customized prediction
If you do not specify a name, BigML.io will assign to the new prediction a name based on the model's name or objective field.
Creating a Prediction Using a Filtered Model
It is possible to create a prediction using the filtered decision tree model by specifying filter parameters in the query string of the request parameters. Two useful parameters are support and value, as described in the Filtering a Model section. For instance, the following will perform a prediction using a model filtered to include only nodes with the given support, with all its leaf values will be within the [2,23] interval.
curl "https://au.bigml.io/prediction?$BIGML_AUTH;support=0.5,value=[2,23]" \
-X POST \
-H 'content-type: application/json' \
-d '{"input_data": {"000001": 3}, "model": "model/4f67c0ee03ce89c74a000006", "name": "my prediction"}'
Filter Example
Retrieving a Prediction
Each prediction has a unique identifier in the form "prediction/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the prediction.
To retrieve a prediction with curl:
curl "https://au.bigml.io/prediction/50a457263c1920186d00004d?$BIGML_AUTH"
$ Retrieving a prediction from the command line
You can also use your browser to visualize the prediction using the full BigML.io URL or pasting the prediction/id into the BigML.com.au dashboard.
Prediction Properties
Once a prediction has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
boosted_ensemble
filterable, sortable |
Boolean | Whether the prediction was built with an ensemble with boosted trees. |
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the prediction and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the prediction creation has been completed without errors. |
| combiner | Integer | The method used to combine predictions from the non-boosted ensemble. See the available combiners above. DEPRECATED |
| confidence | Float | For classification models, a number between 0 and 1 that expresses how certain the model is of the prediction. For regression models, a number mapped to the top end of a 95% confidence interval around the expected error at that node (measured using the variance of the output at the node). However, for logistic regressions, it really means probability, and thus, confidence will be deprecated soon. Note that this property is not available for ensembles with boosted trees and that for models |
| confidences | Array of Arrays |
An array of confidence pairs for each category in the objective field. Available for classification tasks only.Note that for logistic regressions and ensembles with boosted trees, it take the same numeric values as probabilities
Example:
|
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the prediction was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this prediction. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the prediction. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
deepnet
filterable, sortable |
String | The deepnet/id that was used to create the prediction. |
|
description
updatable |
String | A text describing the prediction. It can contain restricted markdown to decorate the text. |
|
ensemble
filterable, sortable |
String | The ensemble/id that was used to create the prediction. |
| error_predictions | Integer | The number of predictions in the ensemble that failed. |
| explanation | Array | An array with a series of predicates, in of the same format as those under the path or next_predicates key, with the exception of the added importance key that specifies the relative importance of each predicate in the list. The list of predicates has loosely the same semantics as the path: creating an instance that adheres to that list of rules leads to predictions by the model that are similar to the prediction on the given input data, and conversely those instances that to not conform to the given rules are likely to lead to a different prediction by the model. |
| fields | Object | A dictionary with an entry per field in the input_data or prediction_path. Each entry includes the column number in original source, the name of the field, the type of the field, and the specific datatype. Not available for ensembles with boosted trees. |
| finished_predictions | Integer | The number of predictions in the ensemble that succeed. |
|
fusion
filterable, sortable |
String | The fusion/id that was used to create the prediction. |
| input_data | Object | The dictionary of input fields' ids and values used as input for the prediction. |
| locale | String | The dataset's locale. |
|
logisticregression
filterable, sortable |
String | The logisticregression/id that was used to create the prediction. |
| missing_strategy |
Integer, default is 0 |
Specifies the type of strategy that a model or models in an ensemble or a fusion will follow when a missing value needed to continue with inference in the model is found. The possible values are:
|
|
model
filterable, sortable |
String | The model/id that was used to create the prediction. |
|
model_status
filterable, sortable |
Boolean | Whether the model is still available or has been deleted. |
| model_type | Integer |
|
| models | Array | A list of the model/id that compose the ensemble. |
|
name
filterable, sortable, updatable |
String | The name of the prediction as you provided or based on the name of the objective field's name by default. |
| number_of_models | Integer | The number of models in the ensemble. |
| objective_field | String | The id of the field that it predicts in the model. |
| objective_field_name | String | The name of the objective field in the model. |
|
objective_fields
filterable, sortable |
Array |
Specifies the ids of the field that it predicts in the model. Even if this is an array the current version of BigML.io only accepts one objective field.
Example: ["000004"] |
| operating_kind | String | The operating kind to perform the prediction. See operating_kind above for more information. |
| operating_point | Object | The specification of an operating point for classification problems to perform the prediction. See operating_point above for more information. |
| output | Number or String | The actual prediction. A string if the task is classification, a number if the task is regression |
|
prediction
filterable, sortable |
Object |
A dictionary keyed with the objective field to get the prediction output for the model.
Example: {"000004": "Iris-virginica"} |
| prediction_path | Object | A Prediction Path Object specifying the decision path of the model followed to make the prediction, the next predicates, and lists of unknown fields and bad fields submitted. |
| predictions | Array |
An array with a prediction object for each model in the non-boosted ensemble or fusion. The prediction object includes:
|
|
private
filterable, sortable, updatable |
Boolean | Whether the prediction is public or not. |
| probabilities | Array of Arrays |
An array of probability pairs for each category in the objective field. Available for classification tasks only.
Example:
|
| probability | Float | The probability of winning class for classification task. For logistic regressions, note that it has been called confidence. However, it will be deprecated soon and only this property will be supported in the future. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| query_string | String | The query string that was used to filter the model. |
| resource | String | The prediction/id. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| status | Object | A description of the status of the prediction. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the prediction was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
| task | String | Either classification or regression depending on whether the objective field is categorical or numeric. |
| threshold | Object | The parameters (k and class) given when a threshold-based combiner is used for the non-boosted ensemble. DEPRECATED |
| tlp | Integer | The tlp used to computed the predictions of the ensemble. DEPRECATED |
| total_count | Integer | The total number of instances used to build the model. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the prediction was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| vote_count | Float | the fraction of trees that voted for the predicted class. See vote_count above for more information |
| vote_counts | Array of Arrays | An array of pairs with each class and the corresponding fraction of trees voting for it as elements. See vote_counts above for more information |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
A Prediction Path Object has the following properties:
Confidence
The confidence field provides a measure of how certain the model is of the prediction.
For classification models, confidence is the lower end of a binomial-style confidence interval, where 1 indicates absolute certainty and 0 indicates no better than a random guess. Technically it is the lower bound of Wilson score confidence interval for a Bernoulli parameter. Read how it works in layman's terms here.
For regression models, confidence is the upper end of a confidence interval around the expected error for that prediction.
For more detailed information about the distribution of the target for the given instance, the objective_summary field provides a histogram of possible target values for the instance. The default prediction is the mean or mode of this distribution (for regression and classification, respectively), but one could use this distribution to make more sophisticated choices such as classification according to a specific loss function.
Predicate Objects have the following properties:
Prediction Status
Creating a prediction is a near real-time process that take just a few seconds depending on whether the corresponding model has been used recently and the workload of BigML's systems. The prediction goes through a number of states until its fully completed. Through the status field in the prediction you can determine when the prediction has been fully processed and ready to be used. Most of the times predictions are fully processed and the output returned in the first call. These are the properties that a prediction's status has:
| Property | Type | Description |
|---|---|---|
| bad_fields | Array | An array of field's ids with wrong values submitted to build the model or logistic regression. Bad fields are ignored. That is, if you submit a value that is wrong, a prediction is created anyway ignoring the input field with the wrong value. |
| code | Integer | A status code that reflects the status of the prediction creation. It can be any of those that are explained here. |
| message | String | A human readable message explaining the status. |
| unknown_fields | Array | An array of field's ids that were submitted in the input_data to build the model or logistic regression and were not recognized. Unknown fields are ignored. That is, if you submit a field that is wrong, a prediction is created anyway ignoring the wrong input field. |
Updating a Prediction
To update a prediction, you need to PUT an object containing the fields that you want to update to the prediction' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated prediction.
For example, to update a prediction with a new name you can use curl like this:
curl "https://au.bigml.io/prediction/50a457263c1920186d00004d?$BIGML_AUTH" \
-X PUT \
-d '{"name": "a new name"}' \
-H 'content-type: application/json'
$ Updating a prediction's name
Deleting a Prediction
To delete a prediction, you need to issue a HTTP DELETE request to the prediction/id to be deleted.
Using curl you can do something like this to delete a prediction:
curl -X DELETE "https://au.bigml.io/prediction/50a457263c1920186d00004d?$BIGML_AUTH"
$ Deleting a prediction from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a prediction, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a prediction a second time, or a prediction that does not exist, you will receive a "404 not found" response.
However, if you try to delete a prediction that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Predictions
To list all the predictions, you can use the prediction base URL. By default, only the 20 most recent predictions will be returned. You can see below how to change this number using the limit parameter.
You can get your list of predictions directly in your browser using your own username and API key with the following links.
https://au.bigml.io/prediction?$BIGML_AUTH
> Listing predictions from a browser
Batch Predictions
Last Updated: Tuesday, 2019-01-29 16:28
A batch prediction provides an easy way to compute a prediction for each instance in a dataset in only one request. To create a new batch prediction you need a supervised model (model/id, ensemble/id, logisticregression/id, deepnet/id, or fusion/id), and a dataset/id.
Batch predictions are created asynchronously. You can retrieve the associated resource to check the progress and status in a similar fashion to the rest of BigML.io resources. Additionally, once a batch prediction is finished you can also download a csv file that contains all the predictions just appending "/download" to the batch prediction URL. You can also set output_dataset to true to automatically generate a new dataset with the results.
BigML.io gives you a number of options to tailor
the format of the csv file containing the predictions. For example, you
can set up the "separator" (e.g., ";"),
whether the file should have a "header" or not,
or whether the "confidence" for each prediction should
also appear together with each prediction. You can read about all the available options below.
BigML.io allows you to create, retrieve, update, delete your batch prediction. You can also list all of your batch predictions.
Jump to:
- Batch Prediction Base URL
- Creating a Batch Prediction
- Batch Prediction Arguments
- Retrieving a Batch Prediction
- Batch Prediction Properties
- Updating a Batch Prediction
- Deleting a Batch Prediction
- Listing Batch Predictions
Batch Prediction Base URL
You can use the following base URL to create, retrieve, update, and delete batch predictions. https://au.bigml.io/batchprediction
Batch Prediction base URL
All requests to manage your batch predictions must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Batch Prediction
To create a new batch prediction, you need to POST to the batch prediction base URL an object containing a supervised model id that you want to use to make predictions and the dataset/id of the dataset that contains the input data that will be used to create predictions. BigML.io will create a prediction for each instance in that dataset. The content-type must always be "application/json".
POST /batchprediction?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating batch prediction definition
curl "https://au.bigml.io/batchprediction?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"ensemble": "ensemble/52465ac93c19205051000000", "dataset": "dataset/5250ae133c192075b800000c"}'
> Creating a batch prediction
BigML.io will return the newly created batch prediction if the request succeeded.
{
"all_fields": false,
"category": 0,
"code": 201,
"combiner": 0,
"confidence": false,
"created": "2013-10-11T03:29:42.317248",
"credits": 2000.0,
"dataset": "dataset/5250ae133c192075b800000c",
"dataset_status": true,
"description": "",
"ensemble": "ensemble/52465ac93c19205051000000",
"fields_map": {},
"header": true,
"importance": false,
"locale": "en-US",
"missing_strategy": 0,
"model_status": true,
"model_type": 1,
"name": "Batch Prediction of Churn model with New Customers dataset",
"newline": "LF",
"number_of_models": 10,
"output_fields": [],
"private": true,
"project": null,
"resource": "batchprediction/525770a63c1920e3f3000000",
"result": {},
"rows": 20000,
"separator": ",",
"shared": false,
"size": 368120,
"source_status": true,
"status": {
"code": 1,
"message": "The batch prediction is being processed and will be performed soon"
},
"subscription": false,
"tags": [],
"type": 0,
"updated": "2013-10-11T03:29:42.317286",
"votes": false
}
< Example batch prediction JSON response
Batch Prediction Arguments
In addition to the model id and the dataset, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
all_fields
optional |
Boolean, default is false |
Whether all the fields from the dataset should be part of the generated csv file together with the predictions.
Example: true |
|
category
optional |
Integer, default is the category of the model |
The category that best describes the batch prediction. See the category codes for the complete list of categories.
Example: 1 |
|
combiner
optional |
Integer |
Specifies the method that should be used to combine predictions when a non-boosted ensemble is used. Note that if operating_kind or operating_point presents, combiner will be ignored. For classification ensembles, the combination is made by majority vote. The options are:
Example: 1 DEPRECATED |
|
confidence
optional |
Boolean, default is false |
Whether the confidence for each prediction for the model or non-boosted ensemble should be added to the each csv file. For logistic regressions, it is accepted but deprecated in favor of probability.
Example: true |
|
confidence_name
optional |
String |
The name of the column in the header of the generated file containing the confidence when a model or non-boosted ensemble is used to create the batch predicition. For logistic regressions, it is accepted but deprecated in favor of probability_name. Note that it will only have effect if header is true.
Example: "Confidence" |
|
confidence_threshold
optional |
Float |
A number between 0 and 1 that can be used with classification models or non-boosted ensembles so that they only return the positive_class when the confidence on the prediction is above the established threshold. When a positive_class is not provided, it will default to the majority class. When the confidence is below the threshold, the prediction returned will be the negative_class. If a negative class is not provided, then the minority class will be returned. When the prediction is overridden, the new confidence returned will be 1 unless specified differently using negative_class_confidence.
Example: 0.7 DEPRECATED |
|
confidences
optional |
Boolean, default is false |
Whether to include a column per class with its corresponding confidence for the batch prediction for the classification task. Note that for logistic regressions and ensembles with boosted trees, it take the same numeric values as probabilities
Example: true |
| dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
|
deepnet
optional |
String |
A valid deepnet/id.
Example: deepnet/55efc3564e1727d635000102 |
|
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
|
description
optional |
String |
A description of the batch prediction up to 8192 characters long.
Example: "This is a description of my new batch prediction" |
|
ensemble
optional |
String |
A valid ensemble/id.
Example: ensemble/517020d53c1920a514000056 |
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset |
Specifies the fields in the dataset to be excluded to create the batch prediction.
Example:
|
|
fields_map
optional |
Object |
A dictionary of identifiers of the fields to use from the predictions under test mapped to their corresponding identifiers in the input dataset.
Example: {"000000":"00000a", "000001":"000002", "000002":"000001", "000003":"000020", "000004":"000004"} |
|
fusion
optional |
String |
A valid fusion/id.
Example: fusion/5948be694e17273079000000 |
|
header
optional |
Boolean, default is true |
Whether the csv file should have a header with the name of each field.
Example: true |
|
importance
optional |
Boolean, default is false |
Whether to include a column for each of the field importances for model, ensemble, and fusion predictions. That will add a column per field, named "<field_name> importance".
Example: true |
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields in the dataset to be considered to create the batch prediction.
Example:
|
|
logisticregression
optional |
String |
A valid logisticregression/id.
Example: logisticregression/55efc3564e1727d635000004 |
|
missing_strategy
optional |
Integer, default is 0 |
Specifies the method that should be used for the model or ensemble when a missing split is found. That is, when a missing value is found in the input data for a decision node. The options are:
Example: 1 |
|
model
optional |
String |
A valid model id of the supported supervised model. Alternatively, you can use ensemble, logisticregression, deepnet or fusion arguments.
Example: model/4f67c0ee03ce89c74a000006 |
|
name
optional |
String, default is dataset's name |
The name you want to give to the new batch prediction.
Example: "my new batch prediction" |
|
negative_class
optional |
String |
The class that will be returned when a confidence threshold is used and the threshold is not reached for the model or ensemble.
Example: false DEPRECATED |
|
negative_class_confidence
optional |
Float |
A number between 0 and 1 that will be returned as the confidence for predictions that are overridden when a confidence_threshold is used for the model or non-boosted ensemble.
Example: 0.7 DEPRECATED |
|
negative_class_probability
optional |
Float |
A number between 0 and 1 that will be returned as the probability for predictions that are overridden when a probability_threshold is used for the boosted trees.
Example: 0.7 DEPRECATED |
|
newline
optional |
String, default is "LF" |
The new line character that you want to get as line break in the generated csv file: "LF", "CRLF".
Example: "CRLF" |
|
node_id
optional |
Boolean, default is false |
Whether a column containing a unique id for the final tree node when making a prediction should be added. This is currently only available for single tree models and the ID may be null missing_strategy is 1 (proportional). If enabled, the column is added last.
Example: true |
|
operating_kind
optional |
String, default is "probability" |
The operating kind to perform the prediction. It also replaces combiner and its value can be confidence, probability, and for non-boosted ensembles, votes. Note that operating_point can override its value. See operating_point for more information.
Example: "confidence" |
|
operating_point
optional |
Object |
The specification of an operating point for classification problems to perform the prediction which consists of a positive_class (one of the categories of the model's objective field), and a threshold (a number between 0 and 1), and an optional field kind (confidence, probability, and for non-boosted ensembles, votes). When it presents, BigML will predict the positive_class if its probability, confidence or votes (depending on the kind) is greater than the threshold set. Otherwise, BigML will predict the class with the higher probability, confidence or votes.confidence and probability will yield the same results for boosted ensembles, deepnets and logistic regressions. For the votes kind, the threshold specifies the ratio of models in the ensembles predicting the positive_class. Note that operating_point takes precedence over combiner and threshold, thus they will be ignored if provided. It also takes precedence over operating_kind if kind is provided. When neither of them are presented, then the default value probability will be used for kind. Example:
|
|
output_dataset
optional |
Boolean, default is false |
Whether a dataset with the results should be automatically created or not.
Example: true |
|
output_fields
optional |
Array, default is []. None of the fields in the dataset |
Specifies the fields to be included in the csv file. It can be a list of field ids or names.
Example:
|
|
positive_class
optional |
String |
The class that will be considered when a confidence threshold is used for the model or ensemble.
Example: false DEPRECATED |
|
prediction_name
optional |
String |
The name of the column in the header of the generated file for the prediction. It will only have effect if header is true.
Example: "Prediction" |
|
probabilities
optional |
Boolean, default is false |
Whether to include the predicted class and all other possible class values for the batch prediction for the classification task.
Example: true |
|
probability
optional |
Boolean, default is false |
Whether the probability for each prediction for the classification task should be added.
Example: true |
|
probability_name
optional |
String |
The name of the column in the header of the generated file containing the probability for the classification task. If probability is set to true, then probability_name will set the column header. Note that it will only have effect if header is true
Example: "Probability" |
|
probability_threshold
optional |
Float |
A number between 0 and 1 that can be used for any classification model, ensemble and logistic regression. The positive_class will only be predicted when the probability of the prediction for that class is above the established threshold. If a positive_class is not provided, it will default to the majority class. When the probability is below the threshold for the positive_class, the following class with higher probability will be predicted instead.
Example: 0.7 DEPRECATED |
|
project
optional |
String |
The project/id you want the batch prediction to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
separator
optional |
Char, default is "," |
The separator that you want to get between fields in the generated csv file.
Example: ";" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your batch prediction.
Example: ["best customers", "2018"] |
|
threshold
optional |
Object |
A dictionary with two optional keys for a model or non-boosted ensemble:
Example: {"k": 2, "class": "attack"} DEPRECATED |
|
vote_count
optional |
Boolean, default is false |
Whether the vote_count, the fraction of ensembles voting for the given prediction, for each prediction for the classification ensemble task should be added.
Example: true |
|
vote_count_name
optional |
String, default is votes |
The name of the column in the header of the generated file containing the vote_count for the ensemble classification task. If vote_count is set to true, then vote_count_name is used to prefix for naming those columns. Note that it will only have effect if header is true.
Example: "Votes" |
|
vote_counts
optional |
Boolean, default is false |
Whether the vote_counts, a list of fractions of votes, one per category for the batch prediction for the ensemble classification task should be added.
Example: true |
|
votes
optional |
Boolean, default is false |
Whether to include a column for each of the individual model predictions for the non-boosted ensemble. That will add a column per model, named <prediction_name>_n where n is the position of the model in the model list in the ensemble, starting at 1.
Example: true |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new batch prediction. For example, to create a new batch prediction named "my batch prediction", that will not include a header, and will only output the field "000001" together with the confidence for each prediction.
curl "https://au.bigml.io/batchprediction?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"dataset": "dataset/50614ed53c192043ea00000c", "model": "model/50614edb3c192043ea000010", "name": "my batch prediction", "header": false, "output_fields": ["000001"], "confidence": true}'
> Creating a customized batch prediction
If you do not specify a name, BigML.io will assign to the new batch prediction a combination of the dataset's name and the model's name. If you do not specify any fields_map, BigML.io will use a direct map of all the fields in the dataset.
Retrieving a Batch Prediction
Each batch prediction has a unique identifier in the form "batchprediction/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the batch prediction.
To retrieve a batch prediction with curl:
curl "https://au.bigml.io/batchprediction/525770a63c1920e3f3000000?$BIGML_AUTH"
$ Retrieving a batch prediction from the command line
You can also use your browser to visualize the batch prediction using the full BigML.io URL or pasting the batchprediction/id into the BigML.com.au dashboard.
Batch Prediction Properties
Once a batch prediction has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
all_fields
filterable, sortable |
Boolean | Whether the batch prediction contains all the fields in the dataset used as an input. |
|
boosted_ensemble
filterable, sortable |
Boolean | Whether the prediction was built with an ensemble with boosted trees. |
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the batch prediction and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the batch prediction creation has been completed without errors. |
| combiner | Integer | The method used to combine predictions from the non-boosted ensemble. See the available combiners above. DEPRECATED |
|
confidence
filterable, sortable |
Boolean | Whether the confidence for each prediction was added to the output file. |
| confidence_name | String | The name of the column containing the confidence for each prediction when it has been passed as an argument. |
|
confidence_threshold
filterable, sortable |
Float | The minimum level of confidence on the positive class that a classification model needs to reach to return the positive_class. Otherwise, it will return the negative class. DEPRECATED |
| confidences | Boolean | Whether to include a column per class with its corresponding confidence for the batch prediction for the classification task. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the batch prediction was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this batch prediction. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the batch prediction. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
deepnet
filterable, sortable |
String | The deepnet/id that was used to create the batch prediction. |
|
description
updatable |
String | A text describing the batch prediction. It can contain restricted markdown to decorate the text. |
|
ensemble
filterable, sortable |
String | The ensemble/id that was used to create the batch prediction. |
| excluded_fields | Array | The list of fields's ids that were excluded to build the batch prediction. |
| fields_map | Array | The map of dataset fields to model fields used. |
|
fusion
filterable, sortable |
String | The fusion/id that was used to create the batch prediction. |
|
header
filterable, sortable |
Boolean | Whether the batch prediction file contains a header with the name of each field or not. |
|
importance
filterable |
Boolean | Whether the batch prediction includes a column for each of the field importances for model, ensemble, and fusion> predictions. There is a column per field, named "<field_name> importance". |
| input_fields | Array | The list of input fields' ids used to create the batch prediction. |
| locale | String | The dataset's locale. |
|
logisticregression
filterable, sortable |
String | The logisticregression/id that was used to create the batch prediction. |
| missing_strategy |
Integer, default is 0 |
Specifies the type of strategy that a model or models in an ensemble will follow when a missing value needed to continue with inference in the model is found. The possible values are:
|
|
model
filterable, sortable |
String | The model/id that was used to create the batch prediction. |
|
model_status
filterable, sortable |
Boolean | Whether the model is still available or has been deleted. |
| model_type | Integer |
|
|
name
filterable, sortable, updatable |
String | The name of the batch prediction. By default it is based on the name of the model and the dataset used. |
|
negative_class
filterable, sortable |
String | The negative class that will be returned when the model does not reach the confidence_threshold or probability_threshold level on the positive_class. DEPRECATED |
|
negative_class_confidence
filterable, sortable |
Float | The confidence returned for predictions that have been overridden by the negative_class when a confidence_threshold has been used. DEPRECATED |
|
negative_class_probability
filterable, sortable |
Float | The probability returned for predictions that have been overridden by the negative_class when a probability_threshold has been used. DEPRECATED |
| newline | String | The new line character used as line break in the file that contains the predictions. |
|
node_id
filterable, sortable |
Boolean | Whether a unique id for the final tree node for each tree prediction was added to the output file. |
| number_of_models | Integer | The number of models in the ensemble. |
| objective_field | Object | The objective field of the model. It includes all the properties of the corresponding field (i.e., column_number, datatype, id,name, optype, etc.). |
| operating_kind | String | The operating kind to perform the prediction. See operating_kind above for more information. |
| operating_point | Object | The specification of an operating point for classification problems to perform the prediction. See operating_point above for more information. |
| output_dataset |
Boolean, default is false |
Whether a dataset with the results should be automatically created or not.
Example: true |
|
output_dataset_resource
filterable, sortable |
String | The dataset/id of the newly created dataset when output_dataset has been set to true. |
|
output_dataset_status
filterable, sortable |
Boolean | Whether the dataset generated as an output is still available or has been deleted. |
| output_fields | Array | The list of output fields's ids used to format the output csv file. |
|
positive_class
filterable, sortable |
String | The positive class that will be considered when a confidence_threshold or probability_threshold is used. DEPRECATED |
| prediction_name | String | The name of the column containing the predictions when it has been passed as an argument. |
|
private
filterable, sortable |
Boolean | Whether the batch prediction is public or not. |
| probabilities | Boolean | Whether to include the predicted class and all other possible class values for the batch prediction for the classification task. |
|
probability
filterable, sortable |
Boolean | Whether the probability for each prediction was added to the output file. |
| probability_name | String | The name of the column containing the probability for each prediction when it has been passed as an argument. |
|
probability_threshold
filterable, sortable |
Float | The minimum level of probability on the positive class that boosted trees need to reach to return the positive_class. Otherwise, it will return the negative class. DEPRECATED |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| resource | String | The batchprediction/id. |
|
rows
filterable, sortable |
Integer | The total number of instances in the dataset used as an input. |
| separator | Char | The separator used in the csv file that contains the predictions. |
|
shared
filterable, sortable |
Boolean | Whether the batch prediction has been shared via a private link. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that was used create the batch prediction. |
| status | Object | A description of the status of the batch prediction. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the batch prediction was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
| threshold | Object | The parameters (k and class) given when a threshold-based combiner is used. DEPRECATED |
|
type
filterable, sortable |
Integer | Reserved for future use. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the batch prediction was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
vote_count
filterable, sortable |
Boolean | Whether the vote_count, the fraction of ensembles voting for the given prediction, for each prediction for the classification ensemble task was added. |
| vote_count_name | String | The name of the column in the header of the generated file containing the vote_count for the ensemble classification task. |
|
vote_counts
filterable, sortable |
Boolean | Whether the vote_counts, a list of fractions of votes, one per category for the batch prediction for the ensemble classification task was added. |
|
votes
filterable, sortable |
Boolean | Whether to include a column for each of the individual model predictions for non-boosted ensemble. That will add a column per model, named <prediction_name>n where n is the position of the model in the model list in the ensemble, starting at 1. |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
Batch Prediction Status
Creating a batch prediction is a process that can take just a few seconds or a few hours depending on the size of the dataset used as input and on the workload of BigML's systems. The batch prediction goes through a number of states until its finished. Through the status field in the batch prediction you can determine when it has been fully processed. These are the properties that a batch prediction's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the batch prediction creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the batch prediction. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the batch prediction. |
Once batch prediction has been successfully finished, it will look like:
{
"all_fields": false,
"category": 0,
"code": 201,
"combiner": 0,
"confidence": false,
"created": "2013-10-11T03:29:42.317248",
"credits": 2000.0,
"dataset": "dataset/5250ae133c192075b800000c",
"dataset_status": true,
"description": "",
"ensemble": "ensemble/52465ac93c19205051000000",
"fields_map": {},
"header": true,
"importance": false,
"locale": "en-US",
"missing_strategy": 0,
"model_status": true,
"model_type": 1,
"name": "Batch Prediction of Churn model with New Customers dataset",
"newline": "LF",
"number_of_models": 10,
"output_fields": [],
"private": true,
"project": null,
"resource": "batchprediction/525770a63c1920e3f3000000",
"result": {},
"rows": 20000,
"separator": ",",
"shared": false,
"size": 368120,
"source_status": true,
"status": {
"code": 5,
"elapsed": 26780,
"message": "The batch prediction has been performed",
"progress": 1.0
},
"subscription": false,
"tags": [],
"type": 0,
"updated": "2013-10-11T03:29:42.317286",
"votes": false
}
< Example batch prediction JSON response
Updating a Batch Prediction
To update a batch prediction, you need to PUT an object containing the fields that you want to update to the batch prediction' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated batch prediction.
For example, to update a batch prediction with a new name you can use curl like this:
curl "https://au.bigml.io/batchprediction/525770a63c1920e3f3000000?$BIGML_AUTH" \
-X PUT \
-d '{"name": "A new name"}' \
-H 'content-type: application/json'
$ Updating a batch prediction's name
Deleting a Batch Prediction
To delete a batch prediction, you need to issue a HTTP DELETE request to the batchprediction/id to be deleted.
Using curl you can do something like this to delete a batch prediction:
curl -X DELETE "https://au.bigml.io/batchprediction/525770a63c1920e3f3000000?$BIGML_AUTH"
$ Deleting a batch prediction from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a batch prediction, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a batch prediction a second time, or a batch prediction that does not exist, you will receive a "404 not found" response.
However, if you try to delete a batch prediction that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Batch Predictions
To list all the batch predictions, you can use the batchprediction base URL. By default, only the 20 most recent batch predictions will be returned. You can see below how to change this number using the limit parameter.
You can get your list of batch predictions directly in your browser using your own username and API key with the following links.
https://au.bigml.io/batchprediction?$BIGML_AUTH
> Listing batch predictions from a browser
Forecasts
Last Updated: Tuesday, 2019-01-29 16:28
A forecast is created using an timeseries/id and the new instance (input_data) for which you wish to create a forecast.
A forecast for time series models consists of extrapolation of the objective field values for time instances beyond the end of the training data. Rather than taking row values as the input, it expects a map keyed by objective ids, and values being maps containing the forecast horizon (number of future steps to predict), and a selector for the ets models to use to compute the forecast.
BigML.io allows you to create, retrieve, update, delete your forecast. You can also list all of your forecasts.
Jump to:
- Forecast Base URL
- Creating a Forecast
- Forecast Arguments
- Retrieving a Forecast
- Forecast Properties
- Updating a Forecast
- Deleting a Forecast
- Listing Forecasts
Forecast Base URL
You can use the following base URL to create, retrieve, update, and delete forecasts. https://au.bigml.io/forecast
Forecast base URL
All requests to manage your forecasts must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Forecast
To create a new forecast, you need to POST to the forecast base URL an object containing at least the timeseries/id that you want to use to create the forecast and an instance. The content-type must always be "application/json".
For example, you can easily create a new forecast using curl as follows:
curl "https://au.bigml.io/forecast?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"timeseries": "timeseries/5621b70910cb86ae4c000000",
"input_data": {
"000000":{
"horizon":30,
"ets_models":{
"indices":[0,1,2],
"names": ["A,A,N"],
"criterion": "bic",
"limit":2
}
}
}}'
> Creating a forecast
BigML.io will return the newly created forecast if the request succeeded.
{
"category": 0,
"clones": 0,
"code": 201,
"configuration": null,
"configuration_status": false,
"created": "2017-06-28T03:20:10.153929",
"credits": 0.01,
"dataset": "dataset/5948cc214e172744d700000f",
"dataset_status": true,
"description": "",
"forecast": {
"max_periods": 5,
"result": {
"000005": [
{
"lower_bound": [
12.35218,
10.88435,
10.08346,
9.4603,
8.93195
],
"point_forecast": [
12.98044,
12.98044,
12.98044,
12.98044,
12.98044
],
"submodel": "A,N,N",
"time_range": {
"end": 2523,
"interval": 1,
"interval_unit": "milliseconds",
"start": 2519
},
"upper_bound": [
13.6087,
15.07653,
15.87742,
16.50058,
17.02893
]
}
]
}
},
"input_data": {
"000005": {
"horizon": 5,
"ets_models": {
"criterion": "bic",
"limit": 10,
"names": [
"A,N,N"
]
}
}
},
"intervals": true,
"locale": "en-us",
"name": "ebay 35",
"name_options": "",
"private": true,
"project": null,
"query_string": "",
"resource": "forecast/5667cbdd4e1724456000297",
"shared": false,
"short_url": "",
"source": "source/5948cc164e172744d700000c",
"source_status": true,
"status": {
"bad_fields": [],
"code": 5,
"elapsed": 0.21002,
"message": "The forecast has been created",
"progress": 1,
"unknown_fields": []
},
"subscription": true,
"tags": [],
"timeseries": "timeseries/5952cce44e172751b0000000",
"timeseries_status": true,
"timeseries_type": 0,
"type": 0,
"updated": "2017-06-28T03:20:10.154139"
}
< Example forecast JSON response
Forecast Arguments
In addition to the timeseries and the input_data, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is the category of the model |
The category that best describes the forecast. See the category codes for the complete list of categories.
Example: 1 |
|
description
optional |
String |
A description of the forecast up to 8192 characters long.
Example: "This is a description of my new forecast" |
| input_data | Object |
A map keyed by objective ids, and values being maps containing the forecast horizon (number of future steps to predict), and a selector for the ETS models to use to compute the forecast.If the selector field is entirely absent, it will default to {"criterion":"aic", "limit":1}.See the section below for more details.
Example:
|
|
intervals
optional |
Boolean, default is true |
Include the lower and upper confidence bounds for the forecast that are calculated.
Example: false |
|
name
optional |
String, default is Forecast for time series's name |
The name you want to give to the new forecast.
Example: "my new forecast" |
|
project
optional |
String |
The project/id you want the forecast to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your forecast.
Example: ["best customers", "2018"] |
| timeseries | String |
A valid timeseries/id.
Example: timeseries/55efc3564e17270d5b611004 |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new forecast. For example, to create a new forecast named "my forecast":
curl "https://au.bigml.io/forecast?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"timeseries": "timeseries/5621b70910cb86ae4c000000",
"input_data": {
"000000":{
"horizon":30,
"ets_models":{
"indices":[0,1,2],
"names": ["A,A,N"],
"criterion": "bic",
"limit":2
}
}
},
"name": "my forecast",
"tags":["my", "tags"]}'
> Creating a customized forecast
If you do not specify a name, BigML.io will assign to the new forecast a default name.
Retrieving a Forecast
Each forecast has a unique identifier in the form "forecast/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the forecast.
To retrieve a forecast with curl:
curl "https://au.bigml.io/forecast/5667cbdd4e1724456000297?$BIGML_AUTH"
$ Retrieving a forecast from the command line
Forecast Properties
Once a forecast has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the forecast and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the forecast creation has been completed without errors. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the forecast was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this forecast. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the forecast. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the forecast. It can contain restricted markdown to decorate the text. |
| forecast | Object | It contains max_periods, which is the maximum periods of all forecasts, and the forecast result which is a map keyed by objective id, where the entries lists of maps describing individual submodel forecasts. See the Forecast Result Object definition below. |
| input_data | Object | The dictionary of input fields' ids and values used as input for the forecast. |
| intervals | Boolean | Whether the lower and upper confidence bounds for the forecast are included in the calculation. |
| locale | String | The dataset's locale. |
|
name
filterable, sortable, updatable |
String | The name of the forecast as you provided or time series by default. |
|
private
filterable, sortable, updatable |
Boolean | Whether the forecast is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| query_string | String | The query string that was used to filter the timeseries. |
| resource | String | The forecast/id. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| status | Object | A description of the status of the forecast. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the forecast was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
timeseries
filterable, sortable |
String | The timeseries/id of the timeseries that was used to create the forecast. |
|
timeseries_status
filterable, sortable |
Boolean | Whether the timeseries is still available or has been deleted. |
| timeseries_type | Integer | Reserved for further use. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the forecast was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
A Forecast Result Object has the following properties:
Forecast Status
Creating a forecast is a near real-time process that take just a few seconds depending on whether the corresponding time series has been used recently and the workload of BigML's systems. The forecast goes through a number of states until its fully completed. Through the status field in the forecast you can determine when forecast has been fully processed and ready to be used. Most of the times forecasts are fully processed and the output returned in the first call. These are the properties that a forecast's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the forecast creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the forecast. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the forecast. |
| unknown_fields | Array | An array of field's ids that were submitted in the input_data and were not recognized. Unknown fields are ignored. That is, if you submit a field that is wrong, a prediction is created anyway ignoring the wrong input field. |
Updating a Forecast
To update a forecast, you need to PUT an object containing the fields that you want to update to the forecast' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated forecast.
For example, to update a forecast with a new name you can use curl like this:
curl "https://au.bigml.io/forecast/5667cbdd4e1724456000297?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating a forecast's name
Deleting a Forecast
To delete a forecast, you need to issue a HTTP DELETE request to the forecast/id to be deleted.
Using curl you can do something like this to delete a forecast:
curl -X DELETE "https://au.bigml.io/forecast/5667cbdd4e1724456000297?$BIGML_AUTH"
$ Deleting a forecast from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a forecast, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a forecast a second time, or a forecast that does not exist, you will receive a "404 not found" response.
However, if you try to delete a forecast that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Forecasts
To list all the forecasts, you can use the forecast base URL. By default, only the 20 most recent forecasts will be returned. You can see below how to change this number using the limit parameter.
You can get your list of forecasts directly in your browser using your own username and API key with the following links.
https://au.bigml.io/forecast?$BIGML_AUTH
> Listing forecasts from a browser
Centroids
Last Updated: Tuesday, 2019-01-29 16:28
A centroid is created using a cluster/id and the new instance (input_data) for which you wish to create a centroid.
When you create a new centroid, BigML.io will automatically compute the distance between the new instance and every cluster and will return the closest to the new instance.
BigML.io allows you to create, retrieve, update, delete your centroid. You can also list all of your centroids.
Jump to:
- Centroid Base URL
- Creating a Centroid
- Centroid Arguments
- Retrieving a Centroid
- Centroid Properties
- Updating a Centroid
- Deleting a Centroid
- Listing Centroids
Centroid Base URL
You can use the following base URL to create, retrieve, update, and delete centroids. https://au.bigml.io/centroid
Centroid base URL
All requests to manage your centroids must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Centroid
To create a new centroid, you need to POST to the centroid base URL an object containing at least the cluster/id that you want to use to find the centroid and an instance without any missing values. For example, you can easily create a new centroid using curl as follows:
curl "https://au.bigml.io/centroid?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"cluster": "cluster/53798de33c1920ee0800001c",
"input_data": {
"sepal length": 3,
"sepal width": 2.5,
"petal length": 4,
"petal width": 3.5}}'
> Creating a centroid
BigML.io will return the newly created centroid if the request succeeded.
{
"category": 0,
"centroid": {
"center": {
"000000": 5.8531,
"000001": 2.68387,
"000002": 4.43077,
"000003": 1.43772
},
"count": 56,
"distance": 0.8192530758972083,
"id": "000002",
"name": "Cluster 2"
},
"centroid_id": "000002",
"centroid_name": "Cluster 2",
"cluster": "cluster/53798de33c1920ee0800001c",
"cluster_status": true,
"cluster_type": 0,
"code": 201,
"created": "2014-05-19T06:13:18.553600",
"credits": 0.01,
"dataset": "dataset/537639383c19207026000004",
"dataset_status": true,
"description": "",
"distance": 0.8192530758972083,
"input_data": {
"petal length": 4,
"petal width": 3.5,
"sepal length": 3,
"sepal width": 2.5
},
"locale": "en-US",
"name": "Centroid",
"private": true,
"project": null,
"query_string": "",
"resource": "centroid/5379a0fe3c192082a1000000",
"shared": false,
"source": "source/5341a53c3c19206725000000",
"source_status": true,
"status": {
"bad_fields": [],
"code": 5,
"elapsed": 0.041,
"message": "The centroid has been created",
"progress": 1,
"unknown_fields": []
},
"subscription": false,
"tags": [],
"updated": "2014-05-19T06:13:18.553641"
}
< Example centroid JSON response
Centroid Arguments
In addition to the cluster and the input_data, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is the category of the model |
The category that best describes the centroid. See the category codes for the complete list of categories.
Example: 1 |
| cluster | String |
A valid cluster/id.
Example: cluster/4f67c0ee03ce89c74a000006 |
|
description
optional |
String |
A description of the centroid up to 8192 characters long.
Example: "This is a description of my new centroid" |
| input_data | Object |
An object with field's id/value or name/value pairs representing the instance you want to find the closest centroid for. You can use either field ids or field names as keys in your input_data.
Example: {"000000": 5, "000001": 3} |
|
name
optional |
String, default is Centroid for cluster's name |
The name you want to give to the new centroid.
Example: "my new centroid" |
|
private
optional |
Boolean, default is true |
Whether you want your centroid to be private or not.
Example: false |
|
project
optional |
String |
The project/id you want the centroid to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your centroid.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can use curl to customize new centroids. For example, to create a new centroid named "my centroid":
curl "https://au.bigml.io/centroid?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"cluster": "cluster/53798de33c1920ee0800001c",
"input_data": {
"sepal length": 3,
"sepal width": 2.5,
"petal length": 4,
"petal width": 3.5}}'
"name": "my centroid",
"tags":["my", "tags"]}'
> Creating a customized centroid
If you do not specify a name, BigML.io will assign to the new centroid a default name.
Retrieving a Centroid
Each centroid has a unique identifier in the form "centroid/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the centroid.
To retrieve a centroid with curl:
curl "https://au.bigml.io/centroid/5379a0fe3c192082a1000000?$BIGML_AUTH"
$ Retrieving a centroid from the command line
You can also use your browser to visualize the centroid using the full BigML.io URL or pasting the centroid/id into the BigML.com.au dashboard.
Centroid Properties
Once a centroid has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| centroid | Object | A dictionary describing the centroid. See the Centroid Object definition below. |
| centroid_id | String | Id assigned to identify the centroid in the cluster. |
| centroid_name | String | Name associated to the centroid in the cluster. |
|
cluster
filterable, sortable |
String | The cluster/id that was used to create the centroid. |
|
cluster_status
filterable, sortable |
Boolean | Whether the cluster is still available or has been deleted. |
| cluster_type | Integer | Reserved for further use. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the centroid and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the centroid creation has been completed without errors. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the centroid was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this centroid. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the centroid. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the centroid. It can contain restricted markdown to decorate the text. |
| distance | Float | The distance between the input_data and the centroid. Distance will be set to -1 if BigML can't computer a centroid for a point due to a missing numeric value. To avoid this, you can choose a default_numeric_value such as mean, median, minimum, maximum, or zero when you build a cluster. |
| input_data | Object | The dictionary of input fields' ids and values used as input for the centroid. |
| locale | String | The dataset's locale. |
|
name
filterable, sortable, updatable |
String | The name of the centroid as you provided or Centroid by default. |
|
private
filterable, sortable, updatable |
Boolean | Whether the centroid is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| query_string | String | The query string that was used to filter the cluster. |
| resource | String | The centroid/id. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| status | Object | A description of the status of the centroid. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the centroid was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the centroid was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
A Centroid Object has the following properties:
Centroid Status
Creating a centroid is a near real-time process that take just a few seconds depending on whether the corresponding cluster has been used recently and the workload of BigML's systems. The centroid goes through a number of states until its fully completed. Through the status field in the centroid you can determine when the centroid has been fully processed and ready to be used. Most of the times centroids are fully processed and the output returned in the first call. These are the properties that a centroid's status has:
| Property | Type | Description |
|---|---|---|
| bad_fields | Array | An array of field's ids with wrong values submitted. Bad fields are ignored. That is, if you submit a value that is wrong, a centroid is created anyway ignoring the input field with the wrong value. |
| code | Integer | A status code that reflects the status of the centroid creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the centroid. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the centroid. |
| unknown_fields | Array | An array of field's ids that were submitted in the input_data and were not recognized. Unknown fields are ignored. That is, if you submit a field that is wrong, a prediction is created anyway ignoring the wrong input field. |
Updating a Centroid
To update a centroid, you need to PUT an object containing the fields that you want to update to the centroid' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated centroid.
For example, to update a centroid with a new name you can use curl like this:
curl "https://au.bigml.io/centroid/5379a0fe3c192082a1000000?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating a centroid's name
Deleting a Centroid
To delete a centroid, you need to issue a HTTP DELETE request to the centroid/id to be deleted.
Using curl you can do something like this to delete a centroid:
curl -X DELETE "https://au.bigml.io/centroid/5379a0fe3c192082a1000000?$BIGML_AUTH"
$ Deleting a centroid from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a centroid, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a centroid a second time, or a centroid that does not exist, you will receive a "404 not found" response.
However, if you try to delete a centroid that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Centroids
To list all the centroids, you can use the centroid base URL. By default, only the 20 most recent centroids will be returned. You can see below how to change this number using the limit parameter.
You can get your list of centroids directly in your browser using your own username and API key with the following links.
https://au.bigml.io/centroid?$BIGML_AUTH
> Listing centroids from a browser
Batch Centroids
Last Updated: Tuesday, 2019-01-29 16:28
A batch centroid provides an easy way to compute a centroid for each instance in a dataset in only one request. To create a new batch centroid you need a cluster/id and a dataset/id.
Batch centroids are created asynchronously. You can retrieve the associated resource to check the progress and status in a similar fashion to the rest of BigML.io resources. Additionally, once a batch centroid is finished you can also download a csv file that contains all the centroids just appending "/download" to the batch centroid URL. You can also set output_dataset to true to automatically generate a new dataset with the results.
BigML.io gives you a number of options to tailor
the format of the csv file containing the centroids. For example, you
can set up the "separator" (e.g., ";"),
whether the file should have a "header" or not,
or whether the "distance" for each centroid should
also appear together with each centroid. You can read about all the available options below.
BigML.io allows you to create, retrieve, update, delete your batch centroid. You can also list all of your batch centroids.
Jump to:
- Batch Centroid Base URL
- Creating a Batch Centroid
- Batch Centroid Arguments
- Retrieving a Batch Centroid
- Batch Centroid Properties
- Updating a Batch Centroid
- Deleting a Batch Centroid
- Listing Batch Centroids
Batch Centroid Base URL
You can use the following base URL to create, retrieve, update, and delete batch centroids. https://au.bigml.io/batchcentroid
Batch Centroid base URL
All requests to manage your batch centroids must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Batch Centroid
To create a new batch centroid, you need to POST to the batch centroid base URL an object containing at least the cluster/id that you want to use to compute centroids and the dataset/id of the dataset that contains the input data that will be used to compute centroids. BigML.io will compute a centroid for each instance in that dataset.
You can easily create a new batch centroid using curl as follows. Your authentication variable should be set up first as shown above.
curl "https://au.bigml.io/batchcentroid?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"cluster": "cluster/5378e08f3c1920e7d8000004",
"dataset": "dataset/5378e0773c1920e7d8000000"}'
> Creating a batch centroid
BigML.io will return the newly created batch centroid if the request succeeded.
{
"all_fields": false,
"category": 0,
"cluster": "cluster/5378e08f3c1920e7d8000004",
"cluster_status": true,
"cluster_type": 0,
"code": 201,
"created": "2014-05-19T01:54:25.624547",
"credits": 1.5,
"dataset": "dataset/5378e0773c1920e7d8000000",
"dataset_status": true,
"description": "",
"distance": false,
"fields_map": {},
"header": true,
"locale": "en-US",
"name": "Batch Centroid of Iris' dataset cluster with Iris' dataset",
"newline": "LF",
"output_fields": [],
"private": true,
"project": null,
"resource": "batchcentroid/537964513c1920ee08000011",
"rows": 150,
"separator": ",",
"shared": false,
"size": 4608,
"status": {
"code": 1,
"message": "The batch centroid is being processed and will be performed soon"
},
"subscription": false,
"tags": [],
"type": 0,
"updated": "2014-05-19T01:54:25.624613"
}
< Example batch centroid JSON response
Batch Centroid Arguments
In addition to the model and the dataset, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
all_fields
optional |
Boolean, default is false |
Whether all the fields from the dataset should be part of the generated csv file together with the centroids.
Example: true |
|
category
optional |
Integer, default is the category of the model |
The category that best describes the batch centroid. See the category codes for the complete list of categories.
Example: 1 |
|
centroid_name
optional |
String |
The name of the column in the header of the generated file for the centroid. It will only have effect if header is true.
Example: "Centroid" |
| cluster | String |
A valid cluster/id.
Example: cluster/4f67c0ee03ce89c74a000006 |
| dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
|
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
|
description
optional |
String |
A description of the batch centroid up to 8192 characters long.
Example: "This is a description of my new batch centroid" |
|
distance
optional |
Boolean, default is false |
Whether the distance for each centroid should be added to the csv file.
Example: true |
|
distance_name
optional |
String |
The name of the column in the header of the generated file containing the distance. It will only have effect if header is true.
Example: "Distance" |
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset |
Specifies the fields in the dataset to be excluded to create the batch centroid.
Example:
|
|
fields_map
optional |
Object |
A dictionary of identifiers of the fields to use from the centroids under test mapped to their corresponding identifiers in the input dataset.
Example: {"000000":"00000a", "000001":"000002", "000002":"000001", "000003":"000020", "000004":"000004"} |
|
header
optional |
Boolean, default is true |
Whether the csv file should have a header with the name of each field.
Example: true |
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields in the dataset to be considered to create the batch centroid.
Example:
|
|
name
optional |
String, default is dataset's name |
The name you want to give to the new batch centroid.
Example: "my new batch centroid" |
|
newline
optional |
String, default is "LF" |
The new line character that you want to get as line break in the generated csv file: "LF", "CRLF".
Example: "CRLF" |
|
output_dataset
optional |
Boolean, default is false |
Whether a dataset with the results should be automatically created or not.
Example: true |
|
output_fields
optional |
Array, default is []. None of the fields in the dataset |
Specifies the fields to be included in the csv file. It can be a list of field ids or names.
Example:
|
|
project
optional |
String |
The project/id you want the batch centroid to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
separator
optional |
Char, default is "," |
The separator that you want to get between fields in the generated csv file.
Example: ";" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your batch centroid.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new batch centroid. For example, to create a new batch centroid named "my batch centroid", that will not include a header, and will only ouput the field "000001" together with the distance for each centroid.
curl "https://au.bigml.io/batchcentroid?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"cluster": "cluster/5378e08f3c1920e7d8000004",
"dataset": "dataset/5378e0773c1920e7d8000000",
"name": "my batch centroid",
"header": false,
"output_fields": ["000001"],
"distance": true}'
> Creating a customized batch centroid
If you do not specify a name, BigML.io will assign to the new batch centroid a combination of the dataset's name and the cluster's name. If you do not specify any fields_map, BigML.io will use a direct map of all the fields in the dataset.
Retrieving a Batch Centroid
Each batch centroid has a unique identifier in the form "batchcentroid/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the batch centroid.
To retrieve a batch centroid with curl:
curl "https://au.bigml.io/batchcentroid/537964513c1920ee08000011?$BIGML_AUTH"
$ Retrieving a batch centroid from the command line
You can also use your browser to visualize the batch centroid using the full BigML.io URL or pasting the batchcentroid/id into the BigML.com.au dashboard.
Batch Centroid Properties
Once a batch centroid has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
all_fields
filterable, sortable |
Boolean | Whether the batch centroid contains all the fields in the dataset used as an input. |
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| centroid_name | String | The name of the column containing the centroids when it has been passed as an argument. |
|
cluster
filterable, sortable |
String | The cluster/id of the cluster used to create the batch centroid. |
|
cluster_status
filterable, sortable |
Boolean | Whether the cluster is still available or has been deleted. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the batch centroid and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the batch centroid creation has been completed without errors. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the batch centroid was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this batch centroid. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the batch centroid. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the batch centroid. It can contain restricted markdown to decorate the text. |
|
distance
filterable, sortable |
Boolean | Whether the distance for each centroid was added to the output file. |
| distance_name | String | The name of the column containing the distance for each centroid when it has been passed as an argument. |
| excluded_fields | Array | The list of fields's ids that were excluded to build the batch centroid. |
| fields_map | Array | The map of dataset fields to cluster fields used. |
|
header
filterable, sortable |
Boolean | Whether the batch centroid file contains a header with the name of each field or not. |
| input_fields | Array | The list of input fields' ids used to create the batch centroid. |
| locale | String | The dataset's locale. |
|
name
filterable, sortable, updatable |
String | The name of the batch centroid. By default, it's based on the name of model and the dataset used. |
| newline | String | The new line character used as line break in the file that contains the centroids. |
| output_dataset |
Boolean, default is false |
Whether a dataset with the results should be automatically created or not.
Example: true |
|
output_dataset_resource
filterable, sortable |
String | The dataset/id of the newly created dataset when output_dataset has been set to true. |
|
output_dataset_status
filterable, sortable |
Boolean | Whether the dataset generated as an output is still available or has been deleted. |
| output_fields | Array | The list of output fields's ids used to format the output csv file. |
|
private
filterable, sortable |
Boolean | Whether the batch centroid is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| resource | String | The batchcentroid/id. |
|
rows
filterable, sortable |
Integer | The total number of instances in the dataset used as an input. |
| separator | Char | The separator used in the csv file that contains the centroids. |
|
shared
filterable, sortable |
Boolean | Whether the batch centroid has been shared via a private link. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that was used create the batch centroid. |
| status | Object | A description of the status of the batch centroid. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the batch centroid was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
type
filterable, sortable |
Integer | Reserved for future use. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the batch centroid was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
Batch Centroid Status
Creating a batch centroid is a process that can take just a few seconds or a few hours depending on the size of the dataset used as input and on the workload of BigML's systems. The batch centroid goes through a number of states until its finished. Through the status field in the batch centroid you can determine when it has been fully processed. These are the properties that a batch centroid's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the batch centroid creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the batch centroid. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the batch centroid. |
Once batch centroid has been successfully finished, it will look like:
{
"all_fields": false,
"category": 0,
"cluster": "cluster/5378e08f3c1920e7d8000004",
"cluster_status": true,
"cluster_type": 0,
"code": 200,
"created": "2014-05-19T02:26:48.386000",
"credits": 1.5,
"dataset": "dataset/5378e0773c1920e7d8000000",
"dataset_status": true,
"description": "",
"distance": false,
"fields_map": {
"000000": "000000",
"000001": "000001",
"000002": "000002",
"000003": "000003",
"000004": "000004"
},
"header": true,
"locale": "en-US",
"name": "Batch Centroid of Iris' dataset cluster with Iris' dataset",
"newline": "LF",
"output_fields": [],
"private": true,
"project": null,
"resource": "batchcentroid/537964513c1920ee08000011",
"rows": 150,
"separator": ",",
"shared": false,
"size": 4608,
"status": {
"code": 5,
"elapsed": 1059,
"message": "The batch centroid has been performed",
"progress": 1.0
},
"subscription": false,
"tags": [],
"type": 0,
"updated": "2014-05-19T02:26:52.332000"
}
< Example batch centroid JSON response
Updating a Batch Centroid
To update a batch centroid, you need to PUT an object containing the fields that you want to update to the batch centroid' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated batch centroid.
For example, to update a batch centroid with a new name you can use curl like this:
curl "https://au.bigml.io/batchcentroid/537964513c1920ee08000011?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "A new name"}'
$ Updating a batch centroid's name
Deleting a Batch Centroid
To delete a batch centroid, you need to issue a HTTP DELETE request to the batchcentroid/id to be deleted.
Using curl you can do something like this to delete a batch centroid:
curl -X DELETE "https://au.bigml.io/batchcentroid/537964513c1920ee08000011?$BIGML_AUTH"
$ Deleting a batch centroid from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a batch centroid, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a batch centroid a second time, or a batch centroid that does not exist, you will receive a "404 not found" response.
However, if you try to delete a batch centroid that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Batch Centroids
To list all the batch centroids, you can use the batchcentroid base URL. By default, only the 20 most recent batch centroids will be returned. You can see below how to change this number using the limit parameter.
You can get your list of batch centroids directly in your browser using your own username and API key with the following links.
https://au.bigml.io/batchcentroid?$BIGML_AUTH
> Listing batch centroids from a browser
Anomaly Scores
Last Updated: Tuesday, 2019-01-29 16:28
An anomaly score is created using an anomaly/id and the new instance (input_data) for which you wish to create an anomaly score.
When you create a new anomaly score, BigML.io will automatically compute a score between 0 and 1. The closer the score is to 1, the more anomalous the instance being scored is. BigML.io will also compute the relative importance for each field. That is, how much each value in the input data contributed to the score.
BigML.io allows you to create, retrieve, update, delete your anomaly score. You can also list all of your anomaly scores.
Jump to:
- Anomaly Score Base URL
- Creating an Anomaly Score
- Anomaly Score Arguments
- Retrieving an Anomaly Score
- Anomaly Score Properties
- Updating an Anomaly Score
- Deleting an Anomaly Score
- Listing Anomaly Scores
Anomaly Score Base URL
You can use the following base URL to create, retrieve, update, and delete anomaly scores. https://au.bigml.io/anomalyscore
Anomaly Score base URL
All requests to manage your anomaly scores must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating an Anomaly Score
To create a new anomaly score, you need to POST to the anomaly score base URL an object containing at least the anomaly/id that you want to use to find the anomaly score and an instance. For example, you can easily create a new anomaly score using curl as follows:
curl "https://au.bigml.io/anomalyscore?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"anomaly": "anomaly/5423625af0a5ea3eea000028",
"input_data": {
"petal length": 4,
"petal width": 1,
"sepal length": 7,
"sepal width": 3}}'
> Creating an anomaly score/span>
BigML.io will return the newly created anomaly score if the request succeeded.
{
"anomaly": "anomaly/5423625af0a5ea3eea000028",
"anomaly_status": true,
"anomaly_type": 0,
"category": 0,
"code": 201,
"created": "2014-09-25T07:11:18.047689",
"credits": 0.01,
"dataset": "dataset/54222a14f0a5eaaab000000c",
"dataset_status": true,
"description": "",
"importance": {
"000000": 0.17899,
"000001": 0.07067,
"000002": 0.12573,
"000003": 0.16941,
"000004": 0.4552
},
"input_data": {
"petal length": 4,
"petal width": 1,
"sepal length": 7,
"sepal width": 3
},
"locale": "en-US",
"name": "Score",
"private": true,
"project": null,
"query_string": "",
"resource": "anomalyscore/5423c016f0a5ea2aeb000006",
"score": 0.72566,
"shared": false,
"source": "source/54222a08f0a5eaaab0000008",
"source_status": true,
"status": {
"bad_fields": [],
"code": 5,
"elapsed": 0.014,
"message": "The anomaly score has been created",
"progress": 1,
"unknown_fields": []
},
"subscription": false,
"tags": [],
"updated": "2014-09-25T07:11:18.047703"
}
< Example anomaly score JSON response
Anomaly Score Arguments
In addition to the anomaly and the input_data, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
| anomaly | String |
A valid anomaly/id.
Example: anomaly/5423625af0a5ea3eea000028 |
|
category
optional |
Integer, default is the category of the model |
The category that best describes the anomaly score. See the category codes for the complete list of categories.
Example: 1 |
|
description
optional |
String |
A description of the anomaly score up to 8192 characters long.
Example: "This is a description of my new anomaly score" |
| input_data | Object |
An object with field's id/value pairs or name/value representing the instance you want to find the closest anomaly score for. You can use either field ids or field names as keys in your input_data.
Example: {"000000": 5, "000001": 3} |
|
name
optional |
String, default is Score for anomaly detectors's name |
The name you want to give to the new anomaly score.
Example: "my new anomaly score" |
|
project
optional |
String |
The project/id you want the anomaly score to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your anomaly score.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can use curl to customize new anomaly scores. For example, to create a new anomaly score named "my anomaly score":
curl "https://au.bigml.io/anomalyscore?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"anomaly": "anomaly/5423625af0a5ea3eea000028",
"input_data": {
"petal length": 4,
"petal width": 1,
"sepal length": 7,
"sepal width": 3},
"name": "my anomaly score",
"tags":["my", "tags"]}'
> Creating a customized anomaly score
If you do not specify a name, BigML.io will assign to the new anomaly score a default name.
Retrieving an Anomaly Score
Each anomaly score has a unique identifier in the form "anomalyscore/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the anomaly score.
To retrieve an anomaly score with curl:
curl "https://au.bigml.io/anomalyscore/5423c016f0a5ea2aeb000006?$BIGML_AUTH"
$ Retrieving a anomaly score from the command line
You can also use your browser to visualize the anomaly score using the full BigML.io URL or pasting the anomalyscore/id into the BigML.com.au dashboard.
Anomaly Score Properties
Once an anomaly score has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
anomaly
filterable, sortable |
String | The anomaly/id of the anomaly detector that was used to create the anomaly score. |
|
anomaly_status
filterable, sortable |
Boolean | Whether the anomaly detector is still available or has been deleted. |
| anomaly_type | Integer | Reserved for further use. |
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the anomaly score and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the anomaly score creation has been completed without errors. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the anomaly score was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this anomaly score. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the anomaly score. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the anomaly score. It can contain restricted markdown to decorate the text. |
| importance | Object | A dictionary keyed by field id that reports the relative contribution of each field to the anomaly score. |
| input_data | Object | The dictionary of input fields' ids and values used as input for the anomaly score. |
| locale | String | The dataset's locale. |
|
name
filterable, sortable, updatable |
String | The name of the anomaly score as you provided or Anomaly Score by default. |
|
private
filterable, sortable, updatable |
Boolean | Whether the anomaly score is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| query_string | String | The query string that was used to filter the anomaly. |
| resource | String | The anomalyscore/id. |
| score | Float | The anomaly score. The closer to 1, the more anomalous the input data is. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| status | Object | A description of the status of the anomaly score. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the anomaly score was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the anomaly score was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
Anomaly Score Status
Creating an anomaly score is a near real-time process that take just a few seconds depending on whether the corresponding anomaly has been used recently and the workload of BigML's systems. The anomaly score goes through a number of states until its fully completed. Through the status field in the anomaly score you can determine when the anomaly score has been fully processed and ready to be used. Most of the times anomaly scores are fully processed and the output returned in the first call. These are the properties that an anomaly score's status has:
| Property | Type | Description |
|---|---|---|
| bad_fields | Array | An array of field's ids with wrong values submitted. Bad fields are ignored. That is, if you submit a value that is wrong, an anomaly score is created anyway ignoring the input field with the wrong value. |
| code | Integer | A status code that reflects the status of the anomaly score creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the anomaly score. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the anomaly score. |
| unknown_fields | Array | An array of field's ids that were submitted in the input_data and were not recognized. Unknown fields are ignored. That is, if you submit a field that is wrong, a prediction is created anyway ignoring the wrong input field. |
Updating an Anomaly Score
To update an anomaly score, you need to PUT an object containing the fields that you want to update to the anomaly score' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated anomaly score.
For example, to update an anomaly score with a new name you can use curl like this:
curl "https://au.bigml.io/anomalyscore/5423c016f0a5ea2aeb000006?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating an anomaly score's name
Deleting an Anomaly Score
To delete an anomaly score, you need to issue a HTTP DELETE request to the anomalyscore/id to be deleted.
Using curl you can do something like this to delete an anomaly score:
curl -X DELETE "https://au.bigml.io/anomalyscore/5423c016f0a5ea2aeb000006?$BIGML_AUTH"
$ Deleting an anomaly score from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete an anomaly score, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete an anomaly score a second time, or an anomaly score that does not exist, you will receive a "404 not found" response.
However, if you try to delete an anomaly score that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Anomaly Scores
To list all the anomaly scores, you can use the anomalyscore base URL. By default, only the 20 most recent anomaly scores will be returned. You can see below how to change this number using the limit parameter.
You can get your list of anomaly scores directly in your browser using your own username and API key with the following links.
https://au.bigml.io/anomalyscore?$BIGML_AUTH
> Listing anomaly scores from a browser
Batch Anomaly Scores
Last Updated: Tuesday, 2019-01-29 16:28
A batch anomaly score provides an easy way to compute an anomaly score for each instance in a dataset in only one request. To create a new batch anomaly score you need an anomaly/id and a dataset/id.
Batch anomaly scores are created asynchronously. You can retrieve the associated resource to check the progress and status in a similar fashion to the rest of BigML.io resources. Additionally, once a batch anomaly score is finished you can also download a csv file that contains all the anomaly scores just appending "/download" to the batch anomaly score URL. You can also set output_dataset to true to automatically generate a new dataset with the results.
BigML.io gives you a number of options to tailor
the format of the csv file containing the anomaly scores. For example, you
can set up the "separator" (e.g., ";"),
whether the file should have a "header" or not,
or whether the "distance" for each anomaly score should
also appear together with each anomaly score. You can read about all the available options below.
BigML.io allows you to create, retrieve, update, delete your batch anomaly score. You can also list all of your batch anomaly scores.
Jump to:
- Batch Anomaly Score Base URL
- Creating a Batch Anomaly Score
- Batch Anomaly Score Arguments
- Retrieving a Batch Anomaly Score
- Batch Anomaly Score Properties
- Updating a Batch Anomaly Score
- Deleting a Batch Anomaly Score
- Listing Batch Anomaly Scores
Batch Anomaly Score Base URL
You can use the following base URL to create, retrieve, update, and delete batch anomaly scores. https://au.bigml.io/batchanomalyscore
Batch Anomaly Score base URL
All requests to manage your batch anomaly scores must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Batch Anomaly Score
To create a new batch anomaly score, you need to POST to the batch anomaly score base URL an object containing at least the anomaly/id that you want to use to compute anomaly scores and the dataset/id of the dataset that contains the input data that will be used to compute anomaly scores. BigML.io will compute an anomaly score for each instance in that dataset.
You can easily create a new batch anomaly score using curl as follows. Your authentication variable should be set up first as shown above.
curl "https://au.bigml.io/batchanomalyscore?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"anomaly": "anomaly/5423625af0a5ea3eea000028",
"dataset": "dataset/54222a14f0a5eaaab000000c"}'
> Creating a batch anomaly score
BigML.io will return the newly created batch anomaly score if the request succeeded.
{
"all_fields":false,
"anomaly":"anomaly/5423625af0a5ea3eea000028",
"anomaly_status":true,
"anomaly_type":0,
"category":0,
"code":201,
"created":"2014-09-25T08:44:05.563912",
"credits":1.5,
"dataset":"dataset/54222a14f0a5eaaab000000c",
"dataset_status":true,
"description":"",
"fields_map":{},
"header":true,
"locale":"en-US",
"importance":false,
"name":"Batch Anomaly Score of Anomaly Detector with Iris",
"newline": "LF",
"output_dataset":false,
"output_dataset_resource":null,
"output_dataset_status":false,
"output_fields":[],
"private":true,
"project":null,
"resource":"batchanomalyscore/5423d5d5f0a5ea4359000007",
"rows":150,
"score_name":"score",
"separator":",",
"shared":false,
"size":4758,
"status":{
"code":1,
"message":"The batch anomaly score is being processed and will be performed soon"
},
"subscription":false,
"tags":[],
"type":0,
"updated":"2014-09-25T08:44:05.563940"
}
< Example batch anomaly score JSON response
Batch Anomaly Score Arguments
In addition to the model and the dataset, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
all_fields
optional |
Boolean, default is false |
Whether all the fields from the dataset should be part of the generated csv file together with the anomaly score.
Example: true |
| anomaly | String |
A valid anomaly/id.
Example: anomaly/4f67c0ee03ce89c74a000006 |
|
category
optional |
Integer, default is the category of the model |
The category that best describes the batch anomaly score. See the category codes for the complete list of categories.
Example: 1 |
| dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
|
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
|
description
optional |
String |
A description of the batch anomaly score up to 8192 characters long.
Example: "This is a description of my new batch anomaly score" |
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset |
Specifies the fields in the dataset to be excluded to create the batch anomaly score.
Example:
|
|
fields_map
optional |
Object |
A dictionary of identifiers of the fields to use from the anomaly score under test mapped to their corresponding identifiers in the input dataset.
Example: {"000000":"00000a", "000001":"000002", "000002":"000001", "000003":"000020", "000004":"000004"} |
|
header
optional |
Boolean, default is true |
Whether the csv file should have a header with the name of each field.
Example: true |
|
importance
optional |
Boolean, default is false |
Whether field importance scores are added as additional columns for each input field.
Example: true |
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields in the dataset to be considered to create the batch anomaly score.
Example:
|
|
name
optional |
String, default is dataset's name |
The name you want to give to the new batch anomaly score.
Example: "my new batch anomaly score" |
|
newline
optional |
String, default is "LF" |
The new line character that you want to get as line break in the generated csv file: "LF", "CRLF".
Example: "CRLF" |
|
output_dataset
optional |
Boolean, default is false |
Whether a dataset with the results should be automatically created or not.
Example: true |
|
output_fields
optional |
Array, default is []. None of the fields in the dataset |
Specifies the fields to be included in the csv file. It can be a list of field ids or names.
Example:
|
|
project
optional |
String |
The project/id you want the batch anomaly score to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
score_name
optional |
String |
The name of the column in the header of the generated file for the anomaly score. It will only have effect if header is true.
Example: "Anomaly Score" |
|
separator
optional |
Char, default is "," |
The separator that you want to get between fields in the generated csv file.
Example: ";" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your batch anomaly score.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new batch anomaly score. For example, to create a new batch anomaly score named "my batch anomaly score", that will not include a header, and will only output the field "000001" together with the score for each anomaly score.
curl "https://au.bigml.io/batchanomalyscore?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"anomaly": "anomaly/5423625af0a5ea3eea000028",
"dataset": "dataset/54222a14f0a5eaaab000000c",
"name": "my batch anomaly score",
"header": true,
"all_fields": true,
"score_name": "Anomaly Score"
}'
> Creating a customized batch anomaly score
If you do not specify a name, BigML.io will assign to the new batch anomaly score a combination of the dataset's name and the anomaly's name. If you do not specify any fields_map, BigML.io will use a direct map of all the fields in the dataset.
Retrieving a Batch Anomaly Score
Each batch anomaly score has a unique identifier in the form "batchanomalyscore/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the batch anomaly score.
To retrieve a batch anomaly score with curl:
curl "https://au.bigml.io/batchanomalyscore/5423d5d5f0a5ea4359000007?$BIGML_AUTH"
$ Retrieving a batch anomaly score from the command line
You can also use your browser to visualize the batch anomaly score using the full BigML.io URL or pasting the batchanomalyscore/id into the BigML.com.au dashboard.
Batch Anomaly Score Properties
Once a batch anomaly score has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
all_fields
filterable, sortable |
Boolean | Whether the batch anomaly score contains all the fields in the dataset used as an input. |
|
anomaly
filterable, sortable |
String | The anomaly/id of the anomaly used to create the batch anomaly score. |
|
anomaly_status
filterable, sortable |
Boolean | Whether the anomaly is still available or has been deleted. |
| anomaly_type | Integer | Reserved for further use. |
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the batch anomaly score and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the batch anomaly score creation has been completed without errors. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the batch anomaly score was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this batch anomaly score. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the batch anomaly score. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the batch anomaly score. It can contain restricted markdown to decorate the text. |
| excluded_fields | Array | The list of fields's ids that were excluded to build the batch anomaly score. |
| fields_map | Array | The map of dataset fields to anomaly fields used. |
|
header
filterable, sortable |
Boolean | Whether the batch anomaly score file contains a header with the name of each field or not. |
| importance | Boolean | Whether field importance scores are added as additional columns for each input field or not. |
| input_fields | Array | The list of input fields' ids used to create the batch anomaly score. |
| locale | String | The dataset's locale. |
|
name
filterable, sortable, updatable |
String | The name of the batch anomaly score. By default, it's based on the name of model and the dataset used. |
| newline | String | The new line character used as line break in the file that contains the anomaly scores. |
| output_dataset |
Boolean, default is false |
Whether a dataset with the results should be automatically created or not.
Example: true |
|
output_dataset_resource
filterable, sortable |
String | The dataset/id of the newly created dataset when output_dataset has been set to true. |
|
output_dataset_status
filterable, sortable |
Boolean | Whether the dataset generated as an output is still available or has been deleted. |
| output_fields | Array | The list of output fields's ids used to format the output csv file. |
|
private
filterable, sortable |
Boolean | Whether the batch anomaly score is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| resource | String | The batchanomalyscore/id. |
|
rows
filterable, sortable |
Integer | The total number of instances in the dataset used as an input. |
| score_name | String | The name of the column containing the anomaly scores when it has been passed as an argument. |
| separator | Char | The separator used in the csv file that contains the anomaly scores. |
|
shared
filterable, sortable |
Boolean | Whether the batch anomaly score has been shared via a private link. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that was used create the batch anomaly score. |
| status | Object | A description of the status of the batch anomaly score. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the batch anomaly score was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
type
filterable, sortable |
Integer | Reserved for future use. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the batch anomaly score was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
Batch Anomaly Score Status
Creating a batch anomaly score is a process that can take just a few seconds or a few hours depending on the size of the dataset used as input and on the workload of BigML's systems. The batch anomaly score goes through a number of states until its finished. Through the status field in the batch anomaly score you can determine when it has been fully processed. These are the properties that a batch anomaly score's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the batch anomaly score creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the batch anomaly score. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the batch anomaly score. |
Once batch anomaly score has been successfully finished, it will look like:
{
"all_fields":false,
"anomaly":"anomaly/5423625af0a5ea3eea000028",
"anomaly_status":true,
"anomaly_type":0,
"category":0,
"code":200,
"created":"2014-09-25T08:44:05.563000",
"credits":1.5,
"dataset":"dataset/54222a14f0a5eaaab000000c",
"dataset_status":true,
"description":"",
"fields_map":{
"000000":"000000",
"000001":"000001",
"000002":"000002",
"000003":"000003",
"000004":"000004"
},
"header":true,
"locale":"en-US",
"importance":false,
"name":"Batch Anomaly Score of Anomaly Detector with Iris",
"newline": "LF",
"output_dataset":false,
"output_dataset_resource":null,
"output_dataset_status":false,
"output_fields":[],
"private":true,
"project":null,
"resource":"batchanomalyscore/5423d5d5f0a5ea4359000007",
"rows":150,
"score_name":"score",
"separator":",",
"shared":false,
"size":4758,
"status":{
"code":5,
"elapsed":2471,
"message":"The batch anomaly score has been performed",
"progress":1
},
"subscription":false,
"tags":[],
"type":0,
"updated":"2014-09-25T08:44:08.310000"
}
< Example batch anomaly score JSON response
Updating a Batch Anomaly Score
To update a batch anomaly score, you need to PUT an object containing the fields that you want to update to the batch anomaly score' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated batch anomaly score.
For example, to update a batch anomaly score with a new name you can use curl like this:
curl "https://au.bigml.io/batchanomalyscore/5423d5d5f0a5ea4359000007?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "A new name"}'
$ Updating a batch anomaly score's name
Deleting a Batch Anomaly Score
To delete a batch anomaly score, you need to issue a HTTP DELETE request to the batchanomalyscore/id to be deleted.
Using curl you can do something like this to delete a batch anomaly score:
curl -X DELETE "https://au.bigml.io/batchanomalyscore/5423d5d5f0a5ea4359000007?$BIGML_AUTH"
$ Deleting a batch anomaly score from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a batch anomaly score, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a batch anomaly score a second time, or a batch anomaly score that does not exist, you will receive a "404 not found" response.
However, if you try to delete a batch anomaly score that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Batch Anomaly Scores
To list all the batch anomaly scores, you can use the batchanomalyscore base URL. By default, only the 20 most recent batch anomaly scores will be returned. You can see below how to change this number using the limit parameter.
You can get your list of batch anomaly scores directly in your browser using your own username and API key with the following links.
https://au.bigml.io/batchanomalyscore?$BIGML_AUTH
> Listing batch anomaly scores from a browser
Association Sets
Last Updated: Tuesday, 2019-01-29 16:28
An association set is created using an association/id and the new instance (input_data) for which you wish to create an association set.
Association Sets are useful to know which items have stronger associations with a given set of values for your fields. For these values (input_data) BigML computes a similarity score with the Antecedent itemset of the associations created and it returns a ranking of the items found in the Consequent. The similarity score then is multiplied by the selected association measure (confidence, leverage, support, lift, or coverage) to create a similarity-weighted score and finally return a ranking of the predicted items.
BigML.io allows you to create, retrieve, update, delete your association set. You can also list all of your association sets.
Jump to:
- Association Set Base URL
- Creating an Association Set
- Association Set Arguments
- Retrieving an Association Set
- Association Set Properties
- Updating an Association Set
- Deleting an Association Set
- Listing Association Sets
Association Set Base URL
You can use the following base URL to create, retrieve, update, and delete association sets. https://au.bigml.io/associationset
Association Set base URL
All requests to manage your association sets must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating an Association Set
To create a new association set, you need to POST to the association set base URL an object containing at least the association/id that you want to use to find the association set and an instance. For example, you can easily create a new association set using curl as follows:
curl "https://au.bigml.io/associationset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"association": "association/5621b70910cb86ae4c000000",
"input_data": {
"000000":["MINERAL","S/GAS"]
}}'
> Creating an association set
BigML.io will return the newly created association set if the request succeeded.
{
"association":"association/5667df634e17273e71000000",
"association_status":true,
"association_type":0,
"association_set":{
"fields":{
"000000":{
"column_number":0,
"datatype":"string",
"item_analysis":{
"separator":" "
},
"name":"field1",
"optype":"items",
"order":0,
"preferred":true
}
},
"k":3,
"max_k":100,
"result":[
{
"item":{
"complement":false,
"count":4715,
"field_id":"000000",
"name":"1104010101-AGUA"
},
"score":0.19931
},
{
"item":{
"complement":false,
"count":3849,
"field_id":"000000",
"name":"NATURAL,"
},
"score":0.07776
},
{
"item":{
"complement":false,
"count":3575,
"field_id":"000000",
"name":"NATURAL"
},
"score":0.07366
}
],
"score_by":"support"
},
"category":0,
"code":201,
"created":"2015-12-09T08:15:55.038970",
"credits":0.01,
"dataset":"dataset/565fbf254e17274741000008",
"dataset_status":true,
"description":"",
"input_data":{
"000000":[
"MINERAL",
"S/GAS"
]
},
"locale":"en-us",
"name":"Association Set",
"private":true,
"project":null,
"query_string":"",
"resource":"associationset/5667cbdd4e17276305000007",
"shared":false,
"source":"source/565f6f494e17274741000003",
"source_status":true,
"status":{
"bad_fields":[],
"code":5,
"elapsed":0.047,
"message":"The association set has been created",
"progress":1,
"unknown_fields":[]
},
"subscription":false,
"tags":[],
"updated":"2015-12-09T08:15:55.039017"
}
< Example association set JSON response
Association Set Arguments
In addition to the association and the input_data, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
| association | String |
A valid association/id.
Example: association/5423625af0a5ea3eea000028 |
|
category
optional |
Integer, default is the category of the model |
The category that best describes the association set. See the category codes for the complete list of categories.
Example: 1 |
|
description
optional |
String |
A description of the association set up to 8192 characters long.
Example: "This is a description of my new association set" |
| input_data | Object |
An object with field's id/value pairs representing the instance you want to find the closest association set for. You can use either field ids or field names as keys in your input_data.
Example: {"000000": ["MINERAL", "S/GAS"]} |
|
max_k
optional |
Integer, default is 100 |
The maximum number of predicted items to return. Each Consequent with a similarity-weighted score greater than 0 may be included in the prediction as long as it is not already contained within the input data.
Example: 50 |
|
name
optional |
String, default is Association Set for association's name |
The name you want to give to the new association set.
Example: "my new association set" |
|
project
optional |
String |
The project/id you want the association set to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
score_by
optional |
String, default is the search_strategy used to create the association |
The associations measure to rank the predicted items returned. Possible values: "lhs_cover", "confidence", "leverage", "lift", and "support".
Example: "coverage" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your association set.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can use curl to customize new association sets. For example, to create a new association set named "my association set":
curl "https://au.bigml.io/associationset?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"association": "association/5621b70910cb86ae4c000000",
"input_data": {
"000000":["MINERAL","S/GAS"]
},
"name": "my association set",
"tags":["my", "tags"]}'
> Creating a customized association set
If you do not specify a name, BigML.io will assign to the new association set a default name.
Retrieving an Association Set
Each association set has a unique identifier in the form "associationset/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the association set.
To retrieve an association set with curl:
curl "https://au.bigml.io/associationset/5667cbdd4e17276305000007?$BIGML_AUTH"
$ Retrieving a association set from the command line
You can also use your browser to visualize the association set using the full BigML.io URL or pasting the associationset/id into the BigML.com.au dashboard.
Association Set Properties
Once an association set has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
association
filterable, sortable |
String | The association/id of the association that was used to create the association set. |
| association_set | Object | All the information that you need to recreate the association set. See the Association Set Object definition below. |
|
association_status
filterable, sortable |
Boolean | Whether the association is still available or has been deleted. |
| association_type | Integer | Reserved for further use. |
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the association set and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the association set creation has been completed without errors. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the association set was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this association set. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the association set. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the association set. It can contain restricted markdown to decorate the text. |
| input_data | Object | The dictionary of input fields' ids and values used as input for the association set. |
| locale | String | The dataset's locale. |
|
name
filterable, sortable, updatable |
String | The name of the association set as you provided or Association Set by default. |
|
private
filterable, sortable, updatable |
Boolean | Whether the association set is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| query_string | String | The query string that was used to filter the association. |
| resource | String | The associationset/id. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| status | Object | A description of the status of the association set. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the association set was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the association set was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
Association Set Object has the following properties.
| Property | Type | Description |
|---|---|---|
| fields | Object | A dictionary with an entry per field in the dataset used to build the test. Fields are paginated according to the field_meta attribute. Each entry includes the column number in original source, the name of the field, the type of the field, and the summary. See this Section for more details. |
| k | Integer | The actual number of predicted items returned. |
| max_k | Integer | The maximum number of predicted items to return. |
| result | Array | An array of objects with a pair of item and a non-zero score. See Item Object for more information. |
| score_by | String | The associations measure to rank the predicted items returned. |
Association Set Status
Creating an association set is a near real-time process that take just a few seconds depending on whether the corresponding association has been used recently and the workload of BigML's systems. The association set goes through a number of states until its fully completed. Through the status field in the association set you can determine when the association set has been fully processed and ready to be used. Most of the times association sets are fully processed and the output returned in the first call. These are the properties that an association set's status has:
| Property | Type | Description |
|---|---|---|
| bad_fields | Array | An array of field's ids with wrong values submitted. Bad fields are ignored. That is, if you submit a value that is wrong, an association set is created anyway ignoring the input field with the wrong value. |
| code | Integer | A status code that reflects the status of the association set creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the association set. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the association set. |
| unknown_fields | Array | An array of field's ids that were submitted in the input_data and were not recognized. Unknown fields are ignored. That is, if you submit a field that is wrong, a prediction is created anyway ignoring the wrong input field. |
Updating an Association Set
To update an association set, you need to PUT an object containing the fields that you want to update to the association set' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated association set.
For example, to update an association set with a new name you can use curl like this:
curl "https://au.bigml.io/associationset/5667cbdd4e17276305000007?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating an association set's name
Deleting an Association Set
To delete an association set, you need to issue a HTTP DELETE request to the associationset/id to be deleted.
Using curl you can do something like this to delete an association set:
curl -X DELETE "https://au.bigml.io/associationset/5667cbdd4e17276305000007?$BIGML_AUTH"
$ Deleting an association set from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete an association set, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete an association set a second time, or an association set that does not exist, you will receive a "404 not found" response.
However, if you try to delete an association set that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Association Sets
To list all the association sets, you can use the associationset base URL. By default, only the 20 most recent association sets will be returned. You can see below how to change this number using the limit parameter.
You can get your list of association sets directly in your browser using your own username and API key with the following links.
https://au.bigml.io/associationset?$BIGML_AUTH
> Listing association sets from a browser
Topic Distributions
Last Updated: Tuesday, 2019-01-29 16:28
A topic distribution is created using a topicmodel/id and the new instance (input_data) for which you wish to obtain the probability distributions across topics.
When you create a topic distributions, BigML.io automatically computes the probability the input_data belongs to each topic. Therefore, for the input_data you will get a set of probabilities between 0% and 100%, one by each topic. The sum of all probabilities across topics must be 100%.
BigML.io allows you to create, retrieve, update, delete your topic distribution. You can also list all of your topic distributions.
Jump to:
- Topic Distribution Base URL
- Creating a Topic Distribution
- Topic Distribution Arguments
- Retrieving a Topic Distribution
- Topic Distribution Properties
- Updating a Topic Distribution
- Deleting a Topic Distribution
- Listing Topic Distributions
Topic Distribution Base URL
You can use the following base URL to create, retrieve, update, and delete topic distributions. https://au.bigml.io/topicdistribution
Topic Distribution base URL
All requests to manage your topic distributions must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Topic Distribution
To create a new topic distribution, you need to POST to the topic distribution base URL an object containing at least the topicmodel/id that you want to use to find the topic distribution and an instance. For example, you can easily create a new topic distribution using curl as follows:
curl "https://au.bigml.io/topicdistribution?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"topicmodel": "topicmodel/56f5ecfa4e17275f4400015b",
"input_data": {
{"00000a": "A great white shark, one of the few known man-eaters, swam into a commercial fishing net in Southern California’s coastal waters, a television station reported Saturday. The 13-foot, 1,500-pound shark was displayed when the vessel docked Saturday at Terminal Island in the Los Angeles harbor. Anthony Tibich, skipper of the Aggressor, was surprised to find tangled in the fishing net Friday morning the huge type of shark featured in the Jaws movies attacking surfers and swimmers, the report said. Sightings of the great white shark are unusual in Southern California’s warmer coastal waters. San Clemente Island is about 75 miles southwest of Los Angeles. Reports of a great white shark sighting on Thursday led Orange County authorities to evacuate about 2,000 swimmers along a 5-mile shore at Newport Beach. The beaches were reopened Friday."}
}'
> Creating a topic distribution
BigML.io will return the newly created topic distribution if the request succeeded.
{
"category":0,
"code":201,
"configuration":null,
"configuration_status":false,
"created":"2016-09-05T19:46:43.515853",
"credits":0.01,
"dataset":"dataset/57c3c7234e1727730e00000b",
"dataset_status":true,
"description":"",
"input_data":{
"00000a":"A great white shark, one of the few known man-eaters, swam into a commercial fishing net in Southern California’s coastal waters, a television station reported Saturday. The 13-foot, 1,500-pound shark was displayed when the vessel docked Saturday at Terminal Island in the Los Angeles harbor. Anthony Tibich, skipper of the Aggressor, was surprised to find tangled in the fishing net Friday morning the huge type of shark featured in the Jaws movies attacking surfers and swimmers, the report said. Sightings of the great white shark are unusual in Southern California’s warmer coastal waters. San Clemente Island is about 75 miles southwest of Los Angeles. Reports of a great white shark sighting on Thursday led Orange County authorities to evacuate about 2,000 swimmers along a 5-mile shore at Newport Beach. The beaches were reopened Friday."
},
"locale":"en-us",
"name":"Topic Distribution",
"private":true,
"project":"project/57c0a9804e17274a85000000",
"query_string":"",
"resource": "topicdistribution/57d9c1b84e17272411000009",
"shared":false,
"source":"source/57c1368a4e17275b67000009",
"source_status":true,
"status":{
"bad_fields":[],
"code":5,
"elapsed":0.09899,
"message":"The topic distribution has been created",
"progress":1,
"unknown_fields":[]
},
"subscription":true,
"tags":[],
"topic_distribution":{
"result":[
0.02404,
0.04327,
0.02404,
0.02404,
0.02404,
0.04327,
0.02404,
0.02404,
0.02404,
0.02404
]
},
"topicmodel":"topicmodel/57c3c7404e1727730e00000e",
"topicmodel_status":true,
"topicmodel_type":0,
"updated":"2016-09-05T19:46:43.515952"
}
< Example topic distribution JSON response
Topic Distribution Arguments
In addition to the topicmodel and the input_data, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is the category of the model |
The category that best describes the topic distribution. See the category codes for the complete list of categories.
Example: 1 |
|
description
optional |
String |
A description of the topic distribution up to 8192 characters long.
Example: "This is a description of my new topic distribution" |
| input_data | Object |
An object with field's id/value pairs representing the instance you want to find the closest topic distribution for. You can use either field ids or field names as keys in your input_data.
Example: {"00000a": "otisday", "00000d": "washington"} |
|
name
optional |
String, default is Topic Distribution |
The name you want to give to the new topic distribution.
Example: "my new topic distribution" |
|
project
optional |
String |
The project/id you want the topic distribution to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your topic distribution.
Example: ["best customers", "2018"] |
| topicmodel | String |
A valid topicmodel/id.
Example: topicmodel/57c3c7404e1727730e00000e |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can use curl to customize new topic distributions. For example, to create a new topic distribution named "my topic distribution":
curl "https://au.bigml.io/topicdistribution?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"topicmodel": "topicmodel/56f5ecfa4e17275f4400015b",
"input_data": {
{"00000a": "A great white shark, one of the few known man-eaters, swam into a commercial fishing net in Southern California’s coastal waters, a television station reported Saturday. The 13-foot, 1,500-pound shark was displayed when the vessel docked Saturday at Terminal Island in the Los Angeles harbor. Anthony Tibich, skipper of the Aggressor, was surprised to find tangled in the fishing net Friday morning the huge type of shark featured in the Jaws movies attacking surfers and swimmers, the report said. Sightings of the great white shark are unusual in Southern California’s warmer coastal waters. San Clemente Island is about 75 miles southwest of Los Angeles. Reports of a great white shark sighting on Thursday led Orange County authorities to evacuate about 2,000 swimmers along a 5-mile shore at Newport Beach. The beaches were reopened Friday."}
},
"name": "my topic distribution",
"tags":["my", "tags"]}'
> Creating a customized topic distribution
If you do not specify a name, BigML.io will assign to the new topic distribution a default name.
Retrieving a Topic Distribution
Each topic distribution has a unique identifier in the form "topicdistribution/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the topic distribution.
To retrieve a topic distribution with curl:
curl "https://au.bigml.io/topicdistribution/57d9c1b84e17272411000009?$BIGML_AUTH"
$ Retrieving a topic distribution from the command line
You can also use your browser to visualize the topic distribution using the full BigML.io URL or pasting the topicdistribution/id into the BigML.com.au dashboard.
Topic Distribution Properties
Once a topic distribution has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the topic distribution and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the topic distribution creation has been completed without errors. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the topic distribution was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this topic distribution. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the topic distribution. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the topic distribution. It can contain restricted markdown to decorate the text. |
| importance | Object | A dictionary keyed by field id that reports the relative contribution of each field to the topic distribution. |
| input_data | Object | The dictionary of input fields' ids and values used as input for the topic distribution. |
| locale | String | The dataset's locale. |
|
name
filterable, sortable, updatable |
String | The name of the topic distribution as you provided or Topic Distribution by default. |
|
private
filterable, sortable, updatable |
Boolean | Whether the topic distribution is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| query_string | String | The query string that was used to filter the topic model. |
| resource | String | The topicdistribution/id. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| status | Object | A description of the status of the topic distribution. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the topic distribution was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
| topic_distribution | Object | It includes result object which contains a list of numbers reflecting the probability the document belongs to each topic. The topics are listed in the same order as found in topics in the topic model. |
|
topicmodel
filterable, sortable |
String | The topicmodel/id of the topic model that was used to create the topic distribution. |
|
topicmodel_status
filterable, sortable |
Boolean | Whether the topic model is still available or has been deleted. |
| topicmodel_type | Integer | Reserved for further use. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the topic distribution was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
Topic Distribution Status
Creating a topic distribution is a near real-time process that take just a few seconds depending on whether the corresponding topic model has been used recently and the workload of BigML's systems. The topic distribution goes through a number of states until its fully completed. Through the status field in the topic distribution you can determine when the topic distribution has been fully processed and ready to be used. Most of the times topic distributions are fully processed and the output returned in the first call. These are the properties that a topic distribution's status has:
| Property | Type | Description |
|---|---|---|
| bad_fields | Array | An array of field's ids with wrong values submitted. Bad fields are ignored. That is, if you submit a value that is wrong, a topic distribution is created anyway ignoring the input field with the wrong value. |
| code | Integer | A status code that reflects the status of the topic distribution creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the topic distribution. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the topic distribution. |
| unknown_fields | Array | An array of field's ids that were submitted in the input_data and were not recognized. Unknown fields are ignored. That is, if you submit a field that is wrong, a prediction is created anyway ignoring the wrong input field. |
Updating a Topic Distribution
To update a topic distribution, you need to PUT an object containing the fields that you want to update to the topic distribution' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated topic distribution.
For example, to update a topic distribution with a new name you can use curl like this:
curl "https://au.bigml.io/topicdistribution/57d9c1b84e17272411000009?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating a topic distribution's name
Deleting a Topic Distribution
To delete a topic distribution, you need to issue a HTTP DELETE request to the topicdistribution/id to be deleted.
Using curl you can do something like this to delete a topic distribution:
curl -X DELETE "https://au.bigml.io/topicdistribution/57d9c1b84e17272411000009?$BIGML_AUTH"
$ Deleting a topic distribution from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a topic distribution, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a topic distribution a second time, or a topic distribution that does not exist, you will receive a "404 not found" response.
However, if you try to delete a topic distribution that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Topic Distributions
To list all the topic distributions, you can use the topicdistribution base URL. By default, only the 20 most recent topic distributions will be returned. You can see below how to change this number using the limit parameter.
You can get your list of topic distributions directly in your browser using your own username and API key with the following links.
https://au.bigml.io/topicdistribution?$BIGML_AUTH
> Listing topic distributions from a browser
Batch Topic Distributions
Last Updated: Tuesday, 2019-01-29 16:28
A batch topic distribution provides an easy way to compute a topic distribution for each instance in a dataset in only one request. To create a new batch topic distribution you need a topicmodel/id and a dataset/id.
Batch topic distributions are created asynchronously. You can retrieve the associated resource to check the progress and status in a similar fashion to the rest of BigML.io resources. Additionally, once a batch topic distribution is finished you can also download a csv file that contains all the topic distributions just appending "/download" to the batch topic distribution URL. You can also set output_dataset to true to automatically generate a new dataset with the results.
BigML.io gives you a number of options to tailor the format of the csv file containing the topic distributions. For example, you can set up the "separator" (e.g., ";"), whether the file should have a "header" or not. You can read about all the available options below.
BigML.io allows you to create, retrieve, update, delete your batch topic distribution. You can also list all of your batch topic distributions.
Jump to:
- Batch Topic Distribution Base URL
- Creating a Batch Topic Distribution
- Batch Topic Distribution Arguments
- Retrieving a Batch Topic Distribution
- Batch Topic Distribution Properties
- Updating a Batch Topic Distribution
- Deleting a Batch Topic Distribution
- Listing Batch Topic Distributions
Batch Topic Distribution Base URL
You can use the following base URL to create, retrieve, update, and delete batch topic distributions. https://au.bigml.io/batchtopicdistribution
Batch Topic Distribution base URL
All requests to manage your batch topic distributions must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Batch Topic Distribution
To create a new batch topic distribution, you need to POST to the batch topic distribution base URL an object containing at least the topicmodel/id that you want to use to compute topic distributions and the dataset/id of the dataset that contains the input data that will be used to compute topic distributions. BigML.io will compute a topic distribution for each instance in that dataset.
You can easily create a new batch topic distribution using curl as follows. Your authentication variable should be set up first as shown above.
curl "https://au.bigml.io/batchtopicdistribution?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"topicmodel": "topicmodel/5423625af0a5ea3eea000028",
"dataset": "dataset/54222a14f0a5eaaab000000c"}'
> Creating a batch topic distribution
BigML.io will return the newly created batch topic distribution if the request succeeded.
{
"all_fields":false,
"category":0,
"code":201,
"configuration":null,
"configuration_status":false,
"created":"2016-10-07T20:28:45.271549",
"credits":146.4,
"dataset":"dataset/57c3c7234e1727730e00000b",
"dataset_status":true,
"description":"",
"fields_map":{},
"header":true,
"locale":"en-US",
"name":"Batch Topic Distribution of SMS Topic Model with Airlines dataset",
"newline":"LF",
"output_dataset":true,
"output_dataset_resource":null,
"output_dataset_status":false,
"output_fields":[],
"private":true,
"project":null,
"resource": "batchtopicdistribution/57db8107b8aa0940d5b61138",
"rows":14640,
"separator":",",
"shared":false,
"size":3428874,
"status":{
"code":1,
"message":"The batch topic distribution is being processed and will be performed soon"
},
"subscription":true,
"tags":[],
"topicmodel":"topicmodel/57f805624e17276795000000",
"topicmodel_status":true,
"topicmodel_type":0,
"type":0,
"updated":"2016-10-07T20:28:45.271682"
}
< Example batch topic distribution JSON response
Batch Topic Distribution Arguments
In addition to the model and the dataset, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
all_fields
optional |
Boolean, default is false |
Whether all the fields from the dataset should be part of the generated csv file together with the topic distributions.
Example: true |
|
category
optional |
Integer, default is the category of the model |
The category that best describes the batch topic distribution. See the category codes for the complete list of categories.
Example: 1 |
| dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
|
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
|
description
optional |
String |
A description of the batch topic distribution up to 8192 characters long.
Example: "This is a description of my new batch topic distribution" |
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset |
Specifies the fields in the dataset to be excluded to create the batch topic distribution.
Example:
|
|
fields_map
optional |
Object |
A dictionary of identifiers of the fields to use from the topic distributions under test mapped to their corresponding identifiers in the input dataset.
Example: {"000000":"00000a", "000001":"000002", "000002":"000001", "000003":"000020", "000004":"000004"} |
|
header
optional |
Boolean, default is true |
Whether the csv file should have a header with the name of each field.
Example: true |
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields in the dataset to be considered to create the batch topic distribution.
Example:
|
|
name
optional |
String, default is dataset's name |
The name you want to give to the new batch topic distribution.
Example: "my new batch topic distribution" |
|
newline
optional |
String, default is "LF" |
The new line character that you want to get as line break in the generated csv file: "LF", "CRLF".
Example: "CRLF" |
|
output_dataset
optional |
Boolean, default is false |
Whether a dataset with the results should be automatically created or not.
Example: true |
|
output_fields
optional |
Array, default is []. None of the fields in the dataset |
Specifies the fields to be included in the csv file. It can be a list of field ids or names.
Example:
|
|
project
optional |
String |
The project/id you want the batch topic distribution to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
separator
optional |
Char, default is "," |
The separator that you want to get between fields in the generated csv file.
Example: ";" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your batch topic distribution.
Example: ["best customers", "2018"] |
| topicmodel | String |
A valid topicmodel/id.
Example: topicmodel/57c3c7404e1727730e00000e |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new batch topic distribution. For example, to create a new batch topic distribution named "my batch topic distribution", that will not include a header, and will output all fields together with the probability for each topic distribution.
curl "https://au.bigml.io/batchtopicdistribution?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"topicmodel": "topicmodel/5423625af0a5ea3eea000028",
"dataset": "dataset/54222a14f0a5eaaab000000c",
"name": "my batch topic distribution",
"header": true,
"all_fields": true,
"distribution_name": "Topic Distribution"
}'
> Creating a customized batch topic distribution
If you do not specify a name, BigML.io will assign to the new batch topic distribution a combination of the dataset's name and the topic model's name. If you do not specify any fields_map, BigML.io will use a direct map of all the fields in the dataset.
Retrieving a Batch Topic Distribution
Each batch topic distribution has a unique identifier in the form "batchtopicdistribution/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the batch topic distribution.
To retrieve a batch topic distribution with curl:
curl "https://au.bigml.io/batchtopicdistribution/57db8107b8aa0940d5b61138?$BIGML_AUTH"
$ Retrieving a batch topic distribution from the command line
You can also use your browser to visualize the batch topic distribution using the full BigML.io URL or pasting the batchtopicdistribution/id into the BigML.com.au dashboard.
Batch Topic Distribution Properties
Once a batch topic distribution has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
all_fields
filterable, sortable |
Boolean | Whether the batch topic distribution contains all the fields in the dataset used as an input. |
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the batch topic distribution and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the batch topic distribution creation has been completed without errors. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the batch topic distribution was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this batch topic distribution. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the batch topic distribution. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the batch topic distribution. It can contain restricted markdown to decorate the text. |
| excluded_fields | Array | The list of fields's ids that were excluded to build the batch topic distribution. |
| fields_map | Array | The map of dataset fields to topic model fields used. |
|
header
filterable, sortable |
Boolean | Whether the batch topic distribution file contains a header with the name of each field or not. |
| input_fields | Array | The list of input fields' ids used to create the batch topic distribution. |
| locale | String | The dataset's locale. |
|
name
filterable, sortable, updatable |
String | The name of the batch topic distribution. By default, it's based on the name of model and the dataset used. |
| newline | String | The new line character used as line break in the file that contains the topic distributions. |
| output_dataset |
Boolean, default is false |
Whether a dataset with the results should be automatically created or not.
Example: true |
|
output_dataset_resource
filterable, sortable |
String | The dataset/id of the newly created dataset when output_dataset has been set to true. |
|
output_dataset_status
filterable, sortable |
Boolean | Whether the dataset generated as an output is still available or has been deleted. |
| output_fields | Array | The list of output fields's ids used to format the output csv file. |
|
private
filterable, sortable |
Boolean | Whether the batch topic distribution is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| resource | String | The batchtopicdistribution/id. |
|
rows
filterable, sortable |
Integer | The total number of instances in the dataset used as an input. |
| separator | Char | The separator used in the csv file that contains the topic distributions. |
|
shared
filterable, sortable |
Boolean | Whether the batch topic distribution has been shared via a private link. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that was used create the batch topic distribution. |
| status | Object | A description of the status of the batch topic distribution. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the batch topic distribution was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
topicmodel
filterable, sortable |
String | The topicmodel/id of the topic model used to create the batch topic distribution. |
|
topicmodel_status
filterable, sortable |
Boolean | Whether the topic model is still available or has been deleted. |
| topicmodel_type | Integer | Reserved for further use. |
|
type
filterable, sortable |
Integer | Reserved for future use. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the batch topic distribution was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
Batch Topic Distribution Status
Creating a batch topic distribution is a process that can take just a few seconds or a few hours depending on the size of the dataset used as input and on the workload of BigML's systems. The batch topic distribution goes through a number of states until its finished. Through the status field in the batch topic distribution you can determine when it has been fully processed. These are the properties that a batch topic distribution's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the batch topic distribution creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the batch topic distribution. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the batch topic distribution. |
Once batch topic distribution has been successfully finished, it will look like:
{
"all_fields": false,
"category": 0,
"code": 200,
"configuration": null,
"configuration_status": false,
"created": "2016-10-07T20:28:45.271000",
"credits": 146.4,
"dataset": "dataset/57c3c7234e1727730e00000b",
"dataset_status": true,
"description": "",
"fields_map": {
"000000": "000000",
"000001": "000001",
"000002": "000002",
"000003": "000003",
"000004": "000004"
},
"header": true,
"locale": "en-US",
"name": "Batch Topic Distribution of SMS Topic Model with Airlines dataset",
"newline": "LF",
"output_dataset": true,
"output_dataset_resource": "dataset/57f8058d4e17276791000001",
"output_dataset_status": true,
"output_fields": [],
"private": true,
"project": null,
"resource": "batchtopicdistribution/57db8107b8aa0940d5b61138",
"rows": 14640,
"separator": ",",
"shared": false,
"size": 3428874,
"status": {
"code": 5,
"elapsed": 11823,
"message": "The batch topic distribution has been performed",
"progress": 1
},
"subscription": true,
"tags": [],
"topicmodel": "topicmodel/57f805624e17276795000000",
"topicmodel_status": true,
"topicmodel_type": 0,
"type": 0,
"updated": "2016-10-07T20:29:01.462000"
}
< Example batch topic distribution JSON response
Updating a Batch Topic Distribution
To update a batch topic distribution, you need to PUT an object containing the fields that you want to update to the batch topic distribution' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated batch topic distribution.
For example, to update a batch topic distribution with a new name you can use curl like this:
curl "https://au.bigml.io/batchtopicdistribution/57db8107b8aa0940d5b61138?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "A new name"}'
$ Updating a batch topic distribution's name
Deleting a Batch Topic Distribution
To delete a batch topic distribution, you need to issue a HTTP DELETE request to the batchtopicdistribution/id to be deleted.
Using curl you can do something like this to delete a batch topic distribution:
curl -X DELETE "https://au.bigml.io/batchtopicdistribution/57db8107b8aa0940d5b61138?$BIGML_AUTH"
$ Deleting a batch topic distribution from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a batch topic distribution, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a batch topic distribution a second time, or a batch topic distribution that does not exist, you will receive a "404 not found" response.
However, if you try to delete a batch topic distribution that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Batch Topic Distributions
To list all the batch topic distributions, you can use the batchtopicdistribution base URL. By default, only the 20 most recent batch topic distributions will be returned. You can see below how to change this number using the limit parameter.
You can get your list of batch topic distributions directly in your browser using your own username and API key with the following links.
https://au.bigml.io/batchtopicdistribution?$BIGML_AUTH
> Listing batch topic distributions from a browser
Projections
Last Updated: Tuesday, 2019-01-29 16:28
A projection is created using a pca/id and the new instance (input_data) for which you wish to project new data points to the componential axes. This is done by first centering and scaling the input data using the same input transformations as model time, and then taking the inner products between the transformed input and the component vectors. Note that for text inputs, the centering and scaling is done using the mean and standard deviation values in the text_stats map in the PCA.
Given a new data point for the iris PCA, which has the following components:
[
[0.89017, -0.46014, 0.99155, 0.96498],
[0.36083, 0.88271, 0.02341, 0.064],
[0.27566, -0.09362, -0.05445, -0.24298],
[-0.03761, 0.01778, 0.11535, -0.07536]
]
- PC1 = 0.89017 * sepal_length - 0.46014 * sepal_width + 0.99155 * petal_length + 0.96498 * petal_width
{
"PC1" : 3.40215,
"PC2" : 0.9997,
"PC3" : -0.24146,
"PC4" : -0.03071
}
- SUMj (Ej2)
BigML.io allows you to create, retrieve, update, delete your projection. You can also list all of your projections.
Jump to:
- Projection Base URL
- Creating a Projection
- Projection Arguments
- Retrieving a Projection
- Projection Properties
- Updating a Projection
- Deleting a Projection
- Listing Projections
Projection Base URL
You can use the following base URL to create, retrieve, update, and delete projections. https://au.bigml.io/projection
Projection base URL
All requests to manage your projections must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Projection
To create a new projection, you need to POST to the projection base URL an object containing at least the pca/id that you want to use to find the projection and an instance. For example, you can easily create a new projection using curl as follows:
curl "https://au.bigml.io/projection?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"pca": "pca/56f5ecfa4e17275f4400015b", "input_data": {"000001": 3}}'
> Creating a projection
BigML.io will return the newly created projection if the request succeeded.
{
"category": 0,
"code": 201,
"configuration": null,
"configuration_status": false,
"created": "2018-12-10T18:10:04.106243",
"creator": "leon1",
"credits": 0.01,
"dataset": "dataset/5948be794e1727307a000000",
"dataset_status": true,
"description": "",
"input_data": {
"000000": 3
},
"locale": "en-us",
"name": "iris",
"name_options": "",
"pca": "pca/5c0eab424e1727c143000006",
"pca_status": true,
"pca_type": 0,
"private": true,
"project": null,
"projection": {
"result": {
"PC1": 0,
"PC2": 0.00001,
"PC3": 0.00002,
"PC4": 0,
"PC5": 0
}
},
"query_string": "",
"resource": "projection/50d3a7f63c192025e1000001",
"shared": false,
"source": "source/5948be694e17273079000000",
"source_status": true,
"status": {
"code": 5,
"elapsed": 2441,
"message": "The projection has been created",
"progress": 1
},
"subscription": true,
"tags": [],
"type": 0,
"updated": "2018-12-10T18:10:04.106433"
}
< Example projection JSON response
Projection Arguments
In addition to the pca and the input_data, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is the category of the model |
The category that best describes the projection. See the category codes for the complete list of categories.
Example: 1 |
|
description
optional |
String |
A description of the projection up to 8192 characters long.
Example: "This is a description of my new projection" |
| input_data | Object |
An object with field's id/value pairs representing the instance you want to find the closest projection for. You can use either field ids or field names as keys in your input_data.
Example: {"00000a": "otisday", "00000d": "washington"} |
|
max_components
optional |
Integer |
The maximum number of components to load for prediction.
Example: 10 |
|
name
optional |
String, default is Projection |
The name you want to give to the new projection.
Example: "my new projection" |
| pca | String |
A valid pca/id.
Example: pca/57c3c7404e1727730e00000e |
|
project
optional |
String |
The project/id you want the projection to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your projection.
Example: ["best customers", "2018"] |
|
variance_threshold
optional |
Float |
The prediction uses the minimum number of components such that the cumulative explained variance is greater than the given threshold. If both max_components and variance_threshold are given, the value for max_components will be used.
Example: 0.95 |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can use curl to customize new projections. For example, to create a new projection named "my projection":
curl "https://au.bigml.io/projection?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"pca": "pca/56f5ecfa4e17275f4400015b",
"input_data": {"000001": 3},
"name": "my projection",
"tags":["my", "tags"]}'
> Creating a customized projection
If you do not specify a name, BigML.io will assign to the new projection a default name.
Retrieving a Projection
Each projection has a unique identifier in the form "projection/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the projection.
To retrieve a projection with curl:
curl "https://au.bigml.io/projection/50d3a7f63c192025e1000001?$BIGML_AUTH"
$ Retrieving a projection from the command line
You can also use your browser to visualize the projection using the full BigML.io URL or pasting the projection/id into the BigML.com.au dashboard.
Projection Properties
Once a projection has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the projection and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the projection creation has been completed without errors. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the projection was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this projection. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the projection. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the projection. It can contain restricted markdown to decorate the text. |
| input_data | Object | The dictionary of input fields' ids and values used as input for the projection. |
| locale | String | The dataset's locale. |
|
name
filterable, sortable, updatable |
String | The name of the projection as you provided or Projection by default. |
|
pca
filterable, sortable |
String | The pca/id of the PCA that was used to create the projection. |
|
pca_status
filterable, sortable |
Boolean | Whether the PCA is still available or has been deleted. |
| pca_type | Integer | Reserved for further use. |
|
private
filterable, sortable, updatable |
Boolean | Whether the projection is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| projection | Object | It includes result object as well as max_components and variance_threshold. |
| query_string | String | The query string that was used to filter the PCA. |
| resource | String | The projection/id. |
|
source
filterable, sortable |
String | The source/id that was used to build the dataset. |
|
source_status
filterable, sortable |
Boolean | Whether the source is still available or has been deleted. |
| status | Object | A description of the status of the projection. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the projection was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the projection was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
Projection Status
Creating a projection is a near real-time process that take just a few seconds depending on whether the corresponding PCA has been used recently and the workload of BigML's systems. The projection goes through a number of states until its fully completed. Through the status field in the projection you can determine when the projection has been fully processed and ready to be used. Most of the times projections are fully processed and the output returned in the first call. These are the properties that a projection's status has:
| Property | Type | Description |
|---|---|---|
| bad_fields | Array | An array of field's ids with wrong values submitted. Bad fields are ignored. That is, if you submit a value that is wrong, a projection is created anyway ignoring the input field with the wrong value. |
| code | Integer | A status code that reflects the status of the projection creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the projection. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the projection. |
| unknown_fields | Array | An array of field's ids that were submitted in the input_data and were not recognized. Unknown fields are ignored. That is, if you submit a field that is wrong, a prediction is created anyway ignoring the wrong input field. |
Updating a Projection
To update a projection, you need to PUT an object containing the fields that you want to update to the projection' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated projection.
For example, to update a projection with a new name you can use curl like this:
curl "https://au.bigml.io/projection/50d3a7f63c192025e1000001?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating a projection's name
Deleting a Projection
To delete a projection, you need to issue a HTTP DELETE request to the projection/id to be deleted.
Using curl you can do something like this to delete a projection:
curl -X DELETE "https://au.bigml.io/projection/50d3a7f63c192025e1000001?$BIGML_AUTH"
$ Deleting a projection from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a projection, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a projection a second time, or a projection that does not exist, you will receive a "404 not found" response.
However, if you try to delete a projection that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Projections
To list all the projections, you can use the projection base URL. By default, only the 20 most recent projections will be returned. You can see below how to change this number using the limit parameter.
You can get your list of projections directly in your browser using your own username and API key with the following links.
https://au.bigml.io/projection?$BIGML_AUTH
> Listing projections from a browser
Batch Projections
Last Updated: Tuesday, 2019-01-29 16:28
A batch projection provides an easy way to compute a projection for each instance in a dataset in only one request. To create a new batch projection you need a pca/id and a dataset/id.
Batch projections are created asynchronously. You can retrieve the associated resource to check the progress and status in a similar fashion to the rest of BigML.io resources. Additionally, once a batch projection is finished you can also download a csv file that contains all the projections just appending "/download" to the batch projection URL. You can also set output_dataset to true to automatically generate a new dataset with the results.
BigML.io gives you a number of options to tailor the format of the csv file containing the projections. For example, you can set up the "separator" (e.g., ";"), whether the file should have a "header" or not. You can read about all the available options below.
BigML.io allows you to create, retrieve, update, delete your batch projection. You can also list all of your batch projections.
Jump to:
- Batch Projection Base URL
- Creating a Batch Projection
- Batch Projection Arguments
- Retrieving a Batch Projection
- Batch Projection Properties
- Updating a Batch Projection
- Deleting a Batch Projection
- Listing Batch Projections
Batch Projection Base URL
You can use the following base URL to create, retrieve, update, and delete batch projections. https://au.bigml.io/batchprojection
Batch Projection base URL
All requests to manage your batch projections must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Batch Projection
To create a new batch projection, you need to POST to the batch projection base URL an object containing at least the pca/id that you want to use to compute projections and the dataset/id of the dataset that contains the input data that will be used to compute projections. BigML.io will compute a projection for each instance in that dataset.
You can easily create a new batch projection using curl as follows. Your authentication variable should be set up first as shown above.
curl "https://au.bigml.io/batchprojection?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"pca": "pca/5423625af0a5ea3eea000028",
"dataset": "dataset/54222a14f0a5eaaab000000c"}'
> Creating a batch projection
BigML.io will return the newly created batch projection if the request succeeded.
{
"all_fields": false,
"category": 0,
"code": 201,
"configuration": null,
"configuration_status": false,
"created": "2018-12-10T18:12:51.741903",
"creator": "leon1",
"credits": 0.00057220458984375,
"dataset": "dataset/5b99f5cb4e1727a593000000",
"dataset_status": true,
"description": "",
"excluded_fields": [],
"fields_map": {},
"header": true,
"input_fields": [],
"locale": "en-US",
"max_components": null,
"name": "Batch Projection of New Iris Dataset",
"name_options": "use all fields",
"newline": "LF",
"output_dataset": true,
"output_dataset_resource": null,
"output_dataset_status": false,
"output_fields": [],
"pca": "pca/5c0eab424e1727c143000006",
"pca_status": true,
"pca_type": 0,
"private": true,
"project": null,
"resource": "batchprojection/5728be2d4e172748db000000",
"rows": 150,
"separator": ",",
"shared": false,
"size": 4608,
"status": {
"code": 1,
"message": "The batch projection creation request has been queued and will be processed soon",
"progress": 0
},
"subscription": true,
"tags": [],
"type": 0,
"updated": "2018-12-10T18:12:51.742051",
"variance_threshold": null
}
< Example batch projection JSON response
Batch Projection Arguments
In addition to the model and the dataset, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
all_fields
optional |
Boolean, default is false |
Whether all the fields from the dataset should be part of the generated csv file together with the projections.
Example: true |
|
category
optional |
Integer, default is the category of the model |
The category that best describes the batch projection. See the category codes for the complete list of categories.
Example: 1 |
| dataset | String |
A valid dataset/id.
Example: dataset/4f66a80803ce8940c5000006 |
|
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
|
description
optional |
String |
A description of the batch projection up to 8192 characters long.
Example: "This is a description of my new batch projection" |
|
excluded_fields
optional |
Array, default is [], an empty list. None of the fields in the dataset |
Specifies the fields in the dataset to be excluded to create the batch projection.
Example:
|
|
fields_map
optional |
Object |
A dictionary of identifiers of the fields to use from the projections under test mapped to their corresponding identifiers in the input dataset.
Example: {"000000":"00000a", "000001":"000002", "000002":"000001", "000003":"000020", "000004":"000004"} |
|
header
optional |
Boolean, default is true |
Whether the csv file should have a header with the name of each field.
Example: true |
|
input_fields
optional |
Array, default is []. All the fields in the dataset |
Specifies the fields in the dataset to be considered to create the batch projection.
Example:
|
|
max_components
optional |
Integer |
The maximum number of components to load for prediction.
Example: 10 |
|
name
optional |
String, default is dataset's name |
The name you want to give to the new batch projection.
Example: "my new batch projection" |
|
newline
optional |
String, default is "LF" |
The new line character that you want to get as line break in the generated csv file: "LF", "CRLF".
Example: "CRLF" |
|
output_dataset
optional |
Boolean, default is false |
Whether a dataset with the results should be automatically created or not.
Example: true |
|
output_fields
optional |
Array, default is []. None of the fields in the dataset |
Specifies the fields to be included in the csv file. It can be a list of field ids or names.
Example:
|
| pca | String |
A valid pca/id.
Example: pca/57c3c7404e1727730e00000e |
|
project
optional |
String |
The project/id you want the batch projection to belong to.
Example: "project/54d98718f0a5ea0b16000000" |
|
separator
optional |
Char, default is "," |
The separator that you want to get between fields in the generated csv file.
Example: ";" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your batch projection.
Example: ["best customers", "2018"] |
|
variance_threshold
optional |
Float |
The prediction uses the minimum number of components such that the cumulative explained variance is greater than the given threshold. If both max_components and variance_threshold are given, the value for max_components will be used.
Example: 0.95 |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
You can also use curl to customize a new batch projection. For example, to create a new batch projection named "my batch projection", that will not include a header, and will output all fields together
curl "https://au.bigml.io/batchprojection?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"pca": "pca/5423625af0a5ea3eea000028",
"dataset": "dataset/54222a14f0a5eaaab000000c",
"name": "my batch projection",
"header": true,
"all_fields": true
}'
> Creating a customized batch projection
If you do not specify a name, BigML.io will assign to the new batch projection a combination of the dataset's name and the PCA's name. If you do not specify any fields_map, BigML.io will use a direct map of all the fields in the dataset.
Retrieving a Batch Projection
Each batch projection has a unique identifier in the form "batchprojection/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the batch projection.
To retrieve a batch projection with curl:
curl "https://au.bigml.io/batchprojection/5728be2d4e172748db000000?$BIGML_AUTH"
$ Retrieving a batch projection from the command line
You can also use your browser to visualize the batch projection using the full BigML.io URL or pasting the batchprojection/id into the BigML.com.au dashboard.
Batch Projection Properties
Once a batch projection has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
all_fields
filterable, sortable |
Boolean | Whether the batch projection contains all the fields in the dataset used as an input. |
|
category
filterable, sortable, updatable |
Integer | One of the categories in the table of categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the batch projection and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the batch projection creation has been completed without errors. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the batch projection was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
credits
filterable, sortable |
Float | The number of credits it cost you to create this batch projection. |
|
dataset
filterable, sortable |
String | The dataset/id that was used to build the batch projection. |
|
dataset_status
filterable, sortable |
Boolean | Whether the dataset is still available or has been deleted. |
|
description
updatable |
String | A text describing the batch projection. It can contain restricted markdown to decorate the text. |
| excluded_fields | Array | The list of fields's ids that were excluded to build the batch projection. |
| fields_map | Array | The map of dataset fields to PCA fields used. |
|
header
filterable, sortable |
Boolean | Whether the batch projection file contains a header with the name of each field or not. |
| input_fields | Array | The list of input fields' ids used to create the batch projection. |
| locale | String | The dataset's locale. |
| max_components | Integer | The maximum number of components to load for prediction. |
|
name
filterable, sortable, updatable |
String | The name of the batch projection. By default, it's based on the name of model and the dataset used. |
| newline | String | The new line character used as line break in the file that contains the projections. |
| output_dataset |
Boolean, default is false |
Whether a dataset with the results should be automatically created or not.
Example: true |
|
output_dataset_resource
filterable, sortable |
String | The dataset/id of the newly created dataset when output_dataset has been set to true. |
|
output_dataset_status
filterable, sortable |
Boolean | Whether the dataset generated as an output is still available or has been deleted. |
| output_fields | Array | The list of output fields's ids used to format the output csv file. |
|
pca
filterable, sortable |
String | The pca/id of the PCA used to create the batch projection. |
|
pca_status
filterable, sortable |
Boolean | Whether the PCA is still available or has been deleted. |
| pca_type | Integer | Reserved for further use. |
|
private
filterable, sortable |
Boolean | Whether the batch projection is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| resource | String | The batchprojection/id. |
|
rows
filterable, sortable |
Integer | The total number of instances in the dataset used as an input. |
| separator | Char | The separator used in the csv file that contains the projections. |
|
shared
filterable, sortable |
Boolean | Whether the batch projection has been shared via a private link. |
|
size
filterable, sortable |
Integer | The number of bytes of the dataset that was used create the batch projection. |
| status | Object | A description of the status of the batch projection. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the batch projection was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
type
filterable, sortable |
Integer | Reserved for future use. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the batch projection was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| variance_threshold | Float | The prediction uses the minimum number of components such that the cumulative explained variance is greater than the given threshold. |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
Batch Projection Status
Creating a batch projection is a process that can take just a few seconds or a few hours depending on the size of the dataset used as input and on the workload of BigML's systems. The batch projection goes through a number of states until its finished. Through the status field in the batch projection you can determine when it has been fully processed. These are the properties that a batch projection's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the batch projection creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the batch projection. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the batch projection. |
Once batch projection has been successfully finished, it will look like:
{
"all_fields": false,
"category": 0,
"code": 200,
"configuration": null,
"configuration_status": false,
"created": "2018-12-10T18:12:51.741000",
"creator": "leon1",
"credits": 0.00057220458984375,
"dataset": "dataset/5b99f5cb4e1727a593000000",
"dataset_status": true,
"description": "",
"excluded_fields": [],
"fields_map": {
"000000": "000000",
"000001": "000001",
"000002": "000002",
"000003": "000003",
"000004": "000004"
},
"header": true,
"input_fields": [
"000000",
"000001",
"000002",
"000003",
"000004"
],
"locale": "en-US",
"max_components": null,
"name": "Batch Projection of New Iris Dataset",
"name_options": "use all fields",
"newline": "LF",
"output_dataset": true,
"output_dataset_resource": "dataset/5c0eacaa4e1727c3e0000004",
"output_dataset_status": true,
"output_fields": [],
"pca": "pca/5c0eab424e1727c143000006",
"pca_status": true,
"pca_type": 0,
"private": true,
"project": null,
"resource": "batchprojection/5728be2d4e172748db000000",
"rows": 150,
"separator": ",",
"shared": false,
"size": 4608,
"status": {
"code": 5,
"elapsed": 5937,
"message": "The batch projection has been created",
"progress": 1
},
"subscription": true,
"tags": [],
"type": 0,
"updated": "2018-12-10T18:12:59.342000",
"variance_threshold": null
}
< Example batch projection JSON response
Updating a Batch Projection
To update a batch projection, you need to PUT an object containing the fields that you want to update to the batch projection' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated batch projection.
For example, to update a batch projection with a new name you can use curl like this:
curl "https://au.bigml.io/batchprojection/5728be2d4e172748db000000?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "A new name"}'
$ Updating a batch projection's name
Deleting a Batch Projection
To delete a batch projection, you need to issue a HTTP DELETE request to the batchprojection/id to be deleted.
Using curl you can do something like this to delete a batch projection:
curl -X DELETE "https://au.bigml.io/batchprojection/5728be2d4e172748db000000?$BIGML_AUTH"
$ Deleting a batch projection from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a batch projection, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a batch projection a second time, or a batch projection that does not exist, you will receive a "404 not found" response.
However, if you try to delete a batch projection that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Batch Projections
To list all the batch projections, you can use the batchprojection base URL. By default, only the 20 most recent batch projections will be returned. You can see below how to change this number using the limit parameter.
You can get your list of batch projections directly in your browser using your own username and API key with the following links.
https://au.bigml.io/batchprojection?$BIGML_AUTH
> Listing batch projections from a browser
Libraries
Last Updated: Tuesday, 2019-01-29 16:28
A library is a special kind of compiled WhizzML source code that only defines functions and constants. It is intended as an import for executable scripts. Thus, a compiled library cannot be executed, just used as an import for other libraries and scripts which then have access to all functions and constants defined in the library. You can read the WhizzML Reference Manual for more information.
BigML.io allows you to create, retrieve, update, delete your library. You can also list all of your libraries.
Jump to:
- Library Base URL
- Creating a Library
- Library Arguments
- Retrieving a Library
- Library Properties
- Updating a Library
- Deleting a Library
- Listing Libraries
Library Base URL
You can use the following base URL to create, retrieve, update, and delete libraries. https://au.bigml.io/library
Library base URL
All requests to manage your libraries must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Library
To create a new library, you need to POST a string containing at least the source code that defines the functions and constants the library exports to the library base URL. The content-type must always be "application/json".
POST /library?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating library definition
curl "https://au.bigml.io/library?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"source_code": "(define (mu x) (+ x 1)) (define (g z y) (mu (+ y z)))"}'
> Creating a library
BigML.io will return the newly created library if the request succeeded.
{
"approval_status":0,
"category":0,
"created":"2016-05-13T18:25:10.408403",
"description":"",
"imports":[],
"line_count":1,
"name":"Average",
"price":0,
"private":true,
"project":null,
"provider":"whizzml-editor",
"resource":"library/55eeed1f1f386fc29500000a",
"shared":false,
"size":36,
"source_code":"(define (avg a b c) (/ (+ a b c) 3))",
"status":{
"code":1,
"message":"The library is being processed and will be created soon"
},
"subscription":true,
"tags":[],
"updated":"2016-05-13T18:25:10.408577",
"white_box":false
}
< Example library JSON response
Library Arguments
In addition to the source code, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is 0 |
The WhizzML category that best describes the library. See the WhizzML category codes for the complete list of categories.
Example: 1 |
|
description
optional |
String |
A description of the library up to 8192 characters long.
Example: "This is a description of my new library" |
|
imports
optional |
Array of Strings |
A list of valid library identifiers.
Example:
|
|
name
optional |
String |
The name you want to give to the new library.
Example: "my new library" |
|
public_in_organization
optional |
Boolean, default is false |
Whether the library is public within the organization.
Example: true |
| source_code | String |
Code for the WhizzML library. See WhizzML Reference Manual for more information.
Example: "(define (avg a b c) (/ (+ a b c) 3))" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your library.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
Retrieving a Library
Each library has a unique identifier in the form "library/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the library.
To retrieve a library with curl:
curl "https://au.bigml.io/library/55eeed1f1f386fc29500000a?$BIGML_AUTH"
$ Retrieving a library from the command line
You can also use your browser to visualize the library using the full BigML.io URL or pasting the library/id into the BigML.com.au dashboard.
Library Properties
Once a library has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
approval_status
filterable, sortable, updatable |
Integer | When a script is requested to be publicly shared, it must be reviewed and approved by the BigML administrators. A user can change its value to 1 to request the approval or 0 to withdraw the previous request. The script can be accepted (5) or rejected (-1) by the administrators. Once the script is accepted, it will be publicly available and no further changes to the script are allowed while the script is publicly shared. |
|
category
filterable, sortable, updatable |
Integer | One of the WhizzML categories in the table of WhizzML categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the library and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the library creation has been completed without errors. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the library was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
description
updatable |
String | A text describing the library. It can contain restricted markdown to decorate the text. |
| exports | Array |
A list with the signature of functions and constants the library makes public.
Example:
|
|
imports
filterable, sortable |
Array of Strings | A list of valid ids of the libraries used in the library. |
| line_count | Integer | The number of lines in the source code. |
|
name
filterable, sortable, updatable |
String | The name of the library. |
|
price
filterable, sortable, updatable |
Float | The price other users must pay to clone the library. |
|
private
filterable, sortable |
Boolean | Whether the library is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| public_in_organization | Boolean | Whether the library is public within the organization. |
| resource | String | The library/id. |
|
shared
filterable, sortable, updatable |
Boolean | Whether the library is shared using a private link or not. |
| shared_hash | String | The hash that gives access to this library if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this library. |
| source_code | String | The source code of the library. |
| status | Object | A description of the status of the library. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the library was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the library was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
|
white_box
filterable, sortable |
Boolean | Whether the library is publicly shared as a white-box. |
Library Status
Creating a library is a process that can take just a few seconds or a few minutes depending on the workload of BigML's systems. The library goes through a number of states until its fully completed. Through the status field in the library you can determine when the library has been fully processed and ready to be used. These are the properties that library's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the library creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the library. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the library. |
Once library has been successfully created, it will look like:
{
"approval_status":0,
"category":0,
"created":"2016-05-13T18:25:10.408000",
"description":"",
"exports":[
{
"name":"avg",
"signature":[
"a",
"b",
"c"
]
}
],
"imports":[],
"line_count":1,
"name":"Average",
"price":0.0,
"private":true,
"project":null,
"provider":"whizzml-editor",
"resource":"library/55eeed1f1f386fc29500000a",
"shared":false,
"size":36,
"source_code":"(define (avg a b c) (/ (+ a b c) 3))",
"status":{
"code":5,
"elapsed":4,
"message":"The library has been created",
"progress":1.0
},
"subscription":true,
"tags":[],
"updated":"2016-05-13T18:26:32.045000",
"white_box":false
}
< Example library JSON response
Updating a Library
To update a library, you need to PUT an object containing the fields that you want to update to the library' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated library.
For example, to update library with a new name you can use curl like this:
curl "https://au.bigml.io/library/55eeed1f1f386fc29500000a?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating a library's name
Deleting a Library
To delete a library, you need to issue a HTTP DELETE request to the library/id to be deleted.
Using curl you can do something like this to delete a library:
curl -X DELETE "https://au.bigml.io/library/55eeed1f1f386fc29500000a?$BIGML_AUTH"
$ Deleting a library from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a library, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a library a second time, or a library that does not exist, you will receive a "404 not found" response.
However, if you try to delete a library that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Libraries
To list all the libraries, you can use the library base URL. By default, only the 20 most recent libraries will be returned. You can see below how to change this number using the limit parameter.
You can get your list of libraries directly in your browser using your own username and API key with the following links.
https://au.bigml.io/library?$BIGML_AUTH
> Listing libraries from a browser
Scripts
Last Updated: Tuesday, 2019-01-29 16:28
BigML.io allows you to create, retrieve, update, delete your script. You can also list all of your scripts.
Jump to:
- Script Base URL
- Creating a Script
- Script Arguments
- Retrieving a Script
- Script Properties
- Updating a Script
- Deleting a Script
- Listing Scripts
Script Base URL
You can use the following base URL to create, retrieve, update, and delete scripts. https://au.bigml.io/script
Script base URL
All requests to manage your scripts must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating a Script
To create a new script, you need to POST a string containing at least the source code to the script base URL. The content-type must always be "application/json".
POST /script?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating script definition
curl "https://au.bigml.io/script?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"source_code": "(create-source {\"remote\" \"http://foo.com/b.csv\"})"}'
> Creating a script
BigML.io will return the newly created script if the request succeeded.
{
"approval_status":0,
"category":0,
"code":201,
"created":"2016-05-13T18:27:23.943742",
"credits_per_execution":0,
"description":"",
"imports":[
"library/55eeed1f1f386fc29500000a"
],
"inputs":[
{
"default":0,
"description":"First number",
"name":"a",
"type":"number"
},
{
"default":0,
"description":"Second number",
"name":"b",
"type":"number"
},
{
"default":0,
"description":"Third number",
"name":"c",
"type":"number"
}
],
"line_count":1,
"locale":"en-US",
"name":"Average calculation script",
"number_of_executions":0,
"outputs":[
{
"description":"Average of all three numbers",
"name":"result",
"type":"number"
}
],
"price":0,
"private":true,
"project":null,
"provider":"bigml-editor",
"resource":"script/55f007d21f386f5199000000",
"shared":false,
"size":27,
"source_code":"(define result (avg a b c))",
"status":{
"code":1,
"message":"The script is being processed and will be created soon"
},
"subscription":true,
"tags":[],
"updated":"2016-05-13T18:27:23.943883",
"white_box":false
}
< Example script JSON response
Script Arguments
In addition to the source code, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is 0 |
The WhizzML category that best describes the script. See the WhizzML category codes for the complete list of categories.
Example: 1 |
|
description
optional |
String |
A description of the script up to 8192 characters long.
Example: "This is a description of my new script" |
|
imports
optional |
Array of Strings |
A list of valid library identifiers.
Example:
|
|
inputs
optional |
Array |
A list of inputs of the script with their name, type and optional default value and description. See the list of available input types.
Example:
|
|
name
optional |
String |
The name you want to give to the new script.
Example: "my new script" |
| origin | String |
The script/id of the gallery script to be cloned. The price of the script must be 0 to be cloned via API.
Example: "script/5b9ab8474e172785e3000003" |
|
outputs
optional |
Array |
A list of variables with their name, type, and optional description, defined in the source code of script, that will conform the outputs of execution. See the list of available output types.
Example:
|
|
public_in_organization
optional |
Boolean, default is false |
Whether the script is public within the organization.
Example: true |
| shared_hash | String |
The shared hash of the shared script to be cloned. The price of the script must be 0 to be cloned via API.
Example: "kpY46mNuNVReITw0Z1mAqoQ9ySW" |
| source_code | String |
Code for the WhizzML script. See WhizzML Reference Manual for more information.
Example: "(define id (create-source {"remote" remote_uri})) (wait id timeout)" |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your script.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
Available Variable Types for Inputs and Outputs
- string (or categorical, text, items)
- number (or numeric)
- integer
- boolean
- list
- map
- list-of-string
- list-of-integer
- list-of-number
- list-of-map
- list-of-boolean
- resource-id
- supervised-model-id
- project-id
- source-id
- dataset-id
- sample-id
- model-id
- ensemble-id
- logisticregression-id
- deepnet-id
- timeseries-id
- prediction-id
- batchprediction-id
- evaluation-id
- anomaly-id
- anomalyscore-id
- batchanomalyscore-id
- cluster-id
- centroid-id
- batchcentroid-id
- association-id
- associationset-id
- topicmodel-id
- topicdistribution-id
- batchtopicdistribution-id
- correlation-id
- statisticaltest-id
- library-id
- script-id
- execution-id
- configuration-id
Retrieving a Script
Each script has a unique identifier in the form "script/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the script.
To retrieve a script with curl:
curl "https://au.bigml.io/script/55f007d21f386f5199000000?$BIGML_AUTH"
$ Retrieving a script from the command line
You can also use your browser to visualize the script using the full BigML.io URL or pasting the script/id into the BigML.com.au dashboard.
Script Properties
Once a script has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
approval_status
filterable, sortable, updatable |
Integer | When a script is requested to be publicly shared, it must be reviewed and approved by the BigML administrators. A user can change its value to 1 to request the approval or 0 to withdraw the previous request. The script can be accepted (5) or rejected (-1) by the administrators. Once the script is accepted, it will be publicly available and no further changes to the script are allowed while the script is publicly shared. |
|
category
filterable, sortable, updatable |
Integer | One of the WhizzML categories in the table of WhizzML categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the script and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the script creation has been completed without errors. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the script was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
|
description
updatable |
String | A text describing the script. It can contain restricted markdown to decorate the text. |
|
imports
filterable, sortable |
Array of Strings | A list of valid ids of the libraries used in the script. |
| inputs | Array | The script's previously defined input variables. |
| line_count | Integer | The number of lines in the source code. |
| locale | String | The script's locale. |
|
name
filterable, sortable, updatable |
String | The name of the script. |
|
number_of_executions
filterable, sortable |
Integer | The number of times that the script has been executed. |
|
origin
filterable, sortable |
String | The script/id of the original gallery script. |
| outputs | Array | The script's previously defined output variables. |
|
price
filterable, sortable, updatable |
Float | The price other users must pay to clone the script. |
|
private
filterable, sortable |
Boolean | Whether the sctipt is public or not. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| public_in_organization | Boolean | Whether the script is public within the organization. |
| resource | String | The script/id. |
|
shared
filterable, sortable, updatable |
Boolean | Whether the script is shared using a private link or not. |
|
shared_clonable
filterable, sortable, updatable |
Boolean | Whether the shared script can be cloned or not. |
| shared_hash | String | The hash that gives access to this script if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this script. |
|
size
filterable, sortable |
Integer | The number of bytes of the source code. |
| source_code | String | The source code of the script. |
| status | Object | A description of the status of the script. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the script was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the script was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
|
white_box
filterable, sortable |
Boolean | Whether the script is publicly shared as a white-box. |
Script Status
Creating a script is a process that can take just a few seconds or a few minutes depending on the workload of BigML's systems. The script goes through a number of states until its fully completed. Through the status field in the script you can determine when the script has been fully processed and ready to be used. These are the properties that script's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the acript creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the acript. |
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the acript. |
Once script has been successfully created, it will look like:"
{
"approval_status":0,
"category":0,
"code":200,
"created":"2016-05-13T18:27:23.943000",
"description":"",
"imports":[
"library/55eeed1f1f386fc29500000a"
],
"inputs":[
{
"default":0,
"description":"First number",
"name":"a",
"type":"number"
},
{
"default":0,
"description":"Second number",
"name":"b",
"type":"number"
},
{
"default":0,
"description":"Third number",
"name":"c",
"type":"number"
}
],
"line_count":1,
"locale":"en-US",
"menu_icon_class":"",
"name":"Average calculation script",
"number_of_executions":0,
"outputs":[
{
"description":"Average of all three numbers",
"name":"result",
"type":"number"
}
],
"price":0.0,
"private":true,
"project":null,
"provider":"bigml-editor",
"resource":"script/55f007d21f386f5199000000",
"shared":false,
"size":27,
"source_code":"(define result (avg a b c))",
"status":{
"code":5,
"elapsed":6,
"message":"The script has been created",
"progress":1.0
},
"subscription":true,
"tags":[],
"updated":"2016-05-13T18:27:24.466000",
"white_box":false
}
< Example script JSON response
Updating a Script
To update a script, you need to PUT an object containing the fields that you want to update to the script' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated script.
For example, to update script with a new name you can use curl like this:
curl "https://au.bigml.io/script/55f007d21f386f5199000000?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating a script's name
Deleting a Script
To delete a script, you need to issue a HTTP DELETE request to the script/id to be deleted.
Using curl you can do something like this to delete a script:
curl -X DELETE "https://au.bigml.io/script/55f007d21f386f5199000000?$BIGML_AUTH"
$ Deleting a script from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete a script, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete a script a second time, or a script that does not exist, you will receive a "404 not found" response.
However, if you try to delete a script that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
See this section for more details.Listing Scripts
To list all the scripts, you can use the script base URL. By default, only the 20 most recent scripts will be returned. You can see below how to change this number using the limit parameter.
You can get your list of scripts directly in your browser using your own username and API key with the following links.
https://au.bigml.io/script?$BIGML_AUTH
> Listing scripts from a browser
Executions
Last Updated: Tuesday, 2019-01-29 16:28
BigML.io allows you to create, retrieve, update, delete your execution. You can also list all of your executions.
Jump to:
- Execution Base URL
- Creating an Execution
- Execution Arguments
- Retrieving an Execution
- Execution Properties
- Execution Errors
- Updating an Execution
- Deleting an Execution
- Listing Executions
Execution Base URL
You can use the following base URL to create, retrieve, update, and delete executions. https://au.bigml.io/execution
Execution base URL
All requests to manage your executions must use HTTPS and be authenticated using your username and API key to verify your identity. See this section for more details.
Creating an Execution
To create a new execution, you need to POST to the execution base URL a string containing the id of the sctipt that will be executed, as well as the required input parameters defined in the source code. The content-type must always be "application/json".
POST /execution?username=alfred;api_key=79138a622755a2383660347f895444b1eb927730; HTTP/1.1
Host: bigml.io
Content-Type: application/json
Creating execution definition
curl "https://au.bigml.io/execution?$BIGML_AUTH" \
-X POST \
-H 'content-type: application/json' \
-d '{"script": "script/55f007d21f386f5199000000"}'
> Creating an execution
BigML.io will return the newly created execution if the request succeeded.
{
"category":0,
"code":201,
"created":"2016-05-13T18:42:45.570230",
"creation_defaults":{},
"description":"",
"execution":{
"sources":[
[
"library/55eeed1f1f386fc29500000a",
"Calculate an average"
],
[
"script/55f007d21f386f5199000000",
"Average result"
]
]
},
"inputs":[
[
"a",
2
],
[
"b",
4
],
[
"c",
6
]
],
"locale":"en-US",
"name":"Average result",
"project":null,
"resource":"execution/56b8b93510cb863adb00000c",
"script":"script/55f007d21f386f5199000000",
"script_status":true,
"shared":false,
"size":0,
"status":{
"code":1,
"message":"The execution is being processed and will be created soon"
},
"subscription":true,
"tags":[],
"updated":"2016-05-13T18:42:45.570380"
}
< Example execution JSON response
Execution Arguments
In addition to the script, you can also POST the following arguments.
| Argument | Type | Description |
|---|---|---|
|
category
optional |
Integer, default is 0 |
The WhizzML category that best describes the execution. See the WhizzML category codes for the complete list of categories.
Example: 1 |
|
creation_defaults
optional |
Object |
A dictionary whose keys are resource type names (dataset, model, prediction, etc.) with a map of values for the corresponding defaults which will be used if the input values are not explicitly provided.
Example:
|
|
default_numeric_value
optional |
String |
It accepts any of the following strings to substitute missing numeric values across all the numeric fields in the dataset: "mean", "median", "minimum", "maximum", "zero"
Example: "median" |
|
description
optional |
String |
A description of the execution up to 8192 characters long.
Example: "This is a description of my new execution" |
|
input_maps
optional |
Object |
A dictionary whose key is script/id and value is inputs below. This should be used instead of inputs when multiple scripts are provided.
Example:
|
|
inputs
optional |
Array |
A list of pairs of input parameters and their values associated to the execution.
Example:
|
|
name
optional |
String |
The name you want to give to the new execution.
Example: "my new execution" |
|
output_maps
optional |
Array |
A dictionary whose key is script/id and value is a list containing outputs below. This can be used along with outputs.
Example:
|
|
outputs
optional |
Array |
A list of variables, defined in the source code of script, that will conform the outputs of execution.
Example: ["a", "b"] |
| script | String |
A valid script/id.
Example: script/4f66a80803ce8940c5000006 |
|
scripts
optional |
Array of Strings |
A list of valid script identifiers to be executed sequentially.
Example: ["script/57361a3a10cb86ed59000152", "script/552328ae10cb8619aa00000f"] |
|
tags
optional |
Array of Strings |
A list of strings that help classify and index your execution.
Example: ["best customers", "2018"] |
|
webhook
optional |
Object |
A webhook url and an optional secret phrase. See the Section on Webhooks for more details.
Example:
|
Retrieving an Execution
Each execution has a unique identifier in the form "execution/id" where id is a string of 24 alpha-numeric characters that you can use to retrieve the execution.
To retrieve an execution with curl:
curl "https://au.bigml.io/execution/56b8b93510cb863adb00000c?$BIGML_AUTH"
$ Retrieving a execution from the command line
You can also use your browser to visualize the execution using the full BigML.io URL or pasting the execution/id into the BigML.com.au dashboard.
Execution Properties
Once an execution has been successfully created it will have the following properties.
| Property | Type | Description |
|---|---|---|
|
category
filterable, sortable, updatable |
Integer | One of the WhizzML categories in the table of WhizzML categories that help classify this resource according to the domain of application. |
| code | Integer | HTTP status code. This will be 201 upon successful creation of the execution and 200 afterwards. Make sure that you check the code that comes with the status attribute to make sure that the execution creation has been completed without errors. |
|
created
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the execution was created with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| creation_defaults | Object | A dictionary whose keys are resource type names with a map of values for the corresponding defaults which will be used if the input values are not explicitly provided. |
|
description
updatable |
String | A text describing the execution. It can contain restricted markdown to decorate the text. |
| execution | Object | Information about the processing of the execution. See the execution table below. |
| input_maps | Object | A dictionary whose key is script/id and value is inputs. |
| inputs | Array | A list of pairs of input parameters and their values associated to the execution. |
| locale | String | The execution's locale. |
|
name
filterable, sortable, updatable |
String | The name of the execution. By default it is None. |
| output_maps | Array | A dictionary whose key is script/id and value is a list outputs defined during the creation. |
| outputs | Array | A list of output variables, type, and value. |
|
project
filterable, sortable, updatable |
String | The project/id the resource belongs to. |
| resource | String | The execution/id. |
|
script
filterable, sortable |
String | The script/id used in the execution. |
|
script_status
filterable, sortable |
Boolean | The status of the script used in the execution. |
| scripts | Array | A list of valid ids of the scripts used in the execution. |
|
shared
filterable, sortable, updatable |
Boolean | Whether the execution is shared using a private link or not. |
| shared_hash | String | The hash that gives access to this execution if it has been shared using a private link. |
| sharing_key | String | The alternative key that gives read access to this execution. |
| status | Object | A description of the status of the execution. It includes a code, a message, and some extra information. See the table below. |
|
subscription
filterable, sortable |
Boolean | Whether the execution was created using a subscription plan or not. |
|
tags
filterable, updatable |
Array of Strings | A list of user tags that can help classify and index this resource. |
|
updated
filterable, sortable |
ISO-8601 Datetime. | This is the date and time in which the execution was updated with microsecond precision. It follows this pattern yyyy-MM-ddThh:mm:ss.SSSSSS. All times are provided in Coordinated Universal Time (UTC). |
| webhook | Object | A webhook url and an optional secret phrase. See the Section on Webhooks for more details. |
The Execution Object has the following properties.
Execution Status
Creating an execution is a process that can take just a few seconds or a few minutes or hours depending on the workload of BigML's systems. The execution goes through a number of states before its fully completed. Using the status field in the execution you can determine if the execution has been fully processed and ready to be used. These are the properties that execution's status has:
| Property | Type | Description |
|---|---|---|
| code | Integer | A status code that reflects the status of the execution creation. It can be any of those that are explained here. |
| elapsed | Integer | Number of milliseconds that BigML.io took to process the execution. |
| elapsed_times | Object |
Information about the time in milliseconds consumed in each step of the execution.
Example:
|
| message | String | A human readable message explaining the status. |
| progress | Float, between 0 and 1 | How far BigML.io has progressed building the execution. |
Once execution has been successfully created, it will look like:
{
"category":0,
"code":200,
"created":"2016-05-13T18:42:45.570000",
"creation_defaults":{},
"description":"",
"execution":{
"output_resources":[],
"outputs":[
[
"result",
4.0,
"Number"
]
],
"result":4.0,
"results":[
4.0
],
"sources":[
[
"library/55eeed1f1f386fc29500000a",
"Calculate an average"
],
[
"script/55f007d21f386f5199000000",
"Average result"
]
],
"steps":22
},
"inputs":[
[
"a",
2
],
[
"b",
4
],
[
"c",
6
]
],
"locale":"en-US",
"name":"Average result",
"project":null,
"resource":"execution/56b8b93510cb863adb00000c",
"script":"script/57361c8b10cb86ed59000161",
"script_status":true,
"shared":false,
"status":{
"code":5,
"elapsed":62,
"elapsed_times":{
"in-progress":59,
"queued":90,
"started":3
},
"message":"The execution has been created",
"progress":1.0
},
"subscription":true,
"tags":[],
"updated":"2016-05-13T18:42:45.906000"
}
< Example execution JSON response
Execution Errors
Unlike other resources, errors for executions are reported with some additional information specific to the execution at hand like following one.
{
"code": 402,
"status": {
"call_stack": [
[0, [1, 1], [0, 60]],
[0, [1, 1], [16, 59]],
[1, [1, 1], [26, 58]],
[2, [1, 1], [16, 59]],
[2, [1, 1], [0, 60]]
],
"cause": {
"code": -1101,
"http_status": 402,
"extra": [
{"credit_type": 1200, "credit_size": 409}
]
},
"code": -8200,
"instruction": {
"source": {
"lines": [1, 1],
"columns": [0, 60],
"origin": 0,
"instruction": "apply"
}
},
"message": "Problem while executing script. Error handling resource: You don't have enough credits available"
}
}
< Example execution error response
- source in instruction contains the origin index, line, and column of the script that was being executed at the moment the error happened.
- call_stack contains the stack trace of the WhizzML execution, as a list of origin, lines, columns entries.
- cause contains the code, http_status, message, and extra of the root cause.
Updating an Execution
To update an execution, you need to PUT an object containing the fields that you want to update to the execution' s base URL. The content-type must always be: "application/json". If the request succeeds, BigML.io will return with an HTTP 202 response with the updated execution.
For example, to update execution with a new name you can use curl like this:
curl "https://au.bigml.io/execution/56b8b93510cb863adb00000c?$BIGML_AUTH" \
-X PUT \
-H 'content-type: application/json' \
-d '{"name": "a new name"}'
$ Updating an execution's name
Deleting an Execution
To delete an execution, you need to issue a HTTP DELETE request to the execution/id to be deleted.
Using curl you can do something like this to delete an execution:
curl -X DELETE "https://au.bigml.io/execution/56b8b93510cb863adb00000c?$BIGML_AUTH"
$ Deleting an execution from the command line
If the request succeeds you will not see anything on the command line unless you executed the command in verbose mode. Successful DELETEs will return "204 no content" responses with no body.
Once you delete an execution, it is permanently deleted. That is, a delete request cannot be undone. If you try to delete an execution a second time, or an execution that does not exist, you will receive a "404 not found" response.
However, if you try to delete an execution that is being used at the moment, then BigML.io will not accept the request and will respond with a "400 bad request" response.
Note that you can also delete all resources that have been created by the execution. Simply append delete_all=true in the query string.
curl -X DELETE "https://au.bigml.io/execution/56b8b93510cb863adb00000c?$BIGML_AUTH;delete_all=true"
$ Deleting an execution with all associated resources from the command line
Listing Executions
To list all the executions, you can use the execution base URL. By default, only the 20 most recent executions will be returned. You can see below how to change this number using the limit parameter.
You can get your list of executions directly in your browser using your own username and API key with the following links.
https://au.bigml.io/execution?$BIGML_AUTH
> Listing executions from a browser







