What to Audit?

  • For all user actions, the following is part of the audit record:
    • user who attempted the action
    • time at which action was attempted
    • result of the action (success or failure)
  • Auditable actions are:
    • every metadata operation attempted such as CRUD operations on metadata objects

The logical structure (fields) of an audit record are:

userOrRole (String, required)
userPrivilege (String, required)
actionTime (Timestamp, required)
action (String,required)
objectAccessed (String, required)
success (String,required)
transactionId (String, optional)
notes (any additional information, optional)

Physical Structure (Tables/Column Families) that Store the Audit Records

For Cassandra and HBase, use a structure that is similar to a relational table (rather than as a key-value pair).


  • This table won’t grow very large.
  • The API for audit records is easier to implement.

Purging Audit Tables

Audit tables need to be purged periodically (a manual operation right now).

Methods That Access Audit Information

Use REST APIs to access audit details. To access audit details, the user must be a user with special privileges (audit privilege).

All output, if any, is presented as a JSON string.

GetAuditLog(startTime, endTime, userOrRole, action, objectAccessed)

All parameters in GetAuditLog are optional.

If startTime and endTime are not defined, they default some values such that audit records produced in the last X minutes(configurable) are returned. All the attributes, if given, are validated for correctness and are used to further filter the result(s).

Configurable Parameters

AUDIT_INTERVAL (in minutes. If startTime and endTime are not defined, endTime = sys_time, startTime = endTimeAUDIT_INTERVAL)

Misc Points That Need Documentation

Audit logs are created with or without authentication.

Sample cURL Command to Access audit_log Using REST API

1. Get the audit log (in JSON format) that was generated in the last X minutes where value of X is defined by the MetadataAPIConfig parameter AUDIT_INTERVAL. The default value for AUDIT_INTERVAL = 10.

curl -X GET
-H "Content-Type: aplication/json"
-H "userid: lonestarr"
-H "password: password"
-H "role: goodguy"

2. Get audit log with explicit values of startTime and endTime. The values of startTime and endTime take the syntax of yyyyMMddHHmmss.

curl -X GET
-H "Content-Type: aplication/json"
-H "userid: lonestarr"
-H "password: password"
-H "role: goodguy"

3. More filters can be applied, but the parameters values must be supplied in strict order. The order of the parameters is starttime/endtime/userOrRole/action/objectAccessed.

curl -X GET
-H "Content-Type: aplication/json"
-H "userid: lonestarr"
-H "password: password"
-H "role: goodguy"

The output of these commands is something such as the following. The value of ObjectAccessed depends on the API operation. In the case of the ADD operations, these are the first 50 characters from the PMML file or JSON file.

Currently, filename is not passed from the API client (curl) to the service.

=>apiResult is a string that is returned by MetadataAPIImpl class.

Comprehensive Auditing

Many businesses that may want to use this software are either heavily regulated or are constrained legally in some fashion (if not both). Decisions made in a model may need to be revisited by either the business’ internal auditors and/or by legal opponents seeking claims against the business. For this reason, it is only prudent to provide enough contextual information as is practical to easily see why a model prediction was made.

This is easy to accomplish. The key predicate results and any tabulated information that led to the model score can be added to the model’s emitted output with supplementary mining field entries in the model’s MiningSchema section.

To see how this works, consider a medical model that reviews the prescription drug records for a given insurance member for possible deleterious drug combinations. With a filter table consisting of drug pairs that would produce bad if not toxic effects in the member, records of the members are reviewed for each doctor, lab, or institutional visit over some lookback period. From the traceability perspective, it would be prudent that the model not only produced the prediction (there is a problem), but also what drug combination(s) was/were found that determined this finding. MiningSchema might look like this:

where ContraIndicative is an array of strings containing the drug combinations that are problematic. It could be derived with a function like this:

The conflictMedCdsSet is the filter table and the MedicationIdPairStrings is an array that enumerates the unique pairs of drugs found in the member’s records.

One final note about this. Especially during model development, it is useful to emit the key decisions and data used to reach a given score result. This aids in creating a more meaningful set of test data to maintain with the model and can also be used as a pedagogical device when a new or changed model is introduced to the client.