Walkthrough: writing a complex query

In this section, we will work through a more complex query that uses many of the primitives provided by the API. By the end of this section, you should understand the different parts of a query and how they fit together, so you can get started writing your own.

For our example query, we will find the 2 cities with the most users who are "highly active". In this example, we'll define a "highly active" user as someone who sent an event on at least three different days during the query interval.

At a high level, this query will:

Here's the whole query. Next, we'll walk through each of the parts.

function main() {
  // Get all events in the given date range
  return Events({
    from_date: params.from_date,
    to_date: params.to_date,
  })
  // Group them by user distinct_id, and call the provided function.
  .groupByUser(function(state, events) {
    // create a default user state if none exists
    state = state || {
      last_day_active: new Date(0),
      num_days_active: 0,
      city: events[0].properties.$city,
    };
    // loop through the user's events, counting the distinct days seen
    _.each(events, function(event) {
      var time = new Date(event.time); // convert seconds to ms
      var day = new Date(time.getFullYear(), time.getMonth(), time.getDate());
      if (day > state.last_day_active) {
        state.last_day_active = day;
        state.num_days_active++;
      }
    });
    return state;
  })
  // Only keep entries where the num_days_active property is at least 3
  .filter(function(item) { return item.value.num_days_active >= 3; })
  // Group entries by 'city' and count them up.
  .groupBy(["value.city"], mixpanel.reducer.count())
  // Limit the result set to the top 2 cities
  .reduce(mixpanel.reducer.top(2));
}

This example makes use of a number of important concepts that we will explain individually. As we work through the example, we will assume that we are running this query against a Mixpanel project where the following events have been tracked:

[
  { distinct_id: 'alice@example.com', name: 'signup', time: '2015-10-01 12:00:00', properties: { city: 'San Francisco' }},
  { distinct_id: 'alice@example.com', name: 'logout', time: '2015-10-04 12:00:00', properties: { city: 'San Francisco' }},
  { distinct_id: 'alice@example.com', name: 'invite', time: '2015-10-04 16:00:00', properties: { city: 'San Francisco' }},
  { distinct_id: 'alice@example.com', name: 'attack', time: '2015-10-05 12:00:00', properties: { city: 'San Francisco' }},

  { distinct_id: 'bob@example.com', name: 'logout', time: '2015-10-10 12:00:00', properties: { city: 'Barcelona' }},

  { distinct_id: 'carol@example.com', name: 'login', time: '2015-10-14 12:00:00', properties: { city: 'Paris' }},
  { distinct_id: 'carol@example.com', name: 'attack', time: '2015-10-15 02:00:00', properties: { city: 'Paris' }},
  { distinct_id: 'carol@example.com', name: 'purchase', time: '2015-10-16 16:00:00', properties: { city: 'Paris' }},
  { distinct_id: 'carol@example.com', name: 'logout', time: '2015-10-17 23:59:00', properties: { city: 'Paris' }},

  { distinct_id: 'daniel@example.com', name: 'logout', time: '2015-10-18 12:00:00', properties: { city: 'San Francisco' }},
  { distinct_id: 'daniel@example.com', name: 'login', time: '2015-10-19 02:00:00', properties: { city: 'San Francisco' }},
  { distinct_id: 'daniel@example.com', name: 'attack', time: '2015-10-20 16:00:00', properties: { city: 'San Francisco' }},

  { distinct_id: 'erin@example.com', name: 'logout', time: '2015-10-21 12:00:00', properties: { city: 'Paris' }},
  { distinct_id: 'erin@example.com', name: 'login', time: '2015-10-22 02:00:00', properties: { city: 'Paris' }},
  { distinct_id: 'erin@example.com', name: 'attack', time: '2015-10-23 16:00:00', properties: { city: 'Paris' }},

  { distinct_id: 'frank@example.com', name: 'login', time: '2015-10-24 02:00:00', properties: { city: 'New York' }},
  { distinct_id: 'frank@example.com', name: 'attack', time: '2015-10-25 16:00:00', properties: { city: 'New York' }},
  { distinct_id: 'frank@example.com', name: 'attack', time: '2015-10-26 23:59:00', properties: { city: 'New York' }}
]

If you look closely at this data, you'll see that there are two users from San Francisco (Alice and Daniel), two from Paris (Carol and Erin) and one from New York (Frank), all of whom satisfy our "sent events on at least 3 different days" criteria. There is one user (Bob from Barcelona) who only sent one event, and thus does not qualify as an active user.

So we expect our query to return the top two cities, San Francisco and Paris. Let's work through the query step by step to see how it works.


Defining your main() function

Every JQL query must define a main() function. This function is executed by our query engine, and its return value is returned to you via HTTP.

function main() {
  // ...
}


Selecting events for the time range we care about

Every JQL query must return a call to the Events() function:

function main() {
  return Events({
    start_date: params.start_date,
    end_date: params.end_date,
  });
}

This function fetches all of the events that happened in the given date range. Here we are making the date range configurable by using params, which is a global variable containing the set of parameters passed to the JQL API endpoint. params makes it possible for you to reuse the same JavaScript query across different date ranges or parameterize it in other ways.

Events() selects a set of events, but to do any useful analysis you must process those events. To do this, we use the concept of transformations. A transformation is applied to a dataset (such as a list of events), where it modifies the dataset in some way and returns the transformed data. Transformations can be chained, with each additional transformation acting on the output of the previous one.

In this query, we use a number of different transformations: .groupByUser(), .filter(), .groupBy(), and .reduce(). Each of these transformations accepts one or more user-defined functions, which they use to modify the dataset.


Determine how many days each user was active

To determine how many days each user was active, we will use the groupByUser() transformation. This will let us process all of the events for a user at once, which will make it much easier to figure out how many days they were active.

This transformation instructs Mixpanel to group events together based on the distinct_id property of each event. Then, it calls the provided reduce function on the list of events for each user. The output of groupByUser() is a collection of key-value pairs, represented as
{ key: [user distinct_id], value: return value of the reduce function } objects.

function main() {
  // Get all events in the given date range
  return Events({
    from_date: params.from_date,
    to_date: params.to_date,
  })
  // Group them by user distinct_id, and call the provided function.
  .groupByUser(function(state, events) {
    // create a default user state if none exists
    state = state || {
      last_day_active: new Date(0),
      num_days_active: 0,
      city: events[0].properties.$city,
    };
    // loop through the user's events, counting the distinct days seen
    _.each(events, function(event) {
      var time = new Date(event.time); // convert seconds to ms
      var day = new Date(time.getFullYear(), time.getMonth(), time.getDate());
      if (day > state.last_day_active) {
        state.last_day_active = day;
        state.num_days_active++;
      }
    });
    return state;
  })
}

As we mentioned earlier, all transformations accept a function argument, which they use to modify the dataset they are applied to. For groupByUser(), the reduce function is typically used to aggregate some information about each user. In this example, we are using our reduce function to figure out what city a user is from and how many days they have been active.

Writing a reduce function for groupByUser

The list of events for a user is provided to the reduce function as its second argument. When this list is long, we may break up the list into smaller chunks and process it across multiple calls to the reduce function. When that happens, the state argument is used for sharing state across calls.

Each time the reduce function is called, the return value of the previous call is provided as a state argument. Initially, state is Undefined. The return value of the last call becomes the result associated with the given user, so it is important to merge new results with old before returning from your function.

In the example above, the first call to the reduce function initializes state to be an object with three fields: the last day user was active, total number of days the user was active, and the city the user is located in.

Events for a given user can originate from multiple cities; in this example, we associate the city of the first event with the user, for simplicity. Two other fields - last_day_active and num_days_active - are updated as more events are processed by the reduce function.

When we execute the query we have built up so far, we will see data for all 6 users:

[
  {
    "key": ["alice@example.com"],
    "value": {
      "city": "San Francisco",
      "num_days_active": 3,
      "last_day_active": "2015-10-05"
    }
  },
  {
    "key": ["bob@example.com"],
    "value": {
      "city": "Barcelona",
      "num_days_active": 1,
      "last_day_active": "2015-10-10"
    }
  },
  {
    "key": ["carol@example.com"],
    "value": {
      "city": "Paris",
      "num_days_active": 4,
      "last_day_active": "2015-10-17"
    }
  },
  {
    "key": ["daniel@example.com"],
    "value": {
      "city": "San Francisco",
      "num_days_active": 3,
      "last_day_active": "2015-10-20"
    }
  },
  {
    "key": ["erin@example.com"],
    "value": {
      "city": "Paris",
      "num_days_active": 3,
      "last_day_active": "2015-10-23"
    }
  },
  {
    "key": ["frank@example.com"],
    "value": {
      "city": "New York",
      "num_days_active": 3,
      "last_day_active": "2015-10-26"
    }
  }
]

Filtering out users who were not active for at least 3 days

If you recall, in this query we only care about active users (defined as users who sent events in at least 3 different days). To achieve this, we will use the filter() transformation to filter down the collection returned from groupByUser().

.filter(function(item) { return item.value.num_days_active >= 3; })

This filter() transformation instructs Mixpanel to only keep elements of the collection that make the filter function evaluate to true. The filter function is a JavaScript function that takes an element of a collection as an argument and returns a boolean.

The previous transformation returned a collection of key-value pairs (shown above). Therefore, the filter function must deference value before reaching to the num_days_active property: value.num_days_active.

When applied to the output shown above, this expression filters out the entry for bob@example.com, leaving us with 5 users: 2 from San Francisco, 2 from Paris, and 1 from New York:

[
  {
    "key": ["alice@example.com"],
    "value": {
      "city": "San Francisco",
      "num_days_active": 3,
      "last_day_active": "2015-10-05"
    }
  },
  {
    "key": ["carol@example.com"],
    "value": {
      "city": "Paris",
      "num_days_active": 4,
      "last_day_active": "2015-10-17"
    }
  },
  {
    "key": ["daniel@example.com"],
    "value": {
      "city": "San Francisco",
      "num_days_active": 3,
      "last_day_active": "2015-10-20"
    }
  },
  {
    "key": ["erin@example.com"],
    "value": {
      "city": "Paris",
      "num_days_active": 3,
      "last_day_active": "2015-10-23"
    }
  },
  {
    "key": ["frank@example.com"],
    "value": {
      "city": "New York",
      "num_days_active": 3,
      "last_day_active": "2015-10-26"
    }
  }
]


Count the number of users in each city

Now that we've filtered our collection down to only active users, we need to figure out how many users are in each city. To do this, we will use the groupBy() transformation to group the collection by city and count the number of users in each group.

.groupBy(["value.city"], mixpanel.reducer.count());

groupBy() accepts a list of group keys and a reducer function. Here we are only using one key: the city attribute we recorded for each user. This will split the collection into three groups: one for San Francisco, one for Paris, and one for New York.

After grouping the collection, the reducer function is called for each group. To keep things simple, here we are using a built-in reducer, mixpanel.reducer.count, to count the number of records in each group. You could easily write your own reducer, as you will see in a moment.

When we apply this transformation to the filtered collection, we output the following collection:

[
  { "key": ["San Francisco"], "value": 2 },
  { "key": ["Paris"], "value": 2 },
  { "key": ["New York"], "value": 1 }
]


Limit the results to the top 2 cities

In this simple example dataset, we only have three cities with active users - but if we were running this query against a real dataset, we might have thousands. So it's important that we have the ability to limit the response to the top cities, as they are most interesting to us.

To find the top cities, we will use the reduce() transformation. We've already used this concept a couple of times in groupByUser() and groupBy() -- there, we were reducing a group at a time. Here we will run our reduce function against the entire collection, not just a subset of it.

.reduce(mixpanel.reducer.top(2));

We're using one of the built-in reducers, mixpanel.reducer.top(N), which operates on our collection to filter down to the N elements with the highest value attribute. When we run this function against the collection of cities and counts, we will get our final result for this query: the top 2 cities, San Francisco and Paris:

[
  [
    { "key": ["San Francisco"], "value": 2 },
    { "key": ["Paris"], "value": 2 }
  ]
]