If you've done any database work, you know that many of them have built-in features that let you reach the data and sometimes even analyze it. This article will show you how to perform two tasks using Ruby's Active Record. It is excerpted from chapter two of the book Practical Reporting with Ruby and Rails, written by David Berube (Apress; ISBN: 1590599330).
Calculating Statistics with Active Record (Page 1 of 5 )
The previous chapter discussed the fundamentals of accessing and manipulating data with Active Record. The statistical analyses--the highest salary, average salary, and so forth--were done manually using Ruby code. While that's a plausible approach, it's easier and often quicker to let the database do the work for you.
Databases typically have numerous built-in features for speeding up data access. Indexes, for example, are subsets of your table data, which are automatically maintained by your database and can make searching much faster. You can think of indexes like the table of contents in a book. It's much faster to find something by using the table of contents than it is to read every page of the book looking for the desired information. Additionally, the database's query planner uses speed-enhancing techniques automatically. This query planner has access to statistical information on the various tables and columns that your query uses, and it will formulate a query plan based on that information. In other words, it estimates how long each method of retrieving the data you requested will take, and it uses the quickest method. Because of the capabilities of the database, it's typically best to use the techniques described in this chapter, as they are considerably faster than doing your statistics in your Ruby code.
In this chapter, you'll learn how to use the database to perform two common tasks: grouping and aggregation. Let's look at how these tasks are useful, and then work through an example that uses them for reporting.
Grouping and Aggregation
Grouping refers to a way to reduce a table into a subset, where each row in the subset represents the set of records having a particular grouped value or values. For example, if you were tracking automobile accidents, and you had a table of persons, with their age and number of accidents, you could group by age and retrieve every distinct age in the database. In other words, you would get a list of the age of every person, with the duplicates removed.
If you were using an Active Record model named Person with an age column, you could find all of the distinct ages of the people involved, as follows:
ages = Person.find(:all, :group=>'age')
However, to perform useful work on grouped queries, you'll typically use aggregate functions. For example, you'll need to use aggregate functions to retrieve the average accidents per age group or the count of the people in each age group.
You've probably encountered a number of aggregate functions already. Some common ones are MAX and MIN, which give you the maximum and minimum value; AVG, which gives you the average value; SUM, which returns the sum of the values; and COUNT, which returns the total number of values. Each database engine may define different statistical functions, but nearly all provide those just mentioned.
Continuing with the Active Record model named Person with an age column, you could find the highest age from your table as follows:
oldest_age = Person.calculate(:max, :age)
Note that calculate takes the max function's name, as a symbol, as its first argument, but Active Record also has a number of convenience functions named after their respective purposes: count, sum, minimum, maximum, and average. For example, the following two lines are identical:
Both print out the average number of accidents for all rows .
Note The calculate form takes the abbreviated version of the function name, such as avg for average. However, the shortcut form takes a longer version. For example, you could use either Person.calculate(:avg, :age) or Person.average(:age). This is confusing, but the idea is that the calculate form passes your function directly to your database , so you can use any statistical function defined in your database, whereas the convenience functions are fixed, so they can have easier to understand names.
You can also combine grouping and aggregate functions. For example, if you wanted to print the average accident count for each age, you could do so as follows:
Person.calculate(:avg, :accident_count, :group=>'age').each do |player| age, accident_count_average = *player puts "Average Accident Count #{'%0.3f' % accident_count_average} for age #{age}" end
Note that the object passed to the block is an array. In the array, the grouped field comes first, followed by the calculated field. (If you wanted to group by more than one field, it would be in the order specified by the group parameter.)
Depending on your data, your results would look something like this:
-------------------------------------------- Average Accident Count 3.000 for age 18 Average Accident Count 2.020 for age 19 Average Accident Count 1.010 for age 20 . . . --------------------------------------------
However, for more complex queries, you may need to craft a SQL statement manually. Specifically, you can use find_by_sql to search for any arbitrary SQL query, and this allows you to ask for virtually any type of information. For example, if a vehicle_model column existed in your Accident model, you could group by vehicle_model and run two aggregate functions: average age and average accident_count. This would give you an average owner age and an average number of accidents per vehicle, so you could get some idea whether, say, a Ford Explorer was a safer ride than a Honda CRX. (Of course, driver control would play a significant part, and hence the average age is a helpful statistic.) You can't use multiple aggregate functions with calculate, but you can with find_by_sql, like this:
sql = "SELECT vehicle_model, AVG(age) as average_age, AVG(accident_count) AS average_accident_count FROM persons GROUP BY vehicle_model " Person.find_by_sql(sql).each do |row| puts "#{row.vehicle_model}, " << "avg. age: #{row.average_age}, " << "avg. accidents: #{row.average_accident_count}" end