Only SQL?
Introduction
Storing data is something that is important in software architecture. There are a lot of data storage engines and especially NoSQL (Not Only SQL): Cassandra, MongoDB, etc.. And nowadays, we want everything: scalability, high availability and consistency.
Because of these new engines, we are leaving aside older data storage servers like MySQL, PostgresSQL, SQL Server …
However, with NoSQL, we lose sometimes isolation or consistency. Indeed, NoSQL is not ACID (Atomicity, Consistency, Isolation, Durability). So, how do we store and query billions of data rows while respecting ACID properties (i.e no data loss)?
Partitioning
First, we will focus on MySQL server. MySQL can partition a table (see MySQL documentation). Many people have knowledge in MySQL usage / management, and don’t feel the need to learn a new technology. However, there is a partition limitation: the number of partitions cannot be greater than 1024 (in version 5.5 of MySQL, 8192 partitions allowed with versions >= 5.6.7).
Therefore, a solution is to manage partitions through your application, creating one table per day or using a sequence. We can have a lot of tables (but be aware of the operating system opened files limitation). By using a naming convention, managing such a table will be easier (example: the table’s name could contain today’s date, if you want to make one table per day).
However, we need both inserting and reading performances. The tables will be created on the fly and they will have a primary key but no foreign key nor index. By avoiding the use of a foreign key, the server avoids checking other tables on an insert. By minimizing or avoiding having any index (other than the primary key) on a table, we improve writing efficiency but decrease reading performances. The goal is to try to get rows by their primary key. We also try to avoid having more than one (or two) billion rows. Thus, queries can be relatively fast.
In production, we face to another problem: despite MySQL being able to handle a lot of tables, it doesn’t like having a lot of empty tables. The server uses a lot of memory (and then swaps) when listing tables with information_schema. This is a bug which is fixed in version >= 5.5.47 or 5.6.25 (see MySQL changelog)
In practice
In our case, we reached 40,000 tables in the production environment without any trouble. We chose to use a sequence to generate the names of the tables, so we created many tables per day. Also, we used a table to associate the sequence and the created table (like a table directory)
The example below is written in Java using Spring / Spring-data-Jpa . We create a repository that does some CRUD operations (i.e. create, read, update, delete). We have to create the SQL queries needed for these operations.
Below, is an extract of the code to show you how you can achieve this.
First, we need to define the CRUD operations in an entity repository. We chose to have a sequenceId which is a number and also a number as the primary key for each entity.
The create operation will be something like:
/**
* Create an entity
*
* @param t
* the object to be created
* @return The created object.
*/
@Modifying
@Transactional(readOnly = false)
public T create(T t) {
if (t.getSequenceId() == null) {
return null;
}
final String tableName = getTableName(t.getSequenceId());
final SqlParameterSource params = retrieveSqlParameterSource(t);
final GeneratedKeyHolder keyHolder = new GeneratedKeyHolder();
String template = retrieveInsertTemplate();
String sql = prepareSql(template, tableName);
int nbUpdate = jdbcTemplate().update(sql, params, keyHolder);
if (nbUpdate > 0) {
t.setId(keyHolder.getKey().longValue());
}
return t;
}
Now, let’s do the same for update and delete operations:
/**
* Update an object
*
* @param t
* @return number of objects updated
*/
@Modifying
@Transactional(readOnly = false)
public int update(T t) {
if (t.getSequenceId() == null) {
return 0;
}
final String tableName = getTableName(t.getSequenceId());
final SqlParameterSource params = retrieveSqlParameterSource(t);
String template = retrieveUpdateTemplate();
String sql = prepareSql(template, tableName);
return jdbcTemplate().update(sql, params);
}
/**
* Delete an object
*
* @param t
* @return number of objects deleted
*/
@Modifying
@Transactional(readOnly = false)
public int delete(T t) {
if (t.getSequenceId() == null) {
return 0;
}
final String tableName = getTableName(t.getSequenceId());
final SqlParameterSource params = retrieveSqlParameterSource(t);
String template = retrieveDeleteTemplate();
String sql = prepareSql(template, tableName);
return jdbcTemplate().update(sql, params);
}
For the read operation, we need to create a row mapper in order to constructs our entity.
/**
* Mapper for SELECT queries
*
* @param sequenceId
*
* @return the mapper for select queries
*/
protected org.springframework.jdbc.core.RowMapper<T> retrieveRowMapper(long sequenceId) {
return new BeanPropertyRowMapper<T>(retrieveBeanClass()) {
@Override
protected Object getColumnValue(ResultSet rs, int index, PropertyDescriptor pd) throws SQLException {
if (isDateTimeParameter(pd.getName())) {
return new DateTime(rs.getTimestamp(index).getTime());
} else {
return super.getColumnValue(rs, index, pd);
}
}
};
}
Then, we can make the read operation:
/**
* Retrieve by id and sequence
*
* @param id
* the primary key
* @param sequenceId
* the table to look.
* @return the entity found or null.
*/
@Query
public T findByIdAndSequenceId(long id, long sequenceId) {
final String tableName = getTableName(sequenceId);
final SqlParameterSource params = new MapSqlParameterSource("id", id);
final RowMapper<T> rowMapper = retrieveRowMapper(sequenceId);
String sql = prepareSql(SELECT, tableName);
try {
List<T> result = jdbcTemplate().query(sql, params, rowMapper);
if (result != null && result.size() > 0) {
if (result.size() == 1) {
T t = result.get(0);
t.setSequenceId(sequenceId);
return t;
} else {
throw new IncorrectResultSizeDataAccessException(1, result.size());
}
} else {
return null;
}
} catch (BadSqlGrammarException e) {
// In this case BadSqlGrammar means table doesn't exists.
return null;
}
}
Finally, by implementing the repository and the entity, we can use it “normally”, which can be done as follows:
Message msg = null;
long seqId = sequenceId.incrementAndGet();
// create the table
messageRepository.createTable(seqId);
// insert message
msg = new Message();
msg.setSequenceId(seqId);
msg.setContent("TEST");
msg.setCreationDate(new DateTime());
assert (msg.getId() == null);
msg = messageRepository.create(msg);
assert (msg.getId() != null);
// update message
msg.setContent("TEST2");
messageRepository.update(msg);
// query the message
Message readMsg = messageRepository.findByIdAndSequenceId(msg.getId(), seqId);
assert (readMsg != null);
assert (readMsg.getContent().equals("TEST2"));
// delete the message
int nb = messageRepository.delete(msg);
assert (nb > 0);
// drop the table
messageRepository.dropTable(seqId);
You can now extend the repository in order to have a lot of read functions, such as findAll, etc…
However, there are some limitations. We wanted ACID properties. We have them because our tables haven’t foreign key. The Consistency is impacted but we have no loss of data (our main goal). Another limitation is to avoid having more than ten millions rows per table. If needed, create another table on the fly. To do this, you can use the SQL query CREATE TABLE xxx IF NOT EXISTS. This will save you from having to test whether the table exists or not.
Finally, if you have to upgrade the schema, you must have the same wariness as usually with SQL database.
Conclusion
We have succeeded in having billions of rows in different tables without decreasing performances. We also have ACID properties (our main goal), so no loss of data, with transactions enabled. Older SQL engines can do the job. Therefore, if we don’t need (very) high scalability, why not choose SQL only ?