Home >
Earlier today, Richard Monson-Haefel posted here on insideria.com about the "One Million Record Challenge". I thought this was a pretty interesting test for any RIA platform, although it will yield wildly different results on different machines, and can't necessarily be used as a true benchmark.
I decided to take on this challenge with Flex. I used Richard's java program to generate 3 files. One with 10,000 records, one with 100,000 records, and one with 1,000,000 records. In my opinion, Flex/Flash passed the test. Here's what I found on my machine (Windows XP, Intel Core 2 Duo @ 2 GHz, 2 Gigs of RAM) when loading the data off of the local file system.
The data files themselves were actually pretty big. Hence, In my example below, I only have the 10,000 and 100,000 data files. Here are the generated file sizes:
Here's the example online. Just select a source file and click on the "load data" button. Give it a few minutes to load the data (it's pretty big).
And now, the source code...
There are a few tricks to making this work. The code uses a HTTPService to return the csv data file. Since it is csv, you have to parse it yourself. Be sure to set the resultFormat of the HTTPService to "text" so that Flex doesn't automatically try to parse the data for you. I had the 1 million record file crash my browser until I set the resultFormat option.
Next, since this is only for reading large quantities of data, I decided not to use any bindings. I used plain-old arrays instead of ArrayCollections to keep it simple and fast.
Next, just parse the data, and shove it into the default Flex mx:DataGrid object. The DataGrid component only renders what is actually visible on the screen, so it is very fast. All of the sorting is using the default sort logic.
You can check out the full source here:
___________________________________
Andrew Trice
Principal Architect
Cynergy Systems
http://www.cynergysystems.com
I decided to take on this challenge with Flex. I used Richard's java program to generate 3 files. One with 10,000 records, one with 100,000 records, and one with 1,000,000 records. In my opinion, Flex/Flash passed the test. Here's what I found on my machine (Windows XP, Intel Core 2 Duo @ 2 GHz, 2 Gigs of RAM) when loading the data off of the local file system.
| Records | Load Time | Parse Time | Total Time | Column Sort |
|---|---|---|---|---|
| 10,000 | 114ms | 101ms | 215ms | < 1 sec |
| 100,000 | 278ms | 447ms | 725ms | ~ 1 sec |
| 1,000,000 | 1749ms | 7550ms | 9299ms | ~ 5-6 secs |
The data files themselves were actually pretty big. Hence, In my example below, I only have the 10,000 and 100,000 data files. Here are the generated file sizes:
| Records | File Size |
|---|---|
| 10,000 | 909K |
| 100,000 | 9,083K |
| 1,000,000 | 90, 821K |
Here's the example online. Just select a source file and click on the "load data" button. Give it a few minutes to load the data (it's pretty big).
And now, the source code...
There are a few tricks to making this work. The code uses a HTTPService to return the csv data file. Since it is csv, you have to parse it yourself. Be sure to set the resultFormat of the HTTPService to "text" so that Flex doesn't automatically try to parse the data for you. I had the 1 million record file crash my browser until I set the resultFormat option.
Next, since this is only for reading large quantities of data, I decided not to use any bindings. I used plain-old arrays instead of ArrayCollections to keep it simple and fast.
Next, just parse the data, and shove it into the default Flex mx:DataGrid object. The DataGrid component only renders what is actually visible on the screen, so it is very fast. All of the sorting is using the default sort logic.
You can check out the full source here:
<?xml version="1.0" encoding="utf-8"?>
<mx:Application xmlns:mx="http://www.adobe.com/2006/mxml" layout="absolute" viewSourceURL="srcview/index.html">
<mx:Script>
<![CDATA[
import flash.utils.getTimer;
import mx.collections.ArrayCollection;
import mx.utils.StringUtil;
import mx.rpc.events.FaultEvent;
import mx.rpc.events.ResultEvent;
private var initialTime : int;
private var loadComplete : int;
private function loadData() : void
{
initialTime = getTimer();
httpService.url = source.text;
httpService.send();
loadButton.enabled = false;
}
private function onResult( event : ResultEvent ) : void
{
loadComplete = getTimer();
var result : Array = [];
var arr : Array = event.result.toString().split( "\n" );
for each ( var str : String in arr )
{
result.push( str.split( ", " ) );
}
dg.dataProvider = result;
loadButton.enabled = true;
resultText.text = "data loaded in " + (loadComplete - initialTime) + "ms, data parsed in " + (getTimer() - loadComplete) + "ms, total " + (getTimer() - initialTime);
}
private function onFault( event : FaultEvent ) : void
{
loadButton.enabled = true;
}
]]>
</mx:Script>
<mx:HTTPService
id="httpService"
result="onResult(event)"
fault="onFault(event)"
resultFormat="text"/>
<mx:ApplicationControlBar dock="true">
<mx:ComboBox id="source">
<mx:dataProvider>
<mx:Array>
<mx:String>assets/10_thousand.csv</mx:String>
<mx:String>assets/100_thousand.csv</mx:String>
<!--<mx:String>assets/1_million.csv</mx:String>-->
</mx:Array>
</mx:dataProvider>
</mx:ComboBox>
<mx:Button label="Load Data" click="loadData()" id="loadButton" />
<mx:Text text="" id="resultText" />
</mx:ApplicationControlBar>
<mx:DataGrid id="dg" bottom="10" top="10" left="10" right="10">
<mx:columns>
<mx:DataGridColumn dataField="0" />
<mx:DataGridColumn dataField="1" />
<mx:DataGridColumn dataField="2" />
<mx:DataGridColumn dataField="3" />
</mx:columns>
</mx:DataGrid>
</mx:Application>




Facebook Application Development
Andrew, this is excellent. I appreciate that you took the time to do this. It shows that Flex is a robust and very viable platform for doing this type of data manipulation!
We are currently working on a project using CSVLib in flex and have found out just how powerful Flex is for parsing large recordsets on the fly.
By the way, your example took my machine:
data loaded in 3314ms, data parsed in 36ms, total 3350
Thats a dual core 2.4Ghz with 4GB RAM on Vista 64bit
For 100k records:
data loaded in 39260ms, data parsed in 614ms, total 39874
data loaded in 45396ms, data parsed in 216ms, total 45612
Mac, Safari Flash Player 10
That was a Macbook Pro 2.2Ghz 4Gb RAM Mac OSX Leopard and the 100k dataset
Thanks for the feedback Raul and TJ. As you an see, the majority of the time is spent actually transferring data from the server to the client, not parsing the data. The Flash player is extremely powerful and fast when parsing this data. No matter what technology is used, you would have some serious latency loading data this size in a real application.
After I saw that post the other day, I wondered how long it would take someone to write a Flex version. :-) Thanks for the info Andrew.
I would be curious to see the difference in performance when the data is loaded using remoting. I would think it would be considerably faster.
I'm on an iMac 2.4, 2Gb ram:
data loaded in 14614ms, data parsed in 376ms, total 14990
Tim
Hi Tim, I'd be curious to see the difference too. I just decided to stick with what Richard had already done, to have a consistent comparison.
I know from the Java world that the hard thing of handling a million rows or more in a data grid is multiple selection. Have you tried this with a large set of randomly selected items, like 10,000 out of your million? and then sorting while maintaining the selection. I did some work on this problem on the Java side and came up with a new selection model that helped some of the cases of huge data sets but was not perfect. I wonder how well Flex handles this.
http://www.jasperpotts.com/blog/2007/11/faster-swing-lists-and-tables-upto-88000x/
Jasper
Good question... I just threw this together to see if Flex could handle that many records in a table. I'll have to test it out. .
Another thing worth mentioning: When I performed the test I had a significant number of apps open and services running on my machine, including a SIP client, ColdFusion, a web server, Pandora, MS Outlook, Eclipse, SQL Studio, Photoshop and a few RDP Sessions. Given that I think the performance would be much better for the average Joe with the same specs
First of all, thanks for this extremely relevant and helpful article. You are writing about exactly the things we are trying to do and it is a huge help!
I have a few comments / questions:
(1)
When you change the "dataProvider" property of DataGrid, is the rendering and updating synchronous or asynchronous? That is, does the call to "set dataProvider" block until the DataGrid rendering is complete? I'm a novice to Flex's thread model (I know there really isn't much of one) so I don't know. If the call is not synchronous - i.e. "set provider" returns right away, before the rendering is necessarily complete, then it might make sense to move the math from the final line of "onResult()" to a separate method called by the "updateComplete" event of the DataGrid. Something like this (leaving out enclosing "script" and "CDATA" pieces):
private function handleFinish(event:FlexEvent):void {
loadButton.enabled = true;
resultText.text = "data loaded in " + (loadComplete - initialTime) + "ms, data parsed in " + (getTimer() - loadComplete) + "ms, total " + (getTimer() - initialTime);
}
<mx:DataGrid id="dg" updateComplete="handleFinish(event)" . . . />
I tried this myself and couldn't really distinguish much of a time difference - so maybe "set dataProvider" really is synchronous. I dug into the Adobe source and found a call to "dispatchEvent()" but had trouble following from there. What do you think?
(2)
Just to clarify, are you saying that this:
private function onResult( event : ResultEvent ) : void
{
. . .
dg.dataProvider = result;
. . .
}
<mx:DataGrid id="dg" . . . />
Is faster than this:
private function onResult( event : ResultEvent ) : void
{
. . .
dg.myProvider.source = result;
. . .
}
<mx:DataGrid id="dg" dataProvider="myProvider". . . />
<mx:ArrayCollection id="myProvider" />
I'm having trouble proving one way or the other. My intuition would be to agree with you (one less layer of wrapping and change notification) but believe it or not, I sometimes get results where the second approach is faster. Again what do you think?
(3)
I think I got a minor improvement by substituting a String const for the ", " in the call to "split()".
Thanks again for this excellent, thought-provoking piece. This literally is what most Flex programmers writing business apps are thinking about every day, so I am very grateful you are addressing it!
Best,
-Noah
UPDATE: I did some more tests and for item (2), the anomaly went away: the first method is faster, just as you said in the article. This would seem to say we should avoid using Collection classes as dataProviders when we can - for example when we are refreshing the entire record set on a given update, thus obviating the need for the binding/updating behavior the Collections provide. That is, use Array over ArrayCollection and XMLList over XMLListCollection. Would you agree? If so I think it's important people know this, as the Adobe docs and tutorials seem to encourage people to use the Collection classes.
Also, I assume performance is a lot worse when we use XML, right? That's more to think about.
Thanks again for the article, it really got me thinking.
Hi Noah,
1) When you set the dataprovider, it is not synchronus. It flags the component as "dirty", which will invoke "validateProperties" later in the runtime lifecycle, which will invoke "invalidateDisplayList". So, technically speaking, it is not complete until the displaylist has been validated. Although, the time taken to render the graphics is trivial with respect to the time needed to parse the data.
2) You have to set the dataprovider to something first. You can't just set dg.dataprovider.source because by default, the dataprovider is null. You have to specify it as an instance of something. In this case I used "Array".
Collections are the preferred object to use instead of "plain-old" XMLList and Array because collections support bindings and include helper functions to easily update specific items in the collection. Basic Array or XMLList objects do not support binding or dispatch any data-binding events.
In this case, I did not want any binding events to fire b/c of the huge size of the dataset. Normally, you do want bindings to be applied, so that the grid contents are updated any time the contents of my collection is updated. The Collection classes are extremely helpful for keeping you view/data model in synch.
3)Thanks, did it speed it up much?
4) I imagine XML will be much slower b/c there is more complex parsing involved, and the language itself if more verbose, and would have a larger file size. I have not tested this.
Awesome display of power! Thanks for taking up the 1 million record challenge. I see that I have my work cut out for me!
Thanks Richard, I enjoyed the challenge, I can't say that this is anything I've ever tried before.
"hard thing of handling a million rows or more in a data grid is multiple selection"
If you're real world RIA is showing an end user a million rows at once, and expecting them to find the one(s) they want to select by looking through it by eye, you have a serous UX problem...
Tom, Very valid point. I can't imagine trying to fight a UI that makes you sort through a million records on your own.
hi
i have a csv file with date and other datatype column.
when i read it. it pick some date file and other field leave .
my dataset return all the record but some date field display empty.
how can i handled.
thanks
Hi Richard,
Sorry I didn't reply sooner! My "item 3" sped things up only a tiny bit, really.
Thanks again for the article,
-Noah