Stress tests 1: Simple calculations with long columns #7378

rdstern · 2022-04-15T11:10:26Z

rdstern
Apr 15, 2022
Maintainer

This discussion follows work with an 8-year old, who is interested in numbers. I describe different components and where they lead me. I suggest one result might be some fun use of R-Instat in maths camps. This isn't as much standard statistics as numbers. Let's see where they lead us, and one key aspect is to explore the limits of R and R-Instat in coping with long columns.

He wanted to develop an algorithm that would take a list of successive numbers and multiply them by a number of his choosing. I claimed R-Instat is perfect for this.
a) Use or make a new dataframe
b) Use the Prepare > Column: Numeric > Regular Sequence dialogue to get a simple sequence 1, 2, 3, ...
c) Use the Prepare >Calculations dialogue to multiply by a number of his choosing.
d) This gives the answer.

I did this with a new data frame of length 100. He was impressed.
He asked if I could do it with 50,000 rows.
No problem I said, and did it. But the Regular sequence dialogue takes a bit of time.
So he then asked for 200 million. No, I said - but I'm not sure - would 200 million work?

He reluctantly agreed to try at 2 million. Now the regular sequence dialogue takes forever. It isn't practical with that length.
I am interested in exploring the limits, both caused by R and (more often) caused by R-Instat. But I want to go further in making the use of R-Instat easy and enjoyable and logical within those limits.
So, I have found problem number 1, namely I suggest the regular sequence dialogue is poorly coded and that's the problem. I note it always gives a preview of the sequence, and wonder if that is causing the problem. If so, then perhaps we don't give the preview for long columns, or we have a Preview button if we want to see it, or we give the first part of the preview only.

I am very keen on us being able to cope routinely with variables of length more than 1 million. The data view now copes will with long columns and this is something potentially simple in R-Instat and fun, and not possible in a spreadsheet.

With the long columns it is also a bit confusing that the row variable is limited to showing 5 places. That's issue #7221 to investigate.
I looked for another way to do this. I therefore used Prepare > Data Frame > Row Numbers/Names instead, and copied the numbers into the first column. This is reasonably fast.
But it gives the row numbers as a character variable and that gives an error in a calculation. So I have to make it numeric. Can the Row Numbers/Names dialogue detect, whether they are numbers, and make the variable numeric if so?
And why can't we make that column, optionally at the start, i.e. in the new data frame? That would solve a problem that I sometimes want just a single variable, without additional empty variables. That would be useful! So I suggest a further "Type" of Variable at the bottom of the list. It is called Sequence. The default is to give the seq command from 1 to the specified length. Optionally you can give a number in the NA. If you give the number x, then it issues the seq command seq(x, length+x-1). If illegal, e.g. character, then it just ignores it and goes from 1 upwards, as with the other defaults.

But I digress. Back to long variables. As we examine the code of different dialogues, can we please try to see how it copes with long columns - certainly rising to more than 1 million rows.

For simple tasks, such as the above, can we explore in more detail and also discuss the implications.
For example we tried other functions. It is quite a powerful calculator. So we tried to the power 8, with columns of length 2 million. Try it. Remember 2 ^ 8 is 256 so 2 million to the power 8 is 256 followed by many zeros. But it isn't in R-Instat. After a few zeros there are numbers that are rounding errors. For those examples that are integers is there a package that does proper integer arithmetic - of course there is! There is gmp, which is already in R-Instat and Rmpfr that isn't there yet. I suggest perhaps an Exact keyboard in the calculator might be useful. That's a bigger task.

We then went to consider other sequences. The first 1000 primes and the first 1000 Fibonacci numbers are in the rcorpora package. From the File > New Data Frame this is the command for them data.frame(data=rcorpora::corpora("mathematics/fibonnaciSequence"))
use primes instead for prime numbers.
The Fibonacci numbers show an interesting problem with the grid. There are over 1400 rows, not 1000. That's presumably because some are very large numbers indeed. Could this be fixed in the grid. It is perhaps the same as issue #7165 but maybe not. They are both related to a lot of information in a single cell.

While on this topic I wonder about making it easier to include these Lists of Data in a new data frame. Perhaps that could be an extra radio button in the dialogue?

Patowhiz · 2022-04-23T10:02:54Z

Patowhiz
Apr 23, 2022
Maintainer

@rdstern in regards to the regular sequencer dialog one of the reasons that makes the dialog slightly slow in loading is described in issue #7408.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stress tests 1: Simple calculations with long columns #7378

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Stress tests 1: Simple calculations with long columns #7378

rdstern Apr 15, 2022 Maintainer

Replies: 1 comment

Patowhiz Apr 23, 2022 Maintainer

rdstern
Apr 15, 2022
Maintainer

Patowhiz
Apr 23, 2022
Maintainer