generated from quarto-ext/manuscript-template-vscode
-
Notifications
You must be signed in to change notification settings - Fork 0
/
datatrials.qmd
79 lines (50 loc) · 3.23 KB
/
datatrials.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# Can we do data better?
## 2024
Chairs: Stephanie Lussier and Doug Kelkhoff
### `databases`
A lot of thought currently about databases, but not a lot of companies using it in primary data flows (although it is used in curated trial data for secondary use, e.g. Novartis' Data42 and Roche's EDIS).
### Blockers
- Dependence on CROs who deliver SAS datasets generated by SAS code is a factor.
- Often fear from IT groups about the cloud, which is sometimes confusing when platforms like medidate are already cloud-based and other companies already have STDM/ADaM in AWS S3/cloud.
- Unclear justification for changes, particularly what are we getting from databases for current STDM/ADaM primary use; existing systems are mostly functional.
- Challenges with concurrent data access by multiple teams in some file based approaches, leading to errors.
### an approach around tortoiseSVN
- One company had been using tortoiseSVN for a while, and is considering moving to snowflake.
- Pros: Integration with version control and modern cloud storage solutions.
- Cons:
- Higher entry threshold for users.
- Gap in a user friendly GUI
- Storing data in 'normal' version control rather than tools designed for data versioning rapidly leads to bloated repositories.
### Version Control and Data Storage
- Alignment code versioning in Git; data versioning in tools like S3 versioning
- S3 can be accessed as a mounted drive (e.g. Lustre) and the S3 API.
### Denodo as Data Fabric Mesh
One company uses Denodo as a data fabric mesh; users interact via Denodo,
which serves as an API layer. No direct interaction with the source data by users.
### Nontabular Data
- Not common for statistical programmers working on clinical trial data.
### CDISC Dataset JSON vs. Manifest JSON
Writing CDISC JSON is super slow and potentially not sufficient for regular working data.
### Popularity and Concerns with Parquet Datasets
- Admiral tool generates Parquet directly; others convert from SAS to Parquet.
- Questions about the longevity and maintenance requirements of Parquet as it's a blob (vs a 'human readable' format like CSV/JSON)
### Handling Legacy Data
- Suggest stacking legacy data into a database if for secondary data use
### Change Management
- For statistical programming, direct instruction to new systems is necessary.
- Emphasize direct support over broad training.
- Simplify systems for users to reduce friction.
- Consider a GUI similar to Azure.
- Focus on reducing the user burden.
### Different Data Use Cases
Differences in data use (e.g., Shiny App vs. regulatory documents).
Dashboards directly accessing EDC without needing snapshots.
### **Summary**
Uncertain value in moving from CDISC data standards to databases.
Limited interest and action in this area across the organization.
Not a high priority given other ongoing organizational changes.
Ongoing shift away from SAS-based datasets and file storage to cloud-based systems, with increasing use of Parquet.
### **Action Items**
- SCE whitepaper - mark bynum from J&J
- Is there actual value / gain in databases?
- Not the best investment relative to other non-data changes going on across organization (e.g. R, containers, etc)