Monday, June 10, 2013

#SLA2013 - Big Data. Big Challenges.

This is an SLA Spotlight session.

Amy Affelt - Compass Lexicon
Know 'em when you see 'em: Big opportunities in big data 

Cool big data applications
* Healthcare
*** Microsoft readmission manager - surfaced some red flags that cause readmission
*** Stanford drug pairings
***** analyze Internet searches for indication of drug interactions
*** Gojo Industries - sensors in hospitals 

* Transportation
*** Street Bump - sends information on possible potholes so that they can be fixed.
*** ODOT - analyze info on cars that are going below the speed limit.  Do those areas need some improvement to make traffic flow better?
*** Xerox ExpressLanes - congestion pricing 

* The Magical World of Disney
*** entertainment - creating a magical experience for guests. The wristbands become the key to everything.  They are selling convenience.  You don't know your purse or credit cards.  All Disney characters can receive info on kids so they can be addressed by name.  Allows Disney to analyze how people move through the park.

* BigML
* Google Fusion Tables

What's in it for me?
* Search to find how your industry is using big data.
* What vexing issues is your organization interested in?  Can you help them address those issues?  Can big data help?
* Embed info IT and big data teams to provide point-of-need research.
* Understand patterns vs. predictions / coincidence vs. causation

Britt Mueller - Qualcomm
How has big data been applied at Qualcomm.
* Too large to parse using traditional tools
* Opportunity to analyze, visualize, cluster and mine for increased understanding and object discovery

Two major opportunities:
* Research 
*** Applicable to new research spaces
*** Create large data sets of data from multiple sources
*** Use analysis tools to create views info the large data sets
*** Produce new "starting places" for traditional expert research
*** Find what we don't know

Analyzing usage metrics and user behaviors 
*** How our population uses the information and tools we provide
*** Combine demographic information, search behaviors/activity, metrics
*** Serve this information back to the user population

* Excel
* Databases - MS Access, Informatica, Oracle
* Intelligence software - Qlikview, Tableau
* Custom search discovery - open source tools, Solr, Lucene

* Joining disparate data 
*** Normalizing and mapping data to maximize analysis is hard
*** Information professionals need to thinks creatively on pulling together disparate data to enhance discovery
*** Joining difference types of content and data increases analysis opportunity and effectiveness of discovery
*** Large, mapped data sets become a diver able in and of themselves

* Content provider outputs 
*** This is new for information professionals, but also for information vendors.
**** challenge in getting vendors to allow data to be pulled out of their system for analysis.
*** Vendors lack consistency,tech support,or licensing models that support creating outputs for further analysis.
***** Content vendors only provide ~60% of search fields as output

If a field is important enough to be searched, it should be important enough to provides as an output.

Wilfred Li - UC San Diego
Research Cyberinfrastructure (RCI) Program,
Elements of UCSD integrated research cyber infrastructure program
* Data center collocation
* Networking
* Data curation
* Centralized storage
* Research computing
* Technical expertise

Where is the data coming from? Many different places include from audio/video equipment and sensors.
How do people store/backup their data?  Every type of device including Google Drive, Dropbox, USB drives, etc.
* People are using hardware that isn't secure or  difficult to recover.  Generally the data has no metadata.

How long do you need to store the data? Most say 5+ years, permanently or duration of the project (majority).

Do you need metadata annotation capabilities? 23% said yes.

Risks and challenges:
* Campus may cease fusing
* Constantly increasing storage demands
* Bait and switch with increased cost later
* Poor backup plan
* No dedicated support staff 

Top 10 requirements for campus Cyberinfrastructure 
* Better CI with minimal direct cost
* Network attached storage
* Data replication backup
* Dropbox or google-drive like service
* 10G network connection
* Minimal cost beyond hardware cost
* Shared technical expertise
* Distributed multi site replication
* Desktop backup
* Compliant and secure storage for sensitive data
* Tiered storage plans

RCI NAS Data Service

David Minor - UC San Diego
Preservation and curation of Univeristy research data: the complexity of big data 
* Data curation
* Appraisal
* Accession
* Arrangement
* Description
* Storage
* Preservation
* Access

Two year pi lot process with selected researchers since September 2011
Targeted domains represented campus
Required explicit researcher participation

Pilot goals include:
Learn how researchers, information technologies, and librarians work together with data 

* The Brain Observatory
* NSF Open Topography Facility
* Levantine Archaeology Laboratory 
* Scrips Institution of Oceanography Geological Collections 
* The Laboratory of Computational Astrophysics

Complicity at scale
* Issue: moving from files to "objects"
*** Semantic significance
*** Meaning within context
*** Meaning outside of context
* Issue: representing complex data 
*** Rethink data representation processes
*** Broadening metadata processes to accommodate new data types

Interesting DAMS infrastructure 
Complex research collections will be mixed in with regular digital collections
Cross collection discoverability is key.

 Content resides at UCSD.  Metadata is searchable through the Online Archive  of California 

Researchers want their content findable, but don't always recognize that they need metadata.
Curation after the fact is expensive.  It needs to be done upfront.
There is no standard definition of a dataset.
Researchers want tools and best practices to help them manage their data.
Need to create scalable systems.

No comments: