Chapter 8 Data Science

8.1 Practical text analytics

  • Big data does not have a consistent and concise definition. We might say somewhat lightheartedly that it appears to involve more data than you can handle comfortably with whatever software and hardware you now own.

8.2 Data science team

It is an article for data science in startups but we can still learn some general points.

  • What can data science do?
    1. Improving the products: -Rely on a virtuous cycle where products collect usage data that becomes the fodder for algorithms which in turn offer users a better experience.
      • The first version of your product has to address what data science calls the “cold start” problem — it has to provide a “good enough” experience to initiate the virtuous cycle of data collection and data-driven improvement. It needs product managers and engineers to implement solution. (需要至少开始有人使用产品,进入“数据收集-数据驱动改进产品”的循环)
      • Data scientists collaborate with engineers
      • Decide whether data scientists implement product enhancements themselves or partner with engineers who implement them. Either approach can work, but it’s important to formalize it and establish shared expectations across the organization. Otherwise, you’ll struggle to get improvements into production, and you’ll lose talented data scientists who feel unproductive and undervalued.
      • Skills: machine learning knowledge and production-level engineering skills
    2. Improving the decisions:
      • Decision science uses data analysis and visualization to inform business and product decisions. The decision-maker may be anywhere in the organization — from a product manager determining how to set priorities on a road map to the executive team making bet-the-company strategic decisions.
      • Characteristics of decision science:
        • Subjective, requiring data scientists to deal with unknown variables and missing context
        • Complex, with many moving parts that lack clear causal relationships
        • Decision science problems are measurable and impactful — the result of making the decision is concrete and significant for the business[Hui: 文章的作者还提到一点说是公司之前不需要解决的新问题。个人不同意这一点。作者这么说可能因为这篇文章是针对创业企业。对于大型传统企业,遇到的大部分都是很老的决策问题,只是之前只是靠猜而已……额,说好听些就是行业专家的猜测。]
    • Sound like data analytics but decision science should do more than produce reports and dashboards. Data scientists shouldn’t be doing work that can be delivered using off-the-shelf business intelligence tools.
    • Not always need decision science (In the two cases, businesses need to rely on intuition and experimentation):
      • Some decisions are too small to justify the investment
      • Lack the data to meaningfully analyze them
      • Good decision scientists know their own limitations
    • Skills: business and product sense, systematic thinking, and strong communication skills
  • Should you be investing in data science?
    • Invest: it’ll be critical to your success
    • Not invest: it’ll just be an expensive distraction
    1. Are you committed to using data science to either inform strategic decisions or build data products?
      • If you’re not committed to using data science toward one of these goals, then don’t hire data scientists.
      • Culture of data-driven decision making is necessary
    2. You may not need them on day one, but it takes time for you to hire the right people — and time for them to get to know your data and your business. You’ll need all that to happen before they can apply data science to drive decision making.
    3. Data products can create value and delight users through improved optimization, relevance, etc. If these are on your product roadmap, you should bring data scientists in early to make the design decisions that will set you up for long-term success. Data scientists can make key decisions about product design, data collection, and systems architecture that are critical foundations for building magical-seeming products.
    4. Will you be able to collect the data you need and act on it?
      • Data science requires data: quantity and quality
      • Data science only matters if data drives action. Data should inform product changes and drive the organization’s key performance indicators (KPIs). Instrumentation requires a commitment across the organization to identify what data each product needs to collect and establish the infrastructure and processes for collecting and maintaining that data. To be successful, instrumentation requires collaboration among data scientists, engineers, and product managers — which in turn requires executive commitment. Data-driven decision making requires a top-down commitment. From the CEO down, the organization has to commit to making decisions using data, rather than based on the highest paid person’s opinion.
  • Do you need data science to be a core competency, or can you outsource it?
    • If data science is solving problems that are critical to your success, then you can’t afford to outsource it. Also, off-the-shelf solutions tend to be rigid. If your business is taking a unique approach to a problem (e.g. collecting new kinds of data or using the results in novel ways), it’s unlikely that an off-the-shelf solution will be flexible enough to adapt to it.
  • Where does data science belong in your organization?
  1. A standalone team -Your data science team acts as an autonomous unit parallel to engineering. The head of data science is a key leader and typically reports to the head of product or engineering — or even directly to the CEO.
    • Pros:
      • Autonomy
      • Well positioned to tackle whatever problems it deems most valuable
      • Demonstrates that the company sees data as a first-class asset, which will help them attract world-class talent.
      • Works particularly well for decision science teams. Even though decision scientists collaborate closely with product teams, their independence helps them to make hard calls, like telling PMs that their product’s metrics aren’t good enough to justify a launch. Decision scientists also benefit a lot from cross-pollination, both to understand how different product metrics depend on one another and to share more general learnings about experimentation and data analysis.
    • Cons:
      • Risk of marginalization. As companies grow and organize into product teams, they often prefer to be self-sufficient. Even when they could benefit from collaboration with data scientists, product teams simply don’t want to depend on resources they don’t control. Instead, they rely on themselves — even hiring their own data scientists under other names like “research engineers” — to get things done. If product teams refuse to work with the standalone data science team, then that team becomes marginalized and ineffective. Again, that’s when you start losing good talent.
  2. An embedded model
    • Data science team brings in talented people and farms them out to the rest of the company. There’s still a head of data science, but he or she is mostly a hiring manager and coach.
    • The embedded model is the polar opposite of the standalone model: It gives up autonomy to ensure utility.
    • Pros: data scientists join the product teams that most need their services, and get to work on a wide variety of problems throughout the organization.
    • Cons:
      • not all data scientists are happy giving up autonomy (in fact, many are not good at it at all). Data scientist job descriptions emphasize creativity and initiative, and embedded roles often require them to defer to the leadership of the teams in which they are embedded.
      • Data scientists will feel like second-class citizens as embedded team members — their product leads don’t feel responsible for their growth and happiness, while their managers won’t feel directly vested in their work.
    • We’ve seen some companies embed data science managers, but this approach only works once you have a fairly large data science team.
    1. Integrated team
    • No separate data science team. Instead, product teams hire and manage their own data scientists.
    • Pros: This optimizes for organizational alignment. By making data scientists first-class members of their product teams, it addresses the downsides of the standalone and embedded models. To the extent that data scientists, software engineers, designers, and product managers work on shared product goals, the integrated model instills collective team ownership of those goals. This is how you avoid the breakdowns that can occur when narrowly focused functional teams diverge in their goals and end up mired in dependencies that are too often ignored or delayed.
    • Cons:
      • It dilutes the identity of data science. Individual data scientists identify with their associated product teams, rather than a centralized data science team.
      • You also sacrifice the flexibility of the embedded model, since it’s harder to move people around based on their skills and interests.
      • he integrated model can create challenges for scientists’ career growth, since the manager of an integrated team may not be in the best position to value or reward their accomplishments.

Over time, the impact that a data science team has will be far higher if you build a diverse team with extremely different backgrounds, skill-sets, and world views.

This will ensure they think as holistically as possible about their domain, and will encourage creativity and innovation over time.

8.3 Data Quality

  • Kaushak, Data Quality Sucks, Let’s Just Get Over It, (2006)
  1. Assume those managing the data have a level of comfort with the data;
  2. Start making decisions that you are comfortable with;
  3. over time drill deeper in micro specific areas and learn more;
  4. Get more comfortable with data and its limitations over time;
  5. Consistency in calculations=Good.

8.4 Data Visualization

Three rules to visualize insights with impact:

  1. Highlight your message and eliminate distractions;
  2. Use visual cues to help lead your audience through your insight;
  3. Use contrast (size, color) to capture the reader’s attention.

8.5 Big Data (大数据的冲击)

8.5.1 What is big data

  • Trend: Analyzing non-structured data

  • Genernalized definition:
    1. structured data
    2. non-structured data(text,voice,vedio,GPS,sensor)
    3. Data storage/processing/analyzing techniques: Hadoop, NoSQL, machine learning/satatistics
    4. Data Scientist/Data oriented organization

  • Bottle neck of machine learning: store and process big data (doubt it)

  • From point information to thread information
    • Point information: purchase a product, received a service……
    • We need thread information to answer WHY!
    • Thread information: customer interaction data (non-structured or historical behavior)

8.5.2 Walk Between Privacy and Innovation

  • Collaborative Filtering (Amazon): method of making automative predictions about the interests of a user by collecting preferences or taste information from many users (collaborating). The underline assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B’s opinion on a different issue X than to have the opinion on X of a person chosen randomly.

  • Rapleaf: using email address/name to get personal profile (linked to information on Facebook)

8.5.3 Data Aggregator

  • Pay as You Drive: car insurance (provide discount according to the driving habit)
  • For company with internal technique skill: collection, cleaning, analyzing, result delivery, implementation
  • Data Aggregator: provide some hard to collect off-line data (what are those?)
  • Who will be data aggregator: those who develop devices for data entry and collection
  • More personal, more useful, harder to get
  • Payment service companies (VISA, American Express etc.): on the road to data aggregator [VISA collabrate with GAP]
  • Aggregate data
    • Internal Data + External Data–>Premiun Data (multiplication effect)
    • Example: Catalina Marketing and Nilson evaluate the TV ad by studying the association between audience rating and actual purchse
    • What data to get?
    • Example: iphone App: Nike+GPS, get running route (anonymous) and use the information to get insight for store location

8.5.4 Data Scientist

  • Sexy
  • Characters:
    • Communication skill
    • Entrepreneuership: explore how to make data useful, lead to new data based services
    • Curiosity: not limited in art, techniques, medicine, natural science……but keenly curious about different areas. By analyzing data from differnt fields, data scientist with curiosity will find bonanza

8.5.5 What machine can not do?

  • We have no chance of competing against machines on frequent, high-volume tasks.
  • Machine cannot compete with us when it comes to tackling novel situations, and this puts a fundamental limit on the human tasks that machines will automate. It needs to learn from large volumes of past data. Humans have the ability to connect seemingly disparate threads to solve problems we’ve never seen before.
  • So what does this mean for the future of work? The future state of any single job lies in the answer to a single question: To what extent is that job reducible to frequent, high-volume tasks, and to what extent does it involve tackling novel situations? On frequent, high-volume tasks, machines are getting smarter and smarter.
  • The copy behind a marketing campaign needs to grab consumers’ attention. It has to stand out from the crowd. Business strategy means finding gaps in the market, things that nobody else is doing. It will be humans that are creating the copy behind our marketing campaigns, and it will be humans that are developing our business strategy.

8.6 Data science position

Prediction for 2017:

  1. Continuous learning will be front and center. Learning is a way of life.
    • Require time commitment from both individual and the company.
    • Always absorbing what is going on there.
    • They are naturally curious people.
  2. Companies will “train up” existing talent
  3. Predictions 1&2 lead to longer tenure
    • Current situation is the opposite
    • Stagnating, lack of job satisfaction
  4. HR analytics will be standard at larger firms
    • Add headhunter support for analytics hiring
  5. Education options will thin out (a bit)
    • Previous: MOOC/Bootcamp
    • Separation between : money deal $15000/12 weeks v.s. high quality education
  6. Analytics ROI will be scrutinized
    • What is the associate payback?
    • How long does it take to see results? How much has been invested?
    • It is important to tie analytics to real business problem
  7. Analytics findings will be scrutinized
    • Confidence for the magic of analytics will shadowed
  8. The internet of things draws more interest
  9. Mission-driven careers are on the rise

Advise: Be open to change

8.7 Everybody Lies

“Makingsense” says that there are things data can not tell us. “Everybody Lies” says there are things we cannot see other than using data and the key is where to find the right data.

  • condom estimates
    • Female: 55/year, 16% using condom, 1.1 billion condoms/year
    • Male: 1.6 billion condoms/year
    • Reality: 600 million condoms/year
  • sexless marriage > unhappy marriage > loveless marriage
  • racial discrimination, People in the esat coast (心机深)
  • significant effect (small data); small effect (big data)
    • pancreatic cancer symptoms by searching history
  • Weather v.s depression (Hawaii v.s Chicago, search for “depression” 40%)
  • Gender discrination:
    • Boys: intelligent, happy
    • Girls: over-weight, pretty