10个基本的数据建模面试问题 *

最好的数据建模专家可以回答的全面来源的基本问题. 在我们社区的推动下，我们鼓励专家提交问题并提供反馈.

是顶级自由软件开发人员的专属网络吗, designers, finance experts, product managers, 和世界上的项目经理. 顶级公司雇佣Toptal自由职业者来完成他们最重要的项目.

Interview Questions

请讨论实体关系(ER)模型中的三种类型的数据模型.

View answer

The Conceptual Model

This model is used to describe the conceptual structure of the business data strategy in the initial stage of design. 它定义了实体名称和实体关系，但不关注技术问题, computer systems, 或者数据库管理系统.

The Logical Model

This model is used for the implementation of the database management system and it focus on the logical structure of the data. 它是一种分期模型，既可以面向用户，也可以面向系统. 除了实体名称和实体关系之外, 模型定义属性, primary keys, 和每个实体中的外键.

物理数据模型

This is the final stage of a data model which not only relates to a specific database management system, 同时也说明了操作系统, storage strategy, data security, and hardware. 该模型描述了模式细节, columns, data types, constraints, triggers, indexes, replicas, and backup strategy. This model carries the actual design blueprint for the database implementation by DBA and database developers.

What is normalization? 什么是反规范化?

View answer

Normalization 是否在保持数据完整性的同时减少数据冗余. 这是通过主键和外键在表之间创建关系来实现的. 归一化过程包括1NF(第一范式)、2NF、3NF等.

1NF:消除实体中重复的属性组，并为每个实体实现一个主键. (By definition, 每个单元格由其唯一主键和列标识符的组合来标识.)

For example, 在这种大学课堂模式中, 每个记录条目由主键唯一标识 [Professor_Id] + [Course_Id].

2NF:这包括1NF，但所有的非键属性(如.e.(即主键列以外的列)必须依赖于主键. 如果没有，将这些属性分离到另一个实体中. You are solving the question of whether relationships should be one-to-many or many-to-many at this stage. (See example below.)

为了将1NF的例子扩展到2NF，我们包括了“动词关系”，但不包括“形容词关系”.“Course”栏部分取决于主键 [Professor_Id] + [Course_Id] +[词]. Therefore, 我们创建实体“Class”和“Professor-Class”来建模多对多关系.

3NF这包括2nf +, all fields (columns) cannot 有传递函数依赖 (an indirect relationship between columns) from primary key in the table. 在上面的1NF示例中，如果我们知道 [Course+Term] we learn the [Professor_Name]. When we learn the [Professor_Name] we have the [Dept_Name]. Therefore [Course+Term] 传递函数依赖于 Department.

At this stage, 我们正在使用外键创建引用表——类似于我们在2NF中所做的, 但对于“形容词关系”也一样.

继续我们的例子， [Type], [Dept_Id], [Dept_Name] and [Campus] 有传递函数依赖. 它们不依赖于主键 [Professor_Id]，所以我们创建了两个参考表:

在大多数情况下，3NF阶段足以实现规范化.

Denormalization, on the other hand, is an optimized procedure used on a normalized database to increase query performance for data analytics. 这是一个从3NF(或更高形式)恢复到1NF的过程. 非规范化模型有时也称为维度模型. 它通过消除复杂连接减少了所选查询的运行时间, 它将详细数据预聚合为摘要级数据, 它把数据存储在a中 separate 使用内存和列存储技术的非规范化1NF (c).f. 本页的相关问题.)也就是说，原始的3NF模型仍然是原始的数据源.

您可以在任何数据仓库维度模型设计策略中找到该策略的变体.

Please discuss in-memory analytics and column store databases 在数据仓库架构的上下文中.

View answer

In-memory analytics 是在RAM中缓存整个数据库的方法. 在过去，RAM非常昂贵，因此这种方法的成本往往令人望而却步. 当时的解决方案是使用数据索引并存储预聚合的数据.

这个过程中最慢的部分是从磁盘中检索数据, 因为有限的系统内存无法容纳整个数据库. Today, RAM的价格更实惠, 因此，当涉及到内存分析时，业界现在的采用率要高得多.

A column store 在数据库中存储和检索数据的现代方式是什么. 过去，数据是面向行并水平存储的. 为了检索数据，它被水平读取，就像从左到右阅读一本书一样. 因此，在使用硬盘之前，必须将整个数据集从硬盘驱动器读取到RAM中.

现在，列式存储(或 columnar store)数据库使用面向列的存储，垂直存储数据. Thus there are minimal hard drive reads when retrieving data—the database only reads the columns that a query actually contains.

For example, say you have a “Customer Sales” dataset that has 20 columns of all the customers’ personal and sales information, with a million rows. You only need the [Customer Name], [Address], and [Purchase Date] columns. Using a column store, 数据将只从硬盘读取量最少的三列中读取, 大大减少读取时间.

Also, 列存储提高了压缩的可能性, because the data values within a single field are highly similar to one another—something that compression algorithms depend on.

申请加入Toptal的发展网络

并享受可靠、稳定、远程自由数据建模专家的工作

Apply as a Freelancer

请讨论这个尺寸模型的问题/缺点:

View answer

而不是水平扩展收费行项目, 模型应该分成一个FACT_Header表和一个Fact_Detail表. The charge line item column names should be pivoted into one “Charge_Code” column as a dimension for in-memory/column storage best practices.

请讨论不同类型的NoSQL数据库.

View answer

Document Databases

A document database is mostly used for semi-structured data storage in a self-describing document format like JSON. 数据文档结构(由键-值层次结构定义)可以相同，也可以不同. Documents with the same attributes will be grouped into “collections” which is similar to tables in a relational databases.

{
    "id": "IC001",
    “类别”:“冰淇淋”;
    “产品名称”:“庆典饼干蛋糕”，
    "Size": 2.5,
    "Details": {
        "shape": [
            {"id": "001"， "type": "Round"}，
            {"id": "002"， "type": "Sheet"}，
            {"id": "003"， "type": "Roll"}，
            {"id": "004"， "type": "Dome"}
        ],	
        "cream flavor": [
            {"id": "1001"， "type": "Mint"}，
            {"id": "1002"， "type": "Oreo"}，
            {"id": "1045"， "type": "Chocolate"}，
            {"id": "1021"， "type": "芝士蛋糕"}，
            {"id": "1022"， "type": "Vanilla"}，
            {"id": "1033"， "type": "Cookie"}，
            {"id": "1014"， "type": "Butter Pecan"}
        ]
    },
    “图像”:“X562FCDQX73DS; L9EBWIDKFDKNLUWNJSNA: IU * DWJHIJDSJDKALijnkw8!WEJD#KLJSATEGD..."
}

Column Databases

在列数据库中，数据是面向列而不是面向行存储的. 这种类型的数据库在分析数据库查询期间针对I/O进行了优化. 一些NoSQL列数据库的例子包括谷歌的Big Table、Apache Cassandra和HBase.

Key-value Databases

和文档数据库一样, 键值数据结构类似于字典或映射, 但是在这种情况下，它不能嵌套. It uses a unique key—which can be synthetic or hash-algorithm-generated—to point to its own value list. 该值可以是不同的类型:string、JSON、basic large object (BLOB)等.

{
    name: "Jimmy Johnson",
    email: "jjohnson@test.com",
    街道:“13278 Southside Blvd”;
    city: "Orlando",
    state:  "FL",
    country: "USA"
}

Cache Systems

类似于键值对, 频繁访问的数据存储在内存中，以便快速检索和访问. Redis和集群Memcached就是这样的例子. (而Memcached可以被rdbms使用, 它只能在一台服务器上提供内存, 而NoSQL提供集群中所有服务器的内存场.)

Graph Databases

In a graph database, 对象表示为节点，对象之间的关系表示为边. 边和节点都有定义的属性/属性，这些属性/属性可以存储为键值对. 例如Neo4J、InfiniteGraph、OrientDB、Virtuoso和Allegro.

注:更多信息请参见 NoSQL数据库权威指南在Toptal Engineering博客上.

NoSQL数据库相对于关系数据库有什么优势?

View answer

Flexibility

关系数据库只允许结构化数据, 而NoSQL数据库提供了存储结构化数据的灵活性, semi-structured, or unstructured data.

Dynamic Schema

在关系数据模型中，需要预先定义模式. 在现实世界中，应用程序不断发展，需要进行更改. 这个过程非常昂贵和耗时. In recent decades, 与简单的文本和数字相比，数据对象更加复杂. For example, geospatial data (polygons made out of GPS coordinates) can be difficult to query efficiently with some RDBMSes.

A dynamic schema offers a solution to add complex data objects easily because you can modify or add new elements as your application grows without a predefined schema.

Large Blobs/Rich Media

现在在很多情况下, 我们需要一种更灵活的方式来存储像语音文件这样的数据类型, video files, images, etc. Where once it was considered a best practice to simply store links to a filesystem for data like this, 使用NoSQL可以直接存储它们.

The advantage—aside from avoiding brittle links—is that the database breaks the objects into pieces and can distribute them over a server pool for the ultimate performance gain.

Sharding

Sharding is the process of dividing up and distributing data into smaller databases for faster access to the data. 数据在应用程序不知情的情况下分布在许多服务器上. 数据和查询负载均匀地分布在服务器上. 这样，如果其中一个出现故障，可以毫不费力地替换它，而不会中断应用程序.

与单服务器设置相比, this process reduces big data tasks into small pieces so that it they can be processed at the same time in a distributed server farm.

Replication

NoSQL数据库在自我解析和自我平衡方面是相当复杂的. 它们提供故障转移和恢复选项. 他们通过将数据库分布在许多地理/区域数据中心来实现这一点. 如果一个区域出现问题，它会自动依赖于其他区域. 这种体系结构允许不间断(“高”)可用性, 因为数据是跨不同数据中心复制的.

Scaling

Starting with a distributed set of server nodes (all server nodes are peers with no shared resources and can work independently), 随着数据库大小的增加, additional server nodes can be added on the fly across the data center without interruption or limitation.

在当前的RDBMS体系结构中, you can generally only scale up by adding memory and hard drives within the applicable hardware limitations, although 原生分片支持正在进行中 in some cases.

如何处理NoSQL数据建模中的一对多和多对多关系?

View answer

One-to-many

One-to-many relationships in NoSQL are modeled in a document database and the 1:M relationship is represented as a parent-child relationship-embedded entity object within another document data type. For example:

{
    "order_id": "12345",
    :“order_date 1/21/2019”,
    “customer_id”:“123456”,
    "order_details": [
        {
            "item_id": "2345",
            "qty": 2,
            "unit_price": 157.75,
            "sales_price": 315.5
        },
        {
            "item_id": "2110",
            "qty": 1,
            "unit_price": 75.25,
            "sales_price": 75.25
        },
        {
            "item_id": "1760",
            "qty": 3,
            "unit_price": 55,
            "sales_price": 165
        }
    ]
}

这里，下面的每个物体 order_details 子文档是否具有不同的文档数据类型.

Many-to-many

Many-to-many relationships are modeled using two document collections with embedded entity objects referencing related documents from a third collection with identifiers/keys. 在检索数据时，在应用程序级别实现/维护M:N关系. 下面的示例使用该标识符 class_id 解决教授和学生收藏之间的M:N关系.

教授的收藏是这样的:

{
    “Professor_id”:“1021”,
    "Name": "John P Vess",
    "Type": "Adjunct",
    "class_details": [
        {
            "class_id": [
                "B10101",
                "B10203",
                "B10112"
            ]
        }
    ]
},
...

学生的收藏是这样的:

{
    “Student_id”:“201727542”,
    "Name": "Mary Carson",
    “类型”:“本科生”，
    "class_details": [
        {
            "class_id": [
                "B10101",
                "B10107",
                "B10119"
            ]
        }
    ]
},
{
    “Student_id”:“201821230”,
    "Name": "Jerry Smith",
    “类型”:“本科生”，
    "class_details": [
        {
            "class_id": [
                "B10101",
                "B10203",
                "M10332",
                "C12001"
            ]
        }
    ]
},
...

最后，由其他两个集合引用的类集合看起来是这样的:

{
    "Class_id": "B10101",
    “课程”:“商业原则”
},
{
    "Class_id": "B10203",
    “课程”:“市场营销概论”
},
...

实体-关系(ER)模型中关系的可能基数是什么, and what do they mean?

View answer

关系有三种程度，称为:

One-to-one (1:1)

在这里，一个实体中的一个事件可以关联到另一个实体中的一个事件或零事件.

For example: A “tag” table with a “tag number” attribute might have a 1:1 relationship to a “vehicle” table with a unique vehicle identification number (VIN) as its primary key. 也就是说，给定的车辆可能有也可能没有标签，但不能有多个标签.

继续这个例子, 库存中可能有机动车标签, 但他们还没有分配车辆.

One-to-many (1:M)

In this case, 一个实体中的一个事件与另一个实体中的许多事件相关(或可以相关).

教授可以教很多课. 因此，在“教授”表中，每个教授记录可以与许多类记录相关联.

相反的关系是多对一(M:1)——这只是一个语义问题, 这取决于你首先提到的是哪张表.g.，“class”表与“professor”表的关系为M:1.

Many-to-many (M:N)

在这种关系中, 一个实体中的许多事件可能与另一个实体中的许多事件相关.

E.g.在美国，一个学生可以上很多课，一个课可以包含很多学生. 要实现这种关系，需要一个中间表将学生与课程链接起来. 我们可以称这个表为“学生-课程”表.

因此，总共需要三个表:“student”、“class”和“student-class”.从“学生”到“学生-班级”的距离是1:M. 从“student-class”到“class”是M:1. 最后，使用中间表，“学生”到“班级”是M:N.

什么是星型模式? When is it used?

View answer

The star schema is where there are one or more fact tables referencing any number of dimension tables in a star schema. Usually, the fact tables in a star schema are created from the third normal form (3NF) with foreign keys and aggregates (sometimes called “measures.”)

在开发数据仓库和维度数据集市时使用它.

下面的星型模式显示了两个事实表, “fact_教授”和“fact_教授class_detail”,和维度表包括“Dim_Type”,” “Dim_Department,” “Dim_Date,” and “Dim_Class”. Foreign keys are [Type_Id], [Dept_Id], and [Date_Key]. Measures (not pictured; these are calculated sums, minimums, maximums, averages, counts, 和不同计数)将包括的总和 [Credits]的不同计数 [Professor]的不同计数 [Class_Id], and [Hired_Days] 哪个可以计算出两者之间的差异 [Hired_Date] and today().

10.

请讨论在NoSQL数据库中对一组实体建模的最佳策略.

View answer

从小型NoSQL数据库开始，可以使用动态模式构建该模型. That is, 它将允许实时更改和调整模式, 而不是在应用程序开发期间使用预定义的模式. However, 保证数据的一致性，解决数据实体之间的关系, 最初使用实体关系(ER)模型来定义实体, relations, attributes, 主键和外键. ER模型——因为它是通过定义规范化的——将确保数据的完整性. 然后可以将模型转换/非规范化为嵌入父子关系的实体模型. For example:

Start with a 3NF model with “order_id” as the primary key in the entity “Order” and with foreign keys in the entity “Detail”:

反规格化后，上述3NF模型如下:

Finally, 你可以将非规范化的表单转换为嵌入父子关系的NoSQL实体模型:

{
    "order_id": "12345",
    :“order_date 1/21/2019”,
    “customer_id”:“123456”,
    "order_details": [
        {
            "item_id": "2345",
            "qty": 2,
            "unit_price": 157.75,
            "sales_price": 315.5
        },
        {
            "item_id": "2110",
            "qty": 1,
            "unit_price": 75.25,
            "sales_price": 75.25
        },
        {
            "item_id": "1760",
            "qty": 3,
            "unit_price": 55,
            "sales_price": 165
        }
    ]
}

面试不仅仅是棘手的技术问题, 所以这些只是作为一个指南. 并不是每一个值得雇佣的“A”候选人都能回答所有的问题, 回答所有问题也不能保证成为A级考生. At the end of the day, 招聘仍然是一门艺术，一门科学，需要大量的工作.

Why Toptal

提出面试问题

提交的问题和答案将被审查和编辑, 并可能会或可能不会选择张贴, 由Toptal全权决定, LLC.

寻找数据建模专家?

Looking for Data Modeling Experts? 查看Toptal的数据建模专家.

View full profile

View Christopher

Christopher Karvetski

自由数据建模专家

United StatesToptal Member Since August 24, 2016

Dr. Karvetski作为一名数据和决策科学家有十年的经验. 他曾在学术界和工业界的各种团队和客户环境中工作, 并被公认为优秀的沟通者. 他喜欢与团队合作，构思和部署新颖的数据科学解决方案. 他精通R、SQL、MATLAB、SAS和其他数据科学平台.

Data Modeling DevOps SAS SQL R iOS Oracle Data Analysis Data Engineering TensorFlow Machine Learning + more

View full profile

View Oliver

Oliver Holloway

自由数据建模专家

United KingdomToptal Member Since May 10, 2016

Oliver is a versatile data scientist and software engineer combining over a decade of experience and a postgraduate mathematics degree from Oxford. Career assignments have ranged from building machine learning solutions for startups to leading project teams and handling vast amounts of data at Goldman Sachs. With this background, he is adept at picking up new skills quickly to deliver robust solutions to the most demanding of businesses.

Data Modeling Google Cloud Deep Learning 人工智能(AI)自然语言处理(NLP)MongoDB Python Machine Learning Pandas HTML5 Data Analysis Data Engineering Full-stack + more

View full profile

View Eva

Eva Bojorges Rodriguez

自由数据建模专家

MexicoToptal Member Since November 11, 2014

Eva is a skilled back-end developer and machine learning engineer with experience in scalability issues, system administration, and more. 她有把事情安排得井井有条的天赋, readable, 可维护的应用程序和优秀的Python知识, Ruby, and Go. 她学习能力很强，曾在各种规模的团队中工作过.

Data Modeling DevOps Google Cloud Machine Learning Google App Engine REST APIs Flask MacOS Python Scrum Back-end Data Engineering Data Analysis + more

Toptal Connects the Top 3% 世界各地的自由职业人才.

加入Toptal社区.

Learn more