invalidate metadata impala что это

INVALIDATE METADATA Statement

See Overview of Impala Metadata and the Metastore for the information about the way Impala uses metadata and how it shares the same metastore database as Hive.

Once issued, the INVALIDATE METADATA statement cannot be cancelled.

If there is no table specified, the cached metadata for all tables is flushed and synced with Hive Metastore (HMS). If tables were dropped from the HMS, they will be removed from the catalog, and if new tables were added, they will show up in the catalog.

If you specify a table name, only the metadata for that one table is flushed and synced with the HMS.

Use REFRESH after invalidating a specific table to separate the metadata load from the first query that’s run against that table.

This example illustrates creating a new database and new table in Hive, then doing an INVALIDATE METADATA statement in Impala using the fully qualified table name, after which both the new table and the new database are visible to Impala.

Before the INVALIDATE METADATA statement was issued, Impala would give a «not found» error if you tried to refer to those database or table names.

Use the REFRESH statement for incremental metadata update.

For more examples of using INVALIDATE METADATA with a combination of Impala and Hive operations, see Switching Back and Forth Between Impala and Hive.

If you change HDFS permissions to make data readable or writeable by the Impala user, issue another INVALIDATE METADATA to make Impala aware of the change.

By default, much of the metadata for Kudu tables is handled by the underlying storage layer. Kudu tables have less reliance on the Metastore database, and require less metadata caching on the Impala side. For example, information about partitions in Kudu tables is managed by Kudu, and Impala does not cache any block locality metadata for Kudu tables. If the Kudu service is not integrated with the Hive Metastore, Impala will manage Kudu table metadata in the Hive Metastore.

The REFRESH and INVALIDATE METADATA statements are needed less frequently for Kudu tables than for HDFS-backed tables. Neither statement is needed when data is added to, removed, or updated in a Kudu table, even if the changes are made directly to Kudu through a client program using the Kudu API. Run REFRESH table_name or INVALIDATE METADATA table_name for a Kudu table only after making a change to the Kudu table schema, such as adding or dropping a column.

Источник

INVALIDATE METADATA Statement

To accurately respond to queries, Impala must have current metadata about those databases and tables that clients query directly. Therefore, if some other entity modifies information used by Impala in the metastore that Impala and Hive share, the information cached by Impala must be updated. However, this does not mean that all metadata updates require an Impala update.

In Impala 1.2.4 and higher, you can specify a table name with INVALIDATE METADATA after the table is created in Hive, allowing you to make individual tables visible to Impala without doing a full reload of the catalog metadata. Impala 1.2.4 also includes other changes to make the metadata broadcast mechanism faster and more responsive, especially during Impala startup. See New Features in Impala 1.2.4 for details.

In Impala 1.2 and higher, a dedicated daemon ( catalogd ) broadcasts DDL changes made through Impala to all Impala nodes. Formerly, after you created a database or table while connected to one Impala node, you needed to issue an INVALIDATE METADATA statement on another Impala node before accessing the new database or table from the other node. Now, newly created or altered objects are picked up automatically by all Impala nodes. You must still use the INVALIDATE METADATA technique after creating or altering objects through Hive. See The Impala Catalog Service for more information on the catalog service.

The INVALIDATE METADATA statement is new in Impala 1.1 and higher, and takes over some of the use cases of the Impala 1.0 REFRESH statement. Because REFRESH now requires a table name parameter, to flush the metadata for all tables at once, use the INVALIDATE METADATA statement.

Читайте также:  при какой температуре плавятся кости человека

INVALIDATE METADATA and REFRESH are counterparts: INVALIDATE METADATA waits to reload the metadata when needed for a subsequent query, but reloads all the metadata for the table, which can be an expensive operation, especially for large tables with many partitions. REFRESH reloads the metadata immediately, but only loads the block location data for newly added data files, making it a less expensive operation overall. If data was altered in some more extensive way, such as being reorganized by the HDFS balancer, use INVALIDATE METADATA to avoid a performance penalty from reduced local reads. If you used Impala version 1.0, the INVALIDATE METADATA statement works just like the Impala 1.0 REFRESH statement did, while the Impala 1.1 REFRESH is optimized for the common use case of adding new data files to an existing table, thus the table name argument is now required.

A metadata update for an impalad instance is required if:

Database and table metadata is typically modified by:

INVALIDATE METADATA causes the metadata for that table to be marked as stale, and reloaded the next time the table is referenced. For a huge table, that process could take a noticeable amount of time; thus you might prefer to use REFRESH where practical, to avoid an unpredictable delay later, for example if the next reference to the table is during a benchmark test.

The following example shows how you might use the INVALIDATE METADATA statement after creating new tables (such as SequenceFile or HBase tables) through the Hive shell. Before the INVALIDATE METADATA statement was issued, Impala would give a «table not found» error if you tried to refer to those table names. The DESCRIBE statements cause the latest metadata to be immediately loaded for the tables, avoiding a delay the next time those tables are queried.

For more examples of using REFRESH and INVALIDATE METADATA with a combination of Impala and Hive operations, see Switching Back and Forth Between Impala and Hive.

The user ID that the impalad daemon runs under, typically the impala user, must have execute permissions for all the relevant directories holding table data. (A table could have data spread across multiple directories, or in unexpected paths, if it uses partitioning or specifies a LOCATION attribute for individual partitions or the entire table.) Issues with permissions might not cause an immediate error for this statement, but subsequent statements such as SELECT or SHOW TABLE STATS could fail.

This example illustrates creating a new database and new table in Hive, then doing an INVALIDATE METADATA statement in Impala using the fully qualified table name, after which both the new table and the new database are visible to Impala. The ability to specify INVALIDATE METADATA table_name for a table created in Hive is a new capability in Impala 1.2.4. In earlier releases, that statement would have returned an error indicating an unknown table, requiring you to do INVALIDATE METADATA with no table name, a more expensive operation that reloaded metadata for all tables and databases.

Amazon S3 considerations:

The REFRESH and INVALIDATE METADATA statements also cache metadata for tables where the data resides in the Amazon Simple Storage Service (S3). In particular, issue a REFRESH for a table after adding or removing files in the associated S3 data directory. See Using Impala with the Amazon S3 Filesystem for details about working with S3 tables.

Cancellation: Cannot be cancelled.

Much of the metadata for Kudu tables is handled by the underlying storage layer. Kudu tables have less reliance on the metastore database, and require less metadata caching on the Impala side. For example, information about partitions in Kudu tables is managed by Kudu, and Impala does not cache any block locality metadata for Kudu tables.

The REFRESH and INVALIDATE METADATA statements are needed less frequently for Kudu tables than for HDFS-backed tables. Neither statement is needed when data is added to, removed, or updated in a Kudu table, even if the changes are made directly to Kudu through a client program using the Kudu API. Run REFRESH table_name or INVALIDATE METADATA table_name for a Kudu table only after making a change to the Kudu table schema, such as adding or dropping a column, by a mechanism other than Impala.

Читайте также:  какой место занимает галатасарай

Источник

Русские Блоги

Правильно используйте Impala Invalidate Metadata с заявлениями об обновлении

В IMPALA инвалидиональные метаданные и обновления операторов могут быть использованы для обновления таблицы, но они по существу различны. Эта статья кратко анализируется и объясняет, что его следует использовать в любом случае.

Impala на введение в улье

Как правило, мы используем традиционные базы данных MySQL или PostgreSQL в качестве компонента MetaStore Comply of Tive. De mysql в CDH, мы можем пройти show tables in hive Заявление четко видит различные таблицы в метасторе улья.

Его организация похожа на информацию_шима в MySQL. Например, метаданные всех таблиц сохраняются, и столбцы сохраняют метаданные всех столбцов, разделов информации об разделах хранения, таблицу хранения SDS, а также разделение каталога HDFS, и так далее.

IMPALA используется в качестве механизма запроса MPP, часто используя улей и в нашем бизнесе. На рисунке ниже показана общая структура IMPALA и периферийных компонентов.

Основные компоненты Impala Impalad, который отвечает за предоставление всех услуг запроса. Кроме того, каталог несет ответственность за получение данных кэшированного терминала, а Statestore отвечает за метаданные отчетов к каждому обновлению Impalad.

Это решение идеально решено для каждого запроса для получения проблемы, потому что после того, как структура таблицы очень сложна, или есть много данных, она приведет к большой задержке. Если вы не можете кэшировать их, метаданные могут быть повторно использованы, экономят время.

invalidate metadata

Invalidate означает «сделать его недействительным, изготавливать отходы», поэтому недействительный смысл метаданных является «отмена (кэш) метаданные». Его синтаксис:

Если он выполнен на Impalad (I) invalidate metadata table Произойдет следующие движения:

Invalidate Metadata обладает асинхронными и полномасштабными. Как видно из вышеперечисленного, на момент работы, другой импровик, отличный от того, что я по-прежнему поддерживаю старый кеш метаданных, даже если новые метаданные, принадлежащие I, неполной. Только когда каталог асинхронно загружается со всеми метаданными, соответствующими таблицем, будет сгенерирован обновленный номер версии, и полные метаданные транслируются всем IMPALAD через Statestore, а восприятие метаданных всей группы Impala будет соответствовать.

refresh

Обновить значение простое, «Обновить». Его синтаксис:

Выполнить I. refresh table Заявление произойдет следующим образом:

Конечно, стационар по-прежнему несет ответственность за вещание новых метаданных в другие узлы. Перед вещанием, другой импал, кроме я также держите старый кэш.

Можно видеть, что в отношении недействительных метаданных обновление характеризуется синхронизацией и прикреплением. Кроме того, его выполнение выполняется вокруг одной таблицы и один раздел таблицы, поэтому он более легкий, и он более подходит для обновления после изменений метаданных раздела или файлов данных.

Как использовать это правильно

Через простое анализ выше, легко сделать следующее резюме:

Источник

INVALIDATE METADATA Statement

See Overview of Impala Metadata and the Metastore for the information about the way Impala uses metadata and how it shares the same metastore database as Hive.

Once issued, the INVALIDATE METADATA statement cannot be cancelled.

If there is no table specified, the cached metadata for all tables is flushed and synced with Hive Metastore (HMS). If tables were dropped from the HMS, they will be removed from the catalog, and if new tables were added, they will show up in the catalog.

If you specify a table name, only the metadata for that one table is flushed and synced with the HMS.

Use REFRESH after invalidating a specific table to separate the metadata load from the first query that’s run against that table.

This example illustrates creating a new database and new table in Hive, then doing an INVALIDATE METADATA statement in Impala using the fully qualified table name, after which both the new table and the new database are visible to Impala.

Before the INVALIDATE METADATA statement was issued, Impala would give a «not found» error if you tried to refer to those database or table names.

Use the REFRESH statement for incremental metadata update.

For more examples of using INVALIDATE METADATA with a combination of Impala and Hive operations, see Switching Back and Forth Between Impala and Hive.

If you change HDFS permissions to make data readable or writeable by the Impala user, issue another INVALIDATE METADATA to make Impala aware of the change.

Читайте также:  сумма начисленных страховых взносов на страховую пенсию что это

Much of the metadata for Kudu tables is handled by the underlying storage layer. Kudu tables have less reliance on the metastore database, and require less metadata caching on the Impala side. For example, information about partitions in Kudu tables is managed by Kudu, and Impala does not cache any block locality metadata for Kudu tables.

The REFRESH and INVALIDATE METADATA statements are needed less frequently for Kudu tables than for HDFS-backed tables. Neither statement is needed when data is added to, removed, or updated in a Kudu table, even if the changes are made directly to Kudu through a client program using the Kudu API. Run REFRESH table_name or INVALIDATE METADATA table_name for a Kudu table only after making a change to the Kudu table schema, such as adding or dropping a column.

Источник

Difference between invalidate metadata and refresh commands in Impala?

I saw at this link which affects Impala version 1.1:

Since Impala 1.1, REFRESH statement only works for existing tables. For new tables you need to issue «INVALIDATE METADATA» statement.

Does this still hold true for later versions of Impala?

3 Answers 3

According to Cloudera’s Impala guide (Cloudera Enterprise 5.8) but stayed the same for 5.9:

INVALIDATE METADATA and REFRESH are counterparts: INVALIDATE METADATA waits to reload the metadata when needed for a subsequent query, but reloads all the metadata for the table, which can be an expensive operation, especially for large tables with many partitions. REFRESH reloads the metadata immediately, but only loads the block location data for newly added data files, making it a less expensive operation overall. If data was altered in some more extensive way, such as being reorganized by the HDFS balancer, use INVALIDATE METADATA to avoid a performance penalty from reduced local reads. If you used Impala version 1.0, the INVALIDATE METADATA statement works just like the Impala 1.0 REFRESH statement did, while the Impala 1.1 REFRESH is optimized for the common use case of adding new data files to an existing table, thus the table name argument is now required.

and related to working on existing tables:

The table name is a required parameter [for REFRESH]. To flush the metadata for all tables, use the INVALIDATE METADATA command. Because REFRESH table_name only works for tables that the current Impala node is already aware of, when you create a new table in the Hive shell, enter INVALIDATE METADATA new_table before you can see the new table in impala-shell. Once the table is known by Impala, you can issue REFRESH table_name after you add data files for that table.

So it seems like it indeed stayed the same. I believe CDH 5.9 comes with Impala 2.7.

As per Impala document Invalidate Metada and Refresh

INVALIDATE METADATA Statement

The INVALIDATE METADATA statement marks the metadata for one or all tables as stale. The next time the Impala service performs a query against a table whose metadata is invalidated, Impala reloads the associated metadata before the query proceeds. As this is a very expensive operation compared to the incremental metadata update done by the REFRESH statement, when possible, prefer REFRESH rather than INVALIDATE METADATA.

INVALIDATE METADATA is required when the following changes are made outside of Impala, in Hive and other Hive client, such as SparkSQL:

No INVALIDATE METADATA is needed when the changes are made by impalad.

REFRESH Statement

The REFRESH statement reloads the metadata for the table from the metastore database and does an incremental reload of the file and block metadata from the HDFS NameNode. REFRESH is used to avoid inconsistencies between Impala and external metadata sources, namely Hive Metastore (HMS) and NameNodes.

The table name is a required parameter, and the table must already exist and be known to Impala.

Only the metadata for the specified table is reloaded.

Use the REFRESH statement to load the latest metastore metadata for a particular table after one of the following scenarios happens outside of Impala:

Источник

Сказочный портал