new file mode 100644
@@ -0,0 +1,1405 @@
+Excerpt from UltraSPARC Virtual Machine Specification
+Extracted via "pdftotext -f 546 -l 571 -layout sun4v-3.0.20.pdf"
+Compiled from version 3.0.20
+Publication date 2017-04-05 18:15
+Copyright © 2008, 2015 Oracle and/or its affiliates. All rights reserved.
+
+
+Chapter 36. Coprocessor services
+ The following APIs provide access via the Hypervisor to hardware assisted data processing functionality.
+ These APIs may only be provided by certain platforms, and may not be available to all virtual machines
+ even on supported platforms. Restrictions on the use of these APIs may be imposed in order to support
+ live-migration and other system management activities.
+
+36.1. Data Analytics Accelerator
+ The Data Analytics Accelerator (DAX) functionality is a collection of hardware coprocessors that provide
+ high speed processoring of database-centric operations. The coprocessors may support one or more of
+ the following data query operations: search, extraction, compression, decompression, and translation. The
+ functionality offered may vary by virtual machine implementation.
+
+ The DAX is a virtual device to sun4v guests, with supported data operations indicated by the virtual de-
+ vice compatibilty property. Functionality is accessed through the submission of Command Control Blocks
+ (CCBs) via the ccb_submit API function. The operations are processed asynchronously, with the status of
+ the submitted operations reported through a Completion Area linked to each CCB. Each CCB has a sep-
+ arate Completion Area and, unless execution order is specifically restricted through the use of serial-con-
+ ditional flags, the execution order of submitted CCBs is arbitrary. Likewise, the time to completion for
+ a given CCB is never guaranteed.
+
+ Guest software may implement a software timeout on CCB operations, and if the timeout is exceeded, the
+ operation may be cancelled or killed via the ccb_kill API function. It is recommended for guest software
+ to implement a software timeout to account for certain RAS errors which may result in lost CCBs. It is
+ recommended such implementation use the ccb_info API function to check the status of a CCB prior to
+ killing it in order to determine if the CCB is still in queue, or may have been lost due to a RAS error.
+
+ There is no fixed limit on the number of outstanding CCBs guest software may have queued in the virtual
+ machine, however, internal resource limitations within the virtual machine can cause CCB submissions
+ to be temporarily rejected with EWOULDBLOCK. In such cases, guests should continue to attempt sub-
+ missions until they succeed; waiting for an outstanding CCB to complete is not necessary, and would not
+ be a guarantee that a future submission would succeed.
+
+ The availablility of DAX coprocessor command service is indicated by the presence of the DAX virtual
+ device node in the guest MD (Section 8.24.17, “Database Analytics Accelerators (DAX) virtual-device
+ node”).
+
+36.1.1. DAX Compatibility Property
+ The query functionality may vary based on the compatibility property of the virtual device:
+
+36.1.1.1. "ORCL,sun4v-dax" Device Compatibility
+ Available CCB commands:
+
+ • No-op/Sync
+
+ • Extract
+
+ • Scan Value
+
+ • Inverted Scan Value
+
+ • Scan Range
+
+ • Inverted Scan Range
+
+
+ 509
+ Coprocessor services
+
+
+ • Translate
+
+ • Inverted Translate
+
+ • Select
+ See Section 36.2.1, “Query CCB Command Formats” for the corresponding CCB input and output formats.
+
+ Only version 0 CCBs are available.
+
+36.1.1.2. "ORCL,sun4v-dax-fc" Device Compatibility
+ "ORCL,sun4v-dax-fc" is compatible with the "ORCL,sun4v-dax" interface, and includes additional CCB
+ bit fields and controls.
+
+36.1.1.3. "ORCL,sun4v-dax2" Device Compatibility
+ Available CCB commands:
+ • No-op/Sync
+
+ • Extract
+
+ • Scan Value
+
+ • Inverted Scan Value
+
+ • Scan Range
+
+ • Inverted Scan Range
+
+ • Translate
+
+ • Inverted Translate
+
+ • Select
+
+ See Section 36.2.1, “Query CCB Command Formats” for the corresponding CCB input and output formats.
+
+ Version 0 and 1 CCBs are available. Only version 0 CCBs may use Huffman encoded data, whereas only
+ version 1 CCBs may use OZIP.
+
+36.1.2. DAX Virtual Device Interrupts
+ The DAX virtual device has multiple interrupts associated with it which may be used by the guest if
+ desired. The number of device interrupts available to the guest is indicated in the virtual device node of the
+ guest MD (Section 8.24.17, “Database Analytics Accelerators (DAX) virtual-device node”). If the device
+ node indicates N interrupts available, the guest may use any value from 0 to N - 1 (inclusive) in a CCB
+ interrupt number field. Using values outside this range will result in the CCB being rejected for an invalid
+ field value.
+
+ The interrupts may be bound and managed using the standard sun4v device interrupts API (Chapter 16,
+ Device interrupt services). Sysino interrupts are not available for DAX devices.
+
+36.2. Coprocessor Control Block (CCB)
+ CCBs are either 64 or 128 bytes long, depending on the operation type. The exact contents of the CCB
+ are command specific, but all CCBs contain at least one memory buffer address. All memory locations
+ referenced by a CCB must be pinned in memory until the CCB either completes execution or is killed via
+ the ccb_kill API call. Changes in virtual address mappings occurring after CCB submission are not guar-
+ anteed to be visible, and as such all virtual address updates need to be synchronized with CCB execution.
+
+
+ 510
+ Coprocessor services
+
+
+All CCBs begin with a common 32-bit header.
+
+Table 36.1. CCB Header Format
+
+Bits Field Description
+[31:28] CCB version. For API version 2.0: set to 1 if CCB uses OZIP encoding; set to 0 if the CCB
+ uses Huffman encoding; otherwise either 0 or 1. For API version 1.0: always set to 0.
+[27] When API version 2.0 is negotiated, this is the Pipeline Flag. It is reserved in API version
+ 1.0
+[26] Long CCB flag
+[25] Conditional synchronization flag
+[24] Serial synchronization flag
+[23:16] CCB operation code:
+ 0x00 No Operation (No-op) or Sync
+ 0x01 Extract
+ 0x02 Scan Value
+ 0x12 Inverted Scan Value
+ 0x03 Scan Range
+ 0x13 Inverted Scan Range
+ 0x04 Translate
+ 0x14 Inverted Translate
+ 0x05 Select
+[15:13] Reserved
+[12:11] Table address type
+ 0b'00 No address
+ 0b'01 Alternate context virtual address
+ 0b'10 Real address
+ 0b'11 Primary context virtual address
+[10:8] Output/Destination address type
+ 0b'000 No address
+ 0b'001 Alternate context virtual address
+ 0b'010 Real address
+ 0b'011 Primary context virtual address
+ 0b'100 Reserved
+ 0b'101 Reserved
+ 0b'110 Reserved
+ 0b'111 Reserved
+[7:5] Secondary source address type
+ 0b'000 No address
+ 0b'001 Alternate context virtual address
+ 0b'010 Real address
+
+
+ 511
+ Coprocessor services
+
+
+Bits Field Description
+ 0b'011 Primary context virtual address
+ 0b'100 Reserved
+ 0b'101 Reserved
+ 0b'110 Reserved
+ 0b'111 Reserved
+[4:2] Primary source address type
+ 0b'000 No address
+ 0b'001 Alternate context virtual address
+ 0b'010 Real address
+ 0b'011 Primary context virtual address
+ 0b'100 Reserved
+ 0b'101 Reserved
+ 0b'110 Reserved
+ 0b'111 Reserved
+[1:0] Completion area address type
+ 0b'00 No address
+ 0b'01 Alternate context virtual address
+ 0b'10 Real address
+ 0b'11 Primary context virtual address
+
+The Long CCB flag indicates whether the submitted CCB is 64 or 128 bytes long; value is 0 for 64 bytes
+and 1 for 128 bytes.
+
+The Serial and Conditional flags allow simple relative ordering between CCBs. Any CCB with the Serial
+flag set will execute sequentially relative to any previous CCB that is also marked as Serial in the same
+CCB submission. CCBs without the Serial flag set execute independently, even if they are between CCBs
+with the Serial flag set. CCBs marked solely with the Serial flag will execute upon the completion of the
+previous Serial CCB, regardless of the completion status of that CCB. The Conditional flag allows CCBs
+to conditionally execute based on the successful execution of the closest CCB marked with the Serial flag.
+A CCB may only be conditional on exactly one CCB, however, a CCB may be marked both Conditional
+and Serial to allow execution chaining. The flags do NOT allow fan-out chaining, where multiple CCBs
+execute in parallel based on the completion of another CCB.
+
+The Pipeline flag is an optimization that directs the output of one CCB (the "source" CCB) directly to
+the input of the next CCB (the "target" CCB). The target CCB thus does not need to read the input from
+memory. The Pipeline flag is advisory and may be dropped.
+
+Both the Pipeline and Serial bits must be set in the source CCB. The Conditional bit must be set in the
+target CCB. Exactly one CCB must be made conditional on the source CCB; either 0 or 2 target CCBs
+is invalid. However, Pipelines can be extended beyond two CCBs: the sequence would start with a CCB
+with both the Pipeline and Serial bits set, proceed through CCBs with the Pipeline, Serial, and Conditional
+bits set, and terminate at a CCB that has the Conditional bit set, but not the Pipeline bit.
+
+The input of the target CCB must start within 64 bytes of the output of the source CCB or the pipeline flag
+will be ignored. All CCBs in a pipeline must be submitted in the same call to ccb_submit.
+
+
+
+ 512
+ Coprocessor services
+
+
+ The various address type fields indicate how the various address values used in the CCB should be in-
+ terpreted by the virtual machine. Not all of the types specified are used by every CCB format. Types
+ which are not applicable to the given CCB command should be indicated as type 0 (No address). Virtual
+ addresses used in the CCB must have translation entries present in either the TLB or a configured TSB for
+ the submitting virtual processor. Virtual addresses which cannot be translated by the virtual machine will
+ result in the CCB submission being rejected, with the causal virtual address indicated. The CCB may be
+ resubmitted after inserting the translation, or the address may be translated by guest software and resub-
+ mitted using the real address translation.
+
+36.2.1. Query CCB Command Formats
+36.2.1.1. Supported Data Formats, Elements Sizes and Offsets
+
+ Data for query commands may be encoded in multiple possible formats. The data query commands use a
+ common set of values to indicate the encoding formats of the data being processed. Some encoding formats
+ require multiple data streams for processing, requiring the specification of both primary data formats (the
+ encoded data) and secondary data streams (meta-data for the encoded data).
+
+36.2.1.1.1. Primary Input Format
+ The primary input format code is a 4-bit field when it is used. There are 10 primary input formats available.
+ The packed formats are not endian neutral. Code values not listed below are reserved.
+
+ Code Format Description
+ 0x0 Fixed width byte packed Up to 16 bytes
+ 0x1 Fixed width bit packed Up to 15 bits (CCB version 0) or 23 bits (CCB version
+ 1); bits are read most significant bit to least significant bit
+ within a byte
+ 0x2 Variable width byte packed Data stream of lengths must be provided as a secondary
+ input
+ 0x4 Fixed width byte packed with run Up to 16 bytes; data stream of run lengths must be provid-
+ length encoding ed as a secondary input
+ 0x5 Fixed width bit packed with run Up to 15 bits (CCB version 0) or 23 bits (CCB version
+ length encoding 1); bits are read most significant bit to least significant bit
+ within a byte; data stream of run lengths must be provided
+ as a secondary input
+ 0x8 Fixed width byte packed with Up to 16 bytes before the encoding; compressed stream
+ Huffman (CCB version 0) or bits are read most significant bit to least significant bit
+ OZIP (CCB version 1) encoding within a byte; pointer to the encoding table must be pro-
+ vided
+ 0x9 Fixed width bit packed with Huff- Up to 15 bits (CCB version 0) or 23 bits (CCB version
+ man (CCB version 0) or OZIP 1); compressed stream bits are read most significant bit to
+ (CCB version 1) encoding least significant bit within a byte; pointer to the encoding
+ table must be provided
+ 0xA Variable width byte packed with Up to 16 bytes before the encoding; compressed stream
+ Huffman (CCB version 0) or bits are read most significant bit to least significant bit
+ OZIP (CCB version 1) encoding within a byte; data stream of lengths must be provided as
+ a secondary input; pointer to the encoding table must be
+ provided
+ 0xC Fixed width byte packed with Up to 16 bytes before the encoding; compressed stream
+ run length encoding, followed by bits are read most significant bit to least significant bit
+
+
+ 513
+ Coprocessor services
+
+
+ Code Format Description
+ Huffman (CCB version 0) or within a byte; data stream of run lengths must be provided
+ OZIP (CCB version 1) encoding as a secondary input; pointer to the encoding table must
+ be provided
+ 0xD Fixed width bit packed with Up to 15 bits (CCB version 0) or 23 bits(CCB version 1)
+ run length encoding, followed by before the encoding; compressed stream bits are read most
+ Huffman (CCB version 0) or significant bit to least significant bit within a byte; data
+ OZIP (CCB version 1) encoding stream of run lengths must be provided as a secondary in-
+ put; pointer to the encoding table must be provided
+
+ If OZIP encoding is used, there must be no reserved bytes in the table.
+
+36.2.1.1.2. Primary Input Element Size
+ For primary input data streams with fixed size elements, the element size must be indicated in the CCB
+ command. The size is encoded as the number of bits or bytes, minus one. The valid value range for this
+ field depends on the input format selected, as listed in the table above.
+
+36.2.1.1.3. Secondary Input Format
+ For primary input data streams which require a secondary input stream, the secondary input stream is
+ always encoded in a fixed width, bit-packed format. The bits are read from most significant bit to least
+ significant bit within a byte. There are two encoding options for the secondary input stream data elements,
+ depending on whether the value of 0 is needed:
+
+ Secondary Input For- Description
+ mat Code
+ 0 Element is stored as value minus 1 (0 evalutes to 1, 1 evalutes
+ to 2, etc)
+ 1 Element is stored as value
+
+36.2.1.1.4. Secondary Input Element Size
+ Secondary input element size is encoded as a two bit field:
+
+ Secondary Input Size Description
+ Code
+ 0x0 1 bit
+ 0x1 2 bits
+ 0x2 4 bits
+ 0x3 8 bits
+
+36.2.1.1.5. Input Element Offsets
+ Bit-wise input data streams may have any alignment within the base addressed byte. The offset, specified
+ from most significant bit to least significant bit, is provided as a fixed 3 bit field for each input type. A
+ value of 0 indicates that the first input element begins at the most significant bit in the first byte, and a
+ value of 7 indicates it begins with the least significant bit.
+
+ This field should be zero for any byte-wise primary input data streams.
+
+
+ 514
+ Coprocessor services
+
+
+36.2.1.1.6. Output Format
+ Query commands support multiple sizes and encodings for output data streams. There are four possible
+ output encodings, and up to four supported element sizes per encoding. Not all output encodings are sup-
+ ported for every command. The format is indicated by a 4-bit field in the CCB:
+
+ Output Format Code Description
+ 0x0 Byte aligned, 1 byte elements
+ 0x1 Byte aligned, 2 byte elements
+ 0x2 Byte aligned, 4 byte elements
+ 0x3 Byte aligned, 8 byte elements
+ 0x4 16 byte aligned, 16 byte elements
+ 0x5 Reserved
+ 0x6 Reserved
+ 0x7 Reserved
+ 0x8 Packed vector of single bit elements
+ 0x9 Reserved
+ 0xA Reserved
+ 0xB Reserved
+ 0xC Reserved
+ 0xD 2 byte elements where each element is the index value of a bit,
+ from an bit vector, which was 1.
+ 0xE 4 byte elements where each element is the index value of a bit,
+ from an bit vector, which was 1.
+ 0xF Reserved
+
+36.2.1.1.7. Application Data Integrity (ADI)
+ On platforms which support ADI, the ADI version number may be specified for each separate memory
+ access type used in the CCB command. ADI checking only occurs when reading data. When writing data,
+ the specified ADI version number overwrites any existing ADI value in memory.
+
+ An ADI version value of 0 or 0xF indicates the ADI checking is disabled for that data access, even if it is
+ enabled in memory. By setting the appropriate flag in CCB_SUBMIT (Section 36.3.1, “ccb_submit”) it is
+ also an option to disable ADI checking for all inputs accessed via virtual address for all CCBs submitted
+ during that hypercall invocation.
+
+ The ADI value is only guaranteed to be checked on the first 64 bytes of each data access. Mismatches on
+ subsequent data chunks may not be detected, so guest software should be careful to use page size checking
+ to protect against buffer overruns.
+
+36.2.1.1.8. Page size checking
+ All data accesses used in CCB commands must be bounded within a single memory page. When addresses
+ are provided using a virtual address, the page size for checking is extracted from the TTE for that virtual
+ address. When using real addresses, the guest must supply the page size in the same field as the address
+ value. The page size must be one of the sizes supported by the underlying virtual machine. Using a value
+ that is not supported may result in the CCB submission being rejected or the generation of a CCB parsing
+ error in the completion area.
+
+
+ 515
+ Coprocessor services
+
+
+36.2.1.2. Extract command
+
+ Converts an input vector in one format to an output vector in another format. All input format types are
+ supported.
+
+ The only supported output format is a padded, byte-aligned output stream, using output codes 0x0 - 0x4.
+ When the specified output element size is larger than the extracted input element size, zeros are padded to
+ the extracted input element. First, if the decompressed input size is not a whole number of bytes, 0 bits are
+ padded to the most significant bit side till the next byte boundary. Next, if the output element size is larger
+ than the byte padded input element, bytes of value 0 are added based on the Padding Direction bit in the
+ CCB. If the output element size is smaller than the byte-padded input element size, the input element is
+ truncated by dropped from the least significant byte side until the selected output size is reached.
+
+ The return value of the CCB completion area is invalid. The “number of elements processed” field in the
+ CCB completion area will be valid.
+
+ The extract CCB is a 64-byte “short format” CCB.
+
+ The extract CCB command format can be specified by the following packed C structure for a big-endian
+ machine:
+
+
+ struct extract_ccb {
+ uint32_t header;
+ uint32_t control;
+ uint64_t completion;
+ uint64_t primary_input;
+ uint64_t data_access_control;
+ uint64_t secondary_input;
+ uint64_t reserved;
+ uint64_t output;
+ uint64_t table;
+ };
+
+
+ The exact field offsets, sizes, and composition are as follows:
+
+ Offset Size Field Description
+ 0 4 CCB header (Table 36.1, “CCB Header Format”)
+ 4 4 Command control
+ Bits Field Description
+ [31:28] Primary Input Format (see Section 36.2.1.1.1, “Primary Input
+ Format”)
+ [27:23] Primary Input Element Size (see Section 36.2.1.1.2, “Primary
+ Input Element Size”)
+ [22:20] Primary Input Starting Offset (see Section 36.2.1.1.5, “Input
+ Element Offsets”)
+ [19] Secondary Input Format (see Section 36.2.1.1.3, “Secondary
+ Input Format”)
+ [18:16] Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input
+ Element Offsets”)
+
+
+ 516
+ Coprocessor services
+
+
+Offset Size Field Description
+ Bits Field Description
+ [15:14] Secondary Input Element Size (see Section 36.2.1.1.4, “Se-
+ condary Input Element Size”
+ [13:10] Output Format (see Section 36.2.1.1.6, “Output Format”)
+ [9] Padding Direction selector: A value of 1 causes padding bytes
+ to be added to the left side of output elements. A value of 0
+ causes padding bytes to be added to the right side of output
+ elements.
+ [8:0] Reserved
+8 8 Completion
+ Bits Field Description
+ [63:60] ADI version (see Section 36.2.1.1.7, “Application Data Integri-
+ ty (ADI)”)
+ [59] If set to 1, a virtual device interrupt will be generated using
+ the device interrupt number specified in the lower bits of this
+ completion word. If 0, the lower bits of this completion word
+ are ignored.
+ [58:6] Completion area address bits [58:6]. Address type is deter-
+ mined by CCB header.
+ [5:0] Virtual device interrupt number for completion interrupt, if en-
+ abled.
+16 8 Primary Input
+ Bits Field Description
+ [63:60] ADI version (see Section 36.2.1.1.7, “Application Data Integri-
+ ty (ADI)”)
+ [59:56] If using real address, these bits should be filled in with the page
+ size code for the page boundary checking the guest wants the
+ virtual machine to use when accessing this data stream (check-
+ ing is only guaranteed to be performed when using API version
+ 1.1 and later). If using a virtual address, this field will be used
+ as as primary input address bits [59:56].
+ [55:0] Primary input address bits [55:0]. Address type is determined
+ by CCB header.
+24 8 Data Access Control
+ Bits Field Description
+ [63:62] Flow Control
+ Value Description
+ 0b'00 Disable flow control
+ 0b'01 Enable flow control (only valid with "ORCL,sun4v-
+ dax-fc" compatible virtual device variants)
+ 0b'10 Reserved
+ 0b'11 Reserved
+ [61:60] Reserved (API 1.0)
+
+
+ 517
+ Coprocessor services
+
+
+Offset Size Field Description
+ Bits Field Description
+ Pipeline target (API 2.0)
+ Value Description
+ 0b'00 Connect to primary input
+ 0b'01 Connect to secondary input
+ 0b'10 Reserved
+ 0b'11 Reserved
+ [59:40] Output buffer size given in units of 64 bytes, minus 1. Value of
+ 0 means 64 bytes, value of 1 means 128 bytes, etc. Buffer size is
+ only enforced if flow control is enabled in Flow Control field.
+ [39:32] Reserved
+ [31:30] Output Data Cache Allocation
+ Value Description
+ 0b'00 Do not allocate cache lines for output data stream.
+ 0b'01 Force cache lines for output data stream to be allocat-
+ ed in the cache that is local to the submitting virtual
+ cpu.
+ 0b'10 Allocate cache lines for output data stream, but allow
+ existing cache lines associated with the data to remain
+ in their current cache instance. Any memory not al-
+ ready in cache will be allocated in the cache local to
+ the submitting virtual cpu.
+ 0b'11 Reserved
+ [29:26] Reserved
+ [25:24] Primary Input Length Format
+ Value Description
+ 0b'00 Number of primary symbols
+ 0b'01 Number of primary bytes
+ 0b'10 Number of primary bits
+ 0b'11 Reserved
+ [23:0] Primary Input Length
+ Format Field Value
+ # of primary symbols Number of input elements to process,
+ minus 1. Command execution stops
+ once count is reached.
+ # of primary bytes Number of input bytes to process,
+ minus 1. Command execution stops
+ once count is reached. The count is
+ done before any decompression or
+ decoding.
+ # of primary bits Number of input bits to process, mi-
+ nus 1. Command execution stops
+
+
+
+ 518
+ Coprocessor services
+
+
+ Offset Size Field Description
+ Bits Field Description
+ Format Field Value
+ once count is reached. The count is
+ done before any decompression or
+ decoding, and does not include any
+ bits skipped by the Primary Input
+ Offset field value of the command
+ control word.
+ 32 8 Secondary Input, if used by Primary Input Format. Same fields as Primary
+ Input.
+ 40 8 Reserved
+ 48 8 Output (same fields as Primary Input)
+ 56 8 Symbol Table (if used by Primary Input)
+ Bits Field Description
+ [63:60] ADI version (see Section 36.2.1.1.7, “Application Data Integri-
+ ty (ADI)”)
+ [59:56] If using real address, these bits should be filled in with the page
+ size code for the page boundary checking the guest wants the
+ virtual machine to use when accessing this data stream (check-
+ ing is only guaranteed to be performed when using API version
+ 1.1 and later). If using a virtual address, this field will be used
+ as as symbol table address bits [59:56].
+ [55:4] Symbol table address bits [55:4]. Address type is determined
+ by CCB header.
+ [3:0] Symbol table version
+ Value Description
+ 0 Huffman encoding. Must use 64 byte aligned table
+ address. (Only available when using version 0 CCBs)
+ 1 OZIP encoding. Must use 16 byte aligned table ad-
+ dress. (Only available when using version 1 CCBs)
+
+
+36.2.1.3. Scan commands
+
+ The scan commands search a stream of input data elements for values which match the selection criteria.
+ All the input format types are supported. There are multiple formats for the scan commands, allowing the
+ scan to search for exact matches to one value, exact matches to either of two values, or any value within
+ a specified range. The specific type of scan is indicated by the command code in the CCB header. For the
+ scan range commands, the boundary conditions can be specified as greater-than-or-equal-to a value, less-
+ than-or-equal-to a value, or both by using two boundary values.
+
+ There are two supported formats for the output stream: the bit vector and index array formats (codes 0x8,
+ 0xD, and 0xE). For the standard scan command using the bit vector output, for each input element there
+ exists one bit in the vector that is set if the input element matched the scan criteria, or clear if not. The
+ inverted scan command inverts the polarity of the bits in the output. The most significant bit of the first
+ byte of the output stream corresponds to the first element in the input stream. The standard index array
+ output format contains one array entry for each input element that matched the scan criteria. Each array
+
+
+
+ 519
+ Coprocessor services
+
+
+entry is the index of an input element that matched the scan criteria. An inverted scan command produces
+a similar array, but of all the input elements which did NOT match the scan criteria.
+
+The return value of the CCB completion area contains the number of input elements found which match
+the scan criteria (or number that did not match for the inverted scans). The “number of elements processed”
+field in the CCB completion area will be valid, indicating the number of input elements processed.
+
+These commands are 128-byte “long format” CCBs.
+
+The scan CCB command format can be specified by the following packed C structure for a big-endian
+machine:
+
+
+ struct scan_ccb {
+ uint32_t header;
+ uint32_t control;
+ uint64_t completion;
+ uint64_t primary_input;
+ uint64_t data_access_control;
+ uint64_t secondary_input;
+ uint64_t match_criteria0;
+ uint64_t output;
+ uint64_t table;
+ uint64_t match_criteria1;
+ uint64_t match_criteria2;
+ uint64_t match_criteria3;
+ uint64_t reserved[5];
+ };
+
+
+The exact field offsets, sizes, and composition are as follows:
+
+Offset Size Field Description
+0 4 CCB header (Table 36.1, “CCB Header Format”)
+4 4 Command control
+ Bits Field Description
+ [31:28] Primary Input Format (see Section 36.2.1.1.1, “Primary Input
+ Format”)
+ [27:23] Primary Input Element Size (see Section 36.2.1.1.2, “Primary
+ Input Element Size”)
+ [22:20] Primary Input Starting Offset (see Section 36.2.1.1.5, “Input
+ Element Offsets”)
+ [19] Secondary Input Format (see Section 36.2.1.1.3, “Secondary
+ Input Format”)
+ [18:16] Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input
+ Element Offsets”)
+ [15:14] Secondary Input Element Size (see Section 36.2.1.1.4, “Se-
+ condary Input Element Size”
+ [13:10] Output Format (see Section 36.2.1.1.6, “Output Format”)
+ [9:5] Operand size for first scan criteria value. In a scan value oper-
+ ation, this is one of two potential extact match values. In a scan
+ range operation, this is the size of the upper range boundary.
+
+
+ 520
+ Coprocessor services
+
+
+Offset Size Field Description
+ Bits Field Description
+ The value of this field is the number of bytes in the operand,
+ minus 1. Values 0xF-0x1E are reserved. A value of 0x1F indi-
+ cates this operand is not in use for this scan operation.
+ [4:0] Operand size for second scan criteria value. In a scan value op-
+ eration, this is one of two potential extact match values. In a
+ scan range operation, this is the size of the lower range bound-
+ ary. The value of this field is the number of bytes in the operand,
+ minus 1. Values 0xF-0x1E are reserved. A value of 0x1F indi-
+ cates this operand is not in use for this scan operation.
+8 8 Completion (same fields as Section 36.2.1.2, “Extract command”)
+16 8 Primary Input (same fields as Section 36.2.1.2, “Extract command”)
+24 8 Data Access Control (same fields as Section 36.2.1.2, “Extract command”)
+32 8 Secondary Input, if used by Primary Input Format. Same fields as Primary
+ Input.
+40 4 Most significant 4 bytes of first scan criteria operand. If first operand is less
+ than 4 bytes, the value is left-aligned to the lowest address bytes.
+44 4 Most significant 4 bytes of second scan criteria operand. If second operand
+ is less than 4 bytes, the value is left-aligned to the lowest address bytes.
+48 8 Output (same fields as Primary Input)
+56 8 Symbol Table (if used by Primary Input). Same fields as Section 36.2.1.2,
+ “Extract command”
+64 4 Next 4 most significant bytes of first scan criteria operand occuring after the
+ bytes specified at offset 40, if needed by the operand size. If first operand
+ is less than 8 bytes, the valid bytes are left-aligned to the lowest address.
+68 4 Next 4 most significant bytes of second scan criteria operand occuring after
+ the bytes specified at offset 44, if needed by the operand size. If second
+ operand is less than 8 bytes, the valid bytes are left-aligned to the lowest
+ address.
+72 4 Next 4 most significant bytes of first scan criteria operand occuring after the
+ bytes specified at offset 64, if needed by the operand size. If first operand
+ is less than 12 bytes, the valid bytes are left-aligned to the lowest address.
+76 4 Next 4 most significant bytes of second scan criteria operand occuring after
+ the bytes specified at offset 68, if needed by the operand size. If second
+ operand is less than 12 bytes, the valid bytes are left-aligned to the lowest
+ address.
+80 4 Next 4 most significant bytes of first scan criteria operand occuring after the
+ bytes specified at offset 72, if needed by the operand size. If first operand
+ is less than 16 bytes, the valid bytes are left-aligned to the lowest address.
+84 4 Next 4 most significant bytes of second scan criteria operand occuring after
+ the bytes specified at offset 76, if needed by the operand size. If second
+ operand is less than 16 bytes, the valid bytes are left-aligned to the lowest
+ address.
+
+
+
+
+ 521
+ Coprocessor services
+
+
+36.2.1.4. Translate commands
+
+ The translate commands takes an input array of indicies, and a table of single bit values indexed by those
+ indicies, and outputs a bit vector or index array created by reading the tables bit value at each index in
+ the input array. The output should therefore contain exactly one bit per index in the input data stream,
+ when outputing as a bit vector. When outputing as an index array, the number of elements depends on the
+ values read in the bit table, but will always be less than, or equal to, the number of input elements. Only
+ a restricted subset of the possible input format types are supported. No variable width or Huffman/OZIP
+ encoded input streams are allowed. The primary input data element size must be 3 bytes or less.
+
+ The maximum table index size allowed is 15 bits, however, larger input elements may be used to provide
+ additional processing of the output values. If 2 or 3 byte values are used, the least significant 15 bits are
+ used as an index into the bit table. The most significant 9 bits (when using 3-byte input elements) or single
+ bit (when using 2-byte input elements) are compared against a fixed 9-bit test value provided in the CCB.
+ If the values match, the value from the bit table is used as the output element value. If the values do not
+ match, the output data element value is forced to 0.
+
+ In the inverted translate operation, the bit value read from bit table is inverted prior to its use. The additional
+ additional processing based on any additional non-index bits remains unchanged, and still forces the output
+ element value to 0 on a mismatch. The specific type of translate command is indicated by the command
+ code in the CCB header.
+
+ There are two supported formats for the output stream: the bit vector and index array formats (codes 0x8,
+ 0xD, and 0xE). The index array format is an array of indicies of bits which would have been set if the
+ output format was a bit array.
+
+ The return value of the CCB completion area contains the number of bits set in the output bit vector,
+ or number of elements in the output index array. The “number of elements processed” field in the CCB
+ completion area will be valid, indicating the number of input elements processed.
+
+ These commands are 64-byte “short format” CCBs.
+
+ The translate CCB command format can be specified by the following packed C structure for a big-endian
+ machine:
+
+
+ struct translate_ccb {
+ uint32_t header;
+ uint32_t control;
+ uint64_t completion;
+ uint64_t primary_input;
+ uint64_t data_access_control;
+ uint64_t secondary_input;
+ uint64_t reserved;
+ uint64_t output;
+ uint64_t table;
+ };
+
+
+ The exact field offsets, sizes, and composition are as follows:
+
+
+ Offset Size Field Description
+ 0 4 CCB header (Table 36.1, “CCB Header Format”)
+
+
+ 522
+ Coprocessor services
+
+
+Offset Size Field Description
+4 4 Command control
+ Bits Field Description
+ [31:28] Primary Input Format (see Section 36.2.1.1.1, “Primary Input
+ Format”)
+ [27:23] Primary Input Element Size (see Section 36.2.1.1.2, “Primary
+ Input Element Size”)
+ [22:20] Primary Input Starting Offset (see Section 36.2.1.1.5, “Input
+ Element Offsets”)
+ [19] Secondary Input Format (see Section 36.2.1.1.3, “Secondary
+ Input Format”)
+ [18:16] Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input
+ Element Offsets”)
+ [15:14] Secondary Input Element Size (see Section 36.2.1.1.4, “Se-
+ condary Input Element Size”
+ [13:10] Output Format (see Section 36.2.1.1.6, “Output Format”)
+ [9] Reserved
+ [8:0] Test value used for comparison against the most significant bits
+ in the input values, when using 2 or 3 byte input elements.
+8 8 Completion (same fields as Section 36.2.1.2, “Extract command”
+16 8 Primary Input (same fields as Section 36.2.1.2, “Extract command”
+24 8 Data Access Control (same fields as Section 36.2.1.2, “Extract command”,
+ except Primary Input Length Format may not use the 0x0 value)
+32 8 Secondary Input, if used by Primary Input Format. Same fields as Primary
+ Input.
+40 8 Reserved
+48 8 Output (same fields as Primary Input)
+56 8 Bit Table
+ Bits Field Description
+ [63:60] ADI version (see Section 36.2.1.1.7, “Application Data Integri-
+ ty (ADI)”)
+ [59:56] If using real address, these bits should be filled in with the page
+ size code for the page boundary checking the guest wants the
+ virtual machine to use when accessing this data stream (check-
+ ing is only guaranteed to be performed when using API version
+ 1.1 and later). If using a virtual address, this field will be used
+ as as bit table address bits [59:56]
+ [55:4] Bit table address bits [55:4]. Address type is determined by
+ CCB header. Address must be 64-byte aligned (CCB version
+ 0) or 16-byte aligned (CCB version 1).
+ [3:0] Bit table version
+ Value Description
+ 0 4KB table size
+ 1 8KB table size
+
+
+
+ 523
+ Coprocessor services
+
+
+36.2.1.5. Select command
+ The select command filters the primary input data stream by using a secondary input bit vector to determine
+ which input elements to include in the output. For each bit set at a given index N within the bit vector,
+ the Nth input element is included in the output. If the bit is not set, the element is not included. Only a
+ restricted subset of the possible input format types are supported. No variable width or run length encoded
+ input streams are allowed, since the secondary input stream is used for the filtering bit vector.
+
+ The only supported output format is a padded, byte-aligned output stream. The stream follows the same
+ rules and restrictions as padded output stream described in Section 36.2.1.2, “Extract command”.
+
+ The return value of the CCB completion area contains the number of bits set in the input bit vector. The
+ "number of elements processed" field in the CCB completion area will be valid, indicating the number
+ of input elements processed.
+
+ The select CCB is a 64-byte “short format” CCB.
+
+ The select CCB command format can be specified by the following packed C structure for a big-endian
+ machine:
+
+
+ struct select_ccb {
+ uint32_t header;
+ uint32_t control;
+ uint64_t completion;
+ uint64_t primary_input;
+ uint64_t data_access_control;
+ uint64_t secondary_input;
+ uint64_t reserved;
+ uint64_t output;
+ uint64_t table;
+ };
+
+
+ The exact field offsets, sizes, and composition are as follows:
+
+ Offset Size Field Description
+ 0 4 CCB header (Table 36.1, “CCB Header Format”)
+ 4 4 Command control
+ Bits Field Description
+ [31:28] Primary Input Format (see Section 36.2.1.1.1, “Primary Input
+ Format”)
+ [27:23] Primary Input Element Size (see Section 36.2.1.1.2, “Primary
+ Input Element Size”)
+ [22:20] Primary Input Starting Offset (see Section 36.2.1.1.5, “Input
+ Element Offsets”)
+ [19] Secondary Input Format (see Section 36.2.1.1.3, “Secondary
+ Input Format”)
+ [18:16] Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input
+ Element Offsets”)
+ [15:14] Secondary Input Element Size (see Section 36.2.1.1.4, “Se-
+ condary Input Element Size”
+
+
+ 524
+ Coprocessor services
+
+
+ Offset Size Field Description
+ Bits Field Description
+ [13:10] Output Format (see Section 36.2.1.1.6, “Output Format”)
+ [9] Padding Direction selector: A value of 1 causes padding bytes
+ to be added to the left side of output elements. A value of 0
+ causes padding bytes to be added to the right side of output
+ elements.
+ [8:0] Reserved
+ 8 8 Completion (same fields as Section 36.2.1.2, “Extract command”
+ 16 8 Primary Input (same fields as Section 36.2.1.2, “Extract command”
+ 24 8 Data Access Control (same fields as Section 36.2.1.2, “Extract command”)
+ 32 8 Secondary Bit Vector Input. Same fields as Primary Input.
+ 40 8 Reserved
+ 48 8 Output (same fields as Primary Input)
+ 56 8 Symbol Table (if used by Primary Input). Same fields as Section 36.2.1.2,
+ “Extract command”
+
+36.2.1.6. No-op and Sync commands
+
+ The no-op (no operation) command is a CCB which has no processing effect. The CCB, when processed
+ by the virtual machine, simply updates the completion area with its execution status. The CCB may have
+ the serial-conditional flags set in order to restrict when it executes.
+
+ The sync command is a variant of the no-op command which with restricted execution timing. A sync
+ command CCB will only execute when all previous commands submitted in the same request have com-
+ pleted. This is stronger than the conditional flag sequencing, which is only dependent on a single previous
+ serial CCB. While the relative ordering is guaranteed, virtual machine implementations with shared hard-
+ ware resources may cause the sync command to wait for longer than the minimum required time.
+
+ The return value of the CCB completion area is invalid for these CCBs. The “number of elements
+ processed” field is also invalid for these CCBs.
+
+ These commands are 64-byte “short format” CCBs.
+
+ The no-op CCB command format can be specified by the following packed C structure for a big-endian
+ machine:
+
+
+ struct nop_ccb {
+ uint32_t header;
+ uint32_t control;
+ uint64_t completion;
+ uint64_t reserved[6];
+ };
+
+
+ The exact field offsets, sizes, and composition are as follows:
+
+ Offset Size Field Description
+ 0 4 CCB header (Table 36.1, “CCB Header Format”)
+
+
+ 525
+ Coprocessor services
+
+
+ Offset Size Field Description
+ 4 4 Command control
+ Bits Field Description
+ [31] If set, this CCB functions as a Sync command. If clear, this
+ CCB functions as a No-op command.
+ [30:0] Reserved
+ 8 8 Completion (same fields as Section 36.2.1.2, “Extract command”
+ 16 46 Reserved
+
+36.2.2. CCB Completion Area
+ All CCB commands use a common 128-byte Completion Area format, which can be specified by the
+ following packed C structure for a big-endian machine:
+
+
+ struct completion_area {
+ uint8_t status_flag;
+ uint8_t error_note;
+ uint8_t rsvd0[2];
+ uint32_t error_values;
+ uint32_t output_size;
+ uint32_t rsvd1;
+ uint64_t run_time;
+ uint64_t run_stats;
+ uint32_t elements;
+ uint8_t rsvd2[20];
+ uint64_t return_value;
+ uint64_t extra_return_value[8];
+ };
+
+
+ The Completion Area must be a 128-byte aligned memory location. The exact layout can be described
+ using byte offsets and sizes relative to the memory base:
+
+ Offset Size Field Description
+ 0 1 CCB execution status
+ 0x0 Command not yet completed
+ 0x1 Command ran and succeeded
+ 0x2 Command ran and failed (partial results may be been
+ produced)
+ 0x3 Command ran and was killed (partial execution may
+ have occurred)
+ 0x4 Command was not run
+ 0x5-0xF Reserved
+ 1 1 Error reason code
+ 0x0 Reserved
+ 0x1 Buffer overflow
+
+
+ 526
+ Coprocessor services
+
+
+Offset Size Field Description
+ 0x2 CCB decoding error
+ 0x3 Page overflow
+ 0x4-0x6 Reserved
+ 0x7 Command was killed
+ 0x8 Command execution timeout
+ 0x9 ADI miscompare error
+ 0xA Data format error
+ 0xB-0xD Reserved
+ 0xE Unexpected hardware error (Do not retry)
+ 0xF Unexpected hardware error (Retry is ok)
+ 0x10-0x7F Reserved
+ 0x80 Partial Symbol Warning
+ 0x81-0xFF Reserved
+2 2 Reserved
+4 4 If a partial symbol warning was generated, this field contains the number
+ of remaining bits which were not decoded.
+8 4 Number of bytes of output produced
+12 4 Reserved
+16 8 Runtime of command (unspecified time units)
+24 8 Reserved
+32 4 Number of elements processed
+36 20 Reserved
+56 8 Return value
+64 64 Extended return value
+
+The CCB completion area should be treated as read-only by guest software. The CCB execution status
+byte will be cleared by the Hypervisor to reflect the pending execution status when the CCB is submitted
+successfully. All other fields are considered invalid upon CCB submission until the CCB execution status
+byte becomes non-zero.
+
+CCBs which complete with status 0x2 or 0x3 may produce partial results and/or side effects due to partial
+execution of the CCB command. Some valid data may be accessible depending on the fault type, however,
+it is recommended that guest software treat the destination buffer as being in an unknown state. If a CCB
+completes with a status byte of 0x2, the error reason code byte can be read to determine what corrective
+action should be taken.
+
+A buffer overflow indicates that the results of the operation exceeded the size of the output buffer indicated
+in the CCB. The operation can be retried by resubmitting the CCB with a larger output buffer.
+
+A CCB decoding error indicates that the CCB contained some invalid field values. It may be also be
+triggered if the CCB output is directed at a non-existent secondary input and the pipelining hint is followed.
+
+A page overflow error indicates that the operation required accessing a memory location beyond the page
+size associated with a given address. No data will have been read or written past the page boundary, but
+partial results may have been written to the destination buffer. The CCB can be resubmitted with a larger
+page size memory allocation to complete the operation.
+
+
+ 527
+ Coprocessor services
+
+
+ In the case of pipelined CCBs, a page overflow error will be triggered if the output from the pipeline source
+ CCB ends before the input of the pipeline target CCB. Page boundaries are ignored when the pipeline
+ hint is followed.
+
+ Command kill indicates that the CCB execution was halted or prevented by use of the ccb_kill API call.
+
+ Command timeout indicates that the CCB execution began, but did not complete within a pre-determined
+ limit set by the virtual machine. The command may have produced some or no output. The CCB may be
+ resubmitted with no alterations.
+
+ ADI miscompare indicates that the memory buffer version specified in the CCB did not match the value
+ in memory when accessed by the virtual machine. Guest software should not attempt to resubmit the CCB
+ without determining the cause of the version mismatch.
+
+ A data format error indicates that the input data stream did not follow the specified data input formatting
+ selected in the CCB.
+
+ Some CCBs which encounter hardware errors may be resubmitted without change. Persistent hardware
+ errors may result in multiple failures until RAS software can identify and isolate the faulty component.
+
+ The output size field indicates the number of bytes of valid output in the destination buffer. This field is
+ not valid for all possible CCB commands.
+
+ The runtime field indicates the execution time of the CCB command once it leaves the internal virtual
+ machine queue. The time units are fixed, but unspecified, allowing only relative timing comparisons by
+ guest software. The time units may also vary by hardware platform, and should not be construed to rep-
+ resent any absolute time value.
+
+ Some data query commands process data in units of elements. If applicable to the command, the number of
+ elements processed is indicated in the listed field. This field is not valid for all possible CCB commands.
+
+ The return value and extended return value fields are output locations for commands which do not use
+ a destination output buffer, or have secondary return results. The field is not valid for all possible CCB
+ commands.
+
+36.3. Hypervisor API Functions
+36.3.1. ccb_submit
+ trap# FAST_TRAP
+ function# CCB_SUBMIT
+ arg0 address
+ arg1 length
+ arg2 flags
+ arg3 reserved
+ ret0 status
+ ret1 length
+ ret2 status data
+ ret3 reserved
+
+ Submit one or more coprocessor control blocks (CCBs) for evaluation and processing by the virtual ma-
+ chine. The CCBs are passed in a linear array indicated by address. length indicates the size of the
+ array in bytes.
+
+
+ 528
+ Coprocessor services
+
+
+The address should be aligned to the size indicated by length, rounded up to the nearest power of
+two. Virtual machines implementations may reject submissions which do not adhere to that alignment.
+length must be a multiple of 64 bytes. If length is zero, the maximum supported array length will be
+returned as length in ret1. In all other cases, the length value in ret1 will reflect the number of bytes
+successfully consumed from the input CCB array.
+
+ Implementation note
+ Virtual machines should never reject submissions based on the alignment of address if the
+ entire array is contained within a single memory page of the smallest page size supported by the
+ virtual machine.
+
+A guest may choose to submit addresses used in this API function, including the CCB array address,
+as either a real or virtual addresses, with the type of each address indicated in flags. Virtual addresses
+must be present in either the TLB or an active TSB to be processed. The translation context for virtual
+addresses is determined by a combination of CCB contents and the flags argument.
+
+The flags argument is divided into multiple fields defined as follows:
+
+
+Bits Field Description
+[63:16] Reserved
+[15] Disable ADI for VA reads (in API 2.0)
+ Reserved (in API 1.0)
+[14] Virtual addresses within CCBs are translated in privileged context
+[13:12] Alternate translation context for virtual addresses within CCBs:
+ 0b'00 CCBs requesting alternate context are rejected
+ 0b'01 Reserved
+ 0b'10 CCBs requesting alternate context use secondary context
+ 0b'11 CCBs requesting alternate context use nucleus context
+[11:9] Reserved
+[8] Queue info flag
+[7] All-or-nothing flag
+[6] If address is a virtual address, treat its translation context as privileged
+[5:4] Address type of address:
+ 0b'00 Real address
+ 0b'01 Virtual address in primary context
+ 0b'10 Virtual address in secondary context
+ 0b'11 Virtual address in nucleus context
+[3:2] Reserved
+[1:0] CCB command type:
+ 0b'00 Reserved
+ 0b'01 Reserved
+ 0b'10 Query command
+ 0b'11 Reserved
+
+
+
+ 529
+ Coprocessor services
+
+
+ The CCB submission type and address type for the CCB array must be provided in the flags argument.
+ All other fields are optional values which change the default behavior of the CCB processing.
+
+ When set to one, the "Disable ADI for VA reads" bit will turn off ADI checking when using a virtual
+ address to load data. ADI checking will still be done when loading real-addressed memory. This bit is only
+ available when using major version 2 of the coprocessor API group; at major version 1 it is reserved. For
+ more information about using ADI and DAX, see Section 36.2.1.1.7, “Application Data Integrity (ADI)”.
+
+ By default, all virtual addresses are treated as user addresses. If the virtual address translations are privi-
+ leged, they must be marked as such in the appropriate flags field. The virtual addresses used within the
+ submitted CCBs must all be translated with the same privilege level.
+
+ By default, all virtual addresses used within the submitted CCBs are translated using the primary context
+ active at the time of the submission. The address type field within a CCB allows each address to request
+ translation in an alternate address context. The address context used when the alternate address context is
+ requested is selected in the flags argument.
+
+ The all-or-nothing flag specifies whether the virtual machine should allow partial submissions of the input
+ CCB array. When using CCBs with serial-conditional flags, it is strongly recommended to use the all-
+ or-nothing flag to avoid broken conditional chains. Using long CCB chains on a machine under high co-
+ processor load may make this impractical, however, and require submitting without the flag. When sub-
+ mitting serial-conditional CCBs without the all-or-nothing flag, guest software must manually implement
+ the serial-conditional behavior at any point where the chain was not submitted in a single API call, and re-
+ submission of the remaining CCBs should clear any conditional flag that might be set in the first remaining
+ CCB. Failure to do so will produce indeterminate CCB execution status and ordering.
+
+ When the all-or-nothing flag is not specified, callers should check the value of length in ret1 to determine
+ how many CCBs from the array were successfully submitted. Any remaining CCBs can be resubmitted
+ without modifications.
+
+ The value of length in ret1 is also valid when the API call returns an error, and callers should always
+ check its value to determine which CCBs in the array were already processed. This will additionally iden-
+ tify which CCB encountered the processing error, and was not submitted successfully.
+
+ If the queue info flag is used during submission, and at least one CCB was successfully submitted, the
+ length value in ret1 will be a multi-field value defined as follows:
+ Bits Field Description
+ [63:48] DAX unit instance identifier
+ [47:32] DAX queue instance identifier
+ [31:16] Reserved
+ [15:0] Number of CCB bytes successfully submitted
+
+ The value of status data depends on the status value. See error status code descriptions for details.
+ The value is undefined for status values that do not specifically list a value for the status data.
+
+ The API has a reserved input and output register which will be used in subsequent minor versions of this
+ API function. Guest software implementations should treat that register as voltile across the function call
+ in order to maintain forward compatibility.
+
+36.3.1.1. Errors
+ EOK One or more CCBs have been accepted and enqueued in the virtual machine
+ and no errors were been encountered during submission. Some submitted
+ CCBs may not have been enqueued due to internal virtual machine limitations,
+ and may be resubmitted without changes.
+
+
+ 530
+ Coprocessor services
+
+
+EWOULDBLOCK An internal resource conflict within the virtual machine has prevented it from
+ being able to complete the CCB submissions sufficiently quickly, requiring
+ it to abandon processing before it was complete. Some CCBs may have been
+ successfully enqueued prior to the block, and all remaining CCBs may be re-
+ submitted without changes.
+EBADALIGN CCB array is not on a 64-byte boundary, or the array length is not a multiple
+ of 64 bytes.
+ENORADDR A real address used either for the CCB array, or within one of the submitted
+ CCBs, is not valid for the guest. Some CCBs may have been enqueued prior
+ to the error being detected.
+ENOMAP A virtual address used either for the CCB array, or within one of the submitted
+ CCBs, could not be translated by the virtual machine using either the TLB or
+ TSB contents. The submission may be retried after adding the required map-
+ ping, or by converting the virtual address into a real address. Due to the shared
+ nature of address translation resources, there is no theoretical limit on the num-
+ ber of times the translation may fail, and it is recommended all guests imple-
+ ment some real address based backup. The virtual address which failed trans-
+ lation is returned as status data in ret2. Some CCBs may have been en-
+ queued prior to the error being detected.
+EINVAL The virtual machine detected an invalid CCB during submission, or invalid
+ input arguments, such as bad flag values. Note that not all invalid CCB values
+ will be detected during submission, and some may be reported as errors in the
+ completion area instead. Some CCBs may have been enqueued prior to the
+ error being detected. This error may be returned if the CCB version is invalid.
+ETOOMANY The request was submitted with the all-or-nothing flag set, and the array size is
+ greater than the virtual machine can support in a single request. The maximum
+ supported size for the current virtual machine can be queried by submitting a
+ request with a zero length array, as described above.
+ENOACCESS The guest does not have permission to submit CCBs, or an address used in a
+ CCBs lacks sufficient permissions to perform the required operation (no write
+ permission on the destination buffer address, for example). A virtual address
+ which fails permission checking is returned as status data in ret2. Some
+ CCBs may have been enqueued prior to the error being detected.
+EUNAVAILABLE The requested CCB operation could not be performed at this time. The restrict-
+ ed operation availability may apply only to the first unsuccessfully submitted
+ CCB, or may apply to a larger scope. The status should not be interpreted as
+ permanent, and the guest should attempt to submit CCBs in the future which
+ had previously been unable to be performed. The status data provides
+ additional information about scope of the retricted availability as follows:
+ Value Description
+ 0 Processing for the exact CCB instance submitted was unavailable,
+ and it is recommended the guest emulate the operation. The guest
+ should continue to submit all other CCBs, and assume no restric-
+ tions beyond this exact CCB instance.
+ 1 Processing is unavailable for all CCBs using the requested opcode,
+ and it is recommended the guest emulate the operation. The guest
+ should continue to submit all other CCBs that use different op-
+ codes, but can expect continued rejections of CCBs using the same
+ opcode in the near future.
+
+
+
+
+ 531
+ Coprocessor services
+
+
+ Value Description
+ 2 Processing is unavailable for all CCBs using the requested CCB
+ version, and it is recommended the guest emulate the operation.
+ The guest should continue to submit all other CCBs that use dif-
+ ferent CCB versions, but can expect continued rejections of CCBs
+ using the same CCB version in the near future.
+ 3 Processing is unavailable for all CCBs on the submitting vcpu,
+ and it is recommended the guest emulate the operation or resubmit
+ the CCB on a different vcpu. The guest should continue to submit
+ CCBs on all other vcpus but can expect continued rejections of all
+ CCBs on this vcpu in the near future.
+ 4 Processing is unavailable for all CCBs, and it is recommended the
+ guest emulate the operation. The guest should expect all CCB sub-
+ missions to be similarly rejected in the near future.
+
+
+36.3.2. ccb_info
+
+ trap# FAST_TRAP
+ function# CCB_INFO
+ arg0 address
+ ret0 status
+ ret1 CCB state
+ ret2 position
+ ret3 dax
+ ret4 queue
+
+ Requests status information on a previously submitted CCB. The previously submitted CCB is identified
+ by the 64-byte aligned real address of the CCBs completion area.
+
+ A CCB can be in one of 4 states:
+
+
+ State Value Description
+ COMPLETED 0 The CCB has been fetched and executed, and is no longer active in
+ the virtual machine.
+ ENQUEUED 1 The requested CCB is current in a queue awaiting execution.
+ INPROGRESS 2 The CCB has been fetched and is currently being executed. It may still
+ be possible to stop the execution using the ccb_kill hypercall.
+ NOTFOUND 3 The CCB could not be located in the virtual machine, and does not
+ appear to have been executed. This may occur if the CCB was lost
+ due to a hardware error, or the CCB may not have been successfully
+ submitted to the virtual machine in the first place.
+
+ Implementation note
+ Some platforms may not be able to report CCBs that are currently being processed, and therefore
+ guest software should invoke the ccb_kill hypercall prior to assuming the request CCB will never
+ be executed because it was in the NOTFOUND state.
+
+
+ 532
+ Coprocessor services
+
+
+ The position return value is only valid when the state is ENQUEUED. The value returned is the number
+ of other CCBs ahead of the requested CCB, to provide a relative estimate of when the CCB may execute.
+
+ The dax return value is only valid when the state is ENQUEUED. The value returned is the DAX unit
+ instance indentifier for the DAX unit processing the queue where the requested CCB is located. The value
+ matches the value that would have been, or was, returned by ccb_submit using the queue info flag.
+
+ The queue return value is only valid when the state is ENQUEUED. The value returned is the DAX
+ queue instance indentifier for the DAX unit processing the queue where the requested CCB is located. The
+ value matches the value that would have been, or was, returned by ccb_submit using the queue info flag.
+
+36.3.2.1. Errors
+
+ EOK The request was proccessed and the CCB state is valid.
+ EBADALIGN address is not on a 64-byte aligned.
+ ENORADDR The real address provided for address is not valid.
+ EINVAL The CCB completion area contents are not valid.
+ EWOULDBLOCK Internal resource contraints prevented the CCB state from being queried at this
+ time. The guest should retry the request.
+ ENOACCESS The guest does not have permission to access the coprocessor virtual device
+ functionality.
+
+36.3.3. ccb_kill
+
+ trap# FAST_TRAP
+ function# CCB_KILL
+ arg0 address
+ ret0 status
+ ret1 result
+
+ Request to stop execution of a previously submitted CCB. The previously submitted CCB is identified by
+ the 64-byte aligned real address of the CCBs completion area.
+
+ The kill attempt can produce one of several values in the result return value, reflecting the CCB state
+ and actions taken by the Hypervisor:
+
+ Result Value Description
+ COMPLETED 0 The CCB has been fetched and executed, and is no longer active in
+ the virtual machine. It could not be killed and no action was taken.
+ DEQUEUED 1 The requested CCB was still enqueued when the kill request was sub-
+ mitted, and has been removed from the queue. Since the CCB never
+ began execution, no memory modifications were produced by it, and
+ the completion area will never be updated. The same CCB may be
+ submitted again, if desired, with no modifications required.
+ KILLED 2 The CCB had been fetched and was being executed when the kill re-
+ quest was submitted. The CCB execution was stopped, and the CCB
+ is no longer active in the virtual machine. The CCB completion area
+ will reflect the killed status, with the subsequent implications that par-
+ tial results may have been produced. Partial results may include full
+
+
+ 533
+ Coprocessor services
+
+
+ Result Value Description
+ command execution if the command was stopped just prior to writing
+ to the completion area.
+ NOTFOUND 3 The CCB could not be located in the virtual machine, and does not
+ appear to have been executed. This may occur if the CCB was lost
+ due to a hardware error, or the CCB may not have been successfully
+ submitted to the virtual machine in the first place. CCBs in the state
+ are guaranteed to never execute in the future unless resubmitted.
+
+36.3.3.1. Interactions with Pipelined CCBs
+
+ If the pipeline target CCB is killed but the pipeline source CCB was skipped, the completion area of the
+ target CCB may contain status (4,0) "Command was skipped" instead of (3,7) "Command was killed".
+
+ If the pipeline source CCB is killed, the pipeline target CCB's completion status may read (1,0) "Success".
+ This does not mean the target CCB was processed; since the source CCB was killed, there was no mean-
+ ingful output on which the target CCB could operate.
+
+36.3.3.2. Errors
+
+ EOK The request was proccessed and the result is valid.
+ EBADALIGN address is not on a 64-byte aligned.
+ ENORADDR The real address provided for address is not valid.
+ EINVAL The CCB completion area contents are not valid.
+ EWOULDBLOCK Internal resource contraints prevented the CCB from being killed at this time.
+ The guest should retry the request.
+ ENOACCESS The guest does not have permission to access the coprocessor virtual device
+ functionality.
+
+
+
+
+ 534
+
new file mode 100644
@@ -0,0 +1,591 @@
+/*
+** Libdax
+**
+** Copyright © 2016, 2017 Oracle corp. All rights reserved.
+** The Universal Permissive License (UPL), Version 1.0
+**
+** Subject to the condition set forth below, permission is hereby granted to any person obtaining a copy of this
+** software, associated documentation and/or data (collectively the "Software"), free of charge and under any and
+** all copyright rights in the Software, and any and all patent rights owned or freely licensable by each licensor
+** hereunder covering either (i) the unmodified Software as contributed to or provided by such licensor, or
+** (ii) the Larger Works (as defined below), to deal in both
+**
+** (a) the Software, and
+** (b) any piece of software and/or hardware listed in the lrgrwrks.txt file if one is included with the Software
+** (each a “Larger Work” to which the Software is contributed by such licensors),
+**
+** without restriction, including without limitation the rights to copy, create derivative works of, display,
+** perform, and distribute the Software and make, use, sell, offer for sale, import, export, have made, and have
+** sold the Software and the Larger Work(s), and to sublicense the foregoing rights on either these or other terms.
+**
+** This license is subject to the following condition:
+** The above copyright notice and either this complete permission notice or at a minimum a reference to the UPL must
+** be included in all copies or substantial portions of the Software.
+**
+** THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO
+** THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+** AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF
+** CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+** IN THE SOFTWARE.
+*/
+
+/*
+ * The CCB interface is *not* a supported interface for using DAX. To use DAX,
+ * an application should call libdax. This will protect the application from
+ * possible changes to the CCB format in different hardware versions.
+ */
+
+#ifndef _DAX1_CCB_H
+#define _DAX1_CCB_H
+
+#ifdef __KERNEL__
+#include <linux/types.h>
+#else
+#include <sys/types.h>
+#include <sys/sysmacros.h>
+#include <inttypes.h>
+#endif
+
+/* General definitions */
+
+/* For converting less1 encoded fields */
+#define DAX_LESS1(n) ((n) - 1)
+#define DAX_ADD1(n) ((n) + 1)
+
+/* Map 1,2,4,8 to 0,1,2,3. Does not check for bad input; caller beware. */
+
+static inline uint64_t /* LINTED E_STATIC_UNUSED */
+dax_log2(uint64_t val)
+{
+ val /= 2;
+ if (val == 4)
+ val = 3;
+ return (val);
+}
+
+/* A number must be 1, 2, 4, or 8 to be valid as input to dax_log2() */
+#define DAX_LOG2_MASK ((1 << 1) | (1 << 2) | (1 << 4) | (1 << 8))
+#define DAX_LOG2_VALID(n) ((1<<(n)) & DAX_LOG2_MASK)
+
+/*
+ * Changes bits into bytes needed to hold those bits. For example,
+ * if bits = 3, bytes = 1.
+ */
+#define BITS_TO_BYTES(bits) \
+ (P2ROUNDUP((bits), 8) >> 3)
+
+#define DAX_MAX_ELEM_WIDTH 16 /* in bytes */
+
+/* Values for dax_header_t members. */
+
+/* dax_header_t ccb_version */
+#define DAX1_CCB_VERSION 0
+#define DAX2_CCB_VERSION 1
+
+/* dax_header_t opcode */
+#define DAX_OP_SYNC_NOP 0x0
+#define DAX_OP_EXTRACT 0x1
+#define DAX_OP_SCAN_VALUE 0x2
+#define DAX_OP_SCAN_RANGE 0x3
+#define DAX_OP_TRANSLATE 0x4
+#define DAX_OP_SELECT 0x5
+#define DAX_OP_INVERT 0x10 /* OR with translate, scan opcodes */
+
+/*
+ * For M7, copy and fill both use the extract command
+ * to do the operation. So, below opcodes are defined
+ * to make the distinction between the two while
+ * postprocessing.
+ */
+#define DAX_COPY 0x01
+#define DAX_FILL 0x02
+
+/*
+ * dax_header_t table_addr_type, out_addr_type, sec_addr_type, pri_addr_type,
+ * cca_addr_type
+ */
+#define DAX_ADDR_TYPE_NONE 0
+#define DAX_ADDR_TYPE_VA 3 /* virtual address */
+
+/* Values for dax_control_t members. */
+
+/* dax_control_t pri_fmt */
+#define DAX_PRI_FMT_BITS (1 << 0) /* 1 for bits, 0 for bytes */
+#define DAX_PRI_FMT_VAR (1 << 1) /* 1 for var, 0 for fixed */
+#define DAX_PRI_FMT_RLE (1 << 2) /* 1 for rle */
+#define DAX_PRI_FMT_HUFF (1 << 3) /* 1 for huffman (aka zip) */
+
+/* dax_control_t pri_elem_size */
+#define DAX_PRI_ELEM_SIZE(n) DAX_LESS1(n)
+
+/* dax_control_t pri_offset */
+#define DAX_PRI_OFFSET(n) (n)
+
+/* dax_control_t sec_encoding */
+#define DAX_SEC_ENCODING_ACTUAL 1
+#define DAX_SEC_ENCODING_LESS1 0
+
+/* dax_control_t sec_offset */
+#define DAX_SEC_OFFSET(n) (n)
+
+/* dax_control_t sec_elem_size */
+#define DAX_SEC_ELEM_SIZE(n) dax_log2(n)
+
+/* dax_control_t out_fmt */
+#define DAX_OUT_FMT_BYTES 0 /* 1 to 8 bytes */
+#define DAX_OUT_FMT_16B 1 /* 16 bytes. size 0. */
+#define DAX_OUT_FMT_BIT 2 /* bit vector. size 0. */
+#define DAX_OUT_FMT_INDEX 3 /* ones index. size 2B or 4B */
+
+/*
+ * dax_control_t out_elem_size
+ * For DAX_OUT_FMT_BIT and DAX_OUT_FMT_16B, set out_elem_size = 0.
+ * For DAX_OUT_FMT_BYTES and DAX_OUT_FMT_INDEX, use this macro.
+ */
+#define DAX_OUT_ELEM_SIZE(n) dax_log2(n)
+
+/* dax_extract_control_t pad_dir */
+#define DAX_PAD_DIR_RIGHT 0
+#define DAX_PAD_DIR_LEFT 1
+
+/* dax_scan_control_t u_size, l_size */
+#define DAX_LU_DISABLE 31
+#define DAX_LU_SIZE(n) DAX_LESS1(n)
+
+/* dax_nop_control_t ext_opcode */
+#define DAX_EXT_OPCODE_NOP 0
+#define DAX_EXT_OPCODE_SYNC 1
+
+/* Values for dax_control_t members. */
+
+/* dax_data_access_t flow_ctrl */
+#define DAX_FLOW_CTRL_DISABLE 0
+#define DAX_FLOW_CTRL_LIMIT 2
+
+/* dax_data_access_t pipe_target */
+#define DAX_PIPE_TARGET_PRI 0
+#define DAX_PIPE_TARGET_SEC 1
+
+/* dax_data_access_t out_buf_size */
+#define DAX_OUT_BUF_SIZE(nbytes) \
+ (((((nbytes) + 63) >> 6) - 1) & DAX_OUT_BUF_SIZE_MASK)
+#ifdef TRUNCATE
+/* Reduce limits for testing */
+#define DAX_OUT_BUF_SIZE_MAX (256 * 1024) /* in bytes */
+#define DAX_OUT_BUF_SIZE_MASK 0xfff
+#else
+#define DAX_OUT_BUF_SIZE_MAX (64 * 1024 * 1024) /* in bytes */
+#define DAX_OUT_BUF_SIZE_MASK 0xfffff
+#endif
+
+/* dax_data_access_t out_alloc */
+#define DAX_OUT_ALLOC_NONE 0
+#define DAX_OUT_ALLOC_HARD (1 << 3)
+#define DAX_OUT_ALLOC_SOFT (2 << 3)
+
+/* dax_data_access_t pri_len_fmt */
+#define DAX_PRI_LEN_FMT_SYMS 0
+#define DAX_PRI_LEN_FMT_BYTES 1
+#define DAX_PRI_LEN_FMT_BITS 2
+
+/* dax_data_access_t pri_len */
+#define DAX_PRI_LEN(n) (DAX_LESS1(n) & DAX_PRI_LEN_MASK)
+
+/*
+ * DAX_PRI_LEN_MAX is the max allowed pri_len under optimal conditions.
+ * DAX_PRI_LEN_LIMIT is a lower limit that applies under certain conditions.
+ * See its use in the code for details. Define TRUNCATE to reduce the limits
+ * during testing, so more conditions can be tested using shorter vectors.
+ */
+#ifdef TRUNCATE
+#define DAX_PRI_LEN_MAX (64*1024) /* max before less 1 */
+#define DAX_PRI_LEN_MASK 0xffff
+#else
+#define DAX_PRI_LEN_MAX (16*1024*1024) /* max before less 1 */
+#define DAX_PRI_LEN_MASK 0xffffff
+#endif
+#define DAX_PRI_LEN_LIMIT (DAX_PRI_LEN_MAX - 64) /* max before less 1 */
+
+/* dax_extract_ccb_t huff. OR with ozip table address on M8 */
+#define DAX_ZIP_TABLE_VERSION_M8 1
+
+#define DAX_LONGCCB_SHIFT 26 /* shift longccb bit to lsb */
+#define DAX_PIPECCB_SHIFT 27 /* shift pipeccb bit to lsb */
+
+typedef struct {
+ uint32_t ccb_version:4; /* 31:28 CCB Version */
+ /* 27:24 Sync Flags */
+ uint32_t pipe:1; /* Pipeline */
+ uint32_t longccb:1; /* Longccb. Set for scan with lu2, lu3, lu4. */
+ uint32_t cond:1; /* Conditional */
+ uint32_t serial:1; /* Serial */
+ uint32_t opcode:8; /* 23:16 Opcode */
+ /* 15:0 Address Type. */
+ uint32_t reserved:3; /* 15:13 reserved */
+ uint32_t table_addr_type:2; /* 12:11 Huffman Table Address Type */
+ uint32_t out_addr_type:3; /* 10:8 Destination Address Type */
+ uint32_t sec_addr_type:3; /* 7:5 Secondary Source Address Type */
+ uint32_t pri_addr_type:3; /* 4:2 Primary Source Address Type */
+ uint32_t cca_addr_type:2; /* 1:0 Completion Address Type */
+} dax_header_t;
+
+/* Generic Control Word, followed by opcode-specific Control Words */
+
+#define DAX_CONTROL_COMMON \
+ uint32_t pri_fmt:4; /* 31:28 Primary Input Format */ \
+ uint32_t pri_elem_size:5; /* 27:23 Primary Input Element Size(less1) */\
+ uint32_t pri_offset:3; /* 22:20 Primary Input Starting Offset */ \
+ uint32_t sec_encoding:1; /* 19 Secondary Input Encoding */ \
+ /* (must be 0 for Select) */ \
+ uint32_t sec_offset:3; /* 18:16 Secondary Input Starting Offset */ \
+ uint32_t sec_elem_size:2; /* 15:14 Secondary Input Element Size */ \
+ /* (must be 0 for Select) */ \
+ uint32_t out_fmt:2; /* 13:12 Output Format */ \
+ uint32_t out_elem_size:2; /* 11:10 Output Element Size */
+
+typedef struct {
+ DAX_CONTROL_COMMON /* 31:10 */
+ uint32_t misc:10;
+} dax_control_t;
+
+typedef struct {
+ DAX_CONTROL_COMMON /* 31:10 */
+ uint32_t u_size:5; /* 9:5 U operand size, bytes less 1 (or disable) */
+ uint32_t l_size:5; /* 4:0 L operand size, bytes less 1 (or disable) */
+} dax_scan_control_t;
+
+typedef struct {
+ DAX_CONTROL_COMMON /* 31:10 */
+ uint32_t unused:1; /* 9 Reserved */
+ uint32_t test_value:9; /* 8:0 for v1; 7:0 for v2 with 8 unused */
+} dax_translate_control_t;
+
+typedef struct {
+ DAX_CONTROL_COMMON /* 31:10 */
+ uint32_t pad_dir:1; /* 9 Padding Direction */
+ uint32_t unused:9; /* 8:0 Reserved, set to 0 */
+} dax_extract_control_t, dax_select_control_t;
+
+typedef struct {
+ uint32_t ext_opcode:1; /* 31 Extended Opcode: 0 nop, 1 sync */
+ uint32_t unused:31; /* 30:0 Reserved, set to 0 */
+} dax_nop_control_t;
+
+typedef struct {
+ uint64_t flow_ctrl:2; /* 63:62 Flow Control Type */
+ uint64_t pipe_target:2; /* 61:60 Pipeline Target */
+ uint64_t out_buf_size:20; /* 59:40 Output Buffer Size */
+ /* (cachelines less 1) */
+ uint64_t unused1:8; /* 39:32 Reserved, Set to 0 */
+ uint64_t out_alloc:5; /* 31:27 Output Allocation */
+ uint64_t unused2:1; /* 26 Reserved */
+ uint64_t pri_len_fmt:2; /* 25:24 Input Length Format */
+ uint64_t pri_len:24; /* 23:0 Input Element/Byte/Bit Count */
+ /* (less 1) */
+} dax_data_access_t;
+
+typedef struct {
+ uint32_t upper; /* U operand MSW */
+ uint32_t lower; /* L operand MSW */
+} dax_lu_t;
+
+/* Generic CCB, followed by opcode-specific CCBs */
+
+struct dax_ccb {
+ dax_header_t hdr; /* CCB Header */
+ dax_control_t ctrl; /* Control Word */
+ void *ca; /* Completion Address */
+ void *pri; /* Primary Input Address */
+ dax_data_access_t dac; /* Data Access Control */
+ void *sec; /* Secondary Input Address */
+ uint64_t dword5; /* depends on opcode */
+ void *out; /* Output Address */
+ void *huff_or_bitmap; /* Huff Table Address or bitmap */
+};
+
+typedef struct {
+ dax_header_t hdr; /* CCB Header */
+ dax_extract_control_t ctrl; /* Control Word */
+ void *ca; /* Completion Address */
+ void *pri; /* Primary Input Address */
+ dax_data_access_t dac; /* Data Access Control */
+ void *sec; /* Secondary Input Address */
+ uint64_t dword5; /* Unused, must be 0 */
+ void *out; /* Output Address */
+ void *huff; /* Huff Table Address */
+} dax_extract_ccb_t;
+
+typedef struct {
+ dax_header_t hdr; /* CCB Header */
+ dax_translate_control_t ctrl; /* Control Word */
+ void *ca; /* Completion Address */
+ void *pri; /* Primary Input Address */
+ dax_data_access_t dac; /* Data Access Control */
+ void *sec; /* Secondary Input Address */
+ uint64_t dword5; /* Unused, must be 0 */
+ void *out; /* Output Address */
+ void *bitmap; /* Translate Vector Address */
+} dax_translate_ccb_t;
+
+typedef struct {
+ dax_header_t hdr; /* CCB Header */
+ dax_select_control_t ctrl; /* Control Word */
+ void *ca; /* Completion Address */
+ void *pri; /* Primary Input Address */
+ dax_data_access_t dac; /* Data Access Control */
+ void *sec; /* Secondary Input Address */
+ uint64_t dword5; /* Unused, must be 0 */
+ void *out; /* Output Address */
+ void *huff; /* Huff Table Address */
+} dax_select_ccb_t;
+
+typedef struct {
+ dax_header_t hdr; /* CCB Header */
+ dax_scan_control_t ctrl; /* Control Word */
+ void *ca; /* Completion Address */
+ void *pri; /* Primary Input Address */
+ dax_data_access_t dac; /* Data Access Control */
+ void *sec; /* Secondary Input Address */
+ dax_lu_t lu1; /* L and U Operands MSW */
+ void *out; /* Output Address */
+ void *huff; /* Huff Table Address */
+
+ /* note: must set longccb if these fields are used */
+ dax_lu_t lu2; /* L and U operand 2MSW */
+ dax_lu_t lu3; /* L and U operand 3MSW */
+ dax_lu_t lu4; /* L and U operand 4MSW */
+ uint64_t unused[5]; /* Reserved, must be 0 */
+} dax_scan_ccb_t;
+
+typedef struct {
+ dax_header_t hdr; /* CCB Header */
+ dax_nop_control_t ctrl; /* Control Word */
+ void *ca; /* Completion Address */
+ uint64_t unused[6]; /* Unused, must be 0 */
+} dax_nop_ccb_t, dax_sync_ccb_t;
+
+#define OFFSETOF(s, m) ((size_t)(&(((s *)0)->m)))
+#define CCB_LU1_OFFSET OFFSETOF(dax_scan_ccb_t, lu1)
+#define CCB_LU2_OFFSET OFFSETOF(dax_scan_ccb_t, lu2)
+
+/* Dax command completion area */
+
+/* dax_cca_t cmd_status */
+#define CCA_STAT_NOT_COMPLETED 0
+#define CCA_STAT_COMPLETED 1
+#define CCA_STAT_FAILED 2
+#define CCA_STAT_KILLED 3
+#define CCA_STAT_NOT_RUN 4
+#define CCA_STAT_PIPE_OUT 5
+#define CCA_STAT_PIPE_SRC 6
+#define CCA_STAT_PIPE_DST 7
+
+#define IS_CCA_COMPLETED(status) \
+ (((status) == CCA_STAT_COMPLETED) | \
+ ((status) == CCA_STAT_PIPE_OUT))
+
+/* dax_cca_t err_mask */
+#define CCA_ERR_SUCCESS 0x0 /* no error */
+#define CCA_ERR_OVERFLOW 0x1 /* buffer overflow */
+#define CCA_ERR_DECODE 0x2 /* CCB decode error */
+#define CCA_ERR_PAGE_OVERFLOW 0x3 /* page overflow */
+#define CCA_ERR_KILLED 0x7 /* command was killed */
+#define CCA_ERR_TIMEOUT 0x8 /* Timeout */
+#define CCA_ERR_ADI 0x9 /* ADI error */
+#define CCA_ERR_DATA_FMT 0xA /* data format error */
+#define CCA_ERR_OTHER_NO_RETRY 0xE /* Other error, do not retry */
+#define CCA_ERR_OTHER_RETRY 0xF /* Other error, retry */
+#define CCA_ERR_PARTIAL_SYMBOL 0x80 /* QP partial symbol warning */
+
+/* These error codes are poked into err_mask by software, not used by dax */
+#define CCA_ERR_NOT_RUN 0xf9 /* innocent ccb being skipped */
+#define CCA_ERR_THREAD 0xfa /* thread did not init dax */
+#define CCA_ERR_SUBMIT 0xfb /* unknown submission error */
+#define CCA_ERR_EAGAIN 0xfc /* try again */
+#define CCA_ERR_NOMAP 0xfd /* no VA->PA mapping for some arg */
+#define CCA_ERR_NOACCESS 0xfe /* no permission to access some arg */
+#define CCA_ERR_UNAVAILABLE 0xff /* dax unavailable during live migr */
+
+struct dax_cca {
+ uint8_t status; /* user may mwait on this address */
+ uint8_t err; /* user visible error notification */
+ uint8_t rsvd[2]; /* reserved */
+ uint32_t n_remaining; /* for QP partial symbol warning */
+ uint32_t output_sz; /* output in bytes */
+ uint32_t rsvd2; /* reserved */
+ uint64_t run_cycles; /* run time in OCND2 cycles */
+ uint64_t run_stats; /* nothing reported in version 1.0 */
+ uint32_t n_processed; /* number input elements */
+ uint32_t rsvd3[5]; /* reserved */
+ uint64_t retval; /* command return value */
+ uint64_t rsvd4[8]; /* reserved */
+};
+
+typedef struct dax_cca dax_cca_t;
+
+/* Bitfield definitions for CCB Header */
+
+#define HDR_DATATYPE uint32_t
+
+#define HDR_CCA_ADDR_TYPE_LOW 0
+#define HDR_CCA_ADDR_TYPE_HIGH 1
+#define HDR_CCA_ADDR_TYPE_DATATYPE HDR_DATATYPE
+
+#define HDR_PRI_ADDR_TYPE_LOW 2
+#define HDR_PRI_ADDR_TYPE_HIGH 4
+#define HDR_PRI_ADDR_TYPE_DATATYPE HDR_DATATYPE
+
+#define HDR_SEC_ADDR_TYPE_LOW 5
+#define HDR_SEC_ADDR_TYPE_HIGH 7
+#define HDR_SEC_ADDR_TYPE_DATATYPE HDR_DATATYPE
+
+#define HDR_OUT_ADDR_TYPE_LOW 8
+#define HDR_OUT_ADDR_TYPE_HIGH 10
+#define HDR_OUT_ADDR_TYPE_DATATYPE HDR_DATATYPE
+
+#define HDR_TABLE_ADDR_TYPE_LOW 11
+#define HDR_TABLE_ADDR_TYPE_HIGH 12
+#define HDR_TABLE_ADDR_TYPE_DATATYPE HDR_DATATYPE
+
+#define HDR_OPCODE_LOW 16
+#define HDR_OPCODE_HIGH 23
+#define HDR_OPCODE_DATATYPE HDR_DATATYPE
+
+#define HDR_SERIAL_LOW 24
+#define HDR_SERIAL_HIGH 24
+#define HDR_SERIAL_DATATYPE HDR_DATATYPE
+
+#define HDR_COND_LOW 25
+#define HDR_COND_HIGH 25
+#define HDR_COND_DATATYPE HDR_DATATYPE
+
+#define HDR_LONGCCB_LOW 26
+#define HDR_LONGCCB_HIGH 26
+#define HDR_LONGCCB_DATATYPE HDR_DATATYPE
+
+#define HDR_PIPE_LOW 27
+#define HDR_PIPE_HIGH 27
+#define HDR_PIPE_DATATYPE HDR_DATATYPE
+
+#define HDR_SYNC_FLAGS_LOW 24
+#define HDR_SYNC_FLAGS_HIGH 27
+#define HDR_SYNC_FLAGS_DATATYPE HDR_DATATYPE
+
+#define HDR_CCB_VERSION_LOW 28
+#define HDR_CCB_VERSION_HIGH 31
+#define HDR_CCB_VERSION_DATATYPE HDR_DATATYPE
+
+/*
+ * Bitfield definitions for CCB Control Word: dax_extract_control_t,
+ * dax_scan_control_t, dax_translate_control_t, dax_select_control_t,
+ * dax_nop_control_t.
+ */
+
+#define CTRL_DATATYPE uint32_t
+
+/* For Extract, Scan, Translate, Select */
+#define CTRL_OUT_ELEM_SIZE_LOW 10
+#define CTRL_OUT_ELEM_SIZE_HIGH 11
+#define CTRL_OUT_ELEM_SIZE_DATATYPE CTRL_DATATYPE
+
+#define CTRL_OUT_FMT_LOW 12
+#define CTRL_OUT_FMT_HIGH 13
+#define CTRL_OUT_FMT_DATATYPE CTRL_DATATYPE
+
+#define CTRL_SEC_ELEM_SIZE_LOW 14
+#define CTRL_SEC_ELEM_SIZE_HIGH 15
+#define CTRL_SEC_ELEM_SIZE_DATATYPE CTRL_DATATYPE
+
+#define CTRL_SEC_OFFSET_LOW 16
+#define CTRL_SEC_OFFSET_HIGH 18
+#define CTRL_SEC_OFFSET_DATATYPE CTRL_DATATYPE
+
+#define CTRL_SEC_ENCODING_LOW 19
+#define CTRL_SEC_ENCODING_HIGH 19
+#define CTRL_SEC_ENCODING_DATATYPE CTRL_DATATYPE
+
+#define CTRL_PRI_OFFSET_LOW 20
+#define CTRL_PRI_OFFSET_HIGH 22
+#define CTRL_PRI_OFFSET_DATATYPE CTRL_DATATYPE
+
+#define CTRL_PRI_ELEM_SIZE_LOW 23
+#define CTRL_PRI_ELEM_SIZE_HIGH 27
+#define CTRL_PRI_ELEM_SIZE_DATATYPE CTRL_DATATYPE
+
+#define CTRL_PRI_FMT_LOW 28
+#define CTRL_PRI_FMT_HIGH 31
+#define CTRL_PRI_FMT_DATATYPE CTRL_DATATYPE
+
+/* For Sync and No-op */
+#define CTRL_OPCODE_LOW 31
+#define CTRL_OPCODE_HIGH 31
+#define CTRL_OPCODE_DATATYPE CTRL_DATATYPE
+
+/* For Extract and Select */
+#define CTRL_PAD_DIR_LOW 9
+#define CTRL_PAD_DIR_HIGH 9
+#define CTRL_PAD_DIR_DATATYPE CTRL_DATATYPE
+
+/* For Scan */
+#define CTRL_L_SIZE_LOW 0
+#define CTRL_L_SIZE_HIGH 4
+#define CTRL_L_SIZE_DATATYPE CTRL_DATATYPE
+
+#define CTRL_U_SIZE_LOW 5
+#define CTRL_U_SIZE_HIGH 9
+#define CTRL_U_SIZE_DATATYPE CTRL_DATATYPE
+
+/* Bitfield definitions for Data Access Control, dax_data_access_t */
+
+#define DAC_DATATYPE uint64_t
+
+#define DAC_PRI_LEN_LOW 0
+#define DAC_PRI_LEN_HIGH 23
+#define DAC_PRI_LEN_DATATYPE DAC_DATATYPE
+
+#define DAC_PRI_LEN_FMT_LOW 24
+#define DAC_PRI_LEN_FMT_HIGH 25
+#define DAC_PRI_LEN_FMT_DATATYPE DAC_DATATYPE
+
+#define DAC_OUT_ALLOC_LOW 27
+#define DAC_OUT_ALLOC_HIGH 31
+#define DAC_OUT_ALLOC_DATATYPE DAC_DATATYPE
+
+#define DAC_OUT_BUF_SIZE_LOW 40
+#define DAC_OUT_BUF_SIZE_HIGH 59
+#define DAC_OUT_BUF_SIZE_DATATYPE DAC_DATATYPE
+
+#define DAC_PIPE_TARGET_LOW 60
+#define DAC_PIPE_TARGET_HIGH 61
+#define DAC_PIPE_TARGET_DATATYPE DAC_DATATYPE
+
+#define DAC_FLOW_CTRL_LOW 62
+#define DAC_FLOW_CTRL_HIGH 63
+#define DAC_FLOW_CTRL_DATATYPE DAC_DATATYPE
+
+#define SHORT_CCB_UNITS 1
+#define LONG_CCB_UNITS 2
+#define CCB_MAX_SIZE (LONG_CCB_UNITS * sizeof (dax_ccb_t))
+#define CCB_MIN_SIZE sizeof (dax_ccb_t)
+#define CCB_UNIT_SIZE sizeof (dax_ccb_t)
+#define CCA_SIZE sizeof (dax_cca_t)
+#define CCA_UNIT_SIZE sizeof (dax_cca_t)
+
+/* TBD: delete if unused */
+#define CCB_CONT 0
+
+#define IS_LONG_CCB(ccb) \
+ ((*((uint64_t *)(ccb)) >> (32 + DAX_LONGCCB_SHIFT)) & 0x1)
+
+#define IS_PIPE_CCB(ccb) \
+ ((*((uint64_t *)(ccb)) >> (32 + DAX_PIPECCB_SHIFT)) & 0x1)
+
+#define CCB_ENTRIES(ccb) \
+ (1 << IS_LONG_CCB(ccb))
+
+#define CCB_SIZE(ccb) \
+ (CCB_MIN_SIZE << IS_LONG_CCB(ccb))
+
+#define MAX_BIT_WIDTH_32KBITS_TRANS_VEC 15
+
+#endif /* _DAX1_CCB_H */
new file mode 100644
@@ -0,0 +1,219 @@
+/*
+** Example
+**
+** Copyright © 2017 Oracle corp. All rights reserved.
+** The Universal Permissive License (UPL), Version 1.0
+**
+** Subject to the condition set forth below, permission is hereby granted to any person obtaining a copy of this
+** software, associated documentation and/or data (collectively the "Software"), free of charge and under any and
+** all copyright rights in the Software, and any and all patent rights owned or freely licensable by each licensor
+** hereunder covering either (i) the unmodified Software as contributed to or provided by such licensor, or
+** (ii) the Larger Works (as defined below), to deal in both
+**
+** (a) the Software, and
+** (b) any piece of software and/or hardware listed in the lrgrwrks.txt file if one is included with the Software
+** (each a “Larger Work” to which the Software is contributed by such licensors),
+**
+** without restriction, including without limitation the rights to copy, create derivative works of, display,
+** perform, and distribute the Software and make, use, sell, offer for sale, import, export, have made, and have
+** sold the Software and the Larger Work(s), and to sublicense the foregoing rights on either these or other terms.
+**
+** This license is subject to the following condition:
+** The above copyright notice and either this complete permission notice or at a minimum a reference to the UPL must
+** be included in all copies or substantial portions of the Software.
+**
+** THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO
+** THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+** AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF
+** CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+** IN THE SOFTWARE.
+*/
+
+/*
+ * This is example code to demonstrate how any kernel code
+ * can utilize the Oracle DAX coprocessor.
+ *
+ * This particular example implements a simple memory clearing
+ * function using the coprocessor's Extract operation.
+ */
+
+#include <linux/slab.h>
+#include <asm/hypervisor.h>
+#include "dax1_ccb.h"
+
+#define ASI_MONITOR_PRIMARY 0x84
+u8 loadmon8(void *addr)
+{
+ u8 ret;
+
+ __asm__ __volatile__("lduba [%[src]] %[asi], %[dest]\n"
+ : [dest] "=r" (ret)
+ : [asi] "i" (ASI_MONITOR_PRIMARY),
+ [src] "r" (addr));
+ return ret;
+}
+
+#define MWAIT_COUNT_REGISTER 28
+void mwait(int nsecs)
+{
+ __asm__ __volatile__("wr %%g0, %[arg], %%asr%[mcr]\n"
+ : : [arg] "r" (nsecs),
+ [mcr] "i" (MWAIT_COUNT_REGISTER));
+}
+
+/*
+ * DAX Extract operation to zero the output buffer.
+ *
+ * The primary input buffer is a page full of zeroes, and the
+ * secondary input buffer is a run-length-encoding, where byte I
+ * determines the number of copies of primary input byte I to be
+ * produced in the output. We fill the RLE buffer with the value 0xff,
+ * which produces 256 copies of each input byte in the output.
+ * Additionally, the output format is specified as 16 bytes, so each
+ * byte of input produces 16 bytes of output. Thus each 1-byte element
+ * is expanded to a 16B output elem, 256 times (16 * 256 = 4096), and
+ * with an 8k page of inputs, we can clear 32Mb (4k*8k) of memory.
+ */
+#define DAX_RLE_EXPAND_ELEM_LEN (16*256UL)
+#define DAX_ZERO_OUTPUT_MAX_LEN (DAX_RLE_EXPAND_ELEM_LEN * PAGE_SIZE)
+#define DAX_ZERO_TIMEOUT (5UL * 1000UL * 1000UL * 1000UL)
+#define MWAIT_TIME 8192
+
+int dax_zero(void *addr, int len)
+{
+ unsigned long hv_rv, accepted_len, status_data, timeout, res;
+ struct dax_ccb *ccb;
+ struct dax_cca *cca;
+ void *src0, *src1;
+ u16 kill_res;
+ int ret = 1;
+
+ printk(KERN_ALERT "%s(%p, %x)\n", __func__, addr, len);
+
+ if (len > DAX_ZERO_OUTPUT_MAX_LEN)
+ return ret;
+
+ ccb = kzalloc(sizeof(*ccb), GFP_KERNEL); /* command block */
+ cca = kzalloc(sizeof(*cca), GFP_KERNEL); /* completion area */
+ src0 = kzalloc(2 * PAGE_SIZE, GFP_KERNEL); /* primary input */
+ src1 = src0 + PAGE_SIZE; /* secondary input */
+ memset(src1, 0xff, PAGE_SIZE);
+
+ ccb->hdr.opcode = DAX_OP_EXTRACT;
+
+ ccb->hdr.pri_addr_type = DAX_ADDR_TYPE_VA;
+ ccb->hdr.sec_addr_type = DAX_ADDR_TYPE_VA;
+ ccb->hdr.out_addr_type = DAX_ADDR_TYPE_VA;
+ ccb->hdr.cca_addr_type = DAX_ADDR_TYPE_VA;
+
+ ccb->pri = src0;
+ ccb->sec = src1;
+ ccb->out = addr;
+ ccb->ca = cca;
+
+ ccb->ctrl.pri_fmt = DAX_PRI_FMT_RLE;
+ ccb->ctrl.pri_elem_size = DAX_PRI_ELEM_SIZE(1);
+ ccb->ctrl.sec_encoding = DAX_SEC_ENCODING_LESS1;
+ ccb->ctrl.sec_elem_size = DAX_SEC_ELEM_SIZE(8);
+ ccb->ctrl.out_fmt = DAX_OUT_FMT_16B;
+ ccb->ctrl.out_elem_size = 0;
+
+ ccb->dac.pri_len_fmt = DAX_PRI_LEN_FMT_BYTES;
+ ccb->dac.pri_len = DAX_PRI_LEN(len / DAX_RLE_EXPAND_ELEM_LEN);
+ ccb->dac.out_buf_size = DAX_OUT_BUF_SIZE(len);
+
+ hv_rv = sun4v_ccb_submit((unsigned long)ccb, sizeof(*ccb),
+ HV_CCB_ARG0_PRIVILEGED | HV_CCB_VA_PRIVILEGED |
+ HV_CCB_ARG0_TYPE_PRIMARY | HV_CCB_QUERY_CMD,
+ 0, &accepted_len, &status_data);
+
+ if (hv_rv != HV_EOK || accepted_len != sizeof(*ccb)) {
+ printk(KERN_ALERT "ccb_submit failed (rv=%ld, status_data=0x%lx)\n",
+ hv_rv, status_data);
+ goto done;
+ }
+
+ /*
+ * handle any residual bytes here in parallel with the
+ * coprocessor
+ */
+ res = len % DAX_RLE_EXPAND_ELEM_LEN;
+ memset(addr + (len - res), 0, res);
+
+ for (timeout = DAX_ZERO_TIMEOUT; timeout > 0; timeout -= MWAIT_TIME) {
+ if (loadmon8(cca) == CCA_STAT_NOT_COMPLETED)
+ mwait(MWAIT_TIME);
+ else
+ break;
+ }
+
+ if (cca->status == CCA_STAT_COMPLETED) {
+ ret = 0;
+ goto done;
+ } else if (cca->status == CCA_STAT_NOT_COMPLETED) {
+ printk(KERN_ALERT "dax_zero ccb timed out, kill ccb\n");
+ hv_rv = sun4v_ccb_kill(virt_to_phys(cca), &kill_res);
+ if (hv_rv == HV_EOK) {
+ printk(KERN_ALERT "ccb kill successful (kill_res=%d)\n",
+ kill_res);
+ } else {
+ printk(KERN_ALERT "ccb kill failed (hv_rv=%ld)\n",
+ hv_rv);
+ }
+
+ } else {
+ printk(KERN_ALERT "ccb failed, status=%d, err=0x%x\n",
+ cca->status, cca->err);
+ }
+
+done:
+ kfree(src0);
+ kfree(cca);
+ kfree(ccb);
+ return ret;
+}
+
+#if 0
+void test_dax_zero(void)
+{
+ u8 *output;
+ long i, j;
+ long sizes[] = {8192, 8192 + 653, 16384, 4 * 1024 * 1024,
+ DAX_ZERO_OUTPUT_MAX_LEN};
+
+ output = kzalloc(DAX_ZERO_OUTPUT_MAX_LEN, GFP_KERNEL);
+ if (output == NULL)
+ return;
+
+ for (j = 0; j < sizeof(sizes) / sizeof(long); j++) {
+ long size = sizes[j];
+
+ /* set output to 0xaa */
+ memset(output, 0xaa, DAX_ZERO_OUTPUT_MAX_LEN);
+
+ dax_zero(output, size);
+
+ /* check that all bytes zeroed are 0, and all others are 0xaa */
+ for (i = 0; i < size; i++) {
+ if (output[i] != 0) {
+ printk(KERN_ALERT "dax_zero test (size=%ld) fail: output[%ld]=%x (expected 0)\n",
+ size, i, output[i]);
+ goto done;
+ }
+ }
+
+ for (i = size; i < DAX_ZERO_OUTPUT_MAX_LEN; i++) {
+ if (output[i] != 0xaa) {
+ printk(KERN_ALERT "dax_zero test (size=%ld) fail: output[%ld]=%x (expected 0xaa)\n",
+ size, i, output[i]);
+ goto done;
+ }
+ }
+ }
+
+ printk(KERN_ALERT "dax_zero test passed, all bytes correct\n");
+
+done:
+ kfree(output);
+}
+#endif
new file mode 100644
@@ -0,0 +1,249 @@
+Oracle Data Analytics Accelerator (DAX)
+---------------------------------------
+
+DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
+(DAX2) processor chips, and has direct access to the CPU's L3 caches
+as well as physical memory. It can perform several operations on data
+streams with various input and output formats. A driver provides a
+transport mechanism and has limited knowledge of the various opcodes
+and data formats. A user space library provides high level services
+and translates these into low level commands which are then passed
+into the driver and subsequently the Hypervisor and the coprocessor.
+The library is the recommended way for applications to use the
+coprocessor, and the driver interface is not intended for general use.
+This document describes the general flow of the driver, its
+structures, and its programmatic interface.
+
+The user library is open source and available at:
+ https://oss.oracle.com/git/gitweb.cgi?p=libdax.git
+
+The Hypervisor interface to the coprocessor is described in detail in
+the accompanying document, dax-hv-api.txt, which is a plain text
+excerpt of the (Oracle internal) "UltraSPARC Virtual Machine
+Specification" version 3.0.20, dated 2017-04-05.
+
+
+High Level Overview
+-------------------
+
+A coprocessor request is described by a Command Control Block
+(CCB). The CCB contains an opcode and various parameters. The opcode
+specifies what operation is to be done, and the parameters specify
+options, flags, sizes, and addresses. The CCB (or an array of CCBs)
+is passed to the Hypervisor, which handles queueing and scheduling of
+requests to the available coprocessor execution units. A status code
+returned indicates if the request was submitted successfully or if
+there was an error. One of the addresses given in each CCB is a
+pointer to a "completion area", which is a 128 byte memory block that
+is written by the coprocessor to provide execution status. No
+interrupt is generated upon completion; the completion area must be
+polled by software to find out when a transaction has finished, but
+the M7 and later processors provide a mechanism to pause the virtual
+processor until the completion status has been updated by the
+coprocessor. This is done using the monitored load and mwait
+instructions, which are described in more detail later. The DAX
+coprocessor was designed so that after a request is submitted, the
+kernel is no longer involved in the processing of it. The polling is
+done at the user level, which results in almost zero latency between
+completion of a request and resumption of execution of the requesting
+thread.
+
+
+Addressing Memory
+-----------------
+
+The kernel does not have access to physical memory in the Sun4v
+architecture, as there is an additional level of memory virtualization
+present. This intermediate level is called "real" memory, and the
+kernel treats this as if it were physical. The Hypervisor handles the
+translations between real memory and physical so that each logical
+domain (LDOM) can have a partition of physical memory that is isolated
+from that of other LDOMs. When the kernel sets up a virtual mapping,
+it specifies a virtual address and the real address to which it should
+be mapped.
+
+The DAX coprocessor can only operate on physical memory, so before a
+request can be fed to the coprocessor, all the addresses in a CCB must
+be converted into physical addresses. The kernel cannot do this since
+it has no visibility into physical addresses. So a CCB may contain
+either the virtual or real addresses of the buffers or a combination
+of them. An "address type" field is available for each address that
+may be given in the CCB. In all cases, the Hypervisor will translate
+all the addresses to physical before dispatching to hardware. Address
+translations are performed using the context of the process initiating
+the request.
+
+
+The Driver API
+--------------
+
+An application makes requests to the driver via the write() system
+call, and gets results (if any) via read(). The completion areas are
+made accessible via mmap(), and are read-only for the application.
+
+The request may either be an immediate command or an array of CCBs to
+be submitted to the hardware.
+
+Each open instance of the device is exclusive to the thread that
+opened it, and must be used by that thread for all subsequent
+operations. The driver open function creates a new context for the
+thread and initializes it for use. This context contains pointers and
+values used internally by the driver to keep track of submitted
+requests. The completion area buffer is also allocated, and this is
+large enough to contain the completion areas for many concurrent
+requests. When the device is closed, any outstanding transactions are
+flushed and the context is cleaned up.
+
+On a DAX1 system (M7), the device will be called "oradax1", while on a
+DAX2 system (M8) it will be "oradax2". If an application requires one
+or the other, it should simply attempt to open the appropriate
+device. Only one of the devices will exist on any given system, so the
+name can be used to determine what the platform supports.
+
+The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For
+all of these, success is indicated by a return value from write()
+equal to the number of bytes given in the call. Otherwise -1 is
+returned and errno is set.
+
+CCB_DEQUEUE
+
+Tells the driver to clean up resources associated with past
+requests. Since no interrupt is generated upon the completion of a
+request, the driver must be told when it may reclaim resources. No
+further status information is returned, so the user should not
+subsequently call read().
+
+CCB_KILL
+
+Kills a CCB during execution. The CCB is guaranteed to not continue
+executing once this call returns successfully. On success, read() must
+be called to retrieve the result of the action.
+
+CCB_INFO
+
+Retrieves information about a currently executing CCB. Note that some
+Hypervisors might return 'notfound' when the CCB is in 'inprogress'
+state. To ensure a CCB in the 'notfound' state will never be executed,
+CCB_KILL must be invoked on that CCB. Upon success, read() must be
+called to retrieve the details of the action.
+
+Submission of an array of CCBs for execution
+
+A write() whose length is a multiple of the CCB size is treated as a
+submit operation. The file offset is treated as the index of the
+completion area to use, and may be set via lseek() or using the
+pwrite() system call. If -1 is returned then errno is set to indicate
+the error. Otherwise, the return value is the length of the array that
+was actually accepted by the coprocessor. If the accepted length is
+equal to the requested length, then the operation was completely
+successful and there is no further status needed; hence, the user
+should not subsequently call read(). Partial acceptance of the CCB
+array is indicated by a return value less than the requested length,
+and read() must be called to retrieve further status information. The
+status will reflect the error caused by the first CCB that was not
+accepted, and status_data will provide additional data in some cases.
+
+MMAP
+
+The mmap() function provides access to the completion area allocated
+in the driver. Note that the completion area is not writeable by the
+user process.
+
+
+Completion of a Request
+-----------------------
+
+The first byte in each completion area is the command status which is
+updated by the coprocessor hardware. Software may take advantage of
+new M7/M8 processor capabilities to efficiently poll this status byte.
+First, a "monitored load" is achieved via a Load from Alternate Space
+(ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY). Second, a
+"monitored wait" is achieved via the mwait instruction. This
+instruction is like pause in that it suspends execution of the virtual
+processor, but in addition will terminate early when one of several
+events occur. If the block of data containing the monitored location
+is modified, then the mwait terminates. This allows software to resume
+execution immediately (without a context switch or kernel to user
+transition) after a transaction completes. Thus the latency between
+transaction completion and resumption of execution may be just a few
+nanoseconds.
+
+
+Application Life Cycle of a DAX Submission
+------------------------------------------
+
+ - open dax device
+ - call mmap() to get the completion area address
+ - allocate a CCB and fill in the opcode, flags, parameter, addresses, etc.
+ - submit CCB via write() or pwrite()
+ - go into a loop executing monitored load + monitored wait and
+ terminate when the command status indicates the request is complete
+ (CCB_KILL or CCB_INFO may be used any time as necessary)
+ - perform a CCB_DEQUEUE
+ - call munmap() for completion area
+ - close the dax device
+
+
+Memory Constraints
+------------------
+
+The DAX hardware operates only on physical addresses. Therefore, it is
+not aware of virtual memory mappings and the discontiguities that may
+exist in the physical memory that a virtual buffer maps to. There is
+no I/O TLB or any scatter/gather mechanism. All buffers, whether input
+or output, must reside in a physically contiguous region of memory.
+
+The Hypervisor translates all addresses within a CCB to physical
+before handing off the CCB to DAX. The Hypervisor determines the
+virtual page size for each virtual address given, and uses this to
+program a size limit for each address. This prevents the coprocessor
+from reading or writing beyond the bound of the virtual page, even
+though it is accessing physical memory directly. A simpler way of
+saying this is that a DAX operation will never "cross" a virtual page
+boundary. If an 8k virtual page is used, then the data is strictly
+limited to 8k. If a user's buffer is larger than 8k, then a larger
+page size must be used, or the transaction size will be truncated to
+8k.
+
+Huge pages. A user may allocate huge pages using standard
+interfaces. Memory buffers residing on huge pages may be used to
+achieve much larger DAX transaction sizes, but the rules must still be
+followed, and no transaction will cross a page boundary, even a huge
+page. A major caveat is that Linux on Sparc presents 8Mb as one of
+the huge page sizes. Sparc does not actually provide a 8Mb hardware
+page size, and this size is synthesized by pasting together two 4Mb
+pages. The reasons for this are historical, and it creates an issue
+because only half of this 8Mb page can actually be used for any given
+buffer in a DAX request, and it must be either the first half or the
+second half; it cannot be a 4Mb chunk in the middle, since that
+crosses a (hardware) page boundary. Note that this entire issue may be
+hidden by higher level libraries.
+
+
+CCB Structure
+-------------
+A CCB is an array of 8 64-bit words. Several of these words provide
+command opcodes, parameters, flags, etc., and the rest are addresses
+for the completion area, output buffer, and various inputs:
+
+ struct ccb {
+ u64 control;
+ u64 completion;
+ u64 input0;
+ u64 access;
+ u64 input1;
+ u64 rsvd;
+ u64 output;
+ u64 table;
+ };
+
+See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of
+each of these fields, and see dax-hv-api.txt for a complete description
+of the Hypervisor API available to the guest OS (ie, Linux kernel.)
+
+The first word (control) is examined by the driver for the following:
+ - CCB version, which must be consistent with hardware version
+ - Opcode, which must be one of the documented allowable commands
+ - Address types, which must be set to "virtual" for all the addresses
+ given by the user, thereby ensuring that the application can
+ only access memory that it owns
new file mode 100644
@@ -0,0 +1,214 @@
+/*
+** Example (from libdax/test)
+**
+** Copyright © 2017 Oracle corp. All rights reserved.
+** The Universal Permissive License (UPL), Version 1.0
+**
+** Subject to the condition set forth below, permission is hereby granted to any person obtaining a copy of this
+** software, associated documentation and/or data (collectively the "Software"), free of charge and under any and
+** all copyright rights in the Software, and any and all patent rights owned or freely licensable by each licensor
+** hereunder covering either (i) the unmodified Software as contributed to or provided by such licensor, or
+** (ii) the Larger Works (as defined below), to deal in both
+**
+** (a) the Software, and
+** (b) any piece of software and/or hardware listed in the lrgrwrks.txt file if one is included with the Software
+** (each a “Larger Work” to which the Software is contributed by such licensors),
+**
+** without restriction, including without limitation the rights to copy, create derivative works of, display,
+** perform, and distribute the Software and make, use, sell, offer for sale, import, export, have made, and have
+** sold the Software and the Larger Work(s), and to sublicense the foregoing rights on either these or other terms.
+**
+** This license is subject to the following condition:
+** The above copyright notice and either this complete permission notice or at a minimum a reference to the UPL must
+** be included in all copies or substantial portions of the Software.
+**
+** THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO
+** THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+** AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF
+** CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+** IN THE SOFTWARE.
+*/
+
+/*
+ * Program to demonstrate the interface to the driver
+ *
+ */
+
+#include <stdio.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include "dax1_ccb.h"
+#include "../../arch/sparc/include/uapi/asm/oradax.h"
+
+void verify_bits(unsigned char *bitmap, int nbytes, int nbits)
+{
+ int i;
+
+ for (i=0; i<nbits; i++)
+ if ((bitmap[i/8] & (0x80 >> (i % 8))) == 0)
+ printf("bit %d is 0, expected 1, bitmap[%d]=0x%x\n",
+ i, i/8, bitmap[i/8]);
+ for (i=nbits; i <nbytes*8; i++)
+ if ((bitmap[i/8] & (0x80 >> (i % 8))))
+ printf("bit %d is 1, expected 0, bitmap[%d]=0x%x\n",
+ i, i/8, bitmap[i/8]);
+}
+
+#define ASI_MONITOR_PRIMARY 0x84
+uint8_t __attribute__((noinline)) loadmon8(void *addr)
+{
+ uint8_t ret;
+
+ __asm__ __volatile__("lduba [%[src]] %[asi], %[dest]\n"
+ : [dest] "=r" (ret)
+ : [asi] "i" (ASI_MONITOR_PRIMARY), [src] "r" (addr));
+ return ret;
+}
+
+#define MWAIT_COUNT_REGISTER 28
+void __attribute__((noinline)) mwait(int nsecs)
+{
+ __asm__ __volatile__("wr %%g0, %[arg], %%asr%[mcr]\n"
+ : : [arg] "r" (nsecs), [mcr] "i" (MWAIT_COUNT_REGISTER));
+}
+
+/*
+ * SCAN operation: examine each element of a vector looking for those
+ * that match either of two values. The output is a bitmap which contains
+ * one bit for each input element. For each input element that matches
+ * either of the scan values, the corresponding output bit will be set
+ * to 1.
+ *
+ * Values to use for this scan:
+ * should match 499 elements that match 0x77,
+ * and 1001 elements that match 0xf5
+ */
+#define SCAN_VAL1 0x77
+#define SCAN_VAL2 0xf5
+
+#define SCAN_COUNT1 499
+#define SCAN_COUNT2 1001
+
+int main(void)
+{
+ char *dev;
+ int fd, ret;
+ dax_cca_t *ca;
+ dax_scan_ccb_t ccb;
+ struct dax_command dc;
+ struct ccb_exec_result res;
+ unsigned char *input, *output;
+
+ dev = "/dev/" DAX_NAME "1";
+ if (access(dev, F_OK) == -1) {
+ dev = "/dev/" DAX_NAME "2";
+ if (access(dev, F_OK) == -1) {
+ fprintf(stderr, "No dax device available\n");
+ exit(1);
+ }
+ }
+
+ fd = open(dev, O_RDWR);
+ if (fd < 0) {
+ perror(dev);
+ exit(1);
+ }
+
+ /* map completion area */
+ ca = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0);
+ if (ca == MAP_FAILED) {
+ perror("mmap");
+ exit(2);
+ }
+
+ /* allocate and initialize input buffer */
+ input = mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, 0, 0);
+ if (input == MAP_FAILED) {
+ perror("mmap input");
+ exit(4);
+ }
+ memset(input, 0, 8192);
+ memset(input, SCAN_VAL1, SCAN_COUNT1);
+ memset(input+SCAN_COUNT1, SCAN_VAL2, SCAN_COUNT2);
+
+ /* allocate and initialize output buffer */
+ output = mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, 0, 0);
+ if (output == MAP_FAILED) {
+ perror("mmap output");
+ exit(4);
+ }
+ memset(output, 0, 8192);
+
+ /* set up ccb for a SCAN operation */
+ memset(&ccb, 0, sizeof(dax_scan_ccb_t));
+ ccb.hdr.opcode = DAX_OP_SCAN_VALUE;
+
+ /* set source address, type, length, and format */
+ ccb.pri = input;
+ ccb.hdr.pri_addr_type = DAX_ADDR_TYPE_VA;
+ ccb.ctrl.pri_fmt = 0; /* fixed width, byte */
+ ccb.ctrl.pri_elem_size = DAX_PRI_ELEM_SIZE(1);
+ ccb.dac.pri_len_fmt = DAX_PRI_LEN_FMT_BYTES;
+ ccb.dac.pri_len = DAX_LESS1(8192);
+
+ /* set output address, type, length, and format */
+ ccb.out = output;
+ ccb.hdr.out_addr_type = DAX_ADDR_TYPE_VA;
+ ccb.ctrl.out_fmt = DAX_OUT_FMT_BIT;
+ ccb.ctrl.out_elem_size = DAX_OUT_ELEM_SIZE(1);
+ ccb.dac.out_buf_size = DAX_OUT_BUF_SIZE(8192);
+
+ /* set scan values and sizes */
+ ccb.ctrl.u_size = DAX_LU_SIZE(1);
+ ccb.ctrl.l_size = DAX_LU_SIZE(1);
+ ccb.lu1.upper = SCAN_VAL1 << 24;
+ ccb.lu1.lower = SCAN_VAL2 << 24;
+
+ /* send ccb to coprocessor */
+ ret = write(fd, &ccb, 64);
+ if (ret != 64) {
+ /* submission failed, get driver status */
+ printf("write returned %d\n", ret);
+ if (read(fd, &res, sizeof(res)) != sizeof(res)) {
+ perror("read ccb exec error status");
+ exit(3);
+ }
+ printf("res.status = 0x%x, status_data = 0x%llx\n",
+ res.status, res.status_data);
+ printf("input=%p, output=%p\n", input, output);
+
+ exit(3);
+ }
+
+ /* submission successful, poll completion area until done */
+ while (loadmon8(ca) == CCA_STAT_NOT_COMPLETED)
+ mwait(1000);
+
+ if (IS_CCA_COMPLETED(ca->status)) {
+ printf("Success, output size = %d, retval = %ld\n",
+ ca->output_sz, ca->retval);
+ if (ca->retval != SCAN_COUNT1 + SCAN_COUNT2)
+ printf("retval doesn't match %d+%d\n",
+ SCAN_COUNT1, SCAN_COUNT2);
+ verify_bits(output, 8192, SCAN_COUNT1 + SCAN_COUNT2);
+ } else {
+ printf("cmd_status = %d\n", ca->status);
+ printf("Failed, err=0x%x\n", ca->err);
+ }
+
+ /* dequeue */
+ dc.command = CCB_DEQUEUE;
+ if (write(fd, &dc, sizeof(dc)) != sizeof(dc))
+ perror("dequeue");
+
+ /* unmap completion area */
+ munmap(ca, DAX_MMAP_LEN);
+
+ close(fd);
+ return 0;
+}
+
new file mode 100644
@@ -0,0 +1,91 @@
+/*
+ * Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * Oracle DAX driver API definitions
+ */
+
+#ifndef _ORADAX_H
+#define _ORADAX_H
+
+#include <linux/types.h>
+
+#define CCB_KILL 0
+#define CCB_INFO 1
+#define CCB_DEQUEUE 2
+
+struct dax_command {
+ __u16 command; /* CCB_KILL/INFO/DEQUEUE */
+ __u16 ca_offset; /* offset into mmapped completion area */
+};
+
+struct ccb_kill_result {
+ __u16 action; /* action taken to kill ccb */
+};
+
+struct ccb_info_result {
+ __u16 state; /* state of enqueued ccb */
+ __u16 inst_num; /* dax instance number of enqueued ccb */
+ __u16 q_num; /* queue number of enqueued ccb */
+ __u16 q_pos; /* ccb position in queue */
+};
+
+struct ccb_exec_result {
+ __u64 status_data; /* additional status data (e.g. bad VA) */
+ __u32 status; /* one of DAX_SUBMIT_* */
+};
+
+union ccb_result {
+ struct ccb_exec_result exec;
+ struct ccb_info_result info;
+ struct ccb_kill_result kill;
+};
+
+#define DAX_MMAP_LEN (16 * 1024)
+#define DAX_MAX_CCBS 15
+#define DAX_CCB_BUF_MAXLEN (DAX_MAX_CCBS * 64)
+#define DAX_NAME "oradax"
+
+/* CCB_EXEC status */
+#define DAX_SUBMIT_OK 0
+#define DAX_SUBMIT_ERR_RETRY 1
+#define DAX_SUBMIT_ERR_WOULDBLOCK 2
+#define DAX_SUBMIT_ERR_BUSY 3
+#define DAX_SUBMIT_ERR_THR_INIT 4
+#define DAX_SUBMIT_ERR_ARG_INVAL 5
+#define DAX_SUBMIT_ERR_CCB_INVAL 6
+#define DAX_SUBMIT_ERR_NO_CA_AVAIL 7
+#define DAX_SUBMIT_ERR_CCB_ARR_MMU_MISS 8
+#define DAX_SUBMIT_ERR_NOMAP 9
+#define DAX_SUBMIT_ERR_NOACCESS 10
+#define DAX_SUBMIT_ERR_TOOMANY 11
+#define DAX_SUBMIT_ERR_UNAVAIL 12
+#define DAX_SUBMIT_ERR_INTERNAL 13
+
+/* CCB_INFO states - must match HV_CCB_STATE_* definitions */
+#define DAX_CCB_COMPLETED 0
+#define DAX_CCB_ENQUEUED 1
+#define DAX_CCB_INPROGRESS 2
+#define DAX_CCB_NOTFOUND 3
+
+/* CCB_KILL actions - must match HV_CCB_KILL_* definitions */
+#define DAX_KILL_COMPLETED 0
+#define DAX_KILL_DEQUEUED 1
+#define DAX_KILL_KILLED 2
+#define DAX_KILL_NOTFOUND 3
+
+#endif /* _ORADAX_H */
@@ -70,5 +70,13 @@ config DISPLAY7SEG
another UltraSPARC-IIi-cEngine boardset with a 7-segment display,
you should say N to this option.
+config ORACLE_DAX
+ tristate "Oracle Data Analytics Accelerator"
+ default m if SPARC64
+ help
+ Driver for Oracle Data Analytics Accelerator, which is
+ a coprocessor that performs database operations in hardware.
+ It is available on M7 and M8 based systems only.
+
endmenu
@@ -16,3 +16,4 @@ obj-$(CONFIG_SUN_OPENPROMIO) += openprom.o
obj-$(CONFIG_TADPOLE_TS102_UCTRL) += uctrl.o
obj-$(CONFIG_SUN_JSFLASH) += jsflash.o
obj-$(CONFIG_BBC_I2C) += bbc.o
+obj-$(CONFIG_ORACLE_DAX) += oradax.o
new file mode 100644
@@ -0,0 +1,1005 @@
+/*
+ * Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * Oracle Data Analytics Accelerator (DAX)
+ *
+ * DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
+ * (DAX2) processor chips, and has direct access to the CPU's L3
+ * caches as well as physical memory. It can perform several
+ * operations on data streams with various input and output formats.
+ * The driver provides a transport mechanism only and has limited
+ * knowledge of the various opcodes and data formats. A user space
+ * library provides high level services and translates these into low
+ * level commands which are then passed into the driver and
+ * subsequently the hypervisor and the coprocessor. The library is
+ * the recommended way for applications to use the coprocessor, and
+ * the driver interface is not intended for general use.
+ *
+ * See Documentation/sparc/oracle_dax.txt for more details.
+ */
+
+#include <linux/uaccess.h>
+#include <linux/module.h>
+#include <linux/delay.h>
+#include <linux/cdev.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+
+#include <asm/hypervisor.h>
+#include <asm/mdesc.h>
+#include <asm/oradax.h>
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Driver for Oracle Data Analytics Accelerator");
+
+#define DAX_DBG_FLG_BASIC 0x01
+#define DAX_DBG_FLG_STAT 0x02
+#define DAX_DBG_FLG_INFO 0x04
+#define DAX_DBG_FLG_ALL 0xff
+
+#define dax_err(fmt, ...) pr_err("%s: " fmt "\n", __func__, ##__VA_ARGS__)
+#define dax_info(fmt, ...) pr_info("%s: " fmt "\n", __func__, ##__VA_ARGS__)
+
+#define dax_dbg(fmt, ...) do { \
+ if (dax_debug & DAX_DBG_FLG_BASIC)\
+ dax_info(fmt, ##__VA_ARGS__); \
+ } while (0)
+#define dax_stat_dbg(fmt, ...) do { \
+ if (dax_debug & DAX_DBG_FLG_STAT) \
+ dax_info(fmt, ##__VA_ARGS__); \
+ } while (0)
+#define dax_info_dbg(fmt, ...) do { \
+ if (dax_debug & DAX_DBG_FLG_INFO) \
+ dax_info(fmt, ##__VA_ARGS__); \
+ } while (0)
+
+#define DAX1_MINOR 1
+#define DAX1_MAJOR 1
+#define DAX2_MINOR 0
+#define DAX2_MAJOR 2
+
+#define DAX1_STR "ORCL,sun4v-dax"
+#define DAX2_STR "ORCL,sun4v-dax2"
+
+#define DAX_CA_ELEMS (DAX_MMAP_LEN / sizeof(struct dax_cca))
+
+#define DAX_CCB_USEC 100
+#define DAX_CCB_RETRIES 10000
+
+/* stream types */
+enum {
+ OUT,
+ PRI,
+ SEC,
+ TBL,
+ NUM_STREAM_TYPES
+};
+
+/* completion status */
+#define CCA_STAT_NOT_COMPLETED 0
+#define CCA_STAT_COMPLETED 1
+#define CCA_STAT_FAILED 2
+#define CCA_STAT_KILLED 3
+#define CCA_STAT_NOT_RUN 4
+#define CCA_STAT_PIPE_OUT 5
+#define CCA_STAT_PIPE_SRC 6
+#define CCA_STAT_PIPE_DST 7
+
+/* completion err */
+#define CCA_ERR_SUCCESS 0x0 /* no error */
+#define CCA_ERR_OVERFLOW 0x1 /* buffer overflow */
+#define CCA_ERR_DECODE 0x2 /* CCB decode error */
+#define CCA_ERR_PAGE_OVERFLOW 0x3 /* page overflow */
+#define CCA_ERR_KILLED 0x7 /* command was killed */
+#define CCA_ERR_TIMEOUT 0x8 /* Timeout */
+#define CCA_ERR_ADI 0x9 /* ADI error */
+#define CCA_ERR_DATA_FMT 0xA /* data format error */
+#define CCA_ERR_OTHER_NO_RETRY 0xE /* Other error, do not retry */
+#define CCA_ERR_OTHER_RETRY 0xF /* Other error, retry */
+#define CCA_ERR_PARTIAL_SYMBOL 0x80 /* QP partial symbol warning */
+
+/* CCB address types */
+#define DAX_ADDR_TYPE_NONE 0
+#define DAX_ADDR_TYPE_VA_ALT 1 /* secondary context */
+#define DAX_ADDR_TYPE_RA 2 /* real address */
+#define DAX_ADDR_TYPE_VA 3 /* virtual address */
+
+/* dax_header_t opcode */
+#define DAX_OP_SYNC_NOP 0x0
+#define DAX_OP_EXTRACT 0x1
+#define DAX_OP_SCAN_VALUE 0x2
+#define DAX_OP_SCAN_RANGE 0x3
+#define DAX_OP_TRANSLATE 0x4
+#define DAX_OP_SELECT 0x5
+#define DAX_OP_INVERT 0x10 /* OR with translate, scan opcodes */
+
+struct dax_header {
+ u32 ccb_version:4; /* 31:28 CCB Version */
+ /* 27:24 Sync Flags */
+ u32 pipe:1; /* Pipeline */
+ u32 longccb:1; /* Longccb. Set for scan with lu2, lu3, lu4. */
+ u32 cond:1; /* Conditional */
+ u32 serial:1; /* Serial */
+ u32 opcode:8; /* 23:16 Opcode */
+ /* 15:0 Address Type. */
+ u32 reserved:3; /* 15:13 reserved */
+ u32 table_addr_type:2; /* 12:11 Huffman Table Address Type */
+ u32 out_addr_type:3; /* 10:8 Destination Address Type */
+ u32 sec_addr_type:3; /* 7:5 Secondary Source Address Type */
+ u32 pri_addr_type:3; /* 4:2 Primary Source Address Type */
+ u32 cca_addr_type:2; /* 1:0 Completion Address Type */
+};
+
+struct dax_control {
+ u32 pri_fmt:4; /* 31:28 Primary Input Format */
+ u32 pri_elem_size:5; /* 27:23 Primary Input Element Size(less1) */
+ u32 pri_offset:3; /* 22:20 Primary Input Starting Offset */
+ u32 sec_encoding:1; /* 19 Secondary Input Encoding */
+ /* (must be 0 for Select) */
+ u32 sec_offset:3; /* 18:16 Secondary Input Starting Offset */
+ u32 sec_elem_size:2; /* 15:14 Secondary Input Element Size */
+ /* (must be 0 for Select) */
+ u32 out_fmt:2; /* 13:12 Output Format */
+ u32 out_elem_size:2; /* 11:10 Output Element Size */
+ u32 misc:10; /* 9:0 Opcode specific info */
+};
+
+struct dax_data_access {
+ u64 flow_ctrl:2; /* 63:62 Flow Control Type */
+ u64 pipe_target:2; /* 61:60 Pipeline Target */
+ u64 out_buf_size:20; /* 59:40 Output Buffer Size */
+ /* (cachelines less 1) */
+ u64 unused1:8; /* 39:32 Reserved, Set to 0 */
+ u64 out_alloc:5; /* 31:27 Output Allocation */
+ u64 unused2:1; /* 26 Reserved */
+ u64 pri_len_fmt:2; /* 25:24 Input Length Format */
+ u64 pri_len:24; /* 23:0 Input Element/Byte/Bit Count */
+ /* (less 1) */
+};
+
+struct dax_ccb {
+ struct dax_header hdr; /* CCB Header */
+ struct dax_control ctrl;/* Control Word */
+ void *ca; /* Completion Address */
+ void *pri; /* Primary Input Address */
+ struct dax_data_access dac; /* Data Access Control */
+ void *sec; /* Secondary Input Address */
+ u64 dword5; /* depends on opcode */
+ void *out; /* Output Address */
+ void *tbl; /* Table Address or bitmap */
+};
+
+struct dax_cca {
+ u8 status; /* user may mwait on this address */
+ u8 err; /* user visible error notification */
+ u8 rsvd[2]; /* reserved */
+ u32 n_remaining; /* for QP partial symbol warning */
+ u32 output_sz; /* output in bytes */
+ u32 rsvd2; /* reserved */
+ u64 run_cycles; /* run time in OCND2 cycles */
+ u64 run_stats; /* nothing reported in version 1.0 */
+ u32 n_processed; /* number input elements */
+ u32 rsvd3[5]; /* reserved */
+ u64 retval; /* command return value */
+ u64 rsvd4[8]; /* reserved */
+};
+
+/* per thread CCB context */
+struct dax_ctx {
+ struct dax_ccb *ccb_buf;
+ u64 ccb_buf_ra; /* cached RA of ccb_buf */
+ struct dax_cca *ca_buf;
+ u64 ca_buf_ra; /* cached RA of ca_buf */
+ struct page *pages[DAX_CA_ELEMS][NUM_STREAM_TYPES];
+ /* array of locked pages */
+ struct task_struct *owner; /* thread that owns ctx */
+ struct task_struct *client; /* requesting thread */
+ union ccb_result result;
+ u32 ccb_count;
+ u32 fail_count;
+};
+
+/* driver public entry points */
+static int dax_open(struct inode *inode, struct file *file);
+static ssize_t dax_read(struct file *filp, char __user *buf,
+ size_t count, loff_t *ppos);
+static ssize_t dax_write(struct file *filp, const char __user *buf,
+ size_t count, loff_t *ppos);
+static int dax_devmap(struct file *f, struct vm_area_struct *vma);
+static int dax_close(struct inode *i, struct file *f);
+
+static const struct file_operations dax_fops = {
+ .owner = THIS_MODULE,
+ .open = dax_open,
+ .read = dax_read,
+ .write = dax_write,
+ .mmap = dax_devmap,
+ .release = dax_close,
+};
+
+static int dax_ccb_exec(struct dax_ctx *ctx, const char __user *buf,
+ size_t count, loff_t *ppos);
+static int dax_ccb_info(u64 ca, struct ccb_info_result *info);
+static int dax_ccb_kill(u64 ca, u16 *kill_res);
+
+static struct cdev c_dev;
+static struct class *cl;
+static dev_t first;
+
+static int max_ccb_version;
+static int dax_debug;
+module_param(dax_debug, int, 0644);
+MODULE_PARM_DESC(dax_debug, "Debug flags");
+
+static int __init dax_attach(void)
+{
+ unsigned long dummy, hv_rv, major, minor, minor_requested, max_ccbs;
+ struct mdesc_handle *hp = mdesc_grab();
+ char *prop, *dax_name;
+ bool found = false;
+ int len, ret = 0;
+ u64 pn;
+
+ if (hp == NULL) {
+ dax_err("Unable to grab mdesc");
+ return -ENODEV;
+ }
+
+ mdesc_for_each_node_by_name(hp, pn, "virtual-device") {
+ prop = (char *)mdesc_get_property(hp, pn, "name", &len);
+ if (prop == NULL)
+ continue;
+ if (strncmp(prop, "dax", strlen("dax")))
+ continue;
+ dax_dbg("Found node 0x%llx = %s", pn, prop);
+
+ prop = (char *)mdesc_get_property(hp, pn, "compatible", &len);
+ if (prop == NULL)
+ continue;
+ dax_dbg("Found node 0x%llx = %s", pn, prop);
+ found = true;
+ break;
+ }
+
+ if (!found) {
+ dax_err("No DAX device found");
+ ret = -ENODEV;
+ goto done;
+ }
+
+ if (strncmp(prop, DAX2_STR, strlen(DAX2_STR)) == 0) {
+ dax_name = DAX_NAME "2";
+ major = DAX2_MAJOR;
+ minor_requested = DAX2_MINOR;
+ max_ccb_version = 1;
+ dax_dbg("MD indicates DAX2 coprocessor");
+ } else if (strncmp(prop, DAX1_STR, strlen(DAX1_STR)) == 0) {
+ dax_name = DAX_NAME "1";
+ major = DAX1_MAJOR;
+ minor_requested = DAX1_MINOR;
+ max_ccb_version = 0;
+ dax_dbg("MD indicates DAX1 coprocessor");
+ } else {
+ dax_err("Unknown dax type: %s", prop);
+ ret = -ENODEV;
+ goto done;
+ }
+
+ minor = minor_requested;
+ dax_dbg("Registering DAX HV api with major %ld minor %ld", major,
+ minor);
+ if (sun4v_hvapi_register(HV_GRP_DAX, major, &minor)) {
+ dax_err("hvapi_register failed");
+ ret = -ENODEV;
+ goto done;
+ } else {
+ dax_dbg("Max minor supported by HV = %ld (major %ld)", minor,
+ major);
+ minor = min(minor, minor_requested);
+ dax_dbg("registered DAX major %ld minor %ld", major, minor);
+ }
+
+ /* submit a zero length ccb array to query coprocessor queue size */
+ hv_rv = sun4v_ccb_submit(0, 0, HV_CCB_QUERY_CMD, 0, &max_ccbs, &dummy);
+ if (hv_rv != 0) {
+ dax_err("get_hwqueue_size failed with status=%ld and max_ccbs=%ld",
+ hv_rv, max_ccbs);
+ ret = -ENODEV;
+ goto done;
+ }
+
+ if (max_ccbs != DAX_MAX_CCBS) {
+ dax_err("HV reports unsupported max_ccbs=%ld", max_ccbs);
+ ret = -ENODEV;
+ goto done;
+ }
+
+ if (alloc_chrdev_region(&first, 0, 1, DAX_NAME) < 0) {
+ dax_err("alloc_chrdev_region failed");
+ ret = -ENXIO;
+ goto done;
+ }
+
+ cl = class_create(THIS_MODULE, DAX_NAME);
+ if (cl == NULL) {
+ dax_err("class_create failed");
+ ret = -ENXIO;
+ goto class_error;
+ }
+
+ if (device_create(cl, NULL, first, NULL, dax_name) == NULL) {
+ dax_err("device_create failed");
+ ret = -ENXIO;
+ goto device_error;
+ }
+
+ cdev_init(&c_dev, &dax_fops);
+ if (cdev_add(&c_dev, first, 1) == -1) {
+ dax_err("cdev_add failed");
+ ret = -ENXIO;
+ goto cdev_error;
+ }
+
+ pr_info("Attached DAX module\n");
+ goto done;
+
+cdev_error:
+ device_destroy(cl, first);
+device_error:
+ class_destroy(cl);
+class_error:
+ unregister_chrdev_region(first, 1);
+done:
+ mdesc_release(hp);
+ return ret;
+}
+module_init(dax_attach);
+
+static void __exit dax_detach(void)
+{
+ pr_info("Cleaning up DAX module\n");
+ cdev_del(&c_dev);
+ device_destroy(cl, first);
+ class_destroy(cl);
+ unregister_chrdev_region(first, 1);
+}
+module_exit(dax_detach);
+
+/* map completion area */
+static int dax_devmap(struct file *f, struct vm_area_struct *vma)
+{
+ struct dax_ctx *ctx = (struct dax_ctx *)f->private_data;
+ size_t len = vma->vm_end - vma->vm_start;
+
+ dax_dbg("len=0x%lx, flags=0x%lx", len, vma->vm_flags);
+
+ if (ctx->owner != current) {
+ dax_dbg("devmap called from wrong thread");
+ return -EINVAL;
+ }
+
+ if (len != DAX_MMAP_LEN) {
+ dax_dbg("len(%lu) != DAX_MMAP_LEN(%d)", len, DAX_MMAP_LEN);
+ return -EINVAL;
+ }
+
+ /* completion area is mapped read-only for user */
+ if (vma->vm_flags & VM_WRITE)
+ return -EPERM;
+ vma->vm_flags &= ~VM_MAYWRITE;
+
+ if (remap_pfn_range(vma, vma->vm_start, ctx->ca_buf_ra >> PAGE_SHIFT,
+ len, vma->vm_page_prot))
+ return -EAGAIN;
+
+ dax_dbg("mmapped completion area at uva 0x%lx", vma->vm_start);
+ return 0;
+}
+
+/* Unlock user pages. Called during dequeue or device close */
+static void dax_unlock_pages(struct dax_ctx *ctx, int ccb_index, int nelem)
+{
+ int i, j;
+
+ for (i = ccb_index; i < ccb_index + nelem; i++) {
+ for (j = 0; j < NUM_STREAM_TYPES; j++) {
+ struct page *p = ctx->pages[i][j];
+
+ if (p) {
+ dax_dbg("freeing page %p", p);
+ if (j == OUT)
+ set_page_dirty(p);
+ put_page(p);
+ ctx->pages[i][j] = NULL;
+ }
+ }
+ }
+}
+
+static int dax_lock_page(void *va, struct page **p)
+{
+ int ret;
+
+ dax_dbg("uva %p", va);
+
+ ret = get_user_pages_fast((unsigned long)va, 1, 1, p);
+ if (ret == 1) {
+ dax_dbg("locked page %p, for VA %p", *p, va);
+ return 0;
+ }
+
+ dax_dbg("get_user_pages failed, va=%p, ret=%d", va, ret);
+ return -1;
+}
+
+static int dax_lock_pages(struct dax_ctx *ctx, int idx,
+ int nelem, u64 *err_va)
+{
+ int i;
+
+ for (i = 0; i < nelem; i++) {
+ struct dax_ccb *ccbp = &ctx->ccb_buf[i];
+
+ /*
+ * For each address in the CCB whose type is virtual,
+ * lock the page and change the type to virtual alternate
+ * context. On error, return the offending address in
+ * err_va.
+ */
+ if (ccbp->hdr.out_addr_type == DAX_ADDR_TYPE_VA) {
+ dax_dbg("output");
+ if (dax_lock_page(ccbp->out,
+ &ctx->pages[i + idx][OUT]) != 0) {
+ *err_va = (u64)ccbp->out;
+ goto error;
+ }
+ ccbp->hdr.out_addr_type = DAX_ADDR_TYPE_VA_ALT;
+ }
+
+ if (ccbp->hdr.pri_addr_type == DAX_ADDR_TYPE_VA) {
+ dax_dbg("input");
+ if (dax_lock_page(ccbp->pri,
+ &ctx->pages[i + idx][PRI]) != 0) {
+ *err_va = (u64)ccbp->pri;
+ goto error;
+ }
+ ccbp->hdr.pri_addr_type = DAX_ADDR_TYPE_VA_ALT;
+ }
+
+ if (ccbp->hdr.sec_addr_type == DAX_ADDR_TYPE_VA) {
+ dax_dbg("sec input");
+ if (dax_lock_page(ccbp->sec,
+ &ctx->pages[i + idx][SEC]) != 0) {
+ *err_va = (u64)ccbp->sec;
+ goto error;
+ }
+ ccbp->hdr.sec_addr_type = DAX_ADDR_TYPE_VA_ALT;
+ }
+
+ if (ccbp->hdr.table_addr_type == DAX_ADDR_TYPE_VA) {
+ dax_dbg("tbl");
+ if (dax_lock_page(ccbp->tbl,
+ &ctx->pages[i + idx][TBL]) != 0) {
+ *err_va = (u64)ccbp->tbl;
+ goto error;
+ }
+ ccbp->hdr.table_addr_type = DAX_ADDR_TYPE_VA_ALT;
+ }
+
+ /* skip over 2nd 64 bytes of long CCB */
+ if (ccbp->hdr.longccb)
+ i++;
+ }
+ return DAX_SUBMIT_OK;
+
+error:
+ dax_unlock_pages(ctx, idx, nelem);
+ return DAX_SUBMIT_ERR_NOACCESS;
+}
+
+static void dax_ccb_wait(struct dax_ctx *ctx, int idx)
+{
+ int ret, nretries;
+ u16 kill_res;
+
+ dax_dbg("idx=%d", idx);
+
+ for (nretries = 0; nretries < DAX_CCB_RETRIES; nretries++) {
+ if (ctx->ca_buf[idx].status == CCA_STAT_NOT_COMPLETED)
+ udelay(DAX_CCB_USEC);
+ else
+ return;
+ }
+ dax_dbg("ctx (%p): CCB[%d] timed out, wait usec=%d, retries=%d. Killing ccb",
+ (void *)ctx, idx, DAX_CCB_USEC, DAX_CCB_RETRIES);
+
+ ret = dax_ccb_kill(ctx->ca_buf_ra + idx * sizeof(struct dax_cca),
+ &kill_res);
+ dax_dbg("Kill CCB[%d] %s", idx, ret ? "failed" : "succeeded");
+}
+
+static int dax_close(struct inode *ino, struct file *f)
+{
+ struct dax_ctx *ctx = (struct dax_ctx *)f->private_data;
+ int i;
+
+ f->private_data = NULL;
+
+ for (i = 0; i < DAX_CA_ELEMS; i++) {
+ if (ctx->ca_buf[i].status == CCA_STAT_NOT_COMPLETED) {
+ dax_dbg("CCB[%d] not completed", i);
+ dax_ccb_wait(ctx, i);
+ }
+ dax_unlock_pages(ctx, i, 1);
+ }
+
+ kfree(ctx->ccb_buf);
+ kfree(ctx->ca_buf);
+ dax_stat_dbg("CCBs: %d good, %d bad", ctx->ccb_count, ctx->fail_count);
+ kfree(ctx);
+
+ return 0;
+}
+
+static ssize_t dax_read(struct file *f, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct dax_ctx *ctx = f->private_data;
+
+ if (ctx->client != current)
+ return -EUSERS;
+
+ ctx->client = NULL;
+
+ if (count != sizeof(union ccb_result))
+ return -EINVAL;
+ if (copy_to_user(buf, &ctx->result, sizeof(union ccb_result)))
+ return -EFAULT;
+ return count;
+}
+
+static ssize_t dax_write(struct file *f, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct dax_ctx *ctx = f->private_data;
+ struct dax_command hdr;
+ unsigned long ca;
+ int i, idx, ret;
+
+ if (ctx->client != NULL)
+ return -EINVAL;
+
+ if (count == 0 || count > DAX_MAX_CCBS * sizeof(struct dax_ccb))
+ return -EINVAL;
+
+ if (count % sizeof(struct dax_ccb) == 0)
+ return dax_ccb_exec(ctx, buf, count, ppos); /* CCB EXEC */
+
+ if (count != sizeof(struct dax_command))
+ return -EINVAL;
+
+ /* immediate command */
+ if (ctx->owner != current)
+ return -EUSERS;
+
+ if (copy_from_user(&hdr, buf, sizeof(hdr)))
+ return -EFAULT;
+
+ ca = ctx->ca_buf_ra + hdr.ca_offset;
+
+ switch (hdr.command) {
+ case CCB_KILL:
+ if (hdr.ca_offset >= DAX_MMAP_LEN) {
+ dax_dbg("invalid ca_offset (%d) >= ca_buflen (%d)",
+ hdr.ca_offset, DAX_MMAP_LEN);
+ return -EINVAL;
+ }
+
+ ret = dax_ccb_kill(ca, &ctx->result.kill.action);
+ if (ret != 0) {
+ dax_dbg("dax_ccb_kill failed (ret=%d)", ret);
+ return ret;
+ }
+
+ dax_info_dbg("killed (ca_offset %d)", hdr.ca_offset);
+ idx = hdr.ca_offset / sizeof(struct dax_cca);
+ ctx->ca_buf[idx].status = CCA_STAT_KILLED;
+ ctx->ca_buf[idx].err = CCA_ERR_KILLED;
+ ctx->client = current;
+ return count;
+
+ case CCB_INFO:
+ if (hdr.ca_offset >= DAX_MMAP_LEN) {
+ dax_dbg("invalid ca_offset (%d) >= ca_buflen (%d)",
+ hdr.ca_offset, DAX_MMAP_LEN);
+ return -EINVAL;
+ }
+
+ ret = dax_ccb_info(ca, &ctx->result.info);
+ if (ret != 0) {
+ dax_dbg("dax_ccb_info failed (ret=%d)", ret);
+ return ret;
+ }
+
+ dax_info_dbg("info succeeded on ca_offset %d", hdr.ca_offset);
+ ctx->client = current;
+ return count;
+
+ case CCB_DEQUEUE:
+ for (i = 0; i < DAX_CA_ELEMS; i++) {
+ if (ctx->ca_buf[i].status !=
+ CCA_STAT_NOT_COMPLETED)
+ dax_unlock_pages(ctx, i, 1);
+ }
+ return count;
+
+ default:
+ return -EINVAL;
+ }
+}
+
+static int dax_open(struct inode *inode, struct file *f)
+{
+ struct dax_ctx *ctx = NULL;
+ int i;
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (ctx == NULL)
+ goto done;
+
+ ctx->ccb_buf = kcalloc(DAX_MAX_CCBS, sizeof(struct dax_ccb),
+ GFP_KERNEL);
+ if (ctx->ccb_buf == NULL)
+ goto done;
+
+ ctx->ccb_buf_ra = virt_to_phys(ctx->ccb_buf);
+ dax_dbg("ctx->ccb_buf=0x%p, ccb_buf_ra=0x%llx",
+ (void *)ctx->ccb_buf, ctx->ccb_buf_ra);
+
+ /* allocate CCB completion area buffer */
+ ctx->ca_buf = kzalloc(DAX_MMAP_LEN, GFP_KERNEL);
+ if (ctx->ca_buf == NULL)
+ goto alloc_error;
+ for (i = 0; i < DAX_CA_ELEMS; i++)
+ ctx->ca_buf[i].status = CCA_STAT_COMPLETED;
+
+ ctx->ca_buf_ra = virt_to_phys(ctx->ca_buf);
+ dax_dbg("ctx=0x%p, ctx->ca_buf=0x%p, ca_buf_ra=0x%llx",
+ (void *)ctx, (void *)ctx->ca_buf, ctx->ca_buf_ra);
+
+ ctx->owner = current;
+ f->private_data = ctx;
+ return 0;
+
+alloc_error:
+ kfree(ctx->ccb_buf);
+done:
+ if (ctx != NULL)
+ kfree(ctx);
+ return -ENOMEM;
+}
+
+static char *dax_hv_errno(unsigned long hv_ret, int *ret)
+{
+ switch (hv_ret) {
+ case HV_EBADALIGN:
+ *ret = -EFAULT;
+ return "HV_EBADALIGN";
+ case HV_ENORADDR:
+ *ret = -EFAULT;
+ return "HV_ENORADDR";
+ case HV_EINVAL:
+ *ret = -EINVAL;
+ return "HV_EINVAL";
+ case HV_EWOULDBLOCK:
+ *ret = -EAGAIN;
+ return "HV_EWOULDBLOCK";
+ case HV_ENOACCESS:
+ *ret = -EPERM;
+ return "HV_ENOACCESS";
+ default:
+ break;
+ }
+
+ *ret = -EIO;
+ return "UNKNOWN";
+}
+
+static int dax_ccb_kill(u64 ca, u16 *kill_res)
+{
+ unsigned long hv_ret;
+ int count, ret = 0;
+ char *err_str;
+
+ for (count = 0; count < DAX_CCB_RETRIES; count++) {
+ dax_dbg("attempting kill on ca_ra 0x%llx", ca);
+ hv_ret = sun4v_ccb_kill(ca, kill_res);
+
+ if (hv_ret == HV_EOK) {
+ dax_info_dbg("HV_EOK (ca_ra 0x%llx): %d", ca,
+ *kill_res);
+ } else {
+ err_str = dax_hv_errno(hv_ret, &ret);
+ dax_dbg("%s (ca_ra 0x%llx)", err_str, ca);
+ }
+
+ if (ret != -EAGAIN)
+ return ret;
+ dax_info_dbg("ccb_kill count = %d", count);
+ udelay(DAX_CCB_USEC);
+ }
+
+ return -EAGAIN;
+}
+
+static int dax_ccb_info(u64 ca, struct ccb_info_result *info)
+{
+ unsigned long hv_ret;
+ char *err_str;
+ int ret = 0;
+
+ dax_dbg("attempting info on ca_ra 0x%llx", ca);
+ hv_ret = sun4v_ccb_info(ca, info);
+
+ if (hv_ret == HV_EOK) {
+ dax_info_dbg("HV_EOK (ca_ra 0x%llx): %d", ca, info->state);
+ if (info->state == DAX_CCB_ENQUEUED) {
+ dax_info_dbg("dax_unit %d, queue_num %d, queue_pos %d",
+ info->inst_num, info->q_num, info->q_pos);
+ }
+ } else {
+ err_str = dax_hv_errno(hv_ret, &ret);
+ dax_dbg("%s (ca_ra 0x%llx)", err_str, ca);
+ }
+
+ return ret;
+}
+
+static void dax_prt_ccbs(struct dax_ccb *ccb, int nelem)
+{
+ int i, j;
+ u64 *ccbp;
+
+ dax_dbg("ccb buffer:");
+ for (i = 0; i < nelem; i++) {
+ ccbp = (u64 *)&ccb[i];
+ dax_dbg(" %sccb[%d]", ccb[i].hdr.longccb ? "long " : "", i);
+ for (j = 0; j < 8; j++)
+ dax_dbg("\tccb[%d].dwords[%d]=0x%llx",
+ i, j, *(ccbp + j));
+ }
+}
+
+/*
+ * Validates user CCB content. Also sets completion address and address types
+ * for all addresses contained in CCB.
+ */
+static int dax_preprocess_usr_ccbs(struct dax_ctx *ctx, int idx, int nelem)
+{
+ int i;
+
+ /*
+ * The user is not allowed to specify real address types in
+ * the CCB header. This must be enforced by the kernel before
+ * submitting the CCBs to HV. The only allowed values for all
+ * address fields are VA or IMM
+ */
+ for (i = 0; i < nelem; i++) {
+ struct dax_ccb *ccbp = &ctx->ccb_buf[i];
+ unsigned long ca_offset;
+
+ if (ccbp->hdr.ccb_version > max_ccb_version)
+ return DAX_SUBMIT_ERR_CCB_INVAL;
+
+ switch (ccbp->hdr.opcode) {
+ case DAX_OP_SYNC_NOP:
+ case DAX_OP_EXTRACT:
+ case DAX_OP_SCAN_VALUE:
+ case DAX_OP_SCAN_RANGE:
+ case DAX_OP_TRANSLATE:
+ case DAX_OP_SCAN_VALUE | DAX_OP_INVERT:
+ case DAX_OP_SCAN_RANGE | DAX_OP_INVERT:
+ case DAX_OP_TRANSLATE | DAX_OP_INVERT:
+ case DAX_OP_SELECT:
+ break;
+ default:
+ return DAX_SUBMIT_ERR_CCB_INVAL;
+ }
+
+ if (ccbp->hdr.out_addr_type != DAX_ADDR_TYPE_VA &&
+ ccbp->hdr.out_addr_type != DAX_ADDR_TYPE_NONE) {
+ dax_dbg("invalid out_addr_type in user CCB[%d]", i);
+ return DAX_SUBMIT_ERR_CCB_INVAL;
+ }
+
+ if (ccbp->hdr.pri_addr_type != DAX_ADDR_TYPE_VA &&
+ ccbp->hdr.pri_addr_type != DAX_ADDR_TYPE_NONE) {
+ dax_dbg("invalid pri_addr_type in user CCB[%d]", i);
+ return DAX_SUBMIT_ERR_CCB_INVAL;
+ }
+
+ if (ccbp->hdr.sec_addr_type != DAX_ADDR_TYPE_VA &&
+ ccbp->hdr.sec_addr_type != DAX_ADDR_TYPE_NONE) {
+ dax_dbg("invalid sec_addr_type in user CCB[%d]", i);
+ return DAX_SUBMIT_ERR_CCB_INVAL;
+ }
+
+ if (ccbp->hdr.table_addr_type != DAX_ADDR_TYPE_VA &&
+ ccbp->hdr.table_addr_type != DAX_ADDR_TYPE_NONE) {
+ dax_dbg("invalid table_addr_type in user CCB[%d]", i);
+ return DAX_SUBMIT_ERR_CCB_INVAL;
+ }
+
+ /* set completion (real) address and address type */
+ ccbp->hdr.cca_addr_type = DAX_ADDR_TYPE_RA;
+ ca_offset = (idx + i) * sizeof(struct dax_cca);
+ ccbp->ca = (void *)ctx->ca_buf_ra + ca_offset;
+ memset(&ctx->ca_buf[idx + i], 0, sizeof(struct dax_cca));
+
+ dax_dbg("ccb[%d]=%p, ca_offset=0x%lx, compl RA=0x%llx",
+ i, ccbp, ca_offset, ctx->ca_buf_ra + ca_offset);
+
+ /* skip over 2nd 64 bytes of long CCB */
+ if (ccbp->hdr.longccb)
+ i++;
+ }
+
+ return DAX_SUBMIT_OK;
+}
+
+static int dax_ccb_exec(struct dax_ctx *ctx, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ unsigned long accepted_len, hv_rv;
+ int i, idx, nccbs, naccepted;
+
+ ctx->client = current;
+ idx = *ppos;
+ nccbs = count / sizeof(struct dax_ccb);
+
+ if (ctx->owner != current) {
+ dax_dbg("wrong thread");
+ ctx->result.exec.status = DAX_SUBMIT_ERR_THR_INIT;
+ return 0;
+ }
+ dax_dbg("args: ccb_buf_len=%ld, idx=%d", count, idx);
+
+ /* for given index and length, verify ca_buf range exists */
+ if (idx + nccbs >= DAX_CA_ELEMS) {
+ ctx->result.exec.status = DAX_SUBMIT_ERR_NO_CA_AVAIL;
+ return 0;
+ }
+
+ /*
+ * Copy CCBs into kernel buffer to prevent modification by the
+ * user in between validation and submission.
+ */
+ if (copy_from_user(ctx->ccb_buf, buf, count)) {
+ dax_dbg("copyin of user CCB buffer failed");
+ ctx->result.exec.status = DAX_SUBMIT_ERR_CCB_ARR_MMU_MISS;
+ return 0;
+ }
+
+ /* check to see if ca_buf[idx] .. ca_buf[idx + nccbs] are available */
+ for (i = idx; i < idx + nccbs; i++) {
+ if (ctx->ca_buf[i].status == CCA_STAT_NOT_COMPLETED) {
+ dax_dbg("CA range not available, dequeue needed");
+ ctx->result.exec.status = DAX_SUBMIT_ERR_NO_CA_AVAIL;
+ return 0;
+ }
+ }
+ dax_unlock_pages(ctx, idx, nccbs);
+
+ ctx->result.exec.status = dax_preprocess_usr_ccbs(ctx, idx, nccbs);
+ if (ctx->result.exec.status != DAX_SUBMIT_OK)
+ return 0;
+
+ ctx->result.exec.status = dax_lock_pages(ctx, idx, nccbs,
+ &ctx->result.exec.status_data);
+ if (ctx->result.exec.status != DAX_SUBMIT_OK)
+ return 0;
+
+ if (dax_debug & DAX_DBG_FLG_BASIC)
+ dax_prt_ccbs(ctx->ccb_buf, nccbs);
+
+ hv_rv = sun4v_ccb_submit(ctx->ccb_buf_ra, count,
+ HV_CCB_QUERY_CMD | HV_CCB_VA_SECONDARY, 0,
+ &accepted_len, &ctx->result.exec.status_data);
+
+ switch (hv_rv) {
+ case HV_EOK:
+ /*
+ * Hcall succeeded with no errors but the accepted
+ * length may be less than the requested length. The
+ * only way the driver can resubmit the remainder is
+ * to wait for completion of the submitted CCBs since
+ * there is no way to guarantee the ordering semantics
+ * required by the client applications. Therefore we
+ * let the user library deal with resubmissions.
+ */
+ ctx->result.exec.status = DAX_SUBMIT_OK;
+ break;
+ case HV_EWOULDBLOCK:
+ /*
+ * This is a transient HV API error. The user library
+ * can retry.
+ */
+ dax_dbg("hcall returned HV_EWOULDBLOCK");
+ ctx->result.exec.status = DAX_SUBMIT_ERR_WOULDBLOCK;
+ break;
+ case HV_ENOMAP:
+ /*
+ * HV was unable to translate a VA. The VA it could
+ * not translate is returned in the status_data param.
+ */
+ dax_dbg("hcall returned HV_ENOMAP");
+ ctx->result.exec.status = DAX_SUBMIT_ERR_NOMAP;
+ break;
+ case HV_EINVAL:
+ /*
+ * This is the result of an invalid user CCB as HV is
+ * validating some of the user CCB fields. Pass this
+ * error back to the user. There is no supporting info
+ * to isolate the invalid field.
+ */
+ dax_dbg("hcall returned HV_EINVAL");
+ ctx->result.exec.status = DAX_SUBMIT_ERR_CCB_INVAL;
+ break;
+ case HV_ENOACCESS:
+ /*
+ * HV found a VA that did not have the appropriate
+ * permissions (such as the w bit). The VA in question
+ * is returned in status_data param.
+ */
+ dax_dbg("hcall returned HV_ENOACCESS");
+ ctx->result.exec.status = DAX_SUBMIT_ERR_NOACCESS;
+ break;
+ case HV_EUNAVAILABLE:
+ /*
+ * The requested CCB operation could not be performed
+ * at this time. Return the specific unavailable code
+ * in the status_data field.
+ */
+ dax_dbg("hcall returned HV_EUNAVAILABLE");
+ ctx->result.exec.status = DAX_SUBMIT_ERR_UNAVAIL;
+ break;
+ default:
+ ctx->result.exec.status = DAX_SUBMIT_ERR_INTERNAL;
+ dax_dbg("unknown hcall return value (%ld)", hv_rv);
+ break;
+ }
+
+ /* unlock pages associated with the unaccepted CCBs */
+ naccepted = accepted_len / sizeof(struct dax_ccb);
+ dax_unlock_pages(ctx, idx + naccepted, nccbs - naccepted);
+
+ /* mark unaccepted CCBs as not completed */
+ for (i = idx + naccepted; i < idx + nccbs; i++)
+ ctx->ca_buf[i].status = CCA_STAT_COMPLETED;
+
+ ctx->ccb_count += naccepted;
+ ctx->fail_count += nccbs - naccepted;
+
+ dax_dbg("hcall rv=%ld, accepted_len=%ld, status_data=0x%llx, ret status=%d",
+ hv_rv, accepted_len, ctx->result.exec.status_data,
+ ctx->result.exec.status);
+
+ if (count == accepted_len)
+ ctx->client = NULL; /* no read needed to complete protocol */
+ return accepted_len;
+}